5.1.8 K-Nearest-Neighbor Algorithm
5.1.8 K-Nearest-Neighbor Algorithm
8 K-Nearest-Neighbor Algorithm
KNN classification aims at classifying imbalanced data and was selected as top 10 data
mining algorithms (Wu et al. 2008; Zhang et al. 2017; Zhang et al. 2018b; Zheng et al.
2017).There are two main research directions. One is to set a proper K value. Another is
the distance function for identifying K nearest neighbors. For setting K value, a usually-
proper K value when a training dataset is given. However, training samples are distributed
with different densities in the training sample space. This raises a new challenging issue
that different samples need different K values for class prediction. Recently, Cheng, et al
optimal value for each new data. Zhang, et al (2018b) designed a KNN algorithm with
efficiently learn K for KNN Classification. Although there are many distance functions,
most KNN classification algorithms use Euclidean distance which is defined as follows.
Equation 8:
Y.
Some lately research reports, for example, Deng, et al (2016) pioneered KNN method to
classify big data. It first conducts a k-means clustering to separate the whole dataset into
several parts. And then, each subset is classified with KNN method. Liu, et al (2016)
proposed a neighbor selection for multilabel classification. Liu and Zhang (2012) studied
a noisy data elimination using the mutual KNN for data classification. Zhang (2010)
proposed to replace the majority rule with CF measure for KNN classification. This leads
to a minor class can be become a winner. Zhang (2011) studied a shell-neighbor method
for KNN classification. It assists in learning from datasets with missing values. Zhang, et
The authors proposed representing each sample by other samples with a new self-
reconstruction method. The obtained coefficient is used to compute the value of K for
every sample, rather than all samples used in the traditional methods (Zheng et al., 2017;
Lei and Zhu 2017; Zhu et al. 2018a). Finally, this literature proposed builds a decision tree
with the obtained value of K in the leaf to output the labels of training samples. KNN
classifiers are lazy learners, which is time consuming since the distance between every
test sample and other samples should be calculated. To deal with this issue, Zhang et al
(2018) pioneered a K-tree and a k*Tree to use different numbers of nearest neighbors for
KNN classification. The K-tree method needs less running cost but achieves similar
classification accuracy, compared with those KNN methods that assign different K values
to different test samples. The technique k*Tree is a K-tree expansion. It speeds up its
experiment phase by additional storing data from the coaching samples in K-tree's leaf
nodes, such as the coaching samples in the leaf clusters, their KNNs, and those KNN's
closest neighbor. It makes KNN only using a subset of training samples in the leaf nodes.
This is different from previous methods, e.g., (Zhu et al. 2014; Zheng et al. 2018), which
use KNN method to visit all samples. Therefore, our proposed method may decrease the
computation cost of the test process. We have used this classifier because the K-nearest
nonlinear data. The output value for the object is calculated by the average value of the
nearest k neighbors.
We have use K=1 and the Euclidean Distance in our experiment. Details of parameters
are given in Table 3. This Table 3 shows the list of parameters and Weka function used in
our research.
Table 3: Parameters
5.2 Tools
Data processing is done using R-Script whereas; classification, as well as feature selection,
R is a language and condition for measurable registering and designs. It is a GNU venture
which is like the S language and condition which was created at Bell Laboratories (once
in the past AT&T, presently Lucent Technologies) by John Chambers and associates. R
can be regarded as an alternative use of S. There are some significant contrasts, yet much
R gives a wide assortment of factual (straight and nonlinear demonstrating, old style
strategies, and is profoundly extensible. The S language is often the medium of decision-
making in measurable strategy for studies, and R provides assistance for an open source
quality plots can be delivered, including numerical images and formulae where required.
Extraordinary consideration has been assumed control over the defaults for the minor plan
R is accessible as Free Software under the details of the Free Software Foundation's GNU
General Public License in source code structure. It accumulates and keeps running on a
wide assortment of UNIX stages and comparative frameworks (counting FreeBSD and
5.2.2 Weka
instruments for preparing, classifying, regressing, clustering, mining association rules and
visualizing information.
Found only on New Zealand's islands, the Weka is a curious-looking flightless bird. Such
is the name pronounced, and the bird smells like this. Weka is GNU General Public
a. Evaluator Search
b. Method attributes.
The evaluator function is the method used to evaluate each object in your dataset (also
known as a row or function) in the event of the input vector (e.g. category). The query
method is the technique by which distinct mixes of characteristics can be tried or navigated
a. Raw where whole dataset has been used for feature selection
b. 95%
c. 90%
d. 85%
e. 80%
f. 75%
g. 70%
h. 65%
i. 60%
j. 55%
k. 50%
the estimation variance. We take a training set and a classifier is created. Then we're looking
to assess that classifier's efficiency, and there's a certain level of variance in that assessment
because it's all underneath the statistics. We want to maintain the difference as small as
feasible in the assessment. And cross validation also prevents the over fitting.
We split it only once with cross-validation, but we split it into, say, 10 parts. Then we bring
9 of the
parts and use them to train, and we use the last item to test. Then we bring another 9 parts
with the same separation and use them for practice and experimentation with the hold-
out item. We do the whole process 10 occasions, each moment we use a distinct section to
test. In other cases, we split the dataset into 10 parts, and then we keep each piece in turn
for monitoring, training on the remainder, monitoring, and averaging the 10 outcomes. That
Divide the dataset into 10 components, keep each portion in turn, and evaluate the
outcomes. Therefore, each data point in the dataset is used for testing once and for training
follows:
5.6 Methodology
Figure 11 illustrates the work flow of data analysis of breast cancer. Four different types of
datasets, namely CPG Methylation Data, Histone Marker Modification Data, Human
Genome Data, RNA-Seq Data is used for data analysis. The next step after downloading
the data is feature extraction, useful features are extracted from each dataset. These features
are useful for better prediction of breast cancer as well as to reduce the complexity of
dataset. By using transcript id of each dataset, we have combined all four datasets and
created model using R tool, which includes all extracted features with most promising
number of genes. On this model we have applied four different feature selection techniques
like PCA, CFS, Gain ratio, ReliefF. We have used different feature selection ratios (Raw,
95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%, 55%, 50%) with different training-testing
ratios (90:10, 80:20, 70:30, 60:40) with the combination of 8 different classifiers (Gaussian
SVM, Linear SVM, KNN, Naïve Bayes, Random forest, SVM, Logistic regression and
Multi-Layer Perceptron). We have also used 10-fold cross validation in the research.
Model
1. SVM
2. Linear SVM
3. Gaussian SVM
4. KNN
5. Logistic Regression
6. Multilayer Perceptron
7. Naïve Bayes
8. Random Forest