A Review of Data Classification Using K-Nearest Neighbour
A Review of Data Classification Using K-Nearest Neighbour
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
354
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
B. Use in Data mining 3) Cosine
Data mining is the extraction of veiled information 4) Correlation
from large database. Classification is a data mining task f) Rule
of forecasting the value of a categorical variable by
1) Nearest
building a model based on one or more numerical and/or
2) Random
categorical variables. Classification mining function is
3) Consensus
used to achieve a intense understanding of the database
structure There are various classification techniques like I) Distance
decision tree induction, Bayesian networks, lazy a) Euclidean distance-
classifier and rule based classifier. Data mining involves
the use of sophisticated data analysis tools to discover The Euclidean distance between points p and q is the
previously unknown, valid patterns and relationships in length of the line segment connecting them (pq).
In Cartesian coordinates, if p = (p1, p2,..., pn)
large data set. These tools can include statistical models,
and q = (q1, q2,..., qn) are two points in Euclidean n-space,
mathematical algorithm and machine learning methods.
then the distance from p to q, or from q to p is given
Consequently, data mining consists of more than
by[10]:
collection and managing data, it also includes analysis
and prediction [8]. Data mining applications can use a d(p,q)= d(q,p)= √(q1-p1)2 + (q2-p2)2 (2)
variety of parameters to examine the data. They include
association, sequence or path analysis, classification, The position of a point in a Euclidean n-space is
clustering, and forecasting. Classification technique is a Euclidean vector. So, p and q are Euclidean
capable of processing a wider variety of data and is vectors[11], starting from the origin of the space, and
growing in popularity. The various classification their tips indicate two points. The Euclidean norm,
techniques are Bayesian network, tree classifiers, rule or Euclidean length, or magnitude of a vector measures
based classifiers, lazy classifiers [9], Fuzzy set the length of the vector:
approaches, rough set approach etc. |p|= p12+ p22+ p32······· +pn2= √p.p (3)
III. MATERIAL AND METHODOLIGY Where the last equation involves the dot product. A
vector can be described as a directed line segment from
A. Material the origin of the Euclidean space (vector tail), to a point
Outputs with different methodology has been in that space (vector tip).[12] If we consider that its
compared. length is actually the distance from its tail to its tip, it
becomes clear that the Euclidean norm of a vector is just
a) Sample a special case of Euclidean distance: the Euclidean
Matrix whose rows will be classified into groups. distance between its tail and its tip. The distance between
Sample must have the same number of columns as points p and q may have a direction, so it may be
Training. represented by another vector, given by[13]
b) Training q-p=(q1-p1, q2-p2,·······, qn-pn,) (4)
Matrix used to group the rows in the matrix Sample. In a three-dimensional space (n=3), this is an arrow
Training must have the same number of columns as from p to q, which can be also regarded as the position
Sample. Each row of Training belongs to the group of q relative to p. It may be also called
whose value is the corresponding entry of Group. a displacement vector if p and q represent two positions
c) Group of the same point at two successive instants of time. The
Vector whose distinct values define the grouping of Euclidean distance between p and q is just the Euclidean
the rows in Training. length of this distance (or displacement) vector: [14]
355
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
b) Cityblock (Taxicab metric) B. Material
The taxican distance, d1, between two vectors p, q in To get the output some training data and sample data
an n-dimensional real vector space with fixed Cartesian are chosen and with different rules and with different
coordinate system, is the sum of lengths of the distance matric we get different classified outputs.
projections of the line segment between the points into The data chose for the classification is
the coordinate axis.More formally, Sample = [0.559 0.510; 0.101 0.282; 0.987 0.988]
Training= [0 0; 0.559 0.559; 1 1]
d1(p,q)= ||p-q||1= ∑ |p q| (7) Group= [1;2;3]
Where p=(p1,p2,p3……pn) and q= (q1,q2,q3……qn) are C. Results
vectors[15]. For example, in the plane, the taxicab
distance between (p1,p2) and (q1,q2) is |p1-q1|+ |p2-q2|[16]. Case 1
c) Cosine distance
The cosine of two vectors can be derived by using
the Euclidean dot product formula:
a.b=||a|| ||b|| cosθ (8)
Given two vectors of attributes, A and B, the cosine
similarity, cos(θ), is represented using a dot
product and magnitude[17] as
Similarity= cos(θ = . || || ||B|| = ∑
Bi/√ ∑ √∑ (9)
The resulting similarity ranges from −1 meaning
exactly opposite, to 1 meaning exactly the same, with 0
usually indicating independence, and in-between values
indicating intermediate similarity or dissimilarity. For
text matching, the attribute vectors A and B are usually
the term frequency vectors of the documents. The cosine
Fig.1 Distance used Euclidean and Rule Nearest
similarity can be seen as a method of normalizing
document length during comparison. In the case In this case Euclidean distance is used and reference of
of information retrieval, the cosine similarity of two this figure has been used in Table I .By using Nearest
documents will range from 0 to 1, since the term neighbor Algorithm classification result was medium.
frequencies (tf-idf weights) cannot be negative. The
angle between two term frequency vectors cannot be Case 2
greater than 90°.
d) Correlation
The distance correlation of two random variables is
obtained by dividing their distance covariance[18] by the
product of their distance standard deviations. The
distance correlation is
dCor(X,Y)= dCov(X,Y √dVar(X dVar(Y (10)
II) Rule
a) Nearest
Majority rule with nearest point tie-break (by default)
b) Random
Majority rule with random point tie-break
c) Consensus
356
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
In this case Euclidean distance is used and reference of In this case Cityblock distance is used and reference of
this figure has been used in Table I .By using Random this figure has been used in Table I. By using Nearest
distance Algorithm classification result was good. neighbor Algorithm classification result was excellent.
Case 3 Case 5
357
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
Case 7 Case 9
Fig.7 Distance used Cosine and Rule Nearest Fig.9 Distance used Cosine and Rule Consensus
In this case Cosine distance is used and reference of In this case Cosine distance is used and reference of
this figure has been used in Table I .By using Nearest this figure has been used in Table I .By using Consensus
neighbor Algorithm classification result was medium. Algorithm classification result was medium.
Case 8 Case 10
Fig.8 Distance used Cosine and Rule Random Fig.10 Distance used Correlation and Rule Nearest
In this case Cosine distance is used and reference of In this case Correlation distance is used and reference
this figure has been used in Table I .By using Random of this figure has been used in Table I. By using Nearest
rule Algorithm classification result was poor. neighbor Algorithm classification result was poor.
358
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
Case 11 D. Inference
TABLE I
RESULTS AND EFFIECIENCY OF CLASSIFIERS
IV. CONCLUSION
Classifiers have paved an important path for
classification of data in biometrics like iris detection,
signature verification. If compared with different
distances Euclidean distance has higher efficiency as
compared to other distances and if compared with Bayes
algorithm K-Nearest neighbor algorithm again maintains
it‟s efficiency. The KNN classifier is one of the most
popular neighborhood classifier in pattern recognition.
However, it has limitations such as great calculation
complexity, fully dependent on training set, and no
Fig.12 Distance used Correlation and Rule Consensus weight difference between each class. To avert this, a
In this case Correlation distance is used and reference innovative method to improve the classification
of this figure has been used in Table I. By using performance of KNN using Genetic Algorithm (GA) is
Consensus rule Algorithm classification result was being implemented. Also in results almost every case has
Medium. efficiency near to 100% because training set and sample
Hamming distance has not been used in this paper used are small and distance is approachable.
because that distance requires binary data which is not in
sample.
359
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
REFERENCES [9] William Perrizo,QinDing Anne Denton. “Lazy Classifiers Using
P-trees”, Department of Computer Science ,Penn State
[1] R.O. Duda and P.E. Hart, “Pattern Classification and Scene Harrisburg, Middletown, PA 17057.
Analysis”, New York: John Wiley & Sons, 1973.
[10] A. Y. Alfakih, “Graph rigidity via Euclidean distance matrices,
[2] Dasarathy, B. V., “Nearest Neighbor (NN) Norms,NN Pattern Linear lgebra ppl.”, 310 , pp. 149–165, 2000
Classification Techniques”. IEEE Computer Society Press, 1990.
[11] M. Bakonyi and C. Johnson, “The Euclidean distance matrix
[3] Wettschereck, D., Dietterich, T. G. “ n Experimental completion problem, SIAM Journal on Matrix Analysis and
Comparison of the Nearest Neighbor and Nearesthyperrectangle Applications”, 16 , pp. 646–654, 1995.
lgorithms,” Machine Learning, 9: 5-28, 1995.
[12] Elena Deza & Michel Marie Deza,” Encyclopedia of Distances”,
[4] Platt J C. “Fast Training of Support Vector Machines Using page 94, Springer, 2009.
Sequential Minimal Optimization [M]. Advances in Kernel
Methods:Support Vector Machines” (Edited by Scholkopf [13] W. Glunt, T. L. Hayden, S. Hong, and J. Wells, “An alternating
B,Burges C,Smola A)[M]. Cambridge MA: MIT Press, 185-208, projection algorithm for computing the nearest Euclidean distance
1998. matrix, SIAM Journal on Matrix Analysis and Applications”, 11,
pp. 589–600, 1990.
[5] Y. Yang and X. Liu, “ Re-Examination of Text
[14] R. W. Farebrother, “Three theorems with applications to
Categorization Methods,” Proc. SIGIR ‟99, pp. 42-49, Euclidean distance matrices”, Linear lg. ppl., 95, 11-16, 1987.
1999.
[15] Akca, Z. and Kaya, R.,”On the Taxicab Trigonometry”, Jour. of
[6] T. Joachims, “Text Categorization with Support Vector Inst. of Math& Comp. Sci. (Math. Ser) 10 , No 3, 151-159, 1997.
Machines: Learning with Many Relevant Features,” Proc. [16] Thompson, K. and Dray, T., “Taxicab Angles and Trigonometry”,
European Conf. Machine Learning, pp. 137-142, 1998. Pi Mu Epsilon J., 11, 87-97, 2000.
[7] Man Lan, Chew Lim Tan, Jian Su, and Yue Lu, [17] Bei-Ji Zou,” Shape-Based Trademark Retrieval Using Cosine
“Supervised and Traditional Term Weighting Methods for Distance Method” Intelligent Systems Design and
Automatic Text Categorization”, Ieee Transactions on Applications, 2008. ISDA '08. Eighth International
Pattern Analysis and Machine Intelligence, Vol. 31, No. 4, Conference on 26-28 Nov. 2008
April 2009. [18] M.R. Kosorok, “Discussion of Brownian distance covariance”,
[8] Thair Nu Phyu, “Survey of Classification Techniques in Data Ann. Appl. Stat. 3 (4) 1270–1278, 2009.
Mining” , Proceedings of the International MultiConference of
Engineers and Computer Scientists 2009 Vol I,IMECS
2009,March 18-20,2009.
360