K Nearest Neighbour - Algorithm
K Nearest Neighbour - Algorithm
K Nearest Neighbor
1
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University,
Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kN
N.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the
University of Edinburgh,
https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtD
Z
• Wiki K-Nearest Neighbors:
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review,
V. B. Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents,
Mausumi Goswami et al.
http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf 2 2
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data, Ali Seyed Shirkhorshidi et al.,
The K Nearest Neighbors Algorithm
•Basic idea: Similar Inputs have similar outputs
3 3
Formal Definition
• Assuming 𝑥 to be our test point, lets denote the set of the 𝑘 nearest neighbors
of 𝑥 as 𝑆𝑥
• Formally, 𝑆𝑥 is defined as
𝑺𝒙 ⊆ 𝑫 𝒔. 𝒕. 𝑺𝒙= 𝒌
𝑎𝑛𝑑
∀ 𝒙′ , 𝒚′ ∈ 𝑫 ∖ 𝑺𝒙 ,
• That is, every point that is in 𝐷 but not in 𝑆𝑥 is at least as far away from 𝑥 as the
furthest point in 𝑆𝑥 .
• We define the classifier ℎ() as a function returning the most common label in 𝑆𝑥 :
𝒉(𝒙) = 𝒎𝒐𝒅𝒆({𝒚′′: (𝒙′′, 𝒚′′) ∈ 𝑺𝒙}),
•where mode(⋅) means to select the label of the highest occurrence. So, 4 4
K=1
5 5
KNN Decision Boundary
•Voronoi Tessellation and KNN decision boundaries
6 6
The KNN Algorithm
•A supervised, non-parametric algorithm
• It does not make any assumptions about the underlying distribution nor tries to
estimate it
• There are no parameters to train like in Logistic/Linear Regression or Bayes
11 11
Euclidean Distance
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝1 −
𝑞1 )2 +(𝑝2 − 𝑞2 )2 , … , (𝑝𝑛 − 𝑞𝑛 )2
√
𝑛
𝑑 ( 𝑝 ,𝑞 )=𝑑 ( 𝑞 , 𝑝 )= ∑ ( 𝑝 𝑖 −𝑞𝑖 ) 2
𝑖 =1
•Good choice for numeric attributes
•When data is dense or continuous, this is a good proximity measure
Downside: Sensitive to extreme deviations in a single attribute (as it squares
• differences)
• The variables which have the largest value greatly influence the result.
• Solution: feature normalization
12 12
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal
(∑ | )
𝑛 1
𝑑𝐶h𝑒𝑏 ( 𝑝 ,𝑞 )= lim 𝑝 𝑖 −𝑞𝑖| =max|𝑝 𝑖 − 𝑞𝑖|
𝑎
𝑎→∞ 𝑖 =1 𝑖
𝒅𝑪𝒉𝒆𝒃 𝒑, 𝒒 =
�
Chebyshev Distance
𝑑𝐶h𝑒𝑏 ( 𝑝 ,𝑞 )=𝑚𝑎𝑥 ¿ 𝑝𝑖 − 𝑞 𝑖|
• For Chebyshev distance, the distance between two vectors is the greatest
of their differences along any coordinate dimension
• When two objects are to be defined as “different”, if they are different in
any one dimension
• Also called chessboard distance, maximum metric, or 𝐿∞ metric
14 14
The KNN Algorithm
• Input: Training samples D = {(x1, y1), (x2, y2), … (xn, yn)}, Test sample 𝑑 =
(x, y), 𝑘. Assume x to be an m-dimensional vector.
•Note:
• All action takes place in the test phase, the training phase is
essentially to clean, normalize and store the data
15 15
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol
# (inches) (kgs) B.P. Dia disease Level
1 62 70 110 80 No 150
2 72 90 130 70 No 160
3 74 80 150 70 No 130
4 65 120 140 90 Yes 200
5 67 100 130 85 Yes 190
6 64 110 170 90 No 130
7 69 150 110 100 Yes 250
16 16
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol
# (inches) (kgs) B.P. Dia disease Level
1 62 70 110 80 No 150
2 72 90 130 70 No 160
3 74 80 150 70 No 130
4 65 120 140 90 Yes 200
5 67 100 130 85 Yes 190
6 64 110 170 90 No 130
7 69 150 110 100 Yes 250
8 66 115 145 90 ?? ??
17 17
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol Euclidean
# (inches) (kgs) B.P. Dia disease Level Distance
1 62 70 110 80 No 150
2 72 90 130 70 No 160
3 74 80 150 70 No 130
4 65 120 140 90 Yes 200
5 67 100 130 85 Yes 190
6 64 110 170 90 No 130
7 69 150 110 100 Yes 250
8 66 115 145 90 ?? ??
18 18
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol Euclidean
# (inches) (kgs) B.P. Dia disease Level Distance
1 62 70 110 80 No 150 52.59
2 72 90 130 70 No 160 47.81
3 74 80 150 70 No 130 43.75
4 65 120 140 90 Yes 200 7.14
5 67 100 130 85 Yes 190 16.61
6 64 110 170 90 No 130 15.94
7 69 150 110 100 Yes 250 44.26
8 66 115 145 90 ?? ??
19 19
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol Euclidean
# (inches) (kgs) B.P. Dia disease Level Distance
1 62 70 110 80 No 150 52.59
2 72 90 130 70 No 160 47.81
3 74 80 150 70 No 130 43.75
4 65 120 140 90 Yes 200 7.14
5 67 100 130 85 Yes 190 16.61
6 64 110 170 90 No 130 15.94
7 69 150 110 100 Yes 250 44.26
8 66 115 145 90 ?? ??
20 20
KNN Classification and Regression (Fruit)
Color Fruit Fruit
# Mass Width Height
Score Name Label
1 176 7.4 7.2 0.6 apple 0
2 178 7.1 7.8 0.92 apple 0
3 156 7.4 7.4 0.84 apple 0
4 154 7.1 7.5 0.78 orange 1
5 180 7.6 8.2 0.79 orange 1
6 118 5.9 8 0.72 lemon 2
7 120 6 8.4 0.74 lemon 2
8 118 6.1 8.1 0.7 lemon ??
9 140 7.3 7.1 0.87 apple ??
10 154 7.2 7.2 0.82 orange ??
21 21
KNN Classification and Regression (Fruit)
Color Fruit Fruit
# Mass Width Height D8 D9 D10
Score Name Label
1 176 7.4 7.2 0.6 apple 0
2 178 7.1 7.8 0.92 apple 0
3 156 7.4 7.4 0.84 apple 0
4 154 7.1 7.5 0.78 orange 1
5 180 7.6 8.2 0.79 orange 1
6 118 5.9 8 0.72 lemon 2
7 120 6 8.4 0.74 lemon 2
8 118 6.1 8.1 0.7 lemon ??
9 140 7.3 7.1 0.87 apple ??
10 154 7.2 7.2 0.82 orange ??
22 22
KNN Classification and Regression (Fruit)
Color Fruit Fruit
# Mass Width Height D8 D9 D10
Score Name Label
1 176 7.4 7.2 0.6 apple 0
2 178 7.1 7.8 0.92 apple 0
3 156 7.4 7.4 0.84 apple 0
4 154 7.1 7.5 0.78 orange 1
5 180 7.6 8.2 0.79 orange 1
6 118 5.9 8 0.72 lemon 2
7 120 6 8.4 0.74 lemon 2
8 118 6.1 8.1 0.7 lemon ?? 2
9 140 7.3 7.1 0.87 apple ?? 0
10 154 7.2 7.2 0.82 orange ?? 1
23 23
Example: Handwritten digit recognition
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distances over raw pixels
X Y
Accuracy
• 7-NN ~ 95.2%
√
𝑛 • SVM ~ 95.8%
𝑑 ( 𝑝 ,𝑞 )=𝑑 ( 𝑞 , 𝑝 )= ∑( 𝑝 𝑖 −𝑞𝑖
•2 Humans ~ 97.5%
) 24 24
𝑖=1
http://rstudio-pubs-static.s3.amazonaws.com/6287_c079c40df6864b34808fa7ecb71d0f36.html,
Victor Lavrenko https://www.youtube.com/watch?v=ZD_tfNpKzHY&list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ&index=6
Complexity of KNN
• Input: Training samples D = {(x1, y1), (x2, y2), … (xn, yn)}, Test sample 𝑑 = (x, y), 𝑘.
Assume x to be an m-dimensional vector.
distance in 𝑂(𝑛) and then return all distances no larger than the 𝑘𝑡ℎ smallest
•
27 27
KNN – The good, the bad and the ugly
• KNN is a simple algorithm but is highly effective for solving
various real life classification problems. Especially when the
datasets are large and continuously growing.
•Challenges:
1. How to find the optimum value of K?
2. How to find the right distance function?
•Problems:
1. High computational time cost for each prediction.
2. High memory requirement as we need to keep all training samples.
3. The curse of dimensionality.
28 28
Thank You
29