0% found this document useful (0 votes)
13 views

K Nearest Neighbour - Algorithm

The document provides an overview of the K Nearest Neighbors (KNN) algorithm, emphasizing its classification and regression capabilities based on the principle that similar inputs yield similar outputs. It discusses the algorithm's non-parametric nature, the importance of distance measures (like Euclidean and Manhattan), and the process of determining the class label for a test input by analyzing its k-nearest neighbors. Additionally, it highlights the algorithm's lazy learning approach and the need to tune the hyperparameter k.

Uploaded by

tiya Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

K Nearest Neighbour - Algorithm

The document provides an overview of the K Nearest Neighbors (KNN) algorithm, emphasizing its classification and regression capabilities based on the principle that similar inputs yield similar outputs. It discusses the algorithm's non-parametric nature, the importance of distance measures (like Euclidean and Manhattan), and the process of determining the class label for a test input by analyzing its k-nearest neighbors. Additionally, it highlights the algorithm's lazy learning approach and the need to tune the hyperparameter k.

Uploaded by

tiya Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Instructor: Dr Asad Arshed

K Nearest Neighbor

1
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University,
Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kN
N.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the
University of Edinburgh,
https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtD
Z
• Wiki K-Nearest Neighbors:
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review,
V. B. Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents,
Mausumi Goswami et al.
http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf 2 2
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data, Ali Seyed Shirkhorshidi et al.,
The K Nearest Neighbors Algorithm
•Basic idea: Similar Inputs have similar outputs

•Classification rule: For a test input 𝑥, assign the most common


label amongst its 𝑘 most similar (nearest) training inputs

3 3
Formal Definition
• Assuming 𝑥 to be our test point, lets denote the set of the 𝑘 nearest neighbors
of 𝑥 as 𝑆𝑥
• Formally, 𝑆𝑥 is defined as
𝑺𝒙 ⊆ 𝑫 𝒔. 𝒕. 𝑺𝒙= 𝒌
𝑎𝑛𝑑
∀ 𝒙′ , 𝒚′ ∈ 𝑫 ∖ 𝑺𝒙 ,

• That is, every point that is in 𝐷 but not in 𝑆𝑥 is at least as far away from 𝑥 as the
furthest point in 𝑆𝑥 .
• We define the classifier ℎ() as a function returning the most common label in 𝑆𝑥 :
𝒉(𝒙) = 𝒎𝒐𝒅𝒆({𝒚′′: (𝒙′′, 𝒚′′) ∈ 𝑺𝒙}),
•where mode(⋅) means to select the label of the highest occurrence. So, 4 4

• Keep 𝑘 odd or return the result of 𝑘-NN with a smaller 𝑘


what do we do if there is a draw?
KNN Decision Boundary
•Voronoi Tessellation and KNN decision boundaries

K=1
5 5
KNN Decision Boundary
•Voronoi Tessellation and KNN decision boundaries

6 6
The KNN Algorithm
•A supervised, non-parametric algorithm
• It does not make any assumptions about the underlying distribution nor tries to
estimate it
• There are no parameters to train like in Logistic/Linear Regression or Bayes

• There is a hyperparameter 𝑘, that needs to be tuned


o Parameters allow models to make predictions

o Hyperparameters help with the learning/prediction process


•Used for classification and regression
• Classification: Choose the most frequent class label amongst k-nearest neighbors

to the test point – may be weighted e.g. w = 1 (𝑑: distance from 𝑥)


• Regression: Take an average over the output values of the k-nearest neighbors and assign

•An Instance-based learning algorithm


• Instead of performing explicit generalization, form hypotheses by comparing new problem
instances with training instances

• (-) Complexity of prediction is a function of 𝑛 (size of training data)


• (+) Can easily adapt to unseen data

•A lazy learning algorithm


7
• Delay computations on training data until a query is made, as opposed to eager learning 7
• (+) Good for continuously updated training data like recommender systems
• (-) Slower to evaluate and need to store the whole training data
Similarity/Distance Measures
•Lots of choices, depends on the problem
•The Minkowski distance is a generalized metric form of Euclidean, Manhattan
and Chebyshev distances
•The Minkowski distance between two n-dimensional vectors
𝑃 =< 𝑝1 , 𝑝2 , … , 𝑝𝑛 > and 𝑄 =< 𝑞1 , 𝑞2 , … , 𝑞𝑛 >,
• it is defined as:

•𝑎 = 1, is the Manhattan distance


•𝑎 = 2, is the Euclidean distance
•𝑎 → ∞, is the Chebyshev distance 8 8
Constraints on Distance Metrics
•The distance function between vectors 𝑝 and 𝑞 is a function
•𝑑(𝑝, 𝑞) that defines the distance between both vectors is considered as a
metric if it satisfy a certain number of properties:
1.Non-negativity: The distance between 𝑝 and 𝑞 is always a value greater
than or equal to zero
𝒅(𝒑, 𝒒) ≥ 𝟎
2. Identity of indiscernible vectors: The distance between 𝑝 and 𝑞 is equal to
zero if and only if 𝑝 is equal to 𝑞
𝒅(𝒑, 𝒒) = 𝟎 𝒊𝒇𝒇 𝒑=𝒒
3.Symmetry: The distance between 𝑝 and 𝑞 is equal to the distance
between 𝑞 and 𝑝.
𝒅(𝒑, 𝒒) = 𝒅(𝒒, 𝒑)
4.Triangle inequality: Given a third point 𝑟, the distance between 𝑝 and 𝑞 is
always less than or equal to the sum of the distance between 𝑝 and 𝑟 and 9
the distance between 𝑟 and 𝑞
9

𝒅(𝒑, 𝒒) ≤ 𝒅(𝒑, 𝒓) + 𝒅(𝒓, 𝒒)


Manhattan Distance
𝒑, 𝒒 = 𝒅 𝒒, 𝒑 = |𝒑𝟏 − 𝒒𝟏 | + |𝒑𝟐 − 𝒒𝟐 |, … ,
𝒅𝑴𝒂 |𝒑𝒏 − 𝒒𝒏 |
𝒏
𝑛
𝑑 ( 𝑝 ,𝑞 )= 𝑑 ( 𝑞 , 𝑝 ) = ∑ |𝑝 𝑖 − 𝑞𝑖
𝑖 =1
• The distance between two points is the sum of the
absolute differences of their Cartesian coordinates.
• It is the total sum of the difference
between the x-coordinates and y-
coordinates.
• Also known as Manhattan length, rectilinear
distance, L1 distance or L1 norm, city block
distance, snake distance, taxi-cab metric, or city
block distance
•Works well for high dimensional data.
• It does not amplify differences among features of the
two vectors and as a result does not ignore the effects
10 10
• Higher values of 𝑎 amplify differences and ignore
of any feature dimensions

features with smaller differences


Euclidean vs Manhattan
https://en.wikipedia.org/wiki/Taxicab_geometry
Distance

11 11
Euclidean Distance
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝1 −
𝑞1 )2 +(𝑝2 − 𝑞2 )2 , … , (𝑝𝑛 − 𝑞𝑛 )2


𝑛
𝑑 ( 𝑝 ,𝑞 )=𝑑 ( 𝑞 , 𝑝 )= ∑ ( 𝑝 𝑖 −𝑞𝑖 ) 2

𝑖 =1
•Good choice for numeric attributes
•When data is dense or continuous, this is a good proximity measure
Downside: Sensitive to extreme deviations in a single attribute (as it squares
• differences)
• The variables which have the largest value greatly influence the result.
• Solution: feature normalization

12 12
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal

(∑ | )
𝑛 1
𝑑𝐶h𝑒𝑏 ( 𝑝 ,𝑞 )= lim 𝑝 𝑖 −𝑞𝑖| =max|𝑝 𝑖 − 𝑞𝑖|
𝑎
𝑎→∞ 𝑖 =1 𝑖

Assume, 𝑝 =< 2,3, … , 9 >, 𝑞 =<


How?

4,6, … 10 > 𝑝, 𝑞 = lim ( 2 − 4 𝑎+ 3 − 6 𝑎 +, … + 9 −


1
𝑑 𝐶ℎ𝑒 10 𝑎 ) 𝑎
𝑏 𝑎→ ∞
1
𝑑 𝐶ℎ𝑒
𝑝, 𝑞 = lim
𝑎→
(2𝑎 + 3𝑎 +, … + 1𝑎 )𝑎
Suppose, 𝑎 = 𝑏 ∞
2 1
𝑑 𝑝, = (4 + 9+, … +
Suppose, 𝑎 = 𝑞 1)2
3 1
𝑑 𝑝, = (8 + 27+, … +
Suppose, 𝑎 = 𝑞 1)3
10 1
𝑑 𝑝, 𝑞 = (1,024 + 59,049+,
Now, 𝑎 → … + 1)10
∞ 𝑛
1
𝑑 𝐶ℎ𝑒
𝑝, 𝑞 = lim ( ( ෍ 𝑝𝑖 − 𝑞𝑖 𝑎)
→ max 𝑝𝑖 −
𝑎→ ∞
𝑏 𝑞𝑖 𝑎 ) 𝑎 𝑖 𝑖=
1
1
13 13
𝑑 𝐶ℎ𝑒
𝑝, 𝑞 = lim (max 𝑝 𝑖 −
𝑎→ ∞
𝑏 𝑞𝑖 ) 𝑎 𝑖
𝑎

𝒅𝑪𝒉𝒆𝒃 𝒑, 𝒒 =

Chebyshev Distance
𝑑𝐶h𝑒𝑏 ( 𝑝 ,𝑞 )=𝑚𝑎𝑥 ¿ 𝑝𝑖 − 𝑞 𝑖|

• For Chebyshev distance, the distance between two vectors is the greatest
of their differences along any coordinate dimension
• When two objects are to be defined as “different”, if they are different in
any one dimension
• Also called chessboard distance, maximum metric, or 𝐿∞ metric

14 14
The KNN Algorithm
• Input: Training samples D = {(x1, y1), (x2, y2), … (xn, yn)}, Test sample 𝑑 =
(x, y), 𝑘. Assume x to be an m-dimensional vector.

•Output: Class label of test sample 𝑑


1. Compute the distance between 𝑑 and every sample in 𝐷
2. Choose the 𝐾 samples in 𝐷 that are nearest to 𝑑; denote the set
by 𝑆𝑑 ∈ 𝐷
3. Assign 𝑑 the label 𝑦𝑖 of the majority class in 𝑆𝑑

•Note:
• All action takes place in the test phase, the training phase is
essentially to clean, normalize and store the data
15 15
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol
# (inches) (kgs) B.P. Dia disease Level
1 62 70 110 80 No 150
2 72 90 130 70 No 160
3 74 80 150 70 No 130
4 65 120 140 90 Yes 200
5 67 100 130 85 Yes 190
6 64 110 170 90 No 130
7 69 150 110 100 Yes 250

16 16
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol
# (inches) (kgs) B.P. Dia disease Level
1 62 70 110 80 No 150
2 72 90 130 70 No 160
3 74 80 150 70 No 130
4 65 120 140 90 Yes 200
5 67 100 130 85 Yes 190
6 64 110 170 90 No 130
7 69 150 110 100 Yes 250
8 66 115 145 90 ?? ??

17 17
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol Euclidean
# (inches) (kgs) B.P. Dia disease Level Distance
1 62 70 110 80 No 150
2 72 90 130 70 No 160
3 74 80 150 70 No 130
4 65 120 140 90 Yes 200
5 67 100 130 85 Yes 190
6 64 110 170 90 No 130
7 69 150 110 100 Yes 250
8 66 115 145 90 ?? ??

18 18
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol Euclidean
# (inches) (kgs) B.P. Dia disease Level Distance
1 62 70 110 80 No 150 52.59
2 72 90 130 70 No 160 47.81
3 74 80 150 70 No 130 43.75
4 65 120 140 90 Yes 200 7.14
5 67 100 130 85 Yes 190 16.61
6 64 110 170 90 No 130 15.94
7 69 150 110 100 Yes 250 44.26
8 66 115 145 90 ?? ??

19 19
KNN Classification and Regression
Height Weight B.P. Sys Heart Cholesterol Euclidean
# (inches) (kgs) B.P. Dia disease Level Distance
1 62 70 110 80 No 150 52.59
2 72 90 130 70 No 160 47.81
3 74 80 150 70 No 130 43.75
4 65 120 140 90 Yes 200 7.14
5 67 100 130 85 Yes 190 16.61
6 64 110 170 90 No 130 15.94
7 69 150 110 100 Yes 250 44.26
8 66 115 145 90 ?? ??

20 20
KNN Classification and Regression (Fruit)
Color Fruit Fruit
# Mass Width Height
Score Name Label
1 176 7.4 7.2 0.6 apple 0
2 178 7.1 7.8 0.92 apple 0
3 156 7.4 7.4 0.84 apple 0
4 154 7.1 7.5 0.78 orange 1
5 180 7.6 8.2 0.79 orange 1
6 118 5.9 8 0.72 lemon 2
7 120 6 8.4 0.74 lemon 2
8 118 6.1 8.1 0.7 lemon ??
9 140 7.3 7.1 0.87 apple ??
10 154 7.2 7.2 0.82 orange ??
21 21
KNN Classification and Regression (Fruit)
Color Fruit Fruit
# Mass Width Height D8 D9 D10
Score Name Label
1 176 7.4 7.2 0.6 apple 0
2 178 7.1 7.8 0.92 apple 0
3 156 7.4 7.4 0.84 apple 0
4 154 7.1 7.5 0.78 orange 1
5 180 7.6 8.2 0.79 orange 1
6 118 5.9 8 0.72 lemon 2
7 120 6 8.4 0.74 lemon 2
8 118 6.1 8.1 0.7 lemon ??
9 140 7.3 7.1 0.87 apple ??
10 154 7.2 7.2 0.82 orange ??
22 22
KNN Classification and Regression (Fruit)
Color Fruit Fruit
# Mass Width Height D8 D9 D10
Score Name Label
1 176 7.4 7.2 0.6 apple 0
2 178 7.1 7.8 0.92 apple 0
3 156 7.4 7.4 0.84 apple 0
4 154 7.1 7.5 0.78 orange 1
5 180 7.6 8.2 0.79 orange 1
6 118 5.9 8 0.72 lemon 2
7 120 6 8.4 0.74 lemon 2
8 118 6.1 8.1 0.7 lemon ?? 2
9 140 7.3 7.1 0.87 apple ?? 0
10 154 7.2 7.2 0.82 orange ?? 1
23 23
Example: Handwritten digit recognition
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distances over raw pixels

X Y

Accuracy
• 7-NN ~ 95.2%


𝑛 • SVM ~ 95.8%

𝑑 ( 𝑝 ,𝑞 )=𝑑 ( 𝑞 , 𝑝 )= ∑( 𝑝 𝑖 −𝑞𝑖
•2 Humans ~ 97.5%
) 24 24
𝑖=1
http://rstudio-pubs-static.s3.amazonaws.com/6287_c079c40df6864b34808fa7ecb71d0f36.html,
Victor Lavrenko https://www.youtube.com/watch?v=ZD_tfNpKzHY&list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ&index=6
Complexity of KNN
• Input: Training samples D = {(x1, y1), (x2, y2), … (xn, yn)}, Test sample 𝑑 = (x, y), 𝑘.
Assume x to be an m-dimensional vector.

•Output: Class label of test sample 𝑑


1. Compute the distance between 𝑑 and every sample in 𝐷
• 𝑛 samples, each is 𝑚-dimensional ⇒ 𝑂( 𝑚𝑛)

2. Choose the 𝐾 samples in 𝐷 that are nearest to 𝑑; denote the set by 𝑆𝑑 ∈ 𝐷


Either naively do 𝐾 passes of all samples costing 𝑂( 𝑛) each time for 𝑂( 𝑛𝑘)
Or use the quickselect algorithm (median of medians) to find the 𝑘𝑡ℎ smallest

distance in 𝑂(𝑛) and then return all distances no larger than the 𝑘𝑡ℎ smallest

distance. This will accumulate to 𝑂(𝑛)

•3. Assign 𝑑 the label 𝑦𝑖 of the majority class in 𝑆𝑑


• This is 𝑂 𝑘 .

Time complexity: 𝑂(𝑚𝑛 + 𝑛 + 𝑘) = 𝑂(𝑚𝑛), assuming 𝑘 to be a constant.


25 25

Space complexity: 𝑂(𝑚𝑛), to store the n, m-dimensional training data samples.




Choosing the value of K – The theory
•k=1:
• High variance
• Small changes in the dataset will lead to big changes in
• classification
• Overfitting
• Is too specific and not well-generalized
• It tends to be sensitive to noise
• The model accomplishes a high accuracy on train set but will be a
• poor predictor on new, previously unseen data points
•k= very large (e.g. 100):
• The model is too generalized and not a good predictor on both train and test
sets.
• High bias
• Underfitting
26 26
•k=n:
• The majority class in the dataset wins for every prediction
• High bias
Tuning the hyperparameter K – the Method
• Divide your training data into training and validation sets.
• Do multiple iterations of m-fold cross-validation, each time
with a different value of k, starting from k=1
• Keep iterating until the k with the best classification
accuracy (minimal loss) is found

• What happens if we use the training set itself, instead of a


validation set? Which k wins?
• K=1, as there is always a nearest instance with the correct label, the
instance itself

27 27
KNN – The good, the bad and the ugly
• KNN is a simple algorithm but is highly effective for solving
various real life classification problems. Especially when the
datasets are large and continuously growing.

•Challenges:
1. How to find the optimum value of K?
2. How to find the right distance function?

•Problems:
1. High computational time cost for each prediction.
2. High memory requirement as we need to keep all training samples.
3. The curse of dimensionality.
28 28
Thank You

29

You might also like