L3 KNN
L3 KNN
Supervised Learning
Some slides were adapted/taken from various sources, including Prof. Andrew Ng’s Coursera Lectures, Stanford
University, Prof. Kilian Q. Weinberger’s lectures on Machine Learning, Cornell University, Prof. Sudeshna Sarkar’s
Lecture on Machine Learning, IIT Kharagpur, Prof. Bing Liu’s lecture, University of Illinois at Chicago (UIC),
CS231n: Convolutional Neural Networks for Visual Recognition lectures, Stanford University and many more. We
thankfully acknowledge them. Students are requested to use this material for their study only and NOT to distribute it.
Recap
•
)
, where and
Hypothesis Class
(i.e. every point in D but not in Sx is at least as far away from x as the furthest
point in Sx). We can then define the classifier h(.) as a function returning the most
common label in Sx :
A binary classification example with k=3. The green point in the center is the test
sample x. The labels of the 3 neighbors are 2×(+1) and 1×(-1) resulting in majority
predicting (+1)
Distance Function
• The k-nearest neighbor classifier fundamentally relies on a distance
metric. The better that metric reflects label similarity, the better the
classification will be. The most common choice is the Minkowski
distance
• p = 1, Manhattan Distance
• p = 2, Eucledian distance etc.
Curse of Dimensionality
• The kNN classifier makes the assumption that similar points share similar labels.
𝒅 ℓ
Let ℓ be the edge length of the smallest hyper-cube that
2 0.1
contains all k-nearest neighbor of a test point.
10 0.63
100 0.955
Then and If n=1000, how big is ℓ?
1000 0.9954
So as d 0 almost the entire space is needed to find the 10-NN. This breaks down the
k-NN assumptions, because the k-NN are not particularly closer (and therefore more
similar) than any other data points in the training set. Why would the test points share
the label with those k-nearest neighbors, if they are not actually similar to it?
Figure demonstrating ``the curse of dimensionality''. The histogram plots show the distributions
of all pairwise distances between randomly distributed points within d-dimensional unit squares.
As the number of dimensions d grows, all distances concentrate within a very small range.
What
happens
if
we
increase
k?
Curse of Dimensionality
• One might think that one rescue could be to increase the number of training
samples, n, until the nearest neighbors are truly close to the test point. How many
data points would we need such that ℓ becomes truly small?
• For d>100, we would need far more data points than there are electrons in the
universe...
Distances to hyperplanes
• So the distance between two randomly drawn data points increases drastically with their
dimensionality.
• How about the distance to a hyperplane?
• There are two blue points and a red
• Consider the following figure.
hyperplane. The left plot shows the
scenario in 2d and the right plot in 3d.
This confirms again that pairwise distances grow in high dimensions. On the other hand, the distance
to the red hyperplane remains unchanged as the third dimension is added.
Distances to hyperplanes
• The reason is that the normal of the hyper-plane is orthogonal to the new dimension. This
is a crucial observation.
• In d dimensions, d−1 dimensions will be orthogonal to the normal of any given hyper-
plane. Movement in those dimensions cannot increase or decrease the distance to the
hyperplane --- the points just shift around and remain at the same distance.
• As distances between pairwise points become very large in high dimensional spaces,
distances to hyperplanes become comparatively tiny.
• For machine learning algorithms, this is highly relevant. As we will see later on, many
classifiers (e.g. the Perceptron or SVMs) place hyper planes between concentrations of
different classes.
• One consequence of the curse of dimensionality is that most data points tend to be very
close to these hyperplanes and it is often possible to perturb input slightly (and often
imperceptibly) in order to change a classification outcome. This practice has recently
become known as the creation of adversarial samples, whose existents is often falsely
attributed to the complexity of neural networks.
Distances to hyperplanes
• As n→∞, k-NN becomes provably very accurate, but also very slow.