Week03 - 1 - KNN
Week03 - 1 - KNN
Pattern Recognition
Adopted from Dr. Pádraig Cunningham COMP47750 School of Computer Science UCD(Dublin)
Overview
• Eager v Lazy Classification Strategies
• Distance-based Models
• Feature Spaces
• Measuring Distance
• Data Normalisation
• Nearest Neighbours
• k-Nearest Neighbour Classifier (kNN)
• Weighted kNN
• kNN in scikit-learn in Python
2
Reminder: Classification
• Supervised Learning: Algorithm that learns a function from
manually-labelled training examples.
• Classification: Training examples, usually represented by a set
of descriptive features, help decide the class to which a new
unseen query input belongs.
• Binary Classification: Assign one of two possible target class
labels to the new query input.
Non-Spam
Query Input ?
Spam
3
Eager v Lazy
Classifiers
• Eager Learning Classification Strategy (model based)
• Classifier builds a full model during an initial training phase, to
use later when new query examples arrive.
• More offline setup work, less work at run-time.
• Generalise before seeing the query example.
• Lazy Learning Classification Strategy (instance based)
• Classifier keeps all the training examples for later use.
• Little work is done offline, wait for new query examples.
• Focus on the local space around the examples.
• Distance-based Models: Many learning algorithms are based on
generalising from training data to unseen data by exploiting the
distances (or similarities) between the two.
4
Example: Athlete Selection
• Training set of performance ratings for 20 college athletes, where
each athlete is described by 2 continuous features: speed, agility.
• Each athlete has a target class label indicating whether they were
selected for the university athletics team: 'Yes' or 'No'.
Athlete Speed Agility Selected Athlete Speed Agility Selected
Training set of 20
examples (athletes)
6
Measuring Distance
• Measuring the distance (or similarity) between two examples is
fundamental to many ML algorithms.
• Many measures can be used to calculate distance. There is no
“best” distance measure. The choice is highly problem-dependent.
Examples x4 and x5
x4 x13 have a low distance
(high similarity)
x5
7
Measuring Distance
• Distance function: A suitable function to measure how distant
(or similar) two input examples are from one another are in
some D-dimensional feature space.
x3 Male Italian
otherwise. Generally suitable for
categorical data.
dg(x1,x2) = 1 dn(x1,x2) = 0
For feature For feature
dg(x1,x3) = 1 dn(x1,x3) = 1
Gender Nationality
dg(x2,x3) = 0 dn(x2,x3) = 1
9
Measuring Distance
• Absolute difference: For numeric data, we Athlete
x1
Speed
2.50
Agility
6.00
can calculate absolute value of the x2 3.75 8.00
For feature ds(x1,x2) = |2.50-3.75| = 1.25 For feature ds(x1,x2) = |6.0-8.0| = 2.0
Speed ds(x1,x3) = |2.50-2.25| = 0.25 Agility ds(x1,x3) = |6.0-5.5| = 0.5
ds(x2,x3) = |3.75-2.25| = 1.5 ds(x2,x3) = |8.0-5.5| = 2.5
Input:
2 examples ED(p,q) = Calculate square
of the difference
p and q
between the examples
on feature f
11
Measuring Distance
• Example: Apply Euclidean
x4
distance, where F
x5
consists of 2 numeric
features: speed, agility x15
ED(p,q) =
x4 3.25 8.25
E D (x 4, x 15) =
x15 4.75 6.25
= = 2.5
Athlete Speed Agility
E D (x 4, x 5) =
x4 3.25 8.25
x5 2.75 7.50
=
12
Heterogeneous Distance Functions
• In many datasets, the features associated with examples will
have different types (e.g. continuous, categorical, ordinal etc).
• We can create a global measure from different local distance
functions, using an appropriate function for each feature.
Athlete Speed Agility Gender Nationality • Use absolute difference for continuous
features Speed & Agility
x1 2.50 6.00 Female Irish
• Use overlap for categorical features
x2 3.75 8.00 Male Irish Gender & Nationality
x3 2.25 5.50 Male Italian • Global distance calculated as sum
over individual local distances
• Min-max normalisation: x9
x10
80
58
Use min and max values for a given
feature to rescale to the range [0,1] xi —min(x)
zi = max(x)—min(x)
• Example: Feature Age
min(x) = 19
Age 24 19 50 40 23 68 45 33 80 58
(Non-normalised)
max(x) = 80
Age
max(x ) — min(x ) = (Normalised)
0.08 0.00 0.51 0.34 0.07 0.80 0.43 0.23 1.00 0.64
61
14
Nearest Neighbour Classifier
Lazy learning approach: Do not build a model for the data. Identify
most similar previous example(s) from the training set for which a
label has already been assigned, using some distance function.
Nearest neighbour rule (1NN): For a new query input q, find a single
labelled example x closest to q, and assign q the same label as x.
q2 Assign q2 to
'Yes' class
Assign q1 to q1
'No' class
15
k-Nearest Neighbour Classifier
k-Nearest neighbours (kNN): The NN approach naturally generalises
to the case where we use k nearest neighbours from the training
set to assign a label to a new query input.
Example: For new query inputs, calculate distance to all training
examples. Find k=3 nearest examples (i.e. with smallest distances).
3 nearest training
q2 examples to q2
3 nearest training
q1
examples to q1
16
k-Nearest Neighbour Classifier
Majority voting: The decision on a label for a new query example is
decided based on the “votes” of its k nearest neighbours. The label
for the query is the majority label of its neighbours.
Example: Measure distance from q to all training examples.
Find the k=3 nearest examples, and use their labels as votes.
x16
x15
x6
Neighbour counts
q
• Yes = 2 votes
• No = 1 vote
➡ Majority says
Yes!
17
k-Nearest Neighbour Classifier
Majority voting: The decision on a label for a new query example is
decided based on the “votes” of its k nearest neighbours. The label
for the query is the majority label of its neighbours.
Example: Measure distance from q to all training examples.
Find the k=4 nearest examples, and use their labels as votes.
x16
x15 In the case that…
x6
• Yes = 2 votes
q
• No = 2 votes
Can break ties…
x12 ‣ At random
‣ Based on sum of
neighbour distances
18
Example: kNN Classification
(k=3)
• Training set of 20 athletes - 8 labelled as 'Yes', 12 as 'No'.
• Each athlete described by 2 continuous features: Speed, Agility
Euclidean distance would be an appropriate distance function.
Athlete Speed Agility Selected Athlete Speed Agility Selected
• Rank the training examples and identify set of 3 examples with the
smallest distances.
Athlete
x16
Speed
5.50
Agility
6.75
Selected
Yes
Distance
0.901
• Yes = 2 votes
x15 4.75 6.25 Yes 1.275 • No = 1 vote
x2 3.75 8.00 No 1.346
➡ Majority says Yes,
so assign label Yes to q 20
Weighted kNN
• Weighted voting: In this approach, some training examples have a
higher weight than others.
• Instead of using a binary vote of 1 for each nearest neighbour,
typically closer neighbours get higher votes when deciding on the
predicted label for a query example.
• Inverse distance-weighted voting: Simplest strategy is to take a
neighbour’s vote to be the inverse of their distance from the query
(i.e. 1/Distance). We then sum over the weights for each class.
d(q, x 16) = 0.901
1
weight(x16) = = =
d(q,1x 16)
1.109
0.901
d(q, x 2) = 1.346
1 1
weight(x2) = = = 0.743
d(q, x 2) 1.346
21
Example: Weighted kNN
(k=3)
• Measure distance between q and all 20 training examples.
Athlete Speed Agility Selected Distance Athlete Speed Agility Selected Distance
x1 2.50 6.00 No 2.915 x11 2.00 2.00 No 6.265
x2 3.75 8.00 No 1.346 x12 5.00 2.50 No 5.000
x3 2.25 5.50 No 3.400 x13 8.25 8.50 Yes 3.400
x4 3.25 8.25 No 1.904 x14 5.75 8.75 Yes 1.458
x5 2.75 7.50 No 2.250 x15 4.75 6.25 Yes 1.275
x6 4.50 5.00 No 2.550 x16 5.50 6.75 Yes 0.901
x7 3.50 5.25 No 2.704 x17 5.25 9.50 Yes 2.016
x8 3.00 3.25 No 4.697 x18 7.00 4.25 Yes 3.816
x9 4.00 4.00 No 3.640 x19 7.50 8.00 Yes 2.550
x10 4.25 3.75 No 3.824 x20 7.25 3.75 Yes 4.373
• Rank the training examples and identify set of 3 examples with the
smallest distances. Assign weights based on 1/Distance, and
weights for each class.
sum
• Weights for Yes =
Athlete Speed Agility Selected Distance Weight
1.109 + 0.784 = 1.893
x16 5.50 6.75 Yes 0.901 1.109
x15 4.75 6.25 Yes 1.275 0.784 • Weights for No =
x2 3.75 8.00 No 1.346 0.743 0.743
➡ Majority says
22
Yes
Parameter Tuning
• A simple 1-NN classifier is easy to implement. But it will be
susceptible to “noise” in the data. A misclassification will occur
every time a single noisy example is retrieved.
• We might decide to vary the neighbourhood size parameter k to
improve the predictive performance of kNN.
• Choosing between different settings of an algorithm is often
referred to as hyperparameter tuning or model selection.
23
Problems with kNN
Can be slow to find nearest neighbors in high dim space
24
k-NN with Scikit Learn
25
Load a dataset into Python
• Load a csv file into a Pandas dataframe in Python
athlete = pd.read_csv('AthleteSelection.csv',index_col = 'Athlete')
athlete.head()
y = athlete.pop('Selected').values
X= athlete.values
x2 3.75 8.00 0
x3 2.25 5.50 0
x4 3.25 8.25 0
x5 2.75 7.50 0
26
Train a k-NN classifier
Train it forecast_kNN.fit(X,y)
0 6 85 30 0
1 14 90 35 0
2 15 86 8 1
3 21 56 15 1
4 17 67 9 1
27
Test on the Training Data
• Use training data as test (not a good idea)
• k = 3 so one misclassification
y_dash = forecast_kNN.predict(X)
print(' y:',y)
print('y_dash:',y_dash)
y: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0]
y_dash: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 0]
Confusion matrix:
[[ 7 1]
[ 0 10]]
28
Normalizing (Scaling) Data
• Normaize data so all features have the same influence
• Two popular models
• N(0,1)
• MinMax typically range (0,1)
scaler = preprocessing.StandardScaler().fit(X)
Set up Scaler
X_scaled = scaler.transform(X)
Scale the data q_scaled = scaler.transform([q])
29
Instance weighting
• Give nearer neighbours a bigger weight (vote)
• based on distance
forecast_kNN_SW = KNeighborsClassifier(n_neighbors=3,weights='distance')
forecast_kNN_SW.fit(X_scaled,y)
y_dash = forecast_kNN_SW.predict(X_scaled)
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion))
print('\n y:',y)
print('y_dash:',y_dash)
Confusion matrix:
[[ 8 0]
[ 0 10]]
y: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0]
y_dash: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0]
30
Advantages and disadvantages of the KNN algorithm
Advantages
Disadvantages
Does not scale well: Since KNN is a lazy algorithm, it takes up more memory
and data storage compared to other classifiers. This can be costly from both a
time and money perspective.
Curse of dimensionality: The KNN algorithm tends to fall victim to the curse
of dimensionality, which means that it doesn’t perform well with high-
dimensional data inputs. This is sometimes also referred to as the peaking
phenomenon , where after the algorithm attains the optimal number of
features, additional features increases the amount of classification errors,
especially when the sample size is smaller.
Prone to overfitting: value of k can also impact the model’s behavior. Lower
values of k can overfit the data, whereas higher values of k tend to “smooth
out” the prediction values since it is averaging the values over a greater area,
or neighborhood. However, if the value of k is too high, then it can underfit
the data.