0% found this document useful (0 votes)
87 views

Week03 - 1 - KNN

The document discusses nearest neighbors classifiers. It begins with an overview of eager vs lazy classification strategies and distance-based models. It then covers feature spaces, various methods for measuring distance between examples (absolute difference, Euclidean distance, etc.), and data normalization. Finally, it introduces the k-nearest neighbor classifier (kNN) and weighted kNN, noting they can be implemented in scikit-learn.

Uploaded by

Rawaf Fahad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Week03 - 1 - KNN

The document discusses nearest neighbors classifiers. It begins with an overview of eager vs lazy classification strategies and distance-based models. It then covers feature spaces, various methods for measuring distance between examples (absolute difference, Euclidean distance, etc.), and data normalization. Finally, it introduces the k-nearest neighbor classifier (kNN) and weighted kNN, noting they can be implemented in scikit-learn.

Uploaded by

Rawaf Fahad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

CCAI 312

Pattern Recognition

Nearest Neighbors Classifiers

Adopted from Dr. Pádraig Cunningham COMP47750 School of Computer Science UCD(Dublin)
Overview
• Eager v Lazy Classification Strategies
• Distance-based Models
• Feature Spaces
• Measuring Distance
• Data Normalisation
• Nearest Neighbours
• k-Nearest Neighbour Classifier (kNN)
• Weighted kNN
• kNN in scikit-learn in Python

2
Reminder: Classification
• Supervised Learning: Algorithm that learns a function from
manually-labelled training examples.
• Classification: Training examples, usually represented by a set
of descriptive features, help decide the class to which a new
unseen query input belongs.
• Binary Classification: Assign one of two possible target class
labels to the new query input.

Non-Spam

Query Input ?
Spam

• Multiclass Classification: Assign one of M>2 possible target


class labels to the new query input.

3
Eager v Lazy
Classifiers
• Eager Learning Classification Strategy (model based)
• Classifier builds a full model during an initial training phase, to
use later when new query examples arrive.
• More offline setup work, less work at run-time.
• Generalise before seeing the query example.
• Lazy Learning Classification Strategy (instance based)
• Classifier keeps all the training examples for later use.
• Little work is done offline, wait for new query examples.
• Focus on the local space around the examples.
• Distance-based Models: Many learning algorithms are based on
generalising from training data to unseen data by exploiting the
distances (or similarities) between the two.
4
Example: Athlete Selection
• Training set of performance ratings for 20 college athletes, where
each athlete is described by 2 continuous features: speed, agility.
• Each athlete has a target class label indicating whether they were
selected for the university athletics team: 'Yes' or 'No'.
Athlete Speed Agility Selected Athlete Speed Agility Selected

x1 2.50 6.00 No x11 2.00 2.00 No

x2 3.75 8.00 No x12 5.00 2.50 No

x3 2.25 5.50 No x13 8.25 8.50 Yes

x4 3.25 8.25 No x14 5.75 8.75 Yes

x5 2.75 7.50 No x15 4.75 6.25 Yes

x6 4.50 5.00 No x16 5.50 6.75 Yes

x7 3.50 5.25 No x17 5.25 9.50 Yes

x8 3.00 3.25 No x18 7.00 4.25 Yes

x9 4.00 4.00 No x19 7.50 8.00 Yes

x10 4.25 3.75 No x20 7.25 3.75 Yes

Q. Will a new athlete q be Athlete Speed Agility Selected

selected: 'Yes' or 'No'? q 3.00 8.00 ???


5
Feature Spaces
• A feature space is a D-dimensional coordinate space used to
represent the input examples for a given problem, with one
coordinate for each descriptive feature.
• Example: Use a feature space to visually position the 20 athletes
in a 2-dimensional coordinate space (i.e. agility versus speed):

Training set of 20
examples (athletes)

Each example described


by 2 feature values:
agility & speed

6
Measuring Distance
• Measuring the distance (or similarity) between two examples is
fundamental to many ML algorithms.
• Many measures can be used to calculate distance. There is no
“best” distance measure. The choice is highly problem-dependent.
Examples x4 and x5
x4 x13 have a low distance
(high similarity)
x5

Examples x10 and x13


have a high distance
(low similarity)
x10

7
Measuring Distance
• Distance function: A suitable function to measure how distant
(or similar) two input examples are from one another are in
some D-dimensional feature space.

• Local distance function: Measure the Athlete Speed Agility

distance between two examples based x1 2.50 6.00

on a single feature. x2 3.75 8.00

• e.g. what is distance between x1


and x2 in terms of Speed?
• e.g. what is distance between x1
and x2 in terms of Agility?
• Global distance function: Measure the distance between two
examples based on the combination of the local distances
across all features.
• e.g. what is distance between x1 and x2 based on both
Speed and Agility? 8
Measuring Distance
• Overlap function: Simplest local Athlete Gender Nationality

distance measure. Returns 0 if the two x1 Female Irish

values for a feature are equal and 1 x2 Male Irish

x3 Male Italian
otherwise. Generally suitable for
categorical data.
dg(x1,x2) = 1 dn(x1,x2) = 0
For feature For feature
dg(x1,x3) = 1 dn(x1,x3) = 1
Gender Nationality
dg(x2,x3) = 0 dn(x2,x3) = 1

• Hamming distance: Global distance function which is the sum of


the overlap differences across all features - i.e. number of features
on which two examples disagree.
d(x1,x2) = 1 + 0 = 1
d(x1,x3) = 1 + 1 = 2 Overlap distance for Gender +
d(x2,x3) = 0 + 1 = 1 Overlap distance for Nationality

9
Measuring Distance
• Absolute difference: For numeric data, we Athlete
x1
Speed
2.50
Agility
6.00
can calculate absolute value of the x2 3.75 8.00

difference between values for a feature. x3 2.25 5.50

For feature ds(x1,x2) = |2.50-3.75| = 1.25 For feature ds(x1,x2) = |6.0-8.0| = 2.0
Speed ds(x1,x3) = |2.50-2.25| = 0.25 Agility ds(x1,x3) = |6.0-5.5| = 0.5
ds(x2,x3) = |3.75-2.25| = 1.5 ds(x2,x3) = |8.0-5.5| = 2.5

• Again we can compute a global distance between two


examples by summing the local distances over all features.
d(x1,x2) = 1.25 + 2.0 = 3.25 Absolute difference for Speed +
d(x1,x3) = 0.25 + 0.5 = 0.75 Absolute difference for Agility
d(x2,x3) = 1.5 + 2.5 = 4.0

• For ordinal features, calculate the absolute value of the difference


between the two positions in the ordered list of possible values.
diff(Low,High) = |1-3| = 2
e.g. Ordinal Feature Dosage:
diff(Medium,Low) = |2-1| = 1
{Low,Medium,High} = {1, 2, 3}
diff(High,High) = |3-3| = 0
10
Measuring Distance
• Euclidean distance: Most common measure used to quantify
distance between two examples with numeric features.
• Given by the "straight line" distance between two points in a
Euclidean coordinate space - e.g. a feature space.
• Calculated as the square root of sum of squared differences for
each feature f representing a pair of examples.
• The output is a real value ≥ 0, where a larger value indicates two
examples are more distant (i.e. less similar to one another).

Input:
2 examples ED(p,q) = Calculate square
of the difference
p and q
between the examples
on feature f

For each feature f in


the full set of features F

11
Measuring Distance
• Example: Apply Euclidean
x4
distance, where F
x5
consists of 2 numeric
features: speed, agility x15

ED(p,q) =

Athlete Speed Agility

x4 3.25 8.25
E D (x 4, x 15) =
x15 4.75 6.25
= = 2.5
Athlete Speed Agility
E D (x 4, x 5) =
x4 3.25 8.25

x5 2.75 7.50
=

12
Heterogeneous Distance Functions
• In many datasets, the features associated with examples will
have different types (e.g. continuous, categorical, ordinal etc).
• We can create a global measure from different local distance
functions, using an appropriate function for each feature.
Athlete Speed Agility Gender Nationality • Use absolute difference for continuous
features Speed & Agility
x1 2.50 6.00 Female Irish
• Use overlap for categorical features
x2 3.75 8.00 Male Irish Gender & Nationality
x3 2.25 5.50 Male Italian • Global distance calculated as sum
over individual local distances

d(x1,x2) = 1.25 + 2.0 + 1 + 0 = 4.25


d(x1,x3) = 0.25 + 0.5 + 1 + 1 = 2.75
d(x2,x3) = 1.5 + 2.5 + 0 + 1 = 5.0

Often domain expertise is required to choose an appropriate


distance function for a particular dataset.
13
Data Normalisation
Example Age
• Numeric features often have different x1 24

ranges, which can skew certain distance x2 19


x3 50
functions. x4 40

• So that all features have similar range, we x5


x6
23
68
apply feature normalisation. x7 45
x8 33

• Min-max normalisation: x9
x10
80
58
Use min and max values for a given
feature to rescale to the range [0,1] xi —min(x)
zi = max(x)—min(x)
• Example: Feature Age
min(x) = 19
Age 24 19 50 40 23 68 45 33 80 58
(Non-normalised)
max(x) = 80
Age
max(x ) — min(x ) = (Normalised)
0.08 0.00 0.51 0.34 0.07 0.80 0.43 0.23 1.00 0.64

61
14
Nearest Neighbour Classifier
Lazy learning approach: Do not build a model for the data. Identify
most similar previous example(s) from the training set for which a
label has already been assigned, using some distance function.
Nearest neighbour rule (1NN): For a new query input q, find a single
labelled example x closest to q, and assign q the same label as x.

q2 Assign q2 to
'Yes' class

Assign q1 to q1
'No' class

15
k-Nearest Neighbour Classifier
k-Nearest neighbours (kNN): The NN approach naturally generalises
to the case where we use k nearest neighbours from the training
set to assign a label to a new query input.
Example: For new query inputs, calculate distance to all training
examples. Find k=3 nearest examples (i.e. with smallest distances).

3 nearest training
q2 examples to q2

3 nearest training
q1
examples to q1

16
k-Nearest Neighbour Classifier
Majority voting: The decision on a label for a new query example is
decided based on the “votes” of its k nearest neighbours. The label
for the query is the majority label of its neighbours.
Example: Measure distance from q to all training examples.
Find the k=3 nearest examples, and use their labels as votes.

x16
x15

x6
Neighbour counts
q
• Yes = 2 votes
• No = 1 vote
➡ Majority says
Yes!
17
k-Nearest Neighbour Classifier
Majority voting: The decision on a label for a new query example is
decided based on the “votes” of its k nearest neighbours. The label
for the query is the majority label of its neighbours.
Example: Measure distance from q to all training examples.
Find the k=4 nearest examples, and use their labels as votes.

x16
x15 In the case that…
x6
• Yes = 2 votes
q
• No = 2 votes
Can break ties…
x12 ‣ At random
‣ Based on sum of
neighbour distances
18
Example: kNN Classification
(k=3)
• Training set of 20 athletes - 8 labelled as 'Yes', 12 as 'No'.
• Each athlete described by 2 continuous features: Speed, Agility
Euclidean distance would be an appropriate distance function.
Athlete Speed Agility Selected Athlete Speed Agility Selected

x1 2.50 6.00 No x11 2.00 2.00 No

x2 3.75 8.00 No x12 5.00 2.50 No

x3 2.25 5.50 No x13 8.25 8.50 Yes

x4 3.25 8.25 No x14 5.75 8.75 Yes

x5 2.75 7.50 No x15 4.75 6.25 Yes

x6 4.50 5.00 No x16 5.50 6.75 Yes

x7 3.50 5.25 No x17 5.25 9.50 Yes

x8 3.00 3.25 No x18 7.00 4.25 Yes

x9 4.00 4.00 No x19 7.50 8.00 Yes

x10 4.25 3.75 No x20 7.25 3.75 Yes

Athlete Speed Agility Selected


Will a new input example q be
q 5.00 7.50 ???
classified as 'Yes' or 'No'?
19
Example: kNN Classification
(k=3)
• Measure distance between q and all 20 training examples.
Athlete Speed Agility Selected Distance Athlete Speed Agility Selected Distance
x1 2.50 6.00 No 2.915 x11 2.00 2.00 No 6.265
x2 3.75 8.00 No 1.346 x12 5.00 2.50 No 5.000
x3 2.25 5.50 No 3.400 x13 8.25 8.50 Yes 3.400
x4 3.25 8.25 No 1.904 x14 5.75 8.75 Yes 1.458
x5 2.75 7.50 No 2.250 x15 4.75 6.25 Yes 1.275
x6 4.50 5.00 No 2.550 x16 5.50 6.75 Yes 0.901
x7 3.50 5.25 No 2.704 x17 5.25 9.50 Yes 2.016
x8 3.00 3.25 No 4.697 x18 7.00 4.25 Yes 3.816
x9 4.00 4.00 No 3.640 x19 7.50 8.00 Yes 2.550
x10 4.25 3.75 No 3.824 x20 7.25 3.75 Yes 4.373

q 5.00 7.50 ???

• Rank the training examples and identify set of 3 examples with the
smallest distances.
Athlete
x16
Speed
5.50
Agility
6.75
Selected
Yes
Distance
0.901
• Yes = 2 votes
x15 4.75 6.25 Yes 1.275 • No = 1 vote
x2 3.75 8.00 No 1.346
➡ Majority says Yes,
so assign label Yes to q 20
Weighted kNN
• Weighted voting: In this approach, some training examples have a
higher weight than others.
• Instead of using a binary vote of 1 for each nearest neighbour,
typically closer neighbours get higher votes when deciding on the
predicted label for a query example.
• Inverse distance-weighted voting: Simplest strategy is to take a
neighbour’s vote to be the inverse of their distance from the query
(i.e. 1/Distance). We then sum over the weights for each class.
d(q, x 16) = 0.901
1
weight(x16) = = =
d(q,1x 16)
1.109
0.901
d(q, x 2) = 1.346
1 1
weight(x2) = = = 0.743
d(q, x 2) 1.346

21
Example: Weighted kNN
(k=3)
• Measure distance between q and all 20 training examples.
Athlete Speed Agility Selected Distance Athlete Speed Agility Selected Distance
x1 2.50 6.00 No 2.915 x11 2.00 2.00 No 6.265
x2 3.75 8.00 No 1.346 x12 5.00 2.50 No 5.000
x3 2.25 5.50 No 3.400 x13 8.25 8.50 Yes 3.400
x4 3.25 8.25 No 1.904 x14 5.75 8.75 Yes 1.458
x5 2.75 7.50 No 2.250 x15 4.75 6.25 Yes 1.275
x6 4.50 5.00 No 2.550 x16 5.50 6.75 Yes 0.901
x7 3.50 5.25 No 2.704 x17 5.25 9.50 Yes 2.016
x8 3.00 3.25 No 4.697 x18 7.00 4.25 Yes 3.816
x9 4.00 4.00 No 3.640 x19 7.50 8.00 Yes 2.550
x10 4.25 3.75 No 3.824 x20 7.25 3.75 Yes 4.373

• Rank the training examples and identify set of 3 examples with the
smallest distances. Assign weights based on 1/Distance, and
weights for each class.
sum
• Weights for Yes =
Athlete Speed Agility Selected Distance Weight
1.109 + 0.784 = 1.893
x16 5.50 6.75 Yes 0.901 1.109
x15 4.75 6.25 Yes 1.275 0.784 • Weights for No =
x2 3.75 8.00 No 1.346 0.743 0.743
➡ Majority says
22
Yes
Parameter Tuning
• A simple 1-NN classifier is easy to implement. But it will be
susceptible to “noise” in the data. A misclassification will occur
every time a single noisy example is retrieved.
• We might decide to vary the neighbourhood size parameter k to
improve the predictive performance of kNN.
• Choosing between different settings of an algorithm is often
referred to as hyperparameter tuning or model selection.

• Using a larger (e.g. k > 2) can


sometimes make the classifier more
robust and overcome this problem.
• But when k is large (k→N) and
classes are unbalanced, we always
predict the majority class.

23
Problems with kNN
 Can be slow to find nearest neighbors in high dim space

 Need to store all the training data, so takes a lot of memory

 Need to specify the distance function

 Does not give probabilistic output

 Sensitive to class noise

 Sensitive to scales of attributes

 Distances are less meaningful in high dimensions

24
k-NN with Scikit Learn

• Examples in Notebook 02-kNN


• Loading a dataset
• Finding nearest neighbours
• Training a k-NN classifier
• Scaling features
• Weighting Instances

25
Load a dataset into Python
• Load a csv file into a Pandas dataframe in Python
athlete = pd.read_csv('AthleteSelection.csv',index_col = 'Athlete')
athlete.head()

y = athlete.pop('Selected').values
X= athlete.values

Speed Agility Selected


• X contains the features Athlete

• y contains the targets x1 2.50 6.00 0

x2 3.75 8.00 0

x3 2.25 5.50 0

x4 3.25 8.25 0

x5 2.75 7.50 0

26
Train a k-NN classifier

Set up classifier forecast_kNN = KNeighborsClassifier(n_neighbors=3)

Train it forecast_kNN.fit(X,y)

Setup queries xinput = np.array([[8.,70.,11.],


examples [8,69,15]])

Make predictions forecast_kNN.predict(xinput)

Temperature Humidity Wind_Speed Go-Out

0 6 85 30 0
1 14 90 35 0
2 15 86 8 1
3 21 56 15 1
4 17 67 9 1

27
Test on the Training Data
• Use training data as test (not a good idea)
• k = 3 so one misclassification
y_dash = forecast_kNN.predict(X)
print(' y:',y)
print('y_dash:',y_dash)

y: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0]
y_dash: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 0]

confusion = confusion_matrix(y, y_dash)


print("Confusion matrix:\n{}".format(confusion))

Confusion matrix:
[[ 7 1]
[ 0 10]]

What would we expect to happen when k=1? (Try it.)

28
Normalizing (Scaling) Data
• Normaize data so all features have the same influence
• Two popular models
• N(0,1)
• MinMax typically range (0,1)

scaler = preprocessing.StandardScaler().fit(X)
Set up Scaler
X_scaled = scaler.transform(X)
Scale the data q_scaled = scaler.transform([q])

Retrain the classifier forecast_kNN_S.fit(X_scaled,y)

Make predictions forecast_kNN_S.kneighbors(q_scaled)

29
Instance weighting
• Give nearer neighbours a bigger weight (vote)
• based on distance
forecast_kNN_SW = KNeighborsClassifier(n_neighbors=3,weights='distance')
forecast_kNN_SW.fit(X_scaled,y)
y_dash = forecast_kNN_SW.predict(X_scaled)
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion))
print('\n y:',y)
print('y_dash:',y_dash)

Confusion matrix:
[[ 8 0]
[ 0 10]]

y: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0]
y_dash: [0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0]

30
Advantages and disadvantages of the KNN algorithm

Advantages

 Easy to implement: Given the algorithm’s simplicity and accuracy, it is


one of the first classifiers that a new data scientist will learn.

 Adapts easily: As new training samples are added, the algorithm


adjusts to account for any new data since all training data is stored
into memory.

 Few hyperparameters: KNN only requires a k value and a distance


metric, which is low when compared to other machine learning
algorithms.
Advantages and disadvantages of the KNN algorithm

Disadvantages
 Does not scale well: Since KNN is a lazy algorithm, it takes up more memory
and data storage compared to other classifiers. This can be costly from both a
time and money perspective.

 Curse of dimensionality: The KNN algorithm tends to fall victim to the curse
of dimensionality, which means that it doesn’t perform well with high-
dimensional data inputs. This is sometimes also referred to as the peaking
phenomenon , where after the algorithm attains the optimal number of
features, additional features increases the amount of classification errors,
especially when the sample size is smaller.

 Prone to overfitting: value of k can also impact the model’s behavior. Lower
values of k can overfit the data, whereas higher values of k tend to “smooth
out” the prediction values since it is averaging the values over a greater area,
or neighborhood. However, if the value of k is too high, then it can underfit
the data.

You might also like