Lecture Note #3_PEC-CS701E
Lecture Note #3_PEC-CS701E
1
Simple Analogy..
• Tell me about your friends(who your
neighbors are) and I will tell you who you are.
2
KNN – Different names
• K-Nearest Neighbors
• Memory-Based Reasoning
• Example-Based Reasoning
• Instance-Based Learning
• Lazy Learning
3
Instance-based Learning
4
What is instance based Learning?
In machine learning, instance-based
learning (sometimes called memory-based learning)
is a family of learning algorithms that, instead of
performing explicit generalization, compare new
problem instances with instances seen in training, which
have been stored in memory.
Advantage :
Ability to adapt its model to previously unseen data.
Examples are:
• KNN
• RBF networks
Why Is The KNN Called A “Lazy Learner” Or A
“Lazy Algorithm”?
KNN is called a lazy learner because when we supply training data
to this algorithm, the algorithm does not train itself at all.
KNN does not learn any discriminative function from the training
data. But it memorizes the entire training dataset instead.
8
KNN: Classification Approach
9
10
Distance Measure
Compute
Distance
Test
Record
Training
Records Choose k of the
“nearest” records
11
Different Distance Measure
12
Distance measure for Continuous Variables
Euclidean Distance
Manhattan distance
Minkowski distance
13
Distance Between Neighbors
• Calculate the distance between new example
(E) and all examples in the training set.
i=1 11
K-Nearest Neighbor Algorithm
• All the instances correspond to points in an n-dimensional
feature space.
17
How to choose K?
18
19
X X X
20
KNN Feature Weighting
21
Feature Normalization
• Distance between neighbors could be dominated
by some attributes with relatively large numbers.
e.g., income of customers in our previous example.
22
Nominal/Categorical Data
• Distance works naturally with numerical attributes.
23
KNN Classification
$250,000
$200,000
$150,000
Loan$ Non-Default
$100,000 Default
$50,000
$0
0 10 20 30 40 50 60 70
Age
24
KNN Classification – Distance
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000
48 $142,000 ?
D = (x − x ) 2 +( y − y ) 2
1 2 1 2
25
KNN Classification – Standardized Distance
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771
0.7 0.61 ?
X − Min
Xs =
Max − Min
26
Strengths of KNN
• Very simple and intuitive.
• Can be applied to the data from any distribution.
• Good classification if the number of samples is large enough.
Weaknesses of KNN
• Takes more time to classify a new example.
• need to calculate and compare distance from new
example to all other examples.
• Choosing k may be tricky.
• Need large number of samples for accuracy.
27