0% found this document useful (0 votes)

205 views

KNN PDF

The document discusses the K Nearest Neighbor classifier. It is a lazy learning algorithm that stores training examples and assigns a new example to the class of its k nearest neighbors. The document explains how to calculate distances between examples to determine neighbors, and how to choose the value of k. It provides an example of classifying a new paper sample using the Iris dataset in Weka.

Uploaded by

avinash singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views

KNN PDF

Uploaded by

avinash singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

K Nearest Neighbour Classifier

Eager Learners vs Lazy Learners

 Eager learners, when given a set of training tuples,
will construct a generalization model before receiving
new (e.g., test) tuples to classify.
 Lazy learners simply stores data (or does only a little
minor processing) and waits until it is given a test
tuple.
 Lazy learners store the training tuples or “instances,”
they are also referred to as instance based learners,
even though all learning is essentially based on
instances.
 Lazy learner: less time in training but more in
predicting.

-k- Nearest Neighbor Classifier

-Case Based Classifier

k- Nearest Neighbor Classifier
 History
• It was first described in the early 1950s.

• The method is labor intensive when given large

training sets.

• Gained popularity, when increased computing

power became available.

• Used widely in area of pattern recognition and

statistical estimation.
What is k- NN??
 Nearest-neighbor classifiers are based on
learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it.

 The training tuples are described by n attributes.

 When k = 1, the unknown tuple is assigned the

class of the training tuple that is closest to it in
pattern space.
When k=3 or k=5??
Remarks!!
 Similarity Function Based.

 Choose an odd value of k for 2 class problem.

 k must not be multiple of number of classes.

Closeness
 The Euclidean distance between two points or
tuples, say,
X1 = (x11,x12,...,x1n) and X2 =(x21,x22,...,x2n), is

 Min-max normalization can be used to transform

a value v of a numeric attribute A to v0 in the
range [0,1] by computing
What if attributes are categorical??
 How can distance be computed for attribute
such as colour?

-Simple Method: Compare corresponding value of

attributes

-Other Method: Differential grading

What about missing values ??
 If the value of a given attribute A is missing in
tuple X1 and/or in tuple X2, we assume the
maximum possible difference.
 For categorical attributes, we take the difference
value to be 1 if either one or both of the
corresponding values of A are missing.
 If A is numeric and missing from both tuples X1
and X2, then the difference is also taken to be 1.
How to determine a good value for
k?
 Starting with k = 1, we use a test set to estimate
the error rate of the classifier.
 The k value that gives the minimum error rate
may be selected.
KNN Algorithm and Example
Distance Measures
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2

𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2

𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = |(𝑥𝑖 − 𝑦𝑖)|

Which distance measure to use?

We use Euclidean Distance as it treats each feature as
equally important.
How to choose K?
 If infinite number of samples available, the larger
is k, the better is classification.
 k = 1 is often used for efficiency, but sensitive to
“noise”
 Larger k gives smoother boundaries, better for generalization,
but only if locality is preserved. Locality is not preserved if end
up looking at samples too far away, not from the same class.
 Interesting relation to find k for large sample data : k =
sqrt(n)/2 where n is # of examples
 Can choose k through cross-validation
KNN Classifier Algorithm
Example
 We have data from the questionnaires survey and
objective testing with two attributes (acid durability and
strength) to classify whether a special paper tissue is good
or not. Here are four training samples :
X1 = Acid Durability X2 = Strength Y = Classification
(seconds) (kg/square meter)
7 7 Bad

7 4 Bad

3 4 Good

1 4 Good

Now the factory produces a new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7. Guess the classification of
this new tissue.
 Step 1 : Initialize and Define k.
Lets say, k = 3
(Always choose k as an odd number if the number of
attributes is even to avoid a tie in the class prediction)
 Step 2 : Compute the distance between input sample and
training sample
- Co-ordinate of the input sample is (3,7).
- Instead of calculating the Euclidean distance, we
calculate the Squared Euclidean distance.
X1 = Acid Durability X2 = Strength Squared Euclidean distance
(seconds) (kg/square meter)
7 7 (7-3)2 + (7-7)2 = 16

7 4 (7-3)2 + (4-7)2 = 25

3 4 (3-3)2 + (4-7)2 = 09

1 4 (1-3)2 + (4-7)2 = 13
 Step 3 : Sort the distance and determine the nearest
neighbours based of the Kth minimum distance :

X1 = Acid X2 = Strength Squared Rank Is it included

Durability (kg/square Euclidean minimum in 3-Nearest
(seconds) meter) distance distance Neighbour?
7 7 16 3 Yes

7 4 25 4 No

3 4 09 1 Yes

1 4 13 2 Yes
 Step 4 : Take 3-Nearest Neighbours:
 Gather the category Y of the nearest neighbours.

X1 = Acid X2 = Squared Rank Is it Y=

Durability Strength Euclidean minimum included in Category of
(seconds) (kg/square distance distance 3-Nearest the nearest
meter) Neighbour? neighbour
7 7 16 3 Yes Bad

7 4 25 4 No -

3 4 09 1 Yes Good

1 4 13 2 Yes Good
 Step 5 : Apply simple majority

 Use simple majority of the category of the nearest

neighbours as the prediction value of the query instance.

 We have 2 “good” and 1 “bad”. Thus we conclude that

the new paper tissue that passes the laboratory test with
X1 = 3 and X2 = 7 is included in the “good” category.
Iris Dataset Example using Weka
 Iris dataset contains 150 sample instances belonging
to 3 classes. 50 samples belong to each of these 3
classes.
 Statistical observations :
 Let's denote the true value of interest as 𝜃 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
and the value estimated using some algorithm as
𝜃. (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)
 Kappa Statistics : The kappa statistic measures the
agreement of prediction with the true class -- 1.0
signifies complete agreement. It measures the
significance of the classification with respect to the
observed value and expected value.
 Mean absolute error :
 Root Mean Square Error :

 Relative Absolute Error :

 Root Relative Squared Error :

Complexity
 Basic kNN algorithm stores all examples
 Suppose we have n examples each of dimension
d
 O(d) to compute distance to one examples
 O(nd) to computed distances to all examples
 Plus O(nk) time to find k closest examples
 Total time: O(nk+nd)
 Very expensive for a large number of samples
 But we need a large number of samples for kNN
to to work well!!
 Advantages of KNN classifier :
 Can be applied to the data from any distribution for
example, data does not have to be separable with a
linear boundary
 Very simple and intuitive
 Good classification if the number of samples is large
enough

 Disadvantages of KNN classifier :

 Choosing k may be tricky
 Test stage is computationally expensive
 No training stage, all the work is done during the test
stage
 This is actually the opposite of what we want. Usually we
can afford training step to take a long time, but we want
Applications of KNN Classifier
 Used in classification
 Used to get missing values
 Used in pattern recognition
 Used in gene expression
 Used in protein-protein prediction
 Used to get 3D structure of protein
 Used to measure document similarity
Comparison of various classifiers
Algorithm Features Limitations

C4.5 - Models built can be easily - Small variation in data can lead
Algorithm interpreted to different decision trees
- Easy to implement - Does not work very well on
- Can use both discrete and small training dataset
continuous values - Over-fitting
- Deals with noise
ID3 - It produces more accuracy - Requires large searching time
Algorithm than C4.5 - Sometimes it may generate
- Detection rate is increased very long rules which are
and space consumption is difficult to prune
reduced - Requires large amount of
memory to store tree
K-Nearest - Classes need not be linearly - Time to find the nearest
Neighbour separable neighbours in a large training
Algorithm - Zero cost of the learning dataset can be excessive
process - It is sensitive to noisy or
- Sometimes it is robust with irrelevant attributes
regard to noisy training data - Performance of the algorithm
- Well suited for multimodal depends on the number of
Naïve Bayes - Simple to implement - The precision of the
Algorithm - Great computational efficiency algorithm decreases if the
and classification rate amount of data is less
- It predicts accurate results for - For obtaining good results,
most of the classification and it requires a very large
prediction problems number of records

Support vector - High accuracy - Speed and size

machine - Work well even if the data is requirement both in training
Algorithm not linearly separable in the and testing is more
base feature space - High complexity and
extensive memory
requirements for
classification in many
cases
Artificial Neural - It is easy to use with few - Requires high processing
Networks parameters to adjust time if neural network is
Algorithm - A neural network learns and large
reprogramming is not needed. - Difficult to know how many
- Easy to implement neurons and layers are
- Applicable to a wide range of necessary
problems in real life. - Learning can be slow
Conclusion
 KNN is what we call lazy learning (vs. eager
learning)
 Conceptually simple, easy to understand and
explain
 Very flexible decision boundaries
 Not much learning at all!
 It can be hard to find a good distance measure
 Irrelevant features and noise can be very
detrimental
 Typically can not handle more than a few dozen
attributes
 Computational cost: requires a lot computation
References
 “Data Mining : Concepts and Techniques”, J. Han, J.
Pei, 2001
 “A Comparative Analysis of Classification Techniques
on Categorical Data in Data Mining”, Sakshi, S.
Khare, International Journal on Recent and Innovation
Trends in Computing and Communication, Volume: 3
Issue: 8, ISSN: 2321-8169
 “Comparison of various classification algorithms on
iris datasets using WEKA”, Kanu Patel et al, IJAERD,
Volume 1 Issue 1, February 2014, ISSN: 2348 - 4470

Chapter 3 - Methodology Revised (Partial - Data Gathering)
100% (11)
Chapter 3 - Methodology Revised (Partial - Data Gathering)
7 pages
K Nearest Neighbor - Step by Step Tutorial
No ratings yet
K Nearest Neighbor - Step by Step Tutorial
16 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
Malhotra17 Tif
No ratings yet
Malhotra17 Tif
12 pages
Predicting Diamond Price Using Linear Model
50% (2)
Predicting Diamond Price Using Linear Model
20 pages
Unit I R Data Structures
No ratings yet
Unit I R Data Structures
30 pages
ML - Unit 2
No ratings yet
ML - Unit 2
15 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Association Rules
No ratings yet
Association Rules
64 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
Unit 1 - BD - Introduction To Big Data
100% (1)
Unit 1 - BD - Introduction To Big Data
90 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
KNN Algorithm
No ratings yet
KNN Algorithm
3 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Cluster
100% (1)
Cluster
72 pages
Artificial Intelligence: CS60045 Course Introduction
100% (4)
Artificial Intelligence: CS60045 Course Introduction
16 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
2 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
II Cse Cs3352 Fds QB Unit2
No ratings yet
II Cse Cs3352 Fds QB Unit2
5 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
K-Nearest Neighbors: KNN Algorithm Pseudocode
No ratings yet
K-Nearest Neighbors: KNN Algorithm Pseudocode
2 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Text
No ratings yet
Text
131 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
STRING-Module 2 Notes
100% (1)
STRING-Module 2 Notes
29 pages
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
No ratings yet
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
15 pages
Lecture 6 - State Space Search - Uninformed Search
No ratings yet
Lecture 6 - State Space Search - Uninformed Search
43 pages
Midterm Solution
No ratings yet
Midterm Solution
6 pages
DSGO 2019 Official Notes
No ratings yet
DSGO 2019 Official Notes
75 pages
Data Warehousing MCQ
No ratings yet
Data Warehousing MCQ
71 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
Unit-5 DS Notes
No ratings yet
Unit-5 DS Notes
19 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
Machine Learning Notes 1686281543
No ratings yet
Machine Learning Notes 1686281543
113 pages
Merge +1
No ratings yet
Merge +1
107 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Data Science
No ratings yet
Data Science
65 pages
Unit3 Inferentialnew
No ratings yet
Unit3 Inferentialnew
36 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
UNIT2
No ratings yet
UNIT2
25 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
Honours in Artificial Intelligence and Machine Learning: Board of Studies (Computer Engineering)
No ratings yet
Honours in Artificial Intelligence and Machine Learning: Board of Studies (Computer Engineering)
16 pages
AL3391-AI Unit IV
No ratings yet
AL3391-AI Unit IV
65 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Machine Learning and Neural Networks: Riccardo Rizzo
100% (1)
Machine Learning and Neural Networks: Riccardo Rizzo
113 pages
ML Assignment 3 Nptel 2019
No ratings yet
ML Assignment 3 Nptel 2019
26 pages
Unit 1
No ratings yet
Unit 1
70 pages
Week03 - 1 - KNN
No ratings yet
Week03 - 1 - KNN
32 pages
3.1 K Nearest Neighbour Classifier (1)
No ratings yet
3.1 K Nearest Neighbour Classifier (1)
24 pages
K_Nearest_Neighbour_Classifier
No ratings yet
K_Nearest_Neighbour_Classifier
24 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
Ec220 ST2021-exam
No ratings yet
Ec220 ST2021-exam
7 pages
ML GTU Question Bank
No ratings yet
ML GTU Question Bank
4 pages
Gebrekiros's Journal
No ratings yet
Gebrekiros's Journal
9 pages
SEE5211 Chapter3-P2017
No ratings yet
SEE5211 Chapter3-P2017
58 pages
Rock Music and Aggression PDF
No ratings yet
Rock Music and Aggression PDF
9 pages
Unit-5 BS& A
No ratings yet
Unit-5 BS& A
71 pages
Fundametals of Surveying Prelim Coverage
No ratings yet
Fundametals of Surveying Prelim Coverage
12 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Research Paper
No ratings yet
Research Paper
12 pages
Fe Industrial Engineering
No ratings yet
Fe Industrial Engineering
14 pages
The Sandwich Generation Dependency, Proximity, and Task Assistance Needs of Parents
No ratings yet
The Sandwich Generation Dependency, Proximity, and Task Assistance Needs of Parents
28 pages
Assessment of Opacimeter Calibration According To 10155 PDF
No ratings yet
Assessment of Opacimeter Calibration According To 10155 PDF
5 pages
Asst. Prof. Florence C. Navidad, RMT, RN, M.Ed
100% (1)
Asst. Prof. Florence C. Navidad, RMT, RN, M.Ed
37 pages
Analysis Ecological Data Ws17 18
No ratings yet
Analysis Ecological Data Ws17 18
161 pages
Introduction To Geostatistics: Andr As B Ardossy Institute of Hydraulic Engineering University of Stuttgart
100% (3)
Introduction To Geostatistics: Andr As B Ardossy Institute of Hydraulic Engineering University of Stuttgart
134 pages
The Impact of Audit Committee Effectiveness On Audit Quality: Evidence From The Middle East
No ratings yet
The Impact of Audit Committee Effectiveness On Audit Quality: Evidence From The Middle East
8 pages
Effect of Teamwork On Employee Performance: Cite This Paper
No ratings yet
Effect of Teamwork On Employee Performance: Cite This Paper
18 pages
1291TAManual F16
No ratings yet
1291TAManual F16
134 pages
Drill Holes and Blastholes
No ratings yet
Drill Holes and Blastholes
15 pages
Effect of HR Paractice On Employee Performance
100% (2)
Effect of HR Paractice On Employee Performance
34 pages
Lecture 5: Sampling Methods and Central Limit Theorem
No ratings yet
Lecture 5: Sampling Methods and Central Limit Theorem
29 pages
The Unscrambler Methods
No ratings yet
The Unscrambler Methods
288 pages
Operation Management
0% (1)
Operation Management
12 pages
Q.N. 3 A. Draw The Research Model and Write The Research Hypothesis of The Model
No ratings yet
Q.N. 3 A. Draw The Research Model and Write The Research Hypothesis of The Model
3 pages
Kuliah 10 Simple Regression
No ratings yet
Kuliah 10 Simple Regression
16 pages
Impact of Corruption On Economic Growth
100% (2)
Impact of Corruption On Economic Growth
8 pages

KNN PDF

Uploaded by

KNN PDF

Uploaded by

K Nearest Neighbour Classifier

Eager Learners vs Lazy Learners

-k- Nearest Neighbor Classifier

-Case Based Classifier

• The method is labor intensive when given large

• Gained popularity, when increased computing

• Used widely in area of pattern recognition and

 The training tuples are described by n attributes.

 When k = 1, the unknown tuple is assigned the

 Choose an odd value of k for 2 class problem.

 k must not be multiple of number of classes.

 Min-max normalization can be used to transform

-Simple Method: Compare corresponding value of

-Other Method: Differential grading

𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2

𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = |(𝑥𝑖 − 𝑦𝑖)|

Which distance measure to use?

X1 = Acid X2 = Strength Squared Rank Is it included

X1 = Acid X2 = Squared Rank Is it Y=

 Use simple majority of the category of the nearest

 We have 2 “good” and 1 “bad”. Thus we conclude that

 Relative Absolute Error :

 Root Relative Squared Error :

 Disadvantages of KNN classifier :

Support vector - High accuracy - Speed and size

You might also like