08 Classification Using K NN
08 Classification Using K NN
http://www.saedsayad.com/classification.htm
Classification
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class
• Find a model for class attribute as a function
of the values of other attributes
• Goal: previously unseen records should be
assigned a class as accurately as possible
– A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it
Illustrating Classification Task
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Network
• Support Vector Machines
• Nearest Neighbor
K Nearest Neighbors a.k.a KNN
• K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on a
similarity measure (e.g., distance functions)
• KNN has been used in statistical estimation and pattern
recognition already in the beginning of 1970’s as a non-
parametric technique
Algorithm
• A case is classified by a majority vote of its neighbors,
with the case being assigned to the class most common
amongst its K nearest neighbors measured by a distance
function
• If K = 1, then the case is simply assigned to the class of
its nearest neighbor
• Classify an unknown example with the most common
class among k closest examples
KNN: Multiple Classes
• Easy to implement for multiple classes
• Example for k = 5
How to Choose K?
• In theory, if infinite number of samples available, the
larger is k, the better is classification
• The caveat is that all k neighbors have to be close
– Possible when infinite # samples available
– Impossible in practice since # samples is finite
How to Choose K?
• Rule of thumb is k < sqrt(n), n is number of examples
• interesting theoretical properties
• In practice, k = 1 is often used for efficiency, but can
be sensitive to “noise”
How to Choose K?
• Larger k may improve performance, but too large k destroys
locality, i.e. end up looking at samples that are not neighbors
• Cross-validation (study later) may be used to choose k
Example
• A snack company wants to classify the quality of its products into
2 groups, GOOD and BAD. There are two variables to assess the
quality, the increase of the degree of acidity (%) and volume
shrinkage. There are 10 samples used for testing as shown in the
table.
• The company wants to know whether a product with the increase
in acidity of 6% and the volume shrinkage of 3% included in the
category GOOD or BAD
No Variable Category
The increase of the Volume shrinkage (V2)
degree of acidity (V1)
1 3 2 GOOD
2 4 1 GOOD
3 4 3 GOOD
4 5 1 GOOD
5 5 4 GOOD
6 6 5 BAD
7 7 6 BAD
8 8 4 BAD
9 7 2 BAD
10 9 1 BAD
• Choose k = 5
• Find the distance of the data that will be evaluated (r =
(6,3)) with all the training data using Euclidean Distance.
Result No
V1
Variable
V2
Category Distance