Week 4 Classification KNN
Week 4 Classification KNN
Week 4
Program Studi Teknik Informatika
Fakultas Teknik – Universitas Surabaya
k-Nearest Neighbors
• k-Nearest Neighbors (k-NN) is
a simplest classification model
that uses similarity with nearest
neighbors
• k-NN predicts the class of an
unknown observation by the
class of the largest proportion
of the k nearest observations
• k-NN uses distance to measure
similarity between observations
k-Nearest Neighbors
• For
• There is one observation from
class A closes to the new
example
• The new example is classified
as class A
k-Nearest Neighbors
• For
• There are 2 observations from
class A and 3 observations
from class B closes to the new
example
• The new example is classified
as class B
k-NN algorithm
• Choose a value for
• Calculate the distance of unknown observation to all training data
• Select observations in training data that are nearest to the unknown
observation ( nearest neighbors)
• Predict the class for the unknown observation by the class of the
most proportion in nearest neighbors.
Distance
Some distances can be used to measure the similarity between to
vectors and :
• Euclidean distance
• Manhattan distance
• Minkowski distance
Euclidean distance
• Example:
Manhattan distance
• Example:
Minkowski distance
• Example:
How to determine the best k in k-NN
• The value of k is related to the error rate of the model
• There are no exact methods to determine the best k
• A small value of k can lead to overfitting, but a big value of k can lead
to underfitting
• Overfitting: the model has good performance on the training data but
poor on the testing data.
• Underfitting: the model is not good on the training and testing data.
• Heuristic method can be used to determine the best k.
• Derive a plot between accuracy and k in a defined range. Then
choose the k value that has a maximum accuracy.
Classification accuracy
• Classification accuracy: the rate of correct classifications
Iris Dataset
• Number of Instances: 150 (50 in each of three classes)
• Number of features: 4 numeric
– sepal length in cm
– sepal width in cm
– petal length in cm
– petal width in cm
• Class:
– Iris-Setosa
– Iris-Versicolour
– Iris-Virginica
Iris Dataset
Iris Dataset
Feature Min Max Mean Std. Dev. Class Correlation
sepal length 4.3 7.9 5.84 0.83 0.7826
sepal width 2.0 4.4 3.05 0.43 -0.4194
petal length 1.0 6.9 3.76 1.76 0.9490
petal width 0.1 2.5 1.20 0.76 0.9565
Training k-NN for Iris using all features with
sklearn.neighbors.KNeighborsClassifier
Training k-NN for Iris using all features with
sklearn.neighbors.KNeighborsClassifier
Training k-NN for Iris using all features with
sklearn.neighbors.KNeighborsClassifier
Training k-NN for Iris using petal length and
petal width
Training k-NN for Iris using using petal length
Training k-NN for Iris using petal width