K-NN (Nearest Neighbor)
K-NN (Nearest Neighbor)
1
Eager vs Lazy Learners
2
K-NN
K-Nearest Neighbor is considered a lazy learning algorithm that
classifies data sets based on their similarity with neighbors.
Non-parametric method used for classification
Prediction for test data is done on the basis of its neighbor
K is an integer, If k=1, K is assigned to the class of single nearest
neighbor
The processing defers with respect to K value.
Result is generated after analysis of stored data.
3
WHY NEAREST NEIGHBOR?
Used to classify objects based on closest training examples in the
feature space
Feature space: raw data transformed into sample vectors of fixed length
using feature extraction (Training Data)
Top 10 Data Mining Algorithm
ICDM paper – December 2007
Among the simplest of all Data Mining Algorithms
Classification Method
Implementation of lazy learner
All computation deferred until classification
4
K- NEAREST NEIGHBOR
Requires 3 things:
Feature Space(Training Data)
Distance metric
to compute distance between records
The value of k
the number of nearest neighbors to retrieve
from which to get majority class
To classify an unknown record:
Compute distance to other training
records
Identify k nearest neighbors
Use class labels of nearest neighbors to
determine the class label of unknown
record
5
Example
6
Similarity Measure (Euclidean distance)
7
Similarity Measure (Euclidean distance)
8
Similarity Measure (Manhattan Distance)
9
Similarity Calculation
10
Rank these Attributes
11
K=1
12
K=2
13
K=3
14
Advantages
Can be applied to the data from any distribution for example, data
does not have to be separable with a linear boundary
Simple technique that is easily implemented
Good classification if the number of samples is large enough
Building model is inexpensive
Extremely flexible classification scheme
does not involve preprocessing
15
Disadvantages
Dependent on K Value
Test stage is computationally expensive
No training stage, all the work is done during the test stage
This is actually the opposite of what we want. Usually we can afford training step
to take a long time, but we want fast test step
Need large number of samples for accuracy
Classifying unknown records are relatively expensive
Requires distance computation of k-nearest neighbors
Computationally intensive, especially when the size of the training set grows
Accuracy can be severely degraded by the presence of noisy or irrelevant features
16
Reference
https://www.youtube.com/watch?v=2YQHPfwVuF8
17