100% found this document useful (1 vote)
104 views

K-NN (Nearest Neighbor)

K-nearest neighbors (KNN) is a lazy learning algorithm that classifies data based on similarity to its nearest neighbors. It is a non-parametric method used for classification where the prediction for a test data point is based on its k nearest neighbors from the training data. The algorithm requires a feature space of training data, a distance metric to calculate distances between data points, and a value for k, the number of nearest neighbors. To classify an unknown data point, its distance is calculated to all other training points and the class labels of its k nearest neighbors are used to determine its class label.

Uploaded by

Shisir Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
104 views

K-NN (Nearest Neighbor)

K-nearest neighbors (KNN) is a lazy learning algorithm that classifies data based on similarity to its nearest neighbors. It is a non-parametric method used for classification where the prediction for a test data point is based on its k nearest neighbors from the training data. The algorithm requires a feature space of training data, a distance metric to calculate distances between data points, and a value for k, the number of nearest neighbors. To classify an unknown data point, its distance is calculated to all other training points and the class labels of its k nearest neighbors are used to determine its class label.

Uploaded by

Shisir Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

K-NN(Nearest Neighbor)

1
Eager vs Lazy Learners

2
K-NN
K-Nearest Neighbor is considered a lazy learning algorithm that
classifies data sets based on their similarity with neighbors.
Non-parametric method used for classification
Prediction for test data is done on the basis of its neighbor
K is an integer, If k=1, K is assigned to the class of single nearest
neighbor
The processing defers with respect to K value.
Result is generated after analysis of stored data.

3
WHY NEAREST NEIGHBOR?
Used to classify objects based on closest training examples in the
feature space
 Feature space: raw data transformed into sample vectors of fixed length
using feature extraction (Training Data)
Top 10 Data Mining Algorithm
 ICDM paper – December 2007
Among the simplest of all Data Mining Algorithms
 Classification Method
Implementation of lazy learner
 All computation deferred until classification
4
K- NEAREST NEIGHBOR
Requires 3 things:
 Feature Space(Training Data)
 Distance metric
 to compute distance between records
 The value of k
 the number of nearest neighbors to retrieve
from which to get majority class
To classify an unknown record:
 Compute distance to other training
records
 Identify k nearest neighbors
 Use class labels of nearest neighbors to
determine the class label of unknown
record
5
Example

6
Similarity Measure (Euclidean distance)

Ex: Given X = {-2,2} & Y = {2,5}


Euclidean Distance = dist(X,Y) = [ (-2-2)^2 + (2-5)^2 ]^(1/2)
= dist(X,Y) = (16 + 9)^(1/2)
= dist(X,Y) = 5

7
Similarity Measure (Euclidean distance)

8
Similarity Measure (Manhattan Distance)

Ex: Given X = {1, 2} & Y = {2, 5}


Manhattan Distance = dist(X,Y) = |1-2|+|2-5| = 1+3= 4

9
Similarity Calculation

10
Rank these Attributes

11
K=1

12
K=2

13
K=3

14
Advantages
 Can be applied to the data from any distribution for example, data
does not have to be separable with a linear boundary
Simple technique that is easily implemented
Good classification if the number of samples is large enough
Building model is inexpensive
Extremely flexible classification scheme
 does not involve preprocessing

15
Disadvantages
Dependent on K Value
Test stage is computationally expensive
No training stage, all the work is done during the test stage
This is actually the opposite of what we want. Usually we can afford training step
to take a long time, but we want fast test step
Need large number of samples for accuracy
Classifying unknown records are relatively expensive
 Requires distance computation of k-nearest neighbors
 Computationally intensive, especially when the size of the training set grows
Accuracy can be severely degraded by the presence of noisy or irrelevant features

16
Reference
https://www.youtube.com/watch?v=2YQHPfwVuF8

17

You might also like