0% found this document useful (0 votes)
24 views

K-Nearest Neighbor Classification-Algorithm and Characteristics

The K-nearest neighbor algorithm is a simple machine learning algorithm that stores all available data and classifies new data based on similarity. It finds the K closest training examples to a new data point and assigns the most common class among those neighbors. The value of K and distance metric affect performance, and it works best for small, clear datasets.

Uploaded by

ysakhare69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

K-Nearest Neighbor Classification-Algorithm and Characteristics

The K-nearest neighbor algorithm is a simple machine learning algorithm that stores all available data and classifies new data based on similarity. It finds the K closest training examples to a new data point and assigns the most common class among those neighbors. The value of K and distance metric affect performance, and it works best for small, clear datasets.

Uploaded by

ysakhare69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

K-nearest neighbor classification-Algorithm and

characteristics
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will find
the similar features of the new data set to the cats and dogs images and based on
the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the
below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

K-Nearest Neighbors (KNN) Classification Algorithm:

1. Input:
 Training dataset with labeled examples.
 New data point (unlabeled) that needs to be classified.
2. Choose K:
 Decide the number of neighbors, K.
3. Calculate Distance:
 Use a distance metric (commonly Euclidean distance, Manhattan distance,
Minkowski distance, etc.) to measure the distance between the new data point
and each training data point.
4. Find K Nearest Neighbors:
 Identify the K training data points that are closest to the new data point based
on the chosen distance metric.
5. Majority Vote: (for Classification)
 For classification tasks, assign the class label that is most common among
the K nearest neighbors.
6. Output:
 The predicted class label for the new data point.

Characteristics and Considerations:


1. Non-parametric:
 KNN is a non-parametric algorithm, meaning it doesn't make assumptions
about the underlying data distribution.
2. Lazy Learning:
 It's lazy learning because it postpones the learning phase until the time a
prediction is needed. The model memorizes the training dataset.
3. Choice of K:
 The value of K is crucial. A small K may be sensitive to noise, while a large K
may smooth out the decision boundaries.
4. Distance Metric:
 The choice of distance metric depends on the nature of the data. Euclidean
distance is commonly used, but other metrics may be more suitable for certain
types of data.
5. Feature Scaling:
 KNN is sensitive to the scale of features, so it's often a good practice to scale
features before applying KNN.
6. Computational Cost:
 As KNN requires the computation of distances for each prediction, it can be
computationally expensive for large datasets.
7. Effect of Outliers:
 KNN is sensitive to outliers, as they can significantly affect the distance
calculation.
8. Decision Boundaries:
 KNN tends to have complex decision boundaries, especially in high-
dimensional spaces.
9. Curse of Dimensionality:
 In high-dimensional spaces, the distance between points may lose its
meaning, leading to performance degradation (curse of dimensionality).
Feature selection or dimensionality reduction techniques may be applied in
such cases.
10. Use Cases:
 KNN is suitable for relatively small datasets, and it can be a good choice for
problems with clear decision boundaries and when the data is not high-
dimensional.
11. Implementation:
 Commonly implemented using libraries such as scikit-learn in Python.

KNN is a simple yet effective algorithm, often used as a baseline model or for quick
prototyping. It's particularly useful when the decision boundaries are not well-defined
or when the data distribution is not known in advance. However, its performance
may suffer in high-dimensional or large-scale datasets.

Sample Data:
Consider the following dataset:

Feature 1 Feature 2 Class


3 5 A
1 2 B
4 2 A
4 5 B
2 1 B
3 3 A

Implementation:
Let's say we want to predict the class of a new data point with features (3, 4). We'll
use Euclidean distance as the distance metric and set K to 3.

Step 1: Calculate Distance

Calculate the Euclidean distance between the new point (3, 4) and each point in the
training set:

 Distance to (3, 5): (3−3)2+(4−5)2=1(3−3)2+(4−5)2=1 ​

 Distance to (1, 2): (3−1)2+(4−2)2=5(3−1)2+(4−2)2=5 ​ ​

 Distance to (4, 2): (3−4)2+(4−2)2=5(3−4)2+(4−2)2=5 ​ ​

 Distance to (4, 5): (3−4)2+(4−5)2=2(3−4)2+(4−5)2=2 ​ ​

 Distance to (2, 1): (3−2)2+(4−1)2=10(3−2)2+(4−1)2=10 ​ ​

 Distance to (3, 3): (3−3)2+(4−3)2=1(3−3)2+(4−3)2=1 ​

Step 2: Find Neighbors

Select the K data points with the smallest distances:

 Nearest neighbors for K=3: (3, 3), (3, 5), (4, 5)

Step 3: Majority Vote

Determine the majority class among the nearest neighbors:

 Class of (3, 3): A


 Class of (3, 5): A
 Class of (4, 5): B

Since two out of three nearest neighbors belong to class A, we predict that the new
data point (3, 4) belongs to class A.

This is a basic example to illustrate the steps of the KNN algorithm. In practice, you
would typically use libraries like scikit-learn in Python to implement KNN and handle
distance calculations efficiently.

You might also like