Assignment 3 B
Assignment 3 B
Assignment No:3B
Title:
Implement K-Nearest Neighbours’ algorithm on Social network ad dataset. Compute confusion
matrix, accuracy, error rate, precision and recall on the given dataset
Problem Statement:
Implementation of K-Nearest Neighbours’ algorithm on Social network ad dataset for
classification & Compute confusion matrix, accuracy, error rate, precision and recall on the
given dataset.
Objectives:
Apply the KNN classification algorithms to classify the data with appropriate labels.
Theory:
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm
is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve
both classification and regression problems. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other. KNN captures
the idea of similarity (sometimes called distance, proximity, or closeness) with some
mathematics we might have learned in our childhood— calculating the distance between points
on a graph. There are other ways of calculating distance, which might be preferable depending
on the problem we are solving. However, the straight-line distance (also called the Euclidean
distance) is a popular and familiar choice. It is widely disposable in real-life scenarios since it
is non-parametric, meaning, it does not make any underlying assumptions about the distribution
of data (as opposed to other algorithms such as GMM, which assume a Gaussian distribution
of the given data). This article illustrates K-nearest neighbors on a sample random data using
sklearn library.
Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know
either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it in either cat or
dog category.
● Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
● Numpy – Numpy arrays are very fast and can perform large computations in a very
short time.
● Matplotlib/Seaborn – This library is used to draw visualizations.
● Sklearn – This module contains multiple libraries having pre-implemented
functions to perform tasks from data preprocessing to model development and
evaluation.
KNN Algorithm:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest
Neighbors in category A and two nearest neighbors in category B. Consider the below
image:
As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
Advantages of KNN:
Disadvantages of KNN:
● Sensitive to the noisy features in the dataset.
● Computationally expensive for the large dataset.
● It can be biased in the imbalanced dataset.
● Requires the choice of the appropriate value of K.
● Sometimes normalization may be required
Conclusion:
Thus we have successfully implemented KNN Algorithm on given dataset for classification
and computed confusion matrix, accuracy, error rate, precision and recall on the given dataset
#feature Scaling
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
Out[6]: KNeighborsClassifier()
[[64 4]
[ 3 29]]
0.93