0% found this document useful (0 votes)
24 views

Assignment 3 B

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Assignment 3 B

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Computer Laboratory-I Class: BE (AI &DS)

Assignment No:3B

Title:
Implement K-Nearest Neighbours’ algorithm on Social network ad dataset. Compute confusion
matrix, accuracy, error rate, precision and recall on the given dataset
Problem Statement:
Implementation of K-Nearest Neighbours’ algorithm on Social network ad dataset for
classification & Compute confusion matrix, accuracy, error rate, precision and recall on the
given dataset.
Objectives:
Apply the KNN classification algorithms to classify the data with appropriate labels.

Theory:
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm
is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve
both classification and regression problems. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other. KNN captures
the idea of similarity (sometimes called distance, proximity, or closeness) with some
mathematics we might have learned in our childhood— calculating the distance between points
on a graph. There are other ways of calculating distance, which might be preferable depending
on the problem we are solving. However, the straight-line distance (also called the Euclidean
distance) is a popular and familiar choice. It is widely disposable in real-life scenarios since it
is non-parametric, meaning, it does not make any underlying assumptions about the distribution
of data (as opposed to other algorithms such as GMM, which assume a Gaussian distribution
of the given data). This article illustrates K-nearest neighbors on a sample random data using
sklearn library.

Example:

Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know
either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it in either cat or
dog category.

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

Importing Libraries and Dataset


Python libraries make it very easy for us to handle the data and perform typical and complex
tasks with a single line of code.

● Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
● Numpy – Numpy arrays are very fast and can perform large computations in a very
short time.
● Matplotlib/Seaborn – This library is used to draw visualizations.
● Sklearn – This module contains multiple libraries having pre-implemented
functions to perform tasks from data preprocessing to model development and
evaluation.

KNN Algorithm:

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors


Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

By calculating the Euclidean distance we got the nearest neighbors, as three nearest
Neighbors in category A and two nearest neighbors in category B. Consider the below
image:

As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

Advantages of KNN:

● It is easy to understand and implement.


● It can also handle multiclass classification problems.
● Useful when data does not have a clear distribution.
● It works on a non-parametric approach.

Disadvantages of KNN:
● Sensitive to the noisy features in the dataset.
● Computationally expensive for the large dataset.
● It can be biased in the imbalanced dataset.
● Requires the choice of the appropriate value of K.
● Sometimes normalization may be required

Department of AI & DS MCOERC, NASHIK


Computer Laboratory-I Class: BE (AI &DS)

Conclusion:
Thus we have successfully implemented KNN Algorithm on given dataset for classification
and computed confusion matrix, accuracy, error rate, precision and recall on the given dataset

Department of AI & DS MCOERC, NASHIK


15/10/2023, 18:20 Experiment No 3B (KNN) - Jupyter Notebook

# Implement K-Nearest Neighbours’ algorithm on


Social network ad dataset. Compute confusion
matrix, accuracy, error rate, precision and recall on
the given dataset
In [3]: ​
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from matplotlib import pyplot as plt
import numpy as np
import pickle
import pandas as pd

In [4]: data_set = pd.read_csv('E:\\archive\\Social_Network_Ads.csv')

In [5]: #Extracting Independent and dependent Variable


x = data_set.iloc[:, 2:4].values
y = data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.25, random_state=

#feature Scaling
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

In [6]: model = KNeighborsClassifier(n_neighbors=5, metric='minkowski' )


model.fit(x_train, y_train)

Out[6]: KNeighborsClassifier()

In [7]: y_pred= model.predict(x_test)


cm= confusion_matrix(y_test, y_pred)
print(cm)

[[64 4]
[ 3 29]]

In [8]: acu = accuracy_score(y_test, y_pred)


print(acu)

0.93

In [9]: cr = classification_report(y_test, y_pred)


print(cr)

precision recall f1-score support

0 0.96 0.94 0.95 68


1 0.88 0.91 0.89 32

accuracy 0.93 100


macro avg 0.92 0.92 0.92 100
weighted avg 0.93 0.93 0.93 100

localhost:8888/notebooks/Experiment No 3B (KNN).ipynb 1/2


15/10/2023, 18:20 Experiment No 3B (KNN) - Jupyter Notebook

In [10]: #scatter plot of actual dataset


plt.scatter(x = x[y == 0,0], y = x[y == 0,1], color = 'red')
plt.scatter(x = x[y == 1,0], y = x[y == 1,1], color = 'green')
plt.show()

In [12]: X_set, y_set = x_test, y_test


plt.scatter(X_set[y_set==0,0],X_set[y_set==0,1],color="red",label="Not purchased")
plt.scatter(X_set[y_set==1,0],X_set[y_set==1,1],color="green",label="Purchased")
plt.xlabel("Age")
plt.ylabel("Estimated Salary")
plt.title("Car purchased with age and estimated salary")
plt.legend()
plt.show()

localhost:8888/notebooks/Experiment No 3B (KNN).ipynb 2/2

You might also like