0% found this document useful (0 votes)
13 views6 pages

Bi 6 New

The document outlines an assignment by Manjiri Makode on data classification and clustering using Python. It includes the implementation of a K-Nearest Neighbors (KNN) classifier and a Support Vector Machine (SVM) classifier on a dataset of emails, providing metrics such as accuracy, precision, recall, and F1 score for both models. Additionally, it demonstrates clustering using K-Means on the Iris dataset, visualizing the results with PCA.

Uploaded by

jshruti6896
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Bi 6 New

The document outlines an assignment by Manjiri Makode on data classification and clustering using Python. It includes the implementation of a K-Nearest Neighbors (KNN) classifier and a Support Vector Machine (SVM) classifier on a dataset of emails, providing metrics such as accuracy, precision, recall, and F1 score for both models. Additionally, it demonstrates clustering using K-Means on the Iris dataset, visualizing the results with PCA.

Uploaded by

jshruti6896
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Name: Manjiri Makode

Roll No: 2441015


Batch: C
Assignment No.05: Perform the data classification using classification algorithm. Or Perform the data clustering
using clustering algorithm.

1. Classification
In [1]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import metrics

In [2]: df=pd.read_csv('emails.csv')

In [3]: df.head()
Out[3]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastructure military allowin
No.

Email
0 0 0 1 0 0 0 2 0 0 ... 0 0 0 0 0 0
1

Email
1 8 13 24 6 6 2 102 1 27 ... 0 0 0 0 0 0
2

Email
2 0 0 1 0 0 0 8 0 0 ... 0 0 0 0 0 0
3

Email
3 0 5 22 0 5 1 51 2 10 ... 0 0 0 0 0 0
4

Email
4 7 6 17 1 5 2 57 0 9 ... 0 0 0 0 0 0
5

5 rows × 3002 columns


 

In [4]: df.tail()

Out[4]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastructure military allo
No.

Email
5167 2 2 2 3 0 0 32 0 0 ... 0 0 0 0 0 0
5168

Email
5168 35 27 11 2 6 5 151 4 3 ... 0 0 0 0 0 0
5169

Email
5169 0 0 1 1 0 0 11 0 0 ... 0 0 0 0 0 0
5170

Email
5170 2 7 1 0 2 1 28 2 0 ... 0 0 0 0 0 0
5171

Email
5171 22 24 5 1 6 5 148 8 2 ... 0 0 0 0 0 0
5172

5 rows × 3002 columns


 
In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB

In [6]: df.describe()

Out[6]:
the to ect and for of a you

count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5

mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.517401 2.466551

std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.574172 4.314444

min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000

25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.000000 0.000000

50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.000000 1.000000

75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.250000 3.000000

max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.000000 70.000000

8 rows × 3001 columns


 

In [7]: df.columns

Out[7]: Index(['Email No.', 'the', 'to', 'ect', 'and', 'for', 'of', 'a', 'you', 'hou',
...
'connevey', 'jay', 'valued', 'lay', 'infrastructure', 'military',
'allowing', 'ff', 'dry', 'Prediction'],
dtype='object', length=3002)

In [8]: df.dtypes

Out[8]: Email No. object


the int64
to int64
ect int64
and int64
...
military int64
allowing int64
ff int64
dry int64
Prediction int64
Length: 3002, dtype: object

In [9]: df.size

Out[9]: 15526344
In [10]: df.isna().sum()
Out[10]: Email No. 0
the 0
to 0
ect 0
and 0
..
military 0
allowing 0
ff 0
dry 0
Prediction 0
Length: 3002, dtype: int64

In [11]: df.dropna(inplace=True)

In [12]: df.drop(['Email No.'],axis=1,inplace=True)

In [13]: X = df.drop(['Prediction'],axis = 1)
X

Out[13]:
the to ect and for of a you hou in ... enhancements connevey jay valued lay infrastructure

0 0 0 1 0 0 0 2 0 0 0 ... 0 0 0 0 0 0

1 8 13 24 6 6 2 102 1 27 18 ... 0 0 0 0 0 0

2 0 0 1 0 0 0 8 0 0 4 ... 0 0 0 0 0 0

3 0 5 22 0 5 1 51 2 10 1 ... 0 0 0 0 0 0

4 7 6 17 1 5 2 57 0 9 3 ... 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

5167 2 2 2 3 0 0 32 0 0 5 ... 0 0 0 0 0 0

5168 35 27 11 2 6 5 151 4 3 23 ... 0 0 0 0 0 0

5169 0 0 1 1 0 0 11 0 0 1 ... 0 0 0 0 0 0

5170 2 7 1 0 2 1 28 2 0 8 ... 0 0 0 0 0 0

5171 22 24 5 1 6 5 148 8 2 23 ... 0 0 0 0 0 0

5172 rows × 3000 columns


 

In [14]: y = df['Prediction']
y
Out[14]: 0 0
1 0
2 0
3 0
4 0
..
5167 0
5168 0
5169 1
5170 1
5171 0
Name: Prediction, Length: 5172, dtype: int64

In [15]: from sklearn.model_selection import train_test_split


from sklearn.preprocessing import scale
x = scale(X)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
KNN Classifier
In [16]: from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier(n_neighbors=7)
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)

In [17]: print("Prediction",y_pred)

Prediction [0 0 1 ... 1 1 1]

In [18]: print("Confusion Matrix: ")


print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))
plt.show()

Confusion Matrix:
[[804 293]
[ 16 439]]

In [19]: print("KNN Accuracy: ",metrics.accuracy_score(y_test,y_pred))

KNN Accuracy: 0.8009020618556701

In [20]: print("KNN Precision score: ",metrics.precision_score(y_test,y_pred))

KNN Precision score: 0.5997267759562842

In [21]: print("KNN Recall score: ",metrics.recall_score(y_test,y_pred))

KNN Recall score: 0.9648351648351648

In [22]: print("KNN F1 Score: ",metrics.f1_score(y_test,y_pred))

KNN F1 Score: 0.7396798652064027

In [23]: print("Classification Report:\n", metrics.classification_report(y_test, y_pred,


target_names=["Not Spam", "Spam"]))

Classification Report:
precision recall f1-score support

Not Spam 0.98 0.73 0.84 1097


Spam 0.60 0.96 0.74 455

accuracy 0.80 1552


macro avg 0.79 0.85 0.79 1552
weighted avg 0.87 0.80 0.81 1552

SVM Classifier
In [24]: from sklearn.svm import SVC
model=SVC(C=1)
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

In [25]: print('Confusion Matrix: ')


print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))

Confusion Matrix:
[[1091 6]
[ 90 365]]
In [26]: print("SVM accuracy: ",metrics.accuracy_score(y_test,y_pred))

SVM accuracy: 0.9381443298969072

In [27]: print("SVM Precision score: ",metrics.precision_score(y_test,y_pred))

SVM Precision score: 0.9838274932614556

In [28]: print("SVM Recall score: ",metrics.recall_score(y_test,y_pred))

SVM Recall score: 0.8021978021978022

In [29]: print("SVM F1 Score: ",metrics.f1_score(y_test,y_pred))

SVM F1 Score: 0.8837772397094431

In [30]: print("SVM Classification Report:\n", metrics.classification_report(y_test, y_pred,


target_names=["Not Spam", "Spam"]))

SVM Classification Report:


precision recall f1-score support

Not Spam 0.92 0.99 0.96 1097


Spam 0.98 0.80 0.88 455

accuracy 0.94 1552


macro avg 0.95 0.90 0.92 1552
weighted avg 0.94 0.94 0.94 1552

2. Clustering using K-Means


In [31]: from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [32]: # Load dataset


iris = load_iris()
X = iris.data

In [33]: # Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

In [34]: # Visualizing using PCA


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
In [35]: plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.title("K-Means Clustering (PCA Reduced)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

You might also like