0% found this document useful (0 votes)
21 views

2 - Jupyter Notebook

This document contains code to analyze email data using machine learning algorithms. It loads email data from a CSV file and cleans the data, including removing an ID column and converting the target variable to a categorical format. It then splits the data into feature (X) and target (Y) variables for modeling. Various descriptive statistics are calculated on the raw data.

Uploaded by

shivam.wagh22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

2 - Jupyter Notebook

This document contains code to analyze email data using machine learning algorithms. It loads email data from a CSV file and cleans the data, including removing an ID column and converting the target variable to a categorical format. It then splits the data into feature (X) and target (Y) variables for modeling. Various descriptive statistics are calculated on the raw data.

Uploaded by

shivam.wagh22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

08/11/2023, 13:02 2 - Jupyter Notebook

In [2]: import pandas as pd


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [3]: df = pd.read_csv('emails.csv')

In [4]: df

Out[4]: Email
the to ect and for of a you hou ... connevey jay valued lay infras
No.

Email
0 0 0 1 0 0 0 2 0 0 ... 0 0 0 0
1

Email
1 8 13 24 6 6 2 102 1 27 ... 0 0 0 0
2

Email
2 0 0 1 0 0 0 8 0 0 ... 0 0 0 0
3

Email
3 0 5 22 0 5 1 51 2 10 ... 0 0 0 0
4

Email
4 7 6 17 1 5 2 57 0 9 ... 0 0 0 0
5

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Email
5167 2 2 2 3 0 0 32 0 0 ... 0 0 0 0
5168

Email
5168 35 27 11 2 6 5 151 4 3 ... 0 0 0 0
5169

Email
5169 0 0 1 1 0 0 11 0 0 ... 0 0 0 0
5170

Email
5170 2 7 1 0 2 1 28 2 0 ... 0 0 0 0
5171

Email
5171 22 24 5 1 6 5 148 8 2 ... 0 0 0 0
5172

5172 rows × 3002 columns

localhost:8888/notebooks/Desktop/B190594295/2.ipynb 1/6
08/11/2023, 13:02 2 - Jupyter Notebook

In [5]: df.describe()

Out[5]: the to ect and for of

count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.00

mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.51

std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.57

min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00

25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.00

50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.00

75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.25

max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.00

8 rows × 3001 columns

In [6]: df.shape

Out[6]: (5172, 3002)

In [7]: df.isnull().any()

Out[7]: Email No. False


the False
to False
ect False
and False
...
military False
allowing False
ff False
dry False
Prediction False
Length: 3002, dtype: bool

localhost:8888/notebooks/Desktop/B190594295/2.ipynb 2/6
08/11/2023, 13:02 2 - Jupyter Notebook

In [8]: df.drop(columns='Email No.', inplace=True)


df

Out[8]: the to ect and for of a you hou in ... connevey jay valued lay infrastru

0 0 0 1 0 0 0 2 0 0 0 ... 0 0 0 0

1 8 13 24 6 6 2 102 1 27 18 ... 0 0 0 0

2 0 0 1 0 0 0 8 0 0 4 ... 0 0 0 0

3 0 5 22 0 5 1 51 2 10 1 ... 0 0 0 0

4 7 6 17 1 5 2 57 0 9 3 ... 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

5167 2 2 2 3 0 0 32 0 0 5 ... 0 0 0 0

5168 35 27 11 2 6 5 151 4 3 23 ... 0 0 0 0

5169 0 0 1 1 0 0 11 0 0 1 ... 0 0 0 0

5170 2 7 1 0 2 1 28 2 0 8 ... 0 0 0 0

5171 22 24 5 1 6 5 148 8 2 23 ... 0 0 0 0

5172 rows × 3001 columns

In [9]: df.columns

Out[9]: Index(['the', 'to', 'ect', 'and', 'for', 'of', 'a', 'you', 'hou', 'in',
...
'connevey', 'jay', 'valued', 'lay', 'infrastructure', 'military',
'allowing', 'ff', 'dry', 'Prediction'],
dtype='object', length=3001)

In [10]: df.Prediction.unique()

Out[10]: array([0, 1], dtype=int64)

In [11]: df['Prediction'] = df['Prediction'].replace({0:'Not spam', 1:'Spam'})

localhost:8888/notebooks/Desktop/B190594295/2.ipynb 3/6
08/11/2023, 13:02 2 - Jupyter Notebook

In [12]: df

Out[12]: the to ect and for of a you hou in ... connevey jay valued lay infrastru

0 0 0 1 0 0 0 2 0 0 0 ... 0 0 0 0

1 8 13 24 6 6 2 102 1 27 18 ... 0 0 0 0

2 0 0 1 0 0 0 8 0 0 4 ... 0 0 0 0

3 0 5 22 0 5 1 51 2 10 1 ... 0 0 0 0

4 7 6 17 1 5 2 57 0 9 3 ... 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

5167 2 2 2 3 0 0 32 0 0 5 ... 0 0 0 0

5168 35 27 11 2 6 5 151 4 3 23 ... 0 0 0 0

5169 0 0 1 1 0 0 11 0 0 1 ... 0 0 0 0

5170 2 7 1 0 2 1 28 2 0 8 ... 0 0 0 0

5171 22 24 5 1 6 5 148 8 2 23 ... 0 0 0 0

5172 rows × 3001 columns

In [13]: X = df.drop(columns='Prediction',axis = 1)
Y = df['Prediction']

In [14]: X.columns

Out[14]: Index(['the', 'to', 'ect', 'and', 'for', 'of', 'a', 'you', 'hou', 'in',
...
'enhancements', 'connevey', 'jay', 'valued', 'lay', 'infrastructur
e',
'military', 'allowing', 'ff', 'dry'],
dtype='object', length=3000)

In [15]: Y.head()

Out[15]: 0 Not spam


1 Not spam
2 Not spam
3 Not spam
4 Not spam
Name: Prediction, dtype: object

In [16]: x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, ran

In [17]: KN = KNeighborsClassifier
knn = KN(n_neighbors=7)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

localhost:8888/notebooks/Desktop/B190594295/2.ipynb 4/6
08/11/2023, 13:02 2 - Jupyter Notebook

In [18]: print("Prediction: \n")


print(y_pred)

Prediction:

['Not spam' 'Spam' 'Not spam' ... 'Not spam' 'Not spam' 'Not spam']

In [19]: M = metrics.accuracy_score(y_test,y_pred)
print("KNN accuracy: ", M)

KNN accuracy: 0.8714975845410629

In [20]: C = metrics.confusion_matrix(y_test,y_pred)
print("Confusion matrix: ", C)

Confusion matrix: [[635 84]


[ 49 267]]

In [21]: model = SVC(C = 1) # cost C = 1


model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [22]: n = metrics.accuracy_score(y_test,y_pred)
print("SVM accuracy: ", n)

SVM accuracy: 0.7990338164251207

In [23]: kc = metrics.confusion_matrix(y_test, y_pred)


print("SVM accuracy: ", kc)

SVM accuracy: [[700 19]


[189 127]]

In [24]: df = pd.DataFrame({
'Model Name': ['KNN', 'SVM'],
'Accuracy Score': [87.05, 90.14]
})

In [25]: df

Out[25]: Model Name Accuracy Score

0 KNN 87.05

1 SVM 90.14

In [ ]: ​

In [ ]: ​

localhost:8888/notebooks/Desktop/B190594295/2.ipynb 5/6
08/11/2023, 13:02 2 - Jupyter Notebook

localhost:8888/notebooks/Desktop/B190594295/2.ipynb 6/6

You might also like