0% found this document useful (0 votes)
27 views7 pages

Exp 5

Uploaded by

Loukik Tayshete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views7 pages

Exp 5

Uploaded by

Loukik Tayshete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from scipy.stats import skew

Clustering algo
First we figure out value of K by trial and error

Steps

• Let k = 3 , that we have to figure out most nearby three datapoints using euclidean
distance
• if the three belongs to the same class then the point for which we predict also is
classified as that.

• if k = 10
• then nearest may be from two different classes
• reason for that is points are limited in one class therefore it will look for other class to
complete K's value
• and we will classify the point to the class which has more no. of points considered in K.

• if k = 20
• here sometimes prediction of the point for which we are predicing might be incorrect
here
• as it can then consider more data points from the other class which will lead to wrong
classification.
so for that we have to choose the K carefully

data = pd.read_csv('diabetes-dataset.csv')
data.head(3)

Pregnancies Glucose BloodPressure SkinThickness Insulin


BMI \
0 2 138 62 35 0 33.6

1 0 84 82 31 125 38.2

2 0 145 0 0 0 44.2

DiabetesPedigreeFunction Age Outcome


0 0.127 47 1
1 0.233 23 0
2 0.630 31 1

data.isna().sum()

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
print(data.isin({0}).sum())

Pregnancies 301
Glucose 13
BloodPressure 90
SkinThickness 573
Insulin 956
BMI 28
DiabetesPedigreeFunction 0
Age 0
Outcome 1316
dtype: int64

for col in ['BMI', 'Glucose', 'BloodPressure']:


data[col] = data[col].replace({0 : data[col].median()})

for col in ['Insulin', 'SkinThickness']:


data[col] = data[col].replace({0 : data[col].mean()})

def skewness(data):
skew_df = pd.DataFrame(data.select_dtypes(np.number).columns,
columns=['Feature'])
skew_df['Skew'] = skew_df['Feature'].apply(lambda feature:
skew(data[feature]))
skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)
return skew_df.sort_values(by = 'Absolute Skew', ascending =
False).reset_index(drop = True)

skewness(data)

Feature Skew Absolute Skew


0 Insulin 2.946441 2.946441
1 DiabetesPedigreeFunction 1.810620 1.810620
2 SkinThickness 1.575336 1.575336
3 Age 1.180381 1.180381
4 Pregnancies 0.981629 0.981629
5 BMI 0.936902 0.936902
6 Outcome 0.666133 0.666133
7 Glucose 0.515607 0.515607
8 BloodPressure 0.219439 0.219439

All of the features are skewed except Glucose and Blood Pressure. Let us apply log
transformation to deal with it

for col in ['Insulin', 'DiabetesPedigreeFunction', 'SkinThickness',


'Age', 'Pregnancies', 'BMI', 'Glucose', 'BloodPressure']:
data[col] = np.log1p(data[col])

from sklearn.model_selection import train_test_split


from sklearn.model_selection import train_test_split

X = data.drop('Outcome', axis = 1)
y = data['Outcome'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.2, random_state = 42)

Scaling
mean = X_train.mean()
std = X_test.std()

X_train = (X_train - mean) / std


X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = (X_test - mean) / std
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

len(X_train)

1600

len(X_test)

400

Create KNN (K nearest neighbors classifier)


from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=21)
# metric is default minkowski == euclidean dist

knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=21)

knn.score(X_test,y_test)

0.83

Confusion Matrix will tell us for which classes we got out prediction right and vice versa

from sklearn.metrics import confusion_matrix

# Predict the labels for the test set


y_pred = knn.predict(X_test)

# Generate confusion matrix


cm = confusion_matrix(y_test, y_pred)
# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', cbar=False)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

from sklearn.metrics import classification_report


print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.84 0.91 0.87 253


1 0.82 0.69 0.75 147

accuracy 0.83 400


macro avg 0.83 0.80 0.81 400
weighted avg 0.83 0.83 0.83 400

Precision: Precision is the ratio of correctly predicted positive observations (true positives) to the
total predicted positives (true positives + false positives). For class 0, precision is 0.84, and for
class 1, precision is 0.82. This means that 84% of the samples predicted as class 0 are actually
class 0, and 82% of the samples predicted as class 1 are actually class 1.

Recall: Recall, also known as sensitivity, is the ratio of correctly predicted positive observations
to the all observations in the actual class. For class 0, recall is 0.91, and for class 1, recall is 0.69.
This means that 91% of the actual class 0 samples are correctly identified, and 69% of the actual
class 1 samples are correctly identified.

F1-score: The F1-score is the weighted average of precision and recall. It considers both false
positives and false negatives. For class 0, the F1-score is 0.87, and for class 1, the F1-score is
0.75. The weighted average F1-score is also provided, which is 0.83 in this case.

Accuracy: Accuracy is the ratio of correctly predicted observations to the total observations. The
overall accuracy of the model is 0.83, which means that 83% of the predictions are correct.

These metrics indicate a "good" model

You might also like