Exp 5
Exp 5
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
Clustering algo
First we figure out value of K by trial and error
Steps
• Let k = 3 , that we have to figure out most nearby three datapoints using euclidean
distance
• if the three belongs to the same class then the point for which we predict also is
classified as that.
• if k = 10
• then nearest may be from two different classes
• reason for that is points are limited in one class therefore it will look for other class to
complete K's value
• and we will classify the point to the class which has more no. of points considered in K.
• if k = 20
• here sometimes prediction of the point for which we are predicing might be incorrect
here
• as it can then consider more data points from the other class which will lead to wrong
classification.
so for that we have to choose the K carefully
data = pd.read_csv('diabetes-dataset.csv')
data.head(3)
1 0 84 82 31 125 38.2
2 0 145 0 0 0 44.2
data.isna().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
print(data.isin({0}).sum())
Pregnancies 301
Glucose 13
BloodPressure 90
SkinThickness 573
Insulin 956
BMI 28
DiabetesPedigreeFunction 0
Age 0
Outcome 1316
dtype: int64
def skewness(data):
skew_df = pd.DataFrame(data.select_dtypes(np.number).columns,
columns=['Feature'])
skew_df['Skew'] = skew_df['Feature'].apply(lambda feature:
skew(data[feature]))
skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)
return skew_df.sort_values(by = 'Absolute Skew', ascending =
False).reset_index(drop = True)
skewness(data)
All of the features are skewed except Glucose and Blood Pressure. Let us apply log
transformation to deal with it
X = data.drop('Outcome', axis = 1)
y = data['Outcome'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.2, random_state = 42)
Scaling
mean = X_train.mean()
std = X_test.std()
len(X_train)
1600
len(X_test)
400
knn = KNeighborsClassifier(n_neighbors=21)
# metric is default minkowski == euclidean dist
knn.fit(X_train,y_train)
KNeighborsClassifier(n_neighbors=21)
knn.score(X_test,y_test)
0.83
Confusion Matrix will tell us for which classes we got out prediction right and vice versa
Precision: Precision is the ratio of correctly predicted positive observations (true positives) to the
total predicted positives (true positives + false positives). For class 0, precision is 0.84, and for
class 1, precision is 0.82. This means that 84% of the samples predicted as class 0 are actually
class 0, and 82% of the samples predicted as class 1 are actually class 1.
Recall: Recall, also known as sensitivity, is the ratio of correctly predicted positive observations
to the all observations in the actual class. For class 0, recall is 0.91, and for class 1, recall is 0.69.
This means that 91% of the actual class 0 samples are correctly identified, and 69% of the actual
class 1 samples are correctly identified.
F1-score: The F1-score is the weighted average of precision and recall. It considers both false
positives and false negatives. For class 0, the F1-score is 0.87, and for class 1, the F1-score is
0.75. The weighted average F1-score is also provided, which is 0.83 in this case.
Accuracy: Accuracy is the ratio of correctly predicted observations to the total observations. The
overall accuracy of the model is 0.83, which means that 83% of the predictions are correct.