Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu
Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu
RAIPUR
INFORMATION TECHNOLOGY (5th semester)
DATA MINING
Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets
of data. Data mining is also called Knowledge Discovery in Database. In
this data mining we have considered a diabetes dataset.
Diabetes is a disease that occurs when your blood glucose, also called
blood sugar, is too high. Diabetes is the condition in which the body does
not properly process food for use as energy. There are many types of
diabetes like Type-1 Type-2 and Gestational diabetes.
Attributes
S. No. Name of Column Values Data-Type
1. Pregnancies Number Numerical
2. Glucose Number Numerical
3. Blood Pressure (mm Hg) Numerical
4. Skin Thickness (mm) Numerical
5. Insulin (mu U/ml) Numerical
6. Body Mass Index (BMI) (weight in kg/(height in Numerical
m)^2)
7. Diabetes Pedigree Function Doulbe Numerical
8. Age Number Numerical
9. Outcome (0 or 1) Numerical
DIABETES DATASET INFORMATION
DATA DESCRIPTION
Various Techniques To Handle Missing Data
The real-world data often has a lot of missing values. The cause of missing values
can be data corruption or failure to record data. The handling of missing data is very
important during the pre processing of the dataset as many machine learning
algorithms do not support missing values.
Columns in the dataset which are having numeric continuous values can be replaced with the
mean, median, or mode of remaining values in the column. This method can prevent the loss
of data compared to the earlier method. Replacing the above two approximations (mean,
median) is a statistical approach to handle the missing values
Data with missing values :
Mixture of Gaussians
The K-Means Clustering Method
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
General Characteristics
– Find mutually exclusive clusters of spherical shape
– Distance-based
– May use mean or medoid (etc.) to represent cluster centre
– Effective for small- to medium-size data sets
k-means (MacQueen’67) : The k-means algorithm for partitioning, where each cluster’s centre is represented
by the mean value of the objects in the cluster.
K-means algorithm
Given k
1. Randomly choose k data points to be the initial cluster centre
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current cluster memberships.
4. If a convergence criterion is not met, go to 2.
Stopping/convergence criterion
1. No re-assignments of data points to different clusters
SSE =
Source code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
mydata = pd.read_csv("diabetes.csv")
mydata.head()
iv = mydata[["BloodPressure","Age","Outcome"]]
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3,n_init=10,random_state=0)
kmeans.fit(iv)
kmeans.predict(iv)
wss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i,n_init=10,random_state=0)
kmeans.fit(iv)
wss.append(kmeans.inertia_)
print(i,kmeans.inertia_)
plt.plot(range(1,11),wss)
plt.title("The Elbow plot")
plt.xlabel("Number of cluster")
plt.ylabel("within sum of square")
plt.show()
kmeans = KMeans(n_clusters=3,n_init=10,random_state=0)
kmeans.fit_predict(iv)
iv['cluster'] = kmeans.fit_predict(iv)
iv.head()
plt.scatter(iv.loc[iv['cluster']==0,'BloodPressure'],iv.loc[iv['cluster']==0,'Age'],s=100,c='red',label='positive')
plt.scatter(iv.loc[iv['cluster']==1,'BloodPressure'],iv.loc[iv['cluster']==1,'Age'],s=100,c='green',label='negative')
plt.title("Result of Kmeans Cluster")
plt.xlabel("Age")
plt.ylabel("BloodPressure")
plt.legend()
plt.show()
In this study, various machine learning algorithms are applied on the dataset and the classification has been
done using various algorithms of which Support Vector Machine (SVM) gives highest accuracy of 75.66%.
Application of Standard Scaler gave AdaBoost classifier as best model with accuracy of 78.66%. We have
seen comparison of machine learning algorithm accuracies with two different datasets. It is clear that the
model improves accuracy and precision of diabetes prediction with this dataset compared to existing dataset.
Also we are using different method to handling missing values in diabetes dataset. When we apply Mean and
Median algorithm then we has got different values with different accuracy. On base of different data types like
numerical and categorical data we use various method to handle it. The data pre-processing and handling give
the insight about the problems and solution.The method which we have applied in dataset gave a different
accuracy(in%).
Algorithms Accuracy
Decision Tree 74.67%
Accuracy Table
KNN 75.54%
SVC 78.66%