0% found this document useful (0 votes)
71 views

Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu

The document discusses clustering the diabetes dataset using k-means clustering. It provides details on the dataset, which contains 768 instances with 8 attributes and a class variable indicating presence of diabetes. Missing values in the dataset are imputed using statistical methods like mean, median, and mode. K-means clustering is applied to group patients based on attributes like blood pressure and age. The elbow method is used to find the optimal number of clusters, and the results are visualized to show clusters separating patients.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu

The document discusses clustering the diabetes dataset using k-means clustering. It provides details on the dataset, which contains 768 instances with 8 attributes and a class variable indicating presence of diabetes. Missing values in the dataset are imputed using statistical methods like mean, median, and mode. K-means clustering is applied to group patients based on attributes like blood pressure and age. The elbow method is used to find the optimal number of clusters, and the results are visualized to show clusters separating patients.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

NATIONAL INSTITUTE OF TECHNOLOGY

RAIPUR
INFORMATION TECHNOLOGY (5th semester)
DATA MINING

Submitted by: Submitted to:


OMESH KUMAR SAHU - 19118059 Dr. Mridu Sahu
PANKAJ KUMAR - 19118062
PRIYANK KULSHRESTHA - 19118067
RAJ NAGDEO - 19118072
TRILOK CHAND THAKUR - 19118090

CLUSTERING APPROACH IN DIABETES DATASET


TOPIC DESCRIPTION & INTRODUCTION

Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets
of data. Data mining is also called Knowledge Discovery in Database. In
this data mining we have considered a diabetes dataset.

Diabetes is a disease that occurs when your blood glucose, also called
blood sugar, is too high. Diabetes is the condition in which the body does
not properly process food for use as energy. There are many types of
diabetes like Type-1 Type-2 and Gestational diabetes.

The objective of the dataset is to diagnostically predict whether or not a


patient has diabetes, based on certain diagnostic measurements included
in the dataset. Several constraints were placed on the selection of these
instances from a larger database.
DATA DESCRIPTION

 Number of Instances: 768


 Number of Attributes: 8 plus class
 For Each Attribute: (all numeric-valued)

 Column description and its value.

1.Number of times pregnant


2.Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3.Diastolic blood pressure (mm Hg)
4.Triceps skin fold thickness (mm)
5.2-Hour serum insulin (mu U/ml)
6.Body mass index (weight in kg/(height in m)^2)
7.Diabetes pedigree function
8.Age (years)
9.Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
diabetes")
All patients here are females at least 21 years old

Attributes
S. No. Name of Column Values Data-Type
1. Pregnancies Number Numerical
2. Glucose Number Numerical
3. Blood Pressure (mm Hg) Numerical
4. Skin Thickness (mm) Numerical
5. Insulin (mu U/ml) Numerical
6. Body Mass Index (BMI) (weight in kg/(height in Numerical
m)^2)
7. Diabetes Pedigree Function Doulbe Numerical
8. Age Number Numerical
9. Outcome  (0 or 1) Numerical
DIABETES DATASET INFORMATION
DATA DESCRIPTION
Various Techniques To Handle Missing Data

The real-world data often has a lot of missing values. The cause of missing values
can be data corruption or failure to record data. The handling of missing data is very
important during the pre processing of the dataset as many machine learning
algorithms do not support missing values.

1. Delete The Record that has Missing Value Imputation


2. Create a Separate Model to Handle a missing Value Imputation
3. Fill Missing Value Manually
4. statistical Method Mean , Median and Mode Missing Value Imputation
IMPUTATION : Imputation is a technique used for replacing the missing data with some
substitute value to retain most of the data/information of the dataset. These techniques are
used because removing the data from the dataset every time is not feasible and can lead to a
reduction in the size of the dataset to a large extend, which not only raises concerns for
biasing the dataset but also leads to incorrect analysis. Here we discuss some important
imputation method

Impute missing values with Mean/Median and Most frequent degree(Mode):

Columns in the dataset which are having numeric continuous values can be replaced with the
mean, median, or mode of remaining values in the column. This method can prevent the loss
of data compared to the earlier method. Replacing the above two approximations (mean,
median) is a statistical approach to handle the missing values
Data with missing values :

After Filling Missing Values :


Machine Learning Algorithm
1. Support Vector Machine(SVM)
2. k-nearest neighbors (KNN) 
3. Decision Tree Classifier

Support Vector Machine

“Support Vector Machine” (SVM) is a supervised 


machine learning algorithm that can be used for both
classification or regression challenges. However,  it is
mostly used in classification problems. In the SVM
algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features
you have) with the value of each feature being the
value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that
differentiates the two classes very well.
K-Nearest Neighbors Algorithm
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-
implement supervised machine learning algorithm that can be
used to solve both classification and regression problems. As
the name (K Nearest Neighbor) suggests it considers K Nearest
Neighbors (Data points) to predict the class or continuous value
for the new Datapoint.

Decision Tree is a Supervised learning technique that


can be used for both classification and Regression problems,
but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
Introduction to Clustering

 The task of grouping a set of objects in such a way that object


in the same group (called as cluster) are more similar to each
other than those in other cluster or we can say that similar
object in one group and dissimilar objects in other group.

 Cluster is an unsupervised classification(no predefine classes)


 Useful for
Pattern Recognition
Image Processing
Economic Science (especially market research)
Automatically organizing data
Understanding hidden structure in data.
Application
News Clustering (Google) :
Other Application

Biology : classification of plant and animal kingdom given their


feature

Marketing : customer segmentation based on a database of customer


data containing their Properties and past buying records

Recognize communities in social network


Aspects of clustering

1. A clustering Algorithm such as :

 Partitional clustering e-g , K-means

 Hierarchical clustering e-g , AHC

 Mixture of Gaussians
The K-Means Clustering Method
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
General Characteristics
– Find mutually exclusive clusters of spherical shape
– Distance-based
– May use mean or medoid (etc.) to represent cluster centre
– Effective for small- to medium-size data sets

k-means (MacQueen’67) : The k-means algorithm for partitioning, where each cluster’s centre is represented
by the mean value of the objects in the cluster.

K-means algorithm

Given k
1. Randomly choose k data points to be the initial cluster centre
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current cluster memberships.
4. If a convergence criterion is not met, go to 2.
Stopping/convergence criterion
  1. No re-assignments of data points to different clusters

2. No (or minimum) change of centroids

3. Minimum decrease in the sum of squared error

SSE =
Source code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
mydata = pd.read_csv("diabetes.csv")
mydata.head()
iv = mydata[["BloodPressure","Age","Outcome"]]
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3,n_init=10,random_state=0)
kmeans.fit(iv)
kmeans.predict(iv)
wss = []
for i in range(1,11):
  kmeans = KMeans(n_clusters=i,n_init=10,random_state=0)
  kmeans.fit(iv)
  wss.append(kmeans.inertia_)
  print(i,kmeans.inertia_)

plt.plot(range(1,11),wss)
plt.title("The Elbow plot")
plt.xlabel("Number of cluster")
plt.ylabel("within sum of square")
plt.show()
kmeans = KMeans(n_clusters=3,n_init=10,random_state=0)
kmeans.fit_predict(iv)
iv['cluster'] = kmeans.fit_predict(iv)
iv.head()
plt.scatter(iv.loc[iv['cluster']==0,'BloodPressure'],iv.loc[iv['cluster']==0,'Age'],s=100,c='red',label='positive')
plt.scatter(iv.loc[iv['cluster']==1,'BloodPressure'],iv.loc[iv['cluster']==1,'Age'],s=100,c='green',label='negative')

plt.title("Result of Kmeans Cluster")
plt.xlabel("Age")
plt.ylabel("BloodPressure")
plt.legend()
plt.show()

Output and Result


CONCLUSION

In this study, various machine learning algorithms are applied on the dataset and the classification has been
done using various algorithms of which Support Vector Machine (SVM) gives highest accuracy of 75.66%.
Application of Standard Scaler gave AdaBoost classifier as best model with accuracy of 78.66%. We have
seen comparison of machine learning algorithm accuracies with two different datasets. It is clear that the
model improves accuracy and precision of diabetes prediction with this dataset compared to existing dataset.

Also we are using different method to handling missing values in diabetes dataset. When we apply Mean and
Median algorithm then we has got different values with different accuracy. On base of different data types like
numerical and categorical data we use various method to handle it. The data pre-processing and handling give
the insight about the problems and solution.The method which we have applied in dataset gave a different
accuracy(in%).

Algorithms Accuracy
Decision Tree 74.67%
Accuracy Table
KNN 75.54%
SVC 78.66%

You might also like