0% found this document useful (0 votes)

71 views

Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu

The document discusses clustering the diabetes dataset using k-means clustering. It provides details on the dataset, which contains 768 instances with 8 attributes and a class variable indicating presence of diabetes. Missing values in the dataset are imputed using statistical methods like mean, median, and mode. K-means clustering is applied to group patients based on attributes like blood pressure and age. The elbow method is used to find the optimal number of clusters, and the results are visualized to show clusters separating patients.

Uploaded by

Priyank Kulshrestha

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu

Uploaded by

Priyank Kulshrestha

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

NATIONAL INSTITUTE OF TECHNOLOGY

RAIPUR
INFORMATION TECHNOLOGY (5th semester)
DATA MINING

Submitted by: Submitted to:

OMESH KUMAR SAHU - 19118059 Dr. Mridu Sahu
PANKAJ KUMAR - 19118062
PRIYANK KULSHRESTHA - 19118067
RAJ NAGDEO - 19118072
TRILOK CHAND THAKUR - 19118090

CLUSTERING APPROACH IN DIABETES DATASET

TOPIC DESCRIPTION & INTRODUCTION

Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets
of data. Data mining is also called Knowledge Discovery in Database. In
this data mining we have considered a diabetes dataset.

Diabetes is a disease that occurs when your blood glucose, also called
blood sugar, is too high. Diabetes is the condition in which the body does
not properly process food for use as energy. There are many types of
diabetes like Type-1 Type-2 and Gestational diabetes.

The objective of the dataset is to diagnostically predict whether or not a

patient has diabetes, based on certain diagnostic measurements included
in the dataset. Several constraints were placed on the selection of these
instances from a larger database.
DATA DESCRIPTION

 Number of Instances: 768

 Number of Attributes: 8 plus class
 For Each Attribute: (all numeric-valued)

 Column description and its value.

1.Number of times pregnant

2.Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3.Diastolic blood pressure (mm Hg)
4.Triceps skin fold thickness (mm)
5.2-Hour serum insulin (mu U/ml)
6.Body mass index (weight in kg/(height in m)^2)
7.Diabetes pedigree function
8.Age (years)
9.Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
diabetes")
All patients here are females at least 21 years old

Attributes
S. No. Name of Column Values Data-Type
1. Pregnancies Number Numerical
2. Glucose Number Numerical
3. Blood Pressure (mm Hg) Numerical
4. Skin Thickness (mm) Numerical
5. Insulin (mu U/ml) Numerical
6. Body Mass Index (BMI) (weight in kg/(height in Numerical
m)^2)
7. Diabetes Pedigree Function Doulbe Numerical
8. Age Number Numerical
9. Outcome (0 or 1) Numerical
DIABETES DATASET INFORMATION
DATA DESCRIPTION
Various Techniques To Handle Missing Data

The real-world data often has a lot of missing values. The cause of missing values
can be data corruption or failure to record data. The handling of missing data is very
important during the pre processing of the dataset as many machine learning
algorithms do not support missing values.

1. Delete The Record that has Missing Value Imputation

2. Create a Separate Model to Handle a missing Value Imputation
3. Fill Missing Value Manually
4. statistical Method Mean , Median and Mode Missing Value Imputation
IMPUTATION : Imputation is a technique used for replacing the missing data with some
substitute value to retain most of the data/information of the dataset. These techniques are
used because removing the data from the dataset every time is not feasible and can lead to a
reduction in the size of the dataset to a large extend, which not only raises concerns for
biasing the dataset but also leads to incorrect analysis. Here we discuss some important
imputation method

Impute missing values with Mean/Median and Most frequent degree(Mode):

Columns in the dataset which are having numeric continuous values can be replaced with the
mean, median, or mode of remaining values in the column. This method can prevent the loss
of data compared to the earlier method. Replacing the above two approximations (mean,
median) is a statistical approach to handle the missing values
Data with missing values :

After Filling Missing Values :

Machine Learning Algorithm
1. Support Vector Machine(SVM)
2. k-nearest neighbors (KNN)
3. Decision Tree Classifier

Support Vector Machine

“Support Vector Machine” (SVM) is a supervised

machine learning algorithm that can be used for both
classification or regression challenges. However, it is
mostly used in classification problems. In the SVM
algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features
you have) with the value of each feature being the
value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that
differentiates the two classes very well.
K-Nearest Neighbors Algorithm
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-
implement supervised machine learning algorithm that can be
used to solve both classification and regression problems. As
the name (K Nearest Neighbor) suggests it considers K Nearest
Neighbors (Data points) to predict the class or continuous value
for the new Datapoint.

Decision Tree is a Supervised learning technique that

can be used for both classification and Regression problems,
but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
Introduction to Clustering

 The task of grouping a set of objects in such a way that object

in the same group (called as cluster) are more similar to each
other than those in other cluster or we can say that similar
object in one group and dissimilar objects in other group.

 Cluster is an unsupervised classification(no predefine classes)

 Useful for
Pattern Recognition
Image Processing
Economic Science (especially market research)
Automatically organizing data
Understanding hidden structure in data.
Application
News Clustering (Google) :
Other Application

Biology : classification of plant and animal kingdom given their

feature

Marketing : customer segmentation based on a database of customer

data containing their Properties and past buying records

Recognize communities in social network

Aspects of clustering

1. A clustering Algorithm such as :

 Partitional clustering e-g , K-means

 Hierarchical clustering e-g , AHC

 Mixture of Gaussians
The K-Means Clustering Method
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
General Characteristics
– Find mutually exclusive clusters of spherical shape
– Distance-based
– May use mean or medoid (etc.) to represent cluster centre
– Effective for small- to medium-size data sets

k-means (MacQueen’67) : The k-means algorithm for partitioning, where each cluster’s centre is represented
by the mean value of the objects in the cluster.

K-means algorithm

Given k
1. Randomly choose k data points to be the initial cluster centre
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current cluster memberships.
4. If a convergence criterion is not met, go to 2.
Stopping/convergence criterion
1. No re-assignments of data points to different clusters

2. No (or minimum) change of centroids

3. Minimum decrease in the sum of squared error

SSE =
Source code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
mydata = pd.read_csv("diabetes.csv")
mydata.head()
iv = mydata[["BloodPressure","Age","Outcome"]]
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3,n_init=10,random_state=0)
kmeans.fit(iv)
kmeans.predict(iv)
wss = []
for i in range(1,11):
  kmeans = KMeans(n_clusters=i,n_init=10,random_state=0)
  kmeans.fit(iv)
  wss.append(kmeans.inertia_)
  print(i,kmeans.inertia_)

plt.plot(range(1,11),wss)
plt.title("The Elbow plot")
plt.xlabel("Number of cluster")
plt.ylabel("within sum of square")
plt.show()
kmeans = KMeans(n_clusters=3,n_init=10,random_state=0)
kmeans.fit_predict(iv)
iv['cluster'] = kmeans.fit_predict(iv)
iv.head()
plt.scatter(iv.loc[iv['cluster']==0,'BloodPressure'],iv.loc[iv['cluster']==0,'Age'],s=100,c='red',label='positive')
plt.scatter(iv.loc[iv['cluster']==1,'BloodPressure'],iv.loc[iv['cluster']==1,'Age'],s=100,c='green',label='negative')

plt.title("Result of Kmeans Cluster")
plt.xlabel("Age")
plt.ylabel("BloodPressure")
plt.legend()
plt.show()

Output and Result

CONCLUSION

In this study, various machine learning algorithms are applied on the dataset and the classification has been
done using various algorithms of which Support Vector Machine (SVM) gives highest accuracy of 75.66%.
Application of Standard Scaler gave AdaBoost classifier as best model with accuracy of 78.66%. We have
seen comparison of machine learning algorithm accuracies with two different datasets. It is clear that the
model improves accuracy and precision of diabetes prediction with this dataset compared to existing dataset.

Also we are using different method to handling missing values in diabetes dataset. When we apply Mean and
Median algorithm then we has got different values with different accuracy. On base of different data types like
numerical and categorical data we use various method to handle it. The data pre-processing and handling give
the insight about the problems and solution.The method which we have applied in dataset gave a different
accuracy(in%).

Algorithms Accuracy
Decision Tree 74.67%
Accuracy Table
KNN 75.54%
SVC 78.66%

Chapter 4 - Synchronous Machine
No ratings yet
Chapter 4 - Synchronous Machine
53 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
diabetes_test report
No ratings yet
diabetes_test report
62 pages
Final PPT Heart Disease
100% (2)
Final PPT Heart Disease
23 pages
Wissanu Naruemon Ic2it Manu
No ratings yet
Wissanu Naruemon Ic2it Manu
9 pages
c20 Final Final Ppt
No ratings yet
c20 Final Final Ppt
21 pages
Heart Disease Report
No ratings yet
Heart Disease Report
8 pages
Lab Report Content - 15marks(1) (2)
No ratings yet
Lab Report Content - 15marks(1) (2)
10 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
JETIR2205326
No ratings yet
JETIR2205326
9 pages
Breast Cancer Detection Algo Comparison
No ratings yet
Breast Cancer Detection Algo Comparison
15 pages
Using Predictive Analytics Model To Diagnose Breast Cnacer
No ratings yet
Using Predictive Analytics Model To Diagnose Breast Cnacer
9 pages
Project_Report
No ratings yet
Project_Report
18 pages
Building A Simple Machine Learning Model On Breast Cancer Data
No ratings yet
Building A Simple Machine Learning Model On Breast Cancer Data
12 pages
ML Minor May
No ratings yet
ML Minor May
5 pages
Exploring The Efficacy of Machine Learning Algorithms For Diabetes Prediction A Comparative Prediction
No ratings yet
Exploring The Efficacy of Machine Learning Algorithms For Diabetes Prediction A Comparative Prediction
9 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
Diabetes Project MuskanAltaf
No ratings yet
Diabetes Project MuskanAltaf
15 pages
Synopsis (Heart Disease Prediction)
No ratings yet
Synopsis (Heart Disease Prediction)
7 pages
Case Study - Healthcare Industry
No ratings yet
Case Study - Healthcare Industry
2 pages
Healthcare Data Exploration Report Word File
No ratings yet
Healthcare Data Exploration Report Word File
9 pages
Breast Cancer Detection
No ratings yet
Breast Cancer Detection
15 pages
Independent Project
No ratings yet
Independent Project
10 pages
mlPPT_11_45
No ratings yet
mlPPT_11_45
31 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Kunal DS
No ratings yet
Kunal DS
92 pages
ML Report Edited
No ratings yet
ML Report Edited
10 pages
Diabetes Classification Report
No ratings yet
Diabetes Classification Report
17 pages
Python Project Report
No ratings yet
Python Project Report
15 pages
Predicting Diabetes Onset Using Machine Learning
No ratings yet
Predicting Diabetes Onset Using Machine Learning
4 pages
Using Sentiment Analysis and Machine Learning Algorithms To Determine Citizens' Perceptions
No ratings yet
Using Sentiment Analysis and Machine Learning Algorithms To Determine Citizens' Perceptions
6 pages
End to End Project Multiple Disease Detection Using ML - Nomidl
No ratings yet
End to End Project Multiple Disease Detection Using ML - Nomidl
24 pages
Comparative Study of Machine Learning Algorithms For Diabetes
No ratings yet
Comparative Study of Machine Learning Algorithms For Diabetes
11 pages
Final Year Project
No ratings yet
Final Year Project
57 pages
Mini PPT 2
No ratings yet
Mini PPT 2
32 pages
Heart Disease Predictive Analysis
No ratings yet
Heart Disease Predictive Analysis
4 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Green Minimalist Healthcare Flyer
No ratings yet
Green Minimalist Healthcare Flyer
37 pages
Lecture 15: Tree-Based Algorithms — Applied ML
No ratings yet
Lecture 15: Tree-Based Algorithms — Applied ML
17 pages
Research Paper
No ratings yet
Research Paper
7 pages
Abstract Bhabesh Final
No ratings yet
Abstract Bhabesh Final
5 pages
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
No ratings yet
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
21 pages
batch34_diabetis prediction ML_formatted
No ratings yet
batch34_diabetis prediction ML_formatted
81 pages
Ijatcse 43922020
No ratings yet
Ijatcse 43922020
6 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
8 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Chatbot For Prediction of Weight and BMI
No ratings yet
Chatbot For Prediction of Weight and BMI
3 pages
Heart Disease Prediction Using Logistic Regression Algorithm Using Machine Learning
No ratings yet
Heart Disease Prediction Using Logistic Regression Algorithm Using Machine Learning
4 pages
Diabetes Detection System by Mixing Supe
No ratings yet
Diabetes Detection System by Mixing Supe
40 pages
Deep Learning Approach for Diabetes Prediction using PIMA Indian Dataset
No ratings yet
Deep Learning Approach for Diabetes Prediction using PIMA Indian Dataset
3 pages
Using Bayes Network in Weka
No ratings yet
Using Bayes Network in Weka
6 pages
Prediction of Heart Diseases Using Machine Learning
No ratings yet
Prediction of Heart Diseases Using Machine Learning
49 pages
IT0089 TB391 Decision Tree RABE
No ratings yet
IT0089 TB391 Decision Tree RABE
6 pages
Cancer Detection Using Data Mining
No ratings yet
Cancer Detection Using Data Mining
13 pages
MINOR PROJECT Details
No ratings yet
MINOR PROJECT Details
6 pages
Web Application
No ratings yet
Web Application
13 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Dishan Resume
No ratings yet
Dishan Resume
1 page
National Institute of Technology Raipur
No ratings yet
National Institute of Technology Raipur
21 pages
Vigyaan Problem Statements (: Aavartan'19
No ratings yet
Vigyaan Problem Statements (: Aavartan'19
3 pages
CN Ese 2021
No ratings yet
CN Ese 2021
3 pages
Experiment Apparatus Used Theory
No ratings yet
Experiment Apparatus Used Theory
2 pages
CCSA Overview Flyer
No ratings yet
CCSA Overview Flyer
1 page
Centifugal Pumps
No ratings yet
Centifugal Pumps
29 pages
EXPLORING-THE-PERCEPTIONS-OF-THE-TEACHERS-ON-THE-INTEGRATION-OF-TECHNOLOGY-IN-THE-CLASSROOM-IN-LINO-P.-BERNARDO-NATIONAL-HIGH-SCHOOL (1)
No ratings yet
EXPLORING-THE-PERCEPTIONS-OF-THE-TEACHERS-ON-THE-INTEGRATION-OF-TECHNOLOGY-IN-THE-CLASSROOM-IN-LINO-P.-BERNARDO-NATIONAL-HIGH-SCHOOL (1)
19 pages
Shielded Metal Arc Welding NC I: Jenny's Avenue Ext., Pasig, Metro Manila
No ratings yet
Shielded Metal Arc Welding NC I: Jenny's Avenue Ext., Pasig, Metro Manila
4 pages
suyog sharma resume(21110081)
No ratings yet
suyog sharma resume(21110081)
3 pages
Upazila Contact Information
No ratings yet
Upazila Contact Information
25 pages
Business Records Table
No ratings yet
Business Records Table
2 pages
Maths Paper 2
No ratings yet
Maths Paper 2
17 pages
Mobile Application Interface To Register Citizen Complaint
No ratings yet
Mobile Application Interface To Register Citizen Complaint
4 pages
Past Papers Nust
75% (4)
Past Papers Nust
14 pages
Resume-: Company: - Godara Infratech & Power Industries
No ratings yet
Resume-: Company: - Godara Infratech & Power Industries
2 pages
Invoice Assignment
No ratings yet
Invoice Assignment
3 pages
Liebherr All Terrain Cranes Spec E19f3f 1
No ratings yet
Liebherr All Terrain Cranes Spec E19f3f 1
12 pages
DevOps - Assignment No-2 - Roll No-A-04 Vishal Gaikwad TU4F1819012
No ratings yet
DevOps - Assignment No-2 - Roll No-A-04 Vishal Gaikwad TU4F1819012
24 pages
SVTA2045 Open Caching API Footprint and Capabilities FINAL 10132023 9lj01a
No ratings yet
SVTA2045 Open Caching API Footprint and Capabilities FINAL 10132023 9lj01a
24 pages
Abap Dictionary: Tuesday, November 6, 2007
No ratings yet
Abap Dictionary: Tuesday, November 6, 2007
8 pages
Enerpac Support Cylinders PDF
No ratings yet
Enerpac Support Cylinders PDF
4 pages
Confirmation of Your Order
No ratings yet
Confirmation of Your Order
1 page
TB2 Indoor - 320x160
No ratings yet
TB2 Indoor - 320x160
4 pages
Jmse 11 01174 v3
No ratings yet
Jmse 11 01174 v3
19 pages
CSC577 - SDD Group 7 - Restaurant Management System
No ratings yet
CSC577 - SDD Group 7 - Restaurant Management System
13 pages
Bending Beam Load Cell PDF
No ratings yet
Bending Beam Load Cell PDF
2 pages
Poly List Kottayam
No ratings yet
Poly List Kottayam
13 pages
RawMaterials PS4 ENG
No ratings yet
RawMaterials PS4 ENG
2 pages
Huawei B310s
No ratings yet
Huawei B310s
4 pages
Sma 2104 Mathematics for Science(1)
No ratings yet
Sma 2104 Mathematics for Science(1)
4 pages
IXP - Conclusions - Swazilnd - Document vr1
No ratings yet
IXP - Conclusions - Swazilnd - Document vr1
6 pages
Entity Types
No ratings yet
Entity Types
2 pages