0% found this document useful (0 votes)

32 views10 pages

Diabetes and Glucose Correlation - IBM Machine Learning Training Project

learn data analysis

Uploaded by

Christian Wilmar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views10 pages

Diabetes and Glucose Correlation - IBM Machine Learning Training Project

learn data analysis

Uploaded by

Christian Wilmar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

New Project

March 21, 2021

1 Brief Description of the Dataset and a Summary of Its At-

tributes
About this file The data set for this porject was obtained from Kaggle where its called “dia-
betes.csv”. The data contains 768 samples and 9 attributes where 1 attributes for the class variable
and this data set was generate from female only. The attributes are listed below:
• Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in
an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThick-
ness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body
mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree
function Age: Age (years) Outcome: Class variable (0 or 1)

[51]: import pandas as pd

data = pd.read_csv('data/diabetes.csv')

[52]: data.head()

[52]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

[53]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----

1
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

2 Initial Plan for Data Exploration

Initially, the data will be examined for missing data and outliers. Once outliers and missing data are
addressed, exploratory data analysis will begin. After that, summary statistics and visualization
plots such as histograms, matplotlib, and seaborn will be generated to examine the data, and
correlations will also be examined. Each variable will be examined to see if transformations are
needed in order to analyze the data.

3 Data Cleaning and Feature Engineering

First, we need to made it clear about the rows and columns number.

[54]: samples = data.shape[0]

attributes = data.shape[1]
print(samples, "samples")
print(attributes, "attributes")

768 samples
9 attributes
lets take a look if there are some missing values in the project data.

[55]: data.isnull().sum()

[55]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

As we can see, there are no any more missing values. After that, we need to know the summary
statics of the data where from here we know if there are some outliers. And to prove it, we also

2
using the boxplot.

[56]: data.describe([0.05,0.25,0.50,0.75,0.90,0.95,0.99]).T

[56]: count mean std min 5% \

Pregnancies 768.0 3.845052 3.369578 0.000 0.00000
Glucose 768.0 120.894531 31.972618 0.000 79.00000
BloodPressure 768.0 69.105469 19.355807 0.000 38.70000
SkinThickness 768.0 20.536458 15.952218 0.000 0.00000
Insulin 768.0 79.799479 115.244002 0.000 0.00000
BMI 768.0 31.992578 7.884160 0.000 21.80000
DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.14035
Age 768.0 33.240885 11.760232 21.000 21.00000
Outcome 768.0 0.348958 0.476951 0.000 0.00000

25% 50% 75% 90% 95% \

Pregnancies 1.00000 3.0000 6.00000 9.0000 10.00000
Glucose 99.00000 117.0000 140.25000 167.0000 181.00000
BloodPressure 62.00000 72.0000 80.00000 88.0000 90.00000
SkinThickness 0.00000 23.0000 32.00000 40.0000 44.00000
Insulin 0.00000 30.5000 127.25000 210.0000 293.00000
BMI 27.30000 32.0000 36.60000 41.5000 44.39500
DiabetesPedigreeFunction 0.24375 0.3725 0.62625 0.8786 1.13285
Age 24.00000 29.0000 41.00000 51.0000 58.00000
Outcome 0.00000 0.0000 1.00000 1.0000 1.00000

99% max
Pregnancies 13.00000 17.00
Glucose 196.00000 199.00
BloodPressure 106.00000 122.00
SkinThickness 51.33000 99.00
Insulin 519.90000 846.00
BMI 50.75900 67.10
DiabetesPedigreeFunction 1.69833 2.42
Age 67.00000 81.00
Outcome 1.00000 1.00

[57]: data.plot(kind = 'box', subplots = True, layout = (3, 3), sharex = False,␣
,→sharey = False, figsize = (14, 12));

3
Because there are so many outliers in the Insulin and DiabetesPedigreeFunction columns, then its
need to remove it from the data set. I also remove the SkinThickness column because, it just weird
to see how the data can distributed until zero value. It’s just don’t make any sense if somebody
dont have skin :o.

[58]: data = data.drop("Insulin", axis=1)

data = data.drop("DiabetesPedigreeFunction", axis=1)
data = data.drop("SkinThickness", axis=1)
data.head()

[58]: Pregnancies Glucose BloodPressure BMI Age Outcome

0 6 148 72 33.6 50 1
1 1 85 66 26.6 31 0
2 8 183 64 23.3 32 1
3 1 89 66 28.1 21 0
4 0 137 40 43.1 33 1

After that, here we also need to know how many diabetics (1) and non-diabetics (0) that counted
in this data set.

4
[59]: import matplotlib.pyplot as plt
data["Outcome"].value_counts().plot(kind="bar", color = "Green")
plt.title("Outcome");

data.Outcome.value_counts()

[59]: 0 500
1 268
Name: Outcome, dtype: int64

Since our attributes have a larger scale that can make some regression modelling being lack of
accurate, I using simple feature scaling to normalize the values of each attributes except “Outcome”.

[60]: data["Pregnancies"] = data["Pregnancies"] / data["Pregnancies"].max()

data["Glucose"] = data["Glucose"] / data["Glucose"].max()
data["BloodPressure"] = data["BloodPressure"] / data["BloodPressure"].max()
data["BMI"] = data["BMI"] / data["BMI"].max()
data["Age"] = data["Age"] / data["Age"].max()
data.head()

[60]: Pregnancies Glucose BloodPressure BMI Age Outcome

0 0.352941 0.743719 0.590164 0.500745 0.617284 1
1 0.058824 0.427136 0.540984 0.396423 0.382716 0
2 0.470588 0.919598 0.524590 0.347243 0.395062 1

5
3 0.058824 0.447236 0.540984 0.418778 0.259259 0
4 0.000000 0.688442 0.327869 0.642325 0.407407 1

Histograms were generated for the data. Again, the plots to examine are Pregnancies, Glucose,
BloodPressure, BMI, and Age. The appear to need transformations so that linear techniques can
be used.

[61]: data.hist(figsize = (12, 12))

plt.show();

[62]: data.skew(axis=0, skipna=True)

[62]: Pregnancies 0.901674

Glucose 0.173754

6
BloodPressure -1.843608
BMI -0.428982
Age 1.129597
Outcome 0.635017
dtype: float64

[63]: import seaborn as sns

sns.set_context('talk')
sns.pairplot(data, hue='Outcome');

[64]: k = 6 #number of variables for heatmap

cols = data.corr().nlargest(k, 'Outcome')['Outcome'].index
cm = data[cols].corr()
plt.figure(figsize=(10,6))
sns.heatmap(cm, annot=True, cmap = 'viridis')

7
[64]: <matplotlib.axes._subplots.AxesSubplot at 0x1e80bdf0>

4 Key Findings and Insights

Since the data didn’t has any categorical dtype, then I assumed if this data don’t need one-hot
encoding. Then, the data already to be used to making the regression model for future use.
From the feature engineering, I assumed if Glucose will be a good variabel to use for diabetes
predict. Because when this variabel we pairing to other variabel, it will make some difference when
separeted by Outcome. And to ensure that, I also make a correlation table to see how correlated
glucose with outcome. And its true, glucose has the highest correlation with outcome than the
other. After that, I also use the BMI for future modelling because, it has good correlation with
outcome, as well glucose
Now that the independent variables have been examined and transformed, a predictive model can
be build to a predictive model. Logistic regression works for a categorical variable with values of 0
and 1 like diabetes.

5 Hypothesis
Since I said before if glucose has more correlation than the other with diabetes, I will make it as
my hypothesis. Also, I will using BMI and other variabel to be the hypothesis.

8
Hypothesis 1
H0 : glucose has no correlation with diabetes
H1 : glucose has a correlation with diabetes
Hypothesis 2
H0 : BMI has no correlation with diabetes
H1 : BMI has a correlation with diabetes
Hypothesis 3
H0 : Blood Pressure has a correlation with diabates
H1 : Blood Pressure has no correlation with diabetes

Conducting a formal significance test for one of the hypotheses and discuss the results
Hypothesis 1 listed above will be tested to determine if there is a correlation between glucose
and diabetes, with determine the p-value and some understanding about p-value with Pearson
Correlation Test.
Pearson Correlation Test
p-value
* p-value < 0.05 : reject H0 and accept H1
* p-value > 0.05 : accept H0 and reject H1
Pearson correlation value
* 0.00 : no correlation * 0.01 - 0.19 : negligible correlation * 0.20 - 0.29 : weak correlation * 0.30 -
0.39 : moderate correlation * 0.40 - 0.69 : strong correlation * 0.70 - 1.00 : very strong correlation

[76]: correlation = pd.DataFrame(columns = ['r', 'p'])

for i in data.iloc[:, 1:2] :
if pd.api.types.is_numeric_dtype(data[i]):
r, p = stats.pearsonr(data.Outcome, data[i])
correlation.loc[i] = [round(r, 2), round(p, 2)]

correlation.head()

[76]: r p
Glucose 0.47 0.0

Result:
The p-value is less than 0.5 so that H0 is rejected and H1 is accepted. Then, the Pearson correlation
value is 0.47. Therefore, glucose has a strong correlation with diabetes.

6 Suggestion and Conclusion

For future use, this data absolutely has so understandable obout how correlated some variabel to
caused the diabetes. Moreover, this data was generated from female. Then of course, it can be has
nice usability for future purpose like if you want predict your girlfriend (example :D), to see if she
has possibility to suffer diabetes.
This dataset also very good if any one who new in machine learning practice, because of no missing
data and the type of attributes just two (numeric and binary) that can help newbie to understanding

9
about data more quick and effective. Then, the data processing that I do is very simple.

Diabetes
No ratings yet
Diabetes
97 pages
linear_merged_pagenumber
No ratings yet
linear_merged_pagenumber
48 pages
Effects of Pollution on Housing Prices
No ratings yet
Effects of Pollution on Housing Prices
31 pages
مختار النعيري - The Course Work Submission (1)
No ratings yet
مختار النعيري - The Course Work Submission (1)
31 pages
Astrophysics and the Study of Black Holes
No ratings yet
Astrophysics and the Study of Black Holes
919 pages
Pythone code for predicting diabetes using ML
No ratings yet
Pythone code for predicting diabetes using ML
18 pages
Chapter Three 111
No ratings yet
Chapter Three 111
13 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Aishwarya K S
No ratings yet
Aishwarya K S
15 pages
IPL Winning Prediction Intern Report
No ratings yet
IPL Winning Prediction Intern Report
52 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
diabetes_test report
No ratings yet
diabetes_test report
62 pages
Capstone Presentation Version 1.0
No ratings yet
Capstone Presentation Version 1.0
21 pages
86b899da87de4ffca2871bfd95e72a27
No ratings yet
86b899da87de4ffca2871bfd95e72a27
20 pages
21BCE9757 ITT Summer Internship AI ML Report
No ratings yet
21BCE9757 ITT Summer Internship AI ML Report
18 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
No ratings yet
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
21 pages
22IM30025 Prakriti Assign 02 Stl Lab
No ratings yet
22IM30025 Prakriti Assign 02 Stl Lab
9 pages
Logidtic_Regression_ASSIGNMENT
No ratings yet
Logidtic_Regression_ASSIGNMENT
13 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
E_AI_Lab_EX_2and_3
No ratings yet
E_AI_Lab_EX_2and_3
9 pages
Classification
No ratings yet
Classification
9 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
Developing A Flexible Lead Time Model For The Order-to-Delivery Process
100% (1)
Developing A Flexible Lead Time Model For The Order-to-Delivery Process
70 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
Session 1 - 2
No ratings yet
Session 1 - 2
19 pages
ML Practical 04
No ratings yet
ML Practical 04
20 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
Exploratory Data Analysis Report_ Electric Vehicle Dataset -
No ratings yet
Exploratory Data Analysis Report_ Electric Vehicle Dataset -
8 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Diabetes_Prediction_Report
No ratings yet
Diabetes_Prediction_Report
4 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Diabetes
No ratings yet
Diabetes
7 pages
Cia 2 ML 2348352
No ratings yet
Cia 2 ML 2348352
6 pages
Improving Tactile Map Usability Through 3D Printing Techniques: An Experiment With New Tactile Symbols
No ratings yet
Improving Tactile Map Usability Through 3D Printing Techniques: An Experiment With New Tactile Symbols
8 pages
healthcare-project-simplilearn- Week1
No ratings yet
healthcare-project-simplilearn- Week1
6 pages
Gomez 2019 - Converting Brix To TDS
No ratings yet
Gomez 2019 - Converting Brix To TDS
29 pages
222ECO01 Anand Advanced Econometrics Activity1
No ratings yet
222ECO01 Anand Advanced Econometrics Activity1
6 pages
21201093 (4)
No ratings yet
21201093 (4)
3 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
Sample Thesis Presentation Analysis and Interpretation of Data
100% (1)
Sample Thesis Presentation Analysis and Interpretation of Data
6 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Independent Project
No ratings yet
Independent Project
10 pages
IT0089 TB391 Decision Tree - Coyohan
No ratings yet
IT0089 TB391 Decision Tree - Coyohan
7 pages
Diabetes Prediction System
No ratings yet
Diabetes Prediction System
4 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
Case Study - Healthcare Industry
No ratings yet
Case Study - Healthcare Industry
2 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
1 page
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
Project
No ratings yet
Project
8 pages
Exp 5
No ratings yet
Exp 5
7 pages
Diabetes_Assignment_Report
No ratings yet
Diabetes_Assignment_Report
3 pages
Authors: Danie G. Adorable Anmarechrist B. Albino Elissa C. Banlasan Shiela A. Guno Presentor: Danie G. Adorable - 09109797250
No ratings yet
Authors: Danie G. Adorable Anmarechrist B. Albino Elissa C. Banlasan Shiela A. Guno Presentor: Danie G. Adorable - 09109797250
20 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Diabetes
No ratings yet
Diabetes
10 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
IT0089 TB391 Decision Tree RABE
No ratings yet
IT0089 TB391 Decision Tree RABE
6 pages
x23 Group 1 - Final Project cst383
No ratings yet
x23 Group 1 - Final Project cst383
25 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
Univariate and Multivariate Analysis - Jupyter Notebook
No ratings yet
Univariate and Multivariate Analysis - Jupyter Notebook
5 pages
Minitab - Quality Control
100% (5)
Minitab - Quality Control
310 pages
Mirjana ADAKALIĆ, Biljana LAZOVIĆ, Tatjana PEROVIĆ, Miroslav ČIZMOVIĆ
No ratings yet
Mirjana ADAKALIĆ, Biljana LAZOVIĆ, Tatjana PEROVIĆ, Miroslav ČIZMOVIĆ
7 pages
09 Handout 1
No ratings yet
09 Handout 1
8 pages
Diabetes Case Study
No ratings yet
Diabetes Case Study
1 page
An Empirical Study On General Insurance Agents' Performance in Sri Lankan Insurance Industry
No ratings yet
An Empirical Study On General Insurance Agents' Performance in Sri Lankan Insurance Industry
6 pages
Factors Affecting Sustainable Development: Student Name
No ratings yet
Factors Affecting Sustainable Development: Student Name
14 pages
Diabetic Prediction Using LogicalRegression
No ratings yet
Diabetic Prediction Using LogicalRegression
9 pages
Banking Technology and Cashless
No ratings yet
Banking Technology and Cashless
12 pages
R.D. Nelson Et Al - Precognitive Remote Perception: Replication of Remote Viewing
No ratings yet
R.D. Nelson Et Al - Precognitive Remote Perception: Replication of Remote Viewing
33 pages
11 - Using Performance Assessment Proofreading Done PDF
No ratings yet
11 - Using Performance Assessment Proofreading Done PDF
21 pages
Success Factors in New Ventures: A Meta-Analysis
No ratings yet
Success Factors in New Ventures: A Meta-Analysis
21 pages
Food Adulteration and Practices in Urban Area of Varanasi: A S S R B S M. B
No ratings yet
Food Adulteration and Practices in Urban Area of Varanasi: A S S R B S M. B
13 pages
Your Answer Score Explanation
0% (1)
Your Answer Score Explanation
18 pages
ML Minor May
No ratings yet
ML Minor May
5 pages
Statistics and Probability
No ratings yet
Statistics and Probability
8 pages
Homework 3
No ratings yet
Homework 3
3 pages
Math 1040 Project
No ratings yet
Math 1040 Project
13 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
The Effects of Oral Motor Exercises On Diadochokinetic Rates
No ratings yet
The Effects of Oral Motor Exercises On Diadochokinetic Rates
23 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
Comparison of Incidence of Postoperative Pain After Single Visit Endodontics Using Three Different File Systems
No ratings yet
Comparison of Incidence of Postoperative Pain After Single Visit Endodontics Using Three Different File Systems
12 pages
Coffee From Ampalaya Seeds (Body and Terminal Parts)
0% (1)
Coffee From Ampalaya Seeds (Body and Terminal Parts)
31 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
The Influence of Financial Behavior Towards The Household's Financial Stress
No ratings yet
The Influence of Financial Behavior Towards The Household's Financial Stress
16 pages
Unit 5 - Testing of Hypothesis - SLM
No ratings yet
Unit 5 - Testing of Hypothesis - SLM
46 pages
Risk Based Internal Auditing in Taiwanese Banking Industry Yung Ming
No ratings yet
Risk Based Internal Auditing in Taiwanese Banking Industry Yung Ming
40 pages
Six Sigma Statistics with EXCEL and MINITAB
From Everand
Six Sigma Statistics with EXCEL and MINITAB
Issa Bass
No ratings yet
Eat Millets : The Ancient Superfood That Heals Diabetes, Obesity, Cancer & More
From Everand
Eat Millets : The Ancient Superfood That Heals Diabetes, Obesity, Cancer & More
Maitreya
No ratings yet

Diabetes and Glucose Correlation - IBM Machine Learning Training Project

Uploaded by

Diabetes and Glucose Correlation - IBM Machine Learning Training Project

Uploaded by

New Project

March 21, 2021

1 Brief Description of the Dataset and a Summary of Its At-

[51]: import pandas as pd

[52]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

DiabetesPedigreeFunction Age Outcome

2 Initial Plan for Data Exploration

3 Data Cleaning and Feature Engineering

[54]: samples = data.shape[0]

[56]: count mean std min 5% \

25% 50% 75% 90% 95% \

[58]: data = data.drop("Insulin", axis=1)

[58]: Pregnancies Glucose BloodPressure BMI Age Outcome

[60]: data["Pregnancies"] = data["Pregnancies"] / data["Pregnancies"].max()

[60]: Pregnancies Glucose BloodPressure BMI Age Outcome

[61]: data.hist(figsize = (12, 12))

[62]: data.skew(axis=0, skipna=True)

[62]: Pregnancies 0.901674

[63]: import seaborn as sns

[64]: k = 6 #number of variables for heatmap

4 Key Findings and Insights

[76]: correlation = pd.DataFrame(columns = ['r', 'p'])

6 Suggestion and Conclusion

You might also like