0% found this document useful (0 votes)
32 views10 pages

Diabetes and Glucose Correlation - IBM Machine Learning Training Project

learn data analysis

Uploaded by

Christian Wilmar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

Diabetes and Glucose Correlation - IBM Machine Learning Training Project

learn data analysis

Uploaded by

Christian Wilmar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

New Project

March 21, 2021

1 Brief Description of the Dataset and a Summary of Its At-


tributes
About this file The data set for this porject was obtained from Kaggle where its called “dia-
betes.csv”. The data contains 768 samples and 9 attributes where 1 attributes for the class variable
and this data set was generate from female only. The attributes are listed below:
• Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in
an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThick-
ness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body
mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree
function Age: Age (years) Outcome: Class variable (0 or 1)

[51]: import pandas as pd


data = pd.read_csv('data/diabetes.csv')

[52]: data.head()

[52]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

[53]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----

1
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

2 Initial Plan for Data Exploration


Initially, the data will be examined for missing data and outliers. Once outliers and missing data are
addressed, exploratory data analysis will begin. After that, summary statistics and visualization
plots such as histograms, matplotlib, and seaborn will be generated to examine the data, and
correlations will also be examined. Each variable will be examined to see if transformations are
needed in order to analyze the data.

3 Data Cleaning and Feature Engineering


First, we need to made it clear about the rows and columns number.

[54]: samples = data.shape[0]


attributes = data.shape[1]
print(samples, "samples")
print(attributes, "attributes")

768 samples
9 attributes
lets take a look if there are some missing values in the project data.

[55]: data.isnull().sum()

[55]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

As we can see, there are no any more missing values. After that, we need to know the summary
statics of the data where from here we know if there are some outliers. And to prove it, we also

2
using the boxplot.

[56]: data.describe([0.05,0.25,0.50,0.75,0.90,0.95,0.99]).T

[56]: count mean std min 5% \


Pregnancies 768.0 3.845052 3.369578 0.000 0.00000
Glucose 768.0 120.894531 31.972618 0.000 79.00000
BloodPressure 768.0 69.105469 19.355807 0.000 38.70000
SkinThickness 768.0 20.536458 15.952218 0.000 0.00000
Insulin 768.0 79.799479 115.244002 0.000 0.00000
BMI 768.0 31.992578 7.884160 0.000 21.80000
DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.14035
Age 768.0 33.240885 11.760232 21.000 21.00000
Outcome 768.0 0.348958 0.476951 0.000 0.00000

25% 50% 75% 90% 95% \


Pregnancies 1.00000 3.0000 6.00000 9.0000 10.00000
Glucose 99.00000 117.0000 140.25000 167.0000 181.00000
BloodPressure 62.00000 72.0000 80.00000 88.0000 90.00000
SkinThickness 0.00000 23.0000 32.00000 40.0000 44.00000
Insulin 0.00000 30.5000 127.25000 210.0000 293.00000
BMI 27.30000 32.0000 36.60000 41.5000 44.39500
DiabetesPedigreeFunction 0.24375 0.3725 0.62625 0.8786 1.13285
Age 24.00000 29.0000 41.00000 51.0000 58.00000
Outcome 0.00000 0.0000 1.00000 1.0000 1.00000

99% max
Pregnancies 13.00000 17.00
Glucose 196.00000 199.00
BloodPressure 106.00000 122.00
SkinThickness 51.33000 99.00
Insulin 519.90000 846.00
BMI 50.75900 67.10
DiabetesPedigreeFunction 1.69833 2.42
Age 67.00000 81.00
Outcome 1.00000 1.00

[57]: data.plot(kind = 'box', subplots = True, layout = (3, 3), sharex = False,␣
,→sharey = False, figsize = (14, 12));

3
Because there are so many outliers in the Insulin and DiabetesPedigreeFunction columns, then its
need to remove it from the data set. I also remove the SkinThickness column because, it just weird
to see how the data can distributed until zero value. It’s just don’t make any sense if somebody
dont have skin :o.

[58]: data = data.drop("Insulin", axis=1)


data = data.drop("DiabetesPedigreeFunction", axis=1)
data = data.drop("SkinThickness", axis=1)
data.head()

[58]: Pregnancies Glucose BloodPressure BMI Age Outcome


0 6 148 72 33.6 50 1
1 1 85 66 26.6 31 0
2 8 183 64 23.3 32 1
3 1 89 66 28.1 21 0
4 0 137 40 43.1 33 1

After that, here we also need to know how many diabetics (1) and non-diabetics (0) that counted
in this data set.

4
[59]: import matplotlib.pyplot as plt
data["Outcome"].value_counts().plot(kind="bar", color = "Green")
plt.title("Outcome");

data.Outcome.value_counts()

[59]: 0 500
1 268
Name: Outcome, dtype: int64

Since our attributes have a larger scale that can make some regression modelling being lack of
accurate, I using simple feature scaling to normalize the values of each attributes except “Outcome”.

[60]: data["Pregnancies"] = data["Pregnancies"] / data["Pregnancies"].max()


data["Glucose"] = data["Glucose"] / data["Glucose"].max()
data["BloodPressure"] = data["BloodPressure"] / data["BloodPressure"].max()
data["BMI"] = data["BMI"] / data["BMI"].max()
data["Age"] = data["Age"] / data["Age"].max()
data.head()

[60]: Pregnancies Glucose BloodPressure BMI Age Outcome


0 0.352941 0.743719 0.590164 0.500745 0.617284 1
1 0.058824 0.427136 0.540984 0.396423 0.382716 0
2 0.470588 0.919598 0.524590 0.347243 0.395062 1

5
3 0.058824 0.447236 0.540984 0.418778 0.259259 0
4 0.000000 0.688442 0.327869 0.642325 0.407407 1

Histograms were generated for the data. Again, the plots to examine are Pregnancies, Glucose,
BloodPressure, BMI, and Age. The appear to need transformations so that linear techniques can
be used.

[61]: data.hist(figsize = (12, 12))


plt.show();

[62]: data.skew(axis=0, skipna=True)

[62]: Pregnancies 0.901674


Glucose 0.173754

6
BloodPressure -1.843608
BMI -0.428982
Age 1.129597
Outcome 0.635017
dtype: float64

[63]: import seaborn as sns


sns.set_context('talk')
sns.pairplot(data, hue='Outcome');

[64]: k = 6 #number of variables for heatmap


cols = data.corr().nlargest(k, 'Outcome')['Outcome'].index
cm = data[cols].corr()
plt.figure(figsize=(10,6))
sns.heatmap(cm, annot=True, cmap = 'viridis')

7
[64]: <matplotlib.axes._subplots.AxesSubplot at 0x1e80bdf0>

4 Key Findings and Insights


Since the data didn’t has any categorical dtype, then I assumed if this data don’t need one-hot
encoding. Then, the data already to be used to making the regression model for future use.
From the feature engineering, I assumed if Glucose will be a good variabel to use for diabetes
predict. Because when this variabel we pairing to other variabel, it will make some difference when
separeted by Outcome. And to ensure that, I also make a correlation table to see how correlated
glucose with outcome. And its true, glucose has the highest correlation with outcome than the
other. After that, I also use the BMI for future modelling because, it has good correlation with
outcome, as well glucose
Now that the independent variables have been examined and transformed, a predictive model can
be build to a predictive model. Logistic regression works for a categorical variable with values of 0
and 1 like diabetes.

5 Hypothesis
Since I said before if glucose has more correlation than the other with diabetes, I will make it as
my hypothesis. Also, I will using BMI and other variabel to be the hypothesis.

8
Hypothesis 1
H0 : glucose has no correlation with diabetes
H1 : glucose has a correlation with diabetes
Hypothesis 2
H0 : BMI has no correlation with diabetes
H1 : BMI has a correlation with diabetes
Hypothesis 3
H0 : Blood Pressure has a correlation with diabates
H1 : Blood Pressure has no correlation with diabetes

Conducting a formal significance test for one of the hypotheses and discuss the results
Hypothesis 1 listed above will be tested to determine if there is a correlation between glucose
and diabetes, with determine the p-value and some understanding about p-value with Pearson
Correlation Test.
Pearson Correlation Test
p-value
* p-value < 0.05 : reject H0 and accept H1
* p-value > 0.05 : accept H0 and reject H1
Pearson correlation value
* 0.00 : no correlation * 0.01 - 0.19 : negligible correlation * 0.20 - 0.29 : weak correlation * 0.30 -
0.39 : moderate correlation * 0.40 - 0.69 : strong correlation * 0.70 - 1.00 : very strong correlation

[76]: correlation = pd.DataFrame(columns = ['r', 'p'])


for i in data.iloc[:, 1:2] :
if pd.api.types.is_numeric_dtype(data[i]):
r, p = stats.pearsonr(data.Outcome, data[i])
correlation.loc[i] = [round(r, 2), round(p, 2)]

correlation.head()

[76]: r p
Glucose 0.47 0.0

Result:
The p-value is less than 0.5 so that H0 is rejected and H1 is accepted. Then, the Pearson correlation
value is 0.47. Therefore, glucose has a strong correlation with diabetes.

6 Suggestion and Conclusion


For future use, this data absolutely has so understandable obout how correlated some variabel to
caused the diabetes. Moreover, this data was generated from female. Then of course, it can be has
nice usability for future purpose like if you want predict your girlfriend (example :D), to see if she
has possibility to suffer diabetes.
This dataset also very good if any one who new in machine learning practice, because of no missing
data and the type of attributes just two (numeric and binary) that can help newbie to understanding

9
about data more quick and effective. Then, the data processing that I do is very simple.

10

You might also like