Diabetes and Glucose Correlation - IBM Machine Learning Training Project
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
[52]: data.head()
[53]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
1
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
768 samples
9 attributes
lets take a look if there are some missing values in the project data.
[55]: data.isnull().sum()
[55]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
As we can see, there are no any more missing values. After that, we need to know the summary
statics of the data where from here we know if there are some outliers. And to prove it, we also
2
using the boxplot.
[56]: data.describe([0.05,0.25,0.50,0.75,0.90,0.95,0.99]).T
99% max
Pregnancies 13.00000 17.00
Glucose 196.00000 199.00
BloodPressure 106.00000 122.00
SkinThickness 51.33000 99.00
Insulin 519.90000 846.00
BMI 50.75900 67.10
DiabetesPedigreeFunction 1.69833 2.42
Age 67.00000 81.00
Outcome 1.00000 1.00
[57]: data.plot(kind = 'box', subplots = True, layout = (3, 3), sharex = False,␣
,→sharey = False, figsize = (14, 12));
3
Because there are so many outliers in the Insulin and DiabetesPedigreeFunction columns, then its
need to remove it from the data set. I also remove the SkinThickness column because, it just weird
to see how the data can distributed until zero value. It’s just don’t make any sense if somebody
dont have skin :o.
After that, here we also need to know how many diabetics (1) and non-diabetics (0) that counted
in this data set.
4
[59]: import matplotlib.pyplot as plt
data["Outcome"].value_counts().plot(kind="bar", color = "Green")
plt.title("Outcome");
data.Outcome.value_counts()
[59]: 0 500
1 268
Name: Outcome, dtype: int64
Since our attributes have a larger scale that can make some regression modelling being lack of
accurate, I using simple feature scaling to normalize the values of each attributes except “Outcome”.
5
3 0.058824 0.447236 0.540984 0.418778 0.259259 0
4 0.000000 0.688442 0.327869 0.642325 0.407407 1
Histograms were generated for the data. Again, the plots to examine are Pregnancies, Glucose,
BloodPressure, BMI, and Age. The appear to need transformations so that linear techniques can
be used.
6
BloodPressure -1.843608
BMI -0.428982
Age 1.129597
Outcome 0.635017
dtype: float64
7
[64]: <matplotlib.axes._subplots.AxesSubplot at 0x1e80bdf0>
5 Hypothesis
Since I said before if glucose has more correlation than the other with diabetes, I will make it as
my hypothesis. Also, I will using BMI and other variabel to be the hypothesis.
8
Hypothesis 1
H0 : glucose has no correlation with diabetes
H1 : glucose has a correlation with diabetes
Hypothesis 2
H0 : BMI has no correlation with diabetes
H1 : BMI has a correlation with diabetes
Hypothesis 3
H0 : Blood Pressure has a correlation with diabates
H1 : Blood Pressure has no correlation with diabetes
Conducting a formal significance test for one of the hypotheses and discuss the results
Hypothesis 1 listed above will be tested to determine if there is a correlation between glucose
and diabetes, with determine the p-value and some understanding about p-value with Pearson
Correlation Test.
Pearson Correlation Test
p-value
* p-value < 0.05 : reject H0 and accept H1
* p-value > 0.05 : accept H0 and reject H1
Pearson correlation value
* 0.00 : no correlation * 0.01 - 0.19 : negligible correlation * 0.20 - 0.29 : weak correlation * 0.30 -
0.39 : moderate correlation * 0.40 - 0.69 : strong correlation * 0.70 - 1.00 : very strong correlation
correlation.head()
[76]: r p
Glucose 0.47 0.0
Result:
The p-value is less than 0.5 so that H0 is rejected and H1 is accepted. Then, the Pearson correlation
value is 0.47. Therefore, glucose has a strong correlation with diabetes.
9
about data more quick and effective. Then, the data processing that I do is very simple.
10