Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
David
2024-11-15
Introduction
Diabetes is a metabolic condition causing excessive blood sugar levels (MSD Manual).
Diabetes occurs when the pancreas either doesn’t produce enough insulin or can’t use
the insulin it produces effectively. Insulin is a hormone that regulates blood glucose. The
hormone insulin transfers sugar from the blood into the cells for storage or energy use.
Patients with the potential of diabetes have to go through a series of tests and
examinations to diagnose the disease properly, which are expensive (Tasin, Nabil,
Islam, & Khan, 2022). A predictive model that can accurately detect diabetes is
therefore needed, as it will help in early detection and treatment/management of the
disease.
Study Objective
The main objective of this study was to develop a predictive model that can accurately
predict (with high precision) the likelihood of developing diabetes, and identify the most
important predictors of diabetes.
# Load packages
suppressMessages(
{
library(tidyverse)
library(janitor)
library(caret)
library(mlr)
library(tidymodels)
library(pROC)
library(vip)
library(parallel)
library(parallelMap)
}
)
# Import data
diabetes <- read_csv("diabetes.txt")
## Rows: 15,000
## Columns: 10
## $ PatientID <dbl> 1354778, 1147438, 1640031, 1883350,
1424119, 16…
## $ Pregnancies <dbl> 0, 8, 7, 9, 1, 0, 0, 0, 8, 1, 1, 3, 5, 7,
0, 3,…
## $ PlasmaGlucose <dbl> 171, 92, 115, 103, 85, 82, 133, 67, 80, 72,
88,…
## $ DiastolicBloodPressure <dbl> 80, 93, 47, 78, 59, 92, 47, 87, 95, 31, 86,
96,…
## $ TricepsThickness <dbl> 34, 47, 52, 25, 27, 9, 19, 43, 33, 40, 11,
31, …
## $ SerumInsulin <dbl> 23, 36, 35, 304, 35, 253, 227, 36, 24, 42,
58, …
## $ BMI <dbl> 43.50973, 21.24058, 41.51152, 29.58219,
42.6045…
## $ DiabetesPedigree <dbl> 1.21319135, 0.15836498, 0.07901857,
1.28286985,…
## $ Age <dbl> 21, 23, 23, 43, 22, 26, 21, 26, 53, 26, 22,
23,…
## $ Diabetic <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 1,…
The data contains 15000 observations of 10 variables. All the variables are numeric
(double).
## integer(0)
EDA
# Generate summary statistics for each and every variable.
dbTib |> summary()
The minimum and first quartile for the number of pregnancies is zero. Male
patients definitely have zero number of pregnancies. The median number of
pregnancies is 2 and the maximum number of pregnancies is 14.
The median values for plasma glucose, diastolic blood pressure, triceps
thickness, serum insulin, BMI and diabetes pedigree are 104, 72, 31, 83, 31.77
and 0.2 units respectively. The median age is 24 years.
Diabetic patients are 5000 while non-diabetic patients are 10,000 in number.
There’s class imbalance in the data.
# Convert the data into longer format for visualization
diabetesUntidy <- gather(dbTib, key = "Variable", value = "Value", -
diabetic)
Most of non-diabetic patients are younger (less than 30 years). There are also
numerous elder patients who are non-diabetic.
Diabetic patients have higher BMI, high serum insulin, high plasma glucose and
thicker triceps.
Female diabetic patients are more likely to have had high number of
pregnancies, or given birth to many children.
Based on the distributions of age, BMI, diabetes pedigree, plasma glucose, number of
pregnancies and serum insulin, the two classes seem to be separable.
# Check for highly correlated features
corrplot::corrplot(cor(dbTib[-9]))
Model Training
I’ll try 6 different algorithms i.e Logistic Regression, Naive Bayes classifier, KNN,
Random Forest, XGBoost and an Artificial Neural Network. Before training the models,
I’ll first split the data into training and test sets. The training set will be used to train and
fine-tune the models with cross-validation, and the test sets will be used for model
validation.
# Partition the data into training and test sets (use 75/25 split)
# Set random seed for reproducibility
set.seed(1234)
# Data partitioning
train_index <- createDataPartition(dbTib$diabetic, p = 0.75, list = FALSE)
# Assign 75% to training set
training_data <- dbTib[train_index, ]
# Assign the remaining 25% to test set
test_data <- dbTib[-train_index, ]
# Define learner
logReg <- makeLearner("classif.logreg", predict.type = "prob")
The Logistic Regression model has a training accuracy of 78.83%, which is good based
on the simple nature of the model. The model generalizes well. The model however,
has a high false negative rate.
The Naive Bayes model has a training accuracy of 78.68%. This model also has a high
False Negative rate. The model performs slightly lower than the Logistic Regression
model.
KNN model
# Make learner
knnLearner <- makeLearner("classif.knn")
## $k
## [1] 7
Optimal value of k = 7.
# Check mmce value
tunedK$y
The model has a lower mmce value (0.16), implying a good performance (has a training
accuracy of 84%). The KNN model performs better than the Logistic Regression and
Naive Bayes models.
# Visualize the tuning process
# Obtained model data
knnTuningData <- generateHyperParsEffectData(tunedK)
# Plot
plotHyperParsEffect(knnTuningData, x = "k", y = "acc.test.mean",
plot.type = "line") +
theme_bw()
Accuracy is highest at k = 7.
# Set hyperparameters for the final model
tunedKnn <- setHyperPars(makeLearner("classif.knn"),
par.vals = tunedK$x)
# Train the final model
tunedKnnModel <- train(tunedKnn, diabetesTask)
# Start parallelization
parallelStartSocket(cpus = detectCores())
# Stop parallelization
parallelStop()
# View CV results
tuned_rf_Pars
## Tune result:
## Op. pars: ntree=216; mtry=10; nodesize=21; maxnodes=20
## acc.test.mean=0.8934222,fpr.test.mean=0.0944601,fnr.test.mean=0.1309094
The RF model has a training accuracy of 89.44%, which is good. The model performs
better than the Logistic Regression, Naïve Bayes and KNN models.
# Set the optimal hyperparameters for the final model
tuned_rf <- setHyperPars(rf, par.vals = tuned_rf_Pars$x)
# Train the final model with the optimal hyperparameters
tuned_rf_Model <- train(tuned_rf, diabetesTask)
The mean out-of-bag error begins to stabilize early, at about 50 trees. This implies that I
have enough number of trees in the forest. The positive class has high mean out-of-bag
error rate.
XGBoost
# Define learner
XGB <- makeLearner("classif.xgboost", predict.type = "prob")
A training accuracy 96.07% is very good, though the model might be overfitting the
training data. XGBoost outperforms all the previous models.
# Train the final model using optimal hyperparameters
Neural Network
# Define learner for the neural network
nnet <- makeLearner("classif.nnet", predict.type = "prob")
# Start parallelization
parallelStartSocket(cpus = detectCores())
## Starting parallelization in mode=socket with cpus=4.
# Stop parallelization
parallelStop()
# View CV results
print(tunedNnetPars$x)
## $size
## [1] 6
##
## $decay
## [1] 0.3855826
tunedNnetPars$y
The Neural Net performs better than Logistic Regression, Naive Bayes and KNN
classifiers, but is outperformed by Random Forest and XGBoost. The Neural Net has a
training accuracy of 85.67%. The optimal hyperparameters are size of 6 neurons and
decay rate of 0.385.
# Train the final model using optimal hyperparameters
## # weights: 61
## initial value 16193.872951
## iter 10 value 7173.257497
## iter 20 value 6958.835329
## iter 30 value 6667.220271
## iter 40 value 6130.795307
## iter 50 value 5872.353183
## iter 60 value 5131.380692
## iter 70 value 4675.695522
## iter 80 value 4441.830627
## iter 90 value 4352.662468
## iter 100 value 4257.294982
## final value 4257.294982
## stopped after 100 iterations
# Benchmark
bench <- benchmark(learners, diabetesTask, benchCV,
show.info = FALSE, measures = list(acc, kappa))
Model Evaluation
I’ll use the two performing models to make predictions on test data (RF & XGB) and
evaluate how my models would perform on unseen data.
# Use the RF model to make predictions on test data
rf_Preds <- predict(tuned_rf_Model, newdata = test_data)
# Collect prediction
rf_Preds_data <- rf_Preds$data
# Calculate confusion matrix
confusionMatrix(table(rf_Preds_data$truth, rf_Preds_data$response))
The Random Forest model has a validation accuracy of 88.93%, with a precision of
86.32% which are all good. The model has a higher Specificity (0.9295) than Sensitivity
(0.8156). The training accuracy is slightly higher than the validation accuracy, implying
that the model didn’t overfit the training data.
# Calculate ROC AUC value
rf_Preds_data |> roc_auc(truth = truth, prob.Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.919
A ROC AUC value of 0.919 for the Random Forest model is very good, implying that the
model fits the data very well.
# Plot ROC curve
rf_Preds_data |> roc_curve(truth = truth, prob.Yes) |> autoplot()
ROC curve looks good, very steady and is closely approaching the top left corner where
AUC value is 1.
# Plot variable importance for the Random Forest model
vip(tuned_rf_Model)
The Random Forest model finds number of pregnancies as the most important predictor
of diabetes, followed by BMI, serum insulin, age, plasma glucose, diabetes pedigree
and triceps thickness respectively.
# Use the XGBoost model to make predictions on test data
xgbPreds <- predict(tunedXgbModel, newdata = test_data)
# Collect prediction
xgbPreds_data <- xgbPreds$data
Wow! 95.71% validation accuracy. The XGBoost model has an excellent performance.
Sensitivity and Specificity for the XGBoost model are all good, with Specificity
being high. The trade-off between Sensitivity and Specificity is small. The
Sensitivity and Precision for this model are also good (93.88% and 93.20%
respectively).
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.992
The XGBoost model has a ROC AUC value of 0.99, which is excellent. The model fits
the data very well.
# Plot ROC curve
xgbPreds_data |> roc_curve(truth = truth, prob.Yes) |> autoplot()
The curve is almost touching the top left corner, near 1.
# Plot variable importance for the XGBoost model
vip(tunedXgbModel, type = "gain")
Based on information gain ratio score, number of pregnancies is the most important
predictor of diabetes, followed by age, BMI, serum insulin, plasma glucose, triceps
thickness, diastolic blood pressure and diabetes pedigree respectively.
However, the main limitation of this analysis is that I did not handle class
imbalance in the data.
References
Rhys, H. I. (2020). Machine learning with R, the tidyverse, and mlr. Manning
Publications. https://livebook.manning.com/book/machine-learning-with-r-the-tidyverse-
and-mlr/about-this-book
Tasin, I., Nabil, T. U., Islam, S., & Khan, R. (2022). Diabetes prediction using machine
learning and explainable AI techniques. Healthcare Technology Letters, 10(1-2), 1-10.
https://pmc.ncbi.nlm.nih.gov/articles/PMC10107388/#htl212039-sec-0010