Prediction of Cardiovascular Disease Using Machine Learning Techniques
Prediction of Cardiovascular Disease Using Machine Learning Techniques
Abstract—New technologies like Machine learning and Big in the US, the government spends annually around 2.7 trillion
data analytics have been proven to provide promising solutions USD for the treatment of chronic diseases.
to biomedical communities, healthcare problems, and patient
care. They also help in early prediction of disease by accurate The continuous technological improvements helped the re-
interpretation of medical data. Disease management strategies
can further be improved by the detection of early signs of
searchers to develop new methodologies based on Artificial
disease. This early prediction, moreover, can be helpful in intelligence and Machine learning. The rising health issues
controlling the symptoms of the disease as well as the proper have also lead to an increase in the generation of big data, and
treatment of disease. Machine learning approaches can be used for utilizing this Big data it is required to develop an automatic
in the prediction of chronic diseases, such as kidney and heart computer-based system that can be used to predict those
diseases, by developing the classification models. In this paper, we
propose a preprocessing extensive approach to predict Coronary
diseases by deploying machine learning algorithms which can
Heart Diseases (CHD). The approach involves replacing null work efficiently for various challenges occur in the datasets.
values, resampling, standardization, normalization, classification,
and prediction. This work aims to predict the risk of CHD Along with the other chronic healthcare problems such
using machine learning algorithms like Random Forest, Decision as obesity, smoking, diabetes, etc., cardiovascular diseases
Trees, and K-Nearest Neighbours. Also, a comparative study (CVDs) are also found to be a huge risk factor. Population
among these algorithms on the basis of prediction accuracy is aging, especially in developed countries, is highly correlated
performed. Further, K-fold Cross Validation is used to generate
randomness in the data. These algorithms are experimented with the CVDs, means the old age people are more prone to
over “Framingham Heart Study” dataset, which is having 4240 CVDs [3]. According to the World Health Organisation report
records. In our experimental analysis, Random Forest, Decision in 2014 [4], cardiovascular diseases including heart attacks,
Tree, and K- Nearest Neighbour achieved an accuracy of 96.8%, chronic heart failure, cardiac arrest, etc. led to the death of
92.7%, and 92.89% respectively. Therefore, by including our 17.5 million people. Most heart diseases cannot be detected
preprocessing steps, Random Forest classification gives more
accurate results than other machine learning algorithms. by primitive Electrocardiography (ECG) process [6], so for
Index Terms—Random Forest, Decision Tree, K-Nearest Neigh- that many preventive sensors or devices are invented such as
bour, Coronary Heart Disease Phonocardiogram (PCG), Electromyogram (EGM), etc.
978-1-7281-1895-6/19/$31.00 2019
c IEEE 367
II. R ELATED W ORK Decision Tree, Neural Network, Logistic Regression, Support
Vector Machine, and Naive Bayes over both normalized and
Different researchers practice different ways of involving original data were presented; third, performance of classifiers
machine learning and data-mining approaches to solve health- was compared based on sensitivity, accuracy, specificity, and
related issues. They have used various approaches to classify AUC. Except for Neural Network, other classifiers showed
and predict chronic diseases. almost the same results.
Data mining techniques were proved to efficiently predict In [17] an novel approach is proposed that uses analytic
chronic kidney disease. Weka [5] is a user interface which time-frequency flexible wavelet transform (ATFFWT) and
provides data processing cycle such as data preprocessing, fractal dimension (FD) to detect epileptic seizures. By using
classification and other data mining technique to a user. It ATFFWT the EEG signals are decomposed into subbands and
has a large collection of data to which various algorithm then FD is calculated on each subbands. For training the model
can be practiced on it. This tool illustrates out that Random these FDs are given as input to Least Squares SVM. At last,
Forest gives the best result among various algorithm like Naive cross validation is performed to deal with model overfitting.
Bayes, J48, etc.
Heart Sounds can also be used for the detection of chronic III. P ROPOSED W ORK
heart failure [6]. The method involves filtering of audio sig-
nals, segmentation for feature extraction, and machine learn- This section illustrates various resources and approaches
ing. It also described the stacking process of ML algorithms, that are used in this work. Primarily, the description of dataset
having three phases: segment based ML phase, recording is provided to understand how to work on it, followed by the
based feature extraction phase, and recording based ML phase. preprocessing steps involved. Finally, the internal working and
understanding of the analytical models used are explained.
Different authors have adopted different approaches for
developing classification models. Data can also be captured A. Dataset Description
by ADL (activities of daily living) with wearables [7]. The
proposed framework used this data and implemented super- We have practiced a dataset which is a subset of Framing-
vised and unsupervised machine learning algorithms and batch ham Heart Study (FHS) dataset, it is made publicly available
level processing including cohort segmentation. Different ML through Framingham Heart Institute [7]. The available section
algorithms like KNN, SVM, Random Forest, etc. were im- of FHS dataset used in this paper contains records of 4240
plemented on two different datasets and compared based on participants. The dataset is generated by long term study on
accuracy and model building time. Maximum accuracy in a population of Framingham, Massachusetts. The study is
NHANES and Framingham Datasets were obtained in Random based on the cause and origin that lead to cardiovascular
Forest and SVM modeling respectively. heart disease and it comes under one of the best public health
In [8], NHANES and FHS datasets were used. In the disease management domain [8]. The Framingham Heart study
NHANES dataset, feature selection methods lead to an im- focused mainly to retrieve the risk factors that have an effect on
provement in the performance based on information theory the health of a person in perceiving a coronary heart disease.
ranking and in FHS dataset, this was done by grouping based The dataset contains 16 different features that affect Coronary
on subdivision filter variable method. The model, also, showed Heart Disease.
the trade-off between accuracy and execution time. Random
TABLE I
Forest and KNN gave better results in confusion matrix and ATTRIBUTES OF THE DATASET AND THEIR I NTERPRETATION
classification accuracy but they were unable to satisfy the
boundaries of creation time. Meanwhile, the decision tree
Attribute Interpretation
showed good results in both aspects. gender Female : 0; Male : 1
In [9], the study of the effect of class imbalance in data age Age at the examination time
on performance for multilayer perceptron was carried out. 1: high school
2: high school or GED
Different learning rates were used to evaluate the performance education
3: college or vocational school
of multilayer perceptron and analyze the dataset by three sam- 4: college
pling algorithms, among which the Resample method provided currentSmoker 0 = nonsmoker; 1 = smoker
the best accuracy results than others. Also, a comparative diabetes 0 = No; 1 = Yes
totChol Total cholesterol inside patient’s body (mg/dL)
study was done based on accuracy metrics and execution time. sysBP Systolic Blood Pressure (mmHg)
Spread Sub Sample algorithm has the least execution time. diasBP Diastolic Blood Pressure (mmHg)
The target was to see the importance of features on the clas- cigsPerDay Number of cigarettes smoked per day (average)
BPMeds Is the person on BP medicines
sification result [10]. Preprocessing and normalization were prevalentStroke If the person had any prevalent stroke
performed over the dataset. Next, to measure the correlation prevalentHyp Any beneath prevalent
between features, the correlation matrix was obtained. Further, BMI Body Mass Index : Weight (kg) /Height(meter-squared)
the classification was carried out in three stages: first, L1-based heartRate Beats/Min (Ventricular)
glucose Amount of glucode in mg/dL
feature selection; second, AUC (Area Under Curve) based TenYearCHD Risk of developing CHD (Yes : 1; No: 0)
comparison of the performance of five algorithms, namely,
FP
FPR = (8)
(F P + T N )
B. Results
In the proposed work, 10-Fold cross-validation is performed
for the machine learning algorithms like RF, DT, and KNN
that are used for analysis. Different performance measures
as mentioned in the parameters section are calculated and
compared.
The accuracy and AUC for RF, DT and KNN for the 10-
folds iterations are illustrated in Table II. It can be observed Fig. 3. 10- Fold RF ROC Curve
that the average accuracy and AUC of RF classifier come out
to be more than that of DT and KNN.
1.39 seconds falling between DT and KNN. DT has the least
TABLE II execution time of 0.81 seconds, whereas, that of KNN is as
10-F OLD R ESULTS
high as 1.9 seconds.
Fold RF DT KNN
TABLE III
No. Accuracy AUC Accuracy AUC Accuracy AUC P ERFORMANCE STATISTICS
6 96.52% 0.99 92.08% 0.92 92.63% 0.91 Fig. 4 compares the ROC curves for RF, DT and KNN
giving AUC of 0.99 for RF, 0.92 for DT and 0.91 for KNN.
7 97.49% 0.99 94.15% 0.92 94.15% 0.91
The average AUC of RF is more close to 1, and hence RF
8 97.63% 0.99 92.61% 0.91 93.03% 0.91 is more suitable for the prediction model than DT and KNN
9 97.21% 0.99 92.2% 0.92 93.03% 0.91
since its AUC is more closer to 1.
Table IV shows the results considering different perfor-
10 97.07% 0.99 92.61% 0.92 92.2% 0.91 mance measures such as accuracy, precision, recall, specificity,
and F1 Score. While the recall is taken into consideration, RF
gives recall of 94.4% which is more than that of DT and
The ROC curves for 10 folds of RF classifier with a mean KNN. Thus, RF outperforms the other algorithms to predict
AUC of 0.99 is shown in Fig 3. AUC closer to 1 depicts a CHD risk among individuals.
better model [13]. A model is better if it predicts true more
often, that is, TPR is higher. Therefore, curve passing through TABLE IV
top left corner gives a better predictive model as in case of P ERFORMANCE STATISTICS
RF.
Table III shows the mean accuracy and mean AUC for the F1
10-fold cross-validation scores of RF, DT and KNN. It also Algorithm Accuracy Precision Recall Specificity
Score
compares the execution time taken by each of the algorithms RF 96.71% 98.94% 94.4% 99% 96.61%
DT 92.1% 98.57% 85.33% 98.78% 91.47%
for model creation and prediction. The observation states that KNN 91.49% 98.42% 84.21% 98.67% 90.76%
RF achieves the maximum accuracy, that is 96.8%, among RF,
DT, and KNN with AUC 0.99. The execution time of RF is
R EFERENCES
[1] Huse, Hettiarachchi, Gearon E, Nichols M, Allender S, Peeters A,
“Obesity In Australia Modi”, June 2015.
[2] P. Groves, B. Kayyali, D. Knott, and S. van Kuiken, “The Big Data
Revolution in Healthcare: Accelerating Value and Innovation”, USA:
Center for US Health System Reform Business Technology Office, 2016.
[3] D. Kumar, “Automatic heart sound analysis for cardiovascular disease
assessment,” Ph.D. dissertation, University of Coimbra, 2014.
[4] S. Mendis et al., “Global status report on noncommunicable diseases
2014”, World Health Organization, 2014.
[5] Tilakachuri Balakrishna, B. Narendra, Mooray Harika Reddy, Damarap-
Fig. 4. Comparision ROC Curve ati Jayasri, “Diagnosis of Chronic Kidney Disease Using Random Forest
Classification Technique”, Helix Vol. 7(1): pp.873-877, 2017.
[6] Martin Gjoreski; Monika Simjanoska; Anton Gradisek “Chronic Heart
TABLE V Failure Detection from Heart Sounds Using a Stack of Machine-
R ESULT C OMPARISON Learning Classifiers”, IEEE 13th International Conference on Intelligent
Environments, pp. 14-19, 2017.
[7] Nitten S. Rajliwall; Girija Chetty; Rachel Davey, “Chronic disease risk
Algorithm Previous work [7] Proposed work monitoring based on an innovative predictive modeling framework”,
Time Taken Accuracy Time Taken Accuracy IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8,
(sec) (%) (sec) (%) 2017.
[8] Rajliwall, Nitten S., Rachel Davey, and Girija Chetty, “Machine learning
Random Forest 2180 90.1 1.3969 96.80 based models for Cardiovascular risk prediction”, IEEE International
Decision Tree 77.4 90 0.8138 92.45 Conference on Machine Learning and Data Engineering (ICMLDE),
2018.
KNN 88.8 90.1 1.9029 92.81 [9] Pinar Yildirim, “Chronic Kidney Disease Prediction on Imbalanced Data
by Multilayer Perceptron”, IEEE 41st Annual Computer Software and
Applications Conference, pp. 193-198, 2017.
[10] Maryam Soltanpour Gharibdousti, Kamran Azimi “Prediction of Chronic
Table V highlights the comparison of the work done in [7] Kidney Disease Using Data Mining Techniques”, Proceedings of the
and the proposed work based on the parameters, time taken Industrial and Systems Engineering Conference, 2017.
[11] Min Chen, Yixue Hao, Kai Hwang, Lu Wang, And Lin Wang, “Dis-
by the algorithms and accuracy. The accuracy is more and ease Prediction by Machine Learning Over Big Data From Healthcare
the time taken by the algorithms is very less in the proposed Communities”, IEEE Access Volume 5, pp. 8869 - 8879, 2017.
work in all the three classification algorithms. Moreover, [12] M. A. Jabbar, Shirina Samreen, “Heart disease prediction system based
on Hidden Nave Bayes classifier”, IEEE International Conference on
the accuracy of the algorithms in previous work are nearly Circuits, Controls, Communications and Computing (I4C), pp. 1-5,
same while in proposed work, Random Forest has shown 2016.
much higher accuracy. Therefore, the Random Forest machine [13] Hajian-Tilaki, Karimollah. “Receiver Operating Characteristic (ROC)
Curve Analysis for Medical Diagnostic Test Evaluation”, in Caspian
learning algorithm outperforms the results in [7] and [8] on Journal of Internal Medicine, Vol.4, pp. 627-635, 2013.
the same dataset. [14] Gunarathne W.H.S.D, Perera K.D.M, Kahandawaarachchi K.A.D.C.P,
“Performance Evaluation on Machine Learning Classification Tech-
V. C ONCLUSION AND F UTURE W ORK niques for Disease Classification and Forecasting through Data Analytics
for Chronic Kidney Disease”, IEEE 17th International Conference on
We propose a preprocessing extensive work where Random Bioinformatics and Bioengineering, pp. 291-296, 2017.
Forest is the most compatible contender for prediction model [15] Somaya Hashem, Gamal Esmat, Wafaa Elakel, Shahira
and gives the highest performance measure among K- Nearest Habashy,“Comparison of Machine Learning Approaches for Prediction
of Advanced Liver Fibrosis in Chronic Hepatitis C Patients”, IEEE/ACM
Neighbour and Decision Tree. The accuracy, recall, precision, Transactions on Computational Biology and Bioinformatics, Vol. 15,
specificity and F1 score of RF on the proposed work are No. 3, pp. 861-868, 2018
96.71%, 98.74%, 94.4%, 99%, 96.61% respectively, under ex- [16] Ahmed J. Aljaaf, Dhiya Al-Jumeily, Hussein M. Haglan, Mohamed
Alloghani, “Early Prediction of Chronic Kidney Disease Using Machine
ecution time of 1.3969 seconds. The Decision Tree, however, Learning Supported by Predictive Analytics”, IEEE Congress on Evo-
gives lesser performance set against that of Random Forest lutionary Computation (CEC), pp. 1-9, 2018.
though in quite lesser time (0.8138). The execution time for [17] M. Sharma, Tan, RS. Acharya, U.R. A new method to identify coronary
artery disease with ECG signals and time-Frequency concentrated anti-
K- Nearest Neighbour is the highest among all, however, the symmetric biorthogonal wavelet filter bank,, Pattern Recognition Letters,
performance measures are quite similar to that of the Decision Vol. 125, pp. 235-240(2019).