Machine Learning Classification Techniques For Heart Disease Prediction: A Review
Machine Learning Classification Techniques For Heart Disease Prediction: A Review
net/publication/328031918
CITATIONS READS
3 7,607
3 authors:
Mohammad Hijjawi
Applied Science Private University
17 PUBLICATIONS 65 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A Chord-based Super-node selection algorithm for reducing the number of messages in SensibleThings Platform View project
All content following this page was uploaded by Maryam Aljanabi on 03 April 2020.
Research paper
Abstract
The most crucial task in the healthcare field is disease diagnosis. If a disease is diagnosed early, many lives can be saved. Machine learn-
ing classification techniques can significantly benefit the medical field by providing an accurate and quick diagnosis of diseases. Hence,
save time for both doctors and patients. As heart disease is the number one killer in the world today, it becomes one of the most difficult
diseases to diagnose. In this paper, we provide a survey of the machine learning classification techniques that have been proposed to help
healthcare professionals in diagnosing heart disease. We start by overviewing the machine learning and describing brief definitions of the
most commonly used classification techniques to diagnose heart disease. Then, we review representable research works on using ma-
chine learning classification techniques in this field. Also, a detailed tabular comparison of the surveyed papers is presented.
Keywords: heart disease; heart disease diagnosis; heart disease prediction; machine learning; machine learning classification techniques.
Artificial intelligence (AI) is a part of computer science that has The term heart disease, also called cardiovascular disease, encom-
the task of making computers more intelligent. Since the most passes the diverse diseases that affect the heart. The World Health
basic requirement of intelligence is learning, hence came the sub- Organization estimates that 12 million deaths occur worldwide
field of AI that is called machine learning (ML). ML is one of the every year due to heart disease. It is the major cause of deaths in
most rapidly evolving fields of AI which is used in many areas of many developing countries. For example, in the United States, it
life, primarily in the healthcare field. ML has a great value in the kills one person every 34 seconds. It is also the main cause of
healthcare field since it is an intelligent tool to analyze data, and deaths in India, which proves that heart disease is one of the most
the medical field is rich with data. In the past few years, numerous dangerous diseases threatening adults lives today [2]. Heart dis-
amount of data was collected and stored because of the digital ease diagnosis is one of the most critical and challenging tasks in
revolution. Monitoring and other data collection devices are avail- the healthcare field. It must be diagnosed quickly, efficiently and
able in modern hospitals and are being used every day, and abun- correctly in order to save lives. It requires the patient to do many
dant amounts of data are being gathered. It is very hard or even tests, and healthcare professionals must carefully examine the
impossible for humans to derive useful information from these results. That is why researchers have been interested in predicting
massive amounts of data, that is why machine learning is widely heart disease, and they developed different heart disease predic-
used nowadays to analyze these data and diagnose problems in the tion systems using various machine learning algorithms [3]. Some
healthcare field. A simplified explanation of what the machine of them achieved better results than others. Many used the famous
learning algorithms would do is, it will learn from previously di- UCI heart disease dataset to train and test their classifier, while
agnosed cases of patients. The resulting classifier can have many others used data obtained from other hospitals accessible to them.
uses such as helping doctors to diagnose new patients with higher
speed and efficiency and training students and non-specialists to This survey paper provides an overview of the machine learning
diagnose patients [1]. classification techniques used in the field of diagnosing heart dis-
ease, and how previous researchers implemented them. It throws
Since we have vast amounts of medical datasets, machine learning the light on how important is machine learning in the healthcare
can help us discover patterns and beneficial information from field and how it can make accurate predictions and help healthcare
them. Although it has many uses, machine learning is mostly used professionals.
for disease prediction in the medical field. Many researchers be-
came interested in using machine learning for diagnosing diseases The rest of the paper is organized as follows. Section 2 presents
because it helps to reduce diagnosing time and increases the accu- background topics on ML, classification techniques, and the most
racy and efficiency. Several diseases can be diagnosed using ma- widely used heart disease dataset by researchers in this field. Sec-
chine learning techniques, but the focus of this paper will be on tion 3 contains the literature review of the current proposed re-
heart disease diagnosis. Since heart disease is the primary cause of search work in this area. Section 4 presents a tabular comparison
Copyright © 2018 Maryam I. Aljanabi et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
5374 International Journal of Engineering & Technology
This method combines multiple classifiers into one model to in- 3. Current Classification Techniques for
crease the accuracy. There are three types of Ensemble learning Predicting Heart Disease
method. The first type is Bagging, which is aggregating classifiers
of the similar kind by voting technique. Boosting is the second There are various classification techniques used for predicting
type, which is like bagging, yet the new model is affected by pre- heart disease by many researchers. In this section, we provide a
vious models results. Stacking is the third type, which means ag- summary of the surveyed papers in this area. We grouped the pa-
gregating machine learning classifiers for various kinds to produce pers based on the algorithms that were used in their prediction
one model [6]. model. Most researchers combined multiple algorithms in their
research work or provided a comparison between them; this can be
2.4. Data Preprocessing found in the last section, called the "Hybrid Approach" section.
3.1. Naive Bayes logistic model tree, and random forest. The J48 algorithm outper-
formed the rest with an accuracy of 56.76%.
Vembandasamy et al. in [11] used Naive Bayes classifier to diag-
nose either the presence or absence of heart disease. The dataset
used in the research is obtained from one of the leading diabetic 3.4. K-Nearest Neighbor (KNN)
research institutes in Chennai and contained records of about 500
patients and had 11 attributes (including the diagnosis). Waikato Shouman et al. in [17] applied K-Nearest Neighbor (KNN) to
Environment for Knowledge Analysis (WEKA) tool, which is a predict heart disease using the Cleveland dataset. The paper com-
collection of ML algorithms, is used to apply Naive Bayes classi- pared the results of applying KNN only and applying KNN with
fier. The accuracy of their research work was 86.4198%. the voting technique. Voting is the method of dividing the data
Medhekar et al. in [12] proposed a system that categorized the into subsets and applying the classifier to each subset. Evaluation
data into five categories using Naive Bayes classifier. The catego- is done using 10-fold cross-validation. The results showed that
ries are no, low, average, high and very high. The system predicts without voting, the accuracy ranged from 94% to 97.4% with var-
the possibility of heart disease in the input data. The dataset used ious values for K. When K=7, the accuracy was the highest at
for training and testing is the UCI heart disease dataset shown in 97.4%. Using the voting technique, however, did not improve the
table 1. The system showed an accuracy of 88.96%. accuracy. The results showed that at K=7, the accuracy decreased
to 92.7%.
output layers respectively. The research was divided into two ex- Liu et al. in [25] proposed a hybrid model for diagnosing heart
periments: the first one included comparing the different classifi- disease. The dataset used was the Statlog heart disease dataset
ers mentioned above, while the second one involved applying the from the UCI repository. The model developed with MATLAB
ensemble techniques. The results showed that SVM outperformed consisted of two subsystems which are: feature selection and clas-
the other classifiers in the first experiment at an accuracy of sification. The feature selection subsystem uses the Relief method
84.15%. In the second experiment, using the boosting technique to estimate the weight of features then used the feature selection
with SVM also proved to be the most efficient with an accuracy of approach Rough Set method (RFRS) to remove unnecessary fea-
84.81%. tures and improve the accuracy of the model. The classification
Amin et al. in [19] proposed a hybrid system for predicting heart subsystem used Ensemble classifier with the C4.5 algorithm
disease using ANN and Genetic algorithm. The dataset used in (which is used to generate a Decision Tree) as the base. The re-
this research was collected from 50 people through a survey con- sults showed 92.59% classification accuracy.
ducted by the American Heart Association and contains thirteen Ghumbre et al. in [26] compared Support Vector Machine and
attributes. Data analysis involved preprocessing the data to re- Radial Basis Function (RBF), which is a type of ANN. The algo-
move missing or incorrect values. The dataset was divided into rithms were applied to a patient dataset in India consisting of 214
70% of the data for training and 15% for testing and validation. records and 19 attributes and predicting whether a person has heart
The system was implemented using MATLAB R2012a through disease or not. The performance of the algorithms was evaluated
Global Optimization Toolbox and the Neural Network Toolbox. using the overall average through training and testing the dataset,
The results showed an accuracy of 89% for predicting whether a 5-fold cross-validation, and 10-fold cross-validation. The overall
person has heart disease or not. average performance yielded 86.42% and 80.81% accuracy for
Waghulde and Patil in [8] developed a heart disease prediction SVM and RBF respectively. Their results showed that SVM pro-
system using ANN and Genetic algorithm. The method used a vided a better accuracy.
genetic algorithm to initialize the weights in the Neural Network. Masethe and Masethe in [27] applied several algorithms namely:
The experiment was done using MATLAB on a dataset of 50 peo- J48, Naive Bayes, REPTREE, Simple Cart (Classification and
ple collected by the American Health Association and included Regression Tree) which is a type of Decision Tree, and Bayes Net
thirteen attributes. The results generated an accuracy of 98% and to diagnose heart disease. The dataset used for this work has been
84% when carried out using six hidden nodes and ten hidden obtained from South African physicians containing eleven attrib-
nodes respectively. utes which are: patient identification number (replaced with dum-
Amma in [20] presented a system for heart disease diagnosis by my values to protect the privacy of patients), gender, cardiogram,
combining ANN and Genetic algorithm. The dataset used was the age, chest pain, blood pressure level, heart rate, cholesterol, smok-
Cleveland dataset. Preprocessing the dataset consisted of filling ing, alcohol consumption and blood sugar level. The tool used in
out missing values and normalizing the data using Min-Max nor- the experiment was the WEKA tool. The performance evaluation
malization. The weights of the neural network were determined was done using 10-fold cross-validation to assess the efficiency of
using the genetic algorithm. The accuracy obtained was 94.17%. the built model. The results showed an accuracy of 99.0471% for
Venkatalakshmi and Shivsankar in [21] included a comparison J48, 99.0471% for REPTREE, 97.222% Naive Bayes, 98.1481%
between Naive Bayes and Decision Tree to determine which one for Bayes Net, and 99.0741% for the simple cart, showing that
has the highest accuracy for heart disease prediction. The dataset simple cart outperformed the rest.
used was the UCI heart disease dataset. The experiment was car-
ried out using WEKA tool and showed an accuracy of 85.03% and
84.01% for Naive Bayes and Decision Tree respectively. The 4. Comparison of ML Classification Tech-
paper suggested using a genetic algorithm in MATLAB to reduce
the number of features before feeding the dataset into the WEKA niques for Heart Disease Prediction
tool for future work.
Palaniappan and Awang in [22] proposed an Intelligent Heart This section provides a tabular comparison between all the re-
Disease Prediction System (IHDPS) using multiple classification search papers described above.
techniques which are Decision Tree, Naive Bayes and Neural The comparison is made on the basis of accuracy and can be
Network. The system is web-based and was implemented us- seen in table 2. The table has six elements which are as follow:
ing .NET framework. The data source consisted of 909 records
with fifteen attributes obtained from the Cleveland Heart Disease 1. Author: This shows the author/s of the paper and the
database. Data Mining Extension (DMX) query language was reference number.
used to create the model. The results showed that Naive Bayes 2. Classification Technique/s: This represents the classi-
proved to be the most efficient with 86.53% correct predictions fication algorithm used in the research; whether it was a
followed by Neural Network with only 1% difference. single algorithm, a comparison or a hybrid model.
Dangare and Apte in [23] developed a model for predicting heart 3. Best Technique Found: This column is only applicable
disease. The dataset used is the Cleveland database consisting of to papers having a comparison between multiple algo-
303 records alongside the Statlog database comprising of 270 rithms. It represents the best algorithm found in the re-
records. Instead of using only the thirteen attributes present in the search work, which is the algorithm with the highest ac-
dataset, they added two attributes: obesity and smoking. WEKA curacy.
tool used for preprocessing the dataset. The classification tech- 4. Tool: The framework or programming language used to
niques used for analyzing the dataset were Decision Tree, Naive build the model is shown in this column. It is what the
Bayes, and ANN. According to the results, ANN gave an accuracy researcher used to pre-process the input dataset, create
of 100%, Decision Tree 99.62%, and Naive Bayes 90.74% which the predictive model and test it.
proves that Artificial Neural Network is the highest performing 5. Dataset: This shows the dataset that was used as an in-
algorithm. put for the classification algorithm.
Zriqat et al. in [24] developed an effective intelligent medical 6. Accuracy: This represents the accuracy of the results of
decision support system. Five classification algorithms were com- the proposed model. If the paper contained a comparison,
pared which are: Naive Bayes, Decision Tree, Discriminant, Ran- this column only shows the accuracy of the best tech-
dom Forest, and Support Vector Machine. The analysis was done nique found by the author.
using MATLAB on two datasets, the Cleveland Heart Disease and
the Statlog Heart Disease. The results showed that Decision Tree
performed the highest accuracy for both datasets at 99.01% and
98.15% for the Cleveland and Statelog datasets respectively.
5378 International Journal of Engineering & Technology
SAS enterprise
Das et al. [7] ANN Ensemble n/a Cleveland (UCI) 89.01%
miner 5.2
Chen et al. [13] ANN LVQ n/a C and C# Cleveland (UCI) 80%
Cleveland and
Dangre and Apte [14] ANN n/a WEKA Nearly 100%
Statlog (UCI)
A dataset with 240
Sabarinathan and J48 with feature
DT Not mentioned records for testing 85%
Sugumaran [15] selection
and 120 for training
Patel et al. [16] J48 WEKA Cleveland (UCI) 56.76%
Shouman et al. [17] KNN n/a Not mentioned Cleveland (UCI) 97.4%
Wiharto et al. [18] SVM BT SVM Not mentioned Cleveland (UCI) 61.86%
Venkatalakshmi and
NB and DT NB WEKA UCI 85.03%
Shivsankar [21]
Indian patients
SVM and Radial Basis dataset of 214
Ghumbre et al. [26] SVM Not mentioned 86.42%
Function records and 19
attributes
5. Conclusion and Final Remarks [9] S. Garcia et al., “Big data preprocessing: methods and prospects,”
Big Data Analytics, vol. 1, no. 1, p. 9, Nov 2016.
[10] A. Janosi et al., “Heart disease data set,” Jul 1988. [Online]. Avail-
This paper overviews the literature of machine learning classifica- able: http://archive.ics.uci.edu/ml/datasets/heart Disease.
tion methods for diagnosing heart disease. Many representational [11] K. Vembandasamy, R. Sasipriya, and E. Deepa, “Heart diseases
papers on using machine learning classification techniques were detection using naive bayes algorithm,” International Journal of
surveyed and categorized. The accuracy of the proposed models Innovative Science, Engineering & Technology, vol. 2, no. 9, pp.
vary depending on the tool used, the dataset used, the number of 441–444, 2015.
[12] D. Medhekar, M. Bote, and S. Deshmukh, “Heart disease prediction
attributes and records in the dataset, the preprocessing techniques, system using naive bayes,” International Journal of Enhanced Re-
search In Science Technology & Engineering, vol. 2, no. 3, pp. 1–5,
as well as the classifier implemented in the model. It depends on 2013.
whether it is a hybrid model or not and whether the model uses [13] A. Chen et al., “HDPS: Heart disease prediction system,” in Com-
feature selection or not. From table 2, we can conclude that the putting in Cardiology, Hangzhou, China: IEEE, 2011, pp. 557–560.
researchers who produced the highest accuracy were Dangare and [14] C. Dangare and S. Apte, “A data mining approach for prediction of
Apte using Artificial Neural Network (ANN), WEKA tool and a heart disease using neural networks,” International Journal of
combination of the Cleveland and Statlog heart disease datasets. Computer Engineering & Technology, vol. 3, no. 3, pp. 30–40,
2012.
We conclude that to build an accurate heart disease prediction [15] V. Sabarinathan and V. Sugumaran, “Diagnosis of heart disease
using decision tree,” International Journal of Research in Comput-
model, a dataset with sufficient samples and correct data must be
er Applications & Information Technology, vol. 2, no. 6, pp. 74–79,
used. The dataset must be preprocessed accordingly because it is 2014.
the most critical part to prepare the dataset to be used by the ma- [16] J. Patel et al., “Heart disease prediction using machine learning and
chine learning algorithm and get good results. Also, a suitable data mining technique,” Heart Disease, vol. 7, no. 1, pp. 129–137,
algorithm must be used when developing a prediction model. We 2015.
can notice that Artificial Neural Network (ANN) performed well [17] M. Shouman, T. Turner, and R. Stocker, “Applying k-nearest
in most models for predicting heart disease as well as Decision neighbour in diagnosing heart disease patients,” International Jour-
Tree (DT). nal of Information and Education Technology, vol. 2, no. 3, pp. 220,
2012.
Finally, the field of using machine learning for diagnosing heart [18] W. Wiharto, H. Kusnanto, and H. Herianto, “Performance analysis
disease is an important field, and it can help both healthcare pro- of multiclass support vector machine classification for diagnosis of
fessionals and patients. It is still a growing field, and despite the coronary heart diseases,” International Journal on Computational
massive availability of patient data in hospitals or clinics, not Science & Applications, vol. 5, no. 5, pp. 27–37, 2015.
[19] S. Amin, K. Agarwal, and R. Beg, “Genetic neural network based
much of it is published. As observed in table 2, most researchers data mining in prediction of heart disease using risk factors,” in
got their datasets from the same source which is the UCI reposito- IEEE Conference on Information Communication Technologies.
ry. Since the quality of the dataset is an essential factor in the pre- Thuckalay, Tamil Nadu, India, April 2013, pp. 1227–1231.
diction's accuracy, more hospitals should be encouraged to publish [20] N. Amma, “Cardiovascular disease prediction system using genetic
high-quality datasets (while protecting the privacy of patients) so algorithm and neural network,” in International Conference on
that researchers can have a good source to help them develop their Computing, Communication and Applications. Dindigul, Tamilnadu,
models and obtain good results. India: IEEE, Feb 2012, pp. 1–5.
[21] B. Venkatalakshmi and M. Shivsankar, “Heart disease diagnosis
using predictive data mining,” International Journal of Innovative
Acknowledgement Research in Science, Engineering and Technology, vol. 3, no. 3, pp.
1873–1877, 2014.
[22] S. Palaniappan and R. Awang, “Intelligent heart disease prediction
This work was made possible by the financial support from the system using data mining techniques,” in IEEE/ACS International
Applied Science Private University in Amman, Jordan. Conference on Computer Systems and Applications. Doha, Qatar,
March 2008, pp. 108–115.
[23] C. Dangare and S. Apte, “Improved study of heart disease predic-
tion system using data mining classification techniques,” Interna-
References tional Journal of Computer Applications, vol. 47, no. 10, pp. 44–48,
2012.
[1] I. Kononenko, “Machine learning for medical diagnosis: History, [24] I. Zriqat, A. Altamimi, and M. Azzeh, “A comparative study for
state of the art and perspective,” Artificial Intelligence in Medicine, predicting heart diseases using data mining classification methods,”
vol. 23, no. 1, pp. 89–109, 2001. International Journal of Computer Science and Information Securi-
[2] J. Soni et al., “Intelligent and effective heart disease prediction sys- ty (IJCSIS), vol. 14, no. 12, pp. 868–879, 2017.
tem using weighted associative classifiers,” International Journal [25] X. Liu et al., “A hybrid classification system for heart disease diag-
on Computer Science and Engineering, vol. 3, no. 6, pp. 2385–2392, nosis,” Computational and Mathematical Methods in Medicine, vol.
2011. 2017, pp. 1-11, 2017.
[3] N. Khateeb and M. Usman, “Efficient heart disease prediction sys- [26] S. Ghumbre, C. Patil, and A. Ghatol, “Heart disease diagnosis using
tem using k-nearest neighbor classification technique,” in Proceed- support vector machine,” in International conference on computer
ings of the International Conference on Big Data and Internet of science and information technology. Pattaya, Thailand: Planetary
Thing (BDIOT), New York, NY, USA: ACM, 2017, pp. 21–26. Scientific Research Centre, 2011, pp. 84–88.
[4] H. Almarabeh and E. Amer, “A study of data mining techniques [27] H. Masethe and M. Masethe, “Prediction of heart disease using
accuracy for healthcare,” International Journal of Computer Appli- classification algorithms,” in Proceedings of the world congress on
cations, vol. 168, no. 3, pp. 12–17, Jun 2017. Engineering and Computer Science, San Francisco, USA: Interna-
[5] M. Fatima and M. Pasha, “Survey of machine learning algorithms tional Association of Engineers (IAENG), 2014, pp. 22–24.
for disease diagnostic,” Journal of Intelligent Learning Systems and
Applications, vol. 9, no. 01, pp. 1–16, 2017.
[6] S. Pouriyeh et al., “A comprehensive investigation and comparison
of machine learning techniques in the domain of heart disease,” in
Proceedings of IEEE Symposium on Computers and Communica-
tions (ISCC). Heraklion, Greece: IEEE, July 2017, pp. 204–207.
[7] R. Das, I. Turkoglu, and A. Sengur, “Effective diagnosis of heart
disease through neural networks ensembles,” Expert systems with
applications, vol. 36, no. 4, pp. 7675–7680, 2009.
[8] N. Waghulde and N. Patil, “Genetic neural approach for heart dis-
ease prediction,” International Journal of Advanced Computer Re-
search, vol. 4, no. 3, pp. 778, 2014.