0% found this document useful (0 votes)
2 views23 pages

Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques

The document presents a study on predicting cardiovascular disease (CVD) using machine learning algorithms, specifically employing Relief and LASSO feature selection techniques. A hybrid model incorporating various classifiers, including Decision Tree and Random Forest, was developed and trained on a combined dataset, achieving a high accuracy of 99.05% with the Random Forest Bagging Method. The research highlights the importance of early diagnosis and effective feature selection in improving prediction accuracy for heart disease.

Uploaded by

jass83082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views23 pages

Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques

The document presents a study on predicting cardiovascular disease (CVD) using machine learning algorithms, specifically employing Relief and LASSO feature selection techniques. A hybrid model incorporating various classifiers, including Decision Tree and Random Forest, was developed and trained on a combined dataset, achieving a high accuracy of 99.05% with the Random Forest Bagging Method. The research highlights the importance of early diagnosis and effective feature selection in improving prediction accuracy for heart disease.

Uploaded by

jass83082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Received December 23, 2020, accepted January 19, 2021, date of publication January 22, 2021, date of current

version February 2, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3053759

Efficient Prediction of Cardiovascular Disease


Using Machine Learning Algorithms With Relief
and LASSO Feature Selection Techniques
PRONAB GHOSH1 , SAMI AZAM 2 , MIRJAM JONKMAN2 , (Member, IEEE),
ASIF KARIM 2 , F. M. JAVED MEHEDI SHAMRAT3 , EVA IGNATIOUS 2 ,
SHAHANA SHULTANA1 , ABHIJITH REDDY BEERAVOLU2 ,
AND FRISO DE BOER2
1 Departmentof Computer Science and Engineering, Daffodil International University, Dhaka 1225, Bangladesh
2 College
of Engineering, IT, and Environment, Charles Darwin University, Casuarina, NT 0810, Australia
3 Researcher
and Developer, Information and Communication Technology Division, Ministry of Posts, Telecommunications and Information Technology,
Government of Bangladesh, Dhaka 1000, Bangladesh
Corresponding author: Sami Azam ([email protected])

ABSTRACT Cardiovascular diseases (CVD) are among the most common serious illnesses affecting
human health. CVDs may be prevented or mitigated by early diagnosis, and this may reduce mortality
rates. Identifying risk factors using machine learning models is a promising approach. We would like to
propose a model that incorporates different methods to achieve effective prediction of heart disease. For
our proposed model to be successful, we have used efficient Data Collection, Data Pre-processing and Data
Transformation methods to create accurate information for the training model. We have used a combined
dataset (Cleveland, Long Beach VA, Switzerland, Hungarian and Stat log). Suitable features are selected
by using the Relief, and Least Absolute Shrinkage and Selection Operator (LASSO) techniques. New
hybrid classifiers like Decision Tree Bagging Method (DTBM), Random Forest Bagging Method (RFBM),
K-Nearest Neighbors Bagging Method (KNNBM), AdaBoost Boosting Method (ABBM), and Gradient
Boosting Boosting Method (GBBM) are developed by integrating the traditional classifiers with bagging
and boosting methods, which are used in the training process. We have also instrumented some machine
learning algorithms to calculate the Accuracy (ACC), Sensitivity (SEN), Error Rate, Precision (PRE) and F1
Score (F1) of our model, along with the Negative Predictive Value (NPR), False Positive Rate (FPR), and
False Negative Rate (FNR). The results are shown separately to provide comparisons. Based on the result
analysis, we can conclude that our proposed model produced the highest accuracy while using RFBM and
Relief feature selection methods (99.05%).

INDEX TERMS Heart disease, machine learning, CVD, relief feature selection, LASSO feature selection,
decision tree, random forest, K-nearest neighbors, AdaBoost, and gradient boosting.

I. INTRODUCTION According to data provided by the WHO, one-third of the


Cardiovascular disease has been regarded as the most severe deaths globally are caused by the heart disease. CVDs cause
and lethal disease in humans. The increased rate of car- the death of approximately 17.9 million people every year
diovascular diseases with a high mortality rate is causing worldwide and have a higher prevalence in Asia [4], [5]. The
significant risk and burden to the healthcare systems world- European Cardiology Society (ESC) reported that 26 million
wide. Cardiovascular diseases are more seen in men than in adults worldwide have been diagnosed with heart disease,
women particularly in middle or old age [1], [2], although and 3.6 million are identified each year. Roughly half of
there are also children with similar health issues [3], [99]. all patients diagnosed with Heart Disease die within just
1-2 years and about 3% of the total budget for health care is
The associate editor coordinating the review of this manuscript and deployed on treating heart disease [6]. To predict heart disease
approving it for publication was Claudio Cusano . multiple tests are required. Lack of expertise of medical

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
19304 VOLUME 9, 2021
P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

staff may results in false predictions [7]. Early diagnosis to address some of these research gaps to develop a better
can be difficult [8]. Surgical treatment of heart disease is model for CVD prediction. In this research, five datasets
challenging, particularly in developing countries which lack are combined, increasing the total size of the dataset. Two
trained medical staff as well as testing equipment and other selection techniques, Relief and LASSO are utilized to extract
resources required for proper diagnosis and care of patients the most relevant features based on the rank values in med-
with heart problems [9]. An accurate evaluation of the risk ical references. This also helps to deal with overfitting and
of cardiac failure would help to prevent severe heart attacks underfitting problems of machine learning.
and improve the safety of patients [10]. Machine learning In this study, various supervised models such as AdaBoost
algorithms can be effective in identifying the diseases, when (AB), Decision Tree (DT), Gradient Boosting (GB),
trained on proper data [11]. Heart disease datasets are pub- K-Nearest Neighbors (KNN), and Random Forest (RF)
licly available for the comparison of prediction models. The together with hybrid classifiers are applied. Results are com-
introduction of machine learning and artificial intelligence pared with existing studies.
helps the researchers to design the best prediction model The flow of the paper is as follows: Section II describes
using the large databases which are available. Recent studies the aim and scope of this research. Section III provides
which focus on the heart-related issues in adults and chil- an overview of related literature on the prediction of heart
dren emphasized the need of reducing mortality related to disease with various classifiers and hybrid approaches.
CVDs. Since the available clinical datasets are inconsistent Subsequently, section IV details out the proposed system
and redundant, proper preprocessing is a crucial step [12]. and various performance metrics. The process of the data
Selecting the significant features that can be used as the preparation, preprocessing and hybrid algorithms, Bagging
risk factors in prediction models is essential. Care should be and Boosting methods, are explained in section V. Section VI
taken to select the right combination of the features and the describes the implementation of the system and the results.
appropriate machine learning algorithms to develop accurate Discussion on the statistical significance of the results,
prediction models [13]. It is important to evaluate the effect runtime and computational complexity and hyper-parameter
of risk factors which meet the three criteria like the high tuning have been covered between section VIII and X respec-
prevalence in most populations; a significant impact on heart tively. Some recommendations for future works and con-
diseases independently; and they can be controlled or treated clusion are in section XII with a brief discussion on the
to reduce the risks [14]. Different researchers have included limitations of the proposition in section XI.
different risk factors or features while modelling the predic-
tors for CVD. Features used in the development of CVD II. RESEARCH AIM AND SCOPE OF THE PAPER
prediction models in different research works include age, The aim of this research is to develop an effective method to
sex, chest pain (cp), fasting blood sugar (FBS) – elevated FBS predict heart disease, in particular Coronary Artery Disease or
is linked to Diabetes [72], resting electrocardiographic results Coronary Heart Disease, as accurately as possible. Required
(Restecg), exercise-induced angina (exang), ST depression steps can be summarized as follows:
induced by exercise relative to rest (oldpeak), slope, number 1) Five datasets are combined to develop a larger and more
of major vessels coloured by fluoroscopy (ca), heart status reliable dataset.
(thal), maximum heart rate achieved (thalach), poor diet, 2) Two selection techniques, Relief and LASSO, are uti-
family history, cholesterol (chol), high blood pressure, obe- lized to extract the most relevant features based on rank
sity, physical inactivity and alcohol intake [12], [15]–[19]. values in medical references. This also helps to deal
Recent studies reveal a need for a minimum of 14 attributes with overfitting and underfitting problems of machine
for making the prediction accurate and reliable [20]. Current learning.
researchers are finding it difficult to combine these features 3) Additionally, various hybrid approaches, including Bag-
with the appropriate machine learning techniques to make ging and Boosting, are implemented to improve the
an accurate prediction of heart disease [21]. Machine learn- testing rate and reduce the execution time.
ing algorithms are most effective when they are trained on 4) The performance of the different models is evaluated
suitable datasets [22]–[25]. Since the algorithms rely on the based on the overall results with All, Relief, and LASSO
consistency of the training and test data, the use of feature selected features.
selection techniques such as data mining, Relief selection,
and LASSO can help to prepare the data in order to provide III. LITERATURE REVIEW
a more accurate prediction. Once the relevant features are The application of artificial intelligence and machine learning
selected, classifiers and hybrid models can be applied to algorithms has gained much popularity in recent years due
predict the chances of disease occurrence. Researcher have to the improved accuracy and efficiency of making predic-
applied different techniques to develop classifiers and hybrid tions [25]. The importance of research in this area lies in
models [12], [20]. There are still a number of issues which the possibility to develop and select models with the highest
may prevent accurate prediction of heart disease, like limited accuracy and efficiency [26]. Hybrid models which inte-
medical datasets, feature selection, ML algorithm applica- grate different machine learning models with information
tions, and a lack of in depth analysis. Our research aims systems (major factors) are a promising approach for disease

VOLUME 9, 2021 19305


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

prediction [27]. Various available public data sets are applied. boosting machine were used. The proposed model provides
In the study of Latha and Jeeva [28] ensemble technique accuracy, F1, and AUC values of 98.13%, 96.6%, and 98.7%,
was applied for improved prediction accuracy. Using bagging respectively which exceeded other existing CHD detection
and boosting techniques, the accuracy of weak classifiers methods.
was increased, and the performance for risk identification A novel prediction model was introduced in the paper of
of heart disease was considered satisfactory. They used the Mohan et al. [32] with different combinations of features
majority voting of Naïve Bayes, Bayes Net, C 4.5, Multilayer and several known classification techniques. An ANN with
Perceptron, PART and Random Forest (RF) classifiers in backpropagation and 13 clinical features as the input was used
their study for the hybrid model development. An accuracy in the proposed HRFLM. DT, NN, SVM, and KNN were con-
of 85.48% was achieved with the designed model. More sidered while making use of the data mining methods. SVM
recently [29] machine learning and conventional techniques was useful for enhanced accuracy in disease prediction. The
like RF, Support Vector Machine (SVM), and learning models novel method Vote, in conjunction with a hybrid approach
were tested on the UCI Heart Disease dataset. The accuracy using LR and NB was proposed. An accuracy of 88.7% was
was improved by the voting-based model, together with mul- obtained with the HRFLM method.
tiple classifiers. The study showed that for the anemic clas- An improved random survival forest (iRSF) with high
sifiers, an improvement of 2.1% was achieved. In the study accuracy was used for the development of a comprehen-
of NK. Kumar and Sikamani [30], different machine learning sive risk model in predicting heart failure mortality [33].
classification techniques were used to predict chronic disease. iRSF could discriminate between survivors and non-survivors
In their study, the Hoeffding classifier achieved an accuracy using the novel split rule and the stop criteria. Patient demo-
of 88.56% of in CVD prediction. graphics, clinical, laboratory information and medications
Ashraf et al. [15] used both the individual learning algo- were included in the 32 risk factors for the development of
rithms and ensemble approaches like Bayes Net, J48, KNN, predictors. A data mining approach to detect cardiovascular
multilayer perceptron, Naïve Bayes, random tree, and random has also been applied [34]. The Decision Tree, Bayesian
forest for prediction purposes. Of these, J48 had an accuracy classifiers, neural networks, Association law, SVM, and KNN
of 70.77%. They subsequently employed new-fangled tech- data mining algorithms were used to detect the heart diseases.
niques of which KERAS obtained an 80% accuracy. A multi- SVM resulted in an accuracy of 99.3%.
task (MT) recurrent neural network was proposed to predict In works related to the prediction of patient survival [35],
the onset of Cardiovascular disease with the attention mech- several machine learning classifiers were utilized. Feature
anism at work [16]. The proposed model benefits by an Area relating to the significant risk factors were ranked and a
under Curve (AUC) increase between 2 and 6%. comparison was performed between the traditional biostatis-
In the study of Amin et al. [12] the critical risk factors tics tests and the provided machine learning algorithms. The
identified, machine learning models were applied (k-NN, result was that serum creatinine and ejection fraction were
DT, NB, LR, SVM, Neural Network, and a hybrid of voting demonstrated to be the two most relevant features for accurate
with NB and LR) and a comparative analysis was performed. predictions. A model for CVD detection was developed with
The outcome of their study indicates that the hybrid model, the AL Algorithm [36]. The dataset preparation and inves-
together with the selected attributes achieved an accuracy tigation was done with four algorithms. The precision was
of 87.41%. The mean Fisher score feature selection algo- 99.83% for Decision Tree, and Random Forest methods and
rithm (MFSFSA) together with the SVM classification model 85.32% and 84.49% respectively for SVM and KNN. Con-
was used in the technique proposed by Saqlain et al. [31]. gestive heart failure (CHF) was effectively predicted using
By using a SVM they obtained the selected feature subset the ensemble method in another study [37] by analyzing the
and they used a validation process for MCC calculation. Heart rate variability (HRV) and using deep neural networks
The features were selected based on a higher than average to solve the gap in related fields. The accuracy of the proposed
Fisher score. The combination of MFSFSA and SVM resulted system was 99.85%.
in 81.19% accuracy, a 72.92% sensitivity, and an 88.68% Yadav and Pal [3] used the UCI repository for their study.
specificity. This dataset contains 14 attributes. The classification was
In the research work of Mienye et al. [22] prediction model carried out by four tree-based classification algorithms: M5P,
for heart disease was proposed which involves the mean based random Tree, and Reduced Error Pruning and the Random
splitting method, classification, and regression tree were used forest ensemble method. The Pearson Correlation, Recur-
for randomly partitioning the dataset into smaller subsets. sive Features Elimination, and Lasso Regularization were the
Afterwards, using an accuracy based weighted classifier three feature-based algorithms used in this work. The meth-
ensemble, a homogenous ensemble was generated with the ods were then compared for accuracy and precision. The last
classification accuracies of 93% and 91% on the Cleveland method achieved the best performance. In recent work [38],
and Framingham test sets. Two-tier ensemble-based coronary Gupta et al. utilized the factor analysis of mixed data (FAMD)
disease (CHD) detection model [24] was proposed in the and RF-based MLA for developing a machine intelligence
study of Tama et al. Three different ensemble learners: ran- framework. RF was used for the prediction of disease by
dom forest, gradient boosting machine, and extreme gradient finding the relevant features using the FAMD. The proposed

19306 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

method achieved a 93.44% accuracy, an 89.28% sensitivity Statlog). This is included in the framework. Fig. 1 illus-
and a 96.96% specificity. trates the workflow of recommended models. During data
Rashmi et al. [40] experimented on 303, a dataset preprocessing, the combined dataset is analyzed to check for
that was extracted from the Cleveland dataset. The pro- missing values which are then dealt with by the K-Nearest
posed algorithm, Decision Tree obtained 75.55% accuracy. Neighbors imputation technique. To overcome overfitting
Dinesh et al. [41] examined 920 datasets (Cleveland, Long issues and avoid long execution times, two different feature
Beach VA, Switzerland, and Hungarian) which from the UCI selection techniques are utilized: Relief and LASSO. This
machine learning repository. Random forest achieved 80.89% assists in extracting the best features. Performance of clas-
accuracy; on the other hand, Saqlain has received 68.6% sifiers with the features selected by these techniques as well
accuracy over the AFIC dataset [49]. Sharma et al. [43] and as with the original features is analyzed. After feature selec-
Dwivedi et al. [50] have applied the K-Nearest Neigh- tion, the dataset is split into two parts: training and testing.
bors algorithm to the same dataset. The results were Based on model learning rates, 80% of data is assigned for
90.16% and 80% respectively. An accuracy of 46% was the training phase, and the remaining 20% d for the testing
recorded by Enriko [48] when using the Kita Hospital phase. All ensemble models with classifiers are implemented
Jakarta (450) dataset. An improved result was obtained, for to make a comparison over the combined dataset; however,
instance 56.13%, using AdaBoost on the Cleveland dataset the generated outcome of our model is gained within a short
by Kaur et al. [51]. Shetty et al. [45] achieve 89% accu- period. Different training model has been given for testing
racy using the 270 datasets from the Statlog dataset, and the dataset so that we can pick the best model for our reliable
Chaurasia et al. [39] have been used the same with a Boosting dataset. The process resulted in RFBM being the most useful
hybrid approach resulting in an accuracy of 75.9%. The UCI with 99.05% of accuracy. Furthermore, the most suitable
laboratory dataset was also used to evaluate the performance features of a patient having affected by heart disease have
of the Boosting ensemble technique. Cheng et al. and Chaura- been suggested in this diagnosis system.
sia et al. obtained accuracy of 82.5% by ANN model [46]
and 78.88% [39] accuracy using a hybrid model. Using the B. PERFORMANCE MEASURE INDICES
Gradient Boosting technique, Dinesh et al. [41] obtained The effectiveness and accuracy of the machine learning
84.27% accuracy using a combination of 4 different datasets method can be evaluated using performance indicators. Posi-
where Bhuvaneeswari et al. [53] achieved 95.19% accuracy tive classification occurs when a person is classified as having
using 583 records from the Cleveland and Statlog dataset. HD. When a person is not classified as having HD, he has a
A survey result has been generated on Rajaie cardio vascu- negative classification. The following formula from (1) to (7)
lar medical dataset [44] using the hybrid approach, result- has been applied to get all of this [54], [55].
ing in a 79.54% accuracy. On the other hand, the Bagging TP = True Positive (when the model correctly Identified
approach of Decision Tree [52] achieved more than 85.03% as having HD).
accuracy. Three different datasets were converted into one to TN = True Negative (when the model correctly identified
obtain a more accurate result. A hybrid approach, achieved the opposite class, such as patients truly having no heart
an accuracy of 88.4% by Mohan et al. [42]. Latha et al. [39] issues).
used 303 datasets of Cleveland heart disease by Bagging FP = False Positive (when the model incorrectly identified
approach and gained 80.53% accuracy. Tan et al. [47] exper- HD patients i.e., identifying non-HD patients as HD patients)
imented on 303 datasets which were collected from Cleve- FN = False Negative (when the model incorrectly iden-
land Heart disease dataset by hybrid approach and obtained tified the opposite class, such as HD patients as normal
84.07% accuracy, while Latha et al. [39] achieved 85.48%. patients).
Various techniques have been implemented on data of
cardiovascular disease patients. Data are processed such that (TP + TN)
the K-Nearest Neighbors algorithm handles the missing data. Accuracy (Acc) = (1)
(TP + TN + FP + FN)
The feature selection process is done following the Relief and
(TP)
LASSO. Various machine learning algorithms are implanted Precision = (2)
using the Bagging and Boosting approaches. One of the goals (TP + FP)
of the proposed approach is to analyze the accuracy and error (TP)
Recall or Sensitivity (Sen) = (3)
rates of the algorithms in order to determine the best features. (TP + FN)
2(Precision X Recall)
IV. RESEARCH METHODOLOGY F1-score = (4)
(Precision + Recall)
An overall explanation is explained to build an intelligent FP
machine learning system over the dataset of chronic heart False Positive Rate = (5)
FP + TN
disease. FN
False Negative Rate = (6)
A. OVERVIEW OF THE PROPOSED MODEL (TP + FN)
Dataset is constructed by combining five different datasets TN
Negative predictive value = (7)
(Cleveland, Hungary, Switzerland, and VA Long Beach and (TN + FN)
VOLUME 9, 2021 19307
P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 1. Working diagram of proposed model.

C. APPLICATION OF THE PROPOSED MODEL Fig. 2 picturises how a community health center can put the
Having a suitable application of the proposed model is key system to use, the following steps describes the procedures.
to the development of this unique system and will also help • Step 1: Reports are uploaded into the database.
to deal with the real world challenges. The process has been • Step 2: Attributes are selected from the uploaded data to
illustrated in this section. create input for the trained RFBM model.

19308 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 2. Suitable application of proposed model.

• Step 3: Selected attributes are processed in the trained of the previous studies have actually shown that the pre-
model. dicted accuracy of DT [1], RF [1], [2] and KNN [3] algo-
• Step 4: Output is generated in terms of 0 and 1. rithms were quite high compared to other existing techniques.
◦ 0 = A person is less prone to CVDs. Additionally, a limited number of studies also demonstrated
◦ 1 = A person is prone to CVDs. AB [5], [6] as well as GB [53] can perform rather well with
• Step 5: If ‘1’, notify or request the person to consult a considerably high Accuracy. Our paper highlights some of the
doctor or take additional tests. notable research attempts that deployed Bagging and Boost-
• Step 6: Data uploaded to database is used to create ing ensemble techniques as well as proposed some hybrid
trained model, to further improve the accuracy of hybrid frameworks, however, none of those research attempts closely
classifiers and trained model. resembled our introduced approaches as a base classifier
except DT [8] and kNN [7]. As a consequence, in this work,
D. JUSTIFICATION OF THE PROPOSED TECHNIQUE all of those previous approaches have been further explored
This intelligent system has been developed based on the five with the help of ensemble techniques to make the proposed
classifiers. Subsequently, we used ensemble technique such model more efficient. Although from Literature Review it
as bagging and boosting to retain those algorithms as a base can be seen that propositions put forward in [1], [5], [24]
classifier. Numerous studies have already been conducted on and [27] yielded promising predictive accuracy, but was not
different types of machine learning algorithms. Among them, high enough in comparison to our work.
we picked three most common techniques (DT, RF AND Basically, we felt the need to improve the current studies
KNN) and two less common techniques (AB and GB). Some in this field and analyzed previous models to determine what

VOLUME 9, 2021 19309


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 3. Highly correlated features of Relief approach.

might be lacking, after which we took the initiative to devise side, the attribute values are shown (from 0.3 to −0.4). From
a solution that might reshape the current ideas and provide an Fig. 3, it is clearly seen that ca, chol and trestbps features have
acceptable level of results that makes the system suitable for strong relationship with age where the value was approx-
practical implementation. imately 0.3, on the other hand, the lowest correlation was
As has been discussed before, previous works that are observed for thalach that was about −0.4. Similarly, cp shows
somewhat related to this study and deal with the datasets used a significant correlation with exang. However, the correlated
here are available, however, the performance of those systems values among other features were not so high and fluctuated
were not as expected in most cases. between 0.15 and −0.3.
We believe one reason for the lack of performance of some
systems is the inability of those systems to identify the most V. IMPLEMENTATION
important and highly correlated features. We want to develop A. DIFFERENT MACHINE LEARNING LIBRARIES
a method that will first identify the optimal group of features
The implemented model is written in Jupiter notebook’s
and then identify the algorithms that works best with those
Python programming language using simple libraries like
features.
Panda [56], Pyplot [57] and Scikit-learn [58].
In our understanding, algorithms that performed well ben-
efitted from the tightly correlated feature-set, mainly derived
from the use of Relief, whereas the algorithms that did not B. DATASET
show strong performance, could not properly evaluate the Data is considered the first and most basic aspects of using
correlative structure among the features used. machine learning techniques to get accurate results. The
The following figure has been depicted based on the highly applied dataset is gathered from a well-known data repository,
correlated 10 features with predicted attribute (num) which the ‘UCI machine learning repository’. There are five differ-
are selected by Relief feature selection technique. On the right ent datasets: the Cleveland, Hungary, Switzerland, VA Long

19310 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

TABLE 1. Value range in dataset.

FIGURE 4. Actual data points in the datasets.

Beach [59], and Statlog heart disease dataset [60]. We have


combined all of them in this research to obtain more accu- dataset, this problem is resolved by using the K-Nearest
rate outcomes. More than 1190 cases are collected as a Neighbors [62] imputation method. Before machine learning
text file along with 14 special features from their database. algorithms can be applied, data must also need to be nor-
13 attributes of these combined datasets are taken as diagnosis malized or standardized. Standardization converts
P the data
inputs, whereas the ‘num’ attribute is selected as output. Six to a mean of 0 (µ) and a standard deviation ( ) of 1. The
features which are considered relevant in medical literature conversion formula of (8) is given below [63]:
were present in all or most records: age in years (age), sex
Standardization, X = (X − µ)/σ (8)
(sex), resting blood pressure (trestbps), fasting blood sugar
(fbs), chest pain type (cp), and resting electrocardiographic
D. FEATURE SELECTION TECHNIQUES
results (restecg). Table 1 describes the different attributes and
the range of values. Feature selection techniques are important for the machine
The value of the ‘num’ attribute can be 0, 1, 2, 3 or 4. The learning procedure as the best attributes for classification
predicted value ‘0’ represents that a patient does not have need to be extracted. This also helps to reduce the execu-
heart disease and the values from 1 to 4 reflect the various tion time. We have selected two algorithms: Relief feature
stages of chronic heart disease. selection and the Least Absolute Shrinkage and Selection
An overview of the total number of patients for each value Operator.
of the num attribute in the combined dataset is shown in
Fig. 4. 1) RELIEF FEATURE SELECTION TECHNIQUE
Since for the purpose of this research is to predict whether Relief is a selection attribute algorithm that gives a weight
or not a patient is suffering from heart disease, we convert to all the features in the dataset. These weights can then be
all values in the range of 1 to 4 to a 1. This means that the modified gradually [64]. The aim is to ensure that the impor-
attribute now has the range of (0, 1). tant features have a large and that the remaining features have
low weights. Relief uses the similar techniques as in KNN
C. AN OVERVIEW OF DATA PREPROCESSING AND to determine feature weights. This well – known algorithm
CLEANING TECHNIQUES of feature selection approaches has been shown by Kira and
There is a large amount of collected data in the modern Rendell [65]. Ri is for a randomly selected instance. Relief
world that can be gathered via the internet, surveys, and searches for its two nearest neighbours: one from the same
experiments, etc. Often the data to be used contain missing class, called closest hit H, and one from the opposite class,
values, noise, and distortions, however. The combined dataset called closest miss M. It adjusts the consistency calculation
used for this research also contains missing or null values. W [A] for feature A according to the Ri , M, and H values.
There are some popular techniques, such as imputation and If there is a large difference between Ri and H occur this is not
deletion that can be used to deal with missing values. In our desirable, so the performance value W [A] is reduced. On the

VOLUME 9, 2021 19311


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 5. The working techniques of ensemble process.

other hand if there is a large difference between Ri and M


for attribute A then A may be used to distinguish different FIGURE 6. Bagging method.
classes, so the weight W [A] is increased. This process will
is continued form times where m is a parameter that can be
adjusted. variables, particularly uncertainty and bias. In this study, we
apply two ensemble techniques: Bagging and Boosting to
2) LEAST ABSOLUTE SHRINKAGE AND SELECTION obtain more accurate results. These techniques are explained
OPERATOR ALGORITHM (LASSO) below.
The minimum selection and shrinkage functionality of this
operator depends on modifying the absolute value of the 1) BAGGING TECHNIQUE
coefficient of functions. Some coefficient values of the fea- Bagging is used when the goal is to reduce the variance of
tures are zero, and features with negative coefficients can Decision Tree classifiers. The objective is to create several
also be removed from the subset of features. The LASSO subsets of data from the training samples. [68] Randomly
has a very good performance for feature values with small chosen collections of subset data are used to train their Deci-
coefficients. Features which have large coefficient values will sion Tree. As a result, we get an ensemble of different models.
be available in the chosen subsets of features. Unnecessary The average of all predictions from different trees is then
features can be found with LASSO [66]. Moreover, the relia- used. This is more robust than a single Decision Tree clas-
bility of this feature can be enhanced by repeating the above sifier. It helps not only to reduce the overfitting problem but
procedure many times eventually taking the most frequently also to handle higher dimensionality data properly. It resolves
found features in as the most important ones. This is called missing data issues and maintains accuracy. The process of
the randomized LASSO feature, which was introduced by the Bagging method is described in Pseudocode 1 and Fig. 6.
Meinshausen and Buhlmann, in 2010 and Wang in 2011 [67]. With the help of the Bagging technique, three ensemble
It should be implemented on a powerful computer as it uses hybrid models, based on DT, RF, and KNN, are constructed.
parallel programming. It also demonstrates its realization for The three hybrid models: DTBM, RFBM, and KNNBM are
the present application, where q−i represents the vector of the applied in both the training and the testing phase.
related ith sub-region keys.
2) BOOSTING TECHNIQUE
E. ENSEMBLE METHODS OF MACHINE LEARNING Boosting is a repetitive process which depends on the last
Ensemble techniques mix multiple classifiers of a Decision prediction and changes the weight. Fig. 7 are added to better
Tree to achieve better classification results than only one understand the workflow.
Decision Tree classifier. The core idea behind the ensemble If an instance is incorrectly classified its weight is
method is that a combination of weak learners can work increased. Usually, Boosting constructs good predictive mod-
together to f a strong learner, thus improving the model’s els [69]. It generates different loss functions and works by
accuracy and precision [39]. Fig. 5 depicts the ensemble combining the weak models to boost their performance. For
process [39]. When we seek to identify the target feature this research, we have applied the Boosting technique on two
using any machine learning method, key reasons for in the classification algorithms: AB and GB to construct our hybrid
difference real and identified outcomes are noise, uncertainty, models. The resulting ABBM and GBBM are applied in both
and bias. Ensemble techniques assist in dealing some of these the training and testing phases.

19312 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

Pseudocode 1 Pseudocode for Bagging Method ‘Learning’ based on Decision Tree (DT) often applies an
BEGIN upside-down tree based progression technique. The algorithm
1. Let D = {d1 , d2 , d3 , . . . dn } be the given dataset is capable of resolving both classification and regression
2. E = {}, the set of ensemble classifiers problems. The tree grows from the root node by determine
3. C = {c1 , c2 , c3 , . . . cn }, the set of classifiers a ‘Best Feature’ or ‘Best Attribute’ from the set of attributes
4. X = the training set, X D available at hand, ‘splitting’ is then applied. Selection of the
5. Y = the test set, Y D ‘Best Attribute’ is often carried out through the calculation of
6. L = n(D) two other metric, ‘Entropy’ as shown in (9), and Information
7. for i = 1 to L do Gain, shown in (10). The ‘best attribute’ is the one that
8. S(i) = {Bootstrap sample I with replacement} I provides the most useful information. Entropy indicates how
X homogeneous the dataset is and Information Gain is the rate
9. M(i) = Model trained using C(i) on S(i) of increase or decrease in Entropy of attributes [100].
10. E = E C(i)
E (D) = −P (positive) log2 P (positive)
11. next I
12. for i = 1 to L − P (negative) log2 P (negative) (9)
13. R(i) = Y classified by E(i) Equation (9) calculates the Entropy E, of a dataset D, which
14. next i holds the positive and negative ‘Decision Attributes’.
15. Result = max(R (i) : i = 1, 2, . . . . . . , n)
END Gain (Attribute X ) = Entropy (Decision Attribute Y )
− Entropy(X , Y ) (10)
Non-parametrically supervised learning methods, such as
C4.5 are used for classification and regression. This aim of
the method is to develop a model that predicts the value of
the dependent variable by studying basic rules for decision
making.
Baihaqi et al. [73] applied the C4.5 classifier to diagnose
CAD using and obtained 78.95% accuracy. However, the
classifier C4.5 usually does not allow small datasets. The
RF classifier (describer below) may perform better [74], for
heart disease detection or alternatively the combining strategy
using bagged decision trees [75].

2) RANDOM FOREST
The Random Forest (RF) classifier is an ensemble algo-
rithm [76]. This implies that it consists of more than one
algorithm. Usually In this case, it consists of several DT
algorithms [77]. RF build up an entire forest from several
uncorrelated and random Decision Trees during training seg-
ment [101]. Ensemble learning methods employ multiple
learning algorithms to generate an optimal predictive model,
which can provide better results than any of the individual
FIGURE 7. Boosting method.
model’s prediction [101]. Computational complexity may
increase as RF uses more features than a standalone DT, but
F. PROPOSED APPROACH FOR THE CLASSIFICATION it generally has a higher accuracy when dealing with unseen
MODEL datasets. The result of the Random Forest algorithm is the
This section discusses the machine learning approaches that mean result of the total number of Decision Tree algorithms.
are used in this research to generate an intelligent prediction Illustration. Fig. 8 gives and graphical description of Random
system for heart disease. Forest [87].
The Random Forest ensemble classifier builds and inte-
1) DECISION TREE grates multiple decision trees to get the best result. It pri-
The Decision Tree algorithm, which has only 2 numClasses, marily refers to tree learning through aggregating bootstraps.
is one of the most powerful and well-known predictive instru- Let the provided data be X = {x1 , x2 , x3 , . . . . . . , xn ) with
ments [70]. Every interior node in the structure of a Decision responses Y = { x1 , x2 , x3 , . . . . . . , xn } with a lower limit of
Tree refers to testing a property, every branch corresponds to a b = 1 and an upper limit of B: The prediction PB for sample x0
test outcome, and each leaf node is a separate class [71], [87]. is made by averaging the predictions b=1 f b (x0 ) from every

VOLUME 9, 2021 19313


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

two-dimensional space. KNN puts the new data into the class
which has the least Euclidean distance to the new data.
Previous research [82] has used KNN as an automated
classification technique for coronary artery disease. When
conducting linear discriminant analysis KNN had a better
accuracy than SVM and NN [85]. Rajkumar and Reena
obtained an accuracy of just 45.67% [83] using KNN to
diagnose CAD. However, Gilani et al. [84] subsequently
compared the F1 score with many classification models and
found that the KNN classifier performed best among the
seven classifiers. A limitation of the method is that due to the
high computational complexity, KNN is not appropriate for
implementation in a low power or a real-time environment.
On a different note, in place of using Euclidean Distance,
Suryawanshi and Sharma [102] have shown ‘Spearman Cor-
relation’ [103] can also be employed as the distance mea-
sure for KNN based classification as shown in (13). P and Q
are training and testing tuple respectively while n is the
number of total observations. The values of fij usually lies
between 1 and −1.
2
6 ni=1 rank (Pi ) − rank Qj
P
FIGURE 8. Random Forest algorithm. fij = 1 −  (13)
n n2 − 1
The changes have demonstrated some enhancements over
individual trees for x0 that is shown using (11). regular KNN model with nearly 50% improvement in accu-
racy (97.44% in 80%-20% Train and Test ratio).
B
1X
j= f b (x 0 ) (11) 4) ADABOOST
B
b=1 AdaBoost or Adaptive Boosting is a Boosting algorithm
The Random forest (RF) classifier, a combination of many that is used for binary classification and combines a num-
different tree predictors, is often used for the analysis of big ber of weak classifiers to make a more robust classi-
data. It is a learning method for grouping, regression, and fier [86]. This algorithm produces the predicted accuracy
other functions in an ensemble. based on 1000 samples. The training dataset instances are
Banerjee et al. [79] used successfully applied the RF clas- weighted with a starting weight [87] as shown in (14).
sifier using time-frequency characteristics from PCG signals Weight (xi) = 1/N (14)
to identify heart disease.
where N is the frequency of training instances, and xi is ith
3) K-NEAREST NEIGHBORS training instance. The decision stump gives an output for each
input variable. The misclassification rate is then calculated
K-Nearest Neighbors (n_neighbors = 5) is amongst the most
using equation (15).
common classification technique in the field of machine
learning. It has previously been used for coronary artery Error = (correct−N)/N (15)
disease. KNN is considered nonparametric since the method
where N is the frequency of training instances. Boosting
does not use data distribution assumptions. KNN considers
simply means combining several simple trainers to achieve
the equivalence of the new data and the existing data and
a more accurate prediction. AdaBoost (Adaptive Boosting)
places the new data in the class, which is nearest to the
fixes the weights which vary for both samples and clas-
existing classes. KNN is used for regression problems as well
sifiers [88]. This causes the classifiers to focus on results
as for recognition problems. It is also known as the lazy
that are relatively difficult to identify accurately. The final
learner algorithm [80] as it does not immediately learn from a
classification formula is shown in equation (16).
collection of training data. KNN calculates the Euclidean dis-
k
tance between new A (x1 , y1 ) data and previously accessible X
B(x2 , y2 ) data, using the equation (12) [81]. Hk (p) = +/ − ( ak hk (p)) (16)
k=1
q
(x 2 −x1 )2 + (y2 −y1 )2 (12) Equation (15) is a linear combination of all the weak
classifiers (simple learners), where K is the total number of
The Euclidean formula may be used to evaluate the dis- weak classifiers hk (p) is the output of weak classifier t (this
tance between two data points (x2 , x1 ) and (y2 , y1 ) in can be either −1 or 1). ak is the weight of classifier k.

19314 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

5) GRADIENT BOOSTING TABLE 2. Features selected by Relief algorithms and their rankings.

Gradient Boosting is a Boosting algorithm that required only


100 samples, used for classification and regression prob-
lems [89]. Gradient Boosting consists primarily of three
factors [90]: An enhanced loss function, a weak learner to
make predictions, and an additive model to combine weak
learners to minimize the loss function [91]. Gradient Boosting
is a technique that can increase the algorithm’s efficiency by
eliminating overfitting.
The application of gradient tree Boosting to the Tobit
model, called as the ‘Grabit’ model, helps to improve the
accuracy when there is an imbalance between the numbers
in each class. Boosting rather basis methods also known as
regression tree learners, to obtain higher predictive precision
on a large variety of datasets, e.g. [92], but it utilizes familiar-
ity in a specific area. The distinction between Boosting pro- TABLE 3. Features selected by LASSO algorithms and their rankings.
cess and traditional machine learning is that function space
excludes optimization. The optimal function F(X) is obtained
after iterations m−th [93] that is derived as per (17):
m
X
F (X) = fi (x) (17)
i=0
where fi (x) (i = 1, 2. . . ., M) indicates feature increments,
the fi (x) = − ρi x gm(X). The latest base-learner is the
largest loss function correlated with negative gradients [94].
The negative gradient for the m−th iteration is (18).
∂L (y, F (X))
gm = −[ ]F(X)=Fm−1(X) (18)
∂F (X)
where gm is the path where the loss function decreases the
most rapidly when F(X) = Fm −1(X) [93]. A new decision
tree aims to correct the error made by its previous base
learner. T model is then modified to (19).
Fm (X) = Fm − 1 (X) + ρm xhm (X,αm ) (19)
B. COMPARISON OF VARIOUS ALGORITHMS AND
In this system, several classifiers with ensemble techniques HYBRID APPROACHES ON THE DIFFERENT FEATURES
including DTBM, RFBM, KNNBM, ABBM and GBBM This section compares on the outcomes of the different clas-
have been applied to compare these algorithms. Using DT as sification models with the different input features. First, five
a base class does not always help to get a higher accuracy. The machine learning classifiers and five hybrid techniques were
highest accuracy using ensemble techniques was achieved by applied to all features of heart disease dataset. Secondly, Least
using RFBM in our prediction system. Absolute Shrinkage and Selection Operator Features Selec-
tion Algorithm (LASSO) was implemented to extract some
VI. RESULTS AND DISCUSSION relevant features and the same five machine learning classi-
A. OUTCOMES OF FEATURE SELECTION PROCESSES fiers and five hybrid techniques were applied again. Finally,
Relief [95], a feature selection algorithm, selects main fea- the most important features selected by the Relief model
tures based on the weight of the data. The most important were used as input to the classifiers and hybrid methods.
seven input features selected by Relief are given in Table 2. Different performance metrics are also evaluated to evaluate
The most important feature for predicting heart disease is the predicted outcomes.
serum cholesterol (chol) which rank score is 0.869 according Our original dataset contains 14 individual attributes in
to the findings. which 13 input functions are used to generate the outcome
The LASSO treats closely related features as true, and the of the disease. From these 13 features, 6 significant features
rest as false. After applying the LASSO, chest pain (cp) had of our dataset, which matched prominent medical books
the highest rank score (0.0796), whereas maximum heart rate and guides, these are, age in years (age), gender (sex),
(thalach) had a very low score. resting blood pressure (trestbps), fasting blood sugar (fbs),
Table 3 shows the score of the eight most essential features chest pain type (cp), and resting electrocardiographic results
selected by LASSO for diagnosing heart disease. (restecg) [61], [97]. Some features including age, and sex are

VOLUME 9, 2021 19315


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

not modifiable, while risk factors associated with other fea-


tures (fbs, restecg, cp and trestbps) are gradually modifiable.
After applying the Relief feature selection algorithm
to the proposed dataset, 7 features: age in years (age),
serum cholesterol (chol), fasting blood sugar (fbs), rest-
ing electrocardiographic results (restecg), maximum heart
rate (thalach), exercise induced angina (exang), and number
of major vessels (0–3) colored by fluoroscopy (ca) have
been selected based on their ranking values. Some missing
attributes, present in notable medical books, were added: sex,
trestbps [61], and cp [97] as it was felt that is was important
that these features were included.
Eight relevant features: age, cp, trestbps, chol, thalach,
oldpeak, slope, and thal were selected according to their
ranking by the LASSO feature selection algorithm. Chest
pain was the feature with the highest score. Some missing
attributes, present in all medical records, were added: sex, fbs
and restecg, so that these features were part of all three feature
sets. FIGURE 9. Accuracy.
Different machine learning techniques were applied to
the selected features. The 2 × 2 confusion matrix was
generated to produce the different performance metrics and
provided a comparison of all mentioned algorithms. The
performance metrics Accuracy, Error rates, Sensitivity, Pre-
cision, F1-Score, Negative Predictive Value, False Positive
Rate, and False Negative Rates were used to evaluate the
proposed models.

1) COMPARISON BETWEEN DIFFERENT METHODS BASED


ON ACCURACY
Accuracy is usually considered to be the most impor-
tant techniques to evaluate machine learning algorithms.
As mentioned above, we use five classifiers and five hybrid
classifiers. We applied the ten different methods on the origi-
nal 13 input features then on the eleven input features selected
by the LASSO approach, and on the 10 features selected with
the Relief method. Fig. 9 shows the accuracy of the different
types of classifiers, including the five hybrid classifiers.
Considering 13 features, the most accurate prediction [98]
is 89.07% was obtained from the AB Classifier, whereas the FIGURE 10. Error rates.
accuracy of KNN is 83.61%. The accuracy of DT and GB are
very similar to each other (86.97%). However, results are sig-
nificantly better for some of the hybrid classifiers: the accu- A dramatic improvement in accuracy with hybridization
racy of RFBM is 92.65%. When only evaluating 11 selected observed for the KNN model, from 94.11 % to more than
features (LASSO), the RF Classifier generates the lowest 98 % accuracy.
accuracy (86.97%). We get 88.6%, 93%, 90.75%, 92.85%
accuracy for DT, KNN, AB, and GB classifiers respectively 2) COMPARISON BETWEEN DIFFERENT METHODS BASED
with the 11 LASSO features. GBBM has an outstanding ON ERROR RATES
performance of 97.85% and the other four hybrid classifiers Error rates also help to understand the model performance.
DTBM, RFBM, KNNBM, and ABBM also provide a good The lowest error rate is generated by RFBM on the ten
accuracy: 88.65%, 97.65%, 96.6% and 90.75% respectively. selected features by Relief, approximately 0.95%. How-
Looking at the accuracy of these ten strategies with ever, for the eleven features selected by LASSO, the lowest
the Relief features, the Random Forest Bagging method error rate was obtain with KNN; just under 2.2%. Fig. 10
(RFBM), which is a hybrid classifier, demonstrated an excel- clearly shows that KNN had the highest error rate (16.39%)
lent accuracy of 99.05%. The results of the hybrid models for 13 features, followed by RF for 11 features (13.03%) and
of DT, AB, and GB were similar to the previous results. DT for 10 feature (10.88%).

19316 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 11. Precision.

FIGURE 12. Obtained recall scores.


3) COMPARISON BETWEEN DIFFERENT METHODS BASED
ON PRECISION
Other performance metrics such as precision have also been 5) COMPARISON BETWEEN DIFFERENT METHODS BASED
used to evaluate the performance of classifier and hybrid ON F1-SCORE
algorithms. Considering 13 input features, a noticeable result The F1-score is the harmonic mean of the precision and recall
of over 93% was obtained for precision with the RFBM model scores. For the 13 features, the highest F1-score (approxi-
KNN had the lowest precision score: 84%. Other models mately 92%) is achieved with the RFBM which outperformed
had precision scores between these values. When applied to all other algorithms. KNN had the lowest F1 score for 13 fea-
the 11 LASSO features, the best precision was obtained with tures (84%), and the results for the DT and GB classifiers
the GBBM (98%), and the lowest precision (84%) for the GB were similar: 87%, and 88% respectively.
classifier. Both the Decision Tree (DT) and Random Forest After decreasing the number of features, the F1-score
(RF) classifiers achieved a precision score of approximately increased. For 11 features, GBBM had the highest score
87%. The best precision was obtained evaluating 10 Relief and most other classifiers also had better F1 scores than
features by RFBM which was close to 99%. KNN also had for 13 features. Result improved still further for the 10 Relief
a high precision score (94%). For the 10 Relief features, features with KNNBM and GBBM obtaining F1 scores of
DT produced the lowest score but this was still 89%. The approximately 98% and DT, RF and AB of 90%, 98% and
outcomes for precision are depicted in Fig. 11. 93% respectively. The highest F1-score was obtained with the
RFBM model that generates the highest outcome of f1-score
4) COMPARISON BETWEEN DIFFERENT METHODS BASED (99%) and KNNBM provides the second highest score which
ON RECALL is exactly 98%. The DTBM model had the lowest score for
Recall or sensitivity score is an important performance matrix 10 features. F1-scores are shown in Fig. 13.
as it is important that people with heart disease are accurately
classified. Fig. 12 shows the recall scores for the differ- 6) COMPARISON BETWEEN DIFFERENT METHODS BASED
ent algorithms and feature sets. A very poor recall score ON NEGATIVE PREDICTIVE VALUE
(just over 84%) has been generated in was obtained with The negative predictive values (NPV) of the various algo-
the KNN algorithm, while the RFBM achieved the highest rithms have also been evaluated. The maximum NPV
recall score (92%) when applied to the original 13 features. (98.59%) was obtained with RFBM, when applied to the
ABBM, KNNBM, RF, and GBBM had recall scores of 89%, Relief feature selection. The lowest NPV’s were recorded
89%, and 86%, and 91% respectively based on 13 features. for DT (86.47%) and DTBM (89.7%). For 13 features, the
For the 11 LASSO features, the RF algorithm had a low performance of the classifiers and hybrid model was not so
recall score (just over 85%) while RFBM and GBBM pro- good. The best NPV, for RFBM, was only 90.8%. NPV’s for
vided more satisfactory results over the 11 features. Similar the features selected LASSO algorithm were less than for the
recall scores of approximately 98% were obtained by the features selected by Relief but still quite good compared with
DTBM, RF, and KNNBM classifiers and hybrid models when the 13 features values (93.6% for both RFBM and KNNBM).
using the 10 Relief features. The best recall score, however, Overall, the lowest numClasses = 2 score was obtained by
was obtained with RFBM when applied to the 10 Relief applying DT and KNN on the 13 features. The NPV outcomes
features. are depicted in Fig. 14.

VOLUME 9, 2021 19317


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 13. Obtained F1-scores.


FIGURE 15. Metrics for false positive rates.

FIGURE 14. Outcomes of negative predictive values.

FIGURE 16. Obtained outcomes of false negative rates.


7) COMPARISON BETWEEN DIFFERENT METHODS BASED
ON FALSE POSITIVE RATE techniques Relief and LASSO. With the features selected by
The false-positive rates of the various algorithms are illus- LASSO, GBBM had the lowest FNR (2.1%).
trated before and after feature selection. After apply- RF had the highest FNR (14.7%). For the selective features
ing the Relief feature selection algorithm, the minimum by Relief, the FNR was approximately 0% for RBBM. For
false-positive rates was obtained with the RFBM, 2.05%, GBBM, the FNRs are low for both feature selection tech-
whereas the FPR was seen with DT. The outcomes of FPRs niques. Without feature selection technique the false negative
for RF, KNNBM, GBBM and others were just under 3.5%, rates are higher. KNN had the highest FNR (15.62%). False
a good result without applying the Relief or LASSO feature Negative Rates are depicted in Fig. 16.
selection techniques, false-positive rates for classifiers and
hybrid algorithms are considerably higher. The lowest FPR C. COMPARISON TABLE BETWEEN THE ACCURACY OF
score for all 13 features was obtained by the RFBM, while THE PROPOSED MODELS AND EXISTING TECHNIQUES
the FPR for KNN was very high. FPRs are shown in Fig. 15. A combination of five different datasets has been employed
for this study. Fig. 1 depicts the infrastructure of our proposed
8) COMPARISON BETWEEN DIFFERENT OUTCOMES BASED system and the outcomes based on the all features (13), The
ON FALSE NEGATIVE RATES results of the features selected by LASSO (11) and Relief (10)
The false-negative rates of various algorithms has been pre- were shown in Table 4 As a consequence, separate results
sented before and after applying the two feature selection have been reported based on these features.

19318 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

TABLE 4. A Comparison of accuracy between the proposed system and some existing systems.

After changing the number of selected features by imple- of the datasets [45]. The best result for hybrid models was
menting selection algorithms, significant improvements have only 89% (see Table 4). The highest accuracy achieved with
been noticeable. When an experiment has been gathered previous research was 95.19% [53] and very poor perfor-
from all features, the best accuracy was achieved with mance of hybrid models [39]. Rashmi et al. [40] examined a
the RFBM hybrid model (92.65%) and a low accuracy 303-record dataset that had been extracted from the Cleve-
score was obtained with KNN (83.61%). Application of the land dataset. That analysis showed that the Decision Tree
LASSO selection algorithm leads to some dramatic changes. achieved 75.55% accuracy. Dinesh et al. [41] worked on a
The highest accuracy was obtained with GBBM (97.85%), 920-records datasets, combining the Cleveland, Long Beach
whereas the RF model performed the worst. The best results VA, Switzerland and Hungarian datasets from the UCI repos-
were obtained with the Relief feature selection technique. itory and showed that RF could obtain an accuracy of 80.89%.
This achieves a 99.05% accuracy with RFBM. Our results Other authors in [49] applied the DT and RF to a dataset
have been compared to the existing models and datasets, of 500 which was taken from the Armed Forces Institute of
see Table 4. Each row of the table deals with an algorithm Cardiology (AFIC) and reported that DT achieved the best
that has been used in our studies, as well as two other result (86.6 %). Hybrid classifiers were explored by several
related studies, and the results that have been reported. As researcher [39], [52], obtaining an accuracy of 85.48% using
an auxiliary information, we have also added the dataset the KNNBM approach. The performance of our proposed
that those studies have used. The table draws an overall model is very good compared to previous research works as
picture of the performance of the algorithms in our study can be seen from Table 4.
against other related works. The highest outcomes of pre- FPR is used to show the percentage of wrongly detected
vious results were just over 90.16% [43] and the perfor- heart disease whereas the FNR or miss rate measures the
mance of hybrid models was poor due to the limitations incorrect negative classifications. Fig. 17 shows FPR and

VOLUME 9, 2021 19319


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

FIGURE 17. Comparison between FPR and FNR values.

FIGURE 19. Root Mean Squared Errors of different algorithms on Relief


selected features.

The obtained outcomes were compared to other works to


show the percentage of improvement, while decrease in per-
formance also noted in one occasion (KNN). The highest
increment was noticed for AB approach as opposed to pre-
vious works which was about 46% [48] percentage improve-
ment were calculated for 13 features (93.63%), 11 features
(97.28%), and 10 features (101.85%) respectively. On the
other hand, the lowest increment in percentage was seen for
the ABBM model which was just under 2%, however, for the
selected features of LASSO it was just over 4%. Significant
FIGURE 18. The experimental accuracy of between RFBM and RestAVG. higher values were witnessed in 10 features than 13 features
percentage calculator for RF, RFBM, KNNBM, and GBBM.
Table 5 has been given below.
FNR values. The low FNRs represent a major outcome, based
on the heart disease dataset. After evaluation, RFBM in com- VIII. STATISTICAL ANALYSIS
bination with the Relief feature selection algorithm has been We applied the Root Mean Squared Error and Log Loss to the
demonstrated to have the best performance. output of our algorithms. Results are described below.
The highest accuracy was obtained with the Relief feature
selection algorithm and the Random Forest Bagging Method A. ROOT MEAN SQUARED ERROR
(99.05%). However, the outcomes of RestAVG scores were Following Fig. 19 portraits the Root Mean Squared Error for
not bad for a diagnosis system. From Fig. 18, it can be the 10 models. Here we are analyzing the RMSE of each
observed that the values of the relevant performance indices model for 13 features, 11 features (LASSO) and 10 features
were all about to 94% except precision values which was (Relief). It is clear that RFBM model has the lowest RMSE
higher (96.55%). for 13 features and 10 features - 27.735 and 8.602 respec-
Note that the remaining three features which were not used tively. The GBBM model, produces the minimum RMSE for
with Relief are Thal, oldpeak and slope. LASSO which is 14.732. Thus, we might say that RFBM
model produces the best results for 13 features and 10 features
VII. COMPARATIVE EVALUATION OF OUR PROPOSED (Relief), whereas the GBBM model produces the best result
MODEL for 11 features (LASSO).
In our predicted model, ten features have been evaluated to KNN, RF and DT have the highest values of the RMSE
make this comparison more unique. Our introduced algo- for 13 features (41.345), 11 features (36.313) and 10 features
rithms were conducted based on the all features (13), LASSO (30.38) respectively. Therefore, we might conclude that these
selected features (11), and Relief selected features (10). three models are the most ineffective.

19320 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

TABLE 5. A comparison of accuracy between proposed system and existed outcomes.

value of LL for 13 features, 11 features and 10 features which


are 5.683, 4.854 and 3.532 respectively. Thus, to recapitulate,
all these three models are most inefficient for all three cate-
gories of features (13 features, LASSO and Relief).

IX. ANALYSIS ON RUNTIME AND COMPUTATIONAL


COMPLEXITY
A. COMPUTATIONAL COMPLEXITY
The following Table 6 illustrates the Computational Com-
plexity for seven different models. Two types of complexities
are included in computational complexities: Training com-
plexity and Prediction complexity. Only KNN and Boosting
model have no training complexity associated with them. All
other models have both training and prediction complexity
which are given in the table.
Denoting n as the number of training sample, p as the
number of features, ntrees as the number of trees (for methods
based on various trees), and k as the number of neighbors,
we have the following approximations: Bootstrap Aggre-
FIGURE 20. Log Loss of different algorithms on Relief selected features. gation or bagging is O(n) (for n-sized trees) and random
subspace is O (d0 ) (where d0  d). Complexity for t bagged
B. LOG LOSS
trees of random subspaces is O (t∗d2∗n2∗log (n)) (taking d0
= d for big O notation).
Following Fig. 20 depicts the log loss (LL) for 10 types
of models. Here we are analyzing the changes in LL
value of each model for 13 features, 11 features (LASSO) B. RUNTIME
and 10 features (Relief). If we take a deeper look, we can The Fig. 21 displays the run time (RT) for 10 models. Here we
observe that GBBM model has the lowest LL value for 13 fea- are trying to evaluate the RT value of each model for 13 fea-
tures and 11 features which are 1.997 and 0.721 respectively. tures, 11 features (LASSO) and 10 features (Relief). We can
The RFBM model generates the least LL value for Relief clearly notice that the RFBM model has the lowest RT which
which is 0.127. Therefore, that the GBBM model produces are 0.0126 for 13 features, 0.0012 for LASSO and 0.0011 for
the best result for 13 features and 11 features (LASSO), Relief respectively. On the other hand, the GBBM model has
whereas RFBM model produces the best result for 10 features the longest RT, 1.9973 for 13 features, 1.7021 for 11 features
(Relief). On the contrary, KNN, RF and DT give the highest and 1.5547 for 10 features respectively.

VOLUME 9, 2021 19321


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

TABLE 7. Parameters used.

FIGURE 21. Run time performance of different algorithms on Relief


selected features.

TABLE 6. Algorithmic complexities of the algorithms used.

ever, the default parameter was used with base classifiers for
ensemble technique.
XI. LIMITATIONS OF OUR PROPOSED SYSTEM
The overall discussion has shown that the performance of
different classifiers were good enough in comparison to pre-
vious studies, however, there are indeed few limitations, such
as, the dependency on a specific Feature Selection technique,
for instance more reliance on Relief in this case to produce
highly accurate results. Additionally, high level of missing
values in the dataset can have an adverse effect. We have
demonstrated how to address the issue through the proper
methods, and therefore other dataset when used with this
model, must also take care of this issue if the missing value
is quite significant. Furthermore, though our training dataset
is reasonably extensive, larger dataset would make the model
more precise.
XII. CONCLUSION
Identifying the risk of heart disease with reasonably high
accuracy could potentially have a profound effect on the
X. HYPERPARAMETER TUNING long-term mortality rate of humans, regardless of social and
GridSearchCV, which allocates hyper parameters, is a process cultural background. Early diagnosis is a key step in achiev-
of tuning which can determine the optimal value for a given ing that goal. Several studies have already attempted to pre-
model. In our proposed system, GridSearchCV has been used dict heart disease with the help of machine learning. This
in order to obtain a higher accuracy. The following parameters study takes similar route, but with an improved and novel
were used on the examined algorithms (see Table 7): method and with a larger dataset for training the model. This
sklearn.model_selection.GridSearchCV (estimator, research demonstrates that the Relief feature selection algo-
param_grid, scoring = None, n_jobs = None, iid = ‘dep- rithm can provide a tightly correlated feature set which then
recated’, refit = True, cv = None, verbose = 0, pre_dispatch can be used with several machine learning algorithms. The
= ‘2∗ n_jobs’, error_score = nan, return_train_score = False) study has also identified that RFBM works particularly well
For getting an accurate prediction, tuning is a fundamen- with the high impact features (obtained by feature selection
tal part for all types of classifiers. As a result, we tuned algorithms or medical literature) and produces an accuracy,
our 5 classifiers including DT, RF, KNN, AB, and GB, how- substantially higher than related work. RFBM achieved an

19322 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

accuracy of 99.05% with 10 features. In the future we aim [18] A. K. Paul, P. C. Shill, M. R. I. Rabin, and M. A. H. Akhand, ‘‘Genetic
to generalize the model even further so that it can work algorithm based fuzzy decision support system for the diagnosis of
heart disease,’’ in Proc. 5th Int. Conf. Informat., Electron. Vis. (ICIEV),
with other feature selection algorithms and be robust against May 2016, pp. 145–150.
datasets where the level of missing data is high. The applica- [19] X. Liu, X. Wang, Q. Su, M. Zhang, Y. Zhu, Q. Wang, and Q. Wang,
tion of Deep Learning algorithms is another future approach. ‘‘A hybrid classification system for heart disease diagnosis based on the
RFRS method,’’ Comput. Math. Med., vol. 2017, pp. 1–11, Jan. 2017.
The primary aim of this research was to improve upon the [20] D. Singh and J. S. Samagh, ‘‘A comprehensive review of heart disease
existing work with an innovative and novel way of building prediction using machine learning,’’ J. Crit. Rev., vol. 7, no. 12, p. 2020,
the model, as well as to make the model useful and easily 2020.
[21] M. Shouman, T. Turner, and R. Stocker, ‘‘Integrating clustering with
implementable to practical settings.
different data mining techniques in the diagnosis of heart disease,’’
REFERENCES J. Comput. Sci. Eng., vol. 20, no. 1, pp. 1–10, 2013.
[1] C. Trevisan, G. Sergi, S. J. B. Maggi, and H. Dynamics, ‘‘Gender differ- [22] I. D. Mienye, Y. Sun, and Z. Wang, ‘‘An improved ensemble learn-
ences in brain-heart connection,’’ in Brain and Heart Dynamics. Cham, ing approach for the prediction of heart disease risk,’’ Informat. Med.
Switzerland: Springer, 2020, p. 937. Unlocked, vol. 20, Jan. 2020, Art. no. 100402.
[2] M. S. Oh and M. H. Jeong, ‘‘Sex differences in cardiovascular disease [23] H. Wang, Z. Huang, D. Zhang, J. Arief, T. Lyu, and J. Tian, ‘‘Integrat-
risk factors among Korean adults,’’ Korean J. Med., vol. 95, no. 4, ing co-clustering and interpretable machine learning for the prediction
pp. 266–275, Aug. 2020. of intravenous immunoglobulin resistance in kawasaki disease,’’ IEEE
[3] D. C. Yadav and S. Pal, ‘‘Prediction of heart disease using feature selec- Access, vol. 8, pp. 97064–97071, 2020.
tion and random forest ensemble method,’’ Int. J. Pharmaceutical Res., [24] B. A. Tama, S. Im, and S. Lee, ‘‘Improving an intelligent detection system
vol. 12, no. 4, 2020. for coronary heart disease using a two-tier classifier ensemble,’’ BioMed
[4] World Health Organization and J. Dostupno, ‘‘Cardiovascular diseases: Res. Int., vol. 2020, Apr. 2020, Art. no. 9816142.
Key facts,’’ vol. 13, no. 2016, p. 6, 2016. [Online]. Available: https:// [25] J. Mishra and S. Tarar, Chronic Disease Prediction Using Deep Learning.
www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases- Singapore: Springer, 2020, pp. 201–211.
(cvds) [26] F. Z. Abdeldjouad, M. Brahami, and N. Matta, A Hybrid Approach
[5] K. Uyar and A. Ilhan, ‘‘Diagnosis of heart disease using genetic algorithm for Heart Disease Diagnosis and Prediction Using Machine Learning
based trained recurrent fuzzy neural networks,’’ Procedia Comput. Sci., Techniques. Cham, Switzerland: Springer, 2020, pp. 299–306.
vol. 120, pp. 588–593, Jan. 2017. [27] M. Tarawneh and O. Embarak, ‘‘Hybrid approach for heart disease predic-
[6] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, ‘‘A hybrid tion using data mining techniques,’’ Acta Sci. Nutritional Health, vol. 3,
intelligent system framework for the prediction of heart disease using no. 7, pp. 147–151, Jul. 2019.
machine learning algorithms,’’ Mobile Inf. Syst., vol. 2018, pp. 1–21, [28] C. B. C. Latha and S. C. Jeeva, ‘‘Improving the accuracy of prediction of
Dec. 2018. heart disease risk based on ensemble classification techniques,’’ Informat.
[7] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia, and Med. Unlocked, vol. 16, Jan. 2019, Art. no. 100203.
J. Gutierrez, ‘‘A comprehensive investigation and comparison of machine [29] I. Javid, A. Khalaf, and R. Ghazali, ‘‘Enhanced accuracy of heart disease
learning techniques in the domain of heart disease,’’ in Proc. IEEE Symp. prediction using machine learning and recurrent neural networks ensem-
Comput. Commun. (ISCC), Jul. 2017, pp. 204–207. ble majority voting method,’’ Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 3,
[8] J. Mourao-Miranda, A. L. W. Bokde, C. Born, H. Hampel, and M. Stetter, 2020.
‘‘Classifying brain states and determining the discriminating activation [30] N. Kumar and K. Sikamani, ‘‘Prediction of chronic and infectious dis-
patterns: Support vector machine on functional MRI data,’’ NeuroImage, eases using machine learning classifiers—A systematic approach,’’ Int.
vol. 28, no. 4, pp. 980–995, Dec. 2005. J. Intell. Eng. Syst., vol. 13, no. 4, pp. 11–20, 2020.
[9] S. Ghwanmeh, A. Mohammad, and A. Al-Ibrahim, ‘‘Innovative artificial [31] S. M. Saqlain, M. Sher, F. A. Shah, I. Khan, M. U. Ashraf, M. Awais,
neural networks-based decision support system for heart diseases diagno- and A. Ghani, ‘‘Fisher score and matthews correlation coefficient-based
sis,’’ J. Intell. Learn. Syst. Appl., vol. 5, no. 3, pp. 176–183, 2013. feature subset selection for heart disease diagnosis using support vector
[10] Q. K. Al-Shayea, ‘‘Artificial neural networks in medical diagnosis,’’ Int. machines,’’ Knowl. Inf. Syst., vol. 58, no. 1, pp. 139–167, Jan. 2019.
J. Comput. Sci., vol. 8, no. 2, pp. 150–154, 2011.
[32] S. Mohan, C. Thirumalai, and G. Srivastava, ‘‘Effective heart disease
[11] F. M. J. M. Shamrat, M. A. Raihan, A. K. M. S. Rahman, I. Mahmud, and
prediction using hybrid machine learning techniques,’’ IEEE Access,
R. Akter, ‘‘An analysis on breast disease prediction using machine learn-
vol. 7, pp. 81542–81554, 2019.
ing approaches,’’ Int. J. Sci. Technol. Res., vol. 9, no. 2, pp. 2450–2455,
[33] F. Miao, Y.-P. Cai, Y.-X. Zhang, X.-M. Fan, and Y. Li, ‘‘Predictive
Feb. 2020.
modeling of hospital mortality for patients with heart failure by using an
[12] M. S. Amin, Y. K. Chiam, and K. D. Varathan, ‘‘Identification of signif-
improved random survival forest,’’ IEEE Access, vol. 6, pp. 7244–7253,
icant features and data mining techniques in predicting heart disease,’’
2018.
Telematics Informat., vol. 36, pp. 82–93, Mar. 2019.
[34] C. Raju, E. Philipsy, S. Chacko, L. P. Suresh, and S. D. Rajan, ‘‘A survey
[13] N. Kausar, S. Palaniappan, B. B. Samir, A. Abdullah, and N. Dey, ‘‘Sys-
on predicting heart disease using data mining techniques,’’ in Proc. Conf.
tematic analysis of applied data mining based optimization algorithms
Emerg. Devices Smart Syst. (ICEDSS), 2018, pp. 253-255.
in clinical attribute extraction and classification for diagnosis of cardiac
patients,’’ in Applications of Intelligent Optimization in Biology and [35] D. Chicco and G. Jurman, ‘‘Machine learning can predict survival of
Medicine. Cham, Switzerland: Springer, 2016, pp. 217–231. patients with heart failure from serum creatinine and ejection fraction
[14] J. Mackay and G. A. Mensah, ‘‘The atlas of heart disease and stroke,’’ alone,’’ BMC Med. Informat. Decis. Making, vol. 20, no. 1, p. 16,
World Health Org., Geneva, Switzerland, Tech. Rep., 2004. Dec. 2020.
[15] M. Ashraf, S. M. Ahmad, N. A. Ganai, R. A. Shah, M. Zaman, [36] E. Ahmad, A. Tiwari, and A. Kumar, ‘‘Cardiovascular Diseases (CVDs)
S. A. Khan, and A. A. Shah, Prediction of Cardiovascular Disease Detection using Machine Learning Algorithms,’’
Through Cutting-Edge Deep Learning Technologies: An Empirical Study [37] L. Wang, W. Zhou, Q. Chang, J. Chen, and X. Zhou, ‘‘Deep ensemble
Based on TENSORFLOW, PYTORCH and KERAS. Singapore: Springer, detection of congestive heart failure using short-term RR intervals,’’ IEEE
2021, pp. 239–255. Access, vol. 7, pp. 69559–69574, 2019.
[16] F. Andreotti, F. S. Heldt, B. Abu-Jamous, M. Li, A. Javer, O. Carr, [38] A. Gupta, R. Kumar, H. S. Arora, and B. Raman, ‘‘MIFH: A machine
S. Jovanovic, N. Lipunova, B. Irving, R. T. Khan, R. Dürichen, ‘‘Pre- intelligence framework for heart disease diagnosis,’’ IEEE Access, vol. 8,
diction of the onset of cardiovascular diseases from electronic health pp. 14659–14674, 2020.
records using multi-task gated recurrent units,’’ 2020, arXiv:2007.08491. [39] C. B. C. Latha and S. C. Jeeva, ‘‘Improving the accuracy of prediction of
[Online]. Available: https://arxiv.org/abs/2007.08491 heart disease risk based on ensemble classification techniques,’’ Informat.
[17] W. Wiharto, H. Kusnanto, and H. Herianto, ‘‘Hybrid system of tiered Med. Unlocked, vol. 16, no. 2, 2019, Art. no. 100203.
multivariate analysis and artificial neural network for coronary heart [40] G. O. Rashmi and U. M. A. kumar, ‘‘Machine learning methods for heart
disease diagnosis,’’ Int. J. Electr. Comput. Eng., vol. 7, no. 2, p. 1023, disease prediction,’’ Int. J. Eng. Adv. Technol., vol. 8, no. 5S, pp. 220–223,
Apr. 2017. May 2019.

VOLUME 9, 2021 19323


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

[41] K. G. Dinesh, K. Arumugaraj, K. D. Santhosh, and V. Mareeswari, ‘‘Pre- [64] A. M. D. Silva, Feature Selection, vol. 13. Berlin, Germany: Springer,
diction of cardiovascular disease using machine learning algorithms,’’ in 2015, pp. 1–13.
Proc. Int. Conf. Current Trends Towards Converging Technol. (ICCTCT), [65] S. Chikhi and S. Benhammada, ‘‘ReliefMSS: A variation on a feature
Coimbatore, India, Mar. 2018, pp. 1–7. ranking ReliefF algorithm,’’ Int. J. Bus. Intell. Data Mining, vol. 4,
[42] S. Mohan, C. Thirumalai, and G. Srivastava, ‘‘Effective heart disease pp. 375–390, Jan. 2009.
prediction using hybrid machine learning techniques,’’ IEEE Access, [66] R. Tibshirani, ‘‘Regression shrinkage and selection via the lasso: A retro-
vol. 7, pp. 81542–81554, 2019. spective,’’ J. Roy. Stat. Soc. B, Stat. Methodol., vol. 73, no. 3, pp. 273–282,
[43] S. Sharma and M. Parmar, ‘‘Heart diseases prediction using deep learning Jun. 2011.
neural network model,’’ Int. J. Innov. Technol. Exploring Eng., vol. 9, [67] C. Zhou and A. Wieser, ‘‘Jaccard analysis and LASSO-based feature
no. 3, pp. 1–5, Jan. 2020. selection for location fingerprinting with limited computational complex-
[44] R. Alizadehsani, J. Habibi, Z. A. Sani, H. Mashayekhi, R. Boghrati, ity,’’ in Proc. 14th Int. Conf. Location Based Services (LBS), Dec. 2018,
A. Ghandeharioun, F. Khozeimeh, and F. Alizadeh-Sani, ‘‘Diagnosing pp. 71–87.
coronary artery disease via data mining algorithms by considering labo- [68] Ensemble Techniques of Bagging. Accessed: Jun. 31, 2020. [Online].
ratory and echocardiography features,’’ Res. Cardiovascular Med., vol. 2, Available: https://quantdare.com/what-is-the-difference-between-
no. 3, pp. 133–139, Aug. 2013. Bagging-and-Boosting/
[45] A. A. Shetty and C. Naik, ‘‘Different data mining approaches for predict- [69] An Explanation of Ensemble Bagging Techniques.
ing heart disease,’’ Int. J. Innov. Sci. Eng. Technol., vol. 5, pp. 277–281, Accessed: Jun. 31, 2020. [Online]. Available: https://towardsdatascience.
May 2016. com/ensemble-methods-Bagging-Boosting-and-stacking-c9214a10a205/
[46] C. A. Cheng and H. W. Chiu, ‘‘An artificial neural network model for [70] P. Ghosh, M. Z. Hasan, and M. I. Jabiullah, ‘‘A comparative study
the evaluation of carotid artery stenting prognosis using a national-wide of machine learning approaches on dataset to predicting cancer
database,’’ in Proc. 39th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. outcome,’’ Bangladesh Electron. Soc., vol. 18, nos. 1–3, pp. 1–5,
(EMBC), Jul. 2017, pp. 2566–2569. 2018.
[47] K. C. Tan, E. J. Teoh, Q. Yu, and K. C. Goh, ‘‘A hybrid evolutionary [71] F. M. Javed Mehedi Shamrat, Z. Tasnim, P. Ghosh, A. Majumder, and
algorithm for attribute selection in data mining,’’ Expert Syst. Appl., M. Z. Hasan, ‘‘Personalization of job circular announcement to appli-
vol. 36, no. 4, pp. 8616–8630, May 2009. cants using decision tree classification algorithm,’’ in Proc. IEEE Int.
[48] I. K. A. Enriko, ‘‘Comparative study of heart disease diagnosis using Conf. Innov. Technol. (INOCON), Nov. 2020, pp. 1–5.
top ten data mining classification algorithms,’’ in Proc. 5th Int. Conf. [72] M. M. Alam, S. Saha, P. Saha, F. N. Nur, N. N. Moon, A. Karim, and
Frontiers Educ. Technol., 2019, pp. 159-164. S. Azam, ‘‘D-CARE: A non-invasive glucose measuring technique for
[49] M. Saqlain, W. Hussain, N. A. Saqib, and M. A. Khan, ‘‘Identifica- monitoring diabetes patients,’’ in Proc. Int. Joint Conf. Comput. Intell.
tion of heart failure by using unstructured data of cardiac patients,’’ in Algorithms Intell. Syst., 2019, pp. 443–453.
Proc. 45th Int. Conf. Parallel Process. Workshops (ICPPW), Aug. 2016, [73] W. M. Baihaqi, N. A. Setiawan, and I. Ardiyanto, ‘‘Rule extraction for
pp. 426–431. fuzzy expert system to diagnose coronary artery disease,’’ in Proc. 1st
[50] A. K. Dwivedi, ‘‘Evaluate the performance of different machine learn- Int. Conf. Inf. Technol., Inf. Syst. Electr. Eng. (ICITISEE), Yogyakarta,
ing techniques for prediction of heart disease using ten-fold cross- Indonesia, Aug. 2016, pp. 136–141.
validation,’’ Neural Comput. Appl., vol. 29, pp. 685–693, Sep. 2016. [74] Z. Masetic and A. Subasi, ‘‘Congestive heart failure detection using
[51] A. Kaur, ‘‘A comprehensive approach to predict heart diseases using data random forest classifier,’’ Comput. Methods Programs Biomed., vol. 130,
mining,’’ Int. J. Innov. Eng. Technol., vol. 8, no. 2, pp. 1–5, Apr. 2017. pp. 54–64, Jul. 2016.
[52] V. Chaurasia and S. Pal, ‘‘Data mining approach to detect heart diseases,’’ [75] A. Mert, N. Kılıç, and A. Akan, ‘‘Evaluation of bagging ensem-
Int. J. Adv. Comput. Sci. Inf. Technol., vol. 2, no. 4, pp. 56–66, 2014. ble method with time-domain feature extraction for diagnosing of
[53] R. Bhuvaneeswari, P. Sudhakar, and G. Prabakaran, ‘‘Heart disease pre- arrhythmia beats,’’ Neural Comput. Appl., vol. 24, no. 2, pp. 317–326,
diction model based on gradient boosting tree (GBT) classification algo- Feb. 2014.
rithm,’’ Int. J. Recent Technol. Eng., vol. 8, no. 2, pp. 41–51, Sep. 2019. [76] P. Ghosh, A. Karim, S. T. Atik, S. Afrin, and M. Saifuzzaman, ‘‘Expert
[54] F. M. J. M. Shamrat, P. Ghosh, M. H. Sadek, A. Kazi, and S. Shultana, model of cancer disease using supervised algorithms with a LASSO
‘‘Implementation of machine learning algorithms to detect the progno- feature selection approach,’’ Int. J. Electr. Comput. Eng., vol. 11, no. 3,
sis rate of kidney disease,’’ in Proc. IEEE Int. Conf. Innov. Technol., 2020.
Nov. 2020, pp. 1–7. [77] P. Ghosh, M. Z. Hasan, O. A. Dhore, A. A. Mohammad, and
[55] S. Shultana, M. S. Moharram, and N. Neehal, ‘‘Olympic sports events M. I. Jabiullah, ‘‘On the application of machine learning to predicting
classification using convolutional neural networks,’’ in Proc. Int. Joint cancer outcome,’’ in Proc. Int. Conf. Electron. (ICT). Dhaka, Bangladesh:
Conf. Comput. Intell. (IJCCI), Dhaka, Bangladesh, 2018, pp. 507–518. Bangladesh Electronics Society (BES), Nov. 2018, p. 60.
[56] S. V. J. Jaikrishnan, O. Chantarakasemchit, and P. Meesad, ‘‘A breakup [78] Responsible for Herat Disease Risk Factors. Accessed:Jul. 15, 2020.
machine learning approach for breast cancer prediction,’’ in Proc. [Online]. Available: https://www.texasheart.org/heart-health/heart-
11th Int. Conf. Inf. Technol. Electr. Eng. (ICITEE), Pattaya, Thailand, informationcenter/ topics/heart-disease-risk-factors/
Oct. 2019, pp. 1–6. [79] R. Banerjee, S. Biswas, S. Banerjee, A. D. Choudhury, T. Chattopadhyay,
[57] A. Gavhane, G. Kokkula, I. Pandya, and K. Devadkar, ‘‘Prediction of A. Pal, P. Deshpande, and K. M. Mandana, ‘‘Time-frequency anal-
heart disease using machine learning,’’ in Proc. 2nd Int. Conf. Elec- ysis of phonocardiogram for classifying heart disease,’’ in Proc.
tron., Commun. Aerosp. Technol. (ICECA), Coimbatore, India, Mar. 2018, Comput. Cardiol. Conf. (CinC), Vancouver, BC, Canada, Sep. 2016,
pp. 1275–1278. pp. 573–576.
[58] G. Singh, ‘‘Breast cancer prediction using machine learning,’’ Int. J. Sci. [80] F. M. J. M. Shamrat, P. Ghosh, M. H. Sadek, M. A. Kazi, and S. Shultana,
Res. Comput. Sci., Eng. Inf. Technol., vol. 8, no. 4, pp. 278–284, Jul. 2020. ‘‘Implementation of machine learning algorithms to detect the progno-
[59] Heart Disease Datasets From UCI Machine Learning Repository. sis rate of kidney disease,’’ in Proc. IEEE Int. Conf. Innov. Technol.,
Accessed: May 31, 2020. [Online]. Available: https://archive.ics.uci. Nov. 2020, pp. 1–7.
edu/ml/datasets/Heart+Disease [81] An Overview of K_Nearest Neighbors Algorithm.
[60] Heart Disease Statlog Dataset of UCI Machine Learning Repos- Accessed: Jun. 31, 2020. [Online]. Available: https://www.javatpoint.
itory. Accessed: May 31, 2020. [Online]. Available: http://archive. com/k-nearest-neighbor algorithm- for-machine-learning
ics.uci.edu/ml/datasets/statlog+(heart) [82] D. Giri, U. R. Acharya, R. J. Martis, S. V. Sree, T.-C. Lim, T. Ahamed,
[61] S. Ralston, I. Penman, M. Strachan, and R. Hobson, Davidson’s Prin- and J. S. Suri, ‘‘Automated diagnosis of coronary artery disease affected
ciples and Practice of Medicine, 23rd ed. U.K.: Elsevier, Apr. 2018, patients using LDA, PCA, ICA and discrete wavelet transform,’’ Knowl.-
pp. 219–225. Based Syst., vol. 37, pp. 274–282, Jan. 2013.
[62] A. Rairikar, V. Kulkarni, V. Sabale, H. Kale, and A. Lamgunde, ‘‘Heart [83] A. Rajkumar and G. S. Reena, ‘‘Diagnosis of heart disease using data
disease prediction using data mining techniques,’’ in Proc. Int. Conf. mining algorithm,’’ Global J. Comput. Sci. Technol., vol. 10, pp. 38–43,
Intell. Comput. Control (IC), Jun. 2017, pp. 1–8. Sep. 2010.
[63] A. Acharya, ‘‘Comparative study of machine learning algorithms [84] M. Gilani, J. M. Eklund, and M. Makrehchi, ‘‘Automated detection
for heart disease prediction,’’ M.S. thesis, Helsinki Metropolia Univ. of atrial fibrillation episode using novel heart rate variability
Appl. Sci., Helsinki, Finland, Apr. 2017. [Online]. Available: https:// features,’’ in Proc. 38th Annu. Int. Conf. IEEE Eng. Med.
www.theseus.fi/bitstream/handle/10024/124622/Final%20Thesis.pdf? Biol. Soc. (EMBC), Lake Buena Vista, FL, USA, Aug. 2016,
sequence=1&isAllowed=y pp. 3461–3464.

19324 VOLUME 9, 2021


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

[85] K. Padmavathi and K. S. Ramakrishna, ‘‘Classification of ECG signal SAMI AZAM is currently a leading Researcher
during atrial fibrillation using autoregressive modeling,’’ Procedia Com- and a Senior Lecturer with the College of Engi-
put. Sci., vol. 46, pp. 53–59, Jan. 2015. neering and IT, Charles Darwin University, Casua-
[86] S. H. Ripon, ‘‘Rule induction and prediction of chronic kidney dis- rina, NT, Australia. He is also actively involved
ease using boosting classifiers, Ant-Miner and J48 Decision Tree,’’ in in the research fields relating to Computer Vision,
Proc. Int. Conf. Elect., Comput. Commun. Eng. (ECCE), Cox’s Bazar, Signal Processing, Artificial Intelligence, and
Bangladesh, 2019, pp. 1–6.
[87] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, and M. Alazab, Biomedical Engineering. He has number of publi-
‘‘A comprehensive survey for intelligent spam email detection,’’ IEEE cations in peer-reviewed journals and international
Access, vol. 7, pp. 168261–168295, 2019. conference proceedings.
[88] P. Ghosh, F. M. J. M. Shamrat, S. Shultana, S. Afrin, A. A. Anjum, and
A. A. Khan, ‘‘Optization of prediction method of chronic kidney disease
with machine learning algorithms,’’ in Proc. 15th Int. Symp. Artif. Intell.
Natural Lang. Process. (iSAI-NLP), Int. Conf. Artif. Intell. Internet Things
(AIoT), 2020.
[89] An Overview of Gradient Boosting Algorithm. Accessed: Jun. 31, 2020.
[Online]. Available: https://machinelearningmastery.com/gentle- MIRJAM JONKMAN (Member, IEEE) is cur-
introduction-gradient-Boosting-algorithm-machine-learning/
rently a Lecturer and a Researcher with the Col-
[90] M. Almasoud and T. E. Ward, ‘‘Detection of chronic kidney disease using
machine learning algorithms with least number of predictors,’’ Int. J. Adv. lege of Engineering, IT, and Environment. Her
Comput. Sci. Appl., vol. 10, no. 8, pp. 89–96, 2019. research interests include biomedical engineering,
[91] Gradient Boosting Algorithm. Accessed: Jun. 31, 2020. [Online]. Avail- signal processing, and the application of computer
able: https://data-flair.training/blogs/gradient-Boosting-algorithm/ science to real life problems.
[92] T. Chen and C. Guestrin, ‘‘XGBOOST: A scalable tree boosting system,’’
in Proc. 22nd ACMSIGKDD Int. Conf. Knowl. Discovery Data Mining,
2016, pp. 785–794.
[93] J. Cheng, G. Li, and X. Chen, ‘‘Research on travel time prediction model
of freeway based on gradient boosting decision tree,’’ IEEE Access, vol. 7,
pp. 7466–7480, 2019, doi: 10.1109/ACCESS.2018.2886549.
[94] A. Natekin and A. Knoll, ‘‘Gradient boosting machines, a tutorial,’’ ASIF KARIM is currently a Ph.D. Researcher
Frontiers Neurorobotics, vol. 7, no. 7, pp. 1–21, 2013. with Charles Darwin University, Casuarina, NT,
[95] A. M. De Silva and P. H. W. Leong, Grammar-Based Feature Generation
Australia, and lives in the port city of Darwin. His
for Time-Series Prediction. Berlin, Germany: Springer, 2015.
[96] F. M. J. M. Shamrat, M. Asaduzzaman, P. Ghosh, M. D. Sultan, and research interest includes machine intelligence and
Z. Tasnim, ‘‘A Web based application for agriculture: ‘Smart farming cryptographic communication. He is also working
system,’’’ Int. J. Emerg. Trends Eng. Res., vol. 8, no. 6, pp. 2309–2320, towards the development of a robust and advanced
Jun. 2020. email filtering system primarily using Machine
[97] Responsible for Herat Disease Risk Factors. Accessed: Jul. 15, 2020. Learning algorithms. He has considerable industry
[Online]. Available: https://www.texasheart.org/heart-health/heart- experience in IT, primarily in the field of Software
informationcenter/topics/heart-disease-risk-factors/ Engineering.
[98] F. M. J. M. Shamrat, P. Ghosh, I. Mahmud, N. I. Nobel, and M. D. Sultan,
‘‘An intelligent embedded AC automation model with temperature predic-
tion and human detection,’’ in Proc. 2nd Int. Conf. Emerg. Technol. Data
Mining Inf. Secur. (IEMIS), 2020. F. M. JAVED MEHEDI SHAMRAT received the
[99] Sex, Age, Cardiovascular Risk Factors, and Coronary Heart B.Sc. degree in software engineering from Daf-
Disease. Accessed: Dec. 29, 2020. [Online]. Available: https:// fodil International University, in 2018. He used
www.ahajournals.org/doi/full/10.1161/01.cir.99.9.1165 to work at Daffodil International University as a
[100] S. Hegelich, ‘‘Decision trees and random forests: Machine learning tech-
Research Associate. He is currently working in a
niques to classify rare events,’’ Eur. Policy Anal., vol. 2, no. 1, pp. 98–120,
Government Project under the ICT Division as a
2016.
[101] K. Fawagreh, M. M. Gaber, and E. Elyan, ‘‘Random forests: From early Researcher and Developer. He has published sev-
developments to recent advancements,’’ Syst. Sci. Control Eng., vol. 2, eral research papers and articles in journals (Sco-
no. 1, pp. 602–609, Dec. 2014. pus) and international conferences. His research
[102] A. Sharma and A. Suryawanshi, ‘‘A novel method for detecting spam interests include the IoT, machine learning, data
email using KNN classification with spearman correlation as distance science, information security, android applications, image processing, neural
measure,’’ Int. J. Comput. Appl., vol. 136, no. 6, pp. 28–35, Feb. 2016. network, cyber security, Artificial Intelligence, robotics, and deep learning.
[103] Spearman’s Rank-Order Correlation. Accessed: Jul. 15, 2019. [Online].
Available: https://statistics.laerd.com/statistical-guides/spearmans-rank-
order-correlation-statistical-guide.php
EVA IGNATIOUS is currently a Ph.D. Researcher
with Charles Darwin University, Casuarina, NT,
Australia. Her research interests include biomed-
ical signal processing (interesting features and
abnormalities found in bio-signals), theoretical
modelling and simulation (breast cancer tissues),
PRONAB GHOSH received the B.Sc. degree from applied electronics (thermistors), process control
the Computer Science and Engineering Depart- and instrumentation, and embedded/VLSI sys-
ment, Daffodil International University, in 2019. tems. She has considerable research experience
He has been heavily involved in collaborative with one U.S. patent and two Indian patents for
research activities with researchers in Bangladesh the development of thermal sensor-based breast cancer detection at its early
and researchers from Australia, especially in the stages together with Centre for Materials for Electronics Technology (C-
fields of machine learning, deep learning, cloud MET), an autonomous scientific society under Ministry of Electronics and
computing, and the IoT. Information Technology (MeitY), Government of India. She also has indus-
trial experience as a Production Engineer and a Quality Controller, primarily
in the Electronics and Instrumentation Engineering.

VOLUME 9, 2021 19325


P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms

SHAHANA SHULTANA received the B.Sc. FRISO DE BOER is currently a Professor


degree in computer science and engineering from with the College of Engineering, IT, and Envi-
Daffodil International University, where she is ronment, Charles Darwin University, Casuarina,
currently pursuing the M.Sc. degree in computer NT, Australia. His research interests include
science and engineering. She is also working as signal processing, biomedical engineering, and
a Lecturer with the Department of Computer Sci- mechatronics.
ence and Engineering, Daffodil International Uni-
versity. Her research interests include computer
vision, data mining, neural network, and Artificial
Intelligence.

ABHIJITH REDDY BEERAVOLU received the


M.S. degree in information systems and data sci-
ence form Charles Darwin University. His goal is
to live free and come up with ideas that can help
the people and the societies near me and using
those ideas to ship them into the world. He is a
Computer Science Enthusiast who is interested in
anything that is related to computers. Also, he is
interested in reading books on history and making
comparisons with the current world, to make sense
of the reality and its progression. Mostly, he is interested in reading and
analyzing information related to cognitive and behavioral psychology and
trying to implement/integrate them into various technological ideas.

19326 VOLUME 9, 2021

You might also like