Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques
Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques
ABSTRACT Cardiovascular diseases (CVD) are among the most common serious illnesses affecting
human health. CVDs may be prevented or mitigated by early diagnosis, and this may reduce mortality
rates. Identifying risk factors using machine learning models is a promising approach. We would like to
propose a model that incorporates different methods to achieve effective prediction of heart disease. For
our proposed model to be successful, we have used efficient Data Collection, Data Pre-processing and Data
Transformation methods to create accurate information for the training model. We have used a combined
dataset (Cleveland, Long Beach VA, Switzerland, Hungarian and Stat log). Suitable features are selected
by using the Relief, and Least Absolute Shrinkage and Selection Operator (LASSO) techniques. New
hybrid classifiers like Decision Tree Bagging Method (DTBM), Random Forest Bagging Method (RFBM),
K-Nearest Neighbors Bagging Method (KNNBM), AdaBoost Boosting Method (ABBM), and Gradient
Boosting Boosting Method (GBBM) are developed by integrating the traditional classifiers with bagging
and boosting methods, which are used in the training process. We have also instrumented some machine
learning algorithms to calculate the Accuracy (ACC), Sensitivity (SEN), Error Rate, Precision (PRE) and F1
Score (F1) of our model, along with the Negative Predictive Value (NPR), False Positive Rate (FPR), and
False Negative Rate (FNR). The results are shown separately to provide comparisons. Based on the result
analysis, we can conclude that our proposed model produced the highest accuracy while using RFBM and
Relief feature selection methods (99.05%).
INDEX TERMS Heart disease, machine learning, CVD, relief feature selection, LASSO feature selection,
decision tree, random forest, K-nearest neighbors, AdaBoost, and gradient boosting.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
19304 VOLUME 9, 2021
P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms
staff may results in false predictions [7]. Early diagnosis to address some of these research gaps to develop a better
can be difficult [8]. Surgical treatment of heart disease is model for CVD prediction. In this research, five datasets
challenging, particularly in developing countries which lack are combined, increasing the total size of the dataset. Two
trained medical staff as well as testing equipment and other selection techniques, Relief and LASSO are utilized to extract
resources required for proper diagnosis and care of patients the most relevant features based on the rank values in med-
with heart problems [9]. An accurate evaluation of the risk ical references. This also helps to deal with overfitting and
of cardiac failure would help to prevent severe heart attacks underfitting problems of machine learning.
and improve the safety of patients [10]. Machine learning In this study, various supervised models such as AdaBoost
algorithms can be effective in identifying the diseases, when (AB), Decision Tree (DT), Gradient Boosting (GB),
trained on proper data [11]. Heart disease datasets are pub- K-Nearest Neighbors (KNN), and Random Forest (RF)
licly available for the comparison of prediction models. The together with hybrid classifiers are applied. Results are com-
introduction of machine learning and artificial intelligence pared with existing studies.
helps the researchers to design the best prediction model The flow of the paper is as follows: Section II describes
using the large databases which are available. Recent studies the aim and scope of this research. Section III provides
which focus on the heart-related issues in adults and chil- an overview of related literature on the prediction of heart
dren emphasized the need of reducing mortality related to disease with various classifiers and hybrid approaches.
CVDs. Since the available clinical datasets are inconsistent Subsequently, section IV details out the proposed system
and redundant, proper preprocessing is a crucial step [12]. and various performance metrics. The process of the data
Selecting the significant features that can be used as the preparation, preprocessing and hybrid algorithms, Bagging
risk factors in prediction models is essential. Care should be and Boosting methods, are explained in section V. Section VI
taken to select the right combination of the features and the describes the implementation of the system and the results.
appropriate machine learning algorithms to develop accurate Discussion on the statistical significance of the results,
prediction models [13]. It is important to evaluate the effect runtime and computational complexity and hyper-parameter
of risk factors which meet the three criteria like the high tuning have been covered between section VIII and X respec-
prevalence in most populations; a significant impact on heart tively. Some recommendations for future works and con-
diseases independently; and they can be controlled or treated clusion are in section XII with a brief discussion on the
to reduce the risks [14]. Different researchers have included limitations of the proposition in section XI.
different risk factors or features while modelling the predic-
tors for CVD. Features used in the development of CVD II. RESEARCH AIM AND SCOPE OF THE PAPER
prediction models in different research works include age, The aim of this research is to develop an effective method to
sex, chest pain (cp), fasting blood sugar (FBS) – elevated FBS predict heart disease, in particular Coronary Artery Disease or
is linked to Diabetes [72], resting electrocardiographic results Coronary Heart Disease, as accurately as possible. Required
(Restecg), exercise-induced angina (exang), ST depression steps can be summarized as follows:
induced by exercise relative to rest (oldpeak), slope, number 1) Five datasets are combined to develop a larger and more
of major vessels coloured by fluoroscopy (ca), heart status reliable dataset.
(thal), maximum heart rate achieved (thalach), poor diet, 2) Two selection techniques, Relief and LASSO, are uti-
family history, cholesterol (chol), high blood pressure, obe- lized to extract the most relevant features based on rank
sity, physical inactivity and alcohol intake [12], [15]–[19]. values in medical references. This also helps to deal
Recent studies reveal a need for a minimum of 14 attributes with overfitting and underfitting problems of machine
for making the prediction accurate and reliable [20]. Current learning.
researchers are finding it difficult to combine these features 3) Additionally, various hybrid approaches, including Bag-
with the appropriate machine learning techniques to make ging and Boosting, are implemented to improve the
an accurate prediction of heart disease [21]. Machine learn- testing rate and reduce the execution time.
ing algorithms are most effective when they are trained on 4) The performance of the different models is evaluated
suitable datasets [22]–[25]. Since the algorithms rely on the based on the overall results with All, Relief, and LASSO
consistency of the training and test data, the use of feature selected features.
selection techniques such as data mining, Relief selection,
and LASSO can help to prepare the data in order to provide III. LITERATURE REVIEW
a more accurate prediction. Once the relevant features are The application of artificial intelligence and machine learning
selected, classifiers and hybrid models can be applied to algorithms has gained much popularity in recent years due
predict the chances of disease occurrence. Researcher have to the improved accuracy and efficiency of making predic-
applied different techniques to develop classifiers and hybrid tions [25]. The importance of research in this area lies in
models [12], [20]. There are still a number of issues which the possibility to develop and select models with the highest
may prevent accurate prediction of heart disease, like limited accuracy and efficiency [26]. Hybrid models which inte-
medical datasets, feature selection, ML algorithm applica- grate different machine learning models with information
tions, and a lack of in depth analysis. Our research aims systems (major factors) are a promising approach for disease
prediction [27]. Various available public data sets are applied. boosting machine were used. The proposed model provides
In the study of Latha and Jeeva [28] ensemble technique accuracy, F1, and AUC values of 98.13%, 96.6%, and 98.7%,
was applied for improved prediction accuracy. Using bagging respectively which exceeded other existing CHD detection
and boosting techniques, the accuracy of weak classifiers methods.
was increased, and the performance for risk identification A novel prediction model was introduced in the paper of
of heart disease was considered satisfactory. They used the Mohan et al. [32] with different combinations of features
majority voting of Naïve Bayes, Bayes Net, C 4.5, Multilayer and several known classification techniques. An ANN with
Perceptron, PART and Random Forest (RF) classifiers in backpropagation and 13 clinical features as the input was used
their study for the hybrid model development. An accuracy in the proposed HRFLM. DT, NN, SVM, and KNN were con-
of 85.48% was achieved with the designed model. More sidered while making use of the data mining methods. SVM
recently [29] machine learning and conventional techniques was useful for enhanced accuracy in disease prediction. The
like RF, Support Vector Machine (SVM), and learning models novel method Vote, in conjunction with a hybrid approach
were tested on the UCI Heart Disease dataset. The accuracy using LR and NB was proposed. An accuracy of 88.7% was
was improved by the voting-based model, together with mul- obtained with the HRFLM method.
tiple classifiers. The study showed that for the anemic clas- An improved random survival forest (iRSF) with high
sifiers, an improvement of 2.1% was achieved. In the study accuracy was used for the development of a comprehen-
of NK. Kumar and Sikamani [30], different machine learning sive risk model in predicting heart failure mortality [33].
classification techniques were used to predict chronic disease. iRSF could discriminate between survivors and non-survivors
In their study, the Hoeffding classifier achieved an accuracy using the novel split rule and the stop criteria. Patient demo-
of 88.56% of in CVD prediction. graphics, clinical, laboratory information and medications
Ashraf et al. [15] used both the individual learning algo- were included in the 32 risk factors for the development of
rithms and ensemble approaches like Bayes Net, J48, KNN, predictors. A data mining approach to detect cardiovascular
multilayer perceptron, Naïve Bayes, random tree, and random has also been applied [34]. The Decision Tree, Bayesian
forest for prediction purposes. Of these, J48 had an accuracy classifiers, neural networks, Association law, SVM, and KNN
of 70.77%. They subsequently employed new-fangled tech- data mining algorithms were used to detect the heart diseases.
niques of which KERAS obtained an 80% accuracy. A multi- SVM resulted in an accuracy of 99.3%.
task (MT) recurrent neural network was proposed to predict In works related to the prediction of patient survival [35],
the onset of Cardiovascular disease with the attention mech- several machine learning classifiers were utilized. Feature
anism at work [16]. The proposed model benefits by an Area relating to the significant risk factors were ranked and a
under Curve (AUC) increase between 2 and 6%. comparison was performed between the traditional biostatis-
In the study of Amin et al. [12] the critical risk factors tics tests and the provided machine learning algorithms. The
identified, machine learning models were applied (k-NN, result was that serum creatinine and ejection fraction were
DT, NB, LR, SVM, Neural Network, and a hybrid of voting demonstrated to be the two most relevant features for accurate
with NB and LR) and a comparative analysis was performed. predictions. A model for CVD detection was developed with
The outcome of their study indicates that the hybrid model, the AL Algorithm [36]. The dataset preparation and inves-
together with the selected attributes achieved an accuracy tigation was done with four algorithms. The precision was
of 87.41%. The mean Fisher score feature selection algo- 99.83% for Decision Tree, and Random Forest methods and
rithm (MFSFSA) together with the SVM classification model 85.32% and 84.49% respectively for SVM and KNN. Con-
was used in the technique proposed by Saqlain et al. [31]. gestive heart failure (CHF) was effectively predicted using
By using a SVM they obtained the selected feature subset the ensemble method in another study [37] by analyzing the
and they used a validation process for MCC calculation. Heart rate variability (HRV) and using deep neural networks
The features were selected based on a higher than average to solve the gap in related fields. The accuracy of the proposed
Fisher score. The combination of MFSFSA and SVM resulted system was 99.85%.
in 81.19% accuracy, a 72.92% sensitivity, and an 88.68% Yadav and Pal [3] used the UCI repository for their study.
specificity. This dataset contains 14 attributes. The classification was
In the research work of Mienye et al. [22] prediction model carried out by four tree-based classification algorithms: M5P,
for heart disease was proposed which involves the mean based random Tree, and Reduced Error Pruning and the Random
splitting method, classification, and regression tree were used forest ensemble method. The Pearson Correlation, Recur-
for randomly partitioning the dataset into smaller subsets. sive Features Elimination, and Lasso Regularization were the
Afterwards, using an accuracy based weighted classifier three feature-based algorithms used in this work. The meth-
ensemble, a homogenous ensemble was generated with the ods were then compared for accuracy and precision. The last
classification accuracies of 93% and 91% on the Cleveland method achieved the best performance. In recent work [38],
and Framingham test sets. Two-tier ensemble-based coronary Gupta et al. utilized the factor analysis of mixed data (FAMD)
disease (CHD) detection model [24] was proposed in the and RF-based MLA for developing a machine intelligence
study of Tama et al. Three different ensemble learners: ran- framework. RF was used for the prediction of disease by
dom forest, gradient boosting machine, and extreme gradient finding the relevant features using the FAMD. The proposed
method achieved a 93.44% accuracy, an 89.28% sensitivity Statlog). This is included in the framework. Fig. 1 illus-
and a 96.96% specificity. trates the workflow of recommended models. During data
Rashmi et al. [40] experimented on 303, a dataset preprocessing, the combined dataset is analyzed to check for
that was extracted from the Cleveland dataset. The pro- missing values which are then dealt with by the K-Nearest
posed algorithm, Decision Tree obtained 75.55% accuracy. Neighbors imputation technique. To overcome overfitting
Dinesh et al. [41] examined 920 datasets (Cleveland, Long issues and avoid long execution times, two different feature
Beach VA, Switzerland, and Hungarian) which from the UCI selection techniques are utilized: Relief and LASSO. This
machine learning repository. Random forest achieved 80.89% assists in extracting the best features. Performance of clas-
accuracy; on the other hand, Saqlain has received 68.6% sifiers with the features selected by these techniques as well
accuracy over the AFIC dataset [49]. Sharma et al. [43] and as with the original features is analyzed. After feature selec-
Dwivedi et al. [50] have applied the K-Nearest Neigh- tion, the dataset is split into two parts: training and testing.
bors algorithm to the same dataset. The results were Based on model learning rates, 80% of data is assigned for
90.16% and 80% respectively. An accuracy of 46% was the training phase, and the remaining 20% d for the testing
recorded by Enriko [48] when using the Kita Hospital phase. All ensemble models with classifiers are implemented
Jakarta (450) dataset. An improved result was obtained, for to make a comparison over the combined dataset; however,
instance 56.13%, using AdaBoost on the Cleveland dataset the generated outcome of our model is gained within a short
by Kaur et al. [51]. Shetty et al. [45] achieve 89% accu- period. Different training model has been given for testing
racy using the 270 datasets from the Statlog dataset, and the dataset so that we can pick the best model for our reliable
Chaurasia et al. [39] have been used the same with a Boosting dataset. The process resulted in RFBM being the most useful
hybrid approach resulting in an accuracy of 75.9%. The UCI with 99.05% of accuracy. Furthermore, the most suitable
laboratory dataset was also used to evaluate the performance features of a patient having affected by heart disease have
of the Boosting ensemble technique. Cheng et al. and Chaura- been suggested in this diagnosis system.
sia et al. obtained accuracy of 82.5% by ANN model [46]
and 78.88% [39] accuracy using a hybrid model. Using the B. PERFORMANCE MEASURE INDICES
Gradient Boosting technique, Dinesh et al. [41] obtained The effectiveness and accuracy of the machine learning
84.27% accuracy using a combination of 4 different datasets method can be evaluated using performance indicators. Posi-
where Bhuvaneeswari et al. [53] achieved 95.19% accuracy tive classification occurs when a person is classified as having
using 583 records from the Cleveland and Statlog dataset. HD. When a person is not classified as having HD, he has a
A survey result has been generated on Rajaie cardio vascu- negative classification. The following formula from (1) to (7)
lar medical dataset [44] using the hybrid approach, result- has been applied to get all of this [54], [55].
ing in a 79.54% accuracy. On the other hand, the Bagging TP = True Positive (when the model correctly Identified
approach of Decision Tree [52] achieved more than 85.03% as having HD).
accuracy. Three different datasets were converted into one to TN = True Negative (when the model correctly identified
obtain a more accurate result. A hybrid approach, achieved the opposite class, such as patients truly having no heart
an accuracy of 88.4% by Mohan et al. [42]. Latha et al. [39] issues).
used 303 datasets of Cleveland heart disease by Bagging FP = False Positive (when the model incorrectly identified
approach and gained 80.53% accuracy. Tan et al. [47] exper- HD patients i.e., identifying non-HD patients as HD patients)
imented on 303 datasets which were collected from Cleve- FN = False Negative (when the model incorrectly iden-
land Heart disease dataset by hybrid approach and obtained tified the opposite class, such as HD patients as normal
84.07% accuracy, while Latha et al. [39] achieved 85.48%. patients).
Various techniques have been implemented on data of
cardiovascular disease patients. Data are processed such that (TP + TN)
the K-Nearest Neighbors algorithm handles the missing data. Accuracy (Acc) = (1)
(TP + TN + FP + FN)
The feature selection process is done following the Relief and
(TP)
LASSO. Various machine learning algorithms are implanted Precision = (2)
using the Bagging and Boosting approaches. One of the goals (TP + FP)
of the proposed approach is to analyze the accuracy and error (TP)
Recall or Sensitivity (Sen) = (3)
rates of the algorithms in order to determine the best features. (TP + FN)
2(Precision X Recall)
IV. RESEARCH METHODOLOGY F1-score = (4)
(Precision + Recall)
An overall explanation is explained to build an intelligent FP
machine learning system over the dataset of chronic heart False Positive Rate = (5)
FP + TN
disease. FN
False Negative Rate = (6)
A. OVERVIEW OF THE PROPOSED MODEL (TP + FN)
Dataset is constructed by combining five different datasets TN
Negative predictive value = (7)
(Cleveland, Hungary, Switzerland, and VA Long Beach and (TN + FN)
VOLUME 9, 2021 19307
P. Ghosh et al.: Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms
C. APPLICATION OF THE PROPOSED MODEL Fig. 2 picturises how a community health center can put the
Having a suitable application of the proposed model is key system to use, the following steps describes the procedures.
to the development of this unique system and will also help • Step 1: Reports are uploaded into the database.
to deal with the real world challenges. The process has been • Step 2: Attributes are selected from the uploaded data to
illustrated in this section. create input for the trained RFBM model.
• Step 3: Selected attributes are processed in the trained of the previous studies have actually shown that the pre-
model. dicted accuracy of DT [1], RF [1], [2] and KNN [3] algo-
• Step 4: Output is generated in terms of 0 and 1. rithms were quite high compared to other existing techniques.
◦ 0 = A person is less prone to CVDs. Additionally, a limited number of studies also demonstrated
◦ 1 = A person is prone to CVDs. AB [5], [6] as well as GB [53] can perform rather well with
• Step 5: If ‘1’, notify or request the person to consult a considerably high Accuracy. Our paper highlights some of the
doctor or take additional tests. notable research attempts that deployed Bagging and Boost-
• Step 6: Data uploaded to database is used to create ing ensemble techniques as well as proposed some hybrid
trained model, to further improve the accuracy of hybrid frameworks, however, none of those research attempts closely
classifiers and trained model. resembled our introduced approaches as a base classifier
except DT [8] and kNN [7]. As a consequence, in this work,
D. JUSTIFICATION OF THE PROPOSED TECHNIQUE all of those previous approaches have been further explored
This intelligent system has been developed based on the five with the help of ensemble techniques to make the proposed
classifiers. Subsequently, we used ensemble technique such model more efficient. Although from Literature Review it
as bagging and boosting to retain those algorithms as a base can be seen that propositions put forward in [1], [5], [24]
classifier. Numerous studies have already been conducted on and [27] yielded promising predictive accuracy, but was not
different types of machine learning algorithms. Among them, high enough in comparison to our work.
we picked three most common techniques (DT, RF AND Basically, we felt the need to improve the current studies
KNN) and two less common techniques (AB and GB). Some in this field and analyzed previous models to determine what
might be lacking, after which we took the initiative to devise side, the attribute values are shown (from 0.3 to −0.4). From
a solution that might reshape the current ideas and provide an Fig. 3, it is clearly seen that ca, chol and trestbps features have
acceptable level of results that makes the system suitable for strong relationship with age where the value was approx-
practical implementation. imately 0.3, on the other hand, the lowest correlation was
As has been discussed before, previous works that are observed for thalach that was about −0.4. Similarly, cp shows
somewhat related to this study and deal with the datasets used a significant correlation with exang. However, the correlated
here are available, however, the performance of those systems values among other features were not so high and fluctuated
were not as expected in most cases. between 0.15 and −0.3.
We believe one reason for the lack of performance of some
systems is the inability of those systems to identify the most V. IMPLEMENTATION
important and highly correlated features. We want to develop A. DIFFERENT MACHINE LEARNING LIBRARIES
a method that will first identify the optimal group of features
The implemented model is written in Jupiter notebook’s
and then identify the algorithms that works best with those
Python programming language using simple libraries like
features.
Panda [56], Pyplot [57] and Scikit-learn [58].
In our understanding, algorithms that performed well ben-
efitted from the tightly correlated feature-set, mainly derived
from the use of Relief, whereas the algorithms that did not B. DATASET
show strong performance, could not properly evaluate the Data is considered the first and most basic aspects of using
correlative structure among the features used. machine learning techniques to get accurate results. The
The following figure has been depicted based on the highly applied dataset is gathered from a well-known data repository,
correlated 10 features with predicted attribute (num) which the ‘UCI machine learning repository’. There are five differ-
are selected by Relief feature selection technique. On the right ent datasets: the Cleveland, Hungary, Switzerland, VA Long
Pseudocode 1 Pseudocode for Bagging Method ‘Learning’ based on Decision Tree (DT) often applies an
BEGIN upside-down tree based progression technique. The algorithm
1. Let D = {d1 , d2 , d3 , . . . dn } be the given dataset is capable of resolving both classification and regression
2. E = {}, the set of ensemble classifiers problems. The tree grows from the root node by determine
3. C = {c1 , c2 , c3 , . . . cn }, the set of classifiers a ‘Best Feature’ or ‘Best Attribute’ from the set of attributes
4. X = the training set, X D available at hand, ‘splitting’ is then applied. Selection of the
5. Y = the test set, Y D ‘Best Attribute’ is often carried out through the calculation of
6. L = n(D) two other metric, ‘Entropy’ as shown in (9), and Information
7. for i = 1 to L do Gain, shown in (10). The ‘best attribute’ is the one that
8. S(i) = {Bootstrap sample I with replacement} I provides the most useful information. Entropy indicates how
X homogeneous the dataset is and Information Gain is the rate
9. M(i) = Model trained using C(i) on S(i) of increase or decrease in Entropy of attributes [100].
10. E = E C(i)
E (D) = −P (positive) log2 P (positive)
11. next I
12. for i = 1 to L − P (negative) log2 P (negative) (9)
13. R(i) = Y classified by E(i) Equation (9) calculates the Entropy E, of a dataset D, which
14. next i holds the positive and negative ‘Decision Attributes’.
15. Result = max(R (i) : i = 1, 2, . . . . . . , n)
END Gain (Attribute X ) = Entropy (Decision Attribute Y )
− Entropy(X , Y ) (10)
Non-parametrically supervised learning methods, such as
C4.5 are used for classification and regression. This aim of
the method is to develop a model that predicts the value of
the dependent variable by studying basic rules for decision
making.
Baihaqi et al. [73] applied the C4.5 classifier to diagnose
CAD using and obtained 78.95% accuracy. However, the
classifier C4.5 usually does not allow small datasets. The
RF classifier (describer below) may perform better [74], for
heart disease detection or alternatively the combining strategy
using bagged decision trees [75].
2) RANDOM FOREST
The Random Forest (RF) classifier is an ensemble algo-
rithm [76]. This implies that it consists of more than one
algorithm. Usually In this case, it consists of several DT
algorithms [77]. RF build up an entire forest from several
uncorrelated and random Decision Trees during training seg-
ment [101]. Ensemble learning methods employ multiple
learning algorithms to generate an optimal predictive model,
which can provide better results than any of the individual
FIGURE 7. Boosting method.
model’s prediction [101]. Computational complexity may
increase as RF uses more features than a standalone DT, but
F. PROPOSED APPROACH FOR THE CLASSIFICATION it generally has a higher accuracy when dealing with unseen
MODEL datasets. The result of the Random Forest algorithm is the
This section discusses the machine learning approaches that mean result of the total number of Decision Tree algorithms.
are used in this research to generate an intelligent prediction Illustration. Fig. 8 gives and graphical description of Random
system for heart disease. Forest [87].
The Random Forest ensemble classifier builds and inte-
1) DECISION TREE grates multiple decision trees to get the best result. It pri-
The Decision Tree algorithm, which has only 2 numClasses, marily refers to tree learning through aggregating bootstraps.
is one of the most powerful and well-known predictive instru- Let the provided data be X = {x1 , x2 , x3 , . . . . . . , xn ) with
ments [70]. Every interior node in the structure of a Decision responses Y = { x1 , x2 , x3 , . . . . . . , xn } with a lower limit of
Tree refers to testing a property, every branch corresponds to a b = 1 and an upper limit of B: The prediction PB for sample x0
test outcome, and each leaf node is a separate class [71], [87]. is made by averaging the predictions b=1 f b (x0 ) from every
two-dimensional space. KNN puts the new data into the class
which has the least Euclidean distance to the new data.
Previous research [82] has used KNN as an automated
classification technique for coronary artery disease. When
conducting linear discriminant analysis KNN had a better
accuracy than SVM and NN [85]. Rajkumar and Reena
obtained an accuracy of just 45.67% [83] using KNN to
diagnose CAD. However, Gilani et al. [84] subsequently
compared the F1 score with many classification models and
found that the KNN classifier performed best among the
seven classifiers. A limitation of the method is that due to the
high computational complexity, KNN is not appropriate for
implementation in a low power or a real-time environment.
On a different note, in place of using Euclidean Distance,
Suryawanshi and Sharma [102] have shown ‘Spearman Cor-
relation’ [103] can also be employed as the distance mea-
sure for KNN based classification as shown in (13). P and Q
are training and testing tuple respectively while n is the
number of total observations. The values of fij usually lies
between 1 and −1.
2
6 ni=1 rank (Pi ) − rank Qj
P
FIGURE 8. Random Forest algorithm. fij = 1 − (13)
n n2 − 1
The changes have demonstrated some enhancements over
individual trees for x0 that is shown using (11). regular KNN model with nearly 50% improvement in accu-
racy (97.44% in 80%-20% Train and Test ratio).
B
1X
j= f b (x 0 ) (11) 4) ADABOOST
B
b=1 AdaBoost or Adaptive Boosting is a Boosting algorithm
The Random forest (RF) classifier, a combination of many that is used for binary classification and combines a num-
different tree predictors, is often used for the analysis of big ber of weak classifiers to make a more robust classi-
data. It is a learning method for grouping, regression, and fier [86]. This algorithm produces the predicted accuracy
other functions in an ensemble. based on 1000 samples. The training dataset instances are
Banerjee et al. [79] used successfully applied the RF clas- weighted with a starting weight [87] as shown in (14).
sifier using time-frequency characteristics from PCG signals Weight (xi) = 1/N (14)
to identify heart disease.
where N is the frequency of training instances, and xi is ith
3) K-NEAREST NEIGHBORS training instance. The decision stump gives an output for each
input variable. The misclassification rate is then calculated
K-Nearest Neighbors (n_neighbors = 5) is amongst the most
using equation (15).
common classification technique in the field of machine
learning. It has previously been used for coronary artery Error = (correct−N)/N (15)
disease. KNN is considered nonparametric since the method
where N is the frequency of training instances. Boosting
does not use data distribution assumptions. KNN considers
simply means combining several simple trainers to achieve
the equivalence of the new data and the existing data and
a more accurate prediction. AdaBoost (Adaptive Boosting)
places the new data in the class, which is nearest to the
fixes the weights which vary for both samples and clas-
existing classes. KNN is used for regression problems as well
sifiers [88]. This causes the classifiers to focus on results
as for recognition problems. It is also known as the lazy
that are relatively difficult to identify accurately. The final
learner algorithm [80] as it does not immediately learn from a
classification formula is shown in equation (16).
collection of training data. KNN calculates the Euclidean dis-
k
tance between new A (x1 , y1 ) data and previously accessible X
B(x2 , y2 ) data, using the equation (12) [81]. Hk (p) = +/ − ( ak hk (p)) (16)
k=1
q
(x 2 −x1 )2 + (y2 −y1 )2 (12) Equation (15) is a linear combination of all the weak
classifiers (simple learners), where K is the total number of
The Euclidean formula may be used to evaluate the dis- weak classifiers hk (p) is the output of weak classifier t (this
tance between two data points (x2 , x1 ) and (y2 , y1 ) in can be either −1 or 1). ak is the weight of classifier k.
5) GRADIENT BOOSTING TABLE 2. Features selected by Relief algorithms and their rankings.
TABLE 4. A Comparison of accuracy between the proposed system and some existing systems.
After changing the number of selected features by imple- of the datasets [45]. The best result for hybrid models was
menting selection algorithms, significant improvements have only 89% (see Table 4). The highest accuracy achieved with
been noticeable. When an experiment has been gathered previous research was 95.19% [53] and very poor perfor-
from all features, the best accuracy was achieved with mance of hybrid models [39]. Rashmi et al. [40] examined a
the RFBM hybrid model (92.65%) and a low accuracy 303-record dataset that had been extracted from the Cleve-
score was obtained with KNN (83.61%). Application of the land dataset. That analysis showed that the Decision Tree
LASSO selection algorithm leads to some dramatic changes. achieved 75.55% accuracy. Dinesh et al. [41] worked on a
The highest accuracy was obtained with GBBM (97.85%), 920-records datasets, combining the Cleveland, Long Beach
whereas the RF model performed the worst. The best results VA, Switzerland and Hungarian datasets from the UCI repos-
were obtained with the Relief feature selection technique. itory and showed that RF could obtain an accuracy of 80.89%.
This achieves a 99.05% accuracy with RFBM. Our results Other authors in [49] applied the DT and RF to a dataset
have been compared to the existing models and datasets, of 500 which was taken from the Armed Forces Institute of
see Table 4. Each row of the table deals with an algorithm Cardiology (AFIC) and reported that DT achieved the best
that has been used in our studies, as well as two other result (86.6 %). Hybrid classifiers were explored by several
related studies, and the results that have been reported. As researcher [39], [52], obtaining an accuracy of 85.48% using
an auxiliary information, we have also added the dataset the KNNBM approach. The performance of our proposed
that those studies have used. The table draws an overall model is very good compared to previous research works as
picture of the performance of the algorithms in our study can be seen from Table 4.
against other related works. The highest outcomes of pre- FPR is used to show the percentage of wrongly detected
vious results were just over 90.16% [43] and the perfor- heart disease whereas the FNR or miss rate measures the
mance of hybrid models was poor due to the limitations incorrect negative classifications. Fig. 17 shows FPR and
ever, the default parameter was used with base classifiers for
ensemble technique.
XI. LIMITATIONS OF OUR PROPOSED SYSTEM
The overall discussion has shown that the performance of
different classifiers were good enough in comparison to pre-
vious studies, however, there are indeed few limitations, such
as, the dependency on a specific Feature Selection technique,
for instance more reliance on Relief in this case to produce
highly accurate results. Additionally, high level of missing
values in the dataset can have an adverse effect. We have
demonstrated how to address the issue through the proper
methods, and therefore other dataset when used with this
model, must also take care of this issue if the missing value
is quite significant. Furthermore, though our training dataset
is reasonably extensive, larger dataset would make the model
more precise.
XII. CONCLUSION
Identifying the risk of heart disease with reasonably high
accuracy could potentially have a profound effect on the
X. HYPERPARAMETER TUNING long-term mortality rate of humans, regardless of social and
GridSearchCV, which allocates hyper parameters, is a process cultural background. Early diagnosis is a key step in achiev-
of tuning which can determine the optimal value for a given ing that goal. Several studies have already attempted to pre-
model. In our proposed system, GridSearchCV has been used dict heart disease with the help of machine learning. This
in order to obtain a higher accuracy. The following parameters study takes similar route, but with an improved and novel
were used on the examined algorithms (see Table 7): method and with a larger dataset for training the model. This
sklearn.model_selection.GridSearchCV (estimator, research demonstrates that the Relief feature selection algo-
param_grid, scoring = None, n_jobs = None, iid = ‘dep- rithm can provide a tightly correlated feature set which then
recated’, refit = True, cv = None, verbose = 0, pre_dispatch can be used with several machine learning algorithms. The
= ‘2∗ n_jobs’, error_score = nan, return_train_score = False) study has also identified that RFBM works particularly well
For getting an accurate prediction, tuning is a fundamen- with the high impact features (obtained by feature selection
tal part for all types of classifiers. As a result, we tuned algorithms or medical literature) and produces an accuracy,
our 5 classifiers including DT, RF, KNN, AB, and GB, how- substantially higher than related work. RFBM achieved an
accuracy of 99.05% with 10 features. In the future we aim [18] A. K. Paul, P. C. Shill, M. R. I. Rabin, and M. A. H. Akhand, ‘‘Genetic
to generalize the model even further so that it can work algorithm based fuzzy decision support system for the diagnosis of
heart disease,’’ in Proc. 5th Int. Conf. Informat., Electron. Vis. (ICIEV),
with other feature selection algorithms and be robust against May 2016, pp. 145–150.
datasets where the level of missing data is high. The applica- [19] X. Liu, X. Wang, Q. Su, M. Zhang, Y. Zhu, Q. Wang, and Q. Wang,
tion of Deep Learning algorithms is another future approach. ‘‘A hybrid classification system for heart disease diagnosis based on the
RFRS method,’’ Comput. Math. Med., vol. 2017, pp. 1–11, Jan. 2017.
The primary aim of this research was to improve upon the [20] D. Singh and J. S. Samagh, ‘‘A comprehensive review of heart disease
existing work with an innovative and novel way of building prediction using machine learning,’’ J. Crit. Rev., vol. 7, no. 12, p. 2020,
the model, as well as to make the model useful and easily 2020.
[21] M. Shouman, T. Turner, and R. Stocker, ‘‘Integrating clustering with
implementable to practical settings.
different data mining techniques in the diagnosis of heart disease,’’
REFERENCES J. Comput. Sci. Eng., vol. 20, no. 1, pp. 1–10, 2013.
[1] C. Trevisan, G. Sergi, S. J. B. Maggi, and H. Dynamics, ‘‘Gender differ- [22] I. D. Mienye, Y. Sun, and Z. Wang, ‘‘An improved ensemble learn-
ences in brain-heart connection,’’ in Brain and Heart Dynamics. Cham, ing approach for the prediction of heart disease risk,’’ Informat. Med.
Switzerland: Springer, 2020, p. 937. Unlocked, vol. 20, Jan. 2020, Art. no. 100402.
[2] M. S. Oh and M. H. Jeong, ‘‘Sex differences in cardiovascular disease [23] H. Wang, Z. Huang, D. Zhang, J. Arief, T. Lyu, and J. Tian, ‘‘Integrat-
risk factors among Korean adults,’’ Korean J. Med., vol. 95, no. 4, ing co-clustering and interpretable machine learning for the prediction
pp. 266–275, Aug. 2020. of intravenous immunoglobulin resistance in kawasaki disease,’’ IEEE
[3] D. C. Yadav and S. Pal, ‘‘Prediction of heart disease using feature selec- Access, vol. 8, pp. 97064–97071, 2020.
tion and random forest ensemble method,’’ Int. J. Pharmaceutical Res., [24] B. A. Tama, S. Im, and S. Lee, ‘‘Improving an intelligent detection system
vol. 12, no. 4, 2020. for coronary heart disease using a two-tier classifier ensemble,’’ BioMed
[4] World Health Organization and J. Dostupno, ‘‘Cardiovascular diseases: Res. Int., vol. 2020, Apr. 2020, Art. no. 9816142.
Key facts,’’ vol. 13, no. 2016, p. 6, 2016. [Online]. Available: https:// [25] J. Mishra and S. Tarar, Chronic Disease Prediction Using Deep Learning.
www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases- Singapore: Springer, 2020, pp. 201–211.
(cvds) [26] F. Z. Abdeldjouad, M. Brahami, and N. Matta, A Hybrid Approach
[5] K. Uyar and A. Ilhan, ‘‘Diagnosis of heart disease using genetic algorithm for Heart Disease Diagnosis and Prediction Using Machine Learning
based trained recurrent fuzzy neural networks,’’ Procedia Comput. Sci., Techniques. Cham, Switzerland: Springer, 2020, pp. 299–306.
vol. 120, pp. 588–593, Jan. 2017. [27] M. Tarawneh and O. Embarak, ‘‘Hybrid approach for heart disease predic-
[6] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, ‘‘A hybrid tion using data mining techniques,’’ Acta Sci. Nutritional Health, vol. 3,
intelligent system framework for the prediction of heart disease using no. 7, pp. 147–151, Jul. 2019.
machine learning algorithms,’’ Mobile Inf. Syst., vol. 2018, pp. 1–21, [28] C. B. C. Latha and S. C. Jeeva, ‘‘Improving the accuracy of prediction of
Dec. 2018. heart disease risk based on ensemble classification techniques,’’ Informat.
[7] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia, and Med. Unlocked, vol. 16, Jan. 2019, Art. no. 100203.
J. Gutierrez, ‘‘A comprehensive investigation and comparison of machine [29] I. Javid, A. Khalaf, and R. Ghazali, ‘‘Enhanced accuracy of heart disease
learning techniques in the domain of heart disease,’’ in Proc. IEEE Symp. prediction using machine learning and recurrent neural networks ensem-
Comput. Commun. (ISCC), Jul. 2017, pp. 204–207. ble majority voting method,’’ Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 3,
[8] J. Mourao-Miranda, A. L. W. Bokde, C. Born, H. Hampel, and M. Stetter, 2020.
‘‘Classifying brain states and determining the discriminating activation [30] N. Kumar and K. Sikamani, ‘‘Prediction of chronic and infectious dis-
patterns: Support vector machine on functional MRI data,’’ NeuroImage, eases using machine learning classifiers—A systematic approach,’’ Int.
vol. 28, no. 4, pp. 980–995, Dec. 2005. J. Intell. Eng. Syst., vol. 13, no. 4, pp. 11–20, 2020.
[9] S. Ghwanmeh, A. Mohammad, and A. Al-Ibrahim, ‘‘Innovative artificial [31] S. M. Saqlain, M. Sher, F. A. Shah, I. Khan, M. U. Ashraf, M. Awais,
neural networks-based decision support system for heart diseases diagno- and A. Ghani, ‘‘Fisher score and matthews correlation coefficient-based
sis,’’ J. Intell. Learn. Syst. Appl., vol. 5, no. 3, pp. 176–183, 2013. feature subset selection for heart disease diagnosis using support vector
[10] Q. K. Al-Shayea, ‘‘Artificial neural networks in medical diagnosis,’’ Int. machines,’’ Knowl. Inf. Syst., vol. 58, no. 1, pp. 139–167, Jan. 2019.
J. Comput. Sci., vol. 8, no. 2, pp. 150–154, 2011.
[32] S. Mohan, C. Thirumalai, and G. Srivastava, ‘‘Effective heart disease
[11] F. M. J. M. Shamrat, M. A. Raihan, A. K. M. S. Rahman, I. Mahmud, and
prediction using hybrid machine learning techniques,’’ IEEE Access,
R. Akter, ‘‘An analysis on breast disease prediction using machine learn-
vol. 7, pp. 81542–81554, 2019.
ing approaches,’’ Int. J. Sci. Technol. Res., vol. 9, no. 2, pp. 2450–2455,
[33] F. Miao, Y.-P. Cai, Y.-X. Zhang, X.-M. Fan, and Y. Li, ‘‘Predictive
Feb. 2020.
modeling of hospital mortality for patients with heart failure by using an
[12] M. S. Amin, Y. K. Chiam, and K. D. Varathan, ‘‘Identification of signif-
improved random survival forest,’’ IEEE Access, vol. 6, pp. 7244–7253,
icant features and data mining techniques in predicting heart disease,’’
2018.
Telematics Informat., vol. 36, pp. 82–93, Mar. 2019.
[34] C. Raju, E. Philipsy, S. Chacko, L. P. Suresh, and S. D. Rajan, ‘‘A survey
[13] N. Kausar, S. Palaniappan, B. B. Samir, A. Abdullah, and N. Dey, ‘‘Sys-
on predicting heart disease using data mining techniques,’’ in Proc. Conf.
tematic analysis of applied data mining based optimization algorithms
Emerg. Devices Smart Syst. (ICEDSS), 2018, pp. 253-255.
in clinical attribute extraction and classification for diagnosis of cardiac
patients,’’ in Applications of Intelligent Optimization in Biology and [35] D. Chicco and G. Jurman, ‘‘Machine learning can predict survival of
Medicine. Cham, Switzerland: Springer, 2016, pp. 217–231. patients with heart failure from serum creatinine and ejection fraction
[14] J. Mackay and G. A. Mensah, ‘‘The atlas of heart disease and stroke,’’ alone,’’ BMC Med. Informat. Decis. Making, vol. 20, no. 1, p. 16,
World Health Org., Geneva, Switzerland, Tech. Rep., 2004. Dec. 2020.
[15] M. Ashraf, S. M. Ahmad, N. A. Ganai, R. A. Shah, M. Zaman, [36] E. Ahmad, A. Tiwari, and A. Kumar, ‘‘Cardiovascular Diseases (CVDs)
S. A. Khan, and A. A. Shah, Prediction of Cardiovascular Disease Detection using Machine Learning Algorithms,’’
Through Cutting-Edge Deep Learning Technologies: An Empirical Study [37] L. Wang, W. Zhou, Q. Chang, J. Chen, and X. Zhou, ‘‘Deep ensemble
Based on TENSORFLOW, PYTORCH and KERAS. Singapore: Springer, detection of congestive heart failure using short-term RR intervals,’’ IEEE
2021, pp. 239–255. Access, vol. 7, pp. 69559–69574, 2019.
[16] F. Andreotti, F. S. Heldt, B. Abu-Jamous, M. Li, A. Javer, O. Carr, [38] A. Gupta, R. Kumar, H. S. Arora, and B. Raman, ‘‘MIFH: A machine
S. Jovanovic, N. Lipunova, B. Irving, R. T. Khan, R. Dürichen, ‘‘Pre- intelligence framework for heart disease diagnosis,’’ IEEE Access, vol. 8,
diction of the onset of cardiovascular diseases from electronic health pp. 14659–14674, 2020.
records using multi-task gated recurrent units,’’ 2020, arXiv:2007.08491. [39] C. B. C. Latha and S. C. Jeeva, ‘‘Improving the accuracy of prediction of
[Online]. Available: https://arxiv.org/abs/2007.08491 heart disease risk based on ensemble classification techniques,’’ Informat.
[17] W. Wiharto, H. Kusnanto, and H. Herianto, ‘‘Hybrid system of tiered Med. Unlocked, vol. 16, no. 2, 2019, Art. no. 100203.
multivariate analysis and artificial neural network for coronary heart [40] G. O. Rashmi and U. M. A. kumar, ‘‘Machine learning methods for heart
disease diagnosis,’’ Int. J. Electr. Comput. Eng., vol. 7, no. 2, p. 1023, disease prediction,’’ Int. J. Eng. Adv. Technol., vol. 8, no. 5S, pp. 220–223,
Apr. 2017. May 2019.
[41] K. G. Dinesh, K. Arumugaraj, K. D. Santhosh, and V. Mareeswari, ‘‘Pre- [64] A. M. D. Silva, Feature Selection, vol. 13. Berlin, Germany: Springer,
diction of cardiovascular disease using machine learning algorithms,’’ in 2015, pp. 1–13.
Proc. Int. Conf. Current Trends Towards Converging Technol. (ICCTCT), [65] S. Chikhi and S. Benhammada, ‘‘ReliefMSS: A variation on a feature
Coimbatore, India, Mar. 2018, pp. 1–7. ranking ReliefF algorithm,’’ Int. J. Bus. Intell. Data Mining, vol. 4,
[42] S. Mohan, C. Thirumalai, and G. Srivastava, ‘‘Effective heart disease pp. 375–390, Jan. 2009.
prediction using hybrid machine learning techniques,’’ IEEE Access, [66] R. Tibshirani, ‘‘Regression shrinkage and selection via the lasso: A retro-
vol. 7, pp. 81542–81554, 2019. spective,’’ J. Roy. Stat. Soc. B, Stat. Methodol., vol. 73, no. 3, pp. 273–282,
[43] S. Sharma and M. Parmar, ‘‘Heart diseases prediction using deep learning Jun. 2011.
neural network model,’’ Int. J. Innov. Technol. Exploring Eng., vol. 9, [67] C. Zhou and A. Wieser, ‘‘Jaccard analysis and LASSO-based feature
no. 3, pp. 1–5, Jan. 2020. selection for location fingerprinting with limited computational complex-
[44] R. Alizadehsani, J. Habibi, Z. A. Sani, H. Mashayekhi, R. Boghrati, ity,’’ in Proc. 14th Int. Conf. Location Based Services (LBS), Dec. 2018,
A. Ghandeharioun, F. Khozeimeh, and F. Alizadeh-Sani, ‘‘Diagnosing pp. 71–87.
coronary artery disease via data mining algorithms by considering labo- [68] Ensemble Techniques of Bagging. Accessed: Jun. 31, 2020. [Online].
ratory and echocardiography features,’’ Res. Cardiovascular Med., vol. 2, Available: https://quantdare.com/what-is-the-difference-between-
no. 3, pp. 133–139, Aug. 2013. Bagging-and-Boosting/
[45] A. A. Shetty and C. Naik, ‘‘Different data mining approaches for predict- [69] An Explanation of Ensemble Bagging Techniques.
ing heart disease,’’ Int. J. Innov. Sci. Eng. Technol., vol. 5, pp. 277–281, Accessed: Jun. 31, 2020. [Online]. Available: https://towardsdatascience.
May 2016. com/ensemble-methods-Bagging-Boosting-and-stacking-c9214a10a205/
[46] C. A. Cheng and H. W. Chiu, ‘‘An artificial neural network model for [70] P. Ghosh, M. Z. Hasan, and M. I. Jabiullah, ‘‘A comparative study
the evaluation of carotid artery stenting prognosis using a national-wide of machine learning approaches on dataset to predicting cancer
database,’’ in Proc. 39th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. outcome,’’ Bangladesh Electron. Soc., vol. 18, nos. 1–3, pp. 1–5,
(EMBC), Jul. 2017, pp. 2566–2569. 2018.
[47] K. C. Tan, E. J. Teoh, Q. Yu, and K. C. Goh, ‘‘A hybrid evolutionary [71] F. M. Javed Mehedi Shamrat, Z. Tasnim, P. Ghosh, A. Majumder, and
algorithm for attribute selection in data mining,’’ Expert Syst. Appl., M. Z. Hasan, ‘‘Personalization of job circular announcement to appli-
vol. 36, no. 4, pp. 8616–8630, May 2009. cants using decision tree classification algorithm,’’ in Proc. IEEE Int.
[48] I. K. A. Enriko, ‘‘Comparative study of heart disease diagnosis using Conf. Innov. Technol. (INOCON), Nov. 2020, pp. 1–5.
top ten data mining classification algorithms,’’ in Proc. 5th Int. Conf. [72] M. M. Alam, S. Saha, P. Saha, F. N. Nur, N. N. Moon, A. Karim, and
Frontiers Educ. Technol., 2019, pp. 159-164. S. Azam, ‘‘D-CARE: A non-invasive glucose measuring technique for
[49] M. Saqlain, W. Hussain, N. A. Saqib, and M. A. Khan, ‘‘Identifica- monitoring diabetes patients,’’ in Proc. Int. Joint Conf. Comput. Intell.
tion of heart failure by using unstructured data of cardiac patients,’’ in Algorithms Intell. Syst., 2019, pp. 443–453.
Proc. 45th Int. Conf. Parallel Process. Workshops (ICPPW), Aug. 2016, [73] W. M. Baihaqi, N. A. Setiawan, and I. Ardiyanto, ‘‘Rule extraction for
pp. 426–431. fuzzy expert system to diagnose coronary artery disease,’’ in Proc. 1st
[50] A. K. Dwivedi, ‘‘Evaluate the performance of different machine learn- Int. Conf. Inf. Technol., Inf. Syst. Electr. Eng. (ICITISEE), Yogyakarta,
ing techniques for prediction of heart disease using ten-fold cross- Indonesia, Aug. 2016, pp. 136–141.
validation,’’ Neural Comput. Appl., vol. 29, pp. 685–693, Sep. 2016. [74] Z. Masetic and A. Subasi, ‘‘Congestive heart failure detection using
[51] A. Kaur, ‘‘A comprehensive approach to predict heart diseases using data random forest classifier,’’ Comput. Methods Programs Biomed., vol. 130,
mining,’’ Int. J. Innov. Eng. Technol., vol. 8, no. 2, pp. 1–5, Apr. 2017. pp. 54–64, Jul. 2016.
[52] V. Chaurasia and S. Pal, ‘‘Data mining approach to detect heart diseases,’’ [75] A. Mert, N. Kılıç, and A. Akan, ‘‘Evaluation of bagging ensem-
Int. J. Adv. Comput. Sci. Inf. Technol., vol. 2, no. 4, pp. 56–66, 2014. ble method with time-domain feature extraction for diagnosing of
[53] R. Bhuvaneeswari, P. Sudhakar, and G. Prabakaran, ‘‘Heart disease pre- arrhythmia beats,’’ Neural Comput. Appl., vol. 24, no. 2, pp. 317–326,
diction model based on gradient boosting tree (GBT) classification algo- Feb. 2014.
rithm,’’ Int. J. Recent Technol. Eng., vol. 8, no. 2, pp. 41–51, Sep. 2019. [76] P. Ghosh, A. Karim, S. T. Atik, S. Afrin, and M. Saifuzzaman, ‘‘Expert
[54] F. M. J. M. Shamrat, P. Ghosh, M. H. Sadek, A. Kazi, and S. Shultana, model of cancer disease using supervised algorithms with a LASSO
‘‘Implementation of machine learning algorithms to detect the progno- feature selection approach,’’ Int. J. Electr. Comput. Eng., vol. 11, no. 3,
sis rate of kidney disease,’’ in Proc. IEEE Int. Conf. Innov. Technol., 2020.
Nov. 2020, pp. 1–7. [77] P. Ghosh, M. Z. Hasan, O. A. Dhore, A. A. Mohammad, and
[55] S. Shultana, M. S. Moharram, and N. Neehal, ‘‘Olympic sports events M. I. Jabiullah, ‘‘On the application of machine learning to predicting
classification using convolutional neural networks,’’ in Proc. Int. Joint cancer outcome,’’ in Proc. Int. Conf. Electron. (ICT). Dhaka, Bangladesh:
Conf. Comput. Intell. (IJCCI), Dhaka, Bangladesh, 2018, pp. 507–518. Bangladesh Electronics Society (BES), Nov. 2018, p. 60.
[56] S. V. J. Jaikrishnan, O. Chantarakasemchit, and P. Meesad, ‘‘A breakup [78] Responsible for Herat Disease Risk Factors. Accessed:Jul. 15, 2020.
machine learning approach for breast cancer prediction,’’ in Proc. [Online]. Available: https://www.texasheart.org/heart-health/heart-
11th Int. Conf. Inf. Technol. Electr. Eng. (ICITEE), Pattaya, Thailand, informationcenter/ topics/heart-disease-risk-factors/
Oct. 2019, pp. 1–6. [79] R. Banerjee, S. Biswas, S. Banerjee, A. D. Choudhury, T. Chattopadhyay,
[57] A. Gavhane, G. Kokkula, I. Pandya, and K. Devadkar, ‘‘Prediction of A. Pal, P. Deshpande, and K. M. Mandana, ‘‘Time-frequency anal-
heart disease using machine learning,’’ in Proc. 2nd Int. Conf. Elec- ysis of phonocardiogram for classifying heart disease,’’ in Proc.
tron., Commun. Aerosp. Technol. (ICECA), Coimbatore, India, Mar. 2018, Comput. Cardiol. Conf. (CinC), Vancouver, BC, Canada, Sep. 2016,
pp. 1275–1278. pp. 573–576.
[58] G. Singh, ‘‘Breast cancer prediction using machine learning,’’ Int. J. Sci. [80] F. M. J. M. Shamrat, P. Ghosh, M. H. Sadek, M. A. Kazi, and S. Shultana,
Res. Comput. Sci., Eng. Inf. Technol., vol. 8, no. 4, pp. 278–284, Jul. 2020. ‘‘Implementation of machine learning algorithms to detect the progno-
[59] Heart Disease Datasets From UCI Machine Learning Repository. sis rate of kidney disease,’’ in Proc. IEEE Int. Conf. Innov. Technol.,
Accessed: May 31, 2020. [Online]. Available: https://archive.ics.uci. Nov. 2020, pp. 1–7.
edu/ml/datasets/Heart+Disease [81] An Overview of K_Nearest Neighbors Algorithm.
[60] Heart Disease Statlog Dataset of UCI Machine Learning Repos- Accessed: Jun. 31, 2020. [Online]. Available: https://www.javatpoint.
itory. Accessed: May 31, 2020. [Online]. Available: http://archive. com/k-nearest-neighbor algorithm- for-machine-learning
ics.uci.edu/ml/datasets/statlog+(heart) [82] D. Giri, U. R. Acharya, R. J. Martis, S. V. Sree, T.-C. Lim, T. Ahamed,
[61] S. Ralston, I. Penman, M. Strachan, and R. Hobson, Davidson’s Prin- and J. S. Suri, ‘‘Automated diagnosis of coronary artery disease affected
ciples and Practice of Medicine, 23rd ed. U.K.: Elsevier, Apr. 2018, patients using LDA, PCA, ICA and discrete wavelet transform,’’ Knowl.-
pp. 219–225. Based Syst., vol. 37, pp. 274–282, Jan. 2013.
[62] A. Rairikar, V. Kulkarni, V. Sabale, H. Kale, and A. Lamgunde, ‘‘Heart [83] A. Rajkumar and G. S. Reena, ‘‘Diagnosis of heart disease using data
disease prediction using data mining techniques,’’ in Proc. Int. Conf. mining algorithm,’’ Global J. Comput. Sci. Technol., vol. 10, pp. 38–43,
Intell. Comput. Control (IC), Jun. 2017, pp. 1–8. Sep. 2010.
[63] A. Acharya, ‘‘Comparative study of machine learning algorithms [84] M. Gilani, J. M. Eklund, and M. Makrehchi, ‘‘Automated detection
for heart disease prediction,’’ M.S. thesis, Helsinki Metropolia Univ. of atrial fibrillation episode using novel heart rate variability
Appl. Sci., Helsinki, Finland, Apr. 2017. [Online]. Available: https:// features,’’ in Proc. 38th Annu. Int. Conf. IEEE Eng. Med.
www.theseus.fi/bitstream/handle/10024/124622/Final%20Thesis.pdf? Biol. Soc. (EMBC), Lake Buena Vista, FL, USA, Aug. 2016,
sequence=1&isAllowed=y pp. 3461–3464.
[85] K. Padmavathi and K. S. Ramakrishna, ‘‘Classification of ECG signal SAMI AZAM is currently a leading Researcher
during atrial fibrillation using autoregressive modeling,’’ Procedia Com- and a Senior Lecturer with the College of Engi-
put. Sci., vol. 46, pp. 53–59, Jan. 2015. neering and IT, Charles Darwin University, Casua-
[86] S. H. Ripon, ‘‘Rule induction and prediction of chronic kidney dis- rina, NT, Australia. He is also actively involved
ease using boosting classifiers, Ant-Miner and J48 Decision Tree,’’ in in the research fields relating to Computer Vision,
Proc. Int. Conf. Elect., Comput. Commun. Eng. (ECCE), Cox’s Bazar, Signal Processing, Artificial Intelligence, and
Bangladesh, 2019, pp. 1–6.
[87] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, and M. Alazab, Biomedical Engineering. He has number of publi-
‘‘A comprehensive survey for intelligent spam email detection,’’ IEEE cations in peer-reviewed journals and international
Access, vol. 7, pp. 168261–168295, 2019. conference proceedings.
[88] P. Ghosh, F. M. J. M. Shamrat, S. Shultana, S. Afrin, A. A. Anjum, and
A. A. Khan, ‘‘Optization of prediction method of chronic kidney disease
with machine learning algorithms,’’ in Proc. 15th Int. Symp. Artif. Intell.
Natural Lang. Process. (iSAI-NLP), Int. Conf. Artif. Intell. Internet Things
(AIoT), 2020.
[89] An Overview of Gradient Boosting Algorithm. Accessed: Jun. 31, 2020.
[Online]. Available: https://machinelearningmastery.com/gentle- MIRJAM JONKMAN (Member, IEEE) is cur-
introduction-gradient-Boosting-algorithm-machine-learning/
rently a Lecturer and a Researcher with the Col-
[90] M. Almasoud and T. E. Ward, ‘‘Detection of chronic kidney disease using
machine learning algorithms with least number of predictors,’’ Int. J. Adv. lege of Engineering, IT, and Environment. Her
Comput. Sci. Appl., vol. 10, no. 8, pp. 89–96, 2019. research interests include biomedical engineering,
[91] Gradient Boosting Algorithm. Accessed: Jun. 31, 2020. [Online]. Avail- signal processing, and the application of computer
able: https://data-flair.training/blogs/gradient-Boosting-algorithm/ science to real life problems.
[92] T. Chen and C. Guestrin, ‘‘XGBOOST: A scalable tree boosting system,’’
in Proc. 22nd ACMSIGKDD Int. Conf. Knowl. Discovery Data Mining,
2016, pp. 785–794.
[93] J. Cheng, G. Li, and X. Chen, ‘‘Research on travel time prediction model
of freeway based on gradient boosting decision tree,’’ IEEE Access, vol. 7,
pp. 7466–7480, 2019, doi: 10.1109/ACCESS.2018.2886549.
[94] A. Natekin and A. Knoll, ‘‘Gradient boosting machines, a tutorial,’’ ASIF KARIM is currently a Ph.D. Researcher
Frontiers Neurorobotics, vol. 7, no. 7, pp. 1–21, 2013. with Charles Darwin University, Casuarina, NT,
[95] A. M. De Silva and P. H. W. Leong, Grammar-Based Feature Generation
Australia, and lives in the port city of Darwin. His
for Time-Series Prediction. Berlin, Germany: Springer, 2015.
[96] F. M. J. M. Shamrat, M. Asaduzzaman, P. Ghosh, M. D. Sultan, and research interest includes machine intelligence and
Z. Tasnim, ‘‘A Web based application for agriculture: ‘Smart farming cryptographic communication. He is also working
system,’’’ Int. J. Emerg. Trends Eng. Res., vol. 8, no. 6, pp. 2309–2320, towards the development of a robust and advanced
Jun. 2020. email filtering system primarily using Machine
[97] Responsible for Herat Disease Risk Factors. Accessed: Jul. 15, 2020. Learning algorithms. He has considerable industry
[Online]. Available: https://www.texasheart.org/heart-health/heart- experience in IT, primarily in the field of Software
informationcenter/topics/heart-disease-risk-factors/ Engineering.
[98] F. M. J. M. Shamrat, P. Ghosh, I. Mahmud, N. I. Nobel, and M. D. Sultan,
‘‘An intelligent embedded AC automation model with temperature predic-
tion and human detection,’’ in Proc. 2nd Int. Conf. Emerg. Technol. Data
Mining Inf. Secur. (IEMIS), 2020. F. M. JAVED MEHEDI SHAMRAT received the
[99] Sex, Age, Cardiovascular Risk Factors, and Coronary Heart B.Sc. degree in software engineering from Daf-
Disease. Accessed: Dec. 29, 2020. [Online]. Available: https:// fodil International University, in 2018. He used
www.ahajournals.org/doi/full/10.1161/01.cir.99.9.1165 to work at Daffodil International University as a
[100] S. Hegelich, ‘‘Decision trees and random forests: Machine learning tech-
Research Associate. He is currently working in a
niques to classify rare events,’’ Eur. Policy Anal., vol. 2, no. 1, pp. 98–120,
Government Project under the ICT Division as a
2016.
[101] K. Fawagreh, M. M. Gaber, and E. Elyan, ‘‘Random forests: From early Researcher and Developer. He has published sev-
developments to recent advancements,’’ Syst. Sci. Control Eng., vol. 2, eral research papers and articles in journals (Sco-
no. 1, pp. 602–609, Dec. 2014. pus) and international conferences. His research
[102] A. Sharma and A. Suryawanshi, ‘‘A novel method for detecting spam interests include the IoT, machine learning, data
email using KNN classification with spearman correlation as distance science, information security, android applications, image processing, neural
measure,’’ Int. J. Comput. Appl., vol. 136, no. 6, pp. 28–35, Feb. 2016. network, cyber security, Artificial Intelligence, robotics, and deep learning.
[103] Spearman’s Rank-Order Correlation. Accessed: Jul. 15, 2019. [Online].
Available: https://statistics.laerd.com/statistical-guides/spearmans-rank-
order-correlation-statistical-guide.php
EVA IGNATIOUS is currently a Ph.D. Researcher
with Charles Darwin University, Casuarina, NT,
Australia. Her research interests include biomed-
ical signal processing (interesting features and
abnormalities found in bio-signals), theoretical
modelling and simulation (breast cancer tissues),
PRONAB GHOSH received the B.Sc. degree from applied electronics (thermistors), process control
the Computer Science and Engineering Depart- and instrumentation, and embedded/VLSI sys-
ment, Daffodil International University, in 2019. tems. She has considerable research experience
He has been heavily involved in collaborative with one U.S. patent and two Indian patents for
research activities with researchers in Bangladesh the development of thermal sensor-based breast cancer detection at its early
and researchers from Australia, especially in the stages together with Centre for Materials for Electronics Technology (C-
fields of machine learning, deep learning, cloud MET), an autonomous scientific society under Ministry of Electronics and
computing, and the IoT. Information Technology (MeitY), Government of India. She also has indus-
trial experience as a Production Engineer and a Quality Controller, primarily
in the Electronics and Instrumentation Engineering.