0% found this document useful (0 votes)
183 views

Prediction of Cardiovascular Disease Using Machine Learning Techniques

This document summarizes a research paper that predicts coronary heart disease using machine learning algorithms. The researchers applied algorithms like Random Forest, Decision Trees, and K-Nearest Neighbors to the Framingham Heart Study dataset containing records of 4240 participants. They performed preprocessing steps like handling missing values, resampling, standardization, and normalization. Their experimental results found that Random Forest achieved the highest accuracy of 96.8% for predicting coronary heart disease, outperforming Decision Trees and K-Nearest Neighbors.

Uploaded by

Nowreen Haque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views

Prediction of Cardiovascular Disease Using Machine Learning Techniques

This document summarizes a research paper that predicts coronary heart disease using machine learning algorithms. The researchers applied algorithms like Random Forest, Decision Trees, and K-Nearest Neighbors to the Framingham Heart Study dataset containing records of 4240 participants. They performed preprocessing steps like handling missing values, resampling, standardization, and normalization. Their experimental results found that Random Forest achieved the highest accuracy of 96.8% for predicting coronary heart disease, outperforming Decision Trees and K-Nearest Neighbors.

Uploaded by

Nowreen Haque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Prediction of Coronary Heart Disease using

Supervised Machine Learning Algorithms


Divya Krishnani, Anjali Kumari, Akash Dewangan, Aditya Singh, Nenavath Srinivas Naik
Department of Computer Science and Engineering, Dr. SPM International Institute of Information Technology, Naya Raipur, India
Email: [email protected], [email protected], [email protected],
[email protected], [email protected]

Abstract—New technologies like Machine learning and Big in the US, the government spends annually around 2.7 trillion
data analytics have been proven to provide promising solutions USD for the treatment of chronic diseases.
to biomedical communities, healthcare problems, and patient
care. They also help in early prediction of disease by accurate The continuous technological improvements helped the re-
interpretation of medical data. Disease management strategies
can further be improved by the detection of early signs of
searchers to develop new methodologies based on Artificial
disease. This early prediction, moreover, can be helpful in intelligence and Machine learning. The rising health issues
controlling the symptoms of the disease as well as the proper have also lead to an increase in the generation of big data, and
treatment of disease. Machine learning approaches can be used for utilizing this Big data it is required to develop an automatic
in the prediction of chronic diseases, such as kidney and heart computer-based system that can be used to predict those
diseases, by developing the classification models. In this paper, we
propose a preprocessing extensive approach to predict Coronary
diseases by deploying machine learning algorithms which can
Heart Diseases (CHD). The approach involves replacing null work efficiently for various challenges occur in the datasets.
values, resampling, standardization, normalization, classification,
and prediction. This work aims to predict the risk of CHD Along with the other chronic healthcare problems such
using machine learning algorithms like Random Forest, Decision as obesity, smoking, diabetes, etc., cardiovascular diseases
Trees, and K-Nearest Neighbours. Also, a comparative study (CVDs) are also found to be a huge risk factor. Population
among these algorithms on the basis of prediction accuracy is aging, especially in developed countries, is highly correlated
performed. Further, K-fold Cross Validation is used to generate
randomness in the data. These algorithms are experimented with the CVDs, means the old age people are more prone to
over “Framingham Heart Study” dataset, which is having 4240 CVDs [3]. According to the World Health Organisation report
records. In our experimental analysis, Random Forest, Decision in 2014 [4], cardiovascular diseases including heart attacks,
Tree, and K- Nearest Neighbour achieved an accuracy of 96.8%, chronic heart failure, cardiac arrest, etc. led to the death of
92.7%, and 92.89% respectively. Therefore, by including our 17.5 million people. Most heart diseases cannot be detected
preprocessing steps, Random Forest classification gives more
accurate results than other machine learning algorithms. by primitive Electrocardiography (ECG) process [6], so for
Index Terms—Random Forest, Decision Tree, K-Nearest Neigh- that many preventive sensors or devices are invented such as
bour, Coronary Heart Disease Phonocardiogram (PCG), Electromyogram (EGM), etc.

I. I NTRODUCTION In this work, we analyze and estimate the use of different


machine learning algorithms in the prediction of coronary
The enormous advancements in the health departments such heart disease by combining all the attributes in the dataset to
as smartwatches and fitness bands have led to a revolutionary develop the classification models. Random Forest, K-Nearest
change in the detection of day-to-day activities of an individual Neighbours, and Decision Tree classification models for coro-
such as the heart rates, calories burned, etc. The inventions of nary heart disease risk prediction are developed. We also apply
smart devices such as Continuous Glucose Monitor (CGM), K-fold Cross-Validation for the algorithms. Results show that
Smart Cholesterol Monitoring System has reduced the chances these models can efficiently predict the risk of heart disease, to
of occurrence of diseases. These devices in our daily life keep conclude whether an individual is prone to suffer from CHD
track of the activities and assist in decisions making for health- within a span in 10 years.
care. Even after these advancements in health departments,
people are unaware of the risks and symptoms associated with Rest of the paper is organized as follows: Section II de-
chronic diseases. Therefore, the prediction of such diseases is scribes the related work. In Section III, proposed work is
now a major concern not only for individuals but for mankind. described including dataset description, preprocessing, analyt-
Among all the human-related diseases, chronic diseases are the ics and modeling. Section IV includes experimental analysis
riskiest and major health problems, according to Lancet Study which describes the performance measurement parameters for
on Global Burden of Disease Study 2013 [1]. Moreover, in a different algorithms. After that Results in Section V which
survey by McKinsey [2], it was found that in several countries includes performance comparison of the algorithms and at last
like China, chronic diseases are the major cause of death and paper is concluded in Section VI.

978-1-7281-1895-6/19/$31.00 2019
c IEEE 367
II. R ELATED W ORK Decision Tree, Neural Network, Logistic Regression, Support
Vector Machine, and Naive Bayes over both normalized and
Different researchers practice different ways of involving original data were presented; third, performance of classifiers
machine learning and data-mining approaches to solve health- was compared based on sensitivity, accuracy, specificity, and
related issues. They have used various approaches to classify AUC. Except for Neural Network, other classifiers showed
and predict chronic diseases. almost the same results.
Data mining techniques were proved to efficiently predict In [17] an novel approach is proposed that uses analytic
chronic kidney disease. Weka [5] is a user interface which time-frequency flexible wavelet transform (ATFFWT) and
provides data processing cycle such as data preprocessing, fractal dimension (FD) to detect epileptic seizures. By using
classification and other data mining technique to a user. It ATFFWT the EEG signals are decomposed into subbands and
has a large collection of data to which various algorithm then FD is calculated on each subbands. For training the model
can be practiced on it. This tool illustrates out that Random these FDs are given as input to Least Squares SVM. At last,
Forest gives the best result among various algorithm like Naive cross validation is performed to deal with model overfitting.
Bayes, J48, etc.
Heart Sounds can also be used for the detection of chronic III. P ROPOSED W ORK
heart failure [6]. The method involves filtering of audio sig-
nals, segmentation for feature extraction, and machine learn- This section illustrates various resources and approaches
ing. It also described the stacking process of ML algorithms, that are used in this work. Primarily, the description of dataset
having three phases: segment based ML phase, recording is provided to understand how to work on it, followed by the
based feature extraction phase, and recording based ML phase. preprocessing steps involved. Finally, the internal working and
understanding of the analytical models used are explained.
Different authors have adopted different approaches for
developing classification models. Data can also be captured A. Dataset Description
by ADL (activities of daily living) with wearables [7]. The
proposed framework used this data and implemented super- We have practiced a dataset which is a subset of Framing-
vised and unsupervised machine learning algorithms and batch ham Heart Study (FHS) dataset, it is made publicly available
level processing including cohort segmentation. Different ML through Framingham Heart Institute [7]. The available section
algorithms like KNN, SVM, Random Forest, etc. were im- of FHS dataset used in this paper contains records of 4240
plemented on two different datasets and compared based on participants. The dataset is generated by long term study on
accuracy and model building time. Maximum accuracy in a population of Framingham, Massachusetts. The study is
NHANES and Framingham Datasets were obtained in Random based on the cause and origin that lead to cardiovascular
Forest and SVM modeling respectively. heart disease and it comes under one of the best public health
In [8], NHANES and FHS datasets were used. In the disease management domain [8]. The Framingham Heart study
NHANES dataset, feature selection methods lead to an im- focused mainly to retrieve the risk factors that have an effect on
provement in the performance based on information theory the health of a person in perceiving a coronary heart disease.
ranking and in FHS dataset, this was done by grouping based The dataset contains 16 different features that affect Coronary
on subdivision filter variable method. The model, also, showed Heart Disease.
the trade-off between accuracy and execution time. Random
TABLE I
Forest and KNN gave better results in confusion matrix and ATTRIBUTES OF THE DATASET AND THEIR I NTERPRETATION
classification accuracy but they were unable to satisfy the
boundaries of creation time. Meanwhile, the decision tree
Attribute Interpretation
showed good results in both aspects. gender Female : 0; Male : 1
In [9], the study of the effect of class imbalance in data age Age at the examination time
on performance for multilayer perceptron was carried out. 1: high school
2: high school or GED
Different learning rates were used to evaluate the performance education
3: college or vocational school
of multilayer perceptron and analyze the dataset by three sam- 4: college
pling algorithms, among which the Resample method provided currentSmoker 0 = nonsmoker; 1 = smoker
the best accuracy results than others. Also, a comparative diabetes 0 = No; 1 = Yes
totChol Total cholesterol inside patient’s body (mg/dL)
study was done based on accuracy metrics and execution time. sysBP Systolic Blood Pressure (mmHg)
Spread Sub Sample algorithm has the least execution time. diasBP Diastolic Blood Pressure (mmHg)
The target was to see the importance of features on the clas- cigsPerDay Number of cigarettes smoked per day (average)
BPMeds Is the person on BP medicines
sification result [10]. Preprocessing and normalization were prevalentStroke If the person had any prevalent stroke
performed over the dataset. Next, to measure the correlation prevalentHyp Any beneath prevalent
between features, the correlation matrix was obtained. Further, BMI Body Mass Index : Weight (kg) /Height(meter-squared)
the classification was carried out in three stages: first, L1-based heartRate Beats/Min (Ventricular)
glucose Amount of glucode in mg/dL
feature selection; second, AUC (Area Under Curve) based TenYearCHD Risk of developing CHD (Yes : 1; No: 0)
comparison of the performance of five algorithms, namely,

368 2019 IEEE Region 10 Conference (TENCON 2019)


Table I provides an interpretation of different attributes in
the dataset. The dataset consists of many inconsistent and
discrepant values and can lead to incorrect results. Thus,
proper care needs to be taken while treating these values
for better performance. Therefore, the dataset is preprocessed
before model creation.
B. Preprocessing
Preprocessing is a method to obtain complete, consistent,
interpretable data. The data quality affects the mining results
that are obtained using machine learning algorithms. Quality
data results in a quality decision. Therefore, the FHS dataset
is integrated using the following preprocessing steps.
• Irrelevant features can decrease the performance of the
model and reduces the learning rate. Therefore, feature
selection plays a major role in preprocessing in which
those features are selected that contributes the most in
predicting the desired results. In the FHS dataset, using
an automatic feature selection would have eliminated im- Fig. 1. Resampling of data set
portant features as well. Therefore, an analytical approach
gives better performance.
• The mean is the most probable value that tends to occur
in any attribute. Also, mean preserves the extremes of an
attribute, therefore, missing values in the FHS dataset are
replaced by the attribute mean, as shown in equation (1).

l
(attribute value)i
(1)
Attribute M ean = i=0
l
where, l is the total number of values in an attribute
• Class imbalance of dataset is a major problem in data
mining applications. Most of the machine learning algo-
rithms fail to perform well on a dataset where classes are
imbalanced [9].
• Sampling is an effective method to balance an imbal-
anced dataset. Sampling is of two types: oversampling
and undersampling. Undersampling involves removing
instances from the majority class to balance the class
distribution. Oversampling involves replicating instances
from minority class to balance the class distribution. Fig
1. illustrates the resampling mechanism.
• The target class in the dataset predicts the risk of coronary
heart disease (CHD). The instances with the risk of an
individuals those are more likely to suffer from CHD is
15.2% (644 out of 4240 entries) and that of individuals
those are not suffering from CHD is 84.8% (3596 out of
4240 entries). In order to balance this class distribution
we used random oversampling to replicate the instances
in the minority class, that is, individuals suffering from
CHD.
Fig 2 depicts the graphical representation of the steps in
sequential order that are used in the proposed work.
C. Analytics and Modeling
This section explains the supervised algorithms of Machine Fig. 2. Flowchart of proposed work.
learning that are used in this work. It briefs about the analytical
approach and internal working of Random Forest, Decision

2019 IEEE Region 10 Conference (TENCON 2019) 369


Tree and KNN to create a prediction model. Though, there classifies that data with the class that appeared the most.
are also other such powerful machine learning algorithms
like Convolutional Neural Network (CNN) and Naive Bayes IV. E XPERIMENTAL A NALYSIS
etc. But these algorithms didn’t fit well as they gave lesser A. Parameters Used
accuracy and perfromance measures comparatively in our Performance evaluation of the proposed work is done based
experimental analysis. Therefore, we choose Random Forest, on the following measures:
Decision Tree and KNN over them.
Confusion Matrix is a matrix that is used to evaluate
• Random forest (RF) is a supervised machine learning the performance of a model. The four terms associated
algorithm. As the name predicts, it is a forest of randomly with the confusion matrix which is used to determine the
generated decision trees. It basically uses an approach performance matrices are :
bagging, where various learning models are combined True Positive (TP): An outcome when the positive class is
to improve the overall results. To performs the bagging correctly predicted by the model
operation, it produces manifold decision trees and syn- True Negative (TN): An outcome when the negative class is
thesizes them together to obtain a refined result. It is correctly predicted by the model
one of the finest machine learning algorithms. It uses a False Positive (FP): An outcome when the positive class is
random subset of features by splitting a node to obtain incorrectly predicted by the model
the best feature that contributes the most to build the False Negative (FN): An outcome when the negative class is
model. The result even is increased further by adding incorrectly predicted by the model
random threshold values to each feature. Random Forest
algorithm is also used to score the features. On the basis Accuracy: is the ratio of a number of correct predictions
of how much impurity a feature adds to the model, it given by the model to the total number of instances.
decides the relative feature importance. Also, RF is robust
to outlier values. (T P + T N )
Accuracy = (2)
• Decision Tree (DT) is one of the simplest algorithms, yet (T P + F P + F N + T N )
most effective and useful. It is a tree which comprises of
three nodes, first is the chance node, second is decision Precision: Precision in this work measures the proportion of
node and at the last end node. The chance node shows individuals predicted to be at risk of developing CHD and had
the probable outcomes of a particular node whereas the a risk of developing CHD.
decision node is a node on which a decision is to be made
TP
based on the outcome. The end node is the last node of P recision = (3)
the tree which gives the final result of a path. Decision (T P + F P )
Tree starts from a node known as the Root node and gets
split off into various branches or nodes. The splitting of Recall/Sensitivity: Recall, in this work, measures the propor-
this root node is done on the basis of probabilities. Each tion of individuals that were at a risk of developing CHD and
node extract some information about features of data and were predicted by the algorithm to be at risk of developing
each link represent decision rule taken on the nodes. The CHD.
Tree is mapped or drawn on two bases: Gini index and TP
entropy rule. Its one of the simplest, easy to understand Recall = (4)
(T P + F N )
and finest predictive model.
• K-Nearest Neighbour (KNN) is also a supervised classi-
Specificity: Specificity here measures the proportion of indi-
fication algorithm. It predicts the target class on the basis
viduals who were not at risk of developing CHD and were
of how similar that particular data is from other provided
predicted by the algorithm to be not at risk of developing
training data labels to the model. This can be understood
CHD.
as, the characteristic(features) of that data, whose target
label needs to be predicted, is compared with features of TN
existing data (except the target class). The resemblance of Specif icity = (5)
(T N + F P )
this data with any class decides which class it will belong
to. KNN uses the approach of comparing unclassified data F1 Score: F1 Score is the harmonic mean of precision and
with classified data by calculating the distance between recall.
features of data points (using Euclidean distance, Manhat-
tan distance, etc.). First, the model collects unclassified 2(P recision × Recall)
F 1Score = (6)
data. It then calculates the distance of each feature of (P recision + Recall)
that data from features of classified data. By doing so, it
selects K small distances. Then, it counts the class that ROC(Receiver Operator Characteristic): It is a probability
appears the most among these K observations. Finally, it curve indicating the capability of a model to distinguish

370 2019 IEEE Region 10 Conference (TENCON 2019)


between classes. The ROC curve shows trade-off between
True Positive Rate (TPR)and the False Positive Rate (FPR).
According to [13], AUC (Area Under the Curve) closer to 1
would be able to perfectly differentiate the two classes in the
case of binary classification. Therefore, AUC closer is 1 is
better predictive measure.
TP
TPR = (7)
(T P + F N )

FP
FPR = (8)
(F P + T N )
B. Results
In the proposed work, 10-Fold cross-validation is performed
for the machine learning algorithms like RF, DT, and KNN
that are used for analysis. Different performance measures
as mentioned in the parameters section are calculated and
compared.
The accuracy and AUC for RF, DT and KNN for the 10-
folds iterations are illustrated in Table II. It can be observed Fig. 3. 10- Fold RF ROC Curve
that the average accuracy and AUC of RF classifier come out
to be more than that of DT and KNN.
1.39 seconds falling between DT and KNN. DT has the least
TABLE II execution time of 0.81 seconds, whereas, that of KNN is as
10-F OLD R ESULTS
high as 1.9 seconds.
Fold RF DT KNN
TABLE III
No. Accuracy AUC Accuracy AUC Accuracy AUC P ERFORMANCE STATISTICS

1 96.80% 1 93.33% 0.92 91.52% 0.91 Execution Time


Algorithm Mean Accuracy Mean AUC
(Seconds)
2 97.36% 0.99 91.38% 0.92 92.08% 0.91
RF 1.3969 96.80% 0.99
3 94.86% 0.99 91.94% 0.92 93.05% 0.91 DT 0.8138 92.45% 0.92
4 96.67% 0.99 91.94% 0.92 93.47% 0.91 KNN 1.9029 92.81% 0.91

5 96.38% 0.99 92.22% 0.91 92.91% 0.91

6 96.52% 0.99 92.08% 0.92 92.63% 0.91 Fig. 4 compares the ROC curves for RF, DT and KNN
giving AUC of 0.99 for RF, 0.92 for DT and 0.91 for KNN.
7 97.49% 0.99 94.15% 0.92 94.15% 0.91
The average AUC of RF is more close to 1, and hence RF
8 97.63% 0.99 92.61% 0.91 93.03% 0.91 is more suitable for the prediction model than DT and KNN
9 97.21% 0.99 92.2% 0.92 93.03% 0.91
since its AUC is more closer to 1.
Table IV shows the results considering different perfor-
10 97.07% 0.99 92.61% 0.92 92.2% 0.91 mance measures such as accuracy, precision, recall, specificity,
and F1 Score. While the recall is taken into consideration, RF
gives recall of 94.4% which is more than that of DT and
The ROC curves for 10 folds of RF classifier with a mean KNN. Thus, RF outperforms the other algorithms to predict
AUC of 0.99 is shown in Fig 3. AUC closer to 1 depicts a CHD risk among individuals.
better model [13]. A model is better if it predicts true more
often, that is, TPR is higher. Therefore, curve passing through TABLE IV
top left corner gives a better predictive model as in case of P ERFORMANCE STATISTICS
RF.
Table III shows the mean accuracy and mean AUC for the F1
10-fold cross-validation scores of RF, DT and KNN. It also Algorithm Accuracy Precision Recall Specificity
Score
compares the execution time taken by each of the algorithms RF 96.71% 98.94% 94.4% 99% 96.61%
DT 92.1% 98.57% 85.33% 98.78% 91.47%
for model creation and prediction. The observation states that KNN 91.49% 98.42% 84.21% 98.67% 90.76%
RF achieves the maximum accuracy, that is 96.8%, among RF,
DT, and KNN with AUC 0.99. The execution time of RF is

2019 IEEE Region 10 Conference (TENCON 2019) 371


Tree. Thus, in an environment similar to that of the used
dataset, if all the features are preprocessed such that they
acquire normal distribution, Random Forest is a good selection
to obtain a robust prediction model. And, such models provide
a valuable assistant to the society for health care management
domain.
Further, as an extension to this work, a more real-time and
bigger dataset is required to obtain a better training model.
Also, an emphasis on refining the preprocessing further will
give veracious outcomes.

R EFERENCES
[1] Huse, Hettiarachchi, Gearon E, Nichols M, Allender S, Peeters A,
“Obesity In Australia Modi”, June 2015.
[2] P. Groves, B. Kayyali, D. Knott, and S. van Kuiken, “The Big Data
Revolution in Healthcare: Accelerating Value and Innovation”, USA:
Center for US Health System Reform Business Technology Office, 2016.
[3] D. Kumar, “Automatic heart sound analysis for cardiovascular disease
assessment,” Ph.D. dissertation, University of Coimbra, 2014.
[4] S. Mendis et al., “Global status report on noncommunicable diseases
2014”, World Health Organization, 2014.
[5] Tilakachuri Balakrishna, B. Narendra, Mooray Harika Reddy, Damarap-
Fig. 4. Comparision ROC Curve ati Jayasri, “Diagnosis of Chronic Kidney Disease Using Random Forest
Classification Technique”, Helix Vol. 7(1): pp.873-877, 2017.
[6] Martin Gjoreski; Monika Simjanoska; Anton Gradisek “Chronic Heart
TABLE V Failure Detection from Heart Sounds Using a Stack of Machine-
R ESULT C OMPARISON Learning Classifiers”, IEEE 13th International Conference on Intelligent
Environments, pp. 14-19, 2017.
[7] Nitten S. Rajliwall; Girija Chetty; Rachel Davey, “Chronic disease risk
Algorithm Previous work [7] Proposed work monitoring based on an innovative predictive modeling framework”,
Time Taken Accuracy Time Taken Accuracy IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8,
(sec) (%) (sec) (%) 2017.
[8] Rajliwall, Nitten S., Rachel Davey, and Girija Chetty, “Machine learning
Random Forest 2180 90.1 1.3969 96.80 based models for Cardiovascular risk prediction”, IEEE International
Decision Tree 77.4 90 0.8138 92.45 Conference on Machine Learning and Data Engineering (ICMLDE),
2018.
KNN 88.8 90.1 1.9029 92.81 [9] Pinar Yildirim, “Chronic Kidney Disease Prediction on Imbalanced Data
by Multilayer Perceptron”, IEEE 41st Annual Computer Software and
Applications Conference, pp. 193-198, 2017.
[10] Maryam Soltanpour Gharibdousti, Kamran Azimi “Prediction of Chronic
Table V highlights the comparison of the work done in [7] Kidney Disease Using Data Mining Techniques”, Proceedings of the
and the proposed work based on the parameters, time taken Industrial and Systems Engineering Conference, 2017.
[11] Min Chen, Yixue Hao, Kai Hwang, Lu Wang, And Lin Wang, “Dis-
by the algorithms and accuracy. The accuracy is more and ease Prediction by Machine Learning Over Big Data From Healthcare
the time taken by the algorithms is very less in the proposed Communities”, IEEE Access Volume 5, pp. 8869 - 8879, 2017.
work in all the three classification algorithms. Moreover, [12] M. A. Jabbar, Shirina Samreen, “Heart disease prediction system based
on Hidden Nave Bayes classifier”, IEEE International Conference on
the accuracy of the algorithms in previous work are nearly Circuits, Controls, Communications and Computing (I4C), pp. 1-5,
same while in proposed work, Random Forest has shown 2016.
much higher accuracy. Therefore, the Random Forest machine [13] Hajian-Tilaki, Karimollah. “Receiver Operating Characteristic (ROC)
Curve Analysis for Medical Diagnostic Test Evaluation”, in Caspian
learning algorithm outperforms the results in [7] and [8] on Journal of Internal Medicine, Vol.4, pp. 627-635, 2013.
the same dataset. [14] Gunarathne W.H.S.D, Perera K.D.M, Kahandawaarachchi K.A.D.C.P,
“Performance Evaluation on Machine Learning Classification Tech-
V. C ONCLUSION AND F UTURE W ORK niques for Disease Classification and Forecasting through Data Analytics
for Chronic Kidney Disease”, IEEE 17th International Conference on
We propose a preprocessing extensive work where Random Bioinformatics and Bioengineering, pp. 291-296, 2017.
Forest is the most compatible contender for prediction model [15] Somaya Hashem, Gamal Esmat, Wafaa Elakel, Shahira
and gives the highest performance measure among K- Nearest Habashy,“Comparison of Machine Learning Approaches for Prediction
of Advanced Liver Fibrosis in Chronic Hepatitis C Patients”, IEEE/ACM
Neighbour and Decision Tree. The accuracy, recall, precision, Transactions on Computational Biology and Bioinformatics, Vol. 15,
specificity and F1 score of RF on the proposed work are No. 3, pp. 861-868, 2018
96.71%, 98.74%, 94.4%, 99%, 96.61% respectively, under ex- [16] Ahmed J. Aljaaf, Dhiya Al-Jumeily, Hussein M. Haglan, Mohamed
Alloghani, “Early Prediction of Chronic Kidney Disease Using Machine
ecution time of 1.3969 seconds. The Decision Tree, however, Learning Supported by Predictive Analytics”, IEEE Congress on Evo-
gives lesser performance set against that of Random Forest lutionary Computation (CEC), pp. 1-9, 2018.
though in quite lesser time (0.8138). The execution time for [17] M. Sharma, Tan, RS. Acharya, U.R. A new method to identify coronary
artery disease with ECG signals and time-Frequency concentrated anti-
K- Nearest Neighbour is the highest among all, however, the symmetric biorthogonal wavelet filter bank,, Pattern Recognition Letters,
performance measures are quite similar to that of the Decision Vol. 125, pp. 235-240(2019).

372 2019 IEEE Region 10 Conference (TENCON 2019)

You might also like