0% found this document useful (0 votes)
24 views

A Novel Study On Machine Learning Algorithm Based

Uploaded by

nischay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

A Novel Study On Machine Learning Algorithm Based

Uploaded by

nischay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Hindawi

Health & Social Care in the Community


Volume 2023, Article ID 1406060, 10 pages
https://doi.org/10.1155/2023/1406060

Research Article
A Novel Study on Machine Learning Algorithm-Based
Cardiovascular Disease Prediction

Arsalan Khan,1 Moiz Qureshi ,2 Muhammad Daniyal,3 and Kassim Tawiah 4,5

1
Department of Statistics, Quaid-i-Azam University, Islamabad, Pakistan
2
Department of Statistics, Shaheed Benazir Bhutto University, Shaheed Benazirabad, Nawabshah, Pakistan
3
Department of Statistics, Te Islamia University of Bahawalpur, Bahawalpur, Pakistan
4
Department of Mathematics and Statistics, University of Energy and Natural Resources, Sunyani, Ghana
5
Department of Statistics and Actuarial Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

Correspondence should be addressed to Kassim Tawiah; [email protected]

Received 29 September 2022; Revised 19 October 2022; Accepted 27 October 2022; Published 20 February 2023

Academic Editor: Andrea Maugeri

Copyright © 2023 Arsalan Khan et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Cardiovascular disease (CVD) is a life-threatening disease rising considerably in the world. Early detection and prediction of CVD
as well as other heart diseases might protect many lives. Tis requires tact clinical data analysis. Te potential of predictive
machine learning algorithms to develop the doctor’s perception is essential to all stakeholders in the health sector since it can
augment the eforts of doctors to have a healthier climate for patient diagnosis and treatment. We used the machine learning
(ML) algorithm to carry out a signifcant explanation for accurate prediction and decision making for CVD patients. Simple
random sampling was used to select heart disease patients from the Khyber Teaching Hospital and Lady Reading Hospital,
Pakistan. ML methods such as decision tree (DT), random forest (RF), logistic regression (LR), Naı̈ve Bayes (NB), and support
vector machine (SVM) were implemented for classifcation and prediction purposes for CVD patients in Pakistan. We performed
exploratory analysis and experimental output analysis for all algorithms. We also estimated the confusion matrix and recursive
operating characteristic curve for all algorithms. Te performance of the proposed ML algorithm was estimated using numerous
conditions to recognize the best suitable machine learning algorithm in the class of models. Te RF algorithm had the highest
accuracy of prediction, sensitivity, and recursive operative characteristic curve of 85.01%, 92.11%, and 87.73%, respectively, for
CVD. It also had the least specifcity and misclassifcation errors of 43.48% and 8.70%, respectively, for CVD. Tese results
indicated that the RF algorithm is the most appropriate algorithm for CVD classifcation and prediction. Our proposed model can
be implemented in all settings worldwide in the health sector for disease classifcation and prediction.

1. Introduction situations, our heart beats faster than usual leading to serious
heart problems. Aside from stress, heart problems escalate
Te heart is a major part of the human or animal body that with excessive drinking of liquor, smoking, and heavy fat
plays an essential role in the life of mammals. Te heart intake [1, 2]. Te rate of health hazards in humans rises as a
pumps blood throughout the body parts, thereby supplying function of unhealthy dietary habits, excessive stress, lack of
oxygen to all parts of the body and controlling the pressure good sleep, and lifestyle changes [2].
of the blood. Te heart performs its function together with Cardiovascular disease (CVD) is one of the most no-
the nervous system and the endocrine system. Te nervous ticeable heart diseases which has afected people of all ages.
system helps to control the heart rate while the endocrine CVD is caused by excessive intake of alcohol, smoking, high
system sends hormones as well as blood pressure by causing blood pressure, high cholesterol level, poor diet, and family
the human blood vessels to either spasm or relax. However, history [3]. Del Paoli et al. [4] showed that high blood pressure,
when the human brain is at rest or under stress, it transmits unhealthy arguments, and alcohol are highly correlated with
signals telling your heart to beat more quickly. In stressful CVD. It has been proven that men are at a higher risk of
2 Health & Social Care in the Community

CVD compared to women [5]. Age is one of the most created CVD models. Prasad et al. [20] implemented pro-
signifcant factors for heart disease [6]. cedures geared towards predicting heart problems by re-
In addition to CVD, coronary disease, myocarditis, capitulating recent studies that utilized artifcial intelligence
congenital heart disease, arrhythmias, cardiomyopathy, procedures. Wu et al. [21] initiated new CVD forecasting
congestive heart failure, angina pectoris, and myocardial structures by incorporating several procedures in a single
infarction have been classifed as acute heart diseases. Each hybridized phoned protocol. Teir result validated accuracy
type of heart disease has its symptoms. However, it is very in diagnosing by implementing a mixture of styles ema-
abstruse to identify these heart diseases sharing common nating from all methods.
high-risk factors like cholesterol level and blood pressure, In recent medical felds, a lot of information on dis-
diabetes, abnormal pulse rate (PR), and many more [7]. eases is generated through numerous sources. Tese
Te lack of physical ftness due to lifestyle changes may available data need to be purifed as fast as possible with
also lead to heart disease for all age groups. A survey reported diferent preprocessing techniques for the required in-
that seventeen million people in recent years lost their lives formation to fast-track the diagnosis of diseases. Tis
due to heart failure [8]. Te early detection of heart disease study seeks to develop and propose new methodologies by
may save a lot of lives provided the patients take their the utilization of machine learning algorithms to increase
treatments together with their medication seriously and on the accuracy of the detection of CVD. We investigated and
time [8]. Te predicted global number of casualties from CVD predicted CVD based on hybrid machine learning
in 2015 was 17.7 million, of which 7.4 million were as a result methods. We used hybrid machine learning models to
of coronary heart disease and 6.7 million by stroke. According predict CVD and perform optimum classifcation
to the World Health Organization (WHO), approximately methods for the predictions. Our models and approach
54% of deaths from non-communicable diseases in Pakistan can be applied in all hospital settings across the world for
are due to cardiovascular problems [9]. Although 17.3 million efective prediction and diagnosis of CVD and other heart
deaths were caused due to heart disease in 2008, studies by the diseases. We are hopeful that our suggested technique will
WHO in 2018 estimated deaths due to heart disease to be be utilized for the detection and prediction of other
around 56.9 million globally [10]. diseases in general.
Deep learning models like the backpropagation neural We have discussed the materials and methods applied in
network (BNN) are highly efective for predicting diseases the proceeding section followed by the results and discus-
[11]. Likewise, feature selection approaches like decision tree sion. Te paper ends with the conclusions of the study.
(DT), logistic regression (LR), random forest (RF), Naı̈ve
Bayes (NB), and support vector machine (SVM) have been 2. Materials and Methods
observed to be equally efective in disease prediction [12, 13].
Soni et al. [14] used predictive data mining techniques for 2.1. Data. Te data were collected from the two largest
the prediction of cardiovascular disease by evaluating the teaching hospitals, the Lady Reading Hospital (LRM) and
highest accuracy in the DT among a class of predictive the Khyber Teaching Hospital (KTH), in Khyber Pak-
machine learning models such as K-nearest neighbour al- htunkhwa (KPK), one of the four provinces of Pakistan.
gorithms, neural network classifcation, and Bayesian clas- Ethical approval for the inclusion of heart disease patients
sifcation algorithms [15, 16]. was sought from the Human Ethical Committees of the two
Data mining techniques are very essential in efective teaching hospitals. Te ethics approval certifcate number
healthcare delivery as they can assist in determining whether for the Lady Reading Hospital is B371/12/07/2022, while that
a patient has a disease or not in healthcare centres (hospitals of the Khyber Teaching Hospital is A418/12/07/2022. A
or clinics). Additionally, it can be employed to rapidly and simple random sampling technique was employed in the
automatically diagnose people with diseases with great collection of sample units included in the survey. Te sample
satisfaction [17]. Te prediction approach of these tech- data consisted of a total of 518 randomly selected heart
niques may enable all participants in making rational de- disease patients.
cisions, especially professionals who must make decisions
about how to treat patients [18].
Hybrid machine learning models have been applied to 2.1.1. Variables in the Study. Te CVD data included the
predict heart diseases as well as perform optimum classif- individual output with corresponding factors. Te all-in-
cation methods for prediction. Hybrid models give a better clusive dataset contained the following attributes: age,
optimum output depending on the machine learning gender, height, weight, systolic, diastolic, cholesterol, glu-
method implemented for the execution [8]. Similarly, ran- cose, smoke, alcohol intake, physical activity, cardiovascular
dom forest, decision trees, and hybrid algorithms have been disease, and body mass index (BMI). Te response variable,
used to predict diseases with high accuracy. Te hybrid CVD, was classifed into two categories “presence” and
algorithms were found to have a high accuracy in the “absence.” Furthermore, the data were cleaned of noise,
neighbourhood of 88.7% for the prediction of disease inconsistencies, or any missing observations. We found a
compared to other models [8]. few missing observations in the data because some of the
Nyaga et al. [19], by summarizing available information patients were discharged from the ward without any proper
on aetiology, rates, treatment, covariates, and mortality residential address or mobile/telephone numbers to trace
prevalence arising from heart failure in sub-Saharan Africa, them. As a result, it was very difcult to contact them. Since
Health & Social Care in the Community 3

our analysis is based on complete data, we replaced the 2.2.3. Random Forest (RF) Algorithm. A random forest (RF)
missing data by implementing the usual statistical method is a classifer consisting of a collection of tree-structured
such as using median/mode for the categorical data to re- classifers {h(x; €k); k � 1, 2, . . .} where €k are independent
place the missing values with the corresponding value. Tus, and identically distributed random vectors where each tree
the data cleaning was completed using the corresponding casts a unit vote for the most popular class at the input of the
statistical tools for the preprocessing stage. predictor, x [32–35].
Diferent data mining techniques were utilized in as- Te RF is an ensemble learning approach for regression
sociation, classifcation, clustering, pattern evaluation, and or classifcation used to develop a large number of decision
prediction. In the methods section below, we have discussed trees at training time. Te average prediction of the sepa-
the techniques extensively. rated tree is returned for regression purposes, while in the
classifcation, the RF output is the class predicted by the
maximum trees. Te RF algorithm developed by Ho [36]
2.2. Methods used a stochastic subspace approach and was reintroduced as
a technique for the implementation of a collection of tree
2.2.1. Classifcation. Classifcation is the process of cate- predictors by Breiman [37]. RF implements bootstrapping to
gorizing a given set of data into classes. Classifcation can be randomly select training and testing datasets from the
performed for both structured and unstructured data. original data. After selecting the training dataset, the
Predicting the class of the provided data points is the frst remaining dataset called out of bag (OOB) is used to esti-
step in the procedure [22]. Common names for the classes mate the goodness of ft [37].
include target, label, and categories. Diferent statistical and In the growing phase of the RF, classifcation and regression
mathematical procedures such as linear programming, de- tree techniques are developed for tree growth by splitting the
cision trees, and neural networks involve classifcation [23]. local training set at each node with value 1 to a randomly
Tat notwithstanding, CVD detection can be recognized selected subset of the response variable. Te growth of the tree
through classifcation procedures because it has two cate- continues to the largest extent possible since it does not consider
gories, that is, one has CVD or not [24]. pruning. Te phases of bootstrapping and growing of the tree
require independent random input quantities. We assumed that
these inputs are independent and identically distributed among
2.2.2. Decision Tree (DT) Algorithm. Te decision tree (DT) trees. In that manner, each tree can be viewed as independently
is one of the most important predictive modelling and sampled for a given training data [37, 38].
classifcation methods in learning algorithms that are For prediction purposes, each tree as well as their ter-
widely used in practical approaches in supervised learning minal nodes are assigned to a class in the forest. Predictions
techniques [25, 26]. It utilizes algorithms that can detect by the trees are performed through voting processes in such
diferent ways of splitting datasets based on numerous a way that the forest returns a class with the maximum
situations. In the classifcation tree, the response variable is number of votes by random selection [39].
considered a discrete set of values for tree models [26]. DT
is a useful contemporary approach to solving decision-
making challenges by building models that can be used for 2.2.4. Logistic Regression (LR) Algorithm. Te logistic re-
prediction through systematic analysis. Internal nodes of a gression (LR) model is the most accurate in the case of the
DT indicate a test of the features, branches represent the dichotomous categorical response variable [40]. In the
result, and leaves refect the decisions that are produced machine learning (ML) algorithm, the LR model can be used
after further computation [27, 28]. We performed our DT for classifcation purposes [40, 41]. We used the LR model
as follows: for the classifcation problem satisfying the cardiovascular-
afected respondents. It is implemented on the idea of
(I) Divide the dataset into two subdata, that is, training likelihood by assigning observations to a discrete class being
and testing datasets. performed using logistic regression [42]. Te exponential
(II) In the initial stage, the entire training data are logit function is utilized for output transformation. Te cost
considered the root. function is often restricted by the LR hypothesis to a range
(III) Continuous values are discretized before the model between 0 and 1. Consequently, according to the regression
building, whereas categorical values are preferable hypothesis, linear functions cannot be implemented here
for feature values. because they can have values of either >1 or ≤0. We classifed
and predicted the CVD patients in the machine learning LR
(IV) Establish subsets such that each subset includes [43] using the function
data with the aforementioned feature attributes.
1, CVD Present,
(V) Finally, steps I–IV are repeated for each subset until f(x) � 􏼨 (1)
we get the tree leaves. 0, CVD Absent.
In the DT, the prediction for a record class label begins at
the root. Te values are compared with the root features in 2.2.5. Naı̈ve Bayes (NB) Algorithm. Te Naı̈ve Bayes (NB)
the succeeding record characteristics. In this contrast, the method is a supervised learning approach that is based on
equivalent value of the next node to go is displayed [29–31]. the Bayes theorem. Te NB machine learning method
4 Health & Social Care in the Community

applies probabilistic techniques in solving classifcation variability in the age proportion of the CVD-afected pa-
problems [44]. Te main assumption of the NB is the in- tients. Te exploratory analysis revealed that almost 52.1% of
dependence (free from multicollinearity) of the predictors the respondents had CVD at an aggregate level. Further-
ftted in the probabilistic models [45]. A class of classif- more, there was a noticeable variation in the proportion of
cation algorithms predicated on the Bayes theorem is re- heart disease concerning diferent factors such as gender,
ferred to as Naı̈ve Bayes classifers. It is characterized as a physical activity, smoking, and so on that correlated with
collection of algorithms whereby each algorithm follows the CVD. For instance, a maximum of 4.25% of 60-year-old
same guiding principle that every combination of features patients were estimated to have CVD, whereas a maximum
classifed is independent of each other pair [46]. In our case, of 0.19% of 45-year-old patients had it.
we used the NB classifer to partition the response variable Figure 1 shows the gender, cholesterol level, and glucose
CVD patients into those who have CVD or not for all pa- levels for all randomly selected CVD patients in the study.
tients with heart disease [44, 47]. Te fgure shows that a greater proportion of the patients
had CVD. Figure 2 presents a line graph for the proportion
of gender with respect to the age of patients. Te fgure
2.2.6. Support Vector Machine (SVM) Algorithm. Among
shows that CVD is predominant in males compared to
the diferent classifcation techniques, the support vector
females since a greater proportion of the males had the
machine (SVM) is well known for its discriminative power for
disease. Moreover, the proportion of CVD patients increases
classifcation. Te SVM is widely considered in recent times
from forty years to sixty-one years, which confrms the result
due to its efciency in most diferent pattern classifcation
of Gulfam Ahmad and Jasim Shah [6].
techniques [48]. It has numerous applications ranging from
To achieve our goal, we employed the binary classifer
bioinformatics to involuntary language recognition as well as
based on a supervised machine learning algorithm for
handwritten typescript recognition with sufcient accom-
classifcation to predict the association for the appropriate
plishment. Kim et al. [49] proved that the SVM displays
class of patients [55–57] as proposed by Ramesh et al. [58]
exceptional performance in the classifcation for prognostic
and Boukhatem [42]. Table 2 indicates the output of the
prediction of class III malocclusion. Based on [50], we discuss
predictive models that were used for the prediction of CVD.
a brief mathematical theory of the SVM below.
All fve ML algorithms (i.e., DT, SVM, NB, LR, and RF)
By assuming the binary classifcation of our response
were used to build the CVD prediction model in two dif-
variable, CVD with the convention of linear divisibility for
ferent stages. In the initial stage, the data were split into two
training samples, we have
separate 70% and 30% groups for training and validation,
S � 􏼈 x1 , y1 􏼁, x2 , y2 􏼁, . . . , xn , yn 􏼁􏼉, (2) respectively. In the second stage, however, the data were split
into 75% and 25% for training and validation, respectively.
where xi ∈ Ħ, such that the design matrix X belongs to the Te RF model had the highest accuracy of 85.01% with a 95%
d-dimensional response space, and the response variable, confdence interval of (0.6608, 0.8043), followed by DT with
CVD, is represented by yi , which has a binary class in the 83.72% accuracy with a 95% confdence interval of (0.654,
vector Y with yi ∈ (0, 1) in our study. Te appropriate 0.7986). Te SVM and LR algorithms had the same accuracy
discriminating equation is given by of 83.08%, respectively, with a 95% confdence interval of
f(x) � sgn􏼈(z, x) + β􏼉. (3) (0.654 and 0.7986), respectively. Te NB had the least ac-
curacy of 74.74% with a 95% confdence interval of (0.567,
Similarly, Z represents the vector that determines the 0.7221). Tis shows that the RF algorithm is the best pre-
coordination of the hyperplane (discriminating plane), and dictor of CVD patients. Our outcome confrms the results
so Z, X, and β are ofsets [48, 51, 52]. We have infnite obtained by the authors in [6, 55–58].
possible hyperplanes that are efciently classifed by the Sensitivity, mathematically defned as the ratio of the
training data which can be applied to the validation dataset. total number of true-positive patients to the sum of the
Te optimal classifer identifes the similar optimal gener- number of true-positive and false-negative patients, was
alized hyperplanes that are nearer or even away from each used to fnd the proportion of true patients sufering from
cluster of objects [53]. Te input set of coordinates is CVD [59, 60]. Similarly, the specifcity is described
considered optimally separated by the hyperplane if there is according to respondents that are not afected by cardio-
accuracy in the separation with a maximum distance existing vascular disease. Specifcity, mathematically defned as the
between the nearest components and the support vectors ratio of the total number of true negatives to the sum of the
leading to the identifcation of a specifc hyperplane [53, 54]. number of true negatives and false-positive patients [61], was
We used R version 4.1.2 for all our analyses. also used to determine the true proportion of true patients
who are not sufering from CVD [62]. Te RF algorithm
3. Results and Discussion estimated sensitivity and specifcity as 86.11% and 65.48%,
respectively. Tat is, our algorithm correctly classifed
Te descriptive analysis of the attributes at the aggregate and 86.11% of the patients to have CVD but failed to identify
age levels of the responses of all randomly selected patients 13.89% as having CVD. Similarly, the test correctly classifed
with heart disease in the study is represented in Table 1. Te 65.48% of patients as not having CVD while 34.52% of them
table illustrates the numerical output of the cardiovascular were misclassifed. Although the DT was not the best in
disease-associated risk factors. Table 1 indicates the terms of accuracy of prediction, it had the highest sensitivity
Health & Social Care in the Community 5

Table 1: Descriptive analysis of both response and predictive variables at aggregate and age levels of CVD patients.
CVD patient Gender Height Weight Systolic Diastolic Cholesterol Glucose Smoke Alcohol Exercise BMI
Age
0.521 0.639 164.11 74.33 128.05 91.803 Extremely high Extremely high 904 0.959 0.786 27.685
40 0.0097 0.0135 168 69.1 120 79.9 Normal Extremely high 0.0212 0.0232 0.0251 24.4
41 0.0039 0.0116 169 74.8 122 80 High High 0.0212 0.0232 0.0154 26.1
42 0.0077 0.0174 166 78.5 128 156 Normal Extremely high 0.0232 0.0232 0.0174 28.8
43 0.0039 0.0000 172 61.5 125 80 Normal Normal 0.0039 0.0039 0.0019 20.8
44 0.0058 0.0135 168 74.5 120 78.5 High Normal 0.0212 0.0212 0.0212 26.4
45 0.0019 0.0039 164 68.9 112 71.2 Extremely high Extremely high 0.0135 0.0135 0.0135 25.7
46 0.0116 0.0290 167 85.7 122 73.3 High High 0.0405 0.0425 0.0347 30.1
47 0.0058 0.0135 163 69.6 122 75 Normal Normal 0.0193 0.0232 0.0193 26.2
48 0.0232 0.0290 164 76 127 130 High High 0.0347 0.0386 0.0270 28.3
49 0.0058 0.0097 160 78 126 81.2 Normal Normal 0.0135 0.0154 0.0116 30.5
50 0.0251 0.0444 163 74.7 126 109 High Extremely high 0.0502 0.0502 0.0463 28.3
51 0.0270 0.0347 165 74.6 131 84.4 High Normal 0.0444 0.0463 0.0347 27.5
52 0.0328 0.0328 164 75.7 132 121 High High 0.0444 0.0405 0.0367 28.1
53 0.0116 0.0328 164 72.3 124 81.3 Normal Extremely high 0.0425 0.0502 0.0386 26.8
54 0.0290 0.0386 161 73.2 129 82.5 Extremely high Normal 0.0483 0.0541 0.0483 28.3
55 0.0232 0.0425 161 73.6 128 80 High High 0.0521 0.0521 0.0386 28.2
56 0.0309 0.0309 165 73 133 81.9 High Extremely high 0.0444 0.0463 0.0386 26.7
57 0.0290 0.0309 164 69.4 122 80.2 Extremely high High 0.0425 0.0483 0.0347 25.9
58 0.0270 0.0193 164 72.9 131 81.7 High High 0.0386 0.0463 0.0309 27.1
59 0.0367 0.0367 163 74.9 134 82.9 Normal Normal 0.0483 0.0502 0.0405 28
60 0.0425 0.0367 160 71 132 82 Extremely high High 0.0637 0.0656 0.0560 29
61 0.0425 0.0367 165 75.1 133 117 Normal Normal 0.0483 0.0521 0.0405 27.8
62 0.0232 0.0232 167 79.5 128 81.3 High High 0.0386 0.0425 0.0367 28.4
63 0.0135 0.0212 165 80.3 129 82.1 Extremely high High 0.0251 0.0270 0.0193 29.6
64 0.0251 0.0193 163 75.1 135 85.6 High Extremely high 0.0328 0.0328 0.0309 28.5
65 0.0232 0.0174 165 70 131 139 Extremely high Extremely high 0.0270 0.0270 0.0270 25.7
CVD patient: proportion of afected CVD patients. Gender: proportion of male patients. Height: mean of height predictor. Weight: mean of weight predictor.
Systolic: mean of systolic predictor. Diastolic: mean of diastolic predictor. Cholesterol level: median value of cholesterol. Smoke: proportion of smoker
patients. Alcohol: proportion of alcohol patients. Physical activity: proportion of physical activity.

300

250

200

150

100

50

0
Extremely High High Normal Extremely High High Normal
No Yes
-50

Alcholol
Count of gender
Glucose Level
Figure 1: Bar graph with error bars for patient CVD status with gender, cholesterol level, and glucose level.

(90.28%). Our results confrm those of Boukhatem et al. [63]. Table 3 represents the confusion matrix of the pre-
Figure 3 shows the visualization of all ML algorithm outputs, dictive model for 25% of our validation data. Te confusion
thereby confrming the superiority of the RF. matrix is used to evaluate the performance of the
6 Health & Social Care in the Community

Line graph for the Proportion of Gender with respect to Age of Patient
0.05

0.04

GENDER VARIABLE 0.03

0.02

0.01

0
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
AGE OF AFFECTED CVD PATIENTS
-0.01

Male
Female
Linear (Male)
Figure 2: Line chart with error bars for the proportion of gender with respect to the age of patients.

Table 2: An experimental output of the predictive models for CVD patients.


Output DT SVM NB LR RF
Accuracy 0.8372 0.8308 0.7474 0.8308 0.8501
95% confdence interval (0.6608, 0.8043) (0.654, 0.7986) (0.567, 0.7221) (0.654, 0.7986) (0.6745, 0.8158)
Sensitivity 0.9028 0.8472 0.8889 0.8333 0.8611
Specifcity 0.5952 0.631 0.4405 0.6429 0.6548
+Predicted value 0.6566 0.663 0.5766 0.6667 0.6813
−Predicted value 0.8772 0.8281 0.8222 0.8182 0.8862
Prevalence 0.4615 0.4615 0.4615 0.4615 0.4615
Detection rate 0.4167 0.391 0.4103 0.3846 0.3974
Detection prevalence 0.6346 0.5897 0.7115 0.5769 0.5833
Balanced accuracy 0.849 0.8391 0.7647 0.8381 0.8579

Visualization of machine learning algorithm output


0.9028
0.8889

1
0.8772
0.8611

0.8579
0.8501

0.8472

0.8462

0.8391
0.8381
0.8372

0.8333
0.8308
0.8308

0.8281

0.849
0.8222
0.8182

0.9
0.7647
0.7474

0.7115
0.6813

0.8
0.6667
0.6566
0.6548
0.6429

0.663

0.6346
0.631
0.5952

0.5897

0.5833

0.7
0.5769
0.5766

0.6
0.4615
0.4615
0.4615
0.4615
0.4615
0.4405

0.4167
0.4103
0.3974
0.3846

0.5
0.391

0.4
0.3
0.2
0.1
0
PREVALENCE
ACCURACY

SENSITIVITY

DETECTION RATE

BALANCED ACCURACY
SPECIFICITY

POS PRED VALUE

NEG PRED VALUE

DETECTION PREVALENCE

DT LR
SVM RF
NB
Figure 3: Visualization of the ML algorithm output.
Health & Social Care in the Community 7

Table 3: Confusion matrix for predictive models.


Prediction 1 2 Misclassifcation error
Decision tree
1 65 34 0.3434
2 7 50 0.1228
Support vector machine
1 61 31 0.3370
2 11 53 0.1719
Naı̈ve Bayes model
1 64 47 0.4234
2 8 37 0.1778
Logistic regression model
1 60 30 0.3333
2 12 54 0.1818
Random forest model
1 61 26 0.2989
2 6 63 0.0870

ROC For different ML Classifier Te ROC also indicates that the RF algorithm’s per-
1.0 formance is the best among all classes of ML algorithms. Te
ROC ranges from 0 to 1, where the nearest to 0 value means
0.8 it is inept for a given classifer, whereas a value nearest to 1
signifes a more capable algorithm for the classifer. Te ROC
Sensitivity

0.6
value is 0.8737 for the RF algorithm which precisely signifes
good prediction and classifcation. Te highest ROC for the
0.4
RF algorithm implies a better ability to discriminate the
0.2 classes, while the highest accuracy signifes the well-per-
forming ability of the algorithm and the sense of prediction
0.0 just as in [15, 42, 56].
1.0 0.5 0.0
Specificity 4. Conclusion
LR NB Heart diseases are considered a signifcant apprehension in
RF SVM medical data analysis. Te potential of predictive machine
Figure 4: Recursive operating characteristic curve (ROC) for learning algorithms to develop the doctor’s perception is es-
diferent ML classifers. sential to all stakeholders in the health sector since it can
augment the eforts of doctors to have a healthier climate for
classifcation algorithm by associating the actual target patient diagnosis and treatment. Tis study investigated the
values for the response variable, CVD patients, with a performance of predictive ML algorithms for CVD patients.
predicted output of the response by the machine learning CVD is one of the leading causes of mortality worldwide. We
model. Just as expected, the RF had the best performance for used data from the Lady Reading Hospital and the Khyber
all evaluation metrics for the confusion matrix. Te con- Teaching Hospital in Khyber Pakhtunkhwa Province, Pakistan.
fusion matrix essentially provides the misclassifcation error Ethical approval for the inclusion of heart disease patients was
rates for all our ML algorithms. Te misclassifcation error sought from the Human Ethical Committees of the two
rates for the respondents who are afected were 0.087, 0.1228, teaching hospitals. Five machine learning algorithms (i.e., DT,
0.1719, 01778, and 0.1818, for the RF, DT, SVM, NB, and LR, RF, LR, NB, and SVM) were implemented for the classifcation
respectively, in decreasing order of performance. Tus, the and prediction of CVD. We performed exploratory analysis
RF performed the best among all competing algorithms, and experimental output analysis for all algorithms. We also
while the LR had the poorest performance among them. Our estimated the confusion matrix and recursive operating char-
results are similar to those obtained by O’Kelly et al. [64]. acteristic curve for all algorithms. Te performance of the
Furthermore, the recursive operating characteristic proposed ML algorithm was estimated using numerous con-
curve (ROC) was used for the visualization of the accuracy. ditions to recognize the best suitable machine learning algo-
Te ROC uses a matrix to execute the performance of rithm in the class of models. Te RF algorithm had the highest
classifcation algorithms by visualizing the true-positive accuracy of prediction, sensitivity, and recursive operative
rate with a corresponding false-positive rate, thereby characteristic curve of 85.01%, 92.11%, and 87.73%, respectively,
measuring and highlighting the specifcity and sensitivity of for CVD. It also had the least specifcity and misclassifcation
the classifers. Figure 4 shows the ROC for the diferent errors of 43.48% and 8.70%, respectively, for CVD. Tese results
classifers. indicated that the RF algorithm is the most appropriate for
8 Health & Social Care in the Community

CVD classifcation and prediction. Our proposed model can [5] S. Jóźwik, A. Wrzeciono, B. Cieślik, P. Kiper, J. Szczepańska-
be implemented in all settings worldwide in the health sector Gieracha, and R. Gajda, “Te use of virtual therapy in cardiac
for disease classifcation and prediction. It can also be rehabilitation of male patients with coronary heart disease: a
implemented in other sectors with a similar function. Te randomized pilot study,” Healthcare, vol. 10, no. 4, p. 745,
main limitation of the study is that detailed patient data and 2022.
[6] H. Gulfam Ahmad and M. Jasim Shah, “Prediction of car-
clinical datasets across the globe may be required if we need
diovascular diseases (cvds) using machine learning techniques
to have more powerful and considerable prediction models. in health care centers,” Azerbaijan Journal of High Perfor-
For improving the accuracy of the ML models and algo- mance Computing, vol. 4, no. 2, pp. 267–279, 2021.
rithm, high-dimensional data would be more suitable. Te [7] C. D. Patnode, N. Redmond, M. O. Iacocca, and
ML algorithms used are limited to heart disease prediction M. Henninger, “Behavioral counseling interventions to pro-
studies. Future studies should look into exploring other ML mote a healthy diet and physical activity for cardiovascular
techniques in selecting signifcant characteristics. disease prevention in adults without known cardiovascular
disease risk factors: updated evidence report and systematic
review for the us preventive services task force,” JAMA,
Data Availability vol. 328, no. 4, pp. 375–388, 2022.
Te data used to support the fndings of this study are [8] M. Kavitha, G. Gnaneswar, R. Dinesh, Y. R. Sai, and
R. S. Suraj, “Heart disease prediction using hybrid machine
available from the corresponding author upon request.
learning model,” in Proceedings of the 2021 6th International
Conference on Inventive Computation Technologies (ICICT),
Conflicts of Interest pp. 1329–1333, Coimbatore, India, January 2021.
[9] F. M. Zahid, S. Ramzan, S. Faisal, and I. Hussain, “Gender
Te authors declare that there are no conficts of interest. based survival prediction models for heart failure patients: a
case study in Pakistan,” PLoS One, vol. 14, no. 2, Article ID
e0210602, 2019.
Authors’ Contributions [10] K. Hill, “Review of the world health report 2000: health
Arsalan Khan, Moiz Qureshi, Muhammad Daniyal, and systems: improving performance, by World Health Organi-
zation,” Population and Development Review, vol. 27, no. 2,
Kassim Tawiah were responsible for conceptualization,
pp. 373–376, 2001.
methodology, validation, and visualization. Arsalan Khan,
[11] N. Al-Milli, “Backpropogation neural network for prediction
Moiz Qureshi, and Muhammad Daniyal were responsible of heart disease,” Journal of Teoretical and Applied Infor-
for data curation, formal analysis, and original draft prep- mation Technology, vol. 56, no. 1, pp. 131–135, 2013.
aration. Kassim Tawiah and Muhammad Daniyal were re- [12] S. Bashir, Z. S. Khan, F. H. Khan, A. Anjum, and K. Bashir,
sponsible for review and editing. “Improving heart disease prediction using feature selection
approaches,” in Proceedings of the 2019 16th International
Acknowledgments Bhurban Conference on Applied Sciences and Technology
(IBCAST), pp. 619–623, Islamabad, Pakistan, January 2019.
We are grateful to the authorities of the Lady Reading [13] A. Aleem, G. Prateek, and N. Kumar, “Improving heart
Hospital and the Khyber Teaching Hospital in Khyber disease prediction using feature selection through genetic
algorithm,” in Advanced Network Technologies and Intelligent
Pakhtunkhwa (KPK) Province, Pakistan, for the oppor-
Computing ANTIC, I. Woungang, S. K. Dhurandher,
tunity to conduct the study and providing us with the K. K. Pattanaik, A. Verma, and P. Verma, Eds., Springer,
ethical approval certifcate and waiving the consent. We Berlin, Germany, pp. 765–776, 2021.
appreciate all participants for taking time to contribute to [14] J. Soni, U. Ansari, D. Sharma, and S. Soni, “Predictive data
this study. mining for medical diagnosis: an overview of heart disease
prediction,” International Journal of Computer Applications,
References vol. 17, no. 8, pp. 43–48, 2011.
[15] W. M. Jinjri, P. Keikhosrokiani, and N. L. Abdullah, “Machine
[1] M. G. Tektonidou, “Cardiovascular disease risk in anti- learning algorithms for the classifcation of cardiovascular
phospholipid syndrome: thrombo-infammation and athe- disease-a comparative study,” in Proceedings of the 2021 In-
rothrombosis,” Journal of Autoimmunity, vol. 128, Article ID ternational Conference on Information Technology (ICIT),
102813, 2022. pp. 132–138, Amman, Jordan, July 2021.
[2] World Health Organization, Te World Health Report: 2000: [16] M. N. Uddin and R. K. Halder, “An ensemble method based
Health Systems: Improving Performance, World Health Or- multilayer dynamic system to predict cardiovascular disease
ganization, Geneva, Switzerland, 2000. using machine learning approach,” Informatics in Medicine
[3] M. A. Said, Y. J. van de Vegte, M. M. Zafar et al., “Contri- Unlocked, vol. 24, Article ID 100584, 2021.
butions of interactions between lifestyle and genetics on [17] M. Kumar, S. Shambhu, and A. Sharma, “Classifcation of
coronary artery disease risk,” Current Cardiology Reports, heart diseases patients using data mining techniques,” In-
vol. 21, no. 9, pp. 1–8, 2019. ternational Journal of Research in Electronics and Computer
[4] M. De Paoli, D. W. Wood, M. K. Bohn et al., “Investigating the Engineering, vol. 6, no. 3, pp. 1495–1499, 2018.
protective efects of estrogen on β-cell health and the pro- [18] K. Sudhakar and D. M. Manimekalai, “Study of heart disease
gression of hyperglycemia-induced atherosclerosis,” Ameri- prediction using data mining,” International Journal of Ad-
can Journal of Physiology-Endocrinology and Metabolism, vanced Research in Computer Science and Software Engi-
vol. 323, no. 3, pp. E254–E266, 2022. neering, vol. 4, no. 1, pp. 1157–1160, 2014.
Health & Social Care in the Community 9

[19] U. F. Nyaga, J. J. Bigna, V. N. Agbor, M. Essouma, N. A. Ntusi, [36] T. K. Ho, “Random decision forests,” in Proceedings of the 3rd
and J. J. Noubiap, “Data on the epidemiology of heart failure International Conference on Document Analysis and Recog-
in Sub-Saharan Africa,” Data in Brief, vol. 17, pp. 1218–1239, nition, pp. 278–282, Lausanne, Switzerland, August 1995.
2018. [37] L. Breiman, “Random forests,” Machine Learning, vol. 45,
[20] R. Prasad, P. Anjali, S. Adil, and N. Deepa, “Heart disease no. 1, pp. 5–32, 2001.
prediction using logistic regression algorithm using machine [38] A. Liaw and M. Wiener, “Classifcation and regression by
learning,” International Journal of Engineering and Advanced randomforest,” R News, vol. 2, no. 3, pp. 18–22, 2022.
Technology, vol. 8, no. 3S, pp. 659–662, 2019. [39] Y. Liu, Y. Wang, and J. Zhang, “New machine learning al-
[21] C. M. Wu, M. Badshah, and V. Bhagwat, “Heart disease gorithm: random forest,” in Information Computing and
prediction using data mining techniques,” in Proceedings of Applications, Lecture Notes in Computer Science, B. Liu,
the 2019 2nd International Conference on Data Science and M. Ma, and J. Chang, Eds., pp. 246–252, Springer, Berlin,
Information Technology (DSIT 2019), pp. 7–11, New York, NY, Germany, 2012.
USA, July 2019. [40] M. P. LaValley, “Logistic regression,” Circulation, vol. 117,
[22] A. D. Gordon, Classifcation, Chapman and Hall/CRC, no. 18, pp. 2395–2399, 2008.
London, UK, 2nd edition, 1999. [41] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein,
[23] E. M. De Villiers, C. Fauquet, T. R. Broker, H. U. Bernard, and Logistic Regression, Springer-Verlag, Berlin, Germany, 2002.
H. Zur Hausen, “Classifcation of papillomaviruses,” Virology, [42] C. Y. Bakhoum, S. Madala, C. K. Long et al., “Retinal vein
vol. 324, no. 1, pp. 17–27, 2004. occlusion is associated with stroke independent of underlying
[24] X. Liu, X. Wang, Q. Su et al., “A hybrid classifcation system cardiovascular disease,” Eye, 2022.
for heart disease diagnosis based on the RFRS method,” [43] P. E. Rubini, C. A. Subasini, A. V. Katharine, V. Kumaresan,
Computational and Mathematical Methods in Medicine, S. G. Kumar, and T. M. Nithya, “A cardiovascular disease
vol. 2017, Article ID 8272091, 11 pages, 2017. prediction using machine learning algorithms,” Annals of the
[25] T. G. Dietterich, “Machine learning,” Annual Review of Romanian Society for Cell Biology, vol. 25, no. 2, pp. 904–912,
Computer Science, vol. 4, no. 1, pp. 255–306, 1990. 2021, https://www.annalsofrscb.ro/index.php/journal/article/
[26] P. H. Swain and H. Hauska, “Te decision tree classifer: view/104048.
design and potential,” IEEE Transactions on Geoscience [44] R. G. Nadakinamani, A. Reyana, S. Kautish et al., “Clinical
Electronics, vol. 15, no. 3, pp. 142–147, 1977. data analysis for prediction of cardiovascular disease using
[27] M. T. Huyut and H. Üstündağ, “Prediction of diagnosis and machine learning techniques,” Computational Intelligence
prognosis of COVID-19 disease by blood gas parameters and Neuroscience, vol. 2022, Article ID 2973324, 13 pages,
using decision trees machine learning model: a retrospective 2022.
observational study,” Medical Gas Research, vol. 12, no. 2, [45] S. Chen, G. I. Webb, L. Liu, and X. Ma, “A novel selective
pp. 60–66, 2022. naı̈ve Bayes algorithm,” Knowledge-Based Systems, vol. 192,
[28] S. Christa, V. Suma, and U. Mohan, “Regression and decision Article ID 105361, 2020.
tree approaches in predicting the efort in resolving inci- [46] K. Rrmoku, B. Selimi, and L. Ahmedi, “Application of trust in
dents,” International Journal of Business Information Systems, recommender systems—utilizing naive Bayes classifer,”
vol. 39, no. 3, pp. 379–399, 2022. Computation, vol. 10, no. 1, p. 6, 2022.
[29] Y. Y. Song and Y. Lu, “Decision tree methods: applications for [47] S. K. Sameer and P. Sriramya, “Improving the accuracy for
classifcation and prediction,” Shanghai Arch Psychiatry, prediction of heart disease by novel feature selection scheme
vol. 27, no. 2, pp. 130–135, 2015. using decision tree comparing with naive-bayes classifer
[30] A. Akavia, M. Leibovich, Y. S. Reshef, R. Ron, M. Shahar, and algorithms,” in Proceedings of the 2022 International Con-
M. Vald, “Privacy-preserving decision trees training and ference on Business Analytics for Technology and Security
prediction,” ACM Transactions on Privacy and Security, (ICBATS), pp. 1–8, Dubai, UAE, February 2022.
vol. 25, no. 3, pp. 1–30, 2022. [48] M. Tanveer, T. Rajani, R. Rastogi, Y. H. Shao, and
[31] A. Hamoud, “Selection of best decision tree algorithm for M. A. Ganaie, “Comprehensive review on twin support vector
prediction and classifcation of students’ action,” American machines,” Annals of Operations Research, pp. 1–46, 2022.
International Journal of Research in Science, Technology, [49] B. M. Kim, B. Y. Kang, H. G. Kim, and S. H. Baek, “Prognosis
Engineering & Mathematics, vol. 16, no. 1, pp. 26–32, 2016. prediction for Class III malocclusion treatment by feature
[32] T. A. Pham and V. Q. Tran, “Developing random forest wrapping method,” Te Angle Orthodontist, vol. 79, no. 4,
hybridization models for estimating the axial bearing capacity pp. 683–691, 2009.
of pile,” PLoS One, vol. 17, no. 3, Article ID e0265747, 2022. [50] N. Cristianini and J. Shawe-Taylor, An Introduction to Support
[33] X. Wu, C. Peng, P. T. Nelson, and Q. Cheng, “Random forest- Vector Machines and Other Kernel-Based Learning Methods,
integrated analysis in AD and LATE brain transcriptome- Cambridge University Press, Cambridge, UK, 2000.
wide data to identify disease-specifc gene expression,” PLoS [51] I. Steinwart and A. Christmann, Support Vector Machines,
One, vol. 16, no. 9, Article ID e0256648, 2021. Springer Science & Business Media, Berlin, Germany, 2008.
[34] F. Santos, V. Graw, and S. Bonilla, “A geographically weighted [52] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and
random forest approach for evaluate forest change drivers in B. Scholkopf, “Support vector machines,” IEEE Intelligent
the Northern Ecuadorian Amazon,” PLoS One, vol. 14, no. 12, Systems and their Applications, vol. 13, no. 4, pp. 18–28, 1998.
Article ID e0226224, 2019. [53] J. Fan, J. Zheng, L. Wu, and F. Zhang, “Estimation of daily
[35] A. W. Oehm, A. Springer, D. Jordan et al., “A machine maize transpiration using support vector machines, extreme
learning approach using partitioning around medoids clus- gradient boosting, artifcial and deep neural networks
tering and random forest classifcation to model groups of models,” Agricultural Water Management, vol. 245, Article ID
farms in regard to production parameters and bulk tank milk 106547, 2021.
antibody status of two major internal parasites in dairy cows,” [54] A. Kurani, P. Doshi, A. Vakharia, and M. Shah, “A com-
PLoS One, vol. 17, no. 7, Article ID e0271413, 2022. prehensive comparative study of artifcial neural network
10 Health & Social Care in the Community

(ANN) and support vector machines (SVM) on stock fore-


casting,” Annals of Data Science, pp. 1–26, 2021.
[55] S. Dhar, K. Roy, T. Dey, P. Datta, and A. Biswas, “A hybrid
machine learning approach for prediction of heart diseases,”
in Proceedings of the 2018 4th International Conference on
Computing Communication and Automation (ICCCA),
pp. 1–6, Greater Noida, India, December 2018.
[56] J. Azmi, M. Arif, M. T. Nafs, M. A. Alam, S. Tanweer, and
G. Wang, “A systematic review on machine learning ap-
proaches for cardiovascular disease prediction using medical
big data,” Medical Engineering & Physics, vol. 105, Article ID
103825, 2022.
[57] R. Gulhane and S. Gupta, Machine Learning Approach for
Predicting the Heart Disease, Elsevier, Amsterdam, Nether-
lands, 2022.
[58] R. Tr, U. K. Lilhore, M. Poongodi, S. Simaiya, A. Kaur, and
M. Hamdi, “Predictive analysis of heart diseases with machine
learning approaches,” Malaysian Journal of Computer Science,
vol. 2022, pp. 132–148, 2022.
[59] K. Lange, D. R. Hunter, and I. Yang, “Optimization transfer
using surrogate objective functions,” Journal of Computa-
tional & Graphical Statistics, vol. 9, no. 1, pp. 1–20, 2000.
[60] E. Diday and J. C. Simon, “Clustering analysis,” in Digital
Pattern Recognition, pp. 47–94, Springer, Berlin, Germany,
1976.
[61] T. S. Madhulatha, “An overview on clustering methods,” 2012,
https://arxiv.org/abs/1409.23291205.1117.
[62] E. H. Ruspini, “A new approach to clustering,” Information
and Control, vol. 15, no. 1, pp. 22–32, 1969.
[63] C. Boukhatem, H. Y. Youssef, and A. B. Nassif, “Heart disease
prediction using machine learning,” in Proceedings of the 2022
Advances in Science and Engineering Technology International
Conferences (ASET), pp. 1–6, Dubai, UAE, February 2022.
[64] A. C. O’Kelly, E. D. Michos, C. L. Shufelt et al., “Pregnancy
and reproductive risk factors for cardiovascular disease in
women,” Circulation Research, vol. 130, no. 4, pp. 652–672,
2022.

You might also like