Predictive Modeling of Cardiovascular Risk usingMachine Learning_ Focus on Heart Attack Prevention (2)
Predictive Modeling of Cardiovascular Risk usingMachine Learning_ Focus on Heart Attack Prevention (2)
Email: [email protected]
6
Research Scholar, CMR Institute of Technology, Bengaluru, India
Email: [email protected]
Abstract— heart disease and stroke have ranked high on the list of global mortality from past
decades. Every year, the rising death toll from cardiac arrests is mostly attributable to long-
term health problems and the inability to receive treatment quickly enough. The importance of
early and correct diagnosis in decreasing life risk is a point on which healthcare specialists can
agree without reservation. Technological developments, especially in the fields of AI and ML,
have allowed for a plethora of studies to build models for preventive care using electronic
medical records. In light of the gravity of the situation, our research presents a model based on
machine learning to aid medical practitioners in making risk assessments for heart attacks.
This research makes use of the Heart Attack Prediction dataset, which contains 303
occurrences. Our method yields strong outcomes, and the Random Forest algorithm gives the
best match with 92% recall and 94% accuracy. In addition, we discuss the possibility of future
studies that use machine learning models to forecast or detect particular cardiac ailments, such
as right-heart problems by analyzing jugular veins.
Index Terms— Cardiovascular diseases, Machine Learning, Heart attack prediction, Electronic
medical records, Random Forest algorithm, Preventive care.
I. INTRODUCTION
Diseases of the heart, blood vessels, and the system of coronary arteries are together known as cardiovascular
diseases (CVD). This wide phrase covers conditions that cause heart attacks or strokes, including cerebral
vascular conditions, thrombosis of the deep veins, and congestive heart disease. Obstruction is the most typical
cause of them, of blood arteries, either as a result of internal bleeding or fat buildup. Globally, 17.9 million
individuals, or 32% of the total mortality rate, succumb to cardiovascular diseases annually, as reported by the
WHO. Healthcare professionals believe that reducing these concerning numbers depends on the early and precise
diagnosis of disorders. The purpose of this study is to help doctors, family practitioners, and cardiologists in
forecasting a person's likelihood of having a heart attack in order to address this growing issue.
2847
key variables, enhancing the efficiency and accuracy of medical predictions. Kumar et al., [9] advances previous
research by substituting key features for all other features in classification models to improve their accuracy and
precision. The authors demonstrate that focusing on the most relevant variables can significantly streamline the
algorithms and improve model performance, which is beneficial in medical data analysis. Sharma et al., [10]
reviews the current state of medical data mining with a focus on feature selection and model efficiency. The
authors discuss various methods for identifying significant variables and their impact on reducing the complexity
of machine learning algorithms. They also highlight future work and advancements needed in this field to aid
future researchers in developing more accurate and efficient models. Mishra et al., [11] explains how different
machine learning methods rely on feature selection for medical data classification. The study highlights the need
of efficient algorithms that utilize the most significant variables by evaluating models such as SVM, Decision
Trees, and Logistic Regression. The findings highlight improved accuracy and reduced complexity in medical
data analysis.
Lakshmi et al., [12] explores techniques for improving stroke prediction models by addressing data imbalance
and enhancing feature engineering. After using UCO and several classifiers, the authors found that the Random
Forest Classifier had the highest accuracy at 71.15%. The study underscores the significance of balanced datasets
and relevant feature selection in medical predictions. R Katarya & S Meena [13] discusses advanced machine
learning approaches for heart disease prediction, evaluating models such as SVM, KNN, MLP, and Naive Bayes.
The authors focus on the integration of feature selection techniques like the relief method to enhance model
performance. The study reports an accuracy improvement to 85.10% with the SVM-relief combination,
showcasing the potential of advanced methods in healthcare.
2848
B. Setting up the Environment
The machine learning program was set up using Anaconda, with Spyder as the integrated development
environment (IDE). Additionally, all necessary libraries and add-ons required for building the framework were
installed to ensure a complete and functional development environment.
C. Data Preparation and Processing
The dataset was first shuffled to randomly disperse previously grouped classes, ensuring unbiased data
distribution. It was then split into 80% training and 20% testing sets. The preprocessing phase included data
cleaning to remove inconsistencies and visualization to understand variable importance. Various models were
trained independently on the training set and tested on the testing set, with their predicted outcomes compared
against actual outcomes to determine the most effective model. This structured approach ensured accurate and
reliable predictions.
IV. IMPLEMENTATION
The framework for categorizing the dataset was evaluated using recall, accuracy, precision, F1-Score, and the
confusion matrix. Several classifiers were employed to assess the model's performance, including Decision Tree,
K-Nearest Neighbour, Extreme Gradient Boost, Logistic Regression, Random Forest, and Support Vector
Machine (SVM). These metrics and classifiers helped in comparing the results and determining the effectiveness
of each model. A fundamental and straightforward classification technique used in supervised learning is KNN
(K (Nearest Neighbor). When the distribution of the data is unknown or not known, it is commonly used in
investigations. The approach determines the distances between the current training data points and the test point.
To find out what a test point is classified as, the user chooses a set number of instances, K, to act as its nearest
neighbors [16].
KNN does not predict data distribution because it is non-parametric. It is analogous to real-world situations
when, for the most part, the actual data do not conform to the theoretical statistics' general distribution. KNN
exclusively employs the quick training phase. The model was tested by selecting the five nearest neighbors and
calculating their distance using the formula for the Euclidean distance Equation (1).
Distance (a, b) = ∑ (𝑎𝑖 − 𝑏𝑖)2 Equation (1)
An if-then rule obtained from the training data forms the basis of a decision tree, which in turn forms a tree-like
prediction model. The structural element that serves as the starting point for a decision tree is known as the root.
The following characteristics are referred to as nodes, and their results are branches. The tree reaches the
prediction leaf node as it classifies a test point. In mathematics, data mining, and machine learning, it is a
popular forecasting strategy. ID3 and CART are the most popular decision tree methods. Binary classification
uses ID3, while binary or multiple classifications use CART [17]. Within the framework of the model that was
constructed for the experiment that was part of this research study, the maximum depth of the tree was
established at six levels, and the criterion for the tree was entropy.
Logistic regression analysis is widely used in medical research and machine learning technologies that predict
medical conditions [18]. For cases where a binary prediction is made using numerous independent factors, the
algorithm is deemed the best option. When it comes to determining the clinical significance of observed effects,
logistic regression analysis is seen as a crucial function due to its mathematical ease and flexibility in
comparison to others. The formula in Equation (2) determines the logistic or sigmoid function, which underpins
logistic regression.
When used in machine learning algorithms with data that has a high degree of dimension, the Random Forest
classifier performs very accurately and efficiently. Like a decision tree, the Random Forest classifier builds a
single tree to classify data. However, it uses sub-samples of the data to train many trees, increasing the system's
variety and classification capability. A machine learning approach called Extreme Gradient Boost (XGB) is
employed for both classification and regression. Just like Random Forest uses its many decision trees to fix the
mistakes made by the previous model, XGB uses its decision trees to fix the mistakes made by the previous
model. The XG Boost classifier's booster was set to dart [19]. Following the transformation of the dataset from a
lower-dimensional format to a higher-dimensional feature space, the support vector machine (SVM) locates a
hyperplane in order to linearly split the dataset into its multiple classes. The classifier has the ability to maximize
2849
class distance in order to guarantee reliable data classifications in the future. A hyperplane is a connection
between a line in two dimensions and a surface in three dimensions. A decision boundary that assists in the
division of the various classes is referred to as a hyperplane [20]. It is possible to differentiate between the two
classes through the utilization of a straightforward technique known as the kernel trick. This technique includes
translating data that is non-linearly separable to a plane with a higher dimension. When it comes to non-linear
data categorization, kernel-based support vector machines (SVMs) are the most frequently utilized in low-
dimensional domains.
In order to evaluate the models, we looked at four types of cases: "True Positive (TP)" cases, where the instance
really had a heart attack and the model accurately predicted it; "False Positive (FP)" examples, where the models
misdiagnosed a heart attack; "False Negative (FN)" examples, where the models failed to anticipate a heart
attack but the instance did show symptoms associated with one; and "True Negative (TN)" cases, where neither
the case nor the instance actually had a heart attack. Type I errors are represented by FP and Type II errors by
FN [21].
2850
VI. RESULTS AND DISCUSSIONS
The algorithm's predicted instance label is stored in the 'predicted' class, while the actual instance label is stored
in the 'actual' class. A confusion matrix describes this two-category contingency table. If the model correctly
predicts a heart attack, it is called a "True Positive (TP)" case; if it incorrectly diagnoses one, it is called a "False
Positive (FP)" case. One scenario was "True Negative" (TN) when neither the case nor the instance had a heart
attack, while the other was "Positive" (PN) when the models did not predict a heart attack but the instance
exhibited symptoms. Type I and Type II errors are referred to as FP and FN, respectively. XGB and KNN had
the fewest misclassifications and the best example identification rates. The different performance metrics of the
classification models that were used are compared in Figure 3's column graphs. A classifier's precision is its
capacity to accurately identify positive labels, whereas accuracy is the sum of the number of instances or data
points that properly classify both positive and negative situations. Precision is the percentage of true positives
that are classified relative to the total number of true positives.
The plots show that KNN has the highest accuracy, precision, and F1-score. Both Logistic Regression and
Random Forest have the highest recall, which is 0.89. On the other hand, Support Vector Machines have the
weakest accuracy, precision, recall, and F1-score when it comes to predicting heart attacks. Various models'
outcomes relevant to the medical business are listed in Table 1. These results include the models' accuracy (their
capacity to accurately forecast or diagnose the condition), recall (their capacity to recognize a certain category or
class), precision, and F1-Score. Among the six models tested, KNN proved to be the most precise. The models
with the highest recall ratings were the ones that used logistic regression and random forest. It was determined
that the Support Vector Machine produced the least accurate results, with an accuracy of 70.49 percent, recall
and F1-score of 70 percent, and precision of 55 percent. For the purpose of illustrating the usefulness of a binary
classifier in illness detection, the Receiver Operating Characteristic (ROC) curve is a graphic that compares the
True Positive rate to the False Positive rate of the various classifiers. This curve can help discover the proper cut-
off value to increase classification system sensitivity, recall, and specificity. In comparison to the other six
classifiers that were evaluated, KNN had the best ROC curve, which indicates that it has a greater true positive
rate than it does false positive rate. The SVM classifier, on the other hand, demonstrated the least amount of
accuracy. was discovered to be the least ideal solution for the modeling framework.
III. CONCLUSIONS
The proposed AI-based prediction model for cardiac stroke prediction is a noninvasive technique. The
combination of feature engineering techniques along with the six classifications techniques such as Naïve Bayes,
Decision Tree, Logistic Regression, KNN, SVM and Random Forest have been adopted to investigate the key
characteristics. The results obtained from the proposed approach appears to be interesting and acceptable. The
difference between ROC and AUC values is also interpreted to achieve the objective. The classifiers were
evaluated using wide variety of metrics. The model is also cross-validation to eliminate biased results. We could
achieve 94.2 percent of accuracy by using Random Forest Classification which proved to be the best. For the
prediction of heart stroke, the graphical user interface is developed. Web application for multi-patient monitoring
with better accuracy is developed to monitor and acknowledge critical patient parameters automatically.
Additional features can be added to the work to make it run on a larger database and produce much more
accurate results. Other diseases and disorders can also be predicted using similar methods.
REFERENCES
[1] Li, J., Haq, A., Din, S., Khan, J., Khan, A. & Saboor, A. Heart disease identification method using machine learning
classification in e-healthcare. IEEE Access. 8 pp. 107562-107582 (2020)
[2] Furst, B. & Furst, B. Functional morphology of the heart. The Heart and Circulation: An Integrative Model. pp. 97-120
(2020)
[3] Yamabayashi, C. & Reid, W. Anatomy and Physiology of the Respiratory and Cardiovascular Systems.
Cardiopulmonary Physical Therapy. pp. 3-24 (2024)
[4] Mathur, P., Srivastava, S., Xu, X. & Mehta, J. Artificial intelligence, machine learning, and cardiovascular disease.
Clinical Medicine Insights:Cardiology. 14 pp. 1179546820927404 (2020)
[5] Kumar, N. & Kumar, D. Machine learning based heart disease diagnosis using non-invasive methods: A review. Journal
Of Physics: Conference Series. 1950, 012081 (2021)
[6] Sharma, O. Prediction and Analysis of Heart Attack using Various Machine Learning Algorithms. 2023 International
Conference on Artificial Intelligence and Smart Communication (AISC). pp. 786-790 (2023)
2851
[7] Wang, M., Yao, X. & Chen, Y. An imbalanced-data processing algorithm for the prediction of heart attack in stroke
patients. IEEE Access. 9 pp. 25394-25404 (2021)
[8] Wu, W., Li, Y., Feng, A., Li, L., Huang, T., Xu, A. & Lyu, J. Data mining in clinical big data: the frequently used
databases, steps, and methodological models. Military Medical Research. 8 pp. 1-12 (2021)
[9] Kumar, Y., Koul, A., Sisodia, P., Shafi, J., Verma, K., Gheisari, M. & Davoodi, M. Heart failure detection using
quantum-enhanced machine learning and traditional machine learning techniques for internet of artificially intelligent
medical things. Wireless Communications and Mobile Computing. 2021, 1616725 (2021)
[10] Sharma, A. & Mishra, P. Performance analysis of machine learning based optimized feature selection approaches for
breast cancer diagnosis. International Journal of Information Technology. 14, 1949-1960 (2022)
[11] Mishra, S., Mallick, P., Tripathy, H., Bhoi, A. & Gonz´alez-Briones, A. Performance evaluation of a proposed machine
learning model for chronic disease datasets using an integrated attribute evaluator and an improved decision tree
classifier. Applied Sciences. 10, 8137 (2020)
[12] Lakshmi, N. & Rout, R. Check for updates an 8-Layered MLP Network for Detection of Cardiac Arrest at an Early
Stage of Disease. Artificial Intelligence and Data Science: First International Conference, ICAIDS 2021, Hyderabad,
India, December 17–18, 2021, Revised Selected Papers. pp. 306 (2022)
[13] Katarya, R. & Meena, S. Machine learning techniques for heart disease prediction: a comparative study and analysis.
Health And Technology. 11, 87-97 (2021)
[14] Izonin, I., Tkachenko, R., Shakhovska, N., Ilchyshyn, B. & Singh, K. A two-step data normalization approach for
improving classification accuracy in the medical diagnosis domain. Mathematics. 10, 1942 (2022)
[15] Hemanth Kumar, H., Gowramma, Y., Manjula, S., Anil, D. & Smitha, N. Comparison of various ML and DL Models
for Emotion Recognition using Twitter. 2021 Third International Conference On Intelligent Communication
Technologies and Virtual Mobile Networks (ICICV). pp. 1332-1337 (2021)
[16] N, S., Singh, A., Ghosh, A., Kumari, A., R, T., Manjula, S. & K. R, V. Early Prediction of Sepsis using ML Algorithms
on Clinical Data. 2023 14th International Conference on Computing Communication And Networking Technologies
(ICCCNT). pp. 1-8 (2023)
[17] Shivakumar, B., Nagaraja, B. & Thimmaraja Yadava, G. Classification Performance Analysis of CART and ID3
Decision Tree Classifiers on Remotely Sensed Data. International Conference On VLSI, Signal Processing, Power
Electronics, IoT, Communication and Embedded Systems. pp. 89-107 (2022)
[18] Ramesh, T., Lilhore, U., Poongodi, M., Simaiya, S., Kaur, A. & Hamdi, M. Predictive analysis of heart diseases with
machine learning approaches. Malaysian Journal of Computer Science. pp. 132-148 (2022)
[19] Raj, S., Vani, R., Raja, B., Harsha, T., Drakshayani, T. & Charith, R. HEART DISEASE DETECTION USING XGB-
CLASSIFIER AND FAILURE PREDICTION USING GRADIENT BOOSTING. Journal Of Nonlinear Analysis and
Optimization. 15 (2024)
[20] Anil, D. & Suresh, S. Predicting Early Reviewers on E-Commerce Websites. 2022 IEEE 3rd Global Conference for
Advancement in Technology (GCAT). pp. 1-5 (2022)
[21] Mathapati, S., Anil, D., Tanuja, R., Manjula, S. & Venugopal, K. CNSM: cosine and n-gram similarity measure to
extract reasons for sentiment variation on Twitter. International Journal Of Computer Engineering And Technology
(IJCET) IAEME Journal. 9, 150-161 (2018)
2852