0% found this document useful (0 votes)
6 views

Predictive Modeling of Cardiovascular Risk usingMachine Learning_ Focus on Heart Attack Prevention (2)

This research presents a machine learning model aimed at predicting cardiovascular risk, specifically heart attack prevention, utilizing a dataset of 303 instances. The Random Forest algorithm achieved the highest performance with 94% accuracy and 92% recall, highlighting the potential of AI in early diagnosis and preventive care. The study emphasizes the importance of accurate risk assessment to address the rising mortality rates associated with cardiovascular diseases.

Uploaded by

anil-csbs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Predictive Modeling of Cardiovascular Risk usingMachine Learning_ Focus on Heart Attack Prevention (2)

This research presents a machine learning model aimed at predicting cardiovascular risk, specifically heart attack prevention, utilizing a dataset of 303 instances. The Random Forest algorithm achieved the highest performance with 94% accuracy and 92% recall, highlighting the potential of AI in early diagnosis and preventive care. The study emphasizes the importance of accurate risk assessment to address the rising mortality rates associated with cardiovascular diseases.

Uploaded by

anil-csbs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Grenze International Journal of Engineering and Technology, January Issue

Predictive Modeling of Cardiovascular Risk using


Machine Learning: Focus on Heart Attack Prevention
Bindu S1, Kavyashree I Pathan2, Kishore K V3, Mariyan Richard A4, Anil D 5 and Ashika Raj6
1 New
Horizon College of Engineering, Bengaluru, India
Email: [email protected]
2
Dayanand Sagar University, Bengaluru, India
Email: [email protected]
3,5
CMR Institute of Technology, Bengaluru, India
[email protected], [email protected]
4Department of Computer science PG, Kristu Jayanthi College, Bengaluru, India

Email: [email protected]
6
Research Scholar, CMR Institute of Technology, Bengaluru, India
Email: [email protected]

Abstract— heart disease and stroke have ranked high on the list of global mortality from past
decades. Every year, the rising death toll from cardiac arrests is mostly attributable to long-
term health problems and the inability to receive treatment quickly enough. The importance of
early and correct diagnosis in decreasing life risk is a point on which healthcare specialists can
agree without reservation. Technological developments, especially in the fields of AI and ML,
have allowed for a plethora of studies to build models for preventive care using electronic
medical records. In light of the gravity of the situation, our research presents a model based on
machine learning to aid medical practitioners in making risk assessments for heart attacks.
This research makes use of the Heart Attack Prediction dataset, which contains 303
occurrences. Our method yields strong outcomes, and the Random Forest algorithm gives the
best match with 92% recall and 94% accuracy. In addition, we discuss the possibility of future
studies that use machine learning models to forecast or detect particular cardiac ailments, such
as right-heart problems by analyzing jugular veins.

Index Terms— Cardiovascular diseases, Machine Learning, Heart attack prediction, Electronic
medical records, Random Forest algorithm, Preventive care.

I. INTRODUCTION
Diseases of the heart, blood vessels, and the system of coronary arteries are together known as cardiovascular
diseases (CVD). This wide phrase covers conditions that cause heart attacks or strokes, including cerebral
vascular conditions, thrombosis of the deep veins, and congestive heart disease. Obstruction is the most typical
cause of them, of blood arteries, either as a result of internal bleeding or fat buildup. Globally, 17.9 million
individuals, or 32% of the total mortality rate, succumb to cardiovascular diseases annually, as reported by the
WHO. Healthcare professionals believe that reducing these concerning numbers depends on the early and precise
diagnosis of disorders. The purpose of this study is to help doctors, family practitioners, and cardiologists in
forecasting a person's likelihood of having a heart attack in order to address this growing issue.

Grenze ID: 01.GIJET.11.1.118


© Grenze Scientific Society, 2025
A person's cardiac health depends on a number of factors and medical conditions. A machine learning algorithm
was developed using artificial intelligence using a dataset downloaded from Kaggle. Various machine learning
techniques were employed to categorize the dataset in order to forecast cardiac arrests; the algorithm was then
trained and tested on this dataset. The best approach among the many approaches was then chosen by a
comparative analysis [1].
One among the nine systems that make up the body of a person is the cardiopulmonary system, which is made up
of the bloodstream, the heart, and blood vessels like arteries, veins, and vessels. There are three types of cells
that can be found in a substance that is referred to as blood: red blood cells, which are called erythrocytes,
leukocytes, and platelets, which are sometimes called thrombocytes. It regulates the movement of oxygen and
nutrients throughout the body and is responsible for their distribution. In order to pump blood throughout the
body, the smooth cardiac muscles of the heart are responsible. The left and right halves of the heart are separated
by a septum, which serves to partition the heart into four chambers. The myocardium is the muscular wall of the
circulatory system, and it is situated between the endocardium and the membrane that surrounds the heart [2]. A
person's heart possesses a twofold circulation system, in which the heart beats twice. The right atrium of the
heart is the initial site of deoxygenated blood entry via the vena cava. Subsequent to this, the right ventricle is
accountable for the transmission of blood to the pulmonary artery, which is situated within the lungs. By means
of the capillaries and alveoli, the lungs are accountable for the exchange of gases and the delivery of oxygen to
the bloodstream for the body. On its way to the left atrium of the heart, blood passes from the pulmonary vein.
As the heart pumps oxygenated blood through the aorta, the left ventricle is the location where this process takes
place. The left ventricle receives flow of blood [3].
Insufficient blood supply to the coronary arteries, for example, can prevent the heart from pumping and
functioning normally, increasing the risk of a heart attack. Plaques composed of lipids or adipose tissue can
obstruct the coronary arteries. Both complete and partial blockages, denoted as ST elevation and non-ST
elevation, are within the realm of possibility. A heart attack can be deadly if not treated promptly by a medical
professional. Heart attack risk is increased by the same variables as atherosclerotic plaque. One of these factors
is age, since the risk of heart attacks increases with age, particularly for men and women over 45 and 55.
Furthermore, cardiovascular disease is more common among people who smoke cigarettes and who smoke
continuously, according to research. The likelihood increases in the presence of stress as well as chronically
elevated levels of blood sugar, cholesterol, triglycerides, and hypertension. Two conditions that have been linked
to cardiovascular disease are diabetes and obesity. Unhealthy lifestyle choices, poor diet, autoimmune diseases,
and a history of cardiovascular disease in the family are all variables that can increase the likelihood of
developing cardiovascular disease [4]. The majority of deaths in the modern world are caused by different types
of cardiac disorders, and their prevalence has been rising over time. It has been reported by the USA.
Cardiovascular disease accounts for around one-third of all fatalities annually. In 2018, 2,380 people died every
day in the US from cardiovascular disorders, according to the American Heart Association. There have
occasionally been advancements made in the detection, management, and treatment of cardiovascular illnesses.
Another recent development involves the integration of artificial intelligence with medical diagnosis. The
development of diagnostic tools based on machine learning is currently underway in order to forecast heart
attacks [5].

II. LITERATURE REVIEW


SVM, neural networks (NNs), decision trees, and KNN are used to forecast medical issues. Many studies have
focused on heart disease prediction in an effort to find an algorithm that can reliably predict if a person will have
a heart attack.
Sharma et al., [6] applies a number of classification models to the Statlog (Heart) dataset for evaluation,
including Decision Tree (C4.5, C-RT, ID3), SVM, KNN, Naive Bayes, MLP and Logistic Regression. With an
accuracy of 84.81%, the study concludes that the best results are produced by combining SVM with the relief
approach. The selection of effective models for the prediction of heart disease depends on this evaluation. Wang
et al., [7] introduces the UCO technique to create balanced data for an imbalanced stroke dataset, which is
crucial for training machine learning systems. The most effective method is the UCO-assisted Random Forest
Classifier, which has an accuracy of 70.29 percent and a precision of 70.05 percent. This demonstrates the
necessity of data balancing in the process of medical diagnosis.
Wu et al., [8] explores pattern recognition techniques in medical data mining to identify the most significant
variables affecting health outcomes. The authors highlight that irregular ECGs, blood pressure, blood sugar, and
cholesterol levels are crucial factors. Their work aims to streamline the data mining process by focusing on these

2847
key variables, enhancing the efficiency and accuracy of medical predictions. Kumar et al., [9] advances previous
research by substituting key features for all other features in classification models to improve their accuracy and
precision. The authors demonstrate that focusing on the most relevant variables can significantly streamline the
algorithms and improve model performance, which is beneficial in medical data analysis. Sharma et al., [10]
reviews the current state of medical data mining with a focus on feature selection and model efficiency. The
authors discuss various methods for identifying significant variables and their impact on reducing the complexity
of machine learning algorithms. They also highlight future work and advancements needed in this field to aid
future researchers in developing more accurate and efficient models. Mishra et al., [11] explains how different
machine learning methods rely on feature selection for medical data classification. The study highlights the need
of efficient algorithms that utilize the most significant variables by evaluating models such as SVM, Decision
Trees, and Logistic Regression. The findings highlight improved accuracy and reduced complexity in medical
data analysis.
Lakshmi et al., [12] explores techniques for improving stroke prediction models by addressing data imbalance
and enhancing feature engineering. After using UCO and several classifiers, the authors found that the Random
Forest Classifier had the highest accuracy at 71.15%. The study underscores the significance of balanced datasets
and relevant feature selection in medical predictions. R Katarya & S Meena [13] discusses advanced machine
learning approaches for heart disease prediction, evaluating models such as SVM, KNN, MLP, and Naive Bayes.
The authors focus on the integration of feature selection techniques like the relief method to enhance model
performance. The study reports an accuracy improvement to 85.10% with the SVM-relief combination,
showcasing the potential of advanced methods in healthcare.

III. PROPOSED SYSTEM


The following procedures were taken after conducting the necessary literature review on the subject of
investigation, namely cardiovascular illnesses, and the machine learning techniques employed as a predictor.
A. Obtaining the dataset
Kaggle provided the heart attack analysis and prediction dataset, available here (retrieved on February 15, 2022)
[14]. The dataset consists of 303 instances with 14 columns, where 13 columns are user features and one is the
label. We used Microsoft Excel to look for null values in the dataset after obtaining it as a.csv file. There were
no missing or null values in the dataset, though. Data standardization and visualization were also part of the
preprocessing process in order to identify patterns and the relationships between various characteristics [15].
Features included in the dataset include the following: gender (sex), age (age), number of major vessels (ca),
type of chest pain (cp), resting blood pressure (trtbps), cholesterol (chol) in mg/dL, fasting blood sugar (fbs),
exercise-induced angina (exang), resting electrocardiogram (rest\_ecg), maximal heart rate (thalach), and target,
which signifies if the patient had a heart attack. A bar graph was used to emphasize and compare the relative
relevance of the features, showing the importance of each feature. The patient's maximal heart rate and number
of blocked major arteries are the most important extracted features. Many people, especially in impoverished
places where poverty is widespread, cannot afford complex tests like huge boats. Thus, algorithms were built
using ten of thirteen dataset features, excluding resting blood pressure, cholesterol in mg/dL, and major artery
count.

Figure 1. Model Flow Diagram

2848
B. Setting up the Environment
The machine learning program was set up using Anaconda, with Spyder as the integrated development
environment (IDE). Additionally, all necessary libraries and add-ons required for building the framework were
installed to ensure a complete and functional development environment.
C. Data Preparation and Processing
The dataset was first shuffled to randomly disperse previously grouped classes, ensuring unbiased data
distribution. It was then split into 80% training and 20% testing sets. The preprocessing phase included data
cleaning to remove inconsistencies and visualization to understand variable importance. Various models were
trained independently on the training set and tested on the testing set, with their predicted outcomes compared
against actual outcomes to determine the most effective model. This structured approach ensured accurate and
reliable predictions.

IV. IMPLEMENTATION
The framework for categorizing the dataset was evaluated using recall, accuracy, precision, F1-Score, and the
confusion matrix. Several classifiers were employed to assess the model's performance, including Decision Tree,
K-Nearest Neighbour, Extreme Gradient Boost, Logistic Regression, Random Forest, and Support Vector
Machine (SVM). These metrics and classifiers helped in comparing the results and determining the effectiveness
of each model. A fundamental and straightforward classification technique used in supervised learning is KNN
(K (Nearest Neighbor). When the distribution of the data is unknown or not known, it is commonly used in
investigations. The approach determines the distances between the current training data points and the test point.
To find out what a test point is classified as, the user chooses a set number of instances, K, to act as its nearest
neighbors [16].
KNN does not predict data distribution because it is non-parametric. It is analogous to real-world situations
when, for the most part, the actual data do not conform to the theoretical statistics' general distribution. KNN
exclusively employs the quick training phase. The model was tested by selecting the five nearest neighbors and
calculating their distance using the formula for the Euclidean distance Equation (1).
Distance (a, b) = ∑ (𝑎𝑖 − 𝑏𝑖)2 Equation (1)
An if-then rule obtained from the training data forms the basis of a decision tree, which in turn forms a tree-like
prediction model. The structural element that serves as the starting point for a decision tree is known as the root.
The following characteristics are referred to as nodes, and their results are branches. The tree reaches the
prediction leaf node as it classifies a test point. In mathematics, data mining, and machine learning, it is a
popular forecasting strategy. ID3 and CART are the most popular decision tree methods. Binary classification
uses ID3, while binary or multiple classifications use CART [17]. Within the framework of the model that was
constructed for the experiment that was part of this research study, the maximum depth of the tree was
established at six levels, and the criterion for the tree was entropy.
Logistic regression analysis is widely used in medical research and machine learning technologies that predict
medical conditions [18]. For cases where a binary prediction is made using numerous independent factors, the
algorithm is deemed the best option. When it comes to determining the clinical significance of observed effects,
logistic regression analysis is seen as a crucial function due to its mathematical ease and flexibility in
comparison to others. The formula in Equation (2) determines the logistic or sigmoid function, which underpins
logistic regression.

Sigmoid Function (∞) = Equation (2)


^

When used in machine learning algorithms with data that has a high degree of dimension, the Random Forest
classifier performs very accurately and efficiently. Like a decision tree, the Random Forest classifier builds a
single tree to classify data. However, it uses sub-samples of the data to train many trees, increasing the system's
variety and classification capability. A machine learning approach called Extreme Gradient Boost (XGB) is
employed for both classification and regression. Just like Random Forest uses its many decision trees to fix the
mistakes made by the previous model, XGB uses its decision trees to fix the mistakes made by the previous
model. The XG Boost classifier's booster was set to dart [19]. Following the transformation of the dataset from a
lower-dimensional format to a higher-dimensional feature space, the support vector machine (SVM) locates a
hyperplane in order to linearly split the dataset into its multiple classes. The classifier has the ability to maximize

2849
class distance in order to guarantee reliable data classifications in the future. A hyperplane is a connection
between a line in two dimensions and a surface in three dimensions. A decision boundary that assists in the
division of the various classes is referred to as a hyperplane [20]. It is possible to differentiate between the two
classes through the utilization of a straightforward technique known as the kernel trick. This technique includes
translating data that is non-linearly separable to a plane with a higher dimension. When it comes to non-linear
data categorization, kernel-based support vector machines (SVMs) are the most frequently utilized in low-
dimensional domains.
In order to evaluate the models, we looked at four types of cases: "True Positive (TP)" cases, where the instance
really had a heart attack and the model accurately predicted it; "False Positive (FP)" examples, where the models
misdiagnosed a heart attack; "False Negative (FN)" examples, where the models failed to anticipate a heart
attack but the instance did show symptoms associated with one; and "True Negative (TN)" cases, where neither
the case nor the instance actually had a heart attack. Type I errors are represented by FP and Type II errors by
FN [21].

TABLE I. PERFORMANCE METRICS OF VARIOUS MODELS


Sl No Model Accuracy Recall Precision F1 Score
1 Decision Tree 0.86 0.83 0.89 0.87
2 K-Nearest 0.90 0.87 0.86 0.84
Neighbor
3 Logistic 0.88 0.89 0.86 0.89
Regression
4 Random Forest 0.94 0.92 0.93 0.90
5 Extreme Gradient 0.86 0.87 0.93 0.87
Boost
6 Support Vector 0.70 0.70 0.55 0.70
Machine

Figure 2. Comparative Analysis

Figure 3. Different classifiers' ROC (Receiver Operating Characteristic) curves

2850
VI. RESULTS AND DISCUSSIONS
The algorithm's predicted instance label is stored in the 'predicted' class, while the actual instance label is stored
in the 'actual' class. A confusion matrix describes this two-category contingency table. If the model correctly
predicts a heart attack, it is called a "True Positive (TP)" case; if it incorrectly diagnoses one, it is called a "False
Positive (FP)" case. One scenario was "True Negative" (TN) when neither the case nor the instance had a heart
attack, while the other was "Positive" (PN) when the models did not predict a heart attack but the instance
exhibited symptoms. Type I and Type II errors are referred to as FP and FN, respectively. XGB and KNN had
the fewest misclassifications and the best example identification rates. The different performance metrics of the
classification models that were used are compared in Figure 3's column graphs. A classifier's precision is its
capacity to accurately identify positive labels, whereas accuracy is the sum of the number of instances or data
points that properly classify both positive and negative situations. Precision is the percentage of true positives
that are classified relative to the total number of true positives.
The plots show that KNN has the highest accuracy, precision, and F1-score. Both Logistic Regression and
Random Forest have the highest recall, which is 0.89. On the other hand, Support Vector Machines have the
weakest accuracy, precision, recall, and F1-score when it comes to predicting heart attacks. Various models'
outcomes relevant to the medical business are listed in Table 1. These results include the models' accuracy (their
capacity to accurately forecast or diagnose the condition), recall (their capacity to recognize a certain category or
class), precision, and F1-Score. Among the six models tested, KNN proved to be the most precise. The models
with the highest recall ratings were the ones that used logistic regression and random forest. It was determined
that the Support Vector Machine produced the least accurate results, with an accuracy of 70.49 percent, recall
and F1-score of 70 percent, and precision of 55 percent. For the purpose of illustrating the usefulness of a binary
classifier in illness detection, the Receiver Operating Characteristic (ROC) curve is a graphic that compares the
True Positive rate to the False Positive rate of the various classifiers. This curve can help discover the proper cut-
off value to increase classification system sensitivity, recall, and specificity. In comparison to the other six
classifiers that were evaluated, KNN had the best ROC curve, which indicates that it has a greater true positive
rate than it does false positive rate. The SVM classifier, on the other hand, demonstrated the least amount of
accuracy. was discovered to be the least ideal solution for the modeling framework.

III. CONCLUSIONS
The proposed AI-based prediction model for cardiac stroke prediction is a noninvasive technique. The
combination of feature engineering techniques along with the six classifications techniques such as Naïve Bayes,
Decision Tree, Logistic Regression, KNN, SVM and Random Forest have been adopted to investigate the key
characteristics. The results obtained from the proposed approach appears to be interesting and acceptable. The
difference between ROC and AUC values is also interpreted to achieve the objective. The classifiers were
evaluated using wide variety of metrics. The model is also cross-validation to eliminate biased results. We could
achieve 94.2 percent of accuracy by using Random Forest Classification which proved to be the best. For the
prediction of heart stroke, the graphical user interface is developed. Web application for multi-patient monitoring
with better accuracy is developed to monitor and acknowledge critical patient parameters automatically.
Additional features can be added to the work to make it run on a larger database and produce much more
accurate results. Other diseases and disorders can also be predicted using similar methods.

REFERENCES
[1] Li, J., Haq, A., Din, S., Khan, J., Khan, A. & Saboor, A. Heart disease identification method using machine learning
classification in e-healthcare. IEEE Access. 8 pp. 107562-107582 (2020)
[2] Furst, B. & Furst, B. Functional morphology of the heart. The Heart and Circulation: An Integrative Model. pp. 97-120
(2020)
[3] Yamabayashi, C. & Reid, W. Anatomy and Physiology of the Respiratory and Cardiovascular Systems.
Cardiopulmonary Physical Therapy. pp. 3-24 (2024)
[4] Mathur, P., Srivastava, S., Xu, X. & Mehta, J. Artificial intelligence, machine learning, and cardiovascular disease.
Clinical Medicine Insights:Cardiology. 14 pp. 1179546820927404 (2020)
[5] Kumar, N. & Kumar, D. Machine learning based heart disease diagnosis using non-invasive methods: A review. Journal
Of Physics: Conference Series. 1950, 012081 (2021)
[6] Sharma, O. Prediction and Analysis of Heart Attack using Various Machine Learning Algorithms. 2023 International
Conference on Artificial Intelligence and Smart Communication (AISC). pp. 786-790 (2023)

2851
[7] Wang, M., Yao, X. & Chen, Y. An imbalanced-data processing algorithm for the prediction of heart attack in stroke
patients. IEEE Access. 9 pp. 25394-25404 (2021)
[8] Wu, W., Li, Y., Feng, A., Li, L., Huang, T., Xu, A. & Lyu, J. Data mining in clinical big data: the frequently used
databases, steps, and methodological models. Military Medical Research. 8 pp. 1-12 (2021)
[9] Kumar, Y., Koul, A., Sisodia, P., Shafi, J., Verma, K., Gheisari, M. & Davoodi, M. Heart failure detection using
quantum-enhanced machine learning and traditional machine learning techniques for internet of artificially intelligent
medical things. Wireless Communications and Mobile Computing. 2021, 1616725 (2021)
[10] Sharma, A. & Mishra, P. Performance analysis of machine learning based optimized feature selection approaches for
breast cancer diagnosis. International Journal of Information Technology. 14, 1949-1960 (2022)
[11] Mishra, S., Mallick, P., Tripathy, H., Bhoi, A. & Gonz´alez-Briones, A. Performance evaluation of a proposed machine
learning model for chronic disease datasets using an integrated attribute evaluator and an improved decision tree
classifier. Applied Sciences. 10, 8137 (2020)
[12] Lakshmi, N. & Rout, R. Check for updates an 8-Layered MLP Network for Detection of Cardiac Arrest at an Early
Stage of Disease. Artificial Intelligence and Data Science: First International Conference, ICAIDS 2021, Hyderabad,
India, December 17–18, 2021, Revised Selected Papers. pp. 306 (2022)
[13] Katarya, R. & Meena, S. Machine learning techniques for heart disease prediction: a comparative study and analysis.
Health And Technology. 11, 87-97 (2021)
[14] Izonin, I., Tkachenko, R., Shakhovska, N., Ilchyshyn, B. & Singh, K. A two-step data normalization approach for
improving classification accuracy in the medical diagnosis domain. Mathematics. 10, 1942 (2022)
[15] Hemanth Kumar, H., Gowramma, Y., Manjula, S., Anil, D. & Smitha, N. Comparison of various ML and DL Models
for Emotion Recognition using Twitter. 2021 Third International Conference On Intelligent Communication
Technologies and Virtual Mobile Networks (ICICV). pp. 1332-1337 (2021)
[16] N, S., Singh, A., Ghosh, A., Kumari, A., R, T., Manjula, S. & K. R, V. Early Prediction of Sepsis using ML Algorithms
on Clinical Data. 2023 14th International Conference on Computing Communication And Networking Technologies
(ICCCNT). pp. 1-8 (2023)
[17] Shivakumar, B., Nagaraja, B. & Thimmaraja Yadava, G. Classification Performance Analysis of CART and ID3
Decision Tree Classifiers on Remotely Sensed Data. International Conference On VLSI, Signal Processing, Power
Electronics, IoT, Communication and Embedded Systems. pp. 89-107 (2022)
[18] Ramesh, T., Lilhore, U., Poongodi, M., Simaiya, S., Kaur, A. & Hamdi, M. Predictive analysis of heart diseases with
machine learning approaches. Malaysian Journal of Computer Science. pp. 132-148 (2022)
[19] Raj, S., Vani, R., Raja, B., Harsha, T., Drakshayani, T. & Charith, R. HEART DISEASE DETECTION USING XGB-
CLASSIFIER AND FAILURE PREDICTION USING GRADIENT BOOSTING. Journal Of Nonlinear Analysis and
Optimization. 15 (2024)
[20] Anil, D. & Suresh, S. Predicting Early Reviewers on E-Commerce Websites. 2022 IEEE 3rd Global Conference for
Advancement in Technology (GCAT). pp. 1-5 (2022)
[21] Mathapati, S., Anil, D., Tanuja, R., Manjula, S. & Venugopal, K. CNSM: cosine and n-gram similarity measure to
extract reasons for sentiment variation on Twitter. International Journal Of Computer Engineering And Technology
(IJCET) IAEME Journal. 9, 150-161 (2018)

2852

You might also like