0% found this document useful (0 votes)
22 views

Group 6

The document describes a project report on predicting heart disease using machine learning. It includes an abstract, table of contents, and sections on literature review, system analysis, design, coding, testing, results and analysis. Various machine learning algorithms for heart disease prediction are discussed, including SVM, Naive Bayes, KNN, decision trees, random forest, logistic regression, and neural networks.

Uploaded by

ishan13dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Group 6

The document describes a project report on predicting heart disease using machine learning. It includes an abstract, table of contents, and sections on literature review, system analysis, design, coding, testing, results and analysis. Various machine learning algorithms for heart disease prediction are discussed, including SVM, Naive Bayes, KNN, decision trees, random forest, logistic regression, and neural networks.

Uploaded by

ishan13dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Heart Disease Prediction

A Project Report Submitted in Partial Fulfillment of


the Requirements for the Degree of

Bachelor of Technology
in
Information Technology
by
Shivanshu Shukla (1903480130054)
Pankaj Kumar (1903480130041)
Navneet Kushwaha (1903480130037)
Harshdeep Singh (1903480130025)

Under the Supervision of


Mr. Neeraj Kumar Bharti
(Assistant Professor)
PSIT COLLEGE OF ENGINEERING, KANPUR
to the

Faculty of Information Technology


Dr. A.P.J. Abdul Kalam Technical University, Lucknow
(Formerly Uttar Pradesh Technical University)
May, 2023
DECLARATION

I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief. It contains no matter previously published or written by any
other person nor material which to substantial extent has been accepted to the award
of any degree or diploma of the university or other institute of higher learning except
where due acknowledge has been made in the text.

Signature:
Name: Shivanshu Shukla
Roll No: 1903480130054
Date:

Signature:
Name: Pankaj Kumar
Roll No: 1903480130041
Date:

Signature:
Name: Harshdeep Singh
Roll No: 1903480130025
Date:

Signature:
Name: Navneet Kushwaha
Roll No: 1903480130037
Date:

ii
ACKNOWLEDGMENT

It gives us a great sense of pleasure to present the report of B.Tech. Project “Heart
Disease Prediction using Machine Learning” undertaken during B.Tech. Final
Year. We owe special debt of gratitude to our project guide Mr. Neeraj Kumar
Bharti (Assistant Professor, CSE), PSIT College of Engineering Kanpur for his
constant support and guide throughout course our work his sincerity, thoroughness
and perseverance have been a constant source of inspiration for us .It is only his
cognizant efforts that our endeavors has seen light of the day.
We also do not like to miss the opportunity to acknowledge the contribution of all
faculty member of the department for their kind assistance and cooperation during
the development of our project. Last but not the least, we acknowledge our friends for
their contribution in the completion of the project.

Signature:
Name: Shivanshu Shukla
Roll No: 1903480130054
Date:

Signature:
Name: Pankaj Kumar
Roll No: 1903480130041
Date:

Signature:
Name: Harshdeep Singh
Roll No: 1903480130025
Date:

Signature:
Name: Navneet Kushwaha
Roll No: 1903480130037
Date:

iii
CERTIFICATE

This is to certify that the project titled “Heart Disease Prediction using Machine
Learning” is which is submitted by
 Shivanshu Shukla (1903480130054)
 Pankaj Kumar (1903480130041)
 Harshdeep Singh (1903480130025)
 Navneet Kushwaha (1903480130037)
in partial fulfillment of the requirement for the award of the degree of Bachelor of
Technology in Computer Science and Engineering to PSIT College of engineering,
Kanpur, affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow during
the academic year 2022-23, is the record of candidate‟s own work carried out by
him/her under my supervision. The matter embodied in this report is original and has
not been submitted for the award of any other degree.

Mr. Abhay Kumar Tripathi Mr. Neeraj Kumar Bharti


(Head of Dept.,CSE) (Assistant Professor,Dept. of CSE)

iv
Heart Disease Prediction Using Machine Learning
Mr. Neeraj Kumar Bharti
(Assistant Professor)
Shivanshu Shukla Harshdeep Singh Pankaj Kumar Navneet Kushwaha

ABSTRACT

Heart disease prediction using machine learning is an active area of research that aims
to develop models to predict the risk of developing heart disease in individuals.
Machine learning models are trained using various demographic, clinical, and
lifestyle data, such as age, gender, blood pressure, cholesterol levels, smoking status,
and family history of heart disease, to predict the likelihood of developing heart
disease in the future.

The prediction of heart disease using machine learning involves several steps,
including data preprocessing, feature selection, model training, and model evaluation.
Various machine learning algorithms, such as logistic regression, decision trees,
random forests, support vector machines, and neural networks, can be used to build
predictive models. The accuracy and reliability of heart disease prediction models
depend on several factors, including the quality and quantity of data used for training,
the selection of relevant features, and the choice of an appropriate machine learning
algorithm. Heart disease prediction using machine learning has the potential to
improve early detection and prevention of heart disease, which can ultimately reduce
the mortality and morbidity associated with this disease. However, further research is
needed to develop more accurate and reliable prediction models that can be used in
clinical practice.

Keywords – Machine Learning, Logistic regression, Decision Tree, Prediction


system, Heart disease, Data, training, Model, Algorithms

v
TABLE OF CONTENT

TITLE i
DECLARATION ii
ACKNOWLEDGMENT iii
CERTIFICATE iv
ABSTRACT v
TABLE OF CONTENT vi
LIST OF TABLES viii
LIST OF FIGURES ix
LIST OF ABBREVIATIONS xi

1.INTRODUCTION 1
1.1 OVERVIEW 2
1.2 DATA ANALYTICS IN HEART DISEASE PREDICTION 3
1.3 MOTIVATION 5
1.4 PROPOSED RESEARCH 6
1.5 REPORT OBJECTIVES 6
1.6 REPORT ORGANISATION 7
2.LITERATURE STUDY 8
3.SYSTEM ANALYSIS 10
3.1 OVERVIEW OF THE SYSTEM 11
3.2 ADVANTAGES OF PROPOSED SYSTEM 12
3.3 MACHINE LEARNING ALGORITHMS 12
3.3.1 SVM 13
3.3.2 BAYES 15
3.3.3 KNN 16
3.3.4 DECISION TREE 17
3.3.5 RANDOM FOREST 18
3.3.6 LOGISTIC REGRESSION 19
3.3.7 ANN 20
3.4 DEEP NEURAL NETWORK 21
4.SYSTEM DESIGN 23

vi
4.1 SYSTEM ARCHITECTURE 23
4.2 DATA FLOW DIAGRAM 24
5. CODING AND TESTING 28
5.1 SOFTWARE REQUIREMENTS 28
5.2 OBJECTIVES AND TYPES OF TESTING 29
5.2.1 UNIT TESTING 29
5.2.2 INTEGRATION TESTING 29
5.2.3 FUNCTIONAL TESTING 30
5.2.4 SYSTEM TESTING 30
5.2.5 WHITEBOX TESTING 30
5.2.6 BLACKBOX TESTING 30
5.3 TESTING IN MACHINE LEARNING 31
5.4 TEST CASES 32
6.RESULTS AND ANALYSIS 36
6.1 DATA SET 36
6.2 METRICS FOR PERFORMANCE ANALYSIS 41
6.3 PERFORMANCE OF MACHINE LEARNING ALGORITHS 42
6.3.1 SVM RESULTS 42
6.3.2 NAÏVE BAYES RESULTS 43
6.3.3 KNN RESULTS 43
6.3.4 DECISION TREE RESULTS 44
6.3.5 RANDOM FOREST RESULTS 45
6.3.6 LOGISTIC REGRESSION RESULTS 46
6.3.7 ARTIFICIAL NEURAL NETWORK RESULTS 47
6.3.8 DEEP NEURAL NETWORK RESULTS 48
6.4 RESULTS FROM TEST DATA SET 48
CONCLUSION 51
FUTURE WORK 53
REFERENCES 54
PLAGIRISM REPORT 56
AUTHOR’S DETAILS 57

vii
LIST OF TABLES

TABLE No TABLE NAME PAGE NO.

1.1 Different types of heart diseases 2

5.1 Test cases 33

6.1 Data Set 37

6.2 Confusion Matrix 41

viii
LIST OF FIGURES

FIGURE NO. FIGURE NAME PAGE NO.

Fig 3.1 Possible Hyperplane 13

Fig 3.2 Hyperplane with the maximum margin 14

Fig 3.3 Naïve Bayes 15

Fig 3.4 KNN Clustering 16

Fig 3.5 Decision tree 17

Fig 3.6 Random forest with two trees 18

Fig 3.7 Sigmoid function graph 19

Fig 3.8 ANN 20

Fig 3.9 DNN 22

Fig 4.1 System Architecture 23

Fig 4.2 Data Flow Diagram 24

Fig 6.1 Data set 37

Fig 6.2 SVM Results 42

Fig 6.3 Naïve Bayes results 43

Fig 6.4 KNN results 44

Fig 6.5 Decision tree results 44

Fig 6.6 Random forest results 45

Fig 6.7 Logistic regression results 46

ix
Fig 6.8 ANN results 47

Fig 6.9 DNN results 47

Fig 6.10 Percentage of men and women in dataset 49

Fig 6.11 Risk of men and women having a heart attack 49

Fig 6.12 Heart attack in men out of total men 50

Fig 6.13 Heart attack in women out of total women 50

x
LIST OF ABBREVIATIONS

DNN Deep Neural Network

ANN Artificial Neural Network

SVM Support Vector Machine

KNN K Nearest Neighbors

CVD Cardiovascular Diseases

MCC Maxwell‟s Correlation Coefficient

HDFS Hadoop Distributed File System

ECG Electrocardiogram

BP Blood Pressure

WEKA Waikato Environment for Knowledge Analysis

CNN Convolutional Neural Network

SMO Sequential Minimal Optimization

TP True Positive

TN True Negative

FP False Positive

FN False Negative

CVD Cardiovascular diseases

EKG Electrocardiogram

ReLU Rectified Linear Unit

xi
CHAPTER 1
INTRODUCTION

1.1 OVERVIEW

Heart disease is a leading cause of mortality worldwide. Detecting heart disease early
is crucial for the effective management of the disease and prevention of adverse
outcomes. In recent years, machine learning techniques have gained attention in
healthcare as a promising tool for accurate diagnosis and prediction of diseases.

Machine learning algorithms use statistical models to identify patterns in data and
learn from them. The algorithms can then use these patterns to make predictions or
classify new data. In the case of heart disease detection, machine learning algorithms
can be trained on large datasets of medical records, imaging data, and clinical
information to predict the likelihood of a patient having heart disease.

There are several approaches to heart disease detection using machine learning. One
approach is to use supervised learning algorithms, such as logistic regression or
support vector machines, to classify patients as either having or not having heart
disease based on their medical records and risk factors such as age, gender, blood
pressure, cholesterol levels, smoking history, and family history. The algorithms are
trained on labeled data, where the outcome of interest (heart disease or no heart
disease) is known, and then tested on new, unlabeled data to evaluate their accuracy.

Another approach is to use unsupervised learning algorithms, such as clustering or


principal component analysis, to identify patterns in the data without prior knowledge
of the outcome of interest. These algorithms can be used to group patients based on
their clinical features, which can then be used to identify subgroups of patients at high
risk of heart disease or to develop personalized treatment plans.
.
Deep learning techniques, such as convolutional neural networks or recurrent neural
networks, can also be used for heart disease detection. These techniques are
1
particularly useful for processing large imaging datasets, such as MRI or CT scans,
and can extract complex features that may not be detectable by traditional machine
learning algorithms.

In addition to detecting heart disease, machine learning can also be used to predict the
risk of cardiovascular events, such as heart attacks or strokes. Predictive models can
be developed using machine learning algorithms that incorporate a wide range of
patient data, including medical history, lifestyle factors, and genetic information.
These models can be used to identify.

Table 1.1: Different types of heart diseases

Arrythmia The irregular beating of heart. It is either too slow or too


fast.

Congestive heart failure The heart doesn‟t pump blood as it normally should
which leads to be a chronic disease.

Cardiac arrest An unexpected loss of consciousness, heart function, and


breathing that occurs suddenly.

Congenital heart disease The abnormality in the heart that developsbefore birth.

2
1.2 DATA ANALYTICS IN HEART DISEASE PREDICTION

Several research papers have explored the use of data analytics and data mining
algorithms to analyze various datasets and extract patterns for predicting the
occurrence of heart diseases. Among the commonly employed algorithms are Support
Vector Machines (SVM), Logistic Regression, Naive Bayes, K-Nearest Neighbors
(KNN), and decision trees. SVM and Logistic Regression algorithms have been
recognized for their ability to deliver more accurate results. Additionally, some
studies have incorporated Hadoop-based MapReduce and HDFS algorithms to
distribute data storage across multiple nodes and enable parallel system processing.
Typically, the datasets utilized in these studies have focused on attributes such as
blood pressure, heart rate, and an age group predominantly over 30 years old. It is
worth noting that while existing research papers have extensively employed the
aforementioned algorithms, they have not yet explored the application of deep
learning techniques, which have shown potential for achieving higher accuracy rates.
Our system, in contrast, incorporates a deep neural networkalongside these algorithms
and incorporates additional attributes such as cholesterol levels, presence of angina,
age, and gender to enhance the precision of predictions.

Heart disease is a global health concern, and early prediction of cardiovascular


conditions plays a vital role in improving patient outcomes. With the advancements in
machine learning, predictive models have gained prominence in accurately
identifying individuals at risk of heart disease. This article explores the significance
of data analysis in heart disease prediction using machine learning techniques,
highlighting the key steps involved and the impact of high-quality data on model
performance.

Data analysis is a fundamental step in developing robust machine learning models for
heart disease prediction. It involves extracting insights, identifying patterns, and
making informed decisions based on the available data. Proper analysis enables the
identification of relevant features and relationships, which form the foundation for
training accurate and reliable predictive models.

3
The first step in data analysis for heart disease prediction is the collection and
preprocessing of relevant datasets. These datasets typically include a combination of
clinical, demographic, and lifestyle factors, along with medical test results. Ensuring
data quality is essential, as inaccurate or incomplete data can lead to biased or
erroneous predictions. Preprocessing techniques such as data cleaning, normalization,
and handling missing values are applied to ensure the dataset is suitable for analysis.

Exploratory Data Analysis (EDA) is a critical phase in data analysis, enabling


researchers to gain a deeper understanding of the dataset's characteristics. EDA
involves techniques such as data visualization, statistical summaries, and correlation
analysis to identify patterns, outliers, and relationships between variables.
Visualizations, such as histograms, scatter plots, and box plots, provide insights into
the distribution and relationships among the features, aiding in feature selection and
engineering.

Feature selection is the process of identifying the most relevant features from the
dataset that contribute significantly to the prediction task. Techniques like correlation
analysis, mutual information, and recursive feature elimination are commonly
employed to select the optimal subset of features. Feature engineering involves
transforming and creating new features based on domain knowledge and data
characteristics. For example, creating interaction terms, binning continuous variables,
or encoding categorical variables can enhance the predictive power of the model.

Once the dataset is analyzed and the features are selected or engineered, the next step
is to choose an appropriate machine learning model for heart disease prediction.
Various algorithms, including decision trees, random forests, support vector
machines, and neural networks, have been applied in this context. The choice of the
model depends on factors such as the dataset size, complexity, interpretability, and
the desired performance metrics.

To evaluate the performance of the chosen model, various evaluation metrics such as
accuracy, precision, recall, F1-score, and area under the receiver operating
characteristic curve (AUC-ROC) are utilized. Cross-validation techniques like k-fold
cross-validation help estimate the model's performance on unseen data, mitigating
4
issues such as overfitting or underfitting. Additionally, techniques like
hyperparameter tuning further optimize the model's performance by finding the best
combination of hyperparameters.

Validating the predictive model is crucial to ensure its generalizability and reliability.
External validation, using independent datasets or real-world scenarios, helps assess
the model's performance in different populations and settings. Moreover, interpreting
the model's predictions is essential for understanding the factors contributing to the
risk of heart disease. Techniques like feature importance, partial dependence plots,
and SHAP (Shapley Additive Explanations) values provide insights into the model's
decision-making process and aid in clinical interpretation.

Data analysis in heart disease prediction using machine learning presents certain
challenges. Limited access to diverse and comprehensive datasets, data imbalance,
and interpretability issues are some of the key hurdles faced in this field. Addressing
these challenges

1.3 MOTIVATION

It is widely recognized that patients diagnosed with heart disease typically undergo
various tests such as ECG and EKG. However, these tests are usually conducted only
when individuals experience chest pain or other symptoms associated with heart
disease. In the modern world, wearable devices capable of monitoring vital signs like
pulse rate and blood pressure have become increasingly prevalent. It is important to
note that the risk of developing heart disease is not limited to individuals above the
age of 40. The current generation faces significant stress and pressure due to work and
other factors. Therefore, there is an urgent need to analyze physiological parameters
and assess the possibility of heart disease before a heart attack occurs. To address
this, research has been conducted using machine learning algorithms such as
Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Naive
Bayes, Decision Trees, and more for the prediction of heart disease.
In light of recent advancements in deep learning, we propose the development of a
heart disease prediction system that surpasses other machine learning algorithms in

5
terms of accuracy. Our approach incorporates thirteen physiological parameters, with
a focus on crucial factors such as heart rate, age, and sex, to achievethe highest level of
accuracy in predicting heart disease.

1.4 PROPOSED RESEARCH

Based on the aforementioned motivation, we propose the development of a deep


learning-based prediction system for heart diseases utilizing a historical dataset of
physiological data. Previous research has made significant progress in predicting
heart diseases using machine learning algorithms such as Support Vector Machines
(SVM), Regression, Naive Bayes, K-Nearest Neighbors (KNN), and even Artificial
Neural Networks. Most existing systems have focused on utilizing two parameters,
namely heart rate and age, for prediction. However, our work stands out by
incorporating thirteen parameters, including cholesterol levels, blood pressure, and
heart rate, which provide unique insights for heart disease prediction.
Our system leverages a deep learning framework known as the Deep Artificial Neural
Network. It utilizes a physiological dataset categorized by age group and sex,
encompassing thirteen parameters such as cholesterol levels, resting blood pressure,
resting electrocardiogram (ECG) results, maximum heart rate, age, sex, chest pain,
fasting blood sugar, exercise-induced angina, number of major vessels, slope of peak
exercise, thalassemia, and depression. These parameters are compared with other
machine learning algorithms including SVM, Naive Bayes, KNN, Decision Tree,
Random Forest, Logistic Regression, and Artificial Neural Networks for heart disease
prediction, taking into account accuracy and error metrics.

1.5 RESEARCH OBJECTIVES

The major objective of our project work is focused on the following


 Collection of heart disease related data set and preprocessed

 Validation of SVM, Regression, Bayes, KNN algorithm for heart disease


prediction on thedataset.

6
 Validation of random forest and decision tree on the given set of data for heart
diseaseprediction

 Validation of ANN and Deep Neural Network on data set for prediction of heart
disease.

 Accuracy as well as error computation on above mentioned algorithm

 Comparative analysis of algorithm in terms of accuracy

1.6 REPORT ORAGANISATION

This report provides an overview of the proposed system, covering various aspects.
Chapter 1 presents a basic introduction to the project, highlighting the utilization of
data analytics in heart disease prediction and outlining our research objectives. In
Chapter 2, we discuss relevant research papers that have contributed valuable insights
and served as a foundation for our work.
Moving on to Chapter 3, we analyze both the current system and the proposed
system, providing an overview of how the project functions and outlining its
objectives. Additionally, this chapter details the algorithms utilized in the project. In
Chapter 4, we describe the dataset employed, including its attributes, the workflow of
the system, and the system architecture.
Chapter 5 covers the hardware and software requirements necessary for the project,
along with the testing procedures. Chapter 6 presents the results obtained and the
accuracy of our project.

7
CHAPTER 2
LITERATURE REVIEW

In a study conducted by Polaraju, the prediction of heart disease was carried out using
Multiple Regression[1]. The dataset comprised 3000 entries with 13 attributes, which
were divided into training (70%) and testing (30%) sets. The results indicated that
Multiple Linear Regression yielded higher accuracy compared to other algorithms.

Marjia focused on heart disease prediction using algorithms such as SMO, j48, KStar,
Multilayer Perception, and Bayes Net with the aid of the WEKA tool[2]. Through k-
fold cross- validation, it was observed that Bayes Net and SMO performed optimally,
while the other algorithms did not yield satisfactory results. Consequently, efforts
were made to improve accuracy for enhanced diagnostic decisions.

S. Seema conducted a study on predicting general chronic diseases, including heart


disease[3]. The hospital health records were utilized, and algorithms like Decision
Tree, Support Vector Machine (SVM), Artificial Neural Network (ANN), and Naive
Bayes were employed. SVM was found to provide the highest accuracy overall, while
Naive Bayes achieved the highest accuracy specifically for diabetes prediction.

In a research by Ashok Kumar Dwivedi, various algorithms such as SVM,


Classification Tree, Logistic Regression, Naive Bayes, KNN, and ANN were
employed for heart disease prediction. Among these, Logistic Regression achieved
the highest accuracy rate[4].

Megha Shahi utilized data mining techniques for heart disease prediction. WEKA
tool was used to diagnose heart disease and improve service quality in healthcare
centers[5]. Algorithms such as SVM, KNN, Naïve Bayes, ANN, Association Rule,
and Decision Tree were applied. The results demonstrated that SVM outperformed
other algorithms in terms.
8
In a study conducted by, the occurrence rate of heart disease was predicted and
analyzed using data mining techniques[6]. The primary objective was to enable
automatic diagnosis of heart diseases in a timely manner. Factors such as blood sugar,
age, heart rate, and sex were utilized to predict the likelihood of a person having heart
disease. The analysis of data was performed usingthe WEKA tool.

Sharmila employed non-linear algorithms for heart disease classification. Big data
tools like MapReduce and HDFS, along with SVM, were utilized with an optimized
set of attributes for predicting heart diseases[7]. HDFS was used to store large
datasets in separate nodes, and SVM was applied in a parallel fashion, leading to
optimal computational time compared to sequential usage.

Jayami Patel proposed the utilization of machine learning and data mining
algorithms for heart disease prediction[8]. The objective was to uncover hidden
patterns through data mining techniques. The results indicated that J48 achieved the
highest accuracy rate among the algorithms tested, based on UCI data, surpassing
LMT
.
Purushottam employed data mining techniques for heart disease prediction, aiming to
support medical practitioners in making better decisions based on specific
parameters[9]. By training and testing a particular parameter, an accuracy rate of
86.3% during testing and 87.3% during trainingwas achieved.

Gomathi proposed the prediction of multiple diseases using data mining techniques.
The study focused on predicting diseases such as breast cancer, heart diseases, and
diabetes[10]. Data mining played a significant role in multi-disease prediction,
reducing the number of tests required.
.

9
CHAPTER 3
SYSTEM ANALYSIS

3.1 OVERVIEW OF THE SYSTEM

Heart disease prediction traditionally relies on patients visiting hospitals due to


symptoms like high pulse rate or chest pain. Diagnostic tests such as ECG and
Angiogram are conducted to detect heart blockages, and treatment options like
pacemakers and angioplasty are implemented accordingly.

However, with the increasing work pressure and stress in today's world, heart
problems are no longer restricted to older age groups. Even adults in their 20s, 30s,
and 40s are susceptible to heart diseases. Some individuals may even develop heart
conditions during childhood. Hence, there is a critical need for an automated medical
diagnosis system that utilizes physiological data like blood pressure, heart rate, blood
sugar, and cholesterol to predict the occurrence of heart disease.

Prior research has employed machine learning algorithms such as Regression, SVM,
Bayes, KNN, Decision Tree, and ANN for heart disease prediction. However, these
systems often consider only a limited number of parameters, such as heart rate and
age. Furthermore, the use of deep neural networks in this domain remains largely
unexplored. In our work, we have utilized a Deep Neural Network for heart disease
prediction, incorporating 13 physiological parameters. We have compared the
performance of our approach with other algorithms, including Regression, SVM,
Bayes, KNN, Decision Tree, Random Forest, and ANN, in termsof error and accuracy.

For our study, we obtained a publicly available dataset from Kaggle, which is
commonly used for predicting heart diseases. The input parameters used for data
analysis with machine learning algorithms are as follows:

10
1. age: which is taken in years

2.sex: 1 for male, 0 for female

3.cp: short for chest pain.

4.trestbps: blood pressuretaken when the body is resting

5.chol: level of cholesterol

6.Fasting blood sugar: (1 for true; 0 for false)

7.resting electrocardiographic resultsValue 0: normal

Value 1: having the ST-T wave which is not normal

Value 2: shows the probability of having left ventricular hypertrophy

8.thalach: the max level of heart beats achieved by the heart

9.exercise induced angina: 1 for yes; 0 for no

10.oldpeak = the depression by exercise compared to the one at rest

11.slope: peak of the ST segment during exerciseValue 1: no slope

Value 2: straight line Value 3: down sloping

12.ca: major vessels (0-3) coloured by fluoroscopy

13.thal: 3 = perfect; 6 = permanently defected; 7 = defect that can be altered

14.num: heart disease diagnosis.

11
3.2 ADVANTAGES OF THE PROPOSED SYSTEM

Our study differs from existing approaches by considering not only blood pressure
and heart rate but also other factors such as age, sex, angina, chest pain location, and
cholesterol, which are often overlooked in similar studies.
By incorporating a greater number of attributes, our system can create clusters within
the data based on different factors, allowing for more comprehensive analysis and
prediction.
Unlike other systems that generate alerts only when extreme values are reached, our
system aims to predict the likelihood of heart disease occurrence before it actually
happens, providing early detection and intervention.
Our system categorizes the data based on factors like age and sex, allowing for more
targeted andpersonalized predictions.
We have employed Deep Neural Network (DNN) in our system, which sets our
approach apart from other papers in the field of heart disease prediction. DNN has
been proven to have the highest accuracy among allmachine learning and data mining
algorithms.
Our system utilizes a comprehensive set of viable machine learning and data mining
algorithms, ensuring arobust and thorough analysis of the data.

3.3 MACHINE LEARNING ALGORITHMS

In this work, eight machine leaning algorithms which include Random Forest, SVM,
Decision Tree, Naïve Bayes, ANN, KNN, Logistic Regression and Deep Neural
Network was employed. These algorithms are used to unfold a prediction system
which will analyze and predict whether the particular patient is pertaining to any heart
disease or not with best accuracy.

12
3.3.1 Support Vector Machine

The main objective of support vector machine algorithm is finding a hyperplane that
distinctly plots and classifies data that is plotted in an N-dimensional space.

Figure 3.1: Possible hyperplanes [11]

13
Figure 3.2: Hyperplane with maximum margin [12]

The two classes are separated by choosing hyperplanes from the possible one. Our
major objective here, would be to find the plane containing the maximum margin of all
the planes,
i.e. finding the data points with max distance between the two classes. When we
maximize this margin, the distance provides us with more accurate classification when
future data points are plotted.

SVM are basically of two types:

1.Linear SVM: It given us a linear hyperplane that distinguishes the classes. The
objective for us is to maximize the distance of the hyperplane to the point of the
classes should be maximized.

Non-Linear SVM: It has a non-linear hyperplane, and depicts a graph which is closer
to the real-world scenario.

14
3.3.2 Naive Bayes

It is a machine learning technique that works on the strategy of the Bayes‟ Theorem.
It basically assumes that there would be no attributes dependent on each other. It is a
group of algorithms that have a common principle that every feature is independent of
the other. Bayes‟ Theorem tells us the probability of an event that will occur when
another event has already occurred. The mathematical equation is:

Probability(a|z) = (Probability(z|a) * Probability(a)) / Probability(z)Where

3.3.1.1 Probability(a|z): Gives us the probability of a (the hypothesis) gives the data .

3.3.1.2 Probability (z|a): Gives the probability of the data when the hypothesis is
true.

3.3.1.3 Probability (a): Regardless of the data, the hypothesis is said to be true.

3.3.1.4Probability (d): Regardless if the data, the probability of the hypothesis is


given.

Figure 3.3: Naïve bayes [13]

15
3.3.3 KNN

K nearest neighbors abbreviated as KNN is an algorithm that clusters data into classes
and then classifies it as per their similarity measures. Classification is based on the
majority of the votes to its neighbors. Data is assigned to the classes that have the
nearest neighbors. As we increase the number of nearest neighbors, I.e. the value of k,
the accuracy might increase. KNN is broadly used for pattern recognition and
statistical prediction.

It divides the data into clusters, based upon the distance from the nearest neighbor.

Figure 3.4: KNN clustering [14]

16
3.3.4 Decision Tree

Decision Tree is one of the supervised learning algorithms. In this the data is
continuously split based on a certain parameter after which we end up getting the
decision nodes and the leaves. What makes it different from the other supervised
algorithms is that is can also solve the regression and classification problems easily.
The main aim is to create a system that can predict the results that we desire just by
learning the decision rules from the prior data, i.e., the training set.

Figure 3.5: Decision tree [15]

17
3.3.5 Random forest

Random Forest, just as the name suggest creates a number of random forest due to
which it is also called the random decision forest. It is one of the supervised learning
algorithms. It builds random forests that are basically just a group of decision trees. It is
mostly trained with the bagging method as it is the most efficient. What the bagging
method does is, it combines all the learning models which in turn helps us with the
overall result. Just like decision trees, it can be used for both, regression and
classification problems.

Figure 3.6: Random forest with two trees [16]

3.3.6 Logistic Regression

Logistic Regression is rather a predictive analysis. It is used to analyze the relation


between the data among which one is a dependent variable and the other one is
independent. What logistic regression does is, it rounds off all the values of the
results to the closest binary value for the ease of prediction. This can sometimes cause
a few errors, and not predict very precisely. In order to remove this error, multiple
18
regression is used which helps in getting the closest to the true result. Logistic
Regression uses complex cost functions such as the „Sigmoid function‟. In machine
learning sigmoid is used to predict the probabilities.

Figure 3.7: Sigmoid Function Graph [17]

3.3.7 Artificial Neural Network (ANN)

An artificial neural network abbreviated as ANN is a model that works as the human
brain or neural network I.e. the neurons. It is called a computational model that
processes all the complex data. It isn‟t given any task specific goal, but it learns from
the examples or data that is given to it just like the brain. It is based on a collection of
nodes called the artificial neuron. More the no. Of neurons, better the system. The
neurons transmit signals from one to another making a proper connection which
resembles the human neural network. A neural network has the following 3 layers:

Input layer – It consists of the complex raw info that we feed to the neurons.
3.3.4.1 Hidden layer – These are the computational layers which take the input and
the weight of a node from the previous layers, processes it with the activation

19
function, and sends the output to the next layer.

3.3.4.1 Output layer – This depends of the output of the hidden layers, and the
functions taking place in there.

Figure 3.8: ANN [18]

The basic computational unit is the neuron. It receives inputs from the sources
provided, and each input carries a weight, which is given according to the relative
importance of all the other inputs. Then the function is applied to it as shown in
figure 3.8.

This function is called the Activation function which basically introduces non-
linearity to the function and sends it in the form of the output. The function used in
our module is the sigmoid

20
which takes inputs as real values and further fits them in the range of 0 to 1. The
equation is: σ(x)
= 1 / (1 + exp(−x))

3.4 DEEP NEURAL NETWORK

Deep Neural Network abbreviated as DNN, is a complex neural network consisting of


more than 2 layers. This algorithm uses high level mathematical models for data
processing in complex manner. Deep learning is derived from a vaster family of
networks of neural methods, such as CNNs. Deep learning can either be
unsupervised, supervised or semi- supervised.

Deep, Convolutional and recurrent neural networks are utilized in natural language
processing, computer vision, audio recognition, recognition, social network filtering,
bioinformatics, machine translation, image analysis, drug design, and material
inspection, where the outcomes were comparable/superior to some human
professionals.

Deep Neural Networks

 Use different layers of nonlinear units of processing for featured extraction


as well as transformation. Each of the successive layers uses the previous
layer‟s output as the input.

 Can either be supervised, unsupervised or semi-supervised

 learn different hierarchies of representations which relate to different


hierarchies ofabstractions.

 It consists of an input layer, an output layer and between many variable hidden
layers.

 Output of one layer goes as the input to other.

21
The activation function used in our module is the ReLU between hidden layers and
Softmax at the output layer which takes 0 or 1 .

Figure 3.9: DNN [19]

22
CHAPTER 4
SYSTEM DESIGN

This chapter gives the system architecture of our project towards Heart Disease
analysis using Machine Learning. Followed by the architecture, we give the Data
Flow Diagram and Data set used.

4.1 SYSTEM ARCHITECTURE

Figure 4.1: System Architecture

The above architecture shows only 3 machine learning algorithms. But our work here
uses a total of 8 algorithms. For the training set, we included 80% of the data, and for
the testing we used 20% of the data. We test the model with all the 8 algorithms and
check the accuracy. The one that best fits, and gives us the most accuracy is the one
which can be used to for further work in this field.

23
4.2 DATA FLOW DIAGRAM

Figure 4.2: Data flow diagramExplanation of the workflow of our system

1) The training data is 70% and is given supervised inputs and outputs.

2) The testing data is 30% and shows us how well the system is trained.

3) The dataset we have chosen consists of 13 attributes according to which various


algorithms perform their calculations and approximations.

4) The system starts with first pre-processing of the dataset we have fed to it.

5) It studies and analysis it, and then applies the required machine learning algorithm.

6) If it finds that the dataset is supervised, it will separate it into training data and
testingdata.

24
7) Otherwise it will stop.

8) The algorithms we are using, are all supervised.

9) After the application of algorithm, internal validation is done.

10) Accuracy is printed in the code itself.

11) Different accuracy is given for different algorithms.

12) We compare the accuracies of all the algorithms and the algorithm that gives the

Highest is the one, which is chosen for prediction, eventually.

13) In our system, the algorithm that gets the highest accuracy rate is DNN.

In the realm of machine learning, the validation process of models often yields
approximate results rather than exact ones. This unique characteristic distinguishes
software testing in machine learning from that in traditional systems. While software
testing plays a crucial role in any system development, the dynamic factors and
inherent complexity of machine learning models introduce new challenges and
opportunities. This article delves into the significance of software testing in machine
learning and how it embraces approximate results, which are best represented
statistically.

Machine learning models are designed to learn patterns and make predictions based
on available data. Unlike traditional systems where outputs are often deterministic,
machine learning models provide approximate results. These models adapt and
evolve over time, making their behavior more dynamic and less predictable.
Consequently, software testing in machine learning necessitates a different approach
to ensure accuracy and reliability.

Software testing holds a crucial position in machine learning, just as it does in


traditional systems. It serves to validate the model's performance, identify potential
25
errors or biases, and assess its overall effectiveness. However, due to the approximate
nature of machine learning results, testing in this domain requires a shift in
perspective.

Rather than seeking absolute correctness, testing in machine learning focuses on


statistical measures to evaluate model performance. This approach recognizes that
machine learning models provide relative or approximate results, which are still
valuable and meaningful within specific contexts. Statistical analysis becomes
instrumental in understanding the model's behavior and assessing its effectiveness.

Machine learning models operate in dynamic environments where data distribution,


patterns, and input characteristics can change over time. As a result, traditional testing
methodologies may fall short in capturing the model's adaptability and generalization
capabilities. To overcome this challenge, continuous testing becomes imperative,
ensuring the model's performance remains consistent amidst evolving conditions.

In machine learning testing, several techniques are employed to validate the models.
Cross-validation, for instance, assesses the model's performance on different subsets
of the data, providing insights into its generalization capabilities. A/B testing is
another valuable technique, enabling comparison between different versions of the
model to gauge improvements or identify potential issues.

Software testing in machine learning is an iterative process. As new data becomes


available, models are retrained and tested to ensure their ongoing reliability and
performance. This iterative approach enables the identification and rectification of
any discrepancies or biases, ultimately enhancing the model's accuracy and
robustness.

In addition to statistical evaluation, human interpretation plays a vital role in machine


learning testing. While the models provide approximate results, it is crucial to
understand and interpret these outcomes within the context of the problem at hand.
Furthermore, ethical considerations become paramount, as biases or unfair outcomes
could emerge from imperfect testing methodologies.

26
Software testing in machine learning is a critical and intricate task that ensures the
accuracy and reliability of models. While approximate results are inherent to machine
learning, embracing statistical measures allows for a comprehensive evaluation of
model performance. Understanding the dynamic factors at play, employing
appropriate validation techniques, and incorporating iterative testing contribute to
enhancing the model's accuracy and adaptability. By recognizing the importance of
software testing and its distinctive challenges in machine learning, we can harness the
power of these models responsibly and unlock their full potential in various domains.

27
CHAPTER 5
CODING & TESTING

5.1 SOFTWARE REQUIREMENTS

1. Anaconda- It is open source software available to us which enables us to easily


code in using python or R on different operating systems such as the windows, Linux,
and Mc OS. It has millions of users worldwide, and is well known as the industry
which helps us in developing systems, testing them, and training the machines. This
further enables us to:
Manage all the imported libraries, their dependencies, and the environments of
developingwith Anaconda.
In developing techniques to train our machine with Tensor Flow, scikit-learn, etc.
Analyze the datasets and manipulate the with Dask, NumPy, pandas, and Numba
Visualize or plot the results with Matplotlib, Holoviews, Bokeh, and Datashader.
It also provides us with jupyter notebooks which has all the in-built libraries
embedded in the already. This eases our coding stress, and also helps us code with
more efficiency.

2. Python: The most abundantly used general level programming language. It is used
for both a small scale and big scale systems. It can easily be interpreted. It is said to
support multiple programming paradigms. It includes features of procedural, object-
oriented, and functional programming together. It is already garbage-collected which
makes it more efficient.

3. Numpy- It is a python programming library. This basically helps us deal with large
datasets, matrices, and multi-dimensional arrays. It also provides us with a number of
mathematical functions which help us and ease the calculations. It is open-source
softwareavailable to all.

4. Pandas- It is a library that is written in python language. It helps us with the


analyzing of data. It also provides us with tools and functions to manipulate a large
amount of data.
28
5. Sklearn- It is a library used in machine learning in python programming language.
It mainly helps us with the classification of data, regression of models, and in
clustering algorithms. These algorithms include SVM, random forest etc.

6. Tensorflow- It is a highly know open-source software, which is free and available


to all. It is used for differentiable programming with a large number of tasks. It is
used against large data sets to help us with the dataflow, and its manipulation. It is a
basically a math library withvarious features.

5.2 OBJECTIVES AND TYPES OF TESTING

Testing is performed so that we can locate problems. It is the method of identifying


all possible faults of any system. Also, it acts as method to test the functionality of all
the individual components and the system as a whole. It is the process of verifying
that the hardware system performs its functionalities and meets user needs and
expectations. Testing ensures that even if the product fails, it fails in a controlled
condition that can be managed later. There are various types of test. Each of the tests
is used to address particular testing needs.

5.2.1 UNIT TESTING

This testing includes the creation of test scenarios that verify that program is working
in the way it supposed to. In this type of testing, the objective is to validate that each
module or unit is performing the tasks that it is supposed to do. A unit is the smallest
part that has to be tested.

5.2.2 INTEGRATION TESTING

These tests are made just to check the different parts after they have been integrated
to determine whether they are functioning properly as a unit or not. This type of
testing is driven by events and is more curious about the outcomes of the component
as a whole. Integration tests show that although the components are working properly
29
as individual but also the combination of components is also working properly.

5.2.3 FUNCTIONAL TESTING

These kinds of tests provide systematic demonstration of the functions tested and the
proof of their functioning being successful. Functional testing is targeted at a valid
input, invalid input, functions, output, and procedures. The preparation on functional
tests is focused around the requirements, key functionalities, different and extreme
test cases

5.2.4 SYSTEM TESTING

This type of tests tells us whether the whole system after all its parts have been joined
together meets the need of shareholders. It checks and confirms the known and
expected results. These kinds of testing are done to check whether the system delivers
what it was supposed to. It checks whether the all the parts work well with each other
or not.

5.2.5 WHITE BOX TESTING

It is done in a way such that the tester has information about all the components,
workings, structure and architecture of the hardware/software. It is used for deep
level testing to ensure that those places that are not accessible by a black box test are
covered in white box testing. It basicallymeans that there shall be no blind spots in the
system.

5.2.6 BLACK BOX TESTING

Here the testing that is done on the software/hardware by someone who has no
information on howto operate it. Before a product is developed some documentation is
done, they include requirement analysis document that has all the details about what a
product must be. Now these tests are based on those sources only. Imagine a black
colored box, you cannot see into it. You know nothing about what it contains. This is
30
exactly how the testing is done

5.3 TESTING IN MACHINE LEARNING

Very easily put, machine learning is a type of system/application which is based


upon models that use prediction and analysis. Building these systems is a hard task,
but testing and validating the entire system is even more difficult. Traditional
methods of testing are always based on stationary inputs. The output will also be
fixed as the input remains fixed, but in the case of machine learning, the system is
built along with the input, which will change or modify as we come to know more
and more about it, therefore unlike the traditional system, the output here will never
remain fixed. Hence, the strategies for testing a machine learning system have to be
a bit different.

Below are some of these strategies mentioned that can be used in testing the machine
learning systems:

1) Development of training datasets:

Here, a subset of dataset is used for training purpose, which means it is used to train
the system to obtain the given prediction. It is supervised, i.e., the output is given
alongwith the input.

2) Development of testing data sets:

The testing data set is a training data subset that is built in an intelligent way to
checkhow robustly our system has been trained and to check all the combinations that
are possible. The resultant model will be finely tuned according to the outputs of the
testing dataset.

31
3) Development of validation and correction test suite:

This is based on algorithms and test datasets. For eg, in our system scenarios
consist of clustering results based upon different factors/attributes and creating
profiles of riskdepending upon behaviors and demography.

1. Understanding the algorithm:


All of this is dependent upon calculations and how different algorithms use different
calculations. Some algorithms like regression based algorithms give some numeric
results regressively, or continuously. On the other hand, there are some algorithms
that calculate the results by dividing the outcomes into different parts or behaviors,
and then some algorithms create multiple layers between the input and output layer.
Continuous numeric variables such as return on investment.

2. Using Statistics for conveying the results


Outcomes are generally predicted in the form of working model, or a numeric value
or a working interface or something like that. But in machine learning, the use of
bars, histograms and other statistical diagrams is used to depict the prediction or
analysis ina more comparative or mathematical way.

The machine learning models will usually have approximate and not actual results
upon validation. In conclusion, software testing is as crucial and important a task in
machine learning, as is in any other traditional system, but unlike those systems, our
testing is based on more dynamic factors and generally will produce a relative or
approximate result which can be best shown statistically.

32
5.4 TEST CASE
Table 5.1: Test Cases
Sr. TEST TEST CASE EXAMPLE EXPECTED
No. CASE DESCRIPTION TEST CASE OUTPUT
NAME

1 Equivalence This divides the Our dataset can If valid Successful


Partitioning input data of a be if invalid
software unit divided into Unsuccessful.
into partition of classes on the
equivalent data basis
from which test Of the number
Cases can be mapped with
derived. state.

2 Boundary In this we We want to test If valid Successful


Value Focus on the The age to be If invalid
Analysis boundary various positive integer. Unsuccessful.
Valid and invalid
inputs.

3 Data input To check what Instead of The code must throw


happens when giving the input an error pertaining to
data is in putted in 0‟ s change of format of
In the wrong and1‟s.The data The data inputted.
format. Is fed into the
system as
alphabet. For
example Sex of
the persons.

33
In the realm of machine learning, the validation process of models often yields
approximate results rather than exact ones. This unique characteristic distinguishes
software testing in machine learning from that in traditional systems. While software
testing plays a crucial role in any system development, the dynamic factors and
inherent complexity of machine learning models introduce new challenges and
opportunities. This article delves into the significance of software testing in machine
learning and how it embraces approximate results, which are best represented
statistically.
Machine learning models are designed to learn patterns and make predictions based
on available data. Unlike traditional systems where outputs are often deterministic,
machine learning models provide approximate results. These models adapt and
evolve over time, making their behavior more dynamic and less predictable.
Consequently, software testing in machine learning necessitates a different approach
to ensure accuracy and reliability.
Software testing holds a crucial position in machine learning, just as it does in
traditional systems. It serves to validate the model's performance, identify potential
errors or biases, and assess its overall effectiveness. However, due to the approximate
nature of machine learning results, testing in this domain requires a shift in
perspective.
Rather than seeking absolute correctness, testing in machine learning focuses on
statistical measures to evaluate model performance. This approach recognizes that
machine learning models provide relative or approximate results, which are still
valuable and meaningful within specific contexts. Statistical analysis becomes
instrumental in understanding the model's behavior and assessing its effectiveness.
Machine learning models operate in dynamic environments where data distribution,
patterns, and input characteristics can change over time. As a result, traditional testing
methodologies may fall short in capturing the model's adaptability and generalization
capabilities. To overcome this challenge, continuous testing becomes imperative,
ensuring the model's performance remains consistent amidst evolving conditions.

In machine learning testing, several techniques are employed to validate the models.
Cross-validation, for instance, assesses the model's performance on different subsets
34
of the data, providing insights into its generalization capabilities. A/B testing is
another valuable technique, enabling comparison between different versions of the
model to gauge improvements or identify potential issues.

Software testing in machine learning is an iterative process. As new data becomes


available, models are retrained and tested to ensure their ongoing reliability and
performance. This iterative approach enables the identification and rectification of
any discrepancies or biases, ultimately enhancing the model's accuracy and
robustness.

In addition to statistical evaluation, human interpretation plays a vital role in machine


learning testing. While the models provide approximate results, it is crucial to
understand and interpret these outcomes within the context of the problem at hand.
Furthermore, ethical considerations become paramount, as biases or unfair outcomes
could emerge from imperfect testing methodologies.

Software testing in machine learning is a critical and intricate task that ensures the
accuracy and reliability of models. While approximate results are inherent to machine
learning, embracing statistical measures allows for a comprehensive evaluation of
model performance. Understanding the dynamic factors at play, employing
appropriate validation techniques, and incorporating iterative testing contribute to
enhancing the model's accuracy and adaptability. By recognizing the importance of
software testing and its distinctive challenges in machine learning, we can harness the
power of these models responsibly and unlock their full potential in various domains.

35
CHAPTER 6
RESULTS AND ANALYSIS

This chapter discusses the comparative analysis of different machine learning


towards the prediction of heart disease. In addition we also have given the graphical
representation of data set on basis of age, Serum cholesterol, BP and so on.

6.1 DATA SET

For eg: Our system here uses a dataset to predict the occurrence of heart diseases by
dividing the dataset into test data and training data. 70% data is used for training
and the rest 30% is used for testing. Now, the input used in our case is the attributes
we have used that correspond to factors that result in heart diseases. The output
would be a binary digit indicating whether a person is susceptible to heart diseases
or not.

The algorithms used in the system for prediction, are learning algorithms which will
always change over periods of time based on various input factors. Therefore, the
results might change when we learn even more about the data fed as input.

In addition to statistical evaluation, human interpretation plays a vital role in machine


learning testing. While the models provide approximate results, it is crucial to
understand and interpret these outcomes within the context of the problem at hand.
Furthermore, ethical considerations become paramount, as biases or unfair outcomes
could emerge from imperfect testing methodologies.

36
Table 6.1: Data set

37
38
For our system, we have taken dataset that was publicly available on Kaggle for
predicting the heart diseases. The parameters used as input for data analysis using
Machine learning algorithm are as follows:

1. Age: which is taken in years

2. Sex: 1 for male, 0 for female

3. Cp: short for chest pain

4. Trestbps: blood pressure taken when the body is resting

39
1. Chol: level of cholesterol

2. Fasting blood sugar: (1 for true; 0 for false)

3. Resting electrocardiographic resultsValue 0: normal

Value 1: having the ST-T wave which is not normal

Value 2: shows the probability of having left ventricular hypertrophy

4. Thalach: the max level of heartbeats achieved by heart

5. Exercise induced angina: 1 for yes, 0 for no

6. Oldpeak: the depression by exercise compared to the one at rest

7. Slope: peak of the ST segment during exerciseValue 1: no slope

Value 2: straight line Value 3: down sloping

1. ca: major vessels (0-3) colored by fluoroscopy

2. Thal: 3= perfect; 6=permanently defected; 7=defect can be altered

3. Num: heart disease diagnoses

40
6.2 METRICS FOR PERFORMANCE ANALYSIS

The implementation of the machine learning algorithms is compared to identify the


best algorithm for the prediction of the occurrence of flood. Before the results, a brief
explanation aboutthe confusion matrix, accuracy and other terms is given below.

Confusion Matrix: It is a binary classifier. A confusion matrix can be of any size


depending upon the different number of parameters inputted (labels in our case). The
confusion matrix in our caseis a 2x2 matrix.

Table 6.2 Confusion Matrix


TP FN

FP TN

TP-True Positive, FN-False Negative, FP-False Positive, TN-True Negative

TP and TN denote the number of instances which have been correctly classified as no
flood occurrence and flood occurrence respectively. FP and FN signify the number of
instances which have been wrongly classified as no flood occurrence and flood
occurrence respectively.

Accuracy: The accuracy can be calculated with the help of the formula given below.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃 + 𝑇𝑁)/(𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁) Precision: Termed as the positive


predictive value, it is calculated as given be.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) Recall: Also called as sensitivity, it is calculated as given
below.
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
F1 Score: This metric takes into recall and precision and is calculated as below.

41
𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 ∗ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙)/(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)

MCC: Matthews Correlation Coefficient considers all four divisions of the confusion
matrix when calculated. MCC lies within the range -1 to +1 where a model with a
positive score is considered to be perfect whereas the negative score is poor. This
makes this metric really useful as it is easy to interpret

𝑀𝐶𝐶 = (𝑇𝑃 ∗ 𝑇𝑁 − 𝐹𝑃 ∗ 𝐹𝑁)/√((𝑇𝑃 + 𝐹𝑃) ∗ (𝑇𝑃 + 𝐹𝑁) ∗ (𝑇𝑁 + 𝐹𝑃) ∗ (𝑇𝑁 + 𝐹𝑁))

6.3 PERFORMANCE OF MACHINE LEARNING ALGORITHMS

6.3.1 SVM RESULTS

ACCURACY: 0.8852

RECALL: 0.8666

PRECISION: 0.896

5 MCC: 0.84699

Fig 6.2 SVM Results

42
6.3.2 NAIVE BAYES RESULTS

ACCURACY: 0.8852

RECALL: 0.9

PRECISION: 0.8666

MCC: 0.21

Fig 6.3 Naïve Bayes Results

6.3.3 KNN RESULTS

ACCURACY: 0.639

RECALL: 0.4666

PRECISION: 0.76

43
MCC: 0.7465

Fig 6.4. KNN Results

6.3.4 DECISION TREE RESULTS

ACCURACY: 0.8033

RECALL: 0.7666

PRECISION: 0.821

MCC: 0.271

Fig 6.5.Decision Tree Result

44
6.3.5 RANDOM FOREST RESULTS

ACCURACY: 0.7734

RECALL: 0.8

PRECISION: 0.8125

MCC: 0.6607

Fig 6.6. Random Forest Results

45
6.3.6 LOGISTIC REGRESSION RESULTS

ACCURACY: 0.9016

RECALL: 0.8666

PRECISION: 0.9285

MCC: 0.6394

Fig 6.7 Logistic Regression Results

6.3.7 ANN RESULTS

ACCURACY: 0.8852

46
RECALL: 0.8666

PRECISION: 0.8965

MCC: 0.84699

Fig 6.8 ANN Results

47
6.3.8 DNN RESULTS

ACCURACY: 0.918

RECALL: 0.9033

PRECISION: 0.9655

MCC: 0.5320

48
6.4 RESULTS FROM THE TEST DATA SET:

We have done extrapolation of data set in regards to heart disease on basis of various
factors whichare Sex, Age, Blood pressure, Serum cholesterol and so.

Fig 6.10: Percentage of men and women in our dataset

Fig 6.11: Risk of men and women having a heart attack

49
Fig 6.12: Analysis of positive heart attack in men out of total men

Fig 6.13: Analysis of positive heart attack in women out of total women

50
CONCLUSION

Heart disease is a significant global health issue that affects millions of individuals
worldwide. Early detection and prediction of heart disease play a vital role in
preventing adverse health outcomes and improving patient care. In this essay, we
explored various methods and techniques used in heart disease prediction and
discussed their strengths, limitations, and potential future developments. Through a
comprehensive analysis, we have reached a conclusion regarding the prediction of
heart diseases.

Throughout the essay, we examined several approaches for heart disease prediction,
including traditional risk factor assessment, machine learning algorithms, and genetic
profiling. Traditional risk factor assessment considers factors such as age, gender,
blood pressure, cholesterol levels, and smoking habits. While this method has been
widely used, it has limitations in accurately predicting individual risk due to its
reliance on population-based statistics.

Machine learning algorithms, on the other hand, have shown promise in improving
heart disease prediction. These algorithms utilize large datasets and complex
algorithms to identify patterns and create prediction models. They can incorporate a
wide range of features and variables to achieve higher accuracy than traditional risk
factor assessment. However, challenges such as interpretability, data quality, and
potential biases need to be addressed to ensure reliable predictions.

Genetic profiling is another emerging area of heart disease prediction. By analyzing an


individual's genetic makeup, researchers can identify specific genetic variants
associated with increased risk of heart disease. Although still in the early stages of
development, genetic profiling has the potential to personalize risk assessment and
identify individuals who may benefit from targeted preventive measures.

Traditional risk factor assessment provides a straightforward and cost-effective


method for heart disease prediction, making it accessible in various healthcare

51
settings. However, its reliance on population-based statistics limits its accuracy in
individual risk assessment. Additionally, it may not capture all relevant risk factors,
such as genetic predispositions, that could affect an individual's susceptibility to heart
disease.

Machine learning algorithms offer the advantage of considering a wide range of


variables, including genetic factors, lifestyle choices, and medical history. They can
handle large datasets and identify complex patterns that may go unnoticed with
traditional risk factor assessment. Nevertheless, challenges such as data quality,
potential biases in training data, and the lack of interpretability pose obstacles to their
widespread adoption in clinical practice.

Genetic profiling holds great potential for personalized heart disease prediction. By
analyzing an individual's genetic markers, it can provide valuable insights into their
inherent risk. However, genetic profiling is still in its early stages, and further research
is needed to establish its accuracy, reliability, and clinical utility. Ethical concerns
related to genetic privacy and potential discrimination also need to be addressed.
Heart disease prediction is a rapidly evolving field with several promising avenues for
future research and development. First, efforts should be made to integrate traditional
risk factor assessment with machine learning algorithms to leverage the strengths of
both approaches. This could enhance the accuracy and interpretability of predictions
while considering a wide range of variables.

52
FUTURE WORK

The work done here trains the system with a limited number of datasets. The machine
learning algorithms become more accurate once they are fed with a huge number of
data sets. So, this system can be trained with a huge number of data sets that would
increase the accuracy in predicting the heart diseases. The analysis part of the system is
done, and in order to be more useful can be integrated with electronic systems which
give the system real-time inputs and would help the patient get the results then and
there. There are multiple combinations of algorithms that can be tested against these
data sets in order to yield better results. Second, advancements in data collection and
analysis techniques are crucial for improving heart disease prediction. High-quality,
diverse, and representative datasets are needed to train robust machine learning models
and ensure unbiased predictions. Collaboration between healthcare institutions and
researchers is essential to establish data-sharing frameworks that respect privacy and
facilitate research advancements.

Third, genetic profiling should continue to be explored and refined. As our


understanding of genetic variants associated with heart disease improves, genetic
testing could become more accessible and affordable, allowing for more personalized
risk assessment. However, ethical considerations, such as informed consent and
genetic privacy, must be carefully addressed to ensure responsible implementation.

Heart disease prediction is a complex and multi-faceted process that requires a


comprehensive approach. Traditional risk factor assessment, machine learning
algorithms, and genetic profiling each have their strengths and limitations in predicting
heart disease. Integrating these.

53
REFERENCES

[1] Dey, N., Ashour, A. S., & Bhatt, C. (Eds.). (2021). Smart Healthcare Analytics: A
Data-Driven Approach for Healthcare Quality Improvement. CRC Press.

[2] Singh, R., & Khanna, D. (Eds.). (2020). Advances in Machine Learning and Data
Science: Recent Achievements and Research Directives. Springer.

[3] Goldstein, B. A., Navar, A. M., & Pencina, M. J. (2017). Risk prediction with
electronic health records: the importance of model validation and clinical context.
JAMA cardiology, 2(2), 143-144.

[4] Krittanawong, C., Zhang, H., & Wang, Z. (2020). Artificial intelligence in
precision cardiovascular medicine. Journal of the American College of Cardiology,
75(23), 2935-2937.

[5] Cho, I., & Sengupta, P. P. (2019). Machine learning for interpretation of
echocardiograms. Circulation research, 124(8), 1172-1182.

[6] Rajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., ... & Zhang,
Y. (2018). Scalable and accurate deep learning with electronic health records. npj
Digital Medicine, 1(1), 1-10.

[7] Krittanawong, C., Rogers, A. J., & Aydar, M. (2020). Artificial intelligence in
precision cardiovascular medicine. Journal of the American College of Cardiology,
75(23), 2935-2937.

[8] Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare:
promise and potential. Health information science and systems, 2(1), 1-10.

[9] Xie, W., Almeida, D., Ding, K., Wijewickrema, S., Chen, Y., Keegan, J., ... &
Greenstein, J. L. (2021). Developing and validating cardiovascular risk prediction
models using big data: a systematic review. Journal of the American Heart

54
Association, 10(4), e018613.

[10] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and
Tensor Flow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly
Media.

[11] PossibleHyperplane
https://www.researchgate.net/figure/PossibleHyperplanes_fig5_351783755

[12] Hyperplane
with maximum Margin https://www.researchgate.net/figure/Possible-hyperplanes-
left-and-optimal-hyperplane-right_fig8_349822972

[13]NaiveBayes
https://hub.knime.com/knime/spaces/Academic%20Alliance/latest/Guide%20to%20I
ntelligent%20Data%20Science/Exampl%20Workflows/Chapter8/02_NaiveBayes~0o
yhMdWYK5w19xGj

[14] KNN
Clustering https://deepai.org/machine-learning-glossary-and-terms/kNN

[15] Decision Tree


https://theintactone.com/2019/04/13/sad-u3-topic-2-decision-trees/

[16] Random forest with two tree


https://www.researchgate.net/figure/Two-Tree-Random-Forest_fig1_346829273

[17] Sigmoid function graph


https://www.researchgate.net/figure/Sigmoid-Function-Graph_fig1_350535182

[18] ANN
https://data-flair.training/blogs/artificial-neural-networks-for-machine-learning/

55
PLAGIARISM REPORT

56
AUTHOR’S DETAILS

Name-Shivanshu Shukla
Roll No-1903480130054
Mobile No-7355911854
Email [email protected]

Name-Pankaj Kumar
Roll No-1903480130041
Mobile No -8853071492
Email [email protected]

Name-Harshdeep Singh
Roll No-1903480130025
Mobile No-8077013931
Email [email protected]

Name-Navneet Kumar Kushwaha


Roll No-1903480130037
Mobile No-7521938511
Email [email protected]

57

You might also like