Group 6
Group 6
Bachelor of Technology
in
Information Technology
by
Shivanshu Shukla (1903480130054)
Pankaj Kumar (1903480130041)
Navneet Kushwaha (1903480130037)
Harshdeep Singh (1903480130025)
I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief. It contains no matter previously published or written by any
other person nor material which to substantial extent has been accepted to the award
of any degree or diploma of the university or other institute of higher learning except
where due acknowledge has been made in the text.
Signature:
Name: Shivanshu Shukla
Roll No: 1903480130054
Date:
Signature:
Name: Pankaj Kumar
Roll No: 1903480130041
Date:
Signature:
Name: Harshdeep Singh
Roll No: 1903480130025
Date:
Signature:
Name: Navneet Kushwaha
Roll No: 1903480130037
Date:
ii
ACKNOWLEDGMENT
It gives us a great sense of pleasure to present the report of B.Tech. Project “Heart
Disease Prediction using Machine Learning” undertaken during B.Tech. Final
Year. We owe special debt of gratitude to our project guide Mr. Neeraj Kumar
Bharti (Assistant Professor, CSE), PSIT College of Engineering Kanpur for his
constant support and guide throughout course our work his sincerity, thoroughness
and perseverance have been a constant source of inspiration for us .It is only his
cognizant efforts that our endeavors has seen light of the day.
We also do not like to miss the opportunity to acknowledge the contribution of all
faculty member of the department for their kind assistance and cooperation during
the development of our project. Last but not the least, we acknowledge our friends for
their contribution in the completion of the project.
Signature:
Name: Shivanshu Shukla
Roll No: 1903480130054
Date:
Signature:
Name: Pankaj Kumar
Roll No: 1903480130041
Date:
Signature:
Name: Harshdeep Singh
Roll No: 1903480130025
Date:
Signature:
Name: Navneet Kushwaha
Roll No: 1903480130037
Date:
iii
CERTIFICATE
This is to certify that the project titled “Heart Disease Prediction using Machine
Learning” is which is submitted by
Shivanshu Shukla (1903480130054)
Pankaj Kumar (1903480130041)
Harshdeep Singh (1903480130025)
Navneet Kushwaha (1903480130037)
in partial fulfillment of the requirement for the award of the degree of Bachelor of
Technology in Computer Science and Engineering to PSIT College of engineering,
Kanpur, affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow during
the academic year 2022-23, is the record of candidate‟s own work carried out by
him/her under my supervision. The matter embodied in this report is original and has
not been submitted for the award of any other degree.
iv
Heart Disease Prediction Using Machine Learning
Mr. Neeraj Kumar Bharti
(Assistant Professor)
Shivanshu Shukla Harshdeep Singh Pankaj Kumar Navneet Kushwaha
ABSTRACT
Heart disease prediction using machine learning is an active area of research that aims
to develop models to predict the risk of developing heart disease in individuals.
Machine learning models are trained using various demographic, clinical, and
lifestyle data, such as age, gender, blood pressure, cholesterol levels, smoking status,
and family history of heart disease, to predict the likelihood of developing heart
disease in the future.
The prediction of heart disease using machine learning involves several steps,
including data preprocessing, feature selection, model training, and model evaluation.
Various machine learning algorithms, such as logistic regression, decision trees,
random forests, support vector machines, and neural networks, can be used to build
predictive models. The accuracy and reliability of heart disease prediction models
depend on several factors, including the quality and quantity of data used for training,
the selection of relevant features, and the choice of an appropriate machine learning
algorithm. Heart disease prediction using machine learning has the potential to
improve early detection and prevention of heart disease, which can ultimately reduce
the mortality and morbidity associated with this disease. However, further research is
needed to develop more accurate and reliable prediction models that can be used in
clinical practice.
v
TABLE OF CONTENT
TITLE i
DECLARATION ii
ACKNOWLEDGMENT iii
CERTIFICATE iv
ABSTRACT v
TABLE OF CONTENT vi
LIST OF TABLES viii
LIST OF FIGURES ix
LIST OF ABBREVIATIONS xi
1.INTRODUCTION 1
1.1 OVERVIEW 2
1.2 DATA ANALYTICS IN HEART DISEASE PREDICTION 3
1.3 MOTIVATION 5
1.4 PROPOSED RESEARCH 6
1.5 REPORT OBJECTIVES 6
1.6 REPORT ORGANISATION 7
2.LITERATURE STUDY 8
3.SYSTEM ANALYSIS 10
3.1 OVERVIEW OF THE SYSTEM 11
3.2 ADVANTAGES OF PROPOSED SYSTEM 12
3.3 MACHINE LEARNING ALGORITHMS 12
3.3.1 SVM 13
3.3.2 BAYES 15
3.3.3 KNN 16
3.3.4 DECISION TREE 17
3.3.5 RANDOM FOREST 18
3.3.6 LOGISTIC REGRESSION 19
3.3.7 ANN 20
3.4 DEEP NEURAL NETWORK 21
4.SYSTEM DESIGN 23
vi
4.1 SYSTEM ARCHITECTURE 23
4.2 DATA FLOW DIAGRAM 24
5. CODING AND TESTING 28
5.1 SOFTWARE REQUIREMENTS 28
5.2 OBJECTIVES AND TYPES OF TESTING 29
5.2.1 UNIT TESTING 29
5.2.2 INTEGRATION TESTING 29
5.2.3 FUNCTIONAL TESTING 30
5.2.4 SYSTEM TESTING 30
5.2.5 WHITEBOX TESTING 30
5.2.6 BLACKBOX TESTING 30
5.3 TESTING IN MACHINE LEARNING 31
5.4 TEST CASES 32
6.RESULTS AND ANALYSIS 36
6.1 DATA SET 36
6.2 METRICS FOR PERFORMANCE ANALYSIS 41
6.3 PERFORMANCE OF MACHINE LEARNING ALGORITHS 42
6.3.1 SVM RESULTS 42
6.3.2 NAÏVE BAYES RESULTS 43
6.3.3 KNN RESULTS 43
6.3.4 DECISION TREE RESULTS 44
6.3.5 RANDOM FOREST RESULTS 45
6.3.6 LOGISTIC REGRESSION RESULTS 46
6.3.7 ARTIFICIAL NEURAL NETWORK RESULTS 47
6.3.8 DEEP NEURAL NETWORK RESULTS 48
6.4 RESULTS FROM TEST DATA SET 48
CONCLUSION 51
FUTURE WORK 53
REFERENCES 54
PLAGIRISM REPORT 56
AUTHOR’S DETAILS 57
vii
LIST OF TABLES
viii
LIST OF FIGURES
ix
Fig 6.8 ANN results 47
x
LIST OF ABBREVIATIONS
ECG Electrocardiogram
BP Blood Pressure
TP True Positive
TN True Negative
FP False Positive
FN False Negative
EKG Electrocardiogram
xi
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Heart disease is a leading cause of mortality worldwide. Detecting heart disease early
is crucial for the effective management of the disease and prevention of adverse
outcomes. In recent years, machine learning techniques have gained attention in
healthcare as a promising tool for accurate diagnosis and prediction of diseases.
Machine learning algorithms use statistical models to identify patterns in data and
learn from them. The algorithms can then use these patterns to make predictions or
classify new data. In the case of heart disease detection, machine learning algorithms
can be trained on large datasets of medical records, imaging data, and clinical
information to predict the likelihood of a patient having heart disease.
There are several approaches to heart disease detection using machine learning. One
approach is to use supervised learning algorithms, such as logistic regression or
support vector machines, to classify patients as either having or not having heart
disease based on their medical records and risk factors such as age, gender, blood
pressure, cholesterol levels, smoking history, and family history. The algorithms are
trained on labeled data, where the outcome of interest (heart disease or no heart
disease) is known, and then tested on new, unlabeled data to evaluate their accuracy.
In addition to detecting heart disease, machine learning can also be used to predict the
risk of cardiovascular events, such as heart attacks or strokes. Predictive models can
be developed using machine learning algorithms that incorporate a wide range of
patient data, including medical history, lifestyle factors, and genetic information.
These models can be used to identify.
Congestive heart failure The heart doesn‟t pump blood as it normally should
which leads to be a chronic disease.
Congenital heart disease The abnormality in the heart that developsbefore birth.
2
1.2 DATA ANALYTICS IN HEART DISEASE PREDICTION
Several research papers have explored the use of data analytics and data mining
algorithms to analyze various datasets and extract patterns for predicting the
occurrence of heart diseases. Among the commonly employed algorithms are Support
Vector Machines (SVM), Logistic Regression, Naive Bayes, K-Nearest Neighbors
(KNN), and decision trees. SVM and Logistic Regression algorithms have been
recognized for their ability to deliver more accurate results. Additionally, some
studies have incorporated Hadoop-based MapReduce and HDFS algorithms to
distribute data storage across multiple nodes and enable parallel system processing.
Typically, the datasets utilized in these studies have focused on attributes such as
blood pressure, heart rate, and an age group predominantly over 30 years old. It is
worth noting that while existing research papers have extensively employed the
aforementioned algorithms, they have not yet explored the application of deep
learning techniques, which have shown potential for achieving higher accuracy rates.
Our system, in contrast, incorporates a deep neural networkalongside these algorithms
and incorporates additional attributes such as cholesterol levels, presence of angina,
age, and gender to enhance the precision of predictions.
Data analysis is a fundamental step in developing robust machine learning models for
heart disease prediction. It involves extracting insights, identifying patterns, and
making informed decisions based on the available data. Proper analysis enables the
identification of relevant features and relationships, which form the foundation for
training accurate and reliable predictive models.
3
The first step in data analysis for heart disease prediction is the collection and
preprocessing of relevant datasets. These datasets typically include a combination of
clinical, demographic, and lifestyle factors, along with medical test results. Ensuring
data quality is essential, as inaccurate or incomplete data can lead to biased or
erroneous predictions. Preprocessing techniques such as data cleaning, normalization,
and handling missing values are applied to ensure the dataset is suitable for analysis.
Feature selection is the process of identifying the most relevant features from the
dataset that contribute significantly to the prediction task. Techniques like correlation
analysis, mutual information, and recursive feature elimination are commonly
employed to select the optimal subset of features. Feature engineering involves
transforming and creating new features based on domain knowledge and data
characteristics. For example, creating interaction terms, binning continuous variables,
or encoding categorical variables can enhance the predictive power of the model.
Once the dataset is analyzed and the features are selected or engineered, the next step
is to choose an appropriate machine learning model for heart disease prediction.
Various algorithms, including decision trees, random forests, support vector
machines, and neural networks, have been applied in this context. The choice of the
model depends on factors such as the dataset size, complexity, interpretability, and
the desired performance metrics.
To evaluate the performance of the chosen model, various evaluation metrics such as
accuracy, precision, recall, F1-score, and area under the receiver operating
characteristic curve (AUC-ROC) are utilized. Cross-validation techniques like k-fold
cross-validation help estimate the model's performance on unseen data, mitigating
4
issues such as overfitting or underfitting. Additionally, techniques like
hyperparameter tuning further optimize the model's performance by finding the best
combination of hyperparameters.
Validating the predictive model is crucial to ensure its generalizability and reliability.
External validation, using independent datasets or real-world scenarios, helps assess
the model's performance in different populations and settings. Moreover, interpreting
the model's predictions is essential for understanding the factors contributing to the
risk of heart disease. Techniques like feature importance, partial dependence plots,
and SHAP (Shapley Additive Explanations) values provide insights into the model's
decision-making process and aid in clinical interpretation.
Data analysis in heart disease prediction using machine learning presents certain
challenges. Limited access to diverse and comprehensive datasets, data imbalance,
and interpretability issues are some of the key hurdles faced in this field. Addressing
these challenges
1.3 MOTIVATION
It is widely recognized that patients diagnosed with heart disease typically undergo
various tests such as ECG and EKG. However, these tests are usually conducted only
when individuals experience chest pain or other symptoms associated with heart
disease. In the modern world, wearable devices capable of monitoring vital signs like
pulse rate and blood pressure have become increasingly prevalent. It is important to
note that the risk of developing heart disease is not limited to individuals above the
age of 40. The current generation faces significant stress and pressure due to work and
other factors. Therefore, there is an urgent need to analyze physiological parameters
and assess the possibility of heart disease before a heart attack occurs. To address
this, research has been conducted using machine learning algorithms such as
Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Naive
Bayes, Decision Trees, and more for the prediction of heart disease.
In light of recent advancements in deep learning, we propose the development of a
heart disease prediction system that surpasses other machine learning algorithms in
5
terms of accuracy. Our approach incorporates thirteen physiological parameters, with
a focus on crucial factors such as heart rate, age, and sex, to achievethe highest level of
accuracy in predicting heart disease.
6
Validation of random forest and decision tree on the given set of data for heart
diseaseprediction
Validation of ANN and Deep Neural Network on data set for prediction of heart
disease.
This report provides an overview of the proposed system, covering various aspects.
Chapter 1 presents a basic introduction to the project, highlighting the utilization of
data analytics in heart disease prediction and outlining our research objectives. In
Chapter 2, we discuss relevant research papers that have contributed valuable insights
and served as a foundation for our work.
Moving on to Chapter 3, we analyze both the current system and the proposed
system, providing an overview of how the project functions and outlining its
objectives. Additionally, this chapter details the algorithms utilized in the project. In
Chapter 4, we describe the dataset employed, including its attributes, the workflow of
the system, and the system architecture.
Chapter 5 covers the hardware and software requirements necessary for the project,
along with the testing procedures. Chapter 6 presents the results obtained and the
accuracy of our project.
7
CHAPTER 2
LITERATURE REVIEW
In a study conducted by Polaraju, the prediction of heart disease was carried out using
Multiple Regression[1]. The dataset comprised 3000 entries with 13 attributes, which
were divided into training (70%) and testing (30%) sets. The results indicated that
Multiple Linear Regression yielded higher accuracy compared to other algorithms.
Marjia focused on heart disease prediction using algorithms such as SMO, j48, KStar,
Multilayer Perception, and Bayes Net with the aid of the WEKA tool[2]. Through k-
fold cross- validation, it was observed that Bayes Net and SMO performed optimally,
while the other algorithms did not yield satisfactory results. Consequently, efforts
were made to improve accuracy for enhanced diagnostic decisions.
Megha Shahi utilized data mining techniques for heart disease prediction. WEKA
tool was used to diagnose heart disease and improve service quality in healthcare
centers[5]. Algorithms such as SVM, KNN, Naïve Bayes, ANN, Association Rule,
and Decision Tree were applied. The results demonstrated that SVM outperformed
other algorithms in terms.
8
In a study conducted by, the occurrence rate of heart disease was predicted and
analyzed using data mining techniques[6]. The primary objective was to enable
automatic diagnosis of heart diseases in a timely manner. Factors such as blood sugar,
age, heart rate, and sex were utilized to predict the likelihood of a person having heart
disease. The analysis of data was performed usingthe WEKA tool.
Sharmila employed non-linear algorithms for heart disease classification. Big data
tools like MapReduce and HDFS, along with SVM, were utilized with an optimized
set of attributes for predicting heart diseases[7]. HDFS was used to store large
datasets in separate nodes, and SVM was applied in a parallel fashion, leading to
optimal computational time compared to sequential usage.
Jayami Patel proposed the utilization of machine learning and data mining
algorithms for heart disease prediction[8]. The objective was to uncover hidden
patterns through data mining techniques. The results indicated that J48 achieved the
highest accuracy rate among the algorithms tested, based on UCI data, surpassing
LMT
.
Purushottam employed data mining techniques for heart disease prediction, aiming to
support medical practitioners in making better decisions based on specific
parameters[9]. By training and testing a particular parameter, an accuracy rate of
86.3% during testing and 87.3% during trainingwas achieved.
Gomathi proposed the prediction of multiple diseases using data mining techniques.
The study focused on predicting diseases such as breast cancer, heart diseases, and
diabetes[10]. Data mining played a significant role in multi-disease prediction,
reducing the number of tests required.
.
9
CHAPTER 3
SYSTEM ANALYSIS
However, with the increasing work pressure and stress in today's world, heart
problems are no longer restricted to older age groups. Even adults in their 20s, 30s,
and 40s are susceptible to heart diseases. Some individuals may even develop heart
conditions during childhood. Hence, there is a critical need for an automated medical
diagnosis system that utilizes physiological data like blood pressure, heart rate, blood
sugar, and cholesterol to predict the occurrence of heart disease.
Prior research has employed machine learning algorithms such as Regression, SVM,
Bayes, KNN, Decision Tree, and ANN for heart disease prediction. However, these
systems often consider only a limited number of parameters, such as heart rate and
age. Furthermore, the use of deep neural networks in this domain remains largely
unexplored. In our work, we have utilized a Deep Neural Network for heart disease
prediction, incorporating 13 physiological parameters. We have compared the
performance of our approach with other algorithms, including Regression, SVM,
Bayes, KNN, Decision Tree, Random Forest, and ANN, in termsof error and accuracy.
For our study, we obtained a publicly available dataset from Kaggle, which is
commonly used for predicting heart diseases. The input parameters used for data
analysis with machine learning algorithms are as follows:
10
1. age: which is taken in years
11
3.2 ADVANTAGES OF THE PROPOSED SYSTEM
Our study differs from existing approaches by considering not only blood pressure
and heart rate but also other factors such as age, sex, angina, chest pain location, and
cholesterol, which are often overlooked in similar studies.
By incorporating a greater number of attributes, our system can create clusters within
the data based on different factors, allowing for more comprehensive analysis and
prediction.
Unlike other systems that generate alerts only when extreme values are reached, our
system aims to predict the likelihood of heart disease occurrence before it actually
happens, providing early detection and intervention.
Our system categorizes the data based on factors like age and sex, allowing for more
targeted andpersonalized predictions.
We have employed Deep Neural Network (DNN) in our system, which sets our
approach apart from other papers in the field of heart disease prediction. DNN has
been proven to have the highest accuracy among allmachine learning and data mining
algorithms.
Our system utilizes a comprehensive set of viable machine learning and data mining
algorithms, ensuring arobust and thorough analysis of the data.
In this work, eight machine leaning algorithms which include Random Forest, SVM,
Decision Tree, Naïve Bayes, ANN, KNN, Logistic Regression and Deep Neural
Network was employed. These algorithms are used to unfold a prediction system
which will analyze and predict whether the particular patient is pertaining to any heart
disease or not with best accuracy.
12
3.3.1 Support Vector Machine
The main objective of support vector machine algorithm is finding a hyperplane that
distinctly plots and classifies data that is plotted in an N-dimensional space.
13
Figure 3.2: Hyperplane with maximum margin [12]
The two classes are separated by choosing hyperplanes from the possible one. Our
major objective here, would be to find the plane containing the maximum margin of all
the planes,
i.e. finding the data points with max distance between the two classes. When we
maximize this margin, the distance provides us with more accurate classification when
future data points are plotted.
1.Linear SVM: It given us a linear hyperplane that distinguishes the classes. The
objective for us is to maximize the distance of the hyperplane to the point of the
classes should be maximized.
Non-Linear SVM: It has a non-linear hyperplane, and depicts a graph which is closer
to the real-world scenario.
14
3.3.2 Naive Bayes
It is a machine learning technique that works on the strategy of the Bayes‟ Theorem.
It basically assumes that there would be no attributes dependent on each other. It is a
group of algorithms that have a common principle that every feature is independent of
the other. Bayes‟ Theorem tells us the probability of an event that will occur when
another event has already occurred. The mathematical equation is:
3.3.1.1 Probability(a|z): Gives us the probability of a (the hypothesis) gives the data .
3.3.1.2 Probability (z|a): Gives the probability of the data when the hypothesis is
true.
3.3.1.3 Probability (a): Regardless of the data, the hypothesis is said to be true.
15
3.3.3 KNN
K nearest neighbors abbreviated as KNN is an algorithm that clusters data into classes
and then classifies it as per their similarity measures. Classification is based on the
majority of the votes to its neighbors. Data is assigned to the classes that have the
nearest neighbors. As we increase the number of nearest neighbors, I.e. the value of k,
the accuracy might increase. KNN is broadly used for pattern recognition and
statistical prediction.
It divides the data into clusters, based upon the distance from the nearest neighbor.
16
3.3.4 Decision Tree
Decision Tree is one of the supervised learning algorithms. In this the data is
continuously split based on a certain parameter after which we end up getting the
decision nodes and the leaves. What makes it different from the other supervised
algorithms is that is can also solve the regression and classification problems easily.
The main aim is to create a system that can predict the results that we desire just by
learning the decision rules from the prior data, i.e., the training set.
17
3.3.5 Random forest
Random Forest, just as the name suggest creates a number of random forest due to
which it is also called the random decision forest. It is one of the supervised learning
algorithms. It builds random forests that are basically just a group of decision trees. It is
mostly trained with the bagging method as it is the most efficient. What the bagging
method does is, it combines all the learning models which in turn helps us with the
overall result. Just like decision trees, it can be used for both, regression and
classification problems.
An artificial neural network abbreviated as ANN is a model that works as the human
brain or neural network I.e. the neurons. It is called a computational model that
processes all the complex data. It isn‟t given any task specific goal, but it learns from
the examples or data that is given to it just like the brain. It is based on a collection of
nodes called the artificial neuron. More the no. Of neurons, better the system. The
neurons transmit signals from one to another making a proper connection which
resembles the human neural network. A neural network has the following 3 layers:
Input layer – It consists of the complex raw info that we feed to the neurons.
3.3.4.1 Hidden layer – These are the computational layers which take the input and
the weight of a node from the previous layers, processes it with the activation
19
function, and sends the output to the next layer.
3.3.4.1 Output layer – This depends of the output of the hidden layers, and the
functions taking place in there.
The basic computational unit is the neuron. It receives inputs from the sources
provided, and each input carries a weight, which is given according to the relative
importance of all the other inputs. Then the function is applied to it as shown in
figure 3.8.
This function is called the Activation function which basically introduces non-
linearity to the function and sends it in the form of the output. The function used in
our module is the sigmoid
20
which takes inputs as real values and further fits them in the range of 0 to 1. The
equation is: σ(x)
= 1 / (1 + exp(−x))
Deep, Convolutional and recurrent neural networks are utilized in natural language
processing, computer vision, audio recognition, recognition, social network filtering,
bioinformatics, machine translation, image analysis, drug design, and material
inspection, where the outcomes were comparable/superior to some human
professionals.
It consists of an input layer, an output layer and between many variable hidden
layers.
21
The activation function used in our module is the ReLU between hidden layers and
Softmax at the output layer which takes 0 or 1 .
22
CHAPTER 4
SYSTEM DESIGN
This chapter gives the system architecture of our project towards Heart Disease
analysis using Machine Learning. Followed by the architecture, we give the Data
Flow Diagram and Data set used.
The above architecture shows only 3 machine learning algorithms. But our work here
uses a total of 8 algorithms. For the training set, we included 80% of the data, and for
the testing we used 20% of the data. We test the model with all the 8 algorithms and
check the accuracy. The one that best fits, and gives us the most accuracy is the one
which can be used to for further work in this field.
23
4.2 DATA FLOW DIAGRAM
1) The training data is 70% and is given supervised inputs and outputs.
2) The testing data is 30% and shows us how well the system is trained.
4) The system starts with first pre-processing of the dataset we have fed to it.
5) It studies and analysis it, and then applies the required machine learning algorithm.
6) If it finds that the dataset is supervised, it will separate it into training data and
testingdata.
24
7) Otherwise it will stop.
12) We compare the accuracies of all the algorithms and the algorithm that gives the
13) In our system, the algorithm that gets the highest accuracy rate is DNN.
In the realm of machine learning, the validation process of models often yields
approximate results rather than exact ones. This unique characteristic distinguishes
software testing in machine learning from that in traditional systems. While software
testing plays a crucial role in any system development, the dynamic factors and
inherent complexity of machine learning models introduce new challenges and
opportunities. This article delves into the significance of software testing in machine
learning and how it embraces approximate results, which are best represented
statistically.
Machine learning models are designed to learn patterns and make predictions based
on available data. Unlike traditional systems where outputs are often deterministic,
machine learning models provide approximate results. These models adapt and
evolve over time, making their behavior more dynamic and less predictable.
Consequently, software testing in machine learning necessitates a different approach
to ensure accuracy and reliability.
In machine learning testing, several techniques are employed to validate the models.
Cross-validation, for instance, assesses the model's performance on different subsets
of the data, providing insights into its generalization capabilities. A/B testing is
another valuable technique, enabling comparison between different versions of the
model to gauge improvements or identify potential issues.
26
Software testing in machine learning is a critical and intricate task that ensures the
accuracy and reliability of models. While approximate results are inherent to machine
learning, embracing statistical measures allows for a comprehensive evaluation of
model performance. Understanding the dynamic factors at play, employing
appropriate validation techniques, and incorporating iterative testing contribute to
enhancing the model's accuracy and adaptability. By recognizing the importance of
software testing and its distinctive challenges in machine learning, we can harness the
power of these models responsibly and unlock their full potential in various domains.
27
CHAPTER 5
CODING & TESTING
2. Python: The most abundantly used general level programming language. It is used
for both a small scale and big scale systems. It can easily be interpreted. It is said to
support multiple programming paradigms. It includes features of procedural, object-
oriented, and functional programming together. It is already garbage-collected which
makes it more efficient.
3. Numpy- It is a python programming library. This basically helps us deal with large
datasets, matrices, and multi-dimensional arrays. It also provides us with a number of
mathematical functions which help us and ease the calculations. It is open-source
softwareavailable to all.
This testing includes the creation of test scenarios that verify that program is working
in the way it supposed to. In this type of testing, the objective is to validate that each
module or unit is performing the tasks that it is supposed to do. A unit is the smallest
part that has to be tested.
These tests are made just to check the different parts after they have been integrated
to determine whether they are functioning properly as a unit or not. This type of
testing is driven by events and is more curious about the outcomes of the component
as a whole. Integration tests show that although the components are working properly
29
as individual but also the combination of components is also working properly.
These kinds of tests provide systematic demonstration of the functions tested and the
proof of their functioning being successful. Functional testing is targeted at a valid
input, invalid input, functions, output, and procedures. The preparation on functional
tests is focused around the requirements, key functionalities, different and extreme
test cases
This type of tests tells us whether the whole system after all its parts have been joined
together meets the need of shareholders. It checks and confirms the known and
expected results. These kinds of testing are done to check whether the system delivers
what it was supposed to. It checks whether the all the parts work well with each other
or not.
It is done in a way such that the tester has information about all the components,
workings, structure and architecture of the hardware/software. It is used for deep
level testing to ensure that those places that are not accessible by a black box test are
covered in white box testing. It basicallymeans that there shall be no blind spots in the
system.
Here the testing that is done on the software/hardware by someone who has no
information on howto operate it. Before a product is developed some documentation is
done, they include requirement analysis document that has all the details about what a
product must be. Now these tests are based on those sources only. Imagine a black
colored box, you cannot see into it. You know nothing about what it contains. This is
30
exactly how the testing is done
Below are some of these strategies mentioned that can be used in testing the machine
learning systems:
Here, a subset of dataset is used for training purpose, which means it is used to train
the system to obtain the given prediction. It is supervised, i.e., the output is given
alongwith the input.
The testing data set is a training data subset that is built in an intelligent way to
checkhow robustly our system has been trained and to check all the combinations that
are possible. The resultant model will be finely tuned according to the outputs of the
testing dataset.
31
3) Development of validation and correction test suite:
This is based on algorithms and test datasets. For eg, in our system scenarios
consist of clustering results based upon different factors/attributes and creating
profiles of riskdepending upon behaviors and demography.
The machine learning models will usually have approximate and not actual results
upon validation. In conclusion, software testing is as crucial and important a task in
machine learning, as is in any other traditional system, but unlike those systems, our
testing is based on more dynamic factors and generally will produce a relative or
approximate result which can be best shown statistically.
32
5.4 TEST CASE
Table 5.1: Test Cases
Sr. TEST TEST CASE EXAMPLE EXPECTED
No. CASE DESCRIPTION TEST CASE OUTPUT
NAME
33
In the realm of machine learning, the validation process of models often yields
approximate results rather than exact ones. This unique characteristic distinguishes
software testing in machine learning from that in traditional systems. While software
testing plays a crucial role in any system development, the dynamic factors and
inherent complexity of machine learning models introduce new challenges and
opportunities. This article delves into the significance of software testing in machine
learning and how it embraces approximate results, which are best represented
statistically.
Machine learning models are designed to learn patterns and make predictions based
on available data. Unlike traditional systems where outputs are often deterministic,
machine learning models provide approximate results. These models adapt and
evolve over time, making their behavior more dynamic and less predictable.
Consequently, software testing in machine learning necessitates a different approach
to ensure accuracy and reliability.
Software testing holds a crucial position in machine learning, just as it does in
traditional systems. It serves to validate the model's performance, identify potential
errors or biases, and assess its overall effectiveness. However, due to the approximate
nature of machine learning results, testing in this domain requires a shift in
perspective.
Rather than seeking absolute correctness, testing in machine learning focuses on
statistical measures to evaluate model performance. This approach recognizes that
machine learning models provide relative or approximate results, which are still
valuable and meaningful within specific contexts. Statistical analysis becomes
instrumental in understanding the model's behavior and assessing its effectiveness.
Machine learning models operate in dynamic environments where data distribution,
patterns, and input characteristics can change over time. As a result, traditional testing
methodologies may fall short in capturing the model's adaptability and generalization
capabilities. To overcome this challenge, continuous testing becomes imperative,
ensuring the model's performance remains consistent amidst evolving conditions.
In machine learning testing, several techniques are employed to validate the models.
Cross-validation, for instance, assesses the model's performance on different subsets
34
of the data, providing insights into its generalization capabilities. A/B testing is
another valuable technique, enabling comparison between different versions of the
model to gauge improvements or identify potential issues.
Software testing in machine learning is a critical and intricate task that ensures the
accuracy and reliability of models. While approximate results are inherent to machine
learning, embracing statistical measures allows for a comprehensive evaluation of
model performance. Understanding the dynamic factors at play, employing
appropriate validation techniques, and incorporating iterative testing contribute to
enhancing the model's accuracy and adaptability. By recognizing the importance of
software testing and its distinctive challenges in machine learning, we can harness the
power of these models responsibly and unlock their full potential in various domains.
35
CHAPTER 6
RESULTS AND ANALYSIS
For eg: Our system here uses a dataset to predict the occurrence of heart diseases by
dividing the dataset into test data and training data. 70% data is used for training
and the rest 30% is used for testing. Now, the input used in our case is the attributes
we have used that correspond to factors that result in heart diseases. The output
would be a binary digit indicating whether a person is susceptible to heart diseases
or not.
The algorithms used in the system for prediction, are learning algorithms which will
always change over periods of time based on various input factors. Therefore, the
results might change when we learn even more about the data fed as input.
36
Table 6.1: Data set
37
38
For our system, we have taken dataset that was publicly available on Kaggle for
predicting the heart diseases. The parameters used as input for data analysis using
Machine learning algorithm are as follows:
39
1. Chol: level of cholesterol
40
6.2 METRICS FOR PERFORMANCE ANALYSIS
FP TN
TP and TN denote the number of instances which have been correctly classified as no
flood occurrence and flood occurrence respectively. FP and FN signify the number of
instances which have been wrongly classified as no flood occurrence and flood
occurrence respectively.
Accuracy: The accuracy can be calculated with the help of the formula given below.
41
𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 ∗ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙)/(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)
MCC: Matthews Correlation Coefficient considers all four divisions of the confusion
matrix when calculated. MCC lies within the range -1 to +1 where a model with a
positive score is considered to be perfect whereas the negative score is poor. This
makes this metric really useful as it is easy to interpret
𝑀𝐶𝐶 = (𝑇𝑃 ∗ 𝑇𝑁 − 𝐹𝑃 ∗ 𝐹𝑁)/√((𝑇𝑃 + 𝐹𝑃) ∗ (𝑇𝑃 + 𝐹𝑁) ∗ (𝑇𝑁 + 𝐹𝑃) ∗ (𝑇𝑁 + 𝐹𝑁))
ACCURACY: 0.8852
RECALL: 0.8666
PRECISION: 0.896
5 MCC: 0.84699
42
6.3.2 NAIVE BAYES RESULTS
ACCURACY: 0.8852
RECALL: 0.9
PRECISION: 0.8666
MCC: 0.21
ACCURACY: 0.639
RECALL: 0.4666
PRECISION: 0.76
43
MCC: 0.7465
ACCURACY: 0.8033
RECALL: 0.7666
PRECISION: 0.821
MCC: 0.271
44
6.3.5 RANDOM FOREST RESULTS
ACCURACY: 0.7734
RECALL: 0.8
PRECISION: 0.8125
MCC: 0.6607
45
6.3.6 LOGISTIC REGRESSION RESULTS
ACCURACY: 0.9016
RECALL: 0.8666
PRECISION: 0.9285
MCC: 0.6394
ACCURACY: 0.8852
46
RECALL: 0.8666
PRECISION: 0.8965
MCC: 0.84699
47
6.3.8 DNN RESULTS
ACCURACY: 0.918
RECALL: 0.9033
PRECISION: 0.9655
MCC: 0.5320
48
6.4 RESULTS FROM THE TEST DATA SET:
We have done extrapolation of data set in regards to heart disease on basis of various
factors whichare Sex, Age, Blood pressure, Serum cholesterol and so.
49
Fig 6.12: Analysis of positive heart attack in men out of total men
Fig 6.13: Analysis of positive heart attack in women out of total women
50
CONCLUSION
Heart disease is a significant global health issue that affects millions of individuals
worldwide. Early detection and prediction of heart disease play a vital role in
preventing adverse health outcomes and improving patient care. In this essay, we
explored various methods and techniques used in heart disease prediction and
discussed their strengths, limitations, and potential future developments. Through a
comprehensive analysis, we have reached a conclusion regarding the prediction of
heart diseases.
Throughout the essay, we examined several approaches for heart disease prediction,
including traditional risk factor assessment, machine learning algorithms, and genetic
profiling. Traditional risk factor assessment considers factors such as age, gender,
blood pressure, cholesterol levels, and smoking habits. While this method has been
widely used, it has limitations in accurately predicting individual risk due to its
reliance on population-based statistics.
Machine learning algorithms, on the other hand, have shown promise in improving
heart disease prediction. These algorithms utilize large datasets and complex
algorithms to identify patterns and create prediction models. They can incorporate a
wide range of features and variables to achieve higher accuracy than traditional risk
factor assessment. However, challenges such as interpretability, data quality, and
potential biases need to be addressed to ensure reliable predictions.
51
settings. However, its reliance on population-based statistics limits its accuracy in
individual risk assessment. Additionally, it may not capture all relevant risk factors,
such as genetic predispositions, that could affect an individual's susceptibility to heart
disease.
Genetic profiling holds great potential for personalized heart disease prediction. By
analyzing an individual's genetic markers, it can provide valuable insights into their
inherent risk. However, genetic profiling is still in its early stages, and further research
is needed to establish its accuracy, reliability, and clinical utility. Ethical concerns
related to genetic privacy and potential discrimination also need to be addressed.
Heart disease prediction is a rapidly evolving field with several promising avenues for
future research and development. First, efforts should be made to integrate traditional
risk factor assessment with machine learning algorithms to leverage the strengths of
both approaches. This could enhance the accuracy and interpretability of predictions
while considering a wide range of variables.
52
FUTURE WORK
The work done here trains the system with a limited number of datasets. The machine
learning algorithms become more accurate once they are fed with a huge number of
data sets. So, this system can be trained with a huge number of data sets that would
increase the accuracy in predicting the heart diseases. The analysis part of the system is
done, and in order to be more useful can be integrated with electronic systems which
give the system real-time inputs and would help the patient get the results then and
there. There are multiple combinations of algorithms that can be tested against these
data sets in order to yield better results. Second, advancements in data collection and
analysis techniques are crucial for improving heart disease prediction. High-quality,
diverse, and representative datasets are needed to train robust machine learning models
and ensure unbiased predictions. Collaboration between healthcare institutions and
researchers is essential to establish data-sharing frameworks that respect privacy and
facilitate research advancements.
53
REFERENCES
[1] Dey, N., Ashour, A. S., & Bhatt, C. (Eds.). (2021). Smart Healthcare Analytics: A
Data-Driven Approach for Healthcare Quality Improvement. CRC Press.
[2] Singh, R., & Khanna, D. (Eds.). (2020). Advances in Machine Learning and Data
Science: Recent Achievements and Research Directives. Springer.
[3] Goldstein, B. A., Navar, A. M., & Pencina, M. J. (2017). Risk prediction with
electronic health records: the importance of model validation and clinical context.
JAMA cardiology, 2(2), 143-144.
[4] Krittanawong, C., Zhang, H., & Wang, Z. (2020). Artificial intelligence in
precision cardiovascular medicine. Journal of the American College of Cardiology,
75(23), 2935-2937.
[5] Cho, I., & Sengupta, P. P. (2019). Machine learning for interpretation of
echocardiograms. Circulation research, 124(8), 1172-1182.
[6] Rajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., ... & Zhang,
Y. (2018). Scalable and accurate deep learning with electronic health records. npj
Digital Medicine, 1(1), 1-10.
[7] Krittanawong, C., Rogers, A. J., & Aydar, M. (2020). Artificial intelligence in
precision cardiovascular medicine. Journal of the American College of Cardiology,
75(23), 2935-2937.
[8] Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare:
promise and potential. Health information science and systems, 2(1), 1-10.
[9] Xie, W., Almeida, D., Ding, K., Wijewickrema, S., Chen, Y., Keegan, J., ... &
Greenstein, J. L. (2021). Developing and validating cardiovascular risk prediction
models using big data: a systematic review. Journal of the American Heart
54
Association, 10(4), e018613.
[10] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and
Tensor Flow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly
Media.
[11] PossibleHyperplane
https://www.researchgate.net/figure/PossibleHyperplanes_fig5_351783755
[12] Hyperplane
with maximum Margin https://www.researchgate.net/figure/Possible-hyperplanes-
left-and-optimal-hyperplane-right_fig8_349822972
[13]NaiveBayes
https://hub.knime.com/knime/spaces/Academic%20Alliance/latest/Guide%20to%20I
ntelligent%20Data%20Science/Exampl%20Workflows/Chapter8/02_NaiveBayes~0o
yhMdWYK5w19xGj
[14] KNN
Clustering https://deepai.org/machine-learning-glossary-and-terms/kNN
[18] ANN
https://data-flair.training/blogs/artificial-neural-networks-for-machine-learning/
55
PLAGIARISM REPORT
56
AUTHOR’S DETAILS
Name-Shivanshu Shukla
Roll No-1903480130054
Mobile No-7355911854
Email [email protected]
Name-Pankaj Kumar
Roll No-1903480130041
Mobile No -8853071492
Email [email protected]
Name-Harshdeep Singh
Roll No-1903480130025
Mobile No-8077013931
Email [email protected]
57