Lizu Report
Lizu Report
Project Report
On
SUBMITTED BY
Name – ARPIT POLEI
Regd. No. – 2101020502
Branch - CSE
Roll No. – CSE21188
Batch – 2021-2025
Date of submission –
2|Page
CERTIFICATE OF APRROVAL
This is to certify that we have examined and approved the
project entitled “Heart Disease measure using Random
Forest”. We hereby accord approval of it as a survey report
carried out and presented in a manner required for its
acceptance in the partial fulfilment of the requirements for the
internship project for which it has been submitted. This
approval does not necessarily endorse or accept every
statement made, opinion expressed or conclusions drawn as
recorded in this, it only signifies the acceptance of the mini
project for the purpose it has been submitted.
(Signature of Supervisor)
3|Page
ABSTRACT:
The process of discovering or mining information from a huge volume
of data is known as data mining technology. Today data mining has lots
of application in every aspect of human life. Applications of data
mining are wide and diverse. Among this health care is a major
application of data mining. Medical field has get benefited more from
data mining. Heart Disease is the most dangerous life-threatening
chronic disease globally. The objective of the work is to predicts the
occurrence of heart disease of a patient using random forest algorithm.
The dataset was accessed from Kaggle site. The dataset contains 303
samples and 14 attributes are taken for features of the dataset. Then it
was processed using python open access software in jupyter notebook.
The datasets are classified and processed using machine learning
algorithm Random Forest. The outcomes of the dataset are expressed
in terms of accuracy, sensitivity and specificity in percentage. Using
random forest algorithm, we obtained accuracy of 86.9% for prediction
of heart disease with sensitivity value 90.6% and specificity value
82.7%.
From the receiver operating characteristics, we obtained the diagnosis
rate for prediction of heart disease using random forest is 93.3%. The
random forest algorithm has proven to be the most efficient algorithm
for classification of heart disease and therefore it is used in the
proposed system.
4|Page
TABLE OF CONTENTS
Sl. No. Title Page No.
1. Introduction 6
2. Related Work 7
3. Heart Disease 9
4. Methodology 11
6. Conclusion 15
7. References 16
5|Page
INTRODUCTION:
Data mining is also known as proficiency discovering from data. It
attempts to withdraw hidden pattern and trends from huge data bases.
Data mining also support automatic exploration of data. The main
objective of data mining technique is to find the hidden data in the data
base. It is also called as exploratory data analysis, data driven and
deduction learning. It extracts meaningful information from database.
When the database is very large i.e. in terabyte to petabytes manual
analysis of data is not possible. So, we need automatic data analysis.
Data mining was introduced in 1990s.Various data mining technologies
are as follows:
The world is filled with data such as pictures, video, music. Machine
learning promise to derive a meaning for all the data. Arthur C. Clarke
states that modern technology is filled with magic. There is lots of data
in the world generated not only from people but also from mobile,
computer and from another device. Automatic system can ascertain
from data and can change the data. Machine learning has wide
application in the field of speech processing, image processing, fraud
detection. Also, in the field of medical science such as diabetes retina
path, Skin cancer detection, heart disease. Using data is referred to as
for training and answer refer to as prediction. Training data refers to
create a model and to predict. This predictive model can then use to
serve predictions on previously unseen data and answer the questions.
6|Page
RELATED WORK:
The proposed study gives a prediction method for classification of heart
disease. The risk factor which can control and which cannot control
was explained in this paper. The prediction of heart disease has been
done by random forest machine learning algorithm.
Ref[5] authors have proposed a data mining model for prediction of heart
disease. Dataset was taken from UCI machine learning repository
site. Four data mining algorithms such as Naïve bayes, random
forest, Linear regression, Decision tree was applied by the authors
7|Page
to predict the heart disease. Among these algorithms random
forest gives good accuracy of 90.16% compared to other
algorithms.
Ref[6] authors have used knn, decision tree, linear regression, support
vector machine algorithms for prediction of heart disease and
compared their accuracy. All the datasets for prediction are
accesses from UCI repository site. For implementation of the
algorithm’s python software is used. All the algorithms are
processed in jupyter notebook. From the experimental result
authors have obtained best accuracy of 87% by using k-nearest
neighbor algorithm followed by support vector machine 83%,
decision tree 79%and linear regression of 78% accuracy among
all these algorithms for prediction of heart disease.
8|Page
HEART DISEASE:
The Heart is the most important organ of human body. If it does not
function properly then it affects other organ of the body. According to
a report 7,000,000 die from heart attacks each year. According to WHO
report around 17.9 million people die due to CVDS in 2016. 31% of
the death of people is due to heart disease around the globe in every
year. The pumping of blood to the human body is the vital function of
heart which supply oxygen and nutrients to the human body and also
remove other metabolic waste from the body. If there is deficiency of
blood in human body then heart doesn’t function properly and it stop
working which causes the death of human being. Angina occurs when
there is temporary loss of blood to the heart causing chest pain.
Cardiovascular disease is of two types.
i Heart Attack-It occurs when the heart blood vessels are suddenly
blocked.
9|Page
Table 1. Major cause of heart disease.
Disease Type
Smoking
High Blood Pressure
High Cholesterol
Diabetes and Prediabetes
Being overweight
Physical inactivity
Metabolic syndrome
10 | P a g e
METHODOLOGY:
For the proposed study dataset was taken from Kaggle site. Then it was
downloaded in excel file using comma separated format. Data has
processed by python programming using Jupiter notebook. The data set
contains 303 sample instances as shown in table3. The dataset contains
14 clinical features as shown in table 2. Different types of python
libraries such as pandas, Sklearn, NumPy, matplotlib are used for
processing the algorithms. Using explorative data analysis technique
data was analysed in jupyter notebook.10-fold cross validation
technique is used for spitting the data set into training and testing data.
Then using random forest algorithm dataset was processed. description
of the algorithms: Machine learning is the ability of computer to learn
automatically from the experience. Machine can learn by three ways.
11 | P a g e
Table 2. Features for data prediction
Attribute Meaning
Age1 Age is continuous
Gender 1 1=male 0=female
Cp1 Chest pain
Trestbps Resting blood pressure results during hospitalised:
continuous(mmHg)
chol cholesterol level in mg/dl
Fbs1 Fasting blood sugar 0:<=120mg/dl,1:>120mg/dl
restecg electrocardiographic results during resting 1=true
0=false
thalach Maximum heart rate achieved: continuous
exang Exercise induced angina
oldpeak ST depression
slope ST segment slope
ca Number of major vessels coloured by fluoroscopy:
discrete (0,1,2,3)
thal 3: normal
6: fixed defect
7: reversible defect
12 | P a g e
Table 3. Features for data prediction
14 | P a g e
Figure 3. ROC curve obtained using random forest algorithm
The ROC curve between true positive rate and false positive rate at
different threshold level is plotted. From the ROC curve we obtained
the AUC value is 93.3% that indicates the model 93.3% accurately
predict whether the patient suffered from heart disease or not.
CONCLUSION:
In this paper random forest data mining algorithm was implemented for
prediction of heart disease. From the experimental work we obtained
the Sensitivity value 90.6%. specificity value 82.7, and accuracy value
of 86.9 for prediction. In the proposed work we obtained classification
accuracy of 86.9%for prediction of heart disease with diagnosis rate of
93.3% using random forest algorithm. The proposed system can also
be used for prediction of other disease by applying with another
machine learning algorithm such as Naïve Bayes, decision tree, K-NN,
Linear regression, fuzzy logic for better accuracy. Cloud computing
technology can also be used for the proposed system to manage large
volume of patient data.
15 | P a g e
REFERENCES:
[1] Chen, A. H., Huang, S. Y., Hong, P. S., Cheng, C. H., & Lin, E. J.
(2011, September). HDPS: “Heart disease prediction system”. In
2011 Computing in Cardiology (pp. 557-560). IEEE.
[2] Shetty, Deeraj, Kishor Rit, Sohail Shaikh, and Nikita Patil.
"Diabetes disease prediction using data mining."In 2017
International Conference on Innovations in Information, Embedded
and Communication Systems (ICIIECS), pp. 1-5. IEEE, 2017.
[5] [Rajdhan Apurb, Agarwal Avi, Sai Milan, Ravi Dundigalla, Ghuli
Poonam.” Heart Disease Prediction using Machine Learning”
INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH
& TECHNOLOGY.
[6] Singh, A., & Kumar, R. (2020). “Heart Disease Prediction Using
Machine Learning Algorithms”. 2020 International Conference on
Electrical and Electronics Engineering (ICE3)
doi:10.1109/ice348803.2020.9122958
16 | P a g e