Bhabesh - Chapter 5
Bhabesh - Chapter 5
This chapter explains the different heart disease datasets, which are used to train and test the
models during this study, including the Python machine-learning tool. In addition, a number
of data pre-processing techniques are discussed here. Different feature selection techniques
and their working principles, which are relevant for this study, are also explained.
5.1 INTRODUCTION
Heart disease describes a range of conditions that affect your heart. Diseases under the heart
disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm
problems (arrhythmias) and heart defects you’re born with (congenital heart defects), among
others. The term “heart disease” is often used interchangeably with the term “cardiovascular
disease”. Cardiovascular disease generally refers to conditions that involve narrowed or
blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart
conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered
forms of heart disease. Coronary Heart Disease (CHD) is the most common type of heart
disease, killing over 370,000 people annually. Researchers apply several data mining and
machine learning techniques to analyse huge complex medical data, helping healthcare
professionals to predict heart disease [505].
The datasets, which are used for this research, are freely available from kaggle.com, which is
an online community platform for data scientists and machine learning enthusiasts. Kaggle
allows users to collaborate with other users, find and publish datasets, use GPU integrated
notebooks, and compete with other data scientists to solve data science challenges. It is
difficult to identify heart disease because of several contributory risk factors such as diabetes,
high blood pressure, high cholesterol, abnormal pulse rate and many other factors [506]. The
aim of this online platform (founded in 2010 by Anthony Goldbloom and Jeremy Howard and
acquired by Google in 2017) is to help professionals and learners reach their goals in their
data science journey with the powerful tools and resources it provides. As of today (2021),
there are over 8 million registered users on Kaggle.
1
Python is a computer programming language often used to build websites and software,
automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it
can be used to create a variety of different programs and is not specialized for any specific
problems. This versatility, along with its beginner-friendliness, has made it one of the most-
used programming languages today.
Python was developed by Guido van Rossum, a Dutch programmer, in the late 1980s. He
began working on the language in December 1989 and released the first version, Python 0.9.0,
in February 1991. Guido van Rossum continued to develop and refine Python over the years,
and he remained the "Benevolent Dictator For Life" (BDFL) of the Python community until
he stepped down from that role in July 2018.
Python has become one of the most popular programming languages in the world in recent
years. It is used in everything from machine learning to building websites and software testing.
It can be used by developers and non-developers alike. Stack Overflow's 2022 Developer
Survey revealed that Python is the fourth most popular programming language, with
respondents saying that they use Python almost 50 percent of the time in their development
work. Survey results also showed that Python is tied with Rust as the most-wanted
technology, with 18% percent of developers who are not using it already saying that they are
interested in learning Python. Python is commonly used for developing websites and
software, task automation, data analysis, and data visualization. Venkata R.K Kolla (2015)
compared different ML classification algorithms for diagnosis of heart disease using Python
ML tool. Ritika Chadha et al (2016) analysed different data mining techniques like Artificial
Neural Networks, Decision Tree and Naive Bayes to predict hear disease using Python
environment. Santhana K.J et al (2019) processed their heart disease datasets in Python
programming by applying two main ML Algorithm namely Decision Tree Algorithm and
Naive Bayes Algorithm. Archana Singh et al (2020) proposed a method to calculate accuracy
of ML algorithms for predicting heart disease using different classification techniques in
Python programming environment. Since it’s relatively easy to learn, Python has been
adopted by many non-programmers such as accountants and scientists, for a variety of
everyday tasks, like organizing finances.
2
Data analysis and machine learning: Python has become a staple in data science,
allowing data analysts and other professionals to use the language to conduct complex
statistical calculations, create data visualizations, build machine learning algorithms,
manipulate and analyze data, and complete other data-related tasks. Python can build a
wide range of different data visualizations, like line and bar graphs, pie charts,
histograms, and 3D plots. Python also has a number of libraries that enable coders to
write programs for data analysis and machine learning more quickly and efficiently,
like TensorFlow and Keras.
Web development: Python is often used to develop the back end of a website or
application—the parts that a user doesn’t see. Python’s role in web development can
include sending data to and from servers, processing data and communicating with
databases, URL routing, and ensuring security. Python offers several frameworks for
web development. Commonly used ones include Django and Flask. Some web
development jobs that use Python include back-end engineers, full stack engineers,
Python developers, software engineers, and DevOps engineers.
Software testing and prototyping: In software development, Python can aid in tasks
like build control, bug tracking, and testing. With Python, software developers can
automate testing for new products or features. Some Python tools used for software
testing include Green and Requestium.
Everyday tasks: Python isn't only for programmers and data scientists. Learning
Python can open new possibilities for those in less data-heavy professions, like
journalists, small business owners, or social media marketers. Python can also enable
3
non-programmers to simplify certain tasks in their lives. Here are just a few of the
tasks you could automate with Python:
a) Keep track of stock market or crypto prices
b) Send yourself a text reminder to carry an umbrella anytime it’s raining
c) Update your grocery shopping list
d) Renaming large batches of files
e) Converting text files to spreadsheets
f) Randomly assign chores to family members
g) Fill out online forms automatically
4
1. Scikit-Learn:
Scikit-Learn is one of the most widely used libraries for machine learning in
Python.
It provides simple and efficient tools for data preprocessing, feature selection,
model selection, and evaluation.
Includes a wide range of algorithms for classification, regression, clustering,
dimensionality reduction, and more.
2. TensorFlow:
Developed by Google, TensorFlow is an open-source deep learning
framework.
Widely used for building neural networks and deep learning models.
Provides high-level APIs like Keras for easier model development.
3. PyTorch:
Developed by Facebook's AI Research lab, PyTorch is another popular deep
learning framework.
Known for its dynamic computation graph, which makes it more flexible for
certain tasks.
Gaining popularity in both academia and industry.
4. Keras:
Keras is a high-level neural networks API that runs on top of TensorFlow,
Theano, or CNTK.
Designed for fast prototyping and experimentation with deep learning models.
Offers a simple and intuitive API for building and training neural networks.
5. Pandas:
Pandas is a library for data manipulation and analysis.
It provides data structures like DataFrames for handling structured data.
Essential for data preprocessing and cleaning before feeding data to ML
models.
6. Numpy:
NumPy is a fundamental library for numerical operations in Python.
It provides support for multidimensional arrays and matrices, which are
essential for handling data in ML.
7. Matplotlib and Seaborn:
Matplotlib and Seaborn are libraries for data visualization in Python.
5
They allow you to create various types of plots and graphs to visualize your
data and model results.
8. Jupyter Notebook:
Jupyter Notebook is an interactive environment for data science and ML.
It allows you to write and execute code in a notebook-style interface, making it
easy to document and share your work.
9. XGBoost:
XGBoost is a popular gradient boosting library for classification and
regression tasks.
Known for its efficiency and effectiveness in handling structured data.
10. Scipy:
Scipy is built on top of NumPy and provides additional scientific and technical
computing functionalities.
Useful for advanced optimization, integration, interpolation, and more.
11. Statsmodels:
Statsmodels is a library for estimating and interpreting statistical models.
It's helpful for traditional statistical analysis and hypothesis testing.
12. NLTK (Natural Language Toolkit):
NLTK is a library for natural language processing (NLP).
It provides tools and resources for tasks like text classification, sentiment
analysis, and language modeling.
These tools and libraries, along with many others in the Python ecosystem, make Python a
powerful and versatile platform for machine learning and data science projects. You can use
them to build, train, and deploy machine learning models for a wide range of applications,
from image recognition to natural language processing and more.
6
5.3.1 The heart disease dataset I (DS-I)
This dataset is also known as Cleveland heart disease dataset that consists of 303 individuals
data. There are 14 columns in the dataset, out of which one is the target variable. In the actual
dataset, there were 76 features, but as per different studies, only 14 are found best for research
and are discussed below.
Table 5.1: Description of dataset I
Features Details of feature values Description
‘age’ age of patient (years) Age is the most important risk factor in
developing cardiovascular or heart diseases,
with approximately a tripling of risk with
each decade of life. Coronary fatty streaks can
begin to form in adolescence. It is estimated
that 82 percent of people who die of coronary
heart disease are 65 and older.
Simultaneously, the risk of stroke doubles
every decade after age 55
‘sex’ sex of Patient Men are at greater risk of heart disease than
1: if Male pre-menopausal women. Once past
0: if Female menopause, it has been argued that a woman’s
risk is similar to a man’s although more recent
data from the WHO and UN disputes this. If a
female has diabetes, she is more likely to
develop heart disease than a male with
diabetes.
‘cp’ chest pain type: Chest pain is also called angina or discomfort
0: Typical (angina), caused when your heart muscle doesn’t get
1: Atypical (angina), enough oxygen-rich blood. It may feel like
2: Non-anginal pain, pressure or squeezing in your chest. The
3: No Symptom discomfort also can occur in your shoulders,
arms, neck, jaw, or back. Angina pain may
even feel like indigestion.
‘trestbps’ blood pressure Over time, high blood pressure can damage
arteries that feed your heart. High blood
7
(at rest time) pressure that occurs with other conditions,
such as obesity, high cholesterol or diabetes,
increases your risk even more
‘chol’ cholesterol level (serum) A high level of low-density lipoprotein (LDL)
cholesterol (the “bad” cholesterol) is most
likely to narrow arteries. A high level of
triglycerides, a type of blood fat related to
your diet, also ups your risk of a heart attack.
However, a high level of high-density
lipoprotein (HDL) cholesterol (the “good”
cholesterol) lowers your risk of a heart attack.
‘fbs’ blood sugar (fasting), Not producing enough of a hormone secreted
if greater than120 by your pancreas (insulin) or not responding
1 indicates true, to insulin properly causes your body’s blood
0 indicates false sugar levels to rise, increasing your risk of a
heart attack
‘restecg’ electro cardio graphic For people at low risk of cardiovascular
result at rest time disease, the USPSTF concludes with
0: Nothing to note moderate certainty that the potential harms of
1: ST-T Wave is not screening with resting or exercise ECG equal
normal, or exceed the potential benefits. For people at
2: Possible or definite intermediate to high risk, current evidence is
ventricular hypertrophy insufficient to assess the balance of benefits
(left) and harms of screening.
‘thalash’ heart rate (maximum) The increase in cardiovascular risk, associated
with the acceleration of heart rate, was
comparable to the increase in risk observed
with high blood pressure. It has been shown
that an increase in heart rate by 10 beats per
minute was associated with an increase in the
risk of cardiac death by at least 20%, and this
increase in the risk is similar to the one
observed with an increase in systolic blood
8
pressure by 10 mm Hg.
‘exang’ physical exercise The pain or discomfort associated with angina
1 if Yes usually feels tight, gripping or squeezing, and
0 if No can vary from mild to severe. Angina is
usually felt in the center of your chest but may
spread to either or both of your shoulders, or
your back, neck, jaw or arm. It can even be
felt in your hands. o Types of Angina a.
Stable Angina / Angina Pectoris b. Unstable
Angina c. Variant (Prinzmetal) Angina d.
Microvascular Angina.
‘oldpeak’ ST depression induced by ST depression due to exercise relative to
exercise relaxation will
(compared to rest) observe in the ECG test
‘slope’ peak exercise slope Slope (ST depression)
(ST segment)
0: up_slop
1: flat_slop
2: down_slop
‘ca’ No. of major vessels, The number of major blood vessels that can
indicated by 0 to 3 be visualized using fluoroscopy can range
from 0 to 3.
‘thal’ Thalium stress result, Thalassemia is a blood disorder caused by
indicated by 3 numbers abnormal hemoglobin production, with a
3 if normal score of 3 indicating normal production, 6
6 if defect is fixed indicating permanent deficiency, and 7
7 if defect is reversible signifying temporary impairment.
‘target’ Heart disease indicator of This is the final indicator of the disease, also
the patient called predicted class. This indicates a binary
0 if No state.
1 if Yes
9
This heart disease dataset is also known as Cardiovascular Disease dataset and is also
available in kaggle web portal. It consists of 70,000 patient information and the input features
are categorised in three sections; they are Objective (factual information), Examination
(results of medical examination) and Subjective (information given by the patient). As per the
source of the dataset, all of the dataset values were collected and receded during the time of
medical examination.
(Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset )
This dataset consists of 11 features and one target attribute.
Table 5.2: Description of dataset II
Features Description of the attribute
‘age’ age of patient in days
‘height’ height of patient in centimetre
‘weight’ Weight of patient in kilogram
‘gender’ sex of patient (categorical information)
‘ap_hi’ blood pressure upper level (integer value)
10
primarily focused on removing non-informative or redundant predictors from the model
[501].
The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature. To train an optimal model, we need to make sure
that we use only the essential features. If we have too many features, the model can capture
the unimportant patterns and learn from noise. The method of choosing the important
parameters of our data is called Feature Selection. Less complex models are less likely to
overfit the data.
11
where retraining needs to happen in real-time, one might need to limit oneself to a
couple of best features.
4. Deployment effort: The more features, the more complex the machine learning
system becomes in production. This poses multiple risks, including but not limited
to high maintenance effort, entanglement, undeclared consumers, or correction
cascades.
5. Interpretability: With too many features, we lose the explainability of the model.
While not always the primary modeling goal, interpreting and explaining the
model’s results are often important and, in some regulated domains, might even
constitute a legal requirement.
7. Data-model compatibility: Finally, there is the issue of data-model compatibility.
While, in principle, the approach should be data-first, which means collecting and
preparing high-quality data and then choosing a model which works well on this
data, real life may have it the other way around.
Supervised Unsupervised
A subset of features is selected based on their relationship to the target variable. The selection
is not dependent of any machine learning algorithm. On the contrary, filter methods measure
the “relevance” of the features with the output via statistical tests. The filter methods are
found to be fast, scalable, computationally simple and independent of the classifier [512]. It
includes a number of techniques to reduce the unimportant features from a dataset. some of
the most commonly used techniques are
Univariate selection (ANOVA: Analysis of variance)
Chi Square
Based on Pearson’s correlation
13
Linear discriminant analysis (LDA): It is used to find a linear combination of features
that characterizes or separates two or more classes of a categorical variable.
We can calculate Pearson correlation coefficient (r) using the following formula.
…………….. (5.1)
In the above formula, the variables are represented by x and y, where x is the independent
variable, y is the dependent variable. In addition, the value n represents the sample size, and
summation of all values can be represented by the summation symbol Σ.
14
the chi-square value, which is proportional to the difference between the observed frequencies
and the expected frequencies of the categories. The higher the chi-square value, the more
dependent the two variables are and therefore the characteristic is relevant. Alisha et al [502],
in their study mentioned that data those categories have a very low frequency of
occurring which might affect the ranks calculated using the feature selection techniques
and this can hamper the final performance of the classifiers. The Chi-Squared values can be
calculated using the following formula.
……………………….. (5.2)
Where O stands for observed or actual value and E represents expected value.
5.4.3.3 ANOVA
ANOVA (Analysis of Variance) is a statistical method commonly used in machine learning
for comparing the means of two or more groups. It is also known as factor analysis and
developed by Fisher in 1930. In the context of machine learning, ANOVA can be used for
feature selection or to analyze the significance of different features in predicting the target
variable. ANOVA is particularly useful when dealing with categorical independent variables
and a continuous dependent variable. It helps in determining whether there are significant
differences in the means of the dependent variable across the categories of the independent
variable.
The ANOVA filter method, also known as ANOVA F-value feature selection and this method
is particularly useful when dealing with classification or regression tasks where the target
variable is continuous.
In general, there are three types of ANOVA:
One-Way ANOVA: The one-way analysis of variance is also known as single-factor
ANOVA or simple ANOVA. It is suitable for experiments with only one independent
variable (factor) with two or more levels. One-Way-ANOVA method for feature
selection is a technique that analyzes the experimental data such that, one or more
response variables are calculated under various conditions identified by one or more
classification variables [504].
Two-way ANOVA: It is also called full Factorial ANOVA and is used when there are
two or more independent variables. Each of these factors can have multiple levels. It
15
can compare two or more factors, which means it can check the effect of two
independent variables on a single dependent variable.
N-way ANOVA: An Analysis of Variance Test is considered an n-way ANOVA
Test, if the user uses more than two independent variables, where N represents the
number of independent variables for the experiment. Sometimes, it is also called
as MANOVA Test.
The mathematical implementation ANOVA involves a number calculations step by step, and
it be understand with the help of following descriptions.
Step 1: Calculating overall mean (in the first step we need to calculate the mean of all
observations in the sample).
Step 2: From the overall mean, find the sun of squared divisions of every observations.
Sum of squares total (SST) = ∑ (X −X ¿)2 ¿
Step 3: Find the sum of squared divisions of each observations within every group.
Sum of squares within (SSW) = ∑ (X i −X i ¿ )2 ¿
Step 4: Calculate the sum of squared divisions of every group mean from the overall mean.
Sum of squares between (SSB) = ∑ N i( X i ¿−X )2 ¿
Here, Ni indicates number of observations in each group,
X̄ i indicates mean of each group,
and X̄ is the overall mean.
Step 5: Find the degrees of freedom of each of the three sums of squares.
degrees of freedom (df) can be calculated as
df(SST) = N — 1
df(SSW) = N — k
df(SSB) = k — 1
Here, N is the total no. of observations and k is the number of groups.
Step 6: Now, to calculate the mean squares, we need to divide the sum of squares by the
respective degree of freedom
MSW = SSW / ( N – k )
MSB = SSB / ( k – 1 )
Step 7: Calculating F-statistic
F = MSB / MSW
16
Step 8: Calculating p value and compare the result. If the p-value is less than a particular
level, at that moment it may suggest that the means of the groups are significantly
different.
Nadir O.F Elssied et al (2014) in their research applied feature selection based on one-way
ANOVA F-test statistics scheme to determine the most important features contributing to e-
mail spam classification [503]. It is important to note that Filter methods are much faster
compared to other methods, as it does not undergo any training on the models. However, in
this scenario, the wrapper methods are much expensive computationally as it is completely
based on machine learning models. One more interesting fact lies behind filter methods is
that, it might fail to find the best subset of features in many situations, whereas the wrapper
methods is able to provide the best subset of features at any situation though it is very time
consuming.
5.6 SUMMARY
In this chapter different details of the primary heart disease datasets used in this research are
discussed. The Cleveland heart disease dataset is referred here as DS-I and the other dataset,
also known as Cardiovascular Disease dataset, is referred as DS-II. The first dataset contains
303 records with 14 features whereas the second one contains 70000 patient records with 11
features. They entire experiments conducted during this study was carried out with the help of
python ML tool. The robust toolbox provided by Python's machine learning (ML)
environment enables researchers and developers to create effective models for a variety of
applications. Different libraries available in python programming language are also discussed
here. Since, the primary aim of this study is to improve the classifiers performance with
minimum number of features in the datasets, so different feature selection techniques are also
included here. It is the process of selecting a subset of features from the original features in
order to minimise the feature space as much as possible while meeting certain criteria.
Different models of feature selection is explained with the help of a diagram. Through this,
various important techniques of filter feature selection models are discussed mathematically.
Finally, all the experimental results, figures, and comparison tables are included in the next
chapter (Chapter-6).
17
REFERENCES
[501] Applied Predictive Modeling (2013), Springer, Page 488
[502] Alisha Sikri, N. P. Sing, Surjeet Dalal, International Journal of Intelligent Systems and
Applications in Engineering (IJISAE), 2023, ISSN:2147-6799
[503] Nadir Omer Fadl Elssied, Othman Ibrahim and Ahmed Hamza Osman, A Novel
Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification,
Research Journal of Applied Sciences, Engineering and Technology, ISSN: 2040-7459, July
2014, Page: 625-638
[504] Arowolo, Abdulsalam, Saheed, Y.K. and Salawu, A Feature Selection Based on One-
Way-Anova for Microarray, Al-Hikmah Journal of Pure & Applied Sciences Vol.3 (2016),
Page: 30-35
Data Classification
[505] Shah, D., Patel, S. & Bharti, S.K. Heart Disease Prediction using Machine Learning
Techniques, SN Computer. Science. 1, 345 (2020), Publisher: Springer,
https://doi.org/10.1007/s42979-020-00365-y
[506] S. Mohan, C. Thirumalai and G. Srivastava, Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques, in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.
[507] Kolla, Venkata Ravi Kiran, Heart Disease Diagnosis Using Machine Learning
Techniques In Python: A Comparative Study of Classification Algorithms For Predictive
18
Modeling (September 6, 2015). International Journal of Electronics and Communication
Engineering & Technology, 2015, Available at SSRN: https://ssrn.com/abstract=4413723
[508] Chadha, R., Mayank, S. Prediction of heart disease using data mining techniques, CSI
Transactions on ICT 4, 193–198 (2016). Publisher: Springer, https://doi.org/10.1007/s40012-
016-0121-0
[509] Santhana. K.J. and G.S., Prediction of Heart Disease Using Machine Learning
Algorithms, 2019 1st International Conference on Innovations in Information and
Communication Technology (ICIICT), Chennai, India, 2019, pp. 1-5, doi:
10.1109/ICIICT1.2019.8741465.
[510] Archana Singh and R. Kumar, Heart Disease Prediction Using Machine Learning
Algorithms, 2020 International Conference on Electrical and Electronics Engineering (ICE3),
Gorakhpur, India, 2020, pp. 452-457, doi: 10.1109/ICE348803.2020.9122958.
[511] Akshit J. Dhruv, Reema Patel and Nishant Doshi, Python: The Most Advanced
Programming Language for Computer Science Applications, International Conference on
Culture Heritage, Education, Sustainable Tourism, and Innovation Technologies (CESIT
2020), pages 292-299, ISBN: 978-989-758-501-2, DOI: 10.5220/0010307900003051
[512] Yap B. Wah, N. Ibrahim, H.A. Hamid, S.A. Rahman and S. Fong, Feature Selection
Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy,
Pertanika J. Sci. & Technol. 26 (1): 329 - 340 (2018)
[513] Andrea Bommert, Xudong Sun, Bernd Bischl, Jörg Rahnenführer, Michel Lang,
Benchmark for filter methods for feature selection in high-dimensional classification data,
Computational Statistics & Data Analysis, Volume 143, 2020, 106839, ISSN 0167-9473,
https://doi.org/10.1016/j.csda.2019.106839.
[514] Mahdieh Labani, Parham Moradi, Fardin Ahmadizar, Mahdi Jalili, A novel multivariate
filter method for feature selection in text classification problems, Engineering Applications of
Artificial Intelligence, Volume 70, 2018, Pages 25-37, ISSN 0952-1976,
https://doi.org/10.1016/j.engappai.2017.12.014.
19