0% found this document useful (0 votes)

58 views

Bhabesh - Chapter 5

Uploaded by

raj.kangkan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Bhabesh - Chapter 5

Uploaded by

raj.kangkan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

CHAPTER 5: THE HEART DISEASE DATASETS AND

FEATURE SELECTION TECHNIQUES

This chapter explains the different heart disease datasets, which are used to train and test the
models during this study, including the Python machine-learning tool. In addition, a number
of data pre-processing techniques are discussed here. Different feature selection techniques
and their working principles, which are relevant for this study, are also explained.

5.1 INTRODUCTION
Heart disease describes a range of conditions that affect your heart. Diseases under the heart
disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm
problems (arrhythmias) and heart defects you’re born with (congenital heart defects), among
others. The term “heart disease” is often used interchangeably with the term “cardiovascular
disease”. Cardiovascular disease generally refers to conditions that involve narrowed or
blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart
conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered
forms of heart disease. Coronary Heart Disease (CHD) is the most common type of heart
disease, killing over 370,000 people annually. Researchers apply several data mining and
machine learning techniques to analyse huge complex medical data, helping healthcare
professionals to predict heart disease [505].
The datasets, which are used for this research, are freely available from kaggle.com, which is
an online community platform for data scientists and machine learning enthusiasts. Kaggle
allows users to collaborate with other users, find and publish datasets, use GPU integrated
notebooks, and compete with other data scientists to solve data science challenges. It is
difficult to identify heart disease because of several contributory risk factors such as diabetes,
high blood pressure, high cholesterol, abnormal pulse rate and many other factors [506]. The
aim of this online platform (founded in 2010 by Anthony Goldbloom and Jeremy Howard and
acquired by Google in 2017) is to help professionals and learners reach their goals in their
data science journey with the powerful tools and resources it provides. As of today (2021),
there are over 8 million registered users on Kaggle.

5.2 MACHINE LEARNING THROUGH PYTHON

1
Python is a computer programming language often used to build websites and software,
automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it
can be used to create a variety of different programs and is not specialized for any specific
problems. This versatility, along with its beginner-friendliness, has made it one of the most-
used programming languages today.

Python was developed by Guido van Rossum, a Dutch programmer, in the late 1980s. He
began working on the language in December 1989 and released the first version, Python 0.9.0,
in February 1991. Guido van Rossum continued to develop and refine Python over the years,
and he remained the "Benevolent Dictator For Life" (BDFL) of the Python community until
he stepped down from that role in July 2018.

Python has become one of the most popular programming languages in the world in recent
years. It is used in everything from machine learning to building websites and software testing.
It can be used by developers and non-developers alike. Stack Overflow's 2022 Developer
Survey revealed that Python is the fourth most popular programming language, with
respondents saying that they use Python almost 50 percent of the time in their development
work. Survey results also showed that Python is tied with Rust as the most-wanted
technology, with 18% percent of developers who are not using it already saying that they are
interested in learning Python. Python is commonly used for developing websites and
software, task automation, data analysis, and data visualization. Venkata R.K Kolla (2015)
compared different ML classification algorithms for diagnosis of heart disease using Python
ML tool. Ritika Chadha et al (2016) analysed different data mining techniques like Artificial
Neural Networks, Decision Tree and Naive Bayes to predict hear disease using Python
environment. Santhana K.J et al (2019) processed their heart disease datasets in Python
programming by applying two main ML Algorithm namely Decision Tree Algorithm and
Naive Bayes Algorithm. Archana Singh et al (2020) proposed a method to calculate accuracy
of ML algorithms for predicting heart disease using different classification techniques in
Python programming environment. Since it’s relatively easy to learn, Python has been
adopted by many non-programmers such as accountants and scientists, for a variety of
everyday tasks, like organizing finances.

Some of the most important tasks performed through Pyhton includes:

2
 Data analysis and machine learning: Python has become a staple in data science,
allowing data analysts and other professionals to use the language to conduct complex
statistical calculations, create data visualizations, build machine learning algorithms,
manipulate and analyze data, and complete other data-related tasks. Python can build a
wide range of different data visualizations, like line and bar graphs, pie charts,
histograms, and 3D plots. Python also has a number of libraries that enable coders to
write programs for data analysis and machine learning more quickly and efficiently,
like TensorFlow and Keras.

 Web development: Python is often used to develop the back end of a website or
application—the parts that a user doesn’t see. Python’s role in web development can
include sending data to and from servers, processing data and communicating with
databases, URL routing, and ensuring security. Python offers several frameworks for
web development. Commonly used ones include Django and Flask. Some web
development jobs that use Python include back-end engineers, full stack engineers,
Python developers, software engineers, and DevOps engineers.

 Automation or scripting: If you find yourself performing a task repeatedly, you

could work more efficiently by automating it with Python. Writing code used to build
these automated processes is called scripting. In the coding world, automation can be
used to check for errors across multiple files, convert files, execute simple math, and
remove duplicates in data. Python can even be used by relative beginners to automate
simple tasks on the computer—such as renaming files, finding and downloading
online content or sending emails or texts at desired intervals.

 Software testing and prototyping: In software development, Python can aid in tasks
like build control, bug tracking, and testing. With Python, software developers can
automate testing for new products or features. Some Python tools used for software
testing include Green and Requestium.

 Everyday tasks: Python isn't only for programmers and data scientists. Learning
Python can open new possibilities for those in less data-heavy professions, like
journalists, small business owners, or social media marketers. Python can also enable

3
non-programmers to simplify certain tasks in their lives. Here are just a few of the
tasks you could automate with Python:
a) Keep track of stock market or crypto prices
b) Send yourself a text reminder to carry an umbrella anytime it’s raining
c) Update your grocery shopping list
d) Renaming large batches of files
e) Converting text files to spreadsheets
f) Randomly assign chores to family members
g) Fill out online forms automatically

5.2.1 Advantages of Python

Python is popular for a number of reasons. Python has many advantages over any other
languages, as it has varieties of library, which reduces the code to one-third for programmer
and due to this Python has reached at the highest peak in terms of Machine Learning [511].
Here’s a deeper look at what makes it so versatile and easy to use for coders.
 It has a simple syntax that mimics natural language, so it’s easier to read and
understand. This makes it quicker to build projects, and faster to improve on them.
 It’s versatile. Python can be used for many different tasks, from web development to
machine learning.
 It’s beginner friendly, making it popular for entry-level coders.
 It’s open source, which means it’s free to use and distribute, even for commercial
purposes.
 Python’s archive of modules and libraries—bundles of code that third-party users have
created to expand Python’s capabilities—is vast and growing.
 Python has a large and active community that contributes to Python’s pool of modules
and libraries, and acts as a helpful resource for other programmers. The vast support
community means that if coders run into a stumbling block, finding a solution is
relatively easy; somebody is bound to have encountered the same problem before.

5.2.2 Libraries in Python

Python is a popular programming language for machine learning (ML) and artificial
intelligence (AI) due to its extensive libraries, strong community support, and ease of use.
There are several powerful tools and libraries available in Python for ML, and here are some
of the most important ones:

4
1. Scikit-Learn:
 Scikit-Learn is one of the most widely used libraries for machine learning in
Python.
 It provides simple and efficient tools for data preprocessing, feature selection,
model selection, and evaluation.
 Includes a wide range of algorithms for classification, regression, clustering,
dimensionality reduction, and more.
2. TensorFlow:
 Developed by Google, TensorFlow is an open-source deep learning
framework.
 Widely used for building neural networks and deep learning models.
 Provides high-level APIs like Keras for easier model development.
3. PyTorch:
 Developed by Facebook's AI Research lab, PyTorch is another popular deep
learning framework.
 Known for its dynamic computation graph, which makes it more flexible for
certain tasks.
 Gaining popularity in both academia and industry.
4. Keras:
 Keras is a high-level neural networks API that runs on top of TensorFlow,
Theano, or CNTK.
 Designed for fast prototyping and experimentation with deep learning models.
 Offers a simple and intuitive API for building and training neural networks.
5. Pandas:
 Pandas is a library for data manipulation and analysis.
 It provides data structures like DataFrames for handling structured data.
 Essential for data preprocessing and cleaning before feeding data to ML
models.
6. Numpy:
 NumPy is a fundamental library for numerical operations in Python.
 It provides support for multidimensional arrays and matrices, which are
essential for handling data in ML.
7. Matplotlib and Seaborn:
 Matplotlib and Seaborn are libraries for data visualization in Python.

5
 They allow you to create various types of plots and graphs to visualize your
data and model results.
8. Jupyter Notebook:
 Jupyter Notebook is an interactive environment for data science and ML.
 It allows you to write and execute code in a notebook-style interface, making it
easy to document and share your work.
9. XGBoost:
 XGBoost is a popular gradient boosting library for classification and
regression tasks.
 Known for its efficiency and effectiveness in handling structured data.
10. Scipy:
 Scipy is built on top of NumPy and provides additional scientific and technical
computing functionalities.
 Useful for advanced optimization, integration, interpolation, and more.
11. Statsmodels:
 Statsmodels is a library for estimating and interpreting statistical models.
 It's helpful for traditional statistical analysis and hypothesis testing.
12. NLTK (Natural Language Toolkit):
 NLTK is a library for natural language processing (NLP).
 It provides tools and resources for tasks like text classification, sentiment
analysis, and language modeling.

These tools and libraries, along with many others in the Python ecosystem, make Python a
powerful and versatile platform for machine learning and data science projects. You can use
them to build, train, and deploy machine learning models for a wide range of applications,
from image recognition to natural language processing and more.

5.3 DATASETS USED FOR THIS STUDY

To analyse the performance of the classifiers, it is highly preferable to use multiple datasets,
so that the models can be trained and tested in different environments. Hence, in this study we
have used two different heart disease datasets, which are freely available. In a number
research activity, both of these datasets have already been used and found very useful in their
study.

6
5.3.1 The heart disease dataset I (DS-I)
This dataset is also known as Cleveland heart disease dataset that consists of 303 individuals
data. There are 14 columns in the dataset, out of which one is the target variable. In the actual
dataset, there were 76 features, but as per different studies, only 14 are found best for research
and are discussed below.
Table 5.1: Description of dataset I
Features Details of feature values Description
‘age’ age of patient (years) Age is the most important risk factor in
developing cardiovascular or heart diseases,
with approximately a tripling of risk with
each decade of life. Coronary fatty streaks can
begin to form in adolescence. It is estimated
that 82 percent of people who die of coronary
heart disease are 65 and older.
Simultaneously, the risk of stroke doubles
every decade after age 55
‘sex’ sex of Patient Men are at greater risk of heart disease than
1: if Male pre-menopausal women. Once past
0: if Female menopause, it has been argued that a woman’s
risk is similar to a man’s although more recent
data from the WHO and UN disputes this. If a
female has diabetes, she is more likely to
develop heart disease than a male with
diabetes.
‘cp’ chest pain type: Chest pain is also called angina or discomfort
0: Typical (angina), caused when your heart muscle doesn’t get
1: Atypical (angina), enough oxygen-rich blood. It may feel like
2: Non-anginal pain, pressure or squeezing in your chest. The
3: No Symptom discomfort also can occur in your shoulders,
arms, neck, jaw, or back. Angina pain may
even feel like indigestion.
‘trestbps’ blood pressure Over time, high blood pressure can damage
arteries that feed your heart. High blood

7
(at rest time) pressure that occurs with other conditions,
such as obesity, high cholesterol or diabetes,
increases your risk even more
‘chol’ cholesterol level (serum) A high level of low-density lipoprotein (LDL)
cholesterol (the “bad” cholesterol) is most
likely to narrow arteries. A high level of
triglycerides, a type of blood fat related to
your diet, also ups your risk of a heart attack.
However, a high level of high-density
lipoprotein (HDL) cholesterol (the “good”
cholesterol) lowers your risk of a heart attack.
‘fbs’ blood sugar (fasting), Not producing enough of a hormone secreted
if greater than120 by your pancreas (insulin) or not responding
1 indicates true, to insulin properly causes your body’s blood
0 indicates false sugar levels to rise, increasing your risk of a
heart attack
‘restecg’ electro cardio graphic For people at low risk of cardiovascular
result at rest time disease, the USPSTF concludes with
0: Nothing to note moderate certainty that the potential harms of
1: ST-T Wave is not screening with resting or exercise ECG equal
normal, or exceed the potential benefits. For people at
2: Possible or definite intermediate to high risk, current evidence is
ventricular hypertrophy insufficient to assess the balance of benefits
(left) and harms of screening.
‘thalash’ heart rate (maximum) The increase in cardiovascular risk, associated
with the acceleration of heart rate, was
comparable to the increase in risk observed
with high blood pressure. It has been shown
that an increase in heart rate by 10 beats per
minute was associated with an increase in the
risk of cardiac death by at least 20%, and this
increase in the risk is similar to the one
observed with an increase in systolic blood

8
pressure by 10 mm Hg.
‘exang’ physical exercise The pain or discomfort associated with angina
1 if Yes usually feels tight, gripping or squeezing, and
0 if No can vary from mild to severe. Angina is
usually felt in the center of your chest but may
spread to either or both of your shoulders, or
your back, neck, jaw or arm. It can even be
felt in your hands. o Types of Angina a.
Stable Angina / Angina Pectoris b. Unstable
Angina c. Variant (Prinzmetal) Angina d.
Microvascular Angina.
‘oldpeak’ ST depression induced by ST depression due to exercise relative to
exercise relaxation will
(compared to rest) observe in the ECG test
‘slope’ peak exercise slope Slope (ST depression)
(ST segment)
0: up_slop
1: flat_slop
2: down_slop
‘ca’ No. of major vessels, The number of major blood vessels that can
indicated by 0 to 3 be visualized using fluoroscopy can range
from 0 to 3.
‘thal’ Thalium stress result, Thalassemia is a blood disorder caused by
indicated by 3 numbers abnormal hemoglobin production, with a
3 if normal score of 3 indicating normal production, 6
6 if defect is fixed indicating permanent deficiency, and 7
7 if defect is reversible signifying temporary impairment.
‘target’ Heart disease indicator of This is the final indicator of the disease, also
the patient called predicted class. This indicates a binary
0 if No state.
1 if Yes

5.3.2 The Heart Disease Dataset II (DS-II)

9
This heart disease dataset is also known as Cardiovascular Disease dataset and is also
available in kaggle web portal. It consists of 70,000 patient information and the input features
are categorised in three sections; they are Objective (factual information), Examination
(results of medical examination) and Subjective (information given by the patient). As per the
source of the dataset, all of the dataset values were collected and receded during the time of
medical examination.
(Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset )
This dataset consists of 11 features and one target attribute.
Table 5.2: Description of dataset II
Features Description of the attribute
‘age’ age of patient in days
‘height’ height of patient in centimetre
‘weight’ Weight of patient in kilogram
‘gender’ sex of patient (categorical information)
‘ap_hi’ blood pressure upper level (integer value)

‘ap_lo’ blood pressure lower level (integer value)

‘cholesterol’ Cholesterol of the patient, 1 if normal, 2 if above

normal and 3 if much higher than normal level

‘gluc’ Glucose level of patient, 1 if normal, 2 if above normal

and 3 if much higher than normal level

‘smoke’ smoking habit of patient ( 0 or 1)

‘alco’ alcohol consumption ( 0 or 1)
‘active’ physical activity of patient (0 or 1)
‘cardio’ Target attribute (0 or 1)

5.4 FEATURE SELECTION TECHNIQUES

Feature Selection is the method of reducing the input variable to your model by using only
relevant data and getting rid of noise in data. It is the process of automatically choosing
relevant features, for our ML model based on the type of problem that we are trying to solve.
We do this by including or excluding important features without changing them. It helps in
cutting down the noise in our data and reducing the size of our input data. Feature selection is

10
primarily focused on removing non-informative or redundant predictors from the model
[501].

The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature. To train an optimal model, we need to make sure
that we use only the essential features. If we have too many features, the model can capture
the unimportant patterns and learn from noise. The method of choosing the important
parameters of our data is called Feature Selection. Less complex models are less likely to
overfit the data.

5.4.1 Importance of Feature Selection

In practice, however, many things can go wrong with training when the inputs are irrelevant
or redundant – more on these two terms later. On top of this, there are many other reasons
why simply dumping all the available features into the model might not be a good idea.
Especially for high-dimensional data sets, it is necessary to filter out the irrelevant and
redundant features by choosing a suitable subset of relevant features in order to avoid over-
fitting and tackle the curse of dimensionality [513]. Let’s look at the seven most prominent
ones.

1. Irrelevant and redundant features: Some features might be irrelevant to the

problem at hand. This means they have no relation with the target variable and are
completely unrelated to the task the model is designed to solve. Discarding
irrelevant features will prevent the model from picking up on spurious correlations
it might carry, thus fending off overfitting. Redundant features should be dropped,
as they might pose many problems during training, such as multicollinearity in
linear models.
2. Curse of dimensionality: Feature selection techniques are especially indispensable
in scenarios with many features but few training examples. Such cases suffer from
what is known as the curse of dimensionality: in a very high-dimensional space,
each training example is so far from all the other examples that the model cannot
learn any useful patterns. The solution is to decrease the dimensionality of the
features space, for instance, via feature selection.
3. Training time: The more features, the more training time. The specifics of this
trade-off depend on the particular learning algorithm being used, but in situations

11
where retraining needs to happen in real-time, one might need to limit oneself to a
couple of best features.
4. Deployment effort: The more features, the more complex the machine learning
system becomes in production. This poses multiple risks, including but not limited
to high maintenance effort, entanglement, undeclared consumers, or correction
cascades.
5. Interpretability: With too many features, we lose the explainability of the model.
While not always the primary modeling goal, interpreting and explaining the
model’s results are often important and, in some regulated domains, might even
constitute a legal requirement.
7. Data-model compatibility: Finally, there is the issue of data-model compatibility.
While, in principle, the approach should be data-first, which means collecting and
preparing high-quality data and then choosing a model which works well on this
data, real life may have it the other way around.

5.4.2 Feature Selection Models

Feature selection models are of two types:
 Supervised Models: Supervised feature selection refers to the method which uses the
output label class for feature selection. They use the target variables to identify the
variables which can increase the efficiency of the model
 Unsupervised Models: Unsupervised feature selection refers to the method which does
not need the output label class for feature selection. We use them for unlabelled data.

Feature Selection Techniques

Supervised Unsupervised

Filter method Wrapper method Embedded method

Pearson’s correlation Forward Selection Regularization L1, L2

Chi square test Backward Selection Random forest

importance
ANOVA test Exhaustive Selection
12
Linear discriminant Recursive Selection
analysis (LDA)
Fig 5.1: Feature selection models
The supervised feature selection models can be further divided into three categories, they are
Filter method, Wrapper method and Embedded method. Each of these three methods includes
a number of techniques that helps to reduce the unimportant features from a dataset. In this
study, we will discuss a few techniques under filter method.

5.4.3 Filter Method

In this method, features are filtered based on general characteristics of the dataset such
correlation with the dependent variable. Filter method is performed without any predictive
model. It is faster and usually the better approach when the number of features are huge.
Avoids overfitting but sometimes may fail to select best features. Filter methods perform a
statistical analysis over the features space to select a discriminative subset of features [514].
Here, we must evaluate the relevance of each feature using any of the statistical measures
such as Pearson’s Correlation, Linear Discriminant Analysis, Chi-Square methods etc. The
working of filter method can be easily explained with the following steps.
 Rank the features based on their relevance scores.
 Select a threshold for relevance or a certain number of top-ranking features.
 Remove all features that fall below the threshold or are not in the top-ranking group.
 Use the remaining features to train the model.
 Repeat the process with different thresholds or numbers of top-ranking features to find
the optimal set of features.

A subset of features is selected based on their relationship to the target variable. The selection
is not dependent of any machine learning algorithm. On the contrary, filter methods measure
the “relevance” of the features with the output via statistical tests. The filter methods are
found to be fast, scalable, computationally simple and independent of the classifier [512]. It
includes a number of techniques to reduce the unimportant features from a dataset. some of
the most commonly used techniques are
 Univariate selection (ANOVA: Analysis of variance)
 Chi Square
 Based on Pearson’s correlation

13
 Linear discriminant analysis (LDA): It is used to find a linear combination of features
that characterizes or separates two or more classes of a categorical variable.

5.4.3.1 Pearson’s Correlation

The Pearson correlation coefficient (r) is the most common way of measuring a linear
correlation between two variables. It varies from -1 to +1, where +1 corresponds to positive
linear correlation, 0 to no linear correlation, and −1 to negative linear correlation.

Table 5.3: Pearson’s correlation coefficient values and its interpretation

Pearson correlation Correlation type Interpretation
coefficient (r)
>=0 and <=1 Positive correlation When one variable changes, the other
variable changes in the same direction
0 No correlation There is no relationship between the
variables
<=0 and >=–1 Negative correlation When one variable changes, the other
variable changes in the opposite
direction

We can calculate Pearson correlation coefficient (r) using the following formula.

…………….. (5.1)

In the above formula, the variables are represented by x and y, where x is the independent
variable, y is the dependent variable. In addition, the value n represents the sample size, and
summation of all values can be represented by the summation symbol Σ.

5.4.3.2 Chi Square Test

The Chi-square test is used for categorical features in a dataset. We calculate Chi-square
between each feature and the target and select the desired number of features with the best
Chi-square scores. In order to correctly apply the chi-squared to test the relation between
various features in the dataset and the target variable, the following conditions have to be met:
the variables have to be categorical, sampled independently. This method measures the
independence between the categorical characteristics and the target variable, i.e. the
probability that the two variables are associated or dependent. The chi-square test calculates

14
the chi-square value, which is proportional to the difference between the observed frequencies
and the expected frequencies of the categories. The higher the chi-square value, the more
dependent the two variables are and therefore the characteristic is relevant. Alisha et al [502],
in their study mentioned that data those categories have a very low frequency of
occurring which might affect the ranks calculated using the feature selection techniques
and this can hamper the final performance of the classifiers. The Chi-Squared values can be
calculated using the following formula.

……………………….. (5.2)
Where O stands for observed or actual value and E represents expected value.

5.4.3.3 ANOVA
ANOVA (Analysis of Variance) is a statistical method commonly used in machine learning
for comparing the means of two or more groups. It is also known as factor analysis and
developed by Fisher in 1930. In the context of machine learning, ANOVA can be used for
feature selection or to analyze the significance of different features in predicting the target
variable. ANOVA is particularly useful when dealing with categorical independent variables
and a continuous dependent variable. It helps in determining whether there are significant
differences in the means of the dependent variable across the categories of the independent
variable.
The ANOVA filter method, also known as ANOVA F-value feature selection and this method
is particularly useful when dealing with classification or regression tasks where the target
variable is continuous.
In general, there are three types of ANOVA:
 One-Way ANOVA: The one-way analysis of variance is also known as single-factor
ANOVA or simple ANOVA. It is suitable for experiments with only one independent
variable (factor) with two or more levels. One-Way-ANOVA method for feature
selection is a technique that analyzes the experimental data such that, one or more
response variables are calculated under various conditions identified by one or more
classification variables [504].
 Two-way ANOVA: It is also called full Factorial ANOVA and is used when there are
two or more independent variables. Each of these factors can have multiple levels. It

15
can compare two or more factors, which means it can check the effect of two
independent variables on a single dependent variable.
 N-way ANOVA: An Analysis of Variance Test is considered an n-way ANOVA
Test, if the user uses more than two independent variables, where N represents the
number of independent variables for the experiment. Sometimes, it is also called
as MANOVA Test.

The mathematical implementation ANOVA involves a number calculations step by step, and
it be understand with the help of following descriptions.
Step 1: Calculating overall mean (in the first step we need to calculate the mean of all
observations in the sample).
Step 2: From the overall mean, find the sun of squared divisions of every observations.
Sum of squares total (SST) = ∑ (X −X ¿)2 ¿
Step 3: Find the sum of squared divisions of each observations within every group.
Sum of squares within (SSW) = ∑ (X i −X i ¿ )2 ¿
Step 4: Calculate the sum of squared divisions of every group mean from the overall mean.
Sum of squares between (SSB) = ∑ N i( X i ¿−X )2 ¿
Here, Ni indicates number of observations in each group,
X̄ i indicates mean of each group,
and X̄ is the overall mean.
Step 5: Find the degrees of freedom of each of the three sums of squares.
degrees of freedom (df) can be calculated as
df(SST) = N — 1
df(SSW) = N — k
df(SSB) = k — 1
Here, N is the total no. of observations and k is the number of groups.
Step 6: Now, to calculate the mean squares, we need to divide the sum of squares by the
respective degree of freedom
MSW = SSW / ( N – k )
MSB = SSB / ( k – 1 )
Step 7: Calculating F-statistic
F = MSB / MSW

16
Step 8: Calculating p value and compare the result. If the p-value is less than a particular
level, at that moment it may suggest that the means of the groups are significantly
different.

Nadir O.F Elssied et al (2014) in their research applied feature selection based on one-way
ANOVA F-test statistics scheme to determine the most important features contributing to e-
mail spam classification [503]. It is important to note that Filter methods are much faster
compared to other methods, as it does not undergo any training on the models. However, in
this scenario, the wrapper methods are much expensive computationally as it is completely
based on machine learning models. One more interesting fact lies behind filter methods is
that, it might fail to find the best subset of features in many situations, whereas the wrapper
methods is able to provide the best subset of features at any situation though it is very time
consuming.

5.6 SUMMARY
In this chapter different details of the primary heart disease datasets used in this research are
discussed. The Cleveland heart disease dataset is referred here as DS-I and the other dataset,
also known as Cardiovascular Disease dataset, is referred as DS-II. The first dataset contains
303 records with 14 features whereas the second one contains 70000 patient records with 11
features. They entire experiments conducted during this study was carried out with the help of
python ML tool. The robust toolbox provided by Python's machine learning (ML)
environment enables researchers and developers to create effective models for a variety of
applications. Different libraries available in python programming language are also discussed
here. Since, the primary aim of this study is to improve the classifiers performance with
minimum number of features in the datasets, so different feature selection techniques are also
included here. It is the process of selecting a subset of features from the original features in
order to minimise the feature space as much as possible while meeting certain criteria.
Different models of feature selection is explained with the help of a diagram. Through this,
various important techniques of filter feature selection models are discussed mathematically.
Finally, all the experimental results, figures, and comparison tables are included in the next
chapter (Chapter-6).

17
REFERENCES
[501] Applied Predictive Modeling (2013), Springer, Page 488
[502] Alisha Sikri, N. P. Sing, Surjeet Dalal, International Journal of Intelligent Systems and
Applications in Engineering (IJISAE), 2023, ISSN:2147-6799
[503] Nadir Omer Fadl Elssied, Othman Ibrahim and Ahmed Hamza Osman, A Novel
Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification,
Research Journal of Applied Sciences, Engineering and Technology, ISSN: 2040-7459, July
2014, Page: 625-638
[504] Arowolo, Abdulsalam, Saheed, Y.K. and Salawu, A Feature Selection Based on One-
Way-Anova for Microarray, Al-Hikmah Journal of Pure & Applied Sciences Vol.3 (2016),
Page: 30-35
Data Classification

[505] Shah, D., Patel, S. & Bharti, S.K. Heart Disease Prediction using Machine Learning
Techniques, SN Computer. Science. 1, 345 (2020), Publisher: Springer,
https://doi.org/10.1007/s42979-020-00365-y

[506] S. Mohan, C. Thirumalai and G. Srivastava, Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques, in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.

[507] Kolla, Venkata Ravi Kiran, Heart Disease Diagnosis Using Machine Learning
Techniques In Python: A Comparative Study of Classification Algorithms For Predictive

18
Modeling (September 6, 2015). International Journal of Electronics and Communication
Engineering & Technology, 2015, Available at SSRN: https://ssrn.com/abstract=4413723

[508] Chadha, R., Mayank, S. Prediction of heart disease using data mining techniques, CSI
Transactions on ICT 4, 193–198 (2016). Publisher: Springer, https://doi.org/10.1007/s40012-
016-0121-0

[509] Santhana. K.J. and G.S., Prediction of Heart Disease Using Machine Learning
Algorithms, 2019 1st International Conference on Innovations in Information and
Communication Technology (ICIICT), Chennai, India, 2019, pp. 1-5, doi:
10.1109/ICIICT1.2019.8741465.

[510] Archana Singh and R. Kumar, Heart Disease Prediction Using Machine Learning
Algorithms, 2020 International Conference on Electrical and Electronics Engineering (ICE3),
Gorakhpur, India, 2020, pp. 452-457, doi: 10.1109/ICE348803.2020.9122958.

[511] Akshit J. Dhruv, Reema Patel and Nishant Doshi, Python: The Most Advanced
Programming Language for Computer Science Applications, International Conference on
Culture Heritage, Education, Sustainable Tourism, and Innovation Technologies (CESIT
2020), pages 292-299, ISBN: 978-989-758-501-2, DOI: 10.5220/0010307900003051

[512] Yap B. Wah, N. Ibrahim, H.A. Hamid, S.A. Rahman and S. Fong, Feature Selection
Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy,
Pertanika J. Sci. & Technol. 26 (1): 329 - 340 (2018)

[513] Andrea Bommert, Xudong Sun, Bernd Bischl, Jörg Rahnenführer, Michel Lang,
Benchmark for filter methods for feature selection in high-dimensional classification data,
Computational Statistics & Data Analysis, Volume 143, 2020, 106839, ISSN 0167-9473,
https://doi.org/10.1016/j.csda.2019.106839.

[514] Mahdieh Labani, Parham Moradi, Fardin Ahmadizar, Mahdi Jalili, A novel multivariate
filter method for feature selection in text classification problems, Engineering Applications of
Artificial Intelligence, Volume 70, 2018, Pages 25-37, ISSN 0952-1976,
https://doi.org/10.1016/j.engappai.2017.12.014.
19

800 Data Science Questions
100% (2)
800 Data Science Questions
258 pages
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
From Everand
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Timothy C. Needham
4/5 (25)
Python Programming, Deep Learning: 3 Books in 1: A Complete Guide for Beginners, Python Coding for Ai, Neural Networks, & Machine Learning, Data Science/Analysis with Practical Exercises for Learners
From Everand
Python Programming, Deep Learning: 3 Books in 1: A Complete Guide for Beginners, Python Coding for Ai, Neural Networks, & Machine Learning, Data Science/Analysis with Practical Exercises for Learners
Anthony Adams
4/5 (4)
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
From Everand
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
James Tudor
5/5 (1)
Abhishek Thakur - Approaching (Almost) Any Machine Learning Problem-Abhishek Thakur (2020) PDF
100% (6)
Abhishek Thakur - Approaching (Almost) Any Machine Learning Problem-Abhishek Thakur (2020) PDF
301 pages
Mastering Python Programming for Beginners
From Everand
Mastering Python Programming for Beginners
gareth thomas
No ratings yet
Your First Python Program
From Everand
Your First Python Program
Alexander Paz
No ratings yet
Practical Guide to Python: From Basics to Advanced Programming
From Everand
Practical Guide to Python: From Basics to Advanced Programming
Arcadia J. Darell
No ratings yet
Python
No ratings yet
Python
323 pages
Python Programming for Kids: Fun and Easy Guide to Building Your First Programs
From Everand
Python Programming for Kids: Fun and Easy Guide to Building Your First Programs
Lily Anderson
No ratings yet
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
From Everand
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
Anthony Adams
5/5 (3)
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Python Fundamentals
From Everand
Python Fundamentals
IntroBooks Team
No ratings yet
Python Research Paper
No ratings yet
Python Research Paper
7 pages
Python for Everyone: A Complete Guide to Coding, Data, and Web Development: Your Guide to the Digital World, #3
From Everand
Python for Everyone: A Complete Guide to Coding, Data, and Web Development: Your Guide to the Digital World, #3
Atokhon Ghaniev
No ratings yet
Master Python: Unlock the Language of the Future
From Everand
Master Python: Unlock the Language of the Future
SivarioB
No ratings yet
Python Programming & Applications in Healthcare
No ratings yet
Python Programming & Applications in Healthcare
12 pages
Python Basic
No ratings yet
Python Basic
145 pages
Elegant Python: Simplifying Complex Solutions
From Everand
Elegant Python: Simplifying Complex Solutions
Michael Huang
No ratings yet
Python-2
No ratings yet
Python-2
18 pages
Mastering Python: Learn Python Step-by-Step with Practical Projects
From Everand
Mastering Python: Learn Python Step-by-Step with Practical Projects
Amelia Hartman
No ratings yet
Python Mini Manual
From Everand
Python Mini Manual
CodeCraft Dynamics
No ratings yet
Python For Data Analytics Scientific and Technical Applications
No ratings yet
Python For Data Analytics Scientific and Technical Applications
6 pages
A Guide to Python Mastery: Python
From Everand
A Guide to Python Mastery: Python
Ummed Singh
No ratings yet
Introduction
No ratings yet
Introduction
45 pages
Ganesh
No ratings yet
Ganesh
28 pages
Python Simplified
From Everand
Python Simplified
Alisa Turing
No ratings yet
Python Programming For Beginners: Python Programming Language Tutorial
From Everand
Python Programming For Beginners: Python Programming Language Tutorial
Joseph Joyner
No ratings yet
TUM-CPE_203_Module_1
No ratings yet
TUM-CPE_203_Module_1
5 pages
Teoh Teik Toe Python For Artificial Intelligence 2022
No ratings yet
Teoh Teik Toe Python For Artificial Intelligence 2022
5 pages
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
From Everand
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
Flynn Fisher
4/5 (2)
Basics of Python Programming and Statistics
No ratings yet
Basics of Python Programming and Statistics
56 pages
Python Simplified: Learn Programming Through Practical Examples
From Everand
Python Simplified: Learn Programming Through Practical Examples
Abdelazeem Emam
No ratings yet
The Rise of Python
No ratings yet
The Rise of Python
19 pages
23112
No ratings yet
23112
4 pages
Programming And Coding begginers level
From Everand
Programming And Coding begginers level
Memo
No ratings yet
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Python Programming for Newbies
From Everand
Python Programming for Newbies
Abound Academy
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Python Programming for Beginners
From Everand
Python Programming for Beginners
Adam Stewart
2.5/5 (2)
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
The Best Python Programming Step-By-Step Beginners Guide Easily Master Software engineering with Machine Learning, Data Structures, Syntax, Django Object-Oriented Programming, and AI application
From Everand
The Best Python Programming Step-By-Step Beginners Guide Easily Master Software engineering with Machine Learning, Data Structures, Syntax, Django Object-Oriented Programming, and AI application
Chris Williamson
No ratings yet
Effortless Python: Learn Python Quickly from Beginner to Pro
From Everand
Effortless Python: Learn Python Quickly from Beginner to Pro
Aarav Joshi
No ratings yet
PYTHON FOR BEGINNERS: Master the Basics of Python Programming and Start Writing Your Own Code in No Time (2023 Guide for Beginners)
From Everand
PYTHON FOR BEGINNERS: Master the Basics of Python Programming and Start Writing Your Own Code in No Time (2023 Guide for Beginners)
Glen Jennings
No ratings yet
Master Python Without Prior Experience
From Everand
Master Python Without Prior Experience
CodeCraft Dynamics
No ratings yet
Python Programming: Learn, Code, Create
From Everand
Python Programming: Learn, Code, Create
Sachin Naha
No ratings yet
1.1-1.4_Introduction to Python
No ratings yet
1.1-1.4_Introduction to Python
50 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
Machine Learning Python
No ratings yet
Machine Learning Python
48 pages
Python Unleashed: Mastering the Art of Efficient Coding
From Everand
Python Unleashed: Mastering the Art of Efficient Coding
James Livingston
No ratings yet
The Ultimate Beginner's Guide To Python: Aiming To Start A Career in Data Science
No ratings yet
The Ultimate Beginner's Guide To Python: Aiming To Start A Career in Data Science
47 pages
UNIT-I
No ratings yet
UNIT-I
52 pages
Python Programming Techniques: The Art of Coding and Programming Explained
From Everand
Python Programming Techniques: The Art of Coding and Programming Explained
Lance Gifford
No ratings yet
Article Review 3 Eng
No ratings yet
Article Review 3 Eng
16 pages
Python Made Simple: A Practical Guide with Examples
From Everand
Python Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Python for Engineers: Solving Real-World Technical Challenges
From Everand
Python for Engineers: Solving Real-World Technical Challenges
Robert Johnson
No ratings yet
Module03-Introduction To Python
No ratings yet
Module03-Introduction To Python
40 pages
Getting Started with Python Data Analysis
From Everand
Getting Started with Python Data Analysis
Vo.T.H Phuong
No ratings yet
Python Material
No ratings yet
Python Material
159 pages
Data Science Lecture No 5
No ratings yet
Data Science Lecture No 5
16 pages
hod questionnaire1
No ratings yet
hod questionnaire1
8 pages
assam secretariat assamese computer operator
No ratings yet
assam secretariat assamese computer operator
2 pages
vahan.parivahan.gov.in_vahanservice_vahan_ui_eapplication_formOldVehReceiptReport.xhtml_response=c4fceac5bfcd8264b2742c035baf36068659415c1caab2001c131e57f257e4001011d9c72d19c32b2977751952f93f2ae748e748bc850a527e077bea001e69ae45c49701087cf52310a9434e46
No ratings yet
vahan.parivahan.gov.in_vahanservice_vahan_ui_eapplication_formOldVehReceiptReport.xhtml_response=c4fceac5bfcd8264b2742c035baf36068659415c1caab2001c131e57f257e4001011d9c72d19c32b2977751952f93f2ae748e748bc850a527e077bea001e69ae45c49701087cf52310a9434e46
1 page
43ffb15634798d39738886b799476cec_compare_quote
No ratings yet
43ffb15634798d39738886b799476cec_compare_quote
2 pages
Joining Details
No ratings yet
Joining Details
3 pages
Mis - Gid141765 - 2023-11-22 13 - 43 - 00
No ratings yet
Mis - Gid141765 - 2023-11-22 13 - 43 - 00
3 pages
Application of Machine Learning in Rheumatic Disea
No ratings yet
Application of Machine Learning in Rheumatic Disea
15 pages
Group 12 Data Analytics[1]
No ratings yet
Group 12 Data Analytics[1]
5 pages
final print reporttt_removed
No ratings yet
final print reporttt_removed
26 pages
DS - CAT - 2 QP - Mech
No ratings yet
DS - CAT - 2 QP - Mech
2 pages
Case study AI in healthcare (1) (1)
No ratings yet
Case study AI in healthcare (1) (1)
23 pages
Machine Learning Interview Questions & Answers for Data Scientists
No ratings yet
Machine Learning Interview Questions & Answers for Data Scientists
13 pages
House Report
No ratings yet
House Report
26 pages
Superposition, Memorization, and Double Descent
No ratings yet
Superposition, Memorization, and Double Descent
30 pages
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
No ratings yet
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
13 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
DL Notes
No ratings yet
DL Notes
652 pages
Interpretable Machine Learning For Genomics: University College London
No ratings yet
Interpretable Machine Learning For Genomics: University College London
30 pages
Close: Sources of Collinearity. John Wiley
No ratings yet
Close: Sources of Collinearity. John Wiley
1 page
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Jaswanth Narayana R (40738003) Vishesh K (40738007)
100% (1)
Jaswanth Narayana R (40738003) Vishesh K (40738007)
37 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
21 pages
AIDS QB
No ratings yet
AIDS QB
8 pages
Unit - 1 Deep Learning Techniques
No ratings yet
Unit - 1 Deep Learning Techniques
18 pages
Machine Learning A-Z by Nerchuko
No ratings yet
Machine Learning A-Z by Nerchuko
8 pages
Assessment 2 UEL CN 7000
No ratings yet
Assessment 2 UEL CN 7000
10 pages
Multiple Disease Prediction System
No ratings yet
Multiple Disease Prediction System
43 pages
Design A Learning System in Machine Learning
No ratings yet
Design A Learning System in Machine Learning
41 pages
NPTEL ML Assignment Week1
100% (3)
NPTEL ML Assignment Week1
5 pages
ANNFAA Artificial Neural Network-Based Tool For
No ratings yet
ANNFAA Artificial Neural Network-Based Tool For
15 pages
Alzheimer 4
No ratings yet
Alzheimer 4
5 pages
PDF Risks and Security of Internet and Systems Nora Cuppens download
100% (1)
PDF Risks and Security of Internet and Systems Nora Cuppens download
55 pages
Data!
No ratings yet
Data!
19 pages
Noddy 2
No ratings yet
Noddy 2
30 pages

Bhabesh - Chapter 5

Uploaded by

Bhabesh - Chapter 5

Uploaded by

CHAPTER 5: THE HEART DISEASE DATASETS AND

FEATURE SELECTION TECHNIQUES

5.2 MACHINE LEARNING THROUGH PYTHON

Some of the most important tasks performed through Pyhton includes:

 Automation or scripting: If you find yourself performing a task repeatedly, you

5.2.1 Advantages of Python

5.2.2 Libraries in Python

5.3 DATASETS USED FOR THIS STUDY

5.3.2 The Heart Disease Dataset II (DS-II)

‘ap_lo’ blood pressure lower level (integer value)

‘cholesterol’ Cholesterol of the patient, 1 if normal, 2 if above

‘gluc’ Glucose level of patient, 1 if normal, 2 if above normal

‘smoke’ smoking habit of patient ( 0 or 1)

5.4 FEATURE SELECTION TECHNIQUES

5.4.1 Importance of Feature Selection

1. Irrelevant and redundant features: Some features might be irrelevant to the

5.4.2 Feature Selection Models

Feature Selection Techniques

Filter method Wrapper method Embedded method

Pearson’s correlation Forward Selection Regularization L1, L2

Chi square test Backward Selection Random forest

5.4.3 Filter Method

5.4.3.1 Pearson’s Correlation

Table 5.3: Pearson’s correlation coefficient values and its interpretation

5.4.3.2 Chi Square Test

You might also like