0% found this document useful (0 votes)
58 views

Bhabesh - Chapter 5

Uploaded by

raj.kangkan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Bhabesh - Chapter 5

Uploaded by

raj.kangkan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

CHAPTER 5: THE HEART DISEASE DATASETS AND

FEATURE SELECTION TECHNIQUES

This chapter explains the different heart disease datasets, which are used to train and test the
models during this study, including the Python machine-learning tool. In addition, a number
of data pre-processing techniques are discussed here. Different feature selection techniques
and their working principles, which are relevant for this study, are also explained.

5.1 INTRODUCTION
Heart disease describes a range of conditions that affect your heart. Diseases under the heart
disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm
problems (arrhythmias) and heart defects you’re born with (congenital heart defects), among
others. The term “heart disease” is often used interchangeably with the term “cardiovascular
disease”. Cardiovascular disease generally refers to conditions that involve narrowed or
blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart
conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered
forms of heart disease. Coronary Heart Disease (CHD) is the most common type of heart
disease, killing over 370,000 people annually. Researchers apply several data mining and
machine learning techniques to analyse huge complex medical data, helping healthcare
professionals to predict heart disease [505].
The datasets, which are used for this research, are freely available from kaggle.com, which is
an online community platform for data scientists and machine learning enthusiasts. Kaggle
allows users to collaborate with other users, find and publish datasets, use GPU integrated
notebooks, and compete with other data scientists to solve data science challenges. It is
difficult to identify heart disease because of several contributory risk factors such as diabetes,
high blood pressure, high cholesterol, abnormal pulse rate and many other factors [506]. The
aim of this online platform (founded in 2010 by Anthony Goldbloom and Jeremy Howard and
acquired by Google in 2017) is to help professionals and learners reach their goals in their
data science journey with the powerful tools and resources it provides. As of today (2021),
there are over 8 million registered users on Kaggle.

5.2 MACHINE LEARNING THROUGH PYTHON

1
Python is a computer programming language often used to build websites and software,
automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it
can be used to create a variety of different programs and is not specialized for any specific
problems. This versatility, along with its beginner-friendliness, has made it one of the most-
used programming languages today.

Python was developed by Guido van Rossum, a Dutch programmer, in the late 1980s. He
began working on the language in December 1989 and released the first version, Python 0.9.0,
in February 1991. Guido van Rossum continued to develop and refine Python over the years,
and he remained the "Benevolent Dictator For Life" (BDFL) of the Python community until
he stepped down from that role in July 2018.

Python has become one of the most popular programming languages in the world in recent
years. It is used in everything from machine learning to building websites and software testing.
It can be used by developers and non-developers alike. Stack Overflow's 2022 Developer
Survey revealed that Python is the fourth most popular programming language, with
respondents saying that they use Python almost 50 percent of the time in their development
work. Survey results also showed that Python is tied with Rust as the most-wanted
technology, with 18% percent of developers who are not using it already saying that they are
interested in learning Python. Python is commonly used for developing websites and
software, task automation, data analysis, and data visualization. Venkata R.K Kolla (2015)
compared different ML classification algorithms for diagnosis of heart disease using Python
ML tool. Ritika Chadha et al (2016) analysed different data mining techniques like Artificial
Neural Networks, Decision Tree and Naive Bayes to predict hear disease using Python
environment. Santhana K.J et al (2019) processed their heart disease datasets in Python
programming by applying two main ML Algorithm namely Decision Tree Algorithm and
Naive Bayes Algorithm. Archana Singh et al (2020) proposed a method to calculate accuracy
of ML algorithms for predicting heart disease using different classification techniques in
Python programming environment. Since it’s relatively easy to learn, Python has been
adopted by many non-programmers such as accountants and scientists, for a variety of
everyday tasks, like organizing finances.

Some of the most important tasks performed through Pyhton includes:

2
 Data analysis and machine learning: Python has become a staple in data science,
allowing data analysts and other professionals to use the language to conduct complex
statistical calculations, create data visualizations, build machine learning algorithms,
manipulate and analyze data, and complete other data-related tasks. Python can build a
wide range of different data visualizations, like line and bar graphs, pie charts,
histograms, and 3D plots. Python also has a number of libraries that enable coders to
write programs for data analysis and machine learning more quickly and efficiently,
like TensorFlow and Keras.

 Web development: Python is often used to develop the back end of a website or
application—the parts that a user doesn’t see. Python’s role in web development can
include sending data to and from servers, processing data and communicating with
databases, URL routing, and ensuring security. Python offers several frameworks for
web development. Commonly used ones include Django and Flask. Some web
development jobs that use Python include back-end engineers, full stack engineers,
Python developers, software engineers, and DevOps engineers.

 Automation or scripting: If you find yourself performing a task repeatedly, you


could work more efficiently by automating it with Python. Writing code used to build
these automated processes is called scripting. In the coding world, automation can be
used to check for errors across multiple files, convert files, execute simple math, and
remove duplicates in data. Python can even be used by relative beginners to automate
simple tasks on the computer—such as renaming files, finding and downloading
online content or sending emails or texts at desired intervals.

 Software testing and prototyping: In software development, Python can aid in tasks
like build control, bug tracking, and testing. With Python, software developers can
automate testing for new products or features. Some Python tools used for software
testing include Green and Requestium.

 Everyday tasks: Python isn't only for programmers and data scientists. Learning
Python can open new possibilities for those in less data-heavy professions, like
journalists, small business owners, or social media marketers. Python can also enable

3
non-programmers to simplify certain tasks in their lives. Here are just a few of the
tasks you could automate with Python:
a) Keep track of stock market or crypto prices
b) Send yourself a text reminder to carry an umbrella anytime it’s raining
c) Update your grocery shopping list
d) Renaming large batches of files
e) Converting text files to spreadsheets
f) Randomly assign chores to family members
g) Fill out online forms automatically

5.2.1 Advantages of Python


Python is popular for a number of reasons. Python has many advantages over any other
languages, as it has varieties of library, which reduces the code to one-third for programmer
and due to this Python has reached at the highest peak in terms of Machine Learning [511].
Here’s a deeper look at what makes it so versatile and easy to use for coders.
 It has a simple syntax that mimics natural language, so it’s easier to read and
understand. This makes it quicker to build projects, and faster to improve on them.
 It’s versatile. Python can be used for many different tasks, from web development to
machine learning.
 It’s beginner friendly, making it popular for entry-level coders.
 It’s open source, which means it’s free to use and distribute, even for commercial
purposes.
 Python’s archive of modules and libraries—bundles of code that third-party users have
created to expand Python’s capabilities—is vast and growing.
 Python has a large and active community that contributes to Python’s pool of modules
and libraries, and acts as a helpful resource for other programmers. The vast support
community means that if coders run into a stumbling block, finding a solution is
relatively easy; somebody is bound to have encountered the same problem before.

5.2.2 Libraries in Python


Python is a popular programming language for machine learning (ML) and artificial
intelligence (AI) due to its extensive libraries, strong community support, and ease of use.
There are several powerful tools and libraries available in Python for ML, and here are some
of the most important ones:

4
1. Scikit-Learn:
 Scikit-Learn is one of the most widely used libraries for machine learning in
Python.
 It provides simple and efficient tools for data preprocessing, feature selection,
model selection, and evaluation.
 Includes a wide range of algorithms for classification, regression, clustering,
dimensionality reduction, and more.
2. TensorFlow:
 Developed by Google, TensorFlow is an open-source deep learning
framework.
 Widely used for building neural networks and deep learning models.
 Provides high-level APIs like Keras for easier model development.
3. PyTorch:
 Developed by Facebook's AI Research lab, PyTorch is another popular deep
learning framework.
 Known for its dynamic computation graph, which makes it more flexible for
certain tasks.
 Gaining popularity in both academia and industry.
4. Keras:
 Keras is a high-level neural networks API that runs on top of TensorFlow,
Theano, or CNTK.
 Designed for fast prototyping and experimentation with deep learning models.
 Offers a simple and intuitive API for building and training neural networks.
5. Pandas:
 Pandas is a library for data manipulation and analysis.
 It provides data structures like DataFrames for handling structured data.
 Essential for data preprocessing and cleaning before feeding data to ML
models.
6. Numpy:
 NumPy is a fundamental library for numerical operations in Python.
 It provides support for multidimensional arrays and matrices, which are
essential for handling data in ML.
7. Matplotlib and Seaborn:
 Matplotlib and Seaborn are libraries for data visualization in Python.

5
 They allow you to create various types of plots and graphs to visualize your
data and model results.
8. Jupyter Notebook:
 Jupyter Notebook is an interactive environment for data science and ML.
 It allows you to write and execute code in a notebook-style interface, making it
easy to document and share your work.
9. XGBoost:
 XGBoost is a popular gradient boosting library for classification and
regression tasks.
 Known for its efficiency and effectiveness in handling structured data.
10. Scipy:
 Scipy is built on top of NumPy and provides additional scientific and technical
computing functionalities.
 Useful for advanced optimization, integration, interpolation, and more.
11. Statsmodels:
 Statsmodels is a library for estimating and interpreting statistical models.
 It's helpful for traditional statistical analysis and hypothesis testing.
12. NLTK (Natural Language Toolkit):
 NLTK is a library for natural language processing (NLP).
 It provides tools and resources for tasks like text classification, sentiment
analysis, and language modeling.

These tools and libraries, along with many others in the Python ecosystem, make Python a
powerful and versatile platform for machine learning and data science projects. You can use
them to build, train, and deploy machine learning models for a wide range of applications,
from image recognition to natural language processing and more.

5.3 DATASETS USED FOR THIS STUDY


To analyse the performance of the classifiers, it is highly preferable to use multiple datasets,
so that the models can be trained and tested in different environments. Hence, in this study we
have used two different heart disease datasets, which are freely available. In a number
research activity, both of these datasets have already been used and found very useful in their
study.

6
5.3.1 The heart disease dataset I (DS-I)
This dataset is also known as Cleveland heart disease dataset that consists of 303 individuals
data. There are 14 columns in the dataset, out of which one is the target variable. In the actual
dataset, there were 76 features, but as per different studies, only 14 are found best for research
and are discussed below.
Table 5.1: Description of dataset I
Features Details of feature values Description
‘age’ age of patient (years) Age is the most important risk factor in
developing cardiovascular or heart diseases,
with approximately a tripling of risk with
each decade of life. Coronary fatty streaks can
begin to form in adolescence. It is estimated
that 82 percent of people who die of coronary
heart disease are 65 and older.
Simultaneously, the risk of stroke doubles
every decade after age 55
‘sex’ sex of Patient Men are at greater risk of heart disease than
1: if Male pre-menopausal women. Once past
0: if Female menopause, it has been argued that a woman’s
risk is similar to a man’s although more recent
data from the WHO and UN disputes this. If a
female has diabetes, she is more likely to
develop heart disease than a male with
diabetes.
‘cp’ chest pain type: Chest pain is also called angina or discomfort
0: Typical (angina), caused when your heart muscle doesn’t get
1: Atypical (angina), enough oxygen-rich blood. It may feel like
2: Non-anginal pain, pressure or squeezing in your chest. The
3: No Symptom discomfort also can occur in your shoulders,
arms, neck, jaw, or back. Angina pain may
even feel like indigestion.
‘trestbps’ blood pressure Over time, high blood pressure can damage
arteries that feed your heart. High blood

7
(at rest time) pressure that occurs with other conditions,
such as obesity, high cholesterol or diabetes,
increases your risk even more
‘chol’ cholesterol level (serum) A high level of low-density lipoprotein (LDL)
cholesterol (the “bad” cholesterol) is most
likely to narrow arteries. A high level of
triglycerides, a type of blood fat related to
your diet, also ups your risk of a heart attack.
However, a high level of high-density
lipoprotein (HDL) cholesterol (the “good”
cholesterol) lowers your risk of a heart attack.
‘fbs’ blood sugar (fasting), Not producing enough of a hormone secreted
if greater than120 by your pancreas (insulin) or not responding
1 indicates true, to insulin properly causes your body’s blood
0 indicates false sugar levels to rise, increasing your risk of a
heart attack
‘restecg’ electro cardio graphic For people at low risk of cardiovascular
result at rest time disease, the USPSTF concludes with
0: Nothing to note moderate certainty that the potential harms of
1: ST-T Wave is not screening with resting or exercise ECG equal
normal, or exceed the potential benefits. For people at
2: Possible or definite intermediate to high risk, current evidence is
ventricular hypertrophy insufficient to assess the balance of benefits
(left) and harms of screening.
‘thalash’ heart rate (maximum) The increase in cardiovascular risk, associated
with the acceleration of heart rate, was
comparable to the increase in risk observed
with high blood pressure. It has been shown
that an increase in heart rate by 10 beats per
minute was associated with an increase in the
risk of cardiac death by at least 20%, and this
increase in the risk is similar to the one
observed with an increase in systolic blood

8
pressure by 10 mm Hg.
‘exang’ physical exercise The pain or discomfort associated with angina
1 if Yes usually feels tight, gripping or squeezing, and
0 if No can vary from mild to severe. Angina is
usually felt in the center of your chest but may
spread to either or both of your shoulders, or
your back, neck, jaw or arm. It can even be
felt in your hands. o Types of Angina a.
Stable Angina / Angina Pectoris b. Unstable
Angina c. Variant (Prinzmetal) Angina d.
Microvascular Angina.
‘oldpeak’ ST depression induced by ST depression due to exercise relative to
exercise relaxation will
(compared to rest) observe in the ECG test
‘slope’ peak exercise slope Slope (ST depression)
(ST segment)
0: up_slop
1: flat_slop
2: down_slop
‘ca’ No. of major vessels, The number of major blood vessels that can
indicated by 0 to 3 be visualized using fluoroscopy can range
from 0 to 3.
‘thal’ Thalium stress result, Thalassemia is a blood disorder caused by
indicated by 3 numbers abnormal hemoglobin production, with a
3 if normal score of 3 indicating normal production, 6
6 if defect is fixed indicating permanent deficiency, and 7
7 if defect is reversible signifying temporary impairment.
‘target’ Heart disease indicator of This is the final indicator of the disease, also
the patient called predicted class. This indicates a binary
0 if No state.
1 if Yes

5.3.2 The Heart Disease Dataset II (DS-II)

9
This heart disease dataset is also known as Cardiovascular Disease dataset and is also
available in kaggle web portal. It consists of 70,000 patient information and the input features
are categorised in three sections; they are Objective (factual information), Examination
(results of medical examination) and Subjective (information given by the patient). As per the
source of the dataset, all of the dataset values were collected and receded during the time of
medical examination.
(Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset )
This dataset consists of 11 features and one target attribute.
Table 5.2: Description of dataset II
Features Description of the attribute
‘age’ age of patient in days
‘height’ height of patient in centimetre
‘weight’ Weight of patient in kilogram
‘gender’ sex of patient (categorical information)
‘ap_hi’ blood pressure upper level (integer value)

‘ap_lo’ blood pressure lower level (integer value)

‘cholesterol’ Cholesterol of the patient, 1 if normal, 2 if above


normal and 3 if much higher than normal level

‘gluc’ Glucose level of patient, 1 if normal, 2 if above normal


and 3 if much higher than normal level

‘smoke’ smoking habit of patient ( 0 or 1)


‘alco’ alcohol consumption ( 0 or 1)
‘active’ physical activity of patient (0 or 1)
‘cardio’ Target attribute (0 or 1)

5.4 FEATURE SELECTION TECHNIQUES


Feature Selection is the method of reducing the input variable to your model by using only
relevant data and getting rid of noise in data. It is the process of automatically choosing
relevant features, for our ML model based on the type of problem that we are trying to solve.
We do this by including or excluding important features without changing them. It helps in
cutting down the noise in our data and reducing the size of our input data. Feature selection is

10
primarily focused on removing non-informative or redundant predictors from the model
[501].

The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature. To train an optimal model, we need to make sure
that we use only the essential features. If we have too many features, the model can capture
the unimportant patterns and learn from noise. The method of choosing the important
parameters of our data is called Feature Selection. Less complex models are less likely to
overfit the data.

5.4.1 Importance of Feature Selection


In practice, however, many things can go wrong with training when the inputs are irrelevant
or redundant – more on these two terms later. On top of this, there are many other reasons
why simply dumping all the available features into the model might not be a good idea.
Especially for high-dimensional data sets, it is necessary to filter out the irrelevant and
redundant features by choosing a suitable subset of relevant features in order to avoid over-
fitting and tackle the curse of dimensionality [513]. Let’s look at the seven most prominent
ones.

1. Irrelevant and redundant features: Some features might be irrelevant to the


problem at hand. This means they have no relation with the target variable and are
completely unrelated to the task the model is designed to solve. Discarding
irrelevant features will prevent the model from picking up on spurious correlations
it might carry, thus fending off overfitting. Redundant features should be dropped,
as they might pose many problems during training, such as multicollinearity in
linear models.
2. Curse of dimensionality: Feature selection techniques are especially indispensable
in scenarios with many features but few training examples. Such cases suffer from
what is known as the curse of dimensionality: in a very high-dimensional space,
each training example is so far from all the other examples that the model cannot
learn any useful patterns. The solution is to decrease the dimensionality of the
features space, for instance, via feature selection.
3. Training time: The more features, the more training time. The specifics of this
trade-off depend on the particular learning algorithm being used, but in situations

11
where retraining needs to happen in real-time, one might need to limit oneself to a
couple of best features.
4. Deployment effort: The more features, the more complex the machine learning
system becomes in production. This poses multiple risks, including but not limited
to high maintenance effort, entanglement, undeclared consumers, or correction
cascades.
5. Interpretability: With too many features, we lose the explainability of the model.
While not always the primary modeling goal, interpreting and explaining the
model’s results are often important and, in some regulated domains, might even
constitute a legal requirement.
7. Data-model compatibility: Finally, there is the issue of data-model compatibility.
While, in principle, the approach should be data-first, which means collecting and
preparing high-quality data and then choosing a model which works well on this
data, real life may have it the other way around.

5.4.2 Feature Selection Models


Feature selection models are of two types:
 Supervised Models: Supervised feature selection refers to the method which uses the
output label class for feature selection. They use the target variables to identify the
variables which can increase the efficiency of the model
 Unsupervised Models: Unsupervised feature selection refers to the method which does
not need the output label class for feature selection. We use them for unlabelled data.

Feature Selection Techniques

Supervised Unsupervised

Filter method Wrapper method Embedded method

Pearson’s correlation Forward Selection Regularization L1, L2

Chi square test Backward Selection Random forest


importance
ANOVA test Exhaustive Selection
12
Linear discriminant Recursive Selection
analysis (LDA)
Fig 5.1: Feature selection models
The supervised feature selection models can be further divided into three categories, they are
Filter method, Wrapper method and Embedded method. Each of these three methods includes
a number of techniques that helps to reduce the unimportant features from a dataset. In this
study, we will discuss a few techniques under filter method.

5.4.3 Filter Method


In this method, features are filtered based on general characteristics of the dataset such
correlation with the dependent variable. Filter method is performed without any predictive
model. It is faster and usually the better approach when the number of features are huge.
Avoids overfitting but sometimes may fail to select best features. Filter methods perform a
statistical analysis over the features space to select a discriminative subset of features [514].
Here, we must evaluate the relevance of each feature using any of the statistical measures
such as Pearson’s Correlation, Linear Discriminant Analysis, Chi-Square methods etc. The
working of filter method can be easily explained with the following steps.
 Rank the features based on their relevance scores.
 Select a threshold for relevance or a certain number of top-ranking features.
 Remove all features that fall below the threshold or are not in the top-ranking group.
 Use the remaining features to train the model.
 Repeat the process with different thresholds or numbers of top-ranking features to find
the optimal set of features.

A subset of features is selected based on their relationship to the target variable. The selection
is not dependent of any machine learning algorithm. On the contrary, filter methods measure
the “relevance” of the features with the output via statistical tests. The filter methods are
found to be fast, scalable, computationally simple and independent of the classifier [512]. It
includes a number of techniques to reduce the unimportant features from a dataset. some of
the most commonly used techniques are
 Univariate selection (ANOVA: Analysis of variance)
 Chi Square
 Based on Pearson’s correlation

13
 Linear discriminant analysis (LDA): It is used to find a linear combination of features
that characterizes or separates two or more classes of a categorical variable.

5.4.3.1 Pearson’s Correlation


The Pearson correlation coefficient (r) is the most common way of measuring a linear
correlation between two variables. It varies from -1 to +1, where +1 corresponds to positive
linear correlation, 0 to no linear correlation, and −1 to negative linear correlation.

Table 5.3: Pearson’s correlation coefficient values and its interpretation


Pearson correlation Correlation type Interpretation
coefficient (r)
>=0 and <=1 Positive correlation When one variable changes, the other
variable changes in the same direction
0 No correlation There is no relationship between the
variables
<=0 and >=–1 Negative correlation When one variable changes, the other
variable changes in the opposite
direction

We can calculate Pearson correlation coefficient (r) using the following formula.

…………….. (5.1)

In the above formula, the variables are represented by x and y, where x is the independent
variable, y is the dependent variable. In addition, the value n represents the sample size, and
summation of all values can be represented by the summation symbol Σ.

5.4.3.2 Chi Square Test


The Chi-square test is used for categorical features in a dataset. We calculate Chi-square
between each feature and the target and select the desired number of features with the best
Chi-square scores. In order to correctly apply the chi-squared to test the relation between
various features in the dataset and the target variable, the following conditions have to be met:
the variables have to be categorical, sampled independently. This method measures the
independence between the categorical characteristics and the target variable, i.e. the
probability that the two variables are associated or dependent. The chi-square test calculates

14
the chi-square value, which is proportional to the difference between the observed frequencies
and the expected frequencies of the categories. The higher the chi-square value, the more
dependent the two variables are and therefore the characteristic is relevant. Alisha et al [502],
in their study mentioned that data those categories have a very low frequency of
occurring which might affect the ranks calculated using the feature selection techniques
and this can hamper the final performance of the classifiers. The Chi-Squared values can be
calculated using the following formula.

……………………….. (5.2)
Where O stands for observed or actual value and E represents expected value.

5.4.3.3 ANOVA
ANOVA (Analysis of Variance) is a statistical method commonly used in machine learning
for comparing the means of two or more groups. It is also known as factor analysis and
developed by Fisher in 1930. In the context of machine learning, ANOVA can be used for
feature selection or to analyze the significance of different features in predicting the target
variable. ANOVA is particularly useful when dealing with categorical independent variables
and a continuous dependent variable. It helps in determining whether there are significant
differences in the means of the dependent variable across the categories of the independent
variable.
The ANOVA filter method, also known as ANOVA F-value feature selection and this method
is particularly useful when dealing with classification or regression tasks where the target
variable is continuous.
In general, there are three types of ANOVA:
 One-Way ANOVA: The one-way analysis of variance is also known as single-factor
ANOVA or simple ANOVA. It is suitable for experiments with only one independent
variable (factor) with two or more levels. One-Way-ANOVA method for feature
selection is a technique that analyzes the experimental data such that, one or more
response variables are calculated under various conditions identified by one or more
classification variables [504].
 Two-way ANOVA: It is also called full Factorial ANOVA and is used when there are
two or more independent variables. Each of these factors can have multiple levels. It

15
can compare two or more factors, which means it can check the effect of two
independent variables on a single dependent variable.
 N-way ANOVA: An Analysis of Variance Test is considered an n-way ANOVA
Test, if the user uses more than two independent variables, where N represents the
number of independent variables for the experiment. Sometimes, it is also called
as MANOVA Test.

The mathematical implementation ANOVA involves a number calculations step by step, and
it be understand with the help of following descriptions.
Step 1: Calculating overall mean (in the first step we need to calculate the mean of all
observations in the sample).
Step 2: From the overall mean, find the sun of squared divisions of every observations.
Sum of squares total (SST) = ∑ (X −X ¿)2 ¿
Step 3: Find the sum of squared divisions of each observations within every group.
Sum of squares within (SSW) = ∑ (X i −X i ¿ )2 ¿
Step 4: Calculate the sum of squared divisions of every group mean from the overall mean.
Sum of squares between (SSB) = ∑ N i( X i ¿−X )2 ¿
Here, Ni indicates number of observations in each group,
X̄ i indicates mean of each group,
and X̄ is the overall mean.
Step 5: Find the degrees of freedom of each of the three sums of squares.
degrees of freedom (df) can be calculated as
df(SST) = N — 1
df(SSW) = N — k
df(SSB) = k — 1
Here, N is the total no. of observations and k is the number of groups.
Step 6: Now, to calculate the mean squares, we need to divide the sum of squares by the
respective degree of freedom
MSW = SSW / ( N – k )
MSB = SSB / ( k – 1 )
Step 7: Calculating F-statistic
F = MSB / MSW

16
Step 8: Calculating p value and compare the result. If the p-value is less than a particular
level, at that moment it may suggest that the means of the groups are significantly
different.

Nadir O.F Elssied et al (2014) in their research applied feature selection based on one-way
ANOVA F-test statistics scheme to determine the most important features contributing to e-
mail spam classification [503]. It is important to note that Filter methods are much faster
compared to other methods, as it does not undergo any training on the models. However, in
this scenario, the wrapper methods are much expensive computationally as it is completely
based on machine learning models. One more interesting fact lies behind filter methods is
that, it might fail to find the best subset of features in many situations, whereas the wrapper
methods is able to provide the best subset of features at any situation though it is very time
consuming.

5.6 SUMMARY
In this chapter different details of the primary heart disease datasets used in this research are
discussed. The Cleveland heart disease dataset is referred here as DS-I and the other dataset,
also known as Cardiovascular Disease dataset, is referred as DS-II. The first dataset contains
303 records with 14 features whereas the second one contains 70000 patient records with 11
features. They entire experiments conducted during this study was carried out with the help of
python ML tool. The robust toolbox provided by Python's machine learning (ML)
environment enables researchers and developers to create effective models for a variety of
applications. Different libraries available in python programming language are also discussed
here. Since, the primary aim of this study is to improve the classifiers performance with
minimum number of features in the datasets, so different feature selection techniques are also
included here. It is the process of selecting a subset of features from the original features in
order to minimise the feature space as much as possible while meeting certain criteria.
Different models of feature selection is explained with the help of a diagram. Through this,
various important techniques of filter feature selection models are discussed mathematically.
Finally, all the experimental results, figures, and comparison tables are included in the next
chapter (Chapter-6).

17
REFERENCES
[501] Applied Predictive Modeling (2013), Springer, Page 488
[502] Alisha Sikri, N. P. Sing, Surjeet Dalal, International Journal of Intelligent Systems and
Applications in Engineering (IJISAE), 2023, ISSN:2147-6799
[503] Nadir Omer Fadl Elssied, Othman Ibrahim and Ahmed Hamza Osman, A Novel
Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification,
Research Journal of Applied Sciences, Engineering and Technology, ISSN: 2040-7459, July
2014, Page: 625-638
[504] Arowolo, Abdulsalam, Saheed, Y.K. and Salawu, A Feature Selection Based on One-
Way-Anova for Microarray, Al-Hikmah Journal of Pure & Applied Sciences Vol.3 (2016),
Page: 30-35
Data Classification

[505] Shah, D., Patel, S. & Bharti, S.K. Heart Disease Prediction using Machine Learning
Techniques, SN Computer. Science. 1, 345 (2020), Publisher: Springer,
https://doi.org/10.1007/s42979-020-00365-y

[506] S. Mohan, C. Thirumalai and G. Srivastava, Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques, in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.

[507] Kolla, Venkata Ravi Kiran, Heart Disease Diagnosis Using Machine Learning
Techniques In Python: A Comparative Study of Classification Algorithms For Predictive

18
Modeling (September 6, 2015). International Journal of Electronics and Communication
Engineering & Technology, 2015, Available at SSRN: https://ssrn.com/abstract=4413723

[508] Chadha, R., Mayank, S. Prediction of heart disease using data mining techniques, CSI
Transactions on ICT 4, 193–198 (2016). Publisher: Springer, https://doi.org/10.1007/s40012-
016-0121-0

[509] Santhana. K.J. and G.S., Prediction of Heart Disease Using Machine Learning
Algorithms, 2019 1st International Conference on Innovations in Information and
Communication Technology (ICIICT), Chennai, India, 2019, pp. 1-5, doi:
10.1109/ICIICT1.2019.8741465.

[510] Archana Singh and R. Kumar, Heart Disease Prediction Using Machine Learning
Algorithms, 2020 International Conference on Electrical and Electronics Engineering (ICE3),
Gorakhpur, India, 2020, pp. 452-457, doi: 10.1109/ICE348803.2020.9122958.

[511] Akshit J. Dhruv, Reema Patel and Nishant Doshi, Python: The Most Advanced
Programming Language for Computer Science Applications, International Conference on
Culture Heritage, Education, Sustainable Tourism, and Innovation Technologies (CESIT
2020), pages 292-299, ISBN: 978-989-758-501-2, DOI: 10.5220/0010307900003051

[512] Yap B. Wah, N. Ibrahim, H.A. Hamid, S.A. Rahman and S. Fong, Feature Selection
Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy,
Pertanika J. Sci. & Technol. 26 (1): 329 - 340 (2018)

[513] Andrea Bommert, Xudong Sun, Bernd Bischl, Jörg Rahnenführer, Michel Lang,
Benchmark for filter methods for feature selection in high-dimensional classification data,
Computational Statistics & Data Analysis, Volume 143, 2020, 106839, ISSN 0167-9473,
https://doi.org/10.1016/j.csda.2019.106839.

[514] Mahdieh Labani, Parham Moradi, Fardin Ahmadizar, Mahdi Jalili, A novel multivariate
filter method for feature selection in text classification problems, Engineering Applications of
Artificial Intelligence, Volume 70, 2018, Pages 25-37, ISSN 0952-1976,
https://doi.org/10.1016/j.engappai.2017.12.014.
19

You might also like