Detecting Clinical Signs of Anaemia Using Machine Learning Report
Detecting Clinical Signs of Anaemia Using Machine Learning Report
HARINI S (312420104051)
BACHELOR OF ENGINEERING
in
MARCH 2024
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
work of
SIGNATURE SIGNATURE
Assistant Professor,
Computer Science and Engineering Computer Science and Engineering
i
We also take this opportunity to thank our respected and honorable
Chairman Dr. B. Babu Manoharan M.A., M.B.A., Ph.D. for the
guidance he offered during our tenure in this institution.
We wish to convey our sincere thanks to all the teaching and non-
teaching staff of the department of COMPUTER SCIENCE AND
ENGINEERING without whose cooperation this venture would not have
been a success.
CERTIFICATE OF EVALUATION
ii
College Name : St. JOSEPH’S INSTITUTE OF TECHNOLOGY
Semester : VIII
Submitted for project review and viva voce exam held on___________
ABSTRACT
iii
Anaemia represents a pressing global health issue,
disproportionately affecting children and pregnant women. It is a
widespread medical condition marked by a deficiency of red blood
corpuscles, posing significant health risks worldwide. Early detection is
crucial for effective intervention and management. According to a study
by WHO, approximately 42% of children under the age of 6 and 40% of
pregnant women globally suffer from anaemia. This condition impacts
approximately 33% of the world's total population, primarily due to iron
deficiency. Addressing anaemia is paramount to improving public health
outcomes and reducing the burden on healthcare systems worldwide.
Anaemia occurs once the level of red blood cells within the body
decreases or when the structure of the red blood cells is destroyed or
weakened. Early detection of anaemia helps to prevent irreversible organ
damage. The non-invasive technique, such as the use of machine
learning algorithms, is one of the methods used in the diagnosing or
detection of clinical diseases, which anaemia detection cannot be
overlooked in recent days. In this study, machine learning algorithms
were used to detect iron-deficiency anemia with the application of Naïve
Bayes, CNN, SVM, k-NN, and Decision Tree. This enabled us to
compare the conjunctiva of the eyes, the palpable palm, and the color of
the fingernail images to justify which of them has a higher accuracy for
detecting anemia in children. The technique utilized in this study was
categorized into three different stages: collecting of datasets and
preprocessing the images, image extraction, and segmentation of the
Region of Interest of the images. The models were then developed for
the detection of anaemia using various algorithms.
TABLE OF CONTENTS
iv
PAGE
CHAPTER TITLE NO
ABSTRACT iv
LIST OF TABLES ix
1. INTRODUCTION 1
1.1 OVERVIEW 2
1.2 PROBLEM STATEMENT 2
1.3 EXISTING SYSTEM 3
1.3.1 Disadvantages of Existing System 3
1.4 PROPOSED SYSTEM 4
2. LITERATURE REVIEW 5
3. SYSTEM DESIGN 11
3.1 UNIFIED MODELLING LANGUAGE 11
3.1.1 Use case Diagram of Anemia 12
Detection
3.1.2 Class Diagram of Anemia Detection 13
3.1.3 Sequence Diagram of Anemia 14
Detection
3.1.4 Activity Diagram of Anemia 15
Detection
3.1.5 Deployment Diagram of Anemia 16
Detection
PAGE
CHAPTER TITLE NO
v
4. SYSTEM ARCHITECTURE 17
5. SYSTEM IMPLEMENTATION 18
5.1 MODULE DESCRIPTION 18
5.2 MODULES 18
5.2.1 Exploratory Data Analysis 19
5.2.2 Statistical Test Module 21
5.2.3 Feature Selection 23
5.2.4 Data Preprocessing 25
5.2.5 Class Imbalance and Data Leakage 29
Handling
5.2.6 Algorithm Implementation Module 31
5.2.7 Hyper-Parameter Training and 34
Cross-Validation
REFERENCES 55
LIST OF FIGURES
vi
3.1 Use case diagram of anemia 12
detection
LIST OF TABLES
vii
TABLE NO. TABLE NAME PAGE NO.
5.1 Results of Hyperparameter Tuning 35
Module of Anemia Detection
LIST OF ABBREVIATIONS
viii
ACRONYM ABBREVIATION
LR Logistic Regression
NB Naïve Bayes
DT Decision Tree
RF Random Forest
HB Hemoglobin
ix
CHAPTER 1
INTRODUCTION
1
1.1 OVERVIEW
2
1.3 EXISTING SYSTEM
Clinicians are prone to disease since they are the one who pricks
blood from human.
The Electrophoresis method cannot be performed if the current or
voltage supply is not good.
Test results varies from lab to lab which uses different techniques
disturbs the mentality of the patients on which result to believe.
Sahli’s method of screening haemoglobin uses Acid hematin as a
suspension, not a true solution . This method can't measure all
haemoglobin and chances of visual error are high.
3
1.4 PROPOSED SYSTEM
4
effectiveness of the anemia detection process, ultimately contributing to
improved patient outcomes and healthcare management.
CHAPTER 2
LITERATURE REVIEW
5
blood samples were used to measure the Hb values of patients as an
auxiliary dataset to the images. The decision tree algorithm achieved an
accuracy of 95% which was higher than the SVM and the k-NN
performance of 80% and 83% respectively.
Azwad Tamir et al.,(2017) [3] proposed a pioneering method for
anemia detection, crucial given the condition's prevalence affecting a
quarter of the world's population. By leveraging smartphone-captured
images of the eye's conjunctival color, Tamir's approach offers a non-
invasive, automated alternative to traditional diagnosis methods
involving invasive blood tests. Analyzing the color spectrum of the
conjunctival tissue, the system determines anemia status by comparing
it against a predefined threshold. Testing on 19 subjects revealed a
promising 78.9% accuracy rate, aligning with patients' blood reports in
15 cases.
Chayashree Patgiri et al.,(2019) [4] proposed Modern image
processing techniques, particularly thresholding, are crucial in medical
abnormality detection. Adaptive thresholding is especially powerful in
analyzing medical images for diseases like Sickle Cell Disease (SCD),
where identifying distorted red blood cells is essential for diagnosis.
This study explores adaptive thresholding techniques such as Niblack,
Bernsen, Sauvola, and NICK for segmenting blood images to detect
SCD. By focusing on these methods, the paper aims to provide a
comparative analysis to enhance the accuracy and efficiency of SCD
diagnosis from microscopic blood images.
Enas Walid Abdulhay et al.,(2021) [5] proposed a novel method
for diagnosing Malaria and various types of Anemia using
Convolutional Neural Networks (CNN) on high-resolution blood
6
sample images. By training the CNN on diverse microscopic images, it
can classify samples into normal blood cells, Malaria, Sickle cell
anemia, Megaloblastic anemia, or Thalassemia without requiring the
standard Complete Blood Count (CBC) test. The proposed approach
achieves a test accuracy of 93.4%, offering rapid and cost-effective
diagnosis without laboratory analysis. This method streamlines the
diagnostic process, potentially revolutionizing blood sample analysis.
Furkan Kiraci et al.,(2018) [6] endeavored to streamline the
diagnosis of Sickle Cell Anemia by leveraging Image Processing
Algorithms. By meticulously isolating sickle cells from healthy cells
within blood tissue images, the study achieves commendable results,
boasting an accuracy of 91.11%, a precision rate of 92.9%, and a recall
score of 79.05%. Such precision not only accelerates the diagnostic
process but also substantially reduces the likelihood of misdiagnoses,
ensuring more accurate and efficient patient care.
Garima Vyas et al.,(2016) [7] proposed method which involved
acquisition of the thin blood smear microscopic images, pre-processing
by applying median filter, segmentation of overlapping erythrocytes
using marker-controlled watershed segmentation, applying
morphological operations to enhance the image, extraction of features
such as metric value, aspect ratio, radial signature and its variance, and
finally training the K-nearest neighbor classifier to test the images. The
algorithm processes the infected cells increasing the speed,
effectiveness and efficiency of training and testing. The K-Nearest
Neighbour classifier is trained with 100 images to detect three different
types of distorted erythrocytes namely sickle cells, dacrocytes and
7
elliptocytes responsible for sickle cell anaemia and thalassemia with an
accuracy of 80.6% and sensitivity of 87.6%.
Jessie R.Balbin et al.,(2019) [8] proposed a Raspberry Pi to aid in
the identification of abnormal red blood cells (RBCs). By measuring
parameters such as area, perimeter, diameter, and shape geometric
factor (SGF), while also detecting central pallor and target flags, the
system offers a comprehensive approach to RBC analysis. Notably,
previous studies have explored different methodologies, including
Artificial Neural Networks (ANN) and radial basis function networks,
achieving accuracies of 90.54% and 83.3% respectively. Balbin et al.
opted for a Support Vector Machine (SVM) classifier, which achieved
an impressive accuracy of 93.33% in identifying seven distinct RBC
types, including normal cells and various abnormalities. This
classification capability is particularly valuable in diagnosing a range of
anemias, such as iron-deficiency anemia, thalassemia, and hereditary
spherocytosis, thereby aiding clinicians in early detection and treatment
planning. While the system serves as a valuable aid for initial
identification of abnormal RBCs, it's crucial to underscore that
conclusive diagnoses necessitate further confirmation through
laboratory examinations.
Joan et al.,(2014) [9] proposed aportable device for point-of-care
anemia detection uses impedance analysis with custom electronics,
software, and disposable sensors. Forty-eight whole blood samples
from hospitalized patients were collected, with 10 for calibration and 38
for validation. Calibration involved EIS to determine impedance
spectrum for accurate hematocrit detection. A protocol for instant
impedance detection was developed, achieving less than 2% accuracy
8
error for impedance variations. An algorithm based on impedance
analysis was used for hematocrit detection, effectiveness, and
robustness with 1.75% accuracy error and less than 5% coefficient of
variation.
9
diagnostics, providing a more accessible and non-invasive approach to
identifying this prevalent condition.
10
deficiency anemia and hookworm disease. Diagnosis and monitoring of
anemia and SCD pose significant challenges in low and middle-income
countries due to limited laboratory infrastructure, skilled personnel, and
financial resources. To address these challenges, an extension of the
HemeChip system has been developed, termed HemeChip+, which
incorporates total hemoglobin quantification and anemia testing
capabilities. HemeChip+ boasts mass-producibility at low cost, offering
a pioneering single-test point-of-care (POC) platform.
Muljino et al., (2024) [15] proposed a non-invasive method using
conjunctival images to detect anemia early, aiming to overcome
limitations in current diagnostic methods. The SVM algorithm-
integrated MobileNetV2 method achieves 93% accuracy, 91%
sensitivity, and 94% specificity in categorizing anemic and healthy
patients. This approach offers promise for efficient and precise anemia
diagnosis in clinical settings, potentially improving healthcare by
identifying anemia earlier.
Pooja Tukaram Dalvi et al.,(2016) [16] developed an efficient
machine learning classifier that can detect and classify anemia
accurately. In this paper five ensemble learning methods : Stacking,
Bagging, Voting, Adaboost and Bayesian Boosting are applied on four
classifiers : Decision Tree, Artificial Neural Network, Naïve Bayes and
K-Nearest Neighbor. The aim is to determine which individual
classifier or subset of classifier combination achieves maximum
accuracy in Red blood cell classification for anemia detection. From the
results it is evident that amongst the ensemble methods, stacking
ensemble method achieves the highest accuracy. Amongst the
individual classifier the Artificial Neural Network performs the best and
11
K-Nearest Neighbor performs the worst. However the classifier
combination Decision Tree and K-Nearest Neighbor when applied on
Stacking ensemble, achieves an accuracy much higher than the
Artificial Neural Network. This indicates an ensemble of classifiers
achieves much higher accuracy than individual classifiers. Hence to
achieve maximum accuracy in medical decision making an ensemble of
classifiers should be used.
Pranati Rakshit et al., (2013) [17] focused on the identification of
morphological changes in red blood cells (RBCs) in haemolytic
anaemia, specifically caused by enzyme deficiencies like G-6-P-D
deficiency. The study employs image processing techniques on blood
smear images, including Weiner filtering for preprocessing and Sobel
edge detection for boundary identification. Additionally, a metric is
devised for abnormal RBC shape determination.
12
detection algorithm computed to detect sickle cells diseases at the early
stage in diagnosing patient. A MATLAB software able to demonstrate
the abnormalities of the human Red Blood Cell (RBC) in the single
shapes and quantities of sickle cells present in each dataset. A data
samples of sickle cells from government Ampang Hospital has
contributed this study to validate the results.
13
presents promising avenues for refining diagnostic processes and
prognostic assessments, ultimately fostering more effective patient care
strategies.
14
using the Principal Component Analysis (PCA) method and the K-
Nearest Neighbor (K-NN) method. The results obtained based on the
best parameters with an image size of 256×128 pixels, PCA percentage
parameters of 40%, cityblock distance, with a value of K=9 systems
resulted in an accuracy of 87.5% with a computing time of 1,317
seconds using 60 training data and 40 test data.
15
218 images of the conjunctiva of the eye from Italy and India was
utilized in the study. It gives a comprehensive review of the research
work in this field. Effect of various factors such as age, gender, etc. on
hemoglobin levels is also discussed in this research article. This paper
gives a brief overview of various data collection and preprocessing
methods. It also gives a comprehensive analysis of technologies used to
detect anemia with the help of hemoglobin estimation. The study of
performance measures used for the evaluation of results is covered
while reviewing existing research work. This paper provides the input
for the novel research.
CHAPTER 3
SYSTEM DESIGN
16
complex systems. The Unified Modelling Language is a very important
part of developing object-oriented software and the software
development process.
The Unified Modelling Language uses mostly graphical notations to
express the design of software projects. Using the UML helps project
teams communicate, explore potential designs, and validate the
architectural design of the software. When you’re writing code, there
are thousands of lines in an application, and it’s difficult to keep track
of the relationships and hierarchies within a software system. UML
diagrams divide that software system into components and
subcomponents. Using UML diagrams, developers can easily visualize
the structure of their software, including classes, objects, relationships,
and behaviors.
17
diagrams. It represents how an entity from the external environment can
interact with a part of the system.
18
Figure 3.2: Class diagram for anaemia Detection
19
message flow is represented by a vertical dotted line that extends across
the bottom of the page.
20
It is also termed as an object-oriented flowchart. It encompasses
activities composed of a set of actions or operations that are applied to
model the behavioural diagram.
21
Figure 3.5: Deployment diagram for anaemia Detection
CHAPTER 4
SYSTEM ARCHITECTURE
Anaemia detection using hematological data is a medical
application that aims to identify and diagnose anaemia based on the
analysis of Hemoglobin levels. Anaemia is a condition characterized by
a deficiency of red blood cells or haemoglobin in the blood, leading to
reduced oxygen-carrying capacity and potential health issues. In this
22
chapter, the System Architecture for detecting clinical signs of anaemia
is shown.
CHAPTER 5
SYSTEM IMPLEMENTATION
23
5.1 Module Description
The objective is to develop a Python-based machine learning
system for classifying and detecting clinical signs of anemia using
hematological data. The system employs various supervised learning
models to classify instances as anaemic or non-anaemic based on their
attributes. A comparative study of model performance is conducted to
identify the most effective approach. Finally, the model's output is
displayed through a live stream web application, providing real-time
insights into the classification results to aid in clinical decision-making
and patient management.
5.2 Modules
In addition to providing a systematic and organized approach to
our problem-solving process, these modules facilitate efficient
collaboration among team members by delineating clear responsibilities
and workflows. By breaking down the development process into
modular components, we can easily troubleshoot and iterate on specific
aspects of the system without disrupting the overall workflow. This
structured approach also enhances reproducibility and scalability,
allowing for seamless integration of new features or improvements as
the project evolves. The web application seamlessly updates with the
latest data, offering users a dynamic and interactive experience.
24
data types, and the presence of any missing values. The dataset contains
1421 records with 6 columns, all of which are non-null, suggesting no
missing values. Summary statistics such as mean, standard deviation,
minimum, maximum, and quartile values were computed using
`df.describe()` to understand the central tendency, dispersion, and shape
of the dataset distribution.The data was examined for missing values
using `df.isnull().values.sum()` and `df.display()`. No missing values
were found, indicating a clean dataset. Additionally, the data types of
columns were checked using `df.dtypes()` and column renaming was
performed for visualization purposes.The distribution of the target
variable ('Result') was examined using pie charts and count plots to
identify any class imbalances.
25
Furthermore, the EDA process allowed us to identify potential
outliers or anomalies in the dataset, which could influence model
performance if left unaddressed. Techniques such as box plots and
scatter plots were utilized to visually inspect the distribution of feature
values and detect any unusual patterns. Additionally, correlation
analysis was conducted to assess the strength and direction of
relationships between variables, providing insights into potential
multicollinearity issues that may affect model interpretability.
Moreover, the EDA revealed intriguing trends, such as variations in
anemia prevalence across different age groups or geographical regions,
which could be further explored in subsequent analyses. Additionally,
the examination of temporal trends in anemia rates over time may
provide valuable insights into potential risk factors or interventions. the
insights gained from the EDA process informed decisions regarding
feature selection and engineering strategies, ensuring that only the most
relevant and informative variables are included in the final predictive
model. By thoroughly understanding the dataset's characteristics, we
can develop a robust and reliable machine-learning system for
classifying and detecting the clinical signs of anemia. Correlation
analysis highlighted potential multicollinearity issues. Insights from
EDA informed feature selection for a robust predictive model targeting
anemia detection.
26
T-test for hemoglobin levels by gender, The T-test for
Hemoglobin Levels by Gender reveals that, while there is a slight
difference favoring females, it is not statistically significant enough to
reject the null hypothesis of equal mean hemoglobin levels between
males and females. However, the application of logarithm
transformation to the hemoglobin data addresses skewness and ensures
the validity of the t-test assumptions. By transforming the data, we
mitigate the influence of extreme values and achieve a more symmetric
distribution, thereby enhancing the reliability of the statistical test
results. This approach allows us to accurately assess the gender-based
differences in hemoglobin levels, providing valuable insights into
potential disparities in health outcomes between male and female
populations. Despite the lack of statistical significance, this analysis
underscores the importance of considering gender-specific factors in
the assessment and management of anemia and related health
conditions.
Odds Ratio for Anemia by Gender, The calculated odds ratio of
2.86 signifies a notable difference in the likelihood of being anemic
between genders. Specifically, females exhibit 2.86 times higher odds
of being anemic compared to males, indicating a significant association
between gender and the risk of anemia. This finding underscores the
importance of considering gender-specific factors in the assessment and
management of anemia-related health conditions. The observed
disparity in anemia prevalence between males and females highlights
potential underlying physiological or socio-economic factors that may
contribute to differential health outcomes. Understanding these gender-
based differences is crucial for developing targeted interventions and
healthcare policies aimed at reducing the burden of anemia, particularly
27
among populations at higher risk, such as females. Overall, the odds
ratio analysis provides valuable insights into the gender-related
disparities in anemia prevalence.
Chi-Square Test for Gender and Anemia Status , The results of
the Chi-Square Test for Gender and Anemia Status demonstrate a clear
and statistically significant association between these two variables.
With a chi-square statistic of 90.06 and a p-value less than 0.001, there
is robust evidence to reject the null hypothesis of independence. This
implies that gender and anemia status are dependent variables,
indicating a relationship between being female and having anemia. The
findings suggest that gender plays a significant role in determining the
likelihood of an individual experiencing anemia. This association
underscores the importance of considering gender-specific factors in
the assessment, diagnosis, and management of anemia-related health
conditions. Furthermore, these results provide valuable insights into the
demographic patterns of anemia prevalence and highlight the need for
targeted interventions to address gender-based disparities in healthcare.
Understanding the relationship between gender and anemia status is
crucial for developing effective public health strategies aimed at
reducing the burden of anemia and improving overall health equity.In
conclusion, the analyses suggest that while there isn't a significant
difference in mean hemoglobin levels between genders, there is a
notable association between gender and the likelihood of being anemic.
Specifically, females are more likely to be anemic compared to males.
28
identifying and retaining only the most relevant features while
discarding the rest. This process not only reduces computational
complexity but also mitigates overfitting and improves model
generalization. In the supervised learning context, feature selection
methods can be broadly categorized into three types: wrapper, filter,
and intrinsic. The study incorporated correlation analysis, SelectKBest,
and Extra Tree Classifier methods to select the most informative
features for predicting anemia status.
Correlation analysis, specifically Pearson correlation
coefficient, was utilized to explore the linear relationship between each
feature and the target variable, i.e., anemia status. The correlation
matrix revealed the strength and direction of association between
features and the target variable. For instance, a positive correlation
indicates that an increase in one variable corresponds to an increase in
the other, while a negative correlation suggests the opposite. In this
study, hemoglobin exhibited a strong negative correlation (Pearson
correlation coefficient of -0.8) with anemia status, indicating that lower
hemoglobin levels are associated with a higher likelihood of anemia.
Conversely, gender showed a weak positive correlation (Pearson
correlation coefficient of 0.25) with anemia status, suggesting a slight
gender-related difference in anemia prevalence.
The correlation matrix visualization, in the form of a heatmap,
provided a comprehensive overview of the relationships between all
features and the target variable. This visualization facilitated the
identification of features with the highest correlation coefficients,
thereby guiding the subsequent feature selection process.
SelectKBest corroborate the findings from correlation analysis
and further refine feature selection, the study employed the SelectKBest
29
method, a statistical technique for univariate feature selection.
SelectKBest evaluates the significance of each feature individually by
applying a scoring function, in this case, the chi-squared test statistic.
This test measures the dependence between each feature and the target
variable, assessing whether the occurrences of a specific feature and a
specific class are independent based on their frequency distribution.The
SelectKBest method iteratively selects the top k features with the
highest test scores, where k is a predefined parameter. In this study,
different values of k were evaluated (2, 3, 4, and 5), and the optimal
value was determined based on the total score obtained from the
selected features. The feature scores were computed, and the top
features were identified based on their respective scores. For instance,
the best value of k was found to be 2, with a total score of 307.02,
indicating that the two selected features collectively contribute the most
predictive power for determining anemia status.
The feature selection process culminated in the identification of
the most informative features for predicting anemia status, namely
hemoglobin and gender. These features were selected based on their
strong associations with the target variable, as evidenced by their high
correlation coefficients and significant chi-squared test scores. The
inclusion of these features in the predictive model is expected to
enhance its performance and interpretability by focusing on the most
relevant predictors while disregarding less influential ones.
30
Log scaling, a commonly used data transformation technique,
plays a crucial role in preprocessing skewed data distributions, such as
those commonly encountered in real-world datasets. In our case, we
applied logarithmic transformation to the 'Haemoglobin' feature to
address its left-skewed distribution and mitigate the potential impact of
extreme values.The left-skewed distribution of the 'Haemoglobin'
feature indicates that the majority of data points are concentrated on the
higher end of the scale, with fewer observations at lower values. Such
distributions can lead to challenges in model training and interpretation,
as they may violate the assumptions of linear models and affect the
performance of certain machine learning algorithms.To address this
issue, we employed a logarithmic transformation, which involves
taking the logarithm of the feature values. However, taking the
logarithm of zero or negative values can result in undefined or complex
numbers.
Therefore, to avoid such errors, we added a small constant of
0.01 to the feature values before applying the logarithmic
transformation.The logarithmic transformation has several beneficial
effects on the data, Compression of Range,The logarithm compresses
the range of the data, particularly for large values, thereby reducing the
variability and making the distribution more symmetric. This can help
stabilize the variance of the feature, making it more amenable to
analysis and modeling.Down-weighting Extreme Values, Extreme
values in the original feature distribution often exert a disproportionate
influence on the model's behavior, potentially leading to biased
estimates or overfitting.
Normalization of Skewed Distributions, Logarithmic transformation
can help normalize skewed distributions, making them more symmetric
31
and closer to a normal distribution. This is particularly advantageous
for certain statistical techniques and machine learning algorithms that
assume the data to be normally distributed or exhibit
homoscedasticity.Overall, log scaling is a powerful technique for
preprocessing skewed data distributions, such as those encountered in
the 'Haemoglobin' feature. By addressing the skewness and mitigating
the impact of extreme values, logarithmic transformation contributes to
improved model performance, interpretability, and robustness in
predictive modeling tasks. Standardization, a
fundamental preprocessing technique in machine learning, plays a
pivotal role in ensuring that features are appropriately scaled and
comparable across different variables. Achieved through the
implementation of the StandardScaler from the sklearn.preprocessing
module,
Standardization involves centering the feature values around the
mean and scaling them to have a standard deviation of one. This
process mitigates the risk of any single feature disproportionately
influencing the model's objective function, thus promoting fair and
balanced model training.While standardization may not significantly
impact the performance of tree-based algorithms such as decision trees,
random forests, and boosted trees due to their inherent feature scaling
invariance, it still offers various benefits beyond mere model
optimization. By centering feature values to have a mean of zero,
standardization ensures that the standardized feature distribution aligns
with the origin of the coordinate system, facilitating clearer
interpretations of model coefficients in linear models. This alignment
enables coefficients to represent the change in the target variable per
32
standard deviation change in the feature, simplifying the understanding
of feature contributions to the model's predictions.
Furthermore, standardization aids in the identification of
influential features by enhancing the interpretability of model outputs.
Features with larger standardized coefficients are considered more
influential in determining the target variable, providing valuable
insights into the relative importance of different features in driving
model predictions. This feature interpretability is particularly beneficial
in scenarios where understanding the underlying factors contributing to
model decisions is crucial for informed decision-making and problem-
solving.
Despite its potential limited impact on certain algorithms,
standardization remains a cornerstone preprocessing technique in
machine learning, offering not only improved model performance but
also enhanced interpretability and insights into feature importance. By
standardizing feature values, machine learning models can effectively
learn from data and make accurate predictions, paving the way for more
robust and reliable solutions in various application domains.
Normalization, also known as Min-Max scaling, is a
preprocessing technique aimed at rescaling feature values to a range
between 0 and 1, effectively compressing the data into a standardized
interval. Leveraging the MinMaxScaler from the sklearn.preprocessing
module, normalization ensures that all features are uniformly
distributed within the specified range, regardless of their initial
distribution. This transformation proves particularly beneficial when
feature values exhibit varying scales or non-normal distributions, as it
helps maintain the relative differences in the range of values across
different features. While normalization shares similarities with
33
standardization, it serves a distinct purpose by preserving the inherent
structure and relationships within the data without altering their
distribution shape.Despite its effectiveness, normalization may not be
essential for tree-based algorithms, such as decision trees, random
forests, and boosted trees, as these models are inherently robust to
variations in feature scales.
However, it remains a valuable preprocessing step in scenarios
where maintaining consistent feature ranges is critical for model
convergence and interpretability.Following normalization, feature
engineering techniques were employed to enhance the understanding of
the dataset's characteristics and relationships with the target variable,
'Result.' Visualization played a crucial role in this process, with box
plots serving as a powerful tool for comparing the distribution of
feature values across different outcomes of the target variable.
Specifically, four different versions of the 'Hemoglobin' feature –
original, log-scaled, standardized, and normalized – were plotted
against the 'Result' variable. This comparative analysis provided
insights into how each scaling method influenced the distribution of
feature values and their respective relationships with the target variable.
Finally, the preprocessed data underwent partitioning into training
and testing datasets to facilitate model training and evaluation. Utilizing
the train_test_split function from the sklearn.model_selection module,
the data was randomly divided into training (70%) and testing (30%)
sets. The resulting 'X_train' and 'X_test' datasets contained the predictor
variables, while the 'y_train' and 'y_test' datasets contained the
corresponding target variable values. This division ensured that the
model was trained on a sufficiently large portion of the data while
retaining a separate portion for unbiased evaluation. By adhering to best
34
practices in data splitting, we mitigate the risk of overfitting and ensure
the generalizability of the trained model to unseen data.
5.2.5 Class Imbalance and Data Leakage Handling
The analysis begins with an exploration of the dataset,
identifying the target variable "Result" as the focal point. By examining
the distribution of classes within this variable, it's evident that one class
significantly outnumbers the other. This scenario is known as class
imbalance, which can adversely affect the performance of machine
learning models, particularly in classification tasks. Class imbalance
introduces challenges such as biased model predictions, where the
model tends to favor the majority class due to its prevalence in the
dataset. This can lead to misclassification of minority class instances,
which are often of greater interest in real-world applications (e.g.,
detecting rare diseases, fraud detection). To mitigate the effects of class
imbalance, several sampling techniques are employed:
Random Oversampling technique involves duplicating examples from
the minority class, and balancing the class distribution. However, it
may lead to overfitting for some models due to the replication of
minority class instances.
SMOTE (Synthetic Minority OverSampling Technique) generates
synthetic data points for the minority class, addressing class imbalance
without replicating existing instances.
ADASYN (Adaptive Synthetic Sampling Method for Imbalanced Data)
Similar to SMOTE but focuses on generating more samples for
difficult-to-learn instances. Implementation and Evaluation:
Each sampling technique is implemented using appropriate libraries
such as imblearn.over_sampling and imblearn.under_sampling. The
35
impact of each technique on model performance is evaluated using
metrics like accuracy, precision, recall, F1 score, and AUC-ROC curve.
Data Leakage
Data leakage is a critical issue in machine learning, wherein
information from outside the training dataset contaminates the training
process, resulting in overly optimistic performance evaluations. This
can transpire through inadvertent inclusion of test data during training
or by incorporating external data during model creation. Such leakage
compromises the model's generalization ability, as it learns patterns
based on information not representative of real-world scenarios.
Mitigating data leakage requires stringent separation of training and
testing datasets, as well as vigilant scrutiny of any external data sources
introduced during model development to ensure unbiased performance
evaluation and reliable predictive outcomes.
Techniques for Handling Data Leakage
Undersampling and oversampling techniques are indispensable
strategies for addressing class imbalance in machine learning datasets.
To prevent data leakage and ensure the model's integrity, these
sampling techniques are exclusively applied to the training data. By
doing so, the model is trained solely on data it has not previously
encountered, safeguarding the independence and integrity of the test
set.When applying various sampling techniques, such as random
undersampling or oversampling, logistic regression models are trained
and evaluated using the test set. This rigorous evaluation process
ensures that the model's performance is accurately assessed on unseen
data, reflecting its real-world generalization capability.
36
In our experiments, all logistic regression models consistently
demonstrated high accuracy (>99%) and AUC (>99%), indicating
excellent overall performance in distinguishing between classes. These
metrics serve as reliable indicators of model quality and effectiveness
in classification tasks.Moreover, precision, recall, and F1 scores
consistently exhibit robust performance across different sampling
techniques. Precision measures the proportion of true positive
predictions among all positive predictions, recall assesses the
proportion of true positives correctly identified by the model, and F1
score provides a balanced assessment of precision and recall. The
consistent high values of these metrics signify the model's ability to
accurately classify instances from both classes, regardless of the
sampling technique employed.The kappa statistic, which measures the
agreement between predicted and actual class labels while accounting
for chance agreement, further reinforces the reliability of the logistic
regression models. The high kappa values obtained across different
sampling techniques affirm the model's robustness and consistency in
classification tasks.
In summary, the application of various sampling techniques in
logistic regression modeling, coupled with rigorous evaluation using
performance metrics, provides valuable insights into the model's
performance and generalization capability. The consistently high
performance metrics underscore the effectiveness of logistic regression
in addressing class imbalance and achieving accurate classification
outcomes in diverse real-world scenarios.Visualization techniques such
as ROC curves, precision-recall curves, and confusion matrices provide
additional insights into model performance and class distribution. In
conclusion, the implemented sampling techniques effectively address
37
class imbalance and data leakage issues in the machine learning
pipeline. By evaluating model performance metrics and visualizations,
we can confidently select the most suitable sampling technique for our
dataset, ensuring accurate and reliable predictions in real-world
applications.
38
Support Vector Machines (SVM)
SVM models achieved high performance across all sampling
techniques, with F1 scores ranging from 0.967 to 0.978.
Gaussian Naive Bayes (NB)
NB models achieved good performance across all sampling
techniques, with F1 scores ranging from 0.939 to 0.957.
39
computational resources when selecting the most suitable classifier for
a specific application. While DT and RF may exhibit perfect
performance, indicating potential overfitting, KNN, SVM, and NB
consistently demonstrate strong performance, suggesting their
reliability in handling imbalanced datasets.Additionally, further
analysis, such as model interpretation and validation on independent
datasets, is necessary to ensure the chosen classifier's reliability and
generalization capability.
5.2.7 Hyper-parameter training and cross-validation
Hyperparameter tuning stands as a pivotal phase in the machine
learning model development process, aiming to enhance model
performance by meticulously selecting the optimal combination of
hyperparameters. Employing the GridSearchCV technique within our
module enabled us to systematically explore a predefined grid of
hyperparameters, discerning the configuration that maximizes the
specified scoring metric for each classifier. This exhaustive search
approach ensures comprehensive coverage of the hyperparameter
space, facilitating the identification of the most effective parameter
combination.
Furthermore, our implementation incorporated a robust 5-fold
cross-validation strategy during the search process. This technique
partitions the data into five subsets, iteratively utilizing four subsets for
training and one for validation. By repeating this process across
different subsets, we ensure thorough evaluation of each parameter
combination's performance, guarding against potential overfitting and
enhancing the model's generalization capability.
In our analysis, we evaluated a diverse range of classifiers,
including Decision Tree, Random Forest, Support Vector Machine
40
(SVM), Gaussian Naive Bayes, Logistic Regression, and K-Nearest
Neighbors (KNN). Each of these classifiers offers unique strengths and
characteristics, necessitating tailored parameter grids specific to their
individual traits and complexities.
By customizing parameter grids for each classifier, we ensure a targeted
exploration of hyperparameter combinations, maximizing the model's
performance potential. This meticulous approach allows us to navigate
the extensive hyperparameter space with precision, honing in on
configurations that yield superior predictive capabilities. Through
systematic fine-tuning, we optimize the model's performance,
enhancing its ability to generalize and make accurate predictions across
diverse datasets. By leveraging this methodical strategy, we extract the
full potential of each machine learning algorithm, unlocking optimal
performance and achieving superior results in real-world applications.
This tailored approach not only boosts model performance but also
fosters a deeper understanding of the intricate relationships between
hyperparameters and model outcomes. Consequently, our systematic
fine-tuning process empowers us to harness the full predictive power of
machine learning algorithms, ensuring robust and reliable performance
across a wide range of applications.
The results are tabulated as follows :
TABLE: 5.2.7.1 RESULTS OF HYPERPARAMETER TUNING
MODULE
Algorithms Best score F1 score
41
Naïve Bayes 0.93 0.95
42
Naïve Bayes 0.93 0.01
43
various machine-learning tasks. Despite their promising performance,
selecting the best-performing classifier involves considering additional
factors beyond accuracy and F1 scores.
One crucial aspect is model complexity, which refers to the
number of parameters or features used by the model to represent the
underlying relationships in the data. While complex models may
achieve high accuracy on the training data, they are more prone to
overfitting and may struggle to generalize to unseen data. In contrast,
simpler models are more interpretable and may offer better
generalization performance, especially in scenarios with limited
training data or noisy features. Interpretability is another important
consideration, particularly in domains where model transparency and
explainability are paramount. Decision trees excel in this regard, as
they provide a clear and interpretable representation of the decision-
making process, allowing stakeholders to understand the factors driving
the model's predictions. In contrast, the ensemble nature of random
forests may complicate interpretability, as it involves aggregating
predictions from multiple decision trees.
Furthermore, computational efficiency and scalability should be
taken into account, especially when dealing with large datasets or real-
time applications. While decision trees are computationally efficient
and can handle large volumes of data, random forests may require more
resources due to the training of multiple trees in parallel. Therefore, the
choice of classifier should consider the computational constraints and
requirements of the specific application.
In conclusion, while the Decision Tree Classifier and Random
Forest Classifier showcase robust performance metrics such as
accuracy and F1 scores, identifying the most suitable classifier
44
demands a thorough examination. Beyond mere performance metrics,
factors such as model complexity, interpretability, and computational
efficiency wield significant influence in the selection process. Decision
trees and random forests, while exhibiting strong predictive power,
often come with higher complexity, potentially hindering
interpretability. Conversely, models like Logistic Regression or
Gaussian Naive Bayes may offer simpler interpretations but may
sacrifice some predictive performance.
CHAPTER 6
RESULTS AND CODING
45
6.1 TOOLS AND LANGUAGES
46
ranging from classification and regression to clustering and
dimensionality reduction. With its user-friendly API and extensive
documentation, Scikit-learn democratizes machine learning by making
complex algorithms accessible to both novice and experienced
practitioners. The library offers a modular architecture, allowing users
to seamlessly interchange algorithms and customize workflows to suit
specific requirements. Furthermore, Scikit-learn provides robust tools
for model evaluation, hyperparameter tuning, and cross-validation,
enabling rigorous experimentation and optimization of machine
learning models.
47
from sklearn.ensemble import RandomForestClassifier # for
implementing random forest algorithm from sklearn.svm import SVC #
for implementing Support Vector Machine
(SVM) algorithm
48
from imblearn.under_sampling import NearMiss # for undersampling
imbalanced datasets using NearMiss algorithm
from imblearn.metrics import classification_report_imbalanced # for
generating a classification report for imbalanced datasets
from sklearn.metrics import precision_score, recall_score, f1_score,
roc_auc_score, accuracy_score, classification_report # for computing
various performance metrics for classification models
from collections import Counter # for counting the frequency of
elements in a list
from sklearn.model_selection import KFold, StratifiedKFold # for k-
fold cross-validation
from sklearn.model_selection import cross_val_score # for evaluating a
model using cross-validation
from sklearn.metrics import cohen_kappa_score # for computing
Cohen's kappa score for inter-rater agreement import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 5000) df.info() df.columns
result_counts = df_copy['Result'].value_counts()
'{:d}'.format(p.get_height()), ha='center')
49
# Remove spines sns.despine(left=True, bottom=True) plt.show()
male_data = df_copy[df_copy['Gender'] == 'Male'] female_data =
df_copy[df_copy['Gender'] == 'Female']
# Plot horizontal violinplot using Seaborn
50
plt.title('Distribution of Haemoglobin Levels by Gender')
plt.xlabel('Haemoglobin Level') plt.ylabel('Gender') male_data =
df_copy[df_copy['Gender'] == 'Male'] female_data =
df_copy[df_copy['Gender'] == 'Female']
# Plot horizontal violinplot using Seaborn
51
Figure 6.3.1 countplot illustrates the distribution of individuals
with and without anemia across genders. Each bar represents the count
of individuals categorized by their gender (male or female) and their
anemia status (with or without). The bars are annotated with the
respective counts, providing a visual representation of the prevalence of
anemia among different genders in the dataset.
The user-friendly design is highlighted as a key feature, ensuring
that even individuals with varying levels of technical expertise can
navigate the application effortlessly. The intuitive interface contributes
to a seamless user experience, making the diagnostic process more
accessible and efficient.
52
Figure 6.2: Initial screen of anemia detection System
53
empowers users to make informed decisions regarding anemia
diagnosis and management.
54
under the curve (AUC), precision, recall, F1-score, and kappa statistics.
The graph visually represents the comparative performance of these
models, offering insights into their effectiveness in classification tasks.
Overall, the analysis highlights the robustness of the models and
provides valuable information for understanding their predictive
capabilities in real-world scenarios.
55
other models.The data imbalance handling techniques, including
undersampling, oversampling, SMOTE, and ADASYN, do not seem to
significantly impact the models' performance in this dataset, as all
techniques yield high scores.In summary, the models demonstrate
robust performance across various techniques, indicating their
effectiveness in classification tasks on the given dataset.
CHAPTER 7
56
Decision Trees (DT) and Random Forest (RF) performed well,
while Support Vector Machine (SVM) with ADASYN outperformed all
class imbalance methods, achieving an AUC of 0.984 and accuracy of
98%. k-Nearest Neighbors (KNN) without balancing also performed
strongly, with an accuracy of 97%. Important features for anaemia
classification include Haemoglobin, Gender, and MCV, with females
Specifically, females were found to be at a higher risk of anaemia
compared to males, with an Odds Ratio for gender of 2.86, having a
higher risk of anaemia. Decision Trees and Random Forests were
chosen as the final model due to their superior performance, achieving
a 100% F1 score on test datasets, indicating their robustness in anaemia
detection. This signifies the robustness and reliability of Decision Trees
and Random Forests in anaemia detection.
REFERENCES
57
[1] Akmal Hafeel, H.S.M.H. Fernanado, M.Pravienth, Shashika
Lokuliyana, N.Kayanthan, and Anuradha Jayakody, “ IoT device
to Detect Anemia ” 2019 International Conference On
Advancements in Computing (ICAC), Malabe, Sri Lanka, 2019.
[2] Aparna V, T V Sarath, K.I. Ramachandran, “Simulation model
for anemia detection using RBC counting algorithms and
Watershed transform” 2017 International Conference on
Intelligent Computing, Instrumentation and Control Technologies
(ICICICT),Kerala, India 2017.
[3] AzwadTamir, Chowdry Jahan, Mohammed S. Saif, and
U.Zaman, “Detection of anemia from image of the anterior
conjunctiva of eye by image processing and thresholding” 2017
IEEE Region 10 Humanitarian Technology Conference (R10-
HTC).
[4] Chayashree Patgiri, Amrita Ganguly, “ Comparative Study on
Different Local Thresholding Techniques for Detection of Sickle
Cell Anaemia from Microscopic Blood Images” 2019 IEEE 16th
India Council International Conference (INDICON), Rajkot,
India.
[5] Enas Walid Abdulhay, Ahmad Ghaith Allow, and Mohammad
Eyad Al-Jalouly presented their research on "Detection of Sickle
Cell,Megaloblastic Anemia,Thalassemia, and Malaria through
Convolutional Neural Network" at the 2021 Global Congress on
Electrical Engineering (GC-ElecEng) in Valencia, Spain.
[6] Furkan Kiraci, Batuhan Albayrak, Muazzez Buket Darici, Arif
Selçuk Öğrenci, Atilla Özmen, and Kerem Ertez, "Orak Hücreli
58
Anemi Tespiti: Sickle Cell Anemia Detection" at the 2018
Medical Technologies National Congress (TIPTEKNO).
[7] Garima Vyas, Vishwas Sharma, Adhiraj Rathore, shared insights
on "Detection of Sickle Cell Anemia and Thalassemia Causing
Abnormalities in Thin Smear of Human Blood Sample Using
Image Processing" at the 2016 International Conference on
Inventive Computation Technologies (ICICT) in Coimbatore,
India.
[8] Jessie R. Carlos C. Hortinela, Fausto, Paul Daniel C. Divina, and
John Philip T. Felices presented research on "Identification of
Abnormal Red Blood Cells and Diagnosing Specific Types of
Anemia Using Image Processing and Support Vector Machine" at
the 2019 IEEE 11th International Conference on Humanoid,
Nanotechnology, Information Technology, Communication and
Control, Environment, and Management (HNICEM) in Laoag,
Philippines.
[9] Joan Cid, Jaime Punter-Villagrasa, Jordi Colomer-Farrarons,
Ivón Rodríguez-Villarreal, and Pere Ll. Miribel-Català discussed
progress "Toward an Anemia Early Detection Device Based on
50-μL Whole Blood Sample" in the IEEE Transactions on
Biomedical Engineering.
[10] M. Kathirvelu, S. Keerthana, V. Keerthana, S. Lakshitha, and S.
Manikandan discussed "Early Detection of Sickle Cell Anemia
Among Tribal Inhabitants" at the 2023 8th International
Conference on Communication and Electronics Systems
(ICCES) in Coimbatore, India.
[11] R. Kumar, S. Guruprasad, Krity Kansara, K. N. Raghavendra
Rao, Murali Mohan, Manjunath Ramakrishna Reddy, Uday
59
Haleangadi Prabhu, P. Prakash, Sushovan Chakraborty, Sreetama
Das, and K. N. Madhusoodanan introduced an innovative "A
Novel Noninvasive Hemoglobin Sensing Device for Anemia
Screening" in the IEEE Sensors Journal, Volume 21, Issue 13,
published in 2021.
[12] Maileth Rivero-Palacio, Wilfredo Alfonso-Morales, and Eduardo
Caicedo-Bravo introduced a "Mobile Application for Anemia
Detection through Ocular Conjunctiva Images" at the 2021 IEEE
Colombian Conference on Applications of Computational
Intelligence (ColCACI) in Cali, Colombia.
[13] Megha Tyagi, Lalit Mohan, and Nidhi Dahyia shared insights on
"Detection of Poikilocyte Cells in Iron Deficiency Anemia using
Artificial Neural Network" during the 2016 International
Conference on Computation of Power, Energy Information, and
Communication (ICCPEIC) in Melmaruvathur, India.
[14] Muhammad Noman Hasan, Ran An ,Yuncheng Man, and Umut
A. Gurkan unveiled an "Integrated Point-of-Care Device for
Anemia Detection and Hemoglobin Variant Identification" at the
2019 IEEE Healthcare Innovations and Point of Care
Technologies (HI-POCT) in Bethesda, MD, USA.
[15] Muljono, Sari Ayu Wulandari, Harun Al Azies, Muhammad
Naufal, Wisnu Adi Prasetyanto, and Fatima Az Zahra presented
groundbreaking research titled "Non-Invasive Anemia Detection
Empowered by AI: Pushing the Boundaries in Diagnosis" in
IEEE Access, Volume 12, published in 2024.
[16] Pooja Tukaram Dalvi and Nagaraj Vernekar presented research
on "Anemia Detection Using Ensemble Learning Techniques and
Statistical Models" at the 2016 IEEE International Conference on
60
Recent Trends in Electronics, Information & Communication
Technology (RTEICT) in Bangalore, India.
[17] Pranati Rakshit and Kriti Bhowmik demonstrated "Detection of
Abnormal Findings in Human RBC for Diagnosing G-6-P-D
Deficiency Hemolytic Anemia using Image Processing" at the
2013 IEEE 1st International Conference on Condition
Assessment Techniques in Electrical Systems (CATCON) in
Kolkata, India.
[18] Rita Magdalena, Yunendah Nur Fuadah, Sofia Sa'idah, Inung
Wijayanto, and Raditiana Patmasari, presented research on
"Non-Invasive Anemia Detection in Pregnant Women Using
Digital Image Processing and K-Nearest Neighbor" at the 2020
3rd International Conference on Biomedical Engineering
(IBIOMED) held in Yogyakarta, Indonesia.
[19] Roszymah Hamzah, Ahmad Sabry Mohamad, Nur Syahirah
Abdul Halim, Muhammad Noor Nordin, and Jameela Sathar
presented findings on "Automated Detection of Human RBC in
Diagnosing Sickle Cell Anemia with Laplacian of Gaussian
Filter" at the 2018 IEEE Conference on Systems, Process and
Control (ICSPC) in Melala, Malaysia.
[20]Sagnik Ghosal, Debanjan Das, Venkanna Udutalapally, Asoke K.
Talukder, and Sudip Misra introduced the innovative concept of
"sHEMO: Smartphone Spectroscopy for Blood Hemoglobin
Level Monitoring in Smart Anemia-Care" in the IEEE Sensors
Journal, Volume 21, Issue 6, published in 2021.
[21] Sasikala C, Ashwin M R, Dharanessh M D, and Dhanabalan M
“Curability Prediction Model for Anemia Using Machine
61
Learning” at the 2022 8th International Conference on Smart
Structures and Systems (ICSSS) Chennai, India.
[22] Sherif H. Elgohary, Zeyad Ayman Mohamed, Omar Ayman
Mohamed, and Ahmed Osama Ismail participated in the 2022
10th International Japan-Africa Conference on Electronics,
Communications, and Computations (JAC-ECC) held in
Alexandria, Egypt.
[23] Tajkia Saima Chy and Mohammad Anisur Rahaman
demonstrated "Automatic Sickle Cell Anemia Detection Using
Image Processing Technique" at the 2018 International
Conference on Advancement in Electrical and Electronic
Engineering (ICAEEE) in Gazipur, Bangladesh.
[24] Tiago Bonini Borchartt and Willian França Ribeiro showcased
"Automated Detection of Anemia in Small Ruminants Using
Non-Invasive Visual Analysis Based on BIC Descriptor" at the
2023 36th SIBGRAPI Conference on Graphics, Patterns, and
Images (SIBGRAPI) in Rio Grande, Brazil.
[25] Vinit P. Kharkar and Ajay P. Thakare provided a
"Comprehensive Review of Emerging Technologies for Anemia
Detection" at the 2022 8th International Conference on Signal
Processing and Communication (ICSC) in Noida, India.
62