0% found this document useful (0 votes)
47 views

Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

Uploaded by

vanrao1729
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

Uploaded by

vanrao1729
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 30 June 2024, accepted 8 August 2024, date of publication 14 August 2024, date of current version 22 August 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3443299

Artificial Intelligence-Based Early Detection of


Dengue Using CBC Data
NUSRAT JAHAN RIYA, MRITUNJOY CHAKRABORTY , AND RIASAT KHAN
Department of Electrical and Computer Engineering, North South University, Dhaka 1229, Bangladesh
Corresponding author: Riasat Khan ([email protected])
This work involved human subjects or animals in its research. Approval of all ethical and experimental procedures and protocols was
granted by the Institutional Review Board (IRB)/ Ethics Review Committee (ERC) of North South University, Bangladesh under
Application No. 2024/OR-NSU/IRB/0308.

ABSTRACT Dengue fever is a tropical mosquito-transmitted disease spread through the Aedes mosquito,
where the human body works as the primary host. Each year, densely populated countries such as Bangladesh,
Thailand, and India, particularly in the Southeast Asian region, experience the majority of dengue outbreaks
worldwide. Notably, in 2023, Bangladesh endured an unprecedented dengue outbreak, registering the
highest number of cases in over two decades since 2000. This research aims to facilitate early detection
of dengue from patients’ complete blood count (CBC) medical laboratory reports collected from two
hospitals in Dhaka, Bangladesh. The custom-built dataset, comprising 320 samples and 14 hematology
features, is used to evaluate diverse artificial intelligence techniques. This dataset documents suspected
dengue cases in Bangladesh from May 2023 to October 2023, reflecting a significant outbreak period,
including a gender distribution ratio of 5:3 male to female patients. Various preprocessing steps, handling
missing values and outliers, one-hot encoding, synthetic oversampling, and removing redundant features, are
applied to the employed dataset. Five feature selection methods and diverse machine learning algorithms,
along with ensemble learning and transformer-based models, are implemented. The stacking ensemble
classifier achieved the highest performance, with an accuracy of 96.88% and an F1 score of 0.9646. The
stacking technique has been built using the LightGBM meta-classifier and XGBoost, Logistic Regression,
and Multilayer Perceptron base learners. The collected CBC dengue dataset and the implementation
codes are available at: https://github.com/mritunjoychk17/Dengue-Prediction-in-Bangladesh-Using-
Machine-Learning.

INDEX TERMS Complete blood count, dengue prediction, explainable AI, feature selection, machine
learning, ensemble learning, transformer model.

I. INTRODUCTION were affected by dengue fever. Bangladesh alone recorded


The human body, inherently sensitive, possesses its own more than 0.31 million cases and over 1,600 deaths from this
defense mechanism to combat against external microbial hemorrhagic fever [2].
threats. Nevertheless, humans frequently fall victim to viral or Dengue is most prevalent in urban or peri-urban areas
bacterial infections, resulting in diseases that are significantly within the tropical and subtropical regions of the world,
lethal. Dengue fever, for instance, is a viral disease primarily attributed mainly to insufficient sanitation, haphazard devel-
transmitted to humans by the Aedes mosquito. Every year, opment and unplanned urbanization [3]. According to the
millions across the globe suffer from dengue fever, with latest review by the WHO, the countries in the African,
thousands falling victim to its consequences [1]. According Southeast Asian, and Western Pacific regions have the highest
to the World Health Organization (WHO) and the European incidence of dengue fever. Among the countries in the
Union, in 2023, over six million people in nearly 92 countries Southeast Asian region, Bangladesh recorded the highest
number of dengue cases between June and October. The
The associate editor coordinating the review of this manuscript and
number of affected patients and fatalities due to dengue in
approving it for publication was Wei Ni. 2023 was the highest in recent decades [4]. Hence, early,
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 112355
N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

efficient, and rapid detection and response measures for this 3) GridSearchCV and Keras Tuner are applied to tune the
arboviral disease are crucial. The escalating trend in the best hyperparameters of the applied machine learning
number of affected individuals underscores the imperative for and deep learning models. Five feature selection
implementing appropriate preventive future measures to avert methods have also been used to identify the most salient
surpassing previous records in terms of both affected patients features.
and fatalities [5]. 4) Employing an explainable AI tool, LIME, this research
While dengue symptoms primarily arise from the bite of shed light on the key features that significantly impact
the Aedes mosquito, the virus typically remains dormant for the most on predicting dengue positive and negative
a period before becoming apparent. Though not inherently cases.
fatal, dengue presents a range of debilitating symptoms 5) The novelty of this work is to apply explainable
similar to those of other diseases. Typically, individuals with stacking ensemble and transformer-based AI models
dengue experience intense fever, excruciating bodily pain, and investigate significant features employing a private
nausea, loss of appetite, and various types of skin rashes. blood test report-based dengue dataset.
Despite the absence of specific symptoms in the initial two
weeks post-infection, patients often experience a sudden
deterioration in health [6]. Pathological tests usually reveal II. LITERATURE REVIEW
a decrease in platelets in the blood, indicating a critical In recent years, the advancement of AI has facilitated rapid
condition. Dengue virus has four serotypes. When someone and accurate diagnosis of various diseases through machine
is infected with one serotype, the body develops long-term learning techniques. Machine learning enables the accurate
immunity to it. However, the consequences can be severe identification of diseases such as diabetes, Parkinson’s,
if they are subsequently infected with a different serotype. Alzheimer’s, cardiovascular diseases, ocular diseases, etc.
While dengue symptoms may not be severe initially, upon Machine learning primarily involves training algorithms with
a second infection, dengue can lead to severe conditions new data and providing insights about patterns. As a result,
like shock syndrome, internal bleeding, or multiple organ with the assistance of this vast repository of data, precise
failure. disease identification becomes possible. To determine if a
Considering the recent surge in dengue infections, this patient is suspected of having dengue, individuals typically
study introduces artificial intelligence (AI) approaches that visit the nearest hospital or clinic and undergo multiple tests,
enable early detection of dengue by employing various such as a CBC, IgM/IgG antibody test, NS1 antigen test, etc.
critical hematologic features. In this work, a private dataset Following blood collection, various pathological procedures
has been collected from two local hospitals in Dhaka, are performed, and it generally takes many hours and costs
Bangladesh. The dataset comprises complete blood count a reasonable sum of money [7]. A specialized doctor then
(CBC) data for 320 individuals, classified as dengue reviews the reports to assess the severity of dengue fever
‘positive’ or ‘negative,’ and 14 attributes. The dataset has based on the results. These procedures can be intensive
been preprocessed employing diverse techniques. Various for the people of low-income countries like Bangladesh
machine learning models have been applied, i.e., Logistic due to the scarcity of available specialized doctors and
Regression, Random Forest, SVM, LightGBM, XGBoost, pathologists. As a result, many researchers are striving to
and stacking classifier. Additionally, we deployed five make dengue prediction more efficient and cost-effective
deep-learning models – MLP, ANN, CNN, Bi-LSTM, and by presenting various approaches and ideas. This section
GRU and two advanced transformer models, TabPFN and below delivers a detailed overview of existing methods for
TabTransformer. Hyperparameter tuning with GridSearchCV automatically detecting dengue fever using CBC hematology
and Keras Tuner framework, and five feature selection samples, blood smear images, environmental factors, and
methods have been employed to extract essential features. other relevant parameters from recent articles.
The pivotal role of various features in decision-making, Davi et al. [8] utilized the human genome data of
mainly focusing on the interpretability of black-box mod- 102 patients to predict dengue flavivirus using machine
els, is investigated using the LIME-based explainable AI learning models. The authors investigated the patients at
approach. The study offers several significant contributions, high risk of developing extreme phenotypes despite moderate
which can be summarized as follows: symptoms. Among the applied machine learning algorithms,
1) A major contribution of this work is to present a private the ANN model demonstrated the best accuracy score of
CBC hematology report-based dengue dataset compris- 86%, with a sensitivity of 98% by extracting features using
ing 320 samples and 14 characteristic features collected SVM RFE. Sarma and other researchers [9] designed an
from two local hospitals in Dhaka, Bangladesh. automatic dengue prediction model using machine learning
2) Stacking ensemble model constructed from LightGBM algorithms based on the recent outbreak in Bangladesh. The
meta-classifier and XGBoost, Logistic Regression, and researchers collected raw data from the patients from Dhaka
Multilayer Perceptron base learners has been applied. and Chittagong, Bangladesh’s two largest and most densely
TabPFN and TabTransformer-based transformers and populated cities. The decision tree algorithm achieved the
advanced deep learning models are implemented. highest accuracy of 79%.

112356 VOLUME 12, 2024


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

Fernàndez et al. [10] applied a logistic regression model to


diagnose dengue fever based on the features of approximately
550 patients with febrile illness. The applied logic regression
model attained 69.2% accuracy for the positive cases
with 86.2% sensitivity and 0.66 AUC score. Mayrose and
researchers [11] demonstrated an automated dengue predic-
tion model using several machine learning techniques and
blood smear image samples based on the lymphocyte nucleus
and platelets. The authors achieved the best performance
using the SVM classifier with 95.74% accuracy and 0.96 F1
coefficient. Prome et al. [12] predicted the number of dengue
cases in different areas of Bangladesh by employing machine
learning approaches and a weather dataset. The SVM model
attained the best performance with a mean absolute error of
3.865.
Mello-Román et al. [13] presented predictive models for
dengue based on real patient data admitted to Paraguay’s
health centers with dengue fever symptoms. The applied
ANN attained the maximum accuracy of 96% with the highest
sensitivity and specificity of 96% and 97%, respectively.
Dey et al. [14] initiated to predict dengue cases based on
11 states’ data of Bangladesh. The authors empirically FIGURE 1. Working steps of the proposed automatic dengue prediction
study.
analyzed how environmental factors affect the rise and fall
of dengue cases. The applied Support Vector Regression
algorithm demonstrated the best results with an R2 score of Akter et al. [19] conducted a comparative study to predict
0.75. The Multiple Linear Regression algorithm illustrated an dengue fever in Dhaka city, evaluating the effectiveness of
excellent performance with a 0.62 R2 coefficient. time series analysis and machine learning techniques. Based
Abdualgalil et al. [15] utilized clinical data from a on the time series, the applied ARIMA model hypothesizes
local medical center of Yemen to predict dengue using the forecasts of dengue outbreaks with a 15.29 mean
efficient machine learning techniques. They implied five absolute percentage error (MAPE). The neural network
machine learning algorithms that performed efficiently on model demonstrated superior performance, achieving the
the utilized clinical data. The Extra Tree Classifier algorithm lowest MAPE of 1.15. Majeed et al. [20] executed various
demonstrated the best performance with 99.12% accuracy hybrid AI models to predict cases of dengue viral fever in five
and 0.99 F1 coefficient. Ong and his colleagues [16] regions of Malaysia. Various hybrid LSTM models have been
depicted the transmission rate of dengue with meteorological applied by combining stacked, temporal and spatial attention
data by comparing different machine learning algorithms. approaches. The spatial stacked attention with the LSTM
This study used multiple variables, algorithms, vector technique demonstrated the best performance with the lowest
indices, and meteorological data. An ensemble machine RMSE of 3.17.
learning algorithm, XGBoost, with the Boruta feature It can be understood from the reviews of the related articles
selection technique, achieved the highest accuracy (81%) and that significant works have been initiated on automatic
0.815 AUC. dengue prediction employing advanced machine learning and
Chaw et al. [17] developed an AI-based automatic model deep learning techniques. However, most of these works did
that predicts if there is a chance of shock development not investigate the dengue virus’s significant clinical and
among all dengue patients. They used physiological data environmental features. Few of these articles applied state-of-
from ill patients at the University of Malaya Medical the-art explainable AI techniques to interpret the AI model’s
Centre and trained the model based on these collected data. predictions.
Among the applied machine learning models, the decision
tree approach attained the maximum F1 score of 0.92 and III. METHODOLOGY
0.64 AUC. Sarwar and his colleagues [18] introduced a model Figure 1 illustrates the working steps of the proposed
that can accurately predict the number of dengue-affected automatic dengue prediction study. Initially, data collection
patients. The authors considered various environmental fac- is followed by data preprocessing, which includes categorical
tors in Dhaka, including humidity, temperature, and rainfall, encoding and median imputation of null values. Exploratory
as these variables critically influence dengue outbreaks. After data analysis and removal of outliers with Z-scores are con-
implementing statistical algorithms, the SVM algorithm ducted before splitting the data into training and testing sets.
achieved the highest R-squared coefficient of determination The training data undergoes SMOTE synthetic oversampling,
of 0.92. feature selection using five methods, and classification

VOLUME 12, 2024 112357


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

TABLE 1. Summary of various numerical features of the employed private


dataset.

FIGURE 2. Distribution of dengue classes of the employed dataset.

algorithm application with hyperparameter optimization.


The selected best parameter combination is then applied, entries. In contrast, the Neutrophil, Lymphocyte, Monocyte,
and explainable AI techniques are used to enhance model Eosinophil, and RBC features each contain one null value.
interpretability. Finally, the model predicts dengue disease Additionally, two null values are identified in the Basophil
on the test data, classifying results as positive or negative. feature. To address these missing values, the median impu-
The detailed methodology of the working sequences of this tation technique has been employed for the aforementioned
research is discussed in the subsequent paragraphs. features.

A. DATA COLLECTION 3) ONE-HOT ENCODING AND FEATURE SCALING


A major contribution of this work is the presentation of In this work, the one-hot encoding technique has been used to
a private CBC hematology report-based dengue dataset intrinsically substitute the Gender and Dengue Class into ‘0’
comprising 320 samples and 14 characteristic features and ‘1.’ The remaining numerical features are transformed to
collected from two hospitals in Dhaka, Bangladesh. We focus a comparable level employing a Gaussian distribution-based
on identifying key factors in these reports and understanding standard feature scaling framework.
the correlations among various features. The attributes
of the employed dataset comprise Serial, Date, Gender, 4) SYNTHETIC OVERSAMPLING
Age, Haemoglobin, ESR, WBC, Neutrophil, Lymphocyte, Synthetic Minority Over-sampling Technique (SMOTE) has
Monocyte, Eosinophil, Basophil, RBC, and Platelets. The been used to address class imbalance by generating synthetic
reports span from May 2023 to October 2023, a period data points for minority class (negative dengue cases) training
marked by a high rate of dengue outbreaks. This outbreak was samples.
highly correlated with rainfall, temperature, and an increase
in the breeding rate of Aedes mosquitoes. The proposed C. EXPLORATORY DATA ANALYSIS
dataset includes a diverse age range of patients, from children Figure 2 represents the ratio between positive and negative
as young as eight months to adults up to 81 years old, instances within the employed dengue dataset. Here, ‘pos-
with a distribution of 200 males and 120 females among itive’ and ‘negative’ denote the outcomes corresponding to
the 320 patients. The distribution of dengue classes of the ‘Result.’ In this dataset, 211 instances are labeled as positive
employed dataset is illustrated in Figure 2. and 109 as negative dengue cases. As a result, the CBC
dataset encloses a total of 320 outputs.
B. DATA PREPROCESSING Figure 3 portrays the gender distributions of the employed
After collecting the data, the required preprocessing steps are dataset. 62.5% and 37.5% of the instances correspond to
completed which have been briefly described below. males and females, respectively. The statistical descriptions
of various numerical features of the employed private CBC
1) DROP LESS ESSENTIAL COLUMNS dengue dataset are summarized in Table 1. In the case of
First, we have dropped the less required two attributes, i.e., Platelets, a vast difference between the minimum and the
Serial and Date, to make the working approach smoother. maximum value has been illustrated. It indicates a higher
This approach is also known as a part of dataset cleaning likelihood of positive compared to negative dengue cases.
that increases efficiency. As mentioned above, dropping the According to Figure 4, a variation of ages can be observed
columns has helped to concentrate more on the critical in the curated dengue dataset, ranging from eight months to
hematologic features of the employed dataset. 81 years.
The pair plot of the selected features from the dengue
2) HANDLING NULL VALUES CBC report dataset displays the relationships between various
It is observed that the highest number of null values hematologic parameters such as age, hemoglobin, WBC,
constitutes the ESR attribute of the dataset, with 49 missing neutrophil, lymphocyte, RBC, and platelets, distinguishing

112358 VOLUME 12, 2024


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

of positive cases is skewed, with a lower median at about


25 and a noticeable tail that indicates outliers at higher values.
The ESR distribution is wider in negative cases, with outliers
at the higher end and a median close to 50. In positive
situations, platelets have a rigid, symmetric distribution with
a median of 150, indicating few outliers; in negative cases, the
dispersion is larger, with a higher median of around 300 and
more notable outliers, especially at higher values.
In this work, the Z score method has been used to address
FIGURE 3. Gender distributions of the employed dataset.
outliers, which can be expressed as:
Z < X − 2σ (1)
Z > X + 2σ (2)
where, X represents the feature value and σ denotes the
standard deviation. For the hemoglobin feature, the range
of values gets more concentrated, indicating fewer extreme
values, but both positive and negative cases stay symmetric
and centered around the same medians. With a marginally
smaller range, RBC distributions simultaneously maintain
their symmetry and center medians around 5. Although the
tails of the ESR distributions for both groups have shrunk,
the median for negative cases remains higher, indicating
that excessively high values have been eliminated. With
FIGURE 4. Age distributions of the employed dataset. positive instances remaining concentrated around 150 and
negative cases around 300, platelet distributions become
between positive and negative dengue test results. Figure 5 more uniform with a narrower spread, indicating a more
reveals noticeable clustering patterns, particularly in WBC, consistent distribution free of notable outliers.
platelets, and lymphocytes, where positive dengue cases
(marked in purple) tend to have lower WBC and platelet E. ALGORITHMS
counts, highlighting significant hematologic differences 1) LOGISTIC REGRESSION
between positive and negative dengue patients. Logistic regression [21] is a statistical technique that links
a set of discrete or continuous independent variables to a
D. DATA CLEANING AND REMOVING OUTLIERS binary dependent variable. It is a powerful tool that produces
Figure 6 presents a heat map of various numerical features robust models. It predicts dependent data by examining the
from the dataset, using Pearson’s correlation method to correlation between one or more already present independent
visualize inter-feature correlations. A threshold value of variables.
0.85 is set to determine significant correlations. Features
exceeding this threshold are considered highly correlated. 2) SVM
Neutrophils and lymphocytes show an inverse correlation Support vector machine [22] is an effective supervised
with a value of −0.96, but neither feature is eliminated. learning technique for outliers identification, regression, and
Another correlation between hemoglobin and RBC yields a classification. In order to enable the prediction of labels from
value of 0.83 below the threshold, resulting in no feature one or more feature vectors, it seeks to establish a decision
elimination. This analysis ensures that none of the features boundary between two classes.
are excluded based on the established correlation criteria.
Violin plots of four selected features are depicted in 3) RANDOM FOREST
Figure 7, which constitute versatile tools for visualiz- Random forest [23] is a classifier that uses multiple decision
ing numerical data distributions. The violin plots for trees on different subsets of the input dataset and averages
Haemoglobin and Red Blood Cells (RBC) demonstrate sym- the results to increase the dataset’s predicted accuracy.
metric distributions for both positive and negative situations. It’s possible that some decision trees may anticipate the
Low tails on Hemoglobin instances indicate few outliers in right output while others may not since random forest uses
both groups. Similarly, neither the positive nor negative RBC numerous trees to forecast the class of the dataset.
distributions exhibit notable outliers and are consistent with
low variability. 4) NAIVE BAYES
The distributions show greater variance for platelets and The naive Bayes algorithm [24] relies on Bayes’ theorem, and
ESR (erythrocyte sedimentation rate). The ESR distribution it surmises that any single feature of a dataset is conditionally

VOLUME 12, 2024 112359


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

FIGURE 5. Pair plot of various features of the employed dataset.

independent of the given class output. Gaussian Naive Bayes AdaBoost trains and deploys trees one after the other.
algorithm expresses the probability of a hypothesis. Though By connecting a series of weak classifiers in AdaBoost,
this algorithm is mostly useful for continuous data, it is also boosting is implemented. Each weak classifier attempts to
very efficacious for classification data. This model is quite correct samples that were incorrectly classified by the weak
simple yet effective. classifier before it.

5) AdaBoost 6) XGBoost
Adaptive Boosting (AdaBoost) [25] is a technique for Extreme Gradient Boosting (XGBoost) algorithm is known as
ensemble learning that was first developed to boost the the XGBoost algorithm. It holds a prominent place among all
performance of binary classifiers. The ensemble approaches the machine learning algorithms because of its performance

112360 VOLUME 12, 2024


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

FIGURE 6. Heat map of various numerical features of the employed


dataset.

and speed. Its portability is also a remarkable feature as it can


be run on any platform and this algorithm is also integrable
with multiple systems out there. Among the other boosting
algorithms, it tends to perform faster [26] due to its followed
concept of parallelization.

7) MLP
A multilayer perception (MLP) is a type of artificial neural
network made up of several node levels, such as an input
layer, an output layer, and one or more dense layers. It’s a
powerful model capable of learning complex patterns in data
and also handles various types of data, including structured
and unstructured data.

8) LightGBM
Light Gradient Boosting Machine (LightGBM) [27] is an
extended version of the gradient boosting algorithm but with
FIGURE 7. Violin plot of four selected features.
higher scalability. This algorithm requires less computational
duration than other algorithms and gives more accurate
predictions with its leaf-wise tree growth approach. Its base models for their considerably better performance and
confined depth makes the algorithm more robust as well. scalability. On the other hand, the LightGBM has been used
as the meta-classifier. The algorithm of the proposed stacking
9) STACKING CLASSIFIER ensemble model applied in this automatic dengue prediction
A stacking classifier is an ensemble machine learning system is summarized in Algorithm 1.
approach applied within our model, specifically designed
for addressing ad hoc circumstances. Multiple top-performed 10) ANN
base algorithms are trained as the base models and imple- Artificial Neural Network (ANN) [28] algorithm is known as
mented on the split data. Then, on the next level, the base its working mechanism is quite similar to the human brain.
models’ primary outputs are considered the new features for Its work process includes an activation function that is a
the meta-classifier to get the highest final prediction. In this vital component facilitating the generation of output layers
research, we have utilized the top-performed three statistical through the summation of input products and corresponding
models (MLP, XGBoost, and Logistic Regression) as the weights. Moreover, the inputs of the dataset are processed

VOLUME 12, 2024 112361


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

Algorithm 1 Algorithm of the Proposed Stacking Ensemble 14) TabPFN


Model Tabular data Prior-data Fitted Network (TabPFN) [29] works
1: Start significantly well on tabular classification datasets. The best
2: Step 1: Split the dataset into training, validation, and test feature of this model is that without any prior training,
sets. feature selection, and hyperparameter tuning, this model
3: Step 2: Evaluate all models using cross-validation to brings out the highest possible accuracy within a few seconds
obtain their accuracies. of application. It utilizes the in-context learning method that
4: Step 3: Select the best-performing models based on enables the model to learn the sequences from the given input.
accuracy.
5: Step 4: Assign LightGBM (LGBM) as the
meta-classifier and XGBoost (XGB), Logistic 15) TabTransformer
Regression (LR), and Multilayer Perceptron (MLP) TabTransformer is a transformer-based model that specializes
as Base models. in training for tabular data. This model is quite prevalent
6: Step 5: Train the base models (XGB, LR, MLP) with the for its efficiency and scalability in terms of handling tabular
training samples. data without that many intensive preprocessing tasks. For
7: Step 6: Generate a new dataset using the predictions from our study, we have used the TensorFlow data pipeline with a
the base models. sigmoid activation function, and this activation function has
8: Step 7: Train the meta-classifier (LGBM) with the been applied to the outputs of all the layers to bring out the
validation samples. values within the range of 0 to 1. However, this model’s self-
9: Step 8: Perform inference on the test samples using the attention mechanism is remarkably handy in intricating [30]
trained meta-classifier (LGBM). and finding relationships among features of the employed
10: End dataset.

F. HYPERPARAMETERS OPTIMIZATION
in the forward direction. Last but not least, ANN is simpler
than other neural networks because of its feed-forward Hyperparameter optimization is a fundamental procedure
characteristics. focused on identifying the optimal values for a machine
learning model’s parameters. Usually, prior to the application
11) CNN of this technique, conventional machine learning models are
typically implied on the dataset, often resulting in suboptimal
Convolutional Neural Network (CNN) indicates a class of
performance metrics. However, upon integrating hyperpa-
deep learning models whose core architecture is the convolu-
rameter optimization methodologies such as GridSearchCV
tional layer that processes the structured arrays of data. CNN
and RandomizedSearchCV, statistical algorithms exhibit
model, often referred to as ConvNet, is specifically designed
notable improvements in predictive performance compared to
to automatically and adaptively learn spatial hierarchies of
their pre-implementations. Overall, this process plays a vital
features, primarily focusing on image and video-based data.
role in achieving the ultimate optimal performance metrics
such as precision, recall, accuracy, and F1 score. Conversely,
12) GRU
the Keras Tuner optimization method is deployed to optimize
In this research, a lightweight variant of the LSTM (Long
deep learning algorithms in this work. This approach
Short-Term Memory) model, GRU (Gated Recurrent Unit)
alleviates the need for extensive manual experimentation,
has been implemented. It is distinct in its integration of both
enhancing model performance and improved accuracy rates.
long-term and short-term memory within its hidden state.
Conversely, the Keras Tuner optimization method is deployed
This modified algorithm features two essential gates: the
to optimize deep learning algorithms. This approach alle-
update gate and the reset gate, each designed with a clear
viates the necessity for extensive manual experimentation,
understanding of memory mechanisms. The update gate is
resulting in enhanced model performance and improved
responsible for retaining memory information, while the reset
accuracy rates. In this research, GridSearchCV has been
gate facilitates memory-forgetting processes. The equations
utilized for all the applied machine learning techniques, MLP,
for both gates are almost similar, but the weights are distinct
and LightGBM models. Conversely, we employed Keras
for both cases.
Tuner to optimize deep learning models, including ANN,
CNN, Bi-LSTM, and GRU.
13) Bi-LSTM
For sequence-based classification problems, the Bi-LSTM
algorithm performs conspicuously better than the LSTM G. FEATURE SELECTION METHOD
model. This model is regarded as the extended version of the After implementing multiple machine learning and deep
LSTM model. In this algorithm, the encoding is performed in learning algorithms on the dataset, five feature selection
both the forward and backward directions. Finally, the result methods have been applied that increase interpretability and
is concatenated from both ends. model accuracy. These five methods are as follows: Pearson

112362 VOLUME 12, 2024


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

correlation, Recursive Feature Elimination, SelectKBest with


ANOVA F-value, Chi-Square Test, and Extra Trees Classifier.

1) PEARSON CORRELATION
The Pearson Correlation-based feature selection method is
essential to finding the relationship between two features.
Based on the input variables of the utilized CBC dengue
dataset, we have decided to iterate this method via multiple
machine-learning models. As this method determines the
optimal accuracy among features by plotting points relative
to the line, it enhances the clarity of visualizing the method’s
impact.

2) RFE
The Recursive Feature Elimination (RFE) feature selection
method works on removing the most unimportant feature
FIGURE 8. Training and validation accuracy and loss vs. epochs for the
from the dataset. This process keeps repeating until it TabPFN model.
eliminates all the unimportant features from the dataset and
selects the most important one.
without hyperparameter optimization. The TabPFN classifier
achieved the highest accuracy of 94.79%, with a precision
3) SelectKBest
of 0.9387 and a macro F1 score of 0.9419. Other notable
SelectKBest is a filter-based or univariate feature selection
performances include the stacking classifier model with
method in machine learning. It utilizes metrics like the chi-
an accuracy of 93.75%, and the XGBoost model, which
square test, ANOVA F-value, or both statistical tests to
demonstrated high recall (0.9297) and precision (0.9122),
evaluate each feature. After scoring features, it selects only
achieving an accuracy of 92.71%.
the most salient k features based on the score.
The training and validation accuracy and loss of the
TabPFN model with the change of epochs are illustrated
4) CHI-SQUARE TEST
in Figure 8. The training and validation accuracy remains
A chi-square test is a statistical testing method for categorical high and stable at 94% to 95% throughout the epochs.
data that is also considered a hypothesis testing method. This In contrast, the training and validation loss show minimal
testing method determines the notable difference between the fluctuations, indicating a consistently performing model with
observed and expected data. good generalization capabilities.
5) EXTRA TREES CLASSIFIER
A. FEATURE SELECTIONS TECHNIQUES RESULTS
The Extra Tree Classifier, a type of ensemble learning, is a
Table 3 displays the dengue prediction accuracies of var-
powerful tool for classification tasks. It adds randomness
ious algorithms after applying different feature selection
to feature selection and combines results from multiple
methods, including Pearson Correlation, RFE, SelectKBest,
uncorrelated decision trees in a forest to predict outcomes.
Chi-Square Test, and ExtraTree Classifier. The Random For-
est algorithm consistently performs well across all methods,
IV. RESULTS AND DISCUSSIONS
with a maximum accuracy of 93.75% using both the Pearson
This section presents the results of the applied AI models for
Correlation and Chi-Square Test methods, demonstrating the
the proposed automatic dengue detection system. Precision,
robustness of this algorithm with feature selection. Similarly,
recall, macro F1, and accuracy scores are some important
the stacking classifier and XGB model both achieved the
performance measurement metrics demonstrated in this
highest accuracy score of 93.75% for the RFE method,
section that are determined using (3) to (6), respectively.
and the stacking classifier also matched this score for the
TP SelectKBest method.
Precision = (3)
TP + FP
TP B. HYPERPARAMETER OPTIMIZATION RESULTS
Recall = (4)
TP + FN Table 4 presents the accuracy of various machine learning
Precision × Recall algorithms after applying the GridSearchCV hyperparameter
F1 = 2 × (5)
Precision + Recall optimizer. The stacking classifier achieves the highest
TP + TN accuracy at 96.88%, indicating a significant improvement
Accuracy = (6)
TP+FP+TN+FN compared to its performance without optimization (93.75%).
Table 2 presents the performance metrics of various The accuracy of Logistic Regression improved from 90.62%
machine learning models evaluated for dengue prediction to 92.95%, and XGBoost improved from 92.71% to 93.75%.

VOLUME 12, 2024 112363


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

TABLE 2. Performance metrics for various models (without hyperparameter optimization).

TABLE 3. Accuracy (%) for various algorithms after applying feature selection methods.

Overall, most algorithms show improved accuracy post-


optimization, highlighting the effectiveness of GridSearchCV
in enhancing model performance.
The accuracy and best-optimized hyperparameters of
various deep learning models after applying Keras Tuner for
hyperparameter optimization are summarized in Table 5. The
CNN model achieved the highest accuracy at 86.53%, while
both the Bi-LSTM and GRU models reached an accuracy of
83.33%.
The confusion matrix for the best-performing stacking
classifier model is illustrated in Figure 9, highlighting its effi-
cient predictive performance. The model accurately classified
20 instances of class 0 and 42 cases of class 1, with only
2 misclassifications, demonstrating its high effectiveness
in distinguishing between the two dengue classes for the
employed CBC dataset.
Figure 10 displays the Receiver Operating Characteristic
FIGURE 9. Confusion matrix for the best-performed stacking classifier
(ROC) curve for the stacking classifier, which achieved an model.
impressive Area Under the Curve (AUC) score of 0.9919.
This high AUC value indicates the applied stacking ensemble
model’s excellent ability to distinguish between the positive perturbed versions of the instance. This explanation provides
and negative dengue classes, confirming its robust predictive insight into why the model classifies instances as either
performance. ‘Positive’ or ‘Negative’ in the employed dengue CBC report-
In this work, Local Interpretable Model-agnostic Explana- based dataset. Figure 11 shows a confidence score of
tion (LIME)-based eXplainable AI (XAI) has been utilized 0.99 for the positive case. Platelets, Neutrophil, Eosinophil,
to explain how the black box stacking ensemble machine Gender, Haemoglobin, and WBC are the six most impactful
learning model predicts an outcome. This framework works features that played a significant role in determining the
efficiently by approximating the model’s outcomes locally, corresponding positive class. On the other hand, according to
for instance, training an interpretable linear model on various Figure 12, Plateles, Basophil, Lymphocyte, Monocyte, and

112364 VOLUME 12, 2024


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

TABLE 4. Accuracy for various ML algorithms after applying GridSearchCV hyperparameter optimizer.

TABLE 5. Accuracy for various deep learning models after applying Keras Tuner hyperparameter optimizer.

FIGURE 12. Dengue prediction interpretation of a negative case instance


using LIME explainable AI.

FIGURE 10. ROC-AUC curve for the stacking classifier.

FIGURE 11. Dengue prediction interpretation of a positive case instance


using LIME explainable AI.

FIGURE 13. Radar chart of the best-performing models.


Eosinophil act as the most prominent factors for predicting
negative dengue class. TabPFN, and AdaBoost, across different performance met-
Figure 13 presents a radar chart comparing the perfor- rics. The stacking ensemble model demonstrates superior
mance of the top five models, stacking, MLP, LightGBM, performance across all metrics, particularly in accuracy and

VOLUME 12, 2024 112365


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

TABLE 6. Comparison of the proposed system with similar dengue prediction studies.

recall. In contrast, the other models show competitive but employing the GridSearchCV and Keras Tuner frameworks.
slightly lower scores for the proposed dengue prediction A stacking ensemble approach constructed with LightGBM
system. meta-classifier and XGBoost, Logistic Regression, and MLP
A comparative analysis of the proposed dengue prediction base learners accomplishes the best performance among the
system with various other studies is presented in Table 6. The machine learning models. The MLP neural network model
proposed system, using a LightGBM meta-classifier-based performs best among the deep learning models. Finally,
stacking ensemble technique and data from CBC reports the LIME XAI approach has been applied to investigate
in Dhaka, Bangladesh, achieved an accuracy of 96.88% the salient features and interpret the predictions provided
and an F1 score of 0.9646, which is competitive with by the stacking classifier. In the future, the employed dataset
other high-performing models like the ANN (96% accuracy) can be expanded by adding new data from a larger cohort of
and ETC (99.12% accuracy). Notably, the proposed system patients. Multimodal architecture can be applied using blood
demonstrated strong precision and recall metrics (97.73% smear images for the same patient data. A multiclass problem
and 95.45%, respectively), highlighting its effectiveness can be defined using mild, moderate, severe positive, and
in accurately predicting dengue cases compared to other negative dengue case samples.
articles.
REFERENCES
V. CONCLUSION [1] M. B. Khan, Z.-S. Yang, C.-Y. Lin, M.-C. Hsu, A. N. Urbina,
This research introduces various AI techniques to predict W. Assavalapsakul, W.-H. Wang, Y.-H. Chen, and S.-F. Wang, ‘‘Dengue
overview: An updated systemic review,’’ J. Infection Public Health, vol. 16,
the dengue virus employing a private CBC report dataset. no. 10, pp. 1625–1642, Oct. 2023.
The dataset comprises 320 samples and 14 hematology [2] N. Ali, ‘‘The recent burden of dengue infection in bangladesh: A serious
features collected from local hospitals in Dhaka, Bangladesh. public health issue,’’ J. Infection Public Health, vol. 17, no. 2, pp. 226–228,
Feb. 2024.
Diverse dataset preprocessing steps are implemented to the
[3] S. Roy, A. Biswas, M. T. A. Shawon, S. Akter, and M. M. Rahman, ‘‘Land
dataset, i.e., handling missing values and outliers, one- use and meteorological influences on dengue transmission dynamics in
hot encoding, feature standardization, synthetic oversam- Dhaka city, Bangladesh,’’ Bull. Nat. Res. Centre, vol. 48, no. 1, pp. 1–24,
pling, and removing redundant features. Various machine Mar. 2024, doi: 10.1186/S42269-024-01188-0.
learning, deep learning and transformer-based models are [4] N. Sharif, N. Sharif, A. Khan, and S. K. Dey, ‘‘The epidemiologic and
clinical characteristics of the 2023 dengue outbreak in Bangladesh,’’ Open
applied to predict positive and negative dengue cases. The Forum Infectious Diseases, vol. 11, no. 2, pp. 1–29, Feb. 2024, doi:
hyperparameters of the applied models are optimized by 10.1093/OFID/OFAE066.

112366 VOLUME 12, 2024


N. J. Riya et al.: Artificial Intelligence-Based Early Detection of Dengue Using CBC Data

[5] M. E. H. Kayesh, I. Khalil, M. Kohara, and K. Tsukiyama-Kohara, [25] A. Yulianto, P. Sukarno, and N. A. Suwastika, ‘‘Improving AdaBoost-
‘‘Increasing dengue burden and severe dengue risk in Bangladesh: An based intrusion detection system (IDS) performance on CIC IDS 2017
overview,’’ Tropical Med. Infectious Disease, vol. 8, no. 1, p. 32, Jan. 2023. dataset,’’ J. Phys., Conf. Ser., vol. 1192, Mar. 2019, Art. no. 012018.
[6] D. C. Kajeguka, F. M. Mponela, E. Mkumbo, A. N. Kaaya, D. Lasway, [26] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’
R. D. Kaaya, M. Alifrangis, E. Elanga-Ndille, B. T. Mmbaga, and in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
R. Kavishe, ‘‘Prevalence and associated factors of dengue virus circulation Aug. 2016, pp. 785–794.
in the rural community, Handeni district in Tanga, Tanzania,’’ J. Tropical [27] J. Zhang, D. Mucs, U. Norinder, and F. Svensson, ‘‘LightGBM: An
Med., vol. 2023, pp. 1–9, Nov. 2023. effective and scalable algorithm for prediction of chemical toxicity–
[7] M. A. Kabir, H. Zilouchian, M. A. Younas, and W. Asghar, ‘‘Dengue application to the Tox21 and mutagenicity data sets,’’ J. Chem. Inf. Model.,
detection: Advances in diagnostic tools from conventional technology to vol. 59, no. 10, pp. 4150–4158, 2019.
point of care,’’ Biosensors, vol. 11, no. 7, p. 206, Jun. 2021. [28] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
[8] C. Davi, A. Pastor, T. Oliveira, F. B. d. L. Neto, U. Braga-Neto, MA, USA: MIT Press, 2016.
A. W. Bigham, M. Bamshad, E. T. A. Marques, and B. Acioli-Santos, [29] N. Hollmann, S. Muller, K. Eggensperger, and F. Hutter, ‘‘TabPFN:
‘‘Severe dengue prognosis using human genome data and machine A transformer that solves small tabular classification problems in a
learning,’’ IEEE Trans. Biomed. Eng., vol. 66, no. 10, pp. 2861–2868, second,’’ 2022, arXiv:2207.01848.
Oct. 2019. [30] T. Kiranbhai Vyas, ‘‘Deep learning with tabular data: A self-supervised
[9] D. Sarma, S. Hossain, T. Mittra, Md. A. M. Bhuiya, I. Saha, and approach,’’ 2024, arXiv:2401.15238.
R. Chakma, ‘‘Dengue prediction using machine learning algorithms,’’ in
Proc. IEEE 8th R10 Humanitarian Technol. Conf., Dec. 2020, pp. 1–6.
[10] E. Fernández, M. Smieja, S. D. Walter, and M. Loeb, ‘‘A predictive
model to differentiate dengue from other febrile illness,’’ BMC Infectious
Diseases, vol. 16, no. 1, pp. 1–7, Dec. 2016.
[11] H. Mayrose, G. M. Bairy, N. Sampathila, S. Belurkar, and K. Saravu,
‘‘Machine learning-based detection of dengue from blood smear images
utilizing platelet and lymphocyte characteristics,’’ Diagnostics, vol. 13,
no. 2, p. 220, Jan. 2023.
NUSRAT JAHAN RIYA received the B.Sc. degree
in computer science and engineering from the
[12] S. Sabrina Prome, T. Basak, T. Islam Plabon, and R. Khan, ‘‘Prediction
of dengue cases in Bangladesh using explainable machine learning
Department of Electrical and Computer Engineer-
approach,’’ in Proc. Int. Conf. Inventive Comput. Technol. (ICICT), ing, North South University, Dhaka, Bangladesh.
Apr. 2024, pp. 1–5. Her current research interests include artificial
[13] J. D. Mello-Román, J. C. Mello-Román, S. Gómez-Guerrero, and intelligence, machine learning, deep learning, and
M. García-Torres, ‘‘Predictive models for the medical diagnosis of dengue: natural language processing.
A case study in Paraguay,’’ Comput. Math. Methods Med., vol. 2019,
pp. 1–7, Jul. 2019.
[14] S. K. Dey, M. M. Rahman, A. Howlader, U. R. Siddiqi, K. M. M.
Uddin, R. Borhan, and E. U. Rahman, ‘‘Prediction of dengue incidents
using hospitalized patients, metrological and socio-economic data in
Bangladesh: A machine learning approach,’’ PLoS One, vol. 17, no. 7,
Jul. 2022, Art. no. e0270933.
[15] B. Abdualgalil, S. Abraham, and W. M. Ismael, ‘‘Early diagnosis for
dengue disease prediction using efficient machine learning techniques
based on clinical data,’’ J. Robot. Control (JRC), vol. 3, no. 3, pp. 257–268,
May 2022. MRITUNJOY CHAKRABORTY received the
[16] S. Q. Ong, P. Isawasan, A. M. M. Ngesom, H. Shahar, A. M. Lasim, and B.Sc. degree in computer science and engineering
G. Nair, ‘‘Predicting dengue transmission rates by comparing different from the Department of Electrical and Computer
machine learning models with vector indices and meteorological data,’’ Engineering, North South University, Dhaka,
Sci. Rep., vol. 13, no. 1, pp. 1–10, Nov. 2023. Bangladesh. He is currently a Research Assistant
[17] J. K. Chaw, S. H. Chaw, C. H. Quah, S. Sahrani, M. C. Ang, Y. Zhao, with the Department of ECE, North South Uni-
and T. T. Ting, ‘‘A predictive analytics model using machine learning versity. His research interests include computer
algorithms to estimate the risk of shock development among dengue vision, natural language processing, and machine
patients,’’ Healthcare Anal., vol. 5, pp. 1–17, Jul. 2024. learning.
[18] M. T. Sarwar and M. A. Mamun, ‘‘Prediction of dengue using machine
learning algorithms: Case study Dhaka,’’ in Proc. 4th Int. Conf. Electr.,
Comput. Telecommun. Eng. (ICECTE), Dec. 2022, pp. 1–6.
[19] T. Akter, M. T. Islam, M. F. Hossain, and M. S. Ullah, ‘‘A comparative
study between time series and machine learning technique to predict
dengue fever in Dhaka city,’’ Discrete Dyn. Nature Soc., vol. 2024,
pp. 1–12, May 2024.
[20] M. A. Majeed, H. Z. M. Shafri, Z. Zulkafli, and A. Wayayok, ‘‘A deep
learning approach for dengue fever prediction in Malaysia using LSTM
with spatial attention,’’ Int. J. Environ. Res. Public Health, vol. 20, no. 5, RIASAT KHAN received the B.Sc. degree in elec-
p. 4130, Feb. 2023. trical and electronic engineering from the Islamic
[21] R. Real, A. M. Barbosa, and J. M. Vargas, ‘‘Obtaining environmental University of Technology, Bangladesh, in 2010,
favourability functions from logistic regression,’’ Environ. Ecol. Statist., and the M.Sc. and Ph.D. degrees in electrical
vol. 13, no. 2, pp. 237–245, Jun. 2006. engineering from New Mexico State University,
[22] S. Huang, N. Cai, P. P. Pacheco, S. Narrandes, Y. Wang, and W. Xu, Las Cruces, NM, USA, in 2018. He is currently
‘‘Applications of support vector machine (SVM) learning in cancer an Associate Professor with the Department of
genomics,’’ Cancer Genomics Proteomics, vol. 15, pp. 41–51, Jul. 2018. Electrical and Computer Engineering, North South
[23] A. Liaw and M. Wiener, ‘‘Classification and regression by randomforest,’’ University, Dhaka, Bangladesh. His research
R News, vol. 2, pp. 18–22, Jul. 2002. interests include data science, machine learning,
[24] D. J. Hand and K. Yu, ‘‘Idiot’s bayes—Not so stupid after all?’’ Int. Stat. computational bioelectromagnetics, and power electronics.
Rev., vol. 69, no. 3, pp. 385–398, Dec. 2001.

VOLUME 12, 2024 112367

You might also like