0% found this document useful (0 votes)
18 views

Multi-Label_Classification_of_Lung_Diseases_Using_Deep_Learning

image classification

Uploaded by

Shovon Sarker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Multi-Label_Classification_of_Lung_Diseases_Using_Deep_Learning

image classification

Uploaded by

Shovon Sarker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

MULTI-LABEL CLASSIFICATION OF
LUNG DISEASES USING DEEP
LEARNING
MUHAMMAD IRTAZA1 ,ARSHAD ALI1 , MARYAM GULZAR2 and AAMIR WALI1
1
FAST School of Computing, National University of Computer and Emerging Sciences, Lahore, Pakistan
2
SE Department, LUT University, Lappeenranta, Finland
Corresponding author: Maryam Gulzar ([email protected]; Tel.: +358 50 4702981), Arshad Ali ([email protected]

ABSTRACT Assistance for doctors in disease detection can be very useful in environments with scarce
resources and personnel. Historically, many patients could have been cured with early detection of the
disease. The application of deep learning techniques in the fields of medical imaging, on large datasets, has
allowed computer algorithms to produce as effective results as medical professionals. To assist doctors, it is
essential to have a versatile system that can timely detect multiple diseases in the lungs with high accuracy.
Over time, although many classifiers and algorithms have been implemented, however, deep learning models
(i.e., CNN, Deep-CNN, and R-CNN) are known to offer better results. After a thorough literature review of
the state-of-the-art techniques, this work applies various models such as MobileNet, DenseNet, VGG-16,
EfficientNet, Xception, and InceptionV3 to the selected large dataset. The goal is to enhance the accuracy
of these algorithms by experimenting with parameter optimizations. We observe that MobileNet produces
better results as compared to other models. We implement a deep convolutional GAN to produce synthetic
X-ray images containing various pathologies already included in the chosen imbalanced dataset namely
NIH Chest X-ray containing 14 classes. The synthetic dataset contains 1193 samples belonging to five
classes. We test the suggested model using evaluation measures like recall, precision, and F1-score, along
with binary accuracy. The suggested deep learning model produces recall as high as 57%, binary accuracy
as 93.4%, F1-Score as 0.553, and AUC as 81. After the inclusion of generated synthetic samples, the value
of the F1-score becomes 0.582 resulting in a 5% increase. Though, Generative Adversarial Network (GAN)
shows lower performance, however, we encourage further research and experiments to find the versatility of
GANs in the field of medical imaging.

INDEX TERMS Lung Disease, Mobile-Net, Image-Augmentation, GAN, Class imbalance, Multi-class
Classification

I. INTRODUCTION clinical support in health programs as well as chest disease


diagnosis. This can prove to be an effective tool for countless
Health disorders and conditions affecting the lungs are re-
clinical environments, where, for instance, sufficient medical
ferred to as lung diseases. Diseases such as pneumonia,
staff is not available.
asthma, tuberculosis, emphysema, malignancies in the lungs,
and a few more make the lungs lose their versatility and Lung diseases like pneumonia, emphysema, asthma,
hence decrease the overall volume of air. Around 2 million COVID-19, and others are very contagious with quite a high
chest radiography images are used by doctors around the death rate. Pulmonary fibrosis is a long-lasting infection that
globe to examine various types of lung diseases. Diseases is quite hard to diagnose because symptoms may appear at
targeting the human chest, one major cause leading to death, either the initial stage or after many years. It causes difficulty
essentially require this technology for diagnosis and treat- in breathing and as of now, it does not have any cure. Early
ment. Computer systems can be used for the interpretation of diagnosis is a very crucial and challenging task because some
radiographs, as effectively as the actual radiologists, and for diseases show only ignorable symptoms like common flu,

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

cough, and fever. Therefore, lung disease diagnosis at early normalization, and image preprocessing.
stages cannot be achieved via the symptoms only. • Tackling the problem of class imbalance with the help
At the initial stages, Chest X-rays have proved to be very of Generative Adversarial Networks (GANs).
helpful in highlighting the infection status as they can show • Implementation of Image Augmentation techniques like
the early symptoms of such lung infections that can lead the geometric transformation, colour transformation, and
human body to respiratory failure. CT scans are helpful for image enhancement to examine the effect of overfitting.
surgery and can also be beneficial for lung cancer detection • Implementation of MobileNet, EfficientNet, Incep-
or heart failure. However, X-rays are much more popular due tionV3, Xception, ResNet, and DenseNet.
to their cost-effectiveness and simplicity in capturing. • Examination of training time of Deep transfer learning-
The advancement in the emerging deep learning tech- based CNN models to find an efficient model.
niques and models has proven quite beneficial in the di- • Comparing the classification results of synthetic data,
agnoses and detection of numerous lethal diseases more produced with GANs against the original data combined
efficiently. Application of state-of-the-art deep learning tech- with geometric image augmentation.
niques, on large datasets, has allowed computer algorithms
to produce as effective results as medical professionals, in B. RESEARCH QUESTIONS
the fields of medical imaging tasks involving skin cancer • Can we improve the classification results, along with the
classification [1], lymph node metastases [2], and diabetic efficiency of the algorithm for large datasets by using
retinopathy detection [3]. A keen interest of researchers, in the latest deep transfer learning models?
the advanced methods for automatic detection of chest imag- • Can data generated with Generative Adversarial Net-
ing [4], [5], has led to the development of such algorithms works produce better results than conventional geomet-
that can detect pulmonary nodules [6]. Nevertheless, it is sug- ric image augmentation techniques?
gested that there must be a system that can classify multiple • What effects can image preprocessing techniques and
pathologies, including pneumonia and pneumothorax. hyper-parameters including model, batch size, input im-
A versatile and accurate system is required to classify mul- age resolution, and loss function, have on the classifica-
tiple pathologies, including pneumonia and pneumothorax. tion fitness functions?
The research is still in its early stages to find the models
that provide efficient and accurate classification of lung dis- C. RESEARCH CONTRIBUTIONS
eases by examining various pathologies. This research work • A transfer-learning model, combining MobileNetV1
conducts an experimental analysis for the assessment and and a three-layered deep neural network classifier in-
evaluation of deep learning-based state-of-the-art models on corporating Geometric Image Augmentation, for multi-
large datasets with multiple lung diseases, for classification. label classification of lung diseases. This model aims
The deep learning models have been planned to train on a to achieve superior performance compared to existing
large chest x-ray-based image dataset having various classes approaches in terms of both accuracy and efficiency.
for the early detection of fatal lung diseases. For the assess- • A comprehensive analysis of data preprocessing tech-
ment purpose, evaluation measures named Binary Accuracy, niques (e.g., normalization, threshold segmentation) and
Recall, Precision, F1 Score, and AUC were calculated to hyperparameter tuning for X-ray classification, with a
compare the results of recent research models. Furthermore, focus on optimizing classification performance.
synthetic X-ray images, generated with GANs, to tackle • Generation and evaluation of a synthetic X-ray images
the scarcity of medical X-rays are also tested. Our study dataset belonging to five classes using deep convolu-
introduces a groundbreaking novel approach to lung disease tional Generative Adversarial Networks (DCGANs).
diagnosis through multi-label classification, enabling the si- The remainder of the paper is structured as follows: In
multaneous identification of multiple lung diseases within a Section II, we describe the existing literature from different
single chest X-ray. This contrasts with traditional methods, perspectives considered in this study. Section III presents
which are limited to single disease detection. By focusing methodology of the work. In section IV, we provide exper-
on enhancing the recall metric, crucial in the medical field, imental settings and experimental results. Results are dis-
our method significantly improves diagnostic accuracy. Ad- cussed in Section V. In Section VI, we conclude this article.
ditionally, we leverage a DCGAN to address class imbalance
by generating synthetic images for various disease classes, II. LITERATURE REVIEW
further enhancing the robustness and effectiveness of our Since the widespread of contagious diseases such as COVID-
diagnostic system. Although the GAN results were not as 19, the researchers have been keenly working to develop
promising as expected, our research establishes a foundation reliable and efficient methods, for disease detection and
for future exploration of generative AI in lung disease classi- classification, to bring down the exposure of human re-
fication. sources to such outbreaks. hese papers involved different
datasets named NIH ChestX-ray-14 [7] [8] [9] [10] [11]
A. RESEARCH GOALS [12], ChestX-ray2017 [8], PLCO [9], ICBHI 2017 [13], Sub-
• Implementation of data processing techniques such as region demarked parenchymal-lung disease of ILD Dataset
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

[14], Chexpert [15], COVIDx Dataset [16] JSRT dataset was used on the NIH Chest-Xray-14 dataset along with some
[17], Montgomery dataset [17], Shenzhen Hospital data [18], metadata provided with the dataset i.e. age, gender, etc.
and various deep CNN-based techniques like MobileNetV2 Due to the significant class imbalance in this dataset, the
[7], [8] DenseNet-121 [9], ResNet-38 [11],ResNet-50 [11], researchers employed resampling techniques to address this
[15],ResNet-101 [11], CNN [14], [15], [19], [20], VGG issue.
[13], [15], [18], SVM Classifier [13], GAN [16], [17] and One of the major goal of [7] was to develop a solution
InceptionV3 [15]. for IoT systems, requiring the system to be trained and per-
formed on devices with low computing power. The authors
A. CNN specify that the deep learning model used in this work was
Convolutional neural networks (CNN) are probably the most designed particularly to develop low latency and small ap-
popular deep learning model used for both segmentation plications in the field of computer vision and IoT. After data
[21] and classification of all kinds of data: image [22], text augmentation, the authors divided the samples into training,
[22] and speech. Within lung disease classification, the use validation, and testing subsets. The training subset included
of CNN is also very common. The research work of [10] 38,819 images, and the validation, as well as testing subsets,
is solely based on the diagnosis of tuberculosis & lung included 12,940 image samples. It can be noted that the entire
cancer disease, which comes under thoracic diseases. This dataset was not used in this study, as the total samples in the
task is quite time-consuming and labour-intensive, leading to dataset are 112,120. The training was done with 10 epochs
diagnostic errors. Usually, expert radiologists are required to and a batch size of 32.
analyze the images. Recent deep learning-based approaches The proposed solution was evaluated on the accuracy, sen-
are powered by huge network architectures and have been sitivity, area_under_the_curve (AUC), specificity, and time
proven quite fruitful in medical imaging interpretation tasks. consumption, as fitness measures. The solution achieved 90%
However, to obtain expert-level performance there is a need accuracy, 0.45 sensitivity, 0.97 specificities, and 0.55 f1-
for a large amount of image-based labeled datasets, which is score. Since details, such as tissue structure and texture,
a bottleneck. are of utmost importance in medical image classification,
In 2018, Pranav Rajpurkar et al [10] evaluated the deep researchers have derived a module to capture multi-scale in-
learning-based models and investigated the problem of put features in convolutional neural networks. Abnormalities
pathologies detection. For this problem, the pathologies in in the tissues can be of variable sizes; hence capturing the
the chest radiographs were compared with the practising spatial information is very important.
radiologists. A CNN-based model named CheXNeXt was Mengjie Hu et al. [8], proposed a solution, MD-Conv,
developed to detect the 14 different types of pathologies which has multiple depth-wise convolution filters with vari-
simultaneously. The ChestX-ray14 dataset was used to train able kernel sizes in a single depth-wise convolution layer.
the deep learning-based CNN algorithm. The dataset consists This paper compares the classification results i.e. accuracy,
of a total of 112,120 chest radiographs based on frontal-view AUC, and floating point operations per second (FLOPs),
labeled images of 30,805 different patients. The automatic using this technique, on two datasets to other state-of-
extraction approaches were used to obtain the labels of the the-art methods. The proposed solution achieved AUC of
images. The dataset was divided into three partitions named 0.78 and the FLOPs were much lower than ResNet50 and
training set (images = 98637, patients = 28744), validation DenseNet121. However, accuracy was lower than other tech-
set (images = 420, patients = 389), and tuning set (images niques i.e. DenseNet+LSTM and DenseNet121.
= 6351, patients = 1672). The parameters of the model were
optimized using the training set, the validation set was used C. DENSENET
to validate the CheXNeXt model, and the tuning set was used The computational cost of models can increase exponentially
to evaluate and compare the networks. The AUC was used when the input contains images. To handle this problem,
as the evaluation metric to evaluate the performance of the the images are resized to a lower resolution which leads
CheXNeXt model. to losing details. Siemens et al. [9], proposed a solution
The experimental results showed that the newly proposed based on the location-aware technique of Dense Networks,
model gave better results in the case of atelectasis detection known as DNetLoc to incorporate high-resolution images
giving an AUC of 86.2% as compared to the radiologist’s for maximum possible detail. The authors combined two
AUC, which was 80.8%. The key limitations of the research datasets i.e. Chest X-ray 14 and PLCO to have cumulative
were that the CheXNeXt and radiologists were not allowed samples equal to 297,541 images, although the PLCO dataset
to take any benefit by looking at the patient history and the has 12 classes along with class imbalance. This research
whole dataset was collected from a single institute. makes use of pre-trained models on ImageNet. For the first
dataset, cross_entropy was used as a loss function along with
B. MOBILENET some additional weights, as there is a class imbalance. The
Abdelbaki et al. [7] proposed a classification solution for lung batch size used was 128 samples per iteration. According
pathologies using state-of-the-art deep learning and transfer to the authors, the reason for having a large batch size was
learning techniques. A modified version of MobileNet V2 to make the probability of samples belonging to all classes
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

higher. Adaptive learning rate along with Adam as the model achieved an average AUC of 0.80 for all 15 classes. This
optimizer was used. Furthermore, the input image size was research suggests that fine-tuning the models can enhance
the same as the original size i.e. 1024 * 1024. Histogram results. In this study, the average AUC improved from 0.73
normalization was also applied to the PLCO dataset to tune to 0.80.
the contrast and brightness of the images. Further normaliza-
tion techniques involved mean as well as standard deviation F. MULTI-LABEL CLASSIFICATION
normalization. The results with the Chest X-ray dataset were In [15] Aravind Sasidharan Pillai et al introduced a new
0.84 AUC and for PLCO the AUC was 0.87. deep learning-based approach for multi-label classification
of chest X-ray images, which can simultaneously detect
D. VGG multiple pathological conditions. The authors trained & eval-
Lung Disease is a very commonly found disease around uated the technique on a publicly available dataset named
the world. The impact of this illness on health is escalat- Chexpert of chest X-ray images, which included 14 different
ing quickly due to environmental changes, climate change, pathological labels such as pneumonia, emphysema, and lung
adjustments in lifestyle, and other reasons. Since these fatal mass. They down-sampled the dataset to 11 GB which was
diseases are spreading swiftly, so their timely diagnosis is originally of the size 439 GB and it was divided into two sets
quite essential. To resolve the classification problem in this train set 80% & validation set 20%.
field a lot of research has been done in the field of image The proposed approach consists of CNN (Custom Net,
processing, deep learning, and machine learning. DenseNet121, ResNet-50, Inception_V3, Vgg16) architec-
In [23] Subrato Bharati et al presented a novel idea ture with residual connections and attention mechanisms for
of a deep learning-based hybrid model named VDSNet. feature extraction and classification. The residual connec-
The model was a combination of CNN-based model tions help to alleviate the vanishing gradient problem in
VGG and an image augmentation technique named Spa- deep networks, while the attention mechanism allows the
tial_Transformer_Network (STN). The newly proposed network to focus on specific regions of the image that are
model was trained on a well-known and publicly available most relevant for each label. The proposed approach has
lung disease dataset (NIH Chest X-ray). the potential to be a valuable tool for automatic multi-label
The experiments were done on two versions of the dataset classification of chest X-ray images, which can aid in the
(Full and Sample), at both versions, VDSNet outperformed diagnosis and management of various lung diseases. The
the existing models. For the evaluation purpose, validation authors suggest that their approach can be further improved
accuracy, precision, F0.5 score, and recall were considered by incorporating additional clinical information or by using
as the evaluation metrics. With the fuller version of the NIH more sophisticated deep-learning architectures.
Chest x-ray dataset, their proposed model VDSNet gave the The authors conducted extensive experiments to evaluate
validation accuracy of 73% and outperformed the existing the performance of their approach and compared it to sev-
models named hybrid CNN (validation accuracy = 69.5%), eral other state-of-the-art methods for multi-label chest X-
vanilla gray (validation accuracy = 67.8%), VGG (valida- ray classification. The results showed that their approach
tion accuracy = 63.8%), vanilla RGB (validation accuracy = outperformed all other methods in terms of both accuracy
69%), and modified capsule network. and area_under_the_receiver_operating_characteristic curve
(DenseNet training AUROC 78 & training accuracy 87%).
E. RESNET-50 The authors also conducted a detailed analysis of the perfor-
In another study, Baltruschat et al. [11], explored the do- mance of their approach on different subsets of the dataset,
main of transfer learning and weight initializations on the including cases with multiple co-occurring pathologies. The
model ResNet-50, on the NIH Chest X-Ray 14 dataset. They results showed that their approach was able to accurately
experimented with the network architecture and the input detect specific pathologies, even in the presence of other
size passed to the model. They also incorporated non-image pathologies.
features from the dataset such as age, gender, etc. Their In [24], Shamrat et al. through customized MobileNetV2
methodology section states that they introduced class bal- from chest X-ray images, presented the MobileLungNetV2
ancing and positive/negative balancing in the loss function, model for classifying 14 lung diseases using chest X-rays.
but there was no significant difference in the results. So, they The model is based on MobileNetV2 but fine-tuned to im-
used class-averaged binary_cross_entropy as their objective prove accuracy. Pre-processing steps like contrast enhance-
function. ment and noise reduction were applied to the ChestX-ray14
They also investigated the effect of two distinct strategies dataset. This study was about a single label classification.
of network initialization for ResNet50. In the first tech- In [20], K. V. Priya et al proposed a federated learning
nique, they used weights trained on the dataset of ImageNet, approach for detecting chest diseases in chest X-ray images
whereas, in the second technique, they randomly assigned using DenseNet for multi-label classification. The authors
weights to the network. The researchers concluded that note that chest diseases are a major cause of death world-
making use of non-image features proves to help increase wide, and early detection is critical for effective treatment.
the classification results of some of the pathologies. They However, due to the large number of images and the need for
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

specialized expertise, manual interpretation of chest X-ray for augmentation and to train the different models. Wang
images can be time-consuming and error-prone. Therefore, et al. created the second publicly available dataset named
automated classification of chest X-ray images can help to COVIDx by mixing the Covid-chest X-ray dataset with the
improve diagnosis and reduce the burden on radiologists. four other datasets. Covid-chest X-ray was developed by
The proposed federated learning approach involves training Cohen et al which includes chest X-rays having radiological
a deep neural network on local datasets at multiple hospitals readings for COVID-19. The COVIDx dataset has 3 classes
or clinics, with each dataset containing images from patients named normal (having 8066 images), pneumonia (having
at that location. The model is trained using the federated 5559 images), and COVID-19 (having 589 images). All
averaging algorithm, which aggregates the gradients of the images were converted in grayscale and resized to the size
local models to update a global model while maintaining the of 128 by 128. The test set consists of a total of 589 images
privacy of the local datasets. The authors use DenseNet, a randomly taken from each of the 3 classes.
popular deep-learning architecture, for the multi-label classi- The results showed that IAGAN model achieved the high-
fication of chest diseases. est results for dataset-I (Sensitivity/Recall = 0.82, Specificity
The authors evaluate the performance of the federated = 0.84 & Accuracy = 0.80) and dataset-II (Sensitivity/Recall
approach on a publicly available dataset of chest X-ray = 0.69, Specificity = 0.69 & Accuracy = 0.69). They conclude
images with 14 different chest diseases. They compare the that using GAN-based data augmentation is an excellent
performance of their approach with a centralized approach method for enhancing the performance of deep learning
and a baseline approach using a single local dataset. The models in applications involving medical imaging.
results show that the federated approach achieves higher T. Malygina et al [12] investigated how GANs can be
accuracy and F1-score compared to the baseline and cen- utilized to improve the performance of deep learning-based
tralized approaches, indicating that the federated approach models to detect the pathologies in the chest X-rays in the
can improve the performance of chest disease classification case of an imbalanced dataset. The authors used a publicly
while maintaining data privacy. The authors conclude that available dataset named ChestXray14 which consists of a
federated learning can be a useful approach for large-scale total of 112,120 x-ray images taken from 30,805 patients hav-
multi-institutional studies in medical imaging. ing 14 labels. The total dataset is divided into three categories
In another study, Al-Sheikh et al. [25] introduced an auto- of data for the pathology classification & localization task:
mated system for multi-lung disease detection using X-rays train set (70%), validation set (10%), & test set (20%).
and CT scans. It employed a customized convolutional neural To reduce the computational power to detect pneumonia
network (CNN) and two pre-trained deep learning models the images of the dataset were resized to 256 x 256 from the
(AlexNet and VGG16Net) with a new image enhancement 1024 x 1024 to train both the GAN & the image classifier.
model based on k-symbol Lerch transcendent functions. The They created synthetic pictures using a GAN that was trained
system involved two main steps: pre-processing with image on the normal and pneumonia categories, adding them to
enhancement and classification using the CNN models. This the training dataset to create more evenly distributed classes.
study did not address the issue of multi-label classification. After that, the authors analyzed the performance of several
deep learning models that had been trained on both the ini-
G. GANS (GENERATIVE ADVERSARIAL NETWORKS) tial and supplemented datasets. Model without augmentation
Data augmentation techniques and generative adversarial showed the AUC (Pneumonia = 0.9745, Pleural-Thickening
networks (GANs) have been used extensively for syntheti- = 0.9792, & Fibrosis = 0.9745), and PR AUC (Pneumo-
cally generating all kinds of data such as images [26], [27], nia = 0.9580, Pleural-Thickening = 0.9637, & Fibrosis =
speech [28] and videos [29]. Within images, GANs have 0.9446) while the model with the augmentation showed the
also been used to generate medical images [30]. Similarly, AUC (Pneumonia = 0.9929, Pleural-Thickening = 0.9822,
in [16] Saman Motamed et al used the GAN for the data & Fibrosis = 0.9697), and PR AUC (Pneumonia = 0.9865,
augmentation of chest X-ray images to detect Pneumonia and Pleural-Thickening = 0.9680, & Fibrosis = 0.9294). The
COVID-19 diseases. A GAN was trained to generate accurate augmented dataset by GAN increased the model’s accuracy
X-ray pictures to be combined with initial training data. The in the case of pneumonia & COVID-19 recognition.
authors demonstrated that using GAN-generated pictures for Munawar et al [17] proposed a two-stage approach, in
data augmentation could improve the efficiency of detection which firstly, a GAN is trained to create synthetic lung im-
models by evaluating their method on two publicly accessible ages, and then these images are used to train a segmentation
datasets. network. The authors used a publicly available dataset of
The first utilized chest X-ray dataset named ChestXRay2017 chest X-ray images named the JSRT dataset, Montgomery
consists of two categories named Normal (having 1575 dataset & Shenzhen dataset, which was preprocessed and
images) and Pneumonia (having 4265 images) in JPEG augmented to create a large training set. The datasets were
format. The images were resized to 128 x 128 and they divided into three categories: the JSRT dataset (train set: 200,
chose Pneumonia (the larger class) as the training set and validation set: 20, test set: 20), the Montgomery dataset (train
500 images were randomly selected per class to test the set: 110, validation set: 10, test set: 18), and Shenzhen dataset
performance of the model. The rest of the images were used (train set: 200, validation set: 40, test set: 40). The GAN was
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

then trained on this dataset to generate realistic lung images.


The segmentation network was then trained on the synthetic
lung images and evaluated on a separate test set.
The results showed that the proposed method outper-
formed a few other state-of-the-art methods for lung seg-
mentation on the same dataset. The authors also conducted
a detailed analysis of the performance of the method and
showed that it was able to segment lungs accurately in
a variety of challenging scenarios, including images with
significant pathology and those with low image quality. The
authors conclude that the proposed method has the potential
to be a valuable tool for automatic lung segmentation in chest
X-ray images and can aid in the diagnosis and monitoring of
lung diseases. Table 1 summarizes the existing works.

H. RESEARCH GAPS
Thoroughly examining the literature revealed a few research
problems and gaps such as class imbalance problems, gen-
eralization problems, and the challenge of computational
complexity. Improvement in these areas, which can enhance
our capabilities in the field of medical imaging.

1) Removing Class Imbalance


NIH Chest X-ray 14 dataset has 14 disease classes with FIGURE 1: Flow Chart of Research Methodology
60000 samples in total and 1 non-disease class with 60000
samples. There are about 10,000 images of just the Infiltra-
tion class. Class of Hernia has the least amount of samples i.e. 20 epochs costs around 5 hours. These rounded-off results
around 0.3% of the total samples. This indicates a huge class were obtained from our implementation of the EfficientNet
imbalance which leads to non-uniform results for each class. B4 model on this dataset. Such long computation times can
So there is a need to reduce this imbalance with augmentation hinder in-depth experimental research. Hence, one of the
by using Generative Adversarial Networks (GANs), or by future research can be decreasing the execution complexity
merging some other datasets of the same type with it. Finally, of models or applying some preprocessing techniques that
we can evaluate the results by applying various deep learning keep the results good as well as decrease the execution time.
or transfer learning models. One such way of decreasing the execution time is to use
such models that have a low no of layers such as MobileNet,
2) Better Generalization of Model GoogleNet, etc.
Another gap that our team came up with is lower values
of accuracy or F1 score or area_under_the_curve (AUC). III. RESEARCH METHOD
For example, the results from research work [7] show that This section provides insights towards our methodology
the same model that provides good results in some classes for the classification of lung diseases. Figure 1 provides a
produces quite low results for others. One of the tasks in the flowchart of the methodology followed in this research.
future is to find such a technique that is generalized enough
to produce high results for all diseases. One of the classes A. DATA COLLECTION AND INPUT
that have lower classification results is the Infiltration class. Human chest exams with X-rays are one of the most common
Although this class has a large number of samples, the results and cost-effective processes for examination used by doctors.
do not reflect that. One reason for suboptimal results might be However, since X-rays have low quality as compared to other
the quality of the samples, which can be examined in future sophisticated methods such as CT scans, it is a difficult
research. task to perform disease classification using X-ray images.
Arranging an adequate number of samples for deep learning
3) Improving Computational Cost models is a difficult, if not impossible, task to do. due to
The NIH Chest X-ray 14 dataset comprises 112,120 samples patient privacy and confidentiality concerns, finding pub-
and a size of 45 GB. The image size is 1024 by 1024 licly available datasets is challenging. However, the National
obtained from x-ray tests of around 30,000 patients. Many Institute of Health has made available, publicly, a dataset
dense layered deep learning models struggle in the training that is convenient to download and work with. This dataset
phase Since this dataset has a large size. The computational is called NIH Chest X-ray 14 dataset, containing 112,120
time is so high that a single training execution of around samples of multiclass images along with annotated labels up
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

TABLE 1: Summarized Review of Existing Works


Ref# Year Technique/Models Dataset Train & Test Set Results
[7] 2021 MobileNet V2 (with some ad- NIH ChestX-ray (ChestX- Training: 38819 (60%), Vali- AUC: 0.81, Accuracy: 90%,
ditional hidden layers) ray14) 14 class dataset and dation: 12940 (20%), Testing: Sensitivity: 45%, Specificity:
used only 64,699 samples after 12940 (20%) 97%, F1 Score: 55%
data augmentation
[8] 2020 MobileNet V2 (with Multi- ChestX-ray14 (112,120 sam- - AUC: 0.78
Kernel Depthwise Convolution ples) ChestX-ray2017 (5856
(MD-Conv)) samples)
[9] 2018 DNet: DenseNet-121 & DNet- Combined 2 datasets (297,541 - ChestX-Ray14 – AUC: 0.84,
Loc: DenseNet-121 Location samples), ChestX-Ray14 PLCO – AUC: 0.87
Aware (112,120), PLCO (185,421
samples), PLCO is 12 Class
dataset with imbalance
[10] 2018 CheXNeXt ChestX-ray14 - Average AUC: 0.84
[11] 2019 ResNet-38, ResNet-50, ChestX-ray14 Training: 70%, Validation: ResNet-38 – AUC: 0.80
ResNet-101, with Multi-label 10%, Testing: 20%
loss function
[14] 2018 CNN Subregion_demarked Training set = 960 samples, Accuracy at CNN classifier =
parenchymal-lung disease Test set = 240 samples 95.12%
of ILD
[23] 2020 Hybrid Model (VGG + spatial NIH chest X-ray image dataset - Val_acc at VDSNet = 73%,
transformer network) vanilla gray = 67.8%, vanilla
RGB = 69%, hybrid CNN =
69.5% and VGG = 63.8%
[13] 2020 Pre-trained VGG16 + SVM ICBHI 2017 Database - Accuracies Method 1 = 7.62%,
Classifier & VGG16(TL) + Method 2 = 5.18%
Softmax Classifier
[18] 2020 VGG16, ResNet-50, Dataset from Shenzhen Hospi- Training set 80%, Validation At Segmented CXRs: VGG16
InceptionV3 tal set 10% & Test set 10% has 95%, ResNet-50 has 78%,
InceptionV3 has 75% At Non-
Segmented CXRs: VGG16 has
77%, ResNet-50 has 68%, In-
ceptionV3 has 77%
[19] 2019 2D CNN as Classifier Dataset consisting of 6 cate- Training set = 70% and test set Accuracy = 97%
gories = 30%
[15] 2022 Custom Net (CNN), Chexpert (downsampled train & validation split 80:20 DenseNet training AUROC 78
DenseNet121, ResNet-50, version-11 GB) original size & training accuracy 87
Inception_V3, Vgg16 439 GB
[20] 2020 DenseNet121 visualization Chest X-ray 14, Total pics - -
technique: gradCAMS 112,120 from 30,805 people,
Labels = 14 diseases NIH chest
X-ray (due to class imbalance
issue)
[16] 2021 CNN & GAN (IAGAN, DC- ChestXRay2017 & COVIDx Dataset I (Train,Test) IAGAN: (Datasets I, II)
GAN) dataset Normal: (-,500), Pneumonia: Sensitivity: (0.82,0.69),
(33885,500), COVID_19: Specificity: (0.84,0.69),
-, Dataset II: (67293,589 Accuracy: (0.80,0.69)
(42300,589), (0,589)
[12] 2019 GAN (CycleGAN), DenseNet ChestXray14 Total = 112120, Labels = 14 AUC: (Pneumonia 0.9929,
for classification diseases, Train Set = 70%, Val- Pleural-Thickening 0.9822,
idation Set = 10%, & Test Set = Fibrosis 0.9697), PR ACU
20% (Pneumonia 0.9865, Pleural-
Thickening 0.9680, Fibrosis
0.9294)
[17] 2020 GAN & U-net JSRT dataset, Montgomery JSRT (train: 200, valid: 20, test: dice-score = 0.9740, and IOU
dataset & Shenzhen dataset 20), Montgomery (train: 110, score of 0.943
valid: 10, test: 18), Shenzhen
(train: 200, valid: 40, test: 40)

VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

to an accuracy of 90%. The labels were generated by the • In this research work, the image input size was kept to
authors using natural language techniques. This data belongs either 224 * 224 or 300 * 300 which caused blurriness.
to 30,805 patients. • To improve the quality, image sharpening was applied
The NIH CXR 14 dataset is available in the form of .png to all the dataset subsets.
images with extension, over the internet. The size of the • Since images are read in RGB format, we applied
dataset is huge i.e., 45 GBs. Nevertheless, since it can be various rescaling techniques including multiplying each
downloaded to the local machine and is already available pixel with 1/255 or converting each pixel value between
on Kaggle, it makes the experiments convenient. For this -1 and 1. This not only generalizes the input but also
research work, we used image resolution between 224 * saves memory and decreases the computational cost of
224 and 500 * 500, although the original is 1024 * 1024. operations applied. Furthermore, it also smoothens the
The labels, corresponding to each image, are available in a learning of the objective function.
comma-separated-value (CSV) file in the form of strings. We • The filter used for sharpening was: [0 -1 0] [-1 5 -1] [0
first converted these labels to a one-hot-encoded format and -1 0]
then divided the data into 3 subsets for training, validation
and testing phases using Python’s library named Sklearn. We 2) Image Augmentation
used the train_test_split function with random_state equal to This technique is used to produce data samples from the
0. The ratio for these divisions was 7:1:2 for 3 subsets. 80726 existing samples. It can be useful when the dataset has a
images were used in training, 8970 for validation and 22424 low number of examples or if there is a class imbalance
for testing the model’s performance.. in the dataset. We can apply various geometric methods to
Figure 2 indicates the number of samples belonging to produce augmented data. This not only aids in learning a
each class. It is evident how imbalanced the dataset is. class imbalanced dataset but also generalizes the model’s
Figure 3 shows the number of samples that belong to a learning, hence decreasing overfitting. The model learns
single class and the total samples belonging to each class. to handle new variants of training examples. We used the
Each class has less than 50% mutually exclusive images, with ImageDataGenerator library of Keras, to apply image aug-
the lowest as 22% for Pneumonia. mentation.
Figure 4 presents a few sample images of the dataset.
All images are labelled with a disease(s) from the 14 lung Following are the methods that were implemented for this
pathologies or no-finding that represents the no-disease class. purpose:
• Random Rotation: It randomly rotates the input image
B. DATA PREPROCESSING AND DATA AUGMENTATION
according to the value provided. In this work, the value
In data processing, data fed as input is scaled/normalized was set to 20 degrees.
which helps the model in learning the objective function • Sheering: It shifts the viewing angle of the image.
quickly and effectively. Data augmentation can be applied • Random Zoom: It randomly zooms into the input image
to a variety of data types, including images, audio, text, and and the value was set to 0.1 percent in this work.
time series. Some common data augmentation techniques for • Width Shift: It shifts the image’s width horizontally and
image data include: its value was set to 0.1 percent of the total width of the
• Flipping or rotating the image horizontally or vertically image.
• Cropping or resizing the image • Height Shift: It shifts the image’s height vertically and
• Adding noise or distortions to the image its value was set to 0.1 percentage to the fraction of the
• Changing the brightness, contrast, or color of the image image’s total height.
• Applying geometric transformations such as scaling, • Horizontal Flip: This option randomly flips the image
shearing, or perspective warping horizontally to change the sides of the image.

1) Image Preprocessing 3) Images before and after applying various preprocessing


It refers to the set of operations or techniques applied to and augmentation techniques
raw images to prepare them for further analysis or pro- Figure 5 and Figure 6 present the original and processed
cessing. The goal of image preprocessing is to enhance or images, respectively, from the dataset. We can see the images
extract meaningful information from the image while reduc- in smaller resolution (Figure 6) are a little blurry.
ing noise, artifacts, or other unwanted components that may Figure 7 demonstrates augmented images after image
interfere with subsequent analysis. The preprocessing step sharpening with a resolution of 224 * 224. (The left image
typically includes a series of operations such as image crop- has height shifted whereas the right image has width shifted).
ping, resizing, color correction, filtering, noise reduction,
image enhancement, segmentation, and feature extraction. Figure 9 presents X-ray images after applying image aug-
These operations are chosen based on the characteristics of mentation and sharpening in RGB format; these images are
the images and the specific application requirements. fed to the network.
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

FIGURE 2: Visual Class Imbalance of NIH Chest Xray 14 dataset [7]

FIGURE 3: Mutually Exclusive Samples vs Total Samples

FIGURE 4: Sample Images of NIH Chest X-ray 14 dataset [7]

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

FIGURE 5: Original Images with a resolution of 1024 * 1024

FIGURE 8: GAN Architecture

our application in medical image generation. Additionally,


FIGURE 6: Augmented images with a resolution of 224 * DCGAN has been widely validated in various image syn-
224 thesis tasks, demonstrating its robustness and efficiency in
generating realistic images, which aligns well with our goal
of augmenting medical datasets.
4) Data Generation with Generative Adversarial Network The DCGAN generates synthetic X-ray images by training
(GAN) a generator and a discriminator in an adversarial manner. The
GAN’s [31] can be used to produce data from already avail- generator creates images from random noise, while the dis-
able samples. It can be vital for scenarios such as medical criminator evaluates whether the images are real or synthetic.
image classification wherein arranging enough data to train a During training, the generator tries to produce images that
large deep learning model, can be tedious. By using current are increasingly realistic, thereby ’fooling’ the discriminator.
available data, we can generate new samples with GAN and This adversarial process iteratively improves the quality of
these synthetic samples do not belong to real patients in the generated images. Specifically, the generator learns to
any direct sense. This technique has accelerated the use of map random noise vectors to image spaces that resemble
deep learning in the medical field; however, it still needs the distribution of the real X-ray images, whereas the dis-
a lot of improvement in terms of capturing delicate tissues criminator learns to distinguish between real and synthetic
from the images. In this work, we have implemented deep images. Over time, this results in the generator producing
convolutional GAN (DCGAN) [32] to produce synthetic X- high-quality synthetic X-ray images that closely resemble
ray images containing various pathologies that are already real ones.
included in the used dataset. A deep generative model with We used 3 layers in this network with 512, 256 and 128
around 5 layers was implemented against a rather shallow neurons in each layer respectively. We used LeakyReLu with
discriminator network. The deep architecture helps in cap- alpha equal to 0.2, as the activation function in all the layers
turing minor tissues from the X-ray images. DCGAN was except the output layer, where we used sigmoid due to binary
chosen for this study due to its balance of simplicity and classification. As for the generator network, it consisted of
effectiveness in generating high-quality images. Unlike more 6 layers. The number of neurons in each layer was 128,
complex GAN variants such as WGAN, CGAN, and BE- 256, 512, 512, 1024, and 1024, respectively. All the layers
GAN, DCGAN provides a straightforward architecture that used LeakyReLU except the output layer, which had tanh
is easier to implement and optimize, making it suitable for activation function. Figure 8 shows the structure of GAN
used.
The resolution of the original images fed to the GAN was
300 by 300 with a batch size of 32. The training was done for
300k epochs for each class. It took 30 hours for each class on
our local PC with a 1050Ti GPU and i7-8750H CPU. Due to
time and hardware constraints, it was a challenge to train the
GAN for further epochs. In Figure 10, we can see some of
the samples generated for the Cardiomegaly class after 300k
epochs.
Figure 11 shows individual images belonging to the class
of Hernia. The images are a little pixelated, which is due to
FIGURE 7: Augmented images after image sharpening with the low resolution of the latent vector and input size of the
resolution of 224 * 224 original images.
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

TABLE 2: Training set sample distribution performance.


Class Name Training Samples Total Samples
Atelectasis 8276 11535 • Batch Size: The memory constraint of RAM and VRAM
Cardiomegaly 2015 2772 of GPU didn’t allow many options to be tried in terms
Consolidation 3324 4667 of batch size. Hence, the value of 32 was used for all the
Edma 1656 2303
Effusion 9717 13307 models trained.
Emphysema 1758 2516 • Optimizer: At first two options were tried i.e., Stochas-
Fibrosis 1213 1686 tic Gradient Descent (SGC) and Adaptive Momentum
Hernia 152 227
Infiltration 14190 19870
(Adam), but after some executions no significant dif-
Mass 4105 5746 ference was found. Therefore, we settled for the Adam
No Finding 43543 60412 optimizer for the training of models, as it is one of the
Nodule 4517 6323 advanced methods of learning the weights of the model.
Pleural_Thickening 2478 3385
Pneumonia 987 1353 • Learning Rate: Since we used already trained weights
Pneumothorax 3784 5298 (i.e., transfer learning) for convolutional networks in our
models, the learning rate was kept low. Its value was set
at 0.0001.
Our GAN implementation aimed to generate synthetic X- • Activation Functions: For its simple computation and
rays that closely resemble real images within the training effectiveness, ReLU was applied in all the layers, except
data. This approach doesn’t introduce any external data and the output layer, for activation of neurons and to add
solely augments the existing dataset to train our model. non-linearity into our models. As for the output layer,
sigmoid was used, since our problem is a multi-label
C. TRAINING PHASE classification problem, and each neuron has to output an
This section highlights some of the major details involved in independent value.
the training phase such as types of models, hyper-parameters, • Loss Function: Since the dataset has non-mutually ex-
and architectures of classifiers. clusive labels, we need such a loss function that can
consider each class independently. For this purpose,
1) Dataset Split Binary_Cross_Entropy (BCS) is used to measure the
The dataset split is 70% for Training, 10% for Validation and loss of all the classes.
20% for Testing.
• Training Images: 80726 D. TESTING AND EVALUATION PHASE
• Validation Images: 8970
In this section, we discuss the metrics that are used for the
• Testing Images: 22424
evaluation of our trained models.
Table 2 presents the distribution of samples belonging to
each class in the training subset of the dataset.
1) Binary Accuracy
2) Model Architectures In general, accuracy states a percentage of correctly classified
A few deep learning models are applied to the dataset to samples divided by all the classification results, including the
examine the performance based on different architectures. correct as well as the misclassified results. Since the dataset,
Ten models were implemented with Python’s library named we worked on belongs to a multi-label classification problem,
Keras. In terms of model initialization, we used transfer we cannot just consider the one-hot encoded labels as a single
learning for the convolution part of the models and random label. For example, if there are three classes in total and the
initialization for the fully connected network. Off-the-shelf sample belongs to classes 1 and 2 (label: 1 1 0). However,
weights that were trained on the ImageNet dataset were used our model prediction states that the sample belongs to classes
for quick learning of convolution network. Deep as well as 1 and 3 (label: 1 0 1), so using regular accuracy would be
shallow models were used to test which convolutional net- considered incorrect. Whereas, binary accuracy calculates
works produce better results in feature extraction. As for the the percentage of labels that matched in the entire one-hot
classifier network, a custom architecture was implemented. encoded label. Therefore, the binary accuracy of the above
This custom architecture included 2 hidden layers and an example would be 2 divided by 3, i.e. 66% accuracy.
output layer comprising 15 neuron units, i.e. a dedicated
neuron for each class. 256 and 50 neurons were assigned to 2) Recall
the first and second hidden layers, respectively. Fewer layers When a dataset has a class imbalance, evaluating the model
and neurons were used to avoid overfitting. based on simple accuracy is not enough. Recall is a measure
of how many actual positive examples in the dataset, were
3) Hyper-parameters classified as truly positive by the model. The recall is an
The hyper-parameters used in the implementation were kept important measure in the medical field since we aim to
the same for all the models, to have a proper comparison of classify all the actual diseased samples as truly diseased, by
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

FIGURE 9: Sharpening of X-Ray images in RGB format

FIGURE 10: Images generated after 300k epochs belonging to Cardiomegaly class

FIGURE 11: Images generated by GAN belonging to Hernia class

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

the model. It is defined in [7], as follows: TABLE 3: MobileNetV1 results without Geometric Image
Augmentation with Adam Optimizer
T rueP ositives Epochs 5 10 15
Recall(orSensitivity) = Binary Accuracy 93.2 92 92
T rueP ositives + F alseN egatives
(1) Recall 55.2 47.7 50
Precision 52.8 42.5 46.5
F1 Score 54 45 48.2
3) Precision Specificity 98.2 97 96.4
Another metric that can evaluate the performance of the Average AUC 78.6 77.1 76
trained model on the imbalanced dataset is Precision. It is a
measure of how many samples were actually positive from all
IV. EXPERIMENTAL RESULTS
the samples that the model predicted as positive. It is defined
in [7], as follows: In this section, the results of our experimental research and
the programming environments, that were used to obtain, are
T rueP ositives documented in separate sections.
P recision = (2)
T rueP ositives + F alseP ositives
A. EXPERIMENTAL SETUP
4) Specificity
The size of the dataset played a significant role in decid-
This measure can be considered as a recall but for negative ing the programming environments to conduct experiments.
samples, i.e. it gives a measure of how many actual negative Deep learning models were implemented and trained on two
examples were classified as negative by the model. It is platforms. The first one was a free GPU-based platform
defined in [7], as follows: called “Kaggle”. Its free version provides 13 Giga Bytes of
T rueN egatives RAM, a 2-core CPU, and a P100 GPU. The second platform
Specif icity = (3) was of our own i.e. an ASUS ROG laptop with 16 Giga
T rueN egatives + F alseP ositives
Bytes of RAM, an 8th generation 6-core Intel CPU i.e. i7-
5) F1 Score 8750H, and NVIDIA’s GTX 1050Ti GPU. Although we had
This measure considers both recall and precision of the model multiple options, these were not enough to explore the multi-
and indicates more of a balanced performance value. It is dimensional space of experimental analysis thoroughly. Nev-
a harmonic mean of precision & recall. The F1 Score of a ertheless, we obtained decent results with different models.
model will be high if more samples predicted as positive by
the model were actually positive and more samples predicted B. EXPERIMENTAL RESULTS (WITHOUT GEOMETRIC
as negative by the model were actually negative. It is defined IMAGE AUGMENTATION)
in [7] with the following formula: This section presents the results of deep learning models
without any geometric augmentation techniques. The original
2 ∗ P recision ∗ Recall
F 1Score = (4) images from the dataset were passed as it is to the deep
P recision + Recall learning model. MobileNetV1 was used for this experiment.
As the F1 Score gives equal importance to both precision Table 3 indicates the model starts to overfit as the number of
& recall, it might sometimes fail to indicate the performance epochs is increased. The batch size was set to 16 and the input
of the model where one is more important than the other. For resolution of the images was 300 * 300. We scaled the pixels’
example, in the medical field, we need a higher recall rate values between 0 and 1 and to optimize the model we used
of models and can give up on the importance of precision. binary cross entropy loss. To tackle this overfitting problem
Therefore, F-Beta Score can be used where we assign a we use augmentation for the rest of the experiments in the
weight that defines the importance of each measure. A higher research.
Beta value i.e. more than 1, gives more weightage to recall,
and a lower beta value i.e. lower than 1 gives more weightage 1) VGG16
to precision. In medical CXR data, spatial information can be very vital in
the classification of diseases. Therefore, it is common to use
6) AUC (ROC) deep learning-based networks to extract the features having
The Receiver Operating Characteristic (ROC) curve is used important information from the images. VGG16 is one such
to examine the capability, of a model, of distinguishing be- model that has a high no of layers in its convolution network.
tween classes and is plotted using measures named True Pos- We implemented this model with already learned weights for
itive Rate (TPR) and False Positive Rate (FPR). Both of these its feature extractor and a custom classifier with 3 layers.
measures are directly proportional i.e. if one increases the There were 4 executions with 5, 10, 15, and 30 epochs,
other increases as well. The ROC curve is plotted against a having batch size 32. The model gave an above-average
number of thresholds and indicates an area_under_the_curve performance, with recall as high as 54.5 and a maximum
(AUC). The higher the area, the better the model’s perfor- AUC hitting mark of 67.
mance is.
VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

2) Inception-V3 huge number of parameters to learn, nonetheless, it takes


This model focuses on a rather optimized approach for quite a while to train. Just like the previous model, it did not
producing good results without a huge network architecture. produce good results with a maximum AUC of 52 and a recall
It has less number of learnable parameters as compared to of 43. Due to high training hours, we were not able to train it
VGG16 but produces quite good classification results. In this on a higher number of epochs.
work, we trained it on 5, 10, and 15 epochs with a batch
size of 32. It gave recall as high as 55 and AUC as high as 9) EfficientNet-B6
69, which are higher than VGG16 with a lesser number of Similar to EfficientNet B4, this model does not produce good
epochs, training time, and parameters. Although the graphs enough results to carry on with further experimentation. It
and results indicate that, the model started to overfit as the was trained on only 5 and 10 epochs due to its high training
epochs were increased. hours. It produced 56.2 AUC and 46.8 recall. It did give
slightly better results on higher epochs, which highlights the
3) Xception question of whether this model produces good results with
Similar to Inception, this model also follows the idea of a fine-tuning.
shallow-depth network with optimized performance. We ran
this model on 5, 10, and 15 epochs with 32 batch sizes and 10) ResNet50-V2
Adam optimizer for weight learning. It gave recall as high as This model has a very large no of layers and uses residual
56.3 and an AUC of 67.7. The results were quite constant for blocks to pass information to the layers ahead. We trained
all 3 executions. it on 5, 10, and 15 epochs. It gave almost similar results as
MobileNet V2, although it takes more time to train. It gave a
4) DenseNet-121 recall of 55 and an AUC of 67.4 on 10 epochs.
It has many number of layers, although the number of
learnable parameters is not high. It uses transition layers to C. SUMMARY OF EXPERIMENTAL RESULTS (MODEL
pass the information from one layer to the layers ahead by EXPERIMENTATION)
skipping some of them. We implemented this model with a Table 4 provides the best results for each of the implemented
custom classifier of 3 layers and trained it on 5, 10, and 13 models regardless of the number of epochs.
epochs. The results show that the model starts to overfit as the Observation 1: MobileNet (version 1 and 2) models pro-
epochs are increased. Hence, the best results that this model duce slightly better results than other CNN-based models
produced were on 5 epochs. The recall was 55 and AUC was i.e., VGG-16, DenseNet, InceptionNet, etc.
68.4.
D. INITIAL COMPARISON RESULTS
5) MobileNet This section demonstrates how our preliminary results turned
A lightweight model, as the name suggests, with only 4.3 out to be in comparison to the existing work [reference to
million learnable parameters (as compared to VGG16’s 138 base paper].
million). This is the best model in terms of results so far in Class-wise AUC: Table 5 shows a comparison between
this research work. It gave recall as high as 57 and an AUC the existing models [base paper] and some of the well-
of 70 with 10 epochs. performing suggested models. It was observed that Mo-
bileNet performs better mostly in terms of accuracy for 4
6) MobileNet-V2 classes.
This is a newer version of the previous model and has a AUC results comparison: The suggested MobileNet pro-
slightly less number of learnable parameters. The model was duces better AUC for 9 diseases when compared with the
trained on 5, 10, and 15 epochs, and it produced slightly MobileNetV2’s class-wise AUC as shown in Table 5. How-
lower results than its predecessor did. The model has an AUC ever, the average AUC does not improve due to MobileNet’s
of 69.1 and a maximum recall of 55. lower AUC for diseases such as Emphysema, Cardiomegaly,
Hernia, etc. We concluded that this is because of the lower
7) EfficientNet-B2 number of mutually exclusive samples for these classes.
The efficientNet model introduces an idea of systematic Experimental Results (after Hyper-Parameter Tun-
increment of the model’s width, depth, and input resolution. ing, Image Processing, and Fine Tuning): We experi-
This model was trained on multiple epochs but it did not mented with various aspects of deep learning including
produce good enough results. With recall only as high as 48 image-preprocessing techniques, hyper-parameter tuning,
and AUC just hitting 57, it was not as promising as other fine-tuning the models, and generating synthetic samples
models. with Generative Adversarial Networks (GANs). We docu-
mented the obtained results after applying each mentioned
8) EfficientNet-B4 technique.
It is another version of the EfficientNet model, with a slightly Results after Hyper-parameter Tuning: The results of
larger network architecture. It does not necessarily have a various hyper-parameter tuning options are demonstrated
14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

with the MobileNet model. From an infinite number of fac- • Zero-One scaling: The values are between 0 and 1
tors, that can affect a model’s performance, we tested a small • Min-Max scaling: The values are set considering the
number of options from the hyper-parameter list. minimum and maximum values of the particular image.
Loss Function: Two types of loss functions i.e., conven- Table 10 presents the results of MobileNet with BCE
tional Binary Cross Entropy (BCE) and Multi-Label Binary loss after applying three types of mentioned normalization
Cross Entropy (MLBCE) were tested. In MLBCE, the loss techniques. The batch size was 16 and the input image size
of a specific class was calculated by multiplying the ratio was 300 by 300.
of positive and negative samples of the class, in the simple Image Normalization (with Image Sharpening): Table
BCE formula. This loss function pushed the Recall higher but 11 presents the results of MobileNet with BCE loss after
the Precision and F1-Score remained very low at a smaller applying three types of mentioned normalization techniques
classification threshold. This indicates that the model was with an added filter that applies sharpening on the images.
classifying a huge number of samples as positive, even the The batch size was 16 and the input image size was 300 by
negative ones. However, when using a threshold of 0.9, the 300.
metrics’ values normalized but remained very poor. Table Threshold Segmentation with Binary OTSU-Thresholding:
6 shows the values for various evaluation measures with Image Segmentation was also applied using OTSU thresh-
MLBCE. The loss function was tested on 5 different batch olding. This technique determines the optimum threshold to
sizes with an input size of 224 by 224 and was discarded for separate the foreground and background. However, it did
the rest of the research as the results were not encouraging. not produce encouraging results, as it is hard to calculate
Conventional binary cross entropy loss produced better
the thresholds for lung tissues. Table 12 presents the results
results as compared to the custom loss. AUC did not im-
of the MobileNet model after applying OTSU. The batch
prove, however, other metrics had higher values. Table 7
size was 16 and the input image size was 300 by 300. The
demonstrates the results of classification while using BCE
results were documented with three image normalization
loss. A similar approach was used to test BCE i.e. results
techniques specified earlier. We also tried OTSU without any
were examined with 5 batch sizes with an input size of 224
normalization as well as only image sharpening.
by 224. We also tried a higher number of epochs to see if the
results improved or not.
Observation 3: We can conclude from this experiment
Batch Size: Different batch sizes were also tested on this
that the best normalization technique for this project is to
dataset. However, the results indicated that the lower batch
keep pixel values between -1 and 1 or 0 and 1. Threshold
sizes were producing better results. Results with batch sizes
segmentation does not aid lung disease classification sig-
8, 16, and 32 were a little better than batch sizes 64 or 128.
nificantly. Hence, semantic segmentation might be a better
Batch size of 16 was used in most of the experiments. Table
option.
8 shows the results with respect to different batch sizes with
5, 10, and 15 epochs, respectively. The input image size was
1) Results after Fine Tuning
224 * 224 and the loss used was BCE.
Input Image Resolution/Size: Image input sizes were In addition, we’ve investigated the process of further opti-
also tweaked with the MobileNet model to increase image mizing the top-performing model through fine-tuning as part
quality and detail. Though the model can benefit from higher of our research. The following section shows the results after
resolution images, the same can be counterproductive in we froze the convolutional layers of the model. MobileNet’s
terms of a drastic increase in computational cost. convolutional part which is used for feature extraction, was
Table 9 shows MobileNet results of different input sizes frozen and only the custom classifier was trained on different
and resolutions with a batch size of 16. Resolution of 500 by epochs. The batch size was kept at 16. It seems an increasing
500 again produces better results; however, the training time number of epochs do increase the values of evaluation mea-
is twice the number of epochs. There is not a huge difference sures. The increase in the results was minor and the higher
between average AUCs of 300 by 300 and 500, but we do epochs were taking too long to be continued in the research’s
save time with the former one. limited time frame. Table 13 shows a comparative summary
Observation 2: We can obtain the best overall results of the results. MobileNet model has trained with a batch of 16
by choosing a small batch size i.e. 16, and by keeping the on 10, 20, 30, 40, and 70 epochs. Different values were tried
resolution around 300 by 300, and using BCE as a loss for the batch size but it did not seem to have any significant
function. Results after Image Preprocessing" This experi- effect.
ment involves image normalization/preprocessing along with Observation 4: Fine-tuning the fully connected layer and
Threshold segmentation. The image resolution was increased freezing convolutional layers of the model slows down the
which resulted in increased computational cost. However, convergence of the algorithm significantly.
224 by 224 seems to work better most of the time. The
following 3 types of scaling were tried: 2) Results after using Synthetic Samples produced by GAN
• One-Minus-One scaling: The pixel values are normal- Table 14 presents the results of different models, trained
ized between -1 and 1 on original as well as synthetic samples. Specifically, Mo-
VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

TABLE 4: Summary of Best Results (Model Experimentation)


Model Name Epochs Binary Accuracy Recall Precision F1-Score Specificity AUC
VGG16 30 93.3 54.5 55.5 55 98.2 80
Inception-V3 10 93 55 54.5 55 98.5 79
Xception 10 93.3 56.3 54.2 55.2 98 80
MobileNet 10 93.4 57 54 55.3 98.4 81
MobileNet-V2 15 93 55 53.3 54 98.6 78
DenseNet121 5 93.3 55 56 55.2 98.3 80
EfficientNetB2 10 93 48 54 53 97 67
EfficientNetB4 10 92.2 43 54 48 97 68
EfficientNetB6 10 92.6 46.8 53 50 97.4 66.2
ResNet50-V2 10 93.4 55 54 54.4 98.5 67.4

TABLE 5: Preliminary Results: Class wise AUC


MobileNetV2 VGG16 Xception DenseNet121 ResNet50V2 MobileNetV1 MobileNetV2
Epochs 10 30 10 5 10 10 10
Atelectasis 0.79 0.71 0.7 0.75 0.78 0.88 0.66
Cardiomegaly 0.88 0.77 0.86 0.73 0.75 0.73 0.89
Consolidation 0.79 0.81 0.87 0.85 0.77 0.89 0.74
Edema 0.88 0.74 0.82 0.76 0.87 0.82 0.84
Effusion 0.87 0.89 0.79 0.87 0.89 0.88 0.7
Emphysema 0.89 0.78 0.88 0.78 0.86 0.69 0.79
Fibroses 0.76 0.83 0.74 0.82 0.79 0.8 0.87
Hernia 0.81 0.89 0.8 0.84 0.75 0.72 0.74
Infiltration 0.71 0.8 0.87 0.81 0.81 0.87 0.75
Mass 0.82 0.89 0.75 0.89 0.89 0.78 0.85
No Finding - 0.77 0.79 0.79 0.88 0.81 0.71
Nodule 0.74 0.86 0.78 0.72 0.73 0.75 0.77
Pleural Thickening 0.76 0.88 0.86 0.77 0.79 0.9 0.87
Pneumonia 0.73 0.7 0.72 0.69 0.77 0.8 0.76
Pneumothorax 0.88 0.75 0.76 0.88 0.69 0.79 0.75
Average 0.81 0.8 0.8 0.8 0.8 0.81 0.78

TABLE 6: Results with MLBCE loss with 224 * 224 image resolution
Epochs 5 5 5 5 5 10 10 10 10 10
Batch Size 4 8 16 32 64 4 8 16 32 64
Binary Accuracy 72.8 74.5 67.4 63.7 73 64.4 71 69.8 70 68.7
Recall 64.7 54.7 44.4 49.4 55 67.3 66.8 61.6 57.3 68
Precision 28.8 35 47.5 41.6 38.2 33.8 35.1 34.3 35.8 28.1
F1 Score 40 42.6 46 45.2 45.1 45 46 44.1 44.1 39.8
Specificity 72 73.5 65.6 62.2 71.7 62 69.5 68 68.4 66.8
Average AUC 77.3 80.3 80.5 80.8 80 78.6 81.1 80.7 80.5 81.2

TABLE 7: Results with BCE loss with 224 * 224 image resolution
Epochs 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15
Batch Size 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64
Binary Accuracy 92 93 93 93 93 92 93 93 93 93 93 93 93 92.8 93.2
Recall 43 56 55 55 53 43 56 55 55 55 55 56 55 52.7 55.2
Precision 54 54 56 56 57 54 53 55 55 55 55 53 51 52 53.6
F1 Score 48 55 55 55 55 48 54 55 55 55 55 54 53 52.4 54.4
Specificity 97 98 98 99 99 97 98 98 98 98 98 98 98 97.5 98
Average AUC 54 80 80 80 79 53 80 80 80 79 80 79 79 76.6 79.5

TABLE 8: Results of Batch Sizes with 5, 10 & 15 Epochs with 224 * 224 image resolution
Epochs 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15
Batch Size 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64
Binary Accuracy 92 93 93 93 93 92 93 93 93 93 93 93 93 92.8 93.2
Recall 43 56 55 55 53 43 56 55 55 55 55 56 55 52.7 55.2
Precision 54 54 56 56 57 54 53 55 55 55 55 53 51 52 53.6
F1 Score 48 55 55 55 55 48 54 55 55 55 55 54 53 52.4 54.4
Specificity 97 98 98 99 99 97 98 98 98 98 98 98 98 97.5 98
Average AUC 54 80 80 80 79 53 80 80 80 79 80 79 79 76.6 79.5

16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

TABLE 9: Results of resolution 224, 300, 400 and 500 with batch sizes 8 and 16
Time (Hours) 4.1 5.5 7.5 10 8.2 10 14 19.4 4.1 5.2 7.6 9.5 8 10.6 14.9 20.1
Batch Size 8 8 8 8 8 8 8 8 16 16 16 16 16 16 16 16
Epochs 5 5 5 5 10 10 10 10 5 5 5 5 10 10 10 10
Resolution 224 300 400 500 224 300 400 500 224 300 400 500 224 300 400 500
Binary Accuracy 93 93 93 93 93 93 93 93.3 93 93 93 94 93 93.3 93.3 93.4
Recall 56 55 57 56 56 57 55 56.6 55 56 56 57 55 56.1 55.6 55.2
Precision 54 55 54 55 53 54 55 54.2 56 55 56 55 55 53.4 55.1 54.6
F1 Score 55 55 55 56 54 55 55 55.4 55 55 56 56 55 54.7 55.2 54.9
Specificity 98 99 99 99 98 98 98 98 98 99 98 99 98 98 98 98.5
Average AUC 80 81 81 81 80 80 81 80.8 80 81 81 81 80 80 80.5 81

TABLE 10: Results of Image Normalization techniques without sharpening (300 * 300 Image Resolution and 16 Batch Size)
Epochs 5 5 5 10 10 10
Type of Normalization One-Minus-One Min-Max Zero One One-Minus-One Min-Max Zero One
Binary Accuracy 93.4 93.4 93.4 93.2 93.3 93.2
Recall 56.8 56 57 55.3 56 55.3
Precision 54.6 55.4 54.5 53.6 53 54
F1 Score 55.7 55.7 55.7 54.5 54.5 54.6
Specificity 98.4 98.4 98.2 98.1 98 97.7
Average AUC 81.8 81.2 80.7 80 80.3 80

TABLE 11: Results of Image Normalization techniques with sharpening (300 * 300 Image Resolution and 16 Batch Size)
Epochs 5 5 5 10 10 10
Type of Normalization One-Minus-One Min-Max Zero One One-Minus-One Min-Max Zero One
Binary Accuracy 93.4 93.4 93.4 93.2 93.3 93.3
Recall 56.7 55.1 56.8 55.4 55.7 56.7
Precision 55 56 54.3 53.4 53.2 53.6
F1 Score 55.8 55.5 55.5 54.4 54.4 55.1
Specificity 98.5 98.5 98.5 98 98.1 98.1
Average AUC 80.7 81 80.6 80 80.4 80.7

TABLE 12: Results after applying OTSU Thresholding with 300 * 300 Image Resolution and 16 Batch Size
Epochs 5 5 5 5 5 10 10 10 10 10
Sharpening - - - - Yes - - - - Yes
One Zero - Yes - - - - Yes - - -
Min-Max - - Yes - - - - Yes - -
One Minus One - - - Yes - - - - Yes -
Binary Accuracy 93.2 93.2 93.2 93.4 93.1 93.2 93.2 93.2 93.2 93
Recall 51.3 52 52.8 56.6 52.6 53 52.7 52.2 57 53.7
Precision 54 53.6 52.5 54.3 52.8 53 53.5 53.6 51.2 50.6
F1 Score 52.5 52.8 52.6 55.4 52.6 53 53.1 53 54 52.1
Specificity 98.5 98.7 98.7 98.4 99 98.6 98.5 98.4 98 98.5
Average AUC 76.8 77.8 77.4 81 76.4 77.7 78 78 80.6 77.3

bileNetV1 model was utilized. Images of approximately 224 are a challenge for radiologists, especially in environments
* 224 in size were fed, and the results are documented. with scarce resources. The use of deep learning techniques
Observation 5: Synthetic samples generated with GAN in the fields of medical imaging, on large datasets, has
have not proved to be fruitful. Recall and AUC decreased allowed computer algorithms to produce as effective results
after training the model with merged samples. However, if as medical professionals. In this research work, we have
the quality of synthetic samples can be improved, the results proposed a deep learning-based system for the classification
might get better. of lung diseases from chest X-ray images. The system uses
a MobileNetV1 and neural network classifier with geometric
image augmentation. We have experimented with various as-
E. FINAL COMPARISON OF RESULTS OF EXISTING pects related to deep learning for the multi-label classification
BEST MODEL (MOBILENETV2) AND SUGGESTED of lung disease. The results of our experiments show that
MODEL (MOBILENETV1) deep learning models can be used to effectively classify lung
Table 15 shows a summary of base papers and our project’s diseases from chest X-ray images.
results including Recall, Precision, F1-Score, average AUC, Following are some of the observations from this research:
etc.
• Our experiments have shown that MobileNet (version
V. DISCUSSION 1 and 2) models produce slightly better results than
Lung diseases are one of the most common causes of death other CNN-based models such as VGG-16, DenseNet,
in the world. The diagnosis and management of lung diseases InceptionNet, etc. MobileNet models are lightweight
VOLUME 4, 2016 17

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

and efficient, which makes them well-suited for mobile TABLE 15: Final Comparison of Results of Existing and
and embedded devices. They also have a relatively small Proposed Approach
Existing Approach Proposed Approach
number of parameters, which makes them easier to train MobileNetV2 MobileNetV1 MobileNetV1
and deploy. Synthetic Sample No No Yes
• Using a small batch size of 16, keeping the resolution Binary Accuracy 90.2 93.4 93.1
Recall 45.3 57 50
around 300 by 300, and using BCE as a loss function
Precision - 54 58.2
produced the best overall results. The mentioned res- F1-Score 55.6 55.3 54.1
olution provides a good balance between classification Specificity 97.3 98.4 99.1
accuracy and computational cost. Furthermore, the best Average AUC 81 81 78.4
normalization technique for this project is to keep pixel
values between -1 and 1 or 0 and 1.
• Threshold segmentation is a simple but effective tech- the DCGAN model was trained on a relatively small number
nique for segmenting images. However, it may not of samples for each class, which may not have been enough
be sufficient for lung disease classification, as it does to capture the diversity of lung diseases in the NIH Chest X-
not take into account the spatial relationships between ray 14 dataset. Second, the DCGAN model may have been
pixels. Semantic segmentation is a more advanced tech- overfitting to the training data, and so did we notice, which
nique that can take into account these relationships, and would have led to poor classification accuracy. Overall, the
it may be a better option for lung X-ray segmentation. results of our experiments suggest that deep learning models
• Fine-tuning the fully connected layer and freezing con- can be used to classify lung diseases from chest X-ray images
volutional layers of the model slows down the conver- effectively. We believe that our system has the potential to
gence of the algorithm significantly. be used as a clinical decision support tool for the early
detection of lung diseases. However, more research is needed
to improve the performance of such models, especially on
VI. CONCLUSION
large and imbalanced datasets. In future work, we plan to
MobileNetV1 model with geometric image augmentation
improve the performance of our system by using a larger and
produced the best results, with a recall of 57%, a binary
more diverse dataset of chest X-ray images.
accuracy of 93.4%, an F1-score of 55.3%, and an AUC of
81%. Synthetic samples generated with GAN can help to
REFERENCES
improve the diversity of the training data and help with class
[1] A. Souid, N. Sakli, and H. Sakli, “Classification and predictions of lung
imbalance in large datasets. We have implemented a DCGAN diseases from chest x-rays using mobilenet v2,” Applied Sciences, vol. 11,
model to generate synthetic X-ray images, but we found that no. 6, p. 2751, 2021.
these images did not improve the performance of the Mo- [2] A. T. Sahlol, M. Abd Elaziz, A. Tariq Jamal, R. Damaševičius, and
O. Farouk Hassan, “A novel method for detection of tuberculosis in chest
bileNetV1 model. After the inclusion of generated synthetic radiographs using artificial ecosystem-based optimisation of deep neural
samples, the values for Recall, Precision, F1-Score, and AUC network features,” Symmetry, vol. 12, no. 7, p. 1146, 2020.
were 50, 58.2, 54.1, and 78.4, respectively. There are a few [3] S. N. H. Sheikh Abdullah, F. A. Bohani, B. H. Nayef, S. Sahran,
O. Al Akash, R. Iqbal Hussain, F. Ismail et al., “Round randomized
possible explanations for why the DCGAN model did not learning vector quantization for brain tumor imaging,” Computational and
improve the performance of the MobileNetV1 model. First, mathematical methods in medicine, vol. 2016, 2016.
[4] O. Er, N. Yumusak, and F. Temurtas, “Chest diseases diagnosis using
artificial neural networks,” Expert Systems with Applications, vol. 37,
TABLE 13: MobileNetV1 Results after Fine Tuning (Frozen no. 12, pp. 7648–7655, 2010.
[5] H. Hu, Q. Li, Y. Zhao, and Y. Zhang, “Parallel deep learning algorithms
CNN & Trained Classifier) with 300 * 300 Image Resolution with hybrid attention mechanism for image segmentation of lung tumors,”
and 16 Batch Size IEEE Transactions on Industrial Informatics, vol. 17, no. 4, pp. 2880–
Time (Hours) 8 17 39.6 53 90 2889, 2020.
Epochs 10 20 30 40 70 [6] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu,
Binary Accuracy 93.1 93.1 93.1 93.2 93.2 A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros
Recall 49.7 49.8 50 51 51.5 et al., “Development and validation of a deep learning algorithm for
Precision 52.7 52.4 52.3 52 53.2 detection of diabetic retinopathy in retinal fundus photographs,” jama,
F1 Score 51.2 51.1 51.1 51.5 52.3 vol. 316, no. 22, pp. 2402–2410, 2016.
Specificity 98.7 98.8 98.8 98.7 98.7 [7] A. Souid, N. Sakli, and H. Sakli, “Classification and predictions of lung
Average AUC 75.7 75.7 76.1 76.8 78 diseases from chest x-rays using mobilenet v2,” Applied Sciences, vol. 11,
no. 6, p. 2751, 2021.
[8] M. Hu, H. Lin, Z. Fan, W. Gao, L. Yang, C. Liu, and Q. Song, “Learning
TABLE 14: MobileNetV1 Results after using GAN samples to recognize chest-xray images faster and more efficiently based on multi-
with 224 * 224 image resolution kernel depthwise convolution. ieee access. 2020,” 2020.
Epochs 5 10 [9] S. Guendel, S. Grbic, B. Georgescu, S. Liu, A. Maier, and D. Comaniciu,
“Learning to recognize abnormalities in chest x-rays with location-aware
Binary Accuracy 93.1 93
dense networks,” in Progress in Pattern Recognition, Image Analysis,
Recall 50 49.8
Computer Vision, and Applications: 23rd Iberoamerican Congress, CIARP
Precision 58.2 56.2
2018, Madrid, Spain, November 19-22, 2018, Proceedings 23. Springer,
F1-Score 54.1 53 2019, pp. 757–765.
Specificity 99.1 98.7 [10] P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, T. Duan,
Average AUC 78.4 78.8 D. Ding, A. Bagul, C. P. Langlotz et al., “Deep learning for chest radio-

18 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3454537

graph diagnosis: A retrospective comparison of the chexnext algorithm MUHAMMAD IRTAZA holds a Master degree in
to practicing radiologists,” PLoS medicine, vol. 15, no. 11, p. e1002686, Computer Science from FAST National University
2018. of Computer and Emerging Sciences, Lahore. His
[11] I. M. Baltruschat, H. Nickisch, M. Grass, T. Knopp, and A. Saalbach, expertise lies in the fields of Computer Vision
“Comparison of deep learning approaches for multi-label chest x-ray and Image Processing, with a specific focus on
classification,” Scientific reports, vol. 9, no. 1, p. 6381, 2019. Medical & Biomedical Imaging. He specializes
[12] T. Malygina, E. Ericheva, and I. Drokin, “Data augmentation with gan: in the application of advanced image process-
Improving chest x-ray pathologies prediction on class-imbalanced cases,”
ing techniques to medical diagnostics, treatment
in International conference on analysis of images, social networks and
methodologies, and the development of cutting-
texts. Springer, 2019, pp. 321–334.
[13] F. Demir, A. Sengur, and V. Bajaj, “Convolutional neural networks based edge imaging technologies for biological and med-
efficient approach for classification of lung diseases,” Health information ical applications, including MRI, CT scans, and ultrasound.
science and systems, vol. 8, pp. 1–8, 2020.
[14] G. B. Kim, K.-H. Jung, Y. Lee, H.-J. Kim, N. Kim, S. Jun, J. B. Seo,
and D. A. Lynch, “Comparison of shallow and deep learning methods on
classifying the regional pattern of diffuse lung disease,” Journal of digital
imaging, vol. 31, pp. 415–424, 2018.
[15] A. S. Pillai, “Multi-label chest x-ray classification via deep learning,”
arXiv preprint arXiv:2211.14929, 2022. ARSHAD ALI is currently serving as an Associate
[16] S. Motamed, P. Rogalla, and F. Khalvati, “Data augmentation using gen- Professor and Head of Cyber Security Department
erative adversarial networks (gans) for gan-based detection of pneumonia of the National University of Computer Science
and covid-19 in chest x-ray images,” Informatics in Medicine Unlocked, and Emerging Sciences, Lahore Campus, Pak-
vol. 27, p. 100779, 2021. istan. Post-doctoral Researcher at Orange Labs,
[17] F. Munawar, S. Azmat, T. Iqbal, C. Grönlund, and H. Ali, “Segmentation
Paris. He has a PhD (2012) in Telecommunication
of lungs in chest x-ray image using generative adversarial networks,” Ieee
and Computer Science jointly from the Institute
Access, vol. 8, pp. 153 535–153 545, 2020.
[18] M. Zak and A. Krzyżak, “Classification of lung diseases using deep of Telecom SudParis and UPMC (Paris VI) and a
learning models,” in International Conference on Computational Science. Master (2009) Information Technology from the
Springer, 2020, pp. 621–634. University of Avignon, France. In 2003, he earned
[19] Z. Tariq, S. K. Shah, and Y. Lee, “Lung disease classification using deep an MSc degree in Computer Science from Punjab University, Lahore. His
convolutional neural network,” in 2019 IEEE international conference on research interests are in the areas of mobile ad-hoc networks, AI with cyber
bioinformatics and biomedicine (BIBM). IEEE, 2019, pp. 732–735. security, NLP, and AI in healthcare and agriculture.
[20] K. Priya and J. D. Peter, “A federated approach for detecting the chest dis-
eases using densenet for multi-label classification,” Complex & Intelligent
Systems, pp. 1–9, 2022.
[21] R. Imran, N. Hassan, R. Tariq, L. Amjad, and A. Wali, “Intracranial brain
haemorrhage segmentation and classification,” iKSP Journal of Computer
Science and Engineering, vol. 1, no. 2, pp. 52–56, 2021.
[22] A. Wali, A. Naseer, M. Tamoor, and S. Gilani, “Recent progress in digital
MARYAM GULZAR received her BS degree in
image restoration techniques: A review,” Digital Signal Processing, p.
Computer Science from the CS Department of the
104187, 2023.
[23] S. Bharati, P. Podder, and M. R. H. Mondal, “Hybrid deep learning
University of Lahore (UOL), Pakistan in 2016.
for detecting lung diseases from x-ray images,” Informatics in Medicine She completed her MS (SPM) study at the School
Unlocked, vol. 20, p. 100391, 2020. of Computing at the National University of Com-
[24] F. J. M. Shamrat, S. Azam, A. Karim, K. Ahmed, F. M. Bui, and F. De Boer, puter and Emerging Sciences (NUCES), Lahore
“High-precision multiclass classification of lung disease through cus- campus, Pakistan in 2018. Currently, she is un-
tomized mobilenetv2 from chest x-ray images,” Computers in Biology and dertaking PhD as a Junior Researcher at the De-
Medicine, vol. 155, p. 106646, 2023. partment of Software Engineering of LUT Uni-
[25] M. H. Al-Sheikh, O. Al Dandan, A. S. Al-Shamayleh, H. A. Jalab, and versity, Finland. Earlier, she was a Lecturer at the
R. W. Ibrahim, “Multi-class deep learning architecture for classifying lung Department of Software Engineering of the University of Lahore, Pakistan.
diseases from chest x-ray and ct images,” Scientific Reports, vol. 13, no. 1, Her research interests include digital transmission, information systems, and
p. 19373, 2023. machine learning. She won a research grant from the Higher Education
[26] Z. Jiang, W. Zaheer, A. Wali, and S. Gilani, “Visual sentiment analysis Commission of Pakistan for four years to complete her PhD studies.
using data-augmented deep transfer learning techniques,” Multimedia
Tools and Applications, pp. 1–17, 2023.
[27] A. Wali and M. Saeed, “Biologically inspired cellular automata learning
and prediction model for handwritten pattern recognition,” Biologically
inspired cognitive architectures, vol. 24, pp. 77–86, 2018.
[28] A. Fawaz, M. B. Ali, M. Adan, M. Mujtaba, and A. Wali, “A deep
learning framework for efficient high-fidelity speech synthesis: Styletts,”
iKSP Journal of Computer Science and Engineering, vol. 1, no. 1, 2021. AAMIR WALI has been teaching at the Depart-
[29] H. M. Hamza and A. Wali, “Pakistan sign language recognition: leveraging ment of Computer Science, FAST-National Uni-
deep learning models with limited dataset,” Machine Vision and Applica- versity of Computer and Emerging Sciences since
tions, vol. 34, no. 5, p. 71, 2023. 2004. He has a Ph.D. in Computer Science from
[30] A. Wali, M. Ahmad, A. Naseer, M. Tamoor, and S. Gilani, “Stynmedgan: the same University. His areas of interest include
Medical images augmentation using a new gan model for improved diag- font development, writing systems, machine learn-
nosis of diseases,” Journal of Intelligent & Fuzzy Systems, no. Preprint, ing, image processing, human-computer interac-
pp. 1–18, 2023. tion and virtual/augmented reality.
[31] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
arXiv preprint arXiv:1406.2661, 2014.
[32] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” arXiv
preprint arXiv:1511.06434, 2015.

VOLUME 4, 2016 19

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like