0% found this document useful (0 votes)
15 views

Soft Computing

The document proposes an approach to generate captions for chest X-ray images using deep learning. It involves using a CNN like ResNet50 to extract image features, training an RNN model on chest X-ray datasets to learn image descriptions, and generating captions by feeding image features into the RNN decoder.

Uploaded by

KP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Soft Computing

The document proposes an approach to generate captions for chest X-ray images using deep learning. It involves using a CNN like ResNet50 to extract image features, training an RNN model on chest X-ray datasets to learn image descriptions, and generating captions by feeding image features into the RNN decoder.

Uploaded by

KP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Soft Computing

(IT317)

Captioning Chest X Rays with Deep Learning

Submitted to:
Prof. Sunakshi Mehra
Department of Information Technology Delhi Technological University

Submitted by:
Krishna Poddar 2K21/CO/243

Delhi Technological University


Shahbad Daulatpur, Main Bawana Road, Delhi-110042

1
INDEX
Sno. Topics Pg

1 Introduction 3

2 Proposed Methodology 4-6

3 Experimental Details 7-10

4 Code 11-14

5 Output 14-17

6 Bibliography 17

2
1) Introduction
The realm of deep learning has undergone a remarkable transformation, elevating the
capabilities of artificial intelligence in understanding and interpreting visual information.
In a world inundated with images, the ability to not only recognize objects and scenes but
also to convey that understanding through natural language is a monumental leap in
machine learning. One of the domains where this synergy of computer vision and natural
language processing holds immense potential is in the field of medical radiology.

Every day, thousands of radiological images, ranging from X-rays to MRIs, are generated
in healthcare settings around the globe. These images play a pivotal role in diagnosing
diseases, monitoring treatment progress, and providing critical insights into patients'
health. However, the wealth of visual information contained within these images is often
locked away from easy accessibility and interpretation.

This project embarks on a journey to bridge this gap by creating an automated image
captioning system specifically designed for medical radiology reports. We aim to harness
the power of deep learning, employing state-of-the-art neural network architectures, to
generate coherent and contextually accurate natural language descriptions of radiological
images. This innovative approach not only promises to save valuable time for healthcare
professionals but also enhances the accessibility and interpretability of medical images for
a wide range of stakeholders, from physicians and radiologists to patients themselves.

Challenges in Medical Radiology Reports:

Medical radiology reports are unique in their complexity. The images they contain are often
intricate and multifaceted, requiring a comprehensive understanding of anatomy,
pathology, and disease-specific patterns. Furthermore, radiology reports often include
intricate medical jargon, making them inaccessible to non-specialists. This combination of
visual complexity and linguistic specificity presents a considerable challenge for
automated interpretation.

Conventional methods for generating medical image reports often involve manual
interpretation and report writing by radiologists. This process is time-consuming, subject
to human error, and may lead to reporting backlogs in busy healthcare environments. It is
here that our deep learning-based image captioning system shines, as it can automatically
generate detailed and coherent descriptions of radiological images, alleviating the burden
on healthcare professionals and providing rapid, consistent, and understandable reports.

3
2) Proposed Methodology

a) Image Understanding:
Image understanding is a fundamental step. It entails teaching the model to comprehend
the contents of an image. This is crucial for generating coherent and contextually
accurate captions. Image understanding encompasses several key aspects, including:
i. Object Recognition: Identifying objects within the image, such as anatomical
structures or abnormalities in medical radiology images.
ii. Scene Recognition: Recognizing the broader context or scene in which the
image is situated, which is particularly important in medical images.
iii. Interrelationships: Understanding how objects and scenes relate to each other
within the image, enabling the model to generate descriptive captions that
reflect these relationships.

b) Language Used:

 Python: Python serves as the primary programming language for this project.
Python is an excellent choice due to its extensive support for machine learning
and deep learning libraries. Its simplicity, readability, and wide range of
libraries make it a preferred language for developing machine-learning models.
In this project, Python is used for various tasks, including data preprocessing,
model training, and caption generation.

c) Libraries Used:

i. TensorFlow: TensorFlow is a widely-used deep learning framework that plays


a central role in this project. It provides the tools for building, training, and
deploying machine learning models. In this project, TensorFlow is utilized for

4
tasks like creating image datasets, defining neural network architectures, and
training the captioning model.

ii. Keras: Keras is an open-source high-level neural networks API that runs on top
of TensorFlow. It simplifies the process of building and training deep learning
models. In this project, Keras is used in conjunction with TensorFlow for
defining the architecture of the recurrent neural network (RNN) decoder.

iii. NumPy: NumPy is a fundamental library for numerical computations in


Python. It is used to manipulate and process numerical data, making it an
essential tool in data preprocessing, especially in transforming image data into
feature vectors.

iv. OpenCV: OpenCV (Open Source Computer Vision Library) is used for image
processing tasks, such as loading, resizing, and pre-processing images before
they are fed into the deep learning model.

v. Natural Language Toolkit (NLTK): NLTK is a library for natural language


processing in Python. It is used for text-related tasks, such as tokenizing and
processing captions during training and caption generation.

vi. Scikit-Learn: Scikit-Learn is a versatile library for machine learning in Python.


Although not explicitly mentioned, it is used for tasks such as splitting the
dataset into training and testing sets and evaluating model performance.

vii. Matplotlib: Matplotlib is a powerful library for data visualization in Python. It


is employed to create visualizations that help understand the model's
performance and the quality of generated image captions.

viii. Pandas: Pandas is another essential library for data manipulation and analysis.
It is used for organizing and preparing data, especially during the data
preprocessing phase.

ix. CIFAR-10 (for pre-trained CNN models): Pre-trained CNN models like
ResNet50, EfficientNet, Inception, or InceptionResNet are loaded from the
CIFAR-10 model zoo. This allows leveraging the knowledge acquired during
training on a vast dataset for feature extraction.

5
d) Feature Extraction:
 Feature extraction is the initial and pivotal step in this methodology. It involves
using Convolutional Neural Networks (CNN) architectures, such as ResNet50,
EfficientNet, Inception, or InceptionResNet, as feature extractors.
 These CNNs transform the raw pixel data of images into numerical feature
vectors. These vectors encode information about the image's content, including
objects, scenes, and their relationships.
 The final feature vector typically has a dimension of 8x8x2048, which is
obtained from the last convolutional layer of InceptionV3. This vector serves
as the foundation for generating descriptive captions.
e) Dataset Preparation:

 High-quality datasets are the foundation for training the image captioning
model. The choice of datasets significantly impacts the model's performance,
and it's essential to select datasets that are relevant to the application domain.
 In this methodology, two specific datasets are mentioned:
i. The National Institute of Health Chest X-Ray Dataset is chosen for
training the feature extractor. This dataset contains chest X-ray images
necessary to understand the content of medical radiology images.
ii. The Chest X-ray dataset from Indiana University is used for training the
captioning model. This dataset contains annotated captions for chest X-ray
images, enabling the model to learn how to describe medical images.
 The selection of these datasets aligns with the project's focus on medical
radiology reports and ensures that the model learns from relevant data sources.

f) Preprocessing:

 Data preprocessing is a meticulous and crucial task in this methodology. It


involves organizing the datasets to ensure they are in the required format for
model training.
 The data is structured into the TensorFlow image_dataset_from_directory
format. This format organizes images into directories based on their classes or
categories.
 This data structuring step is essential for ensuring that the data is efficiently
utilized during model training. It allows the model to access and learn from the
data in a systematic manner, improving the training process's effectiveness.

6
3) Experimental Details
a) Introduction to the Dataset
 The Indiana University chest X-ray dataset is a valuable resource for medical
image analysis and diagnosis. It contains approximately 7440 images of both
frontal and lateral views of patients' chests.
 The NIH Chest X-Ray dataset consists of 112,120 de-identified images of chest
X-rays with disease labels from 30,805 unique patients.
 These images are accompanied by detailed medical reports that include
findings, impressions, and information about the patient's chest X-ray.
b) Data Preprocessing

Data preprocessing is a crucial step to make the dataset suitable for model training.
In this case, you have merged two provided files, one containing image paths and the
other containing captions. This process allows for easier handling of data.
## Creating the training dataset
IMG_SIZE = (299, 299)
cls_train_dir = "/content/x_ray_train"
print("Training Images")
train_data =
tf.keras.preprocessing.image_dataset_from_directory(directory =
cls_train_dir,
image_size = IMG_SIZE,
label_mode = "categorical",
color_mode = "rgb",
batch_size = 32)
print("Testing Images")
cls_test_dir = "/content/x_ray_test"
test_data =
tf.keras.preprocessing.image_dataset_from_directory(directory =
cls_test_dir,
image_size = IMG_SIZE,
label_mode = "categorical",
color_mode = "rgb",
batch_size = 32)

c) Visualizing the Dataset


Visualizing the dataset is essential to gain an understanding of its contents.
The provided code snippet demonstrates how to read and display the dataset using
the Pandas library.

7
d) Feature Extraction

Feature extraction involves using a pre-trained Convolutional Neural Network (CNN)


model, such as InceptionV3, to extract high-level features from the chest X-ray images.
The features are encoded into a feature vector that can be used for training the image
captioning model. The code for this is written in the next section.

e) Caption Data Preparation

 Preparing caption data is a critical step in image captioning. The captions should
be formatted in a way that can be used for training.
 In this case, captions are created by combining information from different parts
of the medical reports, including indications, findings, and impressions.
 It is done using 2 components a CNN encoder and an RNN decoder:

i) CNN Encoder:
 The model takes in a single raw image and generated a caption y encode
as a sequence of 1 to K encoded words. K is the size of the vocabulary
and C is the length of the caption.

 The model also uses a Convolutional Neural Network (InceptionV3 in


our case) to extract and output a feature vector which the authors call
annotation vectors. The CNN outputs L vectors each of which is of D
dimensions. In our case, the output of the InceptionV3 feature extractor
is a tensor of shape 8x8x2048.

ii) RNN Decoder:


 The RNN Decoder uses LSTM (Long Short Term Memory) cells to
produce captions step by step.
 Context vectors, obtained from the attention mechanism, are used to
influence the caption generation.
 In this implementation, GRU is used instead of LSTM for sequential
processing.

8
f) Text Vectorization

 Text vectorization is necessary to convert textual data into a format that can be
used by the model.
 In this process, a Text Vectorization layer is set up to encode the captions
numerically. This layer learns the vocabulary from the caption data.

g) Data Splitting

 Data splitting is an important part of preparing the dataset for training and
evaluation.
 The dataset is divided into training and validation sets using an 80-20 split. This
ensures that the model is trained on a representative sample of the data.

h) Architecture of the LSTM cell:

i) Architecture of GRU cell:


 In this implementation, GRU cell is used which is similar to LSTM cell

9
j) Bahdanau Attention Mechanism

 The Bahdanau Attention mechanism is a key component of the RNN Decoder.


 It computes attention weights that determine the importance of different image
locations when generating words in the captions.
 The attention mechanism enhances the model's ability to focus on relevant parts
of the image.

k) Training

 Model training is a crucial phase where the CNN Encoder and RNN Decoder
are trained to work together.
 The Adam optimizer is used, and the loss is calculated using Sparse Categorical
Cross-Entropy.
 The training is performed for 20 epochs to allow the model to learn and
improve.

10
4) Code:
i) Feature Extraction from NIH Chest X-ray Dataset:
base_model = tf.keras.applications.inception_v3.InceptionV3(include_top=False,
weights='imagenet',
input_shape = (299, 299, 3))
base_model.trainable = True## Making the model
model_0 = tf.keras.Sequential(
[
tf.keras.layers.Input(shape = (299, 299, 3), name = "Input_layer"),
base_model,
tf.keras.layers.GlobalMaxPool2D(),
tf.keras.layers.Dense(12, activation = "softmax")
]
)model_0.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ["accuracy"]
)## Setting up callbacks
# Setup EarlyStopping callback to stop training if model's val_loss doesn't improve for
3 epochs
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", # watch the val
loss metric
patience=3) # if val loss decreases
for 3 epochs in a row, stop training
# Creating learning rate reduction callback
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss",
factor=0.25, # multiply the learning
rate by 0.2 (reduce by 4x)
patience=2,
verbose=1, # print out when learning
rate goes down
min_lr=1e-7)history_0 =
model_0.fit(train_data,
epochs = 3,
steps_per_epoch = len(train_data),
validation_data = test_data,
validation_steps = int(0.25*len(test_data)),
callbacks = [early_stopping, reduce_lr])

ii) Feature Extraction from Indiana University X-ray Dataset:


## Make the Image feature Extractor Model
new_input = image_model.input
hidden_layer = image_model.layers[-1].outputimage_features_extract_model =
tf.keras.Model(new_input, hidden_layer)## Create a function to preprocess the image
before passing it to the network
def load_image(image_path):
img = tf.io.read_file(image_path)
img = tf.io.decode_jpeg(img, channels=3)
img = tf.keras.layers.Resizing(299, 299)(img)
img = tf.keras.applications.inception_v3.preprocess_input(img)
return img, image_path
# Get unique images
encode_train = sorted(set(image_paths))image_dataset =
tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(
load_image, num_parallel_calls=tf.data.AUTOTUNE).batch(16)for img, path in
tqdm(image_dataset):
batch_features = image_features_extract_model(img)
batch_features = tf.reshape(batch_features,
(batch_features.shape[0], -1,
batch_features.shape[3]))for bf, p in zip(batch_features, path):
path_of_feature = p.numpy().decode("utf-8")
np.save(path_of_feature, bf.numpy())

11
iii) Text Vectorization:
caption_dataset = tf.data.Dataset.from_tensor_slices(train_captions)# Max word count
for a caption.
max_length = 100
# Use the top 5000 words for a vocabulary.
vocabulary_size = 12000
tokenizer = tf.keras.layers.TextVectorization(
max_tokens=vocabulary_size,
output_sequence_length=max_length)
# Learn the vocabulary from the caption data.
tokenizer.adapt(caption_dataset)cap_vector = caption_dataset.map(lambda x:
tokenizer(x))## Create word to tokens and tokens to words mapping
word_to_index = tf.keras.layers.StringLookup(
mask_token="",
vocabulary=tokenizer.get_vocabulary())
index_to_word = tf.keras.layers.StringLookup(
mask_token="",
vocabulary=tokenizer.get_vocabulary(),
invert=True)

iv) Data splitting and batching:


## Split data to training and testing
img_to_cap_vector = collections.defaultdict(list)
for img, cap in zip(image_paths, cap_vector):
img_to_cap_vector[img].append(cap)
# Create training and validation sets using an 80-20 split randomly.
img_keys = list(img_to_cap_vector.keys())
random.shuffle(img_keys)slice_index = int(len(img_keys)*0.8)
img_name_train_keys, img_name_val_keys = img_keys[:slice_index],
img_keys[slice_index:]img_name_train = []
cap_train = []
for imgt in img_name_train_keys:
capt_len = len(img_to_cap_vector[imgt])
img_name_train.extend([imgt] * capt_len)
cap_train.extend(img_to_cap_vector[imgt])img_name_val = []
cap_val = []
for imgv in img_name_val_keys:
capv_len = len(img_to_cap_vector[imgv])
img_name_val.extend([imgv] * capv_len)
cap_val.extend(img_to_cap_vector[imgv])BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 512
units = 1024
num_steps = len(img_name_train) // BATCH_SIZE
features_shape = 2560
attention_features_shape = 64
# Load the numpy files
def map_func(img_name, cap):
img_tensor = np.load(img_name.decode('utf-8')+'.npy')
return img_tensor, capdataset = tf.data.Dataset.from_tensor_slices((img_name_train,
cap_train))# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
map_func, [item1, item2], [tf.float32, tf.int64]),
num_parallel_calls=tf.data.AUTOTUNE)# Shuffle and batch
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

v) Creating the CNN encoder:


class CNN_Encoder(tf.keras.Model):
# Since you have already extracted the features and dumped it
# This encoder passes those features through a Fully connected layer
def __init__(self, embedding_dim):
super(CNN_Encoder, self).__init__()
# shape after fc == (batch_size, 64, embedding_dim)

12
self.fc = tf.keras.layers.Dense(embedding_dim)def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x

vi) Creating the RNN decoder:


class RNN_Decoder(tf.keras.Model):
def __init__(self, embedding_dim, units, vocab_size):
super(RNN_Decoder, self).__init__()
self.units = unitsself.embedding = tf.keras.layers.Embedding(vocab_size,
embedding_dim)
self.gru = tf.keras.layers.GRU(self.units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(self.units)
self.fc2 = tf.keras.layers.Dense(vocab_size)self.attention =
BahdanauAttention(self.units)def call(self, x, features, hidden):
# defining attention as a separate model
context_vector, attention_weights = self.attention(features, hidden)# x shape after
passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)# x shape after concatenation == (batch_size, 1, embedding_dim
+ hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)# passing the
concatenated vector to the GRU
output, state = self.gru(x)# shape == (batch_size, max_length, hidden_size)
x = self.fc1(output)# x shape == (batch_size * max_length, hidden_size)
x = tf.reshape(x, (-1, x.shape[2]))# output shape == (batch_size * max_length,
vocab)
x = self.fc2(x)return x, state, attention_weightsdef reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))

vii) Creating the Bahdanau Attention Model:


class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)def call(self, features, hidden):
# features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)# hidden shape
== (batch_size, hidden_size)
# hidden_with_time_axis shape == (batch_size, 1, hidden_size)
hidden_with_time_axis = tf.expand_dims(hidden, 1)# attention_hidden_layer shape ==
(batch_size, 64, units)
attention_hidden_layer = (tf.nn.tanh(self.W1(features) +
self.W2(hidden_with_time_axis)))# score shape ==
(batch_size, 64, 1)
# This gives you an unnormalized score for each image feature.
score = self.V(attention_hidden_layer)# attention_weights shape == (batch_size, 64, 1)
attention_weights = tf.nn.softmax(score, axis=1)# context_vector shape after sum ==
(batch_size, hidden_size)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)return context_vector,
attention_weights

viii) Training:
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, tokenizer.vocabulary_size())
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= maskreturn tf.reduce_mean(loss_)checkpoint_path = "./checkpoints/train"

13
ckpt = tf.train.Checkpoint(encoder=encoder,
decoder=decoder,
optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path,
max_to_keep=5)start_epoch = 0
if ckpt_manager.latest_checkpoint:
start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
# restoring the latest checkpoint in checkpoint_path
ckpt.restore(ckpt_manager.latest_checkpoint)loss_plot = []@tf.function
def train_step(img_tensor, target):
loss = 0# initializing the hidden state for each batch
# because the captions are not related from image to image
hidden = decoder.reset_state(batch_size=target.shape[0])dec_input =
tf.expand_dims([word_to_index('<start>')] * target.shape[0], 1)with tf.GradientTape()
as tape:
features = encoder(img_tensor)for i in range(1, target.shape[1]):
# passing the features through the decoder
predictions, hidden, _ = decoder(dec_input, features, hidden)loss +=
loss_function(target[:, i], predictions)# using teacher forcing
dec_input = tf.expand_dims(target[:, i], 1)total_loss = (loss /
int(target.shape[1]))trainable_variables = encoder.trainable_variables +
decoder.trainable_variablesgradients = tape.gradient(loss,
trainable_variables)optimizer.apply_gradients(zip(gradients,
trainable_variables))return loss, total_lossEPOCHS = 20for epoch in range(start_epoch,
EPOCHS):
start = time.time()
total_loss = 0for (batch, (img_tensor, target)) in enumerate(dataset):
batch_loss, t_loss = train_step(img_tensor, target)
total_loss += t_lossif batch % 100 == 0:
average_batch_loss = batch_loss.numpy()/int(target.shape[1])
print(f'Epoch {epoch+1} Batch {batch} Loss {average_batch_loss:.4f}')
# storing the epoch end loss value to plot later
loss_plot.append(total_loss / num_steps)if epoch % 5 == 0:
ckpt_manager.save()print(f'Epoch {epoch+1} Loss {total_loss/num_steps:.6f}')
print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

5) Output:
a) Training Loss Plot

 The training loss plot provides insights into how the loss evolves during training.
 This information is valuable for assessing the model's convergence and performance.

14
b) Generating Captions
 After training, the model can be used to generate captions for chest X-ray images.
 Real captions from the dataset and predicted captions are compared to assess the
model's performance.
i) Sample input 1:

Real Caption:

 Indications: xxxx with xxxx endseq startseq


 Findings: stable cardiomediastinal silhouette no focal airspace
consolidation suspicious pulmonary opacity pneumothorax or
pleural effusion changes of right mastectomy sequelae of prior
granulomatous disease mild thoracic spine degenerative change.
 Impressions: no acute cardiopulmonary abnormality

Prediction Caption:
 indications xxxxyearold female followup endseq startseq
 findings normal heart size no focal consolidation is identified there
is minimal xxxx airspace disease in the left ventricle no focal
alveolar consolidation no definite pleural effusion or
pneumothoraces cardiomediastinal silhouette is normal for size and
contour degenerative changes in the inferior xxxx cardiomegaly and
small to previouschronic pulmonary arthritis

 impressions 1 pulmonary clinical correlation xxxx no xxxx old


fractures the previously seen left upper quadrant seen no xxxx soft
tissue since comparison examination there is some left base airspace
disease the visualized bony structures are intact endseq startseq
impressions no

15
ii) Sample input 2:

Real Caption:

 Indications: start startseq indications dyspnea endseq startseq


 Findings: stable the heart is top normal in size the mediastinum is
stable the aorta is atherosclerotic xxxx opacities are noted in the lung
bases compatible with scarring or atelectasis there is no acute
infiltrate or pleural effusion
 Impressions: chronic changes without acute disease

Prediction Caption:
 indications shortness of breath hypertension
 findings impressions ltthe heart size within normal limits no focal
consolidation pneumothorax or large pleural effusion visualized
bony structures are otherwise unremarkable in appearance of focal
airspace disease no pleural effusion or pneumothorax the bony
elements from elsewhere are no displaced rib fractures the lungs are
clear no pleural effusion

 impressions chest three total images to be grossly unremarkable no


suspicious pulmonary opacities mild degenerative changes of right
apex otherwise unremarkable exam negative for acute pulmonary
infiltrate endseq end

16
c) Results And Improvements:

In terms of the generated captions, the model demonstrates effective generalization to the
findings section, but there are noticeable misreads in the impressions and indications.
Additionally, there are misreads within the findings section. To enhance performance,
consider the following improvements:

i. Enhancing the accuracy of the feature extractor model.


ii. Augmenting the training data and incorporating more classes for the feature
extractor.
iii. Addressing imbalanced classes in the feature extractor model.
iv. Developing a more advanced decoder network.

6) Bibliography
a) Link to NIH X-ray dataset: https://www.nih.gov/news-events/news-releases/nih-
clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-
community?source=post_page-----24febcc19f6f--------------------------------

b) Link to Indiana University Dataset: https://www.kaggle.com/datasets/raddar/chest-


xrays-indiana-university?select=indiana_reports.csv&source=post_page-----
24febcc19f6f--------------------------------

c) The Bahdanau attention paper: https://arxiv.org/abs/1409.0473?source=post_page-----


24febcc19f6f--------------------------------

17

You might also like