Soft Computing
Soft Computing
(IT317)
Submitted to:
Prof. Sunakshi Mehra
Department of Information Technology Delhi Technological University
Submitted by:
Krishna Poddar 2K21/CO/243
1
INDEX
Sno. Topics Pg
1 Introduction 3
4 Code 11-14
5 Output 14-17
6 Bibliography 17
2
1) Introduction
The realm of deep learning has undergone a remarkable transformation, elevating the
capabilities of artificial intelligence in understanding and interpreting visual information.
In a world inundated with images, the ability to not only recognize objects and scenes but
also to convey that understanding through natural language is a monumental leap in
machine learning. One of the domains where this synergy of computer vision and natural
language processing holds immense potential is in the field of medical radiology.
Every day, thousands of radiological images, ranging from X-rays to MRIs, are generated
in healthcare settings around the globe. These images play a pivotal role in diagnosing
diseases, monitoring treatment progress, and providing critical insights into patients'
health. However, the wealth of visual information contained within these images is often
locked away from easy accessibility and interpretation.
This project embarks on a journey to bridge this gap by creating an automated image
captioning system specifically designed for medical radiology reports. We aim to harness
the power of deep learning, employing state-of-the-art neural network architectures, to
generate coherent and contextually accurate natural language descriptions of radiological
images. This innovative approach not only promises to save valuable time for healthcare
professionals but also enhances the accessibility and interpretability of medical images for
a wide range of stakeholders, from physicians and radiologists to patients themselves.
Medical radiology reports are unique in their complexity. The images they contain are often
intricate and multifaceted, requiring a comprehensive understanding of anatomy,
pathology, and disease-specific patterns. Furthermore, radiology reports often include
intricate medical jargon, making them inaccessible to non-specialists. This combination of
visual complexity and linguistic specificity presents a considerable challenge for
automated interpretation.
Conventional methods for generating medical image reports often involve manual
interpretation and report writing by radiologists. This process is time-consuming, subject
to human error, and may lead to reporting backlogs in busy healthcare environments. It is
here that our deep learning-based image captioning system shines, as it can automatically
generate detailed and coherent descriptions of radiological images, alleviating the burden
on healthcare professionals and providing rapid, consistent, and understandable reports.
3
2) Proposed Methodology
a) Image Understanding:
Image understanding is a fundamental step. It entails teaching the model to comprehend
the contents of an image. This is crucial for generating coherent and contextually
accurate captions. Image understanding encompasses several key aspects, including:
i. Object Recognition: Identifying objects within the image, such as anatomical
structures or abnormalities in medical radiology images.
ii. Scene Recognition: Recognizing the broader context or scene in which the
image is situated, which is particularly important in medical images.
iii. Interrelationships: Understanding how objects and scenes relate to each other
within the image, enabling the model to generate descriptive captions that
reflect these relationships.
b) Language Used:
Python: Python serves as the primary programming language for this project.
Python is an excellent choice due to its extensive support for machine learning
and deep learning libraries. Its simplicity, readability, and wide range of
libraries make it a preferred language for developing machine-learning models.
In this project, Python is used for various tasks, including data preprocessing,
model training, and caption generation.
c) Libraries Used:
4
tasks like creating image datasets, defining neural network architectures, and
training the captioning model.
ii. Keras: Keras is an open-source high-level neural networks API that runs on top
of TensorFlow. It simplifies the process of building and training deep learning
models. In this project, Keras is used in conjunction with TensorFlow for
defining the architecture of the recurrent neural network (RNN) decoder.
iv. OpenCV: OpenCV (Open Source Computer Vision Library) is used for image
processing tasks, such as loading, resizing, and pre-processing images before
they are fed into the deep learning model.
viii. Pandas: Pandas is another essential library for data manipulation and analysis.
It is used for organizing and preparing data, especially during the data
preprocessing phase.
ix. CIFAR-10 (for pre-trained CNN models): Pre-trained CNN models like
ResNet50, EfficientNet, Inception, or InceptionResNet are loaded from the
CIFAR-10 model zoo. This allows leveraging the knowledge acquired during
training on a vast dataset for feature extraction.
5
d) Feature Extraction:
Feature extraction is the initial and pivotal step in this methodology. It involves
using Convolutional Neural Networks (CNN) architectures, such as ResNet50,
EfficientNet, Inception, or InceptionResNet, as feature extractors.
These CNNs transform the raw pixel data of images into numerical feature
vectors. These vectors encode information about the image's content, including
objects, scenes, and their relationships.
The final feature vector typically has a dimension of 8x8x2048, which is
obtained from the last convolutional layer of InceptionV3. This vector serves
as the foundation for generating descriptive captions.
e) Dataset Preparation:
High-quality datasets are the foundation for training the image captioning
model. The choice of datasets significantly impacts the model's performance,
and it's essential to select datasets that are relevant to the application domain.
In this methodology, two specific datasets are mentioned:
i. The National Institute of Health Chest X-Ray Dataset is chosen for
training the feature extractor. This dataset contains chest X-ray images
necessary to understand the content of medical radiology images.
ii. The Chest X-ray dataset from Indiana University is used for training the
captioning model. This dataset contains annotated captions for chest X-ray
images, enabling the model to learn how to describe medical images.
The selection of these datasets aligns with the project's focus on medical
radiology reports and ensures that the model learns from relevant data sources.
f) Preprocessing:
6
3) Experimental Details
a) Introduction to the Dataset
The Indiana University chest X-ray dataset is a valuable resource for medical
image analysis and diagnosis. It contains approximately 7440 images of both
frontal and lateral views of patients' chests.
The NIH Chest X-Ray dataset consists of 112,120 de-identified images of chest
X-rays with disease labels from 30,805 unique patients.
These images are accompanied by detailed medical reports that include
findings, impressions, and information about the patient's chest X-ray.
b) Data Preprocessing
Data preprocessing is a crucial step to make the dataset suitable for model training.
In this case, you have merged two provided files, one containing image paths and the
other containing captions. This process allows for easier handling of data.
## Creating the training dataset
IMG_SIZE = (299, 299)
cls_train_dir = "/content/x_ray_train"
print("Training Images")
train_data =
tf.keras.preprocessing.image_dataset_from_directory(directory =
cls_train_dir,
image_size = IMG_SIZE,
label_mode = "categorical",
color_mode = "rgb",
batch_size = 32)
print("Testing Images")
cls_test_dir = "/content/x_ray_test"
test_data =
tf.keras.preprocessing.image_dataset_from_directory(directory =
cls_test_dir,
image_size = IMG_SIZE,
label_mode = "categorical",
color_mode = "rgb",
batch_size = 32)
7
d) Feature Extraction
Preparing caption data is a critical step in image captioning. The captions should
be formatted in a way that can be used for training.
In this case, captions are created by combining information from different parts
of the medical reports, including indications, findings, and impressions.
It is done using 2 components a CNN encoder and an RNN decoder:
i) CNN Encoder:
The model takes in a single raw image and generated a caption y encode
as a sequence of 1 to K encoded words. K is the size of the vocabulary
and C is the length of the caption.
8
f) Text Vectorization
Text vectorization is necessary to convert textual data into a format that can be
used by the model.
In this process, a Text Vectorization layer is set up to encode the captions
numerically. This layer learns the vocabulary from the caption data.
g) Data Splitting
Data splitting is an important part of preparing the dataset for training and
evaluation.
The dataset is divided into training and validation sets using an 80-20 split. This
ensures that the model is trained on a representative sample of the data.
9
j) Bahdanau Attention Mechanism
k) Training
Model training is a crucial phase where the CNN Encoder and RNN Decoder
are trained to work together.
The Adam optimizer is used, and the loss is calculated using Sparse Categorical
Cross-Entropy.
The training is performed for 20 epochs to allow the model to learn and
improve.
10
4) Code:
i) Feature Extraction from NIH Chest X-ray Dataset:
base_model = tf.keras.applications.inception_v3.InceptionV3(include_top=False,
weights='imagenet',
input_shape = (299, 299, 3))
base_model.trainable = True## Making the model
model_0 = tf.keras.Sequential(
[
tf.keras.layers.Input(shape = (299, 299, 3), name = "Input_layer"),
base_model,
tf.keras.layers.GlobalMaxPool2D(),
tf.keras.layers.Dense(12, activation = "softmax")
]
)model_0.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ["accuracy"]
)## Setting up callbacks
# Setup EarlyStopping callback to stop training if model's val_loss doesn't improve for
3 epochs
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", # watch the val
loss metric
patience=3) # if val loss decreases
for 3 epochs in a row, stop training
# Creating learning rate reduction callback
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss",
factor=0.25, # multiply the learning
rate by 0.2 (reduce by 4x)
patience=2,
verbose=1, # print out when learning
rate goes down
min_lr=1e-7)history_0 =
model_0.fit(train_data,
epochs = 3,
steps_per_epoch = len(train_data),
validation_data = test_data,
validation_steps = int(0.25*len(test_data)),
callbacks = [early_stopping, reduce_lr])
11
iii) Text Vectorization:
caption_dataset = tf.data.Dataset.from_tensor_slices(train_captions)# Max word count
for a caption.
max_length = 100
# Use the top 5000 words for a vocabulary.
vocabulary_size = 12000
tokenizer = tf.keras.layers.TextVectorization(
max_tokens=vocabulary_size,
output_sequence_length=max_length)
# Learn the vocabulary from the caption data.
tokenizer.adapt(caption_dataset)cap_vector = caption_dataset.map(lambda x:
tokenizer(x))## Create word to tokens and tokens to words mapping
word_to_index = tf.keras.layers.StringLookup(
mask_token="",
vocabulary=tokenizer.get_vocabulary())
index_to_word = tf.keras.layers.StringLookup(
mask_token="",
vocabulary=tokenizer.get_vocabulary(),
invert=True)
12
self.fc = tf.keras.layers.Dense(embedding_dim)def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x
viii) Training:
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, tokenizer.vocabulary_size())
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= maskreturn tf.reduce_mean(loss_)checkpoint_path = "./checkpoints/train"
13
ckpt = tf.train.Checkpoint(encoder=encoder,
decoder=decoder,
optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path,
max_to_keep=5)start_epoch = 0
if ckpt_manager.latest_checkpoint:
start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
# restoring the latest checkpoint in checkpoint_path
ckpt.restore(ckpt_manager.latest_checkpoint)loss_plot = []@tf.function
def train_step(img_tensor, target):
loss = 0# initializing the hidden state for each batch
# because the captions are not related from image to image
hidden = decoder.reset_state(batch_size=target.shape[0])dec_input =
tf.expand_dims([word_to_index('<start>')] * target.shape[0], 1)with tf.GradientTape()
as tape:
features = encoder(img_tensor)for i in range(1, target.shape[1]):
# passing the features through the decoder
predictions, hidden, _ = decoder(dec_input, features, hidden)loss +=
loss_function(target[:, i], predictions)# using teacher forcing
dec_input = tf.expand_dims(target[:, i], 1)total_loss = (loss /
int(target.shape[1]))trainable_variables = encoder.trainable_variables +
decoder.trainable_variablesgradients = tape.gradient(loss,
trainable_variables)optimizer.apply_gradients(zip(gradients,
trainable_variables))return loss, total_lossEPOCHS = 20for epoch in range(start_epoch,
EPOCHS):
start = time.time()
total_loss = 0for (batch, (img_tensor, target)) in enumerate(dataset):
batch_loss, t_loss = train_step(img_tensor, target)
total_loss += t_lossif batch % 100 == 0:
average_batch_loss = batch_loss.numpy()/int(target.shape[1])
print(f'Epoch {epoch+1} Batch {batch} Loss {average_batch_loss:.4f}')
# storing the epoch end loss value to plot later
loss_plot.append(total_loss / num_steps)if epoch % 5 == 0:
ckpt_manager.save()print(f'Epoch {epoch+1} Loss {total_loss/num_steps:.6f}')
print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')
5) Output:
a) Training Loss Plot
The training loss plot provides insights into how the loss evolves during training.
This information is valuable for assessing the model's convergence and performance.
14
b) Generating Captions
After training, the model can be used to generate captions for chest X-ray images.
Real captions from the dataset and predicted captions are compared to assess the
model's performance.
i) Sample input 1:
Real Caption:
Prediction Caption:
indications xxxxyearold female followup endseq startseq
findings normal heart size no focal consolidation is identified there
is minimal xxxx airspace disease in the left ventricle no focal
alveolar consolidation no definite pleural effusion or
pneumothoraces cardiomediastinal silhouette is normal for size and
contour degenerative changes in the inferior xxxx cardiomegaly and
small to previouschronic pulmonary arthritis
15
ii) Sample input 2:
Real Caption:
Prediction Caption:
indications shortness of breath hypertension
findings impressions ltthe heart size within normal limits no focal
consolidation pneumothorax or large pleural effusion visualized
bony structures are otherwise unremarkable in appearance of focal
airspace disease no pleural effusion or pneumothorax the bony
elements from elsewhere are no displaced rib fractures the lungs are
clear no pleural effusion
16
c) Results And Improvements:
In terms of the generated captions, the model demonstrates effective generalization to the
findings section, but there are noticeable misreads in the impressions and indications.
Additionally, there are misreads within the findings section. To enhance performance,
consider the following improvements:
6) Bibliography
a) Link to NIH X-ray dataset: https://www.nih.gov/news-events/news-releases/nih-
clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-
community?source=post_page-----24febcc19f6f--------------------------------
17