0% found this document useful (0 votes)
8 views33 pages

Seminar Report

The seminar report by Soham Phatak focuses on the integration of Natural Language Processing (NLP) and Image Processing to enhance multimodal applications. It explores the challenges and advancements in combining these technologies, highlighting key use cases such as image captioning and visual question answering. The report includes a literature review, methodologies, and real-world applications, aiming to provide insights into the potential of this integration in various fields.

Uploaded by

jerryfrostwick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Seminar Report

The seminar report by Soham Phatak focuses on the integration of Natural Language Processing (NLP) and Image Processing to enhance multimodal applications. It explores the challenges and advancements in combining these technologies, highlighting key use cases such as image captioning and visual question answering. The report includes a literature review, methodologies, and real-world applications, aiming to provide insights into the potential of this integration in various fields.

Uploaded by

jerryfrostwick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

A

SEMINAR REPORT

ON

Study and Survey on NLP and Image Processing’s


Integrated Multimodal Applications

SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE


IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE OF

BACHELOR OF ENGINEERING
INFORMATION TECHNOLOGY

BY

Soham Phatak
Roll No: 33261
Exam Seat No: T1900508688

Under the guidance of


Dr. Archana Ghotkar

D EPARTMENT O F I NFORMATION T ECHNOLOGY


P UNE I NSTITUTE OF C OMPUTER T ECHNOLOGY
S R . N O 27, P UNE -S ATARA ROAD , D HANKAWADI
P UNE - 411 043.
AY: 2024-2025

P:F-SMR-UG/08/R0
SCTR’s PUNE INSTITUTE OF COMPUTER TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE

This is to certify that the Seminar work entitled


Study and Survey on NLP and Image Processing’s
Integrated Multimodal Applications
Submitted by
Name : Soham Phatak
Exam Seat No: T1900508688

is a bonafide work carried out under the supervision of Dr. Archana Ghotkar and it
is submitted towards the partial fulfillment of the requirements of Savitribai Phule
Pune University, Pune for the award of the degree of Bachelor of Engineering
(Information Technology).

Dr. A. S. Ghotkar Dr. A. S. Ghotkar


Seminar Guide HOD IT

Dr. S. T. Gandhe
Principal

Date:
Place:

P:F-SMR-UG/08/R0 i
Acknowledgement

Firstly, I express my gratitude to our Director, Dr. P. T. Kulkarni, and Principal, Dr. S. T.
Gandhe, for allowing me to deliver this seminar. I am also deeply grateful to the Head of
Department, Dr. A. S. Ghotkar, for their unwavering support and guidance.
I would also like to acknowledge the contributions of the professors and faculty members, who
provided the resources and help to make this seminar a success. I would like to thank my
seminar guide Dr. Archana Ghotkar for guiding me in the process right from choosing the
seminar topic to the tips and tricks of delivering a good presentation. Next, I thank Prof. R.
A. Karnavat for helping me create a PowerPoint presentation that meets the standards of the
seminar.
Lastly, I extend a hand of gratitude to all my peers and friends who helped me with this report
and were always present to clear up any kinds of doubts. Your enthusiasm and curiosity truly
made this a wonderful experience for me. Once again, I extend my sincere thanks to everyone
who played a part in making this seminar a success.

Name : Soham Phatak


Exam Seat No: T1900508688

ii
Abstract

The integration of Natural Language Processing (NLP) and Image Processing presents a com-
bination that enhances the capabilities of multi-modal applications. While NLP involves the
computational analysis and generation of human language, Image Processing involves manip-
ulating and analyzing visual data. Their combination can create advanced systems for tasks
such as image captioning, visual question answering, and text-based image retrieval. Despite
significant advancements in NLP and Image Processing individually, integrating these domains
remains challenging due to the distinct nature of visual and textual data. We explore this in-
tegration by providing an in-depth literature review, examining key use cases, technological
frameworks, and tools essential for combining NLP and Image Processing, and discussing
some real-world applications.

Keywords: Natural Language Processing (NLP), Image Processing, Multi-modal Applica-


tions, Real-world Applications.

iii
Contents

Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 3

3 Methodologies 4
3.1 Framework/Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Different Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 State-of-the-art Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Applications 11
4.1 State-of-the-art Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Image Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 CAD In Oncology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Conclusion and Future Scope 19

iv
List of Figures

3.1 Basic Architecture of Multimodal Models [7] . . . . . . . . . . . . . . . . . . 4


3.2 Attention Maps of some images [8] . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Architecture of the CLIP Model [9] . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Architecture of the SBVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Architecture of the ViLT Model [10] . . . . . . . . . . . . . . . . . . . . . . . 8
3.6 Architecture of the UNITER Model [11] . . . . . . . . . . . . . . . . . . . . . 9

4.1 Image Captioning [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


4.2 Visual Question Answering [12] . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 CAD in Oncology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

v
List of Tables

3.1 Comparison of some Multimodal Algorithms . . . . . . . . . . . . . . . . . . 9

4.1 Comparison of Image Captioning Algorithms . . . . . . . . . . . . . . . . . . 12


4.2 Components of a VQA System . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Models Combining NLP and Image Processing . . . . . . . . . . . . . . . . . 16

vi
Abbreviations

NLP : Natural Language Processing

VQA : Visual Question Answering

CNN : Convolutional Neural Network

BERT : Bidirectional Encoder Representations from Transformers

GPT : Generative Pre-trained Transformer

CAD : Computer-Aided Diagnosis

CLIP : Contrastive Language-Image Pretraining

SBVQA : Speech-Based Visual Question Answering

ViLT : Vision-and-Language Transformer

UNITER : Universal Image-Text Representation

ViLT : Vision-and-Language Transformer

vii
Seminar Report Multimodal Applications of NLP and Image Processing

1. Introduction

1.1 Introduction
Primarily, Natural Language Processing and Image Processing, emerged as technologies at the
forefront of the multimodal space - have led to significant developments. NLP deals with the
analysis and understanding of human language, the generation of images, and the interpretation
of textual data [3]. Image Processing deals more with visual information and its manipulation.
Each of these two areas has exponentially and spectacularly succeeded as separate areas but
becomes far more complicated when dealt with together because of the difference in textual
and visual information.
Examples of multimodal systems that integrate the aspects of both NLP and Image Pro-
cessing include applications such as image captioning [2], visual question answering [5], or
even retrieval of images from text description. The merit of these systems is that they can in-
terpret and generate much more substantial, contextual responses because of the insights from
the combination of both text and images. However, this merging is not trivially a task since
the two represent either manner of their data modalities, and their interaction demands fluent
communication.
Therefore more focus is placed on this integration’s current state, regarding literature re-
view, key technological frameworks and tools, and real-world use cases. The opportunities,
as well as challenges of integrating NLP and Image Processing, are deeply presented in this
work and how the two fields could come together to create even more intelligent and versatile
applications.

1.2 Motivation
Integration of NLP and Image Processing can help revolutionize a very wide range of applica-
tions, offering quite powerful capabilities with new opportunities in both research and industry.
In fact, some examples of such advanced applications enabled by the integration would include,
image captioning, where machines could describe visual content in natural language, and visual
question answering, allowing users to query images with text-based questions. Such multi-
modal applications have direct significant real-world impacts, changing sectors from health-
care, where automated image analysis enhances diagnostics, to media, where smarter content
management and access tools are emerging [1].

PICT, Pune 1 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

Such bridging of visual and textual data with intelligent abatement of the gap between the
two is in high demand in industries where increasingly growing operations are based on both
visual and textual data. Focusing on all the above-mentioned points this study sheds light on
different applications of the integration of NLP and Image processing.

1.3 Objectives
1. Review the current advancements in Natural Language Processing (NLP) and Image Pro-
cessing individually and assess their capabilities when combined.

2. Identify and evaluate key use cases where the integration of NLP and Image Processing has
shown significant impact, such as image captioning, visual question answering, and text-
based image retrieval.

3. Provide an overview of the technological frameworks and tools necessary for successfully
combining NLP and Image Processing in real-world applications.

1.4 Scope
This survey covers study methodologies and frameworks that enable the blend of NLP and
Image Processing as in [4], [5], [2]. This study therefore brings forth an analysis of literature,
technological tools, and approaches used in integrating these two fields to enhance multi-modal
systems. The study aims to review relevant literature and explore various techniques used in
combining textual and visual data to provide a holistic understanding of the issues, progres-
sions, and future opportunities in this interdisciplinary field.

PICT, Pune 2 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

2. Literature Survey
NLP and Image Processing have, in this way, merged to find their place as an important area of
research, which emerges definitely with multimodal applications. Emphasizing the necessity
for robustness in Remote sensing image captioning, Ricci et al. argue that individual captioning
algorithms tend to fail in generalization, especially if based on limited datasets. To achieve such
robustness, the authors proposed an ensemble approach that could bring together the outputs of
a group of models by allowing the utilization of advanced NLP methods like BERT and CLIP
to advance the integration of textual and visual data to enhance performance in challenging
environments. [2].
Alasmary et al. explore the merging of speech recognition with VQA. The authors pre-
sented an end-to-end framework that will unify both said modalities, thus permitting more
intuitive user interactions with open-ended questions. With regard to this, the literature review
that shall be referred to herein identifies some key challenges in conventional VQA systems that
rely on textual input further limiting their potential applications in dynamic environments. [5].
Khurana et al. give an enabling overview of how NLP developed, pointing out that there were
broad categories: one for natural language understanding and the other for generation. They
allow further readings on applications in machine translation and sentiment analysis while com-
menting on ethics connected with AI breakthroughs. Their contribution aptly could be charac-
terized as underlining continuous innovative dynamism to tackle current issues in NLP. [3].
Tynchenko et al. approach focuses on valuable extraction information in medical reports
using a blended approach involving Optical Character Recognition combined with the usage
of NLP techniques like Named Entity Recognition. Their paper can possibly help to improve
the efficiency in extracting texts in health settings whereby challenges in quality data and in-
terpretation of the model might be portrayed. [1]. In conclusion, Wang et al. Aexamine the
application of NLP in computer-aided diagnosis in oncology. They systematically reviewed the
relevant literature on electronic health records, identified trends and challenges in applying NLP
techniques to extract insights from clinical texts, and highlighted the necessity of collaboration
between AI experts and clinical practitioners for better tool design in support of oncology. [4].
In brief, these works together highlight the revolutionary possibility of integration of NLP
with image processing across various domains such as health care, remote sensing, and user
interaction systems. Further advances in this interdisciplinary field promise to increase both
the robustness and effectiveness of multimodal applications in real-world scenarios.

PICT, Pune 3 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

3. Methodologies
This chapter focuses on the methodologies explored in the study of integrated multimodal ap-
plications of NLP and Image Processing. The state-of-the-art algorithms and recent advances
are covered here.

3.1 Framework/Basic Architecture


In multimodal systems that integrate NLP (text input) and Image Processing (image input), the
general structure typically consists of several core layers designed to process this type of data,
mainly the framework of this system consists of:

Figure 3.1: Basic Architecture of Multimodal Models [7]

1. Input Modality Layers: Feature extraction from the two modalities: video and text are
extracted individually by using distinct fully connected layers.

(a) Video Features: From facial landmarks and orientation. These features are based on
models, like CNNs, that can decompose input images to key components and thus
discover objects, spatial relationships, and fine-grained visual details.

(b) Text Features: From text representation of speech transcripts. They are derived from
natural language input using models like BERT, GPT, etc. In order to transform un-
structured textual data into semantic vectors that can further be combined with image
features for the analysis, text encoders are used.

PICT, Pune 4 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

2. Merge Layer: Outputs of the independent layers are concatenated into a unified representa-
tion. Combines outputs of image and text encoders into a unified representation. Techniques
such as attention mechanisms ensure textual and visual inputs align and guide the final pre-
dictions.

3. Fully Connected Layer: The fully-connected layer forms the central submodule wherein
independent feature representations from both modalities are combined. It takes the con-
catenated outputs of the merged layer and applies multiple linear transformations, thereby
effectively correlating the features.

4. Scaling Module: Scale the output predictions toward actual magnitudes of labels by ap-
plying scaling techniques. Some of the scaling techniques are Decimal Scaling, Min-Max
Normalization and Standard Deviation Scaling. [7]

3.2 Different Approaches


Different approaches have been used in combination of NLP and Image Processing to solve
specific tasks in various domains. The approaches presented below discuss the use of attention
mechanisms, ensemble models, and particular fusion strategies in tasks such as image caption-
ing, VQA, etc.

1. Attention Mechanisms: Attention-based systems pay attention to certain parts of the image
or text during decision-making. For example, the image captioning models called Show,
Attend, and Tell use attention to align the description words with the appropriate regions in
an image.

Figure 3.2: Attention Maps of some images [8]

PICT, Pune 5 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

2. Ensemble Methods: In multimodal systems, the third approach is using the combination of
different models for handling specific tasks. NLP-based fusion strategies improve robust-
ness since outputs of different models are integrated so that the best text-image match or
most accurate caption is chosen for the image [2].

3. NLP in Healthcare Diagnostics: NLP enables the retrieval of unstructured data within
clinical reports and their processing within a medical application. It is essentially the inte-
grated systems that combine image processing with textual descriptions for applications in
CAD within oncology in which image-text integration helps in identifying cancerous lesions
based on medical imaging [1].

4. VQA with speech input: Speech-based VQA introduces another mode by adding spoken
questions to the system. A model such as SBVQA 2.0 enables real-time open-ended visual
questions to be answered using speech and images. The model in this regard would have
noise and speaker variability robustness, improving its practicality in real-world application
[5].

3.3 State-of-the-art Algorithms


In the domain of multimodal applications of NLP and Image Processing, many complex al-
gorithms have been developed. Each of them was designed to work with visual as well as
textual data in the most strongly integrated way possible. Further, below, are four algorithms
that are used for integrated applications of NLP and Image Processing CLIP, SBVQA, ViLT,
and UNITER.

1. Contrastive Language-Image Pretraining (CLIP):


CLIP is a versatile multimodal model that aligns images and text in a shared embedding
space. With little or no extra training on top of pre-training, it is designed to generalize
across many tasks, hence making it highly generalizable. The model is pre-trained on a
large-scale dataset of image-text pairs and can adapt toward the handling of tasks such as
image captioning, VQA, and text-based image retrieval [2].

Strengths:

(a) Zero-shot learning ability - it can perform tasks that it has never been trained on.

(b) Super large scale of pre-training makes it highly versatile on myriads of multimodal
tasks.

PICT, Pune 6 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

Figure 3.3: Architecture of the CLIP Model [9]

Applications:

(a) Image captioning - The appropriate captions will be generated by aligning its image
features with the textual data using CLIP.

(b) Text-based image retrieval - The ability to query images based on texts, and CLIP will
retrieve the corresponding images.

2. Speech-Based Visual Question Answering (SBVQA):


SBVQA 2.0 pushes the state-of-the-art in VQA. It introduces a novel modality, namely
speech. SBVQA 2.0 listens to questions being asked in a speech about images and gives tex-
tual answers as responses to visual and spoken input. Speech encoders would be Conformer-
L or WavLM integrated with an image encoder in a noise and speaker variability robust
framework.

Figure 3.4: Architecture of the SBVQA

Strengths:

(a) Resilience to the speaker variabilities and background noises and good for real-world
applications.

(b) Accessibility to visually impaired users who ask questions about their surroundings.

Applications:

PICT, Pune 7 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

(a) Open-ended spoken question answering about images (VQA with speech input).

(b) Question answering in mobile Apps in real-time.

3. Vision-and-Language Transformer (ViLT):


ViLT is a transformer-based architecture constructed to integrate processing both visual and
textual data. Unlike the majority of other models, such as CLIP, which encode images
and text separately and then fuse their representation, ViLT directly processes concatenated
image patches and text tokens using a single transformer.

Figure 3.5: Architecture of the ViLT Model [10]

Strengths:

(a) End-to-end architecture transformer that simplifies processes of training and inference.

(b) The image and the text data are integrated without any requirement of the presence of
a customized picture encoder like CNN.

Applications:

(a) Text-to-image generation and retrieval.

(b) VQA: The integration of the image and the question is fully in operation.

4. Universal Image-Text Representation (UNITER):


UNITER is another class of multimodal model to learn universal image-text representations.
It uses the pretraining technique to learn the aligned representation of an image region and
the text tokens. Under the utilization of multiple objectives like masked language model-
ing, image-text matching, and word-region alignment, UNITER can be used to solve a wide
range of multimodal tasks.

Strengths:

PICT, Pune 8 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

Figure 3.6: Architecture of the UNITER Model [11]

(a) Excellent ability in fine-grained alignment of image regions to text tokens.

(b) As it is pre-trained on large datasets, model will lead to good generalization for other
multimodal tasks.

Applications:

(a) Image-text matching and retrieval.

(b) Fine-grained VQA and image captioning.

3.4 Discussion

Algorithm Architecture Applications Strengths Drawbacks


CLIP Separate image Zero-shot learn- Generalization to High compu-
and text trans- ing, image unseen tasks, ver- tational cost,
formers with captioning, text- satile requires large
contrastive learn- based image datasets
ing retrieval
SBVQA Speech, image, Spoken VQA, Robust to speaker Increased com-
2.0 and fusion mod- real-time ques- variability and plexity with
ules tion answering noise speech process-
ing
ViLT Single trans- VQA, text-to- Lightweight, Less accurate for
former for image retrieval end-to-end complex visual
concatenated transformer ar- tasks
image-text data chitecture
UNITER Region-based im- Fine-grained Effective fine- Requires pre-
age encoder with VQA, image-text grained align- trained object
a BERT-style text matching ment detectors, slower
encoder than other models

Table 3.1: Comparison of some Multimodal Algorithms

CLIP distinguishes itself in zero-shot learning, so it is super performant in generalizing for


tasks that it has never been trained for. It has relied heavily on tons of pre-training data and has

PICT, Pune 9 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

transformer encoders that are a bit complex, making it computationally expensive.


SBVQA allows the possibility to accept speech input processing along with visual input. It
thus opens up scope to access and use a VQA system in real-time more widely. The advantage
is seen at those places where it is easier to use spoken language than text. Natural consequences
are thus, the extra complexity of audio input processing augments the computational cost.
ViLT simplifies the process by using one transformer for both visual and textual data, making
it lightweight and efficient, though lacking in performance on complex visual tasks because it
does not have isolated image encoder.
UNITER is very good for fine-grained tasks like describing regions in images according to a
textual description. Yet again, the model is more specific and has no generalization as good as
models like CLIP.

PICT, Pune 10 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

4. Applications
4.1 State-of-the-art Applications

4.1.1 Image Captioning


Image captioning [2] is the process through which a text description of an image is created,
achieved by integrating NLP with image processing techniques. In other words, it bridges the
gap between visual data, namely images, and text to enable machines to process and describe
content visually. More generally, these techniques use CNNs for visual analysis while text
generation is done using either RNN or Transformers. Image captioning includes two major
stages:

1. Feature Extraction: A CNN processes the image and extracts meaningful features from it.

2. Text Generation: An RNN or Transformer architecture uses those features to generate a


relevant caption word by word. Predicting the next word conditioned on the image features
and the words that have appeared up to this point.

Figure 4.1: Image Captioning [13]

Techniques and Algorithms Used: The primary algorithmic techniques used in image
captioning are:

1. CNN + RNN: Uses the CNN architecture similar to VGG-16 or ResNet to extract features
from the image followed by some kind of architecture based on RNN or LSTM, GRU to
generate the caption.

PICT, Pune 11 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

2. Attention Mechanisms: Attention mechanisms allow the model to choose where it focuses
on the image at each time step when generating each word, thus improving the relevance of
captions.

3. Transformers: It has emerged current architectures in the place of RNNs with transform-
ers, and ViT or Vision Transformer renders even better performance by modelling global
relationships within an image.

Ensemble Methods and Sophisticated Models Blending different models of captioning im-
proves robustness in image captioning applications in the presence of scenes with scarce data,
for example, remote sensing applications. Further, better captions are developed by blending
together several different models of captioning using combinations such as CLIP-coherence
and VAE fusion.
Algorithm Feature Extraction Text Generation
CNN + RNN CNN (VGG/ResNet) LSTM/GRU
CNN + Attention CNN Attention-based LSTM
Transformer-based Vision Transformer (ViT) Transformer Decoder

Table 4.1: Comparison of Image Captioning Algorithms

• Advantages of Image Captioning

1. Automation: It automates descriptions on images; hence, much less man power


needed.

2. Multimodal Learning: It increases models that bring together visual and linguistic
data and enhance machine learning systems.

3. Accessibility: It enables visually impaired individuals to access text descriptions


of images.

4. Scalability: Process huge amounts of visual data in applications such as social


media, surveillance, and healthcare.

• Drawbacks of Image Captioning

1. Data Dependence: It requires large numbers of annotated datasets for proper train-
ing that are scarce in domains such as healthcare or remote sensing.

2. Generalization Difficulty: The performance degrades when models are applied


outside their training domain, particularly dealing with out-of-distribution data.

3. Computational Intensiveness: A very computationally demanding approach both


for training as well as real time inference.

PICT, Pune 12 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

4.1.2 Visual Question Answering


Visual Question Answering (VQA) [5] is the task to apply natural language processing and
image processing in order to answer a question related to visual content. It combines image
features with natural (or spoken) language queries in order to come up with the relevant an-
swers. A multimodal interaction thus has many real-life applications, exploiting the recent
developments in both NLP and computer vision.

Figure 4.2: Visual Question Answering [12]

The typical VQA system encompasses the following four main parts:

1. Image Encoder: The image encoder is tasked to extract the salient features of a given input
image, like the kinds of objects and their positions and the kind of relations present between
various objects in the picture. Some of the most common architectures used for encoding
images are CNNs, like ResNet, and even newer architectures like Vision Transformerss
(ViTs). These models decode the representation as a dense image, capturing the essence
of objects and the context of scenes. The quality of the image encoder determines the
capability of the system to ”understand” the image and recognize important visual cues that
will answer the questions.

2. Question Encoder: The module converts the question into a vector representation from
text or speech form, what it conveys semantically. Techniques like transformers (e.g., BERT,
GPT) apply in text-based questions, while with speech-based questions, you may have mod-
els like Wav2Vec or ASR systems first converting the speech to text and then encoding. In
fact, for the system to understand that, understanding the question is a big deal since what it
does is guide the system on where to concentrate in the image and what features to extract.

3. Feature Fusion Module: This module merges the encoded image and question features into
a coherent representation that will enable the model to map the relevant parts of the image

PICT, Pune 13 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

onto the context of the question. In many cases, multi-head attention (used in transformers),
or simple concatenation with element-wise multiplication is used for the fusion step. Cross-
modal understanding takes place during the fusion step; therefore, the system will be able
to use the information gathered from the visual data to answer the given question correctly.

4. Answer Generator: Forecast the answer on the basis of an amalgamation of features from
a question and an image. It is normally any feed-forward neural network or classification
model predicting the most appropriate answer from a collection of available options. Some-
times fully connected layers, other times transformer-based networks with fused feature
representations to classify or predict the answer.

Component Description
Image Encoder Converts images into feature vectors using CNNs or trans-
formers.
Question Encoder Transforms text/speech into semantic feature vectors using
NLP models.
Feature Fusion Module Combines image and question features, often using atten-
tion mechanisms.
Answer Generator Uses fused features to predict the final answer using a clas-
sifier or transformer.

Table 4.2: Components of a VQA System

• Advantages of VQA

1. The VQA’s capability to interpret and understand visual as well as linguistic inputs
creates more versatile and dynamic responses.

2. Most VQAs can work in real time and can, therefore, be applied for applications
like customer service or assistive technologies.

• Drawbacks of VQA

1. Many complex and costly processes are involved since VQA demands some parallel
processing on both text and images.

2. Like other machine learning systems, VQA models are likely to inherit biases from
training datasets and will produce incorrect or even biased answers, where ever the
categories of images are ambiguous or not represented clearly.

3. If the question is presented as speech, then the model can easily fall into poor per-
formance due to background noise, speaker variability, and so forth.

PICT, Pune 14 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

4.1.3 CAD In Oncology


Oncology combines, NLP and Image Processing to provide a potent approach toward enhanc-
ing the diagnostic abilities of oncology (Cancer Detection/Treatment) [4]. The data of medical
images, including images from CT scans, MRI, or pathology slides, are essential components
of a diagnosis of cancer, but textual data, such as EHR and EMR, carry important informa-
tion. NLP enables extracting important insights from such text sources, thereby complement-
ing image-based analysis for a more comprehensive diagnostic process.
This integration of NLP and image processing works based on correlating clinical information
from text-based data-NLP with visual data-image processing. In other words:

1. Data Extraction: NLP extracts key information about patient clinical notes, EHR, and
radiology reports to identify risk factors, symptoms, or relevant medical history.

2. Image Analysis: CT, MRI, or pathology slides are analyzed using image processing tech-
niques to identify abnormal characteristics such as tumours, cysts, or lesions.

3. Integrated Decision Support: The output from NLP and image processing together is
integrated into the system; thus, more details are available to the oncologist for diagnosis
and treatment planning.

Figure 4.3: CAD in Oncology

Here are some models that integrate NLP with Image Processing for CAD in oncology:

PICT, Pune 15 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

1. Holistic Multi-Modal Systems: These networks combine data from medical reports
through NLP and imaging data through the usage of modalities like MRI or CT scans,
taking advantages of how CNN works for image data and transformers on the text data.

2. Medical Vision-Language Models: These models integrate both transformer architec-


tures for images and text, using the vision transformers (ViT) for image data and a BERT-
like model for text. They share a common attention mechanism that allows the interaction
between image features and text information.

3. Convolutional Neural Networks (CNN) + NLP Pipelines: CNNs applied to medical


images, and for textual information, BERT. These output features are concatenated and
fed to a dense layer for final classification.

4. Attention Mechanism Models: These models utilize the application of attention mech-
anisms both in the image and text, such that the CNN works on images, BERT works on
text, and the attention mechanism helps concentrate where to focus on relevant sections
of both.

Model Name Description Data Type Use Cases


Holistic Multi- Combines NLP and CNN for Clinical text, Breast, lung can-
Modal Systems joint diagnosis. MRI/CT scans cer
Medical Vision- Uses Vision Transformers Radiology im- Lung cancer stag-
Language Mod- and BERT with attention ages, EHR ing
els mechanism.
CNN + NLP CNN for images, BERT for Pathology im- Colorectal cancer
Pipelines text, merged outputs. ages, notes diagnosis
Hybrid Ensemble CNN for images, LSTM for Clinical records, Brain, prostate
Models text, combined predictions. scans cancer
Attention Mecha- Attention on CNN and BERT Pathology re- Prostate cancer
nism Models for joint analysis. ports, X-rays staging
RNN + Image CNN for images, RNN for se- CT/MRI scans, Pancreatic cancer
Processing quential text. timelines growth

Table 4.3: Models Combining NLP and Image Processing

• Advantages

1. Improved Diagnosis: It can improve cancer diagnostic accuracy by at least in-


creasing image processing together with NLP-derived insights to thus produce a
more holistic picture.

2. Clinical Workflow Efficiency: NLP saves time for physicians by automating in-
formation extraction from EHRs.

PICT, Pune 16 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

3. Accessibility: It enables visually impaired individuals to access text descriptions


of images.

4. Personalized Treatment: Oncologists can predict treatment outcomes and recur-


rence risks with NLP patterns from previous patient records.

• Drawbacks

1. Data Privacy: The voluminous textual and visual patient data being managed
brings forth serious issues of privacy.

2. Data Integration Challenges: In the integration of structured data that comes


from medical images as well as unstructured data coming from texts, algorithms
get highly complex and have to be high quality as well.

3. Model Interpretability: Deep learning models in image processing applications


are deemed black boxes. The interpretability problem is very critical because people
work in an area like medicine.

4.2 Challenges
Challenges faced by the researchers in the above-mentioned applications are listed below:

1. Image Captioning [2]:

• Dataset Limitations: Public datasets are very rare and imbalanced, especially for a given
type of cancer.

• Data Homogeneity: Most datasets are also from a single site, which makes the models
adapt to new data or situations with less generalization ability.

• Use of Black-Box Models: The black-box nature of the deep learning models further
makes them non-interpretable, which breeds mistrust in clinicians who cannot easily inte-
grate these models into their medical workflows.

• Generalization Capability: Algorithms are generally not able to generalize between dif-
ferent types of images or scenarios, especially if they’ve just been trained on a low or
domain-specific dataset. This induces undesirable behaviors when applied to simply out-
of-distribution samples.

2. Visual Question Answering [5]:

• Complexity of Remote Sensing Scenes: Since natural images are highly complex with
a lot of variability and contain more entities, most of which are not easily identifiable

PICT, Pune 17 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

objects, remote sensing images often are of higher complexity with fewer determinable
objects, making the task more challenging for models to generate accurate and meaningful
captions.

• Robustness to Background Noise in Recording Environments: This makes it robust against


background noise that could greatly impact the accuracy of speech recognition in real-
world applications.

• Handling Language Priors: Obviously, the language priors could skew model responses,
so it had to depend on both the visual and linguistic context for answering correctly.

3. CAD in Oncology [4]

• Privacy Issues: Inhibitive legal regulations for the treatment of patient data limit access to
larger datasets to be shared between hospitals for use in model development without fear
of breaching privacy.

• Performance Limitations of Single-Model Systems: Using single captioning models is


dangerous in challenging real-world scenarios. The author overcomes this in his approach
by adopting ensembling methods; however, developing and tuning such strategies raises
its own set of problems.

PICT, Pune 18 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

5. Conclusion and Future Scope


NLP and image processing are core domains in several multimodal applications, opening vast
transformation possibilities across different domains. These two technologies have come of
age, each witnessing tremendous growth, though new challenges persist in areas like general-
ization and scarcity of data. Additionally, the complexity of images presents further hurdles
that need to be addressed for broader and more effective applications.

Despite these challenges, the proposed ensemble approaches give promising results in en-
hancing the robustness of the models. Research advancing from this point is going to grant
increasingly more important roles to integrated multimodal solutions in domains such as health
care, surveillance, and e-commerce, domains which may potentially yield much more accurate
and reliable results.

PICT, Pune 19 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

Bibliography

[1] Malashin, I., Masich I., Tynchenko V., Gantimurov A., Nelyub V., and Borodulin A. (2024).
Image Text Extraction and Natural Language Processing of Unstructured Data from Med-
ical Reports. Machine Learning and Knowledge Extraction, 6(2), 1361-1377.

[2] Ricci, R., Melgani, F., Junior, J. M., and Gonçalves, W. N. (2024). NLP-Based Fusion
Approach to Robust Image Captioning. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing.

[3] Khurana, D., Koli, A., Khatter, K., and Singh, S. (2023). Natural language processing:
state of the art, current trends and challenges. Multimedia tools and applications, 82(3),
3713-3744.

[4] Li, C., Zhang, Y., Weng, Y., Wang, B., and Li, Z. (2023). Natural language processing
applications for computer-aided diagnosis in oncology. Diagnostics, 13(2), 286.

[5] Alasmary, F., and Al-Ahmadi, S. (2023). SBVQA 2.0: Robust End-to-End Speech-Based
Visual Question Answering for Open-Ended Questions. IEEE Access.

[6] da Silva Tavares, J. M. R. (2010). Image processing and analysis: applications and trends.
In AES-ATEMA’2010 Fifth International Conference.

[7] Ortega, J. D., Senoussaoui, M., Granger, E., Pedersoli, M., Cardinal, P., and Koerich, A. L.
(2019). Multimodal fusion with deep neural networks for audio-video emotion recognition.
arXiv preprint arXiv:1907.03196.

[8] Lu, J., Xiong, C., Parikh, D., and Socher, R. (2016). Knowing When to Look: Adaptive
Attention via a Visual Sentinel for Image Captioning. 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 3242-3250.

[9] Jabbar, M. S., Shin, J., and Cho, J. D. (2022). Ai ekphrasis: Multi-modal learning with
foundation models for fine-grained poetry retrieval. Electronics, 11(8), 1275.

[10] Ding, Y., Liu, M., and Luo, X. (2022). Safety compliance checking of construction be-
haviours using visual question answering. Automation in Construction, 144, 104580.

PICT, Pune 20 Dept. of Information Technology


Seminar Report Multimodal Applications of NLP and Image Processing

[11] Chen, Y. C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., and Liu, J. (2020, August).
Uniter: Universal image-text representation learning. In European conference on computer
vision (pp. 104-120). Cham: Springer International Publishing.

[12] K. Ganesh. (n.d.). Visual Question Answering Flask Application.


GitHub. https://github.com/ckmganesh/Visual-Question-Answering-Flask-
Application?tab=readme-ov-file

[13] Gan, Z., Li, C., Chen, C., Pu, Y., Su, Q., Carin, L. (2016). Scalable bayesian learning of
recurrent neural networks for language modeling. arXiv preprint arXiv:1611.08034.

PICT, Pune 21 Dept. of Information Technology


Document Information

Analyzed document 33261_Seminar_Report.pdf (D198891122)

Submitted 2024-10-11 10:19:00 UTC+02:00

Submitted by Aaghotkar

Submitter email [email protected]

Similarity 1%

Analysis address [email protected]

Sources included in the report

URL: https://spectrum.library.concordia.ca/id/eprint/992098/1/MohammadKhalid_MCompSc_S_2023.pdf
2
Fetched: 2024-10-11 10:20:00

URL: https://ojs.aaai.org/index.php/AAAI/article/download/28139/28281
1
Fetched: 2024-10-11 10:20:00

URL: https://dspace.lib.ntua.gr/xmlui/bitstream/handle/123456789/50134/Grounded%2520Visual%2520Question%2520Answering%2520Using%2520Sequence-
to-Sequence%2520Modeling.pdf?sequence=2 1
Fetched: 2024-10-11 10:20:00

Entire Document
Acknowledgement Firstly, I express my gratitude to our Director, Dr. P. T. Kulkarni, and Principal, Dr. S. T. Gandhe, for allowing me to deliver this seminar. I am also deeply grateful to
the Head of Department, Dr. A. S. Ghotkar, for their unwavering support and guidance. I would also like to acknowledge the contributions of the professors and faculty members,
who provided the resources and help to make this seminar a success. I would like to thank my seminar guide Dr. Archana Ghotkar for guiding me in the process right from choosing
the seminar topic to the tips and tricks of delivering a good presentation. Next, I thank Prof. R. A. Karnavat for helping me create a PowerPoint presentation that meets the standards
of the seminar. Lastly, I extend a hand of gratitude to all my peers and friends who helped me with this report and were always present to clear up any kinds of doubts. Your
enthusiasm and curiosity truly made this a wonderful experience for me. Once again, I extend my sincere thanks to everyone who played a part in making this seminar a success.
Name : Soham Phatak Exam Seat No: . ii
Abstract The integration of Natural Language Processing (NLP) and Image Processing presents a com- bination that enhances the capabilities of multi-modal applications. While NLP
involves the computational analysis and generation of human language, Image Processing involves manip- ulating and analyzing visual data. Their combination can create advanced
systems for tasks such as

100% MATCHING BLOCK 1/4

image captioning, visual question answering, and text-based image retrieval.

Despite significant advancements in NLP and Image Processing individually, integrating these domains remains challenging due to the distinct nature of visual and textual data. We
explore this in- tegration by providing an in-depth literature review, examining key use cases, technological frameworks, and tools essential for combining NLP and Image
Processing, and discussing some real-world applications. Keywords: Natural Language Processing (NLP), Image Processing, Multi-modal Applica- tions, Real-world Applications. iii
Seminar Report Multimodal Applications of NLP and Image Processing 1. Introduction 1.1 Introduction Primarily, Natural Language Processing and Image Processing, emerged as
technologies at the forefront of the multimodal space - have led to significant developments. NLP deals with the analysis and understanding of human language, the generation of
images, and the interpretation of textual data [3]. Image Processing deals more with visual information and its manipulation. Each of these two areas has exponentially and
spectacularly succeeded as separate areas but becomes far more complicated when dealt with together because of the difference in textual and visual information. Examples of
multimodal systems that integrate the aspects of both NLP and Image Pro- cessing include applications such as image captioning [2], visual question answering [5], or even retrieval
of images from text description. The merit of these systems is that they can in- terpret and generate much more substantial, contextual responses because of the insights from the
combination of both text and images. However, this merging is not trivially a task since the two represent either manner of their data modalities, and their interaction demands fluent
communication. Therefore more focus is placed on this integration’s current state, regarding literature re- view, key technological frameworks and tools, and real-world use cases.
The opportunities, as well as challenges of integrating NLP and Image Processing, are deeply presented in this work and how the two fields could come together to create even
more intelligent and versatile applications. 1.2 Motivation Integration of NLP and Image Processing can help revolutionize a very wide range of applica- tions, offering quite powerful
capabilities with new opportunities in both research and industry. In fact, some examples of such advanced applications enabled by the integration would include, image captioning,
where machines could describe visual content in natural language, and visual question answering, allowing users to query images with text-based questions. Such multi- modal
applications have direct significant real-world impacts, changing sectors from health- care, where automated image analysis enhances diagnostics, to media, where smarter content
management and access tools are emerging [1]. PICT, Pune 1 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing Such bridging of visual and textual data with intelligent abatement of the gap between the two is in high
demand in industries where increasingly growing operations are based on both visual and textual data. Focusing on all the above-mentioned points this study sheds light on different
applications of the integration of NLP and Image processing. 1.3 Objectives 1. Review the current advancements in Natural Language Processing (NLP) and Image Pro- cessing
individually and assess their capabilities when combined. 2. Identify and evaluate key use cases where the integration of NLP and Image Processing has shown significant impact,
such as

100% MATCHING BLOCK 2/4

image captioning, visual question answering, and text- based image retrieval. 3.

Provide an overview of the technological frameworks and tools necessary for successfully combining NLP and Image Processing in real-world applications. 1.4 Scope This survey
covers study methodologies and frameworks that enable the blend of NLP and Image Processing as in [4], [5], [2]. This study therefore brings forth an analysis of literature,
technological tools, and approaches used in integrating these two fields to enhance multi-modal systems. The study aims to review relevant literature and explore various techniques
used in combining textual and visual data to provide a holistic understanding of the issues, progres- sions, and future opportunities in this interdisciplinary field. PICT, Pune 2 Dept. of
Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 2. Literature Survey NLP and Image Processing have, in this way, merged to find their place as an important
area of research, which emerges definitely with multimodal applications. Emphasizing the necessity for robustness in Remote sensing image captioning, Ricci et al. argue that
individual captioning algorithms tend to fail in generalization, especially if based on limited datasets. To achieve such robustness, the authors proposed an ensemble approach that
could bring together the outputs of a group of models by allowing the utilization of advanced NLP methods like BERT and CLIP to advance the integration of textual and visual data
to enhance performance in challenging environments. [2]. Alasmary et al. explore the merging of speech recognition with VQA. The authors pre- sented an end-to-end framework
that will unify both said modalities, thus permitting more intuitive user interactions with open-ended questions. With regard to this, the literature review that shall be referred to
herein identifies some key challenges in conventional VQA systems that rely on textual input further limiting their potential applications in dynamic environments. [5]. Khurana et al.
give an enabling overview of how NLP developed, pointing out that there were broad categories: one for natural language understanding and the other for generation. They allow
further readings on applications in machine translation and sentiment analysis while com- menting on ethics connected with AI breakthroughs. Their contribution aptly could be
charac- terized as underlining continuous innovative dynamism to tackle current issues in NLP. [3]. Tynchenko et al. approach focuses on valuable extraction information in medical
reports using a blended approach involving Optical Character Recognition combined with the usage of NLP techniques like Named Entity Recognition. Their paper can possibly help
to improve the efficiency in extracting texts in health settings whereby challenges in quality data and in- terpretation of the model might be portrayed. [1]. In conclusion, Wang et al.
Aexamine the application of NLP in computer-aided diagnosis in oncology. They systematically reviewed the relevant literature on electronic health records, identified trends and
challenges in applying NLP techniques to extract insights from clinical texts, and highlighted the necessity of collaboration between AI experts and clinical practitioners for better tool
design in support of oncology. [4]. In brief, these works together highlight the revolutionary possibility of integration of NLP with image processing across various domains such as
health care, remote sensing, and user interaction systems. Further advances in this interdisciplinary field promise to increase both the robustness and effectiveness of multimodal
applications in real-world scenarios. PICT, Pune 3 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 3. Methodologies This chapter focuses on the methodologies explored in the study of integrated multimodal
ap- plications of NLP and Image Processing. The state-of-the-art algorithms and recent advances are covered here. 3.1 Framework/Basic Architecture In multimodal systems that
integrate NLP (text input) and Image Processing (image input), the general structure typically consists of several core layers designed to process this type of data, mainly the
framework of this system consists of: Figure 3.1: Basic Architecture of Multimodal Models [7] 1. Input Modality Layers: Feature extraction from the two modalities: video and text are
extracted individually by using distinct fully connected layers. (a) Video Features: From facial landmarks and orientation. These features are based on models, like CNNs, that can
decompose input images to key components and thus discover objects, spatial relationships, and fine-grained visual details. (b) Text Features: From text representation of speech
transcripts. They are derived from natural language input using models like BERT, GPT, etc. In order to transform un- structured textual data into semantic vectors that can further be
combined with image features for the analysis, text encoders are used. PICT, Pune 4 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 2. Merge Layer: Outputs of the independent layers are concatenated into a unified representa- tion. Combines
outputs of image and text encoders into a unified representation. Techniques such as attention mechanisms ensure textual and visual inputs align and guide the final pre- dictions. 3.
Fully Connected Layer: The fully-connected layer forms the central submodule wherein independent feature representations from both modalities are combined. It takes the con-
catenated outputs of the merged layer and applies multiple linear transformations, thereby effectively correlating the features. 4. Scaling Module: Scale the output predictions toward
actual magnitudes of labels by ap- plying scaling techniques. Some of the scaling techniques are Decimal Scaling, Min-Max Normalization and Standard Deviation Scaling. [7] 3.2
Different Approaches Different approaches have been used in combination of NLP and Image Processing to solve specific tasks in various domains. The approaches presented below
discuss the use of attention mechanisms, ensemble models, and particular fusion strategies in tasks such as image caption- ing, VQA, etc. 1. Attention Mechanisms: Attention-based
systems pay attention to certain parts of the image or text during decision-making. For example, the image captioning models called Show, Attend, and Tell use attention to align the
description words with the appropriate regions in an image. Figure 3.2: Attention Maps of some images [8] PICT, Pune 5 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 2. Ensemble Methods: In multimodal systems, the third approach is using the combination of different models
for handling specific tasks. NLP-based fusion strategies improve robust- ness since outputs of different models are integrated so that the best text-image match or most accurate
caption is chosen for the image [2]. 3. NLP in Healthcare Diagnostics: NLP enables the retrieval of unstructured data within clinical reports and their processing within a medical
application. It is essentially the inte- grated systems that combine image processing with textual descriptions for applications in CAD within oncology in which image-text integration
helps in identifying cancerous lesions based on medical imaging [1]. 4. VQA with speech input: Speech-based VQA introduces another mode by adding spoken questions to the
system. A model such as SBVQA 2.0 enables real-time open-ended visual questions to be answered using speech and images. The model in this regard would have noise and
speaker variability robustness, improving its practicality in real-world application [5]. 3.3 State-of-the-art Algorithms In the domain of multimodal applications of NLP and Image
Processing, many complex al- gorithms have been developed. Each of them was designed to work with visual as well as textual data in the most strongly integrated way possible.
Further, below, are four algorithms that are used for integrated applications of NLP and Image Processing CLIP, SBVQA, ViLT, and UNITER. 1. Contrastive Language-Image Pretraining
(CLIP): CLIP is a versatile multimodal model that aligns images and text in a shared embedding space. With little or no extra training on top of pre-training, it is designed to generalize
across many tasks, hence making it highly generalizable. The

96% MATCHING BLOCK 3/4

model is pre-trained on a large-scale dataset of image-text pairs and

can adapt toward the handling of tasks such as image captioning, VQA, and text-based image retrieval [2]. Strengths: (a) Zero-shot learning ability - it can perform tasks that it has
never been trained on. (b) Super large scale of pre-training makes it highly versatile on myriads of multimodal tasks. PICT, Pune 6 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing Figure 3.3: Architecture of the CLIP Model [9] Applications: (a) Image captioning - The appropriate captions
will be generated by aligning its image features with the textual data using CLIP. (b) Text-based image retrieval - The ability to query images based on texts, and CLIP will retrieve the
corresponding images. 2. Speech-Based Visual Question Answering (SBVQA): SBVQA 2.0 pushes the state-of-the-art in VQA. It introduces a novel modality, namely speech. SBVQA
2.0 listens to questions being asked in a speech about images and gives tex- tual answers as responses to visual and spoken input. Speech encoders would be Conformer- L or
WavLM integrated with an image encoder in a noise and speaker variability robust framework. Figure 3.4: Architecture of the SBVQA Strengths: (a) Resilience to the speaker
variabilities and background noises and good for real-world applications. (b) Accessibility to visually impaired users who ask questions about their surroundings. Applications: PICT,
Pune 7 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing (a) Open-ended spoken question answering about images (VQA with speech input). (b) Question answering in
mobile Apps in real-time. 3. Vision-and-Language Transformer (ViLT): ViLT is a transformer-based architecture constructed to integrate processing both visual and textual data.
Unlike the majority of other models, such as CLIP, which encode images and text separately and then fuse their representation, ViLT directly processes concatenated image patches
and text tokens using a single transformer. Figure 3.5: Architecture of the ViLT Model [10] Strengths: (a) End-to-end architecture transformer that simplifies processes of training and
inference. (b) The image and the text data are integrated without any requirement of the presence of a customized picture encoder like CNN. Applications: (a) Text-to-image
generation and retrieval. (b) VQA: The integration of the image and the question is fully in operation. 4. Universal Image-Text Representation (UNITER): UNITER is another class of
multimodal model to learn universal image-text representations. It uses the pretraining technique to learn the aligned representation of an image region and the text tokens. Under
the utilization of multiple objectives like masked language model- ing, image-text matching, and word-region alignment, UNITER can be used to solve a wide range of multimodal
tasks. Strengths: PICT, Pune 8 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing Figure 3.6: Architecture of the UNITER Model [11] (a) Excellent ability in fine-grained alignment of image
regions to text tokens. (b) As it is pre-trained on large datasets, model will lead to good generalization for other multimodal tasks. Applications: (a) Image-text matching and retrieval.
(b) Fine-grained VQA and image captioning. 3.4 Discussion Algorithm Architecture Applications Strengths Drawbacks CLIP Separate image and text trans- formers with contrastive
learn- ing Zero-shot learn- ing, image captioning, text- based image retrieval Generalization to unseen tasks, ver- satile High compu- tational cost, requires large datasets SBVQA 2.0
Speech, image, and fusion mod- ules Spoken VQA, real-time ques- tion answering Robust to speaker variability and noise Increased com- plexity with speech process- ing ViLT
Single trans- former for concatenated image-text data VQA, text-to- image retrieval Lightweight, end-to-end transformer ar- chitecture Less accurate for complex visual tasks
UNITER Region-based im- age encoder with a BERT-style text encoder Fine-grained VQA, image-text matching Effective fine- grained align- ment Requires pre- trained object
detectors, slower than other models Table 3.1: Comparison of some Multimodal Algorithms CLIP distinguishes itself in zero-shot learning, so it is super performant in generalizing for
tasks that it has never been trained for. It has relied heavily on tons of pre-training data and has PICT, Pune 9 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing transformer encoders that are a bit complex, making it computationally expensive. SBVQA allows the
possibility to accept speech input processing along with visual input. It thus opens up scope to access and use a VQA system in real-time more widely. The advantage is seen at
those places where it is easier to use spoken language than text. Natural consequences are thus, the extra complexity of audio input processing augments the computational cost.
ViLT simplifies the process by using one transformer for both visual and textual data, making it lightweight and efficient, though lacking in performance on complex visual tasks
because it does not have isolated image encoder. UNITER is very good for fine-grained tasks like describing regions in images according to a textual description. Yet again, the model
is more specific and has no generalization as good as models like CLIP. PICT, Pune 10 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 4. Applications 4.1 State-of-the-art Applications 4.1.1 Image Captioning Image captioning [2] is the process
through which a text description of an image is created, achieved by integrating NLP with image processing techniques. In other words, it bridges the gap between visual data,
namely images, and text to enable machines to process and describe content visually. More generally, these techniques use CNNs for visual analysis while text generation is done
using either RNN or Transformers. Image captioning includes two major stages: 1. Feature Extraction: A CNN processes the image and extracts meaningful features from it. 2. Text
Generation: An RNN or Transformer architecture uses those features to generate a relevant caption word by word. Predicting the next word conditioned on the image features and
the words that have appeared up to this point. Figure 4.1: Image Captioning [13] Techniques and Algorithms Used: The primary algorithmic techniques used in image captioning are:
1. CNN + RNN: Uses the CNN architecture similar to VGG-16 or ResNet to extract features from the image followed by some kind of architecture based on RNN or LSTM, GRU to
generate the caption. PICT, Pune 11 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 2. Attention Mechanisms: Attention mechanisms allow the model to choose where it focuses on the image at
each time step when generating each word, thus improving the relevance of captions. 3. Transformers: It has emerged current architectures in the place of RNNs with transform- ers,
and ViT or Vision Transformer renders even better performance by modelling global relationships within an image. Ensemble Methods and Sophisticated Models Blending different
models of captioning im- proves robustness in image captioning applications in the presence of scenes with scarce data, for example, remote sensing applications. Further, better
captions are developed by blending together several different models of captioning using combinations such as CLIP-coherence and VAE fusion. Algorithm Feature Extraction Text
Generation CNN + RNN CNN (VGG/ResNet) LSTM/GRU CNN + Attention CNN Attention-based LSTM Transformer-based Vision Transformer (ViT) Transformer Decoder Table 4.1:
Comparison of Image Captioning Algorithms • Advantages of Image Captioning 1. Automation: It automates descriptions on images; hence, much less man power needed. 2.
Multimodal Learning: It increases models that bring together visual and linguistic data and enhance machine learning systems. 3. Accessibility: It enables visually impaired individuals
to access text descriptions of images. 4. Scalability: Process huge amounts of visual data in applications such as social media, surveillance, and healthcare. • Drawbacks of Image
Captioning 1. Data Dependence: It requires large numbers of annotated datasets for proper train- ing that are scarce in domains such as healthcare or remote sensing. 2.
Generalization Difficulty: The performance degrades when models are applied outside their training domain, particularly dealing with out-of-distribution data. 3. Computational
Intensiveness: A very computationally demanding approach both for training as well as real time inference. PICT, Pune 12 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 4.1.2

100% MATCHING BLOCK 4/4

Visual Question Answering Visual Question Answering (VQA) [5] is the task

to apply natural language processing and image processing in order to answer a question related to visual content. It combines image features with natural (or spoken) language
queries in order to come up with the relevant an- swers. A multimodal interaction thus has many real-life applications, exploiting the recent developments in both NLP and computer
vision. Figure 4.2: Visual Question Answering [12] The typical VQA system encompasses the following four main parts: 1. Image Encoder: The image encoder is tasked to extract the
salient features of a given input image, like the kinds of objects and their positions and the kind of relations present between various objects in the picture. Some of the most
common architectures used for encoding images are CNNs, like ResNet, and even newer architectures like Vision Transformerss (ViTs). These models decode the representation as a
dense image, capturing the essence of objects and the context of scenes. The quality of the image encoder determines the capability of the system to ”understand” the image and
recognize important visual cues that will answer the questions. 2. Question Encoder: The module converts the question into a vector representation from text or speech form, what
it conveys semantically. Techniques like transformers (e.g., BERT, GPT) apply in text-based questions, while with speech-based questions, you may have mod- els like Wav2Vec or
ASR systems first converting the speech to text and then encoding. In fact, for the system to understand that, understanding the question is a big deal since what it does is guide the
system on where to concentrate in the image and what features to extract. 3. Feature Fusion Module: This module merges the encoded image and question features into a coherent
representation that will enable the model to map the relevant parts of the image PICT, Pune 13 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing onto the context of the question. In many cases, multi-head attention (used in transformers), or simple
concatenation with element-wise multiplication is used for the fusion step. Cross- modal understanding takes place during the fusion step; therefore, the system will be able to use
the information gathered from the visual data to answer the given question correctly. 4. Answer Generator: Forecast the answer on the basis of an amalgamation of features from a
question and an image. It is normally any feed-forward neural network or classification model predicting the most appropriate answer from a collection of available options. Some-
times fully connected layers, other times transformer-based networks with fused feature representations to classify or predict the answer. Component Description Image Encoder
Converts images into feature vectors using CNNs or trans- formers. Question Encoder Transforms text/speech into semantic feature vectors using NLP models. Feature Fusion
Module Combines image and question features, often using atten- tion mechanisms. Answer Generator Uses fused features to predict the final answer using a clas- sifier or
transformer. Table 4.2: Components of a VQA System • Advantages of VQA 1. The VQA’s capability to interpret and understand visual as well as linguistic inputs creates more versatile
and dynamic responses. 2. Most VQAs can work in real time and can, therefore, be applied for applications like customer service or assistive technologies. • Drawbacks of VQA 1.
Many complex and costly processes are involved since VQA demands some parallel processing on both text and images. 2. Like other machine learning systems, VQA models are
likely to inherit biases from training datasets and will produce incorrect or even biased answers, where ever the categories of images are ambiguous or not represented clearly. 3. If
the question is presented as speech, then the model can easily fall into poor per- formance due to background noise, speaker variability, and so forth. PICT, Pune 14 Dept. of
Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 4.1.3 CAD In Oncology Oncology combines, NLP and Image Processing to provide a potent approach toward
enhanc- ing the diagnostic abilities of oncology (Cancer Detection/Treatment) [4]. The data of medical images, including images from CT scans, MRI, or pathology slides, are
essential components of a diagnosis of cancer, but textual data, such as EHR and EMR, carry important informa- tion. NLP enables extracting important insights from such text
sources, thereby complement- ing image-based analysis for a more comprehensive diagnostic process. This integration of NLP and image processing works based on correlating
clinical information from text-based data-NLP with visual data-image processing. In other words: 1. Data Extraction: NLP extracts key information about patient clinical notes, EHR,
and radiology reports to identify risk factors, symptoms, or relevant medical history. 2. Image Analysis: CT, MRI, or pathology slides are analyzed using image processing tech- niques
to identify abnormal characteristics such as tumours, cysts, or lesions. 3. Integrated Decision Support: The output from NLP and image processing together is integrated into the
system; thus, more details are available to the oncologist for diagnosis and treatment planning. Figure 4.3: CAD in Oncology Here are some models that integrate NLP with Image
Processing for CAD in oncology: PICT, Pune 15 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 1. Holistic Multi-Modal Systems: These networks combine data from medical reports through NLP and
imaging data through the usage of modalities like MRI or CT scans, taking advantages of how CNN works for image data and transformers on the text data. 2. Medical Vision-
Language Models: These models integrate both transformer architec- tures for images and text, using the vision transformers (ViT) for image data and a BERT- like model for text.
They share a common attention mechanism that allows the interaction between image features and text information. 3. Convolutional Neural Networks (CNN) + NLP Pipelines:
CNNs applied to medical images, and for textual information, BERT. These output features are concatenated and fed to a dense layer for final classification. 4. Attention Mechanism
Models: These models utilize the application of attention mech- anisms both in the image and text, such that the CNN works on images, BERT works on text, and the attention
mechanism helps concentrate where to focus on relevant sections of both. Model Name Description Data Type Use Cases Holistic Multi- Modal Systems Combines NLP and CNN
for joint diagnosis. Clinical text, MRI/CT scans Breast, lung can- cer Medical Vision- Language Mod- els Uses Vision Transformers and BERT with attention mechanism. Radiology im-
ages, EHR Lung cancer stag- ing CNN + NLP Pipelines CNN for images, BERT for text, merged outputs. Pathology im- ages, notes Colorectal cancer diagnosis Hybrid Ensemble
Models CNN for images, LSTM for text, combined predictions. Clinical records, scans Brain, prostate cancer Attention Mecha- nism Models Attention on CNN and BERT for joint
analysis. Pathology re- ports, X-rays Prostate cancer staging RNN + Image Processing CNN for images, RNN for se- quential text. CT/MRI scans, timelines Pancreatic cancer growth
Table 4.3: Models Combining NLP and Image Processing • Advantages 1. Improved Diagnosis: It can improve cancer diagnostic accuracy by at least in- creasing image processing
together with NLP-derived insights to thus produce a more holistic picture. 2. Clinical Workflow Efficiency: NLP saves time for physicians by automating in- formation extraction
from EHRs. PICT, Pune 16 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 3. Accessibility: It enables visually impaired individuals to access text descriptions of images. 4. Personalized
Treatment: Oncologists can predict treatment outcomes and recur- rence risks with NLP patterns from previous patient records. • Drawbacks 1. Data Privacy: The voluminous textual
and visual patient data being managed brings forth serious issues of privacy. 2. Data Integration Challenges: In the integration of structured data that comes from medical images as
well as unstructured data coming from texts, algorithms get highly complex and have to be high quality as well. 3. Model Interpretability: Deep learning models in image processing
applications are deemed black boxes. The interpretability problem is very critical because people work in an area like medicine. 4.2 Challenges Challenges faced by the researchers in
the above-mentioned applications are listed below: 1. Image Captioning [2]: • Dataset Limitations: Public datasets are very rare and imbalanced, especially for a given type of cancer.
• Data Homogeneity: Most datasets are also from a single site, which makes the models adapt to new data or situations with less generalization ability. • Use of Black-Box Models:
The black-box nature of the deep learning models further makes them non-interpretable, which breeds mistrust in clinicians who cannot easily inte- grate these models into their
medical workflows. • Generalization Capability: Algorithms are generally not able to generalize between dif- ferent types of images or scenarios, especially if they’ve just been trained
on a low or domain-specific dataset. This induces undesirable behaviors when applied to simply out- of-distribution samples. 2. Visual Question Answering [5]: • Complexity of
Remote Sensing Scenes: Since natural images are highly complex with a lot of variability and contain more entities, most of which are not easily identifiable PICT, Pune 17 Dept. of
Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing objects, remote sensing images often are of higher complexity with fewer determinable objects, making the
task more challenging for models to generate accurate and meaningful captions. • Robustness to Background Noise in Recording Environments: This makes it robust against
background noise that could greatly impact the accuracy of speech recognition in real- world applications. • Handling Language Priors: Obviously, the language priors could skew
model responses, so it had to depend on both the visual and linguistic context for answering correctly. 3. CAD in Oncology [4] • Privacy Issues: Inhibitive legal regulations for the
treatment of patient data limit access to larger datasets to be shared between hospitals for use in model development without fear of breaching privacy. • Performance Limitations of
Single-Model Systems: Using single captioning models is dangerous in challenging real-world scenarios. The author overcomes this in his approach by adopting ensembling
methods; however, developing and tuning such strategies raises its own set of problems. PICT, Pune 18 Dept. of Information Technology
Seminar Report Multimodal Applications of NLP and Image Processing 5. Conclusion and Future Scope NLP and image processing are core domains in several multimodal
applications, opening vast transformation possibilities across different domains. These two technologies have come of age, each witnessing tremendous growth, though new
challenges persist in areas like general- ization and scarcity of data. Additionally, the complexity of images presents further hurdles that need to be addressed for broader and more
effective applications. Despite these challenges, the proposed ensemble approaches give promising results in en- hancing the robustness of the models. Research advancing from
this point is going to grant increasingly more important roles to integrated multimodal solutions in domains such as health care, surveillance, and e-commerce, domains which may
potentially yield much more accurate and reliable results. PICT, Pune 19 Dept. of Information Technology

Hit and source - focused comparison, Side by Side


Submitted text As student entered the text in the submitted document.
Matching text As the text appears in the source.

1/4 SUBMITTED TEXT 10 WORDS 100% MATCHING TEXT 10 WORDS

image captioning, visual question answering, and text-based image retrieval. image captioning, visual question answering, and text- based image retrieval.

https://spectrum.library.concordia.ca/id/eprint/992098/1/MohammadKhalid_MCompSc_S_2023.pdf

2/4 SUBMITTED TEXT 10 WORDS 100% MATCHING TEXT 10 WORDS

image captioning, visual question answering, and text- based image retrieval. 3. image captioning, visual question answering, and text- based image retrieval.

https://spectrum.library.concordia.ca/id/eprint/992098/1/MohammadKhalid_MCompSc_S_2023.pdf

3/4 SUBMITTED TEXT 11 WORDS 96% MATCHING TEXT 11 WORDS

model is pre-trained on a large-scale dataset of image-text pairs and model. It pre-trained on a large-scale dataset of image-text pairs and

https://ojs.aaai.org/index.php/AAAI/article/download/28139/28281

4/4 SUBMITTED TEXT 11 WORDS 100% MATCHING TEXT 11 WORDS

Visual Question Answering Visual Question Answering (VQA) [5] is the task Visual Question Answering Visual question answering (VQA) is the task

https://dspace.lib.ntua.gr/xmlui/bitstream/handle/123456789/50134/Grounded%2520Visual%2520Questio ...

You might also like