A Guide To Image Captioning. How Deep Learning Helps in Captioning
A Guide To Image Captioning. How Deep Learning Helps in Captioning
Follow 556K Followers · Editors' Picks Features Deep Dives Grow Contribute About
Photo by Liam Charmer on Unsplash
Several approaches have been made to solve the task. One of the most
notable work has been put forward by Andrej Karpathy, Director of AI, Tesla
in his Ph.D. at Standford. In this article, we will be talking about the most
used and well-known approaches proposed as a solution to this problem.
We will also be looking at a python demo example on the Flickr Dataset in
Python.
If we are told to describe it, maybe we will describe it as: “A puppy on a blue
towel” or “ A brown dog playing with a green ball”. So, how are we doing
this? While forming the description, we are seeing the image but at the
same time, we are looking to create a meaningful sequence of words. The
first part is handled by CNNs and the second is handled by RNNs.
Now, one issue we might have overlooked here. We have seen that we can
describe the above images in several ways. So, how do we evaluate our
model? For a sequence to sequence problems, like summarization, language
translations, or captioning we use a Metrics called the BLEU score.
BLEU stands for Bilingual Evaluation Understudy. It is a metric for
evaluating a generated sentence to a reference sentence. The perfect match
is 1.0 and a perfect mismatch is 0.0. You can study more about the BLEU
score from this awesome blog post.
We have seen that we need to create a multimodal neural network that uses
feature vectors obtained using both RNN and CNN, so consequently, we will
have two inputs. One is the image we need to describe, a feed to the CNN,
and the second is the words in the text sequence produced till now as a
sequence as the input to the RNN.
We are dealing with two types of information, a language one and another
image one. So, the question arises how or in what order should we
introduce the pieces of information into our model? Elaborately
speaking, we need a language RNN model as we want to generate a
word sequence, so, when should we introduce the image data vectors
in the language model. A paper by Marc Tanti and Albert Gatt, Institute
of Linguistics and Language Technology, University of Malta covered a
comparison study, of all the approaches. Let’s look into the approaches.
Types of Architectures
There are two basic types of architectures:
Source
The first architecture is called Injecting Architecture and the second one is
called Merging architecture. FF shows feed-forward networks.
In the Injecting Architecture, the image data is introduced along with the
language data, and the image and language data mixture is represented
together. The RNN trains on the mixture. So, at every step of training, the
RNN uses the mixture of both pieces of information to predict the next
word, and consequently, the RNN finetunes image information as well
during training.
In the Merging Architecture, the image data is not introduced in the RNN
network. So, the image and the language information are encoded
separately and introduced together in a feed-forward network, creating a
multimodal layer architecture.
Par-Inject: In this case, at every step, we merge the word vector and the
image vector into a similar-sized embedding space and pass it to be trained
by the RNNs.
In all these cases, the image feature vectors are generated by architectures
like Inception Networks and complex Resnet structures. One thing common
in all the Inject architectures is, the hidden state of the RNN is effected by
the image vector. This is the point of difference between the inject and
merge architectures.
Now, having seen the possible kinds of architectures, let’s look at some most
famous architectures proposed and how they work.
Agenda 1
Suppose, we as humans are describing the scene, given above (Fig 1). So,
we first try to recognize the objects in the image, like the ball, the dog, and
the towel. Then we create the description, which obviously has the words
representing those objects. Karpathy focussed on the same logic. He
pointed out that every continuous segment of words in the description
corresponds to a particular position in the image, but obviously to the
machine the positions are unknown. So, his work used these mappings to
create a description generation model. Now, if we think of it, we as humans
also create chunks of words seeing some important parts of the image
having objects and then move to form the sentence. The approach given by
the paper is similar.
Source
The model proposed created a multimodal embedding space using the two
modalities to find the alignments between the segments of the sentences
and their corresponding represented areas in the image. RCNN or Regional
Convolutional Neural Networks were used in the model to detect the object
region and a CNN trained on the ImageNet dataset having 200 classes was
used to recognize the objects. For language modeling, the paper suggests
the use of BRNN or Bidirectional Neural Networks as it best represents the
inter-modal relationships among the n-grams of the sentences. A 300d
word2vec embedding is used for obtaining the word vectors.
The above diagram expresses the approach. The RCNN basically creates a
bounding box, so if we regard it as the i-th region of the image, it’s
confidence is matched with every t-th word in the description. So, for every
region and word pair, the dot product Vt.Si between the i-th region and t-th
word is a measure of similarity.
Source
Agenda 2:
Source
The above image shows the proposed model. The model takes in the image
pixels I and sequence of input word vectors (x1, x2….., xn) and calculates
the sequence of hidden states (h1,h2……., hn) to give the sequence of
outputs (y1,y2,….., yn). The image feature vectors are passed only once as
the initial hidden state. So, the next hidden state is calculated from the
image vector I, the previously hidden state ht-1, and the current input xt.
Next, let’s talk about two of the other most commonly used architectures.
Google’s Architecture
The architecture was proposed in a paper titled “Show and Tell: A Neural
Image Caption Generator” by Google in 2k15.
The above image shows the architecture. The other parts of the functioning
are similar to the functions of the model introduced by Karpathy. The image
feature vector ‘I’ is inserted into the LSTM sequence at t=-1, only once.
After that from t=0, the sequence of word vectors are input. The final
output word of each step is the word with maximum probability obtained
from the activation function of the hidden state at that particular step. The
governing equations are given below.
Source
For obtaining the best prediction for the captions, the Architecture uses a
Beam Search method. The method considers k best sentences as
candidates at each time step t. At every time step t+1, k most probable
words are taken and with k sentences and k words, k² probable sentences
arise, from them again k best probable sentences are chosen for the
candidates of time step t+1.
Microsoft’s Architecture
Microsoft proposed its architecture in the paper titled “Rich Image
Captioning in the Wild”. The architecture was taken one step further from
the two architectures discussed above. The architecture is inspired by
machine translation’s encoder-decoder framework architecture. The paper
stated as most of the image captioning architecture provides generic
content without identifying entities like famous landmarks, celebrities, etc,
their model will take into consideration all these factors separately.
References
Comparison of Architectures: https://arxiv.org/pdf/1703.09137.pdf
Yahoo’s Approach:
https://papers.nips.cc/paper/2019/file/680390c55bbd9ce416d1d69a9ab
4760d-Paper.pdf
Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.
Image Captioning