0% found this document useful (0 votes)
95 views10 pages

Aneja Convolutional Image Captioning CVPR 2018 Paper

This document describes a convolutional neural network approach for image captioning. The key points are: 1) Previous image captioning approaches have used RNNs with LSTM units, but LSTMs are complex, inherently sequential, and have challenges during training. 2) The authors develop a convolutional image captioning technique that operates in parallel rather than sequentially. 3) On the MSCOCO image captioning dataset, their convolutional model achieves performance comparable to LSTM baselines, with faster training time per parameter.

Uploaded by

ajsocool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views10 pages

Aneja Convolutional Image Captioning CVPR 2018 Paper

This document describes a convolutional neural network approach for image captioning. The key points are: 1) Previous image captioning approaches have used RNNs with LSTM units, but LSTMs are complex, inherently sequential, and have challenges during training. 2) The authors develop a convolutional image captioning technique that operates in parallel rather than sequentially. 3) On the MSCOCO image captioning dataset, their convolutional model achieves performance comparable to LSTM baselines, with faster training time per parameter.

Uploaded by

ajsocool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Convolutional Image Captioning

Jyoti Aneja∗, Aditya Deshpande∗, Alexander G. Schwing


University of Illinois at Urbana-Champaign
{janeja2, ardeshp2, aschwing}@illinois.edu

Abstract dencies through a memory cell. However, the complex


addressing and overwriting mechanism combined with in-
Image captioning is an important task, applicable to herently sequential processing, and significant storage re-
virtual assistants, editing tools, image indexing, and sup- quired due to back-propagation through time (BPTT), poses
port of the disabled. In recent years significant progress challenges during training. Also, in contrast to CNNs,
has been made in image captioning, using Recurrent Neu- that are non-sequential, LSTMs often require more care-
ral Networks powered by long-short-term-memory (LSTM) ful engineering, when considering a novel task. Previously,
units. Despite mitigating the vanishing gradient problem, CNNs have not matched up to the LSTM performance on
and despite their compelling ability to memorize depen- vision-language tasks. Inspired by the recent successes of
dencies, LSTM units are complex and inherently sequential convolutional architectures on other sequence-to-sequence
across time. To address this issue, recent work has shown tasks – conditional image generation [34], machine transla-
benefits of convolutional networks for machine translation tion [9, 35] – we study convolutional architectures for the
and conditional image generation [9, 34, 35]. Inspired by vision-language task of image captioning. To the best of
their success, in this paper, we develop a convolutional im- our knowledge, ours is the first convolutional network for
age captioning technique. We demonstrate its efficacy on image captioning that compares favorably to LSTM-based
the challenging MSCOCO dataset and demonstrate perfor- methods.
mance on par with the LSTM baseline [16], while having Our key contributions are: a) A convolutional (CNN-
a faster training time per number of parameters. We also based) image captioning method that shows comparable
perform a detailed analysis, providing compelling reasons performance to an LSTM based method [16] (Section 6.2,
in favor of convolutional language generation approaches. Table 1 and Table 2); b) Improved performance with a CNN
model that uses attention mechanism to leverage spatial im-
age features. With attention, we outperform the attention
baseline [39] and qualitatively demonstrate that our method
1. Introduction finds salient objects in the image. (Figure 5, Table 2); c)
We analyze the characteristics of CNN and LSTM nets and
Image captioning, i.e., describing the content observed provide useful insights such as – CNNs produce more en-
in an image, has received a significant amount of atten- tropy (useful for diverse predictions), better classification
tion in recent years. It is applicable in various scenarios, accuracy, and do not suffer from vanishing gradients (Sec-
e.g., recommendation in editing applications, usage in vir- tion 6 and Figure 6, 7 and 8). We evaluate our architecture
tual assistants, for image indexing, and support of the dis- on the challenging MSCOCO [18] dataset, and compare it
abled. With the availability of large datasets, deep neural to an LSTM [16] and an LSTM+Attention baseline [39].
network (DNN) based methods have been shown to achieve The paper is organized as follows: Section 2 gives our
impressive results on image captioning tasks [16, 37]. notation, Section 3 reviews the RNN based approach, Sec-
These techniques are largely based on recurrent neural nets tion 4 describes our convolutional method, Section 5 gives
(RNNs), often powered by a Long-Short-Term-Memory the details of CNN architecture, Section 6 contains results
(LSTM) [10] component. and Section 7 discusses related work.
LSTM nets have been considered as the de-facto stan-
dard for vision-language tasks of image captioning [5, 16, 2. Problem Setup and Notation
37, 39, 38], visual question answering [3, 30, 28], ques-
tion generation [14, 20], and visual dialog [7, 13], due For image captioning, we are given an input image I and
to their compelling ability to memorize long-term depen- we want to generate a sequence of words y = (y1 , . . . , yN ).
The possible words yi at time-step i are subsumed in a dis-
∗ Denotes equal contribution. crete set Y of options. Its size, |Y|, easily reaches several

5561
Figure 1: A sequential RNN powered by an LSTM cell.
At each time step output is conditioned on the previously
generated word, the image is fed at the start only. Figure 2: Our convolutional model for image captioning.
We use a feed forward network with masked convolutions.
thousands. Y contains special tokens that denote a start to-
Unlike RNNs, our model operates over all words in parallel.
ken (<S>), an end of sentence token (<E>), and an un-
known token (<UNK>) which refers to all words not in Y. Learning. Following classical supervised learning, it is
Given a training set D = {(I, y ∗ )} which contains pairs common to find the parameters w of the word embed-
(I, y ∗ ) of input image I and corresponding ground-truth dings and the LSTM unit by minimizing the negative log-
caption y ∗ = (y1∗ , . . . , yN

), consisting of words yi∗ ∈ Y, likelihood of the training data D, i.e., we optimize:
i ∈ {1, . . . , N }, we maximize w.r.t. parameters w, a proba-
bilistic model pw (y1 , . . . , yN |I). XX
N

A variety of probabilistic models have been considered min − ln pi,w (yi∗ |hi , I). (3)
w
D i=1
(Section 7), from hidden Markov models [40] to recurrent
neural networks. To compute the gradient of the objective given in Eq. (3),
we use back-propagation through time (BPTT). BPTT is
3. RNN Approach necessary due to the recurrence relationship encoded in fw
An illustration of a classical RNN architecture for image (Eq. (2)). Note, the gradients of the function fw at time i
captioning is provided in Figure 1. It consists of three ma- depend on the gradients obtained in successive time-steps.
jor components, all of which contain trainable parameters: To avoid more complicated gradient flows through the
the input word embeddings, the sequential LSTM units con- recurrence relationship, during training, it is common to use
taining the memory cell, and the output word embeddings. ∗
hi = fw (hi−1 , yi−1 , I), (4)
Inference. RNNs sequentially predict one word at a time,
from y1 up to yN . At every time-step i, a conditional prob- rather than the form provided in Eq. (2). I.e., during train-
ability distribution pi,w (yi |hi , I), which depends on param- ing, when computing the latent representation hi , we use the
eters w, is predicted (see top of Figure 1). For modeling ground-truth symbol yi−1 ∗
rather than the prediction yi−1 .
pi,w (yi |hi , I), in the spirit of auto-regressive models, the This is termed as teacher forcing.
dependence of word yi on its ancestors y<i is implicitly cap- Although highly successful, RNN-based techniques suf-
tured by a hidden representation hi (see arrows in Figure 1). fer from some drawbacks. First, the training process is in-
Formally, the probability is computed via herently sequential for a particular image-caption pair. This
results from unrolling the recurrent relation in time. Hence,
pi,w (yi |hi , I) = gw (yi , hi , I), (1)
the output at time-step i has a true dependency on the output
where gw can be any differentiable function/deep net. Note, at i − 1. Secondly, as we will show in our results for image
image captioning techniques usually encode the image into captioning, RNNs tend to produce lower classification ac-
the hidden representation h0 (Figure 1). curacy (Figure 6), and, despite LSTM units, they still suffer
Importantly, RNNs are described by a recurrence rela- to some degree from vanishing gradients (Figure 8).
tion which governs computation of the hidden state hi via Next, we describe an alternative convolutional approach
to image captioning which attempts to overcome some of
hi = fw (hi−1 , yi−1 , I). (2) these challenges.
Again, fw can be any differentiable function. For image 4. Convolutional Approach
captioning, long-short-term-memory (LSTM) [10] nets and
variants thereof based on gated recurrent units (GRU) [6], Our model is based on the convolutional machine trans-
or forward-backward LSTM nets are used here. lation model used in [9]. Figure 2 provides an overview of

5562
Figure 3: Our convolutional architecture for image captioning. It has four components: (i) Input embedding layer, (ii) Image
embedding, (iii) Convolutional module and (iv) Output embedding layer. Details of each component are in Section 5.
our feed-forward convolutional (or CNN-based) approach Learning. Similar to RNN training, we use ground-truth

for image captioning. As the figure illustrates, our tech- y<i for past words, instead of using the predicted word. For

nique contains three main components similar to the RNN prediction of word probability pi,w (yi |y<i , I), the consid-

technique. The first and the last components are in- ered feed-forward network is fw (yi , y<i , I) and we opti-
put/output word embeddings respectively, in both cases. mize for parameters w using a likelihood similar to Eq. (3).
However, while the middle component contains LSTM or Since there are no recurrent connections and all ground-
GRU units in the RNN case, masked convolutions are em- truth words are available at any given time-step i, our CNN
ployed in our CNN-based approach. This component, un- based model can be trained in parallel for all words. In Sec-
like the RNN, is feed-forward without any recurrent func- tion 5, we describe our convolutional architecture in detail.
tion. We briefly review inference and learning of our model.
Inference. In contrast to the RNN formulation, where the 5. Architecture
probabilistic model is unrolled in time via the recurrence
relation given in Eq. (2), we use a simple feed-forward deep In Figure 3, we show a training iteration of our con-
net, fw , for modeling pi,w (yi |I). Prediction of a word yi volutional architecture with input (ground-truth) words
relies on past words y<i or their representations: {y1∗ , . . . , y5∗ } = { a, woman, is, playing, tennis }. Addi-
tionally, we add the start token <S> at the beginning, and
pi,w (yi |y<i , I) = fw (yi , y<i , I). (5) also the end of sentence token <E>.
These words are processed as follows: (1) they pass
To disallow convolution operations from using informa- through an input embedding layer; (2) they are combined
tion of future word tokens, we use masked convolutional with the image embedding; (3) they are processed by the
layers that operate only on ‘past’ data [9, 34]. CNN module; and (4) the output embedding (or classifi-
Inference can now be performed sequentially, one word cation) layer produces output probability distributions (see
at a time. Hence, inference begins with the start token <S> {p1 , . . . , p6 } at top of Figure 3). Each of the four aforemen-
and employs a feed-forward pass to generate p1,w (y1 |∅, I). tioned steps is discussed below.
Afterwards, y1 ∼ p1,w (y1 |∅, I) is sampled. Note that it Input Embedding. For consistency with the RNN/LSTM
is possible to retrieve the maximizing argument or to per- baseline, we train (from scratch) an embedding layer over
form beam search. After sampling, y1 is fed back into the one-hot encoded input words. We use |Y| = 9221 and we
feed-forward network to generate subsequent words y2 , etc. embed the input words to 512-dimensional vectors, follow-
Inference continues until the end token is predicted, or until ing the baseline. This embedding is concatenated to the im-
we reach a fixed upper bound of N steps. age embedding (discussed next) and provided as input to the

5563
MSCOCO Val Set MSCOCO Test Set
Method
B1 B2 B3 B4 M R C S B1 B2 B3 B4 M R C S
Baselines:
LSTM [16] .710 .535 .389 .281 .244 .521 .899 .169 .713 .541 .404 .303 .247 .525 .912 .172
LSTM + Attn (Soft) [39] - - - - - - - - .707 .492 .344 .243 .239 - - -
LSTM + Attn (Hard) [39] - - - - - - - - .718 .504 .357 .250 .230 - - -
Our CNN:
CNN .693 .518 .374 .268 .238 .511 .855 .167 .695 .521 .380 .276 .241 .514 .881 .171
CNN + Weight Norm. .702 .528 .384 .279 .242 .517 .881 .169 .699 .525 .382 .276 .241 .516 .878 .170
CNN +WN +Dropout .707 .532 .386 .278 .242 .517 .883 .171 .704 .532 .389 .283 .243 .520 .904 .173
CNN +WN +Dropout
.706 .532 .389 .284 .244 .519 .899 .173 .704 .532 .389 .284 .244 .520 .906 .175
+Residual
CNN +WN +Drop.
.710 .537 .391 .281 .241 .519 .890 .171 .711 .538 .394 .287 .244 .522 .912 .175
+Res. +Attn

Table 1: Comparison of different methods on standard evaluation metrics: BLEU-1 (B1), BLEU-2 (B2), BLEU-3 (B3),
BLEU-4 (B4), METEOR (M), ROUGE (R), CIDEr (C) and SPICE (S). Our CNN with attention (attn) achieves comparable
performance (equal CIDEr scores on MSCOCO test set) to [16] and outperforms LSTM+Attention baseline of [39]. We start
with a CNN comprising masked convolutions and fully connected layers only. Then, we add weight normalization, dropout,
residual connections and attention incrementally and show that performance improves with every addition. Here, for CNN
and [16] we use the model that obtains the best CIDEr scores on val-set (over 30 epochs) and report its scores for the test set.
For [39], we report all the available metrics for soft/hard attention from their paper (missing numbers are marked by -).

feed-forward CNN module. All methods were trained for 30 epochs and we evaluate the
Image Embedding. Image features for image I are ob- metrics (in Section 6.2) on the validation set, after every
tained from the fc7 layer of the VGG16 network [31]. The epoch, to pick the best model for all methods.
VGG16 is pre-trained on the ImageNet dataset [27]. We ap-
ply dropout, ReLU on the fc7 and use a linear layer to obtain 5.1. Attention
a 512-dimensional embedding. This is consistent with the
image features used in the baseline LSTM method [16]. In addition to the aforementioned CNN architecture, we
CNN Module. The CNN module operates on the combined also experiment with an attention mechanism, since atten-
input and image embedding vector. It performs three lay- tion benefited [9, 35]. We form an attended image vector of
ers of masked convolutions. Consistent with [9, 34], we use dimension 512 and add it to the word embedding at every
gated linear unit (or GLU) activations for our conv layers. layer (shown with red, green and blue arrows in Figure 3).
However, we did not observe a significant change in perfor- We compute separate attention parameters and a separate at-
mance when using the standard ReLU activation. The fea- tended vector for every word. To obtain this attended vector
ture dimension after convolution layer and GLU is 512. We we predict 7×7 attention parameters, over the VGG16 max-
add weight normalization, residual connections and dropout pooled conv-5 features of dimensions 7 × 7 × 512 [31]. We
in these layers as they help improve performance (Table 1). use attention on all three masked convolution layers in our
Our masked convolutions have a receptive field of 5 words CNN module. We continue to use the fc7 image embedding
in the past. We set N (steps or max-sentence length) to 15 discussed above.
for both CNN/RNN. The output of the CNN module after To discuss attention more formally, let dj denote the
three layers is a 512-dimensional vector for each word. embedding of word j in the conv module (i.e., its activa-
Classification Layer. We use a linear layer to encode the tions after GLU shown in Figure 3), let W refer to a linear
512-dimensional vectors obtained from the CNN module layer applied to dj , let ci denote a 512-dimensional spa-
into a 256-dimensional representation per word. Then, we tial conv-5 feature at location i (in 7 × 7 feature map) and
upsample this vector to a |Y|-dimensional activation via a let aij indicate the attention parameters. With this nota-
fully connected layer, and pass it through a softmax to ob- tion at hand, the attention parameter aij is computed via
Pexp(W (dj )T ci )
tain the output word probabilities pi,w (yi |y<i , I). aij = ,
and the attended image vector for
exp(W (dj )T ci )
Training. We use a cross-entropy loss on the probabilities i P
word j is obtained from aij ci . Note that [39] uses the
pi,w (yi |y<i , I) to train the CNN module and the embedding i
layers. Consistent with [16], we start to fine-tune VGG16 LSTM hidden state to compute the attention parameters.
along with our network after 8 training epochs. We optimize Instead, we compute attention parameters using the conv-
with RMSProp using an initial learning rate of 5e−5 and layer activations. This form of attention mechanism was
decay it by multiplying with a factor of .1 every 15 epochs. first proposed in [4].

5564
Beam Size=2 Beam Size=3 Beam Size=4
Method
B1 B2 B3 B4 M R C S B1 B2 B3 B4 M R C S B1 B2 B3 B4 M R C S
LSTM [16] .715 .545 .407 .304 .248 .526 .940 .178 .715 .544 .409 .310 .249 .528 .946 .178 .714 .543 .410 .311 .250 .529 .951 .179
CNN .712 .541 .404 .303 .248 .527 .937 .178 .709 .538 .403 .303 .247 .525 .929 .176 .706 .533 .400 .302 .247 .522 .925 .175
CNN+Attn .718 .549 .411 .306 .248 .528 .942 .177 .722 .553 .418 .316 .250 .531 .952 .179 .718 .550 .415 .314 .249 .528 .951 .179

Table 2: Comparison of different methods (metrics same as Table 1) with beam search on the output word probabilities.
Our results show that with beam size= 3 our CNN outperforms LSTM [16] on all metrics. Note, compared to Table 1, the
performance improves with beam search. We use the MS COCO test split for this experiment. For beam search, we pick one
caption with maximum log probability (sum of log probability of words) from the top-k beams and report the above metrics
for it. Beam = 1 is same as the test set results reported in Table 1.

c5 (Beam = 1) c40 (Beam = 1)


B1 B2 B3 B4 M R C B1 B2 B3 B4 M R C
LSTM .704 .528 .384 .278 .241 .517 .876 .880 .778 .656 .537 .321 .655 .898
CNN+Attn .708 .534 .389 .280 .241 .517 .872 .883 .786 .667 .545 .321 .657 .893
c5 (Beam = 3) c40 (Beam = 3)
B1 B2 B3 B4 M R C B1 B2 B3 B4 M R C
LSTM .710 .537 .399 .299 .246 .523 .904 .889 .794 .681 .570 .334 .671 .912
CNN+Attn .715 .545 .408 .304 .246 .525 .910 .896 .805 .694 .582 .333 .673 .914
Table 3: Above, we show that CNN outperforms LSTM on BLEU metrics and gives comparable scores to LSTM on other
metrics for test split on MSCOCO evaluation server. Note, this hidden test split of 40, 775 images on the evaluation server is
different from the 5000 images test split used in Tables 1 and 2. We compare our CNN+Attn method to the LSTM baseline
(metrics same as Table 1). The c5, c40 scores above are computed with 5, 40 reference captions per test image respectively.
We show comparison results for beam size 1 and beam size 3 for both the methods.

6. Results and Analysis 6.1. Dataset and Baselines


In this section, we demonstrate the following results: We conducted experiments on the MS COCO
dataset [18]. Our train/val/test splits follow [16, 39].
• Our convolutional (or CNN) approach performs on par We use 113287 training images, 5000 images for valida-
with LSTM (or RNN) based approaches on image cap- tion, and 5000 for testing. Henceforth, we will refer to
tioning metrics (Table 1). Our performance improves our approach as CNN, and our approach with the attention
with beam search (Table 2). (Section 5.1) as CNN+Attn. We use the following naming
convention for our baselines: [16] is denoted by LSTM and
• Adding attention to our CNN gives improvements [39] is referred to as LSTM+Attn.
on metrics and we outperform the LSTM+Attn base-
line [39] (Table 1). Figure 5 shows that with attention 6.2. Comparison on Image Captioning Metrics
we identify salient objects for the given image. We consider multiple conventional evaluation metrics,
• We analyze the CNN and RNN approaches and show BLEU-1, BLEU-2, BLEU-3, BLEU-4 [23], METEOR [8],
that CNN produces (1) more entropy in the output ROUGE [17], CIDEr [36] and SPICE [1]. See Table 1 for
probability distribution, (2) gives better word predic- the performance on all these metrics for our val/test splits.
tion accuracy (Figure 6), and (3) does not suffer as Note that we obtain comparable CIDEr scores and better
much from vanishing gradients (Figure 8). SPICE scores than LSTM on test set with our CNN+Attn
method. Our BLEU, METEOR, ROUGE scores are less
• In Table 4, we show that a CNN with 1.5× more pa- than the LSTM ones, but the margin is very small. Our
rameters can be trained in comparable time. This is CNN+Attn method outperforms the LSTM+Attn baseline
because we avoid the sequential processing of RNNs. on the test set for all metrics reported in [39]. For Table 1,
we form the caption by choosing the word with maximum
The details of our experimental setup and these results probability at each step. The metrics are reported for this
are discussed below. The PyTorch implementation of our one caption formed by choosing the maximum probability
convolutional image captioning is available on github.1 word at every step.
Instead of sampling the maximum probability words, we
1 https://github.com/aditya12agd5/convcap also perform beam search with different beam sizes. We

5565
LSTM: a man and a woman LSTM: a cat is laying LSTM: a bear is standing LSTM: a box of donuts with LSTM: a dog and a
in a suit and tie down on a bed on a rock in a zoo a variety of toppings dog in a field
CNN: a black and white photo CNN: a polar bear is drinking CNN: two bears are walking CNN: a box of doughnuts with CNN: two cows are
of a man and woman in a suit water from a white bowl on a rock in the zoo sprinkles and a sign standing in a field of grass
GT: A man sitting next to a GT: A white polar bear laying GT: two bears touching GT:A bunch of doughnuts GT: A dog and a horse
woman while wearing a suit. on top of a pool of water noses standing on rocks with sprinkles on them standing near each other

Figure 4: Captions generated by our CNN are compared to the LSTM and ground-truth caption. In the examples above our
CNN can describe things like black and white photo, polar bear/white bowl, number of bears, sign in the donut image which
LSTM fails to do. The last image (rightmost) shows a failure case for CNN. Typically we observe that CNN and LSTM
captions are of similar quality. We use our CNN+Attn method (Section 5.1) and the MSCOCO test split for these results.

perform beam search for both LSTM and our CNN meth- Table 1, 2 and 3 show that we obtain comparable per-
ods. With beam search, we pick the maximum probability formance from both CNN and RNN/LSTM-based methods.
caption (sum of log word probability in the beam). The re- Encouraged by this result, we analyze the characteristics of
sults reported in Table 2 demonstrate that with beam size these two methods. For fair comparison, we use our CNN
of 3 we achieve better BLEU, ROUGE, CIDEr scores than without attention, since the RNN method does not use spa-
LSTM and equal METEOR and SPICE scores. tial image features. First, we compare the negative log-
In Table 3, we show the results obtained on the likelihoods (or cross-entropy loss) on a subset of train and
MSCOCO evaluation server. These results are computed the entire val set (see Figure 6 (a)). We find that the loss
over a test set of 40, 775 images for which ground-truth is higher for CNN than RNN. This is because CNNs are
is not publicly available. We demonstrate that our method being penalized for producing less-peaky word probability
does better on all BLEU metrics, especially with beam size distributions. To evaluate this further, we plot the entropy
3, we perform better than the LSTM based method. of the output probability distribution (Figure 6 (b)) and the
Comparison to recent state-of-the-art. For better perfor- classification accuracy, i.e., the number of times the max-
mance on the MSCOCO leader board we use ResNet fea- imum probability word is the ground truth (Figure 6 (c)).
tures instead of VGG-16. Table 5 shows ResNet boosts These plots show that RNNs are good at producing low en-
our performance on the MSCOCO split (cf. Table 1) and tropy and therefore peaky word probability distributions at
we compare it to more recent methods [2] and [41]. We are the output, while CNNs produce less peaky distributions
almost as good as [41]. If we had access to their pre-trained (and high entropy). Less peaky distributions are not nec-
attribute network, we may outperform it. [2] uses a sophisti- essarily bad, particularly for a problem like image caption-
cated attention mechanism, which can be incorporated into ing, where multiple word predictions are possible. Despite,
our architecture as part of future work. less peaky distributions, Figure 6 (c) shows that the maxi-
mum probability word is correct more often on the train set
6.3. Qualitative Comparison and it is within approx. 1% accuracy on the val set. Note,
See Figure 4 for a qualitative comparison of captions cross-entropy loss is a proxy for the classification accuracy
generated by CNN and LSTM. In Figure 5, we overlay the and we show that CNNs have higher cross entropy loss, but
attention parameters on the image for each word prediction. their classification accuracy is good. Less peaky posterior
Note that our attention parameters are 7 × 7 as described distributions provided by a CNN may be indicative of CNNs
in Section 5.1 and therefore the image is divided in a 7 × 7 being more capable of predicting diverse captions.
grid. These results show that our attention focuses on salient Diversity. In Figure 7, we plot the unique words and 2/4-
objects such as man, broccoli, ocean, bench, etc., when pre- grams predicted at every word position or time-step. The
dicting these respective words. Our results also show that plot is for word positions 1 to 13. This plot shows that for
the attention is uniform when predicting words such as a, the CNN we have higher unique words for more word po-
of, on, etc., which are unrelated to the image content. sitions and consistently higher 2/4-grams than LSTM. This
supports our analysis that CNNs have less peaky (or one-
6.4. Analysis of CNN and RNN hot) posteriors and therefore can produce more diversity.
In Table 4 we report the number of trainable parameters For these diversity experiments, we perform a beam search
and the training time per epoch. CNNs with ∼ 1.5× param- with beam size 10 and use all the top 10 beams.
eters can be trained in comparable time. Vanishing Gradient. Since RNNs/LSTMs are known

5566
CNN: a plate of food with
broccoli and rice
GT: A BBQ steak on a plate a plate of food with broccoli ... rice
next to mashed potatoes
and mixed vegetables.

CNN: a man sitting on a


bench overlooking the ocean
a man sitting on a bench ... ocean
GT: A man sitting on top
of a bench near the ocean

Figure 5: Attention parameters are overlayed on the image. These results show that we focus on salient regions as broccoli,
bench when predicting these words and that the attention is uniform when predicting words such as a, of and on.
Method # Parameters Train time per epoch relationships between objects can help to form sentences.
LSTM [16] 13M 1529s Generating sentences by taking advantage of surrogate tasks
Our CNN 19M 1585s is then a multi-step approach which is beneficial for inter-
Our CNN+Attn 20M 1620s pretability but lacks a joint objective that can be trained end-
to-end.
Table 4: We train a CNN faster per parameter than the Early techniques formulate image captioning as a re-
LSTM. This is because CNN is not sequential like the trieval problem and find the best fitting description from a
LSTM. We use PyTorch implementation of [16] and our pool of possible captions [11, 15, 22, 32]. Those techniques
CNN-based method, and the timings are obtained on Nvidia are built upon the idea that the fitness between available tex-
Titan X GPU. tual descriptions and images can be learned. While this per-
Method B1 B2 B3 B4 M R C
mits end-to-end training, matching image descriptors to a
Our Resnet-101 .72 .549 .403 .293 .248 .527 .945 sufficiently large pool of captions is computationally expen-
Our Resnet-152 .725 .555 .41 .299 .251 .532 .972 sive. In addition, constructing a database of captions that is
LSTM Resnet-152 .724 .552 .405 .294 .251 .532 .961 sufficient for describing a reasonably large fraction of im-
[41] Resnet-152 .731 .564 .426 .321 .252 .537 .984 ages seems prohibitive.
[2] Resnet-101 .772 - - .362 .27 .564 1.13 To address this issue, recurrent neural nets (RNNs) or
Table 5: Comparison to recent state-of-the-art with Resnet. probabilistic models like Markov chains, which decompose
the space of a caption into a product space of individual
to suffer from vanishing gradient problems, in Fig- words are compelling. The success of RNNs for image cap-
ure 8, we plot the gradient norm at the output embed- tioning is based on a key component, i.e., the Long-Short-
ding/classification layer and the gradient norm at the in- Term-Memory (LSTM) [10] or recent alternatives like the
put embedding layer. The values are averaged over 1 gated recurrent unit (GRU) [6]. These components capture
training epoch. These plots show that the gradients in long-term dependencies by adding a memory cell, and they
RNN/LSTM diminishes more than the ones in CNNs. address the vanishing or exploding gradient issue of classi-
Hence RNN/LSTM nets are more likely to suffer from van- cal RNNs to some degree.
ishing gradients, which stalls learning. If learning is stalled, Based on this success, [19] train a vision (or image)
for larger datasets than the ones we currently use for image CNN and a language RNN that shares a joint embedding
captioning, the performance of RNN and CNN may differ layer. [37] jointly train a vision (or image) CNN with a
significantly. language RNN to generate sentences, [39] extends [37]
with additional attention parameters and learns to iden-
7. Related Work tify salient objects for caption generation. [16] use a bi-
directional RNN along with a structured loss function in a
Describing the content of an observed image is related shared vision-language space. [41] use an additional net-
to a large variety of tasks. Object detection [25, 26, 42] and work trained on coco-attributes, and [2, 28] develop an at-
semantic segmentation [21, 29, 12] can be used to obtain tention mechanism for captioning. These recurrent neural
a list of objects. Detection of co-occurrence patterns and nets have found widespread use for captioning because they

5567
Cross-Entropy Loss Entropy after Softmax (last layer)
(or Negative Log-Likelihood) (Entropy of Posterior) 60
Word accuracy in %

LSTM on Train Set


3.0
LSTM on Train Set
3.0 LSTM on Val Set 58
Cross-Entropy Loss Value

LSTM on Val Set

2.8 Our CNN on Train Set 2.8 Ours CNN on Train Set 56

% Word accuracy
2.6
Our CNN on Val Set
2.6 Our CNN on Val Set
54
2.4

Entropy
2.4
52
2.2 2.2
2.0
50
2.0

1.8 1.8 48
LSTM on Train Set

1.6 1.6 46 LSTM on Val Set

1.4 1.4 44
Our CNN on Train Set

Our CNN on Val Set

1.2
0 5 10 15 20 25 30
1.20 5 10 15 20 25 30 42
0 5 10 15 20 25 30
Epochs Epochs Epochs

(a) CNN gives higher cross-entropy loss on (b) The entropy of the softmax layer (or pos- (c) Even though the CNN training loss is
train/val set of MSCOCO compared to LSTM. terior probability distribution) of our CNN is higher than LSTM, its word prediction accu-
But, as we show in (c), CNN obtains better % higher than the LSTM. For ambiguous prob- racy is better than LSTM on train set. On val
word accuracy than LSTM. Therefore, it as- lems such as image captioning, it is desirable to set, the difference in accuracy between LSTM
signs max. probability to correct word. The have a less peaky (multi-modal) posterior (like and CNN is small (only ∼ 1%).
CNN loss is high because its output probability ours) capable of producing multiple captions,
distributions have more entropy than LSTM. rather than a peaky one (like LSTM).
Figure 6: In the figures above we plot (a) Cross-entropy loss, (b) Entropy of the softmax layer, (c) Word accuracy on train/val
set. Blue line denotes our CNN and red denotes the LSTM based method [16]. Solid/dotted lines denote train/val set of
MSCOCO respectively. For train set, we randomly sample 10k images and use the entire val set.
Unique words at every position
2000 Unique 2-grams at every position 7000 Unique 4-grams at every position Gradient Norm at Embedding
and Classification Layer
700

600
CNN
CNN 6000 CNN
LSTM
1500 LSTM LSTM
500
5000
4000
101 LSTM Embed. Layer
Counts

Counts

Counts

400

1000
300 3000
2000
LSTM Classif. Layer
500
Our CNN Embed. Layer
200

1000
Our CNN Classif. Layer
100

100
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11
Gradient Norm

0
1 2 3 4 5 6 7 8 9 10 11 12 13
Word Position Starting word position for 2-gram Starting word position for 4-gram

(a) Unique words (b) Unique 2-grams (c) Unique 4-grams


Figure 7: We perform beam search of beam size 10 with our
best performing LSTM and CNN models. We use the top 10-1
10 beams to plot the unique words, 2/4-grams predicted for
every word position. CNN (blue) produces higher unique 10-2
words, 2/4-grams at more positions, and therefore more di-
versity, than LSTM (red).
0 5 10 15 20 25 30
have been shown to produce remarkably fitting descriptions. Epochs
Figure 8: Here, we plot the gradient norm at the input em-
Despite the fact that the above RNNs based on
bedding (dotted line) and output embedding/classification
LSTM/GRU deliver remarkable results, e.g., for image cap-
(solid line) layer. The gradient to the first layer of LSTM
tioning, their training procedure is all but trivial. For in-
decays by a factor ∼ 100 in contrast to our CNN, where it
stance, while the forward pass during training can be in par-
decays by a factor of ∼ 10. There is prior evidence in litera-
allel across samples, it is inherently sequential in time, lim-
ture that unlike CNNs, RNN/LSTMs suffer from vanishing
iting the parallelism. To address this issue, [34] proposed
gradients [24, 33].
a PixelCNN architecture for conditional image generation
that approximates an RNN. [9] and [35] demonstrate that
convolutional architectures with attention achieve state-of- gradients of lower magnitude as well as overly confident
the-art performance on machine translation tasks. In spirit predictions to be existing LSTM network concerns.
similar is our approach for image captioning, which is con-
volutional but addresses a different task. Acknowledgments. We thank Arun Mallya for implemen-
tation of [16], Tanmay Gangwani for beam search code used
8. Conclusion for Figure 7 and we thank David Forsyth for insightful dis-
cussions and his comments. This material is based upon
We discussed a convolutional approach for image work supported in part by the National Science Foundation
captioning and showed that it performs on par with existing under Grant No. 1718221, NSF IIS-1421521 and by ONR
LSTM techniques. We also analyzed the differences MURI Award N00014-16-1-2007 and Samsung. We thank
between RNN based learning and our method, and found NVIDIA for the GPUs used for this work.

5568
References [17] C.-Y. Lin. Rouge: a package for automatic evaluation of
summaries. July 2004. 5
[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice:
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
Semantic propositional image caption evaluation. In ECCV,
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
2016. 5
mon Objects in Context, pages 740–755. Springer Interna-
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, tional Publishing, Cham, 2014. 1, 5
S. Gould, and L. Zhang. Bottom-up and top-down attention
[19] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep
for image captioning and visual question answering. arXiv
captioning with multimodal recurrent neural networks (m-
preprint arXiv:1707.07998, 2017. 6, 7
rnn). CoRR, abs/1412.6632, 2014. 7
[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zit-
[20] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He,
nick, and D. Parikh. VQA: Visual Question Answering. In
and L. Vanderwende. Generating natural questions about an
International Conference on Computer Vision (ICCV), 2015.
image. In ACL (1). The Association for Computer Linguis-
1
tics, 2016. 1
[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine
translation by jointly learning to align and translate. CoRR, [21] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
abs/1409.0473, 2014. 4 forward semantic segmentation with zoom-out features. In
2015 IEEE Conference on Computer Vision and Pattern
[5] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent vi-
Recognition (CVPR), pages 3376–3385, June 2015. 7
sual representation for image caption generation. In 2015
IEEE Conference on Computer Vision and Pattern Recogni- [22] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
tion (CVPR), pages 2422–2431, June 2015. 1 ing images using 1 million captioned photographs. In Pro-
[6] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, ceedings of the 24th International Conference on Neural In-
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase formation Processing Systems, NIPS’11, pages 1143–1151,
representations using rnn encoder–decoder for statistical ma- USA, 2011. Curran Associates Inc. 7
chine translation. In Proceedings of the 2014 Confer- [23] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu:
ence on Empirical Methods in Natural Language Processing A method for automatic evaluation of machine translation.
(EMNLP), pages 1724–1734, Doha, Qatar, Oct. 2014. Asso- In Proceedings of the 40th Annual Meeting on Association
ciation for Computational Linguistics. 2, 7 for Computational Linguistics, ACL ’02, pages 311–318,
[7] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Stroudsburg, PA, USA, 2002. Association for Computational
Moura, D. Parikh, and D. Batra. Visual dialog. 2017. 1 Linguistics. 5
[8] M. Denkowski and A. Lavie. Meteor universal: Language [24] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty
specific translation evaluation for any target language. In of training recurrent neural networks. In Proceedings of the
Proceedings of the EACL 2014 Workshop on Statistical Ma- 30th International Conference on International Conference
chine Translation, 2014. 5 on Machine Learning - Volume 28, ICML’13, pages III–
[9] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. 1310–III–1318. JMLR.org, 2013. 8
Dauphin. Convolutional sequence to sequence learning. [25] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.
CoRR, abs/1705.03122, 2017. 1, 2, 3, 4, 8 You only look once: Unified, real-time object detection.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. CoRR, abs/1506.02640, 2015. 7
Neural Comput., 9(8):1735–1780, Nov. 1997. 1, 2, 7 [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
[11] M. Hodosh, P. Young, and J. Hockenmaier. Framing image real-time object detection with region proposal networks. In
description as a ranking task: Data, models and evaluation Proceedings of the 28th International Conference on Neural
metrics. J. Artif. Int. Res., 47(1), May 2013. 7 Information Processing Systems - Volume 1, NIPS’15, pages
[12] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. MaskRNN: In- 91–99, Cambridge, MA, USA, 2015. MIT Press. 7
stance Level Video Object Segmentation. In Proc. NIPS, [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
2017. 7 S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[13] U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog-
Game: Visual Dialog with Discriminative Question Genera- nition challenge. International Journal of Computer Vision,
tion and Answering. In Proc. CVPR, 2018. 1 115(3):211–252, Dec 2015. 4
[14] U. Jain, Z. Zhang, and A. G. Schwing. Creativity: Gener- [28] I. Schwartz, A. G. Schwing, and T. Hazan. High-Order At-
ating diverse questions using variational autoencoders. In tention Models for Visual Question Answering. In Proc.
Computer Vision and Pattern Recognition, 2017. 1 NIPS, 2017. 1, 7
[15] Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality [29] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional
similarity for multinomial data. In Proceedings of the 2011 networks for semantic segmentation. IEEE Trans. Pattern
International Conference on Computer Vision, ICCV ’11, Anal. Mach. Intell., 39(4):640–651, Apr. 2017. 7
pages 2407–2414, Washington, DC, USA, 2011. IEEE Com- [30] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus
puter Society. 7 regions for visual question answering. In Computer Vision
[16] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- and Pattern Recognition, 2016. 1
ments for generating image descriptions. In 2015 IEEE [31] K. Simonyan and A. Zisserman. Very deep convolu-
Conference on Computer Vision and Pattern Recognition tional networks for large-scale image recognition. CoRR,
(CVPR), pages 3128–3137, June 2015. 1, 4, 5, 7, 8 abs/1409.1556, 2014. 4

5569
[32] R. Socher, A. Karpathy, Q. Le, C. Manning, and A. Ng.
Grounded compositional semantics for finding and describ-
ing images with sentences. Transactions of the Association
for Computational Linguistics, 2:207–218, 2014. 7
[33] I. Sutskever, J. Martens, and G. Hinton. Generating text with
recurrent neural networks. In L. Getoor and T. Scheffer, ed-
itors, Proceedings of the 28th International Conference on
Machine Learning (ICML-11), ICML ’11, pages 1017–1024,
New York, NY, USA, June 2011. ACM. 8
[34] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,
A. Graves, and K. Kavukcuoglu. Conditional image genera-
tion with pixelcnn decoders. In NIPS, 2016. 1, 3, 4, 8
[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all
you need. CoRR, abs/1706.03762, 2017. 1, 4, 8
[36] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. In CVPR,
pages 4566–4575. IEEE Computer Society, 2015. 5
[37] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
and tell: Lessons learned from the 2015 mscoco image cap-
tioning challenge. IEEE Trans. Pattern Anal. Mach. Intell.,
39(4):652–663, Apr. 2017. 1, 7
[38] L. Wang, A. G. Schwing, and S. Lazebnik. Diverse and Ac-
curate Image Description Using a Variational Auto-Encoder
with an Additive Gaussian Encoding Space. In Proc. NIPS,
2017. 1
[39] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhut-
dinov, R. S. Zemel, and Y. Bengio. Show, attend and tell:
Neural image caption generation with visual attention. In
Proceedings of the 32Nd International Conference on In-
ternational Conference on Machine Learning - Volume 37,
ICML’15, pages 2048–2057. JMLR.org, 2015. 1, 4, 5, 7
[40] Y. Yang, C. L. Teo, H. Daumé, III, and Y. Aloimonos.
Corpus-guided sentence generation of natural images. In
Proceedings of the Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP ’11, pages 444–454,
Stroudsburg, PA, USA, 2011. Association for Computational
Linguistics. 2
[41] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image
captioning with attributes. In IEEE International Conference
on Computer Vision, ICCV 2017, Venice, Italy, October 22-
29, 2017, pages 4904–4912, 2017. 6, 7
[42] R. A. Yeh, J. Xiong, W.-M. Hwu, M. Do, and A. G. Schwing.
Interpretable and Globally Optimal Prediction for Textual
Grounding using Image Concepts. In Proc. NIPS, 2017. 7

5570

You might also like