image captioning
image captioning
Abstract
4188
Figure 2. Overall framework of multi-level attention networks. Our framework consists of three components: (A) semantic attention, (B)
context-aware visual attention and (C) joint attention learning. Here, we denote by vq the representation of the question Q, by vimg , vc the
img
representation of image content on the visual and semantic level queried by the question, respectively. vr and pc is the activation of the
last convolutional layer and the probability layer from the CNN.
Our work is different from co-attention and dual-attention level attention network. The overall framework is presented
in that we attend to high-level concepts extracted from the in Figure 2. Our framework consists of three major com-
image, rather than words from questions. The major advan- ponents. Component (A), which is defined as semantic at-
tage of using concepts over questions is that the concepts tention, aims at finding question-related concepts from the
are the semantic representation of content in the image, not image. Component (B), which is defined as context-aware
limited to words in the question. In [9], Fukui et al. incorpo- visual attention, aims at finding question related regions and
rate a powerful feature fusion method into visual attention, learning visual representation of these regions. Component
and achieve impressive results in VQA task. However, they (C) is designed to incorporate information from different-
have to keep a much higher dimension after fusion at cost level layers in the CNN by joint attention learning. These
of more computation and storage. three components are joint optimized end-to-end, which
bridges the semantic gap between language and vision, and
High-level Concepts. There is another branch also showing
learns fine-grained representation from image regions.
a promising direction to address VQA problems. Instead
of low-level or middle-level visual features, they leverage
high-level concepts[7, 8, 29], image captions or even visual 3.1. Semantic Attention
story[16], external knowledge base [30, 31]. Each concept
Semantic attention aims at finding important concepts
corresponds to a word mined from the training image de-
mining from the image to answer a question. For example in
scriptions and represents some kinds of attributes about the
Figure 1, although the concept detector has detected a set of
content of the image. These concepts act as semantic units
objects and actions from the image (e.g., “group,” “stand”),
between natural language and visual recognition, allow us
only those concepts which are semantically close to the
to exchange information between the two modalities [25].
question (i.e., “baseball,” “game”), should be highlighted by
However, the spatial information is completely lost in the
semantic attention. One of the core challenges in combin-
procedure of high-level concepts detection, which leads to
ing visual and linguistic modality is that they have different
inferior performance on VQA task.
levels of abstraction, where language usually refers to gen-
eral categories, while hundreds of pixels in the image can
3. Multi-level Attention Networks point to one instance [25]. Previous works on image/video
To simultaneously exploit higher-level semantic infor- captioning [6, 22, 23, 35, 36] and visual question answer-
mation and spatial information, we propose a novel multi- ing [30, 31] have shown that extracting explicit high-level
4189
concepts from images/videos can bring benefits to the inter- tion model, Wec is the second embedding matrix, which em-
action of visual content and language at the semantic level. beds the concepts into the same dimension representation
Although an image can convey multiple semantics, not all with the question. Next, we take the dot product of the pro-
of them are helpful to answer a particular question. There- jected concept vector sc with the question vector vq as an
fore, we propose to attend on concepts, which should be not operation, and pass the resultant value to a sigmoid activa-
only relevant to images, but semantically close to questions. tion layer to get the relevance score between the concept c
We achieve these goals by two steps. with question Q. Further, We formulate the semantic atten-
In the first step, we train a concept detector by deep con- tion weights of the concept c as the multiplication of the
volutional neural networks, which can produce the proba- concept-image relevance pimgc and the concept-question rel-
bility of semantic concepts for an image. Similar to [30], we evance pqc , which is given by:
first build a concept vocabulary, where each concept is de-
fined as a single word. The top highly-frequent words with pqc = sigmoid(vq · sc ), (6)
the number of C from the question-answer training pairs Mc = pimg q
c pc , (7)
are collected in the concept vocabulary after stop words re-
moval. Besides, a multi-label image dataset based on these where the operator · represents the dot product of two vec-
concepts is constructed based on COCO image captioning tors, pqc is the relevance score measuring the semantic sim-
dataset [15], which is used to train the concept detector. As ilarity between the question Q and the concept c, Mc is the
a result, A fixed-length vector pimg c is created for each im- semantic attention weights over concepts. Finally, we repre-
age I by taking the activation of fc in the prediction layer sent the high-level semantic information of image I queried
of a CNN, which represents the probability of each concept by question Q by a weighted sum over all concepts repre-
occurring in the image. We denote the process of concept sentation, which is given by:
detection as:
pimg
c = f c (I). (1) C
vc = ∑ Mc (i)sc (i). (8)
In the second step, we train an attention network to mea- i=1
sure the semantic relevance between each concept in the vo-
cabulary and the question. At first, we represent the ques- 3.2. Context-aware Visual Attention
tion by a recurrent neural network. Specifically, given the Although semantic attention bridges the semantic gap
question Q = [q1 , q2 , ..., qT ], where qt is the one hot vector between the questions and images, it ignores the spatial in-
representation of word at position t, we embed these words formation in images, which is important to represent the
into a vector space through an embedding matrix Weq . For spatial context for image regions, and thus is crucial in the
each time step t, we feed the embedding vector xt of word visual question answering task. Hence visual attention has
qt to a Gate Recurrent Unit (GRU) layer, and pick the last been widely used in recent VQA frameworks, due to its
hidden state hT as the question representation, which is de- success on fine-grained visual representation and visualiza-
noted as vq . We use the following equation to formulate the tion. Compared with human attention, recent work [4] finds
question encoding model: current VQA attention models do not seem to be “looking
at” the consistent regions as human do. One of the possi-
xt = Weq qt , (2)
ble problems in current attention model is that they usually
ht = GRU(xt , ht−1 ), (3) search for image regions one after another, by dividing the
vq = hT . (4) whole image into several isolated units. Although promis-
ing results have been achieved, further improvements are
Besides, we use the same vocabulary and embedding limited, because many concepts may interact with each
matrix for our concepts and questions, therefore they can other through the action and position relations. For exam-
share the same semantic representation. Specifically, we ple, we should be aware of the spatial relationship of the
represent the concept c with a semantic vector sc by a two- cat and the toilet, if we want to really understand and an-
layer stacked embedding layer. The first layer is designed swer the question “what is the cat standing on.” In this case,
to share the the same word embedding layer as the question not only regions about “cat” but those regions at bottom of
model, and the second layer is used to project the concept the “cat” should be looked at and understood. In order to
vector into the same dimension with the question represen- address this issue, we propose a context-aware visual atten-
tation, which is given by: tion mechanism into our VQA framework.
sc = Wec (Weq c), (5) Specifically, we first incorporate the context informa-
tion into the representation from each region by a bidirec-
where c is the one hot vector representation of the con- tional GRU encoder, which is illustrated in figure 3. We
cepts, Weq is the embedding weights shared with the ques- use the fine-tuned CNN model for concept detection from
4190
sual feature for each region, which overcomes the scale in-
consistency problem in multi-modal feature pooling. The
comparison experiment in the section 4.4 demonstrates our
assumption. Specifically, we formulate our visual attention
process as:
↔
h = tanh (WQ vq + bQ ) ⊗ (WI vr + bI ) , (10)
Mr = softmax(Wp h + b), (11)
4191
4. Experiment Table 1. Ablation model on test-dev set. The first three models
only utilize semantic attention, while the middle three models only
4.1. Dataset perform visual attention. MLAN denotes our full model which
applies attention on multi-level representation of images.
We evaluate our model on two large-scale VQA datasets,
Ablation Model Accuracy
i.e., VQA and Visual7W dataset, due to large amount of
training instances and the diversity of question types. Att-CNN + LSTM [30] 55.57
VQA is a large-scale visual question answering dataset Q + Concept 56.62
which contains 204,721 images from the COCO dataset Q + Semantic Attention 59.28
and a newly created abstract scene dataset which contains SAN [34] 58.68
50,000 scene images. We evaluate our model on this dataset Q + Visual Attention 62.29
for only real images. For each image in VQA dataset, three Q + context-aware Visual Attention 62.50
questions are annotated, and each question has 10 answers MLAN (Ours) 63.69
from 10 different annotators. We report our results on two
different tasks, which are open-ended and multiple-choice
tasks. In open-ended task, we select the answer with the by [37], supposing the model is correct on a question if it
highest activation from all possible outputs, and in multiple- selects the correct answer candidate. Accuracy is used to
choice task, we pick the answer that has the highest activa- measure the performance.
tion from the given choices. We collect the most frequent 4.3. Experiment Setting
3000 answers in training data as candidate answer set. We
evaluate the proposed the approach not only on validation We show our experimental settings, hyper-parameters
dataset, but on a test server, which is provided for blind and training process here. For question model, we use the
evaluation in the test set for fair comparison [2]. natural language toolkit NLTK1 to tokenize questions, cast
Visual7W is a more recent VQA dataset built by [37], all words into lowercase, and only keep those words ap-
which is a subset of Visual Genome [3] (the largest visual pearing at least twice in the train-val set. We don’t make
QA dataset to date with 1.7 million QA pairs). Visual7W any additional preprocessing to those words, e.g. remov-
contains 327,939 question-answer pairs on 47,300 COCO ing stop words, stemming. Finally, we get a question vo-
images. Each question-answer pair is associated with 4 cabulary with 9853 words in VQA dataset. As mentioned
human-generated multiple-choices, and only one of them in section 3.1, a single layer GRU is used to encode the
is the correct answer. There are two major highlights on question, which has 620-dimension word vectors and 2400-
Visual7W. First, Visual7W provides dense annotations on dimension hidden states. We take the last hidden state of
object-level groundings for establishing an explicit link be- the GRU layer as the question representation, so that the
tween QA pairs and image regions. Second, Visual7W al- dimension of question feature vector is 2400.
lows pointing questions with visual answers, where the cor- For concepts model, we select the most frequent 256
rect answer is one of four image regions. We evaluate our words appearing in question-answer training pairs as our
model only in multiple-choices setting on this dataset. concept vocabulary after removing stop words. We detect
concepts from images by taking the activation of the last
4.2. Evaluation Metrics layer of ResNet model [11] fine-tuned on our multi-label
Visual QA is formulated as multi-class classification dataset derived from MSCOCO dataset. There are two ma-
problem on both datasets. We follow the evaluation metrics jor differences in our concept detector from [30]. We use
as the baseline approaches on the two datasets. For VQA a more powerful classification model, i.e. ResNet with 152
dataset, [2] set an evaluation server publicly for blind eval- layers pre-trained on ImageNet, instead of VGGNet with 19
uation on the test set. The test set is divided into four splits: layers [27]. Besides, we use the most common loss func-
test-dev, test-standard, test-challenge and test-reserve, each tion “SigmoidCrossEntropyLoss” in multi-label classifica-
of which contains about 20K images. We evaluate our ab- tion task to fine-tune the network. For each concept, we get
lation model for experiment analysis on the test-dev set, the same embedding vector with the same question word,
and evaluate our best model on both the test-dev and test- i.e. 2400 dimensions. We project question vector and con-
standard set. For open-ended task, [2] use a voting mecha- cept vector to the 512-dimension space, and then perform
nism to score the accuracy of a predicted answer: attention on concepts.
For image model, we extract visual features from the last
#humans that said ans convolutional layer (i.e. “res5c”) from the same ResNet-152
acc(ans) = min{ , 1},
3 model with the concept detection. Each feature vector has
a dimension of 2048 and corresponds to a 32 × 32 pixels
where ans is the answer predicted by visual QA models.
For Visual7W dataset, we use the evaluation code released 1 http://www.nltk.org/
4192
Table 2. Comparison results on VQA dataset. We divide compared approaches into five categories based on different attention mecha-
nisms. Category I does not use any attention. Category II uses only visual attention. Category III extracts high-level concepts for image
representation. Category IV applies attention on both images and questions. Category V includes different variations of our approach.
test-dev test-standard
Approach Open Ended MC Open Ended MC
Yes/No Number Other All All Yes/No Number Other All All
LSTM Q + I [2] 78.9 35.2 36.4 53.7 57.2 79.0 35.6 36.8 54.1 57.8
I deeper + norm [2] 80.5 36.8 43.1 57.8 62.7 80.6 36.5 43.7 58.2 63.1
DPPnet [21] 80.7 37.2 41.7 57.2 - 80.3 36.9 42.2 57.4 -
SAN [34] 79.3 36.6 46.1 58.7 - - - - 58.9 -
FDA [13] 81.1 36.2 45.8 59.2 - - - - 59.5 -
DMN+[32] 80.5 36.8 48.3 60.3 - - - - 60.4 -
II
MCB+Att. [9] 82.2 37.7 54.8 64.2 68.6 - - - - -
MCB + Att. + GloVe [9] 82.5 37.6 55.6 64.7 69.1 - - - - -
MCB + Att. + GloVe + VG [9] 82.3 37.2 57.4 65.4 69.9 - - - - -
AC [31] 79.8 36.8 43.1 57.5 - 79.7 36.0 43.4 57.6 -
III
ACK [31] 81.0 38.4 45.2 59.2 - 81.1 37.1 45.8 59.4 -
HieCoAtt [17] 79.7 38.7 51.7 61.8 65.8 - - - 62.1 66.1
IV
DAN [20] 83.0 39.1 53.9 64.3 69.1 82.8 39.1 54.0 64.2 69.0
MLAN (ResNet) 82.9 39.2 52.8 63.7 68.9 - - - - -
V MLAN (ResNet, train+val) 83.8 40.2 53.7 64.6 69.8 83.7 40.9 53.7 64.8 69.9
MLAN (ResNet, train+val +VG) 81.8 41.2 56.7 65.3 70.0 81.3 41.9 56.5 65.2 70.0
region of the input image. As with attention on semantic • Q + Visual Attention: our visual attention model with-
level, we embed the 2048-dimension feature vector to 2400- out context-aware visual representation
dimension by bidirectional GRU, project image and this • Q + context-aware Visual Attention: the second com-
context-aware representation into the same 512-dimension ponents of our model, removing semantic attention
space, and then perform attention on visual representation. from the full model
In our experiments, we use stochastic gradient descent • Q + Multi-level Attention: our full model, fusing at-
with momentum 0.9 as the solver. The batch size is fixed to tention on different level image representation
100. We set the base learning rate to 0.05. After 15 epochs, We report the performance of our ablation models on test-
we drop the learning rate to one of ten of the previous one dev set of VQA dataset in Table 1. These models are trained
every 5 epochs. In addition, gradient clipping technology on the training dataset and half of validation set, as in [34].
and dropout are exploited in training. For visual7W dataset, Further analysis will be given in next section.
we use the exactly same parameter setting and training op-
tions with the VQA dataset. We evaluate our model only 4.5. Result and Analysis
in multiple-choices setting, and split the dataset into train, We will explain how each component works in our
validation and test following [37]. model by ablation experiment shown in Table 1. It is ob-
served that our multi-level attention model outperforms all
4.4. Ablation model single-level attention model significantly, i.e. attention on
To analyze the contribution of each components in our semantic-level concepts and attention on the region-based
model and demonstrate how the multi-level attention works visual feature.
better than single-level attention, we ablate the full model The first three rows in Table 1 compare our semantic
and demonstrate the effectiveness of each component. attention model with those models using high-level con-
• Att-CNN + LSTM [30]: the attribute representation as cepts but without attention mechanism. We get 2.7% per-
the first input of LSTM, then following the question formance gain when we attend to concepts related to both
• Q + Concept: a simple version of semantic attention, images and questions. This demonstrates attention on high-
taking the output of concept detector as the attention level concepts is effective and could find more important
weights, independent on the question semantic information from image and remove noisy infor-
• Q + Semantic Attention: the first component of our mation irrelevant with the question.
model, taking the relation of concepts with both image The middle three rows in Table 1 proves our two con-
and question into the attention weights tributions on visual attention mechanism. We use element-
• SAN [34]: a visual attention model similar with our wise multiplication to replace addition in SAN[34] model
second components and get better performance, which supports our assump-
4193
Figure 4. qualitative results from visual question answering with attention visualization. Both image regions related to the question and
high-level concepts are highlighted. Examples in the first row shows correct attended image regions lead to the true answer, while the
second row shows those cases where answer can be found directly from attended concepts.
tion that element-wise multiplication is a better multimodal on VQA dataset and Table 3 on Visual7W dataset respec-
fusion approach than addition in visual question answer- tively. For a fair comparison, we report the results using
ing task. The second contribution is that we incorporate the single model with several setting. [9] achieve a com-
contextual information from surrounding regions into tar- parable performance with ours when they add glove tricks
get regions, which benefits the spatial inference in images. and additional training data. However, their method uses
The promotion is marginal than we think. We conjecture a much higher dimension fusion method (16,000 dim v.s.
that there might be two reasons. First, our current con- 2400 dim), and drop almost over 1% if they use comparable
text encoding scheme suffers from long-term dependency dimensional features. Their model has to make a trade-off
problems by bidirectional GRU and is not symmetric for between effectiveness and efficiency. [17] and [20] are two
surrounding regions in horizontal and vertical direction be- methods also exploiting both visual attention and textual at-
cause bidirectional GRU can only model a sequence rather tention, the difference is that they perform textual attention
than a 2D spatial map. Second, most images from COCO on questions rather than high-level concepts in our model.
only contain a few objects, therefore, the interaction among We achieve better results than both of them because we ex-
objects is not so common as the natural scenario. We will ploit more concepts from the image than the question itself.
verify this in our future work.
The last row in Table 1 joins different-level attention into 5. Conclusion
one unifying framework and achieves significant improve-
ment compared with any single-level attention model. This We propose a novel Multi-level Attention Network to
demonstrates attention mechanism at different level image join visual attention and semantic attention into an end-end
features are complementary and could benefit each other. framework to address automatic visual question answering.
We compare our model with the state of art methods Visual attention enables fine-grained visual understanding
on two large datasets. The results are showed in Table 2 queried by questions while semantic attention narrows the
domain gap between questions and images. Our model
makes use of the complementarity of attention mechanism
Table 3. Results on Visual7W dataset. We report the indepen- on different level representation. Extensive experiments on
dent and average accuracy on six question types, including “what, two large dataset demonstrate we not only outperforms any
where, when, who, why and how.” single-level attention model, but also achieves top results
Method Wht. Whr. Whn. Who Why How Avg via a simple but effective framework. Future work includes
LSTM-Att [37] 51.5 57.0 75.0 59.5 55.5 49.8 54.3 further exploring on spatial encoding with context informa-
MCB+Att. [9] 60.3 70.4 79.5 69.2 58.2 51.1 62.2 tion, attention on sentence-level representation and better
MLAN (Ours) 60.5 71.2 79.6 69.4 58.0 50.8 62.4 fusion methods to join different level attention.
4194
References [20] H. Nam, J. Ha, and J. Kim. Dual attention networks for
multimodal reasoning and matching. In arXiv:1611.00471,
[1] A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior 2016.
of visual question answering models. In EMNLP, 2016. [21] H. Noh, P. H. Seo, and B. Han. Image question answering
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. using convolutional neural network with dynamic parameter
Zitnick, and D. Parikh. VQA: Visual question answering. In prediction. In CVPR, 2016.
ICCV, 2015. [22] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling
[3] A. Das, H. Agrawal, et al. Visual genome: connecting lan- embedding and translation to bridge video and language. In
guage and vision using crowdsourced dense image annota- CVPR, 2016.
tions. In IJCV, 2016. [23] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning
[4] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. with transferred semantic attributes. In arXiv preprint
Human attention in visual question answering: Do humans arXiv:1611.07675, 2016.
and deep networks look at the same regions? In EMNLP, [24] M. Ren, R. Kiros, and R. S. Zemel. Exploring models and
2016. data for image question answering. In NIPS, 2015.
[5] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, [25] M. Rohrbach. Attributes as semantic units between natural
S. Guadarrama, K. Saenko, and T. Darrell. Long-term recur- language and visual recognition. In arXiv:1604.03249, 2016.
rent convolutional networks for visual recognition and de- [26] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus
scription. In CVPR, 2015. regions for visual question answering. In CVPR, 2016.
[6] H. Fang, S. Gupta, F. Landola, R. Srivastava, L. Deng, [27] K. Simonyan and A. Zisserman. Very deep convolutional
P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zit- networks for large-scale image recognition. In ICLR, 2015.
nick, and G. Zweig. From captions to visual concepts and [28] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
back. In CVPR, 2015. tell: A neural image caption generator. In CVPR, 2015.
[7] J. Fu, J. Wang, Y. Rui, X.-J. Wang, T. Mei, and H. Lu. Image [29] J. Wang, J. Fu, T. Mei, and Y. Xu. Beyond object recognition:
tag refinement with view-dependent concept representations. Visual sentiment analysis with deep coupled adjective and
IEEE T-CSVT, 25(28):1409–1422, 2015. noun neural networks. In IJCAI, 2016.
[8] J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, and Y. Rui. Relaxing [30] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Hengel. What value
from vocabulary: Robust weakly-supervised deep learning do explicit high level concepts have in vision to language
for vocabulary-free image tagging. In ICCV, 2015. problems? In CVPR, 2016.
[9] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and [31] Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. Ask
M. Rohrbach. Multimodal compact bilinear pooling for vi- me anything: Free-form visual question answering based on
sual question answering and visual grounding. In EMNLP, knowledge from external sources. In CVPR, 2016.
2016. [32] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-
[10] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. works for visual and texual question answering. In ICML,
Are you talking to a machine? dataset and methods for mul- 2016.
tilingual image question answering. In NIPS, 2015. [33] H. Xu and K. Saenko. Ask, attend and answer: Exploring
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning question-guided spatial attention for visual question answer-
for image recognition. In CVPR, 2016. ing. In ECCV, 2016.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. [34] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked
Neural Computation, 9(8):1735–1780, 1997. attention networks for image question answering. In CVPR,
[13] I. Ilievski, S. Yan, and J. Feng. A focused dynamic attention 2016.
model for visual question answering. In ECCV, 2016. [35] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boost-
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- ing image captioning with attributes. In arXiv preprint
based learning applied to document recognition. Proced- arXiv:1611.01646, 2016.
dings of the IEEE, 86(11):2278–2324, 1998. [36] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image cap-
[15] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- tioning with semantic attention. In CVPR, 2016.
manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- [37] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w:
mon objects in context. In ECCV, 2014. Grounded question answering in images. In CVPR, 2016.
[16] Y. Liu, J. Fu, T. Mei, and C. W. Chen. Let your photos talk: [38] C. L. Zitnick, A. Agrawal, S. Antol, M. Mitchell, D. Batra,
Generating narrative paragraph for photo stream via bidirec- and D. Parikh. Measuring machine intelligence through vi-
tional attention recurrent neural networks. In AAAI, pages sual question answering. AI Magazine, 37(1):63–72, 2016.
1445–1452, 2017.
[17] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
question-image co-attention for visual question answering.
In NIPS, 2016.
[18] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-
rons: A neural-based approach to answering questions about
images. In ICCV, 2015.
[19] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille.
Deep captioning with multimodal recurrent neural networks
(m-RNN). In ICLR, 2015.
4195