0% found this document useful (0 votes)
7 views9 pages

image captioning

The document presents a novel multi-level attention network (MLAN) for visual question answering (VQA) that addresses challenges in semantic embedding and visual understanding. The MLAN utilizes semantic attention to highlight relevant high-level concepts from images and visual attention to focus on specific regions, optimizing these components jointly to improve answer inference. Experimental results demonstrate that this approach outperforms existing VQA methods on two challenging datasets.

Uploaded by

waghvaibhav0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

image captioning

The document presents a novel multi-level attention network (MLAN) for visual question answering (VQA) that addresses challenges in semantic embedding and visual understanding. The MLAN utilizes semantic attention to highlight relevant high-level concepts from images and visual attention to focus on specific regions, optimizing these components jointly to improve answer inference. Experimental results demonstrate that this approach outperforms existing VQA methods on two challenging datasets.

Uploaded by

waghvaibhav0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2017 IEEE Conference on Computer Vision and Pattern Recognition

Multi-level Attention Networks for Visual Question Answering∗

Dongfei Yu, Jianlong Fu, Tao Mei, Yong Rui


University of Science and Technology of China, Microsoft Research, Beijing, China
[email protected], {jianf, tmei}@microsoft.com, [email protected]

Abstract

Inspired by the recent success of text-based question an-


swering, visual question answering (VQA) is proposed to
automatically answer natural language questions with the
reference to a given image. Compared with text-based QA,
VQA is more challenging because the reasoning process
on visual domain needs both effective semantic embedding
and fine-grained visual understanding. Existing approaches
predominantly infer answers from the abstract low-level vi-
sual features, while neglecting the modeling of high-level
image semantics and the rich spatial context of regions.
To solve the challenges, we propose a multi-level atten-
tion network for visual question answering that can simul- Figure 1. Overview of the multi-level attention network (MLAN).
taneously reduce the semantic gap by semantic attention The proposed attention model highlights both question-related se-
and benefit fine-grained spatial inference by visual atten- mantic concepts (i.e., “baseball,” “game,” “play”) and image re-
tion. First, we generate semantic concepts from high-level gions (i.e., those regions of “bat” and “glove”).
semantics in convolutional neural networks (CNN) and se-
lect those question-related concepts as semantic attention.
Second, we encode region-based middle-level outputs from image. The capability of automatic VQA can significantly
CNN into spatially-embedded representation by a bidirec- promote the mutual understanding between language and
tional recurrent neural network, and further pinpoint the vision, and further benefit a variety of applications, such as
answer-related regions by multiple layer perceptron as vi- visually-impaired assistant devices, early education, service
sual attention. Third, we jointly optimize semantic atten- robots, and so on.
tion, visual attention and question embedding by a softmax The challenges of visual question answering are two-
classifier to infer the final answer. Extensive experiments fold: effective semantic embedding and fine-grained visual
show the proposed approach outperforms the-state-of-arts understanding. Most early works transfer image caption-
on two challenging VQA datasets. ing framework [5, 19, 28, 35] to VQA tasks [10, 18, 24] by
the combination of convolutional neural networks (CNN)
[14] and recurrent neural network (RNN) [12]. Specifi-
1. Introduction cally, these works extract global image representation from
a pre-trained CNN model and extract question representa-
Visual question answering (VQA) has attracted exten- tion from a RNN model. They further feed the joint embed-
sive attention recently, since VQA is considered approach- ding features across language and vision into either a de-
ing towards the milestone of “AI-complete” that enables a coder RNN to generate free-form answers or a softmax clas-
machine to reason across language and vision as humans sifier to infer the best answer from a predefined answer set
[38]. Compared with text-based QA system in natural lan- (e.g., 1K answer categories in VQA dataset [2]). Although
guage processing (NLP), VQA takes one step further, which promising results have been reported, further improvement
is able to answer a natural language question by consider- suffers from the following limitations. First, human lan-
ing the correspondence between a question and a reference guage question conveys strong high-level semantics with
∗ This work was performed when Dongfei Yu was visiting Microsoft explicit query intention, while real-world images with tens
Research as a research intern. of thousands of pixels are relatively low-level and abstract,

1063-6919/17 $31.00 © 2017 IEEE 4187


DOI 10.1109/CVPR.2017.446
which pose grand challenges for deep image understand- • We introduce a novel spatial encoding approach for vi-
ing due to the well-known semantic gap. Second, visual sual attention, which extracts the context-aware visual
question answering requires fine-grained spatial inference features from ordered image regions by a bidirectional
because some answers can be only inferred from highly- RNN model.
localized image regions for “what” and “where” questions. • We conduct experiments on two widely-used VQA
To deal with the challenges, the state-of-the-art ap- datasets [2, 37], and obtain significant performance
proaches proceed the research on VQA along two indepen- gains over both visual-only and semantic-only atten-
dent dimensions. First, some methods develop the high- tion models.
level semantic representation for images by introducing se-
mantic concepts, image captions or even external knowl- 2. Related Work
edge base into the typical CNN-RNN framework [30, 31].
In this section, we first introduce the general CNN-RNN
Second, others focus on using region-based features to
framework on both image captioning and visual question
discover the most important regions to answer a question
answering. Then, we summarize the most recent advances
[9, 13, 17, 20, 26, 33, 34]. However, previous research still
from two different dimensions.
ignores using semantic attention to select the most discrim-
CNN-RNN. Inspired by the success of CNN-RNN frame-
inative concepts for a natural language question and using
work in image captioning task, most early works tend to
the explicit spatial encoding for image regions.
exploit variation of those models to visual question an-
To simultaneously learn semantic and spatial representa- swering task [2, 10, 18, 24]. They extract visual features
tion from images, we unify the two dimensions into a holis- from images via pre-trained convolutional neural networks
tic learning framework. Specifically, we propose a novel (CNNs) and encode questions by recurrent neural networks
multi-level attention network (MLAN) for visual question (RNNs). Ren et al. [24] took their inspiration from [28],
answering by highlighting both question-related semantic where the image was treated as the first token and fed into
concepts and local image regions in end-to-end training. RNNs together with descriptions to learn visual-semantic
Figure 1 shows the advantages of the proposed MLAN embedding. Instead of seeing image once, Malinowski et
by an intuitive example. The proposed MLAN consists al. [18] passed the image into RNNs at each time when en-
of three major components. First, semantic attention at- coded the question, which is similar to the framework of
tends on high-level image representation by discovering the [5] in automatic image captioning task. Gao et al. [10]
semantically-close concepts to questions in the same vocab- adapted m-RNN models [19] to deal with VQA task in the
ulary set and joint embedding space. These concepts cor- multi-lingual setting. Agrawal et al. [2] released a large and
respond to highly-frequent words in question/answer pairs human-annotated VQA dataset and evaluate several base-
and can represent high-level understanding for image con- line models and human-level performance on this dataset,
tent. Specifically, a CNN-based recognizer is trained for which accelerated advances in this task. Despite these early
each concept, and the distribution over the semantic output approaches show promising performance in VQA task, it
layer in CNN constitutes the high-level representation of an tends to fail on novel instances and highly rely on questions
image. Second, spatial attention is proposed to infer the im- (do not change the answer across images) [1].
age regions which can be attended by questions. Local re- Visual Attention. Visual attention mechanism is brought
gion representations are first extracted from convolutional into VQA to address “where to look” problems. Question-
layers in CNN and further fed into a bidirectional RNN guided visual attention uses semantic representation of a
model by a pre-defined order. Such a design enables spa- question as query to search for the regions in an image that
tial information of a region to be encoded from surround- are related to the answer [9, 13, 17, 26, 34]. Two types of
ing context. Attention scores for each region are further soft attention mechanism are well explored in visual ques-
obtained by a multiple layer perceptron (MLP) with the in- tion answering task. The first type concatenates the ques-
put of both context-aware visual representation and ques- tion representation with each candidate region and then put
tion representation. Third, joint learning incorporates at- them into a multiple layer perceptron (MLP) to compute
tended regions, attended concepts and question features by the soft attention weights while the second type gets the at-
element-wise multiplication, followed by a softmax layer to tention score by the dot product of the two ways of inputs
predict the most possible answer from an answer set. We [33]. Yang et al. [34] propose a stacked attention model
summarize the main contributions as follows: which queries the image multiple times to infer the answer
• We address the challenges of automatic visual ques- progressively. Lu et al. [17] exploit a question-image co-
tion answering by jointly learning multi-level atten- attention strategy to attend not only related regions in im-
tion, which can simultaneously reduce the semantic ages but also important words in questions. Recently, Nam
gap from vision to language and benefit fine-grained et al. [20] proposed Dual Attention Networks, which refined
inference in VQA tasks. the visual and textual attention via multiple reasoning steps.

4188
Figure 2. Overall framework of multi-level attention networks. Our framework consists of three components: (A) semantic attention, (B)
context-aware visual attention and (C) joint attention learning. Here, we denote by vq the representation of the question Q, by vimg , vc the
img
representation of image content on the visual and semantic level queried by the question, respectively. vr and pc is the activation of the
last convolutional layer and the probability layer from the CNN.

Our work is different from co-attention and dual-attention level attention network. The overall framework is presented
in that we attend to high-level concepts extracted from the in Figure 2. Our framework consists of three major com-
image, rather than words from questions. The major advan- ponents. Component (A), which is defined as semantic at-
tage of using concepts over questions is that the concepts tention, aims at finding question-related concepts from the
are the semantic representation of content in the image, not image. Component (B), which is defined as context-aware
limited to words in the question. In [9], Fukui et al. incorpo- visual attention, aims at finding question related regions and
rate a powerful feature fusion method into visual attention, learning visual representation of these regions. Component
and achieve impressive results in VQA task. However, they (C) is designed to incorporate information from different-
have to keep a much higher dimension after fusion at cost level layers in the CNN by joint attention learning. These
of more computation and storage. three components are joint optimized end-to-end, which
bridges the semantic gap between language and vision, and
High-level Concepts. There is another branch also showing
learns fine-grained representation from image regions.
a promising direction to address VQA problems. Instead
of low-level or middle-level visual features, they leverage
high-level concepts[7, 8, 29], image captions or even visual 3.1. Semantic Attention
story[16], external knowledge base [30, 31]. Each concept
Semantic attention aims at finding important concepts
corresponds to a word mined from the training image de-
mining from the image to answer a question. For example in
scriptions and represents some kinds of attributes about the
Figure 1, although the concept detector has detected a set of
content of the image. These concepts act as semantic units
objects and actions from the image (e.g., “group,” “stand”),
between natural language and visual recognition, allow us
only those concepts which are semantically close to the
to exchange information between the two modalities [25].
question (i.e., “baseball,” “game”), should be highlighted by
However, the spatial information is completely lost in the
semantic attention. One of the core challenges in combin-
procedure of high-level concepts detection, which leads to
ing visual and linguistic modality is that they have different
inferior performance on VQA task.
levels of abstraction, where language usually refers to gen-
eral categories, while hundreds of pixels in the image can
3. Multi-level Attention Networks point to one instance [25]. Previous works on image/video
To simultaneously exploit higher-level semantic infor- captioning [6, 22, 23, 35, 36] and visual question answer-
mation and spatial information, we propose a novel multi- ing [30, 31] have shown that extracting explicit high-level

4189
concepts from images/videos can bring benefits to the inter- tion model, Wec is the second embedding matrix, which em-
action of visual content and language at the semantic level. beds the concepts into the same dimension representation
Although an image can convey multiple semantics, not all with the question. Next, we take the dot product of the pro-
of them are helpful to answer a particular question. There- jected concept vector sc with the question vector vq as an
fore, we propose to attend on concepts, which should be not operation, and pass the resultant value to a sigmoid activa-
only relevant to images, but semantically close to questions. tion layer to get the relevance score between the concept c
We achieve these goals by two steps. with question Q. Further, We formulate the semantic atten-
In the first step, we train a concept detector by deep con- tion weights of the concept c as the multiplication of the
volutional neural networks, which can produce the proba- concept-image relevance pimgc and the concept-question rel-
bility of semantic concepts for an image. Similar to [30], we evance pqc , which is given by:
first build a concept vocabulary, where each concept is de-
fined as a single word. The top highly-frequent words with pqc = sigmoid(vq · sc ), (6)
the number of C from the question-answer training pairs Mc = pimg q
c pc , (7)
are collected in the concept vocabulary after stop words re-
moval. Besides, a multi-label image dataset based on these where the operator · represents the dot product of two vec-
concepts is constructed based on COCO image captioning tors, pqc is the relevance score measuring the semantic sim-
dataset [15], which is used to train the concept detector. As ilarity between the question Q and the concept c, Mc is the
a result, A fixed-length vector pimg c is created for each im- semantic attention weights over concepts. Finally, we repre-
age I by taking the activation of fc in the prediction layer sent the high-level semantic information of image I queried
of a CNN, which represents the probability of each concept by question Q by a weighted sum over all concepts repre-
occurring in the image. We denote the process of concept sentation, which is given by:
detection as:
pimg
c = f c (I). (1) C
vc = ∑ Mc (i)sc (i). (8)
In the second step, we train an attention network to mea- i=1
sure the semantic relevance between each concept in the vo-
cabulary and the question. At first, we represent the ques- 3.2. Context-aware Visual Attention
tion by a recurrent neural network. Specifically, given the Although semantic attention bridges the semantic gap
question Q = [q1 , q2 , ..., qT ], where qt is the one hot vector between the questions and images, it ignores the spatial in-
representation of word at position t, we embed these words formation in images, which is important to represent the
into a vector space through an embedding matrix Weq . For spatial context for image regions, and thus is crucial in the
each time step t, we feed the embedding vector xt of word visual question answering task. Hence visual attention has
qt to a Gate Recurrent Unit (GRU) layer, and pick the last been widely used in recent VQA frameworks, due to its
hidden state hT as the question representation, which is de- success on fine-grained visual representation and visualiza-
noted as vq . We use the following equation to formulate the tion. Compared with human attention, recent work [4] finds
question encoding model: current VQA attention models do not seem to be “looking
at” the consistent regions as human do. One of the possi-
xt = Weq qt , (2)
ble problems in current attention model is that they usually
ht = GRU(xt , ht−1 ), (3) search for image regions one after another, by dividing the
vq = hT . (4) whole image into several isolated units. Although promis-
ing results have been achieved, further improvements are
Besides, we use the same vocabulary and embedding limited, because many concepts may interact with each
matrix for our concepts and questions, therefore they can other through the action and position relations. For exam-
share the same semantic representation. Specifically, we ple, we should be aware of the spatial relationship of the
represent the concept c with a semantic vector sc by a two- cat and the toilet, if we want to really understand and an-
layer stacked embedding layer. The first layer is designed swer the question “what is the cat standing on.” In this case,
to share the the same word embedding layer as the question not only regions about “cat” but those regions at bottom of
model, and the second layer is used to project the concept the “cat” should be looked at and understood. In order to
vector into the same dimension with the question represen- address this issue, we propose a context-aware visual atten-
tation, which is given by: tion mechanism into our VQA framework.
sc = Wec (Weq c), (5) Specifically, we first incorporate the context informa-
tion into the representation from each region by a bidirec-
where c is the one hot vector representation of the con- tional GRU encoder, which is illustrated in figure 3. We
cepts, Weq is the embedding weights shared with the ques- use the fine-tuned CNN model for concept detection from

4190
sual feature for each region, which overcomes the scale in-
consistency problem in multi-modal feature pooling. The
comparison experiment in the section 4.4 demonstrates our
assumption. Specifically, we formulate our visual attention
process as:
 ↔ 
h = tanh (WQ vq + bQ ) ⊗ (WI vr + bI ) , (10)
Mr = softmax(Wp h + b), (11)

where we denote ⊗ as the multiplication between a matrix


and a vector, which is performed by element-wise multiply-
ing each column of the matrix by the vector. WQ and WI are
Figure 3. An illustration of the context-aware visual representation the corresponding embedding matrix. Wp is the parameter
for image regions by bidirectional GRU. Regions in convolutional in multiple perceptron layers, Mr is the attention weights of
feature maps are encoded into GRU with the order of left-to-right image regions.
and top-to-bottom. Similar with semantic attention, we pool these regions
with a weighted sum to get the visual representation of im-
age I queried by question Q, which is given by:
the precious step to extract visual features for local re-
R
gions. We take the feature map of the last convolutional ↔
vimg = ∑ Mr (i)vr (i). (12)
layer in the CNN model as our visual representation, which i=1
can preserve complete spatial information of each region.
We denote these visual representation on each region as In practice, we repeat the above process once as in [34],
{vr , r = 1, 2, ..., R}, where vr represents the feature vector using the addition of question feature and attended region
of the rth ordered region. We feed these feature vectors into feature as guide. We ignore the details here for concision.
the bidirectional GRU and combine the output from the for-
ward and backward direction at each step to form a new
3.3. Joint Attention Learning
feature vector for each region, which is given by: We use questions as query to search for image informa-
tion on different levels. In the low-level visual feature, we

vr = GRU f (vr ) + GRU b (vr ) (9) focus on question-related regions by visual attention, while
in the high-level semantic feature, we focus on question-

where vr is the context-aware visual representation of image related concepts by semantic attention. The two-level atten-
region r. The new feature vectors contain not only the vi- tion is combined by fusion of their attended representation.
sual information of corresponding regions but also the con- Particularly, we first add question vector into attended im-
textual information from surrounding regions. We set the age features extracted from different layers, then we use a
dimension of the hidden state in each GRU to the same with element-wise multiplication to combine the two types of at-
the question vector. tentions together. Finally, we feed the joint feature into a
Second, we assign each region an attention score for softmax layer to predict the probability of predefined candi-
modeling the relation between the region and the ques- date answer set A. The candidate with the highest probabil-
tion. Different from semantic attention, which measures ity is determined as the final answer, which is given by:
the semantic similarity between the question and the con-
cept word by the dot-product of two vectors, we align the u = (vq + vimg ) ◦ (vq + vc ), (13)
question and each region by element-wise multiplication of pa = softmax(Wu + b), (14)
two vectors, and then fed them into a multiple layer per-
ceptron (MLP). Such a design enables the automatic learn- where we denote ◦ as the element-wise multiplication be-
ing of attention function by parameter optimization in MLP. tween two vectors. vq , vimg ,vc are the representation of
More specifically, we search for regions via multi-step rea- question Q, the attended visual representation of Image I,
soning as [34]. The main differences come from two-fold. and attended semantic representation of concept C, respec-
1) We use context-aware visual feature obtained in the last tively. u is the joint representation from question, image
step to represent local regions, rather than the independent and concepts, which are extracted from the image. W and
representation from each region in convolutional neural net- b is the parameter of the last full connected layer, pa is the
works, which often lacks interactions between different re- output of the softmax layer, i.e. the distribution of probabil-
gions. 2) We use element-wise multiplication instead of ity of answer candidates. The candidate with the maximum
element-wise addition to align the question feature and vi- probability is picked out as the predicted answer.

4191
4. Experiment Table 1. Ablation model on test-dev set. The first three models
only utilize semantic attention, while the middle three models only
4.1. Dataset perform visual attention. MLAN denotes our full model which
applies attention on multi-level representation of images.
We evaluate our model on two large-scale VQA datasets,
Ablation Model Accuracy
i.e., VQA and Visual7W dataset, due to large amount of
training instances and the diversity of question types. Att-CNN + LSTM [30] 55.57
VQA is a large-scale visual question answering dataset Q + Concept 56.62
which contains 204,721 images from the COCO dataset Q + Semantic Attention 59.28
and a newly created abstract scene dataset which contains SAN [34] 58.68
50,000 scene images. We evaluate our model on this dataset Q + Visual Attention 62.29
for only real images. For each image in VQA dataset, three Q + context-aware Visual Attention 62.50
questions are annotated, and each question has 10 answers MLAN (Ours) 63.69
from 10 different annotators. We report our results on two
different tasks, which are open-ended and multiple-choice
tasks. In open-ended task, we select the answer with the by [37], supposing the model is correct on a question if it
highest activation from all possible outputs, and in multiple- selects the correct answer candidate. Accuracy is used to
choice task, we pick the answer that has the highest activa- measure the performance.
tion from the given choices. We collect the most frequent 4.3. Experiment Setting
3000 answers in training data as candidate answer set. We
evaluate the proposed the approach not only on validation We show our experimental settings, hyper-parameters
dataset, but on a test server, which is provided for blind and training process here. For question model, we use the
evaluation in the test set for fair comparison [2]. natural language toolkit NLTK1 to tokenize questions, cast
Visual7W is a more recent VQA dataset built by [37], all words into lowercase, and only keep those words ap-
which is a subset of Visual Genome [3] (the largest visual pearing at least twice in the train-val set. We don’t make
QA dataset to date with 1.7 million QA pairs). Visual7W any additional preprocessing to those words, e.g. remov-
contains 327,939 question-answer pairs on 47,300 COCO ing stop words, stemming. Finally, we get a question vo-
images. Each question-answer pair is associated with 4 cabulary with 9853 words in VQA dataset. As mentioned
human-generated multiple-choices, and only one of them in section 3.1, a single layer GRU is used to encode the
is the correct answer. There are two major highlights on question, which has 620-dimension word vectors and 2400-
Visual7W. First, Visual7W provides dense annotations on dimension hidden states. We take the last hidden state of
object-level groundings for establishing an explicit link be- the GRU layer as the question representation, so that the
tween QA pairs and image regions. Second, Visual7W al- dimension of question feature vector is 2400.
lows pointing questions with visual answers, where the cor- For concepts model, we select the most frequent 256
rect answer is one of four image regions. We evaluate our words appearing in question-answer training pairs as our
model only in multiple-choices setting on this dataset. concept vocabulary after removing stop words. We detect
concepts from images by taking the activation of the last
4.2. Evaluation Metrics layer of ResNet model [11] fine-tuned on our multi-label
Visual QA is formulated as multi-class classification dataset derived from MSCOCO dataset. There are two ma-
problem on both datasets. We follow the evaluation metrics jor differences in our concept detector from [30]. We use
as the baseline approaches on the two datasets. For VQA a more powerful classification model, i.e. ResNet with 152
dataset, [2] set an evaluation server publicly for blind eval- layers pre-trained on ImageNet, instead of VGGNet with 19
uation on the test set. The test set is divided into four splits: layers [27]. Besides, we use the most common loss func-
test-dev, test-standard, test-challenge and test-reserve, each tion “SigmoidCrossEntropyLoss” in multi-label classifica-
of which contains about 20K images. We evaluate our ab- tion task to fine-tune the network. For each concept, we get
lation model for experiment analysis on the test-dev set, the same embedding vector with the same question word,
and evaluate our best model on both the test-dev and test- i.e. 2400 dimensions. We project question vector and con-
standard set. For open-ended task, [2] use a voting mecha- cept vector to the 512-dimension space, and then perform
nism to score the accuracy of a predicted answer: attention on concepts.
For image model, we extract visual features from the last
#humans that said ans convolutional layer (i.e. “res5c”) from the same ResNet-152
acc(ans) = min{ , 1},
3 model with the concept detection. Each feature vector has
a dimension of 2048 and corresponds to a 32 × 32 pixels
where ans is the answer predicted by visual QA models.
For Visual7W dataset, we use the evaluation code released 1 http://www.nltk.org/

4192
Table 2. Comparison results on VQA dataset. We divide compared approaches into five categories based on different attention mecha-
nisms. Category I does not use any attention. Category II uses only visual attention. Category III extracts high-level concepts for image
representation. Category IV applies attention on both images and questions. Category V includes different variations of our approach.
test-dev test-standard
Approach Open Ended MC Open Ended MC
Yes/No Number Other All All Yes/No Number Other All All
LSTM Q + I [2] 78.9 35.2 36.4 53.7 57.2 79.0 35.6 36.8 54.1 57.8
I deeper + norm [2] 80.5 36.8 43.1 57.8 62.7 80.6 36.5 43.7 58.2 63.1
DPPnet [21] 80.7 37.2 41.7 57.2 - 80.3 36.9 42.2 57.4 -
SAN [34] 79.3 36.6 46.1 58.7 - - - - 58.9 -
FDA [13] 81.1 36.2 45.8 59.2 - - - - 59.5 -
DMN+[32] 80.5 36.8 48.3 60.3 - - - - 60.4 -
II
MCB+Att. [9] 82.2 37.7 54.8 64.2 68.6 - - - - -
MCB + Att. + GloVe [9] 82.5 37.6 55.6 64.7 69.1 - - - - -
MCB + Att. + GloVe + VG [9] 82.3 37.2 57.4 65.4 69.9 - - - - -
AC [31] 79.8 36.8 43.1 57.5 - 79.7 36.0 43.4 57.6 -
III
ACK [31] 81.0 38.4 45.2 59.2 - 81.1 37.1 45.8 59.4 -
HieCoAtt [17] 79.7 38.7 51.7 61.8 65.8 - - - 62.1 66.1
IV
DAN [20] 83.0 39.1 53.9 64.3 69.1 82.8 39.1 54.0 64.2 69.0
MLAN (ResNet) 82.9 39.2 52.8 63.7 68.9 - - - - -
V MLAN (ResNet, train+val) 83.8 40.2 53.7 64.6 69.8 83.7 40.9 53.7 64.8 69.9
MLAN (ResNet, train+val +VG) 81.8 41.2 56.7 65.3 70.0 81.3 41.9 56.5 65.2 70.0

region of the input image. As with attention on semantic • Q + Visual Attention: our visual attention model with-
level, we embed the 2048-dimension feature vector to 2400- out context-aware visual representation
dimension by bidirectional GRU, project image and this • Q + context-aware Visual Attention: the second com-
context-aware representation into the same 512-dimension ponents of our model, removing semantic attention
space, and then perform attention on visual representation. from the full model
In our experiments, we use stochastic gradient descent • Q + Multi-level Attention: our full model, fusing at-
with momentum 0.9 as the solver. The batch size is fixed to tention on different level image representation
100. We set the base learning rate to 0.05. After 15 epochs, We report the performance of our ablation models on test-
we drop the learning rate to one of ten of the previous one dev set of VQA dataset in Table 1. These models are trained
every 5 epochs. In addition, gradient clipping technology on the training dataset and half of validation set, as in [34].
and dropout are exploited in training. For visual7W dataset, Further analysis will be given in next section.
we use the exactly same parameter setting and training op-
tions with the VQA dataset. We evaluate our model only 4.5. Result and Analysis
in multiple-choices setting, and split the dataset into train, We will explain how each component works in our
validation and test following [37]. model by ablation experiment shown in Table 1. It is ob-
served that our multi-level attention model outperforms all
4.4. Ablation model single-level attention model significantly, i.e. attention on
To analyze the contribution of each components in our semantic-level concepts and attention on the region-based
model and demonstrate how the multi-level attention works visual feature.
better than single-level attention, we ablate the full model The first three rows in Table 1 compare our semantic
and demonstrate the effectiveness of each component. attention model with those models using high-level con-
• Att-CNN + LSTM [30]: the attribute representation as cepts but without attention mechanism. We get 2.7% per-
the first input of LSTM, then following the question formance gain when we attend to concepts related to both
• Q + Concept: a simple version of semantic attention, images and questions. This demonstrates attention on high-
taking the output of concept detector as the attention level concepts is effective and could find more important
weights, independent on the question semantic information from image and remove noisy infor-
• Q + Semantic Attention: the first component of our mation irrelevant with the question.
model, taking the relation of concepts with both image The middle three rows in Table 1 proves our two con-
and question into the attention weights tributions on visual attention mechanism. We use element-
• SAN [34]: a visual attention model similar with our wise multiplication to replace addition in SAN[34] model
second components and get better performance, which supports our assump-

4193
Figure 4. qualitative results from visual question answering with attention visualization. Both image regions related to the question and
high-level concepts are highlighted. Examples in the first row shows correct attended image regions lead to the true answer, while the
second row shows those cases where answer can be found directly from attended concepts.

tion that element-wise multiplication is a better multimodal on VQA dataset and Table 3 on Visual7W dataset respec-
fusion approach than addition in visual question answer- tively. For a fair comparison, we report the results using
ing task. The second contribution is that we incorporate the single model with several setting. [9] achieve a com-
contextual information from surrounding regions into tar- parable performance with ours when they add glove tricks
get regions, which benefits the spatial inference in images. and additional training data. However, their method uses
The promotion is marginal than we think. We conjecture a much higher dimension fusion method (16,000 dim v.s.
that there might be two reasons. First, our current con- 2400 dim), and drop almost over 1% if they use comparable
text encoding scheme suffers from long-term dependency dimensional features. Their model has to make a trade-off
problems by bidirectional GRU and is not symmetric for between effectiveness and efficiency. [17] and [20] are two
surrounding regions in horizontal and vertical direction be- methods also exploiting both visual attention and textual at-
cause bidirectional GRU can only model a sequence rather tention, the difference is that they perform textual attention
than a 2D spatial map. Second, most images from COCO on questions rather than high-level concepts in our model.
only contain a few objects, therefore, the interaction among We achieve better results than both of them because we ex-
objects is not so common as the natural scenario. We will ploit more concepts from the image than the question itself.
verify this in our future work.
The last row in Table 1 joins different-level attention into 5. Conclusion
one unifying framework and achieves significant improve-
ment compared with any single-level attention model. This We propose a novel Multi-level Attention Network to
demonstrates attention mechanism at different level image join visual attention and semantic attention into an end-end
features are complementary and could benefit each other. framework to address automatic visual question answering.
We compare our model with the state of art methods Visual attention enables fine-grained visual understanding
on two large datasets. The results are showed in Table 2 queried by questions while semantic attention narrows the
domain gap between questions and images. Our model
makes use of the complementarity of attention mechanism
Table 3. Results on Visual7W dataset. We report the indepen- on different level representation. Extensive experiments on
dent and average accuracy on six question types, including “what, two large dataset demonstrate we not only outperforms any
where, when, who, why and how.” single-level attention model, but also achieves top results
Method Wht. Whr. Whn. Who Why How Avg via a simple but effective framework. Future work includes
LSTM-Att [37] 51.5 57.0 75.0 59.5 55.5 49.8 54.3 further exploring on spatial encoding with context informa-
MCB+Att. [9] 60.3 70.4 79.5 69.2 58.2 51.1 62.2 tion, attention on sentence-level representation and better
MLAN (Ours) 60.5 71.2 79.6 69.4 58.0 50.8 62.4 fusion methods to join different level attention.

4194
References [20] H. Nam, J. Ha, and J. Kim. Dual attention networks for
multimodal reasoning and matching. In arXiv:1611.00471,
[1] A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior 2016.
of visual question answering models. In EMNLP, 2016. [21] H. Noh, P. H. Seo, and B. Han. Image question answering
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. using convolutional neural network with dynamic parameter
Zitnick, and D. Parikh. VQA: Visual question answering. In prediction. In CVPR, 2016.
ICCV, 2015. [22] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling
[3] A. Das, H. Agrawal, et al. Visual genome: connecting lan- embedding and translation to bridge video and language. In
guage and vision using crowdsourced dense image annota- CVPR, 2016.
tions. In IJCV, 2016. [23] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning
[4] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. with transferred semantic attributes. In arXiv preprint
Human attention in visual question answering: Do humans arXiv:1611.07675, 2016.
and deep networks look at the same regions? In EMNLP, [24] M. Ren, R. Kiros, and R. S. Zemel. Exploring models and
2016. data for image question answering. In NIPS, 2015.
[5] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, [25] M. Rohrbach. Attributes as semantic units between natural
S. Guadarrama, K. Saenko, and T. Darrell. Long-term recur- language and visual recognition. In arXiv:1604.03249, 2016.
rent convolutional networks for visual recognition and de- [26] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus
scription. In CVPR, 2015. regions for visual question answering. In CVPR, 2016.
[6] H. Fang, S. Gupta, F. Landola, R. Srivastava, L. Deng, [27] K. Simonyan and A. Zisserman. Very deep convolutional
P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zit- networks for large-scale image recognition. In ICLR, 2015.
nick, and G. Zweig. From captions to visual concepts and [28] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
back. In CVPR, 2015. tell: A neural image caption generator. In CVPR, 2015.
[7] J. Fu, J. Wang, Y. Rui, X.-J. Wang, T. Mei, and H. Lu. Image [29] J. Wang, J. Fu, T. Mei, and Y. Xu. Beyond object recognition:
tag refinement with view-dependent concept representations. Visual sentiment analysis with deep coupled adjective and
IEEE T-CSVT, 25(28):1409–1422, 2015. noun neural networks. In IJCAI, 2016.
[8] J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, and Y. Rui. Relaxing [30] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Hengel. What value
from vocabulary: Robust weakly-supervised deep learning do explicit high level concepts have in vision to language
for vocabulary-free image tagging. In ICCV, 2015. problems? In CVPR, 2016.
[9] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and [31] Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. Ask
M. Rohrbach. Multimodal compact bilinear pooling for vi- me anything: Free-form visual question answering based on
sual question answering and visual grounding. In EMNLP, knowledge from external sources. In CVPR, 2016.
2016. [32] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-
[10] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. works for visual and texual question answering. In ICML,
Are you talking to a machine? dataset and methods for mul- 2016.
tilingual image question answering. In NIPS, 2015. [33] H. Xu and K. Saenko. Ask, attend and answer: Exploring
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning question-guided spatial attention for visual question answer-
for image recognition. In CVPR, 2016. ing. In ECCV, 2016.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. [34] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked
Neural Computation, 9(8):1735–1780, 1997. attention networks for image question answering. In CVPR,
[13] I. Ilievski, S. Yan, and J. Feng. A focused dynamic attention 2016.
model for visual question answering. In ECCV, 2016. [35] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boost-
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- ing image captioning with attributes. In arXiv preprint
based learning applied to document recognition. Proced- arXiv:1611.01646, 2016.
dings of the IEEE, 86(11):2278–2324, 1998. [36] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image cap-
[15] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- tioning with semantic attention. In CVPR, 2016.
manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- [37] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w:
mon objects in context. In ECCV, 2014. Grounded question answering in images. In CVPR, 2016.
[16] Y. Liu, J. Fu, T. Mei, and C. W. Chen. Let your photos talk: [38] C. L. Zitnick, A. Agrawal, S. Antol, M. Mitchell, D. Batra,
Generating narrative paragraph for photo stream via bidirec- and D. Parikh. Measuring machine intelligence through vi-
tional attention recurrent neural networks. In AAAI, pages sual question answering. AI Magazine, 37(1):63–72, 2016.
1445–1452, 2017.
[17] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
question-image co-attention for visual question answering.
In NIPS, 2016.
[18] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-
rons: A neural-based approach to answering questions about
images. In ICCV, 2015.
[19] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille.
Deep captioning with multimodal recurrent neural networks
(m-RNN). In ICLR, 2015.

4195

You might also like