I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
ABSTRACT The image caption generation algorithm necessitates the expression of image content using
accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely
generates words one by one in a front-to-back order and is unable to analyze integral contextual information.
This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws
on past information but also captures subsequent information, resulting in the prediction of image content
subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward
LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along
with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between
F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the
attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while
aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s
model capable of extracting contextual information and realizing finer-grained image captioning effectively.
In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively
solves the problem of inconsistent semantic information in the forward and backward direction of the
simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally
demonstrated on the MSCOCO dataset.
INDEX TERMS Bi-LSTM, image caption generation, semantic fusion, semantic similarity.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
134 VOLUME 11, 2023
H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
the context, it hence is imperative to consider both prior and types[19]: module-based matching algorithms [22], [23],
subsequent moment information. To address this issue, this [24], migration-based algorithms [25], [26], and neural
paper employs the Bi-LSTM [17], [18] structure, which com- network-based algorithms [1], [2], [11].
prises two LSTM neural networks, featured by one forward The module-based matching algorithm first identifies the
and one backward. In contrast to the traditional unidirec- objects, attributes, actions, coupled with other information
tional LSTM network [19], [20], [21], the Bi-LSTM structure present in the image using multiple classifiers, and then puts
considers the inherent laws of forward with backward data the detected information into a manually designed sentence
simultaneously while predicting from both the past and the module to generate image captions. Although this algorithm
future. Besides, it employs two independent hidden layers is considered straightforward and intuitive, it remains diffi-
to respectively process the forward and backward semantic culties to recognize more sophisticated image information
information. Then, the forward and backward outputs are and unable to generate sentences with more complicated
drawn upon summation. The content is extracted from the structures given the constraints of classifiers or sentence
forward and backward LSTM. As illustrated in Fig. 1, for- modules [23].
ward and backward extract semantic features about ‘‘riding’’, The migration-based algorithm retrieves similar images in
‘‘wave’’, and attention mechanisms extract salient regions. the existing database and then regards the caption of the sim-
When Bi-LSTM is employed as the decoder, the image ilar image as the caption of the image to be queried. Since the
captions generated by the forward and backward generation sentences in the database are entirely human-generated, the
approaches for the same image are prone to vary widely, and migration-based algorithm produces grammatically correct
the semantic contents of the same time step barely match. sentences. However, considering that the searched image and
When the current word is generated in forwarding order, the image to be queried are similar instead of being definitely
the backward generation approach fails to offer effective identical, the sentences directly generated in this case may not
context information synchronously; similarly. When it is gen- accurately describe the content of the image to be queried.
erated in backward order, the forward generation approach In recent years, deep neural networks have been applied
fails to provide valid context information synchronously to image retrieval [27] and machine translation [28] with
either. Therefore, aiming to fully utilize context information success. Inspired by this trend, a variety of image caption
while addressing the issue of out-of-sync between forward generation algorithms based on deep neural networks were
and backward directions, this paper proposes the S-Att, proposed, followed by great breakthroughs. This type of algo-
which employs the subsidiary attention mechanism between rithm extracts image features using CNN, and further decodes
F-LSTM and B-LSTM, extracting the correlation intensity of image features into fluent sentences using RNN [29]. Unlike
the F-LSTM and B-LSTM semantic information. As a result, module-based matching algorithms or migration-based algo-
the semantic information is aligned and output complemen- rithms, the neural network-based algorithms not only elim-
tary. This method addresses the limitation that forward and inate the limitation of sentence modules, but also generate
backward synchronization semantics are incompatible and novel sentences not available in current databases, which is
cannot be produced, contributing to more precise sentence due to the characterization capabilities of CNN combined
predictions. with the efficient modeling capabilities of RNN for variable-
Consequently, our final model employs the CNN-Bi- length sequences. In a novel parallel fusion LSTM structure
LSTM-s encoder-decoder, as indicated in Fig. 2. CNN is [30]. It adopts hidden states which are based on two parallel
employed to extract features and attention mechanisms to LSTMs to make attributes and visual image information com-
extract salient regions. Bi-LSTM is employed to extract con- plementary and enhanced at each time step. An innovative
textual information, with S-Att raised to fuse semantics and structure eliminates the redundancy that exists in the train-
align complementary outputs. ing set, increasing adaptive weights to increase the ability
In summary, our main contributions are shown as follows: to generalize captions [31]. A more sophisticated attention
• We adopt Bi-LSTM as a decoder to extract different mechanism [32] is employed to extract salient region features.
directional features to obtain more fine-grained contex- It combines sentence-level attention models with word-level
tual information. attention models to generate more accurate captions. Explor-
• We adopt the subsidiary attention mechanism to fix ing region relationships [33] implicitly explores the relation-
the semantic information, and align the forward and ship between related semantics and dynamically searches the
backward hidden states through the similarity module related visual relationships between multiple regions, making
to improve the output accuracy. the description of image captions more accurate. Attribute-
• We fuse the features extracted by visual attention and driven image captioning model [34], which selects a specific
subsidiary attention to obtain complementary and pro- area of the image and then decides which Attribute to focus
gressively finer grained sentences. on. This improves the coverage of visual attributes. The excel-
lent performance of Bi-LSTM in machine translation makes
II. RELATED WORKS many tasks try to use bidirectional LSTM. In Automatic lan-
Following various generation approaches, the current major guage identification task [35], Bi-LSTM effectively extracts
image caption generation algorithms are split into three ‘‘future’’ speech sequences, and the effect is remarkable.
FIGURE 1. Demonstrates the features of F-LSTM and B-LSTM extracted by the decoder. (a) is the
original image. (b) is the visual feature extracted by F-LSTM, and (c) is the visual feature extracted
by B-LSTM.
FIGURE 2. Indicates our proposed image caption attention framework. The model employs CNN as the encoder to
extract the visual area features of the image, and Bi-LSTM as the decoder to extract the hidden state hft −1 at the
previous moment with the hidden state hb t +1
at the next moment. S-Att is introduced into Bi-LSTM to fuse and
complement the hidden states hft and hb
t
, thereby making further predictions.
f
The summation of ht−1 and hbt+1 indicates that Bi-LSTM image is collected from daily life, making it the primary
extracts past and future information to obtain the hidden state experimental dataset for image captioning. the individual
f f
hmt based on visual attention; the summation of ht−1 and ht−1 image contains a multi-entity target with five manual labels
corresponding to each position represents the hidden state hnt for labeling the caption. this dataset includes 91 targets,
of the context aligned by a fixed position via S-Att. Then, the 328,000 images, and 2.5 million labels. the largest dataset
final hidden state under dual attention is as follows: with semantic segmentation provides 80 categories, over
330,000 images, 200,000 of which are annotated, and over
ht = λhm
t + (1 − λ)ht
n
(15) 1.5 million individuals in the entire dataset. we adopt 110,000
The loss function of the bidirectional LSTM includes: photos for training, 5,000 images for validation, and 5,000
images for testing [1].
T
f f
X
LXE (θ) = − log pθ (yt |y1:t−1 ) (16)
B. EVALUATION METHODS
t=1
T
In experiments, we adopt bleu1-4 [40], meteor [42], rouge-l
b
X [43], cider [39], and spice [3] to evaluate our model perfor-
LXE (θ) =− log pbθ (yt |y1:t−1 ) (17)
mance metrics, which are widely used in image captioning.
t=1
f b
L = LXE + LXE (18) C. DATA PRE-PROCESSING
f
LXE and LXEb stand for the loss functions of the F-LSTM In this paper, we implement a bidirectional LSTM with a
and B-LSTM, respectively. The conventional cross-entropy subsidiary attention mechanism. Our parameter settings and
error training strategy is employed, with L as the final loss experimental details are as followers.
function. There is a distinction between training and testing First, we replace all words in the dataset with lowercase,
conventional image captioning models, with testing relying and truncate the sentence caption length to 16, among which
on words previously generated by the model. When the pre- the words with a frequency of less than or equal to 5 are
ceding period’s results are incorrect, the errors would accu- deleted, and finally a word list with a number of 9500 is
mulate and succeeding words cannot be generated correctly. obtained.
To address these issues, we approach image caption produc- Second, we adopt the pre-trained Resnet-101 [44] to
tion as a reinforcement learning problem, directly optimizing encode the image in the encoding phase, which encodes
sentence generation based on the model’s evaluation metrics, the image into a visual feature map of size 14 × 14 with
with the ultimate goal of minimizing the following negative 2048 dimensions. The visual feature map is mainly applied
expected returns: to represent the fine-grained information of the image.
TABLE 1. Verifying the performance of optimized CIDEr and auxiliary attention mechanism using MSCOCO.
TABLE 2. Verifying the performance of Bi-LSTM and parallel double-layer LSTM using MSCOCO.
E. EXPERIMENTAL RESULTS AND ANALYSIS from 112.5 to 117.9 by 5.4; the CIDEr score of our Bi-
We design ablation experiments to evaluate the effectiveness LSTM-S model climbed from 118.6 to 121.3 by 2.7, indi-
of our proposed model in image captioning; all metric scores cating that the current leading methodologies come with a
are designed using the MSCOCO Karparthy test segmenta- significant enhancement in optimizing CIDEr on the basis of
tion. cross-entropy error. Second, the Bi-LSTM-s improved from
First, as shown in Table 1, the improvement brought by 112.5 to 118.6 by 6.1 in the cross-entropy loss function
CIDEr optimization is validated on our model. XE repre- experiment, and 3.4 improved from 117.9 to 121.3 in the
sents training under the cross-entropy loss function, and RL optimized CIDEr experiment. Thus, it reflects that our sub-
indicates the result of optimizing the scoring index based on sidiary attention can efficiently extract, align, and produce
the optimal XE training model. Second, two sets of models, finer captions from the semantic relations of the forward
Bi-LSTM and Bi-LSTM-s, are set to verify the effective- LSTM and backward LSTM.
ness of our auxiliary attention mechanism. Bi-LSTM solely Secondly, as shown in Table 2. Our ablation experiments
employs the visual attention mechanism, while Bi-LSTM- are primarily utilized to validate our model’s superiority, with
s incorporates S-Att into Bi-LSTM. The following training constant experimental training parameters. To be specific, p-
results ensure that the parameters are consistent in order to LSTM denotes the simultaneous superposition of two layers
maintain fairness. of structure processing, in which the hidden state h̃1t of the
As illustrated in the table 1-5, each represents a different first layer of p-LSTM is learnt and h̃1t is transferred to the
ablation experiment. In Tables 1, we demonstrated that rein- second layer of LSTM; then, the input gate, forget gate, and
forcement learning improves significantly in image caption output gate of the second layer of LSTM will all employ h̃1t
tasks. In Tables 2 and 3, we proved the superiority of the as input. The following uncovers the final hidden state:
our model. In Table 4, we verified the effect of averaging
h̃2t = LSTM (h̃1t , h̃2t−1 ) (21)
and taking the maximum input on the result when semantic
fusion occurs. In Table 5, it is the influence of different The final hidden state at time t is derived from the
hyperparameters on the results of the experiment. first layer’s hidden state h̃1t and the second layer’s previous
The following evaluation criteria: B@1, B@2, B@3, B@4, moment h̃2t−1 . According to the optimized CIDEr score, the
M, R, S, C, represent BLEU1-4, METEOR, ROUGE-L, Bi-LSTM model has improved by 10.3 points. In our model,
SPICE, CIDEr. BLEU: It calculates the similarity and penal- our hidden state computation ht is related not only to the
f
izes sentences of insufficient length. METEOR: It focuses current input, but also to ht−1 and hbt+1 . Bi-LSTM considers
on the number of co-occurrences of words and establishes previous and future information simultaneously; thus, it truly
a penalty mechanism based on word order changes to get achieves context-based output.
scores. ROUGE-L: The similarity is measured by calculating Aiming to demonstrate the superiority of our model,
the longest common sequence between the predicted sentence we evaluate it against eight measures and six prominent
and the standard translation. methods. As shown in Table 3. First, the foundational model
SPICE: It encodes images into objects, attributes, and rela- is established. The most typical model, NIC, does not include
tionships, and then selects the highest scoring statement based an attention mechanism. The goal of Soft-Attention is to
on scene graphs. CIDEr: It calculates the similarity, which introduce a soft attention mechanism into difficult tasks. The
is based on the frequency of the words.Firstly, as shown attention mechanism is extended from spatial to channel by
in Table 1. The cross-entropy loss function is compared to SCA-CNN. SCST is the application of reinforcement learn-
the optimized CIDEr score (using the CIDEr score as an ing to the optimization of sentence-level rewards. Second,
example), the CIDEr score of our Bi-LSTM model climbed the pLSTM-A-2, DAIC and our model are improved on the
TABLE 4. Forward and backward LSTM hidden state fusion on MSCOCO dataset.
basis of the above models. pLSTM-A-2 encodes images using As shown in Table 5. Under the combined effect of dual
two separate encoders (MIML and CNN) and simultaneously attention, an oversized selection of λ results in the extrac-
merges the semantic information of the two decoders, result- tion of unaligned semantics before-and-after, worsening the
ing in more accurate and richer captions. caption result. Yet, a small selection of λ leads to excessive
DAIC extracts the encoder’s image input to sentence- reliance on similarity; the prior semantics over-absorb the
level and word-level attention respectively, while the final following semantics, degrading the caption result.
output combines sentence-level and word-level information
to generate more accurate captions. Our model employs F. VISUALIZATION
bidirectional LSTM as the decoder, accepts both past and Fig. 4 depicts the visualization results, which allow us to
future information simultaneously, and truly achieves pre- better represent our proposed approach, including the ground
diction based on contextual information. It also employs truth. F-LSTM, B-LSTM, and Bi-LSTM-s fused with seman-
two attention mechanisms, one of which will dynamically tic features based on contextual output. It also displays the
extract visual information accompanied by integrating visual interaction among Visual-Att focusing on image key regions
information with semantic information; another auxiliary and text dependencies, coupled with the extraction of key-
attention aligns the semantic information of the bidirec- words using an auxiliary attention mechanism. Moreover,
tional LSTM, contributing to more diversified semantic infor- a visualization is presented at the same time. All of the image
mation. Our model has displayed significant advantages elements are derived from the MSCOCO dataset
in scoring. It reveals that our model can effectively extract fine-
As shown in Table 4. Considering the fusion of the forward grained information, such as ‘‘polka dot’’, ‘‘wooden
and backward hidden states: Max is the maximum value of benches’’, and ‘‘red chairs’’. F-LSTM extracts ‘‘polka dot,’’
f f
ht−1 and hbt+1 , while Mean is the weighted sum of ht−1 and the semantic features are fused into Bi-LSTM. F-LSTM
b
and ht+1 . The data reveals that the outcome of taking the extracts ‘‘wooden benches,’’ with B-LSTM extracting ‘‘red
average is slightly better. When fusing forward and backward chairs,’’ and the one is complemented with another for output.
semantics using Max, simply considering forward or back- S-Att extracts ‘‘girl’’ and ‘‘women,’’ presenting a dependency
ward to obtain a single result causes insufficient semantics of 0.85; then, they are fused to complement the output.
and the loss of partial semantics. On the other hand, using Also, ‘‘surrounded’’ and ‘‘topped’’ have a dependency of
Mean considers the shared scope of forward and backward 0.62, while the two are fused to complement the output.
while retaining the original semantic information, thereby Fig. 5 depicts the fine-grained information extracted by our
achieving fused semantics. model. We set Bi-LSTM as the control groups. The subsidiary
[4] J. Wu, T. Chen, H. Wu, Z. Yang, G. Luo, and L. Lin, ‘‘Fine-grained [25] P. Kuznetsova, V. Ordonez, A. Berg, T. Berg, and Y. Choi, ‘‘Collective
image captioning with global-local discriminative objective,’’ IEEE Trans. generation of natural image descriptions,’’ in Proc. 50th Annu. Meeting
Multimedia, vol. 23, pp. 2413–2427, 2021. Assoc. Comput. Linguistics, vol. 1, 2012, pp. 359–368.
[5] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, ‘‘Advanced deep-learning [26] V. Ordonez, X. Han, P. Kuznetsova, G. Kulkarni, M. Mitchell,
techniques for salient and category-specific object detection: A survey,’’ K. Yamaguchi, K. Stratos, A. Goyal, J. Dodge, A. Mensch, H. Daumé,
IEEE Signal Process. Mag., vol. 35, no. 1, pp. 84–100, Jan. 2018. A. C. Berg, Y. Choi, and T. L. Berg, ‘‘Large scale retrieval and generation
[6] L. Ruotsalainen, A. Morrison, M. Makela, J. Rantanen, and N. Sokolova, of image descriptions,’’ Int. J. Comput. Vis., vol. 119, no. 1, pp. 46–59,
‘‘Improving computer vision-based perception for collaborative indoor Aug. 2016.
navigation,’’ IEEE Sensors J., vol. 22, no. 6, pp. 4816–4826, Mar. 2022. [27] S. Karaoglu, R. Tao, T. Gevers, and A. W. M. Smeulders, ‘‘Words matter:
[7] S. Wu, D. Zhang, Z. Zhang, N. Yang, M. Li, and M. Zhou, ‘‘Dependency- Scene text for image classification and retrieval,’’ IEEE Trans. Multimedia,
to-dependency neural machine translation,’’ IEEE/ACM Trans. Audio, vol. 19, no. 5, pp. 1063–1076, May 2017.
Speech, Language Process., vol. 26, no. 11, pp. 2132–2141, Nov. 2018. [28] Y. Wu et al., ‘‘Google’s neural machine translation system: Bridging the
[8] M. A. Kastner, K. Umemura, I. Ide, Y. Kawanishi, T. Hirayama, K. Doman, gap between human and machine translation,’’ 2016, arXiv:1609.08144.
D. Deguchi, H. Murase, and S. Satoh, ‘‘Imageability- and length- [29] R. Pascanu, T. Mikolov, and Y. Bengio, ‘‘On the difficulty of train-
controllable image captioning,’’ IEEE Access, vol. 9, pp. 162951–162961, ing recurrent neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2013,
2021. pp. 1310–1318.
[9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, [30] J. Zhang, K. Li, and Z. Wang, ‘‘Parallel-fusion LSTM with synchronous
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using semantic and visual information for image captioning,’’ J. Vis. Commun.
RNN encoder–decoder for statistical machine translation,’’ 2014, Image Represent., vol. 75, Feb. 2021, Art. no. 103044.
arXiv:1406.1078. [31] G. Sumbul, S. Nayak, and B. Demir, ‘‘SD-RSIC: Summarization-driven
[10] Y. Luo, J. Lu, X. Jiang, and B. Zhang, ‘‘Learning from architectural deep remote sensing image captioning,’’ IEEE Trans. Geosci. Remote
redundancy: Enhanced deep supervision in deep multipath encoder– Sens., vol. 59, no. 8, pp. 6922–6934, Aug. 2021.
decoder networks,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 9, [32] H. Wei, Z. Li, C. Zhang, and H. Ma, ‘‘The synergy of double
pp. 4271–4284, Sep. 2022. attention: Combine sentence-level and word-level attention for image
[11] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, captioning,’’ Comput. Vis. Image Understand., vol. 201, Dec. 2020,
and Y. Bengio, ‘‘Show, attend and tell: Neural image caption gener- Art. no. 103068.
ation with visual attention,’’ in Proc. Int. Conf. Mach. Learn., 2015, [33] Z. Zhang, Q. Wu, Y. Wang, and F. Chen, ‘‘Exploring region relationships
pp. 2048–2057. implicitly: Image captioning with visual relationship attention,’’ Image Vis.
[12] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, ‘‘Boosting image captioning Comput., vol. 109, May 2021, Art. no. 104146.
with attributes,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[34] Y. Zhou, J. Long, S. Xu, and L. Shang, ‘‘Attribute-driven image caption-
pp. 4894–4902.
ing via soft-switch pointer,’’ Pattern Recognit. Lett., vol. 152, pp. 34–41,
[13] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and
Dec. 2021.
L. Zhang, ‘‘Bottom-up and top-down attention for image captioning and
[35] H. S. Das and P. Roy, ‘‘A CNN-BiLSTM based hybrid model for Indian lan-
visual question answering,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
guage identification,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108274.
Recognit., Jun. 2018, pp. 6077–6086.
[36] W. Li, L. Zhu, Y. Shi, K. Guo, and E. Cambria, ‘‘User reviews: Sentiment
[14] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, ‘‘Image caption with global-
analysis using lexicon integrated two-channel CNN–LSTM family mod-
local attention,’’ in Proc. 31st AAAI Conf. Artif. Intell. (AAAI), Feb. 2017,
els,’’ Appl. Soft Comput., vol. 94, Sep. 2020, Art. no. 106435.
pp. 4133–4139.
[15] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, ‘‘Image captioning with [37] F. Guo, R. He, and J. Dang, ‘‘Implicit discourse relation recognition via
semantic attention,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. a BiLSTM-CNN architecture with dynamic chunk-based max pooling,’’
(CVPR), Jun. 2016, pp. 4651–4659. IEEE Access, vol. 7, pp. 169281–169292, 2019.
[16] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural [38] G. Xu, Y. Meng, X. Zhou, Z. Yu, X. Wu, and L. Zhang, ‘‘Chinese event
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. detection based on multi-feature fusion and BiLSTM,’’ IEEE Access,
[17] T. Chen, R. Xu, Y. He, and X. Wang, ‘‘Improving sentiment analysis via vol. 7, pp. 134992–135004, 2019.
sentence type classification using BiLSTM-CRF and CNN,’’ Expert Syst. [39] R. Vedantam, C. L. Zitnick, and D. Parikh, ‘‘CIDEr: Consensus-based
Appl., vol. 72, pp. 221–230, Apr. 2017. image description evaluation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[18] Z. Huang, W. Xu, and K. Yu, ‘‘Bidirectional LSTM-CRF models for Recognit. (CVPR), Jun. 2015, pp. 4566–4575.
sequence tagging,’’ 2015, arXiv:1508.01991. [40] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method for
[19] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, ‘‘Guiding the long- automatic evaluation of machine translation,’’ in Proc. 40th Annu. Meeting
short term memory model for image caption generation,’’ in Proc. IEEE Assoc. Comput. Linguistics, 2001, pp. 311–318.
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2407–2415. [41] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
[20] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, ‘‘SCA- and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in
CNN: Spatial and channel-wise attention in convolutional networks for Proc. Eur. Conf. Comput. Vis. (ECCV). Berlin, Germany: Springer, 2014,
image captioning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. pp. 740–755.
(CVPR), Jul. 2017, pp. 5659–5667. [42] S. Banerjee and A. Lavie, ‘‘METEOR: An automatic metric for MT
[21] J. Lu, C. Xiong, D. Parikh, and R. Socher, ‘‘Knowing when to look: Adap- evaluation with improved correlation with human judgments,’’ in Proc.
tive attention via a visual sentinel for image captioning,’’ in Proc. IEEE ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summa-
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 375–383. rization, 2005, pp. 65–72.
[22] A. Gupta and P. Mannem, ‘‘From image annotation to image description,’’ [43] C.-Y. Lin, ‘‘Rouge: A package for automatic evaluation of summaries,’’ in
in Proc. 19th Int. Conf. Neural Inf. Process. (ICONIP), Doha, Qatar. Berlin, Proc. Text Summarization Branches Out, 2004, pp. 74–81.
Germany: Springer, Nov. 2012, pp. 196–204. [44] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
[23] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
and T. L. Berg, ‘‘BabyTalk: Understanding and generating simple image Jun. 2016, pp. 770–778.
descriptions,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, [45] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
pp. 2891–2903, Dec. 2013. 2014, arXiv:1412.6980.
[24] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, [46] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, ‘‘Self-critical
J. Hockenmaier, and D. Forsyth, ‘‘Every picture tells a story: Generating sequence training for image captioning,’’ in Proc. IEEE Conf. Comput. Vis.
sentences from images,’’ in Proc. 11th Eur. Conf. Comput. Vis. (ECCV), Pattern Recognit. (CVPR), Jul. 2017, pp. 7008–7024.
Heraklion, Greece. Berlin, Germany: Springer, Sep. 2010, pp. 15–29.