0% found this document useful (0 votes)
27 views

I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

Uploaded by

Kubendiran KB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

Uploaded by

Kubendiran KB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received 7 December 2022, accepted 24 December 2022, date of publication 26 December 2022, date of current version 2 January 2023.

Digital Object Identifier 10.1109/ACCESS.2022.3232508

Image Caption Generation Using Contextual


Information Fusion With Bi-LSTM-s
HUAWEI ZHANG , CHENGBO MA , ZHANJUN JIANG, AND JING LIAN, (Member, IEEE)
Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730000, China
Corresponding author: Huawei Zhang ([email protected])
This work was supported by the National Natural Science Foundation of China under Grant 62061023, Grant 61941109, and Grant
61961037.

ABSTRACT The image caption generation algorithm necessitates the expression of image content using
accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely
generates words one by one in a front-to-back order and is unable to analyze integral contextual information.
This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws
on past information but also captures subsequent information, resulting in the prediction of image content
subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward
LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along
with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between
F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the
attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while
aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s
model capable of extracting contextual information and realizing finer-grained image captioning effectively.
In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively
solves the problem of inconsistent semantic information in the forward and backward direction of the
simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally
demonstrated on the MSCOCO dataset.

INDEX TERMS Bi-LSTM, image caption generation, semantic fusion, semantic similarity.

I. INTRODUCTION caption word generation. At present, the image captioning


Image captioning [1], [2], [3], [4] serves as a complex multi- model is primarily based on the encoder-decoder [9], [10]
modal scene understanding task involving two fields of study: approach, whose model solely examines the image’s global
computer vision [5], [6] and natural language processing [7], region while generating the image caption. The encoder trans-
[8], whose purpose is to automatically generate proximate forms the image as the average value of global area fea-
natural language captions for the salient visual content of tures, ignoring the image’s local saliency. As a consequence,
input images. This task requires the model to complete the the attention mechanism [11] is applied to image captions,
following actions: First, the model allows for comprehending the extracted visual features are normalized into a set of
the visual content in the image by identifying salient ele- weight values, and the external visual features of the encoder
ments in the image with their mutual correspondence. Sec- are corresponding to its internal semantic features, further
ond, on the basis of these visual understandings, the model improving the model’s interpretability. In recent years, visual
is also able to accurately describe these structured visual attention [12], [13], [14] and semantic attention [15] have
information word by word using natural language. Dynamic proved their superiority in this domain.
multi-modal analysis and reasoning are performed on the The common difficulty with most approaches lies in that
visual content, as well as generated words in the course of the deep neural network based on LSTM [16] simply con-
siders unidirectional data input while ignoring the impact
The associate editor coordinating the review of this manuscript and of the orientation of the sequence on prediction. Yet the
approving it for publication was Kumaradevan Punithakumar . prediction of a sentence is supposed to be determined by

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
134 VOLUME 11, 2023
H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

the context, it hence is imperative to consider both prior and types[19]: module-based matching algorithms [22], [23],
subsequent moment information. To address this issue, this [24], migration-based algorithms [25], [26], and neural
paper employs the Bi-LSTM [17], [18] structure, which com- network-based algorithms [1], [2], [11].
prises two LSTM neural networks, featured by one forward The module-based matching algorithm first identifies the
and one backward. In contrast to the traditional unidirec- objects, attributes, actions, coupled with other information
tional LSTM network [19], [20], [21], the Bi-LSTM structure present in the image using multiple classifiers, and then puts
considers the inherent laws of forward with backward data the detected information into a manually designed sentence
simultaneously while predicting from both the past and the module to generate image captions. Although this algorithm
future. Besides, it employs two independent hidden layers is considered straightforward and intuitive, it remains diffi-
to respectively process the forward and backward semantic culties to recognize more sophisticated image information
information. Then, the forward and backward outputs are and unable to generate sentences with more complicated
drawn upon summation. The content is extracted from the structures given the constraints of classifiers or sentence
forward and backward LSTM. As illustrated in Fig. 1, for- modules [23].
ward and backward extract semantic features about ‘‘riding’’, The migration-based algorithm retrieves similar images in
‘‘wave’’, and attention mechanisms extract salient regions. the existing database and then regards the caption of the sim-
When Bi-LSTM is employed as the decoder, the image ilar image as the caption of the image to be queried. Since the
captions generated by the forward and backward generation sentences in the database are entirely human-generated, the
approaches for the same image are prone to vary widely, and migration-based algorithm produces grammatically correct
the semantic contents of the same time step barely match. sentences. However, considering that the searched image and
When the current word is generated in forwarding order, the image to be queried are similar instead of being definitely
the backward generation approach fails to offer effective identical, the sentences directly generated in this case may not
context information synchronously; similarly. When it is gen- accurately describe the content of the image to be queried.
erated in backward order, the forward generation approach In recent years, deep neural networks have been applied
fails to provide valid context information synchronously to image retrieval [27] and machine translation [28] with
either. Therefore, aiming to fully utilize context information success. Inspired by this trend, a variety of image caption
while addressing the issue of out-of-sync between forward generation algorithms based on deep neural networks were
and backward directions, this paper proposes the S-Att, proposed, followed by great breakthroughs. This type of algo-
which employs the subsidiary attention mechanism between rithm extracts image features using CNN, and further decodes
F-LSTM and B-LSTM, extracting the correlation intensity of image features into fluent sentences using RNN [29]. Unlike
the F-LSTM and B-LSTM semantic information. As a result, module-based matching algorithms or migration-based algo-
the semantic information is aligned and output complemen- rithms, the neural network-based algorithms not only elim-
tary. This method addresses the limitation that forward and inate the limitation of sentence modules, but also generate
backward synchronization semantics are incompatible and novel sentences not available in current databases, which is
cannot be produced, contributing to more precise sentence due to the characterization capabilities of CNN combined
predictions. with the efficient modeling capabilities of RNN for variable-
Consequently, our final model employs the CNN-Bi- length sequences. In a novel parallel fusion LSTM structure
LSTM-s encoder-decoder, as indicated in Fig. 2. CNN is [30]. It adopts hidden states which are based on two parallel
employed to extract features and attention mechanisms to LSTMs to make attributes and visual image information com-
extract salient regions. Bi-LSTM is employed to extract con- plementary and enhanced at each time step. An innovative
textual information, with S-Att raised to fuse semantics and structure eliminates the redundancy that exists in the train-
align complementary outputs. ing set, increasing adaptive weights to increase the ability
In summary, our main contributions are shown as follows: to generalize captions [31]. A more sophisticated attention
• We adopt Bi-LSTM as a decoder to extract different mechanism [32] is employed to extract salient region features.
directional features to obtain more fine-grained contex- It combines sentence-level attention models with word-level
tual information. attention models to generate more accurate captions. Explor-
• We adopt the subsidiary attention mechanism to fix ing region relationships [33] implicitly explores the relation-
the semantic information, and align the forward and ship between related semantics and dynamically searches the
backward hidden states through the similarity module related visual relationships between multiple regions, making
to improve the output accuracy. the description of image captions more accurate. Attribute-
• We fuse the features extracted by visual attention and driven image captioning model [34], which selects a specific
subsidiary attention to obtain complementary and pro- area of the image and then decides which Attribute to focus
gressively finer grained sentences. on. This improves the coverage of visual attributes. The excel-
lent performance of Bi-LSTM in machine translation makes
II. RELATED WORKS many tasks try to use bidirectional LSTM. In Automatic lan-
Following various generation approaches, the current major guage identification task [35], Bi-LSTM effectively extracts
image caption generation algorithms are split into three ‘‘future’’ speech sequences, and the effect is remarkable.

VOLUME 11, 2023 135


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

FIGURE 1. Demonstrates the features of F-LSTM and B-LSTM extracted by the decoder. (a) is the
original image. (b) is the visual feature extracted by F-LSTM, and (c) is the visual feature extracted
by B-LSTM.

FIGURE 2. Indicates our proposed image caption attention framework. The model employs CNN as the encoder to
extract the visual area features of the image, and Bi-LSTM as the decoder to extract the hidden state hft −1 at the
previous moment with the hidden state hb t +1
at the next moment. S-Att is introduced into Bi-LSTM to fuse and
complement the hidden states hft and hb
t
, thereby making further predictions.

In the sentiment analysis task [36], Bi-LSTM can effectively


extract the context information and obtain more accurate pre-
diction results. Many studies have demonstrated that features
can be extracted efficiently using Bi-LSTM. In the implicit
discourse relation recognition task [37], the discourse argu-
ments are encoded by Bi-LSTM to preserve contextual infor-
mation, and the final result is better than the performance of
LSTM. In the event detection task [38], the algorithm adopts
the Bi-LSTM model to capture contextual information, and
the final result is better than the result of LSTM.

III. PROPOSED METHOD


A bidirectional LSTM is introduced as a decoder in the image FIGURE 3. The S-Att attention model, which we propose in the middle of
the bidirectional LSTM, calculates the relevance by fixing the previous
caption, which efficiently extracts contextual information; moment hft −1 and the next moment hb , then outputting the relevance
t +1
meanwhile, the F-LSTM is aligned with the B-LSTM via with the softmax function before aligning.
subsidiary attention, followed by semantic complementary
output. The following elaborates on our model, as presented the formula:
in Fig. 3. The hidden state extracted by the fixed forward X
LSTM and the hidden state of the backward LSTM is seman- θ ∗ = arg max log p(Y |I ; θ) (1)
tically aligned by the similarity module. θ (I ,y)

θ represents the image caption parameter, typically applied


A. ENCODER-DECODER in the chain rule to model the joint probabilities:
When an image I is given, the image caption aims to gen- T
erate a sentence Y = {y1 , y2 . . . , yT } for describing the
X
log p(Y |I ) = log p(yt |I , y1:t−1 ). (2)
image. Thus, its purpose is to maximize the probability of t=1

136 VOLUME 11, 2023


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

We employ a unified encoder-decoder framework to generate feature map.


captions:
ait = softmax(zi1 , zi2 , . . . ., zit ) (5)
Encoder-CNN: The pixel value of the input image I has
been fixed, and the input image I is encoded as a spatial It is normalized via softmax, thereby generating the atten-
vector using CNN, where V = CNN (I ) is CNN function to tion weight distribution.
obtain the spatial feature V = {v1 , v2 , . . . ., vz }. z represents X
the number of image space regions, and vi ∈ RD denotes the V̄t = ait vi (6)
n
image space region.
Decoder-LSTM: The conventional recurrent neural net- V̄t represents the generated visual attention feature.
work (RNN) comes with the issue of gradient vanishing
while processing time-series tasks; thus, we adopt the long C. BI-LSTM
short-term memory (LSTM) to replace the conventional RNN The conventional LSTM simply predicts the output of the
as the encoder. Compared with RNN, LSTM adds three next moment based on the temporal information of the present
threshold units (input gate, forget gate, output gate) to control moment. However, the output of the current moment is rel-
the flow of data. The forget gate connects the hidden state evant to both the state of the previous moment and the next
ht−1 of the previous moment with the input xt of the current moment. Predicting the exact word in a sentence, for instance,
moment as the total input of the sigmoid activation function should be judged not only on the prior text but also on the
to generate the forget mask ft . The product of ft and the following content, thereby realizing proper judgments based
memory information ct−1 of the previous moment achieves on context.
the purpose of removing the previous moment’s worthless f f
information. The input gate computes the output mask it ht = LSTM ([xt ; Ṽt ], ht−1 ) (7)
using the same way, and employs it to filter the memory infor- hbt = LSTM ([xt ; Ṽt ], hbt+1 ) (8)
mation c̃t at the current moment. Once ct−1 and c̃t are filtered, f
pt (yt |y1 , y2 , . . . ., yt−1 , I ) = softmax(ht )
f
(9)
they are further summed to obtain comprehensive memory
information ct . The output gate computes the output mask ot pbt (yt |y1 , y2 , . . . ., yt−1 , I ) = softmax(hbt ) (10)
following the same way as the first two threshold units, and f
The hidden state ht of the F-LSTM at time t depends on
the comprehensive memory information ct is multiplied with f
the hidden state ht−1 and input xt of the previous moment,
ot after going through the tanh activation function to obtain while the hidden state of the B-LSTM at time t depends on
the hidden state ht at the current moment. The computational f
the hidden state hbt+1 and input xt at the next moment. pt and
procedure is as follows: pbt represent the word conditional probability distribution of
it = σ (Wix xt + Wih ht−1 + bi ) F-LSTM and B-LSTM at time t, respectively.
ft = σ (Wfx xt + Wfh ht−1 + bf )
D. SUBSIDIARY ATTENTION GUIDE
ot = σ (Wox xt + Woh ht−1 + bo )
The hidden state ht at time t is obtained in Bi-LSTM by sum-
c̃t = tanh(Wcx xt + Wch ht−1 + bc ) f
ming the ht−1 and hbt+1 ; the former is obtained with F-LSTM,
ct = it c̃t + ft ct−1 while the latter is obtained with B-LSTM. F-LSTM and
ht = ot tanh(ct ) (3) B-LSTM output semantics are inconsistent and impossible to
be aligned, resulting in unsatisfactory output. Consequently,
B. VISUAL ATTENTION GUIDE the semantics are fixed as the F-LSTM output hidden state
f
In order to make the most of semantic and visual information, ht . The fixed semantics facilitates alignment, with the subse-
we incorporate the two using soft attention [11] in LSTM. quent subsidiary attention mechanism extracting the semantic
The primary task is to properly integrate semantic and visual similarity of the forward and backward LSTMs. Our semantic
information. Second, more focus is paid to different time similarity goal is to numerically indicate how similar hbt is to
f
steps under the two elements of information. As a result, the individual word vectors of ht . To indicate how much two
the visual output shifts from the same global image features vectors point in the same direction, we take a simple inner
to changing image local features as each word is generated. product of vectors, and hence utilize the inner product as the
Attention dynamically extract the attention from images in similarity of two vectors.
response to changes in the visual context. It is defined as f
zst = ht−1 · hbt+1 (11)
follows:
ast = softmax(zs1 , zs2 , . . . .zst ) (12)
zit = Wz tanh(Wv Vi + Wh ht ) (4)
ast represents the weight of similarity degree in the forward
Wz ∈ R1×k1 ,
Wv ∈ Rk1 ×k2
and Wh ∈ Rk1 ×k3
represent and backward direction at time t.
the trainable parameter (transition matrices), Wv is denoted f
hm b
t = ht−1 + ht+1 (13)
as drawing the visual feature Vi into a visual feature map.
f
Wh refers to plotting the semantic feature ht as a semantic hnt = ht−1 + hb∗ (14)

VOLUME 11, 2023 137


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

f
The summation of ht−1 and hbt+1 indicates that Bi-LSTM image is collected from daily life, making it the primary
extracts past and future information to obtain the hidden state experimental dataset for image captioning. the individual
f f
hmt based on visual attention; the summation of ht−1 and ht−1 image contains a multi-entity target with five manual labels
corresponding to each position represents the hidden state hnt for labeling the caption. this dataset includes 91 targets,
of the context aligned by a fixed position via S-Att. Then, the 328,000 images, and 2.5 million labels. the largest dataset
final hidden state under dual attention is as follows: with semantic segmentation provides 80 categories, over
330,000 images, 200,000 of which are annotated, and over
ht = λhm
t + (1 − λ)ht
n
(15) 1.5 million individuals in the entire dataset. we adopt 110,000
The loss function of the bidirectional LSTM includes: photos for training, 5,000 images for validation, and 5,000
images for testing [1].
T
f f
X
LXE (θ) = − log pθ (yt |y1:t−1 ) (16)
B. EVALUATION METHODS
t=1
T
In experiments, we adopt bleu1-4 [40], meteor [42], rouge-l
b
X [43], cider [39], and spice [3] to evaluate our model perfor-
LXE (θ) =− log pbθ (yt |y1:t−1 ) (17)
mance metrics, which are widely used in image captioning.
t=1
f b
L = LXE + LXE (18) C. DATA PRE-PROCESSING
f
LXE and LXEb stand for the loss functions of the F-LSTM In this paper, we implement a bidirectional LSTM with a
and B-LSTM, respectively. The conventional cross-entropy subsidiary attention mechanism. Our parameter settings and
error training strategy is employed, with L as the final loss experimental details are as followers.
function. There is a distinction between training and testing First, we replace all words in the dataset with lowercase,
conventional image captioning models, with testing relying and truncate the sentence caption length to 16, among which
on words previously generated by the model. When the pre- the words with a frequency of less than or equal to 5 are
ceding period’s results are incorrect, the errors would accu- deleted, and finally a word list with a number of 9500 is
mulate and succeeding words cannot be generated correctly. obtained.
To address these issues, we approach image caption produc- Second, we adopt the pre-trained Resnet-101 [44] to
tion as a reinforcement learning problem, directly optimizing encode the image in the encoding phase, which encodes
sentence generation based on the model’s evaluation metrics, the image into a visual feature map of size 14 × 14 with
with the ultimate goal of minimizing the following negative 2048 dimensions. The visual feature map is mainly applied
expected returns: to represent the fine-grained information of the image.

LRL (θ ) = −EY s ∼pθ [r(Y s )] (19) D. DECODING PHASE


We employ a Bi-LSTM-s structure to decode visual feature
r(Y s ) serves as the reward obtained via CIDEr [39], BLEU maps into image captions with word embedding of dimension
[40] and other computing methods when the prediction is as 512. The forward LSTM, backward LSTM, and attention
over; besides, LSTM updates its internal hidden state atten- dimension are set to 512.
tion weights and other states. Finally, during the training phase, we train our model
The gradient can be approximated by the following with the Adam [45] optimizer under the cross-entropy loss.
formula: We fine-tune Resnet-101’s last convolutional layer to adjust
∇θ LRL (θ) ≈ −(r(ys ) − r(y∗ ))∇θ log pθ (ys ) (20) the appropriate training parameters. The learning rate is 1 ×
10−5 , which decreases by 0.5 every six epochs. The batch
y∗ represents the baseline score obtained at test time with size is set to 64, and the model is trained for 30 epochs.
beam search decoding. Subsequently, building on the training model, we employ
reinforcement learning-based methods in order to optimize
IV. EXPERIMENTS the CIDEr assessment metrics. At this phase, the learning rate
In order to demonstrate the effectiveness of the proposed bidi- is set to 5 × 10−5 , the batch size is set to 64, and the training
rectional LSTM model, we perform extensive experiments runs for 30 epochs. During the training phase, we evaluate
to test the model while also comparing it with the advanced our model on the validation machine at the conclusion of each
models. The elaborated material of the experiment is included epoch and save the model with the best current result. Then,
as follows, ranging from the dataset to assessment metrics, the next phase of training will continue on the model with
implementation details, and testing approach. the best performance from the previous phase. In terms of
testing, we select the model with the greatest CIDEr score
A. DATASET on the validation set, and we utilize beam search to produce
We evaluate our model on the widely-used mscoco [41] phrases with the beam size set to 5.
dataset, which acts as a large-scale dataset with diversi- If the performance fails to improve after 6 training epochs,
fied object identification, segmentation, and captioning; each the training would be terminated.

138 VOLUME 11, 2023


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

TABLE 1. Verifying the performance of optimized CIDEr and auxiliary attention mechanism using MSCOCO.

TABLE 2. Verifying the performance of Bi-LSTM and parallel double-layer LSTM using MSCOCO.

E. EXPERIMENTAL RESULTS AND ANALYSIS from 112.5 to 117.9 by 5.4; the CIDEr score of our Bi-
We design ablation experiments to evaluate the effectiveness LSTM-S model climbed from 118.6 to 121.3 by 2.7, indi-
of our proposed model in image captioning; all metric scores cating that the current leading methodologies come with a
are designed using the MSCOCO Karparthy test segmenta- significant enhancement in optimizing CIDEr on the basis of
tion. cross-entropy error. Second, the Bi-LSTM-s improved from
First, as shown in Table 1, the improvement brought by 112.5 to 118.6 by 6.1 in the cross-entropy loss function
CIDEr optimization is validated on our model. XE repre- experiment, and 3.4 improved from 117.9 to 121.3 in the
sents training under the cross-entropy loss function, and RL optimized CIDEr experiment. Thus, it reflects that our sub-
indicates the result of optimizing the scoring index based on sidiary attention can efficiently extract, align, and produce
the optimal XE training model. Second, two sets of models, finer captions from the semantic relations of the forward
Bi-LSTM and Bi-LSTM-s, are set to verify the effective- LSTM and backward LSTM.
ness of our auxiliary attention mechanism. Bi-LSTM solely Secondly, as shown in Table 2. Our ablation experiments
employs the visual attention mechanism, while Bi-LSTM- are primarily utilized to validate our model’s superiority, with
s incorporates S-Att into Bi-LSTM. The following training constant experimental training parameters. To be specific, p-
results ensure that the parameters are consistent in order to LSTM denotes the simultaneous superposition of two layers
maintain fairness. of structure processing, in which the hidden state h̃1t of the
As illustrated in the table 1-5, each represents a different first layer of p-LSTM is learnt and h̃1t is transferred to the
ablation experiment. In Tables 1, we demonstrated that rein- second layer of LSTM; then, the input gate, forget gate, and
forcement learning improves significantly in image caption output gate of the second layer of LSTM will all employ h̃1t
tasks. In Tables 2 and 3, we proved the superiority of the as input. The following uncovers the final hidden state:
our model. In Table 4, we verified the effect of averaging
h̃2t = LSTM (h̃1t , h̃2t−1 ) (21)
and taking the maximum input on the result when semantic
fusion occurs. In Table 5, it is the influence of different The final hidden state at time t is derived from the
hyperparameters on the results of the experiment. first layer’s hidden state h̃1t and the second layer’s previous
The following evaluation criteria: B@1, B@2, B@3, B@4, moment h̃2t−1 . According to the optimized CIDEr score, the
M, R, S, C, represent BLEU1-4, METEOR, ROUGE-L, Bi-LSTM model has improved by 10.3 points. In our model,
SPICE, CIDEr. BLEU: It calculates the similarity and penal- our hidden state computation ht is related not only to the
f
izes sentences of insufficient length. METEOR: It focuses current input, but also to ht−1 and hbt+1 . Bi-LSTM considers
on the number of co-occurrences of words and establishes previous and future information simultaneously; thus, it truly
a penalty mechanism based on word order changes to get achieves context-based output.
scores. ROUGE-L: The similarity is measured by calculating Aiming to demonstrate the superiority of our model,
the longest common sequence between the predicted sentence we evaluate it against eight measures and six prominent
and the standard translation. methods. As shown in Table 3. First, the foundational model
SPICE: It encodes images into objects, attributes, and rela- is established. The most typical model, NIC, does not include
tionships, and then selects the highest scoring statement based an attention mechanism. The goal of Soft-Attention is to
on scene graphs. CIDEr: It calculates the similarity, which introduce a soft attention mechanism into difficult tasks. The
is based on the frequency of the words.Firstly, as shown attention mechanism is extended from spatial to channel by
in Table 1. The cross-entropy loss function is compared to SCA-CNN. SCST is the application of reinforcement learn-
the optimized CIDEr score (using the CIDEr score as an ing to the optimization of sentence-level rewards. Second,
example), the CIDEr score of our Bi-LSTM model climbed the pLSTM-A-2, DAIC and our model are improved on the

VOLUME 11, 2023 139


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

TABLE 3. Comparison with advanced models on MSCOCO dataset.

TABLE 4. Forward and backward LSTM hidden state fusion on MSCOCO dataset.

TABLE 5. Dual attention weight coefficients.

basis of the above models. pLSTM-A-2 encodes images using As shown in Table 5. Under the combined effect of dual
two separate encoders (MIML and CNN) and simultaneously attention, an oversized selection of λ results in the extrac-
merges the semantic information of the two decoders, result- tion of unaligned semantics before-and-after, worsening the
ing in more accurate and richer captions. caption result. Yet, a small selection of λ leads to excessive
DAIC extracts the encoder’s image input to sentence- reliance on similarity; the prior semantics over-absorb the
level and word-level attention respectively, while the final following semantics, degrading the caption result.
output combines sentence-level and word-level information
to generate more accurate captions. Our model employs F. VISUALIZATION
bidirectional LSTM as the decoder, accepts both past and Fig. 4 depicts the visualization results, which allow us to
future information simultaneously, and truly achieves pre- better represent our proposed approach, including the ground
diction based on contextual information. It also employs truth. F-LSTM, B-LSTM, and Bi-LSTM-s fused with seman-
two attention mechanisms, one of which will dynamically tic features based on contextual output. It also displays the
extract visual information accompanied by integrating visual interaction among Visual-Att focusing on image key regions
information with semantic information; another auxiliary and text dependencies, coupled with the extraction of key-
attention aligns the semantic information of the bidirec- words using an auxiliary attention mechanism. Moreover,
tional LSTM, contributing to more diversified semantic infor- a visualization is presented at the same time. All of the image
mation. Our model has displayed significant advantages elements are derived from the MSCOCO dataset
in scoring. It reveals that our model can effectively extract fine-
As shown in Table 4. Considering the fusion of the forward grained information, such as ‘‘polka dot’’, ‘‘wooden
and backward hidden states: Max is the maximum value of benches’’, and ‘‘red chairs’’. F-LSTM extracts ‘‘polka dot,’’
f f
ht−1 and hbt+1 , while Mean is the weighted sum of ht−1 and the semantic features are fused into Bi-LSTM. F-LSTM
b
and ht+1 . The data reveals that the outcome of taking the extracts ‘‘wooden benches,’’ with B-LSTM extracting ‘‘red
average is slightly better. When fusing forward and backward chairs,’’ and the one is complemented with another for output.
semantics using Max, simply considering forward or back- S-Att extracts ‘‘girl’’ and ‘‘women,’’ presenting a dependency
ward to obtain a single result causes insufficient semantics of 0.85; then, they are fused to complement the output.
and the loss of partial semantics. On the other hand, using Also, ‘‘surrounded’’ and ‘‘topped’’ have a dependency of
Mean considers the shared scope of forward and backward 0.62, while the two are fused to complement the output.
while retaining the original semantic information, thereby Fig. 5 depicts the fine-grained information extracted by our
achieving fused semantics. model. We set Bi-LSTM as the control groups. The subsidiary

140 VOLUME 11, 2023


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

FIGURE 4. Comparison with ground truth on MSCOCO.

VOLUME 11, 2023 141


H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

FIGURE 5. Supplemental experimental simulation diagrams on MSCOCO.

A model Bi-LSTM-s is hence created to efficiently extract


past and future information in order to fully extract context
information. Specifically, Bi-LSTM-s encodes the sentence
context as hidden states of F-LSTM and B-LSTM, respec-
tively. After that, S-Att obtains the word similarity between
the hidden states via the attention mechanism, performing
semantic alignment, semantic complementarity, as well as
semantic fusion output. With extensive experimental analysis
achieved on the MSCOCO dataset, our model allows us
to fully extract contextual information, together with fine-
grained information. Furthermore, we demonstrate the supe-
riority of this strategy using a range of evaluation metrics.
However, bidirectional LSTM still has its limitations.
First of all, bidirectional LSTM has too many parameters,
which may lead to prediction time delay for real-time tasks.
Secondly, two basic LSTM cells still work inside the bidirec-
tional LSTM, and the GRU with fewer parameters can be con-
sidered to replace the LSTM during training. At present, most
training features encourage the output of words with high
FIGURE 6. Supplementary experimental pictures are from real
photography.
frequency priority, which leads to the restriction of semantic
information. The further study, we will focus on generating
different constraints to produce fine-grained semantic infor-
attention mechanism effectively complements the forward mation from a global perspective.
and backward output hidden states with progressive output
to obtain fuller semantics, such as ‘‘very’’, ‘‘red’’, and pre- REFERENCES
dicting the fine-grained information ‘‘stainless steel stove’’, [1] A. Karpathy and L. Fei-Fei, ‘‘Deep visual-semantic alignments for gen-
erating image descriptions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
the action will be more comprehensively such as ‘‘leaning Recognit. (CVPR), Jun. 2015, pp. 3128–3137.
against’’. Fig.6 shows that all the photos are taken from real [2] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, ‘‘Show and tell: A
life, and our model can extract fine-grained information. neural image caption generator,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 3156–3164.
V. CONCLUSION [3] P. Anderson, B. Fernando, M. Johnson, and S. Gould, ‘‘SPICE: Semantic
propositional image caption evaluation,’’ in Proc. 14th Eur. Conf. Comput.
At present, the existing mainstream models simply take into Vis. (ECCV), Amsterdam, The Netherlands. Berlin, Germany: Springer,
account the impact of the previous information on sentences. Oct. 2016, pp. 382–398.
142 VOLUME 11, 2023
H. Zhang et al.: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

[4] J. Wu, T. Chen, H. Wu, Z. Yang, G. Luo, and L. Lin, ‘‘Fine-grained [25] P. Kuznetsova, V. Ordonez, A. Berg, T. Berg, and Y. Choi, ‘‘Collective
image captioning with global-local discriminative objective,’’ IEEE Trans. generation of natural image descriptions,’’ in Proc. 50th Annu. Meeting
Multimedia, vol. 23, pp. 2413–2427, 2021. Assoc. Comput. Linguistics, vol. 1, 2012, pp. 359–368.
[5] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, ‘‘Advanced deep-learning [26] V. Ordonez, X. Han, P. Kuznetsova, G. Kulkarni, M. Mitchell,
techniques for salient and category-specific object detection: A survey,’’ K. Yamaguchi, K. Stratos, A. Goyal, J. Dodge, A. Mensch, H. Daumé,
IEEE Signal Process. Mag., vol. 35, no. 1, pp. 84–100, Jan. 2018. A. C. Berg, Y. Choi, and T. L. Berg, ‘‘Large scale retrieval and generation
[6] L. Ruotsalainen, A. Morrison, M. Makela, J. Rantanen, and N. Sokolova, of image descriptions,’’ Int. J. Comput. Vis., vol. 119, no. 1, pp. 46–59,
‘‘Improving computer vision-based perception for collaborative indoor Aug. 2016.
navigation,’’ IEEE Sensors J., vol. 22, no. 6, pp. 4816–4826, Mar. 2022. [27] S. Karaoglu, R. Tao, T. Gevers, and A. W. M. Smeulders, ‘‘Words matter:
[7] S. Wu, D. Zhang, Z. Zhang, N. Yang, M. Li, and M. Zhou, ‘‘Dependency- Scene text for image classification and retrieval,’’ IEEE Trans. Multimedia,
to-dependency neural machine translation,’’ IEEE/ACM Trans. Audio, vol. 19, no. 5, pp. 1063–1076, May 2017.
Speech, Language Process., vol. 26, no. 11, pp. 2132–2141, Nov. 2018. [28] Y. Wu et al., ‘‘Google’s neural machine translation system: Bridging the
[8] M. A. Kastner, K. Umemura, I. Ide, Y. Kawanishi, T. Hirayama, K. Doman, gap between human and machine translation,’’ 2016, arXiv:1609.08144.
D. Deguchi, H. Murase, and S. Satoh, ‘‘Imageability- and length- [29] R. Pascanu, T. Mikolov, and Y. Bengio, ‘‘On the difficulty of train-
controllable image captioning,’’ IEEE Access, vol. 9, pp. 162951–162961, ing recurrent neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2013,
2021. pp. 1310–1318.
[9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, [30] J. Zhang, K. Li, and Z. Wang, ‘‘Parallel-fusion LSTM with synchronous
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using semantic and visual information for image captioning,’’ J. Vis. Commun.
RNN encoder–decoder for statistical machine translation,’’ 2014, Image Represent., vol. 75, Feb. 2021, Art. no. 103044.
arXiv:1406.1078. [31] G. Sumbul, S. Nayak, and B. Demir, ‘‘SD-RSIC: Summarization-driven
[10] Y. Luo, J. Lu, X. Jiang, and B. Zhang, ‘‘Learning from architectural deep remote sensing image captioning,’’ IEEE Trans. Geosci. Remote
redundancy: Enhanced deep supervision in deep multipath encoder– Sens., vol. 59, no. 8, pp. 6922–6934, Aug. 2021.
decoder networks,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 9, [32] H. Wei, Z. Li, C. Zhang, and H. Ma, ‘‘The synergy of double
pp. 4271–4284, Sep. 2022. attention: Combine sentence-level and word-level attention for image
[11] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, captioning,’’ Comput. Vis. Image Understand., vol. 201, Dec. 2020,
and Y. Bengio, ‘‘Show, attend and tell: Neural image caption gener- Art. no. 103068.
ation with visual attention,’’ in Proc. Int. Conf. Mach. Learn., 2015, [33] Z. Zhang, Q. Wu, Y. Wang, and F. Chen, ‘‘Exploring region relationships
pp. 2048–2057. implicitly: Image captioning with visual relationship attention,’’ Image Vis.
[12] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, ‘‘Boosting image captioning Comput., vol. 109, May 2021, Art. no. 104146.
with attributes,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[34] Y. Zhou, J. Long, S. Xu, and L. Shang, ‘‘Attribute-driven image caption-
pp. 4894–4902.
ing via soft-switch pointer,’’ Pattern Recognit. Lett., vol. 152, pp. 34–41,
[13] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and
Dec. 2021.
L. Zhang, ‘‘Bottom-up and top-down attention for image captioning and
[35] H. S. Das and P. Roy, ‘‘A CNN-BiLSTM based hybrid model for Indian lan-
visual question answering,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
guage identification,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108274.
Recognit., Jun. 2018, pp. 6077–6086.
[36] W. Li, L. Zhu, Y. Shi, K. Guo, and E. Cambria, ‘‘User reviews: Sentiment
[14] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, ‘‘Image caption with global-
analysis using lexicon integrated two-channel CNN–LSTM family mod-
local attention,’’ in Proc. 31st AAAI Conf. Artif. Intell. (AAAI), Feb. 2017,
els,’’ Appl. Soft Comput., vol. 94, Sep. 2020, Art. no. 106435.
pp. 4133–4139.
[15] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, ‘‘Image captioning with [37] F. Guo, R. He, and J. Dang, ‘‘Implicit discourse relation recognition via
semantic attention,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. a BiLSTM-CNN architecture with dynamic chunk-based max pooling,’’
(CVPR), Jun. 2016, pp. 4651–4659. IEEE Access, vol. 7, pp. 169281–169292, 2019.
[16] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural [38] G. Xu, Y. Meng, X. Zhou, Z. Yu, X. Wu, and L. Zhang, ‘‘Chinese event
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. detection based on multi-feature fusion and BiLSTM,’’ IEEE Access,
[17] T. Chen, R. Xu, Y. He, and X. Wang, ‘‘Improving sentiment analysis via vol. 7, pp. 134992–135004, 2019.
sentence type classification using BiLSTM-CRF and CNN,’’ Expert Syst. [39] R. Vedantam, C. L. Zitnick, and D. Parikh, ‘‘CIDEr: Consensus-based
Appl., vol. 72, pp. 221–230, Apr. 2017. image description evaluation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[18] Z. Huang, W. Xu, and K. Yu, ‘‘Bidirectional LSTM-CRF models for Recognit. (CVPR), Jun. 2015, pp. 4566–4575.
sequence tagging,’’ 2015, arXiv:1508.01991. [40] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method for
[19] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, ‘‘Guiding the long- automatic evaluation of machine translation,’’ in Proc. 40th Annu. Meeting
short term memory model for image caption generation,’’ in Proc. IEEE Assoc. Comput. Linguistics, 2001, pp. 311–318.
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2407–2415. [41] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
[20] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, ‘‘SCA- and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in
CNN: Spatial and channel-wise attention in convolutional networks for Proc. Eur. Conf. Comput. Vis. (ECCV). Berlin, Germany: Springer, 2014,
image captioning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. pp. 740–755.
(CVPR), Jul. 2017, pp. 5659–5667. [42] S. Banerjee and A. Lavie, ‘‘METEOR: An automatic metric for MT
[21] J. Lu, C. Xiong, D. Parikh, and R. Socher, ‘‘Knowing when to look: Adap- evaluation with improved correlation with human judgments,’’ in Proc.
tive attention via a visual sentinel for image captioning,’’ in Proc. IEEE ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summa-
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 375–383. rization, 2005, pp. 65–72.
[22] A. Gupta and P. Mannem, ‘‘From image annotation to image description,’’ [43] C.-Y. Lin, ‘‘Rouge: A package for automatic evaluation of summaries,’’ in
in Proc. 19th Int. Conf. Neural Inf. Process. (ICONIP), Doha, Qatar. Berlin, Proc. Text Summarization Branches Out, 2004, pp. 74–81.
Germany: Springer, Nov. 2012, pp. 196–204. [44] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
[23] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
and T. L. Berg, ‘‘BabyTalk: Understanding and generating simple image Jun. 2016, pp. 770–778.
descriptions,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, [45] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
pp. 2891–2903, Dec. 2013. 2014, arXiv:1412.6980.
[24] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, [46] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, ‘‘Self-critical
J. Hockenmaier, and D. Forsyth, ‘‘Every picture tells a story: Generating sequence training for image captioning,’’ in Proc. IEEE Conf. Comput. Vis.
sentences from images,’’ in Proc. 11th Eur. Conf. Comput. Vis. (ECCV), Pattern Recognit. (CVPR), Jul. 2017, pp. 7008–7024.
Heraklion, Greece. Berlin, Germany: Springer, Sep. 2010, pp. 15–29.

VOLUME 11, 2023 143

You might also like