0% found this document useful (0 votes)
39 views13 pages

Attention Mechanisms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views13 pages

Attention Mechanisms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CATANIC: Automatic generation model of image

captions based on multiple attention mechanism


Tingting Zhang (  [email protected] )
Yangtze University
Tao Zhang
Yangtze University
Yanhong Zhuo
Yangtze University
Feng Ma
Research Institute of Petroleum Exploration and Development Northwest (NWGI)

Research Article

Keywords: Image caption, DenseNet169, Transformer, AoA module

Posted Date: March 24th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-2718040/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Springer Nature 2021 LATEX template

CATANIC: Automatic generation model of image captions


based on multiple attention mechanism
Tingting Zhang1 , Tao Zhang1*, Yanhong Zhuo1 and Feng Ma2
1* Schoolof Information and Mathematics, Yangtze University, Jingzhou, 434023, Hubei
Province, China.
2 Research Institute of Petroleum Exploration and Development Northwest (NWGI),

PetroChina, Lanzhou, 730020, Gansu Province, China.

*Corresponding author(s). E-mail(s): [email protected];


Contributing authors: [email protected]; [email protected];

Abstract
As a cross-field of computer vision and natural language processing, image caption generation has
been a widely concerned researched topic. However, it is challenging because the automatically gen-
erated captioning needs to conform to the image content and language logic. In this paper, we
propose a novel image caption generation model based on the DenseNet169 pre-training network
and the modified Transformer model, which follows an encoding-decoding architecture. First, we
use the DenseNet169 network as the encoder to extract the initial features of the image, and then
introduce the “Attention on Attention” module between the encoder and the decoder to correct
the initial features of the picture and filter out the attention results that do not match the pic-
ture content. At the same time, we introduce the “Attention on Attention” module into the internal
decoder of the Transformer model, which strengthens the correlation between the image feature
and textual descriptions, and applies the modified Transformer model as a decoder to transform
the modified image feature vectors into textual descriptions. Experimental results show that our
model is not only able to generate captions that match the content of the pictures as well as the
linguistic logic, but also has fewer training parameters and requires less training time cost than a
number of different image captioning models with pre-trained convolutional networks as encoders.

Keywords: Image caption, DenseNet169, Transformer, AoA module

1 Introduction descriptions based on the given images. It neces-


sitates not just identifying the object in an image
Image captioning has been flourishing and has but also investigating any possible relationships,
become one of the most popular research top- such as spatial relations and semantic interactions.
ics in the field of computer vision. In addition, Therefore, this makes itself very challenging.
image captioning is one of the primary goals of Image captioning is a sequence modeling prob-
the interdisciplinary tasks of computer vision and lem that usually employs an encoder-decoder
natural language processing, which is to auto- framework[1–3]. Inspired by the development of
matically generate the corresponding linguistic

1
Springer Nature 2021 LATEX template

neural machine translation[4–7], attention mech- the generated image captions, and making them
anisms are frequently applied with positive out- more consistent with the image content and the
comes in the most recent visual captioning language logic.
models, facilitating the development of visual From what has been discussed above, we pro-
captioning[8]. A recurrent neural network (RNN) pose a new image caption model which combines
is utilized in this architecture for image captioning the encoder with the AoA module, and apply
to produce the natural language sentences after the modified Transformer model as the decoder,
the encoder has extracted the image feature to referred to as CATANIC model. At the same time,
create feature vectors, where the attention mech- we conducted extensive experiments on Flickr8k
anism guides the decoding process by focusing on datasets, and empirically fine-tuned the hyperpa-
the important areas of the image and generating rameters of the model. Our experimental results
a corresponding weight. show that the CATANIC model is capable of
Since the quality of the generated captions generating high-quality captions for images. The
depends directly on the attentional outcome, the following are the paper’s contributions:
attentional mechanism is essential for captioning • AoA module is introduced to process the feature
images. However, due to the poor attention mod-
vectors extracted by the encoder, to ascertain
ule or the useless information selected from the
the correlation between queries and results of
candidate vector, the attention result is irrelevant
attention.
or not highly related to the query. Therefore, the • The performance of four kinds of CATANIC
decoder will be misled by the attention result to
models with pre-trained neural networks as
produce the wrong result. To address this issue,
encoders was compared on the Flickr8k dataset,
we introduce the “Attention on Attention” (AoA)
and it was found that the DenseNet169 convolu-
module[9], in which the AoA module mainly adds
tional neural network as the encoder makes the
another attention mechanism on the basis of
model have the best performance.
the traditional attention mechanism, which is an • In our image captioning approach, we apply the
extension of the traditional attention mechanism.
Transformer model with the AoA module to
The use of the AoA module helps to establish the
iteratively create each word of the caption, effi-
relationship between attention outcomes and the
ciently enhancing the correlation between image
query, filtering out the attention outcomes that
features and text descriptions, and improving
are not relevant to the query and keeping only the
the performance of image captioning models.
useful ones, thus improving the decoding accuracy • Our paper quantitatively and qualitatively vali-
of the decoder.
dates the utility of the CATANIC model for gen-
In previous studies, recurrent neural networks
erating image captions on the Flickr8k dataset,
are usually used as decoders[10, 11]. However,
and performs an ablation analysis that strongly
this method’s disadvantage is that it is unable
validates the role of the AoA module in the
to parallelize the process of creating picture cap-
Transformer model.
tions since the output of the recurrent neural
network at each time depends on the hidden state The remainder of this paper is organized in
of the previous moment. To address this problem, the following manner. First, related work is pre-
we propose to improve the traditional sequence sented in Section 2. Then, we describe in detail
conduction model Transformer[12] by introduc- the structure of our proposed model in Section 3.
ing an AoA module at its decoder side, using Finally, the experimental evaluation and conclu-
the improved Transformer model as the decoder sions of the model are provided in Sections 4 and
of the entire image caption model, completely 5, respectively.
abandoning recursion and convolution and con-
verting vectors into text based on the attention 2 Related Work
mechanism only. This model structure strength-
ens the link between the internal encoder and the 2.1 Image Caption
internal decoder of the Transformer model, thus
enhancing the correlation between image features Over the past few years, more and more
and text descriptions, improving the quality of researchers have paid attention to the topic of
Springer Nature 2021 LATEX template

image caption generation. Typically research on spatial and channel attention mechanisms into
image captioning can be simply divided into the encoding side of the image caption model to
three main categories: template-based approaches, extract more accurate image features[20]; in order
retrieval-based approaches, and deep learning- to make different attention strategies for differ-
based approaches. Specifically, template-based ent words, Lu, Parikh, et al. introduced adaptive
methods first define the fixed template with many attention mechanisms into the decoding side of the
blank slots, and then use methods such as tar- image caption model[21].
get detection, relationship prediction, and back- However, the above attention mechanisms
ground recognition to fill the blank slots and have been applied without considering whether
generate a full description of a given image[13– the attention results are related to the content
15]. The retrieval-based approach, on the other of the image. To address this issue, Lun Huang,
hand, generates image descriptions by selecting Wenmin Wang, et al.[9] proposed an extended
the most semantically similar sentences from a attention mechanism AoA module that uses a gat-
pool of sentences[16, 17]. ing unit to filter out attention results that are not
While both of these algorithms are still being related to the image content.
explored in depth by a number of researchers, a
deep learning-based sequence-to-sequence model 2.3 Transformer
for image captioning has received significant
attention, inspired by the successful application Vaswani et al. proposed the Transformer model
of deep neural networks in machine transla- for solving sequence tasks, and used it for
tion. For example, Vinyals, et al.[18] proposed a machine translation tasks in natural language
deep learning-based encoder-decoder framework processing[12]. In contrast to models based on
for image caption processing, where a convo- recurrent neural networks, the self-attentive mech-
lutional neural network acts as an encoder to anism part of the Transformer model is able to
extract the visual features of an image and a capture the long-term dependencies of sequence
recurrent neural network decodes the encoder’s elements. As a result, researchers have gradually
extracted image feature vector into words of applied the Transformer model to image cap-
descriptive sentences. Jia, Xu, et al.[19] pro- tioning tasks with good results. For instance,
posed an improved image captioning model that Muhammad Shah, Faisal, et al.[22] combine three
incorporates semantic information to guide the different Bengali datasets to generate Bengali cap-
LSTM network as a decoder to generate image tions from images using the Transformer model.
descriptions. With the development of science and Tiantao Xian, Zhixin Li et al.[23] propose a dual
technology, researchers are slowly incorporating global enhanced Transformer model to solve the
attention mechanisms into the encoder-decoder image captioning task, which is centered on incor-
architecture. porating global information in the encoding and
decoding stages.
2.2 Attention Mechanisms
3 Model Architecture
Numerous sequence learning or computer vision
problems have benefited significantly from the In this paper, we propose a novel encoder-decoder
study of attention mechanisms inspired by human model structure for generating the correlated
intuition. First, it performs an important analysis description of a given image. To be specific, we
on each candidate vector and obtains a correlation first use the DenseNet169[24] from the pre-trained
score, then uses a softmax function to normalize neural network as the encoder to extract the ini-
this score to obtain weights, and finally applies tial feature vectors, and then use the modified
these weights to the candidate vectors to pro- Transformer model as the decoder to transform
duce an attention result[8]. In addition, in order to the feature vectors into descriptive sentences. At
improve the quality of generated captions, a num- the same time, an AOA module is introduced in
ber of researchers have applied various attention the middle of the encoder and decoder to explore
mechanisms to picture caption generation mod-
els. For example, Chen, Zhang, et al. introduced
Springer Nature 2021 LATEX template

the relationship between the initial feature vec- mechanism and can be applied to any attention
tors. The general framework of the model is shown mechanism. The AoA module is to obtain the
in Fig.1. attention results through the traditional attention
mechanism and then filters the attention results
3.1 Encoder: Feature extraction that are not relevant to the query using a gated
unit. The gating unit in this instance transforms
In the image coding stage, most researchers use the attention result and the context vector linearly
pre-trained neural networks to obtain the initial to produce an attention gate and an informa-
feature representation of the image. One of the tion vector, and then multiplies the elements to
most recent mainstream pre-trained convolutional produce the pertinent information.
networks is the DenseNet169 network, which at its In the encoder of the CATANIC model, the
core trains a deeper convolutional neural network AoA module firstly uses a multi-head attention
by establishing dense connections between all the mechanism to obtain the attribute relationship
previous layers and the later layers. This allows between individual features in the initial feature
the initial feature representation of the extracted set of the image, then combines this relationship
image to better match the content of the image, with the initial feature vector, and finally obtains
helping to improve the accuracy of using convo- the picture feature vector with strong correlation
lutional neural networks for tasks such as target between the features through the gating unit. Our
recognition, image understanding, and image clas- detailed output is calculated as follows:
sification. At the same time, the DenseNet169
module enables feature reuse by connecting fea-
tures on channels, significantly reducing the num- AoA(Q, K, V ) = σ (Wg (Q + fm−att (Q, K, V )) + bg )
ber of parameters and making the model simpler, ⊙ (Wi (Q + fm−att (Q, K, V )) + bi )
more compact and more computationally efficient. (1)
We use the pre-trained convolutional neu-
ral network DenseNet169 as the encoder of the (Q, K, V ) = I(Wq , Wk , Wv ) (2)
CATANIC model to obtain the initial feature Where,
representation of the image. In this stage, the
DenseNet169 network can filter out most of the Wq , Wk , Wv , Wi , Wg ∈ RD×D (3)
”redundant” information, and extract the ini- D
tial feature representations that can represent the bi , bg ∈ R (4)
main idea of the images. Thus, this improves its
accuracy and makes model training more efficient. And D is the dimension of initial feature vector I.
Most pre-trained convolutional neural networks In addition, fm−att is a multi-head attention func-
are trained for classification tasks in the Ima- tion, σ indicates the sigmoid activation function,
geNet dataset, however, the goal of the encoder in and element-wise multiplication is denoted by ⊙.
this paper is to obtain a fixed-length initial infor- Then the AOA module is used again on the
mation vector in each image. Therefore, in the coding side of the entire model to enhance the cor-
CATANIC model, we make some improvements to relation between the extracted image features and
the DenseNet169 pre-training network. We elim- the text description, improving the accuracy and
inate the last softmax layer in the pre-trained logic of the generated image captions.
model to realize the extraction and representation
of the initial features of the image. 3.3 Decoder: Language
representation
3.2 Capturing relevance: Attention Most competitive natural language processing
on Attention models usually use a variant of recurrent neural
To enable our model to generate more accurate networks, the LSTM, as the decoder. However,
and logical image captions, we introduce the AoA LSTM models have some problems such as long
module between the encoder and the decoder, training time and poor performance on longer
which is an extension of the traditional attention sequences. To avoid the above problems, a mod-
ified Transformer model is used as the decoder
Springer Nature 2021 LATEX template

Fig. 1 The overall structure of the CATNIC model

in the whole structure in this study. The tra- first sub-layer in the decoder output. The third
ditional Transformer model is a neural sequence sublayer is also composed of a position-wise fully
transformation model that is based solely on connected feed-forward network. Finally, a soft-
attention mechanisms and feed-forward neural max function is used to predict which word is
networks, dispensing with recurrence and convo- most likely to be currently output until the entire
lutions entirely. At the same time, it also follows predicted subtitle is output.
the encoder-decoder structure. However, in order Since the input to the second sub-layer of
to further improve the quality of the generated the internal decoding layer of the Transformer
image captions, the traditional Transformer model model is the result of the image feature vector
has been improved. passing through the internal encoding layer and
In the CATANIC model, we use the result the word vector of the target subtitle passing
of summing the image feature vector with its through the first sub-layer, we can consider the
positional encoding as the input to the internal second sub-layer of the internal decoding layer of
encoder of the modified Transformer model. Its the Transformer model as a transition layer from
internal encoder is a stack of six identical encoder images to subtitles. To further improve the accu-
layers, where each encoder layer has two sub- racy of the generated subtitles, we introduce an
layers, one consisting of a multi-headed attention AoA module in the second sub-layer of the inter-
mechanism and the other consisting of a posi- nal decoding layer of the traditional Transformer
tional fully connected neural network. Meanwhile, model. Specifically, the AoA module in this case
in the internal decoder of the traditional Trans- first uses a multi-headed attention mechanism to
former model, we take as input the result of obtain the attribute relationship between the pic-
summing the word vector of the target subtitle in ture features and the word vector of the target
the dataset with its positional encoding. Its inter- subtitle, and combines this relationship with the
nal decoder layer is also made up of a stack of six word vector of the subtitle, and then obtains the
identical decoder layers, where each decoder has subtitle word vector with strong correlation to
three sub-layers. One of which is a multi-headed the picture content through the gating unit. Fig.2
attention mechanism with added masking opera- illustrates the encoder-decoder architecture of a
tions, mainly to restrict the current output to only modified Transformer in the CATANIC model.
the words preceding the output. And the second
sub-layer is made up of a multi-headed attention 4 Experiments
mechanism, where the input to this layer is derived
from the output of the encoder layer with the
Springer Nature 2021 LATEX template

optimizer based on a custom learning rate from


the following equation.

−0.5
lrate = dmodel ∗ min(step num−0.5 ,
(5)
step num ∗ warmup steps−1.5 )

where,
warmup steps = 4000 (6)
The variation of the learning rate with the
number of training steps is shown in Fig.3. In addi-
tion, SparseCategoricalCrossentropy was utilized
as the loss function. Fig.4 displays the loss and
accuracy curve for the CATANIC model.

Fig. 2 The encoder-decoder architecture of a modified


Transformer in the CATANIC model

4.1 Dataset
We evaluated the performance of the pro-
posed CATANIC model on the popular Flickr8k
dataset[25], a publicly available English dataset
containing 8091 images. Each image has five dif-
ferent captions that describe the content of the
objects and events in the image. Following the Fig. 3 The variation of the learning rate with the number
of training steps
general dataset partitioning guidelines, with 6000
images for training, 1000 images for validation,
and 1000 images for testing, we partitioned the
complete dataset in an approximate ratio of 6:1:1.

4.2 Experimental Settings


We use DenseNet169, a neural network pre-trained
on the ImageNet dataset[26], to extract the ini-
tial image feature vector. Then, we apply the AoA
module to the initial feature vector of images to
obtain a relevant image feature vector with dimen-
sion 512. Meanwhile, we set the number of heads
of the multi-headed attention mechanism in the
modified Transformer model to 8, with six encoder
layers and six decoder layers. To reduce overfitting
during training, a dropout with a probability of
0.1 was added at the end of each feedforward layer.
we trained the CATANIC model for 45 epochs in Fig. 4 Loss and accuracy of the CATANIC model
the NVIDIA GeForce RTX 3080 Ti GPU config-
uration environment and implemented the Adam
Springer Nature 2021 LATEX template

4.3 Evaluation Metrics the encoder-decoder architecture, the choice of


both encoder and decoder will have an impact
In order to evaluate the performance of the
on the experimental results. To this end, we first
model more comprehensively, we use the train-
selected five commonly used pre-trained networks
ing time of the model as a metric to evaluate
as image feature extractors, which are Inception-
the encoder of the CATANIC model as com-
V3[33], Resnet50[34], InceptionResnet-V2[35], and
pared to the encoder of other image caption-
VGG16[36]. Table 1 demonstrates the perfor-
ing models more simply. We will also apply
mance of different pre-training networks as model
the BLEU[27, 28], METEOR[29], ROUGE-L[30],
encoders.
and CIDEr-D[31] metrics to evaluate the cap-
tions produced by the CATANIC model and
Table 1 The comparison of CATANIC model with
compare them with other methods. All metrics various encoders on the Flickr8k dataset
were calculated from the publicly available code:
Model BLEU-1 BLEU-4 METEOR ROUGE-L CIDEr-D
https://github.com/helloMickey/caption eval.
Resnet50 68.0 29.8 26.1 51.3 92.9
InceptionResnetV2 64.3 32.1 28.6 54.0 100.2
VGG16 62.9 25.0 24.6 49.9 81.5
4.4 Benchmarking Model InceptionV3
DenseNet169
62.3
78.8
28.0
46.7
25.7
34.4
51.7
63.8
98.6
136.5

We compare the image description model


CATANIC with image captioning models that
have performed well in recent years when tested From these five results, it can be seen that
on the Flickr 8k dataset, including Show-Tell[18], the DenseNet169 pre-trained convolutional neu-
SAT-NIC[8], Log Bilinear[32] and others. ral network performs best as an encoder for
Among them, the Show-Tell model[18] is the CATANIC model, which indicates that the
the first model that automatically generates an DenseNet169 pre-trained network is able to
encoding-decoding framework for image caption- extract richer image features during the encoding
ing, which first uses the ResNet101 pre-trained phase. In addition, the DenseNet169 pre-trained
neural network as an encoder to extract fea- network has fewer training parameters and a more
ture vectors of images, and then uses the LSTM compact model structure than the other four pre-
network as a decoder to transform the feature trained networks. Fig.5 shows the training param-
vectors into captions. The SAT-NIC modelci- eters of various different pre-training networks as
texu2015show adds a soft attention mechanism encoders on the Flickr8k dataset.
to the Show-Tell model, which improves the
decoder’s ability to learn global features of the
image. Through the continuous in-depth research
on image captioning by domestic and international
researchers, many high-quality image captioning
models have also been produced, which will be
used as our benchmark models for comparison
experiments with our CATANIC model.

4.5 Quantitative Analysis


In this section, we present the results of the quan-
titative analysis in two aspects to demonstrate
the effectiveness of the CATANIC model. One
aspect is the size of the training parameters of the Fig. 5 The training parameters of various different pre-
training networks as the encoder of the CATANIC model
model’s encoder and the other aspect is the qual-
ity of the image captions generated by the model.
We perform multiple comparisons of the proposed Subsequently, we evaluate the captions pro-
CATANIC model with a variety of state-of-the- duced by the CATANIC model. In Table 2 we
art models, and quantitatively analyze the model show the results of comparing the performance
using the evaluation metrics described above. In
Springer Nature 2021 LATEX template

of the CATANIC model with other state-of-the- still do not describe the images accurately and
art image captioning models. From Table 2, we do not pay enough attention to detail. The cap-
can see that the CATANIC model outperforms tions produced by CATANIC are more accurate
other models in all metrics, especially the CIDEr- and comprehensive. In more detail, the CATNIC
D metric which is as high as 136.5. This shows model has certain advantages in the following
that our model is not only more accurate in iden- three aspects:
tifying targets in images than other advanced 1. The CATANIC model pays more attention
models, but also better at capturing the rela- to the details in the image. In the first example,
tionships between features in images, resulting the benchmark model generated a description of
in accurate and logical image captions. In sum- ”white dog is running through the snow”, which
mary, our model structure is not only simple and was too vague and ambiguous. Our improved
compact, but also generates high-quality image model, not only notices that the dog has a “stick”,
captions. but also successfully identifies the dog’s movement
as “jumps” rather than “running”.
Table 2 The comparison of the performance of the 2. The CATANIC model is more accurate in
CATNIC model on the Flickr8k dataset recognizing targets and colors in pictures. In the
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr-D third example, the benchmark model generated
Deep VS[37] 57.9 38.3 24.5 16 — — — a description of ”dog jumps to get blue ball”,
Show-Tell[18] 63 41 27 — — — —
Log Bilinear[32]
Hard Attention[8]
65.6
67
42.4
45.7
27.7
31.4
17.7
21.3
17.3
20.3




which did not accurately identify either the color
SAT-NIC[8]
CATANIC
67
78.8
44.8
65.1
29.9
54.7
19.6
46.7
18.9
34.4
44.7
63.8
45.3
136.5 of the dog or the type of ball in the picture. In
contrast to the benchmark model, the encoder of
the CATANIC model is processed by the AOA
module to determine that the color of the dog is
4.6 Qualitative Analysis ”black” based on the approximate color of the tar-
get in the feature picture, and that the ball in the
We show some of the input images, manually
picture is a ”football” based on the approximate
annotated captions and the captions generated by
outline of the target. This is also demonstrated by
the CATANIC model with the benchmark model
the fifth example in Figure 5, where our proposed
in Fig.6
model accurately identifies the girl’s eyes as ’blue’,
which is ignored in the description generated by
the benchmark model.
3. The CATANIC model is more able to cap-
ture the relationships between the targets in the
image and is more logical. In the second exam-
ple, the benchmark model identifies the skier as
“lying” in the snow, which is a clear error in
inferring the behavior of the person in the image,
whereas our model accurately infers that the skier
is about to “fall” and “fall face first into the snow”
through the action of multiple attention.
The CATANIC model has this advantage
because we have introduced an AoA module
behind the traditional encoder, which allows the
model to correctly identify the broad outline fea-
tures of the target in the image, and also to filter
Fig. 6 Loss and accuracy of the CATANIC model out some ’redundant’ features through the action
of the attention gate. This allows the extracted
feature maps to be more relevant to the main con-
From these examples, we can see that the
tent of the image, which helps to generate a more
captions produced by the benchmark model are
comprehensive textual description. Furthermore,
somewhat logical in terms of natural language, but
the traditional Transformer model is improved to
Springer Nature 2021 LATEX template

enhance the correlation between the image fea- image features extracted by the pre-trained con-
tures and the target word vector, so that the volutional network, but the correction was not
generated text descriptions are more relevant to particularly good. This indicates that the intro-
the content of the images and correctly identify duction of the AoA module at both the encoder
the behavior and actions of the characters. As a and decoder of the image captioning model not
result, the CATANIC model is able to produce only corrects the image features extracted by the
captions that are more consistent with the content pre-trained network, but also captures the corre-
of the images and more logical. lation between the extracted image features and
the word vectors, further improving the accuracy
4.7 Ablative Analysis of the generated subtitles. At the same time, the
captions generated by the CATANIC model have
To further illustrate the impact of our introduced higher ROUGE-L and CIDEr-D indices than those
AoA module on the entire image caption model, generated by the other two ablation models, sug-
we compare the CATANIC model with other abla- gesting that the introduction of the AoA module
tion models with different settings. Firstly, the on both the encoder and decoder of the image
image caption model without AoA module is used captioning model can result in a higher correla-
as the ”base” model, and the model structure is tion between the generated image captions and the
that the pre-trained neural network DenseNet169 images, as well as more semantic richness.
extracts image features, and then the Transformer
model converts the extracted image features into
text descriptions. Secondly, we introduce the AoA 5 Conclusions
module after the encoder to correct the image fea- In this research, we present a deep neural net-
tures extracted by the pre-trained network and work architecture-based encoder-decoder architec-
filter out some unimportant or incorrect image fea- ture for an image caption creation model. We
tures. Finally, we improve the decoder of the entire conducted extensive experiments on the Flickr8k
image caption model on the basis of the above, dataset, comparing the results of four commonly
and introduce the AoA module into the internal used pre-trained convolutional neural networks,
decoder of the Transformer model. Based on the and finally chose the DenseNet169 as the encoder
above setup, we conducted experiments on the of the CATANIC model to extract the initial fea-
Flickr8k dataset, the results of which are shown tures of the images and represent them as vectors.
in Table 3. Finally, we use the modified Transformer model as
the decoder of the CATANIC model to transform
Table 3 Settings and results of ablation studies
the image feature vector into an image caption.
Model BLEU-1 BLEU-4 ROUGE-L CIDEr-D At the same time, in order to produce captions
that better correspond to the image content, we
Base 70.9 29.5 53.5 104.4
+ Enc:AoA 70.4 31.3 56.8 115.0 introduced the AoA module several times in the
CATANIC 78.8 46.7 63.8 136.5 model. The CATANIC model is not only able to
recognize the relationships between individual fea-
tures in a picture, but also to filter out irrelevant
By comparison, it was found that introduc- relationships among them, improving the model’s
ing the AoA module to the image captioning performance. In addition, our model requires fewer
model with the pre-trained neural network as the training parameters in the encoding phase and the
encoder and the Transformer semantic model as model is more compact. Experiments show that
the decoder enhanced the accuracy and richness of our proposed CATANIC model has a simple struc-
the generated captions. On the Flickr8k dataset, ture, fewer training parameters, and generates
the captions generated by introducing the AoA more accurate and fluid image captions.
module to the encoding side of the image caption We hope that the outcomes of this study can
model only had a slightly higher BLUE-4 metric inspire additional research on image description.
of 1.8 than those generated by the ”Base” model, Also, we expect that our model can be used in
indicating that the introduction of the AoA mod- different linguistic datasets, allowing for a wide
ule to the encoding side was able to correct the range of generative languages.
Springer Nature 2021 LATEX template

Declarations [7] Orhan Firat, Kyunghyun Cho, and Yoshua


Bengio. Multi-way, multilingual neu-
• No funding was received to assist with the ral machine translation with a shared
preparation of this manuscript. attention mechanism. arXiv preprint
• All authors certify that they have no affili- arXiv:1601.01073, 2016.
ations with or involvement in any organiza-
tion or entity with any financial interest or [8] Kelvin Xu, Jimmy Ba, Ryan Kiros,
non-financial interest in the subject matter or Kyunghyun Cho, Aaron Courville, Ruslan
materials discussed in this manuscript. Salakhudinov, Rich Zemel, and Yoshua Ben-
• Data sharing not applicable to this article as no gio. Show, attend and tell: Neural image
datasets were generated or analysed during the caption generation with visual attention.
current study. In International conference on machine
learning, pages 2048–2057. PMLR, 2015.
References [9] Lun Huang, Wenmin Wang, Jie Chen, and
[1] Jie Shao and Runxia Yang. Controllable Xiao-Yong Wei. Attention on attention for
image caption with an encoder-decoder opti- image captioning. In Proceedings of the
mization structure. Applied Intelligence, IEEE/CVF international conference on com-
pages 1–12, 2022. puter vision, pages 4634–4643, 2019.

[2] Alok Singh, Thoudam Doren Singh, and [10] Priyanka Raut et al. An advanced image cap-
Sivaji Bandyopadhyay. An encoder-decoder tioning using combination of cnn and lstm.
based framework for hindi image caption gen- Turkish Journal of Computer and Mathemat-
eration. Multimedia Tools and Applications, ics Education (TURCOMAT), 12(1S):129–
80(28):35721–35740, 2021. 136, 2021.

[3] Haoran Wang, Yue Zhang, and Xiaosheng [11] Prabhav Karve, Shalaka Thorat, Prasad Mis-
Yu. An overview of image caption gener- tary, and Om Belote. Conversational image
ation methods. Computational intelligence captioning using lstm and yolo for visually
and neuroscience, 2020, 2020. impaired. In Proceedings of Third Interna-
tional Conference on Communication, Com-
[4] Mia Xu Chen, Orhan Firat, Ankur Bapna, puting and Electronics Systems, pages 851–
Melvin Johnson, Wolfgang Macherey, George 862. Springer, 2022.
Foster, Llion Jones, Niki Parmar, Mike Schus-
ter, Zhifeng Chen, et al. The best of [12] Ashish Vaswani, Noam Shazeer, Niki Par-
both worlds: Combining recent advances in mar, Jakob Uszkoreit, Llion Jones, Aidan N
neural machine translation. arXiv preprint Gomez, Lukasz Kaiser, and Illia Polosukhin.
arXiv:1804.09849, 2018. Attention is all you need. Advances in neural
information processing systems, 30, 2017.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V Le, Mohammad Norouzi, Wolfgang [13] Ali Farhadi, Mohsen Hejrati, Moham-
Macherey, Maxim Krikun, Yuan Cao, Qin mad Amin Sadeghi, Peter Young, Cyrus
Gao, Klaus Macherey, et al. Google’s neu- Rashtchian, Julia Hockenmaier, and David
ral machine translation system: Bridging the Forsyth. Every picture tells a story: Gener-
gap between human and machine translation. ating sentences from images. In European
arXiv preprint arXiv:1609.08144, 2016. conference on computer vision, pages 15–29.
Springer, 2010.
[6] Minh-Thang Luong, Hieu Pham, and
Christopher D Manning. Effective [14] Girish Kulkarni, Visruth Premraj, Vicente
approaches to attention-based neural Ordonez, Sagnik Dhar, Siming Li, Yejin Choi,
machine translation. arXiv preprint Alexander C Berg, and Tamara L Berg.
arXiv:1508.04025, 2015.
Springer Nature 2021 LATEX template

11

Babytalk: Understanding and generating sim- [22] Faisal Muhammad Shah, Mayeesha Humaira,
ple image descriptions. IEEE transactions Md Abidur Rahman Khan Jim, Amit
on pattern analysis and machine intelligence, Saha Ami, and Shimul Paul. Bornon: Ben-
35(12):2891–2903, 2013. gali image captioning with transformer-based
deep learning approach. SN Computer Sci-
[15] Vicente Ordonez, Girish Kulkarni, and ence, 3:1–16, 2022.
Tamara Berg. Im2text: Describing images
using 1 million captioned photographs. [23] Tiantao Xian, Zhixin Li, Canlong Zhang,
Advances in neural information processing and Huifang Ma. Dual global enhanced
systems, 24, 2011. transformer for image captioning. Neural
Networks, 148:129–141, 2022.
[16] Micah Hodosh, Peter Young, and Julia Hock-
enmaier. Framing image description as a [24] Gao Huang, Zhuang Liu, Laurens Van
ranking task: Data, models and evaluation Der Maaten, and Kilian Q Weinberger.
metrics. Journal of Artificial Intelligence Densely connected convolutional networks. In
Research, 47:853–899, 2013. Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages
[17] Chen Sun, Chuang Gan, and Ram Nevatia. 4700–4708, 2017.
Automatic concept discovery from parallel
text and visual corpora. In Proceedings of the [25] Cyrus Rashtchian, Peter Young, Micah
IEEE international conference on computer Hodosh, and Julia Hockenmaier. Collecting
vision, pages 2596–2604, 2015. image annotations using amazon’s mechani-
cal turk. In Proceedings of the NAACL HLT
[18] Oriol Vinyals, Alexander Toshev, Samy Ben- 2010 workshop on creating speech and lan-
gio, and Dumitru Erhan. Show and tell: guage data with Amazon’s Mechanical Turk,
A neural image caption generator. In Pro- pages 139–147, 2010.
ceedings of the IEEE conference on computer
vision and pattern recognition, pages 3156– [26] Jia Deng, Wei Dong, Richard Socher, Li-Jia
3164, 2015. Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In
[19] Xu Jia, Efstratios Gavves, Basura Fernando, 2009 IEEE conference on computer vision
and Tinne Tuytelaars. Guiding the long-short and pattern recognition, pages 248–255. Ieee,
term memory model for image caption gen- 2009.
eration. In Proceedings of the IEEE interna-
tional conference on computer vision, pages [27] Kishore Papineni, Salim Roukos, Todd Ward,
2407–2415, 2015. and Wei-Jing Zhu. Bleu: a method for auto-
matic evaluation of machine translation. In
[20] Long Chen, Hanwang Zhang, Jun Xiao, Proceedings of the 40th annual meeting of the
Liqiang Nie, Jian Shao, Wei Liu, and Tat- Association for Computational Linguistics,
Seng Chua. Sca-cnn: Spatial and channel- pages 311–318, 2002.
wise attention in convolutional networks for
image captioning. In Proceedings of the IEEE [28] Jiacheng Yang, Mingxuan Wang, Hao Zhou,
conference on computer vision and pattern Chengqi Zhao, Weinan Zhang, Yong Yu, and
recognition, pages 5659–5667, 2017. Lei Li. Towards making the most of bert in
neural machine translation. In Proceedings
[21] Jiasen Lu, Caiming Xiong, Devi Parikh, and of the AAAI conference on artificial intelli-
Richard Socher. Knowing when to look: gence, volume 34, pages 9378–9385, 2020.
Adaptive attention via a visual sentinel for
image captioning. In Proceedings of the IEEE [29] Michael Denkowski and Alon Lavie. Meteor
conference on computer vision and pattern universal: Language specific translation eval-
recognition, pages 375–383, 2017. uation for any target language. In Proceedings
of the ninth workshop on statistical machine
Springer Nature 2021 LATEX template

translation, pages 376–380, 2014.

[30] Chin-Yew Lin. Rouge: A package for auto-


matic evaluation of summaries. In Text
summarization branches out, pages 74–81,
2004.

[31] Ramakrishna Vedantam, C Lawrence Zitnick,


and Devi Parikh. Cider: Consensus-based
image description evaluation. In Proceedings
of the IEEE conference on computer vision
and pattern recognition, pages 4566–4575,
2015.

[32] Andriy Mnih and Geoffrey Hinton. Three


new graphical models for statistical language
modelling. In Proceedings of the 24th interna-
tional conference on Machine learning, pages
641–648, 2007.

[33] Christian Szegedy, Vincent Vanhoucke,


Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. Rethinking the inception architec-
ture for computer vision. In Proceedings of
the IEEE conference on computer vision and
pattern recognition, pages 2818–2826, 2016.

[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren,


and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the
IEEE conference on computer vision and pat-
tern recognition, pages 770–778, 2016.

[35] Christian Szegedy, Sergey Ioffe, Vincent Van-


houcke, and Alexander A Alemi. Inception-
v4, inception-resnet and the impact of resid-
ual connections on learning. In Thirty-first
AAAI conference on artificial intelligence,
2017.

[36] Karen Simonyan and Andrew Zisserman.


Very deep convolutional networks for large-
scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.

[37] Andrej Karpathy and Li Fei-Fei. Deep visual-


semantic alignments for generating image
descriptions. In Proceedings of the IEEE
conference on computer vision and pattern
recognition, pages 3128–3137, 2015.

You might also like