0% found this document useful (0 votes)
10 views

Two Tier LSTM Model

Uploaded by

akashhp2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Two Tier LSTM Model

Uploaded by

akashhp2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received: January 15, 2021. Revised: March 1, 2021.

22

Two-Tier LSTM Model for Image Caption Generation

Phyu Phyu Khaing1* May The` Yu2

1
Image Processing Lab, University of Computer Studies, Mandalay (UCSM), Myanmar
2
Faculty of Information Science, University of Computer Studies, Mandalay (UCSM), Myanmar
* Corresponding author’s Email: [email protected]

Abstract: Image captioning is systematically generating the caption of the image with a sentence description. In the
past few years, the automatic process of creating image caption has fascinated the great interest in Artificial
Intelligence (AI) field. Image captioning defines as the basic process of building the conjunction of image processing
and natural language processing at input and output position. All image processing tasks, such as the segmentation of
image, object tracking, object detection, image recognition, and many others, are mostly performed using
Convolutional Neural Networks (CNNs). To perform the natural language processing tasks, just as semantic role
labelling, neural machine translation, speech recognition, question and answering, and many others, Recurrent Neural
Networks (RNNs) and long-term memory networks (LSTMs) are essential for some of the biggest breakthroughs. This
paper proposes the efficient encoder-decoder framework-based image captioning model, namely Two-Tier LSTM
(TT-LSTM) Model. The TT-LSTM model is designedly implemented upon the encoder-decoder framework with two
LSTM layers. This research is implemented on the MSCOCO, Flickr30k, and Flickr8k datasets; and evaluated with
standard evaluation matrixes such as ROUGE-L, CIDEr, and four BLEU scores. The outcomes of the experiments on
the typical datasets reveal that the proposed model generates meaningful natural language sentences. The proposed
model also improves the sentence generation efficiency and can achieve better performance for image caption
generation.
Keywords: Image caption generation, Deep learning, Encoder-decoder, Two-tier LSTM model.

The purpose of image description is to describe


1. Introduction an image automatically in one or more natural
language sentences. The main challenges occur when
Image captioning is a system by which
interpreting two separates, but often combined
descriptions are produced from the image with a
approaches of computer vision and natural language
sentence to provide the knowledge of the different
processing [1]. First, the objects on the scene needed
components of the image. Components contain the
to be recognized and their relationships identified,
objects/individuals in the image, the context where it
followed by properly forming sentences for
focuses and associates the objects to all image entities
expressing the contents of the image [2]. Image
and its environment. The objects should be combined
caption generation also varies significantly from
with all the image entities. The massive amounts of
image presentation, since people focus on common
knowledge existing around us in the world can be
sense and observation, emphasize relevant details,
described with formal language or any other kind of
and neglect the objects and the connections of objects
interaction. In the same way, language may be used
[3].
for the valuable and meaningful awareness of the
In the last few years, image captioning is an
scenes in images. This helps to understand the scene
automated production of a natural language sentence
more clearly, by creating image captions and uses
that is related to input images to permit significant
captions to completely grasp the contents of the
changes via the attention mechanism to the encoder-
images.
decoder system [4, 5]. Convolution Neural Network
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 23

(CNN) is commonly utilized at the encoder-decoder Traditional image captioning methods and deep
system at the encoding process of the image to extract learning-based methods are the existing methods for
the features from the image and a Recurrent Neural image captioning.
Network (RNN) is implemented to construct an input
image's description [6, 7]. Then, the image caption 2.1 Traditional image captioning models
generation, which focuses on some part of an input
image, can be produced by the attention mechanism For traditional image captioning methods, there
[8-10]. are two particular types. The first method is image
This paper proposes the Two-Tier LSTM (TT- captioning focused on the template called template-
LSTM). Captioning model for image-focused upon based methods and the second one is image
the inject and merge model. TT-LSTM Model build captioning focused on the retrieval processes called
as an efficient encoder-decoder framework. Unlike retrieval-based methods.
the previous works, the proposed model is utilized The template-based methods for image
two LSTM model. In the proposed model, captioning must predefine the template variety of
Convolutional Neural Network (CNN) is to encode relationships between the objects from the image and
the input image, and Long Short-Term Memory the labels of these objects. Such relationships and
(LSTM) is to encode the sentence caption and to names of objects fill the empty slots, and the caption
generate the sentence as the decoder. The pretrained of the image can be obtained from the empty slots.
CNN (XceptionNet) is availed to extract the high- The phrases, however, that describe as the caption of
level visual semantic knowledge as the feature vector the image generated by these methods, are only one
of the image from the second last activation layer of sentence for one image. That sentence is not a
XceptionNet by dropping the last classification layer. separate expression. To generate relevant image
The typical LSTM is performed for encoding the descriptions, Hidden Markov Model (HMM) is
embedded sequence as the language decoder. employed in [14] by filtering the highest log-
Bidirectional LSTM is utilized as the decoder by likelihood from four corpora that consist of objects,
entering the blended input, that combined the verbs, scenes, and prepositions. Similarly, in [15], the
encoded image vector with an encoded sentence Conditional Random Field (CRF) model is
vector, to catch up with the previous and next implemented with the correspondence attributes,
contexts. objects, and prepositions prediction to construct the
Generally, this paper comprises three main sentence according to predefined templates. A new
technical contributions as follow: subspace embedded approach is suggested for image
caption generation, called Common Subspace for
• To achieve the preciseness of the generated
Model and Similarity (CoSMoS) in [16].
captions for image caption generation, this paper
The problem of the image captioning process is
develops a Two-Tier LSTM (TT-LSTM) Model
known to be an information collection problem with
for generating image caption.
the retrieval methods. The first is to construct the
• Three benchmark datasets: MSCOCO [11],
image description from the training sample by
Flickr30k [12], and Flickr8k [13], are validated
deriving the meaningful sentence for a similar image.
for detailed experiments to judge the capability
After that, updating the image description is
of the proposed model.
processed to achieve the information expression for
• Our methodology is assessed by appraisal the input image. The retrieval-based method
measurements to illustrate comparative outcomes generates the caption of the image depended on a
of the state-of-the-art approaches. large dataset without any doubt, and the generated
The structure of this paper is constructed with the image caption is restricted with the description of the
following sections. Section 2 describes the related training dataset. An automated visual concept
works that concerned with the proposed method. The discovery (VCD) algorithm is initiated by utilizing
proposed architecture of the system demonstrates in concurrent text and image corpora for bidirectional
Section 3. The experimental implementation of the tasks for the image and sentence retrieval and tagging
proposed model represents in Section 4. The tasks for the image in [17]. To choose the consensus
summary of the paper discusses in Section 5 under caption for the image, in [18], a simple K-Nearest
the conclusion. Neighbor (KNN) retrieval model is utilized by
relying on a neural network to excerpt image features.
2. Related work In [19], the authors investigated the natural language
This part discusses the literature of previous description generation methods for the image by
research works for image caption generation. developing data-driven strategies using retrieval-

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 24

based methods that applied two strategies: applying [26], namely character-level RNN (c-RNN), is
the global features and utilizing the detection of developed by composing the descriptive sentence
objects, regions, and scenes. with characterization. c-RNN model is based on the
inject model that one of encoder-decoder architecture
2.2 Deep learning-based methods by substituting with character level instead of word-
level image captioning model. Although a character-
An encoder-decoder framework is the most level RNN model (c-RNN) is efficient than other
efficient deep learning model to generate the caption word-level language models, our proposed TT-
of images. The recent encoder-decoder frameworks LSTM model achieves more performance than c-
have emerged and have been successful, which allow RNN.
an image caption with fluent phrases and various In [27], the authors proposed a structural
phrases to be produced. comparison of the various forms of 'conditioning'
The encoder-decoder framework for generating language choices based on the visual input, exploring
the image caption was suggested in [1]. At the image their repercussions for the framework of caption
captioning, the widespread implementation of the generators. In the comparison, based on the encoder-
encoder-decoder method proceeds from one powerful decoder framework, init-inject, per-inject, par-inject,
effect throughout the challenges of machine and merge models are utilized. The init-inject model
translation. Google NIC model, which applied uses an image vector at the initial state for the RNN
Inception v3 for the encoder and long-short term with the first word of sequence; the per-inject model
memory (LSTM) for the decoder, has been applies an image vector as an initial word of
introduced by the authors of [5]. At the start of the sequence; par-inject utilizes image vector and word
LSTM network, the image information was only in every time steps; merge model takes RNN for the
given as the input for the image captioning. In [20], word embedding. All four models use RNN in one
the authors suggested an extension of the LSTM place for different purposes, but we apply RNN at
model that would implement the semantic knowledge two places in the proposed model.
to direct the LSTM network in producing a text Image captioning framework with semantic
description for the image. element discovery and embedding is developed in
In [21], the new image captioning framework, [28]. The element embedding framework has
namely Neural Baby Talk, is proposed by applying comprised the integration of the CNN-LSTM model
object detection to generate the natural language and object region detection, called LSTM (FD+RD).
sentence. The authors applied two perspectives of the To connect the interval between visual image and
RNN task for an image caption generator in [22]. semantic caption, element embedding long-short
These two perspectives are called inject model and term memory (EE-LSTM) used both visual and
merge model. The inject model encourages that the semantic; local and global features. The LSTM
image and words are entered into RNN, and the (FD+RD) and EE-LSTM are more efficient than
merge model is that RNN is only performed for the baseline models for image understanding but the
encoder of the word sequence. In our proposed sentence generation process is not fully completed.
method, RNN has been used for two purposes: the So, our proposed model constructs with two-tier
word sequence encoder and the sentence generator. LSTM to get more efficient.
Therefore, our proposed model is more accurate than In [29], an image captioning framework is
the previous research work. developed with a two-stage process; scene graphs
In order to obtain clearer image definitions, the generation from the image and caption generation
authors implemented a coarse-to-fine image with pre-trained language generators; that based on
captioning model, which uses a stacked visual focus encoder-decoder architecture. The authors developed
model in combination with various LSTM networks a CNN+Transformer design network for image
in [23]. The guiding network is developed that captioning, namely CaPtion TransformeR (CPTR),
extends the encoder-decoder framework at the that build global context at every encoder layer in
decoder step and the guiding vector can be adjusted [30].
to integrate image and language information in [24]. After the encoder-decoder model, it was added
The RNN-based reinforcement learning framework is the attention-based approach as the extension. In [8],
proposed by integrating with a novel multi-level the authors initially suggested the development of
policy function (word-level policy and sentence-level image caption applied an attention-based approach
polity) and multi-level reward function (vision-
language reward and language-language reward) in
[25]. The language model for image captioning, in

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 25

Image Sentences Sequence Embed LSTM


Encoder n words x x
Dense Softmax
Feature Extraction
With CNN Word v
(Xception) Embedding Image Dense
Vector x
LSTM
Embedding (a)

Concatenation Sequence Embed LSTM


n words x x
Decoder Dense Softmax
Bidirectional v
LSTM
Image Dense
Vector x
Fully Connected
Layer (b)
Figure. 2 Encoder-decoder framework: (a) inject model
and (b) merge model
Output
Sentence 3 Proposed architecture of the system
Fig. 1 exhibits the efficient encoder-decoder
Figure. 1 Two-Tier LSTM (TT-LSTM) model models for image captioning, namely Two-
Tier LSTM (TT-LSTM) model. This model is
by implementing a convolutional layer for deriving focused on the structure of the encoder-decoder
the location-based spatial characteristics. The authors framework to strengthen the model for the image
in [31] proposed a semantic attention mechanism to captioning.
extract the text-related image features and integrated
them with a bidirectional gLSTM (Bi-gLSTM) for 3.1 Encoder-decoder framework
the image captioning model. The work proposed in
[32] is the top-down and bottom-up approaches. The The encoder-decoder framework is a basic deep
combination process of these approaches and the learning framework for image captioning. This
semantic concepts are suggested to combine the involves two elements:
attention mechanism at the image caption generation • Encoder: A network model that reads the input
model in [33]. In the interest of language description image and encodes that image with an internal
generation for the image, the authors in [34] representation into a fixed-length vector.
increased the text-guided attention approach. In [35], • Decoder: A network model that can read the
a text-based approach is suggested to strengthen the encoded vector and generate the output of the text
current guiding LSTM (gLSTM) and to utilize text- description.
based knowledge to increase local attention. The In the encoder-decoder framework, there are two
authors in [36] also increased the cumulative basic models: the inject model and the merge model
attention functions to focus on the target item and [22], that show in Fig. 2 (a) and (b).
other major regional rates that are bottom-up and top- In the inject architecture, the encoder first
down to be measured. Although the attention encodes the image into the vector view with a fixed
mechanism is mostly popular, our proposed model is length; and the decoder function uses an image-and-
only utilizing the encoder-decoder framework and we word sequence as an input text generation model to
will extend our model with attention mechanism in extract the continuous word in the sentence sequence.
the future study. The injection model integrates the encoded image
type and the word sequence produced in the text
description.

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 26

The merge model blends the encoded forms of the specified number of filters. The output map
image entry and the created text definition. A basic numbers are equal to the given filter number.
decoder model is then used to construct the next word • The pooling layer will conduct the down-
in the sequence to merge these two encoded inputs. It streaming procedure through width and height
distinguishes the modelling of the image entry, the (spatial dimensions). For the pooling layer,
text entrance, and the combination of the encoded maximum pooling is primarily used.
outputs. • For the CNN activation function, the ReLu
function is generally applied. An activation
3.2 Two-tier LSTM (TT-LSTM) model function for element-wise activation like the
maximum (0, x) threshold at zero is used in the
Two-Tier LSTM (TT-LSTM) Model shown in
ReLU layer.
Fig. 1 blends the two-basic encoder-decoder
framework: Inject Model and Merge Model. The • The class values are determined by the Fully
proposed TT-LSTM model constructs with two Connected (FC) layer. Any units in this layer are
encoder models for both image and text. Both related to all numbers in the previous volume as
encoding processes of the query image and the with ordinary Neural Networks and as the name
encoded form of the textual description generated are suggests.
combined. The combination uses a decoder model to 3.2.2. Pre-trained convolutional neural network
construct the sequence of the word. For image
encoder, Convolutional Neural Networks (CNNs) Pre-trained convolutional neural networks are the
apply. There are many pre-trained CNN models. models that someone else created to solve a similar
Among them, Xception, a CNN model trained in the problem. The pretrained models use several forward
imagenet dataset, uses in this study. Long-Short Term and backward iterations to define the right weights
Memory (LSTM) applies to the language encoder, for the network. Much deep learning software is
and Bi-directional LSTM utilizes for the decoder. existing for the pre-trained model. Among them, this
research will use Keras written by Python. Many pre-
3.2.1. Convolutional neural network (CNN) trained models include in Keras. There are denseNet,
inceptionNet, mobileNet, nasNet, resNet, vggNet and
Advanced deep neural network: Convolutional
xceptionNet. Each model will work the associated
Neural Networks (CNNs), [37] can handle data that
preprocessing. This research uses the XceptionNet
is an input shape like a 2D matrix. Image
model at the place of feature extraction from the
conveniently shows as a 2D matrix and CNN is
images.
highly useful for the image work. For the image
In this study, XceptionNet [38] model performs
feature extraction function, CNN is mostly used. The
to extract the features of the image for the CNN
typical constituents of a CNN are the input layer,
model. The Xception model already trains on
convolution layer, pooling layer, activation layer, and
imagenet datasets and an extension of the Inception
fully connected layer. The implementations of those
architecture, substituted with depth-separable
typical constituents are:
convolutions for regular Inception units. Model
• The input layer includes the image's raw pixel
Xception works pre-trained on ImageNet with weight.
values and has three colour channels R, G, B.
This model has a default input size of 299x299. The
• In order to construct output feature maps for the 36 convolution layers organize into 14 modules, and
images, the convolution layer takes images from all of which, except the first and last modules, have
the previous layers and complements the linear residual links. Fig. 3 demonstrates the
architecture of the Xception model.

Figure 3. Xception Architecture

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 27

3.2.3. Long-short term memory

Long-Short Term Memory (LSTM) [39]


networks are a special kind of RNN and can use to
predict sequence problems and learn long-term
dependencies. LSTM cell composes with the gate
instead of memory such as input gate, forget gate,
output gate, candidate layer. The sigmoid activation
function utilizes for the forget gate, input gate, and
output gate as single-layered neural networks; the
Tanh function is for the Candidate layer as the
activation function. Forget gate is to determine the Figure 5. Bidirectional LSTM structure
knowledge thrown away from the cell state and a
sigmoid layer makes the decision. The cell state 3.2.4. Bidirectional LSTM
determines what relevant materials will process, and
Bidirectional LSTM [40] is an extension of
the output gate is to determine what the output will
standard LSTM to increase the model output for the
be.
problems with sequence classification. Bi-LSTMs
Fig. 4 shows the structure of the LSTM cell, and
have the forward and backward pass. Instead of one
the following equations are applied to perform the
LSTM on the entry sequence, Bi-LSTM trains with
LSTM cell.
two LSTMs; a backward (future) layer and a forward
(past) layer. The usage of Bi-LSTM does not make
𝑖𝑡 = 𝜎(𝑥𝑡 × 𝑊𝑥𝑖 + ℎ𝑡−1 × 𝑊ℎ𝑖 ) (1)
sense for all sequence prediction issues but can
provide some advantage to those areas where it gets
𝑓𝑡 = 𝜎(𝑥𝑡 × 𝑊𝑓 + ℎ𝑡−1 × 𝑊ℎ𝑓 ) (2) the acceptable results with the significant
improvements. The network utilizes this additional
𝑐̅𝑡 = tanh(𝑥𝑡 × 𝑊𝑥𝑐 + ℎ𝑡−1 × 𝑊ℎ𝑐 ) (3) context, and the results are faster output. Fig. 5
displays the bidirectional LSTM structure.
𝑜𝑡 = 𝜎(𝑥𝑡 × 𝑊𝑥𝑜 + ℎ𝑡−1 × 𝑊ℎ𝑜 ) (4)
4 Experiments
𝑐𝑡 = 𝑓𝑡 ⨀ 𝑐𝑡−1 + 𝑖𝑡 ⨀ 𝑐̅𝑡 (5)
This section presents datasets, evaluation metrics,
ℎ𝑡 = 𝑜𝑡 ⨀ tanh(𝑐𝑡 ) (6) and experimental results for the TT-LSTM model for
image captioning.
In the figure and equations, 𝑥𝑡 is the cell input:
4.1 Datasets
called the input gate, 𝑐̅𝑡 is the input activation, 𝑓𝑡 is
the forget gate, ℎ𝑡 is the next hidden state, ℎ𝑡−1 is the During classification, recognition, and detection
previous hidden state, 𝑜𝑡 is the output gate, 𝑐𝑡−1 is the processes for the image, and image caption
previous cell state, 𝑐𝑡 is the current cell, 𝑊𝑥 is the generation process, there are various types of data
weight for the input, 𝑊ℎ is the weight for the hidden sets. We will use the most well-known standard
state. The sigmoid (𝜎) function and the hyperbolic datasets for image captioning, such as MSCOCO [11],
tangent (tanh) function are the element-wise Flickr30k [12], and Flickr8k [13]. Fig. 6 shows the
nonlinear activation functions. ⨀ is an element-wise sample images and captions pair format of the
multiplication. MSCOCO dataset.
MSCOCO dataset published in 2014 by the
authors [11], with the paper publication, in Computer
Vision and Pattern Recognition (CVPR). It also
continuously published updated versions of the
dataset in 2015 and 2017. This research used the 2017
dataset and contains 123,287 images with five-
sentence descriptions. It comprised 113,287 images,
5,000 images, and 5,000 images for the trained
process, the validated process, and the tested process
respectively.
Figure 4. Structure of LSTM cell
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 28

Flickr30k dataset published in 2014 by the 4.2 Evaluation metrics


authors [12], with the paper publication, in
Transactions of the Association for Computational In our experiment, automated evaluation metrics
Linguistics. It contains five-sentence descriptions for utilize at the computation for the proposed model,
each image of 31,783 images. It comprised 29,783, such as ROUGE-L (RG-L) [41], CIDER (CDR) [42],
1,000 images, 1,000 images for the trained process, and BLEU (B-4, B-3, B-2, B-1) [43].
the validated process, and the tested process Recall-Oriented Understudy for Gisting
respectively. Evaluation (ROUGE) measures the automatic
Flickr8k dataset published in 2013 by Hodosh, assessment of the content description by contrasting
Young, and Hockenmaier [13], with the paper the other human-generated summaries and focuses on
publication, in Journal of Artificial Intelligence the recall.
Research. It contains 8,091 images with five- Consensus-based Image Description Evaluation
sentence descriptions. It comprised 6,000 images, (CIDEr) calculates how most people identify an
1,000 images, and 1000 images for the trained image similar to a candidate sentence by executing
process, the validated process, and the tested process TF-IDF (Term-Frequency-Inverse Document
respectively. Frequency).
Bilingual Evaluation Understudy (BLEU)
evaluation metric judges the similitude at the interval
1. Woman in swim suit holding
parasol on sunny day.
of the sentence of ground truth and the sentence
2. A woman posing for the generated from the machine by a fraction of n-gram
camera, holding a pink, open (n=1,2,3,4).
umbrella and wearing a bright,
floral, ruched bathing suit, by a 4.3 Experimental setup
life guard stand with lake,
green trees, and a blue sky with The code for the TT-LSTM model writes with
a few clouds behind. Python programming language. Keras CNN pre-
3. A woman in a floral swimsuit trained models are used for this study and run on the
holds a pink umbrella. Pycharm editor. Those models run on the RTX 2070
4. A woman with an umbrella super 8GB GPU, 32GB Memory, and 64-bit Ubuntu
near the sea. operating system.
5. A girl in a bathing suit with a For the image encoder, we utilize the Xception
pink umbrella. model, the ReLU activation function, and the Repeat
1. A young boy in winter clothes
Vector layer. 2048 feature vector extracts from the
skiing in a very snowy fully connected layer of the Xception model by
landscape. removing the last classification layer. ReLu
2. A little boy in a bright jacket activation function processes with embedding size:
on skis in the snow. 256. Repeat vector uses to regenerate a 2D array for
3. There is a young boy that is the input for the continuous layer.
riding his skies down hill. For language encoder model, the system firstly
4. A little boy that is standing on constructs with a word embedding, LSTM model,
ski. and TimeDistributed layer. The embedding size and
5. A young boy in an orange
units are 256. The input vector size of the language
snow jacket is on skis.
encoding model is the maximum length of the
1. Two giraffes standing together sentence. The maximum length of a sentence can be
and looking towards an area of different according to the dataset. TimeDistributed
trees and bushes. layer adds to the language encoder since this layer is
2. A couple of giraffes walk next incredibly helpful in dealing with the time series
to some trees process, such as the sequence of video frames and
3. Two giraffes standing in the sentences. Before decoding, the concatenation
grass near trees. process performs to combine the image encoder and
4. Two giraffes’ side by side in language decoder model. At the decoder model, the
the tall grass look into the
concatenation result enters into the bidirectional
shaded tree line.
5. A couple of giraffes that are
LSTM with 256 units. Finally, a softmax activation
walking in the grass layer processes with vocabulary size. The training
Figure 6. Sample images and captions
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 29

model structure for the Flickr8k dataset is shown in soon as validation data output began to decline. We
Fig. 7. also generated captions for the images in the testing
The training process conducts with 0.0001 hyper- dataset with beam search that used the beamwidth 2
parameter of the Adam optimization algorithm [45] and a cumulative clipping period of the maximum
and a batch size of 32. Categorical cross-entropy is number of words at the sentence to assess the trained
used in the cost function. Early stopping parameter is models. Fig. 8 shows the training and validation loss
used to stop the training process by measuring the for all datasets of our proposed model.
validation performance after the training epoch as

Figure 7. Training model structure for Flickr8k

Table 1. Comparative results with baseline model for the MSCOCO dataset
Models RG-L CDR B4 B3 B2 B1
Inject 54.5 79.0 22.0 32.0 46.1 68.0
Merge 54.3 79.0 21.7 31.5 45.7 67.9
TT-LSTM 55.0 79.1 22.5 32.9 47.6 69.4

Table 2. Comparative results with baseline model for the Flickr30k dataset
Models RG-L CDR B4 B3 B2 B1
Inject 48.9 39.8 16.8 26.2 38.5 62.5
Merge 49.1 40.1 17.0 26.6 40.0 63.3
TT-LSTM 50.2 40.3 20.0 29.7 44.2 66.7

Table 3. Comparative results with baseline model for the Flickr8k dataset
Models RG-L CDR B4 B3 B2 B1
Inject 49.3 43.3 15.3 24.2 37.0 58.2
Merge 49.2 41.1 15.2 24.5 37.9 60.1
TT-LSTM 51.9 49.1 18.4 28.9 43.5 65.8

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 30

For the experimental results, we choose the inject


model and the merge model as baseline methods to
compare with the proposed efficient encoder-decoder
model. And then, we analyse the Two-Tier LSTM
(TT-LSTM) model with 1D LSTM and Bi-
directional LSTM. Tables 1, 2, and 3 show the
evaluation results of the TT-LSTM model on
MSCOCO, Flickr30k, and Flicker8k datasets,
respectively.
When compared with two baseline models, the
proposed model achieves better results on all
evaluation metrics. At the large dataset MSCOCO, all
comparative results are slightly higher than baseline
models. To compared on Flickr30k dataset, four
BLEU scores (B4, B3.B2, and B1) of the proposed
(a)
are more improved than baseline models. On the
Flickr8k dataset, all evaluation results are higher than
inject model and merge model. Therefore, the
performance of the proposed model confirms the
effectiveness of the caption generation process than
two baseline models: inject and merge models, on all
three datasets.
The proposed method compares with other
classical models: Google NIC [5]; Init-Inject, Pre-
Inject, and Par-Inject [27]; and LSTM (FD+RD) and
EE-LSTM [28]. The comparative results with
classical models describe in Tables 4, 5, and 6 on
MSCOCO, Flickr30k, Flickr8k, respectively.
In Table 4, on the MSCOCO dataset, the
proposed encoder-decoder framework is more
efficient than other comparative methods. Only
BLEU-1 (B1), BLEU-3 (B2), and ROUGE-L (RG-L)
(b) have good results by comparing with the start-of-the-
art. Since the MSCOCO dataset contains many
images with multiple and complex scenes, and all
contents are difficult to completely detect. However,
the proposed method can generate the sentence with
correct words but cannot completely correct for the
compound words.
On the Flickr30k and Flickr8k datasets, the
proposed model achieves the moderately improved
results through categorical cross-entropy loss. For all
evaluation metrics except BLEU-3 (B3) and BLEU-
4 (B4) scores, TT-LSTM produces considerable
improved outcomes than all competing approaches at
the Fickr30k and Flickr8k datasets. Since the long
sequence of words are not completely correct in this
study, BLEU-3 (B3) and BLEU-4 (B4) evaluation
calculated with 3-gram and 4-gram of the sentence is
(c) a little worse for the result. The evaluation scores are
Figure 8. Training and validation loss: (a) MSCOCO decreased, extraordinarily, if the number of terms is
dataset, (b) Flickr30k dataset, and Flickr8k dataset long. Since the model equally considers all words, the
usage of words cannot be fully utilized. Generally,
4.4 Results the proposed model harmoniously performs better

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 31

than other state-of-the-art methods on almost all model. For the image encoder, Xception Net is
evaluations. utilized, and LSTM is used for the language encoder.
Fig. 9 shows the sample captioning results of And then, two outputs from the image encoder and
image caption generation using the proposed language encoder are concatenated to enter into the
approaches. Due to the fact that the proposed model decoder model. Bi-directional LSTM is processed for
developed with two encoder model for the image and the decoder model and to generate the relevant
the language, the system can prominently apply the caption for the query image. We implement learning
features of image and the language. Additionally, the schemes to train the proposed model on three
decoder model is also used the bi-directional LSTM, benchmark datasets: MSCOCO, Flickr30k, and
the training model can be more accurate. So, the Flickr8k, to perform the image caption generation.
proposed system model is able to generate good Our methods enhanced the quality of generated
captions for images automatically. caption for the image. So, compared to the related
image captioning processes, the proposed approach
5 Conclusion achieves reasonable competitive efficiency.
Furthermore, we will add attention mechanism into
In this paper, we investigate the efficient encoder-
the TT-LSTM model to achieve the more accurate
decoder framework for the image caption generation,
caption and the best performance in the future study.
namely Two-Tier LSTM (TT-LSTM). TT-LSTM
The efficiency and generalization of our model is
model is applied two encoder models for the image
further demonstrated by this observation.
encoder and language encoder, and one decoder

Table 4. Comparative results with classical models for the MSCOCO dataset
Models RG-L CDR B4 B3 B2 B1
GoogleNIC [5] - - 24.6 32.9 46.1 66.6
init-inject [27] 49.9 81.8 27.1 36.7 50.2 67.9
pre-inject [27] 49.8 80.7 26.7 36.6 50.1 67.7
par-inject [27] 49.3 77.4 26.5 35.9 49.2 66.7
LSTM (FD+RD) [28] - - 25.3 36.6 51.1 69.1
EE-LSTM [28] - - 26.9 36.4 49.8 67.5
TT-LSTM 55.0 79.1 22.5 36.7 47.6 69.4

Table 5. Comparative results with classical model for the Flickr30k dataset
Models RG-L CDR B4 B3 B2 B1
GoogleNIC [5] - - 18.3 27.7 42.3 66.3
init-inject [27] 42.5 38.3 19.1 28.3 41.9 61.3
pre-inject [27] 42.0 38.0 19.2 28.4 41.9 61.3
par-inject [27] 41.8 36.1 18.3 27.5 41.0 60.5
LSTM (FD+RD) [28] - - 20.5 30.9 43.2 64.0
EE-LSTM [28] - - 17.0 25.7 39.1 59.2
TT-LSTM 50.2 40.3 20.0 29.7 44.2 66.7

Table 6. Comparative results with classical model for the Flickr8k dataset
Models RG-L CDR B4 B3 B2 B1
GoogleNIC [5] - - - 27.0 41.0 63.0
init-inject [27] 44.5 48.1 19.1 28.5 42.4 61.1
pre-inject [27] 44.4 46.9 19.0 28.5 42.1 60.9
par-inject [27] 44.8 47.5 19.1 28.7 42.4 61.1
LSTM (FD+RD) [28] - - 19.5 33.9 42.8 63.8
EE-LSTM [28] - - 18.4 27.5 40.8 59.8
TT-LSTM 51.9 49.1 18.4 28.9 43.5 65.8

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 32

Generated Caption: two giraffes standing in the middle of Generated Caption: an airplane is flying through the
field sky

Generated Caption: man riding bike on the side of street Generated Caption: group of people sitting on bench in
front of building
Figure 9. Example results of proposed approach

[3] M. Ivašić-Kos, M. Pavlić, and M. Pobar,


Conflicts of Interest “Analyzing the Semantic Level of Outdoor
Image Annotation”, In: Proc. of MIPRO 2009-
The authors declare no conflict of interest on
32nd International Convention on Information
this research, publishing, and/or authorship of this
and Communication Technology, Electronics
article.
and Microelectronics, pp. 293, 2009.
[4] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava,
Author Contributions L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell,
Phyu Phyu Khaing contributed to develop the J. C. Platt, C. L. Zitnick, G. Zweig, “From
system and carry out the analysis, interpret the Captions to Visual Concepts and Back”, In:
findings and compose the manuscript. This study was Proc. of the IEEE Conf. on Computer Vision and
supervised by Dr. May The' Yu. Pattern Recognition, pp. 1473-1482, 2015.
[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan,
Acknowledgments “Show and Tell: A Neural Image Caption
Generator”, In: Proc. of the IEEE Conf. on
I am thankful to my supervisor, Dr. May The` Yu,
Computer Vision and Pattern Recognition
for her supportive and constructive guidance on the
(CVPR), Boston, MA, USA, pp. 3156-3164,
planning and development of this research work.
2015.
[6] K. Cho, B. van Merriënboer, C. Gulcehre, D.
References Bahdanau, F. Bougares, H. Schwenk, and Y.
[1] R. Kiros, R. Salakhutdinov, and R. Zemel, Bengio, “Learning Phrase Representations
“Multimodal Neural Language Models”, In: using RNN Encoder-Decoder for Statistical
Proc. of International Conf. on Machine Machine Translation”, In: Proc. of the 2014
Learning, pp. 595-603, 2014. Conf. on Empirical Methods in Natural
[2] M. Ivasic-Kos, I. Ipsic, and S. Ribaric, “A Language Processing (EMNLP), Doha, Qatar,
Knowledge-Based Multi-Layered Image pp. 1724-1734, 2014.
Annotation System”, Expert Systems with [7] A. Graves, “Generating Sequences with
Applications, Vol. 42, No. 24, pp. 9539-9553, Recurrent Neural Networks”, arXiv preprint
2015. arXiv:1308.0850, pp. 1-43, 2013.

International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 33

[8] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Visual Corpora”, In: Proc. of the IEEE
Salakhudinov, R. Zemel, and Y. Bengio, “Show, International Conf. on Computer Vision, pp.
Attend and Tell: Neural Image Caption 2596-2604, 2015.
Generation with Visual Attention”, In: Proc. of [18] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng,
International Conf. on Machine Learning, Lille, X. He, G. Zweig, and M. Mitchell, “Language
France, pp. 2048-2057, 2015. Models for Image Captioning: The Quirks and
[9] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, What Works”, arXiv preprint arXiv:1505.01809,
and W. W. Cohen, “Encode, Review, and 2015.
Decode: Reviewer Module for Caption [19] V. Ordonez, X. Han, P. Kuznetsova, G.
Generation”, Computing Research Repository Kulkarni, M. Mitchell, K. Yamaguchi, K.
(CoRR), arXiv:1605.07912v3, pp. 1-9, 2016. Stratos, A. Goyal, J. Dodge, A. Mensch, H.
[10] J. Lu, C. Xiong, D. Parikh, and R. Socher, Daumé, A. C. Berg, Y. Choi, and T. L. Berg,
“Knowing When to Look: Adaptive Attention “Large Scale Retrieval and Generation of Image
via A Visual Sentinel for Image Captioning”, In: Descriptions”, International Journal of
Proc. of the IEEE Conf. on Computer Vision and Computer Vision, Vol. 119, No. 1, pp.46-59,
Pattern Recognition (CVPR), Honolulu, HI, 2016.
USA, pp. 375-383, 2017. [20] X. Jia, E. Gavves, B. Fernando, and T.
[11] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Tuytelaars, “Guiding the Long-Short Term
Perona, D. Ramanan, P. Dollár, and C. L. Memory Model for Image Caption Generation”,
Zitnick, “Microsoft COCO: Common Objects in In: Proc. of the IEEE International Conf. on
Context”, In: Proc. of European Conf. on Computer Vision, pp. 2407-2415. 2015.
Computer Vision, Springer, Cham, pp. 740-755, [21] J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural
2014. Baby Talk”, In: Proc. of the IEEE Conf. on
[12] B. A. Plummer, L. Wang, M. Cervantes, Juan C. Computer Vision and Pattern Recognition, pp.
Caicedo, J. Hockenmaier, and S. Lazebnik, 7219-7228, 2018.
“Flickr30k Entities: Collecting Region-To- [22] M. Tanti, A. Gatt, and K. P. Camilleri, “What is
Phrase Correspondences for Richer Image-To- the Role of Recurrent Neural Networks (RNNs)
Sentence Models”, In: Proc. of the IEEE in an Image Caption Generator?”, In: Proc. of
International Conf. on Computer Vision, pp. the 10th International Conf. on Natural
2641-2649, 2015. Language Generation, pp. 51–60, 2017.
[13] M. Hodosh, P. Young, and J. Hockenmaier, [23] J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-
“Framing Image Description as A Ranking Captioning: Coarse-To-Fine Learning for Image
Task: Data, Models and Evaluation Metrics”, Captioning”, In: Proc. of the 32nd AAAI Conf.
Journal of Artificial Intelligence Research, Vol. on Artificial Intelligence, pp. 6837–6844, 2018.
47, pp. 853-899, 2013. [24] W. Jiang, L. Ma, X. Chen, H. Zhang, and W. Liu,
[14] Y. Yang, C. Teo, H. Daumé III, and Y. “Learning to Guide Decoding for Image
Aloimonos, “Corpus-Guided Sentence Captioning”, In: Proc. of the 33rd AAAI Conf.
Generation of Natural Images”, In: Proc. of the on Artificial Intelligence, pp. 6959–6966, 2018.
2011 Conf. on Empirical Methods in Natural [25] N. Xu, H. Zhang, A.A. Liu, W. Nie, Y. Su, J.
Language Processing, pp. 444-454, 2011. Nie, and Y. Zhang, “Multi-Level Policy and
[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, Reward-Based Deep Reinforcement Learning
S. Li, Y. Choi, A. C. Berg, and T. L. Berg, Framework for Image Captioning”, IEEE
“Babytalk: Understanding and Generating Transactions on Multimedia, Vol. 22, No. 5, pp.
Simple Image Descriptions”, IEEE 1372-1383, 2019.
Transactions on Pattern Analysis and Machine [26] G. Huang, and H. Hu, “c-RNN: A Fine-Grained
Intelligence, Vol. 35, No. 12, pp. 2891-2903, Language Model for Image Captioning”, Neural
2013. Processing Letters, Vol. 49, No. 2, pp.683-691,
[16] Y. Ushiku, M. Yamaguchi, Y. Mukuta, and T. 2019.
Harada, “Common Subspace for Model and [27] M. Tanti, A. Gatt, and K. Camilleri, “Where to
Similarity: Phrase Learning for Caption put the Image in an Image Caption Generator”,
Generation from Images”, In: Proc. of the IEEE Natural Language Engineering, Vol. 24, No. 3,
International Conf. on Computer Vision, pp. pp. 467-489, 2018.
2668-2676, 2015. [28] X. Zhang, S. He, X. Song, R.W.H. Lau, J. Jiao,
[17] C. Sun, C. Gan, and R. Nevatia, “Automatic and Q. Ye, “Image Captioning via Semantic
Concept Discovery from Parallel Text and
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 34

Element Embedding”, Neurocomputing, Vol. International Workshop on Health Text Mining


395, pp. 212-221, 2020. and Information Analysis, pp. 17-27, 2016.
[29] S. Vishnubhatla, and N. Sinha, “Image [41] C.Y. Lin, “Rouge: A Package for Automatic
Captioning with Pretrained Language Evaluation of Summaries”, In: Text
Generators”, In: Proc. of 8th ACM IKDD CODS Summarization Branches Out, pp. 74-81, 2004.
and 26th COMAD, pp. 427-427, 2021. [42] R. Vedantam, C.L. Zitnick, and D. Parikh,
[30] W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “CIDEr: Consensus-based Image Description
“CPTR: Full Transformer Network for Image Evaluation”, In: Proc. of the IEEE Conf. on
Captioning”, arXiv preprint arXiv:2101.10804, Computer Vision and Pattern Recognition, pp.
2021. 4566-4575, 2015.
[31] P. Cao, Z. Yang, L. Sun, Y. Liang, M. Qu Yang, [43] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu,
and R. Guan, “Image Captioning with “BLEU: A Method for Automatic Evaluation of
Bidirectional Semantic Attention-Based Machine Translation”, In: Proc. of the 40th
Guiding of Long Short-Term Memory”, Neural Annual Meeting of the Association for
Processing Letters, Vol. 50, No. 1, pp. 103-119, Computational Linguistics, pp. 311-318, 2002.
2019. [44] D. P. Kingma, and J. Ba, “Adam: A method for
[32] J. Chen, X. Song, L. Nie, X. Wang, H. Zhang, stochastic optimization”, In: Proc. of 3rd
and T.S. Chua, “Micro Tells Macro: Predicting International Conf. on Learning
the Popularity of Micro-Videos via A Representations (ICLR), pp. 1-15, 2015.
Transductive Model”, In: Proc. of the 24th ACM
International Conf. on Multimedia, pp. 898-907,
2016.
[33] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo,
“Image Captioning with Semantic Attention”,
In: Proc. of the IEEE Conf. on Computer Vision
and Pattern Recognition, pp. 4651-4659, 2016.
[34] J. Mun, M. Cho, and B. Han, “Text-Guided
Attention Model for Image Captioning”, In:
Proc. of the Thirty-First AAAI Conf. on
Artificial Intelligence, pp. 4233–4239, 2017.
[35] L. Zhou, C. Xu, P. Koch, and J. J. Corso, “Watch
What You Just Said: Image Captioning with
Text-Conditional Attention”, In: Proc. of the on
Thematic Workshops of ACM Multimedia, pp.
305-313, 2017.
[36] P. Anderson, X. He, C. Buehler, D. Teney, M.
Johnson, S. Gould, and L. Zhang, “Bottom-Up
and Top-Down Attention for Image Captioning
and Visual Question Answering”, In: Proc. of
the IEEE Conf. on Computer Vision and Pattern
Recognition, pp. 6077-6086, 2018.
[37] K. O'Shea and R. Nash, “An Introduction to
Convolutional Neural Networks”, arXiv
preprint arXiv:1511.08458, 2015.
[38] F. Chollet, “Xception: Deep learning with
Depthwise Separable Convolutions”, In: Proc.
of the IEEE Conf. on Computer Vision and
Pattern Recognition, pp. 1251-1258, 2017.
[39] S. Hochreiter, and J. Schmidhuber, “Long short-
term memory”, Neural Computation, Vol. 9, No.
8, pp.1735-1780, 1997.
[40] S. Cornegruta, R. Bakewell, S. Withey, and G.
Montana, “Modelling Radiological Language
with Bidirectional Long Short-Term Memory
Networks”, In: Proc. of the Seventh
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03

You might also like