Two Tier LSTM Model
Two Tier LSTM Model
22
1
Image Processing Lab, University of Computer Studies, Mandalay (UCSM), Myanmar
2
Faculty of Information Science, University of Computer Studies, Mandalay (UCSM), Myanmar
* Corresponding author’s Email: [email protected]
Abstract: Image captioning is systematically generating the caption of the image with a sentence description. In the
past few years, the automatic process of creating image caption has fascinated the great interest in Artificial
Intelligence (AI) field. Image captioning defines as the basic process of building the conjunction of image processing
and natural language processing at input and output position. All image processing tasks, such as the segmentation of
image, object tracking, object detection, image recognition, and many others, are mostly performed using
Convolutional Neural Networks (CNNs). To perform the natural language processing tasks, just as semantic role
labelling, neural machine translation, speech recognition, question and answering, and many others, Recurrent Neural
Networks (RNNs) and long-term memory networks (LSTMs) are essential for some of the biggest breakthroughs. This
paper proposes the efficient encoder-decoder framework-based image captioning model, namely Two-Tier LSTM
(TT-LSTM) Model. The TT-LSTM model is designedly implemented upon the encoder-decoder framework with two
LSTM layers. This research is implemented on the MSCOCO, Flickr30k, and Flickr8k datasets; and evaluated with
standard evaluation matrixes such as ROUGE-L, CIDEr, and four BLEU scores. The outcomes of the experiments on
the typical datasets reveal that the proposed model generates meaningful natural language sentences. The proposed
model also improves the sentence generation efficiency and can achieve better performance for image caption
generation.
Keywords: Image caption generation, Deep learning, Encoder-decoder, Two-tier LSTM model.
(CNN) is commonly utilized at the encoder-decoder Traditional image captioning methods and deep
system at the encoding process of the image to extract learning-based methods are the existing methods for
the features from the image and a Recurrent Neural image captioning.
Network (RNN) is implemented to construct an input
image's description [6, 7]. Then, the image caption 2.1 Traditional image captioning models
generation, which focuses on some part of an input
image, can be produced by the attention mechanism For traditional image captioning methods, there
[8-10]. are two particular types. The first method is image
This paper proposes the Two-Tier LSTM (TT- captioning focused on the template called template-
LSTM). Captioning model for image-focused upon based methods and the second one is image
the inject and merge model. TT-LSTM Model build captioning focused on the retrieval processes called
as an efficient encoder-decoder framework. Unlike retrieval-based methods.
the previous works, the proposed model is utilized The template-based methods for image
two LSTM model. In the proposed model, captioning must predefine the template variety of
Convolutional Neural Network (CNN) is to encode relationships between the objects from the image and
the input image, and Long Short-Term Memory the labels of these objects. Such relationships and
(LSTM) is to encode the sentence caption and to names of objects fill the empty slots, and the caption
generate the sentence as the decoder. The pretrained of the image can be obtained from the empty slots.
CNN (XceptionNet) is availed to extract the high- The phrases, however, that describe as the caption of
level visual semantic knowledge as the feature vector the image generated by these methods, are only one
of the image from the second last activation layer of sentence for one image. That sentence is not a
XceptionNet by dropping the last classification layer. separate expression. To generate relevant image
The typical LSTM is performed for encoding the descriptions, Hidden Markov Model (HMM) is
embedded sequence as the language decoder. employed in [14] by filtering the highest log-
Bidirectional LSTM is utilized as the decoder by likelihood from four corpora that consist of objects,
entering the blended input, that combined the verbs, scenes, and prepositions. Similarly, in [15], the
encoded image vector with an encoded sentence Conditional Random Field (CRF) model is
vector, to catch up with the previous and next implemented with the correspondence attributes,
contexts. objects, and prepositions prediction to construct the
Generally, this paper comprises three main sentence according to predefined templates. A new
technical contributions as follow: subspace embedded approach is suggested for image
caption generation, called Common Subspace for
• To achieve the preciseness of the generated
Model and Similarity (CoSMoS) in [16].
captions for image caption generation, this paper
The problem of the image captioning process is
develops a Two-Tier LSTM (TT-LSTM) Model
known to be an information collection problem with
for generating image caption.
the retrieval methods. The first is to construct the
• Three benchmark datasets: MSCOCO [11],
image description from the training sample by
Flickr30k [12], and Flickr8k [13], are validated
deriving the meaningful sentence for a similar image.
for detailed experiments to judge the capability
After that, updating the image description is
of the proposed model.
processed to achieve the information expression for
• Our methodology is assessed by appraisal the input image. The retrieval-based method
measurements to illustrate comparative outcomes generates the caption of the image depended on a
of the state-of-the-art approaches. large dataset without any doubt, and the generated
The structure of this paper is constructed with the image caption is restricted with the description of the
following sections. Section 2 describes the related training dataset. An automated visual concept
works that concerned with the proposed method. The discovery (VCD) algorithm is initiated by utilizing
proposed architecture of the system demonstrates in concurrent text and image corpora for bidirectional
Section 3. The experimental implementation of the tasks for the image and sentence retrieval and tagging
proposed model represents in Section 4. The tasks for the image in [17]. To choose the consensus
summary of the paper discusses in Section 5 under caption for the image, in [18], a simple K-Nearest
the conclusion. Neighbor (KNN) retrieval model is utilized by
relying on a neural network to excerpt image features.
2. Related work In [19], the authors investigated the natural language
This part discusses the literature of previous description generation methods for the image by
research works for image caption generation. developing data-driven strategies using retrieval-
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 24
based methods that applied two strategies: applying [26], namely character-level RNN (c-RNN), is
the global features and utilizing the detection of developed by composing the descriptive sentence
objects, regions, and scenes. with characterization. c-RNN model is based on the
inject model that one of encoder-decoder architecture
2.2 Deep learning-based methods by substituting with character level instead of word-
level image captioning model. Although a character-
An encoder-decoder framework is the most level RNN model (c-RNN) is efficient than other
efficient deep learning model to generate the caption word-level language models, our proposed TT-
of images. The recent encoder-decoder frameworks LSTM model achieves more performance than c-
have emerged and have been successful, which allow RNN.
an image caption with fluent phrases and various In [27], the authors proposed a structural
phrases to be produced. comparison of the various forms of 'conditioning'
The encoder-decoder framework for generating language choices based on the visual input, exploring
the image caption was suggested in [1]. At the image their repercussions for the framework of caption
captioning, the widespread implementation of the generators. In the comparison, based on the encoder-
encoder-decoder method proceeds from one powerful decoder framework, init-inject, per-inject, par-inject,
effect throughout the challenges of machine and merge models are utilized. The init-inject model
translation. Google NIC model, which applied uses an image vector at the initial state for the RNN
Inception v3 for the encoder and long-short term with the first word of sequence; the per-inject model
memory (LSTM) for the decoder, has been applies an image vector as an initial word of
introduced by the authors of [5]. At the start of the sequence; par-inject utilizes image vector and word
LSTM network, the image information was only in every time steps; merge model takes RNN for the
given as the input for the image captioning. In [20], word embedding. All four models use RNN in one
the authors suggested an extension of the LSTM place for different purposes, but we apply RNN at
model that would implement the semantic knowledge two places in the proposed model.
to direct the LSTM network in producing a text Image captioning framework with semantic
description for the image. element discovery and embedding is developed in
In [21], the new image captioning framework, [28]. The element embedding framework has
namely Neural Baby Talk, is proposed by applying comprised the integration of the CNN-LSTM model
object detection to generate the natural language and object region detection, called LSTM (FD+RD).
sentence. The authors applied two perspectives of the To connect the interval between visual image and
RNN task for an image caption generator in [22]. semantic caption, element embedding long-short
These two perspectives are called inject model and term memory (EE-LSTM) used both visual and
merge model. The inject model encourages that the semantic; local and global features. The LSTM
image and words are entered into RNN, and the (FD+RD) and EE-LSTM are more efficient than
merge model is that RNN is only performed for the baseline models for image understanding but the
encoder of the word sequence. In our proposed sentence generation process is not fully completed.
method, RNN has been used for two purposes: the So, our proposed model constructs with two-tier
word sequence encoder and the sentence generator. LSTM to get more efficient.
Therefore, our proposed model is more accurate than In [29], an image captioning framework is
the previous research work. developed with a two-stage process; scene graphs
In order to obtain clearer image definitions, the generation from the image and caption generation
authors implemented a coarse-to-fine image with pre-trained language generators; that based on
captioning model, which uses a stacked visual focus encoder-decoder architecture. The authors developed
model in combination with various LSTM networks a CNN+Transformer design network for image
in [23]. The guiding network is developed that captioning, namely CaPtion TransformeR (CPTR),
extends the encoder-decoder framework at the that build global context at every encoder layer in
decoder step and the guiding vector can be adjusted [30].
to integrate image and language information in [24]. After the encoder-decoder model, it was added
The RNN-based reinforcement learning framework is the attention-based approach as the extension. In [8],
proposed by integrating with a novel multi-level the authors initially suggested the development of
policy function (word-level policy and sentence-level image caption applied an attention-based approach
polity) and multi-level reward function (vision-
language reward and language-language reward) in
[25]. The language model for image captioning, in
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 25
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 26
The merge model blends the encoded forms of the specified number of filters. The output map
image entry and the created text definition. A basic numbers are equal to the given filter number.
decoder model is then used to construct the next word • The pooling layer will conduct the down-
in the sequence to merge these two encoded inputs. It streaming procedure through width and height
distinguishes the modelling of the image entry, the (spatial dimensions). For the pooling layer,
text entrance, and the combination of the encoded maximum pooling is primarily used.
outputs. • For the CNN activation function, the ReLu
function is generally applied. An activation
3.2 Two-tier LSTM (TT-LSTM) model function for element-wise activation like the
maximum (0, x) threshold at zero is used in the
Two-Tier LSTM (TT-LSTM) Model shown in
ReLU layer.
Fig. 1 blends the two-basic encoder-decoder
framework: Inject Model and Merge Model. The • The class values are determined by the Fully
proposed TT-LSTM model constructs with two Connected (FC) layer. Any units in this layer are
encoder models for both image and text. Both related to all numbers in the previous volume as
encoding processes of the query image and the with ordinary Neural Networks and as the name
encoded form of the textual description generated are suggests.
combined. The combination uses a decoder model to 3.2.2. Pre-trained convolutional neural network
construct the sequence of the word. For image
encoder, Convolutional Neural Networks (CNNs) Pre-trained convolutional neural networks are the
apply. There are many pre-trained CNN models. models that someone else created to solve a similar
Among them, Xception, a CNN model trained in the problem. The pretrained models use several forward
imagenet dataset, uses in this study. Long-Short Term and backward iterations to define the right weights
Memory (LSTM) applies to the language encoder, for the network. Much deep learning software is
and Bi-directional LSTM utilizes for the decoder. existing for the pre-trained model. Among them, this
research will use Keras written by Python. Many pre-
3.2.1. Convolutional neural network (CNN) trained models include in Keras. There are denseNet,
inceptionNet, mobileNet, nasNet, resNet, vggNet and
Advanced deep neural network: Convolutional
xceptionNet. Each model will work the associated
Neural Networks (CNNs), [37] can handle data that
preprocessing. This research uses the XceptionNet
is an input shape like a 2D matrix. Image
model at the place of feature extraction from the
conveniently shows as a 2D matrix and CNN is
images.
highly useful for the image work. For the image
In this study, XceptionNet [38] model performs
feature extraction function, CNN is mostly used. The
to extract the features of the image for the CNN
typical constituents of a CNN are the input layer,
model. The Xception model already trains on
convolution layer, pooling layer, activation layer, and
imagenet datasets and an extension of the Inception
fully connected layer. The implementations of those
architecture, substituted with depth-separable
typical constituents are:
convolutions for regular Inception units. Model
• The input layer includes the image's raw pixel
Xception works pre-trained on ImageNet with weight.
values and has three colour channels R, G, B.
This model has a default input size of 299x299. The
• In order to construct output feature maps for the 36 convolution layers organize into 14 modules, and
images, the convolution layer takes images from all of which, except the first and last modules, have
the previous layers and complements the linear residual links. Fig. 3 demonstrates the
architecture of the Xception model.
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 27
model structure for the Flickr8k dataset is shown in soon as validation data output began to decline. We
Fig. 7. also generated captions for the images in the testing
The training process conducts with 0.0001 hyper- dataset with beam search that used the beamwidth 2
parameter of the Adam optimization algorithm [45] and a cumulative clipping period of the maximum
and a batch size of 32. Categorical cross-entropy is number of words at the sentence to assess the trained
used in the cost function. Early stopping parameter is models. Fig. 8 shows the training and validation loss
used to stop the training process by measuring the for all datasets of our proposed model.
validation performance after the training epoch as
Table 1. Comparative results with baseline model for the MSCOCO dataset
Models RG-L CDR B4 B3 B2 B1
Inject 54.5 79.0 22.0 32.0 46.1 68.0
Merge 54.3 79.0 21.7 31.5 45.7 67.9
TT-LSTM 55.0 79.1 22.5 32.9 47.6 69.4
Table 2. Comparative results with baseline model for the Flickr30k dataset
Models RG-L CDR B4 B3 B2 B1
Inject 48.9 39.8 16.8 26.2 38.5 62.5
Merge 49.1 40.1 17.0 26.6 40.0 63.3
TT-LSTM 50.2 40.3 20.0 29.7 44.2 66.7
Table 3. Comparative results with baseline model for the Flickr8k dataset
Models RG-L CDR B4 B3 B2 B1
Inject 49.3 43.3 15.3 24.2 37.0 58.2
Merge 49.2 41.1 15.2 24.5 37.9 60.1
TT-LSTM 51.9 49.1 18.4 28.9 43.5 65.8
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 30
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 31
than other state-of-the-art methods on almost all model. For the image encoder, Xception Net is
evaluations. utilized, and LSTM is used for the language encoder.
Fig. 9 shows the sample captioning results of And then, two outputs from the image encoder and
image caption generation using the proposed language encoder are concatenated to enter into the
approaches. Due to the fact that the proposed model decoder model. Bi-directional LSTM is processed for
developed with two encoder model for the image and the decoder model and to generate the relevant
the language, the system can prominently apply the caption for the query image. We implement learning
features of image and the language. Additionally, the schemes to train the proposed model on three
decoder model is also used the bi-directional LSTM, benchmark datasets: MSCOCO, Flickr30k, and
the training model can be more accurate. So, the Flickr8k, to perform the image caption generation.
proposed system model is able to generate good Our methods enhanced the quality of generated
captions for images automatically. caption for the image. So, compared to the related
image captioning processes, the proposed approach
5 Conclusion achieves reasonable competitive efficiency.
Furthermore, we will add attention mechanism into
In this paper, we investigate the efficient encoder-
the TT-LSTM model to achieve the more accurate
decoder framework for the image caption generation,
caption and the best performance in the future study.
namely Two-Tier LSTM (TT-LSTM). TT-LSTM
The efficiency and generalization of our model is
model is applied two encoder models for the image
further demonstrated by this observation.
encoder and language encoder, and one decoder
Table 4. Comparative results with classical models for the MSCOCO dataset
Models RG-L CDR B4 B3 B2 B1
GoogleNIC [5] - - 24.6 32.9 46.1 66.6
init-inject [27] 49.9 81.8 27.1 36.7 50.2 67.9
pre-inject [27] 49.8 80.7 26.7 36.6 50.1 67.7
par-inject [27] 49.3 77.4 26.5 35.9 49.2 66.7
LSTM (FD+RD) [28] - - 25.3 36.6 51.1 69.1
EE-LSTM [28] - - 26.9 36.4 49.8 67.5
TT-LSTM 55.0 79.1 22.5 36.7 47.6 69.4
Table 5. Comparative results with classical model for the Flickr30k dataset
Models RG-L CDR B4 B3 B2 B1
GoogleNIC [5] - - 18.3 27.7 42.3 66.3
init-inject [27] 42.5 38.3 19.1 28.3 41.9 61.3
pre-inject [27] 42.0 38.0 19.2 28.4 41.9 61.3
par-inject [27] 41.8 36.1 18.3 27.5 41.0 60.5
LSTM (FD+RD) [28] - - 20.5 30.9 43.2 64.0
EE-LSTM [28] - - 17.0 25.7 39.1 59.2
TT-LSTM 50.2 40.3 20.0 29.7 44.2 66.7
Table 6. Comparative results with classical model for the Flickr8k dataset
Models RG-L CDR B4 B3 B2 B1
GoogleNIC [5] - - - 27.0 41.0 63.0
init-inject [27] 44.5 48.1 19.1 28.5 42.4 61.1
pre-inject [27] 44.4 46.9 19.0 28.5 42.1 60.9
par-inject [27] 44.8 47.5 19.1 28.7 42.4 61.1
LSTM (FD+RD) [28] - - 19.5 33.9 42.8 63.8
EE-LSTM [28] - - 18.4 27.5 40.8 59.8
TT-LSTM 51.9 49.1 18.4 28.9 43.5 65.8
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 32
Generated Caption: two giraffes standing in the middle of Generated Caption: an airplane is flying through the
field sky
Generated Caption: man riding bike on the side of street Generated Caption: group of people sitting on bench in
front of building
Figure 9. Example results of proposed approach
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 33
[8] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Visual Corpora”, In: Proc. of the IEEE
Salakhudinov, R. Zemel, and Y. Bengio, “Show, International Conf. on Computer Vision, pp.
Attend and Tell: Neural Image Caption 2596-2604, 2015.
Generation with Visual Attention”, In: Proc. of [18] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng,
International Conf. on Machine Learning, Lille, X. He, G. Zweig, and M. Mitchell, “Language
France, pp. 2048-2057, 2015. Models for Image Captioning: The Quirks and
[9] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, What Works”, arXiv preprint arXiv:1505.01809,
and W. W. Cohen, “Encode, Review, and 2015.
Decode: Reviewer Module for Caption [19] V. Ordonez, X. Han, P. Kuznetsova, G.
Generation”, Computing Research Repository Kulkarni, M. Mitchell, K. Yamaguchi, K.
(CoRR), arXiv:1605.07912v3, pp. 1-9, 2016. Stratos, A. Goyal, J. Dodge, A. Mensch, H.
[10] J. Lu, C. Xiong, D. Parikh, and R. Socher, Daumé, A. C. Berg, Y. Choi, and T. L. Berg,
“Knowing When to Look: Adaptive Attention “Large Scale Retrieval and Generation of Image
via A Visual Sentinel for Image Captioning”, In: Descriptions”, International Journal of
Proc. of the IEEE Conf. on Computer Vision and Computer Vision, Vol. 119, No. 1, pp.46-59,
Pattern Recognition (CVPR), Honolulu, HI, 2016.
USA, pp. 375-383, 2017. [20] X. Jia, E. Gavves, B. Fernando, and T.
[11] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Tuytelaars, “Guiding the Long-Short Term
Perona, D. Ramanan, P. Dollár, and C. L. Memory Model for Image Caption Generation”,
Zitnick, “Microsoft COCO: Common Objects in In: Proc. of the IEEE International Conf. on
Context”, In: Proc. of European Conf. on Computer Vision, pp. 2407-2415. 2015.
Computer Vision, Springer, Cham, pp. 740-755, [21] J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural
2014. Baby Talk”, In: Proc. of the IEEE Conf. on
[12] B. A. Plummer, L. Wang, M. Cervantes, Juan C. Computer Vision and Pattern Recognition, pp.
Caicedo, J. Hockenmaier, and S. Lazebnik, 7219-7228, 2018.
“Flickr30k Entities: Collecting Region-To- [22] M. Tanti, A. Gatt, and K. P. Camilleri, “What is
Phrase Correspondences for Richer Image-To- the Role of Recurrent Neural Networks (RNNs)
Sentence Models”, In: Proc. of the IEEE in an Image Caption Generator?”, In: Proc. of
International Conf. on Computer Vision, pp. the 10th International Conf. on Natural
2641-2649, 2015. Language Generation, pp. 51–60, 2017.
[13] M. Hodosh, P. Young, and J. Hockenmaier, [23] J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-
“Framing Image Description as A Ranking Captioning: Coarse-To-Fine Learning for Image
Task: Data, Models and Evaluation Metrics”, Captioning”, In: Proc. of the 32nd AAAI Conf.
Journal of Artificial Intelligence Research, Vol. on Artificial Intelligence, pp. 6837–6844, 2018.
47, pp. 853-899, 2013. [24] W. Jiang, L. Ma, X. Chen, H. Zhang, and W. Liu,
[14] Y. Yang, C. Teo, H. Daumé III, and Y. “Learning to Guide Decoding for Image
Aloimonos, “Corpus-Guided Sentence Captioning”, In: Proc. of the 33rd AAAI Conf.
Generation of Natural Images”, In: Proc. of the on Artificial Intelligence, pp. 6959–6966, 2018.
2011 Conf. on Empirical Methods in Natural [25] N. Xu, H. Zhang, A.A. Liu, W. Nie, Y. Su, J.
Language Processing, pp. 444-454, 2011. Nie, and Y. Zhang, “Multi-Level Policy and
[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, Reward-Based Deep Reinforcement Learning
S. Li, Y. Choi, A. C. Berg, and T. L. Berg, Framework for Image Captioning”, IEEE
“Babytalk: Understanding and Generating Transactions on Multimedia, Vol. 22, No. 5, pp.
Simple Image Descriptions”, IEEE 1372-1383, 2019.
Transactions on Pattern Analysis and Machine [26] G. Huang, and H. Hu, “c-RNN: A Fine-Grained
Intelligence, Vol. 35, No. 12, pp. 2891-2903, Language Model for Image Captioning”, Neural
2013. Processing Letters, Vol. 49, No. 2, pp.683-691,
[16] Y. Ushiku, M. Yamaguchi, Y. Mukuta, and T. 2019.
Harada, “Common Subspace for Model and [27] M. Tanti, A. Gatt, and K. Camilleri, “Where to
Similarity: Phrase Learning for Caption put the Image in an Image Caption Generator”,
Generation from Images”, In: Proc. of the IEEE Natural Language Engineering, Vol. 24, No. 3,
International Conf. on Computer Vision, pp. pp. 467-489, 2018.
2668-2676, 2015. [28] X. Zhang, S. He, X. Song, R.W.H. Lau, J. Jiao,
[17] C. Sun, C. Gan, and R. Nevatia, “Automatic and Q. Ye, “Image Captioning via Semantic
Concept Discovery from Parallel Text and
International Journal of Intelligent Engineering and Systems, Vol.14, No.4, 2021 DOI: 10.22266/ijies2021.0831.03
Received: January 15, 2021. Revised: March 1, 2021. 34