0% found this document useful (0 votes)
7 views

Hybrid_Image_Captioning_Model

The document presents a hybrid image captioning model that utilizes a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) with an attention mechanism to generate descriptive captions for images. The model employs a VGG16 pre-trained CNN for feature extraction and Long Short-Term Memory (LSTM) networks for decoding the captions, achieving improved performance over traditional models. The effectiveness of the model is evaluated using the BLEU metric, demonstrating its capability in accurately generating image descriptions.

Uploaded by

magangzmp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Hybrid_Image_Captioning_Model

The document presents a hybrid image captioning model that utilizes a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) with an attention mechanism to generate descriptive captions for images. The model employs a VGG16 pre-trained CNN for feature extraction and Long Short-Term Memory (LSTM) networks for decoding the captions, achieving improved performance over traditional models. The effectiveness of the model is evaluated using the BLEU metric, demonstrating its capability in accurately generating image descriptions.

Uploaded by

magangzmp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Hybrid Image Captioning Model

2022 OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON) | 978-1-6654-9294-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/OTCON56053.2023.10113957

Lipismita Panigrahi Raghab Ranjan Panigrahi Saroj Kumar Chandra


Department of School of Computer Department of Computer Science and Department of Computer Science and
Applications Engineering Engineering
KIIT University SOA University OPJU University
Bhubaneswar, India Bhubaneswar, India Raigarh, India
[email protected] [email protected] [email protected]

Abstract— Image captioning is implemented using Deep caption creation issues. These outcomes demonstrate that
learning and NLP (Natural Language Processing) resulting in our suggested model outperforms conventional models in
producing a description of an image. The proposed model terms of image captioning. Finally, the qualitative and
generates a caption for an image using a Convolutional Neural quantitative performance evaluation is used to score the
Network (CNN) together with a Recurrent Neural Network quality of the information produced and assess the
(RNN) and area of attention. Previously, the image names correctness of the produced caption.
were used as keys to map the images with descriptions. In
order to achieve high performance, in the proposed model the When the target image is analyzed to the training images,
image caption is based on the relationship between the areas of this model uses the learned data to produce a respectable
a picture (attention model), the words used in the caption, and description using a Convolutional Neural Network (CNN) as
the state of an RNN language model. The approach of the encoder, features are retrieved from the images. Long
progressive loading is employed for the loading of the image Short-Term Memory (LSTM) is employed to decode the
dataset. Further, for encoding the image dataset into a feature description of the image. Finally, the BLEU metric
vector, VGG16 a pre-trained CNN is used. The extracted technique is used to score the quality of the information
feature vector is given as input to the RNN model. These image produced and assess the correctness of the produced caption.
encodings are output to a specific type of RNN model known
as Long Short-Term Memory (LSTM) networks. Subsequently, II. RELATED WORK
the LSTM works on decoding the feature vector and predicts
the sequence of words, resulting in the generation of The application of ML and Deep Learning in image
descriptions or captions. The training performance is processing has been an area of immense interest for many
measured using one of the model's quantitative analysis researchers in the recent era [1-6]. The scene understanding
metrics known as BLEU. that is present in the image plays a vitalal role in developing
the caption for the picture and is important in many
applications (such as searching using pictures, reading
Keywords— Convolutional Neural Network, image captions, stories from collections, assisting people who are visually
Recurrent Neural Network, LSTM, attention model, encoder, impaired to understand while browsing the internet, and so
decoder. on). Many different picture captioning models have been
developed over last decades [7-9]. The use of encoder-
I. INTRODUCTION decoder models for image captioning has recently received a
Image captioning is one of the most cutting-edge models lot of attention [10-13]. In its ordinary form, a CNN converts
in the study of machine learning (ML) and artificial the input image into a vectorial representation by encoding it
intelligence (AI). The aim of image captioning is describing and then uses that representation as the beginning input for
a picture using idiomatic language. Though, image an RNN. Sequentially, the RNN predicts the word in the
captioning has a variety of uses like: helping people who are caption given the prior word without the need to limit the
visually impaired and offering suggestions, modifying apps, temporal dependence to a predetermined order, as in
creating a virtual assistant, retrieving photos more quickly, techniques based on n-grams. There are various ways to
etc. But, there are certain challenges also like the input the CNN image representation into the RNN. While
identification of many items in a picture, the discovery of some writers [12, 13] just utilize it to compute the RNN's
their associations, classification of objects, and the initial state, others enter it during each iteration of the RNN
combining of words that may not follow standard language [11, 15].
modeling, etc. Automatic Image Captioning (AIC) is a field
that is still under development. Utilising NLPs, the image G. Sairam et. al [16] proposed a model of caption
captioning technique combines object detection and generation for images using deep neural networks. CNN and
modeling into appropriate sentences. LSTM jointly worked to create a model that could generate a
caption. The authors use CNN as the encoder and features
Automatic caption creation from photographs has grown are retrieved from the images. LSTM is employed to decode
to be both a necessary chore and a fascinating study topic as the description of the image.
a result of the constantly expanding multimedia content from
online social networks. The method of creating an image Further, Marco Pedersoli et. al [17] worked on a model
caption involves first extracting the features of the image of an attention-based paradigm called "Areas of Attention”
and then creating a textual description based on the features for image captioning. The proposed method uses three
that are extracted. Further, this captioning model extends interactions to model the relationships between images,
add a new technique to attend various image boundaries as captions, and an RNN. The Attention model mainly focuses
the caption is being constructed phrase by phrase.. These on particular areas according to the current input while
models have produced efficient outcomes in the area of considering the normal network.

978-1-6654-9294-2/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: Zhejiang University. Downloaded on January 24,2025 at 15:55:18 UTC from IEEE Xplore. Restrictions apply.
However, Steven J. Rennie et. al [18] proposed a model A. VGG16 CNN- LSTM RNN encoder-decoder model
using reinforcement learning for an automatic caption Initially, A VGG16 [16] a pre-trained CNN is utilized as
generation of images, showing performance improvement in an encoder to extract the features from the image I and
optimizing systems using the MSCOCO task. Self-critical compare the target image to the training images. The CNN is
sequence training is used to construct systems. The model a particular type of neural network which operates on image
has been propped to classify the WBC cell into any of the processing and is a feed-forward artificial neural network.
five WBC studies and concluded that there should be a CNN is a network model for deep learning algorithms that
development in the authentication methods, including the are utilized for image data processing. CNNs are mostly
factors that affect the authentication process. used for identifying, objects in the images [16]. Let Eq. (1-4)
Further, Zhilin Yang et. al [19] suggested a review regulate the whole network.
network, a brand-new addition to the encoder-decoder
system. In this paper, RNN decoders are taken into jt =σ(M jx X j + M jw wt −1 )
(1)
consideration with both CNN and RNN. The proposed
network can improve the encoder-decoder model by making Where, jt is the input gate at time t, and M stands for the
the model flexible. After each review step, the review trained parameters. The sigmoid operation, which employs
network generates the thought vector, then the vector is fed values between 0 and 1 to indicate how much of each
to the attention model in the standard encoder-decoders are component's output should be passed to the following
an exception to the framework. wt−1
component, uses the variable to denote the module's
Subsequently, in order to make the adaptation of the output at time t-1 [22].
captioning model easier, W. Zhao et. al [20] describe a
cross-modal retrieval aided approach to cross-domain fgt =σ(M fgx X j + M fgwwt −1 ) (2)
picture captioning that makes use of a cross-modal retrieval
model to construct fictitious pairs of images and sentences in Where, fgt stand for the forget gate which represents
the target domain.
the value of the forgotten cell.
Later on, C. Liu et. al [21] propose a novel transformer-
based remote sensing image change captioning (RSICC) og t = σ(M σx X t + M σw wt −1 ) (3)
model for producing human-like language explanations for
the land cover changes in multi-temporal RS images. This Where, ogt stands for the output gate which decides
study is 3 fold such as: 1) a CNN-based feature extractor
whether or not to pass the cell's updated value.
creates high-level features of RS image pairs, 2) a dual-
branch Transformer encoder (DTE) to enhance the feature Additionally, in order to decode the visual description,
classification for the alterations, and 3) a caption decoder the extracted feature vector is fed into the LSTM (a
creates phrases expressing the differences. particular type of RNN model). RNNs are a specific kind of
artificial network in which coordinated phases are formed by
associations between the units. The advantage of using RNN
III. PROPOSED MODEL over standard types is that RNN uses its memory to deal
with the self-assertive organisation of information sources.
This section outlines the procedures for the hybrid image Though the traditional RNNs have a certain advantage at the
captioning model, which is shown in Fig. 1. In this research, same time, the lack of consideration for long-term
we create a methodological model for automatically interdependence is one of the limitations of RNNs [10]. It is
identifying the caption of the image based on the possible that in some instances the typical RNN fails due to
relationship between the areas of a picture (attention model), a significant discrepancy between the relevant data and the
the words used in the caption, and an RNN language model's locations in which it is required.
current state.
To overcome this issue this study used LSTM (shown in
The main objective of this study is image caption Fig. 2) proposed by Hochreiter and Schmiber [23]. LSTM is
generation using natural language expressions. Extracting an improved version of RNN that can store and remember
the features in the image and predicting the priority of data for a longer term than RNN and is efficient in deep
captions is a difficult task when compared to other image learning. It works on discrete or continuous data. The
processing tasks like identification, and segmentation predicted output that predicts the words that will be
localization. As NLP is also involved in the model for generated is shown in Eq. (4).
identifying various features and describing in an image.
Neural network-driven approaches are the most widely used OPt +1 = Soft max(wt )
method for solving the caption generation problem because (4)
of recent developments in training neural networks, the Until an end sequence (.) is encountered, the whole LSTM
availability of GPU computing power, and enormous network is constantly repeated. These predicted words are all
datasets. included in the description of the input image. The LSTM is
created in such a way that it can only predict each word after
In this study, the Flickr8k images are employed as a
viewing the entire image and the previously generated words
training dataset. It contains 8091 images. The objective of
this study is twofold: 1) VGG16 CNN- LSTM RNN as defined by . The sum of losses
encoder-decoder model, 2) Apply attention for prediction evoked by the images I with the with proportionate captions
and feedback. The steps are described below: wt is minimized during training shown in Eq. (5). The details
formulas and experiment can be found in [16, 17].

Authorized licensed use limited to: Zhejiang University. Downloaded on January 24,2025 at 15:55:18 UTC from IEEE Xplore. Restrictions apply.
localised picture region appearances rather than global
image representations.
To extract the image regions as the caption word by
word, the marginal distribution over the regions, is
Where, θ refers to all of the CNN and RNN component's used to extract the region.
parameters collectively. Due to local minima in the loss, this Descriptions as shown in Eq. (6)
leads to an approximate maximum likelihood estimation.
B. Apply attention for prediction and feedback

In this step, the VGG16 CNN - LSTM RNN encoder-


decoder model extends with a framework to attend to
various image regions as the caption is being constructed Psh ∈ Rnr
phase by phase [24]. Subsequently, the caption words and Where, loads all location probabilities at time
image
regions can be connected directly. These connections are t. We generate Eq. (7) by updating the Eq. (3) of the VGG16
generalize from image-level captions during training, more CNN - LSTM RNN encoder-decoder model with this visual
than weakly-supervised object detector training. By representation, which is concatenated to the created word in
localising the appropriate regions during testing, these the feedback signal of the state.
associations help to improve captioning.
The benefit of employing the attention mechanism is that
it increases generalisation for identifying recognisable scene Finally, the BLEU metric technique is used to score the
elements in novel compositions by linking words to quality of the information generated and assess the
correctness of the generated caption.

Fig. 1. Proposed Model.

Fig. 2. Architecture of LSTM layer.

IV. EXPERIMENTAL RESULTS AND DISCUSSION contains 8091 images is used in the project. The proposed
In this section, we evaluate the relative value of the method is implemented using Python with the library Keras,
various model elements, the efficiency of the various TensorFlow, and Matplotlib.
attentional areas, and the impact of jointly adjusting the Initially mounting the drive for data collection and then
CNN and RNN elements. A Flickr8k image dataset which loading the images into a folder in order to make the data

Authorized licensed use limited to: Zhejiang University. Downloaded on January 24,2025 at 15:55:18 UTC from IEEE Xplore. Restrictions apply.
loading easier while training and also verifying the dataset a dictionary in order to load the data into the model the
for accurate contents (shown in Fig. 3). Further, extract the dictionary mapped format should be converted into list
feature from an image. Given a directory name, the function format and feed the RNN with the sequence of data which
extract-features (shown in Fig. 4) will extract features of an was previously created for easier loading of data into the
image by loading the images into VGG16 architecture. A 1- model for processing the image datasets with the description
dimensional vector with 4.096 elements makes up the text for caption generation. Further, a attention based model
image's characteristics. Image descriptions mapped image is used for prediction and feedback (all experiment can
features are returned by the function. found in [17]. Finally fitting the model to train the model
with training data with 20 epochs is shown in Fig. 5. The
Then cleaning the description text that is already
model is trained with 20 epochs with a loss of 20.2% and an
tokenized. To cut down on the number of words in the text, accuracy of 85% and the losses and caption losses
we will fix it up. The clean-descriptions function goes calculated by Eq.(5) represent in Fig. 6 respectively. The
through each description and cleans the wording after sum of losses evoked by the images I (x-axis) with the with
receiving the dictionary of image identifiers to descriptions. proportionate captions wt (y-axis) is minimized during
Further, mapped the images with their respective training.
descriptions. Then, the image and description are mapped in

Fig. 3. Loading and verifying the datasets

Fig. 4. Extracting features from an image

Authorized licensed use limited to: Zhejiang University. Downloaded on January 24,2025 at 15:55:18 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Fitting the model to train the model with training data with 20 epochs.

Fig. 6. (a) Represent the losses and (b) represent the caption losses.

V. CONCLUSION features,” in: 2016 Int. Conf. Adv. Comput. Commun. Informatics,
IEEE, pp. 1096–1102, 2016.
In this research, a hybrid model based on CNN and [4] L. Panigrahi, K. Verma, B.K. Singh, “Evaluation of Image Features
LSTM with an attention network is developed for automatic Within and Surrounding Lesion Region for Risk Stratification in
image captioning. In the suggested paradigm, encoder- Breast Ultrasound Images,” IETE J. Res., pp. 1–12, 2019.
decoder architecture was adopted. the CNN serves as the [5] Panigrahi, L., Verma, K., & Singh, B.K., “Ultrasound image
encoder in this process, converting the image to a segmentation using a novel multi-scale Gaussian kernel fuzzy
representation of vector features. Then, the synchronic clustering and multi-scale vector field convolution.” Expert Systems
with Applications, vol. 115, pp. 486-498, 2019.
sentence is produced by a model known as LSTM (selected
[6] Bafna, Y., Verma, K., Panigrahi, L., & Sahu, S. P., “Automated
as decoder). To enhance performance, the datasets are boundary detection of breast cancer in ultrasound images using
loaded into the model using the progressive loading watershed algorithm.” In Ambient communications and computer
technique. In order to amend the accuracy a method called systems, Springer, Singapore., pp. 729-738, 2018.
the “attention model” is used which increases the accuracy [7] Wang, Z., Shi, S., Zhai, Z., Wu, Y., & Yang, R., “ArCo: Attention-
as well as being capable of taking video input and generating reinforced transformer with contrastive learning for image
captions. captioning,” Image and Vision Computing, vol. 128, pp. 104570,
2022.
REFERENCES [8] Zhang, Z., Zhang, H., Wang, J., Sun, Z., & Yang, Z., “Generating
news image captions with semantic discourse extraction and
[1] L. Panigrahi, K. Verma, B.K. Singh, “Hybrid segmentation method contrastive style-coherent learning,” Computers and Electrical
based on multi‐scale Gaussian kernel fuzzy clustering with spatial Engineering, vol. 104, pp. 108429, 2022.
bias correction and region‐scalable fitting for breast US images,” IET
[9] Hu, N., Fan, C., Ming, Y., & Feng, F., “MAENet: A Novel Multi-
Comput., Vis. Vol. 12, pp. 1067–1077, 2018.
head Association Attention Enhancement Network for Completing
[2] B.K. Singh, K. Verma, L. Panigrahi, A.S. Thoke, “Integrating Intra-modal Interaction in Image Captioning,” Neurocomputing, 2022.
radiologist feedback with computer aided diagnostic systems for
[10] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled
breast cancer risk prediction in ultrasonic images: An experimental
sampling for sequence prediction with recurrent neural networks,” In
investigation in machine learning paradigm,” Expert Syst. Appl. Vol.
NIPS, 2015.
90, pp. 209–223, 2017.
[11] J. Donahue, L. Hendricks, M. Rohrbach, S. Venugopalan, S.
[3] L. Panigrahi, K. Verma, B.K. Singh, “An enhancement in automatic
Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent
seed selection in breast cancer ultrasound images using texture

Authorized licensed use limited to: Zhejiang University. Downloaded on January 24,2025 at 15:55:18 UTC from IEEE Xplore. Restrictions apply.
convolutional networks for visual recognition and description,” In captioning,” 2017 IEEE Conference on Computer Vision and Pattern
CVPR, 2015. Recognition (CVPR), pp. 1179-1195, 2017.
[12] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for [19] Zhilin Yang. Ye Yuan, Yuexin Wu, William W Cohen, and Ruslan R
generating image descriptions,” In CVPR, 2015. Salakhutdinov, “Review networks for caption generation,” In NIPS,
[13] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan,” Show and tell: A 2016.
neural image caption generator,” In CVPR, 2015. [20] W. Zhao, X. Wu and J. Luo, "Cross-Domain Image Captioning via
[14] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Cross-Modal Retrieval and Model Adaptation," in IEEE Transactions
Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption on Image Processing, vol. 30, pp. 1180-1192, 2021.
generation with visual attention,” In ICML, 2015. [21] C. Liu, R. Zhao, H. Chen, Z. Zou and Z. Shi, "Remote Sensing Image
[15] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep Change Captioning With Dual-Branch Transformers: A New Method
captioning with multimodal recurrent neural networks (m-RNN),” and a Large Scale Dataset," in IEEE Transactions on Geoscience and
ICLR, 2015. Remote Sensing, vol. 60, pp. 1-20, 2022.
[16] Sairam, M. Mandha, Prashanth and P. Swetha, "Image Captioning [22] Vedantam, Ramakrishna, C. Lawrence Zitnick and Devi Parikh,
using CNN and LSTM," 4th Smart Cities Symposium (SCS 2021), "Cider: Consensus-based image description evaluation," Proceedings
pp. 274-277, 2021. of the IEEE conference on computer vision and pattern recognition,
2015.
[17] Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J., “Areas of
attention for image captioning,” In Proceedings of the IEEE [23] Hochreiter, S., & Schmidhuber, J., “Long short-term
international conference on computer vision, pp. 1242-1250, 2017. memory,” Neural computation, vol. 9(8), pp. 1735-1780, 1997.
[18] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross. [24] Tiwari, V., Bapat, K., Shrimali, K. R., Singh, S. K., Tiwari, B., Jain,
and Vaibhava Goel, “Self- critical sequence training for image S., & Sharma, H. K. "Automatic generation of chest x-ray medical
imaging reports using lstm-cnn." In Proceedings of the international
conference on data science, machine learning and artificial
intelligence, pp. 80-85. 2021.

Authorized licensed use limited to: Zhejiang University. Downloaded on January 24,2025 at 15:55:18 UTC from IEEE Xplore. Restrictions apply.

You might also like