Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
Abstract—Image Caption Generator deals with generating detection, recognition and generating captions using deep
captions for a given image. The semantic meaning in the image is learning.
captured and converted into a natural language. The capturing
mechanism involves a tedious task that collaborates both image The paper is organized as follows. Section 2 describes the
processing and computer vision. The mechanism must detect and background study of the image caption generator, Section 3 deals
establish relationships between objects, people, and animals. The aim with the proposed methodology, Section 4 deals with the
of this paper is to detect, recognize and generate worthwhile captions experimental analysis and findings and finally concluded in
for a given image using deep learning. Regional Object Detector Section 5.
(RODe) is used for the detection, recognition and generating captions.
The proposed method focuses on deep learning to further improve II. BACKGROUND STUDY
upon the existing image caption generator system. Experiments are
conducted on the Flickr 8k dataset using python language to
This section describes the background study on image caption
demonstrate the proposed method. generators.
Image caption generator deals with generating caption for a
Keywords—Image, capturing, generator, regional, detector, deep given image. A. Kojima [2] used case structure, action hierarchy,
learning and verb patterns to generate captions of human activities in a
I. INTRODUCTION fixed environment. P. Hede [3] proposed a method for image
caption generation, which involves a series of object names
The basic ability of human beings is the tendency to describe stored in a database called dictionaries, such method can generate
an image with an ample amount of information about it by just a captions for fixed image content, but the method fails to generate
quick glance [1]. Creating a computer system to simulate the captions for real-world scenarios. An image caption generator
abilities of human beings is a long time researcher goal in the system was proposed in [4], which uses deep neural networks to
fields of machine learning and artificial intelligence. There are generate captions. A. Farhadi [5] proposed an information
several research progress made in the past such as the detection retrieval based image captioning system, where a score is
of objects from a given image, attribute classification, image generated for every object in an image and the score is compared
classification, and classification of actions by human beings. with other images to generate captions. M. Hodosh [6] proposed
Making a computer system to detect the image and produce a a ranking based image captioning system, where the captions are
description using natural language processing is an exigent task, generated with the help of sentence based image captioning
which is called an image caption generator system. Generating a ranking system. Y. Yang [7] proposed a sentence making
caption for an image involves various tasks such as understanding strategy, which employs verbs, nouns, and prepositions for
the higher levels of semantics and describing the semantics in a building a semantic sentence, the image is detected with the help
sentence by which human can understand. In order to understand of trained detectors and English corpus was used for the estimates
the higher levels of semantics, the computer system must learn of the image. R. Socher [8] proposed a decision tree based
the relationships between the objects in a given image. Usually, recursive neural networks to represent the visual meaning of the
communication in human beings occurs with the help of natural image. O. Vinyals [9] proposed a generative model that combines
language, so developing a system that produces descriptions that computer vision and machine learning to generate captions for a
can be understandable by human beings is a challenging goal. given image. Q. You et al. [10] proposed a model of semantic
There are several steps to generate captions, such as attention, which deals with the semantics stored in a hidden layer
understanding visual representation of objects, establishing of neural networks and fusion them to gain more semantically
relationships among the objects and generating captions both sentence.
linguistically and semantically correct. This paper aims at
Authorized licensed use limited to: KLE Technological University. Downloaded on November 06,2023 at 08:51:28 UTC from IEEE Xplore. Restrictions apply.
2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS)
III. PROPOSED METHODOLOGY resultant variable length string is subjected to a fixed length
decoder for converting to a fixed length descriptive sentence.
The Proposed methodology for generating captions with the
detection and recognition of objects using deep learning is shown IV. EXPERIMENTAL ANALYSIS
in Fig. 1. It consists of object detection, feature extraction,
Convolution Neural Network (CNN) for feature extraction and The aim of this paper is to propose a deep learning method for
for scene classification, Recurrent Neural Network (RNN) for generating captions using neural networks. The dataset details
human and objects attributes, RNN encoder and a fixed length have been described in this section. The experimental evaluation
RNN decoder system. of the proposed methodology is done by Flickr 8k dataset
obtained from [11], from 8000 images, just for simplicity, only
three images were subjected to the proposed methodology and the
results were obtained. Fig. 2 represents the input image to which
the caption needs to be generated. Fig. 3 describes the caption
generation process, first, the input image is subject to the feature
extraction using the feature_extraction command to extract the
features in that image and captions are generated using the
generate_desc command which takes parameters such as model,
tokenizer, input image and length as input. The proposed model
accurately generated a caption that a dog running through the
water for the Fig. 2. The model is also evaluated with Fig. 4 and
Fig. 6, the model accurately generated caption as shown in Fig. 5
and Fig. 7.
108
Authorized licensed use limited to: KLE Technological University. Downloaded on November 06,2023 at 08:51:28 UTC from IEEE Xplore. Restrictions apply.
2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS)
REFERENCES
[1] L. Fei-Fei , A. Iyer , C. Koch , P. Perona. ,What do we perceive in a glance of
a real-world scene? J. Vis. 7 (1) (2007) 1–29 .
[2] A. Kojima , T. Tamura , K. Fukunaga , Natural language description of
human ac- tivities from video images based on concept hierarchy of actions, Int.
Comput. Vis. 50 (2002) 171–184 .
[3] P. Hede , P. Moellic , J. Bourgeoys , M. Joint , C. Thomas ,Automatic
generation of natural language descriptions for images, in: Proceedings of the
Recherche Dinformation Assistee Par Ordinateur, 2004.
[4] J. Donahue , Y. Jia , O. Vinyals , J. Hoffman , N. Zhang , E. Tzeng , T.
Darrell , DeCAF: a deep convolutional activation feature for generic visual
recognition, in: Proceedings of The Thirty First International Conference on
Machine Learning, 2014, pp. 647–655.
Fig. 5. Output: Caption generated
[5] A. Farhadi , M. Hejrati , M.A. Sadeghi , P. Young , C. Rashtchian , J.
Hocken- maier , D. Forsyth , Every picture tells a story: Generating sentences
from images, in: Proceedings of the European Conference on Computer
Vision„2010, pp. 15–29.
[6] M. Hodosh , P. Young , J. Hockenmaier , Framing image description as a
rank- ing task: data, models and evaluation metrics, J. Artif. Intell. Res. 47
(2013) 853–899 .
[7] Y. Yang , C.L. Teo , H. Daume , Y. Aloimono , Corpus-guided sentence
generation of natural images, in: Proceedings of the Conference on Empirical
Methods in Natural Language Processing, 2011, pp. 4 4 4–454 .
[8] R. Socher , A. Karpathy , Q.V. Le , C.D. Manning , A.Y. Ng , Grounded
composi- tional semantics for finding and describing images with sentences,
TACL 2 (2014) 207–218 .
[9] O. Vinyals , A. Toshev , S. Bengio , D. Erhan , S3how and tell: a neural
image cap- tion generator, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 3156–3164 .
[10] Q. You , H. Jin , Z. Wang , C. Fang , J. Luo , Image captioning with
semantic atten- tion, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 4651–4659 .
[11] M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image
Fig. 6. Input Image-3 Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal
of Artificial Intelligence Research, Volume 47, pages 853-899
V. CONCLUSION
109
Authorized licensed use limited to: KLE Technological University. Downloaded on November 06,2023 at 08:51:28 UTC from IEEE Xplore. Restrictions apply.