0% found this document useful (0 votes)
8 views

Image Caption Generator

Uploaded by

Thangaaruna M
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Image Caption Generator

Uploaded by

Thangaaruna M
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 2

International Journal of Computational Vision and Robotics

Image Caption Generator


Abstract:
Image captioning is the process of describing what is in the image. This gives the complete
description of the particular image. In this paper, we're creating the captions for an image
given by the user, by using CNN( Convolution Neural Networks) and LSTM( Long Short-
Term Memory) models. These are the Deep Learning Models used for the generation of the
description for the particular image. Convolution Neural Network (ResNet) is used as an
encoder that accesses the image structures and Recurrent Neural Network (Long Short Term
Memory) is used as a decoder which produces the titles for the images with the help of image
structures and expressions that is built.
Key words : CNN, LSTM, ResNet
Introduction
In the modern era of computers, giving description to the images has created interest in
computer vision and natural language processing. Image captioning is a method of generating
description for what is there in the image. In the past years, this is very tedious task and the
generated content is not sufficient to describe the image. But this is possible with the
advancement of technology in deep learning and neural networks. This increases the
efficiency of the caption generation. These are very much useful in image recognition, Image
classification, Image Captioning and many other Artificial Intelligence applications. This has
a major application in helping the blind people, self-driving cars. Basically, this model takes
image as input and gives caption for it.
In Deep Learning also there are different deep learning models like Inception model, VGG
Model , ResNet-LSTM model, traditional CNN RNN Model. In this paper we are going to
explain about the model we have followed for captioning the images .i.e; ResNet-LSTM
model.
Images are one of the most readily available data type on the Internet in the big data era,
which enhanced the necessity for annotating and labelling them. However, because they
concentrate on the volume of huge data, image captioning systems are an illustration of big
data issues. Early methods attempted to fill specified text templates with features taken from
photos using template methods. Current systems make use of deep learning techniques and
the available computational power.
Many contemporary deep learning algorithms retrieve feature matrices from the final
convolutional layers using pre-trained Convolutional Neural Networks (CNNs) in order to
analyze a picture. This makes it easier to understand various features of the objects and how
they relate to one another in the picture and helps to depict the image more effectively.

This work investigates the hypothesis that exploiting such features could
increase accuracy in image captioning and that using all object features helps to
accurately mimic the human visual understanding of scenes. This paper aims to
present a model that makes use of this type of features through a simple
architecture and evaluate the results.
Researchers have been involved in finding an efficient way to make better
predictions, therefore we have discussed a few methods to achieve good results.
We have used the deep neural networks and machine learning techniques to
build a good model. There are two phases : feature extraction from the image
using Convolutional Neural Networks (CNN) and generating sentences in
natural language based on the image using Recurrent Neural Networks (RNN).
While human beings are able to do it easily, it takes a strong algorithm and a lot
of computational power for a computer system to do so. Many attempts have
been made to simplify this problem and break it down into various simpler
problems such as object detection, image classification, and text generation. A
computer system takes input images as two-dimensional arrays and mapping is
done from images to captions or descriptive sentences. In recent years a lot of
attention has been drawn towards the task of automatically generating captions
for images. However, while new datasets often spur considerable innovation,
benchmark datasets also require fast, accurate, and competitive evaluation
metrics to encourage rapid progress. Being able to automatically describe the
content of a picture using properly formed English sentences may be a very
challenging task, but it could have an excellent impact, as an example by
helping visually impaired people better understand the content of images online.
This task is significantly harder, for instance than the well-studied image
classification or visual perception tasks, which are a main focus within the
computer vision community.

You might also like