0% found this document useful (0 votes)
65 views

Image Captioning

This document discusses image captioning using deep learning. It motivates image captioning by describing applications for assisting the blind and improving self-driving cars. It describes using the MS COCO dataset to train a model, preprocessing images with InceptionV3, and using an attention-based RNN model to generate captions by attending to different parts of encoded images. Key steps included extracting image features, tokenizing captions, splitting data into train and test sets, and generating captions for new images by attending to different image regions.

Uploaded by

Sita Putra Teja
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Image Captioning

This document discusses image captioning using deep learning. It motivates image captioning by describing applications for assisting the blind and improving self-driving cars. It describes using the MS COCO dataset to train a model, preprocessing images with InceptionV3, and using an attention-based RNN model to generate captions by attending to different parts of encoded images. Key steps included extracting image features, tokenizing captions, splitting data into train and test sets, and generating captions for new images by attending to different image regions.

Uploaded by

Sita Putra Teja
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Image

Captioning
KOLLA TEJA
2K19CSUN04013
VA I B H AV S I N G H
2K19CSUN04029
BTECH CSE 4 DSML
INTRODUCTION

• What do you see


in this picture?
• Well some of you might say “ A Giraffe is eating leaves”, some
may say “Giraffes roaming around the forest” and yet some
others might say “Two tallest living terrestrial animals”.

• Definitely all of these captions are relevant for this image and
there may be some other also. But the point I want to make is; it’s
so easy for us, as human beings, to just have a glance at a picture
and describe it in an appropriate language. Even a 5 year old could
this with ease.

• But, can A.I generate captions by taking input as image?


MOTIVATION

We must first understand how


important this problem is to real
world scenarios. Let’s see few
applications where a solution to
this problem can be very useful.
Aid to the blind— We can create a product for the blind
which will guide them travelling on the roads without the
support of anyone else. We can do this by first converting
the scene into text and then the text to voice. Both are
now famous applications of Deep Learning.

Self driving cars— Automatic driving is one of the biggest


challenges and if we can properly caption the scene
around the car, it can give a boost to the self driving
system.
DATA COLLECTION
• There are many open source datasets available for
this problem, like Flickr 8k (containing8k images),
Flickr 30k (containing 30k images), MS COCO
(containing 180k images), etc.

• In my project I had used MS COCO dataset it consists


of annotations and images each images has 5
annotations in it.
EXAMPLE
TRAINING SET
Group all captions together
having the same image ID.

Select the first 6000 image_paths


from the shuffled set.

Approximately each image id has


5 captions associated with it.

That will lead to lead to 30,000


examples.
ARCHITECTURE
Preprocess the images using InceptionV3
you will use InceptionV3 (which is pretrained on Imagenet)
to classify each image. You will extract features from the
last convolutional layer

Initialize InceptionV3 and load the pretrained Imagenet weights


you'll create a tf.keras model where the output layer is the last
convolutional layer in the InceptionV3 architecture. The shape of
the output of this layer is 8x8x2048
Caching the features extracted from InceptionV3

You will pre-process each image with InceptionV3 and


cache the output to disk. Caching the output in RAM would
be faster but also memory intensive, requiring 8 * 8 * 2048
floats per image. 

Preprocess and tokenize the captions

you'll tokenize the captions (for example, by splitting on


spaces). This gives us a vocabulary of all of the unique
words in the data (for example, "surfing", "football", and so
on).
Split the data into training and
testing
• Create training and validation sets using an 80-20 split randomly.

• create a tf.data dataset to use for training our model

• We have to tune the hyperparameters to get the better results


MODEL
• we had used an attention-based model, which enables us to see what
parts of the image the model focuses on as it generates a caption.

•This example, you extract the features from the lower convolutional
layer of InceptionV3 giving us a vector of shape (8, 8, 2048).

•You squash that to a shape of (64, 2048).

•This vector is then passed through the CNN Encoder (which consists of
a single Fully connected layer).

•The RNN (here GRU) attends over the image to predict the next word
Checkpoint Epoch Loss plot

Training Loss
caption
•The evaluate function is similar to the training loop, except you don't
use teacher forcing here. The input to the decoder at each time step
is its previous predictions along with the hidden state and the
encoder output.

•Stop predicting when the model predicts the end token.

•And store the attention weights for every time step.


Caption generation
Input image:
Generated caption:
References
• https://arxiv.org/pdf/1502.03044.pdf

•  https://paperswithcode.com/task/text-generation

• https://machinelearningmastery.com/develop-a-deep- learning-caption-generation-model-in-python/

• https://www.tensorflow.org/tutorials/text/image_captioning

You might also like