0% found this document useful (0 votes)
0 views

Visual Image Caption Generator 38

This project report presents a visual image caption generator using deep learning, specifically combining CNN and LSTM models to generate meaningful captions from images. The proposed system addresses limitations of existing methods by analyzing object relationships and providing accessibility features for visually impaired users. The project utilizes the COCO 2017 dataset and aims to outperform human baselines in captioning accuracy and efficiency.

Uploaded by

s45033966
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Visual Image Caption Generator 38

This project report presents a visual image caption generator using deep learning, specifically combining CNN and LSTM models to generate meaningful captions from images. The proposed system addresses limitations of existing methods by analyzing object relationships and providing accessibility features for visually impaired users. The project utilizes the COCO 2017 dataset and aims to outperform human baselines in captioning accuracy and efficiency.

Uploaded by

s45033966
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

VISUAL IMAGE CAPTION GENERATOR

USING DEEP LEARNING

A PROJECT REPORT
Submitted by

MANIKANDAN
(Reg.No. 0122127038)

In partial fulfillment of the requirements


for the award of the degree

of

BACHELOR OF COMPUTER APPLICATION

IN

Computer science

Under the Guidance of

VIGNESHNARTHI Msc,M.phil,PHD

DEPARTMENT OF COMPUTER SCIENCE


ALAGAPPA GOVERNMENT ARTS COLLEGE
(Grade-I college and Re-accredited with “B”
Grade by NAAC)
KARAIKUDI – 630 003.

March -2025
ABSTRACT

The combination of computer vision and natural language processing in Artificial intelligence
has sparked a lot of interest in research in recent years, thanks to the advent of deep
learning. The context of a photograph is automatically described in English. When a picture
is captioned, the computer learns to interpret the visual information of the image using one
or more phrases. The ability to analyze the state, properties, and relationship between these
objects is required for the meaningful description generation process of high-level picture
semantics. Using CNN -LSTM architectural models on the captioning of a graphical image,
we hope to detect things and inform people via text messages in this research. To correctly
identify the items, the input image is first reduced to grayscale and then processed by a
Convolution Neural Network (CNN). The COCO Dataset 2017 was used. The proposed
method for blind individuals is intended to be expanded to include persons with vision loss to
speech messages to help them reach their full potential and to track their intellect. In this
project, we follow a variety of important concepts of image captioning and its standard
processes, as this work develops a generative CNN-LSTM model that outperforms human
baselines
EXISTING SYSTEM

The existing systems for image captioning primarily rely on traditional machine learning
models and rule-based algorithms, which lack the capability to generate meaningful captions
due to limited contextual understanding of images. These systems focus on identifying
objects but fail to analyze their relationships and interactions, resulting in generic and less
descriptive outputs. Additionally, they rely heavily on handcrafted features and basic natural
language processing techniques, making them inefficient and unsuitable for diverse datasets
like COCO 2017. Moreover, these systems are not designed with accessibility features,
limiting their usability for visually impaired individuals.

DISADVANTAGES

●​ Limited contextual understanding of images, leading to generic and less


meaningful captions.

●​ Focuses only on object detection without analyzing relationships or


interactions between objects.

●​ Heavy reliance on handcrafted features and traditional rule-based algorithms,


reducing efficiency.

●​ Struggles with diverse and complex datasets like COCO 2017, limiting
scalability.

●​ Lack of real-time processing capabilities for practical applications.

●​ Does not include accessibility features, making it unsuitable for visually


impaired individuals.
PROPOSED SYSTEM

The proposed system addresses these limitations by utilizing a deep learning approach that
combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM)
networks. This system analyzes the context, properties, and relationships between objects in
an image to generate high-quality captions in natural language. By training on the COCO
2017 dataset, it adapts to diverse real-world scenarios, starting with feature extraction using
a CNN and passing the results to an LSTM for caption generation. Additionally, the system
incorporates speech generation to convert captions into audio, making it accessible to
visually impaired users. This innovative approach aims to provide meaningful descriptions
while outperforming human baselines in accuracy and efficiency, offering an inclusive and
intelligent solution.

ADVANTAGES

● Recommendations in Editing Applications


● Assistance for visually impaired
● Social Media posts
● Self-Driving cars
● Robotics
● Easy to implement and connect to new data sources
1. HARDWARE REQUIREMENTS

●​ System: i3 Processor
●​ Hard Disk: 500 GB.
●​ Monitor: 15’’LED
●​ Input Devices: Keyboard, Mouse
●​ Ram: 4GB

2. SOFTWARE REQUIREMENTS

●​ Platform: Google Colab


●​ Coding Language: Python

3. ALGORITHMS

●​ Convolutional Neural Network


●​ Long Short-Term Memory

4. METHODOLOGY

●​ Import Libraries Upload


●​ COCO (Common Objects and Contexts) Dataset 2017. (Data Preprocessing)
●​ Apply CNN to identify the objects in the image.
●​ Preprocess and tokenize the captions.
●​ Use LSTM to predict the next word of the sentence.
●​ Make a Data Generator
●​ View Images with caption
THANK YOU

You might also like