Visual Image Caption Generator 38
Visual Image Caption Generator 38
A PROJECT REPORT
Submitted by
MANIKANDAN
(Reg.No. 0122127038)
of
IN
Computer science
VIGNESHNARTHI Msc,M.phil,PHD
March -2025
ABSTRACT
The combination of computer vision and natural language processing in Artificial intelligence
has sparked a lot of interest in research in recent years, thanks to the advent of deep
learning. The context of a photograph is automatically described in English. When a picture
is captioned, the computer learns to interpret the visual information of the image using one
or more phrases. The ability to analyze the state, properties, and relationship between these
objects is required for the meaningful description generation process of high-level picture
semantics. Using CNN -LSTM architectural models on the captioning of a graphical image,
we hope to detect things and inform people via text messages in this research. To correctly
identify the items, the input image is first reduced to grayscale and then processed by a
Convolution Neural Network (CNN). The COCO Dataset 2017 was used. The proposed
method for blind individuals is intended to be expanded to include persons with vision loss to
speech messages to help them reach their full potential and to track their intellect. In this
project, we follow a variety of important concepts of image captioning and its standard
processes, as this work develops a generative CNN-LSTM model that outperforms human
baselines
EXISTING SYSTEM
The existing systems for image captioning primarily rely on traditional machine learning
models and rule-based algorithms, which lack the capability to generate meaningful captions
due to limited contextual understanding of images. These systems focus on identifying
objects but fail to analyze their relationships and interactions, resulting in generic and less
descriptive outputs. Additionally, they rely heavily on handcrafted features and basic natural
language processing techniques, making them inefficient and unsuitable for diverse datasets
like COCO 2017. Moreover, these systems are not designed with accessibility features,
limiting their usability for visually impaired individuals.
DISADVANTAGES
● Struggles with diverse and complex datasets like COCO 2017, limiting
scalability.
The proposed system addresses these limitations by utilizing a deep learning approach that
combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM)
networks. This system analyzes the context, properties, and relationships between objects in
an image to generate high-quality captions in natural language. By training on the COCO
2017 dataset, it adapts to diverse real-world scenarios, starting with feature extraction using
a CNN and passing the results to an LSTM for caption generation. Additionally, the system
incorporates speech generation to convert captions into audio, making it accessible to
visually impaired users. This innovative approach aims to provide meaningful descriptions
while outperforming human baselines in accuracy and efficiency, offering an inclusive and
intelligent solution.
ADVANTAGES
● System: i3 Processor
● Hard Disk: 500 GB.
● Monitor: 15’’LED
● Input Devices: Keyboard, Mouse
● Ram: 4GB
2. SOFTWARE REQUIREMENTS
3. ALGORITHMS
4. METHODOLOGY