From - Table - of - Content - Report - s2t (1) (1) 2
From - Table - of - Content - Report - s2t (1) (1) 2
1.1 Introduction:-
American sign language is a predominant sign language Since the only disability D&M people have been
communication related and they cannot use spoken languages hence the only way for them to communicate
is through sign language. Communication is the process of exchange of thoughts and messages in various
ways such as speech, signals, behavior and visuals. Deaf and dumb(D&M) people make use of their hands to
express different gestures to express their ideas with other people. Gestures are the nonverbally exchanged
messages and these gestures are understood with vision. This nonverbal communication of deaf and dumb
people is called sign language.
In our project we basically focus on producing a model which can recognise Fingerspelling based hand
gestures in order to form a complete word by combining each gesture. The gestures we aim to train are as
given in the image below.
1
1.3 Project Modules
1.3.1.Data Acquisition
1.3.2. Data pre-processing and Feature extraction
1.3.3.Gesture Classification
1.3.4 Text and Speech Translation
2
Chapter 2 :Literature Review
2.1
Motivation:-
In a world driven by communication, the ability to express oneself is not just a
convenience—it's a basic human right. Yet, millions of individuals who are hearing or
speech impaired often struggle to be heard, to be understood, and to participate fully in
daily life.
This project was born out of a deep desire to bridge the gap between the deaf-mute
community and the hearing world. Sign language is their voice—but not everyone
understands it. My goal is to translate their signs into text and speech, giving them a
powerful tool to communicate with ease and confidence.
The idea isn't just to build a software application—it's to create a bridge. A bridge where
gestures become words, where silence turns into voice, and where inclusion replaces
isolation.
With the help of real-time gesture recognition using MediaPipe and deep learning, this
project takes a step towards true accessibility. Every sign translated is a message made
louder. Every sentence spoken from a gesture is one more step toward equality.
Through this innovation, I hope not only to use technology for good, but also to send a
message:
Technology should empower everyone—not just those who can speak or hear.
3
the context of Indian Sign Language (ISL). Most existing systems are either static, slow, or
only work for single-word translation without capturing the flow of natural sentences.
Therefore, there is a critical need for a real-time, efficient, and accurate system that can
recognize hand gestures and convert them into both text and speech, enabling seamless
communication for the hearing and speech impaired.
This project aims to solve this problem by leveraging MediaPipe for hand tracking,
combined with Deep Learning (LSTM) for recognizing sentence-level gestures, and finally
converting the recognized text into audible speech using Text-to-Speech (TTS) technology.
2.3 Aim
• Communication barriers for hearing-impaired individuals.
• Hearing-impaired individuals face communication barriers with non-signing people.
• Sign language is not widely understood by the general population.
• Challenges in workplaces, education, healthcare, and public spaces.
2.4:-Objectives :-
More than 70 million deaf people around the world use sign languages to communicate. Sign
language allows them to learn, work, access services, and be included in the communities.
It is hard to make everybody learn the use of sign language with the goal of ensuring that
people with disabilities can enjoy their rights on an equal basis with others.
So, the aim is to develop a user-friendly human computer interface (HCI) where the computer
understands the American sign language This Project will help the dumb and deaf people by
making their life easy.
To create a computer software and train a model using CNN which takes an image of hand
gesture of American Sign Language and shows the output of the particular sign language in
text format converts it into audio format.
4
2.5:- Literature Review:-
5
Mukesh Patel School of Technology and Management Engineering JVPD Scheme
Bhaktivedanta Marg, Vile Parle (W),
Mumbai-400 056(2013)
In This proposed system, they intend to recognize some very basic elements of sign language and to translate
them to text. Firstly, the video shall be captured frame-by-frame, the captured video will be processed and
the appropriate image will be extracted, this retrieved image will be further processed using BLOB analysis
and will be sent to the statistical database here the captured image shall compared with the one saved in
the database and the matched image will be used to determine the performed alphabet sign in the language.
Here, they will be implementing only
American Sign Language Finger-spellings, and They will construct words and sentences with them. With the
proposed method, they found that the probability of Obtaining desired output is around 93% which is
sufficient and Can be enough to make it suitable to be used on a larger scale For the intended purpose.
6
figure 2.3 Translate the sign language
Sign Language to Text and Speech Translation in Real Time Using Convolutional Neural
Network-by Ankit Ojha Dept. of ISE
JSSATE Bangalore, India .
Creating a desktop application that uses a computer’s webcam to capture a person signing gestures for
American sign language (ASL), and translate it into corresponding text and speech in real time. The
translated sign language gesture will be acquired in text which is farther converted into audio. In this
manner they are implementing a finger spelling sign language translator. To enable the detection of
gestures, they are making use of a Convolutional neural network (CNN). A CNN is highly efficient in tackling
computer vision problems and is capable
of detecting the desired features with a high degree of accuracy upon sufficient training. The modules are
image acquisition, hand region segmentation and hand detection and tracking hand posture recognition
and display as text/speech. A finger spelling sign language translator is obtained which has an accuracy of
95%
Communication with the hearing impaired (deaf/mute) people is a great challenge in our society today;
this can be attributed to the fact that their means of communication (Sign Language or hand gestures at a
local level) requires an interpreter at every instance. To convert ASL signed hand gestures into text as well
7
as speech using unsupervised feature learning to eliminate communication barrier with the hearing
impaired and as well provide teaching aid for sign language.
Sample images of different ASL signs were collected using the Kinect sensor using the image acquisition
toolbox on MATLAB. About five hundred (500) data samples (with each sign count five and ten (5-10)) were
collected as the training data. The reason for this is to make the algorithm very robust for images of the same
database in order to reduce the rate of misclassification. The combination FAST and SURF with a KNN of 10
also showed that unsupervised learning classification could determine the best matched feature from the
existing database. In turn, the best match was converted to text as well as speech. The introduced system
achieved a 92% accuracy of supervised feature learning and 78%of unsupervised feature learning
proposal optimizes the performance overhead through identifications of 17 characters and 6 symbols based
on image contours and convexity measurement of Standard American Sign Language without using complex
algorithms and specialized hardware devices. Accuracy measurement done through simulation, which shows
how their proposal provides more accuracy with minimum complexity in comparison to other state-of-the-
art works. The average accuracy is 86% overall.
9
Chapter 3 : Proposed System
3.1Comparison Table
The table below presents a comparative study of various research efforts undertaken by different authors in
the field of sign language recognition. Each author has proposed a unique algorithm to tackle the challenge
of gesture recognition and translation, evaluated by the accuracy of their respective systems:
1. Mahesh Kumar (2018) implemented the Linear Discriminant Analysis (LDA) technique for
gesture classification. While simple and computationally efficient, the model achieved an accuracy
of 80%, which is moderate compared to more advanced deep learning methods.
2. Krishna Modi (2013) applied Blob Analysis, a classical image processing method. Despite its
simplicity, it delivered an impressive 93% accuracy, showcasing that traditional methods can still
be effective when well-optimized.
3. Bikash K. Yadav (2020) and Ayush Pandey (2020) both employed Convolutional Neural
Networks (CNN), a deep learning-based approach known for its excellent performance in image-
10
related tasks. Their models reached 95.8% and 95% accuracy respectively, indicating CNN's strong
ability to extract complex features and learn from gesture images.
4. Victorial Adebimpe Akano (2018) used the K-Nearest Neighbors (KNN) algorithm, a classic
machine learning method. With an accuracy of 92%, the method proved to be reliable for
classification tasks when applied with proper feature extraction.
5. Rakesh Kumar (2021) opted for contour measurement, focusing on shape-based gesture analysis.
While effective, this method achieved 86% accuracy, slightly lower than deep learning methods but
still significant considering its interpretability and lower resource demand.
- This app is very user-friendly to use. They only require knowledge about American Sign Language.
-The system is operationally feasible as it is very easy for the End users to operate it. It only needs
basic information about windows application.
It must have a graphical user interface that assists users that are not from IT background.
Features:
2. Flexibility.
Back-end Selection:We have used Python as our Back-end Language which has the most widest library
collections The technical feasibility is frequently the most difficult area encountered at this stage. Our app
will fit perfectly for technical feasibility.
The developing system must be justified by cost and benefit. Criteria to ensure that effort is concentrated on
project, which will give best, return at the earliest. One of the factors, which affect the development of a new
system, is the cost it would require. Since the system is developed as part of project work, there is no manual
cost to spend for the proposed system. Also, all the resources are already available, it gives an indication of
the system is economically possible for development.
3.4 Timeline Chart
figure 3.0
12
3.5 Detailed Module description
3.5.1 Data Acquisition
The different approaches to acquire data about the hand gesture can be done in the following ways:
It uses electromechanical devices to provide exact hand configuration, and position. Different glove-
based approaches can be used to extract information. But it is expensive and not user friendly.
In vision-based methods, the computer webcam is the input device for observing the information of
hands and/or fingers. The Vision Based methods require only a camera, thus realizing a natural
interaction between humans and computers without the use of any extra devices, thereby reducing costs.
The main challenge of vision-based hand detection ranges from coping with the large variability of the
human hand’s appearance due to a huge number of hand movements, to different skin-color possibilities
as well as to the variations in viewpoints, scales, and speed of the camera capturing the scene.
13
figure 3.2
figure 3.2
in this method there are many loop holes like your hand must be ahead of clean soft background and that is in
proper lightning condition then only this method will give good accurate results but in real world we dont get
good background everywhere and we don’t get good lightning conditions too.
So to overcome this situation we try different approaches then we reached at one interesting solution in which
firstly we detect hand from frame using mediapipe and get the hand landmarks of hand present in that image
then we draw and connect those landmarks in simple white image
14
figure 3.3 in this image we collacted sign language of B
15
Mediapipe Landmark System:
16
figure 3.7 In this Image now we get Landmark points of “B”
figure 3.8 this image is shown that how to display the with landmark
17
figure 3.9 In this Image now we get Landmark points of “A”
figure 3.10 this image is shown that how to display the with landmark
-By doing this we tackle the situation of background and lightning conditions because the mediapipe labrary
will give us landmark points in any background and mostly in any lightning conditions.
-we have collected 180 skeleton images of Alphabets from A to Z
18
3.5.3 Gesture Classification
Unlike regular Neural Networks, in the layers of CNN, the neurons are arranged in 3 dimensions: width,
height, depth.
The neurons in a layer will only be connected to a small region of the layer (window size) before it, instead of
all of the neurons in a fully-connected manner.
Moreover, the final output layer would have dimensions(number of classes), because by the end of the CNN
architecture we will reduce the full image into a single vector of class scores.
figure 3.11
19
1. Convolutional Layer:
In convolution layer I have taken a small window size [typically of length 5*5] that extends to the depth of
the input matrix.
The layer consists of learnable filters of window size. During every iteration I slid the window by stride size
[typically 1], and compute the dot product of filter entries and input values at a given position.
As I continue this process well create a 2-Dimensional activation matrix that gives the response of that matrix
at every spatial position.
That is, the network will learn filters that activate when they see some type of visual feature such as an edge
of some orientation or a blotch of some colour.
2. Pooling Layer:
We use pooling layer to decrease the size of activation matrix and ultimately reduce the learnable parameters.
a. Max Pooling:
In max pooling we take a window size [for example window of size 2*2], and only taken the maximum of 4
values.
Well lid this window and continue this process, so well finally get an activation matrix half of its original Size.
b. Average Pooling:
In average pooling we take average of all Values in a window.
20
3. Fully Connected Layer:
In convolution layer neurons are connected only to a local region, while in a fully connected region, well
connect the all the inputs to neurons.
figure 3.13 The preprocessed 180 images/alphabet will feed the keras CNN model.
Because we got bad accuracy in 26 different classes thus, We divided whole 26 different alphabets into 8
classes in which every class contains similar alphabets:
21
figure 3.13c [g,h]
22
figure 3.13f [a,e,m,n,s,t]
probability. The label with the highest probability will treated to be the predicted label.
So when model will classify [aemnst] in one single class using mathematical operation on hand landmarks we
will classify further into single alphabet a or e or m or n or s or t.
figure 3.6.1
The given diagram represents the conceptual architecture or flow of processes in a sign language
recognition system that converts hand gestures into text and speech. This system is designed to bridge the
communication gap between the speech/hearing-impaired and the general population.
figure 3.6.2
This diagram provides a simple overview of how a Sign Language to Text Converter System works. It
visually represents the communication flow between the user and the system.
24
DFD-Level 1
figure 3.6.3
• The user performs hand gestures representing specific alphabets, words, or phrases using Indian
Sign Language (ISL) or any other sign language standard.
• The system captures the hand gestures via a camera or sensor module.
• Using techniques such as MediaPipe, CNN models, or keypoint detection, the system identifies the
gesture.
• Each gesture is mapped to its corresponding character based on a pre-trained recognition model.
• This step may also include text-to-speech conversion for vocal output.
25
3.6.4 Sequence diagram
figure 3.6.4
This sequence diagram illustrates the step-by-step process of sign language recognition, starting from video
capture to final output generation. It explains how different components in the system interact to convert
hand gestures into corresponding text using a machine learning model.
26
Conversion of Sign Language to text/Speech
Mode Function
Deaf/Mute → Hearing Sign → Text/Speech
Hearing → Deaf/Mute Text/Speech → Sign (animation) ← ADD THIS
• Multilingual Support:-
Allow users to choose other languages (Marathi, Tamil, Telugu, Bengali,
etc.)
How It Works:
Convert signs to English (default).
Use Google Translate API or any language model to translate English
target language.
Use text-to-speech (TTS) in the selected language.
29
Conversion of Sign Language to text/Speech
👋
"Hello, how "नमस्ते , आप "வணக்கம் , எப் படி
🤚
are you?" कैसे हैं ?" இருக்கிறீர்கள் ?"
🧍♂️
Tech Stack:
30
Conversion of Sign Language to text/Speech
Chapter 5 :
Implementation and Testing
Here are some snapshots when user shows some hand gestures in different background as well as
in different lightning conditions and system is giving corresponding prediction.
figure 5.1 in this image we check the result that “A” is predicted
figure 5.2 in this image we check the result that “W” is predicted
Here the hand gesture of sign ‘W’ is shown with different background and still our model is predicting
correct letter.
31
Conversion of Sign Language to text/Speech
figure 5.3 in this image we check the result that “B” is predicted
figure 5.4 in this image we check the result that “D” is predicted
After Implementing the cnn algorithm we made gui using python Tkinter and add Suggestions
also to make the process smooth for user.
32
Conversion of Sign Language to text/Speech
figure 5.5 in this image we get a Sentence ”DEER is predicted and it speak
Below shown sign use after predicting each alphabet to move further.
figure 5.6 in this image we indicate the pawm to get next character get
33
Conversion of Sign Language to text/Speech
Chapter 6
Conclusion and Future Work
Finally, we are able to predict any alphabet[a-z] with 97% Accuracy (with and without clean
background and proper lightning conditions) through our method. And if the background is clear
and there is good lightning condition then we got even 99% accurate results.
In Future work we will make one android application in which we implement this algorithm for
gesture prediction
34
Conversion of Sign Language to text/Speech
Chapter 7 : References
1. 1.Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal multi-cue network for
continuous sign language recognition. In Proceedingsof the AAAI Conference on
Artificial Intelligence, New York, NY, USA, 7–12 February 2020 ; Volume 34, pp.
13009–13016.
2. Rodriguez, J.; Martínez, F. How important is motion in sign language translation? IET
Comput. Vis. 2021, 15, 224–234. [CrossRef]
3. Zheng, J.; Chen, Y.; Wu, C.; Shi, X.; Kamal, S.M. Enhancing neural sign language
translation by highlighting the facial expressioninformation. Neurocomputing 2021, 464,
462–472. [CrossRef]
4. Li, D.; Xu, C.; Yu, X.; Zhang, K.; Swift, B.; Suominen, H.; Li, H. Tspnet: Hierarchical
feature learning via temporal semanticpyramid for sign language translation. Adv. Neural
Inf. Process. Syst. 2020, 33, 12034–12045.
5. Núñez-Marcos, A.; Perez-de Viñaspre, O.; Labaka, G. A survey on Sign Language
machine translation. Expert Syst. Appl. 2022,213, 118993. [CrossRef]
6. Cui, R.; Liu, H.; Zhang, C. Recurrent convolutional neural networks for continuous sign
language recognition by stagedoptimization. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26July 2017; pp.
7361–7369
35