0% found this document useful (0 votes)
8 views

Research Paper5

This document discusses a system for converting Indian Sign Language (ISL) gestures into text and speech using various machine learning models, including CNN, FRCNN, YOLO, and Media Pipe. The system captures hand gestures through a camera, processes them in real-time, and generates spoken English descriptions, facilitating communication for the deaf and hard of hearing. The Media Pipe model is highlighted as the most effective for achieving the project's goals of accuracy and real-time performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Research Paper5

This document discusses a system for converting Indian Sign Language (ISL) gestures into text and speech using various machine learning models, including CNN, FRCNN, YOLO, and Media Pipe. The system captures hand gestures through a camera, processes them in real-time, and generates spoken English descriptions, facilitating communication for the deaf and hard of hearing. The Media Pipe model is highlighted as the most effective for achieving the project's goals of accuracy and real-time performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

© 2022 JETIR July 2022, Volume 9, Issue 7 www.jetir.

org (ISSN-2349-5162)

SIGN LANGUAGE CONVERSION TO TEXT


AND SPEECH
Medhini Prabhakar 1 Prasad Hundekar1 Sai Deepthi B P1 Shivam Tiwari1
Vinutha M S2
1
Department of Computer Science and Engineering, Dr. Ambedkar Institute of Technology, Bangalore, 560056
2
Assistant Professor, Department of Computer Science and Engineering, Dr. AIT

Keywords— Convolutional Neural Networks(CNNs),


FRCNN(Faster-CNN), YOLO(You Only Look Once) ,
Media Pipe
Abstract— This system presents a novel approach
for translation of sign action analysis, recognition
and generating a text description in English I. INTRODUCTION
language and then conversion of the generated text
to speech. In training set there were 26 Indian Sign The only form of communication for deaf and mute
Language Alphabet image samples used whereas people—mostly illiterates—is sign language.
testing captures hand gestures from the live feed However, it is still difficult to interact with regular
and predicts the class label based on several people without the aid of a human interpreter because
trained models like CNN (Convolutional Neural most members of the general public are not eager to
Networks), FRCNN(Faster-Convolutional Neural learn this sign language. The deaf and hard of hearing
Networks), YOLO(You Only Look Once) and become isolated as a result. Nevertheless, the
Media Pipe. Finally, the text description will be development of technology makes it possible to
generated in English language and converted to overcome the obstacle and close the communication
speech. The average computation timeis bit more distance.
than expected due unavailability of high GPU
hardware but has acceptable recognition rate in Various sign languages are used around the globe.
case of FRCNN model. When it comes to CNN There are around 300 different sign languages in use
model, the recognition of hand gestures is fast around the globe. This is so because individuals from
enough for real world applications but there is a various ethnic groups naturally created sign languages.
compromise in accuracy of identification. YOLO Maybe there isn't a common sign language in India.
model recognizes the sign language with a good Different regions of India have their own dialects and
accuracy but is not satisfactory in case of speed as lexical differences in Indian Sign Language. However,
it is taking more time when live feed of hand new initiatives to standardize Indian Sign Language
gestures is captured and converted. Though YOLO have been made (ISL).It is possible to train the machine
model works inefficiently for real time conversion, to recognize gestures and translate them into text and
it performs greatly when already captured hand voice. To facilitate communication between deaf-mute
gestures are fed as input to the model. Media Pipe and vocal persons, the algorithm effectively and
model of sign language conversion checks all the accurately categorizes hand gestures. Additionally, the
requirements of our project which include great identified sign's gesture name is spoken and displayed.
accuracy as well as real time conversion of hand system helps the blind to navigate independently using
gestures to text to speech in real time without any real time object detection and identification.
delay.

This is a software-based project which uses Media Pipe


framework to detect hand gestures. There are a couple
of other algorithms like FRCNN, CNN and YOLO
JETIR2207380 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org d621
© 2022 JETIR July 2022, Volume 9, Issue 7 www.jetir.org (ISSN-2349-5162)
which were used to achieve the objective of the PAPER 2:
project, but each model had its own cons. Finally, Hand Gesture Recognition For Automated Speech
Media Pipe model satisfies all requirements of the Generation
project and provides output as expected. Abstract:
The field of gesture recognition has developed
recently, and new tools, gadgets, models, and
The project first captures hand gestures through a web algorithms have come into existence. Despite
camera and processes the captured frame by numerous advancements in the aforementioned
undergoing preprocessing and segmentation and technology, humans still feel comfortable making
hence identifies hand gesture shown by the signer in motions with their hands alone. This research focuses
real time. The recognized sign is also converted to a on a system that runs on a mobile computing device
voice form which also increases the use cases in and gives us the technology for automated translation
which this project can be used. of the Indian Sign Language system into Speech in the
English Language, enabling bidirectional
This is a software-based project which uses communication between people with vocal
MediaPipe framework to detect hand gestures. There impairments and the general public. Since it operates
are a couple of other algorithms like FRCNN, CNN on the Gesture Recognition concept, this system can be
and YOLO which were used to achieve the objective utilised in the near future for communication between
of the project but each model had its own cons. individuals who do not comprehend sign language.
Finally, MediaPipe model satisfies all requirements of The system uses an internal mobile camera to recognise
the project and provides output as expected. and capture gestures. The captured gestures are then
analysed using algorithms such the HSV model-(Skin
II.RELATED WORKS Color Detection), LargeBlob Detection, Flood Fill, and
PAPER 1: Contour Extraction. The standard alphabets (A-Z) and
An Efficient Approach for Interpretation of Indian numeric values can be represented using one-handed
Sign Language using MachineLearning. signs that the system can identify (0-9). The results of
Abstract: this system's gesture processing and voice analysis are
This paper focus on the most accurate translation of very accurate, reliable, and efficient.
spoken English words into Indian Sign Language
gestures as well as standard Indian Sign Language
gestures into English. For this,various neural network PAPER 3:
classifiers are created, and their effectiveness in Indian Sign Language Recognition –A Survey.
recognising gestures is assessed. The suggested ISL Abstract:
interpretation system accomplishes two key tasks: I This paper focuses on various methods for
Converting gestures from text to gestures from understanding Indian Sign Language. With various
speech. On the pre-processed photos, feature tools and algorithms applied on the Indian sign
extraction is performed. This entails turning the raw language recognition system, a review of hand
data (pictures) into numerical characteristics so that gesture recognition methods for sign language
the classification algorithm can process the recognition is reviewed, along with difficulties and
information. The information contained in the original potential directions for future research. The acquisition
data is kept even though the image is translated to of the signer's image, or the individual communicating
numerical form. Convolutional neural networks and through sign language, can to be captured with a
other machine learning algorithms are fed the camera. The acquisition process can be started
retrieved image features as input. Support Vector manually. To record the signer's characteristics and
Machine (SVM) and Recurrent Neural Network gestures, a camera sensor is required. Scaling an image
(RNN).Google Speech Recognition API and PyAudio is utilised to cut down on the computational work
are used to convert speech to text. The Keras required for processing images before skin detection.
framework was used to model and create a This procedure produces a binary image in which the
convolutional neural network. Python library The hand is defined by pixels that are coloured white and
classifier model was trained using about 30,240 all other pixels are black. Each pixel of the image is
photos, which represents 60% of the dataset's total classified throughout this processing as either being a
image count. The classifier was trained using various part of human skin or not. Extracting features lessens
epoch counts.A test accuracy of 88.89% was found to the precision without sacrificing computing time.
be the average.Thirty-two,240 photos were utilised to There are many features that can be measured,
train the model of a classifier. The classifier was including hand shape, hand orientation, textures,
trained using various epoch counts. Around 82.3 contour, motion, distance, centre of gravity, etc.used to
percent total testing accuracy at its highest level was identify sign language. Principal Component Analysis
attained. (PCA) benefits from the decreased dimensionality.
Though PCA is highly susceptible to the scaling and
JETIR2207380 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org d622
© 2022 JETIR July 2022, Volume 9, Issue 7 www.jetir.org (ISSN-2349-5162)
rotation and translation of the image, so before using contrast adjustments, among other things, are all
PCA, the image needs to be normalised. The vision- included in image preprocessing. Image enhancing,
based technique is user-friendly because a signer does image cropping, and image segmentation techniques
not need to wear the bulky gloves. The way a gesture are employed in this procedure. The format of captured
appears when it is being identified relies on a number images is RGB. Therefore, the first step is to convert
of factors, including the camera's position and the RGB photos to binary images, and then crop the image
signer's proximity to it. These techniques have been to get rid of any unnecessary parts. Additionally,
utilised to keep the real-time performance's accuracy improvements can now be made in a specific, chosen
and computing complexity in check. region. Edge detection techniques are used in image
segmentation to locate the border of cropped images,
which are then employed in feature extraction
techniques. Classification data is used to assign
III.METHODOLOGY corresponding level with respect to groups with
homogeneous characteristics, with the aim of
discriminating multiple objects from each other within
the image.
Classification involved in two Steps: -
Training step: In this step, with use of the training
samples matrix, the method calculates the parameters
of a probability distribution, considering features are
conditionally independent given the class.
Testing step: For any untested sample, the method
finds the posterior probability of that sample belonging
to each class. The method then classifies the test sample
according to the largest posterior probability.

At last, when the classification process is completed


the equivalent grammatical text description will be
generated with the help of the class labels which are
assigned during the training phase. Finally, using
Google API Conversion of Text to Speech will be
done.

IV.IMPLEMENTATION

Fig: Methodology of Sign Language Conversion To


Text And Speech Model

The model architecture is presented here. This


includes the way image is acquired, segmented, the
flow of hand gesture classification. This will show
how image is captured and converted to speech as the
output. The interaction of image data with the model
is shown in the architecture diagram above.

It uses the user's video and translates the video into


the frames (Multiple images). Each frame will then go
through preprocessing, which will be discussed in the
following section, prior to extracting features. The
web camera's video is still being continually recorded
and divided into a number of frames. A frame is a term
used to describe every single image. Video framing is
the technique of deleting specific frames from a given
video by making use of video characteristics like Fig: Flowchart
frame rate. Cropping, filtering, brightness and
JETIR2207380 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org d623
© 2022 JETIR July 2022, Volume 9, Issue 7 www.jetir.org (ISSN-2349-5162)
provided by the object identification process in YOLO,
which is carried out as a regression problem. CNN is
used by the YOLO method to recognise items instantly.
Step 1: The first stage is to segment the skin part The approach just needs one forward propagation
from the image, as the remaining part can be through a neural network to detect objects, as the name
regarded a noise w.r.t the character classification would imply. This indicates that a single algorithm run
problem is used to perform prediction throughout the full image.
Multiple class probabilities and bounding boxes are
simultaneously predicted using the CNN. This project
Step 2: The second stage is to extract relevant
makes use of the YOLOv3 version.
features from the skin segmented images which
can provesignificant for the next stage i.e., learning
and classification. 4. Media Pipe for Customized Sign Language
In this model, the classes are defined in a custom
manner for each hand gesture. Several phrases which
Step 3: The third stage as mentioned above is to use
people use for daily conversations are included as a part
the extracted features as input into the algorithm for
of training data set which is fed to the Media pipe
training and then finally use the trained models for
framework. For these jobs, MediaPipe, an open-source
classification.
framework created especially for complicated
perception pipelines utilising accelerated inference
Models used for implementation:
(e.g., GPU or CPU), currently provides quick and
1. FRCNN for Indian Sign Language
precise, yet distinct, solutions. It is a particularly
We used a data set of double-handed gesture photos in
challenging task that necessitates simultaneous
Indian Sign Language that comprised both alphabet
inference of numerous, dependent neural networks to
and numbers. Data was given to the FRCNN model.
combine them all in real-time into a semantically
The Fast R-CNN detector employs an algorithm
consistent end-to-end solution. Today, MediaPipe
similar to Edge Boxes to produce region proposals,
Holistic offers a fresh, cutting-edge human pose
just like the R-CNN detector does. The Fast R-CNN
topology that opens up new use cases as a solution to
detector processes the complete image as opposed to
this problem.
the R-CNN detector, which shrinks and resizes region
proposals. Fast R-CNN pools CNN features
V. OBSERVATIONS
corresponding to each area proposal, whereas an R-
1. ISL model using FRCNN shows good accuracy but
CNN detector must categorise each region. Because
does not achieve the mark in terms of speed.
computations for overlapping regions are shared in the
2. ASL model using CNN happens to be acceptable for
Fast R-CNN detector, it is more effective than R-
real time applications in terms of speed but is not
CNN.
satisfactory in terms of accuracy.
3. YOLO model of ASL ticks the mark for both
2. CNN for American Sign Language
accuracy and speedy performance. Nonetheless, the
The English alphabet and numbers in the American
model shows good efficiency only when the input is
Sign Language data set that we downloaded from
fed to the model manually but lays back when it comes
Kaggle were each represented by a single hand. A
to taking input in the form of live feed.
Convolutional Neural Network is a Deep Learning
4. Media Pipe model achieves all expectations and
method that can take in an input image, give various
objectives of our project which is sign language
elements and objects in the image significance and be
recognition and conversion to text and speech from in
able to distinguish between them. Comparatively
real time.
speaking, a ConvNet requires substantially less pre-
processing than other classification techniques.
VI. RESULTS
ConvNets have the capacity to learn these filters and
Real-time image capture from the camera is given to
properties, whereas in primitive techniques filters are
the algorithm model for processing. The Media Pipe
hand-engineered. The Convolution Operation's goal is
concept is utilised by the python code to identify and
to take the input image's high-level characteristics,
categorise the objects. It will outline the detected area
such edges, and extract them.
with border boxes and display the object's category
index. A text file will be used to keep the category
3. YOLO for American Sign Language
index of the objects that were detected. The class name
To meet the project objective and purpose and get
and class id of the discovered object make up the
better results, we implemented the YOLO model for
category index. Following hand gesture identification,
sign language conversion. You Only Look Once is
the text representation of the hand gesture is
known by the acronym YOLO. This algorithm
transformed to speech. The user may easily transport
identifies and finds different things in a picture. The
this system because it is portable.
class probabilities of the discovered photos are

JETIR2207380 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org d624
© 2022 JETIR July 2022, Volume 9, Issue 7 www.jetir.org (ISSN-2349-5162)
hitting the brief of our project objective. YOLO
algorithm was used in this project with an aim of
achieving good accuracy as well as speed in the same
instance which was lacking in the first two models.
Though the conjunction of good accuracy as well as
speed was achieved, the model was not collecting input
in real time. Keeping these hinderances into
consideration, the project was finally implemented
using Media Pipe algorithm. The algorithm classifies
alphabet in sign language efficiently with a good
number of accuracy and also the identified hand
gesture is converted to speech for a better result so that
it be used for communication not only among deaf -
mute and vocals, but also can be applicable for visually
impaired people. Regarding this problem, proposed
Fig: Detecting the number 2 system is developed to solve communication problems
for vocally disabled people and there by encouraging
all people to make better interactions and hence not
make them feel isolated.
To improve accuracy and applicability of this Project
work can be further extended on Server Based Systems
which will be implemented to improve the coverage.
High Range Cameras can be used to getthe better sign
detection. By interconnecting such systems with a
central computer will help in accumulating the data and
hence the performance of the whole network is
benefited with this exchangeof data since it will help
to train the algorithm better.

REFERENCES
[1] https://gilberttanner.com/blog/tensorflow-object-
detection-with-tensorflow-2-creating-a-custom-model/
Fig: Detecting the number 1
[2] https://learnopencv.com/introduction-to-
mediapipe/

[3]https://towardsdatascience.com/yolo-object-
detection-with-opencv-and-python-21e50ac599e9

[4] Indian Sign Language Recognition System for


Deaf People " Int. J.Adv. - A. Thorat, V. Satpute, A.
Nehe, T.Atre Y.Ngargoje

[5] Machine Learning Techniques for Indian Sign


Language Recognition, International Conference on
Current Trends in Computer, Electrical, Electronics
Fig: Detecting Sign C and Communication (ICCTCEEC-2017) - Kusumika
Krori Dutta, Sunny Arokia Swamy Bellary
VIII. CONCLUSION AND FUTURE WORKS
This project uses four different algorithms to [6] A. Sharmila Konwar, B. Sagarika Borah, C.
recognize sign language. The FRCNN algorithm Dr.T.Tuithung, “An American Sign Language
crosses the mark of good accuracy but fails to achieve Detection System using HSV Color Model and Edge
performance in terms of speed required for usage in Detection,” International Conference on
real world. The second algorithm which was used to Communication and Signal Processing, pp. 743-746,
implement this project was CNN for ASL. This model Melmaruvathur, India, April 3-5, 2014
was performing well in terms of speed but remained
behind in terms of accuracy which was again not
JETIR2207380 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org d625
© 2022 JETIR July 2022, Volume 9, Issue 7 www.jetir.org (ISSN-2349-5162)
[7] Anuja v. Nair, Bindu V, “A Review on Indian Sign
Language Recognition,” International Journal of
computer Applications, pp.33-38, July 2013.

[8] A Novel Feature Extraction for American Sign


Language Recognition Using Webcam - Ariya
Thongtawee, Onamon Pinsanoh, Yuttana Kitjaidure

[9] Dhivyasri S, Krishnaa Hari K B, Akash M, Sona


M, Divyapriya S, Dr. Krishnaveni V An Efficient
Approach for Interpretation of Indian Sign Language
using Machine Learning , 2021 3rd International
Conference on Signal Processing and Communication
| 13 – 14 May 2021 | Coimbatore

[10] J. R. Pansare, S. H. Gawande and M. Ingle,


“Real-Time Static Hand Gesture Recognition for
American Sign Language in Complex Background,”
Journal of Signal and Information Processing, No. 3.
pp. 364-367

JETIR2207380 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org d626

You might also like