Real-Time Sign Language to Speech Translation using Convolutional Neural Networks and Gesture Recognition

The point of this paper is to plan a user-friendly framework that’s accommodating for the individuals who have hearing troubles. Sign dialect serves as a imperative communication device for people with hearing and discourse impedances. Be that as it may, the need of broad understanding of sign dialect makes boundaries between the hard of hearing community and the common open.

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Real-Time Sign Language to Speech Translation using Convolutional Neural Networks and Gesture Recognition

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Volume 10, Issue 4, April – 2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25apr1474

Real-Time Sign Language to Speech

Translation using Convolutional Neural
Networks and Gesture Recognition
1
Gayatri Gangeshkumar Waghmare; 2Sakshee Satish Yande;
3
Rajesh Dattatray Tekawade; 4Dr. Chetan Aher
1
Department of Computer Engineering
AISSMS Institute of Information Technology
Pune, India
2
Department of Computer Engineering
AISSMS Institute of Information Technology
Pune, India
3
Department of Computer Engineering
AISSMS Institute of Information Technology
Pune, India
4
Assistant Professor at Department of Computer Engineering
AISSMS Institute of Information Technology
Pune, India

Publication Date: 2025/05/07

Abstract: The point of this paper is to plan a user-friendly framework that’s accommodating for the individuals who have
hearing troubles. Sign dialect serves as a imperative communication device for people with hearing and discourse
impedances. Be that as it may, the need of broad understanding of sign dialect makes boundaries between the hard of
hearing community and the common open. This paper presents a real-time sign dialect interpretation framework that
changes over signals into content and discourse utilizing progressed machine learning procedures. For those who are hard
of hearing and discourse impaired, sign language may be a required mode of communication. Communication
impediments are caused by the restricted information of sign dialect. This study examines how information science
strategies can be utilized to shut this hole by interpreting sign dialect developments into discourse.

The method comprises of three steps: recognizing hand signals utilizing American Sign Dialect (ASL), capturing them
employing a webcam, and interpreting the recognized content to discourse utilizing Google Text-to-Speech (GTS) union.
The framework is centered on conveying an successful real-time communication framework through the utilize of
convolutional neural systems (CNNs) in signal acknowledgment. The extend utilizes a machine learning pipeline that
comprises of information collection, preprocessing, demonstrate preparing, real-time discovery, and discourse blend. This
paper will endeavor to detail diverse strategies, challenges, and future headings for sign dialect to discourse change, and
the part played by information science in making communication more open.

Keywords: Sign Language Recognition, CNN, Text-to-Speech, Real-Time Translation, American Sign Language (ASL), Deep
Learning, Image Classification.

How to Cite: Gayatri Gangeshkumar Waghmare; Sakshee Satish Yande; Rajesh Dattatray Tekawade; Dr. Chetan Aher (2025)
Real-Time Sign Language to Speech Translation using Convolutional Neural Networks and Gesture Recognition. International
Journal of Innovative Science and Research Technology, 10(4), 2605-2609.
https://doi.org/10.38124/ijisrt/25apr1474

IJISRT25APR1474 www.ijisrt.com 2605

Volume 10, Issue 4, April – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25apr1474

I. INTRODUCTION existing frameworks are restricted by inactive signal

classification, dependence on outside equipment, or need of
The point of this paper is to make strides real-time capabilities. Our work addresses these holes by
communication with individuals who have hearing troubles actualizing a real-time, webcam-based American Sign Dialect
and utilize sign dialect to specific themselves. Sign dialect is (ASL) to discourse interpreter employing a lightweight CNN
the essential mode of communication for millions of show. Chaudhary et al. [9] and Adithya et al. [5] centered on
individuals around the world who are hard of hearing or ISL signal acknowledgment utilizing inactive pictures, but
incapable to talk [1]. In any case, communication issues as needed real-time usefulness and discourse yield. We
often as possible emerge between clients of sign dialect and overcome this by coordination OpenCV for real-time location
clients of spoken language since sign dialect isn’t broadly and gTTS for discourse blend. Shukla and Pandey [6] utilized
caught on [2]. Making an programmed framework to change sensor-based gloves, which, whereas precise, are exorbitant
over sign dialect into discourse may be very supportive for and less available. Our vision-based framework disposes of
incorporation and openness. the require for extra equipment.

The objective of this venture is to form a real-time Sign Sakib et al. [4] utilized a CNN-LSTM demonstrate for
Dialect to Discourse Interpreter that employments computer worldly acknowledgment but at the fetched of real-time
vision and profound learning to recognize and interpret execution. We prioritize speed and exactness employing a
American Sign Dialect (ASL) hand motions. The framework streamlined CNN. Garg and Aggarwal [3] accomplished real-
contains a formal machine learning handle that comprises of time ASL acknowledgment but needed coordinates discourse
the taking after steps: yield, which our pipeline incorporates.
 Data Acquisition: Collecting a dataset of ASL hand
gestures using a webcam or other publicly accessible Buckley et al. [10] focused on BSL with a center on
materials [3]. motion complexity. We contribute by supporting ASL with
 Preprocessing: Improving model accuracy by cleaning, real-time sound input and improved preprocessing utilizing
normalizing, and augmenting the image data [4]. adaptive thresholding.
 Model Training: Educating Convolutional Neural
Networks (CNNs) to identify hand motions [5]. III. METHODOLOGY
 Real-Time Detection: Hand gestures are captured and
processed in real time using OpenCV and MediaPipe [6]. The Sign Dialect to Discourse Interpreter is planned to
 Speech Synthesis: Using text-to-speech (TTS) bridge the communication hole between people utilizing sign
technologies like Google TTS or Tacotron to turn dialect and those who depend on talked dialect. This
recognized text into speech [7]. framework captures hand motions by means of a webcam,
forms the picture to recognize the comparing American Sign
This paper addresses these restrictions by proposing a Dialect (ASL) letter, and changes over the recognized content
real-time sign dialect interpretation framework that combines into discourse employing a text-to-speech motor. The extend
CNN-based motion acknowledgment with NLP for relevant takes after a machine learning pipeline comprising
precision. The framework captures hand signals by means of information procurement, preprocessing, show preparing,
a standard webcam, forms them employing a prepared CNN real-time discovery, and discourse amalgamation, comparable
show, and changes over the yield into content and discourse. to approaches seen in prior works. [11–13].
Key innovations include:
 NLP integration: The system incorporates NLP to refine A. Data Acquisition
output grammar, ensuring meaningful communication [7]. The preparing and testing sets of the dataset are made
 Non-invasive hardware: Unlike data gloves or colored up of prerecorded pictures of ASL signs. Each picture
markers, our system uses a camera, enhancing compares to a particular letter or word in sign dialect.
accessibility [8]. Datasets with labeled hand motions are commonly utilized in
signal acknowledgment frameworks [12,13].
 Dynamic gesture support: The CNN model is trained on
both static and dynamic gestures, improving recognition  Dataset Structure: Train Set – Used for training the
accuracy [9]. model. Test Set – Used for evaluating model
performance.
By overcoming the restrictions of existing frameworks,  Images of ASL signs in grayscale format.
our arrangement gives a down to earth, adaptable, and user-  Data Augmentation: Used techniques like flipping and
friendly apparatus for sign dialect interpretation, cultivating rotation to improve generalization [13].
inclusivity for the hard of hearing and hard-of-hearing
community. B. Image Preprocessing
To ensure consistency, each image undergoes
II. LITERATURE REVIEW preprocessing before being fed into the model:
 Grayscale Conversion – Focuses on hand shape and
Later progresses in sign dialect acknowledgment use reduces complexity.
computer vision and profound learning to bridge  Thresholding and Noise Removal – Enhances features
communication crevices. Be that as it may, numerous with Adaptive Gaussian Thresholding [11].

IJISRT25APR1474 www.ijisrt.com 2606

 Resizing and Normalization – Normalized to the [0,1] loop. These frames are processed one by one to detect and
range and resized to 128 × 128 pixels. classify hand gestures.
 Step4:PreprocesstheCapturedImage The image inside the
C. CNN Architecture ROI is converted to grayscale to simplify processing.
A Convolutional Neural Arrange (CNN) was utilized to Gaussian blur is applied to remove background noise.
classify hand motions, taking after a structure that’s broadly Thresholding techniques, including adaptive and Otsu’s
received in signal acknowledgment writing [13]. thresholding, are used to enhance hand region visibility.
 Convolution Layers (3×3 filters) – Extract spatial The image is then resized and normalized to prepare it for
features. input to the model.
 MaxPooling Layers (2×2) – Reduce dimensionality.  Step5:LoadtheTrainedConvolutionalNeuralNetwork(CNN
 Flatten Layer – Produces a vector from feature maps. ) A pre-trained CNN model, stored in .h5 format, is
 Fully Connected Layers – Used to predict the ASL loaded. This model has been trained on labeled ASL
character. gesture data and can recognize alphabet gestures.
 Softmax Output Layer – Classifies 27 categories (A-Z +  Step 6: Predict the ASL Gesture The preprocessed image
space). is fed into the trained CNN model, which predicts the
gesture by assigning a probability to each possible letter.
D. Model Training and Evaluation The letter with the highest probability is selected as the
The demonstrate was prepared with procedures and predicted output. Step 7: Construct a Sentence from
hyperparameters comparable to those utilized in prior thinks Predicted Letters As the user performs gestures, the
about on motion acknowledgment: predicted letters are appended to a string to form words or
 Adam optimizer (learning rate = 0.001) sentences. To avoid misclassification, the string is
 Categorical crossentropy loss updated only after a certain number of frames (e.g., every
 Batch size = 32, epochs = 50 50 frames).
 —Early stopping with patience = 5 [13]  Step 8: Display Prediction on Screen The current
predicted letter and the constructed sentence are shown on
E. Real-Time ASL Detection the live video feed, providing real-time feedback to the
The system uses OpenCV (cv2.VideoCapture(0)) to user.
capture realtime hand gestures. The extracted hand region is:  Step 9: Terminate on Escape Key The system continues
 Pre-processed using grayscale and thresholding. capturing and predicting until the user presses the Esc
 Resized and fed into the trained CNN model. key. At this point, the final sentence is processed and the
 Classified into a corresponding ASL letter [11,13]. loop ends.
 Step 10: Convert the Final Sentence to Speech The final
F. Conversion from Text to Speech (Speech Synthesis) sentence is passed to a text-to-speech processor (gTTS).
Recognized text is converted to speech using: If the sentence is valid, it is converted into an audio file
 Google Text-to-Speech (gTTS) API and played using the system’s default media player,
 MP3 output played via Python’s playsound enabling the ASL gestures to be heard as speech.
 Optional word-level or sentence-level synthesis [12]  Step 11: End the Program After playing the audio, the
system releases the webcam and closes all display
G. Algorithm: Sign Language to Speech Conversion System windows, completing the translation process.

IV. RESULTS
 Step 1: Start the Webcam to Record Live Video Begin by
accessing the system’s webcam to record an uninterrupted
video feed. This enables the system to process real-time The model achieved 95.8% training accuracy and
hand movements exhibited by the user. 93.5% validation accuracy (Fig. 1), indicating effective
learning with slight overfitting observed. Testing on unseen
 Step 2: Create a Center Location (ROI) A rectangular
region of interest (ROI) is drawn on the video screen data yielded 92.1% accuracy. Real-time testing revealed:
where the user places their hand to perform ASL gestures.  90.5% accuracy under ideal lighting
This improves both speed and accuracy by ensuring only  83.2% accuracy in variable lighting conditions
the relevant part of the image is analyzed.  Common confusions: M/N (14%), D/Z (9%)
 Step 3: Continuously Capture Frames from the Webcam  Average latency: 93ms per frame
The program captures frames from the video stream in a

IJISRT25APR1474 www.ijisrt.com 2607

Fig. 1: Training and validation Accuracy and Loss Curves

The framework successfully recognized ASL letters and changed over them into discourse with 88.7% word-level precision.
Distinguished execution impediments incorporate:
 Overfitting due to slight validation loss variation
 Sensitivity to hand orientation and lighting
 Misclassification in letters with similar handshapes

Consolidating procedures such as information enlargement, early ceasing, or utilizing progressed designs like transformers
can advance improve framework strength. The framework beated past approaches in real-time location and discourse yield
capabilities (Table 1).

Table 1. : Comparison of Proposed and Existing Systems

System Accuracy Real-time Speech Output
Chaudhary et al. (2021) 94% No No
Garg & Aggarwal (2020) 89% Yes No
Proposed System 92.1% Yes Yes

V. CONCLUSION AND FUTURE WORK This work illustrates the potential of profound learning
for assistive innovations, clearing the way for more
We created a real-time ASL-to-speech interpretation comprehensive human computer interaction frameworks.
framework utilizing CNN and computer vision strategies.
The framework successfully recognizes hand signals through ACKNOWLEDGMENTS
webcam and changes over them into capable of being heard
discourse utilizing the gTTS motor. It offers great precision, We thank the Department of Computer Engineering at
real-time execution, and can serve as a valuable AISSMS Institute of Information Technology for
communication apparatus for the discourse- and hearing- infrastructure support, and the deaf community members who
impaired. Future work includes: participated in system testing.
 Expanding to full ASL phrases and sentences
 Adding LSTM for better temporal gesture recognition REFERENCES
 Building a mobile app for portability
 Supporting regional sign languages like ISL and BSL [1]. World Health Organization. (2021). Deafness and
 Enhancing gesture detection with depth sensing hearing loss.
[2]. Sharma, P. et al. (2022). Translating Speech to Indian
The proposed framework lays a solid establishment for Sign Language. Future Internet, 14(9), 253.
future headways in sign dialect acknowledgment and holds [3]. Garg, H. and Aggarwal, R. (2020). Real-Time ASL
noteworthy potential for comprehensive and assistive Detection. JATIT.
communication innovations.

IJISRT25APR1474 www.ijisrt.com 2608

[4]. Sakib, S. et al. (2019). Hybrid CNN-LSTM for Bangla

SL. ICCIT.
[5]. Adithya, S. et al. (2021). Deep Learning for ISL. IJERT.
[6]. Shukla, A. and Pandey, R. (2021). Glove-based
Recognition. IJSRCSEIT.
[7]. Ojha, A. et al. (2020). Real-Time SL Translation. IJERT.
[8]. Vaithilingam, G. (2001). Sign Language to Speech
Converting Method. WO 01/59741 A1.
[9]. Chaudhary, A. et al. (2021). CNN Based ISL
Recognition. IJCA.
[10]. Buckley, N. et al. (2021). CNN-Based SL System with
Single/Double-Handed Gestures. COMPSAC, pp. 1040
to 1045.
[11]. Arsan, T. and Ulgen, O. (2015). Sign Language
Converter.¨ International Journal of Computer Science
& Engineering Survey (IJCSES), 6(4), 39–51.
[12]. Vijayalakshmi, P. and Aarthi, M. (2016). Sign language
to speech conversion. 2016 International Conference on
Recent Trends in Information Technology (ICRTIT),
Chennai, India, pp. 1–6.
[13]. Abraham, A. and Rohini, V. (2018). Real time
conversion of sign language to speech and prediction of
gestures using Artificial Neural Network. Procedia
Computer Science, 143, 587–594.0
[14]. Kumar, M. N. B. (2018). Conversion of Sign Language
into Text. IJ Applied Engineering Research, 13(9),
7154–7161.
[15]. Duraisamy, P. et al. (2023). Transforming Sign
Language into Text and Speech. IJ Science and
Technology, 16(45), 4177–4185.
[16]. Papatsimouli, M. et al. (2022). Real Time Sign
Language Translation Systems: A review. MOCAST,
pp. 1–6.
[17]. Pathan, R. K. et al. (2023). Sign Language Recognition
Using CNN and Hand Landmarks. Scientific Reports,
13, 16975.
[18]. Jebakani, C. and Rishitha, S. P. (2022). Sign Language
to Speech/Text Using CNN. BE Thesis, Sathyabama
Institute.
[19]. Sharma, P. et al. (2022). Speech to ISL Using NLP.
Future Internet, 14(9), 253.
[20]. Joksimoski, B. et al. (2022). Tech Solutions for Sign
Language Recognition. IEEE Access, 10, 40979–41025.