Real-Time Sign Language to Speech Translation using Convolutional Neural Networks and Gesture Recognition
Real-Time Sign Language to Speech Translation using Convolutional Neural Networks and Gesture Recognition
Abstract: The point of this paper is to plan a user-friendly framework that’s accommodating for the individuals who have
hearing troubles. Sign dialect serves as a imperative communication device for people with hearing and discourse
impedances. Be that as it may, the need of broad understanding of sign dialect makes boundaries between the hard of
hearing community and the common open. This paper presents a real-time sign dialect interpretation framework that
changes over signals into content and discourse utilizing progressed machine learning procedures. For those who are hard
of hearing and discourse impaired, sign language may be a required mode of communication. Communication
impediments are caused by the restricted information of sign dialect. This study examines how information science
strategies can be utilized to shut this hole by interpreting sign dialect developments into discourse.
The method comprises of three steps: recognizing hand signals utilizing American Sign Dialect (ASL), capturing them
employing a webcam, and interpreting the recognized content to discourse utilizing Google Text-to-Speech (GTS) union.
The framework is centered on conveying an successful real-time communication framework through the utilize of
convolutional neural systems (CNNs) in signal acknowledgment. The extend utilizes a machine learning pipeline that
comprises of information collection, preprocessing, demonstrate preparing, real-time discovery, and discourse blend. This
paper will endeavor to detail diverse strategies, challenges, and future headings for sign dialect to discourse change, and
the part played by information science in making communication more open.
Keywords: Sign Language Recognition, CNN, Text-to-Speech, Real-Time Translation, American Sign Language (ASL), Deep
Learning, Image Classification.
How to Cite: Gayatri Gangeshkumar Waghmare; Sakshee Satish Yande; Rajesh Dattatray Tekawade; Dr. Chetan Aher (2025)
Real-Time Sign Language to Speech Translation using Convolutional Neural Networks and Gesture Recognition. International
Journal of Innovative Science and Research Technology, 10(4), 2605-2609.
https://doi.org/10.38124/ijisrt/25apr1474
The objective of this venture is to form a real-time Sign Sakib et al. [4] utilized a CNN-LSTM demonstrate for
Dialect to Discourse Interpreter that employments computer worldly acknowledgment but at the fetched of real-time
vision and profound learning to recognize and interpret execution. We prioritize speed and exactness employing a
American Sign Dialect (ASL) hand motions. The framework streamlined CNN. Garg and Aggarwal [3] accomplished real-
contains a formal machine learning handle that comprises of time ASL acknowledgment but needed coordinates discourse
the taking after steps: yield, which our pipeline incorporates.
Data Acquisition: Collecting a dataset of ASL hand
gestures using a webcam or other publicly accessible Buckley et al. [10] focused on BSL with a center on
materials [3]. motion complexity. We contribute by supporting ASL with
Preprocessing: Improving model accuracy by cleaning, real-time sound input and improved preprocessing utilizing
normalizing, and augmenting the image data [4]. adaptive thresholding.
Model Training: Educating Convolutional Neural
Networks (CNNs) to identify hand motions [5]. III. METHODOLOGY
Real-Time Detection: Hand gestures are captured and
processed in real time using OpenCV and MediaPipe [6]. The Sign Dialect to Discourse Interpreter is planned to
Speech Synthesis: Using text-to-speech (TTS) bridge the communication hole between people utilizing sign
technologies like Google TTS or Tacotron to turn dialect and those who depend on talked dialect. This
recognized text into speech [7]. framework captures hand motions by means of a webcam,
forms the picture to recognize the comparing American Sign
This paper addresses these restrictions by proposing a Dialect (ASL) letter, and changes over the recognized content
real-time sign dialect interpretation framework that combines into discourse employing a text-to-speech motor. The extend
CNN-based motion acknowledgment with NLP for relevant takes after a machine learning pipeline comprising
precision. The framework captures hand signals by means of information procurement, preprocessing, show preparing,
a standard webcam, forms them employing a prepared CNN real-time discovery, and discourse amalgamation, comparable
show, and changes over the yield into content and discourse. to approaches seen in prior works. [11–13].
Key innovations include:
NLP integration: The system incorporates NLP to refine A. Data Acquisition
output grammar, ensuring meaningful communication [7]. The preparing and testing sets of the dataset are made
Non-invasive hardware: Unlike data gloves or colored up of prerecorded pictures of ASL signs. Each picture
markers, our system uses a camera, enhancing compares to a particular letter or word in sign dialect.
accessibility [8]. Datasets with labeled hand motions are commonly utilized in
signal acknowledgment frameworks [12,13].
Dynamic gesture support: The CNN model is trained on
both static and dynamic gestures, improving recognition Dataset Structure: Train Set – Used for training the
accuracy [9]. model. Test Set – Used for evaluating model
performance.
By overcoming the restrictions of existing frameworks, Images of ASL signs in grayscale format.
our arrangement gives a down to earth, adaptable, and user- Data Augmentation: Used techniques like flipping and
friendly apparatus for sign dialect interpretation, cultivating rotation to improve generalization [13].
inclusivity for the hard of hearing and hard-of-hearing
community. B. Image Preprocessing
To ensure consistency, each image undergoes
II. LITERATURE REVIEW preprocessing before being fed into the model:
Grayscale Conversion – Focuses on hand shape and
Later progresses in sign dialect acknowledgment use reduces complexity.
computer vision and profound learning to bridge Thresholding and Noise Removal – Enhances features
communication crevices. Be that as it may, numerous with Adaptive Gaussian Thresholding [11].
Resizing and Normalization – Normalized to the [0,1] loop. These frames are processed one by one to detect and
range and resized to 128 × 128 pixels. classify hand gestures.
Step4:PreprocesstheCapturedImage The image inside the
C. CNN Architecture ROI is converted to grayscale to simplify processing.
A Convolutional Neural Arrange (CNN) was utilized to Gaussian blur is applied to remove background noise.
classify hand motions, taking after a structure that’s broadly Thresholding techniques, including adaptive and Otsu’s
received in signal acknowledgment writing [13]. thresholding, are used to enhance hand region visibility.
Convolution Layers (3×3 filters) – Extract spatial The image is then resized and normalized to prepare it for
features. input to the model.
MaxPooling Layers (2×2) – Reduce dimensionality. Step5:LoadtheTrainedConvolutionalNeuralNetwork(CNN
Flatten Layer – Produces a vector from feature maps. ) A pre-trained CNN model, stored in .h5 format, is
Fully Connected Layers – Used to predict the ASL loaded. This model has been trained on labeled ASL
character. gesture data and can recognize alphabet gestures.
Softmax Output Layer – Classifies 27 categories (A-Z + Step 6: Predict the ASL Gesture The preprocessed image
space). is fed into the trained CNN model, which predicts the
gesture by assigning a probability to each possible letter.
D. Model Training and Evaluation The letter with the highest probability is selected as the
The demonstrate was prepared with procedures and predicted output. Step 7: Construct a Sentence from
hyperparameters comparable to those utilized in prior thinks Predicted Letters As the user performs gestures, the
about on motion acknowledgment: predicted letters are appended to a string to form words or
Adam optimizer (learning rate = 0.001) sentences. To avoid misclassification, the string is
Categorical crossentropy loss updated only after a certain number of frames (e.g., every
Batch size = 32, epochs = 50 50 frames).
—Early stopping with patience = 5 [13] Step 8: Display Prediction on Screen The current
predicted letter and the constructed sentence are shown on
E. Real-Time ASL Detection the live video feed, providing real-time feedback to the
The system uses OpenCV (cv2.VideoCapture(0)) to user.
capture realtime hand gestures. The extracted hand region is: Step 9: Terminate on Escape Key The system continues
Pre-processed using grayscale and thresholding. capturing and predicting until the user presses the Esc
Resized and fed into the trained CNN model. key. At this point, the final sentence is processed and the
Classified into a corresponding ASL letter [11,13]. loop ends.
Step 10: Convert the Final Sentence to Speech The final
F. Conversion from Text to Speech (Speech Synthesis) sentence is passed to a text-to-speech processor (gTTS).
Recognized text is converted to speech using: If the sentence is valid, it is converted into an audio file
Google Text-to-Speech (gTTS) API and played using the system’s default media player,
MP3 output played via Python’s playsound enabling the ASL gestures to be heard as speech.
Optional word-level or sentence-level synthesis [12] Step 11: End the Program After playing the audio, the
system releases the webcam and closes all display
G. Algorithm: Sign Language to Speech Conversion System windows, completing the translation process.
IV. RESULTS
Step 1: Start the Webcam to Record Live Video Begin by
accessing the system’s webcam to record an uninterrupted
video feed. This enables the system to process real-time The model achieved 95.8% training accuracy and
hand movements exhibited by the user. 93.5% validation accuracy (Fig. 1), indicating effective
learning with slight overfitting observed. Testing on unseen
Step 2: Create a Center Location (ROI) A rectangular
region of interest (ROI) is drawn on the video screen data yielded 92.1% accuracy. Real-time testing revealed:
where the user places their hand to perform ASL gestures. 90.5% accuracy under ideal lighting
This improves both speed and accuracy by ensuring only 83.2% accuracy in variable lighting conditions
the relevant part of the image is analyzed. Common confusions: M/N (14%), D/Z (9%)
Step 3: Continuously Capture Frames from the Webcam Average latency: 93ms per frame
The program captures frames from the video stream in a
The framework successfully recognized ASL letters and changed over them into discourse with 88.7% word-level precision.
Distinguished execution impediments incorporate:
Overfitting due to slight validation loss variation
Sensitivity to hand orientation and lighting
Misclassification in letters with similar handshapes
Consolidating procedures such as information enlargement, early ceasing, or utilizing progressed designs like transformers
can advance improve framework strength. The framework beated past approaches in real-time location and discourse yield
capabilities (Table 1).
V. CONCLUSION AND FUTURE WORK This work illustrates the potential of profound learning
for assistive innovations, clearing the way for more
We created a real-time ASL-to-speech interpretation comprehensive human computer interaction frameworks.
framework utilizing CNN and computer vision strategies.
The framework successfully recognizes hand signals through ACKNOWLEDGMENTS
webcam and changes over them into capable of being heard
discourse utilizing the gTTS motor. It offers great precision, We thank the Department of Computer Engineering at
real-time execution, and can serve as a valuable AISSMS Institute of Information Technology for
communication apparatus for the discourse- and hearing- infrastructure support, and the deaf community members who
impaired. Future work includes: participated in system testing.
Expanding to full ASL phrases and sentences
Adding LSTM for better temporal gesture recognition REFERENCES
Building a mobile app for portability
Supporting regional sign languages like ISL and BSL [1]. World Health Organization. (2021). Deafness and
Enhancing gesture detection with depth sensing hearing loss.
[2]. Sharma, P. et al. (2022). Translating Speech to Indian
The proposed framework lays a solid establishment for Sign Language. Future Internet, 14(9), 253.
future headways in sign dialect acknowledgment and holds [3]. Garg, H. and Aggarwal, R. (2020). Real-Time ASL
noteworthy potential for comprehensive and assistive Detection. JATIT.
communication innovations.