All Research
All Research
0.7 Hand Gesture Recognition in Indian Sign Language Using Deep Learning Harsh Kumar Vashisth,
Tuhin Tarafder, Rehan Aziz, Mamta Arora and Alpana Proceeding Paper
https://doi.org/10.3390/engproc2023059096 Citation: Vashisth, H.K.; Tarafder, T.; Aziz, R.; Arora, M.;
A. Hand Gesture Recognition in Indian Sign Language Using Deep Learning. Eng. Proc. 2023, 59, 96.
https://doi.org/10.3390/ engproc2023059096 Academic Editors: Nithesh Naik, Rajiv Selvam, Pavan
Hiremath, Suhas Kowshik CS and Ritesh Ramakrishna Bhat Published: 21 December 2023 Copyright:
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article
distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license
(https:// creativecommons.org/licenses/by/ 4.0/). Proceeding Paper Hand Gesture Recognition in
Indian Sign Language Using Deep Learning † Harsh Kumar Vashisth, Tuhin Tarafder, Rehan Aziz,
Mamta Arora * and Alpana Department of Computer Science and Technology, Manav Rachna
University, Faridabad 121010, Haryana, India; [email protected] (H.K.V.);
[email protected] (T.T.); [email protected] (R.A.); [email protected] (A.) *
Correspondence: [email protected] † Presented at the International Conference on Recent
Advances on Science and Engineering, Dubai, United Arab Emirates, 4–5 October 2023. Abstract: Sign
languages are important for the deaf and hard-of-hearing communities, as they provide a means of
communication and expression. However, many people outside of the deaf community are not
familiar with sign languages, which can lead to communication barriers and exclusion. Each country
and culture have its own sign language, and some countries have multiple sign languages. Indian Sign
Language (ISL) is a visual language used by the deaf and hard-of-hearing community in India. It is a
complete language, with its own grammar and syntax, and is used to convey information through
hand gestures, facial expressions, and body language. Over time, ISL has evolved into its own distinct
language, with regional variations and dialects. Recognizing hand gestures in sign languages is a
challenging task due to the high variability in hand shapes, movements, and orientations. ISL uses a
combination of one-handed and two-handed gestures, which makes it fundamentally different from
other common sign languages like American Sign Language (ASL). This paper aims to address the
communication gap between specially abled (deaf) people who can only express themselves through
the Indian sign language and those who do not understand it, thereby improving accessibility and
communication for sign language users. This is achieved by using and implementing Convolutional
Neural Networks on our self-made dataset. This is a necessary step, as none of the existing datasets
fulfills the need for real-world images. We have achieved 0.0178 loss and 99% accuracy on our
dataset. Keywords: CNN; AI; ISL; static hand gestures; deep learning 1. Introduction Hand gestures
are a convenient and easy mode of communication for humans. An application of gestures in
communication is sign language. Sign language is a mode of non-verbal communication, mostly used
by people with speech or hearing impairments. There are many sign language variations that are
used by people to communicate. This paper has worked on the Indian Sign Language. Sign languages
are a convenient and easy way to express their thoughts and emotions for people who know sign
languages. But it is very difficult for people who do not understand sign language to communicate
with those who use it. This requires a trained translator to be present in many situations. A
computer-based system that can interpret sign language can be very helpful in this aspect. It can be
used to improve communication by helping people understand sign languages without specialized
training. Over the past few years, much research has been conducted on gesture recognition and sign
language recognition. But most of it focuses on other sign languages. There are approximately 44
million people who have disabling deafness and hearing loss [1]. This is more than the population of
the United States and Canada combined! Eng. Proc. 2023, 59, 96.
https://doi.org/10.3390/engproc2023059096 https://www.mdpi.com/journal/engproc Eng. Proc.
2023, 59, 96 2 of 11 Hence, this is the motivation to proceed with the idea of developing a machine
learning model to recognize sign languages in real time. This will help out a huge part of the
population in terms of communication with one another. This work addresses the following points: 1.
Implementation of machine learning and deep learning algorithms used in hand gesture recognition.
2. To fine-tune the deep learning model in order to achieve maximum performance. 3. What flaws or
drawbacks does this model have that turn up in applications of the algorithms? Discuss these flaws.
4. How these flaws can be corrected, if they can be corrected currently, or what can be done about
them in the future. 2. Literature Survey Several studies have been conducted in the fields of gesture
recognition, face detection, and their applications in sign language recognition and facial expression.
Some of our reviewed papers have been listed here. Rosalina et al. [2] have taken about 3900 raw
image files to achieve the same with over 39 alphabets, numbers, and punctuation marks, in
accordance with SIBI (Sistem Isyarat Bahasa Indonesia), and the accuracy turned out to be 90%.
Computer vision were used to capture the image and then extract essential data from it. ANN
(Artificial Neural Network) was then used to classify the images, and at last, speech recognition was
used to translate the input speech in the form of NATO phonetic language and then translate it to
sign language. Image processing is conducted by morphological operators, mainly erosion and
dilation. The image is to be segmented. The whole process consists of four stages: image capture
(static photos in RGB color space), image processing (separate background from hand, HSV range to
separate the same, blurred, then dilated), feature extraction (finding contours that are basically
edges of the same color range), and classification (ANN will take B/W (Black and White) images).
Lastly, they translated the given letters into speech, which can be conducted using the NATO
alphabets. B. Hangün et al. [3] compare the performance of functions related to image processing as
implemented in the OpenCV library. Image processing requires more computational power than
regular use because of mathematical operations such as matrix inversion, transposition of a matrix,
matrix convolution, Fourier transform, etc. Images are taken as matrices, and as the image resolution
increases, so does the matrix order. CPUs and GPUs work on different principles; CPUs operate on
series processing (one task at a time) and GPUs on parallel processing (multiple tasks at the same
time). Although this does not completely hold true today, it is still practically true. CUDA (Compute
Unified Device Architecture) is a parallel programming model developed by NVIDIA that works on
NVIDIA GPUs. Performance in this paper is based on time vs. image size for tasks such as resizing,
thresholding (making the image black and white), histogram equalization (the process of adjusting
bad contrast), and edge detection. OpenCV is an important AI tool. CUDA is an API made in 2007. The
testing was conducted using a modest consumer-grade GTX 970 and i7 6700. Results showed GPUs
are faster at each of the four tasks, especially at higher-quality images. S. Li [4] et al. performed a
review of the current state of deep learning applications in facial expression recognition. According
to the review, the increase in computational power and success in deep learning have made it
efficient for automatic Facial Expression Recognition (FER). The two issues are overfitting (due to a
lack of training data) and non-expression-related issues such as illumination, head position,
occlusion, identity bias, and age. Anger, fear, disgust, happiness, sadness, surprise, and contempt are
the seven basic expressions humans perceive. FER is of two types: static (1 image) and dynamic
(continuous frames). In the earlier research, shallow learning was used, but now in this paper, real-
time images (due to the increase in data and computational power) are used. There are many
available datasets to train the model to avoid overfitting as much as possible, such as CK+, MMI,
Oulu-CASIA, JAFFE, FER2013, AFEW, SFEW, etc. Image processing is similar to the previous paper,
such as V&J for face alignment (GAN), CNN for augmenta- Eng. Proc. 2023, 59, 96 3 of 11 tion, Face
normalization for illumination (IS, DCT, histogram Normalization), and pose (Hasner, GAN). Face
Classification using (SVM+CNN). D. M. Prasanna et al. [5] developed a real-time GUI-based Face
recognition system using open-face Detection (HOG+SVM), feature extraction (HOG), and recognition
as the three stages. Several use-case scenarios are mentioned here. The issues are similar to those in
the previous paper. DNN (Deep Neural Network) was used for classification. O. K. Oyedotun et al. [6]
proposed a model that recognizes all 24 hand gestures as present in Thomas Moeslund’s gesture
recognition database. Different hand gestures can look similar when viewed from different angles in
a 2D image. This makes recognizing hand gestures a challenging task. Input images are converted to
binary representations and then de-noised using a media filter. The segment of the image that
contains hand gestures is extracted and rescaled to a 32 × 32 image by pattern averaging. The
proposed model uses CNNs and stacked denoising auto-encoders. The applied network topology
combines a convolution and its sub-sampling layer together into a layer. Each convolution layer
generates convolutional feature maps. Then, feature reduction is applied to each convolutional
feature map by subsampling layers. An SDAE (Stacked Denoising Auto Encoder) is used to learn more
features from input images. More features can be learned by stacking more hidden layers. Data-
augmentation techniques are often used to generate additional training data and prevent model
overfitting. Some commonly used data augmentation techniques for images include translation, RGB
jittering and horizontal flipping, spatial augmentation on each frame for video input, and temporal
translations over spatial translations. F. Zhan [7] proposes a spatio-temporal data augmentation
method for better generalization and the use of 2D-CNNs to extract segments of images that contain
hand gestures. Images are converted to grayscale images and rescaled to 50 × 50. Horizontal
mirroring are used for data augmentation. MSE is used as a cost function, and SGD is used as an
optimization function. L. Pigou [8] applies recent advancements in deep learning like Temporal
Convolutions, Residual Networks, Batch Normalization, and Exponential Linear Units for the
framewise classification of gestures and sign languages in a continuous video stream. For pre-
processing, input RGB images are converted to grayscale and rescaled to 128 × 128. Static
information in consecutive frames is removed by taking the difference between the two frames. The
model uses CNN adapted for classification tasks along with back propagation for recurrent
architecture. Temporal convolutions and recurrence are used to work with video frames. CNNs allow
for learning multiple features hierarchically instead of extracting features manually. Separate
networks are used for gesture recognition and sign language recognition. Gesture Recognition uses a
many-to-many configuration. Mini-batch Gradient Descent with Adam is used for optimizing
parameters. Current methods used in head pose estimation require depth information in addition to
RGB information. It is not feasible to obtain in-depth information in all situations, thus limiting the
applications of these methods. Change in facial features is not linear with respect to change in
angles, and Euler angles, which are used to represent pose angles (yaw, pitch, toll), need additional
information (rotation order of axes), making it ambiguous. H.-W. Hsu et al. [9] propose several
methods to overcome these shortcomings. A multi-regression loss function, an, -2. regression loss
combined with an ordinal regression loss, is proposed, which can improve recognition using just RGB
information. The ordinal regression loss helps solve the problem of non-linearity of facial features
and pose angles by learning features that can rank the different intervals of angles. Also, angles are
represented using the Quaternion number system instead of Euler angles. In this system, angles are
represented using 4 angles-, q-x., q-y., q-z. and q-w.. This representation can avoid the Gimbal lock
problem as it is a 4D representation of a 3D rotation. It can also be interpolated to generate
smoother renderings of rotations. The methods proposed in this paper make head pose estimation
much more feasible by reducing the amount of information required. J. Gupta [10] covers hand
gesture recognition using various algorithms like deep learning, CNN, Morphological operation, and
emoji prediction. A database was created for this model using various hand gestures to be
recognized and further used to train the model and predict the emoji. This paper used a base of 11
gestures and filters to detect Eng. Proc. 2023, 59, 96 4 of 11 gestures and then used CNN to
categorize them. It includes morphological operations, contour extraction, and CNN. The database
consists of 1200 images, corresponding to every 11 emoji. It uses a camera to monitor user-created
real-time gestures. Future work is to add more emojis to be predicted in real-time gesture
recognition. T. Mo [11] performs a review of gesture recognition and some of its key issues.
Commonly used features for gestures include shape information (position, outline, etc.), geometric
properties (length, distance, etc.), and binary images. For human skin recognition, commonly used
color spaces are RGB (red, Green, and Blue), HSV (Hue, Saturation, and Value), and YCbCr. Gesture
recognition involves two steps: classification and recognition. Commonly used methods for this
purpose are the hidden Markov Model, Neural Networks, and Template Matching Method. But these
methods are time- and resource-heavy. This paper proposes SVM, which is simpler and has better
generalization capability. The YCbCr color space is used for human skin recognition. After performing
dimensionality reduction, SVM and PCA are used for gesture recognition. In H. Muthu Mariappan et
al. [12], real-time recognition is used to identify Indian Sign Language. It uses 1300 samples of images
to train the model and further predict the Indian Sign Language. The FCM (Fuzzy c-means)-based
real-time sign language recognition system can recognize 40 words of ISL at a time. This method is
used to enhance casual communication among people with hearing disabilities. The future work of
this paper is to add more words to the system for recognition. In Rastogi et al. [13], the key idea is to
enable communication between blind, deaf, and mute individuals using a sensor-enabled glove that
translates hand gestures into tactile vibrations. This allows effective two-way communication of
letters, words, and simple sentences between blind and deaf-mute pairs using coded vibro-tactile
stimuli. Nagpal [14] proposes a portable arm-mounted device with a camera, screen, vibrating pads,
and gesture recognition software. The camera captures gestures made by a mute user. The software
recognizes and converts the gestures into text displayed on the screen. This allows real-time, two-
way communication using a combination of gesture recognition, on-screen text, and vibrotactile
feedback. Ahire et al. [15] propose a device that enables two-way communication between deaf-
mute and hearing-speaking users. The device has a display, camera, microphone, and vibrational
pads on one side for the deaf-mute user. The other side has a screen, speaker, microphone, and
camera for the hearing user. When the hearing user speaks into the mic, speech recognition software
converts it to text displayed on the deaf-mute user’s screen. This concept could help bridge the
communication divide between deaf-mute and hearing-speaking communities when implemented
into a working prototype. Sharma et al. [16] use an ensemble model called Trbaggboost, which uses
a small amount of labeled data to label unlabeled data from a new subject. It uses three sources of
data: tri-axis gyroscopes, tri-axis accelerometers, and multichannel surface electromyograms. After
reviewing the related research, methods used in different stages of gesture classification were
identified. After collecting the dataset, images are first preprocessed. Preprocessing involves multiple
processes, including resizing, cropping, and converting images to a color format applicable to the
problem statement. Some color formats commonly used are: Grayscale [2,6–8,16], RGB [9], HSV [12],
and YCrCb [11,17]. A combination of one or more of these approaches can also be used to improve
performance. Gupta et al. [10] Convert the original RGB images first to HSV, then to grayscale. Morph
and Contour operations are applied after further processing to obtain a mask of the hand gesture.
After that, image augmentation are applied to increase the variation in training data and reduce
overfitting. Some common filters applied for augmenting images are shearing, zooming, horizontal
flips, and vertical and horizontal shifts. Dimensionality reduction techniques like Principal
Component Analysis (PCA) [11] and Histogram of Oriented Gradient (HOG) [5] can also be used to
further reduce the number of features. After data augmentation, the model are then trained. Both
classical machine learning models as well as deep learning models are used for this task. Some
machine learning models used include Fuzzy C-Means (FCM) [12], Support Vector Machine (SVM)
[5,11], etc. For Eng. Proc. 2023, 59, 96 5 of 11 deep learning models, CNN [7,10,17] and Deep CNN
(DCNN) [18] are primarily used as they are best suited for working with images. Other deep learning
models used include Temporal Residual Networks [8]. As training CNN models from scratch is time-
and resource-intensive, transfer learning is widely used for image classification purposes. Some
common pre-trained models used for transfer learning include ResNet34 [19], ResNet50 [16],
ResNet50V2 [16], XCeption [16], InceptionV3 [19], MobileNet [16], and MobileNetV2 [16]. 3.
Methodology After reviewing the recently published results and their methods, the various steps
that are involved in the complete process of gesture recognition have been determined and are as
shown in Figure 1. The detailed information about each step is given in the subsections. Figure 1.
Process Flow Chart: Hand Gesture Recognition in the ISL Translation System. 3.1. Dataset The dataset
used in this problem is self-made, i.e., three of the four authors performed hand signs for all the
images; it contains a total of 7800 images. Five burst shots for each alphabet were captured from a
camera, which comprises 20 images in each burst. These images have dimensions of 4000 × 3000
pixels. This dataset is categorized into three sets named test, train, and validate, each of which
contains ~5500, 1180, and 1160 images, respectively, as shown in Figure 2. There are a total of 26
alphabets, and each alphabet has 300 images. The dataset is also fairly balanced, as shown in Figure
3. Figure 2. Dataset File Structure. Eng. Proc. 2023, 59, 96 6 of 11 tt tt ff ff Figure 3. Number of images
in each class. 3.2. Image Processing For preprocessing, images have been rescaled to 80 by 60 pixels
(to maintain the original 4:3 aspect ratio). The images, originally in RGB format, were converted to
HSV format, as it better highlights hand segments in our dataset. Grayscale images for training were
also used, though they performed worse. Figure 4 shows some sample images after preprocessing in
HSV color space. Figure 5 shows an image preprocessed for both grayscale and HSV color spaces. tt tt
ff ff Figure 4. Some Preprocessed Images (in HSV colorspace). tt tt ff ff Figure 5. Preprocessing is
applied to the letter ‘P’. 3.3. Image Augmentation After preprocessing, several augmentation
techniques were applied to generate more training data. Some augmentation techniques applied
include shearing, zooming, horizon- Eng. Proc. 2023, 59, 96 7 of 11 tal flips, and vertical and
horizontal shifts. The effects of different filters on an image are shown in Figure 6. tt ff tt tt tt Figure 6.
Augmentation filters are applied to the letter ‘D’. 3.4. Model Training This paper has used CNN
[16,19,20] for our task. A Convolutional Neural Network is a feedforward neural network that has
multiple layers. A CNN applies convolutions that can automatically apply preprocessing and extract
features from images. So, it has the added benefit of not having to perform preprocessing and
feature engineering on the image. It has multiple layers of convolutions, and each layer identifies
successively complex features. The architecture of the model is shown in Figure 7. tt ff tt tt tt Figure
7. Model Architecture for Hand Gesture Recognition in ISL Translation. Eng. Proc. 2023, 59, 96 8 of 11
The different layers used in the network are: 1. Conv2D: It applies a 2D convolution operation to the
input data. A filter, which is a matrix whose dimensions are specified by the kernel size, is used to
produce an output. 2. MaxPool2D: This is a pooling layer and is usually applied after a convolution
layer. It applies a filter and selects a single value from each subregion of the specified dimension
from the input data. In this case, the filter applied is Max, which selects the maximum value. 3.
Flatten: It converts multi-dimensional data into a 1-D shape, i.e., flattens the data. 4. Dense: A dense
layer is one where each node is connected to every node of the previous layer. In this case, connect
each node to the flattened layer. 5. Dropout: The dropout layer randomly drops out nodes from its
input layer at a specified rate. It is used to reduce overfitting. All images are resized to 80 × 60 and
converted to HSV color space. So, the input layer has dimensions (80, 60, 3). Models trained using
HSV images provided better accuracy than grayscale images, as can be seen in Table 1. Next, 3 2D-
convolution layers with ReLu as the activation function and a 2D-max pooling layer are used. Finally,
a flattening layer flattens its input matrix and passes the 1D vector to a Dense layer. The output layer
is a Dense layer with a SoftMax activation function and 26 nodes, corresponding to the 26 alphabets.
Because our task is a multiclass classification problem, SoftMax is used, as it can convert model
outputs to a probability distribution that can be easily used to select the predicted classes. Table 1.
Comparison of Hand Gesture Recognition Models for ISL Translation. System Configuration Optimizer
Input Size Training Time Testing Loss Testing Accuracy Google Colab GPU: Tesla K80 CPU: Intel(R)
Xeon rmsprop (100, 75, 1) Grayscale 3 h 52 min 0.0687 0.9713 Google Colab GPU: Tesla K80 CPU:
Intel(R) Xeon rmsprop (80, 60, 3) HSV 3 h 26 min 0.0178 0.999 Local Machine GPU: GTX 1650S CPU:
Intel(R) Core i5 10400F rmsprop (100, 75, 1) Grayscale 2 h 24 min 0.1874 0.9383 Local Machine GPU:
GTX 1650S CPU: Intel(R) Core i5 10400F adam (200, 150, 1) Grayscale 2 h 17 min 0.166 0.9274 Local
Machine GPU: RTX 3050 CPU: Ryzen 9 5900HS rmsprop (100, 75, 1) Grayscale 2 h 11 min 0.025
0.9899 The model has been trained for 25 epochs. The batch size was not explicitly stated. Training
was limited to 100 steps per epoch, and validation was limited to 50 steps per epoch. 3.5. Evaluating
Performance Hand gesture classification is a multi-class classification task. In our implementation,
there are 26 possible outputs, all the letters of the alphabet from A to Z. Because this is a multiclass
classification problem, SoftMax has been used as the activation function for the output layer. The F1
Score, Accuracy, and Confusion Matrix have been calculated to evaluate the model. The graph in
Figure 8 shows the evolution of loss and accuracy over the number of epochs. Eng. Proc. 2023, 59, 96
9 of 11 tt tt tt tt tt Figure 8. Evolution of loss and accuracy. 4. Results and Discussions The model was
run with different preprocessing steps and different machines, and the results are shown in Tale 1. It
is observed that the model with HSV color space images as input shows the best results. Multiple
configurations of hyperparameters have been tested in the models to find the one that provides
optimal accuracy and loss. Doing this not only proves to be useful for further research but also
completes one of the objectives stated in the introduction. Table 1 shows the performance of
different models/preprocessing steps. Table 2 shows the performance comparison of this work and
existing works. Table 2. Comparison between this paper’s result and previous published results.
Paper Task Model Accuracy Muthu Mariappan et al. [12] 40 ISL words and sentences in real time
Fuzzy c-means 75% Rosalina et al. [2] 39 ASL signs (26 alphabet letters, 10 digits, and 3 punctuation)
ANN 90% Oyebade K. Oyedotun et al. [6] 24 ASL Hand Gestures CNN 92.83% Hsien-I Lin et al. [17] 7
hand gestures CNN 99% Gupta et al. [10] 11 hand gestures CNN 99.6% Proposed model 26 ISL Hand
Signs CNN 99% The model was run with different preprocessing steps and different machines, and
the results are shown below. It is observed that the model with HSV color space images as input
shows the best results. Multiple configurations of hyperparameters have been tested in the models
to find the one that provides optimal accuracy and loss. Doing this not only proves to be useful for
further research but also completes one of the objectives stated in the introduction. Table 1 shows
the performance of different models/preprocessing steps. Table 2 shows the performance
comparison of this work and existing works. From Table 1, we can conclude that using RMSprop as
the optimizer, images with sizes 80 and 60 in the HSV color space provided the best performance.
From Table 2, it can be deduced that the proposed model maintains comparable accuracy while
including more classes than any of the previous research. 5. Conclusions and Future Work This paper
has provided a better understanding of Indian Sign Language and applications of machine learning
after getting over the hype and seeing actual results. This work should be useful for the deaf
community as well, as this research hopefully strives to become a helping hand for the community.
The current implementation slightly overfits the dataset, having an accuracy of 99% and a loss of
0.0178, which might be a case of overfitting and requires further work in the dataset given below.
This further results in poor performance on new images. The performance of the model can be
improved by taking into account hand pose/skeletons for the prediction. For future work, we aim to
improve this and develop a much better implementation with aspects such as better vocabulary, a
voice interface within a working client, and adding non-static signs (for instance, the letters y and j
are non-static in ISL). The dataset needs to also come from more individuals with different ethnicities
and backgrounds, skin tones, and varied shapes of hands and faces with their own variations of the
language; furthermore, the lighting and background need Eng. Proc. 2023, 59, 96 10 of 11 to vary to
avoid bias. The data were not gathered from real practitioners of the ISL and may not be completely
reflective of how a real practitioner uses the language. In addition, the model only works on static
gestures; dynamic gestures need to be added in further work. There are no specific hardware
requirements, and it performs reasonably well when deployed on smartphones with mediocre
hardware. Author Contributions: Conceptualization, H.K.V. and R.A.; methodology, T.T.; software,
R.A.; validation, M.A., R.A., and H.K.V.; formal analysis, M.A. and A.; investigation, R.A.; resources,
T.T.; data curation, R.A.; writing—original draft preparation, H.K.V.; writing—review and editing, M.A.;
visualization, T.T.; supervision, M.A.; project administration, M.A. All authors have read and agreed to
the published version of the manuscript. Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available upon request from the
corresponding author. Conflicts of Interest: The authors declare no conflict of interest. References 1.
Rajput, L.; Gupta, S. Sentiment Analysis Using Latent Dirichlet Allocation for Aspect Term Extraction.
J. Comput. Mech. Manag. 2023, 2, 8–13. [CrossRef] 2. Rosalina; Yusnita, L.; Hadisukmana, N.; Wahyu,
R.B.; Roestam, R.; Wahyu, Y. Implementation of Real-Time Static Hand Gesture Recognition Using
Artificial Neural Network. In Proceedings of the 2017 4th International Conference on Computer
Applications and Information Processing Technology (CAIPT), Bali, Indonesia, 8–10 August 2017; pp.
1–6. 3. Hangün, B.; Eyecio ˘glu, Ö. Performance Comparison Between OpenCV Built in CPU and GPU
Functions on Image Processing Operations. Int. J. Eng. Sci. Appl. 2017, 1, 34–41. 4. Li, S.; Deng, W.
Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2020, 13, 1195–1215.
[CrossRef] 5. Prasanna, D.M.; Reddy, C.G. Development of Real Time Face Recognition System Using
OpenCV. Development 2017, 4, 791. 6. Oyedotun, O.K.; Khashman, A. Deep Learning in Vision-Based
Static Hand Gesture Recognition. Neural Comput. Appl. 2017, 28, 3941–3951. [CrossRef] 7. Zhan, F.
Hand Gesture Recognition with Convolution Neural Networks. In Proceedings of the 2019 IEEE 20th
International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles,
CA, USA, 30 July–1 August 2019; pp. 295–298. 8. Pigou, L.; Van Herreweghe, M.; Dambre, J. Gesture
and Sign Language Recognition with Temporal Residual Networks. In Proceedings of the 2017 IEEE
International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October
2017; pp. 3086–3093. 9. Hsu, H.-W.; Wu, T.-Y.; Wan, S.; Wong, W.H.; Lee, C.-Y. QuatNet: Quaternion-
Based Head Pose Estimation With Multiregression Loss. IEEE Trans. Multimed. 2019, 21, 1035–1046.
[CrossRef] 10. Gupta, J. Hand Gesture Recognition for Emoji Prediction. Int. J. Res. Appl. Sci. Eng.
Technol. 2020, 8, 1310–1317. [CrossRef] 11. Mo, T.; Sun, P. Research on Key Issues of Gesture
Recognition for Artificial Intelligence. Soft Comput. 2020, 24, 5795–5803. [CrossRef] 12. Muthu
Mariappan, H.; Gomathi, V. Real-Time Recognition of Indian Sign Language. In Proceedings of the
2019 International Conference on Computational Intelligence in Data Science (ICCIDS), Chennai,
India, 21–23 February 2019; pp. 1–6. 13. Rastogi, R.; Mittal, S.; Agarwal, S. A Novel Approach for
Communication among Blind, Deaf and Dumb People. In Proceedings of the 2015 2nd International
Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13
March 2015. 14. Nagpal, N. Design Issue and Proposed Implementation of Communication Aid for
Deaf & Dumb People. Int. J. Recent Innov. Trends Comput. Commun. 2015, 3, 147–149. 15. Ahire,
P.G.; Tilekar, K.B.; Jawake, T.A.; Warale, P.B. Two Way Communicator between Deaf and Dumb People
and Normal People. In Proceedings of the 2015 International Conference on Computing
Communication Control and Automation, Pune, India, 26–27 February 2015; pp. 641–644. 16.
Sharma, S.; Gupta, R.; Kumar, A. Trbaggboost: An Ensemble-Based Transfer Learning Method Applied
to Indian Sign Language Recognition. J. Ambient Intell. Humaniz. Comput. 2022, 13, 3527–3537.
[CrossRef] 17. Kishore, C.R.; Pemula, R.; Vijaya Kumar, S.; Rao, K.P.; Chandra Sekhar, S. Deep Learning
Models for Identification of COVID-19 Using CT Images. In Proceedings of the Soft Computing:
Theories and Applications; Kumar, R., Ahn, C.W., Sharma, T.K., Verma, O.P., Agarwal, A., Eds.; Springer
Nature: Singapore, 2022; pp. 577–588. Eng. Proc. 2023, 59, 96 11 of 11 18. Lin, H.-I.; Hsu, M.-H.;
Chen, W.-K. Human Hand Gesture Recognition Using a Convolution Neural Network. In Proceedings
of the 2014 IEEE International Conference on Automation Science and Engineering (CASE), New
Taipei, Taiwan, 18–22 August 2014; Volume 2014, pp. 1038–1043. 19. Sharma, C.M.; Tomar, K.;
Mishra, R.K.; Chariar, V.M. Indian Sign Language Recognition Using Fine-Tuned Deep Transfer
Learning Model. In Proceedings of the Innovations In Computer And Information Science (ICICIS),
Ganzhou, China, 27–29 August 2021; pp. 62–67. 20. Arora, M.; Dhawan, S.; Singh, K. Exploring Deep
Convolution Neural Networks with Transfer Learning for Transformation Zone Type Prediction in
Cervical Cancer. In Proceedings of the Soft Computing: Theories and Applications; Pant, M., Sharma,
T.K., Verma, O.P., Singla, R., Sikander, A., Eds.; Springer: Singapore, 2020; pp. 1127–1138.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are
solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s).
MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from
any ideas, methods, instructions or products referred to in the content.
ONLINE HAND GESTURE REGOCNITION AND CLASSIFICATION FOR DEAF AND DUMB A Project Report
Phase – II Submitted in partial fulfillment of the requirements for the award of Bachelor of
Engineering degree in Electronics and Communication Engineering By A KISHORE( 39130227) KAVETI
ASHOK KUMAR YADAV(39130216) DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING SCHOOL OF ELECTRICAL AND ELECTRONICS SATHYABAMA INSTITUTE OF SCIENCE AND
TECHNOLOGY (DEEMED TO BE UNIVERSITY) Accredited with Grade “A” by NAAC JEPPIAAR NAGAR,
RAJIV GANDHI SALAI, CHENNAI - 600 119 APRIL - 2023 i SATHYABAMA INSTITUTE OF SCIENCE AND
TECHNOLOGY (DEEMED TO BE UNIVERSITY) Accredited with “A” grade by NAAC Jeppiaar Nagar, Rajiv
Gandhi Salai, Chennai – 600 119 www.sathyabama.ac.in DEPARTMENT OF ELECTRONICS AND
COMMUNICAITON ENGINEERING BONAFIDE CERTIFICATE This is to certify that this Project Report is
the bonafide work of A KISHORE(39130227) and KAVETI ASHOK KUMAR YADAV (39130216) who
carried out the project entitled “ONLINE HAND GESTURE RECOGNITION AND CLASSIFICATION FOR
DEAF AND DUMB” under our supervision from November 2022 to April 2023. Internal Guide Dr. K. V.
KARTHIKEYAN M.E., Ph.D., Professor Head of the Department Dr. T. RAVI, M.E., Ph.D., Professor
Submitted for Viva voce Examination held on Internal Examiner External Examiner ii DECLARATION
We, A. KISHORE (Reg.no.39130227) and KAVETI ASHOK KUMAR YADAV (Reg.no.39130216) hereby
declare that the Project Report entitled “ONLINE HAND GESTURE RECOGNITION AND
CLASSIFICATION FOR DEAF AND DUMB PEOPLE ” done by under the guidance of Dr. K. V.
KARTHIKEYAN,M.E., Ph.D, Professor, Department of Electronics and Communication Engineering at
SATHYABAMA INSTITUTE OF SCIENCE AND TECHNOLOGY, CHENNAI is submitted in partial fulfillment
of the requirements for the award of Bachelor of Engineering degree in Electronics and
Communication Engineering. DATE: 1. 2. PLACE: SIGNATURE OF THE CANDIDATE(S) ⅲ
ACKNOWLEDGEMENT We are pleased to acknowledge our sincere thanks to the Board of
Management of SATHYABAMA for their kind encouragement in doing this project and for completing
it successfully. We are grateful to them. We convey our thanks to Dr. N.M. NANDHITHA, M.E., Ph.D.,
Professor & Dean, School of Electrical and Electronics and Dr. T. RAVI, M.E.,Ph.D., Professor & Head of
Department of Electronics and Communication Engineering for providing necessary support and
details at the right time during the progressive reviews. We would like to express our sincere and
deep sense of gratitude to our Project Guide Dr. K. V. KARTHIKEYAN, M.E., Ph.D., Professor,
Department of Electronics and Communication Engineering for his valuable guidance, suggestions
and constant encouragement paved the way for the successful completion of our project work. We
wish to express our thanks to all teaching and non-teaching staff members of the Department of
Electronics and Communication Engineering who were helpful in many ways for the completion of
the project. ⅳ ABSTRACT Sign Language Recognition is a breakthrough for helping deaf-mute people
and has been researched for many years. Unfortunately, every research has its own limitations and is
still unable to be used commercially. Some of the research has known to be successful in recognizing
sign language, but requires an expensive cost to be commercialized. Nowadays, researchers have
gotten more attention for developing Sign Language Recognition that can be used commercially.
Researchers do their research in various ways. It starts with the data acquisition methods. The
dataacquisition method varies because of the cost needed for a good device, but a cheap method is
needed for the Sign Language Recognition System to be commercialized. The methods used in
developing Sign Language Recognition are also varied among researchers. Each method has its own
strength compared to other methods and researchers are still using different methods in developing
their own Sign Language Recognition. Each method also has its own limitations compared to other
methods. The aim of this paper is to review the sign language recognition approaches and find the
best method that has been used by researchers. Hence other researchers can get more information
about the methods used and could develop better Sign Language Application Systems in the future.
ⅴ TABLE OF CONTENT Chapter No TITLE Page No. ABSTRACT iv LIST OF FIGURES vi 1 INTRODUCTION
1 2 LITERATURE SURVEY 15 2.1 Inferences from Literature Survey 19 2.2 Open problems in Existing
System 19 3 REQUIREMENTS ANALYSIS 20 3.1 Feasibility Studies/Risk Analysis of the Project 20 3.2
Software Requirements Specification Document 21 4 DESCRIPTION OF THE PROPOSED SYSTEM 22
4.1 Selected Methodology or process model 24 4.2 Architecture / Overall Design of Proposed System
33 4.3 Description of Software for Implementation and Testing plan of the Proposed Model/System
34 4.4 Project Management Plan 37 5 IMPLEMENTATION DETAILS 38 5.1 Algorithms 38 5.2 Testing 40
6 RESULTS 43 7 SUMMARY AND CONCLUSION 47 7.1 Conclusion 47 7.2 Future Enhancements 48 7.3
Implementation 48 REFERENCES 49 Ⅵ LIST OF FIGURES FIGURE NO FIGURE NAME PAGE NO. 1.1
Hand shape and Hand orientation 12 4.1 Block diagram for proposed system 23 4.2 Hand Landmark
24 4.3 Alphabets hand gestures 28 4.4 Sign language hand gestures 29 4.5 System Architecture for
sign language 33 6.1 Sequence Diagram 43 6.2 Data Flow Diagram 44 1 CHAPTER 1 INTRODUCTION
Typically, sign or signed languages are employed to communicate meaning visually. These dialects are
conveyed using non-manual features in addition to physical articulations. These are completely
natural languages with their own lexicon and vocabulary. Smart methods for hand motion detection
from photographs in hand signals are crucial in this situation. Many people throughout the world
who have speech disorders and loss of hearing can benefit from these clever techniques. For
accurately recognizing symbols and signs, various approaches have indeed been documented in the
literature. Innovations like movement and face recognition have gained significant traction in the
basic sign language field in recent years. Different movements known as gestures are utilized during
communication. Hand or body movements are used. Gestures used in sign language typically involve
visually communicated patterns. There are roughly 4,94,93,500.000 people with hearing
impairments worldwide. Some of the existing methods for translating sign language take into
account hand position, hand size, andfinger movements. Each sign in sign language has a particular
meaning ascribed to it so that people can easily comprehend and interpret it. The people are
creating distinct and unique sign languages depending on their native tongues and geographic
locations. As spoken language, there are several variations in sign language. Around the world, more
than 300 different sign languages are in use. They differ from one country to the next. Sign language
can have many diverse regional dialects that cause subtle variances inhow people use and
comprehend signs, even in nations where the same language is spoken. There are significant
variances between some of the most popular sign languages, even though there are some
commonalities. Additionally, not just the symptoms differ. 2 General Backround The use of sign
language has not only been restricted to people with impaired hearing or speech to express
themselves with each other or non-sign-language speakers. It has often been considered as a
prominent medium of communication. Instead of acoustically conveyed sound patterns, sign
language uses manual communication to convey meaning. It combines hand gestures, and facial
expressions along with movements of other body parts such as eyes, legs, etc. This paper proposes a
design for recognizing signs used in ASL and interpreting them. Some of the problems faced by non-
speech and hard-of-hearing individuals in communication with other people were interaction-based,
disparity, education, behavioural patterns, mental health, and most importantly safety concerns. The
ways in which one can interact with the computer are either by using devices like a keyboard, mouse,
or via audio signals, while the former always needs physical contact and the latter is prone to noise
and disturbances. Physical action carried by the hand, eye or any part of the body can be considered
a gesture. Hand gestures are the most convenient and interpretable way for non-speaking humans to
interact. Nonverbal communication is important in our life as it conveys about 65% of messages in
comparison to verbal communication that contributes no more than 35% of our interactions.
Gestures can be categorized into hand and arm gestures (recognition of hand poses, sign languages,
and entertainment applications), head and face gestures (such as nodding or shaking of the head,
the direction of eye gaze, opening the mouth speak, winking, and so on), and body gestures
(involvement of fullbody motion. There are numerous varieties of sign language used not only in the
UK but all over theworld since the speaker's nonverbal cues, actions, and visual emotions may all
have a huge impact on how sign language is transmitted. Similar to spoken language, various groups,
and cultures create their own means of communication specific to theenvironment in which they
exist. 3 DEEP LEARNING Deep-learning networks are distinguished from the more commonplace
singlehidden layer neural networks by their depth; that is, the number of node layers through which
data must pass in a multistep process of pattern recognition. Earlier versions of neural networks such
as the first perceptron were shallow, composed of one input and one output layer, and at most one
hidden layer in between. More than three layers (including input and output) qualify as “deep”
learning. So deep is not just a buzzword to make algorithms seem like they read Sartre and listen to
bands you haven’t heard of yet. It is a strictly defined term that means more than one hidden layer.
In deeplearning networks, each layer of nodes trains on a distinct set of features based on the
previous layer’s output. The further you advance into the neural net, the more complex the features
your nodes can recognize since they aggregate and recombine features from the previous layer. This
is known as feature hierarchy, and it is a hierarchy of increasing complexity and abstraction. It makes
deep-learning networks capable of handling very large, high-dimensional data sets with billions of
parameters that pass through nonlinear functions. Above all, these neural nets are capable of
discovering latent structures within unlabeled, unstructured data, which is the vast majority of data
in the world. Another word for unstructured data is raw media; i.e., pictures, texts, video, and audio
recordings. Therefore, one of the problems deep learning solves best is in processing and clustering
the world’s raw, unlabeled media, discerning similarities and anomalies in data that no human has
organized in a relational database or ever put a name to. For example, deep learning can take a
million images, and cluster them according to their similarities: cats in one corner, ice breakers in
another, and in a third all the photos of your grandmother. This is the basis of so-called smart photo
albums. Deep-learning networks perform automatic feature extraction without human intervention,
unlike most traditional machine-learning algorithms. Given that feature, extraction is a task that can
take teams of data scientists years to accomplish, deep learning is a way to circumvent the
chokepoint of limited experts. It augments the powers of small data science teams, which by their
nature do not scale. When training on unlabeled data, each node layer in a deep network 4 learns
features automatically by repeatedly trying to reconstruct the input from which it draws its samples,
attempting to minimize the difference between the network’s guesses and the probability
distribution of the input data itself. Restricted Boltzmann machines, for example, create so-called
reconstructions in this manner. In the process, these neural networks learn to recognize correlations
between certain relevant features and optimal results – they draw connections between feature
signals and what those features represent, whether it be a full reconstruction, or with labeled data. A
deep-learning network trained on labeled data can then be applied to unstructured data, giving it
access to machine-learning. NEURAL NETWORKS A neural network is a series of algorithms that
endeavors to recognize underlying relationships in a set of data through a process that mimics the
way the human brain operates. In this sense, neural networks refer to systems of neurons, either
organic or artificial in nature. Neural networks can adapt to changing input; so the network generates
the best possible result without needing to redesign the output criteria. The concept of neural
networks, which has its roots in artificial intelligence, is swiftly gaining popularity in the development
of trading systems. A neural network works similarly to the human brain’s neural network. A
“neuron” in a neural network is a mathematical function that collects and classifies information
according to a specific architecture. The network bears a strong resemblance to statistical methods
such as curve fitting and regression analysis. A neural network contains layers of interconnected
nodes. Each node is a perceptron and is similar to multiple linear regression. The perceptron feeds
the signal produced by a multiple linear regression into an activation function that may be nonlinear.
14 In a multi-layered perceptron (MLP), perceptron are arranged in interconnected layers. The input
layer collects input patterns. The output layer has classifications or output signals to which input
patterns may map. Hidden layers fine-tune the input weightings until the neural network’s margin of
error is minimal. It is hypothesized that hidden layers extrapolate salient features in the input data
that have predictive power regarding the outputs. 5 CNN Convolutional Neural Network is an
algorithm of Deep Learning. That is used for Image Recognition and in Natural Language Processing.
Convolutional Neural Network (CNN) takes an image to identify its features and predict it. It captures
the spatial and temporal dependencies of the image. Each CNN layer learns filters of increasing
complexity. The first layers learn basic feature detection filters like edges, corners. The middle layer
learns filters that detect parts of objects. The last layers have higher representations they learn to
recognize full objects in different shapes and positions. Suppose, when you see some image of a Dog,
your brain focuses on certain features of the dog to identify. These features may be the dog’s ears,
eyes, or it may be anything else. Based on these features your brain gives you a signal that this is a
dog. Similarly, Convolutional Neural Network processes the image and identifies it based on certain
features. Convolutional Neural Network is gaining so much popularity over artificial neural networks.
Steps In a Convolutional Neural Network, 1.Convolution Operation. 2.RELU Layer 3.Pooling
4.Flattening 5.Full Connection Generally , a convolution neural network architecture generally has
several components: ● A convolution layer - you can think of this layer as “what relevant features are
we picking up in an image?” In a convolutional neural network, we have multiple convolutional layers
that extract low to high-level features depending on what 6 specific layer we are focusing on. To give
an (over-simplified) intuition, earlier convolutional layers pick up lower-level features (i.e. like lines
and edges) while later convolutional layers pick up higher-level features based on inputs from lower-
level features (i.e. shapes, structures) - analogous to how vision works in the human brain. ● A
pooling layer - convolutional neural networks are typically used for image classification. However,
images are high-dimensional data - so we would prefer to reduce the dimensionality to minimize the
possibility of overfitting. Pooling essentially reduces the spatial dimensions of the image based on
certain mathematical operations such as average or max-pooling (there’s a nice graphic here). We
generally incorporate pooling since it (1) generally, acts as a noise suppressant (2) makes it invariant
to translation movement for image classification and (3) helps capture essential structural features of
the represented images without being bogged down by the fine details. ● Fully-connected layer - You
can think of a series of convolution and pooling operations as dimensionality reduction steps prior to
passing this information over to the fully connected (Dense) layer. Essentially, what the fully
connected layer does is that it takes the "compressed" representation of the image and it tries to fit a
basic NN (multi-layer perceptron) when doing classification. The change in paradigm that the usage
of NNs encompasses is a very important one. Prior to the usage of NNs in image classification, the
practitioner had to use explicitly coded algorithms for detecting features. This was a job that often
involved a lot of work. Machine Learning was not used throughout the whole system, only the last
layer of the system, the classifier, had the capacity to learn and adapt itself to the specific problem.
NNs, and especially CNNs, dramatically change that. CNN’s extract features from images and learns
how to do it. The practitioner doesn’t need to craft complicated hard-coded algorithms to extract
those features. Since the introduction 7 of the first CNNs in 2012, all the winning teams for the
classification task have used different types of CNNs. CNN’s systems now clearly outperform other
image classification systems such as the ones described. State-of-the-art CNNs have even crossed the
boundary of what is considered to be the error rate for human beings in image classification tasks.
Ever since it’s introduction in 2012, CNNs have been making steady progress in decreasing the error
rate in both tasks. This is due to the introduction of new strategies that further improve the
convergence and training speed of CNNs. CNNs are a particular kind of NNs that give their name to
the fact that they make extensive use of Convolutional Layers. Convolutional Layers consist of a
learnable filter of fixed size (kernel) to be applied to images. The name comes from the fact that at
each forward pass, the learned filter is convolved with the image resulting on a 2-dimentional map.
Another type of layer widely used in CNNs are the pooling layers. These layers perform a type of
down sampling on the input signal. There are various types of Pooling layers, max-pooling being the
most common. ReLUs or Rectified Linear Units, are also widely used in CNNs. ReLUs are units that
apply a function similar to Equation 2.1. There are often alternatives for these types of units,
Equation 2.2 and 2.3 show two of them. ReLU : f(x) = max(0, x) (2.1) Tanh : f(x) = tanh(x) (2.2) Sigm :
f(x) = (1+e −x ) −1 (2.3) Fully Connected Layers are used too, as in more standard NNs. These layers
consist of an array of neurons, each of which is connected to every single output of the previous
layer. Loss Layers are typically used at the end of the CNN to figure out the penalty for a given output
and to provide feedback used by the CNN to learn. In addition to these more standard set of layers,
modern CNNs make use of a very extensive set of techniques and strategies that greatly enhance
their performance. Dropout is one of such techniques. It consists on temporarily deactivating some
neurons within the net, preventing it from over-fitting. This temporary deactivation of neurons is
typically applied only to 8 Fully-connected layers. CNNs nowadays also make extensive use of GPU
processing power that allows for a faster training, an idea pioneered in 2012 that is now a
standard.Evidence shows that one way of 8 improving a CNN performance is by stacking more layers
and making the network effectively deeper the problem that often arises is that as more layers are
stacked, the performance saturates and eventually starts to decrease at some point. To address this
issue, the idea of deep residual networks has been proposed. Residual Networks try, with minor
changes to the architecture, and with virtually no impact on performance, to approximate residual
functions instead of standard functions. If we have a CNN with input x and we want the output to
approximate a function O(x), we can instead, approximate F(x) = O(x)−x and then compute O(x) from
F(x) since O(x) = F(x) +x. Evidence shows that CNNs train better and converge faster when the
approximated function is in fact F(x), a residual function. This approach opens door for usage of even
deeper and more powerful models. Some preprocessing techniques are also used to improve the
data fed to the CNN. These techniques include normalization and zero-centering of the data. These
techniques are mainly used in order to overcome the vanishing and exploding gradient problem, that
often affects NN systems’ training. Both problems arise when models are trained using gradientbased
methods. Gradients are often computed using the chain rule. In case of nearzero gradients, this lead
to several very small numbers being multiplied together in order to calculate the changes in the last
layers. This causes the last layers to suffer virtually no change at all in its outputs as the training
progresses. The opposite happens when activation functions have high derivative values, leading 9
Related Work to the uncontrolled increase of the gradient of the last layers. As NN systems tend to
require a lot of examples, better results are obtained when data augmentation techniques are used.
CNN frameworks often incorporate mechanisms to augment data. Cropping is a very simple
technique where several crops of a single image are fed to the NN, augmenting the dataset by a
factor equal to the number of crops. Rotations [KSH12] of the image are also used, augmenting the
dataset by a factor equal to the number of rotations done. Objective Like all other time-variable signs
and signals, it is not easy to compare gestures directly in the Euclidean space. This is because of the
time dependencies and a large no. of irrelevant areas in the frames. it is very difficult to find real
representative hand engineered features for hand gestures. To work well with conventional 9
classifiers, the required features should involve robust descriptors for hand shape, position,
orientation, and temporal dependence between consecutive frames. These features should also be
robust against different circumstances such as occlusion and background clutter. For these reasons,
deep learning is employed in this article, since it seems to be a promising solution. The CNN had
been successfully utilized for automatic feature learning. It achieved excellent results in image
classification, object recognition, and even in human activity recognition. The main contributions in
this article, are as follows: 1. Presenting a CNN model for learning Spatio-temporal features for hand
gestures. 2. Adapting a pretrained deep model to work well on hand gesture recognition. 10 For
instance, native English speakers can be found in both Britain and the United States. However, there
are considerable differences between American and British sign languages. Here is where a lot of
companies and institutions still have trouble reaching out to the Deaf and Hard of Hearing people.
Here are a few different types of sign language as mentioned below: Indian Sign Language Indian
Sign Language uses both the right and left hands to depict a variety of hand gestures. Indian Sign
Language serves as both a communication tool for the deaf and a representation of their pride and
identity. The Rights of Persons with Disabilities (RPwD) Act 2016, which took effect on April 19, 2017,
acknowledges sign language as a form of interaction that is particularly helpful when speaking with
people who havehearing loss. Authorities are also required under the Act to encourage the use of
sign language so that people with hearing loss can take the role in society and benefit from it. The ISL
gradually changed from 4000 words to 10,000 words and changed its goal of being used more widely.
American Sign Language (ASL) On the other hand, one hand is all that's required to sign in ASL. As a
result, putting the system into place is made simple. ASL has its own growth path and is
independentof all spoken languages. The main sign language used by Deaf people in the United
States and most of Anglophone Canada is American Sign Language (ASL), a natural language. Asl is a
comprehensive and well-structured visual language that uses both manual and non-manual
characteristics to communicate. Dialects of ASL and creoles based on ASL are spoken around the
world, including much of West Africa and portions of Southeast Asia, besides North America. 11 As a
lingua franca, ASL is also commonly studied as a second or foreign language. French Sign Language is
most closely similar to ASL. Although ASL exhibits characteristics untypical of creole languages, like
agglutinative morphology, it has been argued that ASL is a creole language of the FSL. French Sign
Language (FSL) The gesture used by the deaf in France and the French-speaking regions of
Switzerland is known as French Sign Language (LSF). Ethnology estimate that 100,000 native signers
are involved. Dutch Sign Language (NGT), Flemish Sign Language (VGT), Belgian-French Sign
Language (LSFB), Irish Sign Language (ISL), American Sign Language (ASL), Quebec (also known as
French Canadian) Sign Language (LSQ), Brazilian Sign Language (LSB, LGB or LSCB), and Russian Sign
Language are all linked to and partly descended from French Sign Language (RSL). The invention of
French Sign Language is usually, though incorrectly, credited to Charles Michel de l'Épée (abbé de
l'Épée). A pair of deaf twin sisters were inside a nearby house when he entered, seeking shelter from
the rain. He was immediately struck by the depth and sophistication of the language they were using
to communicate with one another and the deaf society in Paris. In fact, he is said to have discovered
the language by complete accident. The abbé began studying what is now known as Old French Sign
Language, and eventually, he founded a free education for the hearing impaired. He created a
methodology at this school to instruct his children how to read and write that he dubbed
"methodical symbols." 12 British Sign Language (BSL) The Deaf community in the UK uses the sign
language known as British Sign Language (BSL), which is their first or chosen language. The British
Deaf Association calculates that there are 151,000 BSL users in the UK, of whom 87,000 are Deaf,
based on the percentage of people who reported "using British Sign Language at home" on the 2011
Scottish Census. In comparison, 15,000 persons who resided in England and Wales identified using
BSL as their primary language in the 2011 England and Wales Census. As having to hear relatives of
the deaf, basic sign translators, or as a consequence of other interactions with the British Deaf
community, individuals who aren't deaf might also use BSL. Elements Of Sign Language SL is a visual-
spatial language based on positional and visual cues, including arm andbody movements, the
position and orientation of the hands, and the shape of the fingers and hands. Together, these
elements are employed to communicate an idea'smeaning. There are typically five components to
the phonological structure of SL. In Second Life, each motion is made up of five different
components. These five buildingpieces serve as SL's most valuable components, and automated,
intelligent systems can use them to recognize SL. Fig 1.1 Hand shape and Hand Orientation 13
Methods Of Sign Language Detection Scholarly approaches to overcoming challenges brought on by
disabilities are numerous, methodical, and context-specific. SLR systems, which are used to translate
the signals of SL into text or speech to establish communication with individuals who do not
understand these signs, are one of the crucial interventions. One of the most important projects
aimed at collecting data for the motion of human hands is the development of SLR systems based on
the sensory glove. To record hand layouts and identify the related meanings of gestures, three
strategies are used: vision-based, sensor-based, and a mix of the two. Vision-Based Strategies Lenses
are the main equipment used by vision-based systems to gather the required input data. The primary
benefit of employing a camera is that it eliminates therequirement for sensors in sensory gloves and
lowers the system's construction expenses. Due to the blur produced by a web camera, most laptops
employ a highend camera, which is quite inexpensive. Despite the high-quality cameras that are
found in most smartphones, there are still anumber of issues that hinder the creation of real-time
recognition applications. These issues include the limited field of view of the capturing device, high
computational costs, and the requirement for multiple cameras in order to capture robust results
(dueto an issue of complexity and obstruction). Sensory-Based Strategies An alternate method for
gathering information on gestures is to utilize a certain kind of instrumented glove that is equipped
with flexion (or bend) sensors, accelerometers (ACCs), proximity sensors, and abduction sensors. The
bend angles of the fingers, the abduction between the fingers, and the wrist's orientation (roll, pitch,
and yaw) are all measured using these sensors. Depending onthe number of sensors incorporated
into the glove, different degrees of freedom (DoF)can be achieved using such gloves, ranging from 5
to 22. 14 Glove-based systems have a significant benefit oversight systems in that they may
immediately report necessary and pertinent information (degree of bend, pitch, etc.) to the portable
computer in the form of voltage values obviating the need to transform raw data into useful values.
Contrarily, the computational overhead of vision-based solutions is increased by the requirement to
apply particular tracking and feature extraction techniques to raw streaming video. Hybrid-Based
Strategies The third way integrates camera- and glove-based technologies in a hybrid design to
gather raw gesture data. To improve the accuracy and precision overall, this method employs mutual
mistake minimization. Unfortunately, due to the expense and computational cost of the complete
apparatus, little research has been done in this area. Furthermore, when combined with hybrid
tracking techniques, immersive virtual reality solutions show promise. 15 CHAPTER 2 LITERATURE
SURVEY Soma Shrenika, Myneni Madhu Bala, “Sign Language Recognition Using Template Matching
Technique”, International Conference on Computer Science, Engineering and Applications (ICCSEA),
2020 Normal people cannot understand the signs used by the deaf because they do not understand
the meaning of each sign. This is the goal of the system proposed here. This system employs a
camera to record various hand gestures. Following that, the image is processed using various
algorithms. Initially, the image is pre-processed. The edges are then determined using an edge
detection algorithm. Finally, the sign is identified and the text is displayed using a template-matching
algorithm. Because the output is text, the meaning of a specific sign can be easily interpreted. The
system is implemented in Python using OpenCV. H Muthu Mariappan, V Gomathi, “Real-Time
Recognition of Indian Sign Language”, International Conference on Computational Intelligence in
Data Science (ICCIDS), 2019 The skin segmentation feature of OpenCV is used to identify and track
the Regions of Interest (ROI) for recognizing the signs. The fuzzy c-means clustering machine learning
algorithm is used to train and predict hand gestures. Many applications for gesture recognition exist,
including gesture-controlled robots and automated homes, game control, Human-Computer
Interaction (HCI), and sign language interpretation. The proposed system can recognize real-time
signs. As a result, it is extremely beneficial for hearing and speech impaired people to communicate
with normal people. Wanbo Li, Hang Pu, Ruijuan Wang, “Sign Language Recognition Based on
Computer Vision”, IEEE International Conference on Artificial Intelligence and Computer Applications
(ICAICA), 2021 16 This system makes use of a PyQt-designed GUI interface. Users can choose
betweensign language recognition and translation, image capture with OpenCV, and special
processing with the trained CNN neural network. Through LSTM decisions, the modelcan then
identify American sign language. The user can also use the voice button to have the system convert
the corresponding gesture image into the same pixels and write it to the video file based on the
user's voice. The experimental results show that the rate of sign language recognition is 95.52%
when compared to similar algorithms [5, 6], and the rate of sign language [7] (American sign
language and Arabic numerals)is 90.3% Pushkar Kurhekar, Janvi Phadtare, Sourav Sinha, Kavita P.
Shirsat, “Real Time Sign Language Estimation System”, 3rd International Conference on Trends in
Electronics and Informatics (ICOEI), 2019 This paper describes a system that allows people with
speech and hearing impairments to communicate freely. The model can extract signs from videos by
processing them frame by frame in a minimally cluttered environment. This sign is then displayed in
text form. For webcam input and displaying the predicted sign, the system employs a Convolutional
Neural Network (CNN) and fastai - a deep learning library - in conjunction with OpenCV. The
experimental results show good sign segmentation in various backgrounds and high accuracy in
gesture recognition Aniket Kumar, Mehul Madaan, Shubham Kumar, Aniket Saha, Suman Yadav,
“Indian Sign Language Gesture Recognition in Real-Time using Convolutional Neural Networks”, 8th
International Conference on Signal Processing and Integrated Networks (SPIN), 2021 Using
Convolutional Neural Networks, this paper proposes a model for identifying andclassifying Indian
Sign Language gestures in real-time (CNN). The model, which was built with OpenCV and Keras CNN
implementations, aims to classify 36 ISL gestures representing 0-9 numbers and A-Z alphabets by
transforming them to text equivalents.The dataset that was created and used consists of 300 images
for each 17 gesture that were fed into the CNN model for training and testing. The proposed model
was implemented successfully and achieved 99.91% accuracy for the test images. Sai Nikhilesh Reddy
Karna, Jai Surya Kode, Suneel Nadipalli, Sudha Yadav, “American Sign Language Static Gesture
Recognition using Deep Learning andComputer Vision”, 2nd International Conference on Smart
Electronics and Communication (ICOSEC), 2021 This study proposes a real-time hand-gesture
recognition system based on the American Sign Language (ASL) dataset, with data captured using a
BGR webcam and processed using Computer Vision (OpenCV). The Vision Transformer Model was
used to train the 29 static gestures (the alphabet) from the official, standard ASL dataset (ViT). After
getting trained with 87,000 RGB samples for 1 epoch, the model had an accuracy rate of 99.99%.
(2719 batches of 32 images each). Shirin Sultana Shanta, Saif Taifur Anwar, Md. Rayhanul Kabir,
“Bangla Sign Language Detection Using SIFT and CNN”, 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 2018 There have been very few studies on
detecting Bangla sign language. The classifiersused by the majority of researchers in this field were
SVM, ANN, or KNN. In this paper, the researchers attempt to develop a Bangla sign language system
that uses SIFT feature extraction and Convolutional Neural Network (CNN) classification. The
researchers also show that using the SIFT feature improves CNN's detection of BanglaSign language.
Rajiv Ranjan, B Shivalal Patro, Mohammad Daim Khan, Manas Chandan Behera, Raushan Kumar,
Utsav Raj, “A Review on Sign Language Recognition Systems”, IEEE 2nd International Conference on
Applied Electromagnetics, Signal Processing, & Communication (AESPC), 2022 18 For the
enhancement of the recognized gesture, open CV, Gaussian blur, and contour approximation
techniques are used. The TensorFlow framework is best suited for accurate software prediction with
a CNN model of VGG-16 architecture and the Transfer Learning training method. The use of these
tools results in a more accurate prediction of sign language gestures in the English alphabet. Given
the project's futurescope, the technology stack is both robust and scalable. This review paper
discussesvarious methods and structures for sign language recognition. Kohsheen Tiku, Jayshree
Maloo, Aishwarya Ramesh, Indra R, “Real-time Conversion of Sign Language to Text and Speech”,
Second International Conferenceon Inventive Research in Computing Applications (ICIRCA), 2020 The
purpose of this paper is to examine the performance of various techniques for converting sign
language to text/speech format. Following analysis, an Android application is created that can
convert real-time ASL (American Sign Language) signs to text/speech. Vishwa Hariharan Iyer, U.M
Prakash, Aashrut Vijay, P. Sathishkumar, “Sign Language Detection using Action Recognition”, 2nd
International Conference onAdvance Computing and Innovative Technologies in Engineering
(ICACITE), 2022 This proposal is centered on the continuous detection of image frames in real-time
using action detection to detect the user's action. After recognising key points using mediapipe
holistic, which includes face, pose, and hand features, the model employs an LSTM neural network
model. Collecting key value points for training and testing, pre-processing the data, and creating
labels and features are all part of the proposed work. It saves the weights and uses confusion matrix
accuracy to evaluate the model. 19 2.1 INFERENCES FROM LITREATURE SURVEY Related works that
was done have limitations as well as scope to be developed. The main limitations of these paper,
there was no combined guidelines to help a visually impaired people and deaf people to
communicate in sign language effectively. Our proposed system is dealt with to solve the limitations
of these papers and providing a proper guideline also always forewarn from the obstacle faced
simultaneously. 2.2 OPEN PROBLEMS IN EXISTING SYSTEM In face of the problems often found with
other learning models, such as the presence of multiple local minima and the curse of
dimensionality, existing methods have been significantly ineffective therefore. The SVM method does
not have any edge over the curse of dimensionality, thus their practical importance can be
diminished. An immediate problem arising from the SVM’s original hyperplane formulation is that it
is not very obvious how to make the model applicable to more than two classes. Several approaches
have been proposed to overcome this limitation, two of them being knownas the 1-vs-1 and 1-vs-all
strategies for multiple class classification. For a decision problem over classes, 1-vs-all requires the
creation of classifiers, each trained to distinguish one class from the others. The decision then is
taken in a winner-takes-all approach. However, there is no clear indication this approach results in an
optimum decision. As a result, we have proposed a model that solves these problems and offers
higher accuracy and reliability as compared to traditional classifiers method. 20 CHAPTER 3
REQUIREMENT ANALYSIS 3.1 FEASIBILITY STUDIES/RISK ANALYSIS OF THE PROJECT 3.1.1 FEASIBILITY
STUDY The feasibility of the project is server performance increase in this phase and a business
proposal is put forth with a very general plan for the project and some cost estimates. During system
analysis, the feasibility study of the proposed system is to be carried out. For feasibility analysis,
some understanding of the major requirementsfor the system is essential. Three key considerations
involved in the feasibility analysis are ● Economicfeasibility ● Technical feasibility ● Operational
feasibility ECONOMICAL FEASIBILITY This study is carried out to check the economic impact that the
system will have on theorganization. The amount of funds that the company can pour into the
research and development of the system is limited. The expenditures must be justified. Thus the
developed system as well within the budget and this was achieved because most of the technologies
used are freely available. Only the customized products had to be purchased. TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the
system. Any system developed must not have a high demand on the available technical resources.
This will lead to high demands being placed on the client. The developed system must have modest
requirements, as only minimal or null changes are required for implementing this system. 21
OPERATIONAL FEASIBILITY The aspect of the study is to check the level of acceptance of the system
by the user.This includes the process of training the user to use the system efficiently. The user must
not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by
the users solely depends on the methods that are employed toeducate the user about the system
and to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism,which is welcomed, as he is the final user of the system. 3.2
SOFTWARE REQUIREMENTS SPECIFICATION DOCUMENT S/W CONFIGURATION: • Operating System :
Windows/Linux, Mac • Programming Language : python 3.6 • IDE : Anaconda / Jupiter notebook •
Packages/libraries : Tensor flow, NumPy, OpenCV H/W CONFIGURATION: • Processor : I3/Intel
Processor • Hard Disk : 160GB • RAM : 8GB 22 CHAPTER 4 DESCRIPTION OF PROPOSED SYSTEM In
the proposed model, the objective is to recognize all 26 Alphabets with the aid of hand gestures
through a web camera. We will create an Alphabet detection Model using Tensor flow Dense model
to detect angled and normal alphabets to train our dataset that stands for alphabets. This deep
learning model will recognize the alphabetwith gestures captured in real time via a webcam. We will
create a gesture detection Model using Tensor flow Dense model to detect sign language hand
gestures to train our dataset that stands for hand gestures. This deep learning model will recognize
the sign language with hand gestures captured in real time via a webcam. For this system, we will
create a Node JS webapp to take webcam input using OpenCV and detect alphabets and gestures
using our trained model. A very well-known technique which has worked effectively in case of small
labelled data is transfer learning. A network which is trained on a source task is used as a feature
extractor for target task. There are many CNN models trained on ImageNet which are available
publicly such as VGG-16, VGG-19, Inception, Resnet. Transferable feature representation learned by
CNN minimizes the effect of over-fittingin case of small, labelled set. For the implementation of this
project 4 algorithms were considered, VGG16, VGG19, Resnet50, Inception V3. VGG16 being
effective in terms of its object detection(car) capability and classification (severity and location)
because of its simple linear architecture and hence compatibility with the required use case. Hence
VGG16 model was best suitable for implementation. 23 FIGURE 4.1 BLOCK DAIGRAM FOR PROPOSED
SYSTEM 24 4.1 SELECTED METHODOLOGY OR PROCESS MODEL 4.1.1 Module 1: Data Gathering and
Data Preprocessing Creating Hand Landmark Creator We will create a landmark creator function to
detect the left and right hand from the frame and create a landmark for the hand. The location of
hand landmarks is an important source of information to recognize hand gestures, effectively
exploited in a number of recent methods that work from depth maps. A hand landmark model
processes a cropped bounding box image and returns three-dimensional hand key points on hand.
This function reaches the exact location of 21 key points with a 3D hand knuckle coordinate which is
driven inside the hand regions detected by regression which will produce the prediction of direct
coordinates which is a model of the hand landmark. Figure 4.2: Hand landmark Each hand-knuckle of
the landmark coordinate is composed of x, y, and z where x andy are normalized to [0.0, 1.0] by
image width and height, while z represents the depth of the landmark. The depth of the landmark
that can be found on the wrist being the ancestor. The closer the landmark is to the camera; the
value becomes smaller. 25 Data Gathering The process of collecting data is crucial for the
development of an effective ML model. The quality and quantity of your dataset has a direct effect
on the decisionmaking process of the AI model. And these two factors influence the robustness,
precision and performance of AI algorithms. As a consequence, data collection and structuring often
takes longer than data model training. We will gather data using webcam only for signlanguage that
will contain alphabets (A-Z) and different gestures, i.e. Hello, Thank you,Eat, Goodbye and many
more. Creating Labels for Detections Image labeling is a type of data labeling that places emphasis on
the identification and labeling of specific details in an image. In computer vision, data labeling
involves adding tags to raw data like pictures and videos. Each tag stands for an object class
associated with the data. Supervised ML models use labels during learning to identify a specific
object class in unclassified data. It helps these models combine meaning with data, which can help
form a model. Image annotation is used to create data sets for computer vision models that are
divided into training sets, originally used to train the model and test and validation sets used to
assess model performance. Data scientists use the dataset to shape and evaluate their model, then
the model can automatically label unlabeled and unseen data. We will create labels for each image
to classify the alphabets and gestures. Image labeling is an essential function of computer vision
algorithms. The ultimate goal of machine learning algorithms is to obtain automatic labeling, but to
train a model will require a large set of pre-labeled image data. METHODS FOR IMAGE LABELING :
Manual image annotations This is the process of manually defining labels for a complete image or
drawing regions in an image and adding text descriptions for each region. Manual annotation is
typicallysupported by tools that enable operators to rotate a large number of images, draw regions
on an image, and label and store these data in a standard . 26 Automated image annotations
Automated annotation tools can help manual annotators by trying to find object limits in an image
and providing a starting point for annotation. Automated annotation algorithms are not entirely
precise, but they can save time for human annotators by providing at least one partial map of the
objects in the image. Synthetic image labeling It is a precise, cost-effective technique that can replace
manual annotations. This involves automatically generating images similar to actual data, in
accordance with the criteria defined by the operator. There are three common methods of producing
synthetic images: Variational Autoencoders (VAE): these are algorithms that start from existing data,
create a new data distribution, and link it to the original space using an encoderdecoder method.
Generative Adversarial Networks (GAN): these are models between two neural networks. One neural
network tries to create fake images, while the other tries to distinguish actual images from fake ones.
Over time, the system can produce photorealistic images which are difficult to distinguish from real
images. Neural Radiance Fields (NeRF): This model takes a series of images describing a 3D scene and
automatically makes additional new perspectives from the same scene. It works by calculating a 5-
dimensional radiation function to generate each voxel on the target image. Data Preprocessing
Algorithms which learn data are just statistical equations operating on database values.Thus, as the
saying goes, "if the garbage enters, the garbage leaves". Your data project can only achieve success if
the data entering the machines is of high quality. 27 Data from actual scenarios always contain noise
and missing values. This occurs due to manual mistakes, unexpected events, technical problems, or
various other obstacles. Incomplete and noisy data cannot be consumed by algorithms, as they are
generally not designed to handle missing values, and noise causes disruption to the actual sample
setup. Data preprocessing is a step in transforming raw data so that problems due to
incompleteness, inconsistency, and/or lack of proper representation of trends are resolved in order
to obtain a data set in a comprehensible format. In our model, for data preprocessing we will convert
our labels to categorical values using a One Hot Encoding method. As we will be taking raw webcam
frames, we won't do any other preprocessing method. Categorical Encoding Sometimes the data is in
a form that cannot be processed by machines. For example,a column with string values, like names,
will not mean anything for a model that depends solely on numbers. We, therefore, have to process
the data to assist the model in interpreting them. It is referred to as categorical encoding. There are
multipleways in which we can encode categories. Here we use the One Hot Encoding method. When
data is not in an inherent order, we can use only one hot encoding. A hot encoding generates a
column for each category and assigns a positive value, 1, in any row within that category, and 0 when
not present. The downside is that multiple features are generated from a single feature, which makes
the data large. It doesn't matter if we don't have too many features. 4.1.2 Module 2: Model Creation
and Training Alphabet Detection Model The objective is to recognize all 26 Alphabets with the aid of
hand gestures through a web camera. We will create an Alphabet detection Model using the
Tensorflow Dense 28 Figure 4.3: Alphabet hand gestures The dense layer is a neural network with a
deep connection, which means that each neuron in the dense layer receives the input of all neurons
from its previous layer. Dense Layer does a matrix-vector multiplication, and the values used in the
matrix are parameters that can be trained and updated with the help of backpropagation. The output
produced by the dense layer is a vector of dimension 'n'. A dense Layer is used to change the
dimensions, rotate, scale, and translate the vector. Gesture Detection Model We will create a gesture
detection Model using the Tensorflow Dense model to detect sign language hand gestures to train
our dataset that stands for hand gestures. This deep learning model will recognize the sign language
with hand gestures captured in real-time via a webcam. 29 Figure 4.4: Sign language hand gestures
Tensor flow dense is the kind of layer and function available in neural networks whilst implementing
artificial intelligence and deep learning in a python programming language. There are deep
connections that exist between the neurons in the neural network in dense layers. The pattern they
follow is such that each individual neuron gets data input from all the neurons of the previous layer,
forming the complex model. While on the other end, dense is also a function used in the neural
networks of TensorFlow, which produces the output by applying activation of the dot of Kernel
andinput and adding the bias effect to it. At the same time, dense is also a function used in
TensorFlow neural networks which produce the output by applying kernel point activation and input
and adding a bias effect. Data Splitting Data splitting is the time at which data is divided into two or
more subsets. Typically, with a two-part split, one part is used to evaluate or test the data and
another to form the model. Data splitting is an important part of data science, 30 particularly for the
creation of data-based models. This technique ensures that data and process models that utilize data
models, such as machine learning, are accurately created. In a basic two-part split, the training
dataset is used for training and model development. Training sets are routinely used to estimate
different parameters or to compare the performance of different models. The test data set is used at
the end of the training. Training and test data are compared to make sure the final model is working
properly. When using machine learning, data is usually divided into three or more sets. In three sets,
the extra set is the development set, which is used to modify the parameters of the learning process.
Organizations and data modelers may elect to separate split data using data sampling methods, such
as the following three methods: Random sampling: This data sampling method protects the data
modeling process against bias to different data characteristics that may occur. However, random
splitting can pose problems in terms of uneven distribution of data. Stratified random sampling: It
selects random data samples from specific parameters. Ensures data is properly distributed across
training and test sets. Nonrandom sampling: This approach is generally used when data modelers are
looking for the latest data as a test set. With data splitting, organizations do not have to choose
between using data for statistical analysis and analytics, as the same data can be used in the various
processes. Training Model Model training is the lifecycle phase of data science development during
which practitioners attempt to figure out the best combination of weight and bias to a machine
learning algorithm to minimize a loss function on the prediction range. The goal of model
development is to construct the best mathematical representation of the relationship between data
features and a target label (in supervised learning) or among the features themselves (in
unsupervised learning). 31 Model training is the key step of machine learning leading to a model
ready for validation, testing, and deployment. The performance of the model determines the quality
of the applications which are constructed using it. The quality of the training data and the training
algorithm are significant assets in the training phase of the model. Training information is usually
split for training, validation, and testing. The training algorithm is selected as a function of the end-
use case. There are a number of tradeoff points for choosing the best algorithm – model complexity,
interpretability, performance, computational requirements, etc. All these aspects of model training
make it a process that is both involved and important for the overall development cycle of machine
learning. Saving the Best Model Saving our machine-learning models is an important step in the
machine-learning workflow, so we can reuse them in the future. For example, it is very likely that we
will have to compare models to determine which champion model to use in production. Saving the
models while they are being trained facilitates this process. The alternativewould be to form the
model whenever it is to be used, which can greatly affect productivity, especially if the model takes a
long time to form. 4.1.3 Module 3: Creating Node JS Web App An important part of building a
machine learning model is to share the model we havebuilt with others. No matter how many
models we create, if they remain offline, very few people will be able to see what we are achieving.
That is why we should deploy ourmodels, so that anyone can play with them through a nice User
Interface (UI). For this system, we will create a Node JS webapp to take webcam input using OpenCV
and detect alphabets and gestures using our trained model. Nodejs is a server-side platform powered
by Google Chrome JavaScript Engine (V8 Engine). Node.js is a platform built on Chrome JavaScript
runtime to build fast, scalable network applications with ease. Node.js uses a non-blocking, event-
driven I/O model that makes it lightweight and efficient, perfect for real-time, data-intensive
applications that operate on distributed devices. It is an open-source, cross-platform runtime
environment for developing server-side and networking applications. 32 on OS X, Microsoft
Windows, and Linux. Node.js also provides a rich library of different JavaScript modules that
simplifies the development of web applications using Node.js to a great degree. Node.js = Runtime
Environment + JavaScript Library These are just a few of the key features that make Node.js the first
choice for softwarearchitects: Asynchronous and Event Driven: All APIs of the Node.js library are
asynchronous, meaning non-blocking. This basically means that a Node.js- based server never waits
until an API returns data. The server moves to the next API after calling it and a Node.js event
notification mechanism assists the server to get a response from the previous API call. Very Fast:
Being constructed on Google Chrome V8 JavaScript Engine, the Node.js library is very fast in code
execution. Single-Threaded but Highly Scalable: Node.js has a single-threaded model with an event
loop. The event mechanism helps the server to respond in a nonblocking manner and makes the
server highly scalable as opposed to traditional servers that create limited threads to manage
requests. Node.js uses a single-threaded program and the same program can serve a lot more
requests than traditional serverslike Apache HTTP Server. No Buffering: No Node.js application
buffers the data. These applications simply generate data in chunks. License: Node.js is licensed by
MIT 33 4.2 ARCHITECTURE / OVERALL DESIGN OF THE PROPOSED SYSTEM Fig 4.5: System
Architecture for Sign Language 34 4.3 DESCRIPTION OF SOFTWARE FOR IMPLEMENTATION AND
TESTING PLAN OF THE PROPOSED MODEL/SYSTEM Anaconda is an open-source package manager for
Python and R. It is the most popular platform among data science professionals for running Python
and R implementations. There are over 300 libraries in data science, so having a robust distribution
system for them is a must for any professional in this field.Anaconda simplifies package deployment
and management. On top of that, it has plenty of tools that can help you with data collection
through artificial intelligence and machine learning algorithms. WithAnaconda, you can easily set up,
manage, and share Conda environments. Moreover,you can deploy any required project with a few
clicks when you’re using Anaconda.There are many advantages to using Anaconda and the following
are the most prominent ones among them:Anaconda is free and opensource. This means youcan use
it without spending any money. In the data science sector, Anaconda is an industry staple. It is open-
source too, which has made it widely popular. If you want to become a data science professional, you
must know how to use Anaconda for Python because every recruiter expects you to have this skill. It
is a must-have for data science. It has more than 1500 Python and R data science packages, so you
don’t face any compatibility issues while collaborating with others. For example, suppose your
colleague sends you a project which requires packages called A and B but you only have package A.
Without having package B, you wouldn’t be able to run the project. Anaconda mitigates the chances
of such errors. You can easily collaborate on projects without worrying about any compatibility
issues. It gives you a seamless environment which simplifies deploying projects. You can deploy any
project with just a few clicks and commands while managing the rest. Anaconda has a thriving
community of data scientists and machine learning professionals who use it regularly. If you
encounter an issue, chances are, the community has already answered the same. On the other hand,
you can also ask people in the community about the issues you face there, it’s a very helpful
community ready to help new learners. With Anaconda, you can easily create and train machine
learning and deep learning models as it works well with popular tools 35 including TensorFlow, Scikit-
Learn, and Theano. You can create visualizations by using Bokeh, Holo views, Matplotlib, and Data
shader while using Anaconda. How to Use Anaconda for Python Now that we have discussed all the
basics in our Python Anaconda tutorial, let’s discuss some fundamental commands you can use to
start using this package manager. Listing All Environments To begin using Anaconda, you’d need to
see how many Conda environments are present in your machine. conda env list It will list all the
available Conda environments in your machine.Creating a New Environment You can create a new
Conda environment by going to the required directory and use this command: conda create -n You
can replace with the name of your environment. After entering this command, conda will ask you if
you want to proceed to which you shouldreply with y: proceed ([y])/n)? On the other hand, if you
want to create an environment with a particular version of Python, you should use the following
command: conda create -n python=3.6 Similarly, if you want to create an environment with a
particular package, you can use the following command: conda create -n pack_name Here, you can
replace pack_name with the name of the package you want to use. If you have a .yml file, you can
use the following command to create a new Conda environment based on that file: conda env create
-n -f .yml We have also discussed how you can export an existing Conda environment to a .yml file
later in this article. 36 You can activate a Conda environment by using the following command: conda
activate You should activate the environment before you start working on the same. Also, replace the
term with the environment name you want to activate. On the other hand, if you want to deactivate
an environment use the following command: conda deactivate Installing Packages in an Environment
Now that you have an activated environment, you can install packages into it by using the following
command: conda install Replace the term with the name of the package you want to install in your
Conda environment while using this command. Updating Packages in an Environment If you want to
update the packages present in a particular Conda environment, you should use the following
command: conda update The above command will update all the packages present in the
environment. However, if you want to update a package to a certain version, you will need to use
thefollowing command: conda install = Exporting an Environment Configuration Suppose you want to
share your project with someone else (colleague, friend, etc.). While you can share the directory on
GitHub, it would have many Python packages, making the transfer process very challenging. Instead
of that, you can create an environment configuration .yml file and share it with that person. Now,
they can createan environment like your one by using the .yml file. For exporting the environment to
the .yml file, you’ll first have to activate the same andrun the following command: conda env export
>.yml The person you want to share the environment with only has to use the exported file 37 by
using the ‘Creating a New Environment’ command we shared before. Removing a Package from an
Environment If you want to uninstall a package from a specific Conda environment, use the following
command: conda remove -n On the other hand, if you want to uninstall a package from an activated
environment, you’d have to use the following command: conda remove Deleting an Environment
Sometimes, you don’t need to add a new environment but remove one. In such cases,you must know
how to delete a Conda environment, which you can do so by using thefollowing command: conda
env remove –name The above command would delete the Conda environment right away. 4.4
PROJECT MANAGEMENT PLAN Introduction: October 7-21 Literature Survey: October 21-23 System
Design: November 1-15 System Implementation: November 16-29 UML Diagrams: December 1-25
Conclusions: January 1-15 References And Future Work: February 1-20 Testing: March 1-25 38
CHAPTER 5 IMPLENTATION DETAILS 5.1 Algorithms 5.1.1 Scale -Invarient Feature Transform(SIFT) The
SIFT (Scale-Invariant Feature Transform) algorithm is a computer vision algorithm used for detecting
and describing local features in images. It was introduced by David Lowe in 1999 and has since
become a popular algorithm in computer vision and image processing. The SIFT algorithm consists of
the following steps: Scale-space extrema detection: The first step involves identifying potential
interest points in the image at different scales using the Difference of Gaussian (DoG) function. The
DoG function is obtained by taking the difference of two Gaussian-blurred images at different scales.
The local minima and maxima of the DoG function are then identified as potential interest points.
Key point localization: Once the potential interest points are identified, the algorithm refines their
location and scale by fitting a 3D quadratic function to the DoG function at each point. Orientation
assignment: The algorithm assigns an orientation to each key point based on the local image gradient
directions. This helps in making the descriptor rotationally invariant. Key point descriptor generation:
A descriptor is generated for each key point by considering the local image gradient directions and
magnitudes in a surrounding region. This descriptor is a vector of length 128, which is invariant to
scale, rotation, and illumination changes. Key point matching: Finally, the descriptors of two images
are compared using a distance metric (e.g., Euclidean distance) to find matching key points between
the two images. 39 The SIFT algorithm has been widely used in various applications, such as image
stitching, object recognition, and 3D reconstruction. However, due to its computational complexity, it
may not be suitable for real-time applications. 5.1.2 Histogram of Oriented Gradients(HOG) The "hog
algorithm" typically refers to the Histogram of Oriented Gradients (HOG) feature extraction method,
which is used in computer vision and image processing for object detection and recognition tasks.
The HOG algorithm works by computing the gradients of an image, which represent the intensity and
direction of local edges. These gradients are then quantized into orientation bins, which are used to
construct a histogram of gradient orientations within a given image region. The resulting histogram
can be used as a feature descriptor to represent the local texture and shape information of an object.
HOG is commonly used in combination with a support vector machine (SVM) classifier to perform
object detection and recognition. The SVM classifier is trained on a set of positive and negative
examples of the object of interest, using the HOG features as input. During testing, the HOG features
of an image are extracted and fed into the SVM, which predicts whether the object is present or not.
HOG has been successfully applied to a wide range of object recognition tasks, including pedestrian
detection, face detection, and vehicle detection. However, it is computationally expensive and may
not perform well in situations with significant variations in scale and orientation. 5.1.3 Support
Vector Machine Support Vector Machines (SVM) is a supervised machine learning algorithm used for
40 classification and regression analysis. SVM aims to find the optimal boundary (hyperplane) that
separates the data into different classes while maximizing the margin between the classes. In the
case of binary classification, the SVM algorithm works by finding the hyperplane that best separates
the data into two classes by maximizing the margin. The margin is defined as the distance between
the hyperplane and the nearest data points of each class. SVM uses a kernel function to map the
input data into a high-dimensional feature space where it is easier to find a hyperplane that
separates the data. Common kernel functions include linear, polynomial, radial basis function (RBF),
and sigmoid. SVMs are known for their ability to handle high-dimensional data and for their
robustness against overfitting. However, SVMs can be sensitive to the choice of kernel function and
the regularization parameter. Overall, SVM is a powerful algorithm that has been widely used in
various applications such as image classification, text classification, and bioinformatics. 5.2 Testing •
UNIT TESTING • INTEGRATION TESTING • FUNCTIONAL TESTING • WHITE BOX TESTING • BLACK BOX
TESTING 5.2.1 Unit Testing o Unit testing is used to ensure that each modular component of the
project is working. o The smallest unit of the software design is the subject of unit testing 41 o The
mentioned project underwent a progressive examination of unit testing. o The unit testing findings
were good and encouraging 5.2.2 Integration Testing o Integration testing is a methodical
methodology for building the software architecture while also running tests to detect faults related
to the interface. o Integration testing, in other words, is the comprehensive testing of the product's
collection of modules o I For the described project, a sequential analysis of Integration Testing was
undertaken o The findings of the integration testing were positive and encouraging 5.2.3 Functional
Testing o Functional test cases included running the code with nominal input values for which the
anticipated outputs were known as well as boundary values and unusual values such as logically
linked inputs files with identical components, and empty files. o For the project under consideration,
a Functional Testing Sequential Analysis was carried out. o The outcomes of functional testing were
positive and encouraging 5.2.4 White Box Testing o This kind of testing is also known as glass box
testing o By understanding the precise tasks that a product has been designed to accomplish, testing
can be performed to verify that each function is completely operational while also looking for flaws
in each function. 42 o It is a test case design approach that derives test cases from the procedural
design's control structure o White box testing is used for basis path testing o For the project under
consideration ,a sequential assessment of White Box Testing was carried out White Box testing
yielded positive and encouraging findings 5.2.5 Black Box Testing o By understanding a product's
internal operation, testing may be carried out to guarantee that "all gears mesh," that is, the internal
operation operates according to 25 specifications and all internal components have been
appropriately exercised o For the described project, a sequential analysis of Black Box Testing was
undertaken o The outcomes of functional testing were positive and encouraging. 43 CHAPTER 6
RESULTS The online hand gesture recognition and classification system will be able to recognize a
range of hand gestures in real-time. The system will be able to translate the recognized gestures into
text or speech, depending on the user's preferences. The system will be tested using a range of hand
gestures, and the accuracy of the system will be evaluated. The accuracy of the system will be
measured in terms of the percentage of correctly recognized gestures. Figure 6.1 Sequence Diagram
44 Figure 6.2 Data Flow Diagram In this work, a deep learning based CNN model has been proposed
to identify static hand gestures. Traditional models have been found out to be restrictive in the sense
that they need preprocessing of the image to achieve good accuracy. However, deep learning models
have advantage over traditional methods in that they don’t need prior feature extraction. Since they
do not involve any feature extraction they are very robust. Since we employed DCNN model it is very
attractive for gesture recognition applications. The model thus trained has shown good results with
cluttered backgrounds, poor lighting conditions, and different hand morphologies. The robustness of
the model has been attributed to the capacity of DCNNs to learn complex features in images.
Another important feature about the present model is that it is very accurate. The model has
achieved 99.6% test accuracy which is one of the highest ever reported test accuracy. High accuracy
can be attributed to careful data collection, data augmentation and network architecture. While
collecting data, care has been taken to include various scenarios into account. Good data collection
coupled with data augmentation led to generalization of the network. 45 46 Figure 6.3 Output
Screenshot 47 CHAPTER 7 7.1 Conclusion: This project introduces a CNN-based approach for the
recognition and classification of sign language using deep learning .Our proposed system was able to
achieve 99% training accuracy with testing accuracy of 90.04% in letter recognition, 93.44% in
number recognition, and 97.52% in static word recognition, obtaining an average of 93.667% based
on the gesture recognition with limited time. Ag for future work, the evolution of the proposed
system can be generalized for a wider class of hand gestures. Furthermore, the impact of different
parameters on the performance of the system and the accuracy of results can be further
investigated. Time frame selection techniques can be bettered, more optimization can be added, and
the loss function can be studied more deeply Using open CV, we offer a technique for identifying and
recognizing signals in this paper. Throughout the training of each system, each letter, number and
word gesture were utilized 50 times. This method has substantially better accuracy and significantly
fewer false positives than the others. The identification of hand gestures has seen significant
advancement in the last several decades thanks to the efforts of numerous experts. Sign language
interpretation devices, augmented or virtual reality, robot control, and the ability to recognize hand
gestures are just a few of the numerous areas where this technology has proven useful. Since the
introduction of the next generation of gesture interface technology, the value of gesture recognition
has skyrocketed. Most people who use sign language are deaf or mute and hence unable to
communicate verbally. It is difficult to express emotions clearly when communicating with someone
who can see but do not share a common hand sign language. OpenCV provides a fast and convenient
way of recognizing and classifying hand gestures. This is especially suitable for the Deaf and Dumb
community, who have limited access to other forms of communication. The cost of implementation
is also quite low, making it an ideal solution. In conclusion, hand gesture recognition and
classification for deaf and dumb can bring about transformative and life-changing changes in the lives
of people with hearing impairments. With the use of machine learning and open source 48 libraries,
it is becoming increasingly easier to develop projects centered around hand gesture recognition and
classification. 7.2 Future Enhancements As for future work, the evolution of the proposed system can
be generalized for a wide class of hand gestures. Furthermore, the impact of different parameters on
the performance of the system and the accuracy of results can be further investigated. Time frame
selection techniques can be bettered, more optimization can be added, and the loss function can be
studied more deeply. 7.3 Implementation • Image Pre- Processing The captured objects are being
pre-processed. Pre-processing is done to remove the extra objects. • Feature Extraction and
Recognition An algorithm is used to determine an image’s characteristics. To retrieve the best
featured image from the database, the algorithm applied to the captured image. 49 REFERENCES:-
[1] Soma Shrenika, Myneni Madhu Bala, “Sign Language Recognition Using Template Matching
Technique”, International Conference on Computer Science Engineering and Applications (ICCSEA),
2020 [2] H Muthu Mariappan, V Gomathi, “Real-Time Recognition of Indian Sign Language”,
International Conference on Computational Intelligence in Data Science (ICCIDS), 2019 [3] Wanbo Li,
Hang Pu, Ruijuan Wang, “Sign Language Recognition Based on Computer Vision”, IEEE International
Conference on Artificial Intelligence and Computer Applications (ICAICA), 2021 [4] Pushkar Kurhekar,
Janvi Phadtare, Sourav Sinha, Kavita P. Shirsat, “Real Time Sign Language Estimation System”, 3rd
International Conference on Trends in Electronics and Informatics (ICOEI), 2019 [5] Aniket Kumar,
Mehul Madaan,hubham Kumar, Aniket Saha, Suman Yadav, “Indian Sign Language Gesture
Recognition in Real-Time using Convolutional Neural Networks”, 8th International Conference on
Signal Processing and Integrated Networks (SPIN), 2021 [6] Sai Nikhilesh Reddy Karna, Jai Surya Kode,
Suneel Nadipalli, Sudha Yadav, “American Sign Language Static Gesture Recognition using Deep
Learning and Computer Vision”, 2nd International Conference on Smart Electronics and
Communication (ICOSEC), 2021 [7] Shirin Sultana Shanta, Saif Taifur Anwar, Md. Rayhanul Kabir,
“Bangla Sign Language Detection Using SIFT and CNN”, 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 2018 [8] Rajiv Ranjan, B Shivalal Patro,
Mohammad Daim Khan, Manas Chandan Behera, Raushan Kumar, Utsav Raj, “A Review on Sign
Language Recognition Systems”, IEEE 2nd International Conference on Applied Electromagnetics,
Signal Processing, &Communication (AESPC), 2022 50 [9] Kohsheen Tiku, Jayshree Maloo, Aishwarya
Ramesh, Indra R, “Real-time Conversion of Sign Language to Text and Speech”, Second International
Conference on Inventive Research in Computing Applications (ICIRCA), 2020 [10] Vishwa Hariharan
Iyer, U.M Prakash, Aashrut Vijay, P. Sathishkumar, “Sign Language Detection using Action
Recognition”, 2nd International Conference on Advance Computing and Innovative Technologies in
Engineering (ICACIT)
ISLTranslate: Dataset for Translating Indian Sign Language Abhinav Joshi1 Susmit Agrawal2 Ashutosh
Modi1 1 Indian Institute of Technology Kanpur (IIT Kanpur) 2 Indian Institute of Technology
Hyderabad (IIT Hyderabad) {ajoshi, ashutoshm}@cse.iitk.ac.in, [email protected] Abstract
Sign languages are the primary means of communication for many hard-of-hearing people
worldwide. Recently, to bridge the communication gap between the hard-of-hearing community and
the rest of the population, several sign language translation datasets have been proposed to enable
the development of statistical sign language translation systems. However, there is a dearth of sign
language resources for the Indian sign language. This resource paper introduces ISLTranslate, a
translation dataset for continuous Indian Sign Language (ISL) consisting of 31k ISL-English
sentence/phrase pairs. To the best of our knowledge, it is the largest translation dataset for
continuous Indian Sign Language. We provide a detailed analysis of the dataset. To validate the
performance of existing end-to-end Sign language to spoken language translation systems, we
benchmark the created dataset with a transformer-based model for ISL translation. 1 Introduction
There are about 430 million hard-of-hearing people worldwide1 of which 63 million are in India2 .
Sign Language is a primary mode of communication for the hard-of-hearing community. Although
natural language processing techniques have shown tremendous improvements in the last five years,
primarily, due to the availability of annotated resources and large language models (Tunstall et al.,
2022), languages with bodily modalities like sign languages still lack efficient language-processing
systems. Recently, research in sign languages has started attracting attention in the NLP community
(Yin et al., 2021; Koller et al., 2015; Sincan and Keles, 2020; Xu et al., 2022; Albanie et al., 2020; Jiang
et al., 2021; Moryossef et al., 2020; Joshi 1 https://www.who.int/news-room/fact-sheets/
detail/deafness-and-hearing-loss 2 https://nhm.gov.in/index1.php?lang=1&level=2&
sublinkid=1051&lid=606 Figure 1: An example showing the translation of the phrase “Let’s discuss" in
Indian Sign Language. et al., 2022). The availability of translation datasets has improved the study
and development of NLP systems for sign languages like ASL (American Sign Language) (Li et al.,
2020), BSL (British Sign Language) (Albanie et al., 2021), and DGS (Deutsche Gebärdensprache)
(Camgoz et al., 2018). On the other hand, there is less amount of work focused on Indian Sign
Language. The primary reason is the unavailability of large annotated datasets for Indian Sign
Language (ISL). ISL being a communication medium for a large, diverse population in India, still faces
the deficiency of certified translators (only 300 certified sign language interpreters in India3 ), making
the gap between spoken and sign language more prominent in India. This paper aims to bridge this
gap by curating a new translation dataset for Indian Sign Language: ISLTranslate, having 31, 222 ISL-
English pairs. Due to fewer certified sign language translators for ISL, there is a dearth of educational
material for the hard-of-hearing community. Many government and non-government organizations
in India have recently started bridging this gap by creating standardized educational content in ISL.
The created content helps build basic vocabulary for hard-of-hearing children and helps people use
spoken languages to learn and teach ISL to children. Considering the standardized representations
and simplicity in the vocabulary, we choose these con3The statistic is as per the Indian government
organization Indian Sign Language Research and Training Centre (ISLRTC): http://islrtc.nic.in/
arXiv:2307.05440v1 [cs.CL] 11 Jul 2023 tents for curating an ISL-English translation dataset. We
choose the content specific to education material that is standardized and used across India for
primary-level education. Consequently, the vocabulary used in the created content covers diverse
topics (e.g., Maths, Science, English) using common daily words. ISL is a low-resource language, and
the presence of bodily modality for communication makes it more resource hungry from the point of
view of training machine learning models. Annotating sign languages at the gesture level (grouping
similar gestures in different sign sentences) is challenging and not scalable. Moreover, in the past,
researchers have tried translating signs into gloss representation and gloss to written language
translation (Sign2Gloss2Text (Camgoz et al., 2018)). A gloss is a text label given to a signed gesture.
The presence of gloss labels for sign sentences in a dataset helps translation systems to work at a
granular level of sign translation. However, generating gloss representation for a signed sentence is
an additional challenge for data annotation. For ISLTranslate, we propose the task of end-to-end ISL
to English translation. Figure 1 shows an example of an ISL sign video from ISLTranslate. The example
shows a translation for the sentence “Let’s discuss”, where the signer does the sign for the word
“discuss” by circularly moving the hands with a frown face simultaneously followed by palm lifted
upwards for conveying “let’s.” The continuity present in the sign video makes it more challenging
when compared to the text-to-text translation task, as building a tokenized representation for the
movement is a challenging problem. Overall, in this resource paper, we make the following
contributions: • We create a large ISL-English translation dataset with more than 31, 222 ISL-English
pair sentences/phrases. The dataset covers a wide range of daily communication words with a
vocabulary size of 11, 655. We believe making this dataset available for the NLP community will
facilitate future research in sign languages with a significant societal impact. Moreover, though not
attempted in this paper, we hope that ISLTranslate could also be useful for sign language generation
research. The dataset is made available at: https://github.com/ Exploration-Lab/ISLTranslate. • We
propose a baseline model for end-to-end ISL-English translation inspired by sign language
transformer (Camgoz et al., 2020). 2 Related Work In contrast to spoken natural languages, sign
languages use bodily modalities, which include hand shapes and locations, head movements (like
nodding/shaking), eye gazes, finger-spelling, and facial expressions. As features from hand, eye,
head, and facial expressions go in parallel, it becomes richer when compared to spoken languages,
where a continuous spoken sentence can be seen as a concatenated version of the sound articulated
units. Moreover, translating from a continuous movement in 3 dimensions makes sign language
translation more challenging and exciting from a linguistic and research perspective. Sign Language
Translation Datasets: Various datasets for sign language translation have been proposed in recent
years (Yin et al., 2021). Specifically for American Sign Language (ASL), there have been some early
works on creating datasets (Martinez et al., 2002; Dreuw et al., 2007), where the datasets were
collected in the studio by asking native signers to sign content. Other datasets have been proposed
for Chinese sign language (Zhou et al., 2021), Korean sign language (Ko et al., 2018), Swiss German
Sign Language - Deutschschweizer Gebardensprache (DSGS) and Flemish Sign Language - Vlaamse
Gebarentaal (VGT) (Camgöz et al., 2021). In this work, we specifically target Indian Sign Language and
propose a dataset with ISL videos-English translation pairs. End-to-End Sign Language Translation
Systems: Most of the existing approaches for sign language translation (Camgoz et al., 2018; De
Coster and Dambre, 2022; De Coster et al., 2021) depend on intermediate gloss labels for
translations. As glosses are aligned to video segments, they provide fine one-to-one mapping that
facilitates supervised learning in learning effective video representations. Previous work (Camgoz et
al., 2018) has reported a drop of about 10.0 in BLEU-4 scores without gloss labels. However,
considering the annotation cost of gloss-level annotations, it becomes imperative to consider gloss-
free sing language translation approaches. Moreover, the gloss mapping in continuous sign language
might remove the grammatical aspects from the sign language. Other recent works on Sign language
translation include Voskou et al. Figure 2: A sample from ISLTranslate: “Sign Language is a visual
language consisting of signs, gestures, fingerspelling and facial expressions." Dataset Lang. Sentences
Vocab. Purdue RVL-SLLL (Martinez et al., 2002) ASL 2.5k 104 Boston 104 (Dreuw et al., 2007) ASL 201
103 How2Sign (Duarte et al., 2021) ASL 35k 16k OpenASL (Shi et al., 2022) ASL 98k 33k BOBSL
(Albanie et al., 2021) BSL 1.9k 78k CSL Daily (Zhou et al., 2021) CSL 20.6k 2k Phoenix-2014T (Camgoz
et al., 2018) DGS 8.2 3K SWISSTXT-Weather (Camgöz et al., 2021) DSGS 811 1k SWISSTXT-News
(Camgöz et al., 2021) DSGS 6k 10k KETI (Ko et al., 2018) KSL 14.6k 419 VRT-News (Camgöz et al.,
2021) VGT 7.1k 7k ISL-CSLRT (Elakkiya and Natarajan, 2021) ISL 100 - ISLTranslate (ours) ISL 31k 11k
Table 1: Comparison of continuous sign language translation datasets. (2021); Yin and Read (2020),
which try to remove the requirement for a glossing sequence for training and proposes a
transformer-based architecture for end-to-end translations. We also follow a gloss-free approach for
ISL translation. 3 ISLTranslate ISLTranslate is a dataset created from publicly available educational
videos produced by the ISLRTC organization and made available over YouTube. These videos were
created to provide school-level education to hard-of-hearing children. The videos cover the NCERT4
standardized En4 https://en.wikipedia.org/wiki/National_
Council_of_Educational_Research_and_Training glish educational content in ISL. As the targeted
viewers for these videos are school children and parents, the range of words covered in the videos is
beginner-level. Hence, it provides a good platform for building communication skills in ISL. The videos
cover various NCERT educational books for subjects like science, social science, and literature. A
single video (about 15-30 minutes) usually covers one chapter of a book and simultaneously provides
an audio voice-over (in English) conveying the same content. Apart from ISLRTC’s educational sign
videos which make up a significant portion of ISLTranslate, we also use another resource from Deaf
Enabled Foundations (DEF) (https://def.org.in/). DEF videos consist of words with respective
descriptions and example sentences for the same words, along with the text transcriptions available
in the descriptions5 . We split the DEF Sign videos into multiple segments using visual heuristics for
separating segments corresponding to words, descriptions, and examples. In total, ISLTranslate
consists of 2685 videos (8.6%) from DEF, and the remaining 28537 videos (91.4%) are from ISLRTC.
ISLTranslate Creation: We use the audio voiceover (by analyzing the speech and silence parts) to split
the videos into multiple segments. Further, these segments are passed through the SOTA speech-to-
text model (Radford et al., 2022) to generate the corresponding text. As the generated text is the
same as present in the book chapters’ text, verifying the generated sample was easy and was done
by manually matching them with the textbook. In general, we found automatically transcribed text to
be of high quality; nevertheless, incorrectly generated text was manually corrected with the help of
5Example video: https://www.youtube.com/watch?v= 429wv1kvK_c ISLTranslate Translations ISL-
Signer Translations (references) Birbal started smiling. When it was his turn, he went near the line.
Birbal started smiling. He turned towards the drawn line. Discuss with your partner what Birbal
would do. Discuss with your partner what Birbal would do. Birbal drew a longer line. Under the
drawn line, Birbal under the first one. drew a longer line. saw what he drew and said. and wondered
That’s true, the first line is shorter now. That’s true, the first line is shorter now. One day, Akbar drew
a line on the floor and ordered. One day, Akbar drew a line and ordered Make this line shorter. Make
this line shorter. Rita is shorter than Radha. Rita is short and the other is Radha. Rajat is taller than
Raj. Rajat is taller and the other is Raj. but don’t rub out any part of it. but don’t rub out any part of
it. Try to draw Rajat’s taller than Raj. First draw Rajat as taller, then draw Raj on the right. No one
knew what to do. No one knew what to do. Each minister looked at the line and was puzzled. Each
minister looked at the line and was puzzled. No one could think of any way to make it longer. No one
could think of any way to make it longer. Have you seen the fine wood carving? Look at its
architecture. Most houses had a separate Most houses had a separate bathing Beding area. separate
and some had wells to supply water. and some had wells to supply water. Many of these cities had
covered drains. Many of these cities had covered drains. Notice how carefully these were laid out in
straight lines. Notice how carefully these were laid out in straight lines. Table 2: The Table shows a
sample of English translations present in ISLTranslate compared to sentences translated by ISL Signer
for the respective ISL videos. The exact ISL-Signer Translations were used as reference sentences for
computing translation metric scores reported in Table 3. Blue and Red colored text highlight the
difference between semi-automatically generated English sentences and gold sentences generated
by the ISL instructor. Metric Score BLEU-1 60.65 BLEU-2 55.07 BLEU-3 51.43 BLEU-4 48.93 METEOR
57.33 WER 61.88 ROUGE-L 60.44 Table 3: The Table shows the translation scores for a random
sample of 291 pairs from ISLTranslate when compared to references translated by the ISL instructor.
content in the books. Figure 2 shows an example (from ISLTranslate) of a long sentence and its
translation in ISL. The frames in the figure are grouped into English words and depict the continuous
communication in ISL. Notice the similar words in the sentence, “sign/signs” and “language.” (also
see a visual representation of Sign Language6 ). As common to other natural languages,
representations (characters/gestures) of different lengths are required for communicating different
words. In ISLTranslate, we restrict to the sentence/phrase level translations. The dataset is divided
into train, validation, and test splits (Details in App. A). App. Figure 3 shows the distribution of the
number of samples in various splits. 6 https://www.youtube.com/watch?v=SInKhy-06qA Comparison
with Other Continuous SignLanguage Datasets: We primarily compare with video-based datasets
containing paired continuous signing videos and the corresponding translation in written languages
in Table 1. To the best of our knowledge, we propose the largest dataset for ISL. Data Cleaning and
Preprocessing: The videos (e.g., App. Fig. 4) contain the pictures corresponding book pages. We crop
the signer out of the video by considering the face location as the reference point and removing the
remaining background in the videos. Noise Removal in ISLTranslate: As the ISLTranslate consists of
videos clipped from a longer video using pauses in the available audio signal, there are multiple ways
in which the noises in the dataset might creep in. While translating the text in the audio, a Signer
may use different signs that may not be the word-to-word translation of the respective English
sentence. Moreover, though the audio in the background is aligned with the corresponding signs in
the video, it could happen in a few cases that the audio was fast compared to the corresponding sign
representation and may miss a few words at the beginning or the end of the sentence. We also found
a few cases where while narrating a story, the person in the audio takes the character role by
modifying speech to sound like the designated character speaking the sentence. For example, in a
story where a mouse is talking, instead of saying the sentence followed by the “said the mouse”
statement, the speakers may change their voice and increase the pitch to simulate dialogue spoken
by the mouse. In contrast, in the sign language video, a person may or may not take the role of the
mouse while translating the sentence to ISL. ISLTranslate Validation: To verify the reliability of the
sentence/phrase ISL-English pairs present in the dataset, we take the help of a certified ISL signer.
Due to the limited availability of certified ISL Signers, we could only use a small randomly selected
sign-text pairs sample (291 pairs) for human translation and validation. We ask an ISL instructor to
translate the videos (details in App. C). Each video is provided with one reference translation by the
signers. Table 2 shows a sample of sentences created by the ISL instructor. To quantitatively estimate
the reliability of the translations in the dataset, we compare the English translation text present in
the dataset with the ones provided by the ISL instructor. Table 3 shows the translation scores for 291
sentences in ISLTranslate. Overall, the BLEU-4 score is 48.94, ROUGE-L (Lin, 2004) is 60.44, and WER
(Word Error Rate) is 61.88. To provide a reference for comparison, for text-totext translations BLEU
score of human translations ranges from 30-50 (as reported by Papineni et al. (2002), on a test
corpus of about 500 sentences from 40 general news stories, a human translator scored 34.68
against four references). We speculate high reliability over the translations present in the ISLTranslate
with a BLEU score of 48.93 compared against the reference translations provided by certified ISL
Signer. Ideally, it would be better to have multiple reference translations available for the same
signed sentence in a video; however, the high annotation effort along with the lower availability of
certified ISL signers makes it a challenging task. 4 Baseline Results Given a sign video for a sentence,
the task of sign language translation is to translate it into a spoken language sentence (English in our
case). For benchmarking ISLTranslate, we create a baseline architecture for ISL-to-English translation.
We propose an ISL-pose to English translation baseline (referred to as Pose-SLT) inspired by Sign
LanModel BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L Pose-SLT 13.18 8.77 7.04 6.09 12.91 Table 4: The
table shows translation scores obtained for the baseline model. guage Transformer (SLT) (Camgoz et
al., 2020). Sign language transformer uses image features with transformers for generating text
translations from a signed video. However, considering the faster realtime inference of pose
estimation models (Selvaraj et al., 2022), we use pose instead of images as input. We use the
Mediapipe pose estimation pipeline7 . A similar SLT-based pose-to-Text approach was used by
Saunders et al. (2020), which proposes Progressive Transformers for End-to-End Sign Language
Production and uses SLT-based pose-to-text for validating the generated key points via back
translation (generated pose key points to text translations). Though the pose-based approaches are
faster to process, they often perform less than the image-based methods. For the choice of holistic
key points, we follow Selvaraj et al. (2022), which returns the 3D coordinates of 75 key points
(excluding the face mesh). Further, we normalize every frame’s key points by placing the midpoint of
shoulder key points to the center and scaling the key points using the distance between the nose key
point and the shoulders midpoint. We use standard BLEU and ROUGE scores to evaluate the
obtained English translations (model hyperparameter details in App.D). Table 4 shows the results
obtained for the proposed architecture. Poor BLEU-4 result highlights the challenging nature of the
ISL translation task. The results motivate incorporating ISL linguistic priors into data-driven models to
develop better sign language translation systems. 5 Conclusion We propose ISLTranslate, a dataset of
31k ISLEnglish pairs for ISL. We provide a detailed insight into the proposed dataset and benchmark
them using a sign language transformer-based ISL-poseto-English architecture. Our experiments
highlight the poor performance of the baseline model, pointing towards a significant scope for
improvement for end-to-end Indian sign language translation systems. We hope that ISLTranslate will
create excitement in the sign language research community and have a significant societal impact. 7
https://ai.googleblog.com/2020/12/ mediapipe-holistic-simultaneous-face.html Limitations This
resource paper proposes a new dataset and experiments with a baseline model only. We do not
focus on creating new models and architectures. In the future, we plan to create models that
perform much better on the ISLTranslate dataset. Moreover, the dataset has only 31K video-sentence
pairs, and we plan to extend this to enable more reliable data-driven model development. In the
future, we would also like to incorporate ISL linguistic knowledge in data-driven models. Ethical
Concerns We create a dataset from publicly available resources without violating copyright. We are
not aware of any ethical concerns regarding our dataset. Moreover, the dataset involves people of
Indian origin and is created mainly for Indian Sign Language translation. The annotator involved in
the dataset validation is a hard-of-hearing person and an ISL instructor, and they performed the
validation voluntarily. Acknowledgements We want to thank anonymous reviewers for their insightful
comments. We want to thank Dr. Andesha Mangla (https://islrtc.nic.in/ dr-andesha-mangla) for
helping in translating and validating a subset of the ISLTranslate dataset. References Samuel Albanie,
Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, and Andrew Zisserman.
2020. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In ECCV.
Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury,
Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, and Andrew Zisserman. 2021. BOBSL: BBC-
Oxford British Sign Language Dataset. Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann
Ney, and Richard Bowden. 2018. Neural sign language translation. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 7784–7793. Necati Cihan Camgoz, Oscar Koller,
Simon Hadfield, and Richard Bowden. 2020. Sign language transformers: Joint end-to-end sign
language recognition and translation. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). Necati Cihan Camgöz, Ben Saunders, Guillaume Rochette, Marco Giovanelli,
Giacomo Inches, Robin Nachtrab-Ribback, and Richard Bowden. 2021. Content4all open research
sign language translation datasets. In 2021 16th IEEE International Conference on Automatic Face
and Gesture Recognition (FG 2021), page 1–5. IEEE Press. Mathieu De Coster and Joni Dambre. 2022.
Leveraging frozen pretrained written language models for neural sign language translation.
Information, 13(5). Mathieu De Coster, Karel D’Oosterlinck, Marija Pizurica, Paloma Rabaey, Severine
Verlinden, Mieke Van Herreweghe, and Joni Dambre. 2021. Frozen pretrained transformers for neural
sign language translation. In 1st International Workshop on Automated Translation for Signed and
Spoken Languages. Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, and Hermann
Ney. 2007. Speech recognition techniques for a sign language recognition system. In Proc.
Interspeech 2007, pages 2513– 2516. Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti
Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. 2021. How2Sign:
A Large-scale Multimodal Dataset for Continuous American Sign Language. In Conference on
Computer Vision and Pattern Recognition (CVPR). R Elakkiya and B Natarajan. 2021. Isl-csltr: Indian
sign language dataset for continuous sign language translation and recognition. Mendeley Data.
Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. Skeleton aware
multimodal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) Workshops. Abhinav Joshi, Ashwani Bhat, Pradeep S, Priya
Gole, Shashwat Gupta, Shreyansh Agarwal, and Ashutosh Modi. 2022. CISLR: Corpus for Indian Sign
Language recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980. Sang-Ki Ko, Chang Jo Kim, Hyedong Jung, and
Choong Sang Cho. 2018. Neural sign language translation based on human keypoint estimation.
ArXiv, abs/1811.11436. Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language
recognition: Towards large vocabulary statistical recognition systems handling multiple signers.
Computer Vision and Image Understanding, 141:108–125. Dongxu Li, Cristian Rodriguez, Xin Yu, and
Hongdong Li. 2020. Word-level deep sign language recognition from video: A new large-scale dataset
and methods comparison. In The IEEE Winter Conference on Applications of Computer Vision, pages
1459–1469. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational
Linguistics. Aleix M. Martinez, Ronnie B. Wilbur, Robin Shay, and Avinash C. Kak. 2002. Purdue rvl-slll
asl database for automatic recognition of american sign language. Proceedings. Fourth IEEE
International Conference on Multimodal Interfaces, pages 167–172. Amit Moryossef, Ioannis
Tsochantaridis, Roee Aharoni, Sarah Ebling, and Srini Narayanan. 2020. Real-time sign language
detection using human pose estimation. In European Conference on Computer Vision, pages 237–
248. Springer. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA.
Association for Computational Linguistics. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman,
Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak
supervision. Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. 2020. Progressive
Transformers for End-to-End Sign Language Production. In Proceedings of the European Conference
on Computer Vision (ECCV). Prem Selvaraj, Gokul Nc, Pratyush Kumar, and Mitesh Khapra. 2022.
OpenHands: Making sign language recognition accessible with pose-based pretrained models across
languages. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 2114– 2133, Dublin, Ireland. Association for
Computational Linguistics. Bowen Shi, Diane Brentari, Greg Shakhnarovich, and Karen Livescu. 2022.
Open-domain sign language translation learned from online video. Ozge Mercanoglu Sincan and
Hacer Yalim Keles. 2020. Autsl: A large scale multi-modal turkish sign language dataset and baseline
methods. IEEE Access, 8:181340–181355. Lewis Tunstall, Leandro von Werra, and Thomas Wolf.
2022. Natural language processing with transformers. " O’Reilly Media, Inc.". Andreas Voskou,
Konstantinos P. Panousis, Dimitrios I. Kosmopoulos, Dimitris N. Metaxas, and Sotirios P. Chatzis. 2021.
Stochastic transformer networks with linear competing units: Application to end-to-end sl
translation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11926–
11935. Chenchen Xu, Dongxu Li, Hongdong Li, Hanna Suominen, and Ben Swift. 2022. Automatic
gloss dictionary for sign language learners. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics: System Demonstrations, pages 83–92, Dublin, Ireland.
Association for Computational Linguistics. Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav
Goldberg, and Malihe Alikhani. 2021. Including signed languages in natural language processing. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages
7347– 7360, Online. Association for Computational Linguistics. Kayo Yin and Jesse Read. 2020. Better
sign language translation with STMC-transformer. In Proceedings of the 28th International
Conference on Computational Linguistics, pages 5975–5989, Barcelona, Spain (Online). International
Committee on Computational Linguistics. Hao Zhou, Wen gang Zhou, Weizhen Qi, Junfu Pu, and
Houqiang Li. 2021. Improving sign language translation with monolingual data by sign back-
translation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages
1316–1325. Appendix A Data Splits Data splits for ISLTranslate are shown in Table 5. Train Validation
Test # Pairs 24978 (80%) 3122 (10%) 3122 (10%) Table 5: The table shows the train, validation, split
for the ISLTranslate. B ISLTranslate Word Distribution 0 5 10 15 20 25 30 Number of Words 0 500
1000 1500 2000 2500 3000 3500 Number of Samples Train Validation Test Figure 3: Distribution of
the number of samples in the train, validation, and test splits of ISLTranslate. C Annotation Details We
asked a certified ISL instructor to translate and validate a random subset from the dataset. The
instructor is a hard-of-hearing person and uses ISL for communication; hence they are aware of the
subtitles of ISL. Moreover, the instructor is an assistant professor of sign language linguistics. The
instructor is employed with ISLRTC, the organization involved in creating the sign language content;
however, the instructor did not participate in videos in ISLTranslate. The instructor performed the
validation voluntarily. It took the instructor about 3 hours to validate 100 sentences. They generated
the English translations by looking at the video. D Hyperparameters and Training We follow the code
base of SLT (Camgoz et al., 2020) to train and develop the proposed SLT-based pose-to-text
architecture by modifying the input features to be sign-pose sequences generated by the Figure 4:
The figure shows an example of the educational content video where the signer signs for the
corresponding textbook. mediapipe. The model architecture is a transformerbased encoder-decoder
consisting of 3 transformer layers each for both encoder and decoder. We use the Adam optimizer
(Kingma and Ba, 2014) with a learning rate of 0.0001, β = (0.9, 0.999) and weight decay of 0.0001 for
training the proposed baseline with a batch size of 32.
A Comparative Analysis of Techniques and Algorithms for Recognising Sign Language Rupesh Kumar
Department of CSE Galgotias College of Engineering and Technology, AKTU Greater Noida, India
[email protected] S.K Singh Professor Department of CSE Galgotias College of Engineering
and Technology, AKTU Greater Noida, India [email protected] du Ashutosh Bajpai
Department of CSE Galgotias College of Engineering and Technology, AKTU Greater Noida, India
[email protected] Ayush Sinha Department of CSE Galgotias College of Engineering and
Technology, India Greater Noida, India [email protected] Abstract—Sign language is a
visual language that enhances communication between people and is frequently used as the primary
form of communication by people with hearing loss. Even so, not many people with hearing loss use
sign language, and they frequently experience social isolation. Therefore, it is necessary to create
human-computer interface systems that can offer hearing-impaired people a social platform. Most
commercial sign language translation systems now on the market are sensor-based, pricey, and
challenging to use. Although vision-based systems are desperately needed, they must first overcome
several challenges. Earlier continuous sign language recognition techniques used hidden Markov
models, which have a limited ability to include temporal information. To get over these restrictions,
several machine learning approaches are being applied to transform hand and sign language motions
into spoken or written language. In this study, we compare various deep learning techniques for
recognising sign language. Our survey aims to provide a comprehensive overview of the most recent
approaches and challenges in this field. Keywords— sign language recognition, deep learning,
convolutional neural network, vision-based systems, continuous sign language recognition. I.
INTRODUCTION The study of sign language is a significant and intriguing field of study that has
effects on both the deaf population and society at large. Researchers are interested in studying sign
language for several reasons. Firstly, a better understanding of sign language can lead to better
communication between the deaf community and the general public. Secondly, researching sign
languages can provide insight into the nature of language itself and the interaction between
language and the brain. Lastly, a better understanding of sign language can lead to the development
of more effective sign language instruction techniques for both deaf and hearing people. Sign
languages are used in many countries, and each has its own distinctive vocabulary and grammar,
such as British Sign Language (BSL) in the UK and American Sign Language (ASL) in the US. However,
they all share commonalities in the way that meaning is expressed through hand and body gestures.
Sign languages have their own set of phonological, morphological, and syntactic principles, and they
are not just visual representations of spoken languages. They are independent languages with their
own syntax, vocabulary, and grammar, and their many gestures, facial expressions, and body
positions convey different meanings. The significance of sign language is highlighted by the World
Health Organization's estimate that there are more than 70 million deaf or hearing-impaired persons
worldwide. Many nations, including the United States, Canada, and Australia, recognize sign
language as an official language, underscoring the importance of recognizing, researching, and
preserving sign language as a unique and independent language. The use of sign language can help
reduce linguistic barriers, promote inclusion, and improve accessibility for deaf and hearing-impaired
people in various contexts. In conclusion, sign language is a vital tool for the global deaf community.
Understanding and acknowledging sign language as a unique and independent language is crucial to
promoting accessibility and inclusivity. Sign language research can shed light on language and the
brain, and effective sign language instruction techniques can benefit both deaf and hearing people.
This study attempts to analyse recent developments in deep learning-based sign language
recognition (SLR), including methods and prospective applications. The authors reviewed deep
learning architectures and algorithms for SLR and evaluated their performance, highlighting its
importance for the deaf community. The study provides valuable information about the latest
developments in SLR and its potential applications, serving as a guide for researchers and
practitioners in the field. II.RELATED WORK In this section, we review relevant research papers in sign
language recognition techniques, classifying them into glove-based and vision-based approaches.
The glove-based techniques involve wearable devices like data gloves to capture hand movements
and gestures, while vision-based techniques utilize computer vision algorithms to analyse visual cues.
By examining these studies, we aim to provide a comprehensive overview of the existing techniques
and their implications for advancing sign language recognition technology. A. Glove-based approach
In their paper, Khomami et al. [21] developed a wearable hardware using surface electromyography
(sEMG) and Inertial Measurement Unit (IMU) sensors for Persian Sign Language (PSL) recognition.
The system's ability to accurately capture signs was increased by fusing these two sensors. By
extracting and classifying the 25 highest-ranked features using the KNN classifier, they achieved an
average accuracy of 96.13%. Their affordable and user-friendly hardware, consisting of Arduino Due,
extended EMG shields, and MPU-6050, shows promise for practical PSL communication. A thorough
analysis of wearable sensor-based systems for SLR was done by Kudrinko et al. [22]. The analysis
covered 72 studies that looked at factors such sensor set-up, recognition model, lexicon size, and
identification accuracy between 1991 and 2019. The assessment found problems with sign border
detection, scalability to bigger lexicons, and model convergence as well as gaps in the field. The
study's findings may help in the creation of wearable sensorbased sign language recognition
technology. To make these devices as comfortable for users as possible, wireless transmission, and
Deaf users' input must all be taken into account. A unique multimodal framework for sensor-based
SLR combining Microsoft Kinect and Leap motion sensors was proposed by Kumar et al. [23]. Their
system extracts elements for recognition by capturing finger and palm postures from different
viewpoints. Separate classifiers using the Hidden Markov Model (HMM) and the Bidirectional Long
Short-Term Memory Neural Network (BLSTM-NN) were utilised, and their outcomes were integrated
to increase accuracy. Testing on a dataset of 7500 ISL gestures showed that fusing data from both
sensors outperformed single sensor-based recognition, with accuracies of 97.85% and 94.55%
achieved for single and double-handed signs, respectively. The study emphasizes the robustness and
potential of the proposed multimodal framework for SLR systems. Brazilian Sign Language feature
extraction using RGBD sensors was proposed by Moreira Almeida et al. [24]. From RGB-D photos,
they derived seven vision-based traits and connected them to structural aspects based on hand
movement, shape, and location. Support vector machines (SVM) were used to classify signs, and the
average accuracy was above 80%. By characterising each sign language's unique phonological
structure, the concept may be applied to other sign language systems and shows potential for SLR
systems. RGB-D sensors have the potential to improve image processing algorithms and recognise
hand gestures. Amin et al. [25] conducted a comparative review on the applications of different
sensors for SLR. Their review focused on various techniques and sensors used for SLR, with an
emphasis on sensor-based smart gloves for capturing hand movements. The authors analyzed
existing systems, categorized authors based on their work, and discussed trends and deficiencies in
SLR. The comparative analysis provides valuable insights for researchers and offers guidance for
developing translation systems for different sign languages. Additionally, the review emphasizes the
potential of generated datasets from these sensors for gesture recognition tasks. In their study,
Rajalakshmi et al. [26] proposed a hybrid deep neural net methodology for recognizing Indian and
Russian sign gestures. The system used a variety of methods to extract multi-semantic features like
non-manual components and manual co-articulations, such as 3D deep neural net with atrous
convolutions, attention-based BiLSTM, and modified autoencoders. The hDNN-SLR achieved
accuracies of 98.75%, 98.02%, 97.94%, and 97.54% for the respective WLASL datasets, surpassing
other baseline architectures. B. Vision- based approach In a study by Matyáš Boháček et al. [1], a
transformerbased neural network was employed for word-level SLR. The authors achieved an
accuracy of 63.18% on the WLASL dataset and 100% on the LSA64 dataset, focusing on pose-based
recognition using transformers. Another study by Atyaf Hekmat Mohammed Ali et al. [2] developed a
real-time SLR system using a Convolutional Neural Network (CNN) with the SqueezeNet module for
feature extraction. They achieved 100% accuracy in off-time testing and 97.5% accuracy in real-time
using the ASL dataset. This approach emphasized real-time recognition and feature extraction using
SqueezeNet. A forthcoming paper by Rajalakshmi E et al. [3] proposed a hybrid approach combining
transformer-based neural networks and CNNs for continuous SLR and translation. Although accuracy
results are not yet reported, this approach stands out for its use of a hybrid neural network and
translation capabilities, leveraging the ISLW dataset and the Phoenix14T Weather dataset. In a study
by Aashir Hafeez et al. [4], a SLR system was developed using various deep learning techniques,
including Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), SVM, and CNN. Accuracy rates
of 41.95%, 60.79%, 84.18%, and 88.38% were achieved using the ASL dataset, focusing on improving
accuracy rates using multiple deep learning techniques. Katerina Papadimitriou et al. [5] proposed an
SLR system using modulated graph convolutional networks. While no accuracy results were reported,
this approach stood out for its utilization of 3D convolutions and modulated graph convolutional
networks. Kaushal Goyal et al. [6] developed a SLR system using LSTM and CNN, achieving accuracy
rates of 85% and 97% using the ISL dataset. This approach distinguished itself using LSTM and CNN
for SLR. Soumen Das et al. [8] developed an occlusion-robust SLR system using keyframe + VGG19 +
BiLSTM, achieving an accuracy of 94.654% using the ISL Publicly Available dataset. This approach
focused on occlusionrobust recognition and utilized keyframes. C J Sruthi et al. [9] developed a deep
learning-based SLR system using CNN, achieving an accuracy of 98.6% with the ISL dataset. This
approach stood out for its focus on the Indian sign language and the high accuracy achieved using a
CNN-based approach. This system holds potential in enhancing accessibility and communication for
the deaf community in India. III. SIGN LANGUGAE RECOGNITION TECHNIQUES ANALYSIS SLR research
encompasses various techniques and methodologies to improve accuracy and performance in
recognizing sign gestures. One approach is the utilization of Transformer-based Neural Networks, as
demonstrated by Matyáš Boháček et al. [1]. Their work involved employing a Transformer-based
Neural Network for SLR and achieved an accuracy of 63.18% on the WLASL dataset and a perfect
accuracy of 100% on the LSA64 dataset. Similarly, Rajalakshmi E et al. [3] incorporated a Transformer-
based Neural Network along with CNN in their SLR framework. While the specific accuracy is not
mentioned, they used the Word-level ISL (ISLW) dataset and Phoenix14T Weather dataset for
evaluation. Convolutional Neural Networks (CNN) have proven to be effective in SLR. Atyaf Hekmat
Mohammed Ali et al. [2] employed CNN with the SqueezeNet module for feature extraction in their
research. Their model achieved an outstanding accuracy of 100% in off-time testing and an
impressive accuracy of 97.5% in real-time testing on the ASL dataset. Similarly, C J, Sruthi et al. [9]
leveraged CNN and obtained a remarkable accuracy of 98.6% on the ISL dataset. Another study by K.
Nimisha et al. [14] involved the use of YOLO, PCA, SVM, ANN, and CNN for SLR, achieving an accuracy
of 90% on the ASL dataset. Zhou, H., Zhou et al. [15] proposed a multi-cue framework for SLR and
translation, employing CNN as part of their approach, and achieved an accuracy of 95.9%. Long
Short-Term Memory (LSTM) networks have also been applied in SLR research. Kaushal Goyal et al. [6]
utilized LSTM and CNN in their framework. The LSTM network achieved an accuracy of 85%, while
the CNN achieved an accuracy of 97% on the ISL dataset. Sensor-based approaches have shown
promise in SLR. Khomami et al. [21] developed a wearable system using sEMG and IMU sensors,
combined with a KNN classifier, for Persian Sign Language (PSL) recognition. Their research achieved
an accuracy of 95.03% (SD: 0.76%) and an improved accuracy of 96.13% (SD: 0.46%) on the PSL
dataset. Additionally, Kumar et al. [23] incorporated Microsoft Kinect and Leap motion sensors to
capture hand movements and employed an HMM and BLSTM classifier. On the ISL dataset, they
obtained 97.85% accuracy for single-handed signs and 94.55% accuracy for doublehanded signals.
Other techniques in SLR include hybrid approaches and unique methodologies. Aloysius, N. et al. [10]
utilized a hybrid approach combining PCNN and GM, focusing on frame-label alignment techniques.
Their specific accuracy is not mentioned. Rajalakshmi et al. [26] proposed a 3D deep neural net with
atrous convolutions and Attentionbased Bi-LSTM for Indo-Russian Sign Language recognition. Their
research yielded impressive accuracies of 98.75%, 98.02%, 97.94%, and 97.54% on the WLASL100,
WLASL300, WLASL1000, and WLASL2000 datasets, respectively. These diverse techniques and
methodologies reflect the ongoing efforts in the field of SLR to improve the accuracy and
performance of sign gesture recognition systems. Each approach brings unique contributions and
advancements, contributing to the overall progress of SLR research. TABLE I: ANALYSIS OF SIGN
LANGUAGE DETECTION TECHNIQUE Paper Techniques/Observations Year Accuracy Achieved Dataset
Matyáš Boháček et al. [1] Transformer-based Neural Network 2022 WLASL - 63.18%, LSA64 - 100 %
WLASL, LSA64 dataset Atyaf Hekmat Mohammed Ali et al. [2] CNN with SqueezeNet module for
feature extraction 2022 100% in off-time testing, 97.5 % in real-time ASLa Rajalakshmi E et al. [3]
Transformer-based Neural Network, CNN 2023 - Word-level ISL (ISLW) dataset Phoenix14T Weather
dataset Aashir Hafeez et al. [4] ANN, KNN, SVM, CNN 2023 41.95%, 60.79%, 84.18%, 88.38% ASLa
Katerina Papadimitriou et al. [5] 3D-CNN Model 2023 - AUTSLc, ITI GSLd Kaushal Goyal et al. [6]
LSTM, CNN 2023 85% ,97% ISLb Gero Strobel et al. [7] TNN 2023 - Video Dataset Soumen das et al.
[8] Keyframe + VGG-19 + BiLSTM 2023 94.65% ISLb Publicly Available C J, Sruthi et al. [9] CNN 2019
98.6% ISLb Aloysius, N. et al. [10] Hybrid PCNN and GM. Framelabel alignment technique of Deep
learning is used 2020 - - K. Amrutha et al. [11] Convex Hull - Feature extraction KNN 2021 65% ASLa ,
ISLb R. Cui, H. Liu et al. [12] CNN followed by temporal Convolutional and pooling layers. 2019
91.93% RWTH phoenix Weather 2014 database Papastratis, I. et al. [13] 2d-CNN and weights learned
from ImageNet dataset. 2021 - RWTH PhoenixWeather-2014 and the CSLf and GSLd K. Nimisha et al.
[14] YOLO, PCA, SVM, ANN and CNN 2021 90% ASLa Zhou, H., Zhou et al. [15] Multi-cue framework
for SLR and translation. STMC framework 2021 95.9% - Shirbhate, Radha S. et al. [16] KNN, SVM for
feature extraction 2020 100% ISLb M. Xie et al. [17] RNN 2019 99.4% ISLb Selvaraj, P. et al. [18] ST-
GCN 2022 94.7% ISLb Kumar, D. A. et al. [19] Adaptive Graph Matching Intra-GM to extract signs
alone, discarding ME 2018 98.32% ISLb Korban, Matthew et al. [20] HMM and hybrid KNN-DTW
algorithm 2018 92.4% PSLe Khomami et al. [21] sEMG and IMU sensors, KNN classifier 2020 95.03%
(SD: 0.76%), 96.13% (SD:0.46%) PSLe Kudrinko et al. [22] wearable sensor-based system 2020 - ASLa ,
ISLb Kumar et al. [23] Microsoft Kinect and Leap motion sensors to record hand movement, HMM
and BLSTM classifier 2016 97.85% for single-hand signs, 94.55% for doublehanded signs ISLb Moreira
Almeida et al. [24] RGD-B sensor, SVM classifier 2014 80% Brazilian Sign-Language Amin et al. [25]
sensor-based smart gloves 2022 - ASLa Rajalakshmi et al. [26] 3D deep neural net with atrous
convolutions and Attention-based Bi-LSTM. 2023 98.75% on WLASL100, 98.02% on WLASL300,
97.94% on WLASL1000, 97.54% on WLASL2000 Indo-Russian Sign Language Dataset a. American
Sign-Language b. Indian Sign-Language c. Ankara University Turkish Sign Language Dataset d. German
Sign-Language e. Persian Sign -Language f. Chinese Sign-Language Based on extensive inquiry and
analysis, the authors provide several prospective routes for beginners, academics, and researchers to
take in order to further their work. Some areas that need improvement involve: Tiny sample sizes:
Some researchers have trained and evaluated their models using tiny datasets, which may not fully
reflect the variety of sign languages and signers. Cultural prejudice: Several studies have
concentrated on identifying sign languages in particular cultural contexts, which may not be relevant
to other cultures or nations. Measurement biases: The use of certain measurement techniques,
which may not be completely trustworthy or valid, may have an impact on the findings of some
research. Lack of control groups makes it challenging to assess whether the success of the model is
primarily attributable to the algorithm or to other variables. Some studies have not included a
control group in their analyses. Studies that are restricted to a few sign languages or simple signs
may not be relevant to other sign languages or more complicated signs. Studies that are restricted to
a few sign languages or simple signs. IV. FUTURE SCOPE Accessibility: SLR models can help the
community become more accessible by allowing the deaf population to communicate more
effectively with hearing individuals. Education: Sign language recognition models may be used to
create instructional materials and software that will make it easier for people to learn sign language.
Healthcare: To facilitate communication between patients who use sign language and healthcare
professionals, SLR models can be employed in healthcare settings. Entertainment: SLR models may
be used to provide more accessible entertainment materials, such as sign language interpretation in
motion pictures and television programmes. Interaction with Autonomous Systems: Models for the
recognition of sign language can be used to develop robots and other autonomous systems that can
interact with people who communicate primarily through sign language. Accessibility in public areas:
To make public areas like railway stations, airports, and other transportation hubs more accessible
for those who use sign language, sign language recognition models can be utilised. Services for sign
language interpretation: By using SLR models, it is possible to provide more precise and effective sign
language interpretation services while also lowering the demand for human interpreters.
V.CONCLUSION SLR is a significant research area with implications for communication, accessibility,
and inclusion for the deaf community. The analysis of recent developments in SLR techniques
highlights a range of approaches, including glove-based and vision-based methods. Glove-based
approaches utilize wearable devices to capture hand movements and gestures, while vision-based
approaches leverage computer vision algorithms to analyze visual cues. Both approaches have shown
promising results in recognizing sign gestures. In the glove-based approach, wearable sensor-based
systems using sensors such as sEMG and IMUs have been developed. These systems can accurately
capture hand movements and gestures, leading to high recognition accuracies. However, challenges
still exist in terms of sign boundary detection and scalability to larger sign lexicons. Future research in
this area should focus on addressing these challenges and ensuring user comfort and acceptance.
Vision-based approaches, particularly those using deep learning techniques, have also achieved
significant progress in sign language recognition. CNNs have been widely used for feature extraction
and classification of sign gestures, demonstrating high accuracies. Other deep learning techniques
like LSTM networks and Transformer-based Neural Networks have also been explored, showing
promising results in capturing the sequential nature of sign gestures. Sensor-based approaches
utilizing RGB-D sensors, Microsoft Kinect, or Leap motion sensors have also contributed to sign
language recognition. These technologies provide valuable information about hand movements and
enable accurate feature extraction. However, they may come with additional hardware requirements
and cost considerations. Hybrid approaches that combine different methodologies, such as deep
neural networks, pulsecoupled neural networks (PCNN), graph matching (GM), and frame-label
alignment techniques, have also shown improved performance in recognizing sign gestures. These
hybrid approaches leverage the strengths of different techniques and provide a more comprehensive
solution. Overall, the field of SLR is advancing rapidly, driven by advancements in machine learning,
computer vision, and wearable technologies. Future research should focus on addressing the
challenges in sign boundary detection, scalability, user comfort, and acceptance. Additionally, efforts
should be made to include the perspectives and input of the deaf community in the development of
SLR systems to ensure their effectiveness and suitability for real-world applications. By developing
accurate and robust SLR systems, we can enhance communication accessibility for the deaf and
community, promote inclusivity, and facilitate their participation in various domains of life. VI.
ACKNOWLEDGMENT As we conclude this research, we gratefully acknowledge the invaluable support
and contributions of numerous individuals and organizations. We extend our sincere appreciation to
all those who have been part of this journey. We wish to express our deepest gratitude to Dr. S.K
Singh, our esteemed mentor, for his unwavering guidance, unwavering support, and constant
motivation throughout this endeavour. His wealth of expertise and experience were instrumental in
helping us achieve our objectives and surmount any obstacles that arose. Dr. S.K Singh has been an
invaluable member of our team, and we are profoundly grateful for his outstanding contributions.
REFRENCES [1] M. Boháček and M. Hrúz, "Sign Pose-based Transformer for Wordlevel Sign Language
Recognition," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops
(WACVW), 4-8 January 2022. DOI: 10.1109/WACVW54805.2022.00024. [2] Mohammedali, A. H.,
Abbas, H. H., & Shahadi, H. I. (2022). Realtime sign language recognition system. International
Journal of Health Sciences, 6(S4), 10384–10407. https://doi.org/10.53730/ijhs.v6nS4.12206. [3] E.
Rajalakshmi, R. Elakkiya, V. Subramaniyaswamy, K. Kotecha, M. Mehta, and V. Palade, "Continuous
Sign Language Recognition and Translation Using Hybrid Transformer-Based Neural Network,"
available at SSRN: https://ssrn.com/abstract=4424708 or http://dx.doi.org/10.2139/ssrn.4424708.
[4] A. Hafeez, S. Singh, U. Singh, P. Agarwal, and A. K. Jayswal, "Sign Language Recognition System
Using Deep-Learning for Deaf and Dumb," International Research Journal of Modernization in
Engineering Technology and Science, vol. 5, no. 4, pp., Apr. 2023. DOI: 10.56726/IRJMETS36063. [5]
K. Papadimitriou and G. Potamianos, "Sign Language Recognition via Deformable 3D Convolutions
and Modulated Graph Convolutional Networks," in Proceedings of the IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 2023. [6] Dr. V. G and K. Goyal, “Indian Sign
Language Recognition Using Mediapipe Holistic.” arXiv, 2023. doi: 10.48550/ARXIV.2304.10256. [7]
Strobel, G., Schoormann, T., Banh, L., & Möller, F. (in press). Artificial Intelligence for Sign Language
Translation – A Design Science Research Study. Communications of the Association for Information
Systems, 52, pp-pp. Retrieved from https://aisel.aisnet.org/cais/vol52/iss1/33 [8] S. DAS, S. kr.
Biswas, and B. Purkayastha, “Occlusion Robust Sign Language Recognition System for Indian Sign
Language Using CNN and Pose Features.” Research Square Platform LLC, Apr. 20, 2023. doi:
10.21203/rs.3.rs-2801772/v1. [9] S. C J and L. A, “Signet: A Deep Learning based Indian Sign
Language Recognition System,” 2019 International Conference on Communication and Signal
Processing (ICCSP). IEEE, Apr. 2019. doi: 10.1109/iccsp.2019.8698006. [10] Aloysius, N., Geetha, M.
Understanding vision-based continuous sign language recognition. Multimed Tools Appl 79, 22177–
22209 (2020). https://doi.org/10.1007/s11042-020-08961-z [11] K. Amrutha and P. Prabu, "ML Based
Sign Language Recognition System," 2021 International Conference on Innovative Trends in
Information Technology (ICITIIT), 2021, pp. 1-6. [12] R. Cui, H. Liu and C. Zhang, "A Deep Neural
Framework for Continuous Sign Language Recognition by Iterative Training," in IEEE Transactions on
Multimedia, vol. 21, no. 7, pp. 1880-1891, July 2019. [13] Papastratis, I.; Dimitropoulos, K.; Daras, P.
Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network.
Sensors 2021, 21, 2437. [14] K. Nimisha and A. Jacob, "A Brief Review of the Recent Trends in Sign
Language Recognition," 2020 International Conference on Communication and Signal Processing
(ICCSP), 2020, pp. 186-190. [15] Zhou, H., Zhou, W., Zhou, Y., & Li, H. (2021). Spatial-Temporal Multi-
Cue Network for Sign Language Recognition and Translation. IEEE Transactions on Multimedia. [16]
Shirbhate, Radha S., Mr. Vedant D. Shinde, Ms. Sanam A. Metkari, Ms. Pooja U. Borkar and Ms.
Mayuri A. Khandge. “Sign language Recognition Using Machine Learning Algorithm.” (2020). [17] M.
Xie and X. Ma, "End-to-End Residual Neural Network with Data Augmentation for Sign Language
Recognition," 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control
Conference (IAEAC), 2019, pp. 1629-1633. [18] Selvaraj, P., Gokul N., C., Kumar, P., & Khapra, M.M.
(2022). OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained
Models across Languages. ArXiv, abs/2110.05877. [19] Kumar, D. A., Sastry, A. S. C. S., Kishore, P. V. V.,
& Kumar, E. K. (2018). Indian sign language recognition using graph matching on 3D motion captured
signs. Multimedia Tools and Applications. [20] Korban, Matthew & Nahvi, Manoochehr. (2018). An
algorithm on sign words extraction and recognition of continuous Persian sign language based on
motion and shape features of hands. Formal Pattern Analysis & Applications. 21. 10.1007/s10044-
016-0579-2. [21] Khomami, S. A., & Shamekhi, S. (2021). Persian sign language recognition using IMU
and surface EMG sensors. In Measurement (Vol. 168, p. 108471). Elsevier BV.
https://doi.org/10.1016/j.measurement.2020.108471 [22] K. Kudrinko, E. Flavin, X. Zhu, and Q. Li,
“Wearable Sensor-Based Sign Language Recognition: A Comprehensive Review,” IEEE Reviews in
Biomedical Engineering, vol. 14. Institute of Electrical and Electronics Engineers (IEEE), pp. 82–97,
2021. doi: 10.1109/rbme.2020.3019769. [23] P. Kumar, H. Gauba, P. Pratim Roy, and D. Prosad Dogra,
“A multimodal framework for sensor based sign language recognition,” Neurocomputing, vol. 259.
Elsevier BV, pp. 21–38, Oct. 2017. doi: 10.1016/j.neucom.2016.08.132 [24] S. G. Moreira Almeida, F.
G. Guimarães, and J. Arturo Ramírez, “Feature extraction in Brazilian Sign Language Recognition
based on phonological structure and using RGB-D sensors,” Expert Systems with Applications, vol. 41,
no. 16. Elsevier BV, pp. 7259– 7271, Nov. 2014. doi: 10.1016/j.eswa.2014.05.024 [25] M. S. Amin, S.
T. H. Rizvi, and Md. M. Hossain, “A Comparative Review on Applications of Different Sensors for Sign
Language Recognition,” Journal of Imaging, vol. 8, no. 4. MDPI AG, p. 98, Apr. 02, 2022. doi:
10.3390/jimaging8040098. [26] E. Rajalakshmi et al., “Multi-Semantic Discriminative Feature
Learning for Sign Gesture Recognition Using Hybrid Deep Neural Architecture,” IEEE Access, vol. 11.
Institute of Electrical and Electronics Engineers (IEEE), pp. 2226–2238, 2023. doi:
10.1109/access.2022.3233671.