Sign language emotion and alphabet recognition with hand gestures using convolution neural network
Sign language emotion and alphabet recognition with hand gestures using convolution neural network
Corresponding Author:
Varsha K. Patil
Department of Electronics and Telecommunication Engineering, AISSMS Institute of Information Technology
Pune, India
Email: [email protected]
1. INTRODUCTION
Nonverbal communication [1] is when individuals say many things without speaking. The message
in nonverbal communication is sent by making and moving faces or poses to form movements of body parts.
These movements are called gestures [2]. When hard-of-hearing people interact with each other, they use a
well-structured form of communication called sign language. The American sign language (ASL) is one of the
popular sign languages, where emotions and characters are expressed with the movements of hands. These
movements are done with a set of rules. To understand sign language communication, a trained person who
has acquired sign skills is required. To make the social interactions among hard of hearing and people with
good audacity, there is a technical implementation need for recognizing emotions automatically.
This paper brings the methods, performance factors, and successful results for automatically
recognizing ASL. The secret of communication lies in understanding the hand gestures involved. This article
aims to automatically recognize emotions and the alphabet via standardized hand gestures and positions. The
highlight of this work is, that it is the first article that shows the method to recognize happy, love, sad, peace,
confused, and together emotions with 92% and more accuracy.
2. LITERATURE SURVEY
This section is divided into two parts. The first part discusses about the broad domain of signal
processing of ASL. The second part deals with methodology to increase the accuracy for sign language.
2.2. Literature for methods for incrementing the accuracy-related to ASL processing
For this part, the approach of listing the methods for incrementing the accuracy related to ASL
processing is considered. Table 1 lists and briefly describes different strategies for improving the automatic
ASL processing methods' accuracy. Table 1 summarizes different approaches by the various authors for
incrementing the accuracy of ASL-related sign languages. Table 1 experiments for accuracy increments include
the use of a large dataset, utilizing techniques such as deep learning or glove-based methods.
Sign language emotion and alphabet recognition with hand gestures using CNN (Varsha K. Patil)
956 ISSN: 2252-8938
3. METHOD
From the literature survey in the previous section, the general steps of automatic sign language
recognition start with capturing hand gesture images, followed by preprocessing steps. After preprocessing
there is the crucial step of core algorithms. Algorithms like transfer learning, and deep learning are used, in
literature. In this article, the method used is the CNNs. To interpret the meaning of hand gesture frames from
the input video, we have utilized three steps for implementing sign recognition namely i) dataset creation,
ii) real-time data capturing and feature extraction by the CNN method, and iii) training and testing.
Figure 1 summarizes these steps with a block diagram consisting of preprocessing, feature extraction,
training, and testing. The output is expressed in terms of performance parameters such as confusion matrix,
score, and accuracy. The other form of the system output is the classification of emotions and alphabets. The
system shown in Figure 1 works for confusion, love, together, sad, peace, confused, and happy emotions. The
steps of implementation of the proposed system are shown in Figure 1 and are elaborated as follows.
Step 1: dataset creation: Initially, the input hand gesture images are captured as per the ASL a
teachable machine is used for data capturing and augmentation. The emotion dataset contains 18,000 hand
gesture images categorized into 6 emotion classes. The dataset is created by using a teachable machine. The
alphabet hand gesture image dataset consists of 94,000 images. In the case of each alphabet, 3,000 images are
captured. The main concern of dataset creation is that we have followed ASL guidelines for representing the
alphabet with hand gestures. The total size of the dataset created is 94,000 images, out of these 18,000 images
are related to 6 emotions. The remaining 76,000 images in the dataset are related to the English alphabet. The
alphabet and emotions are kept in separate folders for processing.
Each class that we created contains 3,000 images alphabets. There are subfolders in the emotion
folder, sad, happy, together, peace, and love. The labeled data in the alphabets folder is of 24 Engish alphabets
except J and Z. To mitigate the limitations of this dataset size, we utilized data augmentation techniques through
Keras’s ImageDataGenerator, incorporating a width shift range of 0.1, height shift range of 0.1, a zoom range
of 0.2, shear range of 0.1, and a rotation range of 10 degrees. These augmentations enhance the diversity of the
dataset and improve model generalization, with augmented image batches generated at a size of 20. These
preprocessing methods effectively address the dataset size constraints and bolster the model's robustness. The
data is labeled into folders containing emotions and This way we have created our own data set.
Step 2: real-time data capturing and feature extraction by CNN method: the hand gestures indicating
signs of the alphabet and emotions are captured from the live video feed. The image preprocessing tasks such
as resizing, and zooming are done. Then the features are extracted by the CNN. A dataset of 94,000 images
(captured and processed by us) is used as a reference. The dataset creation band preprocessing process is
already explained in step 1. Figure 2 shows a representative picture of the proposed system for identifying hand
gestures where the input of a preprocessed image is provided to the CNN network, the output image shows the
emotion recognition of peace expressed with of 98.95% accuracy.
As shown in Figure 2, we have designed three layered CNN for sign language, which has inputs from
real-time hand gestures. Our system preprocesses the input image by resizing and edge processing. The output
image shows the” Peace” emotion demonstrated by hand gestures. indicated with a and the output of emotions.
identification experiment shows a bounding box (region of interest (ROI)). A high accuracy of 98.56% is
observed for this experimentation.
The CNN processes input images with filters of size 32×32 pixels with 3 color channels. The
pre-processing step is to import and convert the label “optimizer” from the dataset. CNN has three
convolutional layers: the first two layers use 32 filters per 3×3 pixel size and ReLU activation, followed by a
layer of 64 filters per 3×3 pixel size. CNN has three convolutional layers: the first two layers use 32 filters per
3×3 pixel size and ReLU activation, followed by a layer of 64 filters per 3×3 pixel size.
MaxPooling2D layers have a pool size of 2×2 pixels and are put after the second convolutional layer,
This reduces the dimensionality of the image map and avoids overfitting. The model includes pooling layers
and convolutional. There is a Dense layer with 512 units and a ReLU activation function. These layers are
preceded by an output-dense layer having SoftMax activation). The network is classified with the
Adam optimizer of learning rate of 0.001 and categorical cross-entropy loss function. Dropout layers with a
rate of 0.25 are applied after the second convolutional layer to enhance generalization. A sequential kernel size
of 3 with ReLU activation and a dropout of 0.2 is used. Flattening the model, and adding a dense layer with
ReLU activation and dropout, followed by a SoftMax layer are the significant design steps to get the best results
by adding nonlinearity. These steps are represented in Figures 3 and 4.
Our proposed model's general and specific features are seen in Figures 3 and 4. The steps are to import
the libraries, design the CNN with different layers, and use the feature extensions. As shown in Figure 3, the general
areas of signal processing in image processing are manual localization, segmentation, morphology, and signal
processing in image extraction. We used the TensorFlow platform for sign language recognition using the CNN
approach. In our experiments, using TensorFlow for hand signals, the following parameters ar e used for detection.
Step 3: training and testing: with hand gesture images, our model is trained for alphabet and emotion
recognition. There are two types of results offered by models. The performance factors like classification
parameters of confusion matrix, accuracy. The g splits with a ratio of 80:20 is used for training and testing. For
training purposes 2,400 images and 600 images for testing.
Figure 3. General steps flowchart for sign language Figure 4. Flowchart for CNN-based emotion
approch recognition with sign language approach
4. RESULTS
The next subsections focus on two key aspects: onscreen results and the confusion matrix.
This subsection highlights the impact of the ROI and epoch on performance. Multiple graphs illustrate their
effects on accuracy and loss metrics. These visualizations provide insights into model evaluation.
Sign language emotion and alphabet recognition with hand gestures using CNN (Varsha K. Patil)
958 ISSN: 2252-8938
Figure 5. Emotions like “happy” (94.45%) “together” (94.05%) are recognized from hand gestures
Figure 7. Graphs for alphabet and emotion for epoch vs loss and epoch vs accuracy
Figure 8. Alphabet’s sign: accuracy increases with the ROI for 24 alphabets
Figure 9. Emotion sign recognition: accuracy comparison with and without ROI
Sign language emotion and alphabet recognition with hand gestures using CNN (Varsha K. Patil)
960 ISSN: 2252-8938
Figure 10. “R”, “W”, and “Q” alphabets recognized (e.g. “W” accuracy is 99.89%)
Figure 12. Confusion matrix for recognition of alphabets shown by hand gesture
5. DISCUSSION
We conducted a study using a self-created dataset comprising 94,000 hand gesture images to advance
the recognition of sign language alphabets and emotions. This dataset was processed using CNN. While many
studies focus on alphabet recognition, few address emotion recognition via sign language, making this research
a pioneering effort in the field. Our system analyzes six emotions-confused, happy, sad, love, peace, and
together-evaluating accuracy with and without ROI application. Results show that incorporating ROI improves
accuracy up to 100%, while its absence reduces accuracy to as low as 60%. For alphabet recognition, 24 letters
(excluding “J” and “Z”) were analyzed, with ROI significantly enhancing performance. Recognition accuracy
exceeded 95% for “P” and “Q”, whereas it dropped to about 75% for “D” and “K”.
The system achieved overall accuracy rates of 96%-98% for alphabets and 97%-99.8% for emotions.
However, limitations include dependency on factors such as camera angle, distance, and sign presentation
consistency. Future work could involve integrating non-invasive communication methods and advanced signal
processing to further optimize system efficiency and broaden its applicability in diverse real-world scenarios.
6. CONCLUSION
ASL, can be understood by trained/skilled individuals. However, the ASL emotions and alphabets are
not recognized by the layman. Hence there is a gap in the automatic recognition of the English alphabet using
ASL. In this article, the hand gestures, and movement are recognized with CNN. The main contribution of this
work is, that our system is capable of identifying emotions such as peace, togetherness, love, and happy are
identified with significant confidence (greater than 91%). We have added the ROI to increase the accuracy.
Graphical, tabular output is presented for sign language-based emotion recognition. The ROI is a signal
processing and localization technique with abounding box around the hand. However, the system has
limitations on the repeatability of the readings. The very same accuracy is not observed after repeated readings.
The image characteristics are the influencing factors for the sequence and content of the sign language
processing steps. Additionally, the quality of the segmentation may depend on factors such as lighting
conditions, camera angle, and the complexity of the hand gesture being performed. Besides these odd
limitations, our system provides an average accuracy greater than 91% of accuracy.
REFERENCES
[1] E. L. W. Keutchafo, J. Kerr, and O. B. Baloyi, “A model for effective nonverbal communication between nurses and older patients:
a grounded theory inquiry,” Healthcare, vol. 10, no. 11, 2022, doi: 10.3390/healthcare10112119.
[2] E. D. Stefani and D. D. Marco, “Language, gesture, and emotional communication: an embodied view of social interaction,”
Frontiers in Psychology, vol. 10, 2019, doi: 10.3389/fpsyg.2019.02063.
[3] M. Baca, P. Grd, and T. Fotak, “Basic principles and trends in hand geometry and hand shape biometrics,” New Trends and
Developments in Biometrics, 2012, doi: 10.5772/51912.
[4] S. K. Ko, J. G. Son, and H. Jung, “Sign language recognition with recurrent neural network using human keypoint detection,” in
The 2018 Research in Adaptive and Convergent Systems, RACS 2018, 2018, pp. 326–328, doi: 10.1145/3264746.3264805.
[5] T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1371–1375, 1998, doi: 10.1109/34.735811.
[6] W. Sandler, “Prosody and syntax in sign languages,” Transactions of the Philological Society, vol. 108, no. 3, pp. 298–328, 2010,
doi: 10.1111/j.1467-968X.2010.01242.x.
[7] L. Pigou, M. V. Herreweghe, and J. Dambre, “Sign classification in sign language corpora with deep neural networks,” in LREC -
7 th Workshop on the Representation and Processing of Sign Languages: Corpus Mining, 2016, pp. 175–178.
[8] Y. Wang and L. Guan, “Recognizing human emotion from audiovisual informaiton,” in ICASSP, IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings, 2005, doi: 10.1109/ICASSP.2005.1415607.
[9] M. Kipp, A. Heloir, and Q. Nguyen, “Sign language avatars: animation and comprehensibility,” in Intelligent Virtual Agents, 2011,
pp. 113–126, doi: 10.1007/978-3-642-23974-8_13.
[10] R. K. Pathan, M. Biswas, S. Yasmin, M. U. Khandaker, M. Salman, and A. A. F. Youssef, “Sign language recognition using the
fusion of image and hand landmarks through multi-headed convolutional neural network,” Scientific Reports, vol. 13, no. 1, 2023,
doi: 10.1038/s41598-023-43852-x.
[11] M. Al-Qurishi, T. Khalid, and R. Souissi, “Deep learning for sign language recognition: current techniques, benchmarks, and open
issues,” IEEE Access, vol. 9, pp. 126917–126951, 2021, doi: 10.1109/ACCESS.2021.3110912.
[12] K. Li, Z. Zhou, and C. H. Lee, “Sign transition modeling and a scalable solution to continuous sign language recognition for real-
world applications,” ACM Transactions on Accessible Computing, vol. 8, no. 2, 2016, doi: 10.1145/2850421.
[13] D. Kothadiya, C. Bhatt, K. Sapariya, K. Patel, A. B. Gil-González, and J. M. Corchado, “Deepsign: sign language detection and
recognition using deep learning,” Electronics, vol. 11, no. 11, 2022, doi: 10.3390/electronics11111780.
[14] N. Al Mudawi et al., “Innovative healthcare solutions: robust hand gesture recognition of daily life routines using 1D CNN,”
Frontiers in Bioengineering and Biotechnology, vol. 12, 2024, doi: 10.3389/fbioe.2024.1401803.
[15] P. Vyavahare, S. Dhawale, P. Takale, V. Koli, B. Kanawade, and S. Khonde, “Detection and interpretation of Indian sign language
using LSTM networks,” Journal of Intelligent Systems and Control, vol. 2, no. 3, pp. 132–142, 2023, doi: 10.56578/jisc020302.
[16] I. A. Adeyanju, O. O. Bello, and M. A. Adegboye, “Machine learning methods for sign language recognition: a critical review and
analysis,” Intelligent Systems with Applications, vol. 12, 2021, doi: 10.1016/j.iswa.2021.200056.
[17] B. Subramanian, B. Olimov, S. M. Naik, S. Kim, K. H. Park, and J. Kim, “An integrated mediapipe-optimized GRU model for
Indian sign language recognition,” Scientific Reports, vol. 12, no. 1, 2022, doi: 10.1038/s41598-022-15998-7.
[18] X. Jiang, M. Lu, and S. H. Wang, “An eight-layer convolutional neural network with stochastic pooling, batch normalization and
Sign language emotion and alphabet recognition with hand gestures using CNN (Varsha K. Patil)
962 ISSN: 2252-8938
dropout for fingerspelling recognition of Chinese sign language,” Multimedia Tools and Applications, vol. 79, no. 21–22, pp.
15697–15715, 2020, doi: 10.1007/s11042-019-08345-y.
[19] H. Zahid et al., “A computer vision-based system for recognition and classification of Urdu sign language dataset for differently
abled people using artificial intelligence,” Mobile Information Systems, vol. 2023, pp. 1–17, 2023, doi: 10.1155/2023/1060135.
[20] H. K. Vashisth, T. Tarafder, R. Aziz, M. Arora, and Alpana, “Hand gesture recognition in Indian sign language using deep learning,”
in Engineering Proceedings, 2023, doi: 10.3390/engproc2023059096.
[21] A. Akoum and N. Al Mawla, “Hand gesture recognition approach for ASL language using hand extraction algorithm,” Journal of
Software Engineering and Applications, vol. 8, no. 8, pp. 419–430, 2015, doi: 10.4236/jsea.2015.88041.
[22] S. Mohsin, B. W. Salim, A. K. Mohamedsaeed, B. F. Ibrahim, and S. R. M. Zeebaree, “American sign language recognition based on transfer
learning algorithms,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12, no. 5s, pp. 390–399, 2024.
[23] S. Srivastava, R. Jaiswal, R. Ahmad, and V. Maddheshiya, “Sign language recognition,” SSRN Electronic Journal, 2024, doi:
10.2139/ssrn.4778501.
[24] W. Lin, C. Li, and Y. Zhang, “Interactive application of data glove based on emotion recognition and judgment system,” Sensors,
vol. 22, no. 17, 2022, doi: 10.3390/s22176327.
[25] D. Satybaldina and G. Kalymova, “Deep learning based static hand gesture recognition,” Indonesian Journal of Electrical
Engineering and Computer Science, vol. 21, no. 1, pp. 398–405, 2021, doi: 10.11591/ijeecs.v21.i1.pp398-405.
[26] United Nations, “International day of sign languages,” United Nations. Accessed: Jun. 15, 2024. [Online]. Available:
https://www.un.org/en/observances/sign-languages-day
BIOGRAPHIES OF AUTHOR
Varsha K. Patil an M.Tech. electronics in the year 2009 from the Government
College of Engineering Pune. She has filed 02 patents and 1 copyright in the technical field. She
has published more than 25 papers. She has been invited as a resource person for 10 faculty
development programs. Her areas of interest are image processing, AIoT, the internet of things,
and electronics in agriculture. She has worked to establish the Center of Excellence of Texas
Innovation Lab and IEEE Affordable Agriculture Lab. She has worked on two research grant-
funded projects and one QIP Grant. She is a life member of ISTE and IETE. She can be contacted
at email: [email protected].
Vijaya R. Pawar has completed M.E. in electronics in 2002 from Shivaji University
Kolhapur, Bharati Vidyapeeth Deemed University, Pune has awarded her a Ph.D. degree in
electronics engineering in 2015. She has teaching experience of 28 years and research experience
of 10 years. She has filed 02 patents and 2 copyrights in the technical field. She has published
more than 47 papers, of which 28 papers are in international journals. Her 25 papers are indexed
in Scopus and 12 papers are indexed in SCI. She was invited as session chair for 12 national and
international conferences. She has also been invited as a resource person by 06 colleges as a
resource person for delivering a session on, “Digital signal processing” and “Biomedical signal
processing”. She has received one research grant and 08 development grants. She is a life
member of ISTE, IEI, and IETE. She can be contacted at email: [email protected].
Vinayak Bairagi has completed M.E. in electronic in 2007. The University of Pune
has awarded him a Ph.D. degree in Engineering in 2013. He has teaching experience of 17 years
and research experience of 12 years. He has filed 12 patents and 5 copyrights in the technical
field. He has published more than 81 papers, of which 38 papers are in international journals.
His 71 papers are indexed in Scopus and 32 papers are indexed in SCI. He was invited as session
chair for 21 national and international conferences. He has also been invited as a resource person
by 31 colleges for invited talk. He has received four research grants. He was Chair of the IEEE
Signal Processing Society Pune chapter (2020-23). He can be contacted at email:
[email protected].