0% found this document useful (0 votes)
23 views133 pages

All Research

Uploaded by

Siddharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views133 pages

All Research

Uploaded by

Siddharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

See discussions, stats, and author profiles for this publication at:

https://www.researchgate.net/publication/366125097 Indian Sign Language Recognition System for


Dynamic Signs Conference Paper · October 2022 DOI: 10.1109/ICRITO56286.2022.9964940
CITATIONS 31 READS 352 6 authors, including: Arun Singh Lovely Professional University 44
PUBLICATIONS 364 CITATIONS SEE PROFILE Ankita Wadhawan Lovely Professional University 15
PUBLICATIONS 534 CITATIONS SEE PROFILE Manik Rakhra Lovely Professional University 128
PUBLICATIONS 1,544 CITATIONS SEE PROFILE Usha Mittal Lovely Professional University 35
PUBLICATIONS 346 CITATIONS SEE PROFILE All content following this page was uploaded by
Ahmed Al Ahdal on 02 August 2023. The user has requested enhancement of the downloaded file.
Indian Sign Language Recognition System for Dynamic Signs Ankita Wadhawan Department of
Computer Science and Engineering Lovely Professional University Jalandhar, Phagwara, Punjab, India
[email protected] Ajay Vikram Singh Department of Computer Science and Engineering Amity
University Uttar Pradesh Noida India [email protected] Usha Mittal Department of Computer
Science and Engineering Lovely Professional University Jalandhar, Phagwara, Punjab, India
[email protected] Ahmed Al Ahdal Department of Computer Science and Engineering Lovely
Professional University Jalandhar, Phagwara, Punjab, India [email protected] Manik
Rakhra Department of Computer Science and Engineering. Lovely Professional University Jalandhar,
Phagwara, Punjab, India [email protected] Abstract— Sign language is a means of
communication utilising manual gestures (movement of hands and wrists) and non-manual gestures
(expressions of face and body language). There are many different sign languages in the world, each
with its own collection of words and signs. This study focuses on the implementation of Indian Sign
Language Recognition System (ISLRS) which help deaf people to communicate with other persons. In
this paper, the model based on Sign Language Recognition (SLR) of dynamic signs using Convolutional
Neural Network (CNN) is proposed. The proposed model has been trained and tested on video clips
of dynamic signs and achieved the training accuracy of 70%. This study should serve as a road map
for users to use in deciding which model to implement and laying a foundation for future research
and enhancing model accuracy, allowing the sign language community to communicate and share
their ideas more effectively. This work is also used to overcome the educational gap between hearing
impaired persons. Keywords—Sign language Recognition, Classification, Feature Extraction, Machine
Learning, Deep Learning, Long Artificial Neural Network (ANN), Support Vector Machine (SVM). I.
INTRODUCTION Language is crucial in communication and interaction between two or more people,
however deaf individuals rely on Sign language because they do not have the gift of speech (SL). The
deaf and dumb community rely heavily on sign language to communicate with other persons. Instead
of lipreading or reading facial emotions, sign language is a good choice for them to consider.
Additionally, they use nonverbal cues such as facial expressions to better grasp their emotions,
making it easier for them to carry forward their feelings. Various steps used to carry out the process
of recognition and making the system understandable by the computer are data pre-processing,
segmentation, feature extraction, pattern matching, class recognition, computer vision, natural
language processing (NLP), and selecting a perfect model. The models are designed in such a manner
so that the recognition system can be deployed in real-time for improved human to human
interaction and human-computer interaction which also reduces the educational The models are
designed in such a manner so that the recognition system can be deployed in real-time for improved
human to human interaction and human-computer interaction which also reduces the educational
gap among hearing impaired persons. It also makes the hearing-impaired people more independent
and self-reliant in public spaces like buses, markets, hotels, offices etc. [15-24]. The main challenge
faced by the researchers during this research is collecting data from different people from different
backgrounds. This is the most time-consuming process in the whole work because without data you
cannot train and test your model. Second, selecting whether to go for a deep learning approach or a
machine learning approach. Then choose a model that would be implemented and if the accuracy
comes out to be very low then go for another model or improve its accuracy. II. TYPES OF SIGN
LANGUAGE AND THEIR CLASSIFICATION Sign language was invented in the early 1970s with different
signs for alphabets, numbers, words, and symbols. The signs are conveyed using hands, wrists, and
non-manual expressions. The orientation of the hands and wrists specify different signs and letters;
sign language gestures are classified into 3 categories i.e., signs with one hand, nonmanual, signs
with both the hands; these are further categorized into static and dynamic signs as shown in Figure
1. Single-handed signs are associated with a single dominant hand and this sign can be static or a
dynamic sign; static signs are those signs that does not involve any motion of the hand, on the other
hand the dynamic signs are the signs that involve the motion of hands or other body parts [25-32].
Double-handed signs are also divided into static and dynamic signs which are further subdivided into
T1 and T0 i.e., Type1 and Type0. T1 signs use the dominant hand more than the non-dominant hand
and T0 is vice versa. Non-manual signs are those which are expressed without using hands instead of
using facial expressions, body postures, or mouth gestures [33-38] Figure 1: Indian Sign Language
hierarchy. Non-manual signs are those which are expressed without using hands instead of using
facial expressions, body postures, or mouth gestures. III. LITERATURE REVIEW AND PLANNING Figure
2 depicts the many criteria used in the fact comparison. Our plan for the literature review is to
compare machine learning and deep learning techniques using isolated signs (single sign gestures)
versus continuous signs (a sentence using ISL), the number of signs, the total dataset size, feature
extraction technique to extract hidden characteristics, classification technique to classify the data in
the respective sign, and the accuracy of the model they used. The main objective of this section is to
go through latest research articles on Indian Sign Language (ISL) recognition systems, thus we only
considered papers with the keyword ISL from 2017 to 2021. Figure 2: Reviewing classification IV.
MACHINE LEARNING TECHNIQUES In 2017, Ekbote and Mahasweta Joshi developed Indian sign
language categorization system using support vector machines and neural networks. The features in
this research get extracted by using different techniques like Scale. Invariant Feature Transform
(SIFT), and Histogram of Oriented Gradients (HOG) and shape descriptors. The experiments were
performed on 10 isolated single-handed signs (0 to 9 numerals) and classified using ANN and SVM.
The experimental findings revealed that both SVM and ANN had a 93% accuracy [1]. In 2017, Jadav
and Rokade developed an ISLRS using 26 one handed and double-handed alphabetical signs and
extracted characteristics using Distance Transformation, Discrete Fourier Transform, and the central
probability distribution property, then categorised them using ANN and SVM. The results of the
experiments revealed that ANN outperformed SVM [2]. In 2017, Hore et al. developed an optimized
neural network based Sign Language Recognition System (SLRS). They gathered 22 one-handed and
double-handed gestures. They used a Genetic algorithm technique and Euclidean distance for
extracting the characteristics, and K-nearest neighbour (KNN) and Euclidean distance for
classification. The NNPSO beat the other techniques in the experiments, with 99.96 accuracies,
precision of 99.98, 98.29 recall, 99.63 F-Measure, and kappa statistics to be 0.9956 [3]. Rao and
Kishore proposed selfie video-based continuous ISLRS in 2018. They have collected and used 50
continuous single-handed signs. They have extracted the features using Sobel adaptive threshold,
Morphological differencing and for classification, the models used are Minimum Distance, Artificial
Neural Network. The results present that ANN outperformed with 90% accuracy [4]. Shenoy et al.
(2018) proposed a "Real-time Indian Sign Language (ISL) Recognition". They have isolated and
continuous 33 hand postures and 12 gestures employing single-hand and double-hand. Grid-based
Feature Extraction, Feature Vector, and HOG are the feature extraction techniques employed, while
the classification model used is KNN, Hidden Markov Model (HMM). The results reveal a 99.7%
accuracy for postures with static hand and a 97.23% accuracy for the recognition gestures [7].
Mariappan and Gomathi developed a SLRS to recognize signs in real-time in 2019.They have collected
80-word signs and 50 continuous sentences for their dataset using both single and double hands for
feature extraction they have used orientation histogram, eigenvector-based technique, and
classification technique used is eigenvalue weighted euclidean distance technique. The classification
accuracy rate thus obtained is 86% [8]. Athira et al. presented a signer independent SLRS with
elimination of co-articulation from live videos in 2019. They have collected 46 isolated and used
SVM. Experimental results show that SVM has produced an accuracy of 92.88% [10]. In 2020,
Raghuveera et al. proposed a depth-based ISLRS using Microsoft Kinect. They have collected 140
unique isolated and continuous gestures using single-hand and double-hand. Feature extraction has
been done by Speeded Up Robust Features, HOG, and Local Binary Patterns; and the collected signs
were categorized by using SVM. Experimental results state that the model produced an accuracy up
to 71.85% [12]. Badhe and Kulkarni proposed an artificial neural network based ISLRS using
handcrafted features' in 2020. They have collected 46 different continuous signs using single and
double hand; for feature extraction the techniques used are HOG, Principal Curvature Based Region
(PCBR), and Wavelet Packet Decomposition (WPD), Principal Component Analysis (PCA), or Scale
Invariant Feature Transform (SIFT). The classification technique used is Artificial Neural Network
(ANN) and after rigorous work secured training accuracy 98%, validation accuracy 63%. Author
Isolated/ Continuous Single/ Double Handed Signs Total Dataset Size Feature Extraction Technique
Classification Technique Accuracy Ekbote and Mahasweta (2017) Isolated Singlehanded 10 Signs (0 to
9 numerals) 1000 images, SIFT and HOG Artificial Neural Networks (ANN) and Support Vector
Machine (SVM) 93% Jadav and Rokade (2017) Isolated Both Signs of 26 Alphabets 260 images are
distance transformation, Discrete Fourier Transform, Probability distribution property that is central
moments Artificial Neural Network and SVM 94.37% (ANN), 92.12% (SVM) Hore et al. (2017) Isolated
Both 22 ISL gestures 220 images Genetic algorithm and Euclidean distance . K-nearest neighbour and
Euclidean distance 99.96 accuracy, 99.98 precision, 98.29 recall, 99.63 FMeasure, and 0.9956 Kappa
Statistic Rao and Kishore (2018) Continuous Single-handed 50 Signs 2500 samples Morphological
differencing and sobel threshold Minimum Distance, Artificial Neural Network 85.58% (MDC) and
90% (ANN) TABLE 1. MACHINE LEARNING TECHNIQUES V. DEEP LEARNING-BASED TECHNIQUES Rao
et al. (2018) proposed a "Deep Convolutional Neural Networks for Sign Language Recognition". They
employed a Convolution Neural Network (CNN) with four convolutional layers, two stochastic pooling
layers, and Softmax regression to classify 200 continuous sign images taken with single and double
hands. It has a recognition rate of 92.88 percent, according to experimental results [5]. "Indian Sign
Language Numeral Recognition Using Region of Interest Convolutional Neural Network" was
proposed by Sajanraj T D and Beena M V in 2018. They gathered 36 isolated single and double-
handed signs, using the CNN, HSV, and colour space with Contrast Limited Adaptive Histogram
Equalization as the classification model (CLACHE). In low light conditions, the experiment produced
99.56 % accuracy and 97.26 % accuracy [6]. In 2019, Ankita Wadhawan and Parteek Kumar suggested
a "Deep learning-based static sign language identification system." They employed single and double
hands to gather 100 static continuous signs, and the model they used was a CNN with max-pooling.
On coloured and grayscale images, the experiment yielded the highest training accuracy of 99.72 %
and 99.90 %, respectively [9]. Neel Kamal Bhagat, Vishnusai Y, Rathna G N proposed “Indian Sign
Language Gesture Recognition using Image Processing and Deep Learning” in 2019. They collected
36 isolated static gestures, 10 dynamic signs using single and double-hands the classification model
used is CNN, Long Short-Term Memory (LSTM), Connectionist temporal classification, and
Handcrafted feature techniques like Scale Invariant Feature Transform (SIFT), s Gradient Hough
Transform (GHT). The experiment produced an accuracy of 97.71% [11]. In 2021, R. Dhiman, G. Joshi,
and C. Rama Krishna proposed "A deep learning strategy for classification of Indian sign language
gestures with varied backgrounds." They employed single and double hands to gather 15 isolated
indicators; the classification model was CNN models softmax layer with different optimization
techniques, and activation functions. The accuracy for NUS was 95.95 percent and 97.22 percent,
respectively [14 Shenoy et al. (2018) Both Both 33 hand poses and 12 gestures 24,624 images Grid-
based Feature Extraction, Feature Vector, Histogram of Oriented Gradients (HOG). KNN and HMM
99.7% (static hand postures) , and 97.23% (gestures) Mariappan and Gomathi (2019) Continuous
Both 80 Word signs and 50 Sentence signs 800 (words) , 500 (sentence) = 1300. Orientation
Histogram, eigenvector-based technique eigenvalue weighted Euclidean distance technique 86 %
Athira et al. (2019) both Both 46 ISL signs 900 static images and 700 videos HOG, HOEF and Zernike
moments Support Vector Machine (SVM) 92.88% Raghuveera et al. (2020) Both Both 140 unique
gestures 4600 images HOG and Local Binary Patterns. Support Vector Machine (SVM) accuracy up to
71.85% Badhe and Kulkarni (2020) Continuous Both 46 different signs 500 videos of 10 different sign
gestures235 images of 36 signs HOG, PCBR and WPD, PCA, SIFT Artificial Neural Network Training
accuracy 98%, Validation accuracy 63% TABLE 2. DEEPLEARNING-BASED TECHNIQUE Author Isolated/
Continuous Single/ Double Handed Signs Total Dataset Size Feature Extraction Technique
Classification Technique Accuracy Rao et al. (2018) Continuous Both 200 signs 120000 sign images
Softmax regression CNN 92.88% Sajanraj and Beena (2018) Isolated Both 36 signs 3000 images
CLACHE (Contrast Limited Adaptive Histogram Equalization) CNN, HSV color space 99.56% and
97.26% (low light condition) Wadhawan and Kumar (2019) Continuous Both 100 static signs total
35,000 sign images max-pooling CNN 99.90% Bhagat et al. (2019) Isolated Both 36 static gestures, 10
dynamic signs 45,000 images based out of the depth camera, RGB camera, images without
background segmentation Handcrafted feature techniques like SIFT, Gradient Hough Transform (GHT)
CNN, LSTM, Connectionist temporal classification 97.71% Dhiman et al. (2021) Isolated Both 15 signs
2240 images Convolution Neural Network (CNN) model softmax layer CNN 100% (NUS dataset-I),
95.95% (NUS dataset-II), and 97.22% (mixed dataset) Venugopalan and Reghunadhan (2021)
Continuous Double Handed 9 hand gesture classes 900 video samples of nine hand gesture classes
CNN, GoogleNet bidirectional LSTM (BiLSTM) 76.21% dataset-I, NUS dataset-II, and the mixed dataset
is 100%, Adithya Venugopalan, Rajesh Reghunadhan proposed “Applying deep neural networks for
the automatic recognition of sign language words: A communication aid to deaf agriculturists” in
2021. They have collected 9 continuous double-handed gestures; the classification model used is a
CNN, GoogleNet, and bidirectional LSTM (BiLSTM) sequence network. The assessment produced an
average classification accuracy of 76.21% [15]. VI. PROPOSED METHODOLOGY 1. DATA ACQUISITION
The initial stage in starting the research is data collection, which is the most significant aspect of the
experiment. We have acquired dynamic video data of 50 signs as shown in Figure 3. Figure 3: 50
Signs we collected The signs we recorded using mobile phones from 18 different people and each
sign performed 50 times i.e., 18x50x50 more than 25,000 signs were collected and sorted into its
respective folders as shown in Figure 4 Figure 4: People performing the signs 2. WORK FLOW AND
MODEL Initially as discussed earlier we collect videos from 18 people and then we sort them in the
respective folders. Then we preprocess and normalize the data. The system flow has been given in
Figure 5. Figure 5: System Flow We have prepared a CNN Model picking up the best features from
the above listed papers; the procedure we followed is generate the dataset and partition it into
training and testing sets, and create a model with relu activator and softmax pooling layers and
below are the further details of the model. Figure 6: Created model Figure 7: Layers and features of
the model VI. RESULT ANALYSIS Analyzing the data of ISL in the recent times till 2021as shown in
Figure 8 it is seen that in 2018 and 2019 the most papers were published and in 2021 the least
papers were published. Figure 8: Trendline of papers published over the years. From the data
summary, it is seen that both isolated and continuous signs are both used equally over the years but
both combined are not used much as shown in Figure 9. Figure 9: Research work on isolated and
continuous signs. The datasets used mostly by people have both single-handed and double-handed
signs in them having both static and dynamic gestures as shown in Figure 10. Figure 10: Research
Work for Single/Double Handed Signs A pie-chart representation of all the classification techniques
used by different peoples working on ISL recognition and it is seen that the CNN model is used the
most in deep learning techniques the classification accuracy it's getting is more than 96% in the all
above studies and ANN is used the most for machine learning techniques and the accuracy it is
generating is more than 90%. So, this clearly shows that the CNN model is better to be used as
shown in Figure 11. Figure 11: Research work on techniques used for recognition of signs. VII.
CONCLUSION AND FUTURESCOPE This paper analysed all 15 research articles on Indian sign language
published in recent times till 2021, which offer a wide range of applications that will benefit the deaf
community. This aims to summarise in Indian Sign Language based on isolated/continuous,
single/double-handed motions, signs, total dataset size, feature extraction approach, classification
technique, and accuracy. This article offers sufficient information to allow anyone to get a head start
on their job without having to rewrite it; the following are the results: • It has been observed that
the most papers were published in the years 2018 and 2019, and that the number of papers released
after that has declined, and that individuals have begun to research using deep-learning approaches
more frequently than machine-learning techniques in the past. • The majority of people have utilised
both single and double-handed motions for their dataset, which was filmed using a camera or a
mobile phone utilising both single and double hands isolated and continuous. • According to the
findings, deep-learning based approaches such as CNN outperformed other approaches in terms of
feature extraction and classification with an accuracy of over 95%, which is exceptional, whereas
other traditional machinelearning techniques produced less accuracy. Using the model we
constructed based on the features we found in articles, we were able to achieve a training accuracy
of 70%. It's less because the videos were filmed in various lighting and background situations with
smartphones of varying camera quality, and a couple of the clips were horizontal and vertical in
nature, making feature extraction much more difficult, but it's a genuine situation. With eliminate all
of the noise in the films, the model's accuracy and performance must be improved by the use of
GPUs and better feature extraction techniques, as well as data standardisation. This research also
help hearing impaired people to come forward get education and interact with other persons in the
society. This also leads to the reduction in educational gap that exists among hearing impaired
persons in the society. References [1] J. N. Cheltha C, M. Rakhra, R. Kumar, and H. Walia, “A Review
on Data hiding using Steganography and Cryptography,” 2021 9th Int. Conf. Reliab. Infocom Technol.
Optim. (Trends Futur. Dir. ICRITO 2021, pp. 5–8, 2021, doi: 10.1109/ICRITO51393.2021.9596531. [2]
K. F. Tasneem et al., “Affordable black box: A Smart Accident Detection System for cars,” 2021 9th Int.
Conf. Reliab. Infocom Technol. Optim. (Trends Futur. Dir. ICRITO 2021, pp. 3–7, 2021, doi:
10.1109/ICRITO51393.2021.9596106. [3] S. Takkar et al., “Advanced ATM Security System Using
Arduino Uno,” 2021 9th Int. Conf. Reliab. Infocom Technol. Optim. (Trends Futur. Dir. ICRITO 2021, pp.
1–5, 2021, doi: 10.1109/ICRITO51393.2021.9596341. [4] N. Kaur, N. Nazir, and Manik, “A Review of
Local Binary Pattern Based texture feature extraction,” 2021 9th Int. Conf. Reliab. Infocom Technol.
Optim. (Trends Futur. Dir. ICRITO 2021, pp. 9–12, 2021, doi: 10.1109/ICRITO51393.2021.9596485. [5]
A. Srivastava, “Analysis of Cluster-Based and Position-based Routing Protocol in VANET,” pp. 1–5,
2021. [6] S. Thorupunoori et al., “Camera Based Drunks Detection Mechanism Integrated with D-L
(Deep Learning),” 2021 9th Int. Conf. Reliab. Infocom Technol. Optim. (Trends Futur. Dir. ICRITO 2021,
pp. 21–24, 2021, doi: 10.1109/ICRITO51393.2021.9596283. [7] Isha, S. Dixit, M. K. Ahirwar, D.
Sakethnath, and M. Rakha, “Stock Prediction by Analyzing the Past Market Trend,” 2021 9th Int. Conf.
Reliab. Infocom Technol. Optim. (Trends Futur. Dir. ICRITO 2021, pp. 4–7, 2021, doi:
10.1109/ICRITO51393.2021.9596263. [8] M. Rakhra and R. Singh, “Economic and Social Survey on
Renting and Hiring of Agricultural Equipment of Farmers in Punjab,” 2021 9th Int. Conf. Reliab.
Infocom Technol. Optim. (Trends Futur. Dir. ICRITO 2021, pp. 1–5, 2021, doi:
10.1109/ICRITO51393.2021.9596343. [9] C. C. Prajapati, H. Kaur, and M. Rakhra, “Role of IoT and Fog
Computing in Diagnosis of Coronavirus (COVID-19),” 2021 9th Int. Conf. Reliab. Infocom Technol.
Optim. (Trends Futur. Dir. ICRITO 2021, pp. 1–6, 2021, doi: 10.1109/ICRITO51393.2021.9596203. [10]
A. A. Ahdal, M. Rakhra, S. Badotra and T. Fadhaeel, "An integrated Machine Learning Techniques for
Accurate Heart Disease Prediction," 2022 International Mobile and Embedded Technology
Conference (MECON), 2022, pp. 594-598, doi: 10.1109/MECON53876.2022.9752342. [11] Sushovan
Chaudhury, Manik Rakhra, Naz Memon, Kartik Sau, Melkamu Teshome Ayana, "Breast Cancer
Calcifications: Identification Using a Novel Segmentation Approach", Computational and
Mathematical Methods in Medicine, vol. 2021, Article ID 9905808, 13 pages, 2021.
https://doi.org/10.1155/2021/9905808. [12] Murugesan G., Tousief Irshad Ahmed, Jyoti Bhola,
Mohammad Shabaz, Jimmy Singla, Manik Rakhra, Sujeet More, Issah Abubakari Samori, "Fuzzy Logic-
Based Systems for the Diagnosis of Chronic Kidney Disease", BioMed Research International, vol.
2022, Article ID 2653665, 15 pages, 2022. https://doi.org/10.1155/2022/2653665 [13] Manik Rakhra,
Amitabh Bhargava, Deepshikha Bhargava, Ramandeep Singh, Astha Bhanot, Abdul Wahab Rahmani,
"Implementing Machine Learning for SupplyDemand Shifts and Price Impacts in Farmer Market for
Tool and Equipment Sharing", Journal of Food Quality, vol. 2022, Article ID 4496449, 19 pages, 2022.
https://doi.org/10.1155/2022/4496449 [14] Manik Rakhra, Sumaya Sanober, Noorulhasan Naveed
Quadri, Neha Verma, Samrat Ray, Evans Asenso, "Implementing Machine Learning for Smart Farming
to Forecast Farmers’ Interest in Hiring Equipment", Journal of Food Quality, vol. 2022, Article ID
4721547, 17 pages, 2022. https://doi.org/10.1155/2022/4721547. [15] Manik Rakhra, Ramandeep
Singh, Tarun Kumar Lohani, Mohammad Shabaz, "Metaheuristic and Machine Learning-Based Smart
Engine for Renting and Sharing of Agriculture Equipment", Mathematical Problems in Engineering,
vol. 2021, Article ID 5561065, 13 pages, 2021. https://doi.org/10.1155/2021/5561065 [16] M.
Rakhra, N. Verma, and S. Bhatia, “Structural, Morphological, Optical, Electrical and Agricultural
Properties of Solvent/ZnO Nanoparticles in the Photodegradation of DR-23 Dye,” J. Electron. Mater.,
vol. 49, no. 1, pp. 643–649, 2020, doi: 10.1007/s11664-019- 07760-z. [17] Rakhra, M., Verma, N. Tri-
doped co-annealed zinc oxide semi-conductor synthesis and characterization: photodegradation of
dyes and gas sensing applications. J Mater Sci: Mater Electron 32, 17588–17601 (2021).
https://doi.org/10.1007/s10854-021-06292-9 [18] P. Thapar, M. Rakhra and A. Singh, "Comparing
Image Feature Extraction Methods Using Dermoscopy Noisy Images," 2022 International Mobile and
Embedded Technology Conference (MECON), 2022, pp. 559-562, doi:
10.1109/MECON53876.2022.9751935. [19] Shruti, R. Soumya, M. Rakhra, N. S. Chintagunti, B. Singh
and D. Singh, "Modern Data Mining Approach to Handle Multivariate Data and to Implement Best
Saving Services for Potential Investor," 2022 International Mobile and Embedded Technology
Conference (MECON), 2022, pp. 611-616, doi: 10.1109/MECON53876.2022.9752101. [20] M. Rakhra
et al., "An Analytical Study of the Types of Implements used by Farmers in Mechanized Agriculture,"
2022 International Mobile and Embedded Technology Conference (MECON), 2022, pp. 683-687, doi:
10.1109/MECON53876.2022.9751983 [21] A. A. Ahdal, M. Rakhra, S. Badotra and T. Fadhaeel, "An
integrated Machine Learning Techniques for Accurate Heart Disease Prediction," 2022 International
Mobile and Embedded Technology Conference (MECON), 2022, pp. 594-598, doi:
10.1109/MECON53876.2022.9752342. [22] Arun Malik, Gayatri Vaidya, Vishal Jagota, Sathyapriya
Eswaran, Akash Sirohi, Isha Batra, Manik Rakhra, Evans Asenso, "Design and Evaluation of a Hybrid
Technique for Detecting Sunflower Leaf Disease Using Deep Learning Approach", Journal of Food
Quality, vol. 2022, Article ID 9211700, 12 pages, 2022. https://doi.org/10.1155/2022/9211700 [23] L.
Gundaboina, S. Badotra, S. Tanwar and Manik, "Reducing Resource and Energy Consumption in
Cryptocurrency Mining by using both Proof-of-Stake Algorithm and Renewable Energy," 2022
International Mobile and Embedded Technology Conference (MECON), 2022, pp. 605-610, doi:
10.1109/MECON53876.2022.9752365. [24] Monika Hooda, Chhavi Rana, Omdev Dahiya, Jayashree
Premkumar Shet, Bhupesh Kumar Singh, "Integrating LA and EDM for Improving Students Success in
Higher Education Using FCN Algorithm", Mathematical Problems in Engineering, vol. 2022, Article ID
7690103, 12 pages, 2022. https://doi.org/10.1155/2022/7690103 [25] W. Yan, M. Shabaz, and M.
Rakhra, “Research on Nonlinear Distorted Image Recognition Based on Artificial Neural Network
Algorithm,” J. Interconnect. Networks, pp. 1–12, 2022, doi: 10.1142/s0219265921480029. [26] Neha
Verma, Vishal Jagota, Arnold C. Alguno, Alimuddin, Manik Rakhra, Pawan Kumar, Betty Nokobi
Dugbakie, "Characterization of Fabricated Gold-Doped ZnO Nanospheres and Their Use as a
Photocatalyst in the Degradation of DR-31 Dye", Journal of Nanomaterials, vol. 2022, Article ID
7532332, 8 pages, 2022. https://doi.org/10.1155/2022/7532332 [27] M. Rakhra and R. Singh,
“Materials Today : Proceedings A study of machinery and equipment used by farmers to develop an
uberized model for renting and sharing,” Mater. Today Proc., no. xxxx, 2020, doi:
10.1016/j.matpr.2020.11.784. [28] D. Venkatesh and M. Rakhra, “Agile adoption issues in large scale
organizations: A review,” Mater. Today Proc., no. xxxx, 2020, doi: 10.1016/j.matpr.2020.11.308. [29]
G. S. Panesar, D. Venkatesh, M. Rakhra, K. Jairath, and M. Shabaz, “Agile software and business
development using artificial intelligence,” Ann. Rom. Soc. Cell Biol., vol. 25, no. 2, pp. 1851–1857,
2021. [30] M. Arora et al., “Agile Umbrella Methodologies and its Global Impact Introduction to
Scrum,” vol. 25, no. 4, pp. 2990–3003, 2021. [31] N. Verma, M. Rakhra, M. W. Bhatt, and U. Garg,
“Engineering technology characterization of source solution for ZnO and their data analytics effect
with aloe vera extract,” Neurosci. Informatics, vol. 2, no. 3, p. 100015, 2022, doi:
10.1016/j.neuri.2021.100015. [32] M. Rakhra et al., “Materials Today : Proceedings Crop Price
Prediction Using Random Forest and Decision Tree Regression : -A Review,” Mater. Today Proc., no.
xxxx, 2021, doi: 10.1016/j.matpr.2021.03.261. [33] P. Rattan, M. Arora, M. Rakhra, V. Goel, A. Mishra,
and M. Shabaz, “A Neoteric Approach of Prioritizing Regression Test Suites Using Hybrid ESDG
Models,” vol. 25, no. 4, pp. 2965–2973, 2021. [34] W. Yan, M. Shabaz, and M. Rakhra, “Research on
Nonlinear Distorted Image Recognition Based on Artificial Neural Network Algorithm,” J.
Interconnect. Networks, pp. 1–12, 2022, doi: 10.1142/s0219265921480029. [35] M. Rakhra and R.
Singh, “Materials Today : Proceedings Smart data in innovative farming,” Mater. Today Proc., no. xxxx,
2021, doi: 10.1016/j.matpr.2021.01.237. [36] Neha Verma, Vishal Jagota, Alimuddin, Arnold C.
Alguno, Manik Rakhra, Chanthirasekaran K., Betty Nokobi Dugbakie, "Morphological, Structural, and
Optical Properties of Doped/Codoped ZnO Nanocrystals Film Prepared by Spin Coating Technique
and Their Gas Sensing Application", Journal of Nanomaterials, vol. 2022, Article ID 6250706, 10
pages, 2022. https://doi.org/10.1155/2022/6250706 [37] Neha Verma, Vishal Jagota, Arnold C.
Alguno, Alimuddin, Manik Rakhra, Pawan Kumar, Betty Nokobi Dugbakie, "Characterization of
Fabricated Gold-Doped ZnO Nanospheres and Their Use as a Photocatalyst in the Degradation of DR-
31 Dye", Journal of Nanomaterials, vol. 2022, Article ID 7532332, 8 pages, 2022.
https://doi.org/10.1155/2022/7532332 [38] A. A. Ahdal, D. Prashar, M. Rakhra and A. Wadhawan,
"Machine Learning-Based Heart Patient Scanning, Visualization, and Monitoring," 2021 International
Conference on Computing Sciences (ICCS), 2021, pp. 212- 215, doi: 10.1109/ICCS54944.2021.00049
View publication stats
Real-time Indian Sign Language (ISL) Recognition Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra
Vyavaharkar Department of Computer Engineering, K. J. Somaiya College of Engineering, University
of Mumbai Mumbai, India [email protected], [email protected],
[email protected], [email protected] Abstract—This paper presents a system which
can recognise hand poses & gestures from the Indian Sign Language (ISL) in real-time using grid-
based features. This system attempts to bridge the communication gap between the hearing and
speech impaired and the rest of the society. The existing solutions either provide relatively low
accuracy or do not work in real-time. This system provides good results on both the parameters. It
can identify 33 hand poses and some gestures from the ISL. Sign Language is captured from a
smartphone camera and its frames are transmitted to a remote server for processing. The use of any
external hardware (such as gloves or the Microsoft Kinect sensor) is avoided, making it user-friendly.
Techniques such as Face detection, Object stabilisation and Skin Colour Segmentation are used for
hand detection and tracking. The image is further subjected to a Grid-based Feature Extraction
technique which represents the hand's pose in the form of a Feature Vector. Hand poses are then
classified using the kNearest Neighbours algorithm. On the other hand, for gesture classification, the
motion and intermediate hand poses observation sequences are fed to Hidden Markov Model chains
corresponding to the 12 pre-selected gestures defined in ISL. Using this methodology, the system is
able to achieve an accuracy of 99.7% for static hand poses, and an accuracy of 97.23% for gesture
recognition. Keywords—Indian Sign Language Recognition; Gesture Recognition; Sign Language
Recognition; Grid-based feature extraction; k-Nearest Neighbours (k-NN); Hidden Markov Model
(HMM); Kernelized Correlation Filter (KCF) Tracker; Histogram of Oriented Gradients (HOG). I.
INTRODUCTION Indian Sign Language (ISL) is a sign language used by hearing and speech impaired
people to communicate with other people. The research presented in this paper pertains to ISL as
defined in the Talking Hands website [1]. ISL uses gestures for representing complex words and
sentences. It contains 33 hand poses including 10 digits, and 23 letters. Amongst the letters in ISL,
the letters ‘h’, ‘j’ are represented by gestures and the letter ‘v’ is similar to digit 2. The system is
trained with the hand poses in ISL as shown in Fig. 1. Most people find it difficult to comprehend ISL
gestures. This has created a communication gap between people who understand ISL and those who
do not. One cannot always find an interpreter to translate these gestures when needed. To facilitate
this communication, a potential solution was implemented which would translate hand poses and
gestures from ISL in real-time. It comprises of an Android smartphone camera to capture hand poses
and gestures, and a server to process the frames received from the smartphone camera. The
purpose of the system is to implement a fast and accurate recognition technique. Fig. 1. Hand poses
in ISL The system described in this paper successfully classifies all the 33 hand poses in ISL. For the
initial research, gestures containing only one hand was considered. The solution described can be
easily extended to two-handed gestures. In the next section of this paper, the related work
pertaining to sign language translation is discussed. Section III explains the techniques used to
process each frame and translate a hand pose/gesture. Section IV discusses the experimental results
after implementing the techniques discussed in section III. Section V describes the Android
application developed for the system that enables real-time Sign Language Translation. Section VI
discusses the future work that can be carried out in ISL translation. II. RELATED WORK There has
been considerable work in the field of Sign Language recognition with novel approaches towards
gesture recognition. Different methods such as use of gloves or Microsoft Kinect sensor for tracking
hand, etc. have been employed earlier. A study of many different existing systems has been done to
design a system that is efficient and robust than the rest. A Microsoft Kinect sensor is used in [2] for
recognising sign languages. The sensor creates depth frames; a gesture is viewed as a sequence of
these depth frames. T. Pryor et al [3] designed a pair of gloves, called SignAloud which uses
embedded sensors in gloves to track the position and movement of hands, thus converting gestures
to speech. R. Hait-Campbell et al [4] developed MotionSavvy, a technology that uses Windows tablet
and Leap Motion accelerator AXLR8R to recognise the hand, arm skeleton. Sceptre [5] uses Myo
gesture-control armbands that provide accelerometer, gyroscope and electromyography (EMG) data
for signs & gestures classification. These hardware solutions IEEE - 43488 9th ICCCNT 2018 July 10-
12, 2018, IISC, Bengaluru Bengaluru, India provide good accuracy but are usually expensive and are
not portable. Our system eliminates the need of external sensors by relying on an Android phone
camera. Now for software-based solutions, there are coloured glove based [6, 7] and skin colour-
based solutions. R. Y. Wang et al [6] have used multi-coloured glove for accurate hand pose
reconstruction but the sign demonstrator, while demonstrating the sign language, has to wear this
each time. Skin colour-based solutions may use RGB colour space with some motion cues [8] or HSV
[9, 10, 11], YCrCb [12] colour space for luminosity invariance. G. Awad et al [13] have used the initial
frames of the video sequence to train the SVM for skin colour variations for the further frames. But
to speed up the skin segmentation, they have used Kalman filter for further prediction of position of
skin coloured objects thus reducing the search space. Z. H. Al-Tairi et al [14] have used YUV and RGB
colour space for skin segmentation and the colour ranges that they have used handles good variation
of people’s races. After obtaining segmented hand image, A. B. Jmaa et al [12] have used the rules
defined in the hand anthropometry study of comparative measurements of human body for
localizing and eliminating the palm. They have then used the rest of the segmented image containing
only fingers to create skin-pixel histogram with respect to palm centroid. This histogram is fed to
decision tree classifier. In [15], from the segmented hand image, hand contour was obtained, which
was then used for fitting a convex hull and convexity defects were found out. Using this, the fingers
were identified and the angles between the adjacent ones were determined. This feature set of
angles was fed to SVM for classification. [10] have used distance transform to identify hand centroid
followed by elimination of palm and using angles between fingers for classification. Fourier
Descriptors have been used to describe hand contours by [8, 16]. [16] has used RBF on these Fourier
Descriptors for hand pose classification. S. C. Agarwal et al [17] have used a combination of
geometric features (eccentricity, aspect ratio, orientation, solidity), Histogram of Oriented Gradients
(HOG) and Scale Invariant Fourier Transform (SIFT) key points as feature vectors. The accuracy
obtained using geometric features goes really low when number of hand poses increases. [18] has
used Local Binary Patterns (LBP) as features. Our paper is mainly inspired from [9]. They have trained
the k-NN model using the binary segmented hand images directly. This technique provides great
speed when combined with fast indexing methods, thus making it suitable for real-time applications.
But to handle the variations in hand poses, more data needs to be captured. With the use of grid-
based features in our system, the model will become more user-invariant. For gesture recognition,
hand centroid tracking is done which provides motion information [16]. Gesture recognition can be
done using the Finite State Machine [19] which has to be defined for each gesture. C. Y. Kao et al [20]
have used 2 hand gestures for training HMM that will be used for gesture recognition. They defined
directive gestures such as up, left, right, down for their 2 hands and a time series of these pairs was
input to the HMM for gesture recognition. C. W. Ng et al [16] used a combination of HMM and RNN
classifiers. The HMM Gesture recognition that we have used in our system is mainly inspired from
[16]. They were using 5 hand poses and the same 4 directive gestures. This 9-element vector was
used as input to the HMM classifier. Training of HMM was done using Baum-Welch re-estimation
formulas. III. IMPLEMENTATION Using an Android smartphone, gestures and signs performed by the
person using ISL are captured and their frames are transmitted to the server for processing. To make
the frames ready for recognition of gestures and hand poses, they need to be pre-processed. The
pre-processing first involves face removal, stabilisation and skin colour segmentation to remove
background details and later morphology operations to reduce noise. The hand of the person is
extracted and tracked in each frame. For recognition of hand poses, features are extracted from the
hand and fed into a classifier. The recognised hand pose class is sent back to the Android device. For
classification of hand gestures, the intermediate hand poses are recognised and using these
recognised poses and their intermediate motion, a pattern is defined which is represented in tuples.
This is encoded for HMM and fed to it. The gesture whose HMM chain gives the highest score with
forward-backward algorithm is determined to be the recognized gesture for this pattern. An
overview of this process is described in Fig. 2. Fig. 2. Flow diagram for Gesture Recognition. A.
Dataset used For the digits 0 to 9 in ISL, an average of 1450 images per digit were captured. For
letters in ISL excluding ‘h’, ‘j’ and ‘v’, about 300 images per letter were captured. For the 9 gesture-
related intermediate hand poses such as Thumbs_Up, Sun_Up, about 500 images per pose were
captured. The dataset contains a total of 24,624 images. All the images consist of the sign
demonstrator wearing a full sleeve shirt. IEEE - 43488 9th ICCCNT 2018 July 10-12, 2018, IISC,
Bengaluru Bengaluru, India Most of these images were captured from an ordinary webcam and a few
of them were captured from a smartphone camera. The images are of varying resolutions. For
training HMMs, 15 gesture videos were captured for each of the 12 one-handed pre-selected
gestures defined in [1] (After, All The Best, Apple, Good Afternoon, Good Morning, Good Night, I Am
Sorry, Leader, Please Give Me Your Pen, Strike, That is Good, Towards). These videos have slight
variations in sequences of hand poses and hand motion so as to make the HMMs robust. These
videos were captured from a smartphone camera and also involve the sign demonstrator wearing a
full sleeve shirt. B. Pre-processing 1) Face detection and elimination The hand poses and gestures in
ISL can be represented by particular movement of hands, and facial features are not necessary. Also,
face of the person creates an issue during hand extraction process. To resolve this issue, face
detection was carried out using Histogram of Oriented Gradients (HOG) descriptors followed by a
linear SVM classifier. It uses an image pyramid and sliding window to detect faces in an image, as
described in [21]. HOG feature extraction combined with a linear classifier reduces false positive
rates by more than an order of magnitude than the best Haar wavelet-based detector [21]. After
detection of face, the face contour region is identified, and the entire face-neck region is blackened
out, as demonstrated in Fig. 3. Fig. 3. Face detection and elimination operation 2) Skin colour
segmentation To identify skin-like regions in the image, the YUV and RGB based skin colour
segmentation is used, which provides great results. This model has been chosen since it gives the
best results among the options considered: HSV, YCbCr, RGB, YIQ, YUV and few pairs of these colour
spaces [14]. The frame is converted from RGB to YUV colour space using the equation mentioned in
[22]. This is specified in equation (1). ൬ ௒ ௒ ௒ ൰ = ൬ା ௒.ଶଽଽ ା ௒.ହ଼௒ ା ௒.ଵଵସ ା௒.ଵସ௒ ା௒.ଶ଼ଽ
ା ௒.ସଷ௒ ା ௒.௒ଵହ ା௒.ହଵହ ା௒.ଵ௒௒ ൰.൬ோା ା ௒ ൰+൬ ௒ ଵଶ଼ ଵଶ଼ ൰ (1) [14] mentions the criteria
for determining if pixel is to be classified as skin using the following conditions: ⎩ ⎪ ⎨ ⎪ ⎧ 80 < ܷ <
130 136 < ܷ < 200 ܷ >ܷ 15 < (2) |−‫ܩ‬ ܷ | > 15 ܷ > 80 & ‫ > ܩ‬30 & ‫ ܤ‬The segmentation mask thus obtained
contains less noise and less false positive results. An illustration of the segmentation mask is shown
in Fig. 4. Fig. 4. Skin segmentation mask and effect of morphology operations. (Left) Segmentation
mask; (Right) Mask after application of Morphology operations. 3) Morphology operations
Morphology operations were performed to remove any noise generated after skin colour
segmentation. There are 2 types of errors in skin colour segmentation: 1. Non-skin pixels classified as
skin 2. Skin pixels classified as non-skin Morphology involves 2 basic sub-operations: 1. Erosion: Here,
the active areas in the mask (which are white) are reduced in size 2. Dilation: Here, the active areas
in the mask (which are white) are increased in size Morphology Open operation handles the 1st
error. It involves erosion followed by dilation. The 2nd error is handled by Morphology Close
operation. This involves dilation followed by erosion. The result of applying morphology operations
can be seen in Fig. 4. 4) Object stabilisation using Facial reference. For tracking hand motion
accurately, a steady camera position is desired. Movement of the camera caused by shaky hands is
common. If the sign demonstrator, that is, the person using ISL does not move his hand but the
person capturing the video shakes his hand, false movements will get detected. This problem is
tackled using object stabilisation. Under the assumption that a person’s face is always included in the
gesture video, the face of the sign demonstrator is tracked to stabilise hand movements. The tracker
is initialised with the co-ordinates which were extracted from Face Detection before removing the
face. The tracker detects the location of the facial blob and shifts the entire frame in the opposite
direction of the detected motion of facial coordinates. The system uses the Kernelized Correlation
Filter (KCF) tracker implemented in the OpenCV library to track the face in each frame. The operation
of tracking is performed on the image before the face is blackened. Fig. 5 demonstrates object
stabilisation. IEEE - 43488 9th ICCCNT 2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India (a) (b)
Fig. 5. (a) Actual motion of camera. The subject’s position with respect to the frame changes. (b)
Effect after stabilisation – The position of subject with respect to the frame remains constant. C.
Hand extraction and tracking As all ISL hand poses and gestures can be represented using hand
movements, hand extraction and tracking is important part of the system. After pre-processing each
frame, a black and white image is obtained, where white areas represent skin. These skin areas do
not contain facial regions. They contain parts of the hand and other skin-like parts in the original
image. Since each frame contains only 1 hand (the other hand is not visible) or both hands are
touching each other, the only prominent contour present in the frame will be the person’s hand.
Thus, areas of all contours in the frame are calculated and the contour with the largest area is
extracted. Since the only prominent contour left is hand contour, the extracted contour is the hand
region. Fig. 6 illustrates the importance of face elimination operation. If face was not eliminated, the
face region would have been the largest contour area in the image, thus, being classified as hand as
shown in Fig. 6. Fig. 6. Importance of eliminating face before hand extraction. For tracking hand
motion, centroid of hand is calculated in each frame. If there is movement of hand, the co-ordinates
of centroid of hand will change. Slope of the line formed by the centroid of hand in the current frame
and centroid of hand in previous frame is then determined. Depending upon the value of slope, the
motion of hand was determined as follows: x If -1 < slope < 1 and difference between x coordinates
of both centroids is positive, the hand moved leftwards. x If -1 < slope < 1 and difference between x
coordinates of both centroids is negative, the hand moved rightwards. x If |slope| > 1 and difference
between y co-ordinates of both centroids is positive, hand moved upwards. x If |slope| > 1 and
difference between y co-ordinates of both centroids is negative, hand moved downwards. The above
slope-based motion determination is illustrated in Fig. 7. It is to be noted that the hand motion
observed by the camera will be opposite to the actual motion performed by the sign demonstrator.
The system uses the OpenCV library to calculate area of contour and centroid of hand contour. To
reduce noise during hand tracking, an imaginary circle with a 20-pixel radius around the previous
hand centroid pixel is considered. If the new centroid lies within this radius, this shift is termed as
noise and motion is neglected. The previous co-ordinate considered during comparison is not
updated in this case. If the current centroid lies outside the 20-pixel threshold, this shift is considered
as movement of hand. In the subsequent frames, the radius is set to 7 pixels instead of 20 pixels until
there is no movement after which the radius is restored to 20 pixels. This use of an imaginary circle
reduces noise to a greater extent and gives highly accurate tracking of hand movements. Fig. 7.
Determining hand motion using slope. Fig. 8. The hand pose ‘a’ in ISL fragmented by a 3x3 grid. IEEE -
43488 9th ICCCNT 2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India Fig. 9. ISL hand poses' data
visualised using PCA and TSNE dimensionality reduction. D. Feature extraction using Grid-based
fragmentation technique Using an M x N grid, extracted hand sample is fragmented into M*N blocks.
Using this grid, a feature vector is obtained containing M*N feature values where each block
provides a feature value. In each block, the feature value is calculated as the percentage of hand
contour present. This is specified in equation (3). ௒௨௒௧ ௒௒௒ା = ௒௒௒௒ ௒௒ ௒௒௒ܷ ‫ݎݑݐ ܨ‬ ܷ ‫ݑ ܷݒ‬
௒௒௒௒ ௒௒ ௒௒௒௒௒௒௒௧ (3) If no contour is found, the feature value is 0. In Fig. 8, a 3 x 3 grid is
constructed on a sample for the purpose of illustration. The advantage of using this approach is the
features generated vary with the orientation of each hand pose. Different hand poses occupy
different number of grids and different fragment areas. The feature vector thus accurately represents
the shape and position of the hand. Using these M*N features, each hand pose is represented by
different clusters. Fig. 9 supports these arguments. The data visualised in Fig. 9 was subjected to a 10
x 10 grid and 100 features were extracted per sample. Using Principle Component Analysis (PCA) and
t-Distributed Stochastic Neighbour Embedding (tSNE) [23], the dimensionality of this data was
reduced from 100 to 2. This was done by first applying PCA to reduce the dimensionality to 40
features and then t-SNE to reduce it further to 2. As seen in Fig. 9, separate clusters representing
different orientations of each hand pose can be seen after visualising the extracted features. E.
Classification 1) Recognition of ISL Hand poses using k-NN After observing the graph plotted in Fig. 9,
it can be observed that data is organised into clusters. There is more than one cluster for the same
hand pose. For classification, an algorithm was needed which can distinguish clustered data
efficiently. K-Nearest Neighbours (k-NN) was found suitable for such distribution of data. Hand
extracted from each frame of the live feed is subject to feature extraction using the previously
discussed Grid-based fragmentation. This sample is then represented in an M*N dimensional feature
map. Using Euclidean distance as a distance metric, k nearest samples, which are fitted previously in
the classifier, are computed. The distance computation can be performed using a brute force
approach, wherein Euclidean distance between the sample and each fitted sample in the classifier is
calculated and the k lowest distances are selected. Other optimal approaches include the KD-tree
and Ball Tree. The most suitable approach for distance computation is dependent on size of the data.
Brute force works well on small data size [24]. For low dimensional data, KD Tree works well whereas
for high dimensional data, Ball Tree works best [24]. From these k nearest neighbours, the classifier
selects the class occurring most frequently. 2) Gesture Classification using HMM There are always
some variations present while demonstrating gestures even if performed by the same sign
demonstrator. For handling such variations, some kind of statistical model is needed. Hidden Markov
Models (HMMs) IEEE - 43488 9th ICCCNT 2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India are
one type of statistical model that can handle such variations well [25]. There are two types of HMMs:
continuous and discrete HMM. In continuous HMM, the number of observation symbols possible in
each element of the observation sequences is infinite, but in discrete HMM, it is finite. The HMM can
either be ergodic or left-to-right. In left-to-right HMM, the transition can occur only in 1 direction i.e.
if the HMM moves to the next state it cannot go back to the previous state as shown in Fig. 10. But in
ergodic HMM, transition is possible from any one state to any other state. The initial state
probabilities (π) and the transition probabilities for left-to-right HMM have been shown in Fig. 10.
Fig. 10. HMM chain for gesture with 3 hidden states (E.g. Good Afternoon) A human brain perceives
gestures as a combination of a few intermediate hand poses and hand movements executed in a
particular order. Using this idea, ISL Gestures are comprised of intermediate stationery hand poses
and the hand movements between these. Thus, in this system, discrete leftto-right HMM has been
used. This uses the segmented hand centroid motion and pose classification results for gesture
classification of provided observation sequence as belonging to one of the 12 predefined gestures.
The input to this HMM is an observation sequence extracted from the video feed. The number of
observation symbols possible is the sum of the number of directions tracked and the number of
intermediate stationery hand poses with which the system is trained. In this system, there are 4
directions being tracked (as described in section III C) and the system is trained with 9 intermediate
stationery hand poses such as ‘Thumbs_Up’, ‘Sun_Up’, ‘Fist’ as shown in Fig. 11. Fig. 11. Thumbs_Up
and Sun_Up stationery hand poses The intermediate hand pose recognition also uses Gridbased
feature extraction. The intermediate hand pose recognition is similar to recognition of hand poses for
recognizing letters and digits but is carried out only when there is no movement. Thus, the total
number of observation symbols are 13 i.e. the observation sequence can have elements with values
0-12. At each frame, a tuple is generated of the form: where M represents the Motion of the hand
relative to the previous frame and S represents the pose classified if there was no motion detected. If
motion is detected, the motion is mapped to the corresponding observation symbol ('Upwards':0,
'Rightwards':1, 'Leftwards':2, 'Downwards':3). If no motion is detected, the hand pose is accordingly
mapped. This mapping is predecided. Therefore, for each frame there will be an observation symbol
and the video sequence will generate an observation sequence which encodes motion and hand
pose information. For example, the time series – [ , , , , , , , , , ] represents the gesture “Good
Afternoon”. After encoding this into an observation sequence, we get [ 4, 4, 4, 0, 0, 0, 5, 5, 5 ]. A few
frames of this gesture are illustrated in Fig. 12. Fig. 12. 4 frames of Gesture “Good Afternoon” The
gesture recognition used in this system involves 12 HMM chains (one chain per gesture). The number
of hidden states in each of these chains is determined by breaking the gesture into a combination of
hand poses and the motion between them. For example, Good Afternoon gesture as shown in Fig.
12, can be said to have 3 states: ‘Thumbs_Up’ hand pose, ‘Upwards’ motion, ‘Sun_Up’ hand pose. All
these HMM chains, being left-to-right HMMs, the initial state probabilities and the initial state
transition probabilities are similar to the one shown in Fig. 10. For n hidden states in HMM, the state
transition probability will be of order n x n and the emission probability matrix will be of order n x 13.
The emission probability matrices were initialized with probabilities determined empirically by
subjectively looking at the similarity between hand poses and the closest motions possible. This is to
increase the chances of the HMM model converging into the global maxima after training. After all
the parameters as described here [26] are initialized as explained above, the estimation probabilities
and state transition probabilities of HMM chains are trained using Baum-Welch algorithm [26, 27].
The HMM chains are trained by using a gesture database that was created, the details of which have
been specified in section III A. After training the HMM chains, the new observation sequence is fed to
these HMM chains, and the HMM chain giving the highest score with forward-backward algorithm
[26] is determined to be the recognized gesture. 3) Temporal Segmentation The gesture recognition
module needs to be given video segments that correspond to the gesture only. Without temporal
segmentation, continuous gesture recognition is not possible. We have used a rule where in if the
hand goes out of frame, it would mark the end of the current gesture and gesture recognition will be
performed over the currently obtained frame sequence from gesture. When the hand again IEEE -
43488 9th ICCCNT 2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India comes within frame, it
would mark the beginning of the new gesture. Temporal segmentation is achieved using this rule. IV.
EXPERIMENTAL RESULTS The results discussed in this section were obtained from a personal
computer with 8 GB of RAM, an Intel i7 processor with 4 virtual CPUs and a 2 GB NVIDIA GPU. The
operating system used was Ubuntu 16.04. For better classification of hand poses, an appropriate grid
size had to be determined for fragmentation of hand and extraction of features. We applied 6
different grid sizes – 5x5, 10x10, 10x15, 15x15, 15x20 and 20x20 to extract features from the same
training data. These features were then fitted into a k-NN classifier. Features were then extracted
from the testing data using the grid sizes in consideration. The features extracted from the testing
data were then classified using the k-NN classifier trained previously. An ideal grid size would form
distinct clusters of hand poses so that the k-NN classifier can recognise them with high precision. To
determine this ideal grid size, the accuracies of the trained k-NN classifiers were calculated using the
ܿܿ ܿܿ
testing data and were compared. The accuracy was calculated as specified in equation (4): ‫ݎݑ ܷݕ‬ ܷ ‫ܣ‬
= ோା௨௒௒௒௒ ௒௒ ௦௒௒௒௒௒௦ ௒௒௒௦௦௒௒௒௒ା ௒௒௒௒௒௒௧௒௬ ା௒௧௒௒ ௒௨௒௒௒௒ ௒௒ ௦௒௒௒௒௒௦ (4) The
comparison was plotted on a graph as shown in Fig. 13. From the graph, it can be inferred that the
grid size of 10x10 gives the highest accuracy of 99.714%. In our implementation, the 10x10 grid was
thus selected for extracting features from hand image. The average time required to extract features
from an image of size 300x300 using a 10x10 grid was found to be about 1 milli-second. The testing
data was 30% of the original dataset of hand poses. Fig. 15 shows the confusion matrix of the k-NN
classifier when tested with hand poses’ data. It can be inferred from the confusion matrix that the
model is able to distinguish between each hand pose accurately. The time taken for each frame has
been tabulated in TABLE I. Thus, our application achieves a frame rate of 5.3 fps. After the frames
have been processed and the time series has been extracted from the gesture frame sequence, the
average time taken by HMM model (consisting of 12 HMM chains) for gesture classification is 3.7 ms.
Fig. 13. Comparison of accuracy of k-NN classification on features extracted using various grid sizes
on hand poses’ data. TABLE I. COMPUTATIONAL TIME FOR EACH PHASE Sr. No. Phase Time taken (in
ms) 1 Data Transfer over WLAN 46.2 2 Skin Colour Segmentation and Morphological Operations 12.3
3 Face Detection and Elimination 99.9 4 Object Stabilization 14.8 5 Feature Extraction 11.2 6 Hand
Pose Classification 1.7 Average Time per Frame 186.1 Fig. 14. Heatmap of Confusion matrix of ISL
gestures. 12 gestures and their abbreviations that were considered for testing are as follows: x After
(AFT) x All the Best (ATB) x Apple (APL) x Good Afternoon (GA) x Good Morning (GM) x Good Night
(GN) x I Am Sorry (IAS) x Leader (LDR) x Please Give Me Your Pen (PGP) x Strike (STR) x That is good
(TIG) x Towards (TWD) Also, testing was done for Wrong Gestures (WR), that is, gestures that are not
amongst the above listed. A 10x10 grid and a k-NN classifier was used to recognise the intermediate
hand poses, similar to hand pose recognition. 50 real-time trials were performed for each gesture
class in good lighting conditions with the sign demonstrator wearing a full sleeve shirt. Fig. 14 depicts
a confusion matrix generated after obtaining the results of these trials. Each cell in the confusion
matrix contains the percentage of trials that gave the corresponding output. It clearly shows that
correct classification was obtained more than 94% of the time, and 97.2% on an average. It can thus
be inferred from the results that the hand motion tracking and hand pose classification was accurate
enough to generate appropriate time series for the Hidden Markov Models. IEEE - 43488 9th ICCCNT
2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India Fig. 15. Heatmap of Confusion matrix for k-
NN classifier tested on ISL hand poses. V. THE APPLICATION The system is implemented as an Android
application. The application uses the smartphone’s camera to capture the sign language used by the
person. The frames were captured at a rate of 5 frames per second. Each frame is continuously sent
to a remote server. The processing is performed at the server-side. After each pose or gesture is
classified, the result is sent back to the application which is displayed in the topportion. Currently,
sockets are used to simulate a client-server connection. Fig. 16 shows screenshots of the application.
Fig. 16. (Left) Hand Pose ‘a’ recognized by the application; (Right) Gesture ‘Good Afternoon’
recognised by the application. VI. FUTURE WORK Currently, only single-handed gestures in ISL were
considered for research. With the use of advanced hand extraction algorithms, this approach can be
extended to twohanded gestures as well. Also, with the use of Natural Language Processing
algorithms, this system can be extended to recognise sentences in ISL, by recognition of multiple
gestures in the same video capture. Hand extraction is currently dependent on skin colour
segmentation. This means that the hand extraction requires that the subject must wear full sleeve
shirt for accurate recognition. Although the system could help the hearing and speech impaired
community where full-sleeve shirts are frequently used, the system may not work in general
conditions. This approach could be further extended using Object Detection techniques to extract
hand region from the image. The only limitation in implementation of Object Detection techniques is
requirement of a very wide variety of annotated hand samples so that it could detect hands in
almost any position, orientation and background. The current approach also requires that the
lighting conditions should be optimal – neither too dark nor too bright. The use of even better skin
colour segmentation techniques which can perform well under a wider variety of lighting conditions
can lead to better segmentation results and in turn aid in feature extraction. CONCLUSION From the
results, it can be inferred that the system presented in this paper is accurately able to track hand
movements of the sign demonstrator using techniques such as Object Stabilisation, Face elimination,
Skin colour Extraction and then Hand extraction. It can classify all 33 hand poses in ISL with an
accuracy of 99.7%. The system was also able to classify 12 gestures with an average accuracy of
97.23%. The approach uses an HMM chain for each gesture and a k-NN model to classify each hand
pose. The time required for recognition of hand pose is about 0.2s and that for gesture is 0.0037s.
From the results, it can be concluded that the system can recognise hand poses and gestures in ISL
with precision and in real-time. The system provides higher accuracy and faster recognition in sign
language recognition than other approaches discussed in the literature. The approach IEEE - 43488
9th ICCCNT 2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India discussed is inspired from various
systems described in the Related Works section and utilises the pros discussed in their system to
make the system more precise while classification. This approach is generic and can be extended to
other singlehanded and two-handed gestures. The system presented in this paper can also be
extended to other Sign Languages, if dataset satisfying the current requirements of the system is
available. REFERENCES [1] TalkingHands.co.in, “Talking Hands,” 2014. [Online]. Available:
http://www.talkinghands.co.in/. [Accessed: 21- Jul- 2017]. [2] A. Agarwal and M. K. Thakur, “Sign
Language Recognition using Microsoft Kinect,” Sixth International Conference on Contemporary
Computing (IC3), September 2013. [3] MailOnline, ‘'SignAloud gloves translate sign language
gestures into spoken English,” 2016. [Online]. Available:
http://www.dailymail.co.uk/sciencetech/article-3557362/SignAloudgloves-translate-sign-language-
movements-spoken-English.html. . [Accessed: 10- Feb- 2018]. [4] Alexia. Tsotsis, “MotionSavvy Is A
Tablet App That Understands Sign Language,” 2014. [Online]. Available:
https://techcrunch.com/2014/06/06/motionsavvy-is-a-tablet-app-thatunderstands-sign-language/.
[Accessed: 10 – Feb- 2018]. [5] P. Paudyal, A. Banerjee and S. K. S. Gupta, “SCEPTRE: a Pervasive,
Non-Invasive, and ProgrammableGesture Recognition Technology,” Proceedings of the 21st
International Conference on Intelligent User Interfaces, pp. 282-293, 2016. [6] R. Y. Wang and J.
Popovic, “Real-Time Hand-Tracking with a Color Glove,” ACM transactions on graphics (TOG), vol. 28,
no. 3, 2009. [7] R. Akmeliawati , M. P. L. Ooi and Y. C. Kuang, “Real-Time Malaysian Sign Language
Translation using Colour Segmentation and Neural Network,” Instrumentation and Measurement
Technology Conference Proceedings, 2007. [8] F. S. Chen, C. M. Fu and C. L. Huang, "Hand gesture
recognition using a real-time tracking method and hidden Markov models", Image and vision
computing, vol. 21, pp. 745-758, 2003. [9] M. A. Rahaman , M. Jasim, M. H. Ali and M.
Hasanuzzaman, “RealTime Computer Vision-Based Bengali Sign Language Recognition,” 17th
International Conference on Computer and Information Technology (ICCIT), 2014. [10] S. Padmavathi,
M. S. Saipreethy and V. Valliammai, “Indian Sign Language Character Recognition using Neural
Networks,” IJCA Special Issue on Recent Trends in Pattern Recognition and Image Analysis RTPRIA,
2013. [11] A. Chaudhary, J. L. Raheja and S. Raheja, “A Vision based Geometrical Method to find
Fingers Positions in Real Time Hand Gesture Recognition,” JSW, pp. 861-869, 2012. [12] A. B. Jmaa, W.
Mahdi, Y.B. Jemaa and A.B. Hamadou, “Hand Localization and Fingers Features Extraction:
Application to Digit Recognition in Sign Language,” International Conference on Intelligent Data
Engineering and Automated Learning, pp. 151-159, 2009. [13] G. Awad, J. Han and A. Sutherland, “A
Unified System for Segmentation and Tracking of Face and Hands in Sign Language Recognition,” 18th
International Conference on Pattern Recognition, 2006. [14] Z. H. Al-Tairi, R. W. Rahmat, M.I. Saripan
and P.S. Sulaiman, “Skin Segmentation Using YUV and RGB Color Spaces,” J Inf Process Syst, vol. 10,
no. 2, pp. 283-299, June 2014. [15] H. Lahiani , M. Elleuch and M. Kherallah, “Real Time Hand
Gesture Recognition System for Android Devices,” 15th International Conference on Intelligent
Systems Design and Applications (ISDA), 2015. [16] C. W. Ng and S. Ranganath, “Real-time gesture
recognition system and application,” Image and Vision computing, 2002. [17] S. C. Agrawal, A. S. Jalal
and C. Bhatnagar, “Recognition of Indian Sign Language using Feature Fusion,” 4th International
Conference on Intelligent Human Computer Interaction (IHCI), 2012. [18] M. Hrúz, J. Trojanová and
M. Železný, “Local Binary Pattern based features for Sign Language Recognition,” Pattern Recognition
and Image Analysis, 2012. [19] J. Davis and M. Shah, “Visual gesture recognition,” IEE
ProceedingsVision, Image and Signal Processing, 1994. [20] C. Y. Kao and C. S. Fahn, “A Human-
Machine Interaction Technique: Hand Gesture Recognition based on Hidden Markov Models with
Trajectory of Hand Motion,” Procedia Engineering, vol. 15, pp. 3739- 3743, 2011. [21] N. Dalal and B.
Triggs, “Histogram of Oriented Gradients for Human Detection,” Computer Vision and Pattern
Recognition (CVPR), vol. 2, pp. 886-893, 2005. [doi = 10.1109/CVPR.2005.177] [22] B. C. Ennehar, O.
Brahim, and T. Hicham, “An appropriate color space to improve human skin detection,” INFOCOMP
Journal of Computer Science, vol. 9, no. 4, pp. 1-10, 2010. [23] L. Maaten and G. Hinton, “Visualizing
Data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579-2605, November 2008.
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al, “1.6. Nearest
Neighbours – scikit-learn 0.19.1 documentation,” 2011. [Online]. Available:
http://scikitlearn.org/stable/modules/neighbors.html#nearest-neighboralgorithms. [Accessed: 12-
Sep- 2017]. [25] C. Vogler, D. Metaxas, “Handshapes and movements: Multiple channel ASL
Recognition,” Gesture-Based Communication in HumanComputer Interaction, pp. 247-258, 2004.
[26] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition,” Proceedings of the IEEE, vol. 77, no. 2, February 1989. [27] L. Baum, “An Inequality and
Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov
Process,” Inequalities III: Proceedings of the Third Symposium on Inequalities, ssvol. 3, pp. 1-8, 1972.
IEEE - 43488 9th ICCCNT 2018 July 10-12, 2018, IISC, Bengaluru Bengaluru, India
INDIAN JOURNAL OF SCIENCE AND TECHNOLOGY RESEARCH ARTICLE OPEN ACCESS Received: 11-10-
2023 Accepted: 30-10-2023 Published: 05-12-2023 Citation: Duraisamy P, Abinayasrijanani A, Amrit
Candida M, Dinesh Babu P (2023) Transforming Sign Language into Text and Speech through Deep
Learning Technologies. Indian Journal of Science and Technology 16(45): 4177-4185. https://doi.org/
10.17485/IJST/v16i45.2583 ∗ Corresponding author. [email protected] Funding: None
Competing Interests: None Copyright: © 2023 Duraisamy et al. This is an open access article
distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and
source are credited. Published By Indian Society for Education and Environment (iSee) ISSN Print:
0974-6846 Electronic: 0974-5645 Transforming Sign Language into Text and Speech through Deep
Learning Technologies Premkumar Duraisamy1∗, A Abinayasrijanani2 , M Amrit Candida2 , P Dinesh
Babu2 1 Assistant Professor (Sr. G.), Department of Computer Science and Engineering, KPR Institute
of Engineering and Technology, Coimbatore, Tamil Nadu, India 2 B.E Student, Department of
Computer Science and Engineering, KPR Institute of Engineering and Technology, Coimbatore, Tamil
Nadu, India Abstract Objective: The goal of the proposed work is to leverage deep learning
technologies to create an efficient and accurate system for transforming sign language into text and
speech. People deliver their ideas, feelings, and experiences to others around them via their
interactions with each other. The hand gesture plays a significant role since it reflects the user’s
thoughts more rapidly than other motions (head, face, eye, and body). For deaf-mute people with
disabilities, this is still not the case. Sign language facilitates communication among deaf-mute
individuals. An individual who is deafmute can communicate without the use of acoustic noises by
using sign language. Methods: Convolutional neural networks (CNNs) are generally used to recognize
and extract characteristics from sign language motions. These neural networks are employed to
recognize and extract critical features from sign language gestures. These features are processed by
natural language processing models for textual translation. Finally, neural text-to-speech (TTS)
technology is used to translate the textual translations into synthesized speech, thereby bridging the
communication gap for the Deaf community. To establish an inclusive and accessible communication
system, this technique combines computer vision, natural language processing, and speech
synthesis. Findings: The datasets used in this technique include hand gesture images, which contain
different hand poses and expressions. It is used to train and assess the model. The experiment
findings show an accuracy of 97.6% with a precision of 94.1%, a recall of 96.8%, and an F1-score of
95.9%. Novelty: This approach displays a cogent translation from text to speech and achieves an
outstanding translation accuracy of 97.6% from sign language to text, producing a natural and
understandable output. Keywords: Sign Language Translation; Deep Learning; Convolutional Neural
Networks; SequenceToSequence Models; Attention Mechanisms; Neural TextToSpeech
https://www.indjst.org/ 4177 Duraisamy et al. / Indian Journal of Science and Technology
2023;16(45):4177–4185 1 Introduction Communication plays a vital role in establishing human
connections as it enables the interchange of thoughts, feelings, and knowledge(1) . Verbal
communication, while fundamental, does not encompass all means of conveying messages. For
numerous individuals worldwide who encounter auditory deficiencies, sign languages such as
American Sign Language (ASL) and British Sign Language (BSL) assume the predominant function in
their communicative undertakings. These languages offer a multifaceted and comprehensive mode
of interaction, integrating hand gestures, facial expressions, and bodily movements to convey
nuanced messages(2) . SLTST (Sign Language to Text and Speech Translation) is a field of study that
focuses on bridging the communication gap between the deaf and hard-of-hearing communities and
the hearing population. Despite tremendous technological improvements, significant hurdles remain
in developing efficient and inclusive communication systems for the Deaf community(3) . The whole
natural language of ASL(4) shares several linguistic characteristics with spoken languages. Its syntax
resembles that of English. Despite advancements in sign language translation systems, a seamless
and precise translation between sign language and spoken language remains a challenge. The first
and most important factor is the richness and complexity of sign language itself, which is a multiple-
communication method to understand with its complicated hand gestures and body postures.
Second, achieving high accuracy in sign language recognition and natural text-to-speech conversion
remains a technical challenge. Existing systems frequently encounter precision, hardware
requirements, and availability issues(5) . The present research addresses such problems by utilizing
deep learning technologies, especially convolutional neural networks (CNNs) and natural language
processing models, to develop an SLTST system. The effort attempts to create an open and accessible
means of communication for Deaf and hard-of-hearing people by emphasizing the relevance of
technology in encouraging inclusivity and breaking down communication obstacles. This research
aims to improve the quality of life and opportunities for the deaf population. While developing an
efficient, accurate, and user-friendly SLTST system, in addition to advancing the field of accessible
communication technology(6) . By tackling these issues, this study hopes to greatly advance the area
of sign language-to-text and speech translation, ultimately fostering a more connected and inclusive
society. Table 1. Differentiate between the Proposed System and Existing System Comparison
Proposed Method Existing Method Integration of Models Using CNNs and transformer-based models
together to interpret text and gestures Individual models are used frequently in current techniques
for analyzing gestures and speech. Identification and Addressing Limitations Strategically analyzes
the shortcomings of earlier approaches and suggests fresh remedies May not specifically address or
get around limitations Practical Implementation and Accessibility Real-time communication receives
priority, and hardware dependence is decreased Some techniques might be hardwareintensive,
which would limit accessibility Table 1 describes the key differences between a proposed method
and an existing method for integrating models to interpret text and gestures in the context of sign
language. It highlights three main aspects of comparison: integration of models, identification and
addressing of limitations, and practical implementation and accessibility. 2 Methodology Different
levels of visual input are processed on each layer, which is a major benefit of deep learning models
for translating sign language to text. Lower layers analyze and identify very local characteristics, like
discrete curve segments(7) . Figure 1 shows the American Sign Language (ASL) alphabet letters, and it
serves as a crucial tool for the deaf and hard-ofhearing communities, enabling them to interact with
one another. The features get more complicated as you go higher. Furthermore, you can still make
sense of the network’s functionality(8) . Only the higher layers are often trained to infer the features
for the particular scenario when a model is being modified for a new purpose, leaving the lower
layers alone(9) . The training is substantially expedited by this. Figure 2 below shows the steps
involved in using a hand gesture recognition system. 2.1 Processing of an Image To guarantee the
accuracy and robustness of the system, a broad range of hand stances and motions must be covered
throughout data collection(10) . To accommodate the diverse hand positioning approaches adopted
by individuals, it is imperative to amass https://www.indjst.org/ 4178 Duraisamy et al. / Indian
Journal of Science and Technology 2023;16(45):4177–4185 Fig 1. Hand signs for each English
alphabet Fig 2. Flowchart of the process data from an array of perspectives, alignments, and
separations(11) . Figure 3 shows the conversion of the raw hand image into a grayscale image. Fig 3.
Conversion into Grayscale Image Movements of hands and/or fingers on a computer are a webcam.
The fundamental difficulty in vision-based hand detection is that the human hand appears
significantly differently due to numerous hand movements, various skin tones, various views, various
scales, and various camera shutter rates. After obtaining the region of interest (ROI), Gaussian
blur(12) is used. Next, the image is cropped and transformed it into grayscale using OpenCV. By
lowering noise and making object boundaries more distinct, gaussian blur can aid in boosting the
accuracy of up to 95% of object detection algorithms(13,14) . When dealing with backdrops that are
cluttered or photographs that are noisy, this is quite helpful(15) . A grayscale image must initially
undergo partitioning into two distinct regions, which are then categorized into two distinct groups:
one for foreground objects, typically portrayed as white, and the other for the background,
frequently depicted as black. https://www.indjst.org/ 4179 Duraisamy et al. / Indian Journal of
Science and Technology 2023;16(45):4177–4185 The act of thresholding is a frequently employed
technique for accomplishing this objective. It encompasses two fundamental methodologies: global
thresholding and adaptive thresholding(16) . In global thresholding, the foreground and background
areas are divided by selecting a single threshold value based on the intensity values of the grayscale
image. On the other hand, when the image has varying lighting conditions across different areas,
adaptive thresholding proves helpful. This technique calculates distinct threshold values for different
regions of the image to account for regional lighting changes(17) . After converting the hand’s image
to grayscale, preprocessing techniques such as scaling and normalization are applied. The pre-
processed image is then fed into the landmark detection model of the MediaPipe Landmark
System(18) . For processing perceptual information from several modalities, including audio and
video, Google made MediaPipe, an open-source, crossplatform framework, available(19) . Only two
of the solutions used in MediaPipe are face recognition and posture assessment. Fig 4. Hand
movement tracking using Mediapipe The model undergoes considerable training to identify the
landmarks—important points—on the hand, including the fingertips, knuckles, and the palm’s
center(20) . You can see an example of a figure caption in Figure 4, above. Typically, these recognized
landmarks are shown as points overlaid on the underlying grayscale image. Fig 5. Live hand tracking
The Figure 5 shows the hand tracking pipeline that predicts hand skeleton which will be used as data
to train the model(21) . 2.2 Data Set The Deep Learning of Sign Language to Text Conversion Model
was trained and validated using a dataset of 5040 skeletal images of American Sign Language (ASL).
This dataset is composed of 28 labels, with letter A representing 0, letter B representing 1, and letter
nothing representing 28. Datasets are usually divided into three sub-sets: the training set, the
validation set, and the test set, which are used to train the model, refine hyperparameters, and track
performance during training. The test set is used to assess the performance of the final model on
non-visible data. https://www.indjst.org/ 4180 Duraisamy et al. / Indian Journal of Science and
Technology 2023;16(45):4177–4185 2.3 Model Architecture CNN-based deep learning techniques
demonstrate remarkable efficacy in the realm of computer vision applications(22) . Figure 6 shows
the layers of the CNN model. These methodologies demonstrate exceptional performance in various
tasks, including the categorization of images, the identification of objects, and the division of images
into different segments(23) . This is primarily due to their innate ability to extract hierarchical
representations of features from visual data. It performs computations by adjusting weights to
iteratively scan all of the pixel values in the image using a filter or kernel to identify a specific trait. A
few of the layers available in CNN, including the convolution layer, max pooling layer, dense layer,
flatten layer, dropout layer, and a fully connected neural network, are shown in the figure caption
below (Figure 6). In CNN layers, neurons arrange themselves in three dimensions—width, height, and
depth—as opposed to regular neural networks. Instead of being completely connected, neurons in a
layer will be connected to only a small portion of the layer preceding it (window size). Fig 6. Layers of
CNN Convolutional layers create a new image, A(m) made up of Om channels, from the input image
A(m−1 ) (with Km channels). The output of each channel is a feature map, which is calculated as A
(m) o = gm ∑kW (m) ok ∗A (m−1) k +b (m) o (1) where * denotes the (2D) convolution operation Wok
∗Ak [s,t] = ∑p,q Ak [s+ p,t +q]Wok[P−1− p,Q−1−q] (2) where W (m) ok is matrix of shape Pm × Qm and
b (m) o ∈ R The matrix defines a spatial filter’s parameters, which the layer can use to identify or
enhance a feature in an incoming image. The precise behavior of this filter is dynamically learned
from data throughout the network’s training process. By pooling layers, the size of the feature maps
is decreased. As a result, it lessens both the number of parameters that must be learned and the
workload placed on the network. 2.3.1 Max pooling Max pooling is a type of pooling that chooses
the most elements possible from the feature map area that the filter has covered. Thus, a feature
map containing the most noticeable features from the earlier feature map would be the result of the
max-pooling layer. 2.3.2 Average pooling Average pooling is used to get the median of the elements
in the feature map area that the filter is focusing on. As a result, while max pooling supplies the most
prominent attribute in a particular patch of the feature map, average pooling provides the median of
the attributes present in a patch. https://www.indjst.org/ 4181 Duraisamy et al. / Indian Journal of
Science and Technology 2023;16(45):4177–4185 Fig 7. Concept of pooling The input feature map in
Figure 7 is a 4x4 grid of numbers. A 2x2 pooling zone is used with the max pooling operation(24) .
This means that each feature map patch is 2x2 pixels in size. The maximum value in each patch is
then chosen as the output value for that patch. The max pooling operation produces a 2x2 grid of
numbers. The numbers in this grid represent the highest values obtained from each patch of the
input feature map(24) . The average pooling operation is also used with a 2x2 pooling region. This
means that each feature map patch is 2x2 pixels in size. The average value of each patch is then
determined, and this number becomes the patch’s output value. Fig 8. Relationship between entities
To connect the neurons between two layers, the Fully Connected (FC) layer, which also includes
weights and biases, is utilized. These layers make up the final few layers of a CNN architecture and
are often positioned before the output layer(25) . The pooling layer, depicted in Figure 8, is utilized to
reduce the spatial dimensions of the feature maps from 4x4 to 2x2. It reduces the number of
parameters in the next layer by half. By making the CNN less sensitive to tiny changes in the input
data, the pooling layer also helps to prevent overfitting. The Keras CNN model will be fed the pre-
processed images and alphabets. The stages involved in a text-to-Speech system are shown in Figure
9 above. The system uses Natural Language Processing (NLP) techniques to first translate the input
text into a phonetic representation. After that, the system converts the phonetic representation into
a voice waveform using Digital Signal Processing (DSP) techniques. Ultimately, a speaker outputs the
voice waveform. 2.4 Model Training To reduce the discrepancy between the model’s predictions and
the ground truth labels, which are the matching text representations of the sign language
movements, the model is tuned during training. This calls for a labelled data set in which each
sequence of skeletal gestures is associated with a narrative explanation(25) . https://www.indjst.org/
4182 Duraisamy et al. / Indian Journal of Science and Technology 2023;16(45):4177–4185 Fig 9.
Neural Text-To-Speech Architecture The difference between predicted text and real text is measured
by the loss function that is employed during training. The model develops the ability to translate
complex hand movement patterns into meaningful textual representations. Convolutional, recurrent,
and fully connected layers in the CNN allow it to comprehend both temporal and spatial links in the
gesture sequences. Once the model is trained and achieves satisfactory performance on validation
data, it can be deployed to convert unseen sign language gestures into text and voice. Given a new
gesture captured through the MediaPipe landmark system, the model processes the skeleton
sequence and generates the corresponding text output. This particular text can subsequently
undergo a metamorphosis through the utilization of text-to-speech technology, thereby offering a
holistic remedy that effectively mitigates the discrepancy in communication between individuals
proficient in sign language. Fig 10. Recognize of Hand Gesture The above Figure 10 shows the
outcome of the model, which recognizes the hand gestures, and the gestures are converted into text
and displayed on the screen. The text-to-speech (TTS) conversion component subsequently creates
audible speech from the translated text. The spoken version of the signs can be heard thanks to the
voice output, which is the audio representation of the translated text. These results indicate the
effectiveness and robustness of the model in accurately translating sign language gestures into text.
In this section, the model’s performance results are shown, and by compared it with earlier studies in
the field, its novelty is emphasized. When the model’s accuracy and overall performance are
compared to the available literature, the findings reveal a significant improvement. Several earlier
studies on sign language to text and voice translation showed rates of accuracy ranging from 85% to
90%. The model’s accuracy of 97.6% exceeds these reported levels, indicating considerable
development in the domain. To create a better and more accurate Sign Language to Text and Speech
Translation (SLTST) system, a thorough sensitivity analysis helps locate possible flaws and areas for
system development. Environmental factors, such as variable illumination levels, background noise,
and camera angles, are used to evaluate how well the system operates in diverse environments.
Real-world usability depends heavily on environmental sensitivity. Assessing the system’s capacity to
identify complex and delicate signals, such as quick finger motions and body postures, is known as
gestural complexity. Accurate translations require awareness of the subtleties of sign language. It
involves evaluating the computational and operational complexities associated
https://www.indjst.org/ 4183 Duraisamy et al. / Indian Journal of Science and Technology
2023;16(45):4177–4185 with the entire process. First, computational complexity evaluates the
system’s need for computational resources, such as memory, GPU capabilities, and processor power.
Convolutional neural networks (CNNs) are used in text-to-speech (TTS) and sign recognition models,
which evaluate the model sizes and computing requirements considering the intricacy of deep
learning models. The system’s second analysis is Latency and Real-Time Processing, which looks at
how long it takes to identify sign language motions and translate them into English. Take into account
the response time limits for efficient communication while evaluating the trade-off between
accuracy and real-time performance. The use of MediaPipe for hand gesture identification utilizing
skeletal data is a crucial factor contributing to the model’s improved performance. This method
enables more accurate translation by precisely tracking hand movements. Deep learning techniques
like convolutional neural networks (CNNs) and effective preprocessing processes such as Gaussian
blur and grayscale conversion improve the model’s efficiency and accuracy. Furthermore, the
research emphasizes lightweight model architectures, which reduce hardware requirements while
improving accessibility. The proposed model achieves a balance between precision and productivity,
making it well-suited for live applications and accessible to a broader spectrum of individuals. 3
Conclusion This research uniquely integrates convolutional neural networks (CNNs) for sign language
recognition, natural language processing models for textual translation, and neural text-to-speech
(TTS) technology for synthesized speech. This work highlights the revolutionary potential of deep
learning in boosting sign language identification and, as a result, improving inclusivity and
communication for people. Continuous study and development in this subject hold the possibility of
generating more precise and accessible sign language-to-text and voice conversion systems as
technology improves. The successful integration of transformer-based text generation models with
convolutional neural networks (CNNs) for reliable gesture detection is a notable accomplishment.
Notably, the seamless translation of detected gestures into natural speech using neural text-to-
Speech (TTS) synthesis is a breakthrough, producing coherent and natural output that closely
resembles human speech while retaining a high translation accuracy of 98%. This discovery highlights
the power of deep learning to overcome accessibility issues and promote diversity. The strategy
outperforms current approaches by reaching a greater accuracy rate of 97.6%, demonstrating the
potency of the suggested method. The study adds to the body of literature by outlining the practical
ramifications of bridging the communication gap between the hearing and deaf communities, in
addition to offering a technical solution. Future Enhancement Multimodal integration, incorporating
audio, visual, and contextual cues, plays a pivotal role in enhancing the comprehension of sign
language. By amalgamating multiple sensory inputs, individuals can achieve a more comprehensive
understanding of the nuances and intricacies of sign language communication. Real-time feedback
mechanisms further bolster this understanding by aiding signers in refining the precision of their
movements and honing their overall communicative proficiency. These mechanisms facilitate
immediate adjustments and corrections, allowing for continuous improvement in the quality and
accuracy of 98% of sign language expression. References 1) Buttar AM, Ahmad U, Gumaei AH, Assiri
A, Akbar MA, Alkhamees BF. Deep Learning in Sign Language Recognition: A Hybrid Approach for the
Recognition of Static and Dynamic Signs. Mathematics. 2023;11(17):3729. Available from:
https://doi.org/10.3390/math11173729. 2) Duraisamy P, Natarajan Y, Jeya IJS, Niranjani V. Sensor
Automation for Industrial Applications and Prediction of Energy Level for Home Appliances Using
Machine Learning Algorithm. In: 2023 9th International Conference on Advanced Computing and
Communication Systems (ICACCS). IEEE. 2023;p. 1648–1654. Available from:
https://doi.org/10.1109/ICACCS57279.2023.10112766. 3) Abraham E, Nayak A, Iqbal A. Real-Time
Translation of Indian Sign Language using LSTM. 2019 Global Conference for Advancement in
Technology (GCAT). 2019;p. 1–5. Available from: https://doi.org/10.1109/GCAT47503.2019.8978343.
4) Uyyala P. Sign Language Recognition Using Convolutional Neural Networks. Journal of
interdisciplinary cycle research. 2022;14:1198–1207. Available from: https://doi.org/10.17613/47ga-
zw60. 5) Sreenath S, Daniels DI, Ganesh ASD, Kuruganti YS, Chittawadigi RG. Monocular Tracking of
Human Hand on a Smart Phone Camera using MediaPipe and its Application in Robotics. 2021 IEEE
9th Region 10 Humanitarian Technology Conference (R10-HTC). 2021. Available from:
https://doi.org/10.1109/ R10-HTC53172.2021.9641542. 6) Bohra T, Sompura S, Parekh K, Raut P.
Real-Time Two Way Communication System for Speech and Hearing Impaired Using Computer Vision
and Deep Learning. 2019 International Conference on Smart Systems and Inventive Technology
(ICSSIT). 2019. Available from: https://doi.org/10.1109/ICSSIT46314. 2019.8987908. 7) Elboushaki A,
Hannane R, Afdel K, Koutti L. MultiD-CNN: A multi-dimensional feature learning approach based on
deep convolutional networks for gesture recognition in RGB-D image sequences. Expert Systems with
Applications. 2020;139:112829. Available from: https://doi.org/10.1016/j.eswa.2019. 112829.
https://www.indjst.org/ 4184 Duraisamy et al. / Indian Journal of Science and Technology
2023;16(45):4177–4185 8) Liu JE, An FP. Image Classification Algorithm Based on Deep Learning-
Kernel Function. Scientific Programming. 2020;p. 1–14. Available from:
https://doi.org/10.1155/2020/7607612. 9) Oudah M, Al-Naji A, Chahl J. Hand Gesture Recognition
Based on Computer Vision: A Review of Techniques. Journal of Imaging. 2020;6(8):73. Available from:
https://doi.org/10.3390/jimaging6080073. 10) Al-Hammadi M, Muhammad G, Abdul W, Alsulaiman
M, Hossain MS. Hand Gesture Recognition Using 3D-CNN Model. IEEE Consumer Electronics
Magazine. 2020;9(1):95–101. Available from: https://doi.org/10.1109/MCE.2019.2941464. 11)
Vaidya O, Gandhe S, Sharma A, Bhate A, Bhosale V, Mahale R. Design and Development of Hand
Gesture based Communication Device for Deaf and Mute People. 2020 IEEE Bombay Section
Signature Conference (IBSSC). 2020. Available from:
https://doi.org/10.1109/IBSSC51096.2020.9332208. 12) Li Z, Liu F, Yang W, Peng S, Zhou J. A Survey
of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural
Networks and Learning Systems. 2022;33(12):6999–7019. Available from:
https://doi.org/10.1109/TNNLS.2021.3084827. 13) Smedt D, Quentin H, Wannous JP, Vandeborre.
Heterogeneous hand gesture recognition using 3D dynamic skeletal data. Computer Vision and Image
Understanding. 2019;181:60–72. Available from: https://doi.org/10.1016/j.cviu.2019.01.008. 14)
Sharma A, Mittal A, Singh S, Awatramani V. Hand Gesture Recognition using Image Processing and
Feature Extraction Techniques. Procedia Computer Science. 2020;173:181–190. Available from:
https://doi.org/10.1016/j.procs.2020.06.022. 15) Duraisamy P, Natarajan Y, Niranjani V, Parvathy K.
Optimized Detection of Continuous Gravitational-Wave Signals using Convolutional Neural Network.
In: 2023 3rd International conference on Artificial Intelligence and Signal Processing (AISP). IEEE.
2023;p. 1–5. Available from: https: //doi.org/10.1109/AISP57993.2023.10134809. 16) Jie HJ, Wanda
P. RunPool: A Dynamic Pooling Layer for Convolution Neural Network. International Journal of
Computational Intelligence Systems. 2020;13(1):66. Available from:
https://doi.org/10.2991/ijcis.d.200120.002. 17) Shokat S, Riaz R, Rizvi SS, Khan K, Riaz F, Kwon SJ.
Analysis and Evaluation of Braille to Text Conversion Methods. Mobile Information Systems. 2020;p.
1–14. Available from: https://doi.org/10.1155/2020/3461651. 18) Rahaman MA, Hossain MDP, Rana
MM, Rahman MDA, Akter T. A Rule Based System for Bangla Voice and Text to Bangla Sign Language
Interpretation. 2020 2nd International Conference on Sustainable Technologies for Industry 40 (STI).
2020. Available from: https://doi.org/10.1109/STI50764.2020.9350468. 19) Duraisamy P, Natarajan Y,
Preethaa KRS, Mouthami K. Sentiment Analysis on Drug Reviews Using Diverse Classification
Techniques. In: 2022 3rd International Conference on Communication, Computing and Industry 4.0
(C2I4). IEEE. 2022;p. 1–5. Available from: https://doi.org/10.1109/C2I456876. 2022.10051399. 20)
Shashidhar R, Hegde SR, Chinmaya K, Priyesh A, Manjunath AS, Arunakumari BN. Indian Sign
Language to Speech Conversion Using Convolutional Neural Network. In: 2022 IEEE 2nd Mysore Sub
Section International Conference (MysuruCon). ;p. 1–5. Available from: https://doi.org/10.1109/
MysuruCon55714.2022.9972574. 21) Qin W, Mei X, Chen Y, Zhang Q, Yao Y, Hu S. Sign Language
Recognition and Translation Method based on VTN. In: 2021 International Conference on Digital
Society and Intelligent Systems (DSInS). IEEE. 2021;p. 111–115. Available from:
https://doi.org/10.1109/DSInS54396.2021.9670588. 22) Shokoori AF, Shinwari M, Popal JA, Meena J.
Sign Language Recognition and Translation into Pashto Language Alphabets. In: 2022 6th
International Conference on Computing Methodologies and Communication (ICCMC). IEEE. 2022;p.
1401–1405. Available from: https://doi.org/10. 1109/ICCMC53470.2022.9753959. 23) Duraisamy P,
Yuvaraj S, Natarajan Y, Niranjani V. An Overview of Different Types of Recommendations Systems - A
Survey. In: 2023 4th International Conference on Innovative Trends in Information Technology
(ICITIIT). IEEE. 2023;p. 1–5. Available from: https://doi.org/10.1109/ICITIIT57246.2023. 10068631.
24) Kim CJ, Park HM. Per-frame Sign Language Gloss Recognition. In: 2021 International Conference
on Information and Communication Technology Convergence (ICTC). IEEE. 2021;p. 1125–1127.
Available from: https://doi.org/10.1109/ICTC52510.2021.9621167. 25) Datta PK, Biswas A, Ghosh A,
Chaudhury N. Creation of Image Segmentation Classifiers for Sign Language Processing for Deaf and
Dumb. In: 2020 8th International Conference on Reliability, Infocom Technologies and Optimization
(Trends and Future Directions) (ICRITO). IEEE. 2020;p. 772–775. Available from:
https://doi.org/10.1109/ICRITO48877.2020.9197978. https://www.indjst.org/ 4185
Journal of Intelligent Systems and Control https://www.acadlore.com/journals/JISC Detection and
Interpretation of Indian Sign Language Using LSTM Networks Piyusha Vyavahare1 , Sanket Dhawale1 ,
Priyanka Takale1 , Vikrant Koli1 , Bhavana Kanawade1* , Shraddha Khonde2 1 Information
Technology, International Institute of Information Technology, 411057 Pune, India 2 Department of
Computer Engineering, M. E. S. College of Engineering, S.P. Pune University, 411001 Pune, India *
Correspondence: Bhavana Kanawade ([email protected]) Received: 06-10-2023 Revised:
07-05-2023 Accepted: 07-13-2023 Citation: P. Vyavahare, S. Dhawale, P. Takale, V. Koli, B. Kanawade,
and S. Khonde, “Detection and interpretation of Indian Sign Language using LSTM networks,” J. Intell
Syst. Control, vol. 2, no. 3, pp. 132–142, 2023. https://doi.org/10.56578/jisc020302. ³ Z 2023 by the
author(s). Published by Acadlore Publishing Services Limited, Hong Kong. This article is available for
free download and can be reused and cited, provided that the original published version is credited,
under the CC BY 4.0 license. Abstract: Sign language plays a crucial role in communication for
individuals with speech or hearing difficulties. However, the lack of a comprehensive Indian Sign
Language (ISL) corpus impedes the development of text-to-ISL conversion systems. This study
proposes a specific deep learning-based sign language detection system tailored specifically for
Indian Sign Language (ISL). The proposed system utilizes Long Short-Term Memory (LSTM) networks
to detect and recognize actions from dynamic ISL gestures captured in videos. Initially, the system
employs computer vision algorithms to extract relevant features and representations from the input
gestures. Subsequently, an LSTM-based deep learning architecture is employed to capture the
temporal dependencies and patterns within the gestures. LSTM models excel in sequential data
processing, making them well-suited for analyzing the dynamic nature of sign language gestures. To
assess the effectiveness of the proposed system, extensive experimentation and evaluation were
conducted. A customized dataset was curated, encompassing a diverse range of ISL sign language
actions. This dataset was created by collecting video recordings of native ISL users performing
various actions, ensuring comprehensive coverage of gestures and expressions. These videos were
meticulously annotated and labelled with corresponding textual representations of the gestures. The
dataset was then split into training and testing sets to train the LSTM-based model and evaluate its
performance. The proposed system yielded promising results during the validation process, achieving
a training accuracy of 96% and a test accuracy of 87% for ISL recognition. These results outperformed
previous approaches in the field. The system’s ability to effectively detect and recognize actions from
dynamic ISL gestures, facilitated by the deep learning-based approach utilizing LSTM networks,
demonstrates the potential for more accurate and robust sign language recognition systems.
However, it is important to acknowledge the limitations of the system. Currently, the system’s
primary focus is on recognizing individual words rather than full sentences, indicating the need for
further research to enhance sentence-level interpretations. Additionally, variations in lighting
conditions, camera angles, and hand orientations can potentially impact the system’s accuracy,
particularly in the context of ISL. Keywords: Indian Sign Language; Open CV; Deep learning models;
LSTM 1 Introduction Effective communication is vital for individuals to interact and thrive in society.
However, individuals with limitations in speech or hearing face significant communication challenges.
Sign language, as a visual language, plays a crucial role in enabling these individuals to express and
understand communication through hand gestures, facial expressions, and body movements. Despite
the importance of sign language, there remains a gap in effective communication between sign
language users and those who do not understand sign language. This gap necessitates the
development of sign language translation systems. This research paper aims to address the need for
improved sign language translation by proposing a real-time system that leverages image processing
and deep learning techniques. The primary objectives of this study are twofold: first, to develop a
real-time sign language translation system, and second, to enhance the system’s response time and
accuracy in recognizing and translating sign language gestures. By
https://doi.org/10.56578/jisc020302 132 achieving these objectives, the research aims to contribute
to the field of assistive technology, promoting inclusivity and improving communication accessibility
for individuals with hearing impairments. To establish the foundation for the proposed system, an
extensive review of related work has been conducted. While previous research has explored sign
language translation systems, limitations persist in terms of real-time translation, accurate gesture
recognition, and accommodating the wide range of sign language variations and expressions. These
research gaps and limitations motivate this study to explore novel approaches that can overcome
these challenges and improve the effectiveness of sign language translation. The proposed
methodology encompasses the utilization of image processing algorithms to analyze live sign
motions and extract distinctive indicators, including hand gestures, facial expressions, and body
movements. These indicators are then processed using classifiers to convert them into meaningful
representations. An integral component of the proposed system is the incorporation of Long Short-
Term Memory (LSTM) networks, which are renowned for their ability to capture long-term
dependencies. By leveraging LSTM networks, the system aims to improve gesture recognition
accuracy by effectively modeling the sequential nature of sign language expressions. To train the
system, a customized dataset focused primarily on Indian Sign Language (ISL) has been meticulously
curated. This dataset encompasses a diverse range of sign language gestures, encompassing different
hand shapes, orientations, movements, and positions. The inclusion of high-quality data in the
dataset enables the system to learn and generalize from various sign language expressions
effectively. The structure of the paper is as follows: Section 2 provides a comprehensive review of the
related literature, highlighting the significance of gesture detection in sign language, discussing
existing approaches, and identifying the research gaps and limitations in current methods. Section 3
presents the proposed methodology, outlining the image processing techniques, the integration of
LSTM networks, and the details of the customized Indian Sign Language dataset. Section 4 describes
the experimental setup and the evaluation of the system’s performance. Finally, Section 5 concludes
the paper, discussing the contributions of the research, potential future directions, and the impact
on communication accessibility for individuals with hearing impairments. 2 Related Work Each sign in
a sign language is distinct due to changes in hand form, motion profile, and location of the hand,
face, and other body parts, according to Rastogoo et al. [1]. It is challenging to identify visual sign
languages using computer vision. The authors of the survey looked at the preceding five years’ worth
of deep learning- and vision-based models for sign language recognition. The proposed models
demonstrate a notable increase in the taxonomy-based recognition accuracy of sign language. A set
of artificial intelligence technologies for sign language was proposed by Papastratis et al. [2]. The
authors of the survey made an effort to provide a comprehensive overview of cutting-edge methods
for sign language recording, recognition, translation, and representation while stressing both their
advantages and disadvantages. The survey lists several uses while also outlining the major difficulties
facing sign language technologies. In the survey, deep learning models are used to identify and
detect words in person’s motion. Deep learning models are able to distinguish signals from single
Indian Sign Language (ISL) video frames. They employ the feedback-based learning models LSTM and
GRU. The four consecutive LSTM and GRU combinations (there are two layers of LSTM and two levels
of GRU) were investigated by Kothadiya et al. [3] with their own dataset, IISL 2020. Over 11 different
signals, the proposed algorithm, which consists of a single layer of LSTM followed by GRU, achieves
approximately 97% accuracy. Adaloglou et al. [4] reviews the most current developments in
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers for the
recognition of sign language. The authors also identified the key challenges and open research
questions in this field, such as the lack of standardized datasets and the need for real-time
recognition in practical applications. The authors proposed a two-stage recognition approach, which
first detects the hand region in the input video frames and then recognizes the isolated hand sign
from the detected hand region using a deep neural network. Rastgoo et al. [5] evaluated the
proposed approach on two public datasets, namely, RWTH-BOSTON-50 and Polish Sign Language
Alphabet (PSL). The paper concludes that the proposed deep cascaded model can be an effective
solution for recognizing isolated hand signs in sign language and can be further extended to handle
continuous sign language recognition. Using large-scale video datasets like kinetics, Sarhan and
Frintrop [6] optimize pre-trained action recognition models for sign language recognition tasks. The
proposed method is evaluated on two publicly available sign language datasets, RWTH-PHOENIX-
Weather and DGS-Corpus, and compared to state-of-the-art methods. The authors also analyzed the
transferability of different pre-trained models and demonstrate that models trained on action
recognition tasks that involve similar hand movements and poses as sign language can be effectively
transferred for sign language recognition. The system is designed to recognize vernacular sign
languages, which are sign languages that are specific to a 133 particular region or community and
may have limited resources for language recognition. Halder and Tayade [7] uses the Media pipe
framework to extract hand and pose features from video input, and then apply machine learning
algorithms, specifically Support Vector Machines (SVMs) and Random Forest (RF), for classification.
The system is trained on a dataset of vernacular sign language gestures and evaluated on real-time
video input. The results show that the proposed system achieves high accuracy in recognizing
vernacular sign language gestures in real-time. The authors also compare the performance of SVM
and RF classifiers and analyze the effect of different parameters on the system’s performance.
Unique spatial and temporal information was extracted from each motion by Caliwag et al. [8]. The
aspects that are gathered are then used to train a neural network to classify the gestures. Short,
medium, and long motions were used in sign language videos as part of the movement-in-a-video
recognition method. The author’s suggested method had accuracy rates of 90.33% and 40%,
compared to 69% and 43.7% for previous methods. Continuous sign language recognition (SLR) tries
to turn a series of signs into a complete thought. In the study, System presented the Structured
Feature Network (SF-Net) to efficiently learn many levels of semantic information from the data and
overcome these issues. After integrating the features, they obtained. Yang et al. [9] presented the SF-
Net, which gradually encodes data at the frame, gloss, and sentence levels into the feature
representation. The accuracy and adaptability of the SF-Net were evaluated using two sizable public
SLR datasets collected from diverse continuous SLR scenarios. The results suggest that the SF-Net’s
accuracy and adaptability are superior to those of earlier sequence level supervision-based
techniques. Liao et al. [10] have introduced a new multimodal approach for dynamic sign language
recognition, utilizing a system called BLSTM-3D residual network (B3D ResNet). This system combines
bi-directional LSTM networks and a deep 3-dimensional residual ConvNet to automatically extract
spatiotemporal features from video sequences and generate an intermediate score for each action in
the video. The B3D ResNet system is composed of three main components: hand object localization,
feature analysis, and video clip categorization. The researchers tested the system on two datasets,
the DEVISIGN D and SLR Dataset, and obtained impressive recognition accuracy. Specifically, the B3D
ResNet system achieved 86.9% accuracy on the SLR Dataset and 89.8% accuracy. American Sign
Language alphabet dataset was utilized by Thakur et al. [11] in their study. After being processed
with Python libraries and tools such as OpenCV and skimage, the preprocessed gesture datasets are
then trained using CNN VGG-16 mode. Since the technique explains the meaning of the hand signs
used for conversing with someone who does not understand sign language, it allowed for one-way
communication in the survey. An innovative method was introduced by Starner et al. [12] with the
title Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based
Video System. Two real-time hidden Markov model-based systems were demonstrated, each of
which tracks bare hands using a single-color camera in order to recognize continuous sentences of
American Sign Language (ASL). The precision of the initial system, which utilizes a deskmounted
camera to monitor the user, is 92 percent, contrary to the level of detail of the other system, which
makes use of a camera around the user’s hat, which is 98%. For both experiments 40-word lexicon is
used. It’s worth noting that the development of real-time ASL recognition systems has great potential
to improve communication. The use of wearable technology in this context is particularly promising,
as it allows for more natural and unobtrusive interaction with technology. The survey suggests a
novel deep learning-based pipeline architecture that uses the Single Shot Detector (SSD), 2D
Convolutional Neural Network (2DCNN), 3D Convolutional Neural Network (3DCNN), and Long Short-
Term Memory (LSTM) to automatically recognise hand sign language. The hand skeleton is then
constructed using the midpoint algorithm and the approximated key points. The scientists used
3DCNNs to extract discriminant local spatio-temporal characteristics from the stacked inputs, such as
pixel-level, multi-view hand skeleton, and heatmap features. Authors discovered that 3DCNNs
perform better than 2DCNNs at capturing the spatio-temporal dynamics of the hands after
comparing their performances with various quantities of stacked inputs. Rastgoo et al. [13] employed
a brand-new, sizable hand sign language dataset called RKS-PERSIANSIGN, which is made up of
10,000 RGB videos of 100 Persian sign words, to show their model. Another important contribution is
the new RKSPERSIANSIGN dataset, which offers a sizable real-world dataset for developing and
testing hand sign language recognition algorithms. To address the limitations of current hybrid HMM
and CTC models for frame or word level alignment, Guo et al. [14] proposed a novel hierarchical-
LSTM (HLSTM) encoder-decoder model with visual content and word embedding for Sign Language
Translation (SLT). The HLSTM model is designed to handle different granularities, such as frames,
clips, and viseme units, by capturing spatio-temporal transitions among them. After exploring the
spatiotemporal cues of video clips using a 3D CNN, the authors packed the right visemes via online
key clip mining. By combining the recurrent outputs of the top layer of the HLSTM, a temporal
attention-aware weighting mechanism is used to balance the intrinsic link between the viseme
source positions. Finally, viseme vector processing and semantic meaning translation are done
separately using two LSTM layers. This new approach has the potential to improve the accuracy of
Sign Language Translation by effectively capturing the spatio-temporal dynamics of 134 sign
language. The HLSTM model with its visual content and word embedding enables the model to
capture both the visual and semantic aspects of sign language, and the temporal attention-aware
weighting mechanism further improves the model’s ability to align and translate the spatio-temporal
features. Rastgoo et al. [15] proposed a novel model for recognizing isolated hand sign language
using deep learning techniques from two input modalities, namely RGB and depth movies. The
proposed model incorporates hand posture features and employs Scene Flow (SF) for depth video
inputs and Optical Flow (OF) for flow information in RGB video inputs. The system employs a step-by-
step investigation of various combinations of spatial information and several recurrent models, such
as LSTM and GRU, to evaluate their impact on identification performance. The model achieved a
competitive result on the isoGD dataset, which was only 0.22% lower than the most recent state-of-
the-art model. This approach has the potential to enhance the accuracy of recognizing isolated hand
sign language by effectively combining the strengths of different deep learning techniques and input
modalities. For the extraction of spatiotemporal features and sequence learning, Cui et al. [16]
suggested recurrent convolutional neural network, to address the problem of glosses to video
segments. The authors developed three stages of optimization using their proposed system
architecture. The first stage involved building an end-to-end sequence learning framework with
connectionist temporal classification (CTC) as the objective function, followed by suggesting an
alignment. In the second stage, the alignment proposal was used as greater supervision to fine-tune
the feature extractor. Finally, the authors presented the revised feature representations and
optimized the sequence learning model. This approach has the potential to improve the accuracy of
gloss-to-video segment alignment and enhance the performance of sign language recognition
systems. For weakly supervised continuous sign language recognition (SLR), Wei et al. [17] suggested
a novel approach where only ordered bright tags are supplied for each sign video without temporal
limits in frames. The suggested method uses a semantic boundary detection technique based on
reinforcement learning to solve the problem of precisely aligning video frames with symbolic words
in symbolic video. It formulates the problem of semantic boundary identification as a reinforcement
learning problem and using a multilevel perceptual model to build discriminative representations for
movies. The location of a semantic boundary is handled as an action, whereas the feature
representation of a video segment functions as a state. Between the forecast statement and the
ground truth statement, a quantitative performance measure is used to determine the reward. The
policy gradient algorithm is used to train the policy network. On the CSL Split II and RWTH-PHOENIX-
Weather 2014 datasets, the suggested technique has been thoroughly tested, and the results show
its superiority and efficacy. Word error rate (WER) is used as the primary assessment metric in a new
continuous sign language recognition (SLR) architecture described by Pu et al. [18] that processes
unaligned video-text pairs. Learning models are typically optimized by systematically removing
connection time classification (CTC), as the WER is not separable. However, the WER metric may not
always favor the phrase predicted with the highest decoding probability. The authors suggested a
new architecture with increased diversity as a solution to this issue. They introduced multivariate
data that replicated the WER calculation process, replacing, deleting, and adding the associated
videos and text labels. They suggested multiple loss periods to reduce the gap between the video
and the ground-truth label and to differentiate between real and apparent modes online using these
actual and created pseudo-video-text pairings. Other CTC-based continuous SLR architectures that
are already in use can easily be added to the framework. Extensive experimentation on two
continuous SLR benchmarks, notably RWTH-PHOENIX-Weather and CSL, confirms the effectiveness of
the presented technique. A full convolutional network (FCN) for online sign language recognition
(SLR) was created by Cheng et al. [19] and is capable of learning spatial and temporal information
from poorly labelled video sequences with only sentencelevel annotations. The authors added a GFE
module to the suggested network in order to enhance learning about sequence alignment. Without
any prior training, the network underwent end-to-end training. The results of the experiments
demonstrated the effectiveness of the method and showed that it performed well in online
recognition. Sign language recognition has received a lot of attention in recent years because it can
facilitate communication and increase access for the deaf. To address the challenges of low-resource
sign language recognition, the Open Hands system, which integrates four crucial principles from the
natural language processing (NLP) field to word-level sign language recognition, was introduced by
Selvaraj et al. [20]. In order to provide implementation benchmarks and checkpoints, the system
trained and published four pose-based isolated language recognition models in six languages,
including American, Argentine, Chinese, Greek, Hindi, and Turkish. In addition, Open Hands
published posture-based data and four different ISLR models in the same six sign languages.
Evaluation of the open hands system showed that ST-GCN, a graph-based method, is accurate and
effective in sign language recognition. As per above discussion it is observed that there is a scope of
research on database creation, preprocessing data, feature selection including the fact that the
indicated existing technology could not manage the huge dataset with high accuracy. This issue is
addressed in the proposed system using the LSTM neural network architecture. The strategy in the
proposed system is based on one-way communication, and our dataset is built on dynamic ISL. The
number of actions in our customized dataset is higher (40 actions) than in the current survey. Each
action contains 135 30 videos and each video contain 30 frames. In this experimentation, Open CV
Python library is used for gesture preprocessing. The system uses face, shoulder movements along
with hand gestures. 3 Methodology The study utilizes the Indian Sign Language dataset, which
encompasses a diverse collection of 40 different actions performed using sign language. Each action
is captured through 30 frames and 30 videos, ensuring a comprehensive representation. The dataset
includes recordings from a varied group of subjects, ensuring representation across different genders
and signing abilities. To capture precise hand movements and facial expressions, high-resolution
cameras were used during the acquisition process. The subjects were instructed to perform the
actions in controlled environments with adequate lighting and minimal distractions, and multiple
takes were recorded to ensure data quality. However, it is important to acknowledge the limitations
of the study. The dataset’s size may restrict the representation of the entire spectrum of actions and
variations observed in real-world sign language communication, potentially hindering the capture of
diverse signing styles. Additionally, the subjects included may not fully represent the range of
diversity found within the larger population of sign language users, potentially limiting the
generalizability of the findings. The use of specialized equipment and controlled environments may
fail to replicate the nuances of real-life sign language interactions, introducing biases related to
factors such as lighting conditions, camera angles, and limited contextual information. Moreover, the
subjective choices made during data collection, including action selection and framing, may further
influence the dataset and impact the performance of the models. These limitations highlight the
need for future research to address these issues by expanding the dataset, involving diverse subjects,
incorporating more realistic environments, and adopting rigorous protocols to minimize bias and
enhance the generalizability of the findings. Furthermore, the accuracy of the action recognition
system relies heavily on the quality of the training data. Insufficient, biased, or incorrectly labeled
training data can lead to poor performance and inaccurate predictions. Inaccurate detections or
tracking, influenced by factors like lighting conditions, camera quality, occlusions, and variations in
human poses, can also affect the accuracy of the Media pipe holistic model used for pose estimation
and hand detection. Incorrect key point extraction resulting from inaccurate detections or tracking
can consequently impact the action recognition results. Moreover, the performance and speed of the
action recognition system are influenced by available hardware resources, including the CPU, GPU,
and memory. Limited resources can result in increased processing time and impact the overall
system performance. 3.1 System Architecture Data processing: The first step as shown in the Figure 1
is to collect and process the data. This involves gathering video footage of people signing and
labelling the footage with information about the gestures being made. The data is then pre-
processed, cleaned, and transformed as necessary for use in the subsequent steps. Image Processing:
The next step is to use computer vision algorithms to analyze the video footage and identify regions
of interest, such as the hands and face of the signer. Image processing techniques such as image
enhancement, segmentation, and feature extraction could then be used to extract relevant
information about the hand gestures. Figure 1. System architecture using LSTM Body Gesture
(Landmark Detection): Landmark detection techniques could be used to identify specific points on
the hands, such as the fingertips or knuckles, to track the movement of the hands over time. This
could involve using deep learning techniques such as convolutional neural networks (CNNs) or
recurrent neural networks (RNNs) to analyse the video frames and identify key landmarks. Text +
voice: Once the hand gestures have been identified, they can be translated into text or spoken
language using natural language processing (NLP) techniques. To interpret the movements and
provide the necessary output, one might need combining rule-based and machine learning-based
approaches. 136 LSTM Neural Network: LSTM neural networks is used to learn patterns in the
sequence of hand gestures over time. This involves training the network on a large dataset of sign
language video footage and using backpropagation to update the weights of the network over
multiple training epochs. Feature Extraction: Finally, feature extraction techniques is used to extract
relevant features from the input data and transform them into a format that can be used by the
LSTM network. This might involve techniques such as principal component analysis (PCA) or wavelet
analysis to make the incoming data lower dimensional and emphasise important features. Overall, a
sign language recognition system using action detection would involve a combination of data
processing, image processing, body gesture analysis, text and voice recognition, LSTM neural
networks, and feature extraction techniques to identify and interpret hand gestures in sign language.
3.2 System Design The primary feature of this system is to translate the sign language into text.
Initially, widely used gestures have been tracked to train the system. The system works at word-level
translations. The captured images need to be pre-processed. The system modified the images
captured and trained the LSTM model to classify the signals into labels. 3.2.1 User interfaces The
output screen will display the processed video stream, and the bottom of the video display window
will show the words that are expected. Three words will be displayed, arranged from left to right in
decreasing likelihood order from high to low. A vibrant outline will be used to highlight the term with
the highest likelihood. Figures 2- 5 show what the actual interface will look like when Kinect
completes the task. As shown in Figure 2, Kinect is carrying out a time-related action, and the
findings are shown at the very top of the screen. Likewise, the results of the activities known as
driving, saying sorry, and brushing are shown in Figure 3, Figure 4, and Figure 5. Figure 2. Action for
time Figure 3. Action for driving 137 Figure 4. Action for sorry Figure 5. Action for brush 3.2.2
Hardware interfaces If the device lacks an integrated camera, an external camera sensor, as well as
the driver required to enable the functionality on that specific operating system and hardware
platform, will be required. 3.2.3 Software interfaces OpenCV: It is used to capture the frame the
input stream and then it is fed to the Media Pipe interface. Media Pipe: It extracts the feature points
tracked by OpenCV and the feeds it to the LSTM model for prediction. TensorFlow is an open-source
software library for high-performance numerical computing. Computing may be easily implemented
across a variety of platforms (CPUs, GPUs, and TPUs), from desktop PCs to server clusters, due to its
modular design. 3.3 Dataset For testing and training, we produced the customized dataset. Both
male and female participants made up the dataset, which was compiled with different subjects
between the ages of 20 and 25. For specific dynamic-based isolated gestures, we were able to collect
the appropriate data. The average frame rate (FPS) across all video samples is 40, and their average
length is 4 seconds at a resolution of 1920 × 1080. 40 words made up the dataset. For each action 30
videos were taken, and each video contains 30 frames [21]. 4 LSTM Deep learning uses LSTMs, also
known as long short-term memory networks. Numerous recurrent neural networks (RNNs) have the
capacity to learn long-term dependencies, especially in tasks involving sequence prediction. In
addition to processing single data points like visuals, LSTM features feedback links which allow it to
process a whole data sequence. This is beneficial for speech recognition and machine translation,
among other things. On an assortment of issues, the special RNN variant known as LSTM performs
exceptionally well. The Long Short-Term 138 Memory (LSTM), an RNN structure that uses forgetting
components, is usually suggested to address the gradient vanishing problem. One can select the
optimal time delay and the appropriate period to forget certain information using this method. Given
the properties of the known methods and the strategy suggested in this work, it is anticipated that a
combination of these methods will have the ability to understand and distinguish sign language
motions from a source of video and emit the corresponding English word. The LSTM model’s
structure is governed by a chain. However, the structure of a repeating module may change based on
the application. The utilization of a single neural network was superseded by the creation an
innovative system for communication featuring four interlinked levels. Figure 6 depicts the LSTM
architecture. The input vector, the current memory block, the current output block, and the previous
output block are all represented by the notation ht1, along with the memory block from the previous
block. Instead of having separate memory cells, GRU uses gating units to control the flow of
information inside the unit. The mentioned method uses ISL gesture sequences in the video source
that has been provided as an input. The major goal of this system is to recognize words from real-life
signing movements. The first step is to divide the video file carrying the sequence of ISL motions for
different words into individual sub-sections comprising different words. The proposed approach is a
real-time system that uses image processing to process live sign gestures. The translated output
would then display text and voice after classifiers were used to distinguish between different signs.
To train on the data set, deep learning methods will be applied. With the use effective algorithms
and high-quality data, its objective is to make the current system in this field more precise and
quicker to respond. The potential of LSTM networks to learn long-term dependencies led to its study
and implementation for the classification of gesture data. The developed model classifies the
motions with high accuracy, demonstrating the viability of employing LSTM-based neural networks
for sign language translation with the same parameters. After being received by the 1536-unit LSTM
layer, data is delivered from the GRU layer using the same configurations, including 0.3 dropouts and
l2 kernel regularized. The outcomes are sent to a dense, layer that is fully connected. With an
effective value of 0.3, the result is then transferred to the dropout layer. Figure 6. Architecture of
LSTM 5 Results and Discussion The analysis of the training loss and validation loss over the number
of epochs provides valuable insights into the performance of the machine learning model. When
both losses decrease together, it indicates that the model is effectively learning from the training
data and generalizing well to new, unseen data. This is a positive sign as it suggests that the model is
not overfitting the training data and has the potential to make accurate predictions on new data
points. On the other hand, if the validation loss starts to rise while the training loss continues to
decrease, it may indicate that the model is overfitting and is not able to generalize well. This
information can guide us in making architectural considerations and necessary adjustments to
improve the model’s performance. Similarly, the relationship between training accuracy and
validation accuracy over the epochs provides further insights into the model’s capabilities. The fact
that the training accuracy reached 87% indicates that the model is able to correctly classify a high
proportion of the training data. However, it is important to note that the validation accuracy may
differ from the training accuracy, as it measures the model’s performance on unseen data. By
examining the graph depicting the training accuracy and validation accuracy, we can observe
whether the model is able to maintain a similar level of performance on new data or if there is a
significant drop in accuracy. This understanding is crucial in determining the model’s practical
significance and its ability to make accurate predictions in real-world scenarios. 139 Overall,
analyzing the training loss, validation loss, training accuracy, and validation accuracy provides us with
valuable insights into the model’s learning and generalization capabilities. These insights help us
make informed decisions regarding the model’s architecture, identify potential issues such as
overfitting, and make necessary adjustments to improve its performance on unseen data. By
continuously monitoring and evaluating these metrics, we can ensure that the model is effectively
learning and producing reliable predictions, thus making it more practical and impactful in real-world
applications. Figure 7. Training and validation loss By comparing the training loss and validation loss
over the number of epochs, we can get insights into the model’s performance. If the training loss and
validation loss decrease together, it’s a good sign that the model is learning from the training data
and generalizing well to new data The training loss vs. validation loss is displayed against the number
of epochs in Figure 7 as a graph. This assisted us in making defensible judgements regarding the
necessary architectural considerations. The graph in Figure 8 shows the relationship between
training accuracy and validation accuracy over the number of epochs. The testing accuracy attained
with references to this graph is 87%. Figure 8. Training and validation accuracy 140 6 Conclusion and
Future Scope This study presents a technique for sign language action detection using LSTM on a
proprietary dataset, with the objective of recognizing complete sign actions. The findings
demonstrate that the proposed approach achieves a sign-action recognition accuracy up to 87%.
However, it is important to acknowledge limitations stemming from model uncertainties, dataset
biases, and experimental errors that may restrict generalization. To enhance the proposed approach,
theoretical improvements can be made by exploring advanced neural network architectures and
incorporating transfer learning. Practical recommendations include augmenting the dataset with
more diverse examples, evaluating real-time scenarios, and engaging with sign language experts and
communities to ensure accuracy and cultural sensitivity. To build on the current study, future work
can focus on several directions. Firstly, improving the image processing component can enhance the
accuracy of hand and joint detection, enabling more precise sign language action recognition.
Secondly, further advancements can be made in translating sequences of sign language movements
into text and speech, refining the system’s ability to facilitate bidirectional communication between
sign language and spoken language users. Increasing the size and variability of the dataset would
contribute to better generalization and robustness of the model. Additionally, validating the results
with additional subjects and videos would provide a more comprehensive evaluation of the
proposed technique. These future endeavors would strengthen the practical implications of the
study by refining the accuracy, inclusivity, and accessibility of sign language recognition systems,
ultimately enabling more effective and inclusive communication between sign language and spoken
language users. Data Availability The data used to support the findings of this study are available
from the corresponding author upon request. Conflicts of Interest The authors declare no conflict of
interest. References [1] R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep
survey,” Exp. Syst. Appl., vol. 164, p. 113794, 2020. https://doi.org/10.1016/j.eswa.2020.113794 [2] I.
Papastratis, C. Chatzikonstantinou, D. Konstantinidis, K. Dimitropoulos, and P. Daras, “Artificial
intelligence technologies for sign language,” Sensors, vol. 21, no. 17, p. 5843, 2021.
https://doi.org/10.3390/s21175843 [3] D. Kothadiya, C. Bhatt, K. Sapariya, K. Patel, A. Gil-Gonzalez,
and J. M. Corchado, “Deepsign: Sign ´ language detection and recognition using deep learning,”
Electronics, vol. 11, no. 11, p. 1780, 2022. https: //doi.org/10.3390/electronics11111780 [4] N.
Adaloglou, T. Chatzis, I. Papastratis, A. Stergioulas, G. T. Papadopoulos, V. Zacharopoulou, G. J.
Xydopoulos, K. Atzakas, D. Papazachariou, and P. Daras, “A comprehensive study on deep
learningbased methods for sign language recognition,” IEEE Trans. Multimedia, vol. 24, pp. 1750–
1762, 2021. https://doi.org/10.48550/arXiv.2007.12530 [5] R. Rastgoo, K. Kiani, and S. Escalera,
“Video-based isolated hand sign language recognition using a deep cascaded model,” Multimed.
Tools Appl., vol. 79, pp. 22 965–22 987, 2020. https://doi.org/10.1007/s11042-0 20-09048-5 [6] N.
Sarhan and S. Frintrop, “Transfer learning for videos: From action recognition to sign language
recognition,” in 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United
Arab Emirates, 2020, pp. 1811–1815. https://doi.org/10.1109/ICIP40778.2020.9191289 [7] A. Halder
and A. Tayade, “Real-time vernacular sign language recognition using mediapipe and machine
learning,” Int. J. Res. Publ. Rev., vol. 2, pp. 9–17, 2021. [8] A. C. Caliwag, H. J. Hwang, S. H. Kim, and
W. Lim, “Movement-in-a-video detection scheme for sign language gesture recognition using neural
network,” Appl. Sci., vol. 12, no. 20, p. 10542, 2022. [9] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net:
Structured feature network for continuous sign language recognition,” arXiv preprint
arXiv:1908.01341, 2019. https://doi.org/10.48550/arXiv.1908.01341 [10] Y. Liao, P. Xiong, W. Min, W.
Min, and J. Lu, “Dynamic sign language recognition based on video sequence with blstm-3d residual
networks,” IEEE Access, vol. 7, pp. 38 044–38 054, 2019. https://doi.org/10.1109/AC
CESS.2019.2904749 [11] A. Thakur, P. Budhathoki, S. Upreti, S. Shrestha, and S. Shakya, “Real time
sign language recognition and speech generation,” J. Innov. Image Process., vol. 2, no. 2, pp. 65–76,
2020. https://doi.org/10.36548/jiip.2020.2.001 [12] T. Starner, J. Weaver, and A. Pentland, “Real-time
american sign language recognition using desk and wearable computer based video,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, 1998. https://doi.org/10.1109/34.735811
141 [13] R. Rastgoo, K. Kiani, and S. Escalera, “Hand sign language recognition using multi-view hand
skeleton,” Expert Syst. with Appl., vol. 150, p. 113336, 2020.
https://doi.org/10.1016/j.eswa.2020.113336 [14] D. Guo, W. Zhou, H. Li, and M. Wang, “Hierarchical
lstm for sign language translation,” in Proceedings of the AAAI conference on artificial intelligence,
New Orleans, Lousiana, USA, vol. 32, no. 1, 2018, pp. 6845–6852. [15] R. Rastgoo, K. Kiani, and S.
Escalera, “Hand pose aware multimodal isolated sign language recognition,” Multimed. Tools Appl.,
vol. 80, pp. 127–163, 2021. https://doi.org/10.1007/s11042-020-09700-0 [16] R. Cui, H. Liu, and C.
Zhang, “Recurrent convolutional neural networks for continuous sign language recognition by staged
optimization,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
Honolulu, HI, USA, 2017, pp. 7361–7369. https://doi.org/10.1109/CVPR.2017.175 [17] C. Wei, J.
Zhao, W. Zhou, and H. Li, “Semantic boundary detection with reinforcement learning for continuous
sign language recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 3, pp. 1138–1149,
2020. https://doi.org/10.1109/TCSVT.2020.2999384 [18] J. Pu, W. Zhou, H. Hu, and H. Li, “Boosting
continuous sign language recognition via cross modality augmentation,” in Proceedings of the 28th
ACM International Conference on Multimedia, 2020, pp. 1497–1505.
https://doi.org/10.48550/arXiv.2010.05264 [19] K. L. Cheng, Z. Yang, Q. Chen, and Y. W. Tai, “Fully
convolutional networks for continuous sign language recognition,” in Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, 2020, pp. 697–714.
https://doi.org/10.48550/arXiv.2007.12402 [20] P. Selvaraj, G. Nc, P. Kumar, and M. Khapra,
“Openhands: Making sign language recognition accessible with pose-based pretrained models across
languages,” in Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 2114–2133.
http://dx.doi.org/10.18653/v1/2022.acl-long.150 [21] V. Koli, “Action recognition in sign language
detection using lstm,” 2023. https://github.com/Vikrantkoli5/AC TION-RECOGNITION-IN-SIGN-
LANGUAGE-DETECTION-USING-LSTM 142
Published in Transactions on Machine Learning Research (12/2023) IndicTrans2: Towards High-
Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages Jay Gala∗ 1
Pranjal A. Chitale∗ 1,2 Raghavan AK1,2 Varun Gumma3† Sumanth Doddapaneni1,2 Aswanth
Kumar6† Janki Nawale1 Anupama Sujatha1 Ratish Puduppully7 Vivek Raghavan1,4‡ Pratyush
Kumar1,2,3§ Mitesh M. Khapra1,2¶ Raj Dabre5 Anoop Kunchukuttan1,2,3 1Nilekani Centre at
AI4Bharat 2 Indian Institute of Technology Madras 3Microsoft 4EkStep Foundation 5National
Institute of Information and Communications Technology, Kyoto, Japan 6Flipkart 7 Institute for
Infocomm Research (I2R), A∗STAR, Singapore Reviewed on OpenReview:
https://openreview.net/forum?id=vfT4YuzAYA Abstract India has a rich linguistic landscape, with
languages from 4 major language families spoken by over a billion people. 22 of these languages
listed in the Constitution of India (referred to as scheduled languages) are the focus of this work.
Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are
essential in a country like India. Before this work, there was (i) no parallel training data spanning all
22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant
to India, and (iii) no existing translation models that support all 22 scheduled languages of India. In
this work, we aim to address this gap by focusing on the missing pieces required for enabling wide,
easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We
identify four key areas of improvement: curating and creating larger training datasets, creating
diverse and high-quality benchmarks, training multilingual models, and releasing models with open
access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the
largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext
pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs
created as part of this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and
conversational test sets. Next, we present IndicTrans2, the first translation model to support all 22
languages, surpassing existing models in performance on multiple existing and new benchmarks
created as a part of this work. Lastly, to promote accessibility and collaboration, we release our
models and associated data with permissive licenses at https://github.com/AI4Bharat/IndicTrans2. 1
Introduction India is a linguistically diverse region, with 1,369 distinct mother tongues identified in
the census conducted in 2011. Of these, 22 languages have been listed in the 8th Schedule of the
Constitution of India. Approximately 97% of the population of India speaks one of these 22 languages
as their first language. English is widely spoken and serves as the default medium of formal
communication in many areas, particularly in business, education, government, and judiciary. ∗ Equal
Contribution. All author contributions listed in Section 10. † Work done as a Master’s student at
Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras. ‡Work done while at Nilekani
Centre at AI4Bharat and EkStep Foundation. §Work done while at Nilekani Centre at AI4Bharat,
Indian Institute of Technology Madras and Microsoft. ¶ Corresponding Author: Mitesh Khapra
([email protected]). 1 Published in Transactions on Machine Learning Research (12/2023) With
such linguistic diversity, the importance in India of language translation for effective communication,
social inclusion, equitable access, and national integrity cannot be over-emphasized. For example, for
effective dissemination of information about government policies and welfare schemes, it is
necessary to translate official documents and websites into regional languages. In the context of the
judiciary, it is crucial to translate court proceedings and judgments into regional languages so that
the petitioners, accused, and witnesses can understand and better participate in the judicial process.
Similarly, in the context of education, translation can ensure that high-quality content becomes
accessible to more learners in their regional languages. Lastly, translation also plays a vital role in
national integration by ensuring that people migrating/traveling to and from different parts of the
country can communicate better with people in their new locations. The last decade has seen rapid
progress in Neural Machine Translation, with the latest neural models (Johnson et al., 2017; Liu et al.,
2020a; Fan et al., 2020; Kim et al., 2021; Lepikhin et al., 2021; Ramesh et al., 2022; Costa-jussà et al.,
2022; Siddhant et al., 2022) supporting hundreds of languages and thousands of translation
directions. However, these models either do not have a good coverage of Indian languages, or their
performance on Indian languages is poor, or both. Further, none of these models are evaluated on a
diverse set of domains or content of Indian origin, as there are no robust benchmarks designed
explicitly for Indian languages. Another evidence of the neglect of Indian languages is that in the past
16 years since its inception, the shared tasks run under the Workshop on Machine Translation (WMT)
have only covered a total of 4 Indian languages summed across all these years.1 While the Workshop
on Asian Translation (WAT) (Nakazawa et al., 2022) and the Workshop on Speech and Language
Technologies for Dravidian Languages (Madasamy et al., 2022) have made significant contributions,
they have not garnered the same level of popularity or academic participation as the WMT. As a
result, despite the rapid progress in the broader field of Machine Translation, no single commercial
or open-source translation model supports all the 22 languages listed in the Constitution. In this
paper, we pose the following question: What are the missing pieces required for enabling wide and
easy access to high-quality machine translation for all 22 scheduled Indian languages? We believe
there are four axes of improvement required: (a) curation and creation of significantly larger training
datasets, (b) creation of high quality and diverse benchmarks, (c) training and evaluation of
multilingual models, and (d) releasing of models with open access. For axis (a) training datasets, we
need to create high-quality “seed data” comprising manually translated parallel sentences for all 22
languages with representation from diverse domains. It is to be noted that for several of the 22
languages, no publicly available translation data exists. This manually created data has to be
supplemented with a higher volume of semi-automatically generated data by bitext mining from
web-scale monolingual corpora and multilingual documents. For axis (b) benchmarks, we need
expert-created highly accurate benchmarks for all 22 languages across variations such as formality of
language, length of sentences, domain of text, and source originality. For axis (c) models, we need to
train accurate multilingual models that exploit the similarity between Indian languages and
particularly benefit low-resource languages. We also need to improve processes for the evaluation of
models by choosing robust metrics that are shown to correlate with human evaluation for Indian
languages. In addition, we need to evaluate models with other metrics, such as improvement in post-
editing performance. Finally, for axis (d) open access, created models must have permissive licenses
that can be commercially deployed. For instance, Meta’s NLLB models, though released in the open,
have a CC-BY-NC license precluding commercial usage. In this paper, we contribute across these four
axes with many notable firsts that we highlight below. Training datasets. We release the largest
publicly available parallel corpora for Indic languages, the Bharat Parallel Corpus Collection (BPCC).
As summarized in Table 1, BPCC contains a total of ~230M bitext pairs, of which a total of ~126M
were newly added as part of this work. BPCC includes the following: • Seed training data containing
human translations of English sentences to all 22 Indic languages spanning multiple domains. This
has a total of 644K En-X translation pairs across all languages, including 7 languages for which no
manually created parallel data existed before this work. • Bitext pairs from existing collections such
as Samanantar (Ramesh et al., 2022) and NLLB (Costa-jussà et al., 2022) which were further filtered
using LaBSE (Feng et al., 2022) based cosine similarity thresholds. 1This is, of course, not a comment
on the organizers of WMT but a reflection of the lack of academic interest in Indian languages due to
the lack of sufficient training and evaluation data 2 Published in Transactions on Machine Learning
Research (12/2023) • New bitext pairs mined from additional monolingual sources such as
archive.org and IndicCorp v2 (Doddapaneni et al., 2023) which were not covered in the existing
collections mentioned above. • New bitext pairs mined from additional document-aligned parallel
sources such as NPTEL, UGCResources, Prabhupada Vani, etc. which were not covered in the existing
collections mentioned above. • A very large set of ~800 million back-translated sentences from
diverse sources such as IndicCorp v2 (Doddapaneni et al., 2023), monolingual side of NLLB data
(Costa-jussà et al., 2022) and CC-Matrix (Schwenk et al., 2021b). We visualize these types of data in
BPCC in Figure 7, to highlight the language coverage and our contributions in relation to existing
data. As can be seen, for many languages, BPCC makes the first available datasets, and for all
languages, it makes a significant increase in the datasets available. Benchmarks. We create IN22, the
first n-way parallel benchmark covering all 22 Indian languages with the English side being source-
original. For benchmarks to be of high quality, they must represent content from diverse domains.
We visualize the diversity of our created benchmark in Figure 8. Our benchmark contains high-quality
human translations for sentences taken from India-specific articles belonging to 13 different
domains, viz., Culture, Economy, Education, Entertainment, Geography, Government, Health,
Industry, Legal, News, Religion, Sports, and Tourism (see left chart of Figure 8). We refer to this
subset as IN22-Gen. Our benchmark has another subset IN22- Conv, that contains translations for
sentences taken from everyday conversations in the Indian context from 16 different domains, which
were manually created by in-house experts starting from carefully created conversation prompts (see
right chart of Figure 8). Models. We release IndicTrans2 (IT2), the first translation model to support
all the 22 scheduled Indian languages, trained on the BPCC dataset. The progress made in the quality
of translation in this work with existing open models is captured in Figure 1. The plot shows the
chrF++ metric for English to different languages (which is usually the more challenging translation
direction for low-resource languages). Each language is represented by circles, where the size of the
circle represents the number of speakers in that language. As can be seen, with IndicTrans2, we
made progress in translation quality across languages and now support moderate to high-quality
translation for most speakers in India. Later in the paper, we also report COMET scores, comparisons
with commercial models, and human evaluations of our translations. We find that IT2 is the first
model for Indian languages, which performs at par not only with open-source models like NLLB
(Costa-jussà et al., 2022) but also with commercial models from Google and Microsoft. We release
IndicTrans2-M2M, the first model to support direct translations between all the 22 scheduled Indic
languages, supporting 462 translation directions. Open Access. We aim to promote wider access to
accurate translation models for all Indian languages. Therefore, we will release IndicTrans2 and its
derivatives (IndicTrans2-M2M, IndicTrans2-Dist) under an open-source license, along with all training
data, source code, and tools to enable replication and further improvements by the research
community. Additionally, we provide IndicTrans2-Dist, approximately 1/5 the size of IndicTrans2
(~211M) with comparable performance to reduce deployment costs. We hope our paper will serve
as a starting point for future research on Indic machine translation. Figure 2 provides a
comprehensive overview of the entire workflow, which involved the development of requisite human
infrastructure, building high-quality seed datasets and robust India-centric benchmarks, and
culminates with the release of IndicTrans2, which is the first model to support all the 22 scheduled
languages. Section 3 describes the process followed for the creation of high-quality benchmarks and
seed training data, which entails the establishment of a human infrastructure, followed by a detailed
account of the translation workflow and the quality control procedures implemented. Subsequently,
Section 4 outlines our bitext mining pipeline, incorporating both manual and automated checks that
employ toxicity and language filters. After the creation of the benchmarks and training data, the next
task, as covered in Section 5 is the training of IndicTrans2 with ablation of model architecture,
dataset selections, and training procedures. Furthermore, Section 6 describes the robust evaluation
of IndicTrans2 across existing benchmarks such as FLORES and the benchmarks we create, across
diverse metrics and against both open-source and commercial models. 3 Published in Transactions
on Machine Learning Research (12/2023) M2M-100 (2020) IT1 (2021) NLLB MoE (2022) IT2 (2023) 0
20 40 60 chrF++ Assamese Bengali Bodo Dogri Konkani Gujarati Hindi Kannada Kashmiri Maithili
Malayalam Marathi Manipuri Nepali Odia Punjabi Sanskrit Santali Sindhi Tamil Telugu Urdu Assamese
Bengali Bodo Dogri Konkani Gujarati Hindi Kannada Kashmiri Maithili Malayalam Marathi Manipuri
Nepali Odia Punjabi Sanskrit Santali Sindhi Tamil Telugu Urdu Assamese Bengali Bodo Dogri Konkani
Gujarati Hindi Kannada Kashmiri Maithili Malayalam Marathi Manipuri Nepali Odia Punjabi Sanskrit
Santali Sindhi Tamil Telugu Urdu Assamese Bengali Bodo Dogri Konkani Gujarati Hindi Kannada
Kashmiri Maithili Malayalam Marathi Manipuri Nepali Odia Punjabi Sanskrit Santali Sindhi Tamil
Telugu Urdu High chrF++ score Moderate chrF++ score Low chrF++ score No MT System Figure 1: A
visual representation of the advancements in machine translation systems for Indic languages using
the IN22-Gen Evaluation set in the En-Indic direction. The depicted values have been subjected to
minor adjustments to enhance readability; however, they accurately convey the overall trend.
Thresholds are utilized to estimate performance boundaries for various systems across languages.
The size of each language bubble is proportional to the speaker count for that language (see Table
57). The paper concludes with a comprehensive summary and outlines potential future research
directions. The Appendices provide supplementary results and additional details, including model
and dataset cards. 2 Related Work Languages of India. India, with a population of more than 1.4
billion, is a diverse country known for its rich linguistic heritage, and home to some of the world’s
most widely spoken languages. According to the Census of India 2011, 1369 mother tongues have
been identified of which 121 languages have at least 10,000 speakers and 31 languages have at least
a million speakers.2 22 of these languages have been listed in the 8th Schedule of the Constitution of
India3 , recognizing them as the scheduled languages of the Republic of India. According to the
schedule, the Government of India is under an obligation to take measures to develop these
languages such that they become an effective means of communication. Nine of the Indic languages
are amongst the most spoken languages across the globe4 : Hindi (4th), Bengali (6th), Marathi (13th),
Telugu (14th), Tamil (17th), Urdu (20th), Punjabi (22nd), Gujarati (24th) and Bhojpuri (26th). Some of
these languages are also widely spoken and/or are official languages in neighboring countries viz.,
Bangladesh, Nepal, and Pakistan. Indian languages are also fast-growing across the globe, particularly
in North America, the United Kingdom, Australia, and the Middle East. Beyond the Indic languages,
English is also 2https://en.wikipedia.org/wiki/Languages_of_India
3https://rajbhasha.gov.in/en/languages-included-eighth-schedule-indian-constitution
4https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 4 Published in
Transactions on Machine Learning Research (12/2023) Table 1: Overall statistics for data collated
from different sources (in thousands) for Indian languages and resources in this work. In this
document, each language is identified with a BCP 47 tag sequence comprised of ISO 639-3 language
subtag and ISO 15924 script subtag. Existing BPCC (Newly Added) Mined Human Mined Human
Name Language Samanantar NLLB NLLB ILCI MASSIVE Monolingual Comparable Wiki Daily Assamese
asm_Beng 58.8 506.3 - 82.1 - 712.5 37.8 44.7 11.3 Bengali ben_Beng 2,946.3 13,580.5 - 123.8 16.5
16,055.1 258.2 48.0 8.5 Bodo brx_Deva - - - 83.2 - - < 4 words) or too long (> 40 words) as we found
that the quality and reliability of embeddings deteriorate beyond these lengths. After this quality
control procedure, we apply strict deduplication to eliminate any potential duplicates on the
normalized sentences in the monolingual corpora. Sentence Embedding Model. Prior work such as
Samanantar (Ramesh et al., 2022) and NLLB (Costa-jussà et al., 2022) have employed the LaBSE (Feng
et al., 2022) and LASER3 (Heffernan et al., 2022) models for bitext mining respectively. However, to
determine the optimal sentence embedding model for our mining purposes, we analyze the
correlation of the Semantic Textual Similarity Rating (Agirre et al., 2016) with the cosine similarity
scores obtained using both sentence embedding models. We consider the STS dataset released by
Ramesh et al. (2022) with a human rating for a set of 11 languages. Our analysis suggests that the
cosine similarity scores of LaBSE sentence embeddings exhibit a stronger correlation with the human
ratings on a macro scale, as shown in Table 5. Therefore, we adopt the LaBSE model as the primary
sentence embedding model for our bitext mining and filtering pipeline and only fall back to LASER3
for the languages not supported by LaBSE. We use LASER3 for languages such as Kashmiri
(Devanagari), Kashmiri (Arabic), Maithili, Manipuri (Bengali), Nepali, Sanskrit, and Sindhi (Arabic).
Indexing. To ensure a common embedding space for all languages, we utilized LaBSE (Feng et al.,
2022) to compute the sentence embeddings for all the sentences. Our approach for mining parallel
sentences involves searching through English; thus we indexed all the English sentences and treated
the Indic language sentences as queries. To accommodate the large corpus of 429 million English
sentences, we partitioned them into 5 shards and indexed each shard separately. In line with
previous work (Ramesh et al., 2022), we utilized a FAISS Index13 with 100K clusters and employed
Product Quantization (Jégou et al., 2011) to reduce the dimensionality of the embeddings from 768
to 64, with each dimension represented by an 8-bit integer value.
13https://github.com/facebookresearch/faiss 14 Published in Transactions on Machine Learning
Research (12/2023) Table 6: URLs and domains of the sources used for comparable corpora mining.
Source URL Domain isha https://isha.sadhguru.org/in/en/wisdom Religion, Education, Culture mkb
https://www.pmindia.gov.in/en/mann-ki-baat Government, News, Education nios
https://nios.ac.in/online-course-material.aspx Education nptel https://nptel.ac.in/courses Education
pib https://pib.gov.in/AllRelease.aspx Government, News, Legal spoken tutorial https://spoken-
tutorial.org/tutorial-search Education ugc http://ugceresources.in Education vanipedia
https://tinyurl.com/2sf547tn Religion, Education, Culture Retrieval. To retrieve parallel sentence pairs
for a given query sentence (SL) in language L, we use LaBSE (Feng et al., 2022) to compute the
embedding of the query sentence and perform a search on the FAISS Index constructed from the
English sentences. First, we retrieve the top k (k = 1024) clusters by computing the cosine similarity
between the cluster centroids and the query embedding. Subsequently, we search for ANNs within
these clusters to retrieve the closest match. However, as pointed out by Ramesh et al. (2022), the
similarity scores can vary when using quantized vectors (64d) while preserving the relative ranking
among the sentence pairs. To ensure high-quality matches, we recompute the cosine similarity using
the original 768d vectors and only retain pairs with a similarity score above a threshold of 0.80,
indicating a strong semantic match. The process is repeated on each of the 5 English partitions, and
only the highest-scoring match is retained. 4.2 Mining from Comparable Corpora For Indian
languages, we explore the mining of parallel corpora from comparable sources, i.e., multilingual
websites containing high-quality parallel documents. We first align potentially parallel documents
using heuristics to reduce the search space, followed by the extraction of high-quality parallel
sentences from aligned documents. Data Curation. We first identify several websites that publish
content in multiple Indic languages. The articles on these websites are aligned across different
languages, indicating they are exact translations of each other. Owing to this, the search space is
reduced considerably as compared to monolingual corpus mining. The selected sources are diverse in
domains covering a range of topics like Education, Legal, Religion, etc., and of high quality as verified
by language experts. An overview of the sources is available in Table 6. We follow the same pre-
processing steps to segment the documents into sentences, followed by language identification and
toxicity filters. Indexing. Similar to monolingual corpora, we use the LaBSE (Feng et al., 2022) model
to index both the source and target sentences. Since the search space is much smaller in comparable
corpora, we perform a full search over the entire target sentences in the corresponding document.
Retrieval. Let S = {s1, s2, · · · , sm} be the set of source sentences and T = {t1, t2, · · · , tn} be the set of
target sentences. Let f(si , ti) be the scoring function for calculating the semantic similarity. Given that
m and n are considerably smaller than the size of the monolingual corpus, we perform a total of m ×
n scoring computations. Following Artetxe & Schwenk (2019), we use the margin-based scoring
(Equation 2) to find the closest semantic match between a given source and target sentences. The
sentences under consideration are represented by the pair (x, y). We denote the k unique nearest
neighbors of x and y in the other language as NNk(x) and NNk(y) , respectively. We perform margin-
based mining in both forward and backward directions to eliminate the candidate pairs with
inconsistent alignment and retain only those that intersect, resulting in high-quality bitext pairs.
Following Costa-jussà et al. (2022) we use a margin threshold of 1.06 with 4 nearest neighbors.
Additionally, we set a cosine threshold of 0.80 for the high-resource languages and perform LID
filtering to remove substandard sentence pairs. Considering the high memory requirements and the
high variability of margin scores based on cluster sizes when operating in shards, employing margin-
based mining for monolingual corpus with the current infrastructure was not feasible. 15 Published
in Transactions on Machine Learning Research (12/2023) Table 7: Statistics of the bitext mining from
comparable corpora (till Oct 2022). Language Source Extracted Pairs asm_Beng mkb, nios, pib,
spoken-tutorial, vanipedia 38,656 ben_Beng isha, mkb, nios, nptel, pib spoken-tutorial, ugc,
vanipedia 263,394 brx_Deva spoken-tutorial 700 guj_Gujr isha, mkb, nios, nptel, pib spoken-tutorial,
ugc vanipedia 594,847 hin_Deva isha, mkb, nios, nptel, pib spoken-tutorial, ugc vanipedia 891,464
kan_Knda isha, mkb, nios, nptel, pib spoken-tutorial, ugc vanipedia 386,408 mai_Deva spoken-
tutorial 84 mal_Mlym isha, mkb, nios, nptel, pib spoken-tutorial, ugc vanipedia 365,893 mar_Deva
isha, mkb, nios, nptel, pib spoken-tutorial, ugc vanipedia 453,371 mni_Beng mkb, pib 22,322
npi_Deva isha, spoken-tutorial, vanipedia 6,247 ory_Orya mkb, nios, pib spoken-tutorial, vanipedia
125,143 pan_Guru mkb, nios, pib spoken-tutorial, vanipedia 216,108 san_Deva spoken-tutorial 702
tam_Taml isha, mkb, nios, nptel, pib spoken-tutorial, ugc, vanipedia 455,965 tel_Telu isha, mkb, nios,
nptel, pib spoken-tutorial, ugc, vanipedia 449,239 urd_Arab mkb, nios, pib, vanipedia 232,496 # Total
4,503,039 margin(x, y) = cos(x, y) ∑ z∈NNk(x) cos(x, z) 2k + ∑ z∈NNk(y) cos(y, z) 2k (2) Following
mining from Comparable Corpora, we extract 4.5 million sentence pairs across 17 Indic languages.
The statistics and the sources for the mined bitext are available in Table 7. 4.3 Filtering Existing
Mined Parallel Corpora Over the years, several parallel corpora have been released for Indic
languages (Kunchukuttan et al., 2018; Nakazawa et al., 2021b; Philip et al., 2021; Tiedemann, 2012)
inter alia. The corpora are of varying quality and created using different approaches. We filter these
existing corpora using some of the well-known practices to ensure we retain a high-quality subset for
model training. Particularly, a large collection of parallel corpora was mined as part of the NLLB
project (Costa-jussà et al., 2022) using LASER3 embeddings (Heffernan et al., 2022). The corpus was
mined using the margin-based threshold described in Equation (2), with a threshold of 1.06. The
original dataset was not released by the authors of Costa-jussà et al. (2022). However, Allen AI14 has
replicated the efforts of Costa-jussà et al. (2022) and released the dataset closely matching the
numbers reported by the authors of (Costa-jussà et al., 2022). Going forward, we use this dataset for
our use-case and refer to it as Allen-NLLB 15. The corpus contains 448 million sentence pairs across
19 Indic languages, with more than 10 million sentence pairs in 12 languages. However, on
performing a manual inspection of the bitext, it was observed that a large majority of the sentences
had misalignment and suboptimal parallel sentence pairs. Therefore, before using this corpus for
training MT models, it is important to filter the corpus to remove the noisy sentence pairs. Following
our bitext mining in Section 4.1 and Section 4.2, we use the LaBSE model (Feng et al., 2022) with a
cosine similarity threshold of 0.80 to filter the Allen-NLLB corpus. We also use the LASER3 model
(Heffernan et al., 2022) as a fallback model for languages that are not supported by LaBSE (viz.
Nepali, Maithili, Sanskrit, Sindhi (Arabic), 14https://allenai.org/
15https://huggingface.co/datasets/allenai/nllb 16 Published in Transactions on Machine Learning
Research (12/2023) Table 8: Statistics of pre-filtering and post-filtering on existing mined parallel
corpora consisting of NLLB (Costa-jussà et al., 2022) and Samanantar (Ramesh et al., 2022). Language
Pre-Filtering Post-Filtering Proportion (%) asm_Beng 5,285,401 565,282 10.70 ben_Beng 70,400,333
16,514,684 23.46 guj_Gujr 14,458,054 8,442,476 58.39 hin_Deva 43,149,229 11,056,172 25.62
kan_Knda 38,368,723 10,532,571 27.45 kas_Arab 647,348 125,243 19.35 kas_Deva 1,042,450
194,528 18.66 mai_Deva 4,438,382 62,359 1.40 mal_Mlym 49,599,699 10,832,342 21.84 mar_Deva
35,585,104 7,742,065 21.76 mni_Beng 490,089 347,108 70.83 npi_Deva 19,624,054 1,583,922 8.07
ory_Orya 14,700,484 2,887,960 19.65 pan_Guru 14,057,042 3,391,710 24.13 san_Deva 3,095,396
244,367 7.89 snd_Arab 8,924,699 2,129,054 23.86 tam_Taml 47,777,362 10,489,852 21.96 tel_Telu
51,248,532 11,826,104 23.08 urd_Arab 25,303,579 5,322,290 21.03 # Total 448,195,960 104,290,089
23.27 Kashmiri (Devanagari), Kashmiri (Arabic), Santali). Table 8 shows that upon filtering, the
dataset is reduced from 448.1 million sentence pairs to 104.2 million sentence pairs, i.e. close to 76%
of data has been dropped with quality filtering. For Santali, post LASER3 filtering, it was observed
that the majority of the sentence pairs were dropped during the filtering process. Post-hoc human
evaluation confirmed that most of the parallel data for Santali-English in the Allen-NLLB are noisy. We
see the highest drops in Maithili, Sanskrit, and Nepali, which are considered to be low-resource
languages. Surprisingly, even in high-resource languages like Hindi and Bengali, we see that close to
75% of the data has been dropped during filtering. Similarly, we also apply the same filtering criteria
to Samanantar Corpus (Ramesh et al., 2022), as it was noted that Samanantar was mined with an
older version of LaBSE model (Feng et al., 2022). Section 7.2 describes our analysis of the data quality
v/s scale trade-off. 5 Modeling 5.1 Training Data To train our translation models, we utilize a range of
data sources, including data mined from text corpora (monolingual corpora & comparable sources),
human-annotated collections (BPCC-H-Wiki and BPCC-H-Daily), and filtered versions of existing
corpora (Ramesh et al., 2022; Costa-jussà et al., 2022). We describe our filtering techniques in
Section 4.3. While these sources constitute the majority of our training corpus, we also incorporate
additional humanlabeled seed data from NLLB-seed (Costa-jussà et al., 2022; Maillard et al., 2023),
ILCI (Jha, 2010; Choudhary & Jha, 2011), and MASSIVE (FitzGerald et al., 2022), totaling
approximately 1.47 million sentence pairs. The ILCI (Jha, 2010; Choudhary & Jha, 2011) data is
primarily distributed across domains such as health, tourism, agriculture, and entertainment, and
contributes around 1.34 million parallel sentences across 16 languages. Furthermore, we augment
our data with the Indic portions of MASSIVE (FitzGerald et al., 2022), which was released as Spoken
Language Understanding data and closely resembles the data in BPCC-H-Daily. Professional
annotators manually translate the sentences in this dataset and contribute 139,000 sentence pairs
across seven languages. In total, we have approximately 230.5 million sentence pairs, out of which
2.2 million are gold sentence pairs that are manually annotated by professional translators. The
distribution of the data sources across all languages is presented in Table 1. 17 Published in
Transactions on Machine Learning Research (12/2023) Table 9: Statistics of the bi-text training data
after deduplication with benchmarks. Language Dataset Size Language Dataset Size asm_Beng
1,443,125 mni_Beng 386,916 ben_Beng 32,725,076 mni_Mtei 42,753 brx_Deva 1,13,839 npi_Deva
1,687,436 doi_Deva 24,160 ory_Orya 5,834,074 gom_Deva 97,660 pan_Guru 9,816,009 guj_Gujr
20,491,094 san_Deva 278,374 hin_Deva 39,144,013 sat_Olck 25,128 kan_Knda 23,285,105 snd_Arab
2,128,391 kas_Arab 135,843 snd_Deva 10,503 kas_Deva 200,094 tam_Taml 20,740,179 mai_Deva
87,888 tel_Telu 23,250,217 mal_Mlym 23,521,937 urd_Arab 6,176,951 mar_Deva 18,932,834 # Total
230,579,599 5.2 Preprocessing We follow the following steps in sequential order for our data
preprocessing pipeline. Standard Preprocessing. We apply standard preprocessing, which includes
removing redundant spaces, removing special characters, and normalizing the punctuations.
Additionally, we convert the Indic numerals to English numerals using a dictionary-based mapping.
This facilitates the use of English numerals both at the input and output stages of our model.
However, a post-processing stage can be used to map English numerals back to their Indic
equivalents, if required. Data Deduplication. To prevent any potential data leakages, we apply strict
deduplication with all the available benchmarks mentioned in Table 2. Our deduplication process
involves standard preprocessing steps as mentioned above, followed by text lowercasing, removal of
all punctuations, removal of spaces, and identification of potential matches on the monolingual side
of both source and target sentences with the benchmarks. Correspondingly, any bi-text pairs
associated with these monolingual matches are discarded, and only the remaining data is considered
for training our models. As a result of this deduplication, our processed dataset contains a total of
~230.5M bi-text pairs. The per-language distribution is presented in Table 9 Additional Preprocessing.
Based on human evaluation of the IndicTrans1 model (Ramesh et al., 2022), it was observed that the
model exhibits poor performance in dealing with special cases: emails, URLs, dates, numbers, and
special characters like percentages. These special cases share a common characteristic indicating that
they should ideally not be translated by the model but should be reproduced as it is in the
translation. To address this issue, we employ regular expression patterns to identify text spans
corresponding to these special cases. Subsequently, we wrap these spans of text with special tags (
text span ) on the input side of the model, thereby providing implicit supervision to the model to
retain these special cases in their original form in the translation. Note that, during training, we wrap
the text spans within special tags only if they appear in both the source and target sentences. Script
Unification. Many Indic languages use scripts from the Brahmi family. To facilitate better transfer
learning, wherever feasible, we apply rule-based script conversion using IndicNLP library
(Kunchukuttan, 2020) to represent most of these languages in a single script (Devanagari). Thus,
effectively our models are trained with five scripts, namely Perso-Arabic (Sindhi, Urdu, Kashmiri), Ol
Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (all the rest of the languages). 18
Published in Transactions on Machine Learning Research (12/2023) 5.3 Tokenization Subword-level
tokenization (Sennrich et al., 2016b; Kudo & Richardson, 2018) is an effective approach for
segmenting text into smaller sub-word units to build neural machine translation (NMT) systems that
are robust against out-ofvocabulary (OOV) issues. In this work, we train two separate tokenizers with
the byte-pair-encoding (BPE) algorithm (Sennrich et al., 2016b) using SentencePiece16 library (Kudo
& Richardson, 2018) for English and Indic languages using a sampled corpus comprising monolingual
sentences from IndicCorp v2 (Doddapaneni et al., 2023) and NLLB data (Costa-jussà et al., 2022). We
chose SentencePiece library because of its in-built support for normalization. To ensure fair
representation for each language, we upsample the low-resource languages and limit the high-
resource languages to 3M sentences each. We use a vocab size of 32K and 128K for our English and
Indic SPM models, respectively. We prepare the monolingual data for training our English and Indic
SPM models using the preprocessing pipeline described in section 5.2 except for the additional
preprocessing. We also add special tags ( and ) to the trained SPM models. After tokenization, we
prepend special indicator tags following prior multilingual NMT models (Johnson et al., 2017; Tan et
al., 2019; Tang et al., 2021). In our case, we add both the source and target language tags to indicate
the translation direction. Specifically, when translating text from English to Hindi, we format the
sample as eng_Latn hin_Deva {processed text}. 5.4 Architecture We train our English-centric neural
models based on the transformer encoder-decoder architecture (Vaswani et al., 2017) using the
fairseq library17 (Ott et al., 2019). Our architecture comprises 18 encoder layers and 18 decoder
layers, an input dimension of 1024, pre-normalization (Xiong et al., 2020) for all modules, a
feedforward dimension of 8192, and 16 attention heads. The total parameter count is 1.1B.
Additionally, we use the GELU activation (Hendrycks & Gimpel, 2016) instead of ReLU (Nair & Hinton,
2010). 5.5 Training To perform well across a wide range of domains, we adopt FLORES-200 (Costa-
jussà et al., 2022) multi-domain development set as our validation set rather than combining
development sets from different benchmarks. However, this development set does not cover all the
languages supported by our models. As a result, we extend the FLORES-200 development (Costa-
jussà et al., 2022) set to additionally incorporate five more languages (viz. Bodo, Dogri, Konkani,
Sindhi (Devanagari), Manipuri (Meitei)) to have a complete validation set to jointly optimize and
achieve superior performance on all the 22 scheduled Indic languages (including 25 language script
combinations). We also make the expanded version of the FLORES-200 development set (Costa-jussà
et al., 2022) publicly available, and this has also been integrated into the official FLORES repository
18 . We employ the BLEU metric specifically for checkpointing purposes, using validation BLEU scores
to indicate the model’s performance on the aforementioned validation set. This choice is motivated
by BLEU providing valuable insights into the model’s macro-level performance, making it a useful
diagnostic tool for tracking the model’s progress during training. However, it may not be the most
suitable choice for fine-grained evaluations. This differs from IndicTrans1 (Ramesh et al., 2022),
which utilizes validation loss for checkpointing. By incorporating the checkpointing based on
validation BLEU scores, we can ensure that the training of our models progresses based on their
performance on the validation set, leading to an overall improved model. Our model training
paradigm comprises two distinct phases: auxiliary training and downstream training, which are
described below. Auxiliary Training. The first phase of our model training paradigm, termed auxiliary
training, involves training intermediate models to augment large amounts of monolingual corpora
through back translation. Back-translation 16https://github.com/google/sentencepiece
17https://github.com/facebookresearch/fairseq 18https://github.com/openlanguagedata/flores 19
Published in Transactions on Machine Learning Research (12/2023) Table 10: Details of the
hyperparameters used for stage 1 training and stage 2 fine-tuning. Please note that we reset the
learning scheduler, dataloaders, and optimizer for stage 2 fine-tuning. Hyperparameters Stage 1
training Stage 2 fine-tuning Optimizer Adam (Kingma & Ba, 2014) Adam (Kingma & Ba, 2014) Beta
values (β1, β2) (0.9, 0.98) (0.9, 0.98) Learning rate 5e-4 3e-5 Scheduler Inverse sqrt Inverse sqrt
Criterion Cross-entropy Cross-entropy Label smoothing (Szegedy et al., 2016) 0.1 0.1 Warmup
learning rate 1e-7 1e-7 Warmup steps 4, 000 2, 000 Gradient clipping 1.0 1.0 Dropout (Srivastava et
al., 2014) 0.2 0.2 Patience 10 10 Effective batch size 262K 32K Mixed precision training FP16 FP16
Maximum update steps 1M 1M Validation interval 2, 500 1, 000 Maximum sequence length 256 256
Checkpoint metric BLEU @ beam = 1 BLEU @ beam = 1 (Sennrich et al., 2016a; Edunov et al., 2018) is
a technique that is effective in improving the performance of machine translation models. We adopt
a deterministic curriculum strategy as proposed by Mohiuddin et al. (2022), wherein we first train
the models from scratch on the entire parallel corpora listed in Table 1, followed by stage 2 fine-
tuning on high-quality seed data including BPCC-H-Wiki and the NLLB seed (Costa-jussà et al., 2022;
Maillard et al., 2023), to improve the models further. Our approach differs from theirs in that we
exclusively consider high-quality humangenerated data for stage 2 model fine-tuning rather than
selecting the top p% of bitext pairs from the original data based on a quality measure. Another
prominent advantage of using our human-generated data is that it provides multi-domain coverage,
thereby allowing us to optimize across multiple domains, which may not be feasible when selecting a
subset of bitext pairs based on quality. We list all the hyperparameters used in both stage 1 and stage
2 training in Table 10. Downstream Training. In the second phase, we train our models on the
augmented parallel corpora that combine original data with back-translated data. Mainly, we follow
tagged back translation (Caswell et al., 2019) to provide additional supervision to the model to
distinguish between the different data sources during training. We prepend the special symbol to the
synthetically augmented data while keeping the original data intact. We follow the same training
hyperparameters and two-stage training strategy as the auxiliary training. Table 10 shows all the
hyperparameters used in both stage 1 and stage 2 training. 5.6 Data Augmentation Using existing
parallel corpora as training data may eventually lead to saturation in model performance. To address
this, researchers have proposed data augmentation techniques to enhance data diversity and
improve model performance. One such approach involves augmenting pseudo-parallel corpora by
leveraging diverse monolingual corpora. Back translation (Sennrich et al., 2016a; Edunov et al., 2018)
is a widely used technique to synthetically augment training data for improving translation models.
Given the large scale of our models, we adopt this approach and generate backtranslated data, which
is approximately 1.75 times the size of the original training data. To generate back translation data,
we first identify potential sources of monolingual data for English and Indic languages, intending to
maximize both domain coverage and distributional diversity to improve the models. We use the
intermediate checkpoints of IndicTrans2 to generate the backtranslated data and combine the
augmented data along with the training data to further improve our models. 20 Published in
Transactions on Machine Learning Research (12/2023) Table 11: Statistics of the monolingual data
used for backtranslation. Language English BT Data Indic BT Data Language English BT Data Indic BT
Data asm_Beng 14,569,760 5,433,796 mni_Beng 17,437,961 60,224 ben_Beng 17,928,856
34,987,743 mni_Mtei 17,709,470 33,233 brx_Deva 17,597,825 144,246 npi_Deva 20,567,992
29,997,511 doi_Deva 18,157,864 44,291 ory_Orya 19,528,727 15,341,924 gom_Deva 13,478,802
2,937,179 pan_Guru 17,476,704 29,968,101 guj_Gujr 21,447,703 29,994,809 san_Deva 11,198,794
9,744,059 hin_Deva 20,648,256 37,472,261 sat_Olck 9,799,342 32,346 kan_Knda 10,970,576
32,496,971 snd_Arab 8,918,509 4,298,898 kas_Arab 12,717,571 44,276 snd_Deva 6,479,694 25,264
kas_Deva 11,599,085 154,465 tam_Taml 22,647,544 32,488,783 mai_Deva 15,598,363 1,813,669
tel_Telu 21,767,767 32,494,937 mal_Mlym 17,888,824 32,495,047 urd_Arab 20,006,656 33,471,969
mar_Deva 15,849,536 34,994,281 # Total 401,992,181 400,970,283 English Data for Back Translation.
For back translation, we source English data from several sources, including the English side of
IndicCorp v2 (Doddapaneni et al., 2023), the English side of the Indic subset of the NLLB data (Costa-
jussà et al., 2022), and English data from a few high-resource pairs (eng_Latn - {fra_Latn, por_Latn,
spa_Latn, ces_Latn}) of NLLB data (Costa-jussà et al., 2022), along with additional miscellaneous
sources like Simple Wikipedia19 and DD News.20 We subjected this set of English sentences to
standard preprocessing, as outlined in Section 5.2, and then filtered the set to retain only sentences
with a minimum of five and a maximum of 100 words. As described in Section 5.2, we deduplicate
this set of sentences with all the benchmarks available. Additionally, we deduplicate this set with the
training data to ensure more diversity in English data and sample candidate sentences from a non-
overlapping set. From this reduced candidate set, we randomly sampled approximately 400 million
sentences for back translation, following an approximate distribution of 55% IndicCorp, 20% NLLB
Indic, 20% NLLB HighRes, and 5% Miscellaneous sources. To ensure language-script diversity, we
randomly subdivide the 400 million set into 25 parts, corresponding to the supported language-script
combinations. We utilize the En-Indic model with a beam value of 5 to generate back-translated data.
We proportionally distribute the English data across different languagescript combinations based on
the normalized chrF++ (Popović, 2017) scores across all language-script combinations described
below in Equation (3) on the expanded version of FLORES-200 validation set (Goyal et al., 2022;
Costajussà et al., 2022) described in section 5.5. Table 11 describes the distribution of the English
data we consider for back-translation for each language-script combination. Count(langi ) =
chrF++(langi ) ∑ j chrF++(langj ) × N (3) Here, chrF++(langi ) represents the normalized chrF++ score
for language-script combination langi , and N is the total number of English monolingual sentences to
be used for back translation. Indic Data for Back Translation. We source the Indic monolingual data
from IndicCorp v2 (Doddapaneni et al., 2023) and the Indic side of the NLLB data (Costa-jussà et al.,
2022) to generate back-translated data to improve our En-Indic model. However, it is essential to
note that our sources for Indic monolingual data are limited, which limits the amount of data we can
sample from each language-script combination. As a result, we do not adopt any proportional
sampling based on the model’s performance on the FLORES-200 validation set, as we do when
generating back-translated data from monolingual English data. Therefore, we follow a simple
strategy to include all the available monolingual data from languages, where the availability of
diverse monolingual data is scarce (less than 20 million
19https://simple.wikipedia.org/wiki/Main_Page 20https://ddnews.gov.in/ 21 Published in
Transactions on Machine Learning Research (12/2023) sentences) and uniformly sample from the
high-resource languages. We apply the same preprocessing and data deduplication steps as
described above for back-translation from English. We use the Indic-En model with a beam value of 5
for generating back-translation data. We provide the details of the Indic monolingual data
distribution used for back translation in Table 11. 5.7 Postprocessing Since our En-Indic model is
trained on script-unified data, the output it generates must be mapped back to the native script of
the target language. Therefore, we perform rule-based script conversion using the IndicNLP library
(Kunchukuttan, 2020) and map the script-unified output to the corresponding native Indic script.
Importantly, this postprocessing is only necessary for the En-Indic model, as the outputs of the Indic-
En model are already in the desired format. 6 Evaluation 6.1 Models Compared We compare our
trained models with publicly and commercially available existing models and systems: • IndicTrans1.
Ramesh et al. (2022) curated large parallel corpora by large-scale mining and trained multilingual
transformer models (474M parameters) on this mined Samanantar dataset. These models support
only 11 major Indian languages. • NLLB. Costa-jussà et al. (2022) trained a multi-way many-to-many
54.5B Mixture of Experts (MoE) model supporting 200 languages. This model supports 20 language-
script combinations from the set of scheduled Indic languages, providing coverage in at least one
script for 19 of the 22 scheduled Indic languages. • M2M-100. Fan et al. (2020) released many-to-
many models supporting translation between 100 languages with language-family-specific decoders
trained using English-centric data and non-English-centric data. We use their best model (12B
parameters) supporting 12 of the 22 scheduled Indic languages for our comparison. • Microsoft
Azure Translate.21 Microsoft Azure Translate is a commercial translation engine supporting
translation between 16 out of the 22 scheduled Indic languages at the time of writing. • Google
Translate.22 Google Translate is a commercial translation engine supporting translation between 19
out of the 22 scheduled Indic languages at the time of writing. • GPT-3.5. GPT-3.5 is a commercially
available, large language model developed by OpenAI,23 based on the GPT-3 architecture (Brown et
al., 2020), but with additional improvements and optimizations like instruction fine-tuning,
reinforcement learning with human feedback (Ouyang et al., 2022), and enhanced conversational
support. It is a decoder-only model trained using the causal language modeling objective and is
currently available as a propriety system accessible via a paid API. We evaluate the gpt-3.5-turbo
model, which accepts chat format messages, on our IN22 benchmark in a zero-shot setting. For
proprietary models, it is difficult to do fair comparisons since little information is available about
models and training. Thus, the reported results should be seen as a reasonable approximation. In this
work, we will henceforth adopt the specific shorthand notations: the IndicTrans1 model will be
referred to as IT1, the M2M-100 model as M100, the NLLB 1.2B distilled model as N1.2, the NLLB
54.5B MoE model as N54, Google Translate as Goog, Microsoft Azure Translate as Az, and our
IndicTrans2 model as IT2. The predictions of Microsoft Azure, Google Translate, and GPT3.5 were
generated using the respective APIs, with data retrieved on 10th May 2023.
21https://azure.microsoft.com/en-us/products/cognitive-services/translator
22https://cloud.google.com/translate 23https://platform.openai.com/docs/models/gpt-3-5 22
Published in Transactions on Machine Learning Research (12/2023) 6.2 Benchmarks We evaluate our
trained models (auxiliary and downstream) on our IN22 benchmark and all the publicly available
benchmarks: FLORES-200 (Goyal et al., 2022; Costa-jussà et al., 2022), WAT 2020 (Nakazawa et al.,
2020), WAT 2021 (Nakazawa et al., 2021a), WMT 2014 (Bojar et al., 2014), WMT 2019 (Barrault et al.,
2019), WMT 2020 (Barrault et al., 2020), UFAL (Ramasamy et al., 2012) and NTREX (Federmann et al.,
2022). We list the details of the existing benchmarks below. • IN22 is a comprehensive benchmark
for evaluating machine translation performance in multi-domain, n-way parallel contexts across 22
Indic languages. It comprises three distinct subsets, namely IN22-Wiki, IN22-Web, and IN22- Conv.
The Wikipedia and Web sources subsets offer diverse content spanning news, entertainment,
culture, legal, and India-centric topics. Meanwhile, the conversation domain subset is designed to
assess translation quality in typical day-to-day conversational-style applications. From now on, we
merge Wikipedia and Web Sources subsets, to create a consolidated set referred to as IN22-Gen for
translation evaluation. Our motivation for this is that these two subsets share a common language
style, albeit with varying topics, whereas the Conversation subset is different in both language style
and usage context. • FLORES-101/200 (Goyal et al., 2022; Costa-jussà et al., 2022) is a multi-domain
general-purpose benchmark designed for evaluating translations across 200 languages, including 19
Indic languages. The English sentences are source-original and have been translated into other
languages. It comprises sentences sourced from Wikimedia entities with equal portions of news,
travel, and non-fiction content from children’s books. Tables 2 and 54 provide further details on the
statistics and fine-grained domain coverage. • NTREX (Federmann et al., 2022) is a news-domain
benchmark that expands coverage of languages of test data from WMT 2019 (Barrault et al., 2019) to
128 languages. Out of these, 13 are scheduled Indic languages. • WMT has created benchmarks for
selected Indic languages as part of shared tasks in 2014 (Hindi) (Bojar et al., 2014), 2019 (Gujarati)
(Barrault et al., 2019) and 2020 (Tamil) (Barrault et al., 2020). • WAT 2020/2021 (Nakazawa et al.,
2020; 2021a) included support for translations for 8 Indic languages in the news domain. In addition,
they released data for Hindi-English in Information Technology and WikiNews domains. WAT 2021
(Nakazawa et al., 2021a) created a benchmark for translation between 10 Indic languages and
English. • UFAL (Ramasamy et al., 2012) is an English-Tamil bilingual benchmark created from
publicly available websites. The benchmark consists of English sentences from domains such as
cinema, news, and some biblical sources. Moving forward, we consider IN22 and FLORES-200 (Costa-
jussà et al., 2022) as the primary benchmarks to evaluate all the translation models. The results
obtained from these benchmarks are reported and discussed in Section 7. Additionally, the
performance of the models on other benchmarks is presented in Appendix B. Note that almost all
the test sets are English-original, but have been used for Indic-to-English evaluation as well as Indic-
Indic evaluation. 6.3 Metrics Several metrics have been developed over the years for automatically
assessing translation quality, including stringbased metrics such as BLEU (Papineni et al., 2002), chrF
(Popović, 2015), and chrF++ (Popović, 2017), and modelbased metrics such as BLEURT (Sellam et al.,
2020), COMET (Rei et al., 2020; 2022) and PRISM (Thompson & Post, 2020). Recent research (Kocmi
et al., 2021; Freitag et al., 2021; 2022) has shown that model-based metrics tend to exhibit a
stronger correlation with human judgment. However, these model-based metrics are limited to
languages represented in the underlying pre-trained model. They are trained on human judgment
data from a few languages, and their performance on many low-resource languages has not been
evaluated. We briefly describe all the metrics used in our work below. 23 Published in Transactions
on Machine Learning Research (12/2023) BLEU. BLEU (Papineni et al., 2002) has been a standard and
widely used metric for evaluating machine translation quality. However, a significant limitation of the
standard BLEU metric is its tokenization dependency. To overcome this, sacreBLEU24 (Post, 2018)
provides standardization in terms of tokenization to ensure a fair comparison. We use sacreBLEU for
evaluating our En-Indic and Indic-En trained models. We use the in-built default mteval-v13a
tokenizer25 for Indic-En26 and Indic tokenizer from IndicNLP (Kunchukuttan, 2020) for En-Indic27
evaluations. Therefore, we first tokenize the machine translations and reference translations using
Indic tokenizers from IndicNLP28 (version 0.92) and Urduhack29 (ALi, 2019) libraries before running
sacreBLEU. chrF++. chrF++ (Popović, 2017), an extension of the chrF metric (Popović, 2015) that
additionally considers word unigrams and bigrams, and is better correlated with human judgments
and uses sacreBLEU to compute chrF++ scores. Similar to the tokenizers used for BLEU, for Indic-En30
evaluation, we use the in-built default mteval-v13a tokenizer, while for En-Indic31 evaluation, we use
Indic tokenizers from IndicNLP and Urduhack libraries to tokenize the machine translations and
reference translations before running sacreBLEU. COMET. COMET is a model-based machine
translation evaluation metric introduced by Rei et al. (2020) to address some of the limitations of
existing metrics such as BLEU. However, one of the prominent concerns about COMET is its
extensibility to low-resource languages. Therefore, in this study, we report COMET-DA scores for the
top 13 Indian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali,
Odia, Punjabi, Tamil, Telugu, and Urdu that are supported by the XLM-RoBERTa (Conneau et al.,
2020) model. Specifically, we conduct a reference-based evaluation using the COMET-22 DA model32
(Rei et al., 2022). Choosing the Primary Metric. COMET, the most recommended model-based metric
(Kocmi et al., 2021), does not support all the 22 Indic languages since they are not represented in
XLM-R (Conneau et al., 2020) which is the underlying model on which COMET is based. Conversely,
BLEU has several significant limitations, including its tokenization dependency and preferential bias
towards translations that are closer to the reference translations in terms of lexical and word order
(Ananthakrishnan et al., 2006). Particularly in the context of morphologically rich Indian languages,
BLEU is limited in addressing morphological variants since it relies on exact word matches.
Furthermore, chrF++ is more suitable for evaluating translation quality in languages with complex
morphology and inflections, such as Indian languages. In this work, we, therefore, primarily rely on
chrF++ as our primary metric for evaluating translation quality. We also report additional metrics
such as BLEU (Papineni et al., 2002) and COMET (Rei et al., 2022). In addition, we also perform paired
bootstrap resampling-based statistical significance tests (Koehn, 2004) for all the metrics following
the default configurations. 6.4 Generation To generate predictions using IndicTrans2, initially, we
preprocess and tokenize the source sentences from the benchmark test set, following the steps
described in Section 5.2 and Section 5.3, respectively. Subsequently, we feed the tokenized sentences
into the trained models as input to generate candidate translations. We utilize beam search with a
beam value of 5 for our trained models. Finally, we employ post-processing techniques, as detailed in
Section 5.7, to map the script unified output to the corresponding native script. For other baseline
systems, we follow their documented inference procedure. For all the open-source baseline models,
we use the same beam size of 5. 24https://github.com/mjpost/sacrebleu
25https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl
26Indic-En sacreBLEU BLEU signature: nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1
27En-Indic sacreBLEU BLEU signature:
nrefs:1|case:mixed|eff:no|tok:none|smooth:exp|version:2.3.1
28https://github.com/anoopkunchukuttan/indic_nlp_library
29https://github.com/urduhack/urduhack 30Indic-En sacreBLEU chrF++ signature:
nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.3.1 31En-Indic sacreBLEU chrF++
signature: nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.3.1
32https://huggingface.co/Unbabel/wmt22-comet-da 24 Published in Transactions on Machine
Learning Research (12/2023) Table 12: chrF++ scores of all the systems on the IN22-Gen Evaluation
set in the En-Indic and Indic-En directions. The best-performing system is bolded, while underlined
results indicate significant performance difference where IT2 outperforms the system. The row Avg.
means the average score of all the languages that system X supports. ∆ represents the difference
between the average scores of IT2 and the average scores of system X for the subset of languages
that both X and IT2 support. A positive value for ∆ indicates IT2 is better than X and vice-versa. †
indicates completely off-target translations. En-Indic Indic-En language IT1 M100 N1.2 N54 IT2 Goog
Az IT1 M100 N1.2 N54 IT2 Goog Az asm_Beng 35.9 - 41.7 42.9 47.1 45.5 45.0 56.1 - 63.1 66.5 65.8
65.1 60.8 ben_Beng 48.6 40.6 47.8 49.2 51.8 49.9 49.8 58.4 52.8 60.8 63.5 63.2 64.1 60.2 brx_Deva -
- - - 47.8 - - - - - - 62.1 - - doi_Deva - - - - 57.8 47.8 - - - - - 72.6 67.3 - gom_Deva - - - - 45.2 41.4 41.1 - -
- - 59.2 57.8 51.1 guj_Gujr 47.2 19.9 48.3 49.5 53.5 52.2 50.8 60.3 11.8 63.9 66.3 66.5 66.5 62.4
hin_Deva 53.3 47.1 52.8 53.9 56.7 54.6 54.1 60.7 54.9 62.2 64.8 65.4 64.8 62.0 kan_Knda 46.7 15.3
47.3 48.6 51.0 48.1 49.4 58.8 12.6 62.4 65.1 64.2 64.5 61.7 kas_Arab - - 34.6 35.4 40.2 - - - - 54.9 58.2
60.4 - - mai_Deva - - 44.9 44.7 48.7 38.3 45.2 - - 62.1 65.1 64.8 64.0 61.0 mal_Mlym 45.7 31.2 45.4
46.7 50.9 49.0 48.6 56.9 44.8 59.8 62.8 64.5 62.7 60.4 mar_Deva 44.3 34.5 44.7 46.1 51.0 47.1 48.2
57.7 46.9 60.9 63.6 63.7 64.4 60.3 mni_Mtei - - - - 44.6 35.0 - - - - - 57.9 50.7 - npi_Deva - 17.7 44.8
44.8 49.0 45.5 46.3 - 40.1 65.0 68.0 67.7 69.0 63.8 ory_Orya 40.3 8.2 42.4 41.5 43.9 40.5 45.4 60.0
14.4 63.7 66.7 66.2 64.6 61.1 pan_Guru 48.0 25.0 48.5 49.5 50.6 52.7 50.4 57.2 38.2 60.4 63.1 63.4
62.7 58.5 san_Deva - - 25.5 28.1 38.8 32.0 - - - 48.2 51.3 54.8 53.8 - sat_Olck - - 1.0 † 25.5 33.4 - - - -
36.3 41.4 45.3 - - snd_Deva - - - - 36.6 - - - - - - 57.3 - - tam_Taml 45.5 12.3 47.0 47.5 49.5 48.5 49.4
53.9 26.3 56.9 59.1 59.8 59.6 56.8 tel_Telu 46.5 - 48.1 49.5 52.4 50.8 50.6 57.7 - 61.3 64.4 64.8 64.6
61.2 urd_Arab - 45.0 62.1 63.7 68.2 63.9 69.0 - 52.6 68.3 71.2 73.0 71.8 68.2 Avg. 45.6 27.0 42.8 45.1
48.6 46.8 49.6 58.0 35.9 59.4 62.4 63.1 63.2 60.6 ∆ 5.2 25.4 6.4 4.1 - 4.2 1.7 6.3 29.3 3.7 0.7 - 1.1 4.2
6.5 Evaluation Following the generation of candidate translations, we evaluate their quality using the
automatic metrics mentioned in Section 6.3. We apply standard processing techniques to compute
the evaluation metrics, followed by running sacreBLEU. We use the standard Moses tokenizer for
English, while for Indic languages, we perform tokenization using IndicNLP and Urduhack libraries.
We release our evaluation procedure and scripts to ensure reproducibility. We follow the same
evaluation procedure for all systems listed in Section 6.1. 7 Results and Discussion 7.1 Comparison
with Existing Systems Evaluation on IN22-Gen Set. We evaluate the translation quality of multiple En-
Indic and Indic-En MT models on the IN22-Gen set. The results are presented in Table 12. We
observe that IndicTrans2 significantly improves translation quality over IndicTrans1 (Ramesh et al.,
2022) with an average improvement of 5.2 points in the En-Indic direction and 6.3 points
improvement in the Indic-En direction. The proposed model outperforms the best commercial and
open-source models for En-Indic translation by 1.7 and 4.1 points, respectively. For Indic-En
translation, the IndicTrans2 is comparable to existing models, with a delta of +0.7 and +1.1 for best
open-source and commercial models, respectively. The results further highlight the substantial
improvements made on low-resource languages such 25 Published in Transactions on Machine
Learning Research (12/2023) Table 13: chrF++ scores of all the systems on the FLORES-200 devtest
set in the En-Indic and Indic-En direction. The best-performing system is bolded, while underlined
results indicate significant performance difference where IT2 outperforms the system. Avg. means
the average score of all the languages that system X supports. ∆ represents the difference between
the average scores of IT2 and the average scores of system X for the subset of languages that both X
and IT2 support. A positive value for ∆ indicates IT2 is better than X and vice-versa. † indicates
completely off-target translations. En-Indic Indic-En language IT1 M100 N1.2 N54 IT2 Goog Az IT1
M100 N1.2 N54 IT2 Goog Az asm_Beng 33.5 - 38.6 39.0 43.3 40.9 42.8 48.1 - 55.3 57.8 56.9 57.7 53.4
ben_Beng 49.5 44.3 50.1 52.2 54.3 53.8 53.4 56.9 54.7 60.3 62.2 62.4 63.2 59.9 guj_Gujr 50.4 21.9
52.0 53.6 56.0 55.5 55.6 58.7 12.1 65.2 66.6 67.0 68.0 62.9 hin_Deva 56.6 53.2 56.5 58.2 59.6 60.2
59.6 61.3 60.0 65.0 66.5 67.5 68.0 65.3 kan_Knda 50.9 16.5 53.0 54.3 56.1 56.2 56.1 54.6 12.0 59.5
61.0 61.5 62.1 58.6 kas_Arab - - 37.2 38.0 39.7 - - - - 57.8 60.2 59.7 - - kas_Deva - - 18.7 18.8 19.2 - - -
- 47.7 50.6 48.3 - - mai_Deva - - 46.1 47.5 50.5 41.4 51.0 - - 66.6 68.3 69.5 68.8 65.2 mal_Mlym 49.8
37.8 49.2 52.6 57.3 57.3 56.8 57.2 51.7 61.8 62.9 64.3 64.5 61.3 mar_Deva 45.9 38.6 46.5 48.3 51.3
51.4 49.4 56.4 50.4 61.6 63.8 64.3 65.3 61.5 mni_Beng - - 37.1 42.1 38.2 - - - - 50.5 50.7 52.9 - -
npi_Deva - 15.5 49.2 46.4 57.2 55.7 53.4 - 41.1 65.2 66.9 68.1 68.7 63.9 ory_Orya 44.2 8.5 47.6 47.0
49.2 53.9 50.2 55.5 14.3 61.8 64.4 64.9 64.3 60.5 pan_Guru 50.6 26.8 50.9 51.3 53.5 54.3 54.2 60.0
44.5 64.5 66.3 66.4 67.1 62.7 san_Deva - - 25.8 27.1 31.6 31.3 - - - 47.8 50.7 51.6 51.2 - sat_Olck - -
0.9 † 27.0 28.4 - - - - 38.7 44.3 39.3 - - snd_Arab - 28.6 48.9 49.6 44.9 50.4 51.1 - 19.6 64.0 66.3 65.1
66.6 59.8 tam_Taml 49.5 13.2 53.3 54.0 57.2 56.0 56.1 54.1 33.0 58.9 60.8 61.3 61.5 57.9 tel_Telu
52.6 - 55.0 56.5 59.4 59.0 57.5 58.2 - 63.4 65.5 66.1 66.7 63.4 urd_Arab - 39.9 49.4 50.3 52.2 51.3
51.6 - 48.8 60.9 62.9 62.0 63.7 59.3 Avg. 48.5 28.7 43.3 45.7 48.0 51.8 53.3 56.5 36.9 58.8 60.9 61.0
64.2 61.0 ∆ 5.8 25.4 4.7 2.3 - 0.3 0.2 7.4 27.7 2.2 0.1 - -0.5 3.5 as Dogri (+10), Konkani (+3.8), Kashmiri
(+4.8), Maithili (+3.8), Manipuri (+9.6) for En-Indic and Dogri (+5.3), Manipuri (+7.2), Santali (+3.9) for
Indic-En translations when compared to the next best model. The observed gains can be attributed
to using high-quality human-annotated BPCC-H Wiki data for training MT models. These findings
suggest that the proposed model is well-suited for adoption in the Indian subcontinent, aligning with
the objective of building models suitable for Indian languages. Additionally, we also report the
COMET (Rei et al., 2022) and BLEU (Papineni et al., 2002) scores for our models in Table 41 and Table
44 (in Appendix B) where we observe similar trends, indicating that the observations are robust
across different metrics. Evaluation on FLORES-200. We also evaluate the MT models on the FLORES-
200 benchmark (Costa-jussà et al., 2022). Through this evaluation, we aim to assess the model’s
translation quality on more general content, complementing the evaluation on our IN22 test set
which is India-centric. Therefore, by evaluating our models on both IN22 and FLORES-200, we can
effectively gauge the model’s translation quality in different settings. The results in Table 13 obtained
from the FLORES-200 test set show a similar trend as IN22, with IndicTrans2 being the best open-
source model performing competitively with commercial models. The results also show a significant
improvement from IndicTrans1 to IndicTrans2, with +5.8 and +7.4 points improvement in En-Indic
and Indic-En translations, respectively. We also report the COMET and BLEU scores for the FLORES-
200 benchmark in Table 43 and Table 46 (in Appendix B). Evaluation on IN22-Conv Set. While both
the IN22-Gen Set and FLORES-200 (Costa-jussà et al., 2022) focus on written sentences, the real-
world usage of MT is often task-oriented and involves conversational language. To address this, all
the models are further evaluated on the IN22-Conv Set, which is designed to test the translation
quality of MT models on conversational language and daily use scenarios. The results of all the
models on the IN22-Conv Set are 26 Published in Transactions on Machine Learning Research
(12/2023) Table 14: chrF++ scores of all the systems on the IN22-Conv Evaluation set in the En-Indic
and Indic-En directions. The best performing system is bolded, while underlined results indicate
significant performance difference where IT2 outperforms the system. Avg. means the average score
of all the languages that system X supports. ∆ represents the difference between the average scores
of IT2 and the average scores of system X for the subset of languages that both X and IT2 support. A
positive value for ∆ indicates IT2 is better than X and vice-versa. † indicates completely off-target
translations. En-Indic Indic-En language IT1 M100 N1.2 N54 IT2 Goog Az IT1 M100 N1.2 N54 IT2 Goog
Az asm_Beng 36.4 - 42.6 43.4 46.8 43.6 46.6 52.5 - 58.7 59.8 62.9 64.0 62.1 ben_Beng 47.5 39.7 47.1
48.5 49.7 48.9 48.8 55.2 48.1 55.4 57.0 58.4 59.6 58.3 brx_Deva - - - - 45.3 - - - - - - 56.3 - - doi_Deva -
- - - 53.9 40.1 - - - - - 65.0 62.9 - gom_Deva - - - - 42.5 40.3 38.7 - - - - 51.7 51.6 46.1 guj_Gujr 49.1
21.0 48.7 49.8 53.1 51.9 51.8 56.9 6.5 60.8 61.4 62.0 62.2 61.1 hin_Deva 48.6 42.7 47.6 48.3 49.6
50.6 48.7 57.4 50.6 58.7 59.7 60.1 60.0 59.3 kan_Knda 32.6 13.7 32.2 33.3 33.8 33.1 33.5 44.0 7.2
45.3 46.2 47.5 48.0 48.1 kas_Arab - - 25.7 27.1 35.6 - - - - 44.6 45.2 52.6 - - mai_Deva - - 41.6 41.0
44.3 35.6 38.2 - - 55.2 56.7 57.8 59.1 55.8 mal_Mlym 43.8 32.0 40.9 40.8 45.7 45.2 44.9 50.6 38.8
51.0 52.6 54.3 54.6 54.4 mar_Deva 43.7 33.9 44.8 47.3 48.6 46.6 46.3 54.2 40.4 56.2 57.5 58.5 59.4
58.3 mni_Mtei - - - - 40.2 31.2 - - - - - 52.5 46.3 - npi_Deva - 15.3 44.9 44.3 51.5 46.1 46.4 - 21.0 59.9
60.6 63.0 63.9 62.0 ory_Orya 38.9 7.6 41.3 40.9 40.2 37.7 42.1 55.6 11.5 59.3 59.8 60.3 59.0 58.7
pan_Guru 54.0 25.4 54.3 55.5 57.8 61.1 56.8 58.1 32.4 60.1 61.4 62.7 61.1 61.1 san_Deva - - 26.4
30.3 35.5 32.8 - - - 38.9 40.2 48.3 49.2 - sat_Olck - - 0.8 18.0 34.6 - - - - 33.6 37.4 43.5 - - snd_Deva - -
- - 30.3 - - - - - - 49.6 - - tam_Taml 37.7 19.2 37.2 37.1 39.1 38.7 39.1 44.1 22.5 45.7 46.8 45.8 46.8
46.4 tel_Telu 42.5 - 39.9 40.5 45.5 44.6 44.9 48.5 - 51.3 53.3 52.9 53.9 53.6 urd_Arab - 42.5 55.9 55.5
61.6 60.6 59.6 - 47.9 61.5 62.3 65.5 65.3 64.9 Avg. 43.2 26.6 39.5 41.3 44.8 43.8 45.8 52.5 29.7 52.7
54.0 56.0 57.1 56.7 ∆ 3.2 21.6 5.7 3.9 - 2.8 1.5 4.4 28.3 3.3 2.0 - 0.1 0.9 presented in Table 14. Across
the board, the results show moderately strong translation quality by all the models. Overall, a similar
trend is observed for En-Indic translations, with IndicTrans2 outperforming the best open-source
models and commercial models. Similarly, in the case of Indic-En translations, IndicTrans2
outperforms the best open-source models and performs competitively with commercial models. The
results further highlight significant improvements in the quality of translations for low-resource
languages such as Dogri (+13.8), Kashmiri (+8.5), Manipuri Meitei (+9), Sanskrit (+2.7), and Santali
(+16.6) in the En-Indic direction and Kashmiri (+7.4), and Santali (+6.1) in the Indic-En direction
respectively, compared to the best available existing systems. Given that IndicTrans2 supports all 22
scheduled languages and performs well across all of them, the model is expected to have good
usability in both informational and conversational settings. Additionally, we also report the COMET
(Rei et al., 2022) and BLEU (Papineni et al., 2002) scores for our models in the Table 42 and Table 45
(in Appendix B). Evaluation on Other Benchmarks. We perform evaluations on other publicly
available benchmarks and the detailed results are presented in Appendix B, while a summary of the
observations is presented in this section. Specifically, we evaluate the models on WAT 2020
(Nakazawa et al., 2020) and WAT2021 (Nakazawa et al., 2021a), which were created from the
PMIndia corpus containing data from speeches and news from the Prime Minister of India. Across
the board, the results presented in Table 32 and Table 33 show that IndicTrans2 outperforms all
open-source and commercial models in both Indic-En and En-Indic translation directions, with the
exception of IndicTrans1. However, it is important to note that performance improvement for
IndicTrans1 stems from the fact that their validation set consisted of the development sets of various
shared task benchmarks like WAT, WMT, and FLORES-200. On the contrary, our 27 Published in
Transactions on Machine Learning Research (12/2023) work used the FLORES-200 development set
as the validation set with the aim of attaining strong performance across multiple domains. Along the
same lines, we evaluate our models on the NTREX (Federmann et al., 2022) Evaluation set, which is
derived from the news domain. The results presented in Table 29 and Table 30 show similar findings
with IndicTrans2 performing the best among all the compared models with +3 and +2.6 points
improvement over the best open-source model in En-Indic and Indic-En directions respectively.
However, on the UFAL test set involving Tamil language, among open-source models, we observe
that our model lags behind the IndicTrans1 and NLLB 1.2B model in the En-Indic direction (Table 38).
Best Open-Source Model. Our study evaluated the translation quality of IndicTrans2 and other open-
source models on various benchmarks. While IN22 and FLORES-200 (Costa-jussà et al., 2022)
evaluated the models on diverse domain content such as sports, news, and conversational texts, we
further tested the models on WAT2020 (Nakazawa et al., 2020), WAT2021 (Nakazawa et al., 2021a),
and NTREX (Federmann et al., 2022). Across all multi-domain benchmarks, we observed that
IndicTrans2 consistently outperformed other open-source models, demonstrating its better
translation capabilities. However, it is important to note that performance improvement for
IndicTrans1 on WAT2020 (Nakazawa et al., 2020) and WAT2021 (Nakazawa et al., 2021a) can be
attributed due to explicit optimization across different benchmarks by incorporating development
sets of various shared tasks, in addition to FLORES-200. In contrast, our development set only
comprises FLORES-200. Detailed results for all the benchmarks and models are presented in
Appendix B (refer Tables 29, 32 and 33). Additionally, IndicTrans2 has the highest coverage of
languages and written scripts, with support for 22 Indic languages and 25 language-script
combinations. Further, while the current SOTA open-source model, the NLLB 54B MoE model (Costa-
jussà et al., 2022), is impressive in its capabilities, it is impractical for deployment due to its high
latency and resource requirements. Our study addresses this challenge by developing comparatively
compact models that can compete with large-scale models even when trained on smaller datasets,
emphasizing quality and cost-effectiveness. Results on different benchmarks confirm the robust
performance of our model across various domains and distributions. Therefore, we can conclude
that our model has fair generalization capabilities, performing well across most of the benchmarks.
Supporting New Languages and Scripts. Our work bridges the gap left by existing open-source and
commercial systems by extending IndicTrans1 (Ramesh et al., 2022) to support all 22 scheduled Indic
languages, including lowresource languages and multiple scripts. We train the first open-source
model with reasonable performance for the following languages: Bodo, Dogri, and Konkani. For some
languages, we support translation in scripts that were hitherto unsupported like Sindhi (Devanagari
script) or are only supported by commercial systems like Manipuri (Meitei). In addition, we also
improve translation quality significantly for low-resource languages such as Dogri, Maithili, Manipuri
(Meitei), and Nepali. The human-annotated seed parallel data (refer Table 1) for these languages
help us outperform other models which rely on unsupervised methods and/or mined data for these
low-resource languages. This suggests that investments in creating small parallel corpora for low-
resource languages can substantially improve translation quality, corroborating findings from Costa-
jussà et al. (2022). Comparison across language families. Our analysis reveals that on low-resource
languages from the Sino-Tibetan and Austroasiatic language families models tend to consistently
underperform compared to mid and high-resource languages in the Indo-Aryan and Dravidian
families. Conversely, on mid and high-resource languages, all models seem to exhibit comparable
performance. These observations suggest that the major differences in performance are coming
from the low-resource language families. Notably, no other open-source or commercial model covers
all four language families. The results for all the models on our primary benchmarks are presented in
Figure 5. Additionally, we conduct a small-scale human evaluation exercise to verify if the quality of
our model outputs correlates with the improvements observed using automatic metrics. This
preliminary human evaluation exercise focused on the En-Indic direction and included 50 examples
each from the Wikipedia and Web sources subset to yield a total of 100 sentence pairs from IN22-
Gen and is described in Appendix C. However, future efforts should focus on large-scale human
evaluation to understand the potential biases and shortcomings of our IndicTrans2 models and
assess their feasibility in practical use-case scenarios. 28 Published in Transactions on Machine
Learning Research (12/2023) Indo-Aryan Dravidian Sino-Tibetan Austroasiatic Language Family 20 30
40 50 60 70 Average ChrF++ 49.9 57.5 38.2 28.4 44.8 54.4 42.1 27.0 50.0 57.1 52.0 56.6 FLORES-200
(En-Indic) IT2 NLLB 54B MoE Google Azure Indo-Aryan Dravidian Sino-Tibetan Austroasiatic Language
Family 30 40 50 60 70 80 Average ChrF++ 63.7 63.3 52.9 39.3 62.4 62.5 50.7 44.3 64.4 63.7 61.3 60.3
FLORES-200 (Indic-En) IT2 NLLB 54B MoE Google Azure Indo-Aryan Dravidian Sino-Tibetan
Austroasiatic Language Family 10 20 30 40 50 60 Average ChrF++ 50.8 51.0 44.6 33.4 45.8 48.1 25.5
47.0 49.1 35.0 49.6 49.5 IN22-Gen (En-Indic) IT2 NLLB 54B MoE Google Azure Indo-Aryan Dravidian
Sino-Tibetan Austroasiatic Language Family 30 40 50 60 70 Average ChrF++ 65.0 63.3 57.9 45.3 64.0
62.9 41.4 64.3 62.9 50.7 60.9 60.0 IN22-Gen (Indic-En) IT2 NLLB 54B MoE Google Azure Indo-Aryan
Dravidian Sino-Tibetan Austroasiatic Language Family 10 20 30 40 50 60 Average ChrF++ 48.8 41.0
40.2 34.6 44.3 37.9 18.0 45.8 40.4 31.2 47.6 40.6 IN22-Conv (En-Indic) IT2 NLLB 54B MoE Google
Azure Indo-Aryan Dravidian Sino-Tibetan Austroasiatic Language Family 30 40 50 60 70 Average
ChrF++ 59.8 50.1 52.5 43.5 56.8 49.7 37.4 59.8 50.8 46.3 58.9 50.6 IN22-Conv (Indic-En) IT2 NLLB 54B
MoE Google Azure Figure 5: Average performance improvements in terms of chrF++ across language
families on IN22 and FLORES-200 (Costa-jussà et al., 2022) benchmarks. 7.2 Understanding Data
Scale vs Quality tradeoff Prior works such as NLLB (Costa-jussà et al., 2022) have focused on scaling
the data to improve the model performance. They use a margin-based mining approach with a
threshold of 1.06. However, from an in-house manual inspection, it was observed that the data was
noisy. As a result, we conducted an ablation study to understand the trade-off between data scale
and quality for effectively training multilingual MT models. In this ablation, we consider existing
mined parallel corpora such as Samanantar (Ramesh et al., 2022) and NLLB (Costa-jussà et al., 2022)
and specifically focus on the subset of 11 languages that are common to both. We apply an
additional quality filter, where we eliminate the bitext pairs that fall below the LABSE (Feng et al.,
2022) cosine similarity threshold of 0.80. This resulted in a reduction from 384M (Unfiltered data) to
94M (filtered data) in total. Subsequently, we train two separate models with the same architecture
(refer to Section 5.4) and stage 1 hyperparameters (refer to Table 10) as our final IndicTrans2 models
on filtered and unfiltered versions of the data. The results shown in Table 15 demonstrate that the
models trained on the high-quality filtered subset perform on par or even superior to the model
trained on the unfiltered data. This suggests that eliminating the noisy and suboptimal bitext pairs
through this additional filter improves the model performance and accelerates model convergence.
We, therefore, adopt this filtering threshold for our final training, ensuring that our model benefits
from the improved data quality. 7.3 Impact of Sequential Training with Human Annotated Data We
train our models sequentially, where stage 1 involves training on a combination of all the existing
data, mined data, and high-quality seed data, while stage 2 involves fine-tuning with high-quality
seed data (as described in Section 5.5). Our seed data involves a combination of NLLB Seed (Costa-
jussà et al., 2022; Maillard et al., 2023) and our humanannotated data BPCC-H-Wiki (refer Table 1).
As seed data for Sindhi (Arabic) is not present in both the sources, we 29 Published in Transactions
on Machine Learning Research (12/2023) Table 15: chrF++ scores of the models trained on unfiltered
(pre-filtering) and filtered data (post-filtering), on the FLORES-200 Evaluation set in the En-Indic and
Indic-En directions. The best-performing system is bolded. ∆ represents the difference between the
scores of the model trained on filtered data and unfiltered data. A positive value for ∆ indicates that
the model trained on filtered data (post-filtering) is better than unfiltered (pre-filtering) and vice-
versa. Dataset Size En-Indic Indic-En language Pre-Filter Post-Filter Pre-Filter Post-Filter ∆ Pre-Filter
Post-Filter ∆ asm_Beng 5.3M 0.5M 34.6 39.0 4.4 49.2 51.9 2.7 ben_Beng 70.4M 16.5M 52.2 53.1 0.9
60.0 60.2 0.2 guj_Gujr 14.4M 8.4M 51.4 52.4 1.0 64.0 63.9 -0.1 hin_Deva 43.1M 11M 58.1 58.7 0.6
64.4 64.6 0.2 kan_Knda 38.3M 10.5M 52.7 53.3 0.6 58.6 58.7 0.1 mal_Mlym 49.6M 10.8M 52.8 55.1
2.3 60.2 61.1 0.9 mar_Deva 35.6M 7.74M 46.9 48.5 1.6 60.6 60.7 0.1 ory_Orya 14.7M 2.9M 42.6 46.1
3.5 58.8 60.0 1.2 pan_Guru 14M 3.3M 49.1 50.6 1.5 62.7 63.1 0.4 tam_Taml 47.7M 10.4M 53.3 55.3
2.0 58.0 58.2 0.2 tel_Telu 51.2M 11.8M 56.0 56.8 0.8 63.0 63.2 0.2 Avg. - - 50.0 51.7 1.7 60.0 60.5 0.6
Table 16: Performance improvements of En-Indic and Indic-En models on chrF++ metric on our
primary evaluation benchmarks w.r.t. sequential training. Benchmark En-Indic Indic-En FLORES-200
+1.5 +0.6 IN22-Gen +2.2 +0.5 IN22-Conv +2.7 +1.9 Average +2.1 +1.0 use the Sangam transliteration
API33 (Lehal & Saini, 2014) to transliterate the Sindhi BPCC-H-Wiki data (~10.5K) from Devanagari
script to Perso-Arabic script. We observe that fine-tuning our models with high-quality seed data is
beneficial and leads to an average improvement of 2.1 points and 1 point in En-Indic and Indic-En
directions, respectively, on our primary evaluation benchmarks in terms of chrF++ metric (see Table
16). These findings align with previous works (Mohiuddin et al., 2022), which show that deterministic
data selection curriculum involves pretraining on general domain corpora followed by fine-tuning
with high-quality data subset of general domain corpora results in solid performance improvements
over the preliminary models. A critical distinction from the above approach is that we only use the
human-annotated seed data for fine-tuning, rather than retrieval of top p% samples from training
data based on lexical similarity. Our observations indicate that although sequential training yields
gains on an aggregate level, it is important to note that for specific languages such as Sindhi (Arabic)
(where we use transliterated data), our En-Indic model tends to degrade (~3 points in chrF++) in
terms of performance, highlighting that it is crucial to use high-quality human annotated data for
fine-tuning. Furthermore, Table 17 reports the performance of IndicTrans2 models for various
training stages on IN22-Gen Set. Notably, the highest improvement was observed in Santali for the
En-Indic direction in both ∆1 and ∆2. It is also worth highlighting that the human-annotated seed
data from previous work and our current work serves as the primary and most influential source for
mid-resource and low-resource languages, including Dogri, Konkani, Sindhi (Devanagari), Santali, and
Manipuri (Meitei) as shown in Table 1. Despite the smaller size of seed data compared to mined
corpora, finetuning on this leads to superior performance across different benchmarks (refer Tables
12 to 14). Although ∆1 and ∆2 may be smaller for a few languages due to the saturation of the data
diversity during multi-stage training, the seed data proves to be beneficial on an aggregate level,
further reinforcing its positive impact. 33https://sangam.learnpunjabi.org/ 30 Published in
Transactions on Machine Learning Research (12/2023) Table 17: chrF++ score on IN22-Gen
Evaluation Set for various training stages. OG refers to the model trained on the original training
corpora, while OG-Seed refers to the seed data fine-tuned version of the OG model. ∆1 represents
the gains obtained by fine-tuning the original model with seed data. DA refers to the model trained
on the combination of original training data with augmented data, while DA-Seed refers to the seed
data fine-tuned version of the DA model. ∆2 represents the gains obtained by fine-tuning on seed
data after data augmentation. En-Indic Indic-En language OG OG-Seed ∆1 DA DA+Seed ∆2 OG OG-
Seed ∆1 DA DA+Seed ∆2 asm_Beng 43.4 45.6 2.2 44.8 47.1 2.3 61.9 62.1 0.2 64.9 65.8 0.9 ben_Beng
48.2 50.3 2.1 48.8 51.8 3.0 60.6 60.8 0.2 62.4 63.2 0.8 brx_Deva 44.5 47.1 2.6 46.3 47.8 1.5 58.1 58.4
0.3 61.9 62.1 0.2 doi_Deva 55.4 55.7 0.3 56.2 57.8 1.6 68.6 68.5 -0.1 72.7 72.6 -0.1 gom_Deva 42.2
43.8 1.6 43.2 45.2 2.0 55.9 56.5 0.6 58.7 59.2 0.5 guj_Gujr 49.4 51.6 2.2 50.0 53.5 3.5 64.0 63.9 -0.1
65.7 66.5 0.8 hin_Deva 53.5 54.6 1.1 53.6 56.7 3.1 62.8 63.4 0.6 64.7 65.4 0.7 kan_Knda 47.3 49.7 2.4
47.7 51.0 3.3 61.7 62.0 0.3 63.2 64.2 1.0 kas_Arab 37.7 38.8 1.1 38.3 40.2 1.9 55.6 56.1 0.5 60.0 60.4
0.4 mai_Deva 45.9 47.3 1.4 46.2 48.7 2.5 62.1 61.9 -0.2 64.6 64.8 0.2 mal_Mlym 47.9 49.7 1.8 48.4
50.9 2.5 60.7 61.5 0.8 63.1 64.5 1.4 mar_Deva 45.7 48.6 2.9 46.6 51.0 4.4 60.7 61.1 0.4 62.3 63.7 1.4
mni_Mtei 39.6 41.3 1.7 41.8 44.6 2.8 53.2 53.3 0.1 57.6 57.9 0.3 npi_Deva 44.5 47.5 3.0 45.4 49.0 3.6
64.4 64.4 0.0 67.1 67.7 0.6 ory_Orya 40.1 41.9 1.8 41.0 43.9 2.9 63.1 63.4 0.3 65.3 66.2 0.9 pan_Guru
49.5 50.6 1.1 50.2 50.6 0.4 61.0 61.4 0.4 62.9 63.4 0.5 san_Deva 35.9 37.7 1.8 36.9 38.8 1.9 50.9 51.1
0.2 54.4 54.8 0.4 sat_Olck 24.2 27.3 3.1 26.5 33.4 6.9 43.6 43.8 0.2 44.5 45.3 0.8 snd_Deva 34.8 36.2
1.4 35.3 36.6 1.3 53.6 53.7 0.1 56.5 57.3 0.8 tam_Taml 47.3 48.7 1.4 47.9 49.5 1.6 57.2 57.5 0.3 59.1
59.8 0.7 tel_Telu 49.6 51.3 1.7 50.0 52.4 2.4 62.3 62.6 0.3 64.0 64.8 0.8 urd_Arab 63.8 67.1 3.3 65.4
68.2 2.8 69.5 69.9 0.4 72.5 73.0 0.5 Table 18: Comparison of average chrF++ scores between our
stage 2 auxiliary model and the best open-source baseline on FLORES-200 (Costa-jussà et al., 2022)
Evaluation set at the end of stage 2 auxiliary training. OG-seed denotes the model trained on the
original data followed by fine-tuning with seed data. ∆ denotes the difference between the scores of
our stage 2 auxiliary model and the best open-source baseline. N54 OG-Seed ∆ xx-eng_Latn 60.9 58.1
-2.8 eng_Latn-xx 45.7 47.8 2.1 7.4 Impact of Data Augmentation Section 5.6 describes the procedure
and heuristics for synthetic data generation to further improve our auxiliary models. Initially, we
adopted the back-translation approach for generating the augmented data. We primarily base our
decision to start with an auxiliary En-Indic model for generating back-translation data for Indic-En
translation due to its competitive or better performance compared to the best open-source baseline
(see Table 18). We combine the original data and the English back-translated data, obtained using
our auxiliary En-Indic model, to train our new Indic-En model from scratch, followed by high-quality
seed data fine-tuning. In this case, following prior study (Caswell et al., 2019), we use “__bt__”
indicator tags to provide some supervision to the model to distinguish original data from the back-
translated data. We observe a considerable performance improvement across all our primary
evaluation benchmarks on our IndicEn model, as shown in Figure 6 when we perform training on the
combination of original and back-translated data (refer Table 17). 31 Published in Transactions on
Machine Learning Research (12/2023) FLORES-200 (En-Indic) FLORES-200 (Indic-En) IN22-Gen (En-
Indic) IN22-Gen (Indic-En) IN22-Conv (En-Indic) IN22-Conv (Indic-En) Benchmark 30 40 50 60 70
Average ChrF++ 46.0 57.7 45.0 59.6 40.7 52.2 47.8 58.1 46.9 59.9 43.1 53.5 46.8 60.1 45.9 62.2 41.7
53.4 48.0 61.0 48.6 62.8 44.8 56.0 Stage-wise Improvements of Our Models Original Original + Seed
FT Original + Data Augmentation Original + Data Augmentation + Seed FT Figure 6: Average
Performance of our En-Indic and Indic-En models across different stages in terms of chrF++ metric on
our primary evaluation sets. Following iterative back translation (Hoang et al., 2018), we use the
stage 2 fine-tuned downstream Indic-En model to generate the back-translation data due to its
superior performance compared to the auxiliary Indic-En model. Similarly, we combine the Indic
back-translated data along with the original data using indicator tags and train our new En-Indic
model from scratch, followed by fine-tuning with seed data. However, we do not observe any gains
for the new EnIndic model compared to the stage 2 auxiliary fine-tuned En-Indic model. Further
investigation is needed to determine the exact reasons for the performance limitations of our newly
trained En-Indic model, but we suspect that unlike for Indic-En translation, the increase in the Indic
target side data is insufficient, both in terms of domain coverage and amount. This conjecture is
based on the fact that a significant portion of both the original training corpus and the
backtranslated data is sourced from the news domain, resulting in considerable overlap in their
distributional coverage. The lack of diversity in domains may potentially hinder the model from
reaching its optimal capabilities. Furthermore, for Indic-En translation, the amount of target side
English data almost triples in amount when back-translated data is added to the original parallel
corpus. However, in the case of English-Indic translation, where multiple target languages are
involved, the relative augmentation per language is comparatively lower, which might potentially
explain the marginal enhancement observed in the English-Indic direction. Increased availability of
Indic language monolingual corpora, ideally from various domains, should help remedy this issue.
Since backtranslation did not help in the En-Indic direction, we looked at the findings from distillation
works like Kim & Rush (2016); Gumma et al. (2023), and trained an En-Indic model on the
combination of original data and forward translated data/distillation data (flipping the English BT
data). In this case, we use “__ft__” indicator tags instead of “__bt__” indicator tags. Here, we
observe marginal performance improvements for our newly trained En-Indic model on combining
original data and forward translated data, as shown in Figure 6 (refer Table 17). Although this model
is not particularly better than the one obtained using back-translation, it does exhibit better
performance, and thus we consider this as our final En-Indic model. Overall, our En-Indic model is
competitive or better when compared to the baselines, but further research is necessary to explore
effective methods to improve the En-Indic model. 7.5 Indic-Indic Evaluation Our IndicTrans2 models
have exhibited strong performance across various benchmarks, as detailed in Section 7.1. Building
upon these findings, we aim to conduct a comprehensive evaluation of the Indic-Indic translation
capabilities of our IndicTrans2 models in both pivot-based and direct setups. 32 Published in
Transactions on Machine Learning Research (12/2023) Table 19: chrF++ scores of Indic-Indic
evaluation on FLORES-200 (Costa-jussà et al., 2022) of our IndicTrans2-Pivot (IT2-Pivot) model,
IndicTrans2-M2M (IT2-M2M) model, compressed IndicTrans2-M2M (IndicTrans2-Dist-M2M) model
and NLLB 54B MoE model. “xx-{lang}” and “{lang}-xx” denote the average chrF++ scores to that
language and from that language, respectively. xx-{lang} {lang}-xx language N54 IT2-Pivot IT2-M2M
IT2-Dist-M2M N54 IT2-Pivot IT2-M2M IT2-Dist-M2M asm_Beng 36.7 38.0 37.9 37.4 39.5 41.0 39.7
39.3 ben_Beng 44.5 45.7 44.7 43.7 41.4 43.0 42.1 41.6 guj_Gujr 44.8 45.9 44.8 44.2 43.4 44.9 43.8
43.3 hin_Deva 48.4 48.6 47.7 46.8 42.9 44.6 43.8 43.6 kan_Knda 46.6 47.3 45.9 45.1 40.6 42.3 41.2
40.8 kas_Arab 32.6 33.8 33.1 32.8 40.7 41.7 39.9 39.2 mai_Deva 37.9 41.5 40.5 40.4 45.0 45.9 44.9
44.7 mal_Mlym 45.7 47.8 46.2 45.1 41.2 43.3 42.0 41.5 mar_Deva 41.9 43.6 42.5 41.7 42.4 44.1 43.0
42.5 npi_Deva 43.6 46.9 45.8 45.4 43.1 45.0 44.0 43.5 ory_Orya 41.1 41.6 40.8 40.2 42.7 44.3 43.3
42.8 pan_Guru 44.4 44.6 43.8 43.1 43.4 44.6 43.5 43.2 san_Deva 25.6 28.9 28.7 28.6 35.7 38.1 36.5
35.9 sat_Olck 25.7 26.6 26.3 26.1 32.4 31.4 32.5 31.5 tam_Taml 47.3 48.7 47.3 46.1 40.1 41.7 40.1
39.7 tel_Telu 47.0 48.5 47 46 41.9 43.7 42.6 41.8 urd_Arab 43.7 44.4 43.9 43.1 41.1 42.7 41.6 41
7.5.1 Pivoting Pivoting (Gispert & Mariño, 2006; Utiyama & Isahara, 2007; Bertoldi et al., 2008) is a
widely used approach in nonEnglish centric translation scenarios, where direct parallel corpora are
limited or unavailable. It involves utilizing a high-resource language as an intermediary, translating
from the source to the pivot language and then to the target language. The pivot method is a strong
baseline for non-English centric translation compared to many other methods proposed to address
this task (Freitag & Firat, 2020; Chen et al., 2017; Firat et al., 2016; Arivazhagan et al., 2019;
AlShedivat & Parikh, 2019). In our study, we leverage our Indic-En model followed by the En-Indic
model to facilitate Indic-Indic translation, as our IndicTrans2 models are trained using English-centric
parallel corpora and use English as the pivot language. To assess the Indic-Indic translation
performance, we evaluate our IndicTrans2 models on n-way parallel test sets such as FLORES-200
(Costa-jussà et al., 2022) and IN22 benchmarks. The generation and evaluation procedure for Indic-
Indic translations is the same as described in Section 6.4 and Section 6.5. The performance in Indic-
Indic translation for our pivot-based IndicTrans2 and NLLB (Costa-jussà et al., 2022) is shown in Table
19 for FLORES-200, Table 20 for IN22-Gen and Table 21 for IN22-Conv, using average chrF++ scores
over common languages across NLLB, our pivot as well as direct systems described in Section 7.5.2.
For each language (lang), “xx-{lang}” denotes the average scores from all the common languages in
that language, whereas “{lang}-xx” denotes the average scores from that language into all the
common languages. Table 19 shows that our pivot-based IndicTrans2 outperforms or is on par with
the multi-way trained NLLB 54B MoE model across all IndicIndic directions on FLORES-200 (Costa-
jussà et al., 2022). It is important to note that we directly evaluate the NLLB 54B model by using the
translation outputs34 released by Costa-jussà et al. (2022). However, for the evaluation on the IN22
benchmark, we use the NLLB 1.2B distilled model instead of the NLLB 54B MoE model due to
resource constraints due to the sheer number of translation directions. Our pivot-based IndicTrans2
significantly outperforms the NLLB 1.2B distilled model, as shown in Tables 20 and 21. NLLB 1.2B
distilled model provides a lower-bound estimate of the performance. However, we anticipate a
smaller difference between our pivot-based IndicTrans2 and the best NLLB 54B MoE model. Based on
our previous results, we expect IndicTrans2 scores to be comparable if not
34https://tinyurl.com/nllbflorestranslations 33 Published in Transactions on Machine Learning
Research (12/2023) Table 20: chrF++ scores of Indic-Indic evaluation on IN22-Gen test set of our
IndicTrans2-Pivot (IT2-Pivot) model, IndicTrans2-M2M (IT2-M2M) model, compressed IndicTrans2-
M2M (IndicTrans2-Dist-M2M) model and NLLB 1.2B distilled model. “xx-{lang}” and “{lang}-xx”
denote the average chrF++ scores to that language and from that language, respectively. † indicates
completely off-target translations. xx-{lang} {lang}-xx language N1.2 IT2-Pivot IT2-M2M IT2-Dist-M2M
N1.2 IT2-Pivot IT2-M2M IT2-Dist-M2M asm_Beng 35.5 40.7 40.5 39.4 38.8 44.0 42.7 40.9 ben_Beng
39.9 45.1 44.8 43.2 37.4 43.2 42.3 41.2 guj_Gujr 39.2 45.4 44.3 42.9 39.0 43.8 43.2 39.9 hin_Deva
43.7 49.2 48.8 47.1 39.1 43.4 43.0 42.3 kan_Knda 39.4 44.6 44.5 43 38.4 43.9 43.1 39.8 kas_Arab
28.5 35.4 34.8 33.7 35.6 41.8 41.3 39.8 mai_Deva 36.6 42.0 41.9 40.3 39.1 44.2 43.7 42.8 mal_Mlym
38.5 44.9 43.5 42 36.4 42.9 42.2 40.6 mar_Deva 37.6 44.4 43.6 41.5 38.2 43.8 43.0 42.4 npi_Deva
37.3 41.4 41.1 39.6 39.0 44.8 44.0 43.1 ory_Orya 36.1 38.2 38.0 36.8 39.4 44.9 44.3 41.3 pan_Guru
39.0 43.2 42.2 40.9 36.8 41.7 40.5 39.1 san_Deva 23.3 35.8 35.8 34.6 32.8 39.8 39.0 37.6 sat_Olck
0.0† 31.2 31.2 30 0.0† 35.0 37.2 35.8 tam_Taml 40.1 45.0 44.1 42.6 35.4 41.3 40.2 39.3 tel_Telu 40.0
45.7 44.5 42.9 37.5 43.2 42.5 41.9 urd_Arab 47.7 54.6 53.2 50.8 39.4 45.2 44.4 43.6 Table 21: chrF++
scores of Indic-Indic evaluation on IN22-Conv test set of our IndicTrans2-Pivot (IT2-Pivot) model,
IndicTrans2-M2M (IT2-M2M) model, compressed IndicTrans2-M2M (IndicTrans2-Dist-M2M) model
and NLLB 1.2B distilled model. “xx-{lang}” and “{lang}-xx” denote the average chrF++ scores to that
language and from that language, respectively. † indicates completely off-target translations. xx-
{lang} {lang}-xx language N1.2 IT2-Pivot IT2-M2M IT2-Dist-M2M N1.2 IT2-Pivot IT2-M2M IT2-Dist-
M2M asm_Beng 33.7 38.6 38.6 37.7 35.8 41.1 40.6 39.9 ben_Beng 37.6 41.6 41.5 40.5 34.8 39.8 39.8
39.1 guj_Gujr 38.1 43.5 43.1 42.1 36.5 40.9 40.5 39.7 hin_Deva 39.9 42.5 42.4 41.7 36.3 40.5 40.4
39.9 kan_Knda 28.2 30.8 30.7 30.1 30.8 35.5 34.6 33.7 kas_Arab 18.6 30.7 31.1 30.7 30.5 37.4 37.4
35.7 mai_Deva 32.2 37.9 38.4 37.8 34.8 40.0 39.6 38.9 mal_Mlym 34.9 39.7 39.0 38.0 32.9 37.7 37.3
36.4 mar_Deva 35.6 41.0 40.4 39.2 35.7 40.0 39.9 39.4 npi_Deva 35.6 42.2 42.0 41.2 36.2 41.1 40.9
40.1 ory_Orya 33.7 34.4 34.5 33.9 36.2 41.2 40.7 39.9 pan_Guru 40.9 45.5 45 44.0 35.6 40.3 39.8
39.2 san_Deva 22.3 31.8 32 31.5 26.8 34.8 34.4 33.1 sat_Olck 0.0† 30.7 31.2 30.4 0.0† 32.1 34.7 33.8
tam_Taml 33.2 36.2 35.6 34.9 30.7 34.3 34.0 33.3 tel_Telu 35.0 39.6 39.1 37.9 33.2 37.5 37.3 36.6
urd_Arab 43.7 49.2 48.8 47.9 36.5 41.7 41.4 40.7 34 Published in Transactions on Machine Learning
Research (12/2023) better than the best NLLB 54B MoE model. This highlights the effectiveness of
our robust English-centric models and their potential in Indic-Indic translation scenarios. 7.5.2 Direct
Models While the pivot-based solution demonstrates strong Indic-Indic performance, its inherent
sequential dual model pipeline results in increasing the inference time by a factor of 2 compared to
the English-centric model. To address this limitation, it is essential to build direct Indic-Indic
(IndicTrans2-M2M) models that facilitate Indic-Indic translation with nearly the same inference cost
as English-centric model. However, the scarcity of Indic-Indic data makes training such models from
scratch challenging. As a result, inspired by prior works (Kim et al., 2019; Ma et al., 2020), we
leverage pre-trained components from our English-centric models to initialize the IndicTrans2-M2M
model. Specifically, we initialize the IndicTrans2-M2M model using the Encoder from the Indic-En
model and the Decoder from the En-Indic model. It is important to note that these two pre-trained
components undergo independent training and lack synchronization, resulting in a lack of zero-shot
performance post-initialization. Nevertheless, these pre-trained components serve as strong
initializations to start with and can be further adapted with limited data. The BPCC-Wiki subset
contains 9.2M bitext pairs spanning 462 Indic-Indic directions. This seed corpus is not completely n-
way in the current form (see Section 3.3), and the data scales might be extremely low for some
language pairs. As a result, we leverage data augmentation to synthetically generate n-way parallel
corpora just by performing n inferences instead of nC2. Specifically, we use our IndicTrans2 En-Indic
model to generate 100K synthetic bitext pairs for each translation direction by selecting 100K English
monolingual sentences from IndicCorpv2 (Doddapaneni et al., 2023). This amounts to a total of
46.2M pairs across 462 Indic-Indic language pairs. Our fine-tuning dataset for adapting the
IndicTrans2-M2M model consists of seed corpus and synthetic corpus, resulting in a total of 55.4M
bitext pairs across 462 directions. It is important to note that our IndicTrans2-M2M model covers all
22 scheduled languages but lacks direct support for script variants like Kashmiri (Devanagari),
Manipuri (Bengali), and Sindhi (Arabic) due to the unavailability of seed data for these scripts. Tables
19 to 21 shows that our IndicTrans2-M2M achieves competitive performance with a 1-point decrease
in the chrF++ metric compared to the pivot-based approach at half of the inference cost.
Furthermore, we also apply the same recipe to IndicTrans2-Dist (described in Section 7.6) to improve
the inference latency and compress it to about 350M parameters while achieving competitive
performance with the IndicTrans2-M2M 1.2B parameter model (see Tables 19 to 21). 7.6 Distilled
Models We distill our IndicTrans2 (1.1B parameters, 12Gb size) models into smaller, efficient
counterparts called IndicTrans2- Dist (211M parameters, 2Gb size) to enhance deployment feasibility
in low-infrastructure settings. Following the deep and thin architecture approach (Gumma et al.,
2023), we retain the encoder-decoder layer count but reduce other fullyconnected dimensions.
Acknowledging the robustness of our teacher model, we leverage a smaller, representative dataset
subset of ~110 million pairs across all 22 languages for a more data-efficient distillation process. We
adopt Word-Level distillation (Hinton et al., 2015; Kim & Rush, 2016), facilitating direct student
model training without a separate distilled dataset. The student model is initially distilled from
IndicTrans2 and subsequently fine-tuned using the BPCC seed data. Tables 49 to 51 in Appendix D list
the hyperparameters and architecture of IndicTrans2-Dist models. In adherence to metrics used
before, we report chrF++ scores of the distilled models on IN22-Gen in Table 22. The chrF++ scores
on FLORES-200 and IN22-Conv are presented in Tables 52 and 53 in Appendix D respectively. In
contrast to our earlier findings, we find that fine-tuning with seed data was not so beneficial for the
distilled models. Our distilled models trained with Word-Level distillation perform competitively with
our best IT2 models and show an average drop of 0.87 on Indic-En and 0.17 on En-Indic across all
three benchmarks. It is important to note that we do not use any backtranslation data for distillation.
Notably, we observe higher gains due to distillation on the IN22- Conv than on the IN22-Gen and
FLORES-200 in the Indic-En direction. Low-resource languages like Dogri, Bodo and Arabic script
languages like Kashmiri and Urdu face a drop of more than 2.5 chrF++ points in the Indic-En
direction, whereas Santali has a gain of 2.7 points in IN22-Gen and 2.8 points in IN22-Conv as
compared to the Indic-En teacher model. Almost all high-resource languages like Hindi and Bengali
observe a negligible reduction in performance with 35 Published in Transactions on Machine
Learning Research (12/2023) Table 22: chrF++ scores of Indic-En and En-Indic distilled models on
IN22-Gen. Distilled (Dist) is the model trained with Word-level KD. ∆ is the difference between the
distilled Model fine-tuned on seed data (Dist-Seed) & IT2. Higher values of ∆ are preferable. Indic-En
En-Indic language IT2 Dist Dist-Seed ∆ IT2 Dist Dist-Seed ∆ asm_Beng 65.8 65.6 65.6 -0.2 47.1 46.4
47.1 0.0 ben_Beng 63.2 63.1 63.3 0.1 51.8 51.5 51.6 -0.2 brx_Deva 62.1 59.3 59.3 -2.8 47.8 47.6 47.7
-0.1 doi_Deva 72.6 70.2 70.2 -2.4 57.8 56.3 56.8 -1.0 gom_Deva 59.2 57.3 57.2 -2.0 45.2 44.5 44.8 -
0.4 guj_Gujr 66.5 65.5 65.5 -1.0 53.5 52.9 53.2 -0.3 hin_Deva 65.4 63.7 63.8 -1.6 56.7 56.4 56.7 0.0
kan_Knda 64.2 64.3 64.3 0.1 51.0 50.4 50.9 -0.1 kas_Arab 60.4 57.6 57.8 -2.6 40.2 39.0 39.5 -0.7
mai_Deva 64.8 64.4 64.4 -0.4 48.7 48.5 48.7 0.0 mal_Mlym 64.5 63.2 63.3 -1.2 50.9 50.4 50.8 -0.1
mar_Deva 63.7 63.2 63.3 -0.4 51.0 50.4 50.6 -0.4 mni_Mtei 57.9 58.0 58.0 0.1 44.6 43.2 43.6 -1.0
npi_Deva 67.7 67.6 67.5 -0.2 49.0 48.7 49.0 0.0 ory_Orya 66.2 65.8 65.9 -0.3 43.9 43.5 43.9 0.0
pan_Guru 63.4 62.0 61.9 -1.5 50.6 50.6 50.4 -0.2 san_Deva 54.8 53.8 53.9 -0.9 38.8 37.9 38.2 -0.6
sat_Olck 45.3 47.5 48.0 2.7 33.4 33.0 33.8 0.4 snd_Deva 57.3 56.0 56.6 -0.7 36.6 36.6 36.6 0.0
tam_Taml 59.8 58.4 58.4 -1.4 49.5 49.3 49.3 -0.2 tel_Telu 64.8 63.0 63.0 -1.8 52.4 52.4 52.4 0.0
urd_Arab 73.0 70.8 70.9 -2.1 68.2 67.8 67.8 -0.4 Average 62.8 61.8 61.9 -0.9 48.6 48.1 48.3 -0.3
distillation. In contrast to the findings of Gumma et al. (2023), we observe that the most significant
factor is a robust teacher model coupled with high-quality, diverse data to develop compact student
models that are comparable to the teacher. However, extensive experiments are needed to further
validate and strengthen these observations in the future. 8 Conclusion In this paper, we presented
our efforts on building machine translation systems supporting all 22 languages in the 8th schedule
of the Constitution of India. We created the multi-domain IN22 benchmark and the BPCC parallel
corpus, both of which are first-of-their-kind evaluation and training corpora, the latter consisting of
~230M bitext pairs, covering 22 Indic languages. We trained and evaluated robust English-centric
models containing 1.1B parameters as well as their compact versions with 211M parameters, which
can be used in compute-heavy as well as compute-scarce settings. Additionally, we repurpose pre-
trained components from our English-centric models for efficient training of a direct Indic-Indic
model containing 1.2B parameters as well as its compact version with 350M parameters. Our
evaluations focus on multiple automatic metrics such as BLEU, chrF++ (primary), and COMET which
show that our models are comparable, if not better, than publicly available open and commercial
systems. To summarize, our contributions comprehensively cover all three axes for translation
systems, namely models, data, and benchmarks. We will open-source the data, benchmarks, and
model artifacts publicly and hope that our work will serve as a foundation as well as a guide for
further advancements in translation systems for Indic as well as low-resource languages. 36
Published in Transactions on Machine Learning Research (12/2023) 9 Limitations and Future Work
Our work has several significant positive outcomes, including the release of the first open-source
model that is competitive with commercial models and supports all 22 scheduled Indian languages.
However, some limitations open up avenues for future research across each of the following axes:
Data, Models, Benchmark, Evaluation, and Deployment. Data. One of the foremost challenges is the
scarcity of high-quality human-annotated data for mid-resource or lowresource languages, making it
difficult to develop robust models on these languages. Furthermore, the limited availability of
content in these languages on the web prevents the use of mining-based approaches to overcome
data scarcity effectively. As a result, our IndicTrans2 models demonstrate limited generalization
capabilities for languages such as Manipuri (Meitei), Santali, and Sindhi (Devnagari). Another
important concern is the limited effectiveness of existing sentence embedding models when applied
to Indic languages, which can lead to noisy and suboptimal pairs. To address these challenges, it is
crucial to calibrate sentence embedding models using human-annotated data to improve their
correlation with human annotations. Moreover, expanding the language coverage of these sentence
embedding models to encompass all 22 scheduled languages will be pivotal in facilitating mining
efforts for mid-resource or low-resource languages. Modeling. Our current work serves as an initial
effort to develop IndicTrans2 models supporting 22 scheduled Indic languages, including low-
resource ones. Although consistently outperforming baseline systems, a performance gap exists
between low-resource and high-resource languages (as shown in Section 7.1). To bridge this gap, we
need to explore effective methods to leverage language relatedness for cross-lingual transfer and
improve generalization in low-resource settings. Furthermore, while our IndicTrans2 models released
with this work prioritize general-purpose use cases, it is equally important to investigate sparse
parameter-efficient approaches for effective domain adaptation while also preserving the model’s
general-purpose utility. Furthermore, our current IndicTrans2 supports translations across 22
scheduled Indic languages, encompassing multiple scripts that cater to a vast majority of Indian
speakers. However, numerous Indic languages remain unincorporated, and exploring techniques to
extend the current models without catastrophic forgetting is an important research direction.
Benchmark. Accurate evaluation of translation models requires original test sets that encompass a
wide range of linguistic phenomena and translation challenges. The current test sets that are
released are n-way constructed with English as the original language, which is a common approach
for including numerous languages. This implies that when we evaluate Indic to English translation on
benchmarks like FLORES-200 or IN22, our source is translationese instead of original. Prior research
has emphasized the importance of utilizing source-original test sets to get a fair evaluation of
translation performance (Zhang & Toral, 2019; Federmann et al., 2022). Moreover, the development
of an Indic original benchmark would provide an additional aspect for assessing whether the
subtleties of Indic language original sentences are accurately captured in English translations.
Therefore, we are currently working towards creating Indic-original benchmarks to facilitate the fair
evaluation of Indic-En translations. Soon, we intend to release Indicoriginal to English translation
benchmarks for all 22 scheduled Indic languages. Evaluation. Evaluation of translation models is
critical for understanding their strengths and weaknesses and guiding further improvements. This
evaluation typically involves two main approaches: human evaluation and automatic evaluation. Our
current work includes a preliminary human evaluation study on a sample of 100 sentences from our
IN22-Gen benchmark for En-Indic translations. However, future efforts should focus on conducting a
broad and largescale human evaluation study that focuses on the free-form evaluation and task-
oriented contexts to understand the potential biases and shortcomings of our IndicTrans2 models
and assess their feasibility in practical use-case scenarios, thereby identifying areas for improvement.
Additionally, developing better automatic evaluation metrics, particularly suited for Indic languages,
is vital for achieving a more comprehensive and quantitative assessment of translation quality and
facilitating model improvements. Current model-based metrics may not fully support certain
languages, emphasizing the need to explore effective ways to calibrate them for Indic languages and
improve the correlation with human judgments. 37 Published in Transactions on Machine Learning
Research (12/2023) Fairness. Our IndicTrans2 models are trained on extensive data collected from
the web, which may introduce social biases. To ensure broader and safer accessibility, it is crucial to
thoroughly identify and address these biases. Prior works demonstrate that distilled models can
further propagate or amplify biases from the teacher model (Ahn et al., 2022; Gupta et al., 2022;
Dhar et al., 2021), underscoring the importance of conducting a comprehensive study and
developing alignment methods to mitigate such biases. 10 Author Contributions This project is a
large team effort, with immense contributions from all the people involved. To list down the
contributions of the authors, we document the areas and list the authors contributing significantly to
each of these areas. In each area, the contributors are listed sorted by last name. The lead authors,
Jay Gala, and Pranjal A. Chitale, have contributed across multiple areas and co-ordinated many
activities. Parallel Corpus Collection and Mining: Raghavan AK, Jay Gala, and Aswanth Kumar. Human
Translation: Pranjal A. Chitale, Jay Gala, Mitesh M. Khapra, Pratyush Kumar, Anoop Kunchukuttan,
Janki Nawale, and Anupama Sujatha. Model Training: Pranjal A. Chitale, Raj Dabre, Jay Gala, and
Varun Gumma. Distillation: Pranjal A. Chitale, Raj Dabre, Jay Gala, and Varun Gumma. Model
Evaluation: Pranjal A. Chitale, Raj Dabre, Sumanth Doddapaneni, Jay Gala, Varun Gumma, Anoop
Kunchukuttan, and Ratish Puduppully. Research Leads: Raj Dabre, Mitesh M. Khapra, Pratyush Kumar,
and Anoop Kunchukuttan. Project Conceptualization and Direction: Mitesh M. Khapra, Pratyush
Kumar, Anoop Kunchukuttan, and Vivek Raghavan. Acknowledgements Embarking on this mission
was only possible due to the support of numerous organizations, individuals and members of the
Indian language technology ecosystem. We would like to take a few sentences to thank all of them.
Sponsors/Donors: First and foremost, we thank the Ministry of Electronics and Information
Technology (MeitY), Government of India, for setting up the ambitious Digital India Bhashini Mission
with the goal of advancing Indian language technology. The human infrastructure comprising of a
large team of translators, reviewers and language experts who worked on this project were
supported by the generous grant given by Digital India Bhashini Mission to IIT Madras to serve as the
Data Management Unit for the mission. We are indebted to Shri Nandan Nilekani and Shrimati Rohini
Nilekani for believing in us and supporting our work through generous grants from EkStep
Foundation and Nilekani Philanthropies. These grants were used for (i) supporting many of the
students, research associates, and developers who worked on this project, (ii) fulfilling many of our
compute needs, and (iii) recruiting project managers to oversee the massive pan-India data collection
activity undertaken as a part of this work. We thank Microsoft for their grant to support the creation
of benchmarks for Indian languages. We thank the Centre for Development and Advancement of
Computing, Pune (CDAC Pune) for access to its Param Siddhi super-computer which was used for
mining bitext pairs at scale. IIT Madras: We thank Prof. V Kamakoti (Director, IIT Madras), Prof.
Mahesh V Panchagnula (Dean, IIT Madras), Prof. Ravindra Gettu (Dean, IIT Madras) and Prof. Manu
Santhanam (Dean, IIT Madras) for their constant encour38 Published in Transactions on Machine
Learning Research (12/2023) agement and administrative support. In particular, we are thankful for
the office space provided to AI4Bharat which houses some of our students, researchers, language
experts and administrative team. Indian language technology community: We extend our heartfelt
gratitude to the expansive Indian language technology community, comprising academia, startups,
and the deep tech industry, both within India and across the globe. It is with immense gratitude that
we acknowledge the incredible foundation laid by the giants of this community, whose pioneering
work has paved the way for our endeavors. We are truly grateful for the knowledge, insights, and
advancements that we have built upon, as we stand on the shoulders of these remarkable
contributors. In particular, we thank Prof. Rajeev Sangal (Professor Emeritus, IIIT Hyderabad), Prof.
Pushpak Bhattacharyya (IIT Bombay), Prof. Dipti Mishra (IIIT Hyderabad), Prof. Hema Murthy (IIT
Madras), Prof. Umesh S (IIT Madras), Prof. Rajat Moona (IIT Gandhinagar), Prof. Ganesh
Ramakrishnan (IIT Bombay), Partha Talukdar (Google Research India), Dr. Swaran Lata (MeitY), Dr.
Sobha L (AU-KBC) and Dr. Ritesh Kumar (Dr. B.R. Ambedkar University) for their critical insights and
constructive feedback in improving the translation guidelines used for creating the datasets released
as a part of this work (we apologize if we have missed anyone). Research organisations: We thank
Google for open-sourcing the LaBSE embeddings which we used extensively for mining and filtering
bitext pairs. We thank Meta for open-sourcing their semantic search infrastructure, FAISS, which we
use for indexing and mining bitext pairs. We thank Allen-AI for reproducing the work of NLLB and
releasing a large mined parallel corpus for Indian languages. Language Experts: We express our
deepest gratitude to our exceptional and highly dedicated team of language experts, including
translators and reviewers, whose invaluable contributions have been instrumental in the creation of
the seed data and benchmark data. Their unwavering commitment to adhering to guidelines and
their remarkable ability to work seamlessly as a cohesive unit, despite being geographically
dispersed, is truly commendable. The quality and accuracy of the manual datasets developed as part
of this endeavor owes much to their unwavering efforts. We extend our heartfelt thanks to every
member of our remarkable language team for their outstanding dedication and invaluable
contributions. Administration Team: We are profoundly thankful to the remarkable individuals,
Krishnan Karunganni S and Ravishankar Venkateswaran, for their exceptional dedication, patience,
and extraordinary leadership in managing such an expansive team of talented translators. Their
unwavering commitment to orchestrating and guiding this diverse group of language experts is truly
commendable. Through their exceptional organizational skills and expertise, they ensured seamless
coordination and maintained the highest standards of quality throughout the translation process. We
also thank our support staff Shanthi S, Bhanumathy M, Bhavana R, Suganya Kumaresan, and
Kalaivanan A, who helped with recruitment and procurement. Development Team: We also thank
our development team comprising of our in-house engineers, as well as, engineers from Tarento for
building Shoonya which enabled all the manual translation work. In the absence of Shoonya, it would
have been impossible to manage such a diverse team spread across the country working towards a
common goal. We thank members of our development team for their patience in working with the
language experts and building features that helped improve both the speed and quality of
translation. Partners: We would also like to thank our start-up partners, viz., Desicrew, Devanagari,
Language Services Bureau and Keypoint Technologies, who helped in meeting some of our manual
translation goals. Reviewers: We would like to thank Dr. Benjamin Marie (4i) for reviewing the
modeling and evaluation sections of our paper and helping us gain confidence in the credibility of
our evaluation process. NICT: Raj Dabre would like to thank Dr. Eiichiro Sumita and Dr. Masao
Utiyama of ASTREC at NICT, for the freedom and encouragement to collaborate with AI4Bharat. Last,
but not the least, we thank the Almighty for giving us the courage to embark on this mission! The
Team Behind the Scenes This work was possible because the efforts put in by all the remarkable
individuals listed below. 39 Published in Transactions on Machine Learning Research (12/2023)
Administrative Team • Krishnan Karunganni S (Chief of Operations and Delivery, AI4Bharat) •
Ravishankar Venkateswaran (Delivery Head, AI4Bharat) • Shanthi S (Operations, AI4Bharat) •
Bhanumathy M (Recruitment, AI4Bharat) • Suganya Kumaresan (Recruitment, AI4Bharat) • Bhavana
R R (ex-Recruitment, AI4Bharat) Shoonya Team Developers Designation Tarento Aravinth Bheemaraj
Project Manager Alpana Majhi Frontend Lead Ganavi Kumaraswamy Frontend Developer Mrigank
Shekhar Shringi Frontend Developer Dheeraj Gujral Backend Developer Jagadeesh Lachchanagiri
Backend Developer Umme Nusrath Backend Developer AI4Bharat Ishvinder Sethi Backend Lead and
Lead Coordinator Aparna A Lead Coordinator and Manager Abhigyan Raman DevOps Lead Gokul NC
Tech Lead Kaushal Bhogale Backend Developer and Architect Chetan Gudagamanal Frontend
Developer Kunal Tiwary Backend Developer Interns Aditya Mitra Full-Stack Developer Anirudh
Prabhakar Full-Stack Developer Atharva Naphde Full-Stack Developer Aviral Goel Full-Stack
Developer Pranav Agarwal Full-Stack Developer Rugved Somwanshi Full-Stack Developer Aavaig
Malhotra Frontend Developer Ayush Panwar Frontend Developer Rajat Maheshwari Frontend
Developer Yogesh Bhat Frontend Developer Akshat Sharma Backend Developer Anuran Roy Backend
Developer Debraj Bhal Backend Developer Nishant Nayak Backend Developer Prakhar Rathi Backend
Developer Saish Mendke Backend Developer Translation Team 40 Published in Transactions on
Machine Learning Research (12/2023) Language Language Experts Designation Assamese Devanga
Pallav Saikia Language Lead, Senior Translator Bikash Chandra Senior Project Manager Bishnu Prasad
Barman Translator Dimpi Sarma Translator Bonya Baruah Translator Bikash Chetia Translator
Kangkana Deka Translator Lelina Barman Translator Bengali Sounak Dutta Language Lead, Senior
Translator Shambhobi Ghosh Senior Translator Srija Mukherjee Translator Shreerupa Chattopadhyay
Translator Natasha Ahmed Translator Kathakali Bhoumik Das Translator Atrayee Dutta Translator
Bodo Prafulla Basumatry Language Lead, Senior Translator Bihung Brahma Senior Translator Bikash
Chandra Senior Project Manager Sidwma Brahma Translator Sansuma Brahma Translator Jeetumoni
Basumatry Translator Ria Borah Sonowal Translator Dogri Preeti Dubey Senior Project Manager Lalit
Mangotra Senior Translator Veena Gupta Senior Translator Shashi Pathania Senior Translator Anju
Bala Translator Monika chandel Translator Kulbhushan Jasrotia Translator Gujarati Pranav Pandya
Language Lead, Translator Jayesh Adhyaru Translator Naresh Kapadia Senior Translator Faiz Masi
Translator Jimal Patel Translator Hindi Jaya Sarawati Language Lead, Senior Translator Sufiya Pathan
Senior Translator Deepika Agarwal Senior Translator Aakansha Dubey Translator Rakshi Ghai
Translator Neha Bhakal Translator Ayesha Pereira Translator Veda Bharti Translator Kannada Anagha
H. N. Language Lead, Senior Translator Adithi Raveendranath Translator Abhigna Joshi Translator
Shivakumar R. M. Translator Arun Kumar Translator Goutham M Translator 41 Published in
Transactions on Machine Learning Research (12/2023) T.R. Nagesh Translator Kashmiri Vijay Wali
Senior Translator Shafi Shauq Senior Translator Ambreen Farooq Translator Meer Bismah Translator
Syed Samreen Translator Sumaya Jehangir Translator Nazima Mehdi Senior Project Manager Ishfaq
Nisar Translator Konkani Pradeep Padgaonkar Senior Translator Pradnya Bhagat Senior Project
Manager Sandesh Prabhudesai Senior Translator Sharat Raikar Senior Translator Anwesha Singbal
Translator Cia Fernandes Translator Ashwini Kamat Translator Maithili Sanjay Jha Language Lead,
Translator Avinash Kumar Senior Project Manager Yogendra Pathak Senior Translator Dr.
Chandramani Jha Senior Translator Vikas Vineet Jha Translator Priyeshi Kumari Translator Rahul
Kumar Jha Translator Vijay Deo Jha Translator Manoj Kumar Pathak Translator Tulika Swati Translator
Prashant Kumar Jha Translator Nandan Kumar Translator Kishore Keshav Translator Sanjeev Kumar
Jha Translator Deepak Kumar Translator Juli Jha Translator Swati Jha Translator Aditya Bhushan
Mishra Translator Malayalam Jebi Mariam Kurian Language Lead, Translator Manoj Varma Senior
Translator C. V. Sudheendran Senior Translator Jihad M. Translator Jiza Mariam Kurian Translator Ann
Mary Thomas Translator Srilekha Padmakuma Nambiar Translator Marathi Kunal Gandhi Language
Lead,Translator Paresh Prabhu Senior Translator Vrinda Sarkar Senior Translator Ranjana Pathak
Senior Translator Saee Kodolikar Senior Translator Prasad Jog Translator Shweta Deshmukh Translator
Bhushan Oke Translator 42 Published in Transactions on Machine Learning Research (12/2023) Neha
Satish Bandekar Translator Radhika Deshpande Translator Manipuri Reena Ashem Language Lead,
Senior Translator Yasin Khan Senior Project Manager Chingtham Diana Devi Senior Translator Diana
Thingujam Translator Jahir Hussain Translator Sanju Pukhrambam Translator Alfina Khaidem
Translator Kshetrimayum Momo Translator Padmabati Achom Translator Nepali Sunita Dahal
Language Lead, Senior Translator Bikash Chandra Senior Project Manager Dhaka Ram Kafle Senior
Translator Lekhnath Chhetri Senior Translator Tika Ram Rai Senior Translator D. Ghimiray Translator
Dr Srijana Sharma Translator Dr Khagen Sharma Translator Odia Satyabrata Barik Senior Translator
Pramodini Pradhan Senior Translator Sai Sudeep Das Translator Abhishek Parija Translator
Suchishraba Sarangi Language Lead Bhimasena Bhol Translator Surendra Chandra Tripathy Translator
Punjabi Armin Virk Language Lead, Translator Pallavi Kaushal Translator Shallu Rani Translator
Parneet Kaur Translator Sanskrit Harisha H. M. Language Lead, Senior Translator Dr. Suresha Senior
Translator Suprith S. Translator Sailaja Nittala Translator Vasudev Aital Translator Vivaswini Translator
Dr. Narayan Dutt Mishra Senior Translator Santali Kamala Murmu Senior Project Manager Baren Kisku
Senior Translator Prasanta Kumar Hansda Senior Translator Baburam Murmu Senior Translator Sripati
Tudu Senior Translator Urmila Murmu Translator Raju Mardi Translator Churki Hansda Translator
Promila Hansda Translator Sova Tudu Translator Sanjiban Murmu Translator Satya Hembram
Translator 43 Published in Transactions on Machine Learning Research (12/2023) Guna Hembram
Translator Sagen Murmu Translator Sindhi Armin Virk Language Lead, Translator Dr. Nalini Senior
Translator Prakash Tejwani Translator Bharati Chainani Translator Karan Vanni Translator Tamil Shakir
Azeem Language Lead, Senior Translator Leema Rajavarman Senior Translator Shivapriya Murali
Translator Sharmila Grahadurai Translator V Sayeelakshmi Rajaganapathy Translator Telugu Shakir
Azeem Language Lead, Senior Translator Karuna Vempati Senior Translator N. Sujatha Senior
Translator Srimoukthika Translator Srilakshmi B. Translator Urdu Dr. Irfan Ahmed Senior Translator
Nazima Mehdi Senior Project Manager Aishwarya Diwakar Translator Anwar Wajhiuddin Translator
Muhammad Anzar Translator Hasan Akram Translator Dr. Javaid Aziz Bhat Translator Hafsah Faquih
Translator Habeebunnisa Translator Mohammad Afaan Translator Naziya Rasool Translator 44
Published in Transactions on Machine Learning Research (12/2023) References Eneko Agirre, Carmen
Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce
Wiebe. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 497–
511, San Diego, California, June 2016. Association for Computational Linguistics. doi:
10.18653/v1/S16-1081. URL https:// aclanthology.org/S16-1081. Roee Aharoni, Melvin Johnson, and
Orhan Firat. Massively multilingual neural machine translation. In Jill Burstein, Christy Doran, and
Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019,
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 3874–3884.
Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1388. URL
https://doi.org/10.18653/v1/n19-1388. Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. Why
knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT.
In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp.
266–272, Seattle, Washington, July 2022. Association for Computational Linguistics. doi:
10.18653/v1/2022.gebnlp-1.27. URL https://aclanthology.org/2022.gebnlp-1.27. Maruan Al-Shedivat
and Ankur Parikh. Consistency by agreement in zero-shot neural machine translation. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1184–1197,
Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:
10.18653/v1/N19-1121. URL https://aclanthology.org/N19- 1121. Ikram ALi. Urduhack library.
https://github.com/urduhack/urduhack, 2019. Ananthakrishnan, Pushpak Bhattacharyya, M.
Sasikumar, and Ritesh M. Shah. Some issues in automatic evaluation of english-hindi mt : More blues
for bleu. 2006. Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and
Wolfgang Macherey. The missing ingredient in zero-shot neural machine translation. CoRR,
abs/1903.07091, 2019. URL http://arxiv.org/abs/ 1903.07091. Mikel Artetxe and Holger Schwenk.
Margin-based parallel corpus mining with multilingual sentence embeddings. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pp. 3197–3203, Florence, Italy,
July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1309. URL https:
//aclanthology.org/P19-1309. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun
(eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/ abs/1409.0473. Marta
Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L.
Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere,
Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins,
and Jaume Zaragoza. ParaCrawl: Webscale acquisition of parallel corpora. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pp. 4555–4567, Online, July 2020.
Association for Computational Linguistics. doi: 10.18653/v1/2020. acl-main.417. URL
https://aclanthology.org/2020.acl-main.417. Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat,
Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey,
Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun,
Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, and
Macduff Hughes. Building machine translation systems for the next thousand languages. CoRR,
abs/2205.03983, 2022. doi: 10.48550/arXiv.2205.03983. URL https://doi.org/10.48550/arXiv.
2205.03983. 45 Published in Transactions on Machine Learning Research (12/2023) Loïc Barrault,
Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow,
Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt
Post, and Marcos Zampieri. Findings of the 2019 conference on machine translation (WMT19). In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1),
pp. 1–61, Florence, Italy, August 2019. Association for Computational Linguistics. doi:
10.18653/v1/W19-5301. URL https://aclanthology.org/ W19-5301. Loïc Barrault, Magdalena
Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman
Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo,
Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal,
Matt Post, and Marcos Zampieri. Findings of the 2020 conference on machine translation (WMT20).
In Proceedings of the Fifth Conference on Machine Translation, pp. 1–55, Online, November 2020.
Association for Computational Linguistics. URL https://aclanthology.org/2020.wmt-1.1. Nicola
Bertoldi, Madalina Barbaiani, Marcello Federico, and Roldano Cattoni. Phrase-based statistical
machine translation with pivot languages. In Proceedings of the 5th International Workshop on
Spoken Language Translation: Papers, pp. 143–149, Waikiki, Hawaii, October 20-21 2008. URL
https://aclanthology.org/2008.iwsltpapers.1. Ondřej Bojar, Christian Buck, Christian Federmann,
Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve
Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. Findings of the 2014 workshop on
statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine
Translation, pp. 12–58, Baltimore, Maryland, USA, June 2014. Association for Computational
Linguistics. doi: 10.3115/v1/W14-3302. URL https://aclanthology.org/W14-3302. Tom Brown,
Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,
Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot
learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural
Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL
https://proceedings.neurips.cc/paper_files/paper/ 2020/file/1457c0d6bfcb4967418bfb8ac142f64a-
Paper.pdf. Isaac Caswell, Ciprian Chelba, and David Grangier. Tagged back-translation. In Proceedings
of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 53–63, Florence,
Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5206. URL
https://aclanthology.org/W19-5206. Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. A teacher-
student framework for zero-resource neural machine translation. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1925–1935,
Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1176.
URL https://aclanthology.org/P17-1176. Narayan Choudhary and Girish Nath Jha. Creating
multilingual parallel corpora in indian languages. In Zygmunt Vetulani and Joseph Mariani (eds.),
Human Language Technology Challenges for Computer Science and Linguistics - 5th Language and
Technology Conference, LTC 2011, Poznań, Poland, November 25-27, 2011, Revised Selected Papers,
volume 8387 of Lecture Notes in Computer Science, pp. 527–537. Springer, 2011. doi: 10.1007/978-
3-319- 08958-4\_43. URL https://doi.org/10.1007/978-3-319-08958-4_43. Seamless Communication,
Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, PaulAmbroise Duquenne,
Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel
Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi
Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip
Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi,
Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin,
Mohamed Ramadan, Abinesh 46 Published in Transactions on Machine Learning Research (12/2023)
Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang,
Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia
Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri,
Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang,
and Skyler Wang. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv
preprint arXiv: 2308.11596, 2023. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav
Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and
Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July
2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL
https://aclanthology.org/2020.acl-main.747. Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,
Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia
Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe,
Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela
Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko,
Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind:
Scaling humancentered machine translation, 2022. Raj Dabre, Chenhui Chu, and Anoop
Kunchukuttan. A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5):99:1–
99:38, 2021. doi: 10.1145/3406095. URL https://doi.org/10.1145/3406095. Raj Dabre, Himani
Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh Khapra, and Pratyush Kumar. Indicbart: A
pre-trained model for indic natural language generation. In Smaranda Muresan, Preslav Nakov, and
Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, Dublin,
Ireland, May 22- 27, 2022, pp. 1849–1863. Association for Computational Linguistics, 2022. doi:
10.18653/v1/2022.findings-acl.145. URL https://doi.org/10.18653/v1/2022.findings-acl.145. Peter T
Daniels and William Bright. The world’s writing systems. Oxford University Press on Demand, 1996.
Prithviraj Dhar, Joshua Gleason, Aniket Basu Roy, Carlos Domingo Castillo, P. Jonathon Phillips, and
Ramalingam Chellappa. Distill and de-bias: Mitigating bias in face recognition using knowledge
distillation. ArXiv, abs/2112.09786, 2021. URL https://api.semanticscholar.org/CorpusID:245334459.
Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop
Kunchukuttan, and Pratyush Kumar. Towards leaving no Indic language behind: Building monolingual
corpora, benchmark and models for Indic languages. In Proceedings of the 61st Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12402–12426, Toronto,
Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.693.
URL https://aclanthology.org/2023.acl-long.693. Abteen Ebrahimi, Manuel Mager, Arturo Oncevay,
Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan
Vladimir Meza Ruiz, Gustavo Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando
Coto-Solano, Thang Vu, and Katharina Kann. AmericasNLI: Evaluating zero-shot natural language
understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of
the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 6279–6299, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:
10.18653/v1/2022.acl-long.435. URL https://aclanthology.org/2022.acl-long.435. Sergey Edunov,
Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. In Proceedings
of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500,
Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:
10.18653/v1/D18-1045. URL https: //aclanthology.org/D18-1045. 47 Published in Transactions on
Machine Learning Research (12/2023) Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, and
Philipp Koehn. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5960–
5969, Online, November 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.emnlp-main.480. URL https://aclanthology.org/2020.emnlp-main.480. Murray B.
Emeneau. India as a lingustic area. Language, 32:3, 1956. Angela Fan, Shruti Bhosale, Holger
Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume
Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard
Grave, Michael Auli, and Armand Joulin. Beyond english-centric multilingual machine translation,
2020. Christian Federmann, Tom Kocmi, and Ying Xin. NTREX-128 – news test references for MT
evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual
Evaluation, pp. 21–24, Online, November 2022. Association for Computational Linguistics. URL
https://aclanthology.org/2022.sumeval-1.4. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen
Arivazhagan, and Wei Wang. Language-agnostic BERT sentence embedding. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
878–891, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/
v1/2022.acl-long.62. URL https://aclanthology.org/2022.acl-long.62. Orhan Firat, Baskaran Sankaran,
Yaser Al-onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. Zero-resource translation with multi-
lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pp. 268–277, Austin, Texas, November 2016. Association for
Computational Linguistics. doi: 10.18653/v1/D16-1026. URL https://aclanthology.org/D16-1026. Jack
FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash,
Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter
Leeuwis, Gokhan Tur, and Prem Natarajan. Massive: A 1m-example multilingual natural language
understanding dataset with 51 typologically-diverse languages. arXiv preprint arXiv: Arxiv-
2204.08582, 2022. Markus Freitag and Orhan Firat. Complete multilingual neural machine
translation. In Proceedings of the Fifth Conference on Machine Translation, pp. 550–560, Online,
November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.wmt-
1.66. Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie,
and Ondřej Bojar. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based
human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine
Translation, pp. 733–774, Online, November 2021. Association for Computational Linguistics. URL
https://aclanthology.org/2021.wmt-1.73. Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig
Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. Results
of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In
Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 46–68, Abu Dhabi,
United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URL
https://aclanthology.org/2022.wmt-1.2. Timnit Gebru, Jamie Morgenstern, Briana Vecchione,
Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. Datasheets for
datasets, 2021. A. Gispert and José B. Mariño. Catalan-english statistical machine translation without
parallel corpus : Bridging through spanish. Proceedings of The Language Resources and Evaluation
Conference (LREC), 2006. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume
Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The
Flores-101 evaluation benchmark for low-resource and multilingual machine translation.
Transactions of the Association for Computational Linguistics, 10:522–538, 2022. doi:
10.1162/tacl_a_00474. URL https://aclanthology.org/2022.tacl-1.30. 48 Published in Transactions on
Machine Learning Research (12/2023) Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin
Zobel. Continuous measurement scales in human evaluation of machine translation. In Proceedings
of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 33–41, Sofia,
Bulgaria, August 2013. Association for Computational Linguistics. URL https://aclanthology.org/W13-
2305. Varun Gumma, Raj Dabre, and Pratyush Kumar. An empirical study of leveraging knowledge
distillation for compressing multilingual neural machine translation models. In Proceedings of the
24th Annual Conference of the European Association for Machine Translation, pp. 103–114,
Tampere, Finland, June 2023. European Association for Machine Translation. URL
https://aclanthology.org/2023.eamt-1.11. Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma,
Yada Pruksachatkun, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Greg Ver Steeg, and Aram
Galstyan. Mitigating gender bias in distilled language models via counterfactual role reversal. In
Findings of the Association for Computational Linguistics: ACL 2022, pp. 658–678, Dublin, Ireland,
May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.55. URL
https://aclanthology.org/2022.findings-acl.55. Barry Haddow and Faheem Kirefu. PMIndia – A
Collection of Parallel Corpora of Languages of India. arXiv e-prints, art. arXiv:2001.09907, Jan 2020.
Kevin Heffernan, Onur Çelebi, and Holger Schwenk. Bitext mining using distilled sentence
representations for lowresource languages. In Findings of the Association for Computational
Linguistics: EMNLP 2022, pp. 2101–2112, Abu Dhabi, United Arab Emirates, December 2022.
Association for Computational Linguistics. doi: 10.18653/v1/ 2022.findings-emnlp.154. URL
https://aclanthology.org/2022.findings-emnlp.154. Dan Hendrycks and Kevin Gimpel. Bridging
nonlinearities and stochastic regularizers with gaussian error linear units. ArXiv, abs/1606.08415,
2016. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,
2015. Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. Iterative back-
translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine
Translation and Generation, pp. 18– 24, Melbourne, Australia, July 2018. Association for
Computational Linguistics. doi: 10.18653/v1/W18-2703. URL https://aclanthology.org/W18-2703.
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson.
XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.
CoRR, abs/2003.11080, 2020. URL https://arxiv.org/abs/2003.11080. Girish Nath Jha. The TDIL
program and the Indian langauge corpora intitiative (ILCI). In Proceedings of the Seventh
International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May
2010. European Language Resources Association (ELRA). URL http://www.lrec-
conf.org/proceedings/lrec2010/pdf/ 874_Paper.pdf. Melvin Johnson, Mike Schuster, Quoc V. Le,
Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg
Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system:
Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–
351, 2017. doi: 10.1162/tacl_a_00065. URL https://aclanthology.org/Q17-1024. Pratik Joshi, Sebastin
Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity
and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 6282–6293, Online, July 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.acl-main.560. URL https://aclanthology.org/2020.acl-main.560.
Herve Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi:
10.1109/TPAMI.2010.57. 49 Published in Transactions on Machine Learning Research (12/2023)
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M.
Khapra, and Pratyush Kumar. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-
trained multilingual language models for Indian languages. In Findings of the Association for
Computational Linguistics: EMNLP 2020, pp. 4948–4961, Online, November 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.findingsemnlp.445. URL
https://aclanthology.org/2020.findings-emnlp.445. Huda Khayrallah and Philipp Koehn. On the
impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop
on Neural Machine Translation and Generation, pp. 74–83, Melbourne, Australia, July 2018.
Association for Computational Linguistics. doi: 10.18653/v1/W18-2709. URL https://aclanthology.
org/W18-2709. Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.
1317–1327, Austin, Texas, November 2016. Association for Computational Linguistics. doi:
10.18653/v1/D16-1139. URL https://aclanthology.org/D16-1139. Young Jin Kim, Ammar Ahmad
Awan, Alexandre Muzio, Andrés Felipe Cruz-Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari,
Yuxiong He, and Hany Hassan Awadalla. Scalable and efficient moe training for multitask multilingual
models. CoRR, abs/2109.10465, 2021. URL https://arxiv.org/abs/2109.10465. Yunsu Kim, Petre
Petrov, Pavel Petrushkov, Shahram Khadivi, and Hermann Ney. Pivot-based transfer learning for
neural machine translation between non-English languages. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pp. 866–876, Hong Kong, China, November 2019.
Association for Computational Linguistics. doi: 10.18653/v1/D19-1080. URL
https://aclanthology.org/D19-1080. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. International Conference On Learning Representations, 2014. Tom Kocmi, Christian
Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul
Menezes. To ship or not to ship: An extensive evaluation of automatic metrics for machine
translation. In Proceedings of the Sixth Conference on Machine Translation, pp. 478–494, Online,
November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.wmt-
1.57. Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of
the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 388–395, Barcelona,
Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-3250.
Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of
Machine Translation Summit X: Papers, pp. 79–86, Phuket, Thailand, September 13-15 2005. URL
https://aclanthology.org/ 2005.mtsummit-papers.11. Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan
Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov,
Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara
Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Javier Ortiz Suárez, Iroro Orife,
Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller,
Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov,
Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny,
Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Balli, Stella
Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime,
Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, and
Mofetoluwa Adeyemi. Quality at a glance: An audit of web-crawled multilingual datasets. Trans.
Assoc. Comput. Linguistics, 10:50–72, 2022. doi: 10.1162/tacl\_a\_00447. URL
https://doi.org/10.1162/tacl_a_00447. Taku Kudo and John Richardson. SentencePiece: A simple and
language independent subword tokenizer and detokenizer for neural text processing. In Proceedings
of the 2018 Conference on Empirical Methods in Natural Language 50 Published in Transactions on
Machine Learning Research (12/2023) Processing: System Demonstrations, pp. 66–71, Brussels,
Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012.
URL https://aclanthology.org/D18-2012. Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan
Firat. Investigating multilingual NMT representations at scale. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pp. 1565–1575, Hong Kong, China, November 2019.
Association for Computational Linguistics. doi: 10.18653/v1/D19-1167. URL https:
//aclanthology.org/D19-1167. Anoop Kunchukuttan. The IndicNLP Library.
https://github.com/anoopkunchukuttan/indic_nlp_ library/blob/master/docs/indicnlp.pdf, 2020.
Anoop Kunchukuttan and Pushpak Bhattacharyya. Utilizing language relatedness to improve machine
translation: A case study on languages of the indian subcontinent, 2020. Anoop Kunchukuttan, Pratik
Mehta, and Pushpak Bhattacharyya. The IIT Bombay English-Hindi parallel corpus. In Proceedings of
the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki,
Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.
org/L18-1548. Gurpreet Singh Lehal and Tejinder Singh Saini. Sangam: A Perso-Arabic to Indic script
machine transliteration model. In Proceedings of the 11th International Conference on Natural
Language Processing, pp. 232–239, Goa, India, December 2014. NLP Association of India. URL
https://aclanthology.org/W14-5135. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen,
Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant
models with conditional computation and automatic sharding. In 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
URL https://openreview.net/forum?id=qrwe7XHTmYb. Daniel Licht, Cynthia Gao, Janice Lam,
Francisco Guzman, Mona Diab, and Philipp Koehn. Consistent human evaluation of machine
translation across language pairs. In Proceedings of the 15th biennial conference of the Association
for Machine Translation in the Americas (Volume 1: Research Track), pp. 309–321, Orlando, USA,
September 2022. Association for Machine Translation in the Americas. URL
https://aclanthology.org/2022.amtaresearch.24. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training
for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–
742, 2020a. doi: 10.1162/tacl_a_00343. URL https://aclanthology.org/ 2020.tacl-1.47. Yinhan Liu,
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke
Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Trans. Assoc.
Comput. Linguistics, 8: 726–742, 2020b. doi: 10.1162/tacl\_a\_00343. URL
https://doi.org/10.1162/tacl_a_00343. Shuming Ma, Jian Yang, Haoyang Huang, Zewen Chi, Li Dong,
Dongdong Zhang, Hany Hassan Awadalla, Alexandre Muzio, Akiko Eriguchi, Saksham Singhal, Xia
Song, Arul Menezes, and Furu Wei. Xlm-t: Scaling up multilingual machine translation with pretrained
cross-lingual transformer encoders, 2020. Anand Kumar Madasamy, Asha Hegde, Shubhanker
Banerjee, Bharathi Raja Chakravarthi, Ruba Priyadharshini, Hosahalli Shashirekha, and John McCrae.
Overview of the shared task on machine translation in Dravidian languages. In Proceedings of the
Second Workshop on Speech and Language Technologies for Dravidian Languages, pp. 271–278,
Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:
10.18653/v1/2022.dravidianlangtech1.41. URL https://aclanthology.org/2022.dravidianlangtech-
1.41. Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul Nc, Ruchi Khapra, Anoop
Kunchukuttan, Pratyush Kumar, and Mitesh Khapra. Aksharantar: Open Indic-language transliteration
datasets and models for the next billion users. 51 Published in Transactions on Machine Learning
Research (12/2023) In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association
for Computational Linguistics: EMNLP 2023, pp. 40–57, Singapore, December 2023. Association for
Computational Linguistics. doi: 10.18653/ v1/2023.findings-emnlp.4. URL
https://aclanthology.org/2023.findings-emnlp.4. Jean Maillard, Cynthia Gao, Elahe Kalbassi, Kaushik
Ram Sadagopan, Vedanuj Goswami, Philipp Koehn, Angela Fan, and Francisco Guzman. Small data,
big impact: Leveraging minimal data for effective machine translation. In Proceedings of the 61st
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2740–
2756, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:
10.18653/v1/2023.acllong.154. URL https://aclanthology.org/2023.acl-long.154. Vukosi Marivate,
Tshephisho Sefara, Vongani Chabalala, Keamogetswe Makhaya, Tumisho Mokgonyane, Rethabile
Mokoena, and Abiodun Modupe. Investigating an approach for low resource language dataset
creation, curation and classification: Setswana and sepedi. In Proceedings of the first workshop on
Resources for African Indigenous Languages, pp. 15–20, Marseille, France, May 2020. European
Language Resources Association (ELRA). ISBN 979-10-95546-60-3. URL
https://aclanthology.org/2020.rail-1.3. Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar
Desarkar, and Anoop Kunchukuttan. Utilizing lexical similarity to enable zero-shot machine
translation for extremely low-resource languages, 2023. Margaret Mitchell, Simone Wu, Andrew
Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and
Timnit Gebru. Model cards for model reporting. In Proceedings of the Conference on Fairness,
Accountability, and Transparency, FAT* ’19, pp. 220–229, New York, NY, USA, 2019. Association for
Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287596. URL
https://doi.org/10. 1145/3287560.3287596. Nikita Moghe, Tom Sherborne, Mark Steedman, and
Alexandra Birch. Extrinsic evaluation of machine translation metrics. CoRR, abs/2212.10297, 2022.
doi: 10.48550/arXiv.2212.10297. URL https://doi.org/10.48550/ arXiv.2212.10297. Tasnim
Mohiuddin, Philipp Koehn, Vishrav Chaudhary, James Cross, Shruti Bhosale, and Shafiq Joty. Data
selection curriculum for neural machine translation. In Findings of the Association for Computational
Linguistics: EMNLP 2022, pp. 1569–1582, Abu Dhabi, United Arab Emirates, December 2022.
Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.113. URL
https://aclanthology.org/2022.findingsemnlp.113. Vinod Nair and Geoffrey E. Hinton. Rectified linear
units improve restricted boltzmann machines. In International Conference on Machine Learning,
2010. Toshiaki Nakazawa, Katsuhito Sudoh, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Hideya
Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, and Sadao Kurohashi. Overview of the 5th
workshop on Asian translation. In Proceedings of the 32nd Pacific Asia Conference on Language,
Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian
Translation, Hong Kong, 1–3 December 2018. Association for Computational Linguistics. URL
https://aclanthology.org/Y18-3001. Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre,
Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida,
Ondřej Bojar, and Sadao Kurohashi. Overview of the 7th workshop on Asian translation. In
Proceedings of the 7th Workshop on Asian Translation, pp. 1–44, Suzhou, China, December 2020.
Association for Computational Linguistics. URL https://aclanthology.org/2020.wat-1.1. Toshiaki
Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao
Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi,
Kaori Abe, Yusuke Oda, and Sadao Kurohashi. Overview of the 8th workshop on Asian translation. In
Proceedings of the 8th Workshop on Asian Translation (WAT2021), pp. 1–45, Online, August 2021a.
Association for Computational Linguistics. doi: 10.18653/v1/2021.wat-1.1. URL
https://aclanthology.org/2021.wat-1.1. 52 Published in Transactions on Machine Learning Research
(12/2023) Toshiaki Nakazawa, Hideki Nakayama, Isao Goto, Hideya Mino, Chenchen Ding, Raj Dabre,
Anoop Kunchukuttan, Shohei Higashiyama, Hiroshi Manabe, Win Pa Pa, Shantipriya Parida, Ondřej
Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Katsuhito Sudoh, Sadao Kurohashi, and
Pushpak Bhattacharyya (eds.). Proceedings of the 8th Workshop on Asian Translation (WAT2021),
Online, August 2021b. Association for Computational Linguistics. URL
https://aclanthology.org/2021.wat-1.0. Toshiaki Nakazawa, Hideya Mino, Isao Goto, Raj Dabre,
Shohei Higashiyama, Shantipriya Parida, Anoop Kunchukuttan, Makoto Morishita, Ondřej Bojar,
Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, and Sadao Kurohashi. Overview of the 9th
workshop on Asian translation. In Proceedings of the 9th Workshop on Asian Translation, pp. 1–36,
Gyeongju, Republic of Korea, October 2022. International Conference on Computational Linguistics.
URL https://aclanthology.org/2022.wat-1.1. OpenAI. Gpt-4 technical report. ARXIV.ORG, 2023. doi:
10.48550/arXiv.2303.08774. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan
Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics (Demonstrations), pp. 48–53, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https:
//aclanthology.org/N19-4009. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob
Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano,
Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,
2022. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, pp. 311– 318, Philadelphia, Pennsylvania, USA, July 2002. Association for
Computational Linguistics. doi: 10.3115/1073083. 1073135. URL https://aclanthology.org/P02-1040.
Jerin Philip, Shashank Siripragada, Vinay P. Namboodiri, and C. V. Jawahar. Revisiting low resource
status of indian languages in machine translation. In Jayant R. Haritsa, Shourya Roy, Manish Gupta,
Sharad Mehrotra, Balaji Vasan Srinivasan, and Yogesh Simmhan (eds.), CODS-COMAD 2021: 8th ACM
IKDD CODS and 26th COMAD, Virtual Event, Bangalore, India, January 2-4, 2021, pp. 178–187. ACM,
2021. doi: 10.1145/3430984.3431026. URL https://doi.org/10.1145/3430984.3431026. Maja
Popović. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth
Workshop on Statistical Machine Translation, pp. 392–395, Lisbon, Portugal, September 2015.
Association for Computational Linguistics. doi: 10.18653/v1/W15-3049. URL
https://aclanthology.org/W15-3049. Maja Popović. chrF++: words helping character n-grams. In
Proceedings of the Second Conference on Machine Translation, pp. 612–618, Copenhagen, Denmark,
September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4770. URL
https://aclanthology.org/W17-4770. Matt Post. A call for clarity in reporting BLEU scores. In
Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191,
Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-
6319. URL https://aclanthology.org/W18-6319. Ratish Puduppully and Mirella Lapata. Data-to-text
generation with macro planning. Transactions of the Association for Computational Linguistics,
9:510–527, 2021. doi: 10.1162/tacl_a_00381. URL https://aclanthology.org/ 2021.tacl-1.31. Ratish
Puduppully, Yao Fu, and Mirella Lapata. Data-to-text generation with variational sequential planning.
Transactions of the Association for Computational Linguistics, 10:697–715, 2022. doi:
10.1162/tacl_a_00484. URL https://aclanthology.org/2022.tacl-1.40. 53 Published in Transactions on
Machine Learning Research (12/2023) Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson.
Data cards: Purposeful and transparent dataset documentation for responsible ai, 2022. Pranav
Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine
comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh (eds.), Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA,
November 1-4, 2016, pp. 2383–2392. The Association for Computational Linguistics, 2016. doi:
10.18653/v1/d16-1264. URL https://doi.org/10.18653/v1/d16-1264. Loganathan Ramasamy, Ondřej
Bojar, and Zdeněk Žabokrtský. Morphological processing for English-Tamil statistical machine
translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages,
pp. 113–122, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. URL
https://aclanthology.org/W12-5611. Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj,
Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J,
Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan,
Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. Samanantar: The largest
publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association
for Computational Linguistics, 10:145–162, 2022. doi: 10.1162/tacl_a_00452. URL https:
//aclanthology.org/2022.tacl-1.9. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A
neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pp. 2685– 2702, Online, November 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.213. URL
https://aclanthology.org/2020.emnlp-main.213. Ricardo Rei, José G. C. de Souza, Duarte Alves,
Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins.
COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh
Conference on Machine Translation (WMT), pp. 578–585, Abu Dhabi, United Arab Emirates (Hybrid),
December 2022. Association for Computational Linguistics. URL https://aclanthology. org/2022.wmt-
1.52. Philip Resnik and Noah A. Smith. The web as a parallel corpus. Comput. Linguistics, 29(3):349–
380, 2003. doi: 10.1162/089120103322711578. URL https://doi.org/10.1162/089120103322711578.
Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied,
Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap
Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. Introduction of the
asian language treebank. In 2016 Conference of The Oriental Chapter of International Committee for
Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA),
pp. 1–6, 2016. doi: 10.1109/ICSDA.2016.7918974. Holger Schwenk, Vishrav Chaudhary, Shuo Sun,
Hongyu Gong, and Francisco Guzmán. Wikimatrix: Mining 135m parallel sentences in 1620 language
pairs from wikipedia. In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the
16th Conference of the European Chapter of the Association for Computational Linguistics: Main
Volume, EACL 2021, Online, April 19 - 23, 2021, pp. 1351–1361. Association for Computational
Linguistics, 2021a. doi: 10.18653/v1/2021.eacl-main.115. URL
https://doi.org/10.18653/v1/2021.eacl-main.115. Holger Schwenk, Guillaume Wenzek, Sergey
Edunov, Edouard Grave, Armand Joulin, and Angela Fan. CCMatrix: Mining billions of high-quality
parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pp. 6490–6500, Online, August 2021b. Association for
Computational Linguistics. doi: 10.18653/v1/2021.acl-long.507. URL
https://aclanthology.org/2021.acl-long.507. Thibault Sellam, Dipanjan Das, and Ankur Parikh.
BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pp. 7881–7892, Online, July 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL https:
//aclanthology.org/2020.acl-main.704. 54 Published in Transactions on Machine Learning Research
(12/2023) Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation
models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 86–96, Berlin, Germany, August 2016a.
Association for Computational Linguistics. doi: 10.18653/v1/ P16-1009. URL
https://aclanthology.org/P16-1009. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural
machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Berlin,
Germany, August 2016b. Association for Computational Linguistics. doi: 10.18653/v1/P16- 1162. URL
https://aclanthology.org/P16-1162. Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu
Chen, Isaac Caswell, and Xavier Garcia. Towards the next 1000 languages in multilingual machine
translation: Exploring the synergy between supervised and selfsupervised learning. arXiv preprint
arXiv: 2201.03110, 2022. Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri, and C V Jawahar. A
multilingual parallel corpora collection effort for Indian languages. In Proceedings of the Twelfth
Language Resources and Evaluation Conference, pp. 3743–3751, Marseille, France, May 2020.
European Language Resources Association. ISBN 979-10-95546-34-4. URL
https://aclanthology.org/2020.lrec-1.462. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from
overfitting. J. Mach. Learn. Res., 15(1):1929–1958, jan 2014. ISSN 1532-4435. Karumuri Venkata
Subbarao. South asian languages : a syntactic typology. 2012. Christian Szegedy, Vincent Vanhoucke,
Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer
vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826,
2016. doi: 10.1109/CVPR.2016.308. Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao Qin, and Tie-Yan Liu.
Multilingual neural machine translation with language clustering. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 963–973, Hong Kong, China,
November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1089. URL
https://aclanthology.org/D19-1089. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal,
Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual translation from denoising pre-training. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3450–3466, Online,
August 2021. Association for Computational Linguistics. doi: 10. 18653/v1/2021.findings-acl.304. URL
https://aclanthology.org/2021.findings-acl.304. Brian Thompson and Matt Post. Paraphrase
generation as zero-shot multilingual translation: Disentangling semantic similarity from lexical and
syntactic diversity. In Proceedings of the Fifth Conference on Machine Translation, pp. 561–570,
Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.
org/2020.wmt-1.67. Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Nicoletta
Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani,
Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Eighth International Conference on
Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23- 25, 2012, pp. 2214–2218.
European Language Resources Association (ELRA), 2012. URL
http://www.lrecconf.org/proceedings/lrec2012/summaries/463.html. Masao Utiyama and Hitoshi
Isahara. A comparison of pivot methods for phrase-based statistical machine translation. In Human
Language Technologies 2007: The Conference of the North American Chapter of the Association for
Computational Linguistics; Proceedings of the Main Conference, pp. 484–491, Rochester, New York,
April 2007. Association for Computational Linguistics. URL https://aclanthology.org/N07-1061. 55
Published in Transactions on Machine Learning Research (12/2023) Ashish Vaswani, Noam Shazeer,
Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob
Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing
Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9,
2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL
https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-
Abstract.html. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R.
Bowman. GLUE: A multitask benchmark and analysis platform for natural language understanding. In
Tal Linzen, Grzegorz Chrupala, and Afra Alishahi (eds.), Proceedings of the Workshop: Analyzing and
Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1,
2018, pp. 353–355. Association for Computational Linguistics, 2018. doi: 10.18653/v1/w18-5446.
URL https://doi.org/10.18653/v1/w18-5446. Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier
benchmark for general-purpose language understanding systems. In Hanna M. Wallach, Hugo
Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.),
Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information
Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3261–
3275, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/
4496bf24afe7fab6f046bf4923da8de6-Abstract.html. Guillaume Wenzek, Marie-Anne Lachaux, Alexis
Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet:
Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth
Language Resources and Evaluation Conference, pp. 4003–4012, Marseille, France, May 2020.
European Language Resources Association. ISBN 979-10-95546-34-4. URL
https://aclanthology.org/2020.lrec1.494. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin
Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization
in the transformer architecture. In Proceedings of the 37th International Conference on Machine
Learning, ICML’20. JMLR.org, 2020. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-
Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-
to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online, June
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL
https://aclanthology.org/2021.naacl-main.41. Mike Zhang and Antonio Toral. The effect of
translationese in machine translation test sets. In Proceedings of the Fourth Conference on Machine
Translation (Volume 1: Research Papers), pp. 73–81, Florence, Italy, August 2019. Association for
Computational Linguistics. doi: 10.18653/v1/W19-5208. URL https://aclanthology.org/ W19-5208.
Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. The united nations parallel corpus
v1.0. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente
Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis (eds.),
Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016,
Portorož, Slovenia, May 23-28, 2016. European Language Resources Association (ELRA), 2016. URL
http://www.lrec-conf.org/proceedings/lrec2016/ summaries/1195.html. 56 Published in Transactions
on Machine Learning Research (12/2023) A Data Contribution and Coverage 0 50 100 150 200 250
Seed Data Existing Newly Added 0 10000 20000 30000 40000 Mined Data Existing Newly Added
asm_Beng ben_Beng brx_Deva doi_Deva gom_Deva guj_Gujr hin_Deva kan_Knda kas_Arab
kas_Deva mai_Deva mal_Mlym mar_Deva mni_Beng mni_Mtei npi_Deva ory_Orya pan_Guru
san_Deva sat_Olck snd_Arab snd_Deva tam_Taml tel_Telu urd_Arab Language 0 20000 40000 60000
Backtranslation Data Indic BT English BT Bitext Pairs (in K) Figure 7: Overview of our training data
contributions across different axes: Seed, Mined, and Backtranslation. Indic BT indicates bitext pairs
with the English side as synthetic and Indic side as original, whereas English BT indicates vice versa.
culture 7.8% economy 7.8% education 7.8% entertainment 7.8% geography 7.8% governments 7.8%
health 7.8% industry 7.8% legal 7.8% news 6.2% religion 7.8% sports 7.8% tourism 7.8% IN22-Gen
arts 6.3% banking 3.8% college_life 6.3% culture 4.9% daily_dialogue 7.8% entertainment geography
6.5% 7.6% government 7.7% healthcare 4.5% history 6.5% hobbies 8.0% insurance 5.5% legal 6.4%
school_life 5.8% sports 6.7% tourism 6.1% IN22-Conv Figure 8: Overview of the domain coverage of
our newly created IN22-Gen (left) and IN22-Conv (right) benchmarks. 57 Published in Transactions on
Machine Learning Research (12/2023) B Additional Results B.1 Zero-Shot Translation Capabilities of
IndicTrans2 Through Cross-Lingual Transfer Zero-shot translation (Johnson et al., 2017) is a
challenging task, but it is becoming increasingly feasible with the development of more powerful MT
models. Zero-shot translation refers to the ability of an MT model to translate from a source
language to a target language, even if it has never seen any training data for the language pair
before. This is primarily attributed to cross-lingual transfer learning that involves knowledge transfer
from one language to another. There are several benefits to good zero-shot performance. First, it
indicates that the MT model has good generalization capabilities, which means that the model is able
to learn the underlying structure of languages rather than simply memorizing specific translation
pairs. Second, it suggests that the MT model can learn language representations shared across
different languages. In addition, this makes it easier to extend the model to new languages, even
with limited data. Table 25: chrF++ scores of our IT2 in the zero-shot setting in the Indic-En direction
on Indic languages on the FLORES200 Evaluation set. The best-performing system is bolded, and ∆
represents the difference between the zero-shot score of IT2 and the score of the SOTA model. IT2
results are presented based on the decoding for the Maithili language tag. language N1.2 N54 IT2 ∆
awa_Deva 63.2 65.4 62.4 -3.0 bho_Deva 57.3 58.5 53.6 -4.9 hne_Deva 70.6 72.2 62.1 -10.1
mag_Deva 70.3 72.0 67.4 -4.6 In this study, we investigate the cross-lingual transfer and
generalizability of our IndicTrans2 models. Our focus lies on performing zero-shot evaluation on a set
of additional low-resource Indic languages, which are supported by the NLLB (Costa-jussà et al.,
2022) models (1.2B distilled and 54B MoE) and are included as part of the FLORES-200 (Costa-jussà
et al., 2022) evaluation set. Specifically, we restrict our evaluations to the Indic-En model, as the
structure and syntax of these low-resource languages as target translation is unseen by the model
and therefore, result in offtarget translation. However, in the case of the Indic-En direction, such an
analysis is feasible since the target language, English, is supported by the model. We consider
languages like Awadhi, Bhojpuri, Chhattisgarhi, and Magahi that are written in the Devanagari script,
which is the prominent script supported by our models. We also have test sets available in FLORES-
200 for evaluation. We employ a top-down approach based on language similarity to facilitate zero-
shot decoding. Specifically, we select the top-3 related languages that are closest to the
aforementioned languages under consideration. Using this approach, we identify Hindi, Maithili, and
Nepali as the three closest languages and leverage their language codes for zero-shot decoding of
the new Indic languages. We follow the same generation and evaluation procedure mentioned in
Section 6.4 and Section 6.5. We observe that decoding with the language tag of Maithili yields the
best performance on the test set across all four languages, followed by Hindi and Nepali. This finding
highlights that Maithili is closer to these languages in the embedding space than Hindi or Nepali.
Table 25 demonstrates that our IndicTrans2 model differs by around 4 points on average except for
Chhattisgarhi when compared to the NLLB 54B MoE model that is explicitly using the sentence pairs
of the aforementioned languages in training. Overall, our IndicTrans2 model shows promising results
in zero-shot performance on low-resource languages, highlighting the potential for extending to new
languages with limited data in the future. B.2 Translation Capabilities of Zero-Shot Prompted LLMs
Large language models (LLMs) such as GPT (Brown et al., 2020; OpenAI, 2023) have recently shown
impressive zero-shot performance on various tasks. In this work, we compare the zero-shot
translation capabilities of GPT3.5 (as described in Section 6.1) with our best IndicTrans2 model. The
prompt template “Translate the following sentence into {{lang}}\n {{text}}” was used for evaluation.
Table 26 demonstrates that our IndicTrans2 models outperform GPT3.5 by a significant margin on
both the IN22-Gen and IN22-Conv sets in both En-Indic and Indic-En directions. However, it is
important to note that this gap is comparatively lower on the IN22-Conv 58 Published in Transactions
on Machine Learning Research (12/2023) Table 26: chrF++ scores of GPT3.5 (gpt-3.5-turbo) on the
IN22-Gen (left) and IN22-Conv (right) Evaluation sets in the En-Indic and Indic-En directions. Avg.
means the average score of all the top-13 languages. ∆ represents the difference between the scores
of IT2 and GPT3.5. Positive ∆ indicates IT2 is better than X and vice-versa. IN22-Gen IN22-Conv En-
Indic Indic-En En-Indic Indic-En language GPT3.5 IT2 ∆ GPT3.5 IT2 ∆ GPT3.5 IT2 ∆ GPT3.5 IT2 ∆
asm_Beng 25.9 47.1 21.2 46.9 65.8 18.9 27.2 46.8 19.6 43.6 62.9 19.3 ben_Beng 39.9 51.8 11.9 52.1
63.2 11.1 39.9 49.7 9.8 52.9 58.4 5.5 guj_Gujr 35.6 53.5 17.9 51.7 66.5 14.8 36.0 53.1 17.1 50.9 62.0
11.1 hin_Deva 47.1 56.7 9.6 57.7 65.4 7.7 46.0 49.6 3.6 57.0 60.1 3.1 kan_Knda 34.5 51.0 16.5 51.7
64.2 12.5 27.9 33.8 5.9 42.1 47.5 5.4 mal_Mlym 31.6 50.9 19.3 47.8 64.5 16.7 30.4 45.7 15.3 44.0
54.3 10.3 mar_Deva 33.9 51.0 17.1 50.3 63.7 13.4 34.0 48.6 14.6 47.6 58.5 10.9 npi_Deva 37.2 49.0
11.8 54.2 67.7 13.5 38.3 51.5 13.2 52.0 63.0 11.0 ory_Orya 27.8 43.9 16.1 48.0 66.2 18.2 25.6 40.2
14.6 45.2 60.3 15.1 pan_Guru 36.2 50.6 14.4 51.7 63.4 11.7 40.6 57.8 17.2 53.3 62.7 9.4 tam_Taml
34.0 49.5 15.5 41.3 59.8 18.5 29.7 39.1 9.4 38.0 45.8 7.8 tel_Telu 34.3 52.4 18.1 46.5 64.8 18.3 32.1
45.5 13.4 42.4 52.9 10.5 urd_Arab 47.6 68.2 20.6 58.8 73.0 14.2 49.0 61.6 12.6 57.1 65.5 8.4 Avg.
35.8 52.0 16.2 50.7 65.2 14.6 35.1 47.9 12.8 48.2 58.0 9.8 set, likely because GPT3.5 was fine-tuned
towards fluency in conversational and interactive contexts. In addition, the average ∆ across both the
IN22-Gen and IN22-Conv sets is lower for high-resource languages such as Hindi (+6.6 for Indic-En
and +5.4 for En-Indic) than low-resource languages such as Assamese (+19.1 for Indic-En and +20.4
for En-Indic). Overall, our IndicTrans2 models outperform GPT3.5 by an average of 12.2 points and
14.5 points in IndicEn and En-Indic directions, respectively, on our IN22 benchmark. Even though
LLMs show promising zero-shot capabilities in multilingual settings, we observe that these still lag
behind the task-specific models, particularly for low-resource languages. Exploring how richer
translations can be extracted from LLMs is an open problem and can be a worthy future study. B.3
Comparison with SeamlessM4T Multimodal Translation Model SeamlessM4T (Communication et al.,
2023) is a recently released multimodal translation model supporting 16 Indic languages. In the
interest of the community, we report preliminary results of this model on our primary benchmarks,
such as FLORES-200, IN22-Gen and IN22-Conv in Tables 27 and 28. We use the SeamlessM4T-Large
and SeamlessM4TLarge v2 variants, which are both 2.3B parameter models and the best model
released as a part of the work. B.4 Results on NTREX NTREX (Federmann et al., 2022) is a news-
domain benchmark that expands coverage of languages of test data from WMT 2019 (Barrault et al.,
2019) to 128 languages. Out of these, 13 are scheduled Indic languages. The detailed results are
reported in Tables 29 to 31. B.5 Results on WAT2020 & WAT2021 WAT (Nakazawa et al., 2020; 2021a)
included support for translations for 8 Indic languages in the news domain. In addition, they released
data for Hindi-English data in IT and WikiNews domains. WAT 2021 (Nakazawa et al., 2021a) created
a benchmark for translation between 10 Indic languages and English. The detailed results are
reported in Tables 32 to 37. 59 Published in Transactions on Machine Learning Research (12/2023)
Table 27: chrF++ scores of SM4T (SeamlessM4T-Large), SM4Tv2 (SeamlessM4T-Large v2) and IT2 on
the FLORES200 Evaluation sets in the En-Indic and Indic-En directions. Avg. means the average score
of all the supported languages. En-Indic Indic-En language SM4T SM4Tv2 IT2 SM4T SM4Tv2 IT2
asm_Beng 41.4 38.7 43.3 56.4 55.9 56.9 ben_Beng 52.0 50.3 54.3 61.9 60.3 62.4 guj_Gujr 53.3 51.9
56.0 66.2 65.0 67.0 hin_Deva 58.3 57.5 59.6 66.0 62.7 67.5 kan_Knda 54.2 52.4 56.1 60.4 59.1 61.5
mai_Deva 46.8 43.1 50.5 66.7 66.1 69.5 mal_Mlym 53.0 50.3 57.3 63.2 60.9 64.3 mar_Deva 48.8 46.3
51.3 63.0 62.1 64.3 mni_Beng 38.2 39.0 38.2 50.9 50.0 52.9 npi_Deva 52.8 50.7 57.2 66.3 65.5 68.1
ory_Orya 49.9 46.0 49.2 63.2 62.7 64.9 pan_Guru 52.6 50.6 53.5 65.4 64.2 66.4 snd_Arab 51.8 49.8
44.9 64.3 61.0 65.1 tam_Taml 54.8 52.6 57.2 59.4 57.7 61.3 tel_Telu 56.7 54.6 59.4 65.1 62.8 66.1
urd_Arab 50.1 49.4 52.2 62.0 59.9 62.0 Avg. 50.9 49.0 52.5 62.5 61.0 63.8 Table 28: chrF++ scores of
SM4T (SeamlessM4T-Large), SM4Tv2 (SeamlessM4T-Large v2) and IT2 on the IN22-Gen (left) and
IN22-Conv (right) Evaluation sets in the En-Indic and Indic-En directions. Avg. means the average
score of all the supported languages. IN22-Gen IN22-Conv En-Indic Indic-En En-Indic Indic-En
language SM4T SM4Tv2 IT2 SM4T SM4Tv2 IT2 SM4T SM4Tv2 IT2 SM4T SM4Tv2 IT2 asm_Beng 43.8
40.6 47.1 63.8 62.4 65.8 45.5 43.9 46.8 60.5 60.6 62.9 ben_Beng 47.9 46.2 51.8 61.7 58.7 63.2 47.7
46.8 49.7 57.8 57.2 58.4 guj_Gujr 49.1 47.5 53.5 64.9 62.5 66.5 49.7 48.9 53.1 61.7 60.8 62.0
hin_Deva 53.5 52.8 56.7 62.6 59.8 65.4 47.1 47.1 49.6 59.5 58.3 60.1 kan_Knda 47.5 46.4 51.0 62.7
59.9 64.2 32.3 32.0 33.8 46.3 45.2 47.5 mai_Deva 45.4 41.9 48.7 63.2 61.4 64.8 42.8 41.6 44.3 56.5
55.6 57.8 mal_Mlym 46.9 45.1 50.9 61.6 57.9 64.5 42.0 41.3 45.7 53.4 51.4 54.3 mar_Deva 45.3 43.3
51.0 61.6 60.1 63.7 46.0 44.5 48.6 57.6 57.1 58.5 npi_Deva 46.8 44.2 49.0 66.0 64.8 67.7 47.7 46.5
51.5 61.1 61.2 63.0 ory_Orya 45.2 40.9 43.9 64.2 62.6 66.2 42.7 41.0 40.2 60.2 60.0 60.3 pan_Guru
49.7 48.0 50.6 61.1 59.4 63.4 56.5 55.2 57.8 61.6 61.2 62.7 tam_Taml 47.5 45.9 49.5 57.9 54.8 59.8
37.4 37.0 39.1 46.2 45.4 45.8 tel_Telu 49.1 47.2 52.4 62.6 58.9 64.8 39.8 39.3 45.5 52.8 51.6 52.9
urd_Arab 62.7 60.5 68.2 69.4 66.2 73.0 55.0 54.2 61.6 62.8 62.7 65.5 Avg. 48.6 46.5 51.7 63.1 60.7
65.2 45.2 44.2 47.7 57.0 56.3 58.0 B.6 Results on WMT & UFAL WMT has created benchmarks for
selected Indic languages as part of shared tasks in 2014 (Hindi) (Bojar et al., 2014), 2019 (Gujarati)
(Barrault et al., 2019) and 2020 (Tamil) (Barrault et al., 2020). 60 Published in Transactions on
Machine Learning Research (12/2023) Table 29: chrF++ scores of all the systems on the NTREX
(Federmann et al., 2022) Evaluation set in the En-Indic and Indic-En direction. The best-performing
system is bolded, while underlined results indicate significant performance difference where IT2
outperforms the system. Avg means the average score of all the languages that system X supports. ∆
represents the difference between the average scores of IT2 and the average scores of system X for
the subset of languages that both X and IT2 support. A positive value for ∆ indicates IT2 is better than
X and vice-versa. En-Indic Indic-En Language IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az
ben_Beng 48.4 45.8 50.8 54.0 53.5 52.0 55.9 53.8 60.4 62.9 63.3 59.9 guj_Gujr 44.4 19.5 47.8 49.6
49.3 49.7 57.5 10.9 63.7 66.8 66.7 61.9 hin_Deva 50.0 48.0 51.6 53.3 53.7 52.1 57.4 55.9 61.5 63.7
63.6 59.7 kan_Knda 49.2 14.3 51.2 54.1 54.0 54.1 52.6 12.0 57.9 61.2 61.3 57.3 mal_Mlym 43.4 32.6
41.7 48.6 48.0 47.0 51.9 47.3 56.7 59.6 60.0 56.5 mar_Deva 40.6 36.5 43.5 47.0 46.4 44.5 54.0 48.3
59.7 62.7 63.0 57.5 npi_Deva - 14.2 41.7 45.0 44.7 41.5 - 37.4 62.2 64.4 65.5 59.8 pan_Guru 47.5
27.7 49.1 50.3 51.6 50.3 56.7 43.0 61.8 64.9 65.0 60.4 snd_Arab - 25.1 39.7 43.3 42.1 41.1 - 17.8 55.8
58.2 58.5 52.1 tam_Taml 41.8 14.8 43.7 45.9 45.4 45.4 49.4 29.5 54.5 57.0 57.2 53.4 tel_Telu 42.0 -
43.9 46.7 46.8 43.8 48.7 - 53.1 55.6 55.8 52.2 urd_Arab - 41.7 51.4 53.7 53.1 52.9 - 48.2 60.6 62.5
63.0 59.6 Avg. 45.3 29.1 46.3 49.3 49.1 47.9 53.8 36.7 59.0 61.6 61.9 57.6 ∆ 4.6 20.4 3.0 - 0.2 1.4 7.7
25.5 2.6 - -0.3 4.0 Table 30: COMET scores of all the systems on the NTREX (Federmann et al., 2022)
Evaluation set in the En-Indic and Indic-En direction. The best performing system is bolded, while
underlined results indicate significant performance difference where IT2 outperforms the system. En-
Indic Indic-En Language IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az ben_Beng 85.3 82.6
86.1 86.4 85.2 86.7 86.5 85.8 88.4 88.9 89.3 88.0 guj_Gujr 86.8 61.6 86.7 87.9 87.1 87.5 86.3 36.4
88.8 89.6 89.7 87.6 hin_Deva 77.6 74.3 77.9 78.7 77.9 78.7 86.2 85.6 88.0 88.4 88.5 86.9 kan_Knda
84.1 51.6 84.5 85.6 84.2 85.8 84.5 35.9 86.7 87.7 87.7 86.2 mal_Mlym 85.7 75.4 86.3 87.5 86.5 87.4
85.5 81.5 87.6 88.3 88.7 87.3 mar_Deva 71.8 65.8 73.9 74.6 73.1 74.5 85.4 80.4 87.6 88.4 88.5 86.5
npi_Deva - 51.8 79.1 80.6 79.8 79.9 - 68.4 89.1 89.4 90.1 87.8 pan_Guru 82.4 60.5 83.1 83.0 82.9
83.2 84.3 73.6 86.9 87.6 87.8 85.4 tam_Taml 85.5 53.4 86.0 86.4 86.2 87.3 82.9 62.6 85.2 85.9 86.2
84.2 tel_Telu 83.5 - 83.4 85.1 84.4 85.3 83.6 - 86.2 87.0 87.1 85.3 urd_Arab - 72.7 81.0 82.2 82.3 83.7
- 79.9 87.3 87.8 88.0 86.9 UFAL (Ramasamy et al., 2012) is an English-Tamil bilingual benchmark
created from publicly available websites. The benchmark consists of English sentences from domains
such as cinema, news, and some biblical sources. Detailed results are reported in Tables 38 to 40. B.7
COMET Scores for IN22 & FLORES We report COMET (Rei et al., 2022) scores for IN22 and FLORES
(Costa-jussà et al., 2022) in Tables 41 to 43 B.8 BLEU Scores for IN22 & FLORES We report BLEU
(Papineni et al., 2002) scores for IN22 and FLORES (Costa-jussà et al., 2022) in Tables 44 to 46 61
Published in Transactions on Machine Learning Research (12/2023) Table 31: BLEU scores of all the
systems on the NTREX (Federmann et al., 2022). Evaluation set in the En-Indic and Indic-En direction.
The best performing system is bolded, while underlined results indicate significant performance
difference where IT2 outperforms the system. En-Indic Indic-En Language IT1 M100 N1.2 IT2 Goog Az
IT1 M100 N1.2 IT2 Goog Az ben_Beng 17.7 15.3 19.8 22.9 23 20.8 30.7 27.7 36.6 40.2 40.5 35.4
guj_Gujr 15.4 3.1 18.7 20.4 20.7 20.5 32.1 0.5 40.5 45.2 44.6 37.4 hin_Deva 26.4 24.3 28.2 30.5 31.2
29 31.1 28.9 37.7 41.3 40.7 34.6 kan_Knda 16.4 1.1 18.5 22 22.7 22.7 26.9 0.5 33.8 38.6 38.6 31.8
mal_Mlym 11 5.4 8.1 14.5 14.6 14 25.6 19.5 31.4 34.8 35.9 28.9 mar_Deva 10.5 8.2 12.2 14.6 15.1
12.7 28 21.8 35.6 40.1 40.1 31.1 npi_Deva - 0.8 11.5 13.7 13.7 10.8 - 10.3 38.9 42.4 43.4 35.1
pan_Guru 22.6 8.5 24.5 25.5 26.8 24.6 31.5 14 38.6 43.3 43.2 35.9 snd_Arab - 6.4 13.7 18.7 16 15 - 2
33.3 37.1 37 28.5 tam_Taml 9 0.8 9.9 11.8 11.9 11.4 23.5 6.1 30.2 33.4 33.6 26.9 tel_Telu 11 - 12.1
15.4 15.6 12 22.8 - 28.7 32.6 32.6 27.1 urd_Arab - 18.3 27.7 30.5 30.1 29.5 - 22 36.5 39.4 39.6 35.1
Avg. 15.6 8.4 17.1 20 20.1 18.6 28 13.9 35.2 39 39.2 32.3 ∆ 4.1 12.1 2.9 - -0.1 1.4 10.8 25.7 3.8 - -0.2
6.7 Table 32: chrF++ scores of all the systems on the WAT-2020 (Nakazawa et al., 2020). Evaluation
set in the En-Indic and Indic-En direction. The best performing system is bolded, while underlined
results indicate significant performance difference where IT2 outperforms the system. Avg means the
average score of all the languages that system X supports. ∆ represents the difference between the
average scores of IT2 and the average scores of system X for the subset of languages that both X and
IT2 support. A positive value for ∆ indicates IT2 is better than X and vice-versa. En-Indic Indic-En
Language IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az ben_Beng 38.9 31.7 36.6 37.9 36.4
37.5 44.0 36.1 42.5 44.2 42.9 43.2 guj_Gujr 42.7 18.2 41.2 41.9 41.2 45.4 48.6 9.2 48.1 49.3 48.0 49.6
hin_Deva 43.3 35.6 41.3 41.8 42.7 41.7 48.1 40.5 47.0 48.9 49.5 49.5 mal_Mlym 38.4 29.8 36.2 38.8
38.2 38.6 44.5 31.9 43.1 45.0 43.5 45.5 mar_Deva 41.6 31.9 39.8 41.0 40.5 40.7 44.9 34.4 44.2 45.8
45.0 44.9 tam_Taml 37.5 15.1 36.4 37.9 36.8 37.9 42.9 19.8 41.7 42.9 41.3 43.0 tel_Telu 37.2 - 36.7
37.7 36.8 38.0 43.0 - 42.2 43.7 42.5 43.8 Avg. 39.9 27.1 38.3 39.6 38.9 40.0 45.1 28.7 44.1 45.7 44.7
45.6 ∆ -0.3 12.8 1.4 - 0.7 -0.4 0.6 17.3 1.6 - 1.0 0.1 62 Published in Transactions on Machine Learning
Research (12/2023) Table 33: chrF++ scores of all the systems on the WAT-2021 (Nakazawa et al.,
2021a). Evaluation set in the En-Indic and Indic-En direction. The best performing system is bolded,
while underlined results indicate significant performance difference where IT2 outperforms the
system. Avg means the average score of all the languages that system X supports. ∆ represents the
difference between the average scores of IT2 and the average scores of system X for the subset of
languages that both X and IT2 support. A positive value for ∆ indicates IT2 is better than X and vice-
versa. En-Indic Indic-En Language IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az ben_Beng
45.4 34.7 41.4 42.4 39.1 41.6 53.7 42.5 51.6 52.5 49.9 50.0 guj_Gujr 53.9 21.4 51.8 52.1 48.9 58.2
62.8 8.0 61.2 62.9 59.9 62.1 hin_Deva 60.8 51.0 59.2 59.7 59.3 59.6 66.1 54.2 63.4 65.1 64.9 65.9
kan_Knda 52.5 17.6 50.2 50.9 49.0 51.8 60.0 8.9 58.3 60.3 57.0 55.0 mal_Mlym 49.5 32.7 44.9 49.2
47.5 46.4 58.4 35.0 56.2 58.3 55.2 57.7 mar_Deva 50.4 36.2 47.8 49.0 47.5 48.0 57.1 40.2 55.2 57.1
54.3 55.1 ory_Orya 48.5 7.4 47.5 44.2 40.2 45.4 57.2 13.2 55.7 56.8 52.9 56.0 pan_Guru 56.1 25.6
53.0 54.2 52.6 58.7 65.2 31.9 62.9 64.8 62.2 63.5 tam_Taml 48.8 14.3 46.0 47.5 45.7 47.2 56.6 18.6
54.0 55.6 51.6 54.0 tel_Telu 46.7 - 44.9 45.3 43.0 43.0 59.7 - 56.5 59.6 56.0 58.3 Avg. 51.3 26.8 48.7
49.4 47.3 50.0 59.7 28.1 57.5 59.3 56.4 57.8 ∆ -1.9 23.1 0.7 - 2.1 -0.6 31.2 1.8 - 2.9 1.5 Table 34:
COMET scores of all the systems on the WAT 2020 (Nakazawa et al., 2020). Evaluation set in the En-
Indic and Indic-En direction. The best performing system is bolded, while underlined results indicate
significant performance difference where IT2 outperforms the system. En-Indic Indic-En Language IT1
M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az ben_Beng 86.4 82.7 86.1 86.6 85.6 86.5 83.6
78.2 83.6 83.9 83.8 83.5 guj_Gujr 90.2 66.6 89.9 90.4 90.1 90.5 86.4 35.8 86.6 86.8 86.4 86.5
hin_Deva 81.5 77.0 81.3 81.5 81.7 81.3 84.2 76.8 83.8 84.1 84.0 84.0 mal_Mlym 87.9 80.5 88.2 88.1
87.6 88.5 83.9 73.0 84.0 84.6 84.3 84.5 mar_Deva 77.6 69.5 77.1 77.8 77.2 77.7 83.7 72.9 83.9 84.2
84.0 83.9 tam_Taml 89.0 57.1 88.7 89.4 88.8 89.3 82.4 55.7 82.7 82.8 82.5 82.5 tel_Telu 86.3 - 86.1
86.9 86.3 86.9 83.1 - 83.2 83.7 83.4 83.4 Table 35: COMET scores of all the systems on the WAT 2021
(Nakazawa et al., 2021a). Evaluation set in the En-Indic and Indic-En direction. The best performing
system is bolded, while underlined results indicate significant performance difference where IT2
outperforms the system. En-Indic Indic-En Language IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2
Goog Az ben_Beng 88.2 84.4 87.5 87.9 86.6 87.6 86.9 82.6 87.0 87.1 86.7 86.5 guj_Gujr 92.2 70.5
92.0 92.1 91.5 92.6 90.4 35.0 90.6 90.8 90.3 90.5 hin_Deva 86.4 82.3 86.1 86.1 86.1 86.2 90.5 86.2
90.3 90.7 90.5 90.7 kan_Knda 90.2 60.7 89.9 90.1 89.3 90.2 88.2 34.7 88.5 88.7 88.1 87.1 mal_Mlym
90.9 82.8 91.1 90.9 90.3 91.5 88.7 74.6 88.7 89.3 88.5 89.0 mar_Deva 81.2 72.9 80.7 80.9 80.1 80.7
87.8 77.3 87.8 88.1 87.5 87.6 ory_Orya 88.0 41.6 88.1 83.5 83.0 87.7 88.0 39.8 88.3 88.4 87.4 88.0
pan_Guru 89.3 70.3 89.0 88.9 88.9 89.6 90.0 69.7 90.0 90.2 89.7 89.7 tam_Taml 92.1 53.6 91.7 91.9
91.3 91.8 87.1 53.5 87.1 87.4 86.4 86.6 tel_Telu 86.6 - 86.3 86.4 85.8 86.3 88.3 - 87.9 88.7 87.9 88.1
63 Published in Transactions on Machine Learning Research (12/2023) Table 36: BLEU scores of all
the systems on the WAT-2020 (Nakazawa et al., 2020). Evaluation set in the En-Indic and Indic-En
direction. The best performing system is bolded, while underlined results indicate significant
performance difference where IT2 outperforms the system. Avg means the average score of all the
languages that system X supports. ∆ represents the difference between the average scores of IT2 and
the average scores of system X for the subset of languages that both X and IT2 support. A positive
value for ∆ indicates IT2 is better than X and vice-versa. En-Indic Indic-En Language IT1 M100 N1.2
IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az ben_Beng 12 6.1 9.7 9.8 8.5 9.7 19.9 12.7 18.1 19.5 17.5
18.1 guj_Gujr 15.5 2.4 14 14.2 13.5 18.6 24.1 0.3 23 24.2 22.1 24.3 hin_Deva 20.1 12.3 18 18 19.5
17.9 23.6 15.7 22.2 24.2 24.3 24.6 mal_Mlym 7.3 2.9 5.1 6.9 6.5 6.7 20.4 9 18.8 20.5 18.5 20.7
mar_Deva 13.2 6.4 11.5 11.7 11.4 11.6 20.4 11.2 19.3 20.6 19.2 19.5 tam_Taml 6.2 0.7 5.4 5.9 5.5 6.1
18.2 2 16.8 17.9 16 17.1 tel_Telu 8 - 7.4 7.5 7 8.4 18.5 - 17.5 18.8 17.4 18.5 Avg. 11.8 5.1 10.2 10.6
10.3 11.3 20.7 8.5 19.4 20.8 19.3 20.4 Delta -1.2 6 0.4 - 0.3 -0.7 0.1 12.7 1.4 - 1.5 0.4 Table 37: BLEU
scores of all the systems on the WAT-2021 (Nakazawa et al., 2021a). Evaluation set in the En-Indic
and Indic-En direction. The best performing system is bolded, while underlined results indicate
significant performance difference where IT2 outperforms the system. Avg means the average score
of all the languages that system X supports. ∆ represents the difference between the average scores
of IT2 and the average scores of system X for the subset of languages that both X and IT2 support. A
positive value for ∆ indicates IT2 is better than X and vice-versa. En-Indic Indic-En Language IT1 M100
N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az ben_Beng 15.8 7.4 12.1 12.6 9.5 12.1 29.5 15.4 25.9
25.7 22 22.5 guj_Gujr 25.8 3.5 23.8 23.9 20.4 32.7 40.2 0.1 37.2 38.8 34.7 36.9 hin_Deva 38.8 27.3
36.7 37.6 37.2 36.4 43.9 28.8 39.7 41.6 40.6 43.1 kan_Knda 19.2 1.1 16.6 16.7 14.9 18.3 36.5 0 33.8
36.3 31.3 29.5 mal_Mlym 15.1 3.9 9.2 13.7 12.4 9.5 34.6 9.7 31.2 33.6 29.2 32.4 mar_Deva 20.3 8.6
17.5 18.1 16.9 17.3 33.5 14.4 30.2 32.2 28 29.7 ory_Orya 19.1 0.1 17.9 13.6 10.6 15.1 34.4 0.2 31.6
32.7 27.7 30.6 pan_Guru 33.9 6.7 30 31.1 29.7 37.7 43.2 6.2 39.3 41.5 37.6 38.9 tam_Taml 13.6 0.8
11.4 12.3 11.1 12.6 33.1 1.8 29.1 31.1 25.6 27 tel_Telu 14.5 - 12.9 12 10.2 9.6 36.1 - 31.6 34.4 29 31.1
Avg. 21.6 6.6 18.8 19.2 17.3 20.1 36.5 8.5 33 34.8 30.6 32.2 ∆ -2.4 13.4 0.4 - 1.9 -0.9 -1.7 26.3 1.8 -
4.2 2.6 Table 38: chrF++ scores of all the systems on the WMT (Bojar et al., 2014; Barrault et al.,
2019; 2020) shared tasks and UFAL (Ramasamy et al., 2012) in the En-Indic and Indic-En direction.
The best performing system is bolded, while underlined results indicate significant performance
difference where IT2 outperforms the system. En-Indic Indic-En Benchmark Language IT1 M100 N1.2
IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az UFAL tam_Taml 45.5 15.4 44.9 43.9 43.9 45.7 53.3 25.3 52.4
53.2 51.2 53.8 WMT14 hin_Deva 50.5 45.9 50.7 52.1 52.7 51.9 56.6 53.6 60.4 62.1 62.7 60.4 WMT19
guj_Gujr 48.8 20.7 55.4 56.3 56.8 62.2 50.5 7.9 56.4 57 58.4 58.3 WMT20 tam_Taml 45.7 14.4 47.5
49.2 48.2 49.2 45.8 17.5 48.1 51.3 53.5 52.3 64 Published in Transactions on Machine Learning
Research (12/2023) Table 39: COMET scores of all the systems on the WMT (Bojar et al., 2014;
Barrault et al., 2019; 2020) shared tasks and UFAL (Ramasamy et al., 2012) in the En-Indic and Indic-
En direction. The best performing system is bolded, while underlined results indicate significant
performance difference where IT2 outperforms the system. En-Indic Indic-En Benchmark Language
IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog Az UFAL tam_Taml 85.8 54.3 86.3 85.8 85.4 86.8
82.0 55.9 82.7 83.0 82.5 82.7 WMT14 hin_Deva 81.2 77.5 81.3 81.7 81.6 81.7 84.1 80.2 86.2 86.8
86.4 85.4 WMT19 guj_Gujr 86.4 61.0 86.7 87.8 87.3 88.3 82.6 30.6 84.9 85.9 85.7 85.1 WMT20
tam_Taml 87.7 53.4 87.9 88.4 87.8 89.1 81.2 46.0 83.1 84.4 84.0 83.5 Table 40: BLEU scores of all the
systems on the WMT (Bojar et al., 2014; Barrault et al., 2019; 2020) shared tasks and UFAL
(Ramasamy et al., 2012) in the En-Indic and Indic-En direction. The best performing system is bolded,
while underlined results indicate significant performance difference where IT2 outperforms the
system. En-Indic Indic-En Benchmark Language IT1 M100 N1.2 IT2 Goog Az IT1 M100 N1.2 IT2 Goog
Az UFAL tam_Taml 10.9 0.9 10.6 8.9 9.6 10.8 30.2 4.3 28.5 28.8 25.7 28.3 WMT14 hin_Deva 25.6 21.0
25.8 27.8 28.1 27.0 29.7 26.5 35.1 37.5 37.2 34.1 WMT19 guj_Gujr 19.5 4.2 26.0 26.6 27.9 33.8 25.1
0.5 31.1 31.6 33.2 33.2 WMT20 tam_Taml 10.3 0.7 10.9 12.6 12.0 12.1 18.5 1.7 20.6 23.2 25.5 22.4
Table 41: COMET scores of all the systems on the IN22-Gen Evaluation set in the En-Indic and Indic-
En direction. The best performing system is bolded, while underlined results indicate significant
performance difference where IT2 outperforms the system. En-Indic Indic-En Language IT1 M100
N1.2 N54 IT2 Goog Az IT1 M100 N1.2 N54 IT2 Goog Az asm_Beng 81.1 - 83.4 83.2 84.7 84.0 83.5 84.1
- 87.3 88.3 87.5 87.7 85.7 ben_Beng 85.4 80.6 85.6 85.7 86.8 85.2 86.2 86.8 83.8 87.8 88.6 88.1 88.7
87.5 guj_Gujr 87.5 62.0 87.6 87.6 88.6 87.7 88.0 88.0 34.7 89.3 89.9 89.7 89.5 88.1 hin_Deva 79.5
75.2 79.6 80.0 80.5 79.2 79.3 87.8 84.7 88.4 89.1 89.2 88.7 87.9 kan_Knda 84.0 52.7 84.5 84.9 85.7
83.6 85.3 86.5 34.1 87.9 88.5 87.9 88.0 86.8 mal_Mlym 86.1 73.7 86.4 87.1 87.7 86.7 87.3 86.5 77.2
87.6 88.5 88.9 87.9 86.9 mar_Deva 73.4 65.0 73.7 74.7 76.1 73.7 75.3 85.7 77.2 87.1 87.9 87.5 87.6
86.3 npi_Deva - 54.2 80.3 78.6 82.7 80.7 81.6 - 69.5 89.6 90.4 89.8 90.6 88.9 ory_Orya 82.2 39.1 82.9
82.8 79.5 77.4 83.6 87.4 38.2 88.5 89.4 89.0 88.4 86.7 pan_Guru 82.5 60.8 82.6 82.8 83.0 82.8 82.8
84.6 67.5 86.2 87.0 87.0 86.3 84.5 tam_Taml 87.1 45.0 87.3 87.5 88.2 87.5 88.5 84.9 56.4 86.2 87.0
87.0 87.2 86.0 tel_Telu 85.1 - 85.9 86.2 87.1 86.0 86.9 86.4 - 87.8 88.6 88.7 88.6 87.1 urd_Arab - 73.8
84.2 84.6 85.3 85.0 86.5 - 79.0 88.2 89.0 89.2 88.9 87.9 65 Published in Transactions on Machine
Learning Research (12/2023) Table 42: COMET scores of all the systems on the IN22-Conv Evaluation
set in the En-Indic and Indic-En direction. The best performing system is bolded, while underlined
results indicate significant performance difference where IT2 outperforms the system. En-Indic Indic-
En Language IT1 M100 N1.2 N54 IT2 Goog Az IT1 M100 N1.2 N54 IT2 Goog Az asm_Beng 83.2 - 85.7
85.6 86.5 84.7 85.8 84.7 - 87.9 88.1 89.3 90.4 88.7 ben_Beng 89.5 85.7 89.4 89.7 90.1 88.3 89.8 88.3
84.6 89.0 89.5 89.7 89.9 89.7 guj_Gujr 91.4 70.2 90.7 91.2 92.1 91.6 91.5 90.1 38.3 91.3 91.7 91.7
91.9 91.1 hin_Deva 85.0 81.3 83.9 83.3 85.2 85.1 84.7 89.9 85.5 90.5 90.8 90.8 90.8 90.5 kan_Knda
84.2 58.9 83.7 84.7 85.1 84.3 84.9 81.6 36.7 81.7 82.0 84.0 83.2 83.4 mal_Mlym 89.4 82.4 89.7 90.2
90.1 89.4 90.0 87.2 79.4 87.9 88.3 88.5 88.8 88.5 mar_Deva 80.2 72.5 81.0 82.1 81.9 80.9 81.3 87.8
77.3 88.8 89.1 89.4 90.0 89.3 npi_Deva - 57.0 84.6 83.5 86.8 85.1 85.0 - 57.6 91.0 90.9 91.4 92.2 91.4
ory_Orya 86.2 46.4 87.1 87.3 82.9 82.3 86.8 88.9 41.6 90.3 90.6 90.4 89.5 89.3 pan_Guru 88.2 67.4
88.3 88.8 88.8 89.1 88.6 88.6 72.7 89.6 90.1 90.2 90.2 89.2 tam_Taml 87.6 67.2 85.9 84.5 88.0 87.6
88.3 83.6 63.9 84.8 85.5 85.0 85.6 84.9 tel_Telu 88.1 - 84.2 83.0 89.6 89.0 89.6 85.8 - 87.3 88.0 87.7
88.5 87.8 urd_Arab - 79.6 85.9 85.1 88.7 89.4 89.0 - 80.9 90.0 90.3 90.9 91.2 90.8 Table 43: COMET
scores of all the systems on the FLORES-200 (Costa-jussà et al., 2022) Evaluation set in the En-Indic
and Indic-En direction. The best performing system is bolded, while underlined results indicate
significant performance difference where IT2 outperforms the system. En-Indic Indic-En Language IT1
M100 N1.2 N54 IT2 Goog Az IT1 M100 N1.2 N54 IT2 Goog Az asm_Beng 79.4 - 82.2 81.5 83.7 82.6
82.6 81.0 - 85.4 86.6 85.8 86.4 83.5 ben_Beng 86.1 82.6 86.3 87.1 87.5 86.6 87.2 87.3 85.8 88.7 89.3
89.1 89.6 88.3 guj_Gujr 87.9 62.0 87.6 87.7 89.1 88.7 88.9 88.3 36.1 90.2 90.8 90.7 91.1 89.4
hin_Deva 80.6 77.6 81.1 81.2 81.5 81.3 80.9 88.4 87.6 89.8 90.3 90.3 90.4 89.5 kan_Knda 85.5 52.7
86.3 86.4 87.4 86.5 87.4 85.8 34.4 88.0 88.6 88.5 88.7 87.2 mal_Mlym 87.1 77.4 87.7 88.3 89.5 89.0
89.3 87.2 83.1 89.0 89.5 89.6 89.9 88.4 mar_Deva 73.7 67.7 74.7 75.6 76.4 75.9 76.2 86.4 81.4 88.4
89.0 88.8 89.3 87.9 npi_Deva - 51.2 80.0 74.9 84.5 83.5 83.0 - 72.8 90.7 91.1 91.2 91.5 89.8 ory_Orya
83.6 38.8 84.4 84.2 79.9 80.9 84.8 86.8 38.5 89.1 89.9 89.8 89.6 87.9 pan_Guru 83.8 61.2 84.1 83.6
84.4 84.5 84.6 87.6 76.2 89.3 89.9 89.7 89.9 88.2 tam_Taml 88.0 44.6 89.1 88.6 89.9 89.5 89.9 85.1
64.3 87.4 88.0 87.7 88.2 86.2 tel_Telu 85.9 - 86.5 86.8 87.8 87.5 87.8 86.6 - 88.6 89.4 89.3 89.5 88.0
urd_Arab - 73.4 82.2 81.8 82.6 83.0 83.6 - 79.0 87.5 88.3 87.7 88.4 86.4 66 Published in Transactions
on Machine Learning Research (12/2023) Table 44: BLEU scores of all the systems on the IN22-Gen
Evaluation set in the En-Indic and Indic-En direction. The best performing system is bolded, while
underlined results indicate significant performance difference where IT2 outperforms the system.
Avg means the average score of all the languages that system X supports. ∆ represents the difference
between the average scores of IT2 and the average scores of system X for the subset of languages
that both X and IT2 support. A positive value for ∆ indicates IT2 is better than X and vice-versa. †
indicates completely off-target translations. En-Indic Indic-En Language IT1 M100 N1.2 N54 IT2 Goog
Az IT1 M100 N1.2 N54 IT2 Goog Az asm_Beng 9.9 - 13.9 15.4 19.4 16.9 16.2 32.5 - 40.4 44.6 43.1 42
35.4 ben_Beng 18.1 11.3 16.6 18.3 20.8 18.3 18 33.4 26.3 36.1 39.3 39 39.8 34.9 brx_Deva - - - - 16.9
- - - - - - 40.2 - - doi_Deva - - - - 33.5 22.2 - - - - - 53.5 45.1 - gom_Deva - - - - 18.8 11.6 11.5 - - - - 35.3
33 25.8 guj_Gujr 17.9 3.9 18.7 20.3 25.7 23.3 21.2 36.3 0.4 40.2 43.4 43.7 43 37.1 hin_Deva 28.3 22.1
27.6 28.9 33.5 30.2 29.2 36.1 27.1 37.4 41 42.5 39.8 36.1 kan_Knda 13.4 1 13.4 14.9 18 14.2 15.2
34.8 0.1 39 42.7 40.8 41 35 kas_Arab - - 9.9 10.5 14.4 - - - - 31.5 35 38.3 - - mai_Deva - - 15.5 15.1
19.3 9.3 14.5 - - 37.9 41.8 40.8 39.8 36.2 mal_Mlym 13.9 4.4 11.9 13.1 16.4 13.7 13.6 31.4 17.5 34.8
38.6 41.4 37.9 33.3 mar_Deva 13.9 7 14.5 15.6 21.7 16.2 17.5 33.5 20 37.2 40.8 40.2 40.6 35.1
mni_Mtei - - - - 17.5 10.8 - - - - - 35.1 27.5 - npi_Deva - 2.6 14.4 14.8 16.8 13.8 14.4 - 12.8 42.2 46
45.1 46.8 39.9 ory_Orya 10.2 0.1 12.2 11.8 14.5 10.7 14.1 36.7 0 40.7 44.7 43.8 40.4 34.7 pan_Guru
23.5 7.2 23.9 25.3 25.5 29.9 25.2 33.5 10.4 37.4 40.6 41.4 39.6 34.7 san_Deva - - 3.7 4.3 11.1 5.5 - - -
24.2 27.2 29.8 28.6 - sat_Olck - - 0.0 † 3.8 5.5 - - - - 12.3 18.7 21.8 - - snd_Deva - - - - 14 - - - - - - 35 - -
tam_Taml 11.9 1.4 12.6 13 14.4 14 14.5 28.9 4.9 32.5 35 35.9 34.9 29.4 tel_Telu 15.5 - 15.1 17.1 19.4
17.7 17.7 33.5 - 37.6 41.5 42.3 41.3 35.7 urd_Arab - 23.1 42 43.8 49.7 44.1 51.4 - 26.5 46.5 50.5 53.7
50.9 46.3 Avg. 16 7.6 15.6 16.8 20.3 17.9 19.6 33.7 13.3 35.8 39.5 40.1 39.6 35.3 ∆ 7.3 15.8 4.8 1.7 -
4.4 2.7 8.6 29.2 4.4 0.9 - 2.3 6.6 67 Published in Transactions on Machine Learning Research
(12/2023) Table 45: BLEU scores of all the systems on the IN22-Conv Evaluation set in the En-Indic
and Indic-En direction. The best performing system is bolded, while underlined results indicate
significant performance difference where IT2 outperforms the system. Avg means the average score
of all the languages that system X supports. ∆ represents the difference between the average scores
of IT2 and the average scores of system X for the subset of languages that both X and IT2 support. A
positive value for ∆ indicates IT2 is better than X and vice-versa. † indicates completely off-target
translations. En-Indic Indic-En Language IT1 M100 N1.2 N54 IT2 Goog Az IT1 M100 N1.2 N54 IT2
Goog Az asm_Beng 11.6 - 16.7 17.8 19.7 17.6 19.5 31.3 - 38.6 40.4 43.8 44.6 41.7 ben_Beng 20.1 13
19.3 20.7 21.3 21.5 20.3 32.9 25.8 33.3 35 36.4 37.6 36 brx_Deva - - - - 15.4 - - - - - - 35.5 - - doi_Deva
- - - - 32.4 17.6 - - - - - 45.6 42.6 - gom_Deva - - - - 14.2 12 10.4 - - - - 29.9 29.5 23.7 guj_Gujr 23.2 4
22.8 24.1 27.2 26.7 25.7 34.7 0.3 39.7 39.9 41.1 41 39.1 hin_Deva 28.4 22.3 27.1 28.4 30.1 30.7 28
35.5 28.3 37.3 38.4 39.3 38.3 37.7 kan_Knda 6.1 0.5 5.8 6.5 6.7 6.3 6.2 21.1 0.2 22.5 22.9 24.9 24.4
23.6 kas_Arab - - 4.5 4.6 11.3 - - - - 23.3 24.1 31.8 - - mai_Deva - - 15.4 15.7 18.9 10.6 11.9 - - 32.6
33.8 35.3 36.6 32.1 mal_Mlym 11.1 3.9 8.3 7.6 11.3 11.1 10.8 27.6 16.2 28 29.5 31.6 31.1 30.8
mar_Deva 15.5 8.9 16.9 18.6 19.4 17.7 17.6 32.2 18.9 34.1 35.7 36.7 37.7 35.9 mni_Mtei - - - - 14.2
6.9 - - - - - 31.9 25.7 - npi_Deva - 1.6 15.7 16.4 21.2 16.4 15.8 - 3.5 38.9 39.6 42.4 43.1 40.8 ory_Orya
11.3 0.3 13.8 13.8 12.3 10.3 14.1 33.6 0.2 38.4 38.9 38.8 37.4 35.3 pan_Guru 32 7 32.1 33.8 35.7 41.3
33.2 36.8 7.3 39.6 41.1 43 39.5 40.7 san_Deva - - 2.8 4.7 6.3 5.2 - - - 17.8 17.2 26.1 26.7 - sat_Olck - -
0.0 † 3 6.6 - - - - 113 17.8 23.1 - - snd_Deva - - - - 7.4 - - - - - - 27.5 - - tam_Taml 7.7 1.5 7.1 7.4 7.6 8
8.4 20.8 4 23 24.1 22.7 23.3 22.8 tel_Telu 12 - 9.8 10.5 14.1 13.4 13.8 26.3 - 29.5 31.6 31 31.5 31.1
urd_Arab - 19.4 35.6 35.3 43.7 42.2 40.1 - 26.5 40.3 41.7 45.9 45.6 44.9 Avg. 16.3 7.5 14.9 15.8 18
17.5 18.4 30.3 11.9 31.1 32.5 34.7 35.3 34.4 ∆ 4.5 14 3.5 2.8 - 2.7 1.8 6 24.7 3.8 3.4 - 0.9 1.8 68
Published in Transactions on Machine Learning Research (12/2023) Table 46: BLEU scores of all the
systems on the FLORES-200 (Costa-jussà et al., 2022) devtest set in the En-Indic and Indic-En
direction. The best performing system is bolded, while underlined results indicate significant
performance difference where IT2 outperforms the system. Avg means the average score of all the
languages that system X supports. ∆ represents the difference between the average scores of IT2 and
the average scores of system X for the subset of languages that both X and IT2 support. A positive
value for ∆ indicates IT2 is better than X and vice-versa. † indicates completely off-target translations.
En-Indic Indic-En Language IT1 M100 N1.2 N54 IT2 Goog Az IT1 M100 N1.2 N54 IT2 Goog Az
asm_Beng 7.6 - 11.4 11.7 14 12.2 13.8 23.4 - 31.3 33.9 32.5 32.9 27.1 ben_Beng 19.7 15.3 20.2 22.1
24.7 24.3 23.4 31.8 29.4 36.6 38.7 38.6 39.7 35.3 guj_Gujr 22.1 4.8 23.9 25.2 27.8 27.1 26.6 34.1 1.2
42.5 44.6 45.3 46.2 38.6 hin_Deva 34.5 30.9 34.3 36.7 38.6 39 38.4 37.5 35.6 42.1 44.4 46.1 46.4 43.1
kan_Knda 18.3 1.6 20.6 22.1 24.1 24.6 24.2 28.7 0.7 34.8 36.9 37.8 38.4 32.5 kas_Arab - - 10 10.5
11.9 - - - - 33.7 36.7 36.1 - - kas_Deva - - 1.9 2 2.2 - - - - 23.9 27 25.1 - - mai_Deva - - 16.5 18.2 19 11.8
20.8 - - 44.1 46.7 48.2 46.6 41.8 mal_Mlym 15.9 7.9 14.1 18.3 22 22.4 22 31.4 25.3 37.6 39.1 41 41
35.8 mar_Deva 15.8 10.1 16.2 17.9 19.9 20.7 18.3 31 24.6 37.1 40.3 41.1 42.1 37.3 mni_Beng - - 7.7
10.4 8.6 - - - - 27 27.5 28.5 - - npi_Deva - 1.7 18.7 18.5 25.5 23.9 20.9 - 14 42.3 44.5 46.3 46.5 39.8
ory_Orya 13.6 0.3 17.1 16.9 17.3 24.4 18.6 29.8 0.5 38.2 41.6 42.4 41.6 35.1 pan_Guru 26.7 8.6 27.1
27.7 29.6 31.1 30.1 35.8 15.2 42.2 44.8 44.9 45.8 38.2 san_Deva - - 2.2 2.3 3.2 3.4 - - - 23.3 26.1 26.6
25 - sat_Olck - - 0.1 † 4.9 4.1 - - - - 14.5 21.7 16.7 - - snd_Arab - 10.8 25.3 26.4 20.2 27.3 27.7 - 2.7 42
45 43.6 45.5 36.3 tam_Taml 15.6 0.9 18.6 19.8 22.6 21.1 21.3 28.4 8.3 34.4 36.8 37.8 37.7 31.1
tel_Telu 21.3 - 23.1 25.3 27.8 27.2 25.3 33.4 - 40.9 43.6 44.7 45.1 39.6 urd_Arab - 16.9 25.8 27.2 29.1
28.2 28.2 - 22.2 36.8 39.6 38.1 40 34.4 Avg. 19.2 9.2 16.7 18.2 19.6 23.0 24 31.4 15 35.3 38 38.1 41.3
36.4 ∆ 5.2 15.9 2.9 1.4 - -0.2 0.1 9.7 26.9 2.8 0.1 - -0.3 5.5 69 Published in Transactions on Machine
Learning Research (12/2023) C Human Evaluation Automated evaluation metrics provide a
convenient and quick way to evaluate MT systems. However, as reported by previous works (Kocmi
et al., 2021; Moghe et al., 2022), the degree of correlation between automatic evaluation metrics
and human ratings is not particularly strong. To obtain a more comprehensive understanding of the
model’s performance, it is imperative to conduct human evaluations (Kocmi et al., 2021). We conduct
a small-scale human evaluation exercise to verify if the quality of our model outputs correlates with
the improvements observed using automatic metrics. This exercise focused on the En-Indic direction
and included 50 examples each from the Wikipedia and Web sources subset to yield a total of 100
sentence pairs from IN22-Gen. We seek to study human evaluation of sentences of diverse lengths
(refer Figure 10) and uniformly sample sentences from each bucket. Our human evaluators belong to
the same pool of translators who created the IN22 benchmark. They are fluent speakers of English
and the respective native language under study. Based on the availability of annotators, we conduct
human evaluation studies for the following languages: Assamese, Bengali, Bodo, Dogri, Konkani,
Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Punjabi, Santali, Tamil, Telugu and Urdu. We
compare IndicTrans2 model outputs along with those of NLLB (Costa-jussà et al., 2022), Google
Translate, and Azure Translate. The annotators were not specifically aware of which output was
generated by which system. We use the XSTS methodology proposed by Licht et al. (2022) and
adopted by Costa-jussà et al. (2022) for comparing different multilingual machine translation (MT)
systems. XSTS relies on human raters to assess translations without using reference translations,
focusing more on adequacy (meaning preservation) than fluency. This approach is particularly
suitable for low-resource languages with relatively lower translation quality. XSTS also exhibits better
inter-annotator agreement than Direct Assessment (Graham et al., 2013) as demonstrated by prior
research Licht et al. (2022). Brief instructions for human annotations are provided below. Raters
choose scores between 1 to 5. We refer the readers to Figure 1 in Licht et al. (2022) for the detailed
definition of the scores. • Score of 1 indicates the sentences are unrelated to each other or maybe in
similar topics but differ in more than half of their core concepts. • Score of 2 indicates that the
sentences are about similar topics but some key details about the main subject, verb, or object are
either different or absent. • Score of 3 indicates that the sentences are equivalent to each other but
with unimportant differences. • Score of 4 indicates that the sentences are paraphrases of each
other but have minor differences in emphasis, formality, idioms, etc. • Score of 5 indicates the
sentences mean the same with no difference in emphasis, formality, idioms, etc. It is known that
there is some variance in human evaluators, with some being overly critical while others being
excessively generous when assessing MT outputs. Recent studies by Licht et al. (2022) and Costa-
jussà et al. (2022) emphasize the importance of having a calibration set to ensure that XSTS scores
are comparable across languages. To address this concern, our evaluation methodology employs a
sample of the calibration set, comprising pairs of English sentences released by NLLB Team (Costa-
jussà et al., 2022). From each of the 5 scoring classes described in Licht et al. (2022), we uniformly
sample 10 sentences, forming a calibration set with 50 sentence pairs. The task framework employed
for this purpose closely aligns with the approach suggested in Costa-jussà et al. (2022). To account
for extreme calibration shifts, we use the moderated calibration adjustment as proposed in Costa-
jussà et al. (2022). Overall results. Our findings indicate that IndicTrans2 outperforms Google and
NLLB 54B significantly, and performs comparably with Azure. Statistical significance is computed
using ANOVA with posthoc Tukey HSD test (p ≤ 0.05) following similar human evaluation in data-to-
text generation (Puduppully & Lapata, 2021; Puduppully et al., 2022). However, it should be
acknowledged that the sample size of sentences used for human evaluation is limited, and therefore,
these results must be interpreted with caution. Future work should expand the human evaluation to
cover all 22 Indic languages and also include IN22-Conv set to gain more fine-grained insights. 70
Published in Transactions on Machine Learning Research (12/2023) 1 2 3 4 5 Median score 0 10 20 30
40 50 60 Frequency Assamese 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency Bodo 1 2 3 4 5
Median score 0 10 20 30 40 50 60 Frequency Dogri 1 2 3 4 5 Median score 0 10 20 30 40 50 60
Frequency Konkani 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency Nepali 1 2 3 4 5 Median
score 0 10 20 30 40 50 60 Frequency Punjabi 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency
Sanskrit 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency Urdu (a) XSTS Scores for low, medium
resource languages 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency Bengali 1 2 3 4 5 Median
score 0 10 20 30 40 50 60 Frequency Gujarati 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency
Hindi 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency Kannada 1 2 3 4 5 Median score 0 10 20
30 40 50 60 Frequency Malayalam 1 2 3 4 5 Median score 0 10 20 30 40 50 60 Frequency Marathi 1 2
3 4 5 Median score 0 10 20 30 40 50 60 Frequency Tamil 1 2 3 4 5 Median score 0 10 20 30 40 50 60
Frequency Telugu (b) XSTS Scores for high resource languages Figure 9: Distribution of XSTS scores for
low, medium and high resource languages in IN22 High vs. Low Resource Languages. Figure 9 in
Appendix C depicts the trends in the distribution of ratings for a selected set of languages, with low
and medium resource languages in the upper half, and high resource languages in the lower half.
IndicTrans2 outperforms other models significantly in low-resource languages like Konkani, Sanskrit,
and Nepali. Most languages supported by IndicTrans2 achieve close to a 4 XSTS rating. High-resource
languages, such as Hindi, Bengali, and Telugu, show a right-skewed distribution with many sentence
pairs receiving higher ratings. On the other hand, medium-performance languages like Bodo exhibit a
more symmetrical distribution around the rating. 71 Published in Transactions on Machine Learning
Research (12/2023) Table 47: Post calibration results for human evaluation for En-XX language pairs
using XSTS methodology. We compare between four model outputs: Azure, Google, NLLB (N54B)
(Costa-jussà et al., 2022), and IndicTrans2 (IT2). – indicates languages not supported by a model. The
† after a value indicates statistically significant difference from IndicTrans2 using ANOVA with post-
hoc Tukey HSD test (p ≤ 0.05). ∆ represents the difference of pre-calibration and post-calibration XSTS
score for IndicTrans2, with a positive value indicating improvement in scores post-calibration and
vice-versa. Overall, IndicTrans2 is the top-ranked system comparable with Azure, and significantly
better than Google and NLLB. language Azure Google N54B IT2 ∆ asm_Beng 3.44 3.63 3.83 3.62 -0.61
ben_Beng 3.94† 3.92† 4.05 4.18 -0.18 brx_Deva – – – 3.75 0.04 doi_Deva – 4.13 – 4.32 0.08
gom_Deva – 3.84† – 4.45 -0.09 guj_Gujr 4.50 4.26† 4.28† 4.53 0.12 hin_Deva 4.27† 4.23† 4.40 4.56
0.05 kan_Knda 4.12 3.86 4.01 4.18 0.08 mal_Mlym 3.96 3.73† 3.84 4.06 0.23 mar_Deva 4.27 3.89†
4.12 4.41 -0.11 npi_Deva 3.89† 3.87† 3.81† 4.41 0.13 pan_Guru 4.09 3.94 4.10 4.25 -0.45 san_Deva
– 2.87† 2.83† 3.68 -0.37 tam_Taml 4.00 3.79 3.79 3.90 0.40 tel_Telu 4.29 3.94† 4.24 4.29 0.03
urd_Arab 4.06 3.68† 3.76† 4.25 0.24 Average 4.07 3.84† 3.93† 4.18 -0.02 Calibration. The column ∆
in Table 47 indicates the revision in scores post-calibration for IndicTrans2, with a positive value
indicating improvement in scores and vice-versa. We present the results comparing pre and post-
calibration procedures for all the models in Table 48 in Appendix C. We see that scores of languages
get adjusted. Assamese and Bengali are two related languages written using the same script and
sharing substantial vocabulary. At the same time, Bengali is high-resource in comparison to
Assamese; Bengali belongs to class 5 whereas Assamese belongs to class 2 in terms of the language
resourcefulness classification (Joshi et al., 2020). From Table 48, we see that the scores for Assamese
and Bengali are comparable pre-calibration; however, after calibration, the scores for Assamese drop
compared to that of Bengali. Among languages for which the scores change by more than 0.2 points,
Punjabi and Sanskrit scores drop post calibration whereas Malayalam, Tamil, and Urdu scores
improve. These findings underscore the significance of calibration in ensuring the reliability and
comparability of XSTS scores across different languages and models. Overall, we see an average
change of 0.23 in XSTS scores for IndicTrans2. Importantly, the relative ranking of the machine
translation models based on XSTS scores remains unchanged, with IndicTrans2 outperforming NLLB,
and Google, and comparable with Azure. Correlation with Automatic Metrics. The correlation
between XSTS scores and automatic metrics is an important aspect of evaluating machine translation
performance. Our analysis reveals that XSTS scores for IndicTrans2 exhibit moderate correlation with
two widely used automatic metrics, namely BLEU, and chrF++. Specifically, we observe Spearman
rank correlations of 0.49 and 0.12, respectively, with BLEU and chrF++ across all languages, but the
correlations increase to 0.67 and 0.25, respectively, when Urdu is excluded from the analysis. This
observation can be partly attributed to the influence of Urdu tokenization, which had a greater
impact on the BLEU and chrF++ scores when compared to other languages. This can be due to the
higher fertility of Urdu when using the UrduHack tokenizer, which led to inflated scores for both
metrics. As a result, the correlation was reduced between these metrics and the actual quality of
translations, deviating from the trend observed in other languages. In contrast, we find no
correlation between XSTS scores and the COMET metric, which is designed to assess the fluency and
adequacy of machine translations. Additionally, we observe no correlation between BLEU/chrF++ and
COMET scores, indicating that these 72 Published in Transactions on Machine Learning Research
(12/2023) Table 48: Comparison of XSTS score before and after applying calibration Pre-Calibration
Post-Calibration language Azure Google N54B IT2 Azure Google N54B IT2 asm_Beng 4.05 4.24 4.44
4.23 3.44 3.63 3.83 3.62 ben_Beng 4.13 4.11 4.24 4.36 3.94 3.92 4.05 4.18 brx_Deva - - - 3.71 - - -
3.75 doi_Deva - 4.03 - 4.24 - 4.13 - 4.32 gom_Deva - 3.93 - 4.54 - 3.84 - 4.45 guj_Gujr 4.37 4.10 4.12
4.41 4.50 4.26 4.28 4.53 hin_Deva 4.20 4.15 4.34 4.51 4.27 4.23 4.40 4.56 kan_Knda 4.04 3.77 3.92
4.10 4.12 3.86 4.01 4.18 mal_Mlym 3.72 3.47 3.59 3.83 3.96 3.73 3.84 4.06 mar_Deva 4.38 4.00 4.23
4.52 4.27 3.89 4.12 4.41 npi_Deva 3.71 3.69 3.63 4.28 3.89 3.87 3.81 4.41 pan_Guru 4.54 4.39 4.54
4.70 4.09 3.94 4.10 4.25 san_Deva - 3.23 3.19 4.05 - 2.87 2.83 3.68 tam_Taml 3.61 3.39 3.39 3.50
4.00 3.79 3.79 3.90 tel_Telu 4.26 3.90 4.21 4.26 4.29 3.94 4.24 4.29 urd_Arab 3.80 3.39 3.47 4.01
4.06 3.68 3.76 4.25 Average 4.07 3.85 3.95 4.20 4.07 3.84 3.93 4.18 metrics capture different aspects
of machine translation quality. Nonetheless, further investigation is necessary to gain a deeper
understanding of the relationship between metrics. 73 Published in Transactions on Machine
Learning Research (12/2023) D Distilled Models This section presents a detailed description of our
student model architecture and distillation training hyperparameters. We share the weight of the
decoder embedding and output projection to compress the student models as much as possible. This
also allows us to have equal-sized student models for both directions. This is particularly useful for
the En-Indic model, as the output projection is a significant fraction of the model parameters (≈
60M). Tables 52 and 53 present the comparison between the student and teacher models on
FLORES200 and IN22-Conv respectively. Table 49: Student architecture description. Specifically, our
student models use the base18L architecture, following Gumma et al. (2023). Hyperaparameter
Value Model dim 512 FFN dim 2048 Encoder Layers 18 Decoder Layers 18 Activation GELU
(Hendrycks & Gimpel, 2016) Pre-Normalization True (Xiong et al., 2020) Embedding LayerNorm True
Share decoder input output embed True Table 50: Number of parameters in teacher and distilled
student models. #Params Indic-En En-Indic Teacher 1.02B 1.11B Student 211.77M 211.77M Table 51:
Hyperparameter set for Knowledge Distillation. The rest of the parameters not mentioned in the
table are the same as the ones used for training IT2 (see Table 10). Hyperparameters Stage 1
Distillation Stage 2 fine-tuning Learning rate 7e-4 (en-xx), 1e-3 (xx-en) 3e-5 Criterion KL-Divergence
Cross-entropy Label smoothing (Szegedy et al., 2016) − 0.1 Effective batch size 262K 8K Checkpoint
metric BLEU @ beam = 5 BLEU @ beam = 5 74 Published in Transactions on Machine Learning
Research (12/2023) Table 52: chrF++ scores of Indic-En and En-Indic distilled models on FLORES-200.
Distilled (Dist) is the model trained with Word-level KD. ∆ is the difference between the distilled
Model fine-tuned on seed data (Dist-Seed) & IT2. Higher values of ∆ are preferable. Indic-En En-Indic
language IT2 Dist Dist-Seed ∆ IT2 Dist Dist-Seed ∆ asm_Beng 56.9 56.9 56.1 -0.8 43.3 42.7 43.0 -0.3
ben_Beng 62.4 61.4 61.4 -1.0 54.3 54.0 54.0 -0.3 guj_Gujr 67.0 65.5 65.6 -1.4 56.0 55.9 55.8 -0.2
hin_Deva 67.5 66.0 66.0 -1.5 59.6 59.3 59.3 -0.3 kan_Knda 61.5 60.0 60.2 -1.3 56.1 56.0 55.9 -0.2
kas_Arab 59.7 57.6 57.6 -2.1 39.7 40.0 40.1 0.4 kas_Deva 48.3 45.3 45.6 -2.7 19.2 19.3 19.8 0.6
mai_Deva 69.5 67.3 67.3 -2.2 50.5 50.8 51.0 0.5 mal_Mlym 64.3 62.7 62.8 -1.5 57.3 57.0 57.1 -0.2
mar_Deva 64.3 63.1 63.1 -1.2 51.3 51.3 51.1 -0.2 mni_Beng 52.9 51.0 50.9 -2.0 38.2 37.5 37.2 -1.0
npi_Deva 68.1 66.2 66.2 -1.9 57.2 57.1 57.2 0.0 ory_Orya 64.9 63.2 63.2 -1.7 49.2 48.6 48.7 -0.5
pan_Guru 66.4 64.8 65.0 -1.4 53.5 53.5 53.5 0.0 san_Deva 51.6 49.9 49.9 -1.7 31.6 31.5 31.3 -0.3
sat_Olck 39.3 40.5 40.9 1.6 28.4 28.2 28.6 0.2 snd_Arab 65.1 63.4 63.5 -1.6 44.9 45.1 45.0 0.1
tam_Taml 61.3 59.4 59.3 -2.0 57.2 57.0 57.0 -0.2 tel_Telu 66.1 64.9 64.8 -1.3 59.4 59.4 59.5 0.1
urd_Arab 62.0 60.6 60.7 -1.3 52.2 52.0 52.2 0.0 Average 61.0 59.4 59.5 -1.5 48.0 47.8 47.9 -0.1 75
Published in Transactions on Machine Learning Research (12/2023) Table 53: chrF++ scores of Indic-
En and En-Indic distilled models on IN22-Conv. Distilled (Dist) is the model trained with Word-level
KD. ∆ is the difference between the distilled Model fine-tuned on seed data (Dist-Seed) & IT2. Higher
values of ∆ are preferable. Indic-En En-Indic language IT2 Dist Dist-Seed ∆ IT2 Dist Dist-Seed ∆
asm_Beng 62.9 62.3 63.0 0.1 46.8 46.1 46.6 -0.2 ben_Beng 58.4 58.3 58.7 0.3 49.7 49.6 49.8 0.1
brx_Deva 56.3 55.2 55.2 -1.1 45.3 45.4 45.4 0.1 doi_Deva 65.0 63.8 63.5 -1.5 53.9 52.3 53.0 -0.9
gom_Deva 51.7 50.8 50.8 -0.9 42.5 41.8 41.7 -0.8 guj_Gujr 62.0 61.5 62.0 0.0 53.1 53.1 53.1 0.0
hin_Deva 60.1 59.9 60.3 0.2 49.6 49.3 49.4 -0.2 kan_Knda 47.5 48.3 48.6 1.1 33.8 33.6 33.8 0.0
kas_Arab 52.6 50.2 50.5 -2.1 35.6 33.6 34.9 -0.7 mai_Deva 57.8 56.9 57.2 -0.6 44.3 43.8 44.3 0.0
mal_Mlym 54.3 53.8 53.9 -0.4 45.7 45.5 45.6 -0.1 mar_Deva 58.5 58.4 58.8 0.3 48.6 48.4 48.7 0.1
mni_Mtei 52.5 51.4 51.6 -0.9 40.2 39.5 40.0 -0.2 npi_Deva 63.0 63.1 63.3 0.3 51.5 51.1 51.3 -0.2
ory_Orya 60.3 60.3 60.7 0.4 40.2 39.8 40.0 -0.2 pan_Guru 62.7 61.7 62.1 -0.6 57.8 57.6 57.5 -0.3
san_Deva 48.3 46.9 46.9 -1.4 35.5 34.6 34.8 -0.7 sat_Olck 43.5 45.9 46.3 2.8 34.6 34.2 34.8 0.2
snd_Deva 49.6 50.1 50.5 0.9 30.3 30.1 30.0 -0.3 tam_Taml 45.8 45.7 45.7 -0.1 39.1 38.6 38.7 -0.4
tel_Telu 52.9 52.3 53.0 0.1 45.5 44.7 45.1 -0.4 urd_Arab 65.5 64.4 64.5 -1.0 61.6 61.5 61.4 -0.2
Average 56.0 55.5 55.8 -0.2 44.8 44.3 44.5 -0.3 76 Published in Transactions on Machine Learning
Research (12/2023) E Additional details about IN22 Benchmark This section provides a detailed
overview of the source and domain diversity of the different subsets of the IN22 benchmark. 0 10 20
30 40 IN22-Web 1 - 10 11 - 17 18 - 25 26 - 45 46 - 60 61 - 80 Culture Economy Education
Entertainment Geography Governments Health Industry Legal News Religion Sports Tourism Domain
0 10 20 30 40 IN22-Wiki 1 - 10 11 - 17 18 - 25 26 - 45 46 - 60 61 - 80 Number of sentences Figure 10:
Domain vs. length distribution for the sentences from Web Sources (top) and Wikipedia (bottom)
subsets of IN22 Table 54: Comparison of diversity of domains in FLORES-200 and IN22 FLORES
domain IN22 domain crime, disasters, politics news entertainment entertainment geography
geography health health nature, science education sports sports travel tourism - culture politics
government - industry - economy - legal - religion 77 Published in Transactions on Machine Learning
Research (12/2023) Table 55: Statistics of the Conversational Subset of IN22 Statistic Value Number
of unique conversations 44 Average turns per conversation ± std dev 34.2 ± 4.9 Number of unique
topics 23 Randomly selected 5 topics ‘Government schemes’, ‘Movies’, ‘Historical Architectures’,
‘Geography of India’, ‘Legal Affidavit/documents’ Number of unique domains 16 Randomly selected 5
domains ‘arts’, ‘history’, ‘school life’, ‘healthcare’, ‘legal’ Number of unique prompts 44 Randomly
selected 5 prompts ‘Joint Affidavit for Registration of Marriage’, ‘Diploma in web designing’, ‘Qutub
Minar- visiting time and student discounts’, ‘How do you take out time for your hobbies ?’, ‘Social and
Economic inequalities’ Number of unique scenarios 37 Randomly selected 5 scenarios ‘How to apply
for a loan’, ‘Asking for the date/timing of the voting date’, ‘Housing/Colony’, ‘Learning Music’,
‘Challenges/Issues in Sports sector’ Avg number of speakers per conversation ± std dev 2.0 ± 0.0 E.1
Source Selection For the Wikipedia subset, we carefully chose English source sentences from various
Wikipedia categories to ensure broad coverage across different domains. Initially, we selected article
pages within those categories and identify all the sentences as potential candidates. For each of
these sentences, we construct a context window with a block size of 3, which typically includes one
sentence before and after the candidate sentence. To satisfy the length criteria, we filter out
sentences that are less than 6 or more than 80 words. To minimize overlaps with the FLORES-200 test
set (Costajussà et al., 2022), we discard the sentences that share 4-gram or higher overlaps with any
sentence in the FLORES-200 dev and devtest sets. The candidate sentence domains are manually
annotated as described above. Following this, we randomly select the final set of sentences based on
domain and length constraints. The detailed buckets are presented in Figure 10. It is important to
note that we did not translate all the sentences within the context block, deviating from the
approach followed in FLORES. This deviation was necessary to ensure the optimal length and domain
diversity constraints were met. For the Web Sources, we identified various Govt. of India websites
and digital libraries that could be sources of multidomain content with a focus on Indian topics.
Many benchmarks like FLORES (Costa-jussà et al., 2022), NTREX (Federmann et al., 2022) do not have
a fair representation of India-centric content, and we try to address this in the creation of this subset.
We relied on PDF format documents to discover sentences that are hopefully not part of publicly
available crawls like CommonCrawl (Xue et al., 2021; Conneau et al., 2020) or IndicCorp (Kakwani et
al., 2020; Doddapaneni et al., 2023). The selection of sentences for translation follows a similar
procedure to the Wikipedia subset. Figure 10 provides the bucket-wise and domain-wise distribution.
For the Conversation subset, we first create English conversations with a set of prompts and
scenarios. The prompts are predefined topics or themes that are used to initiate a conversation. A
prompt can be thought of as the starting point of a conversation, which sets the tone and direction
for the interaction between the two speakers. For example, a prompt could be “Travel plans for the
summer” or “Discussing a new project at work”. The prompt is designed to encourage the speakers
to discuss a particular topic or theme, and it serves as the foundation for the conversation. On the
other hand, a scenario is a specific situation or context in which the conversation takes place. It
provides additional context for the speakers and helps to shape the conversation. For example, a
scenario could be “Planning a family vacation to Europe” or “Brainstorming ideas for a marketing
campaign”. The scenario provides a specific context for the prompt, which guides the speakers in
their conversation. To create a conversation, two annotators from our annotator team played out 78
Published in Transactions on Machine Learning Research (12/2023) Table 56: An example from the
Conversation subset of IN22 featuring a conversation between two speakers: a kid and his mother.
The example belongs to the cultural domain, with festivities as a topic, the prompt of 14th April
being a holiday, and the scenario being ‘Historical importance’. Note that the speaker information is
part of metadata and is not part of the text to be translated. Each turn in the conversation is a
distinct instance from the benchmark. It is possible to reconstruct a conversation using the metadata
released along with the translations. Speaker Turn Speaker 1 Mom, let’s go for a movie tomorrow.
Speaker 1 I don’t have to go to school. Speaker 1 It is a holiday. Speaker 2 Oh, tomorrow is the 14th
of April right? Speaker 2 Your dad will also have the day off from work. Speaker 2 We can make a
movie plan! Speaker 1 That’s a good news! Speaker 1 Why is it a holiday though? Speaker 1 Are all
schools, colleges and offices closed tomorrow? Speaker 2 It is Ambedkar Jayanti tomorrow! Speaker
2 This day is celebrated annually to mark the birth of Dr. B. R Ambedkar. Speaker 2 Have you heard of
him? Speaker 1 I think I have seen him in my History and Civics book. Speaker 1 Is he related to our
Constitution? Speaker 2 Absolutely! He is known as the father of the Indian Constitution. Speaker 2
He was a civil rights activist who played a major role in formulating the Constitution. Speaker 2 He
played a crucial part in shaping the vibrant democratic structure that India prides itself upon.
Speaker 1 I remember now! . . . the two speaker roles. Once a conversation is ready, it is then
translated into 22 Indic languages. During translation, the translators have the entire conversation
context available to them. 79 Published in Transactions on Machine Learning Research (12/2023) F
Translation Workflow F.1 Translation Stages Source Sentence Selection Stage. The workflow begins
with the selection of sentences to be translated based on various criteria to be met like domain
coverage, length distribution, licensing constraints, etc. This helps in ensuring the right set of
sentences as required for the project are shortlisted for translation. To ensure a broader vocabulary
coverage, the sentences are taken from multiple domains such as News, Tourism, Business,
Entertainment, History, Geography, Culture, Sports, and Health. Source Verification Stage. Once the
candidate pool of source sentences is created, it is verified by annotators to ensure the correctness
of the source sentences and metadata. This ensures that the sentences selected are valid, of good
quality, and translatable. Shoonya efficiently supports a verification workflow where the annotator
reads a sentence (with context) and selects any one of the given tags: 1. Clean, 2. Difficult
vocabulary, 3. Context Incomplete, 4. Ambiguous sentence, and 5. Profane. Sentences with minor
errors such as spelling mistakes, and punctuation errors are corrected manually. If any sentence in a
paragraph is discarded, the whole paragraph gets rejected, as contextagnostic translations might turn
out ambiguous. In addition, the annotators might also add metadata like the domain and the topic to
the source sentences. Translation Stage. The selected source sentences are translated by translators
across all 22 Indic languages. To ensure quality, standard translation guidelines have been developed
and iterated before starting the translation task. There is an active discussion amongst translators to
ensure consistency. Translators of one language team help translators of another language team who
are from the same language family or share geographical boundaries. This ensures the authenticity
of transliterated words and cross-cultural nuances and gives a human touch to the output. The
translator is provided with: • Source sentence and three context sentences around the source
sentence to help resolve translation ambiguities. • Translation outputs from one of the following
engines (IndicTrans1 with fallback to Google Translate for unsupported language), which can be post-
edited. Translators could post-edit, translate from scratch, or use any alternative MT system as a
starting point. Note that post-editing support is provided only for the creation of training data.
Providing MT as a reference helps translators speed up and overcome the existing mistakes in current
translation models. A few low-resource languages like Kashmiri, Konkani, and Santali, where MT
systems are not available, are supported by the output of other related languages such as Urdu,
Marathi, and Bengali. This helps translators of low-resource languages to reuse syntactic structures
and vocabulary from related languages (as long as such vocabulary is acceptable in the target
language). To create test sets, the translators are expected to translate the sentences from scratch
and are not shown any outputs from an MT system. • To help translate technical vocabulary, the
translators can consult dictionaries and glossaries using IndicGlossary35. IndicGlossary contains
approximately 2 million glossary items across 13 different Indic language pairs and about 20 domains
aggregated from various sources. These glossaries are sourced from the Commission for Scientific
and Technical Terminology (CSTT) and Technology Development for Indian Languages (TDIL) which
are the recommended sources for translation terminologies for different domains (Science,
Engineering/Technology, Medical Science, Humanities, Social Sciences, Agricultural Science, and
Veterinary Science). For some low-resource languages, some translators were not proficient in
English but had proficiency in another Indic language (called the pivot language). For these
languages, the translators are provided with the pivot language translation, which they use to
translate into their native language. We used this method for the following languages: Dogri (pivot
Hindi), Konkani (pivot Marathi), Maithili (pivot Hindi), and Santali (pivot Bengali).
35https://github.com/AI4Bharat/Indic-Glossaries 80 Published in Transactions on Machine Learning
Research (12/2023) Quality Check Stage. Simultaneous review of the translated sentences is
required, as it helps provide feedback to the translators and improves the overall quality. For this, we
have dedicated reviewers in each language who are translators with 5+ years of experience. The job
of the reviewer is to improve the overall quality of the translation by correcting grammatical errors (if
any), choosing better syntactic structures (if required), and rectifying inappropriate dialectical
features to make the translations more standard. The reviewer manually verifies and corrects each
translated sentence (if needed) to ensure adherence to the guidelines by selecting any one of the
options on Shoonya, 1. Accepted, 2. Accepted with Changes, and 3. Rejected. Rejected sentences go
back to the translator with a note from the reviewer. The reviewer then corrects the translation
based on the inputs provided in the note from the translator. The corrected sentences then go back
to the translator for a second round of review. F.2 Translation Guidelines. We developed an extensive
set of translation guidelines to help the translators and ensure translation consistency and quality
across annotators and languages. These have been developed starting with the guidelines prepared
by LDC36 for the BOLT Chinese-English translation task and further adapted for our scenarios and
tasks. It is challenging to translate in 22 languages from 4 different language families, following the
same set of rules, as syntax and availability of resources vary drastically across them. However, the
guidelines were created considering the main goal of “getting as natural translations as possible”. In
the guidelines, we ensured the inclusivity of all unique linguistic features such as distinct word orders
(SVO in Kashmiri), PNG agreement, tense/aspect differentiation in Manipuri, sociocultural nuances,
extreme dialectic variations and challenges like right-to-left writing (Urdu), scripts like Meitei Mayek
and Ol Chiki, languages which don’t support longer syntactic structures like English, sentences with
many subordinate clauses, languages spoken in multiple regions such as Sindhi, unavailability of
modern vocabulary in languages like Sanskrit, inaccessibility of domain-specific dictionaries and
glossaries in languages like Bodo, Santali and reviving the original form of languages like Assamese,
Odia which are highly influenced by high resource languages in the same area (e.g., Bengali). The
detailed guidelines are published as a standalone document here37. Some key highlights from these
guidelines. • The general principle is that the translation should maintain the meaning, style, tone,
and register of the source. No information should be added or deleted. • Official native scripts of the
languages should be used. • Named entities and borrowed words can either be translated or
transliterated. The exact choice depends on the accepted convention in the language, if both choices
exist. We avoid coining new translations if none exist, and the words are transliterated instead. •
Numbers, dates, and units are to be handled as per natural conventions in the target language. • In
the context of historical events/people, translators can use more formal/older conventions or terms.
For more recent events/people, using more casual/colloquial conventions or terms is preferred. • For
test sets, sentences would be translated from scratch without aid from any MT output to avoid bias
towards outputs of any MT system. F.3 Shoonya Translation Interface Translations are performed
using the translation task supported in Shoonya8 . Shoonya has helped improve translator
productivity and project management by providing features like transliteration support, context view,
post-editing, quality control, and cross-lingual support. Performing reviews in real-time has helped
the team improve the quality of translations whilst rectifying their mistakes. Shoonya supports right-
to-left writing, which helps Urdu and Kashmiri 36https://catalog.ldc.upenn.edu/docs/LDC2008T18/
37https://github.com/AI4Bharat/IndicTrans2/blob/main/translation_guidelines.pdf 81 Published in
Transactions on Machine Learning Research (12/2023) translators to speed up their typing. Simple
features like ‘Find and Replace’, marking sentences as drafts, getting randomized sentences across
domains, daily progress tab, etc. helped translators improve their productivity and collaborate
closely with their peers. Below is a summary of Shoonya’s features that have benefitted the
translation task. • Transliteration Support: Romanized input with automatic transliteration to native
scripts to help translators not proficient in native script keyboards. The transliteration is powered by
the open-source IndicXlit models (Madhani et al., 2023), which provide transliteration support for
20+ Indian languages. • Context View: When translating a sentence, it helps to have the context in
which the sentence is being translated to resolve any ambiguities. Shoonya allows translators to see
paragraph-level context when translating an individual sentence. • Post-Editing: Shoonya enables
populating automatic translations from IndicTrans1 models, currently supporting 11 Indic languages.
The translators can post-edit these initial translations. • Quality Control: Shoonya offers various
automated maker-checker flows to evaluate the quality of translated data. To further ensure quality,
we implement a two-level maker-checker paradigm, in which an experienced reviewer verifies each
translation for conformance to the translation guidelines. This approach involves two levels of
processing for each sentence, providing a robust mechanism for ensuring high translation quality. •
Cross-lingual Support: For low-resource languages, Shoonya supports showing annotators
translations in other related languages. For instance, given the task of translating English to Santali,
the translators may have difficulty fully understanding the English sentence. In such cases, we also
show the translators a Bengali translation (a language they are proficient in) of the same sentence to
aid them with the task. This is a common scenario for many low-resource languages (Costa-jussà et
al., 2022; Ebrahimi et al., 2022; Marivate et al., 2020). 82 Published in Transactions on Machine
Learning Research (12/2023) G Language of India This section provides an overview of Indian
languages based on the 2011 census data, employing language classification by Joshi et al. (2020).
Table 57: Overview of Indian languages. Number of Native Speakers as per 2011 census. Language
classification done according to the taxonomy introduced by Joshi et al. (2020), which classifies
languages into 6 classes from 0 to 5 with 0 indicating extremely low resource and 5 indicating high
resource language. Many of these languages are spoken across multiple states in the country. Sample
column indicates the word ”Bharat” written in different scripts. Language Code Name Family Sub-
family Script Sample Class #Native Speakers asm_Beng Assamese Indo-Aryan Eastern Indo-Aryan
Bengali ভাৰত 2 15.3M ben_Beng Bengali Indo-Aryan Eastern Indo-Aryan Bengali ভারত 5 97.2M
brx_Deva Bodo Sino-Tibetan Boroic Devanagari भारत 1 1.4M doi_Deva Dogri Indo-Aryan Northern
Indo-Aryan Devanagari भारत 1 2.5M gom_Deva Konkani Indo-Aryan Southern Indo-Aryan Devanagari
भारत 1 2.2M guj_Gujr Gujarati Indo-Aryan Western Indo-Aryan Gujarati ભારત 4 55.4M hin_Deva
Hindi Indo-Aryan Central Indo-Aryan Devanagari भारत 5 528.3M kan_Knda Kannada Dravidian South
Dravidian Kannada ௒ା ರತ್‍5 43.7M kas_Arab kas_Deva Kashmiri Indo-Aryan Northern Indo-Aryan
Perso-Arabic Devanagari ‫ بھارت‬भारत 1 6.7M mai_Deva Maithili Indo-Aryan Eastern Indo-Aryan
Devanagari भारत 1 13.5M mal_Mlym Malayalam Dravidian Southern Dravidian Malayalam ഭാരത് 4
34.8M mar_Deva Marathi Indo-Aryan Southern Indo-Aryan Devanagari भारत 4 83.0M mni_Beng
mni_Mtei Manipuri Sino-Tibetan Central Tibeto-Burman Bengali Meitei ভারত ꯚꯥꯔꯠ 1 1.7M
npi_Deva Nepali Indo-Aryan Northern Indo-Aryan Devanagari भारत 2 2.9M ory_Orya Odia Indo-
Aryan Eastern Indo-Aryan Odia ଭ ରତ 3 37.5M pan_Guru Punjabi Indo-Aryan North Western Indo-
Aryan Gurmukhi ਭਾਰਤ 3 33.1M san_Deva Sanskrit Indo-Aryan Indo-Aryan Devanagari भारत 2 0.02M
sat_Olck Santali Austroasiatic Munda Ol Chiki ᱵᱷᱟᱨᱚᱛ 1 7.3M snd_Arab snd_Deva Sindhi Indo-
Aryan North Western Indo-Aryan Arabic Devanagari ‫ ڀارت‬भारत 1 2.7M 83 Published in Transactions
on Machine Learning Research (12/2023) tam_Taml Tamil Dravidian South Dravidian Tamil ப ரத 4
69.0M tel_Telu Telugu Dravidian South Central Dravidian Telugu భారత్ 4 81.1M urd_Arab Urdu Indo-
Aryan Central Indo-Aryan Urdu 50.7 5 ‫بھارت‬M 84 Published in Transactions on Machine Learning
Research (12/2023) H Examples of language translation This section shows a sentence translated into
all Indian languages as an illustrative example. Table 59: The table shows an example of the same
sentence translated into all 22 Indian languages. The sentence translated is “All human beings are
born free and equal in dignity and rights.” from the UN Declaration on Human Rights. Language
Translation Romanized Sentence asm_Beng সকেলা মানহ Ɯାাধীন ৈାহ জĮ ·হণ কেৰ আৰু ା মযাদা
আৰু অিধোৰ সকেলাকାৰ সমান। র্ Xokolu manuh swadhin hoi jonmogrohon kore aru marjyada aru
odhikaar xokolure xoman. ben_Beng মানষ জĮগতভাকାব Ɯାাধীন এবং সŜାান ও অিধ- ା োর সবারই
সমান। Manush jonmogotobhabe swadhin ebong somman o odhikar sobari soman. brx_Deva गाʹसबो
सुबुुंआनो उदाुं यै जोनोम जादोुं आरो गाʹसबो सुबुुंिାन मान आरो मोनथायफोरा समान। gasibw subungyanw
udangywi jwnwm jadwng arw gasibw subungni man arw mwnthaiphwra soman. doi_Deva सब्भै
मनुक्ख मान- प्र˃तश्ठा ते अ˃धकारें दे सुंदभर् च सुततर ते इक बरोबर पैदा होुंद ାै ାुं ାे न। Sabhe manukh maan-
pratishtha teh adhikaarein de sandarbh ch sutaintar teh ek barobar paida haundey n. gom_Deva
सगळे मनीस स्वतत्र म्ह ାुं ାण जल्माक येतात आनी प्र˃तश्ठा आनी हक्ाुं चे नदरे न समान आसतात. sagle manis
swatantra mhun jalmak yetat and pratishtha and hakkache nadren samaan astat. guj_Gujr બધાાં
મǙାષ્યો સ્વતાંત્ર જન્મે છે અને ગɳરમા અને અɵધકારોમાાં સમાન હોય છે . badha manushyo swatantra janme
chhe ane garima ane adhikaro ma saman hoy chhe. hin_Deva सभी मनुष्य स्वतत्र पैदा होते ह ାुं ାै ାुं और
गȼरमा और अ˃धकारोुं मेंसमान होते ह।ାै ାुं sabhi manushya swatantra paida hote hain aur garima aur
adhikaron mein samaan hote hain. kan_Knda ಎ௒ା ಲ್ ಮನುಷಯ್ರು ಹುÖಟ್ßା ದĉା
ಸವ್ತ ತರ್ರು ಘನý ¥ା ಗೂ ಹಕುಕ್ಗ ಳ ದೃëಟ್åା ದ ಸ௒ା ನರ್ರ ಆÐರು௒ା ತĈ. ellā manushyaru
huttinindalē svatantrarū ghanate hāgū hakkugala drishtiyinda samānarū āgiruttāre. kas_Deva सलीम
इसान छी जनमी आज़ाद ब ାुं ାे बराबर ˃ििग्नटी बे हक़ salim insaan chi janmi Azad be braaber dignity be
haq ‫ برابر من ٕ ز حقوقس تہ ٕ عزت تہ آزاد چھ انسان تمام‬Arab_kas ‫ ۔‬Tamaam insaan chi Azad ta yezath ta
haqooqs Manz braaber. mai_Deva सभ मनुष स्वतत्र पैदा होयत अʺछ आओर अ˃धकार ାुं आ प्र˃तष्ठा मे
बराबर होयत अʺछ। Sabh manukh swatantra paida hoyat achhi aaor adhikaar aa prtishtha me barabar
hoyat achhi . mal_Mlym എല്ലാ മനുഷ്യരുും സ്വത്തന്തരായി ജനിച്ചവരുും ഒപ്പും
അന്തസ്ിലുും അവകാശങ്ങളിലുും തുലയരുമാണ്. ellaa manushyarum swathanthrarayi
janichavarum oppam anthassilum avakaashangalilum thulyarumaanu. mar_Deva सवममनुष्य स्वतत्र
व्यक्तɃ म्हण ାुं ାन जन्माला येतात आʺण प्र˃तष्ठा आʺण हक्ाुं च्या दष्टीकोनातन समान असतात. ାु sarv
manushya swatantrya vyakti mhanun janmala yetat aani pratishtha ani hakkanchya dushtikonatun
samaan astat. mni_Beng মীওইবা অিদংমে নীংতřାা িমস ା ାং ইোয়ি ା Ŕবা িম- ା সং হকেশলগী
লমদা চপ মাĭনা ৈାঅলমĭঅାর। ା Mioiba khudingmak ningtamba amasung ikaikhumnaba amasung
hakselgi lamda chap mannana leiminnari. mni_Mtei ꯃꯤꯑꯣꯏꯕ ꯄꯨꯝꯅꯃꯛ ꯅꯤꯡꯇꯝꯕ ꯑꯃꯁꯨ ꯡ
ꯏꯀꯥꯏꯈꯨ ꯝꯅꯕ ꯑꯃꯁꯨ ꯡ ꯍꯛꯁꯤꯡꯗ ꯃꯥꯟꯅꯅ ꯄꯣꯛꯏ ꯫ Mioiba pumnamak ningtamba amasung
ikaikhumnaba amasung haksing mannana pok I. npi_Deva सबैमािାनस स्वतन्त्र जन्मन्छन् र सम्मान तथा
अ˃धकारमा समान हुन्छन् छन्। sabai maanis swatrantra janmanchan ra sammaan tathaa adhikaarmaa
samaan chan. ory_Orya ସମų ମନୁ ଷୟ ଜନ୍ମ ଗତ ଭ ୋବ Źା ଧୀନ ଓ ସřା ନ ତଥ ଅଧି କ ର ଦୃ ūା ରୁ ସମ ନ
samasta manushya janmagata bhaabe swadhina o sammaana tathaa adhikaara drushtiru samaan.
pan_Guru ਸੱਭ ਮਨੱ ਖ ਅਜ਼ਾਦ ਪੈਦਾ ਹੁੰਦੇ ਹਨ ਅਤੇ ਮਾਣ-ਸਨਮਾਨ ਤੇ ਅਅਧਕਾਰਾ ਅାਵੱਚ ਬਰਾਬਰ ਹਨ। Sabh manukh
azaad paida hunde han ate mann saman te adhekara vich barabar han. 85 Published in Transactions
on Machine Learning Research (12/2023) san_Deva सवǼमानवजीिାवनः जन्मनः एव स्वतन्त्राः , मानदृष्ट्या
अ˃धकारदृष्ट्या समानाश्च। sarve maanavajivinah janmanah eva svatantraah, maanadrishtyaa
adhikaaradrishtyaa samaanaashca. sat_Olck ᱥᱟᱱᱟᱢ ᱢᱟᱹᱱᱢᱤ ᱜᱮ ᱯᱷᱩᱨᱜᱟᱹᱞ ᱟᱛᱮᱫ ᱠᱩ
ᱡᱟᱱᱟᱢᱚᱜᱼᱟ ᱟᱨ ᱢᱟᱹᱱ ᱟᱨ ᱟᱹᱭᱫᱟᱹᱨ ᱨᱮ ᱠᱩ ᱥᱚᱢᱟᱱ ᱜᱤᱭᱟ Sanam manmi ge phurgal ated ku
janamog-a ar man ar aydar re ku soman giya. ‫ آھن ٿیا پیدا برابر ۾ حقن ۽ عزت ۽ آزاد انسان سڀ‬Arab_snd ‫۔‬
Sabh insaan aazad paida thiya aahin, ain izzat ain hakkan mein barabar aahin. snd_Deva सभई इसान
आज़ाद, ऐ ାुं ାुं मान ऐुंहिकन मेंिାहक िजहड़ा ॼावल िआहन। Sabhai Insan aazad, ain maan ain hakan mein
hik jahida javal aahin. tam_Taml மனிதரகள ப¥றபப ல சுதநத¦ரம னவரகள, மறறும
சம உரிைାமகளும கணணியமும ெାக ணடவரகள. manitharkal pirappaal
suthanthiramaanavarkal, matrum sama urimaykalum kanniyamum kondavarkal. tel_Telu
మనుషులంతా సేవ్చ చ్గా గౌరవ్మరాయ్దలు, హకుక్లో ల సమానతవంతో పుడతారు. manushulantaa
svecchagaa gouravamaryadalu, hakkulalo samantvamto pudataru. ‫حقوق اور عزت اور ہی ہوئ پیدا آزاد‬
‫ انسان تمام‬Arab_urd ‫ ہی۔ برابر ےس لحاظ ےک‬tamam insan azad paida hue hain aur izzat aur huqooq ke lihaz
se barabar hain. 86 Published in Transactions on Machine Learning Research (12/2023) I Language
Coverage of various MT models This section provides an overview of the Indian languages supported
by different open-source and commercial NMT systems. Table 61: Coverage of the 22 languages
listed in the 8 th Schedule of the Constitution of India by various NMT systems NMT Systems
language IndicTrans1 IndicTrans2 Azure NLLB-200 Google Translate asm_Beng ✓ ✓ ✓ ✓ ✓ ben_Beng
✓ ✓ ✓ ✓ ✓ brx_Deva ✗ ✓ ✗ ✗ ✗ doi_Deva ✗ ✓ ✗ ✗ ✓ gom_Deva ✗ ✓ ✓ ✗ ✓ guj_Gujr ✓ ✓ ✓ ✓
✓ hin_Deva ✓ ✓ ✓ ✓ ✓ kan_Knda ✓ ✓ ✓ ✓ ✓ kas_Arab ✗ ✓ ✗ ✓ ✗ kas_Deva ✗ ✓ ✗ ✓ ✗
mai_Deva ✗ ✓ ✓ ✓ ✓ mal_Mlym ✓ ✓ ✓ ✓ ✓ mar_Deva ✓ ✓ ✓ ✓ ✓ mni_Beng ✗ ✓ ✗ ✓ ✗
mni_Mtei ✗ ✓ ✗ ✗ ✓ npi_Deva ✗ ✓ ✓ ✓ ✓ ory_Orya ✓ ✓ ✓ ✓ ✓ pan_Guru ✓ ✓ ✓ ✓ ✓ san_Deva
✗ ✓ ✗ ✓ ✓ sat_Olck ✗ ✓ ✗ ✓ ✗ snd_Arab ✗ ✓ ✓ ✓ ✓ snd_Deva ✗ ✓ ✗ ✗ ✗ tam_Taml ✓ ✓ ✓ ✓
✓ tel_Telu ✓ ✓ ✓ ✓ ✓ urd_Arab ✗ ✓ ✓ ✓ ✓ # languages 11 22 16 19 19 # language-script
combinations 11 25 16 20 19 87 Published in Transactions on Machine Learning Research (12/2023) J
Model Card - IndicTrans2 Following Mitchell et al. (2019), we provide a model card for our
IndicTrans2 models. J.1 Model Details • Person or organization developing model: IndicTrans2
models are multilingual translation models developed by AI4Bharat.38 • Model data: IndicTrans2
models were released on May 26, 2023. • Model version: IndicTrans2 models described in this paper
are version 1.0.0. • Model type: IndicTrans2 models are 18-layer encoder-decoder transformer
models with 1.1B parameters, one for English to Indic translation direction while the other for Indic
to English translation. • Information about training algorithms, parameters, fairness constraints or
other applied approaches, and features: IndicTrans2 models were trained with the exact training
configuration described in Section 5.5 and the training data described in Section 5.1. Sections 5.2, 5.3
and 5.7 describes the data preprocessing, tokenization, and postprocessing steps, correspondingly,
that have been followed during the training. • Paper or other resources for more information: See
the rest of this paper for more details on IndicTrans2 models. More details are also available in
IndicTrans2, 39 our open-source GitHub repository. • License: IndicTrans2 models are made available
through a permissive MIT license.40 • Where to send questions or comments about the model:
Please open an issue41 on our open-source GitHub repository. J.2 Intended Use • Primary intended
uses: IndicTrans2 models are machine translation models designed for research and commercial
purposes, especially for Indic languages. These models enable the translation of the text, either in
single or batch format, across 22 different Indic languages, including 25 language-script combinations
to and from English. In addition, these models offer support for translation between Indic languages
using a pivotbased approach. Further information on how to use the models can be found at
IndicTrans2, our open-source GitHub repository. • Primary intended users: The primary intended
users of the IndicTrans2 models are: – Researchers aiming to explore and advance language
technologies for Indic languages. – Individuals seeking to translate content to and from the
supported Indic languages for various day-to-day use cases. – Organizations interested in translating
their proprietary or internal content into the supported Indic languages. • Out-of-scope use cases:
IndicTrans2 models are released under MIT license without any usage limitations.
38https://ai4bharat.iitm.ac.in/ 39https://github.com/ai4Bharat/IndicTrans2
40https://opensource.org/license/mit/ 41https://github.com/AI4Bharat/IndicTrans2/issues 88
Published in Transactions on Machine Learning Research (12/2023) J.3 Data, Metrics, Limitations,
and Recommendations • Training dataset: Section 5.1 described the parallel corpora used for training
our models. Table 1 provides the statistics of the bitext pairs from different sources. In addition, we
augment the training data with synthetic data as described in Section 5.6 generated from the
intermediate IndicTrans2 models for training the final versions of the IndicTrans2 models. • Fine-
tuning dataset: BPCC-H-Wiki (see Section 3.3) and NLLB-Seed (Costa-jussà et al., 2022) were used for
the fine-tuning of the IndicTrans2 models as described in Section 5.5. Table 1 provides the statistics
of the gold-standard bitext pairs from different sources. • Evaluation dataset: Section 6.2 describes
all the benchmarks including FLORES-200 and our IN22 considered for evaluation of our IndicTrans2
models. The generation and evaluation procedure for the IndicTrans2 models are outlined in Section
6.4 and Section 6.5. Additionally, the baselines compared in this paper also follow the same
evaluation procedure as the IndicTrans2 models. • Metrics: IndicTrans2 models were evaluated using
several metrics such as chrF++, BLEU and COMET as described in Section 6.3. We use chrF++ as our
primary metric. In addition, we also conduct the human evaluation with the XSTS protocol on a small
portion of IN22 Combined evaluation set (see Appendix C). • Limitations: Section 9 describes the
known caveats of the IndicTrans2 models. • Recommendations for future work: IndicTrans2 serves as
a strong foundational translation model with extensive support for all 22 scheduled Indic languages.
However, it is important to acknowledge that there are additional languages that are currently not
supported (see Section 2). There is a potential to expand IndicTrans2 models to support more
languages or improve the performance of the existing supported languages through minimal fine-
tuning. 89 Published in Transactions on Machine Learning Research (12/2023) K Model Card -
IndicTrans2-M2M Following Mitchell et al. (2019), we provide a model card for IndicTrans2-M2M
models. K.1 Model Details • Person or organization developing model: IndicTrans2-M2M models are
multilingual translation models developed by AI4Bharat.42 • Model data: IndicTrans2-M2M models
were released on Dec 01, 2023. • Model version: IndicTrans2-M2M models described in this paper
are version 1.0.0. • Model type: IndicTrans2-M2M and IndicTrans2-Dist-M2M models are 18-layer
encoder-decoder transformer models with 1.2BM parameters and the compact variant with 350M
parameters, respectively, supporting Indic to Indic translation. • Information about training
algorithms, parameters, fairness constraints or other applied approaches, and features: IndicTrans2-
M2M and IndicTrans2-Dist-M2M models were trained with the exact training configuration and the
training data described in Section 7.5.2. Sections 5.2, 5.3 and 5.7 describes the data preprocessing,
tokenization, and postprocessing steps, correspondingly, that have been followed during the training.
• Paper or other resources for more information: See the rest of this paper for more details on
IndicTrans2- M2M and IndicTrans2-Dist-M2M models. More details are also available in IndicTrans2,
43 our open-source GitHub repository. • License: IndicTrans2-M2M and IndicTrans2-Dist-M2M
models are made available through a permissive MIT license.44 • Where to send questions or
comments about the model: Please open an issue45 on our open-source GitHub repository. K.2
Intended Use • Primary intended uses: IndicTrans2-M2M and IndicTrans2-Dist-M2M models are
machine translation models designed for research and commercial purposes, especially for Indic
languages. These models enable the translation of the text, either in single or batch format, between
22 different Indic languages. Further information on how to use the models can be found at
IndicTrans2, our open-source GitHub repository. • Primary intended users: The primary intended
users of the IndicTrans2-M2M and IndicTrans2-Dist-M2M models are: – Researchers aiming to
explore and advance language technologies for Indic languages. – Individuals seeking to translate
content to and from the supported Indic languages for various day-to-day use cases. – Organizations
interested in translating their proprietary or internal content into the supported Indic languages. •
Out-of-scope use cases: IndicTrans2-M2M and IndicTrans2-Dist-M2M are released under MIT license
without any usage limitations. 42https://ai4bharat.iitm.ac.in/
43https://github.com/ai4Bharat/IndicTrans2 44https://opensource.org/license/mit/
45https://github.com/AI4Bharat/IndicTrans2/issues 90 Published in Transactions on Machine
Learning Research (12/2023) K.3 Data, Metrics, Limitations, and Recommendations • Training
dataset: Section 7.5.2 described the parallel corpora used for training our models. • Evaluation
dataset: Section 6.2 describes all the benchmarks including FLORES-200 and our IN22 considered for
evaluation of our IndicTrans2-M2M and IndicTrans2-Dist-M2M models. The generation and
evaluation procedure for the IndicTrans2-Dist models are outlined in Section 6.4 and Section 6.5.
Additionally, the baselines compared in this paper also follow the same evaluation procedure as the
IndicTrans2-M2M and IndicTrans2-Dist-M2M models. • Metrics: IndicTrans2-M2M and IndicTrans2-
Dist-M2M models were evaluated using several metrics such as chrF++, BLEU and COMET as
described in Section 6.3. We use chrF++ as our primary metric. • Limitations: Section 9 describes the
known caveats of the IndicTrans2-M2M and IndicTrans2-Dist-M2M models models. We also do not
conduct an XSTS human evaluation for the IndicTrans2-Dist models. • Recommendations for future
work: IndicTrans2-M2M and IndicTrans2-Dist-M2M models serves as a strong foundational
translation model with extensive support for all 22 scheduled Indic languages. However, it is
important to acknowledge that there are additional languages that are currently not supported (see
Section 2). There is a potential to expand IndicTrans2-M2M and IndicTrans2-Dist-M2M models to
support more languages or improve the performance of the existing supported languages through
minimal fine-tuning. 91 Published in Transactions on Machine Learning Research (12/2023) L Model
Card - IndicTrans2-Dist Following Mitchell et al. (2019), we provide a model card for IndicTrans2-Dist
models. L.1 Model Details • Person or organization developing model: IndicTrans2-Dist models are
multilingual translation models developed by AI4Bharat.46 • Model data: IndicTrans2-Dist models
were released on Dec 01, 2023. • Model version: IndicTrans2-Dist models described in this paper are
version 1.0.0. • Model type: IndicTrans2-Dist models are 18-layer encoder-decoder transformer
models with 211M parameters, one for English to Indic translation direction while the other for Indic
to English translation. • Information about training algorithms, parameters, fairness constraints or
other applied approaches, and features: IndicTrans2-Dist models were trained with the exact training
configuration described in Appendix D and the training data described in Section 7.6. Sections 5.2,
5.3 and 5.7 describes the data preprocessing, tokenization, and postprocessing steps,
correspondingly, that have been followed during the training. • Paper or other resources for more
information: See the rest of this paper for more details on IndicTrans2- Dist models. More details are
also available in IndicTrans2, 47 our open-source GitHub repository. • License: IndicTrans2-Dist
models are made available through a permissive MIT license.48 • Where to send questions or
comments about the model: Please open an issue49 on our open-source GitHub repository. L.2
Intended Use • Primary intended uses: IndicTrans2-Dist models are machine translation models
designed for research and commercial purposes, especially for Indic languages. These models enable
the translation of the text, either in single or batch format, across 22 different Indic languages,
including 25 language-script combinations to and from English. In addition, these models offer
support for translation between Indic languages using a pivot-based approach. Further information
on how to use the models can be found at IndicTrans2, our open-source GitHub repository. • Primary
intended users: The primary intended users of the IndicTrans2-Dist models are: – Researchers aiming
to explore and advance language technologies for Indic languages. – Individuals seeking to translate
content to and from the supported Indic languages for various day-to-day use cases. – Organizations
interested in translating their proprietary or internal content into the supported Indic languages. •
Out-of-scope use cases: IndicTrans2 models are released under MIT license without any usage
limitations. 46https://ai4bharat.iitm.ac.in/ 47https://github.com/ai4Bharat/IndicTrans2
48https://opensource.org/license/mit/ 49https://github.com/AI4Bharat/IndicTrans2/issues 92
Published in Transactions on Machine Learning Research (12/2023) L.3 Data, Metrics, Limitations,
and Recommendations • Training dataset: Section 7.6 described the parallel corpora used for training
our models. • Fine-tuning dataset: BPCC-H-Wiki (see Section 3.3) and NLLB-Seed (Costa-jussà et al.,
2022) were used for the fine-tuning of the IndicTrans2-Dist models as described in Section 5.5. Table
1 provides the statistics of the gold-standard bitext pairs from different sources. • Evaluation dataset:
Section 6.2 describes all the benchmarks including FLORES-200 and our IN22 considered for
evaluation of our IndicTrans2-Dist models. The generation and evaluation procedure for the
IndicTrans2-Dist models are outlined in Section 6.4 and Section 6.5. Additionally, the baselines
compared in this paper also follow the same evaluation procedure as the IndicTrans2-Dist models. •
Metrics: IndicTrans2-Dist models were evaluated using several metrics such as chrF++, BLEU and
COMET as described in Section 6.3. We use chrF++ as our primary metric. • Limitations: Section 9
describes the known caveats of the IndicTrans2-Dist models. We also do not conduct an XSTS human
evaluation for the IndicTrans2-Dist models. • Recommendations for future work: IndicTrans2-Dist
serves as a compact yet strong foundational translation model with extensive support for all 22
scheduled Indic languages. However, it is important to acknowledge that there are additional
languages that are currently not supported (see Section 2). There is a potential to expand
IndicTrans2-Dist models to support more languages or improve the performance of the existing
supported languages through minimal fine-tuning. 93 Published in Transactions on Machine Learning
Research (12/2023) M Dataset Card Following Gebru et al. (2021); Pushkarna et al. (2022), we
provide a dataset card for our Bharat Parallel Corpus Collection, the dataset used to train IndicTrans2
as well as IN22, our benchmark testsets for Indic languages. M.1 Dataset Description • Dataset
summary: – Bharat Parallel Corpus Collection (BPCC) is a comprehensive and publicly accessible
parallel corpus comprising existing and newly added data for all 22 scheduled Indic languages. It
consists of two components: BPCC-Mined and BPCC-Human. BPCC contains a total of ~230M bitext
pairs (see Table 1). BPCC-Mined comprises ~228 million pairs, with around ~126 million pairs newly
mined as part of this work. On the other hand, BPCC-Human consists of 2.2 million gold standard En-
X pairs, with additional contributions of 644K bitext pairs from English sentences sourced from
Wikipedia (forming the Bharat Parallel Corpus Collection-H-Wiki subset) and 139K sentences covering
content from day-to-day use cases (forming the Bharat Parallel Corpus Collection-H-Daily subset). It is
worth highlighting that BPCC provides the first available datasets for many languages and
significantly increases the available data for all languages covered. – IN22 is a newly created
comprehensive benchmark for evaluating machine translation performance in multi-domain, n-way
parallel contexts across 22 Indic languages. It has been created from three distinct subsets, namely
IN22-Wiki, IN22-Web, and IN22-Conv. The Wikipedia and Web sources subsets offer diverse content
spanning news, entertainment, culture, legal, and India-centric topics. IN22-Wiki and IN22-Web have
been combined and considered for evaluation purposes and released as IN22-Gen. Meanwhile, the
conversation domain subset IN22-Conv is designed to assess translation quality in typical day-to-day
conversational-style applications. • How to use the data? We provide the links to access the data and
directions for usage in the README of IndicTrans2, 50 our open-sourced GitHub repository. •
Supported tasks and leaderboards: The provided data is primarily intended for training machine
translation models. It serves as a valuable training corpus for developing and improving such models.
Furthermore, the IN22 benchmark dataset is included, which serves as a robust evaluation set for
assessing the performance of machine translation models. Initial results of the IndicTrans2 models
and the existing baselines are available in our open-source GitHub repository, providing insights and
comparisons as of the release date. • Languages: The dataset covers a total of 22 scheduled Indic
languages with multiple scripts, amounting to a total of 25 language-script combinations. M.2
Dataset Creation • Curation rationale: – BPCC is created to train and improve machine translation
models. It is a valuable resource for developing and improving such models. Section 4 provides a
detailed description of the procedure followed for the mined data collection, referred to as BPCC-
Mined. Similarly, Section 3.3 outlines the motivation and annotation procedure for creating high-
quality seed data, referred to as BPCC-Human. Additionally, section 4.3 provides insights into the
curation and filtration process of the existing mined parallel corpora. – IN22 benchmark dataset is
created to serve as a reliable evaluation set for assessing the performance of machine translation
models. Section 3.2 provides a comprehensive discussion about the subsets of the evaluation
benchmark and the creation procedure for each subset. • Source data:
50https://github.com/AI4Bharat/IndicTrans2 94 Published in Transactions on Machine Learning
Research (12/2023) – Table 1 summarizes the parallel corpora from different sources. BPCC-Mined
(see Section 4) component is typically mined from IndicCorp v2 (Doddapaneni et al., 2023) and the
Internet Archive.51 BPCCHuman-Wiki is a multi-domain seed data collection that is sourced from
Wikipedia articles whereas BPCC-Human-Daily is an in-house created dataset covering content from
day-to-day conversations and use cases (see Section 3.3). – IN22 benchmark is a comprehensive
multi-domain benchmark that consists of three subsets: IN22-Wiki, sourced from Wikipedia articles;
IN22-Web, sourced from PDFs available on various government websites and open-source books;
and IN22-Conv, an in-house benchmark created through the interplay between annotators (see
Section 3.2). • Annotations: The annotations were performed on the Shoonya platform.52 Section 3
provides further details about the procedure and the guidelines followed for annotations. • Personal
and sensitive information: Given that a substantial portion of the dataset is mined from publicly
available websites at a large scale, we acknowledge the possibility of unintentional inclusion of
personal and sensitive information. If there are any concerns regarding potential leakages of such
information, please reach out to [email protected] for further assistance and resolution. M.3
Considerations for Using the Data • Social impact of the dataset: The dataset has a notable social
impact, as it is specifically constructed and curated to improve the translation quality of all 22
scheduled Indic languages. It also provides support for low-resourced languages that utilize multiple
scripts. This dataset contributes to the overall improvement of translation models, benefiting a wide
range of users, helping bridge language barriers, and facilitating better communication and
understanding across diverse linguistic communities. • Discussion of biases: The current work does
not explicitly examine biases in the data. However, we acknowledge the importance of studying
biases and hope to conduct further investigations in this area in the future. M.4 Additional
Information • Dataset Curators: The dataset was curated by AI4Bharat, who collected data from
various existing sources, including contributions from the existing dataset contributors. AI4Bharat
also releases mined data, in-house created seed data, and benchmarks. • Licensing Information:
Bharat Parallel Corpus Collection (BPCC) consists of the largest publicly available parallel corpora for
Indic languages. It includes various types of corpora obtained from different sources. The licensing
information for each category is as follows: – Existing Mined Corpora (NLLB & Samanantar): These
corpora are released under the CC0 license.53 – Existing Seed Corpora (NLLB, ILCI, MASSIVE): The
seed corpora are also released under the CC0 license.53 – Newly added mined corpora
(Samanantar++ & Comparable): The newly added mined corpora are also released under the CC0
license.53 – Newly added seed corpora (Wiki & Daily): The newly added seed corpora are released
under the CC BY 4.0 license.54 – Newly created IN-22 test set: The IN22 test set is released under the
CC BY 4.0 license.54 51https://archive.org 52https://ai4bharat.iitm.ac.in/shoonya
53https://creativecommons.org/share-your-work/public-domain/cc0/
54http://creativecommons.org/licenses/by/4.0 95

0.7 Hand Gesture Recognition in Indian Sign Language Using Deep Learning Harsh Kumar Vashisth,
Tuhin Tarafder, Rehan Aziz, Mamta Arora and Alpana Proceeding Paper
https://doi.org/10.3390/engproc2023059096 Citation: Vashisth, H.K.; Tarafder, T.; Aziz, R.; Arora, M.;
A. Hand Gesture Recognition in Indian Sign Language Using Deep Learning. Eng. Proc. 2023, 59, 96.
https://doi.org/10.3390/ engproc2023059096 Academic Editors: Nithesh Naik, Rajiv Selvam, Pavan
Hiremath, Suhas Kowshik CS and Ritesh Ramakrishna Bhat Published: 21 December 2023 Copyright:
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article
distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license
(https:// creativecommons.org/licenses/by/ 4.0/). Proceeding Paper Hand Gesture Recognition in
Indian Sign Language Using Deep Learning † Harsh Kumar Vashisth, Tuhin Tarafder, Rehan Aziz,
Mamta Arora * and Alpana Department of Computer Science and Technology, Manav Rachna
University, Faridabad 121010, Haryana, India; [email protected] (H.K.V.);
[email protected] (T.T.); [email protected] (R.A.); [email protected] (A.) *
Correspondence: [email protected] † Presented at the International Conference on Recent
Advances on Science and Engineering, Dubai, United Arab Emirates, 4–5 October 2023. Abstract: Sign
languages are important for the deaf and hard-of-hearing communities, as they provide a means of
communication and expression. However, many people outside of the deaf community are not
familiar with sign languages, which can lead to communication barriers and exclusion. Each country
and culture have its own sign language, and some countries have multiple sign languages. Indian Sign
Language (ISL) is a visual language used by the deaf and hard-of-hearing community in India. It is a
complete language, with its own grammar and syntax, and is used to convey information through
hand gestures, facial expressions, and body language. Over time, ISL has evolved into its own distinct
language, with regional variations and dialects. Recognizing hand gestures in sign languages is a
challenging task due to the high variability in hand shapes, movements, and orientations. ISL uses a
combination of one-handed and two-handed gestures, which makes it fundamentally different from
other common sign languages like American Sign Language (ASL). This paper aims to address the
communication gap between specially abled (deaf) people who can only express themselves through
the Indian sign language and those who do not understand it, thereby improving accessibility and
communication for sign language users. This is achieved by using and implementing Convolutional
Neural Networks on our self-made dataset. This is a necessary step, as none of the existing datasets
fulfills the need for real-world images. We have achieved 0.0178 loss and 99% accuracy on our
dataset. Keywords: CNN; AI; ISL; static hand gestures; deep learning 1. Introduction Hand gestures
are a convenient and easy mode of communication for humans. An application of gestures in
communication is sign language. Sign language is a mode of non-verbal communication, mostly used
by people with speech or hearing impairments. There are many sign language variations that are
used by people to communicate. This paper has worked on the Indian Sign Language. Sign languages
are a convenient and easy way to express their thoughts and emotions for people who know sign
languages. But it is very difficult for people who do not understand sign language to communicate
with those who use it. This requires a trained translator to be present in many situations. A
computer-based system that can interpret sign language can be very helpful in this aspect. It can be
used to improve communication by helping people understand sign languages without specialized
training. Over the past few years, much research has been conducted on gesture recognition and sign
language recognition. But most of it focuses on other sign languages. There are approximately 44
million people who have disabling deafness and hearing loss [1]. This is more than the population of
the United States and Canada combined! Eng. Proc. 2023, 59, 96.
https://doi.org/10.3390/engproc2023059096 https://www.mdpi.com/journal/engproc Eng. Proc.
2023, 59, 96 2 of 11 Hence, this is the motivation to proceed with the idea of developing a machine
learning model to recognize sign languages in real time. This will help out a huge part of the
population in terms of communication with one another. This work addresses the following points: 1.
Implementation of machine learning and deep learning algorithms used in hand gesture recognition.
2. To fine-tune the deep learning model in order to achieve maximum performance. 3. What flaws or
drawbacks does this model have that turn up in applications of the algorithms? Discuss these flaws.
4. How these flaws can be corrected, if they can be corrected currently, or what can be done about
them in the future. 2. Literature Survey Several studies have been conducted in the fields of gesture
recognition, face detection, and their applications in sign language recognition and facial expression.
Some of our reviewed papers have been listed here. Rosalina et al. [2] have taken about 3900 raw
image files to achieve the same with over 39 alphabets, numbers, and punctuation marks, in
accordance with SIBI (Sistem Isyarat Bahasa Indonesia), and the accuracy turned out to be 90%.
Computer vision were used to capture the image and then extract essential data from it. ANN
(Artificial Neural Network) was then used to classify the images, and at last, speech recognition was
used to translate the input speech in the form of NATO phonetic language and then translate it to
sign language. Image processing is conducted by morphological operators, mainly erosion and
dilation. The image is to be segmented. The whole process consists of four stages: image capture
(static photos in RGB color space), image processing (separate background from hand, HSV range to
separate the same, blurred, then dilated), feature extraction (finding contours that are basically
edges of the same color range), and classification (ANN will take B/W (Black and White) images).
Lastly, they translated the given letters into speech, which can be conducted using the NATO
alphabets. B. Hangün et al. [3] compare the performance of functions related to image processing as
implemented in the OpenCV library. Image processing requires more computational power than
regular use because of mathematical operations such as matrix inversion, transposition of a matrix,
matrix convolution, Fourier transform, etc. Images are taken as matrices, and as the image resolution
increases, so does the matrix order. CPUs and GPUs work on different principles; CPUs operate on
series processing (one task at a time) and GPUs on parallel processing (multiple tasks at the same
time). Although this does not completely hold true today, it is still practically true. CUDA (Compute
Unified Device Architecture) is a parallel programming model developed by NVIDIA that works on
NVIDIA GPUs. Performance in this paper is based on time vs. image size for tasks such as resizing,
thresholding (making the image black and white), histogram equalization (the process of adjusting
bad contrast), and edge detection. OpenCV is an important AI tool. CUDA is an API made in 2007. The
testing was conducted using a modest consumer-grade GTX 970 and i7 6700. Results showed GPUs
are faster at each of the four tasks, especially at higher-quality images. S. Li [4] et al. performed a
review of the current state of deep learning applications in facial expression recognition. According
to the review, the increase in computational power and success in deep learning have made it
efficient for automatic Facial Expression Recognition (FER). The two issues are overfitting (due to a
lack of training data) and non-expression-related issues such as illumination, head position,
occlusion, identity bias, and age. Anger, fear, disgust, happiness, sadness, surprise, and contempt are
the seven basic expressions humans perceive. FER is of two types: static (1 image) and dynamic
(continuous frames). In the earlier research, shallow learning was used, but now in this paper, real-
time images (due to the increase in data and computational power) are used. There are many
available datasets to train the model to avoid overfitting as much as possible, such as CK+, MMI,
Oulu-CASIA, JAFFE, FER2013, AFEW, SFEW, etc. Image processing is similar to the previous paper,
such as V&J for face alignment (GAN), CNN for augmenta- Eng. Proc. 2023, 59, 96 3 of 11 tion, Face
normalization for illumination (IS, DCT, histogram Normalization), and pose (Hasner, GAN). Face
Classification using (SVM+CNN). D. M. Prasanna et al. [5] developed a real-time GUI-based Face
recognition system using open-face Detection (HOG+SVM), feature extraction (HOG), and recognition
as the three stages. Several use-case scenarios are mentioned here. The issues are similar to those in
the previous paper. DNN (Deep Neural Network) was used for classification. O. K. Oyedotun et al. [6]
proposed a model that recognizes all 24 hand gestures as present in Thomas Moeslund’s gesture
recognition database. Different hand gestures can look similar when viewed from different angles in
a 2D image. This makes recognizing hand gestures a challenging task. Input images are converted to
binary representations and then de-noised using a media filter. The segment of the image that
contains hand gestures is extracted and rescaled to a 32 × 32 image by pattern averaging. The
proposed model uses CNNs and stacked denoising auto-encoders. The applied network topology
combines a convolution and its sub-sampling layer together into a layer. Each convolution layer
generates convolutional feature maps. Then, feature reduction is applied to each convolutional
feature map by subsampling layers. An SDAE (Stacked Denoising Auto Encoder) is used to learn more
features from input images. More features can be learned by stacking more hidden layers. Data-
augmentation techniques are often used to generate additional training data and prevent model
overfitting. Some commonly used data augmentation techniques for images include translation, RGB
jittering and horizontal flipping, spatial augmentation on each frame for video input, and temporal
translations over spatial translations. F. Zhan [7] proposes a spatio-temporal data augmentation
method for better generalization and the use of 2D-CNNs to extract segments of images that contain
hand gestures. Images are converted to grayscale images and rescaled to 50 × 50. Horizontal
mirroring are used for data augmentation. MSE is used as a cost function, and SGD is used as an
optimization function. L. Pigou [8] applies recent advancements in deep learning like Temporal
Convolutions, Residual Networks, Batch Normalization, and Exponential Linear Units for the
framewise classification of gestures and sign languages in a continuous video stream. For pre-
processing, input RGB images are converted to grayscale and rescaled to 128 × 128. Static
information in consecutive frames is removed by taking the difference between the two frames. The
model uses CNN adapted for classification tasks along with back propagation for recurrent
architecture. Temporal convolutions and recurrence are used to work with video frames. CNNs allow
for learning multiple features hierarchically instead of extracting features manually. Separate
networks are used for gesture recognition and sign language recognition. Gesture Recognition uses a
many-to-many configuration. Mini-batch Gradient Descent with Adam is used for optimizing
parameters. Current methods used in head pose estimation require depth information in addition to
RGB information. It is not feasible to obtain in-depth information in all situations, thus limiting the
applications of these methods. Change in facial features is not linear with respect to change in
angles, and Euler angles, which are used to represent pose angles (yaw, pitch, toll), need additional
information (rotation order of axes), making it ambiguous. H.-W. Hsu et al. [9] propose several
methods to overcome these shortcomings. A multi-regression loss function, an, -2. regression loss
combined with an ordinal regression loss, is proposed, which can improve recognition using just RGB
information. The ordinal regression loss helps solve the problem of non-linearity of facial features
and pose angles by learning features that can rank the different intervals of angles. Also, angles are
represented using the Quaternion number system instead of Euler angles. In this system, angles are
represented using 4 angles-, q-x., q-y., q-z. and q-w.. This representation can avoid the Gimbal lock
problem as it is a 4D representation of a 3D rotation. It can also be interpolated to generate
smoother renderings of rotations. The methods proposed in this paper make head pose estimation
much more feasible by reducing the amount of information required. J. Gupta [10] covers hand
gesture recognition using various algorithms like deep learning, CNN, Morphological operation, and
emoji prediction. A database was created for this model using various hand gestures to be
recognized and further used to train the model and predict the emoji. This paper used a base of 11
gestures and filters to detect Eng. Proc. 2023, 59, 96 4 of 11 gestures and then used CNN to
categorize them. It includes morphological operations, contour extraction, and CNN. The database
consists of 1200 images, corresponding to every 11 emoji. It uses a camera to monitor user-created
real-time gestures. Future work is to add more emojis to be predicted in real-time gesture
recognition. T. Mo [11] performs a review of gesture recognition and some of its key issues.
Commonly used features for gestures include shape information (position, outline, etc.), geometric
properties (length, distance, etc.), and binary images. For human skin recognition, commonly used
color spaces are RGB (red, Green, and Blue), HSV (Hue, Saturation, and Value), and YCbCr. Gesture
recognition involves two steps: classification and recognition. Commonly used methods for this
purpose are the hidden Markov Model, Neural Networks, and Template Matching Method. But these
methods are time- and resource-heavy. This paper proposes SVM, which is simpler and has better
generalization capability. The YCbCr color space is used for human skin recognition. After performing
dimensionality reduction, SVM and PCA are used for gesture recognition. In H. Muthu Mariappan et
al. [12], real-time recognition is used to identify Indian Sign Language. It uses 1300 samples of images
to train the model and further predict the Indian Sign Language. The FCM (Fuzzy c-means)-based
real-time sign language recognition system can recognize 40 words of ISL at a time. This method is
used to enhance casual communication among people with hearing disabilities. The future work of
this paper is to add more words to the system for recognition. In Rastogi et al. [13], the key idea is to
enable communication between blind, deaf, and mute individuals using a sensor-enabled glove that
translates hand gestures into tactile vibrations. This allows effective two-way communication of
letters, words, and simple sentences between blind and deaf-mute pairs using coded vibro-tactile
stimuli. Nagpal [14] proposes a portable arm-mounted device with a camera, screen, vibrating pads,
and gesture recognition software. The camera captures gestures made by a mute user. The software
recognizes and converts the gestures into text displayed on the screen. This allows real-time, two-
way communication using a combination of gesture recognition, on-screen text, and vibrotactile
feedback. Ahire et al. [15] propose a device that enables two-way communication between deaf-
mute and hearing-speaking users. The device has a display, camera, microphone, and vibrational
pads on one side for the deaf-mute user. The other side has a screen, speaker, microphone, and
camera for the hearing user. When the hearing user speaks into the mic, speech recognition software
converts it to text displayed on the deaf-mute user’s screen. This concept could help bridge the
communication divide between deaf-mute and hearing-speaking communities when implemented
into a working prototype. Sharma et al. [16] use an ensemble model called Trbaggboost, which uses
a small amount of labeled data to label unlabeled data from a new subject. It uses three sources of
data: tri-axis gyroscopes, tri-axis accelerometers, and multichannel surface electromyograms. After
reviewing the related research, methods used in different stages of gesture classification were
identified. After collecting the dataset, images are first preprocessed. Preprocessing involves multiple
processes, including resizing, cropping, and converting images to a color format applicable to the
problem statement. Some color formats commonly used are: Grayscale [2,6–8,16], RGB [9], HSV [12],
and YCrCb [11,17]. A combination of one or more of these approaches can also be used to improve
performance. Gupta et al. [10] Convert the original RGB images first to HSV, then to grayscale. Morph
and Contour operations are applied after further processing to obtain a mask of the hand gesture.
After that, image augmentation are applied to increase the variation in training data and reduce
overfitting. Some common filters applied for augmenting images are shearing, zooming, horizontal
flips, and vertical and horizontal shifts. Dimensionality reduction techniques like Principal
Component Analysis (PCA) [11] and Histogram of Oriented Gradient (HOG) [5] can also be used to
further reduce the number of features. After data augmentation, the model are then trained. Both
classical machine learning models as well as deep learning models are used for this task. Some
machine learning models used include Fuzzy C-Means (FCM) [12], Support Vector Machine (SVM)
[5,11], etc. For Eng. Proc. 2023, 59, 96 5 of 11 deep learning models, CNN [7,10,17] and Deep CNN
(DCNN) [18] are primarily used as they are best suited for working with images. Other deep learning
models used include Temporal Residual Networks [8]. As training CNN models from scratch is time-
and resource-intensive, transfer learning is widely used for image classification purposes. Some
common pre-trained models used for transfer learning include ResNet34 [19], ResNet50 [16],
ResNet50V2 [16], XCeption [16], InceptionV3 [19], MobileNet [16], and MobileNetV2 [16]. 3.
Methodology After reviewing the recently published results and their methods, the various steps
that are involved in the complete process of gesture recognition have been determined and are as
shown in Figure 1. The detailed information about each step is given in the subsections. Figure 1.
Process Flow Chart: Hand Gesture Recognition in the ISL Translation System. 3.1. Dataset The dataset
used in this problem is self-made, i.e., three of the four authors performed hand signs for all the
images; it contains a total of 7800 images. Five burst shots for each alphabet were captured from a
camera, which comprises 20 images in each burst. These images have dimensions of 4000 × 3000
pixels. This dataset is categorized into three sets named test, train, and validate, each of which
contains ~5500, 1180, and 1160 images, respectively, as shown in Figure 2. There are a total of 26
alphabets, and each alphabet has 300 images. The dataset is also fairly balanced, as shown in Figure
3. Figure 2. Dataset File Structure. Eng. Proc. 2023, 59, 96 6 of 11 tt tt ff ff Figure 3. Number of images
in each class. 3.2. Image Processing For preprocessing, images have been rescaled to 80 by 60 pixels
(to maintain the original 4:3 aspect ratio). The images, originally in RGB format, were converted to
HSV format, as it better highlights hand segments in our dataset. Grayscale images for training were
also used, though they performed worse. Figure 4 shows some sample images after preprocessing in
HSV color space. Figure 5 shows an image preprocessed for both grayscale and HSV color spaces. tt tt
ff ff Figure 4. Some Preprocessed Images (in HSV colorspace). tt tt ff ff Figure 5. Preprocessing is
applied to the letter ‘P’. 3.3. Image Augmentation After preprocessing, several augmentation
techniques were applied to generate more training data. Some augmentation techniques applied
include shearing, zooming, horizon- Eng. Proc. 2023, 59, 96 7 of 11 tal flips, and vertical and
horizontal shifts. The effects of different filters on an image are shown in Figure 6. tt ff tt tt tt Figure 6.
Augmentation filters are applied to the letter ‘D’. 3.4. Model Training This paper has used CNN
[16,19,20] for our task. A Convolutional Neural Network is a feedforward neural network that has
multiple layers. A CNN applies convolutions that can automatically apply preprocessing and extract
features from images. So, it has the added benefit of not having to perform preprocessing and
feature engineering on the image. It has multiple layers of convolutions, and each layer identifies
successively complex features. The architecture of the model is shown in Figure 7. tt ff tt tt tt Figure
7. Model Architecture for Hand Gesture Recognition in ISL Translation. Eng. Proc. 2023, 59, 96 8 of 11
The different layers used in the network are: 1. Conv2D: It applies a 2D convolution operation to the
input data. A filter, which is a matrix whose dimensions are specified by the kernel size, is used to
produce an output. 2. MaxPool2D: This is a pooling layer and is usually applied after a convolution
layer. It applies a filter and selects a single value from each subregion of the specified dimension
from the input data. In this case, the filter applied is Max, which selects the maximum value. 3.
Flatten: It converts multi-dimensional data into a 1-D shape, i.e., flattens the data. 4. Dense: A dense
layer is one where each node is connected to every node of the previous layer. In this case, connect
each node to the flattened layer. 5. Dropout: The dropout layer randomly drops out nodes from its
input layer at a specified rate. It is used to reduce overfitting. All images are resized to 80 × 60 and
converted to HSV color space. So, the input layer has dimensions (80, 60, 3). Models trained using
HSV images provided better accuracy than grayscale images, as can be seen in Table 1. Next, 3 2D-
convolution layers with ReLu as the activation function and a 2D-max pooling layer are used. Finally,
a flattening layer flattens its input matrix and passes the 1D vector to a Dense layer. The output layer
is a Dense layer with a SoftMax activation function and 26 nodes, corresponding to the 26 alphabets.
Because our task is a multiclass classification problem, SoftMax is used, as it can convert model
outputs to a probability distribution that can be easily used to select the predicted classes. Table 1.
Comparison of Hand Gesture Recognition Models for ISL Translation. System Configuration Optimizer
Input Size Training Time Testing Loss Testing Accuracy Google Colab GPU: Tesla K80 CPU: Intel(R)
Xeon rmsprop (100, 75, 1) Grayscale 3 h 52 min 0.0687 0.9713 Google Colab GPU: Tesla K80 CPU:
Intel(R) Xeon rmsprop (80, 60, 3) HSV 3 h 26 min 0.0178 0.999 Local Machine GPU: GTX 1650S CPU:
Intel(R) Core i5 10400F rmsprop (100, 75, 1) Grayscale 2 h 24 min 0.1874 0.9383 Local Machine GPU:
GTX 1650S CPU: Intel(R) Core i5 10400F adam (200, 150, 1) Grayscale 2 h 17 min 0.166 0.9274 Local
Machine GPU: RTX 3050 CPU: Ryzen 9 5900HS rmsprop (100, 75, 1) Grayscale 2 h 11 min 0.025
0.9899 The model has been trained for 25 epochs. The batch size was not explicitly stated. Training
was limited to 100 steps per epoch, and validation was limited to 50 steps per epoch. 3.5. Evaluating
Performance Hand gesture classification is a multi-class classification task. In our implementation,
there are 26 possible outputs, all the letters of the alphabet from A to Z. Because this is a multiclass
classification problem, SoftMax has been used as the activation function for the output layer. The F1
Score, Accuracy, and Confusion Matrix have been calculated to evaluate the model. The graph in
Figure 8 shows the evolution of loss and accuracy over the number of epochs. Eng. Proc. 2023, 59, 96
9 of 11 tt tt tt tt tt Figure 8. Evolution of loss and accuracy. 4. Results and Discussions The model was
run with different preprocessing steps and different machines, and the results are shown in Tale 1. It
is observed that the model with HSV color space images as input shows the best results. Multiple
configurations of hyperparameters have been tested in the models to find the one that provides
optimal accuracy and loss. Doing this not only proves to be useful for further research but also
completes one of the objectives stated in the introduction. Table 1 shows the performance of
different models/preprocessing steps. Table 2 shows the performance comparison of this work and
existing works. Table 2. Comparison between this paper’s result and previous published results.
Paper Task Model Accuracy Muthu Mariappan et al. [12] 40 ISL words and sentences in real time
Fuzzy c-means 75% Rosalina et al. [2] 39 ASL signs (26 alphabet letters, 10 digits, and 3 punctuation)
ANN 90% Oyebade K. Oyedotun et al. [6] 24 ASL Hand Gestures CNN 92.83% Hsien-I Lin et al. [17] 7
hand gestures CNN 99% Gupta et al. [10] 11 hand gestures CNN 99.6% Proposed model 26 ISL Hand
Signs CNN 99% The model was run with different preprocessing steps and different machines, and
the results are shown below. It is observed that the model with HSV color space images as input
shows the best results. Multiple configurations of hyperparameters have been tested in the models
to find the one that provides optimal accuracy and loss. Doing this not only proves to be useful for
further research but also completes one of the objectives stated in the introduction. Table 1 shows
the performance of different models/preprocessing steps. Table 2 shows the performance
comparison of this work and existing works. From Table 1, we can conclude that using RMSprop as
the optimizer, images with sizes 80 and 60 in the HSV color space provided the best performance.
From Table 2, it can be deduced that the proposed model maintains comparable accuracy while
including more classes than any of the previous research. 5. Conclusions and Future Work This paper
has provided a better understanding of Indian Sign Language and applications of machine learning
after getting over the hype and seeing actual results. This work should be useful for the deaf
community as well, as this research hopefully strives to become a helping hand for the community.
The current implementation slightly overfits the dataset, having an accuracy of 99% and a loss of
0.0178, which might be a case of overfitting and requires further work in the dataset given below.
This further results in poor performance on new images. The performance of the model can be
improved by taking into account hand pose/skeletons for the prediction. For future work, we aim to
improve this and develop a much better implementation with aspects such as better vocabulary, a
voice interface within a working client, and adding non-static signs (for instance, the letters y and j
are non-static in ISL). The dataset needs to also come from more individuals with different ethnicities
and backgrounds, skin tones, and varied shapes of hands and faces with their own variations of the
language; furthermore, the lighting and background need Eng. Proc. 2023, 59, 96 10 of 11 to vary to
avoid bias. The data were not gathered from real practitioners of the ISL and may not be completely
reflective of how a real practitioner uses the language. In addition, the model only works on static
gestures; dynamic gestures need to be added in further work. There are no specific hardware
requirements, and it performs reasonably well when deployed on smartphones with mediocre
hardware. Author Contributions: Conceptualization, H.K.V. and R.A.; methodology, T.T.; software,
R.A.; validation, M.A., R.A., and H.K.V.; formal analysis, M.A. and A.; investigation, R.A.; resources,
T.T.; data curation, R.A.; writing—original draft preparation, H.K.V.; writing—review and editing, M.A.;
visualization, T.T.; supervision, M.A.; project administration, M.A. All authors have read and agreed to
the published version of the manuscript. Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available upon request from the
corresponding author. Conflicts of Interest: The authors declare no conflict of interest. References 1.
Rajput, L.; Gupta, S. Sentiment Analysis Using Latent Dirichlet Allocation for Aspect Term Extraction.
J. Comput. Mech. Manag. 2023, 2, 8–13. [CrossRef] 2. Rosalina; Yusnita, L.; Hadisukmana, N.; Wahyu,
R.B.; Roestam, R.; Wahyu, Y. Implementation of Real-Time Static Hand Gesture Recognition Using
Artificial Neural Network. In Proceedings of the 2017 4th International Conference on Computer
Applications and Information Processing Technology (CAIPT), Bali, Indonesia, 8–10 August 2017; pp.
1–6. 3. Hangün, B.; Eyecio ˘glu, Ö. Performance Comparison Between OpenCV Built in CPU and GPU
Functions on Image Processing Operations. Int. J. Eng. Sci. Appl. 2017, 1, 34–41. 4. Li, S.; Deng, W.
Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2020, 13, 1195–1215.
[CrossRef] 5. Prasanna, D.M.; Reddy, C.G. Development of Real Time Face Recognition System Using
OpenCV. Development 2017, 4, 791. 6. Oyedotun, O.K.; Khashman, A. Deep Learning in Vision-Based
Static Hand Gesture Recognition. Neural Comput. Appl. 2017, 28, 3941–3951. [CrossRef] 7. Zhan, F.
Hand Gesture Recognition with Convolution Neural Networks. In Proceedings of the 2019 IEEE 20th
International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles,
CA, USA, 30 July–1 August 2019; pp. 295–298. 8. Pigou, L.; Van Herreweghe, M.; Dambre, J. Gesture
and Sign Language Recognition with Temporal Residual Networks. In Proceedings of the 2017 IEEE
International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October
2017; pp. 3086–3093. 9. Hsu, H.-W.; Wu, T.-Y.; Wan, S.; Wong, W.H.; Lee, C.-Y. QuatNet: Quaternion-
Based Head Pose Estimation With Multiregression Loss. IEEE Trans. Multimed. 2019, 21, 1035–1046.
[CrossRef] 10. Gupta, J. Hand Gesture Recognition for Emoji Prediction. Int. J. Res. Appl. Sci. Eng.
Technol. 2020, 8, 1310–1317. [CrossRef] 11. Mo, T.; Sun, P. Research on Key Issues of Gesture
Recognition for Artificial Intelligence. Soft Comput. 2020, 24, 5795–5803. [CrossRef] 12. Muthu
Mariappan, H.; Gomathi, V. Real-Time Recognition of Indian Sign Language. In Proceedings of the
2019 International Conference on Computational Intelligence in Data Science (ICCIDS), Chennai,
India, 21–23 February 2019; pp. 1–6. 13. Rastogi, R.; Mittal, S.; Agarwal, S. A Novel Approach for
Communication among Blind, Deaf and Dumb People. In Proceedings of the 2015 2nd International
Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13
March 2015. 14. Nagpal, N. Design Issue and Proposed Implementation of Communication Aid for
Deaf & Dumb People. Int. J. Recent Innov. Trends Comput. Commun. 2015, 3, 147–149. 15. Ahire,
P.G.; Tilekar, K.B.; Jawake, T.A.; Warale, P.B. Two Way Communicator between Deaf and Dumb People
and Normal People. In Proceedings of the 2015 International Conference on Computing
Communication Control and Automation, Pune, India, 26–27 February 2015; pp. 641–644. 16.
Sharma, S.; Gupta, R.; Kumar, A. Trbaggboost: An Ensemble-Based Transfer Learning Method Applied
to Indian Sign Language Recognition. J. Ambient Intell. Humaniz. Comput. 2022, 13, 3527–3537.
[CrossRef] 17. Kishore, C.R.; Pemula, R.; Vijaya Kumar, S.; Rao, K.P.; Chandra Sekhar, S. Deep Learning
Models for Identification of COVID-19 Using CT Images. In Proceedings of the Soft Computing:
Theories and Applications; Kumar, R., Ahn, C.W., Sharma, T.K., Verma, O.P., Agarwal, A., Eds.; Springer
Nature: Singapore, 2022; pp. 577–588. Eng. Proc. 2023, 59, 96 11 of 11 18. Lin, H.-I.; Hsu, M.-H.;
Chen, W.-K. Human Hand Gesture Recognition Using a Convolution Neural Network. In Proceedings
of the 2014 IEEE International Conference on Automation Science and Engineering (CASE), New
Taipei, Taiwan, 18–22 August 2014; Volume 2014, pp. 1038–1043. 19. Sharma, C.M.; Tomar, K.;
Mishra, R.K.; Chariar, V.M. Indian Sign Language Recognition Using Fine-Tuned Deep Transfer
Learning Model. In Proceedings of the Innovations In Computer And Information Science (ICICIS),
Ganzhou, China, 27–29 August 2021; pp. 62–67. 20. Arora, M.; Dhawan, S.; Singh, K. Exploring Deep
Convolution Neural Networks with Transfer Learning for Transformation Zone Type Prediction in
Cervical Cancer. In Proceedings of the Soft Computing: Theories and Applications; Pant, M., Sharma,
T.K., Verma, O.P., Singla, R., Sikander, A., Eds.; Springer: Singapore, 2020; pp. 1127–1138.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are
solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s).
MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from
any ideas, methods, instructions or products referred to in the content.

ONLINE HAND GESTURE REGOCNITION AND CLASSIFICATION FOR DEAF AND DUMB A Project Report
Phase – II Submitted in partial fulfillment of the requirements for the award of Bachelor of
Engineering degree in Electronics and Communication Engineering By A KISHORE( 39130227) KAVETI
ASHOK KUMAR YADAV(39130216) DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING SCHOOL OF ELECTRICAL AND ELECTRONICS SATHYABAMA INSTITUTE OF SCIENCE AND
TECHNOLOGY (DEEMED TO BE UNIVERSITY) Accredited with Grade “A” by NAAC JEPPIAAR NAGAR,
RAJIV GANDHI SALAI, CHENNAI - 600 119 APRIL - 2023 i SATHYABAMA INSTITUTE OF SCIENCE AND
TECHNOLOGY (DEEMED TO BE UNIVERSITY) Accredited with “A” grade by NAAC Jeppiaar Nagar, Rajiv
Gandhi Salai, Chennai – 600 119 www.sathyabama.ac.in DEPARTMENT OF ELECTRONICS AND
COMMUNICAITON ENGINEERING BONAFIDE CERTIFICATE This is to certify that this Project Report is
the bonafide work of A KISHORE(39130227) and KAVETI ASHOK KUMAR YADAV (39130216) who
carried out the project entitled “ONLINE HAND GESTURE RECOGNITION AND CLASSIFICATION FOR
DEAF AND DUMB” under our supervision from November 2022 to April 2023. Internal Guide Dr. K. V.
KARTHIKEYAN M.E., Ph.D., Professor Head of the Department Dr. T. RAVI, M.E., Ph.D., Professor
Submitted for Viva voce Examination held on Internal Examiner External Examiner ii DECLARATION
We, A. KISHORE (Reg.no.39130227) and KAVETI ASHOK KUMAR YADAV (Reg.no.39130216) hereby
declare that the Project Report entitled “ONLINE HAND GESTURE RECOGNITION AND
CLASSIFICATION FOR DEAF AND DUMB PEOPLE ” done by under the guidance of Dr. K. V.
KARTHIKEYAN,M.E., Ph.D, Professor, Department of Electronics and Communication Engineering at
SATHYABAMA INSTITUTE OF SCIENCE AND TECHNOLOGY, CHENNAI is submitted in partial fulfillment
of the requirements for the award of Bachelor of Engineering degree in Electronics and
Communication Engineering. DATE: 1. 2. PLACE: SIGNATURE OF THE CANDIDATE(S) ⅲ
ACKNOWLEDGEMENT We are pleased to acknowledge our sincere thanks to the Board of
Management of SATHYABAMA for their kind encouragement in doing this project and for completing
it successfully. We are grateful to them. We convey our thanks to Dr. N.M. NANDHITHA, M.E., Ph.D.,
Professor & Dean, School of Electrical and Electronics and Dr. T. RAVI, M.E.,Ph.D., Professor & Head of
Department of Electronics and Communication Engineering for providing necessary support and
details at the right time during the progressive reviews. We would like to express our sincere and
deep sense of gratitude to our Project Guide Dr. K. V. KARTHIKEYAN, M.E., Ph.D., Professor,
Department of Electronics and Communication Engineering for his valuable guidance, suggestions
and constant encouragement paved the way for the successful completion of our project work. We
wish to express our thanks to all teaching and non-teaching staff members of the Department of
Electronics and Communication Engineering who were helpful in many ways for the completion of
the project. ⅳ ABSTRACT Sign Language Recognition is a breakthrough for helping deaf-mute people
and has been researched for many years. Unfortunately, every research has its own limitations and is
still unable to be used commercially. Some of the research has known to be successful in recognizing
sign language, but requires an expensive cost to be commercialized. Nowadays, researchers have
gotten more attention for developing Sign Language Recognition that can be used commercially.
Researchers do their research in various ways. It starts with the data acquisition methods. The
dataacquisition method varies because of the cost needed for a good device, but a cheap method is
needed for the Sign Language Recognition System to be commercialized. The methods used in
developing Sign Language Recognition are also varied among researchers. Each method has its own
strength compared to other methods and researchers are still using different methods in developing
their own Sign Language Recognition. Each method also has its own limitations compared to other
methods. The aim of this paper is to review the sign language recognition approaches and find the
best method that has been used by researchers. Hence other researchers can get more information
about the methods used and could develop better Sign Language Application Systems in the future.
ⅴ TABLE OF CONTENT Chapter No TITLE Page No. ABSTRACT iv LIST OF FIGURES vi 1 INTRODUCTION
1 2 LITERATURE SURVEY 15 2.1 Inferences from Literature Survey 19 2.2 Open problems in Existing
System 19 3 REQUIREMENTS ANALYSIS 20 3.1 Feasibility Studies/Risk Analysis of the Project 20 3.2
Software Requirements Specification Document 21 4 DESCRIPTION OF THE PROPOSED SYSTEM 22
4.1 Selected Methodology or process model 24 4.2 Architecture / Overall Design of Proposed System
33 4.3 Description of Software for Implementation and Testing plan of the Proposed Model/System
34 4.4 Project Management Plan 37 5 IMPLEMENTATION DETAILS 38 5.1 Algorithms 38 5.2 Testing 40
6 RESULTS 43 7 SUMMARY AND CONCLUSION 47 7.1 Conclusion 47 7.2 Future Enhancements 48 7.3
Implementation 48 REFERENCES 49 Ⅵ LIST OF FIGURES FIGURE NO FIGURE NAME PAGE NO. 1.1
Hand shape and Hand orientation 12 4.1 Block diagram for proposed system 23 4.2 Hand Landmark
24 4.3 Alphabets hand gestures 28 4.4 Sign language hand gestures 29 4.5 System Architecture for
sign language 33 6.1 Sequence Diagram 43 6.2 Data Flow Diagram 44 1 CHAPTER 1 INTRODUCTION
Typically, sign or signed languages are employed to communicate meaning visually. These dialects are
conveyed using non-manual features in addition to physical articulations. These are completely
natural languages with their own lexicon and vocabulary. Smart methods for hand motion detection
from photographs in hand signals are crucial in this situation. Many people throughout the world
who have speech disorders and loss of hearing can benefit from these clever techniques. For
accurately recognizing symbols and signs, various approaches have indeed been documented in the
literature. Innovations like movement and face recognition have gained significant traction in the
basic sign language field in recent years. Different movements known as gestures are utilized during
communication. Hand or body movements are used. Gestures used in sign language typically involve
visually communicated patterns. There are roughly 4,94,93,500.000 people with hearing
impairments worldwide. Some of the existing methods for translating sign language take into
account hand position, hand size, andfinger movements. Each sign in sign language has a particular
meaning ascribed to it so that people can easily comprehend and interpret it. The people are
creating distinct and unique sign languages depending on their native tongues and geographic
locations. As spoken language, there are several variations in sign language. Around the world, more
than 300 different sign languages are in use. They differ from one country to the next. Sign language
can have many diverse regional dialects that cause subtle variances inhow people use and
comprehend signs, even in nations where the same language is spoken. There are significant
variances between some of the most popular sign languages, even though there are some
commonalities. Additionally, not just the symptoms differ. 2 General Backround The use of sign
language has not only been restricted to people with impaired hearing or speech to express
themselves with each other or non-sign-language speakers. It has often been considered as a
prominent medium of communication. Instead of acoustically conveyed sound patterns, sign
language uses manual communication to convey meaning. It combines hand gestures, and facial
expressions along with movements of other body parts such as eyes, legs, etc. This paper proposes a
design for recognizing signs used in ASL and interpreting them. Some of the problems faced by non-
speech and hard-of-hearing individuals in communication with other people were interaction-based,
disparity, education, behavioural patterns, mental health, and most importantly safety concerns. The
ways in which one can interact with the computer are either by using devices like a keyboard, mouse,
or via audio signals, while the former always needs physical contact and the latter is prone to noise
and disturbances. Physical action carried by the hand, eye or any part of the body can be considered
a gesture. Hand gestures are the most convenient and interpretable way for non-speaking humans to
interact. Nonverbal communication is important in our life as it conveys about 65% of messages in
comparison to verbal communication that contributes no more than 35% of our interactions.
Gestures can be categorized into hand and arm gestures (recognition of hand poses, sign languages,
and entertainment applications), head and face gestures (such as nodding or shaking of the head,
the direction of eye gaze, opening the mouth speak, winking, and so on), and body gestures
(involvement of fullbody motion. There are numerous varieties of sign language used not only in the
UK but all over theworld since the speaker's nonverbal cues, actions, and visual emotions may all
have a huge impact on how sign language is transmitted. Similar to spoken language, various groups,
and cultures create their own means of communication specific to theenvironment in which they
exist. 3 DEEP LEARNING Deep-learning networks are distinguished from the more commonplace
singlehidden layer neural networks by their depth; that is, the number of node layers through which
data must pass in a multistep process of pattern recognition. Earlier versions of neural networks such
as the first perceptron were shallow, composed of one input and one output layer, and at most one
hidden layer in between. More than three layers (including input and output) qualify as “deep”
learning. So deep is not just a buzzword to make algorithms seem like they read Sartre and listen to
bands you haven’t heard of yet. It is a strictly defined term that means more than one hidden layer.
In deeplearning networks, each layer of nodes trains on a distinct set of features based on the
previous layer’s output. The further you advance into the neural net, the more complex the features
your nodes can recognize since they aggregate and recombine features from the previous layer. This
is known as feature hierarchy, and it is a hierarchy of increasing complexity and abstraction. It makes
deep-learning networks capable of handling very large, high-dimensional data sets with billions of
parameters that pass through nonlinear functions. Above all, these neural nets are capable of
discovering latent structures within unlabeled, unstructured data, which is the vast majority of data
in the world. Another word for unstructured data is raw media; i.e., pictures, texts, video, and audio
recordings. Therefore, one of the problems deep learning solves best is in processing and clustering
the world’s raw, unlabeled media, discerning similarities and anomalies in data that no human has
organized in a relational database or ever put a name to. For example, deep learning can take a
million images, and cluster them according to their similarities: cats in one corner, ice breakers in
another, and in a third all the photos of your grandmother. This is the basis of so-called smart photo
albums. Deep-learning networks perform automatic feature extraction without human intervention,
unlike most traditional machine-learning algorithms. Given that feature, extraction is a task that can
take teams of data scientists years to accomplish, deep learning is a way to circumvent the
chokepoint of limited experts. It augments the powers of small data science teams, which by their
nature do not scale. When training on unlabeled data, each node layer in a deep network 4 learns
features automatically by repeatedly trying to reconstruct the input from which it draws its samples,
attempting to minimize the difference between the network’s guesses and the probability
distribution of the input data itself. Restricted Boltzmann machines, for example, create so-called
reconstructions in this manner. In the process, these neural networks learn to recognize correlations
between certain relevant features and optimal results – they draw connections between feature
signals and what those features represent, whether it be a full reconstruction, or with labeled data. A
deep-learning network trained on labeled data can then be applied to unstructured data, giving it
access to machine-learning. NEURAL NETWORKS A neural network is a series of algorithms that
endeavors to recognize underlying relationships in a set of data through a process that mimics the
way the human brain operates. In this sense, neural networks refer to systems of neurons, either
organic or artificial in nature. Neural networks can adapt to changing input; so the network generates
the best possible result without needing to redesign the output criteria. The concept of neural
networks, which has its roots in artificial intelligence, is swiftly gaining popularity in the development
of trading systems. A neural network works similarly to the human brain’s neural network. A
“neuron” in a neural network is a mathematical function that collects and classifies information
according to a specific architecture. The network bears a strong resemblance to statistical methods
such as curve fitting and regression analysis. A neural network contains layers of interconnected
nodes. Each node is a perceptron and is similar to multiple linear regression. The perceptron feeds
the signal produced by a multiple linear regression into an activation function that may be nonlinear.
14 In a multi-layered perceptron (MLP), perceptron are arranged in interconnected layers. The input
layer collects input patterns. The output layer has classifications or output signals to which input
patterns may map. Hidden layers fine-tune the input weightings until the neural network’s margin of
error is minimal. It is hypothesized that hidden layers extrapolate salient features in the input data
that have predictive power regarding the outputs. 5 CNN Convolutional Neural Network is an
algorithm of Deep Learning. That is used for Image Recognition and in Natural Language Processing.
Convolutional Neural Network (CNN) takes an image to identify its features and predict it. It captures
the spatial and temporal dependencies of the image. Each CNN layer learns filters of increasing
complexity. The first layers learn basic feature detection filters like edges, corners. The middle layer
learns filters that detect parts of objects. The last layers have higher representations they learn to
recognize full objects in different shapes and positions. Suppose, when you see some image of a Dog,
your brain focuses on certain features of the dog to identify. These features may be the dog’s ears,
eyes, or it may be anything else. Based on these features your brain gives you a signal that this is a
dog. Similarly, Convolutional Neural Network processes the image and identifies it based on certain
features. Convolutional Neural Network is gaining so much popularity over artificial neural networks.
Steps In a Convolutional Neural Network, 1.Convolution Operation. 2.RELU Layer 3.Pooling
4.Flattening 5.Full Connection Generally , a convolution neural network architecture generally has
several components: ● A convolution layer - you can think of this layer as “what relevant features are
we picking up in an image?” In a convolutional neural network, we have multiple convolutional layers
that extract low to high-level features depending on what 6 specific layer we are focusing on. To give
an (over-simplified) intuition, earlier convolutional layers pick up lower-level features (i.e. like lines
and edges) while later convolutional layers pick up higher-level features based on inputs from lower-
level features (i.e. shapes, structures) - analogous to how vision works in the human brain. ● A
pooling layer - convolutional neural networks are typically used for image classification. However,
images are high-dimensional data - so we would prefer to reduce the dimensionality to minimize the
possibility of overfitting. Pooling essentially reduces the spatial dimensions of the image based on
certain mathematical operations such as average or max-pooling (there’s a nice graphic here). We
generally incorporate pooling since it (1) generally, acts as a noise suppressant (2) makes it invariant
to translation movement for image classification and (3) helps capture essential structural features of
the represented images without being bogged down by the fine details. ● Fully-connected layer - You
can think of a series of convolution and pooling operations as dimensionality reduction steps prior to
passing this information over to the fully connected (Dense) layer. Essentially, what the fully
connected layer does is that it takes the "compressed" representation of the image and it tries to fit a
basic NN (multi-layer perceptron) when doing classification. The change in paradigm that the usage
of NNs encompasses is a very important one. Prior to the usage of NNs in image classification, the
practitioner had to use explicitly coded algorithms for detecting features. This was a job that often
involved a lot of work. Machine Learning was not used throughout the whole system, only the last
layer of the system, the classifier, had the capacity to learn and adapt itself to the specific problem.
NNs, and especially CNNs, dramatically change that. CNN’s extract features from images and learns
how to do it. The practitioner doesn’t need to craft complicated hard-coded algorithms to extract
those features. Since the introduction 7 of the first CNNs in 2012, all the winning teams for the
classification task have used different types of CNNs. CNN’s systems now clearly outperform other
image classification systems such as the ones described. State-of-the-art CNNs have even crossed the
boundary of what is considered to be the error rate for human beings in image classification tasks.
Ever since it’s introduction in 2012, CNNs have been making steady progress in decreasing the error
rate in both tasks. This is due to the introduction of new strategies that further improve the
convergence and training speed of CNNs. CNNs are a particular kind of NNs that give their name to
the fact that they make extensive use of Convolutional Layers. Convolutional Layers consist of a
learnable filter of fixed size (kernel) to be applied to images. The name comes from the fact that at
each forward pass, the learned filter is convolved with the image resulting on a 2-dimentional map.
Another type of layer widely used in CNNs are the pooling layers. These layers perform a type of
down sampling on the input signal. There are various types of Pooling layers, max-pooling being the
most common. ReLUs or Rectified Linear Units, are also widely used in CNNs. ReLUs are units that
apply a function similar to Equation 2.1. There are often alternatives for these types of units,
Equation 2.2 and 2.3 show two of them. ReLU : f(x) = max(0, x) (2.1) Tanh : f(x) = tanh(x) (2.2) Sigm :
f(x) = (1+e −x ) −1 (2.3) Fully Connected Layers are used too, as in more standard NNs. These layers
consist of an array of neurons, each of which is connected to every single output of the previous
layer. Loss Layers are typically used at the end of the CNN to figure out the penalty for a given output
and to provide feedback used by the CNN to learn. In addition to these more standard set of layers,
modern CNNs make use of a very extensive set of techniques and strategies that greatly enhance
their performance. Dropout is one of such techniques. It consists on temporarily deactivating some
neurons within the net, preventing it from over-fitting. This temporary deactivation of neurons is
typically applied only to 8 Fully-connected layers. CNNs nowadays also make extensive use of GPU
processing power that allows for a faster training, an idea pioneered in 2012 that is now a
standard.Evidence shows that one way of 8 improving a CNN performance is by stacking more layers
and making the network effectively deeper the problem that often arises is that as more layers are
stacked, the performance saturates and eventually starts to decrease at some point. To address this
issue, the idea of deep residual networks has been proposed. Residual Networks try, with minor
changes to the architecture, and with virtually no impact on performance, to approximate residual
functions instead of standard functions. If we have a CNN with input x and we want the output to
approximate a function O(x), we can instead, approximate F(x) = O(x)−x and then compute O(x) from
F(x) since O(x) = F(x) +x. Evidence shows that CNNs train better and converge faster when the
approximated function is in fact F(x), a residual function. This approach opens door for usage of even
deeper and more powerful models. Some preprocessing techniques are also used to improve the
data fed to the CNN. These techniques include normalization and zero-centering of the data. These
techniques are mainly used in order to overcome the vanishing and exploding gradient problem, that
often affects NN systems’ training. Both problems arise when models are trained using gradientbased
methods. Gradients are often computed using the chain rule. In case of nearzero gradients, this lead
to several very small numbers being multiplied together in order to calculate the changes in the last
layers. This causes the last layers to suffer virtually no change at all in its outputs as the training
progresses. The opposite happens when activation functions have high derivative values, leading 9
Related Work to the uncontrolled increase of the gradient of the last layers. As NN systems tend to
require a lot of examples, better results are obtained when data augmentation techniques are used.
CNN frameworks often incorporate mechanisms to augment data. Cropping is a very simple
technique where several crops of a single image are fed to the NN, augmenting the dataset by a
factor equal to the number of crops. Rotations [KSH12] of the image are also used, augmenting the
dataset by a factor equal to the number of rotations done. Objective Like all other time-variable signs
and signals, it is not easy to compare gestures directly in the Euclidean space. This is because of the
time dependencies and a large no. of irrelevant areas in the frames. it is very difficult to find real
representative hand engineered features for hand gestures. To work well with conventional 9
classifiers, the required features should involve robust descriptors for hand shape, position,
orientation, and temporal dependence between consecutive frames. These features should also be
robust against different circumstances such as occlusion and background clutter. For these reasons,
deep learning is employed in this article, since it seems to be a promising solution. The CNN had
been successfully utilized for automatic feature learning. It achieved excellent results in image
classification, object recognition, and even in human activity recognition. The main contributions in
this article, are as follows: 1. Presenting a CNN model for learning Spatio-temporal features for hand
gestures. 2. Adapting a pretrained deep model to work well on hand gesture recognition. 10 For
instance, native English speakers can be found in both Britain and the United States. However, there
are considerable differences between American and British sign languages. Here is where a lot of
companies and institutions still have trouble reaching out to the Deaf and Hard of Hearing people.
Here are a few different types of sign language as mentioned below: Indian Sign Language Indian
Sign Language uses both the right and left hands to depict a variety of hand gestures. Indian Sign
Language serves as both a communication tool for the deaf and a representation of their pride and
identity. The Rights of Persons with Disabilities (RPwD) Act 2016, which took effect on April 19, 2017,
acknowledges sign language as a form of interaction that is particularly helpful when speaking with
people who havehearing loss. Authorities are also required under the Act to encourage the use of
sign language so that people with hearing loss can take the role in society and benefit from it. The ISL
gradually changed from 4000 words to 10,000 words and changed its goal of being used more widely.
American Sign Language (ASL) On the other hand, one hand is all that's required to sign in ASL. As a
result, putting the system into place is made simple. ASL has its own growth path and is
independentof all spoken languages. The main sign language used by Deaf people in the United
States and most of Anglophone Canada is American Sign Language (ASL), a natural language. Asl is a
comprehensive and well-structured visual language that uses both manual and non-manual
characteristics to communicate. Dialects of ASL and creoles based on ASL are spoken around the
world, including much of West Africa and portions of Southeast Asia, besides North America. 11 As a
lingua franca, ASL is also commonly studied as a second or foreign language. French Sign Language is
most closely similar to ASL. Although ASL exhibits characteristics untypical of creole languages, like
agglutinative morphology, it has been argued that ASL is a creole language of the FSL. French Sign
Language (FSL) The gesture used by the deaf in France and the French-speaking regions of
Switzerland is known as French Sign Language (LSF). Ethnology estimate that 100,000 native signers
are involved. Dutch Sign Language (NGT), Flemish Sign Language (VGT), Belgian-French Sign
Language (LSFB), Irish Sign Language (ISL), American Sign Language (ASL), Quebec (also known as
French Canadian) Sign Language (LSQ), Brazilian Sign Language (LSB, LGB or LSCB), and Russian Sign
Language are all linked to and partly descended from French Sign Language (RSL). The invention of
French Sign Language is usually, though incorrectly, credited to Charles Michel de l'Épée (abbé de
l'Épée). A pair of deaf twin sisters were inside a nearby house when he entered, seeking shelter from
the rain. He was immediately struck by the depth and sophistication of the language they were using
to communicate with one another and the deaf society in Paris. In fact, he is said to have discovered
the language by complete accident. The abbé began studying what is now known as Old French Sign
Language, and eventually, he founded a free education for the hearing impaired. He created a
methodology at this school to instruct his children how to read and write that he dubbed
"methodical symbols." 12 British Sign Language (BSL) The Deaf community in the UK uses the sign
language known as British Sign Language (BSL), which is their first or chosen language. The British
Deaf Association calculates that there are 151,000 BSL users in the UK, of whom 87,000 are Deaf,
based on the percentage of people who reported "using British Sign Language at home" on the 2011
Scottish Census. In comparison, 15,000 persons who resided in England and Wales identified using
BSL as their primary language in the 2011 England and Wales Census. As having to hear relatives of
the deaf, basic sign translators, or as a consequence of other interactions with the British Deaf
community, individuals who aren't deaf might also use BSL. Elements Of Sign Language SL is a visual-
spatial language based on positional and visual cues, including arm andbody movements, the
position and orientation of the hands, and the shape of the fingers and hands. Together, these
elements are employed to communicate an idea'smeaning. There are typically five components to
the phonological structure of SL. In Second Life, each motion is made up of five different
components. These five buildingpieces serve as SL's most valuable components, and automated,
intelligent systems can use them to recognize SL. Fig 1.1 Hand shape and Hand Orientation 13
Methods Of Sign Language Detection Scholarly approaches to overcoming challenges brought on by
disabilities are numerous, methodical, and context-specific. SLR systems, which are used to translate
the signals of SL into text or speech to establish communication with individuals who do not
understand these signs, are one of the crucial interventions. One of the most important projects
aimed at collecting data for the motion of human hands is the development of SLR systems based on
the sensory glove. To record hand layouts and identify the related meanings of gestures, three
strategies are used: vision-based, sensor-based, and a mix of the two. Vision-Based Strategies Lenses
are the main equipment used by vision-based systems to gather the required input data. The primary
benefit of employing a camera is that it eliminates therequirement for sensors in sensory gloves and
lowers the system's construction expenses. Due to the blur produced by a web camera, most laptops
employ a highend camera, which is quite inexpensive. Despite the high-quality cameras that are
found in most smartphones, there are still anumber of issues that hinder the creation of real-time
recognition applications. These issues include the limited field of view of the capturing device, high
computational costs, and the requirement for multiple cameras in order to capture robust results
(dueto an issue of complexity and obstruction). Sensory-Based Strategies An alternate method for
gathering information on gestures is to utilize a certain kind of instrumented glove that is equipped
with flexion (or bend) sensors, accelerometers (ACCs), proximity sensors, and abduction sensors. The
bend angles of the fingers, the abduction between the fingers, and the wrist's orientation (roll, pitch,
and yaw) are all measured using these sensors. Depending onthe number of sensors incorporated
into the glove, different degrees of freedom (DoF)can be achieved using such gloves, ranging from 5
to 22. 14 Glove-based systems have a significant benefit oversight systems in that they may
immediately report necessary and pertinent information (degree of bend, pitch, etc.) to the portable
computer in the form of voltage values obviating the need to transform raw data into useful values.
Contrarily, the computational overhead of vision-based solutions is increased by the requirement to
apply particular tracking and feature extraction techniques to raw streaming video. Hybrid-Based
Strategies The third way integrates camera- and glove-based technologies in a hybrid design to
gather raw gesture data. To improve the accuracy and precision overall, this method employs mutual
mistake minimization. Unfortunately, due to the expense and computational cost of the complete
apparatus, little research has been done in this area. Furthermore, when combined with hybrid
tracking techniques, immersive virtual reality solutions show promise. 15 CHAPTER 2 LITERATURE
SURVEY Soma Shrenika, Myneni Madhu Bala, “Sign Language Recognition Using Template Matching
Technique”, International Conference on Computer Science, Engineering and Applications (ICCSEA),
2020 Normal people cannot understand the signs used by the deaf because they do not understand
the meaning of each sign. This is the goal of the system proposed here. This system employs a
camera to record various hand gestures. Following that, the image is processed using various
algorithms. Initially, the image is pre-processed. The edges are then determined using an edge
detection algorithm. Finally, the sign is identified and the text is displayed using a template-matching
algorithm. Because the output is text, the meaning of a specific sign can be easily interpreted. The
system is implemented in Python using OpenCV. H Muthu Mariappan, V Gomathi, “Real-Time
Recognition of Indian Sign Language”, International Conference on Computational Intelligence in
Data Science (ICCIDS), 2019 The skin segmentation feature of OpenCV is used to identify and track
the Regions of Interest (ROI) for recognizing the signs. The fuzzy c-means clustering machine learning
algorithm is used to train and predict hand gestures. Many applications for gesture recognition exist,
including gesture-controlled robots and automated homes, game control, Human-Computer
Interaction (HCI), and sign language interpretation. The proposed system can recognize real-time
signs. As a result, it is extremely beneficial for hearing and speech impaired people to communicate
with normal people. Wanbo Li, Hang Pu, Ruijuan Wang, “Sign Language Recognition Based on
Computer Vision”, IEEE International Conference on Artificial Intelligence and Computer Applications
(ICAICA), 2021 16 This system makes use of a PyQt-designed GUI interface. Users can choose
betweensign language recognition and translation, image capture with OpenCV, and special
processing with the trained CNN neural network. Through LSTM decisions, the modelcan then
identify American sign language. The user can also use the voice button to have the system convert
the corresponding gesture image into the same pixels and write it to the video file based on the
user's voice. The experimental results show that the rate of sign language recognition is 95.52%
when compared to similar algorithms [5, 6], and the rate of sign language [7] (American sign
language and Arabic numerals)is 90.3% Pushkar Kurhekar, Janvi Phadtare, Sourav Sinha, Kavita P.
Shirsat, “Real Time Sign Language Estimation System”, 3rd International Conference on Trends in
Electronics and Informatics (ICOEI), 2019 This paper describes a system that allows people with
speech and hearing impairments to communicate freely. The model can extract signs from videos by
processing them frame by frame in a minimally cluttered environment. This sign is then displayed in
text form. For webcam input and displaying the predicted sign, the system employs a Convolutional
Neural Network (CNN) and fastai - a deep learning library - in conjunction with OpenCV. The
experimental results show good sign segmentation in various backgrounds and high accuracy in
gesture recognition Aniket Kumar, Mehul Madaan, Shubham Kumar, Aniket Saha, Suman Yadav,
“Indian Sign Language Gesture Recognition in Real-Time using Convolutional Neural Networks”, 8th
International Conference on Signal Processing and Integrated Networks (SPIN), 2021 Using
Convolutional Neural Networks, this paper proposes a model for identifying andclassifying Indian
Sign Language gestures in real-time (CNN). The model, which was built with OpenCV and Keras CNN
implementations, aims to classify 36 ISL gestures representing 0-9 numbers and A-Z alphabets by
transforming them to text equivalents.The dataset that was created and used consists of 300 images
for each 17 gesture that were fed into the CNN model for training and testing. The proposed model
was implemented successfully and achieved 99.91% accuracy for the test images. Sai Nikhilesh Reddy
Karna, Jai Surya Kode, Suneel Nadipalli, Sudha Yadav, “American Sign Language Static Gesture
Recognition using Deep Learning andComputer Vision”, 2nd International Conference on Smart
Electronics and Communication (ICOSEC), 2021 This study proposes a real-time hand-gesture
recognition system based on the American Sign Language (ASL) dataset, with data captured using a
BGR webcam and processed using Computer Vision (OpenCV). The Vision Transformer Model was
used to train the 29 static gestures (the alphabet) from the official, standard ASL dataset (ViT). After
getting trained with 87,000 RGB samples for 1 epoch, the model had an accuracy rate of 99.99%.
(2719 batches of 32 images each). Shirin Sultana Shanta, Saif Taifur Anwar, Md. Rayhanul Kabir,
“Bangla Sign Language Detection Using SIFT and CNN”, 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 2018 There have been very few studies on
detecting Bangla sign language. The classifiersused by the majority of researchers in this field were
SVM, ANN, or KNN. In this paper, the researchers attempt to develop a Bangla sign language system
that uses SIFT feature extraction and Convolutional Neural Network (CNN) classification. The
researchers also show that using the SIFT feature improves CNN's detection of BanglaSign language.
Rajiv Ranjan, B Shivalal Patro, Mohammad Daim Khan, Manas Chandan Behera, Raushan Kumar,
Utsav Raj, “A Review on Sign Language Recognition Systems”, IEEE 2nd International Conference on
Applied Electromagnetics, Signal Processing, & Communication (AESPC), 2022 18 For the
enhancement of the recognized gesture, open CV, Gaussian blur, and contour approximation
techniques are used. The TensorFlow framework is best suited for accurate software prediction with
a CNN model of VGG-16 architecture and the Transfer Learning training method. The use of these
tools results in a more accurate prediction of sign language gestures in the English alphabet. Given
the project's futurescope, the technology stack is both robust and scalable. This review paper
discussesvarious methods and structures for sign language recognition. Kohsheen Tiku, Jayshree
Maloo, Aishwarya Ramesh, Indra R, “Real-time Conversion of Sign Language to Text and Speech”,
Second International Conferenceon Inventive Research in Computing Applications (ICIRCA), 2020 The
purpose of this paper is to examine the performance of various techniques for converting sign
language to text/speech format. Following analysis, an Android application is created that can
convert real-time ASL (American Sign Language) signs to text/speech. Vishwa Hariharan Iyer, U.M
Prakash, Aashrut Vijay, P. Sathishkumar, “Sign Language Detection using Action Recognition”, 2nd
International Conference onAdvance Computing and Innovative Technologies in Engineering
(ICACITE), 2022 This proposal is centered on the continuous detection of image frames in real-time
using action detection to detect the user's action. After recognising key points using mediapipe
holistic, which includes face, pose, and hand features, the model employs an LSTM neural network
model. Collecting key value points for training and testing, pre-processing the data, and creating
labels and features are all part of the proposed work. It saves the weights and uses confusion matrix
accuracy to evaluate the model. 19 2.1 INFERENCES FROM LITREATURE SURVEY Related works that
was done have limitations as well as scope to be developed. The main limitations of these paper,
there was no combined guidelines to help a visually impaired people and deaf people to
communicate in sign language effectively. Our proposed system is dealt with to solve the limitations
of these papers and providing a proper guideline also always forewarn from the obstacle faced
simultaneously. 2.2 OPEN PROBLEMS IN EXISTING SYSTEM In face of the problems often found with
other learning models, such as the presence of multiple local minima and the curse of
dimensionality, existing methods have been significantly ineffective therefore. The SVM method does
not have any edge over the curse of dimensionality, thus their practical importance can be
diminished. An immediate problem arising from the SVM’s original hyperplane formulation is that it
is not very obvious how to make the model applicable to more than two classes. Several approaches
have been proposed to overcome this limitation, two of them being knownas the 1-vs-1 and 1-vs-all
strategies for multiple class classification. For a decision problem over classes, 1-vs-all requires the
creation of classifiers, each trained to distinguish one class from the others. The decision then is
taken in a winner-takes-all approach. However, there is no clear indication this approach results in an
optimum decision. As a result, we have proposed a model that solves these problems and offers
higher accuracy and reliability as compared to traditional classifiers method. 20 CHAPTER 3
REQUIREMENT ANALYSIS 3.1 FEASIBILITY STUDIES/RISK ANALYSIS OF THE PROJECT 3.1.1 FEASIBILITY
STUDY The feasibility of the project is server performance increase in this phase and a business
proposal is put forth with a very general plan for the project and some cost estimates. During system
analysis, the feasibility study of the proposed system is to be carried out. For feasibility analysis,
some understanding of the major requirementsfor the system is essential. Three key considerations
involved in the feasibility analysis are ● Economicfeasibility ● Technical feasibility ● Operational
feasibility ECONOMICAL FEASIBILITY This study is carried out to check the economic impact that the
system will have on theorganization. The amount of funds that the company can pour into the
research and development of the system is limited. The expenditures must be justified. Thus the
developed system as well within the budget and this was achieved because most of the technologies
used are freely available. Only the customized products had to be purchased. TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the
system. Any system developed must not have a high demand on the available technical resources.
This will lead to high demands being placed on the client. The developed system must have modest
requirements, as only minimal or null changes are required for implementing this system. 21
OPERATIONAL FEASIBILITY The aspect of the study is to check the level of acceptance of the system
by the user.This includes the process of training the user to use the system efficiently. The user must
not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by
the users solely depends on the methods that are employed toeducate the user about the system
and to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism,which is welcomed, as he is the final user of the system. 3.2
SOFTWARE REQUIREMENTS SPECIFICATION DOCUMENT S/W CONFIGURATION: • Operating System :
Windows/Linux, Mac • Programming Language : python 3.6 • IDE : Anaconda / Jupiter notebook •
Packages/libraries : Tensor flow, NumPy, OpenCV H/W CONFIGURATION: • Processor : I3/Intel
Processor • Hard Disk : 160GB • RAM : 8GB 22 CHAPTER 4 DESCRIPTION OF PROPOSED SYSTEM In
the proposed model, the objective is to recognize all 26 Alphabets with the aid of hand gestures
through a web camera. We will create an Alphabet detection Model using Tensor flow Dense model
to detect angled and normal alphabets to train our dataset that stands for alphabets. This deep
learning model will recognize the alphabetwith gestures captured in real time via a webcam. We will
create a gesture detection Model using Tensor flow Dense model to detect sign language hand
gestures to train our dataset that stands for hand gestures. This deep learning model will recognize
the sign language with hand gestures captured in real time via a webcam. For this system, we will
create a Node JS webapp to take webcam input using OpenCV and detect alphabets and gestures
using our trained model. A very well-known technique which has worked effectively in case of small
labelled data is transfer learning. A network which is trained on a source task is used as a feature
extractor for target task. There are many CNN models trained on ImageNet which are available
publicly such as VGG-16, VGG-19, Inception, Resnet. Transferable feature representation learned by
CNN minimizes the effect of over-fittingin case of small, labelled set. For the implementation of this
project 4 algorithms were considered, VGG16, VGG19, Resnet50, Inception V3. VGG16 being
effective in terms of its object detection(car) capability and classification (severity and location)
because of its simple linear architecture and hence compatibility with the required use case. Hence
VGG16 model was best suitable for implementation. 23 FIGURE 4.1 BLOCK DAIGRAM FOR PROPOSED
SYSTEM 24 4.1 SELECTED METHODOLOGY OR PROCESS MODEL 4.1.1 Module 1: Data Gathering and
Data Preprocessing Creating Hand Landmark Creator We will create a landmark creator function to
detect the left and right hand from the frame and create a landmark for the hand. The location of
hand landmarks is an important source of information to recognize hand gestures, effectively
exploited in a number of recent methods that work from depth maps. A hand landmark model
processes a cropped bounding box image and returns three-dimensional hand key points on hand.
This function reaches the exact location of 21 key points with a 3D hand knuckle coordinate which is
driven inside the hand regions detected by regression which will produce the prediction of direct
coordinates which is a model of the hand landmark. Figure 4.2: Hand landmark Each hand-knuckle of
the landmark coordinate is composed of x, y, and z where x andy are normalized to [0.0, 1.0] by
image width and height, while z represents the depth of the landmark. The depth of the landmark
that can be found on the wrist being the ancestor. The closer the landmark is to the camera; the
value becomes smaller. 25 Data Gathering The process of collecting data is crucial for the
development of an effective ML model. The quality and quantity of your dataset has a direct effect
on the decisionmaking process of the AI model. And these two factors influence the robustness,
precision and performance of AI algorithms. As a consequence, data collection and structuring often
takes longer than data model training. We will gather data using webcam only for signlanguage that
will contain alphabets (A-Z) and different gestures, i.e. Hello, Thank you,Eat, Goodbye and many
more. Creating Labels for Detections Image labeling is a type of data labeling that places emphasis on
the identification and labeling of specific details in an image. In computer vision, data labeling
involves adding tags to raw data like pictures and videos. Each tag stands for an object class
associated with the data. Supervised ML models use labels during learning to identify a specific
object class in unclassified data. It helps these models combine meaning with data, which can help
form a model. Image annotation is used to create data sets for computer vision models that are
divided into training sets, originally used to train the model and test and validation sets used to
assess model performance. Data scientists use the dataset to shape and evaluate their model, then
the model can automatically label unlabeled and unseen data. We will create labels for each image
to classify the alphabets and gestures. Image labeling is an essential function of computer vision
algorithms. The ultimate goal of machine learning algorithms is to obtain automatic labeling, but to
train a model will require a large set of pre-labeled image data. METHODS FOR IMAGE LABELING :
Manual image annotations This is the process of manually defining labels for a complete image or
drawing regions in an image and adding text descriptions for each region. Manual annotation is
typicallysupported by tools that enable operators to rotate a large number of images, draw regions
on an image, and label and store these data in a standard . 26 Automated image annotations
Automated annotation tools can help manual annotators by trying to find object limits in an image
and providing a starting point for annotation. Automated annotation algorithms are not entirely
precise, but they can save time for human annotators by providing at least one partial map of the
objects in the image. Synthetic image labeling It is a precise, cost-effective technique that can replace
manual annotations. This involves automatically generating images similar to actual data, in
accordance with the criteria defined by the operator. There are three common methods of producing
synthetic images: Variational Autoencoders (VAE): these are algorithms that start from existing data,
create a new data distribution, and link it to the original space using an encoderdecoder method.
Generative Adversarial Networks (GAN): these are models between two neural networks. One neural
network tries to create fake images, while the other tries to distinguish actual images from fake ones.
Over time, the system can produce photorealistic images which are difficult to distinguish from real
images. Neural Radiance Fields (NeRF): This model takes a series of images describing a 3D scene and
automatically makes additional new perspectives from the same scene. It works by calculating a 5-
dimensional radiation function to generate each voxel on the target image. Data Preprocessing
Algorithms which learn data are just statistical equations operating on database values.Thus, as the
saying goes, "if the garbage enters, the garbage leaves". Your data project can only achieve success if
the data entering the machines is of high quality. 27 Data from actual scenarios always contain noise
and missing values. This occurs due to manual mistakes, unexpected events, technical problems, or
various other obstacles. Incomplete and noisy data cannot be consumed by algorithms, as they are
generally not designed to handle missing values, and noise causes disruption to the actual sample
setup. Data preprocessing is a step in transforming raw data so that problems due to
incompleteness, inconsistency, and/or lack of proper representation of trends are resolved in order
to obtain a data set in a comprehensible format. In our model, for data preprocessing we will convert
our labels to categorical values using a One Hot Encoding method. As we will be taking raw webcam
frames, we won't do any other preprocessing method. Categorical Encoding Sometimes the data is in
a form that cannot be processed by machines. For example,a column with string values, like names,
will not mean anything for a model that depends solely on numbers. We, therefore, have to process
the data to assist the model in interpreting them. It is referred to as categorical encoding. There are
multipleways in which we can encode categories. Here we use the One Hot Encoding method. When
data is not in an inherent order, we can use only one hot encoding. A hot encoding generates a
column for each category and assigns a positive value, 1, in any row within that category, and 0 when
not present. The downside is that multiple features are generated from a single feature, which makes
the data large. It doesn't matter if we don't have too many features. 4.1.2 Module 2: Model Creation
and Training Alphabet Detection Model The objective is to recognize all 26 Alphabets with the aid of
hand gestures through a web camera. We will create an Alphabet detection Model using the
Tensorflow Dense 28 Figure 4.3: Alphabet hand gestures The dense layer is a neural network with a
deep connection, which means that each neuron in the dense layer receives the input of all neurons
from its previous layer. Dense Layer does a matrix-vector multiplication, and the values used in the
matrix are parameters that can be trained and updated with the help of backpropagation. The output
produced by the dense layer is a vector of dimension 'n'. A dense Layer is used to change the
dimensions, rotate, scale, and translate the vector. Gesture Detection Model We will create a gesture
detection Model using the Tensorflow Dense model to detect sign language hand gestures to train
our dataset that stands for hand gestures. This deep learning model will recognize the sign language
with hand gestures captured in real-time via a webcam. 29 Figure 4.4: Sign language hand gestures
Tensor flow dense is the kind of layer and function available in neural networks whilst implementing
artificial intelligence and deep learning in a python programming language. There are deep
connections that exist between the neurons in the neural network in dense layers. The pattern they
follow is such that each individual neuron gets data input from all the neurons of the previous layer,
forming the complex model. While on the other end, dense is also a function used in the neural
networks of TensorFlow, which produces the output by applying activation of the dot of Kernel
andinput and adding the bias effect to it. At the same time, dense is also a function used in
TensorFlow neural networks which produce the output by applying kernel point activation and input
and adding a bias effect. Data Splitting Data splitting is the time at which data is divided into two or
more subsets. Typically, with a two-part split, one part is used to evaluate or test the data and
another to form the model. Data splitting is an important part of data science, 30 particularly for the
creation of data-based models. This technique ensures that data and process models that utilize data
models, such as machine learning, are accurately created. In a basic two-part split, the training
dataset is used for training and model development. Training sets are routinely used to estimate
different parameters or to compare the performance of different models. The test data set is used at
the end of the training. Training and test data are compared to make sure the final model is working
properly. When using machine learning, data is usually divided into three or more sets. In three sets,
the extra set is the development set, which is used to modify the parameters of the learning process.
Organizations and data modelers may elect to separate split data using data sampling methods, such
as the following three methods: Random sampling: This data sampling method protects the data
modeling process against bias to different data characteristics that may occur. However, random
splitting can pose problems in terms of uneven distribution of data. Stratified random sampling: It
selects random data samples from specific parameters. Ensures data is properly distributed across
training and test sets. Nonrandom sampling: This approach is generally used when data modelers are
looking for the latest data as a test set. With data splitting, organizations do not have to choose
between using data for statistical analysis and analytics, as the same data can be used in the various
processes. Training Model Model training is the lifecycle phase of data science development during
which practitioners attempt to figure out the best combination of weight and bias to a machine
learning algorithm to minimize a loss function on the prediction range. The goal of model
development is to construct the best mathematical representation of the relationship between data
features and a target label (in supervised learning) or among the features themselves (in
unsupervised learning). 31 Model training is the key step of machine learning leading to a model
ready for validation, testing, and deployment. The performance of the model determines the quality
of the applications which are constructed using it. The quality of the training data and the training
algorithm are significant assets in the training phase of the model. Training information is usually
split for training, validation, and testing. The training algorithm is selected as a function of the end-
use case. There are a number of tradeoff points for choosing the best algorithm – model complexity,
interpretability, performance, computational requirements, etc. All these aspects of model training
make it a process that is both involved and important for the overall development cycle of machine
learning. Saving the Best Model Saving our machine-learning models is an important step in the
machine-learning workflow, so we can reuse them in the future. For example, it is very likely that we
will have to compare models to determine which champion model to use in production. Saving the
models while they are being trained facilitates this process. The alternativewould be to form the
model whenever it is to be used, which can greatly affect productivity, especially if the model takes a
long time to form. 4.1.3 Module 3: Creating Node JS Web App An important part of building a
machine learning model is to share the model we havebuilt with others. No matter how many
models we create, if they remain offline, very few people will be able to see what we are achieving.
That is why we should deploy ourmodels, so that anyone can play with them through a nice User
Interface (UI). For this system, we will create a Node JS webapp to take webcam input using OpenCV
and detect alphabets and gestures using our trained model. Nodejs is a server-side platform powered
by Google Chrome JavaScript Engine (V8 Engine). Node.js is a platform built on Chrome JavaScript
runtime to build fast, scalable network applications with ease. Node.js uses a non-blocking, event-
driven I/O model that makes it lightweight and efficient, perfect for real-time, data-intensive
applications that operate on distributed devices. It is an open-source, cross-platform runtime
environment for developing server-side and networking applications. 32 on OS X, Microsoft
Windows, and Linux. Node.js also provides a rich library of different JavaScript modules that
simplifies the development of web applications using Node.js to a great degree. Node.js = Runtime
Environment + JavaScript Library These are just a few of the key features that make Node.js the first
choice for softwarearchitects: Asynchronous and Event Driven: All APIs of the Node.js library are
asynchronous, meaning non-blocking. This basically means that a Node.js- based server never waits
until an API returns data. The server moves to the next API after calling it and a Node.js event
notification mechanism assists the server to get a response from the previous API call. Very Fast:
Being constructed on Google Chrome V8 JavaScript Engine, the Node.js library is very fast in code
execution. Single-Threaded but Highly Scalable: Node.js has a single-threaded model with an event
loop. The event mechanism helps the server to respond in a nonblocking manner and makes the
server highly scalable as opposed to traditional servers that create limited threads to manage
requests. Node.js uses a single-threaded program and the same program can serve a lot more
requests than traditional serverslike Apache HTTP Server. No Buffering: No Node.js application
buffers the data. These applications simply generate data in chunks. License: Node.js is licensed by
MIT 33 4.2 ARCHITECTURE / OVERALL DESIGN OF THE PROPOSED SYSTEM Fig 4.5: System
Architecture for Sign Language 34 4.3 DESCRIPTION OF SOFTWARE FOR IMPLEMENTATION AND
TESTING PLAN OF THE PROPOSED MODEL/SYSTEM Anaconda is an open-source package manager for
Python and R. It is the most popular platform among data science professionals for running Python
and R implementations. There are over 300 libraries in data science, so having a robust distribution
system for them is a must for any professional in this field.Anaconda simplifies package deployment
and management. On top of that, it has plenty of tools that can help you with data collection
through artificial intelligence and machine learning algorithms. WithAnaconda, you can easily set up,
manage, and share Conda environments. Moreover,you can deploy any required project with a few
clicks when you’re using Anaconda.There are many advantages to using Anaconda and the following
are the most prominent ones among them:Anaconda is free and opensource. This means youcan use
it without spending any money. In the data science sector, Anaconda is an industry staple. It is open-
source too, which has made it widely popular. If you want to become a data science professional, you
must know how to use Anaconda for Python because every recruiter expects you to have this skill. It
is a must-have for data science. It has more than 1500 Python and R data science packages, so you
don’t face any compatibility issues while collaborating with others. For example, suppose your
colleague sends you a project which requires packages called A and B but you only have package A.
Without having package B, you wouldn’t be able to run the project. Anaconda mitigates the chances
of such errors. You can easily collaborate on projects without worrying about any compatibility
issues. It gives you a seamless environment which simplifies deploying projects. You can deploy any
project with just a few clicks and commands while managing the rest. Anaconda has a thriving
community of data scientists and machine learning professionals who use it regularly. If you
encounter an issue, chances are, the community has already answered the same. On the other hand,
you can also ask people in the community about the issues you face there, it’s a very helpful
community ready to help new learners. With Anaconda, you can easily create and train machine
learning and deep learning models as it works well with popular tools 35 including TensorFlow, Scikit-
Learn, and Theano. You can create visualizations by using Bokeh, Holo views, Matplotlib, and Data
shader while using Anaconda. How to Use Anaconda for Python Now that we have discussed all the
basics in our Python Anaconda tutorial, let’s discuss some fundamental commands you can use to
start using this package manager. Listing All Environments To begin using Anaconda, you’d need to
see how many Conda environments are present in your machine. conda env list It will list all the
available Conda environments in your machine.Creating a New Environment You can create a new
Conda environment by going to the required directory and use this command: conda create -n You
can replace with the name of your environment. After entering this command, conda will ask you if
you want to proceed to which you shouldreply with y: proceed ([y])/n)? On the other hand, if you
want to create an environment with a particular version of Python, you should use the following
command: conda create -n python=3.6 Similarly, if you want to create an environment with a
particular package, you can use the following command: conda create -n pack_name Here, you can
replace pack_name with the name of the package you want to use. If you have a .yml file, you can
use the following command to create a new Conda environment based on that file: conda env create
-n -f .yml We have also discussed how you can export an existing Conda environment to a .yml file
later in this article. 36 You can activate a Conda environment by using the following command: conda
activate You should activate the environment before you start working on the same. Also, replace the
term with the environment name you want to activate. On the other hand, if you want to deactivate
an environment use the following command: conda deactivate Installing Packages in an Environment
Now that you have an activated environment, you can install packages into it by using the following
command: conda install Replace the term with the name of the package you want to install in your
Conda environment while using this command. Updating Packages in an Environment If you want to
update the packages present in a particular Conda environment, you should use the following
command: conda update The above command will update all the packages present in the
environment. However, if you want to update a package to a certain version, you will need to use
thefollowing command: conda install = Exporting an Environment Configuration Suppose you want to
share your project with someone else (colleague, friend, etc.). While you can share the directory on
GitHub, it would have many Python packages, making the transfer process very challenging. Instead
of that, you can create an environment configuration .yml file and share it with that person. Now,
they can createan environment like your one by using the .yml file. For exporting the environment to
the .yml file, you’ll first have to activate the same andrun the following command: conda env export
>.yml The person you want to share the environment with only has to use the exported file 37 by
using the ‘Creating a New Environment’ command we shared before. Removing a Package from an
Environment If you want to uninstall a package from a specific Conda environment, use the following
command: conda remove -n On the other hand, if you want to uninstall a package from an activated
environment, you’d have to use the following command: conda remove Deleting an Environment
Sometimes, you don’t need to add a new environment but remove one. In such cases,you must know
how to delete a Conda environment, which you can do so by using thefollowing command: conda
env remove –name The above command would delete the Conda environment right away. 4.4
PROJECT MANAGEMENT PLAN Introduction: October 7-21 Literature Survey: October 21-23 System
Design: November 1-15 System Implementation: November 16-29 UML Diagrams: December 1-25
Conclusions: January 1-15 References And Future Work: February 1-20 Testing: March 1-25 38
CHAPTER 5 IMPLENTATION DETAILS 5.1 Algorithms 5.1.1 Scale -Invarient Feature Transform(SIFT) The
SIFT (Scale-Invariant Feature Transform) algorithm is a computer vision algorithm used for detecting
and describing local features in images. It was introduced by David Lowe in 1999 and has since
become a popular algorithm in computer vision and image processing. The SIFT algorithm consists of
the following steps: Scale-space extrema detection: The first step involves identifying potential
interest points in the image at different scales using the Difference of Gaussian (DoG) function. The
DoG function is obtained by taking the difference of two Gaussian-blurred images at different scales.
The local minima and maxima of the DoG function are then identified as potential interest points.
Key point localization: Once the potential interest points are identified, the algorithm refines their
location and scale by fitting a 3D quadratic function to the DoG function at each point. Orientation
assignment: The algorithm assigns an orientation to each key point based on the local image gradient
directions. This helps in making the descriptor rotationally invariant. Key point descriptor generation:
A descriptor is generated for each key point by considering the local image gradient directions and
magnitudes in a surrounding region. This descriptor is a vector of length 128, which is invariant to
scale, rotation, and illumination changes. Key point matching: Finally, the descriptors of two images
are compared using a distance metric (e.g., Euclidean distance) to find matching key points between
the two images. 39 The SIFT algorithm has been widely used in various applications, such as image
stitching, object recognition, and 3D reconstruction. However, due to its computational complexity, it
may not be suitable for real-time applications. 5.1.2 Histogram of Oriented Gradients(HOG) The "hog
algorithm" typically refers to the Histogram of Oriented Gradients (HOG) feature extraction method,
which is used in computer vision and image processing for object detection and recognition tasks.
The HOG algorithm works by computing the gradients of an image, which represent the intensity and
direction of local edges. These gradients are then quantized into orientation bins, which are used to
construct a histogram of gradient orientations within a given image region. The resulting histogram
can be used as a feature descriptor to represent the local texture and shape information of an object.
HOG is commonly used in combination with a support vector machine (SVM) classifier to perform
object detection and recognition. The SVM classifier is trained on a set of positive and negative
examples of the object of interest, using the HOG features as input. During testing, the HOG features
of an image are extracted and fed into the SVM, which predicts whether the object is present or not.
HOG has been successfully applied to a wide range of object recognition tasks, including pedestrian
detection, face detection, and vehicle detection. However, it is computationally expensive and may
not perform well in situations with significant variations in scale and orientation. 5.1.3 Support
Vector Machine Support Vector Machines (SVM) is a supervised machine learning algorithm used for
40 classification and regression analysis. SVM aims to find the optimal boundary (hyperplane) that
separates the data into different classes while maximizing the margin between the classes. In the
case of binary classification, the SVM algorithm works by finding the hyperplane that best separates
the data into two classes by maximizing the margin. The margin is defined as the distance between
the hyperplane and the nearest data points of each class. SVM uses a kernel function to map the
input data into a high-dimensional feature space where it is easier to find a hyperplane that
separates the data. Common kernel functions include linear, polynomial, radial basis function (RBF),
and sigmoid. SVMs are known for their ability to handle high-dimensional data and for their
robustness against overfitting. However, SVMs can be sensitive to the choice of kernel function and
the regularization parameter. Overall, SVM is a powerful algorithm that has been widely used in
various applications such as image classification, text classification, and bioinformatics. 5.2 Testing •
UNIT TESTING • INTEGRATION TESTING • FUNCTIONAL TESTING • WHITE BOX TESTING • BLACK BOX
TESTING 5.2.1 Unit Testing o Unit testing is used to ensure that each modular component of the
project is working. o The smallest unit of the software design is the subject of unit testing 41 o The
mentioned project underwent a progressive examination of unit testing. o The unit testing findings
were good and encouraging 5.2.2 Integration Testing o Integration testing is a methodical
methodology for building the software architecture while also running tests to detect faults related
to the interface. o Integration testing, in other words, is the comprehensive testing of the product's
collection of modules o I For the described project, a sequential analysis of Integration Testing was
undertaken o The findings of the integration testing were positive and encouraging 5.2.3 Functional
Testing o Functional test cases included running the code with nominal input values for which the
anticipated outputs were known as well as boundary values and unusual values such as logically
linked inputs files with identical components, and empty files. o For the project under consideration,
a Functional Testing Sequential Analysis was carried out. o The outcomes of functional testing were
positive and encouraging 5.2.4 White Box Testing o This kind of testing is also known as glass box
testing o By understanding the precise tasks that a product has been designed to accomplish, testing
can be performed to verify that each function is completely operational while also looking for flaws
in each function. 42 o It is a test case design approach that derives test cases from the procedural
design's control structure o White box testing is used for basis path testing o For the project under
consideration ,a sequential assessment of White Box Testing was carried out White Box testing
yielded positive and encouraging findings 5.2.5 Black Box Testing o By understanding a product's
internal operation, testing may be carried out to guarantee that "all gears mesh," that is, the internal
operation operates according to 25 specifications and all internal components have been
appropriately exercised o For the described project, a sequential analysis of Black Box Testing was
undertaken o The outcomes of functional testing were positive and encouraging. 43 CHAPTER 6
RESULTS The online hand gesture recognition and classification system will be able to recognize a
range of hand gestures in real-time. The system will be able to translate the recognized gestures into
text or speech, depending on the user's preferences. The system will be tested using a range of hand
gestures, and the accuracy of the system will be evaluated. The accuracy of the system will be
measured in terms of the percentage of correctly recognized gestures. Figure 6.1 Sequence Diagram
44 Figure 6.2 Data Flow Diagram In this work, a deep learning based CNN model has been proposed
to identify static hand gestures. Traditional models have been found out to be restrictive in the sense
that they need preprocessing of the image to achieve good accuracy. However, deep learning models
have advantage over traditional methods in that they don’t need prior feature extraction. Since they
do not involve any feature extraction they are very robust. Since we employed DCNN model it is very
attractive for gesture recognition applications. The model thus trained has shown good results with
cluttered backgrounds, poor lighting conditions, and different hand morphologies. The robustness of
the model has been attributed to the capacity of DCNNs to learn complex features in images.
Another important feature about the present model is that it is very accurate. The model has
achieved 99.6% test accuracy which is one of the highest ever reported test accuracy. High accuracy
can be attributed to careful data collection, data augmentation and network architecture. While
collecting data, care has been taken to include various scenarios into account. Good data collection
coupled with data augmentation led to generalization of the network. 45 46 Figure 6.3 Output
Screenshot 47 CHAPTER 7 7.1 Conclusion: This project introduces a CNN-based approach for the
recognition and classification of sign language using deep learning .Our proposed system was able to
achieve 99% training accuracy with testing accuracy of 90.04% in letter recognition, 93.44% in
number recognition, and 97.52% in static word recognition, obtaining an average of 93.667% based
on the gesture recognition with limited time. Ag for future work, the evolution of the proposed
system can be generalized for a wider class of hand gestures. Furthermore, the impact of different
parameters on the performance of the system and the accuracy of results can be further
investigated. Time frame selection techniques can be bettered, more optimization can be added, and
the loss function can be studied more deeply Using open CV, we offer a technique for identifying and
recognizing signals in this paper. Throughout the training of each system, each letter, number and
word gesture were utilized 50 times. This method has substantially better accuracy and significantly
fewer false positives than the others. The identification of hand gestures has seen significant
advancement in the last several decades thanks to the efforts of numerous experts. Sign language
interpretation devices, augmented or virtual reality, robot control, and the ability to recognize hand
gestures are just a few of the numerous areas where this technology has proven useful. Since the
introduction of the next generation of gesture interface technology, the value of gesture recognition
has skyrocketed. Most people who use sign language are deaf or mute and hence unable to
communicate verbally. It is difficult to express emotions clearly when communicating with someone
who can see but do not share a common hand sign language. OpenCV provides a fast and convenient
way of recognizing and classifying hand gestures. This is especially suitable for the Deaf and Dumb
community, who have limited access to other forms of communication. The cost of implementation
is also quite low, making it an ideal solution. In conclusion, hand gesture recognition and
classification for deaf and dumb can bring about transformative and life-changing changes in the lives
of people with hearing impairments. With the use of machine learning and open source 48 libraries,
it is becoming increasingly easier to develop projects centered around hand gesture recognition and
classification. 7.2 Future Enhancements As for future work, the evolution of the proposed system can
be generalized for a wide class of hand gestures. Furthermore, the impact of different parameters on
the performance of the system and the accuracy of results can be further investigated. Time frame
selection techniques can be bettered, more optimization can be added, and the loss function can be
studied more deeply. 7.3 Implementation • Image Pre- Processing The captured objects are being
pre-processed. Pre-processing is done to remove the extra objects. • Feature Extraction and
Recognition An algorithm is used to determine an image’s characteristics. To retrieve the best
featured image from the database, the algorithm applied to the captured image. 49 REFERENCES:-
[1] Soma Shrenika, Myneni Madhu Bala, “Sign Language Recognition Using Template Matching
Technique”, International Conference on Computer Science Engineering and Applications (ICCSEA),
2020 [2] H Muthu Mariappan, V Gomathi, “Real-Time Recognition of Indian Sign Language”,
International Conference on Computational Intelligence in Data Science (ICCIDS), 2019 [3] Wanbo Li,
Hang Pu, Ruijuan Wang, “Sign Language Recognition Based on Computer Vision”, IEEE International
Conference on Artificial Intelligence and Computer Applications (ICAICA), 2021 [4] Pushkar Kurhekar,
Janvi Phadtare, Sourav Sinha, Kavita P. Shirsat, “Real Time Sign Language Estimation System”, 3rd
International Conference on Trends in Electronics and Informatics (ICOEI), 2019 [5] Aniket Kumar,
Mehul Madaan,hubham Kumar, Aniket Saha, Suman Yadav, “Indian Sign Language Gesture
Recognition in Real-Time using Convolutional Neural Networks”, 8th International Conference on
Signal Processing and Integrated Networks (SPIN), 2021 [6] Sai Nikhilesh Reddy Karna, Jai Surya Kode,
Suneel Nadipalli, Sudha Yadav, “American Sign Language Static Gesture Recognition using Deep
Learning and Computer Vision”, 2nd International Conference on Smart Electronics and
Communication (ICOSEC), 2021 [7] Shirin Sultana Shanta, Saif Taifur Anwar, Md. Rayhanul Kabir,
“Bangla Sign Language Detection Using SIFT and CNN”, 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 2018 [8] Rajiv Ranjan, B Shivalal Patro,
Mohammad Daim Khan, Manas Chandan Behera, Raushan Kumar, Utsav Raj, “A Review on Sign
Language Recognition Systems”, IEEE 2nd International Conference on Applied Electromagnetics,
Signal Processing, &Communication (AESPC), 2022 50 [9] Kohsheen Tiku, Jayshree Maloo, Aishwarya
Ramesh, Indra R, “Real-time Conversion of Sign Language to Text and Speech”, Second International
Conference on Inventive Research in Computing Applications (ICIRCA), 2020 [10] Vishwa Hariharan
Iyer, U.M Prakash, Aashrut Vijay, P. Sathishkumar, “Sign Language Detection using Action
Recognition”, 2nd International Conference on Advance Computing and Innovative Technologies in
Engineering (ICACIT)

ISLTranslate: Dataset for Translating Indian Sign Language Abhinav Joshi1 Susmit Agrawal2 Ashutosh
Modi1 1 Indian Institute of Technology Kanpur (IIT Kanpur) 2 Indian Institute of Technology
Hyderabad (IIT Hyderabad) {ajoshi, ashutoshm}@cse.iitk.ac.in, [email protected] Abstract
Sign languages are the primary means of communication for many hard-of-hearing people
worldwide. Recently, to bridge the communication gap between the hard-of-hearing community and
the rest of the population, several sign language translation datasets have been proposed to enable
the development of statistical sign language translation systems. However, there is a dearth of sign
language resources for the Indian sign language. This resource paper introduces ISLTranslate, a
translation dataset for continuous Indian Sign Language (ISL) consisting of 31k ISL-English
sentence/phrase pairs. To the best of our knowledge, it is the largest translation dataset for
continuous Indian Sign Language. We provide a detailed analysis of the dataset. To validate the
performance of existing end-to-end Sign language to spoken language translation systems, we
benchmark the created dataset with a transformer-based model for ISL translation. 1 Introduction
There are about 430 million hard-of-hearing people worldwide1 of which 63 million are in India2 .
Sign Language is a primary mode of communication for the hard-of-hearing community. Although
natural language processing techniques have shown tremendous improvements in the last five years,
primarily, due to the availability of annotated resources and large language models (Tunstall et al.,
2022), languages with bodily modalities like sign languages still lack efficient language-processing
systems. Recently, research in sign languages has started attracting attention in the NLP community
(Yin et al., 2021; Koller et al., 2015; Sincan and Keles, 2020; Xu et al., 2022; Albanie et al., 2020; Jiang
et al., 2021; Moryossef et al., 2020; Joshi 1 https://www.who.int/news-room/fact-sheets/
detail/deafness-and-hearing-loss 2 https://nhm.gov.in/index1.php?lang=1&level=2&
sublinkid=1051&lid=606 Figure 1: An example showing the translation of the phrase “Let’s discuss" in
Indian Sign Language. et al., 2022). The availability of translation datasets has improved the study
and development of NLP systems for sign languages like ASL (American Sign Language) (Li et al.,
2020), BSL (British Sign Language) (Albanie et al., 2021), and DGS (Deutsche Gebärdensprache)
(Camgoz et al., 2018). On the other hand, there is less amount of work focused on Indian Sign
Language. The primary reason is the unavailability of large annotated datasets for Indian Sign
Language (ISL). ISL being a communication medium for a large, diverse population in India, still faces
the deficiency of certified translators (only 300 certified sign language interpreters in India3 ), making
the gap between spoken and sign language more prominent in India. This paper aims to bridge this
gap by curating a new translation dataset for Indian Sign Language: ISLTranslate, having 31, 222 ISL-
English pairs. Due to fewer certified sign language translators for ISL, there is a dearth of educational
material for the hard-of-hearing community. Many government and non-government organizations
in India have recently started bridging this gap by creating standardized educational content in ISL.
The created content helps build basic vocabulary for hard-of-hearing children and helps people use
spoken languages to learn and teach ISL to children. Considering the standardized representations
and simplicity in the vocabulary, we choose these con3The statistic is as per the Indian government
organization Indian Sign Language Research and Training Centre (ISLRTC): http://islrtc.nic.in/
arXiv:2307.05440v1 [cs.CL] 11 Jul 2023 tents for curating an ISL-English translation dataset. We
choose the content specific to education material that is standardized and used across India for
primary-level education. Consequently, the vocabulary used in the created content covers diverse
topics (e.g., Maths, Science, English) using common daily words. ISL is a low-resource language, and
the presence of bodily modality for communication makes it more resource hungry from the point of
view of training machine learning models. Annotating sign languages at the gesture level (grouping
similar gestures in different sign sentences) is challenging and not scalable. Moreover, in the past,
researchers have tried translating signs into gloss representation and gloss to written language
translation (Sign2Gloss2Text (Camgoz et al., 2018)). A gloss is a text label given to a signed gesture.
The presence of gloss labels for sign sentences in a dataset helps translation systems to work at a
granular level of sign translation. However, generating gloss representation for a signed sentence is
an additional challenge for data annotation. For ISLTranslate, we propose the task of end-to-end ISL
to English translation. Figure 1 shows an example of an ISL sign video from ISLTranslate. The example
shows a translation for the sentence “Let’s discuss”, where the signer does the sign for the word
“discuss” by circularly moving the hands with a frown face simultaneously followed by palm lifted
upwards for conveying “let’s.” The continuity present in the sign video makes it more challenging
when compared to the text-to-text translation task, as building a tokenized representation for the
movement is a challenging problem. Overall, in this resource paper, we make the following
contributions: • We create a large ISL-English translation dataset with more than 31, 222 ISL-English
pair sentences/phrases. The dataset covers a wide range of daily communication words with a
vocabulary size of 11, 655. We believe making this dataset available for the NLP community will
facilitate future research in sign languages with a significant societal impact. Moreover, though not
attempted in this paper, we hope that ISLTranslate could also be useful for sign language generation
research. The dataset is made available at: https://github.com/ Exploration-Lab/ISLTranslate. • We
propose a baseline model for end-to-end ISL-English translation inspired by sign language
transformer (Camgoz et al., 2020). 2 Related Work In contrast to spoken natural languages, sign
languages use bodily modalities, which include hand shapes and locations, head movements (like
nodding/shaking), eye gazes, finger-spelling, and facial expressions. As features from hand, eye,
head, and facial expressions go in parallel, it becomes richer when compared to spoken languages,
where a continuous spoken sentence can be seen as a concatenated version of the sound articulated
units. Moreover, translating from a continuous movement in 3 dimensions makes sign language
translation more challenging and exciting from a linguistic and research perspective. Sign Language
Translation Datasets: Various datasets for sign language translation have been proposed in recent
years (Yin et al., 2021). Specifically for American Sign Language (ASL), there have been some early
works on creating datasets (Martinez et al., 2002; Dreuw et al., 2007), where the datasets were
collected in the studio by asking native signers to sign content. Other datasets have been proposed
for Chinese sign language (Zhou et al., 2021), Korean sign language (Ko et al., 2018), Swiss German
Sign Language - Deutschschweizer Gebardensprache (DSGS) and Flemish Sign Language - Vlaamse
Gebarentaal (VGT) (Camgöz et al., 2021). In this work, we specifically target Indian Sign Language and
propose a dataset with ISL videos-English translation pairs. End-to-End Sign Language Translation
Systems: Most of the existing approaches for sign language translation (Camgoz et al., 2018; De
Coster and Dambre, 2022; De Coster et al., 2021) depend on intermediate gloss labels for
translations. As glosses are aligned to video segments, they provide fine one-to-one mapping that
facilitates supervised learning in learning effective video representations. Previous work (Camgoz et
al., 2018) has reported a drop of about 10.0 in BLEU-4 scores without gloss labels. However,
considering the annotation cost of gloss-level annotations, it becomes imperative to consider gloss-
free sing language translation approaches. Moreover, the gloss mapping in continuous sign language
might remove the grammatical aspects from the sign language. Other recent works on Sign language
translation include Voskou et al. Figure 2: A sample from ISLTranslate: “Sign Language is a visual
language consisting of signs, gestures, fingerspelling and facial expressions." Dataset Lang. Sentences
Vocab. Purdue RVL-SLLL (Martinez et al., 2002) ASL 2.5k 104 Boston 104 (Dreuw et al., 2007) ASL 201
103 How2Sign (Duarte et al., 2021) ASL 35k 16k OpenASL (Shi et al., 2022) ASL 98k 33k BOBSL
(Albanie et al., 2021) BSL 1.9k 78k CSL Daily (Zhou et al., 2021) CSL 20.6k 2k Phoenix-2014T (Camgoz
et al., 2018) DGS 8.2 3K SWISSTXT-Weather (Camgöz et al., 2021) DSGS 811 1k SWISSTXT-News
(Camgöz et al., 2021) DSGS 6k 10k KETI (Ko et al., 2018) KSL 14.6k 419 VRT-News (Camgöz et al.,
2021) VGT 7.1k 7k ISL-CSLRT (Elakkiya and Natarajan, 2021) ISL 100 - ISLTranslate (ours) ISL 31k 11k
Table 1: Comparison of continuous sign language translation datasets. (2021); Yin and Read (2020),
which try to remove the requirement for a glossing sequence for training and proposes a
transformer-based architecture for end-to-end translations. We also follow a gloss-free approach for
ISL translation. 3 ISLTranslate ISLTranslate is a dataset created from publicly available educational
videos produced by the ISLRTC organization and made available over YouTube. These videos were
created to provide school-level education to hard-of-hearing children. The videos cover the NCERT4
standardized En4 https://en.wikipedia.org/wiki/National_
Council_of_Educational_Research_and_Training glish educational content in ISL. As the targeted
viewers for these videos are school children and parents, the range of words covered in the videos is
beginner-level. Hence, it provides a good platform for building communication skills in ISL. The videos
cover various NCERT educational books for subjects like science, social science, and literature. A
single video (about 15-30 minutes) usually covers one chapter of a book and simultaneously provides
an audio voice-over (in English) conveying the same content. Apart from ISLRTC’s educational sign
videos which make up a significant portion of ISLTranslate, we also use another resource from Deaf
Enabled Foundations (DEF) (https://def.org.in/). DEF videos consist of words with respective
descriptions and example sentences for the same words, along with the text transcriptions available
in the descriptions5 . We split the DEF Sign videos into multiple segments using visual heuristics for
separating segments corresponding to words, descriptions, and examples. In total, ISLTranslate
consists of 2685 videos (8.6%) from DEF, and the remaining 28537 videos (91.4%) are from ISLRTC.
ISLTranslate Creation: We use the audio voiceover (by analyzing the speech and silence parts) to split
the videos into multiple segments. Further, these segments are passed through the SOTA speech-to-
text model (Radford et al., 2022) to generate the corresponding text. As the generated text is the
same as present in the book chapters’ text, verifying the generated sample was easy and was done
by manually matching them with the textbook. In general, we found automatically transcribed text to
be of high quality; nevertheless, incorrectly generated text was manually corrected with the help of
5Example video: https://www.youtube.com/watch?v= 429wv1kvK_c ISLTranslate Translations ISL-
Signer Translations (references) Birbal started smiling. When it was his turn, he went near the line.
Birbal started smiling. He turned towards the drawn line. Discuss with your partner what Birbal
would do. Discuss with your partner what Birbal would do. Birbal drew a longer line. Under the
drawn line, Birbal under the first one. drew a longer line. saw what he drew and said. and wondered
That’s true, the first line is shorter now. That’s true, the first line is shorter now. One day, Akbar drew
a line on the floor and ordered. One day, Akbar drew a line and ordered Make this line shorter. Make
this line shorter. Rita is shorter than Radha. Rita is short and the other is Radha. Rajat is taller than
Raj. Rajat is taller and the other is Raj. but don’t rub out any part of it. but don’t rub out any part of
it. Try to draw Rajat’s taller than Raj. First draw Rajat as taller, then draw Raj on the right. No one
knew what to do. No one knew what to do. Each minister looked at the line and was puzzled. Each
minister looked at the line and was puzzled. No one could think of any way to make it longer. No one
could think of any way to make it longer. Have you seen the fine wood carving? Look at its
architecture. Most houses had a separate Most houses had a separate bathing Beding area. separate
and some had wells to supply water. and some had wells to supply water. Many of these cities had
covered drains. Many of these cities had covered drains. Notice how carefully these were laid out in
straight lines. Notice how carefully these were laid out in straight lines. Table 2: The Table shows a
sample of English translations present in ISLTranslate compared to sentences translated by ISL Signer
for the respective ISL videos. The exact ISL-Signer Translations were used as reference sentences for
computing translation metric scores reported in Table 3. Blue and Red colored text highlight the
difference between semi-automatically generated English sentences and gold sentences generated
by the ISL instructor. Metric Score BLEU-1 60.65 BLEU-2 55.07 BLEU-3 51.43 BLEU-4 48.93 METEOR
57.33 WER 61.88 ROUGE-L 60.44 Table 3: The Table shows the translation scores for a random
sample of 291 pairs from ISLTranslate when compared to references translated by the ISL instructor.
content in the books. Figure 2 shows an example (from ISLTranslate) of a long sentence and its
translation in ISL. The frames in the figure are grouped into English words and depict the continuous
communication in ISL. Notice the similar words in the sentence, “sign/signs” and “language.” (also
see a visual representation of Sign Language6 ). As common to other natural languages,
representations (characters/gestures) of different lengths are required for communicating different
words. In ISLTranslate, we restrict to the sentence/phrase level translations. The dataset is divided
into train, validation, and test splits (Details in App. A). App. Figure 3 shows the distribution of the
number of samples in various splits. 6 https://www.youtube.com/watch?v=SInKhy-06qA Comparison
with Other Continuous SignLanguage Datasets: We primarily compare with video-based datasets
containing paired continuous signing videos and the corresponding translation in written languages
in Table 1. To the best of our knowledge, we propose the largest dataset for ISL. Data Cleaning and
Preprocessing: The videos (e.g., App. Fig. 4) contain the pictures corresponding book pages. We crop
the signer out of the video by considering the face location as the reference point and removing the
remaining background in the videos. Noise Removal in ISLTranslate: As the ISLTranslate consists of
videos clipped from a longer video using pauses in the available audio signal, there are multiple ways
in which the noises in the dataset might creep in. While translating the text in the audio, a Signer
may use different signs that may not be the word-to-word translation of the respective English
sentence. Moreover, though the audio in the background is aligned with the corresponding signs in
the video, it could happen in a few cases that the audio was fast compared to the corresponding sign
representation and may miss a few words at the beginning or the end of the sentence. We also found
a few cases where while narrating a story, the person in the audio takes the character role by
modifying speech to sound like the designated character speaking the sentence. For example, in a
story where a mouse is talking, instead of saying the sentence followed by the “said the mouse”
statement, the speakers may change their voice and increase the pitch to simulate dialogue spoken
by the mouse. In contrast, in the sign language video, a person may or may not take the role of the
mouse while translating the sentence to ISL. ISLTranslate Validation: To verify the reliability of the
sentence/phrase ISL-English pairs present in the dataset, we take the help of a certified ISL signer.
Due to the limited availability of certified ISL Signers, we could only use a small randomly selected
sign-text pairs sample (291 pairs) for human translation and validation. We ask an ISL instructor to
translate the videos (details in App. C). Each video is provided with one reference translation by the
signers. Table 2 shows a sample of sentences created by the ISL instructor. To quantitatively estimate
the reliability of the translations in the dataset, we compare the English translation text present in
the dataset with the ones provided by the ISL instructor. Table 3 shows the translation scores for 291
sentences in ISLTranslate. Overall, the BLEU-4 score is 48.94, ROUGE-L (Lin, 2004) is 60.44, and WER
(Word Error Rate) is 61.88. To provide a reference for comparison, for text-totext translations BLEU
score of human translations ranges from 30-50 (as reported by Papineni et al. (2002), on a test
corpus of about 500 sentences from 40 general news stories, a human translator scored 34.68
against four references). We speculate high reliability over the translations present in the ISLTranslate
with a BLEU score of 48.93 compared against the reference translations provided by certified ISL
Signer. Ideally, it would be better to have multiple reference translations available for the same
signed sentence in a video; however, the high annotation effort along with the lower availability of
certified ISL signers makes it a challenging task. 4 Baseline Results Given a sign video for a sentence,
the task of sign language translation is to translate it into a spoken language sentence (English in our
case). For benchmarking ISLTranslate, we create a baseline architecture for ISL-to-English translation.
We propose an ISL-pose to English translation baseline (referred to as Pose-SLT) inspired by Sign
LanModel BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L Pose-SLT 13.18 8.77 7.04 6.09 12.91 Table 4: The
table shows translation scores obtained for the baseline model. guage Transformer (SLT) (Camgoz et
al., 2020). Sign language transformer uses image features with transformers for generating text
translations from a signed video. However, considering the faster realtime inference of pose
estimation models (Selvaraj et al., 2022), we use pose instead of images as input. We use the
Mediapipe pose estimation pipeline7 . A similar SLT-based pose-to-Text approach was used by
Saunders et al. (2020), which proposes Progressive Transformers for End-to-End Sign Language
Production and uses SLT-based pose-to-text for validating the generated key points via back
translation (generated pose key points to text translations). Though the pose-based approaches are
faster to process, they often perform less than the image-based methods. For the choice of holistic
key points, we follow Selvaraj et al. (2022), which returns the 3D coordinates of 75 key points
(excluding the face mesh). Further, we normalize every frame’s key points by placing the midpoint of
shoulder key points to the center and scaling the key points using the distance between the nose key
point and the shoulders midpoint. We use standard BLEU and ROUGE scores to evaluate the
obtained English translations (model hyperparameter details in App.D). Table 4 shows the results
obtained for the proposed architecture. Poor BLEU-4 result highlights the challenging nature of the
ISL translation task. The results motivate incorporating ISL linguistic priors into data-driven models to
develop better sign language translation systems. 5 Conclusion We propose ISLTranslate, a dataset of
31k ISLEnglish pairs for ISL. We provide a detailed insight into the proposed dataset and benchmark
them using a sign language transformer-based ISL-poseto-English architecture. Our experiments
highlight the poor performance of the baseline model, pointing towards a significant scope for
improvement for end-to-end Indian sign language translation systems. We hope that ISLTranslate will
create excitement in the sign language research community and have a significant societal impact. 7
https://ai.googleblog.com/2020/12/ mediapipe-holistic-simultaneous-face.html Limitations This
resource paper proposes a new dataset and experiments with a baseline model only. We do not
focus on creating new models and architectures. In the future, we plan to create models that
perform much better on the ISLTranslate dataset. Moreover, the dataset has only 31K video-sentence
pairs, and we plan to extend this to enable more reliable data-driven model development. In the
future, we would also like to incorporate ISL linguistic knowledge in data-driven models. Ethical
Concerns We create a dataset from publicly available resources without violating copyright. We are
not aware of any ethical concerns regarding our dataset. Moreover, the dataset involves people of
Indian origin and is created mainly for Indian Sign Language translation. The annotator involved in
the dataset validation is a hard-of-hearing person and an ISL instructor, and they performed the
validation voluntarily. Acknowledgements We want to thank anonymous reviewers for their insightful
comments. We want to thank Dr. Andesha Mangla (https://islrtc.nic.in/ dr-andesha-mangla) for
helping in translating and validating a subset of the ISLTranslate dataset. References Samuel Albanie,
Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, and Andrew Zisserman.
2020. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In ECCV.
Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury,
Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, and Andrew Zisserman. 2021. BOBSL: BBC-
Oxford British Sign Language Dataset. Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann
Ney, and Richard Bowden. 2018. Neural sign language translation. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 7784–7793. Necati Cihan Camgoz, Oscar Koller,
Simon Hadfield, and Richard Bowden. 2020. Sign language transformers: Joint end-to-end sign
language recognition and translation. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). Necati Cihan Camgöz, Ben Saunders, Guillaume Rochette, Marco Giovanelli,
Giacomo Inches, Robin Nachtrab-Ribback, and Richard Bowden. 2021. Content4all open research
sign language translation datasets. In 2021 16th IEEE International Conference on Automatic Face
and Gesture Recognition (FG 2021), page 1–5. IEEE Press. Mathieu De Coster and Joni Dambre. 2022.
Leveraging frozen pretrained written language models for neural sign language translation.
Information, 13(5). Mathieu De Coster, Karel D’Oosterlinck, Marija Pizurica, Paloma Rabaey, Severine
Verlinden, Mieke Van Herreweghe, and Joni Dambre. 2021. Frozen pretrained transformers for neural
sign language translation. In 1st International Workshop on Automated Translation for Signed and
Spoken Languages. Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, and Hermann
Ney. 2007. Speech recognition techniques for a sign language recognition system. In Proc.
Interspeech 2007, pages 2513– 2516. Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti
Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. 2021. How2Sign:
A Large-scale Multimodal Dataset for Continuous American Sign Language. In Conference on
Computer Vision and Pattern Recognition (CVPR). R Elakkiya and B Natarajan. 2021. Isl-csltr: Indian
sign language dataset for continuous sign language translation and recognition. Mendeley Data.
Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. Skeleton aware
multimodal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) Workshops. Abhinav Joshi, Ashwani Bhat, Pradeep S, Priya
Gole, Shashwat Gupta, Shreyansh Agarwal, and Ashutosh Modi. 2022. CISLR: Corpus for Indian Sign
Language recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980. Sang-Ki Ko, Chang Jo Kim, Hyedong Jung, and
Choong Sang Cho. 2018. Neural sign language translation based on human keypoint estimation.
ArXiv, abs/1811.11436. Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language
recognition: Towards large vocabulary statistical recognition systems handling multiple signers.
Computer Vision and Image Understanding, 141:108–125. Dongxu Li, Cristian Rodriguez, Xin Yu, and
Hongdong Li. 2020. Word-level deep sign language recognition from video: A new large-scale dataset
and methods comparison. In The IEEE Winter Conference on Applications of Computer Vision, pages
1459–1469. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational
Linguistics. Aleix M. Martinez, Ronnie B. Wilbur, Robin Shay, and Avinash C. Kak. 2002. Purdue rvl-slll
asl database for automatic recognition of american sign language. Proceedings. Fourth IEEE
International Conference on Multimodal Interfaces, pages 167–172. Amit Moryossef, Ioannis
Tsochantaridis, Roee Aharoni, Sarah Ebling, and Srini Narayanan. 2020. Real-time sign language
detection using human pose estimation. In European Conference on Computer Vision, pages 237–
248. Springer. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA.
Association for Computational Linguistics. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman,
Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak
supervision. Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. 2020. Progressive
Transformers for End-to-End Sign Language Production. In Proceedings of the European Conference
on Computer Vision (ECCV). Prem Selvaraj, Gokul Nc, Pratyush Kumar, and Mitesh Khapra. 2022.
OpenHands: Making sign language recognition accessible with pose-based pretrained models across
languages. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 2114– 2133, Dublin, Ireland. Association for
Computational Linguistics. Bowen Shi, Diane Brentari, Greg Shakhnarovich, and Karen Livescu. 2022.
Open-domain sign language translation learned from online video. Ozge Mercanoglu Sincan and
Hacer Yalim Keles. 2020. Autsl: A large scale multi-modal turkish sign language dataset and baseline
methods. IEEE Access, 8:181340–181355. Lewis Tunstall, Leandro von Werra, and Thomas Wolf.
2022. Natural language processing with transformers. " O’Reilly Media, Inc.". Andreas Voskou,
Konstantinos P. Panousis, Dimitrios I. Kosmopoulos, Dimitris N. Metaxas, and Sotirios P. Chatzis. 2021.
Stochastic transformer networks with linear competing units: Application to end-to-end sl
translation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11926–
11935. Chenchen Xu, Dongxu Li, Hongdong Li, Hanna Suominen, and Ben Swift. 2022. Automatic
gloss dictionary for sign language learners. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics: System Demonstrations, pages 83–92, Dublin, Ireland.
Association for Computational Linguistics. Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav
Goldberg, and Malihe Alikhani. 2021. Including signed languages in natural language processing. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages
7347– 7360, Online. Association for Computational Linguistics. Kayo Yin and Jesse Read. 2020. Better
sign language translation with STMC-transformer. In Proceedings of the 28th International
Conference on Computational Linguistics, pages 5975–5989, Barcelona, Spain (Online). International
Committee on Computational Linguistics. Hao Zhou, Wen gang Zhou, Weizhen Qi, Junfu Pu, and
Houqiang Li. 2021. Improving sign language translation with monolingual data by sign back-
translation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages
1316–1325. Appendix A Data Splits Data splits for ISLTranslate are shown in Table 5. Train Validation
Test # Pairs 24978 (80%) 3122 (10%) 3122 (10%) Table 5: The table shows the train, validation, split
for the ISLTranslate. B ISLTranslate Word Distribution 0 5 10 15 20 25 30 Number of Words 0 500
1000 1500 2000 2500 3000 3500 Number of Samples Train Validation Test Figure 3: Distribution of
the number of samples in the train, validation, and test splits of ISLTranslate. C Annotation Details We
asked a certified ISL instructor to translate and validate a random subset from the dataset. The
instructor is a hard-of-hearing person and uses ISL for communication; hence they are aware of the
subtitles of ISL. Moreover, the instructor is an assistant professor of sign language linguistics. The
instructor is employed with ISLRTC, the organization involved in creating the sign language content;
however, the instructor did not participate in videos in ISLTranslate. The instructor performed the
validation voluntarily. It took the instructor about 3 hours to validate 100 sentences. They generated
the English translations by looking at the video. D Hyperparameters and Training We follow the code
base of SLT (Camgoz et al., 2020) to train and develop the proposed SLT-based pose-to-text
architecture by modifying the input features to be sign-pose sequences generated by the Figure 4:
The figure shows an example of the educational content video where the signer signs for the
corresponding textbook. mediapipe. The model architecture is a transformerbased encoder-decoder
consisting of 3 transformer layers each for both encoder and decoder. We use the Adam optimizer
(Kingma and Ba, 2014) with a learning rate of 0.0001, β = (0.9, 0.999) and weight decay of 0.0001 for
training the proposed baseline with a batch size of 32.

A Comparative Analysis of Techniques and Algorithms for Recognising Sign Language Rupesh Kumar
Department of CSE Galgotias College of Engineering and Technology, AKTU Greater Noida, India
[email protected] S.K Singh Professor Department of CSE Galgotias College of Engineering
and Technology, AKTU Greater Noida, India [email protected] du Ashutosh Bajpai
Department of CSE Galgotias College of Engineering and Technology, AKTU Greater Noida, India
[email protected] Ayush Sinha Department of CSE Galgotias College of Engineering and
Technology, India Greater Noida, India [email protected] Abstract—Sign language is a
visual language that enhances communication between people and is frequently used as the primary
form of communication by people with hearing loss. Even so, not many people with hearing loss use
sign language, and they frequently experience social isolation. Therefore, it is necessary to create
human-computer interface systems that can offer hearing-impaired people a social platform. Most
commercial sign language translation systems now on the market are sensor-based, pricey, and
challenging to use. Although vision-based systems are desperately needed, they must first overcome
several challenges. Earlier continuous sign language recognition techniques used hidden Markov
models, which have a limited ability to include temporal information. To get over these restrictions,
several machine learning approaches are being applied to transform hand and sign language motions
into spoken or written language. In this study, we compare various deep learning techniques for
recognising sign language. Our survey aims to provide a comprehensive overview of the most recent
approaches and challenges in this field. Keywords— sign language recognition, deep learning,
convolutional neural network, vision-based systems, continuous sign language recognition. I.
INTRODUCTION The study of sign language is a significant and intriguing field of study that has
effects on both the deaf population and society at large. Researchers are interested in studying sign
language for several reasons. Firstly, a better understanding of sign language can lead to better
communication between the deaf community and the general public. Secondly, researching sign
languages can provide insight into the nature of language itself and the interaction between
language and the brain. Lastly, a better understanding of sign language can lead to the development
of more effective sign language instruction techniques for both deaf and hearing people. Sign
languages are used in many countries, and each has its own distinctive vocabulary and grammar,
such as British Sign Language (BSL) in the UK and American Sign Language (ASL) in the US. However,
they all share commonalities in the way that meaning is expressed through hand and body gestures.
Sign languages have their own set of phonological, morphological, and syntactic principles, and they
are not just visual representations of spoken languages. They are independent languages with their
own syntax, vocabulary, and grammar, and their many gestures, facial expressions, and body
positions convey different meanings. The significance of sign language is highlighted by the World
Health Organization's estimate that there are more than 70 million deaf or hearing-impaired persons
worldwide. Many nations, including the United States, Canada, and Australia, recognize sign
language as an official language, underscoring the importance of recognizing, researching, and
preserving sign language as a unique and independent language. The use of sign language can help
reduce linguistic barriers, promote inclusion, and improve accessibility for deaf and hearing-impaired
people in various contexts. In conclusion, sign language is a vital tool for the global deaf community.
Understanding and acknowledging sign language as a unique and independent language is crucial to
promoting accessibility and inclusivity. Sign language research can shed light on language and the
brain, and effective sign language instruction techniques can benefit both deaf and hearing people.
This study attempts to analyse recent developments in deep learning-based sign language
recognition (SLR), including methods and prospective applications. The authors reviewed deep
learning architectures and algorithms for SLR and evaluated their performance, highlighting its
importance for the deaf community. The study provides valuable information about the latest
developments in SLR and its potential applications, serving as a guide for researchers and
practitioners in the field. II.RELATED WORK In this section, we review relevant research papers in sign
language recognition techniques, classifying them into glove-based and vision-based approaches.
The glove-based techniques involve wearable devices like data gloves to capture hand movements
and gestures, while vision-based techniques utilize computer vision algorithms to analyse visual cues.
By examining these studies, we aim to provide a comprehensive overview of the existing techniques
and their implications for advancing sign language recognition technology. A. Glove-based approach
In their paper, Khomami et al. [21] developed a wearable hardware using surface electromyography
(sEMG) and Inertial Measurement Unit (IMU) sensors for Persian Sign Language (PSL) recognition.
The system's ability to accurately capture signs was increased by fusing these two sensors. By
extracting and classifying the 25 highest-ranked features using the KNN classifier, they achieved an
average accuracy of 96.13%. Their affordable and user-friendly hardware, consisting of Arduino Due,
extended EMG shields, and MPU-6050, shows promise for practical PSL communication. A thorough
analysis of wearable sensor-based systems for SLR was done by Kudrinko et al. [22]. The analysis
covered 72 studies that looked at factors such sensor set-up, recognition model, lexicon size, and
identification accuracy between 1991 and 2019. The assessment found problems with sign border
detection, scalability to bigger lexicons, and model convergence as well as gaps in the field. The
study's findings may help in the creation of wearable sensorbased sign language recognition
technology. To make these devices as comfortable for users as possible, wireless transmission, and
Deaf users' input must all be taken into account. A unique multimodal framework for sensor-based
SLR combining Microsoft Kinect and Leap motion sensors was proposed by Kumar et al. [23]. Their
system extracts elements for recognition by capturing finger and palm postures from different
viewpoints. Separate classifiers using the Hidden Markov Model (HMM) and the Bidirectional Long
Short-Term Memory Neural Network (BLSTM-NN) were utilised, and their outcomes were integrated
to increase accuracy. Testing on a dataset of 7500 ISL gestures showed that fusing data from both
sensors outperformed single sensor-based recognition, with accuracies of 97.85% and 94.55%
achieved for single and double-handed signs, respectively. The study emphasizes the robustness and
potential of the proposed multimodal framework for SLR systems. Brazilian Sign Language feature
extraction using RGBD sensors was proposed by Moreira Almeida et al. [24]. From RGB-D photos,
they derived seven vision-based traits and connected them to structural aspects based on hand
movement, shape, and location. Support vector machines (SVM) were used to classify signs, and the
average accuracy was above 80%. By characterising each sign language's unique phonological
structure, the concept may be applied to other sign language systems and shows potential for SLR
systems. RGB-D sensors have the potential to improve image processing algorithms and recognise
hand gestures. Amin et al. [25] conducted a comparative review on the applications of different
sensors for SLR. Their review focused on various techniques and sensors used for SLR, with an
emphasis on sensor-based smart gloves for capturing hand movements. The authors analyzed
existing systems, categorized authors based on their work, and discussed trends and deficiencies in
SLR. The comparative analysis provides valuable insights for researchers and offers guidance for
developing translation systems for different sign languages. Additionally, the review emphasizes the
potential of generated datasets from these sensors for gesture recognition tasks. In their study,
Rajalakshmi et al. [26] proposed a hybrid deep neural net methodology for recognizing Indian and
Russian sign gestures. The system used a variety of methods to extract multi-semantic features like
non-manual components and manual co-articulations, such as 3D deep neural net with atrous
convolutions, attention-based BiLSTM, and modified autoencoders. The hDNN-SLR achieved
accuracies of 98.75%, 98.02%, 97.94%, and 97.54% for the respective WLASL datasets, surpassing
other baseline architectures. B. Vision- based approach In a study by Matyáš Boháček et al. [1], a
transformerbased neural network was employed for word-level SLR. The authors achieved an
accuracy of 63.18% on the WLASL dataset and 100% on the LSA64 dataset, focusing on pose-based
recognition using transformers. Another study by Atyaf Hekmat Mohammed Ali et al. [2] developed a
real-time SLR system using a Convolutional Neural Network (CNN) with the SqueezeNet module for
feature extraction. They achieved 100% accuracy in off-time testing and 97.5% accuracy in real-time
using the ASL dataset. This approach emphasized real-time recognition and feature extraction using
SqueezeNet. A forthcoming paper by Rajalakshmi E et al. [3] proposed a hybrid approach combining
transformer-based neural networks and CNNs for continuous SLR and translation. Although accuracy
results are not yet reported, this approach stands out for its use of a hybrid neural network and
translation capabilities, leveraging the ISLW dataset and the Phoenix14T Weather dataset. In a study
by Aashir Hafeez et al. [4], a SLR system was developed using various deep learning techniques,
including Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), SVM, and CNN. Accuracy rates
of 41.95%, 60.79%, 84.18%, and 88.38% were achieved using the ASL dataset, focusing on improving
accuracy rates using multiple deep learning techniques. Katerina Papadimitriou et al. [5] proposed an
SLR system using modulated graph convolutional networks. While no accuracy results were reported,
this approach stood out for its utilization of 3D convolutions and modulated graph convolutional
networks. Kaushal Goyal et al. [6] developed a SLR system using LSTM and CNN, achieving accuracy
rates of 85% and 97% using the ISL dataset. This approach distinguished itself using LSTM and CNN
for SLR. Soumen Das et al. [8] developed an occlusion-robust SLR system using keyframe + VGG19 +
BiLSTM, achieving an accuracy of 94.654% using the ISL Publicly Available dataset. This approach
focused on occlusionrobust recognition and utilized keyframes. C J Sruthi et al. [9] developed a deep
learning-based SLR system using CNN, achieving an accuracy of 98.6% with the ISL dataset. This
approach stood out for its focus on the Indian sign language and the high accuracy achieved using a
CNN-based approach. This system holds potential in enhancing accessibility and communication for
the deaf community in India. III. SIGN LANGUGAE RECOGNITION TECHNIQUES ANALYSIS SLR research
encompasses various techniques and methodologies to improve accuracy and performance in
recognizing sign gestures. One approach is the utilization of Transformer-based Neural Networks, as
demonstrated by Matyáš Boháček et al. [1]. Their work involved employing a Transformer-based
Neural Network for SLR and achieved an accuracy of 63.18% on the WLASL dataset and a perfect
accuracy of 100% on the LSA64 dataset. Similarly, Rajalakshmi E et al. [3] incorporated a Transformer-
based Neural Network along with CNN in their SLR framework. While the specific accuracy is not
mentioned, they used the Word-level ISL (ISLW) dataset and Phoenix14T Weather dataset for
evaluation. Convolutional Neural Networks (CNN) have proven to be effective in SLR. Atyaf Hekmat
Mohammed Ali et al. [2] employed CNN with the SqueezeNet module for feature extraction in their
research. Their model achieved an outstanding accuracy of 100% in off-time testing and an
impressive accuracy of 97.5% in real-time testing on the ASL dataset. Similarly, C J, Sruthi et al. [9]
leveraged CNN and obtained a remarkable accuracy of 98.6% on the ISL dataset. Another study by K.
Nimisha et al. [14] involved the use of YOLO, PCA, SVM, ANN, and CNN for SLR, achieving an accuracy
of 90% on the ASL dataset. Zhou, H., Zhou et al. [15] proposed a multi-cue framework for SLR and
translation, employing CNN as part of their approach, and achieved an accuracy of 95.9%. Long
Short-Term Memory (LSTM) networks have also been applied in SLR research. Kaushal Goyal et al. [6]
utilized LSTM and CNN in their framework. The LSTM network achieved an accuracy of 85%, while
the CNN achieved an accuracy of 97% on the ISL dataset. Sensor-based approaches have shown
promise in SLR. Khomami et al. [21] developed a wearable system using sEMG and IMU sensors,
combined with a KNN classifier, for Persian Sign Language (PSL) recognition. Their research achieved
an accuracy of 95.03% (SD: 0.76%) and an improved accuracy of 96.13% (SD: 0.46%) on the PSL
dataset. Additionally, Kumar et al. [23] incorporated Microsoft Kinect and Leap motion sensors to
capture hand movements and employed an HMM and BLSTM classifier. On the ISL dataset, they
obtained 97.85% accuracy for single-handed signs and 94.55% accuracy for doublehanded signals.
Other techniques in SLR include hybrid approaches and unique methodologies. Aloysius, N. et al. [10]
utilized a hybrid approach combining PCNN and GM, focusing on frame-label alignment techniques.
Their specific accuracy is not mentioned. Rajalakshmi et al. [26] proposed a 3D deep neural net with
atrous convolutions and Attentionbased Bi-LSTM for Indo-Russian Sign Language recognition. Their
research yielded impressive accuracies of 98.75%, 98.02%, 97.94%, and 97.54% on the WLASL100,
WLASL300, WLASL1000, and WLASL2000 datasets, respectively. These diverse techniques and
methodologies reflect the ongoing efforts in the field of SLR to improve the accuracy and
performance of sign gesture recognition systems. Each approach brings unique contributions and
advancements, contributing to the overall progress of SLR research. TABLE I: ANALYSIS OF SIGN
LANGUAGE DETECTION TECHNIQUE Paper Techniques/Observations Year Accuracy Achieved Dataset
Matyáš Boháček et al. [1] Transformer-based Neural Network 2022 WLASL - 63.18%, LSA64 - 100 %
WLASL, LSA64 dataset Atyaf Hekmat Mohammed Ali et al. [2] CNN with SqueezeNet module for
feature extraction 2022 100% in off-time testing, 97.5 % in real-time ASLa Rajalakshmi E et al. [3]
Transformer-based Neural Network, CNN 2023 - Word-level ISL (ISLW) dataset Phoenix14T Weather
dataset Aashir Hafeez et al. [4] ANN, KNN, SVM, CNN 2023 41.95%, 60.79%, 84.18%, 88.38% ASLa
Katerina Papadimitriou et al. [5] 3D-CNN Model 2023 - AUTSLc, ITI GSLd Kaushal Goyal et al. [6]
LSTM, CNN 2023 85% ,97% ISLb Gero Strobel et al. [7] TNN 2023 - Video Dataset Soumen das et al.
[8] Keyframe + VGG-19 + BiLSTM 2023 94.65% ISLb Publicly Available C J, Sruthi et al. [9] CNN 2019
98.6% ISLb Aloysius, N. et al. [10] Hybrid PCNN and GM. Framelabel alignment technique of Deep
learning is used 2020 - - K. Amrutha et al. [11] Convex Hull - Feature extraction KNN 2021 65% ASLa ,
ISLb R. Cui, H. Liu et al. [12] CNN followed by temporal Convolutional and pooling layers. 2019
91.93% RWTH phoenix Weather 2014 database Papastratis, I. et al. [13] 2d-CNN and weights learned
from ImageNet dataset. 2021 - RWTH PhoenixWeather-2014 and the CSLf and GSLd K. Nimisha et al.
[14] YOLO, PCA, SVM, ANN and CNN 2021 90% ASLa Zhou, H., Zhou et al. [15] Multi-cue framework
for SLR and translation. STMC framework 2021 95.9% - Shirbhate, Radha S. et al. [16] KNN, SVM for
feature extraction 2020 100% ISLb M. Xie et al. [17] RNN 2019 99.4% ISLb Selvaraj, P. et al. [18] ST-
GCN 2022 94.7% ISLb Kumar, D. A. et al. [19] Adaptive Graph Matching Intra-GM to extract signs
alone, discarding ME 2018 98.32% ISLb Korban, Matthew et al. [20] HMM and hybrid KNN-DTW
algorithm 2018 92.4% PSLe Khomami et al. [21] sEMG and IMU sensors, KNN classifier 2020 95.03%
(SD: 0.76%), 96.13% (SD:0.46%) PSLe Kudrinko et al. [22] wearable sensor-based system 2020 - ASLa ,
ISLb Kumar et al. [23] Microsoft Kinect and Leap motion sensors to record hand movement, HMM
and BLSTM classifier 2016 97.85% for single-hand signs, 94.55% for doublehanded signs ISLb Moreira
Almeida et al. [24] RGD-B sensor, SVM classifier 2014 80% Brazilian Sign-Language Amin et al. [25]
sensor-based smart gloves 2022 - ASLa Rajalakshmi et al. [26] 3D deep neural net with atrous
convolutions and Attention-based Bi-LSTM. 2023 98.75% on WLASL100, 98.02% on WLASL300,
97.94% on WLASL1000, 97.54% on WLASL2000 Indo-Russian Sign Language Dataset a. American
Sign-Language b. Indian Sign-Language c. Ankara University Turkish Sign Language Dataset d. German
Sign-Language e. Persian Sign -Language f. Chinese Sign-Language Based on extensive inquiry and
analysis, the authors provide several prospective routes for beginners, academics, and researchers to
take in order to further their work. Some areas that need improvement involve: Tiny sample sizes:
Some researchers have trained and evaluated their models using tiny datasets, which may not fully
reflect the variety of sign languages and signers. Cultural prejudice: Several studies have
concentrated on identifying sign languages in particular cultural contexts, which may not be relevant
to other cultures or nations. Measurement biases: The use of certain measurement techniques,
which may not be completely trustworthy or valid, may have an impact on the findings of some
research. Lack of control groups makes it challenging to assess whether the success of the model is
primarily attributable to the algorithm or to other variables. Some studies have not included a
control group in their analyses. Studies that are restricted to a few sign languages or simple signs
may not be relevant to other sign languages or more complicated signs. Studies that are restricted to
a few sign languages or simple signs. IV. FUTURE SCOPE Accessibility: SLR models can help the
community become more accessible by allowing the deaf population to communicate more
effectively with hearing individuals. Education: Sign language recognition models may be used to
create instructional materials and software that will make it easier for people to learn sign language.
Healthcare: To facilitate communication between patients who use sign language and healthcare
professionals, SLR models can be employed in healthcare settings. Entertainment: SLR models may
be used to provide more accessible entertainment materials, such as sign language interpretation in
motion pictures and television programmes. Interaction with Autonomous Systems: Models for the
recognition of sign language can be used to develop robots and other autonomous systems that can
interact with people who communicate primarily through sign language. Accessibility in public areas:
To make public areas like railway stations, airports, and other transportation hubs more accessible
for those who use sign language, sign language recognition models can be utilised. Services for sign
language interpretation: By using SLR models, it is possible to provide more precise and effective sign
language interpretation services while also lowering the demand for human interpreters.
V.CONCLUSION SLR is a significant research area with implications for communication, accessibility,
and inclusion for the deaf community. The analysis of recent developments in SLR techniques
highlights a range of approaches, including glove-based and vision-based methods. Glove-based
approaches utilize wearable devices to capture hand movements and gestures, while vision-based
approaches leverage computer vision algorithms to analyze visual cues. Both approaches have shown
promising results in recognizing sign gestures. In the glove-based approach, wearable sensor-based
systems using sensors such as sEMG and IMUs have been developed. These systems can accurately
capture hand movements and gestures, leading to high recognition accuracies. However, challenges
still exist in terms of sign boundary detection and scalability to larger sign lexicons. Future research in
this area should focus on addressing these challenges and ensuring user comfort and acceptance.
Vision-based approaches, particularly those using deep learning techniques, have also achieved
significant progress in sign language recognition. CNNs have been widely used for feature extraction
and classification of sign gestures, demonstrating high accuracies. Other deep learning techniques
like LSTM networks and Transformer-based Neural Networks have also been explored, showing
promising results in capturing the sequential nature of sign gestures. Sensor-based approaches
utilizing RGB-D sensors, Microsoft Kinect, or Leap motion sensors have also contributed to sign
language recognition. These technologies provide valuable information about hand movements and
enable accurate feature extraction. However, they may come with additional hardware requirements
and cost considerations. Hybrid approaches that combine different methodologies, such as deep
neural networks, pulsecoupled neural networks (PCNN), graph matching (GM), and frame-label
alignment techniques, have also shown improved performance in recognizing sign gestures. These
hybrid approaches leverage the strengths of different techniques and provide a more comprehensive
solution. Overall, the field of SLR is advancing rapidly, driven by advancements in machine learning,
computer vision, and wearable technologies. Future research should focus on addressing the
challenges in sign boundary detection, scalability, user comfort, and acceptance. Additionally, efforts
should be made to include the perspectives and input of the deaf community in the development of
SLR systems to ensure their effectiveness and suitability for real-world applications. By developing
accurate and robust SLR systems, we can enhance communication accessibility for the deaf and
community, promote inclusivity, and facilitate their participation in various domains of life. VI.
ACKNOWLEDGMENT As we conclude this research, we gratefully acknowledge the invaluable support
and contributions of numerous individuals and organizations. We extend our sincere appreciation to
all those who have been part of this journey. We wish to express our deepest gratitude to Dr. S.K
Singh, our esteemed mentor, for his unwavering guidance, unwavering support, and constant
motivation throughout this endeavour. His wealth of expertise and experience were instrumental in
helping us achieve our objectives and surmount any obstacles that arose. Dr. S.K Singh has been an
invaluable member of our team, and we are profoundly grateful for his outstanding contributions.
REFRENCES [1] M. Boháček and M. Hrúz, "Sign Pose-based Transformer for Wordlevel Sign Language
Recognition," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops
(WACVW), 4-8 January 2022. DOI: 10.1109/WACVW54805.2022.00024. [2] Mohammedali, A. H.,
Abbas, H. H., & Shahadi, H. I. (2022). Realtime sign language recognition system. International
Journal of Health Sciences, 6(S4), 10384–10407. https://doi.org/10.53730/ijhs.v6nS4.12206. [3] E.
Rajalakshmi, R. Elakkiya, V. Subramaniyaswamy, K. Kotecha, M. Mehta, and V. Palade, "Continuous
Sign Language Recognition and Translation Using Hybrid Transformer-Based Neural Network,"
available at SSRN: https://ssrn.com/abstract=4424708 or http://dx.doi.org/10.2139/ssrn.4424708.
[4] A. Hafeez, S. Singh, U. Singh, P. Agarwal, and A. K. Jayswal, "Sign Language Recognition System
Using Deep-Learning for Deaf and Dumb," International Research Journal of Modernization in
Engineering Technology and Science, vol. 5, no. 4, pp., Apr. 2023. DOI: 10.56726/IRJMETS36063. [5]
K. Papadimitriou and G. Potamianos, "Sign Language Recognition via Deformable 3D Convolutions
and Modulated Graph Convolutional Networks," in Proceedings of the IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 2023. [6] Dr. V. G and K. Goyal, “Indian Sign
Language Recognition Using Mediapipe Holistic.” arXiv, 2023. doi: 10.48550/ARXIV.2304.10256. [7]
Strobel, G., Schoormann, T., Banh, L., & Möller, F. (in press). Artificial Intelligence for Sign Language
Translation – A Design Science Research Study. Communications of the Association for Information
Systems, 52, pp-pp. Retrieved from https://aisel.aisnet.org/cais/vol52/iss1/33 [8] S. DAS, S. kr.
Biswas, and B. Purkayastha, “Occlusion Robust Sign Language Recognition System for Indian Sign
Language Using CNN and Pose Features.” Research Square Platform LLC, Apr. 20, 2023. doi:
10.21203/rs.3.rs-2801772/v1. [9] S. C J and L. A, “Signet: A Deep Learning based Indian Sign
Language Recognition System,” 2019 International Conference on Communication and Signal
Processing (ICCSP). IEEE, Apr. 2019. doi: 10.1109/iccsp.2019.8698006. [10] Aloysius, N., Geetha, M.
Understanding vision-based continuous sign language recognition. Multimed Tools Appl 79, 22177–
22209 (2020). https://doi.org/10.1007/s11042-020-08961-z [11] K. Amrutha and P. Prabu, "ML Based
Sign Language Recognition System," 2021 International Conference on Innovative Trends in
Information Technology (ICITIIT), 2021, pp. 1-6. [12] R. Cui, H. Liu and C. Zhang, "A Deep Neural
Framework for Continuous Sign Language Recognition by Iterative Training," in IEEE Transactions on
Multimedia, vol. 21, no. 7, pp. 1880-1891, July 2019. [13] Papastratis, I.; Dimitropoulos, K.; Daras, P.
Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network.
Sensors 2021, 21, 2437. [14] K. Nimisha and A. Jacob, "A Brief Review of the Recent Trends in Sign
Language Recognition," 2020 International Conference on Communication and Signal Processing
(ICCSP), 2020, pp. 186-190. [15] Zhou, H., Zhou, W., Zhou, Y., & Li, H. (2021). Spatial-Temporal Multi-
Cue Network for Sign Language Recognition and Translation. IEEE Transactions on Multimedia. [16]
Shirbhate, Radha S., Mr. Vedant D. Shinde, Ms. Sanam A. Metkari, Ms. Pooja U. Borkar and Ms.
Mayuri A. Khandge. “Sign language Recognition Using Machine Learning Algorithm.” (2020). [17] M.
Xie and X. Ma, "End-to-End Residual Neural Network with Data Augmentation for Sign Language
Recognition," 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control
Conference (IAEAC), 2019, pp. 1629-1633. [18] Selvaraj, P., Gokul N., C., Kumar, P., & Khapra, M.M.
(2022). OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained
Models across Languages. ArXiv, abs/2110.05877. [19] Kumar, D. A., Sastry, A. S. C. S., Kishore, P. V. V.,
& Kumar, E. K. (2018). Indian sign language recognition using graph matching on 3D motion captured
signs. Multimedia Tools and Applications. [20] Korban, Matthew & Nahvi, Manoochehr. (2018). An
algorithm on sign words extraction and recognition of continuous Persian sign language based on
motion and shape features of hands. Formal Pattern Analysis & Applications. 21. 10.1007/s10044-
016-0579-2. [21] Khomami, S. A., & Shamekhi, S. (2021). Persian sign language recognition using IMU
and surface EMG sensors. In Measurement (Vol. 168, p. 108471). Elsevier BV.
https://doi.org/10.1016/j.measurement.2020.108471 [22] K. Kudrinko, E. Flavin, X. Zhu, and Q. Li,
“Wearable Sensor-Based Sign Language Recognition: A Comprehensive Review,” IEEE Reviews in
Biomedical Engineering, vol. 14. Institute of Electrical and Electronics Engineers (IEEE), pp. 82–97,
2021. doi: 10.1109/rbme.2020.3019769. [23] P. Kumar, H. Gauba, P. Pratim Roy, and D. Prosad Dogra,
“A multimodal framework for sensor based sign language recognition,” Neurocomputing, vol. 259.
Elsevier BV, pp. 21–38, Oct. 2017. doi: 10.1016/j.neucom.2016.08.132 [24] S. G. Moreira Almeida, F.
G. Guimarães, and J. Arturo Ramírez, “Feature extraction in Brazilian Sign Language Recognition
based on phonological structure and using RGB-D sensors,” Expert Systems with Applications, vol. 41,
no. 16. Elsevier BV, pp. 7259– 7271, Nov. 2014. doi: 10.1016/j.eswa.2014.05.024 [25] M. S. Amin, S.
T. H. Rizvi, and Md. M. Hossain, “A Comparative Review on Applications of Different Sensors for Sign
Language Recognition,” Journal of Imaging, vol. 8, no. 4. MDPI AG, p. 98, Apr. 02, 2022. doi:
10.3390/jimaging8040098. [26] E. Rajalakshmi et al., “Multi-Semantic Discriminative Feature
Learning for Sign Gesture Recognition Using Hybrid Deep Neural Architecture,” IEEE Access, vol. 11.
Institute of Electrical and Electronics Engineers (IEEE), pp. 2226–2238, 2023. doi:
10.1109/access.2022.3233671.

You might also like