SLR Paper

The document presents a new large-scale dataset for American Sign Language (ASL) recognition, called WLASL, which includes over 21,000 videos of more than 2,000 words performed by 119 signers. It compares different deep learning methods for word-level sign recognition, including holistic visual appearance and 2D human pose approaches, achieving a top-10 accuracy of 62.63%. The dataset aims to address the limitations of existing small-scale datasets and is intended to facilitate further research in sign language recognition.

Uploaded by

amazonnsai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views11 pages

SLR Paper

Uploaded by

amazonnsai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Word-level Deep Sign Language Recognition from Video:

A New Large-scale Dataset and Methods Comparison

Dongxu Li , Cristian Rodriguez Opazo, Xin Yu, Hongdong Li

The Australian National University, Australian Centre for Robotic Vision (ACRV)
{dongxu.li, cristian.rodriguez, xin.yu, hongdong.li}@anu.edu.au
arXiv:1910.11006v2 [cs.CV] 21 Jan 2020

Abstract

Vision-based sign language recognition aims at helping

deaf people to communicate with others. However, most
existing sign language datasets are limited to a small num-
ber of words. Due to the limited vocabulary size, models
learned from those datasets cannot be applied in practice.
In this paper, we introduce a new large-scale Word-Level
American Sign Language (WLASL) video dataset, contain-
ing more than 2000 words performed by over 100 signers.
This dataset will be made publicly available to the research
community. To our knowledge,it is by far the largest pub- Figure 1: ASL signs “read” (top row) and “dance” (bottom
lic ASL dataset to facilitate word-level sign recognition re- row) [14] differ only in the orientations of the hands.
search.
Based on this new large-scale dataset, we are able to matically translating sign languages using, for example, vi-
experiment with several deep learning methods for word- sion techniques. Such a process involves mainly two tasks,
level sign recognition and evaluate their performances in namely, word-level sign language recognition (or “isolated
large scale scenarios. Specifically we implement and com- sign language recognition”) and sentence-level sign lan-
pare two different models,i.e., (i) holistic visual appear- guage recognition (or “continuous sign language recogni-
ance based approach, and (ii) 2D human pose based ap- tion”). In this paper, we target at word-level recognition
proach. Both models are valuable baselines that will ben- task for American Sign Language (ASL) considering that
efit the community for method benchmarking. Moreover, it is widely adopted by deaf communities over 20 countries
we also propose a novel pose-based temporal graph convo- around the world [45].
lution networks (Pose-TGCN) that model spatial and tem- Serving as a fundamental building block for understand-
poral dependencies in human pose trajectories simultane- ing sign language sentences, the word-level sign recognition
ously, which has further boosted the performance of the task itself is also very challenging:
pose-based method. Our results show that pose-based • The meaning of signs mainly depends on the combi-
and appearance-based models achieve comparable per- nation of body motions, manual movements and head
formances up to 62.63% at top-10 accuracy on 2,000 poses, and subtle differences may translate into dif-
words/glosses, demonstrating the validity and challenges ferent meanings. As shown in Fig. 1, the signs for
of our dataset. Our dataset and baseline deep mod- “dance” and “read” only differ in the orientations of
els are available at https://dxli94.github.io/ hands.
WLASL/.
• The vocabulary of signs in daily use is large and usu-
ally in the magnitude of thousands. In contrast, re-
lated tasks such as gesture recognition [5, 1] and ac-
1. Introduction tion recognition [31, 58, 12] only contains at most a
Sign languages, as a primary communication tool for few hundred categories. This greatly challenges the
the deaf community, have their unique linguistic struc- scalability of recognition methods.
tures. Sign language interpretation methods aim at auto- • A word in sign language may have multiple coun-
terparts in natural languages. For instance, the sign
shown in Fig. 2 (a), can be interpreted as “wish” or
“hungry” depending on the context. In addition, nouns
and verbs that are from the same lemma may have the
same sign. These subtleties are not well captured in
the existing small-scale datasets.
In order to learn a practical ASL recognition model, the
training data needs to contain a sufficient number of classes
and training examples. Considering that existing word-level (a) The verb “Wish” (top) and the adjective “hungry” (bottom)
datasets do not provide a large-scale vocabulary of signs, correspond to the same sign.
we firstly collect large-scale word-level signs in ASL as
well as their corresponding annotations. Furthermore, since
we want to leverage the minimal hardware requirement for
the sign recognition, only monocular RGB-based videos are
collected from the Internet. By doing so, the trained sign
recognition models do not rely on special equipment, such
as depth cameras [33] and colored gloves [51], and can be
deployed in general cases. Moreover, when people commu-
nicate with each other, they usually sign in frontal views.
Thus, we only collect videos with signers in near-frontal (b) The same sign represents different words “Rice” (top) and
views to achieve a high-quality large-scale dataset. In addi- “soup” (bottom).
tion, our dataset contains annotations for dialects that are
commonly-used in ASL. In total, our proposed WLASL
dataset consists 21,083 videos performed by 119 signers,
and each video only contains one sign in ASL. Each sign is
performed by at least 3 different signers. Thus, inter-signer
variations in our dataset facilitates the generalization ability
of the trained sign recognition models.
Based on WLASL, we are able to experiment with sev-
eral deep learning methods for word-level sign recognition,
based on (i) holistic visual appearance, and (ii) 2D human- (c) Signers perform “Scream” with different hand positions and
pose. For appearance-based methods, we provide a base- amplitude of hand movements.
line by re-training VGG backbone [57] and GRU [17] as
a representative for convolutional recurrent networks. We Figure 2: Ambiguity and variations of Signing. (a, b) shows
also provide a 3D convolution networks baseline using fine- linguistic ambiguity in ASL. (c) shows signing variations of
tuned I3D [12], which performs better than the VGG-GRU different signers.
baseline. For pose-based methods, we firstly extract hu-
man poses from original videos and use them as input fea- recognition algorithms are also discuss to demonstrate the
tures. We provide a baseline using GRU to model the tem- necessity of a large-scale ASL dataset.
poral movements of the poses. Giving that GRU captures
explicitly only the temporal information in pose trajecto- 2.1. Sign Language Datasets
ries, it may not fully utilizes the spatial relationship be- There are three publicly released word-level ASL
tween body keypoints. Motivated by this, we propose a datasets1 , i.e. Purdue RVL-SLLL ASL Database [69],
novel pose-based model temporal graph convolutional net- Boston ASLLVD [6] and RWTH-BOSTON-50 [78].
work (TGCN) that captures the temporal and spatial depen- Purdue RVL-SLLL ASL Database [69] contains 39
dencies in the pose trajectories simultaneously. Our results motion primitives with different hand-shapes that are com-
show that both pose-based approach and appearance-based monly encountered in ASL. Each primitive is produced by
approach achieve comparable classification performance on 14 native signers. Note that, the primitives in [69] are the
2,000 words, reaching up to 62.63%. elements constituting ASL signs but may not necessarily
correspond to an English word. Boston ASLLVD [6] has
2. Related Work
1 We notice that one paper [32] aims at providing an ASL dataset con-
In this section, we briefly review some existing publicly taining 1,000 glosses. Since the dataset is not released at the time of prepar-
sign language datasets, and state-of-the-art sign language ing the paper, we cannot evaluate and compare with the dataset.
Figure 3: Illustrations of the diversity of our dataset, which contains different backgrounds, illumination conditions and
signers with different appearances.

Table 1: Overview of word-level datasets in other lan- traction, temporal-dependency modeling and classification.
guages. Previous works first employ different hand-crafted features
Datasets #Gloss #Videos #Signers Type Sign Language to represent static hand poses, such as SIFT-based fea-
LSA64 [51] 64 3,200 10 RGB Argentinian
PSL Kinect 30 [34] 30 300 - RGB, depth Polish
tures [71, 74, 63], HOG-based features [43, 8, 20] and fea-
PSL ToF [34] 84 1,680 - RGB, depth Polish tures in the frequency domain [4, 7]. Hidden Markov Mod-
DEVISIGN [15] 2,000 24,000 8 RGB, depth Chinese
GSL [24] 20 840 6 RGB Greek
els (HMM) [60, 59] are then employed to model the tempo-
DGS Kinect [3] 40 3,000 15 RGB, depth German ral relationships in video sequences. Dynamic Time Warp-
LSE-sign [27] 2,400 2,400 2 RGB Spanish
ing (DTW) [41] is also exploited to handle differences of se-
quence lengths and frame rates. Classification algorithms,
such as Support Vector Machine (SVM) [47], are used to
2,742 words (i.e., glosses) with 9,794 examples (3.6 exam-
label the signs with the corresponding words.
ples per gloss on average). Although the dataset has large
coverage of the vocabulary, more than 2,000 glosses have at
Similar to action recognition, some recent works [55, 35]
most three examples, which is unsuitable to train thousand-
use CNNs to extract the holistic features from image frames
way classifiers. RWTH-BOSTON-50 [78] contains 483
and then use the extracted features for classification. Sev-
samples of 50 different glosses performed by 2 signers.
eral approaches [37, 36] first extract body keypoints and
Moreover, RWTH-BOSTON-104 provides 200 continu-
then concatenate their locations as a feature vector. The
ous sentences signed by 3 signers which in total cover 104
extracted features are then fed into a stacked GRU for rec-
signs/words. RWTH-BOSTON-400, as a sentence-level
ognizing signs. These methods demonstrate the effective-
corpus, consists of 843 sentences including around 400
ness of using human poses in the word-level sign recog-
signs, and those sentences are performed by 5 signers. DE-
nition task. Instead of encoding the spatial and tempo-
VISIGN is a large-scale word-level Chinese Sign Language
ral information separately, recent works also employ 3D
dataset, consists of 2,000 words and 24,000 examples per-
CNNs [28, 75] to capture spatial-temporal features together.
formed by 8 non-native signers in controlled lab environ-
However, these methods are only tested on small-scale
ment. Word-level sign language datasets exist for other re-
datasets. Thus, the generalization ability of those methods
gions, as summarized word-level sign language datasets in
remains unknown. Moreover, due to the lack of a stan-
other languages in Table 1.
dard word-level large-scale sign language dataset, the re-
All the previously mentioned datasets have their own sults of different methods evaluated on different small-scale
properties and provide different attempts to tackle the word- datasets are not comparable and might not reflect the prac-
level sign recognition task. However, they fail to capture the tical usefulness of models.
difficulties of the task due to insufficient amount of instance
and signer. To overcome the above issues in sign recognition,
we propose a large-scale word-level ASL dataset, coined
2.2. Sign Language Recognition Approaches
WLASL database. Since our dataset consists of RGB-only
Existing word-level sign recognition models are mainly videos, the algorithms trained on our dataset can be eas-
trained and evaluated on either private [26, 38, 77, 28, 48] ily applied to real world cases with minimal equipment re-
or small-scale datasets with less than one hundred words [?, quirements. Moreover, we provide a set of baselines using
38, 77, 28, 48, 42, 46, 70]. These sign recognition ap- state-of-the-art methods for sign recognition to facilitate the
proaches mainly consists of three steps: the feature ex- evaluation of future works.
Table 2: Comparisons of our WLASL dataset with existing annotations, is also given in our dataset.
ASL datasets. Column “Mean” indicates the average num- Temporal boundary: A temporal boundary is used to in-
ber of video samples per gloss. dicate the start and end frames of a sign. When the videos
Datasets #Gloss #Videos Mean #Signers Year do not contain repetitions of signs, the boundaries are la-
Purdue RVL-SLLL [69] 39 546 14 14 2006 belled as the first and last frames of the signs. Otherwise, we
RWTH-BOSTON-50 [78] 50 483 9.7 3 2005 manually label the boundaries between the repetitions. For
Boston ASLLVD [6] 2,742 9,794 3.6 6 2008
the videos containing repetitions, we only keep one sample
WLASL100 100 2,038 20.4 97 2019 of the repeated sign to ensure samples in which the same
WLASL300 300 5,117 17.1 109 2019
WLASL1000 1,000 13,168 13.2 116 2019 signer performs the same sign will not appear in both train-
WLASL2000 2,000 21,083 10.5 119 2019 ing and testing sets. Thus, we prevent learned models from
overfiting to the testing set.
3. Our Proposed WLASL Dataset Body Bounding-box: In order to reduce side-effects
caused by backgrounds and let models focus on the signers,
In this section, we introduce our proposed Word-Level we use YOLOv3 [50] as a person detection tool to identify
American Sign Language dataset (WLASL). We first ex- body bounding-boxes of signers in videos. Note that, the
plain the data sources and the data collection process. Fol- size of the bounding-box will change as a person signs, we
lowing with the description of our annotation process which use the largest bounding-box size to crop the person from
combines automatic detection procedures with manual an- the video.
notations to ensure the correctness between signs and their
Signer Diversity: A good sign recognition model should
annotations. Finally, we provide statistics of our WLASL.
be robust to inter-signer variations in the input data, e.g.
3.1. Dataset Collection signer appearance and signing paces, in order to general-
ize well to real-world scenarios. For example, as shown
In order to construct a large-scale signer-independent in Fig. 2c, the same sign is performed with slightly differ-
ASL dataset, we resort to two main sources from Internet. ent hand positioning by two signers. From this perspective,
First, there are multiple educational sign language websites, sign datasets should have a diversity of signers. Therefore,
such as ASLU [2] and ASL-LEX [14], and they provide we identify signers in our collected dataset and then provide
lookup function for ASL signs. The mappings between the IDs of the signers as the meta information of the videos.
glosses and signs from those websites are accurate since To this end, we first employ the face detector and the face
those videos have been checked by experts before uploaded. embedding provided by FaceNet [53] to encode faces of the
Another main source is ASL tutorial videos on YouTube. dataset, and then compare the Euclidean distances among
We select videos whose titles clearly describe the gloss of the face embeddings. If the distance between two embed-
the sign. In total, we access 68,129 videos of 20,863 ASL dings is lower than our pre-defined threshold (i.e., 0.9), we
glosses from 20 different websites. In each video, a signer consider those two videos signed by the same person. After
performs only one sign (possibly multiple repetitions) in a automatic labeling, we also manually check the identifica-
nearly-frontal view with different backgrounds. tion results and correct the mislabelled ones.
After collecting all the resources for the dataset, if the Dialect Variation Annotation: Similar to natural lan-
gloss annotations are composed of more than two words guages, ASL signs also have dialect variations [45] and
in English, we will remove those videos to ensure that the those variations may contain different sign primitives, such
dataset contains words only. If the number of the videos for as hand-shapes and motions. To avoid the situation where
one gloss is less than seven, we also remove that gloss to dialect variations only appear in testing dataset, we man-
guarantee that enough samples are split into the training and ually label the variations for each gloss. Our annotators
testing sets. Since most of the websites include daily used receive training in advance to ensure that they understand
words, the small number of video samples for one gloss may the basic knowledge of ASL, in order to distinguish the dif-
imply those words are not frequently used. Therefore, references from the signers variations and dialect variations.
moving those glosses with few video samples will not affect To speed up the annotation process and control the annota-
the usefulness of our dataset in practice. After this prelimi- tion quality, we design an interface which lets the annota-
nary selection procedure, we have 34,404 video samples of tors only compare signs from two videos displayed simul-
3,126 glosses for further annotations. taneously. Then we count the number of dialects and assign
labels for different dialects automatically. After the dialect
3.2. Annotations
annotation, we also give each video a dialect label. With the
In addition to providing a gloss label for each video, help of the dialect labels, we can guarantee the dialect signs
some meta information, including temporal boundary, body in the testing set have corresponding training samples. We
bounding box, signer annotation and sign dialect/variation also discard the sign variations with less than five examples
since there are not enough samples to be split into train- guidance to extract local deep features around the joint re-
ing, validation and testing sets. Furthermore, we notice that gions.
these variations are usually not commonly used in daily life. Sign language recognition, especially word-level recog-
nition, needs to focus on detailed differences between signs,
3.3. Dataset Arrangement such as the orientation of hands and movement direction of
After obtaining all the annotations for each video, we ob- the arms, while the background context does not provide
tain videos with lengths ranging from 0.36 to 8.12 seconds, any clue for recognition. Motivated by the action recog-
and the average length of all the videos is 2.41 seconds. The nition methods, we employ two image-based baselines to
average intra-class standard deviation of the videos is 0.85 model the temporal and spatial information of videos in dif-
seconds. ferent manners.
We sort the glosses in a descending order in terms of
the sample number of a gloss. To provide better under- 4.1.1 2D Convolution with Recurrent Neural Networks
standing on the difficulties of the word-level sign recogni- 2D Convolutional Neural Networks (CNN) are widely used
tion task and the scalability of sign recognition methods, to extract spatial features of input images while Recurrent
we conduct experiments on the datasets with different vo- Neural Networks (RNN) are employed to capture the long-
cabulary sizes. In particular, we select top-K glosses with term temporal dependencies among inputs. Thus, our first
K = {100, 300, 1000, 2000}, and organize them to four baseline is constructed by a CNN and a RNN to capture
subsets, named WLASL100, WLASL300, WLASL1000 spatio-temporal features from input video frames. In par-
and WLASL2000, respectively. ticular, we use VGG16 [57] pretrained on ImageNet to ex-
In Table 2, we present statistics of the four subsets tract spatial features and then feed the extracted features to
of WLASL. As indicated by Table 2, we acquire 21,083 a stacked GRU [17]. This baseline is referred to as 2D Conv
video samples with a duration of around 14 hours for RNN, and the network architecture is illustrated in Figure 4.
WLASL2000 in total, and each gloss in WLASL2000 has To avoid overfiting the training set, the hidden sizes of
10.5 samples on average, which is almost three times larger GRU for the four subsets are set to 64, 96, 128 and 256 re-
than the existing large-scale dataset Boston ASLLVD. We spectively, and the number of the stacked recurrent layers in
show example frames of our dataset in Fig. 3. GRU is set to 2. In the training phase, we randomly select at
most 50 consecutive frames from each video. Cross-entropy
4. Method Comparison on WLASL losses is imposed on the output at all the time steps as well
Signing, as a part of human actions, shares similarities as the output feature from the average pooling of all the
with human action recognition and pose estimation. In this output features. In testing, we consider all the frames in the
section, we first introduce some relevant works on action video and make predictions based on the average pooling of
recognition and human pose estimation. Inspired by net- all the output features.
work architectures of action recognition, we employ image-
appearance based and pose based baseline models for word- 4.1.2 3D Convolutional Networks
level sign recognition. By doing so, we not only investigate 3D convolutional networks [13, 65, 62, 30] are able to es-
the usability of our collected dataset but also exam the sign tablish not only the holistic representation of each frame
recognition performance of deep models based on different but also the temporal relationship between frames in a hier-
modalities. archical fashion. Carreira et al. [13] inflate 2D filters of the
4.1. Image-appearance based Baselines Inception network [61] trained on ImageNet [52], thus ob-
taining well-initialized 3D filters. The inflated 3D filters are
Early approaches employ handcrafted features to repre- also fine-tuned on the Kinetics dataset [13] to better capture
sent the spatial-temporal information from image frames the spatial-temporal information in a video.
and then ensemble them as a high-dimensional code for In this paper, we employ the network architecture of I3D
classification [40, 68, 54, 39, 21, 65, 67]. [13] as our second image-appearance based baseline, and
Benefiting from the powerful feature extraction ability of the network architecture is illustrated in Figure 4. As men-
deep neural networks, the works [56, 65] exploit deep neu- tioned above, the original I3D network is trained on Ima-
ral networks to generate a holistic representation for each geNet [52] and fine-tuned on Kinetics-400 [13]. In order
input frame and then use the representations for recogni- to model the temporal and spatial information of the sign
tion. To better establish the temporal relationship among language, such as focusing on the hand shapes and orienta-
the extracted visual features, Donahue et al. [22] and Yue et tions as well as arm movements, we need to fine-tune the
al. [76] employ use recurrent neural networks (e.g., LSTM). pre-trained I3D. In this way, the fine-tuned I3D can better
Some works [23, 10] also employ the joint locations as a capture the spatio-temporal information of signs. Since the
a) 2D Conv. RNN b) 3D Conv. c) Pose RNN d) Pose TGCN
Gloss
Gloss Gloss
Pooling Gloss
Pooling Pooling
GRU GRU Pooling

3D Temporal
GRU GRU GraphConv
2D ConvNet 2D ConvNet ConvNet

Keypoints 1 Keypoints K
Images Keypoints
Image 1 Image K 1 to K 1 to K
time
time time
time
Figure 4: Illustrations of our baseline architectures.
class number varies in our WLASL subsets, only the last GRUs, we use the empirically optimized hidden sizes of 64,
classification layer is modified in accordance with the class 64, 128 and 128 for the four subsets respectively. Similar to
number. the training and testing protocols in Section 4.1.1, 50 con-
secutive frames are randomly chosen from the input video.
4.2. Pose-based Baselines Cross-entropy losses is employed for training. In testing, all
Human pose estimation aims at localizing the keypoints the frames in a video are used for classification.
or joints of human bodies from a single image or videos.
Traditional approaches employ the probabilistic graphical 4.2.2 Pose Based Temporal Graph Neural Networks
model [73] or pictorial structures [49] to estimate single-
person poses. Recently, deep learning techniques have We introduce a novel pose-based approach to ISLR using
boosted the performance of pose estimation significantly. Temporal Graph Convolution Networks (TGCN). Consider
There are two mainstream approaches: regressing the key- the input pose sequence X1:N = [x1 , x2 , x3 , ..., xN ] in N
point positions [64, 11], and estimating keypoint heatmaps sequential frames, where xi ∈ RK represents the concate-
followed by a non-maximal suppression technique [9, 19, nated 2D keypoint coordinates in dimension K. We pro-
18, 72]. However, pose estimation only provides the loca- pose a new graph network based architecture that models
tions of the body keypoints, while the spatial dependencies the spatial and temporal dependencies of the pose sequence.
among the estimated keypoints are not explored. Different from existing works on human pose estimation,
Several works [29, 66] exploit human poses to recog- which usually model motions using 2D joint angles, we en-
nize actions. The works [29, 66] represent the locations code temporal motion information as a holistic representa-
of body joints as a feature representation for recognition. tion of the trajectories of body keypoints.
These methods can obtain high recognition accuracy when Motivated by the recent work on human pose forecasting
the oracle annotations of the joint locations are provided. In [16, 16], we view a human body as a fully-connected graph
order to exploit the pose information for SLR, the spatial with K vertices and represent the edges in the graph as a
and temporal relationships among all the keypoints require weighted adjacency matrix A ∈ RK×K . Although a human
further investigation. body is only partially connected, we construct the human
body as fully-connected graph in order to learn the depen-
dencies among joints via a graph network. In a deep graph
4.2.1 Pose based Recurrent Neural Networks
convolutional network, the n-th graph layer is a function Gn
Pose based approaches mainly utilize RNNs [44] to model that takes as input features a matrix Hn ∈ RK×F , where F
the pose sequences for analyzing human motions. Inspired is the feature dimension output by its previous layer. In the
by this idea, our first pose-based baseline employs RNN first layer, the networks takes as input the K × 2N matrix
to model the temporal sequential information of the pose coordinates of body keypoints. Given this formulation and
0
movements, and the representation output by RNN is used a set of trainable weights Wn ∈ RF ×F , a graph convolu-
for the sign recognition. tional layer is expressed as:
In this work, we extract 55 body and hand 2D keypoints Hn+1 = Gn (Hn ) = σ(An Hn Wn ), (1)
from a frame on WLASL using OpenPose [9]. These key-
points include 13 upper-body joints and 21 joints for both where An is a trainable adjacency matrix for n-th layer and
left and right hands as defined in [9]. Then, we concate- σ(·) denotes the activation function tanh(·). A residual
nate all the 2D coordinates of each joint as the input feature graph convolutional block stacks two graph convolutional
and feed it to a stacked GRU of 2 layers. In the design of layers with a residual connection as shown in Fig. 5. Our
4.3.3 Evaluation Metric

Graph Graph Graph We evaluate the models using the mean scores of top-K
Conv Conv Conv classification accuracy with K = {1, 5, 10} over all the
Graph Graph Graph
sign instances. As seen in Figure 2, different meanings have
Conv Conv Conv very similar sign gestures, and those gestures may cause er-
rors in the classification results. However, some of the erro-
neous classification can be rectified by contextual informa-
3x Residual GraphConv Block tion. Therefore, it is more reasonable to use top-K predicted
labels for the word-level sign language recognition.
Figure 5: Residual Graph Convolution Block.
4.4. Discussion
proposed TGCN stacks multiple residual graph convolu-
tional blocks and takes the average pooling result along the 4.4.1 Performance Evaluation of Baseline Networks
temporal dimension as the feature representation of pose Table 3 indicates that the performance of our baseline mod-
trajectories. Then a softmax layer followed by the average els based on poses and image-appearance. The results
pooling layer is employed for classification. demonstrate that our pose-based TGCN further improves
the classification accuracy in comparison to the pose-based
4.3. Training and Testing Protocol sign recognition method Pose-GRU. This indicates that our
proposed pose-TGCN captures both spatial and temporal
4.3.1 Data Pre-processing and Augmentation relationships of the body keypoints since Pose-GRU mainly
explores the temporal dependencies of the keypoints for
We resize the resolution of all original video frames such
classification. On the other hand, our fine-tuned I3D model
that the diagonal size of the person bounding-box is 256
achieves better performance compared to the other image-
pixels. For training VGG-GRU and I3D, we randomly crop
appearance based model VGG-GRU since I3D has larger
a 224 × 224 patch from an input frame and apply a horizon-
network capacity and is pretrained on not only ImageNet
tal flipping with a probability of 0.5. Note that, the same
but also Kinetics.
crop and flipping operations are applied to the entire video
Although I3D is larger than our TGCN, Pose-TGCN
frames instead of in a frame-wise manner. Similar to [12],
can still achieve comparable results with I3D at top-5 and
when training VGG-GRU, Pose-GRU and Pose-TGCN, for
top-10 accuracy on the large-scale subset WLASL2000.
each video consecutive 50 frames are randomly selected and
This demonstrates that our TGCN effectively encodes hu-
the models are asked to predict labels based on only partial
man motion information. Since we use an off-the-shelf
observations of the input video. In doing so, we increase
pose estimator [9], the erroneous estimation of poses may
the discriminativeness of the learned model. For I3D, we
degrade the recognition performance. In contrast, image
follow its original training configuration.
appearance-based baselines are trained in an end-to-end
fashion for sign recognition and thus the errors residing
in spatial features can be reduced during training. There-
4.3.2 Implementation details fore, training pose-based baselines in an end-to-end fashion
could further improve the recognition performance.
The models, i.e., VGG-GRU, Pose-GRU, Pose-TGCN and
I3D are implemented in PyTorch. It is important to no-
4.4.2 Effect of Vocabulary Size
tice that we use the I3D pre-train weights provided by Car-
reira et al. [13].We train all the models with Adam opti- As seen in Table 3, our baseline methods can achieve rel-
mizer [34]. Note that, I3D was trained by stochastic gra- atively high classification accuracy on small-size subsets.
dient descent (SGD) in [12]. However, I3D does not con- i.e., WLASL100 and WLASL300. However, the subset
verge when using SGD to fine-tune it in our experiments. WLASL2000 is very close to the real-world word-level
Thus, Adam is employed to fine-tune I3D. All the models classification scenario due to its large vocabulary. Pose-
are trained with 200 epochs on each subset. We terminate GRU, pose-TGCN and I3D achieve similar performance
the training process when the validation accuracy stops in- on WLASL2000. This implies that the recognition per-
creasing. formance on small vocabulary datasets does not reflect the
We split the samples of a gloss into the training, vali- model performance on large vocabulary datasets, and the
dation and testing sets following a ratio of 4:1:1. We also large-scale sign language recognition is very challenging.
ensure each split has at least one sample per gloss. The split We also evaluate how the class number, i.e., vocabulary
information will be released publicly as part of WLASL. size, impacts on the model performance. There are two
Table 3: Top-1, top-5, top-10 accuracy (%) achieved by each model (by row) on the four WLASL subsets.

Method WLASL100 WLASL300 WLASL1000 WLASL2000

top-1 top-5 top-10 top-1 top-5 top-10 top-1 top-5 top-10 top-1 top-5 top-10
Pose-GRU 46.51 76.74 85.66 33.68 64.37 76.05 30.01 58.42 70.15 22.54 49.81 61.38
Pose-TGCN 55.43 78.68 87.60 38.32 67.51 79.64 34.86 61.73 71.91 23.65 51.75 62.24
VGG-GRU 25.97 55.04 63.95 19.31 46.56 61.08 14.66 37.31 49.36 8.44 23.58 32.58
I3D 65.89 84.11 89.92 56.14 79.94 86.98 47.33 76.44 84.33 32.48 57.31 66.31

Table 4: Top-10 accuracy (%) of I3D (and Pose-TGCN when trained (row) and tested (column) on different WLASL subsets.

WLASL100 WLASL300 WLASL1000 WLASL2000

I3D TGCN I3D TGCN I3D TGCN I3D TGCN
WLASL100 89.92 87.60 - - - - - -
WLASL300 88.37 81.40 86.98 79.64 - - - -
WLASL1000 85.27 77.52 86.22 74.25 84.33 71.91 - -
WLASL2000 72.09 67.83 71.11 65.42 67.32 64.55 66.31 62.24

factors mainly affecting the performance: (i) deep models specific domian knowledge and makes crowdsourcing in-
themselves favor simple and easy tasks, and thus they per- feasible.
form better on smaller datasets. As indicated in Table 3,
the models trained on smaller vocabulary size sets perform 5. Conclusion
better than larger ones (comparing along columns); (ii) the
dataset itself has ambiguity. Some signs, as shown in Fig- In this paper, we proposed a large-scale Word-Level ASL
ure 2, are hard to recognize by even humans, and thus deep (WLASL) dataset covering a wide range of daily words and
models will be also misled by those classes. As the number evaluated the performance of deep learning based methods
of classes increases, there will be more ambiguous signs. on it. To the best of our knowledge, our dataset is the largest
In order to explain the impacts of the second factor, we public ASL dataset in terms of the vocabulary size and the
dissect the models, i.e., I3D and Pose-TGCN, trained on number of samples for each class. Since understanding
WLASL2000. Here, we test our models on the WLASL100, sign language requires very specific domain knowledge, la-
WLASL300, WLASL1000 and WLASL2000. As seen in belling a large amount of samples per class is unaffordable.
Table 4, when the test class number is smaller, the models After comparisons among deep sign recognition models on
achieve higher accuracy (comparing along rows). The ex- WLASL, we conclude that developing word-level sign lan-
periments imply that as the number of classes decreases, the guage recognition algorithms on such a large-scale dataset
number of ambiguous signs becomes smaller, thus making requires more advanced learning algorithms, such as few-
classification easier. shot learning. In our future work, we also aim at utiliz-
ing word-level annotations to facilitate sentence-level and
story-level machine sign translations.
4.4.3 Effect of Sample Numbers
As the class number in the dataset increases, training a deep Acknowledgement
model requires more samples. However, as illustrated in Ta- This research is supported in part by the Australia Re-
ble 1, although in our dataset each gloss contains more sam- search Council ARC Centre of Excellence for Robotics Vi-
ples than other datasets, the number of training examples sion (CE140100016), ARC-Discovery (DP 190102261) and
per class is still relatively small compared to some large- ARC-LIEF (190100080). The authors gratefully acknowl-
scale generic activity recognition datasets [25]. This brings edge the GPU gift donated by NVIDIA Corporation. We
some difficulties for the network training. Note that, the thank all anonymous reviewers for their constructive com-
average training samples for each gloss in WLASL100 are ments.
twice large as those in WLASL2000. Therefore, models
obtain better classification performance on the glosses with
more samples, as indicated in Table 3 and Table 4. Crowd-
References
sourcing via Amazon Mechanism Tucker (AMT) is a popu- [1] The 20bn-jester dataset-v1. https://20bn.com/
lar way to collect data. However, annotating ASL requires datasets/jester. Accessed: 2019-07-16. 1
[2] Asl university. http://asluniversity.com/. Ac- [17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
cessed: 2019-07-16. 4 F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
[3] Kinect gesture dataset. https://www.microsoft. representations using rnn encoder-decoder for statistical ma-
com/en-us/download/details.aspx?id= chine translation. arXiv preprint arXiv:1406.1078, 2014. 2,
52283. Accessed: 2019-07-16. 3 5
[4] M. Al-Rousan, K. Assaleh, and A. Talaa. Video-based [18] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature
signer-independent arabic sign language recognition using learning for pose estimation. In Proceedings of the IEEE
hidden markov models. Applied Soft Computing, 9(3):990– Conference on Computer Vision and Pattern Recognition,
999, 2009. 3 pages 4715–4723, 2016. 6
[5] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, [19] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and
C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, X. Wang. Multi-context attention for human pose estima-
M. Mendoza, et al. A low power, fully event-based gesture tion. In Proceedings of the IEEE Conference on Computer
recognition system. In Proceedings of the IEEE Conference Vision and Pattern Recognition, pages 1831–1840, 2017. 6
on Computer Vision and Pattern Recognition, pages 7243–
[20] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden. Sign
7252, 2017. 1
language recognition using sub-units. Journal of Machine
[6] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan,
Learning Research, 13(Jul):2205–2231, 2012. 3
Q. Yuan, and A. Thangali. The american sign language lex-
icon video dataset. In 2008 IEEE Computer Society Con- [21] N. Dalal and B. Triggs. Histograms of oriented gradients for
ference on Computer Vision and Pattern Recognition Work- human detection. 2005. 5
shops, pages 1–8. IEEE, 2008. 2, 4 [22] J. Donahue, L. Anne Hendricks, S. Guadarrama,
[7] P. C. Badhe and V. Kulkarni. Indian sign language transla- M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
tor using gesture recognition algorithm. In 2015 IEEE In- rell. Long-term recurrent convolutional networks for visual
ternational Conference on Computer Graphics, Vision and recognition and description. In Proceedings of the IEEE
Information Security (CGVIS), pages 195–200. IEEE, 2015. conference on computer vision and pattern recognition,
3 pages 2625–2634, 2015. 5
[8] P. Buehler, A. Zisserman, and M. Everingham. Learning sign [23] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent
language by watching tv (using weakly aligned subtitles). pose-attention network for action recognition in videos. In
In 2009 IEEE Conference on Computer Vision and Pattern Proceedings of the IEEE International Conference on Com-
Recognition, pages 2961–2968. IEEE, 2009. 3 puter Vision, pages 3725–3734, 2017. 5
[9] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. [24] E. Efthimiou and S.-E. Fotinea. Gslc: Creation and annota-
OpenPose: realtime multi-person 2D pose estimation using tion of a greek sign language corpus for hci. In HCI, 2007.
Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 3
2018. 6, 7 [25] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C.
[10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime Multi- Niebles. Activitynet: A large-scale video benchmark for hu-
Person 2D Pose Estimation using Part Affinity Fields. In man activity understanding. In Proceedings of the IEEE Con-
CVPR, 2017. 5 ference on Computer Vision and Pattern Recognition, pages
[11] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human 961–970, 2015. 8
pose estimation with iterative error feedback. In Proceed- [26] K. Grobel and M. Assan. Isolated sign language recognition
ings of the IEEE conference on computer vision and pattern using hidden markov models. In 1997 IEEE International
recognition, pages 4733–4742, 2016. 6 Conference on Systems, Man, and Cybernetics. Computa-
[12] J. Carreira and A. Zisserman. Quo vadis, action recognition? tional Cybernetics and Simulation, volume 1, pages 162–
a new model and the kinetics dataset. In proceedings of the 167. IEEE, 1997. 3
IEEE Conference on Computer Vision and Pattern Recogni-
[27] E. Gutierrez-Sigut, B. Costello, C. Baus, and M. Carreiras.
tion, pages 6299–6308, 2017. 1, 2, 7
Lse-sign: A lexical database for spanish sign language. Be-
[13] J. Carreira and A. Zisserman. Quo vadis, action recognition?
havior Research Methods, 48(1):123–137, 2016. 3
a new model and the kinetics dataset. In CVPR, 2017. 5, 7
[14] N. K. Caselli, Z. S. Sehyr, A. M. Cohen-Goldberg, and [28] J. Huang, W. Zhou, H. Li, and W. Li. Sign language recogni-
K. Emmorey. Asl-lex: A lexical database of american sign tion using 3d convolutional neural networks. In 2015 IEEE
language. Behavior research methods, 49(2):784–801, 2017. international conference on multimedia and expo (ICME),
1, 4 pages 1–6. IEEE, 2015. 3
[15] X. Chai, H. Wanga, M. Zhoub, G. Wub, H. Lic, and [29] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black.
X. Chena. Devisign: Dataset and evaluation for 3d sign Towards understanding action recognition. In Proceedings of
language recognition. Technical report, Beijing, Tech. Rep, the IEEE international conference on computer vision, pages
2015. 3 3192–3199, 2013. 6
[16] H.-k. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. [30] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural
Niebles. Action-agnostic human pose forecasting. In 2019 networks for human action recognition. IEEE transactions
IEEE Winter Conference on Applications of Computer Vision on pattern analysis and machine intelligence, 35(1):221–
(WACV), pages 1423–1432. IEEE, 2019. 6 231, 2012. 5
[31] Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, [47] S. Nagarajan and T. Subashini. Static hand gesture recog-
M. Shah, and R. Sukthankar. Thumos challenge: Action nition for sign language alphabets using edge oriented his-
recognition with a large number of classes, 2014. 1 togram and multi class svm. International Journal of Com-
[32] H. R. V. Joze and O. Koller. Ms-asl: A large-scale data set puter Applications, 82(4), 2013. 3
and benchmark for understanding american sign language. [48] L. Pigou, M. Van Herreweghe, and J. Dambre. Gesture and
arXiv preprint arXiv:1812.01053, 2018. 2 sign language recognition with temporal residual networks.
[33] T. Kapuscinski, M. Oszust, M. Wysocki, and D. Warchol. In The IEEE International Conference on Computer Vision
Recognition of hand gestures observed by depth cameras. In- (ICCV) Workshops, Oct 2017. 3
ternational Journal of Advanced Robotic Systems, 12(4):36, [49] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Pose-
2015. 2 let conditioned pictorial structures. In Proceedings of the
[34] D. P. Kingma and J. Ba. Adam: A method for stochastic IEEE Conference on Computer Vision and Pattern Recogni-
optimization. arXiv preprint arXiv:1412.6980, 2014. 3, 7 tion, pages 588–595, 2013. 6
[35] P. Kishore, G. A. Rao, E. K. Kumar, M. T. K. Kumar, and [50] J. Redmon and A. Farhadi. Yolov3: An incremental improve-
D. A. Kumar. Selfie sign language recognition with convo- ment. arXiv preprint arXiv:1804.02767, 2018. 4
lutional neural networks. International Journal of Intelligent [51] F. Ronchetti, F. Quiroga, C. A. Estrebou, L. C. Lanzarini,
Systems and Applications, 10(10):63, 2018. 3 and A. Rosete. Lsa64: an argentinian sign language dataset.
[36] S.-K. Ko, C. J. Kim, H. Jung, and C. Cho. Neural sign lan- In XXII Congreso Argentino de Ciencias de la Computación
guage translation based on human keypoint estimation. Ap- (CACIC 2016)., 2016. 2, 3
plied Sciences, 9(13):2683, 2019. 3 [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[37] S.-K. Ko, J. G. Son, and H. Jung. Sign language recognition
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
with recurrent neural network using human keypoint detec-
Recognition Challenge. International Journal of Computer
tion. In Proceedings of the 2018 Conference on Research
Vision (IJCV), 115(3):211–252, 2015. 5
in Adaptive and Convergent Systems, pages 326–328. ACM,
2018. 3 [53] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
unified embedding for face recognition and clustering. In
[38] V. S. Kulkarni and S. Lokhande. Appearance based recogni-
Proceedings of the IEEE conference on computer vision and
tion of american sign language using gesture segmentation.
pattern recognition, pages 815–823, 2015. 4
International Journal on Computer Science and Engineer-
[54] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de-
ing, 2(03):560–565, 2010. 3
scriptor and its application to action recognition. In Proceed-
[39] I. Laptev. On space-time interest points. International jour-
ings of the 15th ACM international conference on Multime-
nal of computer vision, 64(2-3):107–123, 2005. 5
dia, pages 357–360. ACM, 2007. 5
[40] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. [55] H. Shin, W. J. Kim, and K.-a. Jang. Korean sign language
Learning realistic human actions from movies. 2008. 5 recognition based on image and convolution neural network.
[41] J. F. Lichtenauer, E. A. Hendriks, and M. J. Reinders. Sign In Proceedings of the 2nd International Conference on Im-
language recognition by combining statistical dtw and inde- age and Graphics Processing, pages 52–55. ACM, 2019. 3
pendent classification. IEEE Transactions on Pattern Analy- [56] K. Simonyan and A. Zisserman. Two-stream convolutional
sis and Machine Intelligence, 30(11):2040–2046, 2008. 3 networks for action recognition in videos. In Advances
[42] K. M. Lim, A. W. Tan, and S. C. Tan. Block-based his- in neural information processing systems, pages 568–576,
togram of optical flow for isolated sign language recognition. 2014. 5
Journal of Visual Communication and Image Representation, [57] K. Simonyan and A. Zisserman. Very deep convolutional
40:538–545, 2016. 3 networks for large-scale image recognition. arXiv preprint
[43] S. Liwicki and M. Everingham. Automatic recognition of arXiv:1409.1556, 2014. 2, 5
fingerspelled words in british sign language. In 2009 IEEE [58] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset
computer society conference on computer vision and pattern of 101 human actions classes from videos in the wild. arXiv
recognition workshops, pages 50–57. IEEE, 2009. 3 preprint arXiv:1212.0402, 2012. 1
[44] J. Martinez, M. J. Black, and J. Romero. On human mo- [59] T. Starner, J. Weaver, and A. Pentland. Real-time american
tion prediction using recurrent neural networks. In Proceed- sign language recognition using desk and wearable computer
ings of the IEEE Conference on Computer Vision and Pattern based video. IEEE Transactions on pattern analysis and ma-
Recognition, pages 2891–2900, 2017. 6 chine intelligence, 20(12):1371–1375, 1998. 3
[45] C. McCaskill, C. Lucas, R. Bayley, and J. Hill. The hidden [60] T. E. Starner. Visual recognition of american sign lan-
treasure of Black ASL: Its history and structure. Gallaudet guage using hidden markov models. Technical report, Mas-
University Press Washington, DC, 2011. 1, 4 sachusetts Inst Of Tech Cambridge Dept Of Brain And Cog-
[46] D. Metaxas, M. Dilsizian, and C. Neidle. Scalable asl sign nitive Sciences, 1995. 3
recognition using model-based machine learning and lin- [61] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
guistically annotated corpora. In 8th Workshop on the Rep- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
resentation & Processing of Sign Languages: Involving the Going deeper with convolutions. In Proceedings of the
Language Community, Language Resources and Evaluation IEEE conference on computer vision and pattern recogni-
Conference 2018, 2018. 3 tion, pages 1–9, 2015. 5
[62] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu- [77] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and
tional learning of spatio-temporal features. In European con- P. Presti. American sign language recognition with the
ference on computer vision, pages 140–153. Springer, 2010. kinect. In Proceedings of the 13th international conference
5 on multimodal interfaces, pages 279–286. ACM, 2011. 3
[63] A. Tharwat, T. Gaber, A. E. Hassanien, M. K. Shahin, and [78] M. Zahedi, D. Keysers, T. Deselaers, and H. Ney. Combi-
B. Refaat. Sift-based arabic sign language recognition sys- nation of tangent distance and an image distortion model for
tem. In Afro-european conference for industrial advance- appearance-based sign language recognition. In Joint Pattern
ment, pages 359–370. Springer, 2015. 3 Recognition Symposium, pages 401–408. Springer, 2005. 2,
[64] A. Toshev, C. Szegedy, and G. DeepPose. Human pose es- 3, 4
timation via deep neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), Columbus, OH, USA, pages 24–27, 2014. 6
[65] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional net-
works. In Proceedings of the IEEE international conference
on computer vision, pages 4489–4497, 2015. 5
[66] C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-
based action recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
915–922, 2013. 6
[67] H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin. Action
recognition by dense trajectories. 2011. 5
[68] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.
Evaluation of local spatio-temporal features for action recog-
nition. 2009. 5
[69] R. Wilbur and A. C. Kak. Purdue rvl-slll american sign lan-
guage database. 2006. 2, 4
[70] Q. Xue, X. Li, D. Wang, and W. Zhang. Deep forest-based
monocular visual sign language recognition. Applied Sci-
ences, 9(9):1945, 2019. 3
[71] Q. Yang. Chinese sign language recognition based on video
sequence appearance modeling. In 2010 5th IEEE Con-
ference on Industrial Electronics and Applications, pages
1537–1542. IEEE, 2010. 3
[72] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learn-
ing of deformable mixture of parts and deep convolutional
neural networks for human pose estimation. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3073–3082, 2016. 6
[73] Y. Yang and D. Ramanan. Articulated pose estimation with
flexible mixtures-of-parts. In CVPR 2011, pages 1385–1392.
IEEE, 2011. 6
[74] F. Yasir, P. C. Prasad, A. Alsadoon, and A. Elchouemi. Sift
based approach on bangla sign language recognition. In 2015
IEEE 8th International Workshop on Computational Intelli-
gence and Applications (IWCIA), pages 35–39. IEEE, 2015.
3
[75] Y. Ye, Y. Tian, M. Huenerfauth, and J. Liu. Recogniz-
ing american sign language gestures from within continu-
ous videos. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pages
2064–2073, 2018. 3
[76] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,
O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-
pets: Deep networks for video classification. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 4694–4702, 2015. 5

SIGNLANGUAGE PPT
100% (1)
SIGNLANGUAGE PPT
15 pages
Machine Learning and Deep Learning Approaches For
No ratings yet
Machine Learning and Deep Learning Approaches For
91 pages
Cafe Indiana Cookbook
100% (2)
Cafe Indiana Cookbook
164 pages
Sc203 Group1
No ratings yet
Sc203 Group1
48 pages
Recognition of Gujarati Sign Language Alphabets Using LSTM Deep Learning Approach
No ratings yet
Recognition of Gujarati Sign Language Alphabets Using LSTM Deep Learning Approach
8 pages
Final Year Project
No ratings yet
Final Year Project
19 pages
A Survey On Sign Language Recognition Systems
No ratings yet
A Survey On Sign Language Recognition Systems
27 pages
Plag Free
No ratings yet
Plag Free
28 pages
Sign Language
No ratings yet
Sign Language
22 pages
On Sign Language Detection
No ratings yet
On Sign Language Detection
25 pages
ICSE Class 10 English Language Question Paper Solution 2015
No ratings yet
ICSE Class 10 English Language Question Paper Solution 2015
13 pages
Final Project
No ratings yet
Final Project
24 pages
SLR Paper
No ratings yet
SLR Paper
11 pages
Report 1
No ratings yet
Report 1
30 pages
American - Yolo
No ratings yet
American - Yolo
16 pages
Sign Lang 5
No ratings yet
Sign Lang 5
19 pages
Sign Language Recognition
No ratings yet
Sign Language Recognition
24 pages
Sridhar 2020
No ratings yet
Sridhar 2020
10 pages
Health Economics Notes
100% (1)
Health Economics Notes
4 pages
SEWING Sleeves
100% (2)
SEWING Sleeves
4 pages
American Sign Language Detection Using YOLOv5 and
No ratings yet
American Sign Language Detection Using YOLOv5 and
16 pages
2021a1r002 1
No ratings yet
2021a1r002 1
14 pages
Referencia N°01
No ratings yet
Referencia N°01
20 pages
G7 Synopsis
No ratings yet
G7 Synopsis
14 pages
MS-ASL: A Large-Scale Data Set and Benchmark For Understanding American Sign Language
No ratings yet
MS-ASL: A Large-Scale Data Set and Benchmark For Understanding American Sign Language
16 pages
Mathematics 11 03729
No ratings yet
Mathematics 11 03729
20 pages
Multimodal Deep Learning For Real-Time Gesture Recognition and Cross-Lingual Translation
No ratings yet
Multimodal Deep Learning For Real-Time Gesture Recognition and Cross-Lingual Translation
11 pages
"Asl To Text Conversion": Bachelor of Technology
No ratings yet
"Asl To Text Conversion": Bachelor of Technology
15 pages
Openhands:: Making Sign Language Recognition Accessible With Pose-Based Pretrained Models Across Languages
No ratings yet
Openhands:: Making Sign Language Recognition Accessible With Pose-Based Pretrained Models Across Languages
13 pages
How2Sign: A Large-Scale Multimodal Dataset For Continuous American Sign Language
No ratings yet
How2Sign: A Large-Scale Multimodal Dataset For Continuous American Sign Language
14 pages
SL Contextual Processing
No ratings yet
SL Contextual Processing
14 pages
An Efficient Two-Stream Network For Isolated Sign Language Recognition Using Accumulative Video Motion
No ratings yet
An Efficient Two-Stream Network For Isolated Sign Language Recognition Using Accumulative Video Motion
14 pages
Sign 1
No ratings yet
Sign 1
10 pages
Enhancing Accessibility With Long Short-Term Memory-Based Sign Language Detection Systems
No ratings yet
Enhancing Accessibility With Long Short-Term Memory-Based Sign Language Detection Systems
8 pages
MCA2185 - Research Paper
No ratings yet
MCA2185 - Research Paper
8 pages
A Survey of Machine Learning Techniques For Sign Language Translation Ijariie22722
No ratings yet
A Survey of Machine Learning Techniques For Sign Language Translation Ijariie22722
10 pages
ML Group 9
No ratings yet
ML Group 9
7 pages
Manit Research Paper
No ratings yet
Manit Research Paper
6 pages
Electronics 13 01229 v2
No ratings yet
Electronics 13 01229 v2
13 pages
Video-Based Sign Language Recognition With R21D and LSTM Networks
No ratings yet
Video-Based Sign Language Recognition With R21D and LSTM Networks
6 pages
An Empirical Analysis of CNN For American Sign Language Recognition
No ratings yet
An Empirical Analysis of CNN For American Sign Language Recognition
8 pages
Pose-Guided Sign Language Video GAN With Dynamic Lambda
No ratings yet
Pose-Guided Sign Language Video GAN With Dynamic Lambda
6 pages
Hand Gesture Based Sign Language Recognition Using Deep Learning
No ratings yet
Hand Gesture Based Sign Language Recognition Using Deep Learning
5 pages
Report
No ratings yet
Report
8 pages
Review Paper
No ratings yet
Review Paper
5 pages
Final Conf PPT
No ratings yet
Final Conf PPT
11 pages
Deep Learning Approach For Sign Language Gesture Recognition Using Convolutional Neural Networks
No ratings yet
Deep Learning Approach For Sign Language Gesture Recognition Using Convolutional Neural Networks
6 pages
Dynamic Gesture Recognition For Sign Language Using Long Short Term Memory Networks
No ratings yet
Dynamic Gesture Recognition For Sign Language Using Long Short Term Memory Networks
7 pages
Static Sign Language Recognition Using Deep Learning
No ratings yet
Static Sign Language Recognition Using Deep Learning
9 pages
PFX 48420843
No ratings yet
PFX 48420843
6 pages
Ahmed Jinml
No ratings yet
Ahmed Jinml
10 pages
Real Time Isolated Hand Sign Language Recognition Using Deep Networks and SVD
No ratings yet
Real Time Isolated Hand Sign Language Recognition Using Deep Networks and SVD
21 pages
Sign Language Recognition System - A Survey
No ratings yet
Sign Language Recognition System - A Survey
5 pages
SSICT-2023 Paper 5
No ratings yet
SSICT-2023 Paper 5
4 pages
Sign Language Recognition System Using Convolutional Neural Network and Computer Vision
No ratings yet
Sign Language Recognition System Using Convolutional Neural Network and Computer Vision
6 pages
A Survey of Sign Language Recognition
No ratings yet
A Survey of Sign Language Recognition
6 pages
Bantupalli and Xie
No ratings yet
Bantupalli and Xie
3 pages
IRJAEH0202016 - Real-Time Sign Language Recognition and Translation Using Deep Learning Techniques
No ratings yet
IRJAEH0202016 - Real-Time Sign Language Recognition and Translation Using Deep Learning Techniques
5 pages
Vision-Based American Sign Language Classification Approach Via Deep Learning
No ratings yet
Vision-Based American Sign Language Classification Approach Via Deep Learning
4 pages
Inocando, Et. Al. Memorandum Position Paper
No ratings yet
Inocando, Et. Al. Memorandum Position Paper
10 pages
IJRAR23B3375
No ratings yet
IJRAR23B3375
5 pages
IEEE Conference Template 1
No ratings yet
IEEE Conference Template 1
5 pages
Transfer Learning in Sign Language
No ratings yet
Transfer Learning in Sign Language
8 pages
Tunnel Segment Gasket Design - Solutions and Innovations: Bakhshi, Mehdi and Nasri, Verya
No ratings yet
Tunnel Segment Gasket Design - Solutions and Innovations: Bakhshi, Mehdi and Nasri, Verya
10 pages
MS Firm Profile
No ratings yet
MS Firm Profile
7 pages
Quotation: Eureka Forbes Limited
No ratings yet
Quotation: Eureka Forbes Limited
10 pages
At 2 Manuscript
No ratings yet
At 2 Manuscript
2 pages
Highway Eng BCVE 213 Notes
No ratings yet
Highway Eng BCVE 213 Notes
68 pages
An Overview of Derivative Securities
No ratings yet
An Overview of Derivative Securities
38 pages
Galicia Poland Intro
No ratings yet
Galicia Poland Intro
27 pages
Research Paper On Matter of Conscience
No ratings yet
Research Paper On Matter of Conscience
2 pages
9th Mathametics EM - WWW - Tntextbooks.in-47-82
No ratings yet
9th Mathametics EM - WWW - Tntextbooks.in-47-82
36 pages
Chemical Coordination in Animals
No ratings yet
Chemical Coordination in Animals
13 pages
Nep PSSs PDF
No ratings yet
Nep PSSs PDF
41 pages
The Self in Aristotle
No ratings yet
The Self in Aristotle
17 pages
Can Tech 2006 4 CANopen Safety
No ratings yet
Can Tech 2006 4 CANopen Safety
19 pages
Alpha Advertising: Agency Presentation
No ratings yet
Alpha Advertising: Agency Presentation
20 pages
Radical Optimism
No ratings yet
Radical Optimism
24 pages
Evolution of Hydrothermal System at The Dizon Porphyry Cu-Au Deposit, Zambales, Philippines
No ratings yet
Evolution of Hydrothermal System at The Dizon Porphyry Cu-Au Deposit, Zambales, Philippines
18 pages
2024 HR Compliance Calendar - GoCo
No ratings yet
2024 HR Compliance Calendar - GoCo
17 pages
FSEA ArtCall Submission Guide For Regions
No ratings yet
FSEA ArtCall Submission Guide For Regions
14 pages
LearnEnglish-Listening-A1-A-request-from-your-boss - 1 Activity 3 - Luis Fernandez
No ratings yet
LearnEnglish-Listening-A1-A-request-from-your-boss - 1 Activity 3 - Luis Fernandez
4 pages
Chapter 11 - The Interview - Business English, Part 1
No ratings yet
Chapter 11 - The Interview - Business English, Part 1
9 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
USTFMS ERB F01 Application Form
No ratings yet
USTFMS ERB F01 Application Form
2 pages
Table Report Notes
No ratings yet
Table Report Notes
5 pages
WWF Hamurni Evaluation - Terms of Reference
No ratings yet
WWF Hamurni Evaluation - Terms of Reference
5 pages
Authors Men of A Certain Age (MOCA) 'S New Book, "From Then Until Now: Short Memoirs of Eight African American Savannahians," Explores Coming of Age in The Jim Crow South
No ratings yet
Authors Men of A Certain Age (MOCA) 'S New Book, "From Then Until Now: Short Memoirs of Eight African American Savannahians," Explores Coming of Age in The Jim Crow South
4 pages
Front Office CV Format
No ratings yet
Front Office CV Format
2 pages
Please State Your Name and Other Personal Circumstances
No ratings yet
Please State Your Name and Other Personal Circumstances
3 pages

SLR Paper

Uploaded by

SLR Paper

Uploaded by

Word-level Deep Sign Language Recognition from Video:

A New Large-scale Dataset and Methods Comparison

Dongxu Li , Cristian Rodriguez Opazo, Xin Yu, Hongdong Li

Vision-based sign language recognition aims at helping

Method WLASL100 WLASL300 WLASL1000 WLASL2000

WLASL100 WLASL300 WLASL1000 WLASL2000

You might also like