SLR Paper
SLR Paper
Abstract
Table 1: Overview of word-level datasets in other lan- traction, temporal-dependency modeling and classification.
guages. Previous works first employ different hand-crafted features
Datasets #Gloss #Videos #Signers Type Sign Language to represent static hand poses, such as SIFT-based fea-
LSA64 [51] 64 3,200 10 RGB Argentinian
PSL Kinect 30 [34] 30 300 - RGB, depth Polish
tures [71, 74, 63], HOG-based features [43, 8, 20] and fea-
PSL ToF [34] 84 1,680 - RGB, depth Polish tures in the frequency domain [4, 7]. Hidden Markov Mod-
DEVISIGN [15] 2,000 24,000 8 RGB, depth Chinese
GSL [24] 20 840 6 RGB Greek
els (HMM) [60, 59] are then employed to model the tempo-
DGS Kinect [3] 40 3,000 15 RGB, depth German ral relationships in video sequences. Dynamic Time Warp-
LSE-sign [27] 2,400 2,400 2 RGB Spanish
ing (DTW) [41] is also exploited to handle differences of se-
quence lengths and frame rates. Classification algorithms,
such as Support Vector Machine (SVM) [47], are used to
2,742 words (i.e., glosses) with 9,794 examples (3.6 exam-
label the signs with the corresponding words.
ples per gloss on average). Although the dataset has large
coverage of the vocabulary, more than 2,000 glosses have at
Similar to action recognition, some recent works [55, 35]
most three examples, which is unsuitable to train thousand-
use CNNs to extract the holistic features from image frames
way classifiers. RWTH-BOSTON-50 [78] contains 483
and then use the extracted features for classification. Sev-
samples of 50 different glosses performed by 2 signers.
eral approaches [37, 36] first extract body keypoints and
Moreover, RWTH-BOSTON-104 provides 200 continu-
then concatenate their locations as a feature vector. The
ous sentences signed by 3 signers which in total cover 104
extracted features are then fed into a stacked GRU for rec-
signs/words. RWTH-BOSTON-400, as a sentence-level
ognizing signs. These methods demonstrate the effective-
corpus, consists of 843 sentences including around 400
ness of using human poses in the word-level sign recog-
signs, and those sentences are performed by 5 signers. DE-
nition task. Instead of encoding the spatial and tempo-
VISIGN is a large-scale word-level Chinese Sign Language
ral information separately, recent works also employ 3D
dataset, consists of 2,000 words and 24,000 examples per-
CNNs [28, 75] to capture spatial-temporal features together.
formed by 8 non-native signers in controlled lab environ-
However, these methods are only tested on small-scale
ment. Word-level sign language datasets exist for other re-
datasets. Thus, the generalization ability of those methods
gions, as summarized word-level sign language datasets in
remains unknown. Moreover, due to the lack of a stan-
other languages in Table 1.
dard word-level large-scale sign language dataset, the re-
All the previously mentioned datasets have their own sults of different methods evaluated on different small-scale
properties and provide different attempts to tackle the word- datasets are not comparable and might not reflect the prac-
level sign recognition task. However, they fail to capture the tical usefulness of models.
difficulties of the task due to insufficient amount of instance
and signer. To overcome the above issues in sign recognition,
we propose a large-scale word-level ASL dataset, coined
2.2. Sign Language Recognition Approaches
WLASL database. Since our dataset consists of RGB-only
Existing word-level sign recognition models are mainly videos, the algorithms trained on our dataset can be eas-
trained and evaluated on either private [26, 38, 77, 28, 48] ily applied to real world cases with minimal equipment re-
or small-scale datasets with less than one hundred words [?, quirements. Moreover, we provide a set of baselines using
38, 77, 28, 48, 42, 46, 70]. These sign recognition ap- state-of-the-art methods for sign recognition to facilitate the
proaches mainly consists of three steps: the feature ex- evaluation of future works.
Table 2: Comparisons of our WLASL dataset with existing annotations, is also given in our dataset.
ASL datasets. Column “Mean” indicates the average num- Temporal boundary: A temporal boundary is used to in-
ber of video samples per gloss. dicate the start and end frames of a sign. When the videos
Datasets #Gloss #Videos Mean #Signers Year do not contain repetitions of signs, the boundaries are la-
Purdue RVL-SLLL [69] 39 546 14 14 2006 belled as the first and last frames of the signs. Otherwise, we
RWTH-BOSTON-50 [78] 50 483 9.7 3 2005 manually label the boundaries between the repetitions. For
Boston ASLLVD [6] 2,742 9,794 3.6 6 2008
the videos containing repetitions, we only keep one sample
WLASL100 100 2,038 20.4 97 2019 of the repeated sign to ensure samples in which the same
WLASL300 300 5,117 17.1 109 2019
WLASL1000 1,000 13,168 13.2 116 2019 signer performs the same sign will not appear in both train-
WLASL2000 2,000 21,083 10.5 119 2019 ing and testing sets. Thus, we prevent learned models from
overfiting to the testing set.
3. Our Proposed WLASL Dataset Body Bounding-box: In order to reduce side-effects
caused by backgrounds and let models focus on the signers,
In this section, we introduce our proposed Word-Level we use YOLOv3 [50] as a person detection tool to identify
American Sign Language dataset (WLASL). We first ex- body bounding-boxes of signers in videos. Note that, the
plain the data sources and the data collection process. Fol- size of the bounding-box will change as a person signs, we
lowing with the description of our annotation process which use the largest bounding-box size to crop the person from
combines automatic detection procedures with manual an- the video.
notations to ensure the correctness between signs and their
Signer Diversity: A good sign recognition model should
annotations. Finally, we provide statistics of our WLASL.
be robust to inter-signer variations in the input data, e.g.
3.1. Dataset Collection signer appearance and signing paces, in order to general-
ize well to real-world scenarios. For example, as shown
In order to construct a large-scale signer-independent in Fig. 2c, the same sign is performed with slightly differ-
ASL dataset, we resort to two main sources from Internet. ent hand positioning by two signers. From this perspective,
First, there are multiple educational sign language websites, sign datasets should have a diversity of signers. Therefore,
such as ASLU [2] and ASL-LEX [14], and they provide we identify signers in our collected dataset and then provide
lookup function for ASL signs. The mappings between the IDs of the signers as the meta information of the videos.
glosses and signs from those websites are accurate since To this end, we first employ the face detector and the face
those videos have been checked by experts before uploaded. embedding provided by FaceNet [53] to encode faces of the
Another main source is ASL tutorial videos on YouTube. dataset, and then compare the Euclidean distances among
We select videos whose titles clearly describe the gloss of the face embeddings. If the distance between two embed-
the sign. In total, we access 68,129 videos of 20,863 ASL dings is lower than our pre-defined threshold (i.e., 0.9), we
glosses from 20 different websites. In each video, a signer consider those two videos signed by the same person. After
performs only one sign (possibly multiple repetitions) in a automatic labeling, we also manually check the identifica-
nearly-frontal view with different backgrounds. tion results and correct the mislabelled ones.
After collecting all the resources for the dataset, if the Dialect Variation Annotation: Similar to natural lan-
gloss annotations are composed of more than two words guages, ASL signs also have dialect variations [45] and
in English, we will remove those videos to ensure that the those variations may contain different sign primitives, such
dataset contains words only. If the number of the videos for as hand-shapes and motions. To avoid the situation where
one gloss is less than seven, we also remove that gloss to dialect variations only appear in testing dataset, we man-
guarantee that enough samples are split into the training and ually label the variations for each gloss. Our annotators
testing sets. Since most of the websites include daily used receive training in advance to ensure that they understand
words, the small number of video samples for one gloss may the basic knowledge of ASL, in order to distinguish the dif-
imply those words are not frequently used. Therefore, re- ferences from the signers variations and dialect variations.
moving those glosses with few video samples will not affect To speed up the annotation process and control the annota-
the usefulness of our dataset in practice. After this prelimi- tion quality, we design an interface which lets the annota-
nary selection procedure, we have 34,404 video samples of tors only compare signs from two videos displayed simul-
3,126 glosses for further annotations. taneously. Then we count the number of dialects and assign
labels for different dialects automatically. After the dialect
3.2. Annotations
annotation, we also give each video a dialect label. With the
In addition to providing a gloss label for each video, help of the dialect labels, we can guarantee the dialect signs
some meta information, including temporal boundary, body in the testing set have corresponding training samples. We
bounding box, signer annotation and sign dialect/variation also discard the sign variations with less than five examples
since there are not enough samples to be split into train- guidance to extract local deep features around the joint re-
ing, validation and testing sets. Furthermore, we notice that gions.
these variations are usually not commonly used in daily life. Sign language recognition, especially word-level recog-
nition, needs to focus on detailed differences between signs,
3.3. Dataset Arrangement such as the orientation of hands and movement direction of
After obtaining all the annotations for each video, we ob- the arms, while the background context does not provide
tain videos with lengths ranging from 0.36 to 8.12 seconds, any clue for recognition. Motivated by the action recog-
and the average length of all the videos is 2.41 seconds. The nition methods, we employ two image-based baselines to
average intra-class standard deviation of the videos is 0.85 model the temporal and spatial information of videos in dif-
seconds. ferent manners.
We sort the glosses in a descending order in terms of
the sample number of a gloss. To provide better under- 4.1.1 2D Convolution with Recurrent Neural Networks
standing on the difficulties of the word-level sign recogni- 2D Convolutional Neural Networks (CNN) are widely used
tion task and the scalability of sign recognition methods, to extract spatial features of input images while Recurrent
we conduct experiments on the datasets with different vo- Neural Networks (RNN) are employed to capture the long-
cabulary sizes. In particular, we select top-K glosses with term temporal dependencies among inputs. Thus, our first
K = {100, 300, 1000, 2000}, and organize them to four baseline is constructed by a CNN and a RNN to capture
subsets, named WLASL100, WLASL300, WLASL1000 spatio-temporal features from input video frames. In par-
and WLASL2000, respectively. ticular, we use VGG16 [57] pretrained on ImageNet to ex-
In Table 2, we present statistics of the four subsets tract spatial features and then feed the extracted features to
of WLASL. As indicated by Table 2, we acquire 21,083 a stacked GRU [17]. This baseline is referred to as 2D Conv
video samples with a duration of around 14 hours for RNN, and the network architecture is illustrated in Figure 4.
WLASL2000 in total, and each gloss in WLASL2000 has To avoid overfiting the training set, the hidden sizes of
10.5 samples on average, which is almost three times larger GRU for the four subsets are set to 64, 96, 128 and 256 re-
than the existing large-scale dataset Boston ASLLVD. We spectively, and the number of the stacked recurrent layers in
show example frames of our dataset in Fig. 3. GRU is set to 2. In the training phase, we randomly select at
most 50 consecutive frames from each video. Cross-entropy
4. Method Comparison on WLASL losses is imposed on the output at all the time steps as well
Signing, as a part of human actions, shares similarities as the output feature from the average pooling of all the
with human action recognition and pose estimation. In this output features. In testing, we consider all the frames in the
section, we first introduce some relevant works on action video and make predictions based on the average pooling of
recognition and human pose estimation. Inspired by net- all the output features.
work architectures of action recognition, we employ image-
appearance based and pose based baseline models for word- 4.1.2 3D Convolutional Networks
level sign recognition. By doing so, we not only investigate 3D convolutional networks [13, 65, 62, 30] are able to es-
the usability of our collected dataset but also exam the sign tablish not only the holistic representation of each frame
recognition performance of deep models based on different but also the temporal relationship between frames in a hier-
modalities. archical fashion. Carreira et al. [13] inflate 2D filters of the
4.1. Image-appearance based Baselines Inception network [61] trained on ImageNet [52], thus ob-
taining well-initialized 3D filters. The inflated 3D filters are
Early approaches employ handcrafted features to repre- also fine-tuned on the Kinetics dataset [13] to better capture
sent the spatial-temporal information from image frames the spatial-temporal information in a video.
and then ensemble them as a high-dimensional code for In this paper, we employ the network architecture of I3D
classification [40, 68, 54, 39, 21, 65, 67]. [13] as our second image-appearance based baseline, and
Benefiting from the powerful feature extraction ability of the network architecture is illustrated in Figure 4. As men-
deep neural networks, the works [56, 65] exploit deep neu- tioned above, the original I3D network is trained on Ima-
ral networks to generate a holistic representation for each geNet [52] and fine-tuned on Kinetics-400 [13]. In order
input frame and then use the representations for recogni- to model the temporal and spatial information of the sign
tion. To better establish the temporal relationship among language, such as focusing on the hand shapes and orienta-
the extracted visual features, Donahue et al. [22] and Yue et tions as well as arm movements, we need to fine-tune the
al. [76] employ use recurrent neural networks (e.g., LSTM). pre-trained I3D. In this way, the fine-tuned I3D can better
Some works [23, 10] also employ the joint locations as a capture the spatio-temporal information of signs. Since the
a) 2D Conv. RNN b) 3D Conv. c) Pose RNN d) Pose TGCN
Gloss
Gloss Gloss
Pooling Gloss
Pooling Pooling
GRU GRU Pooling
3D Temporal
GRU GRU GraphConv
2D ConvNet 2D ConvNet ConvNet
Keypoints 1 Keypoints K
Images Keypoints
Image 1 Image K 1 to K 1 to K
time
time time
time
Figure 4: Illustrations of our baseline architectures.
class number varies in our WLASL subsets, only the last GRUs, we use the empirically optimized hidden sizes of 64,
classification layer is modified in accordance with the class 64, 128 and 128 for the four subsets respectively. Similar to
number. the training and testing protocols in Section 4.1.1, 50 con-
secutive frames are randomly chosen from the input video.
4.2. Pose-based Baselines Cross-entropy losses is employed for training. In testing, all
Human pose estimation aims at localizing the keypoints the frames in a video are used for classification.
or joints of human bodies from a single image or videos.
Traditional approaches employ the probabilistic graphical 4.2.2 Pose Based Temporal Graph Neural Networks
model [73] or pictorial structures [49] to estimate single-
person poses. Recently, deep learning techniques have We introduce a novel pose-based approach to ISLR using
boosted the performance of pose estimation significantly. Temporal Graph Convolution Networks (TGCN). Consider
There are two mainstream approaches: regressing the key- the input pose sequence X1:N = [x1 , x2 , x3 , ..., xN ] in N
point positions [64, 11], and estimating keypoint heatmaps sequential frames, where xi ∈ RK represents the concate-
followed by a non-maximal suppression technique [9, 19, nated 2D keypoint coordinates in dimension K. We pro-
18, 72]. However, pose estimation only provides the loca- pose a new graph network based architecture that models
tions of the body keypoints, while the spatial dependencies the spatial and temporal dependencies of the pose sequence.
among the estimated keypoints are not explored. Different from existing works on human pose estimation,
Several works [29, 66] exploit human poses to recog- which usually model motions using 2D joint angles, we en-
nize actions. The works [29, 66] represent the locations code temporal motion information as a holistic representa-
of body joints as a feature representation for recognition. tion of the trajectories of body keypoints.
These methods can obtain high recognition accuracy when Motivated by the recent work on human pose forecasting
the oracle annotations of the joint locations are provided. In [16, 16], we view a human body as a fully-connected graph
order to exploit the pose information for SLR, the spatial with K vertices and represent the edges in the graph as a
and temporal relationships among all the keypoints require weighted adjacency matrix A ∈ RK×K . Although a human
further investigation. body is only partially connected, we construct the human
body as fully-connected graph in order to learn the depen-
dencies among joints via a graph network. In a deep graph
4.2.1 Pose based Recurrent Neural Networks
convolutional network, the n-th graph layer is a function Gn
Pose based approaches mainly utilize RNNs [44] to model that takes as input features a matrix Hn ∈ RK×F , where F
the pose sequences for analyzing human motions. Inspired is the feature dimension output by its previous layer. In the
by this idea, our first pose-based baseline employs RNN first layer, the networks takes as input the K × 2N matrix
to model the temporal sequential information of the pose coordinates of body keypoints. Given this formulation and
0
movements, and the representation output by RNN is used a set of trainable weights Wn ∈ RF ×F , a graph convolu-
for the sign recognition. tional layer is expressed as:
In this work, we extract 55 body and hand 2D keypoints Hn+1 = Gn (Hn ) = σ(An Hn Wn ), (1)
from a frame on WLASL using OpenPose [9]. These key-
points include 13 upper-body joints and 21 joints for both where An is a trainable adjacency matrix for n-th layer and
left and right hands as defined in [9]. Then, we concate- σ(·) denotes the activation function tanh(·). A residual
nate all the 2D coordinates of each joint as the input feature graph convolutional block stacks two graph convolutional
and feed it to a stacked GRU of 2 layers. In the design of layers with a residual connection as shown in Fig. 5. Our
4.3.3 Evaluation Metric
Graph Graph Graph We evaluate the models using the mean scores of top-K
Conv Conv Conv classification accuracy with K = {1, 5, 10} over all the
Graph Graph Graph
sign instances. As seen in Figure 2, different meanings have
Conv Conv Conv very similar sign gestures, and those gestures may cause er-
rors in the classification results. However, some of the erro-
neous classification can be rectified by contextual informa-
3x Residual GraphConv Block tion. Therefore, it is more reasonable to use top-K predicted
labels for the word-level sign language recognition.
Figure 5: Residual Graph Convolution Block.
4.4. Discussion
proposed TGCN stacks multiple residual graph convolu-
tional blocks and takes the average pooling result along the 4.4.1 Performance Evaluation of Baseline Networks
temporal dimension as the feature representation of pose Table 3 indicates that the performance of our baseline mod-
trajectories. Then a softmax layer followed by the average els based on poses and image-appearance. The results
pooling layer is employed for classification. demonstrate that our pose-based TGCN further improves
the classification accuracy in comparison to the pose-based
4.3. Training and Testing Protocol sign recognition method Pose-GRU. This indicates that our
proposed pose-TGCN captures both spatial and temporal
4.3.1 Data Pre-processing and Augmentation relationships of the body keypoints since Pose-GRU mainly
explores the temporal dependencies of the keypoints for
We resize the resolution of all original video frames such
classification. On the other hand, our fine-tuned I3D model
that the diagonal size of the person bounding-box is 256
achieves better performance compared to the other image-
pixels. For training VGG-GRU and I3D, we randomly crop
appearance based model VGG-GRU since I3D has larger
a 224 × 224 patch from an input frame and apply a horizon-
network capacity and is pretrained on not only ImageNet
tal flipping with a probability of 0.5. Note that, the same
but also Kinetics.
crop and flipping operations are applied to the entire video
Although I3D is larger than our TGCN, Pose-TGCN
frames instead of in a frame-wise manner. Similar to [12],
can still achieve comparable results with I3D at top-5 and
when training VGG-GRU, Pose-GRU and Pose-TGCN, for
top-10 accuracy on the large-scale subset WLASL2000.
each video consecutive 50 frames are randomly selected and
This demonstrates that our TGCN effectively encodes hu-
the models are asked to predict labels based on only partial
man motion information. Since we use an off-the-shelf
observations of the input video. In doing so, we increase
pose estimator [9], the erroneous estimation of poses may
the discriminativeness of the learned model. For I3D, we
degrade the recognition performance. In contrast, image
follow its original training configuration.
appearance-based baselines are trained in an end-to-end
fashion for sign recognition and thus the errors residing
in spatial features can be reduced during training. There-
4.3.2 Implementation details fore, training pose-based baselines in an end-to-end fashion
could further improve the recognition performance.
The models, i.e., VGG-GRU, Pose-GRU, Pose-TGCN and
I3D are implemented in PyTorch. It is important to no-
4.4.2 Effect of Vocabulary Size
tice that we use the I3D pre-train weights provided by Car-
reira et al. [13].We train all the models with Adam opti- As seen in Table 3, our baseline methods can achieve rel-
mizer [34]. Note that, I3D was trained by stochastic gra- atively high classification accuracy on small-size subsets.
dient descent (SGD) in [12]. However, I3D does not con- i.e., WLASL100 and WLASL300. However, the subset
verge when using SGD to fine-tune it in our experiments. WLASL2000 is very close to the real-world word-level
Thus, Adam is employed to fine-tune I3D. All the models classification scenario due to its large vocabulary. Pose-
are trained with 200 epochs on each subset. We terminate GRU, pose-TGCN and I3D achieve similar performance
the training process when the validation accuracy stops in- on WLASL2000. This implies that the recognition per-
creasing. formance on small vocabulary datasets does not reflect the
We split the samples of a gloss into the training, vali- model performance on large vocabulary datasets, and the
dation and testing sets following a ratio of 4:1:1. We also large-scale sign language recognition is very challenging.
ensure each split has at least one sample per gloss. The split We also evaluate how the class number, i.e., vocabulary
information will be released publicly as part of WLASL. size, impacts on the model performance. There are two
Table 3: Top-1, top-5, top-10 accuracy (%) achieved by each model (by row) on the four WLASL subsets.
Table 4: Top-10 accuracy (%) of I3D (and Pose-TGCN when trained (row) and tested (column) on different WLASL subsets.
factors mainly affecting the performance: (i) deep models specific domian knowledge and makes crowdsourcing in-
themselves favor simple and easy tasks, and thus they per- feasible.
form better on smaller datasets. As indicated in Table 3,
the models trained on smaller vocabulary size sets perform 5. Conclusion
better than larger ones (comparing along columns); (ii) the
dataset itself has ambiguity. Some signs, as shown in Fig- In this paper, we proposed a large-scale Word-Level ASL
ure 2, are hard to recognize by even humans, and thus deep (WLASL) dataset covering a wide range of daily words and
models will be also misled by those classes. As the number evaluated the performance of deep learning based methods
of classes increases, there will be more ambiguous signs. on it. To the best of our knowledge, our dataset is the largest
In order to explain the impacts of the second factor, we public ASL dataset in terms of the vocabulary size and the
dissect the models, i.e., I3D and Pose-TGCN, trained on number of samples for each class. Since understanding
WLASL2000. Here, we test our models on the WLASL100, sign language requires very specific domain knowledge, la-
WLASL300, WLASL1000 and WLASL2000. As seen in belling a large amount of samples per class is unaffordable.
Table 4, when the test class number is smaller, the models After comparisons among deep sign recognition models on
achieve higher accuracy (comparing along rows). The ex- WLASL, we conclude that developing word-level sign lan-
periments imply that as the number of classes decreases, the guage recognition algorithms on such a large-scale dataset
number of ambiguous signs becomes smaller, thus making requires more advanced learning algorithms, such as few-
classification easier. shot learning. In our future work, we also aim at utiliz-
ing word-level annotations to facilitate sentence-level and
story-level machine sign translations.
4.4.3 Effect of Sample Numbers
As the class number in the dataset increases, training a deep Acknowledgement
model requires more samples. However, as illustrated in Ta- This research is supported in part by the Australia Re-
ble 1, although in our dataset each gloss contains more sam- search Council ARC Centre of Excellence for Robotics Vi-
ples than other datasets, the number of training examples sion (CE140100016), ARC-Discovery (DP 190102261) and
per class is still relatively small compared to some large- ARC-LIEF (190100080). The authors gratefully acknowl-
scale generic activity recognition datasets [25]. This brings edge the GPU gift donated by NVIDIA Corporation. We
some difficulties for the network training. Note that, the thank all anonymous reviewers for their constructive com-
average training samples for each gloss in WLASL100 are ments.
twice large as those in WLASL2000. Therefore, models
obtain better classification performance on the glosses with
more samples, as indicated in Table 3 and Table 4. Crowd-
References
sourcing via Amazon Mechanism Tucker (AMT) is a popu- [1] The 20bn-jester dataset-v1. https://20bn.com/
lar way to collect data. However, annotating ASL requires datasets/jester. Accessed: 2019-07-16. 1
[2] Asl university. http://asluniversity.com/. Ac- [17] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
cessed: 2019-07-16. 4 F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
[3] Kinect gesture dataset. https://www.microsoft. representations using rnn encoder-decoder for statistical ma-
com/en-us/download/details.aspx?id= chine translation. arXiv preprint arXiv:1406.1078, 2014. 2,
52283. Accessed: 2019-07-16. 3 5
[4] M. Al-Rousan, K. Assaleh, and A. Talaa. Video-based [18] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature
signer-independent arabic sign language recognition using learning for pose estimation. In Proceedings of the IEEE
hidden markov models. Applied Soft Computing, 9(3):990– Conference on Computer Vision and Pattern Recognition,
999, 2009. 3 pages 4715–4723, 2016. 6
[5] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, [19] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and
C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, X. Wang. Multi-context attention for human pose estima-
M. Mendoza, et al. A low power, fully event-based gesture tion. In Proceedings of the IEEE Conference on Computer
recognition system. In Proceedings of the IEEE Conference Vision and Pattern Recognition, pages 1831–1840, 2017. 6
on Computer Vision and Pattern Recognition, pages 7243–
[20] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden. Sign
7252, 2017. 1
language recognition using sub-units. Journal of Machine
[6] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan,
Learning Research, 13(Jul):2205–2231, 2012. 3
Q. Yuan, and A. Thangali. The american sign language lex-
icon video dataset. In 2008 IEEE Computer Society Con- [21] N. Dalal and B. Triggs. Histograms of oriented gradients for
ference on Computer Vision and Pattern Recognition Work- human detection. 2005. 5
shops, pages 1–8. IEEE, 2008. 2, 4 [22] J. Donahue, L. Anne Hendricks, S. Guadarrama,
[7] P. C. Badhe and V. Kulkarni. Indian sign language transla- M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
tor using gesture recognition algorithm. In 2015 IEEE In- rell. Long-term recurrent convolutional networks for visual
ternational Conference on Computer Graphics, Vision and recognition and description. In Proceedings of the IEEE
Information Security (CGVIS), pages 195–200. IEEE, 2015. conference on computer vision and pattern recognition,
3 pages 2625–2634, 2015. 5
[8] P. Buehler, A. Zisserman, and M. Everingham. Learning sign [23] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent
language by watching tv (using weakly aligned subtitles). pose-attention network for action recognition in videos. In
In 2009 IEEE Conference on Computer Vision and Pattern Proceedings of the IEEE International Conference on Com-
Recognition, pages 2961–2968. IEEE, 2009. 3 puter Vision, pages 3725–3734, 2017. 5
[9] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. [24] E. Efthimiou and S.-E. Fotinea. Gslc: Creation and annota-
OpenPose: realtime multi-person 2D pose estimation using tion of a greek sign language corpus for hci. In HCI, 2007.
Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 3
2018. 6, 7 [25] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C.
[10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime Multi- Niebles. Activitynet: A large-scale video benchmark for hu-
Person 2D Pose Estimation using Part Affinity Fields. In man activity understanding. In Proceedings of the IEEE Con-
CVPR, 2017. 5 ference on Computer Vision and Pattern Recognition, pages
[11] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human 961–970, 2015. 8
pose estimation with iterative error feedback. In Proceed- [26] K. Grobel and M. Assan. Isolated sign language recognition
ings of the IEEE conference on computer vision and pattern using hidden markov models. In 1997 IEEE International
recognition, pages 4733–4742, 2016. 6 Conference on Systems, Man, and Cybernetics. Computa-
[12] J. Carreira and A. Zisserman. Quo vadis, action recognition? tional Cybernetics and Simulation, volume 1, pages 162–
a new model and the kinetics dataset. In proceedings of the 167. IEEE, 1997. 3
IEEE Conference on Computer Vision and Pattern Recogni-
[27] E. Gutierrez-Sigut, B. Costello, C. Baus, and M. Carreiras.
tion, pages 6299–6308, 2017. 1, 2, 7
Lse-sign: A lexical database for spanish sign language. Be-
[13] J. Carreira and A. Zisserman. Quo vadis, action recognition?
havior Research Methods, 48(1):123–137, 2016. 3
a new model and the kinetics dataset. In CVPR, 2017. 5, 7
[14] N. K. Caselli, Z. S. Sehyr, A. M. Cohen-Goldberg, and [28] J. Huang, W. Zhou, H. Li, and W. Li. Sign language recogni-
K. Emmorey. Asl-lex: A lexical database of american sign tion using 3d convolutional neural networks. In 2015 IEEE
language. Behavior research methods, 49(2):784–801, 2017. international conference on multimedia and expo (ICME),
1, 4 pages 1–6. IEEE, 2015. 3
[15] X. Chai, H. Wanga, M. Zhoub, G. Wub, H. Lic, and [29] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black.
X. Chena. Devisign: Dataset and evaluation for 3d sign Towards understanding action recognition. In Proceedings of
language recognition. Technical report, Beijing, Tech. Rep, the IEEE international conference on computer vision, pages
2015. 3 3192–3199, 2013. 6
[16] H.-k. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. [30] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural
Niebles. Action-agnostic human pose forecasting. In 2019 networks for human action recognition. IEEE transactions
IEEE Winter Conference on Applications of Computer Vision on pattern analysis and machine intelligence, 35(1):221–
(WACV), pages 1423–1432. IEEE, 2019. 6 231, 2012. 5
[31] Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, [47] S. Nagarajan and T. Subashini. Static hand gesture recog-
M. Shah, and R. Sukthankar. Thumos challenge: Action nition for sign language alphabets using edge oriented his-
recognition with a large number of classes, 2014. 1 togram and multi class svm. International Journal of Com-
[32] H. R. V. Joze and O. Koller. Ms-asl: A large-scale data set puter Applications, 82(4), 2013. 3
and benchmark for understanding american sign language. [48] L. Pigou, M. Van Herreweghe, and J. Dambre. Gesture and
arXiv preprint arXiv:1812.01053, 2018. 2 sign language recognition with temporal residual networks.
[33] T. Kapuscinski, M. Oszust, M. Wysocki, and D. Warchol. In The IEEE International Conference on Computer Vision
Recognition of hand gestures observed by depth cameras. In- (ICCV) Workshops, Oct 2017. 3
ternational Journal of Advanced Robotic Systems, 12(4):36, [49] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Pose-
2015. 2 let conditioned pictorial structures. In Proceedings of the
[34] D. P. Kingma and J. Ba. Adam: A method for stochastic IEEE Conference on Computer Vision and Pattern Recogni-
optimization. arXiv preprint arXiv:1412.6980, 2014. 3, 7 tion, pages 588–595, 2013. 6
[35] P. Kishore, G. A. Rao, E. K. Kumar, M. T. K. Kumar, and [50] J. Redmon and A. Farhadi. Yolov3: An incremental improve-
D. A. Kumar. Selfie sign language recognition with convo- ment. arXiv preprint arXiv:1804.02767, 2018. 4
lutional neural networks. International Journal of Intelligent [51] F. Ronchetti, F. Quiroga, C. A. Estrebou, L. C. Lanzarini,
Systems and Applications, 10(10):63, 2018. 3 and A. Rosete. Lsa64: an argentinian sign language dataset.
[36] S.-K. Ko, C. J. Kim, H. Jung, and C. Cho. Neural sign lan- In XXII Congreso Argentino de Ciencias de la Computación
guage translation based on human keypoint estimation. Ap- (CACIC 2016)., 2016. 2, 3
plied Sciences, 9(13):2683, 2019. 3 [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[37] S.-K. Ko, J. G. Son, and H. Jung. Sign language recognition
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
with recurrent neural network using human keypoint detec-
Recognition Challenge. International Journal of Computer
tion. In Proceedings of the 2018 Conference on Research
Vision (IJCV), 115(3):211–252, 2015. 5
in Adaptive and Convergent Systems, pages 326–328. ACM,
2018. 3 [53] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
unified embedding for face recognition and clustering. In
[38] V. S. Kulkarni and S. Lokhande. Appearance based recogni-
Proceedings of the IEEE conference on computer vision and
tion of american sign language using gesture segmentation.
pattern recognition, pages 815–823, 2015. 4
International Journal on Computer Science and Engineer-
[54] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de-
ing, 2(03):560–565, 2010. 3
scriptor and its application to action recognition. In Proceed-
[39] I. Laptev. On space-time interest points. International jour-
ings of the 15th ACM international conference on Multime-
nal of computer vision, 64(2-3):107–123, 2005. 5
dia, pages 357–360. ACM, 2007. 5
[40] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. [55] H. Shin, W. J. Kim, and K.-a. Jang. Korean sign language
Learning realistic human actions from movies. 2008. 5 recognition based on image and convolution neural network.
[41] J. F. Lichtenauer, E. A. Hendriks, and M. J. Reinders. Sign In Proceedings of the 2nd International Conference on Im-
language recognition by combining statistical dtw and inde- age and Graphics Processing, pages 52–55. ACM, 2019. 3
pendent classification. IEEE Transactions on Pattern Analy- [56] K. Simonyan and A. Zisserman. Two-stream convolutional
sis and Machine Intelligence, 30(11):2040–2046, 2008. 3 networks for action recognition in videos. In Advances
[42] K. M. Lim, A. W. Tan, and S. C. Tan. Block-based his- in neural information processing systems, pages 568–576,
togram of optical flow for isolated sign language recognition. 2014. 5
Journal of Visual Communication and Image Representation, [57] K. Simonyan and A. Zisserman. Very deep convolutional
40:538–545, 2016. 3 networks for large-scale image recognition. arXiv preprint
[43] S. Liwicki and M. Everingham. Automatic recognition of arXiv:1409.1556, 2014. 2, 5
fingerspelled words in british sign language. In 2009 IEEE [58] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset
computer society conference on computer vision and pattern of 101 human actions classes from videos in the wild. arXiv
recognition workshops, pages 50–57. IEEE, 2009. 3 preprint arXiv:1212.0402, 2012. 1
[44] J. Martinez, M. J. Black, and J. Romero. On human mo- [59] T. Starner, J. Weaver, and A. Pentland. Real-time american
tion prediction using recurrent neural networks. In Proceed- sign language recognition using desk and wearable computer
ings of the IEEE Conference on Computer Vision and Pattern based video. IEEE Transactions on pattern analysis and ma-
Recognition, pages 2891–2900, 2017. 6 chine intelligence, 20(12):1371–1375, 1998. 3
[45] C. McCaskill, C. Lucas, R. Bayley, and J. Hill. The hidden [60] T. E. Starner. Visual recognition of american sign lan-
treasure of Black ASL: Its history and structure. Gallaudet guage using hidden markov models. Technical report, Mas-
University Press Washington, DC, 2011. 1, 4 sachusetts Inst Of Tech Cambridge Dept Of Brain And Cog-
[46] D. Metaxas, M. Dilsizian, and C. Neidle. Scalable asl sign nitive Sciences, 1995. 3
recognition using model-based machine learning and lin- [61] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
guistically annotated corpora. In 8th Workshop on the Rep- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
resentation & Processing of Sign Languages: Involving the Going deeper with convolutions. In Proceedings of the
Language Community, Language Resources and Evaluation IEEE conference on computer vision and pattern recogni-
Conference 2018, 2018. 3 tion, pages 1–9, 2015. 5
[62] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu- [77] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and
tional learning of spatio-temporal features. In European con- P. Presti. American sign language recognition with the
ference on computer vision, pages 140–153. Springer, 2010. kinect. In Proceedings of the 13th international conference
5 on multimodal interfaces, pages 279–286. ACM, 2011. 3
[63] A. Tharwat, T. Gaber, A. E. Hassanien, M. K. Shahin, and [78] M. Zahedi, D. Keysers, T. Deselaers, and H. Ney. Combi-
B. Refaat. Sift-based arabic sign language recognition sys- nation of tangent distance and an image distortion model for
tem. In Afro-european conference for industrial advance- appearance-based sign language recognition. In Joint Pattern
ment, pages 359–370. Springer, 2015. 3 Recognition Symposium, pages 401–408. Springer, 2005. 2,
[64] A. Toshev, C. Szegedy, and G. DeepPose. Human pose es- 3, 4
timation via deep neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), Columbus, OH, USA, pages 24–27, 2014. 6
[65] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional net-
works. In Proceedings of the IEEE international conference
on computer vision, pages 4489–4497, 2015. 5
[66] C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-
based action recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
915–922, 2013. 6
[67] H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin. Action
recognition by dense trajectories. 2011. 5
[68] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.
Evaluation of local spatio-temporal features for action recog-
nition. 2009. 5
[69] R. Wilbur and A. C. Kak. Purdue rvl-slll american sign lan-
guage database. 2006. 2, 4
[70] Q. Xue, X. Li, D. Wang, and W. Zhang. Deep forest-based
monocular visual sign language recognition. Applied Sci-
ences, 9(9):1945, 2019. 3
[71] Q. Yang. Chinese sign language recognition based on video
sequence appearance modeling. In 2010 5th IEEE Con-
ference on Industrial Electronics and Applications, pages
1537–1542. IEEE, 2010. 3
[72] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learn-
ing of deformable mixture of parts and deep convolutional
neural networks for human pose estimation. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3073–3082, 2016. 6
[73] Y. Yang and D. Ramanan. Articulated pose estimation with
flexible mixtures-of-parts. In CVPR 2011, pages 1385–1392.
IEEE, 2011. 6
[74] F. Yasir, P. C. Prasad, A. Alsadoon, and A. Elchouemi. Sift
based approach on bangla sign language recognition. In 2015
IEEE 8th International Workshop on Computational Intelli-
gence and Applications (IWCIA), pages 35–39. IEEE, 2015.
3
[75] Y. Ye, Y. Tian, M. Huenerfauth, and J. Liu. Recogniz-
ing american sign language gestures from within continu-
ous videos. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pages
2064–2073, 2018. 3
[76] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,
O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-
pets: Deep networks for video classification. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 4694–4702, 2015. 5