Large-Scale Multiview 3D Hand Pose Dataset
Large-Scale Multiview 3D Hand Pose Dataset
Abstract
Accurate hand pose estimation at joint level has several uses on human-robot
interaction, user interfacing and virtual reality applications. However, it is
a currently unresolved question. The novel deep learning techniques could
make great improvement in this respect but they need an enormous amount
of annotated data. The hand pose datasets released so far are impossible
to use in deep learning methods as they present issues such as the limited
number of samples, high-level abstraction annotations or samples consisting
in depth maps.
In this work, we introduce a multiview hand pose dataset in which we
provide color images of hands and different kind of annotations for each, i.e.
the bounding box and the 2D and 3D location on the joints in the hand.
Furthermore, we introduce a simple yet accurate deep learning architecture
for real-time robust 2D hand pose estimation.
1. Introduction
2
gesture recognition, so they did not provide hand joint level annotations, but
a class for each samples. Then, with the emergence of low cost range sensors,
the datasets began to set aside the color information to provide depth maps
instead. Nonetheless, most deep learning approaches utilize color information
to tackle classification and regression problems.
This paper presents a novel pose dataset. It provides images of hands
from multiple points of view. The ground truth is composed of different
kinds of data. For each image the hand location is provided: its bounding
box, the X and Y image coordinates of each joint position of the hand,
and the X, Y and Z real-world coordinates of the position of each joint of
the hand. The dataset contains images of several individuals at different
instants in time. There are even samples captured under special conditions
which will challenge the generalization capabilities of the algorithms that use
this dataset. Training and testing splits are also provided.
The rest of the paper is organized as follows. Section 2 reviews state-of-
the-art works related to the existent hand pose datasets. Next, in Section
3 the details of the capture device we use for composing the dataset are
given. Then, Section 4 explains relevant details of the dataset itself. Section
5 describes a system for hand pose regression trained on this dataset to be
used by other hand pose estimation based proposals as a baseline. Section 6
presents the main conclusions of the present work and some lines of future
work are suggested.
3
2. Related Works
4
color images are available at 160 × 120 pixels. The hand postures are se-
lected in such a way that the inter class variation in the appearance of the
postures is minimized, which makes the recognition task challenging, but the
background of the images in this dataset is uniform.
Again, this dataset does not provide annotation of the joints in the hand,
but a class for each sample, so it is not intended to be used in continuous
hand pose estimation at joint level. Moreover, they state the background is
uniform which is an undesirable feature for learning algorithms.
The dataset for hand gesture recognition introduced in [6] contains ges-
tures from Polish Sign Language and American Sign Language. In addition,
some special signs were included. The database consists of three series of ges-
tures which include the subsequent data: original RGB images at different
resolutions, but with great image quality; ground truth binary skin presence
masks and hand feature points location. In total, this dataset contains over
1600 samples.
Despite its high resolution and great quality ground truth, this dataset
contains insufficient number of samples. There is a further and bigger draw-
back: the background across the samples is always the same. This might
harm the generalization capabilities of the learning system, as in a real world
scenario the background is constantly changing, and clearly not always grey
with a great deal of contrast between the hand and the background.
Another dataset was released by Molina et. al[7]. This dataset is com-
posed of natural and synthetic depth maps. Each depth map represents a
segmented hand in a certain pose. The annotations include the 2D position
of each joint in the depth map coordinate frame, and the depth value in that
5
point. They included 65 natural samples divided in 6 different dictionaries
split by their occlusion level, pose based, motion based, or compound con-
tent. The synthetic data was generated taking a natural seed and performing
up to 200 random rotations.
This is a high representative hand pose dataset but it has a major draw-
back: it only provides depth maps, which forces whoever wants to use it
to acquire a 3D sensor. The number of samples is also insufficient for deep
learning algorithms and the synthetic samples do not represent the essence
of the real world.
Following the depth trend spurred by the emergence of low cost range
sensors, the NYU Hand Pose dataset[8] was created. This dataset contains
8252 test-set and 72757 training-set frames of captured RGBD data with
ground-truth hand-pose information. For each frame, the RGBD data from
3 Kinects is provided: a frontal view and 2 side views. The training set
contains samples from a single user only, whilst the test set contains samples
from two users. A synthetic re-creation (rendering) of the hand pose is also
provided for each view.
The main drawback of this dataset is that the RGB images provided are
projections of the point clouds obtained by a 3D sensor. This causes the 2D
color images to be rendered in low quality, with a lot of unknown regions
out of range of the 3D sensor. Thus, for using this dataset a 3D sensor is
required as well.
Another depth-based dataset is proposed in [9]. This dataset contains
sequences of 14 hand gestures performed in two ways: using one finger and
the whole hand. Each gesture is performed between 1 and 10 times by 28
6
participants, resulting in 2800 sequences. Sequences are labelled following
their gesture, the number of fingers used, the performer and the trial. Each
frame of sequences contains a depth image, the coordinates of 22 joints both
in the 2D depth image space and in the 3D world space creating a full hand
skeleton. Once again, this dataset does not include the corresponding RGB
images so a 3D sensor is required in order to use this dataset.
In view of this review of the state of the art, we consider there is a
need for a large scale hand pose dataset composed of color images and their
corresponding 3D and 2D annotations.
3. Capture Device
7
Figure 1: We created a custom structure in order to generate the dataset. This device
holds the color cameras and the Leap Motion Controller, and allows us to quickly generate
large amounts of ground truth data.
The Leap Motion Controller is a device that accurately captures the pose
of a person’s hands. It provides sub-millimeter hand tracking at 200Hz. This
device supplies, among other data, the 3D position of the joints of a hand.
This information is obtained in order to automatically annotate the position
of the joints in the color images. The device is attached at the bottom of the
structure, facing upwards. This configuration creates an overlapping space
that is captured by both the cameras and the Leap Motion Controller.
This device can capture an enormous amount of accurate ground truth
with no effort.
8
4. Dataset Description
9
We provide a 3D point for each knuckle in a finger, plus the fingertips.
Additionally, we also provide the palm position and the normal of the palm.
Table 1: Joints of the hand present in the dataset. The Joint Number columns reference
the numerical label shown in Figure 3.
10
tortion). To deal with this problem, and to provide the best ground truth
possible, an undistortioning process was carried out for each image as de-
scribed in [10]. No further preprocessing was applied to the images.
As stated before, the dataset provides four color images for each frame.
These four images picture a hand in the same time instant but from different
perspectives.
11
given a set of n 3D points in the Leap motion reference frame and their cor-
responding 2D image projections, as well as the intrinsic camera parameters,
K, we determine the six degree-of-freedom (6DoF) pose of the camera in
the form of its rotation R and translation T with respect to the leap motion
coordinate frame. We use the Levenberg-Marquardt optimization algorithm
to minimize the final reprojection error. Figure 3 illustrates the result of the
reprojected 3D points after the calibration procedure.
Figure 2: Calibration setup used for extrinsic parameter estimation by solving the
Perspective-n-Point problem.
12
Figure 3: Once the devices are calibrated and the transformation matrix computed, new
examples can be annotated in an automatic fashion in order to generate reliable ground
truth.
Finally, we also provide the exact location of the hand in the scene for each
sample through its bounding box. The bounding box is computed by taking
the projection of the 3D points to the camera coordinate frame and then
extracting the maximum and minimum values for the X and Y coordinates.
A fixed offset of 30 pixels is added to each direction to include the whole hand.
Since there are 4 color images per frame, the dataset provides 4 different
bounding boxes, one per color image.
13
contrast to gesture-based dataset in which some fixed poses are considered.
In addition, some sequences were carried out under special conditions: there
are two sequences in which the subject wore different gloves, and a sequence
in which the subject wore a mask. Finally, it is worth to remark that all the
subjects were right-handed.
In total, we provide over 20, 500 different frames distributed in 21 se-
quences. For each frame, 4 color images and 9 different annotations were
provided, so over 80, 000 color images and over 184, 500 annotations are pro-
vided.
In order to generate the training and test splits, we shuffled all samples
and took the 20% of them for the test split and the remaining 80% for the
training split.
Figure 4: Some images extracted from the proposed dataset. Notice the high variability
of the samples.
14
5. Baseline - 2D Joints Estimation
Figure 5: Overall pipeline of the proposed method. First, an image is captured using
a regular web camera. This image is forwarded to a hand detector in order to obtain
the bounding box of the hand. Then, the hand is cropped and passed to the hand pose
regressor to finally estimate the position of the hand joints.
Notice that there are two critical stages in this pipeline: the hand detec-
tion (bounding box estimation) and the hand pose regression stage. These
subsystems are detailed in the following subsections.
Finally, it is worth noting that this pipeline provides highly accurate hand
pose estimation while maintaining a reasonable computation cost. These
15
features allow the system to run under real-time constraints. Further results
and experimentation are detailed in Section 5.3.
Figure 6: The Faster R-CNN network is able to accurately estimate the bounding box of
a certain object, in our case, the network was trained for detecting and localizing hands.
16
overfitting to a specific environment that would produce lack of generaliza-
tion.
Figure 6 shows estimated bounding boxes over some test images once
the Faster R-CNN network was trained.
17
neurons from the last fully connected layer was adjusted to fit our problem.
As there are 20 joints to locate composed of X and Y values for each joint,
the number of output neurons was set to 40. We also changed the default
softmax activation function was changed for a linear one.
This architecture was trained from scratch using the training split of the
proposed dataset. First, hands were extracted using the bounding boxes,
and the joint annotations were remapped accordingly and normalized in a
range between 0 and 1. The network was trained on the resultant dataset
for 200 iterations reaching a training loss of 0.0000607571 and a test loss of
0.00031547. The test loss was used as an early stop criteria. The optimizer
of choice was Adam with a learning rate of 0.001. The loss function we used
was mean squared error.
5.3. Experimentation
18
frame. However, with a naive motion estimation technique, bounding boxes
could be tracked along the sequence and the hand position would only be
recomputed when the system detects large displacements, thus reducing the
computation time.
The frameworks of choice were Caffe 0.14-RC3 and Keras 1.2.0 with Ten-
sor Flow 0.12 as the backbone, running on Ubuntu 14.04.02. Both were
compiled using CMake 2.8.7, g++ 4.8.2, CUDA 8.0, and cuDNN v5.0.
As described earlier, there are two trainable subsystem in our pipeline.
Both the hand detector and the hand pose estimator were trained on our
dataset.
In order to validate our baseline approach, the full pipeline was pro-
cessed using the test split, and achieved a validation mean absolute error of
0.06939% over the input hand image (224 × 224), which means an error of
10 pixels.
Figure 7 shows system performance for several hand poses. The last image
shows a failure case due to the incorrect hand detection. As this approach is
a dependent two-stage system, if the first step fails to detect the whole hand,
the second system would try to regress the joints on wrong data, so it is very
likely to fail.
6. Conclusions
19
Figure 7: Our proposed system is able to accurately estimate hand joints position on a
variety of hand poses. Nonetheless, it presents some failure cases as shown in the last
image.
1
http://www.rovit.ua.es/dataset/mhpdataset/
20
Acknowledgements
References
21
[5] P. K. Pisharady, P. Vadakkepat, L. A. Poh, Hand Posture and Face
Recognition Using Fuzzy-Rough Approach, Springer Singapore, Singa-
pore, 2014, pp. 63–80.
22
[11] Z. Zhang, A flexible new technique for camera calibration, IEEE Trans.
Pattern Anal. Mach. Intell. 22 (11) (2000) 1330–1334. doi:10.1109/
34.888718.
[13] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
23