4 James Steven Supanˇciˇc III et al.
(a) (b)
(c) (d)
Fig. 3 Our new test data challenges methods with clut-
ter (a), object manipulation (b), low-res (c), and various
viewpoints (d). We collected data in diverse environments
(8 offices, 4 homes, 4 public spaces, 2 vehicles, and 2 out-
doors) using time-of-flight (Intel/Creative Gesture Camera)
and structured-light (ASUS Xtion Pro) depth cameras. Ten
(3 female and 7 male) subjects were given prompts to per-
form natural interactions with objects in the environment, as
well as display 24 random and 24 canonical poses.
This is important, because any practical application
must handle diverse subjects, scenes, and clutter.
3 Training Data
Here we discuss various approaches for generating train-
ing data (ref. Table 2). Real annotated training data
has long been the gold standard for supervised learn-
ing. However, the generally accepted wisdom (for hand
pose estimation) is that the space of poses is too large to
manually annotate. This motivates approaches to lever-
age synthetically generated training data, discussed fur-
ther below.
Real data + manual annotation: Arguably, the space
of hand poses exceeds what can be sampled with real
data. Our experiments identify a second problem: per-
haps surprisingly, human annotators often disagree on
pose annotations. For example, in our testset, human
annotators disagree on 20% of pose annotations (con-
sidering a 20mm threshold) as plotted in Fig. 19. These
disagreements arise from limitations in the raw sensor
data, either due to poor resolution or occlusions. We
found that low resolution consistently corresponds to
annotation ambiguities, across test sets. See Sec. 5.2)
for further discussion and examples. These ambigui-
ties are often mitigated by placing the hand close to
Dataset Generation Viewpoint Views Size Subj.
ICL [59] Real + manual annot. 3rd Pers. 1 331,000 10
NYU [63] Real + auto annot. 3rd Pers. 3 72,757 1
HandNet [69] Real + auto annot. 3rd Pers. 1 12,773 10
UCI-EGO [44] Synthetic Egocentric 1 10,000 1
libhand [67] Synthetic Generic 1 25,000,000 1
Table 2 Training data sets: We broadly categorize train-
ing datasets by the method used to generate the data and
annotations: real data + manual annotations, real data + au-
tomatic annotations, or synthetic data (and automatic anno-
tations). Most existing datasets are viewpoint-specific (tuned
for 3rd-person or egocentric recognition) and limited in size
to tens of thousands of examples. NYU is unique in that it
is a multiview dataset collected with multiple cameras, while
ICL contains shape variation due to multiple (10) subjects.
To explore the effect of training data, we use the public lib-
hand animation package to generate a massive training set of
25 million examples.
the camera (Xu and Cheng 2013, Tang et al. 2014,
Qian et al. 2014, Oberweger et al. 2016). As an illustra-
tive example, we evaluate the ICL training set (Tang
et al. 2014).
Real data + automatic annotation: Data gloves directly
obtain automatic pose annotations for real data (Xu
and Cheng 2013). However, they require painstaking
per-user calibration. Magnetic markers can partially
alleviate calibration difficulties (Wetzler et al. 2015)
but still distort the hand shape that is observed in
the depth map. When evaluating depth-only systems,
colored markers can provide ground-truth through the
RGB channel (Sharp et al. 2015). Alternatively, one
could use a “passive” motion capture system. We eval-
uate the larger NYU training set (Tompson et al. 2014)
that annotates real data by fitting (offline) a skinned 3D
hand model to high-quality 3D measurements. Finally,
integrating model fitting with tracking lets one leverage
a small set of annotated reference frames to annotate
an entire video (Oberweger et al. 2016).
Quasi-synthetic data: Augmenting real data with geo-
metric computer graphics models provides an attractive
solution. For example, one can apply geometric trans-
formations (e.g., rotations) to both real data and its
annotations (Tang et al. 2014). If multiple depth cam-
eras are used to collect real data (that is then registered
to a model), one can synthesize a larger set of varied
viewpoints (Sridhar et al. 2015, Tompson et al. 2014).
Finally, mimicking the noise and artifacts of real data
is often important when using synthetic data. Domain
transfer methods (D. Tang and Kim 2013) learn the re-
lationships between a small real dataset and large syn-
thetic one.
Synthetic data: Another hope is to use data rendered
by a computer graphics system. Graphical synthesis