dramatic differences in the contextual appearance (see
Figure 2). For training data, we generate realistic synthetic
depth images of humans of many shapes and sizes in
highly varied poses sampled from a large motion capture
database. The classifier used is a deep randomized decision forest, which is well suited to our multi-class scenario
and admits extremely high-speed implementation. The
primary challenge imposed by this choice is the need for
large amounts of training data, easily obtained given our
use of synthetic imagery. The further challenge of building
a distributed infrastructure for decision tree training was
important to the success of our approach, but is beyond
the scope of this paper.
An optimized implementation of our algorithm runs in
under 5ms per frame on the Xbox 360 GPU, at least one
order of magnitude faster than existing approaches. It
works frame by frame across dramatically differing body
shapes and sizes, and the learned discriminative approach
naturally handles self-occlusions and poses cropped by
the image frame. We evaluate both real and synthetic
depth images, containing challenging poses of a varied set
of subjects. Even without exploiting temporal or kinematic constraints, the 3D joint proposals are both accurate and stable. We investigate the effect of several training
parameters and show how very deep trees can still avoid
overfitting due to the large training set. Further, results on
silhouette images suggest more general applicability of
our approach.
1. 4. Contributions
Our main contribution is to treat pose estimation as object
recognition using a novel intermediate body parts representation designed to spatially localize joints of interest at low
computational cost and high accuracy. Our experiments also
carry several insights: (i) synthetic depth training data is an
excellent proxy for real data; (ii) scaling up the learning problem with varied synthetic data is important for high accuracy;
and (iii) our parts-based approach generalizes better than
even an oracular whole-image nearest neighbor algorithm.
1. 5. sensor characteristics
Before describing our algorithm in detail, we describe the
process by which we generate training data for human pose
estimation. In order to do so, we first describe the characteristics of the depth sensor we employ, as those characteristics
must be replicated in the synthetic data generation.
As described above, the camera produces a 640 × 480
array of depth values, with the following characteristics.
•;Certain materials do not reflect infrared wavelengths of
light effectively, and so ‘drop out’ pixels can be common. This particularly affects hair and shiny surfaces.
•;In bright sunlight, the ambient infrared can swamp the
active signal preventing any depth inference.
•;The depth range is limited by the power of the emitter,
and safety considerations result in a typical operating
range of about 4 m.
• The depth noise level ranges from a few millimeters
close up to a few centimeters for more distant pixels.
figure 2. example renderings focusing on one hand, showing
the range of appearances a single point on the body may exhibit.
•;The sensor operates on the principle of stereo matching
between an emitter and camera, which must be offset
by some baseline. Consequently, an occlusion shadow
appears on one side of objects in the depth camera.
• The occluding contours of objects are not precisely
delineated and can flicker between foreground and
background.
2. tRaiNiNG Data
Pose estimation research has often focussed on techniques to overcome lack of training data, 13 because of two
problems. First, generating realistic intensity images
using computer graphics techniques14, 15, 19 is hampered
by the huge color and texture variability induced by clothing, hair, and skin, often meaning that the data is reduced
to 2D silhouettes. 1 Although depth cameras significantly
reduce this difficulty, considerable variation in body and
clothing shape remains. The second limitation is that synthetic body pose images are of necessity fed by motion-capture (‘mocap’) data, which is expensive and
time-consuming to obtain. Although techniques exist to
simulate human motion (e.g., Sidenbladh et al. 23), they do
not yet produce the range of volitional motions of a
human subject.
2. 1. Motion capture data
The human body is capable of an enormous range of poses,
which are difficult to simulate. Instead, we capture a large
database of motion capture of human actions. Our aim was
to span the wide variety of poses people would make in an
entertainment scenario. The database consists of approximately 500,000 frames in a few hundred sequences including actions such as driving, dancing, kicking, running, and
navigating menus.
We expect our semi-local body part classifier to
generalize somewhat to unseen poses. In particular, we need not
record all possible combinations of the different limbs; in
practice, a wide range of poses prove sufficient. Further, we
need not record mocap with variation in rotation about the