vertical axis, mirroring left–right, scene position, body
shape and size, or camera pose, all of which can be added in
post-hoc.
Since the classifier uses no temporal information, we are
interested only in static poses and not motion. Often,
changes in pose from one mocap frame to the next are so
small as to be insignificant. We thus discard many similar,
redundant poses from the initial mocap data using ‘
furthest neighbor’ clustering10 where the distance between
poses q1 and q2 is defined as maxj , the maximum
Euclidean distance over body joints j. We use a subset of
100,000 poses such that no two poses are closer than 5cm.
We have found it necessary to iterate the process of
motion capture, sampling from our model, training the
classifier, and testing joint prediction accuracy in order to
refine the mocap database with regions of pose space that
had been previously missed out. Our early experiments
employed the CMU mocap database, 5 which gave acceptable results though covered far less of pose space.
2. 2. Generating synthetic data
We have built a randomized rendering pipeline from which
we can sample fully labeled training images. Our goals in
building this pipeline were twofold: realism and variety.
For the learned model to work well, the samples must
closely resemble real camera images and contain good coverage of the appearance variations we hope to recognize at
test time. While depth/scale and translation variations are
handled explicitly in our features (see below), other invariances cannot be encoded efficiently. Instead, we learn
invariances—to camera pose, body pose, and body size and
shape—from the data.
The synthesis pipeline first randomly samples a pose
from the mocap database, and then uses standard computer graphics techniques to render depth and (see below)
body parts images from texture-mapped 3D meshes. The
pose is retargeted to each of 15 base meshes (see Figure 3)
spanning the range of body shapes and sizes. Further, slight
random variation in height and weight gives extra coverage
of body shapes. Other randomized parameters include
camera pose, camera noise, clothing, and hairstyle. Figure
4 compares the varied output of the pipeline to hand-labeled real camera images.
In detail, the variations are as follows:
Base character. We use 3D models of 15 varied base characters, both male and female, from child to adult, short to
tall, and thin to fat. Some examples are shown in Figure 3
(top row). A given render will pick uniformly at random from
the characters.
Pose. Having discarded redundant poses from the mocap
data, we retarget the remaining poses to each base character
and choose uniformly at random. The pose is also mirrored
left–right with probability to prevent a left or right bias.
Rotation and translation. The character is rotated about
the vertical axis and translated in the scene, uniformly at
random.
hair and clothing. We add mesh models of several hairstyles and items of clothing chosen at random; some examples are shown in Figure 3 (bottom row).
figure 3. Renders of several base character models. top row: bare
models. Bottom row: with random addition of hair and clothing.
Weight and height variation. The base characters already
have a wide variety of weights and heights. To add further
variety, we add an extra variation in height (vertical scale
± 10%) and weight (overall scale ± 10%).
Camera position and orientation. The camera height,
pitch, and roll are chosen uniformly at random within a
range believed to be representative of an entertainment scenario in a home living room.
Camera noise. Real depth cameras exhibit noise. We
distort the clean computer graphics renders with dropped
out pixels, depth shadows, spot noise, and disparity quantization to match the camera output as closely as possible. In
practice however, we found that this noise addition had little effect on accuracy, perhaps due to the quality of the cameras or the more important appearance variations due to
other factors such as pose.
We use a standard graphics rendering pipeline to generate
the scene, consisting of a depth image paired with its body
parts label image. Examples are given in Figure 4.
3. BoDy PaRt iNfeReNCe aND JoiNt PRoPosaLs
In this section, we describe our intermediate body parts
representation, detail the discriminative depth image features, review decision forests and their application to body
part recognition, and finally discuss how a mode finding
algorithm is used to generate joint position proposals.
3. 1. Body part labeling
A key innovation of this work is the form of our intermediate body parts representation. We define several localized
body part labels that densely cover the body, as color-coded in Figure 4. The parts are defined by assigning a
label to each triangle of the mesh used for rendering of
the synthetic data. Because each model is in vertex-to-ver-tex correspondence, each triangle is associated with the
same part of the body in each rendered image. The precise definitions of the body parts are somewhat arbitrary:
the number of parts was chosen at 31 after some initial
experimentation with smaller numbers, and it is