figure 6. Randomized Decision forests. a forest is an ensemble of
trees. each tree consists of split nodes (blue) and leaf nodes (green).
the red arrows indicate the different paths that might be taken by
different trees for a particular input.
(I, x)
(I, x)
tree 1
tree T
. . .
PT(c)
P1(c)
Mean shift is used to find modes in this density efficiently.
All pixels above a learned probability threshold lc are used
as starting points for part c. A final confidence estimate is
given as a sum of the pixel weights reaching each mode. This
proved more reliable than taking the modal density
estimate.
The detected modes lie on the surface of the body. Each
mode is therefore pushed back into the scene by a learned z
offset zc to produce a final joint position proposal. This simple, efficient approach works well in practice. The bandwidths bc, probability threshold lc, and surface-to-interior
z offset zc are optimized per-part on a hold-out validation set
of 5000 images by grid search. (As an indication, this resulted
in mean bandwidth 0.065m, probability threshold 0.14, and
z offset 0.039m).
training. Each tree is trained on a different set of randomly
synthesized images. A random subset of 2000 example pixels from each image is chosen to ensure a roughly even distribution across body parts. Each tree is trained using the
algorithm in Lepetit et al. 11 To keep the training times down,
we employ a distributed implementation. Training three
trees to depth 20 from 1 million images takes about a day on
a 1000 core cluster.
4. exPeRiMeNts
In this section, we describe the experiments performed to
evaluate our method. We show both qualitative and quanti-
tative results on several challenging datasets and compare
with both nearest-neighbor approaches and the state of the
art. 8 We provide further results in the supplementary mate-
rial. Unless otherwise specified, parameters below were set
as 3 trees, 20 deep, 300 k training images per tree, 2000 train-
ing example pixels per image, 2000 candidate features f, and
50 candidate thresholds t per feature.
test data. We use challenging synthetic and real depth
images to evaluate our approach. For our synthetic test set,
we synthesize 5000 depth images, together with the ground
truth body parts labels and joint positions. The original
mocap poses used to generate these images are held out
from the training data. Our real test set consists of 8808
frames of real depth images over 15 different subjects, hand-
labeled with dense body parts and seven upper body joint
positions. We also evaluate on the real depth data from
Ganapathi et al. 8 The results suggest that effects seen on syn-
thetic data are mirrored in the real data, and further that our
synthetic test set is by far the ‘hardest’ due to the extreme
variability in pose and body shape. For most experiments,
we limit the rotation of the user to ± 120° in both training
and synthetic test data, since the user faces the camera (0°)
in our main entertainment scenario, though we also evalu-
ate the full 360° scenario.
error metrics. We quantify both classification and
joint prediction accuracy. For classification, we report the
average per-class accuracy: the average of the diagonal of
the confusion matrix between the ground truth part label
and the most likely inferred part label. This metric
weights each body part equally despite their varying sizes,
though mislabelings on the part boundaries reduce the
absolute numbers.
For joint proposals, we generate recall-precision curves
as a function of confidence threshold. We quantify accuracy
as average precision per joint, or mean average precision
(mAP) over all joints. The first joint proposal within D
meters of the ground truth position is taken as a true positive, while other proposals also within D meters count as
false positives. This penalizes multiple spurious detections
near the correct position, which might slow a downstream
( 7)
3. 4. Joint position proposals
Body part recognition as described above infers per-pixel
information. This information must now be pooled across
pixels to generate reliable proposals for the positions of 3D
skeletal joints. These proposals are the final output of our
algorithm and could be used by a tracking algorithm to self-initialize and recover from failure.
A simple option is to accumulate the global 3D centers of
probability mass for each part, using the known calibrated
depth. However, outlying pixels severely degrade the quality
of such a global estimate. We consider two algorithms: a
fast algorithm based on simple bottom-up clustering and a
more accurate algorithm based on mean shift, which shall
now be described.
We employ a local mode-finding approach based on
mean shift6 with a weighted Gaussian kernel. We define a
density estimator per body part as
where ◯ is a coordinate in 3D space, N is the number of
image pixels, wic is a pixel weighting, ◯i is the reprojection of
image pixel xi into world space given depth dI(xi), and bc is a
learned per-part bandwidth. The pixel weighting wic considers both the inferred body part probability at the pixel and
the world surface area of the pixel:
wic = P(c|I, xi) · dI(xi) 2.
This ensures that density estimates are depth invariant and
give a small but significant improvement in joint prediction
accuracy. Depending on the definition of body parts, the
posterior P(c|I, x) can be pre-accumulated over a small set of
parts. For example, in our experiments the four body parts
covering the head are merged to localize the head joint.