tracking algorithm. Any joint proposals outside D meters
also count as false positives. Note that all proposals (not
just the most confident) are counted in this metric. Joints
invisible in the image are not penalized as false negatives.
Although final applications may well require these joints, it
is assumed that their prediction is more the task of the
sequential tracker Eq. ( 3). We set D = 0.1m below, approximately the accuracy of the hand-labeled real test data
ground truth. The strong correlation of classification and
joint prediction accuracy (the blue curves in Figures 8(a)
and 10(a) ) suggests that the trends observed below for one
also apply for the other.
4. 1. Qualitative results
Figure 7 shows example inferences of our algorithm. Note
high accuracy of both classification and joint prediction
across large variations in body and camera pose, depth in
scene, cropping, and body size and shape (e.g., small
child versus heavy adult). The bottom row shows some
failure modes of the body part classification. The first
example shows a failure to distinguish subtle changes in
the depth image such as the crossed arms. Often (as with
the second and third failure examples), the most likely
body part is incorrect, but there is still sufficient correct
probability mass in distribution P(c|I, x) that an accurate
proposal can be generated. The fourth example shows a
failure to generalize well to an unseen pose, but the confi-
dence gates bad proposals, maintaining high precision at
the expense of recall.
4. 2. Classification accuracy
We investigate the effect of several training parameters on
classification accuracy. The trends are highly correlated
between the synthetic and real test sets, and the real test set
appears consistently ‘easier’ than the synthetic test set,
probably due to the less varied poses present.
number of training images. In Figure 8(a), we show how
test accuracy increases approximately logarithmically with
the number of randomly generated training images, though
starts to tail off around 100,000 images. As shown below,
this saturation is likely due to the limited model capacity of
a 3 tree, 20 deep decision forest.
silhouette images. We also show in Figure 8(a) the quality
of our approach on synthetic silhouette images, where the features in Eq. ( 4) are either given scale (as the mean depth) or not
(a fixed constant depth). For the corresponding joint prediction using a 2D metric with a 10 pixel true positive threshold,
figure 7. example inferences. synthetic (top row), real (middle), and failure modes (bottom). Left column: ground truth for a neutral pose as
a reference. in each example, we see the depth image, the inferred most likely body part labels, and the joint proposals shown as front, right,
and top views (overlaid on a depth point cloud). only the most confident proposal for each joint above a fixed, shared threshold is shown.
figure 8. training parameters versus classification accuracy. (a) Number of training images. (b) Depth of trees. (c) Maximum probe offset.
Average per-class accuracy (%)
60
60
Average per-class accuracy (%)
50
55
50
40
45
30
40
Synthetic test set
20
Real test set
Silhouette (scale)
Silhouette (no scale)
35
10
30
10 1000 100000
8
(a) (b)
Number of training images (log scale)
60
Real test set
Average per-class accuracy (%)
55
50
45
40
35
900k training images
15k training images
30
12 16 20
5
Depth of trees
Synthetic test set
60
Average per-class accuracy (%)
55
50
30 35 40
45
0 100 200 300
Maximum probe offset (pixel meters)
Real test data
Synthetic test data
10 15 20
Depth of trees
900k training images
15k training images
(c)