figure 4. synthetic and real data. Pairs of depth image and ground truth body parts. Note wide variety in pose, shape, clothing, and crop.
synthetic (train and test)
real (test)
convenient to fit the label in 5 bits. The definitions of
some of the parts are in terms of particular skeletal joints
of interest, for example, ‘all triangles intersecting the
sphere of radius 10 cm centered on the left hand.’ Other
parts fill the gaps between these parts. Despite these
apparently arbitrary choices, later attempts to optimize
the parts distribution have not proved significantly better
than the set described in this paper.
For the experiments in this paper, the parts used are
named as follows: lu/ru/lw/rw head, neck, l/r shoulder,
lu/ru/lw/rw arm, l/r elbow, l/r wrist, l/r hand, lu/ru/
lw/rw torso, lu/ru/lw/rw leg, l/r knee, l/r ankle, l/r foot
(left, right, upper, lower). Distinct parts for left and right
allow the classifier to disambiguate the left and right sides
of the body. Even though this distinction may be ambiguous, the probabilistic label we output can usefully use even
ambiguous labels.
Of course, the precise definition of these parts could be
changed to suit a particular application. For example, in an
upper body tracking scenario, all the lower body parts could
be merged. Parts should be sufficiently small to accurately
localize body joints, but not too numerous as to waste capacity of the classifier.
3. 2. Depth image features
We employ simple depth comparison features, inspired by
those in Lepetit et al. 11 At a given pixel with 2D coordinates x,
the features compute
figure 5. Depth image features. the yellow crosses indicate the
pixel x being classified. the red circles indicate the offset pixels
as defined in eq. ( 4). (a) the two example features give a large
depth difference response. (b) the same two features at new image
locations give a much smaller response.
(a)
(b)
f1
f1
f2
f2
Individually, these features provide only a weak signal
about which part of the body the pixel belongs to, but in
combination in a decision forest they are sufficient to accurately disambiguate all trained parts. The design of these
features was strongly motivated by their computational
efficiency: no preprocessing is needed; each feature
need read at most three image pixels and perform at most
five arithmetic operations; and the features can be straightforwardly implemented on the GPU. Given a larger computational budget, one could employ potentially more
powerful features based on, for example, depth integrals
over regions, curvature, or local descriptors, for example,
shape contexts. 3
( 4) 3. 3. Randomized decision forests
Randomized decision trees and forests2, 4 have proven fast
and effective multi-class classifiers for many tasks, and can
be implemented efficiently on the GPU. 20 As illustrated in
Figure 6, a forest is an ensemble of T decision trees, each
consisting of split and leaf nodes. Each split node consists
of a feature f and a threshold t. To classify pixel x in image
I, the current node is set to the root, and then Eq. ( 4) is evalu-
ated. The current node is then updated to the left or right
child according to the comparison ff(I, x) < t, and the process
repeated until a leaf node is reached. At the leaf node reached
in tree t, a learned distribution Pt(c|I, x) over body part labels
c is stored. The distributions are averaged together for all
trees in the forest to give the final classification
where dI(x) is the depth at pixel x in image I, and parame-
ters f = (u, v) describe offsets u and v. The normalization of
the offsets by ensures that the features are depth invari-
ant: at a given point on the body, a fixed world space offset
will result whether the pixel is close or far from the camera.
The features are thus 3D translation invariant (modulo per-
spective effects). If an offset pixel lies on the background or
outside the bounds of the image, the depth probe dI(x′) is
given a large positive constant value.
( 5)