figure 10. Comparisons. (a) Comparison with nearest neighbor matching. (b) Comparison with Ganapathi et al. 8 even without the kinematic
and temporal constraints exploited by Ganapathi et al., 8 our algorithm is able to more accurately localize body joints.
Mean average precision
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
30
Ground truth skeleton NN
Chamfer NN
Our algorithm
(a)
300 3000 30000
Number of training images (log scale)
0.5
0.6
0.7
0.8
0.9
1.0
Our result (per frame) Ganapathi et al. (tracking)
Head
Neck
ot
A
300000
(b)
5. DisCussioN
We have seen how accurate proposals for the 3D locations of
body joints can be estimated in super real-time from single
depth images. We introduced body part recognition as an
intermediate representation for human pose estimation.
Use of a highly varied synthetic training set allowed us to
train very deep decision forests using simple depth-invariant
features without overfitting, learning invariance to both
pose and shape. Detecting modes in a density function gives
the final set of confidence-weighted 3D joint proposals. Our
results show high correlation between real and synthetic
data, and between the intermediate classification and the
final joint proposal accuracy. We have highlighted the
importance of breaking the whole skeleton into parts, and
show state-of-the-art accuracy on a competitive test set.
As future work, we plan further study of the variability in
the source mocap data, the properties of the generative
model underlying the synthesis pipeline, and the particular
part definitions. Whether a similarly efficient approach can
directly regress joint positions is also an open question.
Perhaps a global estimate of latent variables such as coarse
person orientation could be used to condition the body part
inference and remove ambiguities in local pose estimates.
acknowledgments
We thank the many skilled engineers in Xbox, particularly
Robert Craig, Matt Bronder, Craig Peeper, Momin
Al-Ghosien, and Ryan Geiss, who built the Kinect tracking
system on top of this research. We also thank John Winn,
Duncan Robertson, Antonio Criminisi, Shahram Izadi, Ollie
Williams, and Mihai Budiu for help and valuable discussions, and Varun Ganapathi and Christian Plagemann for
providing their test data.
capture using a single time-of-flight
camera. In Proceedings of CVPR
(2010).
9. Gavrila, d. Pedestrian detection from
a moving vehicle. In Proceedings of
ECCV (june 2000).
10. Gonzalez, t. Clustering to minimize
the maximum intercluster distance.
Theor. Comp. Sci. 38 (1985).
11. lepetit, V., lagger, P., Fua, P.
Randomized trees for real-time
keypoint recognition. In Proceedings
of CVPR (2005).
12. Moeslund, t., Hilton, a., Krüger, V. a
survey of advances in vision-based
human motion capture and analysis.
CVIU 104( 2–3) (2006), 90–126.
13. navaratnam, R., Fitzgibbon, a. W.,
Cipolla, R. the joint manifold model
for semi-supervised multi-valued
regression. In Proceedings of ICCV
(2007).
14. ning, H., Xu, W., Gong, y., Huang, t.s.
discriminative learning of visual
words for 3d human pose
estimation. In Proceedings of CVPR
(2008).
15. Okada, R., soatto, s. Relevant
feature selection for human pose
estimation and localization in
cluttered images. In Proceedings of
ECCV (2008).
16. Plagemann, C., Ganapathi, V., Koller,
d., thrun, s. Real-time identification
and localization of body parts from
depth images. In Proceedings of
ICRA (2010).
17. Poppe, R. Vision-based human motion
analysis: an overview. CVIU 108( 1–2)
(2007), 4–18.
18. Ramanan, d., Forsyth, d. Finding and
tracking people from the bottom up.
In Proceedings of CVPR (2003).
19. shakhnarovich, G., Viola, P., darrell, t.
Fast pose estimation with parameter
sensitive hashing. In Proceedings of
ICCV (2003).
20. sharp, t. Implementing decision
trees and forests on a GPu. In
Proceedings of ECCV (2008).
21. shotton, j., Winn, j., Rother, C.,
Criminisi, a. TextonBoost: joint
appearance, shape and context
modeling for multi-class object
recognition and segmentation. In
Proceedings of ECCV (2006).
22. siddiqui, M., Medioni, G. Human pose
estimation from a single view point,
real-time range sensor. In IEEE
International Workshop on Computer
Vision for Computer Games (2010).
23. sidenbladh, H., black, M., sigal, l.
Implicit probabilistic models of
human motion for synthesis and
tracking. In Proceedings of ECCV
(2002).
24. sigal, l., bhatia, s., Roth, s., black, M.,
Isard, M. tracking loose-limbed
people. In Proceedings of CVPR
(2004).
25. urtasun, R., darrell, t. local
probabilistic regression for
activity-independent human pose
inference. In Proceedings of CVPR
(2008).
26. Wang, R., Popović, j. Real-time
hand-tracking with a color glove. In
Proceedings of ACM SIGGRAPH
(2009).
27. Winn, j., shotton, j. the layout
consistent random field for
recognizing and segmenting partially
occluded objects. In Proceedings of
CVPR (2006).
28. Zhu, y., Fujimura, K. Constrained
optimization for human pose
estimation from depth sequences.
In Proceedings of ACCV (2007).
Jamie Shotton, Toby Sharp, Andrew
Fitzgibbon, Andrew Blake, and
Mat Cook ({jamiesho, tsharp, awf, ablake,
and a-macook}@ microsoft.com), Microsoft
Research, Cambridge, uK.
Alex Kipman and Mark Finocchio
({akipman and markfi}@ microsoft.com),
Xbox Incubation.
Richard Moore st-Ericsson.
References
1. agarwal, a., triggs, b. 3d human pose
from silhouettes by relevance vector
regression. In Proceedings of CVPR
(2004).
2. amit, y., Geman, d. shape quantization
and recognition with randomized
trees. Neural Computation, 9, 7
(1997), 1545–1588.
3. belongie, s., Malik, j., Puzicha, j.
shape matching and object
recognition using shape contexts.
IEEE Trans. PAMI 24, 4 (2002),
509–522.
4. breiman, l. Random forests. Mach.
Learn. 45, 1 (2001), 5–32.
5. CMu Mocap database. http://mocap.
cs.cmu.edu.
6. Comaniciu, d., Meer, P. Mean shift: a
robust approach toward feature
space analysis. IEEE Trans. PAMI 24,
5 (2002).
7. Fergus, R., Perona, P., Zisserman, a.
Object class recognition by
unsupervised scale-invariant learning.
In Proceedings of CVPR (2003).
8. Ganapathi, V., Plagemann, C., Koller,
d., thrun, s. Real time motion
© 2013 aCM 0001-0782/13/01