Real-Time Human Pose
Recognition in Parts
from Single Depth Images
By Jamie Shotton, Toby Sharp, Alex Kipman, Andrew fitzgibbon,
Mark finocchio, Andrew Blake, Mat Cook, and Richard Moore
We propose a new method to quickly and accurately predict
human pose—the 3D positions of body joints—from a single
depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object
recognition strategies. By designing an intermediate
representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques
exist. By using computer graphics to synthesize a very large
dataset of training image pairs, one can train a classifier that
estimates body part labels from test images invariant to pose,
body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by
reprojecting the classification result and finding local modes.
The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test
sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison
with related work and demonstrate improved generalization
over exact whole-skeleton nearest neighbor matching.
Robust interactive human body tracking has applications
including gaming, human–computer interaction, security,
telepresence, and health care. Human pose estimation from
video has generated a vast literature (surveyed in Moeslund
et al. 12 and Poppe17). Early work used standard video cameras,
but the task has recently been greatly simplified by the introduction of real-time depth cameras.
Depth imaging technology has advanced dramatically
over the last few years, finally reaching a consumer price
point with the launch of Kinect for Xbox 360. Pixels in a
depth image record depth in the scene, rather than a measure of intensity or color. The Kinect camera gives a
640 × 480 image at 30 frames per second with depth resolution of a few centimeters. Depth cameras offer several
advantages over traditional intensity sensors, which are
working in low light levels, giving a calibrated scale estimate, and being color and texture invariant. They also
greatly simplify the task of background subtraction, which
we assume in this work. Importantly for our approach, it is
rather easier to use computer graphics to synthesize realistic depth images of people than to synthesize color
images, and thus to build a large training dataset cheaply.
However, even the best existing depth-based systems for
human pose estimation16, 22 still exhibit limitations. In particular, until the launch of Kinect for Xbox 360, of which the
algorithm described in this paper is a key component, none
ran at interactive rates on consumer hardware while
handling a full range of human body shapes and sizes undergoing general body motions.
1. 1. Problem overview
The overall problem we wish to solve is stated as follows.
The input is a stream of depth images, that is, the image It
at time t comprises a 2D array of N distance measurements from the camera to the scene. Specifically, an
image I encodes a function dI(x) which maps 2D coordinates x to the distance to the first opaque surface along
the pixel’s viewing direction. The output of the system is a
stream of 3D skeletons, each skeleton being a vector of
about 30 numbers representing the body configuration
(e.g., joint angles) of each person in the corresponding
input image. Denoting the output skeleton(s) at frame t
by qt, the goal is to define a function F such that qt = F (It,
It− 1, …). This is a standard formulation, in which the output at time t may depend on information from earlier
images as well as from the current image. Our solution,
illustrated in Figure 1, pipelines the function F into two
figure 1. system overview. from a single input depth image, a
per-pixel body part distribution is inferred. (Colors indicate the most
likely part labels at each pixel and correspond in the joint proposals.)
Local modes of this signal are estimated to give high-quality
proposals for the 3D locations of body joints, even for multiple
users. finally, the joint proposals are input to skeleton fitting, which
outputs the 3D skeleton for each user.
The original version of this paper appeared in the
Proceedings of the 2011 Conference on Computer Vision and
Pattern Recognition, 129–1304.