technical Perspective
finding People
in Depth
By James M. Rehg
Doi: 10.1145/2398356.2398380
When the MiCRosoFt Kinect for Xbox
360 was introduced in November
2010, it was an instant success. Via
the Kinect, users can control their
Xbox through natural body gestures
and commands thanks to a depth
camera that enables gesture recognition. In contrast to a conventional
camera, which measures the color at
each pixel location, a depth camera
returns the distance to that point in
the scene. Depth cameras make it
easy to separate the Xbox user from
the background of the room, and reduce the complexities caused by color
variation, for example, in clothing.
While the role of the depth camera
in the success of the Kinect is well-known, what is less well-known is the
innovative computer vision technology that underlies the Kinect’s gesture
recognition capabilities. The following article by Shotton et al. describes
a landmark computer vision system
that takes a single depth image containing a person and automatically
estimates the pose of the person’s
body in 3D. This novel method for
pose estimation is the key to the Kinect’s success.
Three important ideas define the
Kinect architecture: tracking by detection, data-driven learning, and
discriminative part models. These
ideas have their origin in object recognition and tracking research from
the computer vision community over
the past 10 years. Their development
in the Kinect has led to some exciting
and innovative work on feature representations and training methods.
The resulting system is a dramatic
improvement over the previous state
of the art.
In order to recognize a user’s ges-
ture, the Kinect must track the user’s
motion in a sequence of depth imag-
es. An important aspect of the Kinect
architecture is that body poses are de-
tected independently in each frame,
without incorporating information
from previous frames. This tracking
by detection approach has the poten-
tial for greater robustness because
errors made over time are less likely
to accumulate. It is enabled by an ex-
tremely efficient and reliable solution
to the pose estimation problem.
The final idea is the use of
discriminative part models to represent the
body pose. Parts are crucial. They decompose the problem of predicting
the pose into a series of independent
subproblems: given an input depth
image, each pixel is labeled with its
corresponding part, and the parts
are grouped into hypotheses about
joint locations. Each pixel can be
processed independently in this approach, making it possible to leverage
the Xbox GPU and obtain real-time
performance. This efficiency is enhanced by a clever feature design.
The Kinect’s impact has extended
well beyond the gaming market. It has
become a popular sensor in the robotics community, where its low cost and
ability to support human-robot interaction are hugely appealing. A survey
of the two main robotics conferences
in 2012 (IROS and ICRA) reveals that
among the more than 1,600 papers,
9% mentioned the Kinect. At Georgia
Tech, we are using the Kinect to measure children’s behavior, in order to
support the research and treatment of
autism, and other developmental and
behavioral disorders.
In summary, the Kinect is a potent
combination of innovative hardware
and software design, informed by decades of computer vision research.
The proliferation of depth camera
technology in the coming years will
enable new advances in vision-based
sensing and support an increasingly
diverse set of applications.
James M. Rehg ( rehg@.gatech.edu) is a professor in the
school of Interactive Computing at the Georgia Institute of
technology, atlanta, where he directs the Center for behavior
Imaging and co-directs the Computational Perception lab.
© 2013 aCM 0001-0782/13/01