Visual Object Detection with
Deformable Part Models
By Pedro felzenszwalb, Ross Girshick, David McAllester, and Deva Ramanan
We describe a state-of-the-art system for finding objects in
cluttered images. Our system is based on deformable models that represent objects using local part templates and geometric constraints on the locations of parts. We reduce object
detection to classification with latent variables. The latent
variables introduce invariances that make it possible to detect
objects with highly variable appearance. We use a generalization of support vector machines to incorporate latent information during training. This has led to a general framework
for discriminative training of classifiers with latent variables.
Discriminative training benefits from large training datasets. In practice we use an iterative algorithm that alternates
between estimating latent values for positive examples and
solving a large convex optimization problem. Practical optimization of this large convex problem can be done using active
set techniques for adaptive subsampling of the training data.
Object recognition is a fundamental challenge in computer
vision. Consider the problem of detecting objects from a
category, such as people or cars, in static images. This is a
difficult problem because objects in each category can vary
greatly in appearance. Variations arise from changes in
illumination, viewpoint, and intra-class variability of shape
and other visual properties among object instances. For
example, people wear different clothes and take a variety of
poses while cars come in various shapes and colors.
Early approaches to object recognition focused on three-dimensional geometric models and invariant features. 22, 24, 25
More contemporary methods tend to be based on appearance-based representations that directly model local image
features. 21, 27 Machine learning techniques have been
very successful in training appearance-based models in
restricted settings such as face detection29, 30 and handwritten digit recognition. 23 Our system uses new machine
learning techniques to train models that combine local
appearance models with geometric constraints.
To apply machine learning techniques to object detection
we can reduce the problem to binary classification. Consider a
classifier that takes as input an image and a position and scale
within the image. The classifier determines whether or not
there is an instance of the target category at the given position
and scale. Detection is performed by evaluating the classifier
at a dense set of positions and scales within an image. This
approach is commonly called “sliding window” detection. Let
x specify an image and a position and scale within the image.
In the case of a linear classifier we threshold a score b · Φ(x),
where b is a vector of model parameters (often seen as a
template) and Φ(x) is a feature vector summarizing the appear-
ance of an image region defined by x. A difficulty with this
approach is that a linear classifier is likely to be insufficient to
model objects that can have significant appearance variation.
One representation, designed to handle greater variabil-
ity in object appearance is that of a pictorial structure, 14, 18
where objects are described by a collection of parts arranged
in a deformable configuration. In a pictorial structure
model each part encodes local appearance properties of an
object, and the deformable configuration is characterized
by spring-like connections between certain pairs of parts.
Deformable models such as pictorial structures can capture
significant variations in appearance but a single deformable
model still cannot represent many interesting object catego-
ries. Consider modeling the appearance of bicycles. Bicycles
come in different types (e.g., mountain bikes, tandems, penny-
farthings with one big wheel and one small wheel) and we can
view them from different directions (e.g., frontal versus side
views). We use mixtures of deformable models to deal with
these more significant variations.
Our classifiers treat mixture component choice and part
locations as latent variables. Let x denote an image and a
position and scale within the image. Our classifiers compute
a score of the form
Here b is a vector of model parameters, z are latent values,
and Φ(x, z) is a feature vector. If the score is above a threshold, our model will produce a detection at x. Associated with
every detection are the inferred latent values, z = argmaxz
b⋅Φ(x, z), which specify a mixture component choice and
the locations of the parts associated with that component.
Figure 1 shows two detections and the inferred part locations in each case. We note that ( 1) can handle very general
forms of latent information. For example, z could specify a
derivation under a rich visual grammar. 15
One challenge in training deformable part models is that
it is often difficult to obtain training data with part annota-
tions. Annotating parts can be time consuming and genu-
inely ambiguous. For example, what are the right parts for
a sofa model? We train our models from weakly labeled
data in the form of bounding boxes around target objects.
Part structure and latent part locations are automatically
This work first appeared in the IEEE CVPR 2008 Conference
and in the IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 32, No. 9, September 2010.