are d-dimensional feature vectors computed on a dense grid
of image locations (e.g., every 8 × 8 pixels). Each feature vector
describes a small image patch while introducing some invariants. The framework described here is independent of the specific choice of features. In practice we use a low-dimensional
variation of the histogram of oriented gradient (HOG) features
from Dalal and Triggs. 7 HOG features introduce invariances to
photometric transformations and small image deformations.
A linear filter is defined by a w × h array of d-dimensional
weight vector. Intuitively, a filter is a template that is tuned
to respond to an iconic arrangement of image features.
Filters are typically much smaller than feature maps and can
be applied at different locations within a feature map. The
score, or response, of a filter F at a particular feature map
location is obtained by taking the dot product of F’s array of
weight vectors, concatenated into a single long vector, with
the concatenation of the feature vectors extracted from a
w × h window of the feature map. Because objects appear at
a wide range of scales, we apply the same filter to multiple
feature maps, each computed from a rescaled version of
the original image. Figure 2 shows some examples of filters,
feature maps, and filter responses. To fix notation, let I be
an image and p = (x, y, s) specify a position and scale in the
image. We write F ⋅ f (I, p) for the score obtained by applying
filter F at the position and scale specified by p.
2. 2. Deformable part models
To combine a set of filters into a deformable model we
define spring-like connections between some pairs of filters.
Thinking of filters as vertices and their pairwise connections
as edges, a model is defined by a graph. Here we consider
models represented by star graphs, where one filter acts as
the hub, or root, to which all other filters are connected.
In our star models, a low resolution root filter, that
approximately covers an entire object, serves as the star’s
hub. Higher resolution part filters, that cover smaller regions
of the object, are connected to the root. Figure 1 illustrates
a star model for detecting pedestrians and its two highest
scoring detections in a test image.
We have found that using higher resolution features for
defining part filters is essential for obtaining high recognition performance. With this approach the part filters capture finer resolution features that are localized to greater
accuracy when compared to the features captured by the
root filter. Consider building a model for a face. The root
filter might capture a coarse appearance model for the face
as a whole while the part filters might capture the detailed
appearance of face parts such as eyes, nose, and mouth.
The model for an object with n parts is defined by a set of
parameters (F0, (F1, d1), …, (Fn, dn), b) where F0 is a root filter, Fi is
a part filter, di is a vector of deformation parameters, and b is
a scalar bias term. The vector di specifies the coefficients of a
quadratic function that scores a position for filter i relative
to the root filter’s position. We use a quadratic deformation
model because it is relatively flexible while still amenable to
efficient computations. A quadratic score over relative positions can be thought of as a spring that connects a part filter
to the root filter. The rest position and rigidity of the spring
are determined by di.
inferred during learning. To achieve this, we developed a
general framework for discriminative training of latent-
variable classifiers of the form in ( 1). This leads to a formal-
ism that we call latent support vector machine (LSVM).
Sliding window detection leads to imbalanced classifica-
tion problems. There are vastly more negative examples than
positive ones. To obtain high performance using discrimina-
tive training it is often important to make exhaustive use of
large training sets. This motivates a data subsampling pro-
cess that searches through all of the negative instances to find
the hard negative examples and then trains a model relative
to those instances. A heuristic methodology of data mining
for hard negatives was adopted by Dalal and Triggs7 and goes
back at least to the training methods used by Schneiderman
and Kanade28 and Viola and Jones. 30 We developed simple
data mining algorithms for subsampling the training data
for SVMs and LSVMs that are guaranteed to converge to the
optimal model defined in terms of the entire training set.
We formally define our models in Section 2. We describe
a general framework for learning classifiers with latent
variables in Section 3. Section 4 describes how we use this
framework to train object detection models. We present
experimental results in Section 5 and conclude by discussing
related work in Section 6.
A core component of our models is templates, or filters, that
capture the appearance of object parts based on local image
features. Filters define scores for placing parts at different
image positions and scales. These scores are combined
using a deformation model that scores an arrangements of
parts based on geometric relationships. Detection involves
searching over arrangements of parts using efficient
algorithms. This is done separately for each component in a
mixture of deformable part models.
2. 1. filters
Our models are built from linear filters that are applied to
dense feature maps. A feature map is an array whose entries
(a) (b) (c)
figure 1. Detections obtained with a single component person
model. the model is defined by a coarse root filter (a), several higher
resolution part filters (b), and a spatial model for the location of each
part relative to the root (c). the filters specify weights for histogram
of oriented gradients features. their visualization shows the positive
weights at different orientations. the visualization of the spatial
models reflects the “cost” of placing the center of a part at different
locations relative to the root.