b = (F0, …, Fn, d1, …, dn, b). ( 3)
Φ (I, z) = (f (I, p0), … f (I, pn),
–ψ;(dx1, dy1, …, –ψ;(dxn, dyn), 1).
This makes a connection between deformable part models
and linear classifiers. We use this representation for learning the model parameters with the latent SVM framework.
2. 3. Detection
To detect objects in an image we compute an accumulated
score for each root filter location p0 according to the best
possible placement of the parts relative to p0
An object hypothesis is given by a configuration vector
z = (p0, …, pn), where pi = (xi, yi, si) specifies the position and
scale of the i-th filter. The score of a hypothesis is given by
the scores of each filter at their respective locations (the data
term) minus a deformation cost that depends on the rela-
tive position of each part with respect to the root (the spatial
prior), plus the bias,
where ψ;(pi, p0) = (dxi, dyi, dx2i , dy2i ),
with dxi = xi–x0 and dyi = yi–y0.
Each term in the second summation in ( 2) can be interpreted as a spring deformation model that anchors part i to
some ideal location relative to the root.
The score of a hypothesis z can be expressed in terms of a
dot product, b ⋅ Φ(I, z), between a vector of model parameters
b and a feature vector Φ(I, z),
figure 2. Detection at one scale. Responses from the root and part filters are computed on different resolution feature maps. Distance
transforms are used to solve equation ( 7) efficiently for all possible part placements. the transformed responses are combined to yield a
final score for each root location. We show the responses and transformed responses for the “head” and “right shoulder” parts. note how the
“head” filter is more discriminative. the combined scores clearly show two good hypotheses for the object at this scale.
feature map feature map at 2x resolution
detection scores for
each root location
low value high value
n-th part filter
1st part filter
responses of part filters
response of root filter
color encoding of filter