OUr visUal sysTeM helps us carry out
our daily business: walking, driving,
reading, playing sports, or socializing. It is difficult to think of an activity
that does not depend on vision. Our
eyes and brain help us by measuring
shapes, trajectories, and distances in
world around us, and by recognizing
materials, objects, and scenes. How
is this done? Can we reproduce these
abilities in a machine?
The following paper by Felzenszwalb
et al. describes what is currently the best
system for detecting object categories (a
pedestrian, a bottle, a cat) in images.
Like much work in computer vision,
their system is built upon insight coming from a diverse set of areas of science and engineering: biological vision,
geometry, signal processing, machine
learning, and computer algorithms.
Three ingredients make their system successful. First, objects are described as collections of visually distinctive parts (for example, eyes, nose,
and mouth in a face) that appear in a
consistent, although not rigid, mutual
position, or shape. This idea may be
traced back to Fischler and Elschlager, 6
although much work was necessary to
make it work in practice; for example,
making representations invariant to
scale, representing the fact that parts
are sometimes occluded and thus invisible, and giving shape and occlusion
probabilistic interpretation. 2
The second ingredient is represent-
ing parts (eyes, among others) using
patterns of local orientations in the
image. This simple idea makes a big
difference. It turns out that orientation
is less sensitive to changes in lighting
conditions and viewpoint than pixel
values. This observation comes from
studying biological vision systems4
and is the foundation of the most suc-
cessful descriptors for image patches:
shape contexts, SIFT, and HOG. 1, 3, 7 The
authors here add one twist to the idea:
rather than building detectors based
on what the part looks like, it is better
to build detectors as discriminative
classifiers; that is, optimizing their
ability to tell the difference between a
given part (for example, the head of a
pedestrian) and the environment that
typically surrounds it (bookshelves, the
shoulders, and arms of the pedestrian).
The third ingredient is an efficient
search algorithm, originating with
Felzenszwalb’s thesis, 5 which detects
an object in a handful of seconds, focusing computation only on the most
promising areas of the image.
Is detecting visual categories a
solved problem? The reader will be
amused by how poorly our best algorithms work. A quick perusal of Table
1 in Felzenszwalb et al. will reveal
that, on a good day, less than half of
the people are detected in the PASCAL
VOC dataset. Boats and birds are even
more difficult to find. This is precisely
what makes computer vision an exciting field of research today: there is
much progress to be made; we are still
a few big ideas away from the ultimate
design. Twenty years ago we only had
nebulous ideas about how to approach
visual categorization, and 10 years ago
the performance numbers would have
probably been in the few percent.
What is missing? Quite a few
things; I will mention a couple. First
of all, our models are purely phenom-
enological, based on statistics of how
objects look in 2D images. We do not
take into account 3D geometry, nor
the properties and materials of sur-
faces. Second, today’s goal is to rec-
ognize widely different categories:
bottle vs. cat vs. person. There is a
whole world of fine distinctions, for
example, Anopheles vs. Culex mos-
quito, Siamese vs. Burmese cat. We
do not yet know how to handle such
fine-grained classifications. Third,
people can learn to recognize new
categories with just a few training
examples; how many femurs does a
medical student need to see to learn
the category? Our algorithms must
see thousands of training exam-
ples to become halfway decent. The
mother of all challenges is scaling:
there are millions of meaningful vi-
sual categories to recognize ( 105 ver-
tebrate species, 107 insect species,
not to speak of shoes, wristwatches,
and handbags). We need to develop
systems able to train themselves by
using information available on the
Web, and that are able to tap into the
expertise of knowledgeable humans
by asking them intelligent questions.
A growing number of talented researchers are hard at work tackling
these questions. It is an exciting moment for computer vision. Stay tuned.
References
1. belongie, S., malik, J., and Puzicha, J. Shape matching
and object recognition using shape contexts. IEEE
Transactions on Pattern Analysis and Machine
Intelligence 24, 4 (2002), 509–522.
2. burl, m., weber, m., and Perona, P. a probabilistic
approach to object recognition using local photometry
and global geometry. European Conference on
Computer Vision, II (1998), 628–641.
3. Dalal, n. and triggs, b. histograms of oriented
gradients for human detection. In Proceedings of the
IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. Ieee Computer
Society (2005), 886–893.
4. edelman, S., Intrator, n., and Poggio, t. Complex cells
and object recognition. unpublished; http://cogprints.
org/561/2/ 199710003.ps.
5. felzenszwalb, P.f. and huttenlocher, D.P. Pictorial
structures for object recognition. International
Journal of Computer Vision 61, 1 (2005), 55–79.
6. fischler, m. and elschlager, r. the representation and
matching of pictorial structures. IEEE Transactions on
Computers 22 (1973), 67–92.
7. lowe, D.g. Distinctive image features from scale-invariant keypoints. Int. J. Comput Vision 60, 2 (2004),
91–110.
Pietro Perona is the allen e. Puckett Professor of
electrical engineering and Computational and neural
Systems at the California Institute of technology,
Pasadena, Ca, and director of the Ph.D. program in
Computation and neural Systems at Caltech.
Copyright held by owners/author(s)
technical Perspective
Progress in Visual
categorization
By Pietro Perona
research highlights
Doi: 10.1145/2500468.2500480
the following paper
describes what is
currently the best
system for detecting
object categories
(a pedestrian,
a bottle, a cat)
in images.