technical Perspective
Seeing the trees, the forest,
and much more
By Pietro Perona
yoUr portabLE phonE can beat you at
chess, but can it recognize a horse?
Bristling with cameras, microphones,
and other sensors, today’s machines
are nevertheless essentially deaf and
blind; they do not have senses to interact with their environment. In the
meantime, vast amounts of valuable
sensory data is captured, transmitted,
and inexpensively stored every day. TV
programs and movies, fMRI scans,
planetary surveys, footage from security cameras, and digital photographs
pile up and lie fallow on hard drives
around the globe. It is all too much
for humans to organize and access
by hand. Someone has appropriately
called this the “data deluge.” Automat-ing the process of analyzing sensory
data and transforming it into actionable information is one of the most
useful and difficult challenges of modern engineering.
How shall we go about building machines that can see, hear, smell, touch?
Sensory tasks come in all shapes and
forms: reading books, recognizing
people, or hitting tennis balls. It is
expeditious to approach each one as
a separate problem. However, one remarkable fact about our own senses is
they adapt easily to new environments
and tasks. Our senses evolved to help
us navigate and forage among trees,
rocks, and grass, as well as enable us
to socialize with people. Despite this
history, we can train ourselves to read
text, to recognize galaxies in telescope
images, and to drive fast-moving vehicles. Discovering general laws and
principles that underlie sensory processing might one day allow us to design and build flexible and adaptable
sensory systems for our machines.
In the following paper, Torralba,
Murphy, and Freeman are concerned
with visual recognition. They explore
one principle that has general validity:
the use of context. The authors propose
an elegant and compelling demonstra-
tion showing that context is crucial for
recognizing an object when the image
has poor resolution and, as a result,
the object’s picture is ambiguous. That
context may be useful in visual recogni-
tion is rather intuitive. However, to de-
sign a machine that makes use of con-
text we must first define what context
is, exactly how should one measure it,
and how these measurements may be
used to recognize objects.
Discovering general
laws and principles
that underlie
sensory processing
might one day
allow us to design
and build flexible
and adaptable
sensory systems
for our machines.
most researchers to date have side-stepped this baffling chicken-and-egg
issue.
The authors avoid computing explicit scene semantic information.
They start instead by considering easy-to-compute, image-like quantities
that correlate with context. Inspired
by what we know about the human visual system, they compute statistics of
the output of wavelet-like linear filters
applied to the image. These statistics
capture some aspects of the visual statistics of the scene that, in turn, are
indicative of its overall nature: for example, long and vertical structure in
a forest, sparse horizontal structure
in open grassland. Filter statistics are
thus correlated to scene type. Torralba, Murphy, and Freeman call the ensemble of their measurements “gist,”
a term used in psychology to denote
the overall visual meaning of a scene,
which has been shown to be perceived
quickly by human observers. 1, 2
The authors find that, surprisingly,
their filter-based gist is rather good at
predicting the number of instances of
a given object category that might be
present in the scene, as well as their
likely position along the y-axis. Combining this with information coming
from object detectors operating independently at each location produces
an overall score for the presence of an
object of a given class at location (x; y).
This is more reliable than using the detectors alone. It looks like it is finally
open season on visual context.
References
1. biederman, I. Perceiving real-world scenes. Science
177 (1972), 77-80.
2. Fei-Fei, L., Iyer, A., Koch, C., and Perona, P. What
do we perceive in a glance of a real-world scene?
Journal of Vision 7, 1534-7362 (2007), 1-29.
Pietro Perona is the Allen E. Puckett Professor of
Electrical Engineering at the California Institute of
Technology, Pasadena, where he directs Computation and
neural system—a Ph.D. program centered on the study of
biological brains and intelligent machines.