Technology | DOI: 10.1145/2001269.2001276
Tom Geller
seeing is not enough
A new DARPA program is teaching cameras visual
intelligence—how to spot and understand human behavior.
AMAN WAITS by the public square of a war-torn city. A woman walks by and hands him something, then quickly walks away.
He waits there another 15 minutes before someone else gives him another
package and sprints off. The subject
hurriedly puts together the two pieces,
walks to the center of the square, puts
down the new assemblage, and leaves.
What are the pieces? What did they
compose? Why did the man leave after putting them together? The answers to these questions could presage something horrible—like a bomb
explosion—or something ordinary,
like a city worker installing signage. A
human observer would know to follow
up, perhaps by examining the placed
object or detaining the man who put
it there. But for a surveillance camera,
these actions are no more suspicious
than those of an ice cream vendor or of
kids playing soccer. The camera sees,
but it does not understand.
iMage courtesy of darpa
A program from the U.S. governmental agency Defense Advanced
Research Projects Agency (DARPA)
aims to change that with its Mind’s
Eye program, which first sought participants in March 2010, launched
that September, and announced its
15 contractors in January 2011. The
five-year program provides funding
for 12 research teams to develop “
fundamental machine-based visual intelligence” (VI), as well as three implementation teams that will integrate VI
technologies with portable, camera-bearing, combat-ready unmanned
ground vehicles. Funding for the first
year totals about $5 million, or an average of about $333,000 per contractor. It will be followed by $10 million
the next year and $16 million the next,
with further funding to be determined
as the program progresses. To ensure
the final products have military usefulness, DARPA has engaged the Army
the mind’s eye project has the ambitious goal of teaching cameras to recognize and name
the “nouns” and “verbs” of actions, such as one person giving an object to another person.
Research Laboratory as its “customer”
throughout the process.
According to DARPA Program Manager James Donlon, action recognition
research has traditionally focused on
narrowly defined problems, solved
with incrementally higher degrees
of performance. By comparison, the
Mind’s Eye project aims to markedly
advance the field by converting video
streams to simple descriptions of the
actions they depict. “DARPA’s in the
perfect role to identify problems that
are almost ridiculously difficult, compared to the current state of the art,”
Donlon says. “We know that performance will be lower than you’d have on
a set of tightly controlled data. But then
the questions are, What did we learn as
a result? What needs to be developed
next to get better performance?”
seeing things as they are
The Mind’s Eye project requires re-
searchers to attack four tasks: Recogni-
tion of actions in a scene; description of
the actions being performed; gap-filling
to make accurate assumptions of what’s
left out of a scene, including predictions
of what came before and what will fol-
low after it; and anomaly detection to
identify actions that are unusual in the
context of the entire video. It builds on
past achievements in object recogni-
tion—the “nouns” of VI—to establish
methods to recognize the “verbs” of ac-
tion. To focus efforts, DARPA has identi-
fied 48 specific verbs of interest such as
“approach,” “fly,” and “walk.”
Action recognition is a surprisingly
difficult task for a VI system, although
humans do it without thinking. First,
the system must separate active objects
from the background—a task the world
makes difficult with such distractions
as tree branches blowing in the breeze.
Even after a VI system discounts irrel-
evant movement, the scene could con-
tain multiple active objects that require
examination, and the crucial action
could depend on one, some, or all of
them. As Lauren Barghout, founder of
the vision technology firm Eyegorithm,
describes it, “You can refer to something
as a group of objects or nouns—‘two
cupcakes’. If you employ the ‘spotlight
theory of attention,’ you pay attention
to the area they occupy, but you might
find that the center of that area is just
an empty space. Or you could follow the
‘object-based’ theory, in which case you
have to determine whether the ‘object’
is one cupcake or both cupcakes.”
The visual system must also recog-
nize and deconstruct those objects.
Takeo Kanade, a professor at Carnegie