Mellon University’s Robotics Institute,
points out that some objects are easier
to recognize than others. “We can train
for limited sets of objects pretty well,
once we know their expected appearances—the human face, the human
body, cars, things like that. But there
are lots of objects whose appearance is
completely unknown or unspecifiable.
Take ‘package’. People who use that
word only know that it’s three-dimensional, probably box-like, and made of
paper. It tends to be from hand-size to
body-size, because we call it something
else if it’s bigger.” The human face,
on the other hand, is patterned after a
well-defined template and is therefore
easier for VI systems to identify.
from object to actor
Once the VI system recognizes objects
with some accuracy, it must under-
stand how their movements and inter-
actions comprise actions. Bruce Drap-
er, an associate professor of computer
science at Colorado State University,
believes a successful system will need
to learn that without explicit training.
“You cannot go in and predefine every
object, every action, every event,” he
says. “You don’t know where the sys-
tem’s going to be deployed, and the
world changes all the time. We need to
build systems that learn from watch-
ing the video stream without ever be-
ing told what’s in it.” The key, he says,
is for the system to recognize repeated
actions. “If we had only one video of
someone throwing a football, there’d
be no repeated pattern to learn from,”
says Draper. “But with several, it can
learn that pattern as unique.”
After the VI system can recognize ac-
tions, it still has to name them. NASA
Jet Propulsion Laboratory Principal
Investigator Michael Burl believes
that a VI system should do more than
just flash “run” and “approach” on
the screen when those actions occur.
“We want to go beyond text descrip-
tions by extracting and transmitting a
‘script’ that can be used to regenerate
what happened in the video,” he says.
To that end, Burl and co-investigator
Russell Knight are using planning-ex-
ecuting agent (PEA) graphical models
to provide an abstract representation
of various behaviors. “Take ‘throw,’”
says Burl. “It takes two arguments:
The agent doing the throwing, and the
object being thrown. For the action it-
self you’d expect to see transitions be-
tween several states, such as a windup,
forward motion of the arm, then the
concepts of ‘separate’ and ‘fly’ being
applied to the object. PEA models are
hierarchical so complex actions can be
composed from simpler ones. By iden-
tifying the PEA models being used by
the agents, we obtain a compact, gen-
erative script that provides a summary
of the full video.”
Beyond the Battlefield
DARPA’s Donlon says the target verbs
were chosen to be both relevant and
wide-ranging. “Some verbs are what
soldiers would need to know on the
battlefield, but the list has a lot of diversity,” he notes. In fact, all of the listed
verbs have some applicability in non-military situations, raising the question: What will VI technology be like
when it reaches the civilian sphere?
“We’re getting a lot of interest
in non-military applications” says
Draper. “For example, the National
Institutes of Health wants to figure
out what kids are doing on the play-
ground. Not what any individual child
is doing, but which pieces of equip-
ment are encouraging them to be ac-
tive. And there’s an existing surveil-
lance market that detects motion and
determines whether it’s caused by a
person. That’s wonderful to protect a
perimeter, but what if someone grabs
their chest and falls down in the mid-
dle of a public square? That’s an activ-
ity that’s wrong, not just a matter of
‘someone who shouldn’t be there.’ ”
However, the presence of intel-
ligent cameras in the public sector
could have a chilling effect, says Jay
Stanley, senior policy analyst in the
Speech, Privacy and Technology Pro-
gram of the American Civil Liberties
Union. “It’s fine to use this technology
in military applications on overseas
battlefields or in certain law enforce-
ment situations where there’s proba-
ble cause and a warrant,” he says. “And
perhaps it could be used by individu-
als if it becomes integrated into con-
sumer products, the way face recogni-
tion has in a limited way. But there are
two classes of concerns. The first is
that it works really well, and so inten-
sifies existing concerns about surveil-
lance. The second is that it works very
poorly. False alarms can be just as bad
for people as accurate ones, and there
are all kinds of gradations in what a
false positive is. Perhaps the computer
correctly interprets your behavior but
there’s a perfectly innocent explana-
tion. Or perhaps the computer totally
misunderstands your behavior. And
there are a lot of points in between.”
Donlon believes the benefits of
the Mind’s Eye project will outweigh
such risks. “One of the things inspir-
ing about visual intelligence is that, if
we can solve this range of tasks on this
range of verbs—even partially—there
will be a wide range of both military
and commercial applications,” he
says. “We use one potential military
outcome as the ultimate goal, but it’s
just one exemplar, really, of the ben-
efit we’ll get. It just seems intuitively
obvious that there’s a very rich po-
tential commercial market for smart
cameras—for commercial security
applications, or for loss prevention
in retail. For all of them, attending to
alerts and dismissing false alarms is a
lot less eyeball-intensive than staring
at a video feed 24/7.”
Further Reading
Barghout, L.
Empirical data on the configural
architecture of human scene perception,
Journal of Vision 9, 8, August 5, 2009.
Draper, B.
Early results in micro-action detection,
http://www.cs.colostate.edu/~draper/
newsite/ index.php/research/visual-intelligence-through-latent-geometry-and-selective-guidan/early-micro-action-detection/, Colorado State University video,
January 2011.
Lui, Y., Beveridge, J., and Kirby, M.
Action classification on product manifolds,
2010 IEEE Conference on Computer Vision
and Pattern Recognition, San Francisco, CA,
June 13–18, 2010.
O’Hara,S., Lui, Y., and Draper, B.A.
Unsupervised learning of human
expressions, gestures, and actions, 2011
IEEE Conference on Automatic Face and
Gesture Recognition, Santa Barbara, CA,
March 21–25, 2011.
Poppe, R.
A survey on vision-based human action
recognition, Image and Vision Computing
28, 6, June 2010.
Tom Geller is an oberlin, oh-based science, technology,
and business writer.