How does the model29 perform real-world recognition tasks? And how
does it compare to state-of-the-art
artificial-intelligence systems? Given
the specific biological constraints the
theory must satisfy (such as using only
biophysically plausible operations,
receptive field sizes, and a range of invariances), it was not clear how well the
model implementation would perform
compared to systems heuristically engineered for these complex tasks.
Several years ago, we were surprised
to find the model capable of recognizing complex images, 27 performing at a
level comparable to some of the best
existing systems on the CalTech- 101
image database of 101 object categories with a recognition rate of about
55% (chance level < 1%); see Serre et
al. 27 and Mutch and Lowe. 19 A related
system with fewer layers, less invariance, and more units had an even better recognition rate on the CalTech
data set. 20
We also developed an automated
system for parsing street-scene images27 based in part on the class of
models described earlier. The system
recognizes seven different object categories—cars, pedestrians, bikes,
skies, roads, buildings, trees—from
natural images of street scenes despite very large variations in shape
(such as trees in summer and winter
and SUVs and compact cars from any
point of view).
Content-based recognition and
search in videos is an emerging application of computer vision, whereby
neuroscience may again suggest an
avenue for approaching the problem.
In 2007, we developed an initial model for recognizing biological motion
and actions from video sequences
based on the organization of the dorsal stream of the visual cortex, 13 which
is critically linked to the processing
of motion information, from V1 and
MT to higher motion-selective areas
MST/FST and STS. The system relies
on computational principles similar
to those in the model of the ventral
stream described earlier but that start
with spatio-temporal filters modeled
after motion-sensitive cells in the primary visual cortex.
We evaluated system performance
for recognizing actions (human and
animal) in real-world video sequenc-
Black corresponds to data used to derive the parameters of the model, red to data
consistent with the model (not used to fit model parameters), and blue to actual
correct predictions by the model. notations: PFc (prefrontal cortex), v1 (visual
area i or primary visual cortex), v4 (visual area iv), and it (inferotemporal cortex).
Data from these areas corresponds to monkey electrophysiology studies. lOc (lateral
Occipital complex) involves fMri with humans. the psychological studies are
psychophysics on human subjects.
Quantitative Data
Compatible with
the Model
area type of data
Psych. rapid animal categorization
Face inversion effect
Loc Face processing (fMrI)
Pfc differential role of IT and PFc in categorization
it Tuning and invariance properties
read out for object category
Average effect in IT
V4 MAX operation
Tuning for two-bar stimuli
Two-spot interaction
Tuning for boundary conformation
Tuning for cartesian and non-cartesian gratings
V1 Simple and complex cells tuning properties
MAX operation in subset of complex cells
Ref. biol. data Ref. model data
( 1) ( 1)
( 2) ( 2)
( 3) ( 3)
( 4) ( 5)
( 6) ( 5)
( 7) ( 8, 9)
( 10) ( 10)
( 11) ( 5)
( 12) ( 8, 9)
( 13) ( 8)
( 14) ( 8, 15)
( 16) ( 8)
( 17–19) ( 8)
( 20) ( 5)
1. serre, t., oliva, a., and Poggio, t. Proc. Natl. Acad. Sci. 104, 6424 (apr. 2007).
2. riesenhuber, M. et al. Proc. Biol. Sci. 271, s448 (2004).
3. Jiang, X. et al. Neuron 50, 159 (2006).
4. Freedman, d.J., riesenhuber, M., Poggio, t., and Miller, e. K. Journ. Neurosci. 23, 5235 (2003).
5. riesenhuber, M. and Poggio, t. Nature Neuroscience 2, 1019 (1999).
6. Logothetis, n.K., Pauls, J., and Poggio, t. Curr. Biol. 5, 552 (May 1995).
7. hung, c. P., Kreiman, g., Poggio, t., and dicarlo, J. J. Science 310, 863 (nov. 2005).
8. serre, t. et al. MI T AI Memo 2005-036 / CBCL Memo 259 (2005).
9. serre, t. et al. Prog. Brain Res. 165, 33 (2007).
10. Zoccolan, d., Kouh, M., Poggio, t., and dicarlo, J. J. Journ. Neurosci. 27, 12292 (2007).
11. gawne, t. J. and Martin, J.M. Journ. Neurophysiol. 88, 1128 (2002).
12. reynolds, J.h., chelazzi, L., and desimone, r. Journ. Neurosci. 19, 1736 (Mar. 1999).
13. taylor, K., Mandon, s., Freiwald, w.a., and Kreiter, a.K. Cereb. Cortex 15, 1424 (2005).
14. Pasupathy, a. and connor, c. Journ. Neurophysiol. 82, 2490 (1999).
15. cadieu, c. et al. Journ. Neurophysiol. 98, 1733 (2007).
16. gallant, J. L. et al. Journ. Neurophysiol. 76, 2718 (1996).
17. schiller, P.h., Finlay, b. L., and Volman, s. F. Journ. Neurophysiol. 39, 1288 (1976).
18. hubel, d.h. and wiesel, t.n. Journ. Physiol. 160, 106 (1962).
19. de Valois, r. L., albrecht, d.g., and thorell, L.g. Vision Res. 22, 545 (1982).
20. Lampl, i., Ferster, d., Poggio, t., and riesenhuber, M. Journ. Neurophysiol. 92, 2704 (2004).
es, 13 finding that the model of the dor-
sal stream competed with a state-of-
the-art action-recognition system (that
outperformed many other systems) on
all three data sets. 13 A direct extension
of this approach led to a computer sys-
tem for the automated monitoring and
analysis of rodent behavior for behav-
ioral phenotyping applications that
perform on par with human manual
scoring. We also found the learning in
this model produced a large dictionary
of optic-flow patterns that seems con-
sistent with the response properties of
cells in the medial temporal (MT) area
in response to both isolated gratings
and plaids, or two gratings superim-
posed on one another.
conclusion
Demonstrating that a model designed
to mimic known anatomy and physiol-