stream of the visual cortex, neurons
become selective for stimuli that are
increasingly complex—from simple
oriented bars and edges in early visual
area V1 to moderately complex features in intermediate areas (such as a
combination of orientations) to complex objects and faces in higher visual
areas (such as IT). Along with this increase in complexity of the preferred
stimulus, the invariance properties of
neurons seem to also increase. Neurons become more and more tolerant
with respect to the exact position and
scale of the stimulus within their receptive fields. As a result, the receptive
field size of neurons increases from
about one degree or less in V1 to several degrees in IT.
Compelling evidence suggests that
IT, which has been critically linked
with a monkey’s ability to recognize
objects, provides a representation of
the image that facilitates recognition
tolerant of image transformations. For
instance, Logothetis et al. 16 showed
that monkeys can be trained to recognize paperclip-like wireframe objects
at a specific location and scale. After
training, recordings in their IT cortex revealed significant selectivity for
the trained objects. Because monkeys
were unlikely to have been in contact
with the specific paperclip prior to
training, this experiment provides indirect evidence of learning. More important, Logothetis et al. 16 found selective neurons also exhibited a range
of invariance with respect to the exact
position (two to four degrees) and
scale (around two octaves) of the stimulus, which was never presented before testing at these new positions and
scales. In 2005, Hung et al. 12 showed it
was possible to train a (linear) classifier to robustly read out from a population of IT neurons the category information of a briefly flashed stimulus.
Hung et al. also showed the classifier
was able to generalize to a range of
positions and scales (similar to Logothetis et al.’s data) not presented during the training of the classifier. This
generalization suggests the observed
tolerance to 2D transformation is a
property of the population of neurons
learned from visual experience but
available for a novel object without
object-specific learning, depending
on task difficulty.
computational models of
object Recognition in cortex
We developed26, 29 (in close cooperation with experimental labs) an initial
quantitative model of feedforward hierarchical processing in the ventral
stream of the visual cortex (see Figure
2). The resulting model effectively integrates the large body of neuroscience
data (summarized earlier) characterizing the properties of neurons along
the object-recognition processing hierarchy. The model also mimics human
performance in difficult visual-recog-nition tasks28 while performing at least
as well as most current computer-vision systems. 27
Feedforward hierarchical models have a long history, beginning in
the 1970s with Marko and Giebel’s
homogeneous multilayered architecture17 and later Fukushima’s Neocognitron. 6 One of their key computational mechanisms originates from
the pioneering physiological studies and models of Hubel and Wiesel
( http://serre-lab.clps.brown.edu/re-
sources/ACM2010). The basic idea is
to build an increasingly complex and
invariant object representation in a
hierarchy of stages by progressively
integrating, or pooling, convergent
inputs from lower levels. Building on
existing models (see supplementary
notes http://serre-lab.clps.brown.
edu/resources/ACM2010), we have
been developing24, 29 a similar computational theory that attempts to quantitatively account for a host of recent
anatomical and physiological data;
see also Mutch and Lowe19 and Masquelier et al. 18
The feedforward hierarchical model in Figure 2 assumes two classes of
functional units: simple and complex.
Simple act as local template-matching
operators, increasing the complexity of
the image representation by pooling
over local afferent units with selectivity for different image features (such as
edges at different orientations). Complex increase the tolerance of the representation with respect to 2D transformations by pooling over afferent units
with similar selectivity but slightly different positions and scales.
Learning and plasticity. How the
organization of the visual cortex is influenced by development vs. genetics
is a matter of debate. An fMRI study21
showed the patterns of neural activity
elicited by certain ecologically important classes of objects (such as faces
and places in monozygotic twins) are
significantly more similar than in dizygotic twins. These results suggest
that genes may play a significant role
in the way the visual cortex is wired to
process certain object classes. Meanwhile, several electrophysiological
studies have demonstrated learning
and plasticity in the adult monkey;
see, for instance, Li and DiCarlo. 15
Learning is likely to be both faster and
easier to elicit in higher visually responsive areas (such as PFC and IT15)
than in lower areas.
This learning result makes intuitive sense. For the visual system to remain stable, the time scale for learning
should increase ascending the ventral
stream.d In the Figure 2 model, we assumed unsupervised learning from V1
to IT happens during development in
a sequence starting with the lower areas. In reality, learning might continue
throughout adulthood, certainly at the
level of IT and perhaps in intermediate
and lower areas as well.
Unsupervised learning in the ventral
stream of the visual cortex. With the exception of the task-specific units at the
top of the hierarchy (“visual routines”),
learning in the model in Figure 2 is unsupervised, thus closely mimicking a
developmental learning stage.
As emphasized by several authors,
statistical regularities in natural visual
scenes may provide critical cues to the
visual system for learning with very
limited or no supervision. A key goal of
the visual system may be to adapt to the
statistics of its natural environment
through visual experience and perhaps
evolution, too. In the Figure 2 model,
the selectivity of simple and complex
units can be learned from natural video sequences (see supplementary ma-
d In the hierarchical model in Figure 1, learning
proceeds layer by layer, starting at the bottom,
a process similar to recent work by Hinton11
but that is quite different from the original
neural networks that used back-propagation
and simultaneously learned all layers at the
same time. Our implementation includes the
unsupervised learning of features from natural images but assumes the learning of position and scale tolerance, thus hardwired in the
model; see Masquelier et al. 18 for an initial attempt at learning position and scale tolerance
in the model.