the problem in (A). Learning to catego-
rize the data-points in (B) requires far
fewer training examples than in (A) and
may be done with as few as two exam-
ples. The key problem in vision is thus
what can be learned effectively with
only a small number of examples.c
More recent work in computer vi-
sion suggests a hierarchical architec-
ture may provide a better solution to
the problem; see also Bengio and Le
Cun1 for a related argument. For in-
stance, Heisele et al. 10 designed a hi-
erarchical system for the detection
and recognition of faces, an approach
based on a hierarchy of “component
experts” performing a local search for
one facial component (such as an eye
or a nose) over a range of positions and
scales. Experimental evidence from
Heisele et al. 10 suggests such hierarchi-
cal systems based exclusively on linear
(SVM) classifiers significantly outper-
form a shallow architecture that tries
to classify a face as a whole, albeit by
relying on more complex kernels.
The visual system may be using a
similar strategy to recognize objects,
with the goal of reducing the sample
complexity of the classification problem. In this view, the visual cortex
transforms the raw image into a posi-tion- and scale-tolerant representation through a hierarchy of processing
stages, whereby each layer gradually
increases the tolerance to position
and scale of the image representation.
After several layers of such processing
stages, the resulting image representation can be used much more efficiently
for task-dependent learning and classi-
c The idea of sample complexity is related to
the point made by DiCarlo and Cox4 about the
main goal of processing information from the
retina to higher visual areas to be “untangling
object representations,” so a simple linear
classifier can discriminate between any two
classes of objects.
the role of
the anatomical
back-projections
present
(in abundance)
among almost all
areas in
visual cortex is
a matter of debate.
fication by higher brain areas.
These stages can be learned during
development from temporal streams
of natural images by exploiting the statistics of natural environments in two
ways: correlations over images that
provide information-rich features at
various levels of complexity and sizes;
and correlations over time used to
learn equivalence classes of these features under transformations (such as
shifts in position and changes in scale).
The combination of these two learning
processes allows efficient sharing of visual features between object categories
and makes learning new objects and
categories easier, since they inherit the
invariance properties of the representation learned from previous experience in the form of basic features common to other objects. In the following
sections, we review evidence for this
hierarchical architecture and the two
correlation mechanisms described
earlier.
hierarchical architecture
and invariant Recognition
Several lines of evidence (from both
human psychophysics and monkey
electrophysiology studies) suggest the
primate visual system exhibits at least
some invariance to position and scale.
While the precise amount of invariance is still under debate, there is general agreement as to the fact that there
is at least some generalization to position and scale.
The neural mechanisms underlying
such invariant visual recognition have
been the subject of much computational and experimental work since the
early 1990s. One general class of computational models postulates that the
hierarchical organization of the visual
cortex is key to this process; see also
Hegdé and Felleman9 for an alternative view. The processing of shape information in the visual cortex follows a
series of stages, starting with the retina
and proceeding through the lateral geniculate nucleus (LGN) of the thalamus to primary visual cortex (V1) and
extrastriate visual areas, V2, V4, and
the inferotemporal (IT) cortex. In turn,
IT provides a major source of input to
the prefrontal cortex (PFC) involved in
linking perception to memory and action; see Serre et al. 29 for references.