ing visual recognition, then explore a
computational neuroscience model
(representative of a class of older models) that implements these principles,
including some of the evidence in its
favor. When tested with natural images, the model performs robust object
recognition on par with computer-vision systems and human performance
for a specific class of quick visual-rec-ognition tasks. The initial success of
this research represents a case in point
for arguing that over the next decade
progress in computer vision and artificial intelligence promises to benefit
directly from progress in neuroscience.
Goal of the Visual System
A key computational issue in object
recognitiona is the specificity-invari-ance trade-off: Recognition must be
able to finely discriminate between different objects or object classes (such as
the faces in Figure 1) while being tolerant of object transformations (such
as scaling, translation, illumination,
changes in viewpoint, and clutter),
as well as non-rigid transformations
(such as variations in shape within a
class), as in the change of facial expression in recognizing faces.
A key challenge posed by the visual
cortex is how well it deals with the pov-
erty-of-stimulus problem, or simple
lack of visual information. Primates
are able to learn to recognize an object
in quite different images from far few-
er labeled examples than are predicted
by our present learning theory and
algorithms. For instance, discrimina-
tive algorithms (such as support vector
machines, or SVMs) can learn a com-
plex object-recognition task from a few
hundred labeled images. This number
is small compared to the apparent di-
mensionality of the problem (millions
of pixels), but a child, even a monkey, is
apparently able to learn the same task
from a handful of examples. As an ex-
ample of the prototypical problem in
visual recognition, imagine a (naïve)
machine is shown an image of a given
person and an image of another per-
son. The system’s task is to discrimi-
nate future images of these two people
without seeing other images of them,
though it has seen many images of oth-
er people and objects and their trans-
formations and may have learned from
them in an unsupervised way. Can the
system learn to perform the classifica-
tion task correctly with just two (or a
few) labeled examples?
b The receptive field of a neuron is the part
of the visual field that (properly stimulated)
could elicit a response from the neuron.
can be found between the two classes.
It has been shown that certain learning
algorithms (such as SVMs with Gaussian kernels) can solve any discrimination task with arbitrary difficulty (in the
limit of an infinite number of training
examples). That is, with certain classes
of learning algorithms we are guaranteed to be able to find a separation for
the problem at hand irrespective of the
difficulty of the recognition task. However, learning to solve the problem may
require a prohibitively large number of
training examples.
In separating two classes, the two
representations in panels (A) and (B)
are not equal; the one in (B) is far superior to the one in (A). With no prior
assumption on the class of functions
to be learned, the “simplest” classifier that can separate the data in (B) is
much simpler than the “simplest” classifier that separates the data in (A). The
number of wiggles of the separation
line (related to the number of parameters to be learned) gives a hand-wavy estimate of the complexity of a classifier.
The sample complexity of the problem
derived from the invariant representation in (B) is much lower than that of
figure 1. Sample complexity.
+
+
–
+
+
+
–
–
–
+++
+
a Within recognition, one distinguishes between identification and categorization. From
a computational point of view, both involve
classification and represent two points on a
spectrum of generalization levels.
A hypothetical 2d (face) classification problem (red) line. One class is represented with + and
the other with – symbols. Insets are 2d transformations (translation and scales) applied to
examples from the two categories. Panels (A) and (b) are two different representations of the
same set of images. (b), which is tolerant with respect to the exact position and scale of the
object within the image, leads to a simpler decision function (such as a linear classifier) and
requires fewer training examples to achieve similar performance, thus lowering the sample
complexity of the classification problem. In the limit, learning in (b) could be done with only two
training examples (blue).