Perhaps one of the broadest applica-
tions of these systems today is in user
interfaces (such as automated tech-
nical support and the commanding
of software systems, as in phone and
navigation systems in vehicles). These
systems fail often; try to say something
that is not very prototypical or not to
hide your accent if you have one. But
when these systems fail, they send
the user back to a human operator or
force the user to command the soft-
ware through classical means; some
users even adjust their speech to get
the systems to work. Again, while the
performance of these systems has im-
proved, according to the adopted met-
rics, they are today embedded in new
contexts and governed by new modes
of operation that can tolerate lack of
robustness or intelligence. Moreover,
as in text, improving their performance
against current metrics is not neces-
sarily directed toward, nor requires
addressing, the challenge of compre-
hending speech.l
Moving to vision applications, it
has been noted that some object-rec-
ognition systems, based on neural net-
works, surpass human performance in
recognizing certain objects in images.
But reports also indicate how making
simple changes to images may some-
times hinder the ability of neural net-
works to recognize objects correctly.
Some transformations or deformations
to objects in images, which preserve
the human ability to recognize them,
can also hinder the ability of networks
to recognize them. While this does not
measure up to the expectations of early
AI researchers or even contemporary vi-
sion researchers, as far as robustness
and intelligence is concerned, we still
manage to benefit from these technolo-
gies in a number of applications. This
includes recognizing faces during au-
tofocus in smart cameras (people do
not normally deform their faces but if
they do, bad luck, an unfocused image);
looking up images that contain cats in
online search (it is ok if you end up get-
ting a dog instead); and localizing sur-
rounding vehicles in an image taken by
l An anonymous reviewer suggested that tran-
scription is perhaps the main application of
speech systems today, with substantial prog-
ress made toward the preferred metric of
“word error rate.” The same observation ap-
plies to this class of applications.
from 100% compared to humans, and
successful translation was predicated
on the ability to comprehend text. Gov-
ernment intelligence was a main driv-
ing application; a failure to translate
correctly can potentially lead to a politi-
cal crisis. Today, the main application
of machine translation is to webpages
and social-media content, leading to a
new mode of operation and a different
measure of success. In the new context,
there is no explicit need for a transla-
tion system to comprehend text, only
to perform well based on the adopted
metrics. From a consumer’s viewpoint,
success is effectively measured in terms
of how far a system’s accuracy is from
0%. If I am looking at a page written in
French, a language I do not speak, I am
happy with any translation that gives me
a sense of what the page is saying. In fact,
the machine-translation community
rightfully calls this “gist translation.” It
can work impressively well on prototypi-
cal sentences that appear often in the
data (such as in social media) but can
fail badly on novel text (such as poetry).
It is still very valuable yet corresponds to
a task that is significantly different from
what was tackled by early AI researchers.
We did indeed make significant progress
recently with function-based translation,
thanks to deep learning. But this prog-
ress has not been directed toward the
classical challenge of comprehending
text, which aimed to acquire knowledge
from text to enable reasoning about its
content,j instead of just translating it.k
Similar observations can be made
about speech-recognition systems.
j There are other views as to what “
comprehension” might mean, as in, say, what might be
revealed about language from the internal encodings of learned translation functions.
k With regard to the observation that the represent-and-reason approach is considered to have
failed on machine translation, Stuart Russell of
the University of California, Berkeley, pointed
out to me that this is probably a correct description of an incorrect diagnosis, as not enough effort was directed toward pursuing an adequate
represent-and-reason approach, particularly
one that is trainable, since language has too
many quirks to be captured by hand. This observation is part of a broader perspective I subscribe to calling for revisiting represent-and-reason approaches while augmenting them with
advances in machine learning. This task would,
however, require a new generation of researchers well versed in both approaches; see the section in this article on the power of success for
hints as to what might stand in the way of having
this breed of researchers.
Some seemingly
complex abilities
that are typically
associated with
perception or
cognition can
be captured and
reproduced to a
reasonable extent
by simply fitting
functions to data.