community studying counterfactual
reasoning for ML.
Misuse of language. There are three
common avenues of language misuse
in machine learning: suggestive definitions, overloaded terminology, and
suitcase words.
Suggestive definitions. In the first avenue, a new technical term is coined
that has a suggestive colloquial meaning, thus sneaking in connotations
without the need to argue for them.
This often manifests in anthropomorphic characterizations of tasks (
reading comprehension and music composition) and techniques (curiosity and
fear—I (Zachary) am responsible for
the latter). A number of papers name
components of proposed models in
a manner suggestive of human cognition (for example, thought vectors
and the consciousness prior). Our goal
is not to rid the academic literature
of all such language; when properly
qualified, these connections might
communicate a fruitful source of inspiration. When a suggestive term is
assigned technical meaning, however,
each subsequent paper has no choice
but to confuse its readers, either by
embracing the term or by replacing it.
Describing empirical results with
loose claims of “human-level” performance can also portray a false sense
of current capabilities. Take, for example, the “dermatologist-level classification of skin cancer” reported
in a 2017 paper by Esteva et al.
12 The
comparison with dermatologists concealed the fact that classifiers and dermatologists perform fundamentally
different tasks. Real dermatologists
encounter a wide variety of circumstances and must perform their jobs
despite unpredictable changes. The
machine classifier, however, achieveed low error only on independent,
identically distributed (IID) test data.
In contrast, claims of human-level
performance in work by He et al.
16 are
better qualified to refer to the Ima-
geNet classification task (rather than
object recognition more broadly).
Even in this case, one careful paper
(among many less careful) was insuffi-
cient to put the public discourse back
on track. Popular articles continue to
characterize modern image classifiers
as “surpassing human abilities and ef-
fectively proving that bigger data leads
to better decisions,” as explained by
Dave Gershgorn,
13 despite demon-
strations that these networks rely on
spurious correlations, (for example,
misclassifying “Asians dressed in red”
as ping-pong balls, reported by Stock
and Cisse43).
Deep-learning papers are not the
sole offenders; misuse of language
plagues many subfields of ML. Lipton, Chouldechova, and McAuley23
discuss how the recent literature on
fairness in ML often overloads terminology borrowed from complex legal
doctrine, such as disparate impact, to
name simple equations expressing
particular notions of statistical parity. This has resulted in a literature
where “fairness,” “opportunity,” and
“discrimination” denote simple statistics of predictive models, confusing
researchers who become oblivious to
the difference and policymakers who
become misinformed about the ease
of incorporating ethical desiderata
into ML.
Overloading technical terminology.
A second avenue of language misuse
consists of taking a term that holds
precise technical meaning and using
it in an imprecise or contradictory
way. Consider the case of deconvolu-
tion, which formally describes the
process of reversing a convolution,
but is now used in the deep-learning
literature to refer to transpose convo-
lutions (also called upconvolutions) as
commonly found in auto-encoders
and generative adversarial networks.
This term first took root in deep
learning in a paper that does address
deconvolution but was later overgen-
eralized to refer to any neural archi-
tecture using upconvolutions. Such
overloading of terminology can create
lasting confusion. New ML papers re-
ferring to deconvolution might be in-
voking its original meaning, describ-
ing upconvolution, or attempting to
resolve the confusion, as in a paper by
Hazirbas, Leal-Taixé, and Cremers,
15
which awkwardly refers to “upconvo-
lution (deconvolution).”
As another example, generative
models are traditionally models of
either the input distribution p(x) or
the joint distribution p(x,y). In con-
trast, discriminative models address
the conditional distribution p(yx) of
the label given the inputs. In recent
works, however, generative model
imprecisely refers to any model that
produces realistic-looking structured
data. On the surface, this may seem
consistent with the p(x) definition, but
it obscures several shortcomings—for
example, the inability of GANs (gen-
erative adversarial networks) or VAEs
(variational autoencoders) to perform
conditional inference (for example,
sampling from p(x2x1) where x1 and
x2 are two distinct input features).
Bending the term further, some dis-
criminative models are now referred
to as generative models on account of
producing structured outputs, a mis-
take that I (Lipton), too, have made.
Seeking to resolve the confusion and
provide historical context, Mohamed
and Lakshminarayanan30 distinguish
between prescribed and implicit gen-
erative models.
Revisiting batch normalization,
Ioffe and Szegedy18 described covari-
ate shift as a change in the distribution
of model inputs. In fact, covariate shift
refers to a specific type of shift, where
although the input distribution p(x)
might change, the labeling function
p(yx) does not. Moreover, as a result
of the influence of Ioffe and Szegedy,
Google Scholar lists batch normaliza-
tion as the first reference on searches
for “covariate shift.”
Among the consequences of misus-
ing language is the possibility (as with
generative models) of concealing lack
of progress by redefining an unsolved
task to refer to something easier. This
often combines with suggestive defi-
nitions via anthropomorphic naming.
Language understanding and reading
comprehension, once grand challenges
of AI, now refer to making accurate
predictions on specific datasets.
Suitcase words. Finally, ML papers tend to overuse suitcase words.
Coined by Marvin Minsky in the 2007
book The Emotion Machine,
29 suitcase
words pack together a variety of meanings. Minsky described mental processes such as consciousness, thinking, attention, emotion, and feeling
that may not share “a single cause or
origin.” Many terms in ML fall into
this category. For example, I (Lipton)
noted in a 2016 paper that
interpretability holds no universally agreed-upon meaning and often references
disjoint methods and desiderata.
22 As