in a submitted paper.
Misaligned incentives. Reviewers are
not alone in providing poor incentives
for authors. As ML research garners
increased media attention and ML
startups become commonplace, to
some degree incentives are provided
by the press (“What will they write
about?”) and by investors (“What will
they invest in?”). The media provides
incentives for some of these trends.
Anthropomorphic descriptions
of ML algorithms provide fodder for
popular coverage. Take, for example, a
2014 article by Cade Metz in Wired,
28
that characterized an autoencoder as
a “simulated brain.” Hints of human-level performance tend to be sensationalized in newspaper coverage—
for example, an article in the New York
Times by John Markoff described a
deep-learning image-captioning system as “mimicking human levels of
understanding.”
25
Investors, too, have shown a strong
appetite for AI research, funding startups sometimes on the basis of a single paper. In my (Lipton) experience
working with investors, they are sometimes attracted to startups whose research has received media coverage,
a dynamic that attaches financial incentives to media attention. Note that
recent interest in chatbot startups
co-occurred with anthropomorphic
descriptions of dialogue systems and
reinforcement learners both in papers
and in the media, although it may be
difficult to determine whether the
lapses in scholarship caused the interest of investors or vice versa.
Suggestions. Suppose we are to intervene to counter these trends, then how?
Besides merely suggesting that each author abstain from these patterns, what
can we do as a community to raise the
level of experimental practice, exposition, and theory? And how can we more
readily distill the knowledge of the community and disabuse researchers and
the wider public of misconceptions?
What follows are a number of preliminary suggestions based on personal experiences and impressions.
For Authors, Publishers,
and Reviewers
We encourage authors to ask “What
worked?” and “Why?” rather than just
“How well?” Except in extraordinary
a consequence, even papers that ap-
pear to be in dialogue with each other
may have different concepts in mind.
As another example,
generalization has both a specific technical
meaning (generalizing from training to testing) and a more colloquial
meaning that is closer to the notion
of transfer (generalizing from one
population to another) or of external validity (generalizing from an experimental setting to the real world).
Conflating these notions leads to
overestimating the capabilities of
current systems.
Suggestive definitions and overloaded terminology can contribute to
the creation of new suitcase words.
In the fairness literature, where legal, philosophical, and statistical
language are often overloaded, terms
such as bias become suitcase words
that must be subsequently unpacked.
In common speech and as aspirational terms, suitcase words can serve
a useful purpose. Sometimes a suitcase word might reflect an overarching aspiration that unites the various meanings. For example, artificial
intelligence might be well suited as
an aspirational name to organize an
academic department. On the other
hand, using suitcase words in technical arguments can lead to confusion.
For example, in his 2017 book,
Superintelligence,
3 Nick Bostrom wrote an
equation (Box 4) involving the terms
intelligence and optimization power,
implicitly assuming these suitcase
words can be quantified with a one-dimensional scalar.
Speculation on Causes
Behind the Trends
Do the patterns mentioned here represent a trend, and if so, what are the
underlying causes? We speculate that
these patterns are on the rise and suspect several possible causal factors:
complacency in the face of progress,
the rapid expansion of the community, the consequent thinness of the
reviewer pool, and misaligned incentives of scholarship vs. short-term
measures of success.
Complacency in the face of progress.
The apparent rapid progress in ML has
at times engendered an attitude that
strong results excuse weak arguments.
Authors with strong results may feel li-
censed to insert arbitrary unsupported
stories (see “Explanation vs. Specula-
tion”) regarding the factors driving the
results; to omit experiments aimed at
disentangling those factors (see “Fail-
ure to Identify the Sources of Empiri-
cal Gains”); to adopt exaggerated ter-
minology (see “Misuse of Language”);
or to take less care to avoid mathiness
(see “Mathiness”).
At the same time, the single-round
nature of the reviewing process may
cause reviewers to feel they have no
choice but to accept papers with
strong quantitative findings. Indeed,
even if the paper is rejected, there is
no guarantee the flaws will be fixed or
even noticed in the next cycle, so re-
viewers may conclude that accepting a
flawed paper is the best option.
Growing pains. Since around 2012,
the ML community has expanded rapidly because of increased popularity
stemming from the success of deep-learning methods. While the rapid
expansion of the community can be
seen as a positive development, it can
also have side effects.
To protect junior authors, we have
preferentially referenced our own
papers and those of established researchers. And certainly, experienced
researchers exhibit these patterns.
Newer researchers, however, may be
even more susceptible. For example,
authors unaware of previous terminology are more likely to misuse or redefine language (as discussed earlier).
Rapid growth can also thin the reviewer pool in two ways: by increasing the ratio of submitted papers
to reviewers and by decreasing the
fraction of experienced reviewers.
Less-experienced reviewers may
be more likely to demand architectural novelty, be fooled by spurious
theorems, and let pass serious but
subtle issues such as misuse of
language, thus either incentivizing
or enabling several of the trends
described here. At the same time,
experienced but overburdened reviewers may revert to a “checklist”
mentality, rewarding more formulaic papers at the expense of more
creative or intellectually ambitious
work that might not fit a preconceived
template. Moreover, overworked reviewers may not have enough time to
fix—or even to notice—all of the issues