and the experiments so computationally expensive to run that waiting for
ablations to complete might not have
been worth the cost to the community.
A related concern is that high standards might impede the publication of
original ideas, which are more likely to be
unusual and speculative. In other fields,
such as economics, high standards result in a publishing process that can take
years for a single paper, with lengthy revision cycles consuming resources that
could be deployed toward new work.
Finally, perhaps there is value
in specialization: The researchers
generating new conceptual ideas or
building new systems need not be the
same ones who carefully collate and
distill knowledge.
These are valid considerations, and
the standards we are putting forth
here are at times exacting. In many
cases, however, they are straightforward to implement, requiring only
a few extra days of experiments and
more careful writing. Moreover, they
are being presented as strong heuristics rather than unbreakable rules—
if an idea cannot be shared without
violating these heuristics, the idea
should be shared and the heuristics
set aside.
We have almost always found attempts to adhere to these standards to
be well worth the effort. In short, the
research community has not achieved
a Pareto optimal state on the growth-quality frontier.
Historical Antecedents
The issues discussed here are unique
neither to machine learning nor to this
moment in time; they instead reflect
issues that recur cyclically throughout academia. As far back as 1964, the
physicist John R. Platt34 discussed related concerns in his paper on strong inference, where he identified adherence
to specific empirical standards as responsible for the rapid progress of molecular biology and high-energy physics
relative to other areas of science.
There have been similar discus-
sions in AI. As noted in the introduc-
tion to this article, McDermott26 criti-
cized a (mostly pre-ML) AI community
in 1976 on a number of issues, includ-
ing suggestive definitions and a fail-
ure to separate out speculation from
technical claims. In 1988, Cohen and
the papers leading to the current suc-
cess of deep learning were careful em-
pirical investigations characterizing
principles for training deep networks.
This includes the advantage of ran-
dom over sequential hyperparameter
search, the behavior of different acti-
vation functions, and an understand-
ing of unsupervised pretraining.
Second, flawed scholarship already
negatively impacts the research community and broader public discourse.
The “Troubling Trends” section of this
article gives examples of unsupported
claims being cited thousands of times,
lineages of purported improvements
being overturned by simple baselines,
datasets that appear to test high-level
semantic reasoning but actually test
low-level syntactic fluency, and terminology confusion that muddles the academic dialogue. This final issue also
affects public discourse. For example,
the European Parliament passed a report considering regulations to apply
if “robots become or are made self-aware.”
10 While ML researchers are
not responsible for all misrepresentations of our work, it seems likely that
anthropomorphic language in authoritative peer-reviewed papers is at least
partly to blame.
Greater rigor in exposition, science, and theory are essential for both
scientific progress and fostering productive discourse with the broader
public. Moreover, as practitioners
apply ML in critical domains such as
health, law, and autonomous driving,
a calibrated awareness of the abilities
and limits of ML systems will help us
to deploy ML responsibly.
Countervailing Considerations
There are a number of countervailing
considerations to the suggestions set
forth in this article. Several readers of
earlier drafts of this paper noted that
stochastic gradient descent tends to converge faster than gradient descent—in
other words, perhaps a faster, noisier
process that ignores our guidelines for
producing “cleaner” papers results in
a faster pace of research. For example,
the breakthrough paper on ImageNet
classification proposes multiple techniques without ablation studies, several of which were subsequently determined to be unnecessary. At the time,
however, the results were so significant
Greater rigor in
exposition, science,
and theory are
essential for both
scientific progress
and fostering
productive
discourse with
the broader public.