were reticent to label a core part of the
argument as speculative.
In contrast to these examples,
Srivastava et al.
39 separate speculation from fact. While this 2014 paper,
which introduced dropout regularization, speculates at length on connections between dropout and sexual
reproduction, a designated “
Motivation” section clearly quarantines this
discussion. This practice avoids confusing readers while allowing authors
to express informal ideas.
In another positive example, Yoshua Bengio2 presents practical guidelines for training neural networks.
Here, the author carefully conveys
uncertainty. Instead of presenting
the guidelines as authoritative, the
paper states: “Although such recommendations come … from years of
experimentation and to some extent
mathematical justification, they
should be challenged. They constitute a good starting point … but very
often have not been formally validated, leaving open many questions that
can be answered either by theoretical
analysis or by solid comparative experimental work.”
Failure to identify the sources of
empirical gains. The ML peer-review
process places a premium on technical novelty. Perhaps to satisfy reviewers, many papers emphasize both
complex models (addressed here) and
fancy mathematics (to be discussed in
“Mathiness” section). While complex
models are sometimes justified, empirical advances often come about in
other ways: through clever problem
formulations, scientific experiments,
optimization heuristics, data-prepro-cessing techniques, extensive hyperparameter tuning, or applying existing methods to interesting new tasks.
Sometimes a number of proposed
techniques together achieve a significant empirical result. In these cases,
it serves the reader to elucidate which
techniques are necessary to realize the
reported gains.
Too frequently, authors propose
many tweaks absent proper ablation
studies, obscuring the source of em-
pirical gains. Sometimes, just one of
the changes is actually responsible for
the improved results. This can give the
false impression that the authors did
more work (by proposing several im-
provements), when in fact they did not
do enough (by not performing proper
ablations). Moreover, this practice
misleads readers to believe that all of
the proposed changes are necessary.
In 2018, Melis, Dyer, and Blunsom27
demonstrated that a series of pub-
lished improvements in language
modeling, originally attributed to
complex innovations in network ar-
chitectures, were actually the result
of better hyperparameter tuning. On
equal footing, vanilla long short-term
memory (LSTM) networks, hardly
modified since 1997, topped the
leaderboard. The community might
have benefited more by learning the
details of the hyperparameter tun-
ing without the distractions. Similar
evaluation issues have been observed
for deep reinforcement learning17 and
generative adversarial networks.
24 See
Sculley et al.
38 for more discussion of
lapses in empirical rigor and result-
ing consequences.
In contrast, many papers perform
good ablation analyses, and even
retrospective attempts to isolate the
source of gains can lead to new dis-
coveries. Furthermore, ablation is nei-
ther necessary nor sufficient for un-
derstanding a method, and can even
be impractical given computational
constraints. Understanding can also
come from robustness checks (as in
Cotterell et al.,
9 which discovers that
existing language models handle in-
flectional morphology poorly), as well
as qualitative error analysis.
Empirical study aimed at under-
standing can be illuminating even
absent a new algorithm. For example,
probing the behavior of neural net-
works led to identifying their suscep-
tibility to adversarial perturbations.
44
Careful study also often reveals limi-
tations of challenge datasets while
yielding stronger baselines. A 2016
paper by Chen, Bolton, and Manning6
studied a task designed for reading
comprehension of news passages and
found that 73% of the questions can
be answered by looking at a single sen-
tence, while only 2% required looking
at multiple sentences (the remaining
25% of examples were either ambigu-
ous or contained coreference errors).
In addition, simpler neural networks
and linear classifiers outperformed
complicated neural architectures that
Empirical
study aimed at
understanding
can be illuminating
even absent a
new algorithm.