Howe8 addressed an AI community
that at that point “rarely publish[ed]
performance evaluations” of their
proposed algorithms and instead
only described the systems. They suggested establishing sensible metrics
for quantifying progress, and analyzing the following: “Why does it work?”
“Under what circumstances won’t it
work?” and “Have the design decisions been justified?”—questions that
continue to resonate today.
Finally, in 2009 Armstrong et al.
discussed the empirical rigor of in-formation-retrieval research, noting a
tendency of papers to compare against
the same weak baselines, producing a
long series of improvements that did
not accumulate to meaningful gains.
In other fields, an unchecked decline in scholarship has led to crisis.
A landmark study in 2015 suggested a
significant portion of findings in the
psychology literature may not be reproducible.
33 In a few historical cases,
enthusiasm paired with undisciplined
scholarship led entire communities
down blind alleys. For example, following the discovery of X-rays, a related discipline on N-rays emerged
before it was eventually debunked.
The reader might rightly suggest these
problems are self-correcting. We
agree. However, the community self-corrects precisely through recurring
debate about what constitutes reasonable standards for scholarship. We
hope that this paper contributes constructively to the discussion.
We thank Asya Bergal, Kyunghyun
Cho, Moustapha Cisse, Daniel Dewey,
Danny Hernandez, Charles Elkan, Ian
Goodfellow, Moritz Hardt, Tatsunori
Hashimoto, Sergey Ioffe, Sham Kakade, David Kale, Holden Karnofsky,
Pang Wei Koh, Lisha Li, Percy Liang,
Julian McAuley, Robert Nishihara,
Noah Smith, Balakrishnan “Murali”
Narayanaswamy, Ali Rahimi, Christopher Ré, and Byron Wallace. We also
thank the ICML Debates organizers.
1. Armstrong, T. G., Moffat, A., Webber, W. and Zobel, J.
Improvements that don’t add up: ad-hoc retrieval
results since 1998. In Proceedings of the 18th ACM
Conf. Information and Knowledge Management, 2009,
2. Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Neural Networks:
Tricks of the Trade. G. Montavon, G.B. Orr, KR Müller,
eds. LNCS 7700 (2012). Springer, Berlin, Heidelberg,
3. Bostrom, N. Superintelligence. Dunod, Paris, France, 2017.
4. Bottou, L. et al. Counterfactual reasoning and learning
systems: The example of computational advertising. J.
Machine Learning Research 14, 1 (2013), 3207–3260.
5. Bray, A.J. and Dean, D.S. Statistics of critical points of
Gaussian fields on large-dimensional spaces. Physical
Review Letters 98, 15 (2007), 150201; https://journals.
6. Chen, D., Bolton, J. and Manning, C. D. A thorough
examination of the CNN/Daily Mail reading
comprehension task. In Proceedings of the 54th
Annual Meeting of Assoc. Computational Linguistics,
7. Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B.,
LeCun, Y. The loss surfaces of multilayer networks.
In Proceedings of the 18th Intern. Conf. Artificial
Intelligence and Statistics, 2015.
8. Cohen, P.R., Howe, A.E. How evaluation guides AI
research: the message still counts more than the
medium. AI Magazine 9, 4 (1988), 35.
9. Cotterell, R., Mielke, S.J., Eisner, J. and Roark, B. Are
all languages equally hard to language-model? In
Proceedings of Conf. North American Chapt. Assoc.
Computational Linguistics: Human Language T
echnologies, Vol. 2, 2018.
10. Council of the European Union. Motion for a European
Parliament Resolution with Recommendations to the
Commission on Civil Law Rules on Robotics, 2016;
11. Dauphin, Y.N. et al. Identifying and attacking the
saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information
Processing Systems, 2014, 2933–2941.
12. Esteva, A. et al. Dermatologist-level classification of
skin cancer with deep neural networks. Nature 542,
7639 (2017), 115-118.
13. Gershgorn, D. The data that transformed AI
research—and possibly the world. Quartz, 2017;
14. Goodfellow, I.J., Vinyals, O. and Saxe, A.M.
Qualitatively characterizing neural network
optimization problems. In Proceedings of the Intern.
Conf. Learning Representations, 2015.
15. Hazirbas, C., Leal-Taixé, L. and Cremers, D. Deep depth
from focus. arXiv Preprint, 2017; arXiv:1704.01085.
16. He, K., Zhang, X., Ren, S. and Sun, J. Delving deep into
rectifiers: Surpassing human-level performance on
ImageNet classification. In Proceedings of the IEEE
Intern. Conf. Computer Vision, 2015, 1026–1034.
17. Henderson, P. et al. Deep reinforcement learning
that matters. In Proceedings of the 32nd Assoc.
Advancement of Artificial Intelligence Conf., 2018.
18. Ioffe, S. and Szegedy, C. Batch normalization:
accelerating deep network training by reducing
internal covariate shift. In Proceedings of the 32nd
Intern. Conf. Machine Learning 37, 2015; http://
19. Kingma, D.P. and Ba, J. Adam: A method for stochastic
optimization. In Proceedings of the 3rd Intern. Conf.
Learning Representations, 2015
20. Knuth, D. E., Larrabee, T. and Roberts, P.M.
Mathematical writing, 1987; https://bit.ly/2TmxyNq
21. Langley, P. and Kibler, D. The experimental study of
machine learning, 1991; http://www.isle.org/~langley/
22. Lipton, Z. C. The mythos of model interpretability.
Intern. Conf. Machine Learning Workshop on Human
23. Lipton, Z. C., Chouldechova, A. and McAuley, J. Does
mitigating ML’s impact disparity require treatment
disparity? Advances in Neural Inform. Process. Syst.
2017, 8136-8146. arXiv Preprint arXiv:1711.07076.
24. Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet,
O. Are GANs created equal? A large-scale study. In
Proceedings of the 32nd Conf. Neural Information
Processing Syst. arXiv Preprint 2017; arXiv:1711.10337.
25. Markoff, J. Researchers announce advance in image-recognition software. NYT (Nov. 17, 2014); https://nyti.
26. McDermott, D. Artificial intelligence meets natural
stupidity. ACM SIGART Bulletin 57 (1976), 4–9.
27. Melis, G., Dyer, C. and Blunsom, P. On the state of
the art of evaluation in neural language models.
In Proceedings of the Intern. Conf. Learning
28. Metz, C. You don’t have to be Google to build an
artificial brain. Wired (Sept. 26, 2014); https://www.
29. Minsky, M. The Emotion Machine: Commonsense
Thinking, Artificial Intelligence, and the Future of the
Human Mind. Simon & Schuster, New York, NY, 2006.
30. Mohamed, S., Lakshminarayanan, B. Learning in
implicit generative models. arXiv Preprint, 2016;
31. Noh, H., Hong, S. and Han, B. Learning deconvolution
network for semantic segmentation. In Proceedings of
the Intern. Conf. Computer Vision, 2015, 1520–1528.
32. Nye, M.J. N-rays: An episode in the history and
psychology of science. Historical Studies in the
Physical Sciences 11, 1 (1980), 125–56.
33. Open Science Collaboration. Estimating the
reproducibility of psychological science. Science 349,
6251 (2015), aac4716.
34. Platt, J.R. Strong inference. Science 146, 3642 (1964),
35. Reddi, S.J., Kale, S. and Kumar, S. On the convergence
of Adam and beyond. In Proceedings of the Intern.
Conf. Learning Representations, 2018.
36. Romer, P.M. Mathiness in the theory of economic
growth. Amer. Econ. Rev. 105, 5 (2015), 89–93.
37. Santurkar, S., Tsipras, D., Ilyas, A. and Madry, A. How
does batch normalization help optimization? (No, it is
not about internal covariate shift). In Proceedings of
the 32nd Conf. Neural Information Processing Systems;
38. Sculley, D., Snoek, J., Wiltschko, A. and Rahimi, A.
Winner’s curse? On pace, progress, and empirical
rigor. In Proceedings of the 6th Intern. Conf. Learning
Representations, Workshop Track, 2018
39. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever,
I. and Salakhutdinov, R. Dropout: A simple way to
prevent neural networks from overfitting. J. Machine
Learning Research 15, 1 (2014), 1929–1958; https://
40. Steinhardt, J. and Liang, P. Learning fast-mixing
models for structured prediction. In Proceedings of
the 32nd Intern. Conf. Machine Learning 37 (2015),
41. Steinhardt, J. and Liang, P. Reified context models. In
Proceedings of the 32nd Intern. Conf. Machine Learning
37, (2015), 1043–1052; https://dl.acm.org/citation.
42. Steinhardt, J., Koh, P. W. and Liang, P. S. Certified
defenses for data poisoning attacks. In Proceedings of
the 31st Conf. Neural Information Processing Systems,
43. Stock, P. and Cisse, M. ConvNets and ImageNet
beyond accuracy: Explanations, bias detection,
adversarial examples and model criticism. arXiv
Preprint, 2017, arXiv:1711.11443.
44. Szegedy, C. et al. Intriguing properties of neural
networks. Intern. Conf. Learning Representations.
arXiv Preprint, 2013, arXiv:1312.6199.
45. Zellers, R., Yatskar, M., Thomson, S. and Choi, Y. Neural
motifs: Scene graph parsing with global context. In
Proceedings of the IEEE Conf. Computer Vision and
Pattern Recognition, 2018, 5831–5840.
46. Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals,
O. Understanding deep learning requires rethinking
generalization. In Proceedings of the Intern. Conf.
Learning Representations, 2017.
Zachary C. Lipton is an assistant professor at Carnegie
Mellon University in the Tepper School of Business with
appointments in the Machine Learning Department and
the Heinz School of Public Policy. He also collaborates
with Amazon, where he helped to grow AWS’ Amazon
AI team and contributed to the Apache MXNet deep
learning framework. Find him at zacklipton.com,
Twitter @zacharylipton, or GitHub @zackchase.
Jacob Steinhardt will be joining UC Berkeley as an
assistant professor of statistics. He is a technical advisor
for the Open Philanthropy Project and has collaborated
with policy researchers to understand and avoid potential
misuses of machine learning.
Copyright held by owners/authors.
Publication rights licensed to ACM.