Doi: 10.1145/1378704.1378723

Wake Up and Smell the Coffee:
Evaluation Methodology
for the 21st Century

By Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot, B. Moss, Aashish Phansalkar, Darko Stefanovíc , Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann

abstract

Evaluation methodology underpins all innovation in experimental computer science. It requires relevant workloads, appropriate experimental design, and rigorous analysis. Unfortunately, methodology is not keeping pace with the changes in our field. The rise of managed languages such as Java, C#, and Ruby in the past decade and the imminent rise of commodity multicore architectures for the next decade pose new methodological challenges that are not yet widely understood. This paper explores the consequences of our collective inattention to methodology on innovation, makes recommendations for addressing this problem in one domain, and provides guidelines for other domains. We describe benchmark suite design, experimental design, and analysis for evaluating Java applications. For example, we introduce new criteria for measuring and selecting diverse applications for a benchmark suite. We show that the complexity and nondeterminism of the Java runtime system make experimental design a first-order consideration, and we recommend mechanisms for addressing complexity and nondeterminism. Drawing on these results, we suggest how to adapt methodology more broadly. To continue to deliver innovations, our field needs to significantly increase participation in and funding for developing sound methodological foundations.

1. intRoDuction

Methodology is the foundation for judging innovation in experimental computer science. It therefore directs and misdirects our research. Flawed methodology can make good ideas look bad or bad ideas look good. Like any infrastructure, such as bridges and power lines, methodology is often mundane and thus vulnerable to neglect. While systemic misdirection of research is not as dramatic as a bridge collapse11 or complete power failure, 10 the scientific and economic cost may be considerable. Sound methodology includes using appropriate workloads, principled experimental design, and rigorous analysis. Unfortunately, many of us struggle to adapt to the rapidly changing computer science landscape. We use archaic benchmarks, outdated experimental designs, and/or inadequate data analysis. This paper explores the methodological gap, its consequences, and some solutions. We use the commercial uptake of managed languages over the past decade as the driving example.

Many developers today choose managed languages, which provide: ( 1) memory and type safety, ( 2) automatic memory management, ( 3) dynamic code execution, and ( 4) well-defined boundaries between type-safe and unsafe code (e.g., JNI and Pinvoke). Many such languages are also object-oriented. Managed languages include Java, C#, Python, and Ruby. C and C++ are not managed languages; they are compiled-ahead-of-time, not garbage collected, and unsafe. Unfortunately, managed languages add at least three new degrees of freedom to experimental evaluation: ( 1) a space–time trade-off due to garbage collection, in which heap size is a control variable, ( 2) nondeterminism due to adaptive optimization and sampling technologies, and ( 3) system warm-up due to dynamic class loading and just-in-time (JIT) compilation.

Although programming language researchers have embraced managed languages, many have not evolved their evaluation methodologies to address these additional degrees of freedom. As we shall show, weak methodology leads to incorrect findings. Equally problematic, most architecture and operating systems researchers do not use appropriate workloads. Most ignore managed languages entirely, despite their commercial prominence. They continue to use C and C++ benchmarks, perhaps because of the significant cost and challenges of developing expertise in new infrastructure. Regardless of the reasons, the current state of methodology for managed languages often provides bad results or no results.

To combat this neglect, computer scientists must be vigilant in their methodology. This paper describes how we addressed some of these problems for Java and makes recommendations for other domains. We discuss how benchmark designers can create forward-looking and diverse workloads and how researchers should use them. We then present a set of experimental design guidelines that accommodate complex and nondeterministic workloads. We show that managed languages make it much harder to produce meaningful results and suggest how to identify and explore control variables. Finally, we discuss the importance of rigorous analysis8 for complex nondeterministic systems that are not amenable to trivial empirical methods.

We address neglect in one domain, at one point in time, but the broader problem is widespread and growing. For example, researchers and industry are pouring resources into and exploring new approaches for embedded systems, multicore architectures, and concurrent programming models. However, without consequent investments

References:

Archives