Doi: 10.1145/1378704.1378723
Wake Up and Smell the Coffee:
Evaluation Methodology
for the 21st Century
By Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer
Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee,
J. Eliot, B. Moss, Aashish Phansalkar, Darko Stefanovíc , Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann
abstract
Evaluation methodology underpins all innovation in experimental computer science. It requires relevant workloads,
appropriate experimental design, and rigorous analysis.
Unfortunately, methodology is not keeping pace with the
changes in our field. The rise of managed languages such
as Java, C#, and Ruby in the past decade and the imminent
rise of commodity multicore architectures for the next decade pose new methodological challenges that are not yet
widely understood. This paper explores the consequences
of our collective inattention to methodology on innovation,
makes recommendations for addressing this problem in
one domain, and provides guidelines for other domains.
We describe benchmark suite design, experimental design,
and analysis for evaluating Java applications. For example,
we introduce new criteria for measuring and selecting diverse applications for a benchmark suite. We show that the
complexity and nondeterminism of the Java runtime system
make experimental design a first-order consideration, and
we recommend mechanisms for addressing complexity and
nondeterminism. Drawing on these results, we suggest how
to adapt methodology more broadly. To continue to deliver
innovations, our field needs to significantly increase participation in and funding for developing sound methodological
foundations.
1. intRoDuction
Methodology is the foundation for judging innovation in
experimental computer science. It therefore directs and
misdirects our research. Flawed methodology can make
good ideas look bad or bad ideas look good. Like any infrastructure, such as bridges and power lines, methodology is
often mundane and thus vulnerable to neglect. While systemic misdirection of research is not as dramatic as a bridge
collapse11 or complete power failure, 10 the scientific and
economic cost may be considerable. Sound methodology
includes using appropriate workloads, principled experimental design, and rigorous analysis. Unfortunately, many
of us struggle to adapt to the rapidly changing computer science landscape. We use archaic benchmarks, outdated experimental designs, and/or inadequate data analysis. This
paper explores the methodological gap, its consequences,
and some solutions. We use the commercial uptake of managed languages over the past decade as the driving example.
Many developers today choose managed languages, which
provide: ( 1) memory and type safety, ( 2) automatic memory
management, ( 3) dynamic code execution, and ( 4) well-defined boundaries between type-safe and unsafe code (e.g., JNI
and Pinvoke). Many such languages are also object-oriented.
Managed languages include Java, C#, Python, and Ruby. C
and C++ are not managed languages; they are compiled-ahead-of-time, not garbage collected, and unsafe. Unfortunately, managed languages add at least three new degrees of
freedom to experimental evaluation: ( 1) a space–time trade-off
due to garbage collection, in which heap size is a control variable, ( 2) nondeterminism due to adaptive optimization and
sampling technologies, and ( 3) system warm-up due to dynamic class loading and just-in-time (JIT) compilation.
Although programming language researchers have embraced managed languages, many have not evolved their
evaluation methodologies to address these additional degrees of freedom. As we shall show, weak methodology leads
to incorrect findings. Equally problematic, most architecture
and operating systems researchers do not use appropriate
workloads. Most ignore managed languages entirely, despite
their commercial prominence. They continue to use C and
C++ benchmarks, perhaps because of the significant cost and
challenges of developing expertise in new infrastructure. Regardless of the reasons, the current state of methodology for
managed languages often provides bad results or no results.
To combat this neglect, computer scientists must be
vigilant in their methodology. This paper describes how
we addressed some of these problems for Java and makes
recommendations for other domains. We discuss how
benchmark designers can create forward-looking and diverse
workloads and how researchers should use them. We then
present a set of experimental design guidelines that accommodate complex and nondeterministic workloads. We show
that managed languages make it much harder to produce
meaningful results and suggest how to identify and explore
control variables. Finally, we discuss the importance of
rigorous analysis8 for complex nondeterministic systems that
are not amenable to trivial empirical methods.
We address neglect in one domain, at one point in time,
but the broader problem is widespread and growing. For
example, researchers and industry are pouring resources
into and exploring new approaches for embedded systems, multicore architectures, and concurrent programming models. However, without consequent investments