a single aggregate score over a suite, this methodology is
brittle because the result depends entirely on vagaries of the
suite composition. 18 For example, while it is tempting to cite
your best result—“we outperform X by up to 1000%”—
stating an aggregate together with the best and worst results is
more honest and insightful.
5. concLusion
Methodology plays a strategic role in experimental computer
science research and development by creating a common
ground for evaluating ideas and products. Sound methodology relies on relevant workloads, principled experimental design,
and rigorous analysis. Evaluation methodology can therefore
have a significant impact on a research field, potentially accelerating, retarding, or misdirecting energy and innovation.
However, we work within a fast-changing environment and
our methodologies must adapt to remain sound and relevant.
Prompted by concerns among ourselves and others about the
state of the art, we spent thousands of hours at eight institutions examining and addressing the problems of evaluating
Java applications. The lack of direct funding, the perception
that methodology is mundane, and the magnitude of the effort surely explain why these efforts are uncommon.
We address neglect of evaluation methodology concretely,
in one domain at one point in time, and draw broader lessons
for experimental computer science. The development and
maintenance of the DaCapo benchmark suite and associated
References
1. Adamson, A., Dagastine, D., and Sarne,
S. SPECjbb2005––A year in the life of
a benchmark. 2007 SPEC Benchmark
Workshop. SPEC, Jan. 2007.
2. Blackburn, S.M., Cheng P., and
McKinley, K.S. Myths and realities:
The performance impact of garbage
collection. Proceedings of the
ACM Conference on Measurement
and Modeling Computer Systems,
pp. 25–36, New York, NY,
June 2004.
3. Blackburn, S.M., Garner, R., Hoffman,
C., Khan, A.M., McKinley, K.S.,
Bentzur, R., Diwan, A., Feinberg, D.,
Frampton, D., Guyer, S.Z., Hirzel, M.,
Hosking, A., Jump, M., Lee, H., Moss,
J.E.B., Phansalkar, A., Stefanović,
D., VanDrunen, T., von Dincklage,
D., and Wiedermann, B. The DaCapo
benchmarks: Java benchmarking
development and analysis (extended
version). Technical Report TR-
CS-06-01, Dept. of Computer
Science, Australian National
University, 2006. http://www.
dacapobench.org.
4. Blackburn, S.M., Garner, R., Hoffman,
C., Khan, A.M., McKinley, K.S.,
Bentzur, R., Diwan, A., Feinberg, D.,
Frampton, D., Guyer, S.Z., Hirzel, M.,
Hosking, A., Jump, M., Lee, H., Moss,
J. E.B., Phansalkar, A., Stefanović,
D., VanDrunen, T., von Dincklage,
D.,and Wiedermann, B. The DaCapo
benchmarks: Java benchmarking
development and analysis. ACM
Conference on Object-Oriented
Programming Systems, Languages,
and Applications, pp. 169–190,
Oct. 2006.
5. Bodden, E., Hendren, L., and Lhoták,
O. A staged static program analysis to
improve the performance of runtime
monitoring. 21st European Conference
on Object-Oriented Programming,
July 30th–August 3rd 2007, Berlin,
Germany, number 4609 in Lecture
Notes in Computer Science, pp.
525–549, Springer Verlag, 2007.
6. Chidamber, S.R. and Kemerer, C.F.
A metrics suite for object-oriented
design. IEEE Transactions on Software
Engineering, 20( 6):476–493, 1994.
7. Dunteman, G. H. Principal Components
Analysis. Sage Publications, Newbury
Park, CA, USA, 1989.
8. Georges, A., Buytaert, D., and
Eeckhout, L. Statistically rigorous
Java performance evaluation. ACM
Conference on Object-Oriented
Programming Systems, Languages,
and Applications, pp. 57–76, Montreal,
Quebec, Canada, 2007.
9. Huang, X., Blackburn, S.M., McKinley,
K.S., Moss, J.E.B., Wang Z., and Cheng
P. The garbage collection advantage:
Improving mutator locality. ACM
Conference on Object-Oriented
Programming Systems, Languages,
and Applications, pp. 69–80,
Vancouver, BC, 2004.
10. Leyland, B. Auckland central
business district supply failure. Power
Engineering Journal, 12( 3):109–114,
1998.
11. National Transportation Safety
Board. NTSB urges bridge owners to
perform load capacity calculations
before modifications; I-35W
investigation continues. SB-08-02.
http://www.ntsb.gov/Pressrel/
methodology have brought some much-needed improvement
to our evaluations and to our particular field. However, experimental computer science cannot expect the upkeep of its methodological foundations to fall to ad hoc volunteer efforts. We
encourage stakeholders such as industry and granting agencies
to be forward-looking and make a systemic commitment to
stem methodological neglect. Invest in the foundations of our
innovation.
acknowledgments
We thank Andrew Appel, Randy Chow, Frans Kaashoek, and
Bill Pugh, who encouraged this project at our three year NSF
ITR review. We thank Mark Wegman, who initiated the public
availability of Jikes RVM, and the developers of Jikes RVM. We
gratefully acknowledge Fahad Gilani, who wrote the original
version of the measurement infrastructure for his ANU Masters thesis; Xianglong Huang and Narendran Sachindran, who
helped develop the replay methodology; and Jungwoo Ha and
Magnus Gustafsson, who helped developed the multi-iteration
replay methodology. We thank Tom Horn for his proofreading,
and Guy Steele for his careful reading and suggestions.
This work was supported by NSF ITR CCR-0085792,
CNS-0719966, NSF CCF-0429859, NSF EIA-0303609, DARPA
F33615-03-C-4106, ARC DP0452011, ARC DP0666059, Intel,
IBM, and Microsoft. Any opinions, findings and conclusions
expressed herein are the authors’ and do not necessarily reflect those of the sponsors.
2008/ 080115.html, Jan. 2008.
12. Neelakantam, N., Rajwar, R.,
Srinivas, S., Srinivasan, U., and
Zilles, C. Hardware atomicity for
reliable software speculation. ACM/
IEEE International Symposium on
Computer Architecture, pp. 174–185,
ACM, New York, N Y, USA, 2007.
13. Perez, D. G., Mouchard, G., and
Temam, O. MicroLib: A case for
the quantitative comparison of
micro-architecture mechanisms.
International Symposium on
Microarchitecture, pp. 43–54,
Portland, OR, Dec. 2004.
14. Standard Performance Evaluation
Corporation. SPECjvm98
Documentation, release 1.03
edition, March 1999.
15. Standard Performance Evaluation
Corporation. SPECjbb2000 (Java
Business Benchmark) Documentation,
Stephen M. Blackburn, Robin Garner,
Daniel Frampton, Australian National
University
Kathryn S. McKinley, Aashish
Phansalkar, Ben Wiedermann, Maria
Jump, University of Texas at Austin
Chris hoffmann, Asjad M. Khan, J Eliot
B. Moss, University of Massachusetts,
Amherst
Rotem Bentzur, Daniel Feinberg, Darko
Stefanović, University of New Mexico
© 2008 ACM 0001-0782/08/0800 $5.00
release 1.01 edition, 2001.
16. Stefanović, D. Properties of
Age-Based Automatic Memory
Reclamation Algorithms. PhD thesis,
Department of Computer Science,
University of Massachusetts,
Amherst, Massachusetts, Dec. 1998.
17. Yi, J. J., Vandierendonck, H., Eeckhout,
L., and Lilja, D. J. The exigency of
benchmark and compiler drift:
Designing tomorrow’s processors
with yesterday’s tools. International
Conference on Supercomputing, pp.
75–86, Cairns, Queensland, Australia,
July 2006.
18. Yoo, R. M., Lee, H.-H. S., Lee, H., and
K. Chow. Hierarchical means: Single
number benchmarking with workload
cluster analysis. IIS WC 2007. IEEE
10th International Symposium
on Workload Characterization, pp.
204–213, IEEE, 2007.
Amer Diwan, Daniel von Dincklage,
University of Colorado
Samuel Z. Guyer, Tufts University
Martin hirzel, IBM
Antony hosking, Purdue University
han Lee, Intel
Thomas VanDrunen, Wheaton College