Doi: 10.1145/1378704.1378722
technical Perspective
a methodology for evaluating
computer system Performance
By William Pugh
COMPUTER SCIENCE HAS long had a solid foundation for evaluating the performance of algorithms. The asymptotic complexity of the time required
by an algorithm is well defined and
usually tractable, allowing for a clear
evaluation of whether one algorithm
provides a fundamental improvement
over another. More nuanced and alternative evaluations, such as amortized
and randomized analysis, provide additional insights into the fundamental
advantages of different algorithms.
Unfortunately, the situation is even
grimmer when evaluating the performance of a computer system, whether
that system is a computer architecture, a compiler, a graphics processor,
or a runtime system. Given a specific
application, it is often fairly straightforward to execute the application on
various systems and evaluate which
system offers faster execution of that
application on the provided input. Of
course, once an application has been
run on a particular input, one generally does not need to rerun it on that
same input.
What programmers really want is
some way to evaluate which system is
likely to provide better performance
on applications and data sets run in
the future, thus making it the “
better” system. Benchmarks also provide
a way to examine how various system
components behave and interact under load. Benchmarks should give repeatable results, even when rerun by
an independent researcher or testing
organization. A benchmark can be
either a real or a synthetic application.
A synthetic application doesn’t
compute anything useful but is designed to have performance characteristics that are representative of a range
of real applications.
Benchmarks have an established
history in computer science. The first
widely used synthetic benchmark was
the Whetstone benchmark written in
the Algol60 programming language in
1972 and later translated into many
other languages. Several benchmarks
became well known and widely used
in research or commercial settings, or
both. Examples include the Livermore
loops, the Dhrystone benchmark, the
Linpack benchmark, and the Perfect
Club benchmarks.
The design and selection of benchmarks, however, has traditionally been
a matter of art and taste, as much science as engineering. The paper here by
the DaCapo team is the best effort I’ve
seen in providing a sound basis for selecting benchmarks. Historically, there
has not been any standard methodology for deciding whether or not a benchmark did indeed provide a representative measure of a system’s performance
within a particular domain. A more
serious problem with benchmarks is
that they age poorly. Benchmarks often do a reasonable job of evaluating
the performance of applications at the
time they are proposed. However, three
things tend to make benchmarks grow
less useful over time:
˲ As machines and memories grow
faster and larger, the sizes of application data sets grow as well. What was
considered a reasonable problem size
when a benchmark was proposed soon
becomes a trivial example that fits in
the on-processor cache.
˲ The actual applications that people
use systems for evolve over time, and
benchmarks that were once representative become less so.
˲ The weight attached to benchmark
performance encourages developers of computer systems to optimize,
tune, and tweak their systems in ways
that improve their performance on the
benchmarks but not more generally,
making the benchmarks—again—less
representative.
Almost every systems researcher
and commercial software developer
has a personal horror story about a
poorly designed benchmark that was
difficult to use, produced misleading
results, or focused attention on the
wrong problem for too long. One such
story in my own experience involves the
SPEC JVM98 db benchmark intended
to represent a database benchmark.
Several early papers on removing redundant or useless synchronization
from Java programs focused on this
benchmark, since removing such synchronization could produce a 20% to
30% speed improvement in the benchmark. However, closer examination revealed that more than 70% of the CPU
time for this benchmark was spent in a
badly written 20-line Shell sort; replacing the handwritten sort with a call to
the built-in sort function doubled the
execution speed, even without removing the useless synchronization.
The DaCapo research group offers
what seems to be an exceptionally well
engineered set of benchmarks for evaluating Java computer systems. This
includes not only selecting the benchmark applications, but designing a
substantial infrastructure to support
the execution and evaluation of benchmark executions.
Far more important than the actual
selection of the benchmarks and the
engineering infrastructure, the DaCapo team has put together an excellent
description of best practices for using
benchmarks to evaluate Java system
performance, as well as a principled
approach for evaluating whether a
suite of benchmark applications is, in
fact, sufficiently diverse. This approach
involves measuring a number of characteristics of each application, and
then applying principal component
analysis (PCA) to determine whether
the applications do have fundamental
differences, or if they basically measure the same aspects of a system. I
hope the methodology described in
the paper will allow the DaCapo benchmark suite, and others, to be evaluated
so they can evolve in ways that make
them useful as well as meaningful for
more than just a moment in time.
William Pugh ( pugh@cs.umd.edu) is a professor in the
Department of Computer Science at the University of
Maryland, College Park.