figure 2: Gaming your results. four ways to compare two garbage collectors.
1. 5 1. 5 1. 5 1. 5
SemiSpace SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep MarkSweep
1. 4 1. 4 1. 4 1. 4
1. 3 1. 3 1. 3 1. 3
1. 2 1. 2 1. 2 1. 2
1. 1 1. 1 1. 1 1. 1
123456 123456 123456 123456
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(a) _209_db, Pentium-M (b) hsqldb, Pentium-M (c) pseudojbb, AMD (d) pseudojbb, PPC
experimental system is clearly important. For a classic comparison of Fortran, C, or C++ systems, there are at least two
degrees of freedom to control: (a) the host platform (
hardware and operating system) and (b) the language runtime
(compiler and associated libraries). Over the years, researchers have evolved solid methodologies for evaluating compiler, library, and architectural enhancements that target these
languages. Consider a compiler optimization for improving
cache locality. Accepted practice is to compile with and without the optimization and report how often the compiler applied the optimization. To eliminate interference from other
processes, one runs the versions standalone on one or more
architectures and measures miss rates with either performance counters or a simulator. This methodology evolved,
but is now extremely familiar. Once researchers invest in a
methodology, the challenge is to notice when the world has
changed, and to figure out how to adapt.
Modern managed runtimes such as Java add at least three
more degrees of freedom: (c) heap size, (d) nondeterminism,
and (e) warm-up of the runtime system.
Heap Size: Managed languages use garbage collection to
detect unreachable objects, rather than relying on the programmer to explicitly delete objects. Garbage collection is
fundamentally a space–time trade-off between the efficacy
of space reclamation and time spent reclaiming objects;
heap size is the key control variable. The smaller the heap
size, the more often the garbage collector will be invoked
and the more work it will perform.
Nondeterminism: Deterministic profiling metrics are expensive. High-performance JVMs therefore use approximate
execution frequencies computed by low-overhead dynamic
sampling to select which methods the JIT compiler will optimize and how. For example, a method may happen to be
sampled N times in one invocation and N + 3 in another; if
the optimizer uses a hot-method threshold of N + 1, it will
make different choices. Due to this nondeterminism, code
quality usually does not reach the same steady state on a deterministic workload across independent JVM invocations.
Warm-Up: A single invocation of the JVM will often execute
the same application repeatedly. The first iteration of the application usually includes the largest amount of dynamic
compilation. Later iterations usually have both less compilation and better application code quality. Eventually, code
quality may reach a steady state. Code quality thus “warms
up.” Steady state is the most frequent use-case. For example,
application servers run their code many times in the same
JVM invocation and thus care most about steady-state performance. Controlling for code warm-up is an important aspect
of experimental design for high-performance runtimes.
3. 3. case study
We consider performance evaluation of a new garbage collector as an example of experimental design. We describe
the context and then show how to control the factors described above to produce a sound experimental design.
Two key context-specific factors for garbage-collection
evaluation are (a) the space–time trade-off as discussed
above and (b) the relationship between the collector and
mutator (the term for the application itself in the garbage-collection literature). For simplicity, we consider
a stop-the-world garbage collector, in which the collector
and the mutator never overlap in execution. This separation eases measurement of the mutator and collector.
Some collector-specific code mixes with the mutator: object allocation and write barriers, which identify pointers
that cross between independently collected regions. This
code impacts both the mutator and the JIT compiler. Furthermore, the collector greatly affects mutator locality, due
to the allocation policy and any movement of objects at collection time.
Meaningful Baseline: Comparing against the state of the
art is ideal, but practical only when researchers make their
implementations publicly available. Researchers can then
implement their approaches using the same tools or control for infrastructure differences to make apples-to-apples
comparisons. Garbage-collection evaluations often use gen-erational MarkSweep collectors as a baseline because these
collectors are widely used in high-performance VMs and
Host Platform: Garbage collectors exhibit architecture-dependent performance properties that are best revealed
with an evaluation across multiple architectures, as shown
in Figures 2(c) and 2(d). These properties include locality,
the cost of write barriers, and the cost of synchronization
Language Runtime: The language runtime, libraries, and
JIT compiler directly affect memory load, and so should be
controlled. Implementing various collectors in a common
toolkit factors out common shared mechanisms and focuses the comparison on the algorithmic differences between