pare performance across a range of benchmark-specific
relative heap sizes, starting at the smallest heap in which
any of the measured collectors can run, as shown in Figure
2. Each evaluated system must experience the same memory
load, which requires forcing collections between iterations
to normalize the heap and controlling the JIT compiler.
Nondeterminism: Nondeterministic JIT optimization
plans lead to nondeterministic mutator performance. JIT
optimization of collector-specific code, optimizations that
elide allocations, and the fraction of time spent in collection may affect mutator behavior in ways that cannot be predicted or repeated. For example, in Jikes RVM, a Java-in-Java
VM widely used by researchers, JIT compiler activity directly
generates garbage collection load because the compiler allocates and executes in the same heap as the application.
These effects make nondeterminism even more acute.
Warm-Up: For multi-iteration experiments, as the system
warms up, mutator speeds increase, and JIT compiler activity
decreases, the fraction of time spent in collection typically
grows. Steady-state execution therefore accentuates the impact of the garbage collector as compared to start-up. Furthermore, the relative impact of collector-specific code will
change as the code is more aggressively optimized. Evaluations must therefore control for code quality and warm-up.
3. 4. controlling nondeterminism
Of the three new degrees of freedom outlined in Section 3. 2,
we find dealing with nondeterminism to be the most methodologically challenging. Over time, we have adopted and
recommend three different strategies: (a) use deterministic
replay of optimization plans, which requires JVM support;
(b) take multiple measurements in a single JVM invocation,
after reaching steady state and turning off the JIT compiler;
and (c) generate sufficient data points and apply suitable
statistical analysis. 8 Depending on the experiment, the researcher will want to perform one, two, or all of these experiments. The first two reduce nondeterminism for analysis
purposes by controlling its sources. Statistical analysis of
results from (a) and (b) will reveal whether differences from
the remaining nondeterminism are significant. The choice
of (c) accommodates larger factors of nondeterminism (see
Section 4) and may be more realistic, but requires significantly more data points, at the expense of other experiments.
Replay Compilation: Replay compilation collects profile data and a compilation plan from one or more training runs, forms an optimization plan, and then replays
it in subsequent, independent timing invocations. 9 This
methodology deterministically applies the JIT compiler,
but requires modifications to the JVM. It isolates the JIT
compiler activity, since replay eagerly compiles to the
plan’s final optimization level instead of lazily relying on
dynamic recompilation triggers. Researchers can measure the first iteration for deterministic characterization
of start-up behavior. Replay also removes most profiling
overheads associated with the adaptive optimization system, which is turned off. As far as we are aware, production JVMs do not support replay compilation.
that does not depend on runtime support is to run multiple
measurement iterations of a benchmark in a single invocation, after the runtime has reached steady state. Unlike
replay, this approach does not support deterministic measurement of warm-up. We use this approach when gathering data from multiple hardware performance counters,
which requires multiple distinct measurements of the same
system. We first perform N – 1 unmeasured iterations of a
benchmark while the JIT compiler warms up the code. We
then turn the JIT compiler off and execute the Nth iteration
unmeasured to drain any JIT work queues. We measure the
next K iterations. On each iteration, we gather different performance counters of interest. Since the code quality has
reached steady state, it should be a representative mix of
optimized and unoptimized code. Since the JIT compiler is
turned off, the variation between the subsequent iterations
should be low. The variation can be measured and verified.
3. 5. experimental design in other settings
In each experimental setting, the relative influence of the
degrees of freedom, and how to control them, will vary. For
example, when evaluating a new compiler optimization, researchers should hold the garbage-collection activity constant to keep it from obscuring the effect of the optimization.
Comparing on multiple architectures is best, but is limited
by the compiler back-end. When evaluating a new architecture, vary the garbage-collection load and JIT compiler activity, since both have distinctive execution profiles. Since architecture evaluation often involves very expensive simulation,
eliminating nondeterminism is particularly important.
4. anaLysis
Researchers use data analysis to identify and articulate the
significance of experimental results. This task is more challenging when systems and their evaluation become more
complex, and the sheer volume of results grows. The primary data analysis task is one of aggregation: (a) across repeated experiments to defeat experimental noise and (b) across
diverse experiments to draw conclusions.
Aggregating data across repeated experiments is a standard technique for increasing confidence in a noisy environment. 8 In the limit, this approach is in tension with
tractability, because researchers have only finite resources. Reducing
sources of nondeterminism with sound experimental design
improves tractability. Since noise cannot be eliminated altogether, multiple trials are inevitably necessary. Researchers must aggregate data from multiple trials and provide
evidence such as confidence intervals to reveal whether the
findings are significant. Georges et al. 8 use a survey to show
that current practice lacks statistical rigor and explain the
appropriate tests for comparing alternatives.
Section 2. 3 exhorts researchers not to cherry-pick benchmarks. Still, researchers need to convey results from diverse
experiments succinctly, which necessitates aggregation. We
encourage researchers (a) to include complete results and (b)
to use appropriate summaries. For example, using the geometric mean dampens the skewing effect of one excellent
result. Although industrial benchmarks will often produce