figure 1: two time-varying selection metrics. Pointer distance (top) and heap composition (bottom) as a function of time.
100 40
Distances (%)
30
75 20
50 10
Distances (%)
0
25 − 10
− 20
0 − 30
− 25 − 40
− 50
− 50 − 60
− 75 − 70
− 80
− 100 − 90
0.0 2. 5 5.0 7. 5 10.0 12. 5 15.0 17. 5 20.0 22. 5 25.0 27. 5 30.0 32. 5 0 25 50
75 100 125 150 175 200 225 250 275 300 325 350
Heap Volume (MB)
8
7
6
5
4
3
2
1
0
0.0 2. 5 5.0 7. 5 10.0 12. 5 15.0 17. 5 20.0 22. 5 25.0 27. 5 30.0 32. 5
Time (millions of pointer mutations)
(a) SPECjvm98 _209_db
Heap Volume (MB)
30.0
27. 5
25.0
22. 5
20.0
17. 5
15.0
12. 5
10.0
7. 5
5.0
2. 5
0.0
0 25 50
75 100 125 150 175 200 225 250 275 300 325 350
Time (millions of pointer mutations)
(b) DaCapo eclipse
there is no satisfactory answer, it is time to form or join a consortium and create new suitable workloads and supporting
infrastructure.
Do Not Cherry-Pick! A well-designed benchmark suite reflects a range of behaviors and should be used as a whole.
Perez et al. demonstrate with alarming clarity that cherry-picking changes the results of performance evaluation. 13
They simulate 12 previously published cache architecture
optimizations in an apples-to-apples evaluation on a suite of
26 SPECcpu benchmarks. There is one clear winner with all
26 benchmarks. There is a choice of 2 different winners with
a suitable subset of 23 benchmarks, 6 winners with subsets
of 18, and 11 winners with 7. When methodology allows researchers a choice among 11 winners from 12 candidates,
the risk of incorrect conclusions, by either mischief or error,
is too high. Section 3. 1 shows that Java is equally vulnerable
to subsetting.
Run every benchmark. If it is impossible to report results
for every benchmark because of space or time constraints,
bugs, or relevance, explain why. For example, if you are proposing an optimization for multithreaded Java workloads,
you may wish to exclude benchmarks that do not exhibit
concurrency. In this case, we recommend reporting all the
results but highlighting the most pertinent. Otherwise readers are left guessing as to the impact of the “optimization”
on the omitted workloads—with key data omitted, readers
and reviewers should not give researchers the benefit of the
doubt.
3. eXPeRimentaL DesiGn
Sound experimental design requires a meaningful baseline and comparisons that control key parameters. Most
researchers choose and justify a baseline well, but identifying which parameters to control and how to control them is
challenging.
3. 1. Gaming your results
The complexity and degrees of freedom inherent in these
systems make it easy to produce misleading results through
errors, omissions, or mischief. Figure 2 presents four results
from a detailed comparison of two garbage collectors. The
JVM, architecture, and other evaluation details appear in the
original paper. 4 More garbage collector implementation details are in Blackburn et al. 2 Each graph shows normalized
time (lower is better) across a range of heap sizes that expose
the space–time tradeoff for implementations of two canonical garbage collector designs, SemiSpace and MarkSweep.
Subsetting Figure 2 badly misleads us in at least three
ways: ( 1) Figure 2(c) shows that by selecting a single heap size
rather than plotting a continuum, the results can produce
diametrically opposite conclusions. At 2. 1 × maximum heap
size, MarkSweep performs much better than SemiSpace,
while at 6.0 × maximum heap size, Semi Space performs better.
Figures 2(a) and 2(d) exhibit this same dichotomy, but have
different crossover points. Unfortunately, some researchers are still evaluating the performance of garbage-collected
languages without varying heap size. ( 2) Figures 2(a) and 2(b)
confirm the need to use an entire benchmark suite. Although
_209_db and hsqldb are established in-memory database
benchmarks, SemiSpace performs better for _209_db in large
heaps, while MarkSweep is always better for hsqldb. ( 3) Figures
2(c) and 2(d) show that the architecture significantly impacts
conclusions at these heap size ranges. MarkSweep is better at
more heap sizes for AMD hardware as shown in Figure 2(c).
However, Figure 2(d) shows SemiSpace is better at more heap
sizes for PowerPC (PPC) hardware. This example of garbage-collection evaluation illustrates a small subset of the pitfalls
in evaluating the performance of managed languages.
3. 2. control in a changing world