allocated to live memory, and heap mutation rate. These
new metrics included summaries and time series of allocated and live object size demographics, summaries and
time series of pointer distances, and summaries and time
series of mutation distances. Pointer distance and mutation
distance time-series metrics summarize the lengths of the
edges that form the application’s object graph. We designed
these metrics and their means of collection to be abstract,
so that the measurements are VM-neutral. 4
Figure 1 qualitatively illustrates the temporal complexity of heap composition and pointer distance metrics for
two benchmarks, _209_db and eclipse. With respect to our
metrics, eclipse from DaCapo is qualitatively richer than
_209_db from SPECjvm98. Our original paper explains how
to read these graphs and includes dozens of graphs, representing mountains of data. 4 Furthermore, it shows that the
DaCapo benchmarks substantially improve over SPECjvm98
on all measured metrics. To confirm the diversity of the
suite, we applied principal component analysis (PCA) 7 to
the summary metrics. PCA is a multivariate statistical technique for reducing a large N-dimensional space into a lower-dimensional uncorrelated space. If the benchmarks are
uncorrelated in lower-dimensional space, then they are also
uncorrelated in the higher-dimensional space. The analysis
shows that the DaCapo benchmarks are diverse, nontrivial
real-world applications with significant memory load, code
complexity, and code size.
Because the applications come from active projects,
they include unresolved performance anomalies, both typical and unusual programming idioms, and bugs. Although
not our intention, their rich use of Java features uncovered
bugs in some commercial JVMs. The suite notably omits
Java application servers, embedded Java applications, and
numerically intensive applications. Only a few benchmarks
are explicitly concurrent. To remain relevant, we plan to update the DaCapo benchmarks every two years to their latest
version, add new applications, and delete applications that
have become less relevant. This relatively tight schedule
should reduce the extent to which vendors may tune their
products to the benchmarks (which is standard practice, notably for SPECjbb20001).
As far as we know, we are the first to use quantitative metrics and PCA analysis to ensure that our suite is diverse and
nontrivial. The designers of future suites should choose
additional aggregate and time-varying metrics that directly address the domain of interest. For example, metrics
for concurrent or embedded applications might include a
measure of the fraction of time spent executing purely sequential code, maximum and time-varying degree of parallelism, and a measure of sharing between threads.
2. 2. suitable for research
We decided that making the benchmarks tractable, standardized, and suitable for research was a high priority.
While not technically deep, good packaging is extremely
time consuming and affects usability. Researchers need
tractable workloads because they often run thousands of executions for a single experiment. Consider comparing four
garbage collectors over 16 heap sizes—that is, we need 64
combinations to measure. Teasing apart the performance
differences with multiple hardware performance monitors may add eight or more differently instrumented runs
per combination. Using five trials to ensure statistical significance requires a grand total of 2560 test runs. If a single
benchmark test run takes as long as 20 min (the time limit is
30 min on SPECjbb15), we would need over a month on one
machine for just one benchmark comparison—and surely
we should test the four garbage collectors on many benchmarks, not just one.
Moreover, time-limited workloads do not hold work constant, so they are analytically inconvenient for reproducibility and controlling load on the JIT compiler and the garbage
collector. Cycle-accurate simulation, which slows execution
down by orders of magnitude, further amplifies the need for
tractability. We therefore provide work-limited benchmarks
with three input sizes: small, default, and large. For some of
the benchmarks, large and default are the same. The largest ones typically executed in around a minute on circa 2006
commodity high-performance architectures.
We make simplicity our priority for packaging; we ship the
suite as a single self-contained Java jar file. The file contains
all benchmarks, a harness, input data, and checksums for
correctness. The harness checksums the output of each iteration and compares it to a stored value. If the values do not
match, the benchmark fails. We provide extensive configuration options for specifying the number of iterations, the
ability to run to convergence with customized convergence
criteria, and callback hooks before and after every iteration.
For example, the user-defined callbacks can turn hardware
performance counters on and off, or switch a simulator in
and out of detailed simulation mode. We use these features
extensively and are heartened to see others using them. 12
For standardization and analytical clarity, our benchmarks
require only a single host and we avoid components that
require user configuration. By contrast, SPEC jAppServer,
which models real-world application servers, requires multiple hosts and depends on third-party–configurable components such as a database. Here we traded some relevance
for control and analytical clarity.
We provide a separate “source” jar to build the entire
suite from scratch. For licensing reasons, the source jar automatically downloads the Java code from the licensor. With
assistance from our users, 5 our packaging now facilitates
static whole program analysis, which is not required for
standard Java implementations. Since the entire suite and
harness are open-source, we happily accept contributions
from our users.
2. 3. the researcher
Appropriate workload selection is a task for the community, consortia, the workload designer, and the researcher.
Researchers make a workload selection, either implicitly or
explicitly, when they conduct an experiment. This selection
is often automatic: “Let’s use the same thing we used last
time!” Since researchers invest heavily in their evaluation
methodology and infrastructure, this path offers the least
resistance. Instead, we need to identify the workloads and
methodologies that best serve the research evaluation. If