in methodology, how can we confidently evaluate these
approaches? The community must take responsibility for
methodology. For example, many Java evaluations still use
SPECjvm98, which is badly out of date. Out-of-date benchmarks are problematic because they pose last year’s problems and can lead to different conclusions. 17 To ensure a
solid foundation for future innovation, the community
must make continuous and substantial investments. Establishing community standards and sustaining these investments require open software infrastructures containing the consequent artifacts.
For our part, we developed a new benchmark suite and
new methodologies. We estimate that we have spent 10,000
person-hours to date developing the DaCapo suite and associated infrastructure, none of it directly funded. Such a major undertaking would be impossible without a large number
of contributing institutions and individuals. Just as NSF and
DARPA have invested in networking infrastructure to foster
the past and future generations of the Internet, our community needs foundational investment in methodological infrastructure to build next-generation applications, software
systems, and architectures. Without this investment, what
will be the cost to researchers, industry, and society in lost
opportunities?
2. WoRKLoaD DesiGn anD use
The DaCapo research group embarked on building a Java
benchmark suite in 2003 after we highlighted the dearth
of realistic Java benchmarks to an NSF review panel. The
panel suggested we solve our own problem, but our grant
was for dynamic optimizations. NSF did not provide additional funds for benchmark development, but we forged
ahead regardless. The standard workloads at the time,
SPECjvm98 and SPECjbb2000, 14, 15 were out of date. For example, SPECjvm98 and SPECjbb2000 make meager use of
Java language features, and SPECjvm98 has a tiny code and
memory footprint. (SPEC measurements are in a technical
report3.) We therefore set out to create a suite suitable for research, a goal that adds new requirements beyond SPEC’s
goal of product comparisons. Our goals were:
Relevant and diverse workload: A diverse, widely used set
of nontrivial applications that provide a compelling platform for innovation.
Suitable for research: A controlled, tractable workload
amenable to analysis and experiments.
We selected the following benchmarks for the initial release
of the DaCapo suite, based on criteria described below.
antlr Parser generator and translator generator
bloat Java bytecode-level optimization and analysis tool
chart Graph-plotting toolkit and PDF renderer
eclipse Integrated development environment (IDE)
fop Output-device-independent print formatter
hsqldb SQL relational database engine written in Java
jython Python interpreter written in Java
luindex Text-indexing tool
lusearch Text-search tool
pmd
xalan
Source code analyzer for Java
XSLT transformer for XML documents
2. 1. Relevance and diversity
No workload is definitive, but a narrow scope makes it possible to attain some coverage. We limited the DaCapo suite
to nontrivial, actively maintained real-world Java applications. We solicited and collected candidate applications.
Because source code supports research, we considered only
open-source applications. We first packaged candidates
into a prototype DaCapo harness and tuned them with inputs that produced tractable execution times suitable for experimentation, that is, around a minute on 2006 commodity
hardware. Section 2. 2 describes how the DaCapo packaging
provides tractability and standardization.
We then quantitatively and qualitatively evaluated each
candidate. Table 1 lists the static and dynamic metrics we
used to ensure that the benchmarks were relevant and diverse. Our original paper4 presents the DaCapo metric data
and our companion technical report3 adds SPECjvm98 and
SPECjbb200. We compared against SPEC as a reference point
and compared candidates with each other to ensure diversity.
We used new and standard metrics. Our standard metrics included the static CK metrics, which measure code
complexity of object-oriented programs6; dynamic heap
composition graphs, which measure time-varying lifetime
properties of the heap16; and architectural characteristics
such as branch misprediction rates and instruction mix.
We introduced new metrics to capture domain-specific
characteristics of Java such as allocation rate, ratio of
table 1: Quantitative selection metrics.
metric
code metrics
CK metrics6
Description
Code size
Code footprint
optimization
object-oriented programming metrics measuring
source code complexity
numbers of classes loaded, methods declared, total
bytecodes compiled
instruction cache and i-tlb misses
number of methods compiled, number optimized,
percentage hot
heap metrics
allocation
heap footprint
Fan-out/fan-in
Pointer distance
total bytes/objects allocated, average object size
Maximum live bytes/objects, nursery survival rate
Mean incoming and outgoing pointers per object
Mean distance in bytes of each pointer encountered
in a snapshot traversal of an age-ordered heap
Mutation distance Mean distance in bytes of each pointer dynamically created/mutated by the application in an age-ordered heap
architecture metrics
instruction mix Mix of branches, alu, and memory instructions
branches branch mispredictions per instruction for PMM
predictor
register dependence distances
register
dependence