Vviewpoints
Viewpoint
The Real Software
Crisis: Repeatability
as a Core Value
Sharing experiences running artifact evaluation
committees for five major conferences.
sive and easy test of a paper’s artifacts,
and clarifies the scientific contribution
of the paper. We believe repeatability
should become a standard feature of
the dissemination of research results.
Of course, not all results are repeatable, but many are.
Researchers cannot be expected to
develop industrial-quality software.
There will always be a difference between research prototypes and production software. It is therefore important
to set the right standard. We argue the
right measure is not some absolute
notion of quality, but rather how the
artifact stacks up against the expectations set by the paper. Also, clearly,
not all papers need artifacts. Even in
software conferences, some papers
contain valuable theoretical results
or profound observations that do not
lend themselves to artifacts. These
papers should, of course, remain welcome. But if a paper makes, or implies,
claims that require software, those
claims should be backed up. In short,
a paper should not mislead readers:
if an idea has not been evaluated this
should be made clear, both so program
committees can judge the paper on its
actual merits, and to allow subsequent
authors to get the credit of performing
a rigorous empirical evaluation of the
paper’s ideas. Lastly, artifacts can include data sets, proofs and any other
by-product of the research process.
WHERE IS THE software in programming lan- guage research? In our field, software artifacts play a central role: they
are the embodiments of our ideas and
contributions. Yet when we publish, we
are evaluated on our ability to describe
informally those artifacts in prose. Often, such prose gives only a partial,
and sometimes overly rosy, view of the
work. Especially so when the object of
discourse is made up of tens of thousands of lines of code that interact in
subtle ways with different parts of the
software and hardware stack on which
it is deployed. Over the years, our community’s culture has evolved to value
originality above everything else, and
our main conferences and journalsa
deny software its rightful place.
Science advances faster when we can
build on existing results, and when new
ideas can easily be measured against
the state of the art. This is exceedingly
difficult in an environment that does
not reward the production of reusable
a Our central argument applies just as well,
and perhaps even more strongly, to journals.
However, we do not have experience creating
an artifact evaluation process for journals;
we also imagine that some journals might be
concerned that their submission rate is suffi-
ciently low that further obstacles would be un-
welcome, though this is a weak argument for
not performing a more thorough review.
software artifacts. Our goal is to get to
the point where any published idea
that has been evaluated, measured, or
benchmarked is accompanied by the
artifact that embodies it. Just as for-
mal results are increasingly expected to
come with mechanized proofs, empiri-
cal results should come with code.
Conversations about this topic in-
evitably get mired in discussions of re-
producibility, which the act of creating
a fresh system from first principles to
duplicate an existing result under dif-
ferent experimental conditions. Repro-
ducibility is an expensive undertaking
and not something we are advocating.
We are after repeatability, which is
simply the act of checking the claims
made in the paper, usually, but not
only, by re-running a bundled software
artifact. Repeatability is an inexpen-
DOI: 10.1145/2658987
If a paper
makes, or implies,
claims that
require software,
those claims
must be backed up.