This was a necessary compromise to
get the process approved at all. In time,
it is conceivable that artifact evaluation
will become a part of the evaluation of
most scientific results.
Initially, we judged artifacts on a
five-point scale, with crisp, declarative sentences (inspired by Identify the
Champion,d which many evaluators
are already familiar with) accompanying each level:
˲ The artifact greatly exceeds the expectations set by the paper.
˲ The artifact exceeds the expectations set by the paper.
˲ The artifact meets the expectations set by the paper.
˲ The artifact falls below the expectations set by the paper.
˲ The artifact greatly falls below the
expectations set by the paper.
Over time we have come to think
this is too fine-grained, and have
settled for the simpler criterion of
whether the artifact passes muster or
not. Here, “expectations” is interpreted as the claims made in the paper.
For instance, if a paper claimed the
implementation of a new compiler
for the Java programming language,
it would be reasonable for the evaluators to expect the artifact would be
able to process an arbitrary Java program; on the other hand, if the paper
only claimed a subset of the language,
say “all loop-free Java programs,”
then evaluators would have to lower
their expectations accordingly.
In addition to “running” the artifact, the evaluators must read the paper and try to tweak provided inputs or
create new ones, to test the limits of the
artifact. The amount of effort to be invested is intended to be comparable to
the time reviewers spend on evaluating
a paper; in practice evaluators have reported spending between one and two
days per artifact. Just like when reading
a paper, the goal is not to render a definitive judgment but rather to provide
a best-effort expert opinion.
Who should evaluate artifacts?
Some have argued that evaluating
artifacts is a job for the conference
program committee itself. However,
we believe this sits at odds with the
reality of scientific reviewing. Due to
high submission volumes, program
The artifact evaluation process.
Several ACM SIGPLAN conferences
(OOPSLA, PLDI, and POPL) and closely
related conferences (SAS, ECOOP, and
ESEC/FSE) have begun an experiment
intended to move in the direction outlined here. They have initiated an artifact evaluation process that allows
authors of accepted papers to submit
software as well as many kinds of non-software entities (such as data sets, test
suites, and models) that might back up
their results.b Since 2011 we have run,
or helped with, six artifact evaluation
committees (AECs). The results so far
are encouraging. In 2011, the ESEC/
FSE conference had 14 artifact submissions (for 34 accepted papers) and
seven of those met or exceeded expectations. In 2013, at ECOOP, nine out of
13 artifacts were found to meet expectations. The same year, ESEC/FSE saw a
big jump in artifact submission with 22
artifacts, of which 12 were validated. At
SAS, 11 out of 23 papers had artifacts.
The 2014 OOPSLA conference had 21
artifacts out of 50 accepted papers, and
all but three were judged adequate. In
2014, all the preceding conferences
had an artifact evaluation process.
What are the mechanics of artifact
evaluation? The design of the first artifact evaluation process (conducted by
the first author with Carlo Ghezzic) involved discussions with leaders of the
software engineering community, and
met with more resistance than expected. There was concern that introducing
artifact evaluation into the decision-making process would be an abrupt
and significant cultural change. As a
result, we erected a strict separation
between paper acceptance and artifact
evaluation in the simplest possible
way: using a temporal barrier. Only accepted papers could be submitted for
evaluation and their acceptance status
was guaranteed to remain unchanged.
b For pragmatic and social reasons, artifact
evaluation is limited to accepted papers. Integrating artifact evaluation with paper reviewing was felt to be risky, as the standards of
what constitutes a valid artifact are still evolving. From a practical perspective, the effort of
evaluating a large number of artifacts would
overwhelm the committee. On average, an artifact takes a day and a half to evaluate by each
of the three evaluators. The process would be
difficult to scale to hundreds of submissions.