The third column of Table 1 records the number of
lines of specification (and lines of custom predicate code)
added to each benchmark. Overall, the specification burden is quite small. Indeed, for the majority of the programs,
an author was able to add deterministic assertions in only
5 to 10minutes per benchmark, despite being unfamiliar
with the code. In particular, it was typically not difficult to
both identify regions of code performing parallel computation and to determine from documentation, comments, or
source code which results were intended to be deterministic. Figure 2 shows the assertions added to the mandelbrot
benchmark.
The added assertions were correct on the first attempt for
all but two benchmarks. For phylogeny, the resulting phylogenetic tree was erroneously specified as deterministic,
when, in fact, there are many correct optimal trees. The specification was modified to assert only that the optimal score
must be deterministic. For sparsematmult, we incorrectly
identified the variable to which the output was written. This
error was identified during later work on automatically inferring deterministic specifications. 6
The two predicates provided by our assertion library were
sufficient for all but one of the benchmarks. For the JGF
montecarlo benchmark, the authors had to write a custom
equals and hashCode method for two classes— 34 total
lines of code—in order to assume and assert that two sets,
one of initial tasks and one of results, are equivalent across
executions.
Discussion: More experience, or possibly user studies, would
be needed to conclude decisively that our assertions are easier to use than existing techniques for specifying that parallel
code is correctly deterministic. However, we believe our experience is quite promising. In particular, writing assertions for
the full functional correctness of the parallel regions of these
programs seemed to be quite difficult, perhaps requiring
implementing a sequential version of the code and asserting
that it produces the same result. Further, there seemed to be
no obvious simpler, traditional assertions that would aid in
catching nondeterministic parallelism.
Despite these difficulties, we found that specifying the
natural deterministic behavior of the benchmarks with our
bridge assertions required little effort.
5. 2. effectiveness
To evaluate the utility of our deterministic specifications in
finding true parallelism bugs, we used a modified version
of the CalFuzzer14, 16 tool to find real races in the benchmark programs, both data races and higher level races (such
as races to acquire a lock). For each such race, we ran 10
trials using CalFuzzer to create real executions with these
races and to randomly resolve the races (i.e., randomly pick
a thread to “win”). We turned on run-time checking of our
deterministic assertions for these trials, and recorded all
found violations.
Table 1 summarizes the results of these experiments. For
each benchmark, we indicate the number of real data races
and higher-level races we observed. Further, we indicate
how many of these races led to determinism violations in
any execution.
In these experiments, the primary computational cost
is from CalFuzzer, which typically has an overhead in the
range of 2x–20x for these kinds of compute bound applications. We have not carefully measured the computational
cost of our deterministic assertion library. For most benchmarks, however, the cost of serializing and comparing a
computation’s inputs and outputs is dwarfed by the cost of
the computation itself—e.g., consider the cost of checking
that two fractal images are identical versus the cost of computing each fractal in the first place.
Determinism Violations: We found two cases of nondeterministic behavior. First, a known data race in the raytracer
benchmark, due the use of the wrong lock to protect a shared
sum, can yield an incorrect final answer.
Second, the pi benchmark can yield a nondeterministic
answer given the same random seed because of insufficient
synchronization of a shared random number generator.
In each Monte Carlo sample, two successive calls to java.
util.Random.nextDouble() are made. A context switch
between these calls changes the set of samples generated.
Similarly, nextDouble() itself makes two calls to java.
util.Random.next(), which atomically generates up to
32 pseudorandom bits. A context switch between these two
calls changes the generated sequence of pseudorandom
doubles. Thus, although java.util.Random.next-Double()is thread-safe and free of data races, scheduling
nondeterminism can still lead to a nondeterministic result.
(This behavior is known—the PJ Library provides several
versions of this benchmark, one of which does guarantee a
deterministic result for any given random seed.)
Benign races: The high number of real data races for these
benchmarks is largely due to benign races on volatile variables
used for synchronization—e.g., to implement a tournament
barrier or a custom lock. Although CalFuzzer does not understand these sophisticated synchronization schemes, our deterministic assertions automatically provide some confidence
that these races are benign because, over the course of many
experimental runs, they did not lead to nondeterministic final
results.
Note that it can be quite challenging to verify by hand that
these races are benign. On inspecting the benchmark code
and these data races, an author several times believed he
had found a synchronization bug. But on deeper inspection,
the code was found to be correct in all such cases.
The number of high-level races is low for the JGF benchmarks because all the benchmarks except montecarlo
exclusively use volatile variables (and thread joins) for synchronization. Thus, all observable scheduling nondeterminism is due to data races.
The number of high-level races is low for the PJ benchmarks because they primarily use a combination of volatile variables and atomic compare-and-set operations for
synchronization. Currently, the only kind of high-level
race our modified CalFuzzer recognizes is a lock race.
Thus, while we believe there are many (benign) races
in the ordering of these compare-and-set operations,
CalFuzzer does not report them. The one high-level race
for pi, indicated in the table and described above, was
confirmed by hand.