software correctness tools (such as
Microsoft’s CHESS14) assume SC. Verifying software correctness under SC is
already difficult, and the state space
balloons if non-SC interleavings need
to be verified as well. In the next few
years, we expect that correctness-veri-fication tools will play a larger role as
more parallel software is developed.
Using them in combination with an
SC platform would make them most
effective.
A final reason for the programmability of an SC platform is that it
would make the memory model of
safe languages (such as Java) easier
to understand and verify. The need to
provide safety guarantees and enable
performance at the same time has resulted in an increasingly complex and
unintuitive memory model over the
years. A high-performance SC memory model would trivially ensure Java’s
safety properties related to memory
ordering, improving its security and
usability.
The Bulk Multicore’s second feature is a set of hardware primitives
that can be used to engineer a sophisticated program-development-and-debugging environment that is always
“on,” even during production runs.
The key insight is that chunks and
signatures free development and debugging tools from having to record
or be concerned with individual loads
and stores. As a result, the amount of
bookkeeping and state required by
the tools is substantially reduced, as
is the time overhead. Here, we give
three examples of this benefit in the
areas of deterministic replay of parallel programs, data-race detection, and
high-speed disambiguation of sets of
addresses.
Note, too, that chunks provide an
excellent primitive for supporting
popular atomic-section-based techniques for programmability (such as
thread-level speculation17 and transactional memory6).
Deterministic replay of parallel programs with practically no log.
Hard-ware-assisted deterministic replay
of parallel programs is a promising
technique for debugging parallel
programs. It involves a two-step process.
20 In the recording step, while
the parallel program executes, special hardware records into a log the
the Bulk multicore
supports
high-performance
sequential memory
consistency at
low hardware
complexity.
order of data dependences observed
among the multiple threads. The log
effectively captures the “interleaving”
of the program’s threads. Then, in the
replay step, while the parallel program
is re-executed, the system enforces
the interleaving orders encoded in the
log.
In most proposals of deterministic replay schemes, the log stores individual data dependences between
threads or groups of dependences
bundled together. In the Bulk Multicore, the log must store only the total
order of chunk commits, an approach
we call DeLorean.
13 The logged information can be as minimalist as a list
of committing-processor IDs, assuming the chunking is performed in a
deterministic manner; therefore, the
chunk sizes can be deterministically
reproduced on replay. This design,
which we call OrderOnly, reduces the
log size by nearly an order of magnitude over previous proposals.
The Bulk Multicore can further reduce the log size if, during the recording step, the arbiter enforces a certain
order of chunk commit interleaving
among the different threads (such as
by committing one chunk from each
processor round robin). In this case
of enforced chunk-commit order, the
log practically disappears. During the
replay step, the arbiter reinforces the
original commit algorithm, forcing
the same order of chunk commits as
in the recording step. This design,
which we call PicoLog, typically incurs
a performance cost because it can
force some processors to wait during
recording.
Figure 3a outlines a parallel execution in which the boxes are chunks
and the arrows are the observed cross-thread data dependences. Figure 3b
shows a possible resulting execution
log in OrderOnly, while Figure 3c
shows the log in PicoLog.
Data-race detection at production-run speed. The Bulk Multicore can
support an efficient data-race detector based on the “happens-before”
method10 if it cuts the chunks at synchronization points, rather than at
arbitrary dynamic points. Synchronization points are easily recognized by
hardware or software, since synchronization operations are executed by
special instructions. This approach