speculative values of the variables—
namely, the values before Thread 0’s
updates. When Thread 0’s chunk commits, the hardware sends signature W0
1, and W0 and R0 are cleared.
At the processor where Thread
the hardware intersects W0 with the
ongoing chunk’s R1 and W1. Since W0
∩ R1 is not null, the chunk in Thread
The commit of chunks is serialized globally. In a bus-based machine,
serialization is given by the order in
which W signatures are placed on the
bus. With a general interconnect, serialization is enforced by a (potentially
distributed) arbiter module.
2 W signatures are sent to the arbiter, which
quickly acknowledges whether the
chunk can be considered committed.
Since chunks execute atomically
and in isolation, commit in program
order in each processor, and there is
a global commit order of chunks, the
Bulk Multicore supports sequential
9 at the chunk level.
As a consequence, the machine also
supports SC at the instruction level.
More important, it supports high-performance SC at low hardware complexity.
The performance of this SC implementation is high because (within
a chunk) the Bulk Multicore allows
memory access reordering and overlap and instruction optimization. As
we discuss later, synchronization instructions induce no reordering constraint within a chunk.
Meanwhile, hardware-implementa-tion complexity is low because memory-consistency enforcement is largely
decoupled from processor structures.
In a conventional processor that issues memory accesses out of order,
supporting SC requires intrusive processor modifications. For example,
from the time the processor executes
a load to line L out of order until the
load reaches its commit time, the
hardware must check for writes to L
by other processors—in case an inconsistent state was observed. Such
checking typically requires sending,
for each external coherence event, a
signal up the cache hierarchy. The signal snoops the load queue to check for
an address match. Additional modifications involve preventing cache displacements that could risk missing a
Figure 1. a simple way to generate a signature.
. . .
Figure 2. Executing chunks atomically and in isolation with signatures.
W0 = sig(B,C)
R0 = sig(X, Y)
W1 = sig( T)
R1 = sig(B,C)
(W0 Ç R1) Ú (W0 Ç W1)
coherence event. Consequently, load
queues, L1 caches, and other critical
processor components must be augmented with extra hardware.
In the Bulk Multicore, SC enforcement and violation detection are performed with simple signature intersections outside the processor core.
Additionally, caches are oblivious to
what data is speculative, and their tag
and data arrays are unmodified.
Finally, note that the Bulk Multicore’s execution mode is not like
6 While one
could intuitively view the Bulk Multicore as an environment with transactions occurring all the time, the key
difference is that chunks are dynamic
entities, rather than static, and invisible to the software.
Since chunked execution is invisible
to the software, it places no restriction
on programming model, language,
or runtime system. However, it does
enable a highly programmable environment by virtue of providing two
features: high-performance SC at the
hardware level and several novel hardware primitives that can be used to
build a sophisticated program-devel-opment-and-debugging environment.
Unlike current architectures, the
Bulk Multicore supports high-performance SC at the hardware level.
If we generate code for the Bulk Multicore using an SC compiler (such as
the BulkCompiler1), we attain a high-performance, fully SC platform. The
resulting platform is highly programmable for several reasons. The first is
that debugging concurrent programs
with data races would be much easier.
This is because the possible outcomes
of the memory accesses involved in
the bug would be easier to reason
about, and the debugger would in
fact be able to reproduce the buggy
interleaving. Second, most existing