novel compiler optimizations that require dynamic disambiguation of sets
of addresses (see the sidebar “Making
Signatures Visible to Software”).
Figure 3. Parallel execution in the Bulk multicore (a), with a possible
orderonly execution log (b) and PicoLog execution log (c).
The Bulk Multicore also has advantages in performance and in hardware
simplicity. It delivers high performance because the processor hardware can reorder and overlap all memory accesses within a chunk—except,
of course, those that participate in
single-thread dependences. In particular, in the Bulk Multicore, synchronization instructions do not constrain
memory access reordering or overlap.
Indeed, fences inside a chunk are
transformed into null instructions.
Fences’ traditional functionality of
delaying execution until certain references are performed is useless; by
construction, no other processor observes the actual order of instruction
execution within a chunk.
Moreover, a processor can concurrently execute multiple chunks from
the same thread, and memory accesses from these chunks can also overlap.
Each concurrently executing chunk
in the processor has its own R and W
signatures, and individual accesses
update the corresponding chunk’s
signatures. As long as chunks within
a processor commit in program order
(if a chunk is squashed, its successors are also squashed), correctness is
guaranteed. Such concurrent chunk
execution in a processor hides the
Bulk Multicore performance increases further if the compiler generates the chunks, as in the BulkCompiler.
1 In this case, the compiler can
aggressively optimize the code within
each chunk, recognizing that no other
processor sees intermediate states
within a chunk.
Finally, the Bulk Multicore needs
simpler processor hardware than current machines. As discussed earlier,
much of the responsibility for memory-consistency enforcement is taken
away from critical structures in the
core (such as the load queue and L1
cache) and moved to the cache hierarchy where signatures detect violations
2 For example, this property
could enable a new environment in
(P1, P2, P3…)
(a) Parallel execution
Figure 4. Forming chunks for data-race detection in the presence
of a lock (a), flag (b), and barrier (c).
which cores and accelerators are designed without concern for how to satisfy a particular set of access-ordering
constraints. This ability allows hardware designers to focus on the novel
aspects of their design, rather than
on the interaction with the target machine’s legacy memory-consistency
model. It also motivates the development of commodity accelerators.
Numerous proposals for multiprocessor architecture designs focus on
improving programmability. In particular, architectures for thread-level
17 and transactional
6 have received significant attention over the past 15 years.
These techniques share key primitive
mechanisms with the Bulk Multicore,
notably speculative state buffering
and undo and detection of cross-thread conflicts. However, they also
have a different goal, namely simplify
code parallelization by parallelizing
the code transparently to the user
software in TLS or by annotating the
user code with constructs for mutual
exclusion in TM. On the other hand,
the Bulk Multicore aims to provide a
broadly usable architectural platform
that is easier to program for while delivering advantages in performance
and hardware simplicity.
Two architecture proposals involve processors continuously executing blocks of instructions atomically
and in isolation. One of them, called
Transactional Memory Coherence and
5 is a TM environment with transactions occurring all
the time. TCC mainly differs from the
Bulk Multicore in that its transactions