Synchronization. Given the difficulty of determining
which iteration depends on which in non-numerical programs, compilers typically make the conservative assumption that an iteration depends on all of its predecessor
iterations. Therefore, a core cannot execute sequential code
until it is unblocked by its predecessor. 5, 20 Moreover, an iteration unblocks its successor only if both it and its predecessors have executed this sequential segment or if they are not
going to. This execution model leads to a chain of signal
propagation across loop iterations that includes unnecessary synchronization: even if an iteration is not going to execute sequential code, it still needs to synchronize with its
predecessor before unblocking its successor.
HELIX-RC removes this synchronization overhead by
enabling an iteration to detect the readiness of all predecessor iterations, not just one. Therefore, once an iteration for-goes executing the sequential segment, it immediately
notifies its successor without waiting for its predecessor.
Unfortunately, while HELIX-RC removes unnecessary synchronization, it increases the number of signals that can be
in flight simultaneously.
HELIX-RC relies on the signal instruction to handle
synchronization signals efficiently. Synchronization
between a producer core and a consumer includes generation of a signal by the producer, a request for that signal by
the consumer, and transmission of the signal between the
two. On a conventional multicore processor that relies on a
demand-driven memory system for communication, signal
transmission is inherently lazy, and signal request and
transmission become serialized. With HELIX-RC, on the
other hand, a signal instructs the ring cache to proactively
forward a signal to all other nodes in the ring without interrupting any of the cores, thereby decoupling signal transmission from synchronization.
Code example. The code in Figure 3(a), abstracted for
clarity, represents a small hot loop from 175.vpr of SPEC
CINT2000. It is responsible for 55% of the total execution
time of that program. The loop body has 2 execution paths.
The left path entails a loop-carried data dependence because
during a typical loop iteration, instruction 1 uses the value
of variable a produced by a previous iteration. The right path
does not depend on prior data. Owing to complex control
flow, the compiler cannot predict the execution path taken
during a particular iteration, so it must assume that instruction 1 may depend on the previous iteration.
In a conventional implementation coupling communication with computation, the compiler would add wait 1
and signal 1 instructions to the right path, as shown in
Figure 3(a), to synchronize each iteration with its predecessor and successor iterations. If shared values and signals
were communicated on demand, the resulting sequential
signal chain would look like that highlighted in red shown
in Figure 3(b). If we assume that only iterations 0 and 2, running on cores 0 and 2, respectively, take the left path and
execute instruction 1, then the sequential signal chain
shown in Figure 3(b) is unnecessarily long, because iteration
1 only executes parallel code, so the wait instruction is
unnecessary in that iteration. It results in a signal stall.
Iterations 0 and 2, in order to update a, must load its
Compiler. HCCv3 automatically generates parallel
threads from sequential programs by distributing successive loop iterations across adjacent cores organized as a unidirectional ring within a single multicore processor. HCCv3
parallelizes loops that are most likely to speed up performance when their iterations execute in parallel. Only 1 loop
runs in parallel at a time.
To preserve the sequential semantics of the original loop,
the code that implements a loop-carried data dependence,
that is, one spanning loop iterations, must run in a sequential segment whose instances in parallel threads execute in
iteration order. Variables and other data structures involved
in such dependences—even those normally allocated to registers in sequential code—are mapped to specially-allocated
memory locations shared between cores. HCCv3 guarantees
that accesses to those shared memory locations always
occur within sequential segments.
ISA. We introduce a pair of instructions—wait and signal—that mark the beginning and end of a sequential segment. Each has an integer operand that identifies the
particular sequential segment. A wait 3 instruction, for
example, blocks execution of the core that issues it until all
other cores running earlier iterations have finished executing the sequential segment labeled 3, which they signify by
executing signal 3. Figure 2 shows a sequential segment
with label A being executed by the core attached to the leftmost ring node. Between wait A and signal A, a store
instruction sends the new value 5 for the shared location at
address 0x00A to the ring node for caching and circulation
to its successor nodes. The signal A instruction that ends
the segment also signals subsequent nodes that the value
generated by segment A is ready.
A core forwards all memory accesses within sequential
segments to its local ring node. All other memory accesses
(not within a sequential segment) go through the private L1
Memory. Each ring node has a cache array that satisfies
both loads and stores received from its attached core during
a sequential segment. HELIX-RC does not require other
changes to the existing memory hierarchy because the ring
cache orchestrates interactions with it. To avoid any changes
to conventional cache coherence protocols, the ring cache
permanently maps each memory address to a unique ring
node. All accesses from the distributed ring cache to the
next cache level (L1) go through the associated node for a
3. 2. Overlapping communication with computation
Because shared values produced by a sequential segment
and the signal that marks its end are propagated through
the ring node as soon as they are generated, this communication between iterations is decoupled from computation
taking place on the cores.
Shared data communication. Once a ring node receives a
store, it records the new value and proactively forwards its
address and value to an adjacent node in the ring cache, all
without interrupting the execution of the attached core. The
value then propagates from node to node through the rest of
the ring without interrupting the computation of any core.