DECEMBER 2017 | VOL. 60 | NO. 12 | COMMUNICATIONS OF THE ACM 91
wait instructions while keeping the architectural enhancement simple. Eliminating waits allows a core to execute a
later loop iteration than its successor (significantly boosting parallelism). Future iterations, however, produce signals that must be buffered. The last code property prevents
a core from getting more than one “lap” ahead of its successor. So when buffering signals, each ring cache node only
needs to recognize 2 types—those from the past and those
from the future.
4. 2. Code optimizations
In addition to conventional optimizations specifically tuned
to extract Thread Level Parallelism (TLP) (e.g., code scheduling, method inlining, loop unrolling), HCCv3 includes ones
that are essential for best performance of non-numerical
programs on a ring-cache-enhanced architecture: aggressive splitting of sequential segments into smaller code
blocks; identification and selection of small hot loops; and
elimination of unnecessary wait instructions.
Sizing sequential segments poses a tradeoff. Additional
segments created by splitting run in parallel with others, but
extra segments entail extra synchronization, which adds
communication overhead. Thanks to decoupling, HCCv3
can split aggressively to efficiently extract TLP. Note that segments cannot be split indefinitely—each shared location
must be accessed by only 1 segment.
To identify small hot loops that are most likely to speed
up when their iterations run in parallel, HCCv3 profiles the
program being compiled using representative inputs.
Instrumentation code emulates execution with the ring
cache during profiling, which produces an estimate of time
saved by parallelization. Finally, HCCv3 uses a loop nesting
graph, annotated with the profiling results, to choose the
most promising loops.
5. ARCHITECTURE ENHANCEMENTS
Adding a ring cache to a multicore architecture enables the
proactive circulation of data and signals that boost parallelization. This section describes the design of the ring cache
and its constituent ring nodes. The design is guided by the
Low-latency communication. HELIX-RC relies on fast
communication between cores in a multicore processor for
synchronization and for data sharing between loop
previous value first, using a regular load. So lazy forwarding
of this shared data leads to data stalls, because the transfer
only begins when demanded by a load, rather than when
generated by a store.
In HELIX-RC, however, a wait A unblocks when all predecessor iterations have signaled that segment A is finished. That allows HCCv3 to omit the wait 1 on the right
path through the loop body. That optimization, combined
with HELIX-RC’s proactive communication between
cores, leads to the more efficient scenario shown in Figure
3(c). The sequential chain in red now only includes the
delay required to satisfy the dependence—
communication updating a shared value.
The decoupled execution model of HELIX-RC described so
far is possible given the tight co-design of the compiler and
architecture. In this section, we focus on compiler-guaran-teed code properties that enable a lightweight ring cache
design, and follow up with code optimizations that make
use of the ring cache.
4. 1. Code properties
• Only 1 loop can run in parallel at a time. Apart from a dedicated core responsible for executing code outside parallel
loops, each core is either executing an iteration of the current loop or waiting for the start of the next one.
• Successive loop iterations are distributed to threads in
a round-robin manner. Since each thread is pinned to a
predefined core, and cores are organized in a unidirectional ring, successive iterations form a logical ring.
• Communication between cores executing a parallelized loop occurs only within sequential segments.
• Different sequential segments always access different
shared data. HCCv3 only generates multiple sequential
segments when there is no intersection of shared data.
Consequently, instances of distinct sequential segments may run in parallel.
• At most 2 signals per sequential segment emitted by a
given core can be in flight at any time. Hence, only 2 signals per segment need to be tracked by the ring cache.
This last property allows the elimination of unnecessary
a = a+ 1;
(a) Parallel code
(b) Coupled communication
Core 0 Core 1 Core 2
(c) Decoupled communication
Figure 3. Example illustrating benefits of decoupling communication from computation.