a significant increase in core utilization, and concomitant
overall speedup, if the compiler were able to freely select
small hot loops for parallelization. Realizing that potential
requires understanding the characteristics of such loops
and optimizing for them.
2. 2. Low latency challenge
To illustrate the need for low-latency communication,
Figure 1a plots a cumulative distribution of average iteration execution times on a single Atom-like core (described in
our work4) for the set of hot loops from SPECint2000 chosen
for parallelization by HELIX-RC. The shaded portion of the
plot shows that more than half of the loop iterations complete within 25 clock cycles. The figure also shows the measured core-to-core round trip communication latencies for 3
modern multicore processors. Even the shortest of these
latencies, 75 cycles for Ivy Bridge, is too heavy a communication penalty for the majority of these short loops.
2. 3. Broadcast challenge
Loops within non-numerical programs generate values that
are consumed by later iterations, but the compiler cannot
know which iterations will use which values. So when the
compiler distributes the iterations of a loop to separate
cores, shared values that result from loop-carried dependences need to be accessible to any of those cores soon after
For loops parallelized by HELIX-RC, most communication of shared values is not between cores executing successive loop iterations, which HELIX-RC assigns to adjacent
cores. Figure 1b charts the distribution of value communication distances (defined as the undirected distance between
the core that produces a value and the first one that consumes it) on a platform with 16 cores organized in a ring.
Only 15% of those transfers are between adjacent cores.
Moreover, Figure 1c shows that most (86%) of the loop-carried values in parallelized loops are consumed by multiple
cores. Since consumers of shared values are not known at
compile time, especially for non-numerical workloads, broadcasting is the most appropriate communication protocol.
It is well known that implementing low-latency broadcast
is challenging for a large set of cores. HELIX-RC uses a hard-
ware mechanism that achieves proactive delayed broadcast
of data and signals to all cores in the ring for a loop. Such
proactive communication is the cornerstone of the
HELIX-RC approach, which allows the communication
needed for sharing data between cores to overlap with com-
putation that the cores carry out in parallel.
3. THE HELIX-RC SOLUTION
To run the iterations of small hot loops efficiently in parallel, HELIX-RC replaces communication-on-demand with
proactive communication. It decouples value forwarding
between threads from value consumption by the receiving
thread. It also decouples transmission of synchronizing signals from the code that enforces sequential semantics.
Extensions of conventional microprocessor architecture
make this decoupling possible. Reliance on compiler-guan-anteed machine code properties keeps those architectural
extensions simple and efficient.
3. 1. Approach
HELIX-RC is a co-design that binds its compiler (HCCv3) to a
processor architecture enhancement called ring cache.
When the compiler generates a set of parallel threads to run
on separate cores, they are rarely completely independent.
While most of each thread’s code can run concurrently with
other threads, there are segments of that code that must
execute in strict sequence across the thread set. We call
these sequential segments. The main role of the ring cache is
to accelerate the communication of values and synchronizing signals needed to implement sequential segments
The ring cache is a ring network linking ring nodes, each
of which is attached to a core in the processor. During
sequential segments, this ring serves as a distributed first-level cache preceding the private L1 cache (Figure 2). HCCv3
marks the entry and exit points of sequential segments
using 2 instructions that extend the instruction set. As a
result, each core knows whether or not it is currently executing the sequential segment of a parallel thread, and it
accesses the cache hierarchy accordingly.
(a) Short loop iterations
8% 1 Core
Figure 1. Small hot loops have short iterations that send data over
multiple hops and to multiple cores.
store 0x00A, 5
Code outside sequential segments
Code within sequential segments
Figure 2. The ring cache is a ring network that connects ring nodes
attached to each core. It operates during sequential segments as
a distributed first-level cache that precedes the private L1 cache
(left side). Ring nodes propagate newly-generated values without
involving their attached cores (right side). In this example, data
generated by the rightmost core is available at the leftmost core
when needed, so wait A incurs no delay.