DECEMBER 2017 | VOL. 60 | NO. 12 | COMMUNICATIONS OF THE ACM 93
cache meets this requirement by leveraging the unidirectional data flow guaranteed by the compiler. Sequential consistency must be preserved when ring cache values reach
lower-level caches, but the consistency model provided by
conventional memory hierarchies is weaker. We resolve this
difference by introducing a single serialization point per
memory location, namely a unique owner node responsible
for all interactions with the rest of the memory hierarchy.
When a shared value is moved between the ring cache and
L1 caches (owing to occasional ring cache load misses and
evictions), only its owner node can perform the required L1
cache accesses. This solution preserves existing consistency
models with minimal impact on performance.
Cache flush. Finally, to guarantee coherence between
parallelized loops and serial code between loop invocations,
each ring node flushes the dirty values of memory locations
it owns to L1 once a parallel loop has finished execution.
This is equivalent to executing a distributed fence at the end
of loops. In a multiprogram scenario, signal buffers must
also be flushed/restored at program context switches.
By co-designing the compiler along with the architecture,
HELIX-RC more than triples the performance of parallelized
code when compared to a compiler-only solution. This section investigates HELIX-RC’s performance benefits and
their sensitivity to ring cache parameters. We confirm that
the majority of speedups come from decoupling all types of
communication and synchronization. We conclude by analyzing the remaining overheads of the execution model.
6. 1. Experimental setup
We ran experiments on 2 sets of architectures. The first
relies on a conventional memory hierarchy to share data
among the cores. The second relies on the ring cache.
Simulated conventional hardware. We simulate a multi-core in-order × 86 processor by adding multiple-core support
to the XIOSim simulator. We also simulate out-of-order
cores modeled after Intel Nehalem.
The simulated cache hierarchy has 2 levels: a per-core
32 KB, 8-way associative L1 cache and a shared 8 MB 16-bank
L2 cache. We vary the core count from 1 to 16, but do not vary
the amount of L2 cache with the number of cores, keeping it
at 8 MB for all configurations. Also scaling cache size would
make it difficult to distinguish the benefits of parallelizing a
workload from the benefits of fitting its working set into the
larger cache, causing misleading results. Finally, we use
DRAMSim2 for cycle-accurate simulation of memory controllers and DRAM.
We extended XIOSim with a cache coherence protocol
assuming an optimistic cache-to-cache latency of 10 clock
cycles. This 10-cycle latency is optimistically low even com-
pared to research prototypes of low-latency coherence. 11 We
only use this low-latency model to simulate conventional
hardware, and later (Section 6. 2) shows that low latency
alone is not enough to compensate for the lazy nature of its
To ensure correctness under network contention, the
ring cache is sometimes forced to stall all messages (data
and signals) traveling along the ring. The only events that
can cause contention and stalls are ring cache misses and
evictions, which may then need to fetch data from a remote
L1 cache. While these ring stalls are necessary to guarantee
correctness, they are infrequent.
The ring cache relies on credit-based flow control9 and is
deadlock free. Each ring node has at least two buffers
attached to the incoming links to guarantee forward progress. The network maintains the invariant that there is
always at least one empty buffer per set of links somewhere
in the ring. That is why a node only injects new data from its
associated core into the ring when there is no data from a
predecessor node to forward.
Node-core integration. Ring nodes are connected to their
respective cores as the closest level in the cache hierarchy
(Figure 4). The core’s interface to the ring cache is through
regular loads and stores for memory accesses in sequential
As previously discussed, wait and signal instructions
delineate code within a sequential segment. A thread that
needs to enter a sequential segment first executes a wait,
which only returns from the associated ring node when
matching signals have been received from all other cores
executing prior loop iterations. The signal buffer within the
ring node enforces this. Specialized core logic detects the
start of the sequential segment and routes memory operations to the ring cache. Finally, executing the corresponding
signal marks the end of the sequential segment.
The wait and signal instructions require special treatment in out-of-order cores. Since they may have system-wide
side effects, these instructions must issue non-speculatively
from the core’s store queue and regular loads and stores
cannot be reordered around them. Our implementation
reuses logic from load-store queues for memory disambiguation and holds a lightweight local fence in the load queue
until the wait returns to the senior store queue. This is not
a concern for in-order cores.
5. 2. Memory hierarchy integrationa
The ring cache is a level within the cache hierarchy and as
such must not break any consistency guarantees that the
hierarchy normally provides. Consistency between the ring
cache and the conventional memory hierarchy results from
the following invariants: (i) shared memory can only be
accessed within sequential segments through the ring cache
(compiler-enforced); (ii) only a uniquely assigned owner node
can read or write a particular shared memory location
through the L1 cache on a ring cache miss (ring cache-enforced); and (iii) the cache coherence protocol preserves
the order of stores to a memory location through a particular
Sequential consistency. To preserve the semantics of a
parallelized single-threaded program, memory operations
on shared values require sequential consistency. The ring
a This feature may add one multiplexer delay to the critical delay path from
the core to L1.
b Most cache coherence protocols (including Intel, AMD, and ARM implementations) provide this minimum guarantee.