In addition to these router-like elements, a ring node also
contains structures more common to caches. A set associative cache array stores all data values (and their tags) received
by the ring node, whether from a predecessor node or from
its associated core. The line size of this cache array is kept at
one machine word. While the small line is contrary to typical
cache designs, it ensures there will be no false data sharing
by independent values from the same line.
The final structural component of the ring node is the
signal buffer, which stores signals until they are consumed.
Node-to-node connection. The main purpose of the ring
cache is to proactively provide many-to-many core communication in a scalable and low-latency manner. In the unidirectional ring formed by the ring nodes, data propagates by
value circulation. Once a ring node receives an (address,
value) pair, either from its predecessor, or from its associated core, it stores a local copy in its cache array and propagates the same pair to its successor node. The pair eventually
propagates through the entire ring (stopping after a full
cycle) so that any core can consume the data value from its
local ring node, as needed.
This value circulation mechanism allows the ring cache
to communicate between cores faster than reactive systems
(like most coherent cache hierarchies). In a reactive system,
data transfer begins once the receiver requests the shared
data, which adds transfer latency to an already latency-criti-cal code path. In contrast, a proactive scheme overlaps
transfer latencies with computation to lower the receiver’s
The ring cache prioritizes the common case, where data
generated within sequential segments must propagate to all
other nodes as quickly as possible. Assuming no contention
over the network and single-cycle node-to-node latency, the
design shown in Figure 4 allows us to bound the latency for
a full trip around the ring to N clock cycles, where N is the
number of cores. Each ring node prioritizes data received
from the ring and stalls injection from its local core.
To eliminate delays to forward data between ring nodes,
the number of write ports in each node’s cache array must
match the link bandwidth between two nodes. While this
may seem like an onerous design constraint for the cache
array, Section 6 shows that just one write port is sufficient to
reap more than 99% of the ideal-case benefits.
iterations. Since low-latency communication is possible
between physically adjacent cores in modern processors,
the ring cache implements a simple unidirectional ring
Caching shared values. A compiler cannot easily guarantee whether and when shared data generated by a loop iteration will be consumed by other cores running subsequent
iterations. Hence, the ring cache must cache shared data.
Keeping shared data on local ring nodes provides quick
access for the associated cores. As with data, it is also important to buffer signals in each ring node for immediate
Easy integration. The ring cache is a minimally-invasive
extension to existing multicore systems, easy to adopt and
integrate. It does not require modifications to the existing
memory hierarchy or to cache coherence protocols.
With these objectives in mind, we now describe the internals of the ring cache and its interaction with the rest of the
5. 1. Ring cache architecture
The ring cache architecture relies on properties of compiled
code, which imply that the data involved in timing-critical
dependences that potentially limit overall performance are
both produced and consumed in the same order as loop iterations. Furthermore, a ring network topology captures this
data flow, as sketched in Figure 4. The following paragraphs
describe the structure and purpose of each ring cache
Ring node structure. The internal structure of a per-core
ring node is shown in the right half of Figure 4. Parts of this
structure resemble a simple network router. Unidirectional
links connect a node to its two neighbors to form the ring
backbone. Bidirectional connections to the core and private
L1 cache allow injection of data into and extraction of data
from the ring. There are 3 separate sets of data links and buffers. A primary set forwards data and signals between cores.
Two other sets manage infrequent traffic for integration
with the rest of the memory hierarchy (see Section 5. 2).
Separating these 3 traffic types simplifies the design and
avoids deadlock. Finally, signals move in lockstep with forwarded data to ensure that a shared memory location is not
accessed before the data arrives.
Data and signals
L1 Cache reads/writes
Figure 4. Ring cache architecture overview. From left to right: overall system; single core slice; ring node internal structure.