systems fill the gap between a private
cache and a large, distributed shared
cache, allowing the cluster cache to deliver faster access to data shared within
the cluster. An additional benefit is
that coherence requests may be satisfied entirely within the cluster (such
as by a sibling node caching the block)
that can be significant if the software is
aware of the hierarchy.
The same techniques described
earlier—inclusion, integrating tracking state with caches, recall messages,
and explicit eviction notifications—are
straightforward to apply recursively to
provide coherence across a hierarchical system. Rather than just embed
tracking state at a single shared cache,
each intermediate shared cache also
tracks sharers—but just for the caches
included by it in the hierarchy. Consider a chip (see Figure 2) in which
each core has its own private cache,
each cluster of cores has a cluster
cache, and the chip has a single shared
last-level cache. Each cluster cache is
shared among the cores in the cluster
and serves the same role for coherence
as the shared cache in nonhierarchical systems; that is, the cluster cache
tracks which private caches within
the cluster have the block. The shared
last-level cache tracks which cluster
caches are caching the block but not
which specific private cache(s) within
the cluster are caching it. For example, a balanced 256-core system might
consist of 16 clusters of 16 cores each
with a 16KB first-level cache, a 512KB
second-level shared cluster cache, and
a 16MB third-level (last-level) cache
shared among all clusters.
Such a hierarchical organization
has some disadvantages—extra com-
plexity and layers of cache lookups—
but also two key benefits for coherence:
First, the hierarchy naturally provides
a simple form of fan-out invalidation
and acknowledgment combining. For
example, consider a block cached by all
cores; when a core issues a write miss
to this block, the cluster cache lacks
write permission for the block, so it for-
wards it to the shared last-level cache.
The shared last-level cache then sends
an invalidation message to each cluster
(not to each core), triggering the clus-
ter cache to perform an analogous in-
validation operation within the cluster.
The cluster then sends a single invali-
dation acknowledgment independent
of the number of cores in the cluster
that were caching the block. Compared
to a flat protocol, which must send ac-
knowledgments to every requestor,
the total cross-chip traffic is reduced,
and the protocol avoids the bottleneck
of sequentially injecting hundreds or
thousands of invalidation messages
and later sequentially processing the
same number of acknowledgments.
4,096-core 3-level system would have 16
clusters, each with 16 subclusters of 16
cores, with storage overhead of only 3%.
Conclusion. Hierarchy combined
with inclusion enables efficient scaling
of the storage cost for exact encoding
of sharers.
Concern 3: maintaining inclusion
In the system model covered here,
we initially choose to require that the
shared cache maintain inclusion with
respect to the private caches. Maintaining an inclusive shared cache
allows efficient tracking of blocks
in private caches by embedding the
tracking information in the tags of the
shared cache, and is why we use this
design point. Inclusion also simplified our earlier analysis of communication and storage.
Inclusion requires that if a block is
cached in any private cache, it must also
be cached in the shared cache. When
the shared cache evicts a block with
nonempty tracking bits, it is required
to send a recall message to each private
cache that is caching the block, adding
to system traffic. More insidiously, such
recalls can increase the cache miss rate
by forcing cores to evict hot blocks they
are actively using. 11 To ensure scalability, we seek a system that makes recall
messages vanishingly rare.
Recalls occur when the shared cache
is forced to evict a block with one or
more sharers. To reduce the number
of recalls, the shared cache always
chooses to evict nonshared blocks over
shared blocks. Because the capacity
of an inclusive shared cache often exceeds the aggregate capacity of the private caches (for example, the ratio is 8
for the four-core Intel Core i7 with 8MB
shared cache and four 256KB second-level private caches), it is highly likely
that a nonshared block will be available
to evict whenever an eviction occurs.
Unfortunately, the shared cache
sometimes lacks sufficient information to differentiate between a block
possibly being cached and certainly
being cached by a core. That is, the
tracking bits in the shared cache are
updated when a block is requested, but
the shared cache in some systems does
not always know when a private cache
has evicted the block. In such systems,
clean blocks (those not written during
their lifetime in the cache) are evicted