in the original inclusive shared cache),
and the relative overhead becomes
larger if the hardware designer opts for
a smaller shared cache.
To be concrete, let S1 be the sum
of private cache sizes, S2 the shared
cache size, D the directory entry size
relative to the size of a private cache
block and tag, and R, the ratio of the
number of directory entries to the total number of private cache blocks. R
should be greater than 1 to keep recalls
rare, as discussed earlier in the section
on maintaining inclusion. Directory
storage adds R×S1×D to cache storage
S1+S2 for a relative overhead of (R×D)/
( 1+S2/S1). Assume that R= 2 and D=64b/
(48b+512b). If S2/S1 is 8, as in Core i7,
then directory storage overhead is only
2.5%. Shrinking S2/S1 to 4, 2, and 1 increases relative overhead to 4.6%, 7.6%,
and 11%, respectively.
The use of hierarchy adds another
level of directory and an L3 cache. Without inclusion, the new directory level
must point to an L2 bank if a block is
either in the L2 bank or in its co-located directory. For cache size ratio Z = S3/
S2 = S2/S1 = 8, the storage overhead for
reaching 256 cores is 3.1%. Shrinking Z
to 4, 2, or 1 at most doubles the relative
overhead to 6.5%, 13%, or 23%, respectively. Furthermore, such storage overheads translate into relatively lower
overheads in terms of overall chip area,
as caches are only part of the chip area.
Overall, we find that directory storage
is still reasonable when the cache size
ratio Z > 1.
Caveats and Criticisms
We have described a coherence proto-
col based on known ideas to show the
costs of on-chip coherence grow slowly
with core count. Our design uses a hier-
archy of inclusive caches with embed-
ded coherence state whose tracking in-
formation is kept precise with explicit
cache-replacement messages. Using
amortized analysis, we have shown that
for every cache miss request and data
response, the interconnection network
traffic per miss is independent of the
number of cores and thus scales. Em-
bedding coherence state in an inclu-
sive cache hierarchy keeps coherence’s
storage costs small; for example, 512
cores can be supported with 5% extra
cache area with two cache levels or 2%
with three levels. Coherence adds neg-
ligible latency to cache hits, off-chip
accesses, and misses to blocks not ac-
tively shared; miss latency for actively
shared blocks is higher, but the ratio
of the latencies for these misses is tol-
erable today and independent of the
number of cores. Energy overheads of
coherence are correlated with traffic
and storage, so we find no reason for
energy overheads to limit the scalabil-
ity of coherence. Extensions to a non-
inclusive shared cache show larger but
manageable storage costs when shared
cache size is larger than the sum of pri-
vate cache size. With coherence’s costs
shown to scale, we expect on-chip co-
herence is here to stay due to the pro-
grammability and compatibility ben-
efits it delivers.