as four cores to 16 cores), the storage cost is negligible; a 16-core system adds just 16b for each 64B cache
block in the shared cache, or approximately 3% more bits. For a miss to a
block not cached by other private
caches, the miss latency and energy
consumed incur the negligible overhead of checking a couple of state bits
in the shared cache rather than just a
single valid bit. As we show later, even
when blocks are shared, the traffic per
miss is limited and independent of
the number of cores. Overall, this approach is reasonably low cost in terms
of traffic, storage, latency, and energy,
and its design complexity is tractable.
Nevertheless, the question for architects is: Does this system model scale
to future manycore chips?
some caveats and potential criticisms
of this work.
scalability
Some prognosticators forecast that
the era of cache coherence is nearing
its end5, 10, 13 due primarily to an alleged
lack of scalability. However, when we
examined state-of-the-art coherence
mechanisms, we found them to be
more scalable than we expected.
We view a coherent system as “
scalable” when the cost of providing coherence grows (at most) slowly as core
count increases. We focus exclusively
on the cache-coherence aspects of
multicore scaling, whereas a fully scalable system (coherent or otherwise)
also requires scalability from other
hardware (such as memory and on-chip interconnection network) and
software (operating system and applications) components.
Here, we examine five potential concerns when scaling on-chip coherence:
˲ Traffic on the on-chip interconnection network;
˲ Storage cost for tracking sharers;
˲ Inefficiencies caused by maintaining inclusion (as inclusion is assumed
by our base system);
˲ Latency of cache misses; and
˲ Energy overheads.
The following five sections address these concerns in sequence
and present our analysis, indicating
that existing design approaches can
be employed such that none of these
concerns would present a fundamental barrier to scaling coherence. We
then discuss extending the analysis
to noninclusive caches and address
Concern 1: traffic
Here we tackle the concerns regarding
the scalability of coherence traffic on
the on-chip interconnection network.
To perform a traffic analysis, we consider for each cache miss how many
bytes must be transferred to obtain and
relinquish the given block. We divide
the analysis into two parts: in the absence of sharing and with sharing. This
analysis shows that when sharers are
tracked precisely, the traffic per miss
is independent of the number of cores.
Thus, if coherence’s traffic is acceptable for today’s systems with relatively
few cores, it will continue to be acceptable as the number of cores scales up.
We conclude with a discussion of how
coherence’s per-miss traffic compares
to that of a system without coherence.
Without sharing. We first analyze
the worst-case traffic in the absence
of sharing. Each miss in a private
cache requires at least two messages:
a request from the private cache to the
shared cache and a response from the
shared cache to provide the data to the
requestor. If the block is written during
the time it is in the cache, the block is
“dirty” and must be written explicitly
back to the shared cache upon eviction.
Even without sharing, the traffic de-
pends on the specific coherence proto-
col implementation. In particular, we
consider protocols that require a pri-
vate cache to send an explicit eviction
notification message to the shared
cache whenever it evicts a block, even
when evicting a clean block. (This de-
cision to require explicit eviction noti-
fications benefits implementation of
inclusive caching, as discussed later
in the section on maintaining inclu-
sion.) We also conservatively assume
that coherence requires the shared
cache to send an acknowledgment
message in response to each eviction
notification. Fortunately, clean evic-
tion messages are small (enough to,
say, hold a 8B address) and can oc-
cur only subsequent to cache misses,
transferring, say, a 64B cache block.
Coherence’s additional traffic per
miss is thus modest and, most im-
portant, independent of the number
of cores. Based on 64B cache blocks,
the table here shows that coherence’s
traffic is 96B/miss for clean blocks
and 160B/miss for dirty blocks.
traffic cost of cache misses.
To calculate traffic, we must assume values for the size of addresses and cache blocks (such as 8B
physical addresses and 64B cache blocks). Request and acknowledgment messages are typically
short (such as 8B) because they contain mainly a block address and a message type field. A data
message is significantly larger because it contains both an entire data block plus a block address
(such as 64B + 8B = 72B).
Clean block Dirty block
Without coherence (Req+Data) + 0 = 80B/miss (Req+Data) + Data = 152B/miss
With coherence (Req+Data) + (evict+Ack) = 96B/miss (Req+Data) + (Data+Ack)= 160B/miss
Per-miss traffic overhead 20% 5%