ter) maps easily to a shared memory
system with caches.
Third, we are aware of the complexity challenge posed by coherence and do
not underestimate the importance of
managing complexity but also that the
chip-design industry has a long history
of managing complexity. Many companies have sold many systems with
hardware cache coherence. Designing
and validating the coherence protocols
in them is not easy, but industry continues to overcome these challenges.
Moreover, the complexity of coherence
protocols does not necessarily scale up
with increasing numbers of cores. Adding more cores to an existing multicore
design has little effect on the conceptual complexity of a coherence protocol,
though it may increase the amount of
time necessary to validate the protocol.
However, even the validation effort
may not pose a scalability problem;
research shows it is possible to design hierarchical coherence protocols
that can be formally verified with an
amount of effort that is independent
of the number of cores.
the complexity of the alternative to
hardware coherence—software implemented coherence—is non-zero. As
when assessing hardware coherence’s
overheads—storage, traffic, latency,
and energy—chip architects must be
careful not to implicitly assume the
alternative to coherence is free. Forcing software to use software-managed
coherence or explicit message passing
does not remove the complexity but
rather shifts the complexity from hardware to software.
Fourth, we assumed a single-chip
(socket) system and did not explicitly
address chip-to-chip coherence in today’s multisocket servers. The same
sort of tagged tracking structures can
be applied to small-scale multisocket
6 essentially adding one more
level to the coherence hierarchy. Moreover, providing coherence across multisocket systems may become less important, because single-chip solutions
solve more needs, and “scale out” solutions are required in any case (such
as for data centers), but that is an argument for another article.
Finally, even if coherence itself
scales, we did not address other is-
sues that might prevent practical
multicore scaling, such as die-area
limitations, scalability of the on-chip
interconnect, and critical problems
of software non-scalability. Despite
advances in scaling operating systems
and applications, many applications
do not (yet) effectively scale to many
cores. This article does not improve
that situation. Nevertheless, we have
shown that on-chip hardware coher-
ence can be made to scale gracefully,
freeing application and system soft-
ware developers from having to re-
implement coherence (such as know-
ing when to flush and refetch data) or
orchestrating explicit communication
via message passing.
We thank James Balfour, Colin
Blundell, Derek Hower, Steve Keckler,
Alvy Lebeck, Steve Lumetta, Steve Reinhardt, Mike Swift, and David Wood.
This material is based on work supported by the National Science Foundation (CNS-0720565, CNS-0916725,
CNS-1117280, CCF-0644197, CCF-
0905464, CCF-0811290, and CCF-
1017650); Sandia/Department of Energy (MSN123960/DOE890426); and the
Semiconductor Research Corporation
(2009-HJ-1881). Any opinions, findings, and conclusions or recommendations expressed here are those of
the authors and do not necessarily reflect the views of the National Science
Foundation, Sandia/DOE, or SRC. The
authors have also received research
funding from AMD, Intel, and NVIDIA.
Hill has a significant financial interest
1. agarwal, a., simoni, r., horowitz, M., and hennessy,
J. an evaluation of directory schemes for cache
coherence. in Proceedings of the 15th Annual
International Symposium on Computer Architecture
(honolulu, May). ieee Computer society Press, los
alamitos, Ca, 1988, 280–298.
2. boyd-Wickizer, s. Clements, a.t., Mao, y., Pesterev,
a., Kaashoek, M.F., Morris, r., and Zeldovich, n.
an analysis of linux scalability to many cores. in
Proceedings of the Ninth USENIX Symposium on
Operating Systems Design and Implementation
(vancouver, oct. 4–6). usenix association, berkeley,
Ca, 2010, 1–8.
3. bryant, r. scaling linux to the extreme. in
Proceedings of the Linux Symposium (boston, June
27–July 2, 2004), 133–148.
4. butler, M., barnes, l., sarma, d.d., and gelinas, b.
bulldozer: an approach to multithreaded compute
performance. IEEE Micro 31, 2 (Mar./apr. 2011), 6–15.
Milo M.K. Martin ( email@example.com) is an
associate professor in the Computer and information
science department of the university of Pennsylvania,
Mark D. hill ( firstname.lastname@example.org) is a professor in both
the Computer sciences department and the electrical and
Computer engineering department of the university of
Daniel J. Sorin ( email@example.com) is an associate
professor in the electrical and Computer engineering
and Computer science departments of duke university,