On-chip hardware coherence can scale
gracefully as the number of cores increases.
BY miLo m.K. maRtin, maRK D. hiLL, anD DanieL J. soRin
here to stay
shAred MeMor Y is the dominant low-level
communication paradigm in today’s mainstream
multicore processors. In a shared-memory system,
the (processor) cores communicate via loads and
stores to a shared address space. The cores use caches
to reduce the average memory latency and memory
traffic. Caches are thus beneficial, but private caches
lead to the possibility of cache incoherence. The
mainstream solution is to provide shared memory
and prevent incoherence through a hardware cache
coherence protocol, making caches functionally
invisible to software. The incoherence problem and
basic hardware coherence solution are outlined in
the sidebar, “The Problem of Incoherence,” page 86.
Cache-coherent shared memory is provided by
mainstream servers, desktops, laptops, and mobile
devices and is available from all major vendors,
including AMD, ARM, IBM, Intel, and Oracle (Sun).
Cache coherence has come to dominate the market for technical, as well as
for legacy, reasons. Technically, hardware cache coherence provides performance generally superior to what is
achievable with software-implemented
coherence. Cache coherence’s legacy
advantage is that it provides backward
compatibility for a long history of software, including operating systems,
written for cache-coherent shared-memory systems.
Although coherence delivers value
in today’s multicore systems, the conventional wisdom is that on-chip cache
coherence will not scale to the large
number of cores expected to be found
on future processor chips. 5, 10, 13 Coherence’s alleged lack of scalability arises from claims of unscalable storage
and interconnection network traffic
and concerns over latency and energy.
Such claims lead to the conclusion that
cores in future multicore chips will not
employ coherence but instead communicate with software-managed coherence, explicitly managed scratchpad
memories, and/or message passing
(without shared memory).
Here, we seek to refute this conventional wisdom by presenting one
way to scale on-chip cache coherence
in which coherence overheads—
traffic, storage, latency, and energy—grow
slowly with core count and are similar
to the overheads deemed acceptable in
today’s systems. To do this, we synergistically combine known techniques,
including shared caches augmented
the approach taken here scales on-chip
hardware cache coherence to many
cores with bounded traffic, storage,
latency, and energy overheads.
For the same reason system designers
will not abandon compatibility for
the sake of eliminating minor costs,
they likewise will not abandon cache
Continued coherence support lets
programmers concentrate on what
matters for parallel speedups: finding
work to do in parallel with no undo
communication and synchronization.