Figure 3. OpenPiton’s memory hierarchy datapath.
in the OpenSPARC T1 to connect the cores, L2 cache, FPU,
I/O, etc. 1 In OpenPiton, the L1.5 and FPU are connected to
the core by CCX.
2. 2. Core
OpenPiton uses the open-source OpenSPARC T115 core with
modifications. This core was chosen because of its industry-hardened design, multi-threaded capability, simplicity, and
modest silicon area requirements. Equally important, the
OpenSPARC framework has a stable code base, implements
a mature ISA with compiler and OS support, and comes with
a large test suite.
In the default configuration for OpenPiton, as used in
Piton, the number of threads is reduced from four to two
and the stream processing unit (SPU) is removed from the
core to save area. The default Translation Lookaside Buffer
(TLB) size is 16 entries but can be increased to 32 or 64, or
decreased down to 8 entries.
Additional configuration registers were added to enable
extensibility within the core. They are useful for adding
additional functionality to the core which can be configured
from software, for example enabling/disabling functionality, configuring different modes of operation, etc.
2. 3. Cache hierarchy
OpenPiton’s cache hierarchy is composed of three cache
levels. Each tile in OpenPiton contains private L1 and L1.5
caches and a slice of the distributed, shared L2 cache. The
data path of the cache hierarchy is shown in Figure 3.
The memory subsystem maintains cache coherence
using our coherence protocol, called P-Mesh. It adheres to
the memory consistency model used by the OpenSPARC T1.
Coherent messages between L1.5 caches and L2 caches
communicate through three NoCs, carefully designed to
ensure deadlock-free operation.
L1 caches. The L1 caches are reused from the OpenSPARC
T1 design with extensions for configurability. They are composed of separate L1 instruction and L1 data caches, both
of which are write-through and 4-way set-associative. By
default, the L1 data cache is an 8KB cache and its line size
is 16-bytes. The 16KB L1 instruction cache has a 32-byte
L1.5 data cache. The L1.5 (comparable to L2 caches in
other processors) both transduces the OpenSPARC T1’s
CCX protocol to P-Mesh’s NoC coherence packet formats,
and acts as a write-back layer, caching stores from the write-through L1 data cache. Its parameters match the L1 data
cache by default.
The L1.5 communicates to and from the core through
the CCX bus, preserved from the OpenSPARC T1. When a
memory request results in a miss, the L1.5 translates and
forwards it to the L2 through the NoC channels. Generally,
the L1.5 issues requests on NoC1, receives data on NoC2,
and writes back modified cache lines on NoC3, as shown in
The L1.5 is inclusive of the L1 data cache; each can
be independently sized with independent eviction policies. For space and performance, the L1.5 does not cache
instructions–these cache lines are bypassed directly to the
L2 cache. The L2 cache (comparable to a last-level L3
cache in other processors) is a distributed, write-back
cache shared by all tiles. The default cache configuration is
64KB per tile and 4-way set associativity, but both the cache
size and associativity are configurable. The cache line size
is 64 bytes, larger than the line sizes of caches lower in the
hierarchy. The integrated directory cache has 64 bits per
entry, so it can precisely keep track of up to 64 sharers by
The L2 cache is inclusive of the private caches (L1 and
L1.5). Cache line way mapping between the L1.5 and the L2
is independent and is entirely subject to the replacement
policy of each cache. Since the L2 is distributed, cache lines
consecutively mapped in the L1.5 are likely to be distributed
across multiple L2 tiles (L2 tile referring to a portion of the
distributed L2 cache in a single tile).
The L2 is the point of coherence for all cacheable memory
requests. All cacheable memory operations (including atomic
operations such as compare-and-swap) are ordered, and the
L2 strictly follows this order when servicing requests. The L2
also keeps the instruction and data caches coherent, per the
OpenSPARC T1’s original design. When a line is present in
a core’s L1 instruction cache and is loaded as data, the L2
sends invalidations to the relevant instruction caches before
servicing the load.
2. 4. P-Mesh network on-chip
There are three P-Mesh NoCs in an OpenPiton chip. The
NoCs provide communication between the tiles for cache
coherence, I/O, memory traffic, and inter-core interrupts.
They also route traffic destined for off-chip to the chip
bridge. The packet format contains 29 bits of core address-ability, making it scalable up to 500 million cores.
To ensure deadlock-free operation, the L1.5 cache, L2
cache, and memory controller give different priorities to
different NoC channels; NoC3 has the highest priority, next
is NoC2, and NoC1 has the lowest priority. Thus, NoC3 will
never be blocked. In addition, all hardware components
are designed such that consuming a high priority packet is
never dependent on lower priority traffic.
Classes of coherence operations are mapped to NoCs
based on the following rules, as depicted in Figure 3:
• NoC1 messages are initiated by requests from the pri-
vate cache (L1.5) to the shared cache (L2).
• NoC2 messages are initiated by the shared cache (L2) to
the private cache (L1.5) or memory controller.
• NoC3 messages are responses from the private cache
(L1.5) or memory controller to the shared cache (L2).