4. 1. Internal research case studies
Execution Drafting. Execution Drafting11 (ExecD) is an
energy saving microarchitectural technique for multi-threaded processors, which leverages duplicate computation. ExecD takes over the thread selection decision from
the OpenSPARC T1 thread selection policy and instruments the front-end to achieve energy savings. ExecD
required modifications to the OpenSPARC T1 core and
thus was not as simple as plugging a standalone module into the OpenPiton system. The core microarchitecture needed to be understood, and the implementation
tightly integrated with the core. Implementing ExecD in
OpenPiton revealed several implementation details that
had been abstracted away in simulation, such as tricky
divergence conditions in the thread synchronization
mechanisms. This reiterates the importance of taking
research designs to implementation in an infrastructure
like OpenPiton.
ExecD must be enabled by an ExecD-aware operating
system. Our public Linux kernel and OpenPiton hypervisor
repositories contain patches intended to add support for
ExecD. These patches were developed as part of a single-semester undergraduate OS research project.
Coherence Domain Restriction. Coherence Domain
Restriction8 (CDR) is a novel cache coherence framework
designed to enable large scale shared memory with low
storage and energy overhead. CDR restricts cache coherence of an application or page to a subset of cores, rather
than keeping global coherence over potentially millions
of cores. In order to implement it in OpenPiton, the TLB
is extended with extra fields and both the L1.5 and L2
cache are modified to fit CDR into the existing cache
coherence protocol. CDR is specifically designed for
large scale shared memory systems such as OpenPiton.
In fact, OpenPiton’s million-core scalability is not feasible without CDR because of increasing directory storage overhead.
Memory Inter-arrival Time Traffic Shaper. The Memory
Inter-arrival Time Traffic Shaper23 (MITTS) enables a
manycore system or an IaaS cloud system to provision
memory bandwidth in the form of a memory request
interarrival time distribution at a per-core or per-appli-cation basis. A runtime system configures MITTS knobs
in order to optimize different metrics (e.g., throughput,
fairness). MITTS sits at the egress of the L1.5 cache, monitoring the memory requests and stalling the L1.5 when it
uses bandwidth outside its allocated distribution. MITTS
has been integrated with OpenPiton and works on a per-core granularity, though it could be easily modified to
operate per-thread.
MITTS must also be supported by the OS. Our public
Linux kernel and OpenPiton hypervisor repositories con-
tain patches for supporting the MITTS hardware. With
these patches, developed as an undergraduate thesis proj-
ect, Linux processes can be assigned memory inter-arrival
time distributions, as they would in an IaaS environment
where the customer paid for a particular distribution corre-
sponding with their application’s behavior. The OS con-
figures the MITTS bins to correspond with each process’s
allocated distribution, and MITTS enforces the distribu-
tion accordingly.
4. 2. External research use
A number of external researchers have already made considerable use of OpenPiton. In a CAD context, Lerner
et al. 10 present a development workflow for improving processor lifetime, based on OpenPiton and the gem5 simulator, which is able to improve the design’s reliability time
by 4. 1×.
OpenPiton has also been used in a security context as a
testbed for hardware trojan detection. OpenPiton’s FPGA
emulation enabled Elnaggar et al. 5 to boot full-stack Debian
Linux and extract performance counter information while
running SPEC benchmarks. This project moved quickly
from adopting OpenPiton to an accepted publication in a
matter of months, thanks in part to the full-stack OpenPiton
system that can be emulated on FPGA.
Oblivious RAM (ORAM) 7 is a memory controller designed
to eliminate memory side channels. An ORAM controller
was integrated into the 25-core Piton processor, providing
the opportunity for secure access to off-chip DRAM. The controller was directly connected to OpenPiton’s NoC, making
the integration straightforward. It only required a handful
of files to wrap an existing ORAM implementation, and once
it was connected, its integration was verified in simulation
using the OpenPiton test suite.
4. 3. Educational use
We have been using OpenPiton in coursework at Princeton, in
particular our senior undergraduate Computer Architecture
and graduate Parallel Computation classes. A few of the
resulting student projects are described here.
Core replacement. Internally, we have tested replacements for the OpenSPARC T1 core with two other open
source cores. These modifications replaced the CCX
interface to the L1.5 cache with shims which translate
to the L1.5’s interface signals. These shims require very
little logic but provide the cores with fully cache-coherent
memory access through P-Mesh. We are using these cores
to investigate manycore processors with heterogeneous
ISAs.
Multichip network topology exploration. A senior undergraduate thesis project investigated the impact of interchip
network topologies for large manycore processors. Figure 7
shows multiple FPGAs connected over a high-speed serial
interface, carrying standard P-Mesh packets at 9 gigabits
per second. The student developed a configurable P-Mesh
router for this project which is now integrated as a standard
OpenPiton component.
MIAOW. A student project integrated the MIAOW
open source GPU2 with OpenPiton. An OpenPiton core
and a MIAOW core can both fit onto a VC707 FPGA
with the OpenPiton core acting as a host, in place of
the Microblaze that was used in the original MIAOW
release. The students added MIAOW to the chipset
crossbar with a single entry in its XML configuration.
Once they implemented a native P-Mesh interface to
replace the original AXI-Lite interface, MIAOW could