contributed articles
Doi: 10.1145/1610252.1610271
Easing the programmer’s burden does not
compromise system performance or increase
the complexity of hardware implementation.
BY JoSEP toRRELLaS, LuiS CEzE, JamES tuCK,
CaLin CaSCaVaL, PaBLo montESinoS, WonSun ahn,
anD miLoS PRVuLoViC
the Bulk multicore
architecture
for improved
Programmability
MultiCore ChiPS AS commodity architecture
for platforms ranging from handhelds to
supercomputers herald an era when parallel
programming and computing will be the norm.
While the computer science and engineering
community has periodically focused on advancing
the technology for parallel processing,
8 this time
around the stakes are truly high, since there is
no obvious route to higher performance other
than through parallelism. However, for parallel
computing to become widespread, breakthroughs
are needed in all layers of the computing stack,
including languages, programming models,
compilation and runtime software, programming
and debugging tools, and hardware architectures.
At the hardware-architecture layer, we need to
change the way multicore architectures are designed.
In the past, architectures were designed primarily for performance or
for energy efficiency. Looking ahead,
one of the top priorities must be for
the architecture to enable a programmable environment. In practice, programmability is a notoriously difficult
metric to define and measure. At the
hardware-architecture level, programmability implies two things: First, the
architecture is able to attain high efficiency while relieving the programmer from having to manage low-level
tasks; second, the architecture helps
minimize the chance of (parallel) programming errors.
In this article, we describe a
novel, general-purpose multicore
architecture—the Bulk Multicore—
we designed to enable a highly programmable environment. In it, the
programmer and runtime system
are relieved of having to manage the
sharing of data thanks to novel support for scalable hardware cache coherence. Moreover, to help minimize
the chance of parallel-programming
errors, the Bulk Multicore provides
to the software high-performance sequential memory consistency and also
introduces several novel hardware
primitives. These primitives can be
used to build a sophisticated program-development-and-debugging environment, including low-overhead data-race detection, deterministic replay
of parallel programs, and high-speed
disambiguation of sets of addresses.
The primitives have an overhead low
enough to always be “on” during production runs.
The key idea in the Bulk Multicore is twofold: First, the hardware
automatically executes all software
as a series of atomic blocks of thousands of dynamic instructions called
Chunks. Chunk execution is invisible
to the software and, therefore, puts no
restriction on the programming language or model. Second, the Bulk Multicore introduces the use of Hardware
Address Signatures as a low-overhead
mechanism to ensure atomic and isolated execution of chunks and help