ingly run on GPUs. Now that GPUs are
cheaper per FLOPS (floating-point operations per second) than CPUs, the
trend is increasingly toward coercing
algorithms that are not naturally data
parallel to run on GPUs.
The problem of dark silicon (the
portion of a chip that must be left
unpowered) means it is going to be
increasingly viable to have lots of different cores on the same die, as long
as most of them are not constantly
powered. Efficient designs in such a
world will require admitting there is
no one-size-fits-all processor design
and there is a large spectrum, with different trades at different points.
A New Objective-C Runtime:
from Research to Production
Getting Gigascale Chips: Challenges and
Opportunities in Continuing Moore’s Law
Extreme Software Scaling
1. Baumann, A., Barham, P., Dagand, P.-E., Harris, T.,
Isaacs, R., Peter, S., Roscoe, T., Schüpbach, A. and
Singhania, A. The multikernel: a new OS architecture
for scalable multicore systems. In Proceedings of the
ACM SIGOPS 22nd Symposium on Operating Systems
Principles, (2009), 29–44.
2. Denning, P. J. The working-set model for program
behavior. Commun. ACM 11, 5 (May 1968), 323–333.
3. Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M.,
Watanabe, Y. and Yamazaki, T. Synergistic processing
in Cell’s multicore architecture. IEEE Micro 26, 2
4. Kim, J. M., Seo, S. K., Chung, S. W. 2014. Looking into
heterogeneity: when simple is faster. In Proceedings
of the 2nd International Workshop on Parallelism in
Mobile Platforms, (2014).
5. Patterson, D.A. and Sequin, C.H. RISC I: a reduced
instruction set VLSI computer. In Proceedings of the
8th Annual Symposium on Computer Architecture,
6. Popek, G.J. and Goldberg, R.P. 1974. Formal
requirements for virtualizable third-generation
architectures. Commun. ACM 17, 7 (July 1974),
7. SPARC International Inc. 1992. SPARC Architecture
Manual: Version 8. Prentice-Hall, Upper Saddle River, NJ.
8. Wall, D. W. 1993. Limits of instruction-level parallelism.
Technical report, Digital Equipment Corporation
Western Research Laboratory, 1993.
David Chisnall is a researcher at the University of
Cambridge, where he works on programming-language
design and implementation. He spent several years
consulting, during which time he also wrote books on Xen
and the Objective-C and Go programming languages.
Copyright held by owner/author. Publication rights
licensed to ACM. $15.00.
Models of Parallelism
Parallelism in software comes in a
variety of forms and granularity. The
most important form for most CPUs
is ILP (instruction-level parallelism).
Superscalar architectures are specifically designed to take advantage of
ILP. They translate the architectural
instruction encodings into something more akin to a static single assignment form (ironically, the compiler spends a lot of effort translating
from such a form into a finite-register
encoding) so they can identify independent instructions and dispatch
them in parallel.
Although the theoretical limit for
ILP can be very high, as much as 150 independent instructions,
8 the amount
that it is practical to extract is significantly lower. VLIW (very long instruction word) processors simplify their
logic by making it the compiler’s job to
identify ILP and bundle instructions.
The problem with the VLIW approach is that ILP is a dynamic property. Recall that the RISC I heuristic was
that branches occur, on average, every
seven instructions. This means the
compiler can, at most, provide sev-en-way ILP, because it cannot identify ILP that spans basic blocks: if the
compiler statically knew the target of
a branch, then it would most likely
not insert a branch.
VLIW processors have been suc-
cessful as DSPs (digital signal proces-
sors) and GPUs but not in the kind
of code for which “general-purpose”
processors are optimized, in spite
of several attempts by Intel (i860,
Itanium). This means that statically
predictable ILP is relatively low in the
code these processors are expected
to run. Superscalar processors do not
have the same limitation because, in
conjunction with a branch predictor,
they can extract ILP from dynamic
flow control spanning basic blocks
and even across functions.
It is worth noting that, in spite of
occupying four times the die area
and four times the power consump-
tion, clock-for-clock the ARM Cortex
A15 (three-issue, superscalar, out-of-
order) achieves only 75%–100% more
performance than the (two-issue, in-
order) A7, in spite of being able (theo-
retically) to exploit a lot more ILP.
The other implicit assumption
with regard to parallelism in most
CPUs is that communication among
parallel threads of execution is both
rare and implicit. The latter attribute
comes from the C shared-everything
model of concurrency, where the only
way for threads to communicate is by
modifying some shared state. A large
amount of logic is required in cache
coherency to make this efficient, yet
it is relevant only for shared-memory
Implementing message passing
(as embodied both by early languages
such as Occam and Actor Smalltalk
and by newer ones such as Erlang and
Go) on such processors typically involves algorithms that bounce cacheline ownership between cores and involve large amounts of bus traffic.
The general-purpose processors of
today are highly specialized and designed for running applications compiled from low-level C-like languages.
They are virtualized using time-division multiplexing, contain mostly predictable branches roughly every seven
instructions, and exhibit a high degree
of locality of reference and a low degree
of fine-grained parallelism. Although
this describes a lot of programs, it is by
no means exhaustive.
Because processors optimized for
these cases have been the cheapest
processing elements that consumers
can buy, many algorithms have been
coerced into exhibiting some of these
properties. With the appearance of
cheap programmable GPUs, this
has started to change, and naturally
data-parallel algorithms are increas-