development that parallel programming is difficult. This would come as
a surprise to Alan Kay, who was able
to teach an actor-model language to
young children, with which they wrote
working programs with more than 200
threads. It comes as a surprise to Erlang programmers, who commonly
write programs with thousands of parallel components. It is more accurate
to say that parallel programming in
a language with a C-like abstract machine is difficult, and given the prevalence of parallel hardware, from multicore CPUs to many-core GPUs, that is
just another way of saying that C doesn’t
map to modern hardware very well.
The Challenge of Cross-Language
Finding More than One Worm in the Apple
Coding for the Code
Friedrich Steimann and Thomas Kühne
1. C Defect Report 260, 2004; http://www.open-std.org/
2. Chadwick, G.A. Communication centric, multi-core,
fine-grained processor architecture. Technical Report
832. University of Cambridge, Computer Laboratory,
3. Memarian, K., Matthiesen, J., Lingard, J., Nienhuis, K.,
Chisnall, D. Watson, R.N.M. and Sewell, P. Into the
depths of C: Elaborating the de facto standards. In
Proceedings of the 37th ACM SIGPLAN Conference on
Programming Language Design and Implementation,
2016, 1–15; http://dl.acm.org/authorize?N04455.
4. Ou, A., Nguyen, Q., Lee, Y. and Asanović, K. A case
for MVPs: Mixed-precision vector processors.
In Proceedings of the 2nd Intern’l Workshop on
Parallelism in Mobile Platforms at the 41st Intern’l
Symposium on Computer Architecture. 2014
5. Perlis, A. Epigrams on programming. ACM SIGPLAN
Notices 17, 9 (1982).
David Chisnall is a researcher at the University of
Cambridge, where he works on programming language
design and implementation. He has authored books on Xen
and the Objective-C and Go programming languages, and
contributes to the LLVM, Clang, FreeBSD, GNUstep, and
Étoilé open source projects.
Approved for public release; distribution is unlimited.
Sponsored by the Defense Advanced Research Projects
Agency (DARPA) and the Air Force Research Laboratory
(AFRL) under contract FA8750-10-C-0237 (“CTSRD”’)
as part of the DARPA CRASH research program. The
views, opinions, and/or findings contained in this report
are those of the authors and should not be interpreted
as representing the official views or policies, either
expressed or implied, of the Department of Defense or
the U.S. government.
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00.
ample, security vulnerabilities have
been observed from signed integer
overflow (undefined behavior in C) and
from code that dereferenced a pointer
before a null check, indicating to the
compiler that the pointer could not
be null because dereferencing a null
pointer is undefined behavior in C and
therefore can be assumed not to happen (CVE-2009-1897).
In light of such issues, it is difficult to argue that a programmer can
be expected to understand exactly
how a C program will map to an underlying architecture.
Imagining a Non-C Processor
The proposed fixes for Spectre and
Meltdown impose significant performance penalties, largely offsetting the
advances in microarchitecture in the
past decade. Perhaps it is time to stop
trying to make C code fast and instead
think about what programming models would look like on a processor designed to be fast.
We have a number of examples of
designs that have not focused on traditional C code to provide some inspiration. For example, highly multithreaded chips, such as Sun/Oracle’s
UltraSPARC Tx series, do not require
as much cache to keep their execution
units full. Research processors2 have
extended this concept to very large
numbers of hardware-scheduled
threads. The key idea behind these
designs is that with enough high-level parallelism, you can suspend
the threads that are waiting for data
from memory and fill your execution
units with instructions from others. The problem with such designs
is that C programs tend to have few
ARM’s Scalar Vector Extensions
(SVE)—and similar work from Berke-
ley4—provides another glimpse at a
better interface between program
and hardware. Conventional vector
units expose fixed-sized vector opera-
tions and expect the compiler to try
to map the algorithm to the available
unit size. In contrast, the SVE inter-
face expects the programmer to de-
scribe the degree of parallelism avail-
able and relies on the hardware to
map it down to the available number
of execution units. Using this from C
is complex, because the autovector-
izer must infer the available parallel-
ism from loop structures. Generating
code for it from a functional-style
map operation is trivial: the length
of the mapped array is the degree of
Caches are large, but their size isn’t
the only reason for their complexity.
The cache coherency protocol is one of
the hardest parts of a modern CPU to
make both fast and correct. Most of
the complexity involved comes from
supporting a language in which data
is expected to be both shared and mu-
table as a matter of course. Consider
in contrast an Erlang-style abstract
machine, where every object is either
thread-local or immutable (Erlang
has a simplification of even this,
where there is only one mutable ob-
ject per thread). A cache coherency
protocol for such a system would
have two cases: mutable or shared.
A software thread being migrated to
a different processor would need its
cache explicitly invalidated, but that
is a relatively uncommon operation.
Immutable objects can simplify
caches even more, as well as making
several operations even cheaper. Sun
Labs’ Project Maxwell noted that the
objects in the cache and the objects
that would be allocated in a young
generation are almost the same set.
If objects are dead before they need
to be evicted from the cache, then
never writing them back to main
memory can save a lot of power.
Project Maxwell proposed a young-generation garbage collector (and
allocator) that would run in the
cache and allow memory to be recycled quickly. With immutable
objects on the heap and a mutable
stack, a garbage collector becomes
a very simple state machine that is
trivial to implement in hardware and
allows for more efficient use of a relatively small cache.
A processor designed purely for
speed, not for a compromise between
speed and C support, would likely support large numbers of threads, have
wide vector units, and have a much
simpler memory model. Running C
code on such a system would be problematic, so, given the large amount of
legacy C code in the world, it would not
likely be a commercial success.
There is a common myth in software