and the need for a new programming
model (or language).
Before trying to answer these questions, we need to discuss the eternal
issue of productivity of the programmer vs. performance of the generated
software. The common wisdom used to
be that many aspects of the hardware
needed to be hidden from the programmer to increase productivity. Writing
in Python makes you more productive
than writing in C, which is more productive than writing in assembly, right?
The answer is not that easy, because
many Python routines, for example, are
just C wrappers. With the proliferation
of heterogeneous machines, performance programmers for use by productivity programmers will create more
and more libraries. Even productivity
programmers, however, need to make
some hard decisions: how to decompose the application into threads (or
processes) suitable for the hardware at
hand (this may require experimenting
with different algorithms), and which
parts of the program do not require
high performance and can be executed
in lower-power-consumption mode (for
example, parts that require I/O)?
Defining the measures of success
poses a number of challenges for
both productivity and performance
programmers. What are the measures of success of a program written
for a heterogeneous machine? Many
of these measures have characteristics in common with those of traditional parallel code for homogeneous
machines. The first, of course, is performance. How much speedup do you
get relative to the sequential version
and relative to the parallel version of
homogeneous computing?
The second measure is scalability.
Does your program scale as more cores
are added? Scalability in heterogeneous computing is more complicated
than in the homogeneous case. For the
latter, you just add more of the same.
For heterogeneous machines, you have
more options: adding more cores of
some type, or more GPUs, or maybe FPGAs. How does the program behave in
each case?
The third measure of success is reli-
ability. As transistors get smaller, they
become more susceptible to faults, both
transient and permanent. Do you leave
this issue of dealing with faults to the
hardware, or system software, or shall
the programmer have some say? Each
strategy has its pros and cons. On the
one hand, if it is left to the hardware or
the system software, the programmer
will be more productive. On the other
hand, the programmer is better in-
formed than the system to decide how to
achieve graceful degradation in perfor-
mance if the number of cores decreases
as a result of failure or a thread produces
the wrong result because of a transient
fault. The programmer can have, for ex-
ample, two versions of the same subrou-
tine: one to be executed on a GPU and
the other on several traditional cores.
Portability is another issue. If you
are writing a niche program for a well-defined machine, then the first three
measures are enough. But if you are
writing a program for public use on
many different heterogeneous computing machines, then you need to ensure
portability. What happens if your code
runs on a machine with an FPGA instead of a GPU, for example? This scenario is not unlikely in the near future.
The Best Strategy
Given these questions and considerations, what is the best strategy?
Should we introduce new programming models (and languages), or
should we fix/update current ones?
Psychology has something to say. The
more choices a person has, the better—until some threshold is reached.
Beyond that, people become overwhelmed and will stick to whatever
language they are using. But we have to
be very careful about fixing a language.
Perl used to be called a “write-only language.” We don’t want to fall into the
same trap. Deciding which language to
fix/modify is a very difficult decision,
and a wrong decision would have a very
high cost. For heterogeneous computing, OpenCL (Open Computing Language) seems like a good candidate
for shared-memory machines, but it
must be more user friendly. How about
distributed memory? Is MPI (Message
Passing Interface) good enough? Do
any of the currently available languag-es/paradigms consider reliability as a
measure of success?
The best scheme seems to be two-
fold: new paradigms invented and
tested in academia while the filtering
happens in industry. How does the fil-
tering happen? It happens when an
inflection point occurs in the comput-
ing world. Examples of two previous in-
flection points are moving from single
core to multicore and the rise of GPUs.
We are currently witnessing a couple of
inflection points at the same time: get-
ting close to exascale computing and
the rise of the Internet of Things. Het-
erogeneous computing is the enabling
technology for both.
Heterogeneous computing is already
here, and it will stay. Making the best
use of it will require revisiting the whole
computing stack. At the algorithmic
level, keep in mind that computation
is now much cheaper than memory access and data movement. Programming
models need to deal with productivity vs. performance. Compilers need to
learn to use heterogeneous nodes. They
have a long way to go, because compilers are not yet as mature in the parallel-computing arena in general as they are
in sequential programming. Operating
systems must learn new tricks. Computer architects need to decide which nodes
to put together to get the most effective
machines, how to design the memory
hierarchy, and how best to connect all
these modules. At the circuit level and
the process technology level, we have a
long wish list of reliability, power, compatibility, and cost. There are many
hanging fruits at all levels of the computing stack, all ready for the picking if we
can figure out the thorns.
Related articles
on queue.acm.org
Computing without Processors
Satnam Singh
http://queue.acm.org/detail.cfm?id=2000516
FPGA Programming for the Masses
David F. Bacon, Rodric Rabbah, and Sunil Shukla
http://queue.acm.org/detail.cfm?id=2443836
A Conversation with John Hennessy and
David Patterson
http://queue.acm.org/detail.cfm?id=1189286
References
1. HSA Foundation; http://www.hsafoundation.com/.
2. IBM Research. The cognitive era; https://www.
research.ibm.com/cognitive-computing/.
3. Micron. Automata processor; http://www.
micronautomata.com/.
Mohamed Zahran is a clinical associate professor of
computer science at New York University. His research
interests span several areas of computer architecture
and hardware/software interaction.
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00