late 1980s and early 1990s, and incrementally improved over the course of
the decade.
The upshot of these trends was that
by the end of the 1990s, concurrent systems had displaced their uniprocessor
forebears as high-performance computers. When the Top500 list of supercomputers was first drawn up in 1993,
the highest-performing uniprocessor
in the world was just #34, with over 80%
of the Top 500 being multiprocessors of
one flavor or another. By 1997, uniprocessors were off the list entirely. Beyond
the supercomputing world, many trans-action-oriented applications scaled
with CPU, allowing users to realize the
dream of expanding a system without
revisiting architecture.
The rise of concurrent systems in the
1990s coincided with another trend:
while CPU clock rate continued to increase, the speed of main memory was
not keeping up. To cope with this relatively slower memory, microprocessor
architects incorporated deeper (and
more complicated) pipelines, caches
and prediction units. Even then, the
clock rates themselves were quickly becoming something of a fib: while the
CPU might be able to execute at the advertised rate, only a slim fraction of code
could actually achieve (let alone surpass)
the rate of one cycle per instruction—
most code was mired spending three,
four, five (or many more) cycles per instruction. Many saw these two trends—
the rise of concurrency and the futility of
increasing clock rate—and came to the
logical conclusion: instead of spending
transistor budget on “faster” CPUs that
weren’t actually yielding much in terms
of performance gains (and had terrible
cost in terms of power, heat, and area),
why not take advantage of the rise of
concurrent software and use transistors
to effect multiple (simpler) cores per
die? That it was the success of concurrent software that contributed to the
genesis of chip multiprocessing is an incredibly important historical point, and
bears reemphasis: there is a perception
that microprocessor architects have—
out of malice, cowardice, or despair—
inflicted concurrency on software. 7 In
reality, the opposite is the case: it was
the maturity of concurrent software that
led architects to consider concurrency
on the die. (Readers are referred to one
of the earliest chip multiprocessors—
DEC’s Piranha—for a detailed discussion of this motivation. 1) Were software
not ready, these microprocessors would
not be commercially viable today. So if
anything, the “free lunch” that some decry as being “over” is in fact, at long last,
being served—one need only be hungry
and know how to eat!
concurrency is for Performance
The most important conclusion from
our foray into the history of concurrency
is that concurrency has always been employed for one purpose: to improve the
performance of the system. This seems
almost too obvious to make explicit.
Why else would we want concurrency
if not to improve performance? And yet
for all its obviousness, concurrency’s
raison d’être is seemingly forgotten, as
if the proliferation of concurrent hardware has awakened an anxiety that all
software must use all available physical
resources. Just as no programmer felt a
moral obligation to eliminate pipeline
stalls on a superscaler microprocessor,
no software engineer should feel responsible for using concurrency simply
because the hardware supports it. Rather, concurrency should be considered
and used for one reason only: because
it is needed to yield an acceptably performing system.
There are three fundamental ways
in which concurrent execution can improve performance: to reduce latency
(that is, to make a unit of work execute
faster); to hide latency (that is, to allow
the system to continue doing work during a long latency operation); or to increase throughput (that is, to make the
system able to perform more work).
Using concurrency to reduce latency
is highly problem-specific in that it requires a parallel algorithm for the task
at hand. For some kinds of problems
—especially those found in scientific
computing—this is straightforward:
work can be divided a priori, and multiple compute elements set on the task.
But many of these problems are often
so parallelizable they do not require the
tight coupling of a shared memory—
and they are often able to more economically execute on grids of small machines instead of a smaller number of
highly concurrent ones. Further, using
concurrency to reduce latency requires
that a unit of work be long enough in its
execution to amortize the substantial
costs of coordinating multiple compute elements: one can envision using
concurrency to parallelize a sort of 40
million elements—but a sort of a mere
40 elements is unlikely to take enough
compute time to pay the overhead of
parallelism. In short, the degree to one
can use concurrency to reduce latency
depends much more on the problem
than those endeavoring to solve it—and
many important problems are simply
not amenable to it.
For long-running operations that
cannot be parallelized, concurrent execution can instead be used to perform
useful work while the operation is pending. In this model, the latency of the
operation is not reduced, but it is
hidden by the progression of the system.
Using concurrency to hide latency is
particularly tempting when the operations themselves are likely to block on
entities outside of the program—for
example, a disk I/O operation or a DNS
lookup. Tempting though it may be, one
must be very careful when considering
using concurrency to merely hide latency: having a parallel program can become a substantial complexity burden
to bear for just improved responsiveness. Further, concurrent execution is
not the only way to hide system-induced
latencies: one can often achieve the
same effect by employing non-blocking
operations (for example, asynchronous
I/O) and an event loop (for example, the
poll()/select() calls found in Unix)
in an otherwise sequential program.
Programs that wish to hide latency
should therefore consider concurrent
execution as an option, not as a foregone conclusion.
When problems resist parallelization or have no appreciable latency to
hide, there is a third way to use concurrent execution to improve performance:
concurrency may also be used to increase the throughput of the system.
That is, instead of using parallel logic to
make a single operation faster, one can
employ multiple concurrent executions
of sequential logic to be able to accommodate more simultaneous work. Importantly, a system using concurrency
to increase throughput need not consist
exclusively (or even largely) of multithreaded code. Rather, those components of the system that share no state
can be left entirely sequential, with the
system executing multiple instances