That it was the success of concurrent software that
contributed to the genesis of chip multiprocessing is an
incredibly important historical point and bears reemphasis. There is a perception that microprocessor architects
have—out of malice, cowardice, or despair—inflicted
concurrency on software. 3 In reality, the opposite is the
case: it was the maturity of concurrent software that
led architects to consider concurrency on the die. (The
reader is referred to one of the earliest chip multiprocessors—DEC’s Piranha—for a detailed discussion of this
motivation. 4) Were software not ready, these microprocessors would not be commercially viable today. If anything,
the “free lunch” that some decry as being over is in fact,
at long last, being served. One need only be hungry and
know how to eat!
CONCURRENC Y IS FOR PERFORMANCE
The most important conclusion from this foray into history is that concurrency has always been employed for
one purpose: to improve the performance of the system.
This seems almost too obvious to make explicit—why else
would we want concurrency if not to improve performance?—yet for all its obviousness, concurrency’s raison
d’être is increasingly forgotten, as if the proliferation of
concurrent hardware has awakened an anxiety that all
software must use all available physical resources. Just as
no programmer felt a moral obligation to eliminate pipeline stalls on a superscalar microprocessor, no software
engineer should feel responsible for using concurrency
simply because the hardware supports it. Rather, concurrency should be thought about and used for one reason
and one reason only: because it is needs to yield an
acceptably performing system.
Concurrent execution can improve performance in
three fundamental ways: it can reduce latency (that is,
make a unit of work execute faster); it can hide latency
(that is, allow the system to continue doing work during
a long-latency operation); or it can increase throughput
(that is, make the system able to perform more work).
Using concurrency to reduce latency is highly prob-lem-specific in that it requires a parallel algorithm for
the task at hand. For some kinds of problems—especially
those found in scientific computing—this is straightforward: work can be divided a priori, and multiple compute elements set on the task. Many of these problems,
however, are often so parallelizable that they do not
require the tight coupling of a shared memory—and they
are often able to execute more economically on grids of
small machines instead of a smaller number of highly
concurrent ones. Further, using concurrency to reduce
latency requires that a unit of work be long enough in its
execution to amortize the substantial costs of coordinating multiple compute elements: one can envision using
concurrency to parallelize a sort of 40 million elements—
but a sort of a mere 40 elements is unlikely to take
enough compute time to pay the overhead of parallelism.
In short, the degree to which one can use concurrency to
reduce latency depends much more on the problem than
on those endeavoring to solve it—and many important
problems are simply not amenable to it.
For long-running operations that cannot be parallelized, concurrent execution can instead be used to perform
useful work while the operation is pending; in this model,
the latency of the operation is not reduced, but it is
hidden by the progression of the system. Using concurrency
to hide latency is particularly tempting when the operations themselves are likely to block on entities outside
of the program—for example, a disk I/O operation or a
DNS lookup. Tempting though it may be, one must be
very careful when considering using concurrency merely
to hide latency: having a parallel program can become a
substantial complexity burden to bear just for improved
responsiveness. Further, concurrent execution is not the
only way to hide system-induced latencies: one can often
achieve the same effect by employing nonblocking operations (e.g., asynchronous I/O) and an event loop (e.g.,
the poll()/select() calls found in Unix) in an otherwise
sequential program. Programmers who wish to hide
latency should therefore consider concurrent execution as
an option, not as a foregone conclusion.
When problems resist parallelization or have no
appreciable latency to hide, the third way that concurrent execution can improve performance is to increase
the throughput of the system. Instead of using parallel
logic to make a single operation faster, one can employ
multiple concurrent executions of sequential logic to
accommodate more simultaneous work. Importantly, a
system using concurrency to increase throughput need
not consist exclusively (or even largely) of multithreaded
code. Rather, those components of the system that
share no state can be left entirely sequential, with the
system executing multiple instances of these components concurrently. The sharing in the system can then
be offloaded to components explicitly designed around
parallel execution on shared state, which can ideally be
reduced to those elements already known to operate well
in concurrent environments: the database and/or the
operating system.