late 1980s and early 1990s, and incrementally improved over the course of the decade.

The upshot of these trends was that by the end of the 1990s, concurrent systems had displaced their uniprocessor forebears as high-performance computers. When the Top500 list of supercomputers was first drawn up in 1993, the highest-performing uniprocessor in the world was just #34, with over 80% of the Top 500 being multiprocessors of one flavor or another. By 1997, uniprocessors were off the list entirely. Beyond the supercomputing world, many trans-action-oriented applications scaled with CPU, allowing users to realize the dream of expanding a system without revisiting architecture.

The rise of concurrent systems in the 1990s coincided with another trend: while CPU clock rate continued to increase, the speed of main memory was not keeping up. To cope with this relatively slower memory, microprocessor architects incorporated deeper (and more complicated) pipelines, caches and prediction units. Even then, the clock rates themselves were quickly becoming something of a fib: while the CPU might be able to execute at the advertised rate, only a slim fraction of code could actually achieve (let alone surpass) the rate of one cycle per instruction— most code was mired spending three, four, five (or many more) cycles per instruction. Many saw these two trends— the rise of concurrency and the futility of increasing clock rate—and came to the logical conclusion: instead of spending transistor budget on “faster” CPUs that weren’t actually yielding much in terms of performance gains (and had terrible cost in terms of power, heat, and area), why not take advantage of the rise of concurrent software and use transistors to effect multiple (simpler) cores per die? That it was the success of concurrent software that contributed to the genesis of chip multiprocessing is an incredibly important historical point, and bears reemphasis: there is a perception that microprocessor architects have— out of malice, cowardice, or despair— inflicted concurrency on software. 7 In reality, the opposite is the case: it was the maturity of concurrent software that led architects to consider concurrency on the die. (Readers are referred to one of the earliest chip multiprocessors—

DEC’s Piranha—for a detailed discussion of this motivation. 1) Were software not ready, these microprocessors would not be commercially viable today. So if anything, the “free lunch” that some decry as being “over” is in fact, at long last, being served—one need only be hungry and know how to eat!

concurrency is for Performance

The most important conclusion from our foray into the history of concurrency is that concurrency has always been employed for one purpose: to improve the performance of the system. This seems almost too obvious to make explicit. Why else would we want concurrency if not to improve performance? And yet for all its obviousness, concurrency’s raison d’être is seemingly forgotten, as if the proliferation of concurrent hardware has awakened an anxiety that all software must use all available physical resources. Just as no programmer felt a moral obligation to eliminate pipeline stalls on a superscaler microprocessor, no software engineer should feel responsible for using concurrency simply because the hardware supports it. Rather, concurrency should be considered and used for one reason only: because it is needed to yield an acceptably performing system.

There are three fundamental ways in which concurrent execution can improve performance: to reduce latency (that is, to make a unit of work execute faster); to hide latency (that is, to allow the system to continue doing work during a long latency operation); or to increase throughput (that is, to make the system able to perform more work).

Using concurrency to reduce latency is highly problem-specific in that it requires a parallel algorithm for the task at hand. For some kinds of problems —especially those found in scientific computing—this is straightforward: work can be divided a priori, and multiple compute elements set on the task. But many of these problems are often so parallelizable they do not require the tight coupling of a shared memory— and they are often able to more economically execute on grids of small machines instead of a smaller number of highly concurrent ones. Further, using concurrency to reduce latency requires that a unit of work be long enough in its execution to amortize the substantial

costs of coordinating multiple compute elements: one can envision using concurrency to parallelize a sort of 40 million elements—but a sort of a mere 40 elements is unlikely to take enough compute time to pay the overhead of parallelism. In short, the degree to one can use concurrency to reduce latency depends much more on the problem than those endeavoring to solve it—and many important problems are simply not amenable to it.

For long-running operations that cannot be parallelized, concurrent execution can instead be used to perform useful work while the operation is pending. In this model, the latency of the operation is not reduced, but it is hidden by the progression of the system. Using concurrency to hide latency is particularly tempting when the operations themselves are likely to block on entities outside of the program—for example, a disk I/O operation or a DNS lookup. Tempting though it may be, one must be very careful when considering using concurrency to merely hide latency: having a parallel program can become a substantial complexity burden to bear for just improved responsiveness. Further, concurrent execution is not the only way to hide system-induced latencies: one can often achieve the same effect by employing non-blocking operations (for example, asynchronous I/O) and an event loop (for example, the poll()/select() calls found in Unix) in an otherwise sequential program. Programs that wish to hide latency should therefore consider concurrent execution as an option, not as a foregone conclusion.

When problems resist parallelization or have no appreciable latency to hide, there is a third way to use concurrent execution to improve performance: concurrency may also be used to increase the throughput of the system. That is, instead of using parallel logic to make a single operation faster, one can employ multiple concurrent executions of sequential logic to be able to accommodate more simultaneous work. Importantly, a system using concurrency to increase throughput need not consist exclusively (or even largely) of multithreaded code. Rather, those components of the system that share no state can be left entirely sequential, with the system executing multiple instances

References:

Archives