Although we assert that less—much less—code needs to be parallel than some might fear, it is nonetheless true that writing parallel code remains something of a black art. We also therefore give specific implementation techniques for developing a highly parallel system. As such, this article is somewhat dichotomous: we try both to argue that most code can (and should) achieve concurrency without explicit parallelism, and at the same time to elucidate techniques for those who must write explicitly parallel code. This article is half stern lecture on the merits of abstinence and half Kama Sutra.
Before we discuss concurrency with respect to today’s applications, it would be helpful to explore the history of concurrent execution. Even by the 1960s—when the world was still wet with the morning dew of the computer age—it was becoming clear that a single central processing unit executing a single instruction stream would result in unnecessarily limited system performance. While computer designers experimented with different ideas to circumvent this limitation, it was the introduction of the Burroughs B5000 in 1961 that proffered the idea that ultimately proved to be the way forward: disjoint CPUs concurrently executing different instruction streams but sharing a common memory. In this regard (as in many) the B5000 was at least a decade ahead of its time. It was not until the 1980s that the need for multiprocessing became clear to a wider body of researchers, who over the course of the decade explored cache coherence protocols (e.g., the Xerox Dragon and DEC Firefly), prototyped parallel operating systems (e.g., multiprocessor Unix running on the AT&T 3B20A), and developed parallel databases (e.g., Gamma at the University of Wisconsin).
In the 1990s, the seeds planted by researchers in the 1980s bore the fruit of practical systems, with many computer companies (e.g., Sun, SGI, Sequent, Pyramid) placing big bets on symmetric multiprocessing. These bets on concurrent hardware necessitated corresponding bets on concurrent software—if an operating system cannot execute in parallel, neither can much else in the system—and these companies independently came to
the realization that their operating systems would need to be rewritten around the notion of concurrent execution. These rewrites took place early in the 1990s, and the resulting systems were polished over the decade; much of the resulting technology can today be seen in open source operating systems such as OpenSolaris, FreeBSD, and Linux.
Just as several computer companies made big bets around multiprocessing, several database vendors made bets around highly parallel relational databases; upstarts including Oracle, Teradata, Tandem, Sybase, and Infor-mix needed to use concurrency to achieve a performance advantage over the mainframes that had dominated transaction processing until that time. 2 As in operating systems, this work was conceived in the late 1980s and early 1990s, and incrementally improved over the course of the decade.
The upshot of these trends was that by the end of the 1990s, concurrent systems had displaced their uniprocessor forebears as high-performance computers: when the TOP500 list of supercomputers was first drawn up in 1993, the highest-performing uniprocessor in the world was just #34, and more than 80 percent of the top 500 were multiprocessors of one flavor or another. By 1997, uniprocessors were off the list entirely. Beyond the super-computing world, many transaction-oriented applications scaled with CPU, allowing users to realize the dream of expanding a system without revisiting architecture.
The rise of concurrent systems in the 1990s coincided with another trend: while CPU clock rate continued to increase, the speed of main memory was not keeping up. To cope with this relatively slower memory, microprocessor architects incorporated deeper (and more complicated) pipelines, caches, and prediction units. Even then, the clock rates themselves were quickly becoming something of a fib: while the CPU might be able to execute at the advertised rate, only a slim fraction of code could actually achieve (let alone surpass) the rate of one cycle per instruction—most code was mired spending three, four, five (or many more) cycles per instruction.
Many saw these two trends—the rise of concurrency and the futility of increasing clock rate—and came to the logical conclusion: instead of spending transistor budget on “faster” CPUs that weren’t actually yielding much in terms of performance gains (and had terrible costs in terms of power, heat, and area), why not take advantage of the rise of concurrent software and use transistors to effect multiple (simpler) cores per die?
References:
Archives