Parallel Algorithms
the following two examples explore how these algorithms look and the opportunities
and benefits they provide to systems developers and programmers.
Example 1. Given are two variables a and B, each containing some value. the
exchange problem involves exchanging their values; for example, if the input to the
exchange problem is a= 2 and B= 5, then the output is a= 5 and B= 2. the standard
algorithm for this problem uses an auxiliary variable X and works in three steps:
X:=a
a:=B
B:=X
In order not to overwrite a and lose its content, the content of a is first stored in X,
B is then copied to a, and finally the original content of a is copied from X to B. the
work in this algorithm is three operations, the depth is three time units, and the space
requirement (beyond input and output) is one word.
this parallel algorithm requires 3n work, as in the serial algorithm. Its depth has
improved from 3n to 3. If the size of the array n is 1,000 words, it would constitute
speedup by a factor of 1,000 relative to the serial algorithm. the increase in space to 2n
(for array X and n concurrent values of i) demonstrates a cost of parallelism.
Example 2. Given is the directed graph with nodes representing all commercial
airports in the world. an edge connects node u to node v if there is a nonstop flight from
airport u to airport v, and s is one of these airports. the problem is to find the smallest
number of nonstop flights from s to any other airport. the wD algorithm works as
follows: Suppose the first i steps compute the fewest number of nonstop flights from s
to all airports that can be reached from s in at most i flights, while all other airports are
marked “unvisited.”
Step i+ 1 concurrently finds the destination of every outgoing flight from any airport
to which the fewest number of flights from s is exactly i, and for every such destination
marked “unvisited” requires i+ 1 flights from s. Note that some “unvisited” nodes may
have more than one incoming edge. In such a case the arbitrary CRCw convention
implies that one of the attempting writes succeeds. while we don’t know which one, we
do know all writes would enter the number i+ 1; in general, however, arbitrary CRCw
also allows different values.
the standard serial algorithm for this problem9 is called breadth-first search,
and the parallel algorithm described earlier is basically breadth-first search with one
difference: Step i+ 1 described earlier allows concurrent-writes. In the serial version,
breadth-first search also operates by marking all nodes whose shortest path from s
requires i+ 1 edges after all nodes whose shortest path from s requires i edges. the
serial version then proceeds to impose a serial order. each newly visited node is placed
in a first-in-first-out queue data structure.
three lessons are drawn from this example: first, the serial order obstructs
the parallelism in breadth-first search; freedom to process in any-order nodes for
which the shortest path from s has the same length is lost. Second, programmers
trained to incorporate such serial data structures into their programs acquire bad
serial habits difficult to uproot; it may be better to preempt the problem by teaching
parallel programming and parallel algorithms early. and third, to demonstrate the
performance advantage of the parallel algorithm over the serial algorithm, assume
that the number of edges in the graph is 600,000 (the number of nonstop flight links),
and the smallest number of flights from airport s to any other airport is no more than
five. while the serial algorithm requires 600,000 basic steps, the parallel algorithm
requires only six. meanwhile, each of the six steps may require longer wall clock time
than each of the 600,000 steps, but the factor 600,000/6 provides leeway for speedups
by a proper architecture.
I begin by proposing the Immediate
Concurrent Execution (ICE) abstraction, followed by two contributions
supporting this abstraction I have led:
XMT. A general-purpose many-core
explicit multi-threaded (XMT) computer architecture designed from the
ground up to capitalize on the on-chip
resources becoming available to support the formidable body of knowledge, known as parallel random-access machine (model), or PRAM,
algorithmics, and the latent, though
not widespread, familiarity with it;
and
Workflow. A programmer’s workflow links ICE, PRAM algorithmics,
and XMT programming. The ICE abstraction of an algorithm is followed
by a description of the algorithm for
the synchronous PRAM, allowing ease
of reasoning about correctness and
complexity, which is followed by multithreaded programming that relaxes
this synchrony for the sake of implementation. Directly reasoning about
soundness and performance of multithreaded code is generally known
to be error-prone. To circumvent the
likelihood of errors, the workflow incorporates multiple levels of abstraction; the programmer must establish
only that multithreaded program
behavior matches the synchronous
PRAM-like algorithm it implements,
a much simpler task. Current XMT
hardware and software prototypes and
demonstrated ease-of-programming
and strong speedups suggest that CS
may be much better prepared for the
challenges ahead than many of our
colleagues realize.
A notable rudimentary abstraction—that any single instruction available for execution in a serial program
executes immediately—made serial
computing simple. Abstracting away
a hierarchy of memories, each with
greater capacity but slower access
time than its predecessor, along with
different execution time for different
operations, this Immediate Serial Execution (ISE) abstraction has been used
by programmers for years to conceptualize serial computing and ensure
support by hardware and compilers. A
program provides the instruction to be
executed next at each step (
inductively). The left side of Figure 1 outlines
serial execution as implied by this ISE