Merging with a Single Spawn-Join
the merging problem takes as
input two sorted arrays A = A[ 1 . . . n]
and B = B[ 1 . . . n]. each of these 2n
elements must then be mapped into
an array C = C[ 1 . . . 2n] that is also
sorted. I first review the Shiloach-Vishkin two-step PRam algorithm
for merging, then discuss its related
XmtC programming:
Step 1. Partitioning. this step
selects some number x of elements
from A at equal distances. In the
example in the figure here, suppose
the x = 4 elements 4, 16, 20, and 27
are selected and ranked relative to
array B using x concurrent binary
searches. Similarly, x elements from
B at equal distances, say, elements 1,
7, 13, and 24, are also selected, then
ranked relative to array A using x = 4
concurrent binary searches. the step
takes O(log n) time. these ranked
elements partition the merging job
that must be completed into 2x = 8
“strips”; in the figure, step 2 includes
eight such strips.
Step 2. Actual work. for each
strip the remaining job is to merge
a subarrary of A with a subarray of
B, mapping their elements into a
subarray of C. Since these 2x merging
jobs are mutually independent, each
is able to concurrently apply the
standard linear-time serial merging
algorithm.
Consider the following
complexity analysis of this algorithm:
Since each strip has at most n/x
elements from A and n/x elements
from B, the depth (or parallel time) of
the second step is O(n/x). If x ≤ n/ log
n, the first step and the algorithm as
a whole does O(n) work. In the PRam
model, this algorithm requires O(n/x
+ log n) time. a simplistic XmtC
program requires as many spawn
(and respective join) commands
as the number of PRam steps. the
reasons I include this example here
are that it involves a way to use only
a single spawn (and a single join)
command to represent the whole
merging algorithm and, as I explain
in the Conclusion, to demonstrate
an Xmt advantage over current
hardware by comparing it with
the parallel merging algorithm in
Cormen et al. 9
main steps of the ranking/merging algorithm.
A
B
4
6
8
9
16
17
18
19
20
21
23
25
27
29
31
32
1
2
3
5
7
10
11
12
13
14
15
22
24
26
28
30
Step 1
Partitioning
Merging in XMTC. an XmtC
program spawns 2x concurrent
threads, one for each of the selected
elements in array A or B. Using binary
search, each thread first ranks its array
element relative to the other array,
then proceeds directly (without a join
operation) to merge the elements in its
strip, terminating just before setting
the merging result of another selected
element because the merging result is
computed by another thread.
to demonstrate the operation of a
thread, consider the thread of element 20.
Starting with binary search on array B the
A
B
6
8
9
2
3
5
17
18
19
10
11
12
21
23
25
14
15
22
29
31
32
26
28
30
Step 2
Actual Work
thread finds that 20 ranks as 11 in B; 11
is the index of 15 in B. Since the index of
20 in a is 9, element 20 ranks 20 in C. the
thread then compares 21 to 22 and ranks
element 21 (as 21), then compares 23 to
22 to rank 22, 23 to 24 to rank 23, and 24
to 25 but terminates since the thread of 24
ranks 24, concluding the example.
our experience is that, with little
effort, Xmt-type threading requires
fewer synchronizations than implied
by the original PRam algorithm.
the current merging example
demonstrates that synchronization
reduction is sometimes significant.
The primitive is especially useful
when several threads perform a prefix-
sum simultaneously against a com-
mon base, because multiple prefix-
sum operations can be combined by
the hardware to form a very fast multi-
operand prefix-sum operation. Be-
cause each prefix-sum is atomic, each
thread returns a different R value. This
way, the parallel prefix-sum command
can be used to implement efficient
and scalable inter-thread synchroniza-
tion by arbitrating an ordering among
the threads.
XMTC is an extension of standard C,
augmenting C with a small number
of commands (such as spawn, join,
and prefix-sum). Each parallel region is delineated by spawn and join
statements, and synchronization is
achieved through the prefix-sum and
join commands. Every thread execut-