throughput-oriented processors can be
more significant; these processors are
designed to reach a higher peak computational throughput and may have a
higher peak throughput-to-bandwidth
ratio than latency-oriented processors.
More important, they seek to tolerate
rather than avoid latency. To hide the
latency of frequent movement of data
to/from main memory requires either
more threads or more work per thread,
generally requiring larger data sets.
The best performance is typically
achieved when calculation is more
common than data access. Performing
roughly 10 to 20 operations per word of
data loaded from memory is ideal, and
it may be preferable to locally recompute values rather than store frequently
needed values in external memory. Consider a simple example of computing a
moderately expensive function like sin
θ for 256 unique values of θ. Tabulating
all 256 possible values would require
little space, but accessing them from external memory would require hundreds
of cycles. In the same amount of time, a
thread could execute perhaps 50 to 100
instructions that could be used to compute the result and leave the memory
bandwidth available for other uses.
Divide and conquer. Divide-and-con-
quer methods often yield effective paral-
lel algorithms, even for apparently serial
problems. Consider the merging of two
sorted sequences A and B, a common
problem for which most computer sci-
ence students learn a sequential solu-
tion like this:
function merge1(A, B):
if empty(A): return B
if empty(B): return A
if first(A)<first(B):
return first(A) +
merge1(rest(A), B)
else:
return first(B) +
merge1(A, rest(B))
This approach is reminiscent of
quicksort though more efficient since A
and B are both sorted. If s is drawn from
A, “partitioning” A is trivial, since the elements less than s are simply those preceding s, and the corresponding point
at which s splits B can be found through
binary search.
This divide-and-conquer method can
lead to an inherently parallel algorithm
by picking a sorted sequence of k split-
ting elements s1, …, sk. These splitters
partition both A and B into k+ 1 subse-
quences that can be merged indepen-
dently like this:
A related divide-and-conquer algo-
rithm picks an element s from either A
or B and partitions both sequences into
those elements A1,B1 that are less than s
and elements A2,B2 that are not less than
s. Having split the input sequences, con-
structing the merged sequence is simply
a matter of recursively merging Ai with
Bi. The code for doing it would look like
this:
function merge3(A, B):
if empty(A): return B
if empty(B): return A
function merge2(A, B):
if empty(A): return B
if empty(B): return A
// Pick k elements from A
and B, in sorted order
S = [s1, ..., sk] = select _
elements(A, B, k)
A0, ..., Ak] = partition(A,
S)
[B0, ..., Bk] = partition(B,
S)
s = select _ an _ element(A,B)
A1, A2 = partition(A, s)
B1, B2 = partition(B, s)
return merge3(A0,B0) + [s1] +
... + [sk] + merge3(Ak,Bk)
return merge2(A1,B1) + [s] +
merge2(A2,B2)
figure 4. Double-precision throughput of spmV strategies on an nViDia Geforce GtX 285
GPu; all matrices have exactly 4 x 210 nonzeros but differing row counts.
1 thread/non-zero
32 threads/row
1 thread/row
20
18
16
14
12
GfLoP/s
10
8
6
4
2
0
1
4
16
64
256 1K 4K
matrix Rows
16K 64K 256K 1m
4m
Since each recursive merge is independent, this is an intrinsically parallel algorithm. Merge algorithms of this
form have been used in the parallel programming literature for decades14 and
can be used to build efficient merge sort
routines in CUDA. 27
Hierarchical synchronization. More
often than not, parallel threads must
synchronize with one another at various
times, but excessive synchronization
can undermine the efficiency of parallel programs. Synchronization must be
treated carefully on massively parallel
throughput-oriented processors where
thousands of threads could potentially
contend for a lock.
To avoid unnecessary synchronization, threads should synchronize hierarchically, keeping parallel computations independent as long as possible.
When parallel programs are decomposed hierarchically, as in divide-and-conquer methods, the synchronization
of threads can also be organized in a hierarchical fashion. For example, when
evaluating merge3(A,B) earlier, each
recursive parallel merge step can pro-