Idx and the blockIdx of its block. It
increments the value of xi by 1 if i < n—a
conditional check required since n need
not be a multiple of 256.
throughput-oriented Programming
Scalability is the programmer’s central
concern in designing efficient algorithms for throughput-oriented machines. Today’s architectural trends
clearly favor increasing parallelism, and
effective algorithmic techniques must
scale with hardware parallelism. Some
techniques suitable for four parallel
threads may be entirely unsuitable for
4,000 parallel threads. Running thousands of threads at a time, GPUs are a
powerful platform for exploring scalable
algorithms and a leading indicator for
algorithm design on future throughput-oriented architectures.
Abundant parallelism.
Throughput-oriented programs must expose substantial amounts of fine-grain parallelism, fulfilling the expectations of the
architecture. Exploiting multicore CPUs
obviously requires exposing parallelism as well, but a programmer’s mental
model of parallelism on a throughput-oriented processor is qualitatively different from multicore. A four-core CPU
can be fully utilized by four to eight
threads. Thread creation and scheduling are computationally heavyweight,
since they can involve the saving and
restoration of processor state and relatively expensive calls to the operating
system kernel. In contrast, a GPU typically requires thousands of threads to
cover memory latency and reach full
utilization, while thread scheduling is
essentially without cost.
Consider computing the product
y=Ax, where A is a n×n matrix, and x is an
n-element vector. For sparse problems,
because the vast majority of matrix entries is 0, A is best represented using a
data structure that stores only its non-zero elements. The algorithm for sparse
matrix-vector multiplication (SpMV)
would look like this:
GPus are
specifically
designed to
execute literally
billions of small
user-written
programs
per second.
procedure spmv(y, A, x):
for each row i:
y[i] = 0
for each non-zero column
j:
y[i] += A[i,j] x[j]
Since each row is processed indepen-
dently, a simple CUDA implementation
would assign a separate thread to each
row. For large matrices, this could easily expose millions of threads of parallelism. However, for smaller matrices
with only a few thousand rows, this level
of parallelism might be insufficient, so
an efficient implementation could instead assign multiple threads to process
each row. In the most extreme case, each
non-zero element could be assigned to a
separate thread.
Figure 4 plots an experiment measuring the performance of three different parallel granularities: one thread/
row, 32 threads/row, and one thread/
non-zero. 3 These tests use synthetic matrices with a constant number of entries
distributed across a variable number of
rows ranging from one row with four
million entries on the left to four million rows of one entry each on the right.
The maximal parallelism resulting from
assigning one thread per non-zero element yields the most efficient implementation when there are few rows
but suffers from lower absolute performance due to its need for inter-thread
synchronization. For intermediate row
counts, assigning 32 threads per row is
the best solution, while assigning one
thread per row is best when the number
of rows is sufficiently large.
Calculation is cheap. Computation
generally costs considerably less than
memory transfers, particularly external
memory transfers that frequently require hundreds of cycles to complete.
The fact that the cost of memory access
has continued to increase and is now
quite high relative to the cost of computation is often referred to as the “
memory wall.” The energy required to move
data between the chip and external
DRAM is also far higher than required to
operate an on-chip functional unit. In a
45nm process, a 64-bit integer addition
unit expends roughly 1pJ (picojoule),
and a 64-bit floating point fused multiply add, or FMA, unit requires around
100pJ. In contrast, reading a 64-bit value
from external DRAM requires on the order of 2,000pJ. 8
The high relative cost of accessing memory affects both latency and
throughput-oriented processors, since
the cost is the result of the physical properties of semiconductor technology.
However, the performance consequences of external memory references for