order, combining results generated by multiple blocks
must in general be done by launching a second kernel
on a new grid of thread blocks. However, multiple thread
blocks can coordinate their work using atomic operations
on global memory (e.g., to manage a data structure).
Recursive function calls are not allowed in CUDA
kernels. Recursion is unattractive in a massively parallel kernel because providing stack space for the tens of
thousands of threads that may be active would require
substantial amounts of memory. Serial algorithms that are
normally expressed using recursion, such as quicksort, are
typically best implemented using nested data parallelism
rather than explicit recursion.
To support a heterogeneous system architecture
combining a CPU and a GPU, each with its own memory
system, CUDA programs must copy data and results
between host memory and device memory. The overhead
of CPU–GPU interaction and data transfers is minimized
by using DMA block-transfer engines and fast intercon-nects. Of course, problems large enough to need a GPU
performance boost amortize the overhead better than
small problems.
RELATED WORK
Although the first CUDA implementation targets NVIDIA
GPUs, the CUDA abstractions are general and useful
for programming multicore CPUs and scalable parallel
systems. Coarse-grained thread blocks map naturally to
separate processor cores, while fine-grained threads map
to multiple-thread contexts, vector operations, and pipelined loops in each core. Stratton et al. have developed
a prototype source-to-source translation framework that
compiles CUDA programs for multicore CPUs by mapping a thread block to loops within a single CPU thread.
They found that CUDA kernels compiled in this way
perform and scale well. 4
CUDA uses parallel kernels similar to recent GPGPU
programming models, but differs by providing flexible thread creation, thread blocks, shared memory,
global memory, and explicit synchronization. Streaming languages apply parallel kernels to data records
from a stream. Applying a stream kernel to one record
is analogous to executing a single CUDA kernel thread,
but stream programs do not allow dependencies among
kernel threads, and kernels communicate only via FIFO
(first-in, first-out) streams. Brook for GPUs differentiates
between FIFO input/output streams and random-access
gather streams, and it supports parallel reductions. Brook
is a good fit for earlier-generation GPUs with random
access texture units and raster pixel operation units. 5
Pthreads and Java provide fork-join parallelism but
are not particularly convenient for data-parallel applications. OpenMP targets shared memory architectures with
parallel execution constructs, including “parallel for”
and teams of coarse-grained threads. Intel’s C++ Threading Building Blocks provide similar features for multicore
CPUs. MPI targets distributed memory systems and uses
message passing rather than shared memory.
CUDA APPLICATION EXPERIENCE
The CUDA programming model extends the C language
with a small number of additional parallel abstractions.
Programmers who are comfortable developing in C can
quickly begin writing CUDA programs.
In the relatively short period since the introduction
of CUDA, a number of real-world parallel application
codes have been developed using the CUDA model.
These include FHD-spiral MRI reconstruction, 6 molecular
dynamics, 7 and n-body astrophysics simulation. 8 Running
on Tesla-architecture GPUs, these applications were able
to achieve substantial speedups over alternative implementations running on serial CPUs: the MRI reconstruction was 263 times faster; the molecular dynamics code
was 10–100 times faster; and the n-body simulation was
50–250 times faster. These large speedups are a result of
the highly parallel nature of the Tesla architecture and its
high memory bandwidth.
C ompressed Sparse Row (CSR) Matrix
a . sample matrix A
3010
0000
0241
1001
b . CSR representation of matrix
row 0 row 2 row 3
Av[ 7]={ 31 2 4 1 1 1}
Aj[ 7]={02 1 2 3 0 3}
Ap[ 5]={ 0 2 2 5 7 }
FIG 4