izations of SPMD, because each kernel call dynamically
creates a new grid with the right number of thread blocks
and threads for that application step. The programmer can use a convenient degree of parallelism for each
kernel, rather than having to design all phases of the
computation to use the same number of threads.
Figure 3 shows an example of a SPMD-like CUDA
code sequence. It first instantiates kernelF on a 2D grid
of 3× 2 blocks where each 2D thread block consists of 5× 3
threads. It then instantiates kernelG on a 1D grid of four
1D thread blocks with
six threads each. Because
kernelG depends on the
results of kernelF, they are
separated by an inter-kernel synchronization
barrier.
The concurrent threads
of a thread block express
fine-grained data and
thread parallelism. The
independent thread blocks
of a grid express coarse-grained data parallelism.
Independent grids express
coarse-grained task parallelism. A kernel is simply C
code for one thread of the
hierarchy.
hardware management and scheduling of threads and
thread blocks.
Task parallelism can be expressed at the thread-block
level, but blockwide barriers are not well suited for
supporting task parallelism among threads in a block.
To enable CUDA programs to run on any number of
processors, communication between thread blocks within
the same kernel grid is not allowed—they must execute
independently. Since CUDA requires that thread blocks
be independent and allows blocks to be executed in any
K ernel, Barrier, Kernel Sequence Kernel Sequence
k ernelF 2D grid is 3 x 2 thread blocks;
s equence e ach block is 5 x 3 threads
block 0,0 block 1,0 block 2,0
kernelF <<< ( 3, 2),( 5, 3)>>> (params);
block 0, 1 block 1, 1 block 1, 2
block 1, 1
thread 0,0 thread 1,0 thread 2,0 thread 3,0 thread 4,0
thread 0, 1 thread 1, 1 thread 2, 1 thread 3, 1 thread 4, 1
RESTRICTIONS
When developing CUDA
programs, it is important
to understand the ways in
which the CUDA model
is restricted, largely for
reasons of efficiency.
Threads and thread blocks
may be created only by
invoking a parallel kernel,
not from within a parallel
kernel. Together with the
required independence of
thread blocks, this makes it
possible to execute CUDA
programs with a simple
scheduler that introduces
minimal runtime overhead. In fact, the Tesla
architecture implements
thread 0, 2 thread 1, 2 thread 2, 2 thread 3, 2 thread 4, 2
inter-kernel synchronization barrier
kernelG 1D grid is 4 thread blocks;
each block is 6 threads
block 0 block 1 block 2
block 3
kernelG <<< 4, 6 >>> (params);
thread 0 thread 1
block 2
thread 2 thread 3
thread 4 thread 5
FIG 3