izations of SPMD, because each kernel call dynamically creates a new grid with the right number of thread blocks and threads for that application step. The programmer can use a convenient degree of parallelism for each kernel, rather than having to design all phases of the computation to use the same number of threads.

Figure 3 shows an example of a SPMD-like CUDA code sequence. It first instantiates kernelF on a 2D grid of 3× 2 blocks where each 2D thread block consists of 5× 3 threads. It then instantiates kernelG on a 1D grid of four 1D thread blocks with six threads each. Because kernelG depends on the results of kernelF, they are separated by an inter-kernel synchronization barrier.

The concurrent threads of a thread block express fine-grained data and thread parallelism. The independent thread blocks of a grid express coarse-grained data parallelism. Independent grids express coarse-grained task parallelism. A kernel is simply C code for one thread of the hierarchy.

hardware management and scheduling of threads and thread blocks.

Task parallelism can be expressed at the thread-block level, but blockwide barriers are not well suited for supporting task parallelism among threads in a block. To enable CUDA programs to run on any number of processors, communication between thread blocks within the same kernel grid is not allowed—they must execute independently. Since CUDA requires that thread blocks be independent and allows blocks to be executed in any

K ernel, Barrier, Kernel Sequence Kernel Sequence k ernelF 2D grid is 3 x 2 thread blocks; s equence e ach block is 5 x 3 threads block 0,0 block 1,0 block 2,0

kernelF <<< ( 3, 2),( 5, 3)>>> (params);

block 0, 1 block 1, 1 block 1, 2

block 1, 1
thread 0,0 thread 1,0 thread 2,0 thread 3,0 thread 4,0

thread 0, 1 thread 1, 1 thread 2, 1 thread 3, 1 thread 4, 1

RESTRICTIONS When developing CUDA programs, it is important to understand the ways in which the CUDA model is restricted, largely for reasons of efficiency. Threads and thread blocks may be created only by invoking a parallel kernel, not from within a parallel kernel. Together with the required independence of thread blocks, this makes it possible to execute CUDA programs with a simple scheduler that introduces minimal runtime overhead. In fact, the Tesla architecture implements

thread 0, 2 thread 1, 2 thread 2, 2 thread 3, 2 thread 4, 2

inter-kernel synchronization barrier

kernelG 1D grid is 4 thread blocks;

each block is 6 threads block 0 block 1 block 2

block 3

kernelG <<< 4, 6 >>> (params);

thread 0 thread 1

block 2
thread 2 thread 3

thread 4 thread 5

FIG 3

References:

http://www.acmqueue.com

Archives