finer pieces that can be solved cooperatively in parallel. The programming model scales transparently to large numbers of processor cores: a compiled CUDA program executes on any number of processors, and only the runtime system needs to know the physical processor count.
CUDA is a minimal extension of the C and C++ programming languages. The programmer writes a serial program that calls parallel kernels, which may be simple functions
SIMT (single-instruction, multiple-thread). 3 The SM maps each thread to one SP scalar core, and each scalar thread executes independently with its own instruction address and register state. The SM SIMT unit creates, manages, sched-
or full programs. A kernel executes in parallel across a set of parallel threads. The programmer organizes these threads into a hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that can cooperate among themselves through barrier synchronization and shared access to a memory space private to the block. A grid is a set of thread blocks that may each be executed independently and thus may execute in parallel.
When invoking a kernel, the programmer specifies the number of threads per block and the number of blocks
ules, and executes threads in groups of 32 parallel threads called warps. (This term originates from weaving, the first parallel thread technology.) Individual threads composing a
Continued on the next page
system memory
GPU
host interface
input assemble
vertex work distribution
texture
unit
tex L1
texture
unit
tex L1
ROP
L2
memory
setup/raster/Zcull
pixel work distribution
texture
unit
tex L1
texture
unit
tex L1
texture
unit
tex L1
interconnectionnetwork
ROP
L2
ROP
L2
memory
memory
compute work distribution
texture
unit
tex L1
texture
unit
tex L1
SM
MTIU
SP SP
SP SP
SP SP
SP SP
SFU SFU
shared memory
ROP
L2
memory
References:
Archives