finer pieces that can be solved cooperatively in parallel.
The programming model scales transparently to large
numbers of processor cores: a compiled CUDA program
executes on any number of processors, and only the runtime system needs to know the physical processor count.
THE CUDA PARADIGM
CUDA is a minimal extension of the C and C++ programming languages. The programmer writes a serial program
that calls parallel kernels, which may be simple functions
SIMT (single-instruction, multiple-thread). 3 The SM maps
each thread to one SP scalar core, and each scalar thread
executes independently with its own instruction address and
register state. The SM SIMT unit creates, manages, sched-
or full programs. A kernel executes in parallel across a
set of parallel threads. The programmer organizes these
threads into a hierarchy of grids of thread blocks. A thread
block is a set of concurrent threads that can cooperate
among themselves through barrier synchronization and
shared access to a memory space private to the block. A
grid is a set of thread blocks that may each be executed
independently and thus may execute in parallel.
When invoking a kernel, the programmer specifies the
number of threads per block and the number of blocks
ules, and executes threads in groups of 32 parallel threads
called warps. (This term originates from weaving, the first
parallel thread technology.) Individual threads composing a
Continued on the next page
N VIDIATe slaGPUwith112StreamingProcessorCores
host CPU
system memory
GPU
host interface
input assemble
vertex work
distribution
texture
unit
tex L1
texture
unit
tex L1
ROP
L2
memory
setup/raster/Zcull
pixel work
distribution
texture
unit
tex L1
texture
unit
tex L1
texture
unit
tex L1
interconnectionnetwork
ROP
L2
ROP
L2
memory
memory
compute work
distribution
texture
unit
tex L1
texture
unit
tex L1
SM
MTIU
SP SP
SP SP
SP SP
SP SP
SFU SFU
shared
memory
ROP
L2
memory
FIG A