finer pieces that can be solved cooperatively in parallel. The programming model scales transparently to large numbers of processor cores: a compiled CUDA program executes on any number of processors, and only the runtime system needs to know the physical processor count.

THE CUDA PARADIGM

CUDA is a minimal extension of the C and C++ programming languages. The programmer writes a serial program that calls parallel kernels, which may be simple functions

 

SIMT (single-instruction, multiple-thread). 3 The SM maps each thread to one SP scalar core, and each scalar thread executes independently with its own instruction address and register state. The SM SIMT unit creates, manages, sched-

or full programs. A kernel executes in parallel across a set of parallel threads. The programmer organizes these threads into a hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that can cooperate among themselves through barrier synchronization and shared access to a memory space private to the block. A grid is a set of thread blocks that may each be executed independently and thus may execute in parallel.

When invoking a kernel, the programmer specifies the number of threads per block and the number of blocks

 

ules, and executes threads in groups of 32 parallel threads called warps. (This term originates from weaving, the first parallel thread technology.) Individual threads composing a

Continued on the next page

N VIDIATe slaGPUwith112StreamingProcessorCores
host CPU

system memory

GPU

host interface

input assemble

vertex work distribution

texture
unit
tex L1

texture
unit
tex L1

ROP

L2

memory

setup/raster/Zcull

pixel work distribution

texture
unit
tex L1

texture
unit
tex L1

texture
unit
tex L1

interconnectionnetwork

ROP

L2

ROP

L2

memory

memory

compute work distribution

texture
unit
tex L1

texture
unit
tex L1

SM
MTIU
SP SP
SP SP
SP SP
SP SP
SFU SFU

shared memory

ROP

L2

memory

FIG A

References:

http://www.acmqueue.com

Archives