number of cores, as well as across a variety of parallel and __device__ type qualifiers. On a Tesla-architecture
architectures. It also helps to avoid the possibility of dead- GPU, these memory spaces correspond to physically sepa-
lock. rate memories: per-block shared memory is a low-latency
An application may execute multiple grids either on-chip RAM, while global memory resides in the fast
independently or dependently. Independent grids may DRAM on the graphics board.
execute concurrently given sufficient hardware resources. Shared memory is expected to be a low-latency mem-
Dependent grids execute sequentially, with an implicit ory near each processor, much like an L1 cache. It can,
inter-kernel barrier between them, thus guaranteeing that therefore, provide for high-performance communication
all blocks of the first grid will complete before any block and data sharing among the threads of a thread block.
of the second dependent grid is launched. Since it has the same lifetime as its corresponding thread
Threads may access data from multiple memory spaces block, kernel code will typically initialize data in shared
during their execution. Each thread has a private local variables, compute using shared variables, and copy
memory. CUDA uses this memory for thread-private vari- shared memory results to global memory. Thread blocks
ables that do not fit in the thread’s registers, as well as for of sequentially dependent grids communicate via global
stack frames and register spilling. Each thread block has a memory, using it to read input and write results.
shared memory visible to all threads of the block that has Figure 2 diagrams the nested levels of threads, thread
the same lifetime as the block. Finally, all threads have blocks, and grids of thread blocks. It shows the corre-
access to the same global memory. Programs declare vari- sponding levels of memory sharing: local, shared, and
ables in shared and global memory with the __shared__ global memories for per-thread, per-thread-block, and
per-application data shar-

ing.

L evels of Parallel Granularity and Memory Sharing A program manages t hread the global memory space visible to kernels through per-thread local memory calls to the CUDA runtime, such as cudaMalloc() and t hread block cudaFree(). Kernels may execute on a physically per-block separate device, as is the shared memory case when running kernels on the GPU. Consequently, the application must use g rid 0 s equence cudaMemcpy() to copy data between the allocated space and the host system memory. global grid 1 memory The CUDA programming model is similar in style to the familiar SPMD (single-program multiple-data) model—it expresses parallelism explicitly, and FIG 2 each kernel executes on a fixed number of threads. CUDA, however, is more flexible than most real-

References:

mailto:feedback@acmqueue.com

Archives