To manage its large population of
threads efficiently, the GPU employs
a single-instruction, multiple-thread,
or SIMT, architecture in which threads
resident on a single SM are executed in
groups of 32, called warps, each executing a single instruction at a time across
all its threads. Warps are the basic unit
of thread scheduling, and in any given
cycle the SM is free to issue an instruction from any runnable warp. The
threads of a warp are free to follow their
own execution path, and all such execution divergence is handled automatically in hardware. However, it is obviously
more efficient for threads to follow the
same execution path for the bulk of the
computation. Different warps may follow different execution paths without
penalty.
While SIMT architectures share
many performance characteristics with
SIMD vector machines, they are, from
the programmer’s perspective, qualitatively different. Vector machines are
typically programmed with either vector
intrinsics explicitly operating on vectors
of some fixed width or compiler auto-vectorization of loops. In contrast, SIMT
machines are programmed by writing a
scalar program describing the action of
a single thread. A SIMT machine implicitly executes groups of independent scalar threads in a SIMD fashion, whereas a
vector machine explicitly encodes SIMD
execution in the vector operations in the
instruction stream it is given.
CUDA programming model. The CUDA
programming model23, 26 provides a
minimalist set of abstractions for parallel programming on massively multi-threaded architectures like the NVIDIA
GPU. A CUDA program is organized into
one or more threads executing on a host
processor and one or more parallel kernels that can be executed by the host
thread(s) on a parallel device.
Individual kernels execute a scalar
sequential program across a set of par-
allel threads. The programmer orga-
nizes the kernel’s threads into thread
blocks, specifying for each kernel
launch the number of blocks and num-
ber of threads per block to be created.
CUDA kernels are thus similar in style
to a blocked form of the familiar sin-
gle-program, multiple-data, or SPMD,
paradigm. However, CUDA is somewhat
more flexible than most SPMD sys-
tems in that the host program is free to
customize the number of threads and
blocks launched for a particular kernel
at each invocation. A thread block is a
group of parallel threads that may syn-
chronize with one another at a per-block
barrier and communicate among them-
selves through per-block shared mem-
ory. Threads from different blocks may
coordinate with one another via atomic
operations on variables in the global
memory space visible to all threads.
There is an implicit barrier between suc-
cessive dependent kernels launched by
the host program.
figure 3. trivial cuDa c kernel for incrementing each element of an array.
__global__ void increment(float *x, int n)
{
// Each thread will process 1 element, which
// is determined from the thread’s index.
int i = blockIdx.x*blockDim.x + threadIdx.x;
if( i<n ) x[i] = x[i] + 1;
}
__host__ void parallel_increment(float *x, int n)
{
// Launch increment() kernel with 1 thread
// per element, grouped into ⎡n/256⎤ blocks
// of 256 threads each.
increment<<<ceil(n/256), 256>>>(x, n);
}
capacity of each sm over three GPu generations.
Registers (32-bit)
co-resident threads
independent warps
shared memory (KB)
L1 cache (KB)
L2 cache (KB per chip)
G8x/G9x
8192
768
24
16
—
—
Gt2xx
16384
1024
32
16
—
—
Gf100
32768
1536
48
48/16
16/48
768