ing an instruction across multiple threads with identical
PCs. In either SIMD implementation, the complexity of
processing an instruction stream and the cost of circuits
and structures to control ALUs are amortized across multiple ALUs. The result is both power- and area-efficient
chip execution.
CPU designs have converged on a SIMD width of four
as a balance between providing increased throughput and
retaining high single-threaded performance. Characteristics of the shading workload make it beneficial for GPUs
to employ significantly wider SIMD processing (widths
ranging from 32 to 64) and to support a rich set of operations. It is common for GPUs to support SIMD implementations of reciprocal square root, trigonometric functions,
and memory gather/scatter operations.
The efficiency of wide SIMD processing allows GPUs
to pack many cores densely with ALUs. For example, the
NVIDIA GeForce 8800 Ultra GPU contains 128 single-precision ALUs operating at 1. 5 GHz. These ALUs are
organized into 16 processing cores and yield a peak rate
of 384 Gflops (each ALU retires one 32-bit multiply-add
per clock). In comparison, a high-end 3-GHz Intel Core 2
CPU contains four cores, each with eight SIMD floating-point ALUs (two 4-width vector instructions per clock),
and is capable of, at most, 96 Gflops of peak performance.
GPUs execute groups of shader invocations in parallel to take advantage of SIMD processing. Dynamic
per-entity control flow is implemented by executing all
control paths taken by the shader invocations. SIMD
operations that do not apply to all invocations, such as
those within shader code conditional or loop blocks, are
partially nullified using write-masks. In this implementation, when shader control flow diverges, fewer SIMD
ALUs do useful work. Thus, on a chip with width-S SIMD
processing, worst-case behavior yields performance equaling 1/S the chip’s peak rate. Fortunately, shader workloads
exhibit sufficient levels of instruction stream sharing to
justify wide SIMD implementations. Additionally, GPU
ISAs contain special instructions that make it possible for
shader compilers to transform per-entity control flow into
efficient sequences of SIMD operations.
Hardware Multithreading = High ALU Utilization.
Thread stalls pose an additional challenge to high-performance shader execution. Threads stall (or block) when
the processor cannot dispatch the next instruction in
an instruction stream because of a dependency on an
outstanding instruction. High-latency off-chip memory
accesses, most notably those generated by fragment
shader texturing operations, cause thread stalls lasting
hundreds of cycles (recall that while shader input and
output records lend themselves to streaming prefetch,
texture accesses do not).
Allowing ALUs to remain idle during the period while
a thread is stalled is inefficient. Instead, GPUs maintain
more execution contexts on chip than they can simultaneously execute, and they perform instructions from
runnable threads when others are stalled. Hardware
scheduling logic determines which context(s) to execute
in each processor cycle. This technique of overprovisioning cores with thread contexts to hide the latency of
thread stalls is called hardware multithreading. GPUs use
multithreading to hide both memory access and instruction pipeline latencies.
The latency-hiding ability of GPU multithreading is
dependent on the ratio of hardware thread contexts to
the number of threads that can be simultaneously executed in a clock (value T from table 1). Support for more
thread contexts allows the GPU to hide longer or more
frequent stalls. All modern GPUs maintain large numbers of execution contexts on chip to provide maximal
memory latency-hiding ability ( T ranges from 16 to 96).
This represents a significant departure from CPU designs,
which attempt to avoid or minimize stalls using large,
low-latency data caches and complicated out-of-order
execution logic. Current Intel Core 2 and AMD Phenom
processors maintain one thread per core, and even high-end models of Sun’s multithreaded UltraSPARC T2 processor manage only four times the number of threads they
can simultaneously execute.
Note that in the absence of stalls, the throughput of
single- and multithreaded processors is equivalent. Multithreading does not increase the number of processing
resources on a chip. Rather, it is a strategy that interleaves
execution of multiple threads in order to use existing
resources more efficiently (improve throughput). On average, a multithreaded core operating at its peak rate runs
each thread 1/T of the time.