the duration of processing for a single frame, different
stages will dominate overall execution, often resulting
in bandwidth- and compute-intensive phases of execution. Maintaining an efficient mapping of the graphics
pipeline to a GPU’s resources in the face of this variability
is a significant challenge, as it requires processing and
on-chip storage resources to be dynamically reallocated to
pipeline stages, depending on current load.
Mixture of predictable and unpredictable data access.
The graphics pipeline rigidly defines inter-stage data flows
using streams of entities. This predictability presents
opportunities for aggregate prefetching of stream data
records and highly specialized hardware management
on-chip storage resources. In contrast, buffer and texture
accesses performed by shaders are fine-grained memory
operations on dynamically computed addresses, making
prefetch difficult. As both forms of data access are critical
to maintaining high throughput, shader programming
models explicitly differentiate stream from buffer/texture
memory accesses, permitting specialized hardware solutions for both types of accesses.
Opportunities for instruction stream sharing. While
the shader programming model permits each shader
invocation to follow a unique stream of control, in
practice, shader execution on nearby stream elements
often results in the same dynamic control-flow decisions.
As a result, multiple shader invocations can likely share
an instruction stream. Although GPUs must accommodate situations where this is not the case, instruction
stream sharing across multiple shader invocations is a key
optimization in the design of GPU processing cores and is
accounted for in algorithms for pipeline scheduling.
processing. As shown in table 1, these throughput-com-puting techniques are not unique to GPUs (top two rows).
In comparison with CPUs, however, GPU designs push
these ideas to extreme scales.
Multicore + SIMD Processing = Lots of ALUs. A thread
of control is realized by a stream of processor instructions
that execute within a processor-managed environment,
called an execution (or thread) context. This context consists of states such as a program counter, a stack pointer,
general-purpose registers, and virtual memory mappings.
A multicore processor replicates processing resources
(both ALUs and execution contexts) and organizes them
into independent cores. When an application features
multiple threads of control, multicore architectures provide increased throughput by executing these instruction
streams on each core in parallel. For example, an Intel
Core 2 Quad contains four cores and can execute four
instruction streams simultaneously. As significant parallelism exists across shader invocations, GPU designs easily
push core counts higher. High-end models contain up to
16 cores per chip.
Even higher performance is possible by populating
each core with multiple floating-point ALUs. This is done
efficiently with SIMD processing, which uses each ALU to
perform the same operation on a different piece of data.
The most common implementation of SIMD processing
is via explicit short-vector instructions, similar to those
provided by the x86 SSE or PowerPC Altivec ISA extensions. These extensions provide a SIMD width of four,
with instructions that control the operation of four ALUs.
Alternative implementations, such as NVIDIA’s 8-series
architecture, perform SIMD execution by implicitly shar-
PROGRAMMABLE
PROCESSING RESOURCES
A large fraction of a GPU’s
resources exist within
programmable processing
cores responsible for executing shader functions.
While substantial implementation differences exist
across vendors and product
lines, all modern GPUs
maintain high efficiency
through the use of multicore designs that employ
both hardware multithreading and SIMD (single
instruction, multiple data)
TABLE 1 TaleoftheTape:
Throughput Architectures
Type Processor Cores/Chip ALUs/Core3 SIMD width MaxT4
GPUs AMD Radeon HD 2900 4 80 64 48
NVIDIA GeForce 8800 16 8 32 96
CPUs Intel Core 2 Quad1 4 8 4 1
STI Cell BE2 8 4 4 1
Sun UltraSPARC T2 8 1 1 4
1SSE processing only, does not account for x86 FPU.
2Stream processing (SPE) cores only, does not account for PPU cores.
332-bit, floating point (all ALUs are multiply-add except the Intel Core 2 Quad)
4The ratio of core thread contexts to simultaneously executable threads. We use the ratio T (rather
than the total number of per-core thread contexts) to describe the extent to which processor cores
automatically hide thread stalls via hardware multithreading.