Large-scale multithreading requires execution contexts
to be compact in order to fit many contexts within on-chip memories. The number of thread contexts supported
by a GPU core is shader-program dependent and typically limited by the size of on-chip storage. GPUs require
compiled shader binaries to declare input and output
entity sizes, as well as bounds on temporary storage and
scratch registers required for execution. At runtime, GPUs
use these bounds to partition unspillable on-chip storage
(including data registers) dynamically among execution
contexts. Thus, GPUs support many thread contexts (up
to an architecture-specific bound) and, correspondingly,
provide maximal latency-hiding ability when shaders
use fewer resources. When shaders require large amounts
of storage, the number of execution contexts provided
by a GPU drops. (The accompanying sidebar details an
example of the efficient execution of a fragment shader
on a GPU core.)
FIXED-FUNCTION PROCESSING RESOURCES
A GPU’s programmable cores interoperate with a collection of specialized fixed-function processing units that
provide high-performance, power-efficient implementations of nonshader stages. These components do not
simply augment programmable processing; they perform
sophisticated operations and constitute an additional
hundreds of gigaflops of processing power. Two of the
most important operations performed via fixed-function
hardware are texture filtering and rasterization (fragment
generation).
Texturing is handled almost entirely by fixed-function
logic. A texturing operation samples a contiguous 1D, 2D,
or 3D signal (a texture) that is discretely represented by a
multidimensional array of color values (2D texture data is
simply an image). A GPU texture-filtering unit accepts a
point within the texture’s parameterization (represented
by a floating-point tuple, such as {. 5,.75}) and loads array
values surrounding the coordinate from memory. The values are then filtered to yield a single result that represents
the texture’s value at the specified coordinate. This value
is returned to the calling shader function. Sophisticated
texture filtering is required for generating high-quality
images. As graphics APIs provide a finite set of filtering
kernels, and because filtering kernels are computationally
expensive, texture filtering is well suited for fixed-function processing.
Primitive rasterization in the FG stage is another key
pipeline operation implemented by fixed-function components. Rasterization involves densely sampling a primitive (at least once per output image pixel) to determine
which pixels the primitive overlaps. This process involves
interpolating the location of the surface at each sample
point and then generating fragments for all sample points
covered by the primitive. Bounding-box computations
and hierarchical techniques optimize the rasterization
process. Nonetheless, rasterization involves significant
computation.
In addition to the components for texturing and rasterization, GPUs contain dedicated hardware components
for operations such as surface visibility determination,
output pixel compositing, and data compression/decom-pression.
THE MEMORY SYSTEM
Parallel-processing resources place extreme load on a
GPU’s memory system, which services memory requests
from both fixed-function and programmable compo-
GPU memory systems are architected
to deliver high-bandwidth, rather than low-
latency, data access.
nents. These requests include a mixture of fine-granularity and bulk prefetch operations and may even require
realtime guarantees (such as display scan out).
Recall that a GPU’s programmable cores tolerate large
memory latencies via hardware multithreading and that
interstage stream data accesses can be prefetched. As a
result, GPU memory systems are architected to deliver
high-bandwidth, rather than low-latency, data access.
High throughput is obtained through the use of wide