These components do not simply augment programmable processing; they
perform sophisticated operations and
constitute an additional hundreds of gigaflops of processing power. Two of the
most important operations performed
via fixed-function hardware are texture
filtering and rasterization (fragment
generation).
Texturing is handled almost entirely
by fixed-function logic. A texturing operation samples a contiguous 1D, 2D,
or 3D signal (a texture) that is discretely
represented by a multidimensional array of color values (2D texture data is
simply an image). A GPU texture-filtering unit accepts a point within the texture’s parameterization (represented by
a floating-point tuple, such as {. 5,. 75})
and loads array values surrounding the
coordinate from memory. The values
are then filtered to yield a single result
that represents the texture’s value at
the specified coordinate. This value
is returned to the calling shader function. Sophisticated texture filtering is
required for generating high-quality images. As graphics APIs provide a finite
set of filtering kernels, and because filtering kernels are computationally expensive, texture filtering is well suited
for fixed-function processing.
Primitive rasterization in the FG
stage is another key pipeline operation currently implemented by fixed-function components. Rasterization
involves densely sampling a primitive
(at least once per output image pixel)
to determine which pixels the primitive
overlaps. This process involves computing the location of the surface at each
sample point and then generating fragments for all sample points covered by
the primitive. Bounding-box computations and hierarchical techniques
optimize the rasterization process.
Nonetheless, rasterization involves significant computation.
In addition to the components for
texturing and rasterization, GPUs contain dedicated hardware components for
operations such as surface visibility determination, output pixel compositing, and
data compression/decompression.
the memory system
Parallel-processing resources place extreme load on a GPU’s memory system,
which services memory requests from
both fixed-function and programmable
understanding
key ideas behind
the success of
GPu computing is
valuable not only
for developers
targeting software
for GPu execution,
but also for
informing the
design of new
architectures and
programming
systems for other
domains.
components. These requests include
a mixture of fine-granularity and bulk
prefetch operations and may even require real-time guarantees (such as display scan out).
Recall that a GPU’s programmable
cores tolerate large memory latencies
via hardware multithreading and that
interstage stream data accesses can be
prefetched. As a result, GPU memory
systems are architected to deliver high-bandwidth, rather than low-latency,
data access. High throughput is obtained through the use of wide memory
buses and specialized GDDR (graphics
double data rate) memories that operate most efficiently when memory access granularities are large. Thus, GPU
memory controllers must buffer, reorder, and then coalesce large numbers
of memory requests to synthesize large
operations that make efficient use of
the memory system. As an example, the
ATI Radeon HD 4870 memory controller
manipulates thousands of outstanding
requests to deliver 115GB per second of
bandwidth from GDDR5 memories attached to a 256-bit bus.
GPU data caches meet different
needs from CPU caches. GPUs employ
relatively small, read-only caches (no
cache coherence) that serve to filter requests destined for the memory controller and to reduce bandwidth requirements placed on main memory. Thus,
GPU caches typically serve to amplify
total bandwidth to processing units
rather than decrease latency of memory
accesses. Interleaved execution of many
threads renders large read-write caches inefficient because of severe cache
thrashing. Instead, GPUs benefit from
small caches that capture spatial locality
across simultaneously executed shader
invocations. This situation is common,
as texture accesses performed while
processing fragments in close screen
proximity are likely to have overlapping
texture-filter support regions.
Although most GPU caches are small,
this does not imply that GPUs contain little on-chip storage. Significant
amounts of on-chip storage are used to
hold entity streams, execution contexts,
and thread scratch data.
Pipeline scheduling and control
Mapping the entire graphics pipeline
efficiently onto GPU resources is a challenging problem that requires dynamic