data- (stages operate independently on
stream entities) parallelism, making
parallel processing a viable strategy for
increasing throughput. Despite abundant potential parallelism, however, the
unpredictable cost of shader execution
and constraints on the order of PO stage
processing introduce dynamic, fine-grained dependencies that complicate
parallel implementation throughout
the pipeline. Although output image
contributions from most fragments can
be applied in parallel, those that contribute to the same pixel cannot.
Extreme variations in pipeline load.
Although the number of stages and data
flows of the graphics pipeline is fixed,
the computational and bandwidth requirements of all stages vary significantly depending on the behavior of shader
functions and properties of scene. For
example, primitives that cover large regions of the screen generate many more
fragments than vertices. In contrast,
many small primitives result in high ver-tex-processing demands. Applications
frequently reconfigure the pipeline to
use different shader functions that vary
from tens of instructions to a few hundred. For these reasons, over the duration of processing for a single frame,
different stages will dominate overall
execution, often resulting in bandwidth
and compute-intensive phases of execution. Dynamic load balancing is required
to maintain an efficient mapping of the
graphics pipeline to a GPU’s resources
in the face of this variability and GPUs
employ sophisticated heuristics for reallocating execution and on-chip storage resources amongst pipeline stages
depending on load.
Fixed-function stages encapsulate dif-ficult-to-parallelize work.
Programmable stages are trivially parallelizable by
executing shader function logic simultaneously on multiple stream entities.
In contrast, the pipeline’s nonprogram-mable stages involve multiple entity
interactions (such as ordering dependencies in PO or vertex grouping in PG)
and stateful processing. Isolating this
non-data-parallel work into fixed stages
allows the GPU’s programmable processing components to be highly specialized for data-parallel execution and
keeps the shader programming model
simple. In addition, the separation enables difficult aspects of the graphics
computation to be encapsulated in op-
timized, fixed-function hardware components.
Mixture of predictable and unpredictable data access. The graphics pipeline
rigidly defines inter-stage data flows
using streams of entities. This predictability presents opportunities for
aggregate prefetching of stream data
records and highly specialized hardware management of on-chip storage
resources. In contrast, buffer and texture accesses performed by shaders
are fine-grained memory operations on
dynamically computed addresses, making prefetch difficult. As both forms of
data access are critical to maintaining
high throughput, shader programming
models explicitly differentiate stream
from buffer/texture memory accesses,
permitting specialized hardware solutions for both types of accesses.
Opportunities for instruction stream
sharing. While the shader programming
model permits each shader invocation
to follow a unique stream of control, in
practice, shader execution on nearby
stream elements often results in the
same dynamic control-flow decisions.
As a result, multiple shader invocations
can likely share an instruction stream.
Although GPUs must accommodate
situations where this is not the case, the
use of SIMD-style execution to exploit
shared control-flow across multiple
shader invocations is a key optimization
in the design of GPU processing cores
and is accounted for in algorithms for
pipeline scheduling.
Programmable
Processing Resources
A large fraction of a GPU’s resources
exist within programmable processing
cores responsible for executing shader
functions. While substantial implementation differences exist across
vendors and product lines, all modern
GPUs maintain high efficiency through
the use of multicore designs that employ both hardware multithreading and
SIMD (single instruction, multiple data)
processing. As shown in the table here,
these throughput-computing techniques are not unique to GPUs (top two
rows). In comparison with CPUs, however, GPU designs push these ideas to
extreme scales.
Multicore + SIMD Processing = Lots
of ALUs. A logical thread of control is
realized by a stream of processor in-
figure 2: Graphics pipeline operations.
(a)
v1
v0
v5
v4
v2
v3
(b)
v1
v0
p0 v5
v4
v2
p1
v3
(c)
p0
p1
(d)
p0
p1
(e)
(a) six vertices from the vG output stream
define the scene position and orientation of
two triangles. (b) Following vP and PG, the
vertices have been transformed into their
screen-space positions and grouped into
two triangle primitives, p0 and p1. (c) FG
samples the two primitives, producing a set
of fragments corresponding to p0 and p1. (d)
FP computes the appearance of the surface
at each sample location. (e) Po updates the
output image with contributions from the
fragments, accounting for surface visibility.
in this example, p1 is nearer to the camera
than p0. As a result p0 is occluded by p1.