and adaptive techniques. A unique aspect of GPU computing is that hardware
logic assumes a major role in mapping
and scheduling computation onto chip
resources. GPU hardware “scheduling”
logic extends beyond the thread-sched-uling responsibilities discussed in previous sections. GPUs automatically assign computations to threads, clean up
after threads complete, size and manage buffers that hold stream data, guarantee ordered processing when needed,
and identify and discard unnecessary
pipeline work. This logic relies heavily
on specific upfront knowledge of graphics workload characteristics.
Conventional thread programming
uses operating-system or threading API
mechanisms for thread creation, completion, and synchronization on shared
structures. Large-scale multithreading
coupled with the brevity of shader function execution (at most a few hundred
instructions), however, means GPU
thread management must be performed
entirely by hardware logic.
GPUs minimize thread launch costs
by preconfiguring execution contexts to
run one of the pipeline’s three types of
shader functions and reusing the configuration multiple times for shaders
of the same type. GPUs prefetch shader
input records and launch threads when
a shader stage’s input stream contains
a sufficient number of entities. Similar hardware logic commits records to
the output stream buffer upon thread
completion. The distribution of execution contexts to shader stages is reprovisioned periodically as pipeline needs
change and stream buffers drain or approach capacity.
GPUs leverage upfront knowledge of
pipeline entities to identify and skip unnecessary computation. For example,
vertices shared by multiple primitives
are identified and VP results cached to
avoid duplicate vertex processing. GPUs
also discard fragments prior to FP when
the fragment will not alter the value of
any image pixel. Early fragment discard
is triggered when a fragment’s sample
point is occluded by a previously processed surface located closer to the
camera.
Another class of hardware optimizations reorganizes fine-grained operations for more efficient processing. For
example, rasterization orders fragment
generation to maximize screen proxim-
ity of samples. This ordering improves
texture cache hit rates, as well as instruction stream sharing across shader
invocations. The GPU memory controller also performs automatic reorganization when it reorders memory requests
to optimize memory bus and DRAM utilization.
GPUs ensure inter-fragment PO ordering dependencies using hardware
logic. Implementations use structures
such as post-FP reorder buffers or
scoreboards that delay fragment thread
launch until the processing of overlapping fragments is complete.
GPU hardware can take responsibility for sophisticated scheduling decisions because semantics and invariants
of the graphics pipeline are known a priori. Hardware implementation enables
fine-granularity logic that is informed
by precise knowledge of both the graphics pipeline and the underlying GPU
implementation. As a result, GPUs are
highly efficient at using all available resources. The drawback of this approach
is that GPUs execute only those computations for which these invariants and
structures are known.
Graphics programming is becoming increasingly versatile. Developers
constantly seek to incorporate more
sophisticated algorithms and leverage
more configurable graphics pipelines.
Simultaneously, the growing popularity of GPU-based computing for nongraphics applications has led to new
interfaces for accessing GPU resources.
Given both of these trends, the extent
to which GPU designers can embed a
priori knowledge of computations into
hardware scheduling logic will inevitably decrease over time.
A major challenge in the evolution
of GPU programming involves preserving GPU performance levels and ease
of use while increasing the generality
and expressiveness of application interfaces. The designs of “GPU-compute”
interfaces, such as NVIDIA’s CUDA and
AMD’s CAL, are evidence of how difficult
this challenge is. These frameworks abstract computation as large batch operations that involve many invocations of
a kernel function operating in parallel.
The resulting computations execute on
GPUs efficiently only under conditions
of massive data parallelism. Programs
that attempt to implement non data-parallel algorithms perform poorly.
GPU-compute programming models
are simple to use and permit well-written programs to make good use of both
GPU programmable cores and (if needed) texturing resources. Programs using
these interfaces, however, cannot use
powerful fixed-function components of
the chip, such as those related to compression, image compositing, or rasterization. Also, when these interfaces are
enabled, much of the logic specific to
graphics-pipeline scheduling is simply
turned off. Thus, current GPU-compute
programming frameworks significantly restrict computations so that their
structure, as well as their use of chip resources, remains sufficiently simple for
GPUs to run these programs in parallel.
GPu and cPu convergence
The modern graphics processor is a powerful computing platform that resides
at the extreme end of the design space
of throughput-oriented architectures.
A GPU’s processing resources and accompanying memory system are heavily
optimized to execute large numbers of
operations in parallel. In addition, specialization to the graphics domain has
enabled the use of fixed-function processing and allowed hardware scheduling of a parallel computation to be practical. With this design, GPUs deliver
unsurpassed levels of performance to
challenging workloads while maintaining a simple and convenient programming interface for developers.
Today, commodity CPU designs are
adopting features common in GPU
computing, such as increased core
counts and hardware multithreading.
At the same time, each generation of
GPU evolution adds flexibility to previous high-throughput GPU designs. Given these trends, software developers in
many fields are likely to take interest in
the extent to which CPU and GPU architectures and, correspondingly, CPU and
GPU programming systems, ultimately
converge.
Kayvon Fatahalian ( kayvonf@gmail.com) and Mike
houston are Ph.D. candidates in computer science in the
Computer Graphics Laboratory at Stanford University.
A previous version of this article was published in the
March 2008 issue of ACM Queue.
© 2008 ACM 0001-0782/08/1000 $5.00