the graphics pipeline are known a priori. Hardware implementation enables fine-granularity logic that is informed
by precise knowledge of both the graphics pipeline and
the underlying GPU implementation. As a result, GPUs
are highly efficient at using all available resources. The
drawback of this approach is that GPUs execute only
those computations for which these invariants and structures are known.
Graphics programming is becoming increasingly
versatile. Developers constantly seek to incorporate more
sophisticated algorithms and leverage more configurable
graphics pipelines. Simultaneously, the growing popularity of GPGPU (general-purpose computing using GPU
platforms) has led to new interfaces for accessing GPU
resources. Given both of these trends, the extent to which
GPU designers can embed a priori knowledge of computations into hardware scheduling logic will inevitably
decrease over time.
A major challenge in the evolution of GPU programming involves preserving GPU performance levels while
increasing the generality and expressiveness of application interfaces. The designs of GPGPU interfaces, such
as NVIDIA’s CUDA and AMD’s CAL, are evidence of how
difficult this challenge is. These frameworks abstract
computation as large batch operations that involve many
invocations of a kernel function operating in parallel. The
resulting computations execute on GPUs efficiently only
under conditions of massive data parallelism. Programs
that attempt to implement non-data-parallel algorithms
perform poorly.
GPGPU programming models are simple to use and
permit well-written programs to make good use of both
GPU programmable cores and (if needed) texturing
resources. Programs using these interfaces, however, cannot use powerful fixed-function components of the chip,
such as those related to compression, image compositing,
or rasterization. Also, when these interfaces are enabled,
much of the logic specific to graphics-pipeline scheduling
is simply turned off. Thus, current GPGPU programming
frameworks restrict computations so that their structure,
as well as their use of chip resources, remains sufficiently
simple for GPUs to run these programs in parallel.
GPU AND CPU CONVERGENCE
The modern graphics processor is a powerful computing
platform that resides at the extreme end of the design
space of throughput-oriented architectures. A GPU’s processing resources and accompanying memory system are
heavily optimized to execute large numbers of operations
in parallel. In addition, specialization to the graphics
domain has enabled the use of fixed-function processing
and allowed hardware scheduling of a parallel computation to be practical. With this design, GPUs deliver unsurpassed levels of performance to challenging workloads
while maintaining a simple and convenient programming
interface for developers.
Today, commodity CPU designs are adopting features
common in GPU computing, such as increased core
counts and hardware multithreading. At the same time,
each generation of GPU evolution adds flexibility to previous high-throughput GPU designs. Given these trends,
software developers in many fields are likely to take
interest in the extent to which CPU and GPU architectures and, correspondingly, CPU and GPU programming
systems, ultimately converge. Q
LOVE IT, HATE IT? LET US KNOW
feedback@acmqueue.com or www.acmqueue.com/forums
KAYVON FATAHALIAN is a Ph.D. candidate in computer
science in the Computer Graphics Laboratory at Stanford
University. His research interests include programming
systems for commodity parallel architectures and computer
graphics/animation systems for the interactive and film
domains. His thesis research seeks to enable execution of
more flexible rendering pipelines on future GPUs and multicore PCs. He will soon be looking for a job.
MIKE HOUSTON is a Ph.D. candidate in computer science
in the Computer Graphics Laboratory at Stanford University.
His research interests include programming models, algorithms, and runtime systems for parallel architectures including GPUs, Cell, multicore CPUs, and clusters. His dissertation
includes the Sequoia runtime system, a system for programming hierarchical memory machines. He received his B.S. in
computer science from UCSD in 2001 and is a recipient of
the Intel Graduate Fellowship.
© 2008 ACM 1542-7730/08/0300 $5.00