a large factor, at least on appropriate workloads. 3 GPU
evolution has been driven by 3D rendering, an embarrassingly data-parallel problem, which makes the GPU an
excellent target for data-parallel code. As a result of this
significantly different workload design point (processing
model, I/O patterns, and locality of reference), the GPU
has a substantially different processor architecture and
memory subsystem design, typically featuring a broader
SIMD (single instruction, multiple data) width and a
higher-latency, higher-bandwidth streaming memory system. The processing model exposed via a graphics API is
a task-serial pipeline made up of a few data-parallel stages
that use no interthread communication mechanisms at
all. While separate stages appear for processing vertices or
pixels, the actual architecture is somewhat simpler.
As shown in figure 1, a modern DirectX10-class GPU
has a single array of processors that perform the computa-
tional work of each stage in conjunction with specialized
A Modern GPU
input
stream
texture
sampler
input
data
arrays
data array
specialized unit
processor
processor
array
output
blender
triangle
interpolator
output
data
array
hardware. After polygon-vertex processing, a specialized
hardware interpolator unit is used to turn each polygon
into pixels for the pixel-processing stage. This unit can
be thought of as an address generator. At the end of the
pipeline, another specialized unit blends completed pixels
into the image buffer. This hardware is often useful in
accumulating results into a destination array. Further, all
processing stages have access to a dedicated texture-sam-pling unit that performs linearly interpolated reads on
1D, 2D, or 3D source arrays in a variety of data-element
formats.
Shaped by these special workload requirements, the
modern GPU has:
• Ten times the GFLOPS of CPU chips for similar price
and power consumption
• Thousands of threads distributed over hundreds of
single-precision floating-point ALUs
• A dedicated streaming-memory system with 10 times
the memory bandwidth of a CPU
• Dedicated memory capacity similar to the CPU system
memory capacity
• Specialized cores for filtering, blending, rasterizing, and
video processing
A GPU’s memory subsystem is designed for higher
I/O latency to achieve increased throughput. It assumes
only very limited data reuse (locality in read/write access),
featuring small input and output caches designed more
as FIFO (first in, first out) buffers than as mechanisms to
avoid round-trips to memory.
Recent research has looked into applying these processors to other algorithms beyond 3D rendering. There have
been applications that have shown significant benefits
over CPU code. In general, those that most closely match
the original design workload of 3D graphics (such as
image processing) and can find a way to leverage either
the tenfold compute advantage or the tenfold bandwidth
advantage have done well. (Much of this work is cataloged on the Web at http://www.gpgpu.org.)
This research has identified interesting algorithms. For
example, compacting an array of variable-length records
is a task that has a data-parallel implementation on the
parallel prefix sum or scan. The prefix-sum algorithm
computes the sum of all previous array elements (i.e.,