streams, etc.) allows code to become compute-bound
instead of memory-bound (which can be as much as 10
times faster on GPUs). This is because memory is typically organized into pages, and there is some overhead in
switching between pages. Grouping data elements and
threads so that many results can be read from (or written
to) the same page helps with performance.
Many types of trees and other sparse-data structures
have data-parallel-friendly array-based implementations.
Although using these structures is quite conventional,
their implementations are nonintuitive to developers
trained on pointer-based schemes. 5
The most important characteristic of the GPU memory
subsystem is the cache architecture. Unlike a CPU, the
GPU has hardly any read/write cache. It is assumed that
so much data will be streaming through the processor
that it will overflow just about any cache. As a result,
the only caches present are separate read-through and
write-through buffers that smooth out the data flow.
Therefore, it is critical to select algorithms that do not
rely on reuse of data at scales larger than the few local
registers available. For example, histogram computation
requires more read/write storage to contain the histogram
bins than typical register allocation supports. Upcoming
GPU architectures are beginning to add read/write caches
so that more algorithms will work, including reasonably
sized histograms, but since these caches are still 10 to 100
times smaller than those on the CPU, this will remain a
key criterion when choosing an algorithm.
GPUS AS DATA-PARALLEL HARDWARE
GPU systems are cheap and widely available, and many
programmers (such as game developers) have identified
key approaches to programming them efficiently.
First, it can be important to leverage all the silicon
on the die. Applications that don’t light up the graphics-specific gates are already at a disadvantage compared
with a CPU. For example, Govindaraju’s sort implementations show significant benefits from using the blending
hardware. 6
Another way to ensure programming efficiency is
to keep the data elements small. This extra hardware
is assuming graphics data types that are optimal when
they are 16 or fewer bytes in size, and ideally four bytes.
If you can make your data look like what a GPU usually
processes, you will get large benefits.
Unfortunately, the GPU’s high-speed memory system
( 10 times faster throughput than the CPU front side bus)
is typically connected to the CPU by a link that is 10
times slower than CPU memory. Minimizing data and
control traffic through this link is vital to GPU performance in low-latency scenarios. The secret is to keep data
in the GPU’s memory as long as possible, bringing it back
to the CPU only for persistent storage. Sometimes this
may involve executing a small non-data-parallel task on
the GPU because the cost of sending the required data
across to the CPU, synchronizing it, and sending it back
may be even greater.
GPU GENERALIT Y
With shorter design cycles, GPUs have been evolving
more rapidly than CPUs. This evolution has typically
been in the direction of increased generality. Now we are
seeing GPU generality growing beyond the needs of basic
rendering to more general applications. For example, in
the past year new GPU environments have become available that expose features that the graphics APIs do not.
Some now support sharing of data among threads and
more flexible memory-access options.
This enables entirely new classes of algorithms on
GPUs. Most obviously, more general approaches to 3D
processing are becoming feasible, including manipulation
of acceleration data structures for ray tracing, radiosity,
or collision detection. Other obvious applications are in
media processing (photo, video, and audio data) where
the data types are similar to those of 3D rendering. Other
F undamental Algorithms Sorted by
Granularity of Parallelism
serial entropic encoding
sparse structure editing
sparse linear algebra
sparse structure creation
sparse structure query
dense linear algebra
video motion estimation
bitonic sort
histogram generation
reduce
explicit finite difference
image convolution
pixel lighting
map
granularity
coarse