a large factor, at least on appropriate workloads. 3 GPU evolution has been driven by 3D rendering, an embarrassingly data-parallel problem, which makes the GPU an excellent target for data-parallel code. As a result of this significantly different workload design point (processing model, I/O patterns, and locality of reference), the GPU has a substantially different processor architecture and memory subsystem design, typically featuring a broader SIMD (single instruction, multiple data) width and a higher-latency, higher-bandwidth streaming memory system. The processing model exposed via a graphics API is a task-serial pipeline made up of a few data-parallel stages that use no interthread communication mechanisms at all. While separate stages appear for processing vertices or pixels, the actual architecture is somewhat simpler.
As shown in figure 1, a modern DirectX10-class GPU
has a single array of processors that perform the computa-
tional work of each stage in conjunction with specialized
A Modern GPU
input stream
texture sampler
input data arrays
data array specialized unit processor
processor array
output blender
triangle interpolator
output data array
hardware. After polygon-vertex processing, a specialized hardware interpolator unit is used to turn each polygon into pixels for the pixel-processing stage. This unit can be thought of as an address generator. At the end of the pipeline, another specialized unit blends completed pixels into the image buffer. This hardware is often useful in accumulating results into a destination array. Further, all processing stages have access to a dedicated texture-sam-pling unit that performs linearly interpolated reads on 1D, 2D, or 3D source arrays in a variety of data-element formats.
Shaped by these special workload requirements, the modern GPU has:
• Ten times the GFLOPS of CPU chips for similar price and power consumption
• Thousands of threads distributed over hundreds of single-precision floating-point ALUs
• A dedicated streaming-memory system with 10 times the memory bandwidth of a CPU
• Dedicated memory capacity similar to the CPU system memory capacity
• Specialized cores for filtering, blending, rasterizing, and video processing
A GPU’s memory subsystem is designed for higher I/O latency to achieve increased throughput. It assumes only very limited data reuse (locality in read/write access), featuring small input and output caches designed more as FIFO (first in, first out) buffers than as mechanisms to avoid round-trips to memory.
Recent research has looked into applying these processors to other algorithms beyond 3D rendering. There have been applications that have shown significant benefits over CPU code. In general, those that most closely match the original design workload of 3D graphics (such as image processing) and can find a way to leverage either the tenfold compute advantage or the tenfold bandwidth advantage have done well. (Much of this work is cataloged on the Web at http://www.gpgpu.org.)
This research has identified interesting algorithms. For example, compacting an array of variable-length records is a task that has a data-parallel implementation on the parallel prefix sum or scan. The prefix-sum algorithm computes the sum of all previous array elements (i.e.,
References:
Archives