structions that execute within a pro-cessor-managed environment, called
an execution (or thread) context. This
context consists of state such as a program counter, a stack pointer, general-purpose registers, and virtual memory
mappings. A single core processor managing a single execution context can
run one thread of control at a time. A
multicore processor replicates processing resources (ALUs, control logic, and
execution contexts) and organizes them
into independent cores. When an application features multiple threads of
control, multicore architectures provide increased throughput by executing
these instruction streams on each core
in parallel. For example, an Intel Core
2 Quad contains four cores and can execute four instruction streams simultaneously. As significant parallelism exists
across shader invocations in a graphics
pipeline, GPU designs easily push core
counts higher.
Even higher performance is possible
by populating each core with multiple
floating-point ALUs. This is done efficiently through SIMD (single instruction, multiple data) processing, where
several ALUs perform the same operation on a different piece of data. SIMD
processing amortizes the complexity of
decoding an instruction stream and the
cost of ALU control structures across
multiple ALUs, resulting in both power-and area-efficient chip execution.
The most common implementation
of SIMD processing is via explicit short-vector instructions, similar to those
provided by the x86 SSE or PowerPC Al-
tivec ISA extensions. These extensions
provide instructions that control the
operation of four ALUs (SIMD width
of 4). Alternatively, most GPUs realize
the benefits of SIMD execution by
implicitly sharing an instruction stream
across threads with identical PCs. In
this implementation, the SIMD width
of the machine is not explicitly made
visible to the programmer. CPU designers have chosen a SIMD width of four as
a balance between providing increased
throughput and retaining high single-threaded performance. Characteristics
of the shading workload make it beneficial for GPUs to employ significantly
wider SIMD processing (widths ranging
from 32 to 64) and to support a rich set
of operations. It is common for GPUs
to support SIMD implementations of
reciprocal square root, trigonometric
functions, and memory gather/scatter
operations.
The efficiency of wide SIMD processing allows GPUs to pack many cores
densely with ALUs. For example, the NVIDIA GeForce GTX 280GPU contains 480
ALUs operating at 1.3GHz. These ALUs
are organized into 30 processing cores
and yield a peak rate of 933GFLOPS. In
comparison, a high-end 3GHz Intel Core
2 Quad CPU contains four cores, each
with eight SIMD floating-point ALUs (two
4-width vector instructions per clock) and
is capable of, at most, 96GFLOPS of peak
performance.
Recall that a shader function defines
processing on a single pipeline entity.
GPUs execute multiple invocations of
the same shader function in parallel
table 1. tale of the tape: throughput architectures.
type
GPus
Processor
AMd radeon hd
4870
nvidiA GeForce 30
GtX 280
intel core 2 Quad1 4
sti cell be2 8
sun ultrasPArc t2 8
cores/chip
10
aLus/core3
80
simD width
64
max t4
25
8
32
128
cPus
8
4
1
4
4
1
1
1
4
1 SSE processing only, does not account for traditional FPU
2 Stream processing (SPE) cores only, does not account for PPU cores.
3 32-bit floating point operations
4 Max T is defined as the maximum ratio of hardware-managed thread execution contexts to simultaneously
executable threads (not an absolute count of hardware-managed execution contexts). This ratio is a measure
of a processor’s ability to automatically hide thread stalls using hardware multithreading.
to take advantage of SIMD processing. Dynamic per-entity control flow is
implemented by executing all control
paths taken by the shader invocations
in the group. SIMD operations that do
not apply to all invocations, such as
those within shader code conditional or
loop blocks, are partially nullified using
write-masks. In this implementation,
when shader control flow diverges, fewer SIMD ALUs do useful work. Thus, on
a chip with width-S SIMD processing,
worst-case behavior yields performance
equaling 1/S the chip’s peak rate. Fortunately, shader workloads exhibit
sufficient levels of instruction stream
sharing to justify wide SIMD implementations. Additionally, GPU ISAs contain
special instructions that make it possible for shader compilers to transform
per-entity control flow into efficient
sequences of explicit or implicit SIMD
operations.
Hardware Multithreading = High ALU
Utilization. Thread stalls pose an additional challenge to high-performance
shader execution. Threads stall (or
block) when the processor cannot dispatch the next instruction in an instruction stream due to a dependency on an
outstanding instruction. High-latency
off-chip memory accesses, most notably those generated by texture access
operations, cause thread stalls lasting
hundreds of cycles (recall that while
shader input and output records lend
themselves to streaming prefetch, texture accesses do not).
Allowing ALUs to remain idle during the period while a thread is stalled
is inefficient. Instead, GPUs maintain
more execution contexts on chip than
they can simultaneously execute, and
they perform instructions from run-nable threads when others are stalled.
Hardware scheduling logic determines
which context(s) to execute in each processor cycle. This technique of overprovisioning cores with thread contexts to
hide the latency of thread stalls is called
hardware multithreading. GPUs use
multithreading as the primary mechanism to hide both memory access and
instruction pipeline latencies.
The amount of stall latency a GPU
can tolerate via multithreading is dependent on the ratio of hardware thread
contexts to the number of threads that
are simultaneously executed in a clock
(we refer to this ratio as T). Support for