more thread contexts allows the GPU to
hide longer or more frequent stalls. All
modern GPUs maintain large numbers
of execution contexts on chip to provide
maximal memory latency-hiding ability
(T reaches 128 in modern GPUs—see
the table). This represents a significant
departure from CPU designs, which attempt to avoid or minimize stalls primarily using large, low-latency data
caches and complicated out-of-order
execution logic. Current Intel Core 2
and AMD Phenom processors maintain
one thread per core, and even high-end
models of Sun’s multithreaded UltraSPARC T2 processor manage only four
times the number of threads they can
simultaneously execute.
Note that in the absence of stalls, the
throughput of single- and multithreaded
processors is equivalent. Multithreading does not increase the number of
processing resources on a chip. Rather,
it is a strategy that interleaves execution
of multiple threads in order to use existing resources more efficiently (improve
throughput). On average, a multithreaded core operating at its peak rate runs
each thread 1/T of the time.
To achieve large-scale multithreading, execution contexts must be compact. The number of thread contexts
supported by a GPU core is limited by
the size of on-chip execution context
storage. GPUs require compiled shader
binaries to statically declare input and
output entity sizes, as well as bounds on
temporary storage and scratch registers
required for their execution. At runtime,
GPUs use these bounds to dynamically
partition on-chip storage (including
data registers) to support the maximum
possible number of threads. As a result, the latency hiding ability of a GPU
is shader dependent. GPUs can manage many thread contexts (and provide
maximal latency-hiding ability) when
shaders use fewer resources. When
shaders require large amounts of storage, the number of execution contexts
(and latency-hiding ability) provided by
a GPU drops.
fixed-function
Processing Resources
A GPU’s programmable cores interoper-ate with a collection of specialized fixed-function processing units that provide
high-performance, power-efficient
implementations of nonshader stages.
Running a Fragment
Shader on a GPU Core
Shader compilation to SIMD (single instruction, multiple data) instruction
sequences coupled with dynamic hardware thread scheduling lead to efficient
execution of a fragment shader on the simplified single-core GPU shown in Figure A.
figure a: example GPu core ˲ The core executes an instruc-
tion from at most one thread each
processor clock, but maintains state
for four threads on-chip simultane-
ously (T= 4).
31 ˲ Core threads issue explicit
width- 32 SIMD vector instructions;
32 ALUs simultaneously execute a
vector instruction in a single clock.
r ˲ The core contains a pool of 16
15
general-purpose vector registers
(each containing a vector of 32
execution (thread) contexts single-precision floats) partitioned
among thread contexts.
t0 t1 t2 t3
˲ The only source of thread stalls is
texture access; they have a maximum
latency of 50 cycles.
Shader compilation by the graphics driver produces a GPU binary from high-level
fragment shader source. The resulting vector instruction sequence performs
32 invocations of the fragment shader simultaneously by carrying out each
invocation in a single lane of the width- 32 vectors. The compiled binary requires
four vector registers for temporary results and contains 20 arithmetic instructions
between each texture access operation.
At runtime, the GPU executes a copy of the shader binary on each of its four thread
contexts, as illustrated in Figure B. The core executes T0 (thread 0) until it detects
a stall resulting from texture access in cycle 20. While T0 waits for the result of the
texturing operation, the core continues to execute its remaining three threads.
The result of T0’s texture access becomes available in cycle 70. Upon T3’s stall in
cycle 80, the core immediately resumes T0. Thus, at no point during execution are
ALUs left idle.
When executing the shader program for this example, a minimum of four threads
is needed to keep core ALUs busy. Each thread operates simultaneously on 32
fragments; thus, 4* 32=128 fragments are required for the chip to achieve peak
performance. As memory latencies on real GPUs involve hundreds of cycles,
modern GPUs must contain support for significantly more threads to sustain
high utilization. If we extend our simple GPU to a more realistic size of 16
processing cores and provision each core with storage for 16 execution contexts,
then simultaneous processing of 8,192 fragments is needed to approach peak
processing rates. Clearly, GPU performance relies heavily on the abundance of
parallel shading work.
Alus (siMd operation)
1
general register file
(partitioned among threads)
r
0
figure B: thread execution on the example GPu core
executing ready (not executing) stalled
0
20
cycle
40
60
80
stall
stall
stall
ready
stall
t0
t1
t2
t3