Running a Fragment Shader on a GPU Core
Shader compilation to SIMD (single instruction, multiple
data) instruction sequences coupled with dynamic hardware
thread scheduling leads to efficient execution of a fragment
shader on the simplified single-core GPU shown in figure A.
• The core executes an instruction from at most one thread
each processor clock, but maintains state for four threads
on-chip simultaneously ( T= 4).
• Core threads issue explicit width- 32 SIMD vector instructions; 32 ALUs simultaneously execute a vector instruction
in a single clock.
• The core has a pool of 16 general-purpose vector registers
(R0 to R15) that are partitioned among thread contexts.
The elements of each length- 32 vector are 32-bit values.
• The only source of thread stalls is texture access; they have
a maximum latency of 50 cycles.
Shader compilation by the graphics driver produces a
GPU binary from a high-level fragment shader source. The
resulting vector instruction sequence performs 32 invocations of the fragment shader simultaneously by carrying out
each invocation in a single lane of the width- 32 vectors. The
compiled binary requires four vector registers for temporary
results and contains 20 arithmetic instructions between each
texture access operation.
At runtime, the GPU executes a copy of the shader binary
on each of its four thread contexts, as illustrated in figure
B. The core executes T0 (thread 0) until it detects a stall
resulting from texture access in cycle 20. While T0 waits for
the result of the texturing operation, the core continues to
execute its remaining three threads. The result of T0’s texture
access becomes available in cycle 70. Upon T3’s stall in cycle
80, the core immediately resumes T0. Thus, at no point during execution are ALUs left idle.
When executing the shader program for this example, a
minimum of four threads is needed to keep core ALUs busy.
Each thread operates simultaneously on 32 fragments; thus,
4* 32=128 fragments are required for the chip to achieve
peak performance.
As memory latencies on real GPUs involve hundreds of
cycles, modern GPUs must contain support for significantly
more threads to sustain high utilization. If we extend our
simple GPU to a more realistic size of eight processing cores
and provision each core with storage for 16 execution
contexts, then simultaneous processing of 4,096 fragments
is needed to approach peak processing rates. Clearly, GPU
performance relies heavily on the abundance of parallel
shading work.
Example GPU Core
Thread Execution on the Example GPU Core
0
0
31
ALUs (SIMD operation)
20
stall
R0
R
15
executing
ready (not
executing)
stalled
general register file (partitioned among threads)
Cycle
40
stall
60
stall
T0
T1
T2
T3
ready
execution (thread) contexts
80
stall
T0
FIG A
T1 T2 T3
FIG B