Running a Fragment Shader on a GPU Core
Shader compilation to SIMD (single instruction, multiple data) instruction sequences coupled with dynamic hardware thread scheduling leads to efficient execution of a fragment shader on the simplified single-core GPU shown in figure A.
• The core executes an instruction from at most one thread each processor clock, but maintains state for four threads on-chip simultaneously ( T= 4).
• Core threads issue explicit width- 32 SIMD vector instructions; 32 ALUs simultaneously execute a vector instruction in a single clock.
• The core has a pool of 16 general-purpose vector registers
(R0 to R15) that are partitioned among thread contexts.
The elements of each length- 32 vector are 32-bit values.
• The only source of thread stalls is texture access; they have a maximum latency of 50 cycles.
Shader compilation by the graphics driver produces a GPU binary from a high-level fragment shader source. The resulting vector instruction sequence performs 32 invocations of the fragment shader simultaneously by carrying out each invocation in a single lane of the width- 32 vectors. The compiled binary requires four vector registers for temporary results and contains 20 arithmetic instructions between each texture access operation.
At runtime, the GPU executes a copy of the shader binary on each of its four thread contexts, as illustrated in figure B. The core executes T0 (thread 0) until it detects a stall resulting from texture access in cycle 20. While T0 waits for the result of the texturing operation, the core continues to execute its remaining three threads. The result of T0’s texture access becomes available in cycle 70. Upon T3’s stall in cycle 80, the core immediately resumes T0. Thus, at no point during execution are ALUs left idle.
When executing the shader program for this example, a minimum of four threads is needed to keep core ALUs busy. Each thread operates simultaneously on 32 fragments; thus, 4* 32=128 fragments are required for the chip to achieve peak performance.
As memory latencies on real GPUs involve hundreds of cycles, modern GPUs must contain support for significantly more threads to sustain high utilization. If we extend our simple GPU to a more realistic size of eight processing cores and provision each core with storage for 16 execution contexts, then simultaneous processing of 4,096 fragments is needed to approach peak processing rates. Clearly, GPU performance relies heavily on the abundance of parallel shading work.
0
31
ALUs (SIMD operation)
20
stall
R0
R 15
executing ready (not executing) stalled
general register file (partitioned among threads)
Cycle
40
stall
60
stall
T0
T1
T2
T3
ready
execution (thread) contexts
80
stall
T0
T1 T2 T3
References:
Archives