Running a Fragment Shader on a GPU Core

Shader compilation to SIMD (single instruction, multiple data) instruction sequences coupled with dynamic hardware thread scheduling leads to efficient execution of a fragment shader on the simplified single-core GPU shown in figure A.

• The core executes an instruction from at most one thread each processor clock, but maintains state for four threads on-chip simultaneously ( T= 4).

• Core threads issue explicit width- 32 SIMD vector instructions; 32 ALUs simultaneously execute a vector instruction in a single clock.

• The core has a pool of 16 general-purpose vector registers
(R0 to R15) that are partitioned among thread contexts.
The elements of each length- 32 vector are 32-bit values.

• The only source of thread stalls is texture access; they have a maximum latency of 50 cycles.

Shader compilation by the graphics driver produces a GPU binary from a high-level fragment shader source. The resulting vector instruction sequence performs 32 invocations of the fragment shader simultaneously by carrying out each invocation in a single lane of the width- 32 vectors. The compiled binary requires four vector registers for temporary results and contains 20 arithmetic instructions between each texture access operation.

At runtime, the GPU executes a copy of the shader binary on each of its four thread contexts, as illustrated in figure B. The core executes T0 (thread 0) until it detects a stall resulting from texture access in cycle 20. While T0 waits for the result of the texturing operation, the core continues to execute its remaining three threads. The result of T0’s texture access becomes available in cycle 70. Upon T3’s stall in cycle 80, the core immediately resumes T0. Thus, at no point during execution are ALUs left idle.

When executing the shader program for this example, a minimum of four threads is needed to keep core ALUs busy. Each thread operates simultaneously on 32 fragments; thus, 4* 32=128 fragments are required for the chip to achieve peak performance.

As memory latencies on real GPUs involve hundreds of cycles, modern GPUs must contain support for significantly more threads to sustain high utilization. If we extend our simple GPU to a more realistic size of eight processing cores and provision each core with storage for 16 execution contexts, then simultaneous processing of 4,096 fragments is needed to approach peak processing rates. Clearly, GPU performance relies heavily on the abundance of parallel shading work.

Example GPU Core
Thread Execution on the Example GPU Core
0

0

31

ALUs (SIMD operation)

20

stall

R0

R 15

executing ready (not executing) stalled

general register file (partitioned among threads)

Cycle

40

stall

60

stall

T0

T1

T2

T3

ready

execution (thread) contexts

80

stall

T0

FIG A

T1 T2 T3

FIG B

References:

Archives