will present for programmers. Most of the programming
challenges discussed here will be applicable to all future
graphics architectures, even those that are somewhat different from the one I am expecting.
END OF THE HARDWARE-DEFINED PIPELINE
Graphics processors will evolve toward a programming
model similar to that illustrated in figure 2b. User-written
software specifies the overall structure of the computation, expressed in an extremely flexible parallel programming model similar to that used to program today’s
multicore CPUs. The user-written software may optionally use specialized hardware to accelerate specific tasks
such as texture mapping. The specialized hardware may
be accessed via a combination of instructions in the ISA
(instruction set architecture), special memory-mapped
registers, and special inter-processor messages.
The latest generation of GPUs (graphics processing units) from NVIDIA and AMD have already taken a
significant step toward this future graphics programming
model by supporting a separate programming model for
nongraphics computations that is more flexible than
the programming model used for graphics. This second
programming model is an assembly-level parallel-programming model with some capabilities for fine-grained
synchronization and data sharing across hardware
threads. NVIDIA calls its model PTX (Parallel Thread
Execution), and AMD’s is known as CTM (Close to Metal).
Note that NVIDIA’s C-like CUDA language (see “Scalable
Parallel Programming with CUDA” in this issue) is a layer
on top of the assembly-level PTX. It is important to realize, however, that PTX and CTM have some significant
limitations compared with traditional general-purpose
parallel programming models. PTX and CTM are still
fairly restrictive, especially in their memory and concurrency models.
These limitations become obvious when comparing
PTX and CTM with the programming models supported
by other single-chip highly parallel processors, such as
Sun’s Niagara server chips. I believe that the programming model of future graphics architectures will be
substantially more flexible than PTX and CTM.
TASK PARALLELISM AND MULTITHREADING
The parallelism supported by current GPUs primarily
takes the form of data parallelism—that is, the GPU operates simultaneously on many data elements (such as vertices or pixels or elements in an array). In contrast, task
parallelism is not supported well, except for the specific
case of concurrent processing of pixels and vertices. Since
better support for task parallelism is necessary to support
user-defined rendering pipelines efficiently, I expect that
future GPUs will support task parallelism much more
aggressively. In particular, multiple tasks will be able to
execute asynchronously from each other and from the
CPU, and will be able to communicate and synchronize
with each other. These changes will require a substantially
more sophisticated software runtime environment than
the one used for today’s GPUs and will introduce significant complexity into the hardware/software interactions
for thread management.
As with today’s GPUs and Sun’s Niagara processor,
each core will use hardware multithreading, 3 possibly augmented by additional software multithreading along the
lines of that used by programmers of the Cell architecture. This multithreading serves two purposes:
• First, it allows the core to remain fully utilized even if
each individual instruction has a pipeline latency of
Future GPUs will support
task parallelism much more aggressively.
several cycles—the core just executes an instruction
from another thread.
• Second, it allows the core to remain fully utilized even if
one or more of the threads on the core stalls because of
an off-chip DRAM access such as those that occur when
fetching data from a texture. Programmers will face the
challenge of exposing parallelism for multiple cores
and for multiple threads on each core. This challenge
is already starting to appear with programming models
such as NVIDIA’s CUDA.
SIMD EXECUTION WITHIN EACH CORE
An important concern in the design of graphics hardware
is obtaining the maximum possible performance using a
fixed number of transistors on a chip. If one instruction
cache/fetch/decode unit can be shared among several
arithmetic units, the die area and power requirements
of the hardware are reduced, as compared with a design
that has one instruction unit per arithmetic unit. That