is, a SIMD (single instruction, multiple data) execution
model increases efficiency as long as most of the elements
in the SIMD vectors are kept active most of the time. A
SIMD execution model also provides a simple form of
fine-grained synchronization that helps to ensure that
memory accesses have good locality.
Current graphics hardware uses a SIMD execution model, although it is sometimes hidden from the
programmer behind a scalar programming interface as
in NVIDIA’s hardware. One area of ongoing debate and
change is likely to be in the underlying hardware SIMD
width; there is a tension between the efficiency gained
for regular computations as SIMD width increases and
the efficiency gained for irregular computations as SIMD
width decreases. NVIDIA GPUs (GeForce 8000 and 9000
series) have an effective SIMD width of 32, but the trend
has been for the SIMD width of GPUs to decrease to
improve the efficiency of algorithms with irregular control flow.
There is also debate about how to expose the SIMD
execution model. It can be directly exposed to the programmer with register-SIMD instructions, as is done with
x86 SSE instructions, or it may be nominally hidden from
the programmer behind a scalar programming model, as
is the case with NVIDIA’s GeForce 9000 series. If the SIMD
execution model is hidden, the conversion from the
scalar programming model to the SIMD hardware may be
performed by either the hardware (as in the GeForce 9000
series) or a compiler or some combination of the two.
Regardless of which strategy is used, programmers who
are concerned with performance will need to be aware of
the underlying SIMD execution model and width.
SMALL AMOUNTS OF LOCAL STORAGE
One of the most important differences between GPUs and
CPUs is that GPUs devote a greater fraction of their transistors to arithmetic units, whereas CPUs devote a greater
fraction of their transistors to cache. This difference is
one of the primary reasons that the peak performance of
a GPU is much higher than that of a CPU.
I expect that this difference will continue in the
future. The impact on programmers will be significant:
although the overall programming model of future GPUs
will become much closer to that of today’s CPUs, programmers will need to manage data locality much more
carefully on future GPUs than they do on today’s CPUs.
This problem is made even more challenging by
multithreading; if there are N threads on each core, the
amount of local storage per thread per core is effectively
1/N of the core’s total local storage. This issue can be
mitigated if the N threads on a core are sharing a working
set, but to do this the programmer must think of the N
threads as being closely coupled to each other. Similarly,
programmers will have to think about how to share a
working set across threads on different cores.
These considerations are already becoming apparent
with CUDA. The constraints are likely to be frustrating
to programmers who are accustomed to the large caches
of CPUs, but they need to realize that extra local storage
would come at the cost of fewer ALUs (arithmetic logic
units), and they will need to work closely with hardware
designers to determine the optimum balance between
cache and ALUs.
CACHE-COHERENT SHARED MEMORY
The most important aspect of any parallel architecture is
its overall memory and communication model. To illustrate the importance of this aspect of the design, consider
four (of many) possible alternatives (of course, hybrids
and enhancements of these models are possible):
• A message-passing architecture, in which each processor
core has its own memory space and all communication
occurs through explicit message passing. Most large-scale supercomputers (those with 100-plus processors)
use this model.
• An architecture such as the Sony/Toshiba/IBM Cell with
a noncached, noncoherent shared memory. In such an
architecture, all transfers of data between a core’s small
private memory and the global memory must be orchestrated through explicit memory-transfer commands.
• An architecture such as NVIDIA’s GeForce 8800 with
what amounts to a minimally cached, noncoherent
shared memory, with support for load/store to this
memory.
• An architecture such as modern multicore CPUs, with
cached, coherent shared memory. In such architectures,
hardware mechanisms manage transfer of data between
cache and main memory and ensure that data in caches
of different processors remains consistent.
There is considerable debate within the graphics architecture community as to which memory and communication model would be best for future architectures, and