tasks follow the same execution trace
and can suffer when heterogeneous
tasks follow completely different execution traces. The efficiency of SIMD
architectures depends on the availability of sufficient amounts of uniform
work. In practice, sufficient uniformity
is often present in abundantly parallel
workloads, since it is more likely that
a pool of 10,000 concurrent tasks consists of a small number of task types
rather than 10,000 completely disparate computations.
GPus
Programmable GPUs are the leading exemplars of aggressively throughput-oriented processors, taking the emphasis
on throughput further than the vast majority of other processors and thus offering tremendous potential performance
on massively parallel problems. 13
Historical perspective. Modern GPUs
have evolved according to the needs of
real-time computer graphics, two aspects of which are of particular importance to understanding the development of GPU designs: it is an extremely
parallel problem, and throughput is its
paramount measure of performance.
Visual applications generally model
the environments they display through a
collection of geometric primitives, with
triangles the most common. The most
widely used techniques for producing
images from these primitives proceed
through several stages where processing
is performed on each triangle, triangle
corner, and pixel covered by a triangle.
At each stage, individual triangles/
vertices/pixels can be processed inde-
pendently of all others. An individual
scene can easily paint millions of pixels
at a time, thus generating a great deal
of completely parallel work. Further-
more, processing an element generally
involves launching a thread to execute
a program—usually called a shader—
written by the developer. Consequently,
GPUs are specifically designed to ex-
ecute literally billions of small user-writ-
ten programs per second.
figure 2. nViDia GPu consisting of an array of multithreaded multiprocessors.
GPU
Thread
scheduling
Off-chip memory
SM
SM
SIMT Control
SM
SIMT Control
DRAM
DRAM
Processing
Elements
Processing
Elements
Processing
Elements
PCIe Bus
Host Interface
On-chip memory L1
On-chip memory L1
On-chip memory L1
Memory Interface
DRAM
Interconnection Network
Global L2 Cache