accesses but still relies on extensive
multithreading for latency tolerance.
Many simple processing units. The
high transistor density in modern semiconductor technologies makes it feasible for a single chip to contain multiple
processing units, raising the question of
how to use the available area on the chip
to achieve optimal performance: one
very large processor, a handful of large
processors, or many small processors?
Designing increasingly large single-processor chips is unattractive. 6 The
strategies used to obtain progressively
higher scalar performance (such as
out-of-order execution and aggressive speculation) come at the price of
rapidly increasing power consumption; incremental performance gains
incur increasingly large power costs. 15
Thus, while increasing the power consumption of a single-threaded core is
physically possible, the potential performance improvement from more aggressive speculation appears insignificant
by comparison. This analysis has led to
an industrywide transition toward multicore chips, though their designs remain fundamentally latency-oriented.
Individual cores maintain roughly comparable scalar performance to earlier
generations of single-core chips.
achieve even higher levels of performance by using many simple, and hence
small, processing cores. 10 The individual processing units of a throughput-oriented chip typically execute instructions
in the order they appear in the program,
rather than trying to dynamically reorder instructions for out-of-order execution. They also generally avoid speculative execution and branch prediction.
These architectural simplifications often reduce the speed with which a single thread completes its computation.
However, the resulting savings in chip
area allow for more parallel processing
units and correspondingly higher total
throughput on parallel workloads.
SIMD execution. Parallel processors
frequently employ some form of single-instruction, multiple-data, or SIMD,
execution12 to improve their aggregate
throughput. Issuing a single instruction
in a SIMD machine applies the given
operation to potentially many data operands; SIMD addition might, for example, perform pairwise addition of two
64-element sequences. As with multi-
by the GPu,
to increase total
threading, SIMD execution has a long
history dating to at least the 1960s.
Most SIMD machines can be classified into two basic categories. First is
the SIMD processor array, typified by the
ILLIAC IV developed at the University of
Illinois, 7 the Thinking Machines CM- 2, 29
and the MasPar Computer Corp. MP- 1. 5
All consisted of a large array of processing elements (hundreds or thousands)
and a single control unit that would
consume a single instruction stream.
The control unit would broadcast each
instruction to all processing elements
that would then execute the instruction
The second category is the vector
processor, exemplified by the Cray-125
and numerous other machines11 that
augment a traditional scalar instruction set with additional vector instructions operating on data vectors of some
fixed width—64-element vectors in the
Cray- 1 and four-element vectors in the
most current vector extensions (such as
the x86 Streaming SIMD Extensions, or
SSE). The operation of a vector instruction, like vector addition, may be performed in a pipelined fashion (as on the
Cray- 1) or in parallel (as in current SSE
implementations). Several modern processor families, including x86 processors from Intel and AMD and the ARM
Cortex-A series, provide vector SIMD
instructions that operate in parallel
on 128-bit (such as four 32-bit integer)
values. Programmable GPUs have long
made aggressive use of SIMD; current
NVIDIA GPUs have a SIMD width of 32.
Many recent research designs, including the Vector IRAM, 19 SCALE, 20 and
Imagine and Merrimac streaming processors, 9, 16 have also used SIMD architectures to improve efficiency.
SIMD execution is attractive because,
among other things, it increases the
amount of resources that can be devoted to functional units rather than control logic. For instance, 32 floating-point
arithmetic units coupled with a single
control unit takes less chip area than
32 arithmetic units with 32 separate
control units. The desire to amortize
the cost of control logic over numerous
functional units was the key motivating
factor behind even the earliest SIMD
However, devoting less space to control comes at a cost. SIMD execution delivers peak performance when parallel