throughput-oriented Processors
Two fundamental measures of processor performance are task latency (time
elapsed between initiation and completion of some task) and throughput (
total amount of work completed per unit
time). Processor architects make many
carefully calibrated trade-offs between
latency and throughput optimization,
since improving one could degrade the
other. Real-world processors tend to emphasize one over the other, depending
on the workloads they are expected to
encounter.
Traditional scalar microprocessors
are essentially latency-oriented architectures. Their goal is to minimize the
running time of a single sequential
program by avoiding task-level latency
whenever possible. Many architectural
techniques, including out-of-order
execution, speculative execution, and
sophisticated memory caches, have
been developed to help achieve it. This
traditional design approach is predicated on the conservative assumption
that the parallelism available in the
workload presented to the processor is
fundamentally scarce. Single-core scalar CPUs typified by the Intel Pentium
IV were aggressively latency-oriented.
More recent multicore CPUs (such as
the Intel Core2 Duo and Core i7) reflect a trend toward somewhat less-ag-gressive designs that expect a modest
amount of parallelism.
Throughput-oriented processors,
in contrast, arise from the assumption
that they will be presented with workloads in which parallelism is abundant.
This fundamental difference leads to
architectures that differ from traditional sequential machines. Broadly speaking, throughput-oriented processors
rely on three key architectural features:
emphasis on many simple processing
cores, extensive hardware multithreading, and use of single-instruction, multiple-data, or SIMD, execution. Aggressively throughput-oriented processors,
exemplified by the GPU, willingly sacrifice single-thread execution speed to increase total computational throughput
across all threads.
No successful processor can afford
to optimize aggregate task throughput
while completely ignoring single-task la-
tency or vice versa. Different processors
may also vary in the degree they empha-
size one over the other; for instance, in-
dividual throughput-oriented architec-
tures may not use all three architectural
features just listed. Also worth noting is
that several architectural strategies, in-
cluding pipelining, multiple issue, and
out-of-order execution, avoid task-level
latency by improving instruction-level
throughput.