ceed independently of all the others.
Any synchronization required within
these subtasks can be localized to the
subtasks. Only at the end, when all subsequent operations are merged, must
the parallel subtasks be synchronized
with one another. Organizing synchronization hierarchically also aligns well
with the physical cost of synchronizing
threads spread across different sections
of a given system. It is natural to expect
that threads executing on a single core
can be synchronized much more cheaply than threads spread across an entire
processor, just as threads on a single
machine can be synchronized more
cheaply than threads across the multiple nodes of a cluster.
The transition from single-core to multicore processors and the increasing use
of throughout-oriented architectures
signal greater emphasis on parallelism
as the driving force for higher computational performance. Yet these two kinds
of processors differ in the degree of parallelism they expect to encounter in a
typical workload. Throughput-oriented
processors assume parallelism is abundant, rather than scarce, and their paramount design goal is maximizing total
throughput of all parallel tasks rather
than minimizing the latency of a single
Emphasizing total throughput over
the running time of a single task leads
to a number of architectural design decisions. Among them, the three primary
architectural trends typical of throughput-oriented processors are hardware
multithreading, many simple processing elements, and SIMD execution.
Hardware multithreading makes managing the expected abundant parallelism cheap. Simple in-order cores forgo
out-of-order execution and speculation,
and SIMD execution increases the ratio
of functional units to control logic. Simple core design and SIMD execution reduce the area and power cost of control
logic, leaving more resources for parallel functional units.
These design decisions are all predi-
cated on the assumption that sufficient
parallelism exists in the workloads the
processor is expected to handle. The
performance of a program with insuf-
ficient parallelism may therefore suffer.
A fully general-purpose chip (such as a
CPU) cannot afford to aggressively trade
for increased total performance at the
cost of single-thread performance. The
spectrum of workloads presented to it
is simply too broad, and not all compu-
tations are parallel. For computations
that are largely sequential, latency-ori-
ented processors perform better than
throughput-oriented processors. On
the other hand, a processor specifically
intended for parallel computation can
accept this trade-off and realize signifi-
cantly greater total throughput on paral-
lel problems as a result.
1. alverson, g., alverson, r., callahan, D., Koblenz, B.,
porterfield, a., and smith, B. exploiting heterogeneous
parallelism on a multithreaded multiprocessor. In
Proceedings of the Sixth international Conference on
Supercomputing ( Washington, D.c., July 19–24). acm
press, new york, 1992, 188–197.
2. alverson, r., callahan, D., cummings, D., Koblenz,
B., porterfield, a., and smith, B. the tera computer
system. In Proceedings of the Fourth international
Conference on Supercomputing (amsterdam, the
netherlands, June 11–15). acm press, new york,
3. Bell, n. and garland, m. Implementing sparse
matrix-vector multiplication on throughput-oriented
processors. In Proceedings of the Conference on High
Performance Computing Networking, Storage and
Analysis (portland, or, nov. 14–20). acm press, new
york, 2009, 1–11.
4. Birrell, a.D. An Introduction to Programming with
Threads. research report 35. Digital equipment corp.
systems research, palo alto, ca, 1989.
5. Blank, t. the maspar mp- 1 architecture. In
Proceedings of Compcon (san francisco, ca, feb. 26–
mar. 2). Ieee press, 1990, 20–24.
6. Borkar, s., Jouppi, n.p., and stenstrom, p.
microprocessors in the era of terascale integration. In
Proceedings of the Conference on Design, Automation
and Test in Europe (nice, france, apr. 16–20). eDa
consortium, san Jose, ca, 2007, 237–242.
7. Bouknight, W.J., Denenberg, s.a., mcIntyre, D.e.,
randall, J.m., sameh, a.H., and slotnick, D.L. the
Illiac IV system. Proceedings of the IEEE 60, 4 (apr.
8. Dally, W. power efficient supercomputing. presented
at the accelerator-based computing and manycore
Workshop (Lawrence Berkeley national Laboratory,
Berkeley, ca, nov. 30–Dec. 2, 2009); http://www.lbl.
gov/cs/html/manycore_ Workshop09/gpu multicore
9. Dally, W.J., Labonte, f., Das, a., Hanrahan, p., ahn, J.,
gummaraju, J., erez, m., Jayasena, n., Buck, I., Knight,
t. J., and Kapasi, u.J. merrimac: supercomputing
with streams. In Proceedings of the 2003 ACM/IEEE
Conference on Supercomputing (nov. 15–21). Ieee
computer society, Washington, D.c., 2003.
10. Davis, J. D., Laudon, J., and olukotun, K. maximizing
cmp throughput with mediocre cores. In Proceedings
of the 14th international Conference on Parallel
Architectures and Compilation Techniques (sept.
17–21). Ieee computer society, Washington, D.c.,
11. espasa, r., Valero, m., and smith, J.e. Vector
architectures: past, present and future. In
Proceedings of the 12th international Conference on
Supercomputing (melbourne, australia). acm press,
new york, 1998, 425–432.
Michael Garland ( email@example.com) is a senior
research scientist in nVIDIa research, santa clara, ca.
David B. Kirk ( firstname.lastname@example.org) is an nVIDIa fellow and
former chief scientist of nVIDIa research, santa clara,
© 2010 acm 0001-0782/10/1100 $10.00