intensity—the number of floating-point operations per byte transferred
from DRAM—is an important parameter for both the kernels and the multi-core computers.
We applied Roofline to four kernels
from among the Seven Dwarfs4, 11 to
four recent multicore designs: AMD
Opteron X4, Intel Xeon, IBM Cell, and
Sun T2+. The ridge point—the minimum operational intensity to achieve
maximum performance—proved to
be a better predictor of performance
than clock rate or peak performance.
Cell offered the highest attained performance (GFlops/sec) on these kernels, but T2+ was the easiest computer
on which to achieve its highest performance. One reason is because the
ridge point of the Roofline Model for
T2+ was the lowest.
Just as the graphical Roofline Model offers insights into the difficulty of
achieving the peak performance of a
computer, it also makes obvious when
a computer is imbalanced. The operational ridge points for the two x86 computers were 4. 4 and 6. 7—meaning a 35
to 55 Flops/Byte operand that accesses
DRAM—yet the operational intensities for the 16 combinations of kernels
and computers in Table 4 ranged from
0.25 to just 1. 64, with a median of 0.60
Flops/Byte. Architects should keep the
ridge point in mind if they want programs to reach peak performance on
their new designs.
We measured the roofline and ceilings using microbenchmarks but
could have used performance counters (see online Appendix A. 1 and
A. 3). There may indeed be a synergistic relationship between performance
counters and the Roofline Model. The
requirements for automatic creation
of a Roofline model could guide the
designer as to which metrics should
be collected when faced with literally
hundreds of candidates but only a limited hardware budget. 6
Roofline offers insights into other
types of multicore systems (such as vector processors and graphical processing units); other kernels (such as sort
and ray tracing); other computational
metrics (such as pair-wise sorts per
second and frames per second); and
other traffic metrics (such as L3 cache
bandwidth and I/O bandwidth). Alas,
there are many more opportunities
for Roofline-oriented research than we
can pursue. We thus invite others to
join us in the exploration of the effectiveness of the Roofline Model.
Acknowledgments
This research was sponsored in part by
the Universal Parallel Computing Research Center funded by Intel and Microsoft and in part by the Office of Advanced Scientific Computing Research
in the U. S. Department of Energy Office
of Science under contract number DE-
AC02-05CH11231. We’d like to thank
FZ-Jülich and Georgia Tech for access
to Cell blades and Joseph Gebis, Leonid
Oliker, John Shalf, Katherine Yelick,
and the rest of the Par Lab for feedback on Roofline, and to Jike Chong,
Kaushik Datta, Mark Hoemmen, Matt
Johnson, Jae Lee, Rajesh Nishtala, Hei-di Pan, David Wessel, Mark Hill, and
the anonymous reviewers for insightful
feedback on early drafts.
References
1. adve, V. Analyzing the Behavior and Performance
of Parallel Programs. Ph.D. thesis, university
of Wisconsin, 1993; www.cs.wisc.edu/
techreports/1993/tr1201.pdf.
2. amD. Software Optimization Guide for AMD Family
10h Processors, Publication 40546, apr. 2008; www.
amd.com/us-en/assets/content_type/white_papers_
and_tech_docs/40546.pdf.
3. amdahl, g. Validity of the single processor approach
to achieving large-scale computing capabilities.
in Proceedings of the AFIPS Conference, 1967,
483–485.
44. asanovic, k., bodik, r., catanzaro, b., gebis, j.,
keutzer, k., Patterson, D., Plishker, W., shalf, j.,
Williams, s., and yelick, k. The Landscape of
Parallel Computing Research: A View from Berkeley.
technical report ucb/eecs-2006-183. eecs,
university of california, berkeley, Dec. 2006.
5. bienia, c., kumar, s., singh, j., and li, k. The PARSEC
Benchmark Suite: Characterization and Architectural
Implications. technical report tr- 81 1-008.
Princeton university, jan. 2008.
66. bird, s., Waterman, a., klues, k., Datta, k., liu, r.,
nishtala, r., Williams, s., asanovi, k., Demmel, j.,
Patterson, D., and yelick, k. a case for sensible
performance counters. submitted to the first
usenix Workshop on hot topics in Parallelism
(berkeley ca, mar. 30–31, 2009); www.usenix.org/
events/hotpar09/.
7. boyd, e., azeem, W., lee, h., shih, t., hung, s., and
Davidson, e. a hierarchical approach to modeling
and improving the performance of scientific
applications on the ksr1. in Proceedings of the 1994
International Conference on Parallel Processing,
1994, 188–192.
8. callahan, D., cocke, j., and kennedy, k. estimating
interlock and improving balance for pipelined
machines. Journal of Parallel Distributed Computing
5 (1988), 334–358.
9. carr, s. and kennedy, k. improving the ratio of
memory operations to floating-point operations
in loops. ACM Transactions on Programming
Languages and Systems 16, 4 (nov. 1994).
10. chong, j. Private communication on financial PDe
solvers, 2008.
11. colella, P. Defining Software Requirements for
Scientific Computing. Presentation, 2004.
12. Datta, k., murphy, m., Volkov, V., Williams, s., carter,
j., oliker, l., Patterson, D., shalf, j., and yelick, k.
stencil computation optimization and autotuning
on state-of-the-art multicore architectures.
in Proceedings of the 2008 ACM/IEEE SC08
Conference (austin, tx, nov. 15–21). ieee Press,
Piscataway, nj, 2008, 1-12.
13. Demmel, j., Dongarra, j., eijkhout, V., fuentes, e.,
Petitet, a., Vuduc, r., Whaley, r., and yelick, k. self-adapting linear algebra algorithms and software.
Proceedings of the IEEE: Special Issue on Program
Generation, Optimization, and Adaptation 93, 2
(2005).
14. Dubois, m. and briggs, f.a. Performance of
synchronized iterative processes in multiprocessor
systems. IEEE Transactions on Software Engineering
SE- 8, 4 (july 1982), 419–431.
15. frigo, m. and johnson, s. the design and
implementation of fft W3. Proceedings of the IEEE:
Special Issue on Program Generation, Optimization,
and Platform Adaptation 93, 2 (2005).
16. harris, m. mapping computational concepts to
gPus. in ACM SIGGRAPH Courses, chapter 31 (los
angeles, july 31-aug. 4). acm Press, new york,
2005.
17. hennessy, j. and Patterson, D. Computer
Architecture: A Quantitative Approach, Fourth
Edition. morgan kaufmann Publishers, boston, ma,
2007.
18. hill, m. and marty, m. amdahl’s law in the multicore
era. IEEE Computer (july 2008), 33–38.
19. hill, m. and smith, a. evaluating associativity in cPu
caches. IEEE Transactions on Computers 38, 12
(Dec. 1989), 1612–1630.
20. lazowska, e., Zahorjan, j., graham, s., and sevcik, k.
Quantitative System Performance: Computer System
Analysis Using Queueing Network Models. Prentice
hall, upper saddle river, nj, 1984.
21. little, j.D.c. a proof of the queueing formula l = λ
W. Operations Research 9, 3 (1961), 383–387.
22. mccalpin, j. STREAM: Sustainable Memory
Bandwidth in High-Performance Computers, 1995;
www.cs.virginia.edu/stream.
23. Patterson, D. latency lags bandwidth. Commun.
ACM 47, 10 (oct. 2004).
24. thomasian, a. and bay, P. analytic queueing network
models for parallel processing of task systems. IEEE
Transactions on Computers C- 35, 12 (Dec. 1986),
1045–1054.
25. tikir, m., carrington, l., strohmaier, e., and snavely,
a. a genetic algorithms approach to modeling the
performance of memory-bound computations. in
Proceedings of the SC07 Conference (reno, nV, nov.
10–16). acm Press, new york, 2007.
26. Vuduc, r., Demmel, j., yelick, k., kamil, s., nishtala,
r., and lee, b. Performance optimizations and
bounds for sparse matrix-vector multiply. in
Proceedings of the ACM/IEEE SC02 Conference
(baltimore, mD, nov. 16–22). ieee computer society
Press, los alamitos, ca, 2002.
27. Williams, s. Autotuning Performance on Multicore
Computers. Ph.D. thesis. university of california,
berkeley, Dec. 2008; www.eecs.berkeley.edu/Pubs/
techrpts/2008/ eecs-2008-164.html.
28. Williams, s., carter, j., oliker, l., shalf, j., and yelick,
k. lattice boltzmann simulation optimization on
leading multicore platforms. in Proceedings of the
IEEE International Symposium on Parallel and
Distributed Processing Symposium (miami, fl, apr.
14–18, 2008), 1–14.
29. Williams, s., oliker, l., Vuduc, f., shalf, j., yelick, k.,
and Demmel, j. optimization of sparse matrix-vector
multiplication on emerging multicore platforms.
in Proceedings of the ACM/IEEE SC07 Conference
(reno, nV, nov. 10–16). acm Press, new york, 2007.
30. Woo, s., ohara, m., torrie, e., singh, j.-P., and gupta,
a.the sPlash- 2 programs: characterization and
methodological considerations. in Proceedings of the
22nd Annual International Symposium on Computer
Architecture. acm Press, new york, 1995, 24–37.
Samuel Williams (s WWilliams@lbl.gov) is a research
scientist at lawrence berkeley national laboratory,
berkeley, ca.
Andrew Waterman ( waterman@eecs.berkeley.edu) is a
graduate student researcher in the Parallel computing
laboratory of the university of california, berkeley.
David Patterson ( pattrsn@eecs.berkeley.edu) is Director
of the Parallel computing laboratory of the university of
california, berkeley, and a past president of acm.
© 2009 acm 0001-0782/09/0400 $5.00
76 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4