already difficult jobs of programmers,
compiler writers, and even architects.
Hence, an easy-to-understand model
that offers performance guidelines
would be especially valuable.
Such a model need not be perfect,
just insightful. The 3Cs (compulsory,
capacity, and conflict misses) model
for caches is an analogy.
19 It is not perfect, as it ignores potentially important
factors like block size, block-allocation
policy, and block-replacement policy.
It also has quirks; for example, a miss
might be labeled “capacity” in one design and “conflict” in another cache
of the same size. Yet the 3Cs model
has been popular for nearly 20 years
precisely because it offers insight into
the behavior of programs, helping programmers, compiler writers, and architects improve their respective designs.
Here, we propose one such model
we call Roofline, demonstrating it on
four diverse multicore computers using four key floating-point kernels.
tensity” to mean operations per byte
of DRAM traffic, defining total bytes
accessed as those bytes that go to the
main memory after they have been filtered by the cache hierarchy. That is,
we measure traffic between the caches
and memory rather than between the
processor and the caches. Thus, operational intensity predicts the DRAM
bandwidth needed by a kernel on a
We say “operational intensity” instead of, say, “arithmetic intensity”
8, 9 for two reasons:
First, arithmetic intensity and machine balance measure traffic between
the processor and the cache, whereas
efficiency-level programmers want to
measure traffic between the caches
and DRAM. This subtle change allows
them to include memory optimizations of a computer into our bound-and-bottleneck model. Second, we
think the model will work with kernels
where the operations are not arithmetic, as discussed later, so we needed a
more general term than “arithmetic.”
The proposed Roofline model ties
together floating-point performance,
operational intensity, and memory
performance in a 2D graph. Peak floating-point performance can be found
through hardware specifications or
microbenchmarks. The working sets
of the kernels we consider here do
not fit fully in on-chip caches, so peak
memory performance is defined by
the memory system behind the caches. Although one can find memory
performance through the STREAM
22 for this work we wrote
a series of progressively optimized
microbenchmarks designed to determine sustainable DRAM bandwidth.
They include all techniques to get the
best memory performance, including
figure 1: Roofline model for (a) AmD opteron X2 and (b) opteron X2 vs. opteron X4.
Stochastic analytical models4, 24 and
statistical performance models7, 25 can
accurately predict program performance on multiprocessors but rarely
provide insight into how to improve
the performance of programs, compilers, and computers1 and can be difficult to use by nonexperts.
An alternative, simpler approach
is “bound and bottleneck analysis.”
Rather than try to predict performance, it provides “valuable insight
into the primary factors affecting the
performance of computer systems. In
particular, the critical influence of the
system bottleneck is highlighted and
The best-known example of a performance bound is surely Amdahl’s
3 which says the performance gain
of a parallel computer is limited by the
serial portion of a parallel program
and was recently applied to heterogeneous multicore computers.
peak memory bandwidth (stream)
operational intensity 1
peak floating-point performance
operational intensity 2
operational intensity (flops/Byte)
For the foreseeable future, off-chip
memory bandwidth will often be the
constraining resource in system performance.
23 Hence, we want a model
that relates processor performance to
off-chip memory traffic. Toward this
goal, we use the term “operational in-
operational intensity (flops/Byte)
66 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4