prefetching and data alignment. (See
Section A. 1 in the online Appendixa for
more detail of how to measure processor and memory performance and operational intensity.)
Figure 1a outlines the model for a
2.2GHz AMD Opteron X2 model 2214
in a dual-socket system. The graph is
on a log-log scale. The y-axis is attainable floating-point performance. The
x-axis is operational intensity, varying
from 0.25 Flops/DRAM byte-accessed
to 16 Flops/DRAM byte-accessed.
The system being modeled has peak
double precision floating-point performance of 17. 6 GFlops/sec and peak
memory bandwidth of 15GB/sec from
our benchmark. This latter measure is
the steady-state bandwidth potential
of the memory in a computer, not the
pin bandwidth of the DRAM chips.
One can plot a horizontal line showing peak floating-point performance
of the computer. The actual floating-point performance of a floating-point
kernel can be no higher than the horizontal line, since this line is the hardware limit.
How might we plot peak memory
performance? Since the x-axis is Flops
per Byte and the y-axis is GFlops/sec,
gigabytes per second (GB/sec)—or
(GFlops/sec)/(Flops/Byte)—is just a
line of unit slope in Figure 1. Hence,
we can plot a second line that bounds
the maximum floating-point performance that the memory system of
the computer can support for a given
operational intensity. This formula
drives the two performance limits in
the graph in Figure 1a:
Attainable
=min
GFlops/sec
Peak Floating-Point
Performance
Peak Memory Operational
×
Bandwidth Intensity
The two lines intersect at the point
of peak computational performance
and peak memory bandwidth. Note that
these limits are created once per multi-core computer, not once per kernel.
For a given kernel, we can find a
point on the x-axis based on its operational intensity. If we draw a vertical
line (the pink dashed line in the figures) through that point, the performance of the kernel on that computer
a Please go to doi.acm.org/10.1145/1498765.149
8785#supp
the Roofline sets
an upper bound
on performance of
a kernel depending
on the kernel’s
operational
intensity. if we
think of operational
intensity as a
column that hits
the roof, either
it hits the flat part
of the roof,
meaning
performance is
compute-bound,
or performance
is ultimately
memory-bound.
must lie somewhere along that line.
The horizontal and diagonal lines
give this bound model its name. The
Roofline sets an upper bound on performance of a kernel depending on
the kernel’s operational intensity. If
we think of operational intensity as a
column that hits the roof, either it hits
the flat part of the roof, meaning performance is compute-bound, or it hits
the slanted part of the roof, meaning
performance is ultimately memory-bound. In Figure 1a, a kernel with
operational intensity 2.0 Flops/Byte
is compute-bound and a kernel with
operational intensity 1.0 Flops/Byte is
memory-bound. Given a Roofline, you
can use it repeatedly on different kernels, since the Roofline doesn’t vary.
Note that the ridge point (where the
diagonal and horizontal roofs meet) offers insight into the computer’s overall
performance. The x-coordinate of the
ridge point is the minimum operational intensity required to achieve maximum performance. If the ridge point is
far to the right, then only kernels with
very high operational intensity can
achieve the maximum performance
of that computer. If it is far to the left,
then almost any kernel can potentially
hit maximum performance. As we explain later, the ridge point suggests
the level of difficulty for programmers
and compiler writers to achieve peak
performance.
To illustrate, we compare the Opteron X2 with two cores in Figure 1a to its
successor, the Opteron X4 with four
cores. To simplify board design, they
share the same socket. Hence, they
have the same DRAM channels and
can thus have the same peak memory
bandwidth, although prefetching is
better in the X4. In addition to doubling the number of cores, the X4
also has twice the peak floating-point
performance per core; X4 cores can
issue two floating-point SSE2 instructions per clock cycle, whereas X2 cores
can issue two instructions every other
clock. As the clock rate is slightly faster— 2.2GHz for X2 vs. 2.3GHz for X4—
the X4 is able to achieve slightly more
than four times the peak floating-point
performance of the X2 with the same
memory bandwidth.
Figure 1b compares the Roofline
models for these two systems. As expected, the ridge point shifts right
APriL 2009 | voL. 52 | no. 4 | communicAtionS of the Acm
67