table 3: Kernel optimizations. 12,
28,
29
Memory affinity. reduce accesses to Dram
memory attached to the other socket.
Long unit-stride accesses. change loop
structures to generate long unit-stride ac-
cesses to engage the prefetchers;
also reduces tlb misses.
Software prefetching. software and hardware
prefetching both used to get the most
from memory systems.
Reduce conflict misses. Pad arrays to improve
cache-hit rates.
Unroll and reorder loops. to expose sufficient
parallelism and improve cache utilization,
unroll and reorder loops to group
statements with similar addresses;
improves code quality, reduces register
pressure, facilitates simD.
“SIMD-ize” code. the x86 compilers didn’t
generate good sse code, so we made a code
generator to produce sse intrinsics.
Compress data structures (SpMV only). since
bandwidth limits performance, we used
smaller data structures: 16b vs. 32b
index and smaller representations of
non-zero subblocks.
27
and X4, while the bottleneck is almost
evenly split for T2+ and Cell. For FFT,
the surrounding ceilings are memory-bound for Xeon and X4, but compute-bound for T2+ and Cell.
fallacies About Roofline
We have presented this material in several venues, prompting a number of
misconceptions we address here:
Fallacy: The model does not account
for all features of modern processors
(such as caches and prefetching). The
definition of operational intensity we
use here does indeed factor-in caches;
memory accesses are measured between the caches and memory, not between the processor and caches. In our
discussion of performance models, we
showed that the memory bandwidth
measures of the computer include
prefetching and any other optimization (such as blocking) that can im-
prove memory performance. Similarly,
some of the optimizations in Table 3
explicitly involve memory. Moreover,
in our discussion on tying the 3Cs to
operational intensity, we demonstrated the optimizations’ effect on increasing operational intensity by reducing
capacity and conflict misses.
Fallacy: Doubling cache size increases
operational intensity. Autotuning three
of the four kernels gets very close to
the compulsory memory traffic; the resultant working set is sometimes only
a small fraction of the cache. Increasing cache size helps only with capacity
misses and possibly conflict misses,
so a larger cache has no effect on the
operational intensity for the three kernels. However, for 128 3D FFT, a larger
3
cache could capture a whole plane of a
3D cube, improving operational intensity by reducing capacity and conflict
misses.
Fallacy: The model doesn’t account
for the long memory latency. The ceilings for no software prefetching in
Figures 3 and 4 are at lower memory
bandwidth precisely because they cannot hide the long memory latency.
Fallacy: The model ignores integer
units in floating-point programs, possibly limiting performance. For the example kernels we’ve outlined here, the
amount of integer code and integer
performance can affect performance.
For example, the Sun UltraSPARC T2+
fetches two instructions per core per
clock cycle and doesn’t implement the
SIMD instructions of the x86 that can
operate on two double-precision floating-point operands at a time. Relative
to other processors, the T2+ expends a
larger fraction of its instruction issue
bandwidth on integer instructions and
executes them at a lower rate, hurting
overall performance.
Fallacy: The model has nothing to do
with multicore. Little’s Law17, 20, 21
dictates that considerable concurrency is
necessary to really push the limits of
the memory system. This concurrency
is more easily satisfied in a multicore
than in a uniprocessor. While the bandwidth orientation of the Roofline model certainly works for uniprocessors, it
is even more helpful for multicores.
Fallacy: You need to recalculate the
Roofline model for every kernel. The
Roofline needs to be calculated for given performance metrics and comput-
ers just once; it then guides the implementation for any program for which
that metric is the critical performance
metric. The kernels we’ve explored
here use floating-point operations and
main memory traffic. The ceilings are
measured once but can be reordered
depending on whether or not multiplies and adds are naturally balanced
in the kernel (see the earlier discussion on adding ceilings to the model).
Note that the heights of the ceilings
we discuss here document the maximum potential gain of a code performing this optimization. An interesting
future direction is to use performance
counters to adjust the height of the
ceilings and the order of the ceilings
for a particular kernel to show the actual benefits of each optimization and
the recommended order to try them
(see online Appendix A. 3).
Fallacy: The model is limited to easily optimized kernels that never hit in the
cache. These kernels do indeed hit in
the cache; for example, the cache-hit
rates of our three multicores with on-chip caches are at least 94% for Stencil and 98% for FFT. Moreover, if the
Seven Dwarfs were easy to optimize, it
would bode well for the future of multicores. However, our experience is that
it is not easy to create the fastest version of these numerical methods on
the divergent multicore architectures
discussed here. Indeed, three of the results were judged significant enough
to be accepted for publication at major
conferences.
122, 28, 29
Fallacy: The model is limited to floating-point programs. Our focus here has
been on floating-point programs, so
the two axes of the model are floating-point operations per second and the
floating-point operational intensity of
accesses to main memory. However,
the Roofline model can work for other
kernels where performance is a function of different performance metrics.
A concrete example is the transpose
phase of 3D FFT, which performs no
floating-point operations at all. Figure
5 shows a Roofline model for just this
phase on Cell, with exchanges replacing Flops in the model. One exchange
involves reading and writing 16B, so
its operational intensity is 1/32 pair-wise Exchanges/Byte. Despite the computational metric being memory exchanges, there is still a computational
74 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4