from 1.0 Flops/Byte in the Opteron X2
to 4. 4 in the Opteron X4. Hence, to realize a performance gain using the X4,
kernels need an operational intensity
greater than 1.0 Flops/Byte.
figure 2: Roofline model with ceilings for opteron X2.
(a) Computational Ceilings
128
64
Adding ceilings to the model
The Roofline model provides an upper
bound to performance. Suppose a program performs far below its Roofline.
What optimizations should one implement and in what order? Another
advantage of bound-and-bottleneck
analysis is that “a number of alternatives can be treated together, with a
single bounding analysis providing
useful information about them all.”
20
We leverage this insight to add multiple ceilings to the Roofline model to
guide which optimizations to implement. It is similar to the guidelines
loop balance gives the compiler. We
can think of each optimization as a
“performance ceiling” below the appropriate Roofline, meaning you cannot break through a ceiling without
first performing the associated optimization.
For example, to reduce computational bottlenecks on the Opteron X2,
almost any kernel can be helped with
two optimizations:
Improve instruction-level parallelism
(ILP) and apply SIMD. For superscalar
architectures, the highest performance
comes when fetching, executing, and
committing the maximum number
of instructions per clock cycle. The
goal is to improve the code from the
compiler to increase ILP. The highest
performance comes from completely
covering the functional unit latency.
One way to hide instruction latency is
by unrolling loops. For x86-based architectures, another way is using floating-point SIMD instructions whenever
possible, since a SIMD instruction operates on pairs of adjacent operands;
and
Balance floating-point operation mix.
The best performance requires that
a significant fraction of the instruction mix be floating-point operations
(discussed later). Peak floating-point
performance typically also requires
an equal number of simultaneous
floating-point additions and multiplications, since many computers have
multiply-add instructions or an equal
number of adders and multipliers.
Attainable Gflops/sec
32
16
8
4
peak memory band width (stream)
peak floating-point performance
2. floating-point balance
1. ilP or simD
2
tlP only
1
1/2
1/8
1/4
1/2 1 2
operational intensity (flops/Byte)
4
8
16
(b) Bandwidth Ceilings
128
64
32
Attainable Gflops/sec
16
8
4
2
peak floating-point performance
peak memory bandwidth (stream)
5. soft ware prefetching
4. memory affinity
1
3. unit stride accesses only
1/2
1/8
1/4
1/2 1 2
operational intensity (flops/Byte)
4
8
16
(c) Optimization Regions
128
64
Attainable Gflops/sec
32
16
8
4
peak memory bandwidth (stream)
2
5. soft ware prefetching
4. memory affinity
peak floating-point performance
2. floating-point balance
1. ilP or simD
1
3. unit stride accesses only
kernel 1
tlP only
kernel 2
1/2
1/8
1/4
1/2 1 2
operational intensity (flops/Byte)
4
8
16
Memory bottlenecks can be reduced
with the help of three optimizations:
Restructure loops for unit stride accesses. Optimizing for unit-stride
memory accesses engages hardware
prefetching, significantly increasing
memory bandwidth;
Ensure memory affinity. Most microprocessors today include a memory
controller on the same chip with the
68 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4