point of operational intensity is 0.65
Flops/Byte.
Here, we demonstrate the Roofline
model on four diverse mutlicore architectures running four kernels representative of some of the Seven Dwarfs:
Sparse matrix-vector multiplication.
The first example kernel of the sparse
matrix computational dwarf is Sparse
Matrix-Vector multiply (SpMV); the
computation is y = A*x, where A is a
sparse matrix and x and y are dense
vectors. SpMV is popular in scientific
computing, economic modeling, and
information retrieval. Alas, conventional implementations often run at
less than 10% of peak floating-point
performance in uniprocessors. One
reason is the irregular accesses to
memory, which might be expected
from sparse matrices. The operational
intensity varies from 0.17 Flops/Byte
before a register blocking optimization to 0.25 Flops/Byte afterward26 (see
online Appendix A. 1).
Given that the operational intensity
of SpMV was below the ridge point of
all four multicores in Figure 3, most
optimizations involve the memory system. Table 3 summarizes the optimizations used by SpMV and the rest of
the kernels. Many are associated with
the ceilings in Figure 3, and the height
of the ceilings suggests the potential
benefit of these optimizations.
Lattice-Boltzmann Magnetohydrodynamics. Like SpMV, LBMHD tends to
achieve a small fraction of peak performance on uniprocessors due to the
complexity of the data structures and
the irregularity of memory access patterns. The Flop-to-Byte ratio is 0.70
vs. 0.25 or less in SpMV. By using the
no-allocate store optimization, a programmer can improve the operational
intensity of LBMHD to 1.07 Flops/
Byte. Both x86 multicores offer this
cache optimization, but Cell does not
have this problem since it uses DMA.
Hence, T2+ is the only one of the four
computers with the lower intensity of
0.70 Flops/Byte.
Figures 3 and 4 show that the operational intensity of LBMHD is high
enough that both computational and
memory bandwidth optimizations
make sense on all multicores, except
the T2+ where the Roofline ridge point
is below that of LBMHD. The T2+
reaches its performance ceiling using
figure 3d–3f: Roofline model for intel Xeon, AmD opteron X4, and iBm cell.
(d) AMD Opteron X4 (Barcelona)
128
64
peak DP
+smD
32
+ilP
Gflops/s
16
8
+balanced mul/add
4
tlP only
2
peak stream band width
without m e m ory affinity
peak copy band width
spmV
1
1/16
1/8
1/4 1/2 1 2
operational intensity (flops/Byte)
4
8
16
(e) IBM Cell (QS20)
128
64
32
peak DP
+fma
Gflops/s
16
8
4
peak stream band width
2
without m emory affinity
+simD
+ilP
tlP only
1
1/16
stencil
fft(1283)
fft(5123)
lbmhD
1/8
1/4 1/2 1 2
operational intensity (flops/Byte)
4
8
16
(f) IBM Cell (QS20)
128
64
32
peak DP
+simD
Gflops/s
16
8
4
peak stream band width
+ilP
+fma
2
without numa
tlP only
spmV
1
1/16
1/8
1/4 1/2 1 2
operational intensity (flops/Byte)
4
8
16
only the computational optimizations.
Stencil. In general, a stencil on a
structured grid is defined as a function
that updates a point based on the values of its neighbors. The stencil structure remains constant as it moves from
one point in space to the next. For this
work, we use the stencil derived from
the explicit heat equation, a partial differential equation on a uniform 2563
3D grid.
12 The stencil’s neighbors are
the nearest six points along each axis,
as well as the center point itself. This
stencil performs eight floating-point
operations for every 24B of compulsory memory traffic on write-allocate
72 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4