that the operational intensity is fixed,
though this is not always the case; for
example, for some kernels, the operational intensity increases with problem size (such as for Dense Matrix and
Caches filter the number of accesses that go to memory, so optimizations
that improve cache performance increase operational intensity. Thus, we
may couple the 3Cs model to the Roofline model. Compulsory misses set the
minimum memory traffic and hence
the highest possible operational intensity. Memory traffic from conflict
and capacity misses can considerably
lower the operational intensity of a
kernel, so we should try to eliminate
For example, we can reduce traffic
from conflict misses by padding arrays
to change cache line addressing. A second example is that some computers
have a non-allocating store instruction,
so stores go directly to memory and do
not affect caches. This approach prevents loading a cache block with data
to be overwritten, thereby reducing
memory traffic. It also prevents displacing useful items in the cache with
data that will not be read, thereby saving conflict misses.
This shift of operational intensity to
the right could put a kernel in a different optimization region. Generally, we
advise improving operational intensity of the kernel before implementing
double-precision floating-point operations. It is the only one of the four machines with a front-side bus connecting to a common north bridge chip and
memory controller. The other three
have the memory controller on chip.
The Opteron X4 also uses sophisticated cores with high peak floating-point performance but is the only
computer of the four with on-chip L3
caches. The two sockets communicate
over separate, dedicated hypertrans-port links, making it possible to build
a “glueless” multi-chip system.
The Sun UltraSPARC T2+ uses relatively simple processors at a modest
clock rate compared to the other three,
allowing it to have twice as many cores
per chip. It is also highly multithreaded, with eight hardware-supported
threads per core. It has the highest
memory bandwidth of the four, as
each chip has two dual-channel memory controllers that can drive four sets
The clock rate of the IBM Cell QS20
is the highest of the four multicores at
3.2GHz. It is also the most unusual of
the four, with a heterogeneous design,
a relatively simple PowerPC core, and
eight synergistic processing elements
(SPEs) with their own unique SIMD-style
instruction set. Each SPE also has its
own local memory, instead of a cache.
An SPE must transfer data from main
memory into the local memory to operate on it and then back to main memory
when the computation is completed. It
uses Direct Memory Access, which has
some similarity to software prefetching.
The lack of caches means porting programs to Cell is more challenging.
Four diverse floating-point kernels. Rather than pick programs from
a standard parallel benchmark suite
(such as Parsec5 and Splash-230), we
were inspired by the work of Phil
11 an expert in scientific computing at Lawrence Berkeley National
Laboratory, who identified seven numerical methods he believes will be
important for computational science
and engineering for at least the next
decade. Because he identified seven,
they are called the Seven Dwarfs and
are specified at a high level of abstraction to allow reasoning about
their behavior across a broad range
of implementations. The widely read
“Berkeley View” report4 found that
if the data types were changed from
floating point to integer, the same
Seven Dwarfs would also be found in
many other programs. Note that the
claim is not that the Dwarfs are easy to
parallelize but that they will be important to computing in most current and
future applications; designers are thus
advised to make sure they run well on
the systems they create, whether or
table 2: characteristics of four floating-point kernels.
Demonstrating the model
To demonstrate the Roofline model’s
utility, we now construct Roofline
models for four recent multicore computers and then optimize four floating-point kernels. We’ll then show that the
ceilings and rooflines bound the observed performance for all computers
Four diverse multicore computers.
Given the lack of conventional wisdom
concerning multicore architecture, it’s
not surprising that there are as many
different designs as there are chips.
Table 1 lists the key characteristics of
the four multicore computers, all dual-socket systems, that we discuss here.
The Intel Xeon uses relatively sophisticated processors, capable of
executing two SIMD instructions per
clock cycle that can each perform two
0.17 to 0.25
multiply: y = a*x where a is
a sparse matrix and x, y are dense
vectors; multiplies and adds equal.
0.70 to 1.07
a structured grid code with
a series of time steps.
0.33 to 0.50
a multigrid kernel that
updates seven nearby points in a 3D
stencil for a 2563 problem.
1.09 to 1. 64
3D fast fourier transform
( 2 sizes: 1283 and 5123).
70 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4