80 COMMUNICATIONS OF THE ACM | MAY 2015 | VOL. 58 | NO. 5
image filtering operation used for effects such as blur,
15 For a given 2D image and a 5 × 5 spatial filter, each pixel computes and stores the weighted sum of a 5
× 5 neighborhood of pixels, where the weights are the corresponding values in the filter.
11. Volume Rendering: Volume Rendering is commonly used in the fields of medical imaging,
24 etc. Given a
3D grid, and a 2D image location, the benchmark spawns
rays perpendicular to the image plane through the 3D
grid, which accumulates the color and opacity to compute
the final color of each pixel of the image.
Ninja Performance: Table 1 provides details of the
representative dataset sizes for each of the benchmarks.
There exists a corresponding best performing code for
each, for which the performance numbers have been
previously citedd on different platforms than those used
in our study. In order to perform a fair comparison, we
implemented and aggressively optimized (including the
use of intrinsics/assembly code) the benchmarks, and
obtained comparable performance to the best reported
numbers on the corresponding platform. This code was
then executed on our platforms to obtain the corresponding best optimized performance numbers we use in this
paper. Table 1 (column 3) show the best optimized (Ninja)
performance for all the benchmarks on Intel® Core™ i7
X980. For the rest of the paper, Ninja Performance refers
to the performance numbers obtained by executing this
code on our platforms.
3. BRIDGING THE NINJA GAP
In this section, we take each of the benchmarks described
in Section 2, and attempt to bridge the Ninja gap starting
with naively written code with low programming effort. For
a detailed performance analysis, we refer the reader to our
Platform: We measured the performance on a 3.3GHz
6-core Intel® Core™ i7 X980 (code-named Westmere, peak
compute: 158 GFlops, peak bandwidth: 30 GBps). The cores
feature an out-of-order super-scalar micro-architecture,
with 2-way Simultaneous Multi-Threading (SMT). It also has
4-wide SIMD units that support a wide range of instructions.
Each core has an individual 32 KB L1 and 256KB L2 cache.
The cores share a 12 MB last-level cache (LLC). Our system
has 12 GB RAM and runs SuSE Enterprise Linux (ver. 11). We
use the Intel® Composer XE 2011 compiler.
Methodology: For each benchmark, we attempt to first
get good single thread performance by exploiting instruction and data level parallelism. To exploit data level parallelism, we measure the SIMD scaling for each benchmark
by running the code with auto-vectorization enabled/
disabled (-no-vec compiler flag). If SIMD scaling is not
close to peak, we analyze the generated code to identify
architectural bottlenecks. We then obtain thread level
parallelism by adding OpenMP pragmas to parallelize the
benchmark and evaluate thread scaling—again evaluating bottlenecks. Finally, we make necessary algorithmic
changes to overcome the bottlenecks.
Compiler pragmas used: We use OpenMP for thread-
level parallelism, and use the auto-vectorizer or recent tech-
nologies such as array notations introduced as part of the
Intel® Cilk™ Plus (hereafter referred to array notations) for
data parallelism. Details about the specific compiler tech-
niques are available in Tian et al.
26 The compiler directives
we add to the code and command line are the following:
• ILP optimizations: We use #pragma unroll directive
before an innermost loop, and #pragma unroll_and_
jam primitive outside an outer loop. Both optionally
accept the number of times a loop is to be unrolled.
• Vectorizing at innermost loop level: If auto-vectorization
fails, the programmer can force vectorization using
#pragma simd. This is a recent feature introduced in
• Vectorizing at outer loop levels: This can be done in two
different ways: ( 1) directly vectorize at outer loop levels,
and ( 2) Stripmine outer loop iterations and change
each statement in the loop body to operate on the strip.
In this study, we used the second approach with array
• Parallelization: We use the OpenMP #pragma omp to
parallelize loops. We typically use this over an outer for
loop using a #pragma omp parallel for construct.
• Fast math: We use the -fimf-precision flag selectively to
our benchmarks depending on precision needs.
1. Nbody: We implemented Nbody on a dataset of 1 million bodies ( 16 MB memory). Figure 2 shows the breakdown
of the various optimizations. The code consists of two loops
that iterate over all the pairs. We first performed unrolling
optimizations to improve ILP, which gives a benefit of 1.4X.
The compiler auto-vectorizes the code well with no programmer intervention and provides a 3.7X SIMD scaling.
We obtained a parallel scaling of 3.1X, which motivates the
need for our algorithmic optimization of blocking the data
structures to fit in L3 cache (1-D blocking, code in Section
4. 1). Once blocking is done, we obtain an additional 1.9X
thread scaling, and a 1.1X performance gap between compiled and best-optimized code.
NBody1 106 bodies 7. 5 × 109 Pairs/sec
BackProjection13 500 images on 1 K3
1. 9 × 109 Proj./sec
7 Point 3D Stencil17 5123 grid 4. 9 × 109 Up./sec
LBM17 2563 grid 2. 3 × 108 Up./sec
10 M paths on 15 options 8. 2 × 105 Paths/sec
Complex 1D Conv.
8 K on 1. 28 M pixels 1. 9 × 106 Pixels/sec
1 M call + put options 8. 1 × 108 Options/sec
TreeSearch14 100 M queries on 64 M tree 7. 1 × 107 Queries/sec
MergeSort6 256 M elements 2. 1 × 108 Data/sec
2D 5X5 Convolution15
2 K × 2 K Image 2. 2 × 109 Pixels/sec
Volume Rendering24 5123 volume 2.0 × 108 Rays/sec
Table 1. Various benchmarks and the respective datasets used, along
with best optimized (Ninja) performance on Core i7 X980.
d The best reported numbers are cited from recent top-tier publications in the
area of Databases, HPC, Image processing, etc. To the best of our knowledge,
there does not exist any faster performing code for any of the benchmarks.