alternative SIMD-friendly algorithm. An important class of
algorithmic changes involves blocking the data structures
to fit in the cache, thus reducing the memory bandwidth
pressure. Another class of changes involves eliminating
the use of memory gather/scatter operations. Such irregular memory operations can both increase latency and
bandwidth usage, as well as limit the scope of compiler
vectorization. A common data layout change is to convert
data structures written in an Array of Structures (AOS) representation to a Structure of Arrays (SOA) representation.
This helps prevent gathers when accessing one field of the
structure across the array elements, and helps the compiler
vectorize loops that iterate over the array. Finally, in some
cases, the code cannot be vectorized due to back-to-back
dependencies between loop iterations, and in those cases a
different SIMD-friendly algorithm may need to be chosen.
Performing algorithmic changes does require programmer effort and insights, and we expect that education
and training in parallel programming will play a big role
in enabling programmers to develop better parallel algorithms. The payoffs are large—we show that after performing algorithmic changes, we have an average performance
gap of only 1.3X between best-optimized and compiler-generated code. Moreover, this effort can be amortized
across different processor generations and also across different computing platforms such as GPUs. Since the underlying hardware trends toward increasing cores, SIMD width
and slowly increasing bandwidth have been optimized for,
a small and predictable performance gap will remain across
future architectures. We demonstrate this by repeating
our experiments for the Intel® Xeon Phi™ Knights Corner
co-processor architecture, a recent 86X based manycore platform. We show that the Ninja gap is almost the same ( 1.2X).
In fact, the addition of hardware gather support makes
programmability easier for at least one benchmark. The
combination of algorithmic changes coupled with modern
compiler technology is an important step toward enabling
programmers to ride the trend of parallel processing using
2. BENCHMARK DESCRIPTION
For our study, we analyze compute and memory char-
acteristics of recently proposed benchmark suites,
2, 3, 5
and choose a representative set of benchmarks from the
suite of throughput computing applications. Throughput
workloads deal with processing large amounts of data in
a given amount of time, and require a fast response time
for all the data processed. These include workloads from
areas of High Performance Computing, Financial Services,
Image Processing, Databases, etc.
5 Throughput computing
applications have plenty of data- and thread-level parallel-
ism, and have been identified as one of the most important
classes of future applications with compute and memory
characteristics influencing the design of current and
upcoming multi-/many-core processors. They also offer the
most opportunity for exploiting architectural resources—
leading to large Ninja gaps if naive code does not take
advantage of increasing computational resources. We
formulated a representative set of benchmarks described
below that cover this wide range of application domains of
throughput computing. We capture the key computational
kernels where most time is spent in throughput comput-
ing applications. As such, reducing Ninja gap in our bench-
marks will also translate to the applications themselves.
1. NBody: NBody computations are used in many scientific applications, including the fields of astrophysics and
statistical learning algorithms.
1 For given N bodies, the
basic computation is an O(N
2) algorithm that has two loops
over the bodies, and computes and accumulates pair-wise
interactions between them.
2. BackProjection: Backprojection is a commonly used
kernel in performing cone-beam reconstruction of CT
13 A set of 2D images are “back-projected” onto a 3D
volume in order to construct the grid of density values. For
each input image, each 3D grid point is projected onto the
image, and the density from the neighboring 2 × 2 pixels is
bilinearly interpolated and accumulated.
3. 7-Point Stencil: Stencil computation is used for a wide
range of scientific disciplines.
8 The computation involves
multiple sweeps over a spatial input 3D grid of points, where
each sweep computes the weighted sum of each grid point
and its neighbors, and stores the computed value to the output grid.
4. Lattice Boltzmann Method (LBM): LBM is a class of
computational fluid dynamics capable of modeling complex flow problems.
25 It simulates the evolution of particle
distribution functions over a 3D lattice over many time-steps. For each time-step, at each grid point, the computation performed involves directional density values for the
grid point and its face ( 6) and edge ( 12) neighbors (also
referred to as D3Q19).
5. LIBOR Monte Carlo: The LIBOR market model4 is used
to price a portfolio of swaptions. It models a set of forward
rates as a log-normal distribution. A typical Monte Carlo
approach would generate random samples for this distribution and compute the derivative price using a large number
of independent paths.
6. Complex 1D Convolution: This is widely used in application areas like image processing, radar tracking, etc. It
performs 1D convolution on complex 1D images with a large
7. BlackScholes: The Black–Scholes model provides a
partial differential equation (PDE) for the evolution of an
option price. For European options, there is a closed form
expression for the solution of the PDE.
20 This involves a
number of math operations: computation of exponent, log,
square-root, and division operations.
8. TreeSearch: In-memory tree structured index search
is commonly used in commercial databases.
14 This application involves multiple parallel searches over a tree with different queries, with each query tracing a path through the
tree depending on the results of comparison to the node
value at each tree level.
9. MergeSort: MergeSort is commonly used in the area of
6 and also shown to be the sorting algorithm
of choice for future architectures.
22 It sorts an array of N elements using log N merge passes over the complete array.
10. 2D 5 × 5 Convolution: Convolution is a common