MAY2015 | VOL. 58 | NO. 5 | COMMUNICATIONS OF THE ACM 81
compiler generated extra spill/fill instructions, that resulted
in performance gap of 1.4X.
5: LIBOR: LIBOR4 has an outer loop over all paths of the
Monte Carlo simulation, and an inner loop over the forward
rates on a single path. Figure 3 shows performance benefit
of only 1.5X from auto-vectorization, since the current
compiler only attempts to vectorize the inner loop, which has
back-to-back dependencies and can only be partially vectorized. To solve this issue, we performed an algorithmic
change to convert layout from AOS to SOA. We use the array
notations technology to express outer loop vectorization
(code in Figure 7b). Performing these changes allowed the
outer loop to vectorize and provides additional 2.5X SIMD
scaling (net 3.8X). The performance is similar to the best-optimized code.
6. Complex 1D Convolution: We use an image with
12. 8 million points, and a kernel size of 8K. The first bar
in Figure 3 shows the performance achieved by the unrolling enabled by the compiler, which results in 1.4X scaling.
The auto-vectorizer only achieves a scaling of 1.1X. The
TLP achieved is 5.8X. In order to improve the SIMD performance, we perform a rearrangement of data from AOS
to SOA format. As a result, the compiler produces efficient
SSE code, and the performance scales up by a further 2.9X.
Our overall performance is about 1.6X slower than the best-optimized numbers (inability of the complier to block the
7. BlackScholes: BlackScholes computes the call and put
options together. The total computation is 200 ops, while
the bandwidth is 36 bytes. Figure 3 shows a SIMD speedup
of 1.1X using auto-vectorization. The low scaling is primarily
due to the AOS layout, which results in gather operations.
To improve performance, we changed the data layout from
AOS to SOA. As a result, the auto-vectorizer generated SVML
(short vector math library) code, resulting in an increase in
scaling by 2.7X (total 3.0X). The net performance is within
1.1X of the best performing code.
8. TreeSearch: The binary tree is laid out in a breadth-first fashion. The auto-vectorizer achieves a SIMD speedup
2: BackProjection: We back-project 500 images of
dimension 2048 × 2048 pixels onto a 1024 × 1024 × 1024 3D
grid. Backprojection requires 80 ops per grid point. Both
the image ( 16 MB) and volume ( 4 GB) are too large to reside
in cache. Figure 2 shows that we get poor SIMD scaling of
1.2X from auto-vectorization. Moreover, parallel scaling
is only 1.8X. This is because the code is bandwidth-bound
( 1. 6 bytes/flop). We perform blocking over the 3D volume
to reduce bandwidth (3D blocking in Figure 2). Due to spatial locality, the image working set reduces accordingly. This
results in the code becoming compute-bound. However,
due to gathers which cannot be vectorized on CPU, SIMD
scaling only improved by additional 1.6X (total 1.8X). We
obtained additional 4.4X thread scaling (total 7.9X), showing benefits of SMT. The net performance is 1.1X off the
3: 7-Point 3D Stencil: Application iterates over a 3D grid
of points, and performs 8 flops of computation per point.
A 3D dataset with 512 × 512 × 512 grid points is used. Figure 2
shows that we get a poor SIMD scaling of 1.8X from auto-vectorization (bandwidth bound). In order to improve the
scaling, we perform both spatial and temporal blocking to
improve the performance.
17 The resultant code performs
four time-steps simultaneously, and improves the DLP by a
further 1.7X (net SIMD scaling of 3.1X—lower than 4X due
to the overhead of repeated computation on the boundary). The thread scaling is further boosted by 2.5X (overall
5.3X). The net performance is within 10.3% of the best-optimized code.
4. Lattice Boltzmann Method (LBM): The computational
pattern is similar to the stencil kernel. We used a 256 × 256
× 256 dataset. Figure 2 shows that our initial code (SPEC
CPU2006) does not achieve any SIMD scaling, and 2.9X core-scaling. The reason for no SIMD scaling is the AOS data layout that results in gather operations. In order to improve
performance, we perform the following two algorithmic
changes. Firstly, we perform an AOS to SOA conversion
of the data. The resultant auto-vectorized code improves
SIMD scaling to 1.65X. Secondly, we perform 3.5D blocking. The resultant code further boosts SIMD scaling by 1.3X,
achieving a net scaling of 2.2X. The resultant thread scaling
was further increased by 1.95X (total 5.7X). However, the
Nbody BackProjection Stencil LBM
Figure 2. Breakdown of Ninja Performance Gap in terms of Instruction
(ILP), Task (TLP), and Data Level Parallelism (DLP) before and after
algorithm changes for NBody, BackProjection, Stencil, and LBM.
The algorithm change involves blocking.
ILP DLP TLP
ILP+Algo DLP+Algo TLP+Algo
2D Conv VR
ILP DLP TLP NinjaPerf
Figure 3:. Breakdown of Ninja Gap for (a) Treesearch and Mergesort
and (b) 2D convolution and VR. The benchmarks in (a) require
rethinking algorithms to be more SIMD-friendly, while in (b) do not
require any algorithmic changes.