4. ANALYSIS AND SUMMARY
In this section, we identify the steps taken to bridge the
Ninja gap with low programmer effort. The key steps taken
are to first perform a set of well-known algorithmic optimizations to overcome scaling bottlenecks, and secondly to
use the latest compiler technology for vectorization and parallelization. We now summarize our findings with respect to
the gains we achieve in each step.
4. 1 Algorithmic Changes
Algorithmic changes do require programmer effort and
some insights, but are essential to avoid vectorization issues
and bandwidth bottlenecks. Figure 5 shows the performance improvements due to set of well-known algorithmic
optimizations that we describe below.
AOS to SOA conversion: A common optimization that
helps prevent gathers and scatters in vectorized code is to
convert data structures from Array-Of-Structures (AOS)
to Structure-Of-Array (SOA). Separate arrays for each field
allows contiguous memory accesses when vectorization
is performed. AOS structures require gathers and scatters, which can impact both SIMD efficiency and introduce
extra bandwidth and latency for memory accesses. The
presence of a hardware gather/scatter mechanism does
not eliminate the need for this transformation—gather/
scatter accesses commonly need significantly higher bandwidth and latency than contiguous loads. Such transformations are advocated for a variety of architectures including
19 Figure 5 shows that for our benchmarks, AOS to
SOA conversion helped by an average of 1.4X.
Blocking: Blocking is a well-known optimization that
can help avoid bandwidth bottlenecks in a number of applications. The key idea is to exploit the inherent data reuse
available in the application by ensuring that data remains
in caches across multiple uses, both in the spatial domain
(1-D, 2-D, or 3-D), and temporal domain.
In terms of code change, blocking involves a combina-
tion of loop splitting and interchange. The code snippet in
Figure 6a shows an example of blocking NBody code. There
are two loops (body1 and body2) iterating over all bodies.
The original code on the top streams through the entire
set of bodies in the inner loop, and must load the body2
value from memory in each iteration. The blocked code is
of 1.4X (Figure 4a). This is because it operates simultane-
ously on 4 queries, and each query may traverse down a
different path — resulting in a gather operation. In order to
improve performance, we perform an algorithmic change,
and traversed 2 levels at a time (similar to SIMD width block-
ing14). However, the compiler did not generate the described
code sequence, resulting in a 1.55X Ninja gap.
9. MergeSort: Our analysis is done for sorting an input
array with 256M elements. Figure 4a shows that we get a
1.2X scaling from auto-vectorization. This is due to gather
operations for merging four pairs of lists. Parallel scaling
is only 4.1X because the last few merge phases being bandwidth bound. In order to improve performance, we perform
the following two algorithmic changes. Firstly, we implement merging of lists using a merging network6 (code in
Section 4. 1). Secondly, in order to reduce the bandwidth
requirement, we perform multiple merge phases together.
The parallel scaling of the resultant code further speeds up
by 1.9X. The resultant performance is within 1.3X of the
10. 2D Convolution: We perform convolution of a 2 K × 2 K
image with a 5 × 5 kernel. The code consists of four loops.
Figure 4b shows that we obtained a benefit of 1.2X by loop
unrolling. We implemented the two inner loops using the
array notations technology. That enabled vectorization of the
outer loop, and scaled 3.8X with SIMD width. The thread-level parallelism was 6.2X. Our net performance was within
1.3X of the best-optimized code.
11. Volume Rendering: As shown in Figure 4b, we achieve
a TLP scaling of 8.7X (SMT of 1.5X). As far as DLP is concerned, earlier compiler versions did not vectorize the
code due to various control-intensive statements. However,
recent compilers vectorize the code using mask values for
each branch instruction, and using proper masks to execute
both execution paths for each branch. There is only a small
1.3X Ninja performance gap.
Summary: In this section, we analyzed each benchmark,
and reduced the Ninja gap to within 1. 1–1.6X by applying
necessary algorithmic changes coupled with the latest compiler technology.
SOA Conversion + Blocking + SIMD Friendly
Blocking SOA Conversion SIMD Friendly
Figure 5. Benefit of three different algorithmic changes to our
benchmarks normalized to code before any algorithmic change.
The effect of algorithmic changes is cumulative.
Libor Complex 1D BlackScholes
DLP + SOA
Figure 4. Breakdown of Ninja Performance Gap for Libor, Complex 1D
convolution, and BlackScholes. All benchmarks require AOS to SOA
conversion to obtain good SIMD scaling.