86 COMMUNICATIONS OF THE ACM | MAY 2015 | VOL. 58 | NO. 5
using low programmer effort. A previous version of this
paper was published in Satish et al.
23 Production compilers have recently started to support parallelization and
vectorization technology that have been published in
compiler research. Examples of such technology include
OpenMP for parallelization available in recent GCC and
ICC compilers, as well as auto-vectorization technology,
dealing with alignment constraints and outer loop vectorization. These technologies have been made available
using straightforward pragmas and technology like array
notations, a part of Intel® Cilk™ Plus.
However, naively written code may not scale even with
compiler support since they are bottlenecked by architectural features such as memory bandwidth, gathers/
scatters or because the algorithm cannot be vectorized.
In such cases, algorithmic changes such as blocking, SOA
conversion and SIMD-friendly algorithms are required.
There have been various techniques proposed to address
these algorithmic changes, either using compiler assisted
optimization, using cache-oblivious algorithms or specialized languages. Such changes usually require programmer
intervention and programmer effort, but can be used across
a number of architectures and generations. For instance, a
number of papers have shown the impact of similar algorithmic optimizations on GPUs.
While our work focuses on traditional programming
models, there have been radical programming model
changes proposed to bridge the gap. Recent suggestions
include Domain Specific Languages (a survey is available
at Fowler9), the Berkeley View project2 and OpenCL for
programming heterogeneous systems. There have also
been library oriented approaches proposed such as Intel®
Threading Building Blocks, Intel® Math Kernel Library,
Microsoft Parallel Patterns Library (PPL), etc. We believe
these are orthogonal and used in conjunction with traditional models.
There is also a body of literature in adopting auto-tuning as an approach to bridging the gap.
8, 25 Autotuning
results can be significantly worse than the best-optimized code. For, for example, for autotuned stencil
8 our best-optimized code is 1.5X better in
performance. Since our Ninja gap for stencil is only 1.1X,
our compiled code performs 1.3X better than auto-tuned
code. We expect our compiled results to be in general
competitive with autotuned results, while offering the
advantages of using standard tool-chains that can ease
portability across processor generations.
In this work, we showed that there is a large Ninja performance gap of 24X for a set of real-world throughput
computing benchmarks for a recent multi-processor.
This gap, if left unaddressed will inevitably increase. We
showed how a set of simple and well-known algorithmic techniques coupled with advancements in modern
compiler technology can bring down the Ninja gap to
an average of just 1.3X. These changes only require low
programming effort as compared to the very high effort in
Nadathur Satish, Mikhail Smelyanskiy,
and Pradeep Dubey, Parallel Computing
Lab, Intel Corp.
Changkyu Kim, Google Inc.
Jatin Chhugani, Ebay Inc.
Hideki Saito, Rakesh Krishnaiyer, and
Milind Girkar, Intel Compiler Lab, Intel
© 2015 ACM 0001-0782/15/05 $15.00
1. Arora, N., Shringarpure, A., Vuduc, R. W.
Direct N-body Kernels for
multicore platforms. In ICPP
2. Asanovic, K., Bodik, R., Catanzaro, B.,
Gebis, J., Husbands, P., Keutzer, K.,
Patterson, D. A., Plishker, W.L., Shalf, J.,
et al. The Landscape of Parallel
Computing Research: A View from
Berkeley. Technical Report UCB/
3. Bienia, C., Kumar, S., Singh, J.P., Li, K.
The PARSEC benchmark suite:
Characterization and architectural
implications. In PACT (2008), 72–81.
4. Brace, A., Gatarek, D., Musiela, M.
The market model of interest rate
dynamics. Mathematical Finance 7, 2
5. Chen, Y.K., Chhugani, J., et al.
Convergence of recognition,
mining and synthesis workloads
and its implications. IEEE 96, 5
6. Chhugani, J., Nguyen, A. D., et al.
Efficient implementation of
sorting on multi-core simd cpu
architecture. PVLDB 1, 2 (2008),
7. Dally, W.J. The end of denial
architecture and the rise of
throughput computing. In Keynote
Speech at Desgin Automation
8. Datta, K. Auto-tuning Stencil Codes
for Cache-based Multicore Platforms.
PhD thesis, EECS Department,
University of California, Berkeley
9. Fowler, M. Domain Specific
Languages, 1st edn. Addison-Wesley
Professional, Boston, MA 2010.
10. Giles, M.B. Monte Carlo Evaluation of
Sensitivities in Computational Finance.
Technical report. Oxford University
Computing Laboratory, 2007.
11. Intel. A quick, easy and reliable way
to improve threaded performance,
12. Ismail, L., Guerchi, D. Performance
evaluation of convolution on the cell
broadband engine processor. IEEE
PDS 22, 2 (2011), 337–351.
13. Kachelrieb, M., Knaup, M.,
Bockenbach, O. Hyperfast perspective
cone-beam backprojection. IEEE
Nuclear Science 3, (2006), 1679–1683.
14. Kim, C., Chhugani, J., Satish, N., et al.
FAST: fast architecture sensitive tree
search on modern CPUs and GPUs.
In SIGMOD (2010). 339–350.
15. Lee, V. W., Kim, C., Chhugani, J.,
Deisher, M., Kim, D., Nguyen, A. D.,
Satish, N., et al. Debunking the 100X
GPU vs. CPU myth: An evaluation of
throughput computing on CPU and
GPU. In ISCA (2010). 451–460.
16. T. N. Mudge. Power: A first-class
architectural design constraint. IEEE
Computer 34, 4 (2001), 52–58.
17. Nguyen, A., Satish, N., et al. 3.5-D
blocking optimization for stencil
computations on modern CPUs and
GPUs. In SC10 (2010). 1–13.
18. Nuzman, D., Henderson, R. Multi-platform auto-vectorization. In CGO
19. Nvidia. CUDA C Best Practices Guide 3,
20. Podlozhnyuk, V. Black–Scholes option
pricing. Nvidia, 2007. http://developer.
21. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S.,
Stone, S. S., Kirk, D.B., Hwu, W.M. W.
Optimization principles and
application performance evaluation
of a multithreaded GPU using CUDA.
In PPoPP (2008). 73–82.
22. Satish, N., Kim, C., Chhugani, J., et al.
Fast sort on CPUs and GPUs:
A case for bandwidth oblivious
SIMD sort. In SIGMOD (2010).
23. Satish, N., Kim, C., Chhugani, J., Saito, H.,
Krishnaiyer, R., Smelyanskiy, M.,
et al. Can traditional programming
bridge the Ninja performance
gap for parallel computing
applications? In ISCA (2012).
24. Smelyanskiy, M., Holmes, D., et al.
Mapping high-fidelity volume
rendering to CPU, GPU and many-core. IEEE TVCG, 15, 6(2009),
25. Sukop, M.C., Thorne, D. T., Jr. Lattice
Boltzmann Modeling: An Introduction
for Geoscientists and Engineers,
26. Tian, X., Saito, H., Girkar, M., Preis, S.,
Kozhukhov, S., Cherkasov, A. G.,
Nelson, C., Panchenko, N., Geva, R..
Compiling C/C++ SIMD extensions
for function and loop vectorizaion
on multicore-SIMD processors.
In IPDPS Workshops (Springer, NY,