Fast Fourier transform implements small 2D FFTs, based
on an algorithm adapted for GPUs. 13 Small 2D tiled FFTs are
used in many image processing workloads. The strategy in this
implementation is to perform Cooley-Tukey FFTs on columns,
which is convenient for vectorizing along the rows. The data is
transposed repeatedly such that all of the FFTs are performed
on columns; after the FFT is complete, the data is transposed
back to the original orientation. Real FFTs are implemented
by computing pairs of columns of FFTs simultaneously.
We compare the performance of our FFT implementation
to FFTW. 11 For 16× 16, 32× 32, and 48× 48 complex-to-complex
FFTs, the Halide FFT is about 1. 3 times faster than FFTW. For
similarly sized real FFTs, Halide is between 1. 5 times and 2. 1
times faster than FFTW on x86, and up to 4. 6 times faster than
FFTW on ARM. For larger FFTs, the Halide FFT performance
begins to drop relative to FFTW, becoming comparable to
FFTW at 64× 64 and dropping further from there, as we have
not implemented a vectorization strategy more suitable for
these larger FFTs.
In addition to the raw performance, the Halide FFT has a
number of other advantages over FFT libraries. Because the
FFT is expressed in pure Halide, the operations being performed in the Fourier domain can be fused into the FFT itself,
improving locality. The implementation is also more easily
modified, for example to compute fixed point or zero-padded
FFTs. Because the FFT is expressed at a high level, with the
optimizations expressed in the schedule, it can deliver state-of-the-art performance across a variety of platforms.
Camera pipeline transforms the raw data recorded by a camera sensor into a photograph. It performs four tasks: hot pixel
suppression, demosaicking, color correction, and global tone
mapping (i.e., gamma correction and contrast enhancement).
These steps contain a variety of operations including stencils,
color space transformations, and table lookups. The demosaicking alone is a combination of 21 inter-dependent stencils.
The reference comparison is a single carefully tiled and
fused loop nest from the Frankencamera, expressed in 463
lines of C++ with ARM NEON intrinsics and inline assembly. 1
All producer-consumer communication is staged through
scratch buffers. The Halide implementation is 170 lines describing 32 functions and 22 different stencils, literally translated from the pseudocode in the comments explaining the
original source. With 70 lines of combined schedule code, the
Halide algorithm can be compiled to target 32- and 64-bit x86
and ARM, is trivially parallelized across CPU cores, and can
also be compiled to run on a Qualcomm Hexagon DSP.
We benchmarked this workload on a Google Pixel smartphone, which contains a Hexagon 680 DSP with HVX
(Hexagon Vector Extensions), a standard feature of the popular Snapdragon 820 SoC. The Halide CPU code performs
almost identically to the ARM reference implementations on
a single core. Using all four CPU cores, the Halide implementation is 2. 7 times faster than the single-core reference implementation. (The CPU cores on this processor are asymmetric,
with a theoretical linear speedup of 3. 5 times across all four.)
The Halide Hexagon implementation performs similar to
the four core CPU implementation, which is in line with the
theoretical throughput of two HVX processing clusters.
However, while we expect that to be the case, it delivers that
5-point stencils. Finally, the output image is constructed by
trilinear interpolation within the grid at locations determined by the input image.
The CPU reference code is a tuned but clean implementation from the original authors in 122 lines of C++. It is partially autovectorized by GCC, but is nontrivial to multithread
(a naïve OpenMP parallelization of major stages results in a
slowdown on our benchmark CPU), so the reference is sin-gle-threaded. The Halide algorithm is 34 lines, and compiles
to an implementation 11 times faster than the original. The
speedup comes from a combination of parallelism, tilelevel
fusion of some stages, and careful reordering of dimensions
to control parallel grain size in the grid.
We also compared the Halide version to a hand-tuned GPU
implementation from the original authors, written in 370 lines
of CUDA code. The same Halide algorithm, paired with a different schedule, was 17% faster than the hand-written CUDA, but
about the total code size. The Halide compiler generates
similar GPU code to the reference, but with Halide we quickly
found a new point in the schedule space. It sacrifices some parallelism in the grid construction step to reduce synchronization overhead, and uses a tiled fusion strategy which passes
intermediate results through GPU scratchpad memory to
improve locality through the blur steps at the expense of redundant computation. These tradeoffs were counter-intuitive to
the original author, and much harder to express in CUDA, but
are easily described by our schedule representation.
Local Laplacian filters uses a multi-scale approach to tone
map images and enhance local contrast in an edge-respecting
fashion. 2 It is used in the clarity, tone mapping, and other filters in Adobe Photoshop and Lightroom. The algorithm
builds and manipulates several image pyramids. The filter
output is produced by a data-dependent resampling from the
processed pyramids. With the parameters we used, the pipeline contains 99 different stages, operating at many scales,
and with several different computational patterns.
The reference implementation is 262 lines of C++, developed at Adobe. It is carefully parallelized with OpenMP, and
offloads most intensive kernels to tuned assembly routines
from Intel Performance Primitives (IPP). It has very similar
performance to a version deployed in their products, which
took several months to develop, including 2–3 weeks dedicated to optimization. It is 10 times faster than an algorith-mically identical reference version written by the authors in
pure C++, without IPP or OpenMP.
The Halide version was written in one day, in 52 lines of code.
It compiles to an implementation which is 2. 3 times faster than
the highly optimized expert implementation (at least 20 times
faster than the clean C++ without IPP and OpenMP). The resulting schedule is complex, mixing different fusion, tiling, vectorization, and multithreading strategies throughout the 99 stage
graph. In C, it would correspond to hundreds of loops over thousands of lines of code, but in Halide it is just 11 lines.
The same program compiles with a different schedule to
a hybrid CPU/GPU program with 23 unique GPU kernels,
each representing a different subset of the overall graph. It
runs 9. 1 times faster than the hand-tuned Adobe implementation, and is four times faster than the best parallel and vectorized implementation on the CPU.