Most compiler optimizations for numerical programs are
based on loop analysis and transformation, including auto-vectorization, loop interchange, fusion, and tiling. 3 The polyhedral model is a powerful tool for modeling and
transforming looping imperative programs. 10 Halide’s model
considers only axis-aligned bounding regions, not general
polytopes—a practical simplification for image processing
and many other applications. Interval analysis is simpler
than modern polyhedral analysis, but it can effectively analyze through a wide range of expressions, and it is trivial to
generate high quality loop nests from intervals. This simple
generality is essential for Halide’s design in which all bounds
are inferred from context. Most traditional loop optimizations also do not consider recomputation of values, but in
image processing this can be a large performance win and is
central to the choices we consider during optimization.
Many data-parallel languages have been proposed. Chapel
and its predecessors focus on computations over regions of
regular grids, much like Halide, with explicit primitives to
control distribution of grid regions across memory and processors. 6 Popular today, CUDA and OpenCL expose an imperative, single program-multiple data programming model which can
target both GPUs and multicore CPUs with SIMD units. 5, 19 Like C,
these languages allow the specification of very high performance
implementations for many algorithms, but because their semantics closely model the underlying machine, they also deeply
conflate algorithms with optimization.
Streaming languages encode data and task parallelism in
graphs of kernels, which compilers automatically schedule
using tiling, fusion, fission, and 1D stencil optimization. 12
Halide’s model of scheduling addresses the problem of multidimensional stencils with parallelism, where recomputation versus storage becomes a critical but complex choice.
A separate line of research creates explicit languages for
choices of how problems are mapped into physical execution,
much like Halide’s decoupling of schedules from algorithms.
SPIRAL uses a domain-specific language to specify linear signal processing operations at the level of mathematical operators, 20 and a separate algebra of rewrite rules describes ways
these operations can be turned into efficient code for a particular architecture. Sequoia defines a model where a user-defined “mapping file” describes how to execute tasks on a
tree-like memory hierarchy. 9 The CHiLL framework exposes
polyhedral program transformations and code generation
choices via a dedicated scripting language. 24
7. CONCLUSION AND PERSPECTIVE
Our initial hope in designing Halide was to allow expert-level
manual control over the organization of computation in
image processing code, which was at once far more control-
lable and powerful than traditional compiler flags and auto-
vectorization, and far more productive than hand coding in
low-level languages. After more than 4 years in production,
we found this works even better than we first hoped. Across a
range of applications and target architectures, we find that
Halide’s scheduling representation is able to model, and its
compiler is able to generate, implementations which deliver
state-of-the-art performance from surprisingly little code.
This performance comes from careful navigation of the high
dimensional space of tradeoffs between locality, parallelism,
and redundant recomputation in image processing pipelines.
Expressing these tradeoffs in traditional languages is challeng-
ing enough, as shown by the much greater complexity of hand-
written implementations, but finding the ideal balance is
daunting when each change a programmer might want to try can
require completely rewriting a complex loop nest hundreds of
lines long. The performance advantage of the Halide implemen-
tations is a direct result of simply making it easy to test many
more points in the space than a human programmer could manu-
ally describe using explicit loops. We have found this design to
generalize well from its initial CPU and GPU targets to new, more
specialized (and notoriously hard-to-program) image processing
accelerators and DSPs. We think it helps point the way towards
the “OpenGL” of future specialized image processing hardware.
Still, while Halide’s model of schedules is powerful and
productive in the hands of experts, we have found it challenging for novices and those unfamiliar with high performance
programming to master. Even for experts, optimized schedules grow difficult to maintain for algorithms beyond moderately sized blocks, and there’s no complete answer for what
should be done if those blocks are meant to be reused in different places or get moved around a larger pipeline (as with a
library of reusable components). Fundamentally, schedules
describe how to interleave computation across composition
boundaries. Decoupling them from the fundamental algorithm gives simpler, more modular algorithm code, but
modularizing scheduling choices without sacrificing performance remains an open problem. This led to our attempts to
automate the process of scheduling in Halide, 17, 23 which have
shown promise but remain ongoing work. An automatic
scheduling system could both remove the burden from novice programmers, and schedule globally across composition
boundaries even in large applications.
Finally, we have also found that, while the constraints and
style of the Halide language were motivated by examples in
modern image processing, the programming model is significantly more general: it expresses arbitrary bounded computations over multidimensional regular grids, and it has been
applied to algorithms in linear algebra, scientific simulation,
and machine learning. Really, Halide should be thought of as
specific to the “data structure” of regular grids or multidimensional arrays, not to the “domain” of image processing.
As an open source project, Halide has received contributions from many people. Most notably, Zalman Stern, Steven
Johnson, and Patricia Suriana are full-time developers on
the project at Google and are responsible for a large amount
of the current code. The Hexagon backend was developed
with Pranav Bhandarkar, Ron Lieberman, Dan Palermo, and
Anshuman Dasgupta at the Qualcomm Innovation Center.
Eric Chan provided feedback and inspiration throughout
the original design of Halide. This work was supported by
DOE Award DE-SC0005288, NSF grant 0964004, grants from
Intel and Quanta, and gifts from Cognex and Adobe.