for several applications, including some presented above.
To converge in reasonable time on real programs, we
found that higher-level heuristics are critical to guide the
search, because of the high cost of compiling and running
each new test. Drawing on this experience, we have since
found that using a simple cost model in place of recompiling
and benchmarking each configuration, and restricting the
search to greedily consider variants of common interleaving
patterns, gave a much faster and more scalable automatic
scheduling method. 17 The cost model is based directly on the
tradeoff we observe between parallelism, locality, and redundant work. The tool is also easier to use since it directly analyzes the algorithm, rather than requiring a runnable
benchmark program and data set.
6. RELATED WORK
Graphics & image processing languages
Domain-specific languages for image processing go back at least
as far as Bell Labs’ Pico and POPI. 16 Most prior image processing
languages have focused on efficient expression of individual kernels. Pan introduced a functional model for image processing
much like our own, in which images are functions from coordinates to values. 8 Recently, others have demonstrated automatic
optimization of image processing code in a subset of the Halide’s
algorithm model using the polyhedral model. 18
Elsewhere in graphics, the real-time graphics pipeline has
been a hugely successful abstraction precisely because the
schedule is separated from the specification of the shaders. 4 This
allows GPUs and drivers to efficiently execute a wide range of
programs with little programmer control over parallelism and
memory management. This separation of concerns is extremely
effective, but it is specific to the design of a single pipeline.
throughput using up to 10 times less power.
Porting the camera pipeline to Hexagon required modifying the schedule to use a sliding window on the rows of the
image, instead of tiling. Hexagon has extremely large vectors, and also benefits heavily from loop pipelining, which
means that efficient 2D tile sizes are too large to fit in cache.
Halide allowed us to easily rewrite the camera pipeline from
a tiled schedule to a sliding window schedule, without
changing the algorithm description. In contrast, rewriting
the handwritten ARM implementation to use a sliding window would be a serious undertaking, as the tiling logic is
inextricably linked to the algorithm description, all the way
into the indexing arithmetic in the assembly inner loops.
Scheduling new algorithms
To better understand how long it takes an expert Halide developer to schedule an algorithm when starting from scratch, we
recruited two professional Halide developers (authors on this
paper) into a scheduling “race.” They selected three new, non-trivial applications they had never scheduled before (
lensblur, nlmeans, and maxfilter) and implemented the
original Halide algorithm for these programs. For each benchmark, each programmer independently developed a schedule
in a single programming session. The programmer stopped
optimizing after converging on a solution they considered their
reasonable best. While developing the schedules the developers documented their progress by measuring the performance
of their current schedule at various points in the session.
In each case, the developers started out with relatively simple baseline schedules, but which generally still included at
least SIMD vectorization and multicore parallelism.
Nonetheless, the speedup from this baseline to their best
optimized performance was always many times.
Results of the race are as shown in Figure 4. The X-axis in each
of the graphs indicates development time (in minutes) for the
schedules. The Y-axis shows the performance of the benchmark
(measured as pixel throughput, so higher is better). The yellow
and gray lines each correspond to one of the programmers.
Arriving at a good schedule requires significant optimization effort, even for experts. For these applications, it took
on the order of an hour of iteration and experimentation for
the developers to feel their optimization had reasonably
converged. The resulting schedules, however, are only a few
lines of code, and dramatically simpler than equivalent code
in explicit C++ loops.
Automatically determining good schedules
To further accelerate development of high-performance
code, and to make Halide accessible to programmers without optimization experience, we have also worked on ways to
automatically optimize Halide programs.
In the original version of this paper we demonstrated a
prototype autotuner which automatically searched the
(exponentially large) space of schedules for a program using
stochastic search with a genetic algorithm. 23 For each generated schedule, the autotuner benchmarked its actual runtime by compiling and running a complete test program
given by the user. In a few hours to days of tuning, it was able
to generate schedules competitive with hand-tuned results
Optimization of manually authored schedules
Schedule development time (minutes)
0 50 10 20 30 40
15 30 45 0
40 80 120 0
= Programmer 2 = Programmer 1
Figure 4. Two professional Halide developers were tasked with
developing schedules for three new programs. The graphs plot
the runtime of their schedules, over time as they developed them.
The developers converged to what they considered “reasonable”
performance in on the order of an hour per program.