3. 1 Scheduling within a function
The order of evaluation of the points within a function is
defined by a family of common transformations applied to a
default (sequential, row-major) loop nest over the grid of
points in the function’s domain. Loop dimensions can be
split, merged, and reordered. Because the regions computed
are simple intervals (axis-aligned bounding boxes within the
grid), the result is always a perfect loop nest. In addition, the
resulting loops can be unrolled, or mapped to parallel
threads, SIMD vector lanes, and GPU kernel dimensions.
The dimensions of the storage layout for a function’s
results can similarly be reordered, allowing common transformations such as column- versus row-major layouts.
Storage can also be folded across a dimension into a circular
buffer of a fixed size.
The dimensions of an RDom in an update step may also be
reordered, parallelized, etc., but only if the compiler can prove
that different update iterations across those dimensions do
not interact. Recent work has added support for splitting associative reductions to create parallel reduction trees. 25
3. 2 Scheduling across functions
The more unique part of Halide’s model of schedules is how
it transforms computation and storage across functions.
Consider the simple two-stage blur algorithm. When
scheduling across these two stages, we call the first stage,
bh, the “producer,” and the second stage, bv, its “
consumer.” So far, the scheduled order of computation within
each function (discussed above) defines a perfect loop
nest for each function. For example, the default schedule
for the output stage bv gives a simple row-major loop nest
for bv.y in bv.min.y to bv.max.y:
for bv.x in bv.min.x to bv.max.x:
compute and store bv(bv.x, bv.y)
The schedule across functions specifies at what level in this
RDom ri(0, 255);
cdf(i) = 0; // initial value
cdf(ri) = cdf(ri− 1) + histogram(ri); // update
out(x, y) = cdf(in(x, y));
3. SCHEDULING IMAGE PROCESSING PIPELINES
A complete Halide algorithm is a DAG of functions over regular grids. Actually evaluating an algorithm requires iterating over and computing all of the required points in each
function. But in what order should we compute these
points? And where should we store and load the results of
intermediate stages to communicate them between stages?
Unlike a looping program in a traditional language, the
Halide algorithm does not specify these choices. Instead,
they’re specified by a separate schedule.
For each stage, we think about these choices in four parts:
1. In what order should we iterate over the points in that
2. In what order should we lay out those points in memory to store their results?
3. At what granularity should we interleave the computation of that stage with the computation of the downstream stages which consume its results?
4. At what granularity should we store blocks of the function for reuse across iterations of its consumers?
These choices affect performance through locality (e.g.,
cache behavior), exposed parallelism, and the need for
recomputation (at grain boundaries). Together, they specify a set of nested loops (invisible to the programmer) that
our compiler uses to generate executable code. Scheduling
instructions are concise and can be written independently of the definition of the algorithms, as shown in
scheduling across functions (interleaving granularity)
coarsest (whole images) moderate (per-tile) finest (per-pixel)
parallel tiles parallel vectorized tiled column-major row-major
in bh bv
in bh bv
in bh bv
Figure 2. A Halide schedule specifies the order of computation within each stage in an image processing algorithm (top, discussed in
Sec. 3. 1), and the granularity of interleaving across producer and consumer stages (bottom, discussed in Sec. 3. 2). In the two-stage blur
algorithm, coarse-grained interleaving (bottom-left) computes whole stages at a time, sacrificing producer-consumer locality. Finer-grained
interleaving improves locality but introduces redundant computation where the stencils for the grains overlap (denoted by hatched regions
in the bottom-middle). In the extreme case of per-pixel interleaving (bottom-right), values are immediately consumed as soon as they are
computed (maximizing locality), but all values of the producer stages are redundantly recomputed in each place they are reused (here,
3 times in bh and 9 times transitively in in). Whereas this shows a single shared order and granularity choice made applied throughout a
simple pipeline, all of these choices can be made separately for each function in a whole algorithm.