bv.split(x, tx, xi, tile_size)
.split(y, ty, yi, tile_size)
.reorder(xi, yi, tx, ty)
which can be abbreviated simply as:
bv.tile(x, y, tx, ty, xi, yi, tile_size, tile_size)
This generates the loop nest:
// simplified to ignore non-evenly divisible tile sizes
for bv.ty in bv.min.y/tile_size to bv.max.y/tile_size:
for bv.tx in bv.min.x/tile_size to bv.max.x/tile_size:
for bv.yi in 0 to tile_size−1:
for bv.xi in 0 to tile_size−1:
compute bv(bv.tx*tile_size + bv.xi,
bv.ty*tile_size + bv.yi)
Now, by computing bh at the granularity of tiles of bv—at
the bv.tx loop level—we get a less fine-grained interleaving
than individual pixels, but one more fine-grained than the
whole image. In Halide’s scheduling language, we specify
this by saying bh.compute_at(bv, tx). This is the key
scheduling choice in the optimized implementations as
shown in Figure 1(b) and (c). This generates the loop nest:
for bv.ty in bv.min.y/tile_size to bv.max.y/tile_size:
for bv.tx in bv.min.x/tile_size to bv.max.x/tile_size:
// compute tile of bh
for bh.y in (bv.ty*tile_size− 1)to ((bv.ty+ 1)*tile_size):
for bh.x in (bv.tx*tile_size) to ((bv.tx+ 1)*tile_
size− 1):
bh(bh.x, bh.y)= in(bh.x− 1, bh.y)
+ in(bh.x , bh.y)
+ in(bh.x+ 1, bh.y)
// compute tile of bv
for bv.yi in 0 to tile_size−1:
for bv.xi in 0 to tile_size−1:
bv.y = bv.ty*tile_size + bv.y
bv.x = bv.tx*tile_size + bv.x
bv(bv.x, bv.y) = bh(bv.x, bv.y− 1)
+ bh(bv.x, bv.y )
+ bh(bv.x, bv.y+ 1)
This organization recomputes only one row on the top and
bottom of each tile, rather than every row, while improving
locality to the level of tiles rather than whole images. By
increasing or decreasing the size of these tiles, we can easily
trade off between locality and computation. Common
choices would be sized to fit tiles in a given level of cache.
The ideal choice depends on the relative cost of recomputing a pixel of the producer function versus storing and loading it from each level of the memory hierarchy.
It is important to realize that this is not just a constant-factor tweak in the resulting code: if the tile size is a constant, we have asymptotically improved the locality relative
to the naïve interleaving, by reducing the reuse distance for
intermediate values from O(n) to O( 1).
We can also choose to compute bh at any of the other loop
levels in bv: the schedule of computation across stages in
Halide is specified, for each producer function, as any single
level in the scheduled loop nest of the downstream pipeline.
The fourth and final core part of the schedule similarly
specifies the granularity of storage across functions. The
granularity of storage determines at what loop level in the
evaluation of the consuming pipeline the results of a pro-
ducer function should be stored for reuse. On architec-
tures such as GPUs, different levels are also mapped to
loop nest around the consumer stage the program should
evaluate the required part of the producer stage (here, bh).
This determines the granularity with which the computa-
tion of the different functions is interleaved.
For example, Halide’s default schedule for producers is
inline, which means that if we leave the default schedule for
bh, it is evaluated directly everywhere it is called:
for bv.y in bv.min.y to bv.max.y:
for bv.x in bv.min.x to bv.max.x:
bv(bv.x, bv.y) = // bh(x,y− 1)+bh(x,y)+bh(x,y+ 1) =
(in(bv.x− 1,bv.y− 1)+in(bv.x,bv.y− 1)+in(bv.x+ 1,bv.y− 1))
+(in(bv.x− 1,bv.y )+in(bv.x,bv.y )+in(bv.x+ 1,bv.y ))
+(in(bv.x− 1,bv.y+ 1)+in(bv.x,bv.y+ 1)+in(bv.x+ 1,bv.y+ 1))
This maximizes locality between the producer and consumer, since each value of bh is immediately used as soon as
it is computed, so it does not need to be stored and loaded in
far-away memory. However, because the stencils of neighboring pixels overlap, every pixel in bh is computed three
times instead of once: when the outer loop moves to the next
row down, each pixel of bv recomputes two of the same pixels of bh that were just computed by the pixel right above it.
The other obvious choice we could make is to not interleave these two stages at all. Instead, we compute all of bh
before computing any of bv. We express this in Halide by
saying that bh should be computed at the very outermost, or
“root” level, outside all loops over its consumer. This is done
by calling bh.compute_root(), generating the complete
equivalent program:
// compute bh over slightly enlarged window,
// to fulfill all uses in bv
for bh.y in bv.min.y− 1 to bv.max.y+1:
+ in(bh.x , bh.y)
+ in(bh.x+ 1, bh.y)
// compute bv using bh
for bv.y in bv.min.y to bv.max.y:
for bv.x in bv.min.x to bv.max.x:
bv(bv.x, bv.y) = bh(bv.x, bv.y− 1)
+ bh(bv.x, bv.y)
+ bh(bv.x, bv.y+ 1)
This only computes each pixel in each stage exactly once,
wasting no work, but it destroys any producer-consumer
locality between the two stages: an entire image of intermediate results has to be computed between where values are
produced in the first stage and where they are consumed in
the second. This schedule is equivalent to the clean C++ as
shown in Figure 1(a), which suffers from the same problem.
Looking at these two examples, we see a tension between
locality and (excess, or redundant) computation. This tension is fundamental to computations with stencil dependencies, since stencils overlap between iterations and
prevent simple loop fusion without duplicating work.
However, this tension can be balanced along a continuum
between these extremes.
For example, we can change the schedule within bv to iterate in a tiled order by splitting its two dimensions by a chosen
tile size, and then reordering the four resulting dimensions.
We can do this by saying: