operator previously described in Section 2. Note how SAD
fits quite naturally to a CE abstraction: the map function is
absolute difference and the reduce function is summation.
Fractional Motion Estimation (FME): FME refines the initial match obtained at the IME step to a quarter-pixel resolution. It first up-samples the block selected by IME, and then
performs a slightly modified variant of the SAD operation.
Up-sampling also fits nicely to the convolution abstraction
and actually includes two convolution operations: first,
the image block is up-sampled by two using a six-tap separable 2D filter. This part is purely convolution. Second, the
resulting image is up-sampled by another factor of two by
interpolating adjacent pixels, which can be defined as a map
operator (to generate the new pixels) with no reduce.
3. 2. SIFT
Scale Invariant Feature Transform (SIFT) looks for distinctive
features in an image.
10 To ensure scale invariance, Gaussian
blurring and down-sampling is performed on the image to
create a pyramid of images at coarser and coarser scales.
A Difference-of-Gaussian (DoG) pyramid is then created by
computing the difference between every two adjacent image
scales. Features of interest are then found by looking at the
scale-space extrema in the DoG pyramid.
Gaussian blurring and down-sampling are naturally 2D
convolution operations. Finding scale-space extrema is a 3D
stencil computation, but we can convert it into a 2D stencil
operation by interleaving rows from different images into a
single buffer. The extrema operation is mapped to convolution using compare as a map operator and logical AND as the
3. 3. Demosaic
Camera sensor output is typically a red, green, and blue
4. CONVOLUTION ENGINE
(RGB) color mosaic laid out in Bayer pattern.
3 At each
location, the two missing color values are then interpo-
lated using the luminance and color values in surround-
ing cells. Because the color information is undersampled,
the interpolation is tricky; any linear approach yields color
fringes. We use an implementation of Demosaic that is
based upon adaptive color plane interpolation (ACPI),
which computes image gradients and then uses a three-
tap filter in the direction of smallest gradient. While this
fits the generalize convolution flow, it requires a complex
“reduction” tree to implement the gradient-based selec-
tion. The data access pattern is also nontrivial since indi-
vidual color values from the mosaic must be separated
before performing interpolation.
Convolution operators are highly compute-intensive, particularly for large stencil sizes, and being data-parallel they
lend themselves to vector processing. However, as explained
earlier, existing SIMD units are limited in the extent to
which they can exploit the inherent parallelism and locality of convolution due to the organization of their register
files. The CE overcomes these limitations with the help of
shift register structures. As shown in Figure 3 for the 2D convolution case, when such a storage structure is augmented
with an ability to generate multiple shifted versions of the
input data, it can fill 128 ALUs from just a small 16 × 8 2D
register with low access energy as well as area. Similar gains
are possible for 1D horizontal and 1D vertical convolutions.
As we will see shortly, the CE facilitates further reductions in
energy overheads by creating fused super-instructions introduced in Section 3.
The CE is developed as a domain specific hardware extension to Tensilica’s extensible RISC cores.
6 The extension
hardware is developed using Tensilica’s TIE language.
next sections discuss the key blocks in the CE extension
hardware, depicted in Figure 4.
4. 1. Register files
The 2D shift register is used for vertical and 2D convolution
flows and supports vertical row shift: one new row of pixel
data is shifted in as the 2D stencil moves vertically down
into the image. The 2D shift register provides simultaneous
access to all of its elements enabling the interface unit to
feed any data element to the ALUs. 1D shift register is used
to supply data for horizontal convolution flow. New image
pixels are shifted horizontally into the 1D register as the 1D
stencil moves over an image row.
The 2D Coefficient Register stores data that does not
change as the stencil moves across the image. This can be
filter coefficients, current image pixels in IME for performing SAD, or pixels at the center of Windowed Min/Max stencils. The results of convolution operations are either written
back to the 2D Shift Register or the Output Register. A later
Table 1. Mapping kernels to convolution abstraction.
Map Reduce Stencil sizes Data flow
IME SAD Abs diff Add 4 × 4 2D convolution
FME 1/2 pixel up-sampling Multiply Add 6 1D horizontal and vertical
FME 1/4 pixel up-sampling Average None – 2D matrix operation
SIFT Gaussian blur Multiply Add 9, 13, 15 1D horizontal and vertical convolution
SIFT DoG Subtract None – 2D matrix operation
SIFT extrema Compare Logical AND 9 × 3 2D convolution
Demosaic interpolation Multiply Complex 3 1D horizontal and vertical
Some kernels such as subtraction operate on single pixels and thus have no stencil size defined. We call these as matrix operations. There is no reduce step for these operations.