pattern and the functional units perform the arithmetic.
Interface units. The Interface Units (IF) arrange data from
the register files into a specific pattern needed by the map
operation. For 2D convolutions, multiple shifted 2D subblocks can be simultaneously accessed from the 2D register. Multiple block sizes such as 2 × 2, 4 × 4, 8 × 8, etc. are
supported and the appropriate size is selected based on
convolution kernel size. Similarly for vertical convolution,
multiple 2D register columns can be accessed in parallel,
with support for multiple column access sizes. Finally, the
1D IF supports accessing multiple shifted 1D blocks from
the 1D shift register for horizontal convolution. We are also
exploring a more generalized permutation layer to support
Functional units. Since all data rearrangement is handled
by the interface unit, the functional units are just an array of
short fixed point two-input arithmetic ALUs. In addition to
multipliers, we support absolute difference to facilitate SAD
and other typical arithmetic operations such as addition,
subtraction, and comparison. The output of the ALU is fed
to the Reduce stage.
Reduce unit. The reduce part of the map-reduce operation
is handled by a programmable reduce stage. Based upon
the needs of our applications, we currently support arithmetic and logical reduction stages. The degree of reduction
is dependent on the kernel size, for example a 4 × 4 2D kernel requires a 16 to 1 reduction whereas 8 to 1 reduction is
needed for an 8-tap 1D kernel. Thus, the reduction stage is
implemented as a combining tree and outputs can be tapped
out from multiple stages of the tree.
To enable the creation of “super instructions” described
in Section 3, we augment the combining tree to enable
handle noncommutative operations by adding support for
diverse arithmetic operations at different levels of the tree.
This fusion increases the computational efficiency by reducing the number of required instructions and by eliminating temporary storage of intermediate data in register files.
Because this more complex data combination need not be
commutative, the right data (output of the map operation)
must be placed on each input to the combining network.
Thus, a “Data Shuffle Stage” is also added to the CE in the
form of a very flexible swizzle network that provides permutations of the input data.
4. 3. Other hardware
To facilitate vector operations on the convolution output,
we have added a 32-element SIMD unit. This unit interfaces
with the 2D Output Register and uses it as a Vector Register
file. This unit is wider than typical SIMD units, as it operates on intermediate data generated by convolution data
path and thus is not constrained by data memory accesses.
Despite being wider, the vector unit is still lightweight as
it only supports basic vector add and subtract type operations and has no support for higher cost operations such as
Because an application may perform computation that
4. 2. Map and reduce logic
conforms neither to the convolution block nor to the vector
unit, or may otherwise benefit from a fixed function imple-
mentation. If the designer wishes to build a customized
section shows how the output register file also works as the
vector register file for the vector unit shown in Figure 4.
As described earlier we abstract convolution as a map and
reduce step that transforms each input pixel into an output
pixel. In our implementation, interface units and ALUs
work together to implement the map operation; the interface units arrange the data as needed for the particular map
2D Shift Register 2D Register
c k R0,0
R0, 1 R0, 15
R7,0 R7, 7
R0, 8 R0, 6
C0,0 C0, 7
C0,0 C0, 7
C15,0 C7, 7
R7, 6 R7, 14
R7, 1 R7, 15
Figure 3. Implementation of 8 × 8 2D SAD operation that exploits
parallelism in all four loops of Listing 1. The reference block resides
in a 2D shift register while the current block is stored in a 2D
register. Because both registers allow 2D access of the 8 × 8 block,
64 ALUs can operate in parallel. To enable an even larger degree of
parallelism and to exploit data-reuse in the horizontal direction, the
shift register generates pairs of multiple overlapping 8 × 8 blocks
which are then fed to the ALU through a multiplexer. These pairs
allow parallel execution of 128 ALUs generating two outputs in
parallel. After the generation of four pairs of horizontal outputs, the
shift register shifts up by one to make room for a new row of search
window achieving vertical data-reuse.
2D Shift Register
1D Shift Register
ALU Input Port 1
ALU Input Port 2
Instruction Graph Fusion/Multi-
level Reduction Tree
IF 2D IF 2D IF 1D IF
Figure 4. Block diagram of convolution engine. The interface units
(IF) connect the register files to the functional units and provide shifted
broadcast to facilitate convolution. Data shuffle (DS) stage combined
with instruction graph fusion (IGF) stage create the generalized
reduction unit, and is called the complex graph fusion unit.