figure 9. fme upsampling unit. customized shift registers, directly
wired to function logic, result in efficient upsampling. ten integer
pixels from local memory are used for row upsampling in RfiR
blocks. half upsampled pixels along with appropriate integer pixels
are loaded into shift registers. cfiR accesses six shift registers in
each column simultaneously to perform column upsampling.
Ten integer pixels loaded from local memory
RFIR
RFIR
RFIR
RFIR
RFIR
CFIR
CFIR
CFIR
CFIR
CFIR
CFIR
CFIR
CFIR
CFIR
CFIR
CFIR
Integer buffer Row half Buffer Column half Buffer
RFIR Row upsampling CFIR Column upsampling
which, in turn, lets us reduce the I-cache from a 16KB 4-way
cache to a 2KB direct-mapped cache. Due to the abundance
of short-lived data, we remove the vector register files and
replace them with custom storage buffers. The “magic”
instruction reduces the instruction cache energy by 54×
and processor fetch and decode energy by 14×. Finally, as
Figure 7 shows, 35% of the energy is now going into the functional units, and again the energy efficiency of this unit is
close to an ASIC.
CABAC Strategy: CABAC originally consumed less than 2%
of the total energy, but after data-parallel components are
accelerated by “magic” instructions, CABAC dominates the
total energy. However, it requires a different set of optimizations because it is control oriented and not data parallel.
Thus, for CABAC, we are more interested in control fusion
than operation fusion.
A critical part of CABAC is the arithmetic encoding stage,
which is a serial process with small amounts of computation, but complex control flow. We break arithmetic coding down into a simple pipeline and drastically change it
from the reference code implementation, reducing the
binary encoding of each symbol to five instructions. While
there are several if–then–else conditionals reduced to
single instructions (or with several compressed into one),
the most significant reduction came in the encoding loop,
as shown in Figure 10a. Each iteration of this loop may or
may not trigger execution of an internal loop that outputs
an indefinite number of encoded bits. By fundamentally
changing the algorithm, the while loop was reduced to a
single constant time instruction (ENCODE_PIPE_ 5) and a
rarely executed while loop, as shown in Figure 10b.
The other critical part of CABAC is the conversion of
non-binary-valued DCT coefficients to binary codes in the
binarization stage. To improve the efficiency of this step, we
create a 16-entry LIFO structure to store DCT coefficients.
To each LIFO entry, we add a single-bit flag to identify zero-valued DCT coefficients. These structures, along with their
figure 10. caBac arithmetic encoding Loop (a) h.264 reference
code. (b) after insertion of “magic” instructions. much of the control
logic in the main loop has been reduced to one constant time
instruction encoDe_PiPe_ 5.
START
Y
Done?
START
ENCODE_PIPE_ 5
Seldom
Executed
Output?
corresponding logic, reduce register file energy by bringing
the most frequently used values out of the register file and
into custom storage buffers. Using “magic” instructions we
produce Unary and Exponential-Golomb codes using simple operations, which help reduce datapath energy. These
modifications are inspired by the ASIC implementation
described in Shojania and Sudharsanan.
17 CABAC is optimized to achieve the bit rate required for H.264 level 3. 1 at
720p video resolution.
Magic Instructions Summary: To summarize, the magic
instructions perform up to hundreds of operations each
time they are executed, so the overhead of the instruction
is better balanced by the work performed. Of course this is
hard to do in a general way, since bandwidth requirements
and utilization of a larger SIMD array would be problematic.
Therefore we solved this problem by building custom storage units tailored to the application, and then directly connecting the necessary functional units to these storage units.
These custom storage units greatly amplified the register
fetch bandwidth, since data in the storage units is used for
many different computations. In addition, since the intra-storage and functional unit communications were fixed and
local, they can be managed at ASIC-like energy costs.
After this effort, the processors optimized for data-parallel algorithms have a total speedup of up to 600 × and an
energy reduction of 60–350 × compared to our base CMP. For
CABAC total performance gain is 17 × and energy gain is 8 ×.
Figure 7 provides the final energy breakdowns. The efficiencies found in these custom datapaths are impressive, since,
in H.264 at least, they take advantage of data sharing patterns and create very efficient multiple-input operations.
This means that even if researchers are able to a create a processor which decreases the instruction and data fetch parts
of a processor by more than 10×, these solutions will not be
as efficient as solutions with “magic” instructions.
Achieving ASIC-like efficiency required 2–3 special
hardware units for each subalgorithm, which is significant
customization work. Some might even say we are just building an ASIC in our processor. While we agree that creating
“magic” instructions requires a thorough understanding of