In the base system, we map this four-stage macro-block
partition to a four-processor CMP system where each processor has 16KB 2-way set associative instruction and data
caches. Figure 2 highlights the large efficiency gap between
our base CMP and the reference ASIC for individual 720p
HD H.264 subalgorithms. The energy required for each
RISC instruction is similar and as a result, the energy
required for each task (shown in Figure 3) is related to the
cycles spent on that task. The RISC implementation of
IME, which is the major contributor to performance and
energy consumption, has a performance gap of 525× and
an energy gap of over 700× compared to the ASIC. IME
and FME dominate the overall energy and thus need to be
aggressively optimized. However, we also note that while IP,
DCT, Quant, and CABAC are much smaller parts of the total
energy/delay, even they need about 100× energy improvement to reach ASIC levels.
At approximately 8.6B instructions to process 1 frame,
our base system consumes about 140 pJ per instruction—a
reasonable value for a general-purpose system. To further
analyze the energy efficiency of this base CMP implementation we break the processor’s energy into different
functional units as shown in Figure 4. This data makes it
clear how far we need to go to approach ASIC efficiency.
The energy spent in instruction fetch (IF) is an overhead
due to the programmable nature of the processors and is
absent in a custom hardware state machine, but eliminating all this overhead only increases the energy efficiency
by less than one third. Even if we eliminate everything but
the functional unit energy, we still end up with energy savings of only 20×—not nearly enough to reach ASIC levels.
figure 2. the performance and energy gap for base cmP
implementation when compared to an equivalent asic. intra
combines iP, Dct, and Quant.
figure 3. Processor energy breakdown for base implementation, over
the different h.264 subalgorithms. intra combines iP, Dct, and Quant.
The next section explores what customizations are needed
to reach the efficiency goals.
4. customization ResuLts
At first, we restrict our customizations to datapath extensions inspired by GPUs and Intel’s SSE instructions. Such
extensions are relatively general-purpose data-parallel
optimizations and consist of single instruction, multiple
data (SIMD) and multiple instruction issue per cycle (we
use long instruction word, or LIW), with a limited degree
of algorithm-specific customization coming in the form
of operation fusion—the creation of new instructions that
combine frequently occurring sequences of instructions.
However, much like their SSE and GPU counterparts, these
new instructions are constrained to the existing instruction formats and datapath structures. This step represents
the datapaths in current state-of-the-art optimized CPUs. In
the next step, we replace these generic datapaths by custom
units, and allow unrestricted tailoring of the datapath by
introducing arbitrary new compute operations as well as by
adding custom register file structures.
The results of these customizations are shown in
Figures 5 through 7. The rest of this section describes these
results in detail and evaluates the effectiveness of these
three customization strategies. Collectively, these results
describe how efficiencies improve by 170× over the baseline of Section 3.
figure 4. Processor energy breakdown for base implementation.
if is instruction fetch/decode. D-$ is data cache. P Reg includes the
pipeline registers, buses, and clocking. ctrl is miscellaneous control.
Rf is register file. fu is the functional units.
figure 5. each set of bar graphs represents energy consumption (mJ)
at each stage of optimization for ime, fme, iP and caBac respectively.
the first bar in each set represents base Risc energy; followed by
Risc augmented with sse/GPu style extensions; and then Risc
augmented with “magic” instructions. the last bar in each group
indicates energy consumption by the asic.