4. 1. sse/GPu style enhancements
Using Tensilica’s TIE extensions we add LIW instructions
and SIMD execution units with vector register files of custom depths and widths. A single SIMD instruction performs
multiple operations ( 8 for IP, 16 for IME, and 18 for FME),
reducing the number of instructions and consequently
reducing IF energy. LIW instructions execute 2 or 3 operations per cycle, further reducing cycle count. Moreover,
SIMD operations perform wider register file and data cache
accesses which are more energy efficient compared to narrower accesses. Therefore all components of instruction
energy depicted in Figure 4 get a reduction through the use
of these enhancements.
We further augment these enhancements with operation fusion, in which we fuse together frequently occurring
complex instruction sub-graphs for both RISC and SIMD
instructions. To prevent the register file ports from increasing, these instructions are restricted to use up to two input
operands and can produce only one output. Operation
fusion improves energy efficiency by reducing the number
of instructions and also reducing the number of register file
accesses by internally consuming short-lived intermediate
data. Additionally, fusion gives us the ability to create more
figure 6. speedup at each stage of optimization for ime, fme, iP and
caBac.
1000
100
10
1
0.1
IME
FME
RISC
IP
SSE/GPU
CABAC
Magic ASIC
Total
energy-efficient hardware implementations of the fused
operations, e.g., multiplication implemented using shifts
and adds. The reductions due to operation fusion are less
than 2× in energy and less than 2. 5× in performance.
With SIMD, LIW and Op Fusion support, IME, FME and
IP processors achieve speedups of around 15×, 30× and 10×,
respectively. CABAC is not data parallel and benefits only
from LIW and op fusion with a speedup of merely 1. 1× and
almost no change in energy per operation. Overall, the application gets an energy efficiency gain of almost 10×, but still
uses greater than 50× more energy than an ASIC. To reach
ASIC levels of efficiency, we need a different approach.
4. 2. algorithm specific instructions
The root cause of the large energy difference is that the
basic operations in H.264 are very simple and low energy.
They only require 8–16 bit integer operations, so the fundamental energy per operation bound is on the order of hundreds of femtojoules in a 90 nm process. All other costs in a
processor—IF, register fetch, data fetch, control, and pipeline registers—are much larger (140 pJ) and dominate overall power. Standard SIMD and simple fused instructions
can only go so far to improve the performance and energy
efficiency. It is hard to aggregate more than 10–20 operations into an instruction without incurring growing ineffi-ciencies, and with tens of operations per cycle we still have
a machine where around 90% of the energy is going into
overhead functions. It is now easy to see how an ASIC can
be 2–3 orders of magnitude lower energy than a processor.
For computationally limited applications with low-energy
operations, an ASIC can implement hardware which both
has low overheads, and is a perfect structural match to the
application. These features allow it to exploit large amounts
of parallelism efficiently.
To match these results in a processor we must amortize
the per-instruction energy overheads over hundreds of these
figure 7. Processor energy breakdown for h.264. if is instruction fetch/decode. D-$ is data cache. Pip is the pipeline registers, buses, and
clocking. ctl is random control. Rf is the register file. fu is the functional elements. only the top bar or two (fu, Rf) contribute useful work in
the processor. for this application it is hard to achieve much more than 10% of the power in the fu without adding custom hardware units.
60%
70%
80%
90%
100%
0%
10%
20%
30%
40%
50%
FU
RF
Ctl
Pip
D-$
IF
RISC SSE/GPU Magic
RISC SSE/GPU Magic
RISC SSE/GPU Magic
RISC SSE/GPU Magic
IME
FME
IP
CABAC