area by 14 × and energy efficiency by 10 ×. Despite these customizations, the resulting solution is still 50× less energy
efficient than an ASIC. An examination of the energy breakdown in the paper clearly demonstrates why. Basic arithmetic
operations are typically 8–16 bits wide, and even when performing more than 10 such operations per cycle, arithmetic
unit energy comprises less than 10% of the total. One must
consider the energy cost of the desired operation compared
with the energy cost of one processor cycle: for highly efficient
machines, these energies should be similar.
The next section provides the background needed to
understand the rest of the paper. Section 3 then presents
our experimental methodology, describing our baseline,
generic H.264 implementation on a Tensilica CMP. The performance and efficiency gains are described in Section 4,
which also explores the causes of the overheads and different methods for addressing them. Using the insight gained
from our results, Section 5 discusses the broader implications for efficient computing and supporting application
driven design.
2. BackGRounD
We first review the basic ways one can analyze power, and
some previous work in creating energy-efficient processors.
With this background, we then provide an overview of H.264
encoding and its main compute stages. The section ends
by comparing existing hardware and software implementations of an H.264 encoder.
2. 1. Power-constrained design and energy efficiency
Power is defined to be energy per second, which can be broken up into two terms, energy/op ops/second. Thus there
are two primary means by which a designer can reduce
power consumption: reduce the number of operations
per second or reduce the energy per operation. The first
approach—reducing the operations per second—simply
reduces performance to save power. This approach is analogous to slowing down a factory’s assembly line to save electricity costs; although power consumption is reduced, the
factory output is also reduced and the energy used (i.e., the
electricity bill) per unit of output remains unchanged. If, on
the other hand, a designer wishes to maintain or improve
the performance under a fixed power budget, a reduction
in the fundamental energy per operation is required. It is
this reduction in energy per operation—not power—that
represents real gains in efficiency.
This distinction between power and energy is an important one. Even though designers typically face physical
power constraints, to increase efficiency requires that the
fundamental energy of operations be reduced. Although one
might be tempted to report power numbers when discussing
power efficiency, this can be misleading if the performance
is not also reported. What may seem like a power efficiency
gain may just be a modulation in performance. Using energy
per operation, however, is a performance-invariant metric that represents the fundamental efficiency of the work
being done. Thus, even though the designer may be facing a
power constraint, it is energy per operation that the designer
needs to focus on improving.
Reducing the energy required for the basic operation
can be achieved through a number of techniques, all of
which fundamentally reduce the overhead affiliated with
the work being done. As one simple example, clock gating
improves energy efficiency by eliminating spurious activity
in a chip that otherwise causes energy waste.
8 As another
example, customized hardware can increase efficiency by
eliminating overheads. The next section further discusses
the use of customization.
2. 2. Related work in efficient computing
Processors are often customized to improve their efficiency
for specific application domains. For example, SIMD architectures achieve higher performance for multimedia and
other data-parallel applications, while DSP processors are
tailored for signal-processing tasks. More recently, ELM1
and AnySP24 have been optimized for embedded and mobile
signal processing applications, respectively, by reducing
processor overheads. While these strategies target a broad
spectrum of applications, special instructions are sometimes added to speed up specific applications. For example,
Intel’s SSE410 includes instructions to accelerate matrix
transpose and sum-of-absolute-differences.
Customizable processors allow designers to take the
next step, and create instructions tailored to applications.
Extensible processors such as Tensilica’s Xtensa provide
a base design that the designer can extend with custom
instructions and datapath units.
15 Tensilica provides an
automated ISA extension tool,
20 which achieves speedups
of 1. 2× to 3× for EEMBC benchmarks and signal processing algorithms.
21 Other tools have similarly demonstrated
significant gains from automated ISA extension.
4, 5 While
automatic ISA extensions can be very effective, manually
creating ISA extensions gives even larger gains: Tensilica
reports speedups of 40× to 300× for kernels such as FFT,
AES, and DES encryption.
18, 19, 22
Recently researchers have proposed another approach
for achieving energy efficiency—reducing the cost of creating customized hardware rather than customizing a processor. Examples of the latter include using higher levels of
abstraction (e.g., C-to-RTL13) and even full chip generators
using extensible processors.
16 Independent of whether one
customizes a processor, or creates customized hardware, it
is important to understand in quantitative terms the types
and magnitudes of energy overheads in processors.
While previous studies have demonstrated significant
improvements in performance and efficiency moving from
general-purpose processors to ASICs, we explore the reasons
for these gains, which is essential to determine the nature
and degree of customization necessary for future systems.
Our approach starts with a generic CMP system. We incrementally customize its memory system and processors to
determine the magnitude and sources of overhead eliminated in each step toward achieving a high efficiency 720p
HD H.264 encoder. We explore the basic computation in
H.264 next.
2. 3. h.264 computational motifs