Understanding Sources
of Inefficiency in
General-Purpose Chips
abstract
Scaling the performance of a power limited processor
requires decreasing the energy expended per instruction
executed, since energy/op * op/second is power. To better
understand what improvement in processor efficiency is
possible, and what must be done to capture it, we quantify
the sources of the performance and energy overheads of a
720p HD H.264 encoder running on a general-purpose four-processor CMP system. The initial overheads are large: the
CMP was 500× less energy efficient than an Application
Specific Integrated Circuit (ASIC) doing the same job. We
explore methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding.
Broadly applicable optimizations like single instruction,
multiple data (SIMD) units improve CMP performance by
14× and energy by 10×, which is still 50× worse than an ASIC.
The problem is that the basic operation costs in H.264 are
so small that even with a SIMD unit doing over 10 ops per
cycle, 90% of the energy is still overhead. Achieving ASIC-like performance and efficiency requires algorithm-specific
optimizations. For each subalgorithm of H.264, we create a
large, specialized functional/storage unit capable of executing hundreds of operations per instruction. This improves
energy efficiency by 160× (instead of 10×), and the final customized CMP reaches the same performance and within 3×
of an ASIC solution’s energy in comparable area.
1. intRoDuction
Most computing systems today are power limited, whether
it is the 1 W limit of a cell phone system on a chip (SoC), or
the 100 W limit of a processor in a server. Since power is ops/
second energy/op, we need to decrease the energy cost of
each op if we want to continue to scale performance at constant power. Traditionally, chip designers were able to make
increasingly complex designs both by increasing the system
power, and by leveraging the energy gains from technology
scaling. Historically each factor of 2 in scaling made each
gate evaluation take 8× less energy.
7 However, technology
scaling no longer provides the energy savings it once did,
9 so
designers must turn to other techniques to scale energy cost.
Most designs use processor-based solutions because of their
flexibility and low design costs, however, these are usually not
the most energy-efficient solutions. A shift to multi-core systems has helped improve the efficiency of processor systems
but that approach is also going to hit a limit pretty soon.
8
On the other hand, using hardware that has been customized
for a specific application (an Application Specific Integrated
Circuit or ASIC) can be three orders of magnitude better than
a processor in both energy/op and ops/area.
6 This paper compares ASIC solutions to processor-based solutions, to try to
understand the sources of inefficiency in general-purpose
processors. We hope this information will prove to be useful both for building more energy-efficient processors and
understanding why and where customization must be used
for efficiency.
To build this understanding, we start with a single video
compression application, 720p HD H.264 video encode,
and transform the hardware it runs on from a generic multiprocessor to a custom multiprocessor with ASIC-like specialized hardware units. On this task, a general-purpose
software solution takes 500× more energy per frame and
500× more area than an ASIC to reach the same performance. We choose H.264 because it demonstrates the large
energy advantage of ASIC solutions (500×) and because
there exist commercial ASICs that can serve as a benchmark.
Moreover, H.264 contains a variety of computational motifs,
from highly data-parallel algorithms (motion estimation) to
control intensive ones (Context Adaptive Binary Arithmetic
Coding [CABAC]).
To better understand the potential of producing general-purpose chips with better efficiency, we consider two broad
strategies for customized hardware. The first extends the
current trend of creating general data-parallel engines on
our processors. This approach mimics the addition of SSE
instructions, or the recent work in merging graphic processors on die to help with other applications. We claim these
are similar to general functional units since they typically
have some special instructions for important applications,
but are still generally useful. The second approach creates
application-specific data storage fused with functional
units. In the limit this should be an ASIC-like solution. The
first has the advantage of being a programmable solution,
while the second provides potentially greater efficiency.
The results are striking. Starting from a 500× energy pen-
alty, adding relatively wide SSE-like parallel execution engines
and rewriting the code to use them improves performance/
A previous version of this paper was published in
Proceedings of the 37th Annual International Symposium on
Computer Architecture (2010), ACM, NY.