the application as well as hardware, we feel that adding this
hardware in an extensible processor framework has many
advantages over just designing an ASIC. These advantages
come from the constrained processor design environment
and the software, compiler, and debugging tools available in
this environment. Many of the low-level issues, like interface
design and pipelining, are automatically handled. In addition, since all hardware is wrapped in a general-purpose processor, the application developer retains enough flexibility
in the processor to make future algorithmic modifications.
5. eneRGY-efficient comPuteRs
It is important to remember that the “overhead” of using a
processor depends on the energy required for the desired
operation. Floating point (FP) energy costs are about 10×
the small integer ops we have explored in this paper, so
machines with 10 wide FP units will not be far from the
maximum efficiency possible for that class of applications. Similarly, customizing the hardware will not have
a large impact on the energy efficiency of an application
dominated by memory costs; an ASIC and a processor’s
energy will not be that different. For these applications,
optimization that restructures the algorithm and/or the
memory system is needed to reduce energy, and can yield
large savings.
23
Unfortunately, as we drive to more energy-efficient solutions, we will find ways to transform FP code to fixed point
operations, and restructure our algorithms to minimize the
memory fetch costs. Said differently, if we want ASIC-like
energy efficiencies— 100× to 1000× more energy efficient
than general-purpose CPUs—we will have to transform
our algorithms to be dominated by the simple, low-energy
operations we have been studying in this paper. Since the
energy of these operations is very low, any overhead, from
the register fetch to the pipeline registers in a processor, is
likely to dominate energy costs. The good news is that this
large overhead per instruction makes estimating the energy
savings easy—you simply look at the performance gains—
but the bad news is that adding state-of-the art data-parallel
hardware like wide SIMD units and media extensions will
still leave you far from the desired efficiency.
It is encouraging that we were able to achieve ASIC
energy levels in a customized processor by creating customized hardware that easily fit inside a processor framework.
Extending a processor instead of building an ASIC seems
like the correct approach, since it provides a number of
software development advantages and the energy cost of
this option seems small. However, building such custom
datapaths still requires a significant effort and thus the key
challenge now is to build a design system that lets application designers create and exploit such customizations with
much greater ease. The key is to find a parameterization of
the space which makes sense to application designers in a
specific application domain.
For example, often a number of algorithms in a domain
share similar data flow and computation structures. In
H.264 a common computational motif is based on a convo-lution-like data flow: apply a function to all the data, then
perform a reduction, then shift the data and add a small
amount of new data, and repeat. A similar pattern of con-volution-like computations also exists in a number of other
image processing and media processing algorithms. While
the exact computation is going to be different for each particular algorithm, we believe that by exploiting the common data-flow structure of these algorithms we can create
a generalized convolution abstraction which application
designers can customize. If this abstraction is useful for
application designers, one can imagine implementing it by
creating a flexible hardware unit that is significantly more
efficient than a generic SIMD/SSE unit. We also believe that
similar patterns exist in other domains that may allow us to
create a set of customized units for each domain.
Even if we could come up with such a set of customized functional units, it is likely that some degree of per
algorithm configurability will be required. For example,
in a convolution engine, the convolution size and resulting datapath size could vary from algorithm to algorithm
and thus potentially needs to be tuned on a per processor basis. This leads to the idea of creating a two-step
design process. The first step is when a set of chip experts
design a processor generator platform. This is a meta-level
design which “knows” about the special functional units
and control optimization, and provides the application
designer an application-tailored interface. The application designers can then co-optimize their code and the
interface parameters to meet their requirements. After
this co-optimization, an optimized implementation based
on these parameters is automatically generated. In fact,
such a platform will also help in building the more generic
domain customized functional units mentioned earlier by
facilitating the process of rapidly creating and evaluating
new designs.
A reconfigurable processor generator alone is not a sufficient solution, since one still needs to take one or more
of these processors and create a working chip system.
Designing and validating a chip is an extremely hard and
expensive task. If application customization will be needed
for efficiency—and our data indicates it will be—we need to
start creating systems that will efficiently allow savvy application experts to create these optimized chip level solutions.
This will require extending the ideas for extensible processors to full chip generation systems. We are currently working on creating this type of system.
16
acknowledgments
This work would have not been possible without great
support and cooperation from many people at Tensilica
including Chris Rowen, Dror Maydan, Bill Huffman, Nenad
Nedeljkovic, David Heine, Govind Kamat, and others. The
authors acknowledge the support of the C2S2 Focus Center,
one of six research centers funded under the Focus Center
Research Program (FCRP), a Semiconductor Research
Corporation subsidiary, and earlier support from DARPA.
This material is based upon work partially supported
under a Sequoia Capital Stanford Graduate Fellowship.
The National Science Foundation under Grant #0937060
to the Computing Research Association also supports this
material for the CIFellows Project. Any opinions, findings,