cessor-performance scaling faces new
challenges (see Table 1) precluding
use of energy-inefficient microarchitecture innovations developed over the
past two decades. Further, chip architects must face these challenges with
an ongoing industry expectation of a
30x performance increase in the next
decade and 1,000x increase by 2030
(see Table 2).
As the transistor scales, supply
voltage scales down, and the threshold voltage of the transistor (when
the transistor starts conducting) also
scales down. But the transistor is not
a perfect switch, leaking some small
amount of current when turned off,
increasing exponentially with reduction in the threshold voltage. In addition, the exponentially increasing
transistor-integration capacity exacerbates the effect; as a result, a substantial portion of power consumption is
due to leakage. To keep leakage under
control, the threshold voltage cannot
be lowered further and, indeed, must
increase, reducing transistor performance. 10
As transistors have reached atomic
dimensions, lithography and variability pose further scaling challenges, affecting supply-voltage scaling. 11 With
limited supply-voltage scaling, energy
and power reduction is limited, adversely affecting further integration
of transistors. Therefore, transistor-integration capacity will continue with
scaling, though with limited performance and power benefit. The challenge for chip architects is to use this
integration capacity to continue to improve performance.
Package power/total energy consumption limits number of logic transistors. If chip architects simply add
more cores as transistor-integration
capacity becomes available and operate the chips at the highest frequency the transistors and designs can
achieve, then the power consumption
of the chips would be prohibitive (see
Figure 7). Chip architects must limit
frequency and number of cores to keep
power within reasonable bounds, but
doing so severely limits improvement
in microprocessor performance.
Consider the transistor-integration
capacity affordable in a given power
envelope for reasonable die size. For
regular desktop applications the pow-
Death of
90/10 Optimization,
Rise of
10× 10 Optimization
traditional wisdom suggests investing maximum transistors in the 90% case, with
the goal of using precious transistors to increase single-thread performance that can
be applied broadly. In the new scaling regime typified by slow transistor performance
and energy improvement, it often makes no sense to add transistors to a single core
as energy efficiency suffers. Using additional transistors to build more cores produces
a limited benefit—increased performance for applications with thread parallelism.
In this world, 90/10 optimization no longer applies. Instead, optimizing with an
accelerator for a 10% case, then another for a different 10% case, then another 10%
case can often produce a system with better overall energy efficiency and performance.
We call this “ 10× 10 optimization,” 14 as the goal is to attack performance as a set of
10% optimization opportunities—a different way of thinking about transistor cost,
operating the chip with 10% of the transistors active—90% inactive, but a different 10%
at each point in time.
historically, transistors on a chip were expensive due to the associated design
effort, validation and testing, and ultimately manufacturing cost. But 20 generations
of Moore’s Law and advances in design and validation have shifted the balance.
Building systems where the 10% of the transistors that can operate within the energy
budget are configured optimally (an accelerator well-suited to the application) may
well be the right solution. the choice of 10 cases is illustrative, and a 5× 5, 7× 7, 10× 10,
or 12× 12 architecture might be appropriate for a particular design.
er envelope is around 65 watts, and
the die size is around 100mm2. Figure
8 outlines a simple analysis for 45nm
process technology node; the x-axis is
the number of logic transistors inte-
grated on the die, and the two y-axes
are the amount of cache that would fit
and the power the die would consume.
As the number of logic transistors on
the die increases (x-axis), the size of the
cache decreases, and power dissipa-
tion increases. This analysis assumes
average activity factor for logic and
cache observed in today’s micropro-
cessors. If the die integrates no logic at
all, then the entire die could be popu-
lated with about 16MB of cache and
consume less than 10 watts of power,
since caches consume less power than
logic (Case A). On the other hand, if it
integrates no cache at all, then it could
integrate 75 million transistors for log-
ic, consuming almost 90 watts of pow-
er (Case B). For 65 watts, the die could
integrate 50 million transistors for
logic and about 6MB of cache (Case C).
figure 8. transistor integration capacity at a fixed power envelope.
2008, 45nm, 100mm2
100
Case A, 16MB of Cache
80
total Power (Watts)
60
40
20
Power Dissipation
Cachesize
Case C
50M T Logic
6MB Cache
Case A, 0 Logic, 8 W
0
0
20
40 60
Logic transistors (millions)
Case B
18
16
14
12
10
8
Cache (mB)
6
4
2
0
80