Figures 10(a) and (b) reveal a striking similarity in power
and energy savings between the Core (65nm/45nm) and
Nehalem (45nm/32nm) die shrinks. This data suggests that
Intel maintained the same rate of energy reduction across the
two most recent generations. As a point of comparison, the
models used by the International Technology Roadmap for
Semiconductors (ITRS) predicted a 9% increase in frequency
and a 34% reduction in power from 45nm to 32 nm.
10(a) is both more and less encouraging. Clock speed increased
by 26% in the stock configurations of the i7 ( 45) to the i5 ( 32)
with an accompanying 14% increase in performance, but power
reduced by 23%, less than the 34% predicted. To more deeply
understand die shrink efficiency on modern processors, one
requires measuring more processors in each technology node.
5. 4. Gross microarchitecture change
This section explores the power and performance effect of
gross microarchitectural change by comparing microarchitectures while matching features such as processor clock, degree
of hardware parallelism, process technology, and cache size.
Figure 11 compares the Nehalem i7 ( 45) with the NetBurst
Pentium 4 (130), Bonnell Atom D ( 45), and Core 2D ( 45) microarchitectures, and it compares the Nehalem i5 ( 32) with the
Core 2D ( 65). Each comparison configures the Nehalems to
match the clock speed, number of cores, and hardware threads
of the other architecture. Both the i7 ( 45) and i5 ( 32) comparisons to the Core show that the move from Core to Nehalem
Figure 11. Gross microarchitecture: a comparison of nehalem with
four other microarchitectures. in each comparison, the nehalem
is configured to match the other processor as closely as possible.
(a) impact of microarchitecture change with respect to performance,
power, and energy, averaged over all four workloads. (b) energy
impact of microarchitecture for each workload. the most recent
microarchitecture, nehalem, is more energy efficient than the
others, including the low-power Bonnell (atom).
Bonnell: i7 ( 45)/AtomD ( 45)
Core: i7 ( 45)/C2D ( 45)
NetBurst: i7 ( 45)/Pentium4 (130)
Core: i5 ( 32)/C2D ( 65)
yields a small 14% performance improvement. This finding is
not inconsistent with Nehalem’s stated primary design goals,
that is, delivering scalability and memory performance.
finding: Controlling for technology, hardware parallelism,
and clock speed, the out-of-order architectures have
similar energy efficiency as the in-order ones.
The comparisons between the i7 ( 45) and Atom D ( 45) and
Core 2D ( 45) hold process technology constant at 45 nm. All
three processors are remarkably similar in energy consumption. This outcome is all the more interesting because the i7
( 45) is disadvantaged since it uses fewer hardware contexts
here than in its stock configuration. Furthermore, the i7 ( 45)
integrates more services on-die, such as the memory controller, that are off-die on the other processors, and thus outside
the scope of the power meters. The i7 ( 45) improves upon
the Core 2D ( 45) and Atom D ( 45) with a more scalable, much
higher bandwidth on-chip interconnect, which is not exercised heavily by our workloads. It is impressive that, despite
all of these factors, the i7 ( 45) delivers similar energy efficiency to its two 45 nm peers, particularly when compared to
the low-power in-order Atom D ( 45). It is unsurprising that
the i7 ( 45) performs 2. 6× faster than the Pentium 4 (130),
while consuming one-third the power, when controlling for
clock speed and hardware parallelism (but not for factors
such as memory speed). Much of the 50% power improvement is attributable to process technology advances. This
speedup of 2. 6 over 7 years is however substantially less than
the historical factor of 8 improvement experienced in every
prior 7-year time interval between 1970 through the early
2000s. This difference in improvements marks the beginning of the power-constrained architecture design era.
6. ReLateD WoRK
The processor design literature is full of performance measurement and analysis. Despite power’s growing importance, po wer measurements are still relatively rare.
7, 10, 12 Here,
we summarize related power measurement and simulation
work. Our original paper contains a fuller treatment.
Power measurement. Isci and Martonosi combine a clamp
ammeter with performance counters for per unit power estimation of the Intel Pentium 4 on SPEC CPU2000.10 Fan et al.
estimate whole system power for large-scale data centers.
They find that even the most power-consuming workloads
draw less than 60% of peak possible power consumption.
We measure chip power and support their results by showing that TDP does not predict measured chip power. Our
work is the first to compare microarchitectures, technology
generations, individual benchmarks, and workloads in the
context of power and performance.
Power modeling. Power modeling is necessary to thoroughly explore architecture design.
1, 13, 14 Measurement
complements simulation by providing validation. For example, some prior simulators used TDP, but our measurements show that it is not accurate. As we look to the future,
we believe that programmers will need to tune their applications for power and energy, not only performance. Just
as hardware performance counters provide insight into