and their average to produce an energy/performance scatter plot (not shown here). We next pick off the frontier—the
points that are not dominated in performance or energy efficiency by any other point—and fit them with a polynomial
curve. Figure 5 plots these polynomial curves for each workload and the average. The rightmost curve delivers the best
performance for the least energy.
Each row of Figure 6 corresponds to one of the five curves
in Figure 5. The check marks identify the Pareto-efficient
configurations that define the bounding curve and include
15 of 29 configurations. Somewhat surprising is that none of
the Atom D ( 45) configurations are Pareto efficient. Notice
the following: ( 1) Native non-scalable shares only one choice
with any other workload. ( 2) Java scalable and the average
share all the same choices. ( 3) Only two of eleven choices
for Java non-scalable and Java scalable are common to both.
( 4) Native non-scalable does not include the Atom ( 45) in
its frontier. This last finding contradicts prior simulation
work, which concluded that dual-issue in-order cores and
dual-issue out-of-order cores are Pareto optimal for native
1 Instead, we find that all of the Pareto-efficient
points for native non-scalable in this design space are
quad-issue out-of-order i7 ( 45) configurations.
Figure 5 starkly shows that each workload deviates substantially from the average. Even when the workloads share
Figure 5. energy/performance Pareto frontiers ( 45 nm). the energy/
performance optimal designs are application dependent and
significantly deviate from the average case.
Normalized workload energy
0.00 2.00 4.00 6.00
Workload performance/Workload reference performance
Figure 6. Pareto-efficient processor configurations for each
workload. stock configurations are bold. each “✔” indicates that the
configuration is on the energy/performance Pareto-optimal curve.
native non-scalable has almost no overlap with any other workload.
Atom( 45)1C2T@ 1.7GHz Core2D( 45)2C1T@ 1.6GHz Core2D( 45)2C1T@ 3.1GHz i7( 45)1C1T@ 2.7GHzNoTB i7( 45)1C1T@ 2.7GHz i7( 45)1C2T@ 1.6GHz i7( 45)1C2T@ 2.4GHz i7( 45)2C1T@ 1.6GHz i7( 45)2C2T@ 1.6GHz i7( 45)4C1T@ 2.7GHzNoTB i7( 45)4C1T@ 2.7GHz i7( 45)4C2T@ 1.6GHz i7( 45)4C2T@ 2.1GHz i7( 45)4C2T@ 2.7GHzNoTB i7( 45)4C2T@ 2.7GHz
;; ;; ;
; ;; ; ;
; ;; ; ;
points, the points fall in different places on the curves
because each workload exhibits a different energy/performance trade-off. Compare the scalable and non-scalable
benchmarks at 0.40 normalized energy on the y-axis. It is
impressive how well these architectures effectively exploit
software parallelism, pushing the curves to the right and
increasing performance from about 3 to 7 while holding
energy constant. This measured behavior confirms prior
model-based observations about the role of software parallelism in extending the energy/performance curve to the right.
finding: Energy-efficient architecture design is very sensitive to
workload. Configurations in the native non-scalable
Pareto frontier differ substantially from all other
In summary, architects should use a variety of workloads,
and in particular, should avoid only using native non-scalable
5. FeatuRe anaLYsis
Our original paper evaluates the energy effect of a range of
hardware features: clock frequency, die shrink, memory
hierarchy, hardware parallelism, and gross microarchitecture. This analysis resulted in a large number of findings
and insights. Reader and reviewer feedback yielded a diversity of opinions as to which findings were most surprising
and interesting. This section presents results exploring chip
multiprocessing (CMP), simultaneous multithreading (SMT),
technology scaling with a die shrink, and gross microarchitecture, to give a flavor of our analysis.
5. 1. Chip multiprocessors
Figure 7 shows the average power, performance, and energy
effects of chip multiprocessors (CMPs) by comparing one
core to two cores for the two most recent processors in our
study. We disable Turbo Boost in these analyses because
it adjusts power dynamically based on the number of idle
cores. We disable Simultaneous Multithreading (SMT)
to maximally expose thread-level parallelism to the CMP
hardware feature. Figure 7(a) compares relative power, performance, and energy as a weighted average of the workloads. Figure 7(b) shows a break down of the energy as a
function of workload. While average energy is reduced by
9% when adding a core to the i5 ( 32), it is increased by 12%
when adding a core to the i7 ( 45). Figure 7(a) shows that the
source of this difference is that the i7 ( 45) experiences twice
the power overhead for enabling a core as the i5 ( 32), while
producing roughly the same performance improvement.
finding: Comparing one core to two, enabling a core is not
consistently energy efficient.
Figure 7(b) shows that native non-scalable and Java non-scalable suffer the most energy overhead with the addition
of another core on the i7 ( 45). As expected, performance
for native non-scalable is unaffected. However, turning on
an additional core for native non-scalable leads to a power
increase of 4% and 14%, respectively, for the i5 ( 32) and