ensuring that SMT is the sole opportunity for thread-level
parallelism. Figure 9(a) shows that the performance advantage of SMT is significant. Notably, on the i5 ( 32) and Atom
( 45), SMT improves average performance significantly without much cost in power, leading to net energy savings.
finding: SMT delivers substantial energy savings for recent
hardware and for in-order processors.
Figure 10. Die shrink: microarchitectures compared across technology
nodes. “Core” shows Core 2D ( 65)/Core 2D ( 45) while “nehalem”
shows i7 ( 45)/i5 ( 32) when two cores are enabled. (a) each processor
uses its native clock speed. (b) Clock speeds are matched in each
comparison. (c) energy impact with matched clocks, as a function of
workload. Both die shrinks deliver substantial energy reductions.
0.00 . 20. 40. 60. 80 1.00 1. 20
Core Nehalem 2C2T
Given that SMT was and continues to be motivated by the
challenge of filling issue slots and hiding latency in wide
issue superscalars, it may appear counterintuitive that
performance on the dual-issue in-order Atom ( 45) should
benefit so much more from SMT than the quad-issue i7
( 45) and i5 ( 32) benefit. One explanation is that the in-order
pipelined Atom ( 45) is more restricted in its capacity to fill
issue slots. Compared to other processors in this study, the
Atom ( 45) has much smaller caches. These features accentuate the need to hide latency, and therefore the value of
SMT. The performance improvements on the Pentium 4
(130) due to SMT are half to one-third that of more recent
processors, and consequently, there is no net energy advantage. This result is not so surprising given that the Pentium
4 (130) is the first commercial implementation of SMT.
Figure 9(b) shows that, as expected, the native non-scalable workload experiences very little energy overhead due
to enabling SMT, whereas Figure 7(b) shows that enabling a
core incurs a significant power and thus energy penalty. The
scalable workloads unsurprisingly benefit most from SMT.
The excellent energy efficiency of SMT is impressive on
recent processors as compared to CMP, particularly given
its very low die footprint. Compare Figures 7 and 9. SMT
provides less performance improvement than CMP—SMT
adds about half as much performance as CMP on average
but incurs much less power cost. The results on the modern
processors show that SMT in a much more favorable light
than in Sasanka et al.’s model-based comparative study of
the energy efficiency of SMT and CMP.
Core 2. 4 GHz Nehalem 2C2T 2. 6 GHz
Core 2. 4 GHz Nehalem 2C2T 2. 6 GHz
5. 3. Die shrink
We use processor pairs from the Core (Core 2D ( 65)/Core
2D ( 45) ) and Nehalem (i7 ( 45)/i5 ( 32) ) microarchitectures to
explore die shrink effects. These hardware comparisons are
imperfect because they are not straightforward die shrinks.
To limit the differences, we control for hardware parallelism
by limiting the i7 ( 45) to two cores. The tools and processors
at our disposal do not let us control the cache size, nor do
they let us control for other microarchitecture changes that
accompany a die shrink. We compare at stock clock speeds
and control for clock speed by running both Cores at 2. 4 GHz
and both Nehalems at 2. 66 GHz. We do not directly control
for core voltage, which differs across technology nodes
for the same frequency. Although imperfect, these are the
first published comparisons of measured energy efficiency
across technology nodes.
finding: Two recent die shrinks deliver similar and surprising
reductions in energy, even when controlling for clock
Figure 10(a) shows the power and performance effects of
the die shrinks with the stock clock speeds for all the processors. Figure 10(b) shows the same comparison with
matched clock speeds, and Figure 10(c) breaks down the
workloads for the matched clock speeds. The newer processors are significantly faster at their higher stock clock speeds
and significantly more power efficient. Figure 10(b) shows
the same experiment, but down-clocking the newer processors to match the frequency of their older peers. Down-clocking the new processors improves their relative power
and energy advantage even further. Note that as expected,
the die-shrunk processors offer no performance advantage
once the clocks are matched; indeed, the i5 ( 32) performs
10% slower than the i7 ( 45). However, power consumption
is reduced by 47%. This result is consistent with expectations, given the lower voltage and reduced capacitance at the
smaller feature size.