ent approach to achieve performance
improvements. The multicore era was
thus born.
Multicore shifted responsibility for
identifying parallelism and deciding
how to exploit it to the programmer
and to the language system. Multicore
does not resolve the challenge of ener-
gy-efficient computation that was exac-
erbated by the end of Dennard scaling.
Each active core burns power whether
or not it contributes effectively to the
computation. A primary hurdle is an
old observation, called Amdahl’s Law,
stating that the speedup from a paral-
lel computer is limited by the portion
of a computation that is sequential.
To appreciate the importance of this
observation, consider Figure 5, show-
ing how much faster an application
runs with up to 64 cores compared to
a single core, assuming different por-
tions of serial execution, where only
one processor is active. For example,
when only 1% of the time is serial, the
speedup for a 64-processor configura-
tion is about 35. Unfortunately, the
power needed is proportional to 64
processors, so approximately 45% of
the energy is wasted.
Real programs have more complex
structures of course, with portions
that allow varying numbers of processors to be used at any given moment
in time. Nonetheless, the need to communicate and synchronize periodically
means most applications have some
portions that can effectively use only
a fraction of the processors. Although
Amdahl’s Law is more than 50 years
old, it remains a difficult hurdle.
With the end of Dennard scaling,
increasing the number of cores on a
chip meant power is also increasing
at nearly the same rate. Unfortunately,
the power that goes into a processor
must also be removed as heat. Multicore processors are thus limited by
the thermal dissipation power (TDP),
or average amount of power the package and cooling system can remove.
Although some high-end data centers
may use more advanced packages and
cooling technology, no computer users would want to put a small heat
exchanger on their desks or wear a radiator on their backs to cool their cellphones. The limit of TDP led directly
to the era of “dark silicon,” whereby
processors would slow on the clock
rate and turn off idle cores to prevent
overheating. Another way to view this
approach is that some chips can reallocate their precious power from the
idle cores to the active ones.
An era without Dennard scaling,
along with reduced Moore’s Law and
Amdahl’s Law in full effect means
inefficiency limits improvement in
performance to only a few percent
per year (see Figure 6). Achieving
higher rates of performance improvement—as was seen in the 1980s and
1990s—will require new architectural approaches that use the inte-grated-circuit capability much more
efficiently. We will return to what approaches might work after discussing
another major shortcoming of modern computers—their support, or
lack thereof, for computer security.
predicting the outcome of 15 branches.
If a processor architect wants to limit
wasted work to only 10% of the time,
the processor must predict each branch
correctly 99.3% of the time. Few general-purpose programs have branches that
can be predicted so accurately.
To appreciate how this wasted work
adds up, consider the data in Figure 4,
showing the fraction of instructions
that are effectively executed but turn
out to be wasted because the processor speculated incorrectly. On average,
19% of the instructions are wasted for
these benchmarks on an Intel Core i7.
The amount of wasted energy is greater, however, since the processor must
use additional energy to restore the
state when it speculates incorrectly.
Measurements like these led many to
conclude architects needed a differ-
Figure 6. Growth of computer performance using integer programs (SPECintCPU).
Figure 7. Potential speedup of matrix multiply in Python for four optimizations.
100,000
10,000
1,000
100
10
1
1
47
366
6,727
62,806
Python
Matrix Multiply Speedup Over Native Python
Sp
ee
du
p
C + parallel
loops
+ memory
optimization
+ SIMD
instructions
1980
100,000
CISC 2X/2.5 years
(22%/year)
RISC 2X/1.5 years
(52%/year)
End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year)
Amdahl’s Law ⇒ 2X/6 years (12%/year)
End of the Line ⇒ 2X/20 years (3%/yr)
10,000
1,000
100
10
Pe
rf
or
m
ance
v
s.
VA
X11
- 78
0
1
1985 1990 1995 2000 2005 2010 2015