vance (such as introducing an on-die
cache by comparing 486 to 386 in 1μ
technology and superscalar microarchitecture of Pentium in 0.7μ technology with 486).
This data shows that on-die caches
and pipeline architectures used transistors well, providing a significant
performance boost without compromising energy efficiency. In this era,
superscalar, and out-of-order architectures provided sizable performance
benefits at a cost in energy efficiency.
Of these architectures, deep-pipe-lined design seems to have delivered
the lowest performance increase for
the same area and power increase as
out-of-order and speculative design,
incurring the greatest cost in energy
efficiency. The term “deep pipelined
architecture” describes deeper pipeline, as well as other circuit and mi-croarchitectural techniques (such as
trace cache and self-resetting domino
logic) employed to achieve even higher frequency. Evident from the data is
that reverting to a non-deep pipeline
reclaimed energy efficiency by dropping these expensive and inefficient
techniques.
When transistor performance in-
creases frequency of operation, the
performance of a well-tuned system
generally increases, with frequency
subject to the performance limits of
other parts of the system. Historically,
microarchitecture techniques exploit-
ing the growth in available transistors
have delivered performance increases
empirically described by Pollack’s
Rule, 32 whereby performance increas-
es (when not limited by other parts
of the system) as the square root of
the number of transistors or area of
a processor (see Figure 3). According
to Pollack’s Rule, each new technol-
ogy generation doubles the number
of transistors on a chip, enabling a
new microarchitecture that delivers a
40% performance increase. The faster
transistors provide an additional 40%
performance (increased frequency),
almost doubling overall performance
within the same power envelope (per
scaling theory). In practice, however,
implementing a new microarchitec-
ture every generation is difficult, so
microarchitecture gains are typically
less. In recent microprocessors, the in-
creasing drive for energy efficiency has
caused designers to forego many of
these microarchitecture techniques.
Unaddressed, the memory-latency gap
would have eliminated and could still
eliminate most of the benefits of processor improvement.
The reason for slow improvement
of DRAM speed is practical, not technological. It’s a misconception that
DRAM technology based on capacitor
storage is inherently slower; rather, the
memory organization is optimized for
density and lower cost, making it slower. The DRAM market has demanded
large capacity at minimum cost over
speed, depending on small and fast
caches on the microprocessor die to
emulate high-performance memory
by providing the necessary bandwidth
and low latency based on data locality.
The emergence of sophisticated, yet
effective, memory hierarchies allowed
DRAM to emphasize density and cost
over speed. At first, processors used a
single level of cache, but, as processor
speed increased, two to three levels of
cache hierarchies were introduced to
span the growing speed gap between
figure 3. Increased performance vs. area in the same process technology follows
Pollack’s Rule.
10.0
Performance sqrt(area)
Integer Performance (x)
1.0
386 to 486
Pentium to P6
486 to Pentium
P6 to Pentium 4
Pentium 4 to Core
Slope =0.5
0.1
0.1
1.0
area (x)
10.0
figure 4. DRam density and performance, 1980–2010.
100,000
10,000
DraM Density
Relative
1,000
100
CPu
Speed
10
GaP
DraM Speed
1
1,000
CPu Clocks/DRam Latency
100
10
1
1980
1990 2000
(a)
2010
1980
1990 2000
(b)