tion has been slowed by continued
improvement in microprocessor sin-
gle-thread performance. Developers of
software applications had little incen-
tive to customize for accelerators that
might be available on only a fraction of
the machines in the field and for which
the performance advantage might
soon be overtaken by advances in the
traditional microprocessor. With slow-
ing improvement in single-thread per-
formance, this landscape has changed
significantly, and for many applica-
tions, accelerators may be the only
figure 11. on-die interconnect delay and energy (45nm).
10,000
2
1,000
100
on-die network energy per bit
1,000
Delay (ps)
100
Wire Delay
10
Wire energy
Measured
10
1
1. 5
1
(pJ)
pJ/Bit
0.5
0.1
0.01
0.5u 0.18u 65nm 22nm 8nm
extrapolated
10
0 5 10 15 20
on-die interconnect length (mm)
(a)
(b)
figure 12. hybrid switching for network-on-a-chip.
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
R
C
C
Bus
C
C
R
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
C
C
Bus
C
C
R
C
C
Bus
C
C
R
Bus to connect
a cluster
second-level bus to connect
clusters (hierarchy of busses)
second-level router-based
network (hierarchy of networks)
table 5. Data movement challenges, trends, directions.
Long-term
heterogeneous parallelism and
customization, hardware/runtime
placement, migration, adaptation
for locality and load balance
Data Movement/
locality
More complex, more exposed hierarchies;
new abstractions for control over
movement and “snooping”
new memory abstractions and
mechanisms for efficient vertical
data locality management with low
programming effort and energy
Resilience More aggressive energy reduction;
compensated by recovery for resilience
Radical new memory technologies
(new physics) and resilience techniques
energy
Proportional
Communication
Fine-grain power management in packet
fabrics
exploitation of wide data, slow clock,
and circuit-based techniques
Reduced energy low-energy address translation
Challenge
Parallelism
near-term
Increased parallelism
efficient multi-level naming and
memory-hierarchy management
path toward increased performance
or energy efficiency (see Table 4). But
such software customization is difficult, especially for large programs (see
the sidebar “Decline of 90/10 Optimization, Rise of 10x10 Optimization”).
Orchestrating data movement:
Memory hierarchies and interconnects. In future microprocessors, the
energy expended for data movement
will have a critical effect on achievable performance. Every nano-joule
of energy used to move data up and
down the memory hierarchy, as well
as to synchronize across and data between processors, takes away from the
limited budget, reducing the energy
available for the actual computation.
In this context, efficient memory hierarchies are critical, as the energy to
retrieve data from a local register or
cache is far less than the energy to go
to DRAM or to secondary storage. In
addition, data must be moved between
processing units efficiently, and task
placement and scheduling must be
optimized against an interconnection
network with high locality. Here, we
examine energy and power associated
with data movement on the processor
die.
Today’s processor performance is
on the order of 100Giga-op/sec, and
a 30x increase over the next 10 years
would increase this performance to
3Tera-op/sec. At minimum, this boost
requires 9Tera-operands or 64b x
9Tera-operands (or 576Tera-bits) to be
moved each second from registers or
memory to arithmetic logic, consuming energy.
Figure 11(a) outlines typical wire
delay and energy consumed in moving
one bit of data on the die. If the operands move on average 1mm (10% of
die size), then at the rate of 0.1pJ/bit,
the 576Tera-bits/sec of movement consumes almost 58 watts with hardly any
energy budget left for computation. If
most operands are kept local to the execution units (such as in register files)
and the data movement is far less than
1mm, on, say, the order of only 0.1mm,
then the power consumption is only
around 6 watts, allowing ample energy
budget for the computation.
Cores in a many-core system are
typically connected through a network-on-a-chip to move data around
the cores. 40 Here, we examine the ef-