figure 3. Pcm as a DRam alternative. (a) application delay and memory energy. (b) Percentage of buffer evictions that require array writes.
3
3. 2
3. 4
Normalized to DRAM
2
2. 2
2. 4
2. 6
2. 8
1
1. 2
1. 4
1. 6
1. 8
0
0.2
0.4
0.6
0.8
PCM performance :: 2048Bx1 buffer
Delay
EnergyMem
1
Normalized to DRAM
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
PCM array writes per buffer read
ArrayWrites
cg is mg fft rad oce art equ swi avg
cg is mg fft rad oce art equ swi avg
are evicted. On average, we observe a 2. 2× energy penalty.
The end-to-end delay and energy penalties are more modest than the underlying technology parameters might suggest. Even memory-intensive workloads mix computation
with memory accesses. Furthermore, the long latency, high
energy array writes manifest themselves much less often
in PCM than in DRAM; nondestructive PCM reads do not
require subsequent writes whereas destructive DRAM reads
do. Figure 3b indicates only 28% of PCM array reads first
require an array write of a dirty buffer.
To enable PCM for use below the lowest level processor
cache in general-purpose systems, we must close the delay
and energy gap between PCM and DRAM. Nondestructive
PCM reads help mitigate underlying delay and energy disadvantages by default. We seek to eliminate the remaining
PCM-DRAM differences with architectural solutions. In particular, the baseline analysis considers a single 2048B-wide
buffer per bank. Such wide buffering is inexpensive in
DRAM, but incurs unnecessary energy costs in PCM given
the expensive current injection required when writing buffer
contents back into the array.
3. 2. Buffer organization
We examine whether PCM subsystems can close the gap with
DRAM application performance and memory subsystem energy.
To be a viable DRAM alternative, buffer organizations must
hide long PCM latencies, while minimizing PCM energy costs.
To achieve area neutrality across buffer organizations,
we consider narrower buffers and additional buffer rows.
The number of sense amplifiers decreases linearly with buffer width, significantly reducing area as fewer of these large
circuits are required. We utilize this area by implementing
multiple rows with latches much smaller than the removed
sense amplifiers. Narrow widths reduce PCM write energy
but negatively impact spatial locality, opportunities for write
coalescing, and application performance. However, these
penalties may be mitigated by the additional buffer rows.
We consider buffer widths ranging from the original
102 communications of the acm | july 2010 | Vol. 53 | no. 7
2048B to 64B, which is the line size of the lowest level cache.
We consider buffer rows ranging from the original single
row to a maximum of 32 rows. At present, we consider a
fully associative buffer and full associativity likely becomes
intractable beyond 32 rows. Buffers with multiple rows use a
least recently used (LRU) eviction policy implemented in the
memory controller.
3. 3. Buffer design space
Buffer reorganizations impact the degree of exploited locality and energy costs associated with array reads and writes.
Figure 4 illustrates the delay and energy characteristics of
the buffer design space for an average of memory-intensive
benchmarks. Triangles illustrate PCM and DRAM baselines,
which implement a single 2048B buffer. Circles illustrate
various buffer organizations. Reorganizing a single, wide
PCM buffer organization
figure 4. Pareto analysis for Pcm buffer organizations.
EnergyMem (normalized to DRAM)
2
2. 2
2. 4
1
1. 2
1. 4
1. 6
1. 8
0
0.2
0.4
0.6
0.8
0.8 0.9 1
PCM buff
PCM base
DRAM base
1. 1 1. 2 1. 3 1. 4 1. 5 1. 6 1. 7 1. 8 1. 9 2