PCM buffer analysis
ArrayReads :: avg
figure 5. memory subsystem trends from Pcm buffer organizations. (a) array reads increase sublinearly with buffer width. (b) array write
coalescing opportunities increase with buffer rows.
PCM buffer analysis
WriteCoalescing :: avg
2
ArrayReads (normalized to PCM 2048Bx1)
1
1. 2
1. 4
1. 6
1. 8
0
0.2
0.4
0.6
0.8
64
Rp = 1
Rp = 4
Rp = 2
Rp = 1
Rp = 4
Rp = 2
128
512 256
Row buffer width (B)
1024
2048
WriteCoalescing (normalized to PCM 2048Bx1)
2
1
1. 2
1. 4
1. 6
1. 8
64
0.2
0.4
0.6
0.8
128
512 256
Row buffer width (B)
1024
2048
buffer into multiple, narrow buffers reduce both energy costs
and delay. Examining the Pareto frontier, we observe Pareto
optima shift PCM delay and energy into the neighborhood of
the DRAM baseline. Among these Pareto optima, we observe a
knee that minimizes both energy and delay; this organization
uses four 512B-wide buffers to reduce PCM delay, energy disadvantages from 1. 6×, 2. 2× to more modest 1. 2×, 1.0×.
The number of array reads is a measure of locality. Figure
5a shows the number of array reads increases very slowly as
buffer width decreases exponentially from 2048B to 64B.
For a single buffered row (RP = 1), a 32× reduction in buffer
width produces only a 2× increase in array reads, suggesting
very little spatial locality within wide rows for the memory-intensive workloads we consider. The single row is evicted too
quickly after its first access, limiting opportunities for spatial
reuse. However, we do observe significant temporal locality. A
2048B-wide buffer with two rows (RP = 2) requires 0.4× the array
reads as a 2048B-wide buffer with only a single row (RP = 1).
Writes are coalesced if multiple writes modify the buffer
before its contents are evicted to the array. Thus the number
of array writes per buffer write is a metric for write coalescing.
Figure 5b illustrates increasing opportunities for coalescing
as the number of rows increase. As the number of rows in a
2048B-wide buffer increases from one to two and four rows,
array writes per buffer write fall by 0.51× and 0.32×, respectively; the buffers coalesce 49% and 68% of memory writes.
Coalescing opportunities fall as buffer widths narrow beyond
256B. Since we use 64B lines in the lowest level cache, there are
no coalescing opportunities from spatial locality within a 64B
row buffered for a write. Increasing the number of 64B rows
has no impact since additional rows exploit temporal locality, but any temporal locality in writes are already exploited by
coalescing in the 64B lines of the lowest level cache.
Thus, narrow buffers mitigate high energy PCM writes
and multiple rows exploit locality. This locality not only
improves performance, but also reduces energy by exposing
additional opportunities for write coalescing. We evaluate
PCM buffering using technology parameters at 90nm. As
PCM technology matures, baseline PCM latencies may
improve. Moreover, process technology scaling will drive
linear reductions in PCM energy.
3. 4. scaling comparison
DRAM scaling faces many significant technical challenges
as scaling attacks weaknesses in both components of the
one transistor, one capacitor cell. Capacitor scaling is constrained by the DRAM storage mechanism, which requires
maintaining charge on a capacitor. In future, process scaling is constrained by challenges in manufacturing small
capacitors that store sufficient charge for reliable sensing
despite large parasitic capacitances on the bitline.
The scaling scenarios are also bleak for the access transistor. As this transistor scales down, increasing subthreshold
leakage will make it increasingly difficult to ensure DRAM
retention times. Not only is less charge stored in the capacitor, that charge is stored less reliably. These trends impact
the reliability and energy efficiency of DRAM in future process technologies. According to ITRS, “manufacturable
solutions are not known” for DRAM beyond 40nm. 17
In contrast, ITRS projects PCM scaling mechanisms will
extend to 32nm, after which other scaling mechanisms
might apply. 17 Such PCM scaling has already been demonstrated with a novel device structure fabricated by Raoux. 15
Although both DRAM and PCM are expected to be viable
at 40nm technologies, energy scaling trends strongly favor
PCM with a 2. 4× reduction in PCM energy from 80 to 40nm
as illustrated in Figure 6a. In contrast, ITRS projects DRAM
energy falls by only 1. 5× at 40nm, which reflects the technical challenges of DRAM scaling. 17
Since PCM energy scales down faster than DRAM
energy, PCM subsystems significantly outperform DRAM
subsystems at 40nm. Figure 6b indicates PCM subsystem
energy is 61.3% that of DRAM averaged across workloads.
Switching from DRAM to PCM reduces energy costs by at