PCM energy scaling : : RESET
figure 6. Pcm scalability. (a) Reset energy scaling from a survey of empirical prototypes by Lai and an analytical analysis by Pirovano et al.
(b) memory energy projections for 40 nm.
Normalized to DRAM
PCM energy at 40nm :: 512Bx4 buffer
100 90 80
70 60 50
Process technology (nm)
40 30 20 10 0
cg is mg fft rad oce art equ swi avg
least 22.1% (art) and by as much as 68.7% (swim). Note this
analysis does not account for refresh energy, which would
further increase DRAM energy costs. Although ITRS projects constant retention time of 64ms as DRAM scales to
40nm, 17 less effective access transistor control may reduce
retention times. If retention times fall, DRAM refresh
energy will increase as a fraction of total energy costs.
Writes per second per bit
Memory module lifetime (s)
logical capacity (Gb)
memory Bus Bandwidth
Memory bus frequency (Mhz)
Processor frequency multiplier
burst length (blocks)
number of writes, reads
execution time (cy)
buffer width (b), rows
buffer, array writes
Fraction of buffer written to array
table 1. endurance model parameters.
1e + 08
4. memoRY Lifetimes
In addition to architecting PCM to offer competitive delay
and energy characteristics relative to DRAM, we must also
consider PCM wear mechanisms. To mitigate these effects,
we propose partial writes, which reduce the number of
writes to the PCM array by tracking modified data from the
L1 cache to the memory banks. This architectural solution
adds a modest amount of cache state to reduce the number
of bits written. We derive an analytical model to estimate
memory module lifetime from a combination of fundamental PCM technology parameters and measured application
characteristics. Partial writes, combined with an effective
buffer organization, increase memory module lifetimes to
a degree that makes PCM in main memory feasible.
4. 1. Partial writes
Partial writes track data modifications, propagating this
information from the L1 cache down to the buffers at the
memory banks. When a buffered row is evicted and contents
written to the PCM array, only modified data is written. We
consider partial writes at two granularities: lowest level cache
line size (64B) and word size (4B).
These granularities are least invasive since modified
words are tracked by store instructions from the microprocessor pipeline. In contrast, bit-level granularity requires
knowledge of previous data values and expensive comparators. We analyze a conservative implementation of partial
writes, which does not exploit cases where stores write the
same data values already stored.
Partial writes are supported by adding state to each cache
line, tracking stores using fine-grained dirty bits. At the dirty
line granularity, 64B modifications are tracked beginning at
the lowest level cache and requires only 1b per 64B L2 line.
At the dirty word granularity, 4B modifications are tracked
beginning at the L1 cache with 8b per 32B L1 line and propa-
gated to the L2 cache with 16b per 64B L2 line. Overheads
are 0.2% and 3.1% of each cache line when tracking dirty
lines and words, respectively.
4. 2. endurance
Equation 1 estimates the write intensity observed by a
memory module driven with access patterns observed in
our memory-intensive workloads. Table 1 summarizes
the model parameters. The model estimates the number
of writes per second Ŵ for any given bit. We first estimate
memory bus occupancy, which has a theoretical peak command bandwidth of fm · (B/2)− 1. Each command requires
B/2 bus cycles to transmit its burst length B in a DDR interface, which prevents commands from issuing at memory bus speeds fm. We then scale this peak bandwidth by
application-specific utilization. Utilization is computed by