2. 3. Process scaling
PCM scaling reduces required programming current injected via the electrode-storage contact. As the contact area
decreases with feature size, thermal resistivity increases and
the volume of phase change material that must be cooled
into an amorphous state during a reset to completely block
current flow decreases. These effects enable smaller access
devices for current injection. Pirovano et al. outline PCM
scaling rules, 14 which are confirmed empirically in a survey by Lai. 9 Specifically, as feature size scales linearly (1/k),
contact area decreases quadratically (1/k2). Reduced contact
area causes resistivity to increase linearly (k), which causes
programming current to decrease linearly (1/k).
Operational issues arise with aggressive PCM technology
scaling. As contact area decreases, lateral thermal coupling
may cause programming currents for one cell to influence the states of adjacent cells. Lai’s survey of PCM finds
these effects negligible in measurement and simulation. 9
Temperatures fall exponentially with distance from programmed cell, suggesting no appreciable impact from
thermal coupling. Increasing resistivity from smaller contact areas may reduce signal strength (i.e., smaller resistivity difference between crystalline and amorphous states).
However, these signal strengths are well within the sense
circuit capabilities of modern memory architectures. 9
2. 4. array architecture
As shown in Figure 2, PCM cells might be hierarchically
organized into banks, blocks, and subblocks. Despite simi-larities to conventional memory architectures, PCM-specific
issues must be addressed. For example, PCM reads are nondestructive whereas DRAM reads are destructive and require
mechanisms to replenish discharged capacitors.
Sense amplifiers detect the change in bitline state when
a memory row is accessed. Choice of bitline sense amplifiers affects array read access time. Voltage sense amplifiers
are cross-coupled inverters which require differential discharging of bitline capacitances. In contrast, current sense
amplifiers rely on current differences to create a differential
voltage at the amplifier’s output nodes. Current sensing is
faster but requires larger circuits. 18
In DRAM, sense amplifiers serve a dual purpose, both
sensing and buffering data using cross-coupled inverters. In contrast, we explore PCM architectures with separate sensing and buffering; sense amplifiers drive banks of
explicit latches. These latches provide greater flexibility in
row buffer organization by enabling multiple buffered rows.
However, these latches incur area overheads. Separate sensing and buffering enables multiplexed sense amplifiers.
Multiplexing also enables buffer widths narrower than the
array width, which is defined by the total number of bitlines.
Buffer width is a critical design parameter, determining the
required number of expensive current sense amplifiers.
3. a DRam aLteRnatiVe
We express PCM device and circuit characteristics within
conventional DDR timing and energy parameters, thereby
quantifying PCM in the context of more familiar DRAM
parameters to facilitate a direct comparison. 10
figure 2. array architecture. a hierarchical memory organization
includes banks, blocks, and subblocks with local, global decoding
for row, column addresses. sense amplifiers (s/a) and word drivers
(W/D) are multiplexed across blocks.
Bank
Buffer
S/A S/A S/A S/A GBL DEC GWL DEC v_ W/D TO GBL WL DEC LBL DEC W/D W/D W/D W/D
We evaluate a four-core chip multiprocessor using the
SESC simulator. 16 The 4-way superscalar, out-of-order cores
operate at 4.0GHz. This datapath is supported by 32KB,
direct-mapped instruction and 32KB, 4-way data L1 caches,
which may be accessed in two to three cycles. A 4MB, 8-way
L2 cache with 64B lines is shared between the four cores and
may be accessed in 32 cycles.
Below the caches is a 400 MHz SDRAM memory subsystem modeled after Micron’s DDR2-800 technical specifications. 12 We consider one channel, one rank, and four
× 16 chips per rank to achieve the standard 8B interface.
Internally, each chip is organized into four banks to facilitate throughput as data are interleaved across banks and
accessed in parallel. We model a burst length of eight blocks.
The memory controller has a 64-entry transaction queue.
We consider parallel workloads from the SPLASH- 2 suite
(fft, radix, ocean), SPEC OpenMP suite (art, equake, swim),
and NAS parallel benchmarks (cg, is, mg). 1, 2, 19 Regarding
input sets, we use 1 M points for FFT, 514×514 grid for ocean,
and 2M integers for radix. SPEC OpenMP workloads run
MinneSpec-Large data set and NAS parallel benchmarks run
with Class A problem sizes. Applications are compiled using
gcc and Fortran compilers at the -O3 optimization level.
3. 1. Baseline comparison
We consider a PCM baseline architecture, which implements DRAM-style buffering with a single 2048B-wide
buffer. Figure 3a illustrates end-to-end application performance when PCM replaces DRAM as main memory.
Application delay increases with penalties relative to DRAM
ranging from 1. 2× (radix) to 2. 2× (ocean, swim). On average, we observe a 1. 6× delay penalty. The energy penalties
are larger, ranging from 1. 4× (cg) to 3. 4× (ocean), due to the
highly expensive array writes required when buffer contents