The performance of flash is a bit unusual in that it’s
highly asymmetric, posing a challenge for using it in a
storage system. A block of flash must be erased before it
can be written, which takes on the order of 1-2 ms for a
block, and writing to erased flash requires around 200-300
µs. For this reason flash devices try to maintain a pool of
previously erased blocks so that the latency of a write is
just that of the program operation. Read operations are
much faster: approximately 25 µs for 4k. By comparison,
raw DRAM is even faster, able to perform reads and writes
in much less than a microsecond. Disk-drive latency
depends on the rotational speed of the drive: on average
4. 2 ms for 7200 RPM, 3 ms for 10,000 RPM, and 2 ms for
15,000 RPM. Adding in the seek time bumps these latencies up an additional 3-10 ms depending on the quality of
the mechanical components.
SLC flash is typically rated to sustain 1 million program/erase cycles per block. As flash cells are stressed, they
lose their ability to record and retain values. Because of
the limited lifetime, flash devices must take care to ensure
that cells are stressed uniformly so that “hot” cells don’t
cause premature device failure. This is done through a process known as wear leveling. Just as disk drives keep a pool
of spare blocks for bad-block remapping, flash devices
typically present themselves to the operating system as
significantly smaller than the amount of raw flash to
maintain a reserve of spare blocks (and pre-erased blocks
for performance). Most flash devices are also capable of
estimating their own remaining lifetimes so systems can
anticipate failure and take prophylactic action.
THE S TORAGE HIERARCHY OF TODAY
Whether over a network or for local access, primary
storage can be succinctly summarized as a head unit
containing CPUs and DRAM attached to drives either in
storage arrays or JBODs (just a bunch of disks). The disks
are the primary repository for data—typical modern data
sets range from a few hundred gigabytes up to a petabyte
TABLE
1Power Consumption Comparison
DRAM DIMM module ( 1 GB)
15,000-RPM drive (300 GB)
7200-RPM drive (750 GB)
High-performance flash SSD (128 GB)
5W
17. 2 W
12.6W
2W
or more—while DRAM acts as a very fast cache. Clients
communicate via read and write operations. Read operations are always synchronous in that the client is blocked
until the operation is serviced, whereas write operations
may be either synchronous or asynchronous depending
on the application. For example, video streams may write
data blocks asynchronously and verify only at the end
of the stream that all data has been quiesced; databases,
however, typically use synchronous writes to ensure that
every transaction has been committed to stable storage.
On a typical system, the speed of a synchronous write
is bounded by the latency of nonvolatile storage, as writes
must be committed before they can be acknowledged.
Read operations first check in the DRAM cache providing
very low-latency service times, but cache misses must also
wait for the slow procession of data around the spindle.
Since it’s quite common to have working sets larger than
the meager DRAM available, even the best prefetching
algorithms will leave many read operations blocked on
the disk.
A brute-force solution for improving latency is simply
to spin the platters faster to reduce rotational latency,
using 15,000-RPM drives rather than 10,000- or 7,200-
RPM drives. This will improve both read and write
latency, but only by a factor of two or so. For example,
a 10-TB data set on a 7,200-RPM drive (from a major
vendor, at current prices) would cost about $3,000 and
dissipate 112 watts; the same data set on a 15,000-RPM
drive would cost $22,000 and dissipate 473 watts—all
for a latency improvement of a bit more than a factor of
two. The additional cost and power overhead make this
an unsatisfying solution, though it is widely employed
absent a clear alternative.
A focused solution for improving the performance of
synchronous writes is to add NVRAM (nonvolatile RAM)
in the form of battery-backed DRAM, usually on a PCI
card. Writes are committed to the NVRAM ring buffer and
immediately acknowledged to the client while the data
is asynchronously written
out to the drives. Once the
data has been committed
to disk, the corresponding
record can be freed in the
NVRAM. This technique
allows for a tremendous
improvement for synchro-
nous writes, but suffers
some downsides. NVRAM
is quite expensive; batter-
ies fail (or leak or, worse,