committed to stable storage.
On a typical system, the speed of a
synchronous write is bounded by the
latency of nonvolatile storage, as writes
must be committed before they can
be acknowledged. Read operations
first check in the DRAM cache providing very low-latency service times, but
cache misses must also wait for the slow
procession of data around the spindle.
Since it’s quite common to have working sets larger than the meager DRAM
available, even the best prefetching
algorithms will leave many read operations blocked on the disk.
A brute-force solution for improving latency is simply to spin the platters faster to reduce rotational latency,
using 15,000RPM drives rather than
10,000 or 7,200RPM drives. This will
improve both read and write latency,
but only by a factor of two or so. For example, using drives from a major vendor, at current prices, a 10TB data set
on a 7,200RPM drive would cost about
$3,000 and dissipate 112 watts; the
same data set on a 15,000RPM drive
would cost $22,000 and dissipate 473
watts—all for a latency improvement
of a bit more than a factor of two. The
additional cost and power overhead
make this an unsatisfying solution,
though it is widely employed absent a
clear alternative.
A focused solution for improving the
performance of synchronous writes is
to add nonvolatile RAM (NVRAM) in the
form of battery-backed DRAM, usually
on a PCI card. Writes are committed to
the NVRAM ring buffer and immedi-
figure 2: flash cost per GB.
225
200
175
150
125
100
75
50
25
0
2003
2004
ately acknowledged to the client while
the data is asynchronously written out
to the drives. Once the data has been
committed to disk, the corresponding
record can be freed in the NVRAM. This
technique allows for a tremendous improvement for synchronous writes, but
suffers some downsides. NVRAM is
quite expensive; batteries fail (or leak,
or, worse, explode); and the maximum
size of NVRAM tends to be small (2GB–
4GB)—small enough that workloads
can fill the entire ring buffer before it
can be flushed to disk.
flash as a Log Device
One use of flash is as a stand-in for
NVRAM that can improve write performance as a log device. To that end
you need a device that mimics the important properties of NVRAM (fast,
persistent writes), while avoiding the
downsides (cost, size, battery power).
Recall, however, that while achieving
good write bandwidth is fairly easy, the
physics of flash dictate that individual
writes exhibit relatively high latency.
However, it’s possible to build a flash-based device that can service write
operations very quickly by inserting
a DRAM write cache and then treating that write cache as nonvolatile by
adding a supercapacitor to provide the
necessary power to flush outstanding
data in the DRAM to flash in the case of
power loss.
Many applications such as databases can use a dedicated log device
as a way of improving the performance
of write operations; for these applica-
2005
2006
2007
tions, such a device can be dropped in
easily. To bring the benefits of a flash
log device to primary storage, and
therefore to a wide array of applications, we need similar functionality in
a general-purpose file system. Sun’s
ZFS provides a useful context for the
use of flash. ZFS, an enterprise-class
file system designed for the scale and
requirements of modern systems, was
implemented from scratch starting
in 2001. It discards the model of a file
system sitting on a volume manager in
favor of pooled storage both for simplicity of management and greater flexibility for optimizing performance. ZFS
maintains its on-disk data structures
in way that is always consistent, eliminating the need for consistency checking after an unexpected power failure.
Furthermore, it is flexible enough to
accommodate new technological advances, such as new uses of flash. (For a
complete description of ZFS, see http://
opensolaris.org/os/community/zfs.)
ZFS provides for the use of a separate intent-log device (a slog in ZFS
jargon) to which synchronous writes
can be quickly written and acknowledged to the client before the data is
written to the storage pool. The slog
is used only for small transactions,
while large transactions use the main
storage pool—it’s tough to beat the
raw throughput of large numbers
of disks. The flash-based log device
would be ideally suited for a ZFS slog.
The write buffer on the flash device has
to be only large enough to saturate the
bandwidth to flash. Its DRAM size requirements—and therefore the power
requirements—are quite small. Note
also the write buffer is much smaller
than the required DRAM in a battery-backed NVRAM device. There are effectively no constraints on the amount
of flash that could be placed on such a
device, but experimentation has shown
that 10GB of delivered capacity is more
than enough for the vast majority of
use cases.
Using such a device with ZFS in a
test system, we measured latencies
in the range of 80– 100 µs. This approaches the performance of NVRAM
and has many other benefits. A common concern for flash is its longevity.
SLC flash is often rated for one million
write/erase cycles, but beyond several
hundred thousand, the data-retention