explode); and the maximum size of NVRAM tends to be
small ( 2-4 GB)—small enough that workloads can fill the
entire ring buffer before it can be flushed to disk.
FLASH AS A LOG DEVICE
One use of flash is as a stand-in for NVRAM that can
improve write performance as a log device. To that end
you need a device that mimics the important properties
of NVRAM (fast, persistent writes), while avoiding the
downsides (cost, size, battery power). Recall, however,
that while achieving good write bandwidth is fairly easy,
the physics of flash dictate that individual writes exhibit
relatively high latency. It’s possible, however, to build a
flash-based device that can service write operations very
quickly. This is done by inserting a DRAM write cache and
then treating it as nonvolatile by adding a supercapacitor
that, in case of power loss, provides the necessary power
to flush outstanding data in the DRAM to flash.
Many applications, such as databases, can use a dedicated log device as a way of improving the performance
of write operations; for these applications, such a device
can be easily dropped in. To bring the benefits of a flash
log device to primary storage, and therefore to a wide
array of applications, we need similar functionality in a
general-purpose file system. Sun’s ZFS provides a useful context for the use of flash. ZFS, an enterprise-class
file system designed for the scale and requirements of
modern systems, was implemented from scratch starting
in 2001. It discards the model of a file system sitting on a
volume manager in favor of pooled storage for both simplicity of management and greater flexibility for optimizing performance. ZFS maintains its on-disk data structures
in a way that is always consistent, eliminating the need
for consistency checking after an unexpected power failure. Furthermore, it is flexible enough to accommodate
new technological advances, such as new uses of flash.
(For a complete description of ZFS, see http://opensolaris.
org/os/community/zfs.)
ZFS provides for the use of a separate intent-log device
(a slog in ZFS jargon) to which synchronous writes can be
quickly written and acknowledged to the client before the
data is written to the storage pool. The slog is used only
for small transactions, while large transactions use the
main storage pool—it’s tough to beat the raw through
put of large numbers of disks. The flash-based log device
would be ideally suited for a ZFS slog. The write buffer on
the flash device has to be only large enough to saturate
the bandwidth to flash. Its DRAM size requirements—and
therefore the power requirements—are quite small.
Note also that the write buffer is much smaller than the
required DRAM in a battery-backed NVRAM device. There
are effectively no constraints on the amount of flash that
could be placed on such a device, but experimentation
has shown that 10 GB of delivered capacity is more than
enough for the vast majority of use cases.
Using such a device with ZFS in a test system, we measured latencies in the range of 80-100 µs. This approaches
the performance of NVRAM and has many other benefits.
A common concern about flash is its longevity. SLC flash
is often rated for 1 million write/erase cycles, but beyond
several hundred thousand, the data-retention period can
drop to just a few weeks. ZFS will write to this device
as a slog in 8-KB chunks with each operation taking 80
µs. On a device with 10 GB of raw flash, this equates to
about 3½ years of constant use. A flash device with a
formatted capacity of 10 GB will, however, typically have
20-50 percent more flash held in reserve, easily taking
the longevity of such a device under constant use to five
years. The device itself can report its expected remaining
lifetime as it counts down its dwindling reserve of spare
blocks. Further, data need be retained only long enough
for the system to recover from a fatal error; a reasonable
standard is 72 hours, so a few weeks of data retention,
even for very old flash cells, is more than adequate and a
vast improvement on NVRAM.
FLASH AS A CACHE
The other half of this performance picture is read latency.
Storage systems typically keep a DRAM cache of data the
system determines a consumer is likely to access so that
it can service read requests from that cache rather than
wait for the disk. In ZFS, this subsystem is called the ARC
(adaptive replacement cache). The policies that determine
which data is present in the ARC attempt to anticipate
future needs, but read requests can still miss the cache as
a result of bad predictions or because the working set is
simply larger than the cache can hold—or