period can drop to just a few weeks.
ZFS will write to this device as a slog in
8KB chunks with each operation taking 80 µs. On a device with 10GB of raw
flash, this equates to about 3½ years of
constant use. A flash device with a formatted capacity of 10GB will, however,
top: a 10mB compact flash card from 1996.
Bottom: a 2GB SD flash card from 2008.
Power consumption for typical components.
Device
DRAM DIMM module (1gB)
15,000 RPM drive (300gB)
7,200 RPM drive (750gB)
high-performance flash ssD (128gB)
typically have 20%–50% more flash held
in reserve, easily taking the longevity of
such a device under constant use to five
years, and the device itself can easily report its expected remaining lifetime as
it counts down its dwindling reserve of
spare blocks. Further, data needs to be
retained only long enough for the system to recover from a fatal error; a reasonable standard is 72 hours, so a few
weeks of data retention, even for very
old flash cells, is more than adequate
and a vast improvement on NVRAM.
flash as a cache
The other half of this performance picture is read latency. Storage systems
typically keep a DRAM cache of data
the system determines a consumer is
likely to access so that it can service
read requests from that cache rather
than waiting for the disk. In ZFS, this
subsystem is called the adaptive replacement cache (ARC). The policies
that determine which data is present
in the ARC attempt to anticipate future
needs, but read requests can still miss
the cache as a result of bad predictions
or because the working set is simply
larger than the cache can hold—or
even larger than the maximum configurable amount of DRAM on a system. Flash is well suited for acting as
a new second-level cache in between
memory and disk in terms of capacity
and performance. In ZFS, this is called
the L2ARC.
ZFS fills the L2ARC using large, asynchronous writes and uses the cache to
seamlessly satisfy read requests from
clients. The requirements here are a
perfect fit for flash, which inherently
has sufficient write bandwidth and fantastic read latency. Since these devices
can be external—rather than being attached to the main board, as is the case
with DRAM—the size of the L2ARC is
limited only by the amount of DRAM
approximate Power consumption
5W
17.2W
12.6W
2W
required for bookkeeping (at a ratio
of 50: 1 in the current ZFS implementation). For example, the maximum
memory configuration on a four-sock-et machine is usually around 128GB;
such a system can easily accommodate 768GB or more using flash SSDs
in its internal drive bays. ZFS’s built-in checksums catch cache inconsistencies and mean that defective flash
blocks simply lead to fewer cache hits
rather than data loss.
In the context of the memory hierarchy, caches are often populated as entries are evicted from the previous layer—in an exclusive cache architecture,
on-chip caches are evicted to off-chip
caches, and so on. With a flash-based
cache, however, the write latency is so
poor the system could easily be bogged
down waiting for evictions. Accordingly, the L2ARC uses an evict-ahead
policy: it aggregates ARC entries and
predictively pushes them out to flash,
thus amortizing the cost over large
operations and ensuring that there is
no additional latency when the time
comes to evict an entry from the ARC.
The L2ARC iterates over its space as a
ring, starting back at the beginning
once it reaches the end, thereby avoiding any potential for fragmentation. Although this technique does mean that
entries in the L2ARC that may soon be
accessed could be overwritten prematurely, bear in mind that the hottest
data will still reside in the DRAM-based
ARC. ZFS will write to the L2ARC slowly, meaning that it can take some time
to warm up; but once warm, it should
remain so, as long as the writes to the
cache can keep up with data churn on
the system.
It’s worth noting that to this point
the L2ARC hasn’t even taken advantage of what is usually considered to
be a key feature of flash: nonvolatility.
Under normal operation, the L2ARC
treats flash as cheap and vast storage.
As it writes blocks of data to populate
the cache devices, however, the L2ARC
includes a directory so that after a
power loss, the contents of the cache
can be identified, thus pre-warming
the cache. Although resets are rare,
system failures, power failures, and
downtime due to maintenance are all
inevitable; the instantly warmed cache
reduces the slow performance ramp
typical of a system after a reset. Since