period can drop to just a few weeks. ZFS will write to this device as a slog in 8KB chunks with each operation taking 80 µs. On a device with 10GB of raw flash, this equates to about 3½ years of constant use. A flash device with a formatted capacity of 10GB will, however,

 

top: a 10mB compact flash card from 1996. Bottom: a 2GB SD flash card from 2008.

Power consumption for typical components.

Device

DRAM DIMM module (1gB)

15,000 RPM drive (300gB)

7,200 RPM drive (750gB)

high-performance flash ssD (128gB)

typically have 20%–50% more flash held in reserve, easily taking the longevity of such a device under constant use to five years, and the device itself can easily report its expected remaining lifetime as it counts down its dwindling reserve of spare blocks. Further, data needs to be retained only long enough for the system to recover from a fatal error; a reasonable standard is 72 hours, so a few weeks of data retention, even for very old flash cells, is more than adequate and a vast improvement on NVRAM.

flash as a cache

The other half of this performance picture is read latency. Storage systems typically keep a DRAM cache of data the system determines a consumer is likely to access so that it can service read requests from that cache rather than waiting for the disk. In ZFS, this subsystem is called the adaptive replacement cache (ARC). The policies that determine which data is present in the ARC attempt to anticipate future needs, but read requests can still miss the cache as a result of bad predictions or because the working set is simply larger than the cache can hold—or even larger than the maximum configurable amount of DRAM on a system. Flash is well suited for acting as a new second-level cache in between memory and disk in terms of capacity and performance. In ZFS, this is called the L2ARC.

ZFS fills the L2ARC using large, asynchronous writes and uses the cache to seamlessly satisfy read requests from clients. The requirements here are a perfect fit for flash, which inherently has sufficient write bandwidth and fantastic read latency. Since these devices can be external—rather than being attached to the main board, as is the case with DRAM—the size of the L2ARC is limited only by the amount of DRAM

 

approximate Power consumption

5W

17.2W

12.6W

2W

required for bookkeeping (at a ratio of 50: 1 in the current ZFS implementation). For example, the maximum memory configuration on a four-sock-et machine is usually around 128GB; such a system can easily accommodate 768GB or more using flash SSDs in its internal drive bays. ZFS’s built-in checksums catch cache inconsistencies and mean that defective flash blocks simply lead to fewer cache hits rather than data loss.

In the context of the memory hierarchy, caches are often populated as entries are evicted from the previous layer—in an exclusive cache architecture, on-chip caches are evicted to off-chip caches, and so on. With a flash-based cache, however, the write latency is so poor the system could easily be bogged down waiting for evictions. Accordingly, the L2ARC uses an evict-ahead policy: it aggregates ARC entries and predictively pushes them out to flash, thus amortizing the cost over large operations and ensuring that there is no additional latency when the time comes to evict an entry from the ARC. The L2ARC iterates over its space as a ring, starting back at the beginning once it reaches the end, thereby avoiding any potential for fragmentation. Although this technique does mean that entries in the L2ARC that may soon be accessed could be overwritten prematurely, bear in mind that the hottest data will still reside in the DRAM-based ARC. ZFS will write to the L2ARC slowly, meaning that it can take some time to warm up; but once warm, it should remain so, as long as the writes to the cache can keep up with data churn on the system.

It’s worth noting that to this point the L2ARC hasn’t even taken advantage of what is usually considered to be a key feature of flash: nonvolatility. Under normal operation, the L2ARC treats flash as cheap and vast storage. As it writes blocks of data to populate the cache devices, however, the L2ARC includes a directory so that after a power loss, the contents of the cache can be identified, thus pre-warming the cache. Although resets are rare, system failures, power failures, and downtime due to maintenance are all inevitable; the instantly warmed cache reduces the slow performance ramp typical of a system after a reset. Since

References:

Archives