committed to stable storage.
On a typical system, the speed of a synchronous write is bounded by the latency of nonvolatile storage, as writes must be committed before they can be acknowledged. Read operations first check in the DRAM cache providing very low-latency service times, but cache misses must also wait for the slow procession of data around the spindle. Since it’s quite common to have working sets larger than the meager DRAM available, even the best prefetching algorithms will leave many read operations blocked on the disk.
A brute-force solution for improving latency is simply to spin the platters faster to reduce rotational latency, using 15,000RPM drives rather than 10,000 or 7,200RPM drives. This will improve both read and write latency, but only by a factor of two or so. For example, using drives from a major vendor, at current prices, a 10TB data set on a 7,200RPM drive would cost about $3,000 and dissipate 112 watts; the same data set on a 15,000RPM drive would cost $22,000 and dissipate 473 watts—all for a latency improvement of a bit more than a factor of two. The additional cost and power overhead make this an unsatisfying solution, though it is widely employed absent a clear alternative.
A focused solution for improving the performance of synchronous writes is to add nonvolatile RAM (NVRAM) in the form of battery-backed DRAM, usually on a PCI card. Writes are committed to the NVRAM ring buffer and immedi-
figure 2: flash cost per GB.
225
200
175
150
125
100
75
50
25
0 2003
2004
ately acknowledged to the client while the data is asynchronously written out to the drives. Once the data has been committed to disk, the corresponding record can be freed in the NVRAM. This technique allows for a tremendous improvement for synchronous writes, but suffers some downsides. NVRAM is quite expensive; batteries fail (or leak, or, worse, explode); and the maximum size of NVRAM tends to be small (2GB– 4GB)—small enough that workloads can fill the entire ring buffer before it can be flushed to disk.
One use of flash is as a stand-in for NVRAM that can improve write performance as a log device. To that end you need a device that mimics the important properties of NVRAM (fast, persistent writes), while avoiding the downsides (cost, size, battery power). Recall, however, that while achieving good write bandwidth is fairly easy, the physics of flash dictate that individual writes exhibit relatively high latency. However, it’s possible to build a flash-based device that can service write operations very quickly by inserting a DRAM write cache and then treating that write cache as nonvolatile by adding a supercapacitor to provide the necessary power to flush outstanding data in the DRAM to flash in the case of power loss.
Many applications such as databases can use a dedicated log device as a way of improving the performance of write operations; for these applica-
2005
2006
2007
tions, such a device can be dropped in easily. To bring the benefits of a flash log device to primary storage, and therefore to a wide array of applications, we need similar functionality in a general-purpose file system. Sun’s ZFS provides a useful context for the use of flash. ZFS, an enterprise-class file system designed for the scale and requirements of modern systems, was implemented from scratch starting in 2001. It discards the model of a file system sitting on a volume manager in favor of pooled storage both for simplicity of management and greater flexibility for optimizing performance. ZFS maintains its on-disk data structures in way that is always consistent, eliminating the need for consistency checking after an unexpected power failure. Furthermore, it is flexible enough to accommodate new technological advances, such as new uses of flash. (For a complete description of ZFS, see http:// opensolaris.org/os/community/zfs.)
ZFS provides for the use of a separate intent-log device (a slog in ZFS jargon) to which synchronous writes can be quickly written and acknowledged to the client before the data is written to the storage pool. The slog is used only for small transactions, while large transactions use the main storage pool—it’s tough to beat the raw throughput of large numbers of disks. The flash-based log device would be ideally suited for a ZFS slog. The write buffer on the flash device has to be only large enough to saturate the bandwidth to flash. Its DRAM size requirements—and therefore the power requirements—are quite small. Note also the write buffer is much smaller than the required DRAM in a battery-backed NVRAM device. There are effectively no constraints on the amount of flash that could be placed on such a device, but experimentation has shown that 10GB of delivered capacity is more than enough for the vast majority of use cases.
Using such a device with ZFS in a test system, we measured latencies in the range of 80– 100 µs. This approaches the performance of NVRAM and has many other benefits. A common concern for flash is its longevity. SLC flash is often rated for one million write/erase cycles, but beyond several hundred thousand, the data-retention
References:
Archives