The Five-minute Rule
20 Years Later
and How Flash Memory Changes the Rules
structures capture not only traditional clustered and
nonclustered indexes but also bitmap indexes, columnar
storage, contents indexes, XML indexes, catalogs (
metadata), and allocation data structures.
With respect to transactional guarantees, we assume
traditional write-ahead logging of both contents changes
(such as inserting or deleting a record) and structural
changes (such as splitting B-tree nodes). Efficient log-based
recovery after failures is enabled by checkpoints that force
dirty data from the buffer pool to persistent storage.
Other assumptions include variations such as “
second-chance” or fuzzy checkpoints. In addition, nonlogged
(allocation-only logged) execution is permitted for some
operations such as index creation. These operations
require appropriate write ordering and a “force” buffer-pool policy. 6
FLASH MEMOR Y
We assume that hardware and device drivers hide many
implementation details such as the specific hardware
interface to flash memory. For example, flash memory
might be mounted on the computer’s motherboard, a
DIMM slot, a PCI board, or within a standard disk enclosure. In all cases, DMA (direct memory access) transfers
(or something better) are assumed between RAM and
flash memory. Moreover, the assumption is that there is
either efficient DMA data transfer between flash and disk
or a transfer buffer in RAM. The size of such a transfer
buffer should be, in a first approximation, about equal to
the product of transfer bandwidth and disk latency. If it
is desirable that disk writes should never delay disk reads,
the increased write-behind latency must be included in
the calculation.
Another assumption is that transfer bandwidths of
flash memory and disk are comparable. While flash write
bandwidth has lagged behind read bandwidth, some
products claim a difference of less than a factor of two
(e.g., Samsung’s Flash-based solid-state disk in table 1).
If necessary, the transfer bandwidth can be increased
by using array arrangements, as is well known for disk
drives. 7 Even redundant arrangement of flash memory
may prove advantageous in some cases. 8
Since the reliability of current NAND flash suffers after
100,000 to 1 million erase-and-write cycles, we assume
that some mechanisms for “wear leveling” are provided.
These mechanisms ensure that all pages or blocks of pages
are written with about the same frequency. It is important
to recognize the similarity between wear-leveling algorithms and log-structured file systems, 9, 10 although the
former also move stable, unchanged data such that their
locations can absorb some of the erase-and-write cycles.
Also note that traditional disk drives do not support
more write operations, albeit for different reasons. For
example, six years of continuous and sustained writing
at 100 MB per second overwrites an entire 250-GB disk
fewer than 80,000 times. In other words, assuming a log-structured file system as appropriate for RAID- 5 or RAID- 6
arrays, the reliability of current NAND flash seems comparable. Similarly, overwriting a 32-GB flash disk 100,000
times at 30 MB per second takes about 3½ years.
In addition to wear leveling, we assume that there is
an asynchronous agent that moves fairly stale data from
flash memory to disk and immediately erases the freed-up
space in flash memory to prepare it for write operations
without further delay. This activity also has an immediate
equivalence in log-structured file systems—namely, the
cleanup activity that prepares space for future log writing. The difference is that disk contents must merely be
moved, whereas flash contents must also be erased before
the next write operation at that location.
In either file systems or database systems, we assume
separate mechanisms for page tracking and page replacement. A traditional buffer pool, for example, provides
both, but it uses two different data structures for these
two purposes. The standard design relies on an LRU list
for page replacement and on a hash table for tracking
pages (i.e., which pages are present in the buffer pool and
in which buffer frames). Alternative algorithms and data
structures also separate page tracking and replacement
management.
The data structures for the replacement algorithm are
assumed to be small and high-traffic, and are therefore
kept in RAM. We also assume that page tracking must be
as persistent as the data; thus, a buffer pool’s hash table is
reinitialized during a system reboot, but tracking information for pages on a persistent store such as a disk must be
stored with the data.
As previously mentioned, we assume page replacement
on demand. There may also be automatic policies and
mechanisms for prefetch, read-ahead, and write-behind.
Based on these considerations, we assume that the
contents of flash memory are pretty much the same,