flash memory should exploit the same
methods as a traditional buffer pool.
For truly comparable and competitive
performance and administration costs,
a similar approach seems advisable
when flash memory is used as an
extended disk.
File systems. Our research assumed a
fairly traditional file system. Although
many file systems differ from this
model, most still generally follow it.
In our traditional system, each file
is a large byte stream. Files are often
read in their entirety, their contents
manipulated in memory, and the entire
file replaced if it is updated. Archiving,
version retention, hierarchical storage
management, data movement using
removable media, among others, all
seem to follow this model as well.
Based on this model, space allocation
on disk attempts to use contiguous disk
blocks for each file. Metadata is limited
to directories, a few standard tags such
as a creation time, and data structures
for space management.
Consistency of these on-disk data
structures is achieved by careful write
ordering, fairly quick write-back of
updated data blocks, and expensive
file-system checks after any less-than-perfect shutdown or media removal.
In other words, we assume the
absence of transactional guarantees
and transactional logging, at least for
file contents. If log-based recovery is
supported for file contents such as
individual pages or records within
pages, then a number of the arguments
presented here need to be revisited.
Database systems. We assume fairly
traditional database systems with
B-tree indexes as the workhorse storage
structure. Similar tree structures
capture not only traditional clustered
and nonclustered indexes, but also
bitmap indexes, columnar storage,
contents indexes, XML indexes,
catalogs (metadata), and allocation
data structures.
With respect to transactional
guarantees, we assume traditional
write-ahead logging of both contents
changes (such as inserting or deleting
a record) and structural changes (such
as splitting B-tree nodes). Efficient
log-based recovery after failures is
enabled by checkpoints that force
dirty data from the buffer pool to
persistent storage.
figure 1: Caching and indexing page locations.
Cached
Buffer
RAM
Index
Cached
tracking
information
Tracking
information
flash
memory
Variations such as “second-chance”
or fuzzy checkpoints fit within our
assumptions. In addition, nonlogged
(allocation-only logged) execution is
permitted for some operations such
as index creation. These operations
require appropriate write ordering and
a “force” buffer pool policy. 18
Flash memory. Hardware and device
drivers are assumed to hide many
implementation details such as the
specific hardware interface to flash
memory. For example, flash memory
might be mounted on the computer’s
motherboard, a DIMM slot, a PCI
board, or within a standard disk
enclosure. In all cases, DMA transfers
(or something better) are assumed
between RAM and flash memory.
Moreover, we assume there is either
efficient DMA data transfer between
flash and disk or a transfer buffer in
RAM. The size of such a transfer buffer
should be, in a first approximation,
about equal to the product of transfer
bandwidth and disk latency. If it is
desirable that disk writes should never
delay disk reads, the increased write-behind latency must be included in
the calculation.
Another assumption is that transfer
bandwidths of flash memory and disk
are comparable. While flash write
bandwidth has lagged behind read
bandwidth, some products claim a
difference of less than a factor of two
(for example, Samsung’s Flash-based
solid-state disk used in Table 1). If
necessary, the transfer bandwidth
can be increased by using array
arrangements, as is well known for disk
drives; even redundant arrangement of
flash memory may prove advantageous
in some cases.
6
Since the reliability of current NAND
flash suffers after 100,000– 1,000,000
erase-and-write cycles, we assume that
some mechanisms for wear leveling are
provided. These mechanisms ensure
that all pages or blocks of pages are
written similarly often. It is important to
recognize the similarity between wear-leveling algorithms and log-structured
file systems, 22, 27 although the former
also move stable, unchanged data such
that their locations can absorb some of
the erase-and-write cycles.
Note that traditional disk drives do
not support more write operations,
albeit for different reasons. For
example, six years of continuous
and sustained writing at 100Mbps
overwrites an entire 250GB disk fewer
than 80,000 times. In other words,
assuming that a log-structured file
system is appropriate for RAID- 5
or RAID- 6 arrays, the reliability of
current flash seems comparable.
Similarly, overwriting a 32GB flash
disk 100,000 times with a sustained
average bandwidth of 30Mbps takes
about 3. 5 years.