figure 2: A write-optimized B-tree with fence keys instead of neighbor pointers.
and worthwhile read-ahead actions
without the benefit of such hints.
If flash memory is part of the
persistent storage, page movement
between flash memory and disk is
similar to page movement during
defragmentation, both in file systems
and database systems. The most
significant difference is how page
movement and current page locations
are tracked in these two kinds of systems.
Tracking Page Locations
The mechanisms for tracking page
locations are quite different in file
systems and database systems. In file
systems, pointer pages keep track of
data pages or runs of contiguous data
pages. Moving an individual page may
require breaking up a run. It always
requires updating and then writing a
pointer page.
In database systems, most data is
stored in B-tree indexes, including
clustered (primary, nonredundant) and
nonclustered (secondary, redundant)
indexes on tables, materialized views,
and database catalogs. Bitmap indexes,
columnar storage, and master-detail
clustering can be readily and efficiently
representedinB-trees. Treestructures
12
derived from B-trees are also used for
blobs (binary large objects) and are
similar to the storage structures of
some file systems. 5, 25
For B-trees, moving an individual
page can be very expensive or very
cheap. The most efficient mechanisms
are usually found in utilities for
defragmentation or reorganization.
Cost or efficiency results from two
aspects of B-tree implementation—
namely, maintenance of neighbor
pointers, and logging for recovery.
First, if physical neighbor pointers
are maintained in each B-tree page,
moving a single page requires updating
two neighbors in addition to the
parent node. If the neighbor pointers
are logical using fence keys, only the
parent page requires an update during
a page movement. 10 Figure 2 shows
such a B-tree, with neighbor pointers
replaced by copies of the separator
keys propagated to the parent node
during leaf splits. If the parent page
is in memory, perhaps even pinned
in the buffer pool, recording the new
location is rather like updating an in-memory indirection array. The pointer
change in the parent page is logged in
the recovery log, but there is no need
to force the log immediately to stable
storage because this change is merely
a structural change, not a database
contents change.
Second, database systems log
changes in the physical database, and
in the extreme case both the deleted
page image and the newly created
page image are logged. Thus, an
inefficient implementation fills two
log pages whenever a single data page
moves from one location to another.
A more efficient implementation logs
only allocation actions and delays de-allocation of the old page image until
the new image is safely written in its
intended location. In other words,
10
moving a page from one location (for
example, on persistent flash memory)
to another (for example, on disk)
requires only a few bytes in the database
recovery log.
The difference between traditional
file systems and database systems
is the efficiency of updates enabled
by the recovery log. In a file system,
the new page location must be saved
as soon as possible by writing a
new image of the pointer page. In a
database system, only a single log
record or a few short log records must
be added to the log buffer. Thus, the
overhead for a page movement in a
file system is writing an entire pointer
page using a random access, whereas
a database system adds a log record
of a few dozen bytes to the log buffer
that will eventually be written using
large sequential write operations.
If a file system uses flash memory
as persistent storage, moving a page
between a flash memory location and
an on-disk location adds substantial
overhead. Thus, most file-system
designers will probably prefer flash
memory as an extension to the buffer
pool rather than as an extension of the
disk, thus avoiding this overhead.
A database system, however, has
built-in mechanisms that can easily
track page movements. These mechanisms are inherent in the “workhorse”
data structure, B-tree indexes. Compared with file systems, these mechanisms permit efficient page movement,
each one requiring only a fraction of
a sequential write (in the recovery log)
rather than a full random write.
Moreover, the database mechanisms are reliable. Should a failure
occur during a page movement, database recovery is driven by the recovery
log, whereas a traditional file system
requires checking the entire volume
during reboot.
Checkpoint Processing
To ensure fast recovery after a system
failure, database systems use checkpoints. Their effect is that recovery
needs to consider database activity only
from the most recent checkpoint, plus
some limited activity explicitly indicated in the checkpoint information. The
main effort is writing dirty pages from
the buffer pool to persistent storage.
If pages in flash memory are part
of the buffer pool, dirty pages must
be written to disk during database
checkpoints. Common checkpoint
intervals are measured in seconds or
minutes. Alternatively, if checkpoints
are not truly points but intervals, it is
reasonable to flush pages and perform
checkpoint activities continuously,
starting the next checkpoint as soon
as one finishes. With flash memory as
part of the buffer pool, many writes to
flash memory require a write to disk
soon thereafter as part of checkpoint
processing, and flash memory as the
intermediate level in the memory
hierarchy fails to absorb write activity.
Recall, this effect may be exacerbated