if RAM is kept small because of the
presence of flash memory.
If, on the other hand, flash memory
is considered persistent storage,
writing to flash memory is sufficient.
Write-through to disk is required only
as part of page replacement (such
as, when a page’s usage suggests
placement on disk rather than in flash
memory). Thus, checkpoints do not
incur the cost of moving data from
flash memory to disk.
Checkpoints might even be faster
in systems with flash memory because
dirty pages in RAM need to be written
merely to flash memory, not to disk.
Given the very fast random access in
flash memory relative to disk drives,
this difference might speed up
To summarize, database systems
benefit if flash memory is treated as
part of the system’s persistent storage.
In contrast, traditional file systems do
not have systemwide checkpoints that
flush the recovery log and any dirty
data from the buffer pool. Instead,
they rely on carefully writing modified
file-system pages because of the lack of
a recovery log protecting file contents.
the access time dominates for all small
page sizes, such that additional byte
transfer and thus additional utility are
B-tree nodes of 256KB are near
optimal. For those, Table 3 indicates
a break-even time for RAM and disk of
88 seconds. For a $400 flash disk and
a traditional rotating hard disk, Table
4 indicates 337 seconds or just over
five minutes. This is the first of the two
new five-minute rules.
Table 5 illustrates the same cal–
culations for B-trees on flash memory.
Because there is no mechanical seeking
or rotation, transfer time dominates
access time even for small pages. The
optimal page size for B-trees on flash
memory is 2KB, much smaller than
for traditional disk drives. In Table 3,
the break-even interval for 4KB pages
is 351 seconds. This is the second new
The implication of two different
optimal page sizes is that any uniform
node size for B-trees on flash memory
and traditional rotating hard disks is
suboptimal. Optimizing page sizes for
both media requires a change in buffer
management, space allocation, and
some of the B-tree logic.
Fortunately, Patrick O’Neil of the
University of Massachusetts already
designed a space allocation scheme
for B-trees in which neighboring leaf
nodes usually reside within the same
contiguous extent of pages. 23 When
a new page is needed for a node split,
another page within the same extent is
allocated. When an extent overflows,
half its pages are moved to a newly
allocated extent. In other words, the
Table 4: Page utility for B-tree nodes on disk.
In addition to tuning based on the
five-minute rule, another optimization based on access performance is
sizing of B-tree nodes. The optimal
page size minimizes the time spent
on I/O during a root-to-leaf search. It
balances a short I/O (that is, the desire
for small pages) with a high reduction
in remaining search space (that is, the
desire for large pages).
Assuming binary search within
each B-tree node, the reduction in
remaining search space is measured by
the logarithm of the number of records
within each node. This measure was
called a node’s utility in our earlier
work. This optimization is essentially
equivalent to one described in the
original research on B-trees. 3
Table 4 illustrates this optimization
for 20-byte records (typical with prefix
and suffix truncation4) and for nodes
filled at about 70%.
Not surprisingly, the optimal node
size for B-tree indexes on modern
high-bandwidth disks is much larger
than the page sizes in traditional
database systems. With those disks,
Records per page
Table 5: Page utility for B-tree nodes on flash memory.
Records per page
figure 3: Pages and extents in an sB-tree.
Page 75. 1
Page 75. 2
Page 75. 3
Page 75. 4
Page 75. 5
Page 93. 1
Page 93. 2
Page 93. 3
Page 93. 4