“split in half when full” logic of B-trees
is applied not only to pages containing
records, but also to contiguous disk
extents containing pages.
Using O’Neil’s SB-trees (S meaning
sequential), 256KB extents can be the
units of transfer between flash memory
and disk, whereas 4KB pages can be the
unit of transfer between RAM and flash
memory. Figure 3 shows pages within
two extents. Child pointers in a B-tree
(also shown) refer to individual pages.
If multiple neighboring child pointers
refer to neighboring pages (as shown),
the pointer values can be represented
compactly with run-length encoding
applied not to a set of duplicate key
values but to a series of values with
constant increments. For example,
the five child pointers in extent 75. 1 in
Figure 3 can be represented by the page
identifier 93.0 and the counter 5.
Similar notions of self-similar B-trees
have also been proposed for higher
levels in the memory hierarchy, for
example, in the form of B-trees of cache
lines for the indirection vector within a
large page. 19 Given that there are at least
three levels of B-trees and three node
sizes now (cache lines, flash memory
pages, and disk pages), research into
cache-oblivious B-trees2 might be
promising.
Database-Query Processing
Self-similar designs apply both to data
structures such as B-trees and to algorithms. For example, sort algorithms
already use algorithms similar to traditional external merge sorts in multiple
ways—to merge runs not only on disk
but also in memory, where the initial
runs are sized to limit run creation to
the CPU cache. 11, 21
The same technique might be applied
three times instead of twice: first, cachesize runs in memory are merged into
memory-size runs in memory; second,
in larger sort operations, memory-size
runs in flash memory are merged into
runs on disk; and third, runs on disk
are merged to form the final sorted
result. Read-ahead, forecasting, write-behind, and page sizes all deserve
a new look in a multilevel memory
hierarchy consisting of cache, RAM,
flash memory, and traditional disk
drives. These page sizes can then
inform the break-even calculation for
page retention versus I/O and thus
guide the optimal capacities at each
level of the memory hierarchy.
We can surmise that a variation of
this sort algorithm will be not only
The 20-year-old
fast but also energy efficient. While
energy efficiency has always been
five-minute rule for
crucial for battery-powered devices,
RAM and disks still
research into energy-efficient query
processing on server machines is only
holds, but for ever-
now beginning. 24 For example, for
both flash memory and disks, energy-
larger disk pages.
optimal page sizes might well differ
Moreover, it should
from performance-optimal page sizes.
The I/O pattern of an external
be augmented by
merge sort is similar (albeit in the
two new five-minute
opposite direction) to the I/O pattern
of an external distribution sort. Figure
rules: one for small
4 illustrates how merging combines
pages moving
many small files into a large file, with
many seek operations in the small files
between RAM and
as demanded by the merge logic, and
how partitioning divides a single large
flash memory and
file into many small files, with many
one for large pages
seek operations in the small files as
demanded by the partitioning function.
moving between
The I/O pattern of a distribution sort
flash memory and
is equal to that of partitioning during
hash join and hash aggregation. 8 All of
traditional disks.
these algorithms require reevaluation
and redesign in a three-level memory
hierarchy, or even a four-level hierarchy
if CPU caches are also considered. 26
Flash memory with its very fast
access times may well revive interest in
index-based query execution. 7, 9 Instead
of large scans and memory-intensive
operations such as sorting and hash
join, fast accesses to index pages shift
the break-even point toward index-to-index navigation. For example, assume
a table with 100 million rows of 100
bytes and table scans at 100MB per
second. A table scan takes 100 seconds.
Searching a secondary index requires
fetching individual rows from the table.
If the table is stored on a traditional
disk, then a period of 100 seconds
permits fetching about 10,000 records.
If more than 10,000 rows satisfy the
query predicate, then the table scan is
faster. If, however, the table is stored
on a flash device, 100 seconds will
permit fetching about 500,000 records.
Thus, flash storage shifts the break-even point between table scan and
index search from 10,000 to 500,000
rows satisfying the query predicate,
and many more query execution plans
will rely on index-to-index navigation
rather than large scans and hash joins.