“split in half when full” logic of B-trees is applied not only to pages containing records, but also to contiguous disk extents containing pages.

Using O’Neil’s SB-trees (S meaning sequential), 256KB extents can be the units of transfer between flash memory and disk, whereas 4KB pages can be the unit of transfer between RAM and flash memory. Figure 3 shows pages within two extents. Child pointers in a B-tree (also shown) refer to individual pages. If multiple neighboring child pointers refer to neighboring pages (as shown), the pointer values can be represented compactly with run-length encoding applied not to a set of duplicate key values but to a series of values with constant increments. For example, the five child pointers in extent 75. 1 in Figure 3 can be represented by the page identifier 93.0 and the counter 5.

Similar notions of self-similar B-trees have also been proposed for higher levels in the memory hierarchy, for example, in the form of B-trees of cache lines for the indirection vector within a large page. 19 Given that there are at least three levels of B-trees and three node sizes now (cache lines, flash memory pages, and disk pages), research into cache-oblivious B-trees2 might be promising.

Database-Query Processing

Self-similar designs apply both to data structures such as B-trees and to algorithms. For example, sort algorithms already use algorithms similar to traditional external merge sorts in multiple ways—to merge runs not only on disk but also in memory, where the initial runs are sized to limit run creation to the CPU cache. 11, 21

The same technique might be applied three times instead of twice: first, cachesize runs in memory are merged into memory-size runs in memory; second, in larger sort operations, memory-size runs in flash memory are merged into runs on disk; and third, runs on disk are merged to form the final sorted result. Read-ahead, forecasting, write-behind, and page sizes all deserve a new look in a multilevel memory hierarchy consisting of cache, RAM, flash memory, and traditional disk drives. These page sizes can then inform the break-even calculation for page retention versus I/O and thus

guide the optimal capacities at each level of the memory hierarchy.

We can surmise that a variation of this sort algorithm will be not only

The 20-year-old
fast but also energy efficient. While
energy efficiency has always been
five-minute rule for
crucial for battery-powered devices,
RAM and disks still
research into energy-efficient query
processing on server machines is only
holds, but for ever-
now beginning. 24 For example, for
both flash memory and disks, energy-
larger disk pages.
optimal page sizes might well differ
Moreover, it should
from performance-optimal page sizes.
The I/O pattern of an external
be augmented by
merge sort is similar (albeit in the
two new five-minute
opposite direction) to the I/O pattern
of an external distribution sort. Figure

rules: one for small

4 illustrates how merging combines

pages moving
many small files into a large file, with
many seek operations in the small files
between RAM and
as demanded by the merge logic, and
how partitioning divides a single large
flash memory and
file into many small files, with many
one for large pages
seek operations in the small files as
demanded by the partitioning function.

moving between

The I/O pattern of a distribution sort

flash memory and
is equal to that of partitioning during
hash join and hash aggregation. 8 All of
traditional disks.
these algorithms require reevaluation
and redesign in a three-level memory
hierarchy, or even a four-level hierarchy
if CPU caches are also considered. 26

Flash memory with its very fast access times may well revive interest in index-based query execution. 7, 9 Instead of large scans and memory-intensive operations such as sorting and hash join, fast accesses to index pages shift the break-even point toward index-to-index navigation. For example, assume a table with 100 million rows of 100 bytes and table scans at 100MB per second. A table scan takes 100 seconds. Searching a secondary index requires fetching individual rows from the table. If the table is stored on a traditional disk, then a period of 100 seconds permits fetching about 10,000 records. If more than 10,000 rows satisfy the query predicate, then the table scan is faster. If, however, the table is stored on a flash device, 100 seconds will permit fetching about 500,000 records. Thus, flash storage shifts the break-even point between table scan and index search from 10,000 to 500,000 rows satisfying the query predicate, and many more query execution plans will rely on index-to-index navigation rather than large scans and hash joins.

References:

Archives