if RAM is kept small because of the presence of flash memory.

If, on the other hand, flash memory is considered persistent storage, writing to flash memory is sufficient. Write-through to disk is required only as part of page replacement (such as, when a page’s usage suggests placement on disk rather than in flash memory). Thus, checkpoints do not incur the cost of moving data from flash memory to disk.

Checkpoints might even be faster in systems with flash memory because dirty pages in RAM need to be written merely to flash memory, not to disk. Given the very fast random access in flash memory relative to disk drives, this difference might speed up checkpoints significantly.

To summarize, database systems benefit if flash memory is treated as part of the system’s persistent storage. In contrast, traditional file systems do not have systemwide checkpoints that flush the recovery log and any dirty data from the buffer pool. Instead, they rely on carefully writing modified file-system pages because of the lack of a recovery log protecting file contents.

the access time dominates for all small page sizes, such that additional byte transfer and thus additional utility are almost free.

B-tree nodes of 256KB are near optimal. For those, Table 3 indicates a break-even time for RAM and disk of 88 seconds. For a $400 flash disk and a traditional rotating hard disk, Table 4 indicates 337 seconds or just over five minutes. This is the first of the two new five-minute rules.

Table 5 illustrates the same cal– culations for B-trees on flash memory. Because there is no mechanical seeking or rotation, transfer time dominates access time even for small pages. The optimal page size for B-trees on flash memory is 2KB, much smaller than for traditional disk drives. In Table 3, the break-even interval for 4KB pages

is 351 seconds. This is the second new five-minute rule.

The implication of two different optimal page sizes is that any uniform node size for B-trees on flash memory and traditional rotating hard disks is suboptimal. Optimizing page sizes for both media requires a change in buffer management, space allocation, and some of the B-tree logic.

Fortunately, Patrick O’Neil of the University of Massachusetts already designed a space allocation scheme for B-trees in which neighboring leaf nodes usually reside within the same contiguous extent of pages. 23 When a new page is needed for a node split, another page within the same extent is allocated. When an extent overflows, half its pages are moved to a newly allocated extent. In other words, the

Table 4: Page utility for B-tree nodes on disk.

Page sizes

In addition to tuning based on the five-minute rule, another optimization based on access performance is sizing of B-tree nodes. The optimal page size minimizes the time spent on I/O during a root-to-leaf search. It balances a short I/O (that is, the desire for small pages) with a high reduction in remaining search space (that is, the desire for large pages).

Assuming binary search within each B-tree node, the reduction in remaining search space is measured by the logarithm of the number of records within each node. This measure was called a node’s utility in our earlier work. This optimization is essentially

14

equivalent to one described in the original research on B-trees. 3

Table 4 illustrates this optimization for 20-byte records (typical with prefix and suffix truncation4) and for nodes filled at about 70%.

Not surprisingly, the optimal node size for B-tree indexes on modern high-bandwidth disks is much larger than the page sizes in traditional database systems. With those disks,

Page size

4KB 16KB 64KB 128KB 256KB 512KB 1MB

Records per page 140

560 2,240 4,480 8,960 17,920 35,840

node utility 7 0 11 12 13 14 15

Access time
12.0ms
12.1ms
12.2ms
12.4ms
12.9ms
13.7ms
15.4ms

utility/time
0.58
0.75
0.90
0.97
1.01
1.02
0.97

Table 5: Page utility for B-tree nodes on flash memory.

 

Page size 1KB 2KB 4KB 8KB 16KB 64KB

Records per page 35 70 140 280

560 2,240

node utility 5 6 7 8 9 11

Access time
0.11ms
0.13ms
0.16ms
0.22ms
0.34ms
1.07ms

utility/time
43. 4
46. 1
43. 6
36. 2
26. 3
10. 3

figure 3: Pages and extents in an sB-tree.

extent 75

Page 75.0

Page 75. 1

Page 75. 2

Page 75. 3

Page 75. 4

Page 75. 5

Page 93.0

Page 93. 1

Page 93. 2

Page 93. 3

Page 93. 4

extent 93

References:

Archives