RAM or persistent storage. These
models are called extended buffer pool
and extended disk here. Both models
may seem viable in operating systems,
file systems, and in database systems.
The different characteristics of each
of these systems, however, will require
different usage models.
In both models, contents of RAM
and flash will be governed by LRU-like
replacement algorithms that attempt
to keep the most valuable pages in
RAM and the least valuable pages on
traditional disks. The linked list or
other data structure implementing the
replacement policy for flash memory
will be maintained in RAM.
Operating systems and traditional
file systems will use flash memory
mostly as transient memory (for
example, as a fast backup store for
virtual memory and as a secondary
file-system cache). Both of these
applications fall into the extended
buffer-pool model. During an orderly
system shutdown, the flash memory
contents must be written to persistent
storage. During a system crash,
however, the RAM-based description of
flash-memory contents will be lost and
must be reconstructed by a contents
analysis similar to a traditional file-system check. Alternatively, flash-memory contents can be voided and
reloaded on demand.
Database systems, on the other
hand, will employ flash memory as
persistent storage, using the extended
disk model. The current contents
will be described in persistent data
structures, such as parent pages in
B-tree indexes. Traditional durability
mechanisms—in particular, logging
and checkpoints—ensure consistency
and efficient recovery after system
crashes. Checkpoints and orderly
system shutdowns have no need to
write flash memory contents to disk,
and the pre-shutdown of flash contents
is required for a complete restart.
There are two reasons for these
different usage models for flash
memory. First, database systems rely
on regular checkpoints during which
dirty pages are flushed from the buffer
pool to persistent storage. If a dirty page
is moved from RAM to the extended
buffer pool in flash memory, it creates
substantial overhead during the next
checkpoint. A free buffer must be found
in RAM, the page contents must be
read from flash memory into RAM, and
then the page must be written to disk.
Adding such overhead to checkpoints
is not attractive in database systems
with frequent checkpoints. Operating
systems and traditional file systems,
on the other hand, do not rely on
checkpoints and thus can exploit flash
memory as an extended buffer pool.
Second, the principal persistent
data structures of databases, B-tree
indexes, provide precisely the mapping
and location-tracking mechanisms
needed to complement frequent page
movement and replacement. Thus,
tracking a data page when it moves
between disk and flash relies on the
same data structure maintained for
efficient database search. In addition,
avoiding indirection in locating a
page also makes database searches as
efficient as possible.
Finally, as the ratio of access
latencies and transfer bandwidth is
very different for flash memory and
disks, different B-tree node sizes are
optimal. O’Neil’s SB-tree exploits
two different node sizes as needed in
a multilevel storage hierarchy. The
required inexpensive mechanisms for
moving individual pages are the same
as those required when moving pages
between flash memory and disk.
Acknowledgments
This article is dedicated to Jim Gray,
who suggested this research and
helped the author and many others
many times in many ways. Barb Peters,
Lily Jow, Harumi Kuno, José Blakeley,
Mehul Shah, the DaMoN 2007 reviewers, and particularly Harumi Kuno suggested multiple improvements after
reading earlier versions of this work.
References
1. Ailamaki, A., De Witt, D. J. and hill, M.D. Data page
layouts for relational databases on deep memory
hierarchies. VLDB Journal 11, 3 (2002), 198–215.
2. Bender, M. A. Demaine, E. D. and Farach-Colton, M.
Cache-oblivious B-trees. SIAM Journal on Computing
35, 2 (2005), 341–358.
3. Bayer, R. and McCreight, E.M. Organization and
maintenance of large ordered indexes. SIGFI-DET
Workshop (1970), 107–141.
4. Bayer, R. and Unterauer, k. Prefix B-trees. ACM
Transactions on Database Systems 2, 1 (1977), 11–26.
5. Carey, M.J., De Witt, D.J., Richardson, J.E. and Shekita,
E.J. Storage management in EXODUS. In
Object-Oriented Concepts, Databases, and Applications. W.
kim and F. Lochovsky, Eds. ACM, N. Y., 1989, 341–369.
6. Chen, P.M., Lee, E.L. Gibson, G.A., katz, R.h. and Patterson,
D.A. 1994. RAID: high-performance, reliable secondary
storage. ACM Computing Surveys 26( 2): 145–185.
7. De Witt, D. J., Naughton, J. F. and Burger, J. Nested
loops revisited. Parallel and Distributed Information
Systems (1993), 230–242.
8. Graefe, G. Query evaluation techniques for large
databases. ACM Computing Surveys 25, 2 (1993), 73–170.
9. Graefe, G. Executing nested queries. Database Systems
for Business, Technology and Web (2003), 58–77.
10. Graefe, G. Write-optimized B-trees. VLDB Journal
(2004), 672–683.
11. Graefe, G. Implementing sorting in database systems.
ACM Computing Surveys 38, 3 (2006), 69–106.
12. Graefe, G. Master-detail clustering using merged
indexes. Informatik–Forschung und Entwicklun, 2007.
13. Gray, J. and Fitzgerald, B. 2007. Flash disk opportunity
for server-applications; http://research.microsoft.
com/~gray/papers/ FlashDiskPublic.doc.
14. Gray, J., Graefe, G. 1997. The five-minute rule ten
years later, and other computer storage rules of
thumb. SIGMOD Record 26, 4 (1997), 63–68.
15. Gray, J. and Putzolu, G. R. The 5-minute rule for
trading memory for disk accesses and the 10-byte
rule for trading memory for CPU time. SIGMOD
Journal (1987), 395–398.
16. härder, T. Implementing a generalized access path
structure for a relational database system. ACM
Transactions on Database Systems 3, 3 (1978), 285–298.
17. hamilton, J. An architecture for modular data centers.
In Proceedings of the Conference on Innovative Data
Systems Research, 2007.
18. härder, T. and Reuter, A. Principles of transaction-oriented database recovery. ACM Computing Surveys
15, 4 (1983), 287–317.
19. Lomet, D.B. The evolution of effective B-tree page
organization and techniques: a personal account.
SIGMOD Record 30, 3, 64–69.
20. Larus, J.R. and Rajwar, R. Transactional Memory.
Synthesis Lectures on Computer Architecture. Morgan
& Claypool, 2007.
21. Nyberg, C., Barclay, T., Cvetanovic, z., Gray, J. and
Lomet, D.B. AlphaSort: A cache-sensitive parallel
external sort. VLDB Journal (1995), 603–627.
22. Ousterhout, J.k. and Douglis, F. Beating the I/O
bottleneck: A case for log-structured file systems.
Operating Systems Review 23, 1 (1989), 11–28.
23. O’Neil, P. W. The SB-tree: An index-sequential structure
for high-performance sequential access. Acta
Informatica 29, 3 (1992), 241–265.
24. Rivoire, S., Shah, M., Ranganathan, P. and kozyrakis, C.
JouleSort: A balanced energy-efficiency benchmark.
SIGMOD Record, 2007.
25. Stonebraker, M. Operating system support for database
management. Commun. ACM 24, 7 (July 1981), 412–418.
26. Shatdal, A., kant, C. and Naughton, J.F. Cache-conscious algorithms for relational query processing.
VLDB Journal (1994), 510–521.
27. Woodhouse, D. JFFS: The Journaling Flash File
System. Ottawa Linux Symposium, Red hat Inc., 2001.
Related articles
on queue.acm.org
Flash Storage Today
Adam Leventhal
http://queue.acm.org/detail.cfm?id=1413262
Flash Disk Opportunity
for Server Applications
Jim Gray, Bob Fitzgerald
http://queue.acm.org/detail.cfm?id=1413261
Enterprise SSDs
Mark Moshayedi, Patrick Wilkison
http://queue.acm.org/detail.cfm?id=1413263
Goetz Graefe (Goetz. Graefe@hP.com) joined hewlett-Packard Laboratories after seven years as an academic
researcher and teacher followed by 12 years as a product
architect and developer at Microsoft. he was recently
named an hP Fellow. his Volcano research project
was awarded the 10-year Test-of-Time Award at ACM
SIGMOD 2000 for work on query execution.
An earlier version of this article was originally published
in Proceedings of the Third International Workshop on
Data Management on New Hardware (June 15, 2007),
Beijing, China.
© 2009 ACM 0001-0782/09/0700 $10.00