RAM or persistent storage. These models are called extended buffer pool and extended disk here. Both models may seem viable in operating systems, file systems, and in database systems. The different characteristics of each of these systems, however, will require different usage models.

In both models, contents of RAM and flash will be governed by LRU-like replacement algorithms that attempt to keep the most valuable pages in RAM and the least valuable pages on traditional disks. The linked list or other data structure implementing the replacement policy for flash memory will be maintained in RAM.

Operating systems and traditional file systems will use flash memory mostly as transient memory (for example, as a fast backup store for virtual memory and as a secondary file-system cache). Both of these applications fall into the extended buffer-pool model. During an orderly system shutdown, the flash memory contents must be written to persistent storage. During a system crash, however, the RAM-based description of flash-memory contents will be lost and must be reconstructed by a contents analysis similar to a traditional file-system check. Alternatively, flash-memory contents can be voided and reloaded on demand.

Database systems, on the other hand, will employ flash memory as persistent storage, using the extended disk model. The current contents will be described in persistent data structures, such as parent pages in B-tree indexes. Traditional durability mechanisms—in particular, logging and checkpoints—ensure consistency and efficient recovery after system crashes. Checkpoints and orderly system shutdowns have no need to write flash memory contents to disk, and the pre-shutdown of flash contents is required for a complete restart.

There are two reasons for these different usage models for flash memory. First, database systems rely on regular checkpoints during which dirty pages are flushed from the buffer pool to persistent storage. If a dirty page is moved from RAM to the extended buffer pool in flash memory, it creates substantial overhead during the next checkpoint. A free buffer must be found

in RAM, the page contents must be read from flash memory into RAM, and then the page must be written to disk. Adding such overhead to checkpoints is not attractive in database systems with frequent checkpoints. Operating systems and traditional file systems, on the other hand, do not rely on checkpoints and thus can exploit flash memory as an extended buffer pool.

Second, the principal persistent data structures of databases, B-tree indexes, provide precisely the mapping and location-tracking mechanisms needed to complement frequent page movement and replacement. Thus, tracking a data page when it moves between disk and flash relies on the same data structure maintained for efficient database search. In addition, avoiding indirection in locating a page also makes database searches as efficient as possible.

Finally, as the ratio of access latencies and transfer bandwidth is very different for flash memory and disks, different B-tree node sizes are optimal. O’Neil’s SB-tree exploits two different node sizes as needed in a multilevel storage hierarchy. The required inexpensive mechanisms for moving individual pages are the same as those required when moving pages between flash memory and disk.

Acknowledgments

This article is dedicated to Jim Gray, who suggested this research and helped the author and many others many times in many ways. Barb Peters, Lily Jow, Harumi Kuno, José Blakeley, Mehul Shah, the DaMoN 2007 reviewers, and particularly Harumi Kuno suggested multiple improvements after reading earlier versions of this work.

 

References

1. Ailamaki, A., De Witt, D. J. and hill, M.D. Data page layouts for relational databases on deep memory hierarchies. VLDB Journal 11, 3 (2002), 198–215.

2. Bender, M. A. Demaine, E. D. and Farach-Colton, M. Cache-oblivious B-trees. SIAM Journal on Computing 35, 2 (2005), 341–358.

3. Bayer, R. and McCreight, E.M. Organization and maintenance of large ordered indexes. SIGFI-DET Workshop (1970), 107–141.

4. Bayer, R. and Unterauer, k. Prefix B-trees. ACM Transactions on Database Systems 2, 1 (1977), 11–26.

5. Carey, M.J., De Witt, D.J., Richardson, J.E. and Shekita, E.J. Storage management in EXODUS. In Object-Oriented Concepts, Databases, and Applications. W. kim and F. Lochovsky, Eds. ACM, N. Y., 1989, 341–369.

6. Chen, P.M., Lee, E.L. Gibson, G.A., katz, R.h. and Patterson, D.A. 1994. RAID: high-performance, reliable secondary storage. ACM Computing Surveys 26( 2): 145–185.

7. De Witt, D. J., Naughton, J. F. and Burger, J. Nested loops revisited. Parallel and Distributed Information

Systems (1993), 230–242.

8. Graefe, G. Query evaluation techniques for large databases. ACM Computing Surveys 25, 2 (1993), 73–170.

9. Graefe, G. Executing nested queries. Database Systems for Business, Technology and Web (2003), 58–77.

10. Graefe, G. Write-optimized B-trees. VLDB Journal (2004), 672–683.

11. Graefe, G. Implementing sorting in database systems. ACM Computing Surveys 38, 3 (2006), 69–106.

12. Graefe, G. Master-detail clustering using merged indexes. Informatik–Forschung und Entwicklun, 2007.

13. Gray, J. and Fitzgerald, B. 2007. Flash disk opportunity for server-applications; http://research.microsoft. com/~gray/papers/ FlashDiskPublic.doc.

14. Gray, J., Graefe, G. 1997. The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26, 4 (1997), 63–68.

15. Gray, J. and Putzolu, G. R. The 5-minute rule for trading memory for disk accesses and the 10-byte rule for trading memory for CPU time. SIGMOD Journal (1987), 395–398.

16. härder, T. Implementing a generalized access path structure for a relational database system. ACM Transactions on Database Systems 3, 3 (1978), 285–298.

17. hamilton, J. An architecture for modular data centers. In Proceedings of the Conference on Innovative Data Systems Research, 2007.

18. härder, T. and Reuter, A. Principles of transaction-oriented database recovery. ACM Computing Surveys 15, 4 (1983), 287–317.

19. Lomet, D.B. The evolution of effective B-tree page organization and techniques: a personal account. SIGMOD Record 30, 3, 64–69.

20. Larus, J.R. and Rajwar, R. Transactional Memory. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2007.

21. Nyberg, C., Barclay, T., Cvetanovic, z., Gray, J. and Lomet, D.B. AlphaSort: A cache-sensitive parallel external sort. VLDB Journal (1995), 603–627.

22. Ousterhout, J.k. and Douglis, F. Beating the I/O bottleneck: A case for log-structured file systems. Operating Systems Review 23, 1 (1989), 11–28.

23. O’Neil, P. W. The SB-tree: An index-sequential structure for high-performance sequential access. Acta Informatica 29, 3 (1992), 241–265.

24. Rivoire, S., Shah, M., Ranganathan, P. and kozyrakis, C. JouleSort: A balanced energy-efficiency benchmark. SIGMOD Record, 2007.

25. Stonebraker, M. Operating system support for database management. Commun. ACM 24, 7 (July 1981), 412–418.

26. Shatdal, A., kant, C. and Naughton, J.F. Cache-conscious algorithms for relational query processing. VLDB Journal (1994), 510–521.

27. Woodhouse, D. JFFS: The Journaling Flash File System. Ottawa Linux Symposium, Red hat Inc., 2001.

 

Related articles on queue.acm.org

Flash Storage Today

Adam Leventhal

http://queue.acm.org/detail.cfm?id=1413262

Flash Disk Opportunity for Server Applications Jim Gray, Bob Fitzgerald

http://queue.acm.org/detail.cfm?id=1413261

Enterprise SSDs

Mark Moshayedi, Patrick Wilkison

http://queue.acm.org/detail.cfm?id=1413263

Goetz Graefe (Goetz. Graefe@hP.com) joined hewlett-Packard Laboratories after seven years as an academic researcher and teacher followed by 12 years as a product architect and developer at Microsoft. he was recently named an hP Fellow. his Volcano research project was awarded the 10-year Test-of-Time Award at ACM SIGMOD 2000 for work on query execution. An earlier version of this article was originally published in Proceedings of the Third International Workshop on Data Management on New Hardware (June 15, 2007), Beijing, China.

© 2009 ACM 0001-0782/09/0700 $10.00

References:

http://queue.acm.org

http://queue.acm.org/detail.cfm?id=1413262

http://queue.acm.org/detail.cfm?id=1413261

http://queue.acm.org/detail.cfm?id=1413263

mailto:Goetz.Graefe@HP.com

http://research.microsoft.com/~gray/papers/FlashDiskPublic.doc

http://research.microsoft.com/~gray/papers/FlashDiskPublic.doc

Archives