data structures and algorithms we have found useful in disk-based computation. The first method is based on external sort—a well-known disk-based sort that avoids random access at the cost of performing several passes through the data. New states discovered during the breadth-first search are saved to disk without checking for duplicates. When an entire level of the search is completed, the new states are externally sorted and merged into a sorted list containing all previously discovered states.
In this way, we eliminate random-access data structures, using sorted lists in their place. Eliminating random access comes at the cost of having to maintain the sorted order of the lists. Further, this method requires that we save all known states. For our Rubik’s Cube computation, storing all configurations would require 11TB, not counting the buffer space for newly generated states.
The second method avoids storing all seen states and also removes the need for expensive external sorting operations. Instead of explicitly storing the known states, we use a disk-based table to record the previously discovered states. To avoid random access, we split this table into contiguous pieces such that each piece fits into RAM. When performing duplicate detection, we load one piece of the table into RAM at a time and remove duplicate states that correspond to that portion of the search space.
Even though this method avoids storing explored states, it still requires the storage of the open list of new states from which duplicates have not been removed. For our Rubik’s Cube computation, the open list has a maximum size of 50TB. To avoid this limitation, we use a technique we call implicit open list to encode the open states using a hash table, rather than an explicit list. This allows us to complete the computation using just 7TB of disk space.
ORGANIZING PRINCIPLE
A unified framework is required to broaden the appeal of disk-based computation beyond Rubik’s Cube. Our team is now searching for an organizing
principle that will allow for the construction of a software library or language extension that does for disk-based computation what numerical libraries have done for numerical analysis. As an initial step, we have begun a comparative analysis of eight different techniques for disk-based enumeration [ 2]. This analysis is based on the methods we used for Rubik’s Cube, along with our solutions to several model problems in computational group theory.
The search cuts across many areas of computer science. For example, in systems and architecture, how can we design disk-based computations to balance the use of CPU, RAM, network, and disk? In theory and algorithms, what class of computations can be converted to efficient disk-based computation? In software engineering and programming languages, how can we separate disk-specific data structures and algorithms from problem-specific concerns? By answering such questions, we will advance the use of disk-based computation, enabling solutions to problems requiring even petabytes of memory. c
REFERENCES
1. Kunkle, D. and Cooperman, G. Twenty-six moves suffice for Rubik’s Cube. In Proceedings of the 2007 International Symposium on Symbolic and Algebraic Computation (Waterloo, Ontario, Canada, July 29–Aug. 1). ACM Press, New York, 2007, 235–242.
2. Robinson, R., Kunkle, D., and Cooperman, G. A comparative analysis of parallel disk-based methods for enumerating implicit graphs. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada, July 27– 28). ACM Press, New York, 2007, 78– 87.
3. Vitter, J. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys 33, 2 (June 2001), 209–271.
DANIEL KUNKLE ( kunkle@ccs.neu.edu) is a Ph.D. candidate in computer science in the College of Computer and Information Science at Northeastern University, Boston, MA. GENE COOPERMAN ( gene@ccs.neu.edu) is a professor in the College of Computer and Information Science at Northeastern University, Boston, MA., where he is also the director of the Institute for Complex Scientific Software and the head of the High Performance Computing Laboratory.
© 2008 ACM 0001-0782/08/0400 $5.00
References:
Archives