reconstruction: The Data Log contains all the information necessary to reconstruct the Hash Index from scratch.
As an optimization, FAWN-DS periodically checkpoints the
index by writing the Hash Index and a pointer to the last
log entry to flash. After a failure, FAWN-DS uses the checkpoint as a starting point to reconstruct the in-memory
Virtual iDs and semi-random writes: A physical node has
a separate FAWN-DS datastore file for each of its virtual IDs,
and FAWN-DS appends new or updated data items to the
appropriate datastore. Sequentially appending to a small
number of files is termed semi-random writes. With many
flash devices, these semi-random writes are nearly as fast
as a single sequential append. 15 We take advantage of this
property to retain fast write performance while allowing
key ranges to be stored in independent files to speed the
maintenance operations described in the following.
3. 3. 1. Basic functions: Store, lookup, delete
Store appends an entry to the log, updates the corresponding hash table entry to point to the offset of the newly
appended entry within the Data Log, and sets the valid bit
to true. If the key written already existed, the old value is
now orphaned (no hash entry points to it) for later garbage
Lookup retrieves the hash entry containing the offset,
indexes into the Data Log, and returns the data blob.
Delete invalidates the hash entry corresponding to the
key and writes a Delete entry to the end of the data file. The
delete entry is necessary for fault tolerance—the invalidated
hash table entry is not immediately committed to non-volatile storage to avoid random writes, so a failure following a delete requires a log to ensure that recovery will delete
the entry upon reconstruction. Because of its log structure,
FAWN-DS deletes are similar to store operations with 0
byte values. Deletes do not immediately reclaim space and
require compaction to perform garbage collection. This
design defers the cost of a random write to a later sequential
3. 3. 2. Maintenance: Split, merge, compact
Inserting a new virtual node into the ring causes one key
range to split into two, with the new virtual node gaining
responsibility for the first part of it. Nodes handling these
VIDs must therefore Split their datastore into two datastores, one for each key range. When a virtual node departs the
system, two adjacent key ranges must similarly Merge into
a single datastore. In addition, a virtual node must periodically Compact its datastores to clean up stale or orphaned
entries created by Split, Store, and Delete.
These maintenance functions are designed to work well
on flash, requiring only scans of one datastore and sequential writes into another.
Split parses the Data Log sequentially, writing each
entry in a new datastore if its key falls in the new datastore’s
Merge writes every log entry from one datastore into the
other datastore; because the key ranges are independent,
it does so as an append. Split and Merge propagate delete
104 CoMMunICATIonS oF ThE ACM | july 2011 | Vol. 54 | no. 7
entries into the new datastore.
Compact cleans up entries in a datastore, similar to
garbage collection in a log-structured filesystem. It skips
entries that fall outside of the datastore’s key range, which
may be leftover after a split. It also skips orphaned entries
that no in-memory hash table entry points to, and then skips
any delete entries corresponding to those entries. It writes all
other valid entries into the output datastore.
3. 3. 3. Concurrent maintenance and operation
All FAWN-DS maintenance functions allow concurrent reads
and writes to the datastore. Stores and Deletes only
modify hash table entries and write to the end of the log.
Maintenance operations (Split, Merge, and Compact)
sequentially parse the Data Log, which may be growing due
to deletes and stores. Because the log is append only, a log
entry once parsed will never be changed. These operations
each create one new output datastore logfile. The maintenance operations run until they reach the end of the log, and
then briefly lock the datastore, ensure that all values flushed
to the old log have been processed, update the FAWN-DS
datastore list to point to the newly created log, and release
the lock (Figure 2c).
3. 4. The FAWn key-value system
In FAWN-KV, client applications send requests to front ends
using a standard put/get interface. Front ends send the
request to the back-end node that owns the key space for the
request. The back-end node satisfies the request using its
FAWN-DS and replies to the front ends.
3. 4. 1. Consistent hashing: Key ranges to nodes
A typical FAWN cluster will have several front ends and
many back ends. FAWN-KV organizes the back-end VIDs
into a storage ring-structure using consistent hashing. 18
Front ends maintain the entire node membership list and
directly forward queries to the back-end node that contains
a particular data item.
Each front-end node manages the VID membership list
and queries for a large contiguous chunk of the key space.
A front end receiving queries for keys outside of its range
forwards the queries to the appropriate front-end node.
This design either requires clients to be roughly aware of
the front-end mapping or doubles the traffic that front ends
must handle, but it permits front ends to cache values without a cache consistency protocol.
The key space is allocated to front ends by a single management node; we envision this node being replicated
using a small Paxos cluster, 13 but we have not (yet) implemented this. There would be 80 or more back-end nodes
per front-end node with our current hardware prototypes,
so the amount of information this management node
maintains is small and changes infrequently—a list of 125
front ends would suffice for a 10,000 node FAWN cluster.
When a back-end node joins, it obtains the list of front-end IDs. It uses this list to determine which front ends to
contact to join the ring, one VID at a time. We chose this
design so that the system would be robust to front-end node
failures: The back-end node identifier (and thus, what keys