can be used to test whether an element is a member of the set. It can
produce false-positive matches (that
is, state that the element is a member
of set, while it is not, in fact, present
there) but cannot produce false negatives (that is, if a negative match is returned, the element is guaranteed not
to be a member of the set). In other
words, a Bloom Filter is used to tell if
the key “might be in an SSTable” or “is
definitely not in an SSTable.” SSTables
for which a Bloom Filter has returned
a negative match are skipped during
the query.
LSM maintenance. Since SSTables
are immutable, they are written sequentially and hold no reserved empty space for in-place modifications.
This means insert, update, or delete
operations would require rewriting
the whole file. All operations modifying the database state are “batched”
in the memory-resident table. Over
time, the number of disk-resident
tables will grow (data for the same
key located in several files, multiple
versions of the same record, redundant records that got shadowed by
deletes), and the reads will continue
getting more expensive.
To reduce the cost of reads, reconcile space occupied by shadowed
records, and reduce the number of
disk-resident tables, LSM-trees require
a compaction process that reads complete SSTables from disk and merges
them. Because SSTables are sorted by
key and compaction works like merge-sort, this operation is very efficient:
records are read from several sources
sequentially, and merged output can
be appended to the results file right
away, also sequentially. One of the advantages of merge-sort is that it can
work efficiently even for merging large
files that don not fit in memory. The resulting table preserves the order of the
original SSTables.
During this process, merged SSTables
are discarded and replaced with their
“compacted” versions, as shown in
Figure 8. Compaction takes multiple
SSTables and merges them into one.
Some database systems logically group
the tables of the same size to the same
“level” and start the merge process
whenever enough tables are on a par-
ticular level. After compaction, the
number of SSTables that have to be
ery value item in an SSTable has a time-
stamp associated with it. This specifies
the write time for inserts and updates
(which are often indistinguishable)
and removal time for deletes.
SS Tables have some nice properties:
˲ Point-queries (that is, finding a value by key) can be done quickly by looking up the primary index.
˲ Scans (that is, iterating over all key/
value pairs in a specified key range) can
be done efficiently simply by reading
key/value pairs sequentially from the
data block.
An SSTable represents a snapshot of
all database operations over a period of
time, as the SSTable is created by the
flush process from the memory-resident table that served as a buffer for
operations against the database state
for this period.
Lookups. Retrieving data requires
searching all SSTables on disk, checking the memory-resident tables, and
merging their contents before returning the result. The merge step during
the read is required since the searched
data can reside in multiple SSTables.
The merge step is also necessary to
ensure the deletes and updates work.
Deletes in an LSM-tree insert place-holders (often called tombstones),
specifying which key was marked for
deletion. Similarly, an update is just
a record with a later timestamp. During the read, the records that get shadowed by deletes are skipped and not
returned to the client. A similar thing
happens with the updates: out of two
records with the same key, only the
one with the later timestamp is returned. Figure 7 shows a merge step
reconciling the data stored in separate tables for the same key: as shown
here, the record for Alex was written
with timestamp 100 and updated with
a new phone and timestamp 200; the
record for John was deleted. The other
two entries are taken as is, as they are
not shadowed.
To reduce the number of searched
SSTables and to avoid checking every
SSTable for the searched key, many
storage systems employ a data structure known as a Bloom Filter. 2 This
is a probabilistic data structure that
Figure 8. Compaction.
Figure 9. The RUM Conjecture.
read
(read optimized)
update
(write optimized)
memory
(space optimized)
differential DS
compressible/
approximate DS
point:
indexes
adaptive