dom place. Linear searches for your
wallet might be tractable in a small
apartment but not so much when the
search space gets bigger in a larger
home in the suburbs. To reduce the
read perspiration, LSM trees invest energy to organize the data by rewriting
it as you go.
When a new file is freshly written from the storage engine, it has a
bunch of key-value pairs. To make it
easy to find keys, these are merged
with files that were written earlier.
Each LSM tree has some form of fan-out where lower levels of the tree (with
keys written earlier) are kept across
more files. For example, you may have
10 times as many files at level 1 as
at the brand-new level 0. Each file at
level 1 has approximately one-tenth
as large a key range represented but
approximately 10 times the amount
of update time represented. Similarly,
moving down to level 2 results in 100
files, each with a narrower key range
and longer time range.
The depth of an LSM tree depends
on the fan-out, the size of each file, and
the number of key-value pairs in the
tree. In general, most of the storage is
in the lowest level of the tree.
So, within this basic LSM structure
that is gaining so much popularity,
there are varieties of implementation
˲ Leveling merges. When a new file
is added to a level, pick the next file in
the round-robin traversal and merge
it with the files in the next level below. Suppose you pick a fan-out of 10;
you will find the key range in the file
dropping down typically covers the
key range in about 10 files in the level
below. You merge 11 files together as
one drops down onto 10 and you get
11 files out. Now, the next level has
gotten fatter by one file, so you repeat
and merge down again.
˲ Tiering merges. In this different
but related approach, you let a bunch
of files stack up on each level before
doing the merge. Say you stack up 10
files before you merge down at each
level. That dramatically reduces the
amount of merging required.
Leveling merges have a large write
amplification. Each write of a new
key-value pair to level 0 will be re-
written 10 or 11 times at each level it
moves through. On the other hand,
indexing we did, the more our ability to
update became slower than molasses.
I learned this is a common trade-off.
Reading fast frequently means writing
Row-store vs. column-store. I have
focused most of my misspent career on
distributed systems and online transaction processing (OLTP)-style databases. It’s natural for me to associate
high-performance updates with what
today is called a row-store.
Another approach is to organize
data by columns: Take a bunch of rows
and organize the data by its column
values. Every row containing the state
of California, for example, keeps just
the single column’s data together. Columnar databases are super fast for doing queries because many logical rows
with the same value are physically close
to each other.
However, updating a column-store
is not as easy. Typically, updates are
kept separately in an integrated row-store. Queries check the small row-store in a fashion that’s relatively fast
because it’s small. These queries are
combined with the results of the faster column-store to give a unified accurate answer. Periodically, the new
row-store updates are merged with the
column-store to make a new column-store. This may be done in a cascading
fashion somewhat like the merges in
an log-structured merge (LSM) tree,
described in the next section.
When inserting into a column-store
(or really its attached row-store), you
are incurring a debt to be paid later.
This debt to rewrite and integrate the
new data is a form of write amplification where a single write turns into
more writes later.
LSM trees were first proposed in
1996.6 The idea is to track changes to
a key-value store as transactions, with
new values kept in memory. As transactions commit, the sorted collection
of recent key-value pairs can be written to disk in a uniquely named file.
This file contains the sorted key-value
pairs along with an index into the keys
in the file. Once written to disk, the
newly committed changes do not need
to be kept in memory.
Now, if you keep doing this, looking up values by key starts looking like
what happens to me when I try to find
something I set down in some ran-
a lot easier.
lowers the read