addressed is reduced, making queries
Atomicity and Durability
To reduce the number of I/O operations and make them sequential,
both B-trees and LSM-trees batch operations in memory before making
an actual update. This means data
integrity is not guaranteed during
failure scenarios and atomicity (
applying a series of changes atomically,
as if they were a single operation, or
not applying them at all) and
durability (ensuring that in the face of a
process crash or power loss, data has
reached persistent storage) properties are not ensured.
To solve that problem, most modern
storage systems employ WAL (
write-ahead logging). The key idea behind
WAL is that all the database state
modifications are first durably persisted in the append-only log on disk.
If the process crashes in the middle of
an operation, the log is replayed, ensuring no data is lost and all changes
In B-trees, using WAL can be understood as writing changes to data files
only after they have been logged. Usually log sizes for B-tree storage systems are relatively small: as soon as
changes are applied to the persisted
storage, they can be discarded. WAL
serves as a backup for the in-flight operations: any changes that were not
applied to data pages can be redone
from the log records.
In LSM-trees, WAL is used to persist
changes that have reached the memtables but have not yet been fully flushed
on disk. As soon as a memtable is fully
flushed and switched so that read operations can be served from the newly
created SSTable, the WAL segment
holding the data for the flushed memtable can be discarded.
One of the biggest differences between
the B-tree and LSM-tree data structures
is what they optimize for and what implications these optimizations have.
Let’s compare the properties of B-trees with LSM-trees. In summary, B-trees have the following properties:
˲ They are mutable, which allows
for in-place updates by introducing
some space overhead and a more in-
volved write path, although it does
not require complete file rewrites or
˲ They are read-optimized, meaning
they do not require reading from (and
subsequently merging) multiple sources,
thus simplifying the read path.
˲ Writes might trigger a cascade of
node splits, making some write opera-
tions more expensive.
˲ They are optimized for paged envi-
ronments (block storage), where byte ad-
dressing is not possible.
˲ Fragmentation, caused by frequent
updates, might require additional main-
tenance and block rewrites. B-trees, how-
ever, usually require less maintenance
than LSM-tree storage.
˲ Concurrent access requires reader/
writer isolation and involves chains of
locks and latches.
LSM-trees have these properties:
˲ They are immutable. SSTables are
written on disk once and never updat-
ed. Compaction is used to reconcile
space occupied by removed items and
merge same-key data from multiple
data files. Merged SSTables are dis-
carded and removed after a successful
merge as part of the compaction pro-
cess. Another useful property coming
from immutability is that flushed tables
can be accessed concurrently.
˲ They are write optimized, meaning
that writes are buffered and flushed on
disk sequentially, potentially allowing for
spatial locality on the disk.
˲ Reads might require accessing data
from multiple sources, since data for the
same key, written during different times,
might land in different data files. Records
have to go through the merge process be-
fore being returned to the client.
˲ Maintenance/compaction is required,
as buffered writes are flushed on disk.
Evaluating Storage Systems
Developing storage systems always
presents the same challenges and factors to consider. Deciding what to optimize for has a substantial influence
on the result. You can spend more time
during write in order to lay out structures for more efficient reads, reserve
extra space for in-place updates, facilitate faster writes, and buffer data in
memory to ensure sequential write operations. It is impossible, however, to
do this all at once. An ideal storage system would have the lowest read cost,
One of the biggest
B-tree and LSM-tree
data structures is
what they optimize
for and what