Algorithms: Modern Systems
Algorithms: Modern Systems
storage ONLY
AlgorithmsBehind
Modern
Storage
Systems
Different ALEX PETROV
uses for read-
T
optimized
he amounts of data processed by applications
B-trees and are constantly growing. With this growth,
write-optimized scaling storage becomes more challenging.
LSM-trees Every database system has its own tradeoffs.
Understanding them is crucial, as it helps in
selecting the right one from so many available choices.
Every application is different in terms of read/write
workload balance, consistency requirements, latencies,
and access patterns. Familiarizing yourself with database
and storage internals facilitates architectural decisions,
helps explain why a system behaves a certain way, helps
troubleshoot problems when they arise, and fine-tunes the
database for your workload.
It’s impossible to optimize a system in all directions.
In an ideal world there would be data structures
guaranteeing the best read and write performance with
B-TREES
B-trees are a popular read-optimized indexing data
structure and generalization of binary trees. They come
in many variations and are used in many databases
(including MySQL InnoDB4 and PostgreSQL7) and even file
systems (HFS+8, HTrees in ext49). The B in B-tree stands for
Bayer, the author of the original data structure, or Boeing,
where he worked at that time.
In binary tree every node has two children (referred as a
left and a right child). Left and right subtrees hold the keys
that are less than and greater than the current node key,
respectively. To keep the tree depth to a minimum, a binary
tree has to be balanced: when randomly ordered keys are
being added to the tree, it is natural that one side of the
tree will eventually get deeper than the other.
One way to rebalance a binary tree is to use so-called
rotation: rearrange nodes, pushing the parent node of the
longer subtree down below its child and pulling this child
up, effectively placing it instead of its parent. Figure 1 is
an example of rotation used for balancing in a binary tree.
On the left, a binary tree is unbalanced after adding node
2 to it. In order to balance it, node 3 is used as a pivot (the
tree is rotated around it). Then node 5, previously a root
1
5 3
3 2 5
node and a parent node for 3, becomes its child node. After
the rotation step is done, the height of the left subtree
decreases by one and the height of the right subtree
increases by one. The maximum depth of the tree has
decreased.
Binary trees are most useful as in-memory data
structures. Because of balancing (the need to keep the
depth of all subtrees to a minimum) and low fanout (a
maximum of two pointers per node), they don’t work
well on disk. B-trees allow for storing more than two
pointers per node and work well with block devices by
matching the node size to the page size (e.g., 4 KB). Some
implementations today use larger node sizes, spanning
across multiple pages in size.
B-trees have the following properties:
3 Sorted. This allows sequential scans and simplifies
lookups.
3 Self-balancing. There’s no need to balance the tree during
Lookups
When performing lookups, the search starts at the root
node and follows internal nodes recursively down to the
leaf level. On each level, the search space is reduced to
the child subtree (the range of this subtree includes the
searched value) by following the child pointer. Figure 3
shows a lookup in a B-tree making a single root-to-leaf
pass, following the pointers “between” the two keys, one of
which is greater than (or equal to) the searched term, and
the other of which is less than the searched term. When
3
a point query is performed, the search is complete after
locating the leaf node. During the range scan, the keys and
values of the found leaf, and then the sibling leaf’s nodes
are traversed, until the end of the range is reached.
In terms of complexity, B-trees guarantee log(n) lookup,
because finding a key within the node is performed using
1 3 4 6
FIGURE 4: Binary search of a B-Tree
7 8 10 13 14 18 19 21 24 37 40 45 714
acmqueue | march-april 2018 36
storage 7 of 21
LSM-TREES
The log-structured merge-tree is an immutable disk-
resident write-optimized data structure. It is most useful in
systems where writes are more frequent than lookups that
retrieve the records. LSM-trees have been getting more
attention because they can eliminate random insertions,
updates, and deletions.
5
searching all disk-resident parts of the tree, checking
the in-memory table, and merging their contents before
returning the result. Figure 5 shows the structure of
an LSM-tree: a memory-resident table used for writes.
Whenever the memory table is large enough, its sorted
contents are written on disk. Reads are served, hitting
memory disk
key1 flush key1 key3 key5
key2 key2 key4 key6
key3 key3 key7 key8
key4 key4 key8 key9
6
index block data block
key1:offset1 key1 value1
key2:offset2 key2 value2
key3:offset3 key3 value3
key4:offset4 key4 value4
the write time for inserts and updates (which are often
indistinguishable) and removal time for deletes.
SSTables have some nice properties:
3 Point queries (i.e., finding a value by key) can be done
quickly by looking up the primary index.
3 Scans (i.e., iterating over all key/value pairs in a specified
key range) can be done efficiently simply by reading key/
value pairs sequentially from the data block.
An SSTable represents a snapshot of all database
operations over a period in time, as the SSTable is created
by the flush process from the memory-resident table that
served as a buffer for operations against the database
state for this period.
Lookups
Retrieving data requires searching all SSTables on disk,
checking the memory-resident tables, and merging their
contents together before returning the result. The merge
step during the read is required since the searched data
can reside in multiple SSTables.
The merge step is also necessary to ensure that the
deletes and updates work. Deletes in an LSM-tree insert
placeholders (often called tombstones), specifying which
key was marked for deletion. Similarly, an update is just
a record with a bigger timestamp. During the read, the
records that get shadowed by deletes are skipped and not
returned to the client. A similar thing happens with the
updates: out of two records with the same key, only the
one with the later timestamp is returned. Figure 7 shows a
merge step reconciling the data stored in separate tables
for the same key: as shown here, the record for Alex was
LSM-tree maintenance
Since SSTables are immutable, they are written
sequentially and hold no reserved empty space for in-
place modifications. This means insert, update, or delete
operations would require rewriting the whole file. All
operations modifying the database state are “batched” in
the memory-resident table. Over time, the number of disk-
resident tables will grow (data for the same key located
in several files, multiple versions of the same record,
redundant records that got shadowed by deletes), and the
reads will continue getting more expensive.
To reduce the cost of reads, reconcile space occupied
by shadowed records, and reduce the number of disk-
resident tables, LSM-trees require a compaction process
that reads complete SSTables from disk and merges them.
Because SSTables are sorted by key and compaction works
like merge-sort, this operation is very efficient: records
are read from several sources sequentially, and merged
output can be appended to the results file right away, also
sequentially. One of the advantages of merge-sort is that it
can work efficiently even for merging large files that don’t
fit in memory. The resulting table preserves the order of
the original SSTables.
During this process, merged SSTables are discarded
and replaced with their “compacted” versions, shown in
figure 8. Compaction takes multiple SSTables and merges
them into one. Some database systems logically group the
tables of the same size to the same “level” and start the
FIGURE 8: Compaction
SUMMARIZING
One of the biggest differences between the B-tree and
LSM-tree data structures is what they optimize for and
what implications these optimizations have.
Let’s compare the properties of B-trees with LSM-trees.
In summary, B-trees have the following properties:
3 They are mutable, which allows for in-place updates by
introducing some space overhead and a more involved
write path, although it does not require complete file
rewrites or multisource merges.
3 They are read-optimized, meaning they do not require
reading from (and subsequently merging) multiple sources,
thus simplifying the read path.
3 Writes might trigger a cascade of node splits, making
some write operations more expensive.
3 They are optimized for paged environments (block
storage), where byte addressing is not possible.
3 Fragmentation, caused by frequent updates, might
require additional maintenance and block rewrites.
B-trees, however, usually require less maintenance than
LSM-tree storage.
read
(read optimized)
point:
indexes
adaptive
compressible/
differential DS approximate DS
update memory
(write optimized) (space optimized)
References
1. C
omer, D. 1979. The ubiquitous B-tree. Computing
Surveys 11(2); 121-137; http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.96.6637&rep=rep1&type=pdf.
2. D ata Systems Laboratory at Harvard. The RUM
Conjecture; http://daslab.seas.harvard.edu/rum-
conjecture/.
3. G raefe, G. 2011. Modern B-tree techniques. Foundations
and Trends in Databases 3(4): 203-402; http://citeseerx.
ist.psu.edu/viewdoc/download?doi=10.1.1.219.7269&rep=r
ep1&type=pdf.
4. M ySQL 5.7 Reference Manual. The physical structure of
an InnoDB index; https://dev.mysql.com/doc/refman/5.7/
en/innodb-physical-structure.html.
5. O ’Neil, P., Cheng, E., Gawlick, D., O’Neil, E. 1996. The log-
structured merge(LSM-tree). Acta Informatica 33(4): 351-
385; http://citeseerx.ist.psu.edu/viewdoc/download?doi=1
0.1.1.44.2782&rep=rep1&type=pdf.
6. P elkonen, T., Franklin, S.Teller, J., Cavallaro, P., Huang, Q.,
Meza, J., Veeraraghavan, K. 2015. Gorilla: a fast, scalable,
in-memory time series database. Proceedings of the
VLDB Endowment 8(12): 1816-1827; http://www.vldb.org/
pvldb/vol8/p1816-teller.pdf.
7. S
uzuki, H. 2015-2018. The internals of PostreSQL; http://
www.interdb.jp/pg/pgsql01.html.
8. Apple HFS Plus Volume Format; https://developer.apple.
com/legacy/library/technotes/tn/tn1150.html#BTrees
9. M athur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas,
A., Vivier, L. (2007). The new ext4 filesystem: current
status and future plans. Proceedings of the Linux
Symposium. Ottawa, Canada: Red Hat. http://citeseerx.
ist.psu.edu/viewdoc/download?doi=10.1.1.111.798&rep=re
p1&type=pdf
10. Bloom, Burton H. (1970),“Space/time trade-offs in hash
2 CONTENTS