Move RFC doc here.

From https://github.com/zenithdb/rfcs/pull/17
2026-05-14 03:30:36 +00:00 · 2022-03-12 11:34:43 +02:00
parent ee40297758
commit 4bd557ca61
1 changed files with 146 additions and 0 deletions
--- a/docs/rfcs/014-storage-lsm.md
+++ b/docs/rfcs/014-storage-lsm.md
@@ -0,0 +1,146 @@
+# Why LSM trees?
+
+In general, an LSM tree has the nice property that random updates are
+fast, but the disk writes are sequential. When a new file is created,
+it is immutable. New files are created and old ones are deleted, but
+existing files are never modified. That fits well with storing the
+files on S3.
+
+Currently, we create a lot of small files. That is mostly a problem
+with S3, because each GET/PUT operation is expensive. Currently, the
+files "archived" together into larger checkpoint files before they're
+uploaded to S3, but garbage collecting data from the archive files
+would be difficult and we have not implemented it. This proposal
+addresses that problem.
+
+
+# Overview
+
+
+```
+^ LSN
+|
+|      Memtable:     +-----------------------------+
+|                    |                             |
+|                    +-----------------------------+
+|
+|
+|            L0:     +-----------------------------+
+|                    |                             |
+|                    +-----------------------------+
+|
+|                    +-----------------------------+
+|                    |                             |
+|                    +-----------------------------+
+|
+|                    +-----------------------------+
+|                    |                             |
+|                    +-----------------------------+
+|
+|                    +-----------------------------+
+|                    |                             |
+|                    +-----------------------------+
+|
+|
+|           L1:      +-------+ +-----+ +--+  +-+
+|                    |       | |     | |  |  | |
+|                    |       | |     | |  |  | |
+|                    +-------+ +-----+ +--+  +-+
+|
+|                       +----+ +-----+ +--+  +----+
+|                       |    | |     | |  |  |    |
+|                       |    | |     | |  |  |    |
+|                       +----+ +-----+ +--+  +----+
+|
+--------------------------------------------------------------> Page ID
+
+
+---+
+|   |   Layer file
+---+
+```
+
+
+# Memtable
+
+When new WAL arrives, it is first put into the Memtable. Despite the
+name, the Memtable is not a purely in-memory data structure. It can
+spill to a temporary file on disk if the system is low on memory, and
+is accessed through a buffer cache.
+
+If the page server crashes, the Memtable is lost. It is rebuilt by
+processing again the WAL that's newer than the latest layer in L0.
+
+The size of the Memtable is equal to the "checkpoint distance", or the
+amount of WAL that we need to keep in the safekeeper.
+
+# L0
+
+When the Memtable fills up, it is written out to a new file in L0. The
+files are immutable; when a file is created, it is never
+modified. Each file in L0 is roughly 1 GB in size (*). Like the
+Memtable, each file in L0 covers the whole key range.
+
+When enough files have been accumulated in L0, compaction
+starts. Compaction processes all the files in L0 and reshuffles the
+data to create a new set of files in L1.
+
+
+(*) except in corner cases like if we want to shut down the page
+server and want to flush out the memtable to disk even though it's not
+full yet.
+
+
+# L1
+
+L1 consists of ~ 1 GB files like L0. But each file covers only part of
+the overall key space, and a larger range of LSNs. This speeds up
+searches. When you're looking for a given page, you need to check all
+the files in L0, to see if they contain a page version for the requested
+page. But in L1, you only need to check the files whose key range covers
+the requested page.
+
+Partitioning by key range also helps with garbage collection. If only a
+part of the database is updated, we will accumulate more files for
+the hot part in L1, and old files can be removed without affecting the
+cold part.
+
+
+# Image layers
+
+So far, we've only talked about delta layers. In addition to the delta
+layers, we create image layers, when "enough" WAL has been accumulated
+for some part of the database. Each image layer covers a 1 GB range of
+key space. It contains images of the pages at a single LSN, a snapshot
+if you will.
+
+The exact heuristic for what "enough" means is not clear yet. Maybe
+create a new image layer when 10 GB of WAL has been accumulated for a
+1 GB segment.
+
+The image layers limit the number of layers that a search needs to
+check. That put a cap on read latency, and it also allows garbage
+collecting layers that are older than the GC horizon.
+
+
+# Partitioning scheme
+
+When compaction happens and creates a new set of files in L1, how do
+we partition the data into the files? 
+
+- Goal is that each file is ~ 1 GB in size
+- Try to match partition boundaries at relation boundaries. (See [1]
+  for how PebblesDB does this, and for why that's important)
+- Greedy algorithm
+
+# Next steps
+
+- Allow delta layers to cover a range keys instead of a single segment.
+
+- Implement a two-level LSM tree (or three-leveled, if you count the
+"memtable"), by adding L0.
+
+# Additional Reading
+
+[1] Paper on PebblesDB and how it does partitioning.
+https://www.cs.utexas.edu/~rak/papers/sosp17-pebblesdb.pdf