diff --git a/docs/rfcs/014-storage-lsm.md b/docs/rfcs/014-storage-lsm.md new file mode 100644 index 0000000000..f91ccda6c0 --- /dev/null +++ b/docs/rfcs/014-storage-lsm.md @@ -0,0 +1,146 @@ +# Why LSM trees? + +In general, an LSM tree has the nice property that random updates are +fast, but the disk writes are sequential. When a new file is created, +it is immutable. New files are created and old ones are deleted, but +existing files are never modified. That fits well with storing the +files on S3. + +Currently, we create a lot of small files. That is mostly a problem +with S3, because each GET/PUT operation is expensive. Currently, the +files "archived" together into larger checkpoint files before they're +uploaded to S3, but garbage collecting data from the archive files +would be difficult and we have not implemented it. This proposal +addresses that problem. + + +# Overview + + +``` +^ LSN +| +| Memtable: +-----------------------------+ +| | | +| +-----------------------------+ +| +| +| L0: +-----------------------------+ +| | | +| +-----------------------------+ +| +| +-----------------------------+ +| | | +| +-----------------------------+ +| +| +-----------------------------+ +| | | +| +-----------------------------+ +| +| +-----------------------------+ +| | | +| +-----------------------------+ +| +| +| L1: +-------+ +-----+ +--+ +-+ +| | | | | | | | | +| | | | | | | | | +| +-------+ +-----+ +--+ +-+ +| +| +----+ +-----+ +--+ +----+ +| | | | | | | | | +| | | | | | | | | +| +----+ +-----+ +--+ +----+ +| ++--------------------------------------------------------------> Page ID + + ++---+ +| | Layer file ++---+ +``` + + +# Memtable + +When new WAL arrives, it is first put into the Memtable. Despite the +name, the Memtable is not a purely in-memory data structure. It can +spill to a temporary file on disk if the system is low on memory, and +is accessed through a buffer cache. + +If the page server crashes, the Memtable is lost. It is rebuilt by +processing again the WAL that's newer than the latest layer in L0. + +The size of the Memtable is equal to the "checkpoint distance", or the +amount of WAL that we need to keep in the safekeeper. + +# L0 + +When the Memtable fills up, it is written out to a new file in L0. The +files are immutable; when a file is created, it is never +modified. Each file in L0 is roughly 1 GB in size (*). Like the +Memtable, each file in L0 covers the whole key range. + +When enough files have been accumulated in L0, compaction +starts. Compaction processes all the files in L0 and reshuffles the +data to create a new set of files in L1. + + +(*) except in corner cases like if we want to shut down the page +server and want to flush out the memtable to disk even though it's not +full yet. + + +# L1 + +L1 consists of ~ 1 GB files like L0. But each file covers only part of +the overall key space, and a larger range of LSNs. This speeds up +searches. When you're looking for a given page, you need to check all +the files in L0, to see if they contain a page version for the requested +page. But in L1, you only need to check the files whose key range covers +the requested page. + +Partitioning by key range also helps with garbage collection. If only a +part of the database is updated, we will accumulate more files for +the hot part in L1, and old files can be removed without affecting the +cold part. + + +# Image layers + +So far, we've only talked about delta layers. In addition to the delta +layers, we create image layers, when "enough" WAL has been accumulated +for some part of the database. Each image layer covers a 1 GB range of +key space. It contains images of the pages at a single LSN, a snapshot +if you will. + +The exact heuristic for what "enough" means is not clear yet. Maybe +create a new image layer when 10 GB of WAL has been accumulated for a +1 GB segment. + +The image layers limit the number of layers that a search needs to +check. That put a cap on read latency, and it also allows garbage +collecting layers that are older than the GC horizon. + + +# Partitioning scheme + +When compaction happens and creates a new set of files in L1, how do +we partition the data into the files? + +- Goal is that each file is ~ 1 GB in size +- Try to match partition boundaries at relation boundaries. (See [1] + for how PebblesDB does this, and for why that's important) +- Greedy algorithm + +# Next steps + +- Allow delta layers to cover a range keys instead of a single segment. + +- Implement a two-level LSM tree (or three-leveled, if you count the +"memtable"), by adding L0. + +# Additional Reading + +[1] Paper on PebblesDB and how it does partitioning. +https://www.cs.utexas.edu/~rak/papers/sosp17-pebblesdb.pdf