Files
neon/docs/rfcs/014-storage-lsm.md
Heikki Linnakangas 07342f7519 Major storage format rewrite.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
2022-03-28 05:41:15 -05:00

5.1 KiB

Why LSM trees?

In general, an LSM tree has the nice property that random updates are fast, but the disk writes are sequential. When a new file is created, it is immutable. New files are created and old ones are deleted, but existing files are never modified. That fits well with storing the files on S3.

Currently, we create a lot of small files. That is mostly a problem with S3, because each GET/PUT operation is expensive, and LIST operation only returns 1000 objects at a time, and isn't free either. Currently, the files are "archived" together into larger checkpoint files before they're uploaded to S3 to alleviate that problem, but garbage collecting data from the archive files would be difficult and we have not implemented it. This proposal addresses that problem.

Overview

^ LSN
|
|      Memtable:     +-----------------------------+
|                    |                             |
|                    +-----------------------------+
|
|
|            L0:     +-----------------------------+
|                    |                             |
|                    +-----------------------------+
|
|                    +-----------------------------+
|                    |                             |
|                    +-----------------------------+
|
|                    +-----------------------------+
|                    |                             |
|                    +-----------------------------+
|
|                    +-----------------------------+
|                    |                             |
|                    +-----------------------------+
|
|
|           L1:      +-------+ +-----+ +--+  +-+
|                    |       | |     | |  |  | |
|                    |       | |     | |  |  | |
|                    +-------+ +-----+ +--+  +-+
|
|                       +----+ +-----+ +--+  +----+
|                       |    | |     | |  |  |    |
|                       |    | |     | |  |  |    |
|                       +----+ +-----+ +--+  +----+
|
+--------------------------------------------------------------> Page ID


+---+
|   |   Layer file
+---+

Memtable

When new WAL arrives, it is first put into the Memtable. Despite the name, the Memtable is not a purely in-memory data structure. It can spill to a temporary file on disk if the system is low on memory, and is accessed through a buffer cache.

If the page server crashes, the Memtable is lost. It is rebuilt by processing again the WAL that's newer than the latest layer in L0.

The size of the Memtable is configured by the "checkpoint distance" setting. Because anything that hasn't been flushed to disk and uploaded to S3 yet needs to be kept in the safekeeper, the "checkpoint distance" also determines the amount of WAL that needs to kept in the safekeeper.

L0

When the Memtable fills up, it is written out to a new file in L0. The files are immutable; when a file is created, it is never modified. Each file in L0 is roughly 1 GB in size (*). Like the Memtable, each file in L0 covers the whole key range.

When enough files have been accumulated in L0, compaction starts. Compaction processes all the files in L0 and reshuffles the data to create a new set of files in L1.

(*) except in corner cases like if we want to shut down the page server and want to flush out the memtable to disk even though it's not full yet.

L1

L1 consists of ~ 1 GB files like L0. But each file covers only part of the overall key space, and a larger range of LSNs. This speeds up searches. When you're looking for a given page, you need to check all the files in L0, to see if they contain a page version for the requested page. But in L1, you only need to check the files whose key range covers the requested page. This is particularly important at cold start, when checking a file means downloading it from S3.

Partitioning by key range also helps with garbage collection. If only a part of the database is updated, we will accumulate more files for the hot part in L1, and old files can be removed without affecting the cold part.

Image layers

So far, we've only talked about delta layers. In addition to the delta layers, we create image layers, when "enough" WAL has been accumulated for some part of the database. Each image layer covers a 1 GB range of key space. It contains images of the pages at a single LSN, a snapshot if you will.

The exact heuristic for what "enough" means is not clear yet. Maybe create a new image layer when 10 GB of WAL has been accumulated for a 1 GB segment.

The image layers limit the number of layers that a search needs to check. That put a cap on read latency, and it also allows garbage collecting layers that are older than the GC horizon.

Partitioning scheme

When compaction happens and creates a new set of files in L1, how do we partition the data into the files?

  • Goal is that each file is ~ 1 GB in size
  • Try to match partition boundaries at relation boundaries. (See [1] for how PebblesDB does this, and for why that's important)
  • Greedy algorithm

Additional Reading

[1] Paper on PebblesDB and how it does partitioning. https://www.cs.utexas.edu/~rak/papers/sosp17-pebblesdb.pdf