rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-24 08:30:37 +00:00

Files

Heikki Linnakangas 07342f7519 Major storage format rewrite.

This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>

2022-03-28 05:41:15 -05:00

4.3 KiB

Raw Blame History

Non-implementation details

This document describes the current state of the backup system in pageserver, existing limitations and concerns, why some things are done the way they are the future development plans. Detailed description on how the synchronization works and how it fits into the rest of the pageserver can be found in the storage module and its submodules. Ideally, this document should disappear after current implementation concerns are mitigated, with the remaining useful knowledge bits moved into rustdocs.

Approach

Backup functionality is a new component, appeared way after the core DB functionality was implemented. Pageserver layer functionality is also quite volatile at the moment, there's a risk its local file management changes over time.

To avoid adding more chaos into that, backup functionality is currently designed as a relatively standalone component, with the majority of its logic placed in a standalone async loop. This way, the backups are managed in background, not affecting directly other pageserver parts: this way the backup and restoration process may lag behind, but eventually keep up with the reality. To track that, a set of prometheus metrics is exposed from pageserver.

What's done

Current implementation

provides remote storage wrappers for AWS S3 and local FS
synchronizes the differences with local timelines and remote states as fast as possible
uploads new layer files
downloads and registers timelines, found on the remote storage, but missing locally, if those are requested somehow via pageserver (e.g. http api, gc)
uses compression when deals with files, for better S3 usage
maintains an index of what's stored remotely
evicts failing tasks and stops the corresponding timelines

The tasks are delayed with every retry and the retries are capped, to avoid poisonous tasks. After any task eviction, or any error at startup checks (e.g. obviously different and wrong local and remote states fot the same timeline), the timeline has to be stopped from submitting further checkpoint upload tasks, which is done along the corresponding timeline status change.

No good optimisations or performance testing is done, the feature is disabled by default and gets polished over time. It's planned to deal with all questions that are currently on and prepare the feature to be enabled by default in cloud environments.

Peculiarities

As mentioned, the backup component is rather new and under development currently, so not all things are done properly from the start. Here's the list of known compromises with comments:

Remote storage file model is currently a custom archive format, that's not possible to deserialize without a particular Rust code of ours (including serde). We also don't optimize the archivation and pack every timeline checkpoint separately, so the resulting blob's size that gets on S3 could be arbitrary. But, it's a single blob, which is way better than storing ~780 small files separately.
Archive index restoration requires reading every blob's head. This could be avoided by a background thread/future storing the serialized index in the remote storage.
no proper file comparison

No file checksum assertion is done currently, but should be (AWS S3 returns file checksums during the list operation)

sad rust-s3 api

rust-s3 is not very pleasant to use:

it returns anyhow::Result and it's hard to distinguish "missing file" cases from "no connection" one, for instance
at least one function it its API that we need (get_object_stream) has async keyword and blocks (!), see details here
it's a prerelease library with unclear maintenance status
noisy on debug level

But it's already used in the project, so for now it's reused to avoid bloating the dependency tree. Based on previous evaluation, even rusoto-s3 could be a better choice over this library, but needs further benchmarking.

gc is ignored

So far, we don't adjust the remote storage based on GC thread loop results, only checkpointer loop affects the remote storage. Index module could be used as a base to implement a deferred GC mechanism, a "defragmentation" that repacks archives into new ones after GC is done removing the files from the archives.

4.3 KiB Raw Blame History

Non-implementation details

Approach

What's done

Peculiarities

4.3 KiB

Raw Blame History