This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
4.3 KiB
Non-implementation details
This document describes the current state of the backup system in pageserver, existing limitations and concerns, why some things are done the way they are the future development plans. Detailed description on how the synchronization works and how it fits into the rest of the pageserver can be found in the storage module and its submodules. Ideally, this document should disappear after current implementation concerns are mitigated, with the remaining useful knowledge bits moved into rustdocs.
Approach
Backup functionality is a new component, appeared way after the core DB functionality was implemented. Pageserver layer functionality is also quite volatile at the moment, there's a risk its local file management changes over time.
To avoid adding more chaos into that, backup functionality is currently designed as a relatively standalone component, with the majority of its logic placed in a standalone async loop. This way, the backups are managed in background, not affecting directly other pageserver parts: this way the backup and restoration process may lag behind, but eventually keep up with the reality. To track that, a set of prometheus metrics is exposed from pageserver.
What's done
Current implementation
- provides remote storage wrappers for AWS S3 and local FS
- synchronizes the differences with local timelines and remote states as fast as possible
- uploads new layer files
- downloads and registers timelines, found on the remote storage, but missing locally, if those are requested somehow via pageserver (e.g. http api, gc)
- uses compression when deals with files, for better S3 usage
- maintains an index of what's stored remotely
- evicts failing tasks and stops the corresponding timelines
The tasks are delayed with every retry and the retries are capped, to avoid poisonous tasks. After any task eviction, or any error at startup checks (e.g. obviously different and wrong local and remote states fot the same timeline), the timeline has to be stopped from submitting further checkpoint upload tasks, which is done along the corresponding timeline status change.
No good optimisations or performance testing is done, the feature is disabled by default and gets polished over time. It's planned to deal with all questions that are currently on and prepare the feature to be enabled by default in cloud environments.
Peculiarities
As mentioned, the backup component is rather new and under development currently, so not all things are done properly from the start. Here's the list of known compromises with comments:
-
Remote storage file model is currently a custom archive format, that's not possible to deserialize without a particular Rust code of ours (including
serde). We also don't optimize the archivation and pack every timeline checkpoint separately, so the resulting blob's size that gets on S3 could be arbitrary. But, it's a single blob, which is way better than storing ~780 small files separately. -
Archive index restoration requires reading every blob's head. This could be avoided by a background thread/future storing the serialized index in the remote storage.
-
no proper file comparison
No file checksum assertion is done currently, but should be (AWS S3 returns file checksums during the list operation)
- sad rust-s3 api
rust-s3 is not very pleasant to use:
- it returns
anyhow::Resultand it's hard to distinguish "missing file" cases from "no connection" one, for instance - at least one function it its API that we need (
get_object_stream) hasasynckeyword and blocks (!), see details here - it's a prerelease library with unclear maintenance status
- noisy on debug level
But it's already used in the project, so for now it's reused to avoid bloating the dependency tree.
Based on previous evaluation, even rusoto-s3 could be a better choice over this library, but needs further benchmarking.
- gc is ignored
So far, we don't adjust the remote storage based on GC thread loop results, only checkpointer loop affects the remote storage. Index module could be used as a base to implement a deferred GC mechanism, a "defragmentation" that repacks archives into new ones after GC is done removing the files from the archives.