Files
neon/pageserver
Heikki Linnakangas 07342f7519 Major storage format rewrite.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
2022-03-28 05:41:15 -05:00
..
2022-03-28 05:41:15 -05:00
2022-03-28 05:41:15 -05:00

Page server architecture

The Page Server has a few different duties:

  • Respond to GetPage@LSN requests from the Compute Nodes
  • Receive WAL from WAL safekeeper
  • Replay WAL that's applicable to the chunks that the Page Server maintains
  • Backup to S3

S3 is the main fault-tolerant storage of all data, as there are no Page Server replicas. We use a separate fault-tolerant WAL service to reduce latency. It keeps track of WAL records which are not synced to S3 yet.

The Page Server consists of multiple threads that operate on a shared repository of page versions:

                                       | WAL
                                       V
                               +--------------+
                               |              |
                               | WAL receiver |
                               |              |
                               +--------------+
                                                                             +----+
              +---------+                              ..........            |    |
              |         |                              .        .            |    |

GetPage@LSN | | . backup . -------> | S3 | -------------> | Page | repository . . | | | Service | .......... | | page | | +----+ <------------- | | +---------+ +--------------------+ | Checkpointing / | | Garbage collection | +--------------------+

Legend:

+--+ | | A thread or multi-threaded service +--+

.... . . Component at its early development phase. ....

---> Data flow <---

Page Service

The Page Service listens for GetPage@LSN requests from the Compute Nodes, and responds with pages from the repository.

WAL Receiver

The WAL receiver connects to the external WAL safekeeping service (or directly to the primary) using PostgreSQL physical streaming replication, and continuously receives WAL. It decodes the WAL records, and stores them to the repository.

Repository

The repository stores all the page versions, or WAL records needed to reconstruct them. Each tenant has a separate Repository, which is stored in the .zenith/tenants/ directory.

Repository is an abstract trait, defined in repository.rs. It is implemented by the LayeredRepository object in layered_repository.rs. There is only that one implementation of the Repository trait, but it's still a useful abstraction that keeps the interface for the low-level storage functionality clean. The layered storage format is described in layered_repository/README.md.

Each repository consists of multiple Timelines. Timeline is a workhorse that accepts page changes from the WAL, and serves get_page_at_lsn() and get_rel_size() requests. Note: this has nothing to do with PostgreSQL WAL timeline. The term "timeline" is mostly interchangeable with "branch", there is a one-to-one mapping from branch to timeline. A timeline has a unique ID within the tenant, represented as 16-byte hex string that never changes, whereas a branch is a user-given name for a timeline.

Each repository also has a WAL redo manager associated with it, see walredo.rs. The WAL redo manager is used to replay PostgreSQL WAL records, whenever we need to reconstruct a page version from WAL to satisfy a GetPage@LSN request, or to avoid accumulating too much WAL for a page. The WAL redo manager uses a Postgres process running in special zenith wal-redo mode to do the actual WAL redo, and communicates with the process using a pipe.

Checkpointing / Garbage Collection

Periodically, the checkpointer thread wakes up and performs housekeeping duties on the repository. It has two duties:

Checkpointing

Flush WAL that has accumulated in memory to disk, so that the old WAL can be truncated away in the WAL safekeepers. Also, to free up memory for receiving new WAL. This process is called "checkpointing". It's similar to checkpointing in PostgreSQL or other DBMSs, but in the page server, checkpointing happens on a per-segment basis.

Garbage collection

Remove old on-disk layer files that are no longer needed according to the PITR retention policy

Backup service

The backup service, responsible for storing pageserver recovery data externally.

Currently, pageserver stores its files in a filesystem directory it's pointed to. That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached". Therefore, the server interacts with external, more reliable storage to back up and restore its state.

The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait. There are the following implementations present:

  • local filesystem — to use in tests mainly
  • AWS S3 - to use in production

Implementation details are covered in the backup readme and corresponding Rust file docs, parameters documentation can be found at settings docs.

The backup service is disabled by default and can be enabled to interact with a single remote storage.

CLI examples:

  • Local FS: ${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"
  • AWS S3 : ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/',access_key_id='SOMEKEYAAAAASADSAH*#',secret_access_key='SOMEsEcReTsd292v'}"

For Amazon AWS S3, a key id and secret access key could be located in ~/.aws/credentials if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS. For local S3 installations, refer to the their documentation for name format and credentials.

Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets. Required sections are:

[remote_storage]
local_path = '/Users/someonetoignore/Downloads/tmp_dir/'

or

[remote_storage]
bucket_name = 'some-sample-bucket'
bucket_region = 'eu-north-1'
prefix_in_bucket = '/test_prefix/'
access_key_id = 'SOMEKEYAAAAASADSAH*#'
secret_access_key = 'SOMEsEcReTsd292v'

Also, AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID variables can be used to specify the credentials instead of any of the ways above.

TODO: Sharding

We should be able to run multiple Page Servers that handle sharded data.