From a3f3d46016bb4d2e1e89144a8b96a35bd558d80c Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 30 Aug 2021 11:42:25 +0300 Subject: [PATCH] Misc doc updates --- docs/README.md | 2 ++ docs/glossary.md | 27 +++++++------- docs/sourcetree.md | 6 ++-- pageserver/README | 88 ++++++++++++++++++++++++++-------------------- 4 files changed, 70 insertions(+), 53 deletions(-) diff --git a/docs/README.md b/docs/README.md index 35f66e2d6f..000cea8355 100644 --- a/docs/README.md +++ b/docs/README.md @@ -4,7 +4,9 @@ - [authentication.md](authentication.md) — pageserver JWT authentication. - [docker.md](docker.md) — Docker images and building pipeline. +- [glossary.md](glossary.md) — Glossary of all the terms used in codebase. - [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI. +- [sourcetree.md](sourcetree.md) — Overview of the source tree layeout. - [pageserver/README](/pageserver/README) — pageserver overview. - [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview. - [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview. diff --git a/docs/glossary.md b/docs/glossary.md index 0506456973..43c6329ea2 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -26,8 +26,8 @@ A checkpoint record in the WAL marks a point in the WAL sequence at which it is NOTE: This is an overloaded term. Whenever enough WAL has been accumulated in memory, the page server [] -writes out the changes in memory into new snapshot files[]. This process -is called "checkpointing". The page server only creates snapshot files for +writes out the changes in memory into new layer files[]. This process +is called "checkpointing". The page server only creates layer files for relations that have been modified since the last checkpoint. ### Compute node @@ -41,6 +41,15 @@ Stateless Postgres node that stores data in pageserver. Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map. Each PostgreSQL fork is considered a separate relish. +### Layer file + +Layered repository on-disk format is based on immutable files. The +files are called "layer files". Each file corresponds to one 10 MB +segment of a PostgreSQL relation fork. There are two kinds of layer +files: image files and delta files. An image file contains a +"snapshot" of the segment at a particular LSN, and a delta file +contains WAL records applicable to the segment, in a range of LSNs. + ### Layered repository ### LSN @@ -102,6 +111,10 @@ Repository stores multiple timelines, forked off from the same initial call to ' and has associated WAL redo service. One repository corresponds to one Tenant. +### Retention policy + +How much history do we need to keep around for PITR and read-only nodes? + ### SLRU SLRUs include pg_clog, pg_multixact/members, and @@ -110,16 +123,6 @@ they don't need to be stored permanently (e.g. pg_subtrans), or we do not support them in zenith yet (pg_commit_ts). Each SLRU segment is considered a separate relish[]. -### Snapshot file - -Layered repository on-disk format is based on immutable files. -The files are called "snapshot files". -Each snapshot file contains a full snapshot, that is, full copy of all -pages in the relation, as of the "start LSN". It also contains all WAL -records applicable to the relation between the start and end -LSNs. -Each snapshot file corresponds to one 10 MB slice of a PostgreSQL relation fork. - ### Tenant (Multitenancy) Tenant represents a single customer, interacting with Zenith. Wal redo[] activity, timelines[], snapshots[] are managed for each tenant independently. diff --git a/docs/sourcetree.md b/docs/sourcetree.md index 4e1aa0b2ae..be0e7db655 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -13,7 +13,7 @@ Intended to be used in integration tests and in CLI tools for local installation Documentaion of the Zenith features and concepts. Now it is mostly dev documentation. -`monitoring`: +`/monitoring`: TODO @@ -72,9 +72,9 @@ The workspace_hack crate exists only to pin down some dependencies. Main entry point for the 'zenith' CLI utility. TODO: Doesn't it belong to control_plane? -`zenith_metrics`: +`/zenith_metrics`: -TODO +Helpers for exposing Prometheus metrics from the server. `/zenith_utils`: diff --git a/pageserver/README b/pageserver/README index 6fe4330fb8..7595fb7738 100644 --- a/pageserver/README +++ b/pageserver/README @@ -8,10 +8,11 @@ The Page Server has a few different duties: - Backup to S3 -The Page Server consists of multiple threads that operate on a shared -cache of page versions: +The Page Server consists of multiple threads that operate on a shared +repository of page versions: + | WAL V +--------------+ @@ -23,16 +24,14 @@ cache of page versions: +---------+ .......... | | | | . . | | GetPage@LSN | | . backup . -------> | S3 | --------------> | Page | page cache . . | | +-------------> | Page | repository . . | | | Service | .......... | | page | | +----+ <------------- | | - +---------+ - - ................................... - . . - . Garbage Collection / Compaction . - ................................... + +---------+ +--------------------+ + | Checkpointing / | + | Garbage collection | + +--------------------+ Legend: @@ -52,7 +51,7 @@ Page Service ------------ The Page Service listens for GetPage@LSN requests from the Compute Nodes, -and responds with pages from the page cache. +and responds with pages from the repository. WAL Receiver @@ -61,46 +60,59 @@ WAL Receiver The WAL receiver connects to the external WAL safekeeping service (or directly to the primary) using PostgreSQL physical streaming replication, and continuously receives WAL. It decodes the WAL records, -and stores them to the page cache repository. +and stores them to the repository. -Page Cache +Repository ---------- -The Page Cache is a switchboard to access different Repositories. +The repository stores all the page versions, or WAL records needed to +reconstruct them. Each tenant has a separate Repository, which is +stored in the .zenith/tenants/ directory. -#### Repository -Repository corresponds to one .zenith directory. -Repository is needed to manage Timelines. -Each repository has associated WAL redo service. - -There is currently only one implementation of the Repository trait, -LayeredRepository, but it's still a useful abstraction that keeps the +Repository is an abstract trait, defined in `repository.rs`. It is +implemented by the LayeredRepository object in +`layered_repository.rs`. There is only that one implementation of the +Repository trait, but it's still a useful abstraction that keeps the interface for the low-level storage functionality clean. The layered storage format is described in layered_repository/README.md. -#### Timeline -Timeline is a page cache workhorse that accepts page changes -and serves get_page_at_lsn() and get_rel_size() requests. -Note: this has nothing to do with PostgreSQL WAL timeline. +Each repository consists of multiple Timelines. Timeline is a +workhorse that accepts page changes from the WAL, and serves +get_page_at_lsn() and get_rel_size() requests. Note: this has nothing +to do with PostgreSQL WAL timeline. The term "timeline" is mostly +interchangeable with "branch", there is a one-to-one mapping from +branch to timeline. A timeline has a unique ID within the tenant, +represented as 16-byte hex string that never changes, whereas a +branch is a user-given name for a timeline. -#### Branch -We can create branch at certain LSN. -Each Branch lives in a corresponding timeline and has an ancestor. - -To get full snapshot of data at certain moment we need to traverse timeline and its ancestors. - -#### WAL redo service -WAL redo service - service that runs PostgreSQL in a special wal_redo mode -to apply given WAL records over an old page image and return new page image. +Each repository also has a WAL redo manager associated with it, see +`walredo.rs`. The WAL redo manager is used to replay PostgreSQL WAL +records, whenever we need to reconstruct a page version from WAL to +satisfy a GetPage@LSN request, or to avoid accumulating too much WAL +for a page. The WAL redo manager uses a Postgres process running in +special zenith wal-redo mode to do the actual WAL redo, and +communicates with the process using a pipe. -TODO: Garbage Collection / Compaction -------------------------------------- +Checkpointing / Garbage Collection +---------------------------------- -Periodically, the Garbage Collection / Compaction thread runs -and applies pending WAL records, and removes old page versions that -are no longer needed. +Periodically, the checkpointer thread wakes up and performs housekeeping +duties on the repository. It has two duties: + +### Checkpointing + +Flush WAL that has accumulated in memory to disk, so that the old WAL +can be truncated away in the WAL safekeepers. Also, to free up memory +for receiving new WAL. This process is called "checkpointing". It's +similar to checkpointing in PostgreSQL or other DBMSs, but in the page +server, checkpointing happens on a per-segment basis. + +### Garbage collection + +Remove old on-disk layer files that are no longer needed according to the +PITR retention policy TODO: Backup service