Reorganize, expand, improve internal documentation

Reorganize existing READMEs and other documentation files into mdbook format. The resulting Table of Contents is a mix of placeholders for docs that we should write, and documentation files that we already had, dropped into the most appropriate place. Update the Pageserver overview diagram. Add sections on thread management and WAL redo processes. Add all the RFCs to the mdbook Table of Content too. Per github issue #1979
2025-12-22 21:59:59 +00:00 · 2022-07-18 13:28:39 +03:00
parent a69fdb0e8e
commit 0b14fdb078
17 changed files with 326 additions and 93 deletions
--- a/docs/.gitignore
+++ b/docs/.gitignore
@@ -0,0 +1 @@
 book
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,14 +0,0 @@
 # Zenith documentation
 ## Table of contents
 - [authentication.md](authentication.md) — pageserver JWT authentication.
 - [docker.md](docker.md) — Docker images and building pipeline.
 - [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
 - [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
 - [sourcetree.md](sourcetree.md) — Overview of the source tree layout.
 - [pageserver/README.md](/pageserver/README.md) — pageserver overview.
 - [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
 - [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
 - [safekeeper/README.md](/safekeeper/README.md) — WAL service overview.
 - [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -0,0 +1,84 @@
 # Summary
 [Introduction]()
 - [Separation of Compute and Storage](./separation-compute-storage.md)
 # Architecture
 - [Compute]()
  - [WAL proposer]()
  - [WAL Backpressure]()
  - [Postgres changes](./core_changes.md)
 - [Pageserver](./pageserver.md)
    - [Services](./pageserver-services.md)
    - [Thread management](./pageserver-thread-mgmt.md)
    - [WAL Redo](./pageserver-walredo.md)
    - [Page cache](./pageserver-pagecache.md)
    - [Storage](./pageserver-storage.md)
        - [Datadir mapping]()
        - [Layer files]()
        - [Branching]()
        - [Garbage collection]()
    - [Cloud Storage]()
    - [Processing a GetPage request](./pageserver-processing-getpage.md)
    - [Processing WAL](./pageserver-processing-wal.md)
 	- [Management API]()
 	- [Tenant Rebalancing]()
 - [WAL Service](walservice.md)
  - [Consensus protocol](safekeeper-protocol.md)
  - [Management API]()
  - [Rebalancing]()
 - [Control Plane]()
 - [Proxy]()
 - [Source view](./sourcetree.md)
  - [docker.md](./docker.md) — Docker images and building pipeline.
  - [Error handling and logging]()
  - [Testing]()
    - [Unit testing]()
    - [Integration testing]()
    - [Benchmarks]()
 - [Glossary](./glossary.md)
 # Uncategorized
 - [authentication.md](./authentication.md)
 - [multitenancy.md](./multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
 - [settings.md](./settings.md)
 #FIXME: move these under sourcetree.md
 #- [pageserver/README.md](/pageserver/README.md)
 #- [postgres_ffi/README.md](/libs/postgres_ffi/README.md)
 #- [test_runner/README.md](/test_runner/README.md)
 #- [safekeeper/README.md](/safekeeper/README.md)
 # RFCs
 - [RFCs](./rfcs/README.md)
 - [002-storage](rfcs/002-storage.md)
 - [003-laptop-cli](rfcs/003-laptop-cli.md)
 - [004-durability](rfcs/004-durability.md)
 - [005-zenith_local](rfcs/005-zenith_local.md)
 - [006-laptop-cli-v2-CLI](rfcs/006-laptop-cli-v2-CLI.md)
 - [006-laptop-cli-v2-repository-structure](rfcs/006-laptop-cli-v2-repository-structure.md)
 - [007-serverless-on-laptop](rfcs/007-serverless-on-laptop.md)
 - [008-push-pull](rfcs/008-push-pull.md)
 - [009-snapshot-first-storage-cli](rfcs/009-snapshot-first-storage-cli.md)
 - [009-snapshot-first-storage](rfcs/009-snapshot-first-storage.md)
 - [009-snapshot-first-storage-pitr](rfcs/009-snapshot-first-storage-pitr.md)
 - [010-storage_details](rfcs/010-storage_details.md)
 - [011-retention-policy](rfcs/011-retention-policy.md)
 - [012-background-tasks](rfcs/012-background-tasks.md)
 - [013-term-history](rfcs/013-term-history.md)
 - [014-safekeepers-gossip](rfcs/014-safekeepers-gossip.md)
 - [014-storage-lsm](rfcs/014-storage-lsm.md)
 - [015-storage-messaging](rfcs/015-storage-messaging.md)
 - [016-connection-routing](rfcs/016-connection-routing.md)
 - [cluster-size-limits](rfcs/cluster-size-limits.md)
--- a/docs/book.toml
+++ b/docs/book.toml
@@ -0,0 +1,5 @@
 [book]
 language = "en"
 multilingual = false
 src = "."
 title = "Neon architecture"
--- a/docs/core_changes.md
+++ b/docs/core_changes.md
@@ -1,3 +1,12 @@
 # Postgres core changes
 This lists all the changes that have been made to the PostgreSQL
 source tree, as a somewhat logical set of patches. The long-term goal
 is to eliminate all these changes, by submitting patches to upstream
 and refactoring code into extensions, so that you can run unmodified
 PostgreSQL against Neon storage.
 1. Add t_cid to XLOG record
 - Why?
  The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
--- a/docs/pageserver-page-service.md
+++ b/docs/pageserver-page-service.md
@@ -0,0 +1,9 @@
 # Page Service
 The Page Service listens for GetPage@LSN requests from the Compute Nodes,
 and responds with pages from the repository. On each GetPage@LSN request,
 it calls into the Repository function
 A separate thread is spawned for each incoming connection to the page
 service. The page service uses the libpq protocol to communicate with
 the client. The client is a Compute Postgres instance.
--- a/docs/pageserver-pagecache.md
+++ b/docs/pageserver-pagecache.md
@@ -0,0 +1,8 @@
 # Page cache
 TODO:
 - shared across tenants
 - store pages from layer files
 - store pages from "in-memory layer"
 - store materialized pages
--- a/docs/pageserver-processing-getpage.md
+++ b/docs/pageserver-processing-getpage.md
@@ -0,0 +1,4 @@
 # Processing a GetPage request
 TODO:
 - sequence diagram that shows how a GetPage@LSN request is processed
--- a/docs/pageserver-processing-wal.md
+++ b/docs/pageserver-processing-wal.md
@@ -0,0 +1,5 @@
 # Processing WAL
 TODO:
 - diagram that shows how incoming WAL is processed
 - explain durability, what is fsync'd when, disk_consistent_lsn
--- a/docs/pageserver-services.md
+++ b/docs/pageserver-services.md
@@ -1,15 +1,4 @@
-## Page server architecture
+# Services
 The Page Server has a few different duties:
 - Respond to GetPage@LSN requests from the Compute Nodes
 - Receive WAL from WAL safekeeper
 - Replay WAL that's applicable to the chunks that the Page Server maintains
 - Backup to S3
 S3 is the main fault-tolerant storage of all data, as there are no Page Server
 replicas. We use a separate fault-tolerant WAL service to reduce latency. It
 keeps track of WAL records which are not synced to S3 yet.
 The Page Server consists of multiple threads that operate on a shared
 repository of page versions:
@@ -21,18 +10,22 @@ repository of page versions:
                                   | WAL receiver |
                                   |              |
                                   +--------------+
-                                                                                 +----+
+                                                                                 ......
-                  +---------+                              ..........            |    |
+                  +---------+                              +--------+            .    .
-                  |         |                              .        .            |    |
+                  |         |                              |        |            .    .
- GetPage@LSN      |         |                              . backup .  ------->  | S3 |
+ GetPage@LSN      |         |                              | backup |  ------->  . S3 .
------------->    |  Page   |         repository           .        .            |    |
+------------->    |  Page   |         repository           |        |            .    .
-                  | Service |                              ..........            |    |
+                  | Service |                              +--------+            .    .
-   page           |         |                                                    +----+
+   page           |         |                                                    ......
 <-------------    |         |
-                  +---------+      +--------------------+
+                  +---------+     +-----------+     +--------------------+
-		                   |   Checkpointing /  |
+                                  | WAL redo  |     | Checkpointing,     |
-				   | Garbage collection |
+                  +----------+    | processes |     | Garbage collection |
-                                   +--------------------+
+                  |          |    +-----------+     +--------------------+
                  |   HTTP   |
                  | mgmt API |
                  |          |
                  +----------+
 Legend:
@@ -40,28 +33,77 @@ Legend:
 |  |   A thread or multi-threaded service
 +--+
 ....
 .  .   Component at its early development phase.
 ....
 --->   Data flow
 <---
 ```
-Page Service
+## Page Service
 ------------
 The Page Service listens for GetPage@LSN requests from the Compute Nodes,
-and responds with pages from the repository.
+and responds with pages from the repository. On each GetPage@LSN request,
 it calls into the Repository function
 A separate thread is spawned for each incoming connection to the page
 service. The page service uses the libpq protocol to communicate with
 the client. The client is a Compute Postgres instance.
 ## WAL Receiver
 The WAL receiver connects to the external WAL safekeeping service
 using PostgreSQL physical streaming replication, and continuously
 receives WAL. It decodes the WAL records, and stores them to the
 repository.
-WAL Receiver
+## Backup service
 ------------
-The WAL receiver connects to the external WAL safekeeping service (or
+The backup service, responsible for storing pageserver recovery data externally.
-directly to the primary) using PostgreSQL physical streaming
+
-replication, and continuously receives WAL. It decodes the WAL records,
+Currently, pageserver stores its files in a filesystem directory it's pointed to.
-and stores them to the repository.
+That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
 Therefore, the server interacts with external, more reliable storage to back up and restore its state.
 The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
 There are the following implementations present:
 * local filesystem — to use in tests mainly
 * AWS S3           - to use in production
 Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
 The backup service is disabled by default and can be enabled to interact with a single remote storage.
 CLI examples:
 * Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
 * AWS S3  : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"`
 For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
 For local S3 installations, refer to the their documentation for name format and credentials.
 Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
 Required sections are:
 ```toml
 [remote_storage]
 local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
 ```
 or
 ```toml
 [remote_storage]
 bucket_name = 'some-sample-bucket'
 bucket_region = 'eu-north-1'
 prefix_in_bucket = '/test_prefix/'
 ```
 `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
 ## Repository background tasks
 The Repository also has a few different background threads and tokio tasks that perform
 background duties like dumping accumulated WAL data from memory to disk, reorganizing
 files for performance (compaction), and garbage collecting old files.
 Repository
@@ -116,48 +158,6 @@ Remove old on-disk layer files that are no longer needed according to the
 PITR retention policy
 ### Backup service
 The backup service, responsible for storing pageserver recovery data externally.
 Currently, pageserver stores its files in a filesystem directory it's pointed to.
 That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
 Therefore, the server interacts with external, more reliable storage to back up and restore its state.
 The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
 There are the following implementations present:
 * local filesystem — to use in tests mainly
 * AWS S3           - to use in production
 Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
 The backup service is disabled by default and can be enabled to interact with a single remote storage.
 CLI examples:
 * Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
 * AWS S3  : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"`
 For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
 For local S3 installations, refer to the their documentation for name format and credentials.
 Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
 Required sections are:
 ```toml
 [remote_storage]
 local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
 ```
 or
 ```toml
 [remote_storage]
 bucket_name = 'some-sample-bucket'
 bucket_region = 'eu-north-1'
 prefix_in_bucket = '/test_prefix/'
 ```
 `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
 TODO: Sharding
 --------------------
--- a/pageserver/src/layered_repository/README.md
+++ b/pageserver/src/layered_repository/README.md
@@ -1,4 +1,4 @@
-# Overview
+# Pageserver storage
 The main responsibility of the Page Server is to process the incoming WAL, and
 reprocess it into a format that allows reasonably quick access to any page
--- a/docs/pageserver-thread-mgmt.md
+++ b/docs/pageserver-thread-mgmt.md
@@ -0,0 +1,26 @@
 ## Thread management
 Each thread in the system is tracked by the `thread_mgr` module. It
 maintains a registry of threads, and which tenant or timeline they are
 operating on. This is used for safe shutdown of a tenant, or the whole
 system.
 ### Handling shutdown
 When a tenant or timeline is deleted, we need to shut down all threads
 operating on it, before deleting the data on disk. A thread registered
 in the thread registry can check if it has been requested to shut down,
 by calling `is_shutdown_requested()`. For async operations, there's also
 a `shudown_watcher()` async task that can be used to wake up on shutdown.
 ### Sync vs async
 The primary programming model in the page server is synchronous,
 blocking code. However, there are some places where async code is
 used. Be very careful when mixing sync and async code.
 Async is primarily used to wait for incoming data on network
 connections. For example, all WAL receivers have a shared thread pool,
 with one async Task for each connection. Once a piece of WAL has been
 received from the network, the thread calls the blocking functions in
 the Repository to process the WAL.
--- a/docs/pageserver-walredo.md
+++ b/docs/pageserver-walredo.md
@@ -0,0 +1,77 @@
 # WAL Redo
 To reconstruct a particular page version from an image of the page and
 some WAL records, the pageserver needs to replay the WAL records. This
 happens on-demand, when a GetPage@LSN request comes in, or as part of
 background jobs that reorganize data for faster access.
 It's important that data cannot leak from one tenant to another, and
 that a corrupt WAL record on one timeline doesn't affect other tenants
 or timelines.
 ## Multi-tenant security
 If you have direct access to the WAL directory, or if you have
 superuser access to a running PostgreSQL server, it's easy to
 construct a malicious or corrupt WAL record that causes the WAL redo
 functions to crash, or to execute arbitrary code. That is not a
 security problem for PostgreSQL; if you have superuser access, you
 have full access to the system anyway.
 The Neon pageserver, however, is multi-tenant. It needs to execute WAL
 belonging to different tenants in the same system, and malicious WAL
 in one tenant must not affect other tenants.
 A separate WAL redo process is launched for each tenant, and the
 process uses the seccomp(2) system call to restrict its access to the
 bare minimum needed to replay WAL records. The process does not have
 access to the filesystem or network. It can only communicate with the
 parent pageserver process through a pipe.
 If an attacker creates a malicious WAL record and injects it into the
 WAL stream of a timeline, he can take control of the WAL redo process
 in the pageserver. However, the WAL redo process cannot access the
 rest of the system. And because there is a separate WAL redo process
 for each tenant, the hijacked WAL redo process can only see WAL and
 data belonging to the same tenant, which the attacker would have
 access to anyway.
 ## WAL-redo process communication
 The WAL redo process runs the 'postgres' executable, launched with a
 Neon-specific command-line option to put it into WAL-redo process
 mode.  The pageserver controls the lifetime of the WAL redo processes,
 launching them as needed. If a tenant is detached from the pageserver,
 any WAL redo processes for that tenant are killed.
 The pageserver communicates with each WAL redo process over its
 stdin/stdout/stderr. It works in request-response model with a simple
 custom protocol, described in walredo.rs. To replay a set of WAL
 records for a page, the pageserver sends the "before" image of the
 page and the WAL records over 'stdin', followed by a command to
 perform the replay. The WAL redo process responds with an "after"
 image of the page.
 ## Special handling of some records
 Some WAL record types are handled directly in the pageserver, by
 bespoken Rust code, and are not sent over to the WAL redo process.
 This includes SLRU-related WAL records, like commit records. SLRUs
 don't use the standard Postgres buffer manager, so dealing with them
 in the Neon WAL redo mode would require quite a few changes to
 Postgres code and special handling in the protocol anyway.
 Some record types that include a full-page-image (e.g. XLOG_FPI) are
 also handled specially when incoming WAL is processed already, and are
 stored as page images rather than WAL records.
 ## Records that modify multiple pages
 Some Postgres WAL records modify multiple pages. Such WAL records are
 duplicated, so that a copy is stored for each affected page. This is
 somewhat wasteful, but because most WAL records only affect one page,
 the overhead is acceptable.
 The WAL redo always happens for one particular page. If the WAL record
 coantains changes to other pages, they are ignored.
--- a/docs/pageserver.md
+++ b/docs/pageserver.md
@@ -0,0 +1,11 @@
 # Page server architecture
 The Page Server has a few different duties:
 - Respond to GetPage@LSN requests from the Compute Nodes
 - Receive WAL from WAL safekeeper, and store it
 - Upload data to S3 to make it durable, download files from S3 as needed
 S3 is the main fault-tolerant storage of all data, as there are no Page Server
 replicas. We use a separate fault-tolerant WAL service to reduce latency. It
 keeps track of WAL records which are not synced to S3 yet.
--- a/docs/safekeeper-protocol.md
+++ b/docs/safekeeper-protocol.md
--- a/docs/separation-compute-storage.md
+++ b/docs/separation-compute-storage.md
@@ -0,0 +1,8 @@
 # Separation of Compute and Storage
 TODO:
 - Read path
 - Write path
 - Durability model
 - API auth
--- a/safekeeper/README.md
+++ b/safekeeper/README.md