Reorganize, expand, improve internal documentation

Reorganize existing READMEs and other documentation files into mdbook format. The resulting Table of Contents is a mix of placeholders for docs that we should write, and documentation files that we already had, dropped into the most appropriate place. Update the Pageserver overview diagram. Add sections on thread management and WAL redo processes. Add all the RFCs to the mdbook Table of Content too. Per github issue #1979
2026-05-26 17:40:37 +00:00 · 2022-07-18 13:28:39 +03:00
parent a69fdb0e8e
commit 0b14fdb078
17 changed files with 326 additions and 93 deletions
--- a/docs/.gitignore
+++ b/docs/.gitignore
@@ -0,0 +1 @@
+book
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,14 +0,0 @@
-# Zenith documentation
-
-## Table of contents
-
- [authentication.md](authentication.md) — pageserver JWT authentication.
- [docker.md](docker.md) — Docker images and building pipeline.
- [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
- [sourcetree.md](sourcetree.md) — Overview of the source tree layout.
- [pageserver/README.md](/pageserver/README.md) — pageserver overview.
- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -0,0 +1,84 @@
+# Summary
+
+[Introduction]()
+- [Separation of Compute and Storage](./separation-compute-storage.md)
+
+# Architecture
+
+- [Compute]()
+  - [WAL proposer]()
+  - [WAL Backpressure]()
+  - [Postgres changes](./core_changes.md)
+
+- [Pageserver](./pageserver.md)
+    - [Services](./pageserver-services.md)
+    - [Thread management](./pageserver-thread-mgmt.md)
+    - [WAL Redo](./pageserver-walredo.md)
+    - [Page cache](./pageserver-pagecache.md)
+    - [Storage](./pageserver-storage.md)
+        - [Datadir mapping]()
+        - [Layer files]()
+        - [Branching]()
+        - [Garbage collection]()
+    - [Cloud Storage]()
+    - [Processing a GetPage request](./pageserver-processing-getpage.md)
+    - [Processing WAL](./pageserver-processing-wal.md)
+	- [Management API]()
+	- [Tenant Rebalancing]()
+
+- [WAL Service](walservice.md)
+  - [Consensus protocol](safekeeper-protocol.md)
+  - [Management API]()
+  - [Rebalancing]()
+
+- [Control Plane]()
+
+- [Proxy]()
+
+- [Source view](./sourcetree.md)
+  - [docker.md](./docker.md) — Docker images and building pipeline.
+  - [Error handling and logging]()
+  - [Testing]()
+    - [Unit testing]()
+    - [Integration testing]()
+    - [Benchmarks]()
+
+
+- [Glossary](./glossary.md)
+
+# Uncategorized
+
+- [authentication.md](./authentication.md)
+- [multitenancy.md](./multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
+- [settings.md](./settings.md)
+#FIXME: move these under sourcetree.md
+#- [pageserver/README.md](/pageserver/README.md)
+#- [postgres_ffi/README.md](/libs/postgres_ffi/README.md)
+#- [test_runner/README.md](/test_runner/README.md)
+#- [safekeeper/README.md](/safekeeper/README.md)
+
+
+# RFCs
+
+- [RFCs](./rfcs/README.md)
+
+- [002-storage](rfcs/002-storage.md)
+- [003-laptop-cli](rfcs/003-laptop-cli.md)
+- [004-durability](rfcs/004-durability.md)
+- [005-zenith_local](rfcs/005-zenith_local.md)
+- [006-laptop-cli-v2-CLI](rfcs/006-laptop-cli-v2-CLI.md)
+- [006-laptop-cli-v2-repository-structure](rfcs/006-laptop-cli-v2-repository-structure.md)
+- [007-serverless-on-laptop](rfcs/007-serverless-on-laptop.md)
+- [008-push-pull](rfcs/008-push-pull.md)
+- [009-snapshot-first-storage-cli](rfcs/009-snapshot-first-storage-cli.md)
+- [009-snapshot-first-storage](rfcs/009-snapshot-first-storage.md)
+- [009-snapshot-first-storage-pitr](rfcs/009-snapshot-first-storage-pitr.md)
+- [010-storage_details](rfcs/010-storage_details.md)
+- [011-retention-policy](rfcs/011-retention-policy.md)
+- [012-background-tasks](rfcs/012-background-tasks.md)
+- [013-term-history](rfcs/013-term-history.md)
+- [014-safekeepers-gossip](rfcs/014-safekeepers-gossip.md)
+- [014-storage-lsm](rfcs/014-storage-lsm.md)
+- [015-storage-messaging](rfcs/015-storage-messaging.md)
+- [016-connection-routing](rfcs/016-connection-routing.md)
+- [cluster-size-limits](rfcs/cluster-size-limits.md)
--- a/docs/book.toml
+++ b/docs/book.toml
@@ -0,0 +1,5 @@
+[book]
+language = "en"
+multilingual = false
+src = "."
+title = "Neon architecture"
--- a/docs/core_changes.md
+++ b/docs/core_changes.md
@@ -1,3 +1,12 @@
+# Postgres core changes
+
+This lists all the changes that have been made to the PostgreSQL
+source tree, as a somewhat logical set of patches. The long-term goal
+is to eliminate all these changes, by submitting patches to upstream
+and refactoring code into extensions, so that you can run unmodified
+PostgreSQL against Neon storage.
+
+
 1. Add t_cid to XLOG record
 - Why?
  The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
--- a/docs/pageserver-page-service.md
+++ b/docs/pageserver-page-service.md
@@ -0,0 +1,9 @@
+# Page Service
+
+The Page Service listens for GetPage@LSN requests from the Compute Nodes,
+and responds with pages from the repository. On each GetPage@LSN request,
+it calls into the Repository function
+
+A separate thread is spawned for each incoming connection to the page
+service. The page service uses the libpq protocol to communicate with
+the client. The client is a Compute Postgres instance.
--- a/docs/pageserver-pagecache.md
+++ b/docs/pageserver-pagecache.md
@@ -0,0 +1,8 @@
+# Page cache
+
+TODO:
+
+- shared across tenants
+- store pages from layer files
+- store pages from "in-memory layer"
+- store materialized pages
--- a/docs/pageserver-processing-getpage.md
+++ b/docs/pageserver-processing-getpage.md
@@ -0,0 +1,4 @@
+# Processing a GetPage request
+
+TODO:
+- sequence diagram that shows how a GetPage@LSN request is processed
--- a/docs/pageserver-processing-wal.md
+++ b/docs/pageserver-processing-wal.md
@@ -0,0 +1,5 @@
+# Processing WAL
+
+TODO:
+- diagram that shows how incoming WAL is processed
+- explain durability, what is fsync'd when, disk_consistent_lsn
--- a/docs/pageserver-services.md
+++ b/docs/pageserver-services.md
@@ -0,0 +1,165 @@
+# Services
+
+The Page Server consists of multiple threads that operate on a shared
+repository of page versions:
+```
+                                           | WAL
+                                           V
+                                   +--------------+
+                                   |              |
+                                   | WAL receiver |
+                                   |              |
+                                   +--------------+
+                                                                                 ......
+                  +---------+                              +--------+            .    .
+                  |         |                              |        |            .    .
+ GetPage@LSN      |         |                              | backup |  ------->  . S3 .
+------------->    |  Page   |         repository           |        |            .    .
+                  | Service |                              +--------+            .    .
+   page           |         |                                                    ......
+<-------------    |         |
+                  +---------+     +-----------+     +--------------------+
+                                  | WAL redo  |     | Checkpointing,     |
+                  +----------+    | processes |     | Garbage collection |
+                  |          |    +-----------+     +--------------------+
+                  |   HTTP   |
+                  | mgmt API |
+                  |          |
+                  +----------+
+
+Legend:
+
+--+
+|  |   A thread or multi-threaded service
+--+
+
+--->   Data flow
+<---
+```
+
+## Page Service
+
+The Page Service listens for GetPage@LSN requests from the Compute Nodes,
+and responds with pages from the repository. On each GetPage@LSN request,
+it calls into the Repository function
+
+A separate thread is spawned for each incoming connection to the page
+service. The page service uses the libpq protocol to communicate with
+the client. The client is a Compute Postgres instance.
+
+## WAL Receiver
+
+The WAL receiver connects to the external WAL safekeeping service
+using PostgreSQL physical streaming replication, and continuously
+receives WAL. It decodes the WAL records, and stores them to the
+repository.
+
+
+## Backup service
+
+The backup service, responsible for storing pageserver recovery data externally.
+
+Currently, pageserver stores its files in a filesystem directory it's pointed to.
+That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
+Therefore, the server interacts with external, more reliable storage to back up and restore its state.
+
+The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
+There are the following implementations present:
+* local filesystem — to use in tests mainly
+* AWS S3           - to use in production
+
+Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
+
+The backup service is disabled by default and can be enabled to interact with a single remote storage.
+
+CLI examples:
+* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
+* AWS S3  : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"`
+
+For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
+For local S3 installations, refer to the their documentation for name format and credentials.
+
+Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
+Required sections are:
+
+```toml
+[remote_storage]
+local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
+```
+
+or
+
+```toml
+[remote_storage]
+bucket_name = 'some-sample-bucket'
+bucket_region = 'eu-north-1'
+prefix_in_bucket = '/test_prefix/'
+```
+
+`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
+
+
+## Repository background tasks
+
+The Repository also has a few different background threads and tokio tasks that perform
+background duties like dumping accumulated WAL data from memory to disk, reorganizing
+files for performance (compaction), and garbage collecting old files.
+
+
+Repository
+----------
+
+The repository stores all the page versions, or WAL records needed to
+reconstruct them. Each tenant has a separate Repository, which is
+stored in the .neon/tenants/<tenantid> directory.
+
+Repository is an abstract trait, defined in `repository.rs`. It is
+implemented by the LayeredRepository object in
+`layered_repository.rs`. There is only that one implementation of the
+Repository trait, but it's still a useful abstraction that keeps the
+interface for the low-level storage functionality clean. The layered
+storage format is described in layered_repository/README.md.
+
+Each repository consists of multiple Timelines. Timeline is a
+workhorse that accepts page changes from the WAL, and serves
+get_page_at_lsn() and get_rel_size() requests. Note: this has nothing
+to do with PostgreSQL WAL timeline. The term "timeline" is mostly
+interchangeable with "branch", there is a one-to-one mapping from
+branch to timeline. A timeline has a unique ID within the tenant,
+represented as 16-byte hex string that never changes, whereas a
+branch is a user-given name for a timeline.
+
+Each repository also has a WAL redo manager associated with it, see
+`walredo.rs`. The WAL redo manager is used to replay PostgreSQL WAL
+records, whenever we need to reconstruct a page version from WAL to
+satisfy a GetPage@LSN request, or to avoid accumulating too much WAL
+for a page. The WAL redo manager uses a Postgres process running in
+special Neon wal-redo mode to do the actual WAL redo, and
+communicates with the process using a pipe.
+
+
+Checkpointing / Garbage Collection
+----------------------------------
+
+Periodically, the checkpointer thread wakes up and performs housekeeping
+duties on the repository. It has two duties:
+
+### Checkpointing
+
+Flush WAL that has accumulated in memory to disk, so that the old WAL
+can be truncated away in the WAL safekeepers. Also, to free up memory
+for receiving new WAL. This process is called "checkpointing". It's
+similar to checkpointing in PostgreSQL or other DBMSs, but in the page
+server, checkpointing happens on a per-segment basis.
+
+### Garbage collection
+
+Remove old on-disk layer files that are no longer needed according to the
+PITR retention policy
+
+
+
+TODO: Sharding
+--------------------
+
+We should be able to run multiple Page Servers that handle sharded data.
--- a/docs/pageserver-storage.md
+++ b/docs/pageserver-storage.md
@@ -0,0 +1,518 @@
+# Pageserver storage
+
+The main responsibility of the Page Server is to process the incoming WAL, and
+reprocess it into a format that allows reasonably quick access to any page
+version. The page server slices the incoming WAL per relation and page, and
+packages the sliced WAL into suitably-sized "layer files". The layer files
+contain all the history of the database, back to some reasonable retention
+period. This system replaces the base backups and the WAL archive used in a
+traditional PostgreSQL installation. The layer files are immutable, they are not
+modified in-place after creation. New layer files are created for new incoming
+WAL, and old layer files are removed when they are no longer needed.
+
+The on-disk format is based on immutable files. The page server receives a
+stream of incoming WAL, parses the WAL records to determine which pages they
+apply to, and accumulates the incoming changes in memory. Whenever enough WAL
+has been accumulated in memory, it is written out to a new immutable file. That
+process accumulates "L0 delta files" on disk. When enough L0 files have been
+accumulated, they are merged and re-partitioned into L1 files, and old files
+that are no longer needed are removed by Garbage Collection (GC).
+
+The incoming WAL contains updates to arbitrary pages in the system. The
+distribution depends on the workload: the updates could be totally random, or
+there could be a long stream of updates to a single relation when data is bulk
+loaded, for example, or something in between.
+
+```
+Cloud Storage                   Page Server                           Safekeeper
+                        L1               L0             Memory            WAL
+
+----+               +----+----+
+|AAAA|               |AAAA|AAAA|      +---+-----+         |
+----+               +----+----+      |   |     |         |AA
+|BBBB|               |BBBB|BBBB|      |BB | AA  |         |BB
+----+----+          +----+----+      |C  | BB  |         |CC
+|CCCC|CCCC|  <----   |CCCC|CCCC| <--- |D  | CC  |  <---   |DDD     <----   ADEBAABED
+----+----+          +----+----+      |   | DDD |         |E
+|DDDD|DDDD|          |DDDD|DDDD|      |E  |     |         |
+----+----+          +----+----+      |   |     |
+|EEEE|               |EEEE|EEEE|      +---+-----+
+----+               +----+----+
+```
+
+In this illustration, WAL is received as a stream from the Safekeeper, from the
+right.  It is immediately captured by the page server and stored quickly in
+memory. The page server memory can be thought of as a quick "reorder buffer",
+used to hold the incoming WAL and reorder it so that we keep the WAL records for
+the same page and relation close to each other.
+
+From the page server memory, whenever enough WAL has been accumulated, it is flushed
+to disk into a new L0 layer file, and the memory is released.
+
+When enough L0 files have been accumulated, they are merged together and sliced
+per key-space, producing a new set of files where each file contains a more
+narrow key range, but larger LSN range.
+
+From the local disk, the layers are further copied to Cloud Storage, for
+long-term archival. After a layer has been copied to Cloud Storage, it can be
+removed from local disk, although we currently keep everything locally for fast
+access. If a layer is needed that isn't found locally, it is fetched from Cloud
+Storage and stored in local disk. L0 and L1 files are both uploaded to Cloud
+Storage.
+
+# Layer map
+
+The LayerMap tracks what layers exist in a timeline.
+
+Currently, the layer map is just a resizeable array (Vec). On a GetPage@LSN or
+other read request, the layer map scans through the array to find the right layer
+that contains the data for the requested page. The read-code in LayeredTimeline
+is aware of the ancestor, and returns data from the ancestor timeline if it's
+not found on the current timeline.
+
+# Different kinds of layers
+
+A layer can be in different states:
+
+- Open - a layer where new WAL records can be appended to.
+- Closed - a layer that is read-only, no new WAL records can be appended to it
+- Historic: synonym for closed
+- InMemory: A layer that needs to be rebuilt from WAL on pageserver start.
+To avoid OOM errors, InMemory layers can be spilled to disk into ephemeral file.
+- OnDisk: A layer that is stored on disk. If its end-LSN is older than
+  disk_consistent_lsn, it is known to be fully flushed and fsync'd to local disk.
+- Frozen layer: an in-memory layer that is Closed.
+
+TODO: Clarify the difference between Closed, Historic and Frozen.
+
+There are two kinds of OnDisk layers:
+- ImageLayer represents a snapshot of all the keys in a particular range, at one
+  particular LSN. Any keys that are not present in the ImageLayer are known not
+  to exist at that LSN.
+- DeltaLayer represents a collection of WAL records or page images in a range of
+  LSNs, for a range of keys.
+
+# Layer life cycle
+
+LSN range defined by start_lsn and end_lsn:
+- start_lsn is inclusive.
+- end_lsn is exclusive.
+
+For an open in-memory layer, the end_lsn is MAX_LSN. For a frozen in-memory
+layer or a delta layer, it is a valid end bound. An image layer represents
+snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
+
+Every layer starts its life as an Open In-Memory layer. When the page server
+receives the first WAL record for a timeline, it creates a new In-Memory layer
+for it, and puts it to the layer map. Later, when the layer becomes full, its
+contents are written to disk, as an on-disk layers.
+
+Flushing a layer is a two-step process: First, the layer is marked as closed, so
+that it no longer accepts new WAL records, and a new in-memory layer is created
+to hold any WAL after that point. After this first step, the layer is a Closed
+InMemory state. This first step is called "freezing" the layer.
+
+In the second step, a new Delta layers is created, containing all the data from
+the Frozen InMemory layer. When it has been created and flushed to disk, the
+original frozen layer is replaced with the new layers in the layer map, and the
+original frozen layer is dropped, releasing the memory.
+
+# Layer files (On-disk layers)
+
+The files are called "layer files". Each layer file covers a range of keys, and
+a range of LSNs (or a single LSN, in case of image layers). You can think of it
+as a rectangle in the two-dimensional key-LSN space. The layer files for each
+timeline are stored in the timeline's subdirectory under
+`.neon/tenants/<tenantid>/timelines`.
+
+There are two kind of layer files: images, and delta layers. An image file
+contains a snapshot of all keys at a particular LSN, whereas a delta file
+contains modifications to a segment - mostly in the form of WAL records - in a
+range of LSN.
+
+image file:
+
+```
+    000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568
+              start key                          end key                           LSN
+```
+
+
+The first parts define the key range that the layer covers. See
+pgdatadir_mapping.rs for how the key space is used. The last part is the LSN.
+
+delta file:
+
+Delta files are named similarly, but they cover a range of LSNs:
+
+```
+    000000067F000032BE0000400000000020B6-000000067F000032BE0000400000000030B6__000000578C6B29-0000000057A50051
+              start key                          end key                          start LSN     end LSN
+```
+
+A delta file contains all the key-values in the key-range that were updated in
+the LSN range. If a key has not been modified, there is no trace of it in the
+delta layer.
+
+
+A delta layer file can cover a part of the overall key space, as in the previous
+example, or the whole key range like this:
+
+```
+    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000578C6B29-0000000057A50051
+```
+
+A file that covers the whole key range is called a L0 file (Level 0), while a
+file that covers only part of the key range is called a L1 file. The "level" of
+a file is not explicitly stored anywhere, you can only distinguish them by
+looking at the key range that a file covers. The read-path doesn't need to
+treat L0 and L1 files any differently.
+
+
+## Notation used in this document
+
+FIXME: This is somewhat obsolete, the layer files cover a key-range rather than
+a particular relation nowadays. However, the description on how you find a page
+version, and how branching and GC works is still valid.
+
+The full path of a delta file looks like this:
+
+```
+    .neon/tenants/941ddc8604413b88b3d208bddf90396c/timelines/4af489b06af8eed9e27a841775616962/rel_1663_13990_2609_0_10_000000000169C348_0000000001702000
+```
+
+For simplicity, the examples below use a simplified notation for the
+paths.  The tenant ID is left out, the timeline ID is replaced with
+the human-readable branch name, and spcnode+dbnode+relnode+forkum+segno
+with a human-readable table name. The LSNs are also shorter. For
+example, a base image file at LSN 100 and a delta file between 100-200
+for 'orders' table on 'main' branch is represented like this:
+
+```
+    main/orders_100
+    main/orders_100_200
+```
+
+
+# Creating layer files
+
+Let's start with a simple example with a system that contains one
+branch called 'main' and two tables, 'orders' and 'customers'. The end
+of WAL is currently at LSN 250. In this starting situation, you would
+have these files on disk:
+
+```
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/customers_100
+	main/customers_100_200
+	main/customers_200
+```
+
+In addition to those files, the recent changes between LSN 200 and the
+end of WAL at 250 are kept in memory. If the page server crashes, the
+latest records between 200-250 need to be re-read from the WAL.
+
+Whenever enough WAL has been accumulated in memory, the page server
+writes out the changes in memory into new layer files. This process
+is called "checkpointing" (not to be confused with the PostgreSQL
+checkpoints, that's a different thing). The page server only creates
+layer files for relations that have been modified since the last
+checkpoint. For example, if the current end of WAL is at LSN 450, and
+the last checkpoint happened at LSN 400 but there hasn't been any
+recent changes to 'customers' table, you would have these files on
+disk:
+
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/orders_200_300
+	main/orders_300
+	main/orders_300_400
+	main/orders_400
+	main/customers_100
+	main/customers_100_200
+	main/customers_200
+
+If the customers table is modified later, a new file is created for it
+at the next checkpoint. The new file will cover the "gap" from the
+last layer file, so the LSN ranges are always contiguous:
+
+```
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/orders_200_300
+	main/orders_300
+	main/orders_300_400
+	main/orders_400
+	main/customers_100
+	main/customers_100_200
+	main/customers_200
+	main/customers_200_500
+	main/customers_500
+```
+
+## Reading page versions
+
+Whenever a GetPage@LSN request comes in from the compute node, the
+page server needs to reconstruct the requested page, as it was at the
+requested LSN. To do that, the page server first checks the recent
+in-memory layer; if the requested page version is found there, it can
+be returned immediately without looking at the files on
+disk. Otherwise the page server needs to locate the layer file that
+contains the requested page version.
+
+For example, if a request comes in for table 'orders' at LSN 250, the
+page server would load the 'main/orders_200_300' file into memory, and
+reconstruct and return the requested page from it, as it was at
+LSN 250. Because the layer file consists of a full image of the
+relation at the start LSN and the WAL, reconstructing the page
+involves replaying any WAL records applicable to the page between LSNs
+200-250, starting from the base image at LSN 200.
+
+# Multiple branches
+
+Imagine that a child branch is created at LSN 250:
+
+```
+            @250
+    ----main--+-------------------------->
+               \
+                +---child-------------->
+```
+
+
+Then, the 'orders' table is updated differently on the 'main' and
+'child' branches. You now have this situation on disk:
+
+```
+    main/orders_100
+    main/orders_100_200
+    main/orders_200
+    main/orders_200_300
+    main/orders_300
+    main/orders_300_400
+    main/orders_400
+    main/customers_100
+    main/customers_100_200
+    main/customers_200
+    child/orders_250_300
+    child/orders_300
+    child/orders_300_400
+    child/orders_400
+```
+
+Because the 'customers' table hasn't been modified on the child
+branch, there is no file for it there. If you request a page for it on
+the 'child' branch, the page server will not find any layer file
+for it in the 'child' directory, so it will recurse to look into the
+parent 'main' branch instead.
+
+From the 'child' branch's point of view, the history for each relation
+is linear, and the request's LSN identifies unambiguously which file
+you need to look at. For example, the history for the 'orders' table
+on the 'main' branch consists of these files:
+
+```
+    main/orders_100
+    main/orders_100_200
+    main/orders_200
+    main/orders_200_300
+    main/orders_300
+    main/orders_300_400
+    main/orders_400
+```
+
+And from the 'child' branch's point of view, it consists of these
+files:
+
+```
+    main/orders_100
+    main/orders_100_200
+    main/orders_200
+    main/orders_200_300
+    child/orders_250_300
+    child/orders_300
+    child/orders_300_400
+    child/orders_400
+```
+
+The branch metadata includes the point where the child branch was
+created, LSN 250. If a page request comes with LSN 275, we read the
+page version from the 'child/orders_250_300' file. We might also
+need to reconstruct the page version as it was at LSN 250, in order
+to replay the WAL up to LSN 275, using 'main/orders_200_300' and
+'main/orders_200'. The page versions between 250-300 in the
+'main/orders_200_300' file are ignored when operating on the child
+branch.
+
+Note: It doesn't make any difference if the child branch is created
+when the end of the main branch was at LSN 250, or later when the tip of
+the main branch had already moved on. The latter case, creating a
+branch at a historic LSN, is how we support PITR in Zenith.
+
+
+# Garbage collection
+
+In this scheme, we keep creating new layer files over time. We also
+need a mechanism to remove old files that are no longer needed,
+because disk space isn't infinite.
+
+What files are still needed? Currently, the page server supports PITR
+and branching from any branch at any LSN that is "recent enough" from
+the tip of the branch.  "Recent enough" is defined as an LSN horizon,
+which by default is 64 MB.  (See DEFAULT_GC_HORIZON). For this
+example, let's assume that the LSN horizon is 150 units.
+
+Let's look at the single branch scenario again. Imagine that the end
+of the branch is LSN 525, so that the GC horizon is currently at
+525-150 = 375
+
+```
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/orders_200_300
+	main/orders_300
+	main/orders_300_400
+	main/orders_400
+	main/orders_400_500
+	main/orders_500
+	main/customers_100
+	main/customers_100_200
+	main/customers_200
+```
+
+We can remove the following files because the end LSNs of those files are
+older than GC horizon 375, and there are more recent layer files for the
+table:
+
+```
+	main/orders_100       DELETE
+	main/orders_100_200   DELETE
+	main/orders_200       DELETE
+	main/orders_200_300   DELETE
+	main/orders_300       STILL NEEDED BY orders_300_400
+	main/orders_300_400   KEEP, NEWER THAN GC HORIZON
+	main/orders_400       .. 
+	main/orders_400_500   .. 
+	main/orders_500       .. 
+	main/customers_100      DELETE
+	main/customers_100_200  DELETE
+	main/customers_200      KEEP, NO NEWER VERSION
+```
+
+'main/customers_200' is old enough, but it cannot be
+removed because there is no newer layer file for the table.
+
+Things get slightly more complicated with multiple branches. All of
+the above still holds, but in addition to recent files we must also
+retain older snapshot files that are still needed by child branches.
+For example, if child branch is created at LSN 150, and the 'customers'
+table is updated on the branch, you would have these files:
+
+```
+	main/orders_100        KEEP, NEEDED BY child BRANCH
+	main/orders_100_200    KEEP, NEEDED BY child BRANCH
+	main/orders_200        DELETE
+	main/orders_200_300    DELETE
+	main/orders_300        KEEP, NEWER THAN GC HORIZON
+	main/orders_300_400    KEEP, NEWER THAN GC HORIZON
+	main/orders_400        KEEP, NEWER THAN GC HORIZON
+	main/orders_400_500    KEEP, NEWER THAN GC HORIZON
+	main/orders_500        KEEP, NEWER THAN GC HORIZON
+	main/customers_100       DELETE
+	main/customers_100_200   DELETE
+	main/customers_200       KEEP, NO NEWER VERSION
+	child/customers_150_300  DELETE
+	child/customers_300      KEEP, NO NEWER VERSION
+```
+
+In this situation, 'main/orders_100' and 'main/orders_100_200' cannot
+be removed, even though they are older than the GC horizon, because
+they are still needed by the child branch. 'main/orders_200'
+and 'main/orders_200_300' can still be removed.
+
+If 'orders' is modified later on the 'child' branch, we will create a
+new base image and delta file for it on the child:
+
+```
+	main/orders_100
+	main/orders_100_200
+
+	main/orders_300
+	main/orders_300_400
+	main/orders_400
+	main/orders_400_500
+	main/orders_500
+	main/customers_200
+	child/customers_300
+	child/orders_150_400
+	child/orders_400
+```
+
+After this, the 'main/orders_100' and 'main/orders_100_200' file could
+be removed. It is no longer needed by the child branch, because there
+is a newer layer file there. TODO: This optimization hasn't been
+implemented! The GC algorithm will currently keep the file on the
+'main' branch anyway, for as long as the child branch exists.
+
+TODO:
+Describe GC and checkpoint interval settings.
+
+# TODO: On LSN ranges
+
+In principle, each relation can be checkpointed separately, i.e. the
+LSN ranges of the files don't need to line up. So this would be legal:
+
+```
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/orders_200_300
+	main/orders_300
+	main/orders_300_400
+	main/orders_400
+	main/customers_150
+	main/customers_150_250
+	main/customers_250
+	main/customers_250_500
+	main/customers_500
+```
+
+However, the code currently always checkpoints all relations together.
+So that situation doesn't arise in practice.
+
+It would also be OK to have overlapping LSN ranges for the same relation:
+
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/orders_200_300
+	main/orders_300
+	main/orders_250_350
+	main/orders_350
+	main/orders_300_400
+	main/orders_400
+
+The code that reads the layer files should cope with this, but this
+situation doesn't arise either, because the checkpointing code never
+does that.  It could be useful, however, as a transient state when
+garbage collecting around branch points, or explicit recovery
+points. For example, if we start with this:
+
+```
+	main/orders_100
+	main/orders_100_200
+	main/orders_200
+	main/orders_200_300
+	main/orders_300
+```
+
+And there is a branch or explicit recovery point at LSN 150, we could
+replace 'main/orders_100_200' with 'main/orders_150' to keep a
+layer only at that exact point that's still needed, removing the
+other page versions around it. But such compaction has not been
+implemented yet.
--- a/docs/pageserver-thread-mgmt.md
+++ b/docs/pageserver-thread-mgmt.md
@@ -0,0 +1,26 @@
+## Thread management
+
+Each thread in the system is tracked by the `thread_mgr` module. It
+maintains a registry of threads, and which tenant or timeline they are
+operating on. This is used for safe shutdown of a tenant, or the whole
+system.
+
+### Handling shutdown
+
+When a tenant or timeline is deleted, we need to shut down all threads
+operating on it, before deleting the data on disk. A thread registered
+in the thread registry can check if it has been requested to shut down,
+by calling `is_shutdown_requested()`. For async operations, there's also
+a `shudown_watcher()` async task that can be used to wake up on shutdown.
+
+### Sync vs async
+
+The primary programming model in the page server is synchronous,
+blocking code. However, there are some places where async code is
+used. Be very careful when mixing sync and async code.
+
+Async is primarily used to wait for incoming data on network
+connections. For example, all WAL receivers have a shared thread pool,
+with one async Task for each connection. Once a piece of WAL has been
+received from the network, the thread calls the blocking functions in
+the Repository to process the WAL.
--- a/docs/pageserver-walredo.md
+++ b/docs/pageserver-walredo.md
@@ -0,0 +1,77 @@
+# WAL Redo
+
+To reconstruct a particular page version from an image of the page and
+some WAL records, the pageserver needs to replay the WAL records. This
+happens on-demand, when a GetPage@LSN request comes in, or as part of
+background jobs that reorganize data for faster access.
+
+It's important that data cannot leak from one tenant to another, and
+that a corrupt WAL record on one timeline doesn't affect other tenants
+or timelines.
+
+## Multi-tenant security
+
+If you have direct access to the WAL directory, or if you have
+superuser access to a running PostgreSQL server, it's easy to
+construct a malicious or corrupt WAL record that causes the WAL redo
+functions to crash, or to execute arbitrary code. That is not a
+security problem for PostgreSQL; if you have superuser access, you
+have full access to the system anyway.
+
+The Neon pageserver, however, is multi-tenant. It needs to execute WAL
+belonging to different tenants in the same system, and malicious WAL
+in one tenant must not affect other tenants.
+
+A separate WAL redo process is launched for each tenant, and the
+process uses the seccomp(2) system call to restrict its access to the
+bare minimum needed to replay WAL records. The process does not have
+access to the filesystem or network. It can only communicate with the
+parent pageserver process through a pipe.
+
+If an attacker creates a malicious WAL record and injects it into the
+WAL stream of a timeline, he can take control of the WAL redo process
+in the pageserver. However, the WAL redo process cannot access the
+rest of the system. And because there is a separate WAL redo process
+for each tenant, the hijacked WAL redo process can only see WAL and
+data belonging to the same tenant, which the attacker would have
+access to anyway.
+
+## WAL-redo process communication
+
+The WAL redo process runs the 'postgres' executable, launched with a
+Neon-specific command-line option to put it into WAL-redo process
+mode.  The pageserver controls the lifetime of the WAL redo processes,
+launching them as needed. If a tenant is detached from the pageserver,
+any WAL redo processes for that tenant are killed.
+
+The pageserver communicates with each WAL redo process over its
+stdin/stdout/stderr. It works in request-response model with a simple
+custom protocol, described in walredo.rs. To replay a set of WAL
+records for a page, the pageserver sends the "before" image of the
+page and the WAL records over 'stdin', followed by a command to
+perform the replay. The WAL redo process responds with an "after"
+image of the page.
+
+## Special handling of some records
+
+Some WAL record types are handled directly in the pageserver, by
+bespoken Rust code, and are not sent over to the WAL redo process.
+This includes SLRU-related WAL records, like commit records. SLRUs
+don't use the standard Postgres buffer manager, so dealing with them
+in the Neon WAL redo mode would require quite a few changes to
+Postgres code and special handling in the protocol anyway.
+
+Some record types that include a full-page-image (e.g. XLOG_FPI) are
+also handled specially when incoming WAL is processed already, and are
+stored as page images rather than WAL records.
+
+
+## Records that modify multiple pages
+
+Some Postgres WAL records modify multiple pages. Such WAL records are
+duplicated, so that a copy is stored for each affected page. This is
+somewhat wasteful, but because most WAL records only affect one page,
+the overhead is acceptable.
+
+The WAL redo always happens for one particular page. If the WAL record
+coantains changes to other pages, they are ignored.
--- a/docs/pageserver.md
+++ b/docs/pageserver.md
@@ -0,0 +1,11 @@
+# Page server architecture
+
+The Page Server has a few different duties:
+
+- Respond to GetPage@LSN requests from the Compute Nodes
+- Receive WAL from WAL safekeeper, and store it
+- Upload data to S3 to make it durable, download files from S3 as needed
+
+S3 is the main fault-tolerant storage of all data, as there are no Page Server
+replicas. We use a separate fault-tolerant WAL service to reduce latency. It
+keeps track of WAL records which are not synced to S3 yet.
--- a/docs/safekeeper-protocol.md
+++ b/docs/safekeeper-protocol.md
@@ -0,0 +1,279 @@
+# WAL proposer-safekeeper communication consensus protocol.
+
+## General requirements and architecture
+
+There is single stateless master and several safekeepers. Number of safekeepers is determined by redundancy level.
+To minimize number of changes in Postgres core, we are using standard streaming replication from master (through WAL sender).
+This replication stream is initiated by the WAL proposer process that runs in the PostgreSQL server, which broadcasts the WAL generated by PostgreSQL to safekeepers.
+To provide durability we use synchronous replication at master (response to the commit statement is sent to the client
+only when acknowledged by WAL receiver). WAL proposer sends this acknowledgment only when LSN of commit record is confirmed by quorum of safekeepers.
+
+WAL proposer tries to establish connections with safekeepers.
+At any moment of time each safekeeper can serve exactly once proposer, but it can accept new connections.
+
+Any of safekeepers can be used as WAL server, producing replication stream. So both `Pagers` and `Replicas`
+(read-only computation nodes) can connect to safekeeper to receive WAL stream. Safekeepers is streaming WAL until
+it reaches min(`commitLSN`,`flushLSN`). Then replication is suspended until new data arrives from master.
+
+
+## Handshake
+The goal of handshake is to collect quorum (to be able to perform recovery)
+and avoid split-brains caused by simultaneous presence of old and new master.
+Procedure of handshake consists of the following steps:
+
+1. Broadcast information about server to all safekeepers (wal segment size, system_id,...)
+2. Receive responses with information about safekeepers.
+3. Once quorum of handshake responses are received, propose new `NodeId(max(term)+1, server.uuid)`
+to all of them.
+4. On receiving proposed nodeId, safekeeper compares it with locally stored nodeId and if it is greater or equals
+then accepts proposed nodeId and persists this choice in the local control file.
+5. If quorum of safekeepers approve proposed nodeId, then server assumes that handshake is successfully completed and switch to recovery stage.
+
+## Recovery
+Proposer computes max(`restartLSN`) and max(`flushLSN`) from quorum of attached safekeepers.
+`RestartLSN` - is position in WAL which is known to be delivered to all safekeepers.
+In other words: `restartLSN` can be also considered as cut-off horizon (all preceding WAL segments can be removed).
+`FlushLSN` is position flushed by safekeeper to the local persistent storage.
+
+If max(`restartLSN`) != max(`flushLSN`), then recovery has to be performed.
+Proposer creates replication channel with most advanced safekeeper (safekeeper with the largest `flushLSN`).
+Then it downloads all WAL messages between max(`restartLSN`)..max(`flushLSN`).
+Messages are inserted in L1-list (ordered by LSN). Then we locate position of each safekeeper in this list according
+to their `flushLSN`s. Safekeepers that are not yet connected (out of quorum) should start from the beginning of the list
+(corresponding to `restartLSN`).
+
+We need to choose max(`flushLSN`) because voting quorum may be different from quorum committed the last message.
+So we do not know whether records with max(`flushLSN`) was committed by quorum or not. So we have to consider it committed
+to avoid loose of committed data.
+
+Calculated max(`flushLSN`) is called `VCL` (Volume Complete LSN). As far as it is chosen among quorum, there may be some other offline safekeeper with larger
+`VCL`. Once it becomes online, we need to overwrite its WAL beyond `VCL`. To support it, each safekeeper maintains
+`epoch` number. `Epoch` plays almost the same role as `term`, but algorithm of `epoch` bumping is different.
+`VCL` and new epoch are received by safekeeper from proposer during voting.
+But safekeeper doesn't switch to new epoch immediately after voting.
+Instead of it, safekeepers waits record with LSN > Max(`flushLSN`,`VCL`) is received.
+It means that we restore all records from old generation and switch to new generation.
+When proposer calculates max(`FlushLSN`), it first compares `Epoch`. So actually we compare (`Epoch`,`FlushLSN`) pairs.
+
+Let's looks at the examples. Consider that we have three safekeepers: S1, S2, S3. Si(N) means that i-th safekeeper has epoch=N.
+Ri(x) - WAL record for resource X with LSN=i. Assume that we have the following state:
+
+```
+S1(1): R1(a)
+S2(1): R1(a),R2(b)
+S3(1): R1(a),R2(b),R3(c),R4(d)  - offline
+```
+
+Proposer choose quorum (S1,S2). VCL for them is 2. We download S2 to proposer and schedule its write to S1.
+After receiving record R5 the picture can be:
+
+```
+S1(2): R1(a),R2(b),R3(e)
+S2(2): R1(a),R2(b),R3(e)
+S3(1): R1(a),R2(b),R3(c),R4(d)  - offline
+```
+
+Now if server is crashed or restarted, we perform new voting and
+doesn't matter which quorum we choose: (S1,S2), (S2,S3)...
+in any case VCL=3, because S3 has smaller epoch.
+R3(c) will be overwritten with R3(e):
+
+```
+S1(3): R1(a),R2(b),R3(e)
+S2(3): R1(a),R2(b),R3(e)
+S3(1): R1(a),R2(b),R3(e),R4(d)
+```
+
+Epoch of S3 will be adjusted once it overwrites R4:
+
+```
+S1(3): R1(a),R2(b),R3(e),R4(f)
+S2(3): R1(a),R2(b),R3(e),R4(f)
+S3(3): R1(a),R2(b),R3(e),R4(f)
+```
+
+Crash can happen before epoch was bumped. Let's return back to the initial position:
+
+```
+S1(1): R1(a)
+S2(1): R1(a),R2(b)
+S3(1): R1(a),R2(b),R3(c),R4(d)  - offline
+```
+
+Assume that we start recovery:
+
+```
+S1(1): R1(a),R2(b)
+S2(1): R1(a),R2(b)
+S3(1): R1(a),R2(b),R3(c),R4(d)  - offline
+```
+
+and then crash happens. During voting we choose quorum (S3,S3).
+Now them belong to the same epoch and S3 is most advanced among them.
+So VCL is set to 4 and we recover S1 and S2 from S3:
+
+```
+S1(1): R1(a),R2(b),R3(c),R4(d)
+S2(1): R1(a),R2(b),R3(c),R4(d)
+S3(1): R1(a),R2(b),R3(c),R4(d)
+```
+
+## Main loop
+Once recovery is completed, proposer switches to normal processing loop: it receives WAL stream from Postgres and appends WAL
+messages to the list. At the same time it tries to push messages to safekeepers. Each safekeeper is associated
+with some element in message list and once it acknowledged receiving of the message, position is moved forward.
+Each queue element contains acknowledgment mask, which bits corresponds to safekeepers.
+Once all safekeepers acknowledged receiving of this message (by setting correspondent bit),
+then element can be removed from queue and `restartLSN` is advanced forward.
+
+Proposer maintains `restartLSN` and `commitLSN` based on the responses received by safekeepers.
+`RestartLSN` equals to the LSN of head message in the list. `CommitLSN` is `flushLSN[nSafekeepers-Quorum]` element
+in ordered array with `flushLSN`s of safekeepers. `CommitLSN` and `RestartLSN` are included in requests
+sent from proposer to safekeepers and stored in safekeepers control file.
+To avoid overhead of extra fsync, this control file is not fsynced on each request. Flushing this file is performed
+periodically, which means that `restartLSN`/`commitLSN` stored by safekeeper may be slightly deteriorated.
+It is not critical because may only cause redundant processing of some WAL record.
+And `FlushLSN` is recalculated after node restart by scanning local WAL files.
+
+## Fault tolerance
+If the WAL proposer process looses connection to safekeeper it tries to reestablish this connection using the same nodeId.
+
+Restart of PostgreSQL initiates new round of voting and switching new epoch.
+
+## Limitations
+Right now message queue is maintained in main memory and is not spilled to the disk.
+It can cause memory overflow in case of presence of lagging safekeepers.
+It is assumed that in case of losing local data by some safekeepers, it should be recovered using some external mechanism.
+
+
+## Glossary
+* `CommitLSN`: position in WAL confirmed by quorum safekeepers.
+* `RestartLSN`: position in WAL confirmed by all safekeepers.
+* `FlushLSN`: part of WAL persisted to the disk by safekeeper.
+* `NodeID`: pair (term,UUID)
+* `Pager`: Neon component restoring pages from WAL stream
+* `Replica`: read-only computation node
+* `VCL`: the largest LSN for which we can guarantee availability of all prior records.
+
+## Algorithm
+
+```python
+process WalProposer(safekeepers,server,curr_epoch,restart_lsn=0,message_queue={},feedbacks={})
+    function do_recovery(epoch,restart_lsn,VCL)
+        leader = i:safekeepers[i].state.epoch=epoch and safekeepers[i].state.flushLsn=VCL
+        wal_stream = safekeepers[leader].start_replication(restart_lsn,VCL)
+        do
+            message = wal_stream.read()
+            message_queue.append(message)
+        while message.startPos < VCL
+
+        for i in 1..safekeepers.size()
+            for message in message_queue
+                if message.endLsn < safekeepers[i].state.flushLsn
+                    message.delivered += i
+                else
+                    send_message(i, message)
+                    break
+    end function
+
+    function send_message(i,msg)
+        msg.restartLsn = restart_lsn
+        msg.commitLsn = get_commit_lsn()
+        safekeepers[i].send(msg, response_handler)
+    end function
+
+    function do_broadcast(message)
+        for i in 1..safekeepers.size()
+            if not safekeepers[i].sending()
+                send_message(i, message)
+    end function
+
+    function get_commit_lsn()
+        sorted_feedbacks = feedbacks.sort()
+        return sorted_feedbacks[safekeepers.size() - quorum]
+    end function
+
+    function response_handler(i,message,response)
+        feedbacks[i] = if response.epoch=curr_epoch then response.flushLsn else VCL
+        server.write(get_commit_lsn())
+
+        message.delivered += i
+        next_message = message_queue.next(message)
+        if next_message
+            send_message(i, next_message)
+
+        while message_queue.head.delivered.size() = safekeepers.size()
+            if restart_lsn < message_queue.head.beginLsn
+                restart_lsn = message_queue.head.endLsn
+            message_queue.pop_head()
+    end function
+
+    server_info = server.read()
+
+    safekeepers.write(server_info)
+    safekeepers.state = safekeepers.read()
+    next_term = max(safekeepers.state.nodeId.term)+1
+    restart_lsn = max(safekeepers.state.restartLsn)
+    epoch,VCL = max(safekeepers.state.epoch,safekeepers.state.flushLsn)
+    curr_epoch = epoch + 1
+
+    proposal = Proposal(NodeId(next_term,server.id),curr_epoch,VCL)
+    safekeepers.send(proposal)
+    responses = safekeepers.read()
+    if any responses.is_rejected()
+        exit()
+
+    for i in 1..safekeepers.size()
+        feedbacks[i].flushLsn = if epoch=safekeepers[i].state.epoch then safekeepers[i].state.flushLsn else restart_lsn
+
+    if restart_lsn != VCL
+        do_recovery(epoch,restart_lsn,VCL)
+
+    wal_stream = server.start_replication(VCL)
+    for ever
+        message = wal_stream.read()
+        message_queue.append(message)
+        do_broadcast(message)
+end process
+
+process safekeeper(gateway,state)
+    function handshake()
+        proposer = gateway.accept()
+        server_info = proposer.read()
+        proposer.write(state)
+        proposal = proposer.read()
+        if proposal.nodeId < state.nodeId
+            proposer.write(rejected)
+            return null
+        else
+            state.nodeId = proposal.nodeId
+            state.proposed_epoch = proposal.epoch
+            state.VCL = proposal.VCL
+            write_control_file(state)
+            proposer.write(accepted)
+            return proposer
+    end function
+
+    state = read_control_file()
+    state.flushLsn = locate_end_of_wal()
+
+    for ever
+        proposer = handshake()
+        if not proposer
+            continue
+        for ever
+            req = proposer.read()
+            if req.nodeId != state.nodeId
+                break
+            save_wal_file(req.data)
+            state.restartLsn = req.restartLsn
+            if state.epoch < state.proposed_epoch and req.endPos > max(state.flushLsn,state.VCL)
+                state.epoch = state.proposed_epoch
+            if req.endPos > state.flushLsn
+                state.flushLsn = req.endPos
+            save_control_file(state)
+            resp = Response(state.epoch,req.endPos)
+            proposer.write(resp)
+            notify_wal_sender(Min(req.commitLsn,req.endPos))
+end process
+```
--- a/docs/separation-compute-storage.md
+++ b/docs/separation-compute-storage.md
@@ -0,0 +1,8 @@
+# Separation of Compute and Storage
+
+TODO:
+
+- Read path
+- Write path
+- Durability model
+- API auth
--- a/docs/walservice.md
+++ b/docs/walservice.md
@@ -0,0 +1,125 @@
+# WAL service
+
+The neon WAL service acts as a holding area and redistribution
+center for recently generated WAL. The primary Postgres server streams
+the WAL to the WAL safekeeper, and treats it like a (synchronous)
+replica. A replication slot is used in the primary to prevent the
+primary from discarding WAL that hasn't been streamed to the WAL
+service yet.
+
+```
+--------------+              +------------------+
+|              |     WAL      |                  |
+| Compute node |  ----------> |   WAL Service    |
+|              |              |                  |
+--------------+              +------------------+
+                                     |
+                                     |
+                                     | WAL
+                                     |
+                                     |
+                                     V
+                              +--------------+
+                              |              |
+                              | Pageservers  |
+                              |              |
+                              +--------------+
+```
+
+
+The WAL service consists of multiple WAL safekeepers that all store a
+copy of the WAL. A WAL record is considered durable when the majority
+of safekeepers have received and stored the WAL to local disk. A
+consensus algorithm based on Paxos is used to manage the quorum.
+
+```
+  +-------------------------------------------+
+  | WAL Service                               |
+  |                                           |
+  |                                           |
+  |  +------------+                           |
+  |  | safekeeper |                           |
+  |  +------------+                           |
+  |                                           |
+  |  +------------+                           |
+  |  | safekeeper |                           |
+  |  +------------+                           |
+  |                                           |
+  |  +------------+                           |
+  |  | safekeeper |                           |
+  |  +------------+                           |
+  |                                           |
+  +-------------------------------------------+
+```
+
+The primary connects to the WAL safekeepers, so it works in a "push"
+fashion.  That's different from how streaming replication usually
+works, where the replica initiates the connection. To do that, there
+is a component called the "WAL proposer". The WAL proposer is a
+background worker that runs in the primary Postgres server. It
+connects to the WAL safekeeper, and sends all the WAL. (PostgreSQL's
+archive_commands works in the "push" style, but it operates on a WAL
+segment granularity. If PostgreSQL had a push style API for streaming,
+WAL propose could be implemented using it.)
+
+The Page Server connects to the WAL safekeeper, using the same
+streaming replication protocol that's used between Postgres primary
+and standby. You can also connect the Page Server directly to a
+primary PostgreSQL node for testing.
+
+In a production installation, there are multiple WAL safekeepers
+running on different nodes, and there is a quorum mechanism using the
+Paxos algorithm to ensure that a piece of WAL is considered as durable
+only after it has been flushed to disk on more than half of the WAL
+safekeepers. The Paxos and crash recovery algorithm ensures that only
+one primary node can be actively streaming WAL to the quorum of
+safekeepers.
+
+See README_PROTO.md for a more detailed description of the consensus
+protocol. spec/ contains TLA+ specification of it.
+
+# Q&A
+
+Q: Why have a separate service instead of connecting Page Server directly to a
+   primary PostgreSQL node?
+A: Page Server is a single server which can be lost. As our primary
+   fault-tolerant storage is S3, we do not want to wait for it before
+   committing a transaction. The WAL service acts as a temporary fault-tolerant
+   storage for recent data before it gets to the Page Server and then finally
+   to S3. Whenever WALs and pages are committed to S3, WAL's storage can be
+   trimmed.
+
+Q: What if the compute node evicts a page, needs it back, but the page is yet
+   to reach the Page Server?
+A: If the compute node has evicted a page, changes to it have been WAL-logged
+   (that's why it is called Write Ahead logging; there are some exceptions like
+   index builds, but these are exceptions). These WAL records will eventually
+   reach the Page Server. The Page Server notes that the compute node requests
+   pages with a very recent LSN and will not respond to the compute node until a
+   corresponding WAL is received from WAL safekeepers.
+
+Q: How long may Page Server wait for?
+A: Not too long, hopefully. If a page is evicted, it probably was not used for
+   a while, so the WAL service have had enough time to push changes to the Page
+   Server. To limit the lag, tune backpressure using `max_replication_*_lag` settings.
+
+Q: How do WAL safekeepers communicate with each other?
+A: They may only send each other messages via the compute node, they never
+   communicate directly with each other.
+
+Q: Why have a consensus algorithm if there is only a single compute node?
+A: Actually there may be moments with multiple PostgreSQL nodes running at the
+   same time. E.g. we are bringing one up and one down. We would like to avoid
+   simultaneous writes from different nodes, so there should be a consensus on
+   who is the primary node.
+
+# Terminology
+
+WAL service - The service as whole that ensures that WAL is stored durably.
+
+WAL safekeeper - One node that participates in the quorum. All the safekeepers
+together form the WAL service.
+
+WAL acceptor, WAL proposer - In the context of the consensus algorithm, the Postgres
+compute node is also known as the WAL proposer, and the safekeeper is also known
+as the acceptor. Those are the standard terms in the Paxos algorithm.