mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-22 21:59:59 +00:00
Reorganize, expand, improve internal documentation
Reorganize existing READMEs and other documentation files into mdbook format. The resulting Table of Contents is a mix of placeholders for docs that we should write, and documentation files that we already had, dropped into the most appropriate place. Update the Pageserver overview diagram. Add sections on thread management and WAL redo processes. Add all the RFCs to the mdbook Table of Content too. Per github issue #1979
This commit is contained in:
committed by
Heikki Linnakangas
parent
a69fdb0e8e
commit
0b14fdb078
1
docs/.gitignore
vendored
Normal file
1
docs/.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
|||||||
|
book
|
||||||
@@ -1,14 +0,0 @@
|
|||||||
# Zenith documentation
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
|
|
||||||
- [authentication.md](authentication.md) — pageserver JWT authentication.
|
|
||||||
- [docker.md](docker.md) — Docker images and building pipeline.
|
|
||||||
- [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
|
|
||||||
- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
|
|
||||||
- [sourcetree.md](sourcetree.md) — Overview of the source tree layout.
|
|
||||||
- [pageserver/README.md](/pageserver/README.md) — pageserver overview.
|
|
||||||
- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
|
|
||||||
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
|
|
||||||
- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview.
|
|
||||||
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core
|
|
||||||
84
docs/SUMMARY.md
Normal file
84
docs/SUMMARY.md
Normal file
@@ -0,0 +1,84 @@
|
|||||||
|
# Summary
|
||||||
|
|
||||||
|
[Introduction]()
|
||||||
|
- [Separation of Compute and Storage](./separation-compute-storage.md)
|
||||||
|
|
||||||
|
# Architecture
|
||||||
|
|
||||||
|
- [Compute]()
|
||||||
|
- [WAL proposer]()
|
||||||
|
- [WAL Backpressure]()
|
||||||
|
- [Postgres changes](./core_changes.md)
|
||||||
|
|
||||||
|
- [Pageserver](./pageserver.md)
|
||||||
|
- [Services](./pageserver-services.md)
|
||||||
|
- [Thread management](./pageserver-thread-mgmt.md)
|
||||||
|
- [WAL Redo](./pageserver-walredo.md)
|
||||||
|
- [Page cache](./pageserver-pagecache.md)
|
||||||
|
- [Storage](./pageserver-storage.md)
|
||||||
|
- [Datadir mapping]()
|
||||||
|
- [Layer files]()
|
||||||
|
- [Branching]()
|
||||||
|
- [Garbage collection]()
|
||||||
|
- [Cloud Storage]()
|
||||||
|
- [Processing a GetPage request](./pageserver-processing-getpage.md)
|
||||||
|
- [Processing WAL](./pageserver-processing-wal.md)
|
||||||
|
- [Management API]()
|
||||||
|
- [Tenant Rebalancing]()
|
||||||
|
|
||||||
|
- [WAL Service](walservice.md)
|
||||||
|
- [Consensus protocol](safekeeper-protocol.md)
|
||||||
|
- [Management API]()
|
||||||
|
- [Rebalancing]()
|
||||||
|
|
||||||
|
- [Control Plane]()
|
||||||
|
|
||||||
|
- [Proxy]()
|
||||||
|
|
||||||
|
- [Source view](./sourcetree.md)
|
||||||
|
- [docker.md](./docker.md) — Docker images and building pipeline.
|
||||||
|
- [Error handling and logging]()
|
||||||
|
- [Testing]()
|
||||||
|
- [Unit testing]()
|
||||||
|
- [Integration testing]()
|
||||||
|
- [Benchmarks]()
|
||||||
|
|
||||||
|
|
||||||
|
- [Glossary](./glossary.md)
|
||||||
|
|
||||||
|
# Uncategorized
|
||||||
|
|
||||||
|
- [authentication.md](./authentication.md)
|
||||||
|
- [multitenancy.md](./multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
|
||||||
|
- [settings.md](./settings.md)
|
||||||
|
#FIXME: move these under sourcetree.md
|
||||||
|
#- [pageserver/README.md](/pageserver/README.md)
|
||||||
|
#- [postgres_ffi/README.md](/libs/postgres_ffi/README.md)
|
||||||
|
#- [test_runner/README.md](/test_runner/README.md)
|
||||||
|
#- [safekeeper/README.md](/safekeeper/README.md)
|
||||||
|
|
||||||
|
|
||||||
|
# RFCs
|
||||||
|
|
||||||
|
- [RFCs](./rfcs/README.md)
|
||||||
|
|
||||||
|
- [002-storage](rfcs/002-storage.md)
|
||||||
|
- [003-laptop-cli](rfcs/003-laptop-cli.md)
|
||||||
|
- [004-durability](rfcs/004-durability.md)
|
||||||
|
- [005-zenith_local](rfcs/005-zenith_local.md)
|
||||||
|
- [006-laptop-cli-v2-CLI](rfcs/006-laptop-cli-v2-CLI.md)
|
||||||
|
- [006-laptop-cli-v2-repository-structure](rfcs/006-laptop-cli-v2-repository-structure.md)
|
||||||
|
- [007-serverless-on-laptop](rfcs/007-serverless-on-laptop.md)
|
||||||
|
- [008-push-pull](rfcs/008-push-pull.md)
|
||||||
|
- [009-snapshot-first-storage-cli](rfcs/009-snapshot-first-storage-cli.md)
|
||||||
|
- [009-snapshot-first-storage](rfcs/009-snapshot-first-storage.md)
|
||||||
|
- [009-snapshot-first-storage-pitr](rfcs/009-snapshot-first-storage-pitr.md)
|
||||||
|
- [010-storage_details](rfcs/010-storage_details.md)
|
||||||
|
- [011-retention-policy](rfcs/011-retention-policy.md)
|
||||||
|
- [012-background-tasks](rfcs/012-background-tasks.md)
|
||||||
|
- [013-term-history](rfcs/013-term-history.md)
|
||||||
|
- [014-safekeepers-gossip](rfcs/014-safekeepers-gossip.md)
|
||||||
|
- [014-storage-lsm](rfcs/014-storage-lsm.md)
|
||||||
|
- [015-storage-messaging](rfcs/015-storage-messaging.md)
|
||||||
|
- [016-connection-routing](rfcs/016-connection-routing.md)
|
||||||
|
- [cluster-size-limits](rfcs/cluster-size-limits.md)
|
||||||
5
docs/book.toml
Normal file
5
docs/book.toml
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
[book]
|
||||||
|
language = "en"
|
||||||
|
multilingual = false
|
||||||
|
src = "."
|
||||||
|
title = "Neon architecture"
|
||||||
@@ -1,3 +1,12 @@
|
|||||||
|
# Postgres core changes
|
||||||
|
|
||||||
|
This lists all the changes that have been made to the PostgreSQL
|
||||||
|
source tree, as a somewhat logical set of patches. The long-term goal
|
||||||
|
is to eliminate all these changes, by submitting patches to upstream
|
||||||
|
and refactoring code into extensions, so that you can run unmodified
|
||||||
|
PostgreSQL against Neon storage.
|
||||||
|
|
||||||
|
|
||||||
1. Add t_cid to XLOG record
|
1. Add t_cid to XLOG record
|
||||||
- Why?
|
- Why?
|
||||||
The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
|
The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
|
||||||
|
|||||||
9
docs/pageserver-page-service.md
Normal file
9
docs/pageserver-page-service.md
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
# Page Service
|
||||||
|
|
||||||
|
The Page Service listens for GetPage@LSN requests from the Compute Nodes,
|
||||||
|
and responds with pages from the repository. On each GetPage@LSN request,
|
||||||
|
it calls into the Repository function
|
||||||
|
|
||||||
|
A separate thread is spawned for each incoming connection to the page
|
||||||
|
service. The page service uses the libpq protocol to communicate with
|
||||||
|
the client. The client is a Compute Postgres instance.
|
||||||
8
docs/pageserver-pagecache.md
Normal file
8
docs/pageserver-pagecache.md
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
# Page cache
|
||||||
|
|
||||||
|
TODO:
|
||||||
|
|
||||||
|
- shared across tenants
|
||||||
|
- store pages from layer files
|
||||||
|
- store pages from "in-memory layer"
|
||||||
|
- store materialized pages
|
||||||
4
docs/pageserver-processing-getpage.md
Normal file
4
docs/pageserver-processing-getpage.md
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
# Processing a GetPage request
|
||||||
|
|
||||||
|
TODO:
|
||||||
|
- sequence diagram that shows how a GetPage@LSN request is processed
|
||||||
5
docs/pageserver-processing-wal.md
Normal file
5
docs/pageserver-processing-wal.md
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
# Processing WAL
|
||||||
|
|
||||||
|
TODO:
|
||||||
|
- diagram that shows how incoming WAL is processed
|
||||||
|
- explain durability, what is fsync'd when, disk_consistent_lsn
|
||||||
@@ -1,15 +1,4 @@
|
|||||||
## Page server architecture
|
# Services
|
||||||
|
|
||||||
The Page Server has a few different duties:
|
|
||||||
|
|
||||||
- Respond to GetPage@LSN requests from the Compute Nodes
|
|
||||||
- Receive WAL from WAL safekeeper
|
|
||||||
- Replay WAL that's applicable to the chunks that the Page Server maintains
|
|
||||||
- Backup to S3
|
|
||||||
|
|
||||||
S3 is the main fault-tolerant storage of all data, as there are no Page Server
|
|
||||||
replicas. We use a separate fault-tolerant WAL service to reduce latency. It
|
|
||||||
keeps track of WAL records which are not synced to S3 yet.
|
|
||||||
|
|
||||||
The Page Server consists of multiple threads that operate on a shared
|
The Page Server consists of multiple threads that operate on a shared
|
||||||
repository of page versions:
|
repository of page versions:
|
||||||
@@ -21,18 +10,22 @@ repository of page versions:
|
|||||||
| WAL receiver |
|
| WAL receiver |
|
||||||
| |
|
| |
|
||||||
+--------------+
|
+--------------+
|
||||||
+----+
|
......
|
||||||
+---------+ .......... | |
|
+---------+ +--------+ . .
|
||||||
| | . . | |
|
| | | | . .
|
||||||
GetPage@LSN | | . backup . -------> | S3 |
|
GetPage@LSN | | | backup | -------> . S3 .
|
||||||
-------------> | Page | repository . . | |
|
-------------> | Page | repository | | . .
|
||||||
| Service | .......... | |
|
| Service | +--------+ . .
|
||||||
page | | +----+
|
page | | ......
|
||||||
<------------- | |
|
<------------- | |
|
||||||
+---------+ +--------------------+
|
+---------+ +-----------+ +--------------------+
|
||||||
| Checkpointing / |
|
| WAL redo | | Checkpointing, |
|
||||||
| Garbage collection |
|
+----------+ | processes | | Garbage collection |
|
||||||
+--------------------+
|
| | +-----------+ +--------------------+
|
||||||
|
| HTTP |
|
||||||
|
| mgmt API |
|
||||||
|
| |
|
||||||
|
+----------+
|
||||||
|
|
||||||
Legend:
|
Legend:
|
||||||
|
|
||||||
@@ -40,28 +33,77 @@ Legend:
|
|||||||
| | A thread or multi-threaded service
|
| | A thread or multi-threaded service
|
||||||
+--+
|
+--+
|
||||||
|
|
||||||
....
|
|
||||||
. . Component at its early development phase.
|
|
||||||
....
|
|
||||||
|
|
||||||
---> Data flow
|
---> Data flow
|
||||||
<---
|
<---
|
||||||
```
|
```
|
||||||
|
|
||||||
Page Service
|
## Page Service
|
||||||
------------
|
|
||||||
|
|
||||||
The Page Service listens for GetPage@LSN requests from the Compute Nodes,
|
The Page Service listens for GetPage@LSN requests from the Compute Nodes,
|
||||||
and responds with pages from the repository.
|
and responds with pages from the repository. On each GetPage@LSN request,
|
||||||
|
it calls into the Repository function
|
||||||
|
|
||||||
|
A separate thread is spawned for each incoming connection to the page
|
||||||
|
service. The page service uses the libpq protocol to communicate with
|
||||||
|
the client. The client is a Compute Postgres instance.
|
||||||
|
|
||||||
|
## WAL Receiver
|
||||||
|
|
||||||
|
The WAL receiver connects to the external WAL safekeeping service
|
||||||
|
using PostgreSQL physical streaming replication, and continuously
|
||||||
|
receives WAL. It decodes the WAL records, and stores them to the
|
||||||
|
repository.
|
||||||
|
|
||||||
|
|
||||||
WAL Receiver
|
## Backup service
|
||||||
------------
|
|
||||||
|
|
||||||
The WAL receiver connects to the external WAL safekeeping service (or
|
The backup service, responsible for storing pageserver recovery data externally.
|
||||||
directly to the primary) using PostgreSQL physical streaming
|
|
||||||
replication, and continuously receives WAL. It decodes the WAL records,
|
Currently, pageserver stores its files in a filesystem directory it's pointed to.
|
||||||
and stores them to the repository.
|
That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
|
||||||
|
Therefore, the server interacts with external, more reliable storage to back up and restore its state.
|
||||||
|
|
||||||
|
The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
|
||||||
|
There are the following implementations present:
|
||||||
|
* local filesystem — to use in tests mainly
|
||||||
|
* AWS S3 - to use in production
|
||||||
|
|
||||||
|
Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
|
||||||
|
|
||||||
|
The backup service is disabled by default and can be enabled to interact with a single remote storage.
|
||||||
|
|
||||||
|
CLI examples:
|
||||||
|
* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
|
||||||
|
* AWS S3 : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"`
|
||||||
|
|
||||||
|
For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
|
||||||
|
For local S3 installations, refer to the their documentation for name format and credentials.
|
||||||
|
|
||||||
|
Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
|
||||||
|
Required sections are:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[remote_storage]
|
||||||
|
local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
|
||||||
|
```
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[remote_storage]
|
||||||
|
bucket_name = 'some-sample-bucket'
|
||||||
|
bucket_region = 'eu-north-1'
|
||||||
|
prefix_in_bucket = '/test_prefix/'
|
||||||
|
```
|
||||||
|
|
||||||
|
`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
|
||||||
|
|
||||||
|
|
||||||
|
## Repository background tasks
|
||||||
|
|
||||||
|
The Repository also has a few different background threads and tokio tasks that perform
|
||||||
|
background duties like dumping accumulated WAL data from memory to disk, reorganizing
|
||||||
|
files for performance (compaction), and garbage collecting old files.
|
||||||
|
|
||||||
|
|
||||||
Repository
|
Repository
|
||||||
@@ -116,48 +158,6 @@ Remove old on-disk layer files that are no longer needed according to the
|
|||||||
PITR retention policy
|
PITR retention policy
|
||||||
|
|
||||||
|
|
||||||
### Backup service
|
|
||||||
|
|
||||||
The backup service, responsible for storing pageserver recovery data externally.
|
|
||||||
|
|
||||||
Currently, pageserver stores its files in a filesystem directory it's pointed to.
|
|
||||||
That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
|
|
||||||
Therefore, the server interacts with external, more reliable storage to back up and restore its state.
|
|
||||||
|
|
||||||
The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
|
|
||||||
There are the following implementations present:
|
|
||||||
* local filesystem — to use in tests mainly
|
|
||||||
* AWS S3 - to use in production
|
|
||||||
|
|
||||||
Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
|
|
||||||
|
|
||||||
The backup service is disabled by default and can be enabled to interact with a single remote storage.
|
|
||||||
|
|
||||||
CLI examples:
|
|
||||||
* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
|
|
||||||
* AWS S3 : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"`
|
|
||||||
|
|
||||||
For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
|
|
||||||
For local S3 installations, refer to the their documentation for name format and credentials.
|
|
||||||
|
|
||||||
Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
|
|
||||||
Required sections are:
|
|
||||||
|
|
||||||
```toml
|
|
||||||
[remote_storage]
|
|
||||||
local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
|
|
||||||
```
|
|
||||||
|
|
||||||
or
|
|
||||||
|
|
||||||
```toml
|
|
||||||
[remote_storage]
|
|
||||||
bucket_name = 'some-sample-bucket'
|
|
||||||
bucket_region = 'eu-north-1'
|
|
||||||
prefix_in_bucket = '/test_prefix/'
|
|
||||||
```
|
|
||||||
|
|
||||||
`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
|
|
||||||
|
|
||||||
TODO: Sharding
|
TODO: Sharding
|
||||||
--------------------
|
--------------------
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# Overview
|
# Pageserver storage
|
||||||
|
|
||||||
The main responsibility of the Page Server is to process the incoming WAL, and
|
The main responsibility of the Page Server is to process the incoming WAL, and
|
||||||
reprocess it into a format that allows reasonably quick access to any page
|
reprocess it into a format that allows reasonably quick access to any page
|
||||||
26
docs/pageserver-thread-mgmt.md
Normal file
26
docs/pageserver-thread-mgmt.md
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
## Thread management
|
||||||
|
|
||||||
|
Each thread in the system is tracked by the `thread_mgr` module. It
|
||||||
|
maintains a registry of threads, and which tenant or timeline they are
|
||||||
|
operating on. This is used for safe shutdown of a tenant, or the whole
|
||||||
|
system.
|
||||||
|
|
||||||
|
### Handling shutdown
|
||||||
|
|
||||||
|
When a tenant or timeline is deleted, we need to shut down all threads
|
||||||
|
operating on it, before deleting the data on disk. A thread registered
|
||||||
|
in the thread registry can check if it has been requested to shut down,
|
||||||
|
by calling `is_shutdown_requested()`. For async operations, there's also
|
||||||
|
a `shudown_watcher()` async task that can be used to wake up on shutdown.
|
||||||
|
|
||||||
|
### Sync vs async
|
||||||
|
|
||||||
|
The primary programming model in the page server is synchronous,
|
||||||
|
blocking code. However, there are some places where async code is
|
||||||
|
used. Be very careful when mixing sync and async code.
|
||||||
|
|
||||||
|
Async is primarily used to wait for incoming data on network
|
||||||
|
connections. For example, all WAL receivers have a shared thread pool,
|
||||||
|
with one async Task for each connection. Once a piece of WAL has been
|
||||||
|
received from the network, the thread calls the blocking functions in
|
||||||
|
the Repository to process the WAL.
|
||||||
77
docs/pageserver-walredo.md
Normal file
77
docs/pageserver-walredo.md
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
# WAL Redo
|
||||||
|
|
||||||
|
To reconstruct a particular page version from an image of the page and
|
||||||
|
some WAL records, the pageserver needs to replay the WAL records. This
|
||||||
|
happens on-demand, when a GetPage@LSN request comes in, or as part of
|
||||||
|
background jobs that reorganize data for faster access.
|
||||||
|
|
||||||
|
It's important that data cannot leak from one tenant to another, and
|
||||||
|
that a corrupt WAL record on one timeline doesn't affect other tenants
|
||||||
|
or timelines.
|
||||||
|
|
||||||
|
## Multi-tenant security
|
||||||
|
|
||||||
|
If you have direct access to the WAL directory, or if you have
|
||||||
|
superuser access to a running PostgreSQL server, it's easy to
|
||||||
|
construct a malicious or corrupt WAL record that causes the WAL redo
|
||||||
|
functions to crash, or to execute arbitrary code. That is not a
|
||||||
|
security problem for PostgreSQL; if you have superuser access, you
|
||||||
|
have full access to the system anyway.
|
||||||
|
|
||||||
|
The Neon pageserver, however, is multi-tenant. It needs to execute WAL
|
||||||
|
belonging to different tenants in the same system, and malicious WAL
|
||||||
|
in one tenant must not affect other tenants.
|
||||||
|
|
||||||
|
A separate WAL redo process is launched for each tenant, and the
|
||||||
|
process uses the seccomp(2) system call to restrict its access to the
|
||||||
|
bare minimum needed to replay WAL records. The process does not have
|
||||||
|
access to the filesystem or network. It can only communicate with the
|
||||||
|
parent pageserver process through a pipe.
|
||||||
|
|
||||||
|
If an attacker creates a malicious WAL record and injects it into the
|
||||||
|
WAL stream of a timeline, he can take control of the WAL redo process
|
||||||
|
in the pageserver. However, the WAL redo process cannot access the
|
||||||
|
rest of the system. And because there is a separate WAL redo process
|
||||||
|
for each tenant, the hijacked WAL redo process can only see WAL and
|
||||||
|
data belonging to the same tenant, which the attacker would have
|
||||||
|
access to anyway.
|
||||||
|
|
||||||
|
## WAL-redo process communication
|
||||||
|
|
||||||
|
The WAL redo process runs the 'postgres' executable, launched with a
|
||||||
|
Neon-specific command-line option to put it into WAL-redo process
|
||||||
|
mode. The pageserver controls the lifetime of the WAL redo processes,
|
||||||
|
launching them as needed. If a tenant is detached from the pageserver,
|
||||||
|
any WAL redo processes for that tenant are killed.
|
||||||
|
|
||||||
|
The pageserver communicates with each WAL redo process over its
|
||||||
|
stdin/stdout/stderr. It works in request-response model with a simple
|
||||||
|
custom protocol, described in walredo.rs. To replay a set of WAL
|
||||||
|
records for a page, the pageserver sends the "before" image of the
|
||||||
|
page and the WAL records over 'stdin', followed by a command to
|
||||||
|
perform the replay. The WAL redo process responds with an "after"
|
||||||
|
image of the page.
|
||||||
|
|
||||||
|
## Special handling of some records
|
||||||
|
|
||||||
|
Some WAL record types are handled directly in the pageserver, by
|
||||||
|
bespoken Rust code, and are not sent over to the WAL redo process.
|
||||||
|
This includes SLRU-related WAL records, like commit records. SLRUs
|
||||||
|
don't use the standard Postgres buffer manager, so dealing with them
|
||||||
|
in the Neon WAL redo mode would require quite a few changes to
|
||||||
|
Postgres code and special handling in the protocol anyway.
|
||||||
|
|
||||||
|
Some record types that include a full-page-image (e.g. XLOG_FPI) are
|
||||||
|
also handled specially when incoming WAL is processed already, and are
|
||||||
|
stored as page images rather than WAL records.
|
||||||
|
|
||||||
|
|
||||||
|
## Records that modify multiple pages
|
||||||
|
|
||||||
|
Some Postgres WAL records modify multiple pages. Such WAL records are
|
||||||
|
duplicated, so that a copy is stored for each affected page. This is
|
||||||
|
somewhat wasteful, but because most WAL records only affect one page,
|
||||||
|
the overhead is acceptable.
|
||||||
|
|
||||||
|
The WAL redo always happens for one particular page. If the WAL record
|
||||||
|
coantains changes to other pages, they are ignored.
|
||||||
11
docs/pageserver.md
Normal file
11
docs/pageserver.md
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
# Page server architecture
|
||||||
|
|
||||||
|
The Page Server has a few different duties:
|
||||||
|
|
||||||
|
- Respond to GetPage@LSN requests from the Compute Nodes
|
||||||
|
- Receive WAL from WAL safekeeper, and store it
|
||||||
|
- Upload data to S3 to make it durable, download files from S3 as needed
|
||||||
|
|
||||||
|
S3 is the main fault-tolerant storage of all data, as there are no Page Server
|
||||||
|
replicas. We use a separate fault-tolerant WAL service to reduce latency. It
|
||||||
|
keeps track of WAL records which are not synced to S3 yet.
|
||||||
8
docs/separation-compute-storage.md
Normal file
8
docs/separation-compute-storage.md
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
# Separation of Compute and Storage
|
||||||
|
|
||||||
|
TODO:
|
||||||
|
|
||||||
|
- Read path
|
||||||
|
- Write path
|
||||||
|
- Durability model
|
||||||
|
- API auth
|
||||||
Reference in New Issue
Block a user