diff --git a/docs/.gitignore b/docs/.gitignore new file mode 100644 index 0000000000..7585238efe --- /dev/null +++ b/docs/.gitignore @@ -0,0 +1 @@ +book diff --git a/docs/README.md b/docs/README.md deleted file mode 100644 index 60114c5fd5..0000000000 --- a/docs/README.md +++ /dev/null @@ -1,14 +0,0 @@ -# Zenith documentation - -## Table of contents - -- [authentication.md](authentication.md) — pageserver JWT authentication. -- [docker.md](docker.md) — Docker images and building pipeline. -- [glossary.md](glossary.md) — Glossary of all the terms used in codebase. -- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI. -- [sourcetree.md](sourcetree.md) — Overview of the source tree layout. -- [pageserver/README.md](/pageserver/README.md) — pageserver overview. -- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview. -- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview. -- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview. -- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md new file mode 100644 index 0000000000..cf29ee3c6a --- /dev/null +++ b/docs/SUMMARY.md @@ -0,0 +1,84 @@ +# Summary + +[Introduction]() +- [Separation of Compute and Storage](./separation-compute-storage.md) + +# Architecture + +- [Compute]() + - [WAL proposer]() + - [WAL Backpressure]() + - [Postgres changes](./core_changes.md) + +- [Pageserver](./pageserver.md) + - [Services](./pageserver-services.md) + - [Thread management](./pageserver-thread-mgmt.md) + - [WAL Redo](./pageserver-walredo.md) + - [Page cache](./pageserver-pagecache.md) + - [Storage](./pageserver-storage.md) + - [Datadir mapping]() + - [Layer files]() + - [Branching]() + - [Garbage collection]() + - [Cloud Storage]() + - [Processing a GetPage request](./pageserver-processing-getpage.md) + - [Processing WAL](./pageserver-processing-wal.md) + - [Management API]() + - [Tenant Rebalancing]() + +- [WAL Service](walservice.md) + - [Consensus protocol](safekeeper-protocol.md) + - [Management API]() + - [Rebalancing]() + +- [Control Plane]() + +- [Proxy]() + +- [Source view](./sourcetree.md) + - [docker.md](./docker.md) — Docker images and building pipeline. + - [Error handling and logging]() + - [Testing]() + - [Unit testing]() + - [Integration testing]() + - [Benchmarks]() + + +- [Glossary](./glossary.md) + +# Uncategorized + +- [authentication.md](./authentication.md) +- [multitenancy.md](./multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI. +- [settings.md](./settings.md) +#FIXME: move these under sourcetree.md +#- [pageserver/README.md](/pageserver/README.md) +#- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) +#- [test_runner/README.md](/test_runner/README.md) +#- [safekeeper/README.md](/safekeeper/README.md) + + +# RFCs + +- [RFCs](./rfcs/README.md) + +- [002-storage](rfcs/002-storage.md) +- [003-laptop-cli](rfcs/003-laptop-cli.md) +- [004-durability](rfcs/004-durability.md) +- [005-zenith_local](rfcs/005-zenith_local.md) +- [006-laptop-cli-v2-CLI](rfcs/006-laptop-cli-v2-CLI.md) +- [006-laptop-cli-v2-repository-structure](rfcs/006-laptop-cli-v2-repository-structure.md) +- [007-serverless-on-laptop](rfcs/007-serverless-on-laptop.md) +- [008-push-pull](rfcs/008-push-pull.md) +- [009-snapshot-first-storage-cli](rfcs/009-snapshot-first-storage-cli.md) +- [009-snapshot-first-storage](rfcs/009-snapshot-first-storage.md) +- [009-snapshot-first-storage-pitr](rfcs/009-snapshot-first-storage-pitr.md) +- [010-storage_details](rfcs/010-storage_details.md) +- [011-retention-policy](rfcs/011-retention-policy.md) +- [012-background-tasks](rfcs/012-background-tasks.md) +- [013-term-history](rfcs/013-term-history.md) +- [014-safekeepers-gossip](rfcs/014-safekeepers-gossip.md) +- [014-storage-lsm](rfcs/014-storage-lsm.md) +- [015-storage-messaging](rfcs/015-storage-messaging.md) +- [016-connection-routing](rfcs/016-connection-routing.md) +- [cluster-size-limits](rfcs/cluster-size-limits.md) diff --git a/docs/book.toml b/docs/book.toml new file mode 100644 index 0000000000..f83ac2a6aa --- /dev/null +++ b/docs/book.toml @@ -0,0 +1,5 @@ +[book] +language = "en" +multilingual = false +src = "." +title = "Neon architecture" diff --git a/docs/core_changes.md b/docs/core_changes.md index 82c5addd16..86fdc420f7 100644 --- a/docs/core_changes.md +++ b/docs/core_changes.md @@ -1,3 +1,12 @@ +# Postgres core changes + +This lists all the changes that have been made to the PostgreSQL +source tree, as a somewhat logical set of patches. The long-term goal +is to eliminate all these changes, by submitting patches to upstream +and refactoring code into extensions, so that you can run unmodified +PostgreSQL against Neon storage. + + 1. Add t_cid to XLOG record - Why? The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax. diff --git a/docs/pageserver-page-service.md b/docs/pageserver-page-service.md new file mode 100644 index 0000000000..cea9e5a637 --- /dev/null +++ b/docs/pageserver-page-service.md @@ -0,0 +1,9 @@ +# Page Service + +The Page Service listens for GetPage@LSN requests from the Compute Nodes, +and responds with pages from the repository. On each GetPage@LSN request, +it calls into the Repository function + +A separate thread is spawned for each incoming connection to the page +service. The page service uses the libpq protocol to communicate with +the client. The client is a Compute Postgres instance. diff --git a/docs/pageserver-pagecache.md b/docs/pageserver-pagecache.md new file mode 100644 index 0000000000..d9b120bbb9 --- /dev/null +++ b/docs/pageserver-pagecache.md @@ -0,0 +1,8 @@ +# Page cache + +TODO: + +- shared across tenants +- store pages from layer files +- store pages from "in-memory layer" +- store materialized pages diff --git a/docs/pageserver-processing-getpage.md b/docs/pageserver-processing-getpage.md new file mode 100644 index 0000000000..be99ab82d4 --- /dev/null +++ b/docs/pageserver-processing-getpage.md @@ -0,0 +1,4 @@ +# Processing a GetPage request + +TODO: +- sequence diagram that shows how a GetPage@LSN request is processed diff --git a/docs/pageserver-processing-wal.md b/docs/pageserver-processing-wal.md new file mode 100644 index 0000000000..f8c43b6085 --- /dev/null +++ b/docs/pageserver-processing-wal.md @@ -0,0 +1,5 @@ +# Processing WAL + +TODO: +- diagram that shows how incoming WAL is processed +- explain durability, what is fsync'd when, disk_consistent_lsn diff --git a/pageserver/README.md b/docs/pageserver-services.md similarity index 75% rename from pageserver/README.md rename to docs/pageserver-services.md index cb752881af..4e85413513 100644 --- a/pageserver/README.md +++ b/docs/pageserver-services.md @@ -1,15 +1,4 @@ -## Page server architecture - -The Page Server has a few different duties: - -- Respond to GetPage@LSN requests from the Compute Nodes -- Receive WAL from WAL safekeeper -- Replay WAL that's applicable to the chunks that the Page Server maintains -- Backup to S3 - -S3 is the main fault-tolerant storage of all data, as there are no Page Server -replicas. We use a separate fault-tolerant WAL service to reduce latency. It -keeps track of WAL records which are not synced to S3 yet. +# Services The Page Server consists of multiple threads that operate on a shared repository of page versions: @@ -21,18 +10,22 @@ repository of page versions: | WAL receiver | | | +--------------+ - +----+ - +---------+ .......... | | - | | . . | | - GetPage@LSN | | . backup . -------> | S3 | --------------> | Page | repository . . | | - | Service | .......... | | - page | | +----+ + ...... + +---------+ +--------+ . . + | | | | . . + GetPage@LSN | | | backup | -------> . S3 . +-------------> | Page | repository | | . . + | Service | +--------+ . . + page | | ...... <------------- | | - +---------+ +--------------------+ - | Checkpointing / | - | Garbage collection | - +--------------------+ + +---------+ +-----------+ +--------------------+ + | WAL redo | | Checkpointing, | + +----------+ | processes | | Garbage collection | + | | +-----------+ +--------------------+ + | HTTP | + | mgmt API | + | | + +----------+ Legend: @@ -40,28 +33,77 @@ Legend: | | A thread or multi-threaded service +--+ -.... -. . Component at its early development phase. -.... - ---> Data flow <--- ``` -Page Service ------------- +## Page Service The Page Service listens for GetPage@LSN requests from the Compute Nodes, -and responds with pages from the repository. +and responds with pages from the repository. On each GetPage@LSN request, +it calls into the Repository function + +A separate thread is spawned for each incoming connection to the page +service. The page service uses the libpq protocol to communicate with +the client. The client is a Compute Postgres instance. + +## WAL Receiver + +The WAL receiver connects to the external WAL safekeeping service +using PostgreSQL physical streaming replication, and continuously +receives WAL. It decodes the WAL records, and stores them to the +repository. -WAL Receiver ------------- +## Backup service -The WAL receiver connects to the external WAL safekeeping service (or -directly to the primary) using PostgreSQL physical streaming -replication, and continuously receives WAL. It decodes the WAL records, -and stores them to the repository. +The backup service, responsible for storing pageserver recovery data externally. + +Currently, pageserver stores its files in a filesystem directory it's pointed to. +That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached". +Therefore, the server interacts with external, more reliable storage to back up and restore its state. + +The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait. +There are the following implementations present: +* local filesystem — to use in tests mainly +* AWS S3 - to use in production + +Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md). + +The backup service is disabled by default and can be enabled to interact with a single remote storage. + +CLI examples: +* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"` +* AWS S3 : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"` + +For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS. +For local S3 installations, refer to the their documentation for name format and credentials. + +Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets. +Required sections are: + +```toml +[remote_storage] +local_path = '/Users/someonetoignore/Downloads/tmp_dir/' +``` + +or + +```toml +[remote_storage] +bucket_name = 'some-sample-bucket' +bucket_region = 'eu-north-1' +prefix_in_bucket = '/test_prefix/' +``` + +`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed. + + +## Repository background tasks + +The Repository also has a few different background threads and tokio tasks that perform +background duties like dumping accumulated WAL data from memory to disk, reorganizing +files for performance (compaction), and garbage collecting old files. Repository @@ -116,48 +158,6 @@ Remove old on-disk layer files that are no longer needed according to the PITR retention policy -### Backup service - -The backup service, responsible for storing pageserver recovery data externally. - -Currently, pageserver stores its files in a filesystem directory it's pointed to. -That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached". -Therefore, the server interacts with external, more reliable storage to back up and restore its state. - -The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait. -There are the following implementations present: -* local filesystem — to use in tests mainly -* AWS S3 - to use in production - -Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md). - -The backup service is disabled by default and can be enabled to interact with a single remote storage. - -CLI examples: -* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"` -* AWS S3 : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"` - -For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS. -For local S3 installations, refer to the their documentation for name format and credentials. - -Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets. -Required sections are: - -```toml -[remote_storage] -local_path = '/Users/someonetoignore/Downloads/tmp_dir/' -``` - -or - -```toml -[remote_storage] -bucket_name = 'some-sample-bucket' -bucket_region = 'eu-north-1' -prefix_in_bucket = '/test_prefix/' -``` - -`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed. TODO: Sharding -------------------- diff --git a/pageserver/src/layered_repository/README.md b/docs/pageserver-storage.md similarity index 99% rename from pageserver/src/layered_repository/README.md rename to docs/pageserver-storage.md index bd5fa59257..8d03e68ac7 100644 --- a/pageserver/src/layered_repository/README.md +++ b/docs/pageserver-storage.md @@ -1,4 +1,4 @@ -# Overview +# Pageserver storage The main responsibility of the Page Server is to process the incoming WAL, and reprocess it into a format that allows reasonably quick access to any page diff --git a/docs/pageserver-thread-mgmt.md b/docs/pageserver-thread-mgmt.md new file mode 100644 index 0000000000..9ee3e40085 --- /dev/null +++ b/docs/pageserver-thread-mgmt.md @@ -0,0 +1,26 @@ +## Thread management + +Each thread in the system is tracked by the `thread_mgr` module. It +maintains a registry of threads, and which tenant or timeline they are +operating on. This is used for safe shutdown of a tenant, or the whole +system. + +### Handling shutdown + +When a tenant or timeline is deleted, we need to shut down all threads +operating on it, before deleting the data on disk. A thread registered +in the thread registry can check if it has been requested to shut down, +by calling `is_shutdown_requested()`. For async operations, there's also +a `shudown_watcher()` async task that can be used to wake up on shutdown. + +### Sync vs async + +The primary programming model in the page server is synchronous, +blocking code. However, there are some places where async code is +used. Be very careful when mixing sync and async code. + +Async is primarily used to wait for incoming data on network +connections. For example, all WAL receivers have a shared thread pool, +with one async Task for each connection. Once a piece of WAL has been +received from the network, the thread calls the blocking functions in +the Repository to process the WAL. diff --git a/docs/pageserver-walredo.md b/docs/pageserver-walredo.md new file mode 100644 index 0000000000..1de9c177cc --- /dev/null +++ b/docs/pageserver-walredo.md @@ -0,0 +1,77 @@ +# WAL Redo + +To reconstruct a particular page version from an image of the page and +some WAL records, the pageserver needs to replay the WAL records. This +happens on-demand, when a GetPage@LSN request comes in, or as part of +background jobs that reorganize data for faster access. + +It's important that data cannot leak from one tenant to another, and +that a corrupt WAL record on one timeline doesn't affect other tenants +or timelines. + +## Multi-tenant security + +If you have direct access to the WAL directory, or if you have +superuser access to a running PostgreSQL server, it's easy to +construct a malicious or corrupt WAL record that causes the WAL redo +functions to crash, or to execute arbitrary code. That is not a +security problem for PostgreSQL; if you have superuser access, you +have full access to the system anyway. + +The Neon pageserver, however, is multi-tenant. It needs to execute WAL +belonging to different tenants in the same system, and malicious WAL +in one tenant must not affect other tenants. + +A separate WAL redo process is launched for each tenant, and the +process uses the seccomp(2) system call to restrict its access to the +bare minimum needed to replay WAL records. The process does not have +access to the filesystem or network. It can only communicate with the +parent pageserver process through a pipe. + +If an attacker creates a malicious WAL record and injects it into the +WAL stream of a timeline, he can take control of the WAL redo process +in the pageserver. However, the WAL redo process cannot access the +rest of the system. And because there is a separate WAL redo process +for each tenant, the hijacked WAL redo process can only see WAL and +data belonging to the same tenant, which the attacker would have +access to anyway. + +## WAL-redo process communication + +The WAL redo process runs the 'postgres' executable, launched with a +Neon-specific command-line option to put it into WAL-redo process +mode. The pageserver controls the lifetime of the WAL redo processes, +launching them as needed. If a tenant is detached from the pageserver, +any WAL redo processes for that tenant are killed. + +The pageserver communicates with each WAL redo process over its +stdin/stdout/stderr. It works in request-response model with a simple +custom protocol, described in walredo.rs. To replay a set of WAL +records for a page, the pageserver sends the "before" image of the +page and the WAL records over 'stdin', followed by a command to +perform the replay. The WAL redo process responds with an "after" +image of the page. + +## Special handling of some records + +Some WAL record types are handled directly in the pageserver, by +bespoken Rust code, and are not sent over to the WAL redo process. +This includes SLRU-related WAL records, like commit records. SLRUs +don't use the standard Postgres buffer manager, so dealing with them +in the Neon WAL redo mode would require quite a few changes to +Postgres code and special handling in the protocol anyway. + +Some record types that include a full-page-image (e.g. XLOG_FPI) are +also handled specially when incoming WAL is processed already, and are +stored as page images rather than WAL records. + + +## Records that modify multiple pages + +Some Postgres WAL records modify multiple pages. Such WAL records are +duplicated, so that a copy is stored for each affected page. This is +somewhat wasteful, but because most WAL records only affect one page, +the overhead is acceptable. + +The WAL redo always happens for one particular page. If the WAL record +coantains changes to other pages, they are ignored. diff --git a/docs/pageserver.md b/docs/pageserver.md new file mode 100644 index 0000000000..ee70032396 --- /dev/null +++ b/docs/pageserver.md @@ -0,0 +1,11 @@ +# Page server architecture + +The Page Server has a few different duties: + +- Respond to GetPage@LSN requests from the Compute Nodes +- Receive WAL from WAL safekeeper, and store it +- Upload data to S3 to make it durable, download files from S3 as needed + +S3 is the main fault-tolerant storage of all data, as there are no Page Server +replicas. We use a separate fault-tolerant WAL service to reduce latency. It +keeps track of WAL records which are not synced to S3 yet. diff --git a/safekeeper/README_PROTO.md b/docs/safekeeper-protocol.md similarity index 100% rename from safekeeper/README_PROTO.md rename to docs/safekeeper-protocol.md diff --git a/docs/separation-compute-storage.md b/docs/separation-compute-storage.md new file mode 100644 index 0000000000..f07fa8b6dc --- /dev/null +++ b/docs/separation-compute-storage.md @@ -0,0 +1,8 @@ +# Separation of Compute and Storage + +TODO: + +- Read path +- Write path +- Durability model +- API auth diff --git a/safekeeper/README.md b/docs/walservice.md similarity index 100% rename from safekeeper/README.md rename to docs/walservice.md