mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-26 15:49:58 +00:00
130 lines
4.3 KiB
Plaintext
130 lines
4.3 KiB
Plaintext
## Page server architecture
|
|
|
|
The Page Server is responsible for all operations on a number of
|
|
"chunks" of relation data. A chunk corresponds to a PostgreSQL
|
|
relation segment (i.e. one max. 1 GB file in the data directory), but
|
|
it holds all the different versions of every page in the segment that
|
|
are still needed by the system.
|
|
|
|
Currently we do not specifically organize data in chunks.
|
|
All page images and corresponding WAL records are stored as entries in a key-value storage,
|
|
where StorageKey is a zenith_timeline_id + BufferTag + LSN.
|
|
|
|
|
|
The Page Server has a few different duties:
|
|
|
|
- Respond to GetPage@LSN requests from the Compute Nodes
|
|
- Receive WAL from WAL safekeeper
|
|
- Replay WAL that's applicable to the chunks that the Page Server maintains
|
|
- Backup to S3
|
|
|
|
|
|
The Page Server consists of multiple threads that operate on a shared
|
|
cache of page versions:
|
|
|
|
|
|
| WAL
|
|
V
|
|
+--------------+
|
|
| |
|
|
| WAL receiver |
|
|
| |
|
|
+--------------+
|
|
+----+
|
|
+---------+ .......... | |
|
|
| | . . | |
|
|
GetPage@LSN | | . backup . -------> | S3 |
|
|
-------------> | Page | page cache . . | |
|
|
| Service | .......... | |
|
|
page | | +----+
|
|
<------------- | |
|
|
+---------+
|
|
|
|
...................................
|
|
. .
|
|
. Garbage Collection / Compaction .
|
|
...................................
|
|
|
|
Legend:
|
|
|
|
+--+
|
|
| | A thread or multi-threaded service
|
|
+--+
|
|
|
|
....
|
|
. . Component that we will need, but doesn't exist at the moment. A TODO.
|
|
....
|
|
|
|
---> Data flow
|
|
<---
|
|
|
|
|
|
Page Service
|
|
------------
|
|
|
|
The Page Service listens for GetPage@LSN requests from the Compute Nodes,
|
|
and responds with pages from the page cache.
|
|
|
|
|
|
WAL Receiver
|
|
------------
|
|
|
|
The WAL receiver connects to the external WAL safekeeping service (or
|
|
directly to the primary) using PostgreSQL physical streaming
|
|
replication, and continuously receives WAL. It decodes the WAL records,
|
|
and stores them to the page cache.
|
|
|
|
|
|
Page Cache
|
|
----------
|
|
|
|
The Page Cache is a switchboard to access different Repositories.
|
|
|
|
#### Repository
|
|
Repository corresponds to one .zenith directory.
|
|
Repository is needed to manage Timelines.
|
|
|
|
#### Timeline
|
|
Timeline is a page cache workhorse that accepts page changes
|
|
and serves get_page_at_lsn() and get_rel_size() requests.
|
|
Note: this has nothing to do with PostgreSQL WAL timeline.
|
|
|
|
#### Branch
|
|
We can create branch at certain LSN.
|
|
Each Branch lives in a corresponding timeline and has an ancestor.
|
|
|
|
To get full snapshot of data at certain moment we need to traverse timeline and its ancestors.
|
|
|
|
#### ObjectRepository
|
|
ObjectRepository implements Repository and has associated ObjectStore and WAL redo service.
|
|
|
|
#### ObjectStore
|
|
ObjectStore is an interface for key-value store for page images and wal records.
|
|
Currently it has one implementation - RocksDB.
|
|
|
|
#### WAL redo service
|
|
WAL redo service - service that runs PostgreSQL in a special wal_redo mode
|
|
to apply given WAL records over an old page image and return new page image.
|
|
|
|
|
|
TODO: Garbage Collection / Compaction
|
|
-------------------------------------
|
|
|
|
Periodically, the Garbage Collection / Compaction thread runs
|
|
and applies pending WAL records, and removes old page versions that
|
|
are no longer needed.
|
|
|
|
|
|
TODO: Backup service
|
|
--------------------
|
|
|
|
The backup service is responsible for periodically pushing the chunks to S3.
|
|
|
|
TODO: How/when do restore from S3? Whenever we get a GetPage@LSN request for
|
|
a chunk we don't currently have? Or when an external Control Plane tells us?
|
|
|
|
TODO: Sharding
|
|
--------------------
|
|
|
|
We should be able to run multiple Page Servers that handle sharded data.
|