Files
neon/pageserver
Heikki Linnakangas 19fcea99da If too much memory is being used for in-memory layers, flush oldest one.
The old policy was to flush all in-memory layers to disk every 10 seconds.
That was a pretty dumb policy, unnecessarily aggressive. This commit
changes the policy so that we only flush layers where the oldest WAL
record is older than 16 MB from the last valid LSN on the timeline. That's
still pretty aggressive, but it's a step in the right direction. We do
need a limit on how old the oldest in-memory layer is allowed to be,
because that determines how much WAL the safekeepers need to hold onto,
and how much WAL we need to reprocess in case of a page server crash.
16 MB is surely still too aggressive for that, but it's easy to change
the setting later.

To support that, keep all in-memory layers in a binary heap, so that we
can easily find the one with the oldest LSN.

This tracks and a new LSN value in the metadata file: 'disk_consistent_lsn'.
Before, on page server restart we restarted the WAL processing from the
'last_record_lsn' value, but now that we don't flush everything to disk in
one go, the 'last_record_lsn' tracked in memory is usually ahead of the
last record that's been flushed to disk. Even though we track that oldest
LSN now, the crash recovery story isn't really complete. We don't do
fsync()s anywhere, and thing will break if a snapshot file isn't complete,
as there's no CRC on them. That's not new, and it's a TODO.
2021-08-25 11:20:47 +03:00
..
2021-08-24 19:05:00 +03:00
2021-08-23 10:21:10 +03:00

## Page server architecture

The Page Server has a few different duties:

- Respond to GetPage@LSN requests from the Compute Nodes
- Receive WAL from WAL safekeeper
- Replay WAL that's applicable to the chunks that the Page Server maintains
- Backup to S3


The Page Server consists of multiple threads that operate on a shared
cache of page versions:


                                           | WAL
                                           V
                                   +--------------+
                                   |              |
                                   | WAL receiver |
                                   |              |
                                   +--------------+
                                                                                 +----+
                  +---------+                              ..........            |    |
                  |         |                              .        .            |    |
 GetPage@LSN      |         |                              . backup .  ------->  | S3 |
------------->    |  Page   |         page cache           .        .            |    |
                  | Service |                              ..........            |    |
   page           |         |                                                    +----+
<-------------    |         |
                  +---------+

                             ...................................
                             .                                 .
                             . Garbage Collection / Compaction .
                             ...................................

Legend:

+--+
|  |   A thread or multi-threaded service
+--+

....
.  .   Component that we will need, but doesn't exist at the moment. A TODO.
....

--->   Data flow
<---


Page Service
------------

The Page Service listens for GetPage@LSN requests from the Compute Nodes,
and responds with pages from the page cache.


WAL Receiver
------------

The WAL receiver connects to the external WAL safekeeping service (or
directly to the primary) using PostgreSQL physical streaming
replication, and continuously receives WAL. It decodes the WAL records,
and stores them to the page cache repository.


Page Cache
----------

The Page Cache is a switchboard to access different Repositories.

#### Repository
Repository corresponds to one .zenith directory.
Repository is needed to manage Timelines.
Each repository has associated WAL redo service.

Now we have two implementations of Repository:
- ObjectRepository uses RocksDB as a storage
- LayeredRepository uses custom storage format, described in layered_repository/README.md

#### Timeline
Timeline is a page cache workhorse that accepts page changes
and serves get_page_at_lsn() and get_rel_size() requests.
Note: this has nothing to do with PostgreSQL WAL timeline.

#### Branch
We can create branch at certain LSN.
Each Branch lives in a corresponding timeline and has an ancestor.

To get full snapshot of data at certain moment we need to traverse timeline and its ancestors.

#### ObjectRepository
ObjectRepository implements Repository and has associated ObjectStore and WAL redo service.

#### ObjectStore
ObjectStore is an interface for key-value store for page images and wal records.
Currently it has one implementation - RocksDB.

#### WAL redo service
WAL redo service - service that runs PostgreSQL in a special wal_redo mode
to apply given WAL records over an old page image and return new page image.


TODO: Garbage Collection / Compaction
-------------------------------------

Periodically, the Garbage Collection / Compaction thread runs
and applies pending WAL records, and removes old page versions that
are no longer needed.


TODO: Backup service
--------------------

The backup service is responsible for periodically pushing the chunks to S3.

TODO: How/when do restore from S3? Whenever we get a GetPage@LSN request for
a chunk we don't currently have? Or when an external Control Plane tells us?

TODO: Sharding
--------------------

We should be able to run multiple Page Servers that handle sharded data.