Files
neon/pageserver
Heikki Linnakangas 7216f22609 Use tracing crate to have more context in log messages.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing.  For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
2021-10-11 08:59:06 +03:00
..

## Page server architecture

The Page Server has a few different duties:

- Respond to GetPage@LSN requests from the Compute Nodes
- Receive WAL from WAL safekeeper
- Replay WAL that's applicable to the chunks that the Page Server maintains
- Backup to S3

S3 is the main fault-tolerant storage of all data, as there are no Page Server
replicas. We use a separate fault-tolerant WAL service to reduce latency. It
keeps track of WAL records which are not syncted to S3 yet.

The Page Server consists of multiple threads that operate on a shared
repository of page versions:

                                           | WAL
                                           V
                                   +--------------+
                                   |              |
                                   | WAL receiver |
                                   |              |
                                   +--------------+
                                                                                 +----+
                  +---------+                              ..........            |    |
                  |         |                              .        .            |    |
 GetPage@LSN      |         |                              . backup .  ------->  | S3 |
------------->    |  Page   |         repository           .        .            |    |
                  | Service |                              ..........            |    |
   page           |         |                                                    +----+
<-------------    |         |
                  +---------+      +--------------------+
		                   |   Checkpointing /  |
				   | Garbage collection |
                                   +--------------------+

Legend:

+--+
|  |   A thread or multi-threaded service
+--+

....
.  .   Component that we will need, but doesn't exist at the moment. A TODO.
....

--->   Data flow
<---


Page Service
------------

The Page Service listens for GetPage@LSN requests from the Compute Nodes,
and responds with pages from the repository.


WAL Receiver
------------

The WAL receiver connects to the external WAL safekeeping service (or
directly to the primary) using PostgreSQL physical streaming
replication, and continuously receives WAL. It decodes the WAL records,
and stores them to the repository.


Repository
----------

The repository stores all the page versions, or WAL records needed to
reconstruct them. Each tenant has a separate Repository, which is
stored in the .zenith/tenants/<tenantid> directory.

Repository is an abstract trait, defined in `repository.rs`. It is
implemented by the LayeredRepository object in
`layered_repository.rs`. There is only that one implementation of the
Repository trait, but it's still a useful abstraction that keeps the
interface for the low-level storage functionality clean. The layered
storage format is described in layered_repository/README.md.

Each repository consists of multiple Timelines. Timeline is a
workhorse that accepts page changes from the WAL, and serves
get_page_at_lsn() and get_rel_size() requests. Note: this has nothing
to do with PostgreSQL WAL timeline. The term "timeline" is mostly
interchangeable with "branch", there is a one-to-one mapping from
branch to timeline. A timeline has a unique ID within the tenant,
represented as 16-byte hex string that never changes, whereas a
branch is a user-given name for a timeline.

Each repository also has a WAL redo manager associated with it, see
`walredo.rs`. The WAL redo manager is used to replay PostgreSQL WAL
records, whenever we need to reconstruct a page version from WAL to
satisfy a GetPage@LSN request, or to avoid accumulating too much WAL
for a page. The WAL redo manager uses a Postgres process running in
special zenith wal-redo mode to do the actual WAL redo, and
communicates with the process using a pipe.


Checkpointing / Garbage Collection
----------------------------------

Periodically, the checkpointer thread wakes up and performs housekeeping
duties on the repository. It has two duties:

### Checkpointing

Flush WAL that has accumulated in memory to disk, so that the old WAL
can be truncated away in the WAL safekeepers. Also, to free up memory
for receiving new WAL. This process is called "checkpointing". It's
similar to checkpointing in PostgreSQL or other DBMSs, but in the page
server, checkpointing happens on a per-segment basis.

### Garbage collection

Remove old on-disk layer files that are no longer needed according to the
PITR retention policy


TODO: Backup service
--------------------

The backup service is responsible for periodically pushing the chunks to S3.

TODO: How/when do restore from S3? Whenever we get a GetPage@LSN request for
a chunk we don't currently have? Or when an external Control Plane tells us?

TODO: Sharding
--------------------

We should be able to run multiple Page Servers that handle sharded data.