Files
neon/pageserver
Heikki Linnakangas 3b9e7fc5e6 Use explicit threads.
Remove 'async' usage a much as feasible. Async code is harder to debug,
and mixing async and non-async code is a recipe for confusion and bugs.

There are a couple of exceptions:

- The code in walredo.rs, which needs to read and write to the child
  process simultaneously, still uses async. It's more convenient there.
  The 'async' usage is carefully limited to just the functions that
  communicate with the child process.

- Code in walreceiver.rs that uses tokio-postgres to do streaming
  replication. We have to use async there, because tokio-postgres is
  async. Most rust-postgres functionality has non-async wrappers, but
  not the new replication client code. The async usage is very limited
  here, too: we use just block_on to call the tokio-postgres functions.

The code in 'page_service.rs' now launches a dedicated thread for each
connection.

This replaces tokio::sync::channel with std::sync:mpsc in
'seqwait.rs', to make that non-async. It's not a drop-in replacement,
though: std::sync::mpsc doesn't support multiple consumers, so we cannot
share a channel between multiple waiters. So this removes the code to
check if an existing channel can be reused, and creates a new one for
each waiter. That created another problem: BTreeMap cannot hold
duplicates, so I replaced that with BinaryHeap.

Similarly, the tokio::{mpsc, oneshot} channels used between WAL redo
manager and PageCache are replaced with std::sync::mpsc. (There is no
separate 'oneshot' channel in the standard library.)

Fixes github issue #58, and coincidentally also issue #66.
2021-04-26 13:07:51 +03:00
..
2021-04-26 13:07:51 +03:00

Page Server
===========


How to test
-----------


1. Compile and install Postgres from this repository (there are
   modifications, so vanilla Postgres won't do)

    ./configure --prefix=/home/heikki/zenith-install

2. Compile the page server

    cd pageserver
    cargo build

3. Create another "dummy" cluster that will be used by the page server when it applies
   the WAL records. (shouldn't really need this, getting rid of it is a TODO):

    /home/heikki/zenith-install/bin/initdb -D /data/zenith-dummy


4. Initialize and start a new postgres cluster

    /home/heikki/zenith-install/bin/initdb -D /data/zenith-test-db --username=postgres
    /home/heikki/zenith-install/bin/postgres -D /data/zenith-test-db

5. In another terminal, start the page server.

    PGDATA=/data/zenith-dummy PATH=/home/heikki/zenith-install/bin:$PATH ./target/debug/pageserver

   It should connect to the postgres instance using streaming replication, and print something
   like this:

    $ PGDATA=/data/zenith-dummy PATH=/home/heikki/zenith-install/bin:$PATH ./target/debug/pageserver
    Starting WAL receiver
    connecting...
    Starting page server on 127.0.0.1:5430
    connected!
    page cache is empty

6. You can now open another terminal and issue DDL commands. Generated WAL records will
   be streamed to the page servers, and attached to blocks that they apply to in its
   page cache

    $ psql postgres -U postgres
    psql (14devel)
    Type "help" for help.
    
    postgres=# create table mydata (i int4);
    CREATE TABLE
    postgres=# insert into mydata select g from generate_series(1,100) g;
    INSERT 0 100
    postgres=# 

7. The GetPage@LSN interface to the compute nodes isn't working yet, but to simulate
   that, the page server generates a test GetPage@LSN call every 5 seconds on a random
   block that's in the page cache. In a few seconds, you should see output from that:

    testing GetPage@LSN for block 0
    WAL record at LSN 23584576 initializes the page
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167DF40
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167DF80
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167DFC0
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167E018
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167E058
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167E098
    2021-03-19 11:03:13.791 EET [11439] LOG:  applied WAL record at 0/167E0D8
    2021-03-19 11:03:13.792 EET [11439] LOG:  applied WAL record at 0/167E118
    2021-03-19 11:03:13.792 EET [11439] LOG:  applied WAL record at 0/167E158
    2021-03-19 11:03:13.792 EET [11439] LOG:  applied WAL record at 0/167E198
    applied 10 WAL records to produce page image at LSN 18446744073709547246



Architecture
============

The Page Server is responsible for all operations on a number of
"chunks" of relation data. A chunk corresponds to a PostgreSQL
relation segment (i.e. one max. 1 GB file in the data directory), but
it holds all the different versions of every page in the segment that
are still needed by the system.

Determining which chunk each Page Server holds is handled elsewhere. (TODO:
currently, there is only one Page Server which holds all chunks)

The Page Server has a few different duties:

- Respond to GetPage@LSN requests from the Compute Nodes
- Receive WAL from WAL safekeeper
- Replay WAL that's applicable to the chunks that the Page Server maintains
- Backup to S3


The Page Server consists of multiple threads that operate on a shared
cache of page versions:


                                           | WAL
                                           V
                                   +--------------+
                                   |              |
                                   | WAL receiver |
                                   |              |
                                   +--------------+
                                                                                 +----+
                  +---------+                              ..........            |    |
                  |         |                              .        .            |    |
 GetPage@LSN      |         |                              . backup .  ------->  | S3 |
------------->    |  Page   |         page cache           .        .            |    |
                  | Service |                              ..........            |    |
   page           |         |                                                    +----+
<-------------    |         |
                  +---------+

                             ...................................
                             .                                 .
                             . Garbage Collection / Compaction .
                             ...................................

Legend:

+--+
|  |   A thread or multi-threaded service
+--+

....
.  .   Component that we will need, but doesn't exist at the moment. A TODO.
....

--->   Data flow
<---


Page Service
------------

The Page Service listens for GetPage@LSN requests from the Compute Nodes,
and responds with pages from the page cache.


WAL Receiver
------------

The WAL receiver connects to the external WAL safekeeping service (or
directly to the primary) using PostgreSQL physical streaming
replication, and continuously receives WAL. It decodes the WAL records,
and stores them to the page cache.


Page Cache
----------

The Page Cache is a data structure, to hold all the different page versions.
It is accessed by all the other threads, to perform their duties.

Currently, the page cache is implemented fully in-memory. TODO: Store it
on disk. Define a file format.


TODO: Garbage Collection / Compaction
-------------------------------------

Periodically, the Garbage Collection / Compaction thread runs
and applies pending WAL records, and removes old page versions that
are no longer needed.


TODO: Backup service
--------------------

The backup service is responsible for periodically pushing the chunks to S3.

TODO: How/when do restore from S3? Whenever we get a GetPage@LSN request for
a chunk we don't currently have? Or when an external Control Plane tells us?