Major changes and new concepts:
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which relations
exist and what are their sizes, from Repository to a new module,
pgdatadir_mapping.rs. The interface to Repository is now a simple key-value
PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes with
an LSN.
Key
---
The key to the Repository key-value store is a Key struct, which consists
of a few integer fields. It's wide enough to store a full RelFileNode,
fork and block number, and to distinguish those from metadata keys.
See pgdatadir_mapping.rs for how relation blocks and metadata keys are
mapped to the Key struct.
Store arbitrary key-ranges in the layer files
---------------------------------------------
The concept of a "segment" is gone. Each layer file can store an arbitrary
range of Keys.
TODO:
- Deleting keys, to reclaim space. This isn't visible to Postgres, dropping
or truncating a relation works as you would expect if you look at it from
the compute node. If you drop a relation, for example, the relation is
removed from the metadata entry, so that it appears to be gone. However,
the layered repository implementation never reclaims the storage.
- Tracking "logical database size", for disk space quotas. That ought to
be reimplemented now in pgdatadir_mapping.rs, or perhaps in walingest.rs.
- LSM compaction. The logic for checkpointing and creating image layers is
very dumb. AFAIK the *read* code could deal with a full-fledged LSM tree
now consisting of the delta and image layers. But there's no code to
take a bunch of delta layers and compact them, and the heuristics for
when to create image layers is pretty dumb.
- The code to track the layers is inefficient. All layers are just stored in
a vector, and whenever we need to find a layer, we do a linear search in
it.
This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.
Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.
This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.
To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
when basebackup retured an error. Previously pageserver thread just
died.
* Remove "ancestor" file which previously contained ancestor id and
branch lsn. Currently the same data can be obtained from metadata file.
And just the way we handled ancestor file in the code introduced the
case when branching fails timeline directory is created but there is no data in it
except ancestor file. And this confused gc because it scans
directories. So it is better to just remove ancestor file and clean up
this timeline directory creation so it happens after all validity
checks have passed
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.