When 'checkpoint_distance' is reached, freeze the current in-memory
layer directly in the WAL receiver thread. And to flush the frozen
layer to disk, launch a separate "layer flushing thread". This leaves
only the compaction duty to the checkpoint thread.
We create a ClearVisibilityMapFlags record for the VM page, when a heap
WAL record indicates that the VM bit needs to be cleared. However,
sometimes the VM block would not exist. It seems that PostgreSQL
sometimes sets the clear-VM bit on WAL records, even though the
corresponding VM page hasn't been initialized yet. There's no point in
trying to clear a bit on a non-existent bit, so just skip emitting the
record if the VM page doesn't exist.
I'm not entirely sure why we're only seeing this bug with this PR, I
think it existed before. Maybe we were more sloppy and returned
an all-zeros page?
Have separate classes for the KeySpace, a partitioning of the KeySpace
(KeyPartitioning), and a builder object used to construct the KeySpace.
Previously, KeyPartitioning did all those things, and it was a bit
confusing.
Major changes and new concepts:
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which relations
exist and what are their sizes, from Repository to a new module,
pgdatadir_mapping.rs. The interface to Repository is now a simple key-value
PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes with
an LSN.
Key
---
The key to the Repository key-value store is a Key struct, which consists
of a few integer fields. It's wide enough to store a full RelFileNode,
fork and block number, and to distinguish those from metadata keys.
See pgdatadir_mapping.rs for how relation blocks and metadata keys are
mapped to the Key struct.
Store arbitrary key-ranges in the layer files
---------------------------------------------
The concept of a "segment" is gone. Each layer file can store an arbitrary
range of Keys.
TODO:
- Deleting keys, to reclaim space. This isn't visible to Postgres, dropping
or truncating a relation works as you would expect if you look at it from
the compute node. If you drop a relation, for example, the relation is
removed from the metadata entry, so that it appears to be gone. However,
the layered repository implementation never reclaims the storage.
- Tracking "logical database size", for disk space quotas. That ought to
be reimplemented now in pgdatadir_mapping.rs, or perhaps in walingest.rs.
- LSM compaction. The logic for checkpointing and creating image layers is
very dumb. AFAIK the *read* code could deal with a full-fledged LSM tree
now consisting of the delta and image layers. But there's no code to
take a bunch of delta layers and compact them, and the heuristics for
when to create image layers is pretty dumb.
- The code to track the layers is inefficient. All layers are just stored in
a vector, and whenever we need to find a layer, we do a linear search in
it.
* Add --id argument to safekeeper setting its unique u64 id.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.
Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
* add node id to pageserver
This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>