When 'checkpoint_distance' is reached, freeze the current in-memory
layer directly in the WAL receiver thread. And to flush the frozen
layer to disk, launch a separate "layer flushing thread". This leaves
only the compaction duty to the checkpoint thread.
We create a ClearVisibilityMapFlags record for the VM page, when a heap
WAL record indicates that the VM bit needs to be cleared. However,
sometimes the VM block would not exist. It seems that PostgreSQL
sometimes sets the clear-VM bit on WAL records, even though the
corresponding VM page hasn't been initialized yet. There's no point in
trying to clear a bit on a non-existent bit, so just skip emitting the
record if the VM page doesn't exist.
I'm not entirely sure why we're only seeing this bug with this PR, I
think it existed before. Maybe we were more sloppy and returned
an all-zeros page?
Have separate classes for the KeySpace, a partitioning of the KeySpace
(KeyPartitioning), and a builder object used to construct the KeySpace.
Previously, KeyPartitioning did all those things, and it was a bit
confusing.
Major changes and new concepts:
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which relations
exist and what are their sizes, from Repository to a new module,
pgdatadir_mapping.rs. The interface to Repository is now a simple key-value
PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes with
an LSN.
Key
---
The key to the Repository key-value store is a Key struct, which consists
of a few integer fields. It's wide enough to store a full RelFileNode,
fork and block number, and to distinguish those from metadata keys.
See pgdatadir_mapping.rs for how relation blocks and metadata keys are
mapped to the Key struct.
Store arbitrary key-ranges in the layer files
---------------------------------------------
The concept of a "segment" is gone. Each layer file can store an arbitrary
range of Keys.
TODO:
- Deleting keys, to reclaim space. This isn't visible to Postgres, dropping
or truncating a relation works as you would expect if you look at it from
the compute node. If you drop a relation, for example, the relation is
removed from the metadata entry, so that it appears to be gone. However,
the layered repository implementation never reclaims the storage.
- Tracking "logical database size", for disk space quotas. That ought to
be reimplemented now in pgdatadir_mapping.rs, or perhaps in walingest.rs.
- LSM compaction. The logic for checkpointing and creating image layers is
very dumb. AFAIK the *read* code could deal with a full-fledged LSM tree
now consisting of the delta and image layers. But there's no code to
take a bunch of delta layers and compact them, and the heuristics for
when to create image layers is pretty dumb.
- The code to track the layers is inefficient. All layers are just stored in
a vector, and whenever we need to find a layer, we do a linear search in
it.