Major changes and new concepts:
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which relations
exist and what are their sizes, from Repository to a new module,
pgdatadir_mapping.rs. The interface to Repository is now a simple key-value
PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes with
an LSN.
Key
---
The key to the Repository key-value store is a Key struct, which consists
of a few integer fields. It's wide enough to store a full RelFileNode,
fork and block number, and to distinguish those from metadata keys.
See pgdatadir_mapping.rs for how relation blocks and metadata keys are
mapped to the Key struct.
Store arbitrary key-ranges in the layer files
---------------------------------------------
The concept of a "segment" is gone. Each layer file can store an arbitrary
range of Keys.
TODO:
- Deleting keys, to reclaim space. This isn't visible to Postgres, dropping
or truncating a relation works as you would expect if you look at it from
the compute node. If you drop a relation, for example, the relation is
removed from the metadata entry, so that it appears to be gone. However,
the layered repository implementation never reclaims the storage.
- Tracking "logical database size", for disk space quotas. That ought to
be reimplemented now in pgdatadir_mapping.rs, or perhaps in walingest.rs.
- LSM compaction. The logic for checkpointing and creating image layers is
very dumb. AFAIK the *read* code could deal with a full-fledged LSM tree
now consisting of the delta and image layers. But there's no code to
take a bunch of delta layers and compact them, and the heuristics for
when to create image layers is pretty dumb.
- The code to track the layers is inefficient. All layers are just stored in
a vector, and whenever we need to find a layer, we do a linear search in
it.
Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL
record that is replayed with the Postgres WAL redo process, or a built-in
type that is handled entirely by pageserver code.
Replace the special code to replay Postgres XACT commit/abort records
with new Zenith WAL records. A separate zenith WAL record is created for
each modified CLOG page. This allows removing the 'main_data_offset'
field from stored PostgreSQL WAL records, which saves some memory and
some disk space in delta layers.
Introduce zenith WAL records for updating bits in the visibility map.
Previously, when e.g. a heap insert cleared the VM bit, we duplicated the
heap insert WAL record for the affected VM page. That was very wasteful.
The heap WAL record could be massive, containing a full page image in
the worst case. This addresses github issue #941.
Change control plane code to call `postgres --sync-safekeepers` before
compute node start when safekeepers are enabled. Now `pg create` will
create an empty data directory with the proper config file. Subsequent
`pg start` will run `sync-safekeepers` and will call basebackup with
the resulting LSN. Also change few tests to accommodate this new behavior.
Follow PostgreSQL logic: remove Twophase files when prepared transaction is committed/aborted.
Always store Twophase segments as materialized page images (no wal records).
- Add new subdir postgres_ffi/samples/ for config file samples.
- Don't copy wal to the new branch on zenith init or zenith branch.
- Import_timeline_wal on zenith init.
Add back code to parse transaction commit and abort records, and in
particular the list of dropped relations in them. Add 'put_unlink'
function to the Timeline trait and implementation. We had the code to
handle dropped relations in the GC code and elsewhere in ObjectRepository
already, but there was nothing to create the RelationSizeEntry::Unlink
tombstone entries until now. Also add a test to check that GC correctly
removes all page versions of a dropped relation.
Implements https://github.com/zenithdb/zenith/issues/232, except for the
"orphaned" rels.
Reviewed-by: Konstantin Knizhnik
Some of these were related to handling various WAL records that are not
related to any relations, like pg_multixact updates. These should have
been removed in the revert commit 6a9c036ac1, but I missed them.
Also, we didn't anything with commit/abort records. We will start
parsing commit/abort records in the next commit, but seems better to
add that from clean slate.
Reviewed-by: Konstantin Knizhnik
- Previously, we checked on first use of a timeline, whether there is
a snapshot and WAL for the timeline, and loaded it all into the
(rocksdb) repository. That's a waste of effort if we had done that
earlier already, and stopped and restarted the server. Track the
last LSN that we have loaded into the repository, and only load the
recent missing WAL after that.
- When you create a new zenith repository with "zenith init",
immediately load the initial empty postgres cluster into the rocksdb
repository. Previously, we only did that on the first connection. This
way, we don't need any "load from filesystem" codepath during normal
operation, we can assume that the repository for a timeline is always
up to date. (We might still want to use the functionality to import an
existing PostgreSQL data directory into the repository in the future,
as a separate Import feature, but not today.)
This includes the following commits:
35a1c3d521 Specify right LSN in test_createdb.py
d95e1da742 Fix issue with propagation of CREATE DATABASE to the branch
8465738aa5 [refer #167] Fix handling of pg_filenode.map files in page server
86056abd0e Fix merge conflict: set initial WAL position to second segment because of pg_resetwal
2bf2dd1d88 Add nonrelfile_utils.rs file
20b6279beb Fix restoring non-relational data during compute node startup
06f96f9600 Do not transfer WAL to computation nodes: use pg_resetwal for node startup
As well as some older changes related to storing CLOG and MultiXact data as
"pseudorelation" in the page server.
With this revert, we go back to the situtation that when you create a
new compute node, we ship *all* the WAL from the beginning of time to
the compute node. Obviously we need a better solution, like the code
that this reverts. But per discussion with Konstantin and Stas, this
stuff was still half-baked, and it's better for it to live in a branch
for now, until it's more complete and has gone through some review.
This isn't very exciting with the current RocksDB implementation, because
it doesn't care about the PostgreSQL 1 GB segment boundaries at all.
But I think we will care about this in the future, and more tests is
generally better anyway.