Previously, the LayerMap was only used as a cache to hold the snapshot
layers that were loaded into memory. As a result, we often had to scan
the filesystem to get list of all the other snapshot files that exist on
disk, but hadn't been loaded into memory yet. That was very slow, consuming
huge amounts of CPU and causing timeouts in any non-trivial tests. Refactor
so that on startup, we scan the directory once and keep the list of
layers in memory.
It was getting stuck at CREATE DATABASE. It's because freeze() function
did the wrong thing if there were any page versions newer than the
'end_lsn' argument. The possibility for this was mentioned in a FIXME
comment earlier.
Plus misc comment and error message cleanup
This includes some changes that would've been more neat to have as separate
commits, but oh well, this 'layered-repo' branch will need to be squashed
before it's committed to 'main' anyway:
- introduce a new SnaphotFilename struct in snapshot_layer.rs, to
make it more convenient to work with snapshot file filenames.
- Fix the code in get_layer_for_read() to get the right layer from disk
even if some older layer was in cache. There was a FIXME for this, but
it didn't apparently cause trouble before. It started to cause
regression failures after the rebase, I think because that scenario
arised in with the Clog in the test_branch_behind test.
The "snapshot files" term was a leftover from before the code for the
inmemory layer was separated from the on-disk layer code. There are a
few places left that deal specifically with snapshot layers, and
"snapshot file" still makes sense in those, but replace most instances
to "layer".
This explains how and when snapshot files are created, how they're
used to find the correct page version, and how garbage collection
works. I tried to resist the temptation to write how it *should* work,
and purely document how it currently works in this branch.
If a relation is dropped, the last snapshot file for it is given the
_DROPPED suffix. The garbage collection knows that it can remove the
file when it's old enough, even if there is no newer file.
The counters returned by garbage collection, in GcResult, don't make much
sense with the snapshot files implemention, so I added new counters. That
broke the test_gc test, so I made a copy of it as test_snapfiles_gc based
on the new counters.
Handling relation drops is still not implemented.
The 'aversion' crate version 0.2.0 needs Rust 1.52. That was relaxed
at https://github.com/ericseppanen/aversion/pull/3, but there's no
released crate version with that change yet. Maybe we should bump up
the rust version anyway, not sure, but this should fix the immediate
problem of compiling this branch on CI.
This replaces the RocksDB based implementation with an approach using
"snapshot files" on disk, and in-memory btreemaps to hold the recent
changes.
This make the repository implementation a configuration option. You can
choose 'layered' or 'rocksdb' in the "pageserver init" call, but there is
no corresponding --repository-formt option in 'zenith init', so in
practice you have to change the default in pageserver.rs if you want to test
different implementations. The unit tests have been refactored to exercise
both implementations, though. 'layered' is now the default.
TODOs:
- Push/pull is not implemented, causing 'test_history_inmemory' test
in 'cargo test' to fail.
- Garbage collection has not been implemented yet. The 'test_gc' test is
failing because of that.
- Unlinking relations has not been implemented either. (That has no user
visible effect until garbage collection is implemented)
This clarifies - I hope - the abstractions between Repository and
ObjectRepository. The ObjectTag struct was a mix of objects that could
be accessed directly through the public Timeline interface, and also
objects that were created and used internally by the ObjectRepository
implementation and not supposed to be accessed directly by the
callers. With the RelishTag separaate from ObjectTag, the distinction
is more clear: RelishTag is used in the public interface, and
ObjectTag is used internally between object_repository.rs and
object_store.rs, and it contains the internal metadata object types.
One awkward thing with the ObjectTag struct was that the Repository
implementation had to distinguish between ObjectTags for relations,
and track the size of the relation, while others were used to store
"blobs". With the RelishTags, some relishes are considered
"non-blocky", and the Repository implementation is expected to track
their sizes, while others are stored as blobs. I'm not 100% happy with
how RelishTag captures that either: it just knows that some relish
kinds are blocky and some non-blocky, and there's an is_block()
function to check that. But this does enable size-tracking for SLRUs,
allowing us to treat them more like relations.
This changes the way SLRUs are stored in the repository. Each SLRU
segment, e.g. "pg_clog/0000", "pg_clog/0001", are now handled as a
separate relish. This removes the need for the SLRU-specific
put_slru_truncate() function in the Timeline trait. SLRU truncation is
now handled by caling put_unlink() on the segment. This is more in
line with how PostgreSQL stores SLRUs and handles their trunction.
The SLRUs are "blocky", so they are accessed one 8k page at a time,
and repository tracks their size. I considered an alternative design
where we would treat each SLRU segment as non-blocky, and just store
the whole file as one blob. Each SLRU segment is up to 256 kB in size,
which isn't that large, so that might've worked fine, too. One reason
I didn't do that is that it seems better to have the WAL redo
routines be as close as possible to the PostgreSQL routines. It
doesn't matter much in the repository, though; we have to track the
size for relations anyway, so there's not much difference in whether
we also do it for SLRUs.
The codepath for tenant_create command first launched the WAL redo
thread, and then called branches::create_repo() which checked if the
tenant's directory already exists. That's problematic, because
launching the WAL redo thread will run initdb if the directory doesn't
already exist. Race condition: If the tenant already exists, it will
have a WAL redo thread already running, and the old and new WAL redo
thread might try to run initdb at the same time, causing all kinds of
weird failures.
The test_pageserver_api test was failing 100% repeatably on my laptop
because of this. I'm not sure why this doesn't occur on the CI:
Jul 31 18:05:48.877 INFO running initdb in "./tenants/5227e4eb90894775ac6b8a8c76f24b2e/wal-redo-datadir", location: pageserver::walredo, pageserver/src/walredo.rs:483
thread 'WAL redo thread' panicked at 'initdb failed: The files belonging to this database system will be owned by user "heikki".
This user must also own the server process.
The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory ./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Europe/Helsinki
creating configuration files ... ok
running bootstrap script ...
stderr:
2021-07-31 15:05:48.875 GMT [282569] LOG: could not open configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf": No such file or directory
2021-07-31 15:05:48.875 GMT [282569] FATAL: configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf" contains errors
child process exited with exit code 1
initdb: removing data directory "./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir"
- Add new subdir postgres_ffi/samples/ for config file samples.
- Don't copy wal to the new branch on zenith init or zenith branch.
- Import_timeline_wal on zenith init.
It was pretty cool, but no one used it, and it had gotten badly out of
date. The main interesting thing with it was to see some basic metrics
on the fly, while the page server is running, but the metrics collection
had been broken for a long time, too. Best to just remove it.
this patch adds support for tenants. This touches mostly pageserver.
Directory layout on disk is changed to contain new layer of indirection.
Now path to particular repository has the following structure: <pageserver workdir>/tenants/<tenant
id>. Tenant id has the same format as timeline id. Tenant id is included in
pageserver commands when needed. Also new commands are available in
pageserver: tenant_list, tenant_create. This is also reflected CLI.
During init default tenant is created and it's id is saved in CLI config,
so following commands can use it without extra options. Tenant id is also included in
compute postgres configuration, so it can be passed via ServerInfo to
safekeeper and in connection string to pageserver.
For more info see docs/multitenancy.md.
It used to be the case that walkeeper's background thread
failed to recognize the end of stream (EOF) signaled by the
`Ok(None)` result of `FeMessage::read`.
* Introducing common enum ObjectVal for all values
* Rewrite push mechanism to use raw object copy
* Fix history unit test
* Add skip_nonrel_objects functions for history unit tests