wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters
Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console
Define PiTR interval for GC.
refer #1320
Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is
WAL truncation, for which every safekeeper must know remote_consistent_lsn; the
next would be callmemaybe replacement.
Adds corresponding '--broker' argument to safekeeper and ability to run etcd in
tests.
Adds test checking remote_consistent_lsn is indeed communicated.
workspace_hack is needed to avoid recompilation when different crates
inside the workspace depend on the same packages but with different
features being enabled. Problem occurs when you build crates separately
one by one. So this is irrelevant to our CI setup because there we build
all binaries at once, but it may be relevant for local development.
this also changes cargo's resolver version to 2
Have separate routine and http endpoint to create timeline on safekeepers. It is
not used yet, i.e. timeline is still created implicitly, but we'll change that
once infrastructure for learning which tlis are assigned to which safekeepers
will be ready, preventing accidental creation by compute.
Changes format of safekeeper control file, allowing to store set of
peers. Knowing peers provides a part of foundation for peer
recovery (calculating min horizons like truncate_lsn for WAL truncation and
commit_lsn for sync-safekeepers replacement) and proper membership change;
similarly, we don't yet use it for now.
Employing cf file version bump, extracts tenant_id and timeline_id to top level
where it is more suitable. Also adds a bunch of LSNs there and rename
truncate_lsn to more specific peer_horizon_lsn.
* Add --id argument to safekeeper setting its unique u64 id.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.
Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
* add node id to pageserver
This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
to pass current_timeline_size to compute node
Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
Change meaning of lsns in HOT_STANDBY_FEEDBACK:
flush_lsn = disk_consistent_lsn,
apply_lsn = remote_consistent_lsn
Update compute node backpressure configuration respectively.
Update compute node configuration:
set 'synchronous_commit=remote_write' in setup without safekeepers.
This way compute node doesn't have to wait for data checkpoint on pageserver.
This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
This patch introduces fixes for several problems affecting
LLVM-based code coverage:
* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.
* Implement proper shutdown handlers in safekeeper.
The 'zenith' CLI utility can now be used to launch safekeepers. By
default, one safekeeper is configured. There are new 'safekeeper
start/stop' subcommands to manage the safekeepers. Each safekeeper is
given a name that can be used to identify the safekeeper to start/stop
with the 'zenith start/stop' commands. The safekeeper data is stored
in '.zenith/safekeepers/<name>'.
The 'zenith start' command now starts the pageserver and also all
safekeepers. 'zenith stop' stops pageserver, all safekeepers, and all
postgres nodes.
Introduce new 'zenith pageserver start/stop' subcommands for
starting/stopping just the page server.
The biggest change here is to the 'zenith init' command. This adds a
new 'zenith init --config=<path to toml file>' option. It takes a toml
config file that describes the environment. In the config file, you
can specify options for the pageserver, like the pg and http ports,
and authentication. For each safekeeper, you can define a name and the
pg and http ports. If you don't use the --config option, you get a
default configuration with a pageserver and one safekeeper. Note that
that's different from the previous default of no safekeepers. Any
fields that are omitted in the configuration file are filled with
defaults. You can also specify the initial tenant ID in the config
file. A couple of sample config files are added in the control_plane/
directory.
The --pageserver-pg-port, --pageserver-http-port, and
--pageserver-auth options to 'zenith init' are removed. Use a config
file instead.
Finally, change the python test fixtures to use the new 'zenith'
commands and the config file to describe the environment.