to pass current_timeline_size to compute node
Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.
This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.
(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
0.28.0 includes two changes I submitted to upstream:
- Add support for older ListObjects API, needed to use rust-s3 with Google
Cloud Storage: https://github.com/durch/rust-s3/pull/229
- If file is smaller than one chunk, don't initiate multi-part upload.
https://github.com/durch/rust-s3/pull/228
These are not critical for Zenith right now, but let's stay up-to-date.
- Don't spawn a separate thread for each connection.
Instead use one thread per safekeeper, that iterates over all connections and sends callback requests for them.
-Use tokio postgres to connect to the pageserver, to avoid spawning a new thread for each connection.
callmemaybe review fixes:
- Spawn all request_callback tasks separately.
- Remember 'last_call_time' and only send request_callback if 'recall_period' has passed.
- If task hasn't finished till next recall, abort it and try again.
- Add pause/resume CallmeEvents to avoid spamming pageserver when connection already established.
This patch introduces fixes for several problems affecting
LLVM-based code coverage:
* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.
* Implement proper shutdown handlers in safekeeper.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).
Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).
Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data
cleanup
fsync control file after rename to match postgres behaviour
Which is mainly generational state (terms) and useful LSNs.
Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes#726.
ref #115
* `wal_acceptor`: add HTTP handler, /metrics endpoint only, no authentication
* Two gauges are currently reported: `flush_lsn` and `commit_lsn`
* Add `DEFAULT_PG_LISTEN_PORT` and `DEFAULT_PG_LISTEN_PORT` consts for uniformity
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.
Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
Most of the work here was done on the postgres side. There's more
information in the commit message there.
(see: 04cfa326a5)
On the WAL acceptor side, we're now expecting 'START_WAL_PUSH' to
initialize the WAL keeper protocol. Everything else is mostly the same,
with the only real difference being that protocol messages are now
discrete CopyData messages sent over the postgres protocol.
For the sake of documentation, the full set of these messages is:
<- recv: START_WAL_PUSH query
<- recv: server info from postgres (type `ServerInfo`)
-> send: walkeeper info (type `SafeKeeperInfo`)
<- recv: vote info (type `RequestVote`)
if node id mismatch:
-> send: self node id (type `NodeId`); exit
-> send: confirm vote (with node id) (type `NodeId`)
loop:
<- recv: info and maybe WAL block (type `SafeKeeperRequest` + bytes)
(break loop if done)
-> send: confirm receipt (type `SafeKeeperResponse`)
This version validates on every call that our result is exactly the same
as the previous result.
NodeId is a strange corner case: one field is serialized little-endian
and one field is serialized big-endian. Hopefully we can fix that in the
future.
Switch over to a newer version of rust-postgres PR752. A few
minor changes are required:
- PgLsn::UNDEFINED -> PgLsn::from(0)
- PgTimestamp -> SystemTime
Our builds can be a little inconsistent, because Cargo doesn't deal well
with workspaces where there are multiple crates which have different
dependencies that select different features. As a workaround, copy what
other big rust projects do: add a workspace_hack crate.
This crate just pins down a set of dependencies and features that
satisfies all of the workspace crates.
The benefits are:
- running `cargo build` from one of the workspace subdirectories now
works without rebuilding anything.
- running `cargo install` works (without rebuilding anything).
- making small dependency changes is much less likely to trigger large
dependency rebuilds.
The local fork of rust-s3 has some code to support Google Cloud, but
that PR no longer applies upstream, and will need significant changes
before it can be re-submitted.
In the meantime, we might as well just use the most similar upstream
release. The benefit of switching is that it fixes a feature-resolution
bug that was causing us to build 24 more crates than needed (mostly
async-std and its dependencies).
If there isn't any version specified for a dependency crate, Cargo may
choose a newer version. This could happen when Cargo.lock is updated
("cargo update") but can also happen unexpectedly when adding or
changing other dependencies. This can allow API-breaking changes to be
picked up, breaking the build.
To prevent this, specify versions for all dependencies. Cargo is still
allowed to pick newer versions that are (hopefully) non-breaking, by
analyzing the semver version number.
There are two special cases here:
1. serde_derive::{Serialize, Deserialize} isn't really used any more. It
was only a separate crate in the past because of compiler limitations.
Nowadays, people turn on the "derive" feature of the serde crate and
use serde::{Serialize, Deserialize}.
2. parse_duration is unmaintained and has an open security issue. (gh
iss. 87) That issue probably isn't critical for us because of where we
use that crate, but it's probably still better to pin the version so we
can't get hit with an API-breaking change at an awkward time.
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.
There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.
The repository uses the directory layout desribed as Repository format
v1 in https://github.com/zenithdb/rfcs/pull/5. It it *highly* inefficient:
- we never create new snapshots. So in practice, it's really just a base
backup of the initial empty cluster, and everything else is reconstructed
by redoing all WAL
- when you create a new timeline, the base snapshot and *all* WAL is copied
from the new timeline to the new one. There is no smarts about
referencing the old snapshots/wal from the ancestor timeline.
To support all this, this commit includes a bunch of other changes:
- Implement "basebackup" funtionality in page server. When you initialize
a new compute node with "zenith pg create", it connects to the page
server, and requests a base backup of the Postgres data directory on
that timeline. (the base backup excludes user tables, so it's not
as bad as it sounds).
- Have page server's WAL receiver write the WAL into timeline dir. This
allows running a Page Server and Compute Nodes without a WAL safekeeper,
until we get around to integrate that properly into the system. (Even
after we integrate WAL safekeeper, this is perhaps how this will operate
when you want to run the system on your laptop.)
- restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
modified to use the new format. It now also restores all WAL.
- Page server no longer scans and restores everything into memory at startup.
Instead, when the first request is made for a timeline, the timeline is
slurped into memory at that point.
- The responsibility for telling page server to "callmemaybe" was moved
into Postgres libpqpagestore code. Also, WAL producer connstring cannot
be specified in the pageserver's command line anymore.
- Having multiple "system identifiers" in the same page server is no
longer supported. I repurposed much of that code to support multiple
timelines, instead.
- Implemented very basic, incomplete, support for PostgreSQL's Extended
Query Protocol in page_service.rs. Turns out that rust-postgres'
copy_out() function always uses the extended query protocol to send
out the command, and I'm using that to stream the base backup from the
page server.
TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines/<timeline>/wal' directories.
Using the hash should allow us to change the remote repo and propagate
that change to user builds without that change becoming visible at a
random time.
It's unfortunate that we can't declare this dependency once in the
top-level Cargo.toml; that feature request is rust-lang rfc 2906.