This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
Should have been a part of cba4da3f4d to provide upgrade for previously
existing clusters. Separates version independent header (magic + version) out of
SafeKeeperState to choose what to deserialize.
This patch introduces fixes for several problems affecting
LLVM-based code coverage:
* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.
* Implement proper shutdown handlers in safekeeper.
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).
Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.
We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.
Per https://github.com/zenithdb/rfcs/pull/12Closes#296.
Bumps vendor/postgres.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).
Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).
Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data
cleanup
fsync control file after rename to match postgres behaviour
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
The -D option to specify working directory was broken:
$ mkdir foobar
$ ./target/debug/safekeeper -D foobar
Error: failed to open "foobar/safekeeper.log"
Caused by:
No such file or directory (os error 2)
This was because we both chdir'd into to specified directory, and also
prepended the directory to all the paths. So in the above example, it
actually tried to create the log file in "foobar/foobar/safekepeer.log"
Change it to work the same way as in the pageserver: chdir to the
specified directory, and leave 'workdir' always set to ".".
We wouldn't necessarily need the 'workdir' variable in the config at all,
and could assume that the current working directory is always the
safekeeper data directory, but I'd like to keep this consistent with the
the pageserver. The page server doesn't assume that for the sake of unit
tests. We don't currently have unit tests in the safekeeper that write
to disk but we might want to in the future.
Which is mainly generational state (terms) and useful LSNs.
Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes#726.
ref #115
* Rename wal_acceptor binary to safekeeper
* Rename wal_acceptor.pid and wal_acceptor.log to safekeeper.pid and safekeeper.log
* Change some mentions of WAL acceptor to safekeeper
* Dockerfile: alias wal_acceptor to safekeeper temporarily until internal scripts are updated
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.
Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.
This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.
Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.
We now obey the RUST_LOG env variable, if it's set.
The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing. For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
* `wal_acceptor`: add HTTP handler, /metrics endpoint only, no authentication
* Two gauges are currently reported: `flush_lsn` and `commit_lsn`
* Add `DEFAULT_PG_LISTEN_PORT` and `DEFAULT_PG_LISTEN_PORT` consts for uniformity
It previously took &SafeKeeperState similar to persist(), but only for its
`server` member.
Now it takes &ServerInfo only, so there it's clear the state is not persisted.
Also added a comment about sync.
* Send ProposerGreeting manually in tests
* Move test_sync_safekeepers to test_wal_acceptor.py
* Capture test_sync_safekeepers output
* Add comment for handle_json_ctrl
* Save captured output in CI
Commit message copied below:
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.
Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
Otherwise restart of safekeeper before the first segment is filled makes it
report 0 as flushed LSN. To this end, tweak find_end_of_wal_segment to allow
starting from given LSN, not only from the start of the segment. While here,
make it less panicky.
Otherwise we produce corrupted record holes in WAL during compute node restart
in case there was an unfinished record from the old compute, as these reports
advance commit_lsn -- reliably persisted part of WAL.
ref #549.
Mostly by @knizhnik. I adjusted to make sure proposer always starts streaming
since record beginning so we don't need special quirks for decoding in
safekeeper.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.
And make logging slightly better.
ref #439
epochStartLsn is the LSN since which new proposer writes its WAL in its epoch,
let's be more explicit here.
truncate_lsn is LSN still needed by the most lagging safekeeper. restart_lsn is
terminology from pg_replicaton_slots, but here we don't really have 'restart';
hopefully truncate word makes it clearer.
1) Extract consensus logic to safekeeper.rs.
2) Change the voting flow so that acceptor tells his epoch along with giving
the vote, not before it; otherwise it might get immediately stale. #294
3) Process messages from compute atomically and sync state properly. #270
4) Use separate structs for disk and network.
ref #315