This replaces the RocksDB based implementation with an approach using
"snapshot files" on disk, and in-memory btreemaps to hold the recent
changes.
This make the repository implementation a configuration option. You can
choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>"
The unit tests have been refactored to exercise both implementations.
'layered' is now the default.
Push/pull is not implemented. The 'test_history_inmemory' test has been
commented out accordingly. It's not clear how we will implement that
functionality; probably by copying the snapshot files directly.
Most of the work here was done on the postgres side. There's more
information in the commit message there.
(see: 04cfa326a5)
On the WAL acceptor side, we're now expecting 'START_WAL_PUSH' to
initialize the WAL keeper protocol. Everything else is mostly the same,
with the only real difference being that protocol messages are now
discrete CopyData messages sent over the postgres protocol.
For the sake of documentation, the full set of these messages is:
<- recv: START_WAL_PUSH query
<- recv: server info from postgres (type `ServerInfo`)
-> send: walkeeper info (type `SafeKeeperInfo`)
<- recv: vote info (type `RequestVote`)
if node id mismatch:
-> send: self node id (type `NodeId`); exit
-> send: confirm vote (with node id) (type `NodeId`)
loop:
<- recv: info and maybe WAL block (type `SafeKeeperRequest` + bytes)
(break loop if done)
-> send: confirm receipt (type `SafeKeeperResponse`)
Current state with authentication.
Page server validates JWT token passed as a password during connection
phase and later when performing an action such as create branch tenant
parameter of an operation is validated to match one submitted in token.
To allow access from console there is dedicated scope: PageServerApi,
this scope allows access to all tenants. See code for access validation in:
PageServerHandler::check_permission.
Because we are in progress of refactoring of communication layer
involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper
now doesn’t check token passed from compute, and uses “hardcoded” token
passed via environment variable to communicate with pageserver.
Compute postgres now takes token from environment variable and passes it
as a password field in pageserver connection. It is not passed through
settings because then user will be able to retrieve it using pg_settings
or SHOW ..
I’ve added basic test in test_auth.py. Probably after we add
authentication to remaining network paths we should enable it by default
and switch all existing tests to use it.
Server functionality requires not only the "server" feature flag, but
also either "http1" or "http2" (or both). To make things simpler
(and prevent analogous problems), enable all features.
This patch has been extracted from #348, where it became unnecessary
after we had decided that we didn't want to measure anything inside
PostgresBackend.
IMO the change is good enough to make its way into the codebase,
even though it brings nothing "new" to the code.
The metrics are served by an http endpoint, which
is meant to be spawned in a new thread.
In the future the endpoint will provide more APIs,
but for the time being, we won't bother with proper routing.
It used to be the case that walkeeper's background thread
failed to recognize the end of stream (EOF) signaled by the
`Ok(None)` result of `FeMessage::read`.
Now postgres_backend communicates with the client, passing queries to the
provided handler; we have two currently, for wal_acceptor and pageserver.
Now BytesMut is again used for writing data to avoid manual message length
calculation.
ref #118
We're starting to deserialize directly from the TcpStream now, which
means that a socket error gets logged as "deserialize error". That's not
very helpful; preserve the io::Error so it can be logged.
The pieces are:
base Connection
SendWal
ReplicationHandler
There are lots of other changes here:
- Put the replication reader in a background thread; this gets rid
of some hacks with nonblocking mode.
- Stop manually buffering input data; use BufReader instead.
- Use BytesMut a lot less; use Read/Write traits where possible.
I didn't think this mattered, but it does: if you add a dependency to
zenith_utils, but forget to request a feature you need, the crate will
build from the workspace root, but not by itself.
It's probably better to pull in the whole dependency tree.
This leaves one problem unsolved: the missing feature above will now be
a latent bug. If that feature gets removed later by other crates, and
then the workspace_hack Cargo.toml is updated, this missing feature will
become a build failure.
This module adds two traits that implement bincode-based serialization.
BeSer implements methods for big-endian encoding/decoding.
LeSer implements methods for little-endian encoding/decoding.
Right now, the BeSer and LeSer methods have the same names, meaning you
can't `use` them both at the same time. This is intended to be a safety
mechanism: mixing big-endian and little-endian encoding in the same file
is error-prone. There are ways around this, but the easiest fix is to
put the big-endian code and little-endian code in different files or
submodules.
Having multiple copies of the same values is a source of confusion.
Commit da9bf5dc63 fixed one race condition caused by that, for example.
See also discussion at
https://github.com/zenithdb/zenith/issues/57#issuecomment-824393470
This changes SeqWait.advance() to return the old number, and not panic if
you try to move the value backwards. The caller should check for that and
act accordingly.
Remove 'async' usage a much as feasible. Async code is harder to debug,
and mixing async and non-async code is a recipe for confusion and bugs.
There are a couple of exceptions:
- The code in walredo.rs, which needs to read and write to the child
process simultaneously, still uses async. It's more convenient there.
The 'async' usage is carefully limited to just the functions that
communicate with the child process.
- Code in walreceiver.rs that uses tokio-postgres to do streaming
replication. We have to use async there, because tokio-postgres is
async. Most rust-postgres functionality has non-async wrappers, but
not the new replication client code. The async usage is very limited
here, too: we use just block_on to call the tokio-postgres functions.
The code in 'page_service.rs' now launches a dedicated thread for each
connection.
This replaces tokio::sync:⌚:channel with std::sync:mpsc in
'seqwait.rs', to make that non-async. It's not a drop-in replacement,
though: std::sync::mpsc doesn't support multiple consumers, so we cannot
share a channel between multiple waiters. So this removes the code to
check if an existing channel can be reused, and creates a new one for
each waiter. That created another problem: BTreeMap cannot hold
duplicates, so I replaced that with BinaryHeap.
Similarly, the tokio::{mpsc, oneshot} channels used between WAL redo
manager and PageCache are replaced with std::sync::mpsc. (There is no
separate 'oneshot' channel in the standard library.)
Fixes github issue #58, and coincidentally also issue #66.
AtomicLsn is a wrapper around AtomicU64 that has load() and store()
members that are cheap (on x86, anyway) and can be safely used in any
context.
This commit uses AtomicLsn in the page cache, and fixes up some
downstream code that manually implemented LSN formatting.
There's also a bugfix to the logging in wait_lsn, which prints the
wrong lsn value.