It removes remaining issues with running cargo audit. There was one
error and one warning:
Crate: tokio
Version: 1.5.0
Title: Task dropped in wrong thread when aborting `LocalSet` task
Date: 2021-07-07
ID: RUSTSEC-2021-0072
URL: https://rustsec.org/advisories/RUSTSEC-2021-0072
Solution: Upgrade to >=1.5.1, <1.6.0 OR >=1.6.3, <1.7.0 OR >=1.7.2, <1.8.0 OR >=1.8.1
Crate: cpuid-bool
Version: 0.1.2
Warning: unmaintained
Title: `cpuid-bool` has been renamed to `cpufeatures`
Date: 2021-05-06
ID: RUSTSEC-2021-0064
URL: https://rustsec.org/advisories/RUSTSEC-2021-0064
To simplify cloud ops, allow configuration via file.
toml is used as the config format, and the file is stored in the working
directory.
Arguments used at initialization are saved in the config file.
Config file params may be overridden by CLI arguments.
Now postgres_backend communicates with the client, passing queries to the
provided handler; we have two currently, for wal_acceptor and pageserver.
Now BytesMut is again used for writing data to avoid manual message length
calculation.
ref #118
- All timelines are now stored in the same rocksdb repository. The GET
functions have been taught to follow the ancestors.
- Change the way relation size is stored. Instead of inserting "tombstone"
entries for blocks that are truncated away, store relation size as
separate key-value entry for each relation
- Add an abstraction for the key-value store: ObjectStore. It allows
swapping RocksDB with some other key-value store easily. Perhaps we
will write our own storage implementation using that interface, or
perhaps we'll need a different abstraction, but this is a small
improvement over status quo in any case.
- Garbage Collection is broken and commented out. It's not clear where and
how it should be implemented.
Compiling a Regex is very expensive, so let's not do it on every
invocation. This was consuming a big fraction of the time in creating
a new base backup at "zenith pg create". This commits brings down the
time to run "zenith pg create" on a freshly created repository from
about 2 seconds to 1 second.
It's not worth spending much effort on optimizing things at this stage
in general, but might as well pick low-hanging fruit like this.
This patch started as an effort to support CLI working against remote
pageserver, but turned into a pretty big refactoring.
* CLI now does not look into repository files directly. New commands
'branch_create' and 'identify_system' were introduced into page_service to
support that.
* Branch management that was scattered between local_env and
zenith/main.rs is moved into pageserver/branches.rs. That code could better fit
in Repository/Timeline impl, but I'll leave that for a different patch.
* All tests-related code from local_env went into integration_tests/src/lib.rs as an
extension to PostgresNode trait.
* Paths-generating functions were concentrated around corresponding config
types (LocalEnv and PageserverConf).
This isn't just cosmetic, this also fixes one bug: the code in
parse_point_in_time() function used str::parse::<u64>() to parse the
parts of the LSN string (e.g. 0/1A2B3C4D). That's wrong, because the
LSN consists of hex digits, not base-10.
I didn't think this mattered, but it does: if you add a dependency to
zenith_utils, but forget to request a feature you need, the crate will
build from the workspace root, but not by itself.
It's probably better to pull in the whole dependency tree.
This leaves one problem unsolved: the missing feature above will now be
a latent bug. If that feature gets removed later by other crates, and
then the workspace_hack Cargo.toml is updated, this missing feature will
become a build failure.
This version validates on every call that our result is exactly the same
as the previous result.
NodeId is a strange corner case: one field is serialized little-endian
and one field is serialized big-endian. Hopefully we can fix that in the
future.
This module adds two traits that implement bincode-based serialization.
BeSer implements methods for big-endian encoding/decoding.
LeSer implements methods for little-endian encoding/decoding.
Right now, the BeSer and LeSer methods have the same names, meaning you
can't `use` them both at the same time. This is intended to be a safety
mechanism: mixing big-endian and little-endian encoding in the same file
is error-prone. There are ways around this, but the easiest fix is to
put the big-endian code and little-endian code in different files or
submodules.
Switch over to a newer version of rust-postgres PR752. A few
minor changes are required:
- PgLsn::UNDEFINED -> PgLsn::from(0)
- PgTimestamp -> SystemTime
Our builds can be a little inconsistent, because Cargo doesn't deal well
with workspaces where there are multiple crates which have different
dependencies that select different features. As a workaround, copy what
other big rust projects do: add a workspace_hack crate.
This crate just pins down a set of dependencies and features that
satisfies all of the workspace crates.
The benefits are:
- running `cargo build` from one of the workspace subdirectories now
works without rebuilding anything.
- running `cargo install` works (without rebuilding anything).
- making small dependency changes is much less likely to trigger large
dependency rebuilds.
A few things that Eric commented on at PR #96:
- Use thiserror to simplify the implemention of FilePathError
- Add unit tests
- Fix a few complaints from clippy
The local fork of rust-s3 has some code to support Google Cloud, but
that PR no longer applies upstream, and will need significant changes
before it can be re-submitted.
In the meantime, we might as well just use the most similar upstream
release. The benefit of switching is that it fixes a feature-resolution
bug that was causing us to build 24 more crates than needed (mostly
async-std and its dependencies).
Since we are now calling the syscall directly, read_pidfile can now
parse an integer.
We also verify the pid is >= 1, because calling kill on 0 or negative
values goes straight to crazytown.
If there isn't any version specified for a dependency crate, Cargo may
choose a newer version. This could happen when Cargo.lock is updated
("cargo update") but can also happen unexpectedly when adding or
changing other dependencies. This can allow API-breaking changes to be
picked up, breaking the build.
To prevent this, specify versions for all dependencies. Cargo is still
allowed to pick newer versions that are (hopefully) non-breaking, by
analyzing the semver version number.
There are two special cases here:
1. serde_derive::{Serialize, Deserialize} isn't really used any more. It
was only a separate crate in the past because of compiler limitations.
Nowadays, people turn on the "derive" feature of the serde crate and
use serde::{Serialize, Deserialize}.
2. parse_duration is unmaintained and has an open security issue. (gh
iss. 87) That issue probably isn't critical for us because of where we
use that crate, but it's probably still better to pin the version so we
can't get hit with an API-breaking change at an awkward time.
Remove 'async' usage a much as feasible. Async code is harder to debug,
and mixing async and non-async code is a recipe for confusion and bugs.
There are a couple of exceptions:
- The code in walredo.rs, which needs to read and write to the child
process simultaneously, still uses async. It's more convenient there.
The 'async' usage is carefully limited to just the functions that
communicate with the child process.
- Code in walreceiver.rs that uses tokio-postgres to do streaming
replication. We have to use async there, because tokio-postgres is
async. Most rust-postgres functionality has non-async wrappers, but
not the new replication client code. The async usage is very limited
here, too: we use just block_on to call the tokio-postgres functions.
The code in 'page_service.rs' now launches a dedicated thread for each
connection.
This replaces tokio::sync:⌚:channel with std::sync:mpsc in
'seqwait.rs', to make that non-async. It's not a drop-in replacement,
though: std::sync::mpsc doesn't support multiple consumers, so we cannot
share a channel between multiple waiters. So this removes the code to
check if an existing channel can be reused, and creates a new one for
each waiter. That created another problem: BTreeMap cannot hold
duplicates, so I replaced that with BinaryHeap.
Similarly, the tokio::{mpsc, oneshot} channels used between WAL redo
manager and PageCache are replaced with std::sync::mpsc. (There is no
separate 'oneshot' channel in the standard library.)
Fixes github issue #58, and coincidentally also issue #66.
After the rocksdb patch (commit 6aa38d3f7d), the CacheEntry struct was
used only momentarily in the communication between the page_cache and
the walredo modules. It was in fact not stored in any cache anymore.
For clarity, refactor the communication.
There is now a WalRedoManager struct, with `request_redo` function,
that can be used to request WAL replay of a particular page. It sends
a request to a queue like before, but the queue has been replaced with
tokio::sync::mpsc. Previously, the resulting page image was stored
directly in the CacheEntry, and the requestor was notified using a
condition variable. Now, the requestor includes a 'oneshot' channel in
the request, and the WAL redo manager sends the response there.
When calling into the page cache, it was possible to wait on a blocking
mutex, which can stall the async executor.
Replace that sleep with a SeqWait::wait_for(lsn).await so that the
executor can go on with other work while we wait.
Change walreceiver_works to an AtomicBool to avoid the awkwardness of
taking the lock, then dropping it while we call wait_for and then
acquiring it again to do real work.
SeqWait adds a way to .await the arrival of some sequence number.
It provides wait_for(num) which is an async fn, and advance(num) which
is synchronous.
This should be useful in solving the page cache deadlocks, and may be
useful in other areas too.
This implementation still uses a Mutex internally, but only for a brief
critical section. If we find this code broadly useful and start to care
more about executor stalls due to unfair thread scheduling, there might
be ways to make it lock-free.
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.
There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.
The repository uses the directory layout desribed as Repository format
v1 in https://github.com/zenithdb/rfcs/pull/5. It it *highly* inefficient:
- we never create new snapshots. So in practice, it's really just a base
backup of the initial empty cluster, and everything else is reconstructed
by redoing all WAL
- when you create a new timeline, the base snapshot and *all* WAL is copied
from the new timeline to the new one. There is no smarts about
referencing the old snapshots/wal from the ancestor timeline.
To support all this, this commit includes a bunch of other changes:
- Implement "basebackup" funtionality in page server. When you initialize
a new compute node with "zenith pg create", it connects to the page
server, and requests a base backup of the Postgres data directory on
that timeline. (the base backup excludes user tables, so it's not
as bad as it sounds).
- Have page server's WAL receiver write the WAL into timeline dir. This
allows running a Page Server and Compute Nodes without a WAL safekeeper,
until we get around to integrate that properly into the system. (Even
after we integrate WAL safekeeper, this is perhaps how this will operate
when you want to run the system on your laptop.)
- restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
modified to use the new format. It now also restores all WAL.
- Page server no longer scans and restores everything into memory at startup.
Instead, when the first request is made for a timeline, the timeline is
slurped into memory at that point.
- The responsibility for telling page server to "callmemaybe" was moved
into Postgres libpqpagestore code. Also, WAL producer connstring cannot
be specified in the pageserver's command line anymore.
- Having multiple "system identifiers" in the same page server is no
longer supported. I repurposed much of that code to support multiple
timelines, instead.
- Implemented very basic, incomplete, support for PostgreSQL's Extended
Query Protocol in page_service.rs. Turns out that rust-postgres'
copy_out() function always uses the extended query protocol to send
out the command, and I'm using that to stream the base backup from the
page server.
TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines/<timeline>/wal' directories.
This is a first attempt at a new error-handling strategy:
- Use anyhow::Error as the first choice for easy error handling
- Use thiserror to generate local error types for anything that
needs it (no error type is available to us) or will be inspected
or matched on by higher layers.
Using the hash should allow us to change the remote repo and propagate
that change to user builds without that change becoming visible at a
random time.
It's unfortunate that we can't declare this dependency once in the
top-level Cargo.toml; that feature request is rust-lang rfc 2906.