Create `safekeeper_pg_io_bytes_total` metric to track total amount of
bytes written/read in a postgres connections to safekeepers. This metric
has the following labels:
- `client_az` – availability zone of the connection initiator, or
`"unknown"`
- `sk_az` – availability zone of the safekeeper, or `"unknown"`
- `app_name` – `application_name` of the postgres client
- `dir` – data direction, either `"read"` or `"write"`
- `same_az` – `"true"`, `"false"` or `"unknown"`. Can be derived from
`client_az` and `sk_az`, exists purely for convenience.
This is implemented by passing availability zone in the connection
string, like this: `-c tenant_id=AAA timeline_id=BBB
availability-zone=AZ-1`.
Update ansible deployment scripts to add availability_zone argument
to safekeeper and pageserver in systemd service files.
This makes it possible to enable authentication only for the mgmt HTTP
API or the compute API. The HTTP API doesn't need to be directly
accessible from compute nodes, and it can be secured through network
policies. This also allows rolling out authentication in a piecemeal
fashion.
- Add support for splitting async postgres_backend into read and write halfes.
Safekeeper needs this for bidirectional streams. To this end, encapsulate
reading-writing postgres messages to framed.rs with split support without any
additional changes (relying on BufRead for reading and BytesMut out buffer for
writing).
- Use async postgres_backend throughout safekeeper (and in proxy auth link
part).
- In both safekeeper COPY streams, do read-write from the same thread/task with
select! for easier error handling.
- Tidy up finishing CopyBoth streams in safekeeper sending and receiving WAL
-- join split parts back catching errors from them before returning.
Initially I hoped to do that read-write without split at all, through polling
IO:
https://github.com/neondatabase/neon/pull/3522
However that turned out to be more complicated than I initially expected
due to 1) borrow checking and 2) anon Future types. 1) required Rc<Refcell<...>>
which is Send construct just to satisfy the checker; 2) can be workaround with
transmute. But this is so messy that I decided to leave split.
This patch adds a per-timeline periodic task that executes an eviction
policy. The eviction policy is configurable per tenant.
Two policies exist:
- NoEviction (the default one)
- LayerAccessThreshold
The LayerAccessThreshold policy examines the last access timestamp per
layer in the layer map and evicts the layer if that last access is
further in the past than a configurable threshold value.
This policy kind is evaluated periodically at a configurable period.
It logs a summary statistic at `info!()` or `warn!()` level, depending
on whether any evictions failed.
This feature has no explicit killswitch since it's off by default.
This reverts commit 826e89b9ce.
The problem with that commit was that it deletes the TempDir while
there are still EphemeralFile instances open.
At first I thought this could be fixed by simply adding
Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None))
to TenantHarness::drop, but it turned out to be insufficient.
So, reverting the commit until we find a proper solution.
refs https://github.com/neondatabase/neon/issues/3385
Follow-up of https://github.com/neondatabase/neon/pull/3270 which made
an example from main README.md not working.
Fixes that, by adding a way to specify a default tenant now and modifies
the basic neon_local test to start postgres and check branching.
Not all neon_local commands are implemented, so not all README.md
contents is tested yet.
For every Python test, we start the storage first, and expect that
later, in the test, when we start a compute, it will work without
specific timeline and tenant creation or their IDs specified.
For that, we have a concept of "default" branch that was created on the
control plane level first, but that's not needed at all, given that it's
only Python tests that need it: let them create the initial timeline
during set-up.
Before, control plane started and stopped pageserver for timeline
creation, now Python harness runs an extra tenant creation request on
test env init.
I had to adjust the metrics test, turns out it registered the metrics
from the default tenant after an extra pageserver restart.
New model does not sent the metrics before the collection time happens,
and that was 30s before.
Changes are:
* Pageserver: start reading from NEON_AUTH_TOKEN by default.
Warn if ZENITH_AUTH_TOKEN is used instead.
* Compute, Docs: fix the default token name.
* Control plane: change name of the token in configs and start
sequences.
Compatibility:
* Control plane in tests: works, no compatibility expected.
* Control plane for local installations: never officially supported
auth anyways. If someone did enable it, `pageserver.toml` should be updated
with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`.
* Pageserver is backward compatible: you can run new Pageserver with old
commands and environment configurations, but not vice-versa.
The culprit is the hard-coded `NEON_AUTH_TOKEN`.
* Compute has no code changes. As long as you update its configuration
file with `pageserver_connstring` in sync with the start up scripts,
you are good to go.
* Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in
the first place.
1.66 release speeds up compile times for over 10% according to tests.
Also its Clippy finds plenty of old nits in our code:
* useless conversion, `foo as u8` where `foo: u8` and similar, removed
`as u8` and similar
* useless references and dereferenced (that were automatically adjusted
by the compiler), removed various `&` and `*`
* bool -> u8 conversion via `if/else`, changed to `u8::from`
* Map `.iter()` calls where only values were used, changed to
`.values()` instead
Standing out lints:
* `Eq` is missing in our protoc generated structs. Silenced, does not
seem crucial for us.
* `fn default` looks like the one from `Default` trait, so I've
implemented that instead and replaced the `dummy_*` method in tests with
`::default()` invocation
* Clippy detected that
```
if retry_attempt < u32::MAX {
retry_attempt += 1;
}
```
is a saturating add and proposed to replace it.
This fixes all kinds of problems related to missing params,
like broken timestamps (due to `integer_datetimes`).
This solution is not ideal, but it will help. Meanwhile,
I'm going to dedicate some time to improving connection machinery.
Note that this **does not** fix problems with passing certain parameters
in a reverse direction, i.e. **from client to compute**. This is a
separate matter and will be dealt with in an upcoming PR.
Removes the race during pageserver initial timeline creation that lead to partial layer uploads.
This race is only reproducible in test code, we do not create initial timelines in cloud (yet, at least), but still nice to remove the non-deterministic behavior.
This patch centralize the logic of creating & reading pid files into the
new pid_file module and improves upon / makes explicit a few race conditions
that existed with the previous code.
Starting Processes / Creating Pidfiles
======================================
Before this patch, we had three places that had very similar-looking
match lock_file::create_lock_file { ... }
blocks.
After this change, they can use a straight-forward call provided
by the pid_file:
pid_file::claim_pid_file_for_pid()
Stopping Processes / Reading Pidfiles
=====================================
The new pid_file module provides a function to read a pidfile,
called read_pidfile(), that returns a
pub enum PidFileRead {
NotExist,
NotHeldByAnyProcess(PidFileGuard),
LockedByOtherProcess(Pid),
}
If we get back NotExist, there is nothing to kill.
If we get back NotHeldByAnyProcess, the pid file is stale and we must
ignore its contents.
If it's LockedByOtherProcess, it's either another pidfile reader
or, more likely, the daemon that is still running.
In this case, we can read the pid in the pidfile and kill it.
There's still a small window where this is racy, but it's not a
regression compared to what we have before.
The NotHeldByAnyProcess is an improvement over what we had before
this patch. Before, we would blindly read the pidfile contents
and kill, even if no other process held the flock.
If the pidfile was stale (NotHeldByAnyProcess), then that kill
would either result in ESRCH or hit some other unrelated process
on the system. This patch avoids the latter cacse by grabbing
an exclusive flock before reading the pidfile, and returning the
flock to the caller in the form of a guard object, to avoid
concurrent reads / kills.
It's hopefully irrelevant in practice, but it's a little robustness
that we get for free here.
Maintain flock on Pidfile of ETCD / any InitialPidFile::Create()
================================================================
Pageserver and safekeeper create their pidfiles themselves.
But for etcd, neon_local creates the pidfile (InitialPidFile::Create()).
Before this change, we would unlock the etcd pidfile as soon as
`neon_local start` exits, simply because no-one else kept the FD open.
During `neon_local stop`, that results in a stale pid file,
aka, NotHeldByAnyProcess, and it would henceforth not trust that
the PID stored in the file is still valid.
With this patch, we make the etcd process inherit the pidfile FD,
thereby keeping the flock held until it exits.
Our shutdown procedure for "pageserver init" was buggy. Firstly, it
merely sent the process a SIGKILL, but did not wait for it to actually
exit. Normally, it should exit quickly as SIGKILL cannot be caught or
ignored by the target process, but it's still asynchronous and the
process can still be alive when the kill(2) call returns. Secondly,
"neon_local" removed the PID file after sending SIGKILL, even though the
process was still running. That hid the first problem: if we didn't
remove the PID file, and you start a new pageserver process while the
old one is still running, you would get an error when the new process
tries to lock the PID file.
We've been seeing a lot of "Cannot assign requested address" failures in
the CI lately. Our theory is that when we run "pageserver init"
immediately followed by "pageserver start", the first process is still
running and listening on the port when the second invocation starts up.
This commit hopefully fixes the problem.
It is generally a bad idea for the "neon_local" to remove the PID file
on the child process's behalf. The correct way would be for the server
process to remove the PID file, after it has fully shutdown everything
else. We don't currently have a robust way to ensure that everything has
truly shut down and closed, however.
A simpler way is to simply never remove the PID file. It's not necessary
to remove the PID file for correctness: we cannot rely on the cleanup to
happen anyway, if the server process crashes for example. Because of
that, we already have all the logic in place to deal with a stale PID
file that belonged to a process that already exited. Let's rely on that
on normal shutdown too.
Nothing interesting in these changes. Passing through the
RUST_BACKTRACE=full will hopefully save someone else panick reproduction
time.
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
* Fix https://github.com/neondatabase/neon/issues/1854
* Never log Safekeeper::conninfo in walproposer as it now contains a secret token
* control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver
* Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from
pageserver_connstring because it's embedded in there weirdly.
* Pageserver: load JWT token for Safekeeper from the environment variable.
* Rewrite docs/authentication.md
Downsides are:
* We store all components of the config separately. `Url` stores them inside a single
`String` and a bunch of ints which point to different parts of the URL, which is
probably more efficient.
* It is now impossible to pass arbitrary connection strings to the configuration file,
one has to support all components explicitly. However, we never supported anything
except for `host:port` anyway.
Upsides are:
* This significantly restricts the space of possible connection strings, some of which
may be either invalid or unsupported. E.g. Postgres' connection strings may include
a bunch of parameters as query (e.g. `connect_timeout=`, `options=`). These are nether
validated by the current implementation, nor passed to the postgres client library,
Hence, storing separate fields expresses the intention better.
* The same connection configuration may be represented as a URL in multiple ways
(e.g. either `password=` in the query part or a standard URL password).
Now we have a single canonical way.
* Escaping is provided for `options=`.
Other possibilities considered:
* `newtype` with a `String` inside and some validation on creation.
This is more efficient, but harder to log for two reasons:
* Passwords should never end up in logs, so we have to somehow
* Escaped `options=` are harder to read, especially if URL-encoded,
and we use `options=` a lot.
- Pass through FAILPOINTS environment variable to the pageserver in
"neon_local pageserver start" command
- On startup, list any failpoints that were set with FAILPOINTS to the log
- Add optional "extra_env_vars" argument to the NeonPageserver.start()
function in the python fixture, so that you can pass FAILPOINTS
None of the tests use this functionality yet; that comes in a separate
commit.
closes https://github.com/neondatabase/neon/pull/2865
* Poll more frequently when waiting for process start/stop. This
speeds up startup and shutdown in tests. We did this already in
commit 52ce1c9d53, which reduced the interval to 100 ms, but it was
inadvertently increased back to 500 ms in commit d42700280f. Reduce
it to 100 ms again, for both start and stop operations.
* Harmonize the start and stop loops, printing the dots and notices
the same way in both. I considered extracting the logic to a
separate retry-function that takes a closure as argument that does
the polling, but as long as we only have two copies, the code
duplication isn't that bad.
* Remove newline after "Starting pageserver" and "Starting etcd"
messages, so that the progress-indicator dots that are printed once
a second are printed on the same line. Before:
Starting pageserver at '127.0.0.1:64000' in '.neon'
...
pageserver started, pid: 2538937
After:
Starting pageserver at '127.0.0.1:64000' in '.neon'...
pageserver started, pid: 2538937
The "Starting safekeeper" message already got this right.
* Update example output in README.md to match
This API is rather pointless, as sane choice anyway requires knowledge of peers
status and leaders lifetime in any case can intersect, which is fine for us --
so manual elections are straightforward. Here, we deterministically choose among
the reasonably caught up safekeepers, shifting by timeline id to spread the
load.
A step towards custom broker https://github.com/neondatabase/neon/issues/2394
* etcd-client is not updated, since we plan to replace it with another client and the new version errors with some missing prost library error
* clap had released another major update that requires changing every CLI declaration again, deserves a separate PR
The 'local' part was always filled in, so that was easy to merge into
into the TimelineInfo itself. 'remote' only contained two fields,
'remote_consistent_lsn' and 'awaits_download'. I made
'remote_consistent_lsn' an optional field, and 'awaits_download' is now
false if the timeline is not present remotely.
However, I kept stub versions of the 'local' and 'remote' structs for
backwards-compatibility, with a few fields that are actively used by
the control plane. They just duplicate the fields from TimelineInfo
now. They can be removed later, once the control plane has been
updated to use the new fields.
It was only None when you queried the status of a timeline with
'timeline_detail' mgmt API call, and it was still being downloaded. You
can check for that status with the 'tenant_status' API call instead,
checking for has_in_progress_downloads field.
Anothere case was if an error happened while trying to get the current
logical size, in a 'timeline_detail' request. It might make sense to
tolerate such errors, and leave the fields we cannot fill in as empty,
None, 0 or similar, but it doesn't make sense to me to leave the whole
'local' struct empty in tht case.
With the ability to pass commit_lsn. This allows to perform project WAL recovery
through different (from the original) set of safekeepers (or under different
ttid) by
1) moving WAL files to s3 under proper ttid;
2) explicitly creating timeline on safekeepers, setting commit_lsn to the
latest point;
3) putting the lastest .parital file to the timeline directory on safekeepers, if
desired.
Extend test_s3_wal_replay to exersise this behaviour.
Also extends timeline_status endpoint to return postgres information.
Creates new `pageserver_api` and `safekeeper_api` crates to serve as the
shared dependencies. Should reduce both recompile times and cold compile
times.
Decreases the size of the optimized `neon_local` binary: 380M -> 179M.
No significant changes for anything else (mostly as expected).
- Split postgres_ffi into two version specific files.
- Preserve pg_version in timeline metadata.
- Use pg_version in safekeeper code. Check for postgres major version mismatch.
- Clean up the code to use DEFAULT_PG_VERSION constant everywhere, instead of hardcoding.
- Parameterize python tests: use DEFAULT_PG_VERSION env and pg_version fixture.
To run tests using a specific PostgreSQL version, pass the DEFAULT_PG_VERSION environment variable:
'DEFAULT_PG_VERSION='15' ./scripts/pytest test_runner/regress'
Currently don't all tests pass, because rust code relies on the default version of PostgreSQL in a few places.
* Add submodule postgres-15
* Support pg_15 in pgxn/neon
* Renamed zenith -> neon in Makefile
* fix name of codestyle check
* Refactor build system to prepare for building multiple Postgres versions.
Rename "vendor/postgres" to "vendor/postgres-v14"
Change Postgres build and install directory paths to be version-specific:
- tmp_install/build -> pg_install/build/14
- tmp_install/* -> pg_install/14/*
And Makefile targets:
- "make postgres" -> "make postgres-v14"
- "make postgres-headers" -> "make postgres-v14-headers"
- etc.
Add Makefile aliases:
- "make postgres" to build "postgres-v14" and in future, "postgres-v15"
- similarly for "make postgres-headers"
Fix POSTGRES_DISTRIB_DIR path in pytest scripts
* Make postgres version a variable in codestyle workflow
* Support vendor/postgres-v15 in codestyle check workflow
* Support postgres-v15 building in Makefile
* fix pg version in Dockerfile.compute-node
* fix kaniko path
* Build neon extensions in version-specific directories
* fix obsolete mentions of vendor/postgres
* use vendor/postgres-v14 in Dockerfile.compute-node.legacy
* Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c
* Use versioned ECR repositories and image names for compute-node.
The image name format is compute-node-vXX, where XX is postgres major version number.
For now only v14 is supported.
Old format unversioned name (compute-node) is left, because cloud repo depends on it.
* update vendor/postgres submodule url (zenith->neondatabase rename)
* Fix postgres path in python tests after rebase
* fix path in regress test
* Use separate dockerfiles to build compute-node:
Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number.
This is a hack, because Kaniko doesn't support build ARGs properly
* bump vendor/postgres-v14 and vendor/postgres-v15
* Don't use Kaniko cache for v14 and v15 compute-node images
* Build compute-node images for different versions in different jobs
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Seems a bit silly to have a separate crate just for the executable. It
relies on the control plane for everything it does, and it's the only
user of the control plane.