## Problem
To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.
## Summary of changes
- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.
Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.
Closes: https://github.com/neondatabase/neon/issues/5135
Co-authored-by: Christian Schwarz <christian@neon.tech>
I've moved it to the API handler in the 589cf1ed2 to do not delay
compute start. Yet, we now skip full configuration and catalog updates
in the most hot path -- waking up suspended compute, and only do it at:
- first start
- start with applying new configuration
- start for availability check
so it doesn't really matter anymore.
The problem with creating the table and test record in the API handler
is that someone can fill up timeline till the logical limit. Then it's
suspended and availability check is scheduled, so it fails.
If table + test row are created at the very beginning, we reserve a 8 KB
page for future checks, which theoretically will last almost forever.
For example, my ~1y old branch still has 8 KB sized test table:
```sql
cloud_admin@postgres=# select pg_relation_size('health_check');
pg_relation_size
------------------
8192
(1 row)
```
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
## Problem
`timeline_layers` was write-only since
b95addddd5
We deployed the version that no longer requires it for deserializing, so
now we can stop including it when serializing.
## Summary of changes
Fully remove `timeline_layers`.
In logs it is confusing to see seqwait timeouts which seemingly arise
from the branched lsn but actually are about the ancestor, leading to
questions like "has the last_record_lsn went back".
Noticed by @problame.
## Summary
A scheme of logical "generation numbers" for pageservers and their
attachments is proposed, along with
changes to the remote storage format to include these generation numbers
in S3 keys.
Using the control plane as the issuer of these generation numbers
enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus
mechanism directly
in the pageservers.
## Motivation
Currently, the pageserver's remote storage format does not provide a
mechanism for addressing
split brain conditions that may happen when replacing a node during
failover or when migrating
a tenant from one pageserver to another. From a remote storage
perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached,
and both can write to S3. This
can happen in the case of a network partition, pathologically long
delays (e.g. suspended VM), or software
bugs.
This blocks robust implementation of failover from unresponsive
pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.
## Summary of changes
Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:
We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.
Builds upon #4994, #5015, #5056, and #5129.
Part of #4743.
Instead of fixed during the start of replication. To this end, create
term_flush_lsn watch channel similar to commit_lsn one. This allows to continue
recovery streaming if new data appears.
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.
Recovery task itself is just a stub here.
part of
https://github.com/neondatabase/neon/pull/4875
Part 1 of 2, for moving the file cache onto disk.
Because VMs are created by the control plane (and that's where the
filesystem for the file cache is defined), we can't rely on any kind of
synchronization between releases, so the change needs to be
feature-gated (kind of), with the default remaining the same for now.
See also: neondatabase/cloud#6593
## Problem
We still need to rerun some builds manually because flaky tests weren't
detected automatically.
I found two reasons for it:
- If a test is flaky on a particular build type, on a particular
Postgres version, there's a high chance that this test is flaky on all
configurations, but we don't automatically detect such cases.
- We detect flaky tests only on the main branch, which requires manual
retrigger runs for freshly made flaky tests.
Both of them are fixed in the PR.
## Summary of changes
- Spread flakiness of a single test to all configurations
- Detect flaky tests in all branches (not only in the main)
- Look back only at 7 days of test history (instead of 10)
## Problem
I saw these things while working on #5111.
## Summary of changes
* Add a comment explaining why we use `Vec::leak` instead of
`Vec::into_boxed_slice` plus `Box::leak`.
* Add another comment explaining what `valid` is doing, it wasn't very
clear before.
* Add a function `set_usage_count` to not set it directly.
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).
This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.
But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.
The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.
However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.
Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.
Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.
So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
Unrelated fixes noticed while integrating #4938.
- Stop leaking future layers in remote storage
- We schedule extra index_part uploads if layer name to be removed was
not actually present
Starts `postgres` in cgroup directly from `compute_ctl` instead of from
`vm-builder`. This is required because the `vm-monitor` cannot be in the
cgroup it is managing. Otherwise, it itself would be frozen when
freezing the cgroup.
Requires https://github.com/neondatabase/cloud/pull/6331, which adds the
`AUTOSCALING` environment variable letting `compute_ctl` know to start
`postgres` in the cgroup.
Requires https://github.com/neondatabase/autoscaling/pull/468, which
prevents `vm-builder` from starting the monitor and putting postgres in
a cgroup. This will require a `VM_BUILDER_VERSION` bump.
## Problem
The `metadata_bytes` field of IndexPart required explicit
deserialization & error checking everywhere it was used -- there isn't
anything special about this structure that should prevent it from being
serialized & deserialized along with the rest of the structure.
## Summary of changes
- Implement Serialize and Deserialize for TimelineMetadata
- Replace IndexPart::metadata_bytes with a simpler `metadata`, that can
be used directly.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.
We weren't doing that, so the connection task would get stuck
## Summary of changes
Don't return pending. Loop instead
## Problem
We want to make `read_blk` an async function, but outside of
`async_trait`, which allocates, and nightly features, we can't use async
fn's in traits.
## Summary of changes
* Remove all uses of `BlockReader::read_blk` in favour of using block
cursors, at least where the type of the `BlockReader` is behind a
generic
* Introduce a `BlockReaderRef` enum that lists all implementors of
`BlockReader::read_blk`.
* Remove `BlockReader::read_blk` and move its implementations into
inherent functions on the types instead.
We don't turn `read_blk` into an async fn yet, for that we also need to
modify the page cache. So this is a preparatory PR, albeit an important
one.
Part of #4743.
## Problem
The previous arguments have the monitor listen on `localhost`, which the
informant can connect to since it's also in the VM, but which the agent
cannot. Also, the port is wrong.
## Summary of changes
Listen on `0.0.0.0:10301`
## Problem
The `EphemeralFile::write_blob` function accesses the page cache
internally. We want to require `async` for these accesses in #5023.
## Summary of changes
This removes the implementaiton of the `BlobWriter` trait for
`EphemeralFile` and turns the `write_blob` function into an inherent
function. We can then make it async as well as the `push_bytes`
function. We move the `SER_BUFFER` thread-local into the
`InMemoryLayerInner` so that the same buffer can be accessed by
different threads as the async is (potentially) moved between threads.
Part of #4743, preparation for #5023.
Current implementation first calls `load_layer_map`, which loads all
local layers, cleans up files, leave cleaning up stuff to "second
function". Then the "second function" is finally called, it does not do
the cleanup and some of the first functions setup can torn down. "Second
function" is actually both `reconcile_with_remote` and
`create_remote_layers`.
This change makes it a bit more verbose but in one phase with the
following sub-steps:
1. scan the timeline directory
2. delete extra files
- now including on-demand download files
- fixes#3660
3. recoincile the two sources of layers (directory, index_part)
4. rename_to_backup future layers, short layers
5. create the remaining as layers
Needed by #4938.
It was also noticed that this is blocking code in an `async fn` so just
do it in a `spawn_blocking`, which should be healthy for our startup
times. Other effects includes hopefully halving of `stat` calls; extra
calls which were not done previously are now done for the future layers.
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
## Problem
Currently, anything that uses backoff::retry will delay the join of its
task by however long its backoff sleep is, multiplied by its max
retries.
Whenever we call a function that sleeps, we should be passing in a
CancellationToken.
## Summary of changes
- Add a `Cancel` type to backoff::retry that wraps a CancellationToken
and an error `Fn` to generate an error if the cancellation token fires.
- In call sites that already run in a `task_mgr` task, use
`shutdown_token()` to provide the token. In other locations, use a dead
`CancellationToken` to satisfy the interface, and leave a TODO to fix it
up when we broaden the use of explicit cancellation tokens.
This is cherry-picked-then-improved version of release branch commit
4204960942 PR #4861)
The commit
commit 5f8fd640bf
Author: Alek Westover <alek.westover@gmail.com>
Date: Wed Jul 26 08:24:03 2023 -0400
Upload Test Remote Extensions (#4792)
switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.
The switch to using release tag was good in principle, but,
it broke the release pipeline. So, switch release pipeline
back to using `latest`.
Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Accidentially giving is_incremental=true for ImageLayers costs a lot of
debugging time. Removes all API which would allow to do that. They can
easily be restored later *when needed*.
Split off from #4938.
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.
We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.
So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.
While we're at it, switch to using a strummed enum for the operation
type names.