Commit Graph

3737 Commits

Author SHA1 Message Date
John Spray
c63a952b78 Implement validation of generations before delete 2023-08-30 17:44:10 +01:00
John Spray
35e4b43531 Hook deletion queue into generations 2023-08-30 15:35:51 +01:00
John Spray
584c0d3c7b Make remote_layer_path take Generation instead of layer metadata 2023-08-30 15:13:00 +01:00
John Spray
84023207ce Merge branch 'jcsp/deletion-queue' into jcsp/generation-numbers 2023-08-30 15:07:35 +01:00
John Spray
35fa75699b switch deletion queue to local storage 2023-08-30 12:21:29 +01:00
John Spray
f77aa463c6 clippy 2023-08-30 10:37:06 +01:00
John Spray
4492d40c37 Merge remote-tracking branch 'upstream/main' into jcsp/deletion-queue 2023-08-30 10:34:16 +01:00
John Spray
2f58f39648 Revert "libs: make backoff::retry() take a cancellation token"
This reverts commit 8c2ff87f1a.
2023-08-30 10:26:15 +01:00
Joonas Koivunen
05773708d3 fix: add context for ancestor lsn wait (#5143)
In logs it is confusing to see seqwait timeouts which seemingly arise
from the branched lsn but actually are about the ancestor, leading to
questions like "has the last_record_lsn went back".

Noticed by @problame.
2023-08-30 12:21:41 +03:00
John Spray
382473d9a5 docs: add RFC for remote storage generation numbers (#4919)
## Summary

A scheme of logical "generation numbers" for pageservers and their
attachments is proposed, along with
changes to the remote storage format to include these generation numbers
in S3 keys.

Using the control plane as the issuer of these generation numbers
enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus
mechanism directly
in the pageservers.

## Motivation

Currently, the pageserver's remote storage format does not provide a
mechanism for addressing
split brain conditions that may happen when replacing a node during
failover or when migrating
a tenant from one pageserver to another. From a remote storage
perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached,
and both can write to S3. This
can happen in the case of a network partition, pathologically long
delays (e.g. suspended VM), or software
bugs.

This blocks robust implementation of failover from unresponsive
pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-08-30 09:49:55 +01:00
Arpad Müller
eb0a698adc Make page cache and read_blk async (#5023)
## Problem

`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.

## Summary of changes

Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:

We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.

Builds upon #4994, #5015, #5056, and #5129.

Part of #4743.
2023-08-30 09:04:31 +02:00
Arseny Sher
81b6578c44 Allow walsender in recovery mode give WAL till dynamic flush_lsn.
Instead of fixed during the start of replication. To this end, create
term_flush_lsn watch channel similar to commit_lsn one. This allows to continue
recovery streaming if new data appears.
2023-08-29 23:19:40 +03:00
Arseny Sher
bc49c73fee Move wal_stream_connection_config to utils.
It will be used by safekeeper as well.
2023-08-29 23:19:40 +03:00
Arseny Sher
e98580b092 Add term and http endpoint to broker messaged SkTimelineInfo.
We need them for safekeeper peer recovery
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
804ef23043 Rename TermSwitchEntry to TermLsn.
Add derive Ord for easy comparison of <term, lsn> pairs.

part of https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
87f7d6bce3 Start and stop per timeline recovery task.
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
39e3fbbeb0 Add safekeeper peers to TimelineInfo.
Now available under GET /tenant/xxx/timeline/yyy for inspection.
2023-08-29 23:19:40 +03:00
Em Sharnoff
8d2a4aa5f8 vm-monitor: Add flag for when file cache on disk (#5130)
Part 1 of 2, for moving the file cache onto disk.

Because VMs are created by the control plane (and that's where the
filesystem for the file cache is defined), we can't rely on any kind of
synchronization between releases, so the change needs to be
feature-gated (kind of), with the default remaining the same for now.

See also: neondatabase/cloud#6593
2023-08-29 12:44:48 -07:00
John Spray
10b85c0d9a fixup index_part loading 2023-08-29 17:26:08 +01:00
John Spray
cd6367b5ae fixup control_plane attach hook 2023-08-29 17:18:28 +01:00
John Spray
79f9f7c5f8 fixup control_api types 2023-08-29 17:18:28 +01:00
John Spray
ef5ce1635c fixup attach API 2023-08-29 17:18:28 +01:00
John Spray
5aecd8c4fd tests: enable generations in neon_fixture 2023-08-29 17:18:28 +01:00
John Spray
5266bf4552 remote_storage: fix LocalFs list_files 2023-08-29 17:18:28 +01:00
John Spray
a1bcad2382 DNM unit test for index part download 2023-08-29 17:18:28 +01:00
John Spray
4dd60bf7cd pageserver: generation-aware index_part.json loading 2023-08-29 17:18:28 +01:00
John Spray
3eff65618d control_plane: implement attach hook 2023-08-29 17:18:28 +01:00
John Spray
265d3b4352 pageserver: if control plane API is disabled, ignore generations 2023-08-29 17:18:28 +01:00
John Spray
000330054b pageserver: require attachment generation if control plane API is set 2023-08-29 17:18:28 +01:00
John Spray
ddb6453f56 neon_local: manage attachment_service 2023-08-29 17:18:28 +01:00
John Spray
bc95b8f1f5 pageserver: call into control plane on startup 2023-08-29 17:18:28 +01:00
John Spray
5b7d3e39d6 Move pageserver control plane API types into libs/ 2023-08-29 17:18:28 +01:00
John Spray
034bebcfcd pageserver: add control_plane_api conf 2023-08-29 17:18:28 +01:00
John Spray
9e0e2a2a9a Stub of generations API 2023-08-29 17:18:28 +01:00
John Spray
34160a15ca Support generations in RemoteTimelineClient delete 2023-08-29 17:18:28 +01:00
John Spray
f3a9c2d788 Add optional generation input during create & attach 2023-08-29 17:18:28 +01:00
John Spray
50da1b7983 Simplify Generation 2023-08-29 17:08:55 +01:00
John Spray
4a0e2d1290 Simplify None handling for Generation in LayerfileMetadata 2023-08-29 16:57:28 +01:00
John Spray
980d3ba8b0 clippy 2023-08-29 15:47:00 +01:00
John Spray
fd836d8c45 Support generations in RemoteTimelineClient delete 2023-08-29 15:36:15 +01:00
John Spray
67b17034ab pageserver: use generation in keys when writing 2023-08-29 15:36:15 +01:00
John Spray
930de712ee pageserver: add Generation type to Tenant, Timeline & Index 2023-08-29 15:36:15 +01:00
John Spray
dd033d9138 utils: introduce Generation type 2023-08-29 15:36:12 +01:00
Joonas Koivunen
d1fcdf75b3 test: enhanced logging for curious mock_s3 (#5134)
Possible flakyness with mock_s3. Add logging in hopes this will happen
again.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-29 14:48:50 +03:00
Alexander Bayandin
7e39a96441 scripts/flaky_tests.py: Improve flaky tests detection (#5094)
## Problem

We still need to rerun some builds manually because flaky tests weren't
detected automatically.
I found two reasons for it:
- If a test is flaky on a particular build type, on a particular
Postgres version, there's a high chance that this test is flaky on all
configurations, but we don't automatically detect such cases.
- We detect flaky tests only on the main branch, which requires manual
retrigger runs for freshly made flaky tests.
Both of them are fixed in the PR.

## Summary of changes
- Spread flakiness of a single test to all configurations
- Detect flaky tests in all branches (not only in the main)
- Look back only at  7 days of test history (instead of 10)
2023-08-29 11:53:24 +01:00
Vadim Kharitonov
babefdd3f9 Upgrade pgvector to 0.5.0 (#5132) 2023-08-29 12:53:50 +03:00
Arpad Müller
805fee1483 page cache: small code cleanups (#5125)
## Problem

I saw these things while working on #5111.

## Summary of changes

* Add a comment explaining why we use `Vec::leak` instead of
`Vec::into_boxed_slice` plus `Box::leak`.
* Add another comment explaining what `valid` is doing, it wasn't very
clear before.
* Add a function `set_usage_count` to not set it directly.
2023-08-29 11:49:04 +03:00
Felix Prasanna
85d6d9dc85 monitor/compute_ctl: remove references to the informant (#5115)
Also added some docs to the monitor :)

Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
2023-08-29 02:59:27 +03:00
Em Sharnoff
e40ee7c3d1 remove unused file 'vm-cgconfig.conf' (#5127)
Honestly no clue why it's still here, should have been removed ages ago.
This is handled by vm-builder now.
2023-08-28 13:04:57 -07:00
Christian Schwarz
0fe3b3646a page cache: don't proactively evict EphemeralFile pages (#5129)
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
2023-08-28 20:42:18 +02:00