Commit Graph

1503 Commits

Author SHA1 Message Date
John Spray
18159b7695 deletion queue: expose errors from push/flush 2023-08-22 10:01:10 +01:00
John Spray
c1bc9c0f70 Various test fixes + tweaks to flushing 2023-08-18 12:44:35 +01:00
John Spray
d330eac4bc clippy 2023-08-18 12:44:35 +01:00
John Spray
3ebceeda71 pageserver: refactor timeline args into TimelineResources
This sidesteps clippy complaining about function arg counts,
and will enable introducing more shared structures in future
without the noise of adding extra args to all the functions
involved in timeline setup.
2023-08-18 12:44:35 +01:00
John Spray
31729d6f4d pageserver: refactor tenant args into a structure
This way, when we add some new shared structure that the
tenants need a reference to, we do not have to add it
individually as an extra argument to the various functions.
2023-08-18 12:44:35 +01:00
John Spray
7e0e3517c1 clippy 2023-08-18 12:44:35 +01:00
John Spray
c36cba28d6 pageserver: generalize flush API 2023-08-18 12:44:35 +01:00
John Spray
8eaa4015de deletion queue: versions in keys 2023-08-18 12:44:35 +01:00
John Spray
10e927ee3e Add encoding versions to deletion queue structs 2023-08-18 12:44:35 +01:00
John Spray
bb3a59f275 clippy 2023-08-18 12:44:35 +01:00
John Spray
a0ed43cc12 deletion queue: add DeletionHeader for sequence numbers 2023-08-18 12:44:35 +01:00
John Spray
99dc5a5c27 Deletion queue: implement recovery on startup 2023-08-18 12:44:35 +01:00
John Spray
54db1f5d8a remote_storage: add a helper for downloading full objects
This is only for use with small objects that we will
deserialize in a non-streaming way.

Also add a strip_prefix method to RemotePath.
2023-08-18 12:44:35 +01:00
John Spray
404b25e45f Remove vestigial remote_timeline_client deletion paths 2023-08-18 12:44:35 +01:00
John Spray
3edd7ece40 deletion queue: improve frontend retry 2023-08-18 12:44:35 +01:00
John Spray
504fe9c2b0 pageserver: send timeline deletions through the deletion queue 2023-08-18 12:44:35 +01:00
John Spray
10df237a81 deletion queue: add push for generic objects (layers and garbage) 2023-08-18 12:44:35 +01:00
John Spray
d40f8475a5 Error metric and retries 2023-08-18 12:44:35 +01:00
John Spray
164f916a40 Spawn deletion workers with info spans 2023-08-18 12:44:35 +01:00
John Spray
4ebc29768c Add failpoint for deletion execution 2023-08-18 12:44:35 +01:00
John Spray
bae62916dc pageserver/http: add /v1/deletion_queue/flush_execute
This is principally for tesing, but might be useful in
the field if we want to e.g. flush a deletion queue
before running an external scrub tool
2023-08-18 12:44:35 +01:00
John Spray
54ec7919b8 pageserver: add deletion queue submitted/executed metrics 2023-08-18 12:44:35 +01:00
John Spray
e0bed0732c Tweak deletion queue constants 2023-08-18 12:44:35 +01:00
John Spray
9e92121cc3 pageserver: flush deletion queue on clean shutdown 2023-08-18 12:44:35 +01:00
John Spray
50a9508f4f clippy 2023-08-18 12:44:35 +01:00
John Spray
f61402be24 pageserver: testing for deletion queue 2023-08-18 12:44:35 +01:00
John Spray
975e4f2235 Refactor deletion worker construction 2023-08-18 12:44:35 +01:00
John Spray
537eca489e Implement flush_execute() in deletion queue 2023-08-18 12:44:35 +01:00
John Spray
de4882886e pageserver: implement batching in deletion queue 2023-08-18 12:44:35 +01:00
John Spray
6982288426 pageserver: implement frontend of deletion queue 2023-08-18 12:44:35 +01:00
John Spray
e2c793c897 Use deletion queue in schedule_layer_file_deletion 2023-08-18 12:44:33 +01:00
John Spray
0fdc492aa4 Add MockDeletionQueue for unit tests 2023-08-18 11:25:40 +01:00
John Spray
787b099541 wire deletion queue into timeline 2023-08-18 11:25:40 +01:00
John Spray
3af693749d pageserver: wire deletion queue through to Tenant 2023-08-18 11:25:40 +01:00
John Spray
6f9ae6bb5f pageserver: instantiate deletion queue at process scope 2023-08-18 11:25:40 +01:00
John Spray
16d77dcb73 Initial stub implementation of deletion queue 2023-08-18 11:25:40 +01:00
Joonas Koivunen
67af24191e test: cleanup remote_timeline_client tests (#5013)
I will have to change these as I change remote_timeline_client api in
#4938. So a bit of cleanup, handle my comments which were just resolved
during initial review.

Cleanup:
- use unwrap in tests instead of mixed `?` and `unwrap`
- use `Handle` instead of `&'static Reactor` to make the
RemoteTimelineClient more natural
- use arrays in tests
- use plain `#[tokio::test]`
2023-08-17 19:27:30 +03:00
Joonas Koivunen
6af5f9bfe0 fix: format context (#5022)
We return an error with unformatted `{timeline_id}`.
2023-08-17 14:30:25 +00:00
Dmitry Rodionov
d8b0a298b7 Do not attach deleted tenants (#5008)
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006

It requires more plumbing so lets not attach deleted tenants first and
then implement resume.

Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.

Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
2023-08-17 13:46:49 +03:00
Christian Schwarz
957af049c2 ephemeral file: refactor write_blob impl to concentrate mutable state (#5004)
Before this patch, we had the `off` and `blknum` as function-wide
mutable state. Now it's contained in the `Writer` struct.

The use of `push_bytes` instead of index-based filling of the buffer
also makes it easier to reason about what's going on.

This is prep for https://github.com/neondatabase/neon/pull/4994
2023-08-17 13:07:25 +03:00
Joonas Koivunen
d3612ce266 delta_layer: Restore generic from last week (#5014)
Restores #4937 work relating to the ability to use `ResidentDeltaLayer`
(which is an Arc wrapper) in #4938 for the ValueRef's by removing the
borrow from `ValueRef` and providing it from an upper layer.

This should not have any functional changes, most importantly, the
`main` will continue to use the borrowed `DeltaLayerInner`. It might be
that I can change #4938 to be like this. If that is so, I'll gladly rip
out the `Ref` and move the borrow back. But I'll first want to look at
the current test failures.
2023-08-17 11:47:31 +03:00
Christian Schwarz
994411f5c2 page cache: newtype the blob_io and ephemeral_file file ids (#5005)
This makes it more explicit that these are different u64-sized
namespaces.
Re-using one in place of the other would be catastrophic.

Prep for https://github.com/neondatabase/neon/pull/4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.
2023-08-16 18:33:47 +02:00
Arpad Müller
0bdbc39cb1 Compaction: unify key and value reference vecs (#4888)
## Problem

PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
https://github.com/neondatabase/neon/pull/4839#discussion_r1279167815 ,
we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
https://github.com/neondatabase/neon/pull/4839#issuecomment-1660418012
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.

## Summary of changes

This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.

The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.

We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.

Builds on top of #4839 .
2023-08-16 18:27:18 +03:00
Dmitry Rodionov
96b84ace89 Correctly remove orphaned objects in RemoteTimelineClient::delete_all (#5000)
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.

Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible

resolves https://github.com/neondatabase/neon/issues/4499
2023-08-16 17:31:16 +03:00
Christian Schwarz
368b783ada ephemeral_file: remove FileExt impl (was only used by tests) (#5003)
Extracted from https://github.com/neondatabase/neon/pull/4994
2023-08-16 15:41:25 +02:00
Dmitry Rodionov
52c2c69351 fsync directory before mark file removal (#4986)
## Problem

Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.

See added comments.

resolves #4987
2023-08-15 19:24:23 +03:00
Arpad Müller
baf395983f Turn BlockLease associated type into an enum (#4982)
## Problem

The `BlockReader` trait is not ready to be asyncified, as associated
types are not supported by asyncification strategies like via the
`async_trait` macro, or via adopting enums.

## Summary of changes

Remove the `BlockLease` associated type from the `BlockReader` trait and
turn it into an enum instead, bearing the same name. The enum has two
variants, one of which is gated by `#[cfg(test)]`. Therefore, outside of
test settings, the enum has zero overhead over just having the
`PageReadGuard`. Using the enum allows us to impl `BlockReader` without
needing the page cache.

Part of https://github.com/neondatabase/neon/issues/4743
2023-08-14 18:48:09 +02:00
Arpad Müller
ce7efbe48a Turn BlockCursor::{read_blob,read_blob_into_buf} async fn (#4905)
## Problem

The `BlockCursor::read_blob` and `BlockCursor::read_blob_into_buf`
functions are calling `read_blk` internally, so if we want to make that
function async fn, they need to be async themselves.

## Summary of changes

* We first turn `ValueRef::load` into an async fn.
* Then, we switch the `RwLock` implementation in `InMemoryLayer` to use
the one from `tokio`.
* Last, we convert the `read_blob` and `read_blob_into_buf` functions
into async fn.

In three instances we use `Handle::block_on`:

* one use is in compaction code, which currently isn't async. We put the
entire loop into an `async` block to prevent the potentially hot loop
from doing cross-thread operations.
* one use is in dumping code for `DeltaLayer`. The "proper" way to
address this would be to enable the visit function to take async
closures, but then we'd need to be generic over async fs non async,
which [isn't supported by rust right
now](https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-generics.html).
The other alternative would be to do a first pass where we cache the
data into memory, and only then to dump it.
* the third use is in writing code, inside a loop that copies from one
file to another. It is is synchronous and we'd like to keep it that way
(for now?).

Part of #4743
2023-08-14 17:20:37 +02:00
Dmitry Rodionov
4626d89eda Harden retries on tenant/timeline deletion path. (#4973)
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.

To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
2023-08-14 17:16:49 +03:00
John Spray
d3a97fdf88 pageserver: avoid incrementing access time when reading layers for compaction (#4971)
## Problem

Currently, image generation reads delta layers before writing out
subsequent image layers, which updates the access time of the delta
layers and effectively puts them at the back of the queue for eviction.
This is the opposite of what we want, because after a delta layer is
covered by a later image layer, it's likely that subsequent reads of
latest data will hit the image rather than the delta layer, so the delta
layer should be quite a good candidate for eviction.

## Summary of changes

`RequestContext` gets a new `ATimeBehavior` field, and a
`RequestContextBuilder` helper so that we can optionally add the new
field without growing `RequestContext::new` every time we add something
like this.

Request context is passed into the `record_access` function, and the
access time is not updated if `ATimeBehavior::Skip` is set.

The compaction background task constructs its request context with this
skip policy.

Closes: https://github.com/neondatabase/neon/issues/4969
2023-08-14 10:18:22 +01:00