Commit Graph

2262 Commits

Author SHA1 Message Date
Joonas Koivunen
dfdf40916f rename complete_detaching_from_ancestor
it hasn't meant completing in a while now :)
2024-07-26 14:39:31 +00:00
Joonas Koivunen
c6d8015fe9 chore: clippy needless into_iter 2024-07-26 14:39:31 +00:00
Joonas Koivunen
ce2552ba67 minor comment update for FIXME about 503 2024-07-26 14:39:31 +00:00
Joonas Koivunen
f4d773bb89 refactor: unify t::s::Semaphore 2024-07-26 14:39:31 +00:00
Joonas Koivunen
6f28263428 refactor: failpoint all but one 2024-07-26 14:39:31 +00:00
Joonas Koivunen
1e380ea5af refactor: Ancestor::Delete is not needed 2024-07-26 14:39:31 +00:00
Joonas Koivunen
8258385301 remove indentation level with exhaustive match 2024-07-26 14:39:31 +00:00
Joonas Koivunen
6a8f00dea0 fix: return reparented_direct_children in case we reparent nothing new 2024-07-26 14:39:31 +00:00
Joonas Koivunen
44cdb9fb58 refactor: reparented_direct_children query 2024-07-26 14:39:31 +00:00
Joonas Koivunen
cdfaf0700f fix: bifurcate the detach+reparent step 2024-07-26 14:39:31 +00:00
Joonas Koivunen
881e1ad056 refactor: no need to collect reparentable here 2024-07-26 14:39:31 +00:00
Joonas Koivunen
bb3d70e24d fix: properly cancel if any reparenting failed 2024-07-26 14:39:31 +00:00
Joonas Koivunen
c6c560e4c8 rewrite to include testing assertion 2024-07-26 14:39:31 +00:00
Joonas Koivunen
8dd332aed5 doc: remove unnecessary comment 2024-07-26 14:39:31 +00:00
Joonas Koivunen
5c03a17eb8 wip: some progress
now we hit the todo! in "already detached" path.
2024-07-26 14:39:31 +00:00
Joonas Koivunen
402d66778e make reparenting operations idempotent 2024-07-26 14:39:31 +00:00
Joonas Koivunen
39e2bc932f prepare to reparent while gc blocked 2024-07-26 14:39:31 +00:00
Joonas Koivunen
5fc034fa7f feat: block gc persistently until detach ancestor completes 2024-07-26 14:39:31 +00:00
Joonas Koivunen
f9b12def0b add support for WaitToActivate errors 2024-07-26 14:39:31 +00:00
Joonas Koivunen
5d0071447c partial: index_part.json support for ongoing_detach_ancestor 2024-07-26 14:39:31 +00:00
Joonas Koivunen
409e2eff9e fix: run upload_rewritten_layer in a span
there was a weird failure observed with CI tests: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8430/10108870590/index.html#suites/a1c2be32556270764423c495fad75d47/94a4686382b96297
2024-07-26 14:39:31 +00:00
Joonas Koivunen
e6e3b9a716 doc: remove on_gc_task_start fixme 2024-07-26 08:52:55 +00:00
Joonas Koivunen
7f31a3f671 forgotten rename, maybe 2024-07-26 08:52:55 +00:00
Joonas Koivunen
9971ae3d24 rename is_detached_from_{original_,}ancestor (just the rename) 2024-07-26 08:52:55 +00:00
Joonas Koivunen
48a2a20de3 chore: derive default 2024-07-26 08:52:55 +00:00
Joonas Koivunen
29ef8f15ce chore: unused variable 2024-07-26 08:52:55 +00:00
Joonas Koivunen
5e45dd3f86 rename SharedState::notify
to continue_existing_attempt
2024-07-26 08:52:55 +00:00
Joonas Koivunen
5fced442d7 warning caused by removed body 2024-07-26 08:52:55 +00:00
Joonas Koivunen
4222610233 cleanup index part dependent 2024-07-26 08:52:55 +00:00
Joonas Koivunen
92deb0dfd7 plumbing: collect timelines index parts 2024-07-26 08:52:55 +00:00
Joonas Koivunen
46ca6f17c5 plumbing: notify shared state of existing attempt 2024-07-26 08:52:55 +00:00
Joonas Koivunen
14869abb77 complete the plumbing with non-notifying attempt_blocks_gc impl 2024-07-26 08:52:55 +00:00
Joonas Koivunen
5330fd9366 doc(fixme): shared state 2024-07-26 08:52:55 +00:00
Joonas Koivunen
6c5b3b7812 doc: more sketched api comments 2024-07-26 08:52:55 +00:00
Joonas Koivunen
849fe0f191 plumb the shared state through
the api for the gc pausing is quite awkward.
2024-07-26 08:52:55 +00:00
Joonas Koivunen
f564b66f21 shared state sketch 2024-07-26 08:52:55 +00:00
Joonas Koivunen
2e58ccee78 temp: planning 2024-07-26 08:52:55 +00:00
Joonas Koivunen
0ad31bb7fb doc: remove obsolete FIXME
this was cleared with partial metadata updates.
2024-07-26 08:52:55 +00:00
Joonas Koivunen
86f26d0918 chore: minor rename FIXME in IndexPart 2024-07-26 08:52:55 +00:00
Joonas Koivunen
4a562dff2e doc: more 2024-07-26 08:52:55 +00:00
Joonas Koivunen
f9185b42a9 doc: minor enhancements 2024-07-26 08:52:55 +00:00
Joonas Koivunen
d4f30daa81 chore: minor indentation problem 2024-07-26 08:52:55 +00:00
John Spray
6711087ddf remote_storage: expose last_modified in listings (#8497)
## Problem

The scrubber would like to check the highest mtime in a tenant's objects
as a safety check during purges. It recently switched to use
GenericRemoteStorage, so we need to expose that in the listing methods.

## Summary of changes

- In Listing.keys, return a ListingObject{} including a last_modified
field, instead of a RemotePath

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-07-26 10:57:52 +03:00
Arpad Müller
8e02db1ab9 Handle NotInitialized::ShuttingDown error in shard split (#8506)
There is a race condition between timeline shutdown and the split task.
Timeline shutdown first shuts down the upload queue, and only then fires
the cancellation token. A parallel running timeline split operation
might thus encounter a cancelled upload queue before the cancellation
token is fired, and print a noisy error.

Fix this by mapping `anyhow::Error{ NotInitialized::ShuttingDown }) to
`FlushLayerError::Cancelled` instead of `FlushLayerError::Other(_)`.


Fixes #8496
2024-07-26 02:16:10 +02:00
Alex Chi Z.
bea0468f1f fix(pageserver): allow incomplete history in btm-gc-compaction (#8500)
This pull request (should) fix the failure of test_gc_feedback. See the
explanation in the newly-added test case.

Part of https://github.com/neondatabase/neon/issues/8002

Allow incomplete history for the compaction algorithm.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-25 12:56:37 -04:00
Vlad Lazar
9c5ad21341 storcon: make heartbeats restart aware (#8222)
## Problem
Re-attach blocks the pageserver http server from starting up. Hence, it
can't reply to heartbeats
until that's done. This makes the storage controller mark the node
off-line (not good). We worked
around this by setting the interval after which nodes are marked offline
to 5 minutes. This isn't a
long term solution.

## Summary of changes
* Introduce a new `NodeAvailability` state: `WarmingUp`. This state
models the following time interval:
* From receiving the re-attach request until the pageserver replies to
the first heartbeat post re-attach
* The heartbeat delta generator becomes aware of this state and uses a
separate longer interval
* Flag `max-warming-up-interval` now models the longer timeout and
`max-offline-interval` the shorter one to
match the names of the states

Closes https://github.com/neondatabase/neon/issues/7552
2024-07-25 14:09:12 +01:00
Christian Schwarz
a1256b2a67 fix: remote timeline client shutdown trips circuit breaker (#8495)
Before this PR

1.The circuit breaker would trip on CompactionError::Shutdown. That's
wrong, we want to ignore those cases.
2. remote timeline client shutdown would not be mapped to
CompactionError::Shutdown in all circumstances.

We observed this in staging, see
https://neondb.slack.com/archives/C033RQ5SPDH/p1721829745384449

This PR fixes (1) with a simple `match` statement, and (2) by switching
a bunch of `anyhow` usage over to distinguished errors that ultimately
get mapped to `CompactionError::Shutdown`.

I removed the implicit `#[from]` conversion from `anyhow::Error` to
`CompactionError::Other` to discover all the places that were mapping
remote timeline client shutdown to `anyhow::Error`.

In my opinion `#[from]` is an antipattern and we should avoid it,
especially for `anyhow::Error`. If some callee is going to return
anyhow, the very least the caller should to is to acknowledge, through a
`map_err(MyError::Other)` that they're conflating different failure
reasons.
2024-07-25 09:44:31 +01:00
Christian Schwarz
d57412aaab followup(#8359): pre-initialize circuitbreaker metrics (#8491) 2024-07-25 10:24:28 +02:00
John Spray
5f4e14d27d pageserver: fix a compilation error (#8487)
## Problem
PR that modified compaction raced with PR that modified the GcInfo
structure

## Summary of changes
Fix it

Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com>
2024-07-24 16:37:15 +01:00
Vlad Lazar
2723a8156a pageserver: faster and simpler inmem layer vec read (#8469)
## Problem
The in-memory layer vectored read was very slow in some conditions
(walingest::test_large_rel) test. Upon profiling, I realised that 80% of
the time was spent building up the binary heap of reads. This stage
isn't actually needed.

## Summary of changes
Remove the planning stage as we never took advantage of it in order to
merge reads. There should be no functional change from this patch.
2024-07-24 14:23:03 +01:00