neon/pageserver at 4c7956fa56e8b39e56a83342c72c774481dba295 - neon

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-03 11:32:56 +00:00

Files

Arpad Müller 4c7956fa56 Fix hang deleting offloaded timelines (#12366 )

We don't have cancellation support for timeline deletions. In other
words, timeline deletion might still go on in an older generation while
we are attaching it in a newer generation already, because the
cancellation simply hasn't reached the deletion code.

This has caused us to hit a situation with offloaded timelines in which
the timeline was in an unrecoverable state: always returning an accepted
response, but never a 404 like it should be.

The detailed description can be found in
[here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859)
(private repo link).

TLDR:

1. we ask to delete timeline on old pageserver/generation, starts
process in background
2. the storcon migrates the tenant to a different pageserver.
- during attach, the pageserver still finds an index part, so it adds it
to `offloaded_timelines`
4. the timeline deletion finishes, removing the index part in S3
5. there is a retry of the timeline deletion endpoint, sent to the new
pageserver location. it is bound to fail however:
- as the index part is gone, we print `Timeline already deleted in
remote storage`.
- the problem is that we then return an accepted response code, and not
a 404.
- this confuses the code calling us. it thinks the timeline is not
deleted, so keeps retrying.
- this state never gets recovered from until a reset/detach, because of
the `offloaded_timelines` entry staying there.

This is where this PR fixes things: if no index part can be found, we
can safely assume that the timeline is gone in S3 (it's the last thing
to be deleted), so we can remove it from `offloaded_timelines` and
trigger a reupload of the manifest. Subsequent retries will pick that
up.

Why not improve the cancellation support? It is a more disruptive code
change, that might have its own risks. So we don't do it for now.

Fixes https://github.com/neondatabase/cloud/issues/30406

2025-06-27 15:14:55 +00:00

benches

Use enum-typed PG versions (#12317 )

2025-06-24 17:25:31 +00:00

client

feat(storcon): retrieve feature flag and pass to pageservers (#12324 )

2025-06-25 14:58:18 +00:00

compaction

apply clippy fixes for 1.88.0 beta (#12331 )

2025-06-24 10:12:42 +00:00

ctl

apply clippy fixes for 1.88.0 beta (#12331 )