mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-07 13:32:57 +00:00
# Problem Before this PR, timeline shutdown would - cancel the walreceiver cancellation token subtree (child token of Timeline::cancel) - call freeze_and_flush - Timeline::cancel.cancel() - ... bunch of waiting for things ... - Timeline::gate.close() As noted by the comment that is deleted by this PR, this left a window where, after freeze_and_flush, walreceiver could still be running and ingest data into a new InMemoryLayer. This presents a potential source of log noise during Timeline shutdown where the InMemoryLayer created after the freeze_and_flush observes that Timeline::cancel is cancelled, failing the ingest with some anyhow::Error wrapping (deeply) a `FlushTaskError::Cancelled` instance (`flush task cancelled` error message). # Solution It turns out that it is quite easy to shut down, not just cancel, walreceiver completely because the only subtask spawned by walreceiver connection manager is the `handle_walreceiver_connection` task, which is properly shut down and waited upon when the manager task observes cancellation and exits its retry loop. The alternative is to replace all the usage of `anyhow` on the ingest path with differentiated error types. A lot of busywork for little gain to fix a potential logging noise nuisance, so, not doing that for now. # Correctness / Risk We do not risk leaking walreceiver child tasks because existing discipline is to hold a gate guard. We will prolong `Timeline::shutdown` to the degree that we're no longer making progress with the rest of shutdown while the walreceiver task hasn't yet observed cancellation. In practice, this should be negligible. `Timeline::shutdown` could fail to complete if there is a hidden dependency of walreceiver shutdown on some subsystem. The code certainly suggests there isn't, and I'm not aware of any such dependency. Anyway, impact will be low because we only shut down Timeline instances that are obsolete, either because there is a newer attachment at a different location, or because the timeline got deleted by the user. We would learn about this through stuck cplane operations or stuck storcon reconciliations. We would be able to mitigate by cancelling such stuck operations/reconciliations and/or by rolling back pageserver. # Refs - identified this while investigating https://github.com/neondatabase/neon/issues/11762 - PR that _does_ fix a bunch _real_ `flush task cancelled` noise on the compaction path: https://github.com/neondatabase/neon/pull/11853