pageserver: fix wal receiver hang on remote client shutdown (#12348)

## Problem

Druing shard splits we shut down the remote client early and allow the
parent shard to keep ingesting data. While ingesting data, the wal
receiver task may wait for the current flush to complete in order to
apply backpressure. Notifications are delivered via
`Timeline::layer_flush_done_tx`.

When the remote client was being shut down the flush loop exited
whithout delivering a notification. This left
`Timeline::wait_flush_completion` hanging indefinitely which blocked the
shutdown of the wal receiver task, and, hence, the shard split.

## Summary of Changes

Deliver a final notification when the flush loop is shutting down
without the timeline cancel cancellation token having fired. I tried
writing a test for this, but got stuck in failpoint hell and decided
it's not worth it.

`test_sharding_autosplit`, which reproduces this reliably in CI, passed
with the proposed fix in
https://github.com/neondatabase/neon/pull/12304.

Closes https://github.com/neondatabase/neon/issues/12060
This commit is contained in:
Vlad Lazar
2025-06-26 17:35:34 +01:00
committed by GitHub
parent 232f2447d4
commit 72b3c9cd11

View File

@@ -4680,6 +4680,16 @@ impl Timeline {
mut layer_flush_start_rx: tokio::sync::watch::Receiver<(u64, Lsn)>,
ctx: &RequestContext,
) {
// Always notify waiters about the flush loop exiting since the loop might stop
// when the timeline hasn't been cancelled.
let scopeguard_rx = layer_flush_start_rx.clone();
scopeguard::defer! {
let (flush_counter, _) = *scopeguard_rx.borrow();
let _ = self
.layer_flush_done_tx
.send_replace((flush_counter, Err(FlushLayerError::Cancelled)));
}
// Subscribe to L0 delta layer updates, for compaction backpressure.
let mut watch_l0 = match self
.layers
@@ -4709,9 +4719,6 @@ impl Timeline {
let result = loop {
if self.cancel.is_cancelled() {
info!("dropping out of flush loop for timeline shutdown");
// Note: we do not bother transmitting into [`layer_flush_done_tx`], because
// anyone waiting on that will respect self.cancel as well: they will stop
// waiting at the same time we as drop out of this loop.
return;
}