mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-06 13:02:55 +00:00
pageserver: fix wal receiver hang on remote client shutdown (#12348)
## Problem Druing shard splits we shut down the remote client early and allow the parent shard to keep ingesting data. While ingesting data, the wal receiver task may wait for the current flush to complete in order to apply backpressure. Notifications are delivered via `Timeline::layer_flush_done_tx`. When the remote client was being shut down the flush loop exited whithout delivering a notification. This left `Timeline::wait_flush_completion` hanging indefinitely which blocked the shutdown of the wal receiver task, and, hence, the shard split. ## Summary of Changes Deliver a final notification when the flush loop is shutting down without the timeline cancel cancellation token having fired. I tried writing a test for this, but got stuck in failpoint hell and decided it's not worth it. `test_sharding_autosplit`, which reproduces this reliably in CI, passed with the proposed fix in https://github.com/neondatabase/neon/pull/12304. Closes https://github.com/neondatabase/neon/issues/12060
This commit is contained in:
@@ -4680,6 +4680,16 @@ impl Timeline {
|
||||
mut layer_flush_start_rx: tokio::sync::watch::Receiver<(u64, Lsn)>,
|
||||
ctx: &RequestContext,
|
||||
) {
|
||||
// Always notify waiters about the flush loop exiting since the loop might stop
|
||||
// when the timeline hasn't been cancelled.
|
||||
let scopeguard_rx = layer_flush_start_rx.clone();
|
||||
scopeguard::defer! {
|
||||
let (flush_counter, _) = *scopeguard_rx.borrow();
|
||||
let _ = self
|
||||
.layer_flush_done_tx
|
||||
.send_replace((flush_counter, Err(FlushLayerError::Cancelled)));
|
||||
}
|
||||
|
||||
// Subscribe to L0 delta layer updates, for compaction backpressure.
|
||||
let mut watch_l0 = match self
|
||||
.layers
|
||||
@@ -4709,9 +4719,6 @@ impl Timeline {
|
||||
let result = loop {
|
||||
if self.cancel.is_cancelled() {
|
||||
info!("dropping out of flush loop for timeline shutdown");
|
||||
// Note: we do not bother transmitting into [`layer_flush_done_tx`], because
|
||||
// anyone waiting on that will respect self.cancel as well: they will stop
|
||||
// waiting at the same time we as drop out of this loop.
|
||||
return;
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user