Don't await the walreceiver on timeline shutdown (#12402)

Mostly a revert of https://github.com/neondatabase/neon/pull/11851 and https://github.com/neondatabase/neon/pull/12330 . Christian suggested reverting his PR to fix the issue https://github.com/neondatabase/neon/issues/12369 . Alternatives considered: 1. I have originally wanted to introduce cancellation tokens to `RequestContext`, but in the end I gave up on them because I didn't find a select-free way of preventing `test_layer_download_cancelled_by_config_location` from hanging. Namely if I put a select around the `get_or_maybe_download` invocation in `get_values_reconstruct_data`, it wouldn't hang, but if I put it around the `download_init_and_wait` invocation in `get_or_maybe_download`, the test would still hang. Not sure why, even though I made the attached child function of the `RequestContext` create a child token. 2. Introduction of a `download_cancel` cancellation token as a child of a timeline token, putting it into `RemoteTimelineClient` together with the main token, and then putting it into the whole `RemoteTimelineClient` read path. 3. Greater refactorings, like to make cancellation tokens follow a DAG structure so you can have tokens cancelled either by say timeline shutting down or a request ending. It doesn't just represent an effort that we don't have the engineering budget for, it also causes interesting questions like what to do about batching (do you cancel the entire request if only some requests get cancelled?). We might see a reemergence of https://github.com/neondatabase/neon/issues/11762, but given that we have https://github.com/neondatabase/neon/pull/11853 and https://github.com/neondatabase/neon/pull/12376 now, it is possible that it will not come back. Looking at some code, it might actually fix the locations where the error pops up. Let's see. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>
2026-06-04 22:10:39 +00:00 · 2025-07-04 22:12:10 +02:00
parent 7ed4530618
commit b94a5ce119
3 changed files with 27 additions and 30 deletions
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -2144,14 +2144,31 @@ impl Timeline {
        debug_assert_current_span_has_tenant_and_timeline_id();

        // Regardless of whether we're going to try_freeze_and_flush
-        // or not, stop ingesting any more data.
+        // cancel walreceiver to stop ingesting more data asap.
+        //
+        // Note that we're accepting a race condition here where we may
+        // do the final flush below, before walreceiver observes the
+        // cancellation and exits.
+        // This means we may open a new InMemoryLayer after the final flush below.
+        // Flush loop is also still running for a short while, so, in theory, it
+        // could also make its way into the upload queue.
+        //
+        // If we wait for the shutdown of the walreceiver before moving on to the
+        // flush, then that would be avoided. But we don't do it because the
+        // walreceiver entertains reads internally, which means that it possibly
+        // depends on the download of layers. Layer download is only sensitive to
+        // the cancellation of the entire timeline, so cancelling the walreceiver
+        // will have no effect on the individual get requests.
+        // This would cause problems when there is a lot of ongoing downloads or
+        // there is S3 unavailabilities, i.e. detach, deletion, etc would hang,
+        // and we can't deallocate resources of the timeline, etc.
        let walreceiver = self.walreceiver.lock().unwrap().take();
        tracing::debug!(
            is_some = walreceiver.is_some(),
            "Waiting for WalReceiverManager..."
        );
        if let Some(walreceiver) = walreceiver {
-            walreceiver.shutdown().await;
+            walreceiver.cancel().await;
        }
        // ... and inform any waiters for newer LSNs that there won't be any.
        self.last_record_lsn.shutdown();
--- a/pageserver/src/tenant/timeline/walreceiver.rs
+++ b/pageserver/src/tenant/timeline/walreceiver.rs
@@ -63,7 +63,6 @@ pub struct WalReceiver {
    /// All task spawned by [`WalReceiver::start`] and its children are sensitive to this token.
    /// It's a child token of [`Timeline`] so that timeline shutdown can cancel WalReceiver tasks early for `freeze_and_flush=true`.
    cancel: CancellationToken,
-    task: tokio::task::JoinHandle<()>,
 }

 impl WalReceiver {
@@ -80,7 +79,7 @@ impl WalReceiver {
        let loop_status = Arc::new(std::sync::RwLock::new(None));
        let manager_status = Arc::clone(&loop_status);
        let cancel = timeline.cancel.child_token();
-        let task = WALRECEIVER_RUNTIME.spawn({
+        let _task = WALRECEIVER_RUNTIME.spawn({
            let cancel = cancel.clone();
            async move {
                debug_assert_current_span_has_tenant_and_timeline_id();
@@ -121,25 +120,14 @@ impl WalReceiver {
        Self {
            manager_status,
            cancel,
-            task,
        }
    }

    #[instrument(skip_all, level = tracing::Level::DEBUG)]
-    pub async fn shutdown(self) {
+    pub async fn cancel(self) {
        debug_assert_current_span_has_tenant_and_timeline_id();
        debug!("cancelling walreceiver tasks");
        self.cancel.cancel();
-        match self.task.await {
-            Ok(()) => debug!("Shutdown success"),
-            Err(je) if je.is_cancelled() => unreachable!("not used"),
-            Err(je) if je.is_panic() => {
-                // already logged by panic hook
-            }
-            Err(je) => {
-                error!("shutdown walreceiver task join error: {je}")
-            }
-        }
    }

    pub(crate) fn status(&self) -> Option<ConnectionManagerStatus> {
--- a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
+++ b/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
@@ -275,20 +275,12 @@ pub(super) async fn handle_walreceiver_connection(
    let copy_stream = replication_client.copy_both_simple(&query).await?;
    let mut physical_stream = pin!(ReplicationStream::new(copy_stream));

-    let walingest_future = WalIngest::new(timeline.as_ref(), startpoint, &ctx);
-    let walingest_res = select! {
-        walingest_res = walingest_future => walingest_res,
-        _ = cancellation.cancelled() => {
-            // We are doing reads in WalIngest::new, and those can hang as they come from the network.
-            // Timeline cancellation hits the walreceiver cancellation token before it hits the timeline global one.
-            debug!("Connection cancelled");
-            return Err(WalReceiverError::Cancelled);
-        },
-    };
-    let mut walingest = walingest_res.map_err(|e| match e.kind {
-        crate::walingest::WalIngestErrorKind::Cancelled => WalReceiverError::Cancelled,
-        _ => WalReceiverError::Other(e.into()),
-    })?;
+    let mut walingest = WalIngest::new(timeline.as_ref(), startpoint, &ctx)
+        .await
+        .map_err(|e| match e.kind {
+            crate::walingest::WalIngestErrorKind::Cancelled => WalReceiverError::Cancelled,
+            _ => WalReceiverError::Other(e.into()),
+        })?;

    let (format, compression) = match protocol {
        PostgresClientProtocol::Interpreted {