fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386)

# Refs - fixes https://github.com/neondatabase/neon/issues/10309 - fixup of batching design, first introduced in https://github.com/neondatabase/neon/pull/9851 - refinement of https://github.com/neondatabase/neon/pull/8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (https://github.com/neondatabase/neon/pull/8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from https://github.com/neondatabase/neon/pull/8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in https://github.com/neondatabase/neon/pull/10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.
2026-01-06 04:52:55 +00:00 · 2025-01-16 21:34:02 +01:00
parent b0838a68e5
commit c47c5f4ace
12 changed files with 966 additions and 274 deletions
--- a/pageserver/client/Cargo.toml
+++ b/pageserver/client/Cargo.toml
@@ -4,6 +4,9 @@ version = "0.1.0"
 edition.workspace = true
 license.workspace = true

+[features]
+testing = [ "pageserver_api/testing" ]
+
 [dependencies]
 pageserver_api.workspace = true
 thiserror.workspace = true
--- a/pageserver/client/src/page_service.rs
+++ b/pageserver/client/src/page_service.rs
@@ -1,6 +1,9 @@
-use std::pin::Pin;
+use std::sync::{Arc, Mutex};

-use futures::SinkExt;
+use futures::{
+    stream::{SplitSink, SplitStream},
+    SinkExt, StreamExt,
+};
 use pageserver_api::{
    models::{
        PagestreamBeMessage, PagestreamFeMessage, PagestreamGetPageRequest,
@@ -10,7 +13,6 @@ use pageserver_api::{
 };
 use tokio::task::JoinHandle;
 use tokio_postgres::CopyOutStream;
-use tokio_stream::StreamExt;
 use tokio_util::sync::CancellationToken;
 use utils::{
    id::{TenantId, TimelineId},
@@ -62,15 +64,28 @@ impl Client {
            .client
            .copy_both_simple(&format!("pagestream_v3 {tenant_id} {timeline_id}"))
            .await?;
+        let (sink, stream) = copy_both.split(); // TODO: actually support splitting of the CopyBothDuplex so the lock inside this split adaptor goes away.
        let Client {
            cancel_on_client_drop,
            conn_task,
            client: _,
        } = self;
+        let shared = Arc::new(Mutex::new(PagestreamShared::ConnTaskRunning(
+            ConnTaskRunning {
+                cancel_on_client_drop,
+                conn_task,
+            },
+        )));
        Ok(PagestreamClient {
-            copy_both: Box::pin(copy_both),
-            conn_task,
-            cancel_on_client_drop,
+            sink: PagestreamSender {
+                shared: shared.clone(),
+                sink,
+            },
+            stream: PagestreamReceiver {
+                shared: shared.clone(),
+                stream,
+            },
+            shared,
        })
    }

@@ -97,7 +112,28 @@ impl Client {

 /// Create using [`Client::pagestream`].
 pub struct PagestreamClient {
-    copy_both: Pin<Box<tokio_postgres::CopyBothDuplex<bytes::Bytes>>>,
+    shared: Arc<Mutex<PagestreamShared>>,
+    sink: PagestreamSender,
+    stream: PagestreamReceiver,
+}
+
+pub struct PagestreamSender {
+    #[allow(dead_code)]
+    shared: Arc<Mutex<PagestreamShared>>,
+    sink: SplitSink<tokio_postgres::CopyBothDuplex<bytes::Bytes>, bytes::Bytes>,
+}
+
+pub struct PagestreamReceiver {
+    #[allow(dead_code)]
+    shared: Arc<Mutex<PagestreamShared>>,
+    stream: SplitStream<tokio_postgres::CopyBothDuplex<bytes::Bytes>>,
+}
+
+enum PagestreamShared {
+    ConnTaskRunning(ConnTaskRunning),
+    ConnTaskCancelledJoinHandleReturnedOrDropped,
+}
+struct ConnTaskRunning {
    cancel_on_client_drop: Option<tokio_util::sync::DropGuard>,
    conn_task: JoinHandle<()>,
 }
@@ -110,11 +146,11 @@ pub struct RelTagBlockNo {
 impl PagestreamClient {
    pub async fn shutdown(self) {
        let Self {
-            copy_both,
-            cancel_on_client_drop: cancel_conn_task,
-            conn_task,
-        } = self;
-        // The `copy_both` contains internal channel sender, the receiver of which is polled by `conn_task`.
+            shared,
+            sink,
+            stream,
+        } = { self };
+        // The `copy_both` split into `sink` and `stream` contains internal channel sender, the receiver of which is polled by `conn_task`.
        // When `conn_task` observes the sender has been dropped, it sends a `FeMessage::CopyFail` into the connection.
        // (see https://github.com/neondatabase/rust-postgres/blob/2005bf79573b8add5cf205b52a2b208e356cc8b0/tokio-postgres/src/copy_both.rs#L56).
        //
@@ -131,27 +167,77 @@ impl PagestreamClient {
        //
        // NB: page_service doesn't have a use case to exit the `pagestream` mode currently.
        // => https://github.com/neondatabase/neon/issues/6390
-        let _ = cancel_conn_task.unwrap();
+        let ConnTaskRunning {
+            cancel_on_client_drop,
+            conn_task,
+        } = {
+            let mut guard = shared.lock().unwrap();
+            match std::mem::replace(
+                &mut *guard,
+                PagestreamShared::ConnTaskCancelledJoinHandleReturnedOrDropped,
+            ) {
+                PagestreamShared::ConnTaskRunning(conn_task_running) => conn_task_running,
+                PagestreamShared::ConnTaskCancelledJoinHandleReturnedOrDropped => unreachable!(),
+            }
+        };
+        let _ = cancel_on_client_drop.unwrap();
        conn_task.await.unwrap();
-        drop(copy_both);
+
+        // Now drop the split copy_both.
+        drop(sink);
+        drop(stream);
+    }
+
+    pub fn split(self) -> (PagestreamSender, PagestreamReceiver) {
+        let Self {
+            shared: _,
+            sink,
+            stream,
+        } = self;
+        (sink, stream)
    }

    pub async fn getpage(
        &mut self,
        req: PagestreamGetPageRequest,
    ) -> anyhow::Result<PagestreamGetPageResponse> {
-        let req = PagestreamFeMessage::GetPage(req);
-        let req: bytes::Bytes = req.serialize();
-        // let mut req = tokio_util::io::ReaderStream::new(&req);
-        let mut req = tokio_stream::once(Ok(req));
+        self.getpage_send(req).await?;
+        self.getpage_recv().await
+    }

-        self.copy_both.send_all(&mut req).await?;
+    pub async fn getpage_send(&mut self, req: PagestreamGetPageRequest) -> anyhow::Result<()> {
+        self.sink.getpage_send(req).await
+    }

-        let next: Option<Result<bytes::Bytes, _>> = self.copy_both.next().await;
+    pub async fn getpage_recv(&mut self) -> anyhow::Result<PagestreamGetPageResponse> {
+        self.stream.getpage_recv().await
+    }
+}
+
+impl PagestreamSender {
+    // TODO: maybe make this impl Sink instead for better composability?
+    pub async fn send(&mut self, msg: PagestreamFeMessage) -> anyhow::Result<()> {
+        let msg = msg.serialize();
+        self.sink.send_all(&mut tokio_stream::once(Ok(msg))).await?;
+        Ok(())
+    }
+
+    pub async fn getpage_send(&mut self, req: PagestreamGetPageRequest) -> anyhow::Result<()> {
+        self.send(PagestreamFeMessage::GetPage(req)).await
+    }
+}
+
+impl PagestreamReceiver {
+    // TODO: maybe make this impl Stream instead for better composability?
+    pub async fn recv(&mut self) -> anyhow::Result<PagestreamBeMessage> {
+        let next: Option<Result<bytes::Bytes, _>> = self.stream.next().await;
        let next: bytes::Bytes = next.unwrap()?;
+        PagestreamBeMessage::deserialize(next)
+    }

-        let msg = PagestreamBeMessage::deserialize(next)?;
-        match msg {
+    pub async fn getpage_recv(&mut self) -> anyhow::Result<PagestreamGetPageResponse> {
+        let next: PagestreamBeMessage = self.recv().await?;
+        match next {
            PagestreamBeMessage::GetPage(p) => Ok(p),
            PagestreamBeMessage::Error(e) => anyhow::bail!("Error: {:?}", e),
            PagestreamBeMessage::Exists(_)
@@ -160,7 +246,14 @@ impl PagestreamClient {
            | PagestreamBeMessage::GetSlruSegment(_) => {
                anyhow::bail!(
                    "unexpected be message kind in response to getpage request: {}",
-                    msg.kind()
+                    next.kind()
+                )
+            }
+            #[cfg(feature = "testing")]
+            PagestreamBeMessage::Test(_) => {
+                anyhow::bail!(
+                    "unexpected be message kind in response to getpage request: {}",
+                    next.kind()
                )
            }
        }