pageserver: cancellation handling in writes to postgres client socket (#5503)

## Problem Writes to the postgres client socket from the page server were not wrapped in cancellation handling, so a stuck client connection could prevent tenant shutdowwn. ## Summary of changes All the places we call flush() to write to the socket, we should be respecting the cancellation token for the task. In this PR, I explicitly pass around a CancellationToken rather than doing inline `task_mgr::shutdown_token` calls, to avoid coupling it to the global task_mgr state and make it easier to refactor later. I have some follow-on commits that add a Shutdown variant to QueryError and use it more extensively, but that's pure refactor so will keep separate from this bug fix PR. Closes: https://github.com/neondatabase/neon/issues/5341
2026-01-08 22:12:56 +00:00 · 2023-10-09 15:54:17 +01:00
parent 4772cd6c93
commit 7eaa7a496b
4 changed files with 123 additions and 74 deletions
--- a/test_runner/regress/test_pageserver_restarts_under_workload.py
+++ b/test_runner/regress/test_pageserver_restarts_under_workload.py
@@ -17,6 +17,8 @@ def test_pageserver_restarts_under_worload(neon_simple_env: NeonEnv, pg_bin: PgB
    n_restarts = 10
    scale = 10

+    env.pageserver.allowed_errors.append(".*query handler.*failed.*Shutting down")
+
    def run_pgbench(connstr: str):
        log.info(f"Start a pgbench workload on pg {connstr}")
        pg_bin.run_capture(["pgbench", "-i", f"-s{scale}", connstr])
--- a/test_runner/regress/test_tenant_detach.py
+++ b/test_runner/regress/test_tenant_detach.py
@@ -752,6 +752,9 @@ def test_ignore_while_attaching(
    env.pageserver.allowed_errors.append(
        f".*Tenant {tenant_id} will not become active\\. Current state: Stopping.*"
    )
+    # An endpoint is starting up concurrently with our detach, it can
+    # experience RPC failure due to shutdown.
+    env.pageserver.allowed_errors.append(".*query handler.*failed.*Shutting down")

    data_id = 1
    data_secret = "very secret secret"