fix(neon_local): long init_tenant_mgr causes pageserver startup failure

Before this PR, if neon_local's `start_process()` ran out of retries before pageserver started listening for requests, it would give up. As of PR #6474 we at least kill the starting pageserver process in that case, before that, we would leak it. Pageserver `bind()s` the mgmt API early, but only starts `accept()`ing HTTP requests after it has finished `init_tenant_mgr()` (plus some other stuff). init_tenant_mgr can take a long time with many tenants, i.e., longer than the number of retries that neon_local permits. Changes ======= This PR changes the status check that neon_local performs when starting pageserver to ignore connect & timeout errors, as those are expected (see explanation above). I verified that this allows for arbitrarily long `init_tenant_mgr()` by adding a timeout at the top of that function.
2026-01-16 18:02:56 +00:00 · 2024-01-25 15:00:19 +00:00
parent 1dcb05c3d9
commit 1648639874
1 changed files with 12 additions and 5 deletions
--- a/control_plane/src/pageserver.rs
+++ b/control_plane/src/pageserver.rs
@@ -236,11 +236,18 @@ impl PageServerNode {
            self.pageserver_env_variables()?,
            background_process::InitialPidFile::Expect(self.pid_file()),
            || async {
-                let st = self.check_status().await;
-                match st {
-                    Ok(()) => Ok(true),
-                    Err(mgmt_api::Error::ReceiveBody(_)) => Ok(false),
-                    Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
+                match self.http_client.list_tenants().await {
+                    Ok(_) => Ok(true),
+                    Err(e) => match e {
+                        mgmt_api::Error::ReceiveBody(e) => {
+                            if e.is_connect() || e.is_timeout() {
+                                Ok(false)
+                            } else {
+                                Ok(true)
+                            }
+                        }
+                        e => anyhow::bail!(e),
+                    },
                }
            },
        )