Compare commits

...

10 Commits

Author SHA1 Message Date
Christian Schwarz
d359701419 Merge remote-tracking branch 'origin/main' into problame/neon-local-pageserver-slow-starts 2024-09-06 15:24:35 +02:00
Christian Schwarz
8749fe5131 Merge branch 'main' into problame/neon-local-pageserver-slow-starts 2024-01-25 21:00:26 +01:00
Christian Schwarz
cf916b9be8 unnest the match, as per review request 2024-01-25 18:28:05 +00:00
Christian Schwarz
f79e4a3a0b if it's not a connectivity issue, bail 2024-01-25 16:40:11 +00:00
Christian Schwarz
9fae0d9562 using /v1/tenant endpoint is actually not necessary, we don't care about status 503 2024-01-25 15:48:54 +00:00
Christian Schwarz
50c4b83066 impmlement request timeouts using cancel-by-drop 2024-01-25 15:36:44 +00:00
Christian Schwarz
727412b094 Revert "send request with a timeout, extending mgmt_api crate to do that"
This reverts commit 73e86160f3.
2024-01-25 15:31:33 +00:00
Christian Schwarz
73e86160f3 send request with a timeout, extending mgmt_api crate to do that 2024-01-25 15:31:10 +00:00
Christian Schwarz
1648639874 fix(neon_local): long init_tenant_mgr causes pageserver startup failure
Before this PR, if neon_local's `start_process()` ran out of retries
before pageserver started listening for requests, it would give up.
As of PR #6474 we at least kill the starting pageserver process in that
case, before that, we would leak it.

Pageserver `bind()s` the mgmt API early, but only starts `accept()`ing
HTTP requests after it has finished `init_tenant_mgr()` (plus some other
stuff).

init_tenant_mgr can take a long time with many tenants, i.e., longer
than the number of retries that neon_local permits.

Changes
=======

This PR changes the status check that neon_local performs when starting
pageserver to ignore connect & timeout errors, as those are expected
(see explanation above).

I verified that this allows for arbitrarily long `init_tenant_mgr()`
by adding a timeout at the top of that function.
2024-01-25 15:12:07 +00:00
Christian Schwarz
1dcb05c3d9 fix(neon_local): leaks child process if it fails to start & pass checks
Before this PR, if process_started() didn't return Ok(true) until we
ran out of retries, we'd return an error but leave the process running.

refs https://github.com/neondatabase/neon/issues/6473
2024-01-25 14:29:27 +00:00

View File

@@ -284,11 +284,13 @@ impl PageServerNode {
background_process::InitialPidFile::Expect(self.pid_file()),
retry_timeout,
|| async {
let st = self.check_status().await;
match st {
Ok(()) => Ok(true),
Err(mgmt_api::Error::ReceiveBody(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
let res =
tokio::time::timeout(Duration::from_secs(1), self.http_client.status()).await;
match res {
Ok(Ok(_)) => Ok(true),
Ok(Err(mgmt_api::Error::ReceiveBody(e))) if e.is_connect() => Ok(false),
Ok(Err(e)) => anyhow::bail!(e),
Err(_timeout) => Ok(false),
}
},
)