storage controller: make proxying of GETs to pageservers more robust (#9065)

## Problem

These commits are split off from
https://github.com/neondatabase/neon/pull/8971/commits where I was
fixing this to make a better scale test pass -- Vlad also independently
recognized these issues with cloudbench in
https://github.com/neondatabase/neon/issues/9062.

1. The storage controller proxies GET requests to pageservers based on
their intent, not the ground truth of where they're really attached.
2. Proxied requests can race with scheduling to tenants, resulting in
404 responses if the request hits the wrong pageserver.

Closes: https://github.com/neondatabase/neon/issues/9062

## Summary of changes

1. If a shard has a running reconciler, then use the database
generation_pageserver to decide who to proxy the request to
2. If such a request gets a 404 response and its scheduled node has
changed since the request was dispatched.
This commit is contained in:
John Spray
2024-09-25 14:56:39 +01:00
committed by GitHub
parent 2cf47b1477
commit 4b711caf5e
4 changed files with 158 additions and 27 deletions

View File

@@ -541,6 +541,8 @@ impl Reconciler {
}
}
pausable_failpoint!("reconciler-live-migrate-pre-generation-inc");
// Increment generation before attaching to new pageserver
self.generation = Some(
self.persistence
@@ -617,6 +619,8 @@ impl Reconciler {
},
);
pausable_failpoint!("reconciler-live-migrate-post-detach");
tracing::info!("🔁 Switching to AttachedSingle mode on node {dest_ps}",);
let dest_final_conf = build_location_config(
&self.shard,