storage_controller: make leadership protocol more robust (#11703)

## Problem We saw the following scenario in staging: 1. Pod A starts up. Becomes leader and steps down the previous pod cleanly. 2. Pod B starts up (deployment). 3. Step down request from pod B to pod A times out. Pod A did not manage to stop its reconciliations within 10 seconds and exited with return code 1 ([code](7ba8519b43/storage_controller/src/service.rs (L8686-L8702))). 4. Pod B marks itself as the leader and finishes start-up 5. k8s restarts pod A 6. k8s marks pod B as ready 7. pod A sends step down request to pod A - this succeeds => pod A is now the leader 8. k8s kills pod A because it thinks pod B is healthy and pod A is part of the old replica set We end up in a situation where the only pod we have (B) is stepped down and attempts to forward requests to a leader that doesn't exist. k8s can't detect that pod B is in a bad state since the /status endpoint simply returns 200 hundred if the pod is running. ## Summary of changes This PR includes a number of robustness improvements to the leadership protocol: * use a single step down task per controller * add a new endpoint to be used as k8s liveness probe and check leadership status there * handle restarts explicitly (i.e. don't step yourself down) * increase the step down retry count * don't kill the process on long step down since k8s will just restart it
2026-05-20 22:50:38 +00:00 · 2025-04-24 17:59:56 +01:00
parent 8afb783708
commit 6f7e3c18e4
5 changed files with 142 additions and 89 deletions
--- a/storage_controller/src/peer_client.rs
+++ b/storage_controller/src/peer_client.rs
@@ -55,9 +55,12 @@ impl ResponseErrorMessageExt for reqwest::Response {
    }
 }

-#[derive(Serialize, Deserialize, Debug, Default)]
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
 pub(crate) struct GlobalObservedState(pub(crate) HashMap<TenantShardId, ObservedState>);

+const STEP_DOWN_RETRIES: u32 = 8;
+const STEP_DOWN_TIMEOUT: Duration = Duration::from_secs(1);
+
 impl PeerClient {
    pub(crate) fn new(http_client: reqwest::Client, uri: Uri, jwt: Option<String>) -> Self {
        Self {
@@ -76,7 +79,7 @@ impl PeerClient {
            req
        };

-        let req = req.timeout(Duration::from_secs(2));
+        let req = req.timeout(STEP_DOWN_TIMEOUT);

        let res = req
            .send()
@@ -94,8 +97,7 @@ impl PeerClient {
    }

    /// Request the peer to step down and return its current observed state
-    /// All errors are retried with exponential backoff for a maximum of 4 attempts.
-    /// Assuming all retries are performed, the function times out after roughly 4 seconds.
+    /// All errors are re-tried
    pub(crate) async fn step_down(
        &self,
        cancel: &CancellationToken,
@@ -104,7 +106,7 @@ impl PeerClient {
            || self.request_step_down(),
            |_e| false,
            2,
-            4,
+            STEP_DOWN_RETRIES,
            "Send step down request",
            cancel,
        )