storcon: add drain and fill background operations for graceful cluster restarts (#8014)

## Problem Pageserver restarts cause read availablity downtime for tenants. See `Motivation` section in the [RFC](https://github.com/neondatabase/neon/pull/7704). ## Summary of changes * Introduce a new `NodeSchedulingPolicy`: `PauseForRestart` * Implement the first take of drain and fill algorithms * Add a node status endpoint which can be polled to figure out when an operation is done The implementation follows the RFC, so it might be useful to peek at it as you're reviewing. Since the PR is rather chunky, I've made sure all commits build (with warnings), so you can review by commit if you prefer that. RFC: https://github.com/neondatabase/neon/pull/7704 Related https://github.com/neondatabase/neon/issues/7387
2026-05-29 19:10:38 +00:00 · 2024-06-19 11:55:30 +01:00
parent 4753b8f390
commit 5778d714f0
13 changed files with 905 additions and 21 deletions
--- a/storage_controller/src/node.rs
+++ b/storage_controller/src/node.rs
@@ -59,6 +59,10 @@ impl Node {
        self.id
    }

+    pub(crate) fn get_scheduling(&self) -> NodeSchedulingPolicy {
+        self.scheduling
+    }
+
    pub(crate) fn set_scheduling(&mut self, scheduling: NodeSchedulingPolicy) {
        self.scheduling = scheduling
    }
@@ -151,6 +155,7 @@ impl Node {
            NodeSchedulingPolicy::Draining => MaySchedule::No,
            NodeSchedulingPolicy::Filling => MaySchedule::Yes(score),
            NodeSchedulingPolicy::Pause => MaySchedule::No,
+            NodeSchedulingPolicy::PauseForRestart => MaySchedule::No,
        }
    }

@@ -167,7 +172,7 @@ impl Node {
            listen_http_port,
            listen_pg_addr,
            listen_pg_port,
-            scheduling: NodeSchedulingPolicy::Filling,
+            scheduling: NodeSchedulingPolicy::Active,
            availability: NodeAvailability::Offline,
            cancel: CancellationToken::new(),
        }