storcon: add drain and fill background operations for graceful cluster restarts (#8014)

## Problem
Pageserver restarts cause read availablity downtime for tenants. See
`Motivation` section in the
[RFC](https://github.com/neondatabase/neon/pull/7704).

## Summary of changes
* Introduce a new `NodeSchedulingPolicy`: `PauseForRestart`
* Implement the first take of drain and fill algorithms
* Add a node status endpoint which can be polled to figure out when an
operation is done

The implementation follows the RFC, so it might be useful to peek at it
as you're reviewing.
Since the PR is rather chunky, I've made sure all commits build (with
warnings), so you can
review by commit if you prefer that.

RFC: https://github.com/neondatabase/neon/pull/7704
Related https://github.com/neondatabase/neon/issues/7387
This commit is contained in:
Vlad Lazar
2024-06-19 11:55:30 +01:00
committed by GitHub
parent 4753b8f390
commit 5778d714f0
13 changed files with 905 additions and 21 deletions

View File

@@ -59,6 +59,10 @@ impl Node {
self.id
}
pub(crate) fn get_scheduling(&self) -> NodeSchedulingPolicy {
self.scheduling
}
pub(crate) fn set_scheduling(&mut self, scheduling: NodeSchedulingPolicy) {
self.scheduling = scheduling
}
@@ -151,6 +155,7 @@ impl Node {
NodeSchedulingPolicy::Draining => MaySchedule::No,
NodeSchedulingPolicy::Filling => MaySchedule::Yes(score),
NodeSchedulingPolicy::Pause => MaySchedule::No,
NodeSchedulingPolicy::PauseForRestart => MaySchedule::No,
}
}
@@ -167,7 +172,7 @@ impl Node {
listen_http_port,
listen_pg_addr,
listen_pg_port,
scheduling: NodeSchedulingPolicy::Filling,
scheduling: NodeSchedulingPolicy::Active,
availability: NodeAvailability::Offline,
cancel: CancellationToken::new(),
}