From fc39ef491576ce7af336939803d22a09ebaef4b5 Mon Sep 17 00:00:00 2001 From: Vlad Lazar Date: Mon, 1 Jul 2024 18:44:28 +0100 Subject: [PATCH] docs: Graceful storage controller cluster restarts RFC (#7704) RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related https://github.com/neondatabase/neon/issues/7387 --- .../033-storage-controller-drain-and-fill.md | 345 ++++++++++++++++++ 1 file changed, 345 insertions(+) create mode 100644 docs/rfcs/033-storage-controller-drain-and-fill.md diff --git a/docs/rfcs/033-storage-controller-drain-and-fill.md b/docs/rfcs/033-storage-controller-drain-and-fill.md new file mode 100644 index 0000000000..77c84cd2a5 --- /dev/null +++ b/docs/rfcs/033-storage-controller-drain-and-fill.md @@ -0,0 +1,345 @@ +# Graceful Restarts of Storage Controller Managed Clusters + +## Summary +This RFC describes new storage controller APIs for draining and filling tenant shards from/on pageserver nodes. +It also covers how these new APIs should be used by an orchestrator (e.g. Ansible) in order to implement +graceful cluster restarts. + +## Motivation + +Pageserver restarts cause read availablity downtime for tenants. + +For example pageserver-3 @ us-east-1 was unavailable for a randomly +picked tenant (which requested on-demand activation) for around 30 seconds +during the restart at 2024-04-03 16:37 UTC. + +Note that lots of shutdowns on loaded pageservers do not finish within the +[10 second systemd enforced timeout](https://github.com/neondatabase/aws/blob/0a5280b383e43c063d43cbf87fa026543f6d6ad4/.github/ansible/systemd/pageserver.service#L16). This means we are shutting down without flushing ephemeral layers +and have to reingest data in order to serve requests after restarting, potentially making first request latencies worse. + +This problem is not yet very acutely felt in storage controller managed pageservers since +tenant density is much lower there. However, we are planning on eventually migrating all +pageservers to storage controller management, so it makes sense to solve the issue proactively. + +## Requirements + +- Pageserver re-deployments cause minimal downtime for tenants +- The storage controller exposes HTTP API hooks for draining and filling tenant shards +from a given pageserver. Said hooks can be used by an orchestrator proces or a human operator. +- The storage controller exposes some HTTP API to cancel draining and filling background operations. +- Failures to drain or fill the node should not be fatal. In such cases, cluster restarts should proceed +as usual (with downtime). +- Progress of draining/filling is visible through metrics + +## Non Goals + +- Integration with the control plane +- Graceful restarts for large non-HA tenants. + +## Impacted Components + +- storage controller +- deployment orchestrator (i.e. Ansible) +- pageserver (indirectly) + +## Terminology + +** Draining ** is the process through which all tenant shards that can be migrated from a given pageserver +are distributed across the rest of the cluster. + +** Filling ** is the symmetric opposite of draining. In this process tenant shards are migrated onto a given +pageserver until the cluster reaches a resonable, quiescent distribution of tenant shards across pageservers. + +** Node scheduling policies ** act as constraints to the scheduler. For instance, when a +node is set in the `Paused` policy, no further shards will be scheduled on it. + +** Node ** is a pageserver. Term is used interchangeably in this RFC. + +** Deployment orchestrator ** is a generic term for whatever drives our deployments. +Currently, it's an Ansible playbook. + +## Background + +### Storage Controller Basics (skip if already familiar) + +Fundamentally, the storage controller is a reconciler which aims to move from the observed mapping between pageservers and tenant shards to an intended mapping. Pageserver nodes and tenant shards metadata is durably persisted in a database, but note that the mapping between the two entities is not durably persisted. Instead, this mapping (*observed state*) is constructed at startup by sending `GET location_config` requests to registered pageservers. + +An internal scheduler maps tenant shards to pageservers while respecting certain constraints. The result of scheduling is the *intent state*. When the intent state changes, a *reconciliation* will inform pageservers about the new assigment via `PUT location_config` requests and will notify the compute via the configured hook. + +### Background Optimizations + +The storage controller performs scheduling optimizations in the background. It will +migrate attachments to warm secondaries and replace secondaries in order to balance +the cluster out. + +### Reconciliations Concurrency Limiting + +There's a hard limit on the number of reconciles that the storage controller +can have in flight at any given time. To get an idea of scales, the limit is +128 at the time of writing. + +## Implementation + +Note: this section focuses on the core functionality of the graceful restart process. +It doesn't neccesarily describe the most efficient approach. Optimizations are described +separately in a later section. + +### Overall Flow + +This section describes how to implement graceful restarts from the perspective +of Ansible, the deployment orchestrator. Pageservers are already restarted sequentially. +The orchestrator shall implement the following epilogue and prologue steps for each +pageserver restart: + +#### Prologue + +The orchestrator shall first fetch the pageserver node id from the control plane or +the pageserver it aims to restart directly. Next, it issues an HTTP request +to the storage controller in order to start the drain of said pageserver node. +All error responses are retried with a short back-off. When a 202 (Accepted) +HTTP code is returned, the drain has started. Now the orchestrator polls the +node status endpoint exposed by the storage controller in order to await the +end of the drain process. When the `policy` field of the node status response +becomes `PauseForRestart`, the drain has completed and the orchestrator can +proceed with restarting the pageserver. + +The prologue is subject to an overall timeout. It will have a value in the ballpark +of minutes. As storage controller managed pageservers become more loaded this timeout +will likely have to increase. + +#### Epilogue + +After restarting the pageserver, the orchestrator issues an HTTP request +to the storage controller to kick off the filling process. This API call +may be retried for all error codes with a short backoff. This also serves +as a synchronization primitive as the fill will be refused if the pageserver +has not yet re-attached to the storage controller. When a 202(Accepted) HTTP +code is returned, the fill has started. Now the orchestrator polls the node +status endpoint exposed by the storage controller in order to await the end of +the filling process. When the `policy` field of the node status response becomes +`Active`, the fill has completed and the orchestrator may proceed to the next pageserver. + +Again, the epilogue is subject to an overall timeout. We can start off with +using the same timeout as for the prologue, but can also consider relying on +the storage controller's background optimizations with a shorter timeout. + +In the case that the deployment orchestrator times out, it attempts to cancel +the fill. This operation shall be retried with a short back-off. If it ultimately +fails it will require manual intervention to set the nodes scheduling policy to +`NodeSchedulingPolicy::Active`. Not doing that is not immediately problematic, +but it constrains the scheduler as mentioned previously. + +### Node Scheduling Policy State Machine + +The state machine below encodes the behaviours discussed above and +the various failover situations described in a later section. + +Assuming no failures and/or timeouts the flow should be: +`Active -> Draining -> PauseForRestart -> Active -> Filling -> Active` + +``` + Operator requested drain + +-----------------------------------------+ + | | + +-------+-------+ +-------v-------+ + | | | | + | Pause | +-----------> Draining +----------+ + | | | | | | + +---------------+ | +-------+-------+ | + | | | + | | | + Drain requested| | | + | |Drain complete | Drain failed + | | | Cancelled/PS reattach/Storcon restart + | | | + +-------+-------+ | | + | | | | + +-------------+ Active <-----------+------------------+ + | | | | +Fill requested | +---^---^-------+ | + | | | | + | | | | + | | | | + | Fill completed| | | + | | |PS reattach | + | | |after restart | + +-------v-------+ | | +-------v-------+ + | | | | | | + | Filling +---------+ +-----------+PauseForRestart| + | | | | + +---------------+ +---------------+ +``` + +### Draining/Filling APIs + +The storage controller API to trigger the draining of a given node is: +`PUT /v1/control/node/:node_id/{drain,fill}`. + +The following HTTP non-success return codes are used. +All of them are safely retriable from the perspective of the storage controller. +- 404: Requested node was not found +- 503: Requested node is known to the storage controller, but unavailable +- 412: Drain precondition failed: there is no other node to drain to or the node's schedulling policy forbids draining +- 409: A {drain, fill} is already in progress. Only one such background operation +is allowed per node. + +When the drain is accepted and commenced a 202 HTTP code is returned. + +Drains and fills shall be cancellable by the deployment orchestrator or a +human operator via: `DELETE /v1/control/node/:node_id/{drain,fill}`. A 200 +response is returned when the cancelation is successful. Errors are retriable. + +### Drain Process + +Before accpeting a drain request the following validations is applied: +* Ensure that the node is known the storage controller +* Ensure that the schedulling policy is `NodeSchedulingPolicy::Active` or `NodeSchedulingPolicy::Pause` +* Ensure that another drain or fill is not already running on the node +* Ensure that a drain is possible (i.e. check that there is at least one +schedulable node to drain to) + +After accepting the drain, the scheduling policy of the node is set to +`NodeSchedulingPolicy::Draining` and persisted in both memory and the database. +This disallows the optimizer from adding or removing shards from the node which +is desirable to avoid them racing. + +Next, a separate Tokio task is spawned to manage the draining. For each tenant +shard attached to the node being drained, demote the node to a secondary and +attempt to schedule the node away. Scheduling might fail due to unsatisfiable +constraints, but that is fine. Draining is a best effort process since it might +not always be possible to cut over all shards. + +Importantly, this task manages the concurrency of issued reconciles in order to +avoid drowning out the target pageservers and to allow other important reconciles +to proceed. + +Once the triggered reconciles have finished or timed out, set the node's scheduling +policy to `NodeSchedulingPolicy::PauseForRestart` to signal the end of the drain. + +A note on non HA tenants: These tenants do not have secondaries, so by the description +above, they would not be migrated. It makes sense to skip them (especially the large ones) +since, depending on tenant size, this might be more disruptive than the restart since the +pageserver we've moved to do will need to on-demand download the entire working set for the tenant. +We can consider expanding to small non-HA tenants in the future. + +### Fill Process + +Before accpeting a fill request the following validations is applied: +* Ensure that the node is known the storage controller +* Ensure that the schedulling policy is `NodeSchedulingPolicy::Active`. +This is the only acceptable policy for the fill starting state. When a node re-attaches, +it set the scheduling policy to `NodeSchedulingPolicy::Active` if it was equal to +`NodeSchedulingPolicy::PauseForRestart` or `NodeSchedulingPolicy::Draining` (possible end states for a node drain). +* Ensure that another drain or fill is not already running on the node + +After accepting the drain, the scheduling policy of the node is set to +`NodeSchedulingPolicy::Filling` and persisted in both memory and the database. +This disallows the optimizer from adding or removing shards from the node which +is desirable to avoid them racing. + +Next, a separate Tokio task is spawned to manage the draining. For each tenant +shard where the filled node is a secondary, promote the secondary. This is done +until we run out of shards or the counts of attached shards become balanced across +the cluster. + +Like for draining, the concurrency of spawned reconciles is limited. + +### Failure Modes & Handling + +Failures are generally handled by transition back into the `Active` +(neutral) state. This simplifies the implementation greatly at the +cost of adding transitions to the state machine. For example, we +could detect the `Draining` state upon restart and proceed with a drain, +but how should the storage controller know that's what the orchestrator +needs still? + +#### Storage Controller Crash + +When the storage controller starts up reset the node scheduling policy +of all nodes in states `Draining`, `Filling` or `PauseForRestart` to +`Active`. The rationale is that when the storage controller restarts, +we have lost context of what the deployment orchestrator wants. It also +has the benefit of making things easier to reason about. + +#### Pageserver Crash During Drain + +The pageserver will attempt to re-attach during restart at which +point the node scheduling policy will be set back to `Active`, thus +reenabling the scheduler to use the node. + +#### Non-drained Pageserver Crash During Drain + +What should happen when a pageserver we are draining to crashes during the +process. Two reasonable options are: cancel the drain and focus on the failover +*or* do both, but prioritise failover. Since the number of concurrent reconciles +produced by drains/fills are limited, we get the later behaviour for free. +My suggestion is we take this approach, but the cancellation option is trivial +to implement as well. + +#### Pageserver Crash During Fill + +The pageserver will attempt to re-attach during restart at which +point the node scheduling policy will be set back to `Active`, thus +reenabling the scheduler to use the node. + +#### Pageserver Goes unavailable During Drain/Fill + +The drain and fill jobs handle this by stopping early. When the pageserver +is detected as online by storage controller heartbeats, reset its scheduling +policy to `Active`. If a restart happens instead, see the pageserver crash +failure mode. + +#### Orchestrator Drain Times Out + +Orchestrator will still proceed with the restart. +When the pageserver re-attaches, the scheduling policy is set back to +`Active`. + +#### Orchestrator Fill Times Out + +Orchestrator will attempt to cancel the fill operation. If that fails, +the fill will continue until it quiesces and the node will be left +in the `Filling` scheduling policy. This hinders the scheduler, but is +otherwise harmless. A human operator can handle this by setting the scheduling +policy to `Active`, or we can bake in a fill timeout into the storage controller. + +## Optimizations + +### Location Warmth + +When cutting over to a secondary, the storage controller will wait for it to +become "warm" (i.e. download enough of the tenants data). This means that some +reconciliations can take significantly longer than others and hold up precious +reconciliations units. As an optimization, the drain stage can only cut over +tenants that are already "warm". Similarly, the fill stage can prioritise the +"warmest" tenants in the fill. + +Given that the number of tenants by the storage controller will be fairly low +for the foreseable future, the first implementation could simply query the tenants +for secondary status. This doesn't scale well with increasing tenant counts, so +eventually we will need new pageserver API endpoints to report the sets of +"warm" and "cold" nodes. + +## Alternatives Considered + +### Draining and Filling Purely as Scheduling Constraints + +At its core, the storage controller is a big background loop that detects changes +in the environment and reacts on them. One could express draining and filling +of nodes purely in terms of constraining the scheduler (as opposed to having +such background tasks). + +While theoretically nice, I think that's harder to implement and more importantly operate and reason about. +Consider cancellation of a drain/fill operation. We would have to update the scheduler state, create +an entirely new schedule (intent state) and start work on applying that. It gets trickier if we wish +to cancel the reconciliation tasks spawned by drain/fill nodes. How would we know which ones belong +to the conceptual drain/fill? One could add labels to reconciliations, but it gets messy in my opinion. + +It would also mean that reconciliations themselves have side effects that persist in the database +(persist something to the databse when the drain is done), which I'm not conceptually fond of. + +## Proof of Concept + +This RFC is accompanied by a POC which implements nearly everything mentioned here +apart from the optimizations and some of the failure handling: +https://github.com/neondatabase/neon/pull/7682