Files
neon/docs/rfcs/033-storage-controller-drain-and-fill.md
Vlad Lazar 9882ac8e06 docs: Graceful storage controller cluster restarts RFC (#7704)
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related https://github.com/neondatabase/neon/issues/7387
2024-07-01 18:44:28 +01:00

17 KiB

Graceful Restarts of Storage Controller Managed Clusters

Summary

This RFC describes new storage controller APIs for draining and filling tenant shards from/on pageserver nodes. It also covers how these new APIs should be used by an orchestrator (e.g. Ansible) in order to implement graceful cluster restarts.

Motivation

Pageserver restarts cause read availablity downtime for tenants.

For example pageserver-3 @ us-east-1 was unavailable for a randomly picked tenant (which requested on-demand activation) for around 30 seconds during the restart at 2024-04-03 16:37 UTC.

Note that lots of shutdowns on loaded pageservers do not finish within the 10 second systemd enforced timeout. This means we are shutting down without flushing ephemeral layers and have to reingest data in order to serve requests after restarting, potentially making first request latencies worse.

This problem is not yet very acutely felt in storage controller managed pageservers since tenant density is much lower there. However, we are planning on eventually migrating all pageservers to storage controller management, so it makes sense to solve the issue proactively.

Requirements

  • Pageserver re-deployments cause minimal downtime for tenants
  • The storage controller exposes HTTP API hooks for draining and filling tenant shards from a given pageserver. Said hooks can be used by an orchestrator proces or a human operator.
  • The storage controller exposes some HTTP API to cancel draining and filling background operations.
  • Failures to drain or fill the node should not be fatal. In such cases, cluster restarts should proceed as usual (with downtime).
  • Progress of draining/filling is visible through metrics

Non Goals

  • Integration with the control plane
  • Graceful restarts for large non-HA tenants.

Impacted Components

  • storage controller
  • deployment orchestrator (i.e. Ansible)
  • pageserver (indirectly)

Terminology

** Draining ** is the process through which all tenant shards that can be migrated from a given pageserver are distributed across the rest of the cluster.

** Filling ** is the symmetric opposite of draining. In this process tenant shards are migrated onto a given pageserver until the cluster reaches a resonable, quiescent distribution of tenant shards across pageservers.

** Node scheduling policies ** act as constraints to the scheduler. For instance, when a node is set in the Paused policy, no further shards will be scheduled on it.

** Node ** is a pageserver. Term is used interchangeably in this RFC.

** Deployment orchestrator ** is a generic term for whatever drives our deployments. Currently, it's an Ansible playbook.

Background

Storage Controller Basics (skip if already familiar)

Fundamentally, the storage controller is a reconciler which aims to move from the observed mapping between pageservers and tenant shards to an intended mapping. Pageserver nodes and tenant shards metadata is durably persisted in a database, but note that the mapping between the two entities is not durably persisted. Instead, this mapping (observed state) is constructed at startup by sending GET location_config requests to registered pageservers.

An internal scheduler maps tenant shards to pageservers while respecting certain constraints. The result of scheduling is the intent state. When the intent state changes, a reconciliation will inform pageservers about the new assigment via PUT location_config requests and will notify the compute via the configured hook.

Background Optimizations

The storage controller performs scheduling optimizations in the background. It will migrate attachments to warm secondaries and replace secondaries in order to balance the cluster out.

Reconciliations Concurrency Limiting

There's a hard limit on the number of reconciles that the storage controller can have in flight at any given time. To get an idea of scales, the limit is 128 at the time of writing.

Implementation

Note: this section focuses on the core functionality of the graceful restart process. It doesn't neccesarily describe the most efficient approach. Optimizations are described separately in a later section.

Overall Flow

This section describes how to implement graceful restarts from the perspective of Ansible, the deployment orchestrator. Pageservers are already restarted sequentially. The orchestrator shall implement the following epilogue and prologue steps for each pageserver restart:

Prologue

The orchestrator shall first fetch the pageserver node id from the control plane or the pageserver it aims to restart directly. Next, it issues an HTTP request to the storage controller in order to start the drain of said pageserver node. All error responses are retried with a short back-off. When a 202 (Accepted) HTTP code is returned, the drain has started. Now the orchestrator polls the node status endpoint exposed by the storage controller in order to await the end of the drain process. When the policy field of the node status response becomes PauseForRestart, the drain has completed and the orchestrator can proceed with restarting the pageserver.

The prologue is subject to an overall timeout. It will have a value in the ballpark of minutes. As storage controller managed pageservers become more loaded this timeout will likely have to increase.

Epilogue

After restarting the pageserver, the orchestrator issues an HTTP request to the storage controller to kick off the filling process. This API call may be retried for all error codes with a short backoff. This also serves as a synchronization primitive as the fill will be refused if the pageserver has not yet re-attached to the storage controller. When a 202(Accepted) HTTP code is returned, the fill has started. Now the orchestrator polls the node status endpoint exposed by the storage controller in order to await the end of the filling process. When the policy field of the node status response becomes Active, the fill has completed and the orchestrator may proceed to the next pageserver.

Again, the epilogue is subject to an overall timeout. We can start off with using the same timeout as for the prologue, but can also consider relying on the storage controller's background optimizations with a shorter timeout.

In the case that the deployment orchestrator times out, it attempts to cancel the fill. This operation shall be retried with a short back-off. If it ultimately fails it will require manual intervention to set the nodes scheduling policy to NodeSchedulingPolicy::Active. Not doing that is not immediately problematic, but it constrains the scheduler as mentioned previously.

Node Scheduling Policy State Machine

The state machine below encodes the behaviours discussed above and the various failover situations described in a later section.

Assuming no failures and/or timeouts the flow should be: Active -> Draining -> PauseForRestart -> Active -> Filling -> Active

                          Operator requested drain
               +-----------------------------------------+
               |                                         |
       +-------+-------+                         +-------v-------+
       |               |                         |               |
       |     Pause     |             +----------->    Draining   +----------+
       |               |             |           |               |          |
       +---------------+             |           +-------+-------+          |
                                     |                   |                  |
                                     |                   |                  |
                      Drain requested|                   |                  |
                                     |                   |Drain complete    | Drain failed
                                     |                   |                  | Cancelled/PS reattach/Storcon restart
                                     |                   |                  |
                             +-------+-------+           |                  |
                             |               |           |                  |
               +-------------+    Active     <-----------+------------------+
               |             |               |           |
Fill requested |             +---^---^-------+           |
               |                 |   |                   |
               |                 |   |                   |
               |                 |   |                   |
               |   Fill completed|   |                   |
               |                 |   |PS reattach        |
               |                 |   |after restart      |
       +-------v-------+         |   |           +-------v-------+
       |               |         |   |           |               |
       |    Filling    +---------+   +-----------+PauseForRestart|
       |               |                         |               |
       +---------------+                         +---------------+

Draining/Filling APIs

The storage controller API to trigger the draining of a given node is: PUT /v1/control/node/:node_id/{drain,fill}.

The following HTTP non-success return codes are used. All of them are safely retriable from the perspective of the storage controller.

  • 404: Requested node was not found
  • 503: Requested node is known to the storage controller, but unavailable
  • 412: Drain precondition failed: there is no other node to drain to or the node's schedulling policy forbids draining
  • 409: A {drain, fill} is already in progress. Only one such background operation is allowed per node.

When the drain is accepted and commenced a 202 HTTP code is returned.

Drains and fills shall be cancellable by the deployment orchestrator or a human operator via: DELETE /v1/control/node/:node_id/{drain,fill}. A 200 response is returned when the cancelation is successful. Errors are retriable.

Drain Process

Before accpeting a drain request the following validations is applied:

  • Ensure that the node is known the storage controller
  • Ensure that the schedulling policy is NodeSchedulingPolicy::Active or NodeSchedulingPolicy::Pause
  • Ensure that another drain or fill is not already running on the node
  • Ensure that a drain is possible (i.e. check that there is at least one schedulable node to drain to)

After accepting the drain, the scheduling policy of the node is set to NodeSchedulingPolicy::Draining and persisted in both memory and the database. This disallows the optimizer from adding or removing shards from the node which is desirable to avoid them racing.

Next, a separate Tokio task is spawned to manage the draining. For each tenant shard attached to the node being drained, demote the node to a secondary and attempt to schedule the node away. Scheduling might fail due to unsatisfiable constraints, but that is fine. Draining is a best effort process since it might not always be possible to cut over all shards.

Importantly, this task manages the concurrency of issued reconciles in order to avoid drowning out the target pageservers and to allow other important reconciles to proceed.

Once the triggered reconciles have finished or timed out, set the node's scheduling policy to NodeSchedulingPolicy::PauseForRestart to signal the end of the drain.

A note on non HA tenants: These tenants do not have secondaries, so by the description above, they would not be migrated. It makes sense to skip them (especially the large ones) since, depending on tenant size, this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download the entire working set for the tenant. We can consider expanding to small non-HA tenants in the future.

Fill Process

Before accpeting a fill request the following validations is applied:

  • Ensure that the node is known the storage controller
  • Ensure that the schedulling policy is NodeSchedulingPolicy::Active. This is the only acceptable policy for the fill starting state. When a node re-attaches, it set the scheduling policy to NodeSchedulingPolicy::Active if it was equal to NodeSchedulingPolicy::PauseForRestart or NodeSchedulingPolicy::Draining (possible end states for a node drain).
  • Ensure that another drain or fill is not already running on the node

After accepting the drain, the scheduling policy of the node is set to NodeSchedulingPolicy::Filling and persisted in both memory and the database. This disallows the optimizer from adding or removing shards from the node which is desirable to avoid them racing.

Next, a separate Tokio task is spawned to manage the draining. For each tenant shard where the filled node is a secondary, promote the secondary. This is done until we run out of shards or the counts of attached shards become balanced across the cluster.

Like for draining, the concurrency of spawned reconciles is limited.

Failure Modes & Handling

Failures are generally handled by transition back into the Active (neutral) state. This simplifies the implementation greatly at the cost of adding transitions to the state machine. For example, we could detect the Draining state upon restart and proceed with a drain, but how should the storage controller know that's what the orchestrator needs still?

Storage Controller Crash

When the storage controller starts up reset the node scheduling policy of all nodes in states Draining, Filling or PauseForRestart to Active. The rationale is that when the storage controller restarts, we have lost context of what the deployment orchestrator wants. It also has the benefit of making things easier to reason about.

Pageserver Crash During Drain

The pageserver will attempt to re-attach during restart at which point the node scheduling policy will be set back to Active, thus reenabling the scheduler to use the node.

Non-drained Pageserver Crash During Drain

What should happen when a pageserver we are draining to crashes during the process. Two reasonable options are: cancel the drain and focus on the failover or do both, but prioritise failover. Since the number of concurrent reconciles produced by drains/fills are limited, we get the later behaviour for free. My suggestion is we take this approach, but the cancellation option is trivial to implement as well.

Pageserver Crash During Fill

The pageserver will attempt to re-attach during restart at which point the node scheduling policy will be set back to Active, thus reenabling the scheduler to use the node.

Pageserver Goes unavailable During Drain/Fill

The drain and fill jobs handle this by stopping early. When the pageserver is detected as online by storage controller heartbeats, reset its scheduling policy to Active. If a restart happens instead, see the pageserver crash failure mode.

Orchestrator Drain Times Out

Orchestrator will still proceed with the restart. When the pageserver re-attaches, the scheduling policy is set back to Active.

Orchestrator Fill Times Out

Orchestrator will attempt to cancel the fill operation. If that fails, the fill will continue until it quiesces and the node will be left in the Filling scheduling policy. This hinders the scheduler, but is otherwise harmless. A human operator can handle this by setting the scheduling policy to Active, or we can bake in a fill timeout into the storage controller.

Optimizations

Location Warmth

When cutting over to a secondary, the storage controller will wait for it to become "warm" (i.e. download enough of the tenants data). This means that some reconciliations can take significantly longer than others and hold up precious reconciliations units. As an optimization, the drain stage can only cut over tenants that are already "warm". Similarly, the fill stage can prioritise the "warmest" tenants in the fill.

Given that the number of tenants by the storage controller will be fairly low for the foreseable future, the first implementation could simply query the tenants for secondary status. This doesn't scale well with increasing tenant counts, so eventually we will need new pageserver API endpoints to report the sets of "warm" and "cold" nodes.

Alternatives Considered

Draining and Filling Purely as Scheduling Constraints

At its core, the storage controller is a big background loop that detects changes in the environment and reacts on them. One could express draining and filling of nodes purely in terms of constraining the scheduler (as opposed to having such background tasks).

While theoretically nice, I think that's harder to implement and more importantly operate and reason about. Consider cancellation of a drain/fill operation. We would have to update the scheduler state, create an entirely new schedule (intent state) and start work on applying that. It gets trickier if we wish to cancel the reconciliation tasks spawned by drain/fill nodes. How would we know which ones belong to the conceptual drain/fill? One could add labels to reconciliations, but it gets messy in my opinion.

It would also mean that reconciliations themselves have side effects that persist in the database (persist something to the databse when the drain is done), which I'm not conceptually fond of.

Proof of Concept

This RFC is accompanied by a POC which implements nearly everything mentioned here apart from the optimizations and some of the failure handling: https://github.com/neondatabase/neon/pull/7682