17 KiB
Graceful Restarts of Storage Controller Managed Clusters
Summary
This RFC describes new storage controller APIs for draining and filling tenant shards from/on pageserver nodes. It also covers how these new APIs should be used by an orchestrator (e.g. Ansible) in order to implement graceful cluster restarts.
Motivation
Pageserver restarts cause read availablity downtime for tenants.
For example pageserver-3 @ us-east-1 was unavailable for a randomly picked tenant (which requested on-demand activation) for around 30 seconds during the restart at 2024-04-03 16:37 UTC.
Note that lots of shutdowns on loaded pageservers do not finish within the 10 second systemd enforced timeout. This means we are shutting down without flushing ephemeral layers and have to reingest data in order to serve requests after restarting, potentially making first request latencies worse.
This problem is not yet very acutely felt in storage controller managed pageservers since tenant density is much lower there. However, we are planning on eventually migrating all pageservers to storage controller management, so it makes sense to solve the issue proactively.
Requirements
- Pageserver re-deployments cause minimal downtime for tenants
- The storage controller exposes HTTP API hooks for draining and filling tenant shards from a given pageserver. Said hooks can be used by an orchestrator proces or a human operator.
- The storage controller exposes some HTTP API to cancel draining and filling background operations.
- Failures to drain or fill the node should not be fatal. In such cases, cluster restarts should proceed as usual (with downtime).
- Progress of draining/filling is visible through metrics
Non Goals
- Integration with the control plane
- Graceful restarts for large non-HA tenants.
Impacted Components
- storage controller
- deployment orchestrator (i.e. Ansible)
- pageserver (indirectly)
Terminology
** Draining ** is the process through which all tenant shards that can be migrated from a given pageserver are distributed across the rest of the cluster.
** Filling ** is the symmetric opposite of draining. In this process tenant shards are migrated onto a given pageserver until the cluster reaches a resonable, quiescent distribution of tenant shards across pageservers.
** Node scheduling policies ** act as constraints to the scheduler. For instance, when a
node is set in the Paused policy, no further shards will be scheduled on it.
** Node ** is a pageserver. Term is used interchangeably in this RFC.
** Deployment orchestrator ** is a generic term for whatever drives our deployments. Currently, it's an Ansible playbook.
Background
Storage Controller Basics (skip if already familiar)
Fundamentally, the storage controller is a reconciler which aims to move from the observed mapping between pageservers and tenant shards to an intended mapping. Pageserver nodes and tenant shards metadata is durably persisted in a database, but note that the mapping between the two entities is not durably persisted. Instead, this mapping (observed state) is constructed at startup by sending GET location_config requests to registered pageservers.
An internal scheduler maps tenant shards to pageservers while respecting certain constraints. The result of scheduling is the intent state. When the intent state changes, a reconciliation will inform pageservers about the new assigment via PUT location_config requests and will notify the compute via the configured hook.
Background Optimizations
The storage controller performs scheduling optimizations in the background. It will migrate attachments to warm secondaries and replace secondaries in order to balance the cluster out.
Reconciliations Concurrency Limiting
There's a hard limit on the number of reconciles that the storage controller can have in flight at any given time. To get an idea of scales, the limit is 128 at the time of writing.
Implementation
Note: this section focuses on the core functionality of the graceful restart process. It doesn't neccesarily describe the most efficient approach. Optimizations are described separately in a later section.
Overall Flow
This section describes how to implement graceful restarts from the perspective of Ansible, the deployment orchestrator. Pageservers are already restarted sequentially. The orchestrator shall implement the following epilogue and prologue steps for each pageserver restart:
Prologue
The orchestrator shall first fetch the pageserver node id from the control plane or
the pageserver it aims to restart directly. Next, it issues an HTTP request
to the storage controller in order to start the drain of said pageserver node.
All error responses are retried with a short back-off. When a 202 (Accepted)
HTTP code is returned, the drain has started. Now the orchestrator polls the
node status endpoint exposed by the storage controller in order to await the
end of the drain process. When the policy field of the node status response
becomes PauseForRestart, the drain has completed and the orchestrator can
proceed with restarting the pageserver.
The prologue is subject to an overall timeout. It will have a value in the ballpark of minutes. As storage controller managed pageservers become more loaded this timeout will likely have to increase.
Epilogue
After restarting the pageserver, the orchestrator issues an HTTP request
to the storage controller to kick off the filling process. This API call
may be retried for all error codes with a short backoff. This also serves
as a synchronization primitive as the fill will be refused if the pageserver
has not yet re-attached to the storage controller. When a 202(Accepted) HTTP
code is returned, the fill has started. Now the orchestrator polls the node
status endpoint exposed by the storage controller in order to await the end of
the filling process. When the policy field of the node status response becomes
Active, the fill has completed and the orchestrator may proceed to the next pageserver.
Again, the epilogue is subject to an overall timeout. We can start off with using the same timeout as for the prologue, but can also consider relying on the storage controller's background optimizations with a shorter timeout.
In the case that the deployment orchestrator times out, it attempts to cancel
the fill. This operation shall be retried with a short back-off. If it ultimately
fails it will require manual intervention to set the nodes scheduling policy to
NodeSchedulingPolicy::Active. Not doing that is not immediately problematic,
but it constrains the scheduler as mentioned previously.
Node Scheduling Policy State Machine
The state machine below encodes the behaviours discussed above and the various failover situations described in a later section.
Assuming no failures and/or timeouts the flow should be:
Active -> Draining -> PauseForRestart -> Active -> Filling -> Active
Operator requested drain
+-----------------------------------------+
| |
+-------+-------+ +-------v-------+
| | | |
| Pause | +-----------> Draining +----------+
| | | | | |
+---------------+ | +-------+-------+ |
| | |
| | |
Drain requested| | |
| |Drain complete | Drain failed
| | | Cancelled/PS reattach/Storcon restart
| | |
+-------+-------+ | |
| | | |
+-------------+ Active <-----------+------------------+
| | | |
Fill requested | +---^---^-------+ |
| | | |
| | | |
| | | |
| Fill completed| | |
| | |PS reattach |
| | |after restart |
+-------v-------+ | | +-------v-------+
| | | | | |
| Filling +---------+ +-----------+PauseForRestart|
| | | |
+---------------+ +---------------+
Draining/Filling APIs
The storage controller API to trigger the draining of a given node is:
PUT /v1/control/node/:node_id/{drain,fill}.
The following HTTP non-success return codes are used. All of them are safely retriable from the perspective of the storage controller.
- 404: Requested node was not found
- 503: Requested node is known to the storage controller, but unavailable
- 412: Drain precondition failed: there is no other node to drain to or the node's schedulling policy forbids draining
- 409: A {drain, fill} is already in progress. Only one such background operation is allowed per node.
When the drain is accepted and commenced a 202 HTTP code is returned.
Drains and fills shall be cancellable by the deployment orchestrator or a
human operator via: DELETE /v1/control/node/:node_id/{drain,fill}. A 200
response is returned when the cancelation is successful. Errors are retriable.
Drain Process
Before accpeting a drain request the following validations is applied:
- Ensure that the node is known the storage controller
- Ensure that the schedulling policy is
NodeSchedulingPolicy::ActiveorNodeSchedulingPolicy::Pause - Ensure that another drain or fill is not already running on the node
- Ensure that a drain is possible (i.e. check that there is at least one schedulable node to drain to)
After accepting the drain, the scheduling policy of the node is set to
NodeSchedulingPolicy::Draining and persisted in both memory and the database.
This disallows the optimizer from adding or removing shards from the node which
is desirable to avoid them racing.
Next, a separate Tokio task is spawned to manage the draining. For each tenant shard attached to the node being drained, demote the node to a secondary and attempt to schedule the node away. Scheduling might fail due to unsatisfiable constraints, but that is fine. Draining is a best effort process since it might not always be possible to cut over all shards.
Importantly, this task manages the concurrency of issued reconciles in order to avoid drowning out the target pageservers and to allow other important reconciles to proceed.
Once the triggered reconciles have finished or timed out, set the node's scheduling
policy to NodeSchedulingPolicy::PauseForRestart to signal the end of the drain.
A note on non HA tenants: These tenants do not have secondaries, so by the description above, they would not be migrated. It makes sense to skip them (especially the large ones) since, depending on tenant size, this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download the entire working set for the tenant. We can consider expanding to small non-HA tenants in the future.
Fill Process
Before accpeting a fill request the following validations is applied:
- Ensure that the node is known the storage controller
- Ensure that the schedulling policy is
NodeSchedulingPolicy::Active. This is the only acceptable policy for the fill starting state. When a node re-attaches, it set the scheduling policy toNodeSchedulingPolicy::Activeif it was equal toNodeSchedulingPolicy::PauseForRestartorNodeSchedulingPolicy::Draining(possible end states for a node drain). - Ensure that another drain or fill is not already running on the node
After accepting the drain, the scheduling policy of the node is set to
NodeSchedulingPolicy::Filling and persisted in both memory and the database.
This disallows the optimizer from adding or removing shards from the node which
is desirable to avoid them racing.
Next, a separate Tokio task is spawned to manage the draining. For each tenant shard where the filled node is a secondary, promote the secondary. This is done until we run out of shards or the counts of attached shards become balanced across the cluster.
Like for draining, the concurrency of spawned reconciles is limited.
Failure Modes & Handling
Failures are generally handled by transition back into the Active
(neutral) state. This simplifies the implementation greatly at the
cost of adding transitions to the state machine. For example, we
could detect the Draining state upon restart and proceed with a drain,
but how should the storage controller know that's what the orchestrator
needs still?
Storage Controller Crash
When the storage controller starts up reset the node scheduling policy
of all nodes in states Draining, Filling or PauseForRestart to
Active. The rationale is that when the storage controller restarts,
we have lost context of what the deployment orchestrator wants. It also
has the benefit of making things easier to reason about.
Pageserver Crash During Drain
The pageserver will attempt to re-attach during restart at which
point the node scheduling policy will be set back to Active, thus
reenabling the scheduler to use the node.
Non-drained Pageserver Crash During Drain
What should happen when a pageserver we are draining to crashes during the process. Two reasonable options are: cancel the drain and focus on the failover or do both, but prioritise failover. Since the number of concurrent reconciles produced by drains/fills are limited, we get the later behaviour for free. My suggestion is we take this approach, but the cancellation option is trivial to implement as well.
Pageserver Crash During Fill
The pageserver will attempt to re-attach during restart at which
point the node scheduling policy will be set back to Active, thus
reenabling the scheduler to use the node.
Pageserver Goes unavailable During Drain/Fill
The drain and fill jobs handle this by stopping early. When the pageserver
is detected as online by storage controller heartbeats, reset its scheduling
policy to Active. If a restart happens instead, see the pageserver crash
failure mode.
Orchestrator Drain Times Out
Orchestrator will still proceed with the restart.
When the pageserver re-attaches, the scheduling policy is set back to
Active.
Orchestrator Fill Times Out
Orchestrator will attempt to cancel the fill operation. If that fails,
the fill will continue until it quiesces and the node will be left
in the Filling scheduling policy. This hinders the scheduler, but is
otherwise harmless. A human operator can handle this by setting the scheduling
policy to Active, or we can bake in a fill timeout into the storage controller.
Optimizations
Location Warmth
When cutting over to a secondary, the storage controller will wait for it to become "warm" (i.e. download enough of the tenants data). This means that some reconciliations can take significantly longer than others and hold up precious reconciliations units. As an optimization, the drain stage can only cut over tenants that are already "warm". Similarly, the fill stage can prioritise the "warmest" tenants in the fill.
Given that the number of tenants by the storage controller will be fairly low for the foreseable future, the first implementation could simply query the tenants for secondary status. This doesn't scale well with increasing tenant counts, so eventually we will need new pageserver API endpoints to report the sets of "warm" and "cold" nodes.
Alternatives Considered
Draining and Filling Purely as Scheduling Constraints
At its core, the storage controller is a big background loop that detects changes in the environment and reacts on them. One could express draining and filling of nodes purely in terms of constraining the scheduler (as opposed to having such background tasks).
While theoretically nice, I think that's harder to implement and more importantly operate and reason about. Consider cancellation of a drain/fill operation. We would have to update the scheduler state, create an entirely new schedule (intent state) and start work on applying that. It gets trickier if we wish to cancel the reconciliation tasks spawned by drain/fill nodes. How would we know which ones belong to the conceptual drain/fill? One could add labels to reconciliations, but it gets messy in my opinion.
It would also mean that reconciliations themselves have side effects that persist in the database (persist something to the databse when the drain is done), which I'm not conceptually fond of.
Proof of Concept
This RFC is accompanied by a POC which implements nearly everything mentioned here apart from the optimizations and some of the failure handling: https://github.com/neondatabase/neon/pull/7682