mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-07 13:32:57 +00:00
346 lines
17 KiB
Markdown
346 lines
17 KiB
Markdown
# Graceful Restarts of Storage Controller Managed Clusters
|
|
|
|
## Summary
|
|
This RFC describes new storage controller APIs for draining and filling tenant shards from/on pageserver nodes.
|
|
It also covers how these new APIs should be used by an orchestrator (e.g. Ansible) in order to implement
|
|
graceful cluster restarts.
|
|
|
|
## Motivation
|
|
|
|
Pageserver restarts cause read availablity downtime for tenants.
|
|
|
|
For example pageserver-3 @ us-east-1 was unavailable for a randomly
|
|
picked tenant (which requested on-demand activation) for around 30 seconds
|
|
during the restart at 2024-04-03 16:37 UTC.
|
|
|
|
Note that lots of shutdowns on loaded pageservers do not finish within the
|
|
[10 second systemd enforced timeout](https://github.com/neondatabase/infra/blob/0a5280b383e43c063d43cbf87fa026543f6d6ad4/.github/ansible/systemd/pageserver.service#L16). This means we are shutting down without flushing ephemeral layers
|
|
and have to reingest data in order to serve requests after restarting, potentially making first request latencies worse.
|
|
|
|
This problem is not yet very acutely felt in storage controller managed pageservers since
|
|
tenant density is much lower there. However, we are planning on eventually migrating all
|
|
pageservers to storage controller management, so it makes sense to solve the issue proactively.
|
|
|
|
## Requirements
|
|
|
|
- Pageserver re-deployments cause minimal downtime for tenants
|
|
- The storage controller exposes HTTP API hooks for draining and filling tenant shards
|
|
from a given pageserver. Said hooks can be used by an orchestrator proces or a human operator.
|
|
- The storage controller exposes some HTTP API to cancel draining and filling background operations.
|
|
- Failures to drain or fill the node should not be fatal. In such cases, cluster restarts should proceed
|
|
as usual (with downtime).
|
|
- Progress of draining/filling is visible through metrics
|
|
|
|
## Non Goals
|
|
|
|
- Integration with the control plane
|
|
- Graceful restarts for large non-HA tenants.
|
|
|
|
## Impacted Components
|
|
|
|
- storage controller
|
|
- deployment orchestrator (i.e. Ansible)
|
|
- pageserver (indirectly)
|
|
|
|
## Terminology
|
|
|
|
** Draining ** is the process through which all tenant shards that can be migrated from a given pageserver
|
|
are distributed across the rest of the cluster.
|
|
|
|
** Filling ** is the symmetric opposite of draining. In this process tenant shards are migrated onto a given
|
|
pageserver until the cluster reaches a resonable, quiescent distribution of tenant shards across pageservers.
|
|
|
|
** Node scheduling policies ** act as constraints to the scheduler. For instance, when a
|
|
node is set in the `Paused` policy, no further shards will be scheduled on it.
|
|
|
|
** Node ** is a pageserver. Term is used interchangeably in this RFC.
|
|
|
|
** Deployment orchestrator ** is a generic term for whatever drives our deployments.
|
|
Currently, it's an Ansible playbook.
|
|
|
|
## Background
|
|
|
|
### Storage Controller Basics (skip if already familiar)
|
|
|
|
Fundamentally, the storage controller is a reconciler which aims to move from the observed mapping between pageservers and tenant shards to an intended mapping. Pageserver nodes and tenant shards metadata is durably persisted in a database, but note that the mapping between the two entities is not durably persisted. Instead, this mapping (*observed state*) is constructed at startup by sending `GET location_config` requests to registered pageservers.
|
|
|
|
An internal scheduler maps tenant shards to pageservers while respecting certain constraints. The result of scheduling is the *intent state*. When the intent state changes, a *reconciliation* will inform pageservers about the new assigment via `PUT location_config` requests and will notify the compute via the configured hook.
|
|
|
|
### Background Optimizations
|
|
|
|
The storage controller performs scheduling optimizations in the background. It will
|
|
migrate attachments to warm secondaries and replace secondaries in order to balance
|
|
the cluster out.
|
|
|
|
### Reconciliations Concurrency Limiting
|
|
|
|
There's a hard limit on the number of reconciles that the storage controller
|
|
can have in flight at any given time. To get an idea of scales, the limit is
|
|
128 at the time of writing.
|
|
|
|
## Implementation
|
|
|
|
Note: this section focuses on the core functionality of the graceful restart process.
|
|
It doesn't neccesarily describe the most efficient approach. Optimizations are described
|
|
separately in a later section.
|
|
|
|
### Overall Flow
|
|
|
|
This section describes how to implement graceful restarts from the perspective
|
|
of Ansible, the deployment orchestrator. Pageservers are already restarted sequentially.
|
|
The orchestrator shall implement the following epilogue and prologue steps for each
|
|
pageserver restart:
|
|
|
|
#### Prologue
|
|
|
|
The orchestrator shall first fetch the pageserver node id from the control plane or
|
|
the pageserver it aims to restart directly. Next, it issues an HTTP request
|
|
to the storage controller in order to start the drain of said pageserver node.
|
|
All error responses are retried with a short back-off. When a 202 (Accepted)
|
|
HTTP code is returned, the drain has started. Now the orchestrator polls the
|
|
node status endpoint exposed by the storage controller in order to await the
|
|
end of the drain process. When the `policy` field of the node status response
|
|
becomes `PauseForRestart`, the drain has completed and the orchestrator can
|
|
proceed with restarting the pageserver.
|
|
|
|
The prologue is subject to an overall timeout. It will have a value in the ballpark
|
|
of minutes. As storage controller managed pageservers become more loaded this timeout
|
|
will likely have to increase.
|
|
|
|
#### Epilogue
|
|
|
|
After restarting the pageserver, the orchestrator issues an HTTP request
|
|
to the storage controller to kick off the filling process. This API call
|
|
may be retried for all error codes with a short backoff. This also serves
|
|
as a synchronization primitive as the fill will be refused if the pageserver
|
|
has not yet re-attached to the storage controller. When a 202(Accepted) HTTP
|
|
code is returned, the fill has started. Now the orchestrator polls the node
|
|
status endpoint exposed by the storage controller in order to await the end of
|
|
the filling process. When the `policy` field of the node status response becomes
|
|
`Active`, the fill has completed and the orchestrator may proceed to the next pageserver.
|
|
|
|
Again, the epilogue is subject to an overall timeout. We can start off with
|
|
using the same timeout as for the prologue, but can also consider relying on
|
|
the storage controller's background optimizations with a shorter timeout.
|
|
|
|
In the case that the deployment orchestrator times out, it attempts to cancel
|
|
the fill. This operation shall be retried with a short back-off. If it ultimately
|
|
fails it will require manual intervention to set the nodes scheduling policy to
|
|
`NodeSchedulingPolicy::Active`. Not doing that is not immediately problematic,
|
|
but it constrains the scheduler as mentioned previously.
|
|
|
|
### Node Scheduling Policy State Machine
|
|
|
|
The state machine below encodes the behaviours discussed above and
|
|
the various failover situations described in a later section.
|
|
|
|
Assuming no failures and/or timeouts the flow should be:
|
|
`Active -> Draining -> PauseForRestart -> Active -> Filling -> Active`
|
|
|
|
```
|
|
Operator requested drain
|
|
+-----------------------------------------+
|
|
| |
|
|
+-------+-------+ +-------v-------+
|
|
| | | |
|
|
| Pause | +-----------> Draining +----------+
|
|
| | | | | |
|
|
+---------------+ | +-------+-------+ |
|
|
| | |
|
|
| | |
|
|
Drain requested| | |
|
|
| |Drain complete | Drain failed
|
|
| | | Cancelled/PS reattach/Storcon restart
|
|
| | |
|
|
+-------+-------+ | |
|
|
| | | |
|
|
+-------------+ Active <-----------+------------------+
|
|
| | | |
|
|
Fill requested | +---^---^-------+ |
|
|
| | | |
|
|
| | | |
|
|
| | | |
|
|
| Fill completed| | |
|
|
| | |PS reattach |
|
|
| | |after restart |
|
|
+-------v-------+ | | +-------v-------+
|
|
| | | | | |
|
|
| Filling +---------+ +-----------+PauseForRestart|
|
|
| | | |
|
|
+---------------+ +---------------+
|
|
```
|
|
|
|
### Draining/Filling APIs
|
|
|
|
The storage controller API to trigger the draining of a given node is:
|
|
`PUT /v1/control/node/:node_id/{drain,fill}`.
|
|
|
|
The following HTTP non-success return codes are used.
|
|
All of them are safely retriable from the perspective of the storage controller.
|
|
- 404: Requested node was not found
|
|
- 503: Requested node is known to the storage controller, but unavailable
|
|
- 412: Drain precondition failed: there is no other node to drain to or the node's schedulling policy forbids draining
|
|
- 409: A {drain, fill} is already in progress. Only one such background operation
|
|
is allowed per node.
|
|
|
|
When the drain is accepted and commenced a 202 HTTP code is returned.
|
|
|
|
Drains and fills shall be cancellable by the deployment orchestrator or a
|
|
human operator via: `DELETE /v1/control/node/:node_id/{drain,fill}`. A 200
|
|
response is returned when the cancelation is successful. Errors are retriable.
|
|
|
|
### Drain Process
|
|
|
|
Before accpeting a drain request the following validations is applied:
|
|
* Ensure that the node is known the storage controller
|
|
* Ensure that the schedulling policy is `NodeSchedulingPolicy::Active` or `NodeSchedulingPolicy::Pause`
|
|
* Ensure that another drain or fill is not already running on the node
|
|
* Ensure that a drain is possible (i.e. check that there is at least one
|
|
schedulable node to drain to)
|
|
|
|
After accepting the drain, the scheduling policy of the node is set to
|
|
`NodeSchedulingPolicy::Draining` and persisted in both memory and the database.
|
|
This disallows the optimizer from adding or removing shards from the node which
|
|
is desirable to avoid them racing.
|
|
|
|
Next, a separate Tokio task is spawned to manage the draining. For each tenant
|
|
shard attached to the node being drained, demote the node to a secondary and
|
|
attempt to schedule the node away. Scheduling might fail due to unsatisfiable
|
|
constraints, but that is fine. Draining is a best effort process since it might
|
|
not always be possible to cut over all shards.
|
|
|
|
Importantly, this task manages the concurrency of issued reconciles in order to
|
|
avoid drowning out the target pageservers and to allow other important reconciles
|
|
to proceed.
|
|
|
|
Once the triggered reconciles have finished or timed out, set the node's scheduling
|
|
policy to `NodeSchedulingPolicy::PauseForRestart` to signal the end of the drain.
|
|
|
|
A note on non HA tenants: These tenants do not have secondaries, so by the description
|
|
above, they would not be migrated. It makes sense to skip them (especially the large ones)
|
|
since, depending on tenant size, this might be more disruptive than the restart since the
|
|
pageserver we've moved to do will need to on-demand download the entire working set for the tenant.
|
|
We can consider expanding to small non-HA tenants in the future.
|
|
|
|
### Fill Process
|
|
|
|
Before accpeting a fill request the following validations is applied:
|
|
* Ensure that the node is known the storage controller
|
|
* Ensure that the schedulling policy is `NodeSchedulingPolicy::Active`.
|
|
This is the only acceptable policy for the fill starting state. When a node re-attaches,
|
|
it set the scheduling policy to `NodeSchedulingPolicy::Active` if it was equal to
|
|
`NodeSchedulingPolicy::PauseForRestart` or `NodeSchedulingPolicy::Draining` (possible end states for a node drain).
|
|
* Ensure that another drain or fill is not already running on the node
|
|
|
|
After accepting the drain, the scheduling policy of the node is set to
|
|
`NodeSchedulingPolicy::Filling` and persisted in both memory and the database.
|
|
This disallows the optimizer from adding or removing shards from the node which
|
|
is desirable to avoid them racing.
|
|
|
|
Next, a separate Tokio task is spawned to manage the draining. For each tenant
|
|
shard where the filled node is a secondary, promote the secondary. This is done
|
|
until we run out of shards or the counts of attached shards become balanced across
|
|
the cluster.
|
|
|
|
Like for draining, the concurrency of spawned reconciles is limited.
|
|
|
|
### Failure Modes & Handling
|
|
|
|
Failures are generally handled by transition back into the `Active`
|
|
(neutral) state. This simplifies the implementation greatly at the
|
|
cost of adding transitions to the state machine. For example, we
|
|
could detect the `Draining` state upon restart and proceed with a drain,
|
|
but how should the storage controller know that's what the orchestrator
|
|
needs still?
|
|
|
|
#### Storage Controller Crash
|
|
|
|
When the storage controller starts up reset the node scheduling policy
|
|
of all nodes in states `Draining`, `Filling` or `PauseForRestart` to
|
|
`Active`. The rationale is that when the storage controller restarts,
|
|
we have lost context of what the deployment orchestrator wants. It also
|
|
has the benefit of making things easier to reason about.
|
|
|
|
#### Pageserver Crash During Drain
|
|
|
|
The pageserver will attempt to re-attach during restart at which
|
|
point the node scheduling policy will be set back to `Active`, thus
|
|
reenabling the scheduler to use the node.
|
|
|
|
#### Non-drained Pageserver Crash During Drain
|
|
|
|
What should happen when a pageserver we are draining to crashes during the
|
|
process. Two reasonable options are: cancel the drain and focus on the failover
|
|
*or* do both, but prioritise failover. Since the number of concurrent reconciles
|
|
produced by drains/fills are limited, we get the later behaviour for free.
|
|
My suggestion is we take this approach, but the cancellation option is trivial
|
|
to implement as well.
|
|
|
|
#### Pageserver Crash During Fill
|
|
|
|
The pageserver will attempt to re-attach during restart at which
|
|
point the node scheduling policy will be set back to `Active`, thus
|
|
reenabling the scheduler to use the node.
|
|
|
|
#### Pageserver Goes unavailable During Drain/Fill
|
|
|
|
The drain and fill jobs handle this by stopping early. When the pageserver
|
|
is detected as online by storage controller heartbeats, reset its scheduling
|
|
policy to `Active`. If a restart happens instead, see the pageserver crash
|
|
failure mode.
|
|
|
|
#### Orchestrator Drain Times Out
|
|
|
|
Orchestrator will still proceed with the restart.
|
|
When the pageserver re-attaches, the scheduling policy is set back to
|
|
`Active`.
|
|
|
|
#### Orchestrator Fill Times Out
|
|
|
|
Orchestrator will attempt to cancel the fill operation. If that fails,
|
|
the fill will continue until it quiesces and the node will be left
|
|
in the `Filling` scheduling policy. This hinders the scheduler, but is
|
|
otherwise harmless. A human operator can handle this by setting the scheduling
|
|
policy to `Active`, or we can bake in a fill timeout into the storage controller.
|
|
|
|
## Optimizations
|
|
|
|
### Location Warmth
|
|
|
|
When cutting over to a secondary, the storage controller will wait for it to
|
|
become "warm" (i.e. download enough of the tenants data). This means that some
|
|
reconciliations can take significantly longer than others and hold up precious
|
|
reconciliations units. As an optimization, the drain stage can only cut over
|
|
tenants that are already "warm". Similarly, the fill stage can prioritise the
|
|
"warmest" tenants in the fill.
|
|
|
|
Given that the number of tenants by the storage controller will be fairly low
|
|
for the foreseable future, the first implementation could simply query the tenants
|
|
for secondary status. This doesn't scale well with increasing tenant counts, so
|
|
eventually we will need new pageserver API endpoints to report the sets of
|
|
"warm" and "cold" nodes.
|
|
|
|
## Alternatives Considered
|
|
|
|
### Draining and Filling Purely as Scheduling Constraints
|
|
|
|
At its core, the storage controller is a big background loop that detects changes
|
|
in the environment and reacts on them. One could express draining and filling
|
|
of nodes purely in terms of constraining the scheduler (as opposed to having
|
|
such background tasks).
|
|
|
|
While theoretically nice, I think that's harder to implement and more importantly operate and reason about.
|
|
Consider cancellation of a drain/fill operation. We would have to update the scheduler state, create
|
|
an entirely new schedule (intent state) and start work on applying that. It gets trickier if we wish
|
|
to cancel the reconciliation tasks spawned by drain/fill nodes. How would we know which ones belong
|
|
to the conceptual drain/fill? One could add labels to reconciliations, but it gets messy in my opinion.
|
|
|
|
It would also mean that reconciliations themselves have side effects that persist in the database
|
|
(persist something to the databse when the drain is done), which I'm not conceptually fond of.
|
|
|
|
## Proof of Concept
|
|
|
|
This RFC is accompanied by a POC which implements nearly everything mentioned here
|
|
apart from the optimizations and some of the failure handling:
|
|
https://github.com/neondatabase/neon/pull/7682
|