mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-07 13:32:57 +00:00
## Problem Found typos while reading the docs ## Summary of changes Fixed the typos found
600 lines
25 KiB
Markdown
600 lines
25 KiB
Markdown
# Seamless tenant migration
|
|
|
|
- Author: john@neon.tech
|
|
- Created on 2023-08-11
|
|
- Implemented on ..
|
|
|
|
## Summary
|
|
|
|
The preceding [generation numbers RFC](025-generation-numbers.md) may be thought of as "making tenant
|
|
migration safe". Following that,
|
|
this RFC is about how those migrations are to be done:
|
|
|
|
1. Seamlessly (without interruption to client availability)
|
|
2. Quickly (enabling faster operations)
|
|
3. Efficiently (minimizing I/O and $ cost)
|
|
|
|
These points are in priority order: if we have to sacrifice
|
|
efficiency to make a migration seamless for clients, we will
|
|
do so, etc.
|
|
|
|
This is accomplished by introducing two high level changes:
|
|
|
|
- A dual-attached state for tenants, used in a control-plane-orchestrated
|
|
migration procedure that preserves availability during a migration.
|
|
- Warm secondary locations for tenants, where on-disk content is primed
|
|
for a fast migration of the tenant from its current attachment to this
|
|
secondary location.
|
|
|
|
## Motivation
|
|
|
|
Migrating tenants between pageservers is essential to operating a service
|
|
at scale, in several contexts:
|
|
|
|
1. Responding to a pageserver node failure by migrating tenants to other pageservers
|
|
2. Balancing load and capacity across pageservers, for example when a user expands their
|
|
database and they need to migrate to a pageserver with more capacity.
|
|
3. Restarting pageservers for upgrades and maintenance
|
|
|
|
The current situation steps for migration are:
|
|
|
|
- detach from old node; skip if old node is dead; (the [skip part is still WIP](https://github.com/neondatabase/cloud/issues/5426)).
|
|
- attach to new node
|
|
- re-configure endpoints to use the new node
|
|
|
|
Once [generation numbers](025-generation-numbers.md) are implemented,
|
|
the detach step is no longer critical for correctness. So, we can
|
|
|
|
- attach to a new node,
|
|
- re-configure endpoints to use the new node, and then
|
|
- detach from the old node.
|
|
|
|
However, this still does not meet our seamless/fast/efficient goals:
|
|
|
|
- Not fast: The new node will have to download potentially large amounts
|
|
of data from S3, which may take many minutes.
|
|
- Not seamless: If we attach to a new pageserver before detaching an old one,
|
|
the new one might delete some objects that interrupt availability of reads on the old one.
|
|
- Not efficient: the old pageserver will continue uploading
|
|
S3 content during the migration that will never be read.
|
|
|
|
The user expectations for availability are:
|
|
|
|
- For planned maintenance, there should be zero availability
|
|
gap. This expectation is fulfilled by this RFC.
|
|
- For unplanned changes (e.g. node failures), there should be
|
|
minimal availability gap. This RFC provides the _mechanism_
|
|
to fail over quickly, but does not provide the failure _detection_
|
|
nor failover _policy_.
|
|
|
|
## Non Goals
|
|
|
|
- Defining service tiers with different storage strategies: the same
|
|
level of HA & overhead will apply to all tenants. This doesn't rule out
|
|
adding such tiers in future.
|
|
- Enabling pageserver failover in the absence of a control plane: the control
|
|
plane will remain the source of truth for what should be attached where.
|
|
- Totally avoiding availability gaps on unplanned migrations during
|
|
a failure (we expect a small, bounded window of
|
|
read unavailability of very recent LSNs)
|
|
- Workload balancing: this RFC defines the mechanism for moving tenants
|
|
around, not the higher level logic for deciding who goes where.
|
|
- Defining all possible configuration flows for tenants: the migration process
|
|
defined in this RFC demonstrates the sufficiency of the pageserver API, but
|
|
is not the only kind of configuration change the control plane will ever do.
|
|
The APIs defined here should let the control plane move tenants around in
|
|
whatever way is needed while preserving data safety and read availability.
|
|
|
|
## Impacted components
|
|
|
|
Pageserver, control plane
|
|
|
|
## Terminology
|
|
|
|
- **Attachment**: a tenant is _attached_ to a pageserver if it has
|
|
been issued a generation number, and is running an instance of
|
|
the `Tenant` type, ingesting the WAL, and available to serve
|
|
page reads.
|
|
- **Location**: locations are a superset of attachments. A location
|
|
is a combination of a tenant and a pageserver. We may _attach_ at a _location_.
|
|
|
|
- **Secondary location**: a location which is not currently attached.
|
|
- **Warm secondary location**: a location which is not currently attached, but is endeavoring to maintain a warm local cache of layers. We avoid calling this a _warm standby_ to avoid confusion with similar postgres features.
|
|
|
|
## Implementation (high level)
|
|
|
|
### Warm secondary locations
|
|
|
|
To enable faster migrations, we will identify at least one _secondary location_
|
|
for each tenant. This secondary location will keep a warm cache of layers
|
|
for the tenant, so that if it is later attached, it can catch up with the
|
|
latest LSN quickly: rather than downloading everything, it only has to replay
|
|
the recent part of the WAL to advance from the remote_consistent_offset to the
|
|
most recent LSN in the WAL.
|
|
|
|
The control plane is responsible for selecting secondary locations, and
|
|
calling into pageservers to configure tenants into a secondary mode at this
|
|
new location, as well as attaching the tenant in its existing primary location.
|
|
|
|
The attached pageserver for a tenant will publish a [layer heatmap](#layer-heatmap)
|
|
to advise secondaries of which layers should be downloaded.
|
|
|
|
### Location modes
|
|
|
|
Currently, we consider a tenant to be in one of two states on a pageserver:
|
|
|
|
- Attached: active `Tenant` object, and layers on local disk
|
|
- Detached: no layers on local disk, no runtime state.
|
|
|
|
We will extend this with finer-grained modes, whose purpose will become
|
|
clear in later sections:
|
|
|
|
- **AttachedSingle**: equivalent the existing attached state.
|
|
- **AttachedMulti**: like AttachedSingle, holds an up to date generation, but
|
|
does not do deletions.
|
|
- **AttachedStale**: like AttachedSingle, holds a stale generation,
|
|
do not do any remote storage operations.
|
|
- **Secondary**: keep local state on disk, periodically update from S3.
|
|
- **Detached**: equivalent to existing detached state.
|
|
|
|
To control these finer grained states, a new pageserver API endpoint will be added.
|
|
|
|
### Cutover procedure
|
|
|
|
Define old location and new location as "Node A" and "Node B". Consider
|
|
the case where both nodes are available, and Node B was previously configured
|
|
as a secondary location for the tenant we are migrating.
|
|
|
|
The cutover procedure is orchestrated by the control plane, calling into
|
|
the pageservers' APIs:
|
|
|
|
1. Call to Node A requesting it to flush to S3 and enter AttachedStale state
|
|
2. Increment generation, and call to Node B requesting it to enter AttachedMulti
|
|
state with the new generation.
|
|
3. Call to Node B, requesting it to download the latest hot layers from remote storage,
|
|
according to the latest heatmap flushed by Node A.
|
|
4. Wait for Node B's WAL ingestion to catch up with node A's
|
|
5. Update endpoints to use node B instead of node A
|
|
6. Call to node B requesting it to enter state AttachedSingle.
|
|
7. Call to node A requesting it to enter state Secondary
|
|
|
|
The following table summarizes how the state of the system advances:
|
|
|
|
| Step | Node A | Node B | Node used by endpoints |
|
|
| :-----------: | :------------: | :------------: | :--------------------: |
|
|
| 1 (_initial_) | AttachedSingle | Secondary | A |
|
|
| 2 | AttachedStale | AttachedMulti | A |
|
|
| 3 | AttachedStale | AttachedMulti | A |
|
|
| 4 | AttachedStale | AttachedMulti | A |
|
|
| 5 (_cutover_) | AttachedStale | AttachedMulti | B |
|
|
| 6 | AttachedStale | AttachedSingle | B |
|
|
| 7 (_final_) | Secondary | AttachedSingle | B |
|
|
|
|
The procedure described for a clean handover from a live node to a secondary
|
|
is also used for failure cases and for migrations to a location that is not
|
|
configured as a secondary, by simply skipping irrelevant steps, as described in
|
|
the following sections.
|
|
|
|
#### Migration from an unresponsive node
|
|
|
|
If node A is unavailable, then all calls into
|
|
node A are skipped and we don't wait for B to catch up before
|
|
switching updating the endpoints to use B.
|
|
|
|
#### Migration to a location that is not a secondary
|
|
|
|
If node B is initially in Detached state, the procedure is identical. Since Node B
|
|
is coming from a Detached state rather than Secondary, the download of layers and
|
|
catch up with WAL will take much longer.
|
|
|
|
We might do this if:
|
|
|
|
- Attached and secondary locations are both critically low on disk, and we need
|
|
to migrate to a third node with more resources available.
|
|
- We are migrating a tenant which does not use secondary locations to save on cost.
|
|
|
|
#### Permanent migration away from a node
|
|
|
|
In the final step of the migration, we generally request the original node to enter a Secondary
|
|
state. This is typical if we are doing a planned migration during maintenance, or to
|
|
balance CPU/network load away from a node.
|
|
|
|
One might also want to permanently migrate away: this can be done by simply removing the secondary
|
|
location after the migration is complete, or as an optimization by substituting the Detached state
|
|
for the Secondary state in the final step.
|
|
|
|
#### Cutover diagram
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant CP as Control plane
|
|
participant A as Node A
|
|
participant B as Node B
|
|
participant E as Endpoint
|
|
|
|
CP->>A: PUT Flush & go to AttachedStale
|
|
note right of A: A continues to ingest WAL
|
|
CP->>B: PUT AttachedMulti
|
|
CP->>B: PUT Download layers from latest heatmap
|
|
note right of B: B downloads from S3
|
|
loop Poll until download complete
|
|
CP->>B: GET download status
|
|
end
|
|
activate B
|
|
note right of B: B ingests WAL
|
|
loop Poll until catch up
|
|
CP->>B: GET visible WAL
|
|
CP->>A: GET visible WAL
|
|
end
|
|
deactivate B
|
|
CP->>E: Configure to use Node B
|
|
E->>B: Connect for reads
|
|
CP->>B: PUT AttachedSingle
|
|
CP->>A: PUT Secondary
|
|
```
|
|
|
|
#### Cutover from an unavailable pageserver
|
|
|
|
This case is far simpler: we may skip straight to our intended
|
|
end state.
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant A as Node A
|
|
participant CP as Control plane
|
|
participant B as Node B
|
|
participant E as Endpoint
|
|
|
|
note right of A: Node A offline
|
|
activate A
|
|
CP->>B: PUT AttachedSingle
|
|
CP->>E: Configure to use Node B
|
|
E->>B: Connect for reads
|
|
deactivate A
|
|
```
|
|
|
|
## Implementation (detail)
|
|
|
|
### Purpose of AttachedMulti, AttachedStale
|
|
|
|
#### AttachedMulti
|
|
|
|
Ordinarily, an attached pageserver whose generation is the latest may delete
|
|
layers at will (e.g. during compaction). If a previous generation pageserver
|
|
is also still attached, and in use by endpoints, then this layer deletion could
|
|
lead to a loss of availability for the endpoint when reading from the previous
|
|
generation pageserver.
|
|
|
|
The _AttachedMulti_ state simply disables deletions. These will be enqueued
|
|
in `RemoteTimelineClient` until the control plane transitions the
|
|
node into AttachedSingle, which unblocks deletions. Other remote storage operations
|
|
such as uploads are not blocked.
|
|
|
|
AttachedMulti is not required for data safety, only to preserve availability
|
|
on pageservers running with stale generations.
|
|
|
|
A node enters AttachedMulti only when explicitly asked to by the control plane. It should
|
|
only remain in this state for the duration of a migration.
|
|
|
|
If a control plane bug leaves
|
|
the node in AttachedMulti for a long time, then we must avoid unbounded memory use from enqueued
|
|
deletions. This may be accomplished simply, by dropping enqueued deletions when some modest
|
|
threshold of delayed deletions (e.g. 10k layers per tenant) is reached. As with all deletions,
|
|
it is safe to skip them, and the leaked objects will be eventually cleaned up by scrub or
|
|
by timeline deletion.
|
|
|
|
During AttachedMulti, the Tenant is free to drop layers from local disk in response to
|
|
disk pressure: only the deletion of remote layers is blocked.
|
|
|
|
#### AttachedStale
|
|
|
|
Currently, a pageserver with a stale generation number will continue to
|
|
upload layers, but be prevented from completing deletions. This is safe, but inefficient: layers uploaded by this stale generation
|
|
will not be read back by future generations of pageservers.
|
|
|
|
The _AttachedStale_ state disables S3 uploads. The stale pageserver
|
|
will continue to ingest the WAL and write layers to local disk, but not to
|
|
do any uploads to S3.
|
|
|
|
A node may enter AttachedStale in two ways:
|
|
|
|
- Explicitly, when control plane calls into the node at the start of a migration.
|
|
- Implicitly, when the node tries to validate some deletions and discovers
|
|
that its generation is stale.
|
|
|
|
The AttachedStale state also disables sending consumption metrics from
|
|
that location: it is interpreted as an indication that some other pageserver
|
|
is already attached or is about to be attached, and that new pageserver will
|
|
be responsible for sending consumption metrics.
|
|
|
|
#### Disk Pressure & AttachedStale
|
|
|
|
Over long periods of time, a tenant location in AttachedStale will accumulate data
|
|
on local disk, as it cannot evict any layers written since it entered the
|
|
AttachStale state. We rely on the control plane to revert the location to
|
|
Secondary or Detached at the end of a migration.
|
|
|
|
This scenario is particularly noteworthy when evacuating all tenants on a pageserver:
|
|
since _all_ the attached tenants will go into AttachedStale, we will be doing no
|
|
uploads at all, therefore ingested data will cause disk usage to increase continuously.
|
|
Under nominal conditions, the available disk space on pageservers should be sufficient
|
|
to complete the evacuation before this becomes a problem, but we must also handle
|
|
the case where we hit a low disk situation while in this state.
|
|
|
|
The concept of disk pressure already exists in the pageserver: the `disk_usage_eviction_task`
|
|
touches each Tenant when it determines that a low-disk condition requires
|
|
some layer eviction. Having selected layers for eviction, the eviction
|
|
task calls `Timeline::evict_layers`.
|
|
|
|
**Safety**: If evict_layers is called while in AttachedStale state, and some of the to-be-evicted
|
|
layers are not yet uploaded to S3, then the block on uploads will be lifted. This
|
|
will result in leaking some objects once a migration is complete, but will enable
|
|
the node to manage its disk space properly: if a node is left with some tenants
|
|
in AttachedStale indefinitely due to a network partition or control plane bug,
|
|
these tenants will not cause a full disk condition.
|
|
|
|
### Warm secondary updates
|
|
|
|
#### Layer heatmap
|
|
|
|
The secondary location's job is to serve reads **with the same quality of service as the original location
|
|
was serving them around the time of a migration**. This does not mean the secondary
|
|
location needs the whole set of layers: inactive layers that might soon
|
|
be evicted on the attached pageserver need not be downloaded by the
|
|
secondary. A totally idle tenant only needs to maintain enough on-disk
|
|
state to enable a fast cold start (i.e. the most recent image layers are
|
|
typically sufficient).
|
|
|
|
To enable this, we introduce the concept of a _layer heatmap_, which
|
|
acts as an advisory input to secondary locations to decide which
|
|
layers to download from S3.
|
|
|
|
#### Attached pageserver
|
|
|
|
The attached pageserver, if in state AttachedSingle, periodically
|
|
uploads a serialized heat map to S3. It may skip this if there
|
|
is no change since the last time it uploaded (e.g. if the tenant
|
|
is totally idle).
|
|
|
|
Additionally, when the tenant is flushed to remote storage prior to a migration
|
|
(the first step in [cutover procedure](#cutover-procedure)),
|
|
the heatmap is written out. This enables a future attached pageserver
|
|
to get an up to date view when deciding which layers to download.
|
|
|
|
#### Secondary location behavior
|
|
|
|
Secondary warm locations run a simple loop, implemented separately from
|
|
the main `Tenant` type, which represents attached tenants:
|
|
|
|
- Download the layer heatmap
|
|
- Select any "hot enough" layers to download, if there is sufficient
|
|
free disk space.
|
|
- Download layers, if they were not previously evicted (see below)
|
|
- Download the latest index_part.json
|
|
- Check if any layers currently on disk are no longer referenced by
|
|
IndexPart & delete them
|
|
|
|
Note that the heatmap is only advisory: if a secondary location has plenty
|
|
of disk space, it may choose to retain layers that aren't referenced
|
|
by the heatmap, as long as they are still referenced by the IndexPart. Conversely,
|
|
if a node is very low on disk space, it might opt to raise the heat threshold required
|
|
to both downloading a layer, until more disk space is available.
|
|
|
|
#### Secondary locations & disk pressure
|
|
|
|
Secondary locations are subject to eviction on disk pressure, just as
|
|
attached locations are. For eviction purposes, the access time of a
|
|
layer in a secondary location will be the access time given in the heatmap,
|
|
rather than the literal time at which the local layer file was accessed.
|
|
|
|
The heatmap will indicate which layers are in local storage on the attached
|
|
location. The secondary will always attempt to get back to having that
|
|
set of layers on disk, but to avoid flapping, it will remember the access
|
|
time of the layer it was most recently asked to evict, and layers whose
|
|
access time is below that will not be re-downloaded.
|
|
|
|
The resulting behavior is that after a layer is evicted from a secondary
|
|
location, it is only re-downloaded once the attached pageserver accesses
|
|
the layer and uploads a heatmap reflecting that access time. On a pageserver
|
|
restart, the secondary location will attempt to download all layers in
|
|
the heatmap again, if they are not on local disk.
|
|
|
|
This behavior will be slightly different when secondary locations are
|
|
used for "low energy tenants", but that is beyond the scope of this RFC.
|
|
|
|
### Location configuration API
|
|
|
|
Currently, the `/tenant/<tenant_id>/config` API defines various
|
|
tunables like compaction settings, which apply to the tenant irrespective
|
|
of which pageserver it is running on.
|
|
|
|
A new "location config" structure will be introduced, which defines
|
|
configuration which is per-tenant, but local to a particular pageserver,
|
|
such as the attachment mode and whether it is a secondary.
|
|
|
|
The pageserver will expose a new per-tenant API for setting
|
|
the state: `/tenant/<tenant_id>/location/config`.
|
|
|
|
Body content:
|
|
|
|
```
|
|
{
|
|
state: 'enum{Detached, Secondary, AttachedSingle, AttachedMulti, AttachedStale}',
|
|
generation: Option<u32>,
|
|
configuration: `Option<TenantConfig>`
|
|
flush: bool
|
|
}
|
|
```
|
|
|
|
Existing `/attach` and `/detach` endpoint will have the same
|
|
behavior as calling `/location/config` with `AttachedSingle` and `Detached`
|
|
states respectively. These endpoints will be deprecated and later
|
|
removed.
|
|
|
|
The generation attribute is mandatory for entering `AttachedSingle` or
|
|
`AttachedMulti`.
|
|
|
|
The configuration attribute is mandatory when entering any state other
|
|
than `Detached`. This configuration is the same as the body for
|
|
the existing `/tenant/<tenant_id>/config` endpoint.
|
|
|
|
The `flush` argument indicates whether the pageservers should flush
|
|
to S3 before proceeding: this only has any effect if the node is
|
|
currently in AttachedSingle or AttachedMulti. This is used
|
|
during the first phase of migration, when transitioning the
|
|
old pageserver to AttachedSingle.
|
|
|
|
The `/re-attach` API response will be extended to include a `state` as
|
|
well as a `generation`, enabling the pageserver to enter the
|
|
correct state for each tenant on startup.
|
|
|
|
### Database schema for locations
|
|
|
|
A new table `ProjectLocation`:
|
|
|
|
- pageserver_id: int
|
|
- tenant_id: TenantId
|
|
- generation: Option<int>
|
|
- state: `enum(Secondary, AttachedSingle, AttachedMulti)`
|
|
|
|
Notes:
|
|
|
|
- It is legacy for a Project to have zero `ProjectLocation`s
|
|
- The `pageserver` column in `Project` now means "to which pageserver should
|
|
endpoints connect", rather than simply which pageserver is attached.
|
|
- The `generation` column in `Project` remains, and is incremented and used
|
|
to set the generation of `ProjectLocation` rows when they are set into
|
|
an attached state.
|
|
- The `Detached` state is implicitly represented as the absence of
|
|
a `ProjectLocation`.
|
|
|
|
### Executing migrations
|
|
|
|
Migrations will be implemented as Go functions, within the
|
|
existing `Operation` framework in the control plane. These
|
|
operations are persistent, such that they will always keep
|
|
trying until completion: this property is important to avoid
|
|
leaving garbage behind on pageservers, such as AttachedStale
|
|
locations.
|
|
|
|
### Recovery from failures during migration
|
|
|
|
During migration, the control plane may encounter failures of either
|
|
the original or new pageserver, or both:
|
|
|
|
- If the original fails, skip past waiting for the new pageserver
|
|
to catch up, and put it into AttachedSingle immediately.
|
|
- If the new node fails, put the old pageserver into Secondary
|
|
and then back into AttachedSingle (this has the effect of
|
|
retaining on-disk state and granting it a fresh generation number).
|
|
- If both nodes fail, keep trying until one of them is available
|
|
again.
|
|
|
|
### Control plane -> Pageserver reconciliation
|
|
|
|
A migration may be done while the old node is unavailable,
|
|
in which case the old node may still be running in an AttachedStale
|
|
state.
|
|
|
|
In this case, it is undesirable to have the migration `Operation`
|
|
stay alive until the old node eventually comes back online
|
|
and can be cleaned up. To handle this, the control plane
|
|
should run a background reconciliation process to compare
|
|
a pageserver's attachments with the database, and clean up
|
|
any that shouldn't be there any more.
|
|
|
|
Note that there will be no work to do if the old node was really
|
|
offline, as during startup it will call into `/re-attach` and
|
|
be updated that way. The reconciliation will only be needed
|
|
if the node was unavailable but still running.
|
|
|
|
## Alternatives considered
|
|
|
|
### Only enabling secondary locations for tenants on a higher service tier
|
|
|
|
This will make sense in future, especially for tiny databases that may be
|
|
downloaded from S3 in milliseconds when needed.
|
|
|
|
However, it is not wise to do it immediately, because pageservers contain
|
|
a mixture of higher and lower tier workloads. If we had 1 tenant with
|
|
a secondary location and 9 without, then those other 9 tenants will do
|
|
a lot of I/O as they try to recover from S3, which may degrade the
|
|
service of the tenant which had a secondary location.
|
|
|
|
Until we segregate tenant on different service tiers on different pageserver
|
|
nodes, or implement & test QoS to ensure that tenants with secondaries are
|
|
not harmed by tenants without, we should use the same failover approach
|
|
for all the tenants.
|
|
|
|
### Hot secondary locations (continuous WAL replay)
|
|
|
|
Instead of secondary locations populating their caches from S3, we could
|
|
have them consume the WAL from safekeepers. The downsides of this would be:
|
|
|
|
- Double load on safekeepers, which are a less scalable service than S3
|
|
- Secondary locations' on-disk state would end up subtly different to
|
|
the remote state, which would make synchronizing with S3 more complex/expensive
|
|
when going into attached state.
|
|
|
|
The downside of only updating secondary locations from S3 is that we will
|
|
have a delay during migration from replaying the LSN range between what's
|
|
in S3 and what's in the pageserver. This range will be very small on
|
|
planned migrations, as we have the old pageserver flush to S3 immediately
|
|
before attaching the new pageserver. On unplanned migrations (old pageserver
|
|
is unavailable), the range of LSNs to replay is bounded by the flush frequency
|
|
on the old pageserver. However, the migration doesn't have to wait for the
|
|
replay: it's just that not-yet-replayed LSNs will be unavailable for read
|
|
until the new pageserver catches up.
|
|
|
|
We expect that pageserver reads of the most recent LSNs will be relatively
|
|
rare, as for an active endpoint those pages will usually still be in the postgres
|
|
page cache: this leads us to prefer synchronizing from S3 on secondary
|
|
locations, rather than consuming the WAL from safekeepers.
|
|
|
|
### Cold secondary locations
|
|
|
|
It is not functionally necessary to keep warm caches on secondary locations at all. However, if we do not, then
|
|
we would experience a de-facto availability loss in unplanned migrations, as reads to the new node would take an extremely long time (many seconds, perhaps minutes).
|
|
|
|
Warm caches on secondary locations are necessary to meet
|
|
our availability goals.
|
|
|
|
### Pageserver-granularity failover
|
|
|
|
Instead of migrating tenants individually, we could have entire spare nodes,
|
|
and on a node death, move all its work to one of these spares.
|
|
|
|
This approach is avoided for several reasons:
|
|
|
|
- we would still need fine-grained tenant migration for other
|
|
purposes such as balancing load
|
|
- by sharing the spare capacity over many peers rather than one spare node,
|
|
these peers may use the capacity for other purposes, until it is needed
|
|
to handle migrated tenants. e.g. for keeping a deeper cache of their
|
|
attached tenants.
|
|
|
|
### Readonly during migration
|
|
|
|
We could simplify migrations by making both previous and new nodes go into a
|
|
readonly state, then flush remote content from the previous node, then activate
|
|
attachment on the secondary node.
|
|
|
|
The downside to this approach is a potentially large gap in readability of
|
|
recent LSNs while loading data onto the new node. To avoid this, it is worthwhile
|
|
to incur the extra cost of double-replaying the WAL onto old and new nodes' local
|
|
storage during a migration.
|
|
|
|
### Peer-to-peer pageserver communication
|
|
|
|
Rather than uploading the heatmap to S3, attached pageservers could make it
|
|
available to peers.
|
|
|
|
Currently, pageservers have no peer to peer communication, so adding this
|
|
for heatmaps would incur significant overhead in deployment and configuration
|
|
of the service, and ensuring that when a new pageserver is deployed, other
|
|
pageservers are updated to be aware of it.
|
|
|
|
As well as simplifying implementation, putting heatmaps in S3 will be useful
|
|
for future analytics purposes -- gathering aggregated statistics on activity
|
|
patterns across many tenants may be done directly from data in S3.
|