mirror of
https://github.com/neondatabase/neon.git
synced 2026-06-19 13:20:37 +00:00
Compare commits
12 Commits
hack/compu
...
jcsp/rfc-p
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
2ce2574aa4 | ||
|
|
dc5f107170 | ||
|
|
1569446396 | ||
|
|
a8143a3bed | ||
|
|
689b6f14b7 | ||
|
|
9c1c06ad17 | ||
|
|
40d2a73a0c | ||
|
|
89ddefb428 | ||
|
|
cad0799521 | ||
|
|
1143e2e9ce | ||
|
|
ef3e75abc3 | ||
|
|
cfb285139c |
599
docs/rfcs/028-pageserver-migration.md
Normal file
599
docs/rfcs/028-pageserver-migration.md
Normal file
@@ -0,0 +1,599 @@
|
|||||||
|
# Seamless tenant migration
|
||||||
|
|
||||||
|
- Author: john@neon.tech
|
||||||
|
- Created on 2023-08-11
|
||||||
|
- Implemented on ..
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The preceding [generation numbers RFC](025-generation-numbers.md) may be thought of as "making tenant
|
||||||
|
migration safe". Following that,
|
||||||
|
this RFC is about how those migrations are to be done:
|
||||||
|
|
||||||
|
1. Seamlessly (without interruption to client availability)
|
||||||
|
2. Quickly (enabling faster operations)
|
||||||
|
3. Efficiently (minimizing I/O and $ cost)
|
||||||
|
|
||||||
|
These points are in priority order: if we have to sacrifice
|
||||||
|
efficiency to make a migration seamless for clients, we will
|
||||||
|
do so, etc.
|
||||||
|
|
||||||
|
This is accomplished by introducing two high level changes:
|
||||||
|
|
||||||
|
- A dual-attached state for tenants, used in a control-plane-orchestrated
|
||||||
|
migration procedure that preserves availability during a migration.
|
||||||
|
- Warm secondary locations for tenants, where on-disk content is primed
|
||||||
|
for a fast migration of the tenant from its current attachment to this
|
||||||
|
secondary location.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Migrating tenants between pageservers is essential to operating a service
|
||||||
|
at scale, in several contexts:
|
||||||
|
|
||||||
|
1. Responding to a pageserver node failure by migrating tenants to other pageservers
|
||||||
|
2. Balancing load and capacity across pageservers, for example when a user expands their
|
||||||
|
database and they need to migrate to a pageserver with more capacity.
|
||||||
|
3. Restarting pageservers for upgrades and maintenance
|
||||||
|
|
||||||
|
The current situation steps for migration are:
|
||||||
|
|
||||||
|
- detach from old node; skip if old node is dead; (the [skip part is still WIP](https://github.com/neondatabase/cloud/issues/5426)).
|
||||||
|
- attach to new node
|
||||||
|
- re-configure endpoints to use the new node
|
||||||
|
|
||||||
|
Once [generation numbers](025-generation-numbers.md) are implemented,
|
||||||
|
the detach step is no longer critical for correctness. So, we can
|
||||||
|
|
||||||
|
- attach to a new node,
|
||||||
|
- re-configure endpoints to use the new node, and then
|
||||||
|
- detach from the old node.
|
||||||
|
|
||||||
|
However, this still does not meet our seamless/fast/efficient goals:
|
||||||
|
|
||||||
|
- Not fast: The new node will have to download potentially large amounts
|
||||||
|
of data from S3, which may take many minutes.
|
||||||
|
- Not seamless: If we attach to a new pageserver before detaching an old one,
|
||||||
|
the new one might delete some objects that interrupt availability of reads on the old one.
|
||||||
|
- Not efficient: the old pageserver will continue uploading
|
||||||
|
S3 content during the migration that will never be read.
|
||||||
|
|
||||||
|
The user expectations for availability are:
|
||||||
|
|
||||||
|
- For planned maintenance, there should be zero availability
|
||||||
|
gap. This expectation is fulfilled by this RFC.
|
||||||
|
- For unplanned changes (e.g. node failures), there should be
|
||||||
|
minimal availability gap. This RFC provides the _mechanism_
|
||||||
|
to fail over quickly, but does not provide the failure _detection_
|
||||||
|
nor failover _policy_.
|
||||||
|
|
||||||
|
## Non Goals
|
||||||
|
|
||||||
|
- Defining service tiers with different storage strategies: the same
|
||||||
|
level of HA & overhead will apply to all tenants. This doesn't rule out
|
||||||
|
adding such tiers in future.
|
||||||
|
- Enabling pageserver failover in the absence of a control plane: the control
|
||||||
|
plane will remain the source of truth for what should be attached where.
|
||||||
|
- Totally avoiding availability gaps on unplanned migrations during
|
||||||
|
a failure (we expect a small, bounded window of
|
||||||
|
read unavailability of very recent LSNs)
|
||||||
|
- Workload balancing: this RFC defines the mechanism for moving tenants
|
||||||
|
around, not the higher level logic for deciding who goes where.
|
||||||
|
- Defining all possible configuration flows for tenants: the migration process
|
||||||
|
defined in this RFC demonstrates the sufficiency of the pageserver API, but
|
||||||
|
is not the only kind of configuration change the control plane will ever do.
|
||||||
|
The APIs defined here should let the control plane move tenants around in
|
||||||
|
whatever way is needed while preserving data safety and read availability.
|
||||||
|
|
||||||
|
## Impacted components
|
||||||
|
|
||||||
|
Pageserver, control plane
|
||||||
|
|
||||||
|
## Terminology
|
||||||
|
|
||||||
|
- **Attachment**: a tenant is _attached_ to a pageserver if it has
|
||||||
|
been issued a generation number, and is running an instance of
|
||||||
|
the `Tenant` type, ingesting the WAL, and available to serve
|
||||||
|
page reads.
|
||||||
|
- **Location**: locations are a superset of attachments. A location
|
||||||
|
is a combination of a tenant and a pageserver. We may _attach_ at a _location_.
|
||||||
|
|
||||||
|
- **Secondary location**: a location which is not currently attached.
|
||||||
|
- **Warm secondary location**: a location which is not currently attached, but is endeavoring to maintain a warm local cache of layers. We avoid calling this a _warm standby_ to avoid confusion with similar postgres features.
|
||||||
|
|
||||||
|
## Implementation (high level)
|
||||||
|
|
||||||
|
### Warm secondary locations
|
||||||
|
|
||||||
|
To enable faster migrations, we will identify at least one _secondary location_
|
||||||
|
for each tenant. This secondary location will keep a warm cache of layers
|
||||||
|
for the tenant, so that if it is later attached, it can catch up with the
|
||||||
|
latest LSN quickly: rather than downloading everything, it only has to replay
|
||||||
|
the recent part of the WAL to advance from the remote_consistent_offset to the
|
||||||
|
most recent LSN in the WAL.
|
||||||
|
|
||||||
|
The control plane is responsible for selecting secondary locations, and
|
||||||
|
calling into pageservers to configure tenants into a secondary mode at this
|
||||||
|
new location, as well as attaching the tenant in its existing primary location.
|
||||||
|
|
||||||
|
The attached pageserver for a tenant will publish a [layer heatmap](#layer-heatmap)
|
||||||
|
to advise secondaries of which layers should be downloaded.
|
||||||
|
|
||||||
|
### Location modes
|
||||||
|
|
||||||
|
Currently, we consider a tenant to be in one of two states on a pageserver:
|
||||||
|
|
||||||
|
- Attached: active `Tenant` object, and layers on local disk
|
||||||
|
- Detached: no layers on local disk, no runtime state.
|
||||||
|
|
||||||
|
We will extend this with finer-grained modes, whose purpose will become
|
||||||
|
clear in later sections:
|
||||||
|
|
||||||
|
- **AttachedSingle**: equivalent the existing attached state.
|
||||||
|
- **AttachedMulti**: like AttachedSingle, holds an up to date generation, but
|
||||||
|
does not do deletions.
|
||||||
|
- **AttachedStale**: like AttachedSingle, holds a stale generation,
|
||||||
|
do not do any remote storage operations.
|
||||||
|
- **Secondary**: keep local state on disk, periodically update from S3.
|
||||||
|
- **Detached**: equivalent to existing detached state.
|
||||||
|
|
||||||
|
To control these finer grained states, a new pageserver API endpoint will be added.
|
||||||
|
|
||||||
|
### Cutover procedure
|
||||||
|
|
||||||
|
Define old location and new location as "Node A" and "Node B". Consider
|
||||||
|
the case where both nodes are available, and Node B was previously configured
|
||||||
|
as a secondary location for the tenant we are migrating.
|
||||||
|
|
||||||
|
The cutover procedure is orchestrated by the control plane, calling into
|
||||||
|
the pageservers' APIs:
|
||||||
|
|
||||||
|
1. Call to Node A requesting it to flush to S3 and enter AttachedStale state
|
||||||
|
2. Increment generation, and call to Node B requesting it to enter AttachedMulti
|
||||||
|
state with the new generation.
|
||||||
|
3. Call to Node B, requesting it to download the latest hot layers from remote storage,
|
||||||
|
according to the latest heatmap flushed by Node A.
|
||||||
|
4. Wait for Node B's WAL ingestion to catch up with node A's
|
||||||
|
5. Update endpoints to use node B instead of node A
|
||||||
|
6. Call to node B requesting it to enter state AttachedSingle.
|
||||||
|
7. Call to node A requesting it to enter state Secondary
|
||||||
|
|
||||||
|
The following table summarizes how the state of the system advances:
|
||||||
|
|
||||||
|
| Step | Node A | Node B | Node used by endpoints |
|
||||||
|
| :-----------: | :------------: | :------------: | :--------------------: |
|
||||||
|
| 1 (_initial_) | AttachedSingle | Secondary | A |
|
||||||
|
| 2 | AttachedStale | AttachedMulti | A |
|
||||||
|
| 3 | AttachedStale | AttachedMulti | A |
|
||||||
|
| 4 | AttachedStale | AttachedMulti | A |
|
||||||
|
| 5 (_cutover_) | AttachedStale | AttachedMulti | B |
|
||||||
|
| 6 | AttachedStale | AttachedSingle | B |
|
||||||
|
| 7 (_final_) | Secondary | AttachedSingle | B |
|
||||||
|
|
||||||
|
The procedure described for a clean handover from a live node to a secondary
|
||||||
|
is also used for failure cases and for migrations to a location that is not
|
||||||
|
configured as a secondary, by simply skipping irrelevant steps, as described in
|
||||||
|
the following sections.
|
||||||
|
|
||||||
|
#### Migration from an unresponsive node
|
||||||
|
|
||||||
|
If node A is unavailable, then all calls into
|
||||||
|
node A are skipped and we don't wait for B to catch up before
|
||||||
|
switching updating the endpoints to use B.
|
||||||
|
|
||||||
|
#### Migration to a location that is not a secondary
|
||||||
|
|
||||||
|
If node B is initially in Detached state, the procedure is identical. Since Node B
|
||||||
|
is coming from a Detached state rather than Secondary, the download of layers and
|
||||||
|
catch up with WAL will take much longer.
|
||||||
|
|
||||||
|
We might do this if:
|
||||||
|
|
||||||
|
- Attached and secondary locations are both critically low on disk, and we need
|
||||||
|
to migrate to a third node with more resources available.
|
||||||
|
- We are migrating a tenant which does not use secondary locations to save on cost.
|
||||||
|
|
||||||
|
#### Permanent migration away from a node
|
||||||
|
|
||||||
|
In the final step of the migration, we generally request the original node to enter a Secondary
|
||||||
|
state. This is typical if we are doing a planned migration during maintenance, or to
|
||||||
|
balance CPU/network load away from a node.
|
||||||
|
|
||||||
|
One might also want to permanently migrate away: this can be done by simply removing the secondary
|
||||||
|
location after the migration is complete, or as an optimization by substituting the Detached state
|
||||||
|
for the Secondary state in the final step.
|
||||||
|
|
||||||
|
#### Cutover diagram
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant CP as Control plane
|
||||||
|
participant A as Node A
|
||||||
|
participant B as Node B
|
||||||
|
participant E as Endpoint
|
||||||
|
|
||||||
|
CP->>A: PUT Flush & go to AttachedStale
|
||||||
|
note right of A: A continues to ingest WAL
|
||||||
|
CP->>B: PUT AttachedMulti
|
||||||
|
CP->>B: PUT Download layers from latest heatmap
|
||||||
|
note right of B: B downloads from S3
|
||||||
|
loop Poll until download complete
|
||||||
|
CP->>B: GET download status
|
||||||
|
end
|
||||||
|
activate B
|
||||||
|
note right of B: B ingests WAL
|
||||||
|
loop Poll until catch up
|
||||||
|
CP->>B: GET visible WAL
|
||||||
|
CP->>A: GET visible WAL
|
||||||
|
end
|
||||||
|
deactivate B
|
||||||
|
CP->>E: Configure to use Node B
|
||||||
|
E->>B: Connect for reads
|
||||||
|
CP->>B: PUT AttachedSingle
|
||||||
|
CP->>A: PUT Secondary
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Cutover from an unavailable pageserver
|
||||||
|
|
||||||
|
This case is far simpler: we may skip straight to our intended
|
||||||
|
end state.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant A as Node A
|
||||||
|
participant CP as Control plane
|
||||||
|
participant B as Node B
|
||||||
|
participant E as Endpoint
|
||||||
|
|
||||||
|
note right of A: Node A offline
|
||||||
|
activate A
|
||||||
|
CP->>B: PUT AttachedSingle
|
||||||
|
CP->>E: Configure to use Node B
|
||||||
|
E->>B: Connect for reads
|
||||||
|
deactivate A
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation (detail)
|
||||||
|
|
||||||
|
### Purpose of AttachedMulti, AttachedStale
|
||||||
|
|
||||||
|
#### AttachedMulti
|
||||||
|
|
||||||
|
Ordinarily, an attached pageserver whose generation is the latest may delete
|
||||||
|
layers at will (e.g. during compaction). If a previous generation pageserver
|
||||||
|
is also still attached, and in use by endpoints, then this layer deletion could
|
||||||
|
lead to a loss of availability for the endpoint when reading from the previous
|
||||||
|
generation pageserver.
|
||||||
|
|
||||||
|
The _AttachedMulti_ state simply disables deletions. These will be enqueued
|
||||||
|
in `RemoteTimelineClient` until the control plane transitions the
|
||||||
|
node into AttachedSingle, which unblocks deletions. Other remote storage operations
|
||||||
|
such as uploads are not blocked.
|
||||||
|
|
||||||
|
AttachedMulti is not required for data safety, only to preserve availability
|
||||||
|
on pageservers running with stale generations.
|
||||||
|
|
||||||
|
A node enters AttachedMulti only when explicitly asked to by the control plane. It should
|
||||||
|
only remain in this state for the duration of a migration.
|
||||||
|
|
||||||
|
If a control plane bug leaves
|
||||||
|
the node in AttachedMulti for a long time, then we must avoid unbounded memory use from enqueued
|
||||||
|
deletions. This may be accomplished simply, by dropping enqueued deletions when some modest
|
||||||
|
threshold of delayed deletions (e.g. 10k layers per tenant) is reached. As with all deletions,
|
||||||
|
it is safe to skip them, and the leaked objects will be eventually cleaned up by scrub or
|
||||||
|
by timeline deletion.
|
||||||
|
|
||||||
|
During AttachedMulti, the Tenant is free to drop layers from local disk in response to
|
||||||
|
disk pressure: only the deletion of remote layers is blocked.
|
||||||
|
|
||||||
|
#### AttachedStale
|
||||||
|
|
||||||
|
Currently, a pageserver with a stale generation number will continue to
|
||||||
|
upload layers, but be prevented from completing deletions. This is safe, but inefficient: layers uploaded by this stale generation
|
||||||
|
will not be read back by future generations of pageservers.
|
||||||
|
|
||||||
|
The _AttachedStale_ state disables S3 uploads. The stale pageserver
|
||||||
|
will continue to ingest the WAL and write layers to local disk, but not to
|
||||||
|
do any uploads to S3.
|
||||||
|
|
||||||
|
A node may enter AttachedStale in two ways:
|
||||||
|
|
||||||
|
- Explicitly, when control plane calls into the node at the start of a migration.
|
||||||
|
- Implicitly, when the node tries to validate some deletions and discovers
|
||||||
|
that its generation is stale.
|
||||||
|
|
||||||
|
The AttachedStale state also disables sending consumption metrics from
|
||||||
|
that location: it is interpreted as an indication that some other pageserver
|
||||||
|
is already attached or is about to be attached, and that new pageserver will
|
||||||
|
be responsible for sending consumption metrics.
|
||||||
|
|
||||||
|
#### Disk Pressure & AttachedStale
|
||||||
|
|
||||||
|
Over long periods of time, a tenant location in AttachedStale will accumulate data
|
||||||
|
on local disk, as it cannot evict any layers written since it entered the
|
||||||
|
AttachStale state. We rely on the control plane to revert the location to
|
||||||
|
Secondary or Detached at the end of a migration.
|
||||||
|
|
||||||
|
This scenario is particularly noteworthy when evacuating all tenants on a pageserver:
|
||||||
|
since _all_ the attached tenants will go into AttachedStale, we will be doing no
|
||||||
|
uploads at all, therefore ingested data will cause disk usage to increase continuously.
|
||||||
|
Under nominal conditions, the available disk space on pageservers should be sufficient
|
||||||
|
to complete the evacuation before this becomes a problem, but we must also handle
|
||||||
|
the case where we hit a low disk situation while in this state.
|
||||||
|
|
||||||
|
The concept of disk pressure already exists in the pageserver: the `disk_usage_eviction_task`
|
||||||
|
touches each Tenant when it determines that a low-disk condition requires
|
||||||
|
some layer eviction. Having selected layers for eviction, the eviction
|
||||||
|
task calls `Timeline::evict_layers`.
|
||||||
|
|
||||||
|
**Safety**: If evict_layers is called while in AttachedStale state, and some of the to-be-evicted
|
||||||
|
layers are not yet uploaded to S3, then the block on uploads will be lifted. This
|
||||||
|
will result in leaking some objects once a migration is complete, but will enable
|
||||||
|
the node to manage its disk space properly: if a node is left with some tenants
|
||||||
|
in AttachedStale indefinitely due to a network partition or control plane bug,
|
||||||
|
these tenants will not cause a full disk condition.
|
||||||
|
|
||||||
|
### Warm secondary updates
|
||||||
|
|
||||||
|
#### Layer heatmap
|
||||||
|
|
||||||
|
The secondary location's job is to serve reads **with the same quality of service as the original location
|
||||||
|
was serving them around the time of a migration**. This does not mean the secondary
|
||||||
|
location needs the whole set of layers: inactive layers that might soon
|
||||||
|
be evicted on the attached pageserver need not be downloaded by the
|
||||||
|
secondary. A totally idle tenant only needs to maintain enough on-disk
|
||||||
|
state to enable a fast cold start (i.e. the most recent image layers are
|
||||||
|
typically sufficient).
|
||||||
|
|
||||||
|
To enable this, we introduce the concept of a _layer heatmap_, which
|
||||||
|
acts as an advisory input to secondary locations to decide which
|
||||||
|
layers to download from S3.
|
||||||
|
|
||||||
|
#### Attached pageserver
|
||||||
|
|
||||||
|
The attached pageserver, if in state AttachedSingle, periodically
|
||||||
|
uploads a serialized heat map to S3. It may skip this if there
|
||||||
|
is no change since the last time it uploaded (e.g. if the tenant
|
||||||
|
is totally idle).
|
||||||
|
|
||||||
|
Additionally, when the tenant is flushed to remote storage prior to a migration
|
||||||
|
(the first step in [cutover procedure](#cutover-procedure)),
|
||||||
|
the heatmap is written out. This enables a future attached pageserver
|
||||||
|
to get an up to date view when deciding which layers to download.
|
||||||
|
|
||||||
|
#### Secondary location behavior
|
||||||
|
|
||||||
|
Secondary warm locations run a simple loop, implemented separately from
|
||||||
|
the main `Tenant` type, which represents attached tenants:
|
||||||
|
|
||||||
|
- Download the layer heatmap
|
||||||
|
- Select any "hot enough" layers to download, if there is sufficient
|
||||||
|
free disk space.
|
||||||
|
- Download layers, if they were not previously evicted (see below)
|
||||||
|
- Download the latest index_part.json
|
||||||
|
- Check if any layers currently on disk are no longer referenced by
|
||||||
|
IndexPart & delete them
|
||||||
|
|
||||||
|
Note that the heatmap is only advisory: if a secondary location has plenty
|
||||||
|
of disk space, it may choose to retain layers that aren't referenced
|
||||||
|
by the heatmap, as long as they are still referenced by the IndexPart. Conversely,
|
||||||
|
if a node is very low on disk space, it might opt to raise the heat threshold required
|
||||||
|
to both downloading a layer, until more disk space is available.
|
||||||
|
|
||||||
|
#### Secondary locations & disk pressure
|
||||||
|
|
||||||
|
Secondary locations are subject to eviction on disk pressure, just as
|
||||||
|
attached locations are. For eviction purposes, the access time of a
|
||||||
|
layer in a secondary location will be the access time given in the heatmap,
|
||||||
|
rather than the literal time at which the local layer file was accessed.
|
||||||
|
|
||||||
|
The heatmap will indicate which layers are in local storage on the attached
|
||||||
|
location. The secondary will always attempt to get back to having that
|
||||||
|
set of layers on disk, but to avoid flapping, it will remember the access
|
||||||
|
time of the layer it was most recently asked to evict, and layers whose
|
||||||
|
access time is below that will not be re-downloaded.
|
||||||
|
|
||||||
|
The resulting behavior is that after a layer is evicted from a secondary
|
||||||
|
location, it is only re-downloaded once the attached pageserver accesses
|
||||||
|
the layer and uploads a heatmap reflecting that access time. On a pageserver
|
||||||
|
restart, the secondary location will attempt to download all layers in
|
||||||
|
the heatmap again, if they are not on local disk.
|
||||||
|
|
||||||
|
This behavior will be slightly different when secondary locations are
|
||||||
|
used for "low energy tenants", but that is beyond the scope of this RFC.
|
||||||
|
|
||||||
|
### Location configuration API
|
||||||
|
|
||||||
|
Currently, the `/tenant/<tenant_id>/config` API defines various
|
||||||
|
tunables like compaction settings, which apply to the tenant irrespective
|
||||||
|
of which pageserver it is running on.
|
||||||
|
|
||||||
|
A new "location config" structure will be introduced, which defines
|
||||||
|
configuration which is per-tenant, but local to a particular pageserver,
|
||||||
|
such as the attachment mode and whether it is a secondary.
|
||||||
|
|
||||||
|
The pageserver will expose a new per-tenant API for setting
|
||||||
|
the state: `/tenant/<tenant_id>/location/config`.
|
||||||
|
|
||||||
|
Body content:
|
||||||
|
|
||||||
|
```
|
||||||
|
{
|
||||||
|
state: 'enum{Detached, Secondary, AttachedSingle, AttachedMulti, AttachedStale}',
|
||||||
|
generation: Option<u32>,
|
||||||
|
configuration: `Option<TenantConfig>`
|
||||||
|
flush: bool
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Existing `/attach` and `/detach` endpoint will have the same
|
||||||
|
behavior as calling `/location/config` with `AttachedSingle` and `Detached`
|
||||||
|
states respectively. These endpoints will be deprecated and later
|
||||||
|
removed.
|
||||||
|
|
||||||
|
The generation attribute is mandatory for entering `AttachedSingle` or
|
||||||
|
`AttachedMulti`.
|
||||||
|
|
||||||
|
The configuration attribute is mandatory when entering any state other
|
||||||
|
than `Detached`. This configuration is the same as the body for
|
||||||
|
the existing `/tenant/<tenant_id>/config` endpoint.
|
||||||
|
|
||||||
|
The `flush` argument indicates whether the pageservers should flush
|
||||||
|
to S3 before proceeding: this only has any effect if the node is
|
||||||
|
currently in AttachedSingle or AttachedMulti. This is used
|
||||||
|
during the first phase of migration, when transitioning the
|
||||||
|
old pageserver to AttachedSingle.
|
||||||
|
|
||||||
|
The `/re-attach` API response will be extended to include a `state` as
|
||||||
|
well as a `generation`, enabling the pageserver to enter the
|
||||||
|
correct state for each tenant on startup.
|
||||||
|
|
||||||
|
### Database schema for locations
|
||||||
|
|
||||||
|
A new table `ProjectLocation`:
|
||||||
|
|
||||||
|
- pageserver_id: int
|
||||||
|
- tenant_id: TenantId
|
||||||
|
- generation: Option<int>
|
||||||
|
- state: `enum(Secondary, AttachedSingle, AttachedMulti)`
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- It is legacy for a Project to have zero `ProjectLocation`s
|
||||||
|
- The `pageserver` column in `Project` now means "to which pageserver should
|
||||||
|
endpoints connect", rather than simply which pageserver is attached.
|
||||||
|
- The `generation` column in `Project` remains, and is incremented and used
|
||||||
|
to set the generation of `ProjectLocation` rows when they are set into
|
||||||
|
an attached state.
|
||||||
|
- The `Detached` state is implicitly represented as the absence of
|
||||||
|
a `ProjectLocation`.
|
||||||
|
|
||||||
|
### Executing migrations
|
||||||
|
|
||||||
|
Migrations will be implemented as Go functions, within the
|
||||||
|
existing `Operation` framework in the control plane. These
|
||||||
|
operations are persistent, such that they will always keep
|
||||||
|
trying until completion: this property is important to avoid
|
||||||
|
leaving garbage behind on pageservers, such as AttachedStale
|
||||||
|
locations.
|
||||||
|
|
||||||
|
### Recovery from failures during migration
|
||||||
|
|
||||||
|
During migration, the control plane may encounter failures of either
|
||||||
|
the original or new pageserver, or both:
|
||||||
|
|
||||||
|
- If the original fails, skip past waiting for the new pageserver
|
||||||
|
to catch up, and put it into AttachedSingle immediately.
|
||||||
|
- If the new node fails, put the old pageserver into Secondary
|
||||||
|
and then back into AttachedSingle (this has the effect of
|
||||||
|
retaining on-disk state and granting it a fresh generation number).
|
||||||
|
- If both nodes fail, keep trying until one of them is available
|
||||||
|
again.
|
||||||
|
|
||||||
|
### Control plane -> Pageserver reconciliation
|
||||||
|
|
||||||
|
A migration may be done while the old node is unavailable,
|
||||||
|
in which case the old node may still be running in an AttachedStale
|
||||||
|
state.
|
||||||
|
|
||||||
|
In this case, it is undesirable to have the migration `Operation`
|
||||||
|
stay alive until the old node eventually comes back online
|
||||||
|
and can be cleaned up. To handle this, the control plane
|
||||||
|
should run a background reconciliation process to compare
|
||||||
|
a pageserver's attachments with the database, and clean up
|
||||||
|
any that shouldn't be there any more.
|
||||||
|
|
||||||
|
Note that there will be no work to do if the old node was really
|
||||||
|
offline, as during startup it will call into `/re-attach` and
|
||||||
|
be updated that way. The reconciliation will only be needed
|
||||||
|
if the node was unavailable but still running.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
### Only enabling secondary locations for tenants on a higher service tier
|
||||||
|
|
||||||
|
This will make sense in future, especially for tiny databases that may be
|
||||||
|
downloaded from S3 in milliseconds when needed.
|
||||||
|
|
||||||
|
However, it is not wise to do it immediately, because pageservers contain
|
||||||
|
a mixture of higher and lower tier workloads. If we had 1 tenant with
|
||||||
|
a secondary location and 9 without, then those other 9 tenants will do
|
||||||
|
a lot of I/O as they try to recover from S3, which may degrade the
|
||||||
|
service of the tenant which had a secondary location.
|
||||||
|
|
||||||
|
Until we segregate tenant on different service tiers on different pageserver
|
||||||
|
nodes, or implement & test QoS to ensure that tenants with secondaries are
|
||||||
|
not harmed by tenants without, we should use the same failover approach
|
||||||
|
for all the tenants.
|
||||||
|
|
||||||
|
### Hot secondary locations (continuous WAL replay)
|
||||||
|
|
||||||
|
Instead of secondary locations populating their caches from S3, we could
|
||||||
|
have them consume the WAL from safekeepers. The downsides of this would be:
|
||||||
|
|
||||||
|
- Double load on safekeepers, which are a less scalable service than S3
|
||||||
|
- Secondary locations' on-disk state would end up subtly different to
|
||||||
|
the remote state, which would make synchronizing with S3 more complex/expensive
|
||||||
|
when going into attached state.
|
||||||
|
|
||||||
|
The downside of only updating secondary locations from S3 is that we will
|
||||||
|
have a delay during migration from replaying the LSN range between what's
|
||||||
|
in S3 and what's in the pageserver. This range will be very small on
|
||||||
|
planned migrations, as we have the old pageserver flush to S3 immediately
|
||||||
|
before attaching the new pageserver. On unplanned migrations (old pageserver
|
||||||
|
is unavailable), the range of LSNs to replay is bounded by the flush frequency
|
||||||
|
on the old pageserver. However, the migration doesn't have to wait for the
|
||||||
|
replay: it's just that not-yet-replayed LSNs will be unavailable for read
|
||||||
|
until the new pageserver catches up.
|
||||||
|
|
||||||
|
We expect that pageserver reads of the most recent LSNs will be relatively
|
||||||
|
rare, as for an active endpoint those pages will usually still be in the postgres
|
||||||
|
page cache: this leads us to prefer synchronizing from S3 on secondary
|
||||||
|
locations, rather than consuming the WAL from safekeepers.
|
||||||
|
|
||||||
|
### Cold secondary locations
|
||||||
|
|
||||||
|
It is not functionally necessary to keep warm caches on secondary locations at all. However, if we do not, then
|
||||||
|
we would experience a de-facto availability loss in unplanned migrations, as reads to the new node would take an extremely long time (many seconds, perhaps minutes).
|
||||||
|
|
||||||
|
Warm caches on secondary locations are necessary to meet
|
||||||
|
our availability goals.
|
||||||
|
|
||||||
|
### Pageserver-granularity failover
|
||||||
|
|
||||||
|
Instead of migrating tenants individually, we could have entire spare nodes,
|
||||||
|
and on a node death, move all its work to one of these spares.
|
||||||
|
|
||||||
|
This approach is avoided for several reasons:
|
||||||
|
|
||||||
|
- we would still need fine-grained tenant migration for other
|
||||||
|
purposes such as balancing load
|
||||||
|
- by sharing the spare capacity over many peers rather than one spare node,
|
||||||
|
these peers may use the capacity for other purposes, until it is needed
|
||||||
|
to handle migrated tenants. e.g. for keeping a deeper cache of their
|
||||||
|
attached tenants.
|
||||||
|
|
||||||
|
### Readonly during migration
|
||||||
|
|
||||||
|
We could simplify migrations by making both previous and new nodes go into a
|
||||||
|
readonly state, then flush remote content from the previous node, then activate
|
||||||
|
attachment on the secondary node.
|
||||||
|
|
||||||
|
The downside to this approach is a potentially large gap in readability of
|
||||||
|
recent LSNs while loading data onto the new node. To avoid this, it is worthwhile
|
||||||
|
to incur the extra cost of double-replaying the WAL onto old and new nodes' local
|
||||||
|
storage during a migration.
|
||||||
|
|
||||||
|
### Peer-to-peer pageserver communication
|
||||||
|
|
||||||
|
Rather than uploading the heatmap to S3, attached pageservers could make it
|
||||||
|
available to peers.
|
||||||
|
|
||||||
|
Currently, pageservers have no peer to peer communication, so adding this
|
||||||
|
for heatmaps would incur significant overhead in deployment and configuration
|
||||||
|
of the service, and ensuring that when a new pageserver is deployed, other
|
||||||
|
pageservers are updated to be aware of it.
|
||||||
|
|
||||||
|
As well as simplifying implementation, putting heatmaps in S3 will be useful
|
||||||
|
for future analytics purposes -- gathering aggregated statistics on activity
|
||||||
|
pattersn across many tenants may be done directly from data in S3.
|
||||||
244
docs/rfcs/029-sharding-phase1.md
Normal file
244
docs/rfcs/029-sharding-phase1.md
Normal file
@@ -0,0 +1,244 @@
|
|||||||
|
# Sharding Phase 1: Static Key-space Sharding
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
To enable databases with sizes approaching the capacity of a pageserver's disk,
|
||||||
|
it is necessary to break up the storage for the database, or _shard_ it.
|
||||||
|
|
||||||
|
Sharding in general is a complex area. This RFC aims to define a modest initial
|
||||||
|
capability that will permit creating large-capacity databases using a static configuration
|
||||||
|
defined at time of Tenant creation.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Currently, all data for a Tenant, including all its timelines, is stored on a single
|
||||||
|
pageserver. The local storage required may be several times larger than the actual
|
||||||
|
database size, due to LSM write inflation.
|
||||||
|
|
||||||
|
If a database is larger than what one pageserver can hold, then it becomes impossible
|
||||||
|
for the pageserver to hold it in local storage, as it must do to provide service to
|
||||||
|
clients.
|
||||||
|
|
||||||
|
### Prior art
|
||||||
|
|
||||||
|
Numerous: sharding is a long-discussed feature for the pageserver.
|
||||||
|
|
||||||
|
Prior art in other distributed systems is too broad to capture here: pretty much
|
||||||
|
any scale out storage system does something like this.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Enable creating a large (for example, 16TiB) database without requiring dedicated
|
||||||
|
pageserver nodes.
|
||||||
|
- Share read/write bandwidth costs for large databases across pageservers, as well
|
||||||
|
as storage capacity, in order to avoid large capacity databases acting as I/O hotspots
|
||||||
|
that disrupt service to other tenants.
|
||||||
|
- Our data distribution scheme should handle sparse/nonuniform keys well, since postgres
|
||||||
|
does not write out a single contiguous ranges of page numbers.
|
||||||
|
|
||||||
|
*Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database
|
||||||
|
that a user might create on a current-gen enterprise SSD should also work well on
|
||||||
|
Neon. The upper bound is whatever postgres can handle: i.e. we must make sure that the
|
||||||
|
pageserver backend is not the limiting factor in the database size*.
|
||||||
|
|
||||||
|
## Non Goals
|
||||||
|
|
||||||
|
- Independently distributing timelines within the same tenant. If a tenant has many
|
||||||
|
timelines, then sharding may be a less efficient mechanism for distributing load than
|
||||||
|
sharing out timelines between pageservers.
|
||||||
|
- Distributing work in the LSN dimension: this RFC focuses on the Key dimension only,
|
||||||
|
based on the idea that separate mechanisms will make sense for each dimension.
|
||||||
|
|
||||||
|
## Impacted Components
|
||||||
|
|
||||||
|
pageserver, control plane, safekeeper (optional)
|
||||||
|
|
||||||
|
## Terminology
|
||||||
|
|
||||||
|
**Key**: a postgres page number. In the sense that the pageserver is a versioned key-value store,
|
||||||
|
the page number is the key in that store.
|
||||||
|
|
||||||
|
**LSN dimension**: this just means the range of LSNs (history), when talking about the range
|
||||||
|
of keys and LSNs as a two dimensional space.
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### Key sharding vs. LSN sharding
|
||||||
|
|
||||||
|
When we think of sharding across the two dimensional key/lsn space, this is an
|
||||||
|
opportunity to think about how the two dimensions differ:
|
||||||
|
- Sharding the key space distributes the _write_ workload of ingesting data
|
||||||
|
and compacting. This work must be carefully managed so that exactly one
|
||||||
|
node owns a given key.
|
||||||
|
- Sharding the LSN space distributes the _historical read_ workload. This work
|
||||||
|
can be done by anyone without any special coordination, as long as they can
|
||||||
|
see the remote index and layers.
|
||||||
|
|
||||||
|
The key sharding is the harder part, and also the more urgent one, to support larger
|
||||||
|
capacity databases. Because distributing historical LSN read work is a relatively
|
||||||
|
simpler problem that most users don't have, we defer it to future work. It is anticipated
|
||||||
|
that some quite simple P2P offload model will enable distributing work for historical
|
||||||
|
reads: a node which is low on space can call out to peer to ask it to download and
|
||||||
|
serve reads from a historical layer.
|
||||||
|
|
||||||
|
### Key mapping scheme
|
||||||
|
|
||||||
|
Having decided to focus on key sharding, we must next decide how we will map
|
||||||
|
keys to shards.
|
||||||
|
|
||||||
|
It is proposed to use a "wide striping" approach, to obtain a good compromise
|
||||||
|
between data locality and avoiding entire large relations mapping to the same shard.
|
||||||
|
|
||||||
|
The mapping is quite simple:
|
||||||
|
- Define a stripe size, such as 256MiB. Map this to a key count, such that a contiguous
|
||||||
|
range of 256MiB keys would all fall into this stripe, i.e. divide by 8kiB to get 32k.
|
||||||
|
- Map a key to a stripe by integer division.
|
||||||
|
- Map a stripe to a shard by taking the shard index modulo the shard count.
|
||||||
|
|
||||||
|
This scheme will achieve a good balance as long as there is no aliasing of the keys
|
||||||
|
to the stripe width. In the example above, if someone had 4 shards and wrote
|
||||||
|
keys that were all 4*32k apart, they would all map to the same shard. However, we do
|
||||||
|
not have to worry about this, since end users do not control page numbers: as long as
|
||||||
|
we do not pick stripe sizes that map to any problematic postgres behaviors, we'll be fine.
|
||||||
|
|
||||||
|
### Important Types
|
||||||
|
|
||||||
|
#### `ShardMap`
|
||||||
|
|
||||||
|
Provides all the information needed to route a request for a particular
|
||||||
|
key to the correct pageserver:
|
||||||
|
- Stripe size
|
||||||
|
- Shard count
|
||||||
|
- Address of the pageserver hosting each shard
|
||||||
|
|
||||||
|
This structure's size is linear with the number of shards.
|
||||||
|
|
||||||
|
#### `ShardIdentity`
|
||||||
|
|
||||||
|
Provides the information needed to know whether a particular key belongs
|
||||||
|
to a particular shard:
|
||||||
|
- Stripe size
|
||||||
|
- Shard count
|
||||||
|
- Shard index
|
||||||
|
|
||||||
|
This structure's size is constant.
|
||||||
|
|
||||||
|
### Pageserver changes
|
||||||
|
|
||||||
|
Everywhere the Pageserver currently deals with Tenants, it will move to dealing with
|
||||||
|
TenantShards, which are just a `Tenant` plus a `ShardIdentity` telling it which part
|
||||||
|
of the keyspace it owns.
|
||||||
|
|
||||||
|
When the pageserver subscribes to a safekeeper for WAL updates, it must provide
|
||||||
|
its `ShardIdentity` to receive the relevant subset of the WAL.
|
||||||
|
|
||||||
|
When the pageserver writes layers and index_part.json to remote storage, it must
|
||||||
|
include the shard index & count in the name, to avoid collisions (the count is
|
||||||
|
necessary for future-proofing: the count will vary in time). These keys
|
||||||
|
will also include a generation number: the [generation numbers](025-generation-numbers.md) system will work
|
||||||
|
exactly the same for TenantShards as it does for Tenants today: each shard will have
|
||||||
|
its own generation number.
|
||||||
|
|
||||||
|
The pageserver doesn't have to do anything special during ingestion, compaction
|
||||||
|
or GC. It is implicitly operating on the subset of keys that map to its ShardIdentity.
|
||||||
|
This will result in sparse layer files, containing keys only in the stripes that this
|
||||||
|
shard owns. Where optimizations currently exist in compaction for spotting "gaps" in
|
||||||
|
the key range, these should be updated to ignore gaps that are due to sharding, to
|
||||||
|
avoid spuriously splitting up layers ito stripe-sized pieces.
|
||||||
|
|
||||||
|
### Pageserver Controller changes
|
||||||
|
|
||||||
|
The pageserver controller is a new component, which is responsible for abstracting
|
||||||
|
away the business of managing individual tenant placement on pagservers. It will
|
||||||
|
also act as the abstraction on top of sharding, so that the control plane continue
|
||||||
|
to see a Tenant as a single object, even though the reality is that it is many
|
||||||
|
TenantShards.
|
||||||
|
|
||||||
|
For the rest of this RFC, think of the Pageserver Controller as a component of
|
||||||
|
the control plane. The actual implementation is beyond the scope of this RFC
|
||||||
|
and will be described in more detail elsewhere.
|
||||||
|
|
||||||
|
### Safekeeper changes
|
||||||
|
|
||||||
|
The safekeeper's API for subscribing to a WAL will be extended to enable callers
|
||||||
|
to provide a `ShardIdentity`. In this mode it will only send WAL entries that
|
||||||
|
fall within the keyspace belonging to the shard, and WAL entries that are to
|
||||||
|
be mirrored to all shards.
|
||||||
|
|
||||||
|
Metadata updates describing databases+relations are mirrored to
|
||||||
|
all shards, and other WAL messages are only provided to the shard
|
||||||
|
that owns the key being updated. For any operation that updates multiple
|
||||||
|
keys, it will be provided to all the shards whose key ranges intersect with
|
||||||
|
one or more of the keys referenced in the WAL message.
|
||||||
|
|
||||||
|
### Pageserver Controller
|
||||||
|
|
||||||
|
### Endpoints
|
||||||
|
|
||||||
|
Compute endpoints will need to:
|
||||||
|
- Accept a ShardMap as part of their configuration from the control plane
|
||||||
|
- Route pageserver requests according to that ShardMap
|
||||||
|
|
||||||
|
### Control Plane
|
||||||
|
|
||||||
|
#### Publishing ShardMap updates
|
||||||
|
|
||||||
|
The control plane will provide an API for the pageserver controller to publish updates
|
||||||
|
to the ShardMap for a tenant. When such an update is provided, it will be used to
|
||||||
|
update the configuration of any endpoints currently active for the tenant.
|
||||||
|
|
||||||
|
The ShardMap will be opaque to the Control Plane: it doesn't need to do anything with it
|
||||||
|
other than storing and passing on to endpoints.
|
||||||
|
|
||||||
|
#### Attaching via the Pageserver Controller
|
||||||
|
|
||||||
|
The Control Plane will issue attach/create API calls to the pageserver controller
|
||||||
|
instead of directly to pageservers. This will relieve the control plane of the need
|
||||||
|
to know about sharding.
|
||||||
|
|
||||||
|
#### Enabling sharding for large tenants
|
||||||
|
|
||||||
|
When a Tenant is created, it is up to the control plane to provide a hint to
|
||||||
|
the pageserver about how large it will be. This may be implemented as a service tier,
|
||||||
|
where users creating very large databases would be onboarded to the tier, and then
|
||||||
|
the Tenants they create would be created with a larger number of shards. For the
|
||||||
|
general population of users we should continue to use 1 shard by default.
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
Clearly, the mechanism described in this RFC has substantial limitations:
|
||||||
|
- A) the number of shards in a tenant is defined at creation time.
|
||||||
|
- B) data is not distributed across the LSN dimension
|
||||||
|
|
||||||
|
To address `A`, a _splitting_ feature will later be added. One shard can split its
|
||||||
|
data into a number of children by doing a special compaction operation to generate
|
||||||
|
image layers broken up child-shard-wise, and then writing out an index_part.json for
|
||||||
|
each child. This will then require coordination with the pageserver controller to
|
||||||
|
safely attach these new child shards and then move them around to distribute work.
|
||||||
|
The opposite _merging_ operation can also be imagined, but is unlikely to be implemented:
|
||||||
|
once a Tenant has been sharded, there is little value in merging it again.
|
||||||
|
|
||||||
|
To address `B`, it is envisaged to have some gossip mechanism for pageservers to communicate
|
||||||
|
about their workload, and then a getpageatlsn offload mechanism where one pageserver can
|
||||||
|
ask another to go read the necessary layers from remote storage to serve the read. This
|
||||||
|
requires relativly little coordination because it is read-only: any node can service any
|
||||||
|
read. All reads to a particular shard would still flow through one node, but the
|
||||||
|
disk capactity & I/O impact of servicing the read would be distributed.
|
||||||
|
|
||||||
|
## FAQ/Alternatives
|
||||||
|
|
||||||
|
### Why stripe the data, rather than using contiguous ranges of keyspace for each shard?
|
||||||
|
|
||||||
|
When a database is growing under a write workload, writes may predominantly hit the
|
||||||
|
end of the keyspace, creating a bandwidth hotspot on that shard. Similarly, if the user
|
||||||
|
is intensively re-writing a particular relation, if that relation lived in a particular
|
||||||
|
shard then it would not achieve our goal of distributing the write work across shards.
|
||||||
|
|
||||||
|
### Why not proxy read requests through one pageserver, so that endpoints don't have to change?
|
||||||
|
|
||||||
|
Two reasons:
|
||||||
|
1. This would not achieve scale-out of network bandwidth: a busy tenant with a large
|
||||||
|
database would still cause a load hotspot on the pageserver routing its read requests.
|
||||||
|
2. Implementing a proxy model as a stop-gap would not be a cheap option, because
|
||||||
|
it requires making pageservers aware of their peers, and adding synchronisation to
|
||||||
|
keep pageservers aware of their peers as they come and go.
|
||||||
119
docs/rfcs/030-pageserver-controller-phase1.md
Normal file
119
docs/rfcs/030-pageserver-controller-phase1.md
Normal file
@@ -0,0 +1,119 @@
|
|||||||
|
# Pageserver Controller Phase 1: Generations
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
In the [generation numbers RFC](025-generation-numbers.md), it was proposed that
|
||||||
|
the console/control plane would act as the central coordinator for issuing generation
|
||||||
|
numbers.
|
||||||
|
|
||||||
|
That approach has not proven practical, so this RFC proposes an alternative implementation
|
||||||
|
where generation numbers are managed in a different service.
|
||||||
|
|
||||||
|
Calls to generation-aware pageserver APIs like create/attach will call out to this
|
||||||
|
new _pageserver controller_ to acquire generation numbers. This service will also
|
||||||
|
form the basis for satisfying future pageserver management requirements, such as
|
||||||
|
coordinating sharding, doing automatic capacity balancing, and many more.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
This is a dependency for delivering high availability.
|
||||||
|
|
||||||
|
### Prior art
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Provide a hook for the pageserver to use when it receives an attach/create/load API
|
||||||
|
call, which will yield a generation that is safe for the pageserver to use.
|
||||||
|
- Implement the /re-attach and /validate APIs required for the generation numbers feature
|
||||||
|
to work.
|
||||||
|
|
||||||
|
## Non Goals
|
||||||
|
|
||||||
|
- This is not intended to interact with any components other than the pageserver, or
|
||||||
|
to integrate with the broader control plane in any way.
|
||||||
|
|
||||||
|
## Impacted Components
|
||||||
|
|
||||||
|
pageserver, pageserver controller (new)
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
We may start from the minimal `attachment_service` used in automated tests.
|
||||||
|
|
||||||
|
### Data store
|
||||||
|
|
||||||
|
For generation numbers, we need a persistent, linearizable data store. Postgres is sufficient for
|
||||||
|
this: we already have postgres instances used for other control plane work.
|
||||||
|
|
||||||
|
The storage for the Pageserver Controller will be independent of other components:
|
||||||
|
it might use the same physical database server but would use an independent database.
|
||||||
|
|
||||||
|
### Deployment
|
||||||
|
|
||||||
|
There will be one instance per region. In future we would aim to define the concept
|
||||||
|
of a pageserver cluster and have one controller per cluster, but in the short term
|
||||||
|
one per region will be functionally okay for current scale.
|
||||||
|
|
||||||
|
The pageserver controller will be deployed within kubernetes, in the same way as
|
||||||
|
the storage broker (which is currently via a [helm chart](https://github.com/neondatabase/helm-charts/tree/main/charts/neon-storage-broker)).
|
||||||
|
|
||||||
|
### Security
|
||||||
|
|
||||||
|
The pageserver controller's API will do authentication with JWT, the same as
|
||||||
|
the pageserver's existing API.
|
||||||
|
|
||||||
|
### Correctness
|
||||||
|
|
||||||
|
It is essential that pageservers call into the controller at the _very start_ of
|
||||||
|
handling attach/create/load API requests. They should not do any work at all until
|
||||||
|
they have acquired that generation number.
|
||||||
|
|
||||||
|
If the call fails, they must retry: it is not safe to proceed without a generation number.
|
||||||
|
|
||||||
|
## Future
|
||||||
|
|
||||||
|
Having a call chain that goes `Control plane -> Pageserver -> Pageserver controller`
|
||||||
|
is clearly a little strange: we are only doing this to avoid needing to make changes
|
||||||
|
to the control plane.
|
||||||
|
|
||||||
|
In future, we will change the control plane to call directly into the pageserver
|
||||||
|
controller, which would then call onwards into the pageserver. This would be a fairly
|
||||||
|
small change to the controller, since all the logic around storing and updating
|
||||||
|
generation numbers would stay the same: just the behavior of the API frontend
|
||||||
|
would be different.
|
||||||
|
|
||||||
|
The work to enable pageservers to communicate with the controller is not wasted,
|
||||||
|
because they still communicate in that direction when invoking `/re-attach`
|
||||||
|
and `/validate`
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
### Run in the console/control plane codebase
|
||||||
|
|
||||||
|
The control plane is a large Go codebase that uses extensive code generation, and
|
||||||
|
has to be quite generic to manage many different types of component.
|
||||||
|
|
||||||
|
### Direct DB access
|
||||||
|
|
||||||
|
We could have pageservers call directly into a shared database to acquire and update
|
||||||
|
generation numbers (with carefully crafted transactions to protect against concurrent
|
||||||
|
attaches getting the same generation, etc).
|
||||||
|
|
||||||
|
Pros:
|
||||||
|
- No extra service required, simpler deployment
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
- No future path to a cleaner architecture: the pageserver controller can be implemented
|
||||||
|
as an extensible place for implement more functionality in future, whereas a mechanism
|
||||||
|
to do generation numbers via SQL queries from the pageserver would be specialized
|
||||||
|
and the code would probably be disposed of in the relatively near future.
|
||||||
|
- Puts onus entirely on SQL query correctness to mediate concurrent access.
|
||||||
|
The pageserver controller also has to be correct in this respect in case there
|
||||||
|
is more than one instance running, but it is much less likely to hit this path,
|
||||||
|
so the overall risk of issues is lower when using a central service.
|
||||||
|
|
||||||
|
|
||||||
|
The main downside to that approach is that it doesn't provide the future path that
|
||||||
|
the pageserver controller does
|
||||||
Reference in New Issue
Block a user