docs: pageserver controller rfc

docs: sharding phase 1 RFC
clarifications
2026-06-19 13:20:37 +00:00 · 2023-09-29 18:24:24 +01:00 · 2023-09-29 18:20:13 +01:00 · 2023-09-27 10:38:52 +01:00 · 2023-09-27 10:38:52 +01:00 · 2023-09-27 10:38:41 +01:00
3 changed files with 962 additions and 0 deletions
--- a/docs/rfcs/028-pageserver-migration.md
+++ b/docs/rfcs/028-pageserver-migration.md
@@ -0,0 +1,599 @@
 # Seamless tenant migration
 - Author: john@neon.tech
 - Created on 2023-08-11
 - Implemented on ..
 ## Summary
 The preceding [generation numbers RFC](025-generation-numbers.md) may be thought of as "making tenant
 migration safe". Following that,
 this RFC is about how those migrations are to be done:
 1. Seamlessly (without interruption to client availability)
 2. Quickly (enabling faster operations)
 3. Efficiently (minimizing I/O and $ cost)
 These points are in priority order: if we have to sacrifice
 efficiency to make a migration seamless for clients, we will
 do so, etc.
 This is accomplished by introducing two high level changes:
 - A dual-attached state for tenants, used in a control-plane-orchestrated
  migration procedure that preserves availability during a migration.
 - Warm secondary locations for tenants, where on-disk content is primed
  for a fast migration of the tenant from its current attachment to this
  secondary location.
 ## Motivation
 Migrating tenants between pageservers is essential to operating a service
 at scale, in several contexts:
 1. Responding to a pageserver node failure by migrating tenants to other pageservers
 2. Balancing load and capacity across pageservers, for example when a user expands their
   database and they need to migrate to a pageserver with more capacity.
 3. Restarting pageservers for upgrades and maintenance
 The current situation steps for migration are:
 - detach from old node; skip if old node is dead; (the [skip part is still WIP](https://github.com/neondatabase/cloud/issues/5426)).
 - attach to new node
 - re-configure endpoints to use the new node
 Once [generation numbers](025-generation-numbers.md) are implemented,
 the detach step is no longer critical for correctness. So, we can
 - attach to a new node,
 - re-configure endpoints to use the new node, and then
 - detach from the old node.
 However, this still does not meet our seamless/fast/efficient goals:
 - Not fast: The new node will have to download potentially large amounts
  of data from S3, which may take many minutes.
 - Not seamless: If we attach to a new pageserver before detaching an old one,
  the new one might delete some objects that interrupt availability of reads on the old one.
 - Not efficient: the old pageserver will continue uploading
  S3 content during the migration that will never be read.
 The user expectations for availability are:
 - For planned maintenance, there should be zero availability
  gap. This expectation is fulfilled by this RFC.
 - For unplanned changes (e.g. node failures), there should be
  minimal availability gap. This RFC provides the _mechanism_
  to fail over quickly, but does not provide the failure _detection_
  nor failover _policy_.
 ## Non Goals
 - Defining service tiers with different storage strategies: the same
  level of HA & overhead will apply to all tenants. This doesn't rule out
  adding such tiers in future.
 - Enabling pageserver failover in the absence of a control plane: the control
  plane will remain the source of truth for what should be attached where.
 - Totally avoiding availability gaps on unplanned migrations during
  a failure (we expect a small, bounded window of
  read unavailability of very recent LSNs)
 - Workload balancing: this RFC defines the mechanism for moving tenants
  around, not the higher level logic for deciding who goes where.
 - Defining all possible configuration flows for tenants: the migration process
  defined in this RFC demonstrates the sufficiency of the pageserver API, but
  is not the only kind of configuration change the control plane will ever do.
  The APIs defined here should let the control plane move tenants around in
  whatever way is needed while preserving data safety and read availability.
 ## Impacted components
 Pageserver, control plane
 ## Terminology
 - **Attachment**: a tenant is _attached_ to a pageserver if it has
  been issued a generation number, and is running an instance of
  the `Tenant` type, ingesting the WAL, and available to serve
  page reads.
 - **Location**: locations are a superset of attachments. A location
  is a combination of a tenant and a pageserver. We may _attach_ at a _location_.
 - **Secondary location**: a location which is not currently attached.
 - **Warm secondary location**: a location which is not currently attached, but is endeavoring to maintain a warm local cache of layers. We avoid calling this a _warm standby_ to avoid confusion with similar postgres features.
 ## Implementation (high level)
 ### Warm secondary locations
 To enable faster migrations, we will identify at least one _secondary location_
 for each tenant. This secondary location will keep a warm cache of layers
 for the tenant, so that if it is later attached, it can catch up with the
 latest LSN quickly: rather than downloading everything, it only has to replay
 the recent part of the WAL to advance from the remote_consistent_offset to the
 most recent LSN in the WAL.
 The control plane is responsible for selecting secondary locations, and
 calling into pageservers to configure tenants into a secondary mode at this
 new location, as well as attaching the tenant in its existing primary location.
 The attached pageserver for a tenant will publish a [layer heatmap](#layer-heatmap)
 to advise secondaries of which layers should be downloaded.
 ### Location modes
 Currently, we consider a tenant to be in one of two states on a pageserver:
 - Attached: active `Tenant` object, and layers on local disk
 - Detached: no layers on local disk, no runtime state.
 We will extend this with finer-grained modes, whose purpose will become
 clear in later sections:
 - **AttachedSingle**: equivalent the existing attached state.
 - **AttachedMulti**: like AttachedSingle, holds an up to date generation, but
  does not do deletions.
 - **AttachedStale**: like AttachedSingle, holds a stale generation,
  do not do any remote storage operations.
 - **Secondary**: keep local state on disk, periodically update from S3.
 - **Detached**: equivalent to existing detached state.
 To control these finer grained states, a new pageserver API endpoint will be added.
 ### Cutover procedure
 Define old location and new location as "Node A" and "Node B". Consider
 the case where both nodes are available, and Node B was previously configured
 as a secondary location for the tenant we are migrating.
 The cutover procedure is orchestrated by the control plane, calling into
 the pageservers' APIs:
 1. Call to Node A requesting it to flush to S3 and enter AttachedStale state
 2. Increment generation, and call to Node B requesting it to enter AttachedMulti
   state with the new generation.
 3. Call to Node B, requesting it to download the latest hot layers from remote storage,
   according to the latest heatmap flushed by Node A.
 4. Wait for Node B's WAL ingestion to catch up with node A's
 5. Update endpoints to use node B instead of node A
 6. Call to node B requesting it to enter state AttachedSingle.
 7. Call to node A requesting it to enter state Secondary
 The following table summarizes how the state of the system advances:
 |     Step      |     Node A     |     Node B     | Node used by endpoints |
 | :-----------: | :------------: | :------------: | :--------------------: |
 | 1 (_initial_) | AttachedSingle |   Secondary    |           A            |
 |       2       | AttachedStale  | AttachedMulti  |           A            |
 |       3       | AttachedStale  | AttachedMulti  |           A            |
 |       4       | AttachedStale  | AttachedMulti  |           A            |
 | 5 (_cutover_) | AttachedStale  | AttachedMulti  |           B            |
 |       6       | AttachedStale  | AttachedSingle |           B            |
 |  7 (_final_)  |   Secondary    | AttachedSingle |           B            |
 The procedure described for a clean handover from a live node to a secondary
 is also used for failure cases and for migrations to a location that is not
 configured as a secondary, by simply skipping irrelevant steps, as described in
 the following sections.
 #### Migration from an unresponsive node
 If node A is unavailable, then all calls into
 node A are skipped and we don't wait for B to catch up before
 switching updating the endpoints to use B.
 #### Migration to a location that is not a secondary
 If node B is initially in Detached state, the procedure is identical. Since Node B
 is coming from a Detached state rather than Secondary, the download of layers and
 catch up with WAL will take much longer.
 We might do this if:
 - Attached and secondary locations are both critically low on disk, and we need
  to migrate to a third node with more resources available.
 - We are migrating a tenant which does not use secondary locations to save on cost.
 #### Permanent migration away from a node
 In the final step of the migration, we generally request the original node to enter a Secondary
 state. This is typical if we are doing a planned migration during maintenance, or to
 balance CPU/network load away from a node.
 One might also want to permanently migrate away: this can be done by simply removing the secondary
 location after the migration is complete, or as an optimization by substituting the Detached state
 for the Secondary state in the final step.
 #### Cutover diagram
 ```mermaid
 sequenceDiagram
 participant CP as Control plane
 participant A as Node A
 participant B as Node B
 participant E as Endpoint
 CP->>A: PUT Flush & go to AttachedStale
 note right of A: A continues to ingest WAL
 CP->>B: PUT AttachedMulti
 CP->>B: PUT Download layers from latest heatmap
 note right of B: B downloads from S3
 loop Poll until download complete
 CP->>B: GET download status
 end
 activate B
 note right of B: B ingests WAL
 loop Poll until catch up
 CP->>B: GET visible WAL
 CP->>A: GET visible WAL
 end
 deactivate B
 CP->>E: Configure to use Node B
 E->>B: Connect for reads
 CP->>B: PUT AttachedSingle
 CP->>A: PUT Secondary
 ```
 #### Cutover from an unavailable pageserver
 This case is far simpler: we may skip straight to our intended
 end state.
 ```mermaid
 sequenceDiagram
 participant A as Node A
 participant CP as Control plane
 participant B as Node B
 participant E as Endpoint
 note right of A: Node A offline
 activate A
 CP->>B: PUT AttachedSingle
 CP->>E: Configure to use Node B
 E->>B: Connect for reads
 deactivate A
 ```
 ## Implementation (detail)
 ### Purpose of AttachedMulti, AttachedStale
 #### AttachedMulti
 Ordinarily, an attached pageserver whose generation is the latest may delete
 layers at will (e.g. during compaction). If a previous generation pageserver
 is also still attached, and in use by endpoints, then this layer deletion could
 lead to a loss of availability for the endpoint when reading from the previous
 generation pageserver.
 The _AttachedMulti_ state simply disables deletions. These will be enqueued
 in `RemoteTimelineClient` until the control plane transitions the
 node into AttachedSingle, which unblocks deletions.  Other remote storage operations
 such as uploads are not blocked.
 AttachedMulti is not required for data safety, only to preserve availability
 on pageservers running with stale generations.
 A node enters AttachedMulti only when explicitly asked to by the control plane. It should
 only remain in this state for the duration of a migration.
 If a control plane bug leaves
 the node in AttachedMulti for a long time, then we must avoid unbounded memory use from enqueued
 deletions. This may be accomplished simply, by dropping enqueued deletions when some modest
 threshold of delayed deletions (e.g. 10k layers per tenant) is reached. As with all deletions,
 it is safe to skip them, and the leaked objects will be eventually cleaned up by scrub or
 by timeline deletion.
 During AttachedMulti, the Tenant is free to drop layers from local disk in response to
 disk pressure: only the deletion of remote layers is blocked.
 #### AttachedStale
 Currently, a pageserver with a stale generation number will continue to
 upload layers, but be prevented from completing deletions. This is safe, but inefficient: layers uploaded by this stale generation
 will not be read back by future generations of pageservers.
 The _AttachedStale_ state disables S3 uploads. The stale pageserver
 will continue to ingest the WAL and write layers to local disk, but not to
 do any uploads to S3.
 A node may enter AttachedStale in two ways:
 - Explicitly, when control plane calls into the node at the start of a migration.
 - Implicitly, when the node tries to validate some deletions and discovers
  that its generation is stale.
 The AttachedStale state also disables sending consumption metrics from
 that location: it is interpreted as an indication that some other pageserver
 is already attached or is about to be attached, and that new pageserver will
 be responsible for sending consumption metrics.
 #### Disk Pressure & AttachedStale
 Over long periods of time, a tenant location in AttachedStale will accumulate data
 on local disk, as it cannot evict any layers written since it entered the
 AttachStale state. We rely on the control plane to revert the location to
 Secondary or Detached at the end of a migration.
 This scenario is particularly noteworthy when evacuating all tenants on a pageserver:
 since _all_ the attached tenants will go into AttachedStale, we will be doing no
 uploads at all, therefore ingested data will cause disk usage to increase continuously.
 Under nominal conditions, the available disk space on pageservers should be sufficient
 to complete the evacuation before this becomes a problem, but we must also handle
 the case where we hit a low disk situation while in this state.
 The concept of disk pressure already exists in the pageserver: the `disk_usage_eviction_task`
 touches each Tenant when it determines that a low-disk condition requires
 some layer eviction. Having selected layers for eviction, the eviction
 task calls `Timeline::evict_layers`.
 **Safety**: If evict_layers is called while in AttachedStale state, and some of the to-be-evicted
 layers are not yet uploaded to S3, then the block on uploads will be lifted. This
 will result in leaking some objects once a migration is complete, but will enable
 the node to manage its disk space properly: if a node is left with some tenants
 in AttachedStale indefinitely due to a network partition or control plane bug,
 these tenants will not cause a full disk condition.
 ### Warm secondary updates
 #### Layer heatmap
 The secondary location's job is to serve reads **with the same quality of service as the original location
 was serving them around the time of a migration**. This does not mean the secondary
 location needs the whole set of layers: inactive layers that might soon
 be evicted on the attached pageserver need not be downloaded by the
 secondary. A totally idle tenant only needs to maintain enough on-disk
 state to enable a fast cold start (i.e. the most recent image layers are
 typically sufficient).
 To enable this, we introduce the concept of a _layer heatmap_, which
 acts as an advisory input to secondary locations to decide which
 layers to download from S3.
 #### Attached pageserver
 The attached pageserver, if in state AttachedSingle, periodically
 uploads a serialized heat map to S3. It may skip this if there
 is no change since the last time it uploaded (e.g. if the tenant
 is totally idle).
 Additionally, when the tenant is flushed to remote storage prior to a migration
 (the first step in [cutover procedure](#cutover-procedure)), 
 the heatmap is written out. This enables a future attached pageserver
 to get an up to date view when deciding which layers to download.
 #### Secondary location behavior
 Secondary warm locations run a simple loop, implemented separately from
 the main `Tenant` type, which represents attached tenants:
 - Download the layer heatmap
 - Select any "hot enough" layers to download, if there is sufficient
  free disk space.
 - Download layers, if they were not previously evicted (see below)
 - Download the latest index_part.json
 - Check if any layers currently on disk are no longer referenced by
  IndexPart & delete them
 Note that the heatmap is only advisory: if a secondary location has plenty
 of disk space, it may choose to retain layers that aren't referenced
 by the heatmap, as long as they are still referenced by the IndexPart. Conversely,
 if a node is very low on disk space, it might opt to raise the heat threshold required
 to both downloading a layer, until more disk space is available.
 #### Secondary locations & disk pressure
 Secondary locations are subject to eviction on disk pressure, just as
 attached locations are.  For eviction purposes, the access time of a
 layer in a secondary location will be the access time given in the heatmap,
 rather than the literal time at which the local layer file was accessed.
 The heatmap will indicate which layers are in local storage on the attached
 location.  The secondary will always attempt to get back to having that
 set of layers on disk, but to avoid flapping, it will remember the access
 time of the layer it was most recently asked to evict, and layers whose
 access time is below that will not be re-downloaded.
 The resulting behavior is that after a layer is evicted from a secondary
 location, it is only re-downloaded once the attached pageserver accesses
 the layer and uploads a heatmap reflecting that access time.  On a pageserver
 restart, the secondary location will attempt to download all layers in
 the heatmap again, if they are not on local disk.
 This behavior will be slightly different when secondary locations are
 used for "low energy tenants", but that is beyond the scope of this RFC.
 ### Location configuration API
 Currently, the `/tenant/<tenant_id>/config` API defines various
 tunables like compaction settings, which apply to the tenant irrespective
 of which pageserver it is running on.
 A new "location config" structure will be introduced, which defines
 configuration which is per-tenant, but local to a particular pageserver,
 such as the attachment mode and whether it is a secondary.
 The pageserver will expose a new per-tenant API for setting
 the state: `/tenant/<tenant_id>/location/config`.
 Body content:
 ```
 {
  state: 'enum{Detached, Secondary, AttachedSingle, AttachedMulti, AttachedStale}',
  generation: Option<u32>,
  configuration: `Option<TenantConfig>`
  flush: bool
 }
 ```
 Existing `/attach` and `/detach` endpoint will have the same
 behavior as calling `/location/config` with `AttachedSingle` and `Detached`
 states respectively. These endpoints will be deprecated and later
 removed.
 The generation attribute is mandatory for entering `AttachedSingle` or
 `AttachedMulti`.
 The configuration attribute is mandatory when entering any state other
 than `Detached`. This configuration is the same as the body for
 the existing `/tenant/<tenant_id>/config` endpoint.
 The `flush` argument indicates whether the pageservers should flush
 to S3 before proceeding: this only has any effect if the node is
 currently in AttachedSingle or AttachedMulti. This is used
 during the first phase of migration, when transitioning the
 old pageserver to AttachedSingle.
 The `/re-attach` API response will be extended to include a `state` as
 well as a `generation`, enabling the pageserver to enter the
 correct state for each tenant on startup.
 ### Database schema for locations
 A new table `ProjectLocation`:
 - pageserver_id: int
 - tenant_id: TenantId
 - generation: Option<int>
 - state: `enum(Secondary, AttachedSingle, AttachedMulti)`
 Notes:
 - It is legacy for a Project to have zero `ProjectLocation`s
 - The `pageserver` column in `Project` now means "to which pageserver should
  endpoints connect", rather than simply which pageserver is attached.
 - The `generation` column in `Project` remains, and is incremented and used
  to set the generation of `ProjectLocation` rows when they are set into
  an attached state.
 - The `Detached` state is implicitly represented as the absence of
  a `ProjectLocation`.
 ### Executing migrations
 Migrations will be implemented as Go functions, within the
 existing `Operation` framework in the control plane. These
 operations are persistent, such that they will always keep
 trying until completion: this property is important to avoid
 leaving garbage behind on pageservers, such as AttachedStale
 locations.
 ### Recovery from failures during migration
 During migration, the control plane may encounter failures of either
 the original or new pageserver, or both:
 - If the original fails, skip past waiting for the new pageserver
  to catch up, and put it into AttachedSingle immediately.
 - If the new node fails, put the old pageserver into Secondary
  and then back into AttachedSingle (this has the effect of
  retaining on-disk state and granting it a fresh generation number).
 - If both nodes fail, keep trying until one of them is available
  again.
 ### Control plane -> Pageserver reconciliation
 A migration may be done while the old node is unavailable,
 in which case the old node may still be running in an AttachedStale
 state.
 In this case, it is undesirable to have the migration `Operation`
 stay alive until the old node eventually comes back online
 and can be cleaned up. To handle this, the control plane
 should run a background reconciliation process to compare
 a pageserver's attachments with the database, and clean up
 any that shouldn't be there any more.
 Note that there will be no work to do if the old node was really
 offline, as during startup it will call into `/re-attach` and
 be updated that way. The reconciliation will only be needed
 if the node was unavailable but still running.
 ## Alternatives considered
 ### Only enabling secondary locations for tenants on a higher service tier
 This will make sense in future, especially for tiny databases that may be
 downloaded from S3 in milliseconds when needed.
 However, it is not wise to do it immediately, because pageservers contain
 a mixture of higher and lower tier workloads. If we had 1 tenant with
 a secondary location and 9 without, then those other 9 tenants will do
 a lot of I/O as they try to recover from S3, which may degrade the
 service of the tenant which had a secondary location.
 Until we segregate tenant on different service tiers on different pageserver
 nodes, or implement & test QoS to ensure that tenants with secondaries are
 not harmed by tenants without, we should use the same failover approach
 for all the tenants.
 ### Hot secondary locations (continuous WAL replay)
 Instead of secondary locations populating their caches from S3, we could
 have them consume the WAL from safekeepers. The downsides of this would be:
 - Double load on safekeepers, which are a less scalable service than S3
 - Secondary locations' on-disk state would end up subtly different to
  the remote state, which would make synchronizing with S3 more complex/expensive
  when going into attached state.
 The downside of only updating secondary locations from S3 is that we will
 have a delay during migration from replaying the LSN range between what's
 in S3 and what's in the pageserver. This range will be very small on
 planned migrations, as we have the old pageserver flush to S3 immediately
 before attaching the new pageserver. On unplanned migrations (old pageserver
 is unavailable), the range of LSNs to replay is bounded by the flush frequency
 on the old pageserver. However, the migration doesn't have to wait for the
 replay: it's just that not-yet-replayed LSNs will be unavailable for read
 until the new pageserver catches up.
 We expect that pageserver reads of the most recent LSNs will be relatively
 rare, as for an active endpoint those pages will usually still be in the postgres
 page cache: this leads us to prefer synchronizing from S3 on secondary
 locations, rather than consuming the WAL from safekeepers.
 ### Cold secondary locations
 It is not functionally necessary to keep warm caches on secondary locations at all. However, if we do not, then
 we would experience a de-facto availability loss in unplanned migrations, as reads to the new node would take an extremely long time (many seconds, perhaps minutes).
 Warm caches on secondary locations are necessary to meet
 our availability goals.
 ### Pageserver-granularity failover
 Instead of migrating tenants individually, we could have entire spare nodes,
 and on a node death, move all its work to one of these spares.
 This approach is avoided for several reasons:
 - we would still need fine-grained tenant migration for other
  purposes such as balancing load
 - by sharing the spare capacity over many peers rather than one spare node,
  these peers may use the capacity for other purposes, until it is needed
  to handle migrated tenants. e.g. for keeping a deeper cache of their
  attached tenants.
 ### Readonly during migration
 We could simplify migrations by making both previous and new nodes go into a
 readonly state, then flush remote content from the previous node, then activate
 attachment on the secondary node.
 The downside to this approach is a potentially large gap in readability of
 recent LSNs while loading data onto the new node. To avoid this, it is worthwhile
 to incur the extra cost of double-replaying the WAL onto old and new nodes' local
 storage during a migration.
 ### Peer-to-peer pageserver communication
 Rather than uploading the heatmap to S3, attached pageservers could make it
 available to peers.
 Currently, pageservers have no peer to peer communication, so adding this
 for heatmaps would incur significant overhead in deployment and configuration
 of the service, and ensuring that when a new pageserver is deployed, other
 pageservers are updated to be aware of it.
 As well as simplifying implementation, putting heatmaps in S3 will be useful
 for future analytics purposes -- gathering aggregated statistics on activity
 pattersn across many tenants may be done directly from data in S3.
--- a/docs/rfcs/029-sharding-phase1.md
+++ b/docs/rfcs/029-sharding-phase1.md
@@ -0,0 +1,244 @@
 # Sharding Phase 1: Static Key-space Sharding
 ## Summary
 To enable databases with sizes approaching the capacity of a pageserver's disk,
 it is necessary to break up the storage for the database, or _shard_ it.
 Sharding in general is a complex area.  This RFC aims to define a modest initial
 capability that will permit creating large-capacity databases using a static configuration
 defined at time of Tenant creation.
 ## Motivation
 Currently, all data for a Tenant, including all its timelines, is stored on a single
 pageserver.  The local storage required may be several times larger than the actual
 database size, due to LSM write inflation.
 If a database is larger than what one pageserver can hold, then it becomes impossible
 for the pageserver to hold it in local storage, as it must do to provide service to
 clients.
 ### Prior art
 Numerous: sharding is a long-discussed feature for the pageserver.
 Prior art in other distributed systems is too broad to capture here: pretty much
 any scale out storage system does something like this.
 ## Requirements
 - Enable creating a large (for example, 16TiB) database without requiring dedicated
  pageserver nodes.
 - Share read/write bandwidth costs for large databases across pageservers, as well
  as storage capacity, in order to avoid large capacity databases acting as I/O hotspots
  that disrupt service to other tenants.
 - Our data distribution scheme should handle sparse/nonuniform keys well, since postgres
  does not write out a single contiguous ranges of page numbers.
 *Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database
 that a user might create on a current-gen enterprise SSD should also work well on
 Neon.  The upper bound is whatever postgres can handle: i.e. we must make sure that the
 pageserver backend is not the limiting factor in the database size*.
 ## Non Goals
 - Independently distributing timelines within the same tenant.  If a tenant has many
  timelines, then sharding may be a less efficient mechanism for distributing load than
  sharing out timelines between pageservers.
 - Distributing work in the LSN dimension: this RFC focuses on the Key dimension only,
  based on the idea that separate mechanisms will make sense for each dimension.
 ## Impacted Components
 pageserver, control plane, safekeeper (optional)
 ## Terminology
 **Key**: a postgres page number.  In the sense that the pageserver is a versioned key-value store,
 the page number is the key in that store.
 **LSN dimension**: this just means the range of LSNs (history), when talking about the range
 of keys and LSNs as a two dimensional space.
 ## Implementation
 ### Key sharding vs. LSN sharding
 When we think of sharding across the two dimensional key/lsn space, this is an
 opportunity to think about how the two dimensions differ:
 - Sharding the key space distributes the _write_ workload of ingesting data
  and compacting.  This work must be carefully managed so that exactly one
  node owns a given key.
 - Sharding the LSN space distributes the _historical read_ workload.  This work
  can be done by anyone without any special coordination, as long as they can
  see the remote index and layers.
 The key sharding is the harder part, and also the more urgent one, to support larger
 capacity databases.  Because distributing historical LSN read work is a relatively
 simpler problem that most users don't have, we defer it to future work.  It is anticipated
 that some quite simple P2P offload model will enable distributing work for historical
 reads: a node which is low on space can call out to peer to ask it to download and
 serve reads from a historical layer.
 ### Key mapping scheme
 Having decided to focus on key sharding, we must next decide how we will map
 keys to shards.
 It is proposed to use a "wide striping" approach, to obtain a good compromise
 between data locality and avoiding entire large relations mapping to the same shard.
 The mapping is quite simple:
 - Define a stripe size, such as 256MiB.  Map this to a key count, such that a contiguous
  range of 256MiB keys would all fall into this stripe, i.e. divide by 8kiB to get 32k.
 - Map a key to a stripe by integer division.
 - Map a stripe to a shard by taking the shard index modulo the shard count.
 This scheme will achieve a good balance as long as there is no aliasing of the keys
 to the stripe width.  In the example above, if someone had 4 shards and wrote
 keys that were all 4*32k apart, they would all map to the same shard.  However, we do
 not have to worry about this, since end users do not control page numbers: as long as
 we do not pick stripe sizes that map to any problematic postgres behaviors, we'll be fine.
 ### Important Types
 #### `ShardMap`
 Provides all the information needed to route a request for a particular
 key to the correct pageserver:
 - Stripe size
 - Shard count
 - Address of the pageserver hosting each shard
 This structure's size is linear with the number of shards.
 #### `ShardIdentity`
 Provides the information needed to know whether a particular key belongs
 to a particular shard:
 - Stripe size
 - Shard count
 - Shard index
 This structure's size is constant.
 ### Pageserver changes
 Everywhere the Pageserver currently deals with Tenants, it will move to dealing with
 TenantShards, which are just a `Tenant` plus a `ShardIdentity` telling it which part
 of the keyspace it owns.
 When the pageserver subscribes to a safekeeper for WAL updates, it must provide
 its `ShardIdentity` to receive the relevant subset of the WAL.
 When the pageserver writes layers and index_part.json to remote storage, it must
 include the shard index & count in the name, to avoid collisions (the count is
 necessary for future-proofing: the count will vary in time).  These keys
 will also include a generation number: the [generation numbers](025-generation-numbers.md) system will work
 exactly the same for TenantShards as it does for Tenants today: each shard will have
 its own generation number.
 The pageserver doesn't have to do anything special during ingestion, compaction
 or GC.  It is implicitly operating on the subset of keys that map to its ShardIdentity.
 This will result in sparse layer files, containing keys only in the stripes that this
 shard owns.  Where optimizations currently exist in compaction for spotting "gaps" in
 the key range, these should be updated to ignore gaps that are due to sharding, to
 avoid spuriously splitting up layers ito stripe-sized pieces.
 ### Pageserver Controller changes
 The pageserver controller is a new component, which is responsible for abstracting
 away the business of managing individual tenant placement on pagservers.  It will
 also act as the abstraction on top of sharding, so that the control plane continue
 to see a Tenant as a single object, even though the reality is that it is many
 TenantShards.
 For the rest of this RFC, think of the Pageserver Controller as a component of
 the control plane.  The actual implementation is beyond the scope of this RFC
 and will be described in more detail elsewhere.
 ### Safekeeper changes
 The safekeeper's API for subscribing to a WAL will be extended to enable callers
 to provide a `ShardIdentity`.  In this mode it will only send WAL entries that
 fall within the keyspace belonging to the shard, and WAL entries that are to
 be mirrored to all shards.
 Metadata updates describing databases+relations are mirrored to
 all shards, and other WAL messages are only provided to the shard
 that owns the key being updated.  For any operation that updates multiple
 keys, it will be provided to all the shards whose key ranges intersect with
 one or more of the keys referenced in the WAL message.
 ### Pageserver Controller
 ### Endpoints
 Compute endpoints will need to:
 - Accept a ShardMap as part of their configuration from the control plane
 - Route pageserver requests according to that ShardMap
 ### Control Plane
 #### Publishing ShardMap updates
 The control plane will provide an API for the pageserver controller to publish updates
 to the ShardMap for a tenant.  When such an update is provided, it will be used to
 update the configuration of any endpoints currently active for the tenant.
 The ShardMap will be opaque to the Control Plane: it doesn't need to do anything with it
 other than storing and passing on to endpoints.
 #### Attaching via the Pageserver Controller
 The Control Plane will issue attach/create API calls to the pageserver controller
 instead of directly to pageservers.  This will relieve the control plane of the need
 to know about sharding.
 #### Enabling sharding for large tenants
 When a Tenant is created, it is up to the control plane to provide a hint to
 the pageserver about how large it will be.  This may be implemented as a service tier,
 where users creating very large databases would be onboarded to the tier, and then
 the Tenants they create would be created with a larger number of shards.  For the
 general population of users we should continue to use 1 shard by default.
 ## Next Steps
 Clearly, the mechanism described in this RFC has substantial limitations:
 - A) the number of shards in a tenant is defined at creation time.
 - B) data is not distributed across the LSN dimension
 To address `A`, a _splitting_ feature will later be added.  One shard can split its
 data into a number of children by doing a special compaction operation to generate
 image layers broken up child-shard-wise, and then writing out an index_part.json for
 each child.  This will then require coordination with the pageserver controller to
 safely attach these new child shards and then move them around to distribute work.
 The opposite _merging_ operation can also be imagined, but is unlikely to be implemented:
 once a Tenant has been sharded, there is little value in merging it again.
 To address `B`, it is envisaged to have some gossip mechanism for pageservers to communicate
 about their workload, and then a getpageatlsn offload mechanism where one pageserver can
 ask another to go read the necessary layers from remote storage to serve the read.  This
 requires relativly little coordination because it is read-only: any node can service any
 read.  All reads to a particular shard would still flow through one node, but the
 disk capactity & I/O impact of servicing the read would be distributed.
 ## FAQ/Alternatives
 ### Why stripe the data, rather than using contiguous ranges of keyspace for each shard?
 When a database is growing under a write workload, writes may predominantly hit the
 end of the keyspace, creating a bandwidth hotspot on that shard.  Similarly, if the user
 is intensively re-writing a particular relation, if that relation lived in a particular
 shard then it would not achieve our goal of distributing the write work across shards.
 ### Why not proxy read requests through one pageserver, so that endpoints don't have to change?
 Two reasons:
 1. This would not achieve scale-out of network bandwidth: a busy tenant with a large
   database would still cause a load hotspot on the pageserver routing its read requests. 
 2. Implementing a proxy model as a stop-gap would not be a cheap option, because
   it requires making pageservers aware of their peers, and adding synchronisation to
   keep pageservers aware of their peers as they come and go.
--- a/docs/rfcs/030-pageserver-controller-phase1.md
+++ b/docs/rfcs/030-pageserver-controller-phase1.md
@@ -0,0 +1,119 @@
 # Pageserver Controller Phase 1: Generations
 ## Summary
 In the [generation numbers RFC](025-generation-numbers.md), it was proposed that
 the console/control plane would act as the central coordinator for issuing generation
 numbers.
 That approach has not proven practical, so this RFC proposes an alternative implementation
 where generation numbers are managed in a different service.
 Calls to generation-aware pageserver APIs like create/attach will call out to this
 new _pageserver controller_ to acquire generation numbers.  This service will also
 form the basis for satisfying future pageserver management requirements, such as
 coordinating sharding, doing automatic capacity balancing, and many more.
 ## Motivation
 This is a dependency for delivering high availability.
 ### Prior art
 None
 ## Requirements
 - Provide a hook for the pageserver to use when it receives an attach/create/load API
  call, which will yield a generation that is safe for the pageserver to use.
 - Implement the /re-attach and /validate APIs required for the generation numbers feature
  to work. 
 ## Non Goals
 - This is not intended to interact with any components other than the pageserver, or
  to integrate with the broader control plane in any way.
 ## Impacted Components
 pageserver, pageserver controller (new)
 ## Implementation
 We may start from the minimal `attachment_service` used in automated tests.
 ### Data store
 For generation numbers, we need a persistent, linearizable data store.  Postgres is sufficient for
 this: we already have postgres instances used for other control plane work.
 The storage for the Pageserver Controller will be independent of other components:
 it might use the same physical database server but would use an independent database.
 ### Deployment
 There will be one instance per region.  In future we would aim to define the concept
 of a pageserver cluster and have one controller per cluster, but in the short term
 one per region will be functionally okay for current scale.
 The pageserver controller will be deployed within kubernetes, in the same way as
 the storage broker (which is currently via a [helm chart](https://github.com/neondatabase/helm-charts/tree/main/charts/neon-storage-broker)).
 ### Security
 The pageserver controller's API will do authentication with JWT, the same as
 the pageserver's existing API.
 ### Correctness
 It is essential that pageservers call into the controller at the _very start_ of
 handling attach/create/load API requests.  They should not do any work at all until
 they have acquired that generation number.
 If the call fails, they must retry: it is not safe to proceed without a generation number.
 ## Future
 Having a call chain that goes `Control plane -> Pageserver -> Pageserver controller`
 is clearly a little strange: we are only doing this to avoid needing to make changes
 to the control plane.
 In future, we will change the control plane to call directly into the pageserver
 controller, which would then call onwards into the pageserver.  This would be a fairly
 small change to the controller, since all the logic around storing and updating
 generation numbers would stay the same: just the behavior of the API frontend
 would be different.
 The work to enable pageservers to communicate with the controller is not wasted,
 because they still communicate in that direction when invoking `/re-attach` 
 and `/validate`
 ## Alternatives considered
 ### Run in the console/control plane codebase
 The control plane is a large Go codebase that uses extensive code generation, and
 has to be quite generic to manage many different types of component.
 ### Direct DB access
 We could have pageservers call directly into a shared database to acquire and update
 generation numbers (with carefully crafted transactions to protect against concurrent
 attaches getting the same generation, etc).   
 Pros:
 - No extra service required, simpler deployment
 Cons:
 - No future path to a cleaner architecture: the pageserver controller can be implemented
  as an extensible place for implement more functionality in future, whereas a mechanism
  to do generation numbers via SQL queries from the pageserver would be specialized
  and the code would probably be disposed of in the relatively near future.
 - Puts onus entirely on SQL query correctness to mediate concurrent access.
  The pageserver controller also has to be correct in this respect in case there
  is more than one instance running, but it is much less likely to hit this path,
  so the overall risk of issues is lower when using a central service.
 The main downside to that approach is that it doesn't provide the future path that
 the pageserver controller does
Author	SHA1	Message	Date
John Spray	2ce2574aa4	docs: pageserver controller rfc	2023-09-29 18:24:24 +01:00
John Spray	dc5f107170	docs: sharding phase 1 RFC	2023-09-29 18:20:13 +01:00
John Spray	1569446396	clarifications	2023-09-27 10:38:52 +01:00
John Spray	a8143a3bed	Align cutover downloads with heatmap	2023-09-27 10:38:52 +01:00
John Spray	689b6f14b7	Apply suggestions from code review Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-09-27 10:38:41 +01:00
John Spray	9c1c06ad17	Bump number of migration RFC	2023-09-26 10:57:11 +01:00
John Spray	40d2a73a0c	Merge remote-tracking branch 'upstream/main' into jcsp/rfc-migration	2023-09-26 10:56:57 +01:00
John Spray	89ddefb428	Note safety requirement for AttachedMulti & out of scope item	2023-09-12 14:15:46 +01:00
John Spray	cad0799521	Mention disabling consumption metrics in AttachedStale	2023-09-12 14:04:01 +01:00
John Spray	1143e2e9ce	Clarifications	2023-09-12 12:13:29 +01:00
Christian Schwarz	ef3e75abc3	for #5029 (rfc tenant migrations): editorial fixes (#5185 )	2023-09-01 18:10:44 +01:00
John Spray	cfb285139c	docs/rfcs: add RFC for fast tenant migration/failover	2023-08-31 10:55:17 +01:00