diff --git a/docs/rfcs/2025-03-17-compute-prewarm.md b/docs/rfcs/2025-03-17-compute-prewarm.md new file mode 100644 index 0000000000..6e95b9ac39 --- /dev/null +++ b/docs/rfcs/2025-03-17-compute-prewarm.md @@ -0,0 +1,399 @@ +# Compute rolling restart with prewarm + +Created on 2025-03-17 +Implemented on _TBD_ +Author: Alexey Kondratov (@ololobus) + +## Summary + +This RFC describes an approach to reduce performance degradation due to missing caches after compute node restart, i.e.: + +1. Rolling restart of the running instance via 'warm' replica. +2. Auto-prewarm compute caches after unplanned restart or scale-to-zero. + +## Motivation + +Neon currently implements several features that guarantee high uptime of compute nodes: + +1. Storage high-availability (HA), i.e. each tenant shard has a secondary pageserver location, so we can quickly switch over compute to it in case of primary pageserver failure. +2. Fast compute provisioning, i.e. we have a fleet of pre-created empty computes, that are ready to serve workload, so restarting unresponsive compute is very fast. +3. Preemptive NeonVM compute provisioning in case of k8s node unavailability. + +This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes. +During restart, compute loses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches, +so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to +medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances. + +## Non Goals + +- Details of the persistence storage for prewarm data are out of scope, there is a separate RFC for that: . +- Complete compute/Postgres HA setup and flow. Although it was originally in scope of this RFC, during preliminary research it appeared to be a rabbit hole, so it's worth of a separate RFC. +- Low-level implementation details for Postgres replica-to-primary promotion. There are a lot of things to think and care about: how to start walproposer, [logical replication failover](https://www.postgresql.org/docs/current/logical-replication-failover.html), and so on, but it's worth of at least a separate one-pager design document if not RFC. + +## Impacted components + +Postgres, compute_ctl, Control plane, Endpoint storage for unlogged storage of compute files. +For the latter, we will need to implement a uniform abstraction layer on top of S3, ABS, etc., but +S3 is used in text interchangeably with 'endpoint storage' for simplicity. + +## Proposed implementation + +### compute_ctl spec changes and auto-prewarm + +We are going to extend the current compute spec with the following attributes + +```rust +struct ComputeSpec { + /// [All existing attributes] + ... + /// Whether to do auto-prewarm at start or not. + /// Default to `false`. + pub lfc_auto_prewarm: bool + /// Interval in seconds between automatic dumps of + /// LFC state into S3. Default `None`, which means 'off'. + pub lfc_dump_interval_sec: Option +} +``` + +When `lfc_dump_interval_sec` is set to `N`, `compute_ctl` will periodically dump the LFC state +and store it in S3, so that it could be used either for auto-prewarm after restart or by replica +during the rolling restart. For enabling periodic dumping, we should consider the following value +`lfc_dump_interval_sec=300` (5 minutes), same as in the upstream's `pg_prewarm.autoprewarm_interval`. + +When `lfc_auto_prewarm` is set to `true`, `compute_ctl` will start prewarming the LFC upon restart +iif some of the previous states is present in S3. + +### compute_ctl API + +1. `POST /store_lfc_state` -- dump LFC state using Postgres SQL interface and store result in S3. + This has to be a blocking call, i.e. it will return only after the state is stored in S3. + If there is any concurrent request in progress, we should return `429 Too Many Requests`, + and let the caller to retry. + +2. `GET /dump_lfc_state` -- dump LFC state using Postgres SQL interface and return it as is + in text format suitable for the future restore/prewarm. This API is not strictly needed at + the end state, but could be useful for a faster prototyping of a complete rolling restart flow + with prewarm, as it doesn't require persistent for LFC state storage. + +3. `POST /restore_lfc_state` -- restore/prewarm LFC state with request + + ```yaml + RestoreLFCStateRequest: + oneOf: + - type: object + required: + - lfc_state + properties: + lfc_state: + type: string + description: Raw LFC content dumped with GET `/dump_lfc_state` + - type: object + required: + - lfc_cache_key + properties: + lfc_cache_key: + type: string + description: | + endpoint_id of the source endpoint on the same branch + to use as a 'donor' for LFC content. Compute will look up + LFC content dump in S3 using this key and do prewarm. + ``` + + where `lfc_state` and `lfc_cache_key` are mutually exclusive. + + The actual prewarming will happen asynchronously, so the caller need to check the + prewarm status using the compute's standard `GET /status` API. + +4. `GET /status` -- extend existing API with following attributes + + ```rust + struct ComputeStatusResponse { + // [All existing attributes] + ... + pub prewarm_state: PrewarmState + } + + /// Compute prewarm state. Will be stored in the shared Compute state + /// in compute_ctl + struct PrewarmState { + pub status: PrewarmStatus + /// Total number of pages to prewarm + pub pages_total: i64 + /// Number of pages prewarmed so far + pub pages_processed: i64 + /// Optional prewarm error + pub error: Option + } + + pub enum PrewarmStatus { + /// Prewarming was never requested on this compute + Off, + /// Prewarming was requested, but not started yet + Pending, + /// Prewarming is in progress. The caller should follow + /// `PrewarmState::progress`. + InProgress, + /// Prewarming has been successfully completed + Completed, + /// Prewarming failed. The caller should look at + /// `PrewarmState::error` for the reason. + Failed, + /// It is intended to be used by auto-prewarm if none of + /// the previous LFC states is available in S3. + /// This is a distinct state from the `Failed` because + /// technically it's not a failure and could happen if + /// compute was restart before it dumped anything into S3, + /// or just after the initial rollout of the feature. + Skipped, + } + ``` + +5. `POST /promote` -- this is a **blocking** API call to promote compute replica into primary. + This API should be very similar to the existing `POST /configure` API, i.e. accept the + spec (primary spec, because originally compute was started as replica). It's a distinct + API method because semantics and response codes are different: + + - If promotion is done successfully, it will return `200 OK`. + - If compute is already primary, the call will be no-op and `compute_ctl` + will return `412 Precondition Failed`. + - If, for some reason, second request reaches compute that is in progress of promotion, + it will respond with `429 Too Many Requests`. + - If compute hit any permanent failure during promotion `500 Internal Server Error` + will be returned. + +### Control plane operations + +The complete flow will be present as a sequence diagram in the next section, but here +we just want to list some important steps that have to be done by control plane during +the rolling restart via warm replica, but without much of low-level implementation details. + +1. Register the 'intent' of the instance restart, but not yet interrupt any workload at + primary and also accept new connections. This may require some endpoint state machine + changes, e.g. introduction of the `pending_restart` state. Being in this state also + **mustn't prevent any other operations except restart**: suspend, live-reconfiguration + (e.g. due to notify-attach call from the storage controller), deletion. + +2. Start new replica compute on the same timeline and start prewarming it. This process + may take quite a while, so the same concurrency considerations as in 1. should be applied + here as well. + +3. When warm replica is ready, control plane should: + + 3.1. Terminate the primary compute. Starting from here, **this is a critical section**, + if anything goes off, the only option is to start the primary normally and proceed + with auto-prewarm. + + 3.2. Send cache invalidation message to all proxies, notifying them that all new connections + should request and wait for the new connection details. At this stage, proxy has to also + drop any existing connections to the old primary, so they didn't do stale reads. + + 3.3. Attach warm replica compute to the primary endpoint inside control plane metadata + database. + + 3.4. Promote replica to primary. + + 3.5. When everything is done, finalize the endpoint state to be just `active`. + +### Complete rolling restart flow + +```mermaid + sequenceDiagram + + autonumber + + participant proxy as Neon proxy + + participant cplane as Control plane + + participant primary as Compute (primary) + box Compute (replica) + participant ctl as compute_ctl + participant pg as Postgres + end + + box Endpoint unlogged storage + participant s3proxy as Endpoint storage service + participant s3 as S3/ABS/etc. + end + + + cplane ->> primary: POST /store_lfc_state + primary -->> cplane: 200 OK + + cplane ->> ctl: POST /restore_lfc_state + activate ctl + ctl -->> cplane: 202 Accepted + + activate cplane + cplane ->> ctl: GET /status: poll prewarm status + ctl ->> s3proxy: GET /read_file + s3proxy ->> s3: read file + s3 -->> s3proxy: file content + s3proxy -->> ctl: 200 OK: file content + + proxy ->> cplane: GET /proxy_wake_compute + cplane -->> proxy: 200 OK: old primary conninfo + + ctl ->> pg: prewarm LFC + activate pg + pg -->> ctl: prewarm is completed + deactivate pg + + ctl -->> cplane: 200 OK: prewarm is completed + deactivate ctl + deactivate cplane + + cplane -->> cplane: reassign replica compute to endpoint,
start terminating the old primary compute + activate cplane + cplane ->> proxy: invalidate caches + + proxy ->> cplane: GET /proxy_wake_compute + + cplane -x primary: POST /terminate + primary -->> cplane: 200 OK + note over primary: old primary
compute terminated + + cplane ->> ctl: POST /promote + activate ctl + ctl ->> pg: pg_ctl promote + activate pg + pg -->> ctl: done + deactivate pg + ctl -->> cplane: 200 OK + deactivate ctl + + cplane -->> cplane: finalize operation + cplane -->> proxy: 200 OK: new primary conninfo + deactivate cplane +``` + +### Network bandwidth and prewarm speed + +It's currently known that pageserver can sustain about 3000 RPS per shard for a few running computes. +Large tenants are usually split into 8 shards, so the final formula may look like this: + +```text +8 shards * 3000 RPS * 8 KB =~ 190 MB/s +``` + +so depending on the LFC size, prewarming will take at least: + +- ~5s for 1 GB +- ~50s for 10 GB +- ~5m for 100 GB +- \>1h for 1 TB + +In total, one pageserver is normally capped by 30k RPS, so it obviously can't sustain many computes +doing prewarm at the same time. Later, we may need an additional mechanism for computes to throttle +the prewarming requests gracefully. + +### Reliability, failure modes and corner cases + +We consider following failures while implementing this RFC: + +1. Compute got interrupted/crashed/restarted during prewarm. The caller -- control plane -- should + detect that and start prewarm from the beginning. + +2. Control plane promotion request timed out or hit network issues. If it never reached the + compute, control plane should just repeat it. If it did reach the compute, then during + retry control plane can hit `409` as previous request triggered the promotion already. + In this case, control plane need to retry until either `200` or + permanent error `500` is returned. + +3. Compute got interrupted/crashed/restarted during promotion. At restart it will ask for + a spec from control plane, and its content should signal compute to start as **primary**, + so it's expected that control plane will continue polling for certain period of time and + will discover that compute is ready to accept connections if restart is fast enough. + +4. Any other unexpected failure or timeout during prewarming. This **failure mustn't be fatal**, + control plane has to report failure, terminate replica and keep primary running. + +5. Any other unexpected failure or timeout during promotion. Unfortunately, at this moment + we already have the primary node stopped, so the only option is to start primary again + and proceed with auto-prewarm. + +6. Any unexpected failure during auto-prewarm. This **failure mustn't be fatal**, + `compute_ctl` has to report the failure, but do not crash the compute. + +7. Control plane failed to confirm that old primary has terminated. This can happen, especially + in the future HA setup. In this case, control plane has to ensure that it sent VM deletion + and pod termination requests to k8s, so long-term we do not have two running primaries + on the same timeline. + +### Security implications + +There are two security implications to consider: + +1. Access to `compute_ctl` API. It has to be accessible from the outside of compute, so all + new API methods have to be exposed on the **external** HTTP port and **must** be authenticated + with JWT. + +2. Read/write only your own LFC state data in S3. Although it's not really a security concern, + since LFC state is just a mapping of blocks present in LFC at certain moment in time; + it still has to be highly restricted, so that i) only computes on the same timeline can + read S3 state; ii) each compute can only write to the path that contains it's `endpoint_id`. + Both of this must be validated by Endpoint storage service using the JWT token provided by `compute_ctl`. + +### Unresolved questions + +#### Billing, metrics and monitoring + +Currently, we only label computes with `endpoint_id` after attaching them to the endpoint. +In this proposal, this means that temporary replica will remain unlabelled until it's promoted +to primary. We can also hide it from users in the control plane API, but what to do with +billing and monitoring is still unclear. + +We can probably mark it as 'billable' and tag with `project_id`, so it will be billed, but +not interfere in any way with the current primary monitoring. + +Another thing to consider is how logs and metrics export will switch to the new compute. +It's expected that OpenTelemetry collector will auto-discover the new compute and start +scraping metrics from it. + +#### Auto-prewarm + +It's still an open question whether we need auto-prewarm at all. The author's gut-feeling is +that yes, we need it, but might be not for all workloads, so it could end up exposed as a +user-controllable knob on the endpoint. There are two arguments for that: + +1. Auto-prewarm existing in upstream's `pg_prewarm`, _probably for a reason_. + +2. There are still could be 2 flows when we cannot perform the rolling restart via the warm + replica: i) any failure or interruption during promotion; ii) wake up after scale-to-zero. + The latter might be challenged as well, i.e. one can argue that auto-prewarm may and will + compete with user-workload for storage resources. This is correct, but it might as well + reduce the time to get warm LFC and good performance. + +#### Low-level details of the replica promotion + +There are many things to consider here, but three items just off the top of my head: + +1. How to properly start the `walproposer` inside Postgres. + +2. What to do with logical replication. Currently, we do not include logical replication slots + inside basebackup, because nobody advances them at replica, so they just prevent the WAL + deletion. Yet, we do need to have them at primary after promotion. Starting with Postgres 17, + there is a new feature called + [logical replication failover](https://www.postgresql.org/docs/current/logical-replication-failover.html) + and `synchronized_standby_slots` setting, but we need a plan for the older versions. Should we + request a new basebackup during promotion? + +3. How do we guarantee that replica will receive all the latest WAL from safekeepers? Do some + 'shallow' version of sync safekeepers without data copying? Or just a standard version of + sync safekeepers? + +## Alternative implementation + +The proposal already assumes one of the alternatives -- do not have any persistent storage for +LFC state. This is possible to implement faster with the proposed API, but it means that +we do not implement auto-prewarm yet. + +## Definition of Done + +At the end of implementing this RFC we should have two high-level settings that enable: + +1. Auto-prewarm of user computes upon restart. +2. Perform primary compute restart via the warm replica promotion. + +It also has to be decided what's the criteria for enabling one or both of these flows for +certain clients.