## Problem Neon currently implements several features that guarantee high uptime of compute nodes: 1. Storage high-availability (HA), i.e. each tenant shard has a secondary pageserver location, so we can quickly switch over compute to it in case of primary pageserver failure. 2. Fast compute provisioning, i.e. we have a fleet of pre-created empty computes, that are ready to serve workload, so restarting unresponsive compute is very fast. 3. Preemptive NeonVM compute provisioning in case of k8s node unavailability. This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes. During restart, compute looses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches, so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances. [Rendered version](https://github.com/neondatabase/neon/blob/alexk/pg-prewarm-rfc/docs/rfcs/2025-03-17-compute-prewarm.md) Part of https://github.com/neondatabase/cloud/issues/19011
17 KiB
Compute rolling restart with prewarm
Created on 2025-03-17 Implemented on TBD Author: Alexey Kondratov (@ololobus)
Summary
This RFC describes an approach to reduce performance degradation due to missing caches after compute node restart, i.e.:
- Rolling restart of the running instance via 'warm' replica.
- Auto-prewarm compute caches after unplanned restart or scale-to-zero.
Motivation
Neon currently implements several features that guarantee high uptime of compute nodes:
- Storage high-availability (HA), i.e. each tenant shard has a secondary pageserver location, so we can quickly switch over compute to it in case of primary pageserver failure.
- Fast compute provisioning, i.e. we have a fleet of pre-created empty computes, that are ready to serve workload, so restarting unresponsive compute is very fast.
- Preemptive NeonVM compute provisioning in case of k8s node unavailability.
This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes. During restart, compute loses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches, so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances.
Non Goals
- Details of the persistence storage for prewarm data are out of scope, there is a separate RFC for that: https://github.com/neondatabase/neon/pull/9661.
- Complete compute/Postgres HA setup and flow. Although it was originally in scope of this RFC, during preliminary research it appeared to be a rabbit hole, so it's worth of a separate RFC.
- Low-level implementation details for Postgres replica-to-primary promotion. There are a lot of things to think and care about: how to start walproposer, logical replication failover, and so on, but it's worth of at least a separate one-pager design document if not RFC.
Impacted components
Postgres, compute_ctl, Control plane, Endpoint storage for unlogged storage of compute files. For the latter, we will need to implement a uniform abstraction layer on top of S3, ABS, etc., but S3 is used in text interchangeably with 'endpoint storage' for simplicity.
Proposed implementation
compute_ctl spec changes and auto-prewarm
We are going to extend the current compute spec with the following attributes
struct ComputeSpec {
/// [All existing attributes]
...
/// Whether to do auto-prewarm at start or not.
/// Default to `false`.
pub lfc_auto_prewarm: bool
/// Interval in seconds between automatic dumps of
/// LFC state into S3. Default `None`, which means 'off'.
pub lfc_dump_interval_sec: Option<i32>
}
When lfc_dump_interval_sec is set to N, compute_ctl will periodically dump the LFC state
and store it in S3, so that it could be used either for auto-prewarm after restart or by replica
during the rolling restart. For enabling periodic dumping, we should consider the following value
lfc_dump_interval_sec=300 (5 minutes), same as in the upstream's pg_prewarm.autoprewarm_interval.
When lfc_auto_prewarm is set to true, compute_ctl will start prewarming the LFC upon restart
iif some of the previous states is present in S3.
compute_ctl API
-
POST /store_lfc_state-- dump LFC state using Postgres SQL interface and store result in S3. This has to be a blocking call, i.e. it will return only after the state is stored in S3. If there is any concurrent request in progress, we should return429 Too Many Requests, and let the caller to retry. -
GET /dump_lfc_state-- dump LFC state using Postgres SQL interface and return it as is in text format suitable for the future restore/prewarm. This API is not strictly needed at the end state, but could be useful for a faster prototyping of a complete rolling restart flow with prewarm, as it doesn't require persistent for LFC state storage. -
POST /restore_lfc_state-- restore/prewarm LFC state with requestRestoreLFCStateRequest: oneOf: - type: object required: - lfc_state properties: lfc_state: type: string description: Raw LFC content dumped with GET `/dump_lfc_state` - type: object required: - lfc_cache_key properties: lfc_cache_key: type: string description: | endpoint_id of the source endpoint on the same branch to use as a 'donor' for LFC content. Compute will look up LFC content dump in S3 using this key and do prewarm.where
lfc_stateandlfc_cache_keyare mutually exclusive.The actual prewarming will happen asynchronously, so the caller need to check the prewarm status using the compute's standard
GET /statusAPI. -
GET /status-- extend existing API with following attributesstruct ComputeStatusResponse { // [All existing attributes] ... pub prewarm_state: PrewarmState } /// Compute prewarm state. Will be stored in the shared Compute state /// in compute_ctl struct PrewarmState { pub status: PrewarmStatus /// Total number of pages to prewarm pub pages_total: i64 /// Number of pages prewarmed so far pub pages_processed: i64 /// Optional prewarm error pub error: Option<String> } pub enum PrewarmStatus { /// Prewarming was never requested on this compute Off, /// Prewarming was requested, but not started yet Pending, /// Prewarming is in progress. The caller should follow /// `PrewarmState::progress`. InProgress, /// Prewarming has been successfully completed Completed, /// Prewarming failed. The caller should look at /// `PrewarmState::error` for the reason. Failed, /// It is intended to be used by auto-prewarm if none of /// the previous LFC states is available in S3. /// This is a distinct state from the `Failed` because /// technically it's not a failure and could happen if /// compute was restart before it dumped anything into S3, /// or just after the initial rollout of the feature. Skipped, } -
POST /promote-- this is a blocking API call to promote compute replica into primary. This API should be very similar to the existingPOST /configureAPI, i.e. accept the spec (primary spec, because originally compute was started as replica). It's a distinct API method because semantics and response codes are different:- If promotion is done successfully, it will return
200 OK. - If compute is already primary, the call will be no-op and
compute_ctlwill return412 Precondition Failed. - If, for some reason, second request reaches compute that is in progress of promotion,
it will respond with
429 Too Many Requests. - If compute hit any permanent failure during promotion
500 Internal Server Errorwill be returned.
- If promotion is done successfully, it will return
Control plane operations
The complete flow will be present as a sequence diagram in the next section, but here we just want to list some important steps that have to be done by control plane during the rolling restart via warm replica, but without much of low-level implementation details.
-
Register the 'intent' of the instance restart, but not yet interrupt any workload at primary and also accept new connections. This may require some endpoint state machine changes, e.g. introduction of the
pending_restartstate. Being in this state also mustn't prevent any other operations except restart: suspend, live-reconfiguration (e.g. due to notify-attach call from the storage controller), deletion. -
Start new replica compute on the same timeline and start prewarming it. This process may take quite a while, so the same concurrency considerations as in 1. should be applied here as well.
-
When warm replica is ready, control plane should:
3.1. Terminate the primary compute. Starting from here, this is a critical section, if anything goes off, the only option is to start the primary normally and proceed with auto-prewarm.
3.2. Send cache invalidation message to all proxies, notifying them that all new connections should request and wait for the new connection details. At this stage, proxy has to also drop any existing connections to the old primary, so they didn't do stale reads.
3.3. Attach warm replica compute to the primary endpoint inside control plane metadata database.
3.4. Promote replica to primary.
3.5. When everything is done, finalize the endpoint state to be just
active.
Complete rolling restart flow
sequenceDiagram
autonumber
participant proxy as Neon proxy
participant cplane as Control plane
participant primary as Compute (primary)
box Compute (replica)
participant ctl as compute_ctl
participant pg as Postgres
end
box Endpoint unlogged storage
participant s3proxy as Endpoint storage service
participant s3 as S3/ABS/etc.
end
cplane ->> primary: POST /store_lfc_state
primary -->> cplane: 200 OK
cplane ->> ctl: POST /restore_lfc_state
activate ctl
ctl -->> cplane: 202 Accepted
activate cplane
cplane ->> ctl: GET /status: poll prewarm status
ctl ->> s3proxy: GET /read_file
s3proxy ->> s3: read file
s3 -->> s3proxy: file content
s3proxy -->> ctl: 200 OK: file content
proxy ->> cplane: GET /proxy_wake_compute
cplane -->> proxy: 200 OK: old primary conninfo
ctl ->> pg: prewarm LFC
activate pg
pg -->> ctl: prewarm is completed
deactivate pg
ctl -->> cplane: 200 OK: prewarm is completed
deactivate ctl
deactivate cplane
cplane -->> cplane: reassign replica compute to endpoint,<br>start terminating the old primary compute
activate cplane
cplane ->> proxy: invalidate caches
proxy ->> cplane: GET /proxy_wake_compute
cplane -x primary: POST /terminate
primary -->> cplane: 200 OK
note over primary: old primary<br>compute terminated
cplane ->> ctl: POST /promote
activate ctl
ctl ->> pg: pg_ctl promote
activate pg
pg -->> ctl: done
deactivate pg
ctl -->> cplane: 200 OK
deactivate ctl
cplane -->> cplane: finalize operation
cplane -->> proxy: 200 OK: new primary conninfo
deactivate cplane
Network bandwidth and prewarm speed
It's currently known that pageserver can sustain about 3000 RPS per shard for a few running computes. Large tenants are usually split into 8 shards, so the final formula may look like this:
8 shards * 3000 RPS * 8 KB =~ 190 MB/s
so depending on the LFC size, prewarming will take at least:
- ~5s for 1 GB
- ~50s for 10 GB
- ~5m for 100 GB
- >1h for 1 TB
In total, one pageserver is normally capped by 30k RPS, so it obviously can't sustain many computes doing prewarm at the same time. Later, we may need an additional mechanism for computes to throttle the prewarming requests gracefully.
Reliability, failure modes and corner cases
We consider following failures while implementing this RFC:
-
Compute got interrupted/crashed/restarted during prewarm. The caller -- control plane -- should detect that and start prewarm from the beginning.
-
Control plane promotion request timed out or hit network issues. If it never reached the compute, control plane should just repeat it. If it did reach the compute, then during retry control plane can hit
409as previous request triggered the promotion already. In this case, control plane need to retry until either200or permanent error500is returned. -
Compute got interrupted/crashed/restarted during promotion. At restart it will ask for a spec from control plane, and its content should signal compute to start as primary, so it's expected that control plane will continue polling for certain period of time and will discover that compute is ready to accept connections if restart is fast enough.
-
Any other unexpected failure or timeout during prewarming. This failure mustn't be fatal, control plane has to report failure, terminate replica and keep primary running.
-
Any other unexpected failure or timeout during promotion. Unfortunately, at this moment we already have the primary node stopped, so the only option is to start primary again and proceed with auto-prewarm.
-
Any unexpected failure during auto-prewarm. This failure mustn't be fatal,
compute_ctlhas to report the failure, but do not crash the compute. -
Control plane failed to confirm that old primary has terminated. This can happen, especially in the future HA setup. In this case, control plane has to ensure that it sent VM deletion and pod termination requests to k8s, so long-term we do not have two running primaries on the same timeline.
Security implications
There are two security implications to consider:
-
Access to
compute_ctlAPI. It has to be accessible from the outside of compute, so all new API methods have to be exposed on the external HTTP port and must be authenticated with JWT. -
Read/write only your own LFC state data in S3. Although it's not really a security concern, since LFC state is just a mapping of blocks present in LFC at certain moment in time; it still has to be highly restricted, so that i) only computes on the same timeline can read S3 state; ii) each compute can only write to the path that contains it's
endpoint_id. Both of this must be validated by Endpoint storage service using the JWT token provided bycompute_ctl.
Unresolved questions
Billing, metrics and monitoring
Currently, we only label computes with endpoint_id after attaching them to the endpoint.
In this proposal, this means that temporary replica will remain unlabelled until it's promoted
to primary. We can also hide it from users in the control plane API, but what to do with
billing and monitoring is still unclear.
We can probably mark it as 'billable' and tag with project_id, so it will be billed, but
not interfere in any way with the current primary monitoring.
Another thing to consider is how logs and metrics export will switch to the new compute. It's expected that OpenTelemetry collector will auto-discover the new compute and start scraping metrics from it.
Auto-prewarm
It's still an open question whether we need auto-prewarm at all. The author's gut-feeling is that yes, we need it, but might be not for all workloads, so it could end up exposed as a user-controllable knob on the endpoint. There are two arguments for that:
-
Auto-prewarm existing in upstream's
pg_prewarm, probably for a reason. -
There are still could be 2 flows when we cannot perform the rolling restart via the warm replica: i) any failure or interruption during promotion; ii) wake up after scale-to-zero. The latter might be challenged as well, i.e. one can argue that auto-prewarm may and will compete with user-workload for storage resources. This is correct, but it might as well reduce the time to get warm LFC and good performance.
Low-level details of the replica promotion
There are many things to consider here, but three items just off the top of my head:
-
How to properly start the
walproposerinside Postgres. -
What to do with logical replication. Currently, we do not include logical replication slots inside basebackup, because nobody advances them at replica, so they just prevent the WAL deletion. Yet, we do need to have them at primary after promotion. Starting with Postgres 17, there is a new feature called logical replication failover and
synchronized_standby_slotssetting, but we need a plan for the older versions. Should we request a new basebackup during promotion? -
How do we guarantee that replica will receive all the latest WAL from safekeepers? Do some 'shallow' version of sync safekeepers without data copying? Or just a standard version of sync safekeepers?
Alternative implementation
The proposal already assumes one of the alternatives -- do not have any persistent storage for LFC state. This is possible to implement faster with the proposed API, but it means that we do not implement auto-prewarm yet.
Definition of Done
At the end of implementing this RFC we should have two high-level settings that enable:
- Auto-prewarm of user computes upon restart.
- Perform primary compute restart via the warm replica promotion.
It also has to be decided what's the criteria for enabling one or both of these flows for certain clients.