Notes on cutover

rename
docs: add hot secondaries rfc
2026-05-03 22:30:37 +00:00 · 2025-03-14 19:19:28 +00:00 · 2025-03-14 15:33:23 +00:00 · 2025-03-13 17:22:56 +00:00
1 changed files with 164 additions and 0 deletions
--- a/docs/rfcs/2025-03-15-hot-secondaries.md
+++ b/docs/rfcs/2025-03-15-hot-secondaries.md
@@ -0,0 +1,164 @@
+
+# Pageserver Hot Secondaries
+
+## Summary
+
+It is proposed to add a new mode for pageserver tenant shard locations,
+called "hot secondary", which is able to serve page_service requests but
+does not do all the same housekeeping as an attached location, and does
+not store any additional data in S3.
+
+There is a stark tradeoff between resource cost and complexity: a very simple solution would be to have multiple full attached locations doing independent I/O, but this RFC proposes some additional complexity to
+reduce cost.
+
+## Background
+
+In the [pageserver migration RFC](028-pageserver-migration.md), we introduced the concept of "warm secondaries".  These are pageserver locations that poll remote storage for a _heatmap_ describing which layers they should hold, and then download those layers from S3.  This enables them to rapidly transition into a usable attached location with a warm cache.
+
+Combined with the storage controller's detect of pageserver failures, warm
+secondaries enabled high availability of pageservers with a recovery time
+objective (RTO) measured in seconds (depends on configured heartbeat frequency) -- occasional cloud instance failures are typically recovered
+in well under a minute, without human intervention.
+
+## Purpose
+
+We aim to provide a sub-second RTO for pageserver failures, for mission
+critical workloads.  To do this, we should enable the postgres client
+to make its own decision about cutting over to a secondary, rather than
+waiting for the controller to detect a failure and instruct it to
+use a different pageserver.  These secondaries should be maintained
+in a continuously readable state rather than requiring explicit activation.
+
+Because low-RTO failover is intrinsically vulnerable to "flapping"/false
+positives, reads from such a hot secondary will not "promote" the secondary: we don't want to flap back and forth at millisecond timescales.  Rather, reads will be served by hot secondaries at any time, 
+but their transition to an attached (primary) location will still be
+managed by the storage controller.
+
+## Design of Hot Location Mode
+
+At a high level, hot locations are basically the same `Tenant` and `Timeline` types as an attached location, but with some behavioral tweaks.  This RFC won't get into code structure details: these changes
+may be expressed as different types (more robust) or as different modes
+for existing types (less code churn, more complexity).
+
+### Load and ingest
+
+Initially, we may start in the same way as a normal attached location:
+by discovering the latest metadata in remote storage and constructing
+a LayerMap.
+
+We should also do ingest as normal: subscribing to safekeeper and streaming
+writes into ephemeral layer files that are then frozen into L0s.  However,
+we do not want to wastefully upload these to S3 (they duplicate what the
+attached location is already writing).
+
+### "Virtual" compaction
+
+Clearly ingesting but never uploading or compacting will generate an unbounded stack of L0 layers, unless we do something about it.
+
+To solve this, we may add a special type of compaction that re-reads
+from remote storage, updates the layer map to contain all L1
+and image layers from the remote metadata, and triggers download of these.
+
+We do not download remote L0s during virtual compaction, because the hot secondary has also been ingesting and generating these, so it would be wasteful.  We just trim any local L0s which are now covered by the L1 high watermark of the remote metadata, and retain any that are still needed to serve reads.
+
+Note that this process is expected to generate some overlaps in LSN space: we might have an L0 that we generated locally which overlaps with an L1 from remote storage.  getpage@lsn logic must handle this, and avoid assuming non-overlapping layers (i.e. having read some deltas from L0, we must not read the same deltas again in an L1, we must remember what LSN we already passed).
+
+The average total network download bandwidth of the hot secondary is equal to the rate at which the attached location generates L1 and image layers, plus the rate at which WAL is generated.
+
+The average total disk write bandwidth is the sum of WAL generation rate plus L1/image generation rate: this is about the same as a normal attached location.  The average disk _read_ bandwidth of a hot secondary is far lower than an attached location because it is not reading back layers to compact them -- layers are only read in periods where the attached location was unavailable, so computes started reading from a hot secondary.
+
+The trigger for virtual compaction can be similar to the existing trigger
+for L1 compaction on attached locations: once we build up a deep stack of L0s, then we do virtual compaction to trim it.  This assumes that the attached location has kept up with compaction.  The hot secondary can be
+more tolerant of a deeper L0 stack because it is less often serving
+reads: for example it might make sense to trigger normal L1 compaction at 10 L0 layers, and trigger shallow compaction at 15 L0 layers, giving a good chance that by the time the hot secondary does compaction, the attached location has already written out some layer files for it to read.
+
+To avoid an availability gap while downloading data from S3, it is important that the hot
+secondary downloads new layer files before updating its layer map to de-reference replaced
+layers.
+
+### Handling missing layers/timelines
+
+If an incoming request references a timeline that the hot secondary is
+unaware of, it must go read from S3 to determine if the timeline exists, and if so then load it.
+
+The hot secondary should also be tipped off by the storage controller when
+timelines are created, so that in normal operation it is aware of timelines
+immediately rather than having to load on demand (loading on demand could
+have much higher latency for reads).
+
+Hot secondaries may also experience 404s reading layers from remote storage, because the layer might have been deleted by the attached location
+during compaction or GC.  If the hot secondary finds such a 404, it should
+trigger a re-download of the timeline index.
+
+### Transition from Hot Secondary to Attached
+
+While a hot secondary can serve writes independently for a short period of time (until
+too many L0s build up to efficiently serve reads), it needs to be promoted to be the attached
+location if the last attached location becomes unavailable (or if the storage controller
+determines that the tenant should be migrated).
+
+This can be done trivially by shutting down and starting up again in attached mode (on startup
+the layer map will be reset to the content of remote storage), but this can impose an availability gap, because:
+- After unexpected failure of an attached location, the hot secondary's local L0s may be
+  further ahead in WAL ingest than the contents of remote storage, so resetting to what's
+  in remote storage will make recent data unavailable until it is re-ingested.
+- Even if the remote data is up to date with latest WAL, it may take some time to download
+  layers. 
+
+To avoid an availability gap while re-ingesting WAL, it is necessary to stitch the local L0s with remote storage state.  We may do this at startup, by making an exception to our
+usual policy of only respecting remote storage state at startup.  This exception can
+be specific to L0 files, and perhaps also specific to when we can detect that these
+were written by a hot secondary (perhaps by marking these files with a suffix or magic 0xffff generation?)
+
+We should also only do this cutover once we're reasonably sure the old attached location
+isn't still uploading, so that on startup we do not see a whole new layer map with lots
+of layers that need downloading.
+
+We may still tolerate some availability gap in the <1s range while reloading the tenant
+in a different mode.  We should aim for this to be under 100ms under usual circumstances,
+as it should only require long enough to:
+- Flush ephemeral layer to L0 on shutdown (writing 128MB takes of the order 100ms)
+- Load remote index on startup (reading from S3 takes of the order 10ms)
+
+Doing many such cutovers concurrently may result in worse availability, so the controller
+should be tuned to understand that when cutting over multiple hot secondaries to attached,
+it is best not to rush it (as they are already in a readable state, it is less urgent
+than when activating warm secondaries).
+
+## Summary of a failover
+
+To summarize the order of operations when a pageserver instance fails while holding a tenant
+that has a hot secondary location:
+- after some short timeout (100s of ms), compute gives up on getpage requests to the primary and sends
+  them to the hot secondary.
+- after some much longer timeout (e.g. ~30s), controller decides that the hot secondary should
+  become attached, so that it can do its own compaction.
+- Hot secondary is instructed to do a compaction before shutting it down, so that during
+  its restart into attached mode it will not have to deal with any remote storage change.
+- Hot secondary shuts down, flushing ephemeral layer to L0.
+- Previously-secondary location starts up in attached mode with a new generation.  Downloads
+  index from remote storage, and identifies which L0 files to retain.  Adds these to LayerMap
+  and enqueues them for upload.
+- Now fully available for reads and able to proceed with compaction etc as normal.
+
+## Optimisations/details
+
+- We should add a read-only mode to RemoteTimelineClient
+
+## Alternatives considered
+
+### Full mirror
+
+We could make hot secondary locations do all compaction, gc, etc operations
+independently, and maintain their own set of layer files in S3.  These would essentially be separate tenants in pageserver terms, but consuming the same safekeeper timelines.
+
+These locations would on longer be anything special in pageserver terms, they'd simply be attached locations that use some modified path like `<tenant_id>.secondary` to avoid colliding with the primary data.
+
+The storage controller could have some `AttachedHotSecondary` placement
+policy that configures the hot secondary location with some flag to indicate that the alternative storage path should be used.
+
+Clearly the advantage of this approach is code simplicity.  However, the
+downsides are substantial:
+- Double object storage costs
+- Compaction costs are doubled (CPU & disk read I/O), whereas the proposed
+  implementation of hot secondaries only pays twice for the compaction _write_ IO as it writes compacted layers to local disk.
Author	SHA1	Message	Date
John Spray	ce3d23e223	Notes on cutover	2025-03-14 19:19:28 +00:00
John Spray	7d60895310	rename	2025-03-14 15:33:23 +00:00
John Spray	f441b1efb4	docs: add hot secondaries rfc	2025-03-13 17:22:56 +00:00