Compare commits

...

3 Commits

Author SHA1 Message Date
John Spray
ce3d23e223 Notes on cutover 2025-03-14 19:19:28 +00:00
John Spray
7d60895310 rename 2025-03-14 15:33:23 +00:00
John Spray
f441b1efb4 docs: add hot secondaries rfc 2025-03-13 17:22:56 +00:00

View File

@@ -0,0 +1,164 @@
# Pageserver Hot Secondaries
## Summary
It is proposed to add a new mode for pageserver tenant shard locations,
called "hot secondary", which is able to serve page_service requests but
does not do all the same housekeeping as an attached location, and does
not store any additional data in S3.
There is a stark tradeoff between resource cost and complexity: a very simple solution would be to have multiple full attached locations doing independent I/O, but this RFC proposes some additional complexity to
reduce cost.
## Background
In the [pageserver migration RFC](028-pageserver-migration.md), we introduced the concept of "warm secondaries". These are pageserver locations that poll remote storage for a _heatmap_ describing which layers they should hold, and then download those layers from S3. This enables them to rapidly transition into a usable attached location with a warm cache.
Combined with the storage controller's detect of pageserver failures, warm
secondaries enabled high availability of pageservers with a recovery time
objective (RTO) measured in seconds (depends on configured heartbeat frequency) -- occasional cloud instance failures are typically recovered
in well under a minute, without human intervention.
## Purpose
We aim to provide a sub-second RTO for pageserver failures, for mission
critical workloads. To do this, we should enable the postgres client
to make its own decision about cutting over to a secondary, rather than
waiting for the controller to detect a failure and instruct it to
use a different pageserver. These secondaries should be maintained
in a continuously readable state rather than requiring explicit activation.
Because low-RTO failover is intrinsically vulnerable to "flapping"/false
positives, reads from such a hot secondary will not "promote" the secondary: we don't want to flap back and forth at millisecond timescales. Rather, reads will be served by hot secondaries at any time,
but their transition to an attached (primary) location will still be
managed by the storage controller.
## Design of Hot Location Mode
At a high level, hot locations are basically the same `Tenant` and `Timeline` types as an attached location, but with some behavioral tweaks. This RFC won't get into code structure details: these changes
may be expressed as different types (more robust) or as different modes
for existing types (less code churn, more complexity).
### Load and ingest
Initially, we may start in the same way as a normal attached location:
by discovering the latest metadata in remote storage and constructing
a LayerMap.
We should also do ingest as normal: subscribing to safekeeper and streaming
writes into ephemeral layer files that are then frozen into L0s. However,
we do not want to wastefully upload these to S3 (they duplicate what the
attached location is already writing).
### "Virtual" compaction
Clearly ingesting but never uploading or compacting will generate an unbounded stack of L0 layers, unless we do something about it.
To solve this, we may add a special type of compaction that re-reads
from remote storage, updates the layer map to contain all L1
and image layers from the remote metadata, and triggers download of these.
We do not download remote L0s during virtual compaction, because the hot secondary has also been ingesting and generating these, so it would be wasteful. We just trim any local L0s which are now covered by the L1 high watermark of the remote metadata, and retain any that are still needed to serve reads.
Note that this process is expected to generate some overlaps in LSN space: we might have an L0 that we generated locally which overlaps with an L1 from remote storage. getpage@lsn logic must handle this, and avoid assuming non-overlapping layers (i.e. having read some deltas from L0, we must not read the same deltas again in an L1, we must remember what LSN we already passed).
The average total network download bandwidth of the hot secondary is equal to the rate at which the attached location generates L1 and image layers, plus the rate at which WAL is generated.
The average total disk write bandwidth is the sum of WAL generation rate plus L1/image generation rate: this is about the same as a normal attached location. The average disk _read_ bandwidth of a hot secondary is far lower than an attached location because it is not reading back layers to compact them -- layers are only read in periods where the attached location was unavailable, so computes started reading from a hot secondary.
The trigger for virtual compaction can be similar to the existing trigger
for L1 compaction on attached locations: once we build up a deep stack of L0s, then we do virtual compaction to trim it. This assumes that the attached location has kept up with compaction. The hot secondary can be
more tolerant of a deeper L0 stack because it is less often serving
reads: for example it might make sense to trigger normal L1 compaction at 10 L0 layers, and trigger shallow compaction at 15 L0 layers, giving a good chance that by the time the hot secondary does compaction, the attached location has already written out some layer files for it to read.
To avoid an availability gap while downloading data from S3, it is important that the hot
secondary downloads new layer files before updating its layer map to de-reference replaced
layers.
### Handling missing layers/timelines
If an incoming request references a timeline that the hot secondary is
unaware of, it must go read from S3 to determine if the timeline exists, and if so then load it.
The hot secondary should also be tipped off by the storage controller when
timelines are created, so that in normal operation it is aware of timelines
immediately rather than having to load on demand (loading on demand could
have much higher latency for reads).
Hot secondaries may also experience 404s reading layers from remote storage, because the layer might have been deleted by the attached location
during compaction or GC. If the hot secondary finds such a 404, it should
trigger a re-download of the timeline index.
### Transition from Hot Secondary to Attached
While a hot secondary can serve writes independently for a short period of time (until
too many L0s build up to efficiently serve reads), it needs to be promoted to be the attached
location if the last attached location becomes unavailable (or if the storage controller
determines that the tenant should be migrated).
This can be done trivially by shutting down and starting up again in attached mode (on startup
the layer map will be reset to the content of remote storage), but this can impose an availability gap, because:
- After unexpected failure of an attached location, the hot secondary's local L0s may be
further ahead in WAL ingest than the contents of remote storage, so resetting to what's
in remote storage will make recent data unavailable until it is re-ingested.
- Even if the remote data is up to date with latest WAL, it may take some time to download
layers.
To avoid an availability gap while re-ingesting WAL, it is necessary to stitch the local L0s with remote storage state. We may do this at startup, by making an exception to our
usual policy of only respecting remote storage state at startup. This exception can
be specific to L0 files, and perhaps also specific to when we can detect that these
were written by a hot secondary (perhaps by marking these files with a suffix or magic 0xffff generation?)
We should also only do this cutover once we're reasonably sure the old attached location
isn't still uploading, so that on startup we do not see a whole new layer map with lots
of layers that need downloading.
We may still tolerate some availability gap in the <1s range while reloading the tenant
in a different mode. We should aim for this to be under 100ms under usual circumstances,
as it should only require long enough to:
- Flush ephemeral layer to L0 on shutdown (writing 128MB takes of the order 100ms)
- Load remote index on startup (reading from S3 takes of the order 10ms)
Doing many such cutovers concurrently may result in worse availability, so the controller
should be tuned to understand that when cutting over multiple hot secondaries to attached,
it is best not to rush it (as they are already in a readable state, it is less urgent
than when activating warm secondaries).
## Summary of a failover
To summarize the order of operations when a pageserver instance fails while holding a tenant
that has a hot secondary location:
- after some short timeout (100s of ms), compute gives up on getpage requests to the primary and sends
them to the hot secondary.
- after some much longer timeout (e.g. ~30s), controller decides that the hot secondary should
become attached, so that it can do its own compaction.
- Hot secondary is instructed to do a compaction before shutting it down, so that during
its restart into attached mode it will not have to deal with any remote storage change.
- Hot secondary shuts down, flushing ephemeral layer to L0.
- Previously-secondary location starts up in attached mode with a new generation. Downloads
index from remote storage, and identifies which L0 files to retain. Adds these to LayerMap
and enqueues them for upload.
- Now fully available for reads and able to proceed with compaction etc as normal.
## Optimisations/details
- We should add a read-only mode to RemoteTimelineClient
## Alternatives considered
### Full mirror
We could make hot secondary locations do all compaction, gc, etc operations
independently, and maintain their own set of layer files in S3. These would essentially be separate tenants in pageserver terms, but consuming the same safekeeper timelines.
These locations would on longer be anything special in pageserver terms, they'd simply be attached locations that use some modified path like `<tenant_id>.secondary` to avoid colliding with the primary data.
The storage controller could have some `AttachedHotSecondary` placement
policy that configures the hot secondary location with some flag to indicate that the alternative storage path should be used.
Clearly the advantage of this approach is code simplicity. However, the
downsides are substantial:
- Double object storage costs
- Compaction costs are doubled (CPU & disk read I/O), whereas the proposed
implementation of hot secondaries only pays twice for the compaction _write_ IO as it writes compacted layers to local disk.