WIP: RFC for timeline import from pgdata

2026-01-14 17:02:56 +00:00 · 2024-10-28 13:09:00 +01:00
parent 33baca07b6
commit c5fa33f348
1 changed files with 187 additions and 0 deletions
--- a/docs/rfcs/039-timeline-pgdata-import.md
+++ b/docs/rfcs/039-timeline-pgdata-import.md
@@ -0,0 +1,187 @@
+# Timelime import from `PGDATA`
+
+Created at: 2024-10-28
+Author: Christian Schwarz (@problame)
+
+## Summary
+
+This document is a retroactive RFC for a feature to creating new root timelines
+from a vanilla Postgres PGDATA directory dump.
+
+## Motivation & Context
+
+perf hedge
+
+companion RFCs:
+* cloud.git RFC that fits everyting together
+  * READ THIS FIRST
+* importer that produces the PGDATA
+
+## Non-Goals
+
+## Components
+
+this doc constrained to storage pieces; companion RFC in cloud.git fits it into a full system
+
+## Prior Art
+
+hackathon code
+
+## Proposed Implementation
+
+### High-Level
+
+refactor timeline creation (modes) & idempotency checking
+
+add new timeline creation mode for import from pgdata
+idempotency checking based on caller-provided idempotency key
+creating task does the following
+acquire creation guard
+create anonymous empty timeline
+upload index part to S3, with field indicating import is in progress
+spawn tokio task that performs the import (bound to tenant cancellation & guard)
+the import task produces layers, adds them to the layer map, uploads them using existing infrastructure
+after import task is done
+transition index part into `done` state
+shut down the timeline object
+load the timeline from remote storage, using the code that we use during tenant attach, and add it to Tenant::timelines
+
+if pageserver restarts or the tenant is migrated during import, tenant attach resumes the import job, based on the
+Index part field.
+
+SEQUENCE DIAGRAM
+
+
+### No Storage Controller Awareness
+
+The storage controller is unaware of the import job, since it simply proxies the timeline creation request to each shard.
+
+### PGDATA => Image Layer Conversion
+
+The import task converts a PGDATA directory dump located in object storage into image layers.
+
+shard awareness, including the new test for 0 disposable keys
+
+range requests
+
+### The Mechanics Of Executing The Conversion Code
+
+Tokio Task Lifecycles / New Concept Of Long-Running Timeline Creation
+
+This is the first time we make a timeline creation run long, and need to make it survive PS restarts and tenant migrations.
+
+### Resumability & Forward Progress
+
+Index Part field, document here.
+Evoultion toward checkpointing in the index part is possible.
+
+For v1, no checkpointing is implemented because the MTBF of pageserver nodes is much higher than
+the highest expected import duration; and tenant migrations are rare enough.
+
+### Resource Usage
+
+Parallelized over multiple tokio tasks
+
+### Interaction with Migrations / Generations
+
+After migrating the tenant / reloading it with a newer generation, what happens?
+
+### Security
+
+The current implementation assumes that the PGDATA dump contents are trusted input,
+and further that it does not change while the import is ongoing.
+
+There is no hard technical need for trusting the PGDATA dump contents, since
+the bulk of layer conversion is copying opaque 8k data blobs.
+
+However, the conversion code uses serde-generated deserialization routines in some
+places which haven't been audited for resource exhaustion attacks (e.g. OOM due to large allcations).
+
+Further, the conversion code is littered with `.unwrap()` and `assert!()` statements, which
+is a remnant from the fact that it was hackathon code. This risks panicking the pageserver
+that executes the import, and our panic blast radius is currently not reliably constrained
+to the tenant in which it originates.
+
+### Progress Notification & Integration Into The Larger System
+
+Timeline create API: no changes about durability semantics of create operation; cplane create
+operation is done as soon as API returns 201 (earlier design 202 code, remove that from cplane draft)
+
+Readiness for import: due to limitations in the control plane regarding long-running jobs, cplane
+schedules the timeline creation before the PGDATA dunp has actually finished uploading.
+As a workaround, the import task polls for a special status key in the PGDATA to appear, which
+declares the PGDATA dump as finished uploading and not changing from thereon.
+TODO: review linearizability of S3, i.e., can we rely on ListObjectsV2 to return the
+full listing immediately after we observe the special status key? Or does the status key
+need to contain the list of prefixes? That would be better anyway, could be signed, cf security).
+
+New concept: completion notification; didn't need this before, now we do;
+cplane must not use timeline before receiving completion notification.
+mechanism:
+* source of truth of completion is a special status object that import job uploads into the PGDATA location in S3
+* how to make cplane aware of update: invocation of a notification upcall API
+* import job retries delivery of the notification upcall until success response
+* in response to upcall, cplane collects statuses of alls hards using ListObjectV2 prefix (it understands ShardId)
+
+SEQUENCE DIAGRAM FROM GIST
+
+### Shard-Split during Import is UB
+
+The design does not handle shard-splits during import, so it's declared UB right now.
+
+### Storage-Controller, Sharding
+
+As before this RFC, the storage controller forwards timeline create requests to each
+shard and is unaware that the timeline is importing, because so far storcon did not need
+to be aware of timelines.
+
+However, to make sharding transparent to cplane, it would make sense for storcon to
+act as an intermediate for the import progress progress upcalls, i.e., batch up all
+shards' completion upcalls and when that's done, deliver a tenant-scoped (instead of shard-scoped)
+upcall to cplane.
+That way, cplane need not know about tenants and storcon can internalize shard-split.
+
+### Alternative Designs
+
+#### External Task Performs Conversion
+
+External task performs conversion and uploads layers & index part into S3.
+
+=> copy gist arguments against that here.
+
+
+## Future Work
+
+### Work Required For Productization
+
+TODOs at the top of import code, esp resource exhaustion attack vectors
+
+Review linearizability of listobjectsv2 after observing pgdata upload done key.
+
+Resource usage: play nice with other tenants
+
+Regression test ensuring resumability (failpoints)
+
+Storcon needs to be aware of import and
+- inhibit shard split while ongoing (can relax this later)
+- and redesign progress notification system to
+  - avoid need to mutate pgdata, so it can be readonly
+
+DESIGN WORK: does storage controller scheduler should be aware of ongoing migration?
+=> it should control resource usage, maybe different policy for migrations, etc
+
+Safety: fail shard split while import is in progress. Pageserver should enforce this.
+Storcon needs to know and respsect this.
+
+Azure support
+
+### v2
+
+Teach pageserver to natively read PGDATA relation files. Then it could be as simple
+as reading a few control files, and otherwise using object copy operations for
+moving the relation files.
+Problems:
+- how to deal with sharding? If we don't slice up the files, then each shard
+download a full copy of the database. Tricky. Maybe can handle this with 1GiB stripe size,
+but, pretty obscure.
+