mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-14 17:02:56 +00:00
WIP: RFC for timeline import from pgdata
This commit is contained in:
187
docs/rfcs/039-timeline-pgdata-import.md
Normal file
187
docs/rfcs/039-timeline-pgdata-import.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Timelime import from `PGDATA`
|
||||
|
||||
Created at: 2024-10-28
|
||||
Author: Christian Schwarz (@problame)
|
||||
|
||||
## Summary
|
||||
|
||||
This document is a retroactive RFC for a feature to creating new root timelines
|
||||
from a vanilla Postgres PGDATA directory dump.
|
||||
|
||||
## Motivation & Context
|
||||
|
||||
perf hedge
|
||||
|
||||
companion RFCs:
|
||||
* cloud.git RFC that fits everyting together
|
||||
* READ THIS FIRST
|
||||
* importer that produces the PGDATA
|
||||
|
||||
## Non-Goals
|
||||
|
||||
## Components
|
||||
|
||||
this doc constrained to storage pieces; companion RFC in cloud.git fits it into a full system
|
||||
|
||||
## Prior Art
|
||||
|
||||
hackathon code
|
||||
|
||||
## Proposed Implementation
|
||||
|
||||
### High-Level
|
||||
|
||||
refactor timeline creation (modes) & idempotency checking
|
||||
|
||||
add new timeline creation mode for import from pgdata
|
||||
idempotency checking based on caller-provided idempotency key
|
||||
creating task does the following
|
||||
acquire creation guard
|
||||
create anonymous empty timeline
|
||||
upload index part to S3, with field indicating import is in progress
|
||||
spawn tokio task that performs the import (bound to tenant cancellation & guard)
|
||||
the import task produces layers, adds them to the layer map, uploads them using existing infrastructure
|
||||
after import task is done
|
||||
transition index part into `done` state
|
||||
shut down the timeline object
|
||||
load the timeline from remote storage, using the code that we use during tenant attach, and add it to Tenant::timelines
|
||||
|
||||
if pageserver restarts or the tenant is migrated during import, tenant attach resumes the import job, based on the
|
||||
Index part field.
|
||||
|
||||
SEQUENCE DIAGRAM
|
||||
|
||||
|
||||
### No Storage Controller Awareness
|
||||
|
||||
The storage controller is unaware of the import job, since it simply proxies the timeline creation request to each shard.
|
||||
|
||||
### PGDATA => Image Layer Conversion
|
||||
|
||||
The import task converts a PGDATA directory dump located in object storage into image layers.
|
||||
|
||||
shard awareness, including the new test for 0 disposable keys
|
||||
|
||||
range requests
|
||||
|
||||
### The Mechanics Of Executing The Conversion Code
|
||||
|
||||
Tokio Task Lifecycles / New Concept Of Long-Running Timeline Creation
|
||||
|
||||
This is the first time we make a timeline creation run long, and need to make it survive PS restarts and tenant migrations.
|
||||
|
||||
### Resumability & Forward Progress
|
||||
|
||||
Index Part field, document here.
|
||||
Evoultion toward checkpointing in the index part is possible.
|
||||
|
||||
For v1, no checkpointing is implemented because the MTBF of pageserver nodes is much higher than
|
||||
the highest expected import duration; and tenant migrations are rare enough.
|
||||
|
||||
### Resource Usage
|
||||
|
||||
Parallelized over multiple tokio tasks
|
||||
|
||||
### Interaction with Migrations / Generations
|
||||
|
||||
After migrating the tenant / reloading it with a newer generation, what happens?
|
||||
|
||||
### Security
|
||||
|
||||
The current implementation assumes that the PGDATA dump contents are trusted input,
|
||||
and further that it does not change while the import is ongoing.
|
||||
|
||||
There is no hard technical need for trusting the PGDATA dump contents, since
|
||||
the bulk of layer conversion is copying opaque 8k data blobs.
|
||||
|
||||
However, the conversion code uses serde-generated deserialization routines in some
|
||||
places which haven't been audited for resource exhaustion attacks (e.g. OOM due to large allcations).
|
||||
|
||||
Further, the conversion code is littered with `.unwrap()` and `assert!()` statements, which
|
||||
is a remnant from the fact that it was hackathon code. This risks panicking the pageserver
|
||||
that executes the import, and our panic blast radius is currently not reliably constrained
|
||||
to the tenant in which it originates.
|
||||
|
||||
### Progress Notification & Integration Into The Larger System
|
||||
|
||||
Timeline create API: no changes about durability semantics of create operation; cplane create
|
||||
operation is done as soon as API returns 201 (earlier design 202 code, remove that from cplane draft)
|
||||
|
||||
Readiness for import: due to limitations in the control plane regarding long-running jobs, cplane
|
||||
schedules the timeline creation before the PGDATA dunp has actually finished uploading.
|
||||
As a workaround, the import task polls for a special status key in the PGDATA to appear, which
|
||||
declares the PGDATA dump as finished uploading and not changing from thereon.
|
||||
TODO: review linearizability of S3, i.e., can we rely on ListObjectsV2 to return the
|
||||
full listing immediately after we observe the special status key? Or does the status key
|
||||
need to contain the list of prefixes? That would be better anyway, could be signed, cf security).
|
||||
|
||||
New concept: completion notification; didn't need this before, now we do;
|
||||
cplane must not use timeline before receiving completion notification.
|
||||
mechanism:
|
||||
* source of truth of completion is a special status object that import job uploads into the PGDATA location in S3
|
||||
* how to make cplane aware of update: invocation of a notification upcall API
|
||||
* import job retries delivery of the notification upcall until success response
|
||||
* in response to upcall, cplane collects statuses of alls hards using ListObjectV2 prefix (it understands ShardId)
|
||||
|
||||
SEQUENCE DIAGRAM FROM GIST
|
||||
|
||||
### Shard-Split during Import is UB
|
||||
|
||||
The design does not handle shard-splits during import, so it's declared UB right now.
|
||||
|
||||
### Storage-Controller, Sharding
|
||||
|
||||
As before this RFC, the storage controller forwards timeline create requests to each
|
||||
shard and is unaware that the timeline is importing, because so far storcon did not need
|
||||
to be aware of timelines.
|
||||
|
||||
However, to make sharding transparent to cplane, it would make sense for storcon to
|
||||
act as an intermediate for the import progress progress upcalls, i.e., batch up all
|
||||
shards' completion upcalls and when that's done, deliver a tenant-scoped (instead of shard-scoped)
|
||||
upcall to cplane.
|
||||
That way, cplane need not know about tenants and storcon can internalize shard-split.
|
||||
|
||||
### Alternative Designs
|
||||
|
||||
#### External Task Performs Conversion
|
||||
|
||||
External task performs conversion and uploads layers & index part into S3.
|
||||
|
||||
=> copy gist arguments against that here.
|
||||
|
||||
|
||||
## Future Work
|
||||
|
||||
### Work Required For Productization
|
||||
|
||||
TODOs at the top of import code, esp resource exhaustion attack vectors
|
||||
|
||||
Review linearizability of listobjectsv2 after observing pgdata upload done key.
|
||||
|
||||
Resource usage: play nice with other tenants
|
||||
|
||||
Regression test ensuring resumability (failpoints)
|
||||
|
||||
Storcon needs to be aware of import and
|
||||
- inhibit shard split while ongoing (can relax this later)
|
||||
- and redesign progress notification system to
|
||||
- avoid need to mutate pgdata, so it can be readonly
|
||||
|
||||
DESIGN WORK: does storage controller scheduler should be aware of ongoing migration?
|
||||
=> it should control resource usage, maybe different policy for migrations, etc
|
||||
|
||||
Safety: fail shard split while import is in progress. Pageserver should enforce this.
|
||||
Storcon needs to know and respsect this.
|
||||
|
||||
Azure support
|
||||
|
||||
### v2
|
||||
|
||||
Teach pageserver to natively read PGDATA relation files. Then it could be as simple
|
||||
as reading a few control files, and otherwise using object copy operations for
|
||||
moving the relation files.
|
||||
Problems:
|
||||
- how to deal with sharding? If we don't slice up the files, then each shard
|
||||
download a full copy of the database. Tricky. Maybe can handle this with 1GiB stripe size,
|
||||
but, pretty obscure.
|
||||
|
||||
Reference in New Issue
Block a user