mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-18 13:40:37 +00:00
Compare commits
3 Commits
rc/2024-10
...
problame/t
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c5fa33f348 | ||
|
|
33baca07b6 | ||
|
|
923974d4da |
@@ -111,6 +111,11 @@ enum Command {
|
||||
#[arg(long)]
|
||||
node: NodeId,
|
||||
},
|
||||
/// Cancel any ongoing reconciliation for this shard
|
||||
TenantShardCancelReconcile {
|
||||
#[arg(long)]
|
||||
tenant_shard_id: TenantShardId,
|
||||
},
|
||||
/// Modify the pageserver tenant configuration of a tenant: this is the configuration structure
|
||||
/// that is passed through to pageservers, and does not affect storage controller behavior.
|
||||
TenantConfig {
|
||||
@@ -535,6 +540,15 @@ async fn main() -> anyhow::Result<()> {
|
||||
)
|
||||
.await?;
|
||||
}
|
||||
Command::TenantShardCancelReconcile { tenant_shard_id } => {
|
||||
storcon_client
|
||||
.dispatch::<(), ()>(
|
||||
Method::PUT,
|
||||
format!("control/v1/tenant/{tenant_shard_id}/cancel_reconcile"),
|
||||
None,
|
||||
)
|
||||
.await?;
|
||||
}
|
||||
Command::TenantConfig { tenant_id, config } => {
|
||||
let tenant_conf = serde_json::from_str(&config)?;
|
||||
|
||||
|
||||
187
docs/rfcs/039-timeline-pgdata-import.md
Normal file
187
docs/rfcs/039-timeline-pgdata-import.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Timelime import from `PGDATA`
|
||||
|
||||
Created at: 2024-10-28
|
||||
Author: Christian Schwarz (@problame)
|
||||
|
||||
## Summary
|
||||
|
||||
This document is a retroactive RFC for a feature to creating new root timelines
|
||||
from a vanilla Postgres PGDATA directory dump.
|
||||
|
||||
## Motivation & Context
|
||||
|
||||
perf hedge
|
||||
|
||||
companion RFCs:
|
||||
* cloud.git RFC that fits everyting together
|
||||
* READ THIS FIRST
|
||||
* importer that produces the PGDATA
|
||||
|
||||
## Non-Goals
|
||||
|
||||
## Components
|
||||
|
||||
this doc constrained to storage pieces; companion RFC in cloud.git fits it into a full system
|
||||
|
||||
## Prior Art
|
||||
|
||||
hackathon code
|
||||
|
||||
## Proposed Implementation
|
||||
|
||||
### High-Level
|
||||
|
||||
refactor timeline creation (modes) & idempotency checking
|
||||
|
||||
add new timeline creation mode for import from pgdata
|
||||
idempotency checking based on caller-provided idempotency key
|
||||
creating task does the following
|
||||
acquire creation guard
|
||||
create anonymous empty timeline
|
||||
upload index part to S3, with field indicating import is in progress
|
||||
spawn tokio task that performs the import (bound to tenant cancellation & guard)
|
||||
the import task produces layers, adds them to the layer map, uploads them using existing infrastructure
|
||||
after import task is done
|
||||
transition index part into `done` state
|
||||
shut down the timeline object
|
||||
load the timeline from remote storage, using the code that we use during tenant attach, and add it to Tenant::timelines
|
||||
|
||||
if pageserver restarts or the tenant is migrated during import, tenant attach resumes the import job, based on the
|
||||
Index part field.
|
||||
|
||||
SEQUENCE DIAGRAM
|
||||
|
||||
|
||||
### No Storage Controller Awareness
|
||||
|
||||
The storage controller is unaware of the import job, since it simply proxies the timeline creation request to each shard.
|
||||
|
||||
### PGDATA => Image Layer Conversion
|
||||
|
||||
The import task converts a PGDATA directory dump located in object storage into image layers.
|
||||
|
||||
shard awareness, including the new test for 0 disposable keys
|
||||
|
||||
range requests
|
||||
|
||||
### The Mechanics Of Executing The Conversion Code
|
||||
|
||||
Tokio Task Lifecycles / New Concept Of Long-Running Timeline Creation
|
||||
|
||||
This is the first time we make a timeline creation run long, and need to make it survive PS restarts and tenant migrations.
|
||||
|
||||
### Resumability & Forward Progress
|
||||
|
||||
Index Part field, document here.
|
||||
Evoultion toward checkpointing in the index part is possible.
|
||||
|
||||
For v1, no checkpointing is implemented because the MTBF of pageserver nodes is much higher than
|
||||
the highest expected import duration; and tenant migrations are rare enough.
|
||||
|
||||
### Resource Usage
|
||||
|
||||
Parallelized over multiple tokio tasks
|
||||
|
||||
### Interaction with Migrations / Generations
|
||||
|
||||
After migrating the tenant / reloading it with a newer generation, what happens?
|
||||
|
||||
### Security
|
||||
|
||||
The current implementation assumes that the PGDATA dump contents are trusted input,
|
||||
and further that it does not change while the import is ongoing.
|
||||
|
||||
There is no hard technical need for trusting the PGDATA dump contents, since
|
||||
the bulk of layer conversion is copying opaque 8k data blobs.
|
||||
|
||||
However, the conversion code uses serde-generated deserialization routines in some
|
||||
places which haven't been audited for resource exhaustion attacks (e.g. OOM due to large allcations).
|
||||
|
||||
Further, the conversion code is littered with `.unwrap()` and `assert!()` statements, which
|
||||
is a remnant from the fact that it was hackathon code. This risks panicking the pageserver
|
||||
that executes the import, and our panic blast radius is currently not reliably constrained
|
||||
to the tenant in which it originates.
|
||||
|
||||
### Progress Notification & Integration Into The Larger System
|
||||
|
||||
Timeline create API: no changes about durability semantics of create operation; cplane create
|
||||
operation is done as soon as API returns 201 (earlier design 202 code, remove that from cplane draft)
|
||||
|
||||
Readiness for import: due to limitations in the control plane regarding long-running jobs, cplane
|
||||
schedules the timeline creation before the PGDATA dunp has actually finished uploading.
|
||||
As a workaround, the import task polls for a special status key in the PGDATA to appear, which
|
||||
declares the PGDATA dump as finished uploading and not changing from thereon.
|
||||
TODO: review linearizability of S3, i.e., can we rely on ListObjectsV2 to return the
|
||||
full listing immediately after we observe the special status key? Or does the status key
|
||||
need to contain the list of prefixes? That would be better anyway, could be signed, cf security).
|
||||
|
||||
New concept: completion notification; didn't need this before, now we do;
|
||||
cplane must not use timeline before receiving completion notification.
|
||||
mechanism:
|
||||
* source of truth of completion is a special status object that import job uploads into the PGDATA location in S3
|
||||
* how to make cplane aware of update: invocation of a notification upcall API
|
||||
* import job retries delivery of the notification upcall until success response
|
||||
* in response to upcall, cplane collects statuses of alls hards using ListObjectV2 prefix (it understands ShardId)
|
||||
|
||||
SEQUENCE DIAGRAM FROM GIST
|
||||
|
||||
### Shard-Split during Import is UB
|
||||
|
||||
The design does not handle shard-splits during import, so it's declared UB right now.
|
||||
|
||||
### Storage-Controller, Sharding
|
||||
|
||||
As before this RFC, the storage controller forwards timeline create requests to each
|
||||
shard and is unaware that the timeline is importing, because so far storcon did not need
|
||||
to be aware of timelines.
|
||||
|
||||
However, to make sharding transparent to cplane, it would make sense for storcon to
|
||||
act as an intermediate for the import progress progress upcalls, i.e., batch up all
|
||||
shards' completion upcalls and when that's done, deliver a tenant-scoped (instead of shard-scoped)
|
||||
upcall to cplane.
|
||||
That way, cplane need not know about tenants and storcon can internalize shard-split.
|
||||
|
||||
### Alternative Designs
|
||||
|
||||
#### External Task Performs Conversion
|
||||
|
||||
External task performs conversion and uploads layers & index part into S3.
|
||||
|
||||
=> copy gist arguments against that here.
|
||||
|
||||
|
||||
## Future Work
|
||||
|
||||
### Work Required For Productization
|
||||
|
||||
TODOs at the top of import code, esp resource exhaustion attack vectors
|
||||
|
||||
Review linearizability of listobjectsv2 after observing pgdata upload done key.
|
||||
|
||||
Resource usage: play nice with other tenants
|
||||
|
||||
Regression test ensuring resumability (failpoints)
|
||||
|
||||
Storcon needs to be aware of import and
|
||||
- inhibit shard split while ongoing (can relax this later)
|
||||
- and redesign progress notification system to
|
||||
- avoid need to mutate pgdata, so it can be readonly
|
||||
|
||||
DESIGN WORK: does storage controller scheduler should be aware of ongoing migration?
|
||||
=> it should control resource usage, maybe different policy for migrations, etc
|
||||
|
||||
Safety: fail shard split while import is in progress. Pageserver should enforce this.
|
||||
Storcon needs to know and respsect this.
|
||||
|
||||
Azure support
|
||||
|
||||
### v2
|
||||
|
||||
Teach pageserver to natively read PGDATA relation files. Then it could be as simple
|
||||
as reading a few control files, and otherwise using object copy operations for
|
||||
moving the relation files.
|
||||
Problems:
|
||||
- how to deal with sharding? If we don't slice up the files, then each shard
|
||||
download a full copy of the database. Tricky. Maybe can handle this with 1GiB stripe size,
|
||||
but, pretty obscure.
|
||||
|
||||
@@ -262,14 +262,6 @@ async fn timeline_snapshot_handler(request: Request<Body>) -> Result<Response<Bo
|
||||
check_permission(&request, Some(ttid.tenant_id))?;
|
||||
|
||||
let tli = GlobalTimelines::get(ttid).map_err(ApiError::from)?;
|
||||
// Note: with evicted timelines it should work better then de-evict them and
|
||||
// stream; probably start_snapshot would copy partial s3 file to dest path
|
||||
// and stream control file, or return WalResidentTimeline if timeline is not
|
||||
// evicted.
|
||||
let tli = tli
|
||||
.wal_residence_guard()
|
||||
.await
|
||||
.map_err(ApiError::InternalServerError)?;
|
||||
|
||||
// To stream the body use wrap_stream which wants Stream of Result<Bytes>,
|
||||
// so create the chan and write to it in another task.
|
||||
|
||||
@@ -8,6 +8,7 @@ use serde::{Deserialize, Serialize};
|
||||
use std::{
|
||||
cmp::min,
|
||||
io::{self, ErrorKind},
|
||||
sync::Arc,
|
||||
};
|
||||
use tokio::{fs::OpenOptions, io::AsyncWrite, sync::mpsc, task};
|
||||
use tokio_tar::{Archive, Builder, Header};
|
||||
@@ -25,8 +26,8 @@ use crate::{
|
||||
routes::TimelineStatus,
|
||||
},
|
||||
safekeeper::Term,
|
||||
state::TimelinePersistentState,
|
||||
timeline::WalResidentTimeline,
|
||||
state::{EvictionState, TimelinePersistentState},
|
||||
timeline::{Timeline, WalResidentTimeline},
|
||||
timelines_global_map::{create_temp_timeline_dir, validate_temp_timeline},
|
||||
wal_backup,
|
||||
wal_storage::open_wal_file,
|
||||
@@ -43,18 +44,33 @@ use utils::{
|
||||
/// Stream tar archive of timeline to tx.
|
||||
#[instrument(name = "snapshot", skip_all, fields(ttid = %tli.ttid))]
|
||||
pub async fn stream_snapshot(
|
||||
tli: WalResidentTimeline,
|
||||
tli: Arc<Timeline>,
|
||||
source: NodeId,
|
||||
destination: NodeId,
|
||||
tx: mpsc::Sender<Result<Bytes>>,
|
||||
) {
|
||||
if let Err(e) = stream_snapshot_guts(tli, source, destination, tx.clone()).await {
|
||||
// Error type/contents don't matter as they won't can't reach the client
|
||||
// (hyper likely doesn't do anything with it), but http stream will be
|
||||
// prematurely terminated. It would be nice to try to send the error in
|
||||
// trailers though.
|
||||
tx.send(Err(anyhow!("snapshot failed"))).await.ok();
|
||||
error!("snapshot failed: {:#}", e);
|
||||
match tli.try_wal_residence_guard().await {
|
||||
Err(e) => {
|
||||
tx.send(Err(anyhow!("Error checking residence: {:#}", e)))
|
||||
.await
|
||||
.ok();
|
||||
}
|
||||
Ok(maybe_resident_tli) => {
|
||||
if let Err(e) = match maybe_resident_tli {
|
||||
Some(resident_tli) => {
|
||||
stream_snapshot_resident_guts(resident_tli, source, destination, tx.clone())
|
||||
.await
|
||||
}
|
||||
None => stream_snapshot_offloaded_guts(tli, source, destination, tx.clone()).await,
|
||||
} {
|
||||
// Error type/contents don't matter as they won't can't reach the client
|
||||
// (hyper likely doesn't do anything with it), but http stream will be
|
||||
// prematurely terminated. It would be nice to try to send the error in
|
||||
// trailers though.
|
||||
tx.send(Err(anyhow!("snapshot failed"))).await.ok();
|
||||
error!("snapshot failed: {:#}", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -80,12 +96,10 @@ impl Drop for SnapshotContext {
|
||||
}
|
||||
}
|
||||
|
||||
pub async fn stream_snapshot_guts(
|
||||
tli: WalResidentTimeline,
|
||||
source: NodeId,
|
||||
destination: NodeId,
|
||||
/// Build a tokio_tar stream that sends encoded bytes into a Bytes channel.
|
||||
fn prepare_tar_stream(
|
||||
tx: mpsc::Sender<Result<Bytes>>,
|
||||
) -> Result<()> {
|
||||
) -> tokio_tar::Builder<impl AsyncWrite + Unpin + Send> {
|
||||
// tokio-tar wants Write implementor, but we have mpsc tx <Result<Bytes>>;
|
||||
// use SinkWriter as a Write impl. That is,
|
||||
// - create Sink from the tx. It returns PollSendError if chan is closed.
|
||||
@@ -100,12 +114,38 @@ pub async fn stream_snapshot_guts(
|
||||
// - SinkWriter (not surprisingly) wants sink of &[u8], not bytes, so wrap
|
||||
// into CopyToBytes. This is a data copy.
|
||||
let copy_to_bytes = CopyToBytes::new(oksink);
|
||||
let mut writer = SinkWriter::new(copy_to_bytes);
|
||||
let pinned_writer = std::pin::pin!(writer);
|
||||
let writer = SinkWriter::new(copy_to_bytes);
|
||||
let pinned_writer = Box::pin(writer);
|
||||
|
||||
// Note that tokio_tar append_* funcs use tokio::io::copy with 8KB buffer
|
||||
// which is also likely suboptimal.
|
||||
let mut ar = Builder::new_non_terminated(pinned_writer);
|
||||
Builder::new_non_terminated(pinned_writer)
|
||||
}
|
||||
|
||||
/// Implementation of snapshot for an offloaded timeline, only reads control file
|
||||
pub(crate) async fn stream_snapshot_offloaded_guts(
|
||||
tli: Arc<Timeline>,
|
||||
source: NodeId,
|
||||
destination: NodeId,
|
||||
tx: mpsc::Sender<Result<Bytes>>,
|
||||
) -> Result<()> {
|
||||
let mut ar = prepare_tar_stream(tx);
|
||||
|
||||
tli.snapshot_offloaded(&mut ar, source, destination).await?;
|
||||
|
||||
ar.finish().await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Implementation of snapshot for a timeline which is resident (includes some segment data)
|
||||
pub async fn stream_snapshot_resident_guts(
|
||||
tli: WalResidentTimeline,
|
||||
source: NodeId,
|
||||
destination: NodeId,
|
||||
tx: mpsc::Sender<Result<Bytes>>,
|
||||
) -> Result<()> {
|
||||
let mut ar = prepare_tar_stream(tx);
|
||||
|
||||
let bctx = tli.start_snapshot(&mut ar, source, destination).await?;
|
||||
pausable_failpoint!("sk-snapshot-after-list-pausable");
|
||||
@@ -138,6 +178,70 @@ pub async fn stream_snapshot_guts(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
impl Timeline {
|
||||
/// Simple snapshot for an offloaded timeline: we will only upload a renamed partial segment and
|
||||
/// pass a modified control file into the provided tar stream (nothing with data segments on disk, since
|
||||
/// we are offloaded and there aren't any)
|
||||
async fn snapshot_offloaded<W: AsyncWrite + Unpin + Send>(
|
||||
self: &Arc<Timeline>,
|
||||
ar: &mut tokio_tar::Builder<W>,
|
||||
source: NodeId,
|
||||
destination: NodeId,
|
||||
) -> Result<()> {
|
||||
// Take initial copy of control file, then release state lock
|
||||
let mut control_file = {
|
||||
let shared_state = self.write_shared_state().await;
|
||||
|
||||
let control_file = TimelinePersistentState::clone(shared_state.sk.state());
|
||||
|
||||
// Rare race: we got unevicted between entering function and reading control file.
|
||||
// We error out and let API caller retry.
|
||||
if !matches!(control_file.eviction_state, EvictionState::Offloaded(_)) {
|
||||
bail!("Timeline was un-evicted during snapshot, please retry");
|
||||
}
|
||||
|
||||
control_file
|
||||
};
|
||||
|
||||
// Modify the partial segment of the in-memory copy for the control file to
|
||||
// point to the destination safekeeper.
|
||||
let replace = control_file
|
||||
.partial_backup
|
||||
.replace_uploaded_segment(source, destination)?;
|
||||
|
||||
let Some(replace) = replace else {
|
||||
// In Manager:: ready_for_eviction, we do not permit eviction unless the timeline
|
||||
// has a partial segment. It is unexpected that
|
||||
anyhow::bail!("Timeline has no partial segment, cannot generate snapshot");
|
||||
};
|
||||
|
||||
tracing::info!("Replacing uploaded partial segment in in-mem control file: {replace:?}");
|
||||
|
||||
// Optimistically try to copy the partial segment to the destination's path: this
|
||||
// can fail if the timeline was un-evicted and modified in the background.
|
||||
let remote_timeline_path = &self.remote_path;
|
||||
wal_backup::copy_partial_segment(
|
||||
&replace.previous.remote_path(remote_timeline_path),
|
||||
&replace.current.remote_path(remote_timeline_path),
|
||||
)
|
||||
.await?;
|
||||
|
||||
// Since the S3 copy succeeded with the path given in our control file snapshot, and
|
||||
// we are sending that snapshot in our response, we are giving the caller a consistent
|
||||
// snapshot even if our local Timeline was unevicted or otherwise modified in the meantime.
|
||||
let buf = control_file
|
||||
.write_to_buf()
|
||||
.with_context(|| "failed to serialize control store")?;
|
||||
let mut header = Header::new_gnu();
|
||||
header.set_size(buf.len().try_into().expect("never breaches u64"));
|
||||
ar.append_data(&mut header, CONTROL_FILE_NAME, buf.as_slice())
|
||||
.await
|
||||
.with_context(|| "failed to append to archive")?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl WalResidentTimeline {
|
||||
/// Start streaming tar archive with timeline:
|
||||
/// 1) stream control file under lock;
|
||||
|
||||
@@ -797,14 +797,17 @@ impl Timeline {
|
||||
state.sk.term_bump(to).await
|
||||
}
|
||||
|
||||
/// Get the timeline guard for reading/writing WAL files.
|
||||
/// If WAL files are not present on disk (evicted), they will be automatically
|
||||
/// downloaded from remote storage. This is done in the manager task, which is
|
||||
/// responsible for issuing all guards.
|
||||
///
|
||||
/// NB: don't use this function from timeline_manager, it will deadlock.
|
||||
/// NB: don't use this function while holding shared_state lock.
|
||||
pub async fn wal_residence_guard(self: &Arc<Self>) -> Result<WalResidentTimeline> {
|
||||
/// Guts of [`Self::wal_residence_guard`] and [`Self::try_wal_residence_guard`]
|
||||
async fn do_wal_residence_guard(
|
||||
self: &Arc<Self>,
|
||||
block: bool,
|
||||
) -> Result<Option<WalResidentTimeline>> {
|
||||
let op_label = if block {
|
||||
"wal_residence_guard"
|
||||
} else {
|
||||
"try_wal_residence_guard"
|
||||
};
|
||||
|
||||
if self.is_cancelled() {
|
||||
bail!(TimelineError::Cancelled(self.ttid));
|
||||
}
|
||||
@@ -816,10 +819,13 @@ impl Timeline {
|
||||
// Wait 30 seconds for the guard to be acquired. It can time out if someone is
|
||||
// holding the lock (e.g. during `SafeKeeper::process_msg()`) or manager task
|
||||
// is stuck.
|
||||
let res = tokio::time::timeout_at(
|
||||
started_at + Duration::from_secs(30),
|
||||
self.manager_ctl.wal_residence_guard(),
|
||||
)
|
||||
let res = tokio::time::timeout_at(started_at + Duration::from_secs(30), async {
|
||||
if block {
|
||||
self.manager_ctl.wal_residence_guard().await.map(Some)
|
||||
} else {
|
||||
self.manager_ctl.try_wal_residence_guard().await
|
||||
}
|
||||
})
|
||||
.await;
|
||||
|
||||
let guard = match res {
|
||||
@@ -827,14 +833,14 @@ impl Timeline {
|
||||
let finished_at = Instant::now();
|
||||
let elapsed = finished_at - started_at;
|
||||
MISC_OPERATION_SECONDS
|
||||
.with_label_values(&["wal_residence_guard"])
|
||||
.with_label_values(&[op_label])
|
||||
.observe(elapsed.as_secs_f64());
|
||||
|
||||
guard
|
||||
}
|
||||
Ok(Err(e)) => {
|
||||
warn!(
|
||||
"error while acquiring WalResidentTimeline guard, statuses {:?} => {:?}",
|
||||
"error acquiring in {op_label}, statuses {:?} => {:?}",
|
||||
status_before,
|
||||
self.mgr_status.get()
|
||||
);
|
||||
@@ -842,7 +848,7 @@ impl Timeline {
|
||||
}
|
||||
Err(_) => {
|
||||
warn!(
|
||||
"timeout while acquiring WalResidentTimeline guard, statuses {:?} => {:?}",
|
||||
"timeout acquiring in {op_label} guard, statuses {:?} => {:?}",
|
||||
status_before,
|
||||
self.mgr_status.get()
|
||||
);
|
||||
@@ -850,7 +856,28 @@ impl Timeline {
|
||||
}
|
||||
};
|
||||
|
||||
Ok(WalResidentTimeline::new(self.clone(), guard))
|
||||
Ok(guard.map(|g| WalResidentTimeline::new(self.clone(), g)))
|
||||
}
|
||||
|
||||
/// Get the timeline guard for reading/writing WAL files.
|
||||
/// If WAL files are not present on disk (evicted), they will be automatically
|
||||
/// downloaded from remote storage. This is done in the manager task, which is
|
||||
/// responsible for issuing all guards.
|
||||
///
|
||||
/// NB: don't use this function from timeline_manager, it will deadlock.
|
||||
/// NB: don't use this function while holding shared_state lock.
|
||||
pub async fn wal_residence_guard(self: &Arc<Self>) -> Result<WalResidentTimeline> {
|
||||
self.do_wal_residence_guard(true)
|
||||
.await
|
||||
.map(|m| m.expect("Always get Some in block=true mode"))
|
||||
}
|
||||
|
||||
/// Get the timeline guard for reading/writing WAL files if the timeline is resident,
|
||||
/// else return None
|
||||
pub(crate) async fn try_wal_residence_guard(
|
||||
self: &Arc<Self>,
|
||||
) -> Result<Option<WalResidentTimeline>> {
|
||||
self.do_wal_residence_guard(false).await
|
||||
}
|
||||
|
||||
pub async fn backup_partial_reset(self: &Arc<Self>) -> Result<Vec<String>> {
|
||||
|
||||
@@ -56,6 +56,9 @@ impl Manager {
|
||||
// This also works for the first segment despite last_removed_segno
|
||||
// being 0 on init because this 0 triggers run of wal_removal_task
|
||||
// on success of which manager updates the horizon.
|
||||
//
|
||||
// **Note** pull_timeline functionality assumes that evicted timelines always have
|
||||
// a partial segment: if we ever change this condition, must also update that code.
|
||||
&& self
|
||||
.partial_backup_uploaded
|
||||
.as_ref()
|
||||
|
||||
@@ -100,6 +100,8 @@ const REFRESH_INTERVAL: Duration = Duration::from_millis(300);
|
||||
pub enum ManagerCtlMessage {
|
||||
/// Request to get a guard for WalResidentTimeline, with WAL files available locally.
|
||||
GuardRequest(tokio::sync::oneshot::Sender<anyhow::Result<ResidenceGuard>>),
|
||||
/// Get a guard for WalResidentTimeline if the timeline is not currently offloaded, else None
|
||||
TryGuardRequest(tokio::sync::oneshot::Sender<Option<ResidenceGuard>>),
|
||||
/// Request to drop the guard.
|
||||
GuardDrop(GuardId),
|
||||
/// Request to reset uploaded partial backup state.
|
||||
@@ -110,6 +112,7 @@ impl std::fmt::Debug for ManagerCtlMessage {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
match self {
|
||||
ManagerCtlMessage::GuardRequest(_) => write!(f, "GuardRequest"),
|
||||
ManagerCtlMessage::TryGuardRequest(_) => write!(f, "TryGuardRequest"),
|
||||
ManagerCtlMessage::GuardDrop(id) => write!(f, "GuardDrop({:?})", id),
|
||||
ManagerCtlMessage::BackupPartialReset(_) => write!(f, "BackupPartialReset"),
|
||||
}
|
||||
@@ -152,6 +155,19 @@ impl ManagerCtl {
|
||||
.and_then(std::convert::identity)
|
||||
}
|
||||
|
||||
/// Issue a new guard if the timeline is currently not offloaded, else return None
|
||||
/// Sends a message to the manager and waits for the response.
|
||||
/// Can be blocked indefinitely if the manager is stuck.
|
||||
pub async fn try_wal_residence_guard(&self) -> anyhow::Result<Option<ResidenceGuard>> {
|
||||
let (tx, rx) = tokio::sync::oneshot::channel();
|
||||
self.manager_tx
|
||||
.send(ManagerCtlMessage::TryGuardRequest(tx))?;
|
||||
|
||||
// wait for the manager to respond with the guard
|
||||
rx.await
|
||||
.map_err(|e| anyhow::anyhow!("response read fail: {:?}", e))
|
||||
}
|
||||
|
||||
/// Request timeline manager to reset uploaded partial segment state and
|
||||
/// wait for the result.
|
||||
pub async fn backup_partial_reset(&self) -> anyhow::Result<Vec<String>> {
|
||||
@@ -674,6 +690,17 @@ impl Manager {
|
||||
warn!("failed to reply with a guard, receiver dropped");
|
||||
}
|
||||
}
|
||||
Some(ManagerCtlMessage::TryGuardRequest(tx)) => {
|
||||
let result = if self.is_offloaded {
|
||||
None
|
||||
} else {
|
||||
Some(self.access_service.create_guard())
|
||||
};
|
||||
|
||||
if tx.send(result).is_err() {
|
||||
warn!("failed to reply with a guard, receiver dropped");
|
||||
}
|
||||
}
|
||||
Some(ManagerCtlMessage::GuardDrop(guard_id)) => {
|
||||
self.access_service.drop_guard(guard_id);
|
||||
}
|
||||
|
||||
@@ -968,6 +968,28 @@ async fn handle_tenant_shard_migrate(
|
||||
)
|
||||
}
|
||||
|
||||
async fn handle_tenant_shard_cancel_reconcile(
|
||||
service: Arc<Service>,
|
||||
req: Request<Body>,
|
||||
) -> Result<Response<Body>, ApiError> {
|
||||
check_permissions(&req, Scope::Admin)?;
|
||||
|
||||
let req = match maybe_forward(req).await {
|
||||
ForwardOutcome::Forwarded(res) => {
|
||||
return res;
|
||||
}
|
||||
ForwardOutcome::NotForwarded(req) => req,
|
||||
};
|
||||
|
||||
let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?;
|
||||
json_response(
|
||||
StatusCode::OK,
|
||||
service
|
||||
.tenant_shard_cancel_reconcile(tenant_shard_id)
|
||||
.await?,
|
||||
)
|
||||
}
|
||||
|
||||
async fn handle_tenant_update_policy(req: Request<Body>) -> Result<Response<Body>, ApiError> {
|
||||
check_permissions(&req, Scope::Admin)?;
|
||||
|
||||
@@ -1776,6 +1798,16 @@ pub fn make_router(
|
||||
RequestName("control_v1_tenant_migrate"),
|
||||
)
|
||||
})
|
||||
.put(
|
||||
"/control/v1/tenant/:tenant_shard_id/cancel_reconcile",
|
||||
|r| {
|
||||
tenant_service_handler(
|
||||
r,
|
||||
handle_tenant_shard_cancel_reconcile,
|
||||
RequestName("control_v1_tenant_cancel_reconcile"),
|
||||
)
|
||||
},
|
||||
)
|
||||
.put("/control/v1/tenant/:tenant_id/shard_split", |r| {
|
||||
tenant_service_handler(
|
||||
r,
|
||||
|
||||
@@ -4834,6 +4834,43 @@ impl Service {
|
||||
Ok(TenantShardMigrateResponse {})
|
||||
}
|
||||
|
||||
/// 'cancel' in this context means cancel any ongoing reconcile
|
||||
pub(crate) async fn tenant_shard_cancel_reconcile(
|
||||
&self,
|
||||
tenant_shard_id: TenantShardId,
|
||||
) -> Result<(), ApiError> {
|
||||
// Take state lock and fire the cancellation token, after which we drop lock and wait for any ongoing reconcile to complete
|
||||
let waiter = {
|
||||
let locked = self.inner.write().unwrap();
|
||||
let Some(shard) = locked.tenants.get(&tenant_shard_id) else {
|
||||
return Err(ApiError::NotFound(
|
||||
anyhow::anyhow!("Tenant shard not found").into(),
|
||||
));
|
||||
};
|
||||
|
||||
let waiter = shard.get_waiter();
|
||||
match waiter {
|
||||
None => {
|
||||
tracing::info!("Shard does not have an ongoing Reconciler");
|
||||
return Ok(());
|
||||
}
|
||||
Some(waiter) => {
|
||||
tracing::info!("Cancelling Reconciler");
|
||||
shard.cancel_reconciler();
|
||||
waiter
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
// Cancellation should be prompt. If this fails we have still done our job of firing the
|
||||
// cancellation token, but by returning an ApiError we will indicate to the caller that
|
||||
// the Reconciler is misbehaving and not respecting the cancellation token
|
||||
self.await_waiters(vec![waiter], SHORT_RECONCILE_TIMEOUT)
|
||||
.await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// This is for debug/support only: we simply drop all state for a tenant, without
|
||||
/// detaching or deleting it on pageservers.
|
||||
pub(crate) async fn tenant_drop(&self, tenant_id: TenantId) -> Result<(), ApiError> {
|
||||
|
||||
@@ -1317,6 +1317,12 @@ impl TenantShard {
|
||||
})
|
||||
}
|
||||
|
||||
pub(crate) fn cancel_reconciler(&self) {
|
||||
if let Some(handle) = self.reconciler.as_ref() {
|
||||
handle.cancel.cancel()
|
||||
}
|
||||
}
|
||||
|
||||
/// Get a waiter for any reconciliation in flight, but do not start reconciliation
|
||||
/// if it is not already running
|
||||
pub(crate) fn get_waiter(&self) -> Option<ReconcilerWaiter> {
|
||||
|
||||
@@ -872,6 +872,14 @@ def test_storage_controller_debug_apis(neon_env_builder: NeonEnvBuilder):
|
||||
assert sum(v["shard_count"] for v in response.json()["nodes"].values()) == 3
|
||||
assert all(v["may_schedule"] for v in response.json()["nodes"].values())
|
||||
|
||||
# Reconciler cancel API should be a no-op when nothing is in flight
|
||||
env.storage_controller.request(
|
||||
"PUT",
|
||||
f"{env.storage_controller_api}/control/v1/tenant/{tenant_id}-0102/cancel_reconcile",
|
||||
headers=env.storage_controller.headers(TokenScope.ADMIN),
|
||||
)
|
||||
|
||||
# Node unclean drop API
|
||||
response = env.storage_controller.request(
|
||||
"POST",
|
||||
f"{env.storage_controller_api}/debug/v1/node/{env.pageservers[1].id}/drop",
|
||||
@@ -879,6 +887,7 @@ def test_storage_controller_debug_apis(neon_env_builder: NeonEnvBuilder):
|
||||
)
|
||||
assert len(env.storage_controller.node_list()) == 1
|
||||
|
||||
# Tenant unclean drop API
|
||||
response = env.storage_controller.request(
|
||||
"POST",
|
||||
f"{env.storage_controller_api}/debug/v1/tenant/{tenant_id}/drop",
|
||||
@@ -892,7 +901,6 @@ def test_storage_controller_debug_apis(neon_env_builder: NeonEnvBuilder):
|
||||
headers=env.storage_controller.headers(TokenScope.ADMIN),
|
||||
)
|
||||
assert len(response.json()) == 1
|
||||
|
||||
# Check that the 'drop' APIs didn't leave things in a state that would fail a consistency check: they're
|
||||
# meant to be unclean wrt the pageserver state, but not leave a broken storage controller behind.
|
||||
env.storage_controller.consistency_check()
|
||||
@@ -1660,6 +1668,11 @@ def test_storcon_cli(neon_env_builder: NeonEnvBuilder):
|
||||
storcon_cli(["tenant-policy", "--tenant-id", str(env.initial_tenant), "--scheduling", "stop"])
|
||||
assert "Stop" in storcon_cli(["tenants"])[3]
|
||||
|
||||
# Cancel ongoing reconcile on a tenant
|
||||
storcon_cli(
|
||||
["tenant-shard-cancel-reconcile", "--tenant-shard-id", f"{env.initial_tenant}-0104"]
|
||||
)
|
||||
|
||||
# Change a tenant's placement
|
||||
storcon_cli(
|
||||
["tenant-policy", "--tenant-id", str(env.initial_tenant), "--placement", "secondary"]
|
||||
|
||||
@@ -1998,6 +1998,109 @@ def test_pull_timeline_term_change(neon_env_builder: NeonEnvBuilder):
|
||||
pt_handle.join()
|
||||
|
||||
|
||||
def test_pull_timeline_while_evicted(neon_env_builder: NeonEnvBuilder):
|
||||
"""
|
||||
Verify that when pull_timeline is used on an evicted timeline, it does not result in
|
||||
promoting any segments to local disk on the source, and the timeline is correctly instantiated
|
||||
in evicted state on the destination. This behavior is important to avoid ballooning disk
|
||||
usage when doing mass migration of timelines.
|
||||
"""
|
||||
neon_env_builder.num_safekeepers = 4
|
||||
neon_env_builder.enable_safekeeper_remote_storage(default_remote_storage())
|
||||
|
||||
# Configure safekeepers with ultra-fast eviction policy
|
||||
neon_env_builder.safekeeper_extra_opts = [
|
||||
"--enable-offload",
|
||||
"--partial-backup-timeout",
|
||||
"50ms",
|
||||
"--control-file-save-interval",
|
||||
"1s",
|
||||
# Safekeepers usually wait a while before evicting something: for this test we want them to
|
||||
# evict things as soon as they are inactive.
|
||||
"--eviction-min-resident=100ms",
|
||||
"--delete-offloaded-wal",
|
||||
]
|
||||
|
||||
initial_tenant_conf = {"lagging_wal_timeout": "1s", "checkpoint_timeout": "100ms"}
|
||||
env = neon_env_builder.init_start(initial_tenant_conf=initial_tenant_conf)
|
||||
tenant_id = env.initial_tenant
|
||||
timeline_id = env.initial_timeline
|
||||
|
||||
(src_sk, dst_sk) = (env.safekeepers[0], env.safekeepers[-1])
|
||||
log.info(f"Will pull_timeline on destination {dst_sk.id} from source {src_sk.id}")
|
||||
|
||||
ep = env.endpoints.create("main")
|
||||
ep.active_safekeepers = [s.id for s in env.safekeepers if s.id != dst_sk.id]
|
||||
log.info(f"Compute writing initially to safekeepers: {ep.active_safekeepers}")
|
||||
ep.active_safekeepers = [1, 2, 3] # Exclude dst_sk from set written by compute initially
|
||||
ep.start()
|
||||
ep.safe_psql("CREATE TABLE t(i int)")
|
||||
ep.safe_psql("INSERT INTO t VALUES (0)")
|
||||
ep.stop()
|
||||
|
||||
wait_lsn_force_checkpoint_at_sk(src_sk, tenant_id, timeline_id, env.pageserver)
|
||||
|
||||
src_http = src_sk.http_client()
|
||||
dst_http = dst_sk.http_client()
|
||||
|
||||
def evicted_on_source():
|
||||
# Wait for timeline to go into evicted state
|
||||
assert src_http.get_eviction_state(timeline_id) != "Present"
|
||||
assert (
|
||||
src_http.get_metric_value(
|
||||
"safekeeper_eviction_events_completed_total", {"kind": "evict"}
|
||||
)
|
||||
or 0 > 0
|
||||
)
|
||||
assert src_http.get_metric_value("safekeeper_evicted_timelines") or 0 > 0
|
||||
# Check that on source no segment files are present
|
||||
assert src_sk.list_segments(tenant_id, timeline_id) == []
|
||||
|
||||
wait_until(60, 1, evicted_on_source)
|
||||
|
||||
# Invoke pull_timeline: source should serve snapshot request without promoting anything to local disk,
|
||||
# destination should import the control file only & go into evicted mode immediately
|
||||
dst_sk.pull_timeline([src_sk], tenant_id, timeline_id)
|
||||
|
||||
# Check that on source and destination no segment files are present
|
||||
assert src_sk.list_segments(tenant_id, timeline_id) == []
|
||||
assert dst_sk.list_segments(tenant_id, timeline_id) == []
|
||||
|
||||
# Check that the timeline on the destination is in the expected evicted state.
|
||||
evicted_on_source() # It should still be evicted on the source
|
||||
|
||||
def evicted_on_destination():
|
||||
assert dst_http.get_eviction_state(timeline_id) != "Present"
|
||||
assert dst_http.get_metric_value("safekeeper_evicted_timelines") or 0 > 0
|
||||
|
||||
# This should be fast, it is a wait_until because eviction state is updated
|
||||
# in the background wrt pull_timeline.
|
||||
wait_until(10, 0.1, evicted_on_destination)
|
||||
|
||||
# Delete the timeline on the source, to prove that deletion works on an
|
||||
# evicted timeline _and_ that the final compute test is really not using
|
||||
# the original location
|
||||
src_sk.http_client().timeline_delete(tenant_id, timeline_id, only_local=True)
|
||||
|
||||
# Check that using the timeline correctly un-evicts it on the new location
|
||||
ep.active_safekeepers = [2, 3, 4]
|
||||
ep.start()
|
||||
ep.safe_psql("INSERT INTO t VALUES (0)")
|
||||
ep.stop()
|
||||
|
||||
def unevicted_on_dest():
|
||||
assert (
|
||||
dst_http.get_metric_value(
|
||||
"safekeeper_eviction_events_completed_total", {"kind": "restore"}
|
||||
)
|
||||
or 0 > 0
|
||||
)
|
||||
n_evicted = dst_sk.http_client().get_metric_value("safekeeper_evicted_timelines")
|
||||
assert n_evicted == 0
|
||||
|
||||
wait_until(10, 1, unevicted_on_dest)
|
||||
|
||||
|
||||
# In this test we check for excessive START_REPLICATION and START_WAL_PUSH queries
|
||||
# when compute is active, but there are no writes to the timeline. In that case
|
||||
# pageserver should maintain a single connection to safekeeper and don't attempt
|
||||
|
||||
Reference in New Issue
Block a user