Compare commits

..

3 Commits

Author SHA1 Message Date
Christian Schwarz
c5fa33f348 WIP: RFC for timeline import from pgdata 2024-10-28 13:09:36 +01:00
John Spray
33baca07b6 storcon: add an API to cancel ongoing reconciler (#9520)
## Problem

If something goes wrong with a live migration, we currently only have
awkward ways to interrupt that:
- Restart the storage controller
- Ask it to do some other modification/migration on the shard, which we
don't really want.

## Summary of changes

- Add a new `/cancel` control API, and storcon_cli wrapper for it, which
fires the Reconciler's cancellation token. This is just for on-call use
and we do not expect it to be used by any other services.
2024-10-28 09:26:01 +00:00
John Spray
923974d4da safekeeper: don't un-evict timelines during snapshot API handler (#9428)
## Problem

When we use pull_timeline API on an evicted timeline, it gets downloaded
to serve the snapshot API request. That means that to evacuate all the
timelines from a node, the node needs enough disk space to download
partial segments from all timelines, which may not be physically the
case.

Closes: #8833 

## Summary of changes

- Add a "try" variant of acquiring a residence guard, that returns None
if the timeline is offloaded
- During snapshot API handler, take a different code path if the
timeline isn't resident, where we just read the checkpoint and don't try
to read any segments.
2024-10-28 08:47:12 +00:00
12 changed files with 588 additions and 43 deletions

View File

@@ -111,6 +111,11 @@ enum Command {
#[arg(long)]
node: NodeId,
},
/// Cancel any ongoing reconciliation for this shard
TenantShardCancelReconcile {
#[arg(long)]
tenant_shard_id: TenantShardId,
},
/// Modify the pageserver tenant configuration of a tenant: this is the configuration structure
/// that is passed through to pageservers, and does not affect storage controller behavior.
TenantConfig {
@@ -535,6 +540,15 @@ async fn main() -> anyhow::Result<()> {
)
.await?;
}
Command::TenantShardCancelReconcile { tenant_shard_id } => {
storcon_client
.dispatch::<(), ()>(
Method::PUT,
format!("control/v1/tenant/{tenant_shard_id}/cancel_reconcile"),
None,
)
.await?;
}
Command::TenantConfig { tenant_id, config } => {
let tenant_conf = serde_json::from_str(&config)?;

View File

@@ -0,0 +1,187 @@
# Timelime import from `PGDATA`
Created at: 2024-10-28
Author: Christian Schwarz (@problame)
## Summary
This document is a retroactive RFC for a feature to creating new root timelines
from a vanilla Postgres PGDATA directory dump.
## Motivation & Context
perf hedge
companion RFCs:
* cloud.git RFC that fits everyting together
* READ THIS FIRST
* importer that produces the PGDATA
## Non-Goals
## Components
this doc constrained to storage pieces; companion RFC in cloud.git fits it into a full system
## Prior Art
hackathon code
## Proposed Implementation
### High-Level
refactor timeline creation (modes) & idempotency checking
add new timeline creation mode for import from pgdata
idempotency checking based on caller-provided idempotency key
creating task does the following
acquire creation guard
create anonymous empty timeline
upload index part to S3, with field indicating import is in progress
spawn tokio task that performs the import (bound to tenant cancellation & guard)
the import task produces layers, adds them to the layer map, uploads them using existing infrastructure
after import task is done
transition index part into `done` state
shut down the timeline object
load the timeline from remote storage, using the code that we use during tenant attach, and add it to Tenant::timelines
if pageserver restarts or the tenant is migrated during import, tenant attach resumes the import job, based on the
Index part field.
SEQUENCE DIAGRAM
### No Storage Controller Awareness
The storage controller is unaware of the import job, since it simply proxies the timeline creation request to each shard.
### PGDATA => Image Layer Conversion
The import task converts a PGDATA directory dump located in object storage into image layers.
shard awareness, including the new test for 0 disposable keys
range requests
### The Mechanics Of Executing The Conversion Code
Tokio Task Lifecycles / New Concept Of Long-Running Timeline Creation
This is the first time we make a timeline creation run long, and need to make it survive PS restarts and tenant migrations.
### Resumability & Forward Progress
Index Part field, document here.
Evoultion toward checkpointing in the index part is possible.
For v1, no checkpointing is implemented because the MTBF of pageserver nodes is much higher than
the highest expected import duration; and tenant migrations are rare enough.
### Resource Usage
Parallelized over multiple tokio tasks
### Interaction with Migrations / Generations
After migrating the tenant / reloading it with a newer generation, what happens?
### Security
The current implementation assumes that the PGDATA dump contents are trusted input,
and further that it does not change while the import is ongoing.
There is no hard technical need for trusting the PGDATA dump contents, since
the bulk of layer conversion is copying opaque 8k data blobs.
However, the conversion code uses serde-generated deserialization routines in some
places which haven't been audited for resource exhaustion attacks (e.g. OOM due to large allcations).
Further, the conversion code is littered with `.unwrap()` and `assert!()` statements, which
is a remnant from the fact that it was hackathon code. This risks panicking the pageserver
that executes the import, and our panic blast radius is currently not reliably constrained
to the tenant in which it originates.
### Progress Notification & Integration Into The Larger System
Timeline create API: no changes about durability semantics of create operation; cplane create
operation is done as soon as API returns 201 (earlier design 202 code, remove that from cplane draft)
Readiness for import: due to limitations in the control plane regarding long-running jobs, cplane
schedules the timeline creation before the PGDATA dunp has actually finished uploading.
As a workaround, the import task polls for a special status key in the PGDATA to appear, which
declares the PGDATA dump as finished uploading and not changing from thereon.
TODO: review linearizability of S3, i.e., can we rely on ListObjectsV2 to return the
full listing immediately after we observe the special status key? Or does the status key
need to contain the list of prefixes? That would be better anyway, could be signed, cf security).
New concept: completion notification; didn't need this before, now we do;
cplane must not use timeline before receiving completion notification.
mechanism:
* source of truth of completion is a special status object that import job uploads into the PGDATA location in S3
* how to make cplane aware of update: invocation of a notification upcall API
* import job retries delivery of the notification upcall until success response
* in response to upcall, cplane collects statuses of alls hards using ListObjectV2 prefix (it understands ShardId)
SEQUENCE DIAGRAM FROM GIST
### Shard-Split during Import is UB
The design does not handle shard-splits during import, so it's declared UB right now.
### Storage-Controller, Sharding
As before this RFC, the storage controller forwards timeline create requests to each
shard and is unaware that the timeline is importing, because so far storcon did not need
to be aware of timelines.
However, to make sharding transparent to cplane, it would make sense for storcon to
act as an intermediate for the import progress progress upcalls, i.e., batch up all
shards' completion upcalls and when that's done, deliver a tenant-scoped (instead of shard-scoped)
upcall to cplane.
That way, cplane need not know about tenants and storcon can internalize shard-split.
### Alternative Designs
#### External Task Performs Conversion
External task performs conversion and uploads layers & index part into S3.
=> copy gist arguments against that here.
## Future Work
### Work Required For Productization
TODOs at the top of import code, esp resource exhaustion attack vectors
Review linearizability of listobjectsv2 after observing pgdata upload done key.
Resource usage: play nice with other tenants
Regression test ensuring resumability (failpoints)
Storcon needs to be aware of import and
- inhibit shard split while ongoing (can relax this later)
- and redesign progress notification system to
- avoid need to mutate pgdata, so it can be readonly
DESIGN WORK: does storage controller scheduler should be aware of ongoing migration?
=> it should control resource usage, maybe different policy for migrations, etc
Safety: fail shard split while import is in progress. Pageserver should enforce this.
Storcon needs to know and respsect this.
Azure support
### v2
Teach pageserver to natively read PGDATA relation files. Then it could be as simple
as reading a few control files, and otherwise using object copy operations for
moving the relation files.
Problems:
- how to deal with sharding? If we don't slice up the files, then each shard
download a full copy of the database. Tricky. Maybe can handle this with 1GiB stripe size,
but, pretty obscure.

View File

@@ -262,14 +262,6 @@ async fn timeline_snapshot_handler(request: Request<Body>) -> Result<Response<Bo
check_permission(&request, Some(ttid.tenant_id))?;
let tli = GlobalTimelines::get(ttid).map_err(ApiError::from)?;
// Note: with evicted timelines it should work better then de-evict them and
// stream; probably start_snapshot would copy partial s3 file to dest path
// and stream control file, or return WalResidentTimeline if timeline is not
// evicted.
let tli = tli
.wal_residence_guard()
.await
.map_err(ApiError::InternalServerError)?;
// To stream the body use wrap_stream which wants Stream of Result<Bytes>,
// so create the chan and write to it in another task.

View File

@@ -8,6 +8,7 @@ use serde::{Deserialize, Serialize};
use std::{
cmp::min,
io::{self, ErrorKind},
sync::Arc,
};
use tokio::{fs::OpenOptions, io::AsyncWrite, sync::mpsc, task};
use tokio_tar::{Archive, Builder, Header};
@@ -25,8 +26,8 @@ use crate::{
routes::TimelineStatus,
},
safekeeper::Term,
state::TimelinePersistentState,
timeline::WalResidentTimeline,
state::{EvictionState, TimelinePersistentState},
timeline::{Timeline, WalResidentTimeline},
timelines_global_map::{create_temp_timeline_dir, validate_temp_timeline},
wal_backup,
wal_storage::open_wal_file,
@@ -43,18 +44,33 @@ use utils::{
/// Stream tar archive of timeline to tx.
#[instrument(name = "snapshot", skip_all, fields(ttid = %tli.ttid))]
pub async fn stream_snapshot(
tli: WalResidentTimeline,
tli: Arc<Timeline>,
source: NodeId,
destination: NodeId,
tx: mpsc::Sender<Result<Bytes>>,
) {
if let Err(e) = stream_snapshot_guts(tli, source, destination, tx.clone()).await {
// Error type/contents don't matter as they won't can't reach the client
// (hyper likely doesn't do anything with it), but http stream will be
// prematurely terminated. It would be nice to try to send the error in
// trailers though.
tx.send(Err(anyhow!("snapshot failed"))).await.ok();
error!("snapshot failed: {:#}", e);
match tli.try_wal_residence_guard().await {
Err(e) => {
tx.send(Err(anyhow!("Error checking residence: {:#}", e)))
.await
.ok();
}
Ok(maybe_resident_tli) => {
if let Err(e) = match maybe_resident_tli {
Some(resident_tli) => {
stream_snapshot_resident_guts(resident_tli, source, destination, tx.clone())
.await
}
None => stream_snapshot_offloaded_guts(tli, source, destination, tx.clone()).await,
} {
// Error type/contents don't matter as they won't can't reach the client
// (hyper likely doesn't do anything with it), but http stream will be
// prematurely terminated. It would be nice to try to send the error in
// trailers though.
tx.send(Err(anyhow!("snapshot failed"))).await.ok();
error!("snapshot failed: {:#}", e);
}
}
}
}
@@ -80,12 +96,10 @@ impl Drop for SnapshotContext {
}
}
pub async fn stream_snapshot_guts(
tli: WalResidentTimeline,
source: NodeId,
destination: NodeId,
/// Build a tokio_tar stream that sends encoded bytes into a Bytes channel.
fn prepare_tar_stream(
tx: mpsc::Sender<Result<Bytes>>,
) -> Result<()> {
) -> tokio_tar::Builder<impl AsyncWrite + Unpin + Send> {
// tokio-tar wants Write implementor, but we have mpsc tx <Result<Bytes>>;
// use SinkWriter as a Write impl. That is,
// - create Sink from the tx. It returns PollSendError if chan is closed.
@@ -100,12 +114,38 @@ pub async fn stream_snapshot_guts(
// - SinkWriter (not surprisingly) wants sink of &[u8], not bytes, so wrap
// into CopyToBytes. This is a data copy.
let copy_to_bytes = CopyToBytes::new(oksink);
let mut writer = SinkWriter::new(copy_to_bytes);
let pinned_writer = std::pin::pin!(writer);
let writer = SinkWriter::new(copy_to_bytes);
let pinned_writer = Box::pin(writer);
// Note that tokio_tar append_* funcs use tokio::io::copy with 8KB buffer
// which is also likely suboptimal.
let mut ar = Builder::new_non_terminated(pinned_writer);
Builder::new_non_terminated(pinned_writer)
}
/// Implementation of snapshot for an offloaded timeline, only reads control file
pub(crate) async fn stream_snapshot_offloaded_guts(
tli: Arc<Timeline>,
source: NodeId,
destination: NodeId,
tx: mpsc::Sender<Result<Bytes>>,
) -> Result<()> {
let mut ar = prepare_tar_stream(tx);
tli.snapshot_offloaded(&mut ar, source, destination).await?;
ar.finish().await?;
Ok(())
}
/// Implementation of snapshot for a timeline which is resident (includes some segment data)
pub async fn stream_snapshot_resident_guts(
tli: WalResidentTimeline,
source: NodeId,
destination: NodeId,
tx: mpsc::Sender<Result<Bytes>>,
) -> Result<()> {
let mut ar = prepare_tar_stream(tx);
let bctx = tli.start_snapshot(&mut ar, source, destination).await?;
pausable_failpoint!("sk-snapshot-after-list-pausable");
@@ -138,6 +178,70 @@ pub async fn stream_snapshot_guts(
Ok(())
}
impl Timeline {
/// Simple snapshot for an offloaded timeline: we will only upload a renamed partial segment and
/// pass a modified control file into the provided tar stream (nothing with data segments on disk, since
/// we are offloaded and there aren't any)
async fn snapshot_offloaded<W: AsyncWrite + Unpin + Send>(
self: &Arc<Timeline>,
ar: &mut tokio_tar::Builder<W>,
source: NodeId,
destination: NodeId,
) -> Result<()> {
// Take initial copy of control file, then release state lock
let mut control_file = {
let shared_state = self.write_shared_state().await;
let control_file = TimelinePersistentState::clone(shared_state.sk.state());
// Rare race: we got unevicted between entering function and reading control file.
// We error out and let API caller retry.
if !matches!(control_file.eviction_state, EvictionState::Offloaded(_)) {
bail!("Timeline was un-evicted during snapshot, please retry");
}
control_file
};
// Modify the partial segment of the in-memory copy for the control file to
// point to the destination safekeeper.
let replace = control_file
.partial_backup
.replace_uploaded_segment(source, destination)?;
let Some(replace) = replace else {
// In Manager:: ready_for_eviction, we do not permit eviction unless the timeline
// has a partial segment. It is unexpected that
anyhow::bail!("Timeline has no partial segment, cannot generate snapshot");
};
tracing::info!("Replacing uploaded partial segment in in-mem control file: {replace:?}");
// Optimistically try to copy the partial segment to the destination's path: this
// can fail if the timeline was un-evicted and modified in the background.
let remote_timeline_path = &self.remote_path;
wal_backup::copy_partial_segment(
&replace.previous.remote_path(remote_timeline_path),
&replace.current.remote_path(remote_timeline_path),
)
.await?;
// Since the S3 copy succeeded with the path given in our control file snapshot, and
// we are sending that snapshot in our response, we are giving the caller a consistent
// snapshot even if our local Timeline was unevicted or otherwise modified in the meantime.
let buf = control_file
.write_to_buf()
.with_context(|| "failed to serialize control store")?;
let mut header = Header::new_gnu();
header.set_size(buf.len().try_into().expect("never breaches u64"));
ar.append_data(&mut header, CONTROL_FILE_NAME, buf.as_slice())
.await
.with_context(|| "failed to append to archive")?;
Ok(())
}
}
impl WalResidentTimeline {
/// Start streaming tar archive with timeline:
/// 1) stream control file under lock;

View File

@@ -797,14 +797,17 @@ impl Timeline {
state.sk.term_bump(to).await
}
/// Get the timeline guard for reading/writing WAL files.
/// If WAL files are not present on disk (evicted), they will be automatically
/// downloaded from remote storage. This is done in the manager task, which is
/// responsible for issuing all guards.
///
/// NB: don't use this function from timeline_manager, it will deadlock.
/// NB: don't use this function while holding shared_state lock.
pub async fn wal_residence_guard(self: &Arc<Self>) -> Result<WalResidentTimeline> {
/// Guts of [`Self::wal_residence_guard`] and [`Self::try_wal_residence_guard`]
async fn do_wal_residence_guard(
self: &Arc<Self>,
block: bool,
) -> Result<Option<WalResidentTimeline>> {
let op_label = if block {
"wal_residence_guard"
} else {
"try_wal_residence_guard"
};
if self.is_cancelled() {
bail!(TimelineError::Cancelled(self.ttid));
}
@@ -816,10 +819,13 @@ impl Timeline {
// Wait 30 seconds for the guard to be acquired. It can time out if someone is
// holding the lock (e.g. during `SafeKeeper::process_msg()`) or manager task
// is stuck.
let res = tokio::time::timeout_at(
started_at + Duration::from_secs(30),
self.manager_ctl.wal_residence_guard(),
)
let res = tokio::time::timeout_at(started_at + Duration::from_secs(30), async {
if block {
self.manager_ctl.wal_residence_guard().await.map(Some)
} else {
self.manager_ctl.try_wal_residence_guard().await
}
})
.await;
let guard = match res {
@@ -827,14 +833,14 @@ impl Timeline {
let finished_at = Instant::now();
let elapsed = finished_at - started_at;
MISC_OPERATION_SECONDS
.with_label_values(&["wal_residence_guard"])
.with_label_values(&[op_label])
.observe(elapsed.as_secs_f64());
guard
}
Ok(Err(e)) => {
warn!(
"error while acquiring WalResidentTimeline guard, statuses {:?} => {:?}",
"error acquiring in {op_label}, statuses {:?} => {:?}",
status_before,
self.mgr_status.get()
);
@@ -842,7 +848,7 @@ impl Timeline {
}
Err(_) => {
warn!(
"timeout while acquiring WalResidentTimeline guard, statuses {:?} => {:?}",
"timeout acquiring in {op_label} guard, statuses {:?} => {:?}",
status_before,
self.mgr_status.get()
);
@@ -850,7 +856,28 @@ impl Timeline {
}
};
Ok(WalResidentTimeline::new(self.clone(), guard))
Ok(guard.map(|g| WalResidentTimeline::new(self.clone(), g)))
}
/// Get the timeline guard for reading/writing WAL files.
/// If WAL files are not present on disk (evicted), they will be automatically
/// downloaded from remote storage. This is done in the manager task, which is
/// responsible for issuing all guards.
///
/// NB: don't use this function from timeline_manager, it will deadlock.
/// NB: don't use this function while holding shared_state lock.
pub async fn wal_residence_guard(self: &Arc<Self>) -> Result<WalResidentTimeline> {
self.do_wal_residence_guard(true)
.await
.map(|m| m.expect("Always get Some in block=true mode"))
}
/// Get the timeline guard for reading/writing WAL files if the timeline is resident,
/// else return None
pub(crate) async fn try_wal_residence_guard(
self: &Arc<Self>,
) -> Result<Option<WalResidentTimeline>> {
self.do_wal_residence_guard(false).await
}
pub async fn backup_partial_reset(self: &Arc<Self>) -> Result<Vec<String>> {

View File

@@ -56,6 +56,9 @@ impl Manager {
// This also works for the first segment despite last_removed_segno
// being 0 on init because this 0 triggers run of wal_removal_task
// on success of which manager updates the horizon.
//
// **Note** pull_timeline functionality assumes that evicted timelines always have
// a partial segment: if we ever change this condition, must also update that code.
&& self
.partial_backup_uploaded
.as_ref()

View File

@@ -100,6 +100,8 @@ const REFRESH_INTERVAL: Duration = Duration::from_millis(300);
pub enum ManagerCtlMessage {
/// Request to get a guard for WalResidentTimeline, with WAL files available locally.
GuardRequest(tokio::sync::oneshot::Sender<anyhow::Result<ResidenceGuard>>),
/// Get a guard for WalResidentTimeline if the timeline is not currently offloaded, else None
TryGuardRequest(tokio::sync::oneshot::Sender<Option<ResidenceGuard>>),
/// Request to drop the guard.
GuardDrop(GuardId),
/// Request to reset uploaded partial backup state.
@@ -110,6 +112,7 @@ impl std::fmt::Debug for ManagerCtlMessage {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
ManagerCtlMessage::GuardRequest(_) => write!(f, "GuardRequest"),
ManagerCtlMessage::TryGuardRequest(_) => write!(f, "TryGuardRequest"),
ManagerCtlMessage::GuardDrop(id) => write!(f, "GuardDrop({:?})", id),
ManagerCtlMessage::BackupPartialReset(_) => write!(f, "BackupPartialReset"),
}
@@ -152,6 +155,19 @@ impl ManagerCtl {
.and_then(std::convert::identity)
}
/// Issue a new guard if the timeline is currently not offloaded, else return None
/// Sends a message to the manager and waits for the response.
/// Can be blocked indefinitely if the manager is stuck.
pub async fn try_wal_residence_guard(&self) -> anyhow::Result<Option<ResidenceGuard>> {
let (tx, rx) = tokio::sync::oneshot::channel();
self.manager_tx
.send(ManagerCtlMessage::TryGuardRequest(tx))?;
// wait for the manager to respond with the guard
rx.await
.map_err(|e| anyhow::anyhow!("response read fail: {:?}", e))
}
/// Request timeline manager to reset uploaded partial segment state and
/// wait for the result.
pub async fn backup_partial_reset(&self) -> anyhow::Result<Vec<String>> {
@@ -674,6 +690,17 @@ impl Manager {
warn!("failed to reply with a guard, receiver dropped");
}
}
Some(ManagerCtlMessage::TryGuardRequest(tx)) => {
let result = if self.is_offloaded {
None
} else {
Some(self.access_service.create_guard())
};
if tx.send(result).is_err() {
warn!("failed to reply with a guard, receiver dropped");
}
}
Some(ManagerCtlMessage::GuardDrop(guard_id)) => {
self.access_service.drop_guard(guard_id);
}

View File

@@ -968,6 +968,28 @@ async fn handle_tenant_shard_migrate(
)
}
async fn handle_tenant_shard_cancel_reconcile(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
check_permissions(&req, Scope::Admin)?;
let req = match maybe_forward(req).await {
ForwardOutcome::Forwarded(res) => {
return res;
}
ForwardOutcome::NotForwarded(req) => req,
};
let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?;
json_response(
StatusCode::OK,
service
.tenant_shard_cancel_reconcile(tenant_shard_id)
.await?,
)
}
async fn handle_tenant_update_policy(req: Request<Body>) -> Result<Response<Body>, ApiError> {
check_permissions(&req, Scope::Admin)?;
@@ -1776,6 +1798,16 @@ pub fn make_router(
RequestName("control_v1_tenant_migrate"),
)
})
.put(
"/control/v1/tenant/:tenant_shard_id/cancel_reconcile",
|r| {
tenant_service_handler(
r,
handle_tenant_shard_cancel_reconcile,
RequestName("control_v1_tenant_cancel_reconcile"),
)
},
)
.put("/control/v1/tenant/:tenant_id/shard_split", |r| {
tenant_service_handler(
r,

View File

@@ -4834,6 +4834,43 @@ impl Service {
Ok(TenantShardMigrateResponse {})
}
/// 'cancel' in this context means cancel any ongoing reconcile
pub(crate) async fn tenant_shard_cancel_reconcile(
&self,
tenant_shard_id: TenantShardId,
) -> Result<(), ApiError> {
// Take state lock and fire the cancellation token, after which we drop lock and wait for any ongoing reconcile to complete
let waiter = {
let locked = self.inner.write().unwrap();
let Some(shard) = locked.tenants.get(&tenant_shard_id) else {
return Err(ApiError::NotFound(
anyhow::anyhow!("Tenant shard not found").into(),
));
};
let waiter = shard.get_waiter();
match waiter {
None => {
tracing::info!("Shard does not have an ongoing Reconciler");
return Ok(());
}
Some(waiter) => {
tracing::info!("Cancelling Reconciler");
shard.cancel_reconciler();
waiter
}
}
};
// Cancellation should be prompt. If this fails we have still done our job of firing the
// cancellation token, but by returning an ApiError we will indicate to the caller that
// the Reconciler is misbehaving and not respecting the cancellation token
self.await_waiters(vec![waiter], SHORT_RECONCILE_TIMEOUT)
.await?;
Ok(())
}
/// This is for debug/support only: we simply drop all state for a tenant, without
/// detaching or deleting it on pageservers.
pub(crate) async fn tenant_drop(&self, tenant_id: TenantId) -> Result<(), ApiError> {

View File

@@ -1317,6 +1317,12 @@ impl TenantShard {
})
}
pub(crate) fn cancel_reconciler(&self) {
if let Some(handle) = self.reconciler.as_ref() {
handle.cancel.cancel()
}
}
/// Get a waiter for any reconciliation in flight, but do not start reconciliation
/// if it is not already running
pub(crate) fn get_waiter(&self) -> Option<ReconcilerWaiter> {

View File

@@ -872,6 +872,14 @@ def test_storage_controller_debug_apis(neon_env_builder: NeonEnvBuilder):
assert sum(v["shard_count"] for v in response.json()["nodes"].values()) == 3
assert all(v["may_schedule"] for v in response.json()["nodes"].values())
# Reconciler cancel API should be a no-op when nothing is in flight
env.storage_controller.request(
"PUT",
f"{env.storage_controller_api}/control/v1/tenant/{tenant_id}-0102/cancel_reconcile",
headers=env.storage_controller.headers(TokenScope.ADMIN),
)
# Node unclean drop API
response = env.storage_controller.request(
"POST",
f"{env.storage_controller_api}/debug/v1/node/{env.pageservers[1].id}/drop",
@@ -879,6 +887,7 @@ def test_storage_controller_debug_apis(neon_env_builder: NeonEnvBuilder):
)
assert len(env.storage_controller.node_list()) == 1
# Tenant unclean drop API
response = env.storage_controller.request(
"POST",
f"{env.storage_controller_api}/debug/v1/tenant/{tenant_id}/drop",
@@ -892,7 +901,6 @@ def test_storage_controller_debug_apis(neon_env_builder: NeonEnvBuilder):
headers=env.storage_controller.headers(TokenScope.ADMIN),
)
assert len(response.json()) == 1
# Check that the 'drop' APIs didn't leave things in a state that would fail a consistency check: they're
# meant to be unclean wrt the pageserver state, but not leave a broken storage controller behind.
env.storage_controller.consistency_check()
@@ -1660,6 +1668,11 @@ def test_storcon_cli(neon_env_builder: NeonEnvBuilder):
storcon_cli(["tenant-policy", "--tenant-id", str(env.initial_tenant), "--scheduling", "stop"])
assert "Stop" in storcon_cli(["tenants"])[3]
# Cancel ongoing reconcile on a tenant
storcon_cli(
["tenant-shard-cancel-reconcile", "--tenant-shard-id", f"{env.initial_tenant}-0104"]
)
# Change a tenant's placement
storcon_cli(
["tenant-policy", "--tenant-id", str(env.initial_tenant), "--placement", "secondary"]

View File

@@ -1998,6 +1998,109 @@ def test_pull_timeline_term_change(neon_env_builder: NeonEnvBuilder):
pt_handle.join()
def test_pull_timeline_while_evicted(neon_env_builder: NeonEnvBuilder):
"""
Verify that when pull_timeline is used on an evicted timeline, it does not result in
promoting any segments to local disk on the source, and the timeline is correctly instantiated
in evicted state on the destination. This behavior is important to avoid ballooning disk
usage when doing mass migration of timelines.
"""
neon_env_builder.num_safekeepers = 4
neon_env_builder.enable_safekeeper_remote_storage(default_remote_storage())
# Configure safekeepers with ultra-fast eviction policy
neon_env_builder.safekeeper_extra_opts = [
"--enable-offload",
"--partial-backup-timeout",
"50ms",
"--control-file-save-interval",
"1s",
# Safekeepers usually wait a while before evicting something: for this test we want them to
# evict things as soon as they are inactive.
"--eviction-min-resident=100ms",
"--delete-offloaded-wal",
]
initial_tenant_conf = {"lagging_wal_timeout": "1s", "checkpoint_timeout": "100ms"}
env = neon_env_builder.init_start(initial_tenant_conf=initial_tenant_conf)
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
(src_sk, dst_sk) = (env.safekeepers[0], env.safekeepers[-1])
log.info(f"Will pull_timeline on destination {dst_sk.id} from source {src_sk.id}")
ep = env.endpoints.create("main")
ep.active_safekeepers = [s.id for s in env.safekeepers if s.id != dst_sk.id]
log.info(f"Compute writing initially to safekeepers: {ep.active_safekeepers}")
ep.active_safekeepers = [1, 2, 3] # Exclude dst_sk from set written by compute initially
ep.start()
ep.safe_psql("CREATE TABLE t(i int)")
ep.safe_psql("INSERT INTO t VALUES (0)")
ep.stop()
wait_lsn_force_checkpoint_at_sk(src_sk, tenant_id, timeline_id, env.pageserver)
src_http = src_sk.http_client()
dst_http = dst_sk.http_client()
def evicted_on_source():
# Wait for timeline to go into evicted state
assert src_http.get_eviction_state(timeline_id) != "Present"
assert (
src_http.get_metric_value(
"safekeeper_eviction_events_completed_total", {"kind": "evict"}
)
or 0 > 0
)
assert src_http.get_metric_value("safekeeper_evicted_timelines") or 0 > 0
# Check that on source no segment files are present
assert src_sk.list_segments(tenant_id, timeline_id) == []
wait_until(60, 1, evicted_on_source)
# Invoke pull_timeline: source should serve snapshot request without promoting anything to local disk,
# destination should import the control file only & go into evicted mode immediately
dst_sk.pull_timeline([src_sk], tenant_id, timeline_id)
# Check that on source and destination no segment files are present
assert src_sk.list_segments(tenant_id, timeline_id) == []
assert dst_sk.list_segments(tenant_id, timeline_id) == []
# Check that the timeline on the destination is in the expected evicted state.
evicted_on_source() # It should still be evicted on the source
def evicted_on_destination():
assert dst_http.get_eviction_state(timeline_id) != "Present"
assert dst_http.get_metric_value("safekeeper_evicted_timelines") or 0 > 0
# This should be fast, it is a wait_until because eviction state is updated
# in the background wrt pull_timeline.
wait_until(10, 0.1, evicted_on_destination)
# Delete the timeline on the source, to prove that deletion works on an
# evicted timeline _and_ that the final compute test is really not using
# the original location
src_sk.http_client().timeline_delete(tenant_id, timeline_id, only_local=True)
# Check that using the timeline correctly un-evicts it on the new location
ep.active_safekeepers = [2, 3, 4]
ep.start()
ep.safe_psql("INSERT INTO t VALUES (0)")
ep.stop()
def unevicted_on_dest():
assert (
dst_http.get_metric_value(
"safekeeper_eviction_events_completed_total", {"kind": "restore"}
)
or 0 > 0
)
n_evicted = dst_sk.http_client().get_metric_value("safekeeper_evicted_timelines")
assert n_evicted == 0
wait_until(10, 1, unevicted_on_dest)
# In this test we check for excessive START_REPLICATION and START_WAL_PUSH queries
# when compute is active, but there are no writes to the timeline. In that case
# pageserver should maintain a single connection to safekeeper and don't attempt