Implement validation of generations before delete

Hook deletion queue into generations
Make remote_layer_path take Generation instead of layer metadata
2026-02-13 07:30:38 +00:00 · 2023-08-30 17:44:10 +01:00 · 2023-08-30 15:35:51 +01:00 · 2023-08-30 15:13:00 +01:00 · 2023-08-30 15:07:35 +01:00 · 2023-08-30 12:21:29 +01:00
93 changed files with 5618 additions and 898 deletions
--- a/.github/actions/run-python-test-set/action.yml
+++ b/.github/actions/run-python-test-set/action.yml
@@ -145,7 +145,11 @@ runs:

        if [ "${RERUN_FLAKY}" == "true" ]; then
          mkdir -p $TEST_OUTPUT
-          poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" --days 10 --output "$TEST_OUTPUT/flaky.json"
+          poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" \
+                                              --days 7 \
+                                              --output "$TEST_OUTPUT/flaky.json" \
+                                              --pg-version "${DEFAULT_PG_VERSION}" \
+                                              --build-type "${BUILD_TYPE}"

          EXTRA_PARAMS="--flaky-tests-json $TEST_OUTPUT/flaky.json $EXTRA_PARAMS"
        fi
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -752,7 +752,7 @@ jobs:
      run:
        shell: sh -eu {0}
    env:
-      VM_BUILDER_VERSION: v0.16.3
+      VM_BUILDER_VERSION: v0.17.5

    steps:
      - name: Checkout
@@ -775,6 +775,7 @@ jobs:
        run: |
          ./vm-builder \
            -enable-file-cache \
+            -cgroup-uid=postgres \
            -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
            -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}

@@ -903,7 +904,7 @@ jobs:
    container:
      image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
      options: --init
-    needs: [ promote-images, tag ]
+    needs: [ tag ]
    steps:
      - name: Set PR's status to pending and request a remote CI test
        run: |
--- a/13
+++ b/13
@@ -1,11 +1,12 @@
-/compute_tools/ @neondatabase/control-plane
+/compute_tools/ @neondatabase/control-plane @neondatabase/compute
 /control_plane/ @neondatabase/compute @neondatabase/storage
 /libs/pageserver_api/ @neondatabase/compute @neondatabase/storage
-/libs/postgres_ffi/ @neondatabase/compute 
-/libs/remote_storage/ @neondatabase/storage 
-/libs/safekeeper_api/ @neondatabase/safekeepers  
-/pageserver/ @neondatabase/compute @neondatabase/storage 
+/libs/postgres_ffi/ @neondatabase/compute
+/libs/remote_storage/ @neondatabase/storage
+/libs/safekeeper_api/ @neondatabase/safekeepers
+/libs/vm_monitor/ @neondatabase/autoscaling @neondatabase/compute
+/pageserver/ @neondatabase/compute @neondatabase/storage
 /pgxn/ @neondatabase/compute
-/proxy/ @neondatabase/control-plane 
+/proxy/ @neondatabase/proxy
 /safekeeper/ @neondatabase/safekeepers
 /vendor/ @neondatabase/compute
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1001,6 +1001,7 @@ dependencies = [
 "comfy-table",
 "compute_api",
 "git-version",
+ "hyper",
 "nix 0.26.2",
 "once_cell",
 "pageserver_api",
@@ -1016,6 +1017,7 @@ dependencies = [
 "storage_broker",
 "tar",
 "thiserror",
+ "tokio",
 "toml",
 "tracing",
 "url",
@@ -2684,6 +2686,7 @@ dependencies = [
 "bytes",
 "const_format",
 "enum-map",
+ "hex",
 "postgres_ffi",
 "serde",
 "serde_json",
@@ -5014,6 +5017,7 @@ dependencies = [
 "nix 0.26.2",
 "once_cell",
 "pin-project-lite",
+ "postgres_connection",
 "pq_proto",
 "rand",
 "regex",
--- a/Dockerfile.compute-node
+++ b/Dockerfile.compute-node
@@ -211,8 +211,8 @@ RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -
 FROM build-deps AS vector-pg-build
 COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/

-RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.4.4.tar.gz -O pgvector.tar.gz && \
-    echo "1cb70a63f8928e396474796c22a20be9f7285a8a013009deb8152445b61b72e6 pgvector.tar.gz" | sha256sum --check && \
+RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.0.tar.gz -O pgvector.tar.gz && \
+    echo "d8aa3504b215467ca528525a6de12c3f85f9891b091ce0e5864dd8a9b757f77b pgvector.tar.gz" | sha256sum --check && \
    mkdir pgvector-src && cd pgvector-src && tar xvzf ../pgvector.tar.gz --strip-components=1 -C . && \
    make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
    make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
--- a/compute_tools/README.md
+++ b/compute_tools/README.md
@@ -19,9 +19,10 @@ Also `compute_ctl` spawns two separate service threads:
 - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
  last activity requests.

-If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
-compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
-downscaling and (eventually) will request immediate upscaling under resource pressure.
+If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
+`vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
+`vm-monitor` communicates with the VM autoscaling system. It coordinates
+downscaling and requests immediate upscaling under resource pressure.

 Usage example:
 ```sh
--- a/compute_tools/src/bin/compute_ctl.rs
+++ b/compute_tools/src/bin/compute_ctl.rs
@@ -20,9 +20,10 @@
 //! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
 //!   last activity requests.
 //!
-//! If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
-//! compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
-//! downscaling and (eventually) will request immediate upscaling under resource pressure.
+//! If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
+//! `vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
+//! `vm-monitor` communicates with the VM autoscaling system. It coordinates
+//! downscaling and requests immediate upscaling under resource pressure.
 //!
 //! Usage example:
 //! ```sh
@@ -278,8 +279,9 @@ fn main() -> Result<()> {
            use tokio_util::sync::CancellationToken;
            use tracing::warn;
            let vm_monitor_addr = matches.get_one::<String>("vm-monitor-addr");
-            let cgroup = matches.get_one::<String>("filecache-connstr");
-            let file_cache_connstr = matches.get_one::<String>("cgroup");
+            let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
+            let cgroup = matches.get_one::<String>("cgroup");
+            let file_cache_on_disk = matches.get_flag("file-cache-on-disk");

            // Only make a runtime if we need to.
            // Note: it seems like you can make a runtime in an inner scope and
@@ -312,6 +314,7 @@ fn main() -> Result<()> {
                        cgroup: cgroup.cloned(),
                        pgconnstr: file_cache_connstr.cloned(),
                        addr: vm_monitor_addr.cloned().unwrap(),
+                        file_cache_on_disk,
                    })),
                    token.clone(),
                ))
@@ -482,6 +485,11 @@ fn cli() -> clap::Command {
                )
                .value_name("FILECACHE_CONNSTR"),
        )
+        .arg(
+            Arg::new("file-cache-on-disk")
+                .long("file-cache-on-disk")
+                .action(clap::ArgAction::SetTrue),
+        )
 }

 #[test]
--- a/compute_tools/src/compute.rs
+++ b/compute_tools/src/compute.rs
@@ -1,4 +1,5 @@
 use std::collections::HashMap;
+use std::env;
 use std::fs;
 use std::io::BufRead;
 use std::os::unix::fs::PermissionsExt;
@@ -175,6 +176,27 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
    }
 }

+/// If we are a VM, returns a [`Command`] that will run in the `neon-postgres`
+/// cgroup. Otherwise returns the default `Command::new(cmd)`
+///
+/// This function should be used to start postgres, as it will start it in the
+/// neon-postgres cgroup if we are a VM. This allows autoscaling to control
+/// postgres' resource usage. The cgroup will exist in VMs because vm-builder
+/// creates it during the sysinit phase of its inittab.
+fn maybe_cgexec(cmd: &str) -> Command {
+    // The cplane sets this env var for autoscaling computes.
+    // use `var_os` so we don't have to worry about the variable being valid
+    // unicode. Should never be an concern . . . but just in case
+    if env::var_os("AUTOSCALING").is_some() {
+        let mut command = Command::new("cgexec");
+        command.args(["-g", "memory:neon-postgres"]);
+        command.arg(cmd);
+        command
+    } else {
+        Command::new(cmd)
+    }
+}
+
 /// Create special neon_superuser role, that's a slightly nerfed version of a real superuser
 /// that we give to customers
 fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
@@ -451,7 +473,7 @@ impl ComputeNode {
    pub fn sync_safekeepers(&self, storage_auth_token: Option<String>) -> Result<Lsn> {
        let start_time = Utc::now();

-        let sync_handle = Command::new(&self.pgbin)
+        let sync_handle = maybe_cgexec(&self.pgbin)
            .args(["--sync-safekeepers"])
            .env("PGDATA", &self.pgdata) // we cannot use -D in this mode
            .envs(if let Some(storage_auth_token) = &storage_auth_token {
@@ -586,7 +608,7 @@ impl ComputeNode {

        // Start postgres
        info!("starting postgres");
-        let mut pg = Command::new(&self.pgbin)
+        let mut pg = maybe_cgexec(&self.pgbin)
            .args(["-D", pgdata])
            .spawn()
            .expect("cannot start postgres process");
@@ -614,7 +636,7 @@ impl ComputeNode {
        let pgdata_path = Path::new(&self.pgdata);

        // Run postgres as a child process.
-        let mut pg = Command::new(&self.pgbin)
+        let mut pg = maybe_cgexec(&self.pgbin)
            .args(["-D", &self.pgdata])
            .envs(if let Some(storage_auth_token) = &storage_auth_token {
                vec![("NEON_AUTH_TOKEN", storage_auth_token)]
--- a/control_plane/Cargo.toml
+++ b/control_plane/Cargo.toml
@@ -12,6 +12,7 @@ git-version.workspace = true
 nix.workspace = true
 once_cell.workspace = true
 postgres.workspace = true
+hyper.workspace = true
 regex.workspace = true
 reqwest = { workspace = true, features = ["blocking", "json"] }
 serde.workspace = true
@@ -20,6 +21,7 @@ serde_with.workspace = true
 tar.workspace = true
 thiserror.workspace = true
 toml.workspace = true
+tokio.workspace = true
 url.workspace = true
 # Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
 # instead, so that recompile times are better.
--- a/control_plane/src/attachment_service.rs
+++ b/control_plane/src/attachment_service.rs
@@ -0,0 +1,104 @@
+use crate::{background_process, local_env::LocalEnv};
+use anyhow::anyhow;
+use pageserver_api::control_api::HexTenantId;
+use serde::{Deserialize, Serialize};
+use std::{path::PathBuf, process::Child};
+use utils::id::{NodeId, TenantId};
+
+pub struct AttachmentService {
+    env: LocalEnv,
+    listen: String,
+    path: PathBuf,
+}
+
+const COMMAND: &str = "attachment_service";
+
+#[derive(Serialize, Deserialize)]
+pub struct AttachHookRequest {
+    pub tenant_id: HexTenantId,
+    pub pageserver_id: Option<NodeId>,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct AttachHookResponse {
+    pub gen: Option<u32>,
+}
+
+impl AttachmentService {
+    pub fn from_env(env: &LocalEnv) -> Self {
+        let path = env.base_data_dir.join("attachments.json");
+
+        // Makes no sense to construct this if pageservers aren't going to use it: assume
+        // pageservers have control plane API set
+        let listen_url = env.pageserver.control_plane_api.clone().unwrap();
+
+        let listen = format!(
+            "{}:{}",
+            listen_url.host_str().unwrap(),
+            listen_url.port().unwrap()
+        );
+
+        Self {
+            env: env.clone(),
+            path,
+            listen,
+        }
+    }
+
+    fn pid_file(&self) -> PathBuf {
+        self.env.base_data_dir.join("attachment_service.pid")
+    }
+
+    pub fn start(&self) -> anyhow::Result<Child> {
+        let path_str = self.path.to_string_lossy();
+
+        background_process::start_process(
+            COMMAND,
+            &self.env.base_data_dir,
+            &self.env.attachment_service_bin(),
+            ["-l", &self.listen, "-p", &path_str],
+            [],
+            background_process::InitialPidFile::Create(&self.pid_file()),
+            // TODO: a real status check
+            || Ok(true),
+        )
+    }
+
+    pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
+        background_process::stop_process(immediate, COMMAND, &self.pid_file())
+    }
+
+    /// Call into the attach_hook API, for use before handing out attachments to pageservers
+    pub fn attach_hook(
+        &self,
+        tenant_id: TenantId,
+        pageserver_id: NodeId,
+    ) -> anyhow::Result<Option<u32>> {
+        use hyper::StatusCode;
+
+        let url = self
+            .env
+            .pageserver
+            .control_plane_api
+            .clone()
+            .unwrap()
+            .join("attach_hook")
+            .unwrap();
+        let client = reqwest::blocking::ClientBuilder::new()
+            .build()
+            .expect("Failed to construct http client");
+
+        let request = AttachHookRequest {
+            tenant_id: HexTenantId::new(tenant_id),
+            pageserver_id: Some(pageserver_id),
+        };
+
+        let response = client.post(url).json(&request).send()?;
+        if response.status() != StatusCode::OK {
+            return Err(anyhow!("Unexpected status {0}", response.status()));
+        }
+
+        let response = response.json::<AttachHookResponse>()?;
+        Ok(response.gen)
+    }
+}
--- a/control_plane/src/bin/attachment_service.rs
+++ b/control_plane/src/bin/attachment_service.rs
@@ -0,0 +1,264 @@
+/// The attachment service mimics the aspects of the control plane API
+/// that are required for a pageserver to operate.
+///
+/// This enables running & testing pageservers without a full-blown
+/// deployment of the Neon cloud platform.
+///
+use anyhow::anyhow;
+use clap::Parser;
+use hyper::StatusCode;
+use hyper::{Body, Request, Response};
+use pageserver_api::control_api::*;
+use serde::{Deserialize, Serialize};
+use std::path::{Path, PathBuf};
+use std::{collections::HashMap, sync::Arc};
+use utils::logging::{self, LogFormat};
+
+use utils::{
+    http::{
+        endpoint::{self},
+        error::ApiError,
+        json::{json_request, json_response},
+        RequestExt, RouterBuilder,
+    },
+    id::{NodeId, TenantId},
+    tcp_listener,
+};
+
+use control_plane::attachment_service::{AttachHookRequest, AttachHookResponse};
+
+#[derive(Parser)]
+#[command(author, version, about, long_about = None)]
+#[command(arg_required_else_help(true))]
+struct Cli {
+    #[arg(short, long)]
+    listen: String,
+
+    #[arg(short, long)]
+    path: PathBuf,
+}
+
+// The persistent state of each Tenant
+#[derive(Serialize, Deserialize, Clone)]
+struct TenantState {
+    // Currently attached pageserver
+    pageserver: Option<NodeId>,
+
+    // Latest generation number: next time we attach, increment this
+    // and use the incremented number when attaching
+    generation: u32,
+}
+
+fn to_hex_map<S, V>(input: &HashMap<TenantId, V>, serializer: S) -> Result<S::Ok, S::Error>
+where
+    S: serde::Serializer,
+    V: Clone + Serialize,
+{
+    eprintln!("to_hex_map");
+    let transformed = input
+        .iter()
+        .map(|(k, v)| (HexTenantId::new(k.clone()), v.clone()));
+
+    transformed
+        .collect::<HashMap<HexTenantId, V>>()
+        .serialize(serializer)
+}
+
+fn from_hex_map<'de, D, V>(deserializer: D) -> Result<HashMap<TenantId, V>, D::Error>
+where
+    D: serde::de::Deserializer<'de>,
+    V: Deserialize<'de>,
+{
+    eprintln!("from_hex_map");
+    let hex_map = HashMap::<HexTenantId, V>::deserialize(deserializer)?;
+
+    Ok(hex_map.into_iter().map(|(k, v)| (k.take(), v)).collect())
+}
+
+// Top level state available to all HTTP handlers
+#[derive(Serialize, Deserialize)]
+struct PersistentState {
+    #[serde(serialize_with = "to_hex_map", deserialize_with = "from_hex_map")]
+    tenants: HashMap<TenantId, TenantState>,
+
+    #[serde(skip)]
+    path: PathBuf,
+}
+
+impl PersistentState {
+    async fn save(&self) -> anyhow::Result<()> {
+        let bytes = serde_json::to_vec(self)?;
+        tokio::fs::write(&self.path, &bytes).await?;
+
+        Ok(())
+    }
+
+    async fn load(path: &Path) -> anyhow::Result<Self> {
+        let bytes = tokio::fs::read(path).await?;
+        let mut decoded = serde_json::from_slice::<Self>(&bytes)?;
+        decoded.path = path.to_owned();
+        Ok(decoded)
+    }
+
+    async fn load_or_new(path: &Path) -> Self {
+        match Self::load(path).await {
+            Ok(s) => s,
+            Err(e) => {
+                tracing::info!(
+                    "Creating new state file at {0} (load returned {e})",
+                    path.to_string_lossy()
+                );
+                Self {
+                    tenants: HashMap::new(),
+                    path: path.to_owned(),
+                }
+            }
+        }
+    }
+}
+
+/// State available to HTTP request handlers
+#[derive(Clone)]
+struct State {
+    inner: Arc<tokio::sync::RwLock<PersistentState>>,
+}
+
+impl State {
+    fn new(persistent_state: PersistentState) -> State {
+        Self {
+            inner: Arc::new(tokio::sync::RwLock::new(persistent_state)),
+        }
+    }
+}
+
+#[inline(always)]
+fn get_state(request: &Request<Body>) -> &State {
+    request
+        .data::<Arc<State>>()
+        .expect("unknown state type")
+        .as_ref()
+}
+
+/// Pageserver calls into this on startup, to learn which tenants it should attach
+async fn handle_re_attach(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
+    let reattach_req = json_request::<ReAttachRequest>(&mut req).await?;
+
+    let state = get_state(&req).inner.clone();
+    let mut locked = state.write().await;
+
+    let mut response = ReAttachResponse {
+        tenants: Vec::new(),
+    };
+    for (t, state) in &mut locked.tenants {
+        if state.pageserver == Some(reattach_req.node_id) {
+            state.generation += 1;
+            response.tenants.push(ReAttachResponseTenant {
+                id: HexTenantId::new(t.clone()),
+                generation: state.generation,
+            });
+        }
+    }
+
+    locked
+        .save()
+        .await
+        .map_err(|e| ApiError::InternalServerError(e))?;
+
+    json_response(StatusCode::OK, response)
+}
+
+/// Pageserver calls into this before doing deletions, to confirm that it still
+/// holds the latest generation for the tenants with deletions enqueued
+async fn handle_validate(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
+    let validate_req = json_request::<ValidateRequest>(&mut req).await?;
+
+    let state = get_state(&req).inner.clone();
+    let locked = state.read().await;
+
+    let mut response = ValidateResponse {
+        tenants: Vec::new(),
+    };
+
+    for req_tenant in validate_req.tenants {
+        if let Some(tenant_state) = locked.tenants.get(req_tenant.id.as_ref()) {
+            let valid = tenant_state.generation == req_tenant.gen;
+            response.tenants.push(ValidateResponseTenant {
+                id: req_tenant.id,
+                valid,
+            });
+        }
+    }
+
+    json_response(StatusCode::OK, response)
+}
+/// Call into this before attaching a tenant to a pageserver, to acquire a generation number
+/// (in the real control plane this is unnecessary, because the same program is managing
+///  generation numbers and doing attachments).
+async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
+    let attach_req = json_request::<AttachHookRequest>(&mut req).await?;
+
+    let state = get_state(&req).inner.clone();
+    let mut locked = state.write().await;
+
+    let tenant_state = locked
+        .tenants
+        .entry(attach_req.tenant_id.take())
+        .or_insert_with(|| TenantState {
+            pageserver: attach_req.pageserver_id,
+            generation: 0,
+        });
+
+    if attach_req.pageserver_id.is_some() {
+        tenant_state.generation += 1;
+    }
+    let generation = tenant_state.generation;
+
+    locked
+        .save()
+        .await
+        .map_err(|e| ApiError::InternalServerError(e))?;
+
+    json_response(
+        StatusCode::OK,
+        AttachHookResponse {
+            gen: attach_req.pageserver_id.map(|_| generation),
+        },
+    )
+}
+
+fn make_router(persistent_state: PersistentState) -> RouterBuilder<hyper::Body, ApiError> {
+    endpoint::make_router()
+        .data(Arc::new(State::new(persistent_state)))
+        .post("/re-attach", |r| handle_re_attach(r))
+        .post("/validate", |r| handle_validate(r))
+        .post("/attach_hook", |r| handle_attach_hook(r))
+}
+
+#[tokio::main]
+async fn main() -> anyhow::Result<()> {
+    logging::init(
+        LogFormat::Plain,
+        logging::TracingErrorLayerEnablement::Disabled,
+    )?;
+
+    let args = Cli::parse();
+    tracing::info!(
+        "Starting, state at {}, listening on {}",
+        args.path.to_string_lossy(),
+        args.listen
+    );
+
+    let persistent_state = PersistentState::load_or_new(&args.path).await;
+
+    let http_listener = tcp_listener::bind(&args.listen)?;
+    let router = make_router(persistent_state)
+        .build()
+        .map_err(|err| anyhow!(err))?;
+    let service = utils::http::RouterService::new(router).unwrap();
+    let server = hyper::Server::from_tcp(http_listener)?.serve(service);
+
+    tracing::info!("Serving on {0}", args.listen.as_str());
+    server.await?;
+
+    Ok(())
+}
--- a/control_plane/src/bin/neon_local.rs
+++ b/control_plane/src/bin/neon_local.rs
@@ -8,6 +8,7 @@
 use anyhow::{anyhow, bail, Context, Result};
 use clap::{value_parser, Arg, ArgAction, ArgMatches, Command};
 use compute_api::spec::ComputeMode;
+use control_plane::attachment_service::AttachmentService;
 use control_plane::endpoint::ComputeControlPlane;
 use control_plane::local_env::LocalEnv;
 use control_plane::pageserver::PageServerNode;
@@ -43,6 +44,8 @@ project_git_version!(GIT_VERSION);

 const DEFAULT_PG_VERSION: &str = "15";

+const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/";
+
 fn default_conf() -> String {
    format!(
        r#"
@@ -56,11 +59,13 @@ listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}'
 listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}'
 pg_auth_type = '{trust_auth}'
 http_auth_type = '{trust_auth}'
+control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}'

 [[safekeepers]]
 id = {DEFAULT_SAFEKEEPER_ID}
 pg_port = {DEFAULT_SAFEKEEPER_PG_PORT}
 http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT}
+
 "#,
        trust_auth = AuthType::Trust,
    )
@@ -107,6 +112,7 @@ fn main() -> Result<()> {
            "start" => handle_start_all(sub_args, &env),
            "stop" => handle_stop_all(sub_args, &env),
            "pageserver" => handle_pageserver(sub_args, &env),
+            "attachment_service" => handle_attachment_service(sub_args, &env),
            "safekeeper" => handle_safekeeper(sub_args, &env),
            "endpoint" => handle_endpoint(sub_args, &env),
            "pg" => bail!("'pg' subcommand has been renamed to 'endpoint'"),
@@ -342,13 +348,25 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
            }
        }
        Some(("create", create_match)) => {
-            let initial_tenant_id = parse_tenant_id(create_match)?;
            let tenant_conf: HashMap<_, _> = create_match
                .get_many::<String>("config")
                .map(|vals| vals.flat_map(|c| c.split_once(':')).collect())
                .unwrap_or_default();
-            let new_tenant_id = pageserver.tenant_create(initial_tenant_id, tenant_conf)?;
-            println!("tenant {new_tenant_id} successfully created on the pageserver");
+
+            // If tenant ID was not specified, generate one
+            let tenant_id = parse_tenant_id(create_match)?.unwrap_or(TenantId::generate());
+
+            let generation = if env.pageserver.control_plane_api.is_some() {
+                // We must register the tenant with the attachment service, so
+                // that when the pageserver restarts, it will be re-attached.
+                let attachment_service = AttachmentService::from_env(env);
+                attachment_service.attach_hook(tenant_id, env.pageserver.id)?
+            } else {
+                None
+            };
+
+            pageserver.tenant_create(tenant_id, generation, tenant_conf)?;
+            println!("tenant {tenant_id} successfully created on the pageserver");

            // Create an initial timeline for the new tenant
            let new_timeline_id = parse_timeline_id(create_match)?;
@@ -358,7 +376,7 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
                .context("Failed to parse postgres version from the argument string")?;

            let timeline_info = pageserver.timeline_create(
-                new_tenant_id,
+                tenant_id,
                new_timeline_id,
                None,
                None,
@@ -369,17 +387,17 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an

            env.register_branch_mapping(
                DEFAULT_BRANCH_NAME.to_string(),
-                new_tenant_id,
+                tenant_id,
                new_timeline_id,
            )?;

            println!(
-                "Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {new_tenant_id}",
+                "Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {tenant_id}",
            );

            if create_match.get_flag("set-default") {
-                println!("Setting tenant {new_tenant_id} as a default one");
-                env.default_tenant_id = Some(new_tenant_id);
+                println!("Setting tenant {tenant_id} as a default one");
+                env.default_tenant_id = Some(tenant_id);
            }
        }
        Some(("set-default", set_default_match)) => {
@@ -817,6 +835,33 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
    Ok(())
 }

+fn handle_attachment_service(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
+    let svc = AttachmentService::from_env(env);
+    match sub_match.subcommand() {
+        Some(("start", _start_match)) => {
+            if let Err(e) = svc.start() {
+                eprintln!("start failed: {e}");
+                exit(1);
+            }
+        }
+
+        Some(("stop", stop_match)) => {
+            let immediate = stop_match
+                .get_one::<String>("stop-mode")
+                .map(|s| s.as_str())
+                == Some("immediate");
+
+            if let Err(e) = svc.stop(immediate) {
+                eprintln!("stop failed: {}", e);
+                exit(1);
+            }
+        }
+        Some((sub_name, _)) => bail!("Unexpected attachment_service subcommand '{}'", sub_name),
+        None => bail!("no attachment_service subcommand provided"),
+    }
+    Ok(())
+}
+
 fn get_safekeeper(env: &local_env::LocalEnv, id: NodeId) -> Result<SafekeeperNode> {
    if let Some(node) = env.safekeepers.iter().find(|node| node.id == id) {
        Ok(SafekeeperNode::from_env(env, node))
@@ -897,6 +942,16 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow

    broker::start_broker_process(env)?;

+    // Only start the attachment service if the pageserver is configured to need it
+    if env.pageserver.control_plane_api.is_some() {
+        let attachment_service = AttachmentService::from_env(env);
+        if let Err(e) = attachment_service.start() {
+            eprintln!("attachment_service start failed: {:#}", e);
+            try_stop_all(env, true);
+            exit(1);
+        }
+    }
+
    let pageserver = PageServerNode::from_env(env);
    if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) {
        eprintln!("pageserver {} start failed: {:#}", env.pageserver.id, e);
@@ -955,6 +1010,13 @@ fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
    if let Err(e) = broker::stop_broker_process(env) {
        eprintln!("neon broker stop failed: {e:#}");
    }
+
+    if env.pageserver.control_plane_api.is_some() {
+        let attachment_service = AttachmentService::from_env(env);
+        if let Err(e) = attachment_service.stop(immediate) {
+            eprintln!("attachment service stop failed: {e:#}");
+        }
+    }
 }

 fn cli() -> Command {
@@ -1138,6 +1200,14 @@ fn cli() -> Command {
                            .arg(stop_mode_arg.clone()))
                .subcommand(Command::new("restart").about("Restart local pageserver").arg(pageserver_config_args.clone()))
        )
+        .subcommand(
+            Command::new("attachment_service")
+                .arg_required_else_help(true)
+                .about("Manage attachment_service")
+                .subcommand(Command::new("start").about("Start local pageserver").arg(pageserver_config_args.clone()))
+                .subcommand(Command::new("stop").about("Stop local pageserver")
+                            .arg(stop_mode_arg.clone()))
+        )
        .subcommand(
            Command::new("safekeeper")
                .arg_required_else_help(true)
--- a/control_plane/src/lib.rs
+++ b/control_plane/src/lib.rs
@@ -7,6 +7,7 @@
 // local installations.
 //

+pub mod attachment_service;
 mod background_process;
 pub mod broker;
 pub mod endpoint;
--- a/control_plane/src/local_env.rs
+++ b/control_plane/src/local_env.rs
@@ -118,6 +118,9 @@ pub struct PageServerConf {
    // auth type used for the PG and HTTP ports
    pub pg_auth_type: AuthType,
    pub http_auth_type: AuthType,
+
+    // Control plane location
+    pub control_plane_api: Option<Url>,
 }

 impl Default for PageServerConf {
@@ -128,6 +131,7 @@ impl Default for PageServerConf {
            listen_http_addr: String::new(),
            pg_auth_type: AuthType::Trust,
            http_auth_type: AuthType::Trust,
+            control_plane_api: None,
        }
    }
 }
@@ -202,6 +206,10 @@ impl LocalEnv {
        self.neon_distrib_dir.join("pageserver")
    }

+    pub fn attachment_service_bin(&self) -> PathBuf {
+        self.neon_distrib_dir.join("attachment_service")
+    }
+
    pub fn safekeeper_bin(&self) -> PathBuf {
        self.neon_distrib_dir.join("safekeeper")
    }
--- a/control_plane/src/pageserver.rs
+++ b/control_plane/src/pageserver.rs
@@ -126,6 +126,13 @@ impl PageServerNode {
            broker_endpoint_param,
        ];

+        if let Some(control_plane_api) = &self.env.pageserver.control_plane_api {
+            overrides.push(format!(
+                "control_plane_api='{}'",
+                control_plane_api.as_str()
+            ));
+        }
+
        if self.env.pageserver.http_auth_type != AuthType::Trust
            || self.env.pageserver.pg_auth_type != AuthType::Trust
        {
@@ -316,7 +323,8 @@ impl PageServerNode {

    pub fn tenant_create(
        &self,
-        new_tenant_id: Option<TenantId>,
+        new_tenant_id: TenantId,
+        generation: Option<u32>,
        settings: HashMap<&str, &str>,
    ) -> anyhow::Result<TenantId> {
        let mut settings = settings.clone();
@@ -382,11 +390,9 @@ impl PageServerNode {
                .context("Failed to parse 'gc_feedback' as bool")?,
        };

-        // If tenant ID was not specified, generate one
-        let new_tenant_id = new_tenant_id.unwrap_or(TenantId::generate());
-
        let request = models::TenantCreateRequest {
            new_tenant_id,
+            generation,
            config,
        };
        if !settings.is_empty() {
--- a/docs/rfcs/025-generation-numbers.md
+++ b/docs/rfcs/025-generation-numbers.md
@@ -0,0 +1,957 @@
+# Pageserver: split-brain safety for remote storage through generation numbers
+
+## Summary
+
+A scheme of logical "generation numbers" for tenant attachment to pageservers is proposed, along with
+changes to the remote storage format to include these generation numbers in S3 keys.
+
+Using the control plane as the issuer of these generation numbers enables strong anti-split-brain
+properties in the pageserver cluster without implementing a consensus mechanism directly
+in the pageservers.
+
+## Motivation
+
+Currently, the pageserver's remote storage format does not provide a mechanism for addressing
+split brain conditions that may happen when replacing a node or when migrating
+a tenant from one pageserver to another.
+
+From a remote storage perspective, a split brain condition occurs whenever two nodes both think
+they have the same tenant attached, and both can write to S3. This can happen in the case of a
+network partition, pathologically long delays (e.g. suspended VM), or software bugs.
+
+In the current deployment model, control plane guarantees that a tenant is attached to one
+pageserver at a time, thereby ruling out split-brain conditions resulting from dual
+attachment (however, there is always the risk of a control plane bug). This control
+plane guarantee prevents robust response to failures, as if a pageserver is unresponsive
+we may not detach from it. The mechanism in this RFC fixes this, by making it safe to
+attach to a new, different pageserver even if an unresponsive pageserver may be running.
+
+Futher, lack of safety during split-brain conditions blocks two important features where occasional
+split-brain conditions are part of the design assumptions:
+
+- seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029))
+- automatic pageserver instance failure handling (aka "failover") (RFC TBD)
+
+### Prior art
+
+- 020-pageserver-s3-coordination.md
+- 023-the-state-of-pageserver-tenant-relocation.md
+- 026-pageserver-s3-mvcc.md
+
+This RFC has broad similarities to the proposal to implement a MVCC scheme in
+S3 object names, but this RFC avoids a general purpose transaction scheme in
+favour of more specialized "generations" that work like a transaction ID that
+always has the same lifetime as a pageserver process or tenant attachment, whichever
+is shorter.
+
+## Requirements
+
+- Accommodate storage backends with no atomic or fencing capability (i.e. work within
+  S3's limitation that there are no atomics and clients can't be fenced)
+- Don't depend on any STONITH or node fencing in the compute layer (i.e. we will not
+  assume that we can reliably kill and EC2 instance and have it die)
+- Scoped per-tenant, not per-pageserver; for _seamless tenant migration_, we need
+  per-tenant granularity, and for _failover_, we likely want to spread the workload
+  of the failed pageserver instance to a number of peers, rather than monolithically
+  moving the entire workload to another machine.
+  We do not rule out the latter case, but should not constrain ourselves to it.
+
+## Design Tenets
+
+These are not requirements, but are ideas that guide the following design:
+
+- Avoid implementing another consensus system: we already have a strongly consistent
+  database in the control plane that can do atomic operations where needed, and we also
+  have a Paxos implementation in the safekeeper.
+- Avoiding locking in to specific models of how failover will work (e.g. do not assume that
+  all the tenants on a pageserver will fail over as a unit).
+- Be strictly correct when it comes to data integrity. Occasional failures of availability
+  are tolerable, occasional data loss is not.
+
+## Non Goals
+
+The changes in this RFC intentionally isolate the design decision of how to define
+logical generations numbers and object storage format in a way that is somewhat flexible with
+respect to how actual orchestration of failover works.
+
+This RFC intentionally does not cover:
+
+- Failure detection
+- Orchestration of failover
+- Standby modes to keep data ready for fast migration
+- Intentional multi-writer operation on tenants (multi-writer scenarios are assumed to be transient split-brain situations).
+- Sharding.
+
+The interaction between this RFC and those features is discussed in [Appendix B](#appendix-b-interoperability-with-other-features)
+
+## Impacted Components
+
+pageserver, control plane, safekeeper (optional)
+
+## Implementation Part 1: Correctness
+
+### Summary
+
+- A per-tenant **generation number** is introduced to uniquely identifying tenant attachments to pageserver processes.
+
+  - This generation number increments each time the control plane modifies a tenant (`Project`)'s assigned pageserver, or when the assigned pageserver restarts.
+  - the control plane is the authority for generation numbers: only it may
+    increment a generation number.
+
+- **Object keys are suffixed** with the generation number
+- **Safety for multiply-attached tenants** is provided by the
+  generation number in the object key: the competing pageservers will not
+  try to write to the same keys.
+- **Safety in split brain for multiple nodes running with
+  the same node ID** is provided by the pageserver calling out to the control plane
+  on startup, to re-attach and thereby increment the generations of any attached tenants
+- **Safety for deletions** is achieved by deferring the DELETE from S3 to a point in time where the deleting node has validated with control plane that no attachment with a higher generation has a reference to the to-be-DELETEd key.
+- **The control plane is used to issue generation numbers** to avoid the need for
+  a built-in consensus system in the pageserver, although this could in principle
+  be changed without changing the storage format.
+
+### Generation numbers
+
+A generation number is associated with each tenant in the control plane,
+and each time the attachment status of the tenant changes, this is incremented.
+Changes in attachment status include:
+
+- Attaching the tenant to a different pageserver
+- A pageserver restarting, and "re-attaching" its tenants on startup
+
+These increments of attachment generation provide invariants we need to avoid
+split-brain issues in storage:
+
+- If two pageservers have the same tenant attached, the attachments are guaranteed to have different generation numbers, because the generation would increment
+  while attaching the second one.
+- If there are multiple pageservers running with the same node ID, all the attachments on all pageservers are guaranteed to have different generation numbers, because the generation would increment
+  when the second node started and re-attached its tenants.
+
+As long as the infrastructure does not transparently replace an underlying
+physical machine, we are totally safe. See the later [unsafe case](#unsafe-case-on-badly-behaved-infrastructure) section for details.
+
+### Object Key Changes
+
+#### Generation suffix
+
+All object keys (layer objects and index objects) will contain the attachment
+generation as a [suffix](#why-a-generation-suffix-rather-than-prefix).
+This suffix is the primary mechanism for protecting against split-brain situations, and
+enabling safe multi-attachment of tenants:
+
+- Two pageservers running with the same node ID (e.g. after a failure, where there is
+  some rogue pageserver still running) will not try to write to the same objects, because at startup they will have re-attached tenants and thereby incremented
+  generation numbers.
+- Multiple attachments (to different pageservers) of the same tenant will not try to write to the same objects, as each attachment would have a distinct generation.
+
+The generation is appended in hex format (8 byte string representing
+u32), to all our existing key names. A u32's range limit would permit
+27 restarts _per second_ over a 5 year system lifetime: orders of magnitude more than
+is realistic.
+
+The exact meaning of the generation suffix can evolve over time if necessary, for
+example if we chose to implement a failover mechanism internally to the pageservers
+rather than going via the control plane. The storage format just sees it as a number,
+with the only semantic property being that the highest numbered index is the latest.
+
+#### Index changes
+
+Since object keys now include a generation suffix, the index of these keys must also be updated. IndexPart currently stores keys and LSNs sufficient to reconstruct key names: this would be extended to store the generation as well.
+
+This will increase the size of the file, but only modestly: layers are already encoded as
+their string-ized form, so the overhead is about 10 bytes per layer. This will be less if/when
+the index storage format is migrated to a binary format from JSON.
+
+#### Visibility
+
+_This section doesn't describe code changes, but extends on the consequences of the
+object key changes given above_
+
+##### Visibility of objects to pageservers
+
+Pageservers can of course list objects in S3 at any time, but in practice their
+visible set is based on the contents of their LayerMap, which is initialized
+from the `index_part.json.???` that they load.
+
+Starting with the `index_part` from the most recent previous generation
+(see [loading index_part](#finding-the-remote-indices-for-timelines)), a pageserver
+initially has visibility of all the objects that were referenced in the loaded index.
+These objects are guaranteed to remain visible until the current generation is
+superseded, via pageservers in older generations avoiding deletions (see [deletion](#deletion)).
+
+The "most recent previous generation" is _not_ necessarily the most recent
+in terms of walltime, it is the one that is readable at the time a new generation
+starts. Consider the following sequence of a tenant being re-attached to different
+pageserver nodes:
+
+- Create + attach on PS1 in generation 1
+- PS1 Do some work, write out index_part.json-0001
+- Attach to PS2 in generation 2
+- Read index_part.json-0001
+- PS2 starts doing some work...
+- Attach to PS3 in generation 3
+- Read index_part.json-0001
+- **...PS2 finishes its work: now it writes index_part.json-0002**
+- PS3 writes out index_part.json-0003
+
+In the above sequence, the ancestry of indices is:
+
+```
+0001 -> 0002
+     |
+     -> 0003
+```
+
+This is not an issue for safety: if the 0002 references some object that is
+not in 0001, then 0003 simply does not see it, and will re-do whatever
+work was required (e.g. ingesting WAL or doing compaction). Objects referenced
+by only the 0002 index will never be read by future attachment generations, and
+will eventually be cleaned up by a scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)).
+
+##### Visibility of LSNs to clients
+
+Because index_part.json is now written with a generation suffix, which data
+is visible depends on which generation the reader is operating in:
+
+- If one was passively reading from S3 from outside of a pageserver, the
+  visibility of data would depend on which index_part.json-<generation> file
+  one had chosen to read from.
+- If two pageservers have the same tenant attached, they may have different
+  data visible as they're independently replaying the WAL, and maintaining
+  independent LayerMaps that are written to independent index_part.json files.
+  Data does not have to be remotely committed to be visible.
+- For a pageserver writing with a stale generation, historic LSNs
+  remain readable until another pageserver (with a higher generation suffix)
+  decides to execute GC deletions. At this point, we may think of the stale
+  attachment's generation as having logically ended: during its existence
+  the generation had a consistent view of the world.
+- For a newly attached pageserver, its highest visible LSN may appears to
+  go backwards with respect to an earlier attachment, if that earlier
+  attachment had not uploaded all data to S3 before the new attachment.
+
+### Deletion
+
+#### Generation number validation
+
+While writes are de-conflicted by writers always using their own generation number in the key,
+deletions are slightly more challenging: if a pageserver A is isolated, and the true active node is
+pageserver B, then it is dangerous for A to do any object deletions, even of objects that it wrote
+itself, because pageserver's B metadata might reference those objects.
+
+We solve this by inserting a "generation validation" step between the write of a remote index
+that un-links a particular object from the index, and the actual deletion of the object, such
+that deletions strictly obey the following ordering:
+
+1. Write out index_part.json: this guarantees that any subsequent reader of the metadata will
+   not try and read the object we unlinked.
+2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
+3. If step 2 passes, it is safe to delete the object. Why? The check-in with control plane
+   together with our visibility rules guarantees that any later generation
+   will use either the exact `index_part.json` that we uploaded in step 1, or a successor
+   of it; not an earlier one. In both cases, the `index_part.json` doesn't reference the
+   key we are deleting anymore, so, the key is invisible to any later attachment generation.
+   Hence it's safe to delete it.
+
+Note that at step 2 we are only confirming that deletions of objects _no longer referenced
+by the specific `index_part.json` written in step 1_ are safe. If we were attempting other deletions concurrently,
+these would need their own generation validation step.
+
+If step 2 fails, we may leak the object. This is safe, but has a cost: see [scrubbing](#cleaning-up-orphan-objects-scrubbing). We may avoid this entirely outside of node
+failures, if we do proper flushing of deletions on clean shutdown and clean migration.
+
+To avoid doing a huge number of control plane requests to perform generation validation,
+validation of many tenants will be done in a single request, and deletions will be queued up
+prior to validation: see [Persistent deletion queue](#persistent-deletion-queue) for more.
+
+#### `remote_consistent_lsn` updates
+
+Remote objects are not the only kind of deletion the pageserver does: it also indirectly deletes
+WAL data, by feeding back remote_consistent_lsn to safekeepers, as a signal to the safekeepers that
+they may drop data below this LSN.
+
+For the same reasons that deletion of objects must be guarded by an attachment generation number
+validation step, updates to `remote_consistent_lsn` are subject to the same rules, using
+an ordering as follows:
+
+1. upload the index_part that covers data up to LSN `L0` to S3
+2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
+3. advance the `remote_consistent_lsn` that we advertise to the safekeepers to `L0`
+
+If step 2 fails, then the `remote_consistent_lsn` advertised
+to safekeepers will not advance again until a pageserver
+with the latest generation is ready to do so.
+
+**Note:** at step 3 we are not advertising the _latest_ remote_consistent_lsn, we are
+advertising the value in the index_part that we uploaded in step 1. This provides
+a strong ordering guarantee.
+
+Internally to the pageserver, each timeline will have two remote_consistent_lsn values: the one that
+reflects its latest write to remote storage, and the one that reflects the most
+recent validation of generation number. It is only the latter value that may
+be advertised to the outside world (i.e. to the safekeeper).
+
+The control plane remains unaware of `remote_consistent_lsn`: it only has to validate
+the freshness of generation numbers, thereby granting the pageserver permission to
+share the information with the safekeeper.
+
+For convenience, in subsequent sections and RFCs we will use "deletion" to mean both deletion
+of objects in S3, and updates to the `remote_consistent_lsn`, as updates to the remote consistent
+LSN are de-facto deletions done via the safekeeper, and both kinds of deletion are subject to
+the same generation validation requirement.
+
+### Pageserver attach/startup changes
+
+#### Attachment
+
+Calls to `/v1/tenant/{tenant_id}/attach` are augmented with an additional
+`generation` field in the body.
+
+The pageserver does not persist this: a generation is only good for the lifetime
+of a process.
+
+#### Finding the remote indices for timelines
+
+Because index files are now suffixed with generation numbers, the pageserver
+cannot always GET the remote index in one request, because it can't always
+know a-priori what the latest remote index is.
+
+Typically, the most recent generation to write an index would be our own
+generation minus 1. However, this might not be the case: the previous
+node might have started and acquired a generation number, and then crashed
+before writing out a remote index.
+
+In the general case and as a fallback, the pageserver may list all the `index_part.json`
+files for a timeline, sort them by generation, and pick the highest that is `<=`
+its current generation for this attachment. The tenant should never load an index
+with an attachment generation _newer_ than its own.
+These two rules combined ensure that objects written by later generations are never visible to earlier generations.
+
+Note that if a given attachment picks an index part from an earlier generation (say n-2), but crashes & restarts before it writes its own generation's index part, next time it tries to pick an index part there may be an index part from generation n-1.
+It would pick the n-1 index part in that case, because it's sorted higher than the previous one from generation n-2.
+So, above rules guarantee no determinism in selecting the index part.
+are allowed to be attached with stale attachment generations during a multiply-attached
+phase in a migration, and in this instance if the old location's pageserver restarts,
+it should not try and load the newer generation's index.
+
+To summarize, on starting a timeline, the pageserver will:
+
+1. Issue a GET for index_part.json-<my generation - 1>
+2. If 1 failed, issue a ListObjectsv2 request for index_part.json\* and
+   pick the newest.
+
+One could optimize this further by using the control plane to record specifically
+which generation most recently wrote an index_part.json, if necessary, to increase
+the probability of finding the index_part.json in one GET. One could also improve
+the chances by having pageservers proactively write out index_part.json after they
+get a new generation ID.
+
+#### Re-attachment on startup
+
+On startup, the pageserver will call out to an new control plane `/re-attach`
+API (see [Generation API](#generation-api)). This returns a list of
+tenants that should be attached to the pageserver, and their generation numbers, which
+the control plane will increment before returning.
+
+The pageserver should still scan its local disk on startup, but should _delete_
+any local content for tenants not indicated in the `/re-attach` response: their
+absence is an implicit detach operation.
+
+**Note** if a tenant is omitted from the re-attach response, its local disk content
+will be deleted. This will change in subsequent work, when the control plane gains
+the concept of a secondary/standby location: a node with local content may revert
+to this status and retain some local content.
+
+#### Cleaning up previous generations' remote indices
+
+Deletion of old indices is not necessary for correctness, although it is necessary
+to avoid the ListObjects fallback in the previous section becoming ever more expensive.
+
+Once the new attachment has written out its index_part.json, it may asynchronously clean up historic index_part.json
+objects that were found.
+
+We may choose to implement this deletion either as an explicit step after we
+write out index_part for the first time in a pageserver's lifetime, or for
+simplicity just do it periodically as part of the background scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing));
+
+### Control Plane Changes
+
+#### Store generations for attaching tenants
+
+- The `Project` table must store the generation number for use when
+  attaching the tenant to a new pageserver.
+- The `/v1/tenant/:tenant_id/attach` pageserver API will require the generation number,
+  which the control plane can supply by simply incrementing the `Project`'s
+  generation number each time the tenant is attached to a different server: the same database
+  transaction that changes the assigned pageserver should also change the generation number.
+
+#### Generation API
+
+This section describes an API that could be provided directly by the control plane,
+or built as a separate microservice. In earlier parts of the RFC, when we
+discuss the control plane providing generation numbers, we are referring to this API.
+
+The API endpoints used by the pageserver to acquire and validate generation
+numbers are quite simple, and only require access to some persistent and
+linerizable storage (such as a database).
+
+Building this into the control plane is proposed as a least-effort option to exploit existing infrastructure and implement generation number issuance in the same transaction that mandates it (i.e., the transaction that updates the `Project` assignment to another pageserver).
+However, this is not mandatory: this "Generation Number Issuer" could
+be built as a microservice. In practice, we will write such a miniature service
+anyway, to enable E2E pageserver/compute testing without control plane.
+
+The endpoints required by pageservers are:
+
+##### `/re-attach`
+
+- Request: `{node_id: <u32>}`
+- Response:
+  - 200 `{tenants: [{id: <TenantId>, gen: <u32>}]}`
+  - 404: unknown node_id
+  - (Future: 429: flapping detected, perhaps nodes are fighting for the same node ID,
+    or perhaps this node was in a retry loop)
+  - (On unknown tenants, omit tenant from `tenants` array)
+- Server behavior: query database for which tenants should be attached to this pageserver.
+  - for each tenant that should be attached, increment the attachment generation and
+    include the new generation in the response
+- Client behavior:
+  - for all tenants in the response, activate with the new generation number
+  - for any local disk content _not_ referenced in the response, act as if we
+    had been asked to detach it (i.e. delete local files)
+
+**Note** the `node_id` in this request will change in future if we move to ephemeral
+node IDs, to be replaced with some correlation ID that helps the control plane realize
+if a process is running with the same storage as a previous pageserver process (e.g.
+we might use EC instance ID, or we might just write some UUID to the disk the first
+time we use it)
+
+##### `/validate`
+
+- Request: `{'tenants': [{tenant: <tenant id>, attach_gen: <gen>}, ...]}'`
+- Response:
+  - 200 `{'tenants': [{tenant: <tenant id>, status: <bool>}...]}`
+  - (On unknown tenants, omit tenant from `tenants` array)
+- Purpose: enable the pageserver to discover for the given attachments whether they are still the latest.
+- Server behavior: this is a read-only operation: simply compare the generations in the request with
+  the generations known to the server, and set status to `true` if they match.
+- Client behavior: clients must not do deletions within a tenant's remote data until they have
+  received a response indicating the generation they hold for the attachment is current.
+
+#### Use of `/load` and `/ignore` APIs
+
+Because the pageserver will be changed to only attach tenants on startup
+based on the control plane's response to a `/re-attach` request, the load/ignore
+APIs no longer make sense in their current form.
+
+The `/load` API becomes functionally equivalent to attach, and will be removed:
+any location that used `/load` before should just attach instead.
+
+The `/ignore` API is equivalent to detaching, but without deleting local files.
+
+### Timeline/Branch creation & deletion
+
+All of the previous arguments for safety have described operations within
+a timeline, where we may describe a sequence that includes updates to
+index_part.json, and where reads and writes are coming from a postgres
+endpoint (writes via the safekeeper).
+
+Creating or destroying timeline is a bit different, because writes
+are coming from the control plane.
+
+We must be safe against scenarios such as:
+
+- A tenant is attached to pageserver B while pageserver A is
+  in the middle of servicing an RPC from the control plane to
+  create or delete a tenant.
+- A pageserver A has been sent a timeline creation request
+  but becomes unresponsive. The tenant is attached to a
+  different pageserver B, and the timeline creation request
+  is sent there too.
+
+#### Timeline Creation
+
+If some very slow node tries to do a timeline creation _after_
+a more recent generation node has already created the timeline
+and written some data into it, that must not cause harm. This
+is provided in timeline creations by the way all the objects
+within the timeline's remote path include a generation suffix:
+a slow node in an old generation that attempts to "create" a timeline
+that already exists will just emit an index_part.json with
+an old generation suffix.
+
+Timeline IDs are never reused, so we don't have
+to worry about the case of create/delete/create cycles. If they
+were re-used during a disaster recovery "un-delete" of a timeline,
+that special case can be handled by calling out to all available pageservers
+to check that they return 404 for the timeline, and to flush their
+deletion queues in case they had any deletions pending from the
+timeline.
+
+The above makes it safe for control plane to change the assignment of
+tenant to pageserver in control plane while a timeline creation is ongoing.
+The reason is that the creation request against the new assigned pageserver
+uses a new generation number. However, care must be taken by control plane
+to ensure that a "timeline creation successul" response from some pageserver
+is checked for the pageserver's generation for that timeline's tenant still being the latest.
+If it is not the latest, the response does not constitute a successful timeline creation.
+It is acceptable to discard such responses, the scrubber will clean up the S3 state.
+It is better to issue a timelien deletion request to the stale attachment.
+
+#### Timeline Deletion
+
+Tenant/timeline deletion operations are exempt from generation validation
+on deletes, and therefore don't have to go through the same deletion
+queue as GC/compaction layer deletions. This is because once a
+delete is issued by the control plane, it is a promise that the
+control plane will keep trying until the deletion is done, so even stale
+pageservers are permitted to go ahead and delete the objects.
+
+The implications of this for control plane are:
+
+- During timeline/tenant deletion, the control plane must wait for the deletion to
+  be truly complete (status 404) and also handle the case where the pageserver
+  becomes unavailable, either by waiting for a replacement with the same node_id,
+  or by *re-attaching the tenant elsewhere.
+
+- The control plane must persist its intent to delete
+  a timeline/tenant before issuing any RPCs, and then once it starts, it must
+  keep retrying until the tenant/timeline is gone. This is already handled
+  by using a persistent `Operation` record that is retried indefinitely.
+
+Timeline deletion may result in a special kind of object leak, where
+the latest generation attachment completes a deletion (including erasing
+all objects in the timeline path), but some slow/partitioned node is
+writing into the timeline path with a stale generation number. This would
+not be caught by any per-timeline scrubbing (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)), since scrubbing happens on the
+attached pageserver, and once the timeline is deleted it isn't attached anywhere.
+This scenario should be pretty rare, and the control plane can make it even
+rarer by ensuring that if a tenant is in a multi-attached state (e.g. during
+migration), we wait for that to complete before processing the deletion. Beyond
+that, we may implement some other top-level scrub of timelines in
+an external tool, to identify any tenant/timeline paths that are not found
+in the control plane database.
+
+#### Examples
+
+- Deletion, node restarts partway through:
+  - By the time we returned 202, we have written a remote delete marker
+  - Any subsequent incarnation of the same node_id will see the remote
+    delete marker and continue to process the deletion
+  - If the original pageserver is lost permanently and no replacement
+    with the same node_id is available, then the control plane must recover
+    by re-attaching the tenant to a different node.
+- Creation, node becomes unresponsive partway through.
+  - Control plane will see HTTP request timeout, keep re-issuing
+    request to whoever is the latest attachment point for the tenant
+    until it succeeds.
+  - Stale nodes may be trying to execute timeline creation: they will
+    write out index_part.json files with
+    stale attachment generation: these will be eventually cleaned up
+    by the same mechanism as other old indices.
+
+### Unsafe case on badly behaved infrastructure
+
+This section is only relevant if running on a different environment
+than EC2 machines with ephemeral disks.
+
+If we ever run pageservers on infrastructure that might transparently restart
+a pageserver while leaving an old process running (e.g. a VM gets rescheduled
+without the old one being fenced), then there is a risk of corruption, when
+the control plane attaches the tenant, as follows:
+
+- If the control plane sends an `/attach` request to node A, then node A dies
+  and is replaced, and the control plane's retries the request without
+  incrementing that attachment ID, then it could end up with two physical nodes
+  both using the same generation number.
+- This is not an issue when using EC2 instances with ephemeral storage, as long
+  as the control plane never re-uses a node ID, but it would need re-examining
+  if running on different infrastructure.
+- To robustly protect against this class of issue, we would either:
+  - add a "node generation" to distinguish between different processes holding the
+    same node_id.
+  - or, dispense with static node_id entirely and issue an ephemeral ID to each
+    pageserver process when it starts.
+
+## Implementation Part 2: Optimizations
+
+### Persistent deletion queue
+
+Between writing our a new index_part.json that doesn't reference an object,
+and executing the deletion, an object passes through a window where it is
+only referenced in memory, and could be leaked if the pageserver is stopped
+uncleanly. That introduces conflicting incentives: on the one hand, we would
+like to delay and batch deletions to
+1. minimize the cost of the mandatory validations calls to control plane, and
+2. minimize cost for DeleteObjects requests.
+On the other hand we would also like to minimize leakage by executing
+deletions promptly.
+
+To resolve this, we may make the deletion queue persistent
+and then executing these in the background at a later time.
+
+_Note: The deletion queue's reason for existence is optimization rather than correctness,
+so there is a lot of flexibility in exactly how the it should work,
+as long as it obeys the rule to validate generations before executing deletions,
+so the following details are not essential to the overall RFC._
+
+#### Scope
+
+The deletion queue will be global per pageserver, not per-tenant. There
+are several reasons for this choice:
+
+- Use the queue as a central point to coalesce validation requests to the
+  control plane: this avoids individual `Timeline` objects ever touching
+  the control plane API, and avoids them having to know the rules about
+  validating deletions. This separation of concerns will avoid burdening
+  the already many-LoC `Timeline` type with even more responsibility.
+- Decouple the deletion queue from Tenant attachment lifetime: we may
+  "hibernate" an inactive tenant by tearing down its `Tenant`/`Timeline`
+  objects in the pageserver, without having to wait for deletions to be done.
+- Amortize the cost of I/O for the persistent queue, instead of having many
+  tiny queues.
+- Coalesce deletions into a smaller number of larger DeleteObjects calls
+
+Because of the cost of doing I/O for persistence, and the desire to coalesce
+generation validation requests across tenants, and coalesce deletions into
+larger DeleteObjects requests, there will be one deletion queue per pageserver
+rather than one per tenant. This has the added benefit that when deactivating
+a tenant, we do not have to drain their deletion queue: deletions can proceed
+for a tenant whose main `Tenant` object has been torn down.
+
+#### Flow of deletion
+
+The flow of a deletion is becomes:
+
+1. Need for deletion of an object (=> layer file) is identified.
+2. Unlink the object from all the places that reference it (=> `index_part.json`).
+3. Enqueue the deletion to a persistent queue.
+   Each entry is `tenant_id, attachment_generation, S3 key`.
+4. Validate & execute in batches:
+  4.1 For a batch of entries, call into control plane.
+  4.2 For the subset of entries that passed validation, execute a `DeleteObjects` S3 DELETE request for their S3 keys.
+
+As outlined in the Part 1 on correctness, it is critical that deletions are only
+executed once the key is not referenced anywhere in S3.
+This property is obviously upheld by the scheme above.
+
+#### We Accept Object Leakage In Acceptable Circumcstances
+
+If we crash in the flow above between (2) and (3), we lose track of unreferenced object.
+Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk.
+This is acceptable for now, it can be caught by [the scrubber](#cleaning-up-orphan-objects-scrubbing).
+
+There are various measures we can take to improve this in the future.
+1. Cap amount of time until enqueued entry becomes durable (timeout for flush-to-tisk)
+2. Proactively flush:
+    - On graceful shutdown, as we anticipate that some or
+      all of our attachments may be re-assigned while we are offline.
+    - On tenant detach.
+3. For each entry, keep track of whether it has passed (2).
+   Only admit entries to (4) one they have passed (2).
+   This requires re-writing / two queue entries (intent, commit) per deletion.
+
+The important take-away with any of the above is that it's not
+disastrous to leak objects in exceptional circumstances.
+
+#### Operations that may skip the queue
+
+Deletions of an entire timeline are [exempt](#Timeline-Deletion) from generation number validation. Once the
+control plane sends the deletion request, there is no requirement to retain the readability
+of any data within the timeline, and all objects within the timeline path may be deleted
+at any time from the control plane's deletion request onwards.
+
+Since deletions of smaller timelines won't have enough objects to compose a full sized
+DeleteObjects request, it is still useful to send these through the last part of the
+deletion pipeline to coalesce with other executing deletions: to enable this, the
+deletion queue should expose two input channels: one for deletions that must be
+processed in a generation-aware way, and a fast path for timeline deletions, where
+that fast path may skip validation and the persistent queue.
+
+### Cleaning up orphan objects (scrubbing)
+
+An orphan object is any object which is no longer referenced by a running node or by metadata.
+
+Examples of how orphan objects arise:
+
+- A node PUTs a layer object, then crashes before it writes the
+  index_part.json that references that layer.
+- A stale node carries on running for some time, and writes out an unbounded number of
+  objects while it believes itself to be the rightful writer for a tenant.
+- A pageserver crashes between un-linking an object from the index, and persisting
+  the object to its deletion queue.
+
+Orphan objects are functionally harmless, but have a small cost due to S3 capacity consumed. We
+may clean them up at some time in the future, but doing a ListObjectsv2 operation and cross
+referencing with the latest metadata to identify objects which are not referenced.
+
+Scrubbing will be done only by an attached pageserver (not some third party process), and deletions requested during scrub will go through the same
+validation as all other deletions: the attachment generation must be
+fresh. This avoids the possibility of a stale pageserver incorrectly
+thinking than an object written by a newer generation is stale, and deleting
+it.
+
+It is not strictly necessary that scrubbing be done by an attached
+pageserver: it could also be done externally. However, an external
+scrubber would still require the same validation procedure that
+a pageserver's deletion queue performs, before actually erasing
+objects.
+
+## Operational impact
+
+### Availability
+
+Coordination of generation numbers via the control plane introduce a dependency for certain
+operations:
+
+1. Starting new pageservers (or activating pageservers after a restart)
+2. Executing enqueued deletions
+3. Advertising updated `remote_consistent_lsn` to enable WAL trimming
+
+Item 1. would mean that some in-place restarts that previously would have resumed service even if the control plane were
+unavailable, will now not resume service to users until the control plane is available. We could
+avoid this by having a timeout on communication with the control plane, and after some timeout,
+resume service with the previous generation numbers (assuming this was persisted to disk). However,
+this is unlikely to be needed as the control plane is already an essential & highly available component. Also, having a node re-use an old generation number would complicate
+reasoning about the system, as it would break the invariant that a generation number uniquely identifies
+a tenant's attachment to a given pageserver _process_: it would merely identify the tenant's attachment
+to the pageserver _machine_ or its _on-disk-state_.
+
+Item 2. is a non-issue operationally: it's harmless to delay deletions, the only impact of objects pending deletion is
+the S3 capacity cost.
+
+Item 3. could be an issue if safekeepers are low on disk space and the control plane is unavailable for a long time. If this became an issue,
+we could adjust the safekeeper to delete segments from local disk sooner, as soon as they're uploaded to S3, rather than waiting for
+remote_consistent_lsn to advance.
+
+For a managed service, the general approach should be to make sure we are monitoring & respond fast enough
+that control plane outages are bounded in time.
+
+There is also the fact that control plane runs in a single region.
+The latency for distant regions is not a big concern for us because all request types added by this RFC are either infrequent or not in the way of the data path.
+However, we lose region isolation for the operations listed above.
+The ongoing work to split console and control will give us per-region control plane, and all operations in this RFC can be handled by these per-region control planes.
+With that in mind, we accept the trade-offs outlined in this paragraph.
+
+We will also implement an "escape hatch" config generation numbers, where in a major disaster outage,
+we may manually run pageservers with a hand-selected generation number, so that we can bring them online
+independently of a control plane.
+
+### Rollout
+
+Although there is coupling between components, we may deploy most of the new data plane components
+independently of the control plane: initially they can just use a static generation number.
+
+#### Phase 1
+
+The pageserver is deployed with some special config to:
+
+- Always act like everything is generation 1 and do not wait for a control plane issued generation on attach
+- Skip the places in deletion and remote_consistent_lsn updates where we would call into control plane
+
+#### Phase 2
+
+The control plane changes are deployed: control plane will now track and increment generation numbers.
+
+#### Phase 3
+
+The pageserver is deployed with its control-plane-dependent changes enabled: it will now require
+the control plane to service re-attach requests on startup, and handle generation
+validation requests.
+
+### On-disk backward compatibility
+
+Backward compatibility with existing data is straightforward:
+
+- When reading the index, we may assume that any layer whose metadata doesn't include
+  generations will have a path without generation suffix.
+- When locating the index file on attachment, we may use the "fallback" listing path
+  and if there is only an index without generation suffix, that is the one we load.
+
+It is not necessary to re-write existing layers: even new index files will be able
+to represent generation-less layers.
+
+### On-disk forward compatibility
+
+We will do a two phase rollout, probably over multiple releases because we will naturally
+have some of the read-side code ready before the overall functionality is ready:
+
+1. Deploy pageservers which understand the new index format and generation suffixes
+   in keys, but do not write objects with generation numbers in the keys.
+2. Deploy pageservers that write objects with generation numbers in the keys.
+
+Old pageservers will be oblivious to generation numbers. That means that they can't
+read objects with generation numbers in the name. This is why we must
+first step must deploy the ability to read, before the second step
+starts writing them.
+
+# Frequently Asked Questions
+
+## Why a generation _suffix_ rather than _prefix_?
+
+The choice is motivated by object listing, since one can list by prefix but not
+suffix.
+
+In [finding remote indices](#finding-the-remote-indices-for-timelines), we rely
+on being able to do a prefix listing for `<tenant>/<timeline>/index_part.json*`.
+That relies on the prefix listing.
+
+The converse case of using a generation prefix and listing by generation is
+not needed: one could imagine listing by generation while scrubbing (so that
+a particular generation's layers could be scrubbed), but this is not part
+of normal operations, and the [scrubber](#cleaning-up-orphan-objects-scrubbing) probably won't work that way anyway.
+
+## Wouldn't it be simpler to have a separate deletion queue per timeline?
+
+Functionally speaking, we could. That's how RemoteTimelineClient currently works,
+but this approach does not map well to a long-lived persistent queue with
+generation validation.
+
+Anything we do per-timeline generates tiny random I/O, on a pageserver with
+tens of thousands of timelines operating: to be ready for high scale, we should:
+
+- A) Amortize costs where we can (e.g. a shared deletion queue)
+- B) Expect to put tenants into a quiescent state while they're not
+  busy: i.e. we shouldn't keep a tenant alive to service its deletion queue.
+
+This was discussed in the [scope](#scope) part of the deletion queue section.
+
+# Appendix A: Examples of use in high availability/failover
+
+The generation numbers proposed in this RFC are adaptable to a variety of different
+failover scenarios and models. The sections below sketch how they would work in practice.
+
+### In-place restart of a pageserver
+
+"In-place" here means that the restart is done before any other element in the system
+has taken action in response to the node being down.
+
+- After restart, the node issues a re-attach request to the control plane, and
+  receives new generation numbers for all its attached tenants.
+- Tenants may be activated with the generation number in the re-attach response.
+- If any of its attachments were in fact stale (i.e. had be reassigned to another
+  node while this node was offline), then
+  - the re-attach response will inform the tenant about this by not including
+    the tenant of this by _not_ incrementing the generation for that attachment.
+  - This will implicitly block deletions in the tenant, but as an optimization
+    the pageserver should also proactively stop doing S3 uploads when it notices this stale-generation state.
+  - The control plane is expected to eventually detach this tenant from the
+    pageserver.
+
+If the control plane does not include a tenant in the re-attach response,
+but there is still local state for the tenant in the filesystem, the pageserver
+deletes the local state in response and does not load/active the tenant.
+See the [earlier section on pageserver startup](#pageserver-attachstartup-changes) for details.
+Control plane can use this mechanism to clean up a pageserver that has been
+down for so long that all its tenants were migrated away before it came back
+up again and asked for re-attach.
+
+### Failure of a pageserver
+
+In this context, read "failure" as the most ambiguous possible case, where
+a pageserver is unavailable to clients and control plane, but may still be executing and talking
+to S3.
+
+#### Case A: re-attachment to other nodes
+
+1. Let's say node 0 becomes unresponsive in a cluster of three nodes 0, 1, 2.
+2. Some external mechanism notices that the node is unavailable and initiates
+   movement of all tenants attached to that node to a different node according
+   to some distribution rule.
+   In this example, it would mean incrementing the generation
+   of all tenants that were attached to node 0, as each tenant's assigned pageserver changes.
+3. A tenant which is now attached to node 1 will _also_ still be attached to node
+   0, from the perspective of node 0. Node 0 will still be using its old generation,
+   node 1 will be using a newer generation.
+4. S3 writes will continue from nodes 0 and 1: there will be an index_part.json-00000001
+   \_and\* an index_part.json-00000002. Objects written under the old suffix
+   after the new attachment was created do not matter from the rest of the system's
+   perspective: the endpoints are reading from the new attachment location. Objects
+   written by node 0 are just garbage that can be cleaned up at leisure. Node 0 will
+   not do any deletions because it can't synchronize with control plane, or if it could,
+   its deletion queue processing would get errors for the validation requests.
+
+#### Case B: direct node replacement with same node_id and drive
+
+This is the scenario we would experience if running pageservers in some dynamic
+VM/container environment that would auto-replace a given node_id when it became
+unresponsive, with the node's storage supplied by some network block device
+that is attached to the replacement VM/container.
+
+1. Let's say node 0 fails, and there may be some other peers but they aren't relevant.
+2. Some external mechanism notices that the node is unavailable, and creates
+   a "new node 0" (Node 0b) which is a physically separate server. The original node 0
+   (Node 0a) may still be running, because we do not assume the environment fences nodes.
+3. On startup, node 0b re-attaches and gets higher generation numbers for
+   all tenants.
+4. S3 writes continue from nodes 0a and 0b, but the writes do not collide due to different
+   generation in the suffix, and the writes from node 0a are not visible to the rest
+   of the system because endpoints are reading only from node 0b.
+
+# Appendix B: interoperability with other features
+
+## Sharded Keyspace
+
+The design in this RFC maps neatly to a sharded keyspace design where subsets of the key space
+for a tenant are assigned to different pageservers:
+
+- the "unit of work" for attachments becomes something like a TenantShard rather than a Tenant
+- TenantShards get generation numbers just as Tenants do.
+- Write workload (ingest, compaction) for a tenant is spread out across pageservers via
+  TenantShards, but each TenantShard still has exactly one valid writer at a time.
+
+## Read replicas
+
+_This section is about a passive reader of S3 pageserver state, not a postgres
+read replica_
+
+For historical reads to LSNs below the remote persistent LSN, any node may act as a reader at any
+time: remote data is logically immutable data, and the use of deferred deletion in this RFC helps
+mitigate the fact that remote data is not _physically_ immutable (i.e. the actual data for a given
+page moves around as compaction happens).
+
+A read replica needs to be aware of generations in remote data in order to read the latest
+metadata (find the index_part.json with the latest suffix). It may either query this
+from the control plane, or find it with ListObjectsv2 request
+
+## Seamless migration
+
+To make tenant migration totally seamless, we will probably want to intentionally double-attach
+a tenant briefly, serving reads from the old node while waiting for the new node to be ready.
+
+This RFC enables that double-attachment: two nodes may be attached at the same time, with the migration destination
+having a higher generation number. The old node will be able to ingest and serve reads, but not
+do any deletes. The new node's attachment must also avoid deleting layers that the old node may
+still use. A new piece of state
+will be needed for this in the control plane's definition of an attachment.
+
+## Warm secondary locations
+
+To enable faster tenant movement after a pageserver is lost, we will probably want to spend some
+disk capacity on keeping standby locations populated with local disk data.
+
+There's no conflict between this RFC and that: implementing warm secondary locations on a per-tenant basis
+would be a separate change to the control plane to store standby location(s) for a tenant. Because
+the standbys do not write to S3, they do not need to be assigned generation numbers. When a tenant is
+re-attached to a standby location, that would increment the tenant attachment generation and this
+would work the same as any other attachment change, but with a warm cache.
+
+## Ephemeral node IDs
+
+This RFC intentionally avoids changing anything fundamental about how pageservers are identified
+and registered with the control plane, to avoid coupling the implementation of pageserver split
+brain protection with more fundamental changes in the management of the pageservers.
+
+Moving to ephemeral node IDs would provide an extra layer of
+resilience in the system, as it would prevent the control plane
+accidentally attaching to two physical nodes with the same
+generation, if somehow there were two physical nodes with
+the same node IDs (currently we rely on EC2 guarantees to
+eliminate this scenario). With ephemeral node IDs, there would be
+no possibility of that happening, no matter the behavior of
+underlying infrastructure.
+
+Nothing fundamental in the pageserver's handling of generations needs to change to handle ephemeral node IDs, since we hardly use the
+`node_id` anywhere. The `/re-attach` API would be extended
+to enable the pageserver to obtain its ephemeral ID, and provide
+some correlation identifier (e.g. EC instance ID), to help the
+control plane re-attach tenants to the same physical server that
+previously had them attached.
--- a/libs/pageserver_api/Cargo.toml
+++ b/libs/pageserver_api/Cargo.toml
@@ -12,6 +12,7 @@ const_format.workspace = true
 anyhow.workspace = true
 bytes.workspace = true
 byteorder.workspace = true
+hex.workspace = true
 utils.workspace = true
 postgres_ffi.workspace = true
 enum-map.workspace = true
--- a/libs/pageserver_api/src/control_api.rs
+++ b/libs/pageserver_api/src/control_api.rs
@@ -0,0 +1,89 @@
+/// Types in this file are for pageserver's upward-facing API calls to the control plane
+use hex::FromHex;
+use serde::{Deserialize, Serialize};
+use utils::id::{NodeId, TenantId};
+
+/// TenantId's serialization is an array of u8, which is rather unfriendly
+/// for outside callers who aren't working with the native Rust TenantId.
+/// This class wraps it in serialization that is just the hex strict
+/// representation.
+#[derive(Eq, PartialEq, Clone, Hash)]
+pub struct HexTenantId(TenantId);
+
+impl HexTenantId {
+    pub fn new(t: TenantId) -> Self {
+        Self(t)
+    }
+
+    pub fn take(self) -> TenantId {
+        self.0
+    }
+}
+
+impl AsRef<TenantId> for HexTenantId {
+    fn as_ref(&self) -> &TenantId {
+        &self.0
+    }
+}
+
+impl Serialize for HexTenantId {
+    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
+    where
+        S: serde::Serializer,
+    {
+        let hex = self.0.hex_encode();
+        serializer.collect_str(&hex)
+    }
+}
+
+impl<'de> Deserialize<'de> for HexTenantId {
+    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        let string = String::deserialize(deserializer)?;
+        TenantId::from_hex(string)
+            .map(|t| HexTenantId::new(t))
+            .map_err(|e| serde::de::Error::custom(format!("{e}")))
+    }
+}
+
+// Top level s
+
+#[derive(Serialize, Deserialize)]
+pub struct ReAttachRequest {
+    pub node_id: NodeId,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct ReAttachResponseTenant {
+    pub id: HexTenantId,
+    pub generation: u32,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct ReAttachResponse {
+    pub tenants: Vec<ReAttachResponseTenant>,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct ValidateRequestTenant {
+    pub id: HexTenantId,
+    pub gen: u32,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct ValidateRequest {
+    pub tenants: Vec<ValidateRequestTenant>,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct ValidateResponse {
+    pub tenants: Vec<ValidateResponseTenant>,
+}
+
+#[derive(Serialize, Deserialize)]
+pub struct ValidateResponseTenant {
+    pub id: HexTenantId,
+    pub valid: bool,
+}
--- a/libs/pageserver_api/src/lib.rs
+++ b/libs/pageserver_api/src/lib.rs
@@ -1,6 +1,7 @@
 use const_format::formatcp;

 /// Public API types
+pub mod control_api;
 pub mod models;
 pub mod reltag;

--- a/libs/pageserver_api/src/models.rs
+++ b/libs/pageserver_api/src/models.rs
@@ -194,6 +194,9 @@ pub struct TimelineCreateRequest {
 pub struct TenantCreateRequest {
    #[serde_as(as = "DisplayFromStr")]
    pub new_tenant_id: TenantId,
+    #[serde(default)]
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub generation: Option<u32>,
    #[serde(flatten)]
    pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
 }
@@ -241,15 +244,6 @@ pub struct StatusResponse {
    pub id: NodeId,
 }

-impl TenantCreateRequest {
-    pub fn new(new_tenant_id: TenantId) -> TenantCreateRequest {
-        TenantCreateRequest {
-            new_tenant_id,
-            config: TenantConfig::default(),
-        }
-    }
-}
-
 #[serde_as]
 #[derive(Serialize, Deserialize, Debug)]
 #[serde(deny_unknown_fields)]
@@ -293,9 +287,11 @@ impl TenantConfigRequest {
    }
 }

-#[derive(Debug, Serialize, Deserialize)]
+#[derive(Debug, Deserialize)]
 pub struct TenantAttachRequest {
    pub config: TenantAttachConfig,
+    #[serde(default)]
+    pub generation: Option<u32>,
 }

 /// Newtype to enforce deny_unknown_fields on TenantConfig for
--- a/libs/remote_storage/src/lib.rs
+++ b/libs/remote_storage/src/lib.rs
@@ -13,13 +13,14 @@ use std::{
    collections::HashMap,
    fmt::Debug,
    num::{NonZeroU32, NonZeroUsize},
-    path::{Path, PathBuf},
+    path::{Path, PathBuf, StripPrefixError},
    pin::Pin,
    sync::Arc,
 };

 use anyhow::{bail, Context};

+use serde::{Deserialize, Serialize};
 use tokio::io;
 use toml_edit::Item;
 use tracing::info;
@@ -44,12 +45,34 @@ pub const DEFAULT_MAX_KEYS_PER_LIST_RESPONSE: Option<i32> = None;

 const REMOTE_STORAGE_PREFIX_SEPARATOR: char = '/';

+// From the S3 spec
+pub const MAX_KEYS_PER_DELETE: usize = 1000;
+
 /// Path on the remote storage, relative to some inner prefix.
 /// The prefix is an implementation detail, that allows representing local paths
 /// as the remote ones, stripping the local storage prefix away.
 #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
 pub struct RemotePath(PathBuf);

+impl Serialize for RemotePath {
+    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
+    where
+        S: serde::Serializer,
+    {
+        serializer.collect_str(self)
+    }
+}
+
+impl<'de> Deserialize<'de> for RemotePath {
+    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        let str = String::deserialize(deserializer)?;
+        Ok(Self(PathBuf::from(&str)))
+    }
+}
+
 impl std::fmt::Display for RemotePath {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "{}", self.0.display())
@@ -88,6 +111,15 @@ impl RemotePath {
    pub fn extension(&self) -> Option<&str> {
        self.0.extension()?.to_str()
    }
+
+    /// Unwrap the PathBuf that RemotePath wraps
+    pub fn take(self) -> PathBuf {
+        self.0
+    }
+
+    pub fn strip_prefix(&self, p: &RemotePath) -> Result<&Path, StripPrefixError> {
+        self.0.strip_prefix(&p.0)
+    }
 }

 /// Storage (potentially remote) API to manage its state.
@@ -166,6 +198,8 @@ pub enum DownloadError {
    BadInput(anyhow::Error),
    /// The file was not found in the remote storage.
    NotFound,
+    /// The client was shut down
+    Shutdown,
    /// The file was found in the remote storage, but the download failed.
    Other(anyhow::Error),
 }
@@ -177,6 +211,7 @@ impl std::fmt::Display for DownloadError {
                write!(f, "Failed to download a remote file due to user input: {e}")
            }
            DownloadError::NotFound => write!(f, "No file found for the remote object id given"),
+            DownloadError::Shutdown => write!(f, "Client shutting down"),
            DownloadError::Other(e) => write!(f, "Failed to download a remote file: {e:?}"),
        }
    }
@@ -241,6 +276,18 @@ impl GenericRemoteStorage {
        }
    }

+    /// For small, simple downloads where caller doesn't want to handle the streaming: return the full body
+    pub async fn download_all(&self, from: &RemotePath) -> Result<Vec<u8>, DownloadError> {
+        let mut download = self.download(from).await?;
+
+        let mut bytes = Vec::new();
+        tokio::io::copy(&mut download.download_stream, &mut bytes)
+            .await
+            .with_context(|| format!("Failed to download body from {from}"))
+            .map_err(DownloadError::Other)?;
+        Ok(bytes)
+    }
+
    pub async fn download_byte_range(
        &self,
        from: &RemotePath,
--- a/libs/remote_storage/src/local_fs.rs
+++ b/libs/remote_storage/src/local_fs.rs
@@ -148,21 +148,53 @@ impl RemoteStorage for LocalFs {
            Some(folder) => folder.with_base(&self.storage_root),
            None => self.storage_root.clone(),
        };
-        let mut files = vec![];
-        let mut directory_queue = vec![full_path.clone()];

+        // If we were given a directory, we may use it as our starting point.
+        // Otherwise, we must go up to the parent directory.  This is because
+        // S3 object list prefixes can be arbitrary strings, but when reading
+        // the local filesystem we need a directory to start calling read_dir on.
+        let mut initial_dir = full_path.clone();
+        match fs::metadata(full_path.clone()).await {
+            Err(e) => {
+                // It's not a file that exists: strip the prefix back to the parent directory
+                if matches!(e.kind(), ErrorKind::NotFound) {
+                    initial_dir.pop();
+                }
+            }
+            Ok(meta) => {
+                if !meta.is_dir() {
+                    // It's not a directory: strip back to the parent
+                    initial_dir.pop();
+                }
+            }
+        }
+
+        // Note that PathBuf starts_with only considers full path segments, but
+        // object prefixes are arbitrary strings, so we need the strings for doing
+        // starts_with later.
+        let prefix = full_path.to_string_lossy();
+
+        let mut files = vec![];
+        let mut directory_queue = vec![initial_dir.clone()];
        while let Some(cur_folder) = directory_queue.pop() {
            let mut entries = fs::read_dir(cur_folder.clone()).await?;
            while let Some(entry) = entries.next_entry().await? {
                let file_name: PathBuf = entry.file_name().into();
                let full_file_name = cur_folder.clone().join(&file_name);
-                let file_remote_path = self.local_file_to_relative_path(full_file_name.clone());
-                files.push(file_remote_path.clone());
-                if full_file_name.is_dir() {
-                    directory_queue.push(full_file_name);
+                if full_file_name
+                    .to_str()
+                    .map(|s| s.starts_with(prefix.as_ref()))
+                    .unwrap_or(false)
+                {
+                    let file_remote_path = self.local_file_to_relative_path(full_file_name.clone());
+                    files.push(file_remote_path.clone());
+                    if full_file_name.is_dir() {
+                        directory_queue.push(full_file_name);
+                    }
                }
            }
        }
+
        Ok(files)
    }

--- a/libs/remote_storage/src/s3_bucket.rs
+++ b/libs/remote_storage/src/s3_bucket.rs
@@ -22,7 +22,7 @@ use aws_sdk_s3::{
    Client,
 };
 use aws_smithy_http::body::SdkBody;
-use hyper::Body;
+use hyper::{Body, StatusCode};
 use scopeguard::ScopeGuard;
 use tokio::{
    io::{self, AsyncRead},
@@ -529,7 +529,16 @@ impl RemoteStorage for S3Bucket {
                    }
                }
                Err(e) => {
-                    return Err(e.into());
+                    if let Some(r) = e.raw_response() {
+                        if r.http().status() == StatusCode::NOT_FOUND {
+                            // 404 is acceptable for deletions.  AWS S3 does not return this, but
+                            // some other implementations might (e.g. GCS XML API returns 404 on DeleteObject
+                            // to a missing key)
+                            continue;
+                        } else {
+                            return Err(anyhow::format_err!("DeleteObjects response error: {e}"));
+                        }
+                    }
                }
            }
        }
--- a/libs/safekeeper_api/src/models.rs
+++ b/libs/safekeeper_api/src/models.rs
@@ -31,6 +31,8 @@ fn lsn_invalid() -> Lsn {
 #[serde_as]
 #[derive(Debug, Clone, Deserialize, Serialize)]
 pub struct SkTimelineInfo {
+    /// Term.
+    pub term: Option<u64>,
    /// Term of the last entry.
    pub last_log_term: Option<u64>,
    /// LSN of the last record.
@@ -58,4 +60,6 @@ pub struct SkTimelineInfo {
    /// A connection string to use for WAL receiving.
    #[serde(default)]
    pub safekeeper_connstr: Option<String>,
+    #[serde(default)]
+    pub http_connstr: Option<String>,
 }
--- a/libs/utils/Cargo.toml
+++ b/libs/utils/Cargo.toml
@@ -38,6 +38,7 @@ url.workspace = true
 uuid.workspace = true

 pq_proto.workspace = true
+postgres_connection.workspace = true
 metrics.workspace = true
 workspace_hack.workspace = true

--- a/libs/utils/src/generation.rs
+++ b/libs/utils/src/generation.rs
@@ -0,0 +1,121 @@
+use std::fmt::Display;
+
+use serde::{Deserialize, Serialize};
+
+#[derive(Copy, Clone, Debug, Eq, PartialEq, PartialOrd, Ord)]
+pub enum Generation {
+    // Generations with this magic value will not add a suffix to S3 keys, and will not
+    // be included in persisted index_part.json.  This value is only to be used
+    // during migration from pre-generation metadata to generation-aware metadata,
+    // and should eventually go away.
+    //
+    // A special Generation is used rather than always wrapping Generation in an Option,
+    // so that code handling generations doesn't have to be aware of the legacy
+    // case everywhere it touches a generation.
+    None,
+    // Generations with this magic value may never be used to construct S3 keys:
+    // we will panic if someone tries to.  This is for Tenants in the "Broken" state,
+    // so that we can satisfy their constructor with a Generation without risking
+    // a code bug using it in an S3 write (broken tenants should never write)
+    Broken,
+    Valid(u32),
+}
+
+/// The Generation type represents a number associated with a Tenant, which
+/// increments every time the tenant is attached to a new pageserver, or
+/// an attached pageserver restarts.
+///
+/// It is included as a suffix in S3 keys, as a protection against split-brain
+/// scenarios where pageservers might otherwise issue conflicting writes to
+/// remote storage
+impl Generation {
+    /// Create a new Generation that represents a legacy key format with
+    /// no generation suffix
+    pub fn none() -> Self {
+        Self::None
+    }
+
+    // Create a new generation that will panic if you try to use get_suffix
+    pub fn broken() -> Self {
+        Self::Broken
+    }
+
+    pub fn new(v: u32) -> Self {
+        Self::Valid(v)
+    }
+
+    pub fn is_none(&self) -> bool {
+        matches!(self, Self::None)
+    }
+
+    pub fn get_suffix(&self) -> String {
+        match self {
+            Self::Valid(v) => {
+                format!("-{:08x}", v)
+            }
+            Self::None => "".into(),
+            Self::Broken => {
+                panic!("Tried to use a broken generation");
+            }
+        }
+    }
+
+    pub fn previous(&self) -> Self {
+        if let Self::Valid(v) = self {
+            Self::new(v - 1)
+        } else {
+            Self::none()
+        }
+    }
+
+    pub fn into(self) -> Option<u32> {
+        if let Self::Valid(v) = self {
+            Some(v)
+        } else {
+            None
+        }
+    }
+}
+
+impl Serialize for Generation {
+    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
+    where
+        S: serde::Serializer,
+    {
+        if let Self::Valid(v) = self {
+            v.serialize(serializer)
+        } else {
+            // We should never be asked to serialize a None or Broken.  Structures
+            // that include an optional generation should convert None to an
+            // Option<Generation>::None
+            Err(serde::ser::Error::custom(
+                "Tried to serialize invalid generation",
+            ))
+        }
+    }
+}
+
+impl<'de> Deserialize<'de> for Generation {
+    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        Ok(Self::Valid(u32::deserialize(deserializer)?))
+    }
+}
+
+impl Display for Generation {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        match self {
+            Self::Valid(v) => {
+                write!(f, "{:08x}", v)
+            }
+            Self::None => {
+                write!(f, "<none>")
+            }
+            Self::Broken => {
+                write!(f, "<broken>")
+            }
+        }
+    }
+}
--- a/libs/utils/src/http/error.rs
+++ b/libs/utils/src/http/error.rs
@@ -24,6 +24,9 @@ pub enum ApiError {
    #[error("Precondition failed: {0}")]
    PreconditionFailed(Box<str>),

+    #[error("Shutting down")]
+    ShuttingDown,
+
    #[error(transparent)]
    InternalServerError(anyhow::Error),
 }
@@ -52,6 +55,10 @@ impl ApiError {
                self.to_string(),
                StatusCode::PRECONDITION_FAILED,
            ),
+            ApiError::ShuttingDown => HttpErrorBody::response_from_msg_and_status(
+                "Shutting down".to_string(),
+                StatusCode::SERVICE_UNAVAILABLE,
+            ),
            ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status(
                err.to_string(),
                StatusCode::INTERNAL_SERVER_ERROR,
--- a/libs/utils/src/id.rs
+++ b/libs/utils/src/id.rs
@@ -50,7 +50,7 @@ impl Id {
        Id::from(tli_buf)
    }

-    fn hex_encode(&self) -> String {
+    pub fn hex_encode(&self) -> String {
        static HEX: &[u8] = b"0123456789abcdef";

        let mut buf = vec![0u8; self.0.len() * 2];
@@ -133,6 +133,10 @@ macro_rules! id_newtype {
            pub const fn from_array(b: [u8; 16]) -> Self {
                $t(Id(b))
            }
+
+            pub fn hex_encode(&self) -> String {
+                self.0.hex_encode()
+            }
        }

        impl FromStr for $t {
@@ -244,13 +248,13 @@ id_newtype!(TenantId);
 /// NOTE: It (de)serializes as an array of hex bytes, so the string representation would look
 /// like `[173,80,132,115,129,226,72,254,170,201,135,108,199,26,228,24]`.
 /// See [`Id`] for alternative ways to serialize it.
-#[derive(Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize, PartialOrd, Ord)]
+#[derive(Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
 pub struct ConnectionId(Id);

 id_newtype!(ConnectionId);

 // A pair uniquely identifying Neon instance.
-#[derive(Debug, Clone, Copy, PartialOrd, Ord, PartialEq, Eq, Hash, Serialize, Deserialize)]
+#[derive(Debug, Clone, Copy, PartialOrd, Ord, PartialEq, Eq, Hash)]
 pub struct TenantTimelineId {
    pub tenant_id: TenantId,
    pub timeline_id: TimelineId,
@@ -273,6 +277,36 @@ impl TenantTimelineId {
    }
 }

+impl Serialize for TenantTimelineId {
+    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
+    where
+        S: serde::Serializer,
+    {
+        serializer.collect_str(self)
+    }
+}
+
+impl<'de> Deserialize<'de> for TenantTimelineId {
+    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        let str = String::deserialize(deserializer)?;
+        if let Some((tenant_part, timeline_part)) = str.split_once('/') {
+            Ok(Self {
+                tenant_id: TenantId(Id::from_hex(tenant_part).map_err(|e| {
+                    serde::de::Error::custom(format!("Malformed tenant in TenantTimelineId: {e}"))
+                })?),
+                timeline_id: TimelineId(Id::from_hex(timeline_part).map_err(|e| {
+                    serde::de::Error::custom(format!("Malformed timeline in TenantTimelineId {e}"))
+                })?),
+            })
+        } else {
+            Err(serde::de::Error::custom("Malformed TenantTimelineId"))
+        }
+    }
+}
+
 impl fmt::Display for TenantTimelineId {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{}/{}", self.tenant_id, self.timeline_id)
--- a/libs/utils/src/lib.rs
+++ b/libs/utils/src/lib.rs
@@ -27,6 +27,9 @@ pub mod id;
 // http endpoint utils
 pub mod http;

+// definition of the Generation type for pageserver attachment APIs
+pub mod generation;
+
 // common log initialisation routine
 pub mod logging;

@@ -58,6 +61,8 @@ pub mod serde_regex;

 pub mod pageserver_feedback;

+pub mod postgres_client;
+
 pub mod tracing_span_assert;

 pub mod rate_limit;
--- a/libs/utils/src/postgres_client.rs
+++ b/libs/utils/src/postgres_client.rs
@@ -0,0 +1,37 @@
+//! Postgres client connection code common to other crates (safekeeper and
+//! pageserver) which depends on tenant/timeline ids and thus not fitting into
+//! postgres_connection crate.
+
+use anyhow::Context;
+use postgres_connection::{parse_host_port, PgConnectionConfig};
+
+use crate::id::TenantTimelineId;
+
+/// Create client config for fetching WAL from safekeeper on particular timeline.
+/// listen_pg_addr_str is in form host:\[port\].
+pub fn wal_stream_connection_config(
+    TenantTimelineId {
+        tenant_id,
+        timeline_id,
+    }: TenantTimelineId,
+    listen_pg_addr_str: &str,
+    auth_token: Option<&str>,
+    availability_zone: Option<&str>,
+) -> anyhow::Result<PgConnectionConfig> {
+    let (host, port) =
+        parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
+    let port = port.unwrap_or(5432);
+    let mut connstr = PgConnectionConfig::new_host_port(host, port)
+        .extend_options([
+            "-c".to_owned(),
+            format!("timeline_id={}", timeline_id),
+            format!("tenant_id={}", tenant_id),
+        ])
+        .set_password(auth_token.map(|s| s.to_owned()));
+
+    if let Some(availability_zone) = availability_zone {
+        connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
+    }
+
+    Ok(connstr)
+}
--- a/libs/vm_monitor/README.md
+++ b/libs/vm_monitor/README.md
@@ -16,3 +16,19 @@ in the `neon-postgres` cgroup and set its `memory.{max,high}`.
 * See also: [`neondatabase/vm-monitor`](https://github.com/neondatabase/vm-monitor/),
 where initial development of the monitor happened. The repository is no longer
 maintained but the commit history may be useful for debugging.
+
+## Structure
+
+The `vm-monitor` is loosely comprised of a few systems. These are:
+* the server: this is just a simple `axum` server that accepts requests and
+upgrades them to websocket connections. The server only allows one connection at
+a time. This means that upon receiving a new connection, the server will terminate
+and old one if it exists.
+* the filecache: a struct that allows communication with the Postgres file cache.
+On startup, we connect to the filecache and hold on to the connection for the
+entire monitor lifetime.
+* the cgroup watcher: the `CgroupWatcher` manages the `neon-postgres` cgroup by
+listening for `memory.high` events and setting its `memory.{high,max}` values.
+* the runner: the runner marries the filecache and cgroup watcher together,
+communicating with the agent throught the `Dispatcher`, and then calling filecache
+and cgroup watcher functions as needed to upscale and downscale
--- a/libs/vm_monitor/src/cgroup.rs
+++ b/libs/vm_monitor/src/cgroup.rs
@@ -634,7 +634,7 @@ impl CgroupWatcher {
            .context("failed to get memory subsystem")?
            .set_mem(cgroups_rs::memory::SetMemory {
                low: None,
-                high: Some(MaxValue::Value(bytes.min(i64::MAX as u64) as i64)),
+                high: Some(MaxValue::Value(u64::min(bytes, i64::MAX as u64) as i64)),
                min: None,
                max: None,
            })
@@ -654,8 +654,10 @@ impl CgroupWatcher {
            .set_mem(cgroups_rs::memory::SetMemory {
                min: None,
                low: None,
-                high: Some(MaxValue::Value(limits.high.min(i64::MAX as u64) as i64)),
-                max: Some(MaxValue::Value(limits.max.min(i64::MAX as u64) as i64)),
+                high: Some(MaxValue::Value(
+                    u64::min(limits.high, i64::MAX as u64) as i64
+                )),
+                max: Some(MaxValue::Value(u64::min(limits.max, i64::MAX as u64) as i64)),
            })
            .context("failed to set memory limits")
    }
--- a/libs/vm_monitor/src/dispatcher.rs
+++ b/libs/vm_monitor/src/dispatcher.rs
@@ -1,7 +1,7 @@
 //! Managing the websocket connection and other signals in the monitor.
 //!
 //! Contains types that manage the interaction (not data interchange, see `protocol`)
-//! between informant and monitor, allowing us to to process and send messages in a
+//! between agent and monitor, allowing us to to process and send messages in a
 //! straightforward way. The dispatcher also manages that signals that come from
 //! the cgroup (requesting upscale), and the signals that go to the cgroup
 //! (notifying it of upscale).
@@ -24,16 +24,16 @@ use crate::protocol::{
 /// The central handler for all communications in the monitor.
 ///
 /// The dispatcher has two purposes:
-/// 1. Manage the connection to the informant, sending and receiving messages.
+/// 1. Manage the connection to the agent, sending and receiving messages.
 /// 2. Communicate with the cgroup manager, notifying it when upscale is received,
-///    and sending a message to the informant when the cgroup manager requests
+///    and sending a message to the agent when the cgroup manager requests
 ///    upscale.
 #[derive(Debug)]
 pub struct Dispatcher {
-    /// We read informant messages of of `source`
+    /// We read agent messages of of `source`
    pub(crate) source: SplitStream<WebSocket>,

-    /// We send messages to the informant through `sink`
+    /// We send messages to the agent through `sink`
    sink: SplitSink<WebSocket, Message>,

    /// Used to notify the cgroup when we are upscaled.
@@ -43,7 +43,7 @@ pub struct Dispatcher {
    /// we send an `UpscaleRequst` to the agent.
    pub(crate) request_upscale_events: mpsc::Receiver<()>,

-    /// The protocol version we have agreed to use with the informant. This is negotiated
+    /// The protocol version we have agreed to use with the agent. This is negotiated
    /// during the creation of the dispatcher, and should be the highest shared protocol
    /// version.
    ///
@@ -56,9 +56,9 @@ pub struct Dispatcher {
 impl Dispatcher {
    /// Creates a new dispatcher using the passed-in connection.
    ///
-    /// Performs a negotiation with the informant to determine the highest protocol
+    /// Performs a negotiation with the agent to determine the highest protocol
    /// version that both support. This consists of two steps:
-    /// 1. Wait for the informant to sent the range of protocols it supports.
+    /// 1. Wait for the agent to sent the range of protocols it supports.
    /// 2. Send a protocol version that works for us as well, or an error if there
    ///    is no compatible version.
    pub async fn new(
@@ -69,7 +69,7 @@ impl Dispatcher {
        let (mut sink, mut source) = stream.split();

        // Figure out the highest protocol version we both support
-        info!("waiting for informant to send protocol version range");
+        info!("waiting for agent to send protocol version range");
        let Some(message) = source.next().await else {
            bail!("websocket connection closed while performing protocol handshake")
        };
@@ -79,7 +79,7 @@ impl Dispatcher {
        let Message::Text(message_text) = message else {
            // All messages should be in text form, since we don't do any
            // pinging/ponging. See nhooyr/websocket's implementation and the
-            // informant/agent for more info
+            // agent for more info
            bail!("received non-text message during proocol handshake: {message:?}")
        };

@@ -88,32 +88,30 @@ impl Dispatcher {
            max: PROTOCOL_MAX_VERSION,
        };

-        let informant_range: ProtocolRange = serde_json::from_str(&message_text)
+        let agent_range: ProtocolRange = serde_json::from_str(&message_text)
            .context("failed to deserialize protocol version range")?;

-        info!(range = ?informant_range, "received protocol version range");
+        info!(range = ?agent_range, "received protocol version range");

-        let highest_shared_version = match monitor_range.highest_shared_version(&informant_range) {
+        let highest_shared_version = match monitor_range.highest_shared_version(&agent_range) {
            Ok(version) => {
                sink.send(Message::Text(
                    serde_json::to_string(&ProtocolResponse::Version(version)).unwrap(),
                ))
                .await
-                .context("failed to notify informant of negotiated protocol version")?;
+                .context("failed to notify agent of negotiated protocol version")?;
                version
            }
            Err(e) => {
                sink.send(Message::Text(
                    serde_json::to_string(&ProtocolResponse::Error(format!(
                        "Received protocol version range {} which does not overlap with {}",
-                        informant_range, monitor_range
+                        agent_range, monitor_range
                    )))
                    .unwrap(),
                ))
                .await
-                .context(
-                    "failed to notify informant of no overlap between protocol version ranges",
-                )?;
+                .context("failed to notify agent of no overlap between protocol version ranges")?;
                Err(e).context("error determining suitable protocol version range")?
            }
        };
@@ -137,7 +135,7 @@ impl Dispatcher {
            .context("failed to send resources and oneshot sender across channel")
    }

-    /// Send a message to the informant.
+    /// Send a message to the agent.
    ///
    /// Although this function is small, it has one major benefit: it is the only
    /// way to send data accross the connection, and you can only pass in a proper
--- a/libs/vm_monitor/src/filecache.rs
+++ b/libs/vm_monitor/src/filecache.rs
@@ -59,8 +59,8 @@ pub struct FileCacheConfig {
    spread_factor: f64,
 }

-impl Default for FileCacheConfig {
-    fn default() -> Self {
+impl FileCacheConfig {
+    pub fn default_in_memory() -> Self {
        Self {
            in_memory: true,
            // 75 %
@@ -71,9 +71,19 @@ impl Default for FileCacheConfig {
            spread_factor: 0.1,
        }
    }
-}

-impl FileCacheConfig {
+    pub fn default_on_disk() -> Self {
+        Self {
+            in_memory: false,
+            resource_multiplier: 0.75,
+            // 256 MiB - lower than when in memory because overcommitting is safe; if we don't have
+            // memory, the kernel will just evict from its page cache, rather than e.g. killing
+            // everything.
+            min_remaining_after_cache: NonZeroU64::new(256 * MiB).unwrap(),
+            spread_factor: 0.1,
+        }
+    }
+
    /// Make sure fields of the config are consistent.
    pub fn validate(&self) -> anyhow::Result<()> {
        // Single field validity
@@ -132,11 +142,11 @@ impl FileCacheConfig {

        // Conversions to ensure we don't overflow from floating-point ops
        let size_from_spread =
-            0_i64.max((available as f64 / (1.0 + self.spread_factor)) as i64) as u64;
+            i64::max(0, (available as f64 / (1.0 + self.spread_factor)) as i64) as u64;

        let size_from_normal = (total as f64 * self.resource_multiplier) as u64;

-        let byte_size = size_from_spread.min(size_from_normal);
+        let byte_size = u64::min(size_from_spread, size_from_normal);

        // The file cache operates in units of mebibytes, so the sizes we produce should
        // be rounded to a mebibyte. We round down to be conservative.
@@ -268,7 +278,7 @@ impl FileCacheState {
            .context("failed to extract max file cache size from query result")?;

        let max_mb = max_bytes / MiB;
-        let num_mb = (num_bytes / MiB).max(max_mb);
+        let num_mb = u64::min(num_bytes, max_bytes) / MiB;

        let capped = if num_bytes > max_bytes {
            " (capped by maximum size)"
--- a/libs/vm_monitor/src/lib.rs
+++ b/libs/vm_monitor/src/lib.rs
@@ -39,6 +39,16 @@ pub struct Args {
    #[arg(short, long)]
    pub pgconnstr: Option<String>,

+    /// Flag to signal that the Postgres file cache is on disk (i.e. not in memory aside from the
+    /// kernel's page cache), and therefore should not count against available memory.
+    //
+    // NB: Ideally this flag would directly refer to whether the file cache is in memory (rather
+    // than a roundabout way, via whether it's on disk), but in order to be backwards compatible
+    // during the switch away from an in-memory file cache, we had to default to the previous
+    // behavior.
+    #[arg(long)]
+    pub file_cache_on_disk: bool,
+
    /// The address we should listen on for connection requests. For the
    /// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369.
    #[arg(short, long)]
@@ -146,7 +156,7 @@ pub async fn start(args: &'static Args, token: CancellationToken) -> anyhow::Res

 /// Handles incoming websocket connections.
 ///
-/// If we are already to connected to an informant, we kill that old connection
+/// If we are already to connected to an agent, we kill that old connection
 /// and accept the new one.
 #[tracing::instrument(name = "/monitor", skip_all, fields(?args))]
 pub async fn ws_handler(
@@ -196,10 +206,10 @@ async fn start_monitor(
            return;
        }
    };
-    info!("connected to informant");
+    info!("connected to agent");

    match monitor.run().await {
        Ok(()) => info!("monitor was killed due to new connection"),
-        Err(e) => error!(error = ?e, "monitor terminated by itself"),
+        Err(e) => error!(error = ?e, "monitor terminated unexpectedly"),
    }
 }
--- a/libs/vm_monitor/src/protocol.rs
+++ b/libs/vm_monitor/src/protocol.rs
@@ -1,13 +1,13 @@
-//! Types representing protocols and actual informant-monitor messages.
+//! Types representing protocols and actual agent-monitor messages.
 //!
 //! The pervasive use of serde modifiers throughout this module is to ease
 //! serialization on the go side. Because go does not have enums (which model
 //! messages well), it is harder to model messages, and we accomodate that with
 //! serde.
 //!
-//! *Note*: the informant sends and receives messages in different ways.
+//! *Note*: the agent sends and receives messages in different ways.
 //!
-//! The informant serializes messages in the form and then sends them. The use
+//! The agent serializes messages in the form and then sends them. The use
 //! of `#[serde(tag = "type", content = "content")]` allows us to use `Type`
 //! to determine how to deserialize `Content`.
 //! ```ignore
@@ -25,9 +25,9 @@
 //!     Id   uint64
 //! }
 //! ```
-//! After reading the type field, the informant will decode the entire message
+//! After reading the type field, the agent will decode the entire message
 //! again, this time into the correct type using the embedded fields.
-//! Because the informant cannot just extract the json contained in a certain field
+//! Because the agent cannot just extract the json contained in a certain field
 //! (it initially deserializes to `map[string]interface{}`), we keep the fields
 //! at the top level, so the entire piece of json can be deserialized into a struct,
 //! such as a `DownscaleResult`, with the `Type` and `Id` fields ignored.
@@ -37,7 +37,7 @@ use std::cmp;

 use serde::{de::Error, Deserialize, Serialize};

-/// A Message we send to the informant.
+/// A Message we send to the agent.
 #[derive(Serialize, Deserialize, Debug, Clone)]
 pub struct OutboundMsg {
    #[serde(flatten)]
@@ -51,31 +51,31 @@ impl OutboundMsg {
    }
 }

-/// The different underlying message types we can send to the informant.
+/// The different underlying message types we can send to the agent.
 #[derive(Serialize, Deserialize, Debug, Clone)]
 #[serde(tag = "type")]
 pub enum OutboundMsgKind {
-    /// Indicates that the informant sent an invalid message, i.e, we couldn't
+    /// Indicates that the agent sent an invalid message, i.e, we couldn't
    /// properly deserialize it.
    InvalidMessage { error: String },
    /// Indicates that we experienced an internal error while processing a message.
    /// For example, if a cgroup operation fails while trying to handle an upscale,
    /// we return `InternalError`.
    InternalError { error: String },
-    /// Returned to the informant once we have finished handling an upscale. If the
+    /// Returned to the agent once we have finished handling an upscale. If the
    /// handling was unsuccessful, an `InternalError` will get returned instead.
    /// *Note*: this is a struct variant because of the way go serializes struct{}
    UpscaleConfirmation {},
    /// Indicates to the monitor that we are urgently requesting resources.
    /// *Note*: this is a struct variant because of the way go serializes struct{}
    UpscaleRequest {},
-    /// Returned to the informant once we have finished attempting to downscale. If
+    /// Returned to the agent once we have finished attempting to downscale. If
    /// an error occured trying to do so, an `InternalError` will get returned instead.
    /// However, if we are simply unsuccessful (for example, do to needing the resources),
    /// that gets included in the `DownscaleResult`.
    DownscaleResult {
        // FIXME for the future (once the informant is deprecated)
-        // As of the time of writing, the informant/agent version of this struct is
+        // As of the time of writing, the agent/informant version of this struct is
        // called api.DownscaleResult. This struct has uppercase fields which are
        // serialized as such. Thus, we serialize using uppercase names so we don't
        // have to make a breaking change to the agent<->informant protocol. Once
@@ -88,12 +88,12 @@ pub enum OutboundMsgKind {
        status: String,
    },
    /// Part of the bidirectional heartbeat. The heartbeat is initiated by the
-    /// informant.
+    /// agent.
    /// *Note*: this is a struct variant because of the way go serializes struct{}
    HealthCheck {},
 }

-/// A message received form the informant.
+/// A message received form the agent.
 #[derive(Serialize, Deserialize, Debug, Clone)]
 pub struct InboundMsg {
    #[serde(flatten)]
@@ -101,7 +101,7 @@ pub struct InboundMsg {
    pub(crate) id: usize,
 }

-/// The different underlying message types we can receive from the informant.
+/// The different underlying message types we can receive from the agent.
 #[derive(Serialize, Deserialize, Debug, Clone)]
 #[serde(tag = "type", content = "content")]
 pub enum InboundMsgKind {
@@ -120,14 +120,14 @@ pub enum InboundMsgKind {
    /// when done.
    DownscaleRequest { target: Resources },
    /// Part of the bidirectional heartbeat. The heartbeat is initiated by the
-    /// informant.
+    /// agent.
    /// *Note*: this is a struct variant because of the way go serializes struct{}
    HealthCheck {},
 }

 /// Represents the resources granted to a VM.
 #[derive(Serialize, Deserialize, Debug, Clone, Copy)]
-// Renamed because the agent/informant has multiple resources types:
+// Renamed because the agent has multiple resources types:
 // `Resources` (milliCPU/memory slots)
 // `Allocation` (vCPU/bytes) <- what we correspond to
 #[serde(rename(serialize = "Allocation", deserialize = "Allocation"))]
@@ -151,7 +151,7 @@ pub const PROTOCOL_MAX_VERSION: ProtocolVersion = ProtocolVersion::V1_0;
 pub struct ProtocolVersion(u8);

 impl ProtocolVersion {
-    /// Represents v1.0 of the informant<-> monitor protocol - the initial version
+    /// Represents v1.0 of the agent<-> monitor protocol - the initial version
    ///
    /// Currently the latest version.
    const V1_0: ProtocolVersion = ProtocolVersion(1);
--- a/libs/vm_monitor/src/runner.rs
+++ b/libs/vm_monitor/src/runner.rs
@@ -1,4 +1,4 @@
-//! Exposes the `Runner`, which handles messages received from informant and
+//! Exposes the `Runner`, which handles messages received from agent and
 //! sends upscale requests.
 //!
 //! This is the "Monitor" part of the monitor binary and is the main entrypoint for
@@ -21,8 +21,8 @@ use crate::filecache::{FileCacheConfig, FileCacheState};
 use crate::protocol::{InboundMsg, InboundMsgKind, OutboundMsg, OutboundMsgKind, Resources};
 use crate::{bytes_to_mebibytes, get_total_system_memory, spawn_with_cancel, Args, MiB};

-/// Central struct that interacts with informant, dispatcher, and cgroup to handle
-/// signals from the informant.
+/// Central struct that interacts with agent, dispatcher, and cgroup to handle
+/// signals from the agent.
 #[derive(Debug)]
 pub struct Runner {
    config: Config,
@@ -110,10 +110,10 @@ impl Runner {
        // memory limits.
        if let Some(connstr) = &args.pgconnstr {
            info!("initializing file cache");
-            let config: FileCacheConfig = Default::default();
-            if !config.in_memory {
-                panic!("file cache not in-memory implemented")
-            }
+            let config = match args.file_cache_on_disk {
+                true => FileCacheConfig::default_on_disk(),
+                false => FileCacheConfig::default_in_memory(),
+            };

            let mut file_cache = FileCacheState::new(connstr, config, token.clone())
                .await
@@ -140,7 +140,10 @@ impl Runner {
            if actual_size != new_size {
                info!("file cache size actually got set to {actual_size}")
            }
-            file_cache_reserved_bytes = actual_size;
+            // Mark the resources given to the file cache as reserved, but only if it's in memory.
+            if !args.file_cache_on_disk {
+                file_cache_reserved_bytes = actual_size;
+            }

            state.filecache = Some(file_cache);
        }
@@ -227,18 +230,17 @@ impl Runner {
        let mut status = vec![];
        let mut file_cache_mem_usage = 0;
        if let Some(file_cache) = &mut self.filecache {
-            if !file_cache.config.in_memory {
-                panic!("file cache not in-memory unimplemented")
-            }
-
            let actual_usage = file_cache
                .set_file_cache_size(expected_file_cache_mem_usage)
                .await
                .context("failed to set file cache size")?;
-            file_cache_mem_usage = actual_usage;
+            if file_cache.config.in_memory {
+                file_cache_mem_usage = actual_usage;
+            }
            let message = format!(
-                "set file cache size to {} MiB",
-                bytes_to_mebibytes(actual_usage)
+                "set file cache size to {} MiB (in memory = {})",
+                bytes_to_mebibytes(actual_usage),
+                file_cache.config.in_memory,
            );
            info!("downscale: {message}");
            status.push(message);
@@ -289,10 +291,6 @@ impl Runner {
        // Get the file cache's expected contribution to the memory usage
        let mut file_cache_mem_usage = 0;
        if let Some(file_cache) = &mut self.filecache {
-            if !file_cache.config.in_memory {
-                panic!("file cache not in-memory unimplemented");
-            }
-
            let expected_usage = file_cache.config.calculate_cache_size(usable_system_memory);
            info!(
                target = bytes_to_mebibytes(expected_usage),
@@ -304,6 +302,9 @@ impl Runner {
                .set_file_cache_size(expected_usage)
                .await
                .context("failed to set file cache size")?;
+            if file_cache.config.in_memory {
+                file_cache_mem_usage = actual_usage;
+            }

            if actual_usage != expected_usage {
                warn!(
@@ -312,7 +313,6 @@ impl Runner {
                    bytes_to_mebibytes(actual_usage)
                )
            }
-            file_cache_mem_usage = actual_usage;
        }

        if let Some(cgroup) = &self.cgroup {
@@ -371,7 +371,7 @@ impl Runner {
                Ok(None)
            }
            InboundMsgKind::InternalError { error } => {
-                warn!(error, id, "informant experienced an internal error");
+                warn!(error, id, "agent experienced an internal error");
                Ok(None)
            }
            InboundMsgKind::HealthCheck {} => {
@@ -405,10 +405,12 @@ impl Runner {
                        .await
                        .context("failed to send message")?;
                }
-                // there is a message from the informant
+                // there is a message from the agent
                msg = self.dispatcher.source.next() => {
                    if let Some(msg) = msg {
-                        info!(message = ?msg, "received message");
+                        // Don't use 'message' as a key as the string also uses
+                        // that for its key
+                        info!(?msg, "received message");
                        match msg {
                            Ok(msg) => {
                                let message: InboundMsg = match msg {
@@ -417,8 +419,10 @@ impl Runner {
                                    }
                                    other => {
                                        warn!(
-                                            message = ?other,
-                                            "informant should only send text messages but received different type"
+                                            // Don't use 'message' as a key as the
+                                            // string also uses that for its key
+                                            msg = ?other,
+                                            "agent should only send text messages but received different type"
                                        );
                                        continue
                                    },
@@ -429,7 +433,7 @@ impl Runner {
                                    Ok(None) => continue,
                                    Err(e) => {
                                        let error = e.to_string();
-                                        warn!(%error, "error handling message");
+                                        warn!(?error, "error handling message");
                                        OutboundMsg::new(
                                            OutboundMsgKind::InternalError {
                                                error
--- a/pageserver/ctl/src/layer_map_analyzer.rs
+++ b/pageserver/ctl/src/layer_map_analyzer.rs
@@ -97,7 +97,7 @@ pub(crate) fn parse_filename(name: &str) -> Option<LayerFile> {
 // Finds the max_holes largest holes, ignoring any that are smaller than MIN_HOLE_LENGTH"
 async fn get_holes(path: &Path, max_holes: usize) -> Result<Vec<Hole>> {
    let file = FileBlockReader::new(VirtualFile::open(path)?);
-    let summary_blk = file.read_blk(0)?;
+    let summary_blk = file.read_blk(0).await?;
    let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
    let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
        actual_summary.index_start_blk,
--- a/pageserver/ctl/src/layers.rs
+++ b/pageserver/ctl/src/layers.rs
@@ -48,7 +48,7 @@ async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
    virtual_file::init(10);
    page_cache::init(100);
    let file = FileBlockReader::new(VirtualFile::open(path)?);
-    let summary_blk = file.read_blk(0)?;
+    let summary_blk = file.read_blk(0).await?;
    let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
    let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
        actual_summary.index_start_blk,
--- a/pageserver/src/bin/pageserver.rs
+++ b/pageserver/src/bin/pageserver.rs
@@ -2,12 +2,14 @@

 use std::env::{var, VarError};
 use std::sync::Arc;
+use std::time::Duration;
 use std::{env, ops::ControlFlow, path::Path, str::FromStr};

 use anyhow::{anyhow, Context};
 use clap::{Arg, ArgAction, Command};

 use metrics::launch_timestamp::{set_launch_timestamp_metric, LaunchTimestamp};
+use pageserver::deletion_queue::{DeletionQueue, DeletionQueueError};
 use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task};
 use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING};
 use pageserver::task_mgr::WALRECEIVER_RUNTIME;
@@ -349,6 +351,35 @@ fn start_pageserver(
    // Set up remote storage client
    let remote_storage = create_remote_storage_client(conf)?;

+    // Set up deletion queue
+    let deletion_queue_cancel = tokio_util::sync::CancellationToken::new();
+    let (deletion_queue, deletion_frontend, deletion_backend, deletion_executor) =
+        DeletionQueue::new(remote_storage.clone(), conf, deletion_queue_cancel.clone());
+    if let Some(mut deletion_frontend) = deletion_frontend {
+        BACKGROUND_RUNTIME.spawn(async move {
+            deletion_frontend
+                .background()
+                .instrument(info_span!(parent:None, "deletion frontend"))
+                .await
+        });
+    }
+    if let Some(mut deletion_backend) = deletion_backend {
+        BACKGROUND_RUNTIME.spawn(async move {
+            deletion_backend
+                .background()
+                .instrument(info_span!(parent: None, "deletion backend"))
+                .await
+        });
+    }
+    if let Some(mut deletion_executor) = deletion_executor {
+        BACKGROUND_RUNTIME.spawn(async move {
+            deletion_executor
+                .background()
+                .instrument(info_span!(parent: None, "deletion executor"))
+                .await
+        });
+    }
+
    // Up to this point no significant I/O has been done: this should have been fast.  Record
    // duration prior to starting I/O intensive phase of startup.
    startup_checkpoint("initial", "Starting loading tenants");
@@ -386,6 +417,7 @@ fn start_pageserver(
        TenantSharedResources {
            broker_client: broker_client.clone(),
            remote_storage: remote_storage.clone(),
+            deletion_queue_client: deletion_queue.new_client(),
        },
        order,
    ))?;
@@ -482,6 +514,7 @@ fn start_pageserver(
            http_auth,
            broker_client.clone(),
            remote_storage,
+            deletion_queue.clone(),
            disk_usage_eviction_state,
        )?
        .build()
@@ -604,6 +637,36 @@ fn start_pageserver(
            // The plan is to change that over time.
            shutdown_pageserver.take();
            BACKGROUND_RUNTIME.block_on(pageserver::shutdown_pageserver(0));
+
+            // Best effort to persist any outstanding deletions, to avoid leaking objects
+            let dq = deletion_queue.clone();
+            BACKGROUND_RUNTIME.block_on(async move {
+                match tokio::time::timeout(Duration::from_secs(5), dq.new_client().flush()).await {
+                    Ok(flush_r) => {
+                        match flush_r {
+                            Ok(()) => {
+                                info!("Deletion queue flushed successfully on shutdown")
+                            }
+                            Err(e) => {
+                                match e {
+                                    DeletionQueueError::ShuttingDown => {
+                                        // This is not harmful for correctness, but is unexpected: the deletion
+                                        // queue's workers should stay alive as long as there are any client handles instantiated.
+                                        warn!("Deletion queue stopped prematurely");
+                                    }
+                                }
+                            }
+                        }
+                    }
+                    Err(e) => {
+                        warn!("Timed out flushing deletion queue on shutdown ({e})")
+                    }
+                }
+            });
+
+            // Clean shutdown of deletion queue workers
+            deletion_queue_cancel.cancel();
+
            unreachable!()
        }
    })
--- a/pageserver/src/config.rs
+++ b/pageserver/src/config.rs
@@ -204,6 +204,8 @@ pub struct PageServerConf {
    /// has it's initial logical size calculated. Not running background tasks for some seconds is
    /// not terrible.
    pub background_task_maximum_delay: Duration,
+
+    pub control_plane_api: Option<Url>,
 }

 /// We do not want to store this in a PageServerConf because the latter may be logged
@@ -278,6 +280,8 @@ struct PageServerConfigBuilder {
    ondemand_download_behavior_treat_error_as_warn: BuilderValue<bool>,

    background_task_maximum_delay: BuilderValue<Duration>,
+
+    control_plane_api: BuilderValue<Option<Url>>,
 }

 impl Default for PageServerConfigBuilder {
@@ -340,6 +344,8 @@ impl Default for PageServerConfigBuilder {
                DEFAULT_BACKGROUND_TASK_MAXIMUM_DELAY,
            )
            .unwrap()),
+
+            control_plane_api: Set(None),
        }
    }
 }
@@ -468,6 +474,10 @@ impl PageServerConfigBuilder {
        self.background_task_maximum_delay = BuilderValue::Set(delay);
    }

+    pub fn control_plane_api(&mut self, api: Url) {
+        self.control_plane_api = BuilderValue::Set(Some(api))
+    }
+
    pub fn build(self) -> anyhow::Result<PageServerConf> {
        let concurrent_tenant_size_logical_size_queries = self
            .concurrent_tenant_size_logical_size_queries
@@ -553,6 +563,9 @@ impl PageServerConfigBuilder {
            background_task_maximum_delay: self
                .background_task_maximum_delay
                .ok_or(anyhow!("missing background_task_maximum_delay"))?,
+            control_plane_api: self
+                .control_plane_api
+                .ok_or(anyhow!("missing control_plane_api"))?,
        })
    }
 }
@@ -566,6 +579,27 @@ impl PageServerConf {
        self.workdir.join("tenants")
    }

+    pub fn deletion_prefix(&self) -> PathBuf {
+        self.workdir.join("deletion")
+    }
+
+    pub fn deletion_list_path(&self, sequence: u64) -> PathBuf {
+        // Encode a version in the filename, so that if we ever switch away from JSON we can
+        // increment this.
+        const VERSION: u8 = 1;
+
+        self.deletion_prefix()
+            .join(format!("{sequence:016x}-{VERSION:02x}.list"))
+    }
+
+    pub fn deletion_header_path(&self) -> PathBuf {
+        // Encode a version in the filename, so that if we ever switch away from JSON we can
+        // increment this.
+        const VERSION: u8 = 1;
+
+        self.deletion_prefix().join(format!("header-{VERSION:02x}"))
+    }
+
    pub fn tenant_path(&self, tenant_id: &TenantId) -> PathBuf {
        self.tenants_path().join(tenant_id.to_string())
    }
@@ -643,23 +677,6 @@ impl PageServerConf {
            .join(METADATA_FILE_NAME)
    }

-    /// Files on the remote storage are stored with paths, relative to the workdir.
-    /// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
-    ///
-    /// Errors if the path provided does not start from pageserver's workdir.
-    pub fn remote_path(&self, local_path: &Path) -> anyhow::Result<RemotePath> {
-        local_path
-            .strip_prefix(&self.workdir)
-            .context("Failed to strip workdir prefix")
-            .and_then(RemotePath::new)
-            .with_context(|| {
-                format!(
-                    "Failed to resolve remote part of path {:?} for base {:?}",
-                    local_path, self.workdir
-                )
-            })
-    }
-
    /// Turns storage remote path of a file into its local path.
    pub fn local_path(&self, remote_path: &RemotePath) -> PathBuf {
        remote_path.with_base(&self.workdir)
@@ -758,6 +775,7 @@ impl PageServerConf {
                },
                "ondemand_download_behavior_treat_error_as_warn" => builder.ondemand_download_behavior_treat_error_as_warn(parse_toml_bool(key, item)?),
                "background_task_maximum_delay" => builder.background_task_maximum_delay(parse_toml_duration(key, item)?),
+                "control_plane_api" => builder.control_plane_api(parse_toml_string(key, item)?.parse().context("failed to parse control plane URL")?),
                _ => bail!("unrecognized pageserver option '{key}'"),
            }
        }
@@ -926,6 +944,7 @@ impl PageServerConf {
            test_remote_failures: 0,
            ondemand_download_behavior_treat_error_as_warn: false,
            background_task_maximum_delay: Duration::ZERO,
+            control_plane_api: None,
        }
    }
 }
@@ -1149,6 +1168,7 @@ background_task_maximum_delay = '334 s'
                background_task_maximum_delay: humantime::parse_duration(
                    defaults::DEFAULT_BACKGROUND_TASK_MAXIMUM_DELAY
                )?,
+                control_plane_api: None
            },
            "Correct defaults should be used when no config values are provided"
        );
@@ -1204,6 +1224,7 @@ background_task_maximum_delay = '334 s'
                test_remote_failures: 0,
                ondemand_download_behavior_treat_error_as_warn: false,
                background_task_maximum_delay: Duration::from_secs(334),
+                control_plane_api: None
            },
            "Should be able to parse all basic config values correctly"
        );
--- a/pageserver/src/deletion_queue.rs
+++ b/pageserver/src/deletion_queue.rs
@@ -0,0 +1,850 @@
+mod backend;
+mod executor;
+mod frontend;
+
+use std::collections::HashMap;
+use std::path::PathBuf;
+
+use crate::metrics::DELETION_QUEUE_SUBMITTED;
+use crate::tenant::remote_timeline_client::remote_timeline_path;
+use remote_storage::{GenericRemoteStorage, RemotePath};
+use serde::Deserialize;
+use serde::Serialize;
+use serde_with::serde_as;
+use thiserror::Error;
+use tokio;
+use tokio_util::sync::CancellationToken;
+use tracing::{self, debug, error};
+use utils::generation::Generation;
+use utils::id::{TenantId, TimelineId};
+
+pub(crate) use self::backend::BackendQueueWorker;
+use self::executor::ExecutorWorker;
+use self::frontend::DeletionOp;
+pub(crate) use self::frontend::FrontendQueueWorker;
+use backend::BackendQueueMessage;
+use executor::ExecutorMessage;
+use frontend::FrontendQueueMessage;
+
+use crate::{config::PageServerConf, tenant::storage_layer::LayerFileName};
+
+// TODO: adminstrative "panic button" config property to disable all deletions
+// TODO: configurable for how long to wait before executing deletions
+
+/// We aggregate object deletions from many tenants in one place, for several reasons:
+/// - Coalesce deletions into fewer DeleteObjects calls
+/// - Enable Tenant/Timeline lifetimes to be shorter than the time it takes
+///   to flush any outstanding deletions.
+/// - Globally control throughput of deletions, as these are a low priority task: do
+///   not compete with the same S3 clients/connections used for higher priority uploads.
+/// - Future: enable validating that we may do deletions in a multi-attached scenario,
+///   via generation numbers (see https://github.com/neondatabase/neon/pull/4919)
+///
+/// There are two kinds of deletion: deferred and immediate.  A deferred deletion
+/// may be intentionally delayed to protect passive readers of S3 data, and may
+/// be subject to a generation number validation step.  An immediate deletion is
+/// ready to execute immediately, and is only queued up so that it can be coalesced
+/// with other deletions in flight.
+///
+/// Deferred deletions pass through three steps:
+/// - Frontend: accumulate deletion requests from Timelines, and batch them up into
+///   DeletionLists, which are persisted to S3.
+/// - Backend: accumulate deletion lists, and validate them en-masse prior to passing
+///   the keys in the list onward for actual deletion
+/// - Executor: accumulate object keys that the backend has validated for immediate
+///   deletion, and execute them in batches of 1000 keys via DeleteObjects.
+///
+/// Non-deferred deletions, such as during timeline deletion, bypass the first
+/// two stages and are passed straight into the Executor.
+///
+/// Internally, each stage is joined by a channel to the next.  In S3, there is only
+/// one queue (of DeletionLists), which is written by the frontend and consumed
+/// by the backend.
+#[derive(Clone)]
+pub struct DeletionQueue {
+    client: DeletionQueueClient,
+}
+
+#[derive(Debug)]
+struct FlushOp {
+    tx: tokio::sync::oneshot::Sender<()>,
+}
+
+impl FlushOp {
+    fn fire(self) {
+        if self.tx.send(()).is_err() {
+            // oneshot channel closed. This is legal: a client could be destroyed while waiting for a flush.
+            debug!("deletion queue flush from dropped client");
+        };
+    }
+}
+
+#[derive(Clone)]
+pub struct DeletionQueueClient {
+    tx: tokio::sync::mpsc::Sender<FrontendQueueMessage>,
+    executor_tx: tokio::sync::mpsc::Sender<ExecutorMessage>,
+}
+
+#[derive(Debug, Serialize, Deserialize)]
+struct TenantDeletionList {
+    /// For each Timeline, a list of key fragments to append to the timeline remote path
+    /// when reconstructing a full key
+    timelines: HashMap<TimelineId, Vec<String>>,
+
+    /// The generation in which this deletion was emitted: note that this may not be the
+    /// same as the generation of any layers being deleted.  The generation of the layer
+    /// has already been absorbed into the keys in `objects`
+    generation: Generation,
+}
+
+#[serde_as]
+#[derive(Debug, Serialize, Deserialize)]
+struct DeletionList {
+    /// Serialization version, for future use
+    version: u8,
+
+    /// Used for constructing a unique key for each deletion list we write out.
+    sequence: u64,
+
+    /// To avoid repeating tenant/timeline IDs in every key, we store keys in
+    /// nested HashMaps by TenantTimelineID.  Each Tenant only appears once
+    /// with one unique generation ID: if someone tries to push a second generation
+    /// ID for the same tenant, we will start a new DeletionList.
+    tenants: HashMap<TenantId, TenantDeletionList>,
+
+    /// Avoid having to walk `tenants` to calculate size
+    size: usize,
+}
+
+#[serde_as]
+#[derive(Debug, Serialize, Deserialize)]
+struct DeletionHeader {
+    /// Serialization version, for future use
+    version: u8,
+
+    /// Enable determining the next sequence number even if there are no deletion lists present.
+    /// If there _are_ deletion lists present, then their sequence numbers take precedence over
+    /// this.
+    last_deleted_list_seq: u64,
+    // TODO: this is where we will track a 'clean' sequence number that indicates all deletion
+    // lists <= that sequence have had their generations validated with the control plane
+    // and are OK to execute.
+}
+
+impl DeletionHeader {
+    const VERSION_LATEST: u8 = 1;
+
+    fn new(last_deleted_list_seq: u64) -> Self {
+        Self {
+            version: Self::VERSION_LATEST,
+            last_deleted_list_seq,
+        }
+    }
+}
+
+impl DeletionList {
+    const VERSION_LATEST: u8 = 1;
+    fn new(sequence: u64) -> Self {
+        Self {
+            version: Self::VERSION_LATEST,
+            sequence,
+            tenants: HashMap::new(),
+            size: 0,
+        }
+    }
+
+    fn drain(&mut self) -> Self {
+        let mut tenants = HashMap::new();
+        std::mem::swap(&mut self.tenants, &mut tenants);
+        let other = Self {
+            version: Self::VERSION_LATEST,
+            sequence: self.sequence,
+            tenants,
+            size: self.size,
+        };
+        self.size = 0;
+        other
+    }
+
+    fn is_empty(&self) -> bool {
+        self.tenants.is_empty()
+    }
+
+    fn len(&self) -> usize {
+        self.size
+    }
+
+    /// Returns true if the push was accepted, false if the caller must start a new
+    /// deletion list.
+    fn push(
+        &mut self,
+        tenant: &TenantId,
+        timeline: &TimelineId,
+        generation: Generation,
+        objects: &mut Vec<RemotePath>,
+    ) -> bool {
+        if objects.is_empty() {
+            // Avoid inserting an empty TimelineDeletionList: this preserves the property
+            // that if we have no keys, then self.objects is empty (used in Self::is_empty)
+            return true;
+        }
+
+        let tenant_entry = self
+            .tenants
+            .entry(*tenant)
+            .or_insert_with(|| TenantDeletionList {
+                timelines: HashMap::new(),
+                generation: generation,
+            });
+
+        if tenant_entry.generation != generation {
+            // Only one generation per tenant per list: signal to
+            // caller to start a new list.
+            return false;
+        }
+
+        let timeline_entry = tenant_entry
+            .timelines
+            .entry(*timeline)
+            .or_insert_with(|| Vec::new());
+
+        let timeline_remote_path = remote_timeline_path(tenant, timeline);
+
+        self.size += objects.len();
+        timeline_entry.extend(objects.drain(..).map(|p| {
+            p.strip_prefix(&timeline_remote_path)
+                .expect("Timeline paths always start with the timeline prefix")
+                .to_string_lossy()
+                .to_string()
+        }));
+        true
+    }
+
+    fn take_paths(self) -> Vec<RemotePath> {
+        let mut result = Vec::new();
+        for (tenant, tenant_deletions) in self.tenants.into_iter() {
+            for (timeline, timeline_layers) in tenant_deletions.timelines.into_iter() {
+                let timeline_remote_path = remote_timeline_path(&tenant, &timeline);
+                result.extend(
+                    timeline_layers
+                        .into_iter()
+                        .map(|l| timeline_remote_path.join(&PathBuf::from(l))),
+                );
+            }
+        }
+
+        result
+    }
+}
+
+#[derive(Error, Debug)]
+pub enum DeletionQueueError {
+    #[error("Deletion queue unavailable during shutdown")]
+    ShuttingDown,
+}
+
+impl DeletionQueueClient {
+    async fn do_push(&self, msg: FrontendQueueMessage) -> Result<(), DeletionQueueError> {
+        match self.tx.send(msg).await {
+            Ok(_) => Ok(()),
+            Err(e) => {
+                // This shouldn't happen, we should shut down all tenants before
+                // we shut down the global delete queue.  If we encounter a bug like this,
+                // we may leak objects as deletions won't be processed.
+                error!("Deletion queue closed while pushing, shutting down? ({e})");
+                Err(DeletionQueueError::ShuttingDown)
+            }
+        }
+    }
+
+    /// Submit a list of layers for deletion: this function will return before the deletion is
+    /// persistent, but it may be executed at any time after this function enters: do not push
+    /// layers until you're sure they can be deleted safely (i.e. remote metadata no longer
+    /// references them).
+    pub(crate) async fn push_layers(
+        &self,
+        tenant_id: TenantId,
+        timeline_id: TimelineId,
+        generation: Generation,
+        layers: Vec<(LayerFileName, Generation)>,
+    ) -> Result<(), DeletionQueueError> {
+        DELETION_QUEUE_SUBMITTED.inc_by(layers.len() as u64);
+        self.do_push(FrontendQueueMessage::Delete(DeletionOp {
+            tenant_id,
+            timeline_id,
+            layers,
+            generation,
+            objects: Vec::new(),
+        }))
+        .await
+    }
+
+    async fn do_flush(
+        &self,
+        msg: FrontendQueueMessage,
+        rx: tokio::sync::oneshot::Receiver<()>,
+    ) -> Result<(), DeletionQueueError> {
+        self.do_push(msg).await?;
+        if rx.await.is_err() {
+            // This shouldn't happen if tenants are shut down before deletion queue.  If we
+            // encounter a bug like this, then a flusher will incorrectly believe it has flushed
+            // when it hasn't, possibly leading to leaking objects.
+            error!("Deletion queue dropped flush op while client was still waiting");
+            Err(DeletionQueueError::ShuttingDown)
+        } else {
+            Ok(())
+        }
+    }
+
+    /// Wait until all previous deletions are persistent (either executed, or written to a DeletionList)
+    pub async fn flush(&self) -> Result<(), DeletionQueueError> {
+        let (tx, rx) = tokio::sync::oneshot::channel::<()>();
+        self.do_flush(FrontendQueueMessage::Flush(FlushOp { tx }), rx)
+            .await
+    }
+
+    // Wait until all previous deletions are executed
+    pub(crate) async fn flush_execute(&self) -> Result<(), DeletionQueueError> {
+        debug!("flush_execute: flushing to deletion lists...");
+        // Flush any buffered work to deletion lists
+        self.flush().await?;
+
+        // Flush execution of deletion lists
+        let (tx, rx) = tokio::sync::oneshot::channel::<()>();
+        debug!("flush_execute: flushing execution...");
+        self.do_flush(FrontendQueueMessage::FlushExecute(FlushOp { tx }), rx)
+            .await?;
+        debug!("flush_execute: finished flushing execution...");
+        Ok(())
+    }
+
+    /// This interface bypasses the persistent deletion queue, and any validation
+    /// that this pageserver is still elegible to execute the deletions.  It is for
+    /// use in timeline deletions, where the control plane is telling us we may
+    /// delete everything in the timeline.
+    ///
+    /// DO NOT USE THIS FROM GC OR COMPACTION CODE.  Use the regular `push_layers`.
+    pub(crate) async fn push_immediate(
+        &self,
+        objects: Vec<RemotePath>,
+    ) -> Result<(), DeletionQueueError> {
+        self.executor_tx
+            .send(ExecutorMessage::Delete(objects))
+            .await
+            .map_err(|_| DeletionQueueError::ShuttingDown)
+    }
+
+    /// Companion to push_immediate.  When this returns Ok, all prior objects sent
+    /// into push_immediate have been deleted from remote storage.
+    pub(crate) async fn flush_immediate(&self) -> Result<(), DeletionQueueError> {
+        let (tx, rx) = tokio::sync::oneshot::channel::<()>();
+        self.executor_tx
+            .send(ExecutorMessage::Flush(FlushOp { tx }))
+            .await
+            .map_err(|_| DeletionQueueError::ShuttingDown)?;
+
+        rx.await.map_err(|_| DeletionQueueError::ShuttingDown)
+    }
+}
+
+impl DeletionQueue {
+    pub fn new_client(&self) -> DeletionQueueClient {
+        self.client.clone()
+    }
+
+    /// Caller may use the returned object to construct clients with new_client.
+    /// Caller should tokio::spawn the background() members of the two worker objects returned:
+    /// we don't spawn those inside new() so that the caller can use their runtime/spans of choice.
+    ///
+    /// If remote_storage is None, then the returned workers will also be None.
+    pub fn new(
+        remote_storage: Option<GenericRemoteStorage>,
+        conf: &'static PageServerConf,
+        cancel: CancellationToken,
+    ) -> (
+        Self,
+        Option<FrontendQueueWorker>,
+        Option<BackendQueueWorker>,
+        Option<ExecutorWorker>,
+    ) {
+        // Deep channel: it consumes deletions from all timelines and we do not want to block them
+        let (tx, rx) = tokio::sync::mpsc::channel(16384);
+
+        // Shallow channel: it carries DeletionLists which each contain up to thousands of deletions
+        let (backend_tx, backend_rx) = tokio::sync::mpsc::channel(16);
+
+        // Shallow channel: it carries lists of paths, and we expect the main queueing to
+        // happen in the backend (persistent), not in this queue.
+        let (executor_tx, executor_rx) = tokio::sync::mpsc::channel(16);
+
+        let remote_storage = match remote_storage {
+            None => {
+                return (
+                    Self {
+                        client: DeletionQueueClient { tx, executor_tx },
+                    },
+                    None,
+                    None,
+                    None,
+                )
+            }
+            Some(r) => r,
+        };
+
+        (
+            Self {
+                client: DeletionQueueClient {
+                    tx,
+                    executor_tx: executor_tx.clone(),
+                },
+            },
+            Some(FrontendQueueWorker::new(
+                conf,
+                rx,
+                backend_tx,
+                cancel.clone(),
+            )),
+            Some(BackendQueueWorker::new(
+                conf,
+                backend_rx,
+                executor_tx,
+                cancel.clone(),
+            )),
+            Some(ExecutorWorker::new(
+                remote_storage,
+                executor_rx,
+                cancel.clone(),
+            )),
+        )
+    }
+}
+
+#[cfg(test)]
+mod test {
+    use hex_literal::hex;
+    use std::{
+        io::ErrorKind,
+        path::{Path, PathBuf},
+    };
+    use tracing::info;
+
+    use remote_storage::{RemoteStorageConfig, RemoteStorageKind};
+    use tokio::{runtime::EnterGuard, task::JoinHandle};
+
+    use crate::tenant::{harness::TenantHarness, remote_timeline_client::remote_timeline_path};
+
+    use super::*;
+    pub const TIMELINE_ID: TimelineId =
+        TimelineId::from_array(hex!("11223344556677881122334455667788"));
+
+    struct TestSetup {
+        runtime: &'static tokio::runtime::Runtime,
+        _entered_runtime: EnterGuard<'static>,
+        harness: TenantHarness,
+        remote_fs_dir: PathBuf,
+        storage: GenericRemoteStorage,
+        deletion_queue: DeletionQueue,
+        fe_worker: JoinHandle<()>,
+        be_worker: JoinHandle<()>,
+        ex_worker: JoinHandle<()>,
+    }
+
+    impl TestSetup {
+        /// Simulate a pageserver restart by destroying and recreating the deletion queue
+        fn restart(&mut self) {
+            let (deletion_queue, fe_worker, be_worker, ex_worker) = DeletionQueue::new(
+                Some(self.storage.clone()),
+                self.harness.conf,
+                CancellationToken::new(),
+            );
+
+            self.deletion_queue = deletion_queue;
+
+            let mut fe_worker = fe_worker.unwrap();
+            let mut be_worker = be_worker.unwrap();
+            let mut ex_worker = ex_worker.unwrap();
+            let mut fe_worker = self
+                .runtime
+                .spawn(async move { fe_worker.background().await });
+            let mut be_worker = self
+                .runtime
+                .spawn(async move { be_worker.background().await });
+            let mut ex_worker = self.runtime.spawn(async move {
+                drop(ex_worker.background().await);
+            });
+            std::mem::swap(&mut self.fe_worker, &mut fe_worker);
+            std::mem::swap(&mut self.be_worker, &mut be_worker);
+            std::mem::swap(&mut self.ex_worker, &mut ex_worker);
+
+            // Join the old workers
+            self.runtime.block_on(fe_worker).unwrap();
+            self.runtime.block_on(be_worker).unwrap();
+            self.runtime.block_on(ex_worker).unwrap();
+        }
+    }
+
+    fn setup(test_name: &str) -> anyhow::Result<TestSetup> {
+        let test_name = Box::leak(Box::new(format!("deletion_queue__{test_name}")));
+        let harness = TenantHarness::create(test_name)?;
+
+        // We do not load() the harness: we only need its config and remote_storage
+
+        // Set up a GenericRemoteStorage targetting a directory
+        let remote_fs_dir = harness.conf.workdir.join("remote_fs");
+        std::fs::create_dir_all(remote_fs_dir)?;
+        let remote_fs_dir = std::fs::canonicalize(harness.conf.workdir.join("remote_fs"))?;
+        let storage_config = RemoteStorageConfig {
+            max_concurrent_syncs: std::num::NonZeroUsize::new(
+                remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS,
+            )
+            .unwrap(),
+            max_sync_errors: std::num::NonZeroU32::new(
+                remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS,
+            )
+            .unwrap(),
+            storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
+        };
+        let storage = GenericRemoteStorage::from_config(&storage_config).unwrap();
+
+        let runtime = Box::leak(Box::new(
+            tokio::runtime::Builder::new_current_thread()
+                .enable_all()
+                .build()?,
+        ));
+        let entered_runtime = runtime.enter();
+
+        let (deletion_queue, fe_worker, be_worker, ex_worker) = DeletionQueue::new(
+            Some(storage.clone()),
+            harness.conf,
+            CancellationToken::new(),
+        );
+
+        let mut fe_worker = fe_worker.unwrap();
+        let mut be_worker = be_worker.unwrap();
+        let mut ex_worker = ex_worker.unwrap();
+        let fe_worker_join = runtime.spawn(async move { fe_worker.background().await });
+        let be_worker_join = runtime.spawn(async move { be_worker.background().await });
+        let ex_worker_join = runtime.spawn(async move {
+            drop(ex_worker.background().await);
+        });
+
+        Ok(TestSetup {
+            runtime,
+            _entered_runtime: entered_runtime,
+            harness,
+            remote_fs_dir,
+            storage,
+            deletion_queue,
+            fe_worker: fe_worker_join,
+            be_worker: be_worker_join,
+            ex_worker: ex_worker_join,
+        })
+    }
+
+    // TODO: put this in a common location so that we can share with remote_timeline_client's tests
+    fn assert_remote_files(expected: &[&str], remote_path: &Path) {
+        let mut expected: Vec<String> = expected.iter().map(|x| String::from(*x)).collect();
+        expected.sort();
+
+        let mut found: Vec<String> = Vec::new();
+        let dir = match std::fs::read_dir(remote_path) {
+            Ok(d) => d,
+            Err(e) => {
+                if e.kind() == ErrorKind::NotFound {
+                    if expected.is_empty() {
+                        // We are asserting prefix is empty: it is expected that the dir is missing
+                        return;
+                    } else {
+                        assert_eq!(expected, Vec::<String>::new());
+                        unreachable!();
+                    }
+                } else {
+                    panic!(
+                        "Unexpected error listing {0}: {e}",
+                        remote_path.to_string_lossy()
+                    );
+                }
+            }
+        };
+
+        for entry in dir.flatten() {
+            let entry_name = entry.file_name();
+            let fname = entry_name.to_str().unwrap();
+            found.push(String::from(fname));
+        }
+        found.sort();
+
+        assert_eq!(expected, found);
+    }
+
+    fn assert_local_files(expected: &[&str], directory: &Path) {
+        let mut dir = match std::fs::read_dir(directory) {
+            Ok(d) => d,
+            Err(_) => {
+                assert_eq!(expected, &Vec::<String>::new());
+                return;
+            }
+        };
+        let mut found = Vec::new();
+        while let Some(dentry) = dir.next() {
+            let dentry = dentry.unwrap();
+            let file_name = dentry.file_name();
+            let file_name_str = file_name.to_string_lossy();
+            found.push(file_name_str.to_string());
+        }
+        found.sort();
+        assert_eq!(expected, found);
+    }
+
+    #[test]
+    fn deletion_queue_smoke() -> anyhow::Result<()> {
+        // Basic test that the deletion queue processes the deletions we pass into it
+        let ctx = setup("deletion_queue_smoke").expect("Failed test setup");
+        let client = ctx.deletion_queue.new_client();
+
+        let layer_file_name_1: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap();
+        let tenant_id = ctx.harness.tenant_id;
+
+        let content: Vec<u8> = "victim1 contents".into();
+        let relative_remote_path = remote_timeline_path(&tenant_id, &TIMELINE_ID);
+        let remote_timeline_path = ctx.remote_fs_dir.join(relative_remote_path.get_path());
+        let deletion_prefix = ctx.harness.conf.deletion_prefix();
+
+        // Exercise the distinction between the generation of the layers
+        // we delete, and the generation of the running Tenant.
+        let layer_generation = Generation::new(0xdeadbeef);
+        let now_generation = Generation::new(0xfeedbeef);
+
+        let remote_layer_file_name_1 =
+            format!("{}{}", layer_file_name_1, layer_generation.get_suffix());
+
+        // Inject a victim file to remote storage
+        info!("Writing");
+        std::fs::create_dir_all(&remote_timeline_path)?;
+        std::fs::write(
+            remote_timeline_path.join(remote_layer_file_name_1.clone()),
+            content,
+        )?;
+        assert_remote_files(&[&remote_layer_file_name_1], &remote_timeline_path);
+
+        // File should still be there after we push it to the queue (we haven't pushed enough to flush anything)
+        info!("Pushing");
+        ctx.runtime.block_on(client.push_layers(
+            tenant_id,
+            TIMELINE_ID,
+            now_generation,
+            [(layer_file_name_1.clone(), layer_generation)].to_vec(),
+        ))?;
+        assert_remote_files(&[&remote_layer_file_name_1], &remote_timeline_path);
+
+        assert_local_files(&[], &deletion_prefix);
+
+        // File should still be there after we write a deletion list (we haven't pushed enough to execute anything)
+        info!("Flushing");
+        ctx.runtime.block_on(client.flush())?;
+        assert_remote_files(&[&remote_layer_file_name_1], &remote_timeline_path);
+        assert_local_files(&["0000000000000001-01.list"], &deletion_prefix);
+
+        // File should go away when we execute
+        info!("Flush-executing");
+        ctx.runtime.block_on(client.flush_execute())?;
+        assert_remote_files(&[], &remote_timeline_path);
+        assert_local_files(&["header-01"], &deletion_prefix);
+
+        // Flushing on an empty queue should succeed immediately, and not write any lists
+        info!("Flush-executing on empty");
+        ctx.runtime.block_on(client.flush_execute())?;
+        assert_local_files(&["header-01"], &deletion_prefix);
+
+        Ok(())
+    }
+
+    #[test]
+    fn deletion_queue_recovery() -> anyhow::Result<()> {
+        // Basic test that the deletion queue processes the deletions we pass into it
+        let mut ctx = setup("deletion_queue_recovery").expect("Failed test setup");
+        let client = ctx.deletion_queue.new_client();
+
+        let layer_file_name_1: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap();
+        let tenant_id = ctx.harness.tenant_id;
+
+        let content: Vec<u8> = "victim1 contents".into();
+        let relative_remote_path = remote_timeline_path(&tenant_id, &TIMELINE_ID);
+        let remote_timeline_path = ctx.remote_fs_dir.join(relative_remote_path.get_path());
+        let deletion_prefix = ctx.harness.conf.deletion_prefix();
+        let layer_generation = Generation::new(0xdeadbeef);
+        let now_generation = Generation::new(0xfeedbeef);
+        let remote_layer_file_name_1 =
+            format!("{}{}", layer_file_name_1, layer_generation.get_suffix());
+
+        // Inject a file, delete it, and flush to a deletion list
+        std::fs::create_dir_all(&remote_timeline_path)?;
+        std::fs::write(
+            remote_timeline_path.join(remote_layer_file_name_1.clone()),
+            content,
+        )?;
+        ctx.runtime.block_on(client.push_layers(
+            tenant_id,
+            TIMELINE_ID,
+            now_generation,
+            [(layer_file_name_1.clone(), layer_generation)].to_vec(),
+        ))?;
+        ctx.runtime.block_on(client.flush())?;
+        assert_local_files(&["0000000000000001-01.list"], &deletion_prefix);
+
+        // Restart the deletion queue
+        drop(client);
+        ctx.restart();
+        let client = ctx.deletion_queue.new_client();
+
+        // If we have recovered the deletion list properly, then executing after restart should purge it
+        info!("Flush-executing");
+        ctx.runtime.block_on(client.flush_execute())?;
+        assert_remote_files(&[], &remote_timeline_path);
+        assert_local_files(&["header-01"], &deletion_prefix);
+        Ok(())
+    }
+}
+
+/// A lightweight queue which can issue ordinary DeletionQueueClient objects, but doesn't do any persistence
+/// or coalescing, and doesn't actually execute any deletions unless you call pump() to kick it.
+#[cfg(test)]
+pub mod mock {
+    use tracing::info;
+
+    use crate::tenant::remote_timeline_client::remote_layer_path;
+
+    use super::*;
+    use std::sync::{
+        atomic::{AtomicUsize, Ordering},
+        Arc,
+    };
+
+    pub struct MockDeletionQueue {
+        tx: tokio::sync::mpsc::Sender<FrontendQueueMessage>,
+        executor_tx: tokio::sync::mpsc::Sender<ExecutorMessage>,
+        tx_pump: tokio::sync::mpsc::Sender<FlushOp>,
+        executed: Arc<AtomicUsize>,
+    }
+
+    impl MockDeletionQueue {
+        pub fn new(remote_storage: Option<GenericRemoteStorage>) -> Self {
+            let (tx, mut rx) = tokio::sync::mpsc::channel(16384);
+            let (tx_pump, mut rx_pump) = tokio::sync::mpsc::channel::<FlushOp>(1);
+            let (executor_tx, mut executor_rx) = tokio::sync::mpsc::channel(16384);
+
+            let executed = Arc::new(AtomicUsize::new(0));
+            let executed_bg = executed.clone();
+
+            tokio::spawn(async move {
+                let remote_storage = match &remote_storage {
+                    Some(rs) => rs,
+                    None => {
+                        info!("No remote storage configured, deletion queue will not run");
+                        return;
+                    }
+                };
+                info!("Running mock deletion queue");
+                // Each time we are asked to pump, drain the queue of deletions
+                while let Some(flush_op) = rx_pump.recv().await {
+                    info!("Executing all pending deletions");
+
+                    // Transform all executor messages to generic frontend messages
+                    while let Ok(msg) = executor_rx.try_recv() {
+                        match msg {
+                            ExecutorMessage::Delete(objects) => {
+                                for path in objects {
+                                    match remote_storage.delete(&path).await {
+                                        Ok(_) => {
+                                            debug!("Deleted {path}");
+                                        }
+                                        Err(e) => {
+                                            error!(
+                                                "Failed to delete {path}, leaking object! ({e})"
+                                            );
+                                        }
+                                    }
+                                    executed_bg.fetch_add(1, Ordering::Relaxed);
+                                }
+                            }
+                            ExecutorMessage::Flush(flush_op) => {
+                                flush_op.fire();
+                            }
+                        }
+                    }
+
+                    while let Ok(msg) = rx.try_recv() {
+                        match msg {
+                            FrontendQueueMessage::Delete(op) => {
+                                let mut objects = op.objects;
+                                for (layer, generation) in op.layers {
+                                    objects.push(remote_layer_path(
+                                        &op.tenant_id,
+                                        &op.timeline_id,
+                                        &layer,
+                                        generation,
+                                    ));
+                                }
+
+                                for path in objects {
+                                    info!("Executing deletion {path}");
+                                    match remote_storage.delete(&path).await {
+                                        Ok(_) => {
+                                            debug!("Deleted {path}");
+                                        }
+                                        Err(e) => {
+                                            error!(
+                                                "Failed to delete {path}, leaking object! ({e})"
+                                            );
+                                        }
+                                    }
+                                    executed_bg.fetch_add(1, Ordering::Relaxed);
+                                }
+                            }
+                            FrontendQueueMessage::Flush(op) => {
+                                op.fire();
+                            }
+                            FrontendQueueMessage::FlushExecute(op) => {
+                                // We have already executed all prior deletions because mock does them inline
+                                op.fire();
+                            }
+                        }
+                        info!("All pending deletions have been executed");
+                    }
+                    flush_op
+                        .tx
+                        .send(())
+                        .expect("Test called flush but dropped before finishing");
+                }
+            });
+
+            Self {
+                tx,
+                tx_pump,
+                executor_tx,
+                executed,
+            }
+        }
+
+        pub fn get_executed(&self) -> usize {
+            self.executed.load(Ordering::Relaxed)
+        }
+
+        pub async fn pump(&self) {
+            let (tx, rx) = tokio::sync::oneshot::channel();
+            self.tx_pump
+                .send(FlushOp { tx })
+                .await
+                .expect("pump called after deletion queue loop stopped");
+            rx.await
+                .expect("Mock delete queue shutdown while waiting to pump");
+        }
+
+        pub(crate) fn new_client(&self) -> DeletionQueueClient {
+            DeletionQueueClient {
+                tx: self.tx.clone(),
+                executor_tx: self.executor_tx.clone(),
+            }
+        }
+    }
+}
--- a/pageserver/src/deletion_queue/backend.rs
+++ b/pageserver/src/deletion_queue/backend.rs
@@ -0,0 +1,300 @@
+use std::collections::HashMap;
+use std::time::Duration;
+
+use futures::future::TryFutureExt;
+use pageserver_api::control_api::HexTenantId;
+use pageserver_api::control_api::{ValidateRequest, ValidateRequestTenant, ValidateResponse};
+use serde::de::DeserializeOwned;
+use tokio_util::sync::CancellationToken;
+use tracing::debug;
+use tracing::info;
+use tracing::warn;
+use utils::backoff;
+
+use crate::config::PageServerConf;
+use crate::metrics::DELETION_QUEUE_ERRORS;
+
+use super::executor::ExecutorMessage;
+use super::DeletionHeader;
+use super::DeletionList;
+use super::DeletionQueueError;
+use super::FlushOp;
+
+// After this length of time, execute deletions which are elegible to run,
+// even if we haven't accumulated enough for a full-sized DeleteObjects
+const EXECUTE_IDLE_DEADLINE: Duration = Duration::from_secs(60);
+
+// If we have received this number of keys, proceed with attempting to execute
+const AUTOFLUSH_KEY_COUNT: usize = 16384;
+
+#[derive(Debug)]
+pub(super) enum BackendQueueMessage {
+    Delete(DeletionList),
+    Flush(FlushOp),
+}
+pub struct BackendQueueWorker {
+    conf: &'static PageServerConf,
+    rx: tokio::sync::mpsc::Receiver<BackendQueueMessage>,
+    tx: tokio::sync::mpsc::Sender<ExecutorMessage>,
+
+    // Accumulate some lists to execute in a batch.
+    // The purpose of this accumulation is to implement batched validation of
+    // attachment generations, when split-brain protection is implemented.
+    // (see https://github.com/neondatabase/neon/pull/4919)
+    pending_lists: Vec<DeletionList>,
+
+    // Sum of all the lengths of lists in pending_lists
+    pending_key_count: usize,
+
+    // DeletionLists we have fully executed, which may be deleted
+    // from remote storage.
+    executed_lists: Vec<DeletionList>,
+
+    cancel: CancellationToken,
+}
+
+#[derive(thiserror::Error, Debug)]
+enum ValidateCallError {
+    #[error("shutdown")]
+    Shutdown,
+    #[error("remote: {0}")]
+    Remote(reqwest::Error),
+}
+
+async fn retry_http_forever<T>(
+    url: &url::Url,
+    request: ValidateRequest,
+    cancel: CancellationToken,
+) -> Result<T, DeletionQueueError>
+where
+    T: DeserializeOwned,
+{
+    let client = reqwest::ClientBuilder::new()
+        .build()
+        .expect("Failed to construct http client");
+
+    let response = match backoff::retry(
+        || {
+            client
+                .post(url.clone())
+                .json(&request)
+                .send()
+                .map_err(|e| ValidateCallError::Remote(e))
+        },
+        |_| false,
+        3,
+        u32::MAX,
+        "calling control plane generation validation API",
+        backoff::Cancel::new(cancel.clone(), || ValidateCallError::Shutdown),
+    )
+    .await
+    {
+        Err(ValidateCallError::Shutdown) => {
+            return Err(DeletionQueueError::ShuttingDown);
+        }
+        Err(ValidateCallError::Remote(_)) => {
+            panic!("We retry forever");
+        }
+        Ok(r) => r,
+    };
+
+    // TODO: handle non-200 response
+    // TODO: handle decode error
+    Ok(response.json::<T>().await.unwrap())
+}
+
+impl BackendQueueWorker {
+    pub(super) fn new(
+        conf: &'static PageServerConf,
+        rx: tokio::sync::mpsc::Receiver<BackendQueueMessage>,
+        tx: tokio::sync::mpsc::Sender<ExecutorMessage>,
+        cancel: CancellationToken,
+    ) -> Self {
+        Self {
+            conf,
+            rx,
+            tx,
+            pending_lists: Vec::new(),
+            pending_key_count: 0,
+            executed_lists: Vec::new(),
+            cancel,
+        }
+    }
+
+    async fn cleanup_lists(&mut self) {
+        debug!(
+            "cleanup_lists: {0} executed lists, {1} pending lists",
+            self.executed_lists.len(),
+            self.pending_lists.len()
+        );
+
+        // Lists are always pushed into the queues + executed list in sequence order, so
+        // no sort is required: can find the highest sequence number by peeking at last element
+        let max_executed_seq = match self.executed_lists.last() {
+            Some(v) => v.sequence,
+            None => {
+                // No executed lists, nothing to clean up.
+                return;
+            }
+        };
+
+        // In case this is the last list, write a header out first so that
+        // we don't risk losing our knowledge of the sequence number (on replay, our
+        // next sequence number is the highest list seen + 1, or read from the header
+        // if there are no lists)
+        let header = DeletionHeader::new(max_executed_seq);
+        debug!("Writing header {:?}", header);
+        let header_bytes =
+            serde_json::to_vec(&header).expect("Failed to serialize deletion header");
+        let header_path = self.conf.deletion_header_path();
+
+        if let Err(e) = tokio::fs::write(&header_path, header_bytes).await {
+            warn!("Failed to upload deletion queue header: {e:#}");
+            DELETION_QUEUE_ERRORS
+                .with_label_values(&["put_header"])
+                .inc();
+            return;
+        }
+
+        while let Some(list) = self.executed_lists.pop() {
+            let list_path = self.conf.deletion_list_path(list.sequence);
+            if let Err(e) = tokio::fs::remove_file(&list_path).await {
+                // Unexpected: we should have permissions and nothing else should
+                // be touching these files
+                tracing::error!("Failed to delete {0}: {e:#}", list_path.display());
+                self.executed_lists.push(list);
+                break;
+            }
+        }
+    }
+
+    pub async fn validate_lists(&mut self) -> Result<(), DeletionQueueError> {
+        let control_plane_api = match &self.conf.control_plane_api {
+            None => {
+                // Generations are not switched on yet.
+                return Ok(());
+            }
+            Some(api) => api,
+        };
+
+        let validate_path = control_plane_api
+            .join("validate")
+            .expect("Failed to build validate path");
+
+        for list in &mut self.pending_lists {
+            let request = ValidateRequest {
+                tenants: list
+                    .tenants
+                    .iter()
+                    .map(|(tid, tdl)| ValidateRequestTenant {
+                        id: HexTenantId::new(*tid),
+                        gen: tdl.generation.into().expect(
+                            "Generation should always be valid for a Tenant doing deletions",
+                        ),
+                    })
+                    .collect(),
+            };
+
+            // Retry forever, we cannot make progress until we get a response
+            let response: ValidateResponse =
+                retry_http_forever(&validate_path, request, self.cancel.clone()).await?;
+
+            let tenants_valid: HashMap<_, _> = response
+                .tenants
+                .into_iter()
+                .map(|t| (t.id.take(), t.valid))
+                .collect();
+
+            // Filter the list based on whether the server responded valid: true.
+            // If a tenant is omitted in the response, it has been deleted, and we should
+            // proceed with deletion.
+            list.tenants.retain(|tenant_id, _tenant| {
+                let r = tenants_valid.get(tenant_id).map(|v| *v).unwrap_or(true);
+                if !r {
+                    warn!("Dropping stale deletions for tenant {tenant_id}, objects may be leaked");
+                }
+                r
+            });
+        }
+
+        Ok(())
+    }
+
+    pub async fn flush(&mut self) {
+        // Issue any required generation validation calls to the control plane
+        if let Err(DeletionQueueError::ShuttingDown) = self.validate_lists().await {
+            warn!("Shutting down");
+            return;
+        }
+
+        // Submit all keys from pending DeletionLists into the executor
+        for list in self.pending_lists.drain(..) {
+            let objects = list.take_paths();
+            if let Err(_e) = self.tx.send(ExecutorMessage::Delete(objects)).await {
+                warn!("Shutting down");
+                return;
+            };
+        }
+
+        // Flush the executor to ensure all the operations we just submitted have been executed
+        let (tx, rx) = tokio::sync::oneshot::channel::<()>();
+        let flush_op = FlushOp { tx };
+        if let Err(_e) = self.tx.send(ExecutorMessage::Flush(flush_op)).await {
+            warn!("Shutting down");
+            return;
+        };
+        if rx.await.is_err() {
+            warn!("Shutting down");
+            return;
+        }
+
+        // After flush, we are assured that all contents of the pending lists
+        // are executed
+        self.pending_key_count = 0;
+        self.executed_lists.append(&mut self.pending_lists);
+
+        // Erase the lists we executed
+        self.cleanup_lists().await;
+    }
+
+    pub async fn background(&mut self) {
+        // TODO: if we would like to be able to defer deletions while a Layer still has
+        // refs (but it will be elegible for deletion after process ends), then we may
+        // add an ephemeral part to BackendQueueMessage::Delete that tracks which keys
+        // in the deletion list may not be deleted yet, with guards to block on while
+        // we wait to proceed.
+
+        loop {
+            let msg = match tokio::time::timeout(EXECUTE_IDLE_DEADLINE, self.rx.recv()).await {
+                Ok(Some(m)) => m,
+                Ok(None) => {
+                    // All queue senders closed
+                    info!("Shutting down");
+                    break;
+                }
+                Err(_) => {
+                    // Timeout, we hit deadline to execute whatever we have in hand.  These functions will
+                    // return immediately if no work is pending
+                    self.flush().await;
+
+                    continue;
+                }
+            };
+
+            match msg {
+                BackendQueueMessage::Delete(list) => {
+                    self.pending_key_count += list.len();
+                    self.pending_lists.push(list);
+
+                    if self.pending_key_count > AUTOFLUSH_KEY_COUNT {
+                        self.flush().await;
+                    }
+                }
+                BackendQueueMessage::Flush(op) => {
+                    self.flush().await;
+                    op.fire();
+                }
+            }
+        }
+    }
+}
--- a/pageserver/src/deletion_queue/executor.rs
+++ b/pageserver/src/deletion_queue/executor.rs
@@ -0,0 +1,143 @@
+use remote_storage::GenericRemoteStorage;
+use remote_storage::RemotePath;
+use remote_storage::MAX_KEYS_PER_DELETE;
+use std::time::Duration;
+use tokio_util::sync::CancellationToken;
+use tracing::info;
+use tracing::warn;
+
+use crate::metrics::DELETION_QUEUE_ERRORS;
+use crate::metrics::DELETION_QUEUE_EXECUTED;
+
+use super::DeletionQueueError;
+use super::FlushOp;
+
+const AUTOFLUSH_INTERVAL: Duration = Duration::from_secs(10);
+
+pub(super) enum ExecutorMessage {
+    Delete(Vec<RemotePath>),
+    Flush(FlushOp),
+}
+
+/// Non-persistent deletion queue, for coalescing multiple object deletes into
+/// larger DeleteObjects requests.
+pub struct ExecutorWorker {
+    // Accumulate up to 1000 keys for the next deletion operation
+    accumulator: Vec<RemotePath>,
+
+    rx: tokio::sync::mpsc::Receiver<ExecutorMessage>,
+
+    cancel: CancellationToken,
+    remote_storage: GenericRemoteStorage,
+}
+
+impl ExecutorWorker {
+    pub(super) fn new(
+        remote_storage: GenericRemoteStorage,
+        rx: tokio::sync::mpsc::Receiver<ExecutorMessage>,
+        cancel: CancellationToken,
+    ) -> Self {
+        Self {
+            remote_storage,
+            rx,
+            cancel,
+            accumulator: Vec::new(),
+        }
+    }
+
+    /// Wrap the remote `delete_objects` with a failpoint
+    pub async fn remote_delete(&self) -> Result<(), anyhow::Error> {
+        fail::fail_point!("deletion-queue-before-execute", |_| {
+            info!("Skipping execution, failpoint set");
+            DELETION_QUEUE_ERRORS
+                .with_label_values(&["failpoint"])
+                .inc();
+            Err(anyhow::anyhow!("failpoint hit"))
+        });
+
+        self.remote_storage.delete_objects(&self.accumulator).await
+    }
+
+    /// Block until everything in accumulator has been executed
+    pub async fn flush(&mut self) -> Result<(), DeletionQueueError> {
+        while !self.accumulator.is_empty() && !self.cancel.is_cancelled() {
+            match self.remote_delete().await {
+                Ok(()) => {
+                    // Note: we assume that the remote storage layer returns Ok(()) if some
+                    // or all of the deleted objects were already gone.
+                    DELETION_QUEUE_EXECUTED.inc_by(self.accumulator.len() as u64);
+                    info!(
+                        "Executed deletion batch {}..{}",
+                        self.accumulator
+                            .first()
+                            .expect("accumulator should be non-empty"),
+                        self.accumulator
+                            .last()
+                            .expect("accumulator should be non-empty"),
+                    );
+                    self.accumulator.clear();
+                }
+                Err(e) => {
+                    warn!("DeleteObjects request failed: {e:#}, will retry");
+                    DELETION_QUEUE_ERRORS.with_label_values(&["execute"]).inc();
+                }
+            };
+        }
+        if self.cancel.is_cancelled() {
+            // Expose an error because we may not have actually flushed everything
+            Err(DeletionQueueError::ShuttingDown)
+        } else {
+            Ok(())
+        }
+    }
+
+    pub async fn background(&mut self) -> Result<(), DeletionQueueError> {
+        self.accumulator.reserve(MAX_KEYS_PER_DELETE);
+
+        loop {
+            if self.cancel.is_cancelled() {
+                return Err(DeletionQueueError::ShuttingDown);
+            }
+
+            let msg = match tokio::time::timeout(AUTOFLUSH_INTERVAL, self.rx.recv()).await {
+                Ok(Some(m)) => m,
+                Ok(None) => {
+                    // All queue senders closed
+                    info!("Shutting down");
+                    return Err(DeletionQueueError::ShuttingDown);
+                }
+                Err(_) => {
+                    // Timeout, we hit deadline to execute whatever we have in hand.  These functions will
+                    // return immediately if no work is pending
+                    self.flush().await?;
+
+                    continue;
+                }
+            };
+
+            match msg {
+                ExecutorMessage::Delete(mut list) => {
+                    while !list.is_empty() || self.accumulator.len() == MAX_KEYS_PER_DELETE {
+                        if self.accumulator.len() == MAX_KEYS_PER_DELETE {
+                            self.flush().await?;
+                            // If we have received this number of keys, proceed with attempting to execute
+                            assert_eq!(self.accumulator.len(), 0);
+                        }
+
+                        let available_slots = MAX_KEYS_PER_DELETE - self.accumulator.len();
+                        let take_count = std::cmp::min(available_slots, list.len());
+                        for path in list.drain(list.len() - take_count..) {
+                            self.accumulator.push(path);
+                        }
+                    }
+                }
+                ExecutorMessage::Flush(flush_op) => {
+                    // If flush() errors, we drop the flush_op and the caller will get
+                    // an error recv()'ing their oneshot channel.
+                    self.flush().await?;
+                    flush_op.fire();
+                }
+            }
+        }
+    }
+}
--- a/pageserver/src/deletion_queue/frontend.rs
+++ b/pageserver/src/deletion_queue/frontend.rs
@@ -0,0 +1,376 @@
+use super::BackendQueueMessage;
+use super::DeletionHeader;
+use super::DeletionList;
+use super::FlushOp;
+
+use std::fs::create_dir_all;
+use std::time::Duration;
+
+use regex::Regex;
+use remote_storage::RemotePath;
+use tokio_util::sync::CancellationToken;
+use tracing::debug;
+use tracing::info;
+use tracing::warn;
+use utils::generation::Generation;
+use utils::id::TenantId;
+use utils::id::TimelineId;
+
+use crate::config::PageServerConf;
+use crate::metrics::DELETION_QUEUE_ERRORS;
+use crate::metrics::DELETION_QUEUE_SUBMITTED;
+use crate::tenant::remote_timeline_client::remote_layer_path;
+use crate::tenant::storage_layer::LayerFileName;
+
+// The number of keys in a DeletionList before we will proactively persist it
+// (without reaching a flush deadline).  This aims to deliver objects of the order
+// of magnitude 1MB when we are under heavy delete load.
+const DELETION_LIST_TARGET_SIZE: usize = 16384;
+
+// Ordinarily, we only flush to DeletionList periodically, to bound the window during
+// which we might leak objects from not flushing a DeletionList after
+// the objects are already unlinked from timeline metadata.
+const FRONTEND_DEFAULT_TIMEOUT: Duration = Duration::from_millis(10000);
+
+// If someone is waiting for a flush to DeletionList, only delay a little to accumulate
+// more objects before doing the flush.
+const FRONTEND_FLUSHING_TIMEOUT: Duration = Duration::from_millis(100);
+
+#[derive(Debug)]
+pub(super) struct DeletionOp {
+    pub(super) tenant_id: TenantId,
+    pub(super) timeline_id: TimelineId,
+    // `layers` and `objects` are both just lists of objects.  `layers` is used if you do not
+    // have a config object handy to project it to a remote key, and need the consuming worker
+    // to do it for you.
+    pub(super) layers: Vec<(LayerFileName, Generation)>,
+    pub(super) objects: Vec<RemotePath>,
+
+    /// The _current_ generation of the Tenant attachment in which we are enqueuing
+    /// this deletion.
+    pub(super) generation: Generation,
+}
+
+#[derive(Debug)]
+pub(super) enum FrontendQueueMessage {
+    Delete(DeletionOp),
+    // Wait until all prior deletions make it into a persistent DeletionList
+    Flush(FlushOp),
+    // Wait until all prior deletions have been executed (i.e. objects are actually deleted)
+    FlushExecute(FlushOp),
+}
+
+pub struct FrontendQueueWorker {
+    conf: &'static PageServerConf,
+
+    // Incoming frontend requests to delete some keys
+    rx: tokio::sync::mpsc::Receiver<FrontendQueueMessage>,
+
+    // Outbound requests to the backend to execute deletion lists we have composed.
+    tx: tokio::sync::mpsc::Sender<BackendQueueMessage>,
+
+    // The list we are currently building, contains a buffer of keys to delete
+    // and our next sequence number
+    pending: DeletionList,
+
+    // These FlushOps should fire the next time we flush
+    pending_flushes: Vec<FlushOp>,
+
+    // Worker loop is torn down when this fires.
+    cancel: CancellationToken,
+}
+
+impl FrontendQueueWorker {
+    pub(super) fn new(
+        conf: &'static PageServerConf,
+        rx: tokio::sync::mpsc::Receiver<FrontendQueueMessage>,
+        tx: tokio::sync::mpsc::Sender<BackendQueueMessage>,
+        cancel: CancellationToken,
+    ) -> Self {
+        Self {
+            pending: DeletionList::new(1),
+            conf,
+            rx,
+            tx,
+            pending_flushes: Vec::new(),
+            cancel,
+        }
+    }
+    async fn upload_pending_list(&mut self) -> anyhow::Result<()> {
+        let path = self.conf.deletion_list_path(self.pending.sequence);
+
+        let bytes = serde_json::to_vec(&self.pending).expect("Failed to serialize deletion list");
+        tokio::fs::write(&path, &bytes).await?;
+        tokio::fs::File::open(&path).await?.sync_all().await?;
+        Ok(())
+    }
+
+    /// Try to flush `list` to persistent storage
+    ///
+    /// This does not return errors, because on failure to flush we do not lose
+    /// any state: flushing will be retried implicitly on the next deadline
+    async fn flush(&mut self) {
+        if self.pending.is_empty() {
+            for f in self.pending_flushes.drain(..) {
+                f.fire();
+            }
+            return;
+        }
+
+        match self.upload_pending_list().await {
+            Ok(_) => {
+                info!(sequence = self.pending.sequence, "Stored deletion list");
+
+                for f in self.pending_flushes.drain(..) {
+                    f.fire();
+                }
+
+                let onward_list = self.pending.drain();
+
+                // We have consumed out of pending: reset it for the next incoming deletions to accumulate there
+                self.pending = DeletionList::new(self.pending.sequence + 1);
+
+                if let Err(e) = self.tx.send(BackendQueueMessage::Delete(onward_list)).await {
+                    // This is allowed to fail: it will only happen if the backend worker is shut down,
+                    // so we can just drop this on the floor.
+                    info!("Deletion list dropped, this is normal during shutdown ({e:#})");
+                }
+            }
+            Err(e) => {
+                DELETION_QUEUE_ERRORS.with_label_values(&["put_list"]).inc();
+                warn!(
+                    sequence = self.pending.sequence,
+                    "Failed to write deletion list to remote storage, will retry later ({e:#})"
+                );
+            }
+        }
+    }
+
+    async fn recover(&mut self) -> Result<(), anyhow::Error> {
+        // Load header: this is not required to be present, e.g. when a pageserver first runs
+        let header_path = self.conf.deletion_header_path();
+
+        // Synchronous, but we only do it once per process lifetime so it's tolerable
+        create_dir_all(&self.conf.deletion_prefix())?;
+
+        let header_bytes = match tokio::fs::read(&header_path).await {
+            Ok(h) => Ok(Some(h)),
+            Err(e) => {
+                if e.kind() == std::io::ErrorKind::NotFound {
+                    debug!(
+                        "Deletion header {0} not found, first start?",
+                        header_path.display()
+                    );
+                    Ok(None)
+                } else {
+                    Err(e)
+                }
+            }
+        }?;
+
+        if let Some(header_bytes) = header_bytes {
+            if let Some(header) = match serde_json::from_slice::<DeletionHeader>(&header_bytes) {
+                Ok(h) => Some(h),
+                Err(e) => {
+                    warn!(
+                        "Failed to deserialize deletion header, ignoring {0}: {e:#}",
+                        header_path.display()
+                    );
+                    // This should never happen unless we make a mistake with our serialization.
+                    // Ignoring a deletion header is not consequential for correctnes because all deletions
+                    // are ultimately allowed to fail: worst case we leak some objects for the scrubber to clean up.
+                    None
+                }
+            } {
+                self.pending.sequence =
+                    std::cmp::max(self.pending.sequence, header.last_deleted_list_seq + 1);
+            };
+        };
+
+        let mut dir = match tokio::fs::read_dir(&self.conf.deletion_prefix()).await {
+            Ok(d) => d,
+            Err(e) => {
+                warn!(
+                    "Failed to open deletion list directory {0}: {e:#}",
+                    header_path.display()
+                );
+
+                // Give up: if we can't read the deletion list directory, we probably can't
+                // write lists into it later, so the queue won't work.
+                return Err(e.into());
+            }
+        };
+
+        let list_name_pattern = Regex::new("([a-zA-Z0-9]{16})-([a-zA-Z0-9]{2}).list").unwrap();
+
+        let mut seqs: Vec<u64> = Vec::new();
+        while let Some(dentry) = dir.next_entry().await? {
+            let file_name = dentry.file_name().to_owned();
+            let basename = file_name.to_string_lossy();
+            let seq_part = if let Some(m) = list_name_pattern.captures(&basename) {
+                m.get(1)
+                    .expect("Non optional group should be present")
+                    .as_str()
+            } else {
+                warn!("Unexpected key in deletion queue: {basename}");
+                continue;
+            };
+
+            let seq: u64 = match u64::from_str_radix(seq_part, 16) {
+                Ok(s) => s,
+                Err(e) => {
+                    warn!("Malformed key '{basename}': {e}");
+                    continue;
+                }
+            };
+            seqs.push(seq);
+        }
+        seqs.sort();
+
+        // Initialize the next sequence number in the frontend based on the maximum of the highest list we see,
+        // and the last list that was deleted according to the header.  Combined with writing out the header
+        // prior to deletions, this guarnatees no re-use of sequence numbers.
+        if let Some(max_list_seq) = seqs.last() {
+            self.pending.sequence = std::cmp::max(self.pending.sequence, max_list_seq + 1);
+        }
+
+        for s in seqs {
+            let list_path = self.conf.deletion_list_path(s);
+            let list_bytes = tokio::fs::read(&list_path).await?;
+
+            let deletion_list = match serde_json::from_slice::<DeletionList>(&list_bytes) {
+                Ok(l) => l,
+                Err(e) => {
+                    // Drop the list on the floor: any objects it referenced will be left behind
+                    // for scrubbing to clean up.  This should never happen unless we have a serialization bug.
+                    warn!(sequence = s, "Failed to deserialize deletion list: {e}");
+                    continue;
+                }
+            };
+
+            // We will drop out of recovery if this fails: it indicates that we are shutting down
+            // or the backend has panicked
+            DELETION_QUEUE_SUBMITTED.inc_by(deletion_list.len() as u64);
+            self.tx
+                .send(BackendQueueMessage::Delete(deletion_list))
+                .await?;
+        }
+
+        info!(next_sequence = self.pending.sequence, "Replay complete");
+
+        Ok(())
+    }
+
+    /// This is the front-end ingest, where we bundle up deletion requests into DeletionList
+    /// and write them out, for later
+    pub async fn background(&mut self) {
+        info!("Started deletion frontend worker");
+
+        let mut recovered: bool = false;
+
+        while !self.cancel.is_cancelled() {
+            let timeout = if self.pending_flushes.is_empty() {
+                FRONTEND_DEFAULT_TIMEOUT
+            } else {
+                FRONTEND_FLUSHING_TIMEOUT
+            };
+
+            let msg = match tokio::time::timeout(timeout, self.rx.recv()).await {
+                Ok(Some(msg)) => msg,
+                Ok(None) => {
+                    // Queue sender destroyed, shutting down
+                    break;
+                }
+                Err(_) => {
+                    // Hit deadline, flush.
+                    self.flush().await;
+                    continue;
+                }
+            };
+
+            // On first message, do recovery.  This avoids unnecessary recovery very
+            // early in startup, and simplifies testing by avoiding a 404 reading the
+            // header on every first pageserver startup.
+            if !recovered {
+                // Before accepting any input from this pageserver lifetime, recover all deletion lists that are in S3
+                if let Err(e) = self.recover().await {
+                    // This should only happen in truly unrecoverable cases, like the recovery finding that the backend
+                    // queue receiver has been dropped.
+                    info!("Deletion queue recover aborted, deletion queue will not proceed ({e})");
+                    return;
+                } else {
+                    recovered = true;
+                }
+            }
+
+            match msg {
+                FrontendQueueMessage::Delete(op) => {
+                    debug!(
+                        "Delete: ingesting {0} layers, {1} other objects",
+                        op.layers.len(),
+                        op.objects.len()
+                    );
+
+                    let mut layer_paths = Vec::new();
+                    for (layer, generation) in op.layers {
+                        layer_paths.push(remote_layer_path(
+                            &op.tenant_id,
+                            &op.timeline_id,
+                            &layer,
+                            generation,
+                        ));
+                    }
+                    layer_paths.extend(op.objects);
+
+                    if self.pending.push(
+                        &op.tenant_id,
+                        &op.timeline_id,
+                        op.generation,
+                        &mut layer_paths,
+                    ) == false
+                    {
+                        self.flush().await;
+                        let retry = self.pending.push(
+                            &op.tenant_id,
+                            &op.timeline_id,
+                            op.generation,
+                            &mut layer_paths,
+                        );
+                        if retry != true {
+                            // Unexpeted: after we flush, we should have
+                            // drained self.pending, so a conflict on
+                            // generation numbers should be impossible.
+                            tracing::error!(
+                                "Failed to enqueue deletions, leaking objects.  This is a bug."
+                            );
+                        }
+                    }
+                }
+                FrontendQueueMessage::Flush(op) => {
+                    if self.pending.is_empty() {
+                        // Execute immediately
+                        debug!("Flush: No pending objects, flushing immediately");
+                        op.fire()
+                    } else {
+                        // Execute next time we flush
+                        debug!("Flush: adding to pending flush list for next deadline flush");
+                        self.pending_flushes.push(op);
+                    }
+                }
+                FrontendQueueMessage::FlushExecute(op) => {
+                    debug!("FlushExecute: passing through to backend");
+                    // We do not flush to a deletion list here: the client sends a Flush before the FlushExecute
+                    if let Err(e) = self.tx.send(BackendQueueMessage::Flush(op)).await {
+                        info!("Can't flush, shutting down ({e})");
+                        // Caller will get error when their oneshot sender was dropped.
+                    }
+                }
+            }
+
+            if self.pending.len() > DELETION_LIST_TARGET_SIZE || !self.pending_flushes.is_empty() {
+                self.flush().await;
+            }
+        }
+        info!("Deletion queue shut down.");
+    }
+}
--- a/pageserver/src/http/openapi_spec.yml
+++ b/pageserver/src/http/openapi_spec.yml
@@ -52,6 +52,29 @@ paths:
              schema:
                type: object

+  /v1/deletion_queue/flush:
+    parameters:
+      - name: execute
+        in: query
+        required: false
+        schema:
+          type: boolean
+        description:
+          If true, attempt to execute deletions.  If false, just flush deletions to persistent deletion lists.
+    put:
+      description: Execute any deletions currently enqueued
+      security: []
+      responses:
+        "200":
+          description: |
+            Flush completed: if execute was true, then enqueued deletions have been completed.  If execute was false,
+            then enqueued deletions have been persisted to deletion lists, and may have been completed.
+          content:
+            application/json:
+              schema:
+                type: object
+
+
  /v1/tenant/{tenant_id}:
    parameters:
      - name: tenant_id
@@ -383,7 +406,6 @@ paths:
        schema:
          type: string
          format: hex
-
    post:
      description: |
        Schedules attach operation to happen in the background for the given tenant.
@@ -1020,6 +1042,9 @@ components:
      properties:
        config:
          $ref: '#/components/schemas/TenantConfig'
+        generation:
+          type: integer
+          description: Attachment generation number.
    TenantConfigRequest:
      allOf:
        - $ref: '#/components/schemas/TenantConfig'
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -23,6 +23,7 @@ use super::models::{
    TimelineCreateRequest, TimelineGcRequest, TimelineInfo,
 };
 use crate::context::{DownloadBehavior, RequestContext};
+use crate::deletion_queue::{DeletionQueue, DeletionQueueError};
 use crate::metrics::{StorageTimeOperation, STORAGE_TIME_GLOBAL};
 use crate::pgdatadir_mapping::LsnForTimestamp;
 use crate::task_mgr::TaskKind;
@@ -32,11 +33,13 @@ use crate::tenant::mgr::{
 };
 use crate::tenant::size::ModelInputs;
 use crate::tenant::storage_layer::LayerAccessStatsReset;
-use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError, Timeline};
+use crate::tenant::timeline::Timeline;
+use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError};
 use crate::{config::PageServerConf, tenant::mgr};
 use crate::{disk_usage_eviction_task, tenant};
 use utils::{
    auth::JwtAuth,
+    generation::Generation,
    http::{
        endpoint::{self, attach_openapi_ui, auth_middleware, check_permission_with},
        error::{ApiError, HttpErrorBody},
@@ -56,6 +59,7 @@ struct State {
    auth: Option<Arc<JwtAuth>>,
    allowlist_routes: Vec<Uri>,
    remote_storage: Option<GenericRemoteStorage>,
+    deletion_queue: DeletionQueue,
    broker_client: storage_broker::BrokerClientChannel,
    disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
 }
@@ -65,6 +69,7 @@ impl State {
        conf: &'static PageServerConf,
        auth: Option<Arc<JwtAuth>>,
        remote_storage: Option<GenericRemoteStorage>,
+        deletion_queue: DeletionQueue,
        broker_client: storage_broker::BrokerClientChannel,
        disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
    ) -> anyhow::Result<Self> {
@@ -78,6 +83,7 @@ impl State {
            allowlist_routes,
            remote_storage,
            broker_client,
+            deletion_queue,
            disk_usage_eviction_state,
        })
    }
@@ -472,7 +478,7 @@ async fn tenant_attach_handler(
    check_permission(&request, Some(tenant_id))?;

    let maybe_body: Option<TenantAttachRequest> = json_request_or_empty_body(&mut request).await?;
-    let tenant_conf = match maybe_body {
+    let tenant_conf = match &maybe_body {
        Some(request) => TenantConfOpt::try_from(&*request.config).map_err(ApiError::BadRequest)?,
        None => TenantConfOpt::default(),
    };
@@ -483,13 +489,30 @@ async fn tenant_attach_handler(

    let state = get_state(&request);

+    let generation = if state.conf.control_plane_api.is_some() {
+        // If we have been configured with a control plane URI, then generations are
+        // mandatory, as we will attempt to re-attach on startup.
+        maybe_body
+            .as_ref()
+            .map(|tar| tar.generation)
+            .flatten()
+            .map(|g| Generation::new(g))
+            .ok_or(ApiError::BadRequest(anyhow!(
+                "generation attribute missing"
+            )))?
+    } else {
+        Generation::none()
+    };
+
    if let Some(remote_storage) = &state.remote_storage {
        mgr::attach_tenant(
            state.conf,
            tenant_id,
+            generation,
            tenant_conf,
            state.broker_client.clone(),
            remote_storage.clone(),
+            &state.deletion_queue,
            &ctx,
        )
        .instrument(info_span!("tenant_attach", %tenant_id))
@@ -552,6 +575,7 @@ async fn tenant_load_handler(
        tenant_id,
        state.broker_client.clone(),
        state.remote_storage.clone(),
+        &state.deletion_queue,
        &ctx,
    )
    .instrument(info_span!("load", %tenant_id))
@@ -867,6 +891,12 @@ async fn tenant_create_handler(
    let tenant_conf =
        TenantConfOpt::try_from(&request_data.config).map_err(ApiError::BadRequest)?;

+    // TODO: make generation mandatory here once control plane supports it.
+    let generation = request_data
+        .generation
+        .map(|g| Generation::new(g))
+        .unwrap_or(Generation::none());
+
    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);

    let state = get_state(&request);
@@ -875,8 +905,10 @@ async fn tenant_create_handler(
        state.conf,
        tenant_conf,
        target_tenant_id,
+        generation,
        state.broker_client.clone(),
        state.remote_storage.clone(),
+        &state.deletion_queue,
        &ctx,
    )
    .instrument(info_span!("tenant_create", tenant_id = %target_tenant_id))
@@ -1117,6 +1149,48 @@ async fn always_panic_handler(
    json_response(StatusCode::NO_CONTENT, ())
 }

+async fn deletion_queue_flush(
+    r: Request<Body>,
+    cancel: CancellationToken,
+) -> Result<Response<Body>, ApiError> {
+    let state = get_state(&r);
+
+    if state.remote_storage.is_none() {
+        // Nothing to do if remote storage is disabled.
+        return json_response(StatusCode::OK, ());
+    }
+
+    let execute = parse_query_param(&r, "execute")?.unwrap_or(false);
+
+    let queue_client = state.deletion_queue.new_client();
+
+    tokio::select! {
+        flush_result = async {
+            if execute {
+                queue_client.flush_execute().await
+            } else {
+                queue_client.flush().await
+            }
+        } => {
+            match flush_result {
+                Ok(())=> {
+                    json_response(StatusCode::OK, ())
+                },
+                Err(e) => {
+                    match e {
+                        DeletionQueueError::ShuttingDown => {
+            Err(ApiError::ShuttingDown)
+                        }
+                    }
+                }
+            }
+        },
+        _ = cancel.cancelled() => {
+            Err(ApiError::ShuttingDown)
+        }
+    }
+}
+
 async fn disk_usage_eviction_run(
    mut r: Request<Body>,
    _cancel: CancellationToken,
@@ -1326,6 +1400,7 @@ pub fn make_router(
    auth: Option<Arc<JwtAuth>>,
    broker_client: BrokerClientChannel,
    remote_storage: Option<GenericRemoteStorage>,
+    deletion_queue: DeletionQueue,
    disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
 ) -> anyhow::Result<RouterBuilder<hyper::Body, ApiError>> {
    let spec = include_bytes!("openapi_spec.yml");
@@ -1355,6 +1430,7 @@ pub fn make_router(
                conf,
                auth,
                remote_storage,
+                deletion_queue,
                broker_client,
                disk_usage_eviction_state,
            )
@@ -1439,6 +1515,9 @@ pub fn make_router(
        .put("/v1/disk_usage_eviction/run", |r| {
            api_handler(r, disk_usage_eviction_run)
        })
+        .put("/v1/deletion_queue/flush", |r| {
+            api_handler(r, deletion_queue_flush)
+        })
        .put("/v1/tenant/:tenant_id/break", |r| {
            testing_api_handler("set tenant state to broken", r, handle_tenant_break)
        })
--- a/pageserver/src/lib.rs
+++ b/pageserver/src/lib.rs
@@ -3,6 +3,7 @@ pub mod basebackup;
 pub mod config;
 pub mod consumption_metrics;
 pub mod context;
+pub mod deletion_queue;
 pub mod disk_usage_eviction_task;
 pub mod http;
 pub mod import_datadir;
--- a/pageserver/src/metrics.rs
+++ b/pageserver/src/metrics.rs
@@ -795,6 +795,31 @@ static REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST: Lazy<HistogramVec> = Lazy::new
    .expect("failed to define a metric")
 });

+pub(crate) static DELETION_QUEUE_SUBMITTED: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
+        "pageserver_deletion_queue_submitted_total",
+        "Number of objects submitted for deletion"
+    )
+    .expect("failed to define a metric")
+});
+
+pub(crate) static DELETION_QUEUE_EXECUTED: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
+        "pageserver_deletion_queue_executed_total",
+        "Number of objects deleted"
+    )
+    .expect("failed to define a metric")
+});
+
+pub(crate) static DELETION_QUEUE_ERRORS: Lazy<IntCounterVec> = Lazy::new(|| {
+    register_int_counter_vec!(
+        "pageserver_deletion_queue_errors_total",
+        "Incremented on retryable remote I/O errors writing deletion lists or executing deletions.",
+        &["op_kind"],
+    )
+    .expect("failed to define a metric")
+});
+
 static REMOTE_TIMELINE_CLIENT_BYTES_STARTED_COUNTER: Lazy<IntCounterVec> = Lazy::new(|| {
    register_int_counter_vec!(
        "pageserver_remote_timeline_client_bytes_started",
--- a/pageserver/src/page_cache.rs
+++ b/pageserver/src/page_cache.rs
@@ -75,10 +75,7 @@
 use std::{
    collections::{hash_map::Entry, HashMap},
    convert::TryInto,
-    sync::{
-        atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
-        RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError,
-    },
+    sync::atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
 };

 use anyhow::Context;
@@ -162,7 +159,7 @@ struct Version {
 }

 struct Slot {
-    inner: RwLock<SlotInner>,
+    inner: tokio::sync::RwLock<SlotInner>,
    usage_count: AtomicU8,
 }

@@ -203,6 +200,11 @@ impl Slot {
            Err(usage_count) => usage_count,
        }
    }
+
+    /// Sets the usage count to a specific value.
+    fn set_usage_count(&self, count: u8) {
+        self.usage_count.store(count, Ordering::Relaxed);
+    }
 }

 pub struct PageCache {
@@ -215,9 +217,9 @@ pub struct PageCache {
    ///
    /// If you add support for caching different kinds of objects, each object kind
    /// can have a separate mapping map, next to this field.
-    materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
+    materialized_page_map: std::sync::RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,

-    immutable_page_map: RwLock<HashMap<(FileId, u32), usize>>,
+    immutable_page_map: std::sync::RwLock<HashMap<(FileId, u32), usize>>,

    /// The actual buffers with their metadata.
    slots: Box<[Slot]>,
@@ -233,7 +235,7 @@ pub struct PageCache {
 /// PageReadGuard is a "lease" on a buffer, for reading. The page is kept locked
 /// until the guard is dropped.
 ///
-pub struct PageReadGuard<'i>(RwLockReadGuard<'i, SlotInner>);
+pub struct PageReadGuard<'i>(tokio::sync::RwLockReadGuard<'i, SlotInner>);

 impl std::ops::Deref for PageReadGuard<'_> {
    type Target = [u8; PAGE_SZ];
@@ -260,9 +262,10 @@ impl AsRef<[u8; PAGE_SZ]> for PageReadGuard<'_> {
 /// to initialize.
 ///
 pub struct PageWriteGuard<'i> {
-    inner: RwLockWriteGuard<'i, SlotInner>,
+    inner: tokio::sync::RwLockWriteGuard<'i, SlotInner>,

    // Are the page contents currently valid?
+    // Used to mark pages as invalid that are assigned but not yet filled with data.
    valid: bool,
 }

@@ -337,7 +340,7 @@ impl PageCache {
    /// The 'lsn' is an upper bound, this will return the latest version of
    /// the given block, but not newer than 'lsn'. Returns the actual LSN of the
    /// returned page.
-    pub fn lookup_materialized_page(
+    pub async fn lookup_materialized_page(
        &self,
        tenant_id: TenantId,
        timeline_id: TimelineId,
@@ -357,7 +360,7 @@ impl PageCache {
            lsn,
        };

-        if let Some(guard) = self.try_lock_for_read(&mut cache_key) {
+        if let Some(guard) = self.try_lock_for_read(&mut cache_key).await {
            if let CacheKey::MaterializedPage {
                hash_key: _,
                lsn: available_lsn,
@@ -384,7 +387,7 @@ impl PageCache {
    ///
    /// Store an image of the given page in the cache.
    ///
-    pub fn memorize_materialized_page(
+    pub async fn memorize_materialized_page(
        &self,
        tenant_id: TenantId,
        timeline_id: TimelineId,
@@ -401,7 +404,7 @@ impl PageCache {
            lsn,
        };

-        match self.lock_for_write(&cache_key)? {
+        match self.lock_for_write(&cache_key).await? {
            WriteBufResult::Found(write_guard) => {
                // We already had it in cache. Another thread must've put it there
                // concurrently. Check that it had the same contents that we
@@ -419,31 +422,14 @@ impl PageCache {

    // Section 1.2: Public interface functions for working with immutable file pages.

-    pub fn read_immutable_buf(&self, file_id: FileId, blkno: u32) -> anyhow::Result<ReadBufResult> {
+    pub async fn read_immutable_buf(
+        &self,
+        file_id: FileId,
+        blkno: u32,
+    ) -> anyhow::Result<ReadBufResult> {
        let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno };

-        self.lock_for_read(&mut cache_key)
-    }
-
-    /// Immediately drop all buffers belonging to given file
-    pub fn drop_buffers_for_immutable(&self, drop_file_id: FileId) {
-        for slot_idx in 0..self.slots.len() {
-            let slot = &self.slots[slot_idx];
-
-            let mut inner = slot.inner.write().unwrap();
-            if let Some(key) = &inner.key {
-                match key {
-                    CacheKey::ImmutableFilePage { file_id, blkno: _ }
-                        if *file_id == drop_file_id =>
-                    {
-                        // remove mapping for old buffer
-                        self.remove_mapping(key);
-                        inner.key = None;
-                    }
-                    _ => {}
-                }
-            }
-        }
+        self.lock_for_read(&mut cache_key).await
    }

    //
@@ -463,14 +449,14 @@ impl PageCache {
    ///
    /// If no page is found, returns None and *cache_key is left unmodified.
    ///
-    fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
+    async fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
        let cache_key_orig = cache_key.clone();
        if let Some(slot_idx) = self.search_mapping(cache_key) {
            // The page was found in the mapping. Lock the slot, and re-check
            // that it's still what we expected (because we released the mapping
            // lock already, another thread could have evicted the page)
            let slot = &self.slots[slot_idx];
-            let inner = slot.inner.read().unwrap();
+            let inner = slot.inner.read().await;
            if inner.key.as_ref() == Some(cache_key) {
                slot.inc_usage_count();
                return Some(PageReadGuard(inner));
@@ -511,7 +497,7 @@ impl PageCache {
    /// }
    /// ```
    ///
-    fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
+    async fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
        let (read_access, hit) = match cache_key {
            CacheKey::MaterializedPage { .. } => {
                unreachable!("Materialized pages use lookup_materialized_page")
@@ -526,7 +512,7 @@ impl PageCache {
        let mut is_first_iteration = true;
        loop {
            // First check if the key already exists in the cache.
-            if let Some(read_guard) = self.try_lock_for_read(cache_key) {
+            if let Some(read_guard) = self.try_lock_for_read(cache_key).await {
                if is_first_iteration {
                    hit.inc();
                }
@@ -556,7 +542,7 @@ impl PageCache {
            // Make the slot ready
            let slot = &self.slots[slot_idx];
            inner.key = Some(cache_key.clone());
-            slot.usage_count.store(1, Ordering::Relaxed);
+            slot.set_usage_count(1);

            return Ok(ReadBufResult::NotFound(PageWriteGuard {
                inner,
@@ -569,13 +555,13 @@ impl PageCache {
    /// found, returns None.
    ///
    /// When locking a page for writing, the search criteria is always "exact".
-    fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
+    async fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
        if let Some(slot_idx) = self.search_mapping_for_write(cache_key) {
            // The page was found in the mapping. Lock the slot, and re-check
            // that it's still what we expected (because we don't released the mapping
            // lock already, another thread could have evicted the page)
            let slot = &self.slots[slot_idx];
-            let inner = slot.inner.write().unwrap();
+            let inner = slot.inner.write().await;
            if inner.key.as_ref() == Some(cache_key) {
                slot.inc_usage_count();
                return Some(PageWriteGuard { inner, valid: true });
@@ -588,10 +574,10 @@ impl PageCache {
    ///
    /// Similar to lock_for_read(), but the returned buffer is write-locked and
    /// may be modified by the caller even if it's already found in the cache.
-    fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
+    async fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
        loop {
            // First check if the key already exists in the cache.
-            if let Some(write_guard) = self.try_lock_for_write(cache_key) {
+            if let Some(write_guard) = self.try_lock_for_write(cache_key).await {
                return Ok(WriteBufResult::Found(write_guard));
            }

@@ -617,7 +603,7 @@ impl PageCache {
            // Make the slot ready
            let slot = &self.slots[slot_idx];
            inner.key = Some(cache_key.clone());
-            slot.usage_count.store(1, Ordering::Relaxed);
+            slot.set_usage_count(1);

            return Ok(WriteBufResult::NotFound(PageWriteGuard {
                inner,
@@ -772,7 +758,7 @@ impl PageCache {
    /// Find a slot to evict.
    ///
    /// On return, the slot is empty and write-locked.
-    fn find_victim(&self) -> anyhow::Result<(usize, RwLockWriteGuard<SlotInner>)> {
+    fn find_victim(&self) -> anyhow::Result<(usize, tokio::sync::RwLockWriteGuard<SlotInner>)> {
        let iter_limit = self.slots.len() * 10;
        let mut iters = 0;
        loop {
@@ -784,10 +770,7 @@ impl PageCache {
            if slot.dec_usage_count() == 0 {
                let mut inner = match slot.inner.try_write() {
                    Ok(inner) => inner,
-                    Err(TryLockError::Poisoned(err)) => {
-                        anyhow::bail!("buffer lock was poisoned: {err:?}")
-                    }
-                    Err(TryLockError::WouldBlock) => {
+                    Err(_err) => {
                        // If we have looped through the whole buffer pool 10 times
                        // and still haven't found a victim buffer, something's wrong.
                        // Maybe all the buffers were in locked. That could happen in
@@ -816,6 +799,8 @@ impl PageCache {
    fn new(num_pages: usize) -> Self {
        assert!(num_pages > 0, "page cache size must be > 0");

+        // We use Box::leak here and into_boxed_slice to avoid leaking uninitialized
+        // memory that Vec's might contain.
        let page_buffer = Box::leak(vec![0u8; num_pages * PAGE_SZ].into_boxed_slice());

        let size_metrics = &crate::metrics::PAGE_CACHE_SIZE;
@@ -829,7 +814,7 @@ impl PageCache {
                let buf: &mut [u8; PAGE_SZ] = chunk.try_into().unwrap();

                Slot {
-                    inner: RwLock::new(SlotInner { key: None, buf }),
+                    inner: tokio::sync::RwLock::new(SlotInner { key: None, buf }),
                    usage_count: AtomicU8::new(0),
                }
            })
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -59,6 +59,7 @@ use self::timeline::EvictionTaskTenantState;
 use self::timeline::TimelineResources;
 use crate::config::PageServerConf;
 use crate::context::{DownloadBehavior, RequestContext};
+use crate::deletion_queue::DeletionQueueClient;
 use crate::import_datadir;
 use crate::is_uninit_mark;
 use crate::metrics::TENANT_ACTIVATION;
@@ -85,6 +86,7 @@ pub use pageserver_api::models::TenantState;
 use toml_edit;
 use utils::{
    crashsafe,
+    generation::Generation,
    id::{TenantId, TimelineId},
    lsn::{Lsn, RecordLsn},
 };
@@ -119,7 +121,7 @@ mod span;

 pub mod metadata;
 mod par_fsync;
-mod remote_timeline_client;
+pub mod remote_timeline_client;
 pub mod storage_layer;

 pub mod config;
@@ -156,6 +158,7 @@ pub const TENANT_DELETED_MARKER_FILE_NAME: &str = "deleted";
 pub struct TenantSharedResources {
    pub broker_client: storage_broker::BrokerClientChannel,
    pub remote_storage: Option<GenericRemoteStorage>,
+    pub deletion_queue_client: DeletionQueueClient,
 }

 ///
@@ -178,6 +181,10 @@ pub struct Tenant {
    tenant_conf: Arc<RwLock<TenantConfOpt>>,

    tenant_id: TenantId,
+
+    // The remote storage generation, used to protect S3 objects from split-brain
+    generation: Generation,
+
    timelines: Mutex<HashMap<TimelineId, Arc<Timeline>>>,
    // This mutex prevents creation of new timelines during GC.
    // Adding yet another mutex (in addition to `timelines`) is needed because holding
@@ -191,6 +198,9 @@ pub struct Tenant {
    // provides access to timeline data sitting in the remote storage
    remote_storage: Option<GenericRemoteStorage>,

+    // Access to global deletion queue for when this tenant wants to schedule a deletion
+    deletion_queue_client: Option<DeletionQueueClient>,
+
    /// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`].
    cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
    cached_synthetic_tenant_size: Arc<AtomicU64>,
@@ -522,9 +532,11 @@ impl Tenant {
    pub(crate) fn spawn_attach(
        conf: &'static PageServerConf,
        tenant_id: TenantId,
+        generation: Generation,
        broker_client: storage_broker::BrokerClientChannel,
        tenants: &'static tokio::sync::RwLock<TenantsMap>,
        remote_storage: GenericRemoteStorage,
+        deletion_queue_client: DeletionQueueClient,
        ctx: &RequestContext,
    ) -> anyhow::Result<Arc<Tenant>> {
        // TODO dedup with spawn_load
@@ -538,7 +550,9 @@ impl Tenant {
            tenant_conf,
            wal_redo_manager,
            tenant_id,
+            generation,
            Some(remote_storage.clone()),
+            Some(deletion_queue_client),
        ));

        // Do all the hard work in the background
@@ -648,12 +662,8 @@ impl Tenant {
            .as_ref()
            .ok_or_else(|| anyhow::anyhow!("cannot attach without remote storage"))?;

-        let remote_timeline_ids = remote_timeline_client::list_remote_timelines(
-            remote_storage,
-            self.conf,
-            self.tenant_id,
-        )
-        .await?;
+        let remote_timeline_ids =
+            remote_timeline_client::list_remote_timelines(remote_storage, self.tenant_id).await?;

        info!("found {} timelines", remote_timeline_ids.len());

@@ -665,6 +675,7 @@ impl Tenant {
                self.conf,
                self.tenant_id,
                timeline_id,
+                self.generation,
            );
            part_downloads.spawn(
                async move {
@@ -698,10 +709,7 @@ impl Tenant {
            debug!("successfully downloaded index part for timeline {timeline_id}");
            match index_part {
                MaybeDeletedIndexPart::IndexPart(index_part) => {
-                    timeline_ancestors.insert(
-                        timeline_id,
-                        index_part.parse_metadata().context("parse_metadata")?,
-                    );
+                    timeline_ancestors.insert(timeline_id, index_part.metadata.clone());
                    remote_index_and_client.insert(timeline_id, (index_part, client));
                }
                MaybeDeletedIndexPart::Deleted(index_part) => {
@@ -730,6 +738,7 @@ impl Tenant {
                remote_metadata,
                TimelineResources {
                    remote_client: Some(remote_client),
+                    deletion_queue_client: self.deletion_queue_client.clone(),
                },
                ctx,
            )
@@ -752,8 +761,9 @@ impl Tenant {
            DeleteTimelineFlow::resume_deletion(
                Arc::clone(self),
                timeline_id,
-                &index_part.parse_metadata().context("parse_metadata")?,
+                &index_part.metadata,
                Some(remote_timeline_client),
+                self.deletion_queue_client.clone(),
                None,
            )
            .await
@@ -854,6 +864,8 @@ impl Tenant {
            TenantConfOpt::default(),
            wal_redo_manager,
            tenant_id,
+            Generation::broken(),
+            None,
            None,
        ))
    }
@@ -871,6 +883,7 @@ impl Tenant {
    pub(crate) fn spawn_load(
        conf: &'static PageServerConf,
        tenant_id: TenantId,
+        generation: Generation,
        resources: TenantSharedResources,
        init_order: Option<InitializationOrder>,
        tenants: &'static tokio::sync::RwLock<TenantsMap>,
@@ -888,6 +901,7 @@ impl Tenant {

        let broker_client = resources.broker_client;
        let remote_storage = resources.remote_storage;
+        let deletion_queue_client = resources.deletion_queue_client;

        let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenant_id));
        let tenant = Tenant::new(
@@ -896,7 +910,9 @@ impl Tenant {
            tenant_conf,
            wal_redo_manager,
            tenant_id,
+            generation,
            remote_storage.clone(),
+            Some(deletion_queue_client),
        );
        let tenant = Arc::new(tenant);

@@ -1304,6 +1320,7 @@ impl Tenant {
                                timeline_id,
                                &local_metadata,
                                Some(remote_client),
+                                self.deletion_queue_client.clone(),
                                init_order,
                            )
                            .await
@@ -1314,10 +1331,7 @@ impl Tenant {
                        }
                    };

-                    let remote_metadata = index_part
-                        .parse_metadata()
-                        .context("parse_metadata")
-                        .map_err(LoadLocalTimelineError::Load)?;
+                    let remote_metadata = index_part.metadata.clone();
                    (
                        Some(RemoteStartupData {
                            index_part,
@@ -1356,6 +1370,7 @@ impl Tenant {
                        timeline_id,
                        &local_metadata,
                        None,
+                        None,
                        init_order,
                    )
                    .await
@@ -2280,6 +2295,7 @@ impl Tenant {
            ancestor,
            new_timeline_id,
            self.tenant_id,
+            self.generation,
            Arc::clone(&self.walredo_mgr),
            resources,
            pg_version,
@@ -2297,8 +2313,18 @@ impl Tenant {
        tenant_conf: TenantConfOpt,
        walredo_mgr: Arc<dyn WalRedoManager + Send + Sync>,
        tenant_id: TenantId,
+        generation: Generation,
        remote_storage: Option<GenericRemoteStorage>,
+        deletion_queue_client: Option<DeletionQueueClient>,
    ) -> Tenant {
+        #[cfg(not(test))]
+        match state {
+            TenantState::Broken { .. } => {}
+            _ => {
+                // Non-broken tenants must be constructed with a deletion queue
+                assert!(deletion_queue_client.is_some());
+            }
+        }
        let (state, mut rx) = watch::channel(state);

        tokio::spawn(async move {
@@ -2355,6 +2381,7 @@ impl Tenant {

        Tenant {
            tenant_id,
+            generation,
            conf,
            // using now here is good enough approximation to catch tenants with really long
            // activation times.
@@ -2364,6 +2391,7 @@ impl Tenant {
            gc_cs: tokio::sync::Mutex::new(()),
            walredo_mgr,
            remote_storage,
+            deletion_queue_client,
            state,
            cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
            cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
@@ -2937,13 +2965,17 @@ impl Tenant {
                self.conf,
                self.tenant_id,
                timeline_id,
+                self.generation,
            );
            Some(remote_client)
        } else {
            None
        };

-        TimelineResources { remote_client }
+        TimelineResources {
+            remote_client,
+            deletion_queue_client: self.deletion_queue_client.clone(),
+        }
    }

    /// Creates intermediate timeline structure and its files.
@@ -3460,6 +3492,7 @@ pub mod harness {
        pub conf: &'static PageServerConf,
        pub tenant_conf: TenantConf,
        pub tenant_id: TenantId,
+        pub generation: Generation,
    }

    static LOG_HANDLE: OnceCell<()> = OnceCell::new();
@@ -3501,13 +3534,14 @@ pub mod harness {
                conf,
                tenant_conf,
                tenant_id,
+                generation: Generation::new(0xdeadbeef),
            })
        }

        pub async fn load(&self) -> (Arc<Tenant>, RequestContext) {
            let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
            (
-                self.try_load(&ctx, None)
+                self.try_load(&ctx, None, None)
                    .await
                    .expect("failed to load test tenant"),
                ctx,
@@ -3518,6 +3552,7 @@ pub mod harness {
            &self,
            ctx: &RequestContext,
            remote_storage: Option<remote_storage::GenericRemoteStorage>,
+            deletion_queue_client: Option<DeletionQueueClient>,
        ) -> anyhow::Result<Arc<Tenant>> {
            let walredo_mgr = Arc::new(TestRedoManager);

@@ -3527,7 +3562,9 @@ pub mod harness {
                TenantConfOpt::from(self.tenant_conf),
                walredo_mgr,
                self.tenant_id,
+                self.generation,
                remote_storage,
+                deletion_queue_client,
            ));
            tenant
                .load(None, ctx)
@@ -4092,7 +4129,7 @@ mod tests {
        std::fs::write(metadata_path, metadata_bytes)?;

        let err = harness
-            .try_load(&ctx, None)
+            .try_load(&ctx, None, None)
            .await
            .err()
            .expect("should fail");
@@ -4107,7 +4144,7 @@ mod tests {
        let mut found_error_message = false;
        let mut err_source = err.source();
        while let Some(source) = err_source {
-            if source.to_string() == "metadata checksum mismatch" {
+            if source.to_string().contains("metadata checksum mismatch") {
                found_error_message = true;
                break;
            }
--- a/pageserver/src/tenant/blob_io.rs
+++ b/pageserver/src/tenant/blob_io.rs
@@ -33,7 +33,7 @@ impl<'a> BlockCursor<'a> {
        let mut blknum = (offset / PAGE_SZ as u64) as u32;
        let mut off = (offset % PAGE_SZ as u64) as usize;

-        let mut buf = self.read_blk(blknum)?;
+        let mut buf = self.read_blk(blknum).await?;

        // peek at the first byte, to determine if it's a 1- or 4-byte length
        let first_len_byte = buf[off];
@@ -49,7 +49,7 @@ impl<'a> BlockCursor<'a> {
                // it is split across two pages
                len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
                blknum += 1;
-                buf = self.read_blk(blknum)?;
+                buf = self.read_blk(blknum).await?;
                len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
                off = 4 - thislen;
            } else {
@@ -70,7 +70,7 @@ impl<'a> BlockCursor<'a> {
            if page_remain == 0 {
                // continue on next page
                blknum += 1;
-                buf = self.read_blk(blknum)?;
+                buf = self.read_blk(blknum).await?;
                off = 0;
                page_remain = PAGE_SZ;
            }
--- a/pageserver/src/tenant/block_io.rs
+++ b/pageserver/src/tenant/block_io.rs
@@ -39,7 +39,7 @@ pub enum BlockLease<'a> {
    PageReadGuard(PageReadGuard<'static>),
    EphemeralFileMutableTail(&'a [u8; PAGE_SZ]),
    #[cfg(test)]
-    Rc(std::rc::Rc<[u8; PAGE_SZ]>),
+    Arc(std::sync::Arc<[u8; PAGE_SZ]>),
 }

 impl From<PageReadGuard<'static>> for BlockLease<'static> {
@@ -49,9 +49,9 @@ impl From<PageReadGuard<'static>> for BlockLease<'static> {
 }

 #[cfg(test)]
-impl<'a> From<std::rc::Rc<[u8; PAGE_SZ]>> for BlockLease<'a> {
-    fn from(value: std::rc::Rc<[u8; PAGE_SZ]>) -> Self {
-        BlockLease::Rc(value)
+impl<'a> From<std::sync::Arc<[u8; PAGE_SZ]>> for BlockLease<'a> {
+    fn from(value: std::sync::Arc<[u8; PAGE_SZ]>) -> Self {
+        BlockLease::Arc(value)
    }
 }

@@ -63,7 +63,7 @@ impl<'a> Deref for BlockLease<'a> {
            BlockLease::PageReadGuard(v) => v.deref(),
            BlockLease::EphemeralFileMutableTail(v) => v,
            #[cfg(test)]
-            BlockLease::Rc(v) => v.deref(),
+            BlockLease::Arc(v) => v.deref(),
        }
    }
 }
@@ -83,13 +83,13 @@ pub(crate) enum BlockReaderRef<'a> {

 impl<'a> BlockReaderRef<'a> {
    #[inline(always)]
-    fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
+    async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
        use BlockReaderRef::*;
        match self {
-            FileBlockReaderVirtual(r) => r.read_blk(blknum),
-            FileBlockReaderFile(r) => r.read_blk(blknum),
-            EphemeralFile(r) => r.read_blk(blknum),
-            Adapter(r) => r.read_blk(blknum),
+            FileBlockReaderVirtual(r) => r.read_blk(blknum).await,
+            FileBlockReaderFile(r) => r.read_blk(blknum).await,
+            EphemeralFile(r) => r.read_blk(blknum).await,
+            Adapter(r) => r.read_blk(blknum).await,
            #[cfg(test)]
            TestDisk(r) => r.read_blk(blknum),
        }
@@ -134,8 +134,8 @@ impl<'a> BlockCursor<'a> {
    /// access to the contents of the page. (For the page cache, the
    /// lease object represents a lock on the buffer.)
    #[inline(always)]
-    pub fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
-        self.reader.read_blk(blknum)
+    pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
+        self.reader.read_blk(blknum).await
    }
 }

@@ -170,11 +170,12 @@ where
    /// Returns a "lease" object that can be used to
    /// access to the contents of the page. (For the page cache, the
    /// lease object represents a lock on the buffer.)
-    pub fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
+    pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
        let cache = page_cache::get();
        loop {
            match cache
                .read_immutable_buf(self.file_id, blknum)
+                .await
                .map_err(|e| {
                    std::io::Error::new(
                        std::io::ErrorKind::Other,
--- a/pageserver/src/tenant/disk_btree.rs
+++ b/pageserver/src/tenant/disk_btree.rs
@@ -262,7 +262,7 @@ where
        let block_cursor = self.reader.block_cursor();
        while let Some((node_blknum, opt_iter)) = stack.pop() {
            // Locate the node.
-            let node_buf = block_cursor.read_blk(self.start_blk + node_blknum)?;
+            let node_buf = block_cursor.read_blk(self.start_blk + node_blknum).await?;

            let node = OnDiskNode::deparse(node_buf.as_ref())?;
            let prefix_len = node.prefix_len as usize;
@@ -357,7 +357,7 @@ where
        let block_cursor = self.reader.block_cursor();

        while let Some((blknum, path, depth, child_idx, key_off)) = stack.pop() {
-            let blk = block_cursor.read_blk(self.start_blk + blknum)?;
+            let blk = block_cursor.read_blk(self.start_blk + blknum).await?;
            let buf: &[u8] = blk.as_ref();
            let node = OnDiskNode::<L>::deparse(buf)?;

@@ -704,7 +704,7 @@ pub(crate) mod tests {
        pub(crate) fn read_blk(&self, blknum: u32) -> io::Result<BlockLease> {
            let mut buf = [0u8; PAGE_SZ];
            buf.copy_from_slice(&self.blocks[blknum as usize]);
-            Ok(std::rc::Rc::new(buf).into())
+            Ok(std::sync::Arc::new(buf).into())
        }
    }
    impl BlockReader for TestDisk {
--- a/pageserver/src/tenant/ephemeral_file.rs
+++ b/pageserver/src/tenant/ephemeral_file.rs
@@ -61,13 +61,14 @@ impl EphemeralFile {
        self.len
    }

-    pub(crate) fn read_blk(&self, blknum: u32) -> Result<BlockLease, io::Error> {
+    pub(crate) async fn read_blk(&self, blknum: u32) -> Result<BlockLease, io::Error> {
        let flushed_blknums = 0..self.len / PAGE_SZ as u64;
        if flushed_blknums.contains(&(blknum as u64)) {
            let cache = page_cache::get();
            loop {
                match cache
                    .read_immutable_buf(self.page_cache_file_id, blknum)
+                    .await
                    .map_err(|e| {
                        std::io::Error::new(
                            std::io::ErrorKind::Other,
@@ -135,10 +136,13 @@ impl EphemeralFile {
                                // Pre-warm the page cache with what we just wrote.
                                // This isn't necessary for coherency/correctness, but it's how we've always done it.
                                let cache = page_cache::get();
-                                match cache.read_immutable_buf(
-                                    self.ephemeral_file.page_cache_file_id,
-                                    self.blknum,
-                                ) {
+                                match cache
+                                    .read_immutable_buf(
+                                        self.ephemeral_file.page_cache_file_id,
+                                        self.blknum,
+                                    )
+                                    .await
+                                {
                                    Ok(page_cache::ReadBufResult::Found(_guard)) => {
                                        // This function takes &mut self, so, it shouldn't be possible to reach this point.
                                        unreachable!("we just wrote blknum {} and this function takes &mut self, so, no concurrent read_blk is possible", self.blknum);
@@ -221,9 +225,8 @@ pub fn is_ephemeral_file(filename: &str) -> bool {

 impl Drop for EphemeralFile {
    fn drop(&mut self) {
-        // drop all pages from page cache
-        let cache = page_cache::get();
-        cache.drop_buffers_for_immutable(self.page_cache_file_id);
+        // There might still be pages in the [`crate::page_cache`] for this file.
+        // We leave them there, [`crate::page_cache::PageCache::find_victim`] will evict them when needed.

        // unlink the file
        let res = std::fs::remove_file(&self.file.path);
--- a/pageserver/src/tenant/metadata.rs
+++ b/pageserver/src/tenant/metadata.rs
@@ -12,7 +12,7 @@ use std::fs::{File, OpenOptions};
 use std::io::{self, Write};

 use anyhow::{bail, ensure, Context};
-use serde::{Deserialize, Serialize};
+use serde::{de::Error, Deserialize, Serialize, Serializer};
 use thiserror::Error;
 use tracing::info_span;
 use utils::bin_ser::SerializeError;
@@ -232,6 +232,28 @@ impl TimelineMetadata {
    }
 }

+impl<'de> Deserialize<'de> for TimelineMetadata {
+    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        let bytes = Vec::<u8>::deserialize(deserializer)?;
+        Self::from_bytes(bytes.as_slice()).map_err(|e| D::Error::custom(format!("{e}")))
+    }
+}
+
+impl Serialize for TimelineMetadata {
+    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
+    where
+        S: Serializer,
+    {
+        let bytes = self
+            .to_bytes()
+            .map_err(|e| serde::ser::Error::custom(format!("{e}")))?;
+        bytes.serialize(serializer)
+    }
+}
+
 /// Save timeline metadata to file
 pub fn save_metadata(
    conf: &'static PageServerConf,
--- a/pageserver/src/tenant/mgr.rs
+++ b/pageserver/src/tenant/mgr.rs
@@ -1,10 +1,13 @@
 //! This module acts as a switchboard to access different repositories managed by this
 //! page server.

+use hyper::StatusCode;
+use pageserver_api::control_api::{HexTenantId, ReAttachRequest, ReAttachResponse};
 use std::collections::{hash_map, HashMap};
 use std::ffi::OsStr;
 use std::path::Path;
 use std::sync::Arc;
+use std::time::Duration;
 use tokio::fs;

 use anyhow::Context;
@@ -18,6 +21,7 @@ use utils::crashsafe;

 use crate::config::PageServerConf;
 use crate::context::{DownloadBehavior, RequestContext};
+use crate::deletion_queue::DeletionQueue;
 use crate::task_mgr::{self, TaskKind};
 use crate::tenant::config::TenantConfOpt;
 use crate::tenant::delete::DeleteTenantFlow;
@@ -25,6 +29,7 @@ use crate::tenant::{create_tenant_files, CreateTenantFilesMode, Tenant, TenantSt
 use crate::{InitializationOrder, IGNORED_TENANT_FILE_NAME};

 use utils::fs_ext::PathExt;
+use utils::generation::Generation;
 use utils::id::{TenantId, TimelineId};

 use super::delete::DeleteTenantError;
@@ -75,6 +80,78 @@ pub async fn init_tenant_mgr(

    let mut tenants = HashMap::new();

+    // If we are configured to use the control plane API, then it is the source of truth for what to attach
+    let tenant_generations = conf
+        .control_plane_api
+        .as_ref()
+        .map(|control_plane_api| async {
+            let client = reqwest::ClientBuilder::new()
+                .build()
+                .expect("Failed to construct http client");
+
+            // FIXME: it's awkward that join() requires the base to have a trailing slash, makes
+            // it easy to get a config wrong
+            assert!(
+                control_plane_api.as_str().ends_with("/"),
+                "control plane API needs trailing slash"
+            );
+
+            let re_attach_path = control_plane_api
+                .join("re-attach")
+                .expect("Failed to build re-attach path");
+            let request = ReAttachRequest { node_id: conf.id };
+
+            // TODO: we should have been passed a cancellation token, and use it to end
+            // this loop gracefully
+            loop {
+                let response = match client
+                    .post(re_attach_path.clone())
+                    .json(&request)
+                    .send()
+                    .await
+                {
+                    Err(e) => Err(anyhow::Error::from(e)),
+                    Ok(r) => {
+                        if r.status() == StatusCode::OK {
+                            r.json::<ReAttachResponse>()
+                                .await
+                                .map_err(|e| anyhow::Error::from(e))
+                        } else {
+                            Err(anyhow::anyhow!("Unexpected status {}", r.status()))
+                        }
+                    }
+                };
+
+                match response {
+                    Ok(res) => {
+                        tracing::info!(
+                            "Received re-attach response with {0} tenants",
+                            res.tenants.len()
+                        );
+
+                        // TODO: do something with it
+                        break res
+                            .tenants
+                            .into_iter()
+                            .map(|t| (t.id, t.generation))
+                            .collect::<HashMap<_, _>>();
+                    }
+                    Err(e) => {
+                        tracing::error!("Error re-attaching tenants, retrying: {e:#}");
+                        tokio::time::sleep(Duration::from_secs(1)).await;
+                    }
+                }
+            }
+        });
+
+    let tenant_generations = match tenant_generations {
+        Some(g) => Some(g.await),
+        None => {
+            info!("Control plane API not configured, tenant generations are disabled");
+            None
+        }
+    };
+
    let mut dir_entries = fs::read_dir(&tenants_dir)
        .await
        .with_context(|| format!("Failed to list tenants dir {tenants_dir:?}"))?;
@@ -122,9 +199,53 @@ pub async fn init_tenant_mgr(
                        continue;
                    }

+                    let tenant_id = match tenant_dir_path
+                        .file_name()
+                        .and_then(OsStr::to_str)
+                        .unwrap_or_default()
+                        .parse::<TenantId>()
+                    {
+                        Ok(id) => id,
+                        Err(_) => {
+                            warn!(
+                                "Invalid tenant path (garbage in our repo directory?): {0}",
+                                tenant_dir_path.display()
+                            );
+                            continue;
+                        }
+                    };
+
+                    let generation = if let Some(generations) = &tenant_generations {
+                        // We have a generation map: treat it as the authority for whether
+                        // this tenant is really attached.
+                        if let Some(gen) = generations.get(&HexTenantId::new(tenant_id)) {
+                            Generation::new(*gen)
+                        } else {
+                            info!("Detaching tenant {0}, control plane omitted it in re-attach response", tenant_id);
+                            if let Err(e) = fs::remove_dir_all(&tenant_dir_path).await {
+                                error!(
+                                    "Failed to remove detached tenant directory '{}': {:?}",
+                                    tenant_dir_path.display(),
+                                    e
+                                );
+                            }
+                            continue;
+                        }
+                    } else {
+                        // Legacy mode: no generation information, any tenant present
+                        // on local disk may activate
+                        info!(
+                            "Starting tenant {0} in legacy mode, no generation",
+                            tenant_dir_path.display()
+                        );
+                        Generation::none()
+                    };
+
                    match schedule_local_tenant_processing(
                        conf,
+                        tenant_id,
                        &tenant_dir_path,
+                        generation,
                        resources.clone(),
                        Some(init_order.clone()),
                        &TENANTS,
@@ -160,7 +281,9 @@ pub async fn init_tenant_mgr(

 pub(crate) fn schedule_local_tenant_processing(
    conf: &'static PageServerConf,
+    tenant_id: TenantId,
    tenant_path: &Path,
+    generation: Generation,
    resources: TenantSharedResources,
    init_order: Option<InitializationOrder>,
    tenants: &'static tokio::sync::RwLock<TenantsMap>,
@@ -181,15 +304,6 @@ pub(crate) fn schedule_local_tenant_processing(
        "Cannot load tenant from empty directory {tenant_path:?}"
    );

-    let tenant_id = tenant_path
-        .file_name()
-        .and_then(OsStr::to_str)
-        .unwrap_or_default()
-        .parse::<TenantId>()
-        .with_context(|| {
-            format!("Could not parse tenant id out of the tenant dir name in path {tenant_path:?}")
-        })?;
-
    let tenant_ignore_mark = conf.tenant_ignore_mark_file_path(&tenant_id);
    anyhow::ensure!(
        !conf.tenant_ignore_mark_file_path(&tenant_id).exists(),
@@ -202,9 +316,11 @@ pub(crate) fn schedule_local_tenant_processing(
            match Tenant::spawn_attach(
                conf,
                tenant_id,
+                generation,
                resources.broker_client,
                tenants,
                remote_storage,
+                resources.deletion_queue_client,
                ctx,
            ) {
                Ok(tenant) => tenant,
@@ -224,7 +340,9 @@ pub(crate) fn schedule_local_tenant_processing(
    } else {
        info!("tenant {tenant_id} is assumed to be loadable, starting load operation");
        // Start loading the tenant into memory. It will initially be in Loading state.
-        Tenant::spawn_load(conf, tenant_id, resources, init_order, tenants, ctx)
+        Tenant::spawn_load(
+            conf, tenant_id, generation, resources, init_order, tenants, ctx,
+        )
    };
    Ok(tenant)
 }
@@ -347,8 +465,10 @@ pub async fn create_tenant(
    conf: &'static PageServerConf,
    tenant_conf: TenantConfOpt,
    tenant_id: TenantId,
+    generation: Generation,
    broker_client: storage_broker::BrokerClientChannel,
    remote_storage: Option<GenericRemoteStorage>,
+    deletion_queue: &DeletionQueue,
    ctx: &RequestContext,
 ) -> Result<Arc<Tenant>, TenantMapInsertError> {
    tenant_map_insert(tenant_id, || {
@@ -362,9 +482,11 @@ pub async fn create_tenant(
        let tenant_resources = TenantSharedResources {
            broker_client,
            remote_storage,
+            deletion_queue_client: deletion_queue.new_client(),
        };
        let created_tenant =
-            schedule_local_tenant_processing(conf, &tenant_directory, tenant_resources, None, &TENANTS, ctx)?;
+            schedule_local_tenant_processing(conf, tenant_id, &tenant_directory,
+                generation, tenant_resources, None, &TENANTS, ctx)?;
        // TODO: tenant object & its background loops remain, untracked in tenant map, if we fail here.
        //      See https://github.com/neondatabase/neon/issues/4233

@@ -513,6 +635,7 @@ pub async fn load_tenant(
    tenant_id: TenantId,
    broker_client: storage_broker::BrokerClientChannel,
    remote_storage: Option<GenericRemoteStorage>,
+    deletion_queue: &DeletionQueue,
    ctx: &RequestContext,
 ) -> Result<(), TenantMapInsertError> {
    tenant_map_insert(tenant_id, || {
@@ -526,8 +649,11 @@ pub async fn load_tenant(
        let resources = TenantSharedResources {
            broker_client,
            remote_storage,
+            deletion_queue_client: deletion_queue.new_client(),
        };
-        let new_tenant = schedule_local_tenant_processing(conf, &tenant_path,  resources, None,  &TENANTS, ctx)
+        // TODO: remove the `/load` API once generation support is complete:
+        // it becomes equivalent to attaching.
+        let new_tenant = schedule_local_tenant_processing(conf, tenant_id, &tenant_path, Generation::none(), resources, None,  &TENANTS, ctx)
            .with_context(|| {
                format!("Failed to schedule tenant processing in path {tenant_path:?}")
            })?;
@@ -591,9 +717,11 @@ pub async fn list_tenants() -> Result<Vec<(TenantId, TenantState)>, TenantMapLis
 pub async fn attach_tenant(
    conf: &'static PageServerConf,
    tenant_id: TenantId,
+    generation: Generation,
    tenant_conf: TenantConfOpt,
    broker_client: storage_broker::BrokerClientChannel,
    remote_storage: GenericRemoteStorage,
+    deletion_queue: &DeletionQueue,
    ctx: &RequestContext,
 ) -> Result<(), TenantMapInsertError> {
    tenant_map_insert(tenant_id, || {
@@ -611,8 +739,9 @@ pub async fn attach_tenant(
        let resources = TenantSharedResources {
            broker_client,
            remote_storage: Some(remote_storage),
+            deletion_queue_client: deletion_queue.new_client(),
        };
-        let attached_tenant = schedule_local_tenant_processing(conf, &tenant_dir, resources, None, &TENANTS, ctx)?;
+        let attached_tenant = schedule_local_tenant_processing(conf, tenant_id, &tenant_dir, generation, resources, None, &TENANTS, ctx)?;
        // TODO: tenant object & its background loops remain, untracked in tenant map, if we fail here.
        //      See https://github.com/neondatabase/neon/issues/4233

--- a/pageserver/src/tenant/remote_timeline_client.rs
+++ b/pageserver/src/tenant/remote_timeline_client.rs
@@ -56,9 +56,11 @@
 //! # Consistency
 //!
 //! To have a consistent remote structure, it's important that uploads and
-//! deletions are performed in the right order. For example, the index file
-//! contains a list of layer files, so it must not be uploaded until all the
-//! layer files that are in its list have been successfully uploaded.
+//! deletions are performed in the right order. For example:
+//! - the index file contains a list of layer files, so it must not be uploaded
+//!    until all the layer files that are in its list have been successfully uploaded.
+//! - objects must be removed from the index before being deleted, and that updated
+//!   index must be written to remote storage before deleting the objects from remote storage.
 //!
 //! The contract between client and its user is that the user is responsible of
 //! scheduling operations in an order that keeps the remote consistent as
@@ -70,10 +72,12 @@
 //! correct order, and the client will parallelize the operations in a way that
 //! is safe.
 //!
-//! The caller should be careful with deletion, though. They should not delete
-//! local files that have been scheduled for upload but not yet finished uploading.
-//! Otherwise the upload will fail. To wait for an upload to finish, use
-//! the 'wait_completion' function (more on that later.)
+//! The caller should be careful with deletion, though:
+//! - they should not delete local files that have been scheduled for upload but
+//!   not yet finished uploading.  Otherwise the upload will fail. To wait for an
+//!   upload to finish, use the 'wait_completion' function (more on that later.)
+//! - they should not to remote deletions via DeletionQueue without waiting for
+//!   the latest metadata to upload via RemoteTimelineClient.
 //!
 //! All of this relies on the following invariants:
 //!
@@ -200,12 +204,11 @@
 //! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync
 //! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map

-mod delete;
 mod download;
 pub mod index;
 mod upload;

-use anyhow::Context;
+use anyhow::{bail, Context};
 use chrono::{NaiveDateTime, Utc};
 // re-export these
 pub use download::{is_temp_download_file, list_remote_timelines};
@@ -216,7 +219,7 @@ use utils::backoff::{
 };

 use std::collections::{HashMap, VecDeque};
-use std::path::Path;
+use std::path::{Path, PathBuf};
 use std::sync::atomic::{AtomicU32, Ordering};
 use std::sync::{Arc, Mutex};

@@ -226,6 +229,7 @@ use tracing::{debug, error, info, instrument, warn};
 use tracing::{info_span, Instrument};
 use utils::lsn::Lsn;

+use crate::deletion_queue::DeletionQueueClient;
 use crate::metrics::{
    MeasureRemoteOp, RemoteOpFileKind, RemoteOpKind, RemoteTimelineClientMetrics,
    RemoteTimelineClientMetricsCallTrackSize, REMOTE_ONDEMAND_DOWNLOADED_BYTES,
@@ -234,7 +238,6 @@ use crate::metrics::{
 use crate::task_mgr::shutdown_token;
 use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
 use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
-use crate::tenant::upload_queue::Delete;
 use crate::{
    config::PageServerConf,
    task_mgr,
@@ -244,6 +247,7 @@ use crate::{
    tenant::upload_queue::{
        UploadOp, UploadQueue, UploadQueueInitialized, UploadQueueStopped, UploadTask,
    },
+    tenant::TIMELINES_SEGMENT_NAME,
 };

 use utils::id::{TenantId, TimelineId};
@@ -252,6 +256,7 @@ use self::index::IndexPart;

 use super::storage_layer::LayerFileName;
 use super::upload_queue::SetDeletedFlagProgress;
+use super::Generation;

 // Occasional network issues and such can cause remote operations to fail, and
 // that's expected. If a download fails, we log it at info-level, and retry.
@@ -315,6 +320,7 @@ pub struct RemoteTimelineClient {

    tenant_id: TenantId,
    timeline_id: TimelineId,
+    generation: Generation,

    upload_queue: Mutex<UploadQueue>,

@@ -335,12 +341,14 @@ impl RemoteTimelineClient {
        conf: &'static PageServerConf,
        tenant_id: TenantId,
        timeline_id: TimelineId,
+        generation: Generation,
    ) -> RemoteTimelineClient {
        RemoteTimelineClient {
            conf,
            runtime: BACKGROUND_RUNTIME.handle().to_owned(),
            tenant_id,
            timeline_id,
+            generation,
            storage_impl: remote_storage,
            upload_queue: Mutex::new(UploadQueue::Uninitialized),
            metrics: Arc::new(RemoteTimelineClientMetrics::new(&tenant_id, &timeline_id)),
@@ -453,6 +461,7 @@ impl RemoteTimelineClient {
            &self.storage_impl,
            &self.tenant_id,
            &self.timeline_id,
+            self.generation,
        )
        .measure_remote_op(
            self.tenant_id,
@@ -541,8 +550,7 @@ impl RemoteTimelineClient {
        // ahead of what's _actually_ on the remote during index upload.
        upload_queue.latest_metadata = metadata.clone();

-        let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
-        self.schedule_index_upload(upload_queue, metadata_bytes);
+        self.schedule_index_upload(upload_queue, upload_queue.latest_metadata.clone());

        Ok(())
    }
@@ -562,8 +570,7 @@ impl RemoteTimelineClient {
        let upload_queue = guard.initialized_mut()?;

        if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
-            let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
-            self.schedule_index_upload(upload_queue, metadata_bytes);
+            self.schedule_index_upload(upload_queue, upload_queue.latest_metadata.clone());
        }

        Ok(())
@@ -573,7 +580,7 @@ impl RemoteTimelineClient {
    fn schedule_index_upload(
        self: &Arc<Self>,
        upload_queue: &mut UploadQueueInitialized,
-        metadata_bytes: Vec<u8>,
+        metadata: TimelineMetadata,
    ) {
        info!(
            "scheduling metadata upload with {} files ({} changed)",
@@ -586,7 +593,7 @@ impl RemoteTimelineClient {
        let index_part = IndexPart::new(
            upload_queue.latest_files.clone(),
            disk_consistent_lsn,
-            metadata_bytes,
+            metadata,
        );
        let op = UploadOp::UploadMetadata(index_part, disk_consistent_lsn);
        self.calls_unfinished_metric_begin(&op);
@@ -633,50 +640,66 @@ impl RemoteTimelineClient {
    /// deletion won't actually be performed, until any previously scheduled
    /// upload operations, and the index file upload, have completed
    /// successfully.
-    pub fn schedule_layer_file_deletion(
+    pub async fn schedule_layer_file_deletion(
        self: &Arc<Self>,
        names: &[LayerFileName],
+        deletion_queue_client: &DeletionQueueClient,
    ) -> anyhow::Result<()> {
-        let mut guard = self.upload_queue.lock().unwrap();
-        let upload_queue = guard.initialized_mut()?;
+        // Synchronous update of upload queues under mutex
+        let with_generations = {
+            let mut guard = self.upload_queue.lock().unwrap();
+            let upload_queue = guard.initialized_mut()?;

-        // Deleting layers doesn't affect the values stored in TimelineMetadata,
-        // so we don't need update it. Just serialize it.
-        let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
+            // Deleting layers doesn't affect the values stored in TimelineMetadata,
+            // so we don't need update it. Just serialize it.
+            let metadata = upload_queue.latest_metadata.clone();

-        // Update the remote index file, removing the to-be-deleted files from the index,
-        // before deleting the actual files.
-        //
-        // Once we start removing files from upload_queue.latest_files, there's
-        // no going back! Otherwise, some of the files would already be removed
-        // from latest_files, but not yet scheduled for deletion. Use a closure
-        // to syntactically forbid ? or bail! calls here.
-        let no_bail_here = || {
-            for name in names {
-                upload_queue.latest_files.remove(name);
-                upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
-            }
+            // Decorate our list of names with each name's generation, dropping
+            // makes that are unexpectedly missing from our metadata.
+            let with_generations: Vec<_> = names
+                .into_iter()
+                .filter_map(|name| {
+                    // Remove from latest_files, learning the file's remote generation in the process
+                    let meta = upload_queue.latest_files.remove(name);
+
+                    if let Some(meta) = meta {
+                        upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
+                        Some((name.clone(), meta.generation))
+                    } else {
+                        // This is unexpected: latest_files is meant to be kept up to
+                        // date.  We can't delete the layer if we have forgotten what
+                        // generation it was in.
+                        warn!("Deleting layer {name} not found in latest_files list");
+                        None
+                    }
+                })
+                .collect();

            if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
-                self.schedule_index_upload(upload_queue, metadata_bytes);
+                self.schedule_index_upload(upload_queue, metadata);
            }

-            // schedule the actual deletions
-            for name in names {
-                let op = UploadOp::Delete(Delete {
-                    file_kind: RemoteOpFileKind::Layer,
-                    layer_file_name: name.clone(),
-                    scheduled_from_timeline_delete: false,
-                });
-                self.calls_unfinished_metric_begin(&op);
-                upload_queue.queued_operations.push_back(op);
-                info!("scheduled layer file deletion {name}");
-            }
-
-            // Launch the tasks immediately, if possible
-            self.launch_queued_tasks(upload_queue);
+            with_generations
        };
-        no_bail_here();
+
+        // Barrier: we must ensure all prior uploads and index writes have landed in S3
+        // before emitting deletions.
+        if let Err(e) = self.wait_completion().await {
+            // This can only fail if upload queue is shut down: if this happens, we do
+            // not emit any deletions.  In this condition (remote client is shut down
+            // during compaction or GC) we may leak some objects.
+            bail!("Cannot complete layer file deletions during shutdown ({e})");
+        }
+
+        // Enqueue deletions
+        deletion_queue_client
+            .push_layers(
+                self.tenant_id,
+                self.timeline_id,
+                self.generation,
+                with_generations,
+            )
+            .await?;
        Ok(())
    }

@@ -762,10 +785,10 @@ impl RemoteTimelineClient {
        backoff::retry(
            || {
                upload::upload_index_part(
-                    self.conf,
                    &self.storage_impl,
                    &self.tenant_id,
                    &self.timeline_id,
+                    self.generation,
                    &index_part_with_deleted_at,
                )
            },
@@ -802,12 +825,13 @@ impl RemoteTimelineClient {
    /// Prerequisites: UploadQueue should be in stopped state and deleted_at should be successfuly set.
    /// The function deletes layer files one by one, then lists the prefix to see if we leaked something
    /// deletes leaked files if any and proceeds with deletion of index file at the end.
-    pub(crate) async fn delete_all(self: &Arc<Self>) -> anyhow::Result<()> {
+    pub(crate) async fn delete_all(
+        self: &Arc<Self>,
+        deletion_queue: &DeletionQueueClient,
+    ) -> anyhow::Result<()> {
        debug_assert_current_span_has_tenant_and_timeline_id();

-        let (mut receiver, deletions_queued) = {
-            let mut deletions_queued = 0;
-
+        let layers: Vec<_> = {
            let mut locked = self.upload_queue.lock().unwrap();
            let stopped = locked.stopped_mut()?;

@@ -819,40 +843,29 @@ impl RemoteTimelineClient {

            stopped
                .upload_queue_for_deletion
-                .queued_operations
-                .reserve(stopped.upload_queue_for_deletion.latest_files.len());
-
-            // schedule the actual deletions
-            for name in stopped.upload_queue_for_deletion.latest_files.keys() {
-                let op = UploadOp::Delete(Delete {
-                    file_kind: RemoteOpFileKind::Layer,
-                    layer_file_name: name.clone(),
-                    scheduled_from_timeline_delete: true,
-                });
-                self.calls_unfinished_metric_begin(&op);
-                stopped
-                    .upload_queue_for_deletion
-                    .queued_operations
-                    .push_back(op);
-
-                info!("scheduled layer file deletion {name}");
-                deletions_queued += 1;
-            }
-
-            self.launch_queued_tasks(&mut stopped.upload_queue_for_deletion);
-
-            (
-                self.schedule_barrier(&mut stopped.upload_queue_for_deletion),
-                deletions_queued,
-            )
+                .latest_files
+                .drain()
+                .map(|kv| (kv.0, kv.1.generation))
+                .collect()
        };

-        receiver.changed().await.context("upload queue shut down")?;
+        let layer_deletion_count = layers.len();
+
+        let layer_paths = layers
+            .into_iter()
+            .map(|(layer, generation)| {
+                remote_layer_path(&self.tenant_id, &self.timeline_id, &layer, generation)
+            })
+            .collect();
+        deletion_queue.push_immediate(layer_paths).await?;

        // Do not delete index part yet, it is needed for possible retry. If we remove it first
        // and retry will arrive to different pageserver there wont be any traces of it on remote storage
-        let timeline_path = self.conf.timeline_path(&self.tenant_id, &self.timeline_id);
-        let timeline_storage_path = self.conf.remote_path(&timeline_path)?;
+        let timeline_storage_path = remote_timeline_path(&self.tenant_id, &self.timeline_id);
+
+        // Execute all pending deletions, so that when we prroceed to do a list_prefixes below, we aren't
+        // taking the burden of listing all the layers that we already know we should delete.
+        deletion_queue.flush_immediate().await?;

        let remaining = backoff::retry(
            || async {
@@ -881,17 +894,9 @@ impl RemoteTimelineClient {
            })
            .collect();

+        let not_referenced_count = remaining.len();
        if !remaining.is_empty() {
-            backoff::retry(
-                || async { self.storage_impl.delete_objects(&remaining).await },
-                |_e| false,
-                FAILED_UPLOAD_WARN_THRESHOLD,
-                FAILED_REMOTE_OP_RETRIES,
-                "delete_objects",
-                backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled!")),
-            )
-            .await
-            .context("delete_objects")?;
+            deletion_queue.push_immediate(remaining).await?;
        }

        fail::fail_point!("timeline-delete-before-index-delete", |_| {
@@ -902,18 +907,14 @@ impl RemoteTimelineClient {

        let index_file_path = timeline_storage_path.join(Path::new(IndexPart::FILE_NAME));

-        debug!("deleting index part");
+        debug!("enqueuing index part deletion");
+        deletion_queue
+            .push_immediate([index_file_path].to_vec())
+            .await?;

-        backoff::retry(
-            || async { self.storage_impl.delete(&index_file_path).await },
-            |_e| false,
-            FAILED_UPLOAD_WARN_THRESHOLD,
-            FAILED_REMOTE_OP_RETRIES,
-            "delete_index",
-            backoff::Cancel::new(shutdown_token(), || anyhow::anyhow!("Cancelled")),
-        )
-        .await
-        .context("delete_index")?;
+        // Timeline deletion is rare and we have probably emitted a reasonably number of objects: wait
+        // for a flush to a persistent deletion list so that we may be sure deletion will occur.
+        deletion_queue.flush_immediate().await?;

        fail::fail_point!("timeline-delete-after-index-delete", |_| {
            Err(anyhow::anyhow!(
@@ -921,7 +922,7 @@ impl RemoteTimelineClient {
            ))?
        });

-        info!(prefix=%timeline_storage_path, referenced=deletions_queued, not_referenced=%remaining.len(), "done deleting in timeline prefix, including index_part.json");
+        info!(prefix=%timeline_storage_path, referenced=layer_deletion_count, not_referenced=%not_referenced_count, "done deleting in timeline prefix, including index_part.json");

        Ok(())
    }
@@ -944,10 +945,6 @@ impl RemoteTimelineClient {
                    // have finished.
                    upload_queue.inprogress_tasks.is_empty()
                }
-                UploadOp::Delete(_) => {
-                    // Wait for preceding uploads to finish. Concurrent deletions are OK, though.
-                    upload_queue.num_inprogress_deletions == upload_queue.inprogress_tasks.len()
-                }

                UploadOp::Barrier(_) => upload_queue.inprogress_tasks.is_empty(),
            };
@@ -975,9 +972,6 @@ impl RemoteTimelineClient {
                UploadOp::UploadMetadata(_, _) => {
                    upload_queue.num_inprogress_metadata_uploads += 1;
                }
-                UploadOp::Delete(_) => {
-                    upload_queue.num_inprogress_deletions += 1;
-                }
                UploadOp::Barrier(sender) => {
                    sender.send_replace(());
                    continue;
@@ -1056,15 +1050,17 @@ impl RemoteTimelineClient {

            let upload_result: anyhow::Result<()> = match &task.op {
                UploadOp::UploadLayer(ref layer_file_name, ref layer_metadata) => {
-                    let path = &self
+                    let path = self
                        .conf
                        .timeline_path(&self.tenant_id, &self.timeline_id)
                        .join(layer_file_name.file_name());
+
                    upload::upload_timeline_layer(
                        self.conf,
                        &self.storage_impl,
-                        path,
+                        &path,
                        layer_metadata,
+                        self.generation,
                    )
                    .measure_remote_op(
                        self.tenant_id,
@@ -1086,10 +1082,10 @@ impl RemoteTimelineClient {
                    };

                    let res = upload::upload_index_part(
-                        self.conf,
                        &self.storage_impl,
                        &self.tenant_id,
                        &self.timeline_id,
+                        self.generation,
                        index_part,
                    )
                    .measure_remote_op(
@@ -1109,21 +1105,6 @@ impl RemoteTimelineClient {
                    }
                    res
                }
-                UploadOp::Delete(delete) => {
-                    let path = &self
-                        .conf
-                        .timeline_path(&self.tenant_id, &self.timeline_id)
-                        .join(delete.layer_file_name.file_name());
-                    delete::delete_layer(self.conf, &self.storage_impl, path)
-                        .measure_remote_op(
-                            self.tenant_id,
-                            self.timeline_id,
-                            delete.file_kind,
-                            RemoteOpKind::Delete,
-                            Arc::clone(&self.metrics),
-                        )
-                        .await
-                }
                UploadOp::Barrier(_) => {
                    // unreachable. Barrier operations are handled synchronously in
                    // launch_queued_tasks
@@ -1183,15 +1164,7 @@ impl RemoteTimelineClient {
            let mut upload_queue_guard = self.upload_queue.lock().unwrap();
            let upload_queue = match upload_queue_guard.deref_mut() {
                UploadQueue::Uninitialized => panic!("callers are responsible for ensuring this is only called on an initialized queue"),
-                UploadQueue::Stopped(stopped) => {
-                    // Special care is needed for deletions, if it was an earlier deletion (not scheduled from deletion)
-                    // then stop() took care of it so we just return.
-                    // For deletions that come from delete_all we still want to maintain metrics, launch following tasks, etc.
-                    match &task.op {
-                        UploadOp::Delete(delete) if delete.scheduled_from_timeline_delete => Some(&mut stopped.upload_queue_for_deletion),
-                        _ => None
-                    }
-                },
+                UploadQueue::Stopped(_) => { None }
                UploadQueue::Initialized(qi) => { Some(qi) }
            };

@@ -1213,9 +1186,6 @@ impl RemoteTimelineClient {
                    upload_queue.num_inprogress_metadata_uploads -= 1;
                    upload_queue.last_uploaded_consistent_lsn = lsn; // XXX monotonicity check?
                }
-                UploadOp::Delete(_) => {
-                    upload_queue.num_inprogress_deletions -= 1;
-                }
                UploadOp::Barrier(_) => unreachable!(),
            };

@@ -1247,13 +1217,6 @@ impl RemoteTimelineClient {
                    reason: "metadata uploads are tiny",
                },
            ),
-            UploadOp::Delete(delete) => (
-                delete.file_kind,
-                RemoteOpKind::Delete,
-                DontTrackSize {
-                    reason: "should we track deletes? positive or negative sign?",
-                },
-            ),
            UploadOp::Barrier(_) => {
                // we do not account these
                return None;
@@ -1313,7 +1276,6 @@ impl RemoteTimelineClient {
                        last_uploaded_consistent_lsn: initialized.last_uploaded_consistent_lsn,
                        num_inprogress_layer_uploads: 0,
                        num_inprogress_metadata_uploads: 0,
-                        num_inprogress_deletions: 0,
                        inprogress_tasks: HashMap::default(),
                        queued_operations: VecDeque::default(),
                    };
@@ -1334,9 +1296,7 @@ impl RemoteTimelineClient {

                // consistency check
                assert_eq!(
-                    qi.num_inprogress_layer_uploads
-                        + qi.num_inprogress_metadata_uploads
-                        + qi.num_inprogress_deletions,
+                    qi.num_inprogress_layer_uploads + qi.num_inprogress_metadata_uploads,
                    qi.inprogress_tasks.len()
                );

@@ -1361,14 +1321,84 @@ impl RemoteTimelineClient {
    }
 }

+pub fn remote_timelines_path(tenant_id: &TenantId) -> RemotePath {
+    let path = format!("tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}");
+    RemotePath::from_string(&path).expect("Failed to construct path")
+}
+
+pub fn remote_timeline_path(tenant_id: &TenantId, timeline_id: &TimelineId) -> RemotePath {
+    remote_timelines_path(tenant_id).join(&PathBuf::from(timeline_id.to_string()))
+}
+
+pub fn remote_layer_path(
+    tenant_id: &TenantId,
+    timeline_id: &TimelineId,
+    layer_file_name: &LayerFileName,
+    generation: Generation,
+) -> RemotePath {
+    // Generation-aware key format
+    let path = format!(
+        "tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
+        layer_file_name.file_name(),
+        generation.get_suffix()
+    );
+
+    RemotePath::from_string(&path).expect("Failed to construct path")
+}
+
+pub fn remote_index_path(
+    tenant_id: &TenantId,
+    timeline_id: &TimelineId,
+    generation: Generation,
+) -> RemotePath {
+    RemotePath::from_string(&format!(
+        "tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
+        IndexPart::FILE_NAME,
+        generation.get_suffix()
+    ))
+    .expect("Failed to construct path")
+}
+
+/// Files on the remote storage are stored with paths, relative to the workdir.
+/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
+///
+/// Errors if the path provided does not start from pageserver's workdir.
+pub fn remote_path(
+    conf: &PageServerConf,
+    local_path: &Path,
+    generation: Option<Generation>,
+) -> anyhow::Result<RemotePath> {
+    let stripped = local_path
+        .strip_prefix(&conf.workdir)
+        .context("Failed to strip workdir prefix")?;
+
+    let suffixed = if let Some(generation) = generation {
+        format!(
+            "{0}{1}",
+            stripped.to_string_lossy(),
+            generation.get_suffix()
+        )
+    } else {
+        stripped.to_string_lossy().to_string()
+    };
+
+    RemotePath::new(&PathBuf::from(suffixed)).with_context(|| {
+        format!(
+            "Failed to resolve remote part of path {:?} for base {:?}",
+            local_path, conf.workdir
+        )
+    })
+}
+
 #[cfg(test)]
 mod tests {
    use super::*;
    use crate::{
        context::RequestContext,
+        deletion_queue::mock::MockDeletionQueue,
        tenant::{
            harness::{TenantHarness, TIMELINE_ID},
-            Tenant, Timeline,
+            Generation, Tenant, Timeline,
        },
        DEFAULT_PG_VERSION,
    };
@@ -1410,8 +1440,11 @@ mod tests {
        assert_eq!(avec, bvec);
    }

-    fn assert_remote_files(expected: &[&str], remote_path: &Path) {
-        let mut expected: Vec<String> = expected.iter().map(|x| String::from(*x)).collect();
+    fn assert_remote_files(expected: &[&str], remote_path: &Path, generation: Generation) {
+        let mut expected: Vec<String> = expected
+            .iter()
+            .map(|x| format!("{}{}", x, generation.get_suffix()))
+            .collect();
        expected.sort();

        let mut found: Vec<String> = Vec::new();
@@ -1432,6 +1465,7 @@ mod tests {
        tenant_ctx: RequestContext,
        remote_fs_dir: PathBuf,
        client: Arc<RemoteTimelineClient>,
+        deletion_queue: MockDeletionQueue,
    }

    impl TestSetup {
@@ -1462,6 +1496,8 @@ mod tests {
                storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
            };

+            let generation = Generation::new(0xdeadbeef);
+
            let storage = GenericRemoteStorage::from_config(&storage_config).unwrap();

            let client = Arc::new(RemoteTimelineClient {
@@ -1469,7 +1505,8 @@ mod tests {
                runtime: tokio::runtime::Handle::current(),
                tenant_id: harness.tenant_id,
                timeline_id: TIMELINE_ID,
-                storage_impl: storage,
+                generation,
+                storage_impl: storage.clone(),
                upload_queue: Mutex::new(UploadQueue::Uninitialized),
                metrics: Arc::new(RemoteTimelineClientMetrics::new(
                    &harness.tenant_id,
@@ -1477,6 +1514,8 @@ mod tests {
                )),
            });

+            let deletion_queue = MockDeletionQueue::new(Some(storage));
+
            Ok(Self {
                harness,
                tenant,
@@ -1484,6 +1523,7 @@ mod tests {
                tenant_ctx: ctx,
                remote_fs_dir,
                client,
+                deletion_queue,
            })
        }
    }
@@ -1512,6 +1552,7 @@ mod tests {
            tenant_ctx: _tenant_ctx,
            remote_fs_dir,
            client,
+            deletion_queue,
        } = TestSetup::new("upload_scheduling").await.unwrap();

        let timeline_path = harness.timeline_path(&TIMELINE_ID);
@@ -1527,6 +1568,8 @@ mod tests {
            .init_upload_queue_for_empty_remote(&metadata)
            .unwrap();

+        let generation = Generation::new(0xdeadbeef);
+
        // Create a couple of dummy files,  schedule upload for them
        let layer_file_name_1: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap();
        let layer_file_name_2: LayerFileName = "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D9-00000000016B5A52".parse().unwrap();
@@ -1546,13 +1589,13 @@ mod tests {
        client
            .schedule_layer_file_upload(
                &layer_file_name_1,
-                &LayerFileMetadata::new(content_1.len() as u64),
+                &LayerFileMetadata::new(content_1.len() as u64, generation),
            )
            .unwrap();
        client
            .schedule_layer_file_upload(
                &layer_file_name_2,
-                &LayerFileMetadata::new(content_2.len() as u64),
+                &LayerFileMetadata::new(content_2.len() as u64, generation),
            )
            .unwrap();

@@ -1610,30 +1653,23 @@ mod tests {
                &layer_file_name_2.file_name(),
            ],
        );
-        let downloaded_metadata = index_part.parse_metadata().unwrap();
-        assert_eq!(downloaded_metadata, metadata);
+        assert_eq!(index_part.metadata, metadata);

        // Schedule upload and then a deletion. Check that the deletion is queued
        client
            .schedule_layer_file_upload(
                &layer_file_name_3,
-                &LayerFileMetadata::new(content_3.len() as u64),
+                &LayerFileMetadata::new(content_3.len() as u64, generation),
            )
            .unwrap();
-        client
-            .schedule_layer_file_deletion(&[layer_file_name_1.clone()])
-            .unwrap();
+
        {
            let mut guard = client.upload_queue.lock().unwrap();
            let upload_queue = guard.initialized_mut().unwrap();
-
-            // Deletion schedules upload of the index file, and the file deletion itself
-            assert!(upload_queue.queued_operations.len() == 2);
-            assert!(upload_queue.inprogress_tasks.len() == 1);
-            assert!(upload_queue.num_inprogress_layer_uploads == 1);
-            assert!(upload_queue.num_inprogress_deletions == 0);
-            assert!(upload_queue.latest_files_changes_since_metadata_upload_scheduled == 0);
+            assert_eq!(upload_queue.queued_operations.len(), 0);
+            assert_eq!(upload_queue.num_inprogress_layer_uploads, 1);
        }
+
        assert_remote_files(
            &[
                &layer_file_name_1.file_name(),
@@ -1641,10 +1677,50 @@ mod tests {
                "index_part.json",
            ],
            &remote_timeline_dir,
+            generation,
        );

-        // Finish them
+        client
+            .schedule_layer_file_deletion(
+                &[layer_file_name_1.clone()],
+                &deletion_queue.new_client(),
+            )
+            .await
+            .unwrap();
+
+        {
+            let mut guard = client.upload_queue.lock().unwrap();
+            let upload_queue = guard.initialized_mut().unwrap();
+
+            // Deletion schedules upload of the index file via RemoteTimelineClient, and
+            // deletion of layer files via DeletionQueue.  The uploads have all been flushed
+            // because schedule_layer_file_deletion does a wait_completion before pushing
+            // to the deletion_queue
+            assert_eq!(upload_queue.queued_operations.len(), 0);
+            assert_eq!(upload_queue.inprogress_tasks.len(), 0);
+            assert_eq!(upload_queue.num_inprogress_layer_uploads, 0);
+            assert_eq!(
+                upload_queue.latest_files_changes_since_metadata_upload_scheduled,
+                0
+            );
+        }
+        assert_remote_files(
+            &[
+                &layer_file_name_1.file_name(),
+                &layer_file_name_2.file_name(),
+                &layer_file_name_3.file_name(),
+                "index_part.json",
+            ],
+            &remote_timeline_dir,
+            generation,
+        );
+
+        // Finish uploads and deletions
        client.wait_completion().await.unwrap();
+        deletion_queue.pump().await;
+
+        // 1 layer was deleted
+        assert_eq!(deletion_queue.get_executed(), 1);

        assert_remote_files(
            &[
@@ -1653,6 +1729,7 @@ mod tests {
                "index_part.json",
            ],
            &remote_timeline_dir,
+            generation,
        );
    }

@@ -1705,12 +1782,14 @@ mod tests {

        // Test

+        let generation = Generation::new(0xdeadbeef);
+
        let init = get_bytes_started_stopped();

        client
            .schedule_layer_file_upload(
                &layer_file_name_1,
-                &LayerFileMetadata::new(content_1.len() as u64),
+                &LayerFileMetadata::new(content_1.len() as u64, generation),
            )
            .unwrap();

@@ -1745,4 +1824,23 @@ mod tests {
            }
        );
    }
+
+    // #[tokio::test]
+    // async fn index_part_download() {
+    //     let TestSetup {
+    //         harness,
+    //         tenant: _tenant,
+    //         timeline: _timeline,
+    //         client,
+    //         ..
+    //     } = TestSetup::new("index_part_download").await.unwrap();
+
+    //     let example_index_part = IndexPart {
+    //         version: 3,
+    //         timeline_layers: HashSet::new(),
+    //         layer_metadata:
+
+    //     }
+
+    // }
 }
--- a/pageserver/src/tenant/remote_timeline_client/delete.rs
+++ b/pageserver/src/tenant/remote_timeline_client/delete.rs
@@ -1,29 +0,0 @@
-//! Helper functions to delete files from remote storage with a RemoteStorage
-use anyhow::Context;
-use std::path::Path;
-use tracing::debug;
-
-use remote_storage::GenericRemoteStorage;
-
-use crate::config::PageServerConf;
-
-pub(super) async fn delete_layer<'a>(
-    conf: &'static PageServerConf,
-    storage: &'a GenericRemoteStorage,
-    local_layer_path: &'a Path,
-) -> anyhow::Result<()> {
-    fail::fail_point!("before-delete-layer", |_| {
-        anyhow::bail!("failpoint before-delete-layer")
-    });
-    debug!("Deleting layer from remote storage: {local_layer_path:?}",);
-
-    let path_to_delete = conf.remote_path(local_layer_path)?;
-
-    // We don't want to print an error if the delete failed if the file has
-    // already been deleted. Thankfully, in this situation S3 already
-    // does not yield an error. While OS-provided local file system APIs do yield
-    // errors, we avoid them in the `LocalFs` wrapper.
-    storage.delete(&path_to_delete).await.with_context(|| {
-        format!("Failed to delete remote layer from storage at {path_to_delete:?}")
-    })
-}
--- a/pageserver/src/tenant/remote_timeline_client/download.rs
+++ b/pageserver/src/tenant/remote_timeline_client/download.rs
@@ -15,14 +15,16 @@ use tokio_util::sync::CancellationToken;
 use utils::{backoff, crashsafe};

 use crate::config::PageServerConf;
+use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
 use crate::tenant::storage_layer::LayerFileName;
 use crate::tenant::timeline::span::debug_assert_current_span_has_tenant_and_timeline_id;
-use remote_storage::{DownloadError, GenericRemoteStorage};
+use crate::tenant::Generation;
+use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
 use utils::crashsafe::path_with_suffix_extension;
 use utils::id::{TenantId, TimelineId};

 use super::index::{IndexPart, LayerFileMetadata};
-use super::{FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES};
+use super::{remote_index_path, FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES};

 static MAX_DOWNLOAD_DURATION: Duration = Duration::from_secs(120);

@@ -41,13 +43,16 @@ pub async fn download_layer_file<'a>(
 ) -> Result<u64, DownloadError> {
    debug_assert_current_span_has_tenant_and_timeline_id();

-    let timeline_path = conf.timeline_path(&tenant_id, &timeline_id);
+    let local_path = conf
+        .timeline_path(&tenant_id, &timeline_id)
+        .join(layer_file_name.file_name());

-    let local_path = timeline_path.join(layer_file_name.file_name());
-
-    let remote_path = conf
-        .remote_path(&local_path)
-        .map_err(DownloadError::Other)?;
+    let remote_path = remote_layer_path(
+        &tenant_id,
+        &timeline_id,
+        layer_file_name,
+        layer_metadata.generation,
+    );

    // Perform a rename inspired by durable_rename from file_utils.c.
    // The sequence:
@@ -173,21 +178,19 @@ pub fn is_temp_download_file(path: &Path) -> bool {
 }

 /// List timelines of given tenant in remote storage
-pub async fn list_remote_timelines<'a>(
-    storage: &'a GenericRemoteStorage,
-    conf: &'static PageServerConf,
+pub async fn list_remote_timelines(
+    storage: &GenericRemoteStorage,
    tenant_id: TenantId,
 ) -> anyhow::Result<HashSet<TimelineId>> {
-    let tenant_path = conf.timelines_path(&tenant_id);
-    let tenant_storage_path = conf.remote_path(&tenant_path)?;
+    let remote_path = remote_timelines_path(&tenant_id);

    fail::fail_point!("storage-sync-list-remote-timelines", |_| {
        anyhow::bail!("storage-sync-list-remote-timelines");
    });

    let timelines = download_retry(
-        || storage.list_prefixes(Some(&tenant_storage_path)),
-        &format!("list prefixes for {tenant_path:?}"),
+        || storage.list_prefixes(Some(&remote_path)),
+        &format!("list prefixes for {tenant_id}"),
    )
    .await?;

@@ -221,46 +224,140 @@ pub async fn list_remote_timelines<'a>(
    Ok(timeline_ids)
 }

+async fn do_download_index_part(
+    local_path: &Path,
+    storage: &GenericRemoteStorage,
+    tenant_id: &TenantId,
+    timeline_id: &TimelineId,
+    index_generation: Generation,
+) -> Result<IndexPart, DownloadError> {
+    let remote_path = remote_index_path(tenant_id, timeline_id, index_generation);
+
+    let index_part_bytes = download_retry(
+        || storage.download_all(&remote_path),
+        &format!("download {remote_path:?}"),
+    )
+    .await?;
+
+    let index_part: IndexPart = serde_json::from_slice(&index_part_bytes)
+        .with_context(|| format!("Failed to deserialize index part file into file {local_path:?}"))
+        .map_err(DownloadError::Other)?;
+
+    Ok(index_part)
+}
+
 pub(super) async fn download_index_part(
    conf: &'static PageServerConf,
    storage: &GenericRemoteStorage,
    tenant_id: &TenantId,
    timeline_id: &TimelineId,
+    my_generation: Generation,
 ) -> Result<IndexPart, DownloadError> {
-    let index_part_path = conf
+    let local_path = conf
        .metadata_path(tenant_id, timeline_id)
        .with_file_name(IndexPart::FILE_NAME);
-    let part_storage_path = conf
-        .remote_path(&index_part_path)
-        .map_err(DownloadError::BadInput)?;

-    let index_part_bytes = download_retry(
-        || async {
-            let mut index_part_download = storage.download(&part_storage_path).await?;
+    if my_generation.is_none() {
+        // Operating without generations: just fetch the generation-less path
+        return do_download_index_part(&local_path, storage, tenant_id, timeline_id, my_generation)
+            .await;
+    }

-            let mut index_part_bytes = Vec::new();
-            tokio::io::copy(
-                &mut index_part_download.download_stream,
-                &mut index_part_bytes,
-            )
-            .await
-            .with_context(|| {
-                format!("Failed to download an index part into file {index_part_path:?}")
-            })
-            .map_err(DownloadError::Other)?;
-            Ok(index_part_bytes)
-        },
-        &format!("download {part_storage_path:?}"),
+    let previous_gen = my_generation.previous();
+    let r_previous =
+        do_download_index_part(&local_path, storage, tenant_id, timeline_id, previous_gen).await;
+
+    match r_previous {
+        Ok(index_part) => {
+            tracing::debug!("Found index_part from previous generation {previous_gen}");
+            return Ok(index_part);
+        }
+        Err(e) => {
+            if matches!(e, DownloadError::NotFound) {
+                tracing::debug!("No index_part found from previous generation {previous_gen}, falling back to listing");
+            } else {
+                return Err(e);
+            }
+        }
+    };
+
+    /// Given the key of an index, parse out the generation part of the name
+    fn parse_generation(path: RemotePath) -> Option<Generation> {
+        let path = path.take();
+        let file_name = match path.file_name() {
+            Some(f) => f,
+            None => {
+                // Unexpected: we should be seeing index_part.json paths only
+                tracing::warn!("Malformed index key {0}", path.display());
+                return None;
+            }
+        };
+
+        let file_name_str = match file_name.to_str() {
+            Some(s) => s,
+            None => {
+                tracing::warn!("Malformed index key {0}", path.display());
+                return None;
+            }
+        };
+
+        match file_name_str.split_once("-") {
+            Some((_, gen_suffix)) => u32::from_str_radix(gen_suffix, 16)
+                .map(|g| Generation::new(g))
+                .ok(),
+            None => None,
+        }
+    }
+
+    // Fallback: we did not find an index_part.json from the previous generation, so
+    // we will list all the index_part objects and pick the most recent.
+    let index_prefix = remote_index_path(tenant_id, timeline_id, Generation::none());
+    let indices = backoff::retry(
+        || async { storage.list_files(Some(&index_prefix)).await },
+        |_| false,
+        FAILED_DOWNLOAD_WARN_THRESHOLD,
+        FAILED_REMOTE_OP_RETRIES,
+        "listing index_part files",
+        // TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
+        backoff::Cancel::new(CancellationToken::new(), || -> anyhow::Error {
+            unreachable!()
+        }),
    )
-    .await?;
+    .await
+    .map_err(|e| DownloadError::Other(e))?;

-    let index_part: IndexPart = serde_json::from_slice(&index_part_bytes)
-        .with_context(|| {
-            format!("Failed to deserialize index part file into file {index_part_path:?}")
-        })
-        .map_err(DownloadError::Other)?;
+    let mut generations: Vec<_> = indices
+        .into_iter()
+        .filter_map(|k| parse_generation(k))
+        .filter(|g| g <= &my_generation)
+        .collect();

-    Ok(index_part)
+    generations.sort();
+    match generations.last() {
+        Some(g) => {
+            tracing::debug!("Found index_part in generation {g} (my generation {my_generation})");
+            do_download_index_part(&local_path, storage, tenant_id, timeline_id, *g).await
+        }
+        None => {
+            // This is not an error: the timeline may be newly created, or we may be
+            // upgrading and have no historical index_part with a generation suffix.
+            // Fall back to trying to load the un-suffixed index_part.json.
+            tracing::info!(
+                "No index_part.json-* found when loading {}/{} in generation {}",
+                tenant_id,
+                timeline_id,
+                my_generation
+            );
+            return do_download_index_part(
+                &local_path,
+                storage,
+                tenant_id,
+                timeline_id,
+                Generation::none(),
+            )
+            .await;
+        }
+    }
 }

 /// Helper function to handle retries for a download operation.
--- a/pageserver/src/tenant/remote_timeline_client/index.rs
+++ b/pageserver/src/tenant/remote_timeline_client/index.rs
@@ -12,6 +12,7 @@ use utils::bin_ser::SerializeError;
 use crate::tenant::metadata::TimelineMetadata;
 use crate::tenant::storage_layer::LayerFileName;
 use crate::tenant::upload_queue::UploadQueueInitialized;
+use crate::tenant::Generation;

 use utils::lsn::Lsn;

@@ -20,22 +21,28 @@ use utils::lsn::Lsn;
 /// Fields have to be `Option`s because remote [`IndexPart`]'s can be from different version, which
 /// might have less or more metadata depending if upgrading or rolling back an upgrade.
 #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
-#[cfg_attr(test, derive(Default))]
+//#[cfg_attr(test, derive(Default))]
 pub struct LayerFileMetadata {
    file_size: u64,
+
+    pub(crate) generation: Generation,
 }

 impl From<&'_ IndexLayerMetadata> for LayerFileMetadata {
    fn from(other: &IndexLayerMetadata) -> Self {
        LayerFileMetadata {
            file_size: other.file_size,
+            generation: other.generation,
        }
    }
 }

 impl LayerFileMetadata {
-    pub fn new(file_size: u64) -> Self {
-        LayerFileMetadata { file_size }
+    pub fn new(file_size: u64, generation: Generation) -> Self {
+        LayerFileMetadata {
+            file_size,
+            generation,
+        }
    }

    pub fn file_size(&self) -> u64 {
@@ -77,7 +84,9 @@ pub struct IndexPart {
    // private because internally we would read from metadata instead.
    #[serde_as(as = "DisplayFromStr")]
    disk_consistent_lsn: Lsn,
-    metadata_bytes: Vec<u8>,
+
+    #[serde(rename = "metadata_bytes")]
+    pub metadata: TimelineMetadata,
 }

 impl IndexPart {
@@ -95,7 +104,7 @@ impl IndexPart {
    pub fn new(
        layers_and_metadata: HashMap<LayerFileName, LayerFileMetadata>,
        disk_consistent_lsn: Lsn,
-        metadata_bytes: Vec<u8>,
+        metadata: TimelineMetadata,
    ) -> Self {
        let mut timeline_layers = HashSet::with_capacity(layers_and_metadata.len());
        let mut layer_metadata = HashMap::with_capacity(layers_and_metadata.len());
@@ -111,14 +120,10 @@ impl IndexPart {
            timeline_layers,
            layer_metadata,
            disk_consistent_lsn,
-            metadata_bytes,
+            metadata,
            deleted_at: None,
        }
    }
-
-    pub fn parse_metadata(&self) -> anyhow::Result<TimelineMetadata> {
-        TimelineMetadata::from_bytes(&self.metadata_bytes)
-    }
 }

 impl TryFrom<&UploadQueueInitialized> for IndexPart {
@@ -126,26 +131,31 @@ impl TryFrom<&UploadQueueInitialized> for IndexPart {

    fn try_from(upload_queue: &UploadQueueInitialized) -> Result<Self, Self::Error> {
        let disk_consistent_lsn = upload_queue.latest_metadata.disk_consistent_lsn();
-        let metadata_bytes = upload_queue.latest_metadata.to_bytes()?;
+        let metadata = upload_queue.latest_metadata.clone();

        Ok(Self::new(
            upload_queue.latest_files.clone(),
            disk_consistent_lsn,
-            metadata_bytes,
+            metadata,
        ))
    }
 }

 /// Serialized form of [`LayerFileMetadata`].
-#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize, Default)]
+#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
 pub struct IndexLayerMetadata {
    pub(super) file_size: u64,
+
+    #[serde(default = "Generation::none")]
+    #[serde(skip_serializing_if = "Generation::is_none")]
+    pub(super) generation: Generation,
 }

 impl From<&'_ LayerFileMetadata> for IndexLayerMetadata {
    fn from(other: &'_ LayerFileMetadata) -> Self {
        IndexLayerMetadata {
            file_size: other.file_size,
+            generation: other.generation,
        }
    }
 }
@@ -174,15 +184,17 @@ mod tests {
            layer_metadata: HashMap::from([
                ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
                    file_size: 25600000,
+                    generation: Generation::none()
                }),
                ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
                    // serde_json should always parse this but this might be a double with jq for
                    // example.
                    file_size: 9007199254741001,
+                    generation: Generation::none()
                })
            ]),
            disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
-            metadata_bytes: [113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0].to_vec(),
+            metadata: TimelineMetadata::from_bytes(&[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]).unwrap(),
            deleted_at: None,
        };

@@ -201,7 +213,7 @@ mod tests {
                "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51": { "file_size": 9007199254741001 }
            },
            "disk_consistent_lsn":"0/16960E8",
-            "metadata_bytes":[112,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
+            "metadata_bytes":[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
        }"#;

        let expected = IndexPart {
@@ -211,15 +223,17 @@ mod tests {
            layer_metadata: HashMap::from([
                ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
                    file_size: 25600000,
+                    generation: Generation::none()
                }),
                ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
                    // serde_json should always parse this but this might be a double with jq for
                    // example.
                    file_size: 9007199254741001,
+                    generation: Generation::none()
                })
            ]),
            disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
-            metadata_bytes: [112,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0].to_vec(),
+            metadata: TimelineMetadata::from_bytes(&[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]).unwrap(),
            deleted_at: None,
        };

@@ -238,7 +252,7 @@ mod tests {
                "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51": { "file_size": 9007199254741001 }
            },
            "disk_consistent_lsn":"0/16960E8",
-            "metadata_bytes":[112,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
+            "metadata_bytes":[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
            "deleted_at": "2023-07-31T09:00:00.123"
        }"#;

@@ -249,15 +263,17 @@ mod tests {
            layer_metadata: HashMap::from([
                ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
                    file_size: 25600000,
+                    generation: Generation::none()
                }),
                ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
                    // serde_json should always parse this but this might be a double with jq for
                    // example.
                    file_size: 9007199254741001,
+                    generation: Generation::none()
                })
            ]),
            disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
-            metadata_bytes: [112,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0].to_vec(),
+            metadata: TimelineMetadata::from_bytes(&[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]).unwrap(),
            deleted_at: Some(chrono::NaiveDateTime::parse_from_str(
                "2023-07-31T09:00:00.123000000", "%Y-%m-%dT%H:%M:%S.%f").unwrap())
        };
@@ -281,7 +297,7 @@ mod tests {
            timeline_layers: HashSet::new(),
            layer_metadata: HashMap::new(),
            disk_consistent_lsn: "0/2532648".parse::<Lsn>().unwrap(),
-            metadata_bytes: [
+            metadata: TimelineMetadata::from_bytes(&[
                136, 151, 49, 208, 0, 70, 0, 4, 0, 0, 0, 0, 2, 83, 38, 72, 1, 0, 0, 0, 0, 2, 83,
                38, 32, 1, 87, 198, 240, 135, 97, 119, 45, 125, 38, 29, 155, 161, 140, 141, 255,
                210, 0, 0, 0, 0, 2, 83, 38, 72, 0, 0, 0, 0, 1, 73, 240, 192, 0, 0, 0, 0, 1, 73,
@@ -302,8 +318,8 @@ mod tests {
                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                0, 0,
-            ]
-            .to_vec(),
+            ])
+            .unwrap(),
            deleted_at: None,
        };

--- a/pageserver/src/tenant/remote_timeline_client/upload.rs
+++ b/pageserver/src/tenant/remote_timeline_client/upload.rs
@@ -5,7 +5,11 @@ use fail::fail_point;
 use std::{io::ErrorKind, path::Path};
 use tokio::fs;

-use crate::{config::PageServerConf, tenant::remote_timeline_client::index::IndexPart};
+use super::Generation;
+use crate::{
+    config::PageServerConf,
+    tenant::remote_timeline_client::{index::IndexPart, remote_index_path, remote_path},
+};
 use remote_storage::GenericRemoteStorage;
 use utils::id::{TenantId, TimelineId};

@@ -15,10 +19,10 @@ use tracing::info;

 /// Serializes and uploads the given index part data to the remote storage.
 pub(super) async fn upload_index_part<'a>(
-    conf: &'static PageServerConf,
    storage: &'a GenericRemoteStorage,
    tenant_id: &TenantId,
    timeline_id: &TimelineId,
+    generation: Generation,
    index_part: &'a IndexPart,
 ) -> anyhow::Result<()> {
    tracing::trace!("uploading new index part");
@@ -32,13 +36,9 @@ pub(super) async fn upload_index_part<'a>(
    let index_part_size = index_part_bytes.len();
    let index_part_bytes = tokio::io::BufReader::new(std::io::Cursor::new(index_part_bytes));

-    let index_part_path = conf
-        .metadata_path(tenant_id, timeline_id)
-        .with_file_name(IndexPart::FILE_NAME);
-    let storage_path = conf.remote_path(&index_part_path)?;
-
+    let remote_path = remote_index_path(tenant_id, timeline_id, generation);
    storage
-        .upload_storage_object(Box::new(index_part_bytes), index_part_size, &storage_path)
+        .upload_storage_object(Box::new(index_part_bytes), index_part_size, &remote_path)
        .await
        .with_context(|| format!("Failed to upload index part for '{tenant_id} / {timeline_id}'"))
 }
@@ -52,12 +52,13 @@ pub(super) async fn upload_timeline_layer<'a>(
    storage: &'a GenericRemoteStorage,
    source_path: &'a Path,
    known_metadata: &'a LayerFileMetadata,
+    generation: Generation,
 ) -> anyhow::Result<()> {
    fail_point!("before-upload-layer", |_| {
        bail!("failpoint before-upload-layer")
    });
-    let storage_path = conf.remote_path(source_path)?;

+    let storage_path = remote_path(conf, source_path, Some(generation))?;
    let source_file_res = fs::File::open(&source_path).await;
    let source_file = match source_file_res {
        Ok(source_file) => source_file,
--- a/pageserver/src/tenant/storage_layer/delta_layer.rs
+++ b/pageserver/src/tenant/storage_layer/delta_layer.rs
@@ -467,7 +467,7 @@ impl DeltaLayer {
            PathOrConf::Path(_) => None,
        };

-        let loaded = DeltaLayerInner::load(&path, summary)?;
+        let loaded = DeltaLayerInner::load(&path, summary).await?;

        if let PathOrConf::Path(ref path) = self.path_or_conf {
            // not production code
@@ -841,12 +841,15 @@ impl Drop for DeltaLayerWriter {
 }

 impl DeltaLayerInner {
-    pub(super) fn load(path: &std::path::Path, summary: Option<Summary>) -> anyhow::Result<Self> {
+    pub(super) async fn load(
+        path: &std::path::Path,
+        summary: Option<Summary>,
+    ) -> anyhow::Result<Self> {
        let file = VirtualFile::open(path)
            .with_context(|| format!("Failed to open file '{}'", path.display()))?;
        let file = FileBlockReader::new(file);

-        let summary_blk = file.read_blk(0)?;
+        let summary_blk = file.read_blk(0).await?;
        let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;

        if let Some(mut expected_summary) = summary {
@@ -1028,7 +1031,7 @@ impl<'a> ValueRef<'a> {
 pub(crate) struct Adapter<T>(T);

 impl<T: AsRef<DeltaLayerInner>> Adapter<T> {
-    pub(crate) fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
-        self.0.as_ref().file.read_blk(blknum)
+    pub(crate) async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
+        self.0.as_ref().file.read_blk(blknum).await
    }
 }
--- a/pageserver/src/tenant/storage_layer/image_layer.rs
+++ b/pageserver/src/tenant/storage_layer/image_layer.rs
@@ -349,7 +349,8 @@ impl ImageLayer {
            PathOrConf::Path(_) => None,
        };

-        let loaded = ImageLayerInner::load(&path, self.desc.image_layer_lsn(), expected_summary)?;
+        let loaded =
+            ImageLayerInner::load(&path, self.desc.image_layer_lsn(), expected_summary).await?;

        if let PathOrConf::Path(ref path) = self.path_or_conf {
            // not production code
@@ -432,7 +433,7 @@ impl ImageLayer {
 }

 impl ImageLayerInner {
-    pub(super) fn load(
+    pub(super) async fn load(
        path: &std::path::Path,
        lsn: Lsn,
        summary: Option<Summary>,
@@ -440,7 +441,7 @@ impl ImageLayerInner {
        let file = VirtualFile::open(path)
            .with_context(|| format!("Failed to open file '{}'", path.display()))?;
        let file = FileBlockReader::new(file);
-        let summary_blk = file.read_blk(0)?;
+        let summary_blk = file.read_blk(0).await?;
        let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;

        if let Some(mut expected_summary) = summary {
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -38,6 +38,7 @@ use std::time::{Duration, Instant, SystemTime};
 use crate::context::{
    AccessStatsBehavior, DownloadBehavior, RequestContext, RequestContextBuilder,
 };
+use crate::deletion_queue::DeletionQueueClient;
 use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
 use crate::tenant::storage_layer::delta_layer::DeltaEntry;
 use crate::tenant::storage_layer::{
@@ -67,6 +68,7 @@ use postgres_connection::PgConnectionConfig;
 use postgres_ffi::to_pg_timestamp;
 use utils::{
    completion,
+    generation::Generation,
    id::{TenantId, TimelineId},
    lsn::{AtomicLsn, Lsn, RecordLsn},
    seqwait::SeqWait,
@@ -141,6 +143,7 @@ fn drop_wlock<T>(rlock: tokio::sync::RwLockWriteGuard<'_, T>) {
 /// The outward-facing resources required to build a Timeline
 pub struct TimelineResources {
    pub remote_client: Option<RemoteTimelineClient>,
+    pub deletion_queue_client: Option<DeletionQueueClient>,
 }

 pub struct Timeline {
@@ -152,6 +155,9 @@ pub struct Timeline {
    pub tenant_id: TenantId,
    pub timeline_id: TimelineId,

+    // The generation of the tenant that instantiated us: this is used for safety when writing remote objects
+    generation: Generation,
+
    pub pg_version: u32,

    /// The tuple has two elements.
@@ -195,6 +201,9 @@ pub struct Timeline {
    /// See [`remote_timeline_client`](super::remote_timeline_client) module comment for details.
    pub remote_client: Option<Arc<RemoteTimelineClient>>,

+    /// Deletion queue: a global queue, separate to the remote storage queue's
+    deletion_queue_client: Option<Arc<DeletionQueueClient>>,
+
    // What page versions do we hold in the repository? If we get a
    // request > last_record_lsn, we need to wait until we receive all
    // the WAL up to the request. The SeqWait provides functions for
@@ -465,7 +474,7 @@ impl Timeline {
        // The cached image can be returned directly if there is no WAL between the cached image
        // and requested LSN. The cached image can also be used to reduce the amount of WAL needed
        // for redo.
-        let cached_page_img = match self.lookup_cached_page(&key, lsn) {
+        let cached_page_img = match self.lookup_cached_page(&key, lsn).await {
            Some((cached_lsn, cached_img)) => {
                match cached_lsn.cmp(&lsn) {
                    Ordering::Less => {} // there might be WAL between cached_lsn and lsn, we need to check
@@ -494,6 +503,7 @@ impl Timeline {

        RECONSTRUCT_TIME
            .observe_closure_duration(|| self.reconstruct_value(key, lsn, reconstruct_state))
+            .await
    }

    /// Get last or prev record separately. Same as get_last_record_rlsn().last/prev.
@@ -1198,7 +1208,7 @@ impl Timeline {
                Ok(delta) => Some(delta),
            };

-        let layer_metadata = LayerFileMetadata::new(layer_file_size);
+        let layer_metadata = LayerFileMetadata::new(layer_file_size, self.generation);

        let new_remote_layer = Arc::new(match local_layer.filename() {
            LayerFileName::Image(image_name) => RemoteLayer::new_img(
@@ -1261,6 +1271,18 @@ impl Timeline {

        Ok(())
    }
+
+    async fn delete_all_remote(&self) -> anyhow::Result<()> {
+        if let Some(remote_client) = &self.remote_client {
+            if let Some(deletion_queue_client) = &self.deletion_queue_client {
+                remote_client.delete_all(deletion_queue_client).await
+            } else {
+                Ok(())
+            }
+        } else {
+            Ok(())
+        }
+    }
 }

 #[derive(Debug, thiserror::Error)]
@@ -1376,6 +1398,7 @@ impl Timeline {
        ancestor: Option<Arc<Timeline>>,
        timeline_id: TimelineId,
        tenant_id: TenantId,
+        generation: Generation,
        walredo_mgr: Arc<dyn WalRedoManager + Send + Sync>,
        resources: TimelineResources,
        pg_version: u32,
@@ -1405,6 +1428,7 @@ impl Timeline {
                myself: myself.clone(),
                timeline_id,
                tenant_id,
+                generation,
                pg_version,
                layers: Arc::new(tokio::sync::RwLock::new(LayerManager::create())),
                wanted_image_layers: Mutex::new(None),
@@ -1413,6 +1437,7 @@ impl Timeline {
                walreceiver: Mutex::new(None),

                remote_client: resources.remote_client.map(Arc::new),
+                deletion_queue_client: resources.deletion_queue_client.map(Arc::new),

                // initialize in-memory 'last_record_lsn' from 'disk_consistent_lsn'.
                last_record_lsn: SeqWait::new(RecordLsn {
@@ -1614,7 +1639,10 @@ impl Timeline {
        let (conf, tenant_id, timeline_id) = (self.conf, self.tenant_id, self.timeline_id);
        let span = tracing::Span::current();

-        let (loaded_layers, needs_upload, total_physical_size) = tokio::task::spawn_blocking({
+        // Copy to move into the task we're about to spawn
+        let generation = self.generation;
+
+        let (loaded_layers, to_sync, total_physical_size) = tokio::task::spawn_blocking({
            move || {
                let _g = span.entered();
                let discovered = init::scan_timeline_dir(&timeline_path)?;
@@ -1655,11 +1683,16 @@ impl Timeline {
                    );
                }

-                let decided =
-                    init::reconcile(discovered_layers, index_part.as_ref(), disk_consistent_lsn);
+                let decided = init::reconcile(
+                    discovered_layers,
+                    index_part.as_ref(),
+                    disk_consistent_lsn,
+                    generation,
+                );

                let mut loaded_layers = Vec::new();
                let mut needs_upload = Vec::new();
+                let mut needs_cleanup = Vec::new();
                let mut total_physical_size = 0;

                for (name, decision) in decided {
@@ -1675,14 +1708,10 @@ impl Timeline {
                        Err(FutureLayer { local }) => {
                            if local.is_some() {
                                path.push(name.file_name());
-                                init::cleanup_future_layer(&path, name, disk_consistent_lsn)?;
+                                init::cleanup_future_layer(&path, &name, disk_consistent_lsn)?;
                                path.pop();
-                            } else {
-                                // we cannot do anything for remote layers, but not continuing to
-                                // process it will leave it out index_part.json as well.
                            }
-                            //
-                            // we do not currently schedule deletions for these.
+                            needs_cleanup.push(name);
                            continue;
                        }
                    };
@@ -1736,7 +1765,11 @@ impl Timeline {

                    loaded_layers.push(layer);
                }
-                Ok((loaded_layers, needs_upload, total_physical_size))
+                Ok((
+                    loaded_layers,
+                    (needs_upload, needs_cleanup),
+                    total_physical_size,
+                ))
            }
        })
        .await
@@ -1748,9 +1781,15 @@ impl Timeline {
        guard.initialize_local_layers(loaded_layers, disk_consistent_lsn + 1);

        if let Some(rtc) = self.remote_client.as_ref() {
+            // Deletion queue client is always Some if remote_client is Some
+            let deletion_queue_client = self.deletion_queue_client.as_ref().unwrap();
+
+            let (needs_upload, needs_cleanup) = to_sync;
            for (layer, m) in needs_upload {
                rtc.schedule_layer_file_upload(&layer.layer_desc().filename(), &m)?;
            }
+            rtc.schedule_layer_file_deletion(&needs_cleanup, deletion_queue_client)
+                .await?;
            rtc.schedule_index_upload_for_file_changes()?;
            // Tenant::create_timeline will wait for these uploads to happen before returning, or
            // on retry.
@@ -2261,7 +2300,15 @@ impl Timeline {
                        )));
                    }
                }
-                ancestor.wait_lsn(timeline.ancestor_lsn, ctx).await?;
+                ancestor
+                    .wait_lsn(timeline.ancestor_lsn, ctx)
+                    .await
+                    .with_context(|| {
+                        format!(
+                            "wait for lsn {} on ancestor timeline_id={}",
+                            timeline.ancestor_lsn, ancestor.timeline_id
+                        )
+                    })?;

                timeline_owned = ancestor;
                timeline = &*timeline_owned;
@@ -2440,13 +2487,14 @@ impl Timeline {
        }
    }

-    fn lookup_cached_page(&self, key: &Key, lsn: Lsn) -> Option<(Lsn, Bytes)> {
+    async fn lookup_cached_page(&self, key: &Key, lsn: Lsn) -> Option<(Lsn, Bytes)> {
        let cache = page_cache::get();

        // FIXME: It's pointless to check the cache for things that are not 8kB pages.
        // We should look at the key to determine if it's a cacheable object
-        let (lsn, read_guard) =
-            cache.lookup_materialized_page(self.tenant_id, self.timeline_id, key, lsn)?;
+        let (lsn, read_guard) = cache
+            .lookup_materialized_page(self.tenant_id, self.timeline_id, key, lsn)
+            .await?;
        let img = Bytes::from(read_guard.to_vec());
        Some((lsn, img))
    }
@@ -2656,7 +2704,7 @@ impl Timeline {
                (
                    HashMap::from([(
                        layer.filename(),
-                        LayerFileMetadata::new(layer.layer_desc().file_size),
+                        LayerFileMetadata::new(layer.layer_desc().file_size, self.generation),
                    )]),
                    Some(layer),
                )
@@ -3052,7 +3100,10 @@ impl Timeline {
                .metadata()
                .with_context(|| format!("reading metadata of layer file {}", path.file_name()))?;

-            layer_paths_to_upload.insert(path, LayerFileMetadata::new(metadata.len()));
+            layer_paths_to_upload.insert(
+                path,
+                LayerFileMetadata::new(metadata.len(), self.generation),
+            );

            self.metrics
                .resident_physical_size_gauge
@@ -3727,7 +3778,7 @@ impl Timeline {
            if let Some(remote_client) = &self.remote_client {
                remote_client.schedule_layer_file_upload(
                    &l.filename(),
-                    &LayerFileMetadata::new(metadata.len()),
+                    &LayerFileMetadata::new(metadata.len(), self.generation),
                )?;
            }

@@ -3736,7 +3787,10 @@ impl Timeline {
                .resident_physical_size_gauge
                .add(metadata.len());

-            new_layer_paths.insert(new_delta_path, LayerFileMetadata::new(metadata.len()));
+            new_layer_paths.insert(
+                new_delta_path,
+                LayerFileMetadata::new(metadata.len(), self.generation),
+            );
            l.access_stats().record_residence_event(
                LayerResidenceStatus::Resident,
                LayerResidenceEventReason::LayerCreate,
@@ -3776,7 +3830,13 @@ impl Timeline {

        // Also schedule the deletions in remote storage
        if let Some(remote_client) = &self.remote_client {
-            remote_client.schedule_layer_file_deletion(&layer_names_to_delete)?;
+            let deletion_queue = self
+                .deletion_queue_client
+                .as_ref()
+                .ok_or_else(|| anyhow::anyhow!("Remote storage enabled without deletion queue"))?;
+            remote_client
+                .schedule_layer_file_deletion(&layer_names_to_delete, deletion_queue)
+                .await?;
        }

        Ok(())
@@ -4110,7 +4170,15 @@ impl Timeline {
            }

            if let Some(remote_client) = &self.remote_client {
-                remote_client.schedule_layer_file_deletion(&layer_names_to_delete)?;
+                // Remote metadata upload was scheduled in `update_metadata_file`: wait
+                // for completion before scheduling any deletions.
+                remote_client.wait_completion().await?;
+                let deletion_queue = self.deletion_queue_client.as_ref().ok_or_else(|| {
+                    anyhow::anyhow!("Remote storage enabled without deletion queue")
+                })?;
+                remote_client
+                    .schedule_layer_file_deletion(&layer_names_to_delete, deletion_queue)
+                    .await?;
            }

            apply.flush();
@@ -4128,7 +4196,7 @@ impl Timeline {
    ///
    /// Reconstruct a value, using the given base image and WAL records in 'data'.
    ///
-    fn reconstruct_value(
+    async fn reconstruct_value(
        &self,
        key: Key,
        request_lsn: Lsn,
@@ -4197,6 +4265,7 @@ impl Timeline {
                            last_rec_lsn,
                            &img,
                        )
+                        .await
                        .context("Materialized page memoization failed")
                    {
                        return Err(PageReconstructError::from(e));
@@ -4699,6 +4768,7 @@ mod tests {

    use utils::{id::TimelineId, lsn::Lsn};

+    use crate::deletion_queue::mock::MockDeletionQueue;
    use crate::tenant::{harness::TenantHarness, storage_layer::PersistentLayer};

    use super::{EvictionError, Timeline};
@@ -4721,9 +4791,17 @@ mod tests {
            };
            GenericRemoteStorage::from_config(&config).unwrap()
        };
+        let deletion_queue = MockDeletionQueue::new(Some(remote_storage.clone()), harness.conf);

        let ctx = any_context();
-        let tenant = harness.try_load(&ctx, Some(remote_storage)).await.unwrap();
+        let tenant = harness
+            .try_load(
+                &ctx,
+                Some(remote_storage),
+                Some(deletion_queue.new_client()),
+            )
+            .await
+            .unwrap();
        let timeline = tenant
            .create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, &ctx)
            .await
@@ -4786,9 +4864,17 @@ mod tests {
            };
            GenericRemoteStorage::from_config(&config).unwrap()
        };
+        let deletion_queue = MockDeletionQueue::new(Some(remote_storage.clone()), harness.conf);

        let ctx = any_context();
-        let tenant = harness.try_load(&ctx, Some(remote_storage)).await.unwrap();
+        let tenant = harness
+            .try_load(
+                &ctx,
+                Some(remote_storage),
+                Some(deletion_queue.new_client()),
+            )
+            .await
+            .unwrap();
        let timeline = tenant
            .create_test_timeline(TimelineId::generate(), Lsn(0x10), 14, &ctx)
            .await
--- a/pageserver/src/tenant/timeline/delete.rs
+++ b/pageserver/src/tenant/timeline/delete.rs
@@ -14,6 +14,7 @@ use utils::{

 use crate::{
    config::PageServerConf,
+    deletion_queue::DeletionQueueClient,
    task_mgr::{self, TaskKind},
    tenant::{
        metadata::TimelineMetadata,
@@ -238,15 +239,6 @@ async fn delete_local_layer_files(
    Ok(())
 }

-/// Removes remote layers and an index file after them.
-async fn delete_remote_layers_and_index(timeline: &Timeline) -> anyhow::Result<()> {
-    if let Some(remote_client) = &timeline.remote_client {
-        remote_client.delete_all().await.context("delete_all")?
-    };
-
-    Ok(())
-}
-
 // This function removs remaining traces of a timeline on disk.
 // Namely: metadata file, timeline directory, delete mark.
 // Note: io::ErrorKind::NotFound are ignored for metadata and timeline dir.
@@ -407,6 +399,7 @@ impl DeleteTimelineFlow {
        timeline_id: TimelineId,
        local_metadata: &TimelineMetadata,
        remote_client: Option<RemoteTimelineClient>,
+        deletion_queue_client: Option<DeletionQueueClient>,
        init_order: Option<&InitializationOrder>,
    ) -> anyhow::Result<()> {
        // Note: here we even skip populating layer map. Timeline is essentially uninitialized.
@@ -416,7 +409,10 @@ impl DeleteTimelineFlow {
                timeline_id,
                local_metadata,
                None, // Ancestor is not needed for deletion.
-                TimelineResources { remote_client },
+                TimelineResources {
+                    remote_client,
+                    deletion_queue_client,
+                },
                init_order,
                // Important. We dont pass ancestor above because it can be missing.
                // Thus we need to skip the validation here.
@@ -559,7 +555,7 @@ impl DeleteTimelineFlow {
    ) -> Result<(), DeleteTimelineError> {
        delete_local_layer_files(conf, tenant.tenant_id, timeline).await?;

-        delete_remote_layers_and_index(timeline).await?;
+        timeline.delete_all_remote().await?;

        pausable_failpoint!("in_progress_delete");

--- a/pageserver/src/tenant/timeline/init.rs
+++ b/pageserver/src/tenant/timeline/init.rs
@@ -7,6 +7,7 @@ use crate::{
            index::{IndexPart, LayerFileMetadata},
        },
        storage_layer::LayerFileName,
+        Generation,
    },
    METADATA_FILE_NAME,
 };
@@ -104,6 +105,7 @@ pub(super) fn reconcile(
    discovered: Vec<(LayerFileName, u64)>,
    index_part: Option<&IndexPart>,
    disk_consistent_lsn: Lsn,
+    generation: Generation,
 ) -> Vec<(LayerFileName, Result<Decision, FutureLayer>)> {
    use Decision::*;

@@ -112,7 +114,15 @@ pub(super) fn reconcile(

    let mut discovered = discovered
        .into_iter()
-        .map(|(name, file_size)| (name, (Some(LayerFileMetadata::new(file_size)), None)))
+        .map(|(name, file_size)| {
+            (
+                name,
+                // The generation here will be corrected to match IndexPart in the merge below, unless
+                // it is not in IndexPart, in which case using our current generation makes sense
+                // because it will be uploaded in this generation.
+                (Some(LayerFileMetadata::new(file_size, generation)), None),
+            )
+        })
        .collect::<Collected>();

    // merge any index_part information, when available
@@ -137,7 +147,11 @@ pub(super) fn reconcile(
                Err(FutureLayer { local })
            } else {
                Ok(match (local, remote) {
-                    (Some(local), Some(remote)) if local != remote => UseRemote { local, remote },
+                    (Some(local), Some(remote)) if local != remote => {
+                        assert_eq!(local.generation, remote.generation);
+
+                        UseRemote { local, remote }
+                    }
                    (Some(x), Some(_)) => UseLocal(x),
                    (None, Some(x)) => Evicted(x),
                    (Some(x), None) => NeedsUpload(x),
@@ -183,7 +197,7 @@ pub(super) fn cleanup_local_file_for_remote(

 pub(super) fn cleanup_future_layer(
    path: &Path,
-    name: LayerFileName,
+    name: &LayerFileName,
    disk_consistent_lsn: Lsn,
 ) -> anyhow::Result<()> {
    use LayerFileName::*;
--- a/pageserver/src/tenant/timeline/walreceiver/connection_manager.rs
+++ b/pageserver/src/tenant/timeline/walreceiver/connection_manager.rs
@@ -31,10 +31,11 @@ use storage_broker::Streaming;
 use tokio::select;
 use tracing::*;

-use postgres_connection::{parse_host_port, PgConnectionConfig};
+use postgres_connection::PgConnectionConfig;
 use utils::backoff::{
    exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
 };
+use utils::postgres_client::wal_stream_connection_config;
 use utils::{
    id::{NodeId, TenantTimelineId},
    lsn::Lsn,
@@ -879,33 +880,6 @@ impl ReconnectReason {
    }
 }

-fn wal_stream_connection_config(
-    TenantTimelineId {
-        tenant_id,
-        timeline_id,
-    }: TenantTimelineId,
-    listen_pg_addr_str: &str,
-    auth_token: Option<&str>,
-    availability_zone: Option<&str>,
-) -> anyhow::Result<PgConnectionConfig> {
-    let (host, port) =
-        parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
-    let port = port.unwrap_or(5432);
-    let mut connstr = PgConnectionConfig::new_host_port(host, port)
-        .extend_options([
-            "-c".to_owned(),
-            format!("timeline_id={}", timeline_id),
-            format!("tenant_id={}", tenant_id),
-        ])
-        .set_password(auth_token.map(|s| s.to_owned()));
-
-    if let Some(availability_zone) = availability_zone {
-        connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
-    }
-
-    Ok(connstr)
-}
-
 #[cfg(test)]
 mod tests {
    use super::*;
@@ -921,6 +895,7 @@ mod tests {
            timeline: SafekeeperTimelineInfo {
                safekeeper_id: 0,
                tenant_timeline_id: None,
+                term: 0,
                last_log_term: 0,
                flush_lsn: 0,
                commit_lsn,
@@ -929,6 +904,7 @@ mod tests {
                peer_horizon_lsn: 0,
                local_start_lsn: 0,
                safekeeper_connstr: safekeeper_connstr.to_owned(),
+                http_connstr: safekeeper_connstr.to_owned(),
                availability_zone: None,
            },
            latest_update,
--- a/pageserver/src/tenant/upload_queue.rs
+++ b/pageserver/src/tenant/upload_queue.rs
@@ -1,5 +1,3 @@
-use crate::metrics::RemoteOpFileKind;
-
 use super::storage_layer::LayerFileName;
 use crate::tenant::metadata::TimelineMetadata;
 use crate::tenant::remote_timeline_client::index::IndexPart;
@@ -62,7 +60,6 @@ pub(crate) struct UploadQueueInitialized {
    // Breakdown of different kinds of tasks currently in-progress
    pub(crate) num_inprogress_layer_uploads: usize,
    pub(crate) num_inprogress_metadata_uploads: usize,
-    pub(crate) num_inprogress_deletions: usize,

    /// Tasks that are currently in-progress. In-progress means that a tokio Task
    /// has been launched for it. An in-progress task can be busy uploading, but it can
@@ -120,7 +117,6 @@ impl UploadQueue {
            task_counter: 0,
            num_inprogress_layer_uploads: 0,
            num_inprogress_metadata_uploads: 0,
-            num_inprogress_deletions: 0,
            inprogress_tasks: HashMap::new(),
            queued_operations: VecDeque::new(),
        };
@@ -148,22 +144,20 @@ impl UploadQueue {
            );
        }

-        let index_part_metadata = index_part.parse_metadata()?;
        info!(
            "initializing upload queue with remote index_part.disk_consistent_lsn: {}",
-            index_part_metadata.disk_consistent_lsn()
+            index_part.metadata.disk_consistent_lsn()
        );

        let state = UploadQueueInitialized {
            latest_files: files,
            latest_files_changes_since_metadata_upload_scheduled: 0,
-            latest_metadata: index_part_metadata.clone(),
-            last_uploaded_consistent_lsn: index_part_metadata.disk_consistent_lsn(),
+            latest_metadata: index_part.metadata.clone(),
+            last_uploaded_consistent_lsn: index_part.metadata.disk_consistent_lsn(),
            // what follows are boring default initializations
            task_counter: 0,
            num_inprogress_layer_uploads: 0,
            num_inprogress_metadata_uploads: 0,
-            num_inprogress_deletions: 0,
            inprogress_tasks: HashMap::new(),
            queued_operations: VecDeque::new(),
        };
@@ -201,13 +195,6 @@ pub(crate) struct UploadTask {
    pub(crate) op: UploadOp,
 }

-#[derive(Debug)]
-pub(crate) struct Delete {
-    pub(crate) file_kind: RemoteOpFileKind,
-    pub(crate) layer_file_name: LayerFileName,
-    pub(crate) scheduled_from_timeline_delete: bool,
-}
-
 #[derive(Debug)]
 pub(crate) enum UploadOp {
    /// Upload a layer file
@@ -216,9 +203,6 @@ pub(crate) enum UploadOp {
    /// Upload the metadata file
    UploadMetadata(IndexPart, Lsn),

-    /// Delete a layer file
-    Delete(Delete),
-
    /// Barrier. When the barrier operation is reached,
    Barrier(tokio::sync::watch::Sender<()>),
 }
@@ -234,13 +218,9 @@ impl std::fmt::Display for UploadOp {
                    metadata.file_size()
                )
            }
-            UploadOp::UploadMetadata(_, lsn) => write!(f, "UploadMetadata(lsn: {})", lsn),
-            UploadOp::Delete(delete) => write!(
-                f,
-                "Delete(path: {}, scheduled_from_timeline_delete: {})",
-                delete.layer_file_name.file_name(),
-                delete.scheduled_from_timeline_delete
-            ),
+            UploadOp::UploadMetadata(_, lsn) => {
+                write!(f, "UploadMetadata(lsn: {})", lsn)
+            }
            UploadOp::Barrier(_) => write!(f, "Barrier"),
        }
    }
--- a/pageserver/src/test.log
+++ b/pageserver/src/test.log
@@ -0,0 +1 @@
+-bash: scripts/pytest: No such file or directory
--- a/safekeeper/src/bin/safekeeper.rs
+++ b/safekeeper/src/bin/safekeeper.rs
@@ -341,21 +341,35 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {

    let (wal_backup_launcher_tx, wal_backup_launcher_rx) = mpsc::channel(100);

-    // Load all timelines from disk to memory.
-    GlobalTimelines::init(conf.clone(), wal_backup_launcher_tx)?;
-
    // Keep handles to main tasks to die if any of them disappears.
    let mut tasks_handles: FuturesUnordered<BoxFuture<(String, JoinTaskRes)>> =
        FuturesUnordered::new();

+    // Start wal backup launcher before loading timelines as we'll notify it
+    // through the channel about timelines which need offloading, not draining
+    // the channel would cause deadlock.
+    let current_thread_rt = conf
+        .current_thread_runtime
+        .then(|| Handle::try_current().expect("no runtime in main"));
+    let conf_ = conf.clone();
+    let wal_backup_handle = current_thread_rt
+        .as_ref()
+        .unwrap_or_else(|| WAL_BACKUP_RUNTIME.handle())
+        .spawn(wal_backup::wal_backup_launcher_task_main(
+            conf_,
+            wal_backup_launcher_rx,
+        ))
+        .map(|res| ("WAL backup launcher".to_owned(), res));
+    tasks_handles.push(Box::pin(wal_backup_handle));
+
+    // Load all timelines from disk to memory.
+    GlobalTimelines::init(conf.clone(), wal_backup_launcher_tx).await?;
+
    let conf_ = conf.clone();
    // Run everything in current thread rt, if asked.
    if conf.current_thread_runtime {
        info!("running in current thread runtime");
    }
-    let current_thread_rt = conf
-        .current_thread_runtime
-        .then(|| Handle::try_current().expect("no runtime in main"));

    let wal_service_handle = current_thread_rt
        .as_ref()
@@ -408,17 +422,6 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
        .map(|res| ("WAL remover".to_owned(), res));
    tasks_handles.push(Box::pin(wal_remover_handle));

-    let conf_ = conf.clone();
-    let wal_backup_handle = current_thread_rt
-        .as_ref()
-        .unwrap_or_else(|| WAL_BACKUP_RUNTIME.handle())
-        .spawn(wal_backup::wal_backup_launcher_task_main(
-            conf_,
-            wal_backup_launcher_rx,
-        ))
-        .map(|res| ("WAL backup launcher".to_owned(), res));
-    tasks_handles.push(Box::pin(wal_backup_handle));
-
    set_build_info_metric(GIT_VERSION);

    // TODO: update tokio-stream, convert to real async Stream with
--- a/safekeeper/src/control_file_upgrade.rs
+++ b/safekeeper/src/control_file_upgrade.rs
@@ -1,7 +1,6 @@
 //! Code to deal with safekeeper control file upgrades
 use crate::safekeeper::{
-    AcceptorState, PersistedPeers, PgUuid, SafeKeeperState, ServerInfo, Term, TermHistory,
-    TermSwitchEntry,
+    AcceptorState, PersistedPeers, PgUuid, SafeKeeperState, ServerInfo, Term, TermHistory, TermLsn,
 };
 use anyhow::{bail, Result};
 use pq_proto::SystemId;
@@ -145,7 +144,7 @@ pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result<SafeKeeperState>
        let oldstate = SafeKeeperStateV1::des(&buf[..buf.len()])?;
        let ac = AcceptorState {
            term: oldstate.acceptor_state.term,
-            term_history: TermHistory(vec![TermSwitchEntry {
+            term_history: TermHistory(vec![TermLsn {
                term: oldstate.acceptor_state.epoch,
                lsn: Lsn(0),
            }]),
--- a/safekeeper/src/http/routes.rs
+++ b/safekeeper/src/http/routes.rs
@@ -19,6 +19,7 @@ use crate::receive_wal::WalReceiverState;
 use crate::safekeeper::ServerInfo;
 use crate::safekeeper::Term;
 use crate::send_wal::WalSenderState;
+use crate::timeline::PeerInfo;
 use crate::{debug_dump, pull_timeline};

 use crate::timelines_global_map::TimelineDeleteForceResult;
@@ -101,6 +102,7 @@ pub struct TimelineStatus {
    pub peer_horizon_lsn: Lsn,
    #[serde_as(as = "DisplayFromStr")]
    pub remote_consistent_lsn: Lsn,
+    pub peers: Vec<PeerInfo>,
    pub walsenders: Vec<WalSenderState>,
    pub walreceivers: Vec<WalReceiverState>,
 }
@@ -140,6 +142,7 @@ async fn timeline_status_handler(request: Request<Body>) -> Result<Response<Body
        term_history,
    };

+    let conf = get_conf(&request);
    // Note: we report in memory values which can be lost.
    let status = TimelineStatus {
        tenant_id: ttid.tenant_id,
@@ -153,6 +156,7 @@ async fn timeline_status_handler(request: Request<Body>) -> Result<Response<Body
        backup_lsn: inmem.backup_lsn,
        peer_horizon_lsn: inmem.peer_horizon_lsn,
        remote_consistent_lsn: tli.get_walsenders().get_remote_consistent_lsn(),
+        peers: tli.get_peers(conf).await,
        walsenders: tli.get_walsenders().get_all(),
        walreceivers: tli.get_walreceivers().get_all(),
    };
@@ -282,12 +286,14 @@ async fn record_safekeeper_info(mut request: Request<Body>) -> Result<Response<B
            tenant_id: ttid.tenant_id.as_ref().to_owned(),
            timeline_id: ttid.timeline_id.as_ref().to_owned(),
        }),
+        term: sk_info.term.unwrap_or(0),
        last_log_term: sk_info.last_log_term.unwrap_or(0),
        flush_lsn: sk_info.flush_lsn.0,
        commit_lsn: sk_info.commit_lsn.0,
        remote_consistent_lsn: sk_info.remote_consistent_lsn.0,
        peer_horizon_lsn: sk_info.peer_horizon_lsn.0,
        safekeeper_connstr: sk_info.safekeeper_connstr.unwrap_or_else(|| "".to_owned()),
+        http_connstr: sk_info.http_connstr.unwrap_or_else(|| "".to_owned()),
        backup_lsn: sk_info.backup_lsn.0,
        local_start_lsn: sk_info.local_start_lsn.0,
        availability_zone: None,
--- a/safekeeper/src/json_ctrl.rs
+++ b/safekeeper/src/json_ctrl.rs
@@ -21,7 +21,7 @@ use crate::safekeeper::{AcceptorProposerMessage, AppendResponse, ServerInfo};
 use crate::safekeeper::{
    AppendRequest, AppendRequestHeader, ProposerAcceptorMessage, ProposerElected,
 };
-use crate::safekeeper::{SafeKeeperState, Term, TermHistory, TermSwitchEntry};
+use crate::safekeeper::{SafeKeeperState, Term, TermHistory, TermLsn};
 use crate::timeline::Timeline;
 use crate::GlobalTimelines;
 use postgres_backend::PostgresBackend;
@@ -119,7 +119,7 @@ async fn send_proposer_elected(tli: &Arc<Timeline>, term: Term, lsn: Lsn) -> any
    let history = tli.get_state().await.1.acceptor_state.term_history;
    let history = history.up_to(lsn.checked_sub(1u64).unwrap());
    let mut history_entries = history.0;
-    history_entries.push(TermSwitchEntry { term, lsn });
+    history_entries.push(TermLsn { term, lsn });
    let history = TermHistory(history_entries);

    let proposer_elected_request = ProposerAcceptorMessage::Elected(ProposerElected {
--- a/safekeeper/src/lib.rs
+++ b/safekeeper/src/lib.rs
@@ -19,6 +19,7 @@ pub mod json_ctrl;
 pub mod metrics;
 pub mod pull_timeline;
 pub mod receive_wal;
+pub mod recovery;
 pub mod remove_wal;
 pub mod safekeeper;
 pub mod send_wal;
--- a/safekeeper/src/pull_timeline.rs
+++ b/safekeeper/src/pull_timeline.rs
@@ -227,7 +227,9 @@ async fn pull_timeline(status: TimelineStatus, host: String) -> Result<Response>
    tokio::fs::create_dir_all(conf.tenant_dir(&ttid.tenant_id)).await?;
    tokio::fs::rename(tli_dir_path, &timeline_path).await?;

-    let tli = GlobalTimelines::load_timeline(ttid).context("Failed to load timeline after copy")?;
+    let tli = GlobalTimelines::load_timeline(ttid)
+        .await
+        .context("Failed to load timeline after copy")?;

    info!(
        "Loaded timeline {}, flush_lsn={}",
--- a/safekeeper/src/recovery.rs
+++ b/safekeeper/src/recovery.rs
@@ -0,0 +1,40 @@
+//! This module implements pulling WAL from peer safekeepers if compute can't
+//! provide it, i.e. safekeeper lags too much.
+
+use std::sync::Arc;
+
+use tokio::{select, time::sleep, time::Duration};
+use tracing::{info, instrument};
+
+use crate::{timeline::Timeline, SafeKeeperConf};
+
+/// Entrypoint for per timeline task which always runs, checking whether
+/// recovery for this safekeeper is needed and starting it if so.
+#[instrument(name = "recovery task", skip_all, fields(ttid = %tli.ttid))]
+pub async fn recovery_main(tli: Arc<Timeline>, _conf: SafeKeeperConf) {
+    info!("started");
+    let mut cancellation_rx = match tli.get_cancellation_rx() {
+        Ok(rx) => rx,
+        Err(_) => {
+            info!("timeline canceled during task start");
+            return;
+        }
+    };
+
+    select! {
+        _ = recovery_main_loop(tli) => { unreachable!() }
+        _ = cancellation_rx.changed() => {
+            info!("stopped");
+        }
+    }
+}
+
+const CHECK_INTERVAL_MS: u64 = 2000;
+
+/// Check regularly whether we need to start recovery.
+async fn recovery_main_loop(_tli: Arc<Timeline>) {
+    let check_duration = Duration::from_millis(CHECK_INTERVAL_MS);
+    loop {
+        sleep(check_duration).await;
+    }
+}
--- a/safekeeper/src/safekeeper.rs
+++ b/safekeeper/src/safekeeper.rs
@@ -34,22 +34,33 @@ pub const UNKNOWN_SERVER_VERSION: u32 = 0;

 /// Consensus logical timestamp.
 pub type Term = u64;
-const INVALID_TERM: Term = 0;
+pub const INVALID_TERM: Term = 0;

-#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
-pub struct TermSwitchEntry {
+#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)]
+pub struct TermLsn {
    pub term: Term,
    pub lsn: Lsn,
 }
+
+// Creation from tuple provides less typing (e.g. for unit tests).
+impl From<(Term, Lsn)> for TermLsn {
+    fn from(pair: (Term, Lsn)) -> TermLsn {
+        TermLsn {
+            term: pair.0,
+            lsn: pair.1,
+        }
+    }
+}
+
 #[derive(Clone, Serialize, Deserialize)]
-pub struct TermHistory(pub Vec<TermSwitchEntry>);
+pub struct TermHistory(pub Vec<TermLsn>);

 impl TermHistory {
    pub fn empty() -> TermHistory {
        TermHistory(Vec::new())
    }

-    // Parse TermHistory as n_entries followed by TermSwitchEntry pairs
+    // Parse TermHistory as n_entries followed by TermLsn pairs
    pub fn from_bytes(bytes: &mut Bytes) -> Result<TermHistory> {
        if bytes.remaining() < 4 {
            bail!("TermHistory misses len");
@@ -60,7 +71,7 @@ impl TermHistory {
            if bytes.remaining() < 16 {
                bail!("TermHistory is incomplete");
            }
-            res.push(TermSwitchEntry {
+            res.push(TermLsn {
                term: bytes.get_u64_le(),
                lsn: bytes.get_u64_le().into(),
            })
@@ -557,12 +568,17 @@ where
            .up_to(self.flush_lsn())
    }

+    /// Get current term.
+    pub fn get_term(&self) -> Term {
+        self.state.acceptor_state.term
+    }
+
    pub fn get_epoch(&self) -> Term {
        self.state.acceptor_state.get_epoch(self.flush_lsn())
    }

    /// wal_store wrapper avoiding commit_lsn <= flush_lsn violation when we don't have WAL yet.
-    fn flush_lsn(&self) -> Lsn {
+    pub fn flush_lsn(&self) -> Lsn {
        max(self.wal_store.flush_lsn(), self.state.timeline_start_lsn)
    }

@@ -1138,7 +1154,7 @@ mod tests {
        let pem = ProposerElected {
            term: 1,
            start_streaming_at: Lsn(1),
-            term_history: TermHistory(vec![TermSwitchEntry {
+            term_history: TermHistory(vec![TermLsn {
                term: 1,
                lsn: Lsn(3),
            }]),
--- a/safekeeper/src/send_wal.rs
+++ b/safekeeper/src/send_wal.rs
@@ -2,12 +2,12 @@
 //! with the "START_REPLICATION" message, and registry of walsenders.

 use crate::handler::SafekeeperPostgresHandler;
-use crate::safekeeper::Term;
+use crate::safekeeper::{Term, TermLsn};
 use crate::timeline::Timeline;
 use crate::wal_service::ConnectionId;
 use crate::wal_storage::WalReader;
 use crate::GlobalTimelines;
-use anyhow::Context as AnyhowContext;
+use anyhow::{bail, Context as AnyhowContext};
 use bytes::Bytes;
 use parking_lot::Mutex;
 use postgres_backend::PostgresBackend;
@@ -390,26 +390,25 @@ impl SafekeeperPostgresHandler {
            self.appname.clone(),
        ));

-        let commit_lsn_watch_rx = tli.get_commit_lsn_watch_rx();
-
-        // Walproposer gets special handling: safekeeper must give proposer all
-        // local WAL till the end, whether committed or not (walproposer will
-        // hang otherwise). That's because walproposer runs the consensus and
-        // synchronizes safekeepers on the most advanced one.
+        // Walsender can operate in one of two modes which we select by
+        // application_name: give only committed WAL (used by pageserver) or all
+        // existing WAL (up to flush_lsn, used by walproposer or peer recovery).
+        // The second case is always driven by a consensus leader which term
+        // must generally be also supplied. However we're sloppy to do this in
+        // walproposer recovery which will be removed soon. So TODO is to make
+        // it not Option'al then.
        //
-        // There is a small risk of this WAL getting concurrently garbaged if
-        // another compute rises which collects majority and starts fixing log
-        // on this safekeeper itself. That's ok as (old) proposer will never be
-        // able to commit such WAL.
-        let stop_pos: Option<Lsn> = if self.is_walproposer_recovery() {
-            let wal_end = tli.get_flush_lsn().await;
-            Some(wal_end)
+        // Fetching WAL without term in recovery creates a small risk of this
+        // WAL getting concurrently garbaged if another compute rises which
+        // collects majority and starts fixing log on this safekeeper itself.
+        // That's ok as (old) proposer will never be able to commit such WAL.
+        let end_watch = if self.is_walproposer_recovery() {
+            EndWatch::Flush(tli.get_term_flush_lsn_watch_rx())
        } else {
-            None
+            EndWatch::Commit(tli.get_commit_lsn_watch_rx())
        };
-
-        // take the latest commit_lsn if don't have stop_pos
-        let end_pos = stop_pos.unwrap_or(*commit_lsn_watch_rx.borrow());
+        // we don't check term here; it will be checked on first waiting/WAL reading anyway.
+        let end_pos = end_watch.get();

        if end_pos < start_pos {
            warn!(
@@ -419,8 +418,10 @@ impl SafekeeperPostgresHandler {
        }

        info!(
-            "starting streaming from {:?} till {:?}, available WAL ends at {}",
-            start_pos, stop_pos, end_pos
+            "starting streaming from {:?}, available WAL ends at {}, recovery={}",
+            start_pos,
+            end_pos,
+            matches!(end_watch, EndWatch::Flush(_))
        );

        // switch to copy
@@ -445,9 +446,8 @@ impl SafekeeperPostgresHandler {
            appname,
            start_pos,
            end_pos,
-            stop_pos,
            term,
-            commit_lsn_watch_rx,
+            end_watch,
            ws_guard: ws_guard.clone(),
            wal_reader,
            send_buf: [0; MAX_SEND_SIZE],
@@ -466,6 +466,32 @@ impl SafekeeperPostgresHandler {
    }
 }

+/// Walsender streams either up to commit_lsn (normally) or flush_lsn in the
+/// given term (recovery by walproposer or peer safekeeper).
+enum EndWatch {
+    Commit(Receiver<Lsn>),
+    Flush(Receiver<TermLsn>),
+}
+
+impl EndWatch {
+    /// Get current end of WAL.
+    fn get(&self) -> Lsn {
+        match self {
+            EndWatch::Commit(r) => *r.borrow(),
+            EndWatch::Flush(r) => r.borrow().lsn,
+        }
+    }
+
+    /// Wait for the update.
+    async fn changed(&mut self) -> anyhow::Result<()> {
+        match self {
+            EndWatch::Commit(r) => r.changed().await?,
+            EndWatch::Flush(r) => r.changed().await?,
+        }
+        Ok(())
+    }
+}
+
 /// A half driving sending WAL.
 struct WalSender<'a, IO> {
    pgb: &'a mut PostgresBackend<IO>,
@@ -480,14 +506,12 @@ struct WalSender<'a, IO> {
    // We send this LSN to the receiver as wal_end, so that it knows how much
    // WAL this safekeeper has. This LSN should be as fresh as possible.
    end_pos: Lsn,
-    // If present, terminate after reaching this position; used by walproposer
-    // in recovery.
-    stop_pos: Option<Lsn>,
    /// When streaming uncommitted part, the term the client acts as the leader
    /// in. Streaming is stopped if local term changes to a different (higher)
    /// value.
    term: Option<Term>,
-    commit_lsn_watch_rx: Receiver<Lsn>,
+    /// Watch channel receiver to learn end of available WAL (and wait for its advancement).
+    end_watch: EndWatch,
    ws_guard: Arc<WalSenderGuard>,
    wal_reader: WalReader,
    // buffer for readling WAL into to send it
@@ -497,29 +521,20 @@ struct WalSender<'a, IO> {
 impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
    /// Send WAL until
    /// - an error occurs
-    /// - if we are streaming to walproposer, we've streamed until stop_pos
-    ///   (recovery finished)
-    /// - receiver is caughtup and there is no computes
+    /// - receiver is caughtup and there is no computes (if streaming up to commit_lsn)
    ///
    /// Err(CopyStreamHandlerEnd) is always returned; Result is used only for ?
    /// convenience.
    async fn run(&mut self) -> Result<(), CopyStreamHandlerEnd> {
        loop {
-            // If we are streaming to walproposer, check it is time to stop.
-            if let Some(stop_pos) = self.stop_pos {
-                if self.start_pos >= stop_pos {
-                    // recovery finished
-                    return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
-                        "ending streaming to walproposer at {}, recovery finished",
-                        self.start_pos
-                    )));
-                }
-            } else {
-                // Wait for the next portion if it is not there yet, or just
-                // update our end of WAL available for sending value, we
-                // communicate it to the receiver.
-                self.wait_wal().await?;
-            }
+            // Wait for the next portion if it is not there yet, or just
+            // update our end of WAL available for sending value, we
+            // communicate it to the receiver.
+            self.wait_wal().await?;
+            assert!(
+                self.end_pos > self.start_pos,
+                "nothing to send after waiting for WAL"
+            );

            // try to send as much as available, capped by MAX_SEND_SIZE
            let mut send_size = self
@@ -567,7 +582,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
    /// exit in the meanwhile
    async fn wait_wal(&mut self) -> Result<(), CopyStreamHandlerEnd> {
        loop {
-            self.end_pos = *self.commit_lsn_watch_rx.borrow();
+            self.end_pos = self.end_watch.get();
            if self.end_pos > self.start_pos {
                // We have something to send.
                trace!("got end_pos {:?}, streaming", self.end_pos);
@@ -575,27 +590,31 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
            }

            // Wait for WAL to appear, now self.end_pos == self.start_pos.
-            if let Some(lsn) = wait_for_lsn(&mut self.commit_lsn_watch_rx, self.start_pos).await? {
+            if let Some(lsn) = wait_for_lsn(&mut self.end_watch, self.term, self.start_pos).await? {
                self.end_pos = lsn;
                trace!("got end_pos {:?}, streaming", self.end_pos);
                return Ok(());
            }

-            // Timed out waiting for WAL, check for termination and send KA
-            if let Some(remote_consistent_lsn) = self
-                .ws_guard
-                .walsenders
-                .get_ws_remote_consistent_lsn(self.ws_guard.id)
-            {
-                if self.tli.should_walsender_stop(remote_consistent_lsn).await {
-                    // Terminate if there is nothing more to send.
-                    // Note that "ending streaming" part of the string is used by
-                    // pageserver to identify WalReceiverError::SuccessfulCompletion,
-                    // do not change this string without updating pageserver.
-                    return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
+            // Timed out waiting for WAL, check for termination and send KA.
+            // Check for termination only if we are streaming up to commit_lsn
+            // (to pageserver).
+            if let EndWatch::Commit(_) = self.end_watch {
+                if let Some(remote_consistent_lsn) = self
+                    .ws_guard
+                    .walsenders
+                    .get_ws_remote_consistent_lsn(self.ws_guard.id)
+                {
+                    if self.tli.should_walsender_stop(remote_consistent_lsn).await {
+                        // Terminate if there is nothing more to send.
+                        // Note that "ending streaming" part of the string is used by
+                        // pageserver to identify WalReceiverError::SuccessfulCompletion,
+                        // do not change this string without updating pageserver.
+                        return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
                        "ending streaming to {:?} at {}, receiver is caughtup and there is no computes",
                        self.appname, self.start_pos,
                    )));
+                    }
                }
            }

@@ -663,22 +682,32 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> ReplyReader<IO> {

 const POLL_STATE_TIMEOUT: Duration = Duration::from_secs(1);

-/// Wait until we have commit_lsn > lsn or timeout expires. Returns
-/// - Ok(Some(commit_lsn)) if needed lsn is successfully observed;
+/// Wait until we have available WAL > start_pos or timeout expires. Returns
+/// - Ok(Some(end_pos)) if needed lsn is successfully observed;
 /// - Ok(None) if timeout expired;
-/// - Err in case of error (if watch channel is in trouble, shouldn't happen).
-async fn wait_for_lsn(rx: &mut Receiver<Lsn>, lsn: Lsn) -> anyhow::Result<Option<Lsn>> {
+/// - Err in case of error -- only if 1) term changed while fetching in recovery
+///   mode 2) watch channel closed, which must never happen.
+async fn wait_for_lsn(
+    rx: &mut EndWatch,
+    client_term: Option<Term>,
+    start_pos: Lsn,
+) -> anyhow::Result<Option<Lsn>> {
    let res = timeout(POLL_STATE_TIMEOUT, async move {
-        let mut commit_lsn;
        loop {
-            rx.changed().await?;
-            commit_lsn = *rx.borrow();
-            if commit_lsn > lsn {
-                break;
+            let end_pos = rx.get();
+            if end_pos > start_pos {
+                return Ok(end_pos);
            }
+            if let EndWatch::Flush(rx) = rx {
+                let curr_term = rx.borrow().term;
+                if let Some(client_term) = client_term {
+                    if curr_term != client_term {
+                        bail!("term changed: requested {}, now {}", client_term, curr_term);
+                    }
+                }
+            }
+            rx.changed().await?;
        }
-
-        Ok(commit_lsn)
    })
    .await;

--- a/safekeeper/src/timeline.rs
+++ b/safekeeper/src/timeline.rs
@@ -3,8 +3,11 @@

 use anyhow::{anyhow, bail, Result};
 use postgres_ffi::XLogSegNo;
+use serde::{Deserialize, Serialize};
+use serde_with::serde_as;
 use tokio::fs;

+use serde_with::DisplayFromStr;
 use std::cmp::max;
 use std::path::PathBuf;
 use std::sync::Arc;
@@ -24,9 +27,10 @@ use storage_broker::proto::SafekeeperTimelineInfo;
 use storage_broker::proto::TenantTimelineId as ProtoTenantTimelineId;

 use crate::receive_wal::WalReceivers;
+use crate::recovery::recovery_main;
 use crate::safekeeper::{
    AcceptorProposerMessage, ProposerAcceptorMessage, SafeKeeper, SafeKeeperState,
-    SafekeeperMemState, ServerInfo, Term,
+    SafekeeperMemState, ServerInfo, Term, TermLsn, INVALID_TERM,
 };
 use crate::send_wal::WalSenders;
 use crate::{control_file, safekeeper::UNKNOWN_SERVER_VERSION};
@@ -37,18 +41,25 @@ use crate::SafeKeeperConf;
 use crate::{debug_dump, wal_storage};

 /// Things safekeeper should know about timeline state on peers.
-#[derive(Debug, Clone)]
+#[serde_as]
+#[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct PeerInfo {
    pub sk_id: NodeId,
    /// Term of the last entry.
    _last_log_term: Term,
    /// LSN of the last record.
+    #[serde_as(as = "DisplayFromStr")]
    _flush_lsn: Lsn,
+    #[serde_as(as = "DisplayFromStr")]
    pub commit_lsn: Lsn,
    /// Since which LSN safekeeper has WAL. TODO: remove this once we fill new
    /// sk since backup_lsn.
+    #[serde_as(as = "DisplayFromStr")]
    pub local_start_lsn: Lsn,
-    /// When info was received.
+    /// When info was received. Serde annotations are not very useful but make
+    /// the code compile -- we don't rely on this field externally.
+    #[serde(skip)]
+    #[serde(default = "Instant::now")]
    ts: Instant,
 }

@@ -237,8 +248,9 @@ impl SharedState {
                tenant_id: ttid.tenant_id.as_ref().to_owned(),
                timeline_id: ttid.timeline_id.as_ref().to_owned(),
            }),
+            term: self.sk.state.acceptor_state.term,
            last_log_term: self.sk.get_epoch(),
-            flush_lsn: self.sk.wal_store.flush_lsn().0,
+            flush_lsn: self.sk.flush_lsn().0,
            // note: this value is not flushed to control file yet and can be lost
            commit_lsn: self.sk.inmem.commit_lsn.0,
            remote_consistent_lsn: remote_consistent_lsn.0,
@@ -247,6 +259,7 @@ impl SharedState {
                .advertise_pg_addr
                .to_owned()
                .unwrap_or(conf.listen_pg_addr.clone()),
+            http_connstr: conf.listen_http_addr.to_owned(),
            backup_lsn: self.sk.inmem.backup_lsn.0,
            local_start_lsn: self.sk.state.local_start_lsn.0,
            availability_zone: conf.availability_zone.clone(),
@@ -296,6 +309,13 @@ pub struct Timeline {
    commit_lsn_watch_tx: watch::Sender<Lsn>,
    commit_lsn_watch_rx: watch::Receiver<Lsn>,

+    /// Broadcasts (current term, flush_lsn) updates, walsender is interested in
+    /// them when sending in recovery mode (to walproposer or peers). Note: this
+    /// is just a notification, WAL reading should always done with lock held as
+    /// term can change otherwise.
+    term_flush_lsn_watch_tx: watch::Sender<TermLsn>,
+    term_flush_lsn_watch_rx: watch::Receiver<TermLsn>,
+
    /// Safekeeper and other state, that should remain consistent and
    /// synchronized with the disk. This is tokio mutex as we write WAL to disk
    /// while holding it, ensuring that consensus checks are in order.
@@ -317,16 +337,20 @@ pub struct Timeline {
 impl Timeline {
    /// Load existing timeline from disk.
    pub fn load_timeline(
-        conf: SafeKeeperConf,
+        conf: &SafeKeeperConf,
        ttid: TenantTimelineId,
        wal_backup_launcher_tx: Sender<TenantTimelineId>,
    ) -> Result<Timeline> {
        let _enter = info_span!("load_timeline", timeline = %ttid.timeline_id).entered();

-        let shared_state = SharedState::restore(&conf, &ttid)?;
+        let shared_state = SharedState::restore(conf, &ttid)?;
        let rcl = shared_state.sk.state.remote_consistent_lsn;
        let (commit_lsn_watch_tx, commit_lsn_watch_rx) =
            watch::channel(shared_state.sk.state.commit_lsn);
+        let (term_flush_lsn_watch_tx, term_flush_lsn_watch_rx) = watch::channel(TermLsn::from((
+            shared_state.sk.get_term(),
+            shared_state.sk.flush_lsn(),
+        )));
        let (cancellation_tx, cancellation_rx) = watch::channel(false);

        Ok(Timeline {
@@ -334,6 +358,8 @@ impl Timeline {
            wal_backup_launcher_tx,
            commit_lsn_watch_tx,
            commit_lsn_watch_rx,
+            term_flush_lsn_watch_tx,
+            term_flush_lsn_watch_rx,
            mutex: Mutex::new(shared_state),
            walsenders: WalSenders::new(rcl),
            walreceivers: WalReceivers::new(),
@@ -345,7 +371,7 @@ impl Timeline {

    /// Create a new timeline, which is not yet persisted to disk.
    pub fn create_empty(
-        conf: SafeKeeperConf,
+        conf: &SafeKeeperConf,
        ttid: TenantTimelineId,
        wal_backup_launcher_tx: Sender<TenantTimelineId>,
        server_info: ServerInfo,
@@ -353,6 +379,8 @@ impl Timeline {
        local_start_lsn: Lsn,
    ) -> Result<Timeline> {
        let (commit_lsn_watch_tx, commit_lsn_watch_rx) = watch::channel(Lsn::INVALID);
+        let (term_flush_lsn_watch_tx, term_flush_lsn_watch_rx) =
+            watch::channel(TermLsn::from((INVALID_TERM, Lsn::INVALID)));
        let (cancellation_tx, cancellation_rx) = watch::channel(false);
        let state = SafeKeeperState::new(&ttid, server_info, vec![], commit_lsn, local_start_lsn);

@@ -361,7 +389,9 @@ impl Timeline {
            wal_backup_launcher_tx,
            commit_lsn_watch_tx,
            commit_lsn_watch_rx,
-            mutex: Mutex::new(SharedState::create_new(&conf, &ttid, state)?),
+            term_flush_lsn_watch_tx,
+            term_flush_lsn_watch_rx,
+            mutex: Mutex::new(SharedState::create_new(conf, &ttid, state)?),
            walsenders: WalSenders::new(Lsn(0)),
            walreceivers: WalReceivers::new(),
            cancellation_rx,
@@ -370,12 +400,16 @@ impl Timeline {
        })
    }

-    /// Initialize fresh timeline on disk and start background tasks. If bootstrap
+    /// Initialize fresh timeline on disk and start background tasks. If init
    /// fails, timeline is cancelled and cannot be used anymore.
    ///
-    /// Bootstrap is transactional, so if it fails, created files will be deleted,
+    /// Init is transactional, so if it fails, created files will be deleted,
    /// and state on disk should remain unchanged.
-    pub async fn bootstrap(&self, shared_state: &mut MutexGuard<'_, SharedState>) -> Result<()> {
+    pub async fn init_new(
+        self: &Arc<Timeline>,
+        shared_state: &mut MutexGuard<'_, SharedState>,
+        conf: &SafeKeeperConf,
+    ) -> Result<()> {
        match fs::metadata(&self.timeline_dir).await {
            Ok(_) => {
                // Timeline directory exists on disk, we should leave state unchanged
@@ -391,7 +425,7 @@ impl Timeline {
        // Create timeline directory.
        fs::create_dir_all(&self.timeline_dir).await?;

-        // Write timeline to disk and TODO: start background tasks.
+        // Write timeline to disk and start background tasks.
        if let Err(e) = shared_state.sk.persist().await {
            // Bootstrap failed, cancel timeline and remove timeline directory.
            self.cancel(shared_state);
@@ -405,12 +439,16 @@ impl Timeline {

            return Err(e);
        }
-
-        // TODO: add more initialization steps here
-        self.update_status(shared_state);
+        self.bootstrap(conf);
        Ok(())
    }

+    /// Bootstrap new or existing timeline starting background stasks.
+    pub fn bootstrap(self: &Arc<Timeline>, conf: &SafeKeeperConf) {
+        // Start recovery task which always runs on the timeline.
+        tokio::spawn(recovery_main(self.clone(), conf.clone()));
+    }
+
    /// Delete timeline from disk completely, by removing timeline directory. Background
    /// timeline activities will stop eventually.
    pub async fn delete_from_disk(
@@ -444,6 +482,16 @@ impl Timeline {
        *self.cancellation_rx.borrow()
    }

+    /// Returns watch channel which gets value when timeline is cancelled. It is
+    /// guaranteed to have not cancelled value observed (errors otherwise).
+    pub fn get_cancellation_rx(&self) -> Result<watch::Receiver<bool>> {
+        let rx = self.cancellation_rx.clone();
+        if *rx.borrow() {
+            bail!(TimelineError::Cancelled(self.ttid));
+        }
+        Ok(rx)
+    }
+
    /// Take a writing mutual exclusive lock on timeline shared_state.
    pub async fn write_shared_state(&self) -> MutexGuard<SharedState> {
        self.mutex.lock().await
@@ -520,6 +568,11 @@ impl Timeline {
        self.commit_lsn_watch_rx.clone()
    }

+    /// Returns term_flush_lsn watch channel.
+    pub fn get_term_flush_lsn_watch_rx(&self) -> watch::Receiver<TermLsn> {
+        self.term_flush_lsn_watch_rx.clone()
+    }
+
    /// Pass arrived message to the safekeeper.
    pub async fn process_msg(
        &self,
@@ -531,6 +584,7 @@ impl Timeline {

        let mut rmsg: Option<AcceptorProposerMessage>;
        let commit_lsn: Lsn;
+        let term_flush_lsn: TermLsn;
        {
            let mut shared_state = self.write_shared_state().await;
            rmsg = shared_state.sk.process_msg(msg).await?;
@@ -544,8 +598,11 @@ impl Timeline {
            }

            commit_lsn = shared_state.sk.inmem.commit_lsn;
+            term_flush_lsn =
+                TermLsn::from((shared_state.sk.get_term(), shared_state.sk.flush_lsn()));
        }
        self.commit_lsn_watch_tx.send(commit_lsn)?;
+        self.term_flush_lsn_watch_tx.send(term_flush_lsn)?;
        Ok(rmsg)
    }

--- a/safekeeper/src/timelines_global_map.rs
+++ b/safekeeper/src/timelines_global_map.rs
@@ -11,7 +11,7 @@ use serde::Serialize;
 use std::collections::HashMap;
 use std::path::PathBuf;
 use std::str::FromStr;
-use std::sync::{Arc, Mutex, MutexGuard};
+use std::sync::{Arc, Mutex};
 use tokio::sync::mpsc::Sender;
 use tracing::*;
 use utils::id::{TenantId, TenantTimelineId, TimelineId};
@@ -71,19 +71,23 @@ pub struct GlobalTimelines;

 impl GlobalTimelines {
    /// Inject dependencies needed for the timeline constructors and load all timelines to memory.
-    pub fn init(
+    pub async fn init(
        conf: SafeKeeperConf,
        wal_backup_launcher_tx: Sender<TenantTimelineId>,
    ) -> Result<()> {
-        let mut state = TIMELINES_STATE.lock().unwrap();
-        assert!(state.wal_backup_launcher_tx.is_none());
-        state.wal_backup_launcher_tx = Some(wal_backup_launcher_tx);
-        state.conf = Some(conf);
+        // clippy isn't smart enough to understand that drop(state) releases the
+        // lock, so use explicit block
+        let tenants_dir = {
+            let mut state = TIMELINES_STATE.lock().unwrap();
+            assert!(state.wal_backup_launcher_tx.is_none());
+            state.wal_backup_launcher_tx = Some(wal_backup_launcher_tx);
+            state.conf = Some(conf);

-        // Iterate through all directories and load tenants for all directories
-        // named as a valid tenant_id.
+            // Iterate through all directories and load tenants for all directories
+            // named as a valid tenant_id.
+            state.get_conf().workdir.clone()
+        };
        let mut tenant_count = 0;
-        let tenants_dir = state.get_conf().workdir.clone();
        for tenants_dir_entry in std::fs::read_dir(&tenants_dir)
            .with_context(|| format!("failed to list tenants dir {}", tenants_dir.display()))?
        {
@@ -93,7 +97,7 @@ impl GlobalTimelines {
                        TenantId::from_str(tenants_dir_entry.file_name().to_str().unwrap_or(""))
                    {
                        tenant_count += 1;
-                        GlobalTimelines::load_tenant_timelines(&mut state, tenant_id)?;
+                        GlobalTimelines::load_tenant_timelines(tenant_id).await?;
                    }
                }
                Err(e) => error!(
@@ -108,7 +112,7 @@ impl GlobalTimelines {
        info!(
            "found {} tenants directories, successfully loaded {} timelines",
            tenant_count,
-            state.timelines.len()
+            TIMELINES_STATE.lock().unwrap().timelines.len()
        );
        Ok(())
    }
@@ -116,17 +120,21 @@ impl GlobalTimelines {
    /// Loads all timelines for the given tenant to memory. Returns fs::read_dir
    /// errors if any.
    ///
-    /// Note: This function (and all reading/loading below) is sync because
-    /// timelines are loaded while holding GlobalTimelinesState lock. Which is
-    /// fine as this is called only from single threaded main runtime on boot,
-    /// but clippy complains anyway, and suppressing that isn't trivial as async
-    /// is the keyword, ha. That only other user is pull_timeline.rs for which
-    /// being blocked is not that bad, and we can do spawn_blocking.
-    fn load_tenant_timelines(
-        state: &mut MutexGuard<'_, GlobalTimelinesState>,
-        tenant_id: TenantId,
-    ) -> Result<()> {
-        let timelines_dir = state.get_conf().tenant_dir(&tenant_id);
+    /// It is async for update_status_notify sake. Since TIMELINES_STATE lock is
+    /// sync and there is no important reason to make it async (it is always
+    /// held for a short while) we just lock and unlock it for each timeline --
+    /// this function is called during init when nothing else is running, so
+    /// this is fine.
+    async fn load_tenant_timelines(tenant_id: TenantId) -> Result<()> {
+        let (conf, wal_backup_launcher_tx) = {
+            let state = TIMELINES_STATE.lock().unwrap();
+            (
+                state.get_conf().clone(),
+                state.wal_backup_launcher_tx.as_ref().unwrap().clone(),
+            )
+        };
+
+        let timelines_dir = conf.tenant_dir(&tenant_id);
        for timelines_dir_entry in std::fs::read_dir(&timelines_dir)
            .with_context(|| format!("failed to list timelines dir {}", timelines_dir.display()))?
        {
@@ -136,13 +144,16 @@ impl GlobalTimelines {
                        TimelineId::from_str(timeline_dir_entry.file_name().to_str().unwrap_or(""))
                    {
                        let ttid = TenantTimelineId::new(tenant_id, timeline_id);
-                        match Timeline::load_timeline(
-                            state.get_conf().clone(),
-                            ttid,
-                            state.wal_backup_launcher_tx.as_ref().unwrap().clone(),
-                        ) {
+                        match Timeline::load_timeline(&conf, ttid, wal_backup_launcher_tx.clone()) {
                            Ok(timeline) => {
-                                state.timelines.insert(ttid, Arc::new(timeline));
+                                let tli = Arc::new(timeline);
+                                TIMELINES_STATE
+                                    .lock()
+                                    .unwrap()
+                                    .timelines
+                                    .insert(ttid, tli.clone());
+                                tli.bootstrap(&conf);
+                                tli.update_status_notify().await.unwrap();
                            }
                            // If we can't load a timeline, it's most likely because of a corrupted
                            // directory. We will log an error and won't allow to delete/recreate
@@ -168,18 +179,22 @@ impl GlobalTimelines {
    }

    /// Load timeline from disk to the memory.
-    pub fn load_timeline(ttid: TenantTimelineId) -> Result<Arc<Timeline>> {
+    pub async fn load_timeline(ttid: TenantTimelineId) -> Result<Arc<Timeline>> {
        let (conf, wal_backup_launcher_tx) = TIMELINES_STATE.lock().unwrap().get_dependencies();

-        match Timeline::load_timeline(conf, ttid, wal_backup_launcher_tx) {
+        match Timeline::load_timeline(&conf, ttid, wal_backup_launcher_tx) {
            Ok(timeline) => {
                let tli = Arc::new(timeline);
+
                // TODO: prevent concurrent timeline creation/loading
                TIMELINES_STATE
                    .lock()
                    .unwrap()
                    .timelines
                    .insert(ttid, tli.clone());
+
+                tli.bootstrap(&conf);
+
                Ok(tli)
            }
            // If we can't load a timeline, it's bad. Caller will figure it out.
@@ -217,7 +232,7 @@ impl GlobalTimelines {
        info!("creating new timeline {}", ttid);

        let timeline = Arc::new(Timeline::create_empty(
-            conf,
+            &conf,
            ttid,
            wal_backup_launcher_tx,
            server_info,
@@ -240,23 +255,24 @@ impl GlobalTimelines {
            // Write the new timeline to the disk and start background workers.
            // Bootstrap is transactional, so if it fails, the timeline will be deleted,
            // and the state on disk should remain unchanged.
-            if let Err(e) = timeline.bootstrap(&mut shared_state).await {
-                // Note: the most likely reason for bootstrap failure is that the timeline
+            if let Err(e) = timeline.init_new(&mut shared_state, &conf).await {
+                // Note: the most likely reason for init failure is that the timeline
                // directory already exists on disk. This happens when timeline is corrupted
                // and wasn't loaded from disk on startup because of that. We want to preserve
                // the timeline directory in this case, for further inspection.

                // TODO: this is an unusual error, perhaps we should send it to sentry
                // TODO: compute will try to create timeline every second, we should add backoff
-                error!("failed to bootstrap timeline {}: {}", ttid, e);
+                error!("failed to init new timeline {}: {}", ttid, e);

-                // Timeline failed to bootstrap, it cannot be used. Remove it from the map.
+                // Timeline failed to init, it cannot be used. Remove it from the map.
                TIMELINES_STATE.lock().unwrap().timelines.remove(&ttid);
                return Err(e);
            }
            // We are done with bootstrap, release the lock, return the timeline.
            // {} block forces release before .await
        }
+        timeline.update_status_notify().await?;
        timeline.wal_backup_launcher_tx.send(timeline.ttid).await?;
        Ok(timeline)
    }
--- a/scripts/flaky_tests.py
+++ b/scripts/flaky_tests.py
@@ -12,25 +12,26 @@ import psycopg2.extras
 # We call the test "flaky" if it failed at least once on the main branch in the last N=10 days.
 FLAKY_TESTS_QUERY = """
    SELECT
-        DISTINCT parent_suite, suite, test
+        DISTINCT parent_suite, suite, REGEXP_REPLACE(test, '(release|debug)-pg(\\d+)-?', '') as deparametrized_test
    FROM
        (
            SELECT
-                revision,
-                jsonb_array_elements(data -> 'children') -> 'name' as parent_suite,
-                jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'name' as suite,
-                jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'name' as test,
-                jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'status' as status,
-                jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'retriesStatusChange' as retries_status_change,
-                to_timestamp((jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'time' -> 'start')::bigint / 1000)::date as timestamp
+                reference,
+                jsonb_array_elements(data -> 'children') ->> 'name' as parent_suite,
+                jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') ->> 'name' as suite,
+                jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') ->> 'name' as test,
+                jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') ->> 'status' as status,
+                jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') ->> 'retriesStatusChange' as retries_status_change,
+                to_timestamp((jsonb_array_elements(jsonb_array_elements(jsonb_array_elements(data -> 'children') -> 'children') -> 'children') -> 'time' ->> 'start')::bigint / 1000)::date as timestamp
            FROM
                regress_test_results
-            WHERE
-                reference = 'refs/heads/main'
        ) data
    WHERE
        timestamp > CURRENT_DATE - INTERVAL '%s' day
-        AND (status::text IN ('"failed"', '"broken"') OR retries_status_change::boolean)
+        AND (
+            (status IN ('failed', 'broken') AND reference = 'refs/heads/main')
+            OR retries_status_change::boolean
+        )
    ;
 """

@@ -40,6 +41,9 @@ def main(args: argparse.Namespace):
    interval_days = args.days
    output = args.output

+    build_type = args.build_type
+    pg_version = args.pg_version
+
    res: DefaultDict[str, DefaultDict[str, Dict[str, bool]]]
    res = defaultdict(lambda: defaultdict(dict))

@@ -55,8 +59,21 @@ def main(args: argparse.Namespace):
        rows = []

    for row in rows:
-        logging.info(f"\t{row['parent_suite'].replace('.', '/')}/{row['suite']}.py::{row['test']}")
-        res[row["parent_suite"]][row["suite"]][row["test"]] = True
+        # We don't want to automatically rerun tests in a performance suite
+        if row["parent_suite"] != "test_runner.regress":
+            continue
+
+        deparametrized_test = row["deparametrized_test"]
+        dash_if_needed = "" if deparametrized_test.endswith("[]") else "-"
+        parametrized_test = deparametrized_test.replace(
+            "[",
+            f"[{build_type}-pg{pg_version}{dash_if_needed}",
+        )
+        res[row["parent_suite"]][row["suite"]][parametrized_test] = True
+
+        logging.info(
+            f"\t{row['parent_suite'].replace('.', '/')}/{row['suite']}.py::{parametrized_test}"
+        )

    logging.info(f"saving results to {output.name}")
    json.dump(res, output, indent=2)
@@ -77,6 +94,18 @@ if __name__ == "__main__":
        type=int,
        help="how many days to look back for flaky tests (default: 10)",
    )
+    parser.add_argument(
+        "--build-type",
+        required=True,
+        type=str,
+        help="for which build type to create list of flaky tests (debug or release)",
+    )
+    parser.add_argument(
+        "--pg-version",
+        required=True,
+        type=int,
+        help="for which Postgres version to create list of flaky tests (14, 15, etc.)",
+    )
    parser.add_argument(
        "connstr",
        help="connection string to the test results database",
--- a/storage_broker/benches/rps.rs
+++ b/storage_broker/benches/rps.rs
@@ -125,6 +125,7 @@ async fn publish(client: Option<BrokerClientChannel>, n_keys: u64) {
                    tenant_id: vec![0xFF; 16],
                    timeline_id: tli_from_u64(counter % n_keys),
                }),
+                term: 0,
                last_log_term: 0,
                flush_lsn: counter,
                commit_lsn: 2,
@@ -132,6 +133,7 @@ async fn publish(client: Option<BrokerClientChannel>, n_keys: u64) {
                remote_consistent_lsn: 4,
                peer_horizon_lsn: 5,
                safekeeper_connstr: "zenith-1-sk-1.local:7676".to_owned(),
+                http_connstr: "zenith-1-sk-1.local:7677".to_owned(),
                local_start_lsn: 0,
                availability_zone: None,
            };
--- a/storage_broker/proto/broker.proto
+++ b/storage_broker/proto/broker.proto
@@ -22,6 +22,8 @@ message SubscribeSafekeeperInfoRequest {
 message SafekeeperTimelineInfo {
    uint64 safekeeper_id = 1;
    TenantTimelineId tenant_timeline_id = 2;
+    // Safekeeper term
+    uint64 term = 12;
    // Term of the last entry.
    uint64 last_log_term = 3;
    // LSN of the last record.
@@ -36,6 +38,8 @@ message SafekeeperTimelineInfo {
    uint64 local_start_lsn = 9;
    // A connection string to use for WAL receiving.
    string safekeeper_connstr = 10;
+    // HTTP endpoint connection string
+    string http_connstr = 13;
    // Availability zone of a safekeeper.
    optional string availability_zone = 11;
 }
--- a/storage_broker/src/bin/storage_broker.rs
+++ b/storage_broker/src/bin/storage_broker.rs
@@ -519,6 +519,7 @@ mod tests {
                tenant_id: vec![0x00; 16],
                timeline_id,
            }),
+            term: 0,
            last_log_term: 0,
            flush_lsn: 1,
            commit_lsn: 2,
@@ -526,6 +527,7 @@ mod tests {
            remote_consistent_lsn: 4,
            peer_horizon_lsn: 5,
            safekeeper_connstr: "neon-1-sk-1.local:7676".to_owned(),
+            http_connstr: "neon-1-sk-1.local:7677".to_owned(),
            local_start_lsn: 0,
            availability_zone: None,
        }
--- a/test_runner/fixtures/neon_fixtures.py
+++ b/test_runner/fixtures/neon_fixtures.py
@@ -428,6 +428,7 @@ class NeonEnvBuilder:
        preserve_database_files: bool = False,
        initial_tenant: Optional[TenantId] = None,
        initial_timeline: Optional[TimelineId] = None,
+        enable_generations: bool = False,
    ):
        self.repo_dir = repo_dir
        self.rust_log_override = rust_log_override
@@ -454,6 +455,7 @@ class NeonEnvBuilder:
        self.preserve_database_files = preserve_database_files
        self.initial_tenant = initial_tenant or TenantId.generate()
        self.initial_timeline = initial_timeline or TimelineId.generate()
+        self.enable_generations = False

    def init_configs(self) -> NeonEnv:
        # Cannot create more than one environment from one builder
@@ -713,6 +715,9 @@ class NeonEnvBuilder:
                sk.stop(immediate=True)
            self.env.pageserver.stop(immediate=True)

+            if self.env.attachment_service is not None:
+                self.env.attachment_service.stop(immediate=True)
+
            cleanup_error = None
            try:
                self.cleanup_remote_storage()
@@ -766,6 +771,8 @@ class NeonEnv:
        the tenant id
    """

+    PAGESERVER_ID = 1
+
    def __init__(self, config: NeonEnvBuilder):
        self.repo_dir = config.repo_dir
        self.rust_log_override = config.rust_log_override
@@ -789,6 +796,14 @@ class NeonEnv:
        self.initial_tenant = config.initial_tenant
        self.initial_timeline = config.initial_timeline

+        if config.enable_generations:
+            attachment_service_port = self.port_distributor.get_port()
+            self.control_plane_api: Optional[str] = f"http://127.0.0.1:{attachment_service_port}"
+            self.attachment_service: Optional[NeonAttachmentService] = NeonAttachmentService(self)
+        else:
+            self.control_plane_api = None
+            self.attachment_service = None
+
        # Create a config file corresponding to the options
        toml = textwrap.dedent(
            f"""
@@ -814,7 +829,7 @@ class NeonEnv:
        toml += textwrap.dedent(
            f"""
            [pageserver]
-            id=1
+            id={self.PAGESERVER_ID}
            listen_pg_addr = 'localhost:{pageserver_port.pg}'
            listen_http_addr = 'localhost:{pageserver_port.http}'
            pg_auth_type = '{pg_auth_type}'
@@ -822,6 +837,13 @@ class NeonEnv:
        """
        )

+        if self.control_plane_api is not None:
+            toml += textwrap.dedent(
+                f"""
+                control_plane_api = '{self.control_plane_api}'
+            """
+            )
+
        # Create a corresponding NeonPageserver object
        self.pageserver = NeonPageserver(
            self, port=pageserver_port, config_override=config.pageserver_config_override
@@ -868,6 +890,9 @@ class NeonEnv:
    def start(self):
        # Start up broker, pageserver and all safekeepers
        self.broker.try_start()
+
+        if self.attachment_service is not None:
+            self.attachment_service.start()
        self.pageserver.start()

        for safekeeper in self.safekeepers:
@@ -1289,6 +1314,16 @@ class NeonCli(AbstractNeonCli):
            res.check_returncode()
            return res

+    def attachment_service_start(self):
+        cmd = ["attachment_service", "start"]
+        return self.raw_cli(cmd)
+
+    def attachment_service_stop(self, immediate: bool):
+        cmd = ["attachment_service", "stop"]
+        if immediate:
+            cmd.extend(["-m", "immediate"])
+        return self.raw_cli(cmd)
+
    def pageserver_start(
        self,
        overrides: Tuple[str, ...] = (),
@@ -1470,6 +1505,33 @@ class ComputeCtl(AbstractNeonCli):
    COMMAND = "compute_ctl"


+class NeonAttachmentService:
+    def __init__(self, env: NeonEnv):
+        self.env = env
+
+    def start(self):
+        self.env.neon_cli.attachment_service_start()
+        self.running = True
+        return self
+
+    def stop(self, immediate: bool = False) -> "NeonAttachmentService":
+        if self.running:
+            self.env.neon_cli.attachment_service_stop(immediate)
+            self.running = False
+        return self
+
+    def __enter__(self) -> "NeonAttachmentService":
+        return self
+
+    def __exit__(
+        self,
+        exc_type: Optional[Type[BaseException]],
+        exc: Optional[BaseException],
+        tb: Optional[TracebackType],
+    ):
+        self.stop(immediate=True)
+
+
 class NeonPageserver(PgProtocol):
    """
    An object representing a running pageserver.
@@ -1633,6 +1695,26 @@ class NeonPageserver(PgProtocol):

        return None

+    def tenant_attach(
+        self, tenant_id: TenantId, config: None | Dict[str, Any] = None, config_null: bool = False
+    ):
+        """
+        Tenant attachment passes through here to acquire a generation number before proceeding
+        to call into the pageserver HTTP client.
+        """
+        if self.env.attachment_service is not None:
+            response = requests.post(
+                f"{self.env.control_plane_api}/attach_hook",
+                json={"tenant_id": str(tenant_id), "pageserver_id": self.env.PAGESERVER_ID},
+            )
+            response.raise_for_status()
+            generation = response.json()["gen"]
+        else:
+            generation = None
+
+        client = self.env.pageserver.http_client()
+        return client.tenant_attach(tenant_id, config, config_null, generation=generation)
+

 def append_pageserver_param_overrides(
    params_to_update: List[str],
--- a/test_runner/fixtures/pageserver/http.py
+++ b/test_runner/fixtures/pageserver/http.py
@@ -186,18 +186,25 @@ class PageserverHttpClient(requests.Session):
        return TenantId(new_tenant_id)

    def tenant_attach(
-        self, tenant_id: TenantId, config: None | Dict[str, Any] = None, config_null: bool = False
+        self,
+        tenant_id: TenantId,
+        config: None | Dict[str, Any] = None,
+        config_null: bool = False,
+        generation: Optional[int] = None,
    ):
        if config_null:
            assert config is None
-            body = "null"
+            body: Any = None
        else:
            # null-config is prohibited by the API
            config = config or {}
-            body = json.dumps({"config": config})
+            body = {"config": config}
+            if generation is not None:
+                body.update({"generation": generation})
+
        res = self.post(
            f"http://localhost:{self.port}/v1/tenant/{tenant_id}/attach",
-            data=body,
+            data=json.dumps(body),
            headers={"Content-Type": "application/json"},
        )
        self.verbose_error(res)
@@ -613,3 +620,8 @@ class PageserverHttpClient(requests.Session):
            },
        )
        self.verbose_error(res)
+
+    def deletion_queue_flush(self, execute: bool = False):
+        self.put(
+            f"http://localhost:{self.port}/v1/deletion_queue/flush?execute={'true' if execute else 'false'}"
+        ).raise_for_status()
--- a/test_runner/fixtures/pageserver/utils.py
+++ b/test_runner/fixtures/pageserver/utils.py
@@ -233,10 +233,19 @@ if TYPE_CHECKING:

 def assert_prefix_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None):
    response = list_prefix(neon_env_builder, prefix)
-    objects = response.get("Contents")
-    assert (
-        response["KeyCount"] == 0
-    ), f"remote dir with prefix {prefix} is not empty after deletion: {objects}"
+    keys = response["KeyCount"]
+    objects = response.get("Contents", [])
+
+    if keys != 0 and len(objects) == 0:
+        # this has been seen in one case with mock_s3:
+        # https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4938/6000769714/index.html#suites/3556ed71f2d69272a7014df6dcb02317/ca01e4f4d8d9a11f
+        # looking at moto impl, it might be there's a race with common prefix (sub directory) not going away with deletes
+        common_prefixes = response.get("CommonPrefixes", [])
+        log.warn(
+            f"contradicting ListObjectsV2 response with KeyCount={keys} and Contents={objects}, CommonPrefixes={common_prefixes}"
+        )
+
+    assert keys == 0, f"remote dir with prefix {prefix} is not empty after deletion: {objects}"


 def assert_prefix_not_empty(neon_env_builder: "NeonEnvBuilder", prefix: Optional[str] = None):
--- a/test_runner/regress/test_pageserver_restart.py
+++ b/test_runner/regress/test_pageserver_restart.py
@@ -7,7 +7,10 @@ from fixtures.neon_fixtures import NeonEnvBuilder

 # Test restarting page server, while safekeeper and compute node keep
 # running.
-def test_pageserver_restart(neon_env_builder: NeonEnvBuilder):
+@pytest.mark.parametrize("generations", [True, False])
+def test_pageserver_restart(neon_env_builder: NeonEnvBuilder, generations: bool):
+    neon_env_builder.enable_generations = generations
+
    env = neon_env_builder.init_start()

    env.neon_cli.create_branch("test_pageserver_restart")
--- a/test_runner/regress/test_remote_storage.py
+++ b/test_runner/regress/test_remote_storage.py
@@ -12,7 +12,10 @@ from typing import Dict, List, Optional, Tuple
 import pytest
 from fixtures.log_helper import log
 from fixtures.neon_fixtures import (
+    NeonEnv,
    NeonEnvBuilder,
+    PgBin,
+    last_flush_lsn_upload,
    wait_for_last_flush_lsn,
 )
 from fixtures.pageserver.http import PageserverApiException, PageserverHttpClient
@@ -52,9 +55,9 @@ from requests import ReadTimeout
 #
 # The tests are done for all types of remote storage pageserver supports.
@pytest.mark.parametrize("remote_storage_kind", available_remote_storages())
+@pytest.mark.parametrize("generations", [True, False])
 def test_remote_storage_backup_and_restore(
-    neon_env_builder: NeonEnvBuilder,
-    remote_storage_kind: RemoteStorageKind,
+    neon_env_builder: NeonEnvBuilder, remote_storage_kind: RemoteStorageKind, generations: bool
 ):
    # Use this test to check more realistic SK ids: some etcd key parsing bugs were related,
    # and this test needs SK to write data to pageserver, so it will be visible
@@ -65,6 +68,8 @@ def test_remote_storage_backup_and_restore(
        test_name="test_remote_storage_backup_and_restore",
    )

+    neon_env_builder.enable_generations = generations
+
    # Exercise retry code path by making all uploads and downloads fail for the
    # first time. The retries print INFO-messages to the log; we will check
    # that they are present after the test.
@@ -155,7 +160,8 @@ def test_remote_storage_backup_and_restore(
    # background task to load the tenant. In that background task,
    # listing the remote timelines will fail because of the failpoint,
    # and the tenant will be marked as Broken.
-    client.tenant_attach(tenant_id)
+    # client.tenant_attach(tenant_id)
+    env.pageserver.tenant_attach(tenant_id)

    tenant_info = wait_until_tenant_state(pageserver_http, tenant_id, "Broken", 15)
    assert tenant_info["attachment_status"] == {
@@ -165,7 +171,7 @@ def test_remote_storage_backup_and_restore(

    # Ensure that even though the tenant is broken, we can't attach it again.
    with pytest.raises(Exception, match=f"tenant {tenant_id} already exists, state: Broken"):
-        client.tenant_attach(tenant_id)
+        env.pageserver.tenant_attach(tenant_id)

    # Restart again, this implicitly clears the failpoint.
    # test_remote_failures=1 remains active, though, as it's in the pageserver config.
@@ -183,7 +189,7 @@ def test_remote_storage_backup_and_restore(
    # Ensure that the pageserver remembers that the tenant was attaching, by
    # trying to attach it again. It should fail.
    with pytest.raises(Exception, match=f"tenant {tenant_id} already exists, state:"):
-        client.tenant_attach(tenant_id)
+        env.pageserver.tenant_attach(tenant_id)
    log.info("waiting for tenant to become active. this should be quick with on-demand download")

    wait_until_tenant_active(
@@ -250,35 +256,20 @@ def test_remote_storage_upload_queue_retries(

    client = env.pageserver.http_client()

-    endpoint = env.endpoints.create_start("main", tenant_id=tenant_id)
-
-    endpoint.safe_psql("CREATE TABLE foo (id INTEGER PRIMARY KEY, val text)")
-
-    def configure_storage_sync_failpoints(action):
+    def configure_storage_write_failpoints(action):
        client.configure_failpoints(
            [
                ("before-upload-layer", action),
                ("before-upload-index", action),
-                ("before-delete-layer", action),
            ]
        )

-    def overwrite_data_and_wait_for_it_to_arrive_at_pageserver(data):
-        # create initial set of layers & upload them with failpoints configured
-        endpoint.safe_psql_many(
+    def configure_storage_delete_failpoints(action):
+        client.configure_failpoints(
            [
-                f"""
-               INSERT INTO foo (id, val)
-               SELECT g, '{data}'
-               FROM generate_series(1, 20000) g
-               ON CONFLICT (id) DO UPDATE
-               SET val = EXCLUDED.val
-               """,
-                # to ensure that GC can actually remove some layers
-                "VACUUM foo",
+                ("deletion-queue-before-execute", action),
            ]
        )
-        wait_for_last_flush_lsn(env, endpoint, tenant_id, timeline_id)

    def get_queued_count(file_kind, op_kind):
        val = client.get_remote_timeline_client_metric(
@@ -291,23 +282,52 @@ def test_remote_storage_upload_queue_retries(
        assert val is not None, "expecting metric to be present"
        return int(val)

-    # create some layers & wait for uploads to finish
-    overwrite_data_and_wait_for_it_to_arrive_at_pageserver("a")
-    client.timeline_checkpoint(tenant_id, timeline_id)
-    client.timeline_compact(tenant_id, timeline_id)
-    overwrite_data_and_wait_for_it_to_arrive_at_pageserver("b")
-    client.timeline_checkpoint(tenant_id, timeline_id)
-    client.timeline_compact(tenant_id, timeline_id)
-    gc_result = client.timeline_gc(tenant_id, timeline_id, 0)
-    print_gc_result(gc_result)
-    assert gc_result["layers_removed"] > 0
+    def get_deletions_executed() -> int:
+        executed = client.get_metric_value("pageserver_deletion_queue_executed_total")
+        if executed is None:
+            return 0
+        else:
+            return int(executed)

-    wait_until(2, 1, lambda: get_queued_count(file_kind="layer", op_kind="upload") == 0)
-    wait_until(2, 1, lambda: get_queued_count(file_kind="index", op_kind="upload") == 0)
-    wait_until(2, 1, lambda: get_queued_count(file_kind="layer", op_kind="delete") == 0)
+    def get_deletion_errors(op_type) -> int:
+        executed = client.get_metric_value(
+            "pageserver_deletion_queue_errors_total", {"op_kind": op_type}
+        )
+        if executed is None:
+            return 0
+        else:
+            return int(executed)
+
+    def assert_queued_count(file_kind: str, op_kind: str, fn):
+        v = get_queued_count(file_kind=file_kind, op_kind=op_kind)
+        log.info(f"queue count: {file_kind} {op_kind} {v}")
+        assert fn(v)
+
+    # Push some uploads into the remote_timeline_client queues, before failpoints
+    # are enabled: these should execute and the queue should revert to zero depth
+    generate_uploads_and_deletions(env, tenant_id=tenant_id, timeline_id=timeline_id)
+
+    wait_until(2, 1, lambda: assert_queued_count("layer", "upload", lambda v: v == 0))
+    wait_until(2, 1, lambda: assert_queued_count("index", "upload", lambda v: v == 0))
+
+    # Wait for some deletions to happen in the above compactions, assert that
+    # our metrics of interest exist
+    wait_until(2, 1, lambda: assert_deletion_queue(client, lambda v: v is not None))
+
+    # Before enabling failpoints, flushing deletions through should work
+    client.deletion_queue_flush(execute=True)
+    executed = client.get_metric_value("pageserver_deletion_queue_executed_total")
+    assert executed is not None
+    assert executed > 0

    # let all future operations queue up
-    configure_storage_sync_failpoints("return")
+    configure_storage_write_failpoints("return")
+    configure_storage_delete_failpoints("return")
+
+    # Snapshot of executed deletions: should not increment while failpoint is enabled
+    deletions_executed_pre_failpoint = client.get_metric_value(
+        "pageserver_deletion_queue_executed_total"
+    )

    # Create more churn to generate all upload ops.
    # The checkpoint / compact / gc ops will block because they call remote_client.wait_completion().
@@ -315,38 +335,77 @@ def test_remote_storage_upload_queue_retries(
    churn_thread_result = [False]

    def churn_while_failpoints_active(result):
-        overwrite_data_and_wait_for_it_to_arrive_at_pageserver("c")
-        client.timeline_checkpoint(tenant_id, timeline_id)
-        client.timeline_compact(tenant_id, timeline_id)
-        overwrite_data_and_wait_for_it_to_arrive_at_pageserver("d")
-        client.timeline_checkpoint(tenant_id, timeline_id)
-        client.timeline_compact(tenant_id, timeline_id)
-        gc_result = client.timeline_gc(tenant_id, timeline_id, 0)
-        print_gc_result(gc_result)
-        assert gc_result["layers_removed"] > 0
+        generate_uploads_and_deletions(
+            env, init=False, tenant_id=tenant_id, timeline_id=timeline_id, data="d"
+        )
        result[0] = True

    churn_while_failpoints_active_thread = threading.Thread(
        target=churn_while_failpoints_active, args=[churn_thread_result]
    )
+    log.info("Entered churn phase")
    churn_while_failpoints_active_thread.start()

-    # wait for churn thread's data to get stuck in the upload queue
-    wait_until(10, 0.1, lambda: get_queued_count(file_kind="layer", op_kind="upload") > 0)
-    wait_until(10, 0.1, lambda: get_queued_count(file_kind="index", op_kind="upload") >= 2)
-    wait_until(10, 0.1, lambda: get_queued_count(file_kind="layer", op_kind="delete") > 0)
+    try:
+        # wait for churn thread's data to get stuck in the upload queue
+        wait_until(10, 0.1, lambda: assert_queued_count("layer", "upload", lambda v: v > 0))
+        wait_until(10, 0.1, lambda: assert_queued_count("index", "upload", lambda v: v >= 2))

-    # unblock churn operations
-    configure_storage_sync_failpoints("off")
+        # Deletion queue should not grow, because deletions wait for upload of
+        # metadata, and we blocked that upload.
+        wait_until(10, 0.5, lambda: assert_deletion_queue(client, lambda v: v == 0))

-    # ... and wait for them to finish. Exponential back-off in upload queue, so, gracious timeouts.
-    wait_until(30, 1, lambda: get_queued_count(file_kind="layer", op_kind="upload") == 0)
-    wait_until(30, 1, lambda: get_queued_count(file_kind="index", op_kind="upload") == 0)
-    wait_until(30, 1, lambda: get_queued_count(file_kind="layer", op_kind="delete") == 0)
+        # No more deletions should have executed
+        assert get_deletions_executed() == deletions_executed_pre_failpoint
+
+        # unblock write operations
+        log.info("Unblocking remote writes")
+        configure_storage_write_failpoints("off")
+
+        # ... and wait for them to finish. Exponential back-off in upload queue, so, gracious timeouts.
+        wait_until(30, 1, lambda: assert_queued_count("layer", "upload", lambda v: v == 0))
+        wait_until(30, 1, lambda: assert_queued_count("index", "upload", lambda v: v == 0))
+
+        # Deletions should have been enqueued now that index uploads proceeded
+        log.info("Waiting to see deletions enqueued")
+        wait_until(10, 1, lambda: assert_deletion_queue(client, lambda v: v > 0))
+
+        # Run flush in the backgrorund because it will block on the failpoint
+        class background_flush(threading.Thread):
+            def run(self):
+                client.deletion_queue_flush(execute=True)
+
+        flusher = background_flush()
+        flusher.start()
+
+        def assert_failpoint_hit():
+            assert get_deletion_errors("failpoint") > 0
+
+        # Our background flush thread should induce us to hit the failpoint
+        wait_until(20, 0.25, assert_failpoint_hit)
+
+        # Deletions should not have been executed while failpoint is still active.
+        assert get_deletion_queue_depth(client) is not None
+        assert get_deletion_queue_depth(client) > 0
+        assert get_deletions_executed() == deletions_executed_pre_failpoint
+
+        log.info("Unblocking remote deletes")
+        configure_storage_delete_failpoints("off")
+
+        # An API flush should now complete
+        flusher.join()
+
+        # Queue should drain, which should involve executing some deletions
+        wait_until(2, 1, lambda: assert_deletion_queue(client, lambda v: v == 0))
+        assert get_deletions_executed() > deletions_executed_pre_failpoint
+
+    finally:
+        # The churn thread doesn't make progress once it blocks on the first wait_completion() call,
+        # so, give it some time to wrap up.
+        log.info("Joining churn workload")
+        churn_while_failpoints_active_thread.join(30)
+        log.info("Joined churn workload")

-    # The churn thread doesn't make progress once it blocks on the first wait_completion() call,
-    # so, give it some time to wrap up.
-    churn_while_failpoints_active_thread.join(30)
    assert not churn_while_failpoints_active_thread.is_alive()
    assert churn_thread_result[0]

@@ -364,7 +423,7 @@ def test_remote_storage_upload_queue_retries(
    env.pageserver.start()
    client = env.pageserver.http_client()

-    client.tenant_attach(tenant_id)
+    env.pageserver.tenant_attach(tenant_id)

    wait_until_tenant_active(client, tenant_id)

@@ -432,7 +491,6 @@ def test_remote_timeline_client_calls_started_metric(
    calls_started: Dict[Tuple[str, str], List[int]] = {
        ("layer", "upload"): [0],
        ("index", "upload"): [0],
-        ("layer", "delete"): [0],
    }

    def fetch_calls_started():
@@ -502,7 +560,7 @@ def test_remote_timeline_client_calls_started_metric(
    env.pageserver.start()
    client = env.pageserver.http_client()

-    client.tenant_attach(tenant_id)
+    env.pageserver.tenant_attach(tenant_id)

    wait_until_tenant_active(client, tenant_id)

@@ -930,4 +988,154 @@ def assert_nothing_to_upload(
    assert Lsn(detail["last_record_lsn"]) == Lsn(detail["remote_consistent_lsn"])


+def get_deletion_queue_depth(ps_http) -> int:
+    """
+    Queue depth if at least one deletion has been submitted, else None
+    """
+    submitted = ps_http.get_metric_value("pageserver_deletion_queue_submitted_total")
+
+    if submitted is None:
+        return 0
+
+    executed = ps_http.get_metric_value("pageserver_deletion_queue_executed_total")
+    executed = 0 if executed is None else executed
+
+    depth = submitted - executed
+    assert depth >= 0
+
+    log.info(f"get_deletion_queue_depth: {depth} ({submitted} - {executed})")
+    return int(depth)
+
+
+def assert_deletion_queue(ps_http, size_fn) -> None:
+    v = get_deletion_queue_depth(ps_http)
+    assert v is not None
+    assert size_fn(v) is True
+
+
 # TODO Test that we correctly handle GC of files that are stuck in upload queue.
+
+
+def generate_uploads_and_deletions(
+    env: NeonEnv,
+    *,
+    init: bool = True,
+    tenant_id: Optional[TenantId] = None,
+    timeline_id: Optional[TimelineId] = None,
+    data: Optional[str] = None,
+):
+    """
+    Using the environment's default tenant + timeline, generate a load pattern
+    that results in some uploads and some deletions to remote storage.
+    """
+
+    if tenant_id is None:
+        tenant_id = env.initial_tenant
+    assert tenant_id is not None
+
+    if timeline_id is None:
+        timeline_id = env.initial_timeline
+    assert timeline_id is not None
+
+    ps_http = env.pageserver.http_client()
+
+    with env.endpoints.create_start("main", tenant_id=tenant_id) as endpoint:
+        if init:
+            endpoint.safe_psql("CREATE TABLE foo (id INTEGER PRIMARY KEY, val text)")
+            last_flush_lsn_upload(env, endpoint, tenant_id, timeline_id)
+
+        def churn(data):
+            endpoint.safe_psql_many(
+                [
+                    f"""
+                INSERT INTO foo (id, val)
+                SELECT g, '{data}'
+                FROM generate_series(1, 20000) g
+                ON CONFLICT (id) DO UPDATE
+                SET val = EXCLUDED.val
+                """,
+                    # to ensure that GC can actually remove some layers
+                    "VACUUM foo",
+                ]
+            )
+            assert tenant_id is not None
+            assert timeline_id is not None
+            wait_for_last_flush_lsn(env, endpoint, tenant_id, timeline_id)
+            ps_http.timeline_checkpoint(tenant_id, timeline_id)
+
+        # Compaction should generate some GC-elegible layers
+        for i in range(0, 2):
+            churn(f"{i if data is None else data}")
+
+        gc_result = ps_http.timeline_gc(tenant_id, timeline_id, 0)
+        print_gc_result(gc_result)
+        assert gc_result["layers_removed"] > 0
+
+
+@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS])
+def test_deletion_queue_recovery(
+    neon_env_builder: NeonEnvBuilder,
+    remote_storage_kind: RemoteStorageKind,
+    pg_bin: PgBin,
+):
+    neon_env_builder.enable_remote_storage(
+        remote_storage_kind=remote_storage_kind,
+        test_name="test_deletion_queue_recovery",
+    )
+
+    env = neon_env_builder.init_start(
+        initial_tenant_conf={
+            # small checkpointing and compaction targets to ensure we generate many upload operations
+            "checkpoint_distance": f"{128 * 1024}",
+            "compaction_threshold": "1",
+            "compaction_target_size": f"{128 * 1024}",
+            # no PITR horizon, we specify the horizon when we request on-demand GC
+            "pitr_interval": "0s",
+            # disable background compaction and GC. We invoke it manually when we want it to happen.
+            "gc_period": "0s",
+            "compaction_period": "0s",
+            # create image layers eagerly, so that GC can remove some layers
+            "image_creation_threshold": "1",
+        }
+    )
+
+    ps_http = env.pageserver.http_client()
+
+    # Prevent deletion lists from being executed, to build up some backlog of deletions
+    ps_http.configure_failpoints(
+        [
+            ("deletion-queue-before-execute", "return"),
+        ]
+    )
+
+    generate_uploads_and_deletions(env)
+
+    # There should be entries in the deletion queue
+    assert_deletion_queue(ps_http, lambda n: n > 0)
+    ps_http.deletion_queue_flush()
+    before_restart_depth = get_deletion_queue_depth(ps_http)
+
+    log.info(f"Restarting pageserver with {before_restart_depth} deletions enqueued")
+    env.pageserver.stop(immediate=True)
+    env.pageserver.start()
+
+    def assert_deletions_submitted(n: int):
+        assert ps_http.get_metric_value("pageserver_deletion_queue_submitted_total") == n
+
+    # After restart, issue a flush to kick the deletion frorntend to do recovery.
+    # It should recover all the operations we submitted before the restart.
+    ps_http.deletion_queue_flush(execute=False)
+    wait_until(20, 0.25, lambda: assert_deletions_submitted(before_restart_depth))
+
+    # The queue should drain through completely if we flush it
+    ps_http.deletion_queue_flush(execute=True)
+    wait_until(10, 1, lambda: assert_deletion_queue(ps_http, lambda n: n == 0))
+
+    # Restart again
+    env.pageserver.stop(immediate=True)
+    env.pageserver.start()
+
+    # No deletion lists should be recovered: this demonstrates that deletion lists
+    # were cleaned up after being executed.
+    time.sleep(1)
+    assert_deletion_queue(ps_http, lambda n: n == 0)
--- a/test_runner/regress/test_tenant_delete.py
+++ b/test_runner/regress/test_tenant_delete.py
@@ -47,6 +47,15 @@ def test_tenant_delete_smoke(
    )

    env = neon_env_builder.init_start()
+    env.pageserver.allowed_errors.extend(
+        [
+            # The deletion queue will complain when it encounters simulated S3 errors
+            ".*deletion frontend: Failed to write deletion list.*",
+            ".*deletion backend: Failed to delete deletion list.*",
+            ".*deletion executor: DeleteObjects request failed.*",
+            ".*deletion backend: Failed to upload deletion queue header.*",
+        ]
+    )

    # lucky race with stopping from flushing a layer we fail to schedule any uploads
    env.pageserver.allowed_errors.append(
@@ -91,7 +100,9 @@ def test_tenant_delete_smoke(

    iterations = poll_for_remote_storage_iterations(remote_storage_kind)

-    tenant_delete_wait_completed(ps_http, tenant_id, iterations)
+    # We are running with failures enabled, so this may take some time to make
+    # it through all the remote storage operations required to complete
+    tenant_delete_wait_completed(ps_http, tenant_id, iterations * 10)

    tenant_path = env.tenant_dir(tenant_id=tenant_id)
    assert not tenant_path.exists()
@@ -201,6 +212,17 @@ def test_delete_tenant_exercise_crash_safety_failpoints(
        ]
    )

+    if simulate_failures:
+        env.pageserver.allowed_errors.extend(
+            [
+                # The deletion queue will complain when it encounters simulated S3 errors
+                ".*deletion frontend: Failed to write deletion list.*",
+                ".*deletion backend: Failed to delete deletion list.*",
+                ".*deletion executor: DeleteObjects request failed.*",
+                ".*deletion backend: Failed to upload deletion queue header.*",
+            ]
+        )
+
    ps_http = env.pageserver.http_client()

    timeline_id = env.neon_cli.create_timeline("delete", tenant_id=tenant_id)
--- a/test_runner/regress/test_timeline_delete.py
+++ b/test_runner/regress/test_timeline_delete.py
@@ -488,7 +488,14 @@ def test_timeline_delete_fail_before_local_delete(neon_env_builder: NeonEnvBuild
    # Wait for tenant to finish loading.
    wait_until_tenant_active(ps_http, tenant_id=env.initial_tenant, iterations=10, period=1)

-    wait_timeline_detail_404(ps_http, env.initial_tenant, leaf_timeline_id, iterations=4)
+    # Timeline deletion takes some finite time after startup
+    wait_timeline_detail_404(
+        ps_http,
+        tenant_id=env.initial_tenant,
+        timeline_id=leaf_timeline_id,
+        iterations=20,
+        interval=0.5,
+    )

    assert (
        not leaf_timeline_path.exists()
@@ -534,7 +541,7 @@ def test_timeline_delete_fail_before_local_delete(neon_env_builder: NeonEnvBuild
    wait_until(
        2,
        0.5,
-        lambda: assert_prefix_empty(neon_env_builder),
+        lambda: assert_prefix_empty(neon_env_builder, prefix="/tenants"),
    )


@@ -688,7 +695,7 @@ def test_delete_timeline_client_hangup(neon_env_builder: NeonEnvBuilder):
    wait_until(50, 0.1, first_request_finished)

    # check that the timeline is gone
-    wait_timeline_detail_404(ps_http, env.initial_tenant, child_timeline_id, iterations=2)
+    wait_timeline_detail_404(ps_http, env.initial_tenant, child_timeline_id, iterations=4)


@pytest.mark.parametrize(
@@ -772,7 +779,11 @@ def test_timeline_delete_works_for_remote_smoke(

    # for some reason the check above doesnt immediately take effect for the below.
    # Assume it is mock server inconsistency and check twice.
-    wait_until(2, 0.5, lambda: assert_prefix_empty(neon_env_builder))
+    wait_until(
+        2,
+        0.5,
+        lambda: assert_prefix_empty(neon_env_builder, "/tenants"),
+    )


 def test_delete_orphaned_objects(
@@ -827,6 +838,8 @@ def test_delete_orphaned_objects(
    reason = timeline_info["state"]["Broken"]["reason"]
    assert reason.endswith(f"failpoint: {failpoint}"), reason

+    ps_http.deletion_queue_flush(execute=True)
+
    for orphan in orphans:
        assert not orphan.exists()
        assert env.pageserver.log_contains(
--- a/vm-cgconfig.conf
+++ b/vm-cgconfig.conf
@@ -1,12 +0,0 @@
-# Configuration for cgroups in VM compute nodes
-group neon-postgres {
-    perm {
-        admin {
-            uid = vm-informant;
-        }
-        task {
-            gid = users;
-        }
-    }
-    memory {}
-}
Author	SHA1	Message	Date
John Spray	c63a952b78	Implement validation of generations before delete	2023-08-30 17:44:10 +01:00
John Spray	35e4b43531	Hook deletion queue into generations	2023-08-30 15:35:51 +01:00
John Spray	584c0d3c7b	Make remote_layer_path take Generation instead of layer metadata	2023-08-30 15:13:00 +01:00
John Spray	84023207ce	Merge branch 'jcsp/deletion-queue' into jcsp/generation-numbers	2023-08-30 15:07:35 +01:00
John Spray	35fa75699b	switch deletion queue to local storage	2023-08-30 12:21:29 +01:00
John Spray	f77aa463c6	clippy	2023-08-30 10:37:06 +01:00
John Spray	4492d40c37	Merge remote-tracking branch 'upstream/main' into jcsp/deletion-queue	2023-08-30 10:34:16 +01:00
John Spray	2f58f39648	Revert "libs: make backoff::retry() take a cancellation token" This reverts commit `8c2ff87f1a`.	2023-08-30 10:26:15 +01:00
Joonas Koivunen	05773708d3	fix: add context for ancestor lsn wait (#5143 ) In logs it is confusing to see seqwait timeouts which seemingly arise from the branched lsn but actually are about the ancestor, leading to questions like "has the last_record_lsn went back". Noticed by @problame.	2023-08-30 12:21:41 +03:00
John Spray	382473d9a5	docs: add RFC for remote storage generation numbers (#4919 ) ## Summary A scheme of logical "generation numbers" for pageservers and their attachments is proposed, along with changes to the remote storage format to include these generation numbers in S3 keys. Using the control plane as the issuer of these generation numbers enables strong anti-split-brain properties in the pageserver cluster without implementing a consensus mechanism directly in the pageservers. ## Motivation Currently, the pageserver's remote storage format does not provide a mechanism for addressing split brain conditions that may happen when replacing a node during failover or when migrating a tenant from one pageserver to another. From a remote storage perspective, a split brain condition occurs whenever two nodes both think they have the same tenant attached, and both can write to S3. This can happen in the case of a network partition, pathologically long delays (e.g. suspended VM), or software bugs. This blocks robust implementation of failover from unresponsive pageservers, due to the risk that the unresponsive pageserver is still writing to S3. --------- Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-08-30 09:49:55 +01:00
Arpad Müller	eb0a698adc	Make page cache and read_blk async (#5023 ) ## Problem `read_blk` does I/O and thus we would like to make it async. We can't make the function async as long as the `PageReadGuard` returned by `read_blk` isn't `Send`. The page cache is called by `read_blk`, and thus it can't be async without `read_blk` being async. Thus, we have a circular dependency. ## Summary of changes Due to the circular dependency, we convert both the page cache and `read_blk` to async at the same time: We make the page cache use `tokio::sync` synchronization primitives as those are `Send`. This makes all the places that acquire a lock require async though, which we then also do. This includes also asyncification of the `read_blk` function. Builds upon #4994, #5015, #5056, and #5129. Part of #4743.	2023-08-30 09:04:31 +02:00
Arseny Sher	81b6578c44	Allow walsender in recovery mode give WAL till dynamic flush_lsn. Instead of fixed during the start of replication. To this end, create term_flush_lsn watch channel similar to commit_lsn one. This allows to continue recovery streaming if new data appears.	2023-08-29 23:19:40 +03:00
Arseny Sher	bc49c73fee	Move wal_stream_connection_config to utils. It will be used by safekeeper as well.	2023-08-29 23:19:40 +03:00
Arseny Sher	e98580b092	Add term and http endpoint to broker messaged SkTimelineInfo. We need them for safekeeper peer recovery https://github.com/neondatabase/neon/pull/4875	2023-08-29 23:19:40 +03:00
Arseny Sher	804ef23043	Rename TermSwitchEntry to TermLsn. Add derive Ord for easy comparison of <term, lsn> pairs. part of https://github.com/neondatabase/neon/pull/4875	2023-08-29 23:19:40 +03:00
Arseny Sher	87f7d6bce3	Start and stop per timeline recovery task. Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of https://github.com/neondatabase/neon/pull/4875	2023-08-29 23:19:40 +03:00
Arseny Sher	39e3fbbeb0	Add safekeeper peers to TimelineInfo. Now available under GET /tenant/xxx/timeline/yyy for inspection.	2023-08-29 23:19:40 +03:00
Em Sharnoff	8d2a4aa5f8	vm-monitor: Add flag for when file cache on disk (#5130 ) Part 1 of 2, for moving the file cache onto disk. Because VMs are created by the control plane (and that's where the filesystem for the file cache is defined), we can't rely on any kind of synchronization between releases, so the change needs to be feature-gated (kind of), with the default remaining the same for now. See also: neondatabase/cloud#6593	2023-08-29 12:44:48 -07:00
John Spray	10b85c0d9a	fixup index_part loading	2023-08-29 17:26:08 +01:00
John Spray	cd6367b5ae	fixup control_plane attach hook	2023-08-29 17:18:28 +01:00
John Spray	79f9f7c5f8	fixup control_api types	2023-08-29 17:18:28 +01:00
John Spray	ef5ce1635c	fixup attach API	2023-08-29 17:18:28 +01:00
John Spray	5aecd8c4fd	tests: enable generations in neon_fixture	2023-08-29 17:18:28 +01:00
John Spray	5266bf4552	remote_storage: fix LocalFs list_files	2023-08-29 17:18:28 +01:00
John Spray	a1bcad2382	DNM unit test for index part download	2023-08-29 17:18:28 +01:00
John Spray	4dd60bf7cd	pageserver: generation-aware index_part.json loading	2023-08-29 17:18:28 +01:00
John Spray	3eff65618d	control_plane: implement attach hook	2023-08-29 17:18:28 +01:00
John Spray	265d3b4352	pageserver: if control plane API is disabled, ignore generations	2023-08-29 17:18:28 +01:00
John Spray	000330054b	pageserver: require attachment generation if control plane API is set	2023-08-29 17:18:28 +01:00
John Spray	ddb6453f56	neon_local: manage attachment_service	2023-08-29 17:18:28 +01:00
John Spray	bc95b8f1f5	pageserver: call into control plane on startup	2023-08-29 17:18:28 +01:00
John Spray	5b7d3e39d6	Move pageserver control plane API types into libs/	2023-08-29 17:18:28 +01:00
John Spray	034bebcfcd	pageserver: add control_plane_api conf	2023-08-29 17:18:28 +01:00
John Spray	9e0e2a2a9a	Stub of generations API	2023-08-29 17:18:28 +01:00
John Spray	34160a15ca	Support generations in RemoteTimelineClient delete	2023-08-29 17:18:28 +01:00
John Spray	f3a9c2d788	Add optional generation input during create & attach	2023-08-29 17:18:28 +01:00
John Spray	50da1b7983	Simplify Generation	2023-08-29 17:08:55 +01:00
John Spray	4a0e2d1290	Simplify None handling for Generation in LayerfileMetadata	2023-08-29 16:57:28 +01:00
John Spray	980d3ba8b0	clippy	2023-08-29 15:47:00 +01:00
John Spray	fd836d8c45	Support generations in RemoteTimelineClient delete	2023-08-29 15:36:15 +01:00
John Spray	67b17034ab	pageserver: use generation in keys when writing	2023-08-29 15:36:15 +01:00
John Spray	930de712ee	pageserver: add Generation type to Tenant, Timeline & Index	2023-08-29 15:36:15 +01:00
John Spray	dd033d9138	utils: introduce Generation type	2023-08-29 15:36:12 +01:00
Joonas Koivunen	d1fcdf75b3	test: enhanced logging for curious mock_s3 (#5134 ) Possible flakyness with mock_s3. Add logging in hopes this will happen again. Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-08-29 14:48:50 +03:00
Alexander Bayandin	7e39a96441	scripts/flaky_tests.py: Improve flaky tests detection (#5094 ) ## Problem We still need to rerun some builds manually because flaky tests weren't detected automatically. I found two reasons for it: - If a test is flaky on a particular build type, on a particular Postgres version, there's a high chance that this test is flaky on all configurations, but we don't automatically detect such cases. - We detect flaky tests only on the main branch, which requires manual retrigger runs for freshly made flaky tests. Both of them are fixed in the PR. ## Summary of changes - Spread flakiness of a single test to all configurations - Detect flaky tests in all branches (not only in the main) - Look back only at 7 days of test history (instead of 10)	2023-08-29 11:53:24 +01:00
Vadim Kharitonov	babefdd3f9	Upgrade pgvector to 0.5.0 (#5132 )	2023-08-29 12:53:50 +03:00
Arpad Müller	805fee1483	page cache: small code cleanups (#5125 ) ## Problem I saw these things while working on #5111. ## Summary of changes * Add a comment explaining why we use `Vec::leak` instead of `Vec::into_boxed_slice` plus `Box::leak`. * Add another comment explaining what `valid` is doing, it wasn't very clear before. * Add a function `set_usage_count` to not set it directly.	2023-08-29 11:49:04 +03:00
Felix Prasanna	85d6d9dc85	monitor/compute_ctl: remove references to the informant (#5115 ) Also added some docs to the monitor :) Co-authored-by: Em Sharnoff <sharnoff@neon.tech>	2023-08-29 02:59:27 +03:00
Em Sharnoff	e40ee7c3d1	remove unused file 'vm-cgconfig.conf' (#5127 ) Honestly no clue why it's still here, should have been removed ages ago. This is handled by vm-builder now.	2023-08-28 13:04:57 -07:00
Christian Schwarz	0fe3b3646a	page cache: don't proactively evict EphemeralFile pages (#5129 ) Before this patch, when dropping an EphemeralFile, we'd scan the entire `slots` to proactively evict its pages (`drop_buffers_for_immutable`). This was _necessary_ before #4994 because the page cache was a write-back cache: we'd be deleting the EphemeralFile from disk after, so, if we hadn't evicted its pages before that, write-back in `find_victim` wouldhave failed. But, since #4994, the page cache is a read-only cache, so, it's safe to keep read-only data cached. It's never going to get accessed again and eventually, `find_victim` will evict it. The only remaining advantage of `drop_buffers_for_immutable` over relying on `find_victim` is that `find_victim` has to do the clock page replacement iterations until the count reaches 0, whereas `drop_buffers_for_immutable` can kick the page out right away. However, weigh that against the cost of `drop_buffers_for_immutable`, which currently scans the entire `slots` array to find the EphemeralFile's pages. Alternatives have been proposed in #5122 and #5128, but, they come with their own overheads & trade-offs. Also, the real reason why we're looking into this piece of code is that we want to make the slots rwlock async in #5023. Since `drop_buffers_for_immutable` is called from drop, and there is no async drop, it would be nice to not have to deal with this. So, let's just stop doing `drop_buffers_for_immutable` and observe the performance impact in benchmarks.	2023-08-28 20:42:18 +02:00
Em Sharnoff	529f8b5016	compute_ctl: Fix switched vm-monitor args (#5117 ) Small switcheroo from #4946.	2023-08-28 14:55:41 +02:00
Joonas Koivunen	fbcd174489	load_layer_map: schedule deletions for any future layers (#5103 ) Unrelated fixes noticed while integrating #4938. - Stop leaking future layers in remote storage - We schedule extra index_part uploads if layer name to be removed was not actually present	2023-08-28 10:51:49 +03:00
Felix Prasanna	7b5489a0bb	compute_ctl: start pg in cgroup for vms (#4920 ) Starts `postgres` in cgroup directly from `compute_ctl` instead of from `vm-builder`. This is required because the `vm-monitor` cannot be in the cgroup it is managing. Otherwise, it itself would be frozen when freezing the cgroup. Requires https://github.com/neondatabase/cloud/pull/6331, which adds the `AUTOSCALING` environment variable letting `compute_ctl` know to start `postgres` in the cgroup. Requires https://github.com/neondatabase/autoscaling/pull/468, which prevents `vm-builder` from starting the monitor and putting postgres in a cgroup. This will require a `VM_BUILDER_VERSION` bump.	2023-08-25 15:59:12 -04:00
Felix Prasanna	40268dcd8d	monitor: fix filecache calculations (#5112 ) ## Problem An underflow bug in the filecache calculations. ## Summary of changes Fixed the bug, cleaned up calculations in general.	2023-08-25 13:29:10 -04:00
Vadim Kharitonov	4436c84751	Change codeowners (#5109 )	2023-08-25 19:48:16 +03:00
John Spray	b758bf47ca	pageserver: refactor TimelineMetadata serialization in IndexPart (#5091 ) ## Problem The `metadata_bytes` field of IndexPart required explicit deserialization & error checking everywhere it was used -- there isn't anything special about this structure that should prevent it from being serialized & deserialized along with the rest of the structure. ## Summary of changes - Implement Serialize and Deserialize for TimelineMetadata - Replace IndexPart::metadata_bytes with a simpler `metadata`, that can be used directly. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-25 16:16:20 +01:00
Felix Prasanna	024e306f73	monitor: improve logging (#5099 )	2023-08-25 10:09:53 -04:00
Alek Westover	f71c82e5de	remove obsolete `need` dependency (#5087 )	2023-08-25 09:10:26 -04:00
John Spray	5a217791fd	libs: give TenantTimelineId a compact string serialization The existing derive'd Serialize/Deserialize were not used anywhere. To enable using TenantTimelineId as a key in JSON maps, serialize as a comma separated string. This is also a more compact representation.	2023-08-23 10:33:44 +01:00
John Spray	c9a007d05b	deletion queue: future-proof DeletionList format It needs places to put generation numbers	2023-08-23 10:33:44 +01:00
John Spray	696b49eeba	Update deletion list doc comment for Executor	2023-08-23 09:35:10 +01:00
John Spray	206420d96a	deletion queue: refactor coalescing into Executor	2023-08-23 09:16:55 +01:00
John Spray	416026381f	deletion queue: refactor into frontend/backend modules	2023-08-22 16:38:13 +01:00
John Spray	d9755becab	Update RemoteTimelineClient doc comment	2023-08-22 14:36:57 +01:00
John Spray	9cb255be97	Update pageserver/src/deletion_queue.rs Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-22 14:10:11 +01:00
John Spray	57a44dcc01	Update pageserver/src/deletion_queue.rs Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-22 14:10:06 +01:00
John Spray	1afc6337fb	Remove unused `num_inprogress_deletions`	2023-08-22 14:06:15 +01:00
John Spray	74058e196a	remote_storage: defensively handle 404 on deletions S3 implementions are _meant_ to return 200 on deleting a nonexistent object, but S3 is not a standard and some implementations have their own ideas.	2023-08-22 13:52:58 +01:00
John Spray	a116f6656f	deletion queue: more consistent use of backoff::retry	2023-08-22 13:38:31 +01:00
John Spray	2c7b97245a	tweak test_remote_storage_upload_queue_retries	2023-08-22 13:34:12 +01:00
John Spray	6efddbf526	flush tweaks	2023-08-22 13:17:57 +01:00
John Spray	7c4d79f4db	deletion queue: cancellable retries	2023-08-22 13:05:04 +01:00
John Spray	8c2ff87f1a	libs: make backoff::retry() take a cancellation token	2023-08-22 12:39:19 +01:00
John Spray	23fc247a03	remove redundant spans	2023-08-22 11:22:51 +01:00
John Spray	d8dc4425f8	Merge remote-tracking branch 'upstream/main' into jcsp/deletion-queue	2023-08-22 10:09:23 +01:00
John Spray	18159b7695	deletion queue: expose errors from push/flush	2023-08-22 10:01:10 +01:00
John Spray	c1bc9c0f70	Various test fixes + tweaks to flushing	2023-08-18 12:44:35 +01:00
John Spray	2de5efa208	Fix broken wait_untils in test_remote_storage_upload_queue_retries	2023-08-18 12:44:35 +01:00
John Spray	d330eac4bc	clippy	2023-08-18 12:44:35 +01:00
John Spray	3ebceeda71	pageserver: refactor timeline args into TimelineResources This sidesteps clippy complaining about function arg counts, and will enable introducing more shared structures in future without the noise of adding extra args to all the functions involved in timeline setup.	2023-08-18 12:44:35 +01:00
John Spray	31729d6f4d	pageserver: refactor tenant args into a structure This way, when we add some new shared structure that the tenants need a reference to, we do not have to add it individually as an extra argument to the various functions.	2023-08-18 12:44:35 +01:00
John Spray	7e0e3517c1	clippy	2023-08-18 12:44:35 +01:00
John Spray	c4fc6e433d	tests: add e2e deletion queue recovery test	2023-08-18 12:44:35 +01:00
John Spray	c36cba28d6	pageserver: generalize flush API	2023-08-18 12:44:35 +01:00
John Spray	8eaa4015de	deletion queue: versions in keys	2023-08-18 12:44:35 +01:00
John Spray	10e927ee3e	Add encoding versions to deletion queue structs	2023-08-18 12:44:35 +01:00
John Spray	bb3a59f275	clippy	2023-08-18 12:44:35 +01:00
John Spray	a0ed43cc12	deletion queue: add DeletionHeader for sequence numbers	2023-08-18 12:44:35 +01:00
John Spray	99dc5a5c27	Deletion queue: implement recovery on startup	2023-08-18 12:44:35 +01:00
John Spray	54db1f5d8a	remote_storage: add a helper for downloading full objects This is only for use with small objects that we will deserialize in a non-streaming way. Also add a strip_prefix method to RemotePath.	2023-08-18 12:44:35 +01:00
John Spray	404b25e45f	Remove vestigial remote_timeline_client deletion paths	2023-08-18 12:44:35 +01:00
John Spray	f4dba9f907	tests: update tenant deletion tests for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	4ec45bc7dc	tests: update tenant deletion tests for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	a00d4a8d8c	tests: update test_remote_timeline_client_calls_started_metric for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	43c9a09d8f	tests: update remote storage test for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	3edd7ece40	deletion queue: improve frontend retry	2023-08-18 12:44:35 +01:00
John Spray	504fe9c2b0	pageserver: send timeline deletions through the deletion queue	2023-08-18 12:44:35 +01:00
John Spray	10df237a81	deletion queue: add push for generic objects (layers and garbage)	2023-08-18 12:44:35 +01:00
John Spray	d40f8475a5	Error metric and retries	2023-08-18 12:44:35 +01:00
John Spray	164f916a40	Spawn deletion workers with info spans	2023-08-18 12:44:35 +01:00
John Spray	4ebc29768c	Add failpoint for deletion execution	2023-08-18 12:44:35 +01:00
John Spray	bae62916dc	pageserver/http: add /v1/deletion_queue/flush_execute This is principally for tesing, but might be useful in the field if we want to e.g. flush a deletion queue before running an external scrub tool	2023-08-18 12:44:35 +01:00
John Spray	5e2b8b376c	utils: add ApiError::ShuttingDown So that handlers that check their CancellationToken explicitly can map it to a set http status.	2023-08-18 12:44:35 +01:00
John Spray	54ec7919b8	pageserver: add deletion queue submitted/executed metrics	2023-08-18 12:44:35 +01:00
John Spray	e0bed0732c	Tweak deletion queue constants	2023-08-18 12:44:35 +01:00
John Spray	9e92121cc3	pageserver: flush deletion queue on clean shutdown	2023-08-18 12:44:35 +01:00
John Spray	50a9508f4f	clippy	2023-08-18 12:44:35 +01:00
John Spray	f61402be24	pageserver: testing for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	975e4f2235	Refactor deletion worker construction	2023-08-18 12:44:35 +01:00
John Spray	537eca489e	Implement flush_execute() in deletion queue	2023-08-18 12:44:35 +01:00
John Spray	de4882886e	pageserver: implement batching in deletion queue	2023-08-18 12:44:35 +01:00
John Spray	6982288426	pageserver: implement frontend of deletion queue	2023-08-18 12:44:35 +01:00
John Spray	ccfcfa1098	remote_storage: implement Serialize/Deserialize for RemotePath	2023-08-18 12:44:35 +01:00
John Spray	e2c793c897	Use deletion queue in schedule_layer_file_deletion	2023-08-18 12:44:33 +01:00
John Spray	0fdc492aa4	Add MockDeletionQueue for unit tests	2023-08-18 11:25:40 +01:00
John Spray	787b099541	wire deletion queue into timeline	2023-08-18 11:25:40 +01:00
John Spray	3af693749d	pageserver: wire deletion queue through to Tenant	2023-08-18 11:25:40 +01:00
John Spray	6f9ae6bb5f	pageserver: instantiate deletion queue at process scope	2023-08-18 11:25:40 +01:00
John Spray	16d77dcb73	Initial stub implementation of deletion queue	2023-08-18 11:25:40 +01:00
				`@@ -0,0 +1 @@`
				`-bash: scripts/pytest: No such file or directory`