pageserver: move things around to prepare for decoding logic

We wish to have high level WAL decoding logic in `wal_decoder::decoder` module. For this we need the `Value` and `NeonWalRecord` types accessible there, so: 1. Move `Value` and `NeonWalRecord` to `pageserver_api::value` and `pageserver_api::record` respectively. I had to add a testing feature to `pageserver_api` to get this working due to `NeonWalRecord` test directives. 2. Get rid of `pageserver::repository` (follow up from (1)) 3. Move PG specific WAL record types to `postgres_ffi::record`. In theory they could live in `wal_decoder`, but it would create a circular dependency between `wal_decoder` and `postgres_ffi`. Long term it makes sennse for those types to be PG version specific, so that will work out nicely. 4. Move higher level WAL record types (to be ingested by pageserver) into `wal_decoder::models`
review: add a comment for the types
2026-05-16 20:50:37 +00:00 · 2024-10-24 17:46:21 +02:00 · 2024-10-24 15:51:09 +02:00 · 2024-10-24 15:51:09 +02:00 · 2024-10-24 15:51:09 +02:00 · 2024-10-24 15:51:09 +02:00
123 changed files with 1752 additions and 4948 deletions
--- a/.github/workflows/_build-and-test-locally.yml
+++ b/.github/workflows/_build-and-test-locally.yml
@@ -53,6 +53,20 @@ jobs:
      BUILD_TAG: ${{ inputs.build-tag }}

    steps:
+      - name: Fix git ownership
+        run: |
+          # Workaround for `fatal: detected dubious ownership in repository at ...`
+          #
+          # Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
+          #   Ref https://github.com/actions/checkout/issues/785
+          #
+          git config --global --add safe.directory ${{ github.workspace }}
+          git config --global --add safe.directory ${GITHUB_WORKSPACE}
+          for r in 14 15 16 17; do
+            git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
+            git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
+          done
+
      - uses: actions/checkout@v4
        with:
          submodules: true
--- a/.github/workflows/benchmarking.yml
+++ b/.github/workflows/benchmarking.yml
@@ -671,10 +671,6 @@ jobs:
        password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
      options: --init

-    # Increase timeout to 12h, default timeout is 6h
-    # we have regression in clickbench causing it to run 2-3x longer
-    timeout-minutes: 720
-
    steps:
    - uses: actions/checkout@v4

@@ -720,7 +716,7 @@ jobs:
        test_selection: performance/test_perf_olap.py
        run_in_parallel: false
        save_perf_report: ${{ env.SAVE_PERF_REPORT }}
-        extra_params: -m remote_cluster --timeout 43200 -k test_clickbench
+        extra_params: -m remote_cluster --timeout 21600 -k test_clickbench
        pg_version: ${{ env.DEFAULT_PG_VERSION }}
      env:
        VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -839,7 +839,6 @@ jobs:
      - name: Build vm image
        run: |
          ./vm-builder \
-            -size=2G \
            -spec=compute/vm-image-spec-${{ matrix.version.debian }}.yaml \
            -src=neondatabase/compute-node-${{ matrix.version.pg }}:${{ needs.tag.outputs.build-tag }} \
            -dst=neondatabase/vm-compute-node-${{ matrix.version.pg }}:${{ needs.tag.outputs.build-tag }}
@@ -1079,6 +1078,20 @@ jobs:
    runs-on: [ self-hosted, small ]
    container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
    steps:
+      - name: Fix git ownership
+        run: |
+          # Workaround for `fatal: detected dubious ownership in repository at ...`
+          #
+          # Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
+          #   Ref https://github.com/actions/checkout/issues/785
+          #
+          git config --global --add safe.directory ${{ github.workspace }}
+          git config --global --add safe.directory ${GITHUB_WORKSPACE}
+          for r in 14 15 16 17; do
+            git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
+            git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
+          done
+
      - uses: actions/checkout@v4

      - name: Trigger deploy workflow
@@ -1117,10 +1130,7 @@ jobs:

            gh workflow --repo neondatabase/infra run deploy-proxy-prod.yml --ref main \
              -f deployPgSniRouter=true \
-              -f deployProxyLink=true \
-              -f deployPrivatelinkProxy=true \
-              -f deployProxyScram=true \
-              -f deployProxyAuthBroker=true \
+              -f deployProxy=true \
              -f branch=main \
              -f dockerTag=${{needs.tag.outputs.build-tag}}
          else
--- a/.gitignore
+++ b/.gitignore
@@ -6,8 +6,6 @@ __pycache__/
 test_output/
 .vscode
 .idea
-*.swp
-tags
 neon.iml
 /.neon
 /integration_tests/.neon
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -6274,7 +6274,7 @@ dependencies = [
 [[package]]
 name = "tokio-epoll-uring"
 version = "0.1.0"
-source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#33e00106a268644d02ba0461bbd64476073b0ee1"
+source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#08ccfa94ff5507727bf4d8d006666b5b192e04c6"
 dependencies = [
 "futures",
 "nix 0.26.4",
@@ -6790,7 +6790,7 @@ dependencies = [
 [[package]]
 name = "uring-common"
 version = "0.1.0"
-source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#33e00106a268644d02ba0461bbd64476073b0ee1"
+source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#08ccfa94ff5507727bf4d8d006666b5b192e04c6"
 dependencies = [
 "bytes",
 "io-uring",
@@ -6967,7 +6967,6 @@ dependencies = [
 "serde",
 "tracing",
 "utils",
- "workspace_hack",
 ]

 [[package]]
--- a/compute/compute-node.Dockerfile
+++ b/compute/compute-node.Dockerfile
@@ -666,7 +666,7 @@ RUN apt-get update && \
 #
 # Use new version only for v17
 # because Release_2024_09_1 has some backward incompatible changes
-# https://github.com/rdkit/rdkit/releases/tag/Release_2024_09_1
+# https://github.com/rdkit/rdkit/releases/tag/Release_2024_09_1 
 ENV PATH="/usr/local/pgsql/bin/:/usr/local/pgsql/:$PATH"
 RUN case "${PG_VERSION}" in \
    "v17") \
@@ -860,98 +860,18 @@ ENV PATH="/home/nonroot/.cargo/bin:/usr/local/pgsql/bin/:$PATH"
 USER nonroot
 WORKDIR /home/nonroot

-RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
+RUN case "${PG_VERSION}" in "v17") \
+    echo "v17 is not supported yet by pgrx. Quit" && exit 0;; \
+    esac && \
+    curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
    chmod +x rustup-init && \
    ./rustup-init -y --no-modify-path --profile minimal --default-toolchain stable && \
    rm rustup-init && \
-    case "${PG_VERSION}" in \
-        'v17') \
-            echo 'v17 is not supported yet by pgrx. Quit' && exit 0;; \
-    esac && \
    cargo install --locked --version 0.11.3 cargo-pgrx && \
    /bin/bash -c 'cargo pgrx init --pg${PG_VERSION:1}=/usr/local/pgsql/bin/pg_config'

 USER root

-#########################################################################################
-#
-# Layer "rust extensions pgrx12"
-#
-# pgrx started to support Postgres 17 since version 12,
-# but some older extension aren't compatible with it.
-# This layer should be used as a base for new pgrx extensions,
-# and eventually get merged with `rust-extensions-build`
-#
-#########################################################################################
-FROM build-deps AS rust-extensions-build-pgrx12
-ARG PG_VERSION
-COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
-
-RUN apt-get update && \
-    apt-get install --no-install-recommends -y curl libclang-dev && \
-    useradd -ms /bin/bash nonroot -b /home
-
-ENV HOME=/home/nonroot
-ENV PATH="/home/nonroot/.cargo/bin:/usr/local/pgsql/bin/:$PATH"
-USER nonroot
-WORKDIR /home/nonroot
-
-RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
-    chmod +x rustup-init && \
-    ./rustup-init -y --no-modify-path --profile minimal --default-toolchain stable && \
-    rm rustup-init && \
-    cargo install --locked --version 0.12.6 cargo-pgrx && \
-    /bin/bash -c 'cargo pgrx init --pg${PG_VERSION:1}=/usr/local/pgsql/bin/pg_config'
-
-USER root
-
-#########################################################################################
-#
-# Layers "pg-onnx-build" and "pgrag-pg-build"
-# Compile "pgrag" extensions
-#
-#########################################################################################
-
-FROM rust-extensions-build-pgrx12 AS pg-onnx-build
-
-# cmake 3.26 or higher is required, so installing it using pip (bullseye-backports has cmake 3.25).
-# Install it using virtual environment, because Python 3.11 (the default version on Debian 12 (Bookworm)) complains otherwise
-RUN apt-get update && apt-get install -y python3 python3-pip python3-venv && \
-    python3 -m venv venv && \
-    . venv/bin/activate && \
-    python3 -m pip install cmake==3.30.5 && \
-    wget https://github.com/microsoft/onnxruntime/archive/refs/tags/v1.18.1.tar.gz -O onnxruntime.tar.gz && \
-    mkdir onnxruntime-src && cd onnxruntime-src && tar xzf ../onnxruntime.tar.gz --strip-components=1 -C . && \
-    ./build.sh --config Release --parallel --skip_submodule_sync --skip_tests --allow_running_as_root
-
-
-FROM pg-onnx-build AS pgrag-pg-build
-
-RUN apt-get install -y protobuf-compiler && \
-    wget https://github.com/neondatabase-labs/pgrag/archive/refs/tags/v0.0.0.tar.gz -O pgrag.tar.gz &&  \
-    echo "2cbe394c1e74fc8bcad9b52d5fbbfb783aef834ca3ce44626cfd770573700bb4 pgrag.tar.gz" | sha256sum --check && \
-    mkdir pgrag-src && cd pgrag-src && tar xzf ../pgrag.tar.gz --strip-components=1 -C . && \
-    \
-    cd exts/rag && \
-    sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
-    cargo pgrx install --release && \
-    echo "trusted = true" >> /usr/local/pgsql/share/extension/rag.control && \
-    \
-    cd ../rag_bge_small_en_v15 && \
-    sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
-    ORT_LIB_LOCATION=/home/nonroot/onnxruntime-src/build/Linux \
-        REMOTE_ONNX_URL=http://pg-ext-s3-gateway/pgrag-data/bge_small_en_v15.onnx \
-        cargo pgrx install --release --features remote_onnx && \
-    echo "trusted = true" >> /usr/local/pgsql/share/extension/rag_bge_small_en_v15.control && \
-    \
-    cd ../rag_jina_reranker_v1_tiny_en && \
-    sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
-    ORT_LIB_LOCATION=/home/nonroot/onnxruntime-src/build/Linux \
-        REMOTE_ONNX_URL=http://pg-ext-s3-gateway/pgrag-data/jina_reranker_v1_tiny_en.onnx \
-        cargo pgrx install --release --features remote_onnx && \
-    echo "trusted = true" >> /usr/local/pgsql/share/extension/rag_jina_reranker_v1_tiny_en.control
-
-
 #########################################################################################
 #
 # Layer "pg-jsonschema-pg-build"
@@ -1121,31 +1041,6 @@ RUN wget https://github.com/pgpartman/pg_partman/archive/refs/tags/v5.1.0.tar.gz
    make -j $(getconf _NPROCESSORS_ONLN) install && \
    echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_partman.control

-#########################################################################################
-#
-# Layer "pg_mooncake"
-# compile pg_mooncake extension
-#
-#########################################################################################
-FROM rust-extensions-build AS pg-mooncake-build
-ARG PG_VERSION
-COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
-
-ENV PG_MOONCAKE_VERSION=882175dbba07ba2e6e59b1088d61bf325b910b9e
-ENV PATH="/usr/local/pgsql/bin/:$PATH"
-
-RUN case "${PG_VERSION}" in \
-        'v14') \
-            echo "pg_mooncake is not supported on Postgres ${PG_VERSION}" && exit 0;; \
-    esac && \
-    git clone --depth 1 --branch neon https://github.com/kelvich/pg_mooncake.git pg_mooncake-src && \
-    cd pg_mooncake-src && \
-    git checkout "${PG_MOONCAKE_VERSION}" && \
-    git submodule update --init --depth 1 --recursive && \
-    make BUILD_TYPE=release -j $(getconf _NPROCESSORS_ONLN) && \
-    make BUILD_TYPE=release -j $(getconf _NPROCESSORS_ONLN) install && \
-    echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_mooncake.control
-
 #########################################################################################
 #
 # Layer "neon-pg-ext-build"
@@ -1164,7 +1059,6 @@ COPY --from=h3-pg-build /h3/usr /
 COPY --from=unit-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=vector-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pgjwt-pg-build /usr/local/pgsql/ /usr/local/pgsql/
-COPY --from=pgrag-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg-jsonschema-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg-graphql-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg-tiktoken-pg-build /usr/local/pgsql/ /usr/local/pgsql/
@@ -1190,7 +1084,6 @@ COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql
 COPY --from=pg-anon-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg-ivm-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg-partman-build /usr/local/pgsql/ /usr/local/pgsql/
-COPY --from=pg-mooncake-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY pgxn/ pgxn/

 RUN make -j $(getconf _NPROCESSORS_ONLN) \
@@ -1354,7 +1247,6 @@ COPY --from=unit-pg-build /postgresql-unit.tar.gz /ext-src/
 COPY --from=vector-pg-build /pgvector.tar.gz /ext-src/
 COPY --from=vector-pg-build /pgvector.patch /ext-src/
 COPY --from=pgjwt-pg-build /pgjwt.tar.gz /ext-src
-#COPY --from=pgrag-pg-build /usr/local/pgsql/ /usr/local/pgsql/
 #COPY --from=pg-jsonschema-pg-build /home/nonroot/pg_jsonschema.tar.gz /ext-src
 #COPY --from=pg-graphql-pg-build /home/nonroot/pg_graphql.tar.gz /ext-src
 #COPY --from=pg-tiktoken-pg-build /home/nonroot/pg_tiktoken.tar.gz /ext-src
--- a/compute/vm-image-spec-bookworm.yaml
+++ b/compute/vm-image-spec-bookworm.yaml
@@ -18,7 +18,7 @@ commands:
  - name: pgbouncer
    user: postgres
    sysvInitAction: respawn
-    shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini 2>&1 > /dev/virtio-ports/tech.neon.log.0'
+    shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini'
  - name: local_proxy
    user: postgres
    sysvInitAction: respawn
--- a/compute/vm-image-spec-bullseye.yaml
+++ b/compute/vm-image-spec-bullseye.yaml
@@ -18,7 +18,7 @@ commands:
  - name: pgbouncer
    user: postgres
    sysvInitAction: respawn
-    shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini 2>&1 > /dev/virtio-ports/tech.neon.log.0'
+    shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini'
  - name: local_proxy
    user: postgres
    sysvInitAction: respawn
--- a/control_plane/src/bin/neon_local.rs
+++ b/control_plane/src/bin/neon_local.rs
@@ -1073,10 +1073,10 @@ async fn handle_tenant(subcmd: &TenantCmd, env: &mut local_env::LocalEnv) -> any
                    tenant_id,
                    TimelineCreateRequest {
                        new_timeline_id,
-                        mode: pageserver_api::models::TimelineCreateRequestMode::Bootstrap {
-                            existing_initdb_timeline_id: None,
-                            pg_version: Some(args.pg_version),
-                        },
+                        ancestor_timeline_id: None,
+                        ancestor_start_lsn: None,
+                        existing_initdb_timeline_id: None,
+                        pg_version: Some(args.pg_version),
                    },
                )
                .await?;
@@ -1133,10 +1133,10 @@ async fn handle_timeline(cmd: &TimelineCmd, env: &mut local_env::LocalEnv) -> Re
            let storage_controller = StorageController::from_env(env);
            let create_req = TimelineCreateRequest {
                new_timeline_id,
-                mode: pageserver_api::models::TimelineCreateRequestMode::Bootstrap {
-                    existing_initdb_timeline_id: None,
-                    pg_version: Some(args.pg_version),
-                },
+                ancestor_timeline_id: None,
+                existing_initdb_timeline_id: None,
+                ancestor_start_lsn: None,
+                pg_version: Some(args.pg_version),
            };
            let timeline_info = storage_controller
                .tenant_timeline_create(tenant_id, create_req)
@@ -1189,11 +1189,10 @@ async fn handle_timeline(cmd: &TimelineCmd, env: &mut local_env::LocalEnv) -> Re
            let storage_controller = StorageController::from_env(env);
            let create_req = TimelineCreateRequest {
                new_timeline_id,
-                mode: pageserver_api::models::TimelineCreateRequestMode::Branch {
-                    ancestor_timeline_id,
-                    ancestor_start_lsn: start_lsn,
-                    pg_version: None,
-                },
+                ancestor_timeline_id: Some(ancestor_timeline_id),
+                existing_initdb_timeline_id: None,
+                ancestor_start_lsn: start_lsn,
+                pg_version: None,
            };
            let timeline_info = storage_controller
                .tenant_timeline_create(tenant_id, create_req)
--- a/control_plane/src/pageserver.rs
+++ b/control_plane/src/pageserver.rs
@@ -17,7 +17,7 @@ use std::time::Duration;

 use anyhow::{bail, Context};
 use camino::Utf8PathBuf;
-use pageserver_api::models::{self, TenantInfo, TimelineInfo};
+use pageserver_api::models::{self, AuxFilePolicy, TenantInfo, TimelineInfo};
 use pageserver_api::shard::TenantShardId;
 use pageserver_client::mgmt_api;
 use postgres_backend::AuthType;
@@ -399,6 +399,11 @@ impl PageServerNode {
                .map(serde_json::from_str)
                .transpose()
                .context("parse `timeline_get_throttle` from json")?,
+            switch_aux_file_policy: settings
+                .remove("switch_aux_file_policy")
+                .map(|x| x.parse::<AuxFilePolicy>())
+                .transpose()
+                .context("Failed to parse 'switch_aux_file_policy'")?,
            lsn_lease_length: settings.remove("lsn_lease_length").map(|x| x.to_string()),
            lsn_lease_length_for_ts: settings
                .remove("lsn_lease_length_for_ts")
@@ -494,6 +499,11 @@ impl PageServerNode {
                    .map(serde_json::from_str)
                    .transpose()
                    .context("parse `timeline_get_throttle` from json")?,
+                switch_aux_file_policy: settings
+                    .remove("switch_aux_file_policy")
+                    .map(|x| x.parse::<AuxFilePolicy>())
+                    .transpose()
+                    .context("Failed to parse 'switch_aux_file_policy'")?,
                lsn_lease_length: settings.remove("lsn_lease_length").map(|x| x.to_string()),
                lsn_lease_length_for_ts: settings
                    .remove("lsn_lease_length_for_ts")
@@ -519,6 +529,28 @@ impl PageServerNode {
        Ok(self.http_client.list_timelines(*tenant_shard_id).await?)
    }

+    pub async fn timeline_create(
+        &self,
+        tenant_shard_id: TenantShardId,
+        new_timeline_id: TimelineId,
+        ancestor_start_lsn: Option<Lsn>,
+        ancestor_timeline_id: Option<TimelineId>,
+        pg_version: Option<u32>,
+        existing_initdb_timeline_id: Option<TimelineId>,
+    ) -> anyhow::Result<TimelineInfo> {
+        let req = models::TimelineCreateRequest {
+            new_timeline_id,
+            ancestor_start_lsn,
+            ancestor_timeline_id,
+            pg_version,
+            existing_initdb_timeline_id,
+        };
+        Ok(self
+            .http_client
+            .timeline_create(tenant_shard_id, &req)
+            .await?)
+    }
+
    /// Import a basebackup prepared using either:
    /// a) `pg_basebackup -F tar`, or
    /// b) The `fullbackup` pageserver endpoint
--- a/control_plane/storcon_cli/src/main.rs
+++ b/control_plane/storcon_cli/src/main.rs
@@ -111,11 +111,6 @@ enum Command {
        #[arg(long)]
        node: NodeId,
    },
-    /// Cancel any ongoing reconciliation for this shard
-    TenantShardCancelReconcile {
-        #[arg(long)]
-        tenant_shard_id: TenantShardId,
-    },
    /// Modify the pageserver tenant configuration of a tenant: this is the configuration structure
    /// that is passed through to pageservers, and does not affect storage controller behavior.
    TenantConfig {
@@ -540,15 +535,6 @@ async fn main() -> anyhow::Result<()> {
                )
                .await?;
        }
-        Command::TenantShardCancelReconcile { tenant_shard_id } => {
-            storcon_client
-                .dispatch::<(), ()>(
-                    Method::PUT,
-                    format!("control/v1/tenant/{tenant_shard_id}/cancel_reconcile"),
-                    None,
-                )
-                .await?;
-        }
        Command::TenantConfig { tenant_id, config } => {
            let tenant_conf = serde_json::from_str(&config)?;

--- a/libs/metrics/src/lib.rs
+++ b/libs/metrics/src/lib.rs
@@ -19,7 +19,6 @@ use once_cell::sync::Lazy;
 use prometheus::core::{
    Atomic, AtomicU64, Collector, GenericCounter, GenericCounterVec, GenericGauge, GenericGaugeVec,
 };
-pub use prometheus::local::LocalHistogram;
 pub use prometheus::opts;
 pub use prometheus::register;
 pub use prometheus::Error;
--- a/libs/pageserver_api/src/config.rs
+++ b/libs/pageserver_api/src/config.rs
@@ -250,6 +250,12 @@ pub struct TenantConfigToml {
    // Expresed in multiples of checkpoint distance.
    pub image_layer_creation_check_threshold: u8,

+    /// Switch to a new aux file policy. Switching this flag requires the user has not written any aux file into
+    /// the storage before, and this flag cannot be switched back. Otherwise there will be data corruptions.
+    /// There is a `last_aux_file_policy` flag which gets persisted in `index_part.json` once the first aux
+    /// file is written.
+    pub switch_aux_file_policy: crate::models::AuxFilePolicy,
+
    /// The length for an explicit LSN lease request.
    /// Layers needed to reconstruct pages at LSN will not be GC-ed during this interval.
    #[serde(with = "humantime_serde")]
@@ -469,6 +475,7 @@ impl Default for TenantConfigToml {
            lazy_slru_download: false,
            timeline_get_throttle: crate::models::ThrottleConfig::disabled(),
            image_layer_creation_check_threshold: DEFAULT_IMAGE_LAYER_CREATION_CHECK_THRESHOLD,
+            switch_aux_file_policy: crate::models::AuxFilePolicy::default_tenant_config(),
            lsn_lease_length: LsnLease::DEFAULT_LENGTH,
            lsn_lease_length_for_ts: LsnLease::DEFAULT_LENGTH_FOR_TS,
        }
--- a/libs/pageserver_api/src/models.rs
+++ b/libs/pageserver_api/src/models.rs
@@ -10,6 +10,7 @@ use std::{
    io::{BufRead, Read},
    num::{NonZeroU32, NonZeroU64, NonZeroUsize},
    str::FromStr,
+    sync::atomic::AtomicUsize,
    time::{Duration, SystemTime},
 };

@@ -210,30 +211,13 @@ pub enum TimelineState {
 #[derive(Serialize, Deserialize, Clone)]
 pub struct TimelineCreateRequest {
    pub new_timeline_id: TimelineId,
-    #[serde(flatten)]
-    pub mode: TimelineCreateRequestMode,
-}
-
-#[derive(Serialize, Deserialize, Clone)]
-#[serde(untagged)]
-pub enum TimelineCreateRequestMode {
-    Branch {
-        ancestor_timeline_id: TimelineId,
-        #[serde(default)]
-        ancestor_start_lsn: Option<Lsn>,
-        // TODO: cplane sets this, but, the branching code always
-        // inherits the ancestor's pg_version. Earlier code wasn't
-        // using a flattened enum, so, it was an accepted field, and
-        // we continue to accept it by having it here.
-        pg_version: Option<u32>,
-    },
-    // NB: Bootstrap is all-optional, and thus the serde(untagged) will cause serde to stop at Bootstrap.
-    // (serde picks the first matching enum variant, in declaration order).
-    Bootstrap {
-        #[serde(default)]
-        existing_initdb_timeline_id: Option<TimelineId>,
-        pg_version: Option<u32>,
-    },
+    #[serde(default)]
+    pub ancestor_timeline_id: Option<TimelineId>,
+    #[serde(default)]
+    pub existing_initdb_timeline_id: Option<TimelineId>,
+    #[serde(default)]
+    pub ancestor_start_lsn: Option<Lsn>,
+    pub pg_version: Option<u32>,
 }

 #[derive(Serialize, Deserialize, Clone)]
@@ -308,6 +292,7 @@ pub struct TenantConfig {
    pub lazy_slru_download: Option<bool>,
    pub timeline_get_throttle: Option<ThrottleConfig>,
    pub image_layer_creation_check_threshold: Option<u8>,
+    pub switch_aux_file_policy: Option<AuxFilePolicy>,
    pub lsn_lease_length: Option<String>,
    pub lsn_lease_length_for_ts: Option<String>,
 }
@@ -348,6 +333,68 @@ pub enum AuxFilePolicy {
    CrossValidation,
 }

+impl AuxFilePolicy {
+    pub fn is_valid_migration_path(from: Option<Self>, to: Self) -> bool {
+        matches!(
+            (from, to),
+            (None, _) | (Some(AuxFilePolicy::CrossValidation), AuxFilePolicy::V2)
+        )
+    }
+
+    /// If a tenant writes aux files without setting `switch_aux_policy`, this value will be used.
+    pub fn default_tenant_config() -> Self {
+        Self::V2
+    }
+}
+
+/// The aux file policy memory flag. Users can store `Option<AuxFilePolicy>` into this atomic flag. 0 == unspecified.
+pub struct AtomicAuxFilePolicy(AtomicUsize);
+
+impl AtomicAuxFilePolicy {
+    pub fn new(policy: Option<AuxFilePolicy>) -> Self {
+        Self(AtomicUsize::new(
+            policy.map(AuxFilePolicy::to_usize).unwrap_or_default(),
+        ))
+    }
+
+    pub fn load(&self) -> Option<AuxFilePolicy> {
+        match self.0.load(std::sync::atomic::Ordering::Acquire) {
+            0 => None,
+            other => Some(AuxFilePolicy::from_usize(other)),
+        }
+    }
+
+    pub fn store(&self, policy: Option<AuxFilePolicy>) {
+        self.0.store(
+            policy.map(AuxFilePolicy::to_usize).unwrap_or_default(),
+            std::sync::atomic::Ordering::Release,
+        );
+    }
+}
+
+impl AuxFilePolicy {
+    pub fn to_usize(self) -> usize {
+        match self {
+            Self::V1 => 1,
+            Self::CrossValidation => 2,
+            Self::V2 => 3,
+        }
+    }
+
+    pub fn try_from_usize(this: usize) -> Option<Self> {
+        match this {
+            1 => Some(Self::V1),
+            2 => Some(Self::CrossValidation),
+            3 => Some(Self::V2),
+            _ => None,
+        }
+    }
+
+    pub fn from_usize(this: usize) -> Self {
+        Self::try_from_usize(this).unwrap()
+    }
+}
+
 #[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
 #[serde(tag = "kind")]
 pub enum EvictionPolicy {
@@ -1004,12 +1051,6 @@ pub mod virtual_file {
    }
 }

-#[derive(Debug, Clone, Serialize, Deserialize)]
-pub struct ScanDisposableKeysResponse {
-    pub disposable_count: usize,
-    pub not_disposable_count: usize,
-}
-
 // Wrapped in libpq CopyData
 #[derive(PartialEq, Eq, Debug)]
 pub enum PagestreamFeMessage {
@@ -1569,6 +1610,71 @@ mod tests {
        }
    }

+    #[test]
+    fn test_aux_file_migration_path() {
+        assert!(AuxFilePolicy::is_valid_migration_path(
+            None,
+            AuxFilePolicy::V1
+        ));
+        assert!(AuxFilePolicy::is_valid_migration_path(
+            None,
+            AuxFilePolicy::V2
+        ));
+        assert!(AuxFilePolicy::is_valid_migration_path(
+            None,
+            AuxFilePolicy::CrossValidation
+        ));
+        // Self-migration is not a valid migration path, and the caller should handle it by itself.
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::V1),
+            AuxFilePolicy::V1
+        ));
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::V2),
+            AuxFilePolicy::V2
+        ));
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::CrossValidation),
+            AuxFilePolicy::CrossValidation
+        ));
+        // Migrations not allowed
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::CrossValidation),
+            AuxFilePolicy::V1
+        ));
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::V1),
+            AuxFilePolicy::V2
+        ));
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::V2),
+            AuxFilePolicy::V1
+        ));
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::V2),
+            AuxFilePolicy::CrossValidation
+        ));
+        assert!(!AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::V1),
+            AuxFilePolicy::CrossValidation
+        ));
+        // Migrations allowed
+        assert!(AuxFilePolicy::is_valid_migration_path(
+            Some(AuxFilePolicy::CrossValidation),
+            AuxFilePolicy::V2
+        ));
+    }
+
+    #[test]
+    fn test_aux_parse() {
+        assert_eq!(AuxFilePolicy::from_str("V2").unwrap(), AuxFilePolicy::V2);
+        assert_eq!(AuxFilePolicy::from_str("v2").unwrap(), AuxFilePolicy::V2);
+        assert_eq!(
+            AuxFilePolicy::from_str("cross-validation").unwrap(),
+            AuxFilePolicy::CrossValidation
+        );
+    }
+
    #[test]
    fn test_image_compression_algorithm_parsing() {
        use ImageCompressionAlgorithm::*;
--- a/libs/pageserver_api/src/record.rs
+++ b/libs/pageserver_api/src/record.rs
@@ -1,7 +1,7 @@
 //! This module defines the WAL record format used within the pageserver.

 use bytes::Bytes;
-use postgres_ffi::walrecord::{describe_postgres_wal_record, MultiXactMember};
+use postgres_ffi::record::{describe_postgres_wal_record, MultiXactMember};
 use postgres_ffi::{MultiXactId, MultiXactOffset, TimestampTz, TransactionId};
 use serde::{Deserialize, Serialize};
 use utils::bin_ser::DeserializeError;
--- a/libs/pageserver_api/src/value.rs
+++ b/libs/pageserver_api/src/value.rs
@@ -1,11 +1,4 @@
 //! This module defines the value type used by the storage engine.
-//!
-//! A [`Value`] represents either a completely new value for one Key ([`Value::Image`]),
-//! or a "delta" of how to get from previous version of the value to the new one
-//! ([`Value::WalRecord`]])
-//!
-//! Note that the [`Value`] type is used for the permananent storage format, so any
-//! changes to it must be backwards compatible.

 use crate::record::NeonWalRecord;
 use bytes::Bytes;
@@ -23,12 +16,10 @@ pub enum Value {
 }

 impl Value {
-    #[inline(always)]
    pub fn is_image(&self) -> bool {
        matches!(self, Value::Image(_))
    }

-    #[inline(always)]
    pub fn will_init(&self) -> bool {
        match self {
            Value::Image(_) => true,
@@ -48,7 +39,6 @@ pub enum InvalidInput {
 pub struct ValueBytes;

 impl ValueBytes {
-    #[inline(always)]
    pub fn will_init(raw: &[u8]) -> Result<bool, InvalidInput> {
        if raw.len() < 12 {
            return Err(InvalidInput::TooShortValue);
--- a/libs/postgres_ffi/src/lib.rs
+++ b/libs/postgres_ffi/src/lib.rs
@@ -216,8 +216,8 @@ macro_rules! enum_pgversion {
 }

 pub mod pg_constants;
+pub mod record;
 pub mod relfile_utils;
-pub mod walrecord;

 // Export some widely used datatypes that are unlikely to change across Postgres versions
 pub use v14::bindings::RepOriginId;
--- a/libs/postgres_ffi/src/walrecord.rs
+++ b/libs/postgres_ffi/src/walrecord.rs
@@ -15,6 +15,125 @@ use serde::{Deserialize, Serialize};
 use utils::bin_ser::DeserializeError;
 use utils::lsn::Lsn;

+///
+/// Note: Parsing some fields is missing, because they're not needed.
+///
+/// This is similar to the xl_xact_parsed_commit and
+/// xl_xact_parsed_abort structs in PostgreSQL, but we use the same
+/// struct for commits and aborts.
+///
+#[derive(Debug)]
+pub struct XlXactParsedRecord {
+    pub xid: TransactionId,
+    pub info: u8,
+    pub xact_time: TimestampTz,
+    pub xinfo: u32,
+
+    pub db_id: Oid,
+    /* MyDatabaseId */
+    pub ts_id: Oid,
+    /* MyDatabaseTableSpace */
+    pub subxacts: Vec<TransactionId>,
+
+    pub xnodes: Vec<RelFileNode>,
+    pub origin_lsn: Lsn,
+}
+
+impl XlXactParsedRecord {
+    /// Decode a XLOG_XACT_COMMIT/ABORT/COMMIT_PREPARED/ABORT_PREPARED
+    /// record. This should agree with the ParseCommitRecord and ParseAbortRecord
+    /// functions in PostgreSQL (in src/backend/access/rmgr/xactdesc.c)
+    pub fn decode(buf: &mut Bytes, mut xid: TransactionId, xl_info: u8) -> XlXactParsedRecord {
+        let info = xl_info & pg_constants::XLOG_XACT_OPMASK;
+        // The record starts with time of commit/abort
+        let xact_time = buf.get_i64_le();
+        let xinfo = if xl_info & pg_constants::XLOG_XACT_HAS_INFO != 0 {
+            buf.get_u32_le()
+        } else {
+            0
+        };
+        let db_id;
+        let ts_id;
+        if xinfo & pg_constants::XACT_XINFO_HAS_DBINFO != 0 {
+            db_id = buf.get_u32_le();
+            ts_id = buf.get_u32_le();
+        } else {
+            db_id = 0;
+            ts_id = 0;
+        }
+        let mut subxacts = Vec::<TransactionId>::new();
+        if xinfo & pg_constants::XACT_XINFO_HAS_SUBXACTS != 0 {
+            let nsubxacts = buf.get_i32_le();
+            for _i in 0..nsubxacts {
+                let subxact = buf.get_u32_le();
+                subxacts.push(subxact);
+            }
+        }
+        let mut xnodes = Vec::<RelFileNode>::new();
+        if xinfo & pg_constants::XACT_XINFO_HAS_RELFILENODES != 0 {
+            let nrels = buf.get_i32_le();
+            for _i in 0..nrels {
+                let spcnode = buf.get_u32_le();
+                let dbnode = buf.get_u32_le();
+                let relnode = buf.get_u32_le();
+                tracing::trace!(
+                    "XLOG_XACT_COMMIT relfilenode {}/{}/{}",
+                    spcnode,
+                    dbnode,
+                    relnode
+                );
+                xnodes.push(RelFileNode {
+                    spcnode,
+                    dbnode,
+                    relnode,
+                });
+            }
+        }
+
+        if xinfo & crate::v15::bindings::XACT_XINFO_HAS_DROPPED_STATS != 0 {
+            let nitems = buf.get_i32_le();
+            tracing::debug!(
+                "XLOG_XACT_COMMIT-XACT_XINFO_HAS_DROPPED_STAT nitems {}",
+                nitems
+            );
+            let sizeof_xl_xact_stats_item = 12;
+            buf.advance((nitems * sizeof_xl_xact_stats_item).try_into().unwrap());
+        }
+
+        if xinfo & pg_constants::XACT_XINFO_HAS_INVALS != 0 {
+            let nmsgs = buf.get_i32_le();
+            let sizeof_shared_invalidation_message = 16;
+            buf.advance(
+                (nmsgs * sizeof_shared_invalidation_message)
+                    .try_into()
+                    .unwrap(),
+            );
+        }
+
+        if xinfo & pg_constants::XACT_XINFO_HAS_TWOPHASE != 0 {
+            xid = buf.get_u32_le();
+            tracing::debug!("XLOG_XACT_COMMIT-XACT_XINFO_HAS_TWOPHASE xid {}", xid);
+        }
+
+        let origin_lsn = if xinfo & pg_constants::XACT_XINFO_HAS_ORIGIN != 0 {
+            Lsn(buf.get_u64_le())
+        } else {
+            Lsn::INVALID
+        };
+        XlXactParsedRecord {
+            xid,
+            info,
+            xact_time,
+            xinfo,
+            db_id,
+            ts_id,
+            subxacts,
+            xnodes,
+            origin_lsn,
+        }
+    }
+}
+
 #[repr(C)]
 #[derive(Debug)]
 pub struct XlMultiXactCreate {
@@ -977,125 +1096,6 @@ impl XlDropDatabase {
    }
 }

-///
-/// Note: Parsing some fields is missing, because they're not needed.
-///
-/// This is similar to the xl_xact_parsed_commit and
-/// xl_xact_parsed_abort structs in PostgreSQL, but we use the same
-/// struct for commits and aborts.
-///
-#[derive(Debug)]
-pub struct XlXactParsedRecord {
-    pub xid: TransactionId,
-    pub info: u8,
-    pub xact_time: TimestampTz,
-    pub xinfo: u32,
-
-    pub db_id: Oid,
-    /* MyDatabaseId */
-    pub ts_id: Oid,
-    /* MyDatabaseTableSpace */
-    pub subxacts: Vec<TransactionId>,
-
-    pub xnodes: Vec<RelFileNode>,
-    pub origin_lsn: Lsn,
-}
-
-impl XlXactParsedRecord {
-    /// Decode a XLOG_XACT_COMMIT/ABORT/COMMIT_PREPARED/ABORT_PREPARED
-    /// record. This should agree with the ParseCommitRecord and ParseAbortRecord
-    /// functions in PostgreSQL (in src/backend/access/rmgr/xactdesc.c)
-    pub fn decode(buf: &mut Bytes, mut xid: TransactionId, xl_info: u8) -> XlXactParsedRecord {
-        let info = xl_info & pg_constants::XLOG_XACT_OPMASK;
-        // The record starts with time of commit/abort
-        let xact_time = buf.get_i64_le();
-        let xinfo = if xl_info & pg_constants::XLOG_XACT_HAS_INFO != 0 {
-            buf.get_u32_le()
-        } else {
-            0
-        };
-        let db_id;
-        let ts_id;
-        if xinfo & pg_constants::XACT_XINFO_HAS_DBINFO != 0 {
-            db_id = buf.get_u32_le();
-            ts_id = buf.get_u32_le();
-        } else {
-            db_id = 0;
-            ts_id = 0;
-        }
-        let mut subxacts = Vec::<TransactionId>::new();
-        if xinfo & pg_constants::XACT_XINFO_HAS_SUBXACTS != 0 {
-            let nsubxacts = buf.get_i32_le();
-            for _i in 0..nsubxacts {
-                let subxact = buf.get_u32_le();
-                subxacts.push(subxact);
-            }
-        }
-        let mut xnodes = Vec::<RelFileNode>::new();
-        if xinfo & pg_constants::XACT_XINFO_HAS_RELFILENODES != 0 {
-            let nrels = buf.get_i32_le();
-            for _i in 0..nrels {
-                let spcnode = buf.get_u32_le();
-                let dbnode = buf.get_u32_le();
-                let relnode = buf.get_u32_le();
-                tracing::trace!(
-                    "XLOG_XACT_COMMIT relfilenode {}/{}/{}",
-                    spcnode,
-                    dbnode,
-                    relnode
-                );
-                xnodes.push(RelFileNode {
-                    spcnode,
-                    dbnode,
-                    relnode,
-                });
-            }
-        }
-
-        if xinfo & crate::v15::bindings::XACT_XINFO_HAS_DROPPED_STATS != 0 {
-            let nitems = buf.get_i32_le();
-            tracing::debug!(
-                "XLOG_XACT_COMMIT-XACT_XINFO_HAS_DROPPED_STAT nitems {}",
-                nitems
-            );
-            let sizeof_xl_xact_stats_item = 12;
-            buf.advance((nitems * sizeof_xl_xact_stats_item).try_into().unwrap());
-        }
-
-        if xinfo & pg_constants::XACT_XINFO_HAS_INVALS != 0 {
-            let nmsgs = buf.get_i32_le();
-            let sizeof_shared_invalidation_message = 16;
-            buf.advance(
-                (nmsgs * sizeof_shared_invalidation_message)
-                    .try_into()
-                    .unwrap(),
-            );
-        }
-
-        if xinfo & pg_constants::XACT_XINFO_HAS_TWOPHASE != 0 {
-            xid = buf.get_u32_le();
-            tracing::debug!("XLOG_XACT_COMMIT-XACT_XINFO_HAS_TWOPHASE xid {}", xid);
-        }
-
-        let origin_lsn = if xinfo & pg_constants::XACT_XINFO_HAS_ORIGIN != 0 {
-            Lsn(buf.get_u64_le())
-        } else {
-            Lsn::INVALID
-        };
-        XlXactParsedRecord {
-            xid,
-            info,
-            xact_time,
-            xinfo,
-            db_id,
-            ts_id,
-            subxacts,
-            xnodes,
-            origin_lsn,
-        }
-    }
-}
-
 #[repr(C)]
 #[derive(Debug)]
 pub struct XlClogTruncate {
--- a/libs/remote_storage/src/local_fs.rs
+++ b/libs/remote_storage/src/local_fs.rs
@@ -357,20 +357,22 @@ impl RemoteStorage for LocalFs {
                .list_recursive(prefix)
                .await
                .map_err(DownloadError::Other)?;
-            let mut objects = Vec::with_capacity(keys.len());
-            for key in keys {
-                let path = key.with_base(&self.storage_root);
-                let metadata = file_metadata(&path).await?;
-                if metadata.is_dir() {
-                    continue;
-                }
-                objects.push(ListingObject {
-                    key: key.clone(),
-                    last_modified: metadata.modified()?,
-                    size: metadata.len(),
-                });
-            }
-            let objects = objects;
+            let objects = keys
+                .into_iter()
+                .filter_map(|k| {
+                    let path = k.with_base(&self.storage_root);
+                    if path.is_dir() {
+                        None
+                    } else {
+                        Some(ListingObject {
+                            key: k.clone(),
+                            // LocalFs is just for testing, so just specify a dummy time
+                            last_modified: SystemTime::now(),
+                            size: 0,
+                        })
+                    }
+                })
+                .collect();

            if let ListingMode::NoDelimiter = mode {
                result.keys = objects;
@@ -408,8 +410,9 @@ impl RemoteStorage for LocalFs {
                    } else {
                        result.keys.push(ListingObject {
                            key: RemotePath::from_string(&relative_key).unwrap(),
-                            last_modified: object.last_modified,
-                            size: object.size,
+                            // LocalFs is just for testing
+                            last_modified: SystemTime::now(),
+                            size: 0,
                        });
                    }
                }
--- a/libs/wal_decoder/Cargo.toml
+++ b/libs/wal_decoder/Cargo.toml
@@ -15,4 +15,3 @@ postgres_ffi.workspace = true
 serde.workspace = true
 tracing.workspace = true
 utils.workspace = true
-workspace_hack = { version = "0.1", path = "../../workspace_hack" }
--- a/libs/wal_decoder/src/decoder.rs
+++ b/libs/wal_decoder/src/decoder.rs
@@ -1 +0,0 @@
-
--- a/libs/wal_decoder/src/models.rs
+++ b/libs/wal_decoder/src/models.rs
@@ -1,34 +1,11 @@
 //! This module houses types which represent decoded PG WAL records
-//! ready for the pageserver to interpret. They are derived from the original
-//! WAL records, so that each struct corresponds closely to one WAL record of
-//! a specific kind. They contain the same information as the original WAL records,
-//! just decoded into structs and fields for easier access.
-//!
-//! The ingestion code uses these structs to help with parsing the WAL records,
-//! and it splits them into a stream of modifications to the key-value pairs that
-//! are ultimately stored in delta layers.  See also the split-out counterparts in
-//! [`postgres_ffi::walrecord`].
-//!
-//! The pipeline which processes WAL records is not super obvious, so let's follow
-//! the flow of an example XACT_COMMIT Postgres record:
-//!
-//! (Postgres XACT_COMMIT record)
-//! |
-//! |--> pageserver::walingest::WalIngest::decode_xact_record
-//!      |
-//!      |--> ([`XactRecord::Commit`])
-//!           |
-//!           |--> pageserver::walingest::WalIngest::ingest_xact_record
-//!                |
-//!                |--> (NeonWalRecord::ClogSetCommitted)
-//!                     |
-//!                     |--> write to KV store within the pageserver
+//! ready for the pageserver to interpret. They are higher level
+//! than their counterparts in [`postgres_ffi::record`].

 use bytes::Bytes;
 use pageserver_api::reltag::{RelTag, SlruKind};
-use postgres_ffi::walrecord::{
-    XlMultiXactCreate, XlMultiXactTruncate, XlRelmapUpdate, XlReploriginDrop, XlReploriginSet,
-    XlSmgrTruncate, XlXactParsedRecord,
+use postgres_ffi::record::{
+    XlMultiXactCreate, XlMultiXactTruncate, XlRelmapUpdate, XlReploriginDrop, XlReploriginSet, XlSmgrTruncate, XlXactParsedRecord
 };
 use postgres_ffi::{Oid, TransactionId};
 use utils::lsn::Lsn;
--- a/pageserver/Cargo.toml
+++ b/pageserver/Cargo.toml
@@ -93,6 +93,7 @@ criterion.workspace = true
 hex-literal.workspace = true
 tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time", "test-util"] }
 indoc.workspace = true
+# pageserver_api = { workspace = true, features = ["testing"] }

 [[bench]]
 name = "bench_layer_map"
--- a/pageserver/benches/bench_ingest.rs
+++ b/pageserver/benches/bench_ingest.rs
@@ -6,6 +6,7 @@ use criterion::{criterion_group, criterion_main, Criterion};
 use pageserver::{
    config::PageServerConf,
    context::{DownloadBehavior, RequestContext},
+    gc_result::Value,
    l0_flush::{L0FlushConfig, L0FlushGlobalState},
    page_cache,
    task_mgr::TaskKind,
--- a/pageserver/ctl/src/layers.rs
+++ b/pageserver/ctl/src/layers.rs
@@ -12,7 +12,6 @@ use pageserver::tenant::storage_layer::{delta_layer, image_layer};
 use pageserver::tenant::storage_layer::{DeltaLayer, ImageLayer};
 use pageserver::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
 use pageserver::virtual_file::api::IoMode;
-use pageserver::{page_cache, virtual_file};
 use pageserver::{
    tenant::{
        block_io::FileBlockReader, disk_btree::VisitDirection,
@@ -21,6 +20,7 @@ use pageserver::{
    virtual_file::VirtualFile,
 };
 use pageserver_api::key::{Key, KEY_SIZE};
+use pageserver::{page_cache, virtual_file};
 use std::fs;
 use utils::bin_ser::BeSer;
 use utils::id::{TenantId, TimelineId};
--- a/pageserver/pagebench/src/cmd/aux_files.rs
+++ b/pageserver/pagebench/src/cmd/aux_files.rs
@@ -1,4 +1,4 @@
-use pageserver_api::models::{TenantConfig, TenantConfigRequest};
+use pageserver_api::models::{AuxFilePolicy, TenantConfig, TenantConfigRequest};
 use pageserver_api::shard::TenantShardId;
 use utils::id::TenantTimelineId;
 use utils::lsn::Lsn;
@@ -66,7 +66,10 @@ async fn main_impl(args: Args) -> anyhow::Result<()> {
    mgmt_api_client
        .tenant_config(&TenantConfigRequest {
            tenant_id: timeline.tenant_id,
-            config: TenantConfig::default(),
+            config: TenantConfig {
+                switch_aux_file_policy: Some(AuxFilePolicy::V2),
+                ..Default::default()
+            },
        })
        .await?;

--- a/pageserver/src/tenant/gc_result.rs
+++ b/pageserver/src/tenant/gc_result.rs
--- a/pageserver/src/http/openapi_spec.yml
+++ b/pageserver/src/http/openapi_spec.yml
@@ -597,10 +597,6 @@ paths:
        Create a timeline. Returns new timeline id on success.
        Recreating the same timeline will succeed if the parameters match the existing timeline.
        If no pg_version is specified, assume DEFAULT_PG_VERSION hardcoded in the pageserver.
-
-        To ensure durability, the caller must retry the creation until success.
-        Just because the timeline is visible via other endpoints does not mean it is durable.
-        Future versions may stop showing timelines that are not yet durable.
      requestBody:
        content:
          application/json:
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -38,7 +38,6 @@ use pageserver_api::models::TenantShardSplitRequest;
 use pageserver_api::models::TenantShardSplitResponse;
 use pageserver_api::models::TenantSorting;
 use pageserver_api::models::TimelineArchivalConfigRequest;
-use pageserver_api::models::TimelineCreateRequestMode;
 use pageserver_api::models::TimelinesInfoAndOffloaded;
 use pageserver_api::models::TopTenantShardItem;
 use pageserver_api::models::TopTenantShardsRequest;
@@ -86,7 +85,6 @@ use crate::tenant::timeline::Timeline;
 use crate::tenant::GetTimelineError;
 use crate::tenant::OffloadedTimeline;
 use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError};
-use crate::DEFAULT_PG_VERSION;
 use crate::{disk_usage_eviction_task, tenant};
 use pageserver_api::models::{
    StatusResponse, TenantConfigRequest, TenantInfo, TimelineCreateRequest, TimelineGcRequest,
@@ -549,26 +547,6 @@ async fn timeline_create_handler(
    check_permission(&request, Some(tenant_shard_id.tenant_id))?;

    let new_timeline_id = request_data.new_timeline_id;
-    // fill in the default pg_version if not provided & convert request into domain model
-    let params: tenant::CreateTimelineParams = match request_data.mode {
-        TimelineCreateRequestMode::Bootstrap {
-            existing_initdb_timeline_id,
-            pg_version,
-        } => tenant::CreateTimelineParams::Bootstrap(tenant::CreateTimelineParamsBootstrap {
-            new_timeline_id,
-            existing_initdb_timeline_id,
-            pg_version: pg_version.unwrap_or(DEFAULT_PG_VERSION),
-        }),
-        TimelineCreateRequestMode::Branch {
-            ancestor_timeline_id,
-            ancestor_start_lsn,
-            pg_version: _,
-        } => tenant::CreateTimelineParams::Branch(tenant::CreateTimelineParamsBranch {
-            new_timeline_id,
-            ancestor_timeline_id,
-            ancestor_start_lsn,
-        }),
-    };

    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Error);

@@ -581,12 +559,22 @@ async fn timeline_create_handler(

        tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;

-        // earlier versions of the code had pg_version and ancestor_lsn in the span
-        // => continue to provide that information, but, through a log message that doesn't require us to destructure
-        tracing::info!(?params, "creating timeline");
+        if let Some(ancestor_id) = request_data.ancestor_timeline_id.as_ref() {
+            tracing::info!(%ancestor_id, "starting to branch");
+        } else {
+            tracing::info!("bootstrapping");
+        }

        match tenant
-            .create_timeline(params, state.broker_client.clone(), &ctx)
+            .create_timeline(
+                new_timeline_id,
+                request_data.ancestor_timeline_id,
+                request_data.ancestor_start_lsn,
+                request_data.pg_version.unwrap_or(crate::DEFAULT_PG_VERSION),
+                request_data.existing_initdb_timeline_id,
+                state.broker_client.clone(),
+                &ctx,
+            )
            .await
        {
            Ok(new_timeline) => {
@@ -637,6 +625,8 @@ async fn timeline_create_handler(
        tenant_id = %tenant_shard_id.tenant_id,
        shard_id = %tenant_shard_id.shard_slug(),
        timeline_id = %new_timeline_id,
+        lsn=?request_data.ancestor_start_lsn,
+        pg_version=?request_data.pg_version
    ))
    .await
 }
@@ -1293,99 +1283,6 @@ async fn layer_map_info_handler(
    json_response(StatusCode::OK, layer_map_info)
 }

-#[instrument(skip_all, fields(tenant_id, shard_id, timeline_id, layer_name))]
-async fn timeline_layer_scan_disposable_keys(
-    request: Request<Body>,
-    cancel: CancellationToken,
-) -> Result<Response<Body>, ApiError> {
-    let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
-    let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
-    let layer_name: LayerName = parse_request_param(&request, "layer_name")?;
-
-    tracing::Span::current().record(
-        "tenant_id",
-        tracing::field::display(&tenant_shard_id.tenant_id),
-    );
-    tracing::Span::current().record(
-        "shard_id",
-        tracing::field::display(tenant_shard_id.shard_slug()),
-    );
-    tracing::Span::current().record("timeline_id", tracing::field::display(&timeline_id));
-    tracing::Span::current().record("layer_name", tracing::field::display(&layer_name));
-
-    let state = get_state(&request);
-
-    check_permission(&request, Some(tenant_shard_id.tenant_id))?;
-
-    // technically the timeline need not be active for this scan to complete
-    let timeline =
-        active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
-            .await?;
-
-    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
-
-    let guard = timeline.layers.read().await;
-    let Some(layer) = guard.try_get_from_key(&layer_name.clone().into()) else {
-        return Err(ApiError::NotFound(
-            anyhow::anyhow!("Layer {tenant_shard_id}/{timeline_id}/{layer_name} not found").into(),
-        ));
-    };
-
-    let resident_layer = layer
-        .download_and_keep_resident()
-        .await
-        .map_err(|err| match err {
-            tenant::storage_layer::layer::DownloadError::TimelineShutdown
-            | tenant::storage_layer::layer::DownloadError::DownloadCancelled => {
-                ApiError::ShuttingDown
-            }
-            tenant::storage_layer::layer::DownloadError::ContextAndConfigReallyDeniesDownloads
-            | tenant::storage_layer::layer::DownloadError::DownloadRequired
-            | tenant::storage_layer::layer::DownloadError::NotFile(_)
-            | tenant::storage_layer::layer::DownloadError::DownloadFailed
-            | tenant::storage_layer::layer::DownloadError::PreStatFailed(_) => {
-                ApiError::InternalServerError(err.into())
-            }
-            #[cfg(test)]
-            tenant::storage_layer::layer::DownloadError::Failpoint(_) => {
-                ApiError::InternalServerError(err.into())
-            }
-        })?;
-
-    let keys = resident_layer
-        .load_keys(&ctx)
-        .await
-        .map_err(ApiError::InternalServerError)?;
-
-    let shard_identity = timeline.get_shard_identity();
-
-    let mut disposable_count = 0;
-    let mut not_disposable_count = 0;
-    let cancel = cancel.clone();
-    for (i, key) in keys.into_iter().enumerate() {
-        if shard_identity.is_key_disposable(&key) {
-            disposable_count += 1;
-            tracing::debug!(key = %key, key.dbg=?key, "disposable key");
-        } else {
-            not_disposable_count += 1;
-        }
-        #[allow(clippy::collapsible_if)]
-        if i % 10000 == 0 {
-            if cancel.is_cancelled() || timeline.cancel.is_cancelled() || timeline.is_stopping() {
-                return Err(ApiError::ShuttingDown);
-            }
-        }
-    }
-
-    json_response(
-        StatusCode::OK,
-        pageserver_api::models::ScanDisposableKeysResponse {
-            disposable_count,
-            not_disposable_count,
-        },
-    )
-}
-
 async fn layer_download_handler(
    request: Request<Body>,
    _cancel: CancellationToken,
@@ -3248,10 +3145,6 @@ pub fn make_router(
            "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer/:layer_file_name",
            |r| api_handler(r, evict_timeline_layer_handler),
        )
-        .post(
-            "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer/:layer_name/scan_disposable_keys",
-            |r| testing_api_handler("timeline_layer_scan_disposable_keys", r, timeline_layer_scan_disposable_keys),
-        )
        .post(
            "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/block_gc",
            |r| api_handler(r, timeline_gc_blocking_handler),
--- a/pageserver/src/import_datadir.rs
+++ b/pageserver/src/import_datadir.rs
@@ -21,9 +21,9 @@ use crate::tenant::Timeline;
 use crate::walingest::WalIngest;
 use pageserver_api::reltag::{RelTag, SlruKind};
 use postgres_ffi::pg_constants;
+use postgres_ffi::record::{decode_wal_record, DecodedWALRecord};
 use postgres_ffi::relfile_utils::*;
 use postgres_ffi::waldecoder::WalStreamDecoder;
-use postgres_ffi::walrecord::{decode_wal_record, DecodedWALRecord};
 use postgres_ffi::ControlFileData;
 use postgres_ffi::DBState_DB_SHUTDOWNED;
 use postgres_ffi::Oid;
@@ -455,6 +455,7 @@ pub async fn import_wal_from_tar(
            if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
                let mut decoded = DecodedWALRecord::default();
                decode_wal_record(recdata, &mut decoded, tline.pg_version)?;
+                // let (ephemeral_file_ready_buf, special_records) = decode_wal_record(recdata, tline.pg_version);
                walingest
                    .ingest_record(decoded, lsn, &mut modification, ctx)
                    .await?;
--- a/pageserver/src/lib.rs
+++ b/pageserver/src/lib.rs
@@ -20,6 +20,7 @@ pub use pageserver_api::keyspace;
 use tokio_util::sync::CancellationToken;
 mod assert_u64_eq_usize;
 pub mod aux_file;
+pub mod gc_result;
 pub mod metrics;
 pub mod page_cache;
 pub mod page_service;
--- a/pageserver/src/metrics.rs
+++ b/pageserver/src/metrics.rs
@@ -3040,111 +3040,13 @@ impl<F: Future<Output = Result<O, E>>, O, E> Future for MeasuredRemoteOp<F> {
 }

 pub mod tokio_epoll_uring {
-    use std::{
-        collections::HashMap,
-        sync::{Arc, Mutex},
-    };
-
-    use metrics::{register_histogram, register_int_counter, Histogram, LocalHistogram, UIntGauge};
+    use metrics::{register_int_counter, UIntGauge};
    use once_cell::sync::Lazy;

-    /// Shared storage for tokio-epoll-uring thread local metrics.
-    pub(crate) static THREAD_LOCAL_METRICS_STORAGE: Lazy<ThreadLocalMetricsStorage> =
-        Lazy::new(|| {
-            let slots_submission_queue_depth = register_histogram!(
-                "pageserver_tokio_epoll_uring_slots_submission_queue_depth",
-                "The slots waiters queue depth of each tokio_epoll_uring system",
-                vec![1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0],
-            )
-            .expect("failed to define a metric");
-            ThreadLocalMetricsStorage {
-                observers: Mutex::new(HashMap::new()),
-                slots_submission_queue_depth,
-            }
-        });
-
-    pub struct ThreadLocalMetricsStorage {
-        /// List of thread local metrics observers.
-        observers: Mutex<HashMap<u64, Arc<ThreadLocalMetrics>>>,
-        /// A histogram shared between all thread local systems
-        /// for collecting slots submission queue depth.
-        slots_submission_queue_depth: Histogram,
-    }
-
-    /// Each thread-local [`tokio_epoll_uring::System`] gets one of these as its
-    /// [`tokio_epoll_uring::metrics::PerSystemMetrics`] generic.
-    ///
-    /// The System makes observations into [`Self`] and periodically, the collector
-    /// comes along and flushes [`Self`] into the shared storage [`THREAD_LOCAL_METRICS_STORAGE`].
-    ///
-    /// [`LocalHistogram`] is `!Send`, so, we need to put it behind a [`Mutex`].
-    /// But except for the periodic flush, the lock is uncontended so there's no waiting
-    /// for cache coherence protocol to get an exclusive cache line.
-    pub struct ThreadLocalMetrics {
-        /// Local observer of thread local tokio-epoll-uring system's slots waiters queue depth.
-        slots_submission_queue_depth: Mutex<LocalHistogram>,
-    }
-
-    impl ThreadLocalMetricsStorage {
-        /// Registers a new thread local system. Returns a thread local metrics observer.
-        pub fn register_system(&self, id: u64) -> Arc<ThreadLocalMetrics> {
-            let per_system_metrics = Arc::new(ThreadLocalMetrics::new(
-                self.slots_submission_queue_depth.local(),
-            ));
-            let mut g = self.observers.lock().unwrap();
-            g.insert(id, Arc::clone(&per_system_metrics));
-            per_system_metrics
-        }
-
-        /// Removes metrics observer for a thread local system.
-        /// This should be called before dropping a thread local system.
-        pub fn remove_system(&self, id: u64) {
-            let mut g = self.observers.lock().unwrap();
-            g.remove(&id);
-        }
-
-        /// Flush all thread local metrics to the shared storage.
-        pub fn flush_thread_local_metrics(&self) {
-            let g = self.observers.lock().unwrap();
-            g.values().for_each(|local| {
-                local.flush();
-            });
-        }
-    }
-
-    impl ThreadLocalMetrics {
-        pub fn new(slots_submission_queue_depth: LocalHistogram) -> Self {
-            ThreadLocalMetrics {
-                slots_submission_queue_depth: Mutex::new(slots_submission_queue_depth),
-            }
-        }
-
-        /// Flushes the thread local metrics to shared aggregator.
-        pub fn flush(&self) {
-            let Self {
-                slots_submission_queue_depth,
-            } = self;
-            slots_submission_queue_depth.lock().unwrap().flush();
-        }
-    }
-
-    impl tokio_epoll_uring::metrics::PerSystemMetrics for ThreadLocalMetrics {
-        fn observe_slots_submission_queue_depth(&self, queue_depth: u64) {
-            let Self {
-                slots_submission_queue_depth,
-            } = self;
-            slots_submission_queue_depth
-                .lock()
-                .unwrap()
-                .observe(queue_depth as f64);
-        }
-    }
-
    pub struct Collector {
        descs: Vec<metrics::core::Desc>,
        systems_created: UIntGauge,
        systems_destroyed: UIntGauge,
-        thread_local_metrics_storage: &'static ThreadLocalMetricsStorage,
    }

    impl metrics::core::Collector for Collector {
@@ -3154,7 +3056,7 @@ pub mod tokio_epoll_uring {

        fn collect(&self) -> Vec<metrics::proto::MetricFamily> {
            let mut mfs = Vec::with_capacity(Self::NMETRICS);
-            let tokio_epoll_uring::metrics::GlobalMetrics {
+            let tokio_epoll_uring::metrics::Metrics {
                systems_created,
                systems_destroyed,
            } = tokio_epoll_uring::metrics::global();
@@ -3162,21 +3064,12 @@ pub mod tokio_epoll_uring {
            mfs.extend(self.systems_created.collect());
            self.systems_destroyed.set(systems_destroyed);
            mfs.extend(self.systems_destroyed.collect());
-
-            self.thread_local_metrics_storage
-                .flush_thread_local_metrics();
-
-            mfs.extend(
-                self.thread_local_metrics_storage
-                    .slots_submission_queue_depth
-                    .collect(),
-            );
            mfs
        }
    }

    impl Collector {
-        const NMETRICS: usize = 3;
+        const NMETRICS: usize = 2;

        #[allow(clippy::new_without_default)]
        pub fn new() -> Self {
@@ -3208,7 +3101,6 @@ pub mod tokio_epoll_uring {
                descs,
                systems_created,
                systems_destroyed,
-                thread_local_metrics_storage: &THREAD_LOCAL_METRICS_STORAGE,
            }
        }
    }
@@ -3568,7 +3460,6 @@ pub fn preinitialize_metrics() {
    Lazy::force(&RECONSTRUCT_TIME);
    Lazy::force(&BASEBACKUP_QUERY_TIME);
    Lazy::force(&COMPUTE_COMMANDS_COUNTERS);
-    Lazy::force(&tokio_epoll_uring::THREAD_LOCAL_METRICS_STORAGE);

    tenant_throttling::preinitialize_global_metrics();
 }
--- a/pageserver/src/pgdatadir_mapping.rs
+++ b/pageserver/src/pgdatadir_mapping.rs
@@ -14,7 +14,6 @@ use crate::span::debug_assert_current_span_has_tenant_and_timeline_id_no_shard_i
 use anyhow::{ensure, Context};
 use bytes::{Buf, Bytes, BytesMut};
 use enum_map::Enum;
-use pageserver_api::key::Key;
 use pageserver_api::key::{
    dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range, rel_size_to_key,
    relmap_file_key, repl_origin_key, repl_origin_key_range, slru_block_to_key, slru_dir_to_key,
@@ -25,6 +24,7 @@ use pageserver_api::keyspace::SparseKeySpace;
 use pageserver_api::record::NeonWalRecord;
 use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
 use pageserver_api::value::Value;
+use pageserver_api::key::Key;
 use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
 use postgres_ffi::BLCKSZ;
 use postgres_ffi::{Oid, RepOriginId, TimestampTz, TransactionId};
@@ -1508,42 +1508,35 @@ impl<'a> DatadirModification<'a> {
        Ok(())
    }

-    /// Drop some relations
-    pub(crate) async fn put_rel_drops(
-        &mut self,
-        drop_relations: HashMap<(u32, u32), Vec<RelTag>>,
-        ctx: &RequestContext,
-    ) -> anyhow::Result<()> {
-        for ((spc_node, db_node), rel_tags) in drop_relations {
-            let dir_key = rel_dir_to_key(spc_node, db_node);
-            let buf = self.get(dir_key, ctx).await?;
-            let mut dir = RelDirectory::des(&buf)?;
+    /// Drop a relation.
+    pub async fn put_rel_drop(&mut self, rel: RelTag, ctx: &RequestContext) -> anyhow::Result<()> {
+        anyhow::ensure!(rel.relnode != 0, RelationError::InvalidRelnode);

-            let mut dirty = false;
-            for rel_tag in rel_tags {
-                if dir.rels.remove(&(rel_tag.relnode, rel_tag.forknum)) {
-                    dirty = true;
+        // Remove it from the directory entry
+        let dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode);
+        let buf = self.get(dir_key, ctx).await?;
+        let mut dir = RelDirectory::des(&buf)?;

-                    // update logical size
-                    let size_key = rel_size_to_key(rel_tag);
-                    let old_size = self.get(size_key, ctx).await?.get_u32_le();
-                    self.pending_nblocks -= old_size as i64;
+        self.pending_directory_entries
+            .push((DirectoryKind::Rel, dir.rels.len()));

-                    // Remove entry from relation size cache
-                    self.tline.remove_cached_rel_size(&rel_tag);
-
-                    // Delete size entry, as well as all blocks
-                    self.delete(rel_key_range(rel_tag));
-                }
-            }
-
-            if dirty {
-                self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?)));
-                self.pending_directory_entries
-                    .push((DirectoryKind::Rel, dir.rels.len()));
-            }
+        if dir.rels.remove(&(rel.relnode, rel.forknum)) {
+            self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?)));
+        } else {
+            warn!("dropped rel {} did not exist in rel directory", rel);
        }

+        // update logical size
+        let size_key = rel_size_to_key(rel);
+        let old_size = self.get(size_key, ctx).await?.get_u32_le();
+        self.pending_nblocks -= old_size as i64;
+
+        // Remove enty from relation size cache
+        self.tline.remove_cached_rel_size(&rel);
+
+        // Delete size entry, as well as all blocks
+        self.delete(rel_key_range(rel));
+
        Ok(())
    }

--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
--- a/pageserver/src/tenant/config.rs
+++ b/pageserver/src/tenant/config.rs
@@ -9,6 +9,7 @@
 //! may lead to a data loss.
 //!
 pub(crate) use pageserver_api::config::TenantConfigToml as TenantConf;
+use pageserver_api::models::AuxFilePolicy;
 use pageserver_api::models::CompactionAlgorithmSettings;
 use pageserver_api::models::EvictionPolicy;
 use pageserver_api::models::{self, ThrottleConfig};
@@ -340,6 +341,10 @@ pub struct TenantConfOpt {
    #[serde(skip_serializing_if = "Option::is_none")]
    pub image_layer_creation_check_threshold: Option<u8>,

+    #[serde(skip_serializing_if = "Option::is_none")]
+    #[serde(default)]
+    pub switch_aux_file_policy: Option<AuxFilePolicy>,
+
    #[serde(skip_serializing_if = "Option::is_none")]
    #[serde(with = "humantime_serde")]
    #[serde(default)]
@@ -405,6 +410,9 @@ impl TenantConfOpt {
            image_layer_creation_check_threshold: self
                .image_layer_creation_check_threshold
                .unwrap_or(global_conf.image_layer_creation_check_threshold),
+            switch_aux_file_policy: self
+                .switch_aux_file_policy
+                .unwrap_or(global_conf.switch_aux_file_policy),
            lsn_lease_length: self
                .lsn_lease_length
                .unwrap_or(global_conf.lsn_lease_length),
@@ -462,6 +470,7 @@ impl From<TenantConfOpt> for models::TenantConfig {
            lazy_slru_download: value.lazy_slru_download,
            timeline_get_throttle: value.timeline_get_throttle.map(ThrottleConfig::from),
            image_layer_creation_check_threshold: value.image_layer_creation_check_threshold,
+            switch_aux_file_policy: value.switch_aux_file_policy,
            lsn_lease_length: value.lsn_lease_length.map(humantime),
            lsn_lease_length_for_ts: value.lsn_lease_length_for_ts.map(humantime),
        }
--- a/pageserver/src/tenant/mgr.rs
+++ b/pageserver/src/tenant/mgr.rs
@@ -2811,7 +2811,7 @@ where
 }

 use {
-    crate::tenant::gc_result::GcResult, pageserver_api::models::TimelineGcRequest,
+    crate::gc_result::GcResult, pageserver_api::models::TimelineGcRequest,
    utils::http::error::ApiError,
 };

--- a/pageserver/src/tenant/remote_timeline_client.rs
+++ b/pageserver/src/tenant/remote_timeline_client.rs
@@ -249,7 +249,7 @@ pub(crate) use download::{
    list_remote_tenant_shards, list_remote_timelines,
 };
 pub(crate) use index::LayerFileMetadata;
-pub(crate) use upload::upload_initdb_dir;
+pub(crate) use upload::{upload_initdb_dir, upload_tenant_manifest};

 // Occasional network issues and such can cause remote operations to fail, and
 // that's expected. If a download fails, we log it at info-level, and retry.
@@ -1278,14 +1278,10 @@ impl RemoteTimelineClient {
        let fut = {
            let mut guard = self.upload_queue.lock().unwrap();
            let upload_queue = match &mut *guard {
-                UploadQueue::Stopped(_) => {
-                    scopeguard::ScopeGuard::into_inner(sg);
-                    return;
-                }
+                UploadQueue::Stopped(_) => return,
                UploadQueue::Uninitialized => {
                    // transition into Stopped state
                    self.stop_impl(&mut guard);
-                    scopeguard::ScopeGuard::into_inner(sg);
                    return;
                }
                UploadQueue::Initialized(ref mut init) => init,
--- a/pageserver/src/tenant/remote_timeline_client/download.rs
+++ b/pageserver/src/tenant/remote_timeline_client/download.rs
@@ -403,79 +403,59 @@ async fn do_download_index_part(
    Ok((index_part, index_generation, index_part_mtime))
 }

-/// Metadata objects are "generationed", meaning that they include a generation suffix.  This
-/// function downloads the object with the highest generation <= `my_generation`.
+/// index_part.json objects are suffixed with a generation number, so we cannot
+/// directly GET the latest index part without doing some probing.
 ///
-/// Data objects (layer files) also include a generation in their path, but there is no equivalent
-/// search process, because their reference from an index includes the generation.
-///
-/// An expensive object listing operation is only done if necessary: the typical fast path is to issue two
-/// GET operations, one to our own generation (stale attachment case), and one to the immediately preceding
-/// generation (normal case when migrating/restarting).  Only if both of these return 404 do we fall back
-/// to listing objects.
-///
-/// * `my_generation`: the value of `[crate::tenant::Tenant::generation]`
-/// * `what`: for logging, what object are we downloading
-/// * `prefix`: when listing objects, use this prefix (i.e. the part of the object path before the generation)
-/// * `do_download`: a GET of the object in a particular generation, which should **retry indefinitely** unless
-///                  `cancel`` has fired.  This function does not do its own retries of GET operations, and relies
-///                  on the function passed in to do so.
-/// * `parse_path`: parse a fully qualified remote storage path to get the generation of the object.
-#[allow(clippy::too_many_arguments)]
+/// In this function we probe for the most recent index in a generation <= our current generation.
+/// See "Finding the remote indices for timelines" in docs/rfcs/025-generation-numbers.md
 #[tracing::instrument(skip_all, fields(generation=?my_generation))]
-pub(crate) async fn download_generation_object<'a, T, DF, DFF, PF>(
-    storage: &'a GenericRemoteStorage,
-    tenant_shard_id: &'a TenantShardId,
-    timeline_id: &'a TimelineId,
+pub(crate) async fn download_index_part(
+    storage: &GenericRemoteStorage,
+    tenant_shard_id: &TenantShardId,
+    timeline_id: &TimelineId,
    my_generation: Generation,
-    what: &str,
-    prefix: RemotePath,
-    do_download: DF,
-    parse_path: PF,
-    cancel: &'a CancellationToken,
-) -> Result<(T, Generation, SystemTime), DownloadError>
-where
-    DF: Fn(
-        &'a GenericRemoteStorage,
-        &'a TenantShardId,
-        &'a TimelineId,
-        Generation,
-        &'a CancellationToken,
-    ) -> DFF,
-    DFF: Future<Output = Result<(T, Generation, SystemTime), DownloadError>>,
-    PF: Fn(RemotePath) -> Option<Generation>,
-    T: 'static,
-{
+    cancel: &CancellationToken,
+) -> Result<(IndexPart, Generation, SystemTime), DownloadError> {
    debug_assert_current_span_has_tenant_and_timeline_id();

    if my_generation.is_none() {
        // Operating without generations: just fetch the generation-less path
-        return do_download(storage, tenant_shard_id, timeline_id, my_generation, cancel).await;
+        return do_download_index_part(
+            storage,
+            tenant_shard_id,
+            timeline_id,
+            my_generation,
+            cancel,
+        )
+        .await;
    }

-    // Stale case: If we were intentionally attached in a stale generation, the remote object may already
-    // exist in our generation.
+    // Stale case: If we were intentionally attached in a stale generation, there may already be a remote
+    // index in our generation.
    //
    // This is an optimization to avoid doing the listing for the general case below.
-    let res = do_download(storage, tenant_shard_id, timeline_id, my_generation, cancel).await;
+    let res =
+        do_download_index_part(storage, tenant_shard_id, timeline_id, my_generation, cancel).await;
    match res {
-        Ok(decoded) => {
-            tracing::debug!("Found {what} from current generation (this is a stale attachment)");
-            return Ok(decoded);
+        Ok(index_part) => {
+            tracing::debug!(
+                "Found index_part from current generation (this is a stale attachment)"
+            );
+            return Ok(index_part);
        }
        Err(DownloadError::NotFound) => {}
        Err(e) => return Err(e),
    };

-    // Typical case: the previous generation of this tenant was running healthily, and had uploaded the object
-    // we are seeking in that generation.  We may safely start from this index without doing a listing, because:
+    // Typical case: the previous generation of this tenant was running healthily, and had uploaded
+    // and index part.  We may safely start from this index without doing a listing, because:
    //  - We checked for current generation case above
    //  - generations > my_generation are to be ignored
-    //  - any other objects that exist would have an older generation than `previous_gen`, and
-    //    we want to find the most recent object from a previous generation.
+    //  - any other indices that exist would have an older generation than `previous_gen`, and
+    //    we want to find the most recent index from a previous generation.
    //
    // This is an optimization to avoid doing the listing for the general case below.
-    let res = do_download(
+    let res = do_download_index_part(
        storage,
        tenant_shard_id,
        timeline_id,
@@ -484,12 +464,14 @@ where
    )
    .await;
    match res {
-        Ok(decoded) => {
-            tracing::debug!("Found {what} from previous generation");
-            return Ok(decoded);
+        Ok(index_part) => {
+            tracing::debug!("Found index_part from previous generation");
+            return Ok(index_part);
        }
        Err(DownloadError::NotFound) => {
-            tracing::debug!("No {what} found from previous generation, falling back to listing");
+            tracing::debug!(
+                "No index_part found from previous generation, falling back to listing"
+            );
        }
        Err(e) => {
            return Err(e);
@@ -499,10 +481,12 @@ where
    // General case/fallback: if there is no index at my_generation or prev_generation, then list all index_part.json
    // objects, and select the highest one with a generation <= my_generation.  Constructing the prefix is equivalent
    // to constructing a full index path with no generation, because the generation is a suffix.
-    let paths = download_retry(
+    let index_prefix = remote_index_path(tenant_shard_id, timeline_id, Generation::none());
+
+    let indices = download_retry(
        || async {
            storage
-                .list(Some(&prefix), ListingMode::NoDelimiter, None, cancel)
+                .list(Some(&index_prefix), ListingMode::NoDelimiter, None, cancel)
                .await
        },
        "list index_part files",
@@ -513,22 +497,22 @@ where

    // General case logic for which index to use: the latest index whose generation
    // is <= our own.  See "Finding the remote indices for timelines" in docs/rfcs/025-generation-numbers.md
-    let max_previous_generation = paths
+    let max_previous_generation = indices
        .into_iter()
-        .filter_map(|o| parse_path(o.key))
+        .filter_map(|o| parse_remote_index_path(o.key))
        .filter(|g| g <= &my_generation)
        .max();

    match max_previous_generation {
        Some(g) => {
-            tracing::debug!("Found {what} in generation {g:?}");
-            do_download(storage, tenant_shard_id, timeline_id, g, cancel).await
+            tracing::debug!("Found index_part in generation {g:?}");
+            do_download_index_part(storage, tenant_shard_id, timeline_id, g, cancel).await
        }
        None => {
            // Migration from legacy pre-generation state: we have a generation but no prior
            // attached pageservers did.  Try to load from a no-generation path.
-            tracing::debug!("No {what}* found");
-            do_download(
+            tracing::debug!("No index_part.json* found");
+            do_download_index_part(
                storage,
                tenant_shard_id,
                timeline_id,
@@ -540,33 +524,6 @@ where
    }
 }

-/// index_part.json objects are suffixed with a generation number, so we cannot
-/// directly GET the latest index part without doing some probing.
-///
-/// In this function we probe for the most recent index in a generation <= our current generation.
-/// See "Finding the remote indices for timelines" in docs/rfcs/025-generation-numbers.md
-pub(crate) async fn download_index_part(
-    storage: &GenericRemoteStorage,
-    tenant_shard_id: &TenantShardId,
-    timeline_id: &TimelineId,
-    my_generation: Generation,
-    cancel: &CancellationToken,
-) -> Result<(IndexPart, Generation, SystemTime), DownloadError> {
-    let index_prefix = remote_index_path(tenant_shard_id, timeline_id, Generation::none());
-    download_generation_object(
-        storage,
-        tenant_shard_id,
-        timeline_id,
-        my_generation,
-        "index_part",
-        index_prefix,
-        do_download_index_part,
-        parse_remote_index_path,
-        cancel,
-    )
-    .await
-}
-
 pub(crate) async fn download_initdb_tar_zst(
    conf: &'static PageServerConf,
    storage: &GenericRemoteStorage,
--- a/pageserver/src/tenant/remote_timeline_client/manifest.rs
+++ b/pageserver/src/tenant/remote_timeline_client/manifest.rs
@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
 use utils::{id::TimelineId, lsn::Lsn};

 /// Tenant-shard scoped manifest
-#[derive(Clone, Serialize, Deserialize, PartialEq, Eq)]
+#[derive(Clone, Serialize, Deserialize)]
 pub struct TenantManifest {
    /// Debugging aid describing the version of this manifest.
    /// Can also be used for distinguishing breaking changes later on.
@@ -23,7 +23,7 @@ pub struct TenantManifest {
 /// Very similar to [`pageserver_api::models::OffloadedTimelineInfo`],
 /// but the two datastructures serve different needs, this is for a persistent disk format
 /// that must be backwards compatible, while the other is only for informative purposes.
-#[derive(Clone, Serialize, Deserialize, Copy, PartialEq, Eq)]
+#[derive(Clone, Serialize, Deserialize, Copy)]
 pub struct OffloadedTimelineManifest {
    pub timeline_id: TimelineId,
    /// Whether the timeline has a parent it has been branched off from or not
--- a/pageserver/src/tenant/size.rs
+++ b/pageserver/src/tenant/size.rs
@@ -187,8 +187,6 @@ pub(super) async fn gather_inputs(
    // but it is unlikely to cause any issues. In the worst case,
    // the calculation will error out.
    timelines.retain(|t| t.is_active());
-    // Also filter out archived timelines.
-    timelines.retain(|t| t.is_archived() != Some(true));

    // Build a map of branch points.
    let mut branchpoints: HashMap<TimelineId, HashSet<Lsn>> = HashMap::new();
--- a/pageserver/src/tenant/storage_layer.rs
+++ b/pageserver/src/tenant/storage_layer.rs
@@ -1,6 +1,5 @@
 //! Common traits and structs for layers

-pub mod batch_split_writer;
 pub mod delta_layer;
 pub mod filter_iterator;
 pub mod image_layer;
@@ -9,6 +8,7 @@ pub(crate) mod layer;
 mod layer_desc;
 mod layer_name;
 pub mod merge_iterator;
+pub mod split_writer;

 use crate::context::{AccessStatsBehavior, RequestContext};
 use bytes::Bytes;
--- a/pageserver/src/tenant/storage_layer/delta_layer.rs
+++ b/pageserver/src/tenant/storage_layer/delta_layer.rs
@@ -1085,7 +1085,7 @@ impl DeltaLayerInner {
        }
    }

-    pub(crate) async fn index_entries<'a>(
+    pub(super) async fn load_keys<'a>(
        &'a self,
        ctx: &RequestContext,
    ) -> Result<Vec<DeltaEntry<'a>>> {
@@ -1347,7 +1347,7 @@ impl DeltaLayerInner {

        tree_reader.dump().await?;

-        let keys = self.index_entries(ctx).await?;
+        let keys = self.load_keys(ctx).await?;

        async fn dump_blob(val: &ValueRef<'_>, ctx: &RequestContext) -> anyhow::Result<String> {
            let buf = val.load_raw(ctx).await?;
@@ -1454,16 +1454,6 @@ impl DeltaLayerInner {
            ),
        }
    }
-
-    /// NB: not super efficient, but not terrible either. Should prob be an iterator.
-    //
-    // We're reusing the index traversal logical in plan_reads; would be nice to
-    // factor that out.
-    pub(crate) async fn load_keys(&self, ctx: &RequestContext) -> anyhow::Result<Vec<Key>> {
-        self.index_entries(ctx)
-            .await
-            .map(|entries| entries.into_iter().map(|entry| entry.key).collect())
-    }
 }

 /// A set of data associated with a delta layer key and its value
@@ -2199,7 +2189,6 @@ pub(crate) mod test {
        (k1, l1).cmp(&(k2, l2))
    }

-    #[cfg(feature = "testing")]
    pub(crate) fn sort_delta_value(
        (k1, l1, v1): &(Key, Lsn, Value),
        (k2, l2, v2): &(Key, Lsn, Value),
--- a/pageserver/src/tenant/storage_layer/image_layer.rs
+++ b/pageserver/src/tenant/storage_layer/image_layer.rs
@@ -674,21 +674,6 @@ impl ImageLayerInner {
            ),
        }
    }
-
-    /// NB: not super efficient, but not terrible either. Should prob be an iterator.
-    //
-    // We're reusing the index traversal logical in plan_reads; would be nice to
-    // factor that out.
-    pub(crate) async fn load_keys(&self, ctx: &RequestContext) -> anyhow::Result<Vec<Key>> {
-        let plan = self
-            .plan_reads(KeySpace::single(self.key_range.clone()), None, ctx)
-            .await?;
-        Ok(plan
-            .into_iter()
-            .flat_map(|read| read.blobs_at)
-            .map(|(_, blob_meta)| blob_meta.key)
-            .collect())
-    }
 }

 /// A builder object for constructing a new image layer.
@@ -1025,7 +1010,7 @@ impl ImageLayerWriter {
        self.inner.take().unwrap().finish(ctx, None).await
    }

-    /// Finish writing the image layer with an end key, used in [`super::batch_split_writer::SplitImageLayerWriter`]. The end key determines the end of the image layer's covered range and is exclusive.
+    /// Finish writing the image layer with an end key, used in [`super::split_writer::SplitImageLayerWriter`]. The end key determines the end of the image layer's covered range and is exclusive.
    pub(super) async fn finish_with_end_key(
        mut self,
        end_key: Key,
@@ -1125,8 +1110,8 @@ mod test {
    use itertools::Itertools;
    use pageserver_api::{
        key::Key,
-        shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize},
        value::Value,
+        shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize},
    };
    use utils::{
        generation::Generation,
--- a/pageserver/src/tenant/storage_layer/layer.rs
+++ b/pageserver/src/tenant/storage_layer/layer.rs
@@ -19,7 +19,7 @@ use crate::task_mgr::TaskKind;
 use crate::tenant::timeline::{CompactionError, GetVectoredError};
 use crate::tenant::{remote_timeline_client::LayerFileMetadata, Timeline};

-use super::delta_layer::{self};
+use super::delta_layer::{self, DeltaEntry};
 use super::image_layer::{self};
 use super::{
    AsLayerDesc, ImageLayerWriter, LayerAccessStats, LayerAccessStatsReset, LayerName,
@@ -1841,22 +1841,23 @@ impl ResidentLayer {
    pub(crate) async fn load_keys<'a>(
        &'a self,
        ctx: &RequestContext,
-    ) -> anyhow::Result<Vec<pageserver_api::key::Key>> {
+    ) -> anyhow::Result<Vec<DeltaEntry<'a>>> {
        use LayerKind::*;

        let owner = &self.owner.0;
-        let inner = self.downloaded.get(owner, ctx).await?;
+        match self.downloaded.get(owner, ctx).await? {
+            Delta(ref d) => {
+                // this is valid because the DownloadedLayer::kind is a OnceCell, not a
+                // Mutex<OnceCell>, so we cannot go and deinitialize the value with OnceCell::take
+                // while it's being held.
+                self.owner.record_access(ctx);

-        // this is valid because the DownloadedLayer::kind is a OnceCell, not a
-        // Mutex<OnceCell>, so we cannot go and deinitialize the value with OnceCell::take
-        // while it's being held.
-        self.owner.record_access(ctx);
-
-        let res = match inner {
-            Delta(ref d) => delta_layer::DeltaLayerInner::load_keys(d, ctx).await,
-            Image(ref i) => image_layer::ImageLayerInner::load_keys(i, ctx).await,
-        };
-        res.with_context(|| format!("Layer index is corrupted for {self}"))
+                delta_layer::DeltaLayerInner::load_keys(d, ctx)
+                    .await
+                    .with_context(|| format!("Layer index is corrupted for {self}"))
+            }
+            Image(_) => anyhow::bail!(format!("cannot load_keys on a image layer {self}")),
+        }
    }

    /// Read all they keys in this layer which match the ShardIdentity, and write them all to
--- a/pageserver/src/tenant/storage_layer/layer_desc.rs
+++ b/pageserver/src/tenant/storage_layer/layer_desc.rs
@@ -57,34 +57,6 @@ impl std::fmt::Display for PersistentLayerKey {
    }
 }

-impl From<ImageLayerName> for PersistentLayerKey {
-    fn from(image_layer_name: ImageLayerName) -> Self {
-        Self {
-            key_range: image_layer_name.key_range,
-            lsn_range: PersistentLayerDesc::image_layer_lsn_range(image_layer_name.lsn),
-            is_delta: false,
-        }
-    }
-}
-
-impl From<DeltaLayerName> for PersistentLayerKey {
-    fn from(delta_layer_name: DeltaLayerName) -> Self {
-        Self {
-            key_range: delta_layer_name.key_range,
-            lsn_range: delta_layer_name.lsn_range,
-            is_delta: true,
-        }
-    }
-}
-
-impl From<LayerName> for PersistentLayerKey {
-    fn from(layer_name: LayerName) -> Self {
-        match layer_name {
-            LayerName::Image(i) => i.into(),
-            LayerName::Delta(d) => d.into(),
-        }
-    }
-}
 impl PersistentLayerDesc {
    pub fn key(&self) -> PersistentLayerKey {
        PersistentLayerKey {
--- a/pageserver/src/tenant/storage_layer/merge_iterator.rs
+++ b/pageserver/src/tenant/storage_layer/merge_iterator.rs
@@ -292,14 +292,10 @@ mod tests {
    use crate::{
        tenant::{
            harness::{TenantHarness, TIMELINE_ID},
-            storage_layer::delta_layer::test::{produce_delta_layer, sort_delta},
+            storage_layer::delta_layer::test::{produce_delta_layer, sort_delta, sort_delta_value},
        },
        DEFAULT_PG_VERSION,
    };
-
-    #[cfg(feature = "testing")]
-    use crate::tenant::storage_layer::delta_layer::test::sort_delta_value;
-    #[cfg(feature = "testing")]
    use pageserver_api::record::NeonWalRecord;

    async fn assert_merge_iter_equal(
@@ -463,7 +459,6 @@ mod tests {
        // TODO: test layers are loaded only when needed, reducing num of active iterators in k-merge
    }

-    #[cfg(feature = "testing")]
    #[tokio::test]
    async fn delta_image_mixed_merge() {
        use bytes::Bytes;
@@ -592,6 +587,5 @@ mod tests {
        is_send(merge_iter);
    }

-    #[cfg(feature = "testing")]
    fn is_send(_: impl Send) {}
 }
--- a/pageserver/src/tenant/storage_layer/batch_split_writer.rs
+++ b/pageserver/src/tenant/storage_layer/batch_split_writer.rs
@@ -13,154 +13,41 @@ use super::{
    DeltaLayerWriter, ImageLayerWriter, PersistentLayerDesc, PersistentLayerKey, ResidentLayer,
 };

-pub(crate) enum BatchWriterResult {
+pub(crate) enum SplitWriterResult {
    Produced(ResidentLayer),
    Discarded(PersistentLayerKey),
 }

 #[cfg(test)]
-impl BatchWriterResult {
+impl SplitWriterResult {
    fn into_resident_layer(self) -> ResidentLayer {
        match self {
-            BatchWriterResult::Produced(layer) => layer,
-            BatchWriterResult::Discarded(_) => panic!("unexpected discarded layer"),
+            SplitWriterResult::Produced(layer) => layer,
+            SplitWriterResult::Discarded(_) => panic!("unexpected discarded layer"),
        }
    }

    fn into_discarded_layer(self) -> PersistentLayerKey {
        match self {
-            BatchWriterResult::Produced(_) => panic!("unexpected produced layer"),
-            BatchWriterResult::Discarded(layer) => layer,
+            SplitWriterResult::Produced(_) => panic!("unexpected produced layer"),
+            SplitWriterResult::Discarded(layer) => layer,
        }
    }
 }

-enum LayerWriterWrapper {
-    Image(ImageLayerWriter),
-    Delta(DeltaLayerWriter),
-}
-
-/// An layer writer that takes unfinished layers and finish them atomically.
-#[must_use]
-pub struct BatchLayerWriter {
-    generated_layer_writers: Vec<(LayerWriterWrapper, PersistentLayerKey)>,
-    conf: &'static PageServerConf,
-}
-
-impl BatchLayerWriter {
-    pub async fn new(conf: &'static PageServerConf) -> anyhow::Result<Self> {
-        Ok(Self {
-            generated_layer_writers: Vec::new(),
-            conf,
-        })
-    }
-
-    pub fn add_unfinished_image_writer(
-        &mut self,
-        writer: ImageLayerWriter,
-        key_range: Range<Key>,
-        lsn: Lsn,
-    ) {
-        self.generated_layer_writers.push((
-            LayerWriterWrapper::Image(writer),
-            PersistentLayerKey {
-                key_range,
-                lsn_range: PersistentLayerDesc::image_layer_lsn_range(lsn),
-                is_delta: false,
-            },
-        ));
-    }
-
-    pub fn add_unfinished_delta_writer(
-        &mut self,
-        writer: DeltaLayerWriter,
-        key_range: Range<Key>,
-        lsn_range: Range<Lsn>,
-    ) {
-        self.generated_layer_writers.push((
-            LayerWriterWrapper::Delta(writer),
-            PersistentLayerKey {
-                key_range,
-                lsn_range,
-                is_delta: true,
-            },
-        ));
-    }
-
-    pub(crate) async fn finish_with_discard_fn<D, F>(
-        self,
-        tline: &Arc<Timeline>,
-        ctx: &RequestContext,
-        discard_fn: D,
-    ) -> anyhow::Result<Vec<BatchWriterResult>>
-    where
-        D: Fn(&PersistentLayerKey) -> F,
-        F: Future<Output = bool>,
-    {
-        let Self {
-            generated_layer_writers,
-            ..
-        } = self;
-        let clean_up_layers = |generated_layers: Vec<BatchWriterResult>| {
-            for produced_layer in generated_layers {
-                if let BatchWriterResult::Produced(resident_layer) = produced_layer {
-                    let layer: Layer = resident_layer.into();
-                    layer.delete_on_drop();
-                }
-            }
-        };
-        // BEGIN: catch every error and do the recovery in the below section
-        let mut generated_layers: Vec<BatchWriterResult> = Vec::new();
-        for (inner, layer_key) in generated_layer_writers {
-            if discard_fn(&layer_key).await {
-                generated_layers.push(BatchWriterResult::Discarded(layer_key));
-            } else {
-                let res = match inner {
-                    LayerWriterWrapper::Delta(writer) => {
-                        writer.finish(layer_key.key_range.end, ctx).await
-                    }
-                    LayerWriterWrapper::Image(writer) => {
-                        writer
-                            .finish_with_end_key(layer_key.key_range.end, ctx)
-                            .await
-                    }
-                };
-                let layer = match res {
-                    Ok((desc, path)) => {
-                        match Layer::finish_creating(self.conf, tline, desc, &path) {
-                            Ok(layer) => layer,
-                            Err(e) => {
-                                tokio::fs::remove_file(&path).await.ok();
-                                clean_up_layers(generated_layers);
-                                return Err(e);
-                            }
-                        }
-                    }
-                    Err(e) => {
-                        // Image/DeltaLayerWriter::finish will clean up the temporary layer if anything goes wrong,
-                        // so we don't need to remove the layer we just failed to create by ourselves.
-                        clean_up_layers(generated_layers);
-                        return Err(e);
-                    }
-                };
-                generated_layers.push(BatchWriterResult::Produced(layer));
-            }
-        }
-        // END: catch every error and do the recovery in the above section
-        Ok(generated_layers)
-    }
-}
-
 /// An image writer that takes images and produces multiple image layers.
+///
+/// The interface does not guarantee atomicity (i.e., if the image layer generation
+/// fails, there might be leftover files to be cleaned up)
 #[must_use]
 pub struct SplitImageLayerWriter {
    inner: ImageLayerWriter,
    target_layer_size: u64,
-    lsn: Lsn,
+    generated_layer_writers: Vec<(ImageLayerWriter, PersistentLayerKey)>,
    conf: &'static PageServerConf,
    timeline_id: TimelineId,
    tenant_shard_id: TenantShardId,
-    batches: BatchLayerWriter,
+    lsn: Lsn,
    start_key: Key,
 }

@@ -185,10 +72,10 @@ impl SplitImageLayerWriter {
                ctx,
            )
            .await?,
+            generated_layer_writers: Vec::new(),
            conf,
            timeline_id,
            tenant_shard_id,
-            batches: BatchLayerWriter::new(conf).await?,
            lsn,
            start_key,
        })
@@ -216,13 +103,16 @@ impl SplitImageLayerWriter {
                ctx,
            )
            .await?;
+            let layer_key = PersistentLayerKey {
+                key_range: self.start_key..key,
+                lsn_range: PersistentLayerDesc::image_layer_lsn_range(self.lsn),
+                is_delta: false,
+            };
            let prev_image_writer = std::mem::replace(&mut self.inner, next_image_writer);
-            self.batches.add_unfinished_image_writer(
-                prev_image_writer,
-                self.start_key..key,
-                self.lsn,
-            );
            self.start_key = key;
+
+            self.generated_layer_writers
+                .push((prev_image_writer, layer_key));
        }
        self.inner.put_image(key, img, ctx).await
    }
@@ -233,18 +123,64 @@ impl SplitImageLayerWriter {
        ctx: &RequestContext,
        end_key: Key,
        discard_fn: D,
-    ) -> anyhow::Result<Vec<BatchWriterResult>>
+    ) -> anyhow::Result<Vec<SplitWriterResult>>
    where
        D: Fn(&PersistentLayerKey) -> F,
        F: Future<Output = bool>,
    {
        let Self {
-            mut batches, inner, ..
+            mut generated_layer_writers,
+            inner,
+            ..
        } = self;
        if inner.num_keys() != 0 {
-            batches.add_unfinished_image_writer(inner, self.start_key..end_key, self.lsn);
+            let layer_key = PersistentLayerKey {
+                key_range: self.start_key..end_key,
+                lsn_range: PersistentLayerDesc::image_layer_lsn_range(self.lsn),
+                is_delta: false,
+            };
+            generated_layer_writers.push((inner, layer_key));
        }
-        batches.finish_with_discard_fn(tline, ctx, discard_fn).await
+        let clean_up_layers = |generated_layers: Vec<SplitWriterResult>| {
+            for produced_layer in generated_layers {
+                if let SplitWriterResult::Produced(image_layer) = produced_layer {
+                    let layer: Layer = image_layer.into();
+                    layer.delete_on_drop();
+                }
+            }
+        };
+        // BEGIN: catch every error and do the recovery in the below section
+        let mut generated_layers = Vec::new();
+        for (inner, layer_key) in generated_layer_writers {
+            if discard_fn(&layer_key).await {
+                generated_layers.push(SplitWriterResult::Discarded(layer_key));
+            } else {
+                let layer = match inner
+                    .finish_with_end_key(layer_key.key_range.end, ctx)
+                    .await
+                {
+                    Ok((desc, path)) => {
+                        match Layer::finish_creating(self.conf, tline, desc, &path) {
+                            Ok(layer) => layer,
+                            Err(e) => {
+                                tokio::fs::remove_file(&path).await.ok();
+                                clean_up_layers(generated_layers);
+                                return Err(e);
+                            }
+                        }
+                    }
+                    Err(e) => {
+                        // ImageLayerWriter::finish will clean up the temporary layer if anything goes wrong,
+                        // so we don't need to remove the layer we just failed to create by ourselves.
+                        clean_up_layers(generated_layers);
+                        return Err(e);
+                    }
+                };
+                generated_layers.push(SplitWriterResult::Produced(layer));
+            }
+        }
+        // END: catch every error and do the recovery in the above section
+        Ok(generated_layers)
    }

    #[cfg(test)]
@@ -253,7 +189,7 @@ impl SplitImageLayerWriter {
        tline: &Arc<Timeline>,
        ctx: &RequestContext,
        end_key: Key,
-    ) -> anyhow::Result<Vec<BatchWriterResult>> {
+    ) -> anyhow::Result<Vec<SplitWriterResult>> {
        self.finish_with_discard_fn(tline, ctx, end_key, |_| async { false })
            .await
    }
@@ -261,6 +197,9 @@ impl SplitImageLayerWriter {

 /// A delta writer that takes key-lsn-values and produces multiple delta layers.
 ///
+/// The interface does not guarantee atomicity (i.e., if the delta layer generation fails,
+/// there might be leftover files to be cleaned up).
+///
 /// Note that if updates of a single key exceed the target size limit, all of the updates will be batched
 /// into a single file. This behavior might change in the future. For reference, the legacy compaction algorithm
 /// will split them into multiple files based on size.
@@ -268,12 +207,12 @@ impl SplitImageLayerWriter {
 pub struct SplitDeltaLayerWriter {
    inner: Option<(Key, DeltaLayerWriter)>,
    target_layer_size: u64,
+    generated_layer_writers: Vec<(DeltaLayerWriter, PersistentLayerKey)>,
    conf: &'static PageServerConf,
    timeline_id: TimelineId,
    tenant_shard_id: TenantShardId,
    lsn_range: Range<Lsn>,
    last_key_written: Key,
-    batches: BatchLayerWriter,
 }

 impl SplitDeltaLayerWriter {
@@ -287,12 +226,12 @@ impl SplitDeltaLayerWriter {
        Ok(Self {
            target_layer_size,
            inner: None,
+            generated_layer_writers: Vec::new(),
            conf,
            timeline_id,
            tenant_shard_id,
            lsn_range,
            last_key_written: Key::MIN,
-            batches: BatchLayerWriter::new(conf).await?,
        })
    }

@@ -341,11 +280,13 @@ impl SplitDeltaLayerWriter {
                .await?;
                let (start_key, prev_delta_writer) =
                    std::mem::replace(&mut self.inner, Some((key, next_delta_writer))).unwrap();
-                self.batches.add_unfinished_delta_writer(
-                    prev_delta_writer,
-                    start_key..key,
-                    self.lsn_range.clone(),
-                );
+                let layer_key = PersistentLayerKey {
+                    key_range: start_key..key,
+                    lsn_range: self.lsn_range.clone(),
+                    is_delta: true,
+                };
+                self.generated_layer_writers
+                    .push((prev_delta_writer, layer_key));
            } else if inner.estimated_size() >= S3_UPLOAD_LIMIT {
                // We have to produce a very large file b/c a key is updated too often.
                anyhow::bail!(
@@ -365,25 +306,64 @@ impl SplitDeltaLayerWriter {
        tline: &Arc<Timeline>,
        ctx: &RequestContext,
        discard_fn: D,
-    ) -> anyhow::Result<Vec<BatchWriterResult>>
+    ) -> anyhow::Result<Vec<SplitWriterResult>>
    where
        D: Fn(&PersistentLayerKey) -> F,
        F: Future<Output = bool>,
    {
        let Self {
-            mut batches, inner, ..
+            mut generated_layer_writers,
+            inner,
+            ..
        } = self;
        if let Some((start_key, writer)) = inner {
            if writer.num_keys() != 0 {
                let end_key = self.last_key_written.next();
-                batches.add_unfinished_delta_writer(
-                    writer,
-                    start_key..end_key,
-                    self.lsn_range.clone(),
-                );
+                let layer_key = PersistentLayerKey {
+                    key_range: start_key..end_key,
+                    lsn_range: self.lsn_range.clone(),
+                    is_delta: true,
+                };
+                generated_layer_writers.push((writer, layer_key));
            }
        }
-        batches.finish_with_discard_fn(tline, ctx, discard_fn).await
+        let clean_up_layers = |generated_layers: Vec<SplitWriterResult>| {
+            for produced_layer in generated_layers {
+                if let SplitWriterResult::Produced(delta_layer) = produced_layer {
+                    let layer: Layer = delta_layer.into();
+                    layer.delete_on_drop();
+                }
+            }
+        };
+        // BEGIN: catch every error and do the recovery in the below section
+        let mut generated_layers = Vec::new();
+        for (inner, layer_key) in generated_layer_writers {
+            if discard_fn(&layer_key).await {
+                generated_layers.push(SplitWriterResult::Discarded(layer_key));
+            } else {
+                let layer = match inner.finish(layer_key.key_range.end, ctx).await {
+                    Ok((desc, path)) => {
+                        match Layer::finish_creating(self.conf, tline, desc, &path) {
+                            Ok(layer) => layer,
+                            Err(e) => {
+                                tokio::fs::remove_file(&path).await.ok();
+                                clean_up_layers(generated_layers);
+                                return Err(e);
+                            }
+                        }
+                    }
+                    Err(e) => {
+                        // DeltaLayerWriter::finish will clean up the temporary layer if anything goes wrong,
+                        // so we don't need to remove the layer we just failed to create by ourselves.
+                        clean_up_layers(generated_layers);
+                        return Err(e);
+                    }
+                };
+                generated_layers.push(SplitWriterResult::Produced(layer));
+            }
+        }
+        // END: catch every error and do the recovery in the above section
+        Ok(generated_layers)
    }

    #[cfg(test)]
@@ -391,7 +371,7 @@ impl SplitDeltaLayerWriter {
        self,
        tline: &Arc<Timeline>,
        ctx: &RequestContext,
-    ) -> anyhow::Result<Vec<BatchWriterResult>> {
+    ) -> anyhow::Result<Vec<SplitWriterResult>> {
        self.finish_with_discard_fn(tline, ctx, |_| async { false })
            .await
    }
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -125,9 +125,9 @@ use utils::{
    simple_rcu::{Rcu, RcuReadGuard},
 };

+use crate::gc_result::GcResult;
 use crate::task_mgr;
 use crate::task_mgr::TaskKind;
-use crate::tenant::gc_result::GcResult;
 use crate::ZERO_PAGE;
 use pageserver_api::key::Key;
 use pageserver_api::value::Value;
@@ -425,9 +425,6 @@ pub struct Timeline {
    pub(crate) handles: handle::PerTimelineState<crate::page_service::TenantManagerTypes>,

    pub(crate) attach_wal_lag_cooldown: Arc<OnceLock<WalLagCooldown>>,
-
-    /// Cf. [`crate::tenant::CreateTimelineIdempotency`].
-    pub(crate) create_idempotency: crate::tenant::CreateTimelineIdempotency,
 }

 pub type TimelineDeleteProgress = Arc<tokio::sync::Mutex<DeleteTimelineFlow>>;
@@ -2140,7 +2137,6 @@ impl Timeline {
        pg_version: u32,
        state: TimelineState,
        attach_wal_lag_cooldown: Arc<OnceLock<WalLagCooldown>>,
-        create_idempotency: crate::tenant::CreateTimelineIdempotency,
        cancel: CancellationToken,
    ) -> Arc<Self> {
        let disk_consistent_lsn = metadata.disk_consistent_lsn();
@@ -2279,8 +2275,6 @@ impl Timeline {
                handles: Default::default(),

                attach_wal_lag_cooldown,
-
-                create_idempotency,
            };

            result.repartition_threshold =
--- a/pageserver/src/tenant/timeline/compaction.rs
+++ b/pageserver/src/tenant/timeline/compaction.rs
@@ -32,11 +32,11 @@ use crate::page_cache;
 use crate::statvfs::Statvfs;
 use crate::tenant::checks::check_valid_layermap;
 use crate::tenant::remote_timeline_client::WaitCompletionError;
-use crate::tenant::storage_layer::batch_split_writer::{
-    BatchWriterResult, SplitDeltaLayerWriter, SplitImageLayerWriter,
-};
 use crate::tenant::storage_layer::filter_iterator::FilterIterator;
 use crate::tenant::storage_layer::merge_iterator::MergeIterator;
+use crate::tenant::storage_layer::split_writer::{
+    SplitDeltaLayerWriter, SplitImageLayerWriter, SplitWriterResult,
+};
 use crate::tenant::storage_layer::{
    AsLayerDesc, PersistentLayerDesc, PersistentLayerKey, ValueReconstructState,
 };
@@ -835,12 +835,7 @@ impl Timeline {
                if self.cancel.is_cancelled() {
                    return Err(CompactionError::ShuttingDown);
                }
-                let delta = l.get_as_delta(ctx).await.map_err(CompactionError::Other)?;
-                let keys = delta
-                    .index_entries(ctx)
-                    .await
-                    .map_err(CompactionError::Other)?;
-                all_keys.extend(keys);
+                all_keys.extend(l.load_keys(ctx).await.map_err(CompactionError::Other)?);
            }
            // The current stdlib sorting implementation is designed in a way where it is
            // particularly fast where the slice is made up of sorted sub-ranges.
@@ -2044,11 +2039,11 @@ impl Timeline {
        let produced_image_layers_len = produced_image_layers.len();
        for action in produced_delta_layers {
            match action {
-                BatchWriterResult::Produced(layer) => {
+                SplitWriterResult::Produced(layer) => {
                    stat.produce_delta_layer(layer.layer_desc().file_size());
                    compact_to.push(layer);
                }
-                BatchWriterResult::Discarded(l) => {
+                SplitWriterResult::Discarded(l) => {
                    keep_layers.insert(l);
                    stat.discard_delta_layer();
                }
@@ -2056,11 +2051,11 @@ impl Timeline {
        }
        for action in produced_image_layers {
            match action {
-                BatchWriterResult::Produced(layer) => {
+                SplitWriterResult::Produced(layer) => {
                    stat.produce_image_layer(layer.layer_desc().file_size());
                    compact_to.push(layer);
                }
-                BatchWriterResult::Discarded(l) => {
+                SplitWriterResult::Discarded(l) => {
                    keep_layers.insert(l);
                    stat.discard_image_layer();
                }
@@ -2444,7 +2439,7 @@ impl CompactionDeltaLayer<TimelineAdaptor> for ResidentDeltaLayer {
    type DeltaEntry<'a> = DeltaEntry<'a>;

    async fn load_keys<'a>(&self, ctx: &RequestContext) -> anyhow::Result<Vec<DeltaEntry<'_>>> {
-        self.0.get_as_delta(ctx).await?.index_entries(ctx).await
+        self.0.load_keys(ctx).await
    }
 }

--- a/pageserver/src/tenant/timeline/delete.rs
+++ b/pageserver/src/tenant/timeline/delete.rs
@@ -6,7 +6,7 @@ use std::{
 use anyhow::Context;
 use pageserver_api::{models::TimelineState, shard::TenantShardId};
 use tokio::sync::OwnedMutexGuard;
-use tracing::{error, info, info_span, instrument, Instrument};
+use tracing::{error, info, instrument, Instrument};
 use utils::{crashsafe, fs_ext, id::TimelineId, pausable_failpoint};

 use crate::{
@@ -14,9 +14,10 @@ use crate::{
    task_mgr::{self, TaskKind},
    tenant::{
        metadata::TimelineMetadata,
-        remote_timeline_client::{PersistIndexPartWithDeletedFlagError, RemoteTimelineClient},
-        CreateTimelineCause, DeleteTimelineError, MaybeDeletedIndexPart, Tenant,
-        TimelineOrOffloaded,
+        remote_timeline_client::{
+            self, PersistIndexPartWithDeletedFlagError, RemoteTimelineClient,
+        },
+        CreateTimelineCause, DeleteTimelineError, Tenant, TimelineOrOffloaded,
    },
 };

@@ -175,6 +176,32 @@ async fn remove_maybe_offloaded_timeline_from_tenant(
    Ok(())
 }

+/// It is important that this gets called when DeletionGuard is being held.
+/// For more context see comments in [`DeleteTimelineFlow::prepare`]
+async fn upload_new_tenant_manifest(
+    tenant: &Tenant,
+    _: &DeletionGuard, // using it as a witness
+) -> anyhow::Result<()> {
+    // This is susceptible to race conditions, i.e. we won't continue deletions if there is a crash
+    // between the deletion of the index-part.json and reaching of this code.
+    // So indeed, the tenant manifest might refer to an offloaded timeline which has already been deleted.
+    // However, we handle this case in tenant loading code so the next time we attach, the issue is
+    // resolved.
+    let manifest = tenant.tenant_manifest();
+    // TODO: generation support
+    let generation = remote_timeline_client::TENANT_MANIFEST_GENERATION;
+    remote_timeline_client::upload_tenant_manifest(
+        &tenant.remote_storage,
+        &tenant.tenant_shard_id,
+        generation,
+        &manifest,
+        &tenant.cancel,
+    )
+    .await?;
+
+    Ok(())
+}
+
 /// Orchestrates timeline shut down of all timeline tasks, removes its in-memory structures,
 /// and deletes its data from both disk and s3.
 /// The sequence of steps:
@@ -231,32 +258,7 @@ impl DeleteTimelineFlow {
            ))?
        });

-        let remote_client = match timeline.maybe_remote_client() {
-            Some(remote_client) => remote_client,
-            None => {
-                let remote_client = tenant
-                    .build_timeline_client(timeline.timeline_id(), tenant.remote_storage.clone());
-                let result = remote_client
-                    .download_index_file(&tenant.cancel)
-                    .instrument(info_span!("download_index_file"))
-                    .await
-                    .map_err(|e| DeleteTimelineError::Other(anyhow::anyhow!("error: {:?}", e)))?;
-                let index_part = match result {
-                    MaybeDeletedIndexPart::Deleted(p) => {
-                        tracing::info!("Timeline already set as deleted in remote index");
-                        p
-                    }
-                    MaybeDeletedIndexPart::IndexPart(p) => p,
-                };
-                let remote_client = Arc::new(remote_client);
-
-                remote_client
-                    .init_upload_queue(&index_part)
-                    .map_err(DeleteTimelineError::Other)?;
-                remote_client.shutdown().await;
-                remote_client
-            }
-        };
+        let remote_client = timeline.remote_client_maybe_construct(tenant);
        set_deleted_in_remote_index(&remote_client).await?;

        fail::fail_point!("timeline-delete-before-schedule", |_| {
@@ -311,7 +313,6 @@ impl DeleteTimelineFlow {
                // Important. We dont pass ancestor above because it can be missing.
                // Thus we need to skip the validation here.
                CreateTimelineCause::Delete,
-                crate::tenant::CreateTimelineIdempotency::FailWithConflict, // doesn't matter what we put here
            )
            .context("create_timeline_struct")?;

@@ -453,15 +454,7 @@ impl DeleteTimelineFlow {

        remove_maybe_offloaded_timeline_from_tenant(tenant, timeline, &guard).await?;

-        // This is susceptible to race conditions, i.e. we won't continue deletions if there is a crash
-        // between the deletion of the index-part.json and reaching of this code.
-        // So indeed, the tenant manifest might refer to an offloaded timeline which has already been deleted.
-        // However, we handle this case in tenant loading code so the next time we attach, the issue is
-        // resolved.
-        tenant
-            .store_tenant_manifest()
-            .await
-            .map_err(|e| DeleteTimelineError::Other(anyhow::anyhow!(e)))?;
+        upload_new_tenant_manifest(tenant, &guard).await?;

        *guard = Self::Finished;

--- a/pageserver/src/tenant/timeline/layer_manager.rs
+++ b/pageserver/src/tenant/timeline/layer_manager.rs
@@ -45,16 +45,13 @@ impl LayerManager {
    pub(crate) fn get_from_key(&self, key: &PersistentLayerKey) -> Layer {
        // The assumption for the `expect()` is that all code maintains the following invariant:
        // A layer's descriptor is present in the LayerMap => the LayerFileManager contains a layer for the descriptor.
-        self.try_get_from_key(key)
+        self.layers()
+            .get(key)
            .with_context(|| format!("get layer from key: {key}"))
            .expect("not found")
            .clone()
    }

-    pub(crate) fn try_get_from_key(&self, key: &PersistentLayerKey) -> Option<&Layer> {
-        self.layers().get(key)
-    }
-
    pub(crate) fn get_from_desc(&self, desc: &PersistentLayerDesc) -> Layer {
        self.get_from_key(&desc.key())
    }
--- a/pageserver/src/tenant/timeline/offload.rs
+++ b/pageserver/src/tenant/timeline/offload.rs
@@ -3,7 +3,7 @@ use std::sync::Arc;
 use super::delete::{delete_local_timeline_directory, DeleteTimelineFlow, DeletionGuard};
 use super::Timeline;
 use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
-use crate::tenant::{OffloadedTimeline, Tenant, TimelineOrOffloaded};
+use crate::tenant::{remote_timeline_client, OffloadedTimeline, Tenant, TimelineOrOffloaded};

 pub(crate) async fn offload_timeline(
    tenant: &Tenant,
@@ -63,10 +63,17 @@ pub(crate) async fn offload_timeline(
    // at the next restart attach it again.
    // For that to happen, we'd need to make the manifest reflect our *intended* state,
    // not our actual state of offloaded timelines.
-    tenant
-        .store_tenant_manifest()
-        .await
-        .map_err(|e| anyhow::anyhow!(e))?;
+    let manifest = tenant.tenant_manifest();
+    // TODO: generation support
+    let generation = remote_timeline_client::TENANT_MANIFEST_GENERATION;
+    remote_timeline_client::upload_tenant_manifest(
+        &tenant.remote_storage,
+        &tenant.tenant_shard_id,
+        generation,
+        &manifest,
+        &tenant.cancel,
+    )
+    .await?;

    Ok(())
 }
--- a/pageserver/src/tenant/timeline/uninit.rs
+++ b/pageserver/src/tenant/timeline/uninit.rs
@@ -5,11 +5,7 @@ use camino::Utf8PathBuf;
 use tracing::{error, info, info_span};
 use utils::{fs_ext, id::TimelineId, lsn::Lsn};

-use crate::{
-    context::RequestContext,
-    import_datadir,
-    tenant::{CreateTimelineIdempotency, Tenant, TimelineOrOffloaded},
-};
+use crate::{context::RequestContext, import_datadir, tenant::Tenant};

 use super::Timeline;

@@ -169,17 +165,13 @@ pub(crate) struct TimelineCreateGuard<'t> {
    owning_tenant: &'t Tenant,
    timeline_id: TimelineId,
    pub(crate) timeline_path: Utf8PathBuf,
-    pub(crate) idempotency: CreateTimelineIdempotency,
 }

 /// Errors when acquiring exclusive access to a timeline ID for creation
 #[derive(thiserror::Error, Debug)]
 pub(crate) enum TimelineExclusionError {
    #[error("Already exists")]
-    AlreadyExists {
-        existing: TimelineOrOffloaded,
-        arg: CreateTimelineIdempotency,
-    },
+    AlreadyExists(Arc<Timeline>),
    #[error("Already creating")]
    AlreadyCreating,

@@ -193,42 +185,27 @@ impl<'t> TimelineCreateGuard<'t> {
        owning_tenant: &'t Tenant,
        timeline_id: TimelineId,
        timeline_path: Utf8PathBuf,
-        idempotency: CreateTimelineIdempotency,
-        allow_offloaded: bool,
    ) -> Result<Self, TimelineExclusionError> {
        // Lock order: this is the only place we take both locks.  During drop() we only
        // lock creating_timelines
        let timelines = owning_tenant.timelines.lock().unwrap();
-        let timelines_offloaded = owning_tenant.timelines_offloaded.lock().unwrap();
        let mut creating_timelines: std::sync::MutexGuard<
            '_,
            std::collections::HashSet<TimelineId>,
        > = owning_tenant.timelines_creating.lock().unwrap();

        if let Some(existing) = timelines.get(&timeline_id) {
-            return Err(TimelineExclusionError::AlreadyExists {
-                existing: TimelineOrOffloaded::Timeline(existing.clone()),
-                arg: idempotency,
-            });
+            Err(TimelineExclusionError::AlreadyExists(existing.clone()))
+        } else if creating_timelines.contains(&timeline_id) {
+            Err(TimelineExclusionError::AlreadyCreating)
+        } else {
+            creating_timelines.insert(timeline_id);
+            Ok(Self {
+                owning_tenant,
+                timeline_id,
+                timeline_path,
+            })
        }
-        if !allow_offloaded {
-            if let Some(existing) = timelines_offloaded.get(&timeline_id) {
-                return Err(TimelineExclusionError::AlreadyExists {
-                    existing: TimelineOrOffloaded::Offloaded(existing.clone()),
-                    arg: idempotency,
-                });
-            }
-        }
-        if creating_timelines.contains(&timeline_id) {
-            return Err(TimelineExclusionError::AlreadyCreating);
-        }
-        creating_timelines.insert(timeline_id);
-        Ok(Self {
-            owning_tenant,
-            timeline_id,
-            timeline_path,
-            idempotency,
-        })
    }
 }

--- a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
+++ b/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
@@ -34,8 +34,8 @@ use crate::{
 };
 use postgres_backend::is_expected_io_error;
 use postgres_connection::PgConnectionConfig;
+use postgres_ffi::record::{decode_wal_record, DecodedWALRecord};
 use postgres_ffi::waldecoder::WalStreamDecoder;
-use postgres_ffi::walrecord::{decode_wal_record, DecodedWALRecord};
 use utils::{id::NodeId, lsn::Lsn};
 use utils::{pageserver_feedback::PageserverFeedback, sync::gate::GateError};

@@ -343,6 +343,7 @@ pub(super) async fn handle_walreceiver_connection(
                        let mut decoded = DecodedWALRecord::default();
                        decode_wal_record(recdata, &mut decoded, modification.tline.pg_version)?;

+                        // TODO: Handle this. Probably flush buf + data modifications early.
                        if decoded.is_dbase_create_copy(timeline.pg_version)
                            && uncommitted_records > 0
                        {
--- a/pageserver/src/virtual_file/io_engine/tokio_epoll_uring_ext.rs
+++ b/pageserver/src/virtual_file/io_engine/tokio_epoll_uring_ext.rs
@@ -16,24 +16,18 @@ use tokio_epoll_uring::{System, SystemHandle};

 use crate::virtual_file::on_fatal_io_error;

-use crate::metrics::tokio_epoll_uring::{self as metrics, THREAD_LOCAL_METRICS_STORAGE};
+use crate::metrics::tokio_epoll_uring as metrics;

 #[derive(Clone)]
 struct ThreadLocalState(Arc<ThreadLocalStateInner>);

 struct ThreadLocalStateInner {
-    cell: tokio::sync::OnceCell<SystemHandle<metrics::ThreadLocalMetrics>>,
+    cell: tokio::sync::OnceCell<SystemHandle>,
    launch_attempts: AtomicU32,
    /// populated through fetch_add from [`THREAD_LOCAL_STATE_ID`]
    thread_local_state_id: u64,
 }

-impl Drop for ThreadLocalStateInner {
-    fn drop(&mut self) {
-        THREAD_LOCAL_METRICS_STORAGE.remove_system(self.thread_local_state_id);
-    }
-}
-
 impl ThreadLocalState {
    pub fn new() -> Self {
        Self(Arc::new(ThreadLocalStateInner {
@@ -77,8 +71,7 @@ pub async fn thread_local_system() -> Handle {
                        &fake_cancel,
                    )
                    .await;
-                    let per_system_metrics = metrics::THREAD_LOCAL_METRICS_STORAGE.register_system(inner.thread_local_state_id);
-                    let res = System::launch_with_metrics(per_system_metrics)
+                    let res = System::launch()
                    // this might move us to another executor thread => loop outside the get_or_try_init, not inside it
                    .await;
                    match res {
@@ -93,7 +86,6 @@ pub async fn thread_local_system() -> Handle {
                                emit_launch_failure_process_stats();
                            });
                            metrics::THREAD_LOCAL_LAUNCH_FAILURES.inc();
-                            metrics::THREAD_LOCAL_METRICS_STORAGE.remove_system(inner.thread_local_state_id);
                            Err(())
                        }
                        // abort the process instead of panicking because pageserver usually becomes half-broken if we panic somewhere.
@@ -123,7 +115,7 @@ fn emit_launch_failure_process_stats() {
    // number of threads
    // rss / system memory usage generally

-    let tokio_epoll_uring::metrics::GlobalMetrics {
+    let tokio_epoll_uring::metrics::Metrics {
        systems_created,
        systems_destroyed,
    } = tokio_epoll_uring::metrics::global();
@@ -190,7 +182,7 @@ fn emit_launch_failure_process_stats() {
 pub struct Handle(ThreadLocalState);

 impl std::ops::Deref for Handle {
-    type Target = SystemHandle<metrics::ThreadLocalMetrics>;
+    type Target = SystemHandle;

    fn deref(&self) -> &Self::Target {
        self.0
--- a/pageserver/src/walingest.rs
+++ b/pageserver/src/walingest.rs
@@ -21,7 +21,6 @@
 //! redo Postgres process, but some records it can handle directly with
 //! bespoken Rust code.

-use std::collections::HashMap;
 use std::sync::Arc;
 use std::sync::OnceLock;
 use std::time::Duration;
@@ -29,7 +28,7 @@ use std::time::Instant;
 use std::time::SystemTime;

 use pageserver_api::shard::ShardIdentity;
-use postgres_ffi::walrecord::*;
+use postgres_ffi::record::*;
 use postgres_ffi::{dispatch_pgversion, enum_pgversion, enum_pgversion_dispatch, TimestampTz};
 use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn};
 use wal_decoder::models::*;
@@ -1485,12 +1484,6 @@ impl WalIngest {
            },
        )?;

-        // Group relations to drop by dbNode.  This map will contain all relations that _might_
-        // exist, we will reduce it to which ones really exist later.  This map can be huge if
-        // the transaction touches a huge number of relations (there is no bound on this in
-        // postgres).
-        let mut drop_relations: HashMap<(u32, u32), Vec<RelTag>> = HashMap::new();
-
        for xnode in &parsed.xnodes {
            for forknum in MAIN_FORKNUM..=INIT_FORKNUM {
                let rel = RelTag {
@@ -1499,16 +1492,15 @@ impl WalIngest {
                    dbnode: xnode.dbnode,
                    relnode: xnode.relnode,
                };
-                drop_relations
-                    .entry((xnode.spcnode, xnode.dbnode))
-                    .or_default()
-                    .push(rel);
+                if modification
+                    .tline
+                    .get_rel_exists(rel, Version::Modified(modification), ctx)
+                    .await?
+                {
+                    self.put_rel_drop(modification, rel, ctx).await?;
+                }
            }
        }
-
-        // Execute relation drops in a batch: the number may be huge, so deleting individually is prohibitively expensive
-        modification.put_rel_drops(drop_relations, ctx).await?;
-
        if origin_id != 0 {
            modification
                .set_replorigin(origin_id, parsed.origin_lsn)
@@ -2082,7 +2074,7 @@ impl WalIngest {
    ) -> anyhow::Result<Option<LogicalMessageRecord>> {
        let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
        if info == pg_constants::XLOG_LOGICAL_MESSAGE {
-            let xlrec = XlLogicalMessage::decode(buf);
+            let xlrec = postgres_ffi::record::XlLogicalMessage::decode(buf);
            let prefix = std::str::from_utf8(&buf[0..xlrec.prefix_size - 1])?;

            #[cfg(feature = "testing")]
@@ -2218,6 +2210,16 @@ impl WalIngest {
        Ok(())
    }

+    async fn put_rel_drop(
+        &mut self,
+        modification: &mut DatadirModification<'_>,
+        rel: RelTag,
+        ctx: &RequestContext,
+    ) -> Result<()> {
+        modification.put_rel_drop(rel, ctx).await?;
+        Ok(())
+    }
+
    async fn handle_rel_extend(
        &mut self,
        modification: &mut DatadirModification<'_>,
@@ -2281,59 +2283,6 @@ impl WalIngest {
            WAL_INGEST
                .gap_blocks_zeroed_on_rel_extend
                .inc_by(gap_blocks_filled);
-
-            // Log something when relation extends cause use to fill gaps
-            // with zero pages. Logging is rate limited per pg version to
-            // avoid skewing.
-            if gap_blocks_filled > 0 {
-                use once_cell::sync::Lazy;
-                use std::sync::Mutex;
-                use utils::rate_limit::RateLimit;
-
-                struct RateLimitPerPgVersion {
-                    rate_limiters: [Lazy<Mutex<RateLimit>>; 4],
-                }
-
-                impl RateLimitPerPgVersion {
-                    const fn new() -> Self {
-                        Self {
-                            rate_limiters: [const {
-                                Lazy::new(|| Mutex::new(RateLimit::new(Duration::from_secs(30))))
-                            }; 4],
-                        }
-                    }
-
-                    const fn rate_limiter(
-                        &self,
-                        pg_version: u32,
-                    ) -> Option<&Lazy<Mutex<RateLimit>>> {
-                        const MIN_PG_VERSION: u32 = 14;
-                        const MAX_PG_VERSION: u32 = 17;
-
-                        if pg_version < MIN_PG_VERSION || pg_version > MAX_PG_VERSION {
-                            return None;
-                        }
-
-                        Some(&self.rate_limiters[(pg_version - MIN_PG_VERSION) as usize])
-                    }
-                }
-
-                static LOGGED: RateLimitPerPgVersion = RateLimitPerPgVersion::new();
-                if let Some(rate_limiter) = LOGGED.rate_limiter(modification.tline.pg_version) {
-                    if let Ok(mut locked) = rate_limiter.try_lock() {
-                        locked.call(|| {
-                            info!(
-                                lsn=%modification.get_lsn(),
-                                pg_version=%modification.tline.pg_version,
-                                rel=%rel,
-                                "Filled {} gap blocks on rel extend to {} from {}",
-                                gap_blocks_filled,
-                                new_nblocks,
-                                old_nblocks);
-                        });
-                    }
-                }
-            }
        }
        Ok(())
    }
@@ -2731,9 +2680,7 @@ mod tests {

        // Drop rel
        let mut m = tline.begin_modification(Lsn(0x30));
-        let mut rel_drops = HashMap::new();
-        rel_drops.insert((TESTREL_A.spcnode, TESTREL_A.dbnode), vec![TESTREL_A]);
-        m.put_rel_drops(rel_drops, &ctx).await?;
+        walingest.put_rel_drop(&mut m, TESTREL_A, &ctx).await?;
        m.commit(&ctx).await?;

        // Check that rel is not visible anymore
@@ -3009,8 +2956,8 @@ mod tests {
    #[tokio::test]
    async fn test_ingest_real_wal() {
        use crate::tenant::harness::*;
+        use postgres_ffi::record::decode_wal_record;
        use postgres_ffi::waldecoder::WalStreamDecoder;
-        use postgres_ffi::walrecord::decode_wal_record;
        use postgres_ffi::WAL_SEGMENT_SIZE;

        // Define test data path and constants.
--- a/pgxn/neon/Makefile
+++ b/pgxn/neon/Makefile
@@ -8,7 +8,6 @@ OBJS = \
 	file_cache.o \
 	hll.o \
 	libpagestore.o \
-	logical_replication_monitor.o \
 	neon.o \
 	neon_pgversioncompat.o \
 	neon_perf_counters.o \
@@ -16,7 +15,6 @@ OBJS = \
 	neon_walreader.o \
 	pagestore_smgr.o \
 	relsize_cache.o \
-	unstable_extensions.o \
 	walproposer.o \
 	walproposer_pg.o \
 	control_plane_connector.o \
--- a/pgxn/neon/control_plane_connector.c
+++ b/pgxn/neon/control_plane_connector.c
@@ -18,7 +18,6 @@
 *
 *-------------------------------------------------------------------------
 */
-
 #include "postgres.h"

 #include <curl/curl.h>
@@ -509,8 +508,6 @@ NeonXactCallback(XactEvent event, void *arg)
 static bool
 RoleIsNeonSuperuser(const char *role_name)
 {
-	Assert(role_name);
-
 	return strcmp(role_name, "neon_superuser") == 0;
 }

@@ -673,7 +670,7 @@ HandleCreateRole(CreateRoleStmt *stmt)
 static void
 HandleAlterRole(AlterRoleStmt *stmt)
 {
-	char	   *role_name;
+	const char *role_name = stmt->role->rolename;
 	DefElem    *dpass;
 	ListCell   *option;
 	bool		found = false;
@@ -681,7 +678,6 @@ HandleAlterRole(AlterRoleStmt *stmt)

 	InitRoleTableIfNeeded();

-	role_name = get_rolespec_name(stmt->role);
 	if (RoleIsNeonSuperuser(role_name) && !superuser())
 		elog(ERROR, "can't ALTER neon_superuser");

@@ -693,13 +689,9 @@ HandleAlterRole(AlterRoleStmt *stmt)
 		if (strcmp(defel->defname, "password") == 0)
 			dpass = defel;
 	}
-
 	/* We only care about updates to the password */
 	if (!dpass)
-	{
-		pfree(role_name);
 		return;
-	}

 	entry = hash_search(CurrentDdlTable->role_table,
 						role_name,
@@ -712,8 +704,6 @@ HandleAlterRole(AlterRoleStmt *stmt)
 	else
 		entry->password = NULL;
 	entry->type = Op_Set;
-
-	pfree(role_name);
 }

 static void
--- a/pgxn/neon/logical_replication_monitor.c
+++ b/pgxn/neon/logical_replication_monitor.c
@@ -1,253 +0,0 @@
-#include <limits.h>
-#include <string.h>
-#include <dirent.h>
-#include <signal.h>
-
-#include "postgres.h"
-
-#include "miscadmin.h"
-#include "postmaster/bgworker.h"
-#include "postmaster/interrupt.h"
-#include "replication/slot.h"
-#include "storage/fd.h"
-#include "storage/procsignal.h"
-#include "tcop/tcopprot.h"
-#include "utils/guc.h"
-#include "utils/wait_event.h"
-
-#include "logical_replication_monitor.h"
-
-#define LS_MONITOR_CHECK_INTERVAL 10000 /* ms */
-
-static int	logical_replication_max_snap_files = 300;
-
-PGDLLEXPORT void LogicalSlotsMonitorMain(Datum main_arg);
-
-static int
-LsnDescComparator(const void *a, const void *b)
-{
-	XLogRecPtr	lsn1 = *((const XLogRecPtr *) a);
-	XLogRecPtr	lsn2 = *((const XLogRecPtr *) b);
-
-	if (lsn1 < lsn2)
-		return 1;
-	else if (lsn1 == lsn2)
-		return 0;
-	else
-		return -1;
-}
-
-/*
- * Look at .snap files and calculate minimum allowed restart_lsn of slot so that
- * next gc would leave not more than logical_replication_max_snap_files; all
- * slots having lower restart_lsn should be dropped.
- */
-static XLogRecPtr
-get_num_snap_files_lsn_threshold(void)
-{
-	DIR		   *dirdesc;
-	struct dirent *de;
-	char	   *snap_path = "pg_logical/snapshots/";
-	int			lsns_allocated = 1024;
-	int			lsns_num = 0;
-	XLogRecPtr *lsns;
-	XLogRecPtr	cutoff;
-
-	if (logical_replication_max_snap_files < 0)
-		return 0;
-
-	lsns = palloc(sizeof(XLogRecPtr) * lsns_allocated);
-
-	/* find all .snap files and get their lsns */
-	dirdesc = AllocateDir(snap_path);
-	while ((de = ReadDir(dirdesc, snap_path)) != NULL)
-	{
-		XLogRecPtr	lsn;
-		uint32		hi;
-		uint32		lo;
-
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
-			continue;
-
-		if (sscanf(de->d_name, "%X-%X.snap", &hi, &lo) != 2)
-		{
-			ereport(LOG,
-					(errmsg("could not parse file name as .snap file \"%s\"", de->d_name)));
-			continue;
-		}
-
-		lsn = ((uint64) hi) << 32 | lo;
-		elog(DEBUG5, "found snap file %X/%X", LSN_FORMAT_ARGS(lsn));
-		if (lsns_allocated == lsns_num)
-		{
-			lsns_allocated *= 2;
-			lsns = repalloc(lsns, sizeof(XLogRecPtr) * lsns_allocated);
-		}
-		lsns[lsns_num++] = lsn;
-	}
-	/* sort by lsn desc */
-	qsort(lsns, lsns_num, sizeof(XLogRecPtr), LsnDescComparator);
-	/* and take cutoff at logical_replication_max_snap_files */
-	if (logical_replication_max_snap_files > lsns_num)
-		cutoff = 0;
-	/* have less files than cutoff */
-	else
-	{
-		cutoff = lsns[logical_replication_max_snap_files - 1];
-		elog(LOG, "ls_monitor: dropping logical slots with restart_lsn lower %X/%X, found %d .snap files, limit is %d",
-			 LSN_FORMAT_ARGS(cutoff), lsns_num, logical_replication_max_snap_files);
-	}
-	pfree(lsns);
-	FreeDir(dirdesc);
-	return cutoff;
-}
-
-void
-InitLogicalReplicationMonitor(void)
-{
-	BackgroundWorker bgw;
-
-	DefineCustomIntVariable(
-							"neon.logical_replication_max_snap_files",
-							"Maximum allowed logical replication .snap files. When exceeded, slots are dropped until the limit is met. -1 disables the limit.",
-							NULL,
-							&logical_replication_max_snap_files,
-							300, -1, INT_MAX,
-							PGC_SIGHUP,
-							0,
-							NULL, NULL, NULL);
-
-	memset(&bgw, 0, sizeof(bgw));
-	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS;
-	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
-	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "neon");
-	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "LogicalSlotsMonitorMain");
-	snprintf(bgw.bgw_name, BGW_MAXLEN, "Logical replication monitor");
-	snprintf(bgw.bgw_type, BGW_MAXLEN, "Logical replication monitor");
-	bgw.bgw_restart_time = 5;
-	bgw.bgw_notify_pid = 0;
-	bgw.bgw_main_arg = (Datum) 0;
-
-	RegisterBackgroundWorker(&bgw);
-}
-
-/*
- * Unused logical replication slots pins WAL and prevents deletion of snapshots.
- * WAL bloat is guarded by max_slot_wal_keep_size; this bgw removes slots which
- * need too many .snap files.
- */
-void
-LogicalSlotsMonitorMain(Datum main_arg)
-{
-	/* Establish signal handlers. */
-	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
-	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGTERM, die);
-
-	BackgroundWorkerUnblockSignals();
-
-	for (;;)
-	{
-		XLogRecPtr	cutoff_lsn;
-
-		/* In case of a SIGHUP, just reload the configuration. */
-		if (ConfigReloadPending)
-		{
-			ConfigReloadPending = false;
-			ProcessConfigFile(PGC_SIGHUP);
-		}
-
-		/*
-		 * If there are too many .snap files, just drop all logical slots to
-		 * prevent aux files bloat.
-		 */
-		cutoff_lsn = get_num_snap_files_lsn_threshold();
-		if (cutoff_lsn > 0)
-		{
-			for (int i = 0; i < max_replication_slots; i++)
-			{
-				char		slot_name[NAMEDATALEN];
-				ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
-				XLogRecPtr	restart_lsn;
-
-				/* find the name */
-				LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
-				/* Consider only logical repliction slots */
-				if (!s->in_use || !SlotIsLogical(s))
-				{
-					LWLockRelease(ReplicationSlotControlLock);
-					continue;
-				}
-
-				/* do we need to drop it? */
-				SpinLockAcquire(&s->mutex);
-				restart_lsn = s->data.restart_lsn;
-				SpinLockRelease(&s->mutex);
-				if (restart_lsn >= cutoff_lsn)
-				{
-					LWLockRelease(ReplicationSlotControlLock);
-					continue;
-				}
-
-				strlcpy(slot_name, s->data.name.data, NAMEDATALEN);
-				elog(LOG, "ls_monitor: dropping slot %s with restart_lsn %X/%X below horizon %X/%X",
-					 slot_name, LSN_FORMAT_ARGS(restart_lsn), LSN_FORMAT_ARGS(cutoff_lsn));
-				LWLockRelease(ReplicationSlotControlLock);
-
-				/* now try to drop it, killing owner before if any */
-				for (;;)
-				{
-					pid_t		active_pid;
-
-					SpinLockAcquire(&s->mutex);
-					active_pid = s->active_pid;
-					SpinLockRelease(&s->mutex);
-
-					if (active_pid == 0)
-					{
-						/*
-						 * Slot is releasted, try to drop it. Though of course
-						 * it could have been reacquired, so drop can ERROR
-						 * out. Similarly it could have been dropped in the
-						 * meanwhile.
-						 *
-						 * In principle we could remove pg_try/pg_catch, that
-						 * would restart the whole bgworker.
-						 */
-						ConditionVariableCancelSleep();
-						PG_TRY();
-						{
-							ReplicationSlotDrop(slot_name, true);
-							elog(LOG, "ls_monitor: slot %s dropped", slot_name);
-						}
-						PG_CATCH();
-						{
-							/* log ERROR and reset elog stack */
-							EmitErrorReport();
-							FlushErrorState();
-							elog(LOG, "ls_monitor: failed to drop slot %s", slot_name);
-						}
-						PG_END_TRY();
-						break;
-					}
-					else
-					{
-						/* kill the owner and wait for release */
-						elog(LOG, "ls_monitor: killing slot %s owner %d", slot_name, active_pid);
-						(void) kill(active_pid, SIGTERM);
-						/* We shouldn't get stuck, but to be safe add timeout. */
-						ConditionVariableTimedSleep(&s->active_cv, 1000, WAIT_EVENT_REPLICATION_SLOT_DROP);
-					}
-				}
-			}
-		}
-
-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
-						 LS_MONITOR_CHECK_INTERVAL,
-						 PG_WAIT_EXTENSION);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
-	}
-}
--- a/pgxn/neon/logical_replication_monitor.h
+++ b/pgxn/neon/logical_replication_monitor.h
@@ -1,6 +0,0 @@
-#ifndef __NEON_LOGICAL_REPLICATION_MONITOR_H__
-#define __NEON_LOGICAL_REPLICATION_MONITOR_H__
-
-void InitLogicalReplicationMonitor(void);
-
-#endif
--- a/pgxn/neon/neon.c
+++ b/pgxn/neon/neon.c
@@ -14,23 +14,32 @@
 #include "miscadmin.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xact.h"
 #include "access/xlog.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "catalog/pg_type.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/interrupt.h"
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "tcop/tcopprot.h"
 #include "funcapi.h"
 #include "access/htup_details.h"
 #include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/guc.h"
 #include "utils/guc_tables.h"
+#include "utils/wait_event.h"

 #include "extension_server.h"
 #include "neon.h"
+#include "walproposer.h"
+#include "pagestore_client.h"
 #include "control_plane_connector.h"
-#include "logical_replication_monitor.h"
-#include "unstable_extensions.h"
 #include "walsender_hooks.h"
 #if PG_MAJORVERSION_NUM >= 16
 #include "storage/ipc.h"
@@ -39,6 +48,7 @@
 PG_MODULE_MAGIC;
 void		_PG_init(void);

+static int	logical_replication_max_snap_files = 300;

 static int  running_xacts_overflow_policy;

@@ -72,6 +82,237 @@ static const struct config_enum_entry running_xacts_overflow_policies[] = {
 	{NULL, 0, false}
 };

+static void
+InitLogicalReplicationMonitor(void)
+{
+	BackgroundWorker bgw;
+
+	DefineCustomIntVariable(
+							"neon.logical_replication_max_snap_files",
+							"Maximum allowed logical replication .snap files. When exceeded, slots are dropped until the limit is met. -1 disables the limit.",
+							NULL,
+							&logical_replication_max_snap_files,
+							300, -1, INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "neon");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "LogicalSlotsMonitorMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN, "Logical replication monitor");
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "Logical replication monitor");
+	bgw.bgw_restart_time = 5;
+	bgw.bgw_notify_pid = 0;
+	bgw.bgw_main_arg = (Datum) 0;
+
+	RegisterBackgroundWorker(&bgw);
+}
+
+static int
+LsnDescComparator(const void *a, const void *b)
+{
+	XLogRecPtr	lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr	lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 < lsn2)
+		return 1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return -1;
+}
+
+/*
+ * Look at .snap files and calculate minimum allowed restart_lsn of slot so that
+ * next gc would leave not more than logical_replication_max_snap_files; all
+ * slots having lower restart_lsn should be dropped.
+ */
+static XLogRecPtr
+get_num_snap_files_lsn_threshold(void)
+{
+	DIR		   *dirdesc;
+	struct dirent *de;
+	char	   *snap_path = "pg_logical/snapshots/";
+	int			lsns_allocated = 1024;
+	int			lsns_num = 0;
+	XLogRecPtr *lsns;
+	XLogRecPtr	cutoff;
+
+	if (logical_replication_max_snap_files < 0)
+		return 0;
+
+	lsns = palloc(sizeof(XLogRecPtr) * lsns_allocated);
+
+	/* find all .snap files and get their lsns */
+	dirdesc = AllocateDir(snap_path);
+	while ((de = ReadDir(dirdesc, snap_path)) != NULL)
+	{
+		XLogRecPtr	lsn;
+		uint32		hi;
+		uint32		lo;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		if (sscanf(de->d_name, "%X-%X.snap", &hi, &lo) != 2)
+		{
+			ereport(LOG,
+					(errmsg("could not parse file name as .snap file \"%s\"", de->d_name)));
+			continue;
+		}
+
+		lsn = ((uint64) hi) << 32 | lo;
+		elog(DEBUG5, "found snap file %X/%X", LSN_FORMAT_ARGS(lsn));
+		if (lsns_allocated == lsns_num)
+		{
+			lsns_allocated *= 2;
+			lsns = repalloc(lsns, sizeof(XLogRecPtr) * lsns_allocated);
+		}
+		lsns[lsns_num++] = lsn;
+	}
+	/* sort by lsn desc */
+	qsort(lsns, lsns_num, sizeof(XLogRecPtr), LsnDescComparator);
+	/* and take cutoff at logical_replication_max_snap_files */
+	if (logical_replication_max_snap_files > lsns_num)
+		cutoff = 0;
+	/* have less files than cutoff */
+	else
+	{
+		cutoff = lsns[logical_replication_max_snap_files - 1];
+		elog(LOG, "ls_monitor: dropping logical slots with restart_lsn lower %X/%X, found %d .snap files, limit is %d",
+			 LSN_FORMAT_ARGS(cutoff), lsns_num, logical_replication_max_snap_files);
+	}
+	pfree(lsns);
+	FreeDir(dirdesc);
+	return cutoff;
+}
+
+#define LS_MONITOR_CHECK_INTERVAL 10000 /* ms */
+
+/*
+ * Unused logical replication slots pins WAL and prevents deletion of snapshots.
+ * WAL bloat is guarded by max_slot_wal_keep_size; this bgw removes slots which
+ * need too many .snap files.
+ */
+PGDLLEXPORT void
+LogicalSlotsMonitorMain(Datum main_arg)
+{
+	/* Establish signal handlers. */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGTERM, die);
+
+	BackgroundWorkerUnblockSignals();
+
+	for (;;)
+	{
+		XLogRecPtr	cutoff_lsn;
+
+		/* In case of a SIGHUP, just reload the configuration. */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * If there are too many .snap files, just drop all logical slots to
+		 * prevent aux files bloat.
+		 */
+		cutoff_lsn = get_num_snap_files_lsn_threshold();
+		if (cutoff_lsn > 0)
+		{
+			for (int i = 0; i < max_replication_slots; i++)
+			{
+				char		slot_name[NAMEDATALEN];
+				ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+				XLogRecPtr	restart_lsn;
+
+				/* find the name */
+				LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+				/* Consider only logical repliction slots */
+				if (!s->in_use || !SlotIsLogical(s))
+				{
+					LWLockRelease(ReplicationSlotControlLock);
+					continue;
+				}
+
+				/* do we need to drop it? */
+				SpinLockAcquire(&s->mutex);
+				restart_lsn = s->data.restart_lsn;
+				SpinLockRelease(&s->mutex);
+				if (restart_lsn >= cutoff_lsn)
+				{
+					LWLockRelease(ReplicationSlotControlLock);
+					continue;
+				}
+
+				strlcpy(slot_name, s->data.name.data, NAMEDATALEN);
+				elog(LOG, "ls_monitor: dropping slot %s with restart_lsn %X/%X below horizon %X/%X",
+					 slot_name, LSN_FORMAT_ARGS(restart_lsn), LSN_FORMAT_ARGS(cutoff_lsn));
+				LWLockRelease(ReplicationSlotControlLock);
+
+				/* now try to drop it, killing owner before if any */
+				for (;;)
+				{
+					pid_t		active_pid;
+
+					SpinLockAcquire(&s->mutex);
+					active_pid = s->active_pid;
+					SpinLockRelease(&s->mutex);
+
+					if (active_pid == 0)
+					{
+						/*
+						 * Slot is releasted, try to drop it. Though of course
+						 * it could have been reacquired, so drop can ERROR
+						 * out. Similarly it could have been dropped in the
+						 * meanwhile.
+						 *
+						 * In principle we could remove pg_try/pg_catch, that
+						 * would restart the whole bgworker.
+						 */
+						ConditionVariableCancelSleep();
+						PG_TRY();
+						{
+							ReplicationSlotDrop(slot_name, true);
+							elog(LOG, "ls_monitor: slot %s dropped", slot_name);
+						}
+						PG_CATCH();
+						{
+							/* log ERROR and reset elog stack */
+							EmitErrorReport();
+							FlushErrorState();
+							elog(LOG, "ls_monitor: failed to drop slot %s", slot_name);
+						}
+						PG_END_TRY();
+						break;
+					}
+					else
+					{
+						/* kill the owner and wait for release */
+						elog(LOG, "ls_monitor: killing slot %s owner %d", slot_name, active_pid);
+						(void) kill(active_pid, SIGTERM);
+						/* We shouldn't get stuck, but to be safe add timeout. */
+						ConditionVariableTimedSleep(&s->active_cv, 1000, WAIT_EVENT_REPLICATION_SLOT_DROP);
+					}
+				}
+			}
+		}
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						 LS_MONITOR_CHECK_INTERVAL,
+						 PG_WAIT_EXTENSION);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
 /*
 * XXX: These private to procarray.c, but we need them here.
 */
@@ -425,8 +666,8 @@ _PG_init(void)
 	LogicalFuncs_Custom_XLogReaderRoutines = NeonOnDemandXLogReaderRoutines;
 	SlotFuncs_Custom_XLogReaderRoutines = NeonOnDemandXLogReaderRoutines;

-	InitUnstableExtensionsSupport();
 	InitLogicalReplicationMonitor();
+
 	InitControlPlaneConnector();

 	pg_init_extension_server();
--- a/pgxn/neon/neon_pgversioncompat.c
+++ b/pgxn/neon/neon_pgversioncompat.c
@@ -42,4 +42,3 @@ InitMaterializedSRF(FunctionCallInfo fcinfo, bits32 flags)
 	MemoryContextSwitchTo(old_context);
 }
 #endif
-
--- a/pgxn/neon/unstable_extensions.c
+++ b/pgxn/neon/unstable_extensions.c
@@ -1,129 +0,0 @@
-#include <stdlib.h>
-#include <string.h>
-
-#include "postgres.h"
-
-#include "nodes/plannodes.h"
-#include "nodes/parsenodes.h"
-#include "tcop/utility.h"
-#include "utils/errcodes.h"
-#include "utils/guc.h"
-
-#include "neon_pgversioncompat.h"
-#include "unstable_extensions.h"
-
-static bool					allow_unstable_extensions = false;
-static char				   *unstable_extensions = NULL;
-
-static ProcessUtility_hook_type PreviousProcessUtilityHook = NULL;
-
-static bool
-list_contains(char const* comma_separated_list, char const* val)
-{
-	char const* occ = comma_separated_list;
-	size_t val_len = strlen(val);
-
-	if (val_len == 0)
-		return false;
-
-	while ((occ = strstr(occ, val)) != NULL)
-	{
-		if ((occ == comma_separated_list || occ[-1] == ',')
-			&& (occ[val_len] == '\0' || occ[val_len] == ','))
-		{
-			return true;
-		}
-		occ += val_len;
-	}
-
-	return false;
-}
-
-
-static void
-CheckUnstableExtension(
-	PlannedStmt *pstmt,
-	const char *queryString,
-	bool readOnlyTree,
-	ProcessUtilityContext context,
-	ParamListInfo params,
-	QueryEnvironment *queryEnv,
-	DestReceiver *dest,
-	QueryCompletion *qc)
-{
-	Node	   *parseTree = pstmt->utilityStmt;
-
-	if (allow_unstable_extensions || unstable_extensions == NULL)
-		goto process;
-
-	switch (nodeTag(parseTree))
-	{
-		case T_CreateExtensionStmt:
-		{
-			CreateExtensionStmt *stmt = castNode(CreateExtensionStmt, parseTree);
-			if (list_contains(unstable_extensions, stmt->extname))
-			{
-				ereport(ERROR,
-						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-						 errmsg("%s extension is in beta and may be unstable or introduce backward-incompatible changes.\nWe recommend testing it in a separate, dedicated Neon project.", stmt->extname),
-						 errhint("to proceed with installation, run SET neon.allow_unstable_extensions='true'")));
-			}
-			break;
-		}
-		default:
-			goto process;
-	}
-
-process:
-	if (PreviousProcessUtilityHook)
-	{
-		PreviousProcessUtilityHook(
-			pstmt,
-			queryString,
-			readOnlyTree,
-			context,
-			params,
-			queryEnv,
-			dest,
-			qc);
-	}
-	else
-	{
-		standard_ProcessUtility(
-			pstmt,
-			queryString,
-			readOnlyTree,
-			context,
-			params,
-			queryEnv,
-			dest,
-			qc);
-	}
-}
-
-void
-InitUnstableExtensionsSupport(void)
-{
-	DefineCustomBoolVariable(
-		"neon.allow_unstable_extensions",
-		"Allow unstable extensions to be installed and used",
-		NULL,
-		&allow_unstable_extensions,
-		false,
-		PGC_USERSET,
-		0,
-		NULL, NULL, NULL);
-
-	DefineCustomStringVariable(
-		"neon.unstable_extensions",
-		"List of unstable extensions",
-		NULL,
-		&unstable_extensions,
-		NULL,
-		PGC_SUSET,
-		0,
-		NULL, NULL, NULL);
-
-	PreviousProcessUtilityHook = ProcessUtility_hook;
-	ProcessUtility_hook = CheckUnstableExtension;
-}
--- a/pgxn/neon/unstable_extensions.h
+++ b/pgxn/neon/unstable_extensions.h
@@ -1,6 +0,0 @@
-#ifndef __NEON_UNSTABLE_EXTENSIONS_H__
-#define __NEON_UNSTABLE_EXTENSIONS_H__
-
-void InitUnstableExtensionsSupport(void);
-
-#endif
--- a/poetry.lock
+++ b/poetry.lock
@@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 1.8.3 and should not be changed by hand.
+# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand.

 [[package]]
 name = "aiohappyeyeballs"
@@ -1521,21 +1521,6 @@ files = [
 [package.dependencies]
 six = "*"

-[[package]]
-name = "jwcrypto"
-version = "1.5.6"
-description = "Implementation of JOSE Web standards"
-optional = false
-python-versions = ">= 3.8"
-files = [
-    {file = "jwcrypto-1.5.6-py3-none-any.whl", hash = "sha256:150d2b0ebbdb8f40b77f543fb44ffd2baeff48788be71f67f03566692fd55789"},
-    {file = "jwcrypto-1.5.6.tar.gz", hash = "sha256:771a87762a0c081ae6166958a954f80848820b2ab066937dc8b8379d65b1b039"},
-]
-
-[package.dependencies]
-cryptography = ">=3.4"
-typing-extensions = ">=4.5.0"
-
 [[package]]
 name = "kafka-python"
 version = "2.0.2"
@@ -2126,6 +2111,7 @@ files = [
    {file = "psycopg2_binary-2.9.9-cp311-cp311-win32.whl", hash = "sha256:dc4926288b2a3e9fd7b50dc6a1909a13bbdadfc67d93f3374d984e56f885579d"},
    {file = "psycopg2_binary-2.9.9-cp311-cp311-win_amd64.whl", hash = "sha256:b76bedd166805480ab069612119ea636f5ab8f8771e640ae103e05a4aae3e417"},
    {file = "psycopg2_binary-2.9.9-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:8532fd6e6e2dc57bcb3bc90b079c60de896d2128c5d9d6f24a63875a95a088cf"},
+    {file = "psycopg2_binary-2.9.9-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:b0605eaed3eb239e87df0d5e3c6489daae3f7388d455d0c0b4df899519c6a38d"},
    {file = "psycopg2_binary-2.9.9-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8f8544b092a29a6ddd72f3556a9fcf249ec412e10ad28be6a0c0d948924f2212"},
    {file = "psycopg2_binary-2.9.9-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2d423c8d8a3c82d08fe8af900ad5b613ce3632a1249fd6a223941d0735fce493"},
    {file = "psycopg2_binary-2.9.9-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2e5afae772c00980525f6d6ecf7cbca55676296b580c0e6abb407f15f3706996"},
@@ -2134,6 +2120,8 @@ files = [
    {file = "psycopg2_binary-2.9.9-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:cb16c65dcb648d0a43a2521f2f0a2300f40639f6f8c1ecbc662141e4e3e1ee07"},
    {file = "psycopg2_binary-2.9.9-cp312-cp312-musllinux_1_1_ppc64le.whl", hash = "sha256:911dda9c487075abd54e644ccdf5e5c16773470a6a5d3826fda76699410066fb"},
    {file = "psycopg2_binary-2.9.9-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:57fede879f08d23c85140a360c6a77709113efd1c993923c59fde17aa27599fe"},
+    {file = "psycopg2_binary-2.9.9-cp312-cp312-win32.whl", hash = "sha256:64cf30263844fa208851ebb13b0732ce674d8ec6a0c86a4e160495d299ba3c93"},
+    {file = "psycopg2_binary-2.9.9-cp312-cp312-win_amd64.whl", hash = "sha256:81ff62668af011f9a48787564ab7eded4e9fb17a4a6a74af5ffa6a457400d2ab"},
    {file = "psycopg2_binary-2.9.9-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:2293b001e319ab0d869d660a704942c9e2cce19745262a8aba2115ef41a0a42a"},
    {file = "psycopg2_binary-2.9.9-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:03ef7df18daf2c4c07e2695e8cfd5ee7f748a1d54d802330985a78d2a5a6dca9"},
    {file = "psycopg2_binary-2.9.9-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0a602ea5aff39bb9fac6308e9c9d82b9a35c2bf288e184a816002c9fae930b77"},
@@ -2615,6 +2603,7 @@ files = [
    {file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
    {file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
    {file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
+    {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"},
    {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
    {file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
    {file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
@@ -2923,20 +2912,6 @@ files = [
    {file = "tomli-2.0.1.tar.gz", hash = "sha256:de526c12914f0c550d15924c62d72abc48d6fe7364aa87328337a31007fe8a4f"},
 ]

-[[package]]
-name = "types-jwcrypto"
-version = "1.5.0.20240925"
-description = "Typing stubs for jwcrypto"
-optional = false
-python-versions = ">=3.8"
-files = [
-    {file = "types-jwcrypto-1.5.0.20240925.tar.gz", hash = "sha256:50e17b790378c96239344476c7bd13b52d0c7eeb6d16c2d53723e48cc6bbf4fe"},
-    {file = "types_jwcrypto-1.5.0.20240925-py3-none-any.whl", hash = "sha256:2d12a2d528240d326075e896aafec7056b9136bf3207fa6ccf3fcb8fbf9e11a1"},
-]
-
-[package.dependencies]
-cryptography = "*"
-
 [[package]]
 name = "types-psutil"
 version = "5.9.5.12"
@@ -3143,13 +3118,13 @@ files = [

 [[package]]
 name = "werkzeug"
-version = "3.0.6"
+version = "3.0.3"
 description = "The comprehensive WSGI web application library."
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "werkzeug-3.0.6-py3-none-any.whl", hash = "sha256:1bc0c2310d2fbb07b1dd1105eba2f7af72f322e1e455f2f93c993bee8c8a5f17"},
-    {file = "werkzeug-3.0.6.tar.gz", hash = "sha256:a8dd59d4de28ca70471a34cba79bed5f7ef2e036a76b3ab0835474246eb41f8d"},
+    {file = "werkzeug-3.0.3-py3-none-any.whl", hash = "sha256:fc9645dc43e03e4d630d23143a04a7f947a9a3b5727cd535fdfe155a17cc48c8"},
+    {file = "werkzeug-3.0.3.tar.gz", hash = "sha256:097e5bfda9f0aba8da6b8545146def481d06aa7d3266e7448e2cccf67dd8bd18"},
 ]

 [package.dependencies]
@@ -3184,6 +3159,16 @@ files = [
    {file = "wrapt-1.14.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:8ad85f7f4e20964db4daadcab70b47ab05c7c1cf2a7c1e51087bfaa83831854c"},
    {file = "wrapt-1.14.1-cp310-cp310-win32.whl", hash = "sha256:a9a52172be0b5aae932bef82a79ec0a0ce87288c7d132946d645eba03f0ad8a8"},
    {file = "wrapt-1.14.1-cp310-cp310-win_amd64.whl", hash = "sha256:6d323e1554b3d22cfc03cd3243b5bb815a51f5249fdcbb86fda4bf62bab9e164"},
+    {file = "wrapt-1.14.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ecee4132c6cd2ce5308e21672015ddfed1ff975ad0ac8d27168ea82e71413f55"},
+    {file = "wrapt-1.14.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:2020f391008ef874c6d9e208b24f28e31bcb85ccff4f335f15a3251d222b92d9"},
+    {file = "wrapt-1.14.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2feecf86e1f7a86517cab34ae6c2f081fd2d0dac860cb0c0ded96d799d20b335"},
+    {file = "wrapt-1.14.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:240b1686f38ae665d1b15475966fe0472f78e71b1b4903c143a842659c8e4cb9"},
+    {file = "wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a9008dad07d71f68487c91e96579c8567c98ca4c3881b9b113bc7b33e9fd78b8"},
+    {file = "wrapt-1.14.1-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:6447e9f3ba72f8e2b985a1da758767698efa72723d5b59accefd716e9e8272bf"},
+    {file = "wrapt-1.14.1-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:acae32e13a4153809db37405f5eba5bac5fbe2e2ba61ab227926a22901051c0a"},
+    {file = "wrapt-1.14.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:49ef582b7a1152ae2766557f0550a9fcbf7bbd76f43fbdc94dd3bf07cc7168be"},
+    {file = "wrapt-1.14.1-cp311-cp311-win32.whl", hash = "sha256:358fe87cc899c6bb0ddc185bf3dbfa4ba646f05b1b0b9b5a27c2cb92c2cea204"},
+    {file = "wrapt-1.14.1-cp311-cp311-win_amd64.whl", hash = "sha256:26046cd03936ae745a502abf44dac702a5e6880b2b01c29aea8ddf3353b68224"},
    {file = "wrapt-1.14.1-cp35-cp35m-manylinux1_i686.whl", hash = "sha256:43ca3bbbe97af00f49efb06e352eae40434ca9d915906f77def219b88e85d907"},
    {file = "wrapt-1.14.1-cp35-cp35m-manylinux1_x86_64.whl", hash = "sha256:6b1a564e6cb69922c7fe3a678b9f9a3c54e72b469875aa8018f18b4d1dd1adf3"},
    {file = "wrapt-1.14.1-cp35-cp35m-manylinux2010_i686.whl", hash = "sha256:00b6d4ea20a906c0ca56d84f93065b398ab74b927a7a3dbd470f6fc503f95dc3"},
@@ -3421,4 +3406,4 @@ cffi = ["cffi (>=1.11)"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.9"
-content-hash = "ad5c9ee7723359af22bbd7fa41538dcf78913c02e947a13a8f9a87eb3a59039e"
+content-hash = "f52632571e34b0e51b059c280c35d6ff6f69f6a8c9586caca78282baf635be91"
--- a/proxy/src/auth/backend/hacks.rs
+++ b/proxy/src/auth/backend/hacks.rs
@@ -1,5 +1,5 @@
 use tokio::io::{AsyncRead, AsyncWrite};
-use tracing::{debug, info};
+use tracing::{info, warn};

 use super::{ComputeCredentials, ComputeUserInfo, ComputeUserInfoNoEndpoint};
 use crate::auth::{self, AuthFlow};
@@ -21,7 +21,7 @@ pub(crate) async fn authenticate_cleartext(
    secret: AuthSecret,
    config: &'static AuthenticationConfig,
 ) -> auth::Result<ComputeCredentials> {
-    debug!("cleartext auth flow override is enabled, proceeding");
+    warn!("cleartext auth flow override is enabled, proceeding");
    ctx.set_auth_method(crate::context::AuthMethod::Cleartext);

    // pause the timer while we communicate with the client
@@ -61,7 +61,7 @@ pub(crate) async fn password_hack_no_authentication(
    info: ComputeUserInfoNoEndpoint,
    client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
 ) -> auth::Result<(ComputeUserInfo, Vec<u8>)> {
-    debug!("project not specified, resorting to the password hack auth flow");
+    warn!("project not specified, resorting to the password hack auth flow");
    ctx.set_auth_method(crate::context::AuthMethod::Cleartext);

    // pause the timer while we communicate with the client
--- a/proxy/src/auth/backend/jwt.rs
+++ b/proxy/src/auth/backend/jwt.rs
@@ -1,4 +1,3 @@
-use std::borrow::Cow;
 use std::future::Future;
 use std::sync::Arc;
 use std::time::{Duration, SystemTime};
@@ -6,7 +5,6 @@ use std::time::{Duration, SystemTime};
 use arc_swap::ArcSwapOption;
 use dashmap::DashMap;
 use jose_jwk::crypto::KeyInfo;
-use reqwest::{redirect, Client};
 use serde::de::Visitor;
 use serde::{Deserialize, Deserializer};
 use signature::Verifier;
@@ -26,7 +24,6 @@ const MIN_RENEW: Duration = Duration::from_secs(30);
 const AUTO_RENEW: Duration = Duration::from_secs(300);
 const MAX_RENEW: Duration = Duration::from_secs(3600);
 const MAX_JWK_BODY_SIZE: usize = 64 * 1024;
-const JWKS_USER_AGENT: &str = "neon-proxy";

 /// How to get the JWT auth rules
 pub(crate) trait FetchAuthRules: Clone + Send + Sync + 'static {
@@ -46,7 +43,6 @@ pub(crate) enum FetchAuthRulesError {
    RoleJwksNotConfigured,
 }

-#[derive(Clone)]
 pub(crate) struct AuthRule {
    pub(crate) id: String,
    pub(crate) jwks_url: url::Url,
@@ -54,6 +50,7 @@ pub(crate) struct AuthRule {
    pub(crate) role_names: Vec<RoleNameInt>,
 }

+#[derive(Default)]
 pub struct JwkCache {
    client: reqwest::Client,

@@ -279,7 +276,7 @@ impl JwkCacheEntryLock {

        // get the key from the JWKs if possible. If not, wait for the keys to update.
        let (jwk, expected_audience) = loop {
-            match guard.find_jwk_and_audience(&kid, role_name) {
+            match guard.find_jwk_and_audience(kid, role_name) {
                Some(jwk) => break jwk,
                None if guard.last_retrieved.elapsed() > MIN_RENEW => {
                    let _paused = ctx.latency_timer_pause(crate::metrics::Waiting::Compute);
@@ -314,9 +311,7 @@ impl JwkCacheEntryLock {

        if let Some(aud) = expected_audience {
            if payload.audience.0.iter().all(|s| s != aud) {
-                return Err(JwtError::InvalidClaims(
-                    JwtClaimsError::InvalidJwtTokenAudience,
-                ));
+                return Err(JwtError::InvalidJwtTokenAudience);
            }
        }

@@ -324,15 +319,13 @@ impl JwkCacheEntryLock {

        if let Some(exp) = payload.expiration {
            if now >= exp + CLOCK_SKEW_LEEWAY {
-                return Err(JwtError::InvalidClaims(JwtClaimsError::JwtTokenHasExpired));
+                return Err(JwtError::JwtTokenHasExpired);
            }
        }

        if let Some(nbf) = payload.not_before {
            if nbf >= now + CLOCK_SKEW_LEEWAY {
-                return Err(JwtError::InvalidClaims(
-                    JwtClaimsError::JwtTokenNotYetReadyToUse,
-                ));
+                return Err(JwtError::JwtTokenNotYetReadyToUse);
            }
        }

@@ -364,20 +357,6 @@ impl JwkCache {
    }
 }

-impl Default for JwkCache {
-    fn default() -> Self {
-        let client = Client::builder()
-            .user_agent(JWKS_USER_AGENT)
-            .redirect(redirect::Policy::none())
-            .build()
-            .expect("using &str and standard redirect::Policy");
-        JwkCache {
-            client,
-            map: DashMap::default(),
-        }
-    }
-}
-
 fn verify_ec_signature(data: &[u8], sig: &[u8], key: &jose_jwk::Ec) -> Result<(), JwtError> {
    use ecdsa::Signature;
    use signature::Verifier;
@@ -426,8 +405,8 @@ struct JwtHeader<'a> {
    #[serde(rename = "alg")]
    algorithm: jose_jwa::Algorithm,
    /// key id, must be provided for our usecase
-    #[serde(rename = "kid", borrow)]
-    key_id: Option<Cow<'a, str>>,
+    #[serde(rename = "kid")]
+    key_id: Option<&'a str>,
 }

 /// <https://datatracker.ietf.org/doc/html/rfc7519#section-4.1>
@@ -446,17 +425,17 @@ struct JwtPayload<'a> {

    // the following entries are only extracted for the sake of debug logging.
    /// Issuer of the JWT
-    #[serde(rename = "iss", borrow)]
-    issuer: Option<Cow<'a, str>>,
+    #[serde(rename = "iss")]
+    issuer: Option<&'a str>,
    /// Subject of the JWT (the user)
-    #[serde(rename = "sub", borrow)]
-    subject: Option<Cow<'a, str>>,
+    #[serde(rename = "sub")]
+    subject: Option<&'a str>,
    /// Unique token identifier
-    #[serde(rename = "jti", borrow)]
-    jwt_id: Option<Cow<'a, str>>,
+    #[serde(rename = "jti")]
+    jwt_id: Option<&'a str>,
    /// Unique session identifier
-    #[serde(rename = "sid", borrow)]
-    session_id: Option<Cow<'a, str>>,
+    #[serde(rename = "sid")]
+    session_id: Option<&'a str>,
 }

 /// `OneOrMany` supports parsing either a single item or an array of items.
@@ -591,8 +570,14 @@ pub(crate) enum JwtError {
    #[error("Provided authentication token is not a valid JWT encoding")]
    JwtEncoding(#[from] JwtEncodingError),

-    #[error(transparent)]
-    InvalidClaims(#[from] JwtClaimsError),
+    #[error("invalid JWT token audience")]
+    InvalidJwtTokenAudience,
+
+    #[error("JWT token has expired")]
+    JwtTokenHasExpired,
+
+    #[error("JWT token is not yet ready to use")]
+    JwtTokenNotYetReadyToUse,

    #[error("invalid P256 key")]
    InvalidP256Key(jose_jwk::crypto::Error),
@@ -644,19 +629,6 @@ pub enum JwtEncodingError {
    InvalidCompactForm,
 }

-#[derive(Error, Debug, PartialEq)]
-#[non_exhaustive]
-pub enum JwtClaimsError {
-    #[error("invalid JWT token audience")]
-    InvalidJwtTokenAudience,
-
-    #[error("JWT token has expired")]
-    JwtTokenHasExpired,
-
-    #[error("JWT token is not yet ready to use")]
-    JwtTokenNotYetReadyToUse,
-}
-
 #[allow(dead_code, reason = "Debug use only")]
 #[derive(Debug)]
 pub(crate) enum KeyType {
@@ -693,8 +665,6 @@ mod tests {
    use hyper_util::rt::TokioIo;
    use rand::rngs::OsRng;
    use rsa::pkcs8::DecodePrivateKey;
-    use serde::Serialize;
-    use serde_json::json;
    use signature::Signer;
    use tokio::net::TcpListener;

@@ -708,7 +678,6 @@ mod tests {
            key: jose_jwk::Key::Ec(pk),
            prm: jose_jwk::Parameters {
                kid: Some(kid),
-                alg: Some(jose_jwa::Algorithm::Signing(jose_jwa::Signing::Es256)),
                ..Default::default()
            },
        };
@@ -722,47 +691,24 @@ mod tests {
            key: jose_jwk::Key::Rsa(pk),
            prm: jose_jwk::Parameters {
                kid: Some(kid),
-                alg: Some(jose_jwa::Algorithm::Signing(jose_jwa::Signing::Rs256)),
                ..Default::default()
            },
        };
        (sk, jwk)
    }

-    fn now() -> u64 {
-        SystemTime::now()
-            .duration_since(SystemTime::UNIX_EPOCH)
-            .unwrap()
-            .as_secs()
-    }
-
    fn build_jwt_payload(kid: String, sig: jose_jwa::Signing) -> String {
-        let now = now();
-        let body = typed_json::json! {{
-            "exp": now + 3600,
-            "nbf": now,
-            "aud": ["audience1", "neon", "audience2"],
-            "sub": "user1",
-            "sid": "session1",
-            "jti": "token1",
-            "iss": "neon-testing",
-        }};
-        build_custom_jwt_payload(kid, body, sig)
-    }
-
-    fn build_custom_jwt_payload(
-        kid: String,
-        body: impl Serialize,
-        sig: jose_jwa::Signing,
-    ) -> String {
        let header = JwtHeader {
            algorithm: jose_jwa::Algorithm::Signing(sig),
-            key_id: Some(Cow::Owned(kid)),
+            key_id: Some(&kid),
        };
+        let body = typed_json::json! {{
+            "exp": SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap().as_secs() + 3600,
+        }};

        let header =
            base64::encode_config(serde_json::to_string(&header).unwrap(), URL_SAFE_NO_PAD);
-        let body = base64::encode_config(serde_json::to_string(&body).unwrap(), URL_SAFE_NO_PAD);
+        let body = base64::encode_config(body.to_string(), URL_SAFE_NO_PAD);

        format!("{header}.{body}")
    }
@@ -777,16 +723,6 @@ mod tests {
        format!("{payload}.{sig}")
    }

-    fn new_custom_ec_jwt(kid: String, key: &p256::SecretKey, body: impl Serialize) -> String {
-        use p256::ecdsa::{Signature, SigningKey};
-
-        let payload = build_custom_jwt_payload(kid, body, jose_jwa::Signing::Es256);
-        let sig: Signature = SigningKey::from(key).sign(payload.as_bytes());
-        let sig = base64::encode_config(sig.to_bytes(), URL_SAFE_NO_PAD);
-
-        format!("{payload}.{sig}")
-    }
-
    fn new_rsa_jwt(kid: String, key: rsa::RsaPrivateKey) -> String {
        use rsa::pkcs1v15::SigningKey;
        use rsa::signature::SignatureEncoding;
@@ -858,34 +794,37 @@ X0n5X2/pBLJzxZc62ccvZYVnctBiFs6HbSnxpuMQCfkt/BcR/ttIepBQQIW86wHL
 -----END PRIVATE KEY-----
 ";

-    #[derive(Clone)]
-    struct Fetch(Vec<AuthRule>);
+    #[tokio::test]
+    async fn renew() {
+        let (rs1, jwk1) = new_rsa_jwk(RS1, "1".into());
+        let (rs2, jwk2) = new_rsa_jwk(RS2, "2".into());
+        let (ec1, jwk3) = new_ec_jwk("3".into());
+        let (ec2, jwk4) = new_ec_jwk("4".into());

-    impl FetchAuthRules for Fetch {
-        async fn fetch_auth_rules(
-            &self,
-            _ctx: &RequestMonitoring,
-            _endpoint: EndpointId,
-        ) -> Result<Vec<AuthRule>, FetchAuthRulesError> {
-            Ok(self.0.clone())
-        }
-    }
+        let foo_jwks = jose_jwk::JwkSet {
+            keys: vec![jwk1, jwk3],
+        };
+        let bar_jwks = jose_jwk::JwkSet {
+            keys: vec![jwk2, jwk4],
+        };

-    async fn jwks_server(
-        router: impl for<'a> Fn(&'a str) -> Option<Vec<u8>> + Send + Sync + 'static,
-    ) -> SocketAddr {
-        let router = Arc::new(router);
        let service = service_fn(move |req| {
-            let router = Arc::clone(&router);
+            let foo_jwks = foo_jwks.clone();
+            let bar_jwks = bar_jwks.clone();
            async move {
-                match router(req.uri().path()) {
-                    Some(body) => Response::builder()
-                        .status(200)
-                        .body(Full::new(Bytes::from(body))),
-                    None => Response::builder()
-                        .status(404)
-                        .body(Full::new(Bytes::new())),
-                }
+                let jwks = match req.uri().path() {
+                    "/foo" => &foo_jwks,
+                    "/bar" => &bar_jwks,
+                    _ => {
+                        return Response::builder()
+                            .status(404)
+                            .body(Full::new(Bytes::new()));
+                    }
+                };
+                let body = serde_json::to_vec(jwks).unwrap();
+                Response::builder()
+                    .status(200)
+                    .body(Full::new(Bytes::from(body)))
            }
        });

@@ -900,61 +839,84 @@ X0n5X2/pBLJzxZc62ccvZYVnctBiFs6HbSnxpuMQCfkt/BcR/ttIepBQQIW86wHL
            }
        });

-        addr
-    }
+        let client = reqwest::Client::new();

-    #[tokio::test]
-    async fn check_jwt_happy_path() {
-        let (rs1, jwk1) = new_rsa_jwk(RS1, "rs1".into());
-        let (rs2, jwk2) = new_rsa_jwk(RS2, "rs2".into());
-        let (ec1, jwk3) = new_ec_jwk("ec1".into());
-        let (ec2, jwk4) = new_ec_jwk("ec2".into());
+        #[derive(Clone)]
+        struct Fetch(SocketAddr, Vec<RoleNameInt>);

-        let foo_jwks = jose_jwk::JwkSet {
-            keys: vec![jwk1, jwk3],
-        };
-        let bar_jwks = jose_jwk::JwkSet {
-            keys: vec![jwk2, jwk4],
-        };
-
-        let jwks_addr = jwks_server(move |path| match path {
-            "/foo" => Some(serde_json::to_vec(&foo_jwks).unwrap()),
-            "/bar" => Some(serde_json::to_vec(&bar_jwks).unwrap()),
-            _ => None,
-        })
-        .await;
+        impl FetchAuthRules for Fetch {
+            async fn fetch_auth_rules(
+                &self,
+                _ctx: &RequestMonitoring,
+                _endpoint: EndpointId,
+            ) -> Result<Vec<AuthRule>, FetchAuthRulesError> {
+                Ok(vec![
+                    AuthRule {
+                        id: "foo".to_owned(),
+                        jwks_url: format!("http://{}/foo", self.0).parse().unwrap(),
+                        audience: None,
+                        role_names: self.1.clone(),
+                    },
+                    AuthRule {
+                        id: "bar".to_owned(),
+                        jwks_url: format!("http://{}/bar", self.0).parse().unwrap(),
+                        audience: None,
+                        role_names: self.1.clone(),
+                    },
+                ])
+            }
+        }

        let role_name1 = RoleName::from("anonymous");
        let role_name2 = RoleName::from("authenticated");

-        let roles = vec![
-            RoleNameInt::from(&role_name1),
-            RoleNameInt::from(&role_name2),
-        ];
-        let rules = vec![
-            AuthRule {
-                id: "foo".to_owned(),
-                jwks_url: format!("http://{jwks_addr}/foo").parse().unwrap(),
-                audience: None,
-                role_names: roles.clone(),
-            },
-            AuthRule {
-                id: "bar".to_owned(),
-                jwks_url: format!("http://{jwks_addr}/bar").parse().unwrap(),
-                audience: None,
-                role_names: roles.clone(),
-            },
-        ];
-
-        let fetch = Fetch(rules);
-        let jwk_cache = JwkCache::default();
+        let fetch = Fetch(
+            addr,
+            vec![
+                RoleNameInt::from(&role_name1),
+                RoleNameInt::from(&role_name2),
+            ],
+        );

        let endpoint = EndpointId::from("ep");

-        let jwt1 = new_rsa_jwt("rs1".into(), rs1);
-        let jwt2 = new_rsa_jwt("rs2".into(), rs2);
-        let jwt3 = new_ec_jwt("ec1".into(), &ec1);
-        let jwt4 = new_ec_jwt("ec2".into(), &ec2);
+        let jwk_cache = Arc::new(JwkCacheEntryLock::default());
+
+        let jwt1 = new_rsa_jwt("1".into(), rs1);
+        let jwt2 = new_rsa_jwt("2".into(), rs2);
+        let jwt3 = new_ec_jwt("3".into(), &ec1);
+        let jwt4 = new_ec_jwt("4".into(), &ec2);
+
+        // had the wrong kid, therefore will have the wrong ecdsa signature
+        let bad_jwt = new_ec_jwt("3".into(), &ec2);
+        // this role_name is not accepted
+        let bad_role_name = RoleName::from("cloud_admin");
+
+        let err = jwk_cache
+            .check_jwt(
+                &RequestMonitoring::test(),
+                &bad_jwt,
+                &client,
+                endpoint.clone(),
+                &role_name1,
+                &fetch,
+            )
+            .await
+            .unwrap_err();
+        assert!(err.to_string().contains("signature error"));
+
+        let err = jwk_cache
+            .check_jwt(
+                &RequestMonitoring::test(),
+                &jwt1,
+                &client,
+                endpoint.clone(),
+                &bad_role_name,
+                &fetch,
+            )
+            .await
+            .unwrap_err();
+        assert!(err.to_string().contains("jwk not found"));

        let tokens = [jwt1, jwt2, jwt3, jwt4];
        let role_names = [role_name1, role_name2];
@@ -963,250 +925,15 @@ X0n5X2/pBLJzxZc62ccvZYVnctBiFs6HbSnxpuMQCfkt/BcR/ttIepBQQIW86wHL
                jwk_cache
                    .check_jwt(
                        &RequestMonitoring::test(),
+                        token,
+                        &client,
                        endpoint.clone(),
                        role,
                        &fetch,
-                        token,
                    )
                    .await
                    .unwrap();
            }
        }
    }
-
-    /// AWS Cognito escapes the `/` in the URL.
-    #[tokio::test]
-    async fn check_jwt_regression_cognito_issuer() {
-        let (key, jwk) = new_ec_jwk("key".into());
-
-        let now = now();
-        let token = new_custom_ec_jwt(
-            "key".into(),
-            &key,
-            typed_json::json! {{
-                "sub": "dd9a73fd-e785-4a13-aae1-e691ce43e89d",
-                // cognito uses `\/`. I cannot replicated that easily here as serde_json will refuse
-                // to write that escape character. instead I will make a bogus URL using `\` instead.
-                "iss": "https:\\\\cognito-idp.us-west-2.amazonaws.com\\us-west-2_abcdefgh",
-                "client_id": "abcdefghijklmnopqrstuvwxyz",
-                "origin_jti": "6759d132-3fe7-446e-9e90-2fe7e8017893",
-                "event_id": "ec9c36ab-b01d-46a0-94e4-87fde6767065",
-                "token_use": "access",
-                "scope": "aws.cognito.signin.user.admin",
-                "auth_time":now,
-                "exp":now + 60,
-                "iat":now,
-                "jti": "b241614b-0b93-4bdc-96db-0a3c7061d9c0",
-                "username": "dd9a73fd-e785-4a13-aae1-e691ce43e89d",
-            }},
-        );
-
-        let jwks = jose_jwk::JwkSet { keys: vec![jwk] };
-
-        let jwks_addr = jwks_server(move |_path| Some(serde_json::to_vec(&jwks).unwrap())).await;
-
-        let role_name = RoleName::from("anonymous");
-        let rules = vec![AuthRule {
-            id: "aws-cognito".to_owned(),
-            jwks_url: format!("http://{jwks_addr}/").parse().unwrap(),
-            audience: None,
-            role_names: vec![RoleNameInt::from(&role_name)],
-        }];
-
-        let fetch = Fetch(rules);
-        let jwk_cache = JwkCache::default();
-
-        let endpoint = EndpointId::from("ep");
-
-        jwk_cache
-            .check_jwt(
-                &RequestMonitoring::test(),
-                endpoint.clone(),
-                &role_name,
-                &fetch,
-                &token,
-            )
-            .await
-            .unwrap();
-    }
-
-    #[tokio::test]
-    async fn check_jwt_invalid_signature() {
-        let (_, jwk) = new_ec_jwk("1".into());
-        let (key, _) = new_ec_jwk("1".into());
-
-        // has a matching kid, but signed by the wrong key
-        let bad_jwt = new_ec_jwt("1".into(), &key);
-
-        let jwks = jose_jwk::JwkSet { keys: vec![jwk] };
-        let jwks_addr = jwks_server(move |path| match path {
-            "/" => Some(serde_json::to_vec(&jwks).unwrap()),
-            _ => None,
-        })
-        .await;
-
-        let role = RoleName::from("authenticated");
-
-        let rules = vec![AuthRule {
-            id: String::new(),
-            jwks_url: format!("http://{jwks_addr}/").parse().unwrap(),
-            audience: None,
-            role_names: vec![RoleNameInt::from(&role)],
-        }];
-
-        let fetch = Fetch(rules);
-        let jwk_cache = JwkCache::default();
-
-        let ep = EndpointId::from("ep");
-
-        let ctx = RequestMonitoring::test();
-        let err = jwk_cache
-            .check_jwt(&ctx, ep, &role, &fetch, &bad_jwt)
-            .await
-            .unwrap_err();
-        assert!(
-            matches!(err, JwtError::Signature(_)),
-            "expected \"signature error\", got {err:?}"
-        );
-    }
-
-    #[tokio::test]
-    async fn check_jwt_unknown_role() {
-        let (key, jwk) = new_rsa_jwk(RS1, "1".into());
-        let jwt = new_rsa_jwt("1".into(), key);
-
-        let jwks = jose_jwk::JwkSet { keys: vec![jwk] };
-        let jwks_addr = jwks_server(move |path| match path {
-            "/" => Some(serde_json::to_vec(&jwks).unwrap()),
-            _ => None,
-        })
-        .await;
-
-        let role = RoleName::from("authenticated");
-        let rules = vec![AuthRule {
-            id: String::new(),
-            jwks_url: format!("http://{jwks_addr}/").parse().unwrap(),
-            audience: None,
-            role_names: vec![RoleNameInt::from(&role)],
-        }];
-
-        let fetch = Fetch(rules);
-        let jwk_cache = JwkCache::default();
-
-        let ep = EndpointId::from("ep");
-
-        // this role_name is not accepted
-        let bad_role_name = RoleName::from("cloud_admin");
-
-        let ctx = RequestMonitoring::test();
-        let err = jwk_cache
-            .check_jwt(&ctx, ep, &bad_role_name, &fetch, &jwt)
-            .await
-            .unwrap_err();
-
-        assert!(
-            matches!(err, JwtError::JwkNotFound),
-            "expected \"jwk not found\", got {err:?}"
-        );
-    }
-
-    #[tokio::test]
-    async fn check_jwt_invalid_claims() {
-        let (key, jwk) = new_ec_jwk("1".into());
-
-        let jwks = jose_jwk::JwkSet { keys: vec![jwk] };
-        let jwks_addr = jwks_server(move |path| match path {
-            "/" => Some(serde_json::to_vec(&jwks).unwrap()),
-            _ => None,
-        })
-        .await;
-
-        let now = SystemTime::now()
-            .duration_since(SystemTime::UNIX_EPOCH)
-            .unwrap()
-            .as_secs();
-
-        struct Test {
-            body: serde_json::Value,
-            error: JwtClaimsError,
-        }
-
-        let table = vec![
-            Test {
-                body: json! {{
-                    "nbf": now + 60,
-                    "aud": "neon",
-                }},
-                error: JwtClaimsError::JwtTokenNotYetReadyToUse,
-            },
-            Test {
-                body: json! {{
-                    "exp": now - 60,
-                    "aud": ["neon"],
-                }},
-                error: JwtClaimsError::JwtTokenHasExpired,
-            },
-            Test {
-                body: json! {{
-                }},
-                error: JwtClaimsError::InvalidJwtTokenAudience,
-            },
-            Test {
-                body: json! {{
-                    "aud": [],
-                }},
-                error: JwtClaimsError::InvalidJwtTokenAudience,
-            },
-            Test {
-                body: json! {{
-                    "aud": "foo",
-                }},
-                error: JwtClaimsError::InvalidJwtTokenAudience,
-            },
-            Test {
-                body: json! {{
-                    "aud": ["foo"],
-                }},
-                error: JwtClaimsError::InvalidJwtTokenAudience,
-            },
-            Test {
-                body: json! {{
-                    "aud": ["foo", "bar"],
-                }},
-                error: JwtClaimsError::InvalidJwtTokenAudience,
-            },
-        ];
-
-        let role = RoleName::from("authenticated");
-
-        let rules = vec![AuthRule {
-            id: String::new(),
-            jwks_url: format!("http://{jwks_addr}/").parse().unwrap(),
-            audience: Some("neon".to_string()),
-            role_names: vec![RoleNameInt::from(&role)],
-        }];
-
-        let fetch = Fetch(rules);
-        let jwk_cache = JwkCache::default();
-
-        let ep = EndpointId::from("ep");
-
-        let ctx = RequestMonitoring::test();
-        for test in table {
-            let jwt = new_custom_ec_jwt("1".into(), &key, test.body);
-
-            match jwk_cache
-                .check_jwt(&ctx, ep.clone(), &role, &fetch, &jwt)
-                .await
-            {
-                Err(JwtError::InvalidClaims(error)) if error == test.error => {}
-                Err(err) => {
-                    panic!("expected {:?}, got {err:?}", test.error)
-                }
-                Ok(_payload) => {
-                    panic!("expected {:?}, got ok", test.error)
-                }
-            }
-        }
-    }
 }
--- a/proxy/src/bin/proxy.rs
+++ b/proxy/src/bin/proxy.rs
@@ -137,6 +137,9 @@ struct ProxyCliArgs {
    /// size of the threadpool for password hashing
    #[clap(long, default_value_t = 4)]
    scram_thread_pool_size: u8,
+    /// Disable dynamic rate limiter and store the metrics to ensure its production behaviour.
+    #[clap(long, default_value_t = true, value_parser = clap::builder::BoolishValueParser::new(), action = clap::ArgAction::Set)]
+    disable_dynamic_rate_limiter: bool,
    /// Endpoint rate limiter max number of requests per second.
    ///
    /// Provided in the form `<Requests Per Second>@<Bucket Duration Size>`.
@@ -612,6 +615,9 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
             and metric-collection-interval must be specified"
        ),
    };
+    if !args.disable_dynamic_rate_limiter {
+        bail!("dynamic rate limiter should be disabled");
+    }

    let config::ConcurrencyLockOptions {
        shards,
--- a/proxy/src/serverless/error.rs
+++ b/proxy/src/serverless/error.rs
@@ -1,5 +0,0 @@
-use http::StatusCode;
-
-pub trait HttpCodeError {
-    fn get_http_status_code(&self) -> StatusCode;
-}
--- a/proxy/src/serverless/mod.rs
+++ b/proxy/src/serverless/mod.rs
@@ -6,7 +6,6 @@ mod backend;
 pub mod cancel_set;
 mod conn_pool;
 mod conn_pool_lib;
-mod error;
 mod http_conn_pool;
 mod http_util;
 mod json;
@@ -33,7 +32,6 @@ use hyper_util::rt::TokioExecutor;
 use hyper_util::server::conn::auto::Builder;
 use rand::rngs::StdRng;
 use rand::SeedableRng;
-use sql_over_http::{uuid_to_header_value, NEON_REQUEST_ID};
 use tokio::io::{AsyncRead, AsyncWrite};
 use tokio::net::{TcpListener, TcpStream};
 use tokio::time::timeout;
@@ -311,18 +309,7 @@ async fn connection_handler(
        hyper_util::rt::TokioIo::new(conn),
        hyper::service::service_fn(move |req: hyper::Request<Incoming>| {
            // First HTTP request shares the same session ID
-            let mut session_id = session_id.take().unwrap_or_else(uuid::Uuid::new_v4);
-
-            if matches!(backend.auth_backend, crate::auth::Backend::Local(_)) {
-                // take session_id from request, if given.
-                if let Some(id) = req
-                    .headers()
-                    .get(&NEON_REQUEST_ID)
-                    .and_then(|id| uuid::Uuid::try_parse_ascii(id.as_bytes()).ok())
-                {
-                    session_id = id;
-                }
-            }
+            let session_id = session_id.take().unwrap_or_else(uuid::Uuid::new_v4);

            // Cancel the current inflight HTTP request if the requets stream is closed.
            // This is slightly different to `_cancel_connection` in that
@@ -348,15 +335,8 @@ async fn connection_handler(
                .map_ok_or_else(api_error_into_response, |r| r),
            );
            async move {
-                let mut res = handler.await;
+                let res = handler.await;
                cancel_request.disarm();
-
-                // add the session ID to the response
-                if let Ok(resp) = &mut res {
-                    resp.headers_mut()
-                        .append(&NEON_REQUEST_ID, uuid_to_header_value(session_id));
-                }
-
                res
            }
        }),
--- a/proxy/src/serverless/sql_over_http.rs
+++ b/proxy/src/serverless/sql_over_http.rs
@@ -23,12 +23,10 @@ use typed_json::json;
 use url::Url;
 use urlencoding;
 use utils::http::error::ApiError;
-use uuid::Uuid;

 use super::backend::{LocalProxyConnError, PoolingBackend};
 use super::conn_pool::{AuthData, ConnInfoWithAuth};
 use super::conn_pool_lib::{self, ConnInfo};
-use super::error::HttpCodeError;
 use super::http_util::json_response;
 use super::json::{json_to_pg_text, pg_text_row_to_json, JsonConversionError};
 use super::local_conn_pool;
@@ -65,8 +63,6 @@ enum Payload {
    Batch(BatchQueryData),
 }

-pub(super) static NEON_REQUEST_ID: HeaderName = HeaderName::from_static("neon-request-id");
-
 static CONN_STRING: HeaderName = HeaderName::from_static("neon-connection-string");
 static RAW_TEXT_OUTPUT: HeaderName = HeaderName::from_static("neon-raw-text-output");
 static ARRAY_MODE: HeaderName = HeaderName::from_static("neon-array-mode");
@@ -239,6 +235,7 @@ fn get_conn_info(
    Ok(ConnInfoWithAuth { conn_info, auth })
 }

+// TODO: return different http error codes
 pub(crate) async fn handle(
    config: &'static ProxyConfig,
    ctx: RequestMonitoring,
@@ -319,8 +316,9 @@ pub(crate) async fn handle(
                "forwarding error to user"
            );

+            // TODO: this shouldn't always be bad request.
            json_response(
-                e.get_http_status_code(),
+                StatusCode::BAD_REQUEST,
                json!({
                    "message": message,
                    "code": code,
@@ -404,25 +402,6 @@ impl UserFacingError for SqlOverHttpError {
    }
 }

-impl HttpCodeError for SqlOverHttpError {
-    fn get_http_status_code(&self) -> StatusCode {
-        match self {
-            SqlOverHttpError::ReadPayload(_) => StatusCode::BAD_REQUEST,
-            SqlOverHttpError::ConnectCompute(h) => match h.get_error_kind() {
-                ErrorKind::User => StatusCode::BAD_REQUEST,
-                _ => StatusCode::INTERNAL_SERVER_ERROR,
-            },
-            SqlOverHttpError::ConnInfo(_) => StatusCode::BAD_REQUEST,
-            SqlOverHttpError::RequestTooLarge(_) => StatusCode::PAYLOAD_TOO_LARGE,
-            SqlOverHttpError::ResponseTooLarge(_) => StatusCode::INSUFFICIENT_STORAGE,
-            SqlOverHttpError::InvalidIsolationLevel => StatusCode::BAD_REQUEST,
-            SqlOverHttpError::Postgres(_) => StatusCode::BAD_REQUEST,
-            SqlOverHttpError::JsonConversion(_) => StatusCode::INTERNAL_SERVER_ERROR,
-            SqlOverHttpError::Cancelled(_) => StatusCode::INTERNAL_SERVER_ERROR,
-        }
-    }
-}
-
 #[derive(Debug, thiserror::Error)]
 pub(crate) enum ReadPayloadError {
    #[error("could not read the HTTP request body: {0}")]
@@ -727,12 +706,6 @@ static HEADERS_TO_FORWARD: &[&HeaderName] = &[
    &TXN_DEFERRABLE,
 ];

-pub(crate) fn uuid_to_header_value(id: Uuid) -> HeaderValue {
-    let mut uuid = [0; uuid::fmt::Hyphenated::LENGTH];
-    HeaderValue::from_str(id.as_hyphenated().encode_lower(&mut uuid[..]))
-        .expect("uuid hyphenated format should be all valid header characters")
-}
-
 async fn handle_auth_broker_inner(
    ctx: &RequestMonitoring,
    request: Request<Incoming>,
@@ -759,7 +732,6 @@ async fn handle_auth_broker_inner(
            req = req.header(h, hv);
        }
    }
-    req = req.header(&NEON_REQUEST_ID, uuid_to_header_value(ctx.session_id()));

    let req = req
        .body(body)
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -23,7 +23,7 @@ backoff = "^2.2.1"
 pytest-lazy-fixture = "^0.6.3"
 prometheus-client = "^0.14.1"
 pytest-timeout = "^2.1.0"
-Werkzeug = "^3.0.6"
+Werkzeug = "^3.0.3"
 pytest-order = "^1.1.0"
 allure-pytest = "^2.13.2"
 pytest-asyncio = "^0.21.0"
@@ -42,9 +42,6 @@ pytest-repeat = "^0.9.3"
 websockets = "^12.0"
 clickhouse-connect = "^0.7.16"
 kafka-python = "^2.0.2"
-jwcrypto = "^1.5.6"
-h2 = "^4.1.0"
-types-jwcrypto = "^1.5.0.20240925"

 [tool.poetry.group.dev.dependencies]
 mypy = "==1.3.0"
--- a/safekeeper/src/bin/safekeeper.rs
+++ b/safekeeper/src/bin/safekeeper.rs
@@ -193,8 +193,6 @@ struct Args {
    /// Usually, timeline eviction has to wait for `partial_backup_timeout` before being eligible for eviction,
    /// but if a timeline is un-evicted and then _not_ written to, it would immediately flap to evicting again,
    /// if it weren't for `eviction_min_resident` preventing that.
-    ///
-    /// Also defines interval for eviction retries.
    #[arg(long, value_parser = humantime::parse_duration, default_value = DEFAULT_EVICTION_MIN_RESIDENT)]
    eviction_min_resident: Duration,
 }
--- a/safekeeper/src/control_file.rs
+++ b/safekeeper/src/control_file.rs
@@ -14,10 +14,12 @@ use std::path::Path;
 use std::time::Instant;

 use crate::control_file_upgrade::downgrade_v9_to_v8;
-use crate::control_file_upgrade::upgrade_control_file;
 use crate::metrics::PERSIST_CONTROL_FILE_SECONDS;
 use crate::state::{EvictionState, TimelinePersistentState};
-use utils::bin_ser::LeSer;
+use crate::{control_file_upgrade::upgrade_control_file, timeline::get_timeline_dir};
+use utils::{bin_ser::LeSer, id::TenantTimelineId};
+
+use crate::SafeKeeperConf;

 pub const SK_MAGIC: u32 = 0xcafeceefu32;
 pub const SK_FORMAT_VERSION: u32 = 9;
@@ -52,12 +54,13 @@ pub struct FileStorage {

 impl FileStorage {
    /// Initialize storage by loading state from disk.
-    pub fn restore_new(timeline_dir: &Utf8Path, no_sync: bool) -> Result<FileStorage> {
-        let state = Self::load_control_file_from_dir(timeline_dir)?;
+    pub fn restore_new(ttid: &TenantTimelineId, conf: &SafeKeeperConf) -> Result<FileStorage> {
+        let timeline_dir = get_timeline_dir(conf, ttid);
+        let state = Self::load_control_file_from_dir(&timeline_dir)?;

        Ok(FileStorage {
-            timeline_dir: timeline_dir.to_path_buf(),
-            no_sync,
+            timeline_dir,
+            no_sync: conf.no_sync,
            state,
            last_persist_at: Instant::now(),
        })
@@ -68,16 +71,16 @@ impl FileStorage {
    /// Note: we normally call this in temp directory for atomic init, so
    /// interested in FileStorage as a result only in tests.
    pub async fn create_new(
-        timeline_dir: &Utf8Path,
+        dir: Utf8PathBuf,
+        conf: &SafeKeeperConf,
        state: TimelinePersistentState,
-        no_sync: bool,
    ) -> Result<FileStorage> {
        // we don't support creating new timelines in offloaded state
        assert!(matches!(state.eviction_state, EvictionState::Present));

        let mut store = FileStorage {
-            timeline_dir: timeline_dir.to_path_buf(),
-            no_sync,
+            timeline_dir: dir,
+            no_sync: conf.no_sync,
            state: state.clone(),
            last_persist_at: Instant::now(),
        };
@@ -236,46 +239,89 @@ mod test {
    use tokio::fs;
    use utils::lsn::Lsn;

-    const NO_SYNC: bool = true;
+    fn stub_conf() -> SafeKeeperConf {
+        let workdir = camino_tempfile::tempdir().unwrap().into_path();
+        SafeKeeperConf {
+            workdir,
+            ..SafeKeeperConf::dummy()
+        }
+    }

-    #[tokio::test]
-    async fn test_read_write_safekeeper_state() -> anyhow::Result<()> {
-        let tempdir = camino_tempfile::tempdir()?;
-        let mut state = TimelinePersistentState::empty();
-        let mut storage = FileStorage::create_new(tempdir.path(), state.clone(), NO_SYNC).await?;
+    async fn load_from_control_file(
+        conf: &SafeKeeperConf,
+        ttid: &TenantTimelineId,
+    ) -> Result<(FileStorage, TimelinePersistentState)> {
+        let timeline_dir = get_timeline_dir(conf, ttid);
+        fs::create_dir_all(&timeline_dir)
+            .await
+            .expect("failed to create timeline dir");
+        Ok((
+            FileStorage::restore_new(ttid, conf)?,
+            FileStorage::load_control_file_from_dir(&timeline_dir)?,
+        ))
+    }

-        // Make a change.
-        state.commit_lsn = Lsn(42);
-        storage.persist(&state).await?;
-
-        // Reload the state. It should match the previously persisted state.
-        let loaded_state = FileStorage::load_control_file_from_dir(tempdir.path())?;
-        assert_eq!(loaded_state, state);
-        Ok(())
+    async fn create(
+        conf: &SafeKeeperConf,
+        ttid: &TenantTimelineId,
+    ) -> Result<(FileStorage, TimelinePersistentState)> {
+        let timeline_dir = get_timeline_dir(conf, ttid);
+        fs::create_dir_all(&timeline_dir)
+            .await
+            .expect("failed to create timeline dir");
+        let state = TimelinePersistentState::empty();
+        let storage = FileStorage::create_new(timeline_dir, conf, state.clone()).await?;
+        Ok((storage, state))
    }

    #[tokio::test]
-    async fn test_safekeeper_state_checksum_mismatch() -> anyhow::Result<()> {
-        let tempdir = camino_tempfile::tempdir()?;
-        let mut state = TimelinePersistentState::empty();
-        let mut storage = FileStorage::create_new(tempdir.path(), state.clone(), NO_SYNC).await?;
-
-        // Make a change.
-        state.commit_lsn = Lsn(42);
-        storage.persist(&state).await?;
-
-        // Change the first byte to fail checksum validation.
-        let ctrl_path = tempdir.path().join(CONTROL_FILE_NAME);
-        let mut data = fs::read(&ctrl_path).await?;
-        data[0] += 1;
-        fs::write(&ctrl_path, &data).await?;
-
-        // Loading the file should fail checksum validation.
-        if let Err(err) = FileStorage::load_control_file_from_dir(tempdir.path()) {
-            assert!(err.to_string().contains("control file checksum mismatch"))
-        } else {
-            panic!("expected checksum error")
+    async fn test_read_write_safekeeper_state() {
+        let conf = stub_conf();
+        let ttid = TenantTimelineId::generate();
+        {
+            let (mut storage, mut state) =
+                create(&conf, &ttid).await.expect("failed to create state");
+            // change something
+            state.commit_lsn = Lsn(42);
+            storage
+                .persist(&state)
+                .await
+                .expect("failed to persist state");
+        }
+
+        let (_, state) = load_from_control_file(&conf, &ttid)
+            .await
+            .expect("failed to read state");
+        assert_eq!(state.commit_lsn, Lsn(42));
+    }
+
+    #[tokio::test]
+    async fn test_safekeeper_state_checksum_mismatch() {
+        let conf = stub_conf();
+        let ttid = TenantTimelineId::generate();
+        {
+            let (mut storage, mut state) =
+                create(&conf, &ttid).await.expect("failed to read state");
+
+            // change something
+            state.commit_lsn = Lsn(42);
+            storage
+                .persist(&state)
+                .await
+                .expect("failed to persist state");
+        }
+        let control_path = get_timeline_dir(&conf, &ttid).join(CONTROL_FILE_NAME);
+        let mut data = fs::read(&control_path).await.unwrap();
+        data[0] += 1; // change the first byte of the file to fail checksum validation
+        fs::write(&control_path, &data)
+            .await
+            .expect("failed to write control file");
+
+        match load_from_control_file(&conf, &ttid).await {
+            Err(err) => assert!(err
+                .to_string()
+                .contains("safekeeper control file checksum mismatch")),
+            Ok(_) => panic!("expected error"),
        }
-        Ok(())
    }
 }
--- a/safekeeper/src/copy_timeline.rs
+++ b/safekeeper/src/copy_timeline.rs
@@ -154,7 +154,7 @@ pub async fn handle_request(request: Request) -> Result<()> {
    new_state.peer_horizon_lsn = request.until_lsn;
    new_state.backup_lsn = new_backup_lsn;

-    FileStorage::create_new(&tli_dir_path, new_state.clone(), conf.no_sync).await?;
+    FileStorage::create_new(tli_dir_path.clone(), conf, new_state.clone()).await?;

    // now we have a ready timeline in a temp directory
    validate_temp_timeline(conf, request.destination_ttid, &tli_dir_path).await?;
--- a/safekeeper/src/http/routes.rs
+++ b/safekeeper/src/http/routes.rs
@@ -262,6 +262,14 @@ async fn timeline_snapshot_handler(request: Request<Body>) -> Result<Response<Bo
    check_permission(&request, Some(ttid.tenant_id))?;

    let tli = GlobalTimelines::get(ttid).map_err(ApiError::from)?;
+    // Note: with evicted timelines it should work better then de-evict them and
+    // stream; probably start_snapshot would copy partial s3 file to dest path
+    // and stream control file, or return WalResidentTimeline if timeline is not
+    // evicted.
+    let tli = tli
+        .wal_residence_guard()
+        .await
+        .map_err(ApiError::InternalServerError)?;

    // To stream the body use wrap_stream which wants Stream of Result<Bytes>,
    // so create the chan and write to it in another task.
--- a/safekeeper/src/lib.rs
+++ b/safekeeper/src/lib.rs
@@ -113,7 +113,6 @@ impl SafeKeeperConf {

 impl SafeKeeperConf {
    #[cfg(test)]
-    #[allow(unused)]
    fn dummy() -> Self {
        SafeKeeperConf {
            workdir: Utf8PathBuf::from("./"),
--- a/safekeeper/src/pull_timeline.rs
+++ b/safekeeper/src/pull_timeline.rs
@@ -8,7 +8,6 @@ use serde::{Deserialize, Serialize};
 use std::{
    cmp::min,
    io::{self, ErrorKind},
-    sync::Arc,
 };
 use tokio::{fs::OpenOptions, io::AsyncWrite, sync::mpsc, task};
 use tokio_tar::{Archive, Builder, Header};
@@ -26,8 +25,8 @@ use crate::{
        routes::TimelineStatus,
    },
    safekeeper::Term,
-    state::{EvictionState, TimelinePersistentState},
-    timeline::{Timeline, WalResidentTimeline},
+    state::TimelinePersistentState,
+    timeline::WalResidentTimeline,
    timelines_global_map::{create_temp_timeline_dir, validate_temp_timeline},
    wal_backup,
    wal_storage::open_wal_file,
@@ -44,33 +43,18 @@ use utils::{
 /// Stream tar archive of timeline to tx.
 #[instrument(name = "snapshot", skip_all, fields(ttid = %tli.ttid))]
 pub async fn stream_snapshot(
-    tli: Arc<Timeline>,
+    tli: WalResidentTimeline,
    source: NodeId,
    destination: NodeId,
    tx: mpsc::Sender<Result<Bytes>>,
 ) {
-    match tli.try_wal_residence_guard().await {
-        Err(e) => {
-            tx.send(Err(anyhow!("Error checking residence: {:#}", e)))
-                .await
-                .ok();
-        }
-        Ok(maybe_resident_tli) => {
-            if let Err(e) = match maybe_resident_tli {
-                Some(resident_tli) => {
-                    stream_snapshot_resident_guts(resident_tli, source, destination, tx.clone())
-                        .await
-                }
-                None => stream_snapshot_offloaded_guts(tli, source, destination, tx.clone()).await,
-            } {
-                // Error type/contents don't matter as they won't can't reach the client
-                // (hyper likely doesn't do anything with it), but http stream will be
-                // prematurely terminated. It would be nice to try to send the error in
-                // trailers though.
-                tx.send(Err(anyhow!("snapshot failed"))).await.ok();
-                error!("snapshot failed: {:#}", e);
-            }
-        }
+    if let Err(e) = stream_snapshot_guts(tli, source, destination, tx.clone()).await {
+        // Error type/contents don't matter as they won't can't reach the client
+        // (hyper likely doesn't do anything with it), but http stream will be
+        // prematurely terminated. It would be nice to try to send the error in
+        // trailers though.
+        tx.send(Err(anyhow!("snapshot failed"))).await.ok();
+        error!("snapshot failed: {:#}", e);
    }
 }

@@ -96,10 +80,12 @@ impl Drop for SnapshotContext {
    }
 }

-/// Build a tokio_tar stream that sends encoded bytes into a Bytes channel.
-fn prepare_tar_stream(
+pub async fn stream_snapshot_guts(
+    tli: WalResidentTimeline,
+    source: NodeId,
+    destination: NodeId,
    tx: mpsc::Sender<Result<Bytes>>,
-) -> tokio_tar::Builder<impl AsyncWrite + Unpin + Send> {
+) -> Result<()> {
    // tokio-tar wants Write implementor, but we have mpsc tx <Result<Bytes>>;
    // use SinkWriter as a Write impl. That is,
    // - create Sink from the tx. It returns PollSendError if chan is closed.
@@ -114,38 +100,12 @@ fn prepare_tar_stream(
    // - SinkWriter (not surprisingly) wants sink of &[u8], not bytes, so wrap
    // into CopyToBytes. This is a data copy.
    let copy_to_bytes = CopyToBytes::new(oksink);
-    let writer = SinkWriter::new(copy_to_bytes);
-    let pinned_writer = Box::pin(writer);
+    let mut writer = SinkWriter::new(copy_to_bytes);
+    let pinned_writer = std::pin::pin!(writer);

    // Note that tokio_tar append_* funcs use tokio::io::copy with 8KB buffer
    // which is also likely suboptimal.
-    Builder::new_non_terminated(pinned_writer)
-}
-
-/// Implementation of snapshot for an offloaded timeline, only reads control file
-pub(crate) async fn stream_snapshot_offloaded_guts(
-    tli: Arc<Timeline>,
-    source: NodeId,
-    destination: NodeId,
-    tx: mpsc::Sender<Result<Bytes>>,
-) -> Result<()> {
-    let mut ar = prepare_tar_stream(tx);
-
-    tli.snapshot_offloaded(&mut ar, source, destination).await?;
-
-    ar.finish().await?;
-
-    Ok(())
-}
-
-/// Implementation of snapshot for a timeline which is resident (includes some segment data)
-pub async fn stream_snapshot_resident_guts(
-    tli: WalResidentTimeline,
-    source: NodeId,
-    destination: NodeId,
-    tx: mpsc::Sender<Result<Bytes>>,
-) -> Result<()> {
-    let mut ar = prepare_tar_stream(tx);
+    let mut ar = Builder::new_non_terminated(pinned_writer);

    let bctx = tli.start_snapshot(&mut ar, source, destination).await?;
    pausable_failpoint!("sk-snapshot-after-list-pausable");
@@ -178,70 +138,6 @@ pub async fn stream_snapshot_resident_guts(
    Ok(())
 }

-impl Timeline {
-    /// Simple snapshot for an offloaded timeline: we will only upload a renamed partial segment and
-    /// pass a modified control file into the provided tar stream (nothing with data segments on disk, since
-    /// we are offloaded and there aren't any)
-    async fn snapshot_offloaded<W: AsyncWrite + Unpin + Send>(
-        self: &Arc<Timeline>,
-        ar: &mut tokio_tar::Builder<W>,
-        source: NodeId,
-        destination: NodeId,
-    ) -> Result<()> {
-        // Take initial copy of control file, then release state lock
-        let mut control_file = {
-            let shared_state = self.write_shared_state().await;
-
-            let control_file = TimelinePersistentState::clone(shared_state.sk.state());
-
-            // Rare race: we got unevicted between entering function and reading control file.
-            // We error out and let API caller retry.
-            if !matches!(control_file.eviction_state, EvictionState::Offloaded(_)) {
-                bail!("Timeline was un-evicted during snapshot, please retry");
-            }
-
-            control_file
-        };
-
-        // Modify the partial segment of the in-memory copy for the control file to
-        // point to the destination safekeeper.
-        let replace = control_file
-            .partial_backup
-            .replace_uploaded_segment(source, destination)?;
-
-        let Some(replace) = replace else {
-            // In Manager:: ready_for_eviction, we do not permit eviction unless the timeline
-            // has a partial segment.  It is unexpected that
-            anyhow::bail!("Timeline has no partial segment, cannot generate snapshot");
-        };
-
-        tracing::info!("Replacing uploaded partial segment in in-mem control file: {replace:?}");
-
-        // Optimistically try to copy the partial segment to the destination's path: this
-        // can fail if the timeline was un-evicted and modified in the background.
-        let remote_timeline_path = &self.remote_path;
-        wal_backup::copy_partial_segment(
-            &replace.previous.remote_path(remote_timeline_path),
-            &replace.current.remote_path(remote_timeline_path),
-        )
-        .await?;
-
-        // Since the S3 copy succeeded with the path given in our control file snapshot, and
-        // we are sending that snapshot in our response, we are giving the caller a consistent
-        // snapshot even if our local Timeline was unevicted or otherwise modified in the meantime.
-        let buf = control_file
-            .write_to_buf()
-            .with_context(|| "failed to serialize control store")?;
-        let mut header = Header::new_gnu();
-        header.set_size(buf.len().try_into().expect("never breaches u64"));
-        ar.append_data(&mut header, CONTROL_FILE_NAME, buf.as_slice())
-            .await
-            .with_context(|| "failed to append to archive")?;
-
-        Ok(())
-    }
-}
-
 impl WalResidentTimeline {
    /// Start streaming tar archive with timeline:
    /// 1) stream control file under lock;
--- a/safekeeper/src/receive_wal.rs
+++ b/safekeeper/src/receive_wal.rs
@@ -21,15 +21,18 @@ use postgres_backend::QueryError;
 use pq_proto::BeMessage;
 use serde::Deserialize;
 use serde::Serialize;
-use std::future;
 use std::net::SocketAddr;
 use std::sync::Arc;
 use tokio::io::AsyncRead;
 use tokio::io::AsyncWrite;
-use tokio::sync::mpsc::{channel, Receiver, Sender};
+use tokio::sync::mpsc::channel;
+use tokio::sync::mpsc::error::TryRecvError;
+use tokio::sync::mpsc::Receiver;
+use tokio::sync::mpsc::Sender;
 use tokio::task;
 use tokio::task::JoinHandle;
-use tokio::time::{Duration, MissedTickBehavior};
+use tokio::time::Duration;
+use tokio::time::Instant;
 use tracing::*;
 use utils::id::TenantTimelineId;
 use utils::lsn::Lsn;
@@ -441,9 +444,9 @@ async fn network_write<IO: AsyncRead + AsyncWrite + Unpin>(
    }
 }

-/// The WAL flush interval. This ensures we periodically flush the WAL and send AppendResponses to
-/// walproposer, even when it's writing a steady stream of messages.
-const FLUSH_INTERVAL: Duration = Duration::from_secs(1);
+// Send keepalive messages to walproposer, to make sure it receives updates
+// even when it writes a steady stream of messages.
+const KEEPALIVE_INTERVAL: Duration = Duration::from_secs(1);

 /// Encapsulates a task which takes messages from msg_rx, processes and pushes
 /// replies to reply_tx.
@@ -491,76 +494,67 @@ impl WalAcceptor {
    async fn run(&mut self) -> anyhow::Result<()> {
        let walreceiver_guard = self.tli.get_walreceivers().register(self.conn_id);

-        // Periodically flush the WAL.
-        let mut flush_ticker = tokio::time::interval(FLUSH_INTERVAL);
-        flush_ticker.set_missed_tick_behavior(MissedTickBehavior::Delay);
-        flush_ticker.tick().await; // skip the initial, immediate tick
+        // After this timestamp we will stop processing AppendRequests and send a response
+        // to the walproposer. walproposer sends at least one AppendRequest per second,
+        // we will send keepalives by replying to these requests once per second.
+        let mut next_keepalive = Instant::now();

-        // Tracks unflushed appends.
-        let mut dirty = false;
+        while let Some(mut next_msg) = self.msg_rx.recv().await {
+            // Update walreceiver state in shmem for reporting.
+            if let ProposerAcceptorMessage::Elected(_) = &next_msg {
+                walreceiver_guard.get().status = WalReceiverStatus::Streaming;
+            }

-        loop {
-            let reply = tokio::select! {
-                // Process inbound message.
-                msg = self.msg_rx.recv() => {
-                    // If disconnected, break to flush WAL and return.
-                    let Some(mut msg) = msg else {
-                        break;
-                    };
-
-                    // Update walreceiver state in shmem for reporting.
-                    if let ProposerAcceptorMessage::Elected(_) = &msg {
-                        walreceiver_guard.get().status = WalReceiverStatus::Streaming;
-                    }
-
-                    // Don't flush the WAL on every append, only periodically via flush_ticker.
-                    // This batches multiple appends per fsync. If the channel is empty after
-                    // sending the reply, we'll schedule an immediate flush.
-                    if let ProposerAcceptorMessage::AppendRequest(append_request) = msg {
-                        msg = ProposerAcceptorMessage::NoFlushAppendRequest(append_request);
-                        dirty = true;
-                    }
-
-                    self.tli.process_msg(&msg).await?
-                }
-
-                // While receiving AppendRequests, flush the WAL periodically and respond with an
-                // AppendResponse to let walproposer know we're still alive.
-                _ = flush_ticker.tick(), if dirty => {
-                    dirty = false;
-                    self.tli
-                        .process_msg(&ProposerAcceptorMessage::FlushWAL)
-                        .await?
-                }
-
-                // If there are no pending messages, flush the WAL immediately.
+            let reply_msg = if matches!(next_msg, ProposerAcceptorMessage::AppendRequest(_)) {
+                // Loop through AppendRequests while available to write as many WAL records as
+                // possible without fsyncing.
                //
-                // TODO: this should be done via flush_ticker.reset_immediately(), but that's always
-                // delayed by 1ms due to this bug: https://github.com/tokio-rs/tokio/issues/6866.
-                _ = future::ready(()), if dirty && self.msg_rx.is_empty() => {
-                    dirty = false;
-                    flush_ticker.reset();
-                    self.tli
-                        .process_msg(&ProposerAcceptorMessage::FlushWAL)
-                        .await?
+                // Make sure the WAL is flushed before returning, see:
+                // https://github.com/neondatabase/neon/issues/9259
+                //
+                // Note: this will need to be rewritten if we want to read non-AppendRequest messages here.
+                // Otherwise, we might end up in a situation where we read a message, but don't
+                // process it.
+                while let ProposerAcceptorMessage::AppendRequest(append_request) = next_msg {
+                    let noflush_msg = ProposerAcceptorMessage::NoFlushAppendRequest(append_request);
+
+                    if let Some(reply) = self.tli.process_msg(&noflush_msg).await? {
+                        if self.reply_tx.send(reply).await.is_err() {
+                            break; // disconnected, flush WAL and return on next send/recv
+                        }
+                    }
+
+                    // get out of this loop if keepalive time is reached
+                    if Instant::now() >= next_keepalive {
+                        break;
+                    }
+
+                    // continue pulling AppendRequests if available
+                    match self.msg_rx.try_recv() {
+                        Ok(msg) => next_msg = msg,
+                        Err(TryRecvError::Empty) => break,
+                        // on disconnect, flush WAL and return on next send/recv
+                        Err(TryRecvError::Disconnected) => break,
+                    };
                }
+
+                // flush all written WAL to the disk
+                self.tli
+                    .process_msg(&ProposerAcceptorMessage::FlushWAL)
+                    .await?
+            } else {
+                // process message other than AppendRequest
+                self.tli.process_msg(&next_msg).await?
            };

-            // Send reply, if any.
-            if let Some(reply) = reply {
+            if let Some(reply) = reply_msg {
                if self.reply_tx.send(reply).await.is_err() {
-                    break; // disconnected, break to flush WAL and return
+                    return Ok(()); // chan closed, streaming terminated
                }
+                // reset keepalive time
+                next_keepalive = Instant::now() + KEEPALIVE_INTERVAL;
            }
        }
-
-        // Flush WAL on disconnect, see https://github.com/neondatabase/neon/issues/9259.
-        if dirty {
-            self.tli
-                .process_msg(&ProposerAcceptorMessage::FlushWAL)
-                .await?;
-        }
-
        Ok(())
    }
 }
--- a/safekeeper/src/state.rs
+++ b/safekeeper/src/state.rs
@@ -143,8 +143,8 @@ impl TimelinePersistentState {
        TimelinePersistentState::new(
            &TenantTimelineId::empty(),
            ServerInfo {
-                pg_version: 170000, /* Postgres server version (major * 10000) */
-                system_id: 0,       /* Postgres system identifier */
+                pg_version: 17, /* Postgres server version */
+                system_id: 0,   /* Postgres system identifier */
                wal_seg_size: 16 * 1024 * 1024,
            },
            vec![],
--- a/safekeeper/src/timeline.rs
+++ b/safekeeper/src/timeline.rs
@@ -328,19 +328,15 @@ impl SharedState {
    /// Restore SharedState from control file. If file doesn't exist, bails out.
    fn restore(conf: &SafeKeeperConf, ttid: &TenantTimelineId) -> Result<Self> {
        let timeline_dir = get_timeline_dir(conf, ttid);
-        let control_store = control_file::FileStorage::restore_new(&timeline_dir, conf.no_sync)?;
+        let control_store = control_file::FileStorage::restore_new(ttid, conf)?;
        if control_store.server.wal_seg_size == 0 {
            bail!(TimelineError::UninitializedWalSegSize(*ttid));
        }

        let sk = match control_store.eviction_state {
            EvictionState::Present => {
-                let wal_store = wal_storage::PhysicalStorage::new(
-                    ttid,
-                    &timeline_dir,
-                    &control_store,
-                    conf.no_sync,
-                )?;
+                let wal_store =
+                    wal_storage::PhysicalStorage::new(ttid, timeline_dir, conf, &control_store)?;
                StateSK::Loaded(SafeKeeper::new(
                    TimelineState::new(control_store),
                    wal_store,
@@ -797,17 +793,14 @@ impl Timeline {
        state.sk.term_bump(to).await
    }

-    /// Guts of [`Self::wal_residence_guard`] and [`Self::try_wal_residence_guard`]
-    async fn do_wal_residence_guard(
-        self: &Arc<Self>,
-        block: bool,
-    ) -> Result<Option<WalResidentTimeline>> {
-        let op_label = if block {
-            "wal_residence_guard"
-        } else {
-            "try_wal_residence_guard"
-        };
-
+    /// Get the timeline guard for reading/writing WAL files.
+    /// If WAL files are not present on disk (evicted), they will be automatically
+    /// downloaded from remote storage. This is done in the manager task, which is
+    /// responsible for issuing all guards.
+    ///
+    /// NB: don't use this function from timeline_manager, it will deadlock.
+    /// NB: don't use this function while holding shared_state lock.
+    pub async fn wal_residence_guard(self: &Arc<Self>) -> Result<WalResidentTimeline> {
        if self.is_cancelled() {
            bail!(TimelineError::Cancelled(self.ttid));
        }
@@ -819,13 +812,10 @@ impl Timeline {
        // Wait 30 seconds for the guard to be acquired. It can time out if someone is
        // holding the lock (e.g. during `SafeKeeper::process_msg()`) or manager task
        // is stuck.
-        let res = tokio::time::timeout_at(started_at + Duration::from_secs(30), async {
-            if block {
-                self.manager_ctl.wal_residence_guard().await.map(Some)
-            } else {
-                self.manager_ctl.try_wal_residence_guard().await
-            }
-        })
+        let res = tokio::time::timeout_at(
+            started_at + Duration::from_secs(30),
+            self.manager_ctl.wal_residence_guard(),
+        )
        .await;

        let guard = match res {
@@ -833,14 +823,14 @@ impl Timeline {
                let finished_at = Instant::now();
                let elapsed = finished_at - started_at;
                MISC_OPERATION_SECONDS
-                    .with_label_values(&[op_label])
+                    .with_label_values(&["wal_residence_guard"])
                    .observe(elapsed.as_secs_f64());

                guard
            }
            Ok(Err(e)) => {
                warn!(
-                    "error acquiring in {op_label}, statuses {:?} => {:?}",
+                    "error while acquiring WalResidentTimeline guard, statuses {:?} => {:?}",
                    status_before,
                    self.mgr_status.get()
                );
@@ -848,7 +838,7 @@ impl Timeline {
            }
            Err(_) => {
                warn!(
-                    "timeout acquiring in {op_label} guard, statuses {:?} => {:?}",
+                    "timeout while acquiring WalResidentTimeline guard, statuses {:?} => {:?}",
                    status_before,
                    self.mgr_status.get()
                );
@@ -856,28 +846,7 @@ impl Timeline {
            }
        };

-        Ok(guard.map(|g| WalResidentTimeline::new(self.clone(), g)))
-    }
-
-    /// Get the timeline guard for reading/writing WAL files.
-    /// If WAL files are not present on disk (evicted), they will be automatically
-    /// downloaded from remote storage. This is done in the manager task, which is
-    /// responsible for issuing all guards.
-    ///
-    /// NB: don't use this function from timeline_manager, it will deadlock.
-    /// NB: don't use this function while holding shared_state lock.
-    pub async fn wal_residence_guard(self: &Arc<Self>) -> Result<WalResidentTimeline> {
-        self.do_wal_residence_guard(true)
-            .await
-            .map(|m| m.expect("Always get Some in block=true mode"))
-    }
-
-    /// Get the timeline guard for reading/writing WAL files if the timeline is resident,
-    /// else return None
-    pub(crate) async fn try_wal_residence_guard(
-        self: &Arc<Self>,
-    ) -> Result<Option<WalResidentTimeline>> {
-        self.do_wal_residence_guard(false).await
+        Ok(WalResidentTimeline::new(self.clone(), guard))
    }

    pub async fn backup_partial_reset(self: &Arc<Self>) -> Result<Vec<String>> {
@@ -1077,9 +1046,9 @@ impl ManagerTimeline {
        // trying to restore WAL storage
        let wal_store = wal_storage::PhysicalStorage::new(
            &self.ttid,
-            &self.timeline_dir,
+            self.timeline_dir.clone(),
+            &conf,
            shared.sk.state(),
-            conf.no_sync,
        )?;

        // updating control file
--- a/safekeeper/src/timeline_eviction.rs
+++ b/safekeeper/src/timeline_eviction.rs
@@ -56,9 +56,6 @@ impl Manager {
            // This also works for the first segment despite last_removed_segno
            // being 0 on init because this 0 triggers run of wal_removal_task
            // on success of which manager updates the horizon.
-            //
-            // **Note** pull_timeline functionality assumes that evicted timelines always have
-            // a partial segment: if we ever change this condition, must also update that code.
            && self
                .partial_backup_uploaded
                .as_ref()
@@ -69,15 +66,15 @@ impl Manager {
        ready
    }

-    /// Evict the timeline to remote storage. Returns whether the eviction was successful.
+    /// Evict the timeline to remote storage.
    #[instrument(name = "evict_timeline", skip_all)]
-    pub(crate) async fn evict_timeline(&mut self) -> bool {
+    pub(crate) async fn evict_timeline(&mut self) {
        assert!(!self.is_offloaded);
        let partial_backup_uploaded = match &self.partial_backup_uploaded {
            Some(p) => p.clone(),
            None => {
                warn!("no partial backup uploaded, skipping eviction");
-                return false;
+                return;
            }
        };

@@ -94,12 +91,11 @@ impl Manager {

        if let Err(e) = do_eviction(self, &partial_backup_uploaded).await {
            warn!("failed to evict timeline: {:?}", e);
-            return false;
+            return;
        }

        info!("successfully evicted timeline");
        NUM_EVICTED_TIMELINES.inc();
-        true
    }

    /// Attempt to restore evicted timeline from remote storage; it must be
--- a/safekeeper/src/timeline_manager.rs
+++ b/safekeeper/src/timeline_manager.rs
@@ -100,8 +100,6 @@ const REFRESH_INTERVAL: Duration = Duration::from_millis(300);
 pub enum ManagerCtlMessage {
    /// Request to get a guard for WalResidentTimeline, with WAL files available locally.
    GuardRequest(tokio::sync::oneshot::Sender<anyhow::Result<ResidenceGuard>>),
-    /// Get a guard for WalResidentTimeline if the timeline is not currently offloaded, else None
-    TryGuardRequest(tokio::sync::oneshot::Sender<Option<ResidenceGuard>>),
    /// Request to drop the guard.
    GuardDrop(GuardId),
    /// Request to reset uploaded partial backup state.
@@ -112,7 +110,6 @@ impl std::fmt::Debug for ManagerCtlMessage {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            ManagerCtlMessage::GuardRequest(_) => write!(f, "GuardRequest"),
-            ManagerCtlMessage::TryGuardRequest(_) => write!(f, "TryGuardRequest"),
            ManagerCtlMessage::GuardDrop(id) => write!(f, "GuardDrop({:?})", id),
            ManagerCtlMessage::BackupPartialReset(_) => write!(f, "BackupPartialReset"),
        }
@@ -155,19 +152,6 @@ impl ManagerCtl {
            .and_then(std::convert::identity)
    }

-    /// Issue a new guard if the timeline is currently not offloaded, else return None
-    /// Sends a message to the manager and waits for the response.
-    /// Can be blocked indefinitely if the manager is stuck.
-    pub async fn try_wal_residence_guard(&self) -> anyhow::Result<Option<ResidenceGuard>> {
-        let (tx, rx) = tokio::sync::oneshot::channel();
-        self.manager_tx
-            .send(ManagerCtlMessage::TryGuardRequest(tx))?;
-
-        // wait for the manager to respond with the guard
-        rx.await
-            .map_err(|e| anyhow::anyhow!("response read fail: {:?}", e))
-    }
-
    /// Request timeline manager to reset uploaded partial segment state and
    /// wait for the result.
    pub async fn backup_partial_reset(&self) -> anyhow::Result<Vec<String>> {
@@ -313,12 +297,7 @@ pub async fn main_task(
                match mgr.global_rate_limiter.try_acquire_eviction() {
                    Some(_permit) => {
                        mgr.set_status(Status::EvictTimeline);
-                        if !mgr.evict_timeline().await {
-                            // eviction failed, try again later
-                            mgr.evict_not_before =
-                                Instant::now() + rand_duration(&mgr.conf.eviction_min_resident);
-                            update_next_event(&mut next_event, mgr.evict_not_before);
-                        }
+                        mgr.evict_timeline().await;
                    }
                    None => {
                        // we can't evict timeline now, will try again later
@@ -690,17 +669,6 @@ impl Manager {
                    warn!("failed to reply with a guard, receiver dropped");
                }
            }
-            Some(ManagerCtlMessage::TryGuardRequest(tx)) => {
-                let result = if self.is_offloaded {
-                    None
-                } else {
-                    Some(self.access_service.create_guard())
-                };
-
-                if tx.send(result).is_err() {
-                    warn!("failed to reply with a guard, receiver dropped");
-                }
-            }
            Some(ManagerCtlMessage::GuardDrop(guard_id)) => {
                self.access_service.drop_guard(guard_id);
            }
--- a/safekeeper/src/timelines_global_map.rs
+++ b/safekeeper/src/timelines_global_map.rs
@@ -244,7 +244,7 @@ impl GlobalTimelines {
        // immediately initialize first WAL segment as well.
        let state =
            TimelinePersistentState::new(&ttid, server_info, vec![], commit_lsn, local_start_lsn)?;
-        control_file::FileStorage::create_new(&tmp_dir_path, state, conf.no_sync).await?;
+        control_file::FileStorage::create_new(tmp_dir_path.clone(), &conf, state).await?;
        let timeline = GlobalTimelines::load_temp_timeline(ttid, &tmp_dir_path, true).await?;
        Ok(timeline)
    }
@@ -596,7 +596,7 @@ pub async fn validate_temp_timeline(
        bail!("wal_seg_size is not set");
    }

-    let wal_store = wal_storage::PhysicalStorage::new(&ttid, path, &control_store, conf.no_sync)?;
+    let wal_store = wal_storage::PhysicalStorage::new(&ttid, path.clone(), conf, &control_store)?;

    let commit_lsn = control_store.commit_lsn;
    let flush_lsn = wal_store.flush_lsn();
--- a/safekeeper/src/wal_storage.rs
+++ b/safekeeper/src/wal_storage.rs
@@ -29,6 +29,7 @@ use crate::metrics::{
 };
 use crate::state::TimelinePersistentState;
 use crate::wal_backup::{read_object, remote_timeline_path};
+use crate::SafeKeeperConf;
 use postgres_ffi::waldecoder::WalStreamDecoder;
 use postgres_ffi::XLogFileName;
 use postgres_ffi::XLOG_BLCKSZ;
@@ -86,9 +87,7 @@ pub trait Storage {
 pub struct PhysicalStorage {
    metrics: WalStorageMetrics,
    timeline_dir: Utf8PathBuf,
-
-    /// Disables fsync if true.
-    no_sync: bool,
+    conf: SafeKeeperConf,

    /// Size of WAL segment in bytes.
    wal_seg_size: usize,
@@ -152,9 +151,9 @@ impl PhysicalStorage {
    /// the disk. Otherwise, all LSNs are set to zero.
    pub fn new(
        ttid: &TenantTimelineId,
-        timeline_dir: &Utf8Path,
+        timeline_dir: Utf8PathBuf,
+        conf: &SafeKeeperConf,
        state: &TimelinePersistentState,
-        no_sync: bool,
    ) -> Result<PhysicalStorage> {
        let wal_seg_size = state.server.wal_seg_size as usize;

@@ -199,8 +198,8 @@ impl PhysicalStorage {

        Ok(PhysicalStorage {
            metrics: WalStorageMetrics::default(),
-            timeline_dir: timeline_dir.to_path_buf(),
-            no_sync,
+            timeline_dir,
+            conf: conf.clone(),
            wal_seg_size,
            pg_version: state.server.pg_version,
            system_id: state.server.system_id,
@@ -225,7 +224,7 @@ impl PhysicalStorage {

    /// Call fdatasync if config requires so.
    async fn fdatasync_file(&mut self, file: &File) -> Result<()> {
-        if !self.no_sync {
+        if !self.conf.no_sync {
            self.metrics
                .observe_flush_seconds(time_io_closure(file.sync_data()).await?);
        }
@@ -264,7 +263,9 @@ impl PhysicalStorage {

            // Note: this doesn't get into observe_flush_seconds metric. But
            // segment init should be separate metric, if any.
-            if let Err(e) = durable_rename(&tmp_path, &wal_file_partial_path, !self.no_sync).await {
+            if let Err(e) =
+                durable_rename(&tmp_path, &wal_file_partial_path, !self.conf.no_sync).await
+            {
                // Probably rename succeeded, but fsync of it failed. Remove
                // the file then to avoid using it.
                remove_file(wal_file_partial_path)
--- a/storage_controller/src/http.rs
+++ b/storage_controller/src/http.rs
@@ -968,28 +968,6 @@ async fn handle_tenant_shard_migrate(
    )
 }

-async fn handle_tenant_shard_cancel_reconcile(
-    service: Arc<Service>,
-    req: Request<Body>,
-) -> Result<Response<Body>, ApiError> {
-    check_permissions(&req, Scope::Admin)?;
-
-    let req = match maybe_forward(req).await {
-        ForwardOutcome::Forwarded(res) => {
-            return res;
-        }
-        ForwardOutcome::NotForwarded(req) => req,
-    };
-
-    let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?;
-    json_response(
-        StatusCode::OK,
-        service
-            .tenant_shard_cancel_reconcile(tenant_shard_id)
-            .await?,
-    )
-}
-
 async fn handle_tenant_update_policy(req: Request<Body>) -> Result<Response<Body>, ApiError> {
    check_permissions(&req, Scope::Admin)?;

@@ -1798,16 +1776,6 @@ pub fn make_router(
                RequestName("control_v1_tenant_migrate"),
            )
        })
-        .put(
-            "/control/v1/tenant/:tenant_shard_id/cancel_reconcile",
-            |r| {
-                tenant_service_handler(
-                    r,
-                    handle_tenant_shard_cancel_reconcile,
-                    RequestName("control_v1_tenant_cancel_reconcile"),
-                )
-            },
-        )
        .put("/control/v1/tenant/:tenant_id/shard_split", |r| {
            tenant_service_handler(
                r,
--- a/storage_controller/src/reconciler.rs
+++ b/storage_controller/src/reconciler.rs
@@ -450,9 +450,6 @@ impl Reconciler {
        }
    }

-    /// This function does _not_ mutate any state, so it is cancellation safe.
-    ///
-    /// This function does not respect [`Self::cancel`], callers should handle that.
    async fn await_lsn(
        &self,
        tenant_shard_id: TenantShardId,
@@ -573,10 +570,8 @@ impl Reconciler {

        if let Some(baseline) = baseline_lsns {
            tracing::info!("🕑 Waiting for LSN to catch up...");
-            tokio::select! {
-                r = self.await_lsn(self.tenant_shard_id, &dest_ps, baseline) => {r?;}
-                _ = self.cancel.cancelled() => {return Err(ReconcileError::Cancel)}
-            };
+            self.await_lsn(self.tenant_shard_id, &dest_ps, baseline)
+                .await?;
        }

        tracing::info!("🔁 Notifying compute to use pageserver {dest_ps}");
--- a/storage_controller/src/service.rs
+++ b/storage_controller/src/service.rs
@@ -3130,11 +3130,9 @@ impl Service {
            .await?;

            // Propagate the LSN that shard zero picked, if caller didn't provide one
-            match &mut create_req.mode {
-                models::TimelineCreateRequestMode::Branch { ancestor_start_lsn, .. } if ancestor_start_lsn.is_none() => {
-                    *ancestor_start_lsn = timeline_info.ancestor_lsn;
-                },
-                _ => {}
+            if create_req.ancestor_timeline_id.is_some() && create_req.ancestor_start_lsn.is_none()
+            {
+                create_req.ancestor_start_lsn = timeline_info.ancestor_lsn;
            }

            // Create timeline on remaining shards with number >0
@@ -4834,43 +4832,6 @@ impl Service {
        Ok(TenantShardMigrateResponse {})
    }

-    /// 'cancel' in this context means cancel any ongoing reconcile
-    pub(crate) async fn tenant_shard_cancel_reconcile(
-        &self,
-        tenant_shard_id: TenantShardId,
-    ) -> Result<(), ApiError> {
-        // Take state lock and fire the cancellation token, after which we drop lock and wait for any ongoing reconcile to complete
-        let waiter = {
-            let locked = self.inner.write().unwrap();
-            let Some(shard) = locked.tenants.get(&tenant_shard_id) else {
-                return Err(ApiError::NotFound(
-                    anyhow::anyhow!("Tenant shard not found").into(),
-                ));
-            };
-
-            let waiter = shard.get_waiter();
-            match waiter {
-                None => {
-                    tracing::info!("Shard does not have an ongoing Reconciler");
-                    return Ok(());
-                }
-                Some(waiter) => {
-                    tracing::info!("Cancelling Reconciler");
-                    shard.cancel_reconciler();
-                    waiter
-                }
-            }
-        };
-
-        // Cancellation should be prompt.  If this fails we have still done our job of firing the
-        // cancellation token, but by returning an ApiError we will indicate to the caller that
-        // the Reconciler is misbehaving and not respecting the cancellation token
-        self.await_waiters(vec![waiter], SHORT_RECONCILE_TIMEOUT)
-            .await?;
-
-        Ok(())
-    }
-
    /// This is for debug/support only: we simply drop all state for a tenant, without
    /// detaching or deleting it on pageservers.
    pub(crate) async fn tenant_drop(&self, tenant_id: TenantId) -> Result<(), ApiError> {
@@ -4958,7 +4919,16 @@ impl Service {
                    stripe_size,
                },
                placement_policy: Some(PlacementPolicy::Attached(0)), // No secondaries, for convenient debug/hacking
-                config: TenantConfig::default(),
+
+                // There is no way to know what the tenant's config was: revert to defaults
+                //
+                // TODO: remove `switch_aux_file_policy` once we finish auxv2 migration
+                //
+                // we write to both v1+v2 storage, so that the test case can use either storage format for testing
+                config: TenantConfig {
+                    switch_aux_file_policy: Some(models::AuxFilePolicy::CrossValidation),
+                    ..TenantConfig::default()
+                },
            })
            .await?;

--- a/storage_controller/src/tenant_shard.rs
+++ b/storage_controller/src/tenant_shard.rs
@@ -1317,12 +1317,6 @@ impl TenantShard {
        })
    }

-    pub(crate) fn cancel_reconciler(&self) {
-        if let Some(handle) = self.reconciler.as_ref() {
-            handle.cancel.cancel()
-        }
-    }
-
    /// Get a waiter for any reconciliation in flight, but do not start reconciliation
    /// if it is not already running
    pub(crate) fn get_waiter(&self) -> Option<ReconcilerWaiter> {
--- a/test_runner/conftest.py
+++ b/test_runner/conftest.py
@@ -3,7 +3,6 @@ from __future__ import annotations
 pytest_plugins = (
    "fixtures.pg_version",
    "fixtures.parametrize",
-    "fixtures.h2server",
    "fixtures.httpserver",
    "fixtures.compute_reconfigure",
    "fixtures.storage_controller_proxy",
--- a/test_runner/fixtures/h2server.py
+++ b/test_runner/fixtures/h2server.py
@@ -1,198 +0,0 @@
-"""
-https://python-hyper.org/projects/hyper-h2/en/stable/asyncio-example.html
-
-auth-broker -> local-proxy needs a h2 connection, so we need a h2 server :)
-"""
-
-import asyncio
-import collections
-import io
-import json
-from collections.abc import AsyncIterable
-
-import pytest_asyncio
-from h2.config import H2Configuration
-from h2.connection import H2Connection
-from h2.errors import ErrorCodes
-from h2.events import (
-    ConnectionTerminated,
-    DataReceived,
-    RemoteSettingsChanged,
-    RequestReceived,
-    StreamEnded,
-    StreamReset,
-    WindowUpdated,
-)
-from h2.exceptions import ProtocolError, StreamClosedError
-from h2.settings import SettingCodes
-
-RequestData = collections.namedtuple("RequestData", ["headers", "data"])
-
-
-class H2Server:
-    def __init__(self, host, port) -> None:
-        self.host = host
-        self.port = port
-
-
-class H2Protocol(asyncio.Protocol):
-    def __init__(self):
-        config = H2Configuration(client_side=False, header_encoding="utf-8")
-        self.conn = H2Connection(config=config)
-        self.transport = None
-        self.stream_data = {}
-        self.flow_control_futures = {}
-
-    def connection_made(self, transport: asyncio.Transport):  # type: ignore[override]
-        self.transport = transport
-        self.conn.initiate_connection()
-        self.transport.write(self.conn.data_to_send())
-
-    def connection_lost(self, _exc):
-        for future in self.flow_control_futures.values():
-            future.cancel()
-        self.flow_control_futures = {}
-
-    def data_received(self, data: bytes):
-        assert self.transport is not None
-        try:
-            events = self.conn.receive_data(data)
-        except ProtocolError:
-            self.transport.write(self.conn.data_to_send())
-            self.transport.close()
-        else:
-            self.transport.write(self.conn.data_to_send())
-            for event in events:
-                if isinstance(event, RequestReceived):
-                    self.request_received(event.headers, event.stream_id)
-                elif isinstance(event, DataReceived):
-                    self.receive_data(event.data, event.stream_id)
-                elif isinstance(event, StreamEnded):
-                    self.stream_complete(event.stream_id)
-                elif isinstance(event, ConnectionTerminated):
-                    self.transport.close()
-                elif isinstance(event, StreamReset):
-                    self.stream_reset(event.stream_id)
-                elif isinstance(event, WindowUpdated):
-                    self.window_updated(event.stream_id, event.delta)
-                elif isinstance(event, RemoteSettingsChanged):
-                    if SettingCodes.INITIAL_WINDOW_SIZE in event.changed_settings:
-                        self.window_updated(None, 0)
-
-                self.transport.write(self.conn.data_to_send())
-
-    def request_received(self, headers: list[tuple[str, str]], stream_id: int):
-        headers_map = collections.OrderedDict(headers)
-
-        # Store off the request data.
-        request_data = RequestData(headers_map, io.BytesIO())
-        self.stream_data[stream_id] = request_data
-
-    def stream_complete(self, stream_id: int):
-        """
-        When a stream is complete, we can send our response.
-        """
-        try:
-            request_data = self.stream_data[stream_id]
-        except KeyError:
-            # Just return, we probably 405'd this already
-            return
-
-        headers = request_data.headers
-        body = request_data.data.getvalue().decode("utf-8")
-
-        data = json.dumps({"headers": headers, "body": body}, indent=4).encode("utf8")
-
-        response_headers = (
-            (":status", "200"),
-            ("content-type", "application/json"),
-            ("content-length", str(len(data))),
-        )
-        self.conn.send_headers(stream_id, response_headers)
-        asyncio.ensure_future(self.send_data(data, stream_id))
-
-    def receive_data(self, data: bytes, stream_id: int):
-        """
-        We've received some data on a stream. If that stream is one we're
-        expecting data on, save it off. Otherwise, reset the stream.
-        """
-        try:
-            stream_data = self.stream_data[stream_id]
-        except KeyError:
-            self.conn.reset_stream(stream_id, error_code=ErrorCodes.PROTOCOL_ERROR)
-        else:
-            stream_data.data.write(data)
-
-    def stream_reset(self, stream_id):
-        """
-        A stream reset was sent. Stop sending data.
-        """
-        if stream_id in self.flow_control_futures:
-            future = self.flow_control_futures.pop(stream_id)
-            future.cancel()
-
-    async def send_data(self, data, stream_id):
-        """
-        Send data according to the flow control rules.
-        """
-        while data:
-            while self.conn.local_flow_control_window(stream_id) < 1:
-                try:
-                    await self.wait_for_flow_control(stream_id)
-                except asyncio.CancelledError:
-                    return
-
-            chunk_size = min(
-                self.conn.local_flow_control_window(stream_id),
-                len(data),
-                self.conn.max_outbound_frame_size,
-            )
-
-            try:
-                self.conn.send_data(
-                    stream_id, data[:chunk_size], end_stream=(chunk_size == len(data))
-                )
-            except (StreamClosedError, ProtocolError):
-                # The stream got closed and we didn't get told. We're done
-                # here.
-                break
-
-            assert self.transport is not None
-            self.transport.write(self.conn.data_to_send())
-            data = data[chunk_size:]
-
-    async def wait_for_flow_control(self, stream_id):
-        """
-        Waits for a Future that fires when the flow control window is opened.
-        """
-        f: asyncio.Future[None] = asyncio.Future()
-        self.flow_control_futures[stream_id] = f
-        await f
-
-    def window_updated(self, stream_id, delta):
-        """
-        A window update frame was received. Unblock some number of flow control
-        Futures.
-        """
-        if stream_id and stream_id in self.flow_control_futures:
-            f = self.flow_control_futures.pop(stream_id)
-            f.set_result(delta)
-        elif not stream_id:
-            for f in self.flow_control_futures.values():
-                f.set_result(delta)
-
-            self.flow_control_futures = {}
-
-
-@pytest_asyncio.fixture(scope="function")
-async def http2_echoserver() -> AsyncIterable[H2Server]:
-    loop = asyncio.get_event_loop()
-    serve = await loop.create_server(H2Protocol, "127.0.0.1", 0)
-    (host, port) = serve.sockets[0].getsockname()
-
-    asyncio.create_task(serve.wait_closed())
-
-    server = H2Server(host, port)
-    yield server
-
-    serve.close()
--- a/test_runner/fixtures/metrics.py
+++ b/test_runner/fixtures/metrics.py
@@ -150,7 +150,6 @@ PAGESERVER_GLOBAL_METRICS: tuple[str, ...] = (
    counter("pageserver_tenant_throttling_count_accounted_finish_global"),
    counter("pageserver_tenant_throttling_wait_usecs_sum_global"),
    counter("pageserver_tenant_throttling_count_global"),
-    *histogram("pageserver_tokio_epoll_uring_slots_submission_queue_depth"),
 )

 PAGESERVER_PER_TENANT_METRICS: tuple[str, ...] = (
--- a/test_runner/fixtures/neon_cli.py
+++ b/test_runner/fixtures/neon_cli.py
@@ -16,6 +16,7 @@ from fixtures.common_types import Lsn, TenantId, TimelineId
 from fixtures.log_helper import log
 from fixtures.pageserver.common_types import IndexPartDump
 from fixtures.pg_version import PgVersion
+from fixtures.utils import AuxFileStore

 if TYPE_CHECKING:
    from typing import (
@@ -200,6 +201,7 @@ class NeonLocalCli(AbstractNeonCli):
        shard_stripe_size: Optional[int] = None,
        placement_policy: Optional[str] = None,
        set_default: bool = False,
+        aux_file_policy: Optional[AuxFileStore] = None,
    ):
        """
        Creates a new tenant, returns its id and its initial timeline's id.
@@ -221,6 +223,13 @@ class NeonLocalCli(AbstractNeonCli):
                )
            )

+        if aux_file_policy is AuxFileStore.V2:
+            args.extend(["-c", "switch_aux_file_policy:v2"])
+        elif aux_file_policy is AuxFileStore.V1:
+            args.extend(["-c", "switch_aux_file_policy:v1"])
+        elif aux_file_policy is AuxFileStore.CrossValidation:
+            args.extend(["-c", "switch_aux_file_policy:cross-validation"])
+
        if set_default:
            args.append("--set-default")

--- a/test_runner/fixtures/neon_fixtures.py
+++ b/test_runner/fixtures/neon_fixtures.py
@@ -35,27 +35,17 @@ import toml
 from _pytest.config import Config
 from _pytest.config.argparsing import Parser
 from _pytest.fixtures import FixtureRequest
-from jwcrypto import jwk

 # Type-related stuff
 from psycopg2.extensions import connection as PgConnection
 from psycopg2.extensions import cursor as PgCursor
 from psycopg2.extensions import make_dsn, parse_dsn
-from pytest_httpserver import HTTPServer
 from urllib3.util.retry import Retry

 from fixtures import overlayfs
 from fixtures.auth_tokens import AuthKeys, TokenScope
-from fixtures.common_types import (
-    Lsn,
-    NodeId,
-    TenantId,
-    TenantShardId,
-    TimelineArchivalState,
-    TimelineId,
-)
+from fixtures.common_types import Lsn, NodeId, TenantId, TenantShardId, TimelineId
 from fixtures.endpoint.http import EndpointHttpClient
-from fixtures.h2server import H2Server
 from fixtures.log_helper import log
 from fixtures.metrics import Metrics, MetricsGetter, parse_metrics
 from fixtures.neon_cli import NeonLocalCli, Pagectl
@@ -64,11 +54,7 @@ from fixtures.pageserver.allowed_errors import (
    DEFAULT_STORAGE_CONTROLLER_ALLOWED_ERRORS,
 )
 from fixtures.pageserver.common_types import LayerName, parse_layer_file_name
-from fixtures.pageserver.http import (
-    HistoricLayerInfo,
-    PageserverHttpClient,
-    ScanDisposableKeysResponse,
-)
+from fixtures.pageserver.http import PageserverHttpClient
 from fixtures.pageserver.utils import (
    wait_for_last_record_lsn,
 )
@@ -97,6 +83,7 @@ from fixtures.utils import (
    subprocess_capture,
    wait_until,
 )
+from fixtures.utils import AuxFileStore as AuxFileStore  # reexport

 from .neon_api import NeonAPI, NeonApiEndpoint

@@ -355,6 +342,7 @@ class NeonEnvBuilder:
        initial_tenant: Optional[TenantId] = None,
        initial_timeline: Optional[TimelineId] = None,
        pageserver_virtual_file_io_engine: Optional[str] = None,
+        pageserver_aux_file_policy: Optional[AuxFileStore] = None,
        pageserver_default_tenant_config_compaction_algorithm: Optional[dict[str, Any]] = None,
        safekeeper_extra_opts: Optional[list[str]] = None,
        storage_controller_port_override: Optional[int] = None,
@@ -406,6 +394,8 @@ class NeonEnvBuilder:
                f"Overriding pageserver default compaction algorithm to {self.pageserver_default_tenant_config_compaction_algorithm}"
            )

+        self.pageserver_aux_file_policy = pageserver_aux_file_policy
+
        self.safekeeper_extra_opts = safekeeper_extra_opts

        self.storage_controller_port_override = storage_controller_port_override
@@ -466,6 +456,7 @@ class NeonEnvBuilder:
            timeline_id=env.initial_timeline,
            shard_count=initial_tenant_shard_count,
            shard_stripe_size=initial_tenant_shard_stripe_size,
+            aux_file_policy=self.pageserver_aux_file_policy,
        )
        assert env.initial_tenant == initial_tenant
        assert env.initial_timeline == initial_timeline
@@ -1025,6 +1016,7 @@ class NeonEnv:
        self.control_plane_compute_hook_api = config.control_plane_compute_hook_api

        self.pageserver_virtual_file_io_engine = config.pageserver_virtual_file_io_engine
+        self.pageserver_aux_file_policy = config.pageserver_aux_file_policy
        self.pageserver_virtual_file_io_mode = config.pageserver_virtual_file_io_mode

        # Create the neon_local's `NeonLocalInitConf`
@@ -1320,6 +1312,7 @@ class NeonEnv:
        shard_stripe_size: Optional[int] = None,
        placement_policy: Optional[str] = None,
        set_default: bool = False,
+        aux_file_policy: Optional[AuxFileStore] = None,
    ) -> tuple[TenantId, TimelineId]:
        """
        Creates a new tenant, returns its id and its initial timeline's id.
@@ -1336,6 +1329,7 @@ class NeonEnv:
            shard_stripe_size=shard_stripe_size,
            placement_policy=placement_policy,
            set_default=set_default,
+            aux_file_policy=aux_file_policy,
        )

        return tenant_id, timeline_id
@@ -1393,6 +1387,7 @@ def neon_simple_env(
    compatibility_pg_distrib_dir: Path,
    pg_version: PgVersion,
    pageserver_virtual_file_io_engine: str,
+    pageserver_aux_file_policy: Optional[AuxFileStore],
    pageserver_default_tenant_config_compaction_algorithm: Optional[dict[str, Any]],
    pageserver_virtual_file_io_mode: Optional[str],
 ) -> Iterator[NeonEnv]:
@@ -1425,6 +1420,7 @@ def neon_simple_env(
        test_name=request.node.name,
        test_output_dir=test_output_dir,
        pageserver_virtual_file_io_engine=pageserver_virtual_file_io_engine,
+        pageserver_aux_file_policy=pageserver_aux_file_policy,
        pageserver_default_tenant_config_compaction_algorithm=pageserver_default_tenant_config_compaction_algorithm,
        pageserver_virtual_file_io_mode=pageserver_virtual_file_io_mode,
        combination=combination,
@@ -1451,6 +1447,7 @@ def neon_env_builder(
    top_output_dir: Path,
    pageserver_virtual_file_io_engine: str,
    pageserver_default_tenant_config_compaction_algorithm: Optional[dict[str, Any]],
+    pageserver_aux_file_policy: Optional[AuxFileStore],
    record_property: Callable[[str, object], None],
    pageserver_virtual_file_io_mode: Optional[str],
 ) -> Iterator[NeonEnvBuilder]:
@@ -1493,6 +1490,7 @@ def neon_env_builder(
        test_name=request.node.name,
        test_output_dir=test_output_dir,
        test_overlay_dir=test_overlay_dir,
+        pageserver_aux_file_policy=pageserver_aux_file_policy,
        pageserver_default_tenant_config_compaction_algorithm=pageserver_default_tenant_config_compaction_algorithm,
        pageserver_virtual_file_io_mode=pageserver_virtual_file_io_mode,
    ) as builder:
@@ -2134,24 +2132,6 @@ class NeonStorageController(MetricsGetter, LogUtils):
        response.raise_for_status()
        return response.json()

-    def timeline_archival_config(
-        self,
-        tenant_id: TenantId,
-        timeline_id: TimelineId,
-        state: TimelineArchivalState,
-    ):
-        config = {"state": state.value}
-        log.info(
-            f"requesting timeline archival config {config} for tenant {tenant_id} and timeline {timeline_id}"
-        )
-        res = self.request(
-            "PUT",
-            f"{self.api}/v1/tenant/{tenant_id}/timeline/{timeline_id}/archival_config",
-            json=config,
-            headers=self.headers(TokenScope.ADMIN),
-        )
-        return res.json()
-
    def configure_failpoints(self, config_strings: tuple[str, str] | list[tuple[str, str]]):
        if isinstance(config_strings, tuple):
            pairs = [config_strings]
@@ -2665,51 +2645,6 @@ class NeonPageserver(PgProtocol, LogUtils):
        layers = self.list_layers(tenant_id, timeline_id)
        return layer_name in [parse_layer_file_name(p.name) for p in layers]

-    def timeline_scan_no_disposable_keys(
-        self, tenant_shard_id: TenantShardId, timeline_id: TimelineId
-    ) -> TimelineAssertNoDisposableKeysResult:
-        """
-        Scan all keys in all layers of the tenant/timeline for disposable keys.
-        Disposable keys are keys that are present in a layer referenced by the shard
-        but are not going to be accessed by the shard.
-        For example, after shard split, the child shards will reference the parent's layer
-        files until new data is ingested and/or compaction rewrites the layers.
-        """
-
-        ps_http = self.http_client()
-        tally = ScanDisposableKeysResponse(0, 0)
-        per_layer = []
-        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
-            futs = []
-            shard_layer_map = ps_http.layer_map_info(tenant_shard_id, timeline_id)
-            for layer in shard_layer_map.historic_layers:
-
-                def do_layer(
-                    shard_ps_http: PageserverHttpClient,
-                    tenant_shard_id: TenantShardId,
-                    timeline_id: TimelineId,
-                    layer: HistoricLayerInfo,
-                ) -> tuple[HistoricLayerInfo, ScanDisposableKeysResponse]:
-                    return (
-                        layer,
-                        shard_ps_http.timeline_layer_scan_disposable_keys(
-                            tenant_shard_id, timeline_id, layer.layer_file_name
-                        ),
-                    )
-
-                futs.append(executor.submit(do_layer, ps_http, tenant_shard_id, timeline_id, layer))
-            for fut in futs:
-                layer, result = fut.result()
-                tally += result
-                per_layer.append((layer, result))
-        return TimelineAssertNoDisposableKeysResult(tally, per_layer)
-
-
-@dataclass
-class TimelineAssertNoDisposableKeysResult:
-    tally: ScanDisposableKeysResponse
-    per_layer: list[tuple[HistoricLayerInfo, ScanDisposableKeysResponse]]
-

 class PgBin:
    """A helper class for executing postgres binaries"""
@@ -3083,31 +3018,6 @@ class PSQL:
        )


-def generate_proxy_tls_certs(common_name: str, key_path: Path, crt_path: Path):
-    if not key_path.exists():
-        r = subprocess.run(
-            [
-                "openssl",
-                "req",
-                "-new",
-                "-x509",
-                "-days",
-                "365",
-                "-nodes",
-                "-text",
-                "-out",
-                str(crt_path),
-                "-keyout",
-                str(key_path),
-                "-subj",
-                f"/CN={common_name}",
-                "-addext",
-                f"subjectAltName = DNS:{common_name}",
-            ]
-        )
-        assert r.returncode == 0
-
-
 class NeonProxy(PgProtocol):
    link_auth_uri: str = "http://dummy-uri"

@@ -3206,7 +3116,29 @@ class NeonProxy(PgProtocol):
        # generate key of it doesn't exist
        crt_path = self.test_output_dir / "proxy.crt"
        key_path = self.test_output_dir / "proxy.key"
-        generate_proxy_tls_certs("*.localtest.me", key_path, crt_path)
+
+        if not key_path.exists():
+            r = subprocess.run(
+                [
+                    "openssl",
+                    "req",
+                    "-new",
+                    "-x509",
+                    "-days",
+                    "365",
+                    "-nodes",
+                    "-text",
+                    "-out",
+                    str(crt_path),
+                    "-keyout",
+                    str(key_path),
+                    "-subj",
+                    "/CN=*.localtest.me",
+                    "-addext",
+                    "subjectAltName = DNS:*.localtest.me",
+                ]
+            )
+            assert r.returncode == 0

        args = [
            str(self.neon_binpath / "proxy"),
@@ -3386,125 +3318,6 @@ class NeonProxy(PgProtocol):
        assert out == "ok"


-class NeonAuthBroker:
-    class ControlPlane:
-        def __init__(self, endpoint: str):
-            self.endpoint = endpoint
-
-        def extra_args(self) -> list[str]:
-            args = [
-                *["--auth-backend", "console"],
-                *["--auth-endpoint", self.endpoint],
-            ]
-            return args
-
-    def __init__(
-        self,
-        neon_binpath: Path,
-        test_output_dir: Path,
-        http_port: int,
-        mgmt_port: int,
-        external_http_port: int,
-        auth_backend: NeonAuthBroker.ControlPlane,
-    ):
-        self.domain = "apiauth.localtest.me"  # resolves to 127.0.0.1
-        self.host = "127.0.0.1"
-        self.http_port = http_port
-        self.external_http_port = external_http_port
-        self.neon_binpath = neon_binpath
-        self.test_output_dir = test_output_dir
-        self.mgmt_port = mgmt_port
-        self.auth_backend = auth_backend
-        self.http_timeout_seconds = 15
-        self._popen: Optional[subprocess.Popen[bytes]] = None
-
-    def start(self) -> NeonAuthBroker:
-        assert self._popen is None
-
-        # generate key of it doesn't exist
-        crt_path = self.test_output_dir / "proxy.crt"
-        key_path = self.test_output_dir / "proxy.key"
-        generate_proxy_tls_certs("apiauth.localtest.me", key_path, crt_path)
-
-        args = [
-            str(self.neon_binpath / "proxy"),
-            *["--http", f"{self.host}:{self.http_port}"],
-            *["--mgmt", f"{self.host}:{self.mgmt_port}"],
-            *["--wss", f"{self.host}:{self.external_http_port}"],
-            *["-c", str(crt_path)],
-            *["-k", str(key_path)],
-            *["--sql-over-http-pool-opt-in", "false"],
-            *["--is-auth-broker", "true"],
-            *self.auth_backend.extra_args(),
-        ]
-
-        logfile = open(self.test_output_dir / "proxy.log", "w")
-        self._popen = subprocess.Popen(args, stdout=logfile, stderr=logfile)
-        self._wait_until_ready()
-        return self
-
-    # Sends SIGTERM to the proxy if it has been started
-    def terminate(self):
-        if self._popen:
-            self._popen.terminate()
-
-    # Waits for proxy to exit if it has been opened with a default timeout of
-    # two seconds. Raises subprocess.TimeoutExpired if the proxy does not exit in time.
-    def wait_for_exit(self, timeout=2):
-        if self._popen:
-            self._popen.wait(timeout=timeout)
-
-    @backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_time=10)
-    def _wait_until_ready(self):
-        assert (
-            self._popen and self._popen.poll() is None
-        ), "Proxy exited unexpectedly. Check test log."
-        requests.get(f"http://{self.host}:{self.http_port}/v1/status")
-
-    async def query(self, query, args, **kwargs):
-        user = kwargs["user"]
-        token = kwargs["token"]
-        expected_code = kwargs.get("expected_code")
-
-        log.info(f"Executing http query: {query}")
-
-        connstr = f"postgresql://{user}@{self.domain}/postgres"
-        async with httpx.AsyncClient(verify=str(self.test_output_dir / "proxy.crt")) as client:
-            response = await client.post(
-                f"https://{self.domain}:{self.external_http_port}/sql",
-                json={"query": query, "params": args},
-                headers={
-                    "Neon-Connection-String": connstr,
-                    "Authorization": f"Bearer {token}",
-                },
-            )
-
-            if expected_code is not None:
-                assert response.status_code == expected_code, f"response: {response.json()}"
-            return response.json()
-
-    def get_metrics(self) -> str:
-        request_result = requests.get(f"http://{self.host}:{self.http_port}/metrics")
-        return request_result.text
-
-    def __enter__(self) -> NeonAuthBroker:
-        return self
-
-    def __exit__(
-        self,
-        _exc_type: Optional[type[BaseException]],
-        _exc_value: Optional[BaseException],
-        _traceback: Optional[TracebackType],
-    ):
-        if self._popen is not None:
-            self._popen.terminate()
-            try:
-                self._popen.wait(timeout=5)
-            except subprocess.TimeoutExpired:
-                log.warning("failed to gracefully terminate proxy; killing")
-                self._popen.kill()
-
-
@pytest.fixture(scope="function")
 def link_proxy(
    port_distributor: PortDistributor, neon_binpath: Path, test_output_dir: Path
@@ -3569,74 +3382,6 @@ def static_proxy(
        yield proxy


-@pytest.fixture(scope="function")
-def neon_authorize_jwk() -> jwk.JWK:
-    kid = str(uuid.uuid4())
-    key = jwk.JWK.generate(kty="RSA", size=2048, alg="RS256", use="sig", kid=kid)
-    assert isinstance(key, jwk.JWK)
-    return key
-
-
-@pytest.fixture(scope="function")
-def static_auth_broker(
-    port_distributor: PortDistributor,
-    neon_binpath: Path,
-    test_output_dir: Path,
-    httpserver: HTTPServer,
-    neon_authorize_jwk: jwk.JWK,
-    http2_echoserver: H2Server,
-) -> Iterable[NeonAuthBroker]:
-    """Neon Auth Broker that routes to a mocked local_proxy and a mocked cplane HTTP API."""
-
-    local_proxy_addr = f"{http2_echoserver.host}:{http2_echoserver.port}"
-
-    # return local_proxy addr on ProxyWakeCompute.
-    httpserver.expect_request("/cplane/proxy_wake_compute").respond_with_json(
-        {
-            "address": local_proxy_addr,
-            "aux": {
-                "endpoint_id": "ep-foo-bar-1234",
-                "branch_id": "br-foo-bar",
-                "project_id": "foo-bar",
-            },
-        }
-    )
-
-    # return jwks mock addr on GetEndpointJwks
-    httpserver.expect_request(re.compile("^/cplane/endpoints/.+/jwks$")).respond_with_json(
-        {
-            "jwks": [
-                {
-                    "id": "foo",
-                    "jwks_url": httpserver.url_for("/authorize/jwks.json"),
-                    "provider_name": "test",
-                    "jwt_audience": None,
-                    "role_names": ["anonymous", "authenticated"],
-                }
-            ]
-        }
-    )
-
-    # return static fixture jwks.
-    jwk = neon_authorize_jwk.export_public(as_dict=True)
-    httpserver.expect_request("/authorize/jwks.json").respond_with_json({"keys": [jwk]})
-
-    mgmt_port = port_distributor.get_port()
-    http_port = port_distributor.get_port()
-    external_http_port = port_distributor.get_port()
-
-    with NeonAuthBroker(
-        neon_binpath=neon_binpath,
-        test_output_dir=test_output_dir,
-        http_port=http_port,
-        mgmt_port=mgmt_port,
-        external_http_port=external_http_port,
-        auth_backend=NeonAuthBroker.ControlPlane(httpserver.url_for("/cplane")),
-    ) as proxy:
-        proxy.start()
-        yield proxy
-
-
 class Endpoint(PgProtocol, LogUtils):
    """An object representing a Postgres compute endpoint managed by the control plane."""

--- a/test_runner/fixtures/pageserver/http.py
+++ b/test_runner/fixtures/pageserver/http.py
@@ -129,26 +129,6 @@ class LayerMapInfo:
        return set(x.layer_file_name for x in self.historic_layers)


-@dataclass
-class ScanDisposableKeysResponse:
-    disposable_count: int
-    not_disposable_count: int
-
-    def __add__(self, b):
-        a = self
-        assert isinstance(a, ScanDisposableKeysResponse)
-        assert isinstance(b, ScanDisposableKeysResponse)
-        return ScanDisposableKeysResponse(
-            a.disposable_count + b.disposable_count, a.not_disposable_count + b.not_disposable_count
-        )
-
-    @classmethod
-    def from_json(cls, d: dict[str, Any]) -> ScanDisposableKeysResponse:
-        disposable_count = d["disposable_count"]
-        not_disposable_count = d["not_disposable_count"]
-        return ScanDisposableKeysResponse(disposable_count, not_disposable_count)
-
-
@dataclass
 class TenantConfig:
    tenant_specific_overrides: dict[str, Any]
@@ -162,19 +142,6 @@ class TenantConfig:
        )


-@dataclass
-class TimelinesInfoAndOffloaded:
-    timelines: list[dict[str, Any]]
-    offloaded: list[dict[str, Any]]
-
-    @classmethod
-    def from_json(cls, d: dict[str, Any]) -> TimelinesInfoAndOffloaded:
-        return TimelinesInfoAndOffloaded(
-            timelines=d["timelines"],
-            offloaded=d["offloaded"],
-        )
-
-
 class PageserverHttpClient(requests.Session, MetricsGetter):
    def __init__(
        self,
@@ -497,18 +464,6 @@ class PageserverHttpClient(requests.Session, MetricsGetter):
        assert isinstance(res_json, list)
        return res_json

-    def timeline_and_offloaded_list(
-        self,
-        tenant_id: Union[TenantId, TenantShardId],
-    ) -> TimelinesInfoAndOffloaded:
-        res = self.get(
-            f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline_and_offloaded",
-        )
-        self.verbose_error(res)
-        res_json = res.json()
-        assert isinstance(res_json, dict)
-        return TimelinesInfoAndOffloaded.from_json(res_json)
-
    def timeline_create(
        self,
        pg_version: PgVersion,
@@ -521,13 +476,12 @@ class PageserverHttpClient(requests.Session, MetricsGetter):
    ) -> dict[Any, Any]:
        body: dict[str, Any] = {
            "new_timeline_id": str(new_timeline_id),
+            "ancestor_start_lsn": str(ancestor_start_lsn) if ancestor_start_lsn else None,
+            "ancestor_timeline_id": str(ancestor_timeline_id) if ancestor_timeline_id else None,
+            "existing_initdb_timeline_id": str(existing_initdb_timeline_id)
+            if existing_initdb_timeline_id
+            else None,
        }
-        if ancestor_timeline_id:
-            body["ancestor_timeline_id"] = str(ancestor_timeline_id)
-        if ancestor_start_lsn:
-            body["ancestor_start_lsn"] = str(ancestor_start_lsn)
-        if existing_initdb_timeline_id:
-            body["existing_initdb_timeline_id"] = str(existing_initdb_timeline_id)
        if pg_version != PgVersion.NOT_SET:
            body["pg_version"] = int(pg_version)

@@ -925,16 +879,6 @@ class PageserverHttpClient(requests.Session, MetricsGetter):
        self.verbose_error(res)
        return LayerMapInfo.from_json(res.json())

-    def timeline_layer_scan_disposable_keys(
-        self, tenant_id: Union[TenantId, TenantShardId], timeline_id: TimelineId, layer_name: str
-    ) -> ScanDisposableKeysResponse:
-        res = self.post(
-            f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline/{timeline_id}/layer/{layer_name}/scan_disposable_keys",
-        )
-        self.verbose_error(res)
-        assert res.status_code == 200
-        return ScanDisposableKeysResponse.from_json(res.json())
-
    def download_layer(
        self, tenant_id: Union[TenantId, TenantShardId], timeline_id: TimelineId, layer_name: str
    ):
--- a/test_runner/fixtures/parametrize.py
+++ b/test_runner/fixtures/parametrize.py
@@ -10,6 +10,12 @@ from _pytest.python import Metafunc

 from fixtures.pg_version import PgVersion

+if TYPE_CHECKING:
+    from typing import Any, Optional
+
+    from fixtures.utils import AuxFileStore
+
+
 if TYPE_CHECKING:
    from typing import Any, Optional

@@ -44,6 +50,11 @@ def pageserver_virtual_file_io_mode() -> Optional[str]:
    return os.getenv("PAGESERVER_VIRTUAL_FILE_IO_MODE")


+@pytest.fixture(scope="function", autouse=True)
+def pageserver_aux_file_policy() -> Optional[AuxFileStore]:
+    return None
+
+
 def get_pageserver_default_tenant_config_compaction_algorithm() -> Optional[dict[str, Any]]:
    toml_table = os.getenv("PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM")
    if toml_table is None:
--- a/test_runner/fixtures/utils.py
+++ b/test_runner/fixtures/utils.py
@@ -1,6 +1,7 @@
 from __future__ import annotations

 import contextlib
+import enum
 import json
 import os
 import re
@@ -514,6 +515,21 @@ def assert_no_errors(log_file: Path, service: str, allowed_errors: list[str]):
    assert not errors, f"First log error on {service}: {errors[0]}\nHint: use scripts/check_allowed_errors.sh to test any new allowed_error you add"


+@enum.unique
+class AuxFileStore(str, enum.Enum):
+    V1 = "v1"
+    V2 = "v2"
+    CrossValidation = "cross-validation"
+
+    @override
+    def __repr__(self) -> str:
+        return f"'aux-{self.value}'"
+
+    @override
+    def __str__(self) -> str:
+        return f"'aux-{self.value}'"
+
+
 def assert_pageserver_backups_equal(left: Path, right: Path, skip_files: set[str]):
    """
    This is essentially:
--- a/test_runner/performance/test_logical_replication.py
+++ b/test_runner/performance/test_logical_replication.py
@@ -9,7 +9,7 @@ import pytest
 from fixtures.benchmark_fixture import MetricReport
 from fixtures.common_types import Lsn
 from fixtures.log_helper import log
-from fixtures.neon_fixtures import logical_replication_sync
+from fixtures.neon_fixtures import AuxFileStore, logical_replication_sync

 if TYPE_CHECKING:
    from fixtures.benchmark_fixture import NeonBenchmarker
@@ -17,6 +17,7 @@ if TYPE_CHECKING:
    from fixtures.neon_fixtures import NeonEnv, PgBin


+@pytest.mark.parametrize("pageserver_aux_file_policy", [AuxFileStore.V2])
@pytest.mark.timeout(1000)
 def test_logical_replication(neon_simple_env: NeonEnv, pg_bin: PgBin, vanilla_pg):
    env = neon_simple_env
--- a/test_runner/regress/test_attach_tenant_config.py
+++ b/test_runner/regress/test_attach_tenant_config.py
@@ -172,6 +172,7 @@ def test_fully_custom_config(positive_env: NeonEnv):
        },
        "walreceiver_connect_timeout": "13m",
        "image_layer_creation_check_threshold": 1,
+        "switch_aux_file_policy": "cross-validation",
        "lsn_lease_length": "1m",
        "lsn_lease_length_for_ts": "5s",
    }
--- a/test_runner/regress/test_auth_broker.py
+++ b/test_runner/regress/test_auth_broker.py
@@ -1,37 +0,0 @@
-import json
-
-import pytest
-from fixtures.neon_fixtures import NeonAuthBroker
-from jwcrypto import jwk, jwt
-
-
-@pytest.mark.asyncio
-async def test_auth_broker_happy(
-    static_auth_broker: NeonAuthBroker,
-    neon_authorize_jwk: jwk.JWK,
-):
-    """
-    Signs a JWT and uses it to authorize a query to local_proxy.
-    """
-
-    token = jwt.JWT(
-        header={"kid": neon_authorize_jwk.key_id, "alg": "RS256"}, claims={"sub": "user1"}
-    )
-    token.make_signed_token(neon_authorize_jwk)
-    res = await static_auth_broker.query("foo", ["arg1"], user="anonymous", token=token.serialize())
-
-    # local proxy mock just echos back the request
-    # check that we forward the correct data
-
-    assert (
-        res["headers"]["authorization"] == f"Bearer {token.serialize()}"
-    ), "JWT should be forwarded"
-
-    assert (
-        "anonymous" in res["headers"]["neon-connection-string"]
-    ), "conn string should be forwarded"
-
-    assert json.loads(res["body"]) == {
-        "query": "foo",
-        "params": ["arg1"],
-    }, "Query body should be forwarded"
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Vlad Lazar	b87ac5b375	pageserver: move things around to prepare for decoding logic We wish to have high level WAL decoding logic in `wal_decoder::decoder` module. For this we need the `Value` and `NeonWalRecord` types accessible there, so: 1. Move `Value` and `NeonWalRecord` to `pageserver_api::value` and `pageserver_api::record` respectively. I had to add a testing feature to `pageserver_api` to get this working due to `NeonWalRecord` test directives. 2. Get rid of `pageserver::repository` (follow up from (1)) 3. Move PG specific WAL record types to `postgres_ffi::record`. In theory they could live in `wal_decoder`, but it would create a circular dependency between `wal_decoder` and `postgres_ffi`. Long term it makes sennse for those types to be PG version specific, so that will work out nicely. 4. Move higher level WAL record types (to be ingested by pageserver) into `wal_decoder::models`	2024-10-24 17:46:21 +02:00
Vlad Lazar	56101531c0	review: add a comment for the types	2024-10-24 15:51:09 +02:00
Vlad Lazar	b08901c4aa	pageserver/walingest: fix failpoint name typo	2024-10-24 15:51:09 +02:00
Vlad Lazar	dd981af5b5	ci: format python code	2024-10-24 15:51:09 +02:00
Vlad Lazar	b6d2c23605	review: rename failpoint	2024-10-24 15:51:09 +02:00
Vlad Lazar	de5ecdc6b8	review: parse logical message prefix on decode	2024-10-24 15:51:09 +02:00
Vlad Lazar	31434f1cb1	review: allow for multiple types of smgr truncation	2024-10-24 15:51:09 +02:00
Vlad Lazar	36a3999c80	pageserver: rename ingest functions The goal of this commit is to make it clearer when we are ingesting the whole record versus when we are ingesting an action for the record. I also merged the VM bits clearing into one function since they were exactly the same.	2024-10-24 15:51:09 +02:00
Vlad Lazar	69f03617b0	pageserver: wrap all special records in an enum This will give us a nice evolution path when we want to add new actions for a record.	2024-10-24 15:51:09 +02:00
Vlad Lazar	828ecbd1ac	pageserver: refactor replorigin record	2024-10-24 15:51:09 +02:00
Vlad Lazar	59d8016c5a	pageserver: refactor standby record	2024-10-24 15:51:09 +02:00
Vlad Lazar	1836cb23ff	pageserer: refactor logical message record	2024-10-24 15:51:09 +02:00
Vlad Lazar	d33bb53a29	pageserver: refactor xlog record This is an odd one. It requires the current checkpoint value to decide what to do. That can't trivially be moved to the SK. It's possible with protocol change, but deferring decision for now. Hence, send the raw record and let the pageserver figure it out.	2024-10-24 15:51:09 +02:00
Vlad Lazar	d3ad5dc0db	pageserver: refactor relmap record	2024-10-24 15:51:09 +02:00
Vlad Lazar	80f6d3909c	pageserver: refactor multixact records	2024-10-24 15:51:09 +02:00
Vlad Lazar	0ca177cbd6	pageserver: refactor xact records This one is a bit less obvious than the previous ones. I merged some of the logic that was previously in `WalIngest::ingest_record` to `WalIngest::ingest_xact_record`.	2024-10-24 15:51:09 +02:00
Vlad Lazar	45e8424dba	pageserver: refactor clog records	2024-10-24 15:51:09 +02:00
Vlad Lazar	7136d0a192	pageserver: refactor dbase records	2024-10-24 15:51:09 +02:00
Vlad Lazar	48947e042b	pageserver: refactor smgr records	2024-10-24 15:51:09 +02:00
Vlad Lazar	d4fd4e2d9a	pageserver: refactor neonrmgr records	2024-10-24 15:51:09 +02:00
Vlad Lazar	da39155590	pagesver: refactor heapam records	2024-10-24 15:51:09 +02:00