Add safekeeper protocol to send decoded records

Merge remote-tracking branch 'upstream/compute_sharding_support' into jcsp/sharding-pt1
Revert "extend test_change_pageserver for failure case, rework changing pageserver (#5693 )"
2026-02-03 02:30:37 +00:00 · 2023-11-09 19:33:41 +00:00 · 2023-11-09 16:43:19 +00:00 · 2023-11-09 16:43:10 +00:00 · 2023-11-09 16:39:45 +00:00 · 2023-11-09 16:26:16 +00:00
70 changed files with 2383 additions and 2753 deletions
--- a/.cargo/config.toml
+++ b/.cargo/config.toml
@@ -1,3 +1,17 @@
+# The binaries are really slow, if you compile them in 'dev' mode with the defaults.
+# Enable some optimizations even in 'dev' mode, to make tests faster. The basic
+# optimizations enabled by "opt-level=1" don't affect debuggability too much.
+#
+# See https://www.reddit.com/r/rust/comments/gvrgca/this_is_a_neat_trick_for_getting_good_runtime/
+#
+[profile.dev.package."*"]
+# Set the default for dependencies in Development mode.
+opt-level = 3
+
+[profile.dev]
+# Turn on a small amount of optimization in Development mode.
+opt-level = 1
+
 [build]
 # This is only present for local builds, as it will be overridden
 # by the RUSTDOCFLAGS env var in CI.
--- a/.github/PULL_REQUEST_TEMPLATE/release-pr.md
+++ b/.github/PULL_REQUEST_TEMPLATE/release-pr.md
@@ -3,7 +3,7 @@
 **NB: this PR must be merged only by 'Create a merge commit'!**

 ### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
+- [ ] Read or refresh [the release flow guide](https://github.com/neondatabase/cloud/wiki/Release:-general-flow)
 - [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
 - [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?

--- a/.github/actionlint.yml
+++ b/.github/actionlint.yml
@@ -1,7 +1,5 @@
 self-hosted-runner:
  labels:
-    - arm64
-    - dev
    - gen3
    - large
    - small
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -172,10 +172,10 @@ jobs:
      # https://github.com/EmbarkStudios/cargo-deny
      - name: Check rust licenses/bans/advisories/sources
        if: ${{ !cancelled() }}
-        run: cargo deny check --hide-inclusion-graph
+        run: cargo deny check

  build-neon:
-    needs: [ check-permissions, tag ]
+    needs: [ check-permissions ]
    runs-on: [ self-hosted, gen3, large ]
    container:
      image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -187,7 +187,6 @@ jobs:
    env:
      BUILD_TYPE: ${{ matrix.build_type }}
      GIT_VERSION: ${{ github.event.pull_request.head.sha || github.sha }}
-      BUILD_TAG: ${{ needs.tag.outputs.build-tag }}

    steps:
      - name: Fix git ownership
@@ -586,13 +585,10 @@ jobs:
        id: upload-coverage-report-new
        env:
          BUCKET: neon-github-public-dev
-          # A differential coverage report is available only for PRs.
-          # (i.e. for pushes into main/release branches we have a regular coverage report)
          COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
-          BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
        run: |
+          BASELINE="$(git merge-base HEAD origin/main)"
          CURRENT="${COMMIT_SHA}"
-          BASELINE="$(git merge-base $BASE_SHA $CURRENT)"

          cp /tmp/coverage/report/lcov.info ./${CURRENT}.info

--- a/.github/workflows/neon_extra_builds.yml
+++ b/.github/workflows/neon_extra_builds.yml
@@ -21,10 +21,7 @@ env:

 jobs:
  check-macos-build:
-    if: |
-      contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos')  ||
-      contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
-      github.ref_name == 'main'
+    if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos')
    timeout-minutes: 90
    runs-on: macos-latest

@@ -115,182 +112,8 @@ jobs:
      - name: Check that no warnings are produced
        run: ./run_clippy.sh

-  check-linux-arm-build:
-    timeout-minutes: 90
-    runs-on: [ self-hosted, dev, arm64 ]
-
-    env:
-      # Use release build only, to have less debug info around
-      # Hence keeping target/ (and general cache size) smaller
-      BUILD_TYPE: release
-      CARGO_FEATURES: --features testing
-      CARGO_FLAGS: --locked --release
-      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
-
-    container:
-      image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
-      options: --init
-
-    steps:
-      - name: Fix git ownership
-        run: |
-          # Workaround for `fatal: detected dubious ownership in repository at ...`
-          #
-          # Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
-          #   Ref https://github.com/actions/checkout/issues/785
-          #
-          git config --global --add safe.directory ${{ github.workspace }}
-          git config --global --add safe.directory ${GITHUB_WORKSPACE}
-
-      - name: Checkout
-        uses: actions/checkout@v4
-        with:
-          submodules: true
-          fetch-depth: 1
-
-      - name: Set pg 14 revision for caching
-        id: pg_v14_rev
-        run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v14) >> $GITHUB_OUTPUT
-
-      - name: Set pg 15 revision for caching
-        id: pg_v15_rev
-        run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) >> $GITHUB_OUTPUT
-
-      - name: Set pg 16 revision for caching
-        id: pg_v16_rev
-        run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) >> $GITHUB_OUTPUT
-
-      - name: Set env variables
-        run: |
-          echo "CARGO_HOME=${GITHUB_WORKSPACE}/.cargo" >> $GITHUB_ENV
-
-      - name: Cache postgres v14 build
-        id: cache_pg_14
-        uses: actions/cache@v3
-        with:
-          path: pg_install/v14
-          key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
-
-      - name: Cache postgres v15 build
-        id: cache_pg_15
-        uses: actions/cache@v3
-        with:
-          path: pg_install/v15
-          key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
-
-      - name: Cache postgres v16 build
-        id: cache_pg_16
-        uses: actions/cache@v3
-        with:
-          path: pg_install/v16
-          key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
-
-      - name: Build postgres v14
-        if: steps.cache_pg_14.outputs.cache-hit != 'true'
-        run: mold -run make postgres-v14 -j$(nproc)
-
-      - name: Build postgres v15
-        if: steps.cache_pg_15.outputs.cache-hit != 'true'
-        run: mold -run make postgres-v15 -j$(nproc)
-
-      - name: Build postgres v16
-        if: steps.cache_pg_16.outputs.cache-hit != 'true'
-        run: mold -run make postgres-v16 -j$(nproc)
-
-      - name: Build neon extensions
-        run: mold -run make neon-pg-ext -j$(nproc)
-
-      - name: Build walproposer-lib
-        run: mold -run make walproposer-lib -j$(nproc)
-
-      - name: Run cargo build
-        run: |
-          mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests
-
-      - name: Run cargo test
-        run: |
-          cargo test $CARGO_FLAGS $CARGO_FEATURES
-
-          # Run separate tests for real S3
-          export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
-          export REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev
-          export REMOTE_STORAGE_S3_REGION=eu-central-1
-          # Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
-          cargo test $CARGO_FLAGS --package remote_storage --test test_real_s3
-
-          # Run separate tests for real Azure Blob Storage
-          # XXX: replace region with `eu-central-1`-like region
-          export ENABLE_REAL_AZURE_REMOTE_STORAGE=y
-          export AZURE_STORAGE_ACCOUNT="${{ secrets.AZURE_STORAGE_ACCOUNT_DEV }}"
-          export AZURE_STORAGE_ACCESS_KEY="${{ secrets.AZURE_STORAGE_ACCESS_KEY_DEV }}"
-          export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}"
-          export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}"
-          # Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
-          cargo test $CARGO_FLAGS --package remote_storage --test test_real_azure
-
-  check-codestyle-rust-arm:
-    timeout-minutes: 90
-    runs-on: [ self-hosted, dev, arm64 ]
-
-    container:
-      image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
-      options: --init
-
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v4
-        with:
-          submodules: true
-          fetch-depth: 1
-
-      # Some of our rust modules use FFI and need those to be checked
-      - name: Get postgres headers
-        run: make postgres-headers -j$(nproc)
-
-      # cargo hack runs the given cargo subcommand (clippy in this case) for all feature combinations.
-      # This will catch compiler & clippy warnings in all feature combinations.
-      # TODO: use cargo hack for build and test as well, but, that's quite expensive.
-      # NB: keep clippy args in sync with ./run_clippy.sh
-      - run: |
-          CLIPPY_COMMON_ARGS="$( source .neon_clippy_args; echo "$CLIPPY_COMMON_ARGS")"
-          if [ "$CLIPPY_COMMON_ARGS" = "" ]; then
-            echo "No clippy args found in .neon_clippy_args"
-            exit 1
-          fi
-          echo "CLIPPY_COMMON_ARGS=${CLIPPY_COMMON_ARGS}" >> $GITHUB_ENV
-      - name: Run cargo clippy (debug)
-        run: cargo hack --feature-powerset clippy $CLIPPY_COMMON_ARGS
-      - name: Run cargo clippy (release)
-        run: cargo hack --feature-powerset clippy --release $CLIPPY_COMMON_ARGS
-
-      - name: Check documentation generation
-        run: cargo doc --workspace --no-deps --document-private-items
-        env:
-            RUSTDOCFLAGS: "-Dwarnings -Arustdoc::private_intra_doc_links"
-
-      # Use `${{ !cancelled() }}` to run quck tests after the longer clippy run
-      - name: Check formatting
-        if: ${{ !cancelled() }}
-        run: cargo fmt --all -- --check
-
-      # https://github.com/facebookincubator/cargo-guppy/tree/bec4e0eb29dcd1faac70b1b5360267fc02bf830e/tools/cargo-hakari#2-keep-the-workspace-hack-up-to-date-in-ci
-      - name: Check rust dependencies
-        if: ${{ !cancelled() }}
-        run: |
-          cargo hakari generate --diff  # workspace-hack Cargo.toml is up-to-date
-          cargo hakari manage-deps --dry-run  # all workspace crates depend on workspace-hack
-
-      # https://github.com/EmbarkStudios/cargo-deny
-      - name: Check rust licenses/bans/advisories/sources
-        if: ${{ !cancelled() }}
-        run: cargo deny check
-
  gather-rust-build-stats:
-    if: |
-      contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats') ||
-      contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
-      github.ref_name == 'main'
+    if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats')
    runs-on: [ self-hosted, gen3, large ]
    container:
      image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -83,7 +83,7 @@ hex = "0.4"
 hex-literal = "0.4"
 hmac = "0.12.1"
 hostname = "0.3.1"
-http-types = { version = "2", default-features = false }
+http-types = "2"
 humantime = "2.1"
 humantime-serde = "1.1.1"
 hyper = "0.14"
@@ -136,7 +136,6 @@ strum_macros = "0.24"
 svg_fmt = "0.4.1"
 sync_wrapper = "0.1.2"
 tar = "0.4"
-task-local-extensions = "0.1.4"
 test-context = "0.1"
 thiserror = "1.0"
 tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
--- a/compute_tools/src/compute.rs
+++ b/compute_tools/src/compute.rs
@@ -710,12 +710,8 @@ impl ComputeNode {
    // `pg_ctl` for start / stop, so this just seems much easier to do as we already
    // have opened connection to Postgres and superuser access.
    #[instrument(skip_all)]
-    fn pg_reload_conf(&self) -> Result<()> {
-        let pgctl_bin = Path::new(&self.pgbin).parent().unwrap().join("pg_ctl");
-        Command::new(pgctl_bin)
-            .args(["reload", "-D", &self.pgdata])
-            .output()
-            .expect("cannot run pg_ctl process");
+    fn pg_reload_conf(&self, client: &mut Client) -> Result<()> {
+        client.simple_query("SELECT pg_reload_conf()")?;
        Ok(())
    }

@@ -728,9 +724,9 @@ impl ComputeNode {
        // Write new config
        let pgdata_path = Path::new(&self.pgdata);
        config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec, None)?;
-        self.pg_reload_conf()?;

        let mut client = Client::connect(self.connstr.as_str(), NoTls)?;
+        self.pg_reload_conf(&mut client)?;

        // Proceed with post-startup configuration. Note, that order of operations is important.
        // Disable DDL forwarding because control plane already knows about these roles/databases.
--- a/compute_tools/src/extension_server.rs
+++ b/compute_tools/src/extension_server.rs
@@ -133,6 +133,45 @@ fn parse_pg_version(human_version: &str) -> &str {
    panic!("Unsuported postgres version {human_version}");
 }

+#[cfg(test)]
+mod tests {
+    use super::parse_pg_version;
+
+    #[test]
+    fn test_parse_pg_version() {
+        assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
+        assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
+        assert_eq!(
+            parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
+            "v15"
+        );
+
+        assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
+        assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
+        assert_eq!(
+            parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
+            "v14"
+        );
+
+        assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
+        assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
+        assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
+        assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
+    }
+
+    #[test]
+    #[should_panic]
+    fn test_parse_pg_unsupported_version() {
+        parse_pg_version("PostgreSQL 13.14");
+    }
+
+    #[test]
+    #[should_panic]
+    fn test_parse_pg_incorrect_version_format() {
+        parse_pg_version("PostgreSQL 14");
+    }
+}
+
 // download the archive for a given extension,
 // unzip it, and place files in the appropriate locations (share/lib)
 pub async fn download_extension(
@@ -246,42 +285,3 @@ pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRem
    };
    GenericRemoteStorage::from_config(&config)
 }
-
-#[cfg(test)]
-mod tests {
-    use super::parse_pg_version;
-
-    #[test]
-    fn test_parse_pg_version() {
-        assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
-        assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
-        assert_eq!(
-            parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
-            "v15"
-        );
-
-        assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
-        assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
-        assert_eq!(
-            parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
-            "v14"
-        );
-
-        assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
-        assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
-        assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
-        assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
-    }
-
-    #[test]
-    #[should_panic]
-    fn test_parse_pg_unsupported_version() {
-        parse_pg_version("PostgreSQL 13.14");
-    }
-
-    #[test]
-    #[should_panic]
-    fn test_parse_pg_incorrect_version_format() {
-        parse_pg_version("PostgreSQL 14");
-    }
-}
--- a/control_plane/src/attachment_service.rs
+++ b/control_plane/src/attachment_service.rs
@@ -9,7 +9,6 @@ pub struct AttachmentService {
    env: LocalEnv,
    listen: String,
    path: PathBuf,
-    client: reqwest::blocking::Client,
 }

 const COMMAND: &str = "attachment_service";
@@ -53,9 +52,6 @@ impl AttachmentService {
            env: env.clone(),
            path,
            listen,
-            client: reqwest::blocking::ClientBuilder::new()
-                .build()
-                .expect("Failed to construct http client"),
        }
    }

@@ -98,13 +94,16 @@ impl AttachmentService {
            .unwrap()
            .join("attach-hook")
            .unwrap();
+        let client = reqwest::blocking::ClientBuilder::new()
+            .build()
+            .expect("Failed to construct http client");

        let request = AttachHookRequest {
            tenant_id,
            node_id: Some(pageserver_id),
        };

-        let response = self.client.post(url).json(&request).send()?;
+        let response = client.post(url).json(&request).send()?;
        if response.status() != StatusCode::OK {
            return Err(anyhow!("Unexpected status {}", response.status()));
        }
@@ -123,10 +122,13 @@ impl AttachmentService {
            .unwrap()
            .join("inspect")
            .unwrap();
+        let client = reqwest::blocking::ClientBuilder::new()
+            .build()
+            .expect("Failed to construct http client");

        let request = InspectRequest { tenant_id };

-        let response = self.client.post(url).json(&request).send()?;
+        let response = client.post(url).json(&request).send()?;
        if response.status() != StatusCode::OK {
            return Err(anyhow!("Unexpected status {}", response.status()));
        }
--- a/control_plane/src/bin/neon_local.rs
+++ b/control_plane/src/bin/neon_local.rs
@@ -15,7 +15,7 @@ use control_plane::pageserver::{PageServerNode, PAGESERVER_REMOTE_STORAGE_DIR};
 use control_plane::safekeeper::SafekeeperNode;
 use control_plane::tenant_migration::migrate_tenant;
 use control_plane::{broker, local_env};
-use pageserver_api::models::TimelineInfo;
+use pageserver_api::models::{LocationConfig, LocationConfigMode, TimelineInfo};
 use pageserver_api::{
    DEFAULT_HTTP_LISTEN_PORT as DEFAULT_PAGESERVER_HTTP_PORT,
    DEFAULT_PG_LISTEN_PORT as DEFAULT_PAGESERVER_PG_PORT,
@@ -30,6 +30,7 @@ use std::path::PathBuf;
 use std::process::exit;
 use std::str::FromStr;
 use storage_broker::DEFAULT_LISTEN_ADDR as DEFAULT_BROKER_ADDR;
+use utils::generation::Generation;
 use utils::{
    auth::{Claims, Scope},
    id::{NodeId, TenantId, TenantTimelineId, TimelineId},
@@ -374,9 +375,10 @@ fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> {
 }

 fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> anyhow::Result<()> {
-    let pageserver = get_default_pageserver(env);
    match tenant_match.subcommand() {
        Some(("list", _)) => {
+            // TODO: make command aware of multiple pageservers
+            let pageserver = get_default_pageserver(env);
            for t in pageserver.tenant_list()? {
                println!("{} {:?}", t.id, t.state);
            }
@@ -387,37 +389,73 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
                .map(|vals| vals.flat_map(|c| c.split_once(':')).collect())
                .unwrap_or_default();

+            let shard_count: u8 = create_match
+                .get_one::<u8>("shard-count")
+                .cloned()
+                .unwrap_or(1);
+
            // If tenant ID was not specified, generate one
            let tenant_id = parse_tenant_id(create_match)?.unwrap_or_else(TenantId::generate);

-            let generation = if env.control_plane_api.is_some() {
-                // We must register the tenant with the attachment service, so
-                // that when the pageserver restarts, it will be re-attached.
-                let attachment_service = AttachmentService::from_env(env);
-                attachment_service.attach_hook(tenant_id, pageserver.conf.id)?
-            } else {
-                None
-            };
-
-            pageserver.tenant_create(tenant_id, generation, tenant_conf)?;
-            println!("tenant {tenant_id} successfully created on the pageserver");
-
-            // Create an initial timeline for the new tenant
-            let new_timeline_id = parse_timeline_id(create_match)?;
+            // We will create an initial timeline for the new tenant
+            let new_timeline_id =
+                parse_timeline_id(create_match)?.unwrap_or(TimelineId::generate());
            let pg_version = create_match
                .get_one::<u32>("pg-version")
                .copied()
                .context("Failed to parse postgres version from the argument string")?;

-            let timeline_info = pageserver.timeline_create(
-                tenant_id,
-                new_timeline_id,
-                None,
-                None,
-                Some(pg_version),
-            )?;
-            let new_timeline_id = timeline_info.timeline_id;
-            let last_record_lsn = timeline_info.last_record_lsn;
+            // TODO: implement ability for one pageserver to hold multiple
+            // shards for the same tenant.  Until then, we must place each
+            // shard on a different pageserver.
+            assert!(env.pageservers.len() >= shard_count as usize);
+
+            for shard_number in 0..shard_count {
+                let ps_conf = env.pageservers.get(shard_number as usize).unwrap();
+                let pageserver = PageServerNode::from_env(env, ps_conf);
+
+                // TODO: per-shard generations
+                let generation = if env.control_plane_api.is_some() {
+                    // We must register the tenant with the attachment service, so
+                    // that when the pageserver restarts, it will be re-attached.
+                    let attachment_service = AttachmentService::from_env(env);
+                    attachment_service.attach_hook(tenant_id, pageserver.conf.id)?
+                } else {
+                    None
+                };
+
+                // TODO: shard-aware POST /v1/tenant.  Currently tenant creation on the
+                // pageserver is a no-op, but we shouldn't skip the command entirely.
+
+                let tenant_conf = PageServerNode::build_config(tenant_conf.clone())?;
+
+                let location_conf = LocationConfig {
+                    shard_count,
+                    shard_number,
+                    shard_stripe_size: 32000,
+                    mode: LocationConfigMode::AttachedSingle,
+                    generation: generation.map(Generation::new),
+                    secondary_conf: None,
+                    tenant_conf,
+                };
+                pageserver.location_config(tenant_id, location_conf)?;
+                println!(
+                    "tenant {tenant_id} successfully created on pageserver {}",
+                    pageserver.conf.id
+                );
+            }
+
+            for shard_number in 0..shard_count {
+                let ps_conf = env.pageservers.get(shard_number as usize).unwrap();
+                let pageserver = PageServerNode::from_env(env, ps_conf);
+                pageserver.timeline_create(
+                    tenant_id,
+                    Some(new_timeline_id),
+                    None,
+                    None,
+                    Some(pg_version),
+                )?;
+            }

            env.register_branch_mapping(
                DEFAULT_BRANCH_NAME.to_string(),
@@ -425,9 +463,7 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
                new_timeline_id,
            )?;

-            println!(
-                "Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {tenant_id}",
-            );
+            println!("Created an initial timeline '{new_timeline_id}' for tenant: {tenant_id}",);

            if create_match.get_flag("set-default") {
                println!("Setting tenant {tenant_id} as a default one");
@@ -447,6 +483,8 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
                .map(|vals| vals.flat_map(|c| c.split_once(':')).collect())
                .unwrap_or_default();

+            // TODO: make command aware of multiple pageservers
+            let pageserver = get_default_pageserver(env);
            pageserver
                .tenant_config(tenant_id, tenant_conf)
                .with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
@@ -1345,6 +1383,7 @@ fn cli() -> Command {
                .arg(pg_version_arg.clone())
                .arg(Arg::new("set-default").long("set-default").action(ArgAction::SetTrue).required(false)
                    .help("Use this tenant in future CLI commands where tenant_id is needed, but not specified"))
+                .arg(Arg::new("shard-count").value_parser(value_parser!(u8)).long("shard-count").action(ArgAction::Set).help("Number of shards in the new tenant (default 1)"))
                )
            .subcommand(Command::new("set-default").arg(tenant_id_arg.clone().required(true))
                .about("Set a particular tenant as default in future CLI commands where tenant_id is needed, but not specified"))
--- a/control_plane/src/pageserver.rs
+++ b/control_plane/src/pageserver.rs
@@ -18,7 +18,6 @@ use camino::Utf8PathBuf;
 use pageserver_api::models::{
    self, LocationConfig, TenantInfo, TenantLocationConfigRequest, TimelineInfo,
 };
-use pageserver_api::shard::TenantShardId;
 use postgres_backend::AuthType;
 use postgres_connection::{parse_host_port, PgConnectionConfig};
 use reqwest::blocking::{Client, RequestBuilder, Response};
@@ -339,15 +338,8 @@ impl PageServerNode {
            .json()?)
    }

-    pub fn tenant_create(
-        &self,
-        new_tenant_id: TenantId,
-        generation: Option<u32>,
-        settings: HashMap<&str, &str>,
-    ) -> anyhow::Result<TenantId> {
-        let mut settings = settings.clone();
-
-        let config = models::TenantConfig {
+    pub fn build_config(mut settings: HashMap<&str, &str>) -> anyhow::Result<models::TenantConfig> {
+        Ok(models::TenantConfig {
            checkpoint_distance: settings
                .remove("checkpoint_distance")
                .map(|x| x.parse::<u64>())
@@ -406,10 +398,18 @@ impl PageServerNode {
                .map(|x| x.parse::<bool>())
                .transpose()
                .context("Failed to parse 'gc_feedback' as bool")?,
-        };
+        })
+    }

+    pub fn tenant_create(
+        &self,
+        new_tenant_id: TenantId,
+        generation: Option<u32>,
+        settings: HashMap<&str, &str>,
+    ) -> anyhow::Result<TenantId> {
+        let config = Self::build_config(settings.clone())?;
        let request = models::TenantCreateRequest {
-            new_tenant_id: TenantShardId::unsharded(new_tenant_id),
+            new_tenant_id,
            generation,
            config,
        };
--- a/control_plane/src/tenant_migration.rs
+++ b/control_plane/src/tenant_migration.rs
@@ -102,6 +102,9 @@ pub fn migrate_tenant(
            println!("🔁 Already attached to {origin_ps_id}, freshening...");
            let gen = attachment_service.attach_hook(tenant_id, dest_ps.conf.id)?;
            let dest_conf = LocationConfig {
+                shard_count: 0,
+                shard_number: 0,
+                shard_stripe_size: 0,
                mode: LocationConfigMode::AttachedSingle,
                generation: gen.map(Generation::new),
                secondary_conf: None,
@@ -115,6 +118,9 @@ pub fn migrate_tenant(
        println!("🔁 Switching origin pageserver {origin_ps_id} to stale mode");

        let stale_conf = LocationConfig {
+            shard_count: 0,
+            shard_number: 0,
+            shard_stripe_size: 0,
            mode: LocationConfigMode::AttachedStale,
            generation: Some(Generation::new(*generation)),
            secondary_conf: None,
@@ -127,6 +133,9 @@ pub fn migrate_tenant(

    let gen = attachment_service.attach_hook(tenant_id, dest_ps.conf.id)?;
    let dest_conf = LocationConfig {
+        shard_count: 0,
+        shard_number: 0,
+        shard_stripe_size: 0,
        mode: LocationConfigMode::AttachedMulti,
        generation: gen.map(Generation::new),
        secondary_conf: None,
@@ -171,6 +180,9 @@ pub fn migrate_tenant(

        // Downgrade to a secondary location
        let secondary_conf = LocationConfig {
+            shard_count: 0,
+            shard_number: 0,
+            shard_stripe_size: 0,
            mode: LocationConfigMode::Secondary,
            generation: None,
            secondary_conf: Some(LocationConfigSecondary { warm: true }),
@@ -189,6 +201,9 @@ pub fn migrate_tenant(
        dest_ps.conf.id
    );
    let dest_conf = LocationConfig {
+        shard_count: 0,
+        shard_number: 0,
+        shard_stripe_size: 0,
        mode: LocationConfigMode::AttachedSingle,
        generation: gen.map(Generation::new),
        secondary_conf: None,
--- a/deny.toml
+++ b/deny.toml
@@ -74,30 +74,10 @@ highlight = "all"
 workspace-default-features = "allow"
 external-default-features = "allow"
 allow = []
-
+deny = []
 skip = []
 skip-tree = []

-[[bans.deny]]
-# we use tokio, the same rationale applies for async-{io,waker,global-executor,executor,channel,lock}, smol
-# if you find yourself here while adding a dependency, try "default-features = false", ask around on #rust
-name = "async-std"
-
-[[bans.deny]]
-name = "async-io"
-
-[[bans.deny]]
-name = "async-waker"
-
-[[bans.deny]]
-name = "async-global-executor"
-
-[[bans.deny]]
-name = "async-executor"
-
-[[bans.deny]]
-name = "smol"
-
 # This section is considered when running `cargo deny check sources`.
 # More documentation about the 'sources' section can be found here:
 # https://embarkstudios.github.io/cargo-deny/checks/sources/cfg.html
--- a/docs/rfcs/023-the-state-of-pageserver-tenant-relocation.md
+++ b/docs/rfcs/023-the-state-of-pageserver-tenant-relocation.md
@@ -177,7 +177,7 @@ I e during migration create_branch can be called on old pageserver and newly cre

 The difference of simplistic approach from one described above is that it calls ignore on source tenant first and then calls attach on target pageserver. Approach above does it in opposite order thus opening a possibility for race conditions we strive to avoid.

-The approach largely follows this guide: <https://www.notion.so/neondatabase/Cloud-Ad-hoc-tenant-relocation-f687474f7bfc42269e6214e3acba25c7>
+The approach largely follows this guide: <https://github.com/neondatabase/cloud/wiki/Cloud:-Ad-hoc-tenant-relocation>

 The happy path sequence:

--- a/libs/pageserver_api/Cargo.toml
+++ b/libs/pageserver_api/Cargo.toml
@@ -17,9 +17,6 @@ postgres_ffi.workspace = true
 enum-map.workspace = true
 strum.workspace = true
 strum_macros.workspace = true
-hex.workspace = true
+url.workspace = true

-workspace_hack.workspace = true
-
-[dev-dependencies]
-bincode.workspace = true
+workspace_hack.workspace = true
--- a/libs/pageserver_api/src/models.rs
+++ b/libs/pageserver_api/src/models.rs
@@ -16,7 +16,7 @@ use utils::{
    lsn::Lsn,
 };

-use crate::{reltag::RelTag, shard::TenantShardId};
+use crate::reltag::RelTag;
 use anyhow::bail;
 use bytes::{BufMut, Bytes, BytesMut};

@@ -187,7 +187,7 @@ pub struct TimelineCreateRequest {
 #[derive(Serialize, Deserialize, Debug)]
 #[serde(deny_unknown_fields)]
 pub struct TenantCreateRequest {
-    pub new_tenant_id: TenantShardId,
+    pub new_tenant_id: TenantId,
    #[serde(default)]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub generation: Option<u32>,
@@ -259,6 +259,9 @@ pub struct LocationConfigSecondary {
 /// for use in external-facing APIs.
 #[derive(Serialize, Deserialize, Debug)]
 pub struct LocationConfig {
+    pub shard_number: u8,
+    pub shard_count: u8,
+    pub shard_stripe_size: u32,
    pub mode: LocationConfigMode,
    /// If attaching, in what generation?
    #[serde(default)]
--- a/libs/pageserver_api/src/shard.rs
+++ b/libs/pageserver_api/src/shard.rs
@@ -1,8 +1,6 @@
-use std::{ops::RangeInclusive, str::FromStr};
-
-use hex::FromHex;
+use crate::key::Key;
 use serde::{Deserialize, Serialize};
-use utils::id::TenantId;
+use utils::id::NodeId;

 #[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug)]
 pub struct ShardNumber(pub u8);
@@ -10,312 +8,203 @@ pub struct ShardNumber(pub u8);
 #[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug)]
 pub struct ShardCount(pub u8);

-impl ShardCount {
-    pub const MAX: Self = Self(u8::MAX);
-}
-
 impl ShardNumber {
-    pub const MAX: Self = Self(u8::MAX);
+    fn within_count(&self, rhs: ShardCount) -> bool {
+        self.0 < rhs.0
+    }
 }

-/// TenantShardId identify the units of work for the Pageserver.
-///
-/// These are written as `<tenant_id>-<shard number><shard-count>`, for example:
-///
-///   # The second shard in a two-shard tenant
-///   072f1291a5310026820b2fe4b2968934-0102
-///
-/// Historically, tenants could not have multiple shards, and were identified
-/// by TenantId.  To support this, TenantShardId has a special legacy
-/// mode where `shard_count` is equal to zero: this represents a single-sharded
-/// tenant which should be written as a TenantId with no suffix.
-///
-/// The human-readable encoding of TenantShardId, such as used in API URLs,
-/// is both forward and backward compatible: a legacy TenantId can be
-/// decoded as a TenantShardId, and when re-encoded it will be parseable
-/// as a TenantId.
-///
-/// Note that the binary encoding is _not_ backward compatible, because
-/// at the time sharding is introduced, there are no existing binary structures
-/// containing TenantId that we need to handle.
-#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Copy)]
-pub struct TenantShardId {
-    pub tenant_id: TenantId,
-    pub shard_number: ShardNumber,
-    pub shard_count: ShardCount,
+/// Stripe size in number of pages
+#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
+pub struct ShardStripeSize(pub u32);
+
+/// Layout version: for future upgrades where we might change how the key->shard mapping works
+#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
+pub struct ShardLayout(u8);
+
+const LAYOUT_V1: ShardLayout = ShardLayout(1);
+
+/// Default stripe size in pages: 256MiB divided by 8kiB page size.
+const DEFAULT_STRIPE_SIZE: ShardStripeSize = ShardStripeSize(256 * 1024 / 8);
+
+/// The ShardIdentity contains the information needed for one member of map
+/// to resolve a key to a shard, and then check whether that shard is ==self.
+#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
+pub struct ShardIdentity {
+    pub layout: ShardLayout,
+    pub number: ShardNumber,
+    pub count: ShardCount,
+    pub stripe_size: ShardStripeSize,
 }

-impl TenantShardId {
-    pub fn unsharded(tenant_id: TenantId) -> Self {
+/// The location of a shard contains both the logical identity of the pageserver
+/// holding it (control plane's perspective), and the physical page service port
+/// that postgres should use (endpoint's perspective).
+#[derive(Clone)]
+pub struct ShardLocation {
+    pub id: NodeId,
+    pub page_service: (url::Host, u16),
+}
+
+/// The ShardMap is sufficient information to map any Key to the page service
+/// which should store it.
+#[derive(Clone)]
+struct ShardMap {
+    layout: ShardLayout,
+    count: ShardCount,
+    stripe_size: ShardStripeSize,
+    pageservers: Vec<Option<ShardLocation>>,
+}
+
+impl ShardMap {
+    pub fn get_location(&self, shard_number: ShardNumber) -> &Option<ShardLocation> {
+        assert!(shard_number.within_count(self.count));
+        self.pageservers.get(shard_number.0 as usize).unwrap()
+    }
+
+    pub fn get_identity(&self, shard_number: ShardNumber) -> ShardIdentity {
+        assert!(shard_number.within_count(self.count));
+        ShardIdentity {
+            layout: self.layout,
+            number: shard_number,
+            count: self.count,
+            stripe_size: self.stripe_size,
+        }
+    }
+
+    /// Return Some if the key is assigned to a particular shard.  Else the key
+    /// should be ingested by all shards (e.g. dbdir metadata).
+    pub fn get_shard_number(&self, key: &Key) -> Option<ShardNumber> {
+        if self.count < ShardCount(2) || key_is_broadcast(key) {
+            None
+        } else {
+            Some(key_to_shard_number(self.count, self.stripe_size, key))
+        }
+    }
+
+    pub fn default_with_shards(shard_count: ShardCount) -> Self {
+        ShardMap {
+            layout: LAYOUT_V1,
+            count: shard_count,
+            stripe_size: DEFAULT_STRIPE_SIZE,
+            pageservers: (0..shard_count.0 as usize).map(|_| None).collect(),
+        }
+    }
+}
+
+impl ShardIdentity {
+    /// An identity with number=0 count=0 is a "none" identity, which represents legacy
+    /// tenants.  Modern single-shard tenants should not use this: they should
+    /// have number=0 count=1.
+    pub fn none() -> Self {
        Self {
-            tenant_id,
-            shard_number: ShardNumber(0),
-            shard_count: ShardCount(0),
+            number: ShardNumber(0),
+            count: ShardCount(0),
+            layout: LAYOUT_V1,
+            stripe_size: DEFAULT_STRIPE_SIZE,
        }
    }

-    /// The range of all TenantShardId that belong to a particular TenantId.  This is useful when
-    /// you have a BTreeMap of TenantShardId, and are querying by TenantId.
-    pub fn tenant_range(tenant_id: TenantId) -> RangeInclusive<Self> {
-        RangeInclusive::new(
-            Self {
-                tenant_id,
-                shard_number: ShardNumber(0),
-                shard_count: ShardCount(0),
-            },
-            Self {
-                tenant_id,
-                shard_number: ShardNumber::MAX,
-                shard_count: ShardCount::MAX,
-            },
-        )
-    }
-
-    pub fn shard_slug(&self) -> String {
-        format!("{:02x}{:02x}", self.shard_number.0, self.shard_count.0)
-    }
-}
-
-impl std::fmt::Display for TenantShardId {
-    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
-        if self.shard_count != ShardCount(0) {
-            write!(
-                f,
-                "{}-{:02x}{:02x}",
-                self.tenant_id, self.shard_number.0, self.shard_count.0
-            )
-        } else {
-            // Legacy case (shard_count == 0) -- format as just the tenant id.  Note that this
-            // is distinct from the normal single shard case (shard count == 1).
-            self.tenant_id.fmt(f)
-        }
-    }
-}
-
-impl std::fmt::Debug for TenantShardId {
-    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
-        // Debug is the same as Display: the compact hex representation
-        write!(f, "{}", self)
-    }
-}
-
-impl std::str::FromStr for TenantShardId {
-    type Err = hex::FromHexError;
-
-    fn from_str(s: &str) -> Result<Self, Self::Err> {
-        // Expect format: 16 byte TenantId, '-', 1 byte shard number, 1 byte shard count
-        if s.len() == 32 {
-            // Legacy case: no shard specified
-            Ok(Self {
-                tenant_id: TenantId::from_str(s)?,
-                shard_number: ShardNumber(0),
-                shard_count: ShardCount(0),
-            })
-        } else if s.len() == 37 {
-            let bytes = s.as_bytes();
-            let tenant_id = TenantId::from_hex(&bytes[0..32])?;
-            let mut shard_parts: [u8; 2] = [0u8; 2];
-            hex::decode_to_slice(&bytes[33..37], &mut shard_parts)?;
-            Ok(Self {
-                tenant_id,
-                shard_number: ShardNumber(shard_parts[0]),
-                shard_count: ShardCount(shard_parts[1]),
-            })
-        } else {
-            Err(hex::FromHexError::InvalidStringLength)
-        }
-    }
-}
-
-impl From<[u8; 18]> for TenantShardId {
-    fn from(b: [u8; 18]) -> Self {
-        let tenant_id_bytes: [u8; 16] = b[0..16].try_into().unwrap();
-
+    pub fn new(number: ShardNumber, count: ShardCount, stripe_size: ShardStripeSize) -> Self {
        Self {
-            tenant_id: TenantId::from(tenant_id_bytes),
-            shard_number: ShardNumber(b[16]),
-            shard_count: ShardCount(b[17]),
+            number,
+            count,
+            layout: LAYOUT_V1,
+            stripe_size,
        }
    }
-}

-impl Serialize for TenantShardId {
-    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
-    where
-        S: serde::Serializer,
-    {
-        if serializer.is_human_readable() {
-            serializer.collect_str(self)
+    pub fn get_shard_number(&self, key: &Key) -> ShardNumber {
+        key_to_shard_number(self.count, self.stripe_size, key)
+    }
+
+    /// Return true if the key should be ingested by this shard
+    pub fn is_key_local(&self, key: &Key) -> bool {
+        if self.count < ShardCount(2) || key_is_broadcast(key) {
+            true
        } else {
-            let mut packed: [u8; 18] = [0; 18];
-            packed[0..16].clone_from_slice(&self.tenant_id.as_arr());
-            packed[16] = self.shard_number.0;
-            packed[17] = self.shard_count.0;
-
-            packed.serialize(serializer)
+            key_to_shard_number(self.count, self.stripe_size, key) == self.number
        }
    }
-}

-impl<'de> Deserialize<'de> for TenantShardId {
-    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
-    where
-        D: serde::Deserializer<'de>,
-    {
-        struct IdVisitor {
-            is_human_readable_deserializer: bool,
-        }
-
-        impl<'de> serde::de::Visitor<'de> for IdVisitor {
-            type Value = TenantShardId;
-
-            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
-                if self.is_human_readable_deserializer {
-                    formatter.write_str("value in form of hex string")
-                } else {
-                    formatter.write_str("value in form of integer array([u8; 18])")
-                }
-            }
-
-            fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
-            where
-                A: serde::de::SeqAccess<'de>,
-            {
-                let s = serde::de::value::SeqAccessDeserializer::new(seq);
-                let id: [u8; 18] = Deserialize::deserialize(s)?;
-                Ok(TenantShardId::from(id))
-            }
-
-            fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
-            where
-                E: serde::de::Error,
-            {
-                TenantShardId::from_str(v).map_err(E::custom)
-            }
-        }
-
-        if deserializer.is_human_readable() {
-            deserializer.deserialize_str(IdVisitor {
-                is_human_readable_deserializer: true,
-            })
+    pub fn slug(&self) -> String {
+        if self.count > ShardCount(0) {
+            format!("-{:02x}{:02x}", self.number.0, self.count.0)
        } else {
-            deserializer.deserialize_tuple(
-                18,
-                IdVisitor {
-                    is_human_readable_deserializer: false,
-                },
-            )
+            String::new()
        }
    }
 }

-#[cfg(test)]
-mod tests {
-    use std::str::FromStr;
-
-    use bincode;
-    use utils::{id::TenantId, Hex};
-
-    use super::*;
-
-    const EXAMPLE_TENANT_ID: &str = "1f359dd625e519a1a4e8d7509690f6fc";
-
-    #[test]
-    fn tenant_shard_id_string() -> Result<(), hex::FromHexError> {
-        let example = TenantShardId {
-            tenant_id: TenantId::from_str(EXAMPLE_TENANT_ID).unwrap(),
-            shard_count: ShardCount(10),
-            shard_number: ShardNumber(7),
-        };
-
-        let encoded = format!("{example}");
-
-        let expected = format!("{EXAMPLE_TENANT_ID}-070a");
-        assert_eq!(&encoded, &expected);
-
-        let decoded = TenantShardId::from_str(&encoded)?;
-
-        assert_eq!(example, decoded);
-
-        Ok(())
-    }
-
-    #[test]
-    fn tenant_shard_id_binary() -> Result<(), hex::FromHexError> {
-        let example = TenantShardId {
-            tenant_id: TenantId::from_str(EXAMPLE_TENANT_ID).unwrap(),
-            shard_count: ShardCount(10),
-            shard_number: ShardNumber(7),
-        };
-
-        let encoded = bincode::serialize(&example).unwrap();
-        let expected: [u8; 18] = [
-            0x1f, 0x35, 0x9d, 0xd6, 0x25, 0xe5, 0x19, 0xa1, 0xa4, 0xe8, 0xd7, 0x50, 0x96, 0x90,
-            0xf6, 0xfc, 0x07, 0x0a,
-        ];
-        assert_eq!(Hex(&encoded), Hex(&expected));
-
-        let decoded = bincode::deserialize(&encoded).unwrap();
-
-        assert_eq!(example, decoded);
-
-        Ok(())
-    }
-
-    #[test]
-    fn tenant_shard_id_backward_compat() -> Result<(), hex::FromHexError> {
-        // Test that TenantShardId can decode a TenantId in human
-        // readable form
-        let example = TenantId::from_str(EXAMPLE_TENANT_ID).unwrap();
-        let encoded = format!("{example}");
-
-        assert_eq!(&encoded, EXAMPLE_TENANT_ID);
-
-        let decoded = TenantShardId::from_str(&encoded)?;
-
-        assert_eq!(example, decoded.tenant_id);
-        assert_eq!(decoded.shard_count, ShardCount(0));
-        assert_eq!(decoded.shard_number, ShardNumber(0));
-
-        Ok(())
-    }
-
-    #[test]
-    fn tenant_shard_id_forward_compat() -> Result<(), hex::FromHexError> {
-        // Test that a legacy TenantShardId encodes into a form that
-        // can be decoded as TenantId
-        let example_tenant_id = TenantId::from_str(EXAMPLE_TENANT_ID).unwrap();
-        let example = TenantShardId::unsharded(example_tenant_id);
-        let encoded = format!("{example}");
-
-        assert_eq!(&encoded, EXAMPLE_TENANT_ID);
-
-        let decoded = TenantId::from_str(&encoded)?;
-
-        assert_eq!(example_tenant_id, decoded);
-
-        Ok(())
-    }
-
-    #[test]
-    fn tenant_shard_id_legacy_binary() -> Result<(), hex::FromHexError> {
-        // Unlike in human readable encoding, binary encoding does not
-        // do any special handling of legacy unsharded TenantIds: this test
-        // is equivalent to the main test for binary encoding, just verifying
-        // that the same behavior applies when we have used `unsharded()` to
-        // construct a TenantShardId.
-        let example = TenantShardId::unsharded(TenantId::from_str(EXAMPLE_TENANT_ID).unwrap());
-        let encoded = bincode::serialize(&example).unwrap();
-
-        let expected: [u8; 18] = [
-            0x1f, 0x35, 0x9d, 0xd6, 0x25, 0xe5, 0x19, 0xa1, 0xa4, 0xe8, 0xd7, 0x50, 0x96, 0x90,
-            0xf6, 0xfc, 0x00, 0x00,
-        ];
-        assert_eq!(Hex(&encoded), Hex(&expected));
-
-        let decoded = bincode::deserialize::<TenantShardId>(&encoded).unwrap();
-        assert_eq!(example, decoded);
-
-        Ok(())
+impl Default for ShardIdentity {
+    /// The default identity is to be the only shard for a tenant, i.e. the legacy
+    /// pre-sharding case.
+    fn default() -> Self {
+        ShardIdentity {
+            layout: LAYOUT_V1,
+            number: ShardNumber(0),
+            count: ShardCount(1),
+            stripe_size: DEFAULT_STRIPE_SIZE,
+        }
    }
 }
+
+/// Whether this key should be ingested by all shards
+fn key_is_broadcast(key: &Key) -> bool {
+    // TODO: deduplicate wrt pgdatadir_mapping.rs
+    fn is_rel_block_key(key: &Key) -> bool {
+        key.field1 == 0x00 && key.field4 != 0
+    }
+
+    // TODO: can we be less conservative?  Starting point is to broadcast everything
+    // except for rel block keys
+    !is_rel_block_key(key)
+}
+
+/// Provide the same result as the function in postgres `hashfn.h` with the same name
+fn murmurhash32(data: u32) -> u32 {
+    let mut h = data;
+
+    h ^= h >> 16;
+    h *= 0x85ebca6b;
+    h ^= h >> 13;
+    h *= 0xc2b2ae35;
+    h ^= h >> 16;
+    h
+}
+
+/// Provide the same result as the function in postgres `hashfn.h` with the same name
+fn hash_combine(mut a: u32, b: u32) -> u32 {
+    a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);
+    a
+}
+
+/// Where a Key is to be distributed across shards, select the shard.  This function
+/// does not account for keys that should be broadcast across shards.
+///
+/// The hashing in this function must exactly match what we do in postgres smgr
+/// code.  The resulting distribution of pages is intended to preserve locality within
+/// `stripe_size` ranges of contiguous block numbers in the same relation, while otherwise
+/// distributing data pseudo-randomly.
+///
+/// The mapping of key to shard is not stable across changes to ShardCount: this is intentional
+/// and will be handled at higher levels when shards are split.
+fn key_to_shard_number(count: ShardCount, stripe_size: ShardStripeSize, key: &Key) -> ShardNumber {
+    // Fast path for un-sharded tenants or broadcast keys
+    if count < ShardCount(2) || key_is_broadcast(key) {
+        return ShardNumber(0);
+    }
+
+    // spcNode
+    let mut hash = murmurhash32(key.field2);
+    // dbNode
+    hash = hash_combine(hash, murmurhash32(key.field3));
+    // relNode
+    hash = hash_combine(hash, murmurhash32(key.field4));
+    // blockNum/stripe size
+    hash = hash_combine(hash, murmurhash32(key.field6 / stripe_size.0));
+
+    let shard = (hash % count.0 as u32) as u8;
+
+    ShardNumber(shard)
+}
--- a/pageserver/src/consumption_metrics/metrics.rs
+++ b/pageserver/src/consumption_metrics/metrics.rs
@@ -3,6 +3,7 @@ use anyhow::Context;
 use chrono::{DateTime, Utc};
 use consumption_metrics::EventType;
 use futures::stream::StreamExt;
+use pageserver_api::shard::ShardNumber;
 use std::{sync::Arc, time::SystemTime};
 use utils::{
    id::{TenantId, TimelineId},
@@ -229,6 +230,11 @@ where
    while let Some((tenant_id, tenant)) = tenants.next().await {
        let mut tenant_resident_size = 0;

+        // Sharded tenants report all consumption metrics from shard zero
+        if tenant.get_shard().number == ShardNumber(0) {
+            continue;
+        }
+
        for timeline in tenant.list_timelines() {
            let timeline_id = timeline.timeline_id;

--- a/pageserver/src/deletion_queue.rs
+++ b/pageserver/src/deletion_queue.rs
@@ -15,6 +15,7 @@ use crate::virtual_file::VirtualFile;
 use anyhow::Context;
 use camino::Utf8PathBuf;
 use hex::FromHex;
+use pageserver_api::shard::ShardIdentity;
 use remote_storage::{GenericRemoteStorage, RemotePath};
 use serde::Deserialize;
 use serde::Serialize;
@@ -300,6 +301,7 @@ impl DeletionList {
    fn push(
        &mut self,
        tenant: &TenantId,
+        shard: &ShardIdentity,
        timeline: &TimelineId,
        generation: Generation,
        objects: &mut Vec<RemotePath>,
@@ -326,7 +328,7 @@ impl DeletionList {

        let timeline_entry = tenant_entry.timelines.entry(*timeline).or_default();

-        let timeline_remote_path = remote_timeline_path(tenant, timeline);
+        let timeline_remote_path = remote_timeline_path(tenant, shard, timeline);

        self.size += objects.len();
        timeline_entry.extend(objects.drain(..).map(|p| {
@@ -341,7 +343,9 @@ impl DeletionList {
        let mut result = Vec::new();
        for (tenant, tenant_deletions) in self.tenants.into_iter() {
            for (timeline, timeline_layers) in tenant_deletions.timelines.into_iter() {
-                let timeline_remote_path = remote_timeline_path(&tenant, &timeline);
+                // FIXME: need to update DeletionList definition to store the ShardIdentity for each Tenant
+                let timeline_remote_path =
+                    remote_timeline_path(&tenant, &ShardIdentity::none(), &timeline);
                result.extend(
                    timeline_layers
                        .into_iter()
@@ -507,6 +511,7 @@ impl DeletionQueueClient {
    pub(crate) async fn push_layers(
        &self,
        tenant_id: TenantId,
+        shard: &ShardIdentity,
        timeline_id: TimelineId,
        current_generation: Generation,
        layers: Vec<(LayerFileName, Generation)>,
@@ -517,6 +522,7 @@ impl DeletionQueueClient {
            for (layer, generation) in layers {
                layer_paths.push(remote_layer_path(
                    &tenant_id,
+                    shard,
                    &timeline_id,
                    &layer,
                    generation,
@@ -829,7 +835,8 @@ mod test {
            gen: Generation,
        ) -> anyhow::Result<String> {
            let tenant_id = self.harness.tenant_id;
-            let relative_remote_path = remote_timeline_path(&tenant_id, &TIMELINE_ID);
+            let relative_remote_path =
+                remote_timeline_path(&tenant_id, &ShardIdentity::none(), &TIMELINE_ID);
            let remote_timeline_path = self.remote_fs_dir.join(relative_remote_path.get_path());
            std::fs::create_dir_all(&remote_timeline_path)?;
            let remote_layer_file_name = format!("{}{}", file_name, gen.get_suffix());
@@ -981,7 +988,8 @@ mod test {
        let tenant_id = ctx.harness.tenant_id;

        let content: Vec<u8> = "victim1 contents".into();
-        let relative_remote_path = remote_timeline_path(&tenant_id, &TIMELINE_ID);
+        let relative_remote_path =
+            remote_timeline_path(&tenant_id, &ShardIdentity::none(), &TIMELINE_ID);
        let remote_timeline_path = ctx.remote_fs_dir.join(relative_remote_path.get_path());
        let deletion_prefix = ctx.harness.conf.deletion_prefix();

@@ -1010,6 +1018,7 @@ mod test {
        client
            .push_layers(
                tenant_id,
+                &ShardIdentity::none(),
                TIMELINE_ID,
                now_generation,
                [(layer_file_name_1.clone(), layer_generation)].to_vec(),
@@ -1055,7 +1064,8 @@ mod test {
        ctx.set_latest_generation(latest_generation);

        let tenant_id = ctx.harness.tenant_id;
-        let relative_remote_path = remote_timeline_path(&tenant_id, &TIMELINE_ID);
+        let relative_remote_path =
+            remote_timeline_path(&tenant_id, &ShardIdentity::none(), &TIMELINE_ID);
        let remote_timeline_path = ctx.remote_fs_dir.join(relative_remote_path.get_path());

        // Initial state: a remote layer exists
@@ -1066,6 +1076,7 @@ mod test {
        client
            .push_layers(
                tenant_id,
+                &ShardIdentity::none(),
                TIMELINE_ID,
                stale_generation,
                [(EXAMPLE_LAYER_NAME.clone(), layer_generation)].to_vec(),
@@ -1081,6 +1092,7 @@ mod test {
        client
            .push_layers(
                tenant_id,
+                &ShardIdentity::none(),
                TIMELINE_ID,
                latest_generation,
                [(EXAMPLE_LAYER_NAME.clone(), layer_generation)].to_vec(),
@@ -1104,7 +1116,8 @@ mod test {

        let tenant_id = ctx.harness.tenant_id;

-        let relative_remote_path = remote_timeline_path(&tenant_id, &TIMELINE_ID);
+        let relative_remote_path =
+            remote_timeline_path(&tenant_id, &ShardIdentity::none(), &TIMELINE_ID);
        let remote_timeline_path = ctx.remote_fs_dir.join(relative_remote_path.get_path());
        let deletion_prefix = ctx.harness.conf.deletion_prefix();

@@ -1119,6 +1132,7 @@ mod test {
        client
            .push_layers(
                tenant_id,
+                &ShardIdentity::none(),
                TIMELINE_ID,
                now_generation.previous(),
                [(EXAMPLE_LAYER_NAME.clone(), layer_generation)].to_vec(),
@@ -1133,6 +1147,7 @@ mod test {
        client
            .push_layers(
                tenant_id,
+                &ShardIdentity::none(),
                TIMELINE_ID,
                now_generation,
                [(EXAMPLE_LAYER_NAME_ALT.clone(), layer_generation)].to_vec(),
@@ -1228,6 +1243,7 @@ pub(crate) mod mock {
                        for (layer, generation) in op.layers {
                            objects.push(remote_layer_path(
                                &op.tenant_id,
+                                &ShardIdentity::none(),
                                &op.timeline_id,
                                &layer,
                                generation,
--- a/pageserver/src/deletion_queue/list_writer.rs
+++ b/pageserver/src/deletion_queue/list_writer.rs
@@ -19,6 +19,7 @@ use std::collections::HashMap;
 use std::fs::create_dir_all;
 use std::time::Duration;

+use pageserver_api::shard::ShardIdentity;
 use regex::Regex;
 use remote_storage::RemotePath;
 use tokio_util::sync::CancellationToken;
@@ -390,6 +391,8 @@ impl ListWriter {
                    for (layer, generation) in op.layers {
                        layer_paths.push(remote_layer_path(
                            &op.tenant_id,
+                            // TODO: store shard in deletion list
+                            &ShardIdentity::none(),
                            &op.timeline_id,
                            &layer,
                            generation,
@@ -399,6 +402,8 @@ impl ListWriter {

                    if !self.pending.push(
                        &op.tenant_id,
+                        // TODO: store shard in deletion list
+                        &ShardIdentity::none(),
                        &op.timeline_id,
                        op.generation,
                        &mut layer_paths,
@@ -406,6 +411,8 @@ impl ListWriter {
                        self.flush().await;
                        let retry_succeeded = self.pending.push(
                            &op.tenant_id,
+                            // TODO: store shard in deletion list
+                            &ShardIdentity::none(),
                            &op.timeline_id,
                            op.generation,
                            &mut layer_paths,
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -16,7 +16,6 @@ use pageserver_api::models::{
    DownloadRemoteLayersTaskSpawnRequest, LocationConfigMode, TenantAttachRequest,
    TenantLoadRequest, TenantLocationConfigRequest,
 };
-use pageserver_api::shard::TenantShardId;
 use remote_storage::GenericRemoteStorage;
 use tenant_size_model::{SizeResult, StorageModel};
 use tokio_util::sync::CancellationToken;
@@ -420,9 +419,9 @@ async fn timeline_create_handler(
    mut request: Request<Body>,
    _cancel: CancellationToken,
 ) -> Result<Response<Body>, ApiError> {
-    let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
+    let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
    let request_data: TimelineCreateRequest = json_request(&mut request).await?;
-    check_permission(&request, Some(tenant_shard_id.tenant_id))?;
+    check_permission(&request, Some(tenant_id))?;

    let new_timeline_id = request_data.new_timeline_id;

@@ -431,7 +430,7 @@ async fn timeline_create_handler(
    let state = get_state(&request);

    async {
-        let tenant = state.tenant_manager.get_attached_tenant_shard(tenant_shard_id, true)?;
+        let tenant = mgr::get_tenant(tenant_id, true)?;
        match tenant.create_timeline(
            new_timeline_id,
            request_data.ancestor_timeline_id.map(TimelineId::from),
@@ -465,10 +464,7 @@ async fn timeline_create_handler(
            Err(tenant::CreateTimelineError::Other(err)) => Err(ApiError::InternalServerError(err)),
        }
    }
-    .instrument(info_span!("timeline_create",
-        tenant_id = %tenant_shard_id.tenant_id,
-        shard = %tenant_shard_id.shard_slug(),
-        timeline_id = %new_timeline_id, lsn=?request_data.ancestor_start_lsn, pg_version=?request_data.pg_version))
+    .instrument(info_span!("timeline_create", %tenant_id, timeline_id = %new_timeline_id, lsn=?request_data.ancestor_start_lsn, pg_version=?request_data.pg_version))
    .await
 }

@@ -664,15 +660,14 @@ async fn timeline_delete_handler(
    request: Request<Body>,
    _cancel: CancellationToken,
 ) -> Result<Response<Body>, ApiError> {
-    let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
+    let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
    let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
-    check_permission(&request, Some(tenant_shard_id.tenant_id))?;
+    check_permission(&request, Some(tenant_id))?;

    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
-    let state = get_state(&request);

-    state.tenant_manager.delete_timeline(tenant_shard_id, timeline_id, &ctx)
-        .instrument(info_span!("timeline_delete", tenant_id=%tenant_shard_id.tenant_id, shard=%tenant_shard_id.shard_slug(), %timeline_id))
+    mgr::delete_timeline(tenant_id, timeline_id, &ctx)
+        .instrument(info_span!("timeline_delete", %tenant_id, %timeline_id))
        .await?;

    json_response(StatusCode::ACCEPTED, ())
@@ -686,14 +681,11 @@ async fn tenant_detach_handler(
    check_permission(&request, Some(tenant_id))?;
    let detach_ignored: Option<bool> = parse_query_param(&request, "detach_ignored")?;

-    // This is a legacy API (`/location_conf` is the replacement).  It only supports unsharded tenants
-    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
-
    let state = get_state(&request);
    let conf = state.conf;
    mgr::detach_tenant(
        conf,
-        tenant_shard_id,
+        tenant_id,
        detach_ignored.unwrap_or(false),
        &state.deletion_queue_client,
    )
@@ -810,16 +802,13 @@ async fn tenant_delete_handler(
    _cancel: CancellationToken,
 ) -> Result<Response<Body>, ApiError> {
    // TODO openapi spec
-    let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
-    check_permission(&request, Some(tenant_shard_id.tenant_id))?;
+    let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
+    check_permission(&request, Some(tenant_id))?;

    let state = get_state(&request);

-    mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_shard_id)
-        .instrument(info_span!("tenant_delete_handler",
-            tenant_id = %tenant_shard_id.tenant_id,
-            shard = tenant_shard_id.shard_slug()
-        ))
+    mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_id)
+        .instrument(info_span!("tenant_delete_handler", %tenant_id))
        .await?;

    json_response(StatusCode::ACCEPTED, ())
@@ -1149,10 +1138,9 @@ async fn put_tenant_location_config_handler(
    mut request: Request<Body>,
    _cancel: CancellationToken,
 ) -> Result<Response<Body>, ApiError> {
-    let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
-
    let request_data: TenantLocationConfigRequest = json_request(&mut request).await?;
-    check_permission(&request, Some(tenant_shard_id.tenant_id))?;
+    let tenant_id = request_data.tenant_id;
+    check_permission(&request, Some(tenant_id))?;

    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
    let state = get_state(&request);
@@ -1161,13 +1149,9 @@ async fn put_tenant_location_config_handler(
    // The `Detached` state is special, it doesn't upsert a tenant, it removes
    // its local disk content and drops it from memory.
    if let LocationConfigMode::Detached = request_data.config.mode {
-        if let Err(e) =
-            mgr::detach_tenant(conf, tenant_shard_id, true, &state.deletion_queue_client)
-                .instrument(info_span!("tenant_detach",
-                    tenant_id = %tenant_shard_id.tenant_id,
-                    shard = tenant_shard_id.shard_slug()
-                ))
-                .await
+        if let Err(e) = mgr::detach_tenant(conf, tenant_id, true, &state.deletion_queue_client)
+            .instrument(info_span!("tenant_detach", %tenant_id))
+            .await
        {
            match e {
                TenantStateError::SlotError(TenantSlotError::NotFound(_)) => {
@@ -1184,7 +1168,7 @@ async fn put_tenant_location_config_handler(

    state
        .tenant_manager
-        .upsert_location(tenant_shard_id, location_conf, &ctx)
+        .upsert_location(tenant_id, location_conf, &ctx)
        .await
        // TODO: badrequest assumes the caller was asking for something unreasonable, but in
        // principle we might have hit something like concurrent API calls to the same tenant,
@@ -1494,7 +1478,7 @@ async fn timeline_collect_keyspace(
        let keys = timeline
            .collect_keyspace(at_lsn, &ctx)
            .await
-            .map_err(|e| ApiError::InternalServerError(e.into()))?;
+            .map_err(ApiError::InternalServerError)?;

        json_response(StatusCode::OK, Partitioning { keys, at_lsn })
    }
@@ -1768,7 +1752,7 @@ pub fn make_router(
        .get("/v1/tenant", |r| api_handler(r, tenant_list_handler))
        .post("/v1/tenant", |r| api_handler(r, tenant_create_handler))
        .get("/v1/tenant/:tenant_id", |r| api_handler(r, tenant_status))
-        .delete("/v1/tenant/:tenant_shard_id", |r| {
+        .delete("/v1/tenant/:tenant_id", |r| {
            api_handler(r, tenant_delete_handler)
        })
        .get("/v1/tenant/:tenant_id/synthetic_size", |r| {
@@ -1780,13 +1764,13 @@ pub fn make_router(
        .get("/v1/tenant/:tenant_id/config", |r| {
            api_handler(r, get_tenant_config_handler)
        })
-        .put("/v1/tenant/:tenant_shard_id/location_config", |r| {
+        .put("/v1/tenant/:tenant_id/location_config", |r| {
            api_handler(r, put_tenant_location_config_handler)
        })
        .get("/v1/tenant/:tenant_id/timeline", |r| {
            api_handler(r, timeline_list_handler)
        })
-        .post("/v1/tenant/:tenant_shard_id/timeline", |r| {
+        .post("/v1/tenant/:tenant_id/timeline", |r| {
            api_handler(r, timeline_create_handler)
        })
        .post("/v1/tenant/:tenant_id/attach", |r| {
@@ -1830,7 +1814,7 @@ pub fn make_router(
            "/v1/tenant/:tenant_id/timeline/:timeline_id/download_remote_layers",
            |r| api_handler(r, timeline_download_remote_layers_handler_get),
        )
-        .delete("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| {
+        .delete("/v1/tenant/:tenant_id/timeline/:timeline_id", |r| {
            api_handler(r, timeline_delete_handler)
        })
        .get("/v1/tenant/:tenant_id/timeline/:timeline_id/layer", |r| {
--- a/pageserver/src/keyspace.rs
+++ b/pageserver/src/keyspace.rs
@@ -1,7 +1,4 @@
-use crate::{
-    pgdatadir_mapping::{BASEBACKUP_CUT, METADATA_CUT},
-    repository::{key_range_size, singleton_range, Key},
-};
+use crate::repository::{key_range_size, singleton_range, Key};
 use postgres_ffi::BLCKSZ;
 use std::ops::Range;

@@ -25,22 +22,13 @@ impl KeySpace {
        let target_nblocks = (target_size / BLCKSZ as u64) as usize;

        let mut parts = Vec::new();
-        let mut current_part: Vec<Range<Key>> = Vec::new();
+        let mut current_part = Vec::new();
        let mut current_part_size: usize = 0;
        for range in &self.ranges {
-            let last = current_part
-                .last()
-                .map(|r| r.end)
-                .unwrap_or(Key::from_i128(0));
-            let cut_here = (range.start >= METADATA_CUT && last < METADATA_CUT)
-                || (range.start >= BASEBACKUP_CUT && last < BASEBACKUP_CUT);
-
            // If appending the next contiguous range in the keyspace to the current
            // partition would cause it to be too large, start a new partition.
            let this_size = key_range_size(range) as usize;
-            if cut_here
-                || current_part_size + this_size > target_nblocks && !current_part.is_empty()
-            {
+            if current_part_size + this_size > target_nblocks && !current_part.is_empty() {
                parts.push(KeySpace {
                    ranges: current_part,
                });
--- a/pageserver/src/metrics.rs
+++ b/pageserver/src/metrics.rs
@@ -40,6 +40,9 @@ pub enum StorageTimeOperation {
    #[strum(serialize = "logical size")]
    LogicalSize,

+    #[strum(serialize = "imitate logical size")]
+    ImitateLogicalSize,
+
    #[strum(serialize = "load layer map")]
    LoadLayerMap,

@@ -1249,46 +1252,6 @@ pub(crate) static WAL_REDO_RECORD_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
    .unwrap()
 });

-pub(crate) struct WalRedoProcessCounters {
-    pub(crate) started: IntCounter,
-    pub(crate) killed_by_cause: enum_map::EnumMap<WalRedoKillCause, IntCounter>,
-}
-
-#[derive(Debug, enum_map::Enum, strum_macros::IntoStaticStr)]
-pub(crate) enum WalRedoKillCause {
-    WalRedoProcessDrop,
-    NoLeakChildDrop,
-    Startup,
-}
-
-impl Default for WalRedoProcessCounters {
-    fn default() -> Self {
-        let started = register_int_counter!(
-            "pageserver_wal_redo_process_started_total",
-            "Number of WAL redo processes started",
-        )
-        .unwrap();
-
-        let killed = register_int_counter_vec!(
-            "pageserver_wal_redo_process_stopped_total",
-            "Number of WAL redo processes stopped",
-            &["cause"],
-        )
-        .unwrap();
-        Self {
-            started,
-            killed_by_cause: EnumMap::from_array(std::array::from_fn(|i| {
-                let cause = <WalRedoKillCause as enum_map::Enum>::from_usize(i);
-                let cause_str: &'static str = cause.into();
-                killed.with_label_values(&[cause_str])
-            })),
-        }
-    }
-}
-
-pub(crate) static WAL_REDO_PROCESS_COUNTERS: Lazy<WalRedoProcessCounters> =
-    Lazy::new(WalRedoProcessCounters::default);
-
 /// Similar to `prometheus::HistogramTimer` but does not record on drop.
 pub struct StorageTimeMetricsTimer {
    metrics: StorageTimeMetrics,
@@ -1361,6 +1324,7 @@ pub struct TimelineMetrics {
    pub compact_time_histo: StorageTimeMetrics,
    pub create_images_time_histo: StorageTimeMetrics,
    pub logical_size_histo: StorageTimeMetrics,
+    pub imitate_logical_size_histo: StorageTimeMetrics,
    pub load_layer_map_histo: StorageTimeMetrics,
    pub garbage_collect_histo: StorageTimeMetrics,
    pub last_record_gauge: IntGauge,
@@ -1389,6 +1353,11 @@ impl TimelineMetrics {
            StorageTimeMetrics::new(StorageTimeOperation::CreateImages, &tenant_id, &timeline_id);
        let logical_size_histo =
            StorageTimeMetrics::new(StorageTimeOperation::LogicalSize, &tenant_id, &timeline_id);
+        let imitate_logical_size_histo = StorageTimeMetrics::new(
+            StorageTimeOperation::ImitateLogicalSize,
+            &tenant_id,
+            &timeline_id,
+        );
        let load_layer_map_histo =
            StorageTimeMetrics::new(StorageTimeOperation::LoadLayerMap, &tenant_id, &timeline_id);
        let garbage_collect_histo =
@@ -1421,6 +1390,7 @@ impl TimelineMetrics {
            compact_time_histo,
            create_images_time_histo,
            logical_size_histo,
+            imitate_logical_size_histo,
            garbage_collect_histo,
            load_layer_map_histo,
            last_record_gauge,
--- a/pageserver/src/pgdatadir_mapping.rs
+++ b/pageserver/src/pgdatadir_mapping.rs
@@ -22,7 +22,6 @@ use std::collections::{hash_map, HashMap, HashSet};
 use std::ops::ControlFlow;
 use std::ops::Range;
 use tracing::{debug, trace, warn};
-use utils::bin_ser::DeserializeError;
 use utils::{bin_ser::BeSer, lsn::Lsn};

 /// Block number within a relation or SLRU. This matches PostgreSQL's BlockNumber type.
@@ -30,33 +29,9 @@ pub type BlockNumber = u32;

 #[derive(Debug)]
 pub enum LsnForTimestamp {
-    /// Found commits both before and after the given timestamp
    Present(Lsn),
-
-    /// Found no commits after the given timestamp, this means
-    /// that the newest data in the branch is older than the given
-    /// timestamp.
-    ///
-    /// All commits <= LSN happened before the given timestamp
    Future(Lsn),
-
-    /// The queried timestamp is past our horizon we look back at (PITR)
-    ///
-    /// All commits > LSN happened after the given timestamp,
-    /// but any commits < LSN might have happened before or after
-    /// the given timestamp. We don't know because no data before
-    /// the given lsn is available.
    Past(Lsn),
-
-    /// We have found no commit with a timestamp,
-    /// so we can't return anything meaningful.
-    ///
-    /// The associated LSN is the lower bound value we can safely
-    /// create branches on, but no statement is made if it is
-    /// older or newer than the timestamp.
-    ///
-    /// This variant can e.g. be returned right after a
-    /// cluster import.
    NoData(Lsn),
 }

@@ -68,25 +43,6 @@ pub enum CalculateLogicalSizeError {
    Other(#[from] anyhow::Error),
 }

-#[derive(Debug, thiserror::Error)]
-pub(crate) enum CollectKeySpaceError {
-    #[error(transparent)]
-    Decode(#[from] DeserializeError),
-    #[error(transparent)]
-    PageRead(PageReconstructError),
-    #[error("cancelled")]
-    Cancelled,
-}
-
-impl From<PageReconstructError> for CollectKeySpaceError {
-    fn from(err: PageReconstructError) -> Self {
-        match err {
-            PageReconstructError::Cancelled => Self::Cancelled,
-            err => Self::PageRead(err),
-        }
-    }
-}
-
 impl From<PageReconstructError> for CalculateLogicalSizeError {
    fn from(pre: PageReconstructError) -> Self {
        match pre {
@@ -368,11 +324,7 @@ impl Timeline {
        ctx: &RequestContext,
    ) -> Result<LsnForTimestamp, PageReconstructError> {
        let gc_cutoff_lsn_guard = self.get_latest_gc_cutoff_lsn();
-        // We use this method to figure out the branching LSN for the new branch, but the
-        // GC cutoff could be before the branching point and we cannot create a new branch
-        // with LSN < `ancestor_lsn`. Thus, pick the maximum of these two to be
-        // on the safe side.
-        let min_lsn = std::cmp::max(*gc_cutoff_lsn_guard, self.get_ancestor_lsn());
+        let min_lsn = *gc_cutoff_lsn_guard;
        let max_lsn = self.get_last_record_lsn();

        // LSNs are always 8-byte aligned. low/mid/high represent the
@@ -402,33 +354,30 @@ impl Timeline {
                low = mid + 1;
            }
        }
-        // If `found_smaller == true`, `low = t + 1` where `t` is the target LSN,
-        // so the LSN of the last commit record before or at `search_timestamp`.
-        // Remove one from `low` to get `t`.
-        //
-        // FIXME: it would be better to get the LSN of the previous commit.
-        // Otherwise, if you restore to the returned LSN, the database will
-        // include physical changes from later commits that will be marked
-        // as aborted, and will need to be vacuumed away.
-        let commit_lsn = Lsn((low - 1) * 8);
        match (found_smaller, found_larger) {
            (false, false) => {
                // This can happen if no commit records have been processed yet, e.g.
                // just after importing a cluster.
-                Ok(LsnForTimestamp::NoData(min_lsn))
+                Ok(LsnForTimestamp::NoData(max_lsn))
+            }
+            (true, false) => {
+                // Didn't find any commit timestamps larger than the request
+                Ok(LsnForTimestamp::Future(max_lsn))
            }
            (false, true) => {
                // Didn't find any commit timestamps smaller than the request
-                Ok(LsnForTimestamp::Past(min_lsn))
+                Ok(LsnForTimestamp::Past(max_lsn))
            }
-            (true, false) => {
-                // Only found commits with timestamps smaller than the request.
-                // It's still a valid case for branch creation, return it.
-                // And `update_gc_info()` ignores LSN for a `LsnForTimestamp::Future`
-                // case, anyway.
-                Ok(LsnForTimestamp::Future(commit_lsn))
+            (true, true) => {
+                // low is the LSN of the first commit record *after* the search_timestamp,
+                // Back off by one to get to the point just before the commit.
+                //
+                // FIXME: it would be better to get the LSN of the previous commit.
+                // Otherwise, if you restore to the returned LSN, the database will
+                // include physical changes from later commits that will be marked
+                // as aborted, and will need to be vacuumed away.
+                Ok(LsnForTimestamp::Present(Lsn((low - 1) * 8)))
            }
-            (true, true) => Ok(LsnForTimestamp::Present(commit_lsn)),
        }
    }

@@ -656,27 +605,26 @@ impl Timeline {
    /// Get a KeySpace that covers all the Keys that are in use at the given LSN.
    /// Anything that's not listed maybe removed from the underlying storage (from
    /// that LSN forwards).
-    pub(crate) async fn collect_keyspace(
+    pub async fn collect_keyspace(
        &self,
        lsn: Lsn,
        ctx: &RequestContext,
-    ) -> Result<KeySpace, CollectKeySpaceError> {
+    ) -> anyhow::Result<KeySpace> {
        // Iterate through key ranges, greedily packing them into partitions
-        // This function is responsible for appending keys in order, using implicit
-        // knowledge of how keys are defined.
        let mut result = KeySpaceAccum::new();

+        // The dbdir metadata always exists
+        result.add_key(DBDIR_KEY);
+
        // Fetch list of database dirs and iterate them
        let buf = self.get(DBDIR_KEY, lsn, ctx).await?;
-        let dbdir = DbDirectory::des(&buf)?;
-
-        let mut metadata_keys = Vec::new();
+        let dbdir = DbDirectory::des(&buf).context("deserialization failure")?;

        let mut dbs: Vec<(Oid, Oid)> = dbdir.dbdirs.keys().cloned().collect();
        dbs.sort_unstable();
        for (spcnode, dbnode) in dbs {
-            metadata_keys.push(relmap_file_key(spcnode, dbnode));
-            metadata_keys.push(rel_dir_to_key(spcnode, dbnode));
+            result.add_key(relmap_file_key(spcnode, dbnode));
+            result.add_key(rel_dir_to_key(spcnode, dbnode));

            let mut rels: Vec<RelTag> = self
                .list_rels(spcnode, dbnode, lsn, ctx)
@@ -690,7 +638,7 @@ impl Timeline {
                let relsize = buf.get_u32_le();

                result.add_range(rel_block_to_key(rel, 0)..rel_block_to_key(rel, relsize));
-                metadata_keys.push(relsize_key);
+                result.add_key(relsize_key);
            }
        }

@@ -703,7 +651,7 @@ impl Timeline {
            let slrudir_key = slru_dir_to_key(kind);
            result.add_key(slrudir_key);
            let buf = self.get(slrudir_key, lsn, ctx).await?;
-            let dir = SlruSegmentDirectory::des(&buf)?;
+            let dir = SlruSegmentDirectory::des(&buf).context("deserialization failure")?;
            let mut segments: Vec<u32> = dir.segments.iter().cloned().collect();
            segments.sort_unstable();
            for segno in segments {
@@ -721,7 +669,7 @@ impl Timeline {
        // Then pg_twophase
        result.add_key(TWOPHASEDIR_KEY);
        let buf = self.get(TWOPHASEDIR_KEY, lsn, ctx).await?;
-        let twophase_dir = TwoPhaseDirectory::des(&buf)?;
+        let twophase_dir = TwoPhaseDirectory::des(&buf).context("deserialization failure")?;
        let mut xids: Vec<TransactionId> = twophase_dir.xids.iter().cloned().collect();
        xids.sort_unstable();
        for xid in xids {
@@ -733,13 +681,6 @@ impl Timeline {
        if self.get(AUX_FILES_KEY, lsn, ctx).await.is_ok() {
            result.add_key(AUX_FILES_KEY);
        }
-
-        // The dbdir metadata always exists
-        result.add_key(DBDIR_KEY);
-        for key in metadata_keys {
-            result.add_key(key);
-        }
-
        Ok(result.to_keyspace())
    }

@@ -1347,9 +1288,11 @@ impl<'a> DatadirModification<'a> {
        self.pending_nblocks = 0;

        for (key, value) in self.pending_updates.drain() {
+            tracing::debug!("commit: put {} @ {}", key, lsn);
            writer.put(key, lsn, &value, ctx).await?;
        }
        for key_range in self.pending_deletions.drain(..) {
+            tracing::debug!("commit: delete {:?} @ {}", key_range, lsn);
            writer.delete(key_range, lsn).await?;
        }

@@ -1362,6 +1305,10 @@ impl<'a> DatadirModification<'a> {
        Ok(())
    }

+    pub fn is_no_op(&self) -> bool {
+        self.pending_updates.is_empty() && self.pending_deletions.is_empty()
+    }
+
    // Internal helper functions to batch the modifications

    async fn get(&self, key: Key, ctx: &RequestContext) -> Result<Bytes, PageReconstructError> {
@@ -1482,11 +1429,21 @@ static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]);
 //
 // Below is a full list of the keyspace allocation:
 //
-
+// DbDir:
+// 00 00000000 00000000 00000000 00   00000000
+//
+// Filenodemap:
+// 00 SPCNODE  DBNODE   00000000 00   00000000
+//
+// RelDir:
+// 00 SPCNODE  DBNODE   00000000 00   00000001 (Postgres never uses relfilenode 0)
 //
 // RelBlock:
 // 00 SPCNODE  DBNODE   RELNODE  FORK BLKNUM
 //
+// RelSize:
+// 00 SPCNODE  DBNODE   RELNODE  FORK FFFFFFFF
+//
 // SlruDir:
 // 01 kind     00000000 00000000 00   00000000
 //
@@ -1511,31 +1468,11 @@ static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]);
 // AuxFiles:
 // 03 00000000 00000000 00000000 00   00000002
 //
-// DbDir:
-// 04 00000000 00000000 00000000 00   00000000
-//
-// Filenodemap:
-// 04 SPCNODE  DBNODE   00000000 00   00000000
-//
-// RelDir:
-// 04 SPCNODE  DBNODE   00000000 00   00000001 (Postgres never uses relfilenode 0)
-//
-// RelSize:
-// 04 SPCNODE  DBNODE   RELNODE  FORK FFFFFFFF

 //-- Section 01: relation data and metadata

-/// Keys above this Key are required to serve a basebackup request
-pub(crate) const BASEBACKUP_CUT: Key = slru_dir_to_key(SlruKind::Clog);
-
-/// Keys aboe this Key are needed to make a logical size calculation
-///
-/// Ensuring that such keys are stored above the main range of user relation
-/// blocks enables much more efficient space management.
-pub(crate) const METADATA_CUT: Key = CONTROLFILE_KEY;
-
 const DBDIR_KEY: Key = Key {
-    field1: 0x04,
+    field1: 0x00,
    field2: 0,
    field3: 0,
    field4: 0,
@@ -1545,14 +1482,14 @@ const DBDIR_KEY: Key = Key {

 fn dbdir_key_range(spcnode: Oid, dbnode: Oid) -> Range<Key> {
    Key {
-        field1: 0x04,
+        field1: 0x00,
        field2: spcnode,
        field3: dbnode,
        field4: 0,
        field5: 0,
        field6: 0,
    }..Key {
-        field1: 0x04,
+        field1: 0x00,
        field2: spcnode,
        field3: dbnode,
        field4: 0xffffffff,
@@ -1563,7 +1500,7 @@ fn dbdir_key_range(spcnode: Oid, dbnode: Oid) -> Range<Key> {

 fn relmap_file_key(spcnode: Oid, dbnode: Oid) -> Key {
    Key {
-        field1: 0x04,
+        field1: 0x00,
        field2: spcnode,
        field3: dbnode,
        field4: 0,
@@ -1574,7 +1511,7 @@ fn relmap_file_key(spcnode: Oid, dbnode: Oid) -> Key {

 fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key {
    Key {
-        field1: 0x04,
+        field1: 0x00,
        field2: spcnode,
        field3: dbnode,
        field4: 0,
@@ -1583,7 +1520,7 @@ fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key {
    }
 }

-fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
+pub fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
    Key {
        field1: 0x00,
        field2: rel.spcnode,
@@ -1596,7 +1533,7 @@ fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {

 fn rel_size_to_key(rel: RelTag) -> Key {
    Key {
-        field1: 0x04,
+        field1: 0x00,
        field2: rel.spcnode,
        field3: rel.dbnode,
        field4: rel.relnode,
@@ -1625,7 +1562,7 @@ fn rel_key_range(rel: RelTag) -> Range<Key> {

 //-- Section 02: SLRUs

-const fn slru_dir_to_key(kind: SlruKind) -> Key {
+fn slru_dir_to_key(kind: SlruKind) -> Key {
    Key {
        field1: 0x01,
        field2: match kind {
--- a/pageserver/src/repository.rs
+++ b/pageserver/src/repository.rs
@@ -36,7 +36,6 @@ pub fn singleton_range(key: Key) -> Range<Key> {

 /// A 'value' stored for a one Key.
 #[derive(Debug, Clone, Serialize, Deserialize)]
-#[cfg_attr(test, derive(PartialEq))]
 pub enum Value {
    /// An Image value contains a full copy of the value
    Image(Bytes),
@@ -60,70 +59,6 @@ impl Value {
    }
 }

-#[cfg(test)]
-mod test {
-    use super::*;
-
-    use bytes::Bytes;
-    use utils::bin_ser::BeSer;
-
-    macro_rules! roundtrip {
-        ($orig:expr, $expected:expr) => {{
-            let orig: Value = $orig;
-
-            let actual = Value::ser(&orig).unwrap();
-            let expected: &[u8] = &$expected;
-
-            assert_eq!(utils::Hex(&actual), utils::Hex(expected));
-
-            let deser = Value::des(&actual).unwrap();
-
-            assert_eq!(orig, deser);
-        }};
-    }
-
-    #[test]
-    fn image_roundtrip() {
-        let image = Bytes::from_static(b"foobar");
-        let image = Value::Image(image);
-
-        #[rustfmt::skip]
-        let expected = [
-            // top level discriminator of 4 bytes
-            0x00, 0x00, 0x00, 0x00,
-            // 8 byte length
-            0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x06,
-            // foobar
-            0x66, 0x6f, 0x6f, 0x62, 0x61, 0x72
-        ];
-
-        roundtrip!(image, expected);
-    }
-
-    #[test]
-    fn walrecord_postgres_roundtrip() {
-        let rec = NeonWalRecord::Postgres {
-            will_init: true,
-            rec: Bytes::from_static(b"foobar"),
-        };
-        let rec = Value::WalRecord(rec);
-
-        #[rustfmt::skip]
-        let expected = [
-            // flattened discriminator of total 8 bytes
-            0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00,
-            // will_init
-            0x01,
-            // 8 byte length
-            0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x06,
-            // foobar
-            0x66, 0x6f, 0x6f, 0x62, 0x61, 0x72
-        ];
-
-        roundtrip!(rec, expected);
-    }
-}
-
 ///
 /// Result of performing GC
 ///
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -15,6 +15,7 @@ use anyhow::{bail, Context};
 use camino::{Utf8Path, Utf8PathBuf};
 use futures::FutureExt;
 use pageserver_api::models::TimelineState;
+use pageserver_api::shard::ShardIdentity;
 use remote_storage::DownloadError;
 use remote_storage::GenericRemoteStorage;
 use storage_broker::BrokerClientChannel;
@@ -61,6 +62,7 @@ use self::mgr::TenantsMap;
 use self::remote_timeline_client::RemoteTimelineClient;
 use self::timeline::uninit::TimelineUninitMark;
 use self::timeline::uninit::UninitializedTimeline;
+use self::timeline::EvictionTaskTenantState;
 use self::timeline::TimelineResources;
 use crate::config::PageServerConf;
 use crate::context::{DownloadBehavior, RequestContext};
@@ -168,6 +170,7 @@ pub struct TenantSharedResources {
 /// for an attached tenant is a subset of the [`LocationConf`], represented
 /// in this struct.
 pub(super) struct AttachedTenantConf {
+    shard: ShardIdentity,
    tenant_conf: TenantConfOpt,
    location: AttachedLocationConfig,
 }
@@ -176,6 +179,7 @@ impl AttachedTenantConf {
    fn try_from(location_conf: LocationConf) -> anyhow::Result<Self> {
        match &location_conf.mode {
            LocationMode::Attached(attach_conf) => Ok(Self {
+                shard: location_conf.shard,
                tenant_conf: location_conf.tenant_conf,
                location: attach_conf.clone(),
            }),
@@ -251,6 +255,8 @@ pub struct Tenant {
    cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
    cached_synthetic_tenant_size: Arc<AtomicU64>,

+    eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>,
+
    pub(crate) delete_progress: Arc<tokio::sync::Mutex<DeleteTenantFlow>>,

    // Cancellation token fires when we have entered shutdown().  This is a parent of
@@ -682,9 +688,11 @@ impl Tenant {
        // Get list of remote timelines
        // download index files for every tenant timeline
        info!("listing remote timelines");
+        let shard = self.tenant_conf.read().unwrap().shard.clone();
        let (remote_timeline_ids, other_keys) = remote_timeline_client::list_remote_timelines(
            remote_storage,
            self.tenant_id,
+            &shard,
            cancel.clone(),
        )
        .await?;
@@ -1145,6 +1153,7 @@ impl Tenant {
                self.deletion_queue_client.clone(),
                self.conf,
                self.tenant_id,
+                self.tenant_conf.read().unwrap().shard.clone(),
                timeline_id,
                self.generation,
            );
@@ -2119,6 +2128,11 @@ impl Tenant {
    pub fn get_tenant_id(&self) -> TenantId {
        self.tenant_id
    }
+
+    pub(crate) fn get_shard(&self) -> ShardIdentity {
+        self.tenant_conf.read().unwrap().shard.clone()
+    }
+
    pub fn tenant_specific_overrides(&self) -> TenantConfOpt {
        self.tenant_conf.read().unwrap().tenant_conf
    }
@@ -2364,6 +2378,7 @@ impl Tenant {
            state,
            cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
            cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
+            eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()),
            delete_progress: Arc::new(tokio::sync::Mutex::new(DeleteTenantFlow::default())),
            cancel: CancellationToken::default(),
            gate: Gate::new(format!("Tenant<{tenant_id}>")),
@@ -2984,6 +2999,7 @@ impl Tenant {
                self.deletion_queue_client.clone(),
                self.conf,
                self.tenant_id,
+                self.tenant_conf.read().unwrap().shard.clone(),
                timeline_id,
                self.generation,
            );
--- a/pageserver/src/tenant/config.rs
+++ b/pageserver/src/tenant/config.rs
@@ -10,6 +10,7 @@
 //!
 use anyhow::Context;
 use pageserver_api::models;
+use pageserver_api::shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize};
 use serde::{Deserialize, Serialize};
 use std::num::NonZeroU64;
 use std::time::Duration;
@@ -85,6 +86,11 @@ pub(crate) enum LocationMode {
 /// but have distinct LocationConf.
 #[derive(Clone, PartialEq, Eq, Serialize, Deserialize)]
 pub(crate) struct LocationConf {
+    /// Detailed identity of this TenantShard.  The shard number and count usually
+    /// appear in the keys of maps containing tenants, but it is convenient to also
+    /// store it here.
+    pub(crate) shard: ShardIdentity,
+
    /// The location-specific part of the configuration, describes the operating
    /// mode of this pageserver for this tenant.
    pub(crate) mode: LocationMode,
@@ -156,6 +162,7 @@ impl LocationConf {
    /// possible state.  This function should eventually be removed.
    pub(crate) fn attached_single(tenant_conf: TenantConfOpt, generation: Generation) -> Self {
        Self {
+            shard: ShardIdentity::none(),
            mode: LocationMode::Attached(AttachedLocationConfig {
                generation,
                attach_mode: AttachmentMode::Single,
@@ -226,7 +233,21 @@ impl LocationConf {
            }
        };

-        Ok(Self { mode, tenant_conf })
+        let shard = if conf.shard_count == 0 {
+            ShardIdentity::none()
+        } else {
+            ShardIdentity::new(
+                ShardNumber(conf.shard_number),
+                ShardCount(conf.shard_count),
+                ShardStripeSize(conf.shard_stripe_size),
+            )
+        };
+
+        Ok(Self {
+            shard,
+            mode,
+            tenant_conf,
+        })
    }
 }

@@ -236,6 +257,7 @@ impl Default for LocationConf {
    // => tech debt since https://github.com/neondatabase/neon/issues/1555
    fn default() -> Self {
        Self {
+            shard: ShardIdentity::none(),
            mode: LocationMode::Attached(AttachedLocationConfig {
                generation: Generation::none(),
                attach_mode: AttachmentMode::Single,
--- a/pageserver/src/tenant/mgr.rs
+++ b/pageserver/src/tenant/mgr.rs
@@ -2,10 +2,9 @@
 //! page server.

 use camino::{Utf8DirEntry, Utf8Path, Utf8PathBuf};
-use pageserver_api::shard::TenantShardId;
 use rand::{distributions::Alphanumeric, Rng};
 use std::borrow::Cow;
-use std::collections::{BTreeMap, HashMap};
+use std::collections::HashMap;
 use std::ops::Deref;
 use std::sync::Arc;
 use std::time::{Duration, Instant};
@@ -31,7 +30,6 @@ use crate::metrics::TENANT_MANAGER as METRICS;
 use crate::task_mgr::{self, TaskKind};
 use crate::tenant::config::{AttachmentMode, LocationConf, LocationMode, TenantConfOpt};
 use crate::tenant::delete::DeleteTenantFlow;
-use crate::tenant::span::debug_assert_current_span_has_tenant_id;
 use crate::tenant::{create_tenant_files, AttachedTenantConf, SpawnMode, Tenant, TenantState};
 use crate::{InitializationOrder, IGNORED_TENANT_FILE_NAME, TEMP_FILE_SUFFIX};

@@ -89,37 +87,10 @@ pub(crate) enum TenantsMap {
    Initializing,
    /// [`init_tenant_mgr`] is done, all on-disk tenants have been loaded.
    /// New tenants can be added using [`tenant_map_acquire_slot`].
-    Open(BTreeMap<TenantShardId, TenantSlot>),
+    Open(HashMap<TenantId, TenantSlot>),
    /// The pageserver has entered shutdown mode via [`shutdown_all_tenants`].
    /// Existing tenants are still accessible, but no new tenants can be created.
-    ShuttingDown(BTreeMap<TenantShardId, TenantSlot>),
-}
-
-/// Helper for mapping shard-unaware functions to a sharding-aware map
-/// TODO(sharding): all users of this must be made shard-aware.
-fn exactly_one_or_none<'a>(
-    map: &'a BTreeMap<TenantShardId, TenantSlot>,
-    tenant_id: &TenantId,
-) -> Option<(&'a TenantShardId, &'a TenantSlot)> {
-    let mut slots = map.range(TenantShardId::tenant_range(*tenant_id));
-
-    // Retrieve the first two slots in the range: if both are populated, we must panic because the caller
-    // needs a shard-naive view of the world in which only one slot can exist for a TenantId at a time.
-    let slot_a = slots.next();
-    let slot_b = slots.next();
-    match (slot_a, slot_b) {
-        (None, None) => None,
-        (Some(slot), None) => {
-            // Exactly one matching slot
-            Some(slot)
-        }
-        (Some(_slot_a), Some(_slot_b)) => {
-            // Multiple shards for this tenant: cannot handle this yet.
-            // TODO(sharding): callers of get() should be shard-aware.
-            todo!("Attaching multiple shards in teh same tenant to the same pageserver")
-        }
-        (None, Some(_)) => unreachable!(),
-    }
+    ShuttingDown(HashMap<TenantId, TenantSlot>),
 }

 impl TenantsMap {
@@ -130,8 +101,7 @@ impl TenantsMap {
        match self {
            TenantsMap::Initializing => None,
            TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
-                // TODO(sharding): callers of get() should be shard-aware.
-                exactly_one_or_none(m, tenant_id).and_then(|(_, slot)| slot.get_attached())
+                m.get(tenant_id).and_then(TenantSlot::get_attached)
            }
        }
    }
@@ -139,10 +109,7 @@ impl TenantsMap {
    pub(crate) fn remove(&mut self, tenant_id: &TenantId) -> Option<TenantSlot> {
        match self {
            TenantsMap::Initializing => None,
-            TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
-                let key = exactly_one_or_none(m, tenant_id).map(|(k, _)| *k);
-                key.and_then(|key| m.remove(&key))
-            }
+            TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => m.remove(tenant_id),
        }
    }

@@ -416,7 +383,7 @@ pub async fn init_tenant_mgr(
    init_order: InitializationOrder,
    cancel: CancellationToken,
 ) -> anyhow::Result<TenantManager> {
-    let mut tenants = BTreeMap::new();
+    let mut tenants = HashMap::new();

    let ctx = RequestContext::todo_child(TaskKind::Startup, DownloadBehavior::Warn);

@@ -437,7 +404,7 @@ pub async fn init_tenant_mgr(
                warn!(%tenant_id, "Marking tenant broken, failed to {e:#}");

                tenants.insert(
-                    TenantShardId::unsharded(tenant_id),
+                    tenant_id,
                    TenantSlot::Attached(Tenant::create_broken_tenant(
                        conf,
                        tenant_id,
@@ -460,7 +427,7 @@ pub async fn init_tenant_mgr(
                        // tenants, because they do no remote writes and hence require no
                        // generation number
                        info!(%tenant_id, "Loaded tenant in secondary mode");
-                        tenants.insert(TenantShardId::unsharded(tenant_id), TenantSlot::Secondary);
+                        tenants.insert(tenant_id, TenantSlot::Secondary);
                    }
                    LocationMode::Attached(_) => {
                        // TODO: augment re-attach API to enable the control plane to
@@ -503,10 +470,7 @@ pub async fn init_tenant_mgr(
            &ctx,
        ) {
            Ok(tenant) => {
-                tenants.insert(
-                    TenantShardId::unsharded(tenant.tenant_id()),
-                    TenantSlot::Attached(tenant),
-                );
+                tenants.insert(tenant.tenant_id(), TenantSlot::Attached(tenant));
            }
            Err(e) => {
                error!(%tenant_id, "Failed to start tenant: {e:#}");
@@ -602,80 +566,89 @@ pub(crate) async fn shutdown_all_tenants() {
 async fn shutdown_all_tenants0(tenants: &std::sync::RwLock<TenantsMap>) {
    use utils::completion;

-    let mut join_set = JoinSet::new();
-
-    // Atomically, 1. create the shutdown tasks and 2. prevent creation of new tenants.
-    let (total_in_progress, total_attached) = {
+    // Atomically, 1. extract the list of tenants to shut down and 2. prevent creation of new tenants.
+    let (in_progress_ops, tenants_to_shut_down) = {
        let mut m = tenants.write().unwrap();
        match &mut *m {
            TenantsMap::Initializing => {
-                *m = TenantsMap::ShuttingDown(BTreeMap::default());
+                *m = TenantsMap::ShuttingDown(HashMap::default());
                info!("tenants map is empty");
                return;
            }
            TenantsMap::Open(tenants) => {
-                let mut shutdown_state = BTreeMap::new();
-                let mut total_in_progress = 0;
-                let mut total_attached = 0;
+                let mut shutdown_state = HashMap::new();
+                let mut in_progress_ops = Vec::new();
+                let mut tenants_to_shut_down = Vec::new();

-                for (tenant_shard_id, v) in std::mem::take(tenants).into_iter() {
+                for (k, v) in tenants.drain() {
                    match v {
                        TenantSlot::Attached(t) => {
-                            shutdown_state.insert(tenant_shard_id, TenantSlot::Attached(t.clone()));
-                            join_set.spawn(
-                                async move {
-                                    let freeze_and_flush = true;
-
-                                    let res = {
-                                        let (_guard, shutdown_progress) = completion::channel();
-                                        t.shutdown(shutdown_progress, freeze_and_flush).await
-                                    };
-
-                                    if let Err(other_progress) = res {
-                                        // join the another shutdown in progress
-                                        other_progress.wait().await;
-                                    }
-
-                                    // we cannot afford per tenant logging here, because if s3 is degraded, we are
-                                    // going to log too many lines
-                                    debug!("tenant successfully stopped");
-                                }
-                                .instrument(info_span!("shutdown", tenant_id=%tenant_shard_id.tenant_id, shard=%tenant_shard_id.shard_slug())),
-                            );
-
-                            total_attached += 1;
+                            tenants_to_shut_down.push(t.clone());
+                            shutdown_state.insert(k, TenantSlot::Attached(t));
                        }
                        TenantSlot::Secondary => {
-                            shutdown_state.insert(tenant_shard_id, TenantSlot::Secondary);
+                            shutdown_state.insert(k, TenantSlot::Secondary);
                        }
                        TenantSlot::InProgress(notify) => {
                            // InProgress tenants are not visible in TenantsMap::ShuttingDown: we will
                            // wait for their notifications to fire in this function.
-                            join_set.spawn(async move {
-                                notify.wait().await;
-                            });
-
-                            total_in_progress += 1;
+                            in_progress_ops.push(notify);
                        }
                    }
                }
                *m = TenantsMap::ShuttingDown(shutdown_state);
-                (total_in_progress, total_attached)
+                (in_progress_ops, tenants_to_shut_down)
            }
            TenantsMap::ShuttingDown(_) => {
+                // TODO: it is possible that detach and shutdown happen at the same time. as a
+                // result, during shutdown we do not wait for detach.
                error!("already shutting down, this function isn't supposed to be called more than once");
                return;
            }
        }
    };

-    let started_at = std::time::Instant::now();
-
    info!(
        "Waiting for {} InProgress tenants and {} Attached tenants to shut down",
-        total_in_progress, total_attached
+        in_progress_ops.len(),
+        tenants_to_shut_down.len()
    );

+    for barrier in in_progress_ops {
+        barrier.wait().await;
+    }
+
+    info!(
+        "InProgress tenants shut down, waiting for {} Attached tenants to shut down",
+        tenants_to_shut_down.len()
+    );
+    let started_at = std::time::Instant::now();
+    let mut join_set = JoinSet::new();
+    for tenant in tenants_to_shut_down {
+        let tenant_id = tenant.get_tenant_id();
+        join_set.spawn(
+            async move {
+                let freeze_and_flush = true;
+
+                let res = {
+                    let (_guard, shutdown_progress) = completion::channel();
+                    tenant.shutdown(shutdown_progress, freeze_and_flush).await
+                };
+
+                if let Err(other_progress) = res {
+                    // join the another shutdown in progress
+                    other_progress.wait().await;
+                }
+
+                // we cannot afford per tenant logging here, because if s3 is degraded, we are
+                // going to log too many lines
+
+                debug!("tenant successfully stopped");
+            }
+            .instrument(info_span!("shutdown", %tenant_id)),
+        );
+    }
+
    let total = join_set.len();
    let mut panicked = 0;
    let mut buffering = true;
@@ -688,7 +661,7 @@ async fn shutdown_all_tenants0(tenants: &std::sync::RwLock<TenantsMap>) {
                match joined {
                    Ok(()) => {}
                    Err(join_error) if join_error.is_cancelled() => {
-                        unreachable!("we are not cancelling any of the tasks");
+                        unreachable!("we are not cancelling any of the futures");
                    }
                    Err(join_error) if join_error.is_panic() => {
                        // cannot really do anything, as this panic is likely a bug
@@ -726,22 +699,19 @@ async fn shutdown_all_tenants0(tenants: &std::sync::RwLock<TenantsMap>) {
 pub(crate) async fn create_tenant(
    conf: &'static PageServerConf,
    tenant_conf: TenantConfOpt,
-    tenant_shard_id: TenantShardId,
+    tenant_id: TenantId,
    generation: Generation,
    resources: TenantSharedResources,
    ctx: &RequestContext,
 ) -> Result<Arc<Tenant>, TenantMapInsertError> {
    let location_conf = LocationConf::attached_single(tenant_conf, generation);

-    let slot_guard =
-        tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustNotExist)?;
-    // TODO(sharding): make local paths shard-aware
-    let tenant_path =
-        super::create_tenant_files(conf, &location_conf, &tenant_shard_id.tenant_id).await?;
+    let slot_guard = tenant_map_acquire_slot(&tenant_id, TenantSlotAcquireMode::MustNotExist)?;
+    let tenant_path = super::create_tenant_files(conf, &location_conf, &tenant_id).await?;

    let created_tenant = tenant_spawn(
        conf,
-        tenant_shard_id.tenant_id,
+        tenant_id,
        &tenant_path,
        resources,
        AttachedTenantConf::try_from(location_conf)?,
@@ -754,7 +724,11 @@ pub(crate) async fn create_tenant(
    //      See https://github.com/neondatabase/neon/issues/4233

    let created_tenant_id = created_tenant.tenant_id();
-    debug_assert_eq!(created_tenant_id, tenant_shard_id.tenant_id);
+    if tenant_id != created_tenant_id {
+        return Err(TenantMapInsertError::Other(anyhow::anyhow!(
+            "loaded created tenant has unexpected tenant id (expect {tenant_id} != actual {created_tenant_id})",
+        )));
+    }

    slot_guard.upsert(TenantSlot::Attached(created_tenant.clone()))?;

@@ -790,70 +764,24 @@ pub(crate) async fn set_new_tenant_config(
 }

 impl TenantManager {
-    /// Gets the attached tenant from the in-memory data, erroring if it's absent, in secondary mode, or is not fitting to the query.
-    /// `active_only = true` allows to query only tenants that are ready for operations, erroring on other kinds of tenants.
-    ///
-    /// This method is cancel-safe.
-    pub(crate) fn get_attached_tenant_shard(
-        &self,
-        tenant_shard_id: TenantShardId,
-        active_only: bool,
-    ) -> Result<Arc<Tenant>, GetTenantError> {
-        let locked = self.tenants.read().unwrap();
-
-        let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)?;
-
-        match peek_slot {
-            Some(TenantSlot::Attached(tenant)) => match tenant.current_state() {
-                TenantState::Broken {
-                    reason,
-                    backtrace: _,
-                } if active_only => Err(GetTenantError::Broken(reason)),
-                TenantState::Active => Ok(Arc::clone(tenant)),
-                _ => {
-                    if active_only {
-                        Err(GetTenantError::NotActive(tenant_shard_id.tenant_id))
-                    } else {
-                        Ok(Arc::clone(tenant))
-                    }
-                }
-            },
-            Some(TenantSlot::InProgress(_)) => {
-                Err(GetTenantError::NotActive(tenant_shard_id.tenant_id))
-            }
-            None | Some(TenantSlot::Secondary) => {
-                Err(GetTenantError::NotFound(tenant_shard_id.tenant_id))
-            }
-        }
-    }
-
-    pub(crate) async fn delete_timeline(
-        &self,
-        tenant_shard_id: TenantShardId,
-        timeline_id: TimelineId,
-        _ctx: &RequestContext,
-    ) -> Result<(), DeleteTimelineError> {
-        let tenant = self.get_attached_tenant_shard(tenant_shard_id, true)?;
-        DeleteTimelineFlow::run(&tenant, timeline_id, false).await?;
-        Ok(())
-    }
-
+    #[instrument(skip_all, fields(%tenant_id))]
    pub(crate) async fn upsert_location(
        &self,
-        tenant_shard_id: TenantShardId,
+        tenant_id: TenantId,
        new_location_config: LocationConf,
        ctx: &RequestContext,
    ) -> Result<(), anyhow::Error> {
-        debug_assert_current_span_has_tenant_id();
-        info!("configuring tenant location to state {new_location_config:?}");
+        info!(
+            "configuring tenant location {tenant_id} {} to state {new_location_config:?}",
+            new_location_config.shard.slug()
+        );

        // Special case fast-path for updates to Tenant: if our upsert is only updating configuration,
        // then we do not need to set the slot to InProgress, we can just call into the
        // existng tenant.
        {
            let locked = self.tenants.read().unwrap();
-            let peek_slot =
-                tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Write)?;
+            let peek_slot = tenant_map_peek_slot(&locked, &tenant_id, TenantSlotPeekMode::Write)?;
            match (&new_location_config.mode, peek_slot) {
                (LocationMode::Attached(attach_conf), Some(TenantSlot::Attached(tenant))) => {
                    if attach_conf.generation == tenant.generation {
@@ -884,7 +812,7 @@ impl TenantManager {
        // the tenant is inaccessible to the outside world while we are doing this, but that is sensible:
        // the state is ill-defined while we're in transition.  Transitions are async, but fast: we do
        // not do significant I/O, and shutdowns should be prompt via cancellation tokens.
-        let mut slot_guard = tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::Any)?;
+        let mut slot_guard = tenant_map_acquire_slot(&tenant_id, TenantSlotAcquireMode::Any)?;

        if let Some(TenantSlot::Attached(tenant)) = slot_guard.get_old_value() {
            // The case where we keep a Tenant alive was covered above in the special case
@@ -915,31 +843,25 @@ impl TenantManager {
            slot_guard.drop_old_value().expect("We just shut it down");
        }

-        // TODO(sharding): make local paths sharding-aware
-        let tenant_path = self.conf.tenant_path(&tenant_shard_id.tenant_id);
+        let tenant_path = self.conf.tenant_path(&tenant_id);

        let new_slot = match &new_location_config.mode {
            LocationMode::Secondary(_) => {
+                let tenant_path = self.conf.tenant_path(&tenant_id);
                // Directory doesn't need to be fsync'd because if we crash it can
                // safely be recreated next time this tenant location is configured.
                unsafe_create_dir_all(&tenant_path)
                    .await
                    .with_context(|| format!("Creating {tenant_path}"))?;

-                // TODO(sharding): make local paths sharding-aware
-                Tenant::persist_tenant_config(
-                    self.conf,
-                    &tenant_shard_id.tenant_id,
-                    &new_location_config,
-                )
-                .await
-                .map_err(SetNewTenantConfigError::Persist)?;
+                Tenant::persist_tenant_config(self.conf, &tenant_id, &new_location_config)
+                    .await
+                    .map_err(SetNewTenantConfigError::Persist)?;

                TenantSlot::Secondary
            }
            LocationMode::Attached(_attach_config) => {
-                // TODO(sharding): make local paths sharding-aware
-                let timelines_path = self.conf.timelines_path(&tenant_shard_id.tenant_id);
+                let timelines_path = self.conf.timelines_path(&tenant_id);

                // Directory doesn't need to be fsync'd because we do not depend on
                // it to exist after crashes: it may be recreated when tenant is
@@ -948,19 +870,13 @@ impl TenantManager {
                    .await
                    .with_context(|| format!("Creating {timelines_path}"))?;

-                // TODO(sharding): make local paths sharding-aware
-                Tenant::persist_tenant_config(
-                    self.conf,
-                    &tenant_shard_id.tenant_id,
-                    &new_location_config,
-                )
-                .await
-                .map_err(SetNewTenantConfigError::Persist)?;
+                Tenant::persist_tenant_config(self.conf, &tenant_id, &new_location_config)
+                    .await
+                    .map_err(SetNewTenantConfigError::Persist)?;

-                // TODO(sharding): make spawn sharding-aware
                let tenant = tenant_spawn(
                    self.conf,
-                    tenant_shard_id.tenant_id,
+                    tenant_id,
                    &tenant_path,
                    self.resources.clone(),
                    AttachedTenantConf::try_from(new_location_config)?,
@@ -1006,11 +922,7 @@ pub(crate) fn get_tenant(
    active_only: bool,
 ) -> Result<Arc<Tenant>, GetTenantError> {
    let locked = TENANTS.read().unwrap();
-
-    // TODO(sharding): make all callers of get_tenant shard-aware
-    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
-
-    let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)?;
+    let peek_slot = tenant_map_peek_slot(&locked, &tenant_id, TenantSlotPeekMode::Read)?;

    match peek_slot {
        Some(TenantSlot::Attached(tenant)) => match tenant.current_state() {
@@ -1070,16 +982,12 @@ pub(crate) async fn get_active_tenant_with_timeout(
        Tenant(Arc<Tenant>),
    }

-    // TODO(sharding): make page service interface sharding-aware (page service should apply ShardIdentity to the key
-    // to decide which shard services the request)
-    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
-
    let wait_start = Instant::now();
    let deadline = wait_start + timeout;

    let wait_for = {
        let locked = TENANTS.read().unwrap();
-        let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)
+        let peek_slot = tenant_map_peek_slot(&locked, &tenant_id, TenantSlotPeekMode::Read)
            .map_err(GetTenantError::MapState)?;
        match peek_slot {
            Some(TenantSlot::Attached(tenant)) => {
@@ -1123,9 +1031,8 @@ pub(crate) async fn get_active_tenant_with_timeout(
            })?;
            {
                let locked = TENANTS.read().unwrap();
-                let peek_slot =
-                    tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)
-                        .map_err(GetTenantError::MapState)?;
+                let peek_slot = tenant_map_peek_slot(&locked, &tenant_id, TenantSlotPeekMode::Read)
+                    .map_err(GetTenantError::MapState)?;
                match peek_slot {
                    Some(TenantSlot::Attached(tenant)) => tenant.clone(),
                    _ => {
@@ -1167,7 +1074,7 @@ pub(crate) async fn get_active_tenant_with_timeout(
 pub(crate) async fn delete_tenant(
    conf: &'static PageServerConf,
    remote_storage: Option<GenericRemoteStorage>,
-    tenant_shard_id: TenantShardId,
+    tenant_id: TenantId,
 ) -> Result<(), DeleteTenantError> {
    // We acquire a SlotGuard during this function to protect against concurrent
    // changes while the ::prepare phase of DeleteTenantFlow executes, but then
@@ -1180,9 +1087,7 @@ pub(crate) async fn delete_tenant(
    //
    // See https://github.com/neondatabase/neon/issues/5080

-    // TODO(sharding): make delete API sharding-aware
-    let mut slot_guard =
-        tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustExist)?;
+    let mut slot_guard = tenant_map_acquire_slot(&tenant_id, TenantSlotAcquireMode::MustExist)?;

    // unwrap is safe because we used MustExist mode when acquiring
    let tenant = match slot_guard.get_old_value().as_ref().unwrap() {
@@ -1209,6 +1114,16 @@ pub(crate) enum DeleteTimelineError {
    Timeline(#[from] crate::tenant::DeleteTimelineError),
 }

+pub(crate) async fn delete_timeline(
+    tenant_id: TenantId,
+    timeline_id: TimelineId,
+    _ctx: &RequestContext,
+) -> Result<(), DeleteTimelineError> {
+    let tenant = get_tenant(tenant_id, true)?;
+    DeleteTimelineFlow::run(&tenant, timeline_id, false).await?;
+    Ok(())
+}
+
 #[derive(Debug, thiserror::Error)]
 pub(crate) enum TenantStateError {
    #[error("Tenant {0} is stopping")]
@@ -1223,14 +1138,14 @@ pub(crate) enum TenantStateError {

 pub(crate) async fn detach_tenant(
    conf: &'static PageServerConf,
-    tenant_shard_id: TenantShardId,
+    tenant_id: TenantId,
    detach_ignored: bool,
    deletion_queue_client: &DeletionQueueClient,
 ) -> Result<(), TenantStateError> {
    let tmp_path = detach_tenant0(
        conf,
        &TENANTS,
-        tenant_shard_id,
+        tenant_id,
        detach_ignored,
        deletion_queue_client,
    )
@@ -1257,24 +1172,19 @@ pub(crate) async fn detach_tenant(
 async fn detach_tenant0(
    conf: &'static PageServerConf,
    tenants: &std::sync::RwLock<TenantsMap>,
-    tenant_shard_id: TenantShardId,
+    tenant_id: TenantId,
    detach_ignored: bool,
    deletion_queue_client: &DeletionQueueClient,
 ) -> Result<Utf8PathBuf, TenantStateError> {
-    let tenant_dir_rename_operation = |tenant_id_to_clean: TenantShardId| async move {
-        // TODO(sharding): make local path helpers shard-aware
-        let local_tenant_directory = conf.tenant_path(&tenant_id_to_clean.tenant_id);
+    let tenant_dir_rename_operation = |tenant_id_to_clean| async move {
+        let local_tenant_directory = conf.tenant_path(&tenant_id_to_clean);
        safe_rename_tenant_dir(&local_tenant_directory)
            .await
            .with_context(|| format!("local tenant directory {local_tenant_directory:?} rename"))
    };

-    let removal_result = remove_tenant_from_memory(
-        tenants,
-        tenant_shard_id,
-        tenant_dir_rename_operation(tenant_shard_id),
-    )
-    .await;
+    let removal_result =
+        remove_tenant_from_memory(tenants, tenant_id, tenant_dir_rename_operation(tenant_id)).await;

    // Flush pending deletions, so that they have a good chance of passing validation
    // before this tenant is potentially re-attached elsewhere.
@@ -1288,15 +1198,12 @@ async fn detach_tenant0(
            Err(TenantStateError::SlotError(TenantSlotError::NotFound(_)))
        )
    {
-        // TODO(sharding): make local paths sharding-aware
-        let tenant_ignore_mark = conf.tenant_ignore_mark_file_path(&tenant_shard_id.tenant_id);
+        let tenant_ignore_mark = conf.tenant_ignore_mark_file_path(&tenant_id);
        if tenant_ignore_mark.exists() {
            info!("Detaching an ignored tenant");
-            let tmp_path = tenant_dir_rename_operation(tenant_shard_id)
+            let tmp_path = tenant_dir_rename_operation(tenant_id)
                .await
-                .with_context(|| {
-                    format!("Ignored tenant {tenant_shard_id} local directory rename")
-                })?;
+                .with_context(|| format!("Ignored tenant {tenant_id} local directory rename"))?;
            return Ok(tmp_path);
        }
    }
@@ -1313,11 +1220,7 @@ pub(crate) async fn load_tenant(
    deletion_queue_client: DeletionQueueClient,
    ctx: &RequestContext,
 ) -> Result<(), TenantMapInsertError> {
-    // This is a legacy API (replaced by `/location_conf`).  It does not support sharding
-    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
-
-    let slot_guard =
-        tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustNotExist)?;
+    let slot_guard = tenant_map_acquire_slot(&tenant_id, TenantSlotAcquireMode::MustNotExist)?;
    let tenant_path = conf.tenant_path(&tenant_id);

    let tenant_ignore_mark = conf.tenant_ignore_mark_file_path(&tenant_id);
@@ -1370,10 +1273,7 @@ async fn ignore_tenant0(
    tenants: &std::sync::RwLock<TenantsMap>,
    tenant_id: TenantId,
 ) -> Result<(), TenantStateError> {
-    // This is a legacy API (replaced by `/location_conf`).  It does not support sharding
-    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
-
-    remove_tenant_from_memory(tenants, tenant_shard_id, async {
+    remove_tenant_from_memory(tenants, tenant_id, async {
        let ignore_mark_file = conf.tenant_ignore_mark_file_path(&tenant_id);
        fs::File::create(&ignore_mark_file)
            .await
@@ -1382,7 +1282,7 @@ async fn ignore_tenant0(
                crashsafe::fsync_file_and_parent(&ignore_mark_file)
                    .context("Failed to fsync ignore mark file")
            })
-            .with_context(|| format!("Failed to crate ignore mark for tenant {tenant_shard_id}"))?;
+            .with_context(|| format!("Failed to crate ignore mark for tenant {tenant_id}"))?;
        Ok(())
    })
    .await
@@ -1405,12 +1305,10 @@ pub(crate) async fn list_tenants() -> Result<Vec<(TenantId, TenantState)>, Tenan
    };
    Ok(m.iter()
        .filter_map(|(id, tenant)| match tenant {
-            TenantSlot::Attached(tenant) => Some((id, tenant.current_state())),
+            TenantSlot::Attached(tenant) => Some((*id, tenant.current_state())),
            TenantSlot::Secondary => None,
            TenantSlot::InProgress(_) => None,
        })
-        // TODO(sharding): make callers of this function shard-aware
-        .map(|(k, v)| (k.tenant_id, v))
        .collect())
 }

@@ -1426,11 +1324,7 @@ pub(crate) async fn attach_tenant(
    resources: TenantSharedResources,
    ctx: &RequestContext,
 ) -> Result<(), TenantMapInsertError> {
-    // This is a legacy API (replaced by `/location_conf`).  It does not support sharding
-    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
-
-    let slot_guard =
-        tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustNotExist)?;
+    let slot_guard = tenant_map_acquire_slot(&tenant_id, TenantSlotAcquireMode::MustNotExist)?;
    let location_conf = LocationConf::attached_single(tenant_conf, generation);
    let tenant_dir = create_tenant_files(conf, &location_conf, &tenant_id).await?;
    // TODO: tenant directory remains on disk if we bail out from here on.
@@ -1477,14 +1371,14 @@ pub(crate) enum TenantMapInsertError {
 pub enum TenantSlotError {
    /// When acquiring a slot with the expectation that the tenant already exists.
    #[error("Tenant {0} not found")]
-    NotFound(TenantShardId),
+    NotFound(TenantId),

    /// When acquiring a slot with the expectation that the tenant does not already exist.
    #[error("tenant {0} already exists, state: {1:?}")]
-    AlreadyExists(TenantShardId, TenantState),
+    AlreadyExists(TenantId, TenantState),

    #[error("tenant {0} already exists in but is not attached")]
-    Conflict(TenantShardId),
+    Conflict(TenantId),

    // Tried to read a slot that is currently being mutated by another administrative
    // operation.
@@ -1546,7 +1440,7 @@ pub enum TenantMapError {
 /// `drop_old_value`.  It is an error to call this without shutting down
 /// the conents of `old_value`.
 pub struct SlotGuard {
-    tenant_shard_id: TenantShardId,
+    tenant_id: TenantId,
    old_value: Option<TenantSlot>,
    upserted: bool,

@@ -1557,12 +1451,12 @@ pub struct SlotGuard {

 impl SlotGuard {
    fn new(
-        tenant_shard_id: TenantShardId,
+        tenant_id: TenantId,
        old_value: Option<TenantSlot>,
        completion: utils::completion::Completion,
    ) -> Self {
        Self {
-            tenant_shard_id,
+            tenant_id,
            old_value,
            upserted: false,
            _completion: completion,
@@ -1605,7 +1499,7 @@ impl SlotGuard {
                TenantsMap::Open(m) => m,
            };

-            let replaced = m.insert(self.tenant_shard_id, new_value);
+            let replaced = m.insert(self.tenant_id, new_value);
            self.upserted = true;

            METRICS.tenant_slots.set(m.len() as u64);
@@ -1624,7 +1518,7 @@ impl SlotGuard {
            None => {
                METRICS.unexpected_errors.inc();
                error!(
-                    tenant_shard_id = %self.tenant_shard_id,
+                    tenant_id = %self.tenant_id,
                    "Missing InProgress marker during tenant upsert, this is a bug."
                );
                Err(TenantSlotUpsertError::InternalError(
@@ -1633,7 +1527,7 @@ impl SlotGuard {
            }
            Some(slot) => {
                METRICS.unexpected_errors.inc();
-                error!(tenant_shard_id=%self.tenant_shard_id, "Unexpected contents of TenantSlot during upsert, this is a bug.  Contents: {:?}", slot);
+                error!(tenant_id=%self.tenant_id, "Unexpected contents of TenantSlot during upsert, this is a bug.  Contents: {:?}", slot);
                Err(TenantSlotUpsertError::InternalError(
                    "Unexpected contents of TenantSlot".into(),
                ))
@@ -1711,12 +1605,12 @@ impl Drop for SlotGuard {
            TenantsMap::Open(m) => m,
        };

-        use std::collections::btree_map::Entry;
-        match m.entry(self.tenant_shard_id) {
+        use std::collections::hash_map::Entry;
+        match m.entry(self.tenant_id) {
            Entry::Occupied(mut entry) => {
                if !matches!(entry.get(), TenantSlot::InProgress(_)) {
                    METRICS.unexpected_errors.inc();
-                    error!(tenant_shard_id=%self.tenant_shard_id, "Unexpected contents of TenantSlot during drop, this is a bug.  Contents: {:?}", entry.get());
+                    error!(tenant_id=%self.tenant_id, "Unexpected contents of TenantSlot during drop, this is a bug.  Contents: {:?}", entry.get());
                }

                if self.old_value_is_shutdown() {
@@ -1728,7 +1622,7 @@ impl Drop for SlotGuard {
            Entry::Vacant(_) => {
                METRICS.unexpected_errors.inc();
                error!(
-                    tenant_shard_id = %self.tenant_shard_id,
+                    tenant_id = %self.tenant_id,
                    "Missing InProgress marker during SlotGuard drop, this is a bug."
                );
            }
@@ -1747,7 +1641,7 @@ enum TenantSlotPeekMode {

 fn tenant_map_peek_slot<'a>(
    tenants: &'a std::sync::RwLockReadGuard<'a, TenantsMap>,
-    tenant_shard_id: &TenantShardId,
+    tenant_id: &TenantId,
    mode: TenantSlotPeekMode,
 ) -> Result<Option<&'a TenantSlot>, TenantMapError> {
    let m = match tenants.deref() {
@@ -1761,7 +1655,7 @@ fn tenant_map_peek_slot<'a>(
        TenantsMap::Open(m) => m,
    };

-    Ok(m.get(tenant_shard_id))
+    Ok(m.get(tenant_id))
 }

 enum TenantSlotAcquireMode {
@@ -1774,14 +1668,14 @@ enum TenantSlotAcquireMode {
 }

 fn tenant_map_acquire_slot(
-    tenant_shard_id: &TenantShardId,
+    tenant_id: &TenantId,
    mode: TenantSlotAcquireMode,
 ) -> Result<SlotGuard, TenantSlotError> {
-    tenant_map_acquire_slot_impl(tenant_shard_id, &TENANTS, mode)
+    tenant_map_acquire_slot_impl(tenant_id, &TENANTS, mode)
 }

 fn tenant_map_acquire_slot_impl(
-    tenant_shard_id: &TenantShardId,
+    tenant_id: &TenantId,
    tenants: &std::sync::RwLock<TenantsMap>,
    mode: TenantSlotAcquireMode,
 ) -> Result<SlotGuard, TenantSlotError> {
@@ -1789,7 +1683,7 @@ fn tenant_map_acquire_slot_impl(
    METRICS.tenant_slot_writes.inc();

    let mut locked = tenants.write().unwrap();
-    let span = tracing::info_span!("acquire_slot", tenant_id=%tenant_shard_id.tenant_id, shard=tenant_shard_id.shard_slug());
+    let span = tracing::info_span!("acquire_slot", %tenant_id);
    let _guard = span.enter();

    let m = match &mut *locked {
@@ -1798,21 +1692,19 @@ fn tenant_map_acquire_slot_impl(
        TenantsMap::Open(m) => m,
    };

-    use std::collections::btree_map::Entry;
-
-    let entry = m.entry(*tenant_shard_id);
-
+    use std::collections::hash_map::Entry;
+    let entry = m.entry(*tenant_id);
    match entry {
        Entry::Vacant(v) => match mode {
            MustExist => {
                tracing::debug!("Vacant && MustExist: return NotFound");
-                Err(TenantSlotError::NotFound(*tenant_shard_id))
+                Err(TenantSlotError::NotFound(*tenant_id))
            }
            _ => {
                let (completion, barrier) = utils::completion::channel();
                v.insert(TenantSlot::InProgress(barrier));
                tracing::debug!("Vacant, inserted InProgress");
-                Ok(SlotGuard::new(*tenant_shard_id, None, completion))
+                Ok(SlotGuard::new(*tenant_id, None, completion))
            }
        },
        Entry::Occupied(mut o) => {
@@ -1826,7 +1718,7 @@ fn tenant_map_acquire_slot_impl(
                    TenantSlot::Attached(tenant) => {
                        tracing::debug!("Attached && MustNotExist, return AlreadyExists");
                        Err(TenantSlotError::AlreadyExists(
-                            *tenant_shard_id,
+                            *tenant_id,
                            tenant.current_state(),
                        ))
                    }
@@ -1835,7 +1727,7 @@ fn tenant_map_acquire_slot_impl(
                        // to get the state from
                        tracing::debug!("Occupied & MustNotExist, return AlreadyExists");
                        Err(TenantSlotError::AlreadyExists(
-                            *tenant_shard_id,
+                            *tenant_id,
                            TenantState::Broken {
                                reason: "Present but not attached".to_string(),
                                backtrace: "".to_string(),
@@ -1848,11 +1740,7 @@ fn tenant_map_acquire_slot_impl(
                    let (completion, barrier) = utils::completion::channel();
                    let old_value = o.insert(TenantSlot::InProgress(barrier));
                    tracing::debug!("Occupied, replaced with InProgress");
-                    Ok(SlotGuard::new(
-                        *tenant_shard_id,
-                        Some(old_value),
-                        completion,
-                    ))
+                    Ok(SlotGuard::new(*tenant_id, Some(old_value), completion))
                }
            }
        }
@@ -1865,7 +1753,7 @@ fn tenant_map_acquire_slot_impl(
 /// operation would be needed to remove it.
 async fn remove_tenant_from_memory<V, F>(
    tenants: &std::sync::RwLock<TenantsMap>,
-    tenant_shard_id: TenantShardId,
+    tenant_id: TenantId,
    tenant_cleanup: F,
 ) -> Result<V, TenantStateError>
 where
@@ -1874,7 +1762,7 @@ where
    use utils::completion;

    let mut slot_guard =
-        tenant_map_acquire_slot_impl(&tenant_shard_id, tenants, TenantSlotAcquireMode::MustExist)?;
+        tenant_map_acquire_slot_impl(&tenant_id, tenants, TenantSlotAcquireMode::MustExist)?;

    // The SlotGuard allows us to manipulate the Tenant object without fear of some
    // concurrent API request doing something else for the same tenant ID.
@@ -1901,7 +1789,7 @@ where
                    // if pageserver shutdown or other detach/ignore is already ongoing, we don't want to
                    // wait for it but return an error right away because these are distinct requests.
                    slot_guard.revert();
-                    return Err(TenantStateError::IsStopping(tenant_shard_id.tenant_id));
+                    return Err(TenantStateError::IsStopping(tenant_id));
                }
            }
        }
@@ -1912,7 +1800,7 @@ where

    match tenant_cleanup
        .await
-        .with_context(|| format!("Failed to run cleanup for tenant {tenant_shard_id}"))
+        .with_context(|| format!("Failed to run cleanup for tenant {tenant_id}"))
    {
        Ok(hook_value) => {
            // Success: drop the old TenantSlot::Attached.
@@ -1991,8 +1879,7 @@ pub(crate) async fn immediate_gc(

 #[cfg(test)]
 mod tests {
-    use pageserver_api::shard::TenantShardId;
-    use std::collections::BTreeMap;
+    use std::collections::HashMap;
    use std::sync::Arc;
    use tracing::{info_span, Instrument};

@@ -2000,7 +1887,7 @@ mod tests {

    use super::{super::harness::TenantHarness, TenantsMap};

-    #[tokio::test(start_paused = true)]
+    #[tokio::test]
    async fn shutdown_awaits_in_progress_tenant() {
        // Test that if an InProgress tenant is in the map during shutdown, the shutdown will gracefully
        // wait for it to complete before proceeding.
@@ -2012,12 +1899,12 @@ mod tests {

        // harness loads it to active, which is forced and nothing is running on the tenant

-        let id = TenantShardId::unsharded(t.tenant_id());
+        let id = t.tenant_id();

        // tenant harness configures the logging and we cannot escape it
        let _e = info_span!("testing", tenant_id = %id).entered();

-        let tenants = BTreeMap::from([(id, TenantSlot::Attached(t.clone()))]);
+        let tenants = HashMap::from([(id, TenantSlot::Attached(t.clone()))]);
        let tenants = Arc::new(std::sync::RwLock::new(TenantsMap::Open(tenants)));

        // Invoke remove_tenant_from_memory with a cleanup hook that blocks until we manually
--- a/pageserver/src/tenant/remote_timeline_client.rs
+++ b/pageserver/src/tenant/remote_timeline_client.rs
@@ -188,6 +188,7 @@ use anyhow::Context;
 use camino::Utf8Path;
 use chrono::{NaiveDateTime, Utc};

+use pageserver_api::shard::ShardIdentity;
 use scopeguard::ScopeGuard;
 use tokio_util::sync::CancellationToken;
 use utils::backoff::{
@@ -298,6 +299,7 @@ pub struct RemoteTimelineClient {
    runtime: tokio::runtime::Handle,

    tenant_id: TenantId,
+    shard: ShardIdentity,
    timeline_id: TimelineId,
    generation: Generation,

@@ -322,9 +324,12 @@ impl RemoteTimelineClient {
        deletion_queue_client: DeletionQueueClient,
        conf: &'static PageServerConf,
        tenant_id: TenantId,
+        shard: ShardIdentity,
        timeline_id: TimelineId,
        generation: Generation,
    ) -> RemoteTimelineClient {
+        tracing::info!("RemoteTimelineClient::new shard={}", shard.slug());
+
        RemoteTimelineClient {
            conf,
            runtime: if cfg!(test) {
@@ -334,6 +339,7 @@ impl RemoteTimelineClient {
                BACKGROUND_RUNTIME.handle().clone()
            },
            tenant_id,
+            shard,
            timeline_id,
            generation,
            storage_impl: remote_storage,
@@ -461,6 +467,7 @@ impl RemoteTimelineClient {
        let index_part = download::download_index_part(
            &self.storage_impl,
            &self.tenant_id,
+            &self.shard,
            &self.timeline_id,
            self.generation,
            cancel,
@@ -503,6 +510,7 @@ impl RemoteTimelineClient {
                self.conf,
                &self.storage_impl,
                self.tenant_id,
+                &self.shard,
                self.timeline_id,
                layer_file_name,
                layer_metadata,
@@ -893,6 +901,7 @@ impl RemoteTimelineClient {
                upload::upload_index_part(
                    &self.storage_impl,
                    &self.tenant_id,
+                    &self.shard,
                    &self.timeline_id,
                    self.generation,
                    &index_part_with_deleted_at,
@@ -951,6 +960,7 @@ impl RemoteTimelineClient {
                .map(|(file_name, meta)| {
                    remote_layer_path(
                        &self.tenant_id,
+                        &self.shard,
                        &self.timeline_id,
                        &file_name,
                        meta.generation,
@@ -964,7 +974,8 @@ impl RemoteTimelineClient {

        // Do not delete index part yet, it is needed for possible retry. If we remove it first
        // and retry will arrive to different pageserver there wont be any traces of it on remote storage
-        let timeline_storage_path = remote_timeline_path(&self.tenant_id, &self.timeline_id);
+        let timeline_storage_path =
+            remote_timeline_path(&self.tenant_id, &self.shard, &self.timeline_id);

        // Execute all pending deletions, so that when we proceed to do a list_prefixes below, we aren't
        // taking the burden of listing all the layers that we already know we should delete.
@@ -1000,7 +1011,12 @@ impl RemoteTimelineClient {
            .unwrap_or(
                // No generation-suffixed indices, assume we are dealing with
                // a legacy index.
-                remote_index_path(&self.tenant_id, &self.timeline_id, Generation::none()),
+                remote_index_path(
+                    &self.tenant_id,
+                    &self.shard,
+                    &self.timeline_id,
+                    Generation::none(),
+                ),
            );

        let remaining_layers: Vec<RemotePath> = remaining
@@ -1178,13 +1194,20 @@ impl RemoteTimelineClient {

            let upload_result: anyhow::Result<()> = match &task.op {
                UploadOp::UploadLayer(ref layer, ref layer_metadata) => {
-                    let path = layer.local_path();
-                    upload::upload_timeline_layer(
-                        self.conf,
-                        &self.storage_impl,
-                        path,
-                        layer_metadata,
+                    let remote_path = remote_layer_path(
+                        &self.tenant_id,
+                        &self.shard,
+                        &self.timeline_id,
+                        &layer.layer_desc().filename(),
                        self.generation,
+                    );
+
+                    let local_path = layer.local_path();
+                    upload::upload_timeline_layer(
+                        &self.storage_impl,
+                        local_path,
+                        remote_path,
+                        layer_metadata,
                    )
                    .measure_remote_op(
                        self.tenant_id,
@@ -1208,6 +1231,7 @@ impl RemoteTimelineClient {
                    let res = upload::upload_index_part(
                        &self.storage_impl,
                        &self.tenant_id,
+                        &self.shard,
                        &self.timeline_id,
                        self.generation,
                        index_part,
@@ -1233,6 +1257,7 @@ impl RemoteTimelineClient {
                    .deletion_queue_client
                    .push_layers(
                        self.tenant_id,
+                        &self.shard,
                        self.timeline_id,
                        self.generation,
                        delete.layers.clone(),
@@ -1503,24 +1528,33 @@ impl RemoteTimelineClient {
    }
 }

-pub fn remote_timelines_path(tenant_id: &TenantId) -> RemotePath {
-    let path = format!("tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}");
+pub fn remote_timelines_path(tenant_id: &TenantId, shard: &ShardIdentity) -> RemotePath {
+    let path = format!(
+        "tenants/{tenant_id}{}/{TIMELINES_SEGMENT_NAME}",
+        shard.slug()
+    );
    RemotePath::from_string(&path).expect("Failed to construct path")
 }

-pub fn remote_timeline_path(tenant_id: &TenantId, timeline_id: &TimelineId) -> RemotePath {
-    remote_timelines_path(tenant_id).join(Utf8Path::new(&timeline_id.to_string()))
+pub fn remote_timeline_path(
+    tenant_id: &TenantId,
+    shard: &ShardIdentity,
+    timeline_id: &TimelineId,
+) -> RemotePath {
+    remote_timelines_path(tenant_id, shard).join(Utf8Path::new(&timeline_id.to_string()))
 }

 pub fn remote_layer_path(
    tenant_id: &TenantId,
+    shard: &ShardIdentity,
    timeline_id: &TimelineId,
    layer_file_name: &LayerFileName,
    generation: Generation,
 ) -> RemotePath {
    // Generation-aware key format
    let path = format!(
-        "tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
+        "tenants/{tenant_id}{0}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{1}{2}",
+        shard.slug(),
        layer_file_name.file_name(),
        generation.get_suffix()
    );
@@ -1530,11 +1564,13 @@ pub fn remote_layer_path(

 pub fn remote_index_path(
    tenant_id: &TenantId,
+    shard: &ShardIdentity,
    timeline_id: &TimelineId,
    generation: Generation,
 ) -> RemotePath {
    RemotePath::from_string(&format!(
-        "tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{0}{1}",
+        "tenants/{tenant_id}{0}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{1}{2}",
+        shard.slug(),
        IndexPart::FILE_NAME,
        generation.get_suffix()
    ))
@@ -1558,29 +1594,6 @@ pub fn parse_remote_index_path(path: RemotePath) -> Option<Generation> {
    }
 }

-/// Files on the remote storage are stored with paths, relative to the workdir.
-/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
-///
-/// Errors if the path provided does not start from pageserver's workdir.
-pub fn remote_path(
-    conf: &PageServerConf,
-    local_path: &Utf8Path,
-    generation: Generation,
-) -> anyhow::Result<RemotePath> {
-    let stripped = local_path
-        .strip_prefix(&conf.workdir)
-        .context("Failed to strip workdir prefix")?;
-
-    let suffixed = format!("{0}{1}", stripped, generation.get_suffix());
-
-    RemotePath::new(Utf8Path::new(&suffixed)).with_context(|| {
-        format!(
-            "to resolve remote part of path {:?} for base {:?}",
-            local_path, conf.workdir
-        )
-    })
-}
-
 #[cfg(test)]
 mod tests {
    use super::*;
@@ -1677,6 +1690,7 @@ mod tests {
                conf: self.harness.conf,
                runtime: tokio::runtime::Handle::current(),
                tenant_id: self.harness.tenant_id,
+                shard: ShardIdentity::none(),
                timeline_id: TIMELINE_ID,
                generation,
                storage_impl: self.harness.remote_storage.clone(),
@@ -2010,7 +2024,13 @@ mod tests {
        std::fs::create_dir_all(remote_timeline_dir).expect("creating test dir should work");

        let index_path = test_state.harness.remote_fs_dir.join(
-            remote_index_path(&test_state.harness.tenant_id, &TIMELINE_ID, generation).get_path(),
+            remote_index_path(
+                &test_state.harness.tenant_id,
+                &ShardIdentity::none(),
+                &TIMELINE_ID,
+                generation,
+            )
+            .get_path(),
        );
        eprintln!("Writing {index_path}");
        std::fs::write(&index_path, index_part_bytes).unwrap();
--- a/pageserver/src/tenant/remote_timeline_client/download.rs
+++ b/pageserver/src/tenant/remote_timeline_client/download.rs
@@ -9,6 +9,7 @@ use std::time::Duration;

 use anyhow::{anyhow, Context};
 use camino::Utf8Path;
+use pageserver_api::shard::ShardIdentity;
 use tokio::fs;
 use tokio::io::AsyncWriteExt;
 use tokio_util::sync::CancellationToken;
@@ -40,6 +41,7 @@ pub async fn download_layer_file<'a>(
    conf: &'static PageServerConf,
    storage: &'a GenericRemoteStorage,
    tenant_id: TenantId,
+    shard: &ShardIdentity,
    timeline_id: TimelineId,
    layer_file_name: &'a LayerFileName,
    layer_metadata: &'a LayerFileMetadata,
@@ -52,6 +54,7 @@ pub async fn download_layer_file<'a>(

    let remote_path = remote_layer_path(
        &tenant_id,
+        shard,
        &timeline_id,
        layer_file_name,
        layer_metadata.generation,
@@ -170,9 +173,10 @@ pub fn is_temp_download_file(path: &Utf8Path) -> bool {
 pub async fn list_remote_timelines(
    storage: &GenericRemoteStorage,
    tenant_id: TenantId,
+    shard: &ShardIdentity,
    cancel: CancellationToken,
 ) -> anyhow::Result<(HashSet<TimelineId>, HashSet<String>)> {
-    let remote_path = remote_timelines_path(&tenant_id);
+    let remote_path = remote_timelines_path(&tenant_id, shard);

    fail::fail_point!("storage-sync-list-remote-timelines", |_| {
        anyhow::bail!("storage-sync-list-remote-timelines");
@@ -212,11 +216,12 @@ pub async fn list_remote_timelines(
 async fn do_download_index_part(
    storage: &GenericRemoteStorage,
    tenant_id: &TenantId,
+    shard: &ShardIdentity,
    timeline_id: &TimelineId,
    index_generation: Generation,
    cancel: CancellationToken,
 ) -> Result<IndexPart, DownloadError> {
-    let remote_path = remote_index_path(tenant_id, timeline_id, index_generation);
+    let remote_path = remote_index_path(tenant_id, shard, timeline_id, index_generation);

    let index_part_bytes = download_retry_forever(
        || async {
@@ -253,6 +258,7 @@ async fn do_download_index_part(
 pub(super) async fn download_index_part(
    storage: &GenericRemoteStorage,
    tenant_id: &TenantId,
+    shard: &ShardIdentity,
    timeline_id: &TimelineId,
    my_generation: Generation,
    cancel: CancellationToken,
@@ -261,8 +267,15 @@ pub(super) async fn download_index_part(

    if my_generation.is_none() {
        // Operating without generations: just fetch the generation-less path
-        return do_download_index_part(storage, tenant_id, timeline_id, my_generation, cancel)
-            .await;
+        return do_download_index_part(
+            storage,
+            tenant_id,
+            shard,
+            timeline_id,
+            my_generation,
+            cancel,
+        )
+        .await;
    }

    // Stale case: If we were intentionally attached in a stale generation, there may already be a remote
@@ -272,6 +285,7 @@ pub(super) async fn download_index_part(
    let res = do_download_index_part(
        storage,
        tenant_id,
+        shard,
        timeline_id,
        my_generation,
        cancel.clone(),
@@ -299,6 +313,7 @@ pub(super) async fn download_index_part(
    let res = do_download_index_part(
        storage,
        tenant_id,
+        shard,
        timeline_id,
        my_generation.previous(),
        cancel.clone(),
@@ -321,7 +336,7 @@ pub(super) async fn download_index_part(

    // General case/fallback: if there is no index at my_generation or prev_generation, then list all index_part.json
    // objects, and select the highest one with a generation <= my_generation.
-    let index_prefix = remote_index_path(tenant_id, timeline_id, Generation::none());
+    let index_prefix = remote_index_path(tenant_id, shard, timeline_id, Generation::none());
    let indices = backoff::retry(
        || async { storage.list_files(Some(&index_prefix)).await },
        |_| false,
@@ -347,14 +362,21 @@ pub(super) async fn download_index_part(
    match max_previous_generation {
        Some(g) => {
            tracing::debug!("Found index_part in generation {g:?}");
-            do_download_index_part(storage, tenant_id, timeline_id, g, cancel).await
+            do_download_index_part(storage, tenant_id, shard, timeline_id, g, cancel).await
        }
        None => {
            // Migration from legacy pre-generation state: we have a generation but no prior
            // attached pageservers did.  Try to load from a no-generation path.
            tracing::info!("No index_part.json* found");
-            do_download_index_part(storage, tenant_id, timeline_id, Generation::none(), cancel)
-                .await
+            do_download_index_part(
+                storage,
+                tenant_id,
+                shard,
+                timeline_id,
+                Generation::none(),
+                cancel,
+            )
+            .await
        }
    }
 }
--- a/pageserver/src/tenant/remote_timeline_client/upload.rs
+++ b/pageserver/src/tenant/remote_timeline_client/upload.rs
@@ -3,15 +3,13 @@
 use anyhow::{bail, Context};
 use camino::Utf8Path;
 use fail::fail_point;
+use pageserver_api::shard::ShardIdentity;
 use std::io::ErrorKind;
 use tokio::fs;

 use super::Generation;
-use crate::{
-    config::PageServerConf,
-    tenant::remote_timeline_client::{index::IndexPart, remote_index_path, remote_path},
-};
-use remote_storage::GenericRemoteStorage;
+use crate::tenant::remote_timeline_client::{index::IndexPart, remote_index_path};
+use remote_storage::{GenericRemoteStorage, RemotePath};
 use utils::id::{TenantId, TimelineId};

 use super::index::LayerFileMetadata;
@@ -22,6 +20,7 @@ use tracing::info;
 pub(super) async fn upload_index_part<'a>(
    storage: &'a GenericRemoteStorage,
    tenant_id: &TenantId,
+    shard: &ShardIdentity,
    timeline_id: &TimelineId,
    generation: Generation,
    index_part: &'a IndexPart,
@@ -38,7 +37,7 @@ pub(super) async fn upload_index_part<'a>(
    let index_part_size = index_part_bytes.len();
    let index_part_bytes = tokio::io::BufReader::new(std::io::Cursor::new(index_part_bytes));

-    let remote_path = remote_index_path(tenant_id, timeline_id, generation);
+    let remote_path = remote_index_path(tenant_id, shard, timeline_id, generation);
    storage
        .upload_storage_object(Box::new(index_part_bytes), index_part_size, &remote_path)
        .await
@@ -50,11 +49,10 @@ pub(super) async fn upload_index_part<'a>(
 ///
 /// On an error, bumps the retries count and reschedules the entire task.
 pub(super) async fn upload_timeline_layer<'a>(
-    conf: &'static PageServerConf,
    storage: &'a GenericRemoteStorage,
-    source_path: &'a Utf8Path,
+    source_path: &Utf8Path,
+    remote_path: RemotePath,
    known_metadata: &'a LayerFileMetadata,
-    generation: Generation,
 ) -> anyhow::Result<()> {
    fail_point!("before-upload-layer", |_| {
        bail!("failpoint before-upload-layer")
@@ -62,7 +60,6 @@ pub(super) async fn upload_timeline_layer<'a>(

    pausable_failpoint!("before-upload-layer-pausable");

-    let storage_path = remote_path(conf, source_path, generation)?;
    let source_file_res = fs::File::open(&source_path).await;
    let source_file = match source_file_res {
        Ok(source_file) => source_file,
@@ -97,7 +94,7 @@ pub(super) async fn upload_timeline_layer<'a>(
        .with_context(|| format!("convert {source_path:?} size {fs_size} usize"))?;

    storage
-        .upload(source_file, fs_size, &storage_path, None)
+        .upload(source_file, fs_size, &remote_path, None)
        .await
        .with_context(|| format!("upload layer from local path '{source_path}'"))?;

--- a/pageserver/src/tenant/storage_layer/layer.rs
+++ b/pageserver/src/tenant/storage_layer/layer.rs
@@ -251,7 +251,6 @@ impl Layer {

        layer
            .get_value_reconstruct_data(key, lsn_range, reconstruct_data, &self.0, ctx)
-            .instrument(tracing::info_span!("get_value_reconstruct_data", layer=%self))
            .await
    }

@@ -1212,10 +1211,8 @@ impl DownloadedLayer {
            // this will be a permanent failure
            .context("load layer");

-            if let Err(e) = res.as_ref() {
+            if res.is_err() {
                LAYER_IMPL_METRICS.inc_permanent_loading_failures();
-                // TODO(#5815): we are not logging all errors, so temporarily log them here as well
-                tracing::error!("layer loading failed permanently: {e:#}");
            }
            res
        };
@@ -1294,7 +1291,6 @@ impl ResidentLayer {
    }

    /// Loads all keys stored in the layer. Returns key, lsn and value size.
-    #[tracing::instrument(skip_all, fields(layer=%self))]
    pub(crate) async fn load_keys<'a>(
        &'a self,
        ctx: &RequestContext,
--- a/pageserver/src/tenant/storage_layer/layer_desc.rs
+++ b/pageserver/src/tenant/storage_layer/layer_desc.rs
@@ -5,7 +5,7 @@ use utils::{
    lsn::Lsn,
 };

-use crate::{pgdatadir_mapping::METADATA_CUT, repository::Key};
+use crate::repository::Key;

 use super::{DeltaFileName, ImageFileName, LayerFileName};

@@ -49,20 +49,6 @@ impl PersistentLayerDesc {
        }
    }

-    /// Does this layer consist exclusively of metadata
-    /// content such as dbdir & relation sizes?  This is a
-    /// hint that the layer is likely to be small and should
-    /// not be a candidate for eviction under normal circumstances.
-    pub fn is_metadata_pages(&self) -> bool {
-        self.key_range.start >= METADATA_CUT
-    }
-
-    /// Does this layer consist exclusively of content
-    /// required to serve a basebackup request?
-    pub fn is_basebackup_pages(&self) -> bool {
-        self.key_range.start >= METADATA_CUT
-    }
-
    pub fn short_id(&self) -> impl Display {
        self.filename()
    }
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -12,8 +12,12 @@ use bytes::Bytes;
 use camino::{Utf8Path, Utf8PathBuf};
 use fail::fail_point;
 use itertools::Itertools;
-use pageserver_api::models::{
-    DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskSpawnRequest, LayerMapInfo, TimelineState,
+use pageserver_api::{
+    models::{
+        DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskSpawnRequest, LayerMapInfo,
+        TimelineState,
+    },
+    shard::ShardIdentity,
 };
 use serde_with::serde_as;
 use storage_broker::BrokerClientChannel;
@@ -81,6 +85,8 @@ use crate::task_mgr::TaskKind;
 use crate::ZERO_PAGE;

 use self::delete::DeleteTimelineFlow;
+pub(super) use self::eviction_task::EvictionTaskTenantState;
+use self::eviction_task::EvictionTaskTimelineState;
 use self::layer_manager::LayerManager;
 use self::logical_size::LogicalSize;
 use self::walreceiver::{WalReceiver, WalReceiverConf};
@@ -296,6 +302,8 @@ pub struct Timeline {
    /// timeline is being deleted. If 'true', the timeline has already been deleted.
    pub delete_progress: Arc<tokio::sync::Mutex<DeleteTimelineFlow>>,

+    eviction_task_timeline_state: tokio::sync::Mutex<EvictionTaskTimelineState>,
+
    /// Barrier to wait before doing initial logical size calculation. Used only during startup.
    initial_logical_size_can_start: Option<completion::Barrier>,

@@ -429,6 +437,7 @@ impl std::fmt::Display for PageReconstructError {
 pub enum LogicalSizeCalculationCause {
    Initial,
    ConsumptionMetricsSyntheticSize,
+    EvictionTaskImitation,
    TenantSizeHandler,
 }

@@ -1305,6 +1314,11 @@ impl Timeline {
            .unwrap_or(self.conf.default_tenant_conf.gc_feedback)
    }

+    pub(crate) fn get_shard(&self) -> ShardIdentity {
+        let tenant_conf = &self.tenant_conf.read().unwrap();
+        tenant_conf.shard.clone()
+    }
+
    pub(super) fn tenant_conf_updated(&self) {
        // NB: Most tenant conf options are read by background loops, so,
        // changes will automatically be picked up.
@@ -1437,6 +1451,9 @@ impl Timeline {

                state,

+                eviction_task_timeline_state: tokio::sync::Mutex::new(
+                    EvictionTaskTimelineState::default(),
+                ),
                delete_progress: Arc::new(tokio::sync::Mutex::new(DeleteTimelineFlow::default())),

                initial_logical_size_can_start,
@@ -1959,6 +1976,9 @@ impl Timeline {
            LogicalSizeCalculationCause::Initial
            | LogicalSizeCalculationCause::ConsumptionMetricsSyntheticSize
            | LogicalSizeCalculationCause::TenantSizeHandler => &self.metrics.logical_size_histo,
+            LogicalSizeCalculationCause::EvictionTaskImitation => {
+                &self.metrics.imitate_logical_size_histo
+            }
        };
        let timer = storage_time_metrics.start_timer();
        let logical_size = self
@@ -2735,18 +2755,18 @@ impl Timeline {
        partition_size: u64,
        ctx: &RequestContext,
    ) -> anyhow::Result<(KeyPartitioning, Lsn)> {
-        // {
-        //     let partitioning_guard = self.partitioning.lock().unwrap();
-        //     let distance = lsn.0 - partitioning_guard.1 .0;
-        //     if partitioning_guard.1 != Lsn(0) && distance <= self.repartition_threshold {
-        //         debug!(
-        //             distance,
-        //             threshold = self.repartition_threshold,
-        //             "no repartitioning needed"
-        //         );
-        //         return Ok((partitioning_guard.0.clone(), partitioning_guard.1));
-        //     }
-        // }
+        {
+            let partitioning_guard = self.partitioning.lock().unwrap();
+            let distance = lsn.0 - partitioning_guard.1 .0;
+            if partitioning_guard.1 != Lsn(0) && distance <= self.repartition_threshold {
+                debug!(
+                    distance,
+                    threshold = self.repartition_threshold,
+                    "no repartitioning needed"
+                );
+                return Ok((partitioning_guard.0.clone(), partitioning_guard.1));
+            }
+        }
        let keyspace = self.collect_keyspace(lsn, ctx).await?;
        let partitioning = keyspace.partition(partition_size);

@@ -4274,11 +4294,6 @@ impl Timeline {
            let file_size = l.file_size();
            max_layer_size = max_layer_size.map_or(Some(file_size), |m| Some(m.max(file_size)));

-            // Don't evict small layers required to serve a basebackup
-            if l.is_basebackup_pages() {
-                continue;
-            }
-
            let l = guard.get_from_desc(&l);

            let l = match l.keep_resident().await {
--- a/pageserver/src/tenant/timeline/eviction_task.rs
+++ b/pageserver/src/tenant/timeline/eviction_task.rs
@@ -14,6 +14,7 @@
 //!
 //! See write-up on restart on-demand download spike: <https://gist.github.com/problame/2265bf7b8dc398be834abfead36c76b5>
 use std::{
+    collections::HashMap,
    ops::ControlFlow,
    sync::Arc,
    time::{Duration, SystemTime},
@@ -21,7 +22,7 @@ use std::{

 use tokio::time::Instant;
 use tokio_util::sync::CancellationToken;
-use tracing::{debug, error, info, instrument, warn};
+use tracing::{debug, error, info, info_span, instrument, warn, Instrument};

 use crate::{
    context::{DownloadBehavior, RequestContext},
@@ -30,6 +31,7 @@ use crate::{
        config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold},
        tasks::{BackgroundLoopKind, RateLimitError},
        timeline::EvictionError,
+        LogicalSizeCalculationCause, Tenant,
    },
 };

@@ -37,6 +39,16 @@ use utils::completion;

 use super::Timeline;

+#[derive(Default)]
+pub struct EvictionTaskTimelineState {
+    last_layer_access_imitation: Option<tokio::time::Instant>,
+}
+
+#[derive(Default)]
+pub struct EvictionTaskTenantState {
+    last_layer_access_imitation: Option<Instant>,
+}
+
 impl Timeline {
    pub(super) fn launch_eviction_task(
        self: &Arc<Self>,
@@ -165,6 +177,7 @@ impl Timeline {
        //    that were accessed to compute the value in the first place.
        // 3. Invalidate the caches at a period of < p.threshold/2, so that the values
        //    get re-computed from layers, thereby counting towards layer access stats.
+        // 4. Make the eviction task imitate the layer accesses that typically hit caches.
        //
        // We follow approach (4) here because in Neon prod deployment:
        // - page cache is quite small => high churn => low hit rate
@@ -176,6 +189,10 @@ impl Timeline {
        //
        // We should probably move to persistent caches in the future, or avoid
        // having inactive tenants attached to pageserver in the first place.
+        match self.imitate_layer_accesses(p, cancel, ctx).await {
+            ControlFlow::Break(()) => return ControlFlow::Break(()),
+            ControlFlow::Continue(()) => (),
+        }

        #[allow(dead_code)]
        #[derive(Debug, Default)]
@@ -197,11 +214,6 @@ impl Timeline {
            let layers = guard.layer_map();
            let mut candidates = Vec::new();
            for hist_layer in layers.iter_historic_layers() {
-                // Don't evict the small layers needed to serve a basebackup request.
-                if hist_layer.is_basebackup_pages() {
-                    continue;
-                }
-
                let hist_layer = guard.get_from_desc(&hist_layer);

                // guard against eviction while we inspect it; it might be that eviction_task and
@@ -297,4 +309,163 @@ impl Timeline {
        }
        ControlFlow::Continue(())
    }
+
+    #[instrument(skip_all)]
+    async fn imitate_layer_accesses(
+        &self,
+        p: &EvictionPolicyLayerAccessThreshold,
+        cancel: &CancellationToken,
+        ctx: &RequestContext,
+    ) -> ControlFlow<()> {
+        let mut state = self.eviction_task_timeline_state.lock().await;
+
+        // Only do the imitate_layer accesses approximately as often as the threshold.  A little
+        // more frequently, to avoid this period racing with the threshold/period-th eviction iteration.
+        let inter_imitate_period = p.threshold.checked_sub(p.period).unwrap_or(p.threshold);
+
+        match state.last_layer_access_imitation {
+            Some(ts) if ts.elapsed() < inter_imitate_period => { /* no need to run */ }
+            _ => {
+                self.imitate_timeline_cached_layer_accesses(ctx).await;
+                state.last_layer_access_imitation = Some(tokio::time::Instant::now())
+            }
+        }
+        drop(state);
+
+        if cancel.is_cancelled() {
+            return ControlFlow::Break(());
+        }
+
+        // This task is timeline-scoped, but the synthetic size calculation is tenant-scoped.
+        // Make one of the tenant's timelines draw the short straw and run the calculation.
+        // The others wait until the calculation is done so that they take into account the
+        // imitated accesses that the winner made.
+        let tenant = match crate::tenant::mgr::get_tenant(self.tenant_id, true) {
+            Ok(t) => t,
+            Err(_) => {
+                return ControlFlow::Break(());
+            }
+        };
+        let mut state = tenant.eviction_task_tenant_state.lock().await;
+        match state.last_layer_access_imitation {
+            Some(ts) if ts.elapsed() < inter_imitate_period => { /* no need to run */ }
+            _ => {
+                self.imitate_synthetic_size_calculation_worker(&tenant, ctx, cancel)
+                    .await;
+                state.last_layer_access_imitation = Some(tokio::time::Instant::now());
+            }
+        }
+        drop(state);
+
+        if cancel.is_cancelled() {
+            return ControlFlow::Break(());
+        }
+
+        ControlFlow::Continue(())
+    }
+
+    /// Recompute the values which would cause on-demand downloads during restart.
+    #[instrument(skip_all)]
+    async fn imitate_timeline_cached_layer_accesses(&self, ctx: &RequestContext) {
+        let lsn = self.get_last_record_lsn();
+
+        // imitiate on-restart initial logical size
+        let size = self
+            .calculate_logical_size(lsn, LogicalSizeCalculationCause::EvictionTaskImitation, ctx)
+            .instrument(info_span!("calculate_logical_size"))
+            .await;
+
+        match &size {
+            Ok(_size) => {
+                // good, don't log it to avoid confusion
+            }
+            Err(_) => {
+                // we have known issues for which we already log this on consumption metrics,
+                // gc, and compaction. leave logging out for now.
+                //
+                // https://github.com/neondatabase/neon/issues/2539
+            }
+        }
+
+        // imitiate repartiting on first compactation
+        if let Err(e) = self
+            .collect_keyspace(lsn, ctx)
+            .instrument(info_span!("collect_keyspace"))
+            .await
+        {
+            // if this failed, we probably failed logical size because these use the same keys
+            if size.is_err() {
+                // ignore, see above comment
+            } else {
+                warn!(
+                    "failed to collect keyspace but succeeded in calculating logical size: {e:#}"
+                );
+            }
+        }
+    }
+
+    // Imitate the synthetic size calculation done by the consumption_metrics module.
+    #[instrument(skip_all)]
+    async fn imitate_synthetic_size_calculation_worker(
+        &self,
+        tenant: &Arc<Tenant>,
+        ctx: &RequestContext,
+        cancel: &CancellationToken,
+    ) {
+        if self.conf.metric_collection_endpoint.is_none() {
+            // We don't start the consumption metrics task if this is not set in the config.
+            // So, no need to imitate the accesses in that case.
+            return;
+        }
+
+        // The consumption metrics are collected on a per-tenant basis, by a single
+        // global background loop.
+        // It limits the number of synthetic size calculations using the global
+        // `concurrent_tenant_size_logical_size_queries` semaphore to not overload
+        // the pageserver. (size calculation is somewhat expensive in terms of CPU and IOs).
+        //
+        // If we used that same semaphore here, then we'd compete for the
+        // same permits, which may impact timeliness of consumption metrics.
+        // That is a no-go, as consumption metrics are much more important
+        // than what we do here.
+        //
+        // So, we have a separate semaphore, initialized to the same
+        // number of permits as the `concurrent_tenant_size_logical_size_queries`.
+        // In the worst, we would have twice the amount of concurrenct size calculations.
+        // But in practice, the `p.threshold` >> `consumption metric interval`, and
+        // we spread out the eviction task using `random_init_delay`.
+        // So, the chance of the worst case is quite low in practice.
+        // It runs as a per-tenant task, but the eviction_task.rs is per-timeline.
+        // So, we must coordinate with other with other eviction tasks of this tenant.
+        let limit = self
+            .conf
+            .eviction_task_immitated_concurrent_logical_size_queries
+            .inner();
+
+        let mut throwaway_cache = HashMap::new();
+        let gather = crate::tenant::size::gather_inputs(
+            tenant,
+            limit,
+            None,
+            &mut throwaway_cache,
+            LogicalSizeCalculationCause::EvictionTaskImitation,
+            ctx,
+        )
+        .instrument(info_span!("gather_inputs"));
+
+        tokio::select! {
+            _ = cancel.cancelled() => {}
+            gather_result = gather => {
+                match gather_result {
+                    Ok(_) => {},
+                    Err(e) => {
+                        // We don't care about the result, but, if it failed, we should log it,
+                        // since consumption metric might be hitting the cached value and
+                        // thus not encountering this error.
+                        warn!("failed to imitate synthetic size calculation accesses: {e:#}")
+                    }
+                }
+           }
+        }
+    }
 }
--- a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
+++ b/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
@@ -36,7 +36,7 @@ use crate::{
 };
 use postgres_backend::is_expected_io_error;
 use postgres_connection::PgConnectionConfig;
-use postgres_ffi::waldecoder::WalStreamDecoder;
+
 use utils::pageserver_feedback::PageserverFeedback;
 use utils::{id::NodeId, lsn::Lsn};

@@ -244,13 +244,22 @@ pub(super) async fn handle_walreceiver_connection(

    info!("last_record_lsn {last_rec_lsn} starting replication from {startpoint}, safekeeper is at {end_of_wal}...");

-    let query = format!("START_REPLICATION PHYSICAL {startpoint}");
+    let shard = timeline.get_shard();
+    let shard_str = serde_json::to_string(&shard).map_err(|e| {
+        WalReceiverError::Other(anyhow!(
+            "Failed to serialize shard info for walreceiver: {e}"
+        ))
+    })?;
+    info!("starting replication for shard {shard_str}");
+
+    let query = format!(
+        "START_REPLICATION PHYSICAL {startpoint} (shard={})",
+        shard_str
+    );

    let copy_stream = replication_client.copy_both_simple(&query).await?;
    let mut physical_stream = pin!(ReplicationStream::new(copy_stream));

-    let mut waldecoder = WalStreamDecoder::new(startpoint, timeline.pg_version);
-
    let mut walingest = WalIngest::new(timeline.as_ref(), startpoint, &ctx).await?;

    while let Some(replication_message) = {
@@ -273,9 +282,7 @@ pub(super) async fn handle_walreceiver_connection(
            ReplicationMessage::XLogData(xlog_data) => {
                connection_status.latest_connection_update = now;
                connection_status.commit_lsn = Some(Lsn::from(xlog_data.wal_end()));
-                connection_status.streaming_lsn = Some(Lsn::from(
-                    xlog_data.wal_start() + xlog_data.data().len() as u64,
-                ));
+                connection_status.streaming_lsn = Some(Lsn::from(xlog_data.wal_start()));
                if !xlog_data.data().is_empty() {
                    connection_status.latest_wal_update = now;
                }
@@ -293,44 +300,29 @@ pub(super) async fn handle_walreceiver_connection(

        let status_update = match replication_message {
            ReplicationMessage::XLogData(xlog_data) => {
-                // Pass the WAL data to the decoder, and see if we can decode
-                // more records as a result.
-                let data = xlog_data.data();
-                let startlsn = Lsn::from(xlog_data.wal_start());
-                let endlsn = startlsn + data.len() as u64;
+                // Process decoded WAL record.
+                let next_lsn = Lsn::from(xlog_data.wal_start());
+                let data = xlog_data.into_data();

-                trace!("received XLogData between {startlsn} and {endlsn}");
+                trace!("received XLogData up to {next_lsn}");

-                waldecoder.feed_bytes(data);
+                let mut decoded = DecodedWALRecord::default();
+                let mut modification = timeline.begin_modification(next_lsn);
+                walingest
+                    .ingest_record(data, next_lsn, &mut modification, &mut decoded, &ctx)
+                    .await
+                    .with_context(|| format!("could not ingest record at {next_lsn}"))?;

-                {
-                    let mut decoded = DecodedWALRecord::default();
-                    let mut modification = timeline.begin_modification(endlsn);
-                    while let Some((lsn, recdata)) = waldecoder.poll_decode()? {
-                        // It is important to deal with the aligned records as lsn in getPage@LSN is
-                        // aligned and can be several bytes bigger. Without this alignment we are
-                        // at risk of hitting a deadlock.
-                        if !lsn.is_aligned() {
-                            return Err(WalReceiverError::Other(anyhow!("LSN not aligned")));
-                        }
+                fail_point!("walreceiver-after-ingest");

-                        walingest
-                            .ingest_record(recdata, lsn, &mut modification, &mut decoded, &ctx)
-                            .await
-                            .with_context(|| format!("could not ingest record at {lsn}"))?;
+                last_rec_lsn = next_lsn;

-                        fail_point!("walreceiver-after-ingest");
-
-                        last_rec_lsn = lsn;
-                    }
-                }
-
-                if !caught_up && endlsn >= end_of_wal {
-                    info!("caught up at LSN {endlsn}");
+                if !caught_up && next_lsn >= end_of_wal {
+                    info!("caught up at LSN {next_lsn}");
                    caught_up = true;
                }

-                Some(endlsn)
+                Some(last_rec_lsn)
            }

            ReplicationMessage::PrimaryKeepAlive(keepalive) => {
--- a/pageserver/src/walingest.rs
+++ b/pageserver/src/walingest.rs
@@ -21,6 +21,7 @@
 //! redo Postgres process, but some records it can handle directly with
 //! bespoken Rust code.

+use pageserver_api::shard::ShardIdentity;
 use postgres_ffi::v14::nonrelfile_utils::clogpage_precedes;
 use postgres_ffi::v14::nonrelfile_utils::slru_may_delete_clogsegment;
 use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn};
@@ -46,6 +47,7 @@ use postgres_ffi::BLCKSZ;
 use utils::lsn::Lsn;

 pub struct WalIngest<'a> {
+    shard: ShardIdentity,
    timeline: &'a Timeline,

    checkpoint: CheckPoint,
@@ -65,6 +67,7 @@ impl<'a> WalIngest<'a> {
        trace!("CheckPoint.nextXid = {}", checkpoint.nextXid.value);

        Ok(WalIngest {
+            shard: timeline.get_shard(),
            timeline,
            checkpoint,
            checkpoint_modified: false,
@@ -90,6 +93,36 @@ impl<'a> WalIngest<'a> {
        modification.lsn = lsn;
        decode_wal_record(recdata, decoded, self.timeline.pg_version)?;

+        tracing::trace!(
+            "decoded rmid={} xid={} xl_info={}",
+            decoded.xl_rmid,
+            decoded.xl_xid,
+            decoded.xl_info
+        );
+
+        // Fast path: we may skip the entire record if it only references blocks on another shard.
+        // Otherwise we proceed, and filter blocks later.
+        let any_local_blocks = decoded.blocks.iter().any(|blk| {
+            let rel = RelTag {
+                spcnode: blk.rnode_spcnode,
+                dbnode: blk.rnode_dbnode,
+                relnode: blk.rnode_relnode,
+                forknum: blk.forknum,
+            };
+
+            let key = rel_block_to_key(rel, blk.blkno);
+            self.shard.is_key_local(&key)
+        });
+        // - We need at least one block to skip: otherwise we assume the record's
+        //   payload is all in its other fields, which are metadata-ish things that
+        //   we broadcast to all shards
+        // - ...and obviously, we can only skip a WAL record if it doesn't need to
+        //   write to any pages in this shard.
+        let skip_record = decoded.blocks.len() > 0 && !any_local_blocks;
+        // TODO: actually skip (and update LSN at the time).  Currently we just
+        // check later in the function that if we set skip_record==true, then we
+        // really would not have done any local IO.
+
        let mut buf = decoded.record.clone();
        buf.advance(decoded.main_data_offset);

@@ -358,6 +391,26 @@ impl<'a> WalIngest<'a> {
        // Iterate through all the blocks that the record modifies, and
        // "put" a separate copy of the record for each block.
        for blk in decoded.blocks.iter() {
+            let rel = RelTag {
+                spcnode: blk.rnode_spcnode,
+                dbnode: blk.rnode_dbnode,
+                relnode: blk.rnode_relnode,
+                forknum: blk.forknum,
+            };
+
+            let key = rel_block_to_key(rel, blk.blkno);
+            let key_is_local = self.shard.is_key_local(&key);
+
+            tracing::info!(
+                "ingest: shard decision {} (checkpoint={}) for key {}",
+                if !key_is_local { "drop" } else { "keep" },
+                self.checkpoint_modified,
+                key
+            );
+
+            if !key_is_local {
+                continue;
+            }
            self.ingest_decoded_block(modification, lsn, decoded, blk, ctx)
                .await?;
        }
@@ -370,6 +423,12 @@ impl<'a> WalIngest<'a> {
            self.checkpoint_modified = false;
        }

+        if skip_record && !modification.is_no_op() {
+            tracing::error!(
+                "WAL record @ {lsn} would have been dropped, but we actually did modifications!"
+            );
+        }
+
        // Now that this record has been fully handled, including updating the
        // checkpoint data, let the repository know that it is up-to-date to this LSN
        modification.commit(ctx).await?;
@@ -1459,8 +1518,15 @@ impl<'a> WalIngest<'a> {
            //info!("extending {} {} to {}", rel, old_nblocks, new_nblocks);
            modification.put_rel_extend(rel, new_nblocks, ctx).await?;

+            let mut key = rel_block_to_key(rel, blknum);
            // fill the gap with zeros
            for gap_blknum in old_nblocks..blknum {
+                key.field6 = gap_blknum;
+
+                if self.shard.get_shard_number(&key) != self.shard.number {
+                    continue;
+                }
+
                modification.put_rel_page_image(rel, gap_blknum, ZERO_PAGE.clone())?;
            }
        }
--- a/pageserver/src/walredo.rs
+++ b/pageserver/src/walredo.rs
@@ -43,8 +43,7 @@ use std::sync::atomic::{AtomicUsize, Ordering};

 use crate::config::PageServerConf;
 use crate::metrics::{
-    WalRedoKillCause, WAL_REDO_BYTES_HISTOGRAM, WAL_REDO_PROCESS_COUNTERS,
-    WAL_REDO_RECORDS_HISTOGRAM, WAL_REDO_RECORD_COUNTER, WAL_REDO_TIME,
+    WAL_REDO_BYTES_HISTOGRAM, WAL_REDO_RECORDS_HISTOGRAM, WAL_REDO_RECORD_COUNTER, WAL_REDO_TIME,
 };
 use crate::pgdatadir_mapping::{key_to_rel_block, key_to_slru_block};
 use crate::repository::Key;
@@ -663,10 +662,10 @@ impl WalRedoProcess {
            .close_fds()
            .spawn_no_leak_child(tenant_id)
            .context("spawn process")?;
-        WAL_REDO_PROCESS_COUNTERS.started.inc();
+
        let mut child = scopeguard::guard(child, |child| {
            error!("killing wal-redo-postgres process due to a problem during launch");
-            child.kill_and_wait(WalRedoKillCause::Startup);
+            child.kill_and_wait();
        });

        let stdin = child.stdin.take().unwrap();
@@ -997,7 +996,7 @@ impl Drop for WalRedoProcess {
        self.child
            .take()
            .expect("we only do this once")
-            .kill_and_wait(WalRedoKillCause::WalRedoProcessDrop);
+            .kill_and_wait();
        self.stderr_logger_cancel.cancel();
        // no way to wait for stderr_logger_task from Drop because that is async only
    }
@@ -1033,19 +1032,16 @@ impl NoLeakChild {
        })
    }

-    fn kill_and_wait(mut self, cause: WalRedoKillCause) {
+    fn kill_and_wait(mut self) {
        let child = match self.child.take() {
            Some(child) => child,
            None => return,
        };
-        Self::kill_and_wait_impl(child, cause);
+        Self::kill_and_wait_impl(child);
    }

-    #[instrument(skip_all, fields(pid=child.id(), ?cause))]
-    fn kill_and_wait_impl(mut child: Child, cause: WalRedoKillCause) {
-        scopeguard::defer! {
-            WAL_REDO_PROCESS_COUNTERS.killed_by_cause[cause].inc();
-        }
+    #[instrument(skip_all, fields(pid=child.id()))]
+    fn kill_and_wait_impl(mut child: Child) {
        let res = child.kill();
        if let Err(e) = res {
            // This branch is very unlikely because:
@@ -1090,7 +1086,7 @@ impl Drop for NoLeakChild {
                // This thread here is going to outlive of our dropper.
                let span = tracing::info_span!("walredo", %tenant_id);
                let _entered = span.enter();
-                Self::kill_and_wait_impl(child, WalRedoKillCause::NoLeakChildDrop);
+                Self::kill_and_wait_impl(child);
            })
            .await
        });
--- a/pgxn/neon/control_plane_connector.c
+++ b/pgxn/neon/control_plane_connector.c
@@ -27,12 +27,14 @@
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/acl.h"
+#include "utils/fmgrprotos.h"
 #include "fmgr.h"
 #include "utils/guc.h"
 #include "port.h"
 #include <curl/curl.h>
 #include "utils/jsonb.h"
 #include "libpq/crypt.h"
+#include "pagestore_client.h"

 static ProcessUtility_hook_type PreviousProcessUtilityHook = NULL;

@@ -222,6 +224,104 @@ ErrorWriteCallback(char *ptr, size_t size, size_t nmemb, void *userdata)
 	return nmemb;
 }

+
+static size_t
+ResponseWriteCallback(char *ptr, size_t size, size_t nmemb, void *userdata)
+{
+	appendBinaryStringInfo((StringInfo)userdata, ptr, size*nmemb);
+	return nmemb;
+}
+
+void
+RequestShardMapFromControlPlane(ShardMap* shard_map)
+{
+	shard_map->n_shards = 0;
+	if (!ConsoleURL)
+	{
+		elog(LOG, "ConsoleURL not set, skipping forwarding");
+		return;
+	}
+	StringInfoData resp;
+	initStringInfo(&resp);
+
+	curl_easy_setopt(CurlHandle, CURLOPT_CUSTOMREQUEST, "GET");
+	curl_easy_setopt(CurlHandle, CURLOPT_URL, ConsoleURL);
+	curl_easy_setopt(CurlHandle, CURLOPT_ERRORBUFFER, CurlErrorBuf);
+	curl_easy_setopt(CurlHandle, CURLOPT_TIMEOUT, 3L /* seconds */ );
+	curl_easy_setopt(CurlHandle, CURLOPT_WRITEDATA, &resp);
+	curl_easy_setopt(CurlHandle, CURLOPT_WRITEFUNCTION, ResponseWriteCallback);
+
+	const int	num_retries = 5;
+	int			curl_status;
+
+	for (int i = 0; i < num_retries; i++)
+	{
+		if ((curl_status = curl_easy_perform(CurlHandle)) == CURLE_OK)
+			break;
+		elog(LOG, "Curl request failed on attempt %d: %s", i, CurlErrorBuf);
+		pg_usleep(1000 * 1000);
+	}
+	if (curl_status != CURLE_OK)
+	{
+		curl_easy_cleanup(CurlHandle);
+		elog(ERROR, "Failed to perform curl request: %s", CurlErrorBuf);
+	}
+	else
+	{
+		long		response_code;
+		if (curl_easy_getinfo(CurlHandle, CURLINFO_RESPONSE_CODE, &response_code) != CURLE_UNKNOWN_OPTION)
+		{
+			if (response_code != 200)
+			{
+				bool error_exists = resp.len != 0;
+				if(error_exists)
+				{
+					elog(ERROR,
+						 "[PG_LLM] Received HTTP code %ld from OpenAI: %s",
+						 response_code,
+						 resp.data);
+				}
+				else
+				{
+					elog(ERROR,
+						 "[PG_LLM] Received HTTP code %ld from OpenAI",
+						 response_code);
+				}
+			}
+		}
+		curl_easy_cleanup(CurlHandle);
+
+		JsonbContainer *jsonb = (JsonbContainer *)DatumGetPointer(DirectFunctionCall1(jsonb_in,  CStringGetDatum(resp.data)));
+		JsonbValue	v;
+		JsonbIterator *it;
+		JsonbIteratorToken r;
+
+		it = JsonbIteratorInit(jsonb);
+		r = JsonbIteratorNext(&it, &v, true);
+		if (r != WJB_BEGIN_ARRAY)
+			elog(ERROR, "Array of connection strings expected");
+
+		while ((r = JsonbIteratorNext(&it, &v, true)) != WJB_DONE)
+		{
+			if (r != WJB_ELEM)
+				continue;
+
+			if (shard_map->n_shards >= MAX_SHARDS)
+				elog(ERROR, "Too many shards");
+
+			if (v.type != jbvString)
+				elog(ERROR, "Connection string expected");
+
+			strncpy(shard_map->shard_connstr[shard_map->n_shards++],
+					v.val.string.val,
+					MAX_PS_CONNSTR_LEN);
+		}
+		shard_map->update_counter += 1;
+		pfree(resp.data);
+	}
+}
+
+
 static void
 SendDeltasToControlPlane()
 {
--- a/pgxn/neon/control_plane_connector.h
+++ b/pgxn/neon/control_plane_connector.h
@@ -2,5 +2,6 @@
 #define CONTROL_PLANE_CONNECTOR_H

 void		InitControlPlaneConnector();
+void        RequestShardMapFromControlPlane(ShardMap* shard_map);

 #endif
--- a/pgxn/neon/libpagestore.c
+++ b/pgxn/neon/libpagestore.c
@@ -18,11 +18,10 @@
 #include "fmgr.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "common/hashfn.h"
 #include "storage/buf_internals.h"
-#include "storage/lwlock.h"
 #include "storage/ipc.h"
 #include "c.h"
-#include "postmaster/interrupt.h"

 #include "libpq-fe.h"
 #include "libpq/pqformat.h"
@@ -35,22 +34,12 @@
 #include "neon.h"
 #include "walproposer.h"
 #include "neon_utils.h"
+#include "control_plane_connector.h"

 #define PageStoreTrace DEBUG5

 #define RECONNECT_INTERVAL_USEC 1000000

-bool		connected = false;
-PGconn	   *pageserver_conn = NULL;
-
-/*
- * WaitEventSet containing:
- * - WL_SOCKET_READABLE on pageserver_conn,
- * - WL_LATCH_SET on MyLatch, and
- * - WL_EXIT_ON_PM_DEATH.
- */
-WaitEventSet *pageserver_conn_wes = NULL;
-
 /* GUCs */
 char	   *neon_timeline;
 char	   *neon_tenant;
@@ -64,80 +53,165 @@ int			flush_every_n_requests = 8;
 int			n_reconnect_attempts = 0;
 int			max_reconnect_attempts = 60;

-#define MAX_PAGESERVER_CONNSTRING_SIZE 256
+bool	(*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
+
+static bool pageserver_flush(shardno_t shard_no);
+static void pageserver_disconnect(shardno_t shard_no);
+
+
+static pqsigfunc	 prev_signal_handler;
+
+static shmem_startup_hook_type prev_shmem_startup_hook;
+#if PG_VERSION_NUM>=150000
+static shmem_request_hook_type prev_shmem_request_hook;
+#endif
+
+static ShardMap* shard_map;
+static LWLockId  shard_map_lock;
+static size_t    shard_map_update_counter;

 typedef struct
 {
-    LWLockId lock;
-    pg_atomic_uint64 update_counter;
-    char pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE];
-} PagestoreShmemState;
+	/*
+	 * connection for each shard
+	 */
+	PGconn	   *conn;
+    /*
+	 * WaitEventSet containing:
+	 * - WL_SOCKET_READABLE on pageserver_conn,
+	 * - WL_LATCH_SET on MyLatch, and
+	 * - WL_EXIT_ON_PM_DEATH.
+	 */
+	WaitEventSet    *wes;
+} PageServer;

-#if PG_VERSION_NUM >= 150000
-static shmem_request_hook_type prev_shmem_request_hook = NULL;
-static void walproposer_shmem_request(void);
+static PageServer page_servers[MAX_SHARDS];
+
+static void
+psm_shmem_startup(void)
+{
+	bool found;
+	if (prev_shmem_startup_hook)
+	{
+		prev_shmem_startup_hook();
+	}
+
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	shard_map = (ShardMap*)ShmemInitStruct("shard_map", sizeof(ShardMap), &found);
+	if (!found)
+	{
+		shard_map_lock = (LWLockId)GetNamedLWLockTranche("shard_map_lock");
+		shard_map->n_shards = 0;
+		shard_map->update_counter = 0;
+	}
+	LWLockRelease(AddinShmemInitLock);
+}
+
+static void
+psm_shmem_request(void)
+{
+#if PG_VERSION_NUM>=150000
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
 #endif
-static shmem_startup_hook_type prev_shmem_startup_hook;
-static PagestoreShmemState *pagestore_shared;
-static uint64 pagestore_local_counter = 0;
-static char local_pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE];

-bool	(*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
-
-static bool pageserver_flush(void);
-static void pageserver_disconnect(void);
-
-static bool
-CheckPageserverConnstring(char **newval, void **extra, GucSource source)
-{
-    return strlen(*newval) < MAX_PAGESERVER_CONNSTRING_SIZE;
+	RequestAddinShmemSpace(sizeof(ShardMap));
+	RequestNamedLWLockTranche("shard_map_lock", 1);
 }

 static void
-AssignPageserverConnstring(const char *newval, void *extra)
+psm_init(void)
 {
-    if(!pagestore_shared)
-        return;
-    LWLockAcquire(pagestore_shared->lock, LW_EXCLUSIVE);
-    strlcpy(pagestore_shared->pageserver_connstring, newval, MAX_PAGESERVER_CONNSTRING_SIZE);
-    pg_atomic_fetch_add_u64(&pagestore_shared->update_counter, 1);
-    LWLockRelease(pagestore_shared->lock);
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = psm_shmem_startup;
+#if PG_VERSION_NUM>=150000
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = psm_shmem_request;
+#else
+	psm_shmem_request();
+#endif
 }

-static bool
-CheckConnstringUpdated()
+shardno_t
+get_shard_number(BufferTag* tag)
 {
-    if(!pagestore_shared)
-        return false;
-    return pagestore_local_counter < pg_atomic_read_u64(&pagestore_shared->update_counter);
+	shardno_t 	shard_no;
+	uint32		hash;
+
+#if PG_MAJORVERSION_NUM < 16
+	hash = murmurhash32(tag->rnode.spcNode);
+	hash_combine(hash, murmurhash32(tag->rnode.dbNode));
+	hash_combine(hash, murmurhash32(tag->rnode.relNode));
+	hash_combine(hash, murmurhash32(tag->blockNum/STRIPE_SIZE));
+#else
+	hash = murmurhash32(tag->spcOid);
+	hash_combine(hash, murmurhash32(tag->dbOid));
+	hash_combine(hash, murmurhash32(tag->relNumber));
+	hash_combine(hash, murmurhash32(tag->blockNum/STRIPE_SIZE));
+#endif
+
+	LWLockAcquire(shard_map_lock, LW_SHARED);
+	while (shard_map->n_shards == 0 || shard_map_update_counter != shard_map->update_counter)
+	{
+		/* Close all existed connections */
+		for (shard_no = 0; shard_no < shard_map->n_shards; shard_no++)
+		{
+			if (page_servers[shard_no].conn)
+				pageserver_disconnect(shard_no);
+		}
+
+		/* Request new shard map from control plane under exclusive lock */
+		LWLockRelease(shard_map_lock);
+		LWLockAcquire(shard_map_lock, LW_EXCLUSIVE);
+		if (shard_map->n_shards == 0)
+		{
+			if (*page_server_connstring)
+			{
+				shard_map->n_shards = 1;
+				strncpy(shard_map->shard_connstr[0], page_server_connstring, sizeof shard_map->shard_connstr[0]);
+			}
+			else
+			{
+				RequestShardMapFromControlPlane(shard_map);
+			}
+			shard_map_update_counter = shard_map->update_counter;
+		}
+	}
+	shard_no = hash % shard_map->n_shards;
+
+	LWLockRelease(shard_map_lock);
+
+	return shard_no;
 }

 static void
-ReloadConnstring()
+pageserver_sighup_handler(SIGNAL_ARGS)
 {
-    if(!pagestore_shared)
-        return;
-    LWLockAcquire(pagestore_shared->lock, LW_SHARED);
-    strlcpy(local_pageserver_connstring, pagestore_shared->pageserver_connstring, sizeof(local_pageserver_connstring));
-    pagestore_local_counter = pg_atomic_read_u64(&pagestore_shared->update_counter);
-    LWLockRelease(pagestore_shared->lock);
+	if (prev_signal_handler)
+	{
+        	prev_signal_handler(postgres_signal_arg);
+	}
+	neon_log(LOG, "Received SIGHUP, disconnecting pageserver. New pageserver connstring is %s", page_server_connstring);
+
+	 /* force refetching shard map from control plane */
+	LWLockAcquire(shard_map_lock, LW_EXCLUSIVE);
+	shard_map->n_shards = 0;
+	LWLockRelease(shard_map_lock);
 }

 static bool
-pageserver_connect(int elevel)
+pageserver_connect(shardno_t shard_no, int elevel)
 {
 	char	   *query;
 	int			ret;
 	const char *keywords[3];
 	const char *values[3];
 	int			n;
+	PGconn*		conn;
+	WaitEventSet *wes;

-	Assert(!connected);
-
-        if(CheckConnstringUpdated())
-        {
-            ReloadConnstring();
-        }
+	Assert(page_servers[shard_no].conn == NULL);

 	/*
 	 * Connect using the connection string we got from the
@@ -158,19 +232,18 @@ pageserver_connect(int elevel)
 		n++;
 	}
 	keywords[n] = "dbname";
-	values[n] = local_pageserver_connstring;
+	values[n] = shard_map->shard_connstr[shard_no];
 	n++;
 	keywords[n] = NULL;
 	values[n] = NULL;
 	n++;
-	pageserver_conn = PQconnectdbParams(keywords, values, 1);
+	conn = PQconnectdbParams(keywords, values, 1);

-	if (PQstatus(pageserver_conn) == CONNECTION_BAD)
+	if (PQstatus(conn) == CONNECTION_BAD)
 	{
-		char	   *msg = pchomp(PQerrorMessage(pageserver_conn));
+		char	   *msg = pchomp(PQerrorMessage(conn));

-		PQfinish(pageserver_conn);
-		pageserver_conn = NULL;
+		PQfinish(conn);

 		ereport(elevel,
 				(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
@@ -178,30 +251,28 @@ pageserver_connect(int elevel)
 				 errdetail_internal("%s", msg)));
 		return false;
 	}
-
 	query = psprintf("pagestream %s %s", neon_tenant, neon_timeline);
-	ret = PQsendQuery(pageserver_conn, query);
+	ret = PQsendQuery(conn, query);
 	if (ret != 1)
 	{
-		PQfinish(pageserver_conn);
-		pageserver_conn = NULL;
+		PQfinish(conn);
 		neon_log(elevel, "could not send pagestream command to pageserver");
 		return false;
 	}

-	pageserver_conn_wes = CreateWaitEventSet(TopMemoryContext, 3);
-	AddWaitEventToSet(pageserver_conn_wes, WL_LATCH_SET, PGINVALID_SOCKET,
+	wes = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET,
 			  MyLatch, NULL);
-	AddWaitEventToSet(pageserver_conn_wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+	AddWaitEventToSet(wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
 			  NULL, NULL);
-	AddWaitEventToSet(pageserver_conn_wes, WL_SOCKET_READABLE, PQsocket(pageserver_conn), NULL, NULL);
+	AddWaitEventToSet(wes, WL_SOCKET_READABLE, PQsocket(conn), NULL, NULL);

-	while (PQisBusy(pageserver_conn))
+	while (PQisBusy(conn))
 	{
 		WaitEvent	event;

 		/* Sleep until there's something to do */
-		(void) WaitEventSetWait(pageserver_conn_wes, -1L, &event, 1, PG_WAIT_EXTENSION);
+		(void) WaitEventSetWait(wes, -1L, &event, 1, PG_WAIT_EXTENSION);
 		ResetLatch(MyLatch);

 		CHECK_FOR_INTERRUPTS();
@@ -209,14 +280,12 @@ pageserver_connect(int elevel)
 		/* Data available in socket? */
 		if (event.events & WL_SOCKET_READABLE)
 		{
-			if (!PQconsumeInput(pageserver_conn))
+			if (!PQconsumeInput(conn))
 			{
-				char	   *msg = pchomp(PQerrorMessage(pageserver_conn));
+				char	   *msg = pchomp(PQerrorMessage(conn));

-				PQfinish(pageserver_conn);
-				pageserver_conn = NULL;
-				FreeWaitEventSet(pageserver_conn_wes);
-				pageserver_conn_wes = NULL;
+				PQfinish(conn);
+				FreeWaitEventSet(wes);

 				neon_log(elevel, "could not complete handshake with pageserver: %s",
 						 msg);
@@ -225,9 +294,10 @@ pageserver_connect(int elevel)
 		}
 	}

-	neon_log(LOG, "libpagestore: connected to '%s'", page_server_connstring);
+	neon_log(LOG, "libpagestore: connected to '%s'", shard_map->shard_connstr[shard_no]);
+	page_servers[shard_no].conn = conn;
+	page_servers[shard_no].wes = wes;

-	connected = true;
 	return true;
 }

@@ -235,10 +305,10 @@ pageserver_connect(int elevel)
 * A wrapper around PQgetCopyData that checks for interrupts while sleeping.
 */
 static int
-call_PQgetCopyData(char **buffer)
+call_PQgetCopyData(shardno_t shard_no, char **buffer)
 {
 	int			ret;
-
+	PGconn*     pageserver_conn = page_servers[shard_no].conn;
 retry:
 	ret = PQgetCopyData(pageserver_conn, buffer, 1 /* async */ );

@@ -247,7 +317,7 @@ retry:
 		WaitEvent	event;

 		/* Sleep until there's something to do */
-		(void) WaitEventSetWait(pageserver_conn_wes, -1L, &event, 1, PG_WAIT_EXTENSION);
+		(void) WaitEventSetWait(page_servers[shard_no].wes, -1L, &event, 1, PG_WAIT_EXTENSION);
 		ResetLatch(MyLatch);

 		CHECK_FOR_INTERRUPTS();
@@ -272,7 +342,7 @@ retry:


 static void
-pageserver_disconnect(void)
+pageserver_disconnect(shardno_t shard_no)
 {
 	/*
 	 * If anything goes wrong while we were sending a request, it's not clear
@@ -281,38 +351,32 @@ pageserver_disconnect(void)
 	 * time later after we have already sent a new unrelated request. Close
 	 * the connection to avoid getting confused.
 	 */
-	if (connected)
+	if (page_servers[shard_no].conn)
 	{
 		neon_log(LOG, "dropping connection to page server due to error");
-		PQfinish(pageserver_conn);
-		pageserver_conn = NULL;
-		connected = false;
+		PQfinish(page_servers[shard_no].conn);
+		page_servers[shard_no].conn = NULL;

 		prefetch_on_ps_disconnect();
 	}
-	if (pageserver_conn_wes != NULL)
+	if (page_servers[shard_no].wes != NULL)
 	{
-		FreeWaitEventSet(pageserver_conn_wes);
-		pageserver_conn_wes = NULL;
+		FreeWaitEventSet(page_servers[shard_no].wes);
+		page_servers[shard_no].wes = NULL;
 	}
 }

 static bool
-pageserver_send(NeonRequest * request)
+pageserver_send(shardno_t shard_no, NeonRequest * request)
 {
 	StringInfoData req_buff;
-
-        if(CheckConnstringUpdated())
-        {
-            pageserver_disconnect();
-            ReloadConnstring();
-        }
+	PGconn* pageserver_conn = page_servers[shard_no].conn;

 	/* If the connection was lost for some reason, reconnect */
-	if (connected && PQstatus(pageserver_conn) == CONNECTION_BAD)
+	if (pageserver_conn && PQstatus(pageserver_conn) == CONNECTION_BAD)
 	{
 		neon_log(LOG, "pageserver_send disconnect bad connection");
-		pageserver_disconnect();
+		pageserver_disconnect(shard_no);
 	}

 	req_buff = nm_pack_request(request);
@@ -324,18 +388,19 @@ pageserver_send(NeonRequest * request)
 	 * See https://github.com/neondatabase/neon/issues/1138
 	 * So try to reestablish connection in case of failure.
 	 */
-	if (!connected)
+	if (!page_servers[shard_no].conn)
 	{
-		while (!pageserver_connect(n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
+		while (!pageserver_connect(shard_no, n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
 		{
-			HandleMainLoopInterrupts();
 			n_reconnect_attempts += 1;
 			pg_usleep(RECONNECT_INTERVAL_USEC);
 		}
 		n_reconnect_attempts = 0;
 	}

-	/*
+	pageserver_conn = page_servers[shard_no].conn;
+
+    /*
 	 * Send request.
 	 *
 	 * In principle, this could block if the output buffer is full, and we
@@ -346,7 +411,7 @@ pageserver_send(NeonRequest * request)
 	if (PQputCopyData(pageserver_conn, req_buff.data, req_buff.len) <= 0)
 	{
 		char	   *msg = pchomp(PQerrorMessage(pageserver_conn));
-		pageserver_disconnect();
+		pageserver_disconnect(shard_no);
 		neon_log(LOG, "pageserver_send disconnect because failed to send page request (try to reconnect): %s", msg);
 		pfree(msg);
 		pfree(req_buff.data);
@@ -366,12 +431,12 @@ pageserver_send(NeonRequest * request)
 }

 static NeonResponse *
-pageserver_receive(void)
+pageserver_receive(shardno_t shard_no)
 {
 	StringInfoData resp_buff;
 	NeonResponse *resp;
-
-	if (!connected)
+	PGconn* pageserver_conn = page_servers[shard_no].conn;
+	if (!pageserver_conn)
 		return NULL;

 	PG_TRY();
@@ -379,7 +444,7 @@ pageserver_receive(void)
 		/* read response */
 		int			rc;

-		rc = call_PQgetCopyData(&resp_buff.data);
+		rc = call_PQgetCopyData(shard_no, &resp_buff.data);
 		if (rc >= 0)
 		{
 			resp_buff.len = rc;
@@ -398,25 +463,25 @@ pageserver_receive(void)
 		else if (rc == -1)
 		{
 			neon_log(LOG, "pageserver_receive disconnect because call_PQgetCopyData returns -1: %s", pchomp(PQerrorMessage(pageserver_conn)));
-			pageserver_disconnect();
+			pageserver_disconnect(shard_no);
 			resp = NULL;
 		}
 		else if (rc == -2)
 		{
 			char* msg = pchomp(PQerrorMessage(pageserver_conn));
-			pageserver_disconnect();
+			pageserver_disconnect(shard_no);
 			neon_log(ERROR, "pageserver_receive disconnect because could not read COPY data: %s", msg);
 		}
 		else
 		{
-			pageserver_disconnect();
+			pageserver_disconnect(shard_no);
 			neon_log(ERROR, "pageserver_receive disconnect because unexpected PQgetCopyData return value: %d", rc);
 		}
 	}
 	PG_CATCH();
 	{
 		neon_log(LOG, "pageserver_receive disconnect due to caught exception");
-		pageserver_disconnect();
+		pageserver_disconnect(shard_no);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -426,9 +491,10 @@ pageserver_receive(void)


 static bool
-pageserver_flush(void)
+pageserver_flush(shardno_t shard_no)
 {
-	if (!connected)
+	PGconn* pageserver_conn = page_servers[shard_no].conn;
+	if (!pageserver_conn)
 	{
 		neon_log(WARNING, "Tried to flush while disconnected");
 	}
@@ -437,7 +503,7 @@ pageserver_flush(void)
 		if (PQflush(pageserver_conn))
 		{
 			char	   *msg = pchomp(PQerrorMessage(pageserver_conn));
-			pageserver_disconnect();
+			pageserver_disconnect(shard_no);
 			neon_log(LOG, "pageserver_flush disconnect because failed to flush page requests: %s", msg);
 			pfree(msg);
 			return false;
@@ -446,8 +512,7 @@ pageserver_flush(void)
 	return true;
 }

-page_server_api api =
-{
+page_server_api api = {
 	.send = pageserver_send,
 	.flush = pageserver_flush,
 	.receive = pageserver_receive
@@ -461,72 +526,12 @@ check_neon_id(char **newval, void **extra, GucSource source)
 	return **newval == '\0' || HexDecodeString(id, *newval, 16);
 }

-static Size
-PagestoreShmemSize(void)
-{
-    return sizeof(PagestoreShmemState);
-}
-
-static bool
-PagestoreShmemInit(void)
-{
-    bool found;
-    LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
-    pagestore_shared = ShmemInitStruct("libpagestore shared state",
-                                       PagestoreShmemSize(),
-                                       &found);
-    if(!found)
-    {
-        pagestore_shared->lock = &(GetNamedLWLockTranche("neon_libpagestore")->lock);
-        pg_atomic_init_u64(&pagestore_shared->update_counter, 0);
-        AssignPageserverConnstring(page_server_connstring, NULL);
-    }
-    LWLockRelease(AddinShmemInitLock);
-    return found;
-}
-
-static void
-pagestore_shmem_startup_hook(void)
-{
-    if(prev_shmem_startup_hook)
-        prev_shmem_startup_hook();
-
-    PagestoreShmemInit();
-}
-
-static void
-pagestore_shmem_request(void)
-{
-#if PG_VERSION_NUM >= 150000
-    if(prev_shmem_request_hook)
-        prev_shmem_request_hook();
-#endif
-
-    RequestAddinShmemSpace(PagestoreShmemSize());
-    RequestNamedLWLockTranche("neon_libpagestore", 1);
-}
-
-static void
-pagestore_prepare_shmem(void)
-{
-#if PG_VERSION_NUM >= 150000
-	prev_shmem_request_hook = shmem_request_hook;
-	shmem_request_hook = pagestore_shmem_request;
-#else
-        pagestore_shmem_request();
-#endif
-	prev_shmem_startup_hook = shmem_startup_hook;
-	shmem_startup_hook = pagestore_shmem_startup_hook;
-}
-
 /*
 * Module initialization function
 */
 void
 pg_init_libpagestore(void)
 {
-        pagestore_prepare_shmem();
-
 	DefineCustomStringVariable("neon.pageserver_connstring",
 							   "connection string to the page server",
 							   NULL,
@@ -534,7 +539,7 @@ pg_init_libpagestore(void)
 							   "",
 							   PGC_SIGHUP,
 							   0,	/* no flags required */
-							   CheckPageserverConnstring, AssignPageserverConnstring, NULL);
+							   NULL, NULL, NULL);

 	DefineCustomStringVariable("neon.timeline_id",
 							   "Neon timeline_id the server is running on",
@@ -615,5 +620,8 @@ pg_init_libpagestore(void)
 		redo_read_buffer_filter = neon_redo_read_buffer_filter;
 	}

+        prev_signal_handler = pqsignal(SIGHUP, pageserver_sighup_handler);
+
 	lfc_init();
+	psm_init();
 }
--- a/pgxn/neon/pagestore_client.h
+++ b/pgxn/neon/pagestore_client.h
@@ -20,12 +20,25 @@
 #include RELFILEINFO_HDR
 #include "storage/block.h"
 #include "storage/smgr.h"
+#include "storage/buf_internals.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqformat.h"
 #include "utils/memutils.h"

 #include "pg_config.h"

+#define MAX_SHARDS 128
+#define STRIPE_SIZE (256 * 1024 / 8) /* TODO: should in betaken from control plane? */
+#define MAX_PS_CONNSTR_LEN 128
+
+typedef struct
+{
+	size_t n_shards;
+	size_t update_counter;
+	char   shard_connstr[MAX_SHARDS][MAX_PS_CONNSTR_LEN];
+} ShardMap;
+
+
 typedef enum
 {
 	/* pagestore_client -> pagestore */
@@ -144,11 +157,13 @@ extern char *nm_to_string(NeonMessage * msg);
 * API
 */

+typedef unsigned shardno_t;
+
 typedef struct
 {
-	bool		(*send) (NeonRequest * request);
-	NeonResponse *(*receive) (void);
-	bool		(*flush) (void);
+	bool		(*send) (shardno_t  shard_no, NeonRequest * request);
+	NeonResponse *(*receive) (shardno_t shard_no);
+	bool		(*flush) (shardno_t shard_no);
 }			page_server_api;

 extern void prefetch_on_ps_disconnect(void);
@@ -165,6 +180,8 @@ extern char *neon_tenant;
 extern bool wal_redo;
 extern int32 max_cluster_size;

+extern shardno_t get_shard_number(BufferTag* tag);
+
 extern const f_smgr *smgr_neon(BackendId backend, NRelFileInfo rinfo);
 extern void smgr_init_neon(void);
 extern void readahead_buffer_resize(int newsize, void *extra);
--- a/pgxn/neon/pagestore_smgr.c
+++ b/pgxn/neon/pagestore_smgr.c
@@ -164,6 +164,7 @@ typedef struct PrefetchRequest {
 	XLogRecPtr	actual_request_lsn;
 	NeonResponse *response; /* may be null */
 	PrefetchStatus status;
+	shardno_t   shard_no;
 	uint64		my_ring_index;
 } PrefetchRequest;

@@ -225,6 +226,8 @@ typedef struct PrefetchState {

 	/* the buffers */
 	prfh_hash *prf_hash;
+	int     max_shard_no;
+	uint8   shard_bitmap[(MAX_SHARDS + 7)/8];
 	PrefetchRequest prf_buffer[]; /* prefetch buffers */
 } PrefetchState;

@@ -313,6 +316,7 @@ compact_prefetch_buffers(void)
 		Assert(target_slot->status == PRFS_UNUSED);

 		target_slot->buftag = source_slot->buftag;
+		target_slot->shard_no = source_slot->shard_no;
 		target_slot->status = source_slot->status;
 		target_slot->response = source_slot->response;
 		target_slot->effective_request_lsn = source_slot->effective_request_lsn;
@@ -477,6 +481,23 @@ prefetch_cleanup_trailing_unused(void)
 	}
 }

+
+static bool
+prefetch_flush_requests(void)
+{
+	for (shardno_t shard_no = 0; shard_no < MyPState->max_shard_no; shard_no++)
+	{
+		if (MyPState->shard_bitmap[shard_no >> 3] & (1 << (shard_no & 7)))
+		{
+			if (!page_server->flush(shard_no))
+				return false;
+			MyPState->shard_bitmap[shard_no >> 3] &= ~(1 << (shard_no & 7));
+		}
+	}
+	MyPState->max_shard_no = 0;
+	return true;
+}
+
 /*
 * Wait for slot of ring_index to have received its response.
 * The caller is responsible for making sure the request buffer is flushed.
@@ -492,7 +513,7 @@ prefetch_wait_for(uint64 ring_index)
 	if (MyPState->ring_flush <= ring_index &&
 		MyPState->ring_unused > MyPState->ring_flush)
 	{
-		if (!page_server->flush())
+		if (!prefetch_flush_requests())
 			return false;
 		MyPState->ring_flush = MyPState->ring_unused;
 	}
@@ -530,7 +551,7 @@ prefetch_read(PrefetchRequest *slot)
 	Assert(slot->my_ring_index == MyPState->ring_receive);

 	old = MemoryContextSwitchTo(MyPState->errctx);
-	response = (NeonResponse *) page_server->receive();
+	response = (NeonResponse *) page_server->receive(slot->shard_no);
 	MemoryContextSwitchTo(old);
 	if (response)
 	{
@@ -682,12 +703,14 @@ prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force
 	Assert(slot->response == NULL);
 	Assert(slot->my_ring_index == MyPState->ring_unused);

-	while (!page_server->send((NeonRequest *) &request));
+	while (!page_server->send(slot->shard_no, (NeonRequest *) &request));

 	/* update prefetch state */
 	MyPState->n_requests_inflight += 1;
 	MyPState->n_unused -= 1;
 	MyPState->ring_unused += 1;
+	MyPState->shard_bitmap[slot->shard_no >> 3] |= 1 << (slot->shard_no & 7);
+	MyPState->max_shard_no = Max(slot->shard_no+1, MyPState->max_shard_no);

 	/* update slot state */
 	slot->status = PRFS_REQUESTED;
@@ -847,6 +870,7 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
 	 * function reads the buffer tag from the slot.
 	 */
 	slot->buftag = tag;
+	slot->shard_no = get_shard_number(&tag);
 	slot->my_ring_index = ring_index;

 	prefetch_do_request(slot, force_latest, force_lsn);
@@ -857,7 +881,7 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
 	if (flush_every_n_requests > 0 &&
 		MyPState->ring_unused - MyPState->ring_flush >= flush_every_n_requests)
 	{
-		if (!page_server->flush())
+		if (!prefetch_flush_requests())
 		{
 			/* Prefetch set is reset in case of error, so we should try to register our request once again */
 			goto Retry;
@@ -872,11 +896,34 @@ static NeonResponse *
 page_server_request(void const *req)
 {
 	NeonResponse* resp;
+	BufferTag tag = {0};
+	shardno_t shard_no;
+
+	switch (((NeonRequest *) req)->tag)
+	{
+		case T_NeonExistsRequest:
+			CopyNRelFileInfoToBufTag(tag, ((NeonExistsRequest *) req)->rinfo);
+			break;
+		case T_NeonNblocksRequest:
+			CopyNRelFileInfoToBufTag(tag, ((NeonNblocksRequest *) req)->rinfo);
+			break;
+		case T_NeonDbSizeRequest:
+			NInfoGetDbOid(BufTagGetNRelFileInfo(tag)) = ((NeonDbSizeRequest *) req)->dbNode;
+			break;
+		case T_NeonGetPageRequest:
+			CopyNRelFileInfoToBufTag(tag, ((NeonGetPageRequest *) req)->rinfo);
+			tag.blockNum = ((NeonGetPageRequest *) req)->blkno;
+			break;
+		default:
+			elog(ERROR, "Unexpected request tag: %d", ((NeonRequest *) req)->tag);
+	}
+	shard_no = get_shard_number(&tag);
+
 	do {
-		while (!page_server->send((NeonRequest *) req) || !page_server->flush());
+		while (!page_server->send(shard_no, (NeonRequest *) req) || !page_server->flush(shard_no));
 		MyPState->ring_flush = MyPState->ring_unused;
 		consume_prefetch_responses();
-		resp = page_server->receive();
+		resp = page_server->receive(shard_no);
 	} while (resp == NULL);
 	return resp;

--- a/poetry.lock
+++ b/poetry.lock
@@ -2,98 +2,98 @@

 [[package]]
 name = "aiohttp"
-version = "3.8.6"
+version = "3.8.5"
 description = "Async http client/server framework (asyncio)"
 optional = false
 python-versions = ">=3.6"
 files = [
-    {file = "aiohttp-3.8.6-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:41d55fc043954cddbbd82503d9cc3f4814a40bcef30b3569bc7b5e34130718c1"},
-    {file = "aiohttp-3.8.6-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:1d84166673694841d8953f0a8d0c90e1087739d24632fe86b1a08819168b4566"},
-    {file = "aiohttp-3.8.6-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:253bf92b744b3170eb4c4ca2fa58f9c4b87aeb1df42f71d4e78815e6e8b73c9e"},
-    {file = "aiohttp-3.8.6-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3fd194939b1f764d6bb05490987bfe104287bbf51b8d862261ccf66f48fb4096"},
-    {file = "aiohttp-3.8.6-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6c5f938d199a6fdbdc10bbb9447496561c3a9a565b43be564648d81e1102ac22"},
-    {file = "aiohttp-3.8.6-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2817b2f66ca82ee699acd90e05c95e79bbf1dc986abb62b61ec8aaf851e81c93"},
-    {file = "aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0fa375b3d34e71ccccf172cab401cd94a72de7a8cc01847a7b3386204093bb47"},
-    {file = "aiohttp-3.8.6-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9de50a199b7710fa2904be5a4a9b51af587ab24c8e540a7243ab737b45844543"},
-    {file = "aiohttp-3.8.6-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:e1d8cb0b56b3587c5c01de3bf2f600f186da7e7b5f7353d1bf26a8ddca57f965"},
-    {file = "aiohttp-3.8.6-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:8e31e9db1bee8b4f407b77fd2507337a0a80665ad7b6c749d08df595d88f1cf5"},
-    {file = "aiohttp-3.8.6-cp310-cp310-musllinux_1_1_ppc64le.whl", hash = "sha256:7bc88fc494b1f0311d67f29fee6fd636606f4697e8cc793a2d912ac5b19aa38d"},
-    {file = "aiohttp-3.8.6-cp310-cp310-musllinux_1_1_s390x.whl", hash = "sha256:ec00c3305788e04bf6d29d42e504560e159ccaf0be30c09203b468a6c1ccd3b2"},
-    {file = "aiohttp-3.8.6-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:ad1407db8f2f49329729564f71685557157bfa42b48f4b93e53721a16eb813ed"},
-    {file = "aiohttp-3.8.6-cp310-cp310-win32.whl", hash = "sha256:ccc360e87341ad47c777f5723f68adbb52b37ab450c8bc3ca9ca1f3e849e5fe2"},
-    {file = "aiohttp-3.8.6-cp310-cp310-win_amd64.whl", hash = "sha256:93c15c8e48e5e7b89d5cb4613479d144fda8344e2d886cf694fd36db4cc86865"},
-    {file = "aiohttp-3.8.6-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6e2f9cc8e5328f829f6e1fb74a0a3a939b14e67e80832975e01929e320386b34"},
-    {file = "aiohttp-3.8.6-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:e6a00ffcc173e765e200ceefb06399ba09c06db97f401f920513a10c803604ca"},
-    {file = "aiohttp-3.8.6-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:41bdc2ba359032e36c0e9de5a3bd00d6fb7ea558a6ce6b70acedf0da86458321"},
-    {file = "aiohttp-3.8.6-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:14cd52ccf40006c7a6cd34a0f8663734e5363fd981807173faf3a017e202fec9"},
-    {file = "aiohttp-3.8.6-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2d5b785c792802e7b275c420d84f3397668e9d49ab1cb52bd916b3b3ffcf09ad"},
-    {file = "aiohttp-3.8.6-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1bed815f3dc3d915c5c1e556c397c8667826fbc1b935d95b0ad680787896a358"},
-    {file = "aiohttp-3.8.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:96603a562b546632441926cd1293cfcb5b69f0b4159e6077f7c7dbdfb686af4d"},
-    {file = "aiohttp-3.8.6-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d76e8b13161a202d14c9584590c4df4d068c9567c99506497bdd67eaedf36403"},
-    {file = "aiohttp-3.8.6-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:e3f1e3f1a1751bb62b4a1b7f4e435afcdade6c17a4fd9b9d43607cebd242924a"},
-    {file = "aiohttp-3.8.6-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:76b36b3124f0223903609944a3c8bf28a599b2cc0ce0be60b45211c8e9be97f8"},
-    {file = "aiohttp-3.8.6-cp311-cp311-musllinux_1_1_ppc64le.whl", hash = "sha256:a2ece4af1f3c967a4390c284797ab595a9f1bc1130ef8b01828915a05a6ae684"},
-    {file = "aiohttp-3.8.6-cp311-cp311-musllinux_1_1_s390x.whl", hash = "sha256:16d330b3b9db87c3883e565340d292638a878236418b23cc8b9b11a054aaa887"},
-    {file = "aiohttp-3.8.6-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:42c89579f82e49db436b69c938ab3e1559e5a4409eb8639eb4143989bc390f2f"},
-    {file = "aiohttp-3.8.6-cp311-cp311-win32.whl", hash = "sha256:efd2fcf7e7b9d7ab16e6b7d54205beded0a9c8566cb30f09c1abe42b4e22bdcb"},
-    {file = "aiohttp-3.8.6-cp311-cp311-win_amd64.whl", hash = "sha256:3b2ab182fc28e7a81f6c70bfbd829045d9480063f5ab06f6e601a3eddbbd49a0"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:fdee8405931b0615220e5ddf8cd7edd8592c606a8e4ca2a00704883c396e4479"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d25036d161c4fe2225d1abff2bd52c34ed0b1099f02c208cd34d8c05729882f0"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5d791245a894be071d5ab04bbb4850534261a7d4fd363b094a7b9963e8cdbd31"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0cccd1de239afa866e4ce5c789b3032442f19c261c7d8a01183fd956b1935349"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1f13f60d78224f0dace220d8ab4ef1dbc37115eeeab8c06804fec11bec2bbd07"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:8a9b5a0606faca4f6cc0d338359d6fa137104c337f489cd135bb7fbdbccb1e39"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:13da35c9ceb847732bf5c6c5781dcf4780e14392e5d3b3c689f6d22f8e15ae31"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-musllinux_1_1_i686.whl", hash = "sha256:4d4cbe4ffa9d05f46a28252efc5941e0462792930caa370a6efaf491f412bc66"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-musllinux_1_1_ppc64le.whl", hash = "sha256:229852e147f44da0241954fc6cb910ba074e597f06789c867cb7fb0621e0ba7a"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-musllinux_1_1_s390x.whl", hash = "sha256:713103a8bdde61d13490adf47171a1039fd880113981e55401a0f7b42c37d071"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:45ad816b2c8e3b60b510f30dbd37fe74fd4a772248a52bb021f6fd65dff809b6"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-win32.whl", hash = "sha256:2b8d4e166e600dcfbff51919c7a3789ff6ca8b3ecce16e1d9c96d95dd569eb4c"},
-    {file = "aiohttp-3.8.6-cp36-cp36m-win_amd64.whl", hash = "sha256:0912ed87fee967940aacc5306d3aa8ba3a459fcd12add0b407081fbefc931e53"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:e2a988a0c673c2e12084f5e6ba3392d76c75ddb8ebc6c7e9ead68248101cd446"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ebf3fd9f141700b510d4b190094db0ce37ac6361a6806c153c161dc6c041ccda"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3161ce82ab85acd267c8f4b14aa226047a6bee1e4e6adb74b798bd42c6ae1f80"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d95fc1bf33a9a81469aa760617b5971331cdd74370d1214f0b3109272c0e1e3c"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c43ecfef7deaf0617cee936836518e7424ee12cb709883f2c9a1adda63cc460"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ca80e1b90a05a4f476547f904992ae81eda5c2c85c66ee4195bb8f9c5fb47f28"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:90c72ebb7cb3a08a7f40061079817133f502a160561d0675b0a6adf231382c92"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:bb54c54510e47a8c7c8e63454a6acc817519337b2b78606c4e840871a3e15349"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-musllinux_1_1_ppc64le.whl", hash = "sha256:de6a1c9f6803b90e20869e6b99c2c18cef5cc691363954c93cb9adeb26d9f3ae"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-musllinux_1_1_s390x.whl", hash = "sha256:a3628b6c7b880b181a3ae0a0683698513874df63783fd89de99b7b7539e3e8a8"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:fc37e9aef10a696a5a4474802930079ccfc14d9f9c10b4662169671ff034b7df"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-win32.whl", hash = "sha256:f8ef51e459eb2ad8e7a66c1d6440c808485840ad55ecc3cafefadea47d1b1ba2"},
-    {file = "aiohttp-3.8.6-cp37-cp37m-win_amd64.whl", hash = "sha256:b2fe42e523be344124c6c8ef32a011444e869dc5f883c591ed87f84339de5976"},
-    {file = "aiohttp-3.8.6-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:9e2ee0ac5a1f5c7dd3197de309adfb99ac4617ff02b0603fd1e65b07dc772e4b"},
-    {file = "aiohttp-3.8.6-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:01770d8c04bd8db568abb636c1fdd4f7140b284b8b3e0b4584f070180c1e5c62"},
-    {file = "aiohttp-3.8.6-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:3c68330a59506254b556b99a91857428cab98b2f84061260a67865f7f52899f5"},
-    {file = "aiohttp-3.8.6-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:89341b2c19fb5eac30c341133ae2cc3544d40d9b1892749cdd25892bbc6ac951"},
-    {file = "aiohttp-3.8.6-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:71783b0b6455ac8f34b5ec99d83e686892c50498d5d00b8e56d47f41b38fbe04"},
-    {file = "aiohttp-3.8.6-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f628dbf3c91e12f4d6c8b3f092069567d8eb17814aebba3d7d60c149391aee3a"},
-    {file = "aiohttp-3.8.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b04691bc6601ef47c88f0255043df6f570ada1a9ebef99c34bd0b72866c217ae"},
-    {file = "aiohttp-3.8.6-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:7ee912f7e78287516df155f69da575a0ba33b02dd7c1d6614dbc9463f43066e3"},
-    {file = "aiohttp-3.8.6-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:9c19b26acdd08dd239e0d3669a3dddafd600902e37881f13fbd8a53943079dbc"},
-    {file = "aiohttp-3.8.6-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:99c5ac4ad492b4a19fc132306cd57075c28446ec2ed970973bbf036bcda1bcc6"},
-    {file = "aiohttp-3.8.6-cp38-cp38-musllinux_1_1_ppc64le.whl", hash = "sha256:f0f03211fd14a6a0aed2997d4b1c013d49fb7b50eeb9ffdf5e51f23cfe2c77fa"},
-    {file = "aiohttp-3.8.6-cp38-cp38-musllinux_1_1_s390x.whl", hash = "sha256:8d399dade330c53b4106160f75f55407e9ae7505263ea86f2ccca6bfcbdb4921"},
-    {file = "aiohttp-3.8.6-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:ec4fd86658c6a8964d75426517dc01cbf840bbf32d055ce64a9e63a40fd7b771"},
-    {file = "aiohttp-3.8.6-cp38-cp38-win32.whl", hash = "sha256:33164093be11fcef3ce2571a0dccd9041c9a93fa3bde86569d7b03120d276c6f"},
-    {file = "aiohttp-3.8.6-cp38-cp38-win_amd64.whl", hash = "sha256:bdf70bfe5a1414ba9afb9d49f0c912dc524cf60141102f3a11143ba3d291870f"},
-    {file = "aiohttp-3.8.6-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:d52d5dc7c6682b720280f9d9db41d36ebe4791622c842e258c9206232251ab2b"},
-    {file = "aiohttp-3.8.6-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:4ac39027011414dbd3d87f7edb31680e1f430834c8cef029f11c66dad0670aa5"},
-    {file = "aiohttp-3.8.6-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:3f5c7ce535a1d2429a634310e308fb7d718905487257060e5d4598e29dc17f0b"},
-    {file = "aiohttp-3.8.6-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b30e963f9e0d52c28f284d554a9469af073030030cef8693106d918b2ca92f54"},
-    {file = "aiohttp-3.8.6-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:918810ef188f84152af6b938254911055a72e0f935b5fbc4c1a4ed0b0584aed1"},
-    {file = "aiohttp-3.8.6-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:002f23e6ea8d3dd8d149e569fd580c999232b5fbc601c48d55398fbc2e582e8c"},
-    {file = "aiohttp-3.8.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4fcf3eabd3fd1a5e6092d1242295fa37d0354b2eb2077e6eb670accad78e40e1"},
-    {file = "aiohttp-3.8.6-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:255ba9d6d5ff1a382bb9a578cd563605aa69bec845680e21c44afc2670607a95"},
-    {file = "aiohttp-3.8.6-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:d67f8baed00870aa390ea2590798766256f31dc5ed3ecc737debb6e97e2ede78"},
-    {file = "aiohttp-3.8.6-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:86f20cee0f0a317c76573b627b954c412ea766d6ada1a9fcf1b805763ae7feeb"},
-    {file = "aiohttp-3.8.6-cp39-cp39-musllinux_1_1_ppc64le.whl", hash = "sha256:39a312d0e991690ccc1a61f1e9e42daa519dcc34ad03eb6f826d94c1190190dd"},
-    {file = "aiohttp-3.8.6-cp39-cp39-musllinux_1_1_s390x.whl", hash = "sha256:e827d48cf802de06d9c935088c2924e3c7e7533377d66b6f31ed175c1620e05e"},
-    {file = "aiohttp-3.8.6-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:bd111d7fc5591ddf377a408ed9067045259ff2770f37e2d94e6478d0f3fc0c17"},
-    {file = "aiohttp-3.8.6-cp39-cp39-win32.whl", hash = "sha256:caf486ac1e689dda3502567eb89ffe02876546599bbf915ec94b1fa424eeffd4"},
-    {file = "aiohttp-3.8.6-cp39-cp39-win_amd64.whl", hash = "sha256:3f0e27e5b733803333bb2371249f41cf42bae8884863e8e8965ec69bebe53132"},
-    {file = "aiohttp-3.8.6.tar.gz", hash = "sha256:b0cf2a4501bff9330a8a5248b4ce951851e415bdcce9dc158e76cfd55e15085c"},
+    {file = "aiohttp-3.8.5-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:a94159871304770da4dd371f4291b20cac04e8c94f11bdea1c3478e557fbe0d8"},
+    {file = "aiohttp-3.8.5-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:13bf85afc99ce6f9ee3567b04501f18f9f8dbbb2ea11ed1a2e079670403a7c84"},
+    {file = "aiohttp-3.8.5-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:2ce2ac5708501afc4847221a521f7e4b245abf5178cf5ddae9d5b3856ddb2f3a"},
+    {file = "aiohttp-3.8.5-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:96943e5dcc37a6529d18766597c491798b7eb7a61d48878611298afc1fca946c"},
+    {file = "aiohttp-3.8.5-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2ad5c3c4590bb3cc28b4382f031f3783f25ec223557124c68754a2231d989e2b"},
+    {file = "aiohttp-3.8.5-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0c413c633d0512df4dc7fd2373ec06cc6a815b7b6d6c2f208ada7e9e93a5061d"},
+    {file = "aiohttp-3.8.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:df72ac063b97837a80d80dec8d54c241af059cc9bb42c4de68bd5b61ceb37caa"},
+    {file = "aiohttp-3.8.5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c48c5c0271149cfe467c0ff8eb941279fd6e3f65c9a388c984e0e6cf57538e14"},
+    {file = "aiohttp-3.8.5-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:368a42363c4d70ab52c2c6420a57f190ed3dfaca6a1b19afda8165ee16416a82"},
+    {file = "aiohttp-3.8.5-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:7607ec3ce4993464368505888af5beb446845a014bc676d349efec0e05085905"},
+    {file = "aiohttp-3.8.5-cp310-cp310-musllinux_1_1_ppc64le.whl", hash = "sha256:0d21c684808288a98914e5aaf2a7c6a3179d4df11d249799c32d1808e79503b5"},
+    {file = "aiohttp-3.8.5-cp310-cp310-musllinux_1_1_s390x.whl", hash = "sha256:312fcfbacc7880a8da0ae8b6abc6cc7d752e9caa0051a53d217a650b25e9a691"},
+    {file = "aiohttp-3.8.5-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:ad093e823df03bb3fd37e7dec9d4670c34f9e24aeace76808fc20a507cace825"},
+    {file = "aiohttp-3.8.5-cp310-cp310-win32.whl", hash = "sha256:33279701c04351a2914e1100b62b2a7fdb9a25995c4a104259f9a5ead7ed4802"},
+    {file = "aiohttp-3.8.5-cp310-cp310-win_amd64.whl", hash = "sha256:6e4a280e4b975a2e7745573e3fc9c9ba0d1194a3738ce1cbaa80626cc9b4f4df"},
+    {file = "aiohttp-3.8.5-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:ae871a964e1987a943d83d6709d20ec6103ca1eaf52f7e0d36ee1b5bebb8b9b9"},
+    {file = "aiohttp-3.8.5-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:461908b2578955045efde733719d62f2b649c404189a09a632d245b445c9c975"},
+    {file = "aiohttp-3.8.5-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:72a860c215e26192379f57cae5ab12b168b75db8271f111019509a1196dfc780"},
+    {file = "aiohttp-3.8.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cc14be025665dba6202b6a71cfcdb53210cc498e50068bc088076624471f8bb9"},
+    {file = "aiohttp-3.8.5-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8af740fc2711ad85f1a5c034a435782fbd5b5f8314c9a3ef071424a8158d7f6b"},
+    {file = "aiohttp-3.8.5-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:841cd8233cbd2111a0ef0a522ce016357c5e3aff8a8ce92bcfa14cef890d698f"},
+    {file = "aiohttp-3.8.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5ed1c46fb119f1b59304b5ec89f834f07124cd23ae5b74288e364477641060ff"},
+    {file = "aiohttp-3.8.5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:84f8ae3e09a34f35c18fa57f015cc394bd1389bce02503fb30c394d04ee6b938"},
+    {file = "aiohttp-3.8.5-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:62360cb771707cb70a6fd114b9871d20d7dd2163a0feafe43fd115cfe4fe845e"},
+    {file = "aiohttp-3.8.5-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:23fb25a9f0a1ca1f24c0a371523546366bb642397c94ab45ad3aedf2941cec6a"},
+    {file = "aiohttp-3.8.5-cp311-cp311-musllinux_1_1_ppc64le.whl", hash = "sha256:b0ba0d15164eae3d878260d4c4df859bbdc6466e9e6689c344a13334f988bb53"},
+    {file = "aiohttp-3.8.5-cp311-cp311-musllinux_1_1_s390x.whl", hash = "sha256:5d20003b635fc6ae3f96d7260281dfaf1894fc3aa24d1888a9b2628e97c241e5"},
+    {file = "aiohttp-3.8.5-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:0175d745d9e85c40dcc51c8f88c74bfbaef9e7afeeeb9d03c37977270303064c"},
+    {file = "aiohttp-3.8.5-cp311-cp311-win32.whl", hash = "sha256:2e1b1e51b0774408f091d268648e3d57f7260c1682e7d3a63cb00d22d71bb945"},
+    {file = "aiohttp-3.8.5-cp311-cp311-win_amd64.whl", hash = "sha256:043d2299f6dfdc92f0ac5e995dfc56668e1587cea7f9aa9d8a78a1b6554e5755"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:cae533195e8122584ec87531d6df000ad07737eaa3c81209e85c928854d2195c"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4f21e83f355643c345177a5d1d8079f9f28b5133bcd154193b799d380331d5d3"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:a7a75ef35f2df54ad55dbf4b73fe1da96f370e51b10c91f08b19603c64004acc"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2e2e9839e14dd5308ee773c97115f1e0a1cb1d75cbeeee9f33824fa5144c7634"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c44e65da1de4403d0576473e2344828ef9c4c6244d65cf4b75549bb46d40b8dd"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:78d847e4cde6ecc19125ccbc9bfac4a7ab37c234dd88fbb3c5c524e8e14da543"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:c7a815258e5895d8900aec4454f38dca9aed71085f227537208057853f9d13f2"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-musllinux_1_1_i686.whl", hash = "sha256:8b929b9bd7cd7c3939f8bcfffa92fae7480bd1aa425279d51a89327d600c704d"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-musllinux_1_1_ppc64le.whl", hash = "sha256:5db3a5b833764280ed7618393832e0853e40f3d3e9aa128ac0ba0f8278d08649"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-musllinux_1_1_s390x.whl", hash = "sha256:a0215ce6041d501f3155dc219712bc41252d0ab76474615b9700d63d4d9292af"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:fd1ed388ea7fbed22c4968dd64bab0198de60750a25fe8c0c9d4bef5abe13824"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-win32.whl", hash = "sha256:6e6783bcc45f397fdebc118d772103d751b54cddf5b60fbcc958382d7dd64f3e"},
+    {file = "aiohttp-3.8.5-cp36-cp36m-win_amd64.whl", hash = "sha256:b5411d82cddd212644cf9360879eb5080f0d5f7d809d03262c50dad02f01421a"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:01d4c0c874aa4ddfb8098e85d10b5e875a70adc63db91f1ae65a4b04d3344cda"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e5980a746d547a6ba173fd5ee85ce9077e72d118758db05d229044b469d9029a"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2a482e6da906d5e6e653be079b29bc173a48e381600161c9932d89dfae5942ef"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:80bd372b8d0715c66c974cf57fe363621a02f359f1ec81cba97366948c7fc873"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c1161b345c0a444ebcf46bf0a740ba5dcf50612fd3d0528883fdc0eff578006a"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:cd56db019015b6acfaaf92e1ac40eb8434847d9bf88b4be4efe5bfd260aee692"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:153c2549f6c004d2754cc60603d4668899c9895b8a89397444a9c4efa282aaf4"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:4a01951fabc4ce26ab791da5f3f24dca6d9a6f24121746eb19756416ff2d881b"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-musllinux_1_1_ppc64le.whl", hash = "sha256:bfb9162dcf01f615462b995a516ba03e769de0789de1cadc0f916265c257e5d8"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-musllinux_1_1_s390x.whl", hash = "sha256:7dde0009408969a43b04c16cbbe252c4f5ef4574ac226bc8815cd7342d2028b6"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:4149d34c32f9638f38f544b3977a4c24052042affa895352d3636fa8bffd030a"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-win32.whl", hash = "sha256:68c5a82c8779bdfc6367c967a4a1b2aa52cd3595388bf5961a62158ee8a59e22"},
+    {file = "aiohttp-3.8.5-cp37-cp37m-win_amd64.whl", hash = "sha256:2cf57fb50be5f52bda004b8893e63b48530ed9f0d6c96c84620dc92fe3cd9b9d"},
+    {file = "aiohttp-3.8.5-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:eca4bf3734c541dc4f374ad6010a68ff6c6748f00451707f39857f429ca36ced"},
+    {file = "aiohttp-3.8.5-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:1274477e4c71ce8cfe6c1ec2f806d57c015ebf84d83373676036e256bc55d690"},
+    {file = "aiohttp-3.8.5-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:28c543e54710d6158fc6f439296c7865b29e0b616629767e685a7185fab4a6b9"},
+    {file = "aiohttp-3.8.5-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:910bec0c49637d213f5d9877105d26e0c4a4de2f8b1b29405ff37e9fc0ad52b8"},
+    {file = "aiohttp-3.8.5-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5443910d662db951b2e58eb70b0fbe6b6e2ae613477129a5805d0b66c54b6cb7"},
+    {file = "aiohttp-3.8.5-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2e460be6978fc24e3df83193dc0cc4de46c9909ed92dd47d349a452ef49325b7"},
+    {file = "aiohttp-3.8.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fb1558def481d84f03b45888473fc5a1f35747b5f334ef4e7a571bc0dfcb11f8"},
+    {file = "aiohttp-3.8.5-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:34dd0c107799dcbbf7d48b53be761a013c0adf5571bf50c4ecad5643fe9cfcd0"},
+    {file = "aiohttp-3.8.5-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:aa1990247f02a54185dc0dff92a6904521172a22664c863a03ff64c42f9b5410"},
+    {file = "aiohttp-3.8.5-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:0e584a10f204a617d71d359fe383406305a4b595b333721fa50b867b4a0a1548"},
+    {file = "aiohttp-3.8.5-cp38-cp38-musllinux_1_1_ppc64le.whl", hash = "sha256:a3cf433f127efa43fee6b90ea4c6edf6c4a17109d1d037d1a52abec84d8f2e42"},
+    {file = "aiohttp-3.8.5-cp38-cp38-musllinux_1_1_s390x.whl", hash = "sha256:c11f5b099adafb18e65c2c997d57108b5bbeaa9eeee64a84302c0978b1ec948b"},
+    {file = "aiohttp-3.8.5-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:84de26ddf621d7ac4c975dbea4c945860e08cccde492269db4e1538a6a6f3c35"},
+    {file = "aiohttp-3.8.5-cp38-cp38-win32.whl", hash = "sha256:ab88bafedc57dd0aab55fa728ea10c1911f7e4d8b43e1d838a1739f33712921c"},
+    {file = "aiohttp-3.8.5-cp38-cp38-win_amd64.whl", hash = "sha256:5798a9aad1879f626589f3df0f8b79b3608a92e9beab10e5fda02c8a2c60db2e"},
+    {file = "aiohttp-3.8.5-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:a6ce61195c6a19c785df04e71a4537e29eaa2c50fe745b732aa937c0c77169f3"},
+    {file = "aiohttp-3.8.5-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:773dd01706d4db536335fcfae6ea2440a70ceb03dd3e7378f3e815b03c97ab51"},
+    {file = "aiohttp-3.8.5-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:f83a552443a526ea38d064588613aca983d0ee0038801bc93c0c916428310c28"},
+    {file = "aiohttp-3.8.5-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1f7372f7341fcc16f57b2caded43e81ddd18df53320b6f9f042acad41f8e049a"},
+    {file = "aiohttp-3.8.5-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:ea353162f249c8097ea63c2169dd1aa55de1e8fecbe63412a9bc50816e87b761"},
+    {file = "aiohttp-3.8.5-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e5d47ae48db0b2dcf70bc8a3bc72b3de86e2a590fc299fdbbb15af320d2659de"},
+    {file = "aiohttp-3.8.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d827176898a2b0b09694fbd1088c7a31836d1a505c243811c87ae53a3f6273c1"},
+    {file = "aiohttp-3.8.5-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:3562b06567c06439d8b447037bb655ef69786c590b1de86c7ab81efe1c9c15d8"},
+    {file = "aiohttp-3.8.5-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:4e874cbf8caf8959d2adf572a78bba17cb0e9d7e51bb83d86a3697b686a0ab4d"},
+    {file = "aiohttp-3.8.5-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:6809a00deaf3810e38c628e9a33271892f815b853605a936e2e9e5129762356c"},
+    {file = "aiohttp-3.8.5-cp39-cp39-musllinux_1_1_ppc64le.whl", hash = "sha256:33776e945d89b29251b33a7e7d006ce86447b2cfd66db5e5ded4e5cd0340585c"},
+    {file = "aiohttp-3.8.5-cp39-cp39-musllinux_1_1_s390x.whl", hash = "sha256:eaeed7abfb5d64c539e2db173f63631455f1196c37d9d8d873fc316470dfbacd"},
+    {file = "aiohttp-3.8.5-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:e91d635961bec2d8f19dfeb41a539eb94bd073f075ca6dae6c8dc0ee89ad6f91"},
+    {file = "aiohttp-3.8.5-cp39-cp39-win32.whl", hash = "sha256:00ad4b6f185ec67f3e6562e8a1d2b69660be43070bd0ef6fcec5211154c7df67"},
+    {file = "aiohttp-3.8.5-cp39-cp39-win_amd64.whl", hash = "sha256:c0a9034379a37ae42dea7ac1e048352d96286626251862e448933c0f59cbd79c"},
+    {file = "aiohttp-3.8.5.tar.gz", hash = "sha256:b9552ec52cc147dbf1944ac7ac98af7602e51ea2dcd076ed194ca3c0d1c7d0bc"},
 ]

 [package.dependencies]
@@ -2719,4 +2719,4 @@ cffi = ["cffi (>=1.11)"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.9"
-content-hash = "0834e5cb69e5457741d4f476c3e49a4dc83598b5730685c8755da651b96ad3ec"
+content-hash = "74649cf47c52f21b01b096a42044750b1c9677576b405be0489c2909127a9bf1"
--- a/proxy/Cargo.toml
+++ b/proxy/Cargo.toml
@@ -51,7 +51,6 @@ serde_json.workspace = true
 sha2.workspace = true
 socket2.workspace = true
 sync_wrapper.workspace = true
-task-local-extensions.workspace = true
 thiserror.workspace = true
 tls-listener.workspace = true
 tokio-postgres.workspace = true
--- a/proxy/src/bin/proxy.rs
+++ b/proxy/src/bin/proxy.rs
@@ -4,7 +4,6 @@ use proxy::config::AuthenticationConfig;
 use proxy::config::HttpConfig;
 use proxy::console;
 use proxy::http;
-use proxy::rate_limiter::RateLimiterConfig;
 use proxy::usage_metrics;

 use anyhow::bail;
@@ -96,20 +95,6 @@ struct ProxyCliArgs {
    /// Require that all incoming requests have a Proxy Protocol V2 packet **and** have an IP address associated.
    #[clap(long, default_value_t = false, value_parser = clap::builder::BoolishValueParser::new(), action = clap::ArgAction::Set)]
    require_client_ip: bool,
-    /// Disable dynamic rate limiter and store the metrics to ensure its production behaviour.
-    #[clap(long, default_value_t = true, value_parser = clap::builder::BoolishValueParser::new(), action = clap::ArgAction::Set)]
-    disable_dynamic_rate_limiter: bool,
-    /// Rate limit algorithm. Makes sense only if `disable_rate_limiter` is `false`.
-    #[clap(value_enum, long, default_value_t = proxy::rate_limiter::RateLimitAlgorithm::Aimd)]
-    rate_limit_algorithm: proxy::rate_limiter::RateLimitAlgorithm,
-    /// Timeout for rate limiter. If it didn't manage to aquire a permit in this time, it will return an error.
-    #[clap(long, default_value = "15s", value_parser = humantime::parse_duration)]
-    rate_limiter_timeout: tokio::time::Duration,
-    /// Initial limit for dynamic rate limiter. Makes sense only if `rate_limit_algorithm` is *not* `None`.
-    #[clap(long, default_value_t = 100)]
-    initial_limit: usize,
-    #[clap(flatten)]
-    aimd_config: proxy::rate_limiter::AimdConfig,
 }

 #[tokio::main]
@@ -228,13 +213,6 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
             and metric-collection-interval must be specified"
        ),
    };
-    let rate_limiter_config = RateLimiterConfig {
-        disable: args.disable_dynamic_rate_limiter,
-        algorithm: args.rate_limit_algorithm,
-        timeout: args.rate_limiter_timeout,
-        initial_limit: args.initial_limit,
-        aimd_config: Some(args.aimd_config),
-    };

    let auth_backend = match &args.auth_backend {
        AuthBackend::Console => {
@@ -259,7 +237,7 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
            tokio::spawn(locks.garbage_collect_worker(epoch));

            let url = args.auth_endpoint.parse()?;
-            let endpoint = http::Endpoint::new(url, http::new_client(rate_limiter_config));
+            let endpoint = http::Endpoint::new(url, http::new_client());

            let api = console::provider::neon::Api::new(endpoint, caches, locks);
            auth::BackendType::Console(Cow::Owned(api), ())
--- a/proxy/src/http.rs
+++ b/proxy/src/http.rs
@@ -13,13 +13,13 @@ pub use reqwest_retry::{policies::ExponentialBackoff, RetryTransientMiddleware};
 use tokio::time::Instant;
 use tracing::trace;

-use crate::{rate_limiter, url::ApiUrl};
+use crate::url::ApiUrl;
 use reqwest_middleware::RequestBuilder;

 /// This is the preferred way to create new http clients,
 /// because it takes care of observability (OpenTelemetry).
 /// We deliberately don't want to replace this with a public static.
-pub fn new_client(rate_limiter_config: rate_limiter::RateLimiterConfig) -> ClientWithMiddleware {
+pub fn new_client() -> ClientWithMiddleware {
    let client = reqwest::ClientBuilder::new()
        .dns_resolver(Arc::new(GaiResolver::default()))
        .connection_verbose(true)
@@ -28,7 +28,6 @@ pub fn new_client(rate_limiter_config: rate_limiter::RateLimiterConfig) -> Clien

    reqwest_middleware::ClientBuilder::new(client)
        .with(reqwest_tracing::TracingMiddleware::default())
-        .with(rate_limiter::Limiter::new(rate_limiter_config))
        .build()
 }

--- a/proxy/src/lib.rs
+++ b/proxy/src/lib.rs
@@ -19,7 +19,6 @@ pub mod logging;
 pub mod parse;
 pub mod protocol2;
 pub mod proxy;
-pub mod rate_limiter;
 pub mod sasl;
 pub mod scram;
 pub mod serverless;
--- a/proxy/src/proxy.rs
+++ b/proxy/src/proxy.rs
@@ -19,10 +19,7 @@ use itertools::Itertools;
 use metrics::{exponential_buckets, register_int_counter_vec, IntCounterVec};
 use once_cell::sync::{Lazy, OnceCell};
 use pq_proto::{BeMessage as Be, FeStartupPacket, StartupMessageParams};
-use prometheus::{
-    register_histogram, register_histogram_vec, register_int_gauge_vec, Histogram, HistogramVec,
-    IntGaugeVec,
-};
+use prometheus::{register_histogram_vec, HistogramVec};
 use regex::Regex;
 use std::{error::Error, io, ops::ControlFlow, sync::Arc, time::Instant};
 use tokio::{
@@ -110,25 +107,6 @@ static COMPUTE_CONNECTION_LATENCY: Lazy<HistogramVec> = Lazy::new(|| {
    .unwrap()
 });

-pub static RATE_LIMITER_ACQUIRE_LATENCY: Lazy<Histogram> = Lazy::new(|| {
-    register_histogram!(
-        "semaphore_control_plane_token_acquire_seconds",
-        "Time it took for proxy to establish a connection to the compute endpoint",
-        // largest bucket = 2^16 * 0.5ms = 32s
-        exponential_buckets(0.0005, 2.0, 16).unwrap(),
-    )
-    .unwrap()
-});
-
-pub static RATE_LIMITER_LIMIT: Lazy<IntGaugeVec> = Lazy::new(|| {
-    register_int_gauge_vec!(
-        "semaphore_control_plane_limit",
-        "Current limit of the semaphore control plane",
-        &["limit"], // 2 counters
-    )
-    .unwrap()
-});
-
 pub struct LatencyTimer {
    // time since the stopwatch was started
    start: Option<Instant>,
--- a/proxy/src/rate_limiter.rs
+++ b/proxy/src/rate_limiter.rs
@@ -1,6 +0,0 @@
-mod aimd;
-mod limit_algorithm;
-mod limiter;
-pub use aimd::Aimd;
-pub use limit_algorithm::{AimdConfig, Fixed, RateLimitAlgorithm, RateLimiterConfig};
-pub use limiter::Limiter;
--- a/proxy/src/rate_limiter/aimd.rs
+++ b/proxy/src/rate_limiter/aimd.rs
@@ -1,199 +0,0 @@
-use std::usize;
-
-use async_trait::async_trait;
-
-use super::limit_algorithm::{AimdConfig, LimitAlgorithm, Sample};
-
-use super::limiter::Outcome;
-
-/// Loss-based congestion avoidance.
-///
-/// Additive-increase, multiplicative decrease.
-///
-/// Adds available currency when:
-/// 1. no load-based errors are observed, and
-/// 2. the utilisation of the current limit is high.
-///
-/// Reduces available concurrency by a factor when load-based errors are detected.
-pub struct Aimd {
-    min_limit: usize,
-    max_limit: usize,
-    decrease_factor: f32,
-    increase_by: usize,
-    min_utilisation_threshold: f32,
-}
-
-impl Aimd {
-    pub fn new(config: AimdConfig) -> Self {
-        Self {
-            min_limit: config.aimd_min_limit,
-            max_limit: config.aimd_max_limit,
-            decrease_factor: config.aimd_decrease_factor,
-            increase_by: config.aimd_increase_by,
-            min_utilisation_threshold: config.aimd_min_utilisation_threshold,
-        }
-    }
-
-    pub fn decrease_factor(self, factor: f32) -> Self {
-        assert!((0.5..1.0).contains(&factor));
-        Self {
-            decrease_factor: factor,
-            ..self
-        }
-    }
-
-    pub fn increase_by(self, increase: usize) -> Self {
-        assert!(increase > 0);
-        Self {
-            increase_by: increase,
-            ..self
-        }
-    }
-
-    pub fn with_max_limit(self, max: usize) -> Self {
-        assert!(max > 0);
-        Self {
-            max_limit: max,
-            ..self
-        }
-    }
-
-    /// A threshold below which the limit won't be increased. 0.5 = 50%.
-    pub fn with_min_utilisation_threshold(self, min_util: f32) -> Self {
-        assert!(min_util > 0. && min_util < 1.);
-        Self {
-            min_utilisation_threshold: min_util,
-            ..self
-        }
-    }
-}
-
-#[async_trait]
-impl LimitAlgorithm for Aimd {
-    async fn update(&mut self, old_limit: usize, sample: Sample) -> usize {
-        use Outcome::*;
-        match sample.outcome {
-            Success => {
-                let utilisation = sample.in_flight as f32 / old_limit as f32;
-
-                if utilisation > self.min_utilisation_threshold {
-                    let limit = old_limit + self.increase_by;
-                    limit.clamp(self.min_limit, self.max_limit)
-                } else {
-                    old_limit
-                }
-            }
-            Overload => {
-                let limit = old_limit as f32 * self.decrease_factor;
-
-                // Floor instead of round, so the limit reduces even with small numbers.
-                // E.g. round(2 * 0.9) = 2, but floor(2 * 0.9) = 1
-                let limit = limit.floor() as usize;
-
-                limit.clamp(self.min_limit, self.max_limit)
-            }
-        }
-    }
-}
-
-#[cfg(test)]
-mod tests {
-    use std::sync::Arc;
-
-    use tokio::sync::Notify;
-
-    use super::*;
-
-    use crate::rate_limiter::{Limiter, RateLimiterConfig};
-
-    #[tokio::test]
-    async fn should_decrease_limit_on_overload() {
-        let config = RateLimiterConfig {
-            initial_limit: 10,
-            aimd_config: Some(AimdConfig {
-                aimd_decrease_factor: 0.5,
-                ..Default::default()
-            }),
-            disable: false,
-            ..Default::default()
-        };
-
-        let release_notifier = Arc::new(Notify::new());
-
-        let limiter = Limiter::new(config).with_release_notifier(release_notifier.clone());
-
-        let token = limiter.try_acquire().unwrap();
-        limiter.release(token, Some(Outcome::Overload)).await;
-        release_notifier.notified().await;
-        assert_eq!(limiter.state().limit(), 5, "overload: decrease");
-    }
-
-    #[tokio::test]
-    async fn should_increase_limit_on_success_when_using_gt_util_threshold() {
-        let config = RateLimiterConfig {
-            initial_limit: 4,
-            aimd_config: Some(AimdConfig {
-                aimd_decrease_factor: 0.5,
-                aimd_min_utilisation_threshold: 0.5,
-                aimd_increase_by: 1,
-                ..Default::default()
-            }),
-            disable: false,
-            ..Default::default()
-        };
-
-        let limiter = Limiter::new(config);
-
-        let token = limiter.try_acquire().unwrap();
-        let _token = limiter.try_acquire().unwrap();
-        let _token = limiter.try_acquire().unwrap();
-
-        limiter.release(token, Some(Outcome::Success)).await;
-        assert_eq!(limiter.state().limit(), 5, "success: increase");
-    }
-
-    #[tokio::test]
-    async fn should_not_change_limit_on_success_when_using_lt_util_threshold() {
-        let config = RateLimiterConfig {
-            initial_limit: 4,
-            aimd_config: Some(AimdConfig {
-                aimd_decrease_factor: 0.5,
-                aimd_min_utilisation_threshold: 0.5,
-                ..Default::default()
-            }),
-            disable: false,
-            ..Default::default()
-        };
-
-        let limiter = Limiter::new(config);
-
-        let token = limiter.try_acquire().unwrap();
-
-        limiter.release(token, Some(Outcome::Success)).await;
-        assert_eq!(
-            limiter.state().limit(),
-            4,
-            "success: ignore when < half limit"
-        );
-    }
-
-    #[tokio::test]
-    async fn should_not_change_limit_when_no_outcome() {
-        let config = RateLimiterConfig {
-            initial_limit: 10,
-            aimd_config: Some(AimdConfig {
-                aimd_decrease_factor: 0.5,
-                aimd_min_utilisation_threshold: 0.5,
-                ..Default::default()
-            }),
-            disable: false,
-            ..Default::default()
-        };
-
-        let limiter = Limiter::new(config);
-
-        let token = limiter.try_acquire().unwrap();
-        limiter.release(token, None).await;
-        assert_eq!(limiter.state().limit(), 10, "ignore");
-    }
-}
--- a/proxy/src/rate_limiter/limit_algorithm.rs
+++ b/proxy/src/rate_limiter/limit_algorithm.rs
@@ -1,98 +0,0 @@
-//! Algorithms for controlling concurrency limits.
-use async_trait::async_trait;
-use std::time::Duration;
-
-use super::{limiter::Outcome, Aimd};
-
-/// An algorithm for controlling a concurrency limit.
-#[async_trait]
-pub trait LimitAlgorithm: Send + Sync + 'static {
-    /// Update the concurrency limit in response to a new job completion.
-    async fn update(&mut self, old_limit: usize, sample: Sample) -> usize;
-}
-
-/// The result of a job (or jobs), including the [Outcome] (loss) and latency (delay).
-#[derive(Debug, Clone, PartialEq, Eq)]
-pub struct Sample {
-    pub(crate) latency: Duration,
-    /// Jobs in flight when the sample was taken.
-    pub(crate) in_flight: usize,
-    pub(crate) outcome: Outcome,
-}
-
-#[derive(Clone, Copy, Debug, Default, clap::ValueEnum)]
-pub enum RateLimitAlgorithm {
-    Fixed,
-    #[default]
-    Aimd,
-}
-
-pub struct Fixed;
-
-#[async_trait]
-impl LimitAlgorithm for Fixed {
-    async fn update(&mut self, old_limit: usize, _sample: Sample) -> usize {
-        old_limit
-    }
-}
-
-#[derive(Clone, Copy, Debug)]
-pub struct RateLimiterConfig {
-    pub disable: bool,
-    pub algorithm: RateLimitAlgorithm,
-    pub timeout: Duration,
-    pub initial_limit: usize,
-    pub aimd_config: Option<AimdConfig>,
-}
-
-impl RateLimiterConfig {
-    pub fn create_rate_limit_algorithm(self) -> Box<dyn LimitAlgorithm> {
-        match self.algorithm {
-            RateLimitAlgorithm::Fixed => Box::new(Fixed),
-            RateLimitAlgorithm::Aimd => Box::new(Aimd::new(self.aimd_config.unwrap())), // For aimd algorithm config is mandatory.
-        }
-    }
-}
-
-impl Default for RateLimiterConfig {
-    fn default() -> Self {
-        Self {
-            disable: true,
-            algorithm: RateLimitAlgorithm::Aimd,
-            timeout: Duration::from_secs(1),
-            initial_limit: 100,
-            aimd_config: Some(AimdConfig::default()),
-        }
-    }
-}
-
-#[derive(clap::Parser, Clone, Copy, Debug)]
-pub struct AimdConfig {
-    /// Minimum limit for AIMD algorithm. Makes sense only if `rate_limit_algorithm` is `Aimd`.
-    #[clap(long, default_value_t = 1)]
-    pub aimd_min_limit: usize,
-    /// Maximum limit for AIMD algorithm. Makes sense only if `rate_limit_algorithm` is `Aimd`.
-    #[clap(long, default_value_t = 1500)]
-    pub aimd_max_limit: usize,
-    /// Increase AIMD increase by value in case of success. Makes sense only if `rate_limit_algorithm` is `Aimd`.
-    #[clap(long, default_value_t = 10)]
-    pub aimd_increase_by: usize,
-    /// Decrease AIMD decrease by value in case of timout/429. Makes sense only if `rate_limit_algorithm` is `Aimd`.
-    #[clap(long, default_value_t = 0.9)]
-    pub aimd_decrease_factor: f32,
-    /// A threshold below which the limit won't be increased. Makes sense only if `rate_limit_algorithm` is `Aimd`.
-    #[clap(long, default_value_t = 0.8)]
-    pub aimd_min_utilisation_threshold: f32,
-}
-
-impl Default for AimdConfig {
-    fn default() -> Self {
-        Self {
-            aimd_min_limit: 1,
-            aimd_max_limit: 1500,
-            aimd_increase_by: 10,
-            aimd_decrease_factor: 0.9,
-            aimd_min_utilisation_threshold: 0.8,
-        }
-    }
-}
--- a/proxy/src/rate_limiter/limiter.rs
+++ b/proxy/src/rate_limiter/limiter.rs
@@ -1,441 +0,0 @@
-use std::{
-    sync::{
-        atomic::{AtomicUsize, Ordering},
-        Arc,
-    },
-    time::Duration,
-};
-
-use tokio::sync::{Mutex as AsyncMutex, Semaphore, SemaphorePermit};
-use tokio::time::{timeout, Instant};
-use tracing::info;
-
-use super::{
-    limit_algorithm::{LimitAlgorithm, Sample},
-    RateLimiterConfig,
-};
-
-/// Limits the number of concurrent jobs.
-///
-/// Concurrency is limited through the use of [Token]s. Acquire a token to run a job, and release the
-/// token once the job is finished.
-///
-/// The limit will be automatically adjusted based on observed latency (delay) and/or failures
-/// caused by overload (loss).
-pub struct Limiter {
-    limit_algo: AsyncMutex<Box<dyn LimitAlgorithm>>,
-    semaphore: std::sync::Arc<Semaphore>,
-    config: RateLimiterConfig,
-
-    // ONLY WRITE WHEN LIMIT_ALGO IS LOCKED
-    limits: AtomicUsize,
-
-    // ONLY USE ATOMIC ADD/SUB
-    in_flight: Arc<AtomicUsize>,
-
-    #[cfg(test)]
-    notifier: Option<std::sync::Arc<tokio::sync::Notify>>,
-}
-
-/// A concurrency token, required to run a job.
-///
-/// Release the token back to the [Limiter] after the job is complete.
-#[derive(Debug)]
-pub struct Token<'t> {
-    permit: Option<tokio::sync::SemaphorePermit<'t>>,
-    start: Instant,
-    in_flight: Arc<AtomicUsize>,
-}
-
-/// A snapshot of the state of the [Limiter].
-///
-/// Not guaranteed to be consistent under high concurrency.
-#[derive(Debug, Clone, Copy)]
-pub struct LimiterState {
-    limit: usize,
-    available: usize,
-    in_flight: usize,
-}
-
-/// Whether a job succeeded or failed as a result of congestion/overload.
-///
-/// Errors not considered to be caused by overload should be ignored.
-#[derive(Debug, Clone, Copy, PartialEq, Eq)]
-pub enum Outcome {
-    /// The job succeeded, or failed in a way unrelated to overload.
-    Success,
-    /// The job failed because of overload, e.g. it timed out or an explicit backpressure signal
-    /// was observed.
-    Overload,
-}
-
-impl Outcome {
-    fn from_reqwest_error(error: &reqwest_middleware::Error) -> Self {
-        match error {
-            reqwest_middleware::Error::Middleware(_) => Outcome::Success,
-            reqwest_middleware::Error::Reqwest(e) => {
-                if let Some(status) = e.status() {
-                    if status.is_server_error()
-                        || reqwest::StatusCode::TOO_MANY_REQUESTS.as_u16() == status
-                    {
-                        Outcome::Overload
-                    } else {
-                        Outcome::Success
-                    }
-                } else {
-                    Outcome::Success
-                }
-            }
-        }
-    }
-    fn from_reqwest_response(response: &reqwest::Response) -> Self {
-        if response.status().is_server_error()
-            || response.status() == reqwest::StatusCode::TOO_MANY_REQUESTS
-        {
-            Outcome::Overload
-        } else {
-            Outcome::Success
-        }
-    }
-}
-
-impl Limiter {
-    /// Create a limiter with a given limit control algorithm.
-    pub fn new(config: RateLimiterConfig) -> Self {
-        assert!(config.initial_limit > 0);
-        Self {
-            limit_algo: AsyncMutex::new(config.create_rate_limit_algorithm()),
-            semaphore: Arc::new(Semaphore::new(config.initial_limit)),
-            config,
-            limits: AtomicUsize::new(config.initial_limit),
-            in_flight: Arc::new(AtomicUsize::new(0)),
-            #[cfg(test)]
-            notifier: None,
-        }
-    }
-    // pub fn new(limit_algorithm: T, timeout: Duration, initial_limit: usize) -> Self {
-    //     assert!(initial_limit > 0);
-
-    //     Self {
-    //         limit_algo: AsyncMutex::new(limit_algorithm),
-    //         semaphore: Arc::new(Semaphore::new(initial_limit)),
-    //         timeout,
-    //         limits: AtomicUsize::new(initial_limit),
-    //         in_flight: Arc::new(AtomicUsize::new(0)),
-    //         #[cfg(test)]
-    //         notifier: None,
-    //     }
-    // }
-
-    /// In some cases [Token]s are acquired asynchronously when updating the limit.
-    #[cfg(test)]
-    pub fn with_release_notifier(mut self, n: std::sync::Arc<tokio::sync::Notify>) -> Self {
-        self.notifier = Some(n);
-        self
-    }
-
-    /// Try to immediately acquire a concurrency [Token].
-    ///
-    /// Returns `None` if there are none available.
-    pub fn try_acquire(&self) -> Option<Token> {
-        let result = if self.config.disable {
-            // If the rate limiter is disabled, we can always acquire a token.
-            Some(Token::new(None, self.in_flight.clone()))
-        } else {
-            self.semaphore
-                .try_acquire()
-                .map(|permit| Token::new(Some(permit), self.in_flight.clone()))
-                .ok()
-        };
-        if result.is_some() {
-            self.in_flight.fetch_add(1, Ordering::AcqRel);
-        }
-        result
-    }
-
-    /// Try to acquire a concurrency [Token], waiting for `duration` if there are none available.
-    ///
-    /// Returns `None` if there are none available after `duration`.
-    pub async fn acquire_timeout(&self, duration: Duration) -> Option<Token<'_>> {
-        info!("acquiring token: {:?}", self.semaphore.available_permits());
-        let result = if self.config.disable {
-            // If the rate limiter is disabled, we can always acquire a token.
-            Some(Token::new(None, self.in_flight.clone()))
-        } else {
-            match timeout(duration, self.semaphore.acquire()).await {
-                Ok(maybe_permit) => maybe_permit
-                    .map(|permit| Token::new(Some(permit), self.in_flight.clone()))
-                    .ok(),
-                Err(_) => None,
-            }
-        };
-        if result.is_some() {
-            self.in_flight.fetch_add(1, Ordering::AcqRel);
-        }
-        result
-    }
-
-    /// Return the concurrency [Token], along with the outcome of the job.
-    ///
-    /// The [Outcome] of the job, and the time taken to perform it, may be used
-    /// to update the concurrency limit.
-    ///
-    /// Set the outcome to `None` to ignore the job.
-    pub async fn release(&self, mut token: Token<'_>, outcome: Option<Outcome>) {
-        tracing::info!("outcome is {:?}", outcome);
-        let in_flight = self.in_flight.load(Ordering::Acquire);
-        let old_limit = self.limits.load(Ordering::Acquire);
-        let available = if self.config.disable {
-            0 // This is not used in the algorithm and can be anything. If the config disable it makes sense to set it to 0.
-        } else {
-            self.semaphore.available_permits()
-        };
-        let total = in_flight + available;
-
-        let mut algo = self.limit_algo.lock().await;
-
-        let new_limit = if let Some(outcome) = outcome {
-            let sample = Sample {
-                latency: token.start.elapsed(),
-                in_flight,
-                outcome,
-            };
-            algo.update(old_limit, sample).await
-        } else {
-            old_limit
-        };
-        tracing::info!("new limit is {}", new_limit);
-        let actual_limit = if new_limit < total {
-            token.forget();
-            total.saturating_sub(1)
-        } else {
-            if !self.config.disable {
-                self.semaphore.add_permits(new_limit.saturating_sub(total));
-            }
-            new_limit
-        };
-        crate::proxy::RATE_LIMITER_LIMIT
-            .with_label_values(&["expected"])
-            .set(new_limit as i64);
-        crate::proxy::RATE_LIMITER_LIMIT
-            .with_label_values(&["actual"])
-            .set(actual_limit as i64);
-        self.limits.store(new_limit, Ordering::Release);
-        #[cfg(test)]
-        if let Some(n) = &self.notifier {
-            n.notify_one();
-        }
-    }
-
-    /// The current state of the limiter.
-    pub fn state(&self) -> LimiterState {
-        let limit = self.limits.load(Ordering::Relaxed);
-        let in_flight = self.in_flight.load(Ordering::Relaxed);
-        LimiterState {
-            limit,
-            available: limit.saturating_sub(in_flight),
-            in_flight,
-        }
-    }
-}
-
-impl<'t> Token<'t> {
-    fn new(permit: Option<SemaphorePermit<'t>>, in_flight: Arc<AtomicUsize>) -> Self {
-        Self {
-            permit,
-            start: Instant::now(),
-            in_flight,
-        }
-    }
-
-    #[cfg(test)]
-    pub fn set_latency(&mut self, latency: Duration) {
-        use std::ops::Sub;
-
-        self.start = Instant::now().sub(latency);
-    }
-
-    pub fn forget(&mut self) {
-        if let Some(permit) = self.permit.take() {
-            permit.forget();
-        }
-    }
-}
-
-impl Drop for Token<'_> {
-    fn drop(&mut self) {
-        self.in_flight.fetch_sub(1, Ordering::AcqRel);
-    }
-}
-
-impl LimiterState {
-    /// The current concurrency limit.
-    pub fn limit(&self) -> usize {
-        self.limit
-    }
-    /// The amount of concurrency available to use.
-    pub fn available(&self) -> usize {
-        self.available
-    }
-    /// The number of jobs in flight.
-    pub fn in_flight(&self) -> usize {
-        self.in_flight
-    }
-}
-
-#[async_trait::async_trait]
-impl reqwest_middleware::Middleware for Limiter {
-    async fn handle(
-        &self,
-        req: reqwest::Request,
-        extensions: &mut task_local_extensions::Extensions,
-        next: reqwest_middleware::Next<'_>,
-    ) -> reqwest_middleware::Result<reqwest::Response> {
-        let start = Instant::now();
-        let token = self
-            .acquire_timeout(self.config.timeout)
-            .await
-            .ok_or_else(|| {
-                reqwest_middleware::Error::Middleware(
-                    // TODO: Should we map it into user facing errors?
-                    crate::console::errors::ApiError::Console {
-                        status: crate::http::StatusCode::TOO_MANY_REQUESTS,
-                        text: "Too many requests".into(),
-                    }
-                    .into(),
-                )
-            })?;
-        info!(duration = ?start.elapsed(), "waiting for token to connect to the control plane");
-        crate::proxy::RATE_LIMITER_ACQUIRE_LATENCY.observe(start.elapsed().as_secs_f64());
-        match next.run(req, extensions).await {
-            Ok(response) => {
-                self.release(token, Some(Outcome::from_reqwest_response(&response)))
-                    .await;
-                Ok(response)
-            }
-            Err(e) => {
-                self.release(token, Some(Outcome::from_reqwest_error(&e)))
-                    .await;
-                Err(e)
-            }
-        }
-    }
-}
-
-#[cfg(test)]
-mod tests {
-    use std::{pin::pin, task::Context, time::Duration};
-
-    use futures::{task::noop_waker_ref, Future};
-
-    use super::{Limiter, Outcome};
-    use crate::rate_limiter::RateLimitAlgorithm;
-
-    #[tokio::test]
-    async fn it_works() {
-        let config = super::RateLimiterConfig {
-            algorithm: RateLimitAlgorithm::Fixed,
-            timeout: Duration::from_secs(1),
-            initial_limit: 10,
-            disable: false,
-            ..Default::default()
-        };
-        let limiter = Limiter::new(config);
-
-        let token = limiter.try_acquire().unwrap();
-
-        limiter.release(token, Some(Outcome::Success)).await;
-
-        assert_eq!(limiter.state().limit(), 10);
-    }
-
-    #[tokio::test]
-    async fn is_fair() {
-        let config = super::RateLimiterConfig {
-            algorithm: RateLimitAlgorithm::Fixed,
-            timeout: Duration::from_secs(1),
-            initial_limit: 1,
-            disable: false,
-            ..Default::default()
-        };
-        let limiter = Limiter::new(config);
-
-        // === TOKEN 1 ===
-        let token1 = limiter.try_acquire().unwrap();
-
-        let mut token2_fut = pin!(limiter.acquire_timeout(Duration::from_secs(1)));
-        assert!(
-            token2_fut
-                .as_mut()
-                .poll(&mut Context::from_waker(noop_waker_ref()))
-                .is_pending(),
-            "token is acquired by token1"
-        );
-
-        let mut token3_fut = pin!(limiter.acquire_timeout(Duration::from_secs(1)));
-        assert!(
-            token3_fut
-                .as_mut()
-                .poll(&mut Context::from_waker(noop_waker_ref()))
-                .is_pending(),
-            "token is acquired by token1"
-        );
-
-        limiter.release(token1, Some(Outcome::Success)).await;
-        // === END TOKEN 1 ===
-
-        // === TOKEN 2 ===
-        assert!(
-            limiter.try_acquire().is_none(),
-            "token is acquired by token2"
-        );
-
-        assert!(
-            token3_fut
-                .as_mut()
-                .poll(&mut Context::from_waker(noop_waker_ref()))
-                .is_pending(),
-            "token is acquired by token2"
-        );
-
-        let token2 = token2_fut.await.unwrap();
-
-        limiter.release(token2, Some(Outcome::Success)).await;
-        // === END TOKEN 2 ===
-
-        // === TOKEN 3 ===
-        assert!(
-            limiter.try_acquire().is_none(),
-            "token is acquired by token3"
-        );
-
-        let token3 = token3_fut.await.unwrap();
-        limiter.release(token3, Some(Outcome::Success)).await;
-        // === END TOKEN 3 ===
-
-        // === TOKEN 4 ===
-        let token4 = limiter.try_acquire().unwrap();
-        limiter.release(token4, Some(Outcome::Success)).await;
-    }
-
-    #[tokio::test]
-    async fn disable() {
-        let config = super::RateLimiterConfig {
-            algorithm: RateLimitAlgorithm::Fixed,
-            timeout: Duration::from_secs(1),
-            initial_limit: 1,
-            disable: true,
-            ..Default::default()
-        };
-        let limiter = Limiter::new(config);
-
-        // === TOKEN 1 ===
-        let token1 = limiter.try_acquire().unwrap();
-        let token2 = limiter.try_acquire().unwrap();
-        let state = limiter.state();
-        assert_eq!(state.limit(), 1);
-        assert_eq!(state.in_flight(), 2); // For disabled limiter, it's expected.
-        limiter.release(token1, None).await;
-        limiter.release(token2, None).await;
-    }
-}
--- a/proxy/src/usage_metrics.rs
+++ b/proxy/src/usage_metrics.rs
@@ -249,7 +249,7 @@ mod tests {
    use url::Url;

    use super::{collect_metrics_iteration, Ids, Metrics};
-    use crate::{http, rate_limiter::RateLimiterConfig};
+    use crate::http;

    #[tokio::test]
    async fn metrics() {
@@ -279,7 +279,7 @@ mod tests {
        tokio::spawn(server);

        let metrics = Metrics::default();
-        let client = http::new_client(RateLimiterConfig::default());
+        let client = http::new_client();
        let endpoint = Url::parse(&format!("http://{addr}")).unwrap();
        let now = Utc::now();

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,7 +33,7 @@ psutil = "^5.9.4"
 types-psutil = "^5.9.5.12"
 types-toml = "^0.10.8.6"
 pytest-httpserver = "^1.0.8"
-aiohttp = "3.8.6"
+aiohttp = "3.8.5"
 pytest-rerunfailures = "^11.1.2"
 types-pytest-lazy-fixture = "^0.6.3.3"
 pytest-split = "^0.8.1"
--- a/safekeeper/Cargo.toml
+++ b/safekeeper/Cargo.toml
@@ -41,6 +41,7 @@ toml_edit.workspace = true
 tracing.workspace = true
 url.workspace = true
 metrics.workspace = true
+pageserver_api.workspace = true
 postgres_backend.workspace = true
 postgres_ffi.workspace = true
 pq_proto.workspace = true
--- a/safekeeper/src/handler.rs
+++ b/safekeeper/src/handler.rs
@@ -2,6 +2,7 @@
 //! protocol commands.

 use anyhow::Context;
+
 use std::str::FromStr;
 use std::str::{self};
 use std::sync::Arc;
@@ -12,7 +13,7 @@ use crate::auth::check_permission;
 use crate::json_ctrl::{handle_json_ctrl, AppendLogicalMessage};

 use crate::metrics::{TrafficMetrics, PG_QUERIES_FINISHED, PG_QUERIES_RECEIVED};
-use crate::safekeeper::Term;
+use crate::send_wal::ReplicationOptions;
 use crate::timeline::TimelineError;
 use crate::wal_service::ConnectionId;
 use crate::{GlobalTimelines, SafeKeeperConf};
@@ -46,7 +47,7 @@ pub struct SafekeeperPostgresHandler {
 /// Parsed Postgres command.
 enum SafekeeperPostgresCommand {
    StartWalPush,
-    StartReplication { start_lsn: Lsn, term: Option<Term> },
+    StartReplication(ReplicationOptions),
    IdentifySystem,
    TimelineStatus,
    JSONCtrl { cmd: AppendLogicalMessage },
@@ -58,7 +59,7 @@ fn parse_cmd(cmd: &str) -> anyhow::Result<SafekeeperPostgresCommand> {
    } else if cmd.starts_with("START_REPLICATION") {
        let re = Regex::new(
            // We follow postgres START_REPLICATION LOGICAL options to pass term.
-            r"START_REPLICATION(?: SLOT [^ ]+)?(?: PHYSICAL)? ([[:xdigit:]]+/[[:xdigit:]]+)(?: \(term='(\d+)'\))?",
+            r"START_REPLICATION(?: SLOT [^ ]+)?(?: PHYSICAL)? ([[:xdigit:]]+/[[:xdigit:]]+)(?: \(term='(\d+)'\))?(?: \(shard=(.+)\))?",
        )
        .unwrap();
        let caps = re
@@ -71,7 +72,18 @@ fn parse_cmd(cmd: &str) -> anyhow::Result<SafekeeperPostgresCommand> {
        } else {
            None
        };
-        Ok(SafekeeperPostgresCommand::StartReplication { start_lsn, term })
+        let shard = if let Some(m) = caps.get(3) {
+            Some(serde_json::from_str(m.as_str())?)
+        } else {
+            None
+        };
+        Ok(SafekeeperPostgresCommand::StartReplication(
+            ReplicationOptions {
+                start_lsn,
+                term,
+                shard,
+            },
+        ))
    } else if cmd.starts_with("IDENTIFY_SYSTEM") {
        Ok(SafekeeperPostgresCommand::IdentifySystem)
    } else if cmd.starts_with("TIMELINE_STATUS") {
@@ -86,7 +98,7 @@ fn parse_cmd(cmd: &str) -> anyhow::Result<SafekeeperPostgresCommand> {
    }
 }

-fn cmd_to_string(cmd: &SafekeeperPostgresCommand) -> &str {
+fn cmd_to_string(cmd: &SafekeeperPostgresCommand) -> &'static str {
    match cmd {
        SafekeeperPostgresCommand::StartWalPush => "START_WAL_PUSH",
        SafekeeperPostgresCommand::StartReplication { .. } => "START_REPLICATION",
@@ -228,8 +240,8 @@ impl<IO: AsyncRead + AsyncWrite + Unpin + Send> postgres_backend::Handler<IO>
                    .instrument(info_span!("WAL receiver"))
                    .await
            }
-            SafekeeperPostgresCommand::StartReplication { start_lsn, term } => {
-                self.handle_start_replication(pgb, start_lsn, term)
+            SafekeeperPostgresCommand::StartReplication(opts) => {
+                self.handle_start_replication(pgb, opts)
                    .instrument(info_span!("WAL sender"))
                    .await
            }
--- a/safekeeper/src/lib.rs
+++ b/safekeeper/src/lib.rs
@@ -27,6 +27,7 @@ pub mod recovery;
 pub mod remove_wal;
 pub mod safekeeper;
 pub mod send_wal;
+pub mod send_wal_sharded;
 pub mod timeline;
 pub mod wal_backup;
 pub mod wal_service;
--- a/safekeeper/src/send_wal.rs
+++ b/safekeeper/src/send_wal.rs
@@ -6,13 +6,15 @@ use crate::safekeeper::{Term, TermLsn};
 use crate::timeline::Timeline;
 use crate::wal_service::ConnectionId;
 use crate::wal_storage::WalReader;
-use crate::GlobalTimelines;
+use crate::{send_wal_sharded, GlobalTimelines};
 use anyhow::{bail, Context as AnyhowContext};
 use bytes::Bytes;
+use pageserver_api::shard::ShardIdentity;
 use parking_lot::Mutex;
 use postgres_backend::PostgresBackend;
 use postgres_backend::{CopyStreamHandlerEnd, PostgresBackendReader, QueryError};
 use postgres_ffi::get_current_timestamp;
+use postgres_ffi::waldecoder::WalStreamDecoder;
 use postgres_ffi::{TimestampTz, MAX_SEND_SIZE};
 use pq_proto::{BeMessage, WalSndKeepAlive, XLogDataBody};
 use serde::{Deserialize, Serialize};
@@ -31,6 +33,12 @@ use tokio::time::timeout;
 use tracing::*;
 use utils::{bin_ser::BeSer, lsn::Lsn};

+pub struct ReplicationOptions {
+    pub start_lsn: Lsn,
+    pub term: Option<Term>,
+    pub shard: Option<ShardIdentity>,
+}
+
 // See: https://www.postgresql.org/docs/13/protocol-replication.html
 const HOT_STANDBY_FEEDBACK_TAG_BYTE: u8 = b'h';
 const STANDBY_STATUS_UPDATE_TAG_BYTE: u8 = b'r';
@@ -349,6 +357,22 @@ impl Drop for WalSenderGuard {
    }
 }

+impl WalSenderGuard {
+    pub async fn should_stop(&self, tli: &Arc<Timeline>) -> bool {
+        if let Some(remote_consistent_lsn) = self.walsenders.get_ws_remote_consistent_lsn(self.id) {
+            if tli.should_walsender_stop(remote_consistent_lsn).await {
+                // Terminate if there is nothing more to send.
+                // Note that "ending streaming" part of the string is used by
+                // pageserver to identify WalReceiverError::SuccessfulCompletion,
+                // do not change this string without updating pageserver.
+                return true;
+            }
+        }
+
+        false
+    }
+}
+
 impl SafekeeperPostgresHandler {
    /// Wrapper around handle_start_replication_guts handling result. Error is
    /// handled here while we're still in walsender ttid span; with API
@@ -356,13 +380,9 @@ impl SafekeeperPostgresHandler {
    pub async fn handle_start_replication<IO: AsyncRead + AsyncWrite + Unpin>(
        &mut self,
        pgb: &mut PostgresBackend<IO>,
-        start_pos: Lsn,
-        term: Option<Term>,
+        opts: ReplicationOptions,
    ) -> Result<(), QueryError> {
-        if let Err(end) = self
-            .handle_start_replication_guts(pgb, start_pos, term)
-            .await
-        {
+        if let Err(end) = self.handle_start_replication_guts(pgb, opts).await {
            // Log the result and probably send it to the client, closing the stream.
            pgb.handle_copy_stream_end(end).await;
        }
@@ -372,12 +392,12 @@ impl SafekeeperPostgresHandler {
    pub async fn handle_start_replication_guts<IO: AsyncRead + AsyncWrite + Unpin>(
        &mut self,
        pgb: &mut PostgresBackend<IO>,
-        start_pos: Lsn,
-        term: Option<Term>,
+        opts: ReplicationOptions,
    ) -> Result<(), CopyStreamHandlerEnd> {
        let appname = self.appname.clone();
        let tli =
            GlobalTimelines::get(self.ttid).map_err(|e| CopyStreamHandlerEnd::Other(e.into()))?;
+        let start_pos = opts.start_lsn;

        // Use a guard object to remove our entry from the timeline when we are done.
        let ws_guard = Arc::new(tli.get_walsenders().register(
@@ -415,11 +435,13 @@ impl SafekeeperPostgresHandler {
        }

        info!(
-            "starting streaming from {:?}, available WAL ends at {}, recovery={}, appname={:?}",
+            "starting streaming from {:?}, available WAL ends at {}, recovery={}, appname={:?}, addr={}, shard={:?}",
            start_pos,
            end_pos,
            matches!(end_watch, EndWatch::Flush(_)),
-            appname
+            appname,
+            pgb.get_peer_addr(),
+            opts.shard,
        );

        // switch to copy
@@ -438,23 +460,49 @@ impl SafekeeperPostgresHandler {
        // not synchronized with sends, so this avoids deadlocks.
        let reader = pgb.split().context("START_REPLICATION split")?;

-        let mut sender = WalSender {
-            pgb,
-            tli: tli.clone(),
-            appname,
-            start_pos,
-            end_pos,
-            term,
-            end_watch,
-            ws_guard: ws_guard.clone(),
-            wal_reader,
-            send_buf: [0; MAX_SEND_SIZE],
+        let ws_guard_clone = ws_guard.clone();
+        let sender_future = async {
+            if let Some(_shard) = opts.shard {
+                send_wal_sharded::WalSender {
+                    pgb,
+                    tli: tli.clone(),
+                    appname,
+                    start_pos,
+                    end_pos,
+                    term: opts.term,
+                    end_watch,
+                    ws_guard: ws_guard_clone,
+                    wal_reader,
+                    send_buf: [0; MAX_SEND_SIZE],
+                    waldecoder: WalStreamDecoder::new(
+                        start_pos,
+                        tli.get_state().await.1.server.pg_version / 10000,
+                    ),
+                }
+                .run()
+                .await
+            } else {
+                WalSender {
+                    pgb,
+                    tli: tli.clone(),
+                    appname,
+                    start_pos,
+                    end_pos,
+                    term: opts.term,
+                    end_watch,
+                    ws_guard: ws_guard_clone,
+                    wal_reader,
+                    send_buf: [0; MAX_SEND_SIZE],
+                }
+                .run()
+                .await
+            }
        };
        let mut reply_reader = ReplyReader { reader, ws_guard };

        let res = tokio::select! {
            // todo: add read|write .context to these errors
-            r = sender.run() => r,
+            r = sender_future => r,
            r = reply_reader.run() => r,
        };
        // Join pg backend back.
@@ -466,14 +514,14 @@ impl SafekeeperPostgresHandler {

 /// Walsender streams either up to commit_lsn (normally) or flush_lsn in the
 /// given term (recovery by walproposer or peer safekeeper).
-enum EndWatch {
+pub enum EndWatch {
    Commit(Receiver<Lsn>),
    Flush(Receiver<TermLsn>),
 }

 impl EndWatch {
    /// Get current end of WAL.
-    fn get(&self) -> Lsn {
+    pub fn get(&self) -> Lsn {
        match self {
            EndWatch::Commit(r) => *r.borrow(),
            EndWatch::Flush(r) => r.borrow().lsn,
@@ -481,7 +529,7 @@ impl EndWatch {
    }

    /// Wait for the update.
-    async fn changed(&mut self) -> anyhow::Result<()> {
+    pub async fn changed(&mut self) -> anyhow::Result<()> {
        match self {
            EndWatch::Commit(r) => r.changed().await?,
            EndWatch::Flush(r) => r.changed().await?,
@@ -598,21 +646,11 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
            // Check for termination only if we are streaming up to commit_lsn
            // (to pageserver).
            if let EndWatch::Commit(_) = self.end_watch {
-                if let Some(remote_consistent_lsn) = self
-                    .ws_guard
-                    .walsenders
-                    .get_ws_remote_consistent_lsn(self.ws_guard.id)
-                {
-                    if self.tli.should_walsender_stop(remote_consistent_lsn).await {
-                        // Terminate if there is nothing more to send.
-                        // Note that "ending streaming" part of the string is used by
-                        // pageserver to identify WalReceiverError::SuccessfulCompletion,
-                        // do not change this string without updating pageserver.
-                        return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
+                if self.ws_guard.should_stop(&self.tli).await {
+                    return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
                        "ending streaming to {:?} at {}, receiver is caughtup and there is no computes",
                        self.appname, self.start_pos,
                    )));
-                    }
                }
            }

@@ -685,7 +723,7 @@ const POLL_STATE_TIMEOUT: Duration = Duration::from_secs(1);
 /// - Ok(None) if timeout expired;
 /// - Err in case of error -- only if 1) term changed while fetching in recovery
 ///   mode 2) watch channel closed, which must never happen.
-async fn wait_for_lsn(
+pub async fn wait_for_lsn(
    rx: &mut EndWatch,
    client_term: Option<Term>,
    start_pos: Lsn,
--- a/safekeeper/src/send_wal_sharded.rs
+++ b/safekeeper/src/send_wal_sharded.rs
@@ -0,0 +1,160 @@
+use std::{cmp::min, sync::Arc};
+
+use anyhow::Context;
+use postgres_backend::{CopyStreamHandlerEnd, PostgresBackend};
+use postgres_ffi::{get_current_timestamp, waldecoder::WalStreamDecoder, MAX_SEND_SIZE};
+use pq_proto::{BeMessage, WalSndKeepAlive, XLogDataBody};
+use tokio::io::{AsyncRead, AsyncWrite};
+use tracing::{trace};
+use utils::lsn::Lsn;
+
+use crate::{
+    safekeeper::Term,
+    send_wal::{wait_for_lsn, EndWatch, WalSenderGuard},
+    timeline::Timeline,
+    wal_storage::WalReader,
+};
+
+/// A half driving sending WAL.
+pub struct WalSender<'a, IO> {
+    pub pgb: &'a mut PostgresBackend<IO>,
+    pub tli: Arc<Timeline>,
+    pub appname: Option<String>,
+    // Position since which we are sending next chunk.
+    pub start_pos: Lsn,
+    // WAL up to this position is known to be locally available.
+    // Usually this is the same as the latest commit_lsn, but in case of
+    // walproposer recovery, this is flush_lsn.
+    //
+    // We send this LSN to the receiver as wal_end, so that it knows how much
+    // WAL this safekeeper has. This LSN should be as fresh as possible.
+    pub end_pos: Lsn,
+    /// When streaming uncommitted part, the term the client acts as the leader
+    /// in. Streaming is stopped if local term changes to a different (higher)
+    /// value.
+    pub term: Option<Term>,
+    /// Watch channel receiver to learn end of available WAL (and wait for its advancement).
+    pub end_watch: EndWatch,
+    pub ws_guard: Arc<WalSenderGuard>,
+    pub wal_reader: WalReader,
+    // buffer for readling WAL into to send it
+    pub send_buf: [u8; MAX_SEND_SIZE],
+    pub waldecoder: WalStreamDecoder,
+}
+
+impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
+    /// Send WAL until
+    /// - an error occurs
+    /// - receiver is caughtup and there is no computes (if streaming up to commit_lsn)
+    ///
+    /// Err(CopyStreamHandlerEnd) is always returned; Result is used only for ?
+    /// convenience.
+    pub async fn run(&mut self) -> Result<(), CopyStreamHandlerEnd> {
+        loop {
+            // Wait for the next portion if it is not there yet, or just
+            // update our end of WAL available for sending value, we
+            // communicate it to the receiver.
+            self.wait_wal().await?;
+            assert!(
+                self.end_pos > self.start_pos,
+                "nothing to send after waiting for WAL"
+            );
+
+            // try to send as much as available, capped by MAX_SEND_SIZE
+            let mut send_size = self
+                .end_pos
+                .checked_sub(self.start_pos)
+                .context("reading wal without waiting for it first")?
+                .0 as usize;
+            send_size = min(send_size, self.send_buf.len());
+            let send_buf = &mut self.send_buf[..send_size];
+            let send_size: usize;
+            {
+                // If uncommitted part is being pulled, check that the term is
+                // still the expected one.
+                let _term_guard = if let Some(t) = self.term {
+                    Some(self.tli.acquire_term(t).await?)
+                } else {
+                    None
+                };
+                // read wal into buffer
+                send_size = self.wal_reader.read(send_buf).await?
+            };
+            let send_buf = &send_buf[..send_size];
+
+            // feed waldecoder with the data
+            self.waldecoder.feed_bytes(send_buf);
+            self.start_pos += send_size as u64;
+
+            while let Some((lsn, recdata)) =
+                self.waldecoder.poll_decode().context("wal decoding")?
+            {
+                // It is important to deal with the aligned records as lsn in getPage@LSN is
+                // aligned and can be several bytes bigger. Without this alignment we are
+                // at risk of hitting a deadlock.
+                if !lsn.is_aligned() {
+                    return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
+                        "unaligned record at {}",
+                        lsn
+                    )));
+                }
+
+                trace!(
+                    "read record of {} bytes of WAL ending at {}",
+                    recdata.len(),
+                    lsn
+                );
+
+                // and send it
+                self.pgb
+                    .write_message(&BeMessage::XLogData(XLogDataBody {
+                        wal_start: lsn.0,
+                        wal_end: self.end_pos.0,
+                        timestamp: get_current_timestamp(),
+                        data: &recdata,
+                    }))
+                    .await?;
+            }
+        }
+    }
+
+    /// wait until we have WAL to stream, sending keepalives and checking for
+    /// exit in the meanwhile
+    async fn wait_wal(&mut self) -> Result<(), CopyStreamHandlerEnd> {
+        loop {
+            self.end_pos = self.end_watch.get();
+            if self.end_pos > self.start_pos {
+                // We have something to send.
+                trace!("got end_pos {:?}, streaming", self.end_pos);
+                return Ok(());
+            }
+
+            // Wait for WAL to appear, now self.end_pos == self.start_pos.
+            if let Some(lsn) = wait_for_lsn(&mut self.end_watch, self.term, self.start_pos).await? {
+                self.end_pos = lsn;
+                trace!("got end_pos {:?}, streaming", self.end_pos);
+                return Ok(());
+            }
+
+            // Timed out waiting for WAL, check for termination and send KA.
+            // Check for termination only if we are streaming up to commit_lsn
+            // (to pageserver).
+            if let EndWatch::Commit(_) = self.end_watch {
+                if self.ws_guard.should_stop(&self.tli).await {
+                    return Err(CopyStreamHandlerEnd::ServerInitiated(format!(
+                        "ending streaming to {:?} at {}, receiver is caughtup and there is no computes",
+                        self.appname, self.start_pos,
+                    )));
+                }
+            }
+
+            self.pgb
+                .write_message(&BeMessage::KeepAlive(WalSndKeepAlive {
+                    wal_end: self.end_pos.0,
+                    timestamp: get_current_timestamp(),
+                    request_reply: true,
+                }))
+                .await?;
+        }
+    }
+}
--- a/scripts/export_import_between_pageservers.py
+++ b/scripts/export_import_between_pageservers.py
@@ -18,7 +18,7 @@
 # 6. We wait for the new pageserver's remote_consistent_lsn to catch up
 #
 # For more context on how to use this, see:
-# https://www.notion.so/neondatabase/Storage-format-migration-9a8eba33ccf8417ea8cf50e6a0c542cf
+# https://github.com/neondatabase/cloud/wiki/Storage-format-migration

 import argparse
 import os
--- a/test_runner/fixtures/neon_fixtures.py
+++ b/test_runner/fixtures/neon_fixtures.py
@@ -2179,29 +2179,6 @@ class NeonProxy(PgProtocol):
                *["--allow-self-signed-compute", "true"],
            ]

-    class Console(AuthBackend):
-        def __init__(self, endpoint: str, fixed_rate_limit: Optional[int] = None):
-            self.endpoint = endpoint
-            self.fixed_rate_limit = fixed_rate_limit
-
-        def extra_args(self) -> list[str]:
-            args = [
-                # Console auth backend params
-                *["--auth-backend", "console"],
-                *["--auth-endpoint", self.endpoint],
-            ]
-            if self.fixed_rate_limit is not None:
-                args += [
-                    *["--disable-dynamic-rate-limiter", "false"],
-                    *["--rate-limit-algorithm", "aimd"],
-                    *["--initial-limit", str(1)],
-                    *["--rate-limiter-timeout", "1s"],
-                    *["--aimd-min-limit", "0"],
-                    *["--aimd-increase-by", "1"],
-                    *["--wake-compute-cache", "size=0"],  # Disable cache to test rate limiter.
-                ]
-            return args
-
    @dataclass(frozen=True)
    class Postgres(AuthBackend):
        pg_conn_url: str
--- a/test_runner/regress/test_broken_timeline.py
+++ b/test_runner/regress/test_broken_timeline.py
@@ -26,7 +26,6 @@ def test_local_corruption(neon_env_builder: NeonEnvBuilder):
            ".*will not become active. Current state: Broken.*",
            ".*failed to load metadata.*",
            ".*load failed.*load local timeline.*",
-            ".*layer loading failed permanently: load layer: .*",
        ]
    )

--- a/test_runner/regress/test_change_pageserver.py
+++ b/test_runner/regress/test_change_pageserver.py
@@ -1,13 +1,9 @@
-import asyncio
-
 from fixtures.log_helper import log
 from fixtures.neon_fixtures import NeonEnvBuilder
 from fixtures.remote_storage import RemoteStorageKind


 def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
-    num_connections = 3
-
    neon_env_builder.num_pageservers = 2
    neon_env_builder.enable_pageserver_remote_storage(
        remote_storage_kind=RemoteStorageKind.MOCK_S3,
@@ -20,24 +16,15 @@ def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
    alt_pageserver_id = env.pageservers[1].id
    env.pageservers[1].tenant_attach(env.initial_tenant)

-    pg_conns = [endpoint.connect() for i in range(num_connections)]
-    curs = [pg_conn.cursor() for pg_conn in pg_conns]
-
-    def execute(statement: str):
-        for cur in curs:
-            cur.execute(statement)
-
-    def fetchone():
-        results = [cur.fetchone() for cur in curs]
-        assert all(result == results[0] for result in results)
-        return results[0]
+    pg_conn = endpoint.connect()
+    cur = pg_conn.cursor()

    # Create table, and insert some rows. Make it big enough that it doesn't fit in
    # shared_buffers, otherwise the SELECT after restart will just return answer
    # from shared_buffers without hitting the page server, which defeats the point
    # of this test.
-    curs[0].execute("CREATE TABLE foo (t text)")
-    curs[0].execute(
+    cur.execute("CREATE TABLE foo (t text)")
+    cur.execute(
        """
        INSERT INTO foo
            SELECT 'long string to consume some space' || g
@@ -46,25 +33,25 @@ def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
    )

    # Verify that the table is larger than shared_buffers
-    curs[0].execute(
+    cur.execute(
        """
        select setting::int * pg_size_bytes(unit) as shared_buffers, pg_relation_size('foo') as tbl_size
        from pg_settings where name = 'shared_buffers'
        """
    )
-    row = curs[0].fetchone()
+    row = cur.fetchone()
    assert row is not None
    log.info(f"shared_buffers is {row[0]}, table size {row[1]}")
    assert int(row[0]) < int(row[1])

-    execute("SELECT count(*) FROM foo")
-    assert fetchone() == (100000,)
+    cur.execute("SELECT count(*) FROM foo")
+    assert cur.fetchone() == (100000,)

    endpoint.reconfigure(pageserver_id=alt_pageserver_id)

    # Verify that the neon.pageserver_connstring GUC is set to the correct thing
-    execute("SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'")
-    connstring = fetchone()
+    cur.execute("SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'")
+    connstring = cur.fetchone()
    assert connstring is not None
    expected_connstring = f"postgresql://no_user:@localhost:{env.pageservers[1].service_port.pg}"
    assert expected_connstring == expected_connstring
@@ -73,45 +60,5 @@ def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
        0
    ].stop()  # Stop the old pageserver just to make sure we're reading from the new one

-    execute("SELECT count(*) FROM foo")
-    assert fetchone() == (100000,)
-
-    # Try failing back, and this time we will stop the current pageserver before reconfiguring
-    # the endpoint.  Whereas the previous reconfiguration was like a healthy migration, this
-    # is more like what happens in an unexpected  pageserver failure.
-    env.pageservers[0].start()
-    env.pageservers[1].stop()
-
-    endpoint.reconfigure(pageserver_id=env.pageservers[0].id)
-
-    execute("SELECT count(*) FROM foo")
-    assert fetchone() == (100000,)
-
-    env.pageservers[0].stop()
-    env.pageservers[1].start()
-
-    # Test a (former) bug where a child process spins without updating its connection string
-    # by executing a query separately. This query will hang until we issue the reconfigure.
-    async def reconfigure_async():
-        await asyncio.sleep(
-            1
-        )  # Sleep for 1 second just to make sure we actually started our count(*) query
-        endpoint.reconfigure(pageserver_id=env.pageservers[1].id)
-
-    def execute_count():
-        execute("SELECT count(*) FROM FOO")
-
-    async def execute_and_reconfigure():
-        task_exec = asyncio.to_thread(execute_count)
-        task_reconfig = asyncio.create_task(reconfigure_async())
-        await asyncio.gather(
-            task_exec,
-            task_reconfig,
-        )
-
-    asyncio.run(execute_and_reconfigure())
-    assert fetchone() == (100000,)
-
-    # One final check that nothing hangs
-    execute("SELECT count(*) FROM foo")
-    assert fetchone() == (100000,)
+    cur.execute("SELECT count(*) FROM foo")
+    assert cur.fetchone() == (100000,)
--- a/test_runner/regress/test_compatibility.py
+++ b/test_runner/regress/test_compatibility.py
@@ -449,7 +449,7 @@ def check_neon_works(
    )

    # Check that project can be recovered from WAL
-    # loosely based on https://www.notion.so/neondatabase/Storage-Recovery-from-WAL-d92c0aac0ebf40df892b938045d7d720
+    # loosely based on https://github.com/neondatabase/cloud/wiki/Recovery-from-WAL
    tenant_id = snapshot_config["default_tenant_id"]
    timeline_id = dict(snapshot_config["branch_name_mappings"]["main"])[tenant_id]
    pageserver_port = snapshot_config["pageservers"][0]["listen_http_addr"].split(":")[-1]
--- a/test_runner/regress/test_lsn_mapping.py
+++ b/test_runner/regress/test_lsn_mapping.py
@@ -79,32 +79,13 @@ def test_lsn_mapping_old(neon_env_builder: NeonEnvBuilder):
 def test_lsn_mapping(neon_env_builder: NeonEnvBuilder):
    env = neon_env_builder.init_start()

-    tenant_id, _ = env.neon_cli.create_tenant(
-        conf={
-            # disable default GC and compaction
-            "gc_period": "1000 m",
-            "compaction_period": "0 s",
-            "gc_horizon": f"{1024 ** 2}",
-            "checkpoint_distance": f"{1024 ** 2}",
-            "compaction_target_size": f"{1024 ** 2}",
-        }
-    )
-
-    timeline_id = env.neon_cli.create_branch("test_lsn_mapping", tenant_id=tenant_id)
-    endpoint_main = env.endpoints.create_start("test_lsn_mapping", tenant_id=tenant_id)
-    timeline_id = endpoint_main.safe_psql("show neon.timeline_id")[0][0]
-    log.info("postgres is running on 'main' branch")
+    new_timeline_id = env.neon_cli.create_branch("test_lsn_mapping")
+    endpoint_main = env.endpoints.create_start("test_lsn_mapping")
+    log.info("postgres is running on 'test_lsn_mapping' branch")

    cur = endpoint_main.connect().cursor()
-
-    # Obtain an lsn before all write operations on this branch
-    start_lsn = Lsn(query_scalar(cur, "SELECT pg_current_wal_lsn()"))
-
    # Create table, and insert rows, each in a separate transaction
    # Disable synchronous_commit to make this initialization go faster.
-    # Disable `synchronous_commit` to make this initialization go faster.
-    # XXX: on my laptop this test takes 7s, and setting `synchronous_commit=off`
-    #      doesn't change anything.
    #
    # Each row contains current insert LSN and the current timestamp, when
    # the row was inserted.
@@ -123,63 +104,40 @@ def test_lsn_mapping(neon_env_builder: NeonEnvBuilder):
    cur.execute("INSERT INTO foo VALUES (-1)")

    # Wait until WAL is received by pageserver
-    last_flush_lsn = wait_for_last_flush_lsn(env, endpoint_main, tenant_id, timeline_id)
+    wait_for_last_flush_lsn(env, endpoint_main, env.initial_tenant, new_timeline_id)

    with env.pageserver.http_client() as client:
-        # Check edge cases
-        # Timestamp is in the future
+        # Check edge cases: timestamp in the future
        probe_timestamp = tbl[-1][1] + timedelta(hours=1)
        result = client.timeline_get_lsn_by_timestamp(
-            tenant_id, timeline_id, f"{probe_timestamp.isoformat()}Z", 2
+            env.initial_tenant, new_timeline_id, f"{probe_timestamp.isoformat()}Z", 2
        )
        assert result["kind"] == "future"
-        # make sure that we return a well advanced lsn here
-        assert Lsn(result["lsn"]) > start_lsn

-        # Timestamp is in the unreachable past
+        # timestamp too the far history
        probe_timestamp = tbl[0][1] - timedelta(hours=10)
        result = client.timeline_get_lsn_by_timestamp(
-            tenant_id, timeline_id, f"{probe_timestamp.isoformat()}Z", 2
+            env.initial_tenant, new_timeline_id, f"{probe_timestamp.isoformat()}Z", 2
        )
        assert result["kind"] == "past"
-        # make sure that we return the minimum lsn here at the start of the range
-        assert Lsn(result["lsn"]) < start_lsn

        # Probe a bunch of timestamps in the valid range
        for i in range(1, len(tbl), 100):
            probe_timestamp = tbl[i][1]
            result = client.timeline_get_lsn_by_timestamp(
-                tenant_id, timeline_id, f"{probe_timestamp.isoformat()}Z", 2
+                env.initial_tenant, new_timeline_id, f"{probe_timestamp.isoformat()}Z", 2
            )
-            assert result["kind"] not in ["past", "nodata"]
            lsn = result["lsn"]
            # Call get_lsn_by_timestamp to get the LSN
            # Launch a new read-only node at that LSN, and check that only the rows
            # that were supposed to be committed at that point in time are visible.
            endpoint_here = env.endpoints.create_start(
-                branch_name="test_lsn_mapping",
-                endpoint_id="ep-lsn_mapping_read",
-                lsn=lsn,
-                tenant_id=tenant_id,
+                branch_name="test_lsn_mapping", endpoint_id="ep-lsn_mapping_read", lsn=lsn
            )
            assert endpoint_here.safe_psql("SELECT max(x) FROM foo")[0][0] == i

            endpoint_here.stop_and_destroy()

-        # Do the "past" check again at a new branch to ensure that we don't return something before the branch cutoff
-        timeline_id_child = env.neon_cli.create_branch(
-            "test_lsn_mapping_child", tenant_id=tenant_id, ancestor_branch_name="test_lsn_mapping"
-        )
-
-        # Timestamp is in the unreachable past
-        probe_timestamp = tbl[0][1] - timedelta(hours=10)
-        result = client.timeline_get_lsn_by_timestamp(
-            tenant_id, timeline_id_child, f"{probe_timestamp.isoformat()}Z", 2
-        )
-        assert result["kind"] == "past"
-        # make sure that we return the minimum lsn here at the start of the range
-        assert Lsn(result["lsn"]) >= last_flush_lsn
-

 # Test pageserver get_timestamp_of_lsn API
 def test_ts_of_lsn_api(neon_env_builder: NeonEnvBuilder):
--- a/test_runner/regress/test_proxy.py
+++ b/test_runner/regress/test_proxy.py
@@ -1,4 +1,3 @@
-import asyncio
 import json
 import subprocess
 import time
@@ -12,29 +11,6 @@ from fixtures.neon_fixtures import PSQL, NeonProxy, VanillaPostgres
 GET_CONNECTION_PID_QUERY = "SELECT pid FROM pg_stat_activity WHERE state = 'active'"


-@pytest.mark.asyncio
-async def test_http_pool_begin_1(static_proxy: NeonProxy):
-    static_proxy.safe_psql("create user http_auth with password 'http' superuser")
-
-    def query(*args) -> Any:
-        static_proxy.http_query(
-            "SELECT pg_sleep(10);",
-            args,
-            user="http_auth",
-            password="http",
-            expected_code=200,
-        )
-
-    query()
-    loop = asyncio.get_running_loop()
-    tasks = [loop.run_in_executor(None, query) for _ in range(10)]
-    # Wait for all the tasks to complete
-    completed, pending = await asyncio.wait(tasks)
-    # Get the results
-    results = [task.result() for task in completed]
-    print(results)
-
-
 def test_proxy_select_1(static_proxy: NeonProxy):
    """
    A simplest smoke test: check proxy against a local postgres instance.
--- a/test_runner/regress/test_proxy_rate_limiter.py
+++ b/test_runner/regress/test_proxy_rate_limiter.py
@@ -1,84 +0,0 @@
-import asyncio
-import time
-from pathlib import Path
-from typing import Iterator
-
-import pytest
-from fixtures.neon_fixtures import (
-    PSQL,
-    NeonProxy,
-)
-from fixtures.port_distributor import PortDistributor
-from pytest_httpserver import HTTPServer
-from werkzeug.wrappers.response import Response
-
-
-def waiting_handler(status_code: int) -> Response:
-    # wait more than timeout to make sure that both (two) connections are open.
-    # It would be better to use a barrier here, but I don't know how to do that together with pytest-httpserver.
-    time.sleep(2)
-    return Response(status=status_code)
-
-
-@pytest.fixture(scope="function")
-def proxy_with_rate_limit(
-    port_distributor: PortDistributor,
-    neon_binpath: Path,
-    httpserver_listen_address,
-    test_output_dir: Path,
-) -> Iterator[NeonProxy]:
-    """Neon proxy that routes directly to vanilla postgres."""
-
-    proxy_port = port_distributor.get_port()
-    mgmt_port = port_distributor.get_port()
-    http_port = port_distributor.get_port()
-    external_http_port = port_distributor.get_port()
-    (host, port) = httpserver_listen_address
-    endpoint = f"http://{host}:{port}/billing/api/v1/usage_events"
-
-    with NeonProxy(
-        neon_binpath=neon_binpath,
-        test_output_dir=test_output_dir,
-        proxy_port=proxy_port,
-        http_port=http_port,
-        mgmt_port=mgmt_port,
-        external_http_port=external_http_port,
-        auth_backend=NeonProxy.Console(endpoint, fixed_rate_limit=5),
-    ) as proxy:
-        proxy.start()
-        yield proxy
-
-
-@pytest.mark.asyncio
-async def test_proxy_rate_limit(
-    httpserver: HTTPServer,
-    proxy_with_rate_limit: NeonProxy,
-):
-    uri = "/billing/api/v1/usage_events/proxy_get_role_secret"
-    # mock control plane service
-    httpserver.expect_ordered_request(uri, method="GET").respond_with_handler(
-        lambda _: Response(status=200)
-    )
-    httpserver.expect_ordered_request(uri, method="GET").respond_with_handler(
-        lambda _: waiting_handler(429)
-    )
-    httpserver.expect_ordered_request(uri, method="GET").respond_with_handler(
-        lambda _: waiting_handler(500)
-    )
-
-    psql = PSQL(host=proxy_with_rate_limit.host, port=proxy_with_rate_limit.proxy_port)
-    f = await psql.run("select 42;")
-    await proxy_with_rate_limit.find_auth_link(uri, f)
-    # Limit should be 2.
-
-    # Run two queries in parallel.
-    f1, f2 = await asyncio.gather(psql.run("select 42;"), psql.run("select 42;"))
-    await proxy_with_rate_limit.find_auth_link(uri, f1)
-    await proxy_with_rate_limit.find_auth_link(uri, f2)
-
-    # Now limit should be 0.
-    f = await psql.run("select 42;")
-    await proxy_with_rate_limit.find_auth_link(uri, f)
-
-    # There last query shouldn't reach the http-server.
-    assert httpserver.assertions == []
--- a/workspace_hack/Cargo.toml
+++ b/workspace_hack/Cargo.toml
@@ -39,7 +39,7 @@ hex = { version = "0.4", features = ["serde"] }
 hyper = { version = "0.14", features = ["full"] }
 itertools = { version = "0.10" }
 libc = { version = "0.2", features = ["extra_traits"] }
-log = { version = "0.4", default-features = false, features = ["std"] }
+log = { version = "0.4", default-features = false, features = ["kv_unstable", "std"] }
 memchr = { version = "2" }
 nom = { version = "7" }
 num-bigint = { version = "0.4" }
@@ -56,6 +56,7 @@ scopeguard = { version = "1" }
 serde = { version = "1", features = ["alloc", "derive"] }
 serde_json = { version = "1", features = ["raw_value"] }
 smallvec = { version = "1", default-features = false, features = ["write"] }
+standback = { version = "0.2", default-features = false, features = ["std"] }
 time = { version = "0.3", features = ["local-offset", "macros", "serde-well-known"] }
 tokio = { version = "1", features = ["fs", "io-std", "io-util", "macros", "net", "process", "rt-multi-thread", "signal", "test-util"] }
 tokio-rustls = { version = "0.24" }
@@ -76,13 +77,14 @@ cc = { version = "1", default-features = false, features = ["parallel"] }
 either = { version = "1" }
 itertools = { version = "0.10" }
 libc = { version = "0.2", features = ["extra_traits"] }
-log = { version = "0.4", default-features = false, features = ["std"] }
+log = { version = "0.4", default-features = false, features = ["kv_unstable", "std"] }
 memchr = { version = "2" }
 nom = { version = "7" }
 prost = { version = "0.11" }
 regex = { version = "1" }
 regex-syntax = { version = "0.7" }
 serde = { version = "1", features = ["alloc", "derive"] }
+standback = { version = "0.2", default-features = false, features = ["std"] }
 syn-dff4ba8e3ae991db = { package = "syn", version = "1", features = ["extra-traits", "full", "visit"] }
 syn-f595c2ba2a3f28df = { package = "syn", version = "2", features = ["extra-traits", "full", "visit", "visit-mut"] }
 time-macros = { version = "0.2", default-features = false, features = ["formatting", "parsing", "serde"] }
Author	SHA1	Message	Date
Arthur Petukhovsky	e70d486281	Add safekeeper protocol to send decoded records	2023-11-09 19:33:41 +00:00
John Spray	56630b0eda	Merge remote-tracking branch 'upstream/compute_sharding_support' into jcsp/sharding-pt1	2023-11-09 16:43:19 +00:00
John Spray	27e9cb91ed	Revert "extend test_change_pageserver for failure case, rework changing pageserver (#5693 )" This reverts commit `b989ad1922`.	2023-11-09 16:43:10 +00:00
John Spray	733877a8ff	shard.rs: update hashing	2023-11-09 16:39:45 +00:00
John Spray	93c52b5763	fixup inspect	2023-11-09 16:26:16 +00:00
John Spray	f54cb63eb4	pageserver: omit consumption metrics from shard != 0	2023-11-09 16:25:49 +00:00
John Spray	8e2fee7d06	DNM verbose WAL ingest logging	2023-11-09 16:24:29 +00:00
John Spray	fbbad434a3	pageserver: hacky version of putting shard into remote storage paths	2023-11-09 16:24:28 +00:00
John Spray	8f27c57748	neon_local: create tenants with multiple shards	2023-11-09 16:23:21 +00:00
John Spray	6662c8f1ed	WIP neon_local	2023-11-09 16:23:21 +00:00
John Spray	61244afb59	pageserver: filter WAL by ShardIdentity	2023-11-09 16:23:21 +00:00
John Spray	e20732fdcb	Thread ShardIdentity through to Timeline	2023-11-09 16:23:21 +00:00
John Spray	feae5f716f	pageserver: carry a ShardIdentity in LocationConf	2023-11-09 16:23:21 +00:00
John Spray	ae19f28f59	Add sharding types	2023-11-09 16:23:21 +00:00
John Spray	22a848cf2b	control_plane: use `pageserver/` sub path for local_fs	2023-11-09 16:23:21 +00:00
John Spray	360ca01952	tests: explicitly disable remote storage if it is None	2023-11-09 16:23:21 +00:00
John Spray	bf059935a0	Use the same local_fs path as tests	2023-11-09 16:23:21 +00:00
John Spray	71bf90548d	neon_local: don't use default remote storage if user specified it on CLI	2023-11-09 16:23:21 +00:00
John Spray	e368705692	control_plane: implement `cargo neon migrate`	2023-11-09 16:23:21 +00:00
John Spray	e94b9e9ce8	control_plane: add /inspect to attachment service	2023-11-09 16:23:21 +00:00
John Spray	57cbc20dce	neon_local: remote storage by default & enable multiple pageservers in init	2023-11-09 16:23:21 +00:00
Konstantin Knizhnik	0aeba9fc4c	Take in account stripe size when calculating shard hash number	2023-11-09 10:14:55 +02:00
Konstantin Knizhnik	6e0055b9f6	Add support for PS shardoing in compute	2023-11-09 08:07:35 +02:00