Bump postgres version

Switch to safekeeper in the same AZ (#3883 )
Add a condition to switch walreceiver connection to safekeeper that is located in the same availability zone. Switch happens when commit_lsn of a candidate is not less than commit_lsn from the active connection. This condition is expected not to trigger instantly, because commit_lsn of a current connection is usually greater than commit_lsn of updates from the broker. That means that if WAL is written continuously, switch can take a lot of time, but it should happen eventually. Now protoc 3.15+ is required for building neon. Fixes https://github.com/neondatabase/neon/issues/3200
2026-07-04 04:30:38 +00:00 · 2023-04-03 15:35:43 +03:00 · 2023-04-02 11:32:27 +03:00 · 2023-03-31 21:45:59 +03:00 · 2023-03-31 19:25:53 +03:00 · 2023-03-31 16:11:34 +03:00
54 changed files with 987 additions and 192 deletions
--- a/.github/helm-values/dev-eu-west-1-zeta.neon-proxy-scram.yaml
+++ b/.github/helm-values/dev-eu-west-1-zeta.neon-proxy-scram.yaml
@@ -30,10 +30,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: dev
-  zenith_region: eu-west-1
-  zenith_region_slug: eu-west-1
+  neon_service: proxy-scram
+  neon_env: dev
+  neon_region: eu-west-1

 exposedService:
  annotations:
--- a/.github/helm-values/dev-us-east-2-beta.neon-proxy-link.yaml
+++ b/.github/helm-values/dev-us-east-2-beta.neon-proxy-link.yaml
@@ -15,10 +15,9 @@ settings:

 # -- Additional labels for neon-proxy-link pods
 podLabels:
-  zenith_service: proxy
-  zenith_env: dev
-  zenith_region: us-east-2
-  zenith_region_slug: us-east-2
+  neon_service: proxy
+  neon_env: dev
+  neon_region: us-east-2

 service:
  type: LoadBalancer
--- a/.github/helm-values/dev-us-east-2-beta.neon-proxy-scram-legacy.yaml
+++ b/.github/helm-values/dev-us-east-2-beta.neon-proxy-scram-legacy.yaml
@@ -15,10 +15,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram-legacy
-  zenith_env: dev
-  zenith_region: us-east-2
-  zenith_region_slug: us-east-2
+  neon_service: proxy-scram-legacy
+  neon_env: dev
+  neon_region: us-east-2

 exposedService:
  annotations:
--- a/.github/helm-values/dev-us-east-2-beta.neon-proxy-scram.yaml
+++ b/.github/helm-values/dev-us-east-2-beta.neon-proxy-scram.yaml
@@ -30,10 +30,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: dev
-  zenith_region: us-east-2
-  zenith_region_slug: us-east-2
+  neon_service: proxy-scram
+  neon_env: dev
+  neon_region: us-east-2

 exposedService:
  annotations:
--- a/.github/helm-values/prod-ap-southeast-1-epsilon.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-ap-southeast-1-epsilon.neon-proxy-scram.yaml
@@ -31,10 +31,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: prod
-  zenith_region: ap-southeast-1
-  zenith_region_slug: ap-southeast-1
+  neon_service: proxy-scram
+  neon_env: prod
+  neon_region: ap-southeast-1

 exposedService:
  annotations:
--- a/.github/helm-values/prod-eu-central-1-gamma.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-eu-central-1-gamma.neon-proxy-scram.yaml
@@ -31,10 +31,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: prod
-  zenith_region: eu-central-1
-  zenith_region_slug: eu-central-1
+  neon_service: proxy-scram
+  neon_env: prod
+  neon_region: eu-central-1

 exposedService:
  annotations:
--- a/.github/helm-values/prod-us-east-2-delta.neon-proxy-link.yaml
+++ b/.github/helm-values/prod-us-east-2-delta.neon-proxy-link.yaml
@@ -13,10 +13,9 @@ settings:

 # -- Additional labels for zenith-proxy pods
 podLabels:
-  zenith_service: proxy
-  zenith_env: production
-  zenith_region: us-east-2
-  zenith_region_slug: us-east-2
+  neon_service: proxy
+  neon_env: production
+  neon_region: us-east-2

 service:
  type: LoadBalancer
--- a/.github/helm-values/prod-us-east-2-delta.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-us-east-2-delta.neon-proxy-scram.yaml
@@ -31,10 +31,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: prod
-  zenith_region: us-east-2
-  zenith_region_slug: us-east-2
+  neon_service: proxy-scram
+  neon_env: prod
+  neon_region: us-east-2

 exposedService:
  annotations:
--- a/.github/helm-values/prod-us-west-2-eta.neon-proxy-scram-legacy.yaml
+++ b/.github/helm-values/prod-us-west-2-eta.neon-proxy-scram-legacy.yaml
@@ -31,10 +31,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: prod
-  zenith_region: us-west-2
-  zenith_region_slug: us-west-2
+  neon_service: proxy-scram
+  neon_env: prod
+  neon_region: us-west-2

 exposedService:
  annotations:
--- a/.github/helm-values/prod-us-west-2-eta.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-us-west-2-eta.neon-proxy-scram.yaml
@@ -31,10 +31,9 @@ settings:

 # -- Additional labels for neon-proxy pods
 podLabels:
-  zenith_service: proxy-scram
-  zenith_env: prod
-  zenith_region: us-west-2
-  zenith_region_slug: us-west-2
+  neon_service: proxy-scram
+  neon_env: prod
+  neon_region: us-west-2

 exposedService:
  annotations:
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -3,8 +3,12 @@
 ## Issue ticket number and link

 ## Checklist before requesting a review
+
 - [ ] I have performed a self-review of my code.
 - [ ] If it is a core feature, I have added thorough tests.
 - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
 - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

+## Checklist before merging
+
+- [ ] Do not forget to reformat commit message to not include the above checklist
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -184,10 +184,10 @@ jobs:
          CARGO_FEATURES="--features testing"
          if [[ $BUILD_TYPE == "debug" ]]; then
            cov_prefix="scripts/coverage --profraw-prefix=$GITHUB_JOB --dir=/tmp/coverage run"
-            CARGO_FLAGS="--locked $CARGO_FEATURES"
+            CARGO_FLAGS="--locked"
          elif [[ $BUILD_TYPE == "release" ]]; then
            cov_prefix=""
-            CARGO_FLAGS="--locked --release $CARGO_FEATURES"
+            CARGO_FLAGS="--locked --release"
          fi
          echo "cov_prefix=${cov_prefix}" >> $GITHUB_ENV
          echo "CARGO_FEATURES=${CARGO_FEATURES}" >> $GITHUB_ENV
@@ -240,11 +240,18 @@ jobs:

      - name: Run cargo build
        run: |
-          ${cov_prefix} mold -run cargo build $CARGO_FLAGS --bins --tests
+          ${cov_prefix} mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests

      - name: Run cargo test
        run: |
-          ${cov_prefix} cargo test $CARGO_FLAGS
+          ${cov_prefix} cargo test $CARGO_FLAGS $CARGO_FEATURES
+
+          # Run separate tests for real S3
+          export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
+          export REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev
+          export REMOTE_STORAGE_S3_REGION=eu-central-1
+          # Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
+          ${cov_prefix} cargo test $CARGO_FLAGS --package remote_storage --test pagination_tests -- s3_pagination_should_work --exact

      - name: Install rust binaries
        run: |
@@ -268,7 +275,7 @@ jobs:
            mkdir -p /tmp/neon/test_bin/

            test_exe_paths=$(
-              ${cov_prefix} cargo test $CARGO_FLAGS --message-format=json --no-run |
+              ${cov_prefix} cargo test $CARGO_FLAGS $CARGO_FEATURES --message-format=json --no-run |
              jq -r '.executable | select(. != null)'
            )
            for bin in $test_exe_paths; do
@@ -891,6 +898,16 @@ jobs:
    needs: [ push-docker-hub, tag, regress-tests ]
    if: ( github.ref_name == 'main' || github.ref_name == 'release' ) && github.event_name != 'workflow_dispatch'
    steps:
+      - name: Fix git ownership
+        run: |
+          # Workaround for `fatal: detected dubious ownership in repository at ...`
+          #
+          # Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
+          #   Ref https://github.com/actions/checkout/issues/785
+          #
+          git config --global --add safe.directory ${{ github.workspace }}
+          git config --global --add safe.directory ${GITHUB_WORKSPACE}
+
      - name: Checkout
        uses: actions/checkout@v3
        with:
--- a/.github/workflows/neon_extra_builds.yml
+++ b/.github/workflows/neon_extra_builds.yml
@@ -53,14 +53,14 @@ jobs:
        uses: actions/cache@v3
        with:
          path: pg_install/v14
-          key: v1-${{ runner.os }}-${{ matrix.build_type }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
+          key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}

      - name: Cache postgres v15 build
        id: cache_pg_15
        uses: actions/cache@v3
        with:
          path: pg_install/v15
-          key: v1-${{ runner.os }}-${{ matrix.build_type }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
+          key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}

      - name: Set extra env for macOS
        run: |
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3086,6 +3086,7 @@ dependencies = [
 "serde",
 "serde_json",
 "tempfile",
+ "test-context",
 "tokio",
 "tokio-util",
 "toml_edit",
@@ -3889,6 +3890,27 @@ dependencies = [
 "winapi-util",
 ]

+[[package]]
+name = "test-context"
+version = "0.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "055831a02a4f5aa28fede67f2902014273eb8c21b958ac5ebbd59b71ef30dbc3"
+dependencies = [
+ "async-trait",
+ "futures",
+ "test-context-macros",
+]
+
+[[package]]
+name = "test-context-macros"
+version = "0.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8901a55b0a7a06ebc4a674dcca925170da8e613fa3b163a1df804ed10afb154d"
+dependencies = [
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "textwrap"
 version = "0.16.0"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -97,6 +97,7 @@ strum_macros = "0.24"
 svg_fmt = "0.4.1"
 sync_wrapper = "0.1.2"
 tar = "0.4"
+test-context = "0.1"
 thiserror = "1.0"
 tls-listener = { version = "0.6", features = ["rustls", "hyper-h1"] }
 tokio = { version = "1.17", features = ["macros"] }
--- a/README.md
+++ b/README.md
@@ -40,6 +40,8 @@ pacman -S base-devel readline zlib libseccomp openssl clang \
 postgresql-libs cmake postgresql protobuf
 ```

+Building Neon requires 3.15+ version of `protoc` (protobuf-compiler). If your distribution provides an older version, you can install a newer version from [here](https://github.com/protocolbuffers/protobuf/releases).
+
 2. [Install Rust](https://www.rust-lang.org/tools/install)
 ```
 # recommended approach from https://www.rust-lang.org/tools/install
--- a/compute_tools/src/bin/compute_ctl.rs
+++ b/compute_tools/src/bin/compute_ctl.rs
@@ -203,13 +203,14 @@ fn main() -> Result<()> {
    if delay_exit {
        info!("giving control plane 30s to collect the error before shutdown");
        thread::sleep(Duration::from_secs(30));
-        info!("shutting down");
    }

+    info!("shutting down tracing");
    // Shutdown trace pipeline gracefully, so that it has a chance to send any
    // pending traces before we exit.
    tracing_utils::shutdown_tracing();

+    info!("shutting down");
    exit(exit_code.unwrap_or(1))
 }

--- a/compute_tools/src/pg_helpers.rs
+++ b/compute_tools/src/pg_helpers.rs
@@ -74,18 +74,9 @@ impl GenericOption {
    /// Represent `GenericOption` as configuration option.
    pub fn to_pg_setting(&self) -> String {
        if let Some(val) = &self.value {
-            // TODO: check in the console DB that we don't have these settings
-            // set for any non-deleted project and drop this override.
-            let name = match self.name.as_str() {
-                "safekeepers" => "neon.safekeepers",
-                "wal_acceptor_reconnect" => "neon.safekeeper_reconnect_timeout",
-                "wal_acceptor_connection_timeout" => "neon.safekeeper_connection_timeout",
-                it => it,
-            };
-
            match self.vartype.as_ref() {
-                "string" => format!("{} = '{}'", name, escape_conf_value(val)),
-                _ => format!("{} = {}", name, val),
+                "string" => format!("{} = '{}'", self.name, escape_conf_value(val)),
+                _ => format!("{} = {}", self.name, val),
            }
        } else {
            self.name.to_owned()
--- a/control_plane/src/compute.rs
+++ b/control_plane/src/compute.rs
@@ -90,7 +90,6 @@ impl ComputeControlPlane {
            timeline_id,
            lsn,
            tenant_id,
-            uses_wal_proposer: false,
            pg_version,
        });

@@ -115,7 +114,6 @@ pub struct PostgresNode {
    pub timeline_id: TimelineId,
    pub lsn: Option<Lsn>, // if it's a read-only node. None for primary
    pub tenant_id: TenantId,
-    uses_wal_proposer: bool,
    pg_version: u32,
 }

@@ -149,7 +147,6 @@ impl PostgresNode {
        let port: u16 = conf.parse_field("port", &context)?;
        let timeline_id: TimelineId = conf.parse_field("neon.timeline_id", &context)?;
        let tenant_id: TenantId = conf.parse_field("neon.tenant_id", &context)?;
-        let uses_wal_proposer = conf.get("neon.safekeepers").is_some();

        // Read postgres version from PG_VERSION file to determine which postgres version binary to use.
        // If it doesn't exist, assume broken data directory and use default pg version.
@@ -172,7 +169,6 @@ impl PostgresNode {
            timeline_id,
            lsn: recovery_target_lsn,
            tenant_id,
-            uses_wal_proposer,
            pg_version,
        })
    }
@@ -364,7 +360,7 @@ impl PostgresNode {
    fn load_basebackup(&self, auth_token: &Option<String>) -> Result<()> {
        let backup_lsn = if let Some(lsn) = self.lsn {
            Some(lsn)
-        } else if self.uses_wal_proposer {
+        } else if !self.env.safekeepers.is_empty() {
            // LSN 0 means that it is bootstrap and we need to download just
            // latest data from the pageserver. That is a bit clumsy but whole bootstrap
            // procedure evolves quite actively right now, so let's think about it again
@@ -403,7 +399,7 @@ impl PostgresNode {

    fn pg_ctl(&self, args: &[&str], auth_token: &Option<String>) -> Result<()> {
        let pg_ctl_path = self.env.pg_bin_dir(self.pg_version)?.join("pg_ctl");
-        let mut cmd = Command::new(pg_ctl_path);
+        let mut cmd = Command::new(&pg_ctl_path);
        cmd.args(
            [
                &[
@@ -432,7 +428,9 @@ impl PostgresNode {
            cmd.env("NEON_AUTH_TOKEN", token);
        }

-        let pg_ctl = cmd.output().context("pg_ctl failed")?;
+        let pg_ctl = cmd
+            .output()
+            .context(format!("{} failed", pg_ctl_path.display()))?;
        if !pg_ctl.status.success() {
            anyhow::bail!(
                "pg_ctl failed, exit code: {}, stdout: {}, stderr: {}",
--- a/control_plane/src/safekeeper.rs
+++ b/control_plane/src/safekeeper.rs
@@ -156,7 +156,7 @@ impl SafekeeperNode {
        }

        background_process::start_process(
-            &format!("safekeeper {id}"),
+            &format!("safekeeper-{id}"),
            &datadir,
            &self.env.safekeeper_bin(),
            &args,
--- a/libs/remote_storage/Cargo.toml
+++ b/libs/remote_storage/Cargo.toml
@@ -26,3 +26,4 @@ workspace_hack.workspace = true

 [dev-dependencies]
 tempfile.workspace = true
+test-context.workspace = true
--- a/libs/remote_storage/src/lib.rs
+++ b/libs/remote_storage/src/lib.rs
@@ -39,6 +39,9 @@ pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
 /// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests
 /// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
 pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100;
+/// No limits on the client side, which currenltly means 1000 for AWS S3.
+/// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestSyntax
+pub const DEFAULT_MAX_KEYS_PER_LIST_RESPONSE: Option<i32> = None;

 const REMOTE_STORAGE_PREFIX_SEPARATOR: char = '/';

@@ -64,6 +67,10 @@ impl RemotePath {
    pub fn object_name(&self) -> Option<&str> {
        self.0.file_name().and_then(|os_str| os_str.to_str())
    }
+
+    pub fn join(&self, segment: &Path) -> Self {
+        Self(self.0.join(segment))
+    }
 }

 /// Storage (potentially remote) API to manage its state.
@@ -266,6 +273,7 @@ pub struct S3Config {
    /// AWS S3 has various limits on its API calls, we need not to exceed those.
    /// See [`DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details.
    pub concurrency_limit: NonZeroUsize,
+    pub max_keys_per_list_response: Option<i32>,
 }

 impl Debug for S3Config {
@@ -275,6 +283,10 @@ impl Debug for S3Config {
            .field("bucket_region", &self.bucket_region)
            .field("prefix_in_bucket", &self.prefix_in_bucket)
            .field("concurrency_limit", &self.concurrency_limit)
+            .field(
+                "max_keys_per_list_response",
+                &self.max_keys_per_list_response,
+            )
            .finish()
    }
 }
@@ -303,6 +315,11 @@ impl RemoteStorageConfig {
        )
        .context("Failed to parse 'concurrency_limit' as a positive integer")?;

+        let max_keys_per_list_response =
+            parse_optional_integer::<i32, _>("max_keys_per_list_response", toml)
+                .context("Failed to parse 'max_keys_per_list_response' as a positive integer")?
+                .or(DEFAULT_MAX_KEYS_PER_LIST_RESPONSE);
+
        let storage = match (local_path, bucket_name, bucket_region) {
            // no 'local_path' nor 'bucket_name' options are provided, consider this remote storage disabled
            (None, None, None) => return Ok(None),
@@ -324,6 +341,7 @@ impl RemoteStorageConfig {
                    .map(|endpoint| parse_toml_string("endpoint", endpoint))
                    .transpose()?,
                concurrency_limit,
+                max_keys_per_list_response,
            }),
            (Some(local_path), None, None) => RemoteStorageKind::LocalFs(PathBuf::from(
                parse_toml_string("local_path", local_path)?,
--- a/libs/remote_storage/src/s3_bucket.rs
+++ b/libs/remote_storage/src/s3_bucket.rs
@@ -102,6 +102,7 @@ pub struct S3Bucket {
    client: Client,
    bucket_name: String,
    prefix_in_bucket: Option<String>,
+    max_keys_per_list_response: Option<i32>,
    // Every request to S3 can be throttled or cancelled, if a certain number of requests per second is exceeded.
    // Same goes to IAM, which is queried before every S3 request, if enabled. IAM has even lower RPS threshold.
    // The helps to ensure we don't exceed the thresholds.
@@ -164,6 +165,7 @@ impl S3Bucket {
        Ok(Self {
            client,
            bucket_name: aws_config.bucket_name.clone(),
+            max_keys_per_list_response: aws_config.max_keys_per_list_response,
            prefix_in_bucket,
            concurrency_limiter: Arc::new(Semaphore::new(aws_config.concurrency_limit.get())),
        })
@@ -291,7 +293,9 @@ impl RemoteStorage for S3Bucket {
                .list_objects_v2()
                .bucket(self.bucket_name.clone())
                .set_prefix(self.prefix_in_bucket.clone())
+                .delimiter(REMOTE_STORAGE_PREFIX_SEPARATOR.to_string())
                .set_continuation_token(continuation_token)
+                .set_max_keys(self.max_keys_per_list_response)
                .send()
                .await
                .map_err(|e| {
@@ -306,7 +310,7 @@ impl RemoteStorage for S3Bucket {
                    .filter_map(|o| Some(self.s3_object_to_relative_path(o.key()?))),
            );

-            match fetch_response.continuation_token {
+            match fetch_response.next_continuation_token {
                Some(new_token) => continuation_token = Some(new_token),
                None => break,
            }
@@ -354,6 +358,7 @@ impl RemoteStorage for S3Bucket {
                .set_prefix(list_prefix.clone())
                .set_continuation_token(continuation_token)
                .delimiter(REMOTE_STORAGE_PREFIX_SEPARATOR.to_string())
+                .set_max_keys(self.max_keys_per_list_response)
                .send()
                .await
                .map_err(|e| {
@@ -371,7 +376,7 @@ impl RemoteStorage for S3Bucket {
                    .filter_map(|o| Some(self.s3_object_to_relative_path(o.prefix()?))),
            );

-            match fetch_response.continuation_token {
+            match fetch_response.next_continuation_token {
                Some(new_token) => continuation_token = Some(new_token),
                None => break,
            }
--- a/libs/remote_storage/tests/pagination_tests.rs
+++ b/libs/remote_storage/tests/pagination_tests.rs
@@ -0,0 +1,275 @@
+use std::collections::HashSet;
+use std::env;
+use std::num::{NonZeroU32, NonZeroUsize};
+use std::ops::ControlFlow;
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use std::time::UNIX_EPOCH;
+
+use anyhow::Context;
+use remote_storage::{
+    GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config,
+};
+use test_context::{test_context, AsyncTestContext};
+use tokio::task::JoinSet;
+use tracing::{debug, error, info};
+
+const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE";
+
+/// Tests that S3 client can list all prefixes, even if the response come paginated and requires multiple S3 queries.
+/// Uses real S3 and requires [`ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME`] and related S3 cred env vars specified.
+/// See the client creation in [`create_s3_client`] for details on the required env vars.
+/// If real S3 tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the
+/// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details.
+///
+/// First, the test creates a set of S3 objects with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_s3_data`]
+/// where
+/// * `random_prefix_part` is set for the entire S3 client during the S3 client creation in [`create_s3_client`], to avoid multiple test runs interference
+/// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket
+///
+/// Then, verifies that the client does return correct prefixes when queried:
+/// * with no prefix, it lists everything after its `${random_prefix_part}/` — that should be `${base_prefix_str}` value only
+/// * with `${base_prefix_str}/` prefix, it lists every `sub_prefix_${i}`
+///
+/// With the real S3 enabled and `#[cfg(test)]` Rust configuration used, the S3 client test adds a `max-keys` param to limit the response keys.
+/// This way, we are able to test the pagination implicitly, by ensuring all results are returned from the remote storage and avoid uploading too many blobs to S3,
+/// since current default AWS S3 pagination limit is 1000.
+/// (see https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestSyntax)
+///
+/// Lastly, the test attempts to clean up and remove all uploaded S3 files.
+/// If any errors appear during the clean up, they get logged, but the test is not failed or stopped until clean up is finished.
+#[test_context(MaybeEnabledS3)]
+#[tokio::test]
+async fn s3_pagination_should_work(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()> {
+    let ctx = match ctx {
+        MaybeEnabledS3::Enabled(ctx) => ctx,
+        MaybeEnabledS3::Disabled => return Ok(()),
+        MaybeEnabledS3::UploadsFailed(e, _) => anyhow::bail!("S3 init failed: {e:?}"),
+    };
+
+    let test_client = Arc::clone(&ctx.client_with_excessive_pagination);
+    let expected_remote_prefixes = ctx.remote_prefixes.clone();
+
+    let base_prefix =
+        RemotePath::new(Path::new(ctx.base_prefix_str)).context("common_prefix construction")?;
+    let root_remote_prefixes = test_client
+        .list_prefixes(None)
+        .await
+        .context("client list root prefixes failure")?
+        .into_iter()
+        .collect::<HashSet<_>>();
+    assert_eq!(
+        root_remote_prefixes, HashSet::from([base_prefix.clone()]),
+        "remote storage root prefixes list mismatches with the uploads. Returned prefixes: {root_remote_prefixes:?}"
+    );
+
+    let nested_remote_prefixes = test_client
+        .list_prefixes(Some(&base_prefix))
+        .await
+        .context("client list nested prefixes failure")?
+        .into_iter()
+        .collect::<HashSet<_>>();
+    let remote_only_prefixes = nested_remote_prefixes
+        .difference(&expected_remote_prefixes)
+        .collect::<HashSet<_>>();
+    let missing_uploaded_prefixes = expected_remote_prefixes
+        .difference(&nested_remote_prefixes)
+        .collect::<HashSet<_>>();
+    assert_eq!(
+        remote_only_prefixes.len() + missing_uploaded_prefixes.len(), 0,
+        "remote storage nested prefixes list mismatches with the uploads. Remote only prefixes: {remote_only_prefixes:?}, missing uploaded prefixes: {missing_uploaded_prefixes:?}",
+    );
+
+    Ok(())
+}
+
+enum MaybeEnabledS3 {
+    Enabled(S3WithTestBlobs),
+    Disabled,
+    UploadsFailed(anyhow::Error, S3WithTestBlobs),
+}
+
+struct S3WithTestBlobs {
+    client_with_excessive_pagination: Arc<GenericRemoteStorage>,
+    base_prefix_str: &'static str,
+    remote_prefixes: HashSet<RemotePath>,
+    remote_blobs: HashSet<RemotePath>,
+}
+
+#[async_trait::async_trait]
+impl AsyncTestContext for MaybeEnabledS3 {
+    async fn setup() -> Self {
+        utils::logging::init(utils::logging::LogFormat::Test).expect("logging init failed");
+        if env::var(ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME).is_err() {
+            info!(
+                "`{}` env variable is not set, skipping the test",
+                ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME
+            );
+            return Self::Disabled;
+        }
+
+        let max_keys_in_list_response = 10;
+        let upload_tasks_count = 1 + (2 * usize::try_from(max_keys_in_list_response).unwrap());
+
+        let client_with_excessive_pagination = create_s3_client(max_keys_in_list_response)
+            .context("S3 client creation")
+            .expect("S3 client creation failed");
+
+        let base_prefix_str = "test/";
+        match upload_s3_data(
+            &client_with_excessive_pagination,
+            base_prefix_str,
+            upload_tasks_count,
+        )
+        .await
+        {
+            ControlFlow::Continue(uploads) => {
+                info!("Remote objects created successfully");
+                Self::Enabled(S3WithTestBlobs {
+                    client_with_excessive_pagination,
+                    base_prefix_str,
+                    remote_prefixes: uploads.prefixes,
+                    remote_blobs: uploads.blobs,
+                })
+            }
+            ControlFlow::Break(uploads) => Self::UploadsFailed(
+                anyhow::anyhow!("One or multiple blobs failed to upload to S3"),
+                S3WithTestBlobs {
+                    client_with_excessive_pagination,
+                    base_prefix_str,
+                    remote_prefixes: uploads.prefixes,
+                    remote_blobs: uploads.blobs,
+                },
+            ),
+        }
+    }
+
+    async fn teardown(self) {
+        match self {
+            Self::Disabled => {}
+            Self::Enabled(ctx) | Self::UploadsFailed(_, ctx) => {
+                cleanup(&ctx.client_with_excessive_pagination, ctx.remote_blobs).await;
+            }
+        }
+    }
+}
+
+fn create_s3_client(max_keys_per_list_response: i32) -> anyhow::Result<Arc<GenericRemoteStorage>> {
+    let remote_storage_s3_bucket = env::var("REMOTE_STORAGE_S3_BUCKET")
+        .context("`REMOTE_STORAGE_S3_BUCKET` env var is not set, but real S3 tests are enabled")?;
+    let remote_storage_s3_region = env::var("REMOTE_STORAGE_S3_REGION")
+        .context("`REMOTE_STORAGE_S3_REGION` env var is not set, but real S3 tests are enabled")?;
+    let random_prefix_part = std::time::SystemTime::now()
+        .duration_since(UNIX_EPOCH)
+        .context("random s3 test prefix part calculation")?
+        .as_millis();
+    let remote_storage_config = RemoteStorageConfig {
+        max_concurrent_syncs: NonZeroUsize::new(100).unwrap(),
+        max_sync_errors: NonZeroU32::new(5).unwrap(),
+        storage: RemoteStorageKind::AwsS3(S3Config {
+            bucket_name: remote_storage_s3_bucket,
+            bucket_region: remote_storage_s3_region,
+            prefix_in_bucket: Some(format!("pagination_should_work_test_{random_prefix_part}/")),
+            endpoint: None,
+            concurrency_limit: NonZeroUsize::new(100).unwrap(),
+            max_keys_per_list_response: Some(max_keys_per_list_response),
+        }),
+    };
+    Ok(Arc::new(
+        GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,
+    ))
+}
+
+struct Uploads {
+    prefixes: HashSet<RemotePath>,
+    blobs: HashSet<RemotePath>,
+}
+
+async fn upload_s3_data(
+    client: &Arc<GenericRemoteStorage>,
+    base_prefix_str: &'static str,
+    upload_tasks_count: usize,
+) -> ControlFlow<Uploads, Uploads> {
+    info!("Creating {upload_tasks_count} S3 files");
+    let mut upload_tasks = JoinSet::new();
+    for i in 1..upload_tasks_count + 1 {
+        let task_client = Arc::clone(client);
+        upload_tasks.spawn(async move {
+            let prefix = PathBuf::from(format!("{base_prefix_str}/sub_prefix_{i}/"));
+            let blob_prefix = RemotePath::new(&prefix)
+                .with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
+            let blob_path = blob_prefix.join(Path::new(&format!("blob_{i}")));
+            debug!("Creating remote item {i} at path {blob_path:?}");
+
+            let data = format!("remote blob data {i}").into_bytes();
+            let data_len = data.len();
+            task_client
+                .upload(
+                    Box::new(std::io::Cursor::new(data)),
+                    data_len,
+                    &blob_path,
+                    None,
+                )
+                .await?;
+
+            Ok::<_, anyhow::Error>((blob_prefix, blob_path))
+        });
+    }
+
+    let mut upload_tasks_failed = false;
+    let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
+    let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
+    while let Some(task_run_result) = upload_tasks.join_next().await {
+        match task_run_result
+            .context("task join failed")
+            .and_then(|task_result| task_result.context("upload task failed"))
+        {
+            Ok((upload_prefix, upload_path)) => {
+                uploaded_prefixes.insert(upload_prefix);
+                uploaded_blobs.insert(upload_path);
+            }
+            Err(e) => {
+                error!("Upload task failed: {e:?}");
+                upload_tasks_failed = true;
+            }
+        }
+    }
+
+    let uploads = Uploads {
+        prefixes: uploaded_prefixes,
+        blobs: uploaded_blobs,
+    };
+    if upload_tasks_failed {
+        ControlFlow::Break(uploads)
+    } else {
+        ControlFlow::Continue(uploads)
+    }
+}
+
+async fn cleanup(client: &Arc<GenericRemoteStorage>, objects_to_delete: HashSet<RemotePath>) {
+    info!(
+        "Removing {} objects from the remote storage during cleanup",
+        objects_to_delete.len()
+    );
+    let mut delete_tasks = JoinSet::new();
+    for object_to_delete in objects_to_delete {
+        let task_client = Arc::clone(client);
+        delete_tasks.spawn(async move {
+            debug!("Deleting remote item at path {object_to_delete:?}");
+            task_client
+                .delete(&object_to_delete)
+                .await
+                .with_context(|| format!("{object_to_delete:?} removal"))
+        });
+    }
+
+    while let Some(task_run_result) = delete_tasks.join_next().await {
+        match task_run_result {
+            Ok(task_result) => match task_result {
+                Ok(()) => {}
+                Err(e) => error!("Delete task failed: {e:?}"),
+            },
+            Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
+        }
+    }
+}
--- a/libs/utils/src/http/error.rs
+++ b/libs/utils/src/http/error.rs
@@ -20,6 +20,9 @@ pub enum ApiError {
    #[error("Conflict: {0}")]
    Conflict(String),

+    #[error("Precondition failed: {0}")]
+    PreconditionFailed(&'static str),
+
    #[error(transparent)]
    InternalServerError(anyhow::Error),
 }
@@ -44,6 +47,10 @@ impl ApiError {
            ApiError::Conflict(_) => {
                HttpErrorBody::response_from_msg_and_status(self.to_string(), StatusCode::CONFLICT)
            }
+            ApiError::PreconditionFailed(_) => HttpErrorBody::response_from_msg_and_status(
+                self.to_string(),
+                StatusCode::PRECONDITION_FAILED,
+            ),
            ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status(
                err.to_string(),
                StatusCode::INTERNAL_SERVER_ERROR,
--- a/libs/utils/src/signals.rs
+++ b/libs/utils/src/signals.rs
@@ -1,25 +1,7 @@
-use signal_hook::flag;
 use signal_hook::iterator::Signals;
-use std::sync::atomic::AtomicBool;
-use std::sync::Arc;

 pub use signal_hook::consts::{signal::*, TERM_SIGNALS};

-pub fn install_shutdown_handlers() -> anyhow::Result<ShutdownSignals> {
-    let term_now = Arc::new(AtomicBool::new(false));
-    for sig in TERM_SIGNALS {
-        // When terminated by a second term signal, exit with exit code 1.
-        // This will do nothing the first time (because term_now is false).
-        flag::register_conditional_shutdown(*sig, 1, Arc::clone(&term_now))?;
-        // But this will "arm" the above for the second time, by setting it to true.
-        // The order of registering these is important, if you put this one first, it will
-        // first arm and then terminate ‒ all in the first round.
-        flag::register(*sig, Arc::clone(&term_now))?;
-    }
-
-    Ok(ShutdownSignals)
-}
-
 pub enum Signal {
    Quit,
    Interrupt,
@@ -39,10 +21,7 @@ impl Signal {
 pub struct ShutdownSignals;

 impl ShutdownSignals {
-    pub fn handle(
-        self,
-        mut handler: impl FnMut(Signal) -> anyhow::Result<()>,
-    ) -> anyhow::Result<()> {
+    pub fn handle(mut handler: impl FnMut(Signal) -> anyhow::Result<()>) -> anyhow::Result<()> {
        for raw_signal in Signals::new(TERM_SIGNALS)?.into_iter() {
            let signal = match raw_signal {
                SIGINT => Signal::Interrupt,
--- a/pageserver/src/bin/pageserver.rs
+++ b/pageserver/src/bin/pageserver.rs
@@ -25,11 +25,9 @@ use pageserver::{
    virtual_file,
 };
 use postgres_backend::AuthType;
+use utils::signals::ShutdownSignals;
 use utils::{
-    auth::JwtAuth,
-    logging, project_git_version,
-    sentry_init::init_sentry,
-    signals::{self, Signal},
+    auth::JwtAuth, logging, project_git_version, sentry_init::init_sentry, signals::Signal,
    tcp_listener,
 };

@@ -264,9 +262,6 @@ fn start_pageserver(
    info!("Starting pageserver pg protocol handler on {pg_addr}");
    let pageserver_listener = tcp_listener::bind(pg_addr)?;

-    // Install signal handlers
-    let signals = signals::install_shutdown_handlers()?;
-
    // Launch broker client
    WALRECEIVER_RUNTIME.block_on(pageserver::broker_client::init_broker_client(conf))?;

@@ -430,7 +425,7 @@ fn start_pageserver(
    }

    // All started up! Now just sit and wait for shutdown signal.
-    signals.handle(|signal| match signal {
+    ShutdownSignals::handle(|signal| match signal {
        Signal::Quit => {
            info!(
                "Got {}. Terminating in immediate shutdown mode",
--- a/pageserver/src/config.rs
+++ b/pageserver/src/config.rs
@@ -170,6 +170,10 @@ pub struct PageServerConf {

    /// Number of concurrent [`Tenant::gather_size_inputs`] allowed.
    pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore,
+    /// Limit of concurrent [`Tenant::gather_size_inputs`] issued by module `eviction_task`.
+    /// The number of permits is the same as `concurrent_tenant_size_logical_size_queries`.
+    /// See the comment in `eviction_task` for details.
+    pub eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore,

    // How often to collect metrics and send them to the metrics endpoint.
    pub metric_collection_interval: Duration,
@@ -246,7 +250,7 @@ struct PageServerConfigBuilder {

    log_format: BuilderValue<LogFormat>,

-    concurrent_tenant_size_logical_size_queries: BuilderValue<ConfigurableSemaphore>,
+    concurrent_tenant_size_logical_size_queries: BuilderValue<NonZeroUsize>,

    metric_collection_interval: BuilderValue<Duration>,
    cached_metric_collection_interval: BuilderValue<Duration>,
@@ -295,7 +299,9 @@ impl Default for PageServerConfigBuilder {
            .expect("cannot parse default keepalive interval")),
            log_format: Set(LogFormat::from_str(DEFAULT_LOG_FORMAT).unwrap()),

-            concurrent_tenant_size_logical_size_queries: Set(ConfigurableSemaphore::default()),
+            concurrent_tenant_size_logical_size_queries: Set(
+                ConfigurableSemaphore::DEFAULT_INITIAL,
+            ),
            metric_collection_interval: Set(humantime::parse_duration(
                DEFAULT_METRIC_COLLECTION_INTERVAL,
            )
@@ -400,7 +406,7 @@ impl PageServerConfigBuilder {
        self.log_format = BuilderValue::Set(log_format)
    }

-    pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: ConfigurableSemaphore) {
+    pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: NonZeroUsize) {
        self.concurrent_tenant_size_logical_size_queries = BuilderValue::Set(u);
    }

@@ -449,6 +455,11 @@ impl PageServerConfigBuilder {
    }

    pub fn build(self) -> anyhow::Result<PageServerConf> {
+        let concurrent_tenant_size_logical_size_queries = self
+            .concurrent_tenant_size_logical_size_queries
+            .ok_or(anyhow!(
+                "missing concurrent_tenant_size_logical_size_queries"
+            ))?;
        Ok(PageServerConf {
            listen_pg_addr: self
                .listen_pg_addr
@@ -496,11 +507,12 @@ impl PageServerConfigBuilder {
                .broker_keepalive_interval
                .ok_or(anyhow!("No broker keepalive interval provided"))?,
            log_format: self.log_format.ok_or(anyhow!("missing log_format"))?,
-            concurrent_tenant_size_logical_size_queries: self
-                .concurrent_tenant_size_logical_size_queries
-                .ok_or(anyhow!(
-                    "missing concurrent_tenant_size_logical_size_queries"
-                ))?,
+            concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::new(
+                concurrent_tenant_size_logical_size_queries,
+            ),
+            eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::new(
+                concurrent_tenant_size_logical_size_queries,
+            ),
            metric_collection_interval: self
                .metric_collection_interval
                .ok_or(anyhow!("missing metric_collection_interval"))?,
@@ -698,8 +710,7 @@ impl PageServerConf {
                "concurrent_tenant_size_logical_size_queries" => builder.concurrent_tenant_size_logical_size_queries({
                    let input = parse_toml_string(key, item)?;
                    let permits = input.parse::<usize>().context("expected a number of initial permits, not {s:?}")?;
-                    let permits = NonZeroUsize::new(permits).context("initial semaphore permits out of range: 0, use other configuration to disable a feature")?;
-                    ConfigurableSemaphore::new(permits)
+                    NonZeroUsize::new(permits).context("initial semaphore permits out of range: 0, use other configuration to disable a feature")?
                }),
                "metric_collection_interval" => builder.metric_collection_interval(parse_toml_duration(key, item)?),
                "cached_metric_collection_interval" => builder.cached_metric_collection_interval(parse_toml_duration(key, item)?),
@@ -860,6 +871,8 @@ impl PageServerConf {
            broker_keepalive_interval: Duration::from_secs(5000),
            log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(),
            concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
+            eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::default(
+            ),
            metric_collection_interval: Duration::from_secs(60),
            cached_metric_collection_interval: Duration::from_secs(60 * 60),
            metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
@@ -953,6 +966,11 @@ impl ConfigurableSemaphore {
            inner: std::sync::Arc::new(tokio::sync::Semaphore::new(initial_permits.get())),
        }
    }
+
+    /// Returns the configured amount of permits.
+    pub fn initial_permits(&self) -> NonZeroUsize {
+        self.initial_permits
+    }
 }

 impl Default for ConfigurableSemaphore {
@@ -1057,6 +1075,8 @@ log_format = 'json'
                )?,
                log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(),
                concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
+                eviction_task_immitated_concurrent_logical_size_queries:
+                    ConfigurableSemaphore::default(),
                metric_collection_interval: humantime::parse_duration(
                    defaults::DEFAULT_METRIC_COLLECTION_INTERVAL
                )?,
@@ -1118,6 +1138,8 @@ log_format = 'json'
                broker_keepalive_interval: Duration::from_secs(5),
                log_format: LogFormat::Json,
                concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
+                eviction_task_immitated_concurrent_logical_size_queries:
+                    ConfigurableSemaphore::default(),
                metric_collection_interval: Duration::from_secs(222),
                cached_metric_collection_interval: Duration::from_secs(22200),
                metric_collection_endpoint: Some(Url::parse("http://localhost:80/metrics")?),
@@ -1250,6 +1272,7 @@ broker_endpoint = '{broker_endpoint}'
                        prefix_in_bucket: Some(prefix_in_bucket.clone()),
                        endpoint: Some(endpoint.clone()),
                        concurrency_limit: s3_concurrency_limit,
+                        max_keys_per_list_response: None,
                    }),
                },
                "Remote storage config should correctly parse the S3 config"
--- a/pageserver/src/disk_usage_eviction_task.rs
+++ b/pageserver/src/disk_usage_eviction_task.rs
@@ -22,7 +22,8 @@
 //! If the actual usage is higher, the threshold is exceeded.
 //! `min_avail_bytes` is the absolute available space in bytes.
 //! If the actual usage is lower, the threshold is exceeded.
-//!
+//! If either of these thresholds is exceeded, the system is considered to have "disk pressure", and eviction
+//! is performed on the next iteration, to release disk space and bring the usage below the thresholds again.
 //! The iteration evicts layers in LRU fashion, but, with a weak reservation per tenant.
 //! The reservation is to keep the most recently accessed X bytes per tenant resident.
 //! If we cannot relieve pressure by evicting layers outside of the reservation, we
@@ -34,7 +35,11 @@
 //! The idea is to allow at least one layer to be resident per tenant, to ensure it can make forward progress
 //! during page reconstruction.
 //! An alternative default for all tenants can be specified in the `tenant_config` section of the config.
-//! Lastly, each tenant can have an override in their respectice tenant config (`min_resident_size_override`).
+//! Lastly, each tenant can have an override in their respective tenant config (`min_resident_size_override`).
+
+// Implementation notes:
+// - The `#[allow(dead_code)]` above various structs are to suppress warnings about only the Debug impl
+//   reading these fields. We use the Debug impl for semi-structured logging, though.

 use std::{
    collections::HashMap,
@@ -224,6 +229,7 @@ pub enum IterationOutcome<U> {
    Finished(IterationOutcomeFinished<U>),
 }

+#[allow(dead_code)]
 #[derive(Debug, Serialize)]
 pub struct IterationOutcomeFinished<U> {
    /// The actual usage observed before we started the iteration.
@@ -238,6 +244,7 @@ pub struct IterationOutcomeFinished<U> {
 }

 #[derive(Debug, Serialize)]
+#[allow(dead_code)]
 struct AssumedUsage<U> {
    /// The expected value for `after`, after phase 2.
    projected_after: U,
@@ -245,12 +252,14 @@ struct AssumedUsage<U> {
    failed: LayerCount,
 }

+#[allow(dead_code)]
 #[derive(Debug, Serialize)]
 struct PlannedUsage<U> {
    respecting_tenant_min_resident_size: U,
    fallback_to_global_lru: Option<U>,
 }

+#[allow(dead_code)]
 #[derive(Debug, Default, Serialize)]
 struct LayerCount {
    file_sizes: u64,
@@ -608,6 +617,7 @@ mod filesystem_level_usage {
    use super::DiskUsageEvictionTaskConfig;

    #[derive(Debug, Clone, Copy)]
+    #[allow(dead_code)]
    pub struct Usage<'a> {
        config: &'a DiskUsageEvictionTaskConfig,

--- a/pageserver/src/http/openapi_spec.yml
+++ b/pageserver/src/http/openapi_spec.yml
@@ -214,6 +214,13 @@ paths:
            application/json:
              schema:
                $ref: "#/components/schemas/NotFoundError"
+        "412":
+          description: Tenant is missing
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/PreconditionFailedError"
+
        "500":
          description: Generic operation error
          content:
@@ -891,13 +898,9 @@ components:
      type: object
      properties:
        tenant_specific_overrides:
-          type: object
-          schema:
-            $ref: "#/components/schemas/TenantConfigInfo"
+          $ref: "#/components/schemas/TenantConfigInfo"
        effective_config:
-          type: object
-          schema:
-            $ref: "#/components/schemas/TenantConfigInfo"
+          $ref: "#/components/schemas/TenantConfigInfo"
    TimelineInfo:
      type: object
      required:
@@ -983,6 +986,13 @@ components:
      properties:
        msg:
          type: string
+    PreconditionFailedError:
+      type: object
+      required:
+        - msg
+      properties:
+        msg:
+          type: string

 security:
  - JWT: []
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -152,6 +152,11 @@ impl From<crate::tenant::mgr::DeleteTimelineError> for ApiError {
    fn from(value: crate::tenant::mgr::DeleteTimelineError) -> Self {
        use crate::tenant::mgr::DeleteTimelineError::*;
        match value {
+            // Report Precondition failed so client can distinguish between
+            // "tenant is missing" case from "timeline is missing"
+            Tenant(TenantStateError::NotFound(..)) => {
+                ApiError::PreconditionFailed("Requested tenant is missing")
+            }
            Tenant(t) => ApiError::from(t),
            Timeline(t) => ApiError::from(t),
        }
--- a/pageserver/src/metrics.rs
+++ b/pageserver/src/metrics.rs
@@ -257,7 +257,7 @@ impl EvictionsWithLowResidenceDuration {
    }

    pub fn observe(&self, observed_value: Duration) {
-        if self.threshold < observed_value {
+        if observed_value < self.threshold {
            self.counter
                .as_ref()
                .expect("nobody calls this function after `remove_from_vec`")
--- a/pageserver/src/page_service.rs
+++ b/pageserver/src/page_service.rs
@@ -27,6 +27,7 @@ use pq_proto::FeStartupPacket;
 use pq_proto::{BeMessage, FeMessage, RowDescriptor};
 use std::io;
 use std::net::TcpListener;
+use std::pin::pin;
 use std::str;
 use std::str::FromStr;
 use std::sync::Arc;
@@ -466,8 +467,7 @@ impl PageServerHandler {
        pgb.write_message_noflush(&BeMessage::CopyInResponse)?;
        pgb.flush().await?;

-        let copyin_reader = StreamReader::new(copyin_stream(pgb));
-        tokio::pin!(copyin_reader);
+        let mut copyin_reader = pin!(StreamReader::new(copyin_stream(pgb)));
        timeline
            .import_basebackup_from_tar(&mut copyin_reader, base_lsn, &ctx)
            .await?;
@@ -512,8 +512,7 @@ impl PageServerHandler {
        info!("importing wal");
        pgb.write_message_noflush(&BeMessage::CopyInResponse)?;
        pgb.flush().await?;
-        let copyin_reader = StreamReader::new(copyin_stream(pgb));
-        tokio::pin!(copyin_reader);
+        let mut copyin_reader = pin!(StreamReader::new(copyin_stream(pgb)));
        import_wal_from_tar(&timeline, &mut copyin_reader, start_lsn, end_lsn, &ctx).await?;
        info!("wal import complete");

--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -46,6 +46,7 @@ use std::time::{Duration, Instant};
 use self::config::TenantConf;
 use self::metadata::TimelineMetadata;
 use self::remote_timeline_client::RemoteTimelineClient;
+use self::timeline::EvictionTaskTenantState;
 use crate::config::PageServerConf;
 use crate::context::{DownloadBehavior, RequestContext};
 use crate::import_datadir;
@@ -142,6 +143,8 @@ pub struct Tenant {
    /// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`].
    cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
    cached_synthetic_tenant_size: Arc<AtomicU64>,
+
+    eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>,
 }

 /// A timeline with some of its files on disk, being initialized.
@@ -1788,6 +1791,7 @@ impl Tenant {
            state,
            cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
            cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
+            eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()),
        }
    }

--- a/pageserver/src/tenant/size.rs
+++ b/pageserver/src/tenant/size.rs
@@ -6,6 +6,7 @@ use std::sync::Arc;
 use anyhow::{bail, Context};
 use tokio::sync::oneshot::error::RecvError;
 use tokio::sync::Semaphore;
+use tokio_util::sync::CancellationToken;

 use crate::context::RequestContext;
 use crate::pgdatadir_mapping::CalculateLogicalSizeError;
@@ -352,6 +353,10 @@ async fn fill_logical_sizes(
    // our advantage with `?` error handling.
    let mut joinset = tokio::task::JoinSet::new();

+    let cancel = tokio_util::sync::CancellationToken::new();
+    // be sure to cancel all spawned tasks if we are dropped
+    let _dg = cancel.clone().drop_guard();
+
    // For each point that would benefit from having a logical size available,
    // spawn a Task to fetch it, unless we have it cached already.
    for seg in segments.iter() {
@@ -373,6 +378,7 @@ async fn fill_logical_sizes(
                    timeline,
                    lsn,
                    ctx,
+                    cancel.child_token(),
                ));
            }
            e.insert(cached_size);
@@ -477,13 +483,14 @@ async fn calculate_logical_size(
    timeline: Arc<crate::tenant::Timeline>,
    lsn: utils::lsn::Lsn,
    ctx: RequestContext,
+    cancel: CancellationToken,
 ) -> Result<TimelineAtLsnSizeResult, RecvError> {
    let _permit = tokio::sync::Semaphore::acquire_owned(limit)
        .await
        .expect("global semaphore should not had been closed");

    let size_res = timeline
-        .spawn_ondemand_logical_size_calculation(lsn, ctx)
+        .spawn_ondemand_logical_size_calculation(lsn, ctx, cancel)
        .instrument(info_span!("spawn_ondemand_logical_size_calculation"))
        .await?;
    Ok(TimelineAtLsnSizeResult(timeline, lsn, size_res))
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -25,6 +25,7 @@ use std::collections::HashMap;
 use std::fs;
 use std::ops::{Deref, Range};
 use std::path::{Path, PathBuf};
+use std::pin::pin;
 use std::sync::atomic::{AtomicI64, Ordering as AtomicOrdering};
 use std::sync::{Arc, Mutex, MutexGuard, RwLock, Weak};
 use std::time::{Duration, Instant, SystemTime};
@@ -72,6 +73,7 @@ use crate::ZERO_PAGE;
 use crate::{is_temporary, task_mgr};
 use walreceiver::spawn_connection_manager_task;

+pub(super) use self::eviction_task::EvictionTaskTenantState;
 use self::eviction_task::EvictionTaskTimelineState;

 use super::layer_map::BatchedUpdates;
@@ -677,8 +679,7 @@ impl Timeline {

            let mut failed = 0;

-            let cancelled = task_mgr::shutdown_watcher();
-            tokio::pin!(cancelled);
+            let mut cancelled = pin!(task_mgr::shutdown_watcher());

            loop {
                tokio::select! {
@@ -1768,8 +1769,11 @@ impl Timeline {
            false,
            // NB: don't log errors here, task_mgr will do that.
            async move {
+                // no cancellation here, because nothing really waits for this to complete compared
+                // to spawn_ondemand_logical_size_calculation.
+                let cancel = CancellationToken::new();
                let calculated_size = match self_clone
-                    .logical_size_calculation_task(lsn, &background_ctx)
+                    .logical_size_calculation_task(lsn, &background_ctx, cancel)
                    .await
                {
                    Ok(s) => s,
@@ -1824,6 +1828,7 @@ impl Timeline {
        self: &Arc<Self>,
        lsn: Lsn,
        ctx: RequestContext,
+        cancel: CancellationToken,
    ) -> oneshot::Receiver<Result<u64, CalculateLogicalSizeError>> {
        let (sender, receiver) = oneshot::channel();
        let self_clone = Arc::clone(self);
@@ -1843,7 +1848,9 @@ impl Timeline {
            "ondemand logical size calculation",
            false,
            async move {
-                let res = self_clone.logical_size_calculation_task(lsn, &ctx).await;
+                let res = self_clone
+                    .logical_size_calculation_task(lsn, &ctx, cancel)
+                    .await;
                let _ = sender.send(res).ok();
                Ok(()) // Receiver is responsible for handling errors
            },
@@ -1856,18 +1863,18 @@ impl Timeline {
        self: &Arc<Self>,
        lsn: Lsn,
        ctx: &RequestContext,
+        cancel: CancellationToken,
    ) -> Result<u64, CalculateLogicalSizeError> {
        let mut timeline_state_updates = self.subscribe_for_state_updates();
        let self_calculation = Arc::clone(self);
-        let cancel = CancellationToken::new();

-        let calculation = async {
+        let mut calculation = pin!(async {
            let cancel = cancel.child_token();
            let ctx = ctx.attached_child();
            self_calculation
                .calculate_logical_size(lsn, cancel, &ctx)
                .await
-        };
+        });
        let timeline_state_cancellation = async {
            loop {
                match timeline_state_updates.changed().await {
@@ -1896,7 +1903,6 @@ impl Timeline {
            "aborted because task_mgr shutdown requested".to_string()
        };

-        tokio::pin!(calculation);
        loop {
            tokio::select! {
                res = &mut calculation => { return res }
--- a/pageserver/src/tenant/timeline/eviction_task.rs
+++ b/pageserver/src/tenant/timeline/eviction_task.rs
@@ -14,6 +14,7 @@
 //!
 //! See write-up on restart on-demand download spike: <https://gist.github.com/problame/2265bf7b8dc398be834abfead36c76b5>
 use std::{
+    collections::HashMap,
    ops::ControlFlow,
    sync::Arc,
    time::{Duration, SystemTime},
@@ -29,6 +30,7 @@ use crate::{
    tenant::{
        config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold},
        storage_layer::PersistentLayer,
+        Tenant,
    },
 };

@@ -36,7 +38,12 @@ use super::Timeline;

 #[derive(Default)]
 pub struct EvictionTaskTimelineState {
-    last_refresh_required_in_restart: Option<tokio::time::Instant>,
+    last_layer_access_imitation: Option<tokio::time::Instant>,
+}
+
+#[derive(Default)]
+pub struct EvictionTaskTenantState {
+    last_layer_access_imitation: Option<Instant>,
 }

 impl Timeline {
@@ -126,6 +133,35 @@ impl Timeline {
    ) -> ControlFlow<()> {
        let now = SystemTime::now();

+        // If we evict layers but keep cached values derived from those layers, then
+        // we face a storm of on-demand downloads after pageserver restart.
+        // The reason is that the restart empties the caches, and so, the values
+        // need to be re-computed by accessing layers, which we evicted while the
+        // caches were filled.
+        //
+        // Solutions here would be one of the following:
+        // 1. Have a persistent cache.
+        // 2. Count every access to a cached value to the access stats of all layers
+        //    that were accessed to compute the value in the first place.
+        // 3. Invalidate the caches at a period of < p.threshold/2, so that the values
+        //    get re-computed from layers, thereby counting towards layer access stats.
+        // 4. Make the eviction task imitate the layer accesses that typically hit caches.
+        //
+        // We follow approach (4) here because in Neon prod deployment:
+        // - page cache is quite small => high churn => low hit rate
+        //   => eviction gets correct access stats
+        // - value-level caches such as logical size & repatition have a high hit rate,
+        //   especially for inactive tenants
+        //   => eviction sees zero accesses for these
+        //   => they cause the on-demand download storm on pageserver restart
+        //
+        // We should probably move to persistent caches in the future, or avoid
+        // having inactive tenants attached to pageserver in the first place.
+        match self.imitate_layer_accesses(p, cancel, ctx).await {
+            ControlFlow::Break(()) => return ControlFlow::Break(()),
+            ControlFlow::Continue(()) => (),
+        }
+
        #[allow(dead_code)]
        #[derive(Debug, Default)]
        struct EvictionStats {
@@ -136,27 +172,6 @@ impl Timeline {
            skipped_for_shutdown: usize,
        }

-        // what we want is to invalidate any caches which haven't been accessed for `p.threshold`,
-        // but we cannot actually do it for current limitations except by restarting pageserver. we
-        // just recompute the values which would be recomputed on startup.
-        //
-        // for active tenants this will likely materialized page cache or in-memory layers. for
-        // inactive tenants it will refresh the last_access timestamps so that we will not evict
-        // and re-download on restart these layers.
-        let mut state = self.eviction_task_timeline_state.lock().await;
-        match state.last_refresh_required_in_restart {
-            Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ }
-            _ => {
-                self.refresh_layers_required_in_restart(cancel, ctx).await;
-                state.last_refresh_required_in_restart = Some(tokio::time::Instant::now())
-            }
-        }
-        drop(state);
-
-        if cancel.is_cancelled() {
-            return ControlFlow::Break(());
-        }
-
        let mut stats = EvictionStats::default();
        // Gather layers for eviction.
        // NB: all the checks can be invalidated as soon as we release the layer map lock.
@@ -254,8 +269,55 @@ impl Timeline {
        ControlFlow::Continue(())
    }

+    async fn imitate_layer_accesses(
+        &self,
+        p: &EvictionPolicyLayerAccessThreshold,
+        cancel: &CancellationToken,
+        ctx: &RequestContext,
+    ) -> ControlFlow<()> {
+        let mut state = self.eviction_task_timeline_state.lock().await;
+        match state.last_layer_access_imitation {
+            Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ }
+            _ => {
+                self.imitate_timeline_cached_layer_accesses(cancel, ctx)
+                    .await;
+                state.last_layer_access_imitation = Some(tokio::time::Instant::now())
+            }
+        }
+        drop(state);
+
+        if cancel.is_cancelled() {
+            return ControlFlow::Break(());
+        }
+
+        // This task is timeline-scoped, but the synthetic size calculation is tenant-scoped.
+        // Make one of the tenant's timelines draw the short straw and run the calculation.
+        // The others wait until the calculation is done so that they take into account the
+        // imitated accesses that the winner made.
+        let Ok(tenant) = crate::tenant::mgr::get_tenant(self.tenant_id, true).await else {
+            // likely, we're shutting down
+            return ControlFlow::Break(());
+        };
+        let mut state = tenant.eviction_task_tenant_state.lock().await;
+        match state.last_layer_access_imitation {
+            Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ }
+            _ => {
+                self.imitate_synthetic_size_calculation_worker(&tenant, ctx, cancel)
+                    .await;
+                state.last_layer_access_imitation = Some(tokio::time::Instant::now());
+            }
+        }
+        drop(state);
+
+        if cancel.is_cancelled() {
+            return ControlFlow::Break(());
+        }
+
+        ControlFlow::Continue(())
+    }
+
    /// Recompute the values which would cause on-demand downloads during restart.
-    async fn refresh_layers_required_in_restart(
+    async fn imitate_timeline_cached_layer_accesses(
        &self,
        cancel: &CancellationToken,
        ctx: &RequestContext,
@@ -289,4 +351,61 @@ impl Timeline {
            }
        }
    }
+
+    // Imitate the synthetic size calculation done by the consumption_metrics module.
+    async fn imitate_synthetic_size_calculation_worker(
+        &self,
+        tenant: &Arc<Tenant>,
+        ctx: &RequestContext,
+        cancel: &CancellationToken,
+    ) {
+        if self.conf.metric_collection_endpoint.is_none() {
+            // We don't start the consumption metrics task if this is not set in the config.
+            // So, no need to imitate the accesses in that case.
+            return;
+        }
+
+        // The consumption metrics are collected on a per-tenant basis, by a single
+        // global background loop.
+        // It limits the number of synthetic size calculations using the global
+        // `concurrent_tenant_size_logical_size_queries` semaphore to not overload
+        // the pageserver. (size calculation is somewhat expensive in terms of CPU and IOs).
+        //
+        // If we used that same semaphore here, then we'd compete for the
+        // same permits, which may impact timeliness of consumption metrics.
+        // That is a no-go, as consumption metrics are much more important
+        // than what we do here.
+        //
+        // So, we have a separate semaphore, initialized to the same
+        // number of permits as the `concurrent_tenant_size_logical_size_queries`.
+        // In the worst, we would have twice the amount of concurrenct size calculations.
+        // But in practice, the `p.threshold` >> `consumption metric interval`, and
+        // we spread out the eviction task using `random_init_delay`.
+        // So, the chance of the worst case is quite low in practice.
+        // It runs as a per-tenant task, but the eviction_task.rs is per-timeline.
+        // So, we must coordinate with other with other eviction tasks of this tenant.
+        let limit = self
+            .conf
+            .eviction_task_immitated_concurrent_logical_size_queries
+            .inner();
+
+        let mut throwaway_cache = HashMap::new();
+        let gather =
+            crate::tenant::size::gather_inputs(tenant, limit, None, &mut throwaway_cache, ctx);
+
+        tokio::select! {
+            _ = cancel.cancelled() => {}
+            gather_result = gather => {
+                match gather_result {
+                    Ok(_) => {},
+                    Err(e) => {
+                        // We don't care about the result, but, if it failed, we should log it,
+                        // since consumption metric might be hitting the cached value and
+                        // thus not encountering this error.
+                        warn!("failed to imitate synthetic size calculation accesses: {e:#}")
+                    }
+                }
+           }
+        }
+    }
 }
--- a/pageserver/src/tenant/timeline/walreceiver/connection_manager.rs
+++ b/pageserver/src/tenant/timeline/walreceiver/connection_manager.rs
@@ -237,11 +237,7 @@ async fn connection_manager_loop_step(
        if let Some(new_candidate) = walreceiver_state.next_connection_candidate() {
            info!("Switching to new connection candidate: {new_candidate:?}");
            walreceiver_state
-                .change_connection(
-                    new_candidate.safekeeper_id,
-                    new_candidate.wal_source_connconf,
-                    ctx,
-                )
+                .change_connection(new_candidate, ctx)
                .await
        }
    }
@@ -346,6 +342,8 @@ struct WalConnection {
    started_at: NaiveDateTime,
    /// Current safekeeper pageserver is connected to for WAL streaming.
    sk_id: NodeId,
+    /// Availability zone of the safekeeper.
+    availability_zone: Option<String>,
    /// Status of the connection.
    status: WalConnectionStatus,
    /// WAL streaming task handle.
@@ -405,12 +403,7 @@ impl WalreceiverState {
    }

    /// Shuts down the current connection (if any) and immediately starts another one with the given connection string.
-    async fn change_connection(
-        &mut self,
-        new_sk_id: NodeId,
-        new_wal_source_connconf: PgConnectionConfig,
-        ctx: &RequestContext,
-    ) {
+    async fn change_connection(&mut self, new_sk: NewWalConnectionCandidate, ctx: &RequestContext) {
        self.drop_old_connection(true).await;

        let id = self.id;
@@ -424,7 +417,7 @@ impl WalreceiverState {
            async move {
                super::walreceiver_connection::handle_walreceiver_connection(
                    timeline,
-                    new_wal_source_connconf,
+                    new_sk.wal_source_connconf,
                    events_sender,
                    cancellation,
                    connect_timeout,
@@ -433,13 +426,16 @@ impl WalreceiverState {
                .await
                .context("walreceiver connection handling failure")
            }
-            .instrument(info_span!("walreceiver_connection", id = %id, node_id = %new_sk_id))
+            .instrument(
+                info_span!("walreceiver_connection", id = %id, node_id = %new_sk.safekeeper_id),
+            )
        });

        let now = Utc::now().naive_utc();
        self.wal_connection = Some(WalConnection {
            started_at: now,
-            sk_id: new_sk_id,
+            sk_id: new_sk.safekeeper_id,
+            availability_zone: new_sk.availability_zone,
            status: WalConnectionStatus {
                is_connected: false,
                has_processed_wal: false,
@@ -546,6 +542,7 @@ impl WalreceiverState {
    /// * if connected safekeeper is not present, pick the candidate
    /// * if we haven't received any updates for some time, pick the candidate
    /// * if the candidate commit_lsn is much higher than the current one, pick the candidate
+    /// * if the candidate commit_lsn is same, but candidate is located in the same AZ as the pageserver, pick the candidate
    /// * if connected safekeeper stopped sending us new WAL which is available on other safekeeper, pick the candidate
    ///
    /// This way we ensure to keep up with the most up-to-date safekeeper and don't try to jump from one safekeeper to another too frequently.
@@ -559,6 +556,7 @@ impl WalreceiverState {

                let (new_sk_id, new_safekeeper_broker_data, new_wal_source_connconf) =
                    self.select_connection_candidate(Some(connected_sk_node))?;
+                let new_availability_zone = new_safekeeper_broker_data.availability_zone.clone();

                let now = Utc::now().naive_utc();
                if let Ok(latest_interaciton) =
@@ -569,6 +567,7 @@ impl WalreceiverState {
                        return Some(NewWalConnectionCandidate {
                            safekeeper_id: new_sk_id,
                            wal_source_connconf: new_wal_source_connconf,
+                            availability_zone: new_availability_zone,
                            reason: ReconnectReason::NoKeepAlives {
                                last_keep_alive: Some(
                                    existing_wal_connection.status.latest_connection_update,
@@ -594,6 +593,7 @@ impl WalreceiverState {
                                return Some(NewWalConnectionCandidate {
                                    safekeeper_id: new_sk_id,
                                    wal_source_connconf: new_wal_source_connconf,
+                                    availability_zone: new_availability_zone,
                                    reason: ReconnectReason::LaggingWal {
                                        current_commit_lsn,
                                        new_commit_lsn,
@@ -601,6 +601,20 @@ impl WalreceiverState {
                                    },
                                });
                            }
+                            // If we have a candidate with the same commit_lsn as the current one, which is in the same AZ as pageserver,
+                            // and the current one is not, switch to the new one.
+                            if self.availability_zone.is_some()
+                                && existing_wal_connection.availability_zone
+                                    != self.availability_zone
+                                && self.availability_zone == new_availability_zone
+                            {
+                                return Some(NewWalConnectionCandidate {
+                                    safekeeper_id: new_sk_id,
+                                    availability_zone: new_availability_zone,
+                                    wal_source_connconf: new_wal_source_connconf,
+                                    reason: ReconnectReason::SwitchAvailabilityZone,
+                                });
+                            }
                        }
                        None => debug!(
                            "Best SK candidate has its commit_lsn behind connected SK's commit_lsn"
@@ -668,6 +682,7 @@ impl WalreceiverState {
                            return Some(NewWalConnectionCandidate {
                                safekeeper_id: new_sk_id,
                                wal_source_connconf: new_wal_source_connconf,
+                                availability_zone: new_availability_zone,
                                reason: ReconnectReason::NoWalTimeout {
                                    current_lsn,
                                    current_commit_lsn,
@@ -686,10 +701,11 @@ impl WalreceiverState {
                self.wal_connection.as_mut().unwrap().discovered_new_wal = discovered_new_wal;
            }
            None => {
-                let (new_sk_id, _, new_wal_source_connconf) =
+                let (new_sk_id, new_safekeeper_broker_data, new_wal_source_connconf) =
                    self.select_connection_candidate(None)?;
                return Some(NewWalConnectionCandidate {
                    safekeeper_id: new_sk_id,
+                    availability_zone: new_safekeeper_broker_data.availability_zone.clone(),
                    wal_source_connconf: new_wal_source_connconf,
                    reason: ReconnectReason::NoExistingConnection,
                });
@@ -794,6 +810,7 @@ impl WalreceiverState {
 struct NewWalConnectionCandidate {
    safekeeper_id: NodeId,
    wal_source_connconf: PgConnectionConfig,
+    availability_zone: Option<String>,
    // This field is used in `derive(Debug)` only.
    #[allow(dead_code)]
    reason: ReconnectReason,
@@ -808,6 +825,7 @@ enum ReconnectReason {
        new_commit_lsn: Lsn,
        threshold: NonZeroU64,
    },
+    SwitchAvailabilityZone,
    NoWalTimeout {
        current_lsn: Lsn,
        current_commit_lsn: Lsn,
@@ -873,6 +891,7 @@ mod tests {
                peer_horizon_lsn: 0,
                local_start_lsn: 0,
                safekeeper_connstr: safekeeper_connstr.to_owned(),
+                availability_zone: None,
            },
            latest_update,
        }
@@ -933,6 +952,7 @@ mod tests {
        state.wal_connection = Some(WalConnection {
            started_at: now,
            sk_id: connected_sk_id,
+            availability_zone: None,
            status: connection_status,
            connection_task: TaskHandle::spawn(move |sender, _| async move {
                sender
@@ -1095,6 +1115,7 @@ mod tests {
        state.wal_connection = Some(WalConnection {
            started_at: now,
            sk_id: connected_sk_id,
+            availability_zone: None,
            status: connection_status,
            connection_task: TaskHandle::spawn(move |sender, _| async move {
                sender
@@ -1160,6 +1181,7 @@ mod tests {
        state.wal_connection = Some(WalConnection {
            started_at: now,
            sk_id: NodeId(1),
+            availability_zone: None,
            status: connection_status,
            connection_task: TaskHandle::spawn(move |sender, _| async move {
                sender
@@ -1222,6 +1244,7 @@ mod tests {
        state.wal_connection = Some(WalConnection {
            started_at: now,
            sk_id: NodeId(1),
+            availability_zone: None,
            status: connection_status,
            connection_task: TaskHandle::spawn(move |_, _| async move { Ok(()) }),
            discovered_new_wal: Some(NewCommittedWAL {
@@ -1289,4 +1312,74 @@ mod tests {
            availability_zone: None,
        }
    }
+
+    #[tokio::test]
+    async fn switch_to_same_availability_zone() -> anyhow::Result<()> {
+        // Pageserver and one of safekeepers will be in the same availability zone
+        // and pageserver should prefer to connect to it.
+        let test_az = Some("test_az".to_owned());
+
+        let harness = TenantHarness::create("switch_to_same_availability_zone")?;
+        let mut state = dummy_state(&harness).await;
+        state.availability_zone = test_az.clone();
+        let current_lsn = Lsn(100_000).align();
+        let now = Utc::now().naive_utc();
+
+        let connected_sk_id = NodeId(0);
+
+        let connection_status = WalConnectionStatus {
+            is_connected: true,
+            has_processed_wal: true,
+            latest_connection_update: now,
+            latest_wal_update: now,
+            commit_lsn: Some(current_lsn),
+            streaming_lsn: Some(current_lsn),
+        };
+
+        state.wal_connection = Some(WalConnection {
+            started_at: now,
+            sk_id: connected_sk_id,
+            availability_zone: None,
+            status: connection_status,
+            connection_task: TaskHandle::spawn(move |sender, _| async move {
+                sender
+                    .send(TaskStateUpdate::Progress(connection_status))
+                    .ok();
+                Ok(())
+            }),
+            discovered_new_wal: None,
+        });
+
+        // We have another safekeeper with the same commit_lsn, and it have the same availability zone as
+        // the current pageserver.
+        let mut same_az_sk = dummy_broker_sk_timeline(current_lsn.0, "same_az", now);
+        same_az_sk.timeline.availability_zone = test_az.clone();
+
+        state.wal_stream_candidates = HashMap::from([
+            (
+                connected_sk_id,
+                dummy_broker_sk_timeline(current_lsn.0, DUMMY_SAFEKEEPER_HOST, now),
+            ),
+            (NodeId(1), same_az_sk),
+        ]);
+
+        // We expect that pageserver will switch to the safekeeper in the same availability zone,
+        // even if it has the same commit_lsn.
+        let next_candidate = state.next_connection_candidate().expect(
+            "Expected one candidate selected out of multiple valid data options, but got none",
+        );
+
+        assert_eq!(next_candidate.safekeeper_id, NodeId(1));
+        assert_eq!(
+            next_candidate.reason,
+            ReconnectReason::SwitchAvailabilityZone,
+            "Should switch to the safekeeper in the same availability zone, if it has the same commit_lsn"
+        );
+        assert_eq!(
+            next_candidate.wal_source_connconf.host(),
+            &Host::Domain("same_az".to_owned())
+        );
+
+        Ok(())
+    }
 }
--- a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
+++ b/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs
@@ -2,6 +2,7 @@

 use std::{
    error::Error,
+    pin::pin,
    str::FromStr,
    sync::Arc,
    time::{Duration, SystemTime},
@@ -17,7 +18,7 @@ use postgres_ffi::v14::xlog_utils::normalize_lsn;
 use postgres_ffi::WAL_SEGMENT_SIZE;
 use postgres_protocol::message::backend::ReplicationMessage;
 use postgres_types::PgLsn;
-use tokio::{pin, select, sync::watch, time};
+use tokio::{select, sync::watch, time};
 use tokio_postgres::{replication::ReplicationStream, Client};
 use tokio_util::sync::CancellationToken;
 use tracing::{debug, error, info, trace, warn};
@@ -187,8 +188,7 @@ pub async fn handle_walreceiver_connection(
    let query = format!("START_REPLICATION PHYSICAL {startpoint}");

    let copy_stream = replication_client.copy_both_simple(&query).await?;
-    let physical_stream = ReplicationStream::new(copy_stream);
-    pin!(physical_stream);
+    let mut physical_stream = pin!(ReplicationStream::new(copy_stream));

    let mut waldecoder = WalStreamDecoder::new(startpoint, timeline.pg_version);

--- a/pgxn/neon/file_cache.c
+++ b/pgxn/neon/file_cache.c
@@ -14,6 +14,7 @@
 */

 #include <sys/file.h>
+#include <sys/statvfs.h>
 #include <unistd.h>
 #include <fcntl.h>

@@ -34,6 +35,9 @@
 #include "storage/fd.h"
 #include "storage/pg_shmem.h"
 #include "storage/buf_internals.h"
+#include "storage/procsignal.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/interrupt.h"

 /*
 * Local file cache is used to temporary store relations pages in local file system.
@@ -59,6 +63,9 @@

 #define SIZE_MB_TO_CHUNKS(size) ((uint32)((size) * MB / BLCKSZ / BLOCKS_PER_CHUNK))

+#define MAX_MONITOR_INTERVAL_USEC 1000000 /* 1 second */
+#define MAX_DISK_WRITE_RATE       1000 /* MB/sec */
+
 typedef struct FileCacheEntry
 {
 	BufferTag	key;
@@ -71,6 +78,7 @@ typedef struct FileCacheEntry
 typedef struct FileCacheControl
 {
 	uint32 size; /* size of cache file in chunks */
+	uint32 used; /* number of used chunks */
 	dlist_head lru; /* double linked list for LRU replacement algorithm */
 } FileCacheControl;

@@ -79,12 +87,14 @@ static int   lfc_desc;
 static LWLockId lfc_lock;
 static int   lfc_max_size;
 static int   lfc_size_limit;
+static int   lfc_free_space_watermark;
 static char* lfc_path;
 static  FileCacheControl* lfc_ctl;
 static shmem_startup_hook_type prev_shmem_startup_hook;
 #if PG_VERSION_NUM>=150000
 static shmem_request_hook_type prev_shmem_request_hook;
 #endif
+static int   lfc_shrinking_factor; /* power of two by which local cache size will be shrinked when lfc_free_space_watermark is reached */

 static void
 lfc_shmem_startup(void)
@@ -112,6 +122,7 @@ lfc_shmem_startup(void)
 								 &info,
 								 HASH_ELEM | HASH_BLOBS);
 		lfc_ctl->size = 0;
+		lfc_ctl->used = 0;
 		dlist_init(&lfc_ctl->lru);

 		/* Remove file cache on restart */
@@ -165,7 +176,7 @@ lfc_change_limit_hook(int newval, void *extra)
 		}
 	}
 	LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
-	while (new_size < lfc_ctl->size && !dlist_is_empty(&lfc_ctl->lru))
+	while (new_size < lfc_ctl->used && !dlist_is_empty(&lfc_ctl->lru))
 	{
 		/* Shrink cache by throwing away least recently accessed chunks and returning their space to file system */
 		FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru));
@@ -175,12 +186,86 @@ lfc_change_limit_hook(int newval, void *extra)
 			elog(LOG, "Failed to punch hole in file: %m");
 #endif
 		hash_search(lfc_hash, &victim->key, HASH_REMOVE, NULL);
-		lfc_ctl->size -= 1;
+		lfc_ctl->used -= 1;
 	}
 	elog(LOG, "set local file cache limit to %d", new_size);
 	LWLockRelease(lfc_lock);
 }

+/*
+ * Local file system state monitor check available free space.
+ * If it is lower than lfc_free_space_watermark then we shrink size of local cache
+ * but throwing away least recently accessed chunks.
+ * First time low space watermark is reached cache size is divided by two,
+ * second time by four,... Finally we remove all chunks from local cache.
+ *
+ * Please notice that we are not changing lfc_cache_size: it is used to be adjusted by autoscaler.
+ * We only throw away cached chunks but do not prevent from filling cache by new chunks.
+ *
+ * Interval of poooling cache state is calculated as minimal time needed to consume lfc_free_space_watermark
+ * disk space with maximal possible disk write speed (1Gb/sec). But not larger than 1 second.
+ * Calling statvfs each second should not add any noticeable overhead.
+ */
+void
+FileCacheMonitorMain(Datum main_arg)
+{
+	/*
+	 * Choose file system state monitor interval so that space can not be exosted
+	 * during this period but not longer than  MAX_MONITOR_INTERVAL (10 sec)
+	 */
+	uint64 monitor_interval = Min(MAX_MONITOR_INTERVAL_USEC, lfc_free_space_watermark*MB/MAX_DISK_WRITE_RATE);
+
+	/* Establish signal handlers. */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+	BackgroundWorkerUnblockSignals();
+
+	/* Periodically dump buffers until terminated. */
+	while (!ShutdownRequestPending)
+	{
+		if (lfc_size_limit != 0)
+		{
+			struct statvfs sfs;
+			if (statvfs(lfc_path, &sfs) < 0)
+			{
+				elog(WARNING, "Failed to obtain status of %s: %m", lfc_path);
+			}
+			else
+			{
+				if (sfs.f_bavail*sfs.f_bsize < lfc_free_space_watermark*MB)
+				{
+					if (lfc_shrinking_factor < 31) {
+						lfc_shrinking_factor += 1;
+					}
+					lfc_change_limit_hook(lfc_size_limit >> lfc_shrinking_factor, NULL);
+				}
+				else
+					lfc_shrinking_factor = 0; /* reset to initial value */
+			}
+		}
+		pg_usleep(monitor_interval);
+	}
+}
+
+static void
+lfc_register_free_space_monitor(void)
+{
+	BackgroundWorker bgw;
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "neon");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FileCacheMonitorMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN, "Local free space monitor");
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "Local free space monitor");
+	bgw.bgw_restart_time = 5;
+	bgw.bgw_notify_pid = 0;
+	bgw.bgw_main_arg = (Datum) 0;
+
+	RegisterBackgroundWorker(&bgw);
+}
+
 void
 lfc_init(void)
 {
@@ -217,6 +302,19 @@ lfc_init(void)
 							lfc_change_limit_hook,
 							NULL);

+	DefineCustomIntVariable("neon.free_space_watermark",
+							"Minimal free space in local file system after reaching which local file cache will be truncated",
+							NULL,
+							&lfc_free_space_watermark,
+							1024, /* 1GB */
+							0,
+							INT_MAX,
+							PGC_SIGHUP,
+							GUC_UNIT_MB,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomStringVariable("neon.file_cache_path",
 							   "Path to local file cache (can be raw device)",
 							   NULL,
@@ -231,6 +329,9 @@ lfc_init(void)
 	if (lfc_max_size == 0)
 		return;

+	if (lfc_free_space_watermark != 0)
+		lfc_register_free_space_monitor();
+
 	prev_shmem_startup_hook = shmem_startup_hook;
 	shmem_startup_hook = lfc_shmem_startup;
 #if PG_VERSION_NUM>=150000
@@ -380,7 +481,7 @@ lfc_write(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno,
 		 * there are should be very large number of concurrent IO operations and them are limited by max_connections,
 		 * we prefer not to complicate code and use second approach.
 		 */
-		if (lfc_ctl->size >= SIZE_MB_TO_CHUNKS(lfc_size_limit) && !dlist_is_empty(&lfc_ctl->lru))
+		if (lfc_ctl->used >= SIZE_MB_TO_CHUNKS(lfc_size_limit) && !dlist_is_empty(&lfc_ctl->lru))
 		{
 			/* Cache overflow: evict least recently used chunk */
 			FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru));
@@ -390,7 +491,10 @@ lfc_write(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno,
 			elog(LOG, "Swap file cache page");
 		}
 		else
+		{
+			lfc_ctl->used += 1;
 			entry->offset = lfc_ctl->size++; /* allocate new chunk at end of file */
+		}
 		entry->access_count = 1;
 		memset(entry->bitmap, 0, sizeof entry->bitmap);
 	}
--- a/rust-toolchain.toml
+++ b/rust-toolchain.toml
@@ -1,5 +1,5 @@
 [toolchain]
-channel = "1.66.1"
+channel = "1.68.2"
 profile = "default"
 # The default profile includes rustc, rust-std, cargo, rust-docs, rustfmt and clippy.
 # https://rust-lang.github.io/rustup/concepts/profiles.html
--- a/safekeeper/src/bin/safekeeper.rs
+++ b/safekeeper/src/bin/safekeeper.rs
@@ -5,6 +5,7 @@ use anyhow::{bail, Context, Result};
 use clap::Parser;
 use remote_storage::RemoteStorageConfig;
 use toml_edit::Document;
+use utils::signals::ShutdownSignals;

 use std::fs::{self, File};
 use std::io::{ErrorKind, Write};
@@ -39,7 +40,7 @@ use utils::{
    logging::{self, LogFormat},
    project_git_version,
    sentry_init::init_sentry,
-    signals, tcp_listener,
+    tcp_listener,
 };

 const PID_FILE_NAME: &str = "safekeeper.pid";
@@ -216,7 +217,6 @@ fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
    let timeline_collector = safekeeper::metrics::TimelineCollector::new();
    metrics::register_internal(Box::new(timeline_collector))?;

-    let signals = signals::install_shutdown_handlers()?;
    let mut threads = vec![];
    let (wal_backup_launcher_tx, wal_backup_launcher_rx) = mpsc::channel(100);

@@ -274,15 +274,12 @@ fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {

    set_build_info_metric(GIT_VERSION);
    // TODO: put more thoughts into handling of failed threads
-    // We probably should restart them.
+    // We should catch & die if they are in trouble.

-    // NOTE: we still have to handle signals like SIGQUIT to prevent coredumps
-    signals.handle(|signal| {
-        // TODO: implement graceful shutdown with joining threads etc
-        info!(
-            "received {}, terminating in immediate shutdown mode",
-            signal.name()
-        );
+    // On any shutdown signal, log receival and exit. Additionally, handling
+    // SIGQUIT prevents coredump.
+    ShutdownSignals::handle(|signal| {
+        info!("received {}, terminating", signal.name());
        std::process::exit(0);
    })
 }
--- a/safekeeper/src/http/routes.rs
+++ b/safekeeper/src/http/routes.rs
@@ -242,6 +242,7 @@ async fn record_safekeeper_info(mut request: Request<Body>) -> Result<Response<B
        safekeeper_connstr: sk_info.safekeeper_connstr.unwrap_or_else(|| "".to_owned()),
        backup_lsn: sk_info.backup_lsn.0,
        local_start_lsn: sk_info.local_start_lsn.0,
+        availability_zone: None,
    };

    let tli = GlobalTimelines::get(ttid).map_err(ApiError::from)?;
--- a/safekeeper/src/timeline.rs
+++ b/safekeeper/src/timeline.rs
@@ -337,6 +337,7 @@ impl SharedState {
            safekeeper_connstr: conf.listen_pg_addr.clone(),
            backup_lsn: self.sk.inmem.backup_lsn.0,
            local_start_lsn: self.sk.state.local_start_lsn.0,
+            availability_zone: conf.availability_zone.clone(),
        }
    }
 }
--- a/scripts/sk_collect_dumps/.gitignore
+++ b/scripts/sk_collect_dumps/.gitignore
@@ -0,0 +1,2 @@
+result
+*.json
--- a/scripts/sk_collect_dumps/readme.md
+++ b/scripts/sk_collect_dumps/readme.md
@@ -0,0 +1,25 @@
+# Collect /v1/debug_dump from all safekeeper nodes
+
+1. Run ansible playbooks to collect .json dumps from all safekeepers and store them in `./result` directory.
+2. Run `DB_CONNSTR=... ./upload.sh prod_feb30` to upload dumps to `prod_feb30` table in specified postgres database.
+
+## How to use ansible (staging)
+
+```
+AWS_DEFAULT_PROFILE=dev ansible-playbook -i ../../.github/ansible/staging.us-east-2.hosts.yaml -e @../../.github/ansible/ssm_config remote.yaml
+
+AWS_DEFAULT_PROFILE=dev ansible-playbook -i ../../.github/ansible/staging.eu-west-1.hosts.yaml -e @../../.github/ansible/ssm_config remote.yaml
+```
+
+## How to use ansible (prod)
+
+```
+AWS_DEFAULT_PROFILE=prod ansible-playbook -i ../../.github/ansible/prod.us-west-2.hosts.yaml -e @../../.github/ansible/ssm_config remote.yaml
+
+AWS_DEFAULT_PROFILE=prod ansible-playbook -i ../../.github/ansible/prod.us-east-2.hosts.yaml -e @../../.github/ansible/ssm_config remote.yaml
+
+AWS_DEFAULT_PROFILE=prod ansible-playbook -i ../../.github/ansible/prod.eu-central-1.hosts.yaml -e @../../.github/ansible/ssm_config remote.yaml
+
+AWS_DEFAULT_PROFILE=prod ansible-playbook -i ../../.github/ansible/prod.ap-southeast-1.hosts.yaml -e @../../.github/ansible/ssm_config remote.yaml
+```
+
--- a/scripts/sk_collect_dumps/remote.yaml
+++ b/scripts/sk_collect_dumps/remote.yaml
@@ -0,0 +1,18 @@
+- name: Fetch state dumps from safekeepers
+  hosts: safekeepers
+  gather_facts: False
+  remote_user: "{{ remote_user }}"
+    
+  tasks:
+    - name: Download file
+      get_url:
+        url: "http://{{ inventory_hostname }}:7676/v1/debug_dump?dump_all=true&dump_disk_content=false"
+        dest: "/tmp/{{ inventory_hostname }}.json"
+
+    - name: Fetch file from remote hosts
+      fetch:
+        src: "/tmp/{{ inventory_hostname }}.json"
+        dest: "./result/{{ inventory_hostname }}.json"
+        flat: yes
+        fail_on_missing: no
+
--- a/scripts/sk_collect_dumps/upload.sh
+++ b/scripts/sk_collect_dumps/upload.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+
+if [ -z "$DB_CONNSTR" ]; then
+    echo "DB_CONNSTR is not set"
+    exit 1
+fi
+
+# Create a temporary table for JSON data
+psql $DB_CONNSTR -c 'DROP TABLE IF EXISTS tmp_json'
+psql $DB_CONNSTR -c 'CREATE TABLE tmp_json (data jsonb)'
+
+for file in ./result/*.json; do
+    echo "$file"
+    SK_ID=$(jq '.config.id' $file)
+    echo "SK_ID: $SK_ID"
+    jq -c ".timelines[] |  . + {\"sk_id\": $SK_ID}" $file | psql $DB_CONNSTR -c "\\COPY tmp_json (data) FROM STDIN"
+done
+
+TABLE_NAME=$1
+
+if [ -z "$TABLE_NAME" ]; then
+    echo "TABLE_NAME is not set, skipping conversion to table with typed columns"
+    echo "Usage: ./upload.sh TABLE_NAME"
+    exit 0
+fi
+
+psql $DB_CONNSTR <<EOF
+CREATE TABLE $TABLE_NAME AS
+SELECT
+  (data->>'sk_id')::bigint AS sk_id,
+  (data->>'tenant_id') AS tenant_id,
+  (data->>'timeline_id') AS timeline_id,
+  (data->'memory'->>'active')::bool AS active,
+  (data->'memory'->>'flush_lsn')::bigint AS flush_lsn,
+  (data->'memory'->'mem_state'->>'backup_lsn')::bigint AS backup_lsn,
+  (data->'memory'->'mem_state'->>'commit_lsn')::bigint AS commit_lsn,
+  (data->'memory'->'mem_state'->>'peer_horizon_lsn')::bigint AS peer_horizon_lsn,
+  (data->'memory'->'mem_state'->>'remote_consistent_lsn')::bigint AS remote_consistent_lsn,
+  (data->'memory'->>'write_lsn')::bigint AS write_lsn,
+  (data->'memory'->>'num_computes')::bigint AS num_computes,
+  (data->'memory'->>'epoch_start_lsn')::bigint AS epoch_start_lsn,
+  (data->'memory'->>'last_removed_segno')::bigint AS last_removed_segno,
+  (data->'memory'->>'is_cancelled')::bool AS is_cancelled,
+  (data->'control_file'->>'backup_lsn')::bigint AS disk_backup_lsn,
+  (data->'control_file'->>'commit_lsn')::bigint AS disk_commit_lsn,
+  (data->'control_file'->'acceptor_state'->>'term')::bigint AS disk_term,
+  (data->'control_file'->>'local_start_lsn')::bigint AS local_start_lsn,
+  (data->'control_file'->>'peer_horizon_lsn')::bigint AS disk_peer_horizon_lsn,
+  (data->'control_file'->>'timeline_start_lsn')::bigint AS timeline_start_lsn,
+  (data->'control_file'->>'remote_consistent_lsn')::bigint AS disk_remote_consistent_lsn
+FROM tmp_json
+EOF
--- a/storage_broker/benches/rps.rs
+++ b/storage_broker/benches/rps.rs
@@ -133,6 +133,7 @@ async fn publish(client: Option<BrokerClientChannel>, n_keys: u64) {
                peer_horizon_lsn: 5,
                safekeeper_connstr: "zenith-1-sk-1.local:7676".to_owned(),
                local_start_lsn: 0,
+                availability_zone: None,
            };
            counter += 1;
            yield info;
--- a/storage_broker/proto/broker.proto
+++ b/storage_broker/proto/broker.proto
@@ -36,9 +36,11 @@ message SafekeeperTimelineInfo {
    uint64 local_start_lsn = 9;
    // A connection string to use for WAL receiving.
    string safekeeper_connstr = 10;
+    // Availability zone of a safekeeper.
+    optional string availability_zone = 11;
 }

 message TenantTimelineId {
    bytes tenant_id = 1;
    bytes timeline_id = 2;
-}
+}
--- a/storage_broker/src/bin/storage_broker.rs
+++ b/storage_broker/src/bin/storage_broker.rs
@@ -33,6 +33,7 @@ use tonic::transport::server::Connected;
 use tonic::Code;
 use tonic::{Request, Response, Status};
 use tracing::*;
+use utils::signals::ShutdownSignals;

 use metrics::{Encoder, TextEncoder};
 use storage_broker::metrics::{NUM_PUBS, NUM_SUBS_ALL, NUM_SUBS_TIMELINE};
@@ -437,6 +438,14 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
    info!("version: {GIT_VERSION}");
    ::metrics::set_build_info_metric(GIT_VERSION);

+    // On any shutdown signal, log receival and exit.
+    std::thread::spawn(move || {
+        ShutdownSignals::handle(|signal| {
+            info!("received {}, terminating", signal.name());
+            std::process::exit(0);
+        })
+    });
+
    let registry = Registry {
        shared_state: Arc::new(RwLock::new(SharedState::new(args.all_keys_chan_size))),
        timeline_chan_size: args.timeline_chan_size,
@@ -516,6 +525,7 @@ mod tests {
            peer_horizon_lsn: 5,
            safekeeper_connstr: "neon-1-sk-1.local:7676".to_owned(),
            local_start_lsn: 0,
+            availability_zone: None,
        }
    }

--- a/test_runner/regress/test_timeline_delete.py
+++ b/test_runner/regress/test_timeline_delete.py
@@ -10,7 +10,7 @@ def test_timeline_delete(neon_simple_env: NeonEnv):
    env.pageserver.allowed_errors.append(".*Timeline .* was not found.*")
    env.pageserver.allowed_errors.append(".*timeline not found.*")
    env.pageserver.allowed_errors.append(".*Cannot delete timeline which has child timelines.*")
-    env.pageserver.allowed_errors.append(".*NotFound: tenant .*")
+    env.pageserver.allowed_errors.append(".*Precondition failed: Requested tenant is missing.*")

    ps_http = env.pageserver.http_client()

@@ -24,11 +24,11 @@ def test_timeline_delete(neon_simple_env: NeonEnv):
    invalid_tenant_id = TenantId.generate()
    with pytest.raises(
        PageserverApiException,
-        match=f"NotFound: tenant {invalid_tenant_id}",
+        match="Precondition failed: Requested tenant is missing",
    ) as exc:
        ps_http.timeline_delete(tenant_id=invalid_tenant_id, timeline_id=invalid_timeline_id)

-    assert exc.value.status_code == 404
+    assert exc.value.status_code == 412

    # construct pair of branches to validate that pageserver prohibits
    # deletion of ancestor timelines when they have child branches
--- a/vendor/postgres-v14
+++ b/vendor/postgres-v14
--- a/vendor/postgres-v15
+++ b/vendor/postgres-v15
Author	SHA1	Message	Date
Konstantin Knizhnik	81517aeda6	Bump postgres version	2023-04-03 15:35:43 +03:00
Arthur Petukhovsky	814abd9f84	Switch to safekeeper in the same AZ (#3883 ) Add a condition to switch walreceiver connection to safekeeper that is located in the same availability zone. Switch happens when commit_lsn of a candidate is not less than commit_lsn from the active connection. This condition is expected not to trigger instantly, because commit_lsn of a current connection is usually greater than commit_lsn of updates from the broker. That means that if WAL is written continuously, switch can take a lot of time, but it should happen eventually. Now protoc 3.15+ is required for building neon. Fixes https://github.com/neondatabase/neon/issues/3200	2023-04-02 11:32:27 +03:00
Alexander Bayandin	75ffe34b17	check-macos-build: fix cache key (#3926 ) We don't have `${{ matrix.build_type }}` there, so it gets resolved to an empty substring and looks like this [`v1-macOS--pg-f8a650e49b06d39ad131b860117504044b01f312-dcccd010ff851b9f72bb451f28243fa3a341f07028034bbb46ea802413b36d80`](https://github.com/neondatabase/neon/actions/runs/4575422427/jobs/8078231907#step:26:2)	2023-03-31 21:45:59 +03:00
Christian Schwarz	d2aa31f0ce	fix pageserver_evictions_with_low_residence_duration metric (#3925 ) It was doing the comparison in the wrong way.	2023-03-31 19:25:53 +03:00
Dmitry Rodionov	22f9ea5fe2	Remind people to clean up merge commit message in PR template (#3920 )	2023-03-31 16:11:34 +03:00
Joonas Koivunen	d0711d0896	build: fix git perms for deploy job (#3921 ) copy pasted from `build-neon` job. it is interesting that this is only needed by `build-neon` and `deploy`. Fixes: https://github.com/neondatabase/neon/actions/runs/4568077915/jobs/8070960178 which seems to have been going for a while.	2023-03-31 16:05:15 +03:00
Arseny Sher	271f6a6e99	Always sync-safekeepers in neon_local on compute start. Instead of checking neon.safekeepers GUC value in existing pg node data dir, just always run sync-safekeepers when safekeepers are configured. Without this change, creation of new compute didn't run it. That's ok for new timeline/branch (it doesn't return anything useful anyway, and LSN is known by pageserver), but restart of compute for existing timeline bore the risk of getting basebackup not on the latest LSN, i.e. basically broken -- it might not have prev_lsn, and even if it had, walproposer would complain anyway. fixes https://github.com/neondatabase/neon/issues/2963	2023-03-31 16:15:06 +04:00
Christian Schwarz	a64dd3ecb5	disk-usage-based layer eviction (#3809 ) This patch adds a pageserver-global background loop that evicts layers in response to a shortage of available bytes in the $repo/tenants directory's filesystem. The loop runs periodically at a configurable `period`. Each loop iteration uses `statvfs` to determine filesystem-level space usage. It compares the returned usage data against two different types of thresholds. The iteration tries to evict layers until app-internal accounting says we should be below the thresholds. We cross-check this internal accounting with the real world by making another `statvfs` at the end of the iteration. We're good if that second statvfs shows that we're _actually_ below the configured thresholds. If we're still above one or more thresholds, we emit a warning log message, leaving it to the operator to investigate further. There are two thresholds: - `max_usage_pct` is the relative available space, expressed in percent of the total filesystem space. If the actual usage is higher, the threshold is exceeded. - `min_avail_bytes` is the absolute available space in bytes. If the actual usage is lower, the threshold is exceeded. The iteration evicts layers in LRU fashion with a reservation of up to `tenant_min_resident_size` bytes of the most recent layers per tenant. The layers not part of the per-tenant reservation are evicted least-recently-used first until we're below all thresholds. The `tenant_min_resident_size` can be overridden per tenant as `min_resident_size_override` (bytes). In addition to the loop, there is also an HTTP endpoint to perform one loop iteration synchronous to the request. The endpoint takes an absolute number of bytes that the iteration needs to evict before pressure is relieved. The tests use this endpoint, which is a great simplification over setting up loopback-mounts in the tests, which would be required to test the statvfs part of the implementation. We will rely on manual testing in staging to test the statvfs parts. The HTTP endpoint is also handy in emergencies where an operator wants the pageserver to evict a given amount of space _now. Hence, it's arguments documented in openapi_spec.yml. The response type isn't documented though because we don't consider it stable. The endpoint should _not_ be used by Console but it could be used by on-call. Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-03-31 14:47:57 +03:00
Konstantin Knizhnik	bf46237fc2	Fix prefetch for parallel bitmap scan (#3875 ) ## Describe your changes Fix prefetch for parallel bitmap scan ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-03-30 22:07:19 +03:00
Lassi Pölönen	41d364a8f1	Add more detailed logging to compute_ctl's shutdown (#3915 ) Currently we don't see from the logs, if shutting down tracing takes long time or not. We do see that shutting down computes gets delayed for some reason and hits thhe grace period limit. Moving the shutdown message to slightly later, when we don't have anything else than just exit left. ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-03-30 22:02:39 +03:00
Christian Schwarz	fa54a57ca2	random_init_delay: remove the minimum of 10 seconds (#3914 ) Before this patch, the range from which the random delay is picked is at minimum 10 seconds. With this patch, they delay is bounded to whatever the given `period` is, and zero, if period id Duration::ZERO. Motivation for this: the disk usage eviction tests that we'll add in https://github.com/neondatabase/neon/pull/3905 need to wait for the disk usage eviction background loop to do its job. They set a period of 1s. It seems wasteful to wait 10 seconds in the tests. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-03-30 18:38:45 +02:00
Lassi Pölönen	1c1bb904ed	Rename zenith_* labels to neon_* (#3911 ) ## Describe your changes Get rid of the legacy labeling. Aslo `neon_region_slug` with the same value as `neon_region` doesn't make much sense, so just drop it. This allows us to drop the relabeling from zenith to neon in the log collector.	2023-03-30 16:24:47 +03:00
Gleb Novikov	b26c837ed6	Fixed pageserver openapi spec properties reference (#3904 ) ## Describe your changes In [this linter run](https://github.com/neondatabase/cloud/actions/runs/4553032319/jobs/8029101300?pr=4391) accidentally found out that spec is invalid. Reference other schemas in properties should be done the way I changed. Could not find documentation specifically for schemas embedding in `components.schemas`, but it seems like the approach is inherited from json schema: https://json-schema.org/understanding-json-schema/structuring.html#ref ## Issue ticket number and link - ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] ~If it is a core feature, I have added thorough tests.~ - [ ] ~Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?~ - [ ] ~If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.~	2023-03-29 19:18:44 +04:00
Kirill Bulatov	ac9c7e8c4a	Replace pin! from tokio to the std one (#3903 ) With fresh rustc brought by https://github.com/neondatabase/neon/pull/3902, we can use `std::pin::pin!` macro instead of the tokio one. One place did not need the macro at all, other places were adjusted.	2023-03-29 14:14:56 +03:00
Vadim Kharitonov	f1b174dc6a	Update rust version to 1.68.2	2023-03-29 12:50:04 +04:00
Kirill Bulatov	9d714a8413	Split $CARGO_FLAGS and $CARGO_FEATURES to make e2e tests work	2023-03-29 00:08:30 +03:00
Kirill Bulatov	6c84cbbb58	Run new Rust IT test in CI	2023-03-29 00:08:30 +03:00
Kirill Bulatov	1300dc9239	Replace Python IT test with the Rust one	2023-03-29 00:08:30 +03:00
Kirill Bulatov	018c8b0e2b	Use proper tokens and delimeters when listing S3	2023-03-29 00:08:30 +03:00
Arseny Sher	b52389f228	Cleanly exit on any shutdown signal in storage_broker. neon_local sends SIGQUIT, which otherwise dumps core by default. Also, remove obsolete install_shutdown_handlers; in all binaries it was overridden by ShutdownSignals::handle later. ref https://github.com/neondatabase/neon/issues/3847	2023-03-28 22:29:42 +04:00
Heikki Linnakangas	5a123b56e5	Remove obsolete hack to rename neon-specific GUCs. I checked the console database, we don't have any of these left in production.	2023-03-28 17:57:22 +03:00
Arthur Petukhovsky	7456e5b71c	Add script to collect state from safekeepers (#3835 ) Add an ansible script to collect https://github.com/neondatabase/neon/pull/3710 state JSON from all safekeeper nodes and upload them to a postgres table.	2023-03-28 17:04:02 +03:00
Konstantin Knizhnik	9798737ec6	Update pgxn/neon/file_cache.c Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-03-28 14:43:34 +04:00
Konstantin Knizhnik	35ecb139dc	Use stavfs instead inof statfs to fix MacOS build	2023-03-28 14:43:34 +04:00
Arseny Sher	278d0f117d	Rename neon_local sk logs s/safekeeper 1.log/safekeeper-1.log. I don't like spaces in file names.	2023-03-28 14:28:56 +04:00
Arseny Sher	c30b9e6eb1	Show full path to pg_ctl invokation when it fails.	2023-03-28 12:06:06 +04:00
Konstantin Knizhnik	82a4777046	Add local free space monitor (#3832 ) ## Describe your changes Monitor free spae in local file system and shrink local file cache size if it is under watermark. Neon is using local storage for temp files (temp table + intermediate results), unlogged relations and local file cache. Ideally all space not used for temporary files should be used for local file cache. Temporary files and even unlogged relation are intended to have small life time (because them can be lost at any moment in case of compute restart). So the policy is to overcommit local cache size and shrink it if there is not enough free space. As far as temporary files are expected to be needed for a short time, there i no need to permanently shrink local file cache size. Instead of it, we just throw away least recently accessed elements from local file cache, releasing some space on the local disk. ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: sharnoff <sharnoff@neon.tech>	2023-03-28 08:27:50 +03:00
Dmitry Rodionov	6efea43449	Use precondition failed code in delete_timeline when tenant is missing (#3884 ) This allows client to differentiate between missing tenant and missing timeline cases	2023-03-27 21:01:46 +03:00
Joonas Koivunen	f14895b48e	eviction: avoid post-restart download by synthetic_size (#3871 ) As of #3867, we do artificial layer accesses to layers that will be needed after the next restart, but not until then because of caches. With this patch, we also do that for the accesses that the synthetic size calculation worker does if consumption metrics are enabled. The actual size calculation is not of importance, but we need to calculate all of the sizes, so we only call tenant::size::gather_inputs. Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-03-27 19:20:23 +02:00