Merge pull request #2275 from neondatabase/main

* github/workflows: Fix git dubious ownership (#2223) * Move relation size cache from WalIngest to DatadirTimeline (#2094) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * refactor: replace lazy-static with once-cell (#2195) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com> * Add more buckets to pageserver latency metrics (#2225) * ignore record property warning to fix benchmarks * increase statement timeout * use event so it fires only if workload thread successfully finished * remove debug log * increase timeout to pass test with real s3 * avoid duplicate parameter, increase timeout * Major migration script (#2073) This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview. Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> * Fix etcd typos * Fix links to safekeeper protocol docs. (#2188) safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in commit 0b14fdb078, as part of reorganizing the docs into 'mdbook' format. Fixes issue #1475. Thanks to @banks for spotting the outdated references. In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb078. See https://github.com/neondatabase/neon/pull/2188#pullrequestreview-1055918480. Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech> * Update CONTRIBUTING.md * Update CONTRIBUTING.md * support node id and remote storage params in docker_entrypoint.sh * Safe truncate (#2218) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Check if relation exists before trying to truncat it refer #1932 * Add test reporducing FSM truncate problem Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Fix exponential backoff values * Update back `vendor/postgres` back; it was changed accidentally. (#2251) Commit 4227cfc96e accidentally reverted vendor/postgres to an older version. Update it back. * Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok. * Share exponential backoff code and fix logic for delete task failure (#2252) * Fix bug when import large (>1GB) relations (#2172) Resolves #2097 - use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get` - update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. + `small` is the old test + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build * Fix timeline physical size flaky tests (#2244) Resolves #2212. - use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests ## Context Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk. ### Note The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly * postgres_ffi/waldecoder: validate more header fields * postgres_ffi/waldecoder: remove unused startlsn * postgres_ffi/waldecoder: introduce explicit `enum State` Previously it was emulated with a combination of nullable fields. This change should make the logic more readable. * disable `test_import_from_pageserver_multisegment` (#2258) This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure. See: #2255, #2256 * get_binaries uses DOCKER_TAG taken from docker image build step (#2260) * [proxy] Rework wire format of the password hack and some errors (#2236) The new format has a few benefits: it's shorter, simpler and human-readable as well. We don't use base64 anymore, since url encoding got us covered. We also show a better error in case we couldn't parse the payload; the users should know it's all about passing the correct project name. * test_runner/pg_clients: collect docker logs (#2259) * get_binaries script fix (#2263) * get_binaries uses DOCKER_TAG taken from docker image build step * remove docker tag discovery at all and fix get_binaries for version variable * Better storage sync logs (#2268) * Find end of WAL on safekeepers using WalStreamDecoder. We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes https://github.com/neondatabase/neon/issues/544 https://github.com/neondatabase/cloud/issues/2004 Supersedes https://github.com/neondatabase/neon/pull/2066 * Improve walreceiver logic (#2253) This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers. - There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237 - Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down. - Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second. - `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast. - `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck. * increase timeout in wait_for_upload to avoid spurious failures when testing with real s3 * Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274) * Set up a workflow to run pgbench against captest (#2077) Signed-off-by: Ankur Srivastava <best.ankur@gmail.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Ankur Srivastava <ansrivas@users.noreply.github.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech> Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com> Co-authored-by: Arseny Sher <sher-ars@yandex.ru> Co-authored-by: Egor Suvorov <egor@neon.tech> Co-authored-by: Andrey Taranik <andrey@cicd.team> Co-authored-by: Dmitry Ivanov <ivadmi5@gmail.com>
2026-06-03 21:40:39 +00:00 · 2022-08-15 21:30:45 +03:00
parent e814ac16f9 4cddb0f1a4
commit 873347f977
96 changed files with 2892 additions and 1519 deletions
--- a/.github/actions/run-python-test-set/action.yml
+++ b/.github/actions/run-python-test-set/action.yml
@@ -83,9 +83,10 @@ runs:
        # this variable will be embedded in perf test report
        # and is needed to distinguish different environments
        PLATFORM: github-actions-selfhosted
+        BUILD_TYPE: ${{ inputs.build_type }}
        AWS_ACCESS_KEY_ID: ${{ inputs.real_s3_access_key_id }}
        AWS_SECRET_ACCESS_KEY: ${{ inputs.real_s3_secret_access_key }}
-      shell: bash -euxo pipefail {0} {0}
+      shell: bash -euxo pipefail {0}
      run: |
        PERF_REPORT_DIR="$(realpath test_runner/perf-report-local)"
        rm -rf $PERF_REPORT_DIR
--- a/.github/actions/upload/action.yml
+++ b/.github/actions/upload/action.yml
@@ -29,8 +29,12 @@ runs:
          time tar -C ${SOURCE} -cf ${ARCHIVE} --zstd .
        elif [ -f ${SOURCE} ]; then
          time tar -cf ${ARCHIVE} --zstd ${SOURCE}
+        elif ! ls ${SOURCE} > /dev/null 2>&1; then
+          echo 2>&1 "${SOURCE} does not exist"
+          exit 2
        else
-          echo 2>&1 "${SOURCE} neither directory nor file, don't know how to handle it"
+          echo 2>&1 "${SOURCE} is neither a directory nor a file, do not know how to handle it"
+          exit 3
        fi

    - name: Upload artifact
--- a/.github/ansible/get_binaries.sh
+++ b/.github/ansible/get_binaries.sh
@@ -2,30 +2,14 @@

 set -e

-RELEASE=${RELEASE:-false}
-
-# look at docker hub for latest tag for neon docker image
-if [ "${RELEASE}" = "true" ]; then
-    echo "search latest release tag"
-    VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | grep -E '^[0-9]+$' | sort -n | tail -1)
-    if [ -z "${VERSION}" ]; then
-        echo "no any docker tags found, exiting..."
-        exit 1
-    else
-        TAG="release-${VERSION}"
-    fi
+if [ -n "${DOCKER_TAG}" ]; then
+  # Verson is DOCKER_TAG but without prefix
+  VERSION=$(echo $DOCKER_TAG | sed 's/^.*-//g')
 else
-    echo "search latest dev tag"
-    VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -E '^[0-9]+$' | sort -n | tail -1)
-    if [ -z "${VERSION}" ]; then
-        echo "no any docker tags found, exiting..."
-        exit 1
-    else
-        TAG="${VERSION}"
-    fi
+  echo "Please set DOCKER_TAG environment variable"
+  exit 1
 fi

-echo "found ${VERSION}"

 # do initial cleanup
 rm -rf neon_install postgres_install.tar.gz neon_install.tar.gz .neon_current_version
@@ -33,8 +17,8 @@ mkdir neon_install

 # retrieve binaries from docker image
 echo "getting binaries from docker image"
-docker pull --quiet neondatabase/neon:${TAG}
-ID=$(docker create neondatabase/neon:${TAG})
+docker pull --quiet neondatabase/neon:${DOCKER_TAG}
+ID=$(docker create neondatabase/neon:${DOCKER_TAG})
 docker cp ${ID}:/data/postgres_install.tar.gz .
 tar -xzf postgres_install.tar.gz -C neon_install
 docker cp ${ID}:/usr/local/bin/pageserver neon_install/bin/
--- a/.github/workflows/benchmarking.yml
+++ b/.github/workflows/benchmarking.yml
@@ -1,4 +1,4 @@
-name: benchmarking
+name: Benchmarking

 on:
  # uncomment to run on push for debugging your PR
@@ -15,6 +15,15 @@ on:

  workflow_dispatch: # adds ability to run this manually

+defaults:
+  run:
+    shell: bash -euxo pipefail {0}
+
+concurrency:
+  # Allow only one workflow per any non-`main` branch.
+  group: ${{ github.workflow }}-${{ github.ref }}-${{ github.ref == 'refs/heads/main' && github.sha || 'anysha' }}
+  cancel-in-progress: true
+
 jobs:
  bench:
    # this workflow runs on self hosteed runner
@@ -60,7 +69,6 @@ jobs:
    - name: Setup cluster
      env:
        BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
-      shell: bash -euxo pipefail {0}
      run: |
        set -e

@@ -96,7 +104,9 @@ jobs:
        # since it might generate duplicates when calling ingest_perf_test_result.py
        rm -rf perf-report-staging
        mkdir -p perf-report-staging
-        ./scripts/pytest test_runner/performance/ -v -m "remote_cluster" --skip-interfering-proc-check --out-dir perf-report-staging --timeout 3600
+        # Set --sparse-ordering option of pytest-order plugin to ensure tests are running in order of appears in the file,
+        # it's important for test_perf_pgbench.py::test_pgbench_remote_* tests
+        ./scripts/pytest test_runner/performance/ -v -m "remote_cluster" --sparse-ordering --skip-interfering-proc-check --out-dir perf-report-staging --timeout 3600

    - name: Submit result
      env:
@@ -113,3 +123,106 @@ jobs:
        slack-message: "Periodic perf testing: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
      env:
        SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
+
+  pgbench-compare:
+    env:
+      TEST_PG_BENCH_DURATIONS_MATRIX: "60m"
+      TEST_PG_BENCH_SCALES_MATRIX: "10gb"
+      REMOTE_ENV: "1"
+      POSTGRES_DISTRIB_DIR: /usr
+      TEST_OUTPUT: /tmp/test_output
+
+    strategy:
+      fail-fast: false
+      matrix:
+        connstr: [ BENCHMARK_CAPTEST_CONNSTR, BENCHMARK_RDS_CONNSTR ]
+
+    runs-on: dev
+    container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rustlegacy:2817580636
+
+    timeout-minutes: 360 # 6h
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Cache poetry deps
+      id: cache_poetry
+      uses: actions/cache@v3
+      with:
+        path: ~/.cache/pypoetry/virtualenvs
+        key: v2-${{ runner.os }}-python-deps-${{ hashFiles('poetry.lock') }}
+
+    - name: Install Python deps
+      run: ./scripts/pysync
+
+    - name: Calculate platform
+      id: calculate-platform
+      env:
+        CONNSTR: ${{ matrix.connstr }}
+      run: |
+        if [ "${CONNSTR}" = "BENCHMARK_CAPTEST_CONNSTR" ]; then
+          PLATFORM=neon-captest
+        elif [ "${CONNSTR}" = "BENCHMARK_RDS_CONNSTR" ]; then
+          PLATFORM=rds-aurora
+        else
+          echo 2>&1 "Unknown CONNSTR=${CONNSTR}. Allowed are BENCHMARK_CAPTEST_CONNSTR, and BENCHMARK_RDS_CONNSTR only"
+          exit 1
+        fi
+
+        echo "::set-output name=PLATFORM::${PLATFORM}"
+
+    - name: Install Deps
+      run: |
+        echo "deb http://apt.postgresql.org/pub/repos/apt focal-pgdg main" | sudo tee /etc/apt/sources.list.d/pgdg.list
+        wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
+        sudo apt -y update
+        sudo apt install -y postgresql-14 postgresql-client-14
+
+    - name: Benchmark init
+      env:
+        PLATFORM: ${{ steps.calculate-platform.outputs.PLATFORM }}
+        BENCHMARK_CONNSTR: ${{ secrets[matrix.connstr] }}
+      run: |
+        mkdir -p perf-report-captest
+
+        psql $BENCHMARK_CONNSTR -c "SELECT 1;"
+        ./scripts/pytest test_runner/performance/test_perf_pgbench.py::test_pgbench_remote_init -v -m "remote_cluster" --skip-interfering-proc-check --out-dir perf-report-captest --timeout 21600
+
+    - name: Benchmark simple-update
+      env:
+        PLATFORM: ${{ steps.calculate-platform.outputs.PLATFORM }}
+        BENCHMARK_CONNSTR: ${{ secrets[matrix.connstr] }}
+      run: |
+        psql $BENCHMARK_CONNSTR -c "SELECT 1;"
+        ./scripts/pytest test_runner/performance/test_perf_pgbench.py::test_pgbench_remote_simple_update -v -m "remote_cluster" --skip-interfering-proc-check --out-dir perf-report-captest --timeout 21600
+
+    - name: Benchmark select-only
+      env:
+        PLATFORM: ${{ steps.calculate-platform.outputs.PLATFORM }}
+        BENCHMARK_CONNSTR: ${{ secrets[matrix.connstr] }}
+      run: |
+        psql $BENCHMARK_CONNSTR -c "SELECT 1;"
+        ./scripts/pytest test_runner/performance/test_perf_pgbench.py::test_pgbench_remote_select_only -v -m "remote_cluster" --skip-interfering-proc-check --out-dir perf-report-captest --timeout 21600
+
+    - name: Submit result
+      env:
+        VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
+        PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
+      run: |
+        REPORT_FROM=$(realpath perf-report-captest) REPORT_TO=staging scripts/generate_and_push_perf_report.sh
+
+    - name: Upload logs
+      if: always()
+      uses: ./.github/actions/upload
+      with:
+        name: bench-captest-${{ steps.calculate-platform.outputs.PLATFORM }}
+        path: /tmp/test_output/
+
+    - name: Post to a Slack channel
+      if: ${{ github.event.schedule && failure() }}
+      uses: slackapi/slack-github-action@v1
+      with:
+        channel-id: "C033QLM5P7D" # dev-staging-stream
+        slack-message: "Periodic perf testing: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+      env:
+        SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -35,6 +35,16 @@ jobs:
      GIT_VERSION: ${{ github.sha }}

    steps:
+      - name: Fix git ownerwhip
+        run: |
+          # Workaround for `fatal: detected dubious ownership in repository at ...`
+          #
+          # Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
+          #   Ref https://github.com/actions/checkout/issues/785
+          #
+          git config --global --add safe.directory ${{ github.workspace }}
+          git config --global --add safe.directory ${GITHUB_WORKSPACE}
+
      - name: Checkout
        uses: actions/checkout@v3
        with:
@@ -552,6 +562,7 @@ jobs:

      - name: Redeploy
        run: |
+          export DOCKER_TAG=${{needs.docker-image.outputs.build-tag}}
          cd "$(pwd)/.github/ansible"

          if [[ "$GITHUB_REF_NAME" == "main" ]]; then
--- a/.github/workflows/pg_clients.yml
+++ b/.github/workflows/pg_clients.yml
@@ -19,8 +19,12 @@ concurrency:

 jobs:
  test-postgres-client-libs:
+    # TODO: switch to gen2 runner, requires docker
    runs-on: [ ubuntu-latest ]

+    env:
+      TEST_OUTPUT: /tmp/test_output
+
    steps:
    - name: Checkout
      uses: actions/checkout@v3
@@ -47,7 +51,7 @@ jobs:
      env:
        REMOTE_ENV: 1
        BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
-        TEST_OUTPUT: /tmp/test_output
+
        POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
      shell: bash -euxo pipefail {0}
      run: |
@@ -61,9 +65,18 @@ jobs:
          -m "remote_cluster" \
          -rA "test_runner/pg_clients"

+    # We use GitHub's action upload-artifact because `ubuntu-latest` doesn't have configured AWS CLI.
+    # It will be fixed after switching to gen2 runner
+    - name: Upload python test logs
+      if: always()
+      uses: actions/upload-artifact@v3
+      with:
+        retention-days: 7
+        name: python-test-pg_clients-${{ runner.os }}-stage-logs
+        path: ${{ env.TEST_OUTPUT }}
+
    - name: Post to a Slack channel
-      if: failure()
-      id: slack
+      if: ${{ github.event.schedule && failure() }}
      uses: slackapi/slack-github-action@v1
      with:
        channel-id: "C033QLM5P7D" # dev-staging-stream
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -11,17 +11,15 @@ than it was before.

 ## Submitting changes

-1. Make a PR for every change.
-
-   Even seemingly trivial patches can break things in surprising ways.
-Use of common sense is OK. If you're only fixing a typo in a comment,
-it's probably fine to just push it. But if in doubt, open a PR.
-
-2. Get at least one +1 on your PR before you push.
+1. Get at least one +1 on your PR before you push.

   For simple patches, it will only take a minute for someone to review
 it.

+2. Don't force push small changes after making the PR ready for review.
+Doing so will force readers to re-read your entire PR, which will delay
+the review process.
+
 3. Always keep the CI green.

   Do not push, if the CI failed on your PR. Even if you think it's not
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -495,8 +495,8 @@ name = "control_plane"
 version = "0.1.0"
 dependencies = [
 "anyhow",
- "lazy_static",
 "nix",
+ "once_cell",
 "pageserver",
 "postgres",
 "regex",
@@ -1591,8 +1591,8 @@ dependencies = [
 name = "metrics"
 version = "0.1.0"
 dependencies = [
- "lazy_static",
 "libc",
+ "once_cell",
 "prometheus",
 "workspace_hack",
 ]
@@ -1870,7 +1870,6 @@ dependencies = [
 "humantime-serde",
 "hyper",
 "itertools",
- "lazy_static",
 "metrics",
 "nix",
 "once_cell",
@@ -2116,9 +2115,9 @@ dependencies = [
 "crc32c",
 "env_logger",
 "hex",
- "lazy_static",
 "log",
 "memoffset",
+ "once_cell",
 "postgres",
 "rand",
 "regex",
@@ -2270,6 +2269,7 @@ dependencies = [
 "anyhow",
 "async-trait",
 "base64",
+ "bstr",
 "bytes",
 "clap 3.2.12",
 "futures",
@@ -2278,9 +2278,9 @@ dependencies = [
 "hex",
 "hmac 0.12.1",
 "hyper",
- "lazy_static",
 "md5",
 "metrics",
+ "once_cell",
 "parking_lot 0.12.1",
 "pin-project-lite",
 "rand",
@@ -2754,7 +2754,6 @@ dependencies = [
 "hex",
 "humantime",
 "hyper",
- "lazy_static",
 "metrics",
 "once_cell",
 "postgres",
@@ -3671,9 +3670,9 @@ dependencies = [
 "hex-literal",
 "hyper",
 "jsonwebtoken",
- "lazy_static",
 "metrics",
 "nix",
+ "once_cell",
 "pin-project-lite",
 "postgres",
 "postgres-protocol",
--- a/control_plane/Cargo.toml
+++ b/control_plane/Cargo.toml
@@ -9,7 +9,7 @@ postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8
 serde = { version = "1.0", features = ["derive"] }
 serde_with = "1.12.0"
 toml = "0.5"
-lazy_static = "1.4"
+once_cell = "1.13.0"
 regex = "1"
 anyhow = "1.0"
 thiserror = "1"
--- a/control_plane/src/etcd.rs
+++ b/control_plane/src/etcd.rs
@@ -30,14 +30,14 @@ pub fn start_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
    let etcd_stdout_file =
        fs::File::create(etcd_data_dir.join("etcd.stdout.log")).with_context(|| {
            format!(
-                "Failed to create ectd stout file in directory {}",
+                "Failed to create etcd stout file in directory {}",
                etcd_data_dir.display()
            )
        })?;
    let etcd_stderr_file =
        fs::File::create(etcd_data_dir.join("etcd.stderr.log")).with_context(|| {
            format!(
-                "Failed to create ectd stderr file in directory {}",
+                "Failed to create etcd stderr file in directory {}",
                etcd_data_dir.display()
            )
        })?;
--- a/control_plane/src/postgresql_conf.rs
+++ b/control_plane/src/postgresql_conf.rs
@@ -5,7 +5,7 @@
 /// enough to extract a few settings we need in Zenith, assuming you don't do
 /// funny stuff like include-directives or funny escaping.
 use anyhow::{bail, Context, Result};
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use regex::Regex;
 use std::collections::HashMap;
 use std::fmt;
@@ -19,9 +19,7 @@ pub struct PostgresConf {
    hash: HashMap<String, String>,
 }

-lazy_static! {
-    static ref CONF_LINE_RE: Regex = Regex::new(r"^((?:\w|\.)+)\s*=\s*(\S+)$").unwrap();
-}
+static CONF_LINE_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^((?:\w|\.)+)\s*=\s*(\S+)$").unwrap());

 impl PostgresConf {
    pub fn new() -> PostgresConf {
@@ -139,10 +137,10 @@ fn escape_str(s: &str) -> String {
    //
    // This regex is a bit more conservative than the rules in guc-file.l, so we quote some
    // strings that PostgreSQL would accept without quoting, but that's OK.
-    lazy_static! {
-        static ref UNQUOTED_RE: Regex =
-            Regex::new(r"(^[-+]?[0-9]+[a-zA-Z]*$)|(^[a-zA-Z][a-zA-Z0-9]*$)").unwrap();
-    }
+
+    static UNQUOTED_RE: Lazy<Regex> =
+        Lazy::new(|| Regex::new(r"(^[-+]?[0-9]+[a-zA-Z]*$)|(^[a-zA-Z][a-zA-Z0-9]*$)").unwrap());
+
    if UNQUOTED_RE.is_match(s) {
        s.to_string()
    } else {
--- a/control_plane/src/storage.rs
+++ b/control_plane/src/storage.rs
@@ -401,6 +401,7 @@ impl PageServerNode {
                    .get("checkpoint_distance")
                    .map(|x| x.parse::<u64>())
                    .transpose()?,
+                checkpoint_timeout: settings.get("checkpoint_timeout").map(|x| x.to_string()),
                compaction_target_size: settings
                    .get("compaction_target_size")
                    .map(|x| x.parse::<u64>())
@@ -455,6 +456,7 @@ impl PageServerNode {
                    .map(|x| x.parse::<u64>())
                    .transpose()
                    .context("Failed to parse 'checkpoint_distance' as an integer")?,
+                checkpoint_timeout: settings.get("checkpoint_timeout").map(|x| x.to_string()),
                compaction_target_size: settings
                    .get("compaction_target_size")
                    .map(|x| x.parse::<u64>())
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@@ -1,6 +1,8 @@
 #!/bin/sh
 set -eux

+pageserver_id_param="${NODE_ID:-10}"
+
 broker_endpoints_param="${BROKER_ENDPOINT:-absent}"
 if [ "$broker_endpoints_param" != "absent" ]; then
    broker_endpoints_param="-c broker_endpoints=['$broker_endpoints_param']"
@@ -8,10 +10,12 @@ else
    broker_endpoints_param=''
 fi

+remote_storage_param="${REMOTE_STORAGE:-}"
+
 if [ "$1" = 'pageserver' ]; then
    if [ ! -d "/data/tenants" ]; then
        echo "Initializing pageserver data directory"
-        pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10" $broker_endpoints_param
+        pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=${pageserver_id_param}" $broker_endpoints_param $remote_storage_param
    fi
    echo "Staring pageserver at 0.0.0.0:6400"
    pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" $broker_endpoints_param -D /data
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -52,10 +52,8 @@
 - [multitenancy.md](./multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
 - [settings.md](./settings.md)
 #FIXME: move these under sourcetree.md
-#- [pageserver/README.md](/pageserver/README.md)
 #- [postgres_ffi/README.md](/libs/postgres_ffi/README.md)
 #- [test_runner/README.md](/test_runner/README.md)
-#- [safekeeper/README.md](/safekeeper/README.md)


 # RFCs
--- a/docs/glossary.md
+++ b/docs/glossary.md
@@ -75,7 +75,7 @@ layer's Segment and range of LSNs.
 There are two kinds of layers, in-memory and on-disk layers. In-memory
 layers are used to ingest incoming WAL, and provide fast access
 to the recent page versions. On-disk layers are stored as files on disk, and
-are immutable. See pageserver/src/layered_repository/README.md for more.
+are immutable. See [pageserver-storage.md](./pageserver-storage.md) for more.

 ### Layer file (on-disk layer)

@@ -111,7 +111,7 @@ PostgreSQL LSNs and functions to monitor them:
 * `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
 [source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html):

-Neon safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md)
+Neon safekeeper LSNs. See [safekeeper protocol section](safekeeper-protocol.md) for more information.
 * `CommitLSN`: position in WAL confirmed by quorum safekeepers.
 * `RestartLSN`: position in WAL confirmed by all safekeepers.
 * `FlushLSN`: part of WAL persisted to the disk by safekeeper.
--- a/docs/pageserver-services.md
+++ b/docs/pageserver-services.md
@@ -68,8 +68,6 @@ There are the following implementations present:
 * local filesystem — to use in tests mainly
 * AWS S3           - to use in production

-Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
-
 The backup service is disabled by default and can be enabled to interact with a single remote storage.

 CLI examples:
@@ -118,7 +116,7 @@ implemented by the LayeredRepository object in
 `layered_repository.rs`. There is only that one implementation of the
 Repository trait, but it's still a useful abstraction that keeps the
 interface for the low-level storage functionality clean. The layered
-storage format is described in layered_repository/README.md.
+storage format is described in [pageserver-storage.md](./pageserver-storage.md).

 Each repository consists of multiple Timelines. Timeline is a
 workhorse that accepts page changes from the WAL, and serves
--- a/docs/settings.md
+++ b/docs/settings.md
@@ -15,7 +15,7 @@ listen_pg_addr = '127.0.0.1:64000'
 listen_http_addr = '127.0.0.1:9898'

 checkpoint_distance = '268435456' # in bytes
-checkpoint_period = '1 s'
+checkpoint_timeout = '10m'

 gc_period = '100 s'
 gc_horizon = '67108864'
@@ -46,7 +46,7 @@ Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#ta

 All values can be passed as an argument to the pageserver binary, using the `-c` parameter and specified as a valid TOML string. All tables should be passed in the inline form.

-Example: `${PAGESERVER_BIN} -c "checkpoint_period = '100 s'" -c "remote_storage={local_path='/some/local/path/'}"`
+Example: `${PAGESERVER_BIN} -c "checkpoint_timeout = '10 m'" -c "remote_storage={local_path='/some/local/path/'}"`

 Note that TOML distinguishes between strings and integers, the former require single or double quotes around them.

@@ -82,6 +82,14 @@ S3.

 The unit is # of bytes.

+#### checkpoint_timeout
+
+Apart from `checkpoint_distance`, open layer flushing is also triggered
+`checkpoint_timeout` after the last flush. This makes WAL eventually uploaded to
+s3 when activity is stopped.
+
+The default is 10m.
+
 #### compaction_period

 Every `compaction_period` seconds, the page server checks if
--- a/docs/sourcetree.md
+++ b/docs/sourcetree.md
@@ -28,7 +28,7 @@ The pageserver has a few different duties:
 - Receive WAL from the WAL service and decode it.
 - Replay WAL that's applicable to the chunks that the Page Server maintains

-For more detailed info, see [/pageserver/README](/pageserver/README.md)
+For more detailed info, see [pageserver-services.md](./pageserver-services.md)

 `/proxy`:

@@ -57,7 +57,7 @@ PostgreSQL extension that contains functions needed for testing and debugging.
 The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
 It acts as a holding area and redistribution center for recently generated WAL.

-For more detailed info, see [/safekeeper/README](/safekeeper/README.md)
+For more detailed info, see [walservice.md](./walservice.md)

 `/workspace_hack`:
 The workspace_hack crate exists only to pin down some dependencies.
--- a/docs/walservice.md
+++ b/docs/walservice.md
@@ -75,8 +75,8 @@ safekeepers. The Paxos and crash recovery algorithm ensures that only
 one primary node can be actively streaming WAL to the quorum of
 safekeepers.

-See README_PROTO.md for a more detailed description of the consensus
-protocol. spec/ contains TLA+ specification of it.
+See [this section](safekeeper-protocol.md) for a more detailed description of
+the consensus protocol. spec/ contains TLA+ specification of it.

 # Q&A

--- a/libs/etcd_broker/Cargo.toml
+++ b/libs/etcd_broker/Cargo.toml
@@ -9,7 +9,7 @@
 serde = { version = "1.0", features = ["derive"] }
 serde_json = "1"
 serde_with = "1.12.0"
- once_cell = "1.8.0"
+ once_cell = "1.13.0"

 utils = { path = "../utils" }
 workspace_hack = { version = "0.1", path = "../../workspace_hack" }
--- a/libs/metrics/Cargo.toml
+++ b/libs/metrics/Cargo.toml
@@ -6,5 +6,5 @@ edition = "2021"
 [dependencies]
 prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
 libc = "0.2"
-lazy_static = "1.4"
+once_cell = "1.13.0"
 workspace_hack = { version = "0.1", path = "../../workspace_hack" }
--- a/libs/metrics/src/lib.rs
+++ b/libs/metrics/src/lib.rs
@@ -2,7 +2,7 @@
 //! make sure that we use the same dep version everywhere.
 //! Otherwise, we might not see all metrics registered via
 //! a default registry.
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use prometheus::core::{AtomicU64, GenericGauge, GenericGaugeVec};
 pub use prometheus::opts;
 pub use prometheus::register;
@@ -41,19 +41,22 @@ pub fn gather() -> Vec<prometheus::proto::MetricFamily> {
    prometheus::gather()
 }

-lazy_static! {
-    static ref DISK_IO_BYTES: IntGaugeVec = register_int_gauge_vec!(
+static DISK_IO_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| {
+    register_int_gauge_vec!(
        "libmetrics_disk_io_bytes_total",
        "Bytes written and read from disk, grouped by the operation (read|write)",
        &["io_operation"]
    )
-    .expect("Failed to register disk i/o bytes int gauge vec");
-    static ref MAXRSS_KB: IntGauge = register_int_gauge!(
+    .expect("Failed to register disk i/o bytes int gauge vec")
+});
+
+static MAXRSS_KB: Lazy<IntGauge> = Lazy::new(|| {
+    register_int_gauge!(
        "libmetrics_maxrss_kb",
        "Memory usage (Maximum Resident Set Size)"
    )
-    .expect("Failed to register maxrss_kb int gauge");
-}
+    .expect("Failed to register maxrss_kb int gauge")
+});

 pub const DISK_WRITE_SECONDS_BUCKETS: &[f64] = &[
    0.000_050, 0.000_100, 0.000_500, 0.001, 0.003, 0.005, 0.01, 0.05, 0.1, 0.3, 0.5,
--- a/libs/metrics/src/wrappers.rs
+++ b/libs/metrics/src/wrappers.rs
@@ -10,13 +10,13 @@ use std::io::{Read, Result, Write};
 /// # use std::io::{Result, Read};
 /// # use metrics::{register_int_counter, IntCounter};
 /// # use metrics::CountedReader;
+/// # use once_cell::sync::Lazy;
 /// #
-/// # lazy_static::lazy_static! {
-/// #     static ref INT_COUNTER: IntCounter = register_int_counter!(
+/// # static INT_COUNTER: Lazy<IntCounter> = Lazy::new( || { register_int_counter!(
 /// #         "int_counter",
 /// #         "let's count something!"
-/// #     ).unwrap();
-/// # }
+/// #     ).unwrap()
+/// # });
 /// #
 /// fn do_some_reads(stream: impl Read, count: usize) -> Result<Vec<u8>> {
 ///     let mut reader = CountedReader::new(stream, |cnt| {
@@ -85,13 +85,13 @@ impl<T: Read> Read for CountedReader<'_, T> {
 /// # use std::io::{Result, Write};
 /// # use metrics::{register_int_counter, IntCounter};
 /// # use metrics::CountedWriter;
+/// # use once_cell::sync::Lazy;
 /// #
-/// # lazy_static::lazy_static! {
-/// #     static ref INT_COUNTER: IntCounter = register_int_counter!(
+/// # static INT_COUNTER: Lazy<IntCounter> = Lazy::new( || { register_int_counter!(
 /// #         "int_counter",
 /// #         "let's count something!"
-/// #     ).unwrap();
-/// # }
+/// #     ).unwrap()
+/// # });
 /// #
 /// fn do_some_writes(stream: impl Write, payload: &[u8]) -> Result<()> {
 ///     let mut writer = CountedWriter::new(stream, |cnt| {
--- a/libs/postgres_ffi/Cargo.toml
+++ b/libs/postgres_ffi/Cargo.toml
@@ -12,7 +12,7 @@ byteorder = "1.4.3"
 anyhow = "1.0"
 crc32c = "0.6.0"
 hex = "0.4.3"
-lazy_static = "1.4"
+once_cell = "1.13.0"
 log = "0.4.14"
 memoffset = "0.6.2"
 thiserror = "1.0"
--- a/libs/postgres_ffi/src/relfile_utils.rs
+++ b/libs/postgres_ffi/src/relfile_utils.rs
@@ -2,7 +2,7 @@
 //! Common utilities for dealing with PostgreSQL relation files.
 //!
 use crate::pg_constants;
-use lazy_static::lazy_static;
+use once_cell::sync::OnceCell;
 use regex::Regex;

 #[derive(Debug, Clone, thiserror::Error, PartialEq)]
@@ -54,11 +54,14 @@ pub fn forknumber_to_name(forknum: u8) -> Option<&'static str> {
 /// See functions relpath() and _mdfd_segpath() in PostgreSQL sources.
 ///
 pub fn parse_relfilename(fname: &str) -> Result<(u32, u8, u32), FilePathError> {
-    lazy_static! {
-        static ref RELFILE_RE: Regex =
-            Regex::new(r"^(?P<relnode>\d+)(_(?P<forkname>[a-z]+))?(\.(?P<segno>\d+))?$").unwrap();
-    }
+    static RELFILE_RE: OnceCell<Regex> = OnceCell::new();
+    RELFILE_RE.get_or_init(|| {
+        Regex::new(r"^(?P<relnode>\d+)(_(?P<forkname>[a-z]+))?(\.(?P<segno>\d+))?$").unwrap()
+    });
+
    let caps = RELFILE_RE
+        .get()
+        .unwrap()
        .captures(fname)
        .ok_or(FilePathError::InvalidFileName)?;

--- a/libs/postgres_ffi/src/waldecoder.rs
+++ b/libs/postgres_ffi/src/waldecoder.rs
@@ -13,24 +13,30 @@ use super::xlog_utils::*;
 use super::XLogLongPageHeaderData;
 use super::XLogPageHeaderData;
 use super::XLogRecord;
+use super::XLOG_PAGE_MAGIC;
 use bytes::{Buf, BufMut, Bytes, BytesMut};
 use crc32c::*;
 use log::*;
 use std::cmp::min;
+use std::num::NonZeroU32;
 use thiserror::Error;
 use utils::lsn::Lsn;

+enum State {
+    WaitingForRecord,
+    ReassemblingRecord {
+        recordbuf: BytesMut,
+        contlen: NonZeroU32,
+    },
+    SkippingEverything {
+        skip_until_lsn: Lsn,
+    },
+}
+
 pub struct WalStreamDecoder {
    lsn: Lsn,
-
-    startlsn: Lsn, // LSN where this record starts
-    contlen: u32,
-    padlen: u32,
-
    inputbuf: BytesMut,
-
-    /// buffer used to reassemble records that cross page boundaries.
-    recordbuf: BytesMut,
+    state: State,
 }

 #[derive(Error, Debug, Clone)]
@@ -48,13 +54,8 @@ impl WalStreamDecoder {
    pub fn new(lsn: Lsn) -> WalStreamDecoder {
        WalStreamDecoder {
            lsn,
-
-            startlsn: Lsn(0),
-            contlen: 0,
-            padlen: 0,
-
            inputbuf: BytesMut::new(),
-            recordbuf: BytesMut::new(),
+            state: State::WaitingForRecord,
        }
    }

@@ -67,6 +68,58 @@ impl WalStreamDecoder {
        self.inputbuf.extend_from_slice(buf);
    }

+    fn validate_page_header(&self, hdr: &XLogPageHeaderData) -> Result<(), WalDecodeError> {
+        let validate_impl = || {
+            if hdr.xlp_magic != XLOG_PAGE_MAGIC as u16 {
+                return Err(format!(
+                    "invalid xlog page header: xlp_magic={}, expected {}",
+                    hdr.xlp_magic, XLOG_PAGE_MAGIC
+                ));
+            }
+            if hdr.xlp_pageaddr != self.lsn.0 {
+                return Err(format!(
+                    "invalid xlog page header: xlp_pageaddr={}, expected {}",
+                    hdr.xlp_pageaddr, self.lsn
+                ));
+            }
+            match self.state {
+                State::WaitingForRecord => {
+                    if hdr.xlp_info & XLP_FIRST_IS_CONTRECORD != 0 {
+                        return Err(
+                            "invalid xlog page header: unexpected XLP_FIRST_IS_CONTRECORD".into(),
+                        );
+                    }
+                    if hdr.xlp_rem_len != 0 {
+                        return Err(format!(
+                            "invalid xlog page header: xlp_rem_len={}, but it's not a contrecord",
+                            hdr.xlp_rem_len
+                        ));
+                    }
+                }
+                State::ReassemblingRecord { contlen, .. } => {
+                    if hdr.xlp_info & XLP_FIRST_IS_CONTRECORD == 0 {
+                        return Err(
+                            "invalid xlog page header: XLP_FIRST_IS_CONTRECORD expected, not found"
+                                .into(),
+                        );
+                    }
+                    if hdr.xlp_rem_len != contlen.get() {
+                        return Err(format!(
+                            "invalid xlog page header: xlp_rem_len={}, expected {}",
+                            hdr.xlp_rem_len,
+                            contlen.get()
+                        ));
+                    }
+                }
+                State::SkippingEverything { .. } => {
+                    panic!("Should not be validating page header in the SkippingEverything state");
+                }
+            };
+            Ok(())
+        };
+        validate_impl().map_err(|msg| WalDecodeError { msg, lsn: self.lsn })
+    }
+
    /// Attempt to decode another WAL record from the input that has been fed to the
    /// decoder so far.
    ///
@@ -76,128 +129,121 @@ impl WalStreamDecoder {
    ///     Err(WalDecodeError): an error occurred while decoding, meaning the input was invalid.
    ///
    pub fn poll_decode(&mut self) -> Result<Option<(Lsn, Bytes)>, WalDecodeError> {
-        let recordbuf;
-
        // Run state machine that validates page headers, and reassembles records
        // that cross page boundaries.
        loop {
            // parse and verify page boundaries as we go
-            if self.padlen > 0 {
-                // We should first skip padding, as we may have to skip some page headers if we're processing the XLOG_SWITCH record.
-                if self.inputbuf.remaining() < self.padlen as usize {
-                    return Ok(None);
-                }
+            // However, we may have to skip some page headers if we're processing the XLOG_SWITCH record or skipping padding for whatever reason.
+            match self.state {
+                State::WaitingForRecord | State::ReassemblingRecord { .. } => {
+                    if self.lsn.segment_offset(pg_constants::WAL_SEGMENT_SIZE) == 0 {
+                        // parse long header

-                // skip padding
-                self.inputbuf.advance(self.padlen as usize);
-                self.lsn += self.padlen as u64;
-                self.padlen = 0;
-            } else if self.lsn.segment_offset(pg_constants::WAL_SEGMENT_SIZE) == 0 {
-                // parse long header
+                        if self.inputbuf.remaining() < XLOG_SIZE_OF_XLOG_LONG_PHD {
+                            return Ok(None);
+                        }

-                if self.inputbuf.remaining() < XLOG_SIZE_OF_XLOG_LONG_PHD {
-                    return Ok(None);
-                }
+                        let hdr = XLogLongPageHeaderData::from_bytes(&mut self.inputbuf).map_err(
+                            |e| WalDecodeError {
+                                msg: format!("long header deserialization failed {}", e),
+                                lsn: self.lsn,
+                            },
+                        )?;

-                let hdr = XLogLongPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| {
-                    WalDecodeError {
-                        msg: format!("long header deserialization failed {}", e),
-                        lsn: self.lsn,
+                        self.validate_page_header(&hdr.std)?;
+
+                        self.lsn += XLOG_SIZE_OF_XLOG_LONG_PHD as u64;
+                    } else if self.lsn.block_offset() == 0 {
+                        if self.inputbuf.remaining() < XLOG_SIZE_OF_XLOG_SHORT_PHD {
+                            return Ok(None);
+                        }
+
+                        let hdr =
+                            XLogPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| {
+                                WalDecodeError {
+                                    msg: format!("header deserialization failed {}", e),
+                                    lsn: self.lsn,
+                                }
+                            })?;
+
+                        self.validate_page_header(&hdr)?;
+
+                        self.lsn += XLOG_SIZE_OF_XLOG_SHORT_PHD as u64;
                    }
-                })?;
-
-                if hdr.std.xlp_pageaddr != self.lsn.0 {
-                    return Err(WalDecodeError {
-                        msg: "invalid xlog segment header".into(),
-                        lsn: self.lsn,
-                    });
                }
-                // TODO: verify the remaining fields in the header
-
-                self.lsn += XLOG_SIZE_OF_XLOG_LONG_PHD as u64;
-                continue;
-            } else if self.lsn.block_offset() == 0 {
-                if self.inputbuf.remaining() < XLOG_SIZE_OF_XLOG_SHORT_PHD {
-                    return Ok(None);
-                }
-
-                let hdr = XLogPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| {
-                    WalDecodeError {
-                        msg: format!("header deserialization failed {}", e),
-                        lsn: self.lsn,
+                State::SkippingEverything { .. } => {}
+            }
+            match &mut self.state {
+                State::WaitingForRecord => {
+                    // need to have at least the xl_tot_len field
+                    if self.inputbuf.remaining() < 4 {
+                        return Ok(None);
                    }
-                })?;

-                if hdr.xlp_pageaddr != self.lsn.0 {
-                    return Err(WalDecodeError {
-                        msg: "invalid xlog page header".into(),
-                        lsn: self.lsn,
-                    });
+                    // peek xl_tot_len at the beginning of the record.
+                    // FIXME: assumes little-endian
+                    let xl_tot_len = (&self.inputbuf[0..4]).get_u32_le();
+                    if (xl_tot_len as usize) < XLOG_SIZE_OF_XLOG_RECORD {
+                        return Err(WalDecodeError {
+                            msg: format!("invalid xl_tot_len {}", xl_tot_len),
+                            lsn: self.lsn,
+                        });
+                    }
+                    // Fast path for the common case that the whole record fits on the page.
+                    let pageleft = self.lsn.remaining_in_block() as u32;
+                    if self.inputbuf.remaining() >= xl_tot_len as usize && xl_tot_len <= pageleft {
+                        self.lsn += xl_tot_len as u64;
+                        let recordbuf = self.inputbuf.copy_to_bytes(xl_tot_len as usize);
+                        return Ok(Some(self.complete_record(recordbuf)?));
+                    } else {
+                        // Need to assemble the record from pieces. Remember the size of the
+                        // record, and loop back. On next iteration, we will reach the 'else'
+                        // branch below, and copy the part of the record that was on this page
+                        // to 'recordbuf'.  Subsequent iterations will skip page headers, and
+                        // append the continuations from the next pages to 'recordbuf'.
+                        self.state = State::ReassemblingRecord {
+                            recordbuf: BytesMut::with_capacity(xl_tot_len as usize),
+                            contlen: NonZeroU32::new(xl_tot_len).unwrap(),
+                        }
+                    }
                }
-                // TODO: verify the remaining fields in the header
+                State::ReassemblingRecord { recordbuf, contlen } => {
+                    // we're continuing a record, possibly from previous page.
+                    let pageleft = self.lsn.remaining_in_block() as u32;

-                self.lsn += XLOG_SIZE_OF_XLOG_SHORT_PHD as u64;
-                continue;
-            } else if self.contlen == 0 {
-                assert!(self.recordbuf.is_empty());
+                    // read the rest of the record, or as much as fits on this page.
+                    let n = min(contlen.get(), pageleft) as usize;

-                // need to have at least the xl_tot_len field
-                if self.inputbuf.remaining() < 4 {
-                    return Ok(None);
+                    if self.inputbuf.remaining() < n {
+                        return Ok(None);
+                    }
+
+                    recordbuf.put(self.inputbuf.split_to(n));
+                    self.lsn += n as u64;
+                    *contlen = match NonZeroU32::new(contlen.get() - n as u32) {
+                        Some(x) => x,
+                        None => {
+                            // The record is now complete.
+                            let recordbuf = std::mem::replace(recordbuf, BytesMut::new()).freeze();
+                            return Ok(Some(self.complete_record(recordbuf)?));
+                        }
+                    }
                }
-
-                // peek xl_tot_len at the beginning of the record.
-                // FIXME: assumes little-endian
-                self.startlsn = self.lsn;
-                let xl_tot_len = (&self.inputbuf[0..4]).get_u32_le();
-                if (xl_tot_len as usize) < XLOG_SIZE_OF_XLOG_RECORD {
-                    return Err(WalDecodeError {
-                        msg: format!("invalid xl_tot_len {}", xl_tot_len),
-                        lsn: self.lsn,
-                    });
+                State::SkippingEverything { skip_until_lsn } => {
+                    assert!(*skip_until_lsn >= self.lsn);
+                    let n = skip_until_lsn.0 - self.lsn.0;
+                    if self.inputbuf.remaining() < n as usize {
+                        return Ok(None);
+                    }
+                    self.inputbuf.advance(n as usize);
+                    self.lsn += n;
+                    self.state = State::WaitingForRecord;
                }
-
-                // Fast path for the common case that the whole record fits on the page.
-                let pageleft = self.lsn.remaining_in_block() as u32;
-                if self.inputbuf.remaining() >= xl_tot_len as usize && xl_tot_len <= pageleft {
-                    // Take the record from the 'inputbuf', and validate it.
-                    recordbuf = self.inputbuf.copy_to_bytes(xl_tot_len as usize);
-                    self.lsn += xl_tot_len as u64;
-                    break;
-                } else {
-                    // Need to assemble the record from pieces. Remember the size of the
-                    // record, and loop back. On next iteration, we will reach the 'else'
-                    // branch below, and copy the part of the record that was on this page
-                    // to 'recordbuf'.  Subsequent iterations will skip page headers, and
-                    // append the continuations from the next pages to 'recordbuf'.
-                    self.recordbuf.reserve(xl_tot_len as usize);
-                    self.contlen = xl_tot_len;
-                    continue;
-                }
-            } else {
-                // we're continuing a record, possibly from previous page.
-                let pageleft = self.lsn.remaining_in_block() as u32;
-
-                // read the rest of the record, or as much as fits on this page.
-                let n = min(self.contlen, pageleft) as usize;
-
-                if self.inputbuf.remaining() < n {
-                    return Ok(None);
-                }
-
-                self.recordbuf.put(self.inputbuf.split_to(n));
-                self.lsn += n as u64;
-                self.contlen -= n as u32;
-
-                if self.contlen == 0 {
-                    // The record is now complete.
-                    recordbuf = std::mem::replace(&mut self.recordbuf, BytesMut::new()).freeze();
-                    break;
-                }
-                continue;
            }
        }
+    }

+    fn complete_record(&mut self, recordbuf: Bytes) -> Result<(Lsn, Bytes), WalDecodeError> {
        // We now have a record in the 'recordbuf' local variable.
        let xlogrec =
            XLogRecord::from_slice(&recordbuf[0..XLOG_SIZE_OF_XLOG_RECORD]).map_err(|e| {
@@ -219,18 +265,20 @@ impl WalStreamDecoder {

        // XLOG_SWITCH records are special. If we see one, we need to skip
        // to the next WAL segment.
-        if xlogrec.is_xlog_switch_record() {
+        let next_lsn = if xlogrec.is_xlog_switch_record() {
            trace!("saw xlog switch record at {}", self.lsn);
-            self.padlen = self.lsn.calc_padding(pg_constants::WAL_SEGMENT_SIZE as u64) as u32;
+            self.lsn + self.lsn.calc_padding(pg_constants::WAL_SEGMENT_SIZE as u64)
        } else {
            // Pad to an 8-byte boundary
-            self.padlen = self.lsn.calc_padding(8u32) as u32;
-        }
+            self.lsn.align()
+        };
+        self.state = State::SkippingEverything {
+            skip_until_lsn: next_lsn,
+        };

        // We should return LSN of the next record, not the last byte of this record or
        // the byte immediately after. Note that this handles both XLOG_SWITCH and usual
        // records, the former "spans" until the next WAL segment (see test_xlog_switch).
-        let result = (self.lsn + self.padlen as u64, recordbuf);
-        Ok(Some(result))
+        Ok((next_lsn, recordbuf))
    }
 }
--- a/libs/postgres_ffi/src/xlog_utils.rs
+++ b/libs/postgres_ffi/src/xlog_utils.rs
@@ -16,22 +16,22 @@ use crate::XLogRecord;
 use crate::XLOG_PAGE_MAGIC;

 use crate::pg_constants::WAL_SEGMENT_SIZE;
-use anyhow::{anyhow, bail, ensure};
-use byteorder::{ByteOrder, LittleEndian};
+use crate::waldecoder::WalStreamDecoder;
+
 use bytes::BytesMut;
 use bytes::{Buf, Bytes};
-use crc32c::*;
+
 use log::*;
-use std::cmp::max;
-use std::cmp::min;
-use std::fs::{self, File};
+
+use std::fs::File;
 use std::io::prelude::*;
+use std::io::ErrorKind;
 use std::io::SeekFrom;
 use std::path::{Path, PathBuf};
 use std::time::SystemTime;
 use utils::bin_ser::DeserializeError;
 use utils::bin_ser::SerializeError;
-use utils::const_assert;
+
 use utils::lsn::Lsn;

 pub const XLOG_FNAME_LEN: usize = 24;
@@ -140,338 +140,93 @@ pub fn to_pg_timestamp(time: SystemTime) -> TimestampTz {
    }
 }

-/// Return offset of the last valid record in the segment segno, starting
-/// looking at start_offset. Returns start_offset if no records found.
-fn find_end_of_wal_segment(
-    data_dir: &Path,
-    segno: XLogSegNo,
-    tli: TimeLineID,
-    wal_seg_size: usize,
-    start_offset: usize, // start reading at this point
-) -> anyhow::Result<u32> {
-    // step back to the beginning of the page to read it in...
-    let mut offs: usize = start_offset - start_offset % XLOG_BLCKSZ;
-    let mut skipping_first_contrecord: bool = false;
-    let mut contlen: usize = 0;
-    let mut xl_crc: u32 = 0;
-    let mut crc: u32 = 0;
-    let mut rec_offs: usize = 0;
-    let mut buf = [0u8; XLOG_BLCKSZ];
-    let file_name = XLogFileName(tli, segno, wal_seg_size);
-    let mut last_valid_rec_pos: usize = start_offset; // assume at given start_offset begins new record
-    let mut file = File::open(data_dir.join(file_name.clone() + ".partial"))?;
-    file.seek(SeekFrom::Start(offs as u64))?;
-    // xl_crc is the last field in XLogRecord, will not be read into rec_hdr
-    const_assert!(XLOG_RECORD_CRC_OFFS + 4 == XLOG_SIZE_OF_XLOG_RECORD);
-    let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS];
-
-    trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset);
-    while offs < wal_seg_size {
-        // we are at the beginning of the page; read it in
-        if offs % XLOG_BLCKSZ == 0 {
-            trace!("offs=0x{:x}: new page", offs);
-            let bytes_read = file.read(&mut buf)?;
-            if bytes_read != buf.len() {
-                bail!(
-                    "failed to read {} bytes from {} at {}",
-                    XLOG_BLCKSZ,
-                    file_name,
-                    offs
-                );
-            }
-
-            let xlp_magic = LittleEndian::read_u16(&buf[0..2]);
-            let xlp_info = LittleEndian::read_u16(&buf[2..4]);
-            let xlp_rem_len = LittleEndian::read_u32(&buf[XLP_REM_LEN_OFFS..XLP_REM_LEN_OFFS + 4]);
-            trace!(
-                "  xlp_magic=0x{:x}, xlp_info=0x{:x}, xlp_rem_len={}",
-                xlp_magic,
-                xlp_info,
-                xlp_rem_len
-            );
-            // this is expected in current usage when valid WAL starts after page header
-            if xlp_magic != XLOG_PAGE_MAGIC as u16 {
-                trace!(
-                    "  invalid WAL file {}.partial magic {} at {:?}",
-                    file_name,
-                    xlp_magic,
-                    Lsn(XLogSegNoOffsetToRecPtr(segno, offs as u32, wal_seg_size)),
-                );
-            }
-            if offs == 0 {
-                offs += XLOG_SIZE_OF_XLOG_LONG_PHD;
-                if (xlp_info & XLP_FIRST_IS_CONTRECORD) != 0 {
-                    trace!("  first record is contrecord");
-                    skipping_first_contrecord = true;
-                    contlen = xlp_rem_len as usize;
-                    if offs < start_offset {
-                        // Pre-condition failed: the beginning of the segment is unexpectedly corrupted.
-                        ensure!(start_offset - offs >= contlen,
-                            "start_offset is in the middle of the first record (which happens to be a contrecord), \
-                             expected to be on a record boundary. Is beginning of the segment corrupted?");
-                        contlen = 0;
-                        // keep skipping_first_contrecord to avoid counting the contrecord as valid, we did not check it.
-                    }
-                } else {
-                    trace!("  first record is not contrecord");
-                }
-            } else {
-                offs += XLOG_SIZE_OF_XLOG_SHORT_PHD;
-            }
-            // ... and step forward again if asked
-            trace!("  skipped header to 0x{:x}", offs);
-            offs = max(offs, start_offset);
-        // beginning of the next record
-        } else if contlen == 0 {
-            let page_offs = offs % XLOG_BLCKSZ;
-            let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize;
-            trace!("offs=0x{:x}: new record, xl_tot_len={}", offs, xl_tot_len);
-            if xl_tot_len == 0 {
-                info!(
-                    "find_end_of_wal_segment reached zeros at {:?}, last records ends at {:?}",
-                    Lsn(XLogSegNoOffsetToRecPtr(segno, offs as u32, wal_seg_size)),
-                    Lsn(XLogSegNoOffsetToRecPtr(
-                        segno,
-                        last_valid_rec_pos as u32,
-                        wal_seg_size
-                    ))
-                );
-                break; // zeros, reached the end
-            }
-            if skipping_first_contrecord {
-                skipping_first_contrecord = false;
-                trace!("  first contrecord has been just completed");
-            } else {
-                trace!(
-                    "  updating last_valid_rec_pos: 0x{:x} --> 0x{:x}",
-                    last_valid_rec_pos,
-                    offs
-                );
-                last_valid_rec_pos = offs;
-            }
-            offs += 4;
-            rec_offs = 4;
-            contlen = xl_tot_len - 4;
-            trace!(
-                "  reading rec_hdr[0..4] <-- [0x{:x}; 0x{:x})",
-                page_offs,
-                page_offs + 4
-            );
-            rec_hdr[0..4].copy_from_slice(&buf[page_offs..page_offs + 4]);
-        } else {
-            // we're continuing a record, possibly from previous page.
-            let page_offs = offs % XLOG_BLCKSZ;
-            let pageleft = XLOG_BLCKSZ - page_offs;
-
-            // read the rest of the record, or as much as fits on this page.
-            let n = min(contlen, pageleft);
-            trace!(
-                "offs=0x{:x}, record continuation, pageleft={}, contlen={}",
-                offs,
-                pageleft,
-                contlen
-            );
-            // fill rec_hdr header up to (but not including) xl_crc field
-            trace!(
-                "  rec_offs={}, XLOG_RECORD_CRC_OFFS={}, XLOG_SIZE_OF_XLOG_RECORD={}",
-                rec_offs,
-                XLOG_RECORD_CRC_OFFS,
-                XLOG_SIZE_OF_XLOG_RECORD
-            );
-            if rec_offs < XLOG_RECORD_CRC_OFFS {
-                let len = min(XLOG_RECORD_CRC_OFFS - rec_offs, n);
-                trace!(
-                    "  reading rec_hdr[{}..{}] <-- [0x{:x}; 0x{:x})",
-                    rec_offs,
-                    rec_offs + len,
-                    page_offs,
-                    page_offs + len
-                );
-                rec_hdr[rec_offs..rec_offs + len].copy_from_slice(&buf[page_offs..page_offs + len]);
-            }
-            if rec_offs <= XLOG_RECORD_CRC_OFFS && rec_offs + n >= XLOG_SIZE_OF_XLOG_RECORD {
-                let crc_offs = page_offs - rec_offs + XLOG_RECORD_CRC_OFFS;
-                // All records are aligned on 8-byte boundary, so their 8-byte frames
-                // cannot be split between pages. As xl_crc is the last field,
-                // its content is always on the same page.
-                const_assert!(XLOG_RECORD_CRC_OFFS % 8 == 4);
-                // We should always start reading aligned records even in incorrect WALs so if
-                // the condition is false it is likely a bug. However, it is localized somewhere
-                // in this function, hence we do not crash and just report failure instead.
-                ensure!(crc_offs % 8 == 4, "Record is not aligned properly (bug?)");
-                xl_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]);
-                trace!(
-                    "  reading xl_crc: [0x{:x}; 0x{:x}) = 0x{:x}",
-                    crc_offs,
-                    crc_offs + 4,
-                    xl_crc
-                );
-                crc = crc32c_append(0, &buf[crc_offs + 4..page_offs + n]);
-                trace!(
-                    "  initializing crc: [0x{:x}; 0x{:x}); crc = 0x{:x}",
-                    crc_offs + 4,
-                    page_offs + n,
-                    crc
-                );
-            } else if rec_offs > XLOG_RECORD_CRC_OFFS {
-                // As all records are 8-byte aligned, the header is already fully read and `crc` is initialized in the branch above.
-                ensure!(rec_offs >= XLOG_SIZE_OF_XLOG_RECORD);
-                let old_crc = crc;
-                crc = crc32c_append(crc, &buf[page_offs..page_offs + n]);
-                trace!(
-                    "  appending to crc: [0x{:x}; 0x{:x}); 0x{:x} --> 0x{:x}",
-                    page_offs,
-                    page_offs + n,
-                    old_crc,
-                    crc
-                );
-            } else {
-                // Correct because of the way conditions are written above.
-                assert!(rec_offs + n < XLOG_SIZE_OF_XLOG_RECORD);
-                // If `skipping_first_contrecord == true`, we may be reading from a middle of a record
-                // which started in the previous segment. Hence there is no point in validating the header.
-                if !skipping_first_contrecord && rec_offs + n > XLOG_RECORD_CRC_OFFS {
-                    info!(
-                        "Curiously corrupted WAL: a record stops inside the header; \
-                             offs=0x{:x}, record continuation, pageleft={}, contlen={}",
-                        offs, pageleft, contlen
-                    );
-                    break;
-                }
-                // Do nothing: we are still reading the header. It's accounted in CRC in the end of the record.
-            }
-            rec_offs += n;
-            offs += n;
-            contlen -= n;
-
-            if contlen == 0 {
-                trace!("  record completed at 0x{:x}", offs);
-                crc = crc32c_append(crc, &rec_hdr);
-                offs = (offs + 7) & !7; // pad on 8 bytes boundary */
-                trace!(
-                    "  padded offs to 0x{:x}, crc is {:x}, expected crc is {:x}",
-                    offs,
-                    crc,
-                    xl_crc
-                );
-                if skipping_first_contrecord {
-                    // do nothing, the flag will go down on next iteration when we're reading new record
-                    trace!("  first conrecord has been just completed");
-                } else if crc == xl_crc {
-                    // record is valid, advance the result to its end (with
-                    // alignment to the next record taken into account)
-                    trace!(
-                        "  updating last_valid_rec_pos: 0x{:x} --> 0x{:x}",
-                        last_valid_rec_pos,
-                        offs
-                    );
-                    last_valid_rec_pos = offs;
-                } else {
-                    info!(
-                        "CRC mismatch {} vs {} at {}",
-                        crc, xl_crc, last_valid_rec_pos
-                    );
-                    break;
-                }
-            }
-        }
-    }
-    trace!("last_valid_rec_pos=0x{:x}", last_valid_rec_pos);
-    Ok(last_valid_rec_pos as u32)
-}
-
-///
-/// Scan a directory that contains PostgreSQL WAL files, for the end of WAL.
-/// If precise, returns end LSN (next insertion point, basically);
-/// otherwise, start of the last segment.
-/// Returns (0, 0) if there is no WAL.
-///
+// Returns (aligned) end_lsn of the last record in data_dir with WAL segments.
+// start_lsn must point to some previously known record boundary (beginning of
+// the next record). If no valid record after is found, start_lsn is returned
+// back.
 pub fn find_end_of_wal(
    data_dir: &Path,
    wal_seg_size: usize,
-    precise: bool,
-    start_lsn: Lsn, // start reading WAL at this point or later
-) -> anyhow::Result<(XLogRecPtr, TimeLineID)> {
-    let mut high_segno: XLogSegNo = 0;
-    let mut high_tli: TimeLineID = 0;
-    let mut high_ispartial = false;
+    start_lsn: Lsn, // start reading WAL at this point; must point at record start_lsn.
+) -> anyhow::Result<Lsn> {
+    let mut result = start_lsn;
+    let mut curr_lsn = start_lsn;
+    let mut buf = [0u8; XLOG_BLCKSZ];
+    let mut decoder = WalStreamDecoder::new(start_lsn);

-    for entry in fs::read_dir(data_dir)?.flatten() {
-        let ispartial: bool;
-        let entry_name = entry.file_name();
-        let fname = entry_name
-            .to_str()
-            .ok_or_else(|| anyhow!("Invalid file name"))?;
-
-        /*
-         * Check if the filename looks like an xlog file, or a .partial file.
-         */
-        if IsXLogFileName(fname) {
-            ispartial = false;
-        } else if IsPartialXLogFileName(fname) {
-            ispartial = true;
-        } else {
-            continue;
-        }
-        let (segno, tli) = XLogFromFileName(fname, wal_seg_size);
-        if !ispartial && entry.metadata()?.len() != wal_seg_size as u64 {
-            continue;
-        }
-        if segno > high_segno
-            || (segno == high_segno && tli > high_tli)
-            || (segno == high_segno && tli == high_tli && high_ispartial && !ispartial)
-        {
-            high_segno = segno;
-            high_tli = tli;
-            high_ispartial = ispartial;
-        }
-    }
-    if high_segno > 0 {
-        let mut high_offs = 0;
-        /*
-         * Move the starting pointer to the start of the next segment, if the
-         * highest one we saw was completed.
-         */
-        if !high_ispartial {
-            high_segno += 1;
-        } else if precise {
-            /* otherwise locate last record in last partial segment */
-            if start_lsn.segment_number(wal_seg_size) > high_segno {
-                bail!(
-                    "provided start_lsn {:?} is beyond highest segno {:?} available",
-                    start_lsn,
-                    high_segno,
+    // loop over segments
+    loop {
+        let segno = curr_lsn.segment_number(wal_seg_size);
+        let seg_file_name = XLogFileName(PG_TLI, segno, wal_seg_size);
+        let seg_file_path = data_dir.join(seg_file_name);
+        match open_wal_segment(&seg_file_path)? {
+            None => {
+                // no more segments
+                info!(
+                    "find_end_of_wal reached end at {:?}, segment {:?} doesn't exist",
+                    result, seg_file_path
                );
+                return Ok(result);
+            }
+            Some(mut segment) => {
+                let seg_offs = curr_lsn.segment_offset(wal_seg_size);
+                segment.seek(SeekFrom::Start(seg_offs as u64))?;
+                // loop inside segment
+                loop {
+                    let bytes_read = segment.read(&mut buf)?;
+                    if bytes_read == 0 {
+                        break; // EOF
+                    }
+                    curr_lsn += bytes_read as u64;
+                    decoder.feed_bytes(&buf[0..bytes_read]);
+
+                    // advance result past all completely read records
+                    loop {
+                        match decoder.poll_decode() {
+                            Ok(Some(record)) => result = record.0,
+                            Err(e) => {
+                                info!(
+                                    "find_end_of_wal reached end at {:?}, decode error: {:?}",
+                                    result, e
+                                );
+                                return Ok(result);
+                            }
+                            Ok(None) => break, // need more data
+                        }
+                    }
+                }
            }
-            let start_offset = if start_lsn.segment_number(wal_seg_size) == high_segno {
-                start_lsn.segment_offset(wal_seg_size)
-            } else {
-                0
-            };
-            high_offs = find_end_of_wal_segment(
-                data_dir,
-                high_segno,
-                high_tli,
-                wal_seg_size,
-                start_offset,
-            )?;
        }
-        let high_ptr = XLogSegNoOffsetToRecPtr(high_segno, high_offs, wal_seg_size);
-        return Ok((high_ptr, high_tli));
    }
-    Ok((0, 0))
+}
+
+// Open .partial or full WAL segment file, if present.
+fn open_wal_segment(seg_file_path: &Path) -> anyhow::Result<Option<File>> {
+    let mut partial_path = seg_file_path.to_owned();
+    partial_path.set_extension("partial");
+    match File::open(partial_path) {
+        Ok(file) => Ok(Some(file)),
+        Err(e) => match e.kind() {
+            ErrorKind::NotFound => {
+                // .partial not found, try full
+                match File::open(seg_file_path) {
+                    Ok(file) => Ok(Some(file)),
+                    Err(e) => match e.kind() {
+                        ErrorKind::NotFound => Ok(None),
+                        _ => Err(e.into()),
+                    },
+                }
+            }
+            _ => Err(e.into()),
+        },
+    }
 }

 pub fn main() {
    let mut data_dir = PathBuf::new();
    data_dir.push(".");
-    let (wal_end, tli) = find_end_of_wal(&data_dir, WAL_SEGMENT_SIZE, true, Lsn(0)).unwrap();
-    println!(
-        "wal_end={:>08X}{:>08X}, tli={}",
-        (wal_end >> 32) as u32,
-        wal_end as u32,
-        tli
-    );
+    let wal_end = find_end_of_wal(&data_dir, WAL_SEGMENT_SIZE, Lsn(0)).unwrap();
+    println!("wal_end={:?}", wal_end);
 }

 impl XLogRecord {
@@ -595,7 +350,10 @@ pub fn generate_wal_segment(segno: u64, system_id: u64) -> Result<Bytes, Seriali
 mod tests {
    use super::*;
    use regex::Regex;
+    use std::cmp::min;
+    use std::fs;
    use std::{env, str::FromStr};
+    use utils::const_assert;

    fn init_logging() {
        let _ = env_logger::Builder::from_env(
@@ -606,10 +364,7 @@ mod tests {
        .try_init();
    }

-    fn test_end_of_wal<C: wal_craft::Crafter>(
-        test_name: &str,
-        expected_end_of_wal_non_partial: Lsn,
-    ) {
+    fn test_end_of_wal<C: wal_craft::Crafter>(test_name: &str) {
        use wal_craft::*;
        // Craft some WAL
        let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
@@ -630,7 +385,7 @@ mod tests {
            .iter()
            .map(|&lsn| u64::from(lsn).into())
            .collect();
-        let expected_end_of_wal_partial: Lsn = u64::from(expected_end_of_wal_partial).into();
+        let expected_end_of_wal: Lsn = u64::from(expected_end_of_wal_partial).into();
        srv.kill();

        // Check find_end_of_wal on the initial WAL
@@ -642,10 +397,10 @@ mod tests {
            .filter(|fname| IsXLogFileName(fname))
            .max()
            .unwrap();
-        check_pg_waldump_end_of_wal(&cfg, &last_segment, expected_end_of_wal_partial);
-        for start_lsn in std::iter::once(Lsn(0))
-            .chain(intermediate_lsns)
-            .chain(std::iter::once(expected_end_of_wal_partial))
+        check_pg_waldump_end_of_wal(&cfg, &last_segment, expected_end_of_wal);
+        for start_lsn in intermediate_lsns
+            .iter()
+            .chain(std::iter::once(&expected_end_of_wal))
        {
            // Erase all WAL before `start_lsn` to ensure it's not used by `find_end_of_wal`.
            // We assume that `start_lsn` is non-decreasing.
@@ -660,7 +415,7 @@ mod tests {
                }
                let (segno, _) = XLogFromFileName(&fname, WAL_SEGMENT_SIZE);
                let seg_start_lsn = XLogSegNoOffsetToRecPtr(segno, 0, WAL_SEGMENT_SIZE);
-                if seg_start_lsn > u64::from(start_lsn) {
+                if seg_start_lsn > u64::from(*start_lsn) {
                    continue;
                }
                let mut f = File::options().write(true).open(file.path()).unwrap();
@@ -668,18 +423,12 @@ mod tests {
                f.write_all(
                    &ZEROS[0..min(
                        WAL_SEGMENT_SIZE,
-                        (u64::from(start_lsn) - seg_start_lsn) as usize,
+                        (u64::from(*start_lsn) - seg_start_lsn) as usize,
                    )],
                )
                .unwrap();
            }
-            check_end_of_wal(
-                &cfg,
-                &last_segment,
-                start_lsn,
-                expected_end_of_wal_non_partial,
-                expected_end_of_wal_partial,
-            );
+            check_end_of_wal(&cfg, &last_segment, *start_lsn, expected_end_of_wal);
        }
    }

@@ -716,18 +465,15 @@ mod tests {
        cfg: &wal_craft::Conf,
        last_segment: &str,
        start_lsn: Lsn,
-        expected_end_of_wal_non_partial: Lsn,
-        expected_end_of_wal_partial: Lsn,
+        expected_end_of_wal: Lsn,
    ) {
        // Check end_of_wal on non-partial WAL segment (we treat it as fully populated)
-        let (wal_end, tli) =
-            find_end_of_wal(&cfg.wal_dir(), WAL_SEGMENT_SIZE, true, start_lsn).unwrap();
-        let wal_end = Lsn(wal_end);
-        info!(
-            "find_end_of_wal returned (wal_end={}, tli={}) with non-partial WAL segment",
-            wal_end, tli
-        );
-        assert_eq!(wal_end, expected_end_of_wal_non_partial);
+        // let wal_end = find_end_of_wal(&cfg.wal_dir(), WAL_SEGMENT_SIZE, start_lsn).unwrap();
+        // info!(
+        //     "find_end_of_wal returned wal_end={} with non-partial WAL segment",
+        //     wal_end
+        // );
+        // assert_eq!(wal_end, expected_end_of_wal_non_partial);

        // Rename file to partial to actually find last valid lsn, then rename it back.
        fs::rename(
@@ -735,14 +481,12 @@ mod tests {
            cfg.wal_dir().join(format!("{}.partial", last_segment)),
        )
        .unwrap();
-        let (wal_end, tli) =
-            find_end_of_wal(&cfg.wal_dir(), WAL_SEGMENT_SIZE, true, start_lsn).unwrap();
-        let wal_end = Lsn(wal_end);
+        let wal_end = find_end_of_wal(&cfg.wal_dir(), WAL_SEGMENT_SIZE, start_lsn).unwrap();
        info!(
-            "find_end_of_wal returned (wal_end={}, tli={}) with partial WAL segment",
-            wal_end, tli
+            "find_end_of_wal returned wal_end={} with partial WAL segment",
+            wal_end
        );
-        assert_eq!(wal_end, expected_end_of_wal_partial);
+        assert_eq!(wal_end, expected_end_of_wal);
        fs::rename(
            cfg.wal_dir().join(format!("{}.partial", last_segment)),
            cfg.wal_dir().join(last_segment),
@@ -755,10 +499,7 @@ mod tests {
    #[test]
    pub fn test_find_end_of_wal_simple() {
        init_logging();
-        test_end_of_wal::<wal_craft::Simple>(
-            "test_find_end_of_wal_simple",
-            "0/2000000".parse::<Lsn>().unwrap(),
-        );
+        test_end_of_wal::<wal_craft::Simple>("test_find_end_of_wal_simple");
    }

    #[test]
@@ -766,17 +507,14 @@ mod tests {
        init_logging();
        test_end_of_wal::<wal_craft::WalRecordCrossingSegmentFollowedBySmallOne>(
            "test_find_end_of_wal_crossing_segment_followed_by_small_one",
-            "0/3000000".parse::<Lsn>().unwrap(),
        );
    }

    #[test]
-    #[ignore = "not yet fixed, needs correct parsing of pre-last segments"] // TODO
    pub fn test_find_end_of_wal_last_crossing_segment() {
        init_logging();
        test_end_of_wal::<wal_craft::LastWalRecordCrossingSegment>(
            "test_find_end_of_wal_last_crossing_segment",
-            "0/3000000".parse::<Lsn>().unwrap(),
        );
    }

--- a/libs/postgres_ffi/wal_craft/Cargo.toml
+++ b/libs/postgres_ffi/wal_craft/Cargo.toml
@@ -10,7 +10,7 @@ anyhow = "1.0"
 clap = "3.0"
 env_logger = "0.9"
 log = "0.4"
-once_cell = "1.8.0"
+once_cell = "1.13.0"
 postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
 postgres_ffi = { path = "../" }
 tempfile = "3.2"
--- a/libs/remote_storage/Cargo.toml
+++ b/libs/remote_storage/Cargo.toml
@@ -7,7 +7,7 @@ edition = "2021"
 anyhow = { version = "1.0", features = ["backtrace"] }
 async-trait = "0.1"
 metrics = { version = "0.1", path = "../metrics" }
-once_cell = "1.8.0"
+once_cell = "1.13.0"
 rusoto_core = "0.48"
 rusoto_s3 = "0.48"
 serde = { version = "1.0", features = ["derive"] }
--- a/libs/utils/Cargo.toml
+++ b/libs/utils/Cargo.toml
@@ -8,7 +8,6 @@ anyhow = "1.0"
 bincode = "1.3"
 bytes = "1.0.1"
 hyper = { version = "0.14.7", features = ["full"] }
-lazy_static = "1.4.0"
 pin-project-lite = "0.2.7"
 postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
 postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
@@ -28,6 +27,8 @@ rustls = "0.20.2"
 rustls-split = "0.3.0"
 git-version = "0.3.5"
 serde_with = "1.12.0"
+once_cell = "1.13.0"
+

 metrics = { path = "../metrics" }
 workspace_hack = { version = "0.1", path = "../../workspace_hack" }
--- a/libs/utils/src/http/endpoint.rs
+++ b/libs/utils/src/http/endpoint.rs
@@ -4,8 +4,8 @@ use crate::zid::ZTenantId;
 use anyhow::anyhow;
 use hyper::header::AUTHORIZATION;
 use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server};
-use lazy_static::lazy_static;
 use metrics::{register_int_counter, Encoder, IntCounter, TextEncoder};
+use once_cell::sync::Lazy;
 use routerify::ext::RequestExt;
 use routerify::RequestInfo;
 use routerify::{Middleware, Router, RouterBuilder, RouterService};
@@ -16,13 +16,13 @@ use std::net::TcpListener;

 use super::error::ApiError;

-lazy_static! {
-    static ref SERVE_METRICS_COUNT: IntCounter = register_int_counter!(
+static SERVE_METRICS_COUNT: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "libmetrics_metric_handler_requests_total",
        "Number of metric requests made"
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 async fn logger(res: Response<Body>, info: RequestInfo) -> Result<Response<Body>, ApiError> {
    info!("{} {} {}", info.method(), info.uri().path(), res.status(),);
--- a/libs/utils/tests/ssl_test.rs
+++ b/libs/utils/tests/ssl_test.rs
@@ -7,7 +7,7 @@ use std::{

 use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
 use bytes::{Buf, BufMut, Bytes, BytesMut};
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;

 use utils::postgres_backend::{AuthType, Handler, PostgresBackend};

@@ -19,16 +19,15 @@ fn make_tcp_pair() -> (TcpStream, TcpStream) {
    (server_stream, client_stream)
 }

-lazy_static! {
-    static ref KEY: rustls::PrivateKey = {
-        let mut cursor = Cursor::new(include_bytes!("key.pem"));
-        rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
-    };
-    static ref CERT: rustls::Certificate = {
-        let mut cursor = Cursor::new(include_bytes!("cert.pem"));
-        rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
-    };
-}
+static KEY: Lazy<rustls::PrivateKey> = Lazy::new(|| {
+    let mut cursor = Cursor::new(include_bytes!("key.pem"));
+    rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
+});
+
+static CERT: Lazy<rustls::Certificate> = Lazy::new(|| {
+    let mut cursor = Cursor::new(include_bytes!("cert.pem"));
+    rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
+});

 #[test]
 fn ssl() {
--- a/pageserver/Cargo.toml
+++ b/pageserver/Cargo.toml
@@ -21,7 +21,6 @@ futures = "0.3.13"
 hex = "0.4.3"
 hyper = "0.14"
 itertools = "0.10.3"
-lazy_static = "1.4.0"
 clap = "3.0"
 daemonize = "0.4.1"
 tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
@@ -48,7 +47,7 @@ tracing = "0.1.27"
 signal-hook = "0.3.10"
 url = "2"
 nix = "0.23"
-once_cell = "1.8.0"
+once_cell = "1.13.0"
 crossbeam-utils = "0.8.5"
 fail = "0.5.0"
 git-version = "0.3.5"
--- a/pageserver/src/config.rs
+++ b/pageserver/src/config.rs
@@ -59,6 +59,7 @@ pub mod defaults {

 # [tenant_config]
 #checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
+#checkpoint_timeout = {DEFAULT_CHECKPOINT_TIMEOUT}
 #compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes
 #compaction_period = '{DEFAULT_COMPACTION_PERIOD}'
 #compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}'
@@ -452,6 +453,13 @@ impl PageServerConf {
                Some(parse_toml_u64("checkpoint_distance", checkpoint_distance)?);
        }

+        if let Some(checkpoint_timeout) = item.get("checkpoint_timeout") {
+            t_conf.checkpoint_timeout = Some(parse_toml_duration(
+                "checkpoint_timeout",
+                checkpoint_timeout,
+            )?);
+        }
+
        if let Some(compaction_target_size) = item.get("compaction_target_size") {
            t_conf.compaction_target_size = Some(parse_toml_u64(
                "compaction_target_size",
--- a/pageserver/src/http/models.rs
+++ b/pageserver/src/http/models.rs
@@ -32,6 +32,7 @@ pub struct TenantCreateRequest {
    #[serde_as(as = "Option<DisplayFromStr>")]
    pub new_tenant_id: Option<ZTenantId>,
    pub checkpoint_distance: Option<u64>,
+    pub checkpoint_timeout: Option<String>,
    pub compaction_target_size: Option<u64>,
    pub compaction_period: Option<String>,
    pub compaction_threshold: Option<usize>,
@@ -70,6 +71,7 @@ pub struct TenantConfigRequest {
    #[serde(default)]
    #[serde_as(as = "Option<DisplayFromStr>")]
    pub checkpoint_distance: Option<u64>,
+    pub checkpoint_timeout: Option<String>,
    pub compaction_target_size: Option<u64>,
    pub compaction_period: Option<String>,
    pub compaction_threshold: Option<usize>,
@@ -87,6 +89,7 @@ impl TenantConfigRequest {
        TenantConfigRequest {
            tenant_id,
            checkpoint_distance: None,
+            checkpoint_timeout: None,
            compaction_target_size: None,
            compaction_period: None,
            compaction_threshold: None,
--- a/pageserver/src/http/openapi_spec.yml
+++ b/pageserver/src/http/openapi_spec.yml
@@ -560,6 +560,8 @@ components:
          type: string
        checkpoint_distance:
          type: integer
+        checkpoint_timeout:
+          type: string
        compaction_period:
          type: string
        compaction_threshold:
@@ -578,6 +580,8 @@ components:
          type: string
        checkpoint_distance:
          type: integer
+        checkpoint_timeout:
+          type: string
        compaction_period:
          type: string
        compaction_threshold:
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -623,6 +623,11 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
    }

    tenant_conf.checkpoint_distance = request_data.checkpoint_distance;
+    if let Some(checkpoint_timeout) = request_data.checkpoint_timeout {
+        tenant_conf.checkpoint_timeout =
+            Some(humantime::parse_duration(&checkpoint_timeout).map_err(ApiError::from_err)?);
+    }
+
    tenant_conf.compaction_target_size = request_data.compaction_target_size;
    tenant_conf.compaction_threshold = request_data.compaction_threshold;

@@ -683,6 +688,10 @@ async fn tenant_config_handler(mut request: Request<Body>) -> Result<Response<Bo
    }

    tenant_conf.checkpoint_distance = request_data.checkpoint_distance;
+    if let Some(checkpoint_timeout) = request_data.checkpoint_timeout {
+        tenant_conf.checkpoint_timeout =
+            Some(humantime::parse_duration(&checkpoint_timeout).map_err(ApiError::from_err)?);
+    }
    tenant_conf.compaction_target_size = request_data.compaction_target_size;
    tenant_conf.compaction_threshold = request_data.compaction_threshold;

--- a/pageserver/src/import_datadir.rs
+++ b/pageserver/src/import_datadir.rs
@@ -37,7 +37,7 @@ pub fn import_timeline_from_postgres_datadir<T: DatadirTimeline>(

    // TODO this shoud be start_lsn, which is not necessarily equal to end_lsn (aka lsn)
    // Then fishing out pg_control would be unnecessary
-    let mut modification = tline.begin_modification();
+    let mut modification = tline.begin_modification(lsn);
    modification.init_empty()?;

    // Import all but pg_wal
@@ -56,12 +56,12 @@ pub fn import_timeline_from_postgres_datadir<T: DatadirTimeline>(
            if let Some(control_file) = import_file(&mut modification, relative_path, file, len)? {
                pg_control = Some(control_file);
            }
-            modification.flush(lsn)?;
+            modification.flush()?;
        }
    }

    // We're done importing all the data files.
-    modification.commit(lsn)?;
+    modification.commit()?;

    // We expect the Postgres server to be shut down cleanly.
    let pg_control = pg_control.context("pg_control file not found")?;
@@ -267,7 +267,7 @@ fn import_wal<T: DatadirTimeline>(
        waldecoder.feed_bytes(&buf);

        let mut nrecords = 0;
-        let mut modification = tline.begin_modification();
+        let mut modification = tline.begin_modification(endpoint);
        let mut decoded = DecodedWALRecord::default();
        while last_lsn <= endpoint {
            if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
@@ -301,7 +301,7 @@ pub fn import_basebackup_from_tar<T: DatadirTimeline, Reader: Read>(
    base_lsn: Lsn,
 ) -> Result<()> {
    info!("importing base at {}", base_lsn);
-    let mut modification = tline.begin_modification();
+    let mut modification = tline.begin_modification(base_lsn);
    modification.init_empty()?;

    let mut pg_control: Option<ControlFileData> = None;
@@ -319,7 +319,7 @@ pub fn import_basebackup_from_tar<T: DatadirTimeline, Reader: Read>(
                    // We found the pg_control file.
                    pg_control = Some(res);
                }
-                modification.flush(base_lsn)?;
+                modification.flush()?;
            }
            tar::EntryType::Directory => {
                debug!("directory {:?}", file_path);
@@ -333,7 +333,7 @@ pub fn import_basebackup_from_tar<T: DatadirTimeline, Reader: Read>(
    // sanity check: ensure that pg_control is loaded
    let _pg_control = pg_control.context("pg_control file not found")?;

-    modification.commit(base_lsn)?;
+    modification.commit()?;
    Ok(())
 }

@@ -385,7 +385,7 @@ pub fn import_wal_from_tar<T: DatadirTimeline, Reader: Read>(

        waldecoder.feed_bytes(&bytes[offset..]);

-        let mut modification = tline.begin_modification();
+        let mut modification = tline.begin_modification(end_lsn);
        let mut decoded = DecodedWALRecord::default();
        while last_lsn <= end_lsn {
            if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
--- a/pageserver/src/layered_repository.rs
+++ b/pageserver/src/layered_repository.rs
@@ -5,7 +5,7 @@
 //! get/put call, walking back the timeline branching history as needed.
 //!
 //! The files are stored in the .neon/tenants/<tenantid>/timelines/<timelineid>
-//! directory. See layered_repository/README for how the files are managed.
+//! directory. See docs/pageserver-storage.md for how the files are managed.
 //! In addition to the layer files, there is a metadata file in the same
 //! directory that contains information about the timeline, in particular its
 //! parent timeline, and the last LSN that has been written to disk.
@@ -433,6 +433,13 @@ impl LayeredRepository {
            .unwrap_or(self.conf.default_tenant_conf.checkpoint_distance)
    }

+    pub fn get_checkpoint_timeout(&self) -> Duration {
+        let tenant_conf = self.tenant_conf.read().unwrap();
+        tenant_conf
+            .checkpoint_timeout
+            .unwrap_or(self.conf.default_tenant_conf.checkpoint_timeout)
+    }
+
    pub fn get_compaction_target_size(&self) -> u64 {
        let tenant_conf = self.tenant_conf.read().unwrap();
        tenant_conf
--- a/pageserver/src/layered_repository/block_io.rs
+++ b/pageserver/src/layered_repository/block_io.rs
@@ -5,7 +5,7 @@
 use crate::page_cache;
 use crate::page_cache::{ReadBufResult, PAGE_SZ};
 use bytes::Bytes;
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use std::ops::{Deref, DerefMut};
 use std::os::unix::fs::FileExt;
 use std::sync::atomic::AtomicU64;
@@ -117,9 +117,7 @@ where
    }
 }

-lazy_static! {
-    static ref NEXT_ID: AtomicU64 = AtomicU64::new(1);
-}
+static NEXT_ID: Lazy<AtomicU64> = Lazy::new(|| AtomicU64::new(1));

 /// An adapter for reading a (virtual) file using the page cache.
 ///
--- a/pageserver/src/layered_repository/ephemeral_file.rs
+++ b/pageserver/src/layered_repository/ephemeral_file.rs
@@ -8,7 +8,7 @@ use crate::page_cache;
 use crate::page_cache::PAGE_SZ;
 use crate::page_cache::{ReadBufResult, WriteBufResult};
 use crate::virtual_file::VirtualFile;
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use std::cmp::min;
 use std::collections::HashMap;
 use std::fs::OpenOptions;
@@ -21,15 +21,15 @@ use utils::zid::{ZTenantId, ZTimelineId};

 use std::os::unix::fs::FileExt;

-lazy_static! {
-    ///
-    /// This is the global cache of file descriptors (File objects).
-    ///
-    static ref EPHEMERAL_FILES: RwLock<EphemeralFiles> = RwLock::new(EphemeralFiles {
+///
+/// This is the global cache of file descriptors (File objects).
+///
+static EPHEMERAL_FILES: Lazy<RwLock<EphemeralFiles>> = Lazy::new(|| {
+    RwLock::new(EphemeralFiles {
        next_file_id: 1,
        files: HashMap::new(),
-    });
-}
+    })
+});

 pub struct EphemeralFiles {
    next_file_id: u64,
--- a/pageserver/src/layered_repository/layer_map.rs
+++ b/pageserver/src/layered_repository/layer_map.rs
@@ -15,19 +15,18 @@ use crate::layered_repository::storage_layer::Layer;
 use crate::layered_repository::storage_layer::{range_eq, range_overlaps};
 use crate::repository::Key;
 use anyhow::Result;
-use lazy_static::lazy_static;
 use metrics::{register_int_gauge, IntGauge};
+use once_cell::sync::Lazy;
 use std::collections::VecDeque;
 use std::ops::Range;
 use std::sync::Arc;
 use tracing::*;
 use utils::lsn::Lsn;

-lazy_static! {
-    static ref NUM_ONDISK_LAYERS: IntGauge =
-        register_int_gauge!("pageserver_ondisk_layers", "Number of layers on-disk")
-            .expect("failed to define a metric");
-}
+static NUM_ONDISK_LAYERS: Lazy<IntGauge> = Lazy::new(|| {
+    register_int_gauge!("pageserver_ondisk_layers", "Number of layers on-disk")
+        .expect("failed to define a metric")
+});

 ///
 /// LayerMap tracks what layers exist on a timeline.
--- a/pageserver/src/layered_repository/timeline.rs
+++ b/pageserver/src/layered_repository/timeline.rs
@@ -4,11 +4,11 @@ use anyhow::{anyhow, bail, ensure, Context, Result};
 use bytes::Bytes;
 use fail::fail_point;
 use itertools::Itertools;
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use tracing::*;

 use std::cmp::{max, min, Ordering};
-use std::collections::HashSet;
+use std::collections::{hash_map::Entry, HashMap, HashSet};
 use std::fs;
 use std::fs::{File, OpenOptions};
 use std::io::Write;
@@ -16,7 +16,7 @@ use std::ops::{Deref, Range};
 use std::path::PathBuf;
 use std::sync::atomic::{self, AtomicBool, AtomicIsize, Ordering as AtomicOrdering};
 use std::sync::{Arc, Mutex, MutexGuard, RwLock, RwLockReadGuard, TryLockError};
-use std::time::{Duration, SystemTime};
+use std::time::{Duration, Instant, SystemTime};

 use metrics::{
    register_histogram_vec, register_int_counter, register_int_counter_vec, register_int_gauge_vec,
@@ -38,7 +38,9 @@ use crate::layered_repository::{

 use crate::config::PageServerConf;
 use crate::keyspace::{KeyPartitioning, KeySpace};
+use crate::pgdatadir_mapping::BlockNumber;
 use crate::pgdatadir_mapping::LsnForTimestamp;
+use crate::reltag::RelTag;
 use crate::tenant_config::TenantConfOpt;
 use crate::DatadirTimeline;

@@ -58,76 +60,102 @@ use crate::walredo::WalRedoManager;
 use crate::CheckpointConfig;
 use crate::{page_cache, storage_sync};

+/// Prometheus histogram buckets (in seconds) that capture the majority of
+/// latencies in the microsecond range but also extend far enough up to distinguish
+/// "bad" from "really bad".
+fn get_buckets_for_critical_operations() -> Vec<f64> {
+    let buckets_per_digit = 5;
+    let min_exponent = -6;
+    let max_exponent = 2;
+
+    let mut buckets = vec![];
+    // Compute 10^(exp / buckets_per_digit) instead of 10^(1/buckets_per_digit)^exp
+    // because it's more numerically stable and doesn't result in numbers like 9.999999
+    for exp in (min_exponent * buckets_per_digit)..=(max_exponent * buckets_per_digit) {
+        buckets.push(10_f64.powf(exp as f64 / buckets_per_digit as f64))
+    }
+    buckets
+}
+
 // Metrics collected on operations on the storage repository.
-lazy_static! {
-    pub static ref STORAGE_TIME: HistogramVec = register_histogram_vec!(
+pub static STORAGE_TIME: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "pageserver_storage_operations_seconds",
        "Time spent on storage operations",
-        &["operation", "tenant_id", "timeline_id"]
+        &["operation", "tenant_id", "timeline_id"],
+        get_buckets_for_critical_operations(),
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 // Metrics collected on operations on the storage repository.
-lazy_static! {
-    static ref RECONSTRUCT_TIME: HistogramVec = register_histogram_vec!(
+static RECONSTRUCT_TIME: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "pageserver_getpage_reconstruct_seconds",
        "Time spent in reconstruct_value",
-        &["tenant_id", "timeline_id"]
+        &["tenant_id", "timeline_id"],
+        get_buckets_for_critical_operations(),
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

-lazy_static! {
-    static ref MATERIALIZED_PAGE_CACHE_HIT: IntCounterVec = register_int_counter_vec!(
+static MATERIALIZED_PAGE_CACHE_HIT: Lazy<IntCounterVec> = Lazy::new(|| {
+    register_int_counter_vec!(
        "pageserver_materialized_cache_hits_total",
        "Number of cache hits from materialized page cache",
        &["tenant_id", "timeline_id"]
    )
-    .expect("failed to define a metric");
-    static ref WAIT_LSN_TIME: HistogramVec = register_histogram_vec!(
+    .expect("failed to define a metric")
+});
+
+static WAIT_LSN_TIME: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "pageserver_wait_lsn_seconds",
        "Time spent waiting for WAL to arrive",
-        &["tenant_id", "timeline_id"]
+        &["tenant_id", "timeline_id"],
+        get_buckets_for_critical_operations(),
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

-lazy_static! {
-    static ref LAST_RECORD_LSN: IntGaugeVec = register_int_gauge_vec!(
+static LAST_RECORD_LSN: Lazy<IntGaugeVec> = Lazy::new(|| {
+    register_int_gauge_vec!(
        "pageserver_last_record_lsn",
        "Last record LSN grouped by timeline",
        &["tenant_id", "timeline_id"]
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 // Metrics for determining timeline's physical size.
 // A layered timeline's physical is defined as the total size of
 // (delta/image) layer files on disk.
-lazy_static! {
-    static ref CURRENT_PHYSICAL_SIZE: UIntGaugeVec = register_uint_gauge_vec!(
+static CURRENT_PHYSICAL_SIZE: Lazy<UIntGaugeVec> = Lazy::new(|| {
+    register_uint_gauge_vec!(
        "pageserver_current_physical_size",
        "Current physical size grouped by timeline",
        &["tenant_id", "timeline_id"]
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 // Metrics for cloud upload. These metrics reflect data uploaded to cloud storage,
 // or in testing they estimate how much we would upload if we did.
-lazy_static! {
-    static ref NUM_PERSISTENT_FILES_CREATED: IntCounter = register_int_counter!(
+static NUM_PERSISTENT_FILES_CREATED: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "pageserver_created_persistent_files_total",
        "Number of files created that are meant to be uploaded to cloud storage",
    )
-    .expect("failed to define a metric");
-    static ref PERSISTENT_BYTES_WRITTEN: IntCounter = register_int_counter!(
+    .expect("failed to define a metric")
+});
+
+static PERSISTENT_BYTES_WRITTEN: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "pageserver_written_persistent_bytes_total",
        "Total bytes written that are meant to be uploaded to cloud storage",
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 #[derive(Clone)]
 pub enum LayeredTimelineEntry {
@@ -205,6 +233,8 @@ pub struct LayeredTimeline {
    pub layers: RwLock<LayerMap>,

    last_freeze_at: AtomicLsn,
+    // Atomic would be more appropriate here.
+    last_freeze_ts: RwLock<Instant>,

    // WAL redo manager
    walredo_mgr: Arc<dyn WalRedoManager + Sync + Send>,
@@ -295,6 +325,9 @@ pub struct LayeredTimeline {
    /// or None if WAL receiver has not received anything for this timeline
    /// yet.
    pub last_received_wal: Mutex<Option<WalReceiverInfo>>,
+
+    /// Relation size cache
+    rel_size_cache: RwLock<HashMap<RelTag, (Lsn, BlockNumber)>>,
 }

 pub struct WalReceiverInfo {
@@ -306,7 +339,42 @@ pub struct WalReceiverInfo {
 /// Inherit all the functions from DatadirTimeline, to provide the
 /// functionality to store PostgreSQL relations, SLRUs, etc. in a
 /// LayeredTimeline.
-impl DatadirTimeline for LayeredTimeline {}
+impl DatadirTimeline for LayeredTimeline {
+    fn get_cached_rel_size(&self, tag: &RelTag, lsn: Lsn) -> Option<BlockNumber> {
+        let rel_size_cache = self.rel_size_cache.read().unwrap();
+        if let Some((cached_lsn, nblocks)) = rel_size_cache.get(tag) {
+            if lsn >= *cached_lsn {
+                return Some(*nblocks);
+            }
+        }
+        None
+    }
+
+    fn update_cached_rel_size(&self, tag: RelTag, lsn: Lsn, nblocks: BlockNumber) {
+        let mut rel_size_cache = self.rel_size_cache.write().unwrap();
+        match rel_size_cache.entry(tag) {
+            Entry::Occupied(mut entry) => {
+                let cached_lsn = entry.get_mut();
+                if lsn >= cached_lsn.0 {
+                    *cached_lsn = (lsn, nblocks);
+                }
+            }
+            Entry::Vacant(entry) => {
+                entry.insert((lsn, nblocks));
+            }
+        }
+    }
+
+    fn set_cached_rel_size(&self, tag: RelTag, lsn: Lsn, nblocks: BlockNumber) {
+        let mut rel_size_cache = self.rel_size_cache.write().unwrap();
+        rel_size_cache.insert(tag, (lsn, nblocks));
+    }
+
+    fn remove_cached_rel_size(&self, tag: &RelTag) {
+        let mut rel_size_cache = self.rel_size_cache.write().unwrap();
+        rel_size_cache.remove(tag);
+    }
+}

 ///
 /// Information about how much history needs to be retained, needed by
@@ -377,8 +445,6 @@ impl Timeline for LayeredTimeline {

    /// Look up the value with the given a key
    fn get(&self, key: Key, lsn: Lsn) -> Result<Bytes> {
-        debug_assert!(lsn <= self.get_last_record_lsn());
-
        // Check the page cache. We will get back the most recent page with lsn <= `lsn`.
        // The cached image can be returned directly if there is no WAL between the cached image
        // and requested LSN. The cached image can also be used to reduce the amount of WAL needed
@@ -496,6 +562,13 @@ impl LayeredTimeline {
            .unwrap_or(self.conf.default_tenant_conf.checkpoint_distance)
    }

+    fn get_checkpoint_timeout(&self) -> Duration {
+        let tenant_conf = self.tenant_conf.read().unwrap();
+        tenant_conf
+            .checkpoint_timeout
+            .unwrap_or(self.conf.default_tenant_conf.checkpoint_timeout)
+    }
+
    fn get_compaction_target_size(&self) -> u64 {
        let tenant_conf = self.tenant_conf.read().unwrap();
        tenant_conf
@@ -585,6 +658,7 @@ impl LayeredTimeline {
            disk_consistent_lsn: AtomicLsn::new(metadata.disk_consistent_lsn().0),

            last_freeze_at: AtomicLsn::new(metadata.disk_consistent_lsn().0),
+            last_freeze_ts: RwLock::new(Instant::now()),

            ancestor_timeline: ancestor,
            ancestor_lsn: metadata.ancestor_lsn(),
@@ -618,6 +692,7 @@ impl LayeredTimeline {
            repartition_threshold: 0,

            last_received_wal: Mutex::new(None),
+            rel_size_cache: RwLock::new(HashMap::new()),
        };
        result.repartition_threshold = result.get_checkpoint_distance() / 10;
        result
@@ -1029,8 +1104,11 @@ impl LayeredTimeline {
    }

    ///
-    /// Check if more than 'checkpoint_distance' of WAL has been accumulated
-    /// in the in-memory layer, and initiate flushing it if so.
+    /// Check if more than 'checkpoint_distance' of WAL has been accumulated in
+    /// the in-memory layer, and initiate flushing it if so.
+    ///
+    /// Also flush after a period of time without new data -- it helps
+    /// safekeepers to regard pageserver as caught up and suspend activity.
    ///
    pub fn check_checkpoint_distance(self: &Arc<LayeredTimeline>) -> Result<()> {
        let last_lsn = self.get_last_record_lsn();
@@ -1038,21 +1116,27 @@ impl LayeredTimeline {
        if let Some(open_layer) = &layers.open_layer {
            let open_layer_size = open_layer.size()?;
            drop(layers);
-            let distance = last_lsn.widening_sub(self.last_freeze_at.load());
+            let last_freeze_at = self.last_freeze_at.load();
+            let last_freeze_ts = *(self.last_freeze_ts.read().unwrap());
+            let distance = last_lsn.widening_sub(last_freeze_at);
            // Checkpointing the open layer can be triggered by layer size or LSN range.
            // S3 has a 5 GB limit on the size of one upload (without multi-part upload), and
            // we want to stay below that with a big margin.  The LSN distance determines how
            // much WAL the safekeepers need to store.
            if distance >= self.get_checkpoint_distance().into()
                || open_layer_size > self.get_checkpoint_distance()
+                || (distance > 0 && last_freeze_ts.elapsed() >= self.get_checkpoint_timeout())
            {
                info!(
-                    "check_checkpoint_distance {}, layer size {}",
-                    distance, open_layer_size
+                    "check_checkpoint_distance {}, layer size {}, elapsed since last flush {:?}",
+                    distance,
+                    open_layer_size,
+                    last_freeze_ts.elapsed()
                );

                self.freeze_inmem_layer(true);
                self.last_freeze_at.store(last_lsn);
+                *(self.last_freeze_ts.write().unwrap()) = Instant::now();

                // Launch a thread to flush the frozen layer to disk, unless
                // a thread was already running. (If the thread was running
--- a/pageserver/src/lib.rs
+++ b/pageserver/src/lib.rs
@@ -22,7 +22,7 @@ pub mod walreceiver;
 pub mod walrecord;
 pub mod walredo;

-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use tracing::info;

 use crate::thread_mgr::ThreadKind;
@@ -42,14 +42,14 @@ pub const STORAGE_FORMAT_VERSION: u16 = 3;
 pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;
 pub const DELTA_FILE_MAGIC: u16 = 0x5A61;

-lazy_static! {
-    static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!(
+static LIVE_CONNECTIONS_COUNT: Lazy<IntGaugeVec> = Lazy::new(|| {
+    register_int_gauge_vec!(
        "pageserver_live_connections",
        "Number of live network connections",
        &["pageserver_connection_kind"]
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 pub const LOG_FILE_NAME: &str = "pageserver.log";

@@ -93,3 +93,56 @@ pub fn shutdown_pageserver(exit_code: i32) {
    info!("Shut down successfully completed");
    std::process::exit(exit_code);
 }
+
+const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
+const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
+
+async fn exponential_backoff(n: u32, base_increment: f64, max_seconds: f64) {
+    let backoff_duration_seconds =
+        exponential_backoff_duration_seconds(n, base_increment, max_seconds);
+    if backoff_duration_seconds > 0.0 {
+        info!(
+            "Backoff: waiting {backoff_duration_seconds} seconds before processing with the task",
+        );
+        tokio::time::sleep(std::time::Duration::from_secs_f64(backoff_duration_seconds)).await;
+    }
+}
+
+fn exponential_backoff_duration_seconds(n: u32, base_increment: f64, max_seconds: f64) -> f64 {
+    if n == 0 {
+        0.0
+    } else {
+        (1.0 + base_increment).powf(f64::from(n)).min(max_seconds)
+    }
+}
+
+#[cfg(test)]
+mod backoff_defaults_tests {
+    use super::*;
+
+    #[test]
+    fn backoff_defaults_produce_growing_backoff_sequence() {
+        let mut current_backoff_value = None;
+
+        for i in 0..10_000 {
+            let new_backoff_value = exponential_backoff_duration_seconds(
+                i,
+                DEFAULT_BASE_BACKOFF_SECONDS,
+                DEFAULT_MAX_BACKOFF_SECONDS,
+            );
+
+            if let Some(old_backoff_value) = current_backoff_value.replace(new_backoff_value) {
+                assert!(
+                    old_backoff_value <= new_backoff_value,
+                    "{i}th backoff value {new_backoff_value} is smaller than the previous one {old_backoff_value}"
+                )
+            }
+        }
+
+        assert_eq!(
+            current_backoff_value.expect("Should have produced backoff values to compare"),
+            DEFAULT_MAX_BACKOFF_SECONDS,
+            "Given big enough of retries, backoff should reach its allowed max value"
+        );
+    }
+}
--- a/pageserver/src/page_service.rs
+++ b/pageserver/src/page_service.rs
@@ -11,7 +11,7 @@

 use anyhow::{bail, ensure, Context, Result};
 use bytes::{Buf, BufMut, Bytes, BytesMut};
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use regex::Regex;
 use std::io::{self, Read};
 use std::net::TcpListener;
@@ -434,15 +434,15 @@ const TIME_BUCKETS: &[f64] = &[
    0.1,  // 1/10 s
 ];

-lazy_static! {
-    static ref SMGR_QUERY_TIME: HistogramVec = register_histogram_vec!(
+static SMGR_QUERY_TIME: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "pageserver_smgr_query_seconds",
        "Time spent on smgr query handling",
        &["smgr_query_type", "tenant_id", "timeline_id"],
        TIME_BUCKETS.into()
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 impl PageServerHandler {
    pub fn new(conf: &'static PageServerConf, auth: Option<Arc<JwtAuth>>) -> Self {
@@ -1044,6 +1044,7 @@ impl postgres_backend::Handler for PageServerHandler {
            let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
            pgb.write_message_noflush(&BeMessage::RowDescription(&[
                RowDescriptor::int8_col(b"checkpoint_distance"),
+                RowDescriptor::int8_col(b"checkpoint_timeout"),
                RowDescriptor::int8_col(b"compaction_target_size"),
                RowDescriptor::int8_col(b"compaction_period"),
                RowDescriptor::int8_col(b"compaction_threshold"),
@@ -1054,6 +1055,12 @@ impl postgres_backend::Handler for PageServerHandler {
            ]))?
            .write_message_noflush(&BeMessage::DataRow(&[
                Some(repo.get_checkpoint_distance().to_string().as_bytes()),
+                Some(
+                    repo.get_checkpoint_timeout()
+                        .as_secs()
+                        .to_string()
+                        .as_bytes(),
+                ),
                Some(repo.get_compaction_target_size().to_string().as_bytes()),
                Some(
                    repo.get_compaction_period()
--- a/pageserver/src/pgdatadir_mapping.rs
+++ b/pageserver/src/pgdatadir_mapping.rs
@@ -56,13 +56,16 @@ pub trait DatadirTimeline: Timeline {
    /// This provides a transaction-like interface to perform a bunch
    /// of modifications atomically.
    ///
-    /// To ingest a WAL record, call begin_modification() to get a
+    /// To ingest a WAL record, call begin_modification(lsn) to get a
    /// DatadirModification object. Use the functions in the object to
    /// modify the repository state, updating all the pages and metadata
-    /// that the WAL record affects. When you're done, call commit(lsn) to
-    /// commit the changes. All the changes will be stamped with the specified LSN.
+    /// that the WAL record affects. When you're done, call commit() to
+    /// commit the changes.
    ///
-    /// Calling commit(lsn) will flush all the changes and reset the state,
+    /// Lsn stored in modification is advanced by `ingest_record` and
+    /// is used by `commit()` to update `last_record_lsn`.
+    ///
+    /// Calling commit() will flush all the changes and reset the state,
    /// so the `DatadirModification` struct can be reused to perform the next modification.
    ///
    /// Note that any pending modifications you make through the
@@ -70,7 +73,7 @@ pub trait DatadirTimeline: Timeline {
    /// functions of the timeline until you finish! And if you update the
    /// same page twice, the last update wins.
    ///
-    fn begin_modification(&self) -> DatadirModification<Self>
+    fn begin_modification(&self, lsn: Lsn) -> DatadirModification<Self>
    where
        Self: Sized,
    {
@@ -79,6 +82,7 @@ pub trait DatadirTimeline: Timeline {
            pending_updates: HashMap::new(),
            pending_deletions: Vec::new(),
            pending_nblocks: 0,
+            lsn,
        }
    }

@@ -120,6 +124,10 @@ pub trait DatadirTimeline: Timeline {
    fn get_rel_size(&self, tag: RelTag, lsn: Lsn) -> Result<BlockNumber> {
        ensure!(tag.relnode != 0, "invalid relnode");

+        if let Some(nblocks) = self.get_cached_rel_size(&tag, lsn) {
+            return Ok(nblocks);
+        }
+
        if (tag.forknum == pg_constants::FSM_FORKNUM
            || tag.forknum == pg_constants::VISIBILITYMAP_FORKNUM)
            && !self.get_rel_exists(tag, lsn)?
@@ -133,13 +141,21 @@ pub trait DatadirTimeline: Timeline {

        let key = rel_size_to_key(tag);
        let mut buf = self.get(key, lsn)?;
-        Ok(buf.get_u32_le())
+        let nblocks = buf.get_u32_le();
+
+        // Update relation size cache
+        self.update_cached_rel_size(tag, lsn, nblocks);
+        Ok(nblocks)
    }

    /// Does relation exist?
    fn get_rel_exists(&self, tag: RelTag, lsn: Lsn) -> Result<bool> {
        ensure!(tag.relnode != 0, "invalid relnode");

+        // first try to lookup relation in cache
+        if let Some(_nblocks) = self.get_cached_rel_size(&tag, lsn) {
+            return Ok(true);
+        }
        // fetch directory listing
        let key = rel_dir_to_key(tag.spcnode, tag.dbnode);
        let buf = self.get(key, lsn)?;
@@ -445,6 +461,18 @@ pub trait DatadirTimeline: Timeline {

        Ok(result.to_keyspace())
    }
+
+    /// Get cached size of relation if it not updated after specified LSN
+    fn get_cached_rel_size(&self, tag: &RelTag, lsn: Lsn) -> Option<BlockNumber>;
+
+    /// Update cached relation size if there is no more recent update
+    fn update_cached_rel_size(&self, tag: RelTag, lsn: Lsn, nblocks: BlockNumber);
+
+    /// Store cached relation size
+    fn set_cached_rel_size(&self, tag: RelTag, lsn: Lsn, nblocks: BlockNumber);
+
+    /// Remove cached relation size
+    fn remove_cached_rel_size(&self, tag: &RelTag);
 }

 /// DatadirModification represents an operation to ingest an atomic set of
@@ -457,6 +485,9 @@ pub struct DatadirModification<'a, T: DatadirTimeline> {
    /// in the state in 'tline' yet.
    pub tline: &'a T,

+    /// Lsn assigned by begin_modification
+    pub lsn: Lsn,
+
    // The modifications are not applied directly to the underlying key-value store.
    // The put-functions add the modifications here, and they are flushed to the
    // underlying key-value store by the 'finish' function.
@@ -666,26 +697,36 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {

        self.pending_nblocks += nblocks as isize;

+        // Update relation size cache
+        self.tline.set_cached_rel_size(rel, self.lsn, nblocks);
+
        // Even if nblocks > 0, we don't insert any actual blocks here. That's up to the
        // caller.
-
        Ok(())
    }

    /// Truncate relation
    pub fn put_rel_truncation(&mut self, rel: RelTag, nblocks: BlockNumber) -> Result<()> {
        ensure!(rel.relnode != 0, "invalid relnode");
-        let size_key = rel_size_to_key(rel);
+        let last_lsn = self.tline.get_last_record_lsn();
+        if self.tline.get_rel_exists(rel, last_lsn)? {
+            let size_key = rel_size_to_key(rel);
+            // Fetch the old size first
+            let old_size = self.get(size_key)?.get_u32_le();

-        // Fetch the old size first
-        let old_size = self.get(size_key)?.get_u32_le();
+            // Update the entry with the new size.
+            let buf = nblocks.to_le_bytes();
+            self.put(size_key, Value::Image(Bytes::from(buf.to_vec())));

-        // Update the entry with the new size.
-        let buf = nblocks.to_le_bytes();
-        self.put(size_key, Value::Image(Bytes::from(buf.to_vec())));
+            // Update relation size cache
+            self.tline.set_cached_rel_size(rel, self.lsn, nblocks);

-        // Update logical database size.
-        self.pending_nblocks -= old_size as isize - nblocks as isize;
+            // Update relation size cache
+            self.tline.set_cached_rel_size(rel, self.lsn, nblocks);
+
+            // Update logical database size.
+            self.pending_nblocks -= old_size as isize - nblocks as isize;
+        }
        Ok(())
    }

@@ -703,6 +744,9 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {
            let buf = nblocks.to_le_bytes();
            self.put(size_key, Value::Image(Bytes::from(buf.to_vec())));

+            // Update relation size cache
+            self.tline.set_cached_rel_size(rel, self.lsn, nblocks);
+
            self.pending_nblocks += nblocks as isize - old_size as isize;
        }
        Ok(())
@@ -728,6 +772,9 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {
        let old_size = self.get(size_key)?.get_u32_le();
        self.pending_nblocks -= old_size as isize;

+        // Remove enty from relation size cache
+        self.tline.remove_cached_rel_size(&rel);
+
        // Delete size entry, as well as all blocks
        self.delete(rel_key_range(rel));

@@ -842,7 +889,7 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {
    /// retains all the metadata, but data pages are flushed. That's again OK
    /// for bulk import, where you are just loading data pages and won't try to
    /// modify the same pages twice.
-    pub fn flush(&mut self, lsn: Lsn) -> Result<()> {
+    pub fn flush(&mut self) -> Result<()> {
        // Unless we have accumulated a decent amount of changes, it's not worth it
        // to scan through the pending_updates list.
        let pending_nblocks = self.pending_nblocks;
@@ -856,7 +903,7 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {
        let mut result: Result<()> = Ok(());
        self.pending_updates.retain(|&key, value| {
            if result.is_ok() && (is_rel_block_key(key) || is_slru_block_key(key)) {
-                result = writer.put(key, lsn, value);
+                result = writer.put(key, self.lsn, value);
                false
            } else {
                true
@@ -877,9 +924,9 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {
    /// underlying timeline.
    /// All the modifications in this atomic update are stamped by the specified LSN.
    ///
-    pub fn commit(&mut self, lsn: Lsn) -> Result<()> {
+    pub fn commit(&mut self) -> Result<()> {
        let writer = self.tline.writer();
-
+        let lsn = self.lsn;
        let pending_nblocks = self.pending_nblocks;
        self.pending_nblocks = 0;

@@ -919,8 +966,8 @@ impl<'a, T: DatadirTimeline> DatadirModification<'a, T> {
                bail!("unexpected pending WAL record");
            }
        } else {
-            let last_lsn = self.tline.get_last_record_lsn();
-            self.tline.get(key, last_lsn)
+            let lsn = Lsn::max(self.tline.get_last_record_lsn(), self.lsn);
+            self.tline.get(key, lsn)
        }
    }

@@ -1324,9 +1371,9 @@ pub fn create_test_timeline<R: Repository>(
    timeline_id: utils::zid::ZTimelineId,
 ) -> Result<std::sync::Arc<R::Timeline>> {
    let tline = repo.create_empty_timeline(timeline_id, Lsn(8))?;
-    let mut m = tline.begin_modification();
+    let mut m = tline.begin_modification(Lsn(8));
    m.init_empty()?;
-    m.commit(Lsn(8))?;
+    m.commit()?;
    Ok(tline)
 }

--- a/pageserver/src/repository.rs
+++ b/pageserver/src/repository.rs
@@ -408,7 +408,7 @@ pub trait TimelineWriter<'a> {
 #[cfg(test)]
 pub mod repo_harness {
    use bytes::BytesMut;
-    use lazy_static::lazy_static;
+    use once_cell::sync::Lazy;
    use std::sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard};
    use std::{fs, path::PathBuf};

@@ -439,14 +439,13 @@ pub mod repo_harness {
        buf.freeze()
    }

-    lazy_static! {
-        static ref LOCK: RwLock<()> = RwLock::new(());
-    }
+    static LOCK: Lazy<RwLock<()>> = Lazy::new(|| RwLock::new(()));

    impl From<TenantConf> for TenantConfOpt {
        fn from(tenant_conf: TenantConf) -> Self {
            Self {
                checkpoint_distance: Some(tenant_conf.checkpoint_distance),
+                checkpoint_timeout: Some(tenant_conf.checkpoint_timeout),
                compaction_target_size: Some(tenant_conf.compaction_target_size),
                compaction_period: Some(tenant_conf.compaction_period),
                compaction_threshold: Some(tenant_conf.compaction_threshold),
@@ -589,11 +588,10 @@ mod tests {
    //use std::sync::Arc;
    use bytes::BytesMut;
    use hex_literal::hex;
-    use lazy_static::lazy_static;
+    use once_cell::sync::Lazy;

-    lazy_static! {
-        static ref TEST_KEY: Key = Key::from_slice(&hex!("112222222233333333444444445500000001"));
-    }
+    static TEST_KEY: Lazy<Key> =
+        Lazy::new(|| Key::from_slice(&hex!("112222222233333333444444445500000001")));

    #[test]
    fn test_basic() -> Result<()> {
--- a/pageserver/src/storage_sync.rs
+++ b/pageserver/src/storage_sync.rs
@@ -155,8 +155,7 @@ use std::{

 use anyhow::{anyhow, bail, Context};
 use futures::stream::{FuturesUnordered, StreamExt};
-use lazy_static::lazy_static;
-use once_cell::sync::OnceCell;
+use once_cell::sync::{Lazy, OnceCell};
 use remote_storage::{GenericRemoteStorage, RemoteStorage};
 use tokio::{
    fs,
@@ -173,6 +172,7 @@ use self::{
 };
 use crate::{
    config::PageServerConf,
+    exponential_backoff,
    layered_repository::{
        ephemeral_file::is_ephemeral_file,
        metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME},
@@ -184,8 +184,8 @@ use crate::{
 };

 use metrics::{
-    register_histogram_vec, register_int_counter, register_int_counter_vec, register_int_gauge,
-    HistogramVec, IntCounter, IntCounterVec, IntGauge,
+    register_histogram_vec, register_int_counter_vec, register_int_gauge, HistogramVec,
+    IntCounterVec, IntGauge,
 };
 use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId};

@@ -193,32 +193,33 @@ use self::download::download_index_parts;
 pub use self::download::gather_tenant_timelines_index_parts;
 pub use self::download::TEMP_DOWNLOAD_EXTENSION;

-lazy_static! {
-    static ref REMAINING_SYNC_ITEMS: IntGauge = register_int_gauge!(
+static REMAINING_SYNC_ITEMS: Lazy<IntGauge> = Lazy::new(|| {
+    register_int_gauge!(
        "pageserver_remote_storage_remaining_sync_items",
        "Number of storage sync items left in the queue"
    )
-    .expect("failed to register pageserver remote storage remaining sync items int gauge");
-    static ref FATAL_TASK_FAILURES: IntCounter = register_int_counter!(
-        "pageserver_remote_storage_fatal_task_failures_total",
-        "Number of critically failed tasks"
-    )
-    .expect("failed to register pageserver remote storage remaining sync items int gauge");
-    static ref IMAGE_SYNC_TIME: HistogramVec = register_histogram_vec!(
+    .expect("failed to register pageserver remote storage remaining sync items int gauge")
+});
+
+static IMAGE_SYNC_TIME: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "pageserver_remote_storage_image_sync_seconds",
        "Time took to synchronize (download or upload) a whole pageserver image. \
        Grouped by tenant and timeline ids, `operation_kind` (upload|download) and `status` (success|failure)",
        &["tenant_id", "timeline_id", "operation_kind", "status"],
        vec![0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 3.0, 10.0, 20.0]
    )
-    .expect("failed to register pageserver image sync time histogram vec");
-    static ref REMOTE_INDEX_UPLOAD: IntCounterVec = register_int_counter_vec!(
+    .expect("failed to register pageserver image sync time histogram vec")
+});
+
+static REMOTE_INDEX_UPLOAD: Lazy<IntCounterVec> = Lazy::new(|| {
+    register_int_counter_vec!(
        "pageserver_remote_storage_remote_index_uploads_total",
        "Number of remote index uploads",
        &["tenant_id", "timeline_id"],
    )
-    .expect("failed to register pageserver remote index upload vec");
-}
+    .expect("failed to register pageserver remote index upload vec")
+});

 static SYNC_QUEUE: OnceCell<SyncQueue> = OnceCell::new();

@@ -969,14 +970,19 @@ fn storage_sync_loop<P, S>(
    }
 }

-// needed to check whether the download happened
-// more informative than just a bool
 #[derive(Debug)]
-enum DownloadMarker {
+enum DownloadStatus {
    Downloaded,
    Nothing,
 }

+#[derive(Debug)]
+enum UploadStatus {
+    Uploaded,
+    Failed(anyhow::Error),
+    Nothing,
+}
+
 async fn process_batches<P, S>(
    conf: &'static PageServerConf,
    max_sync_errors: NonZeroU32,
@@ -1016,7 +1022,7 @@ where
            "Finished storage sync task for sync id {sync_id} download marker {:?}",
            download_marker
        );
-        if matches!(download_marker, DownloadMarker::Downloaded) {
+        if matches!(download_marker, DownloadStatus::Downloaded) {
            downloaded_timelines.insert(sync_id.tenant_id);
        }
    }
@@ -1030,7 +1036,7 @@ async fn process_sync_task_batch<P, S>(
    max_sync_errors: NonZeroU32,
    sync_id: ZTenantTimelineId,
    batch: SyncTaskBatch,
-) -> DownloadMarker
+) -> DownloadStatus
 where
    P: Debug + Send + Sync + 'static,
    S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
@@ -1047,66 +1053,71 @@ where
    // When operating in a system without tasks failing over the error threshold,
    // current batching and task processing systems aim to update the layer set and metadata files (remote and local),
    // without "losing" such layer files.
-    let (upload_result, status_update) = tokio::join!(
+    let (upload_status, download_status) = tokio::join!(
        async {
            if let Some(upload_data) = upload_data {
-                match validate_task_retries(upload_data, max_sync_errors)
+                let upload_retries = upload_data.retries;
+                match validate_task_retries(upload_retries, max_sync_errors)
                    .instrument(info_span!("retries_validation"))
                    .await
                {
-                    ControlFlow::Continue(new_upload_data) => {
+                    ControlFlow::Continue(()) => {
                        upload_timeline_data(
                            conf,
                            (storage.as_ref(), &index, sync_queue),
                            current_remote_timeline.as_ref(),
                            sync_id,
-                            new_upload_data,
+                            upload_data,
                            sync_start,
                            "upload",
                        )
-                        .await;
-                        return Some(());
-                    }
-                    ControlFlow::Break(failed_upload_data) => {
-                        if let Err(e) = update_remote_data(
-                            conf,
-                            storage.as_ref(),
-                            &index,
-                            sync_id,
-                            RemoteDataUpdate::Upload {
-                                uploaded_data: failed_upload_data.data,
-                                upload_failed: true,
-                            },
-                        )
                        .await
-                        {
-                            error!("Failed to update remote timeline {sync_id}: {e:?}");
-                        }
                    }
+                    ControlFlow::Break(()) => match update_remote_data(
+                        conf,
+                        storage.as_ref(),
+                        &index,
+                        sync_id,
+                        RemoteDataUpdate::Upload {
+                            uploaded_data: upload_data.data,
+                            upload_failed: true,
+                        },
+                    )
+                    .await
+                    {
+                        Ok(()) => UploadStatus::Failed(anyhow::anyhow!(
+                            "Aborted after retries validation, current retries: {upload_retries}, max retries allowed: {max_sync_errors}"
+                        )),
+                        Err(e) => {
+                            error!("Failed to update remote timeline {sync_id}: {e:?}");
+                            UploadStatus::Failed(e)
+                        }
+                    },
                }
+            } else {
+                UploadStatus::Nothing
            }
-            None
        }
        .instrument(info_span!("upload_timeline_data")),
        async {
            if let Some(download_data) = download_data {
-                match validate_task_retries(download_data, max_sync_errors)
+                match validate_task_retries(download_data.retries, max_sync_errors)
                    .instrument(info_span!("retries_validation"))
                    .await
                {
-                    ControlFlow::Continue(new_download_data) => {
+                    ControlFlow::Continue(()) => {
                        return download_timeline_data(
                            conf,
                            (storage.as_ref(), &index, sync_queue),
                            current_remote_timeline.as_ref(),
                            sync_id,
-                            new_download_data,
+                            download_data,
                            sync_start,
                            "download",
                        )
                        .await;
                    }
-                    ControlFlow::Break(_) => {
+                    ControlFlow::Break(()) => {
                        index
                            .write()
                            .await
@@ -1115,51 +1126,53 @@ where
                    }
                }
            }
-            DownloadMarker::Nothing
+            DownloadStatus::Nothing
        }
        .instrument(info_span!("download_timeline_data")),
    );

-    if let Some(mut delete_data) = batch.delete {
-        if upload_result.is_some() {
-            match validate_task_retries(delete_data, max_sync_errors)
-                .instrument(info_span!("retries_validation"))
-                .await
-            {
-                ControlFlow::Continue(new_delete_data) => {
-                    delete_timeline_data(
-                        conf,
-                        (storage.as_ref(), &index, sync_queue),
-                        sync_id,
-                        new_delete_data,
-                        sync_start,
-                        "delete",
-                    )
-                    .instrument(info_span!("delete_timeline_data"))
-                    .await;
-                }
-                ControlFlow::Break(failed_delete_data) => {
-                    if let Err(e) = update_remote_data(
-                        conf,
-                        storage.as_ref(),
-                        &index,
-                        sync_id,
-                        RemoteDataUpdate::Delete(&failed_delete_data.data.deleted_layers),
-                    )
+    if let Some(delete_data) = batch.delete {
+        match upload_status {
+            UploadStatus::Uploaded | UploadStatus::Nothing => {
+                match validate_task_retries(delete_data.retries, max_sync_errors)
+                    .instrument(info_span!("retries_validation"))
                    .await
-                    {
-                        error!("Failed to update remote timeline {sync_id}: {e:?}");
+                {
+                    ControlFlow::Continue(()) => {
+                        delete_timeline_data(
+                            conf,
+                            (storage.as_ref(), &index, sync_queue),
+                            sync_id,
+                            delete_data,
+                            sync_start,
+                            "delete",
+                        )
+                        .instrument(info_span!("delete_timeline_data"))
+                        .await;
+                    }
+                    ControlFlow::Break(()) => {
+                        if let Err(e) = update_remote_data(
+                            conf,
+                            storage.as_ref(),
+                            &index,
+                            sync_id,
+                            RemoteDataUpdate::Delete(&delete_data.data.deleted_layers),
+                        )
+                        .await
+                        {
+                            error!("Failed to update remote timeline {sync_id}: {e:?}");
+                        }
                    }
                }
            }
-        } else {
-            delete_data.retries += 1;
-            sync_queue.push(sync_id, SyncTask::Delete(delete_data));
-            warn!("Skipping delete task due to failed upload tasks, reenqueuing");
+            UploadStatus::Failed(e) => {
+                warn!("Skipping delete task due to failed upload tasks, reenqueuing. Upload data: {:?}, delete data: {delete_data:?}. Upload failure: {e:#}", batch.upload);
+                sync_queue.push(sync_id, SyncTask::Delete(delete_data));
+            }
        }
    }

-    status_update
+    download_status
 }

 async fn download_timeline_data<P, S>(
@@ -1170,7 +1183,7 @@ async fn download_timeline_data<P, S>(
    new_download_data: SyncData<LayersDownload>,
    sync_start: Instant,
    task_name: &str,
-) -> DownloadMarker
+) -> DownloadStatus
 where
    P: Debug + Send + Sync + 'static,
    S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
@@ -1199,7 +1212,7 @@ where
                Ok(()) => match index.write().await.set_awaits_download(&sync_id, false) {
                    Ok(()) => {
                        register_sync_status(sync_id, sync_start, task_name, Some(true));
-                        return DownloadMarker::Downloaded;
+                        return DownloadStatus::Downloaded;
                    }
                    Err(e) => {
                        error!("Timeline {sync_id} was expected to be in the remote index after a successful download, but it's absent: {e:?}");
@@ -1215,7 +1228,7 @@ where
        }
    }

-    DownloadMarker::Nothing
+    DownloadStatus::Nothing
 }

 async fn update_local_metadata(
@@ -1338,7 +1351,8 @@ async fn upload_timeline_data<P, S>(
    new_upload_data: SyncData<LayersUpload>,
    sync_start: Instant,
    task_name: &str,
-) where
+) -> UploadStatus
+where
    P: Debug + Send + Sync + 'static,
    S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
 {
@@ -1351,9 +1365,9 @@ async fn upload_timeline_data<P, S>(
    )
    .await
    {
-        UploadedTimeline::FailedAndRescheduled => {
+        UploadedTimeline::FailedAndRescheduled(e) => {
            register_sync_status(sync_id, sync_start, task_name, Some(false));
-            return;
+            return UploadStatus::Failed(e);
        }
        UploadedTimeline::Successful(upload_data) => upload_data,
    };
@@ -1372,12 +1386,14 @@ async fn upload_timeline_data<P, S>(
    {
        Ok(()) => {
            register_sync_status(sync_id, sync_start, task_name, Some(true));
+            UploadStatus::Uploaded
        }
        Err(e) => {
            error!("Failed to update remote timeline {sync_id}: {e:?}");
            uploaded_data.retries += 1;
            sync_queue.push(sync_id, SyncTask::Upload(uploaded_data));
            register_sync_status(sync_id, sync_start, task_name, Some(false));
+            UploadStatus::Failed(e)
        }
    }
 }
@@ -1480,25 +1496,17 @@ where
        .context("Failed to upload new index part")
 }

-async fn validate_task_retries<T>(
-    sync_data: SyncData<T>,
+async fn validate_task_retries(
+    current_attempt: u32,
    max_sync_errors: NonZeroU32,
-) -> ControlFlow<SyncData<T>, SyncData<T>> {
-    let current_attempt = sync_data.retries;
+) -> ControlFlow<(), ()> {
    let max_sync_errors = max_sync_errors.get();
    if current_attempt >= max_sync_errors {
-        error!(
-            "Aborting task that failed {current_attempt} times, exceeding retries threshold of {max_sync_errors}",
-        );
-        return ControlFlow::Break(sync_data);
+        return ControlFlow::Break(());
    }

-    if current_attempt > 0 {
-        let seconds_to_wait = 2.0_f64.powf(current_attempt as f64 - 1.0).min(30.0);
-        info!("Waiting {seconds_to_wait} seconds before starting the task");
-        tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await;
-    }
-    ControlFlow::Continue(sync_data)
+    exponential_backoff(current_attempt, 1.0, 30.0).await;
+    ControlFlow::Continue(())
 }

 fn schedule_first_sync_tasks(
--- a/pageserver/src/storage_sync/delete.rs
+++ b/pageserver/src/storage_sync/delete.rs
@@ -95,6 +95,8 @@ where
        debug!("Reenqueuing failed delete task for timeline {sync_id}");
        delete_data.retries += 1;
        sync_queue.push(sync_id, SyncTask::Delete(delete_data));
+    } else {
+        info!("Successfully deleted all layers");
    }
    errored
 }
--- a/pageserver/src/storage_sync/download.rs
+++ b/pageserver/src/storage_sync/download.rs
@@ -202,8 +202,6 @@ where
        })
        .map_err(DownloadError::BadInput)?;

-    warn!("part_storage_path {:?}", part_storage_path);
-
    let mut index_part_download = storage.download(&part_storage_path).await?;

    let mut index_part_bytes = Vec::new();
--- a/pageserver/src/storage_sync/upload.rs
+++ b/pageserver/src/storage_sync/upload.rs
@@ -4,7 +4,7 @@ use std::{fmt::Debug, path::PathBuf};

 use anyhow::Context;
 use futures::stream::{FuturesUnordered, StreamExt};
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use remote_storage::RemoteStorage;
 use tokio::fs;
 use tracing::{debug, error, info, warn};
@@ -20,14 +20,14 @@ use crate::{
 };
 use metrics::{register_int_counter_vec, IntCounterVec};

-lazy_static! {
-    static ref NO_LAYERS_UPLOAD: IntCounterVec = register_int_counter_vec!(
+static NO_LAYERS_UPLOAD: Lazy<IntCounterVec> = Lazy::new(|| {
+    register_int_counter_vec!(
        "pageserver_remote_storage_no_layers_uploads_total",
        "Number of skipped uploads due to no layers",
        &["tenant_id", "timeline_id"],
    )
-    .expect("failed to register pageserver no layers upload vec");
-}
+    .expect("failed to register pageserver no layers upload vec")
+});

 /// Serializes and uploads the given index part data to the remote storage.
 pub(super) async fn upload_index_part<P, S>(
@@ -75,7 +75,7 @@ where
 #[derive(Debug)]
 pub(super) enum UploadedTimeline {
    /// Upload failed due to some error, the upload task is rescheduled for another retry.
-    FailedAndRescheduled,
+    FailedAndRescheduled(anyhow::Error),
    /// No issues happened during the upload, all task files were put into the remote storage.
    Successful(SyncData<LayersUpload>),
 }
@@ -179,7 +179,7 @@ where
        })
        .collect::<FuturesUnordered<_>>();

-    let mut errors_happened = false;
+    let mut errors = Vec::new();
    while let Some(upload_result) = upload_tasks.next().await {
        match upload_result {
            Ok(uploaded_path) => {
@@ -188,13 +188,13 @@ where
            }
            Err(e) => match e {
                UploadError::Other(e) => {
-                    errors_happened = true;
                    error!("Failed to upload a layer for timeline {sync_id}: {e:?}");
+                    errors.push(format!("{e:#}"));
                }
                UploadError::MissingLocalFile(source_path, e) => {
                    if source_path.exists() {
-                        errors_happened = true;
                        error!("Failed to upload a layer for timeline {sync_id}: {e:?}");
+                        errors.push(format!("{e:#}"));
                    } else {
                        // We have run the upload sync task, but the file we wanted to upload is gone.
                        // This is "fine" due the asynchronous nature of the sync loop: it only reacts to events and might need to
@@ -217,14 +217,17 @@ where
        }
    }

-    if errors_happened {
+    if errors.is_empty() {
+        info!("Successfully uploaded all layers");
+        UploadedTimeline::Successful(upload_data)
+    } else {
        debug!("Reenqueuing failed upload task for timeline {sync_id}");
        upload_data.retries += 1;
        sync_queue.push(sync_id, SyncTask::Upload(upload_data));
-        UploadedTimeline::FailedAndRescheduled
-    } else {
-        info!("Successfully uploaded all layers");
-        UploadedTimeline::Successful(upload_data)
+        UploadedTimeline::FailedAndRescheduled(anyhow::anyhow!(
+            "Errors appeared during layer uploads: {:?}",
+            errors
+        ))
    }
 }

--- a/pageserver/src/tenant_config.rs
+++ b/pageserver/src/tenant_config.rs
@@ -23,6 +23,7 @@ pub mod defaults {
    // which is good for now to trigger bugs.
    // This parameter actually determines L0 layer file size.
    pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
+    pub const DEFAULT_CHECKPOINT_TIMEOUT: &str = "10 m";

    // Target file size, when creating image and delta layers.
    // This parameter determines L1 layer file size.
@@ -36,7 +37,7 @@ pub mod defaults {
    pub const DEFAULT_IMAGE_CREATION_THRESHOLD: usize = 3;
    pub const DEFAULT_PITR_INTERVAL: &str = "30 days";
    pub const DEFAULT_WALRECEIVER_CONNECT_TIMEOUT: &str = "2 seconds";
-    pub const DEFAULT_WALRECEIVER_LAGGING_WAL_TIMEOUT: &str = "10 seconds";
+    pub const DEFAULT_WALRECEIVER_LAGGING_WAL_TIMEOUT: &str = "3 seconds";
    pub const DEFAULT_MAX_WALRECEIVER_LSN_WAL_LAG: u64 = 10 * 1024 * 1024;
 }

@@ -48,6 +49,9 @@ pub struct TenantConf {
    // page server crashes.
    // This parameter actually determines L0 layer file size.
    pub checkpoint_distance: u64,
+    // Inmemory layer is also flushed at least once in checkpoint_timeout to
+    // eventually upload WAL after activity is stopped.
+    pub checkpoint_timeout: Duration,
    // Target file size, when creating image and delta layers.
    // This parameter determines L1 layer file size.
    pub compaction_target_size: u64,
@@ -90,6 +94,7 @@ pub struct TenantConf {
 #[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
 pub struct TenantConfOpt {
    pub checkpoint_distance: Option<u64>,
+    pub checkpoint_timeout: Option<Duration>,
    pub compaction_target_size: Option<u64>,
    #[serde(with = "humantime_serde")]
    pub compaction_period: Option<Duration>,
@@ -113,6 +118,9 @@ impl TenantConfOpt {
            checkpoint_distance: self
                .checkpoint_distance
                .unwrap_or(global_conf.checkpoint_distance),
+            checkpoint_timeout: self
+                .checkpoint_timeout
+                .unwrap_or(global_conf.checkpoint_timeout),
            compaction_target_size: self
                .compaction_target_size
                .unwrap_or(global_conf.compaction_target_size),
@@ -142,6 +150,9 @@ impl TenantConfOpt {
        if let Some(checkpoint_distance) = other.checkpoint_distance {
            self.checkpoint_distance = Some(checkpoint_distance);
        }
+        if let Some(checkpoint_timeout) = other.checkpoint_timeout {
+            self.checkpoint_timeout = Some(checkpoint_timeout);
+        }
        if let Some(compaction_target_size) = other.compaction_target_size {
            self.compaction_target_size = Some(compaction_target_size);
        }
@@ -181,6 +192,8 @@ impl TenantConf {

        TenantConf {
            checkpoint_distance: DEFAULT_CHECKPOINT_DISTANCE,
+            checkpoint_timeout: humantime::parse_duration(DEFAULT_CHECKPOINT_TIMEOUT)
+                .expect("cannot parse default checkpoint timeout"),
            compaction_target_size: DEFAULT_COMPACTION_TARGET_SIZE,
            compaction_period: humantime::parse_duration(DEFAULT_COMPACTION_PERIOD)
                .expect("cannot parse default compaction period"),
@@ -212,6 +225,7 @@ impl TenantConf {
    pub fn dummy_conf() -> Self {
        TenantConf {
            checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
+            checkpoint_timeout: Duration::from_secs(600),
            compaction_target_size: 4 * 1024 * 1024,
            compaction_period: Duration::from_secs(10),
            compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD,
--- a/pageserver/src/tenant_mgr.rs
+++ b/pageserver/src/tenant_mgr.rs
@@ -27,23 +27,25 @@ use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId};

 mod tenants_state {
    use anyhow::ensure;
+    use once_cell::sync::Lazy;
    use std::{
        collections::HashMap,
        sync::{RwLock, RwLockReadGuard, RwLockWriteGuard},
    };
    use tokio::sync::mpsc;
    use tracing::{debug, error};
-
    use utils::zid::ZTenantId;

    use crate::tenant_mgr::{LocalTimelineUpdate, Tenant};

-    lazy_static::lazy_static! {
-        static ref TENANTS: RwLock<HashMap<ZTenantId, Tenant>> = RwLock::new(HashMap::new());
-        /// Sends updates to the local timelines (creation and deletion) to the WAL receiver,
-        /// so that it can enable/disable corresponding processes.
-        static ref TIMELINE_UPDATE_SENDER: RwLock<Option<mpsc::UnboundedSender<LocalTimelineUpdate>>> = RwLock::new(None);
-    }
+    static TENANTS: Lazy<RwLock<HashMap<ZTenantId, Tenant>>> =
+        Lazy::new(|| RwLock::new(HashMap::new()));
+
+    /// Sends updates to the local timelines (creation and deletion) to the WAL receiver,
+    /// so that it can enable/disable corresponding processes.
+    static TIMELINE_UPDATE_SENDER: Lazy<
+        RwLock<Option<mpsc::UnboundedSender<LocalTimelineUpdate>>>,
+    > = Lazy::new(|| RwLock::new(None));

    pub(super) fn read_tenants() -> RwLockReadGuard<'static, HashMap<ZTenantId, Tenant>> {
        TENANTS
--- a/pageserver/src/thread_mgr.rs
+++ b/pageserver/src/thread_mgr.rs
@@ -45,21 +45,20 @@ use tokio::sync::watch;

 use tracing::{debug, error, info, warn};

-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;

 use utils::zid::{ZTenantId, ZTimelineId};

 use crate::shutdown_pageserver;

-lazy_static! {
-    /// Each thread that we track is associated with a "thread ID". It's just
-    /// an increasing number that we assign, not related to any system thread
-    /// id.
-    static ref NEXT_THREAD_ID: AtomicU64 = AtomicU64::new(1);
+/// Each thread that we track is associated with a "thread ID". It's just
+/// an increasing number that we assign, not related to any system thread
+/// id.
+static NEXT_THREAD_ID: Lazy<AtomicU64> = Lazy::new(|| AtomicU64::new(1));

-    /// Global registry of threads
-    static ref THREADS: Mutex<HashMap<u64, Arc<PageServerThread>>> = Mutex::new(HashMap::new());
-}
+/// Global registry of threads
+static THREADS: Lazy<Mutex<HashMap<u64, Arc<PageServerThread>>>> =
+    Lazy::new(|| Mutex::new(HashMap::new()));

 // There is a Tokio watch channel for each thread, which can be used to signal the
 // thread that it needs to shut down. This thread local variable holds the receiving
--- a/pageserver/src/timelines.rs
+++ b/pageserver/src/timelines.rs
@@ -232,7 +232,7 @@ pub(crate) fn create_timeline(
        return Ok(None);
    }

-    let _new_timeline = match ancestor_timeline_id {
+    match ancestor_timeline_id {
        Some(ancestor_timeline_id) => {
            let ancestor_timeline = repo
                .get_timeline_load(ancestor_timeline_id)
--- a/pageserver/src/virtual_file.rs
+++ b/pageserver/src/virtual_file.rs
@@ -10,7 +10,7 @@
 //! This is similar to PostgreSQL's virtual file descriptor facility in
 //! src/backend/storage/file/fd.c
 //!
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use once_cell::sync::OnceCell;
 use std::fs::{File, OpenOptions};
 use std::io::{Error, ErrorKind, Read, Seek, SeekFrom, Write};
@@ -32,23 +32,24 @@ const STORAGE_IO_TIME_BUCKETS: &[f64] = &[
    1.0,      // 1 sec
 ];

-lazy_static! {
-    static ref STORAGE_IO_TIME: HistogramVec = register_histogram_vec!(
+static STORAGE_IO_TIME: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "pageserver_io_operations_seconds",
        "Time spent in IO operations",
        &["operation", "tenant_id", "timeline_id"],
        STORAGE_IO_TIME_BUCKETS.into()
    )
-    .expect("failed to define a metric");
-}
-lazy_static! {
-    static ref STORAGE_IO_SIZE: IntGaugeVec = register_int_gauge_vec!(
+    .expect("failed to define a metric")
+});
+
+static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
+    register_int_gauge_vec!(
        "pageserver_io_operations_bytes_total",
        "Total amount of bytes read/written in IO operations",
        &["operation", "tenant_id", "timeline_id"]
    )
-    .expect("failed to define a metric");
-}
+    .expect("failed to define a metric")
+});

 ///
 /// A virtual file descriptor. You can use this just like std::fs::File, but internally
--- a/pageserver/src/walingest.rs
+++ b/pageserver/src/walingest.rs
@@ -30,8 +30,6 @@ use anyhow::Result;
 use bytes::{Buf, Bytes, BytesMut};
 use tracing::*;

-use std::collections::HashMap;
-
 use crate::pgdatadir_mapping::*;
 use crate::reltag::{RelTag, SlruKind};
 use crate::walrecord::*;
@@ -48,8 +46,6 @@ pub struct WalIngest<'a, T: DatadirTimeline> {

    checkpoint: CheckPoint,
    checkpoint_modified: bool,
-
-    relsize_cache: HashMap<RelTag, BlockNumber>,
 }

 impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
@@ -64,13 +60,13 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
            timeline,
            checkpoint,
            checkpoint_modified: false,
-            relsize_cache: HashMap::new(),
        })
    }

    ///
    /// Decode a PostgreSQL WAL record and store it in the repository, in the given timeline.
    ///
+    /// This function updates `lsn` field of `DatadirModification`
    ///
    /// Helper function to parse a WAL record and call the Timeline's PUT functions for all the
    /// relations/pages that the record affects.
@@ -82,6 +78,7 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
        modification: &mut DatadirModification<T>,
        decoded: &mut DecodedWALRecord,
    ) -> Result<()> {
+        modification.lsn = lsn;
        decode_wal_record(recdata, decoded).context("failed decoding wal record")?;

        let mut buf = decoded.record.clone();
@@ -260,7 +257,7 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {

        // Now that this record has been fully handled, including updating the
        // checkpoint data, let the repository know that it is up-to-date to this LSN
-        modification.commit(lsn)?;
+        modification.commit()?;

        Ok(())
    }
@@ -408,7 +405,7 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
            // replaying it would fail to find the previous image of the page, because
            // it doesn't exist. So check if the VM page(s) exist, and skip the WAL
            // record if it doesn't.
-            let vm_size = self.get_relsize(vm_rel)?;
+            let vm_size = self.get_relsize(vm_rel, modification.lsn)?;
            if let Some(blknum) = new_vm_blk {
                if blknum >= vm_size {
                    new_vm_blk = None;
@@ -880,7 +877,6 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
        modification: &mut DatadirModification<T>,
        rel: RelTag,
    ) -> Result<()> {
-        self.relsize_cache.insert(rel, 0);
        modification.put_rel_creation(rel, 0)?;
        Ok(())
    }
@@ -916,7 +912,6 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
        nblocks: BlockNumber,
    ) -> Result<()> {
        modification.put_rel_truncation(rel, nblocks)?;
-        self.relsize_cache.insert(rel, nblocks);
        Ok(())
    }

@@ -926,23 +921,16 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
        rel: RelTag,
    ) -> Result<()> {
        modification.put_rel_drop(rel)?;
-        self.relsize_cache.remove(&rel);
        Ok(())
    }

-    fn get_relsize(&mut self, rel: RelTag) -> Result<BlockNumber> {
-        if let Some(nblocks) = self.relsize_cache.get(&rel) {
-            Ok(*nblocks)
+    fn get_relsize(&mut self, rel: RelTag, lsn: Lsn) -> Result<BlockNumber> {
+        let nblocks = if !self.timeline.get_rel_exists(rel, lsn)? {
+            0
        } else {
-            let last_lsn = self.timeline.get_last_record_lsn();
-            let nblocks = if !self.timeline.get_rel_exists(rel, last_lsn)? {
-                0
-            } else {
-                self.timeline.get_rel_size(rel, last_lsn)?
-            };
-            self.relsize_cache.insert(rel, nblocks);
-            Ok(nblocks)
-        }
+            self.timeline.get_rel_size(rel, lsn)?
+        };
+        Ok(nblocks)
    }

    fn handle_rel_extend(
@@ -952,22 +940,16 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
        blknum: BlockNumber,
    ) -> Result<()> {
        let new_nblocks = blknum + 1;
-        let old_nblocks = if let Some(nblocks) = self.relsize_cache.get(&rel) {
-            *nblocks
+        // Check if the relation exists. We implicitly create relations on first
+        // record.
+        // TODO: would be nice if to be more explicit about it
+        let last_lsn = modification.lsn;
+        let old_nblocks = if !self.timeline.get_rel_exists(rel, last_lsn)? {
+            // create it with 0 size initially, the logic below will extend it
+            modification.put_rel_creation(rel, 0)?;
+            0
        } else {
-            // Check if the relation exists. We implicitly create relations on first
-            // record.
-            // TODO: would be nice if to be more explicit about it
-            let last_lsn = self.timeline.get_last_record_lsn();
-            let nblocks = if !self.timeline.get_rel_exists(rel, last_lsn)? {
-                // create it with 0 size initially, the logic below will extend it
-                modification.put_rel_creation(rel, 0)?;
-                0
-            } else {
-                self.timeline.get_rel_size(rel, last_lsn)?
-            };
-            self.relsize_cache.insert(rel, nblocks);
-            nblocks
+            self.timeline.get_rel_size(rel, last_lsn)?
        };

        if new_nblocks > old_nblocks {
@@ -978,7 +960,6 @@ impl<'a, T: DatadirTimeline> WalIngest<'a, T> {
            for gap_blknum in old_nblocks..blknum {
                modification.put_rel_page_image(rel, gap_blknum, ZERO_PAGE.clone())?;
            }
-            self.relsize_cache.insert(rel, new_nblocks);
        }
        Ok(())
    }
@@ -1069,10 +1050,10 @@ mod tests {
    static ZERO_CHECKPOINT: Bytes = Bytes::from_static(&[0u8; SIZEOF_CHECKPOINT]);

    fn init_walingest_test<T: DatadirTimeline>(tline: &T) -> Result<WalIngest<T>> {
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x10));
        m.put_checkpoint(ZERO_CHECKPOINT.clone())?;
        m.put_relmap_file(0, 111, Bytes::from(""))?; // dummy relmapper file
-        m.commit(Lsn(0x10))?;
+        m.commit()?;
        let walingest = WalIngest::new(tline, Lsn(0x10))?;

        Ok(walingest)
@@ -1084,19 +1065,19 @@ mod tests {
        let tline = create_test_timeline(repo, TIMELINE_ID)?;
        let mut walingest = init_walingest_test(&*tline)?;

-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x20));
        walingest.put_rel_creation(&mut m, TESTREL_A)?;
        walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 2"))?;
-        m.commit(Lsn(0x20))?;
-        let mut m = tline.begin_modification();
+        m.commit()?;
+        let mut m = tline.begin_modification(Lsn(0x30));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 3"))?;
-        m.commit(Lsn(0x30))?;
-        let mut m = tline.begin_modification();
+        m.commit()?;
+        let mut m = tline.begin_modification(Lsn(0x40));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 1, TEST_IMG("foo blk 1 at 4"))?;
-        m.commit(Lsn(0x40))?;
-        let mut m = tline.begin_modification();
+        m.commit()?;
+        let mut m = tline.begin_modification(Lsn(0x50));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 2, TEST_IMG("foo blk 2 at 5"))?;
-        m.commit(Lsn(0x50))?;
+        m.commit()?;

        assert_current_logical_size(&*tline, Lsn(0x50));

@@ -1142,9 +1123,9 @@ mod tests {
        );

        // Truncate last block
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x60));
        walingest.put_rel_truncation(&mut m, TESTREL_A, 2)?;
-        m.commit(Lsn(0x60))?;
+        m.commit()?;
        assert_current_logical_size(&*tline, Lsn(0x60));

        // Check reported size and contents after truncation
@@ -1166,15 +1147,15 @@ mod tests {
        );

        // Truncate to zero length
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x68));
        walingest.put_rel_truncation(&mut m, TESTREL_A, 0)?;
-        m.commit(Lsn(0x68))?;
+        m.commit()?;
        assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x68))?, 0);

        // Extend from 0 to 2 blocks, leaving a gap
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x70));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 1, TEST_IMG("foo blk 1"))?;
-        m.commit(Lsn(0x70))?;
+        m.commit()?;
        assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x70))?, 2);
        assert_eq!(
            tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x70))?,
@@ -1186,9 +1167,9 @@ mod tests {
        );

        // Extend a lot more, leaving a big gap that spans across segments
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x80));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 1500, TEST_IMG("foo blk 1500"))?;
-        m.commit(Lsn(0x80))?;
+        m.commit()?;
        assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x80))?, 1501);
        for blk in 2..1500 {
            assert_eq!(
@@ -1212,18 +1193,18 @@ mod tests {
        let tline = create_test_timeline(repo, TIMELINE_ID)?;
        let mut walingest = init_walingest_test(&*tline)?;

-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x20));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 2"))?;
-        m.commit(Lsn(0x20))?;
+        m.commit()?;

        // Check that rel exists and size is correct
        assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true);
        assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x20))?, 1);

        // Drop rel
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x30));
        walingest.put_rel_drop(&mut m, TESTREL_A)?;
-        m.commit(Lsn(0x30))?;
+        m.commit()?;

        // Check that rel is not visible anymore
        assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x30))?, false);
@@ -1232,9 +1213,9 @@ mod tests {
        //assert!(tline.get_rel_size(TESTREL_A, Lsn(0x30))?.is_none());

        // Re-create it
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x40));
        walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 4"))?;
-        m.commit(Lsn(0x40))?;
+        m.commit()?;

        // Check that rel exists and size is correct
        assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x40))?, true);
@@ -1254,12 +1235,12 @@ mod tests {

        // Create a 20 MB relation (the size is arbitrary)
        let relsize = 20 * 1024 * 1024 / 8192;
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x20));
        for blkno in 0..relsize {
            let data = format!("foo blk {} at {}", blkno, Lsn(0x20));
            walingest.put_rel_page_image(&mut m, TESTREL_A, blkno, TEST_IMG(&data))?;
        }
-        m.commit(Lsn(0x20))?;
+        m.commit()?;

        // The relation was created at LSN 20, not visible at LSN 1 yet.
        assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x10))?, false);
@@ -1280,9 +1261,9 @@ mod tests {

        // Truncate relation so that second segment was dropped
        // - only leave one page
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(0x60));
        walingest.put_rel_truncation(&mut m, TESTREL_A, 1)?;
-        m.commit(Lsn(0x60))?;
+        m.commit()?;

        // Check reported size and contents after truncation
        assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x60))?, 1);
@@ -1310,12 +1291,12 @@ mod tests {
        // Extend relation again.
        // Add enough blocks to create second segment
        let lsn = Lsn(0x80);
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(lsn);
        for blkno in 0..relsize {
            let data = format!("foo blk {} at {}", blkno, lsn);
            walingest.put_rel_page_image(&mut m, TESTREL_A, blkno, TEST_IMG(&data))?;
        }
-        m.commit(lsn)?;
+        m.commit()?;

        assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x80))?, true);
        assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x80))?, relsize);
@@ -1343,10 +1324,10 @@ mod tests {
        let mut lsn = 0x10;
        for blknum in 0..pg_constants::RELSEG_SIZE + 1 {
            lsn += 0x10;
-            let mut m = tline.begin_modification();
+            let mut m = tline.begin_modification(Lsn(lsn));
            let img = TEST_IMG(&format!("foo blk {} at {}", blknum, Lsn(lsn)));
            walingest.put_rel_page_image(&mut m, TESTREL_A, blknum as BlockNumber, img)?;
-            m.commit(Lsn(lsn))?;
+            m.commit()?;
        }

        assert_current_logical_size(&*tline, Lsn(lsn));
@@ -1358,9 +1339,9 @@ mod tests {

        // Truncate one block
        lsn += 0x10;
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(lsn));
        walingest.put_rel_truncation(&mut m, TESTREL_A, pg_constants::RELSEG_SIZE)?;
-        m.commit(Lsn(lsn))?;
+        m.commit()?;
        assert_eq!(
            tline.get_rel_size(TESTREL_A, Lsn(lsn))?,
            pg_constants::RELSEG_SIZE
@@ -1369,9 +1350,9 @@ mod tests {

        // Truncate another block
        lsn += 0x10;
-        let mut m = tline.begin_modification();
+        let mut m = tline.begin_modification(Lsn(lsn));
        walingest.put_rel_truncation(&mut m, TESTREL_A, pg_constants::RELSEG_SIZE - 1)?;
-        m.commit(Lsn(lsn))?;
+        m.commit()?;
        assert_eq!(
            tline.get_rel_size(TESTREL_A, Lsn(lsn))?,
            pg_constants::RELSEG_SIZE - 1
@@ -1383,9 +1364,9 @@ mod tests {
        let mut size: i32 = 3000;
        while size >= 0 {
            lsn += 0x10;
-            let mut m = tline.begin_modification();
+            let mut m = tline.begin_modification(Lsn(lsn));
            walingest.put_rel_truncation(&mut m, TESTREL_A, size as BlockNumber)?;
-            m.commit(Lsn(lsn))?;
+            m.commit()?;
            assert_eq!(
                tline.get_rel_size(TESTREL_A, Lsn(lsn))?,
                size as BlockNumber
--- a/pageserver/src/walreceiver.rs
+++ b/pageserver/src/walreceiver.rs
@@ -66,7 +66,7 @@ pub fn init_wal_receiver_main_thread(
    );
    let broker_prefix = &conf.broker_etcd_prefix;
    info!(
-        "Starting wal receiver main thread, etdc endpoints: {}",
+        "Starting wal receiver main thread, etcd endpoints: {}",
        etcd_endpoints.iter().map(Url::to_string).join(", ")
    );

--- a/pageserver/src/walreceiver/connection_manager.rs
+++ b/pageserver/src/walreceiver/connection_manager.rs
@@ -17,7 +17,7 @@ use std::{
 };

 use anyhow::Context;
-use chrono::{DateTime, Local, NaiveDateTime, Utc};
+use chrono::{NaiveDateTime, Utc};
 use etcd_broker::{
    subscription_key::SubscriptionKey, subscription_value::SkTimelineInfo, BrokerSubscription,
    BrokerUpdate, Client,
@@ -25,15 +25,18 @@ use etcd_broker::{
 use tokio::select;
 use tracing::*;

-use crate::repository::{Repository, Timeline};
+use crate::{
+    exponential_backoff,
+    repository::{Repository, Timeline},
+    DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
+};
 use crate::{RepositoryImpl, TimelineImpl};
 use utils::{
    lsn::Lsn,
-    pq_proto::ReplicationFeedback,
    zid::{NodeId, ZTenantTimelineId},
 };

-use super::{TaskEvent, TaskHandle};
+use super::{walreceiver_connection::WalConnectionStatus, TaskEvent, TaskHandle};

 /// Spawns the loop to take care of the timeline's WAL streaming connection.
 pub(super) fn spawn_connection_manager_task(
@@ -110,21 +113,26 @@ async fn connection_manager_loop_step(
                }
            } => {
                let wal_connection = walreceiver_state.wal_connection.as_mut().expect("Should have a connection, as checked by the corresponding select! guard");
-                match &wal_connection_update {
+                match wal_connection_update {
                    TaskEvent::Started => {
-                        wal_connection.latest_connection_update = Utc::now().naive_utc();
                        *walreceiver_state.wal_connection_attempts.entry(wal_connection.sk_id).or_insert(0) += 1;
                    },
-                    TaskEvent::NewEvent(replication_feedback) => {
-                        wal_connection.latest_connection_update = DateTime::<Local>::from(replication_feedback.ps_replytime).naive_utc();
-                        // reset connection attempts here only, the only place where both nodes
-                        // explicitly confirmn with replication feedback that they are connected to each other
-                        walreceiver_state.wal_connection_attempts.remove(&wal_connection.sk_id);
+                    TaskEvent::NewEvent(status) => {
+                        if status.has_received_wal {
+                            // Reset connection attempts here only, we know that safekeeper is healthy
+                            // because it can send us a WAL update.
+                            walreceiver_state.wal_connection_attempts.remove(&wal_connection.sk_id);
+                        }
+                        wal_connection.status = status;
                    },
                    TaskEvent::End(end_result) => {
                        match end_result {
                            Ok(()) => debug!("WAL receiving task finished"),
-                            Err(e) => warn!("WAL receiving task failed: {e}"),
+                            Err(e) => {
+                                warn!("WAL receiving task failed: {e}");
+                                // If the task failed, set the connection attempts to at least 1, to try other safekeepers.
+                                let _ = *walreceiver_state.wal_connection_attempts.entry(wal_connection.sk_id).or_insert(1);
+                            }
                        };
                        walreceiver_state.wal_connection = None;
                    },
@@ -230,18 +238,6 @@ async fn subscribe_for_timeline_updates(
    }
 }

-const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
-const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
-
-async fn exponential_backoff(n: u32, base: f64, max_seconds: f64) {
-    if n == 0 {
-        return;
-    }
-    let seconds_to_wait = base.powf(f64::from(n) - 1.0).min(max_seconds);
-    info!("Backoff: waiting {seconds_to_wait} seconds before proceeding with the task");
-    tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await;
-}
-
 /// All data that's needed to run endless broker loop and keep the WAL streaming connection alive, if possible.
 struct WalreceiverState {
    id: ZTenantTimelineId,
@@ -265,10 +261,21 @@ struct WalreceiverState {
 struct WalConnection {
    /// Current safekeeper pageserver is connected to for WAL streaming.
    sk_id: NodeId,
-    /// Connection task start time or the timestamp of a latest connection message received.
-    latest_connection_update: NaiveDateTime,
+    /// Status of the connection.
+    status: WalConnectionStatus,
    /// WAL streaming task handle.
-    connection_task: TaskHandle<ReplicationFeedback>,
+    connection_task: TaskHandle<WalConnectionStatus>,
+    /// Have we discovered that other safekeeper has more recent WAL than we do?
+    discovered_new_wal: Option<NewCommittedWAL>,
+}
+
+/// Notion of a new committed WAL, which exists on other safekeeper.
+#[derive(Debug, Clone, Copy)]
+struct NewCommittedWAL {
+    /// LSN of the new committed WAL.
+    lsn: Lsn,
+    /// When we discovered that the new committed WAL exists on other safekeeper.
+    discovered_at: NaiveDateTime,
 }

 /// Data about the timeline to connect to, received from etcd.
@@ -335,10 +342,19 @@ impl WalreceiverState {
            .instrument(info_span!("walreceiver_connection", id = %id))
        });

+        let now = Utc::now().naive_utc();
        self.wal_connection = Some(WalConnection {
            sk_id: new_sk_id,
-            latest_connection_update: Utc::now().naive_utc(),
+            status: WalConnectionStatus {
+                is_connected: false,
+                has_received_wal: false,
+                latest_connection_update: now,
+                latest_wal_update: now,
+                streaming_lsn: None,
+                commit_lsn: None,
+            },
            connection_task: connection_handle,
+            discovered_new_wal: None,
        });
    }

@@ -369,14 +385,16 @@ impl WalreceiverState {
    /// Cleans up stale etcd records and checks the rest for the new connection candidate.
    /// Returns a new candidate, if the current state is absent or somewhat lagging, `None` otherwise.
    /// The current rules for approving new candidates:
-    /// * pick from the input data from etcd for currently connected safekeeper (if any)
-    /// * out of the rest input entries, pick one with biggest `commit_lsn` that's after than pageserver's latest Lsn for the timeline
+    /// * pick a candidate different from the connected safekeeper with biggest `commit_lsn` and lowest failed connection attemps
    /// * if there's no such entry, no new candidate found, abort
-    /// * check the current connection time data for staleness, reconnect if stale
-    /// * otherwise, check if etcd updates contain currently connected safekeeper
-    ///     * if not, that means no WAL updates happened after certain time (either none since the connection time or none since the last event after the connection)
-    ///       Reconnect if the time exceeds the threshold.
-    ///     * if there's one, compare its Lsn with the other candidate's, reconnect if candidate's over threshold
+    /// * otherwise check if the candidate is much better than the current one
+    ///
+    /// To understand exact rules for determining if the candidate is better than the current one, refer to this function's implementation.
+    /// General rules are following:
+    /// * if connected safekeeper is not present, pick the candidate
+    /// * if we haven't received any updates for some time, pick the candidate
+    /// * if the candidate commit_lsn is much higher than the current one, pick the candidate
+    /// * if connected safekeeper stopped sending us new WAL which is available on other safekeeper, pick the candidate
    ///
    /// This way we ensure to keep up with the most up-to-date safekeeper and don't try to jump from one safekeeper to another too frequently.
    /// Both thresholds are configured per tenant.
@@ -392,53 +410,128 @@ impl WalreceiverState {

                let now = Utc::now().naive_utc();
                if let Ok(latest_interaciton) =
-                    (now - existing_wal_connection.latest_connection_update).to_std()
+                    (now - existing_wal_connection.status.latest_connection_update).to_std()
                {
-                    if latest_interaciton > self.lagging_wal_timeout {
+                    // Drop connection if we haven't received keepalive message for a while.
+                    if latest_interaciton > self.wal_connect_timeout {
                        return Some(NewWalConnectionCandidate {
                            safekeeper_id: new_sk_id,
                            wal_source_connstr: new_wal_source_connstr,
-                            reason: ReconnectReason::NoWalTimeout {
-                                last_wal_interaction: Some(
-                                    existing_wal_connection.latest_connection_update,
+                            reason: ReconnectReason::NoKeepAlives {
+                                last_keep_alive: Some(
+                                    existing_wal_connection.status.latest_connection_update,
                                ),
                                check_time: now,
-                                threshold: self.lagging_wal_timeout,
+                                threshold: self.wal_connect_timeout,
                            },
                        });
                    }
                }

-                match self.wal_stream_candidates.get(&connected_sk_node) {
-                    Some(current_connection_etcd_data) => {
-                        let new_lsn = new_safekeeper_etcd_data.commit_lsn.unwrap_or(Lsn(0));
-                        let current_lsn = current_connection_etcd_data
-                            .timeline
-                            .commit_lsn
-                            .unwrap_or(Lsn(0));
-                        match new_lsn.0.checked_sub(current_lsn.0)
-                            {
-                                Some(new_sk_lsn_advantage) => {
-                                    if new_sk_lsn_advantage >= self.max_lsn_wal_lag.get() {
-                                        return Some(
-                                            NewWalConnectionCandidate {
-                                                safekeeper_id: new_sk_id,
-                                                wal_source_connstr: new_wal_source_connstr,
-                                                reason: ReconnectReason::LaggingWal { current_lsn, new_lsn, threshold: self.max_lsn_wal_lag },
-                                            });
-                                    }
-                                }
-                                None => debug!("Best SK candidate has its commit Lsn behind the current timeline's latest consistent Lsn"),
+                if !existing_wal_connection.status.is_connected {
+                    // We haven't connected yet and we shouldn't switch until connection timeout (condition above).
+                    return None;
+                }
+
+                if let Some(current_commit_lsn) = existing_wal_connection.status.commit_lsn {
+                    let new_commit_lsn = new_safekeeper_etcd_data.commit_lsn.unwrap_or(Lsn(0));
+                    // Check if the new candidate has much more WAL than the current one.
+                    match new_commit_lsn.0.checked_sub(current_commit_lsn.0) {
+                        Some(new_sk_lsn_advantage) => {
+                            if new_sk_lsn_advantage >= self.max_lsn_wal_lag.get() {
+                                return Some(NewWalConnectionCandidate {
+                                    safekeeper_id: new_sk_id,
+                                    wal_source_connstr: new_wal_source_connstr,
+                                    reason: ReconnectReason::LaggingWal {
+                                        current_commit_lsn,
+                                        new_commit_lsn,
+                                        threshold: self.max_lsn_wal_lag,
+                                    },
+                                });
                            }
-                    }
-                    None => {
-                        return Some(NewWalConnectionCandidate {
-                            safekeeper_id: new_sk_id,
-                            wal_source_connstr: new_wal_source_connstr,
-                            reason: ReconnectReason::NoEtcdDataForExistingConnection,
-                        })
+                        }
+                        None => debug!(
+                            "Best SK candidate has its commit_lsn behind connected SK's commit_lsn"
+                        ),
                    }
                }
+
+                let current_lsn = match existing_wal_connection.status.streaming_lsn {
+                    Some(lsn) => lsn,
+                    None => self.local_timeline.get_last_record_lsn(),
+                };
+                let current_commit_lsn = existing_wal_connection
+                    .status
+                    .commit_lsn
+                    .unwrap_or(current_lsn);
+                let candidate_commit_lsn = new_safekeeper_etcd_data.commit_lsn.unwrap_or(Lsn(0));
+
+                // Keep discovered_new_wal only if connected safekeeper has not caught up yet.
+                let mut discovered_new_wal = existing_wal_connection
+                    .discovered_new_wal
+                    .filter(|new_wal| new_wal.lsn > current_commit_lsn);
+
+                if discovered_new_wal.is_none() {
+                    // Check if the new candidate has more WAL than the current one.
+                    // If the new candidate has more WAL than the current one, we consider switching to the new candidate.
+                    discovered_new_wal = if candidate_commit_lsn > current_commit_lsn {
+                        trace!(
+                            "New candidate has commit_lsn {}, higher than current_commit_lsn {}",
+                            candidate_commit_lsn,
+                            current_commit_lsn
+                        );
+                        Some(NewCommittedWAL {
+                            lsn: candidate_commit_lsn,
+                            discovered_at: Utc::now().naive_utc(),
+                        })
+                    } else {
+                        None
+                    };
+                }
+
+                let waiting_for_new_lsn_since = if current_lsn < current_commit_lsn {
+                    // Connected safekeeper has more WAL, but we haven't received updates for some time.
+                    trace!(
+                        "Connected safekeeper has more WAL, but we haven't received updates for {:?}. current_lsn: {}, current_commit_lsn: {}",
+                        (now - existing_wal_connection.status.latest_wal_update).to_std(),
+                        current_lsn,
+                        current_commit_lsn
+                    );
+                    Some(existing_wal_connection.status.latest_wal_update)
+                } else {
+                    discovered_new_wal.as_ref().map(|new_wal| {
+                        // We know that new WAL is available on other safekeeper, but connected safekeeper don't have it.
+                        new_wal
+                            .discovered_at
+                            .max(existing_wal_connection.status.latest_wal_update)
+                    })
+                };
+
+                // If we haven't received any WAL updates for a while and candidate has more WAL, switch to it.
+                if let Some(waiting_for_new_lsn_since) = waiting_for_new_lsn_since {
+                    if let Ok(waiting_for_new_wal) = (now - waiting_for_new_lsn_since).to_std() {
+                        if candidate_commit_lsn > current_commit_lsn
+                            && waiting_for_new_wal > self.lagging_wal_timeout
+                        {
+                            return Some(NewWalConnectionCandidate {
+                                safekeeper_id: new_sk_id,
+                                wal_source_connstr: new_wal_source_connstr,
+                                reason: ReconnectReason::NoWalTimeout {
+                                    current_lsn,
+                                    current_commit_lsn,
+                                    candidate_commit_lsn,
+                                    last_wal_interaction: Some(
+                                        existing_wal_connection.status.latest_wal_update,
+                                    ),
+                                    check_time: now,
+                                    threshold: self.lagging_wal_timeout,
+                                },
+                            });
+                        }
+                    }
+                }
+
+                self.wal_connection.as_mut().unwrap().discovered_new_wal = discovered_new_wal;
            }
            None => {
                let (new_sk_id, _, new_wal_source_connstr) =
@@ -458,7 +551,7 @@ impl WalreceiverState {
    /// Optionally, omits the given node, to support gracefully switching from a healthy safekeeper to another.
    ///
    /// The candidate that is chosen:
-    /// * has fewest connection attempts from pageserver to safekeeper node (reset every time the WAL replication feedback is sent)
+    /// * has fewest connection attempts from pageserver to safekeeper node (reset every time we receive a WAL message from the node)
    /// * has greatest data Lsn among the ones that are left
    ///
    /// NOTE:
@@ -497,14 +590,13 @@ impl WalreceiverState {
            .max_by_key(|(_, info, _)| info.commit_lsn)
    }

+    /// Returns a list of safekeepers that have valid info and ready for connection.
    fn applicable_connection_candidates(
        &self,
    ) -> impl Iterator<Item = (NodeId, &SkTimelineInfo, String)> {
        self.wal_stream_candidates
            .iter()
-            .filter(|(_, etcd_info)| {
-                etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())
-            })
+            .filter(|(_, info)| info.timeline.commit_lsn.is_some())
            .filter_map(|(sk_id, etcd_info)| {
                let info = &etcd_info.timeline;
                match wal_stream_connection_string(
@@ -520,6 +612,7 @@ impl WalreceiverState {
            })
    }

+    /// Remove candidates which haven't sent etcd updates for a while.
    fn cleanup_old_candidates(&mut self) {
        let mut node_ids_to_remove = Vec::with_capacity(self.wal_stream_candidates.len());

@@ -554,17 +647,24 @@ struct NewWalConnectionCandidate {
 #[derive(Debug, PartialEq, Eq)]
 enum ReconnectReason {
    NoExistingConnection,
-    NoEtcdDataForExistingConnection,
    LaggingWal {
-        current_lsn: Lsn,
-        new_lsn: Lsn,
+        current_commit_lsn: Lsn,
+        new_commit_lsn: Lsn,
        threshold: NonZeroU64,
    },
    NoWalTimeout {
+        current_lsn: Lsn,
+        current_commit_lsn: Lsn,
+        candidate_commit_lsn: Lsn,
        last_wal_interaction: Option<NaiveDateTime>,
        check_time: NaiveDateTime,
        threshold: Duration,
    },
+    NoKeepAlives {
+        last_keep_alive: Option<NaiveDateTime>,
+        check_time: NaiveDateTime,
+        threshold: Duration,
+    },
 }

 fn wal_stream_connection_string(
@@ -588,7 +688,6 @@ fn wal_stream_connection_string(

 #[cfg(test)]
 mod tests {
-    use std::time::SystemTime;

    use crate::repository::{
        repo_harness::{RepoHarness, TIMELINE_ID},
@@ -666,7 +765,7 @@ mod tests {
                        backup_lsn: None,
                        remote_consistent_lsn: None,
                        peer_horizon_lsn: None,
-                        safekeeper_connstr: Some(DUMMY_SAFEKEEPER_CONNSTR.to_string()),
+                        safekeeper_connstr: None,
                    },
                    etcd_version: 0,
                    latest_update: delay_over_threshold,
@@ -692,22 +791,26 @@ mod tests {
        let connected_sk_id = NodeId(0);
        let current_lsn = 100_000;

+        let connection_status = WalConnectionStatus {
+            is_connected: true,
+            has_received_wal: true,
+            latest_connection_update: now,
+            latest_wal_update: now,
+            commit_lsn: Some(Lsn(current_lsn)),
+            streaming_lsn: Some(Lsn(current_lsn)),
+        };
+
        state.max_lsn_wal_lag = NonZeroU64::new(100).unwrap();
        state.wal_connection = Some(WalConnection {
            sk_id: connected_sk_id,
-            latest_connection_update: now,
+            status: connection_status.clone(),
            connection_task: TaskHandle::spawn(move |sender, _| async move {
                sender
-                    .send(TaskEvent::NewEvent(ReplicationFeedback {
-                        current_timeline_size: 1,
-                        ps_writelsn: 1,
-                        ps_applylsn: current_lsn,
-                        ps_flushlsn: 1,
-                        ps_replytime: SystemTime::now(),
-                    }))
+                    .send(TaskEvent::NewEvent(connection_status.clone()))
                    .ok();
                Ok(())
            }),
+            discovered_new_wal: None,
        });
        state.wal_stream_candidates = HashMap::from([
            (
@@ -932,65 +1035,6 @@ mod tests {
        Ok(())
    }

-    #[tokio::test]
-    async fn connection_no_etcd_data_candidate() -> anyhow::Result<()> {
-        let harness = RepoHarness::create("connection_no_etcd_data_candidate")?;
-        let mut state = dummy_state(&harness);
-
-        let now = Utc::now().naive_utc();
-        let current_lsn = Lsn(100_000).align();
-        let connected_sk_id = NodeId(0);
-        let other_sk_id = NodeId(connected_sk_id.0 + 1);
-
-        state.wal_connection = Some(WalConnection {
-            sk_id: connected_sk_id,
-            latest_connection_update: now,
-            connection_task: TaskHandle::spawn(move |sender, _| async move {
-                sender
-                    .send(TaskEvent::NewEvent(ReplicationFeedback {
-                        current_timeline_size: 1,
-                        ps_writelsn: current_lsn.0,
-                        ps_applylsn: 1,
-                        ps_flushlsn: 1,
-                        ps_replytime: SystemTime::now(),
-                    }))
-                    .ok();
-                Ok(())
-            }),
-        });
-        state.wal_stream_candidates = HashMap::from([(
-            other_sk_id,
-            EtcdSkTimeline {
-                timeline: SkTimelineInfo {
-                    last_log_term: None,
-                    flush_lsn: None,
-                    commit_lsn: Some(Lsn(1 + state.max_lsn_wal_lag.get())),
-                    backup_lsn: None,
-                    remote_consistent_lsn: None,
-                    peer_horizon_lsn: None,
-                    safekeeper_connstr: Some(DUMMY_SAFEKEEPER_CONNSTR.to_string()),
-                },
-                etcd_version: 0,
-                latest_update: now,
-            },
-        )]);
-
-        let only_candidate = state
-            .next_connection_candidate()
-            .expect("Expected one candidate selected out of the only data option, but got none");
-        assert_eq!(only_candidate.safekeeper_id, other_sk_id);
-        assert_eq!(
-            only_candidate.reason,
-            ReconnectReason::NoEtcdDataForExistingConnection,
-            "Should select new safekeeper due to missing etcd data, even if there's an existing connection with this safekeeper"
-        );
-        assert!(only_candidate
-            .wal_source_connstr
-            .contains(DUMMY_SAFEKEEPER_CONNSTR));
-
-        Ok(())
-    }
-
    #[tokio::test]
    async fn lsn_wal_over_threshhold_current_candidate() -> anyhow::Result<()> {
        let harness = RepoHarness::create("lsn_wal_over_threshcurrent_candidate")?;
@@ -1001,21 +1045,25 @@ mod tests {
        let connected_sk_id = NodeId(0);
        let new_lsn = Lsn(current_lsn.0 + state.max_lsn_wal_lag.get() + 1);

+        let connection_status = WalConnectionStatus {
+            is_connected: true,
+            has_received_wal: true,
+            latest_connection_update: now,
+            latest_wal_update: now,
+            commit_lsn: Some(current_lsn),
+            streaming_lsn: Some(current_lsn),
+        };
+
        state.wal_connection = Some(WalConnection {
            sk_id: connected_sk_id,
-            latest_connection_update: now,
+            status: connection_status.clone(),
            connection_task: TaskHandle::spawn(move |sender, _| async move {
                sender
-                    .send(TaskEvent::NewEvent(ReplicationFeedback {
-                        current_timeline_size: 1,
-                        ps_writelsn: current_lsn.0,
-                        ps_applylsn: 1,
-                        ps_flushlsn: 1,
-                        ps_replytime: SystemTime::now(),
-                    }))
+                    .send(TaskEvent::NewEvent(connection_status.clone()))
                    .ok();
                Ok(())
            }),
+            discovered_new_wal: None,
        });
        state.wal_stream_candidates = HashMap::from([
            (
@@ -1060,8 +1108,8 @@ mod tests {
        assert_eq!(
            over_threshcurrent_candidate.reason,
            ReconnectReason::LaggingWal {
-                current_lsn,
-                new_lsn,
+                current_commit_lsn: current_lsn,
+                new_commit_lsn: new_lsn,
                threshold: state.max_lsn_wal_lag
            },
            "Should select bigger WAL safekeeper if it starts to lag enough"
@@ -1074,31 +1122,35 @@ mod tests {
    }

    #[tokio::test]
-    async fn timeout_wal_over_threshhold_current_candidate() -> anyhow::Result<()> {
-        let harness = RepoHarness::create("timeout_wal_over_threshhold_current_candidate")?;
+    async fn timeout_connection_threshhold_current_candidate() -> anyhow::Result<()> {
+        let harness = RepoHarness::create("timeout_connection_threshhold_current_candidate")?;
        let mut state = dummy_state(&harness);
        let current_lsn = Lsn(100_000).align();
        let now = Utc::now().naive_utc();

-        let lagging_wal_timeout = chrono::Duration::from_std(state.lagging_wal_timeout)?;
+        let wal_connect_timeout = chrono::Duration::from_std(state.wal_connect_timeout)?;
        let time_over_threshold =
-            Utc::now().naive_utc() - lagging_wal_timeout - lagging_wal_timeout;
+            Utc::now().naive_utc() - wal_connect_timeout - wal_connect_timeout;
+
+        let connection_status = WalConnectionStatus {
+            is_connected: true,
+            has_received_wal: true,
+            latest_connection_update: time_over_threshold,
+            latest_wal_update: time_over_threshold,
+            commit_lsn: Some(current_lsn),
+            streaming_lsn: Some(current_lsn),
+        };

        state.wal_connection = Some(WalConnection {
            sk_id: NodeId(1),
-            latest_connection_update: time_over_threshold,
+            status: connection_status.clone(),
            connection_task: TaskHandle::spawn(move |sender, _| async move {
                sender
-                    .send(TaskEvent::NewEvent(ReplicationFeedback {
-                        current_timeline_size: 1,
-                        ps_writelsn: current_lsn.0,
-                        ps_applylsn: 1,
-                        ps_flushlsn: 1,
-                        ps_replytime: SystemTime::now(),
-                    }))
+                    .send(TaskEvent::NewEvent(connection_status.clone()))
                    .ok();
                Ok(())
            }),
+            discovered_new_wal: None,
        });
        state.wal_stream_candidates = HashMap::from([(
            NodeId(0),
@@ -1123,12 +1175,12 @@ mod tests {

        assert_eq!(over_threshcurrent_candidate.safekeeper_id, NodeId(0));
        match over_threshcurrent_candidate.reason {
-            ReconnectReason::NoWalTimeout {
-                last_wal_interaction,
+            ReconnectReason::NoKeepAlives {
+                last_keep_alive,
                threshold,
                ..
            } => {
-                assert_eq!(last_wal_interaction, Some(time_over_threshold));
+                assert_eq!(last_keep_alive, Some(time_over_threshold));
                assert_eq!(threshold, state.lagging_wal_timeout);
            }
            unexpected => panic!("Unexpected reason: {unexpected:?}"),
@@ -1141,20 +1193,34 @@ mod tests {
    }

    #[tokio::test]
-    async fn timeout_connection_over_threshhold_current_candidate() -> anyhow::Result<()> {
-        let harness = RepoHarness::create("timeout_connection_over_threshhold_current_candidate")?;
+    async fn timeout_wal_over_threshhold_current_candidate() -> anyhow::Result<()> {
+        let harness = RepoHarness::create("timeout_wal_over_threshhold_current_candidate")?;
        let mut state = dummy_state(&harness);
        let current_lsn = Lsn(100_000).align();
+        let new_lsn = Lsn(100_100).align();
        let now = Utc::now().naive_utc();

        let lagging_wal_timeout = chrono::Duration::from_std(state.lagging_wal_timeout)?;
        let time_over_threshold =
            Utc::now().naive_utc() - lagging_wal_timeout - lagging_wal_timeout;

+        let connection_status = WalConnectionStatus {
+            is_connected: true,
+            has_received_wal: true,
+            latest_connection_update: now,
+            latest_wal_update: time_over_threshold,
+            commit_lsn: Some(current_lsn),
+            streaming_lsn: Some(current_lsn),
+        };
+
        state.wal_connection = Some(WalConnection {
            sk_id: NodeId(1),
-            latest_connection_update: time_over_threshold,
+            status: connection_status,
            connection_task: TaskHandle::spawn(move |_, _| async move { Ok(()) }),
+            discovered_new_wal: Some(NewCommittedWAL {
+                discovered_at: time_over_threshold,
+                lsn: new_lsn,
+            }),
        });
        state.wal_stream_candidates = HashMap::from([(
            NodeId(0),
@@ -1162,7 +1228,7 @@ mod tests {
                timeline: SkTimelineInfo {
                    last_log_term: None,
                    flush_lsn: None,
-                    commit_lsn: Some(current_lsn),
+                    commit_lsn: Some(new_lsn),
                    backup_lsn: None,
                    remote_consistent_lsn: None,
                    peer_horizon_lsn: None,
@@ -1180,10 +1246,16 @@ mod tests {
        assert_eq!(over_threshcurrent_candidate.safekeeper_id, NodeId(0));
        match over_threshcurrent_candidate.reason {
            ReconnectReason::NoWalTimeout {
+                current_lsn,
+                current_commit_lsn,
+                candidate_commit_lsn,
                last_wal_interaction,
                threshold,
                ..
            } => {
+                assert_eq!(current_lsn, current_lsn);
+                assert_eq!(current_commit_lsn, current_lsn);
+                assert_eq!(candidate_commit_lsn, new_lsn);
                assert_eq!(last_wal_interaction, Some(time_over_threshold));
                assert_eq!(threshold, state.lagging_wal_timeout);
            }
@@ -1210,7 +1282,7 @@ mod tests {
                .expect("Failed to create an empty timeline for dummy wal connection manager"),
            wal_connect_timeout: Duration::from_secs(1),
            lagging_wal_timeout: Duration::from_secs(1),
-            max_lsn_wal_lag: NonZeroU64::new(1).unwrap(),
+            max_lsn_wal_lag: NonZeroU64::new(1024 * 1024).unwrap(),
            wal_connection: None,
            wal_stream_candidates: HashMap::new(),
            wal_connection_attempts: HashMap::new(),
--- a/pageserver/src/walreceiver/walreceiver_connection.rs
+++ b/pageserver/src/walreceiver/walreceiver_connection.rs
@@ -8,6 +8,7 @@ use std::{

 use anyhow::{bail, ensure, Context};
 use bytes::BytesMut;
+use chrono::{NaiveDateTime, Utc};
 use fail::fail_point;
 use futures::StreamExt;
 use postgres::{SimpleQueryMessage, SimpleQueryRow};
@@ -29,12 +30,29 @@ use crate::{
 use postgres_ffi::waldecoder::WalStreamDecoder;
 use utils::{lsn::Lsn, pq_proto::ReplicationFeedback, zid::ZTenantTimelineId};

+/// Status of the connection.
+#[derive(Debug, Clone)]
+pub struct WalConnectionStatus {
+    /// If we were able to initiate a postgres connection, this means that safekeeper process is at least running.
+    pub is_connected: bool,
+    /// Defines a healthy connection as one on which we have received at least some WAL bytes.
+    pub has_received_wal: bool,
+    /// Connection establishment time or the timestamp of a latest connection message received.
+    pub latest_connection_update: NaiveDateTime,
+    /// Time of the latest WAL message received.
+    pub latest_wal_update: NaiveDateTime,
+    /// Latest WAL update contained WAL up to this LSN. Next WAL message with start from that LSN.
+    pub streaming_lsn: Option<Lsn>,
+    /// Latest commit_lsn received from the safekeeper. Can be zero if no message has been received yet.
+    pub commit_lsn: Option<Lsn>,
+}
+
 /// Open a connection to the given safekeeper and receive WAL, sending back progress
 /// messages as we go.
 pub async fn handle_walreceiver_connection(
    id: ZTenantTimelineId,
    wal_source_connstr: &str,
-    events_sender: &watch::Sender<TaskEvent<ReplicationFeedback>>,
+    events_sender: &watch::Sender<TaskEvent<WalConnectionStatus>>,
    mut cancellation: watch::Receiver<()>,
    connect_timeout: Duration,
 ) -> anyhow::Result<()> {
@@ -49,12 +67,26 @@ pub async fn handle_walreceiver_connection(
    .await
    .context("Timed out while waiting for walreceiver connection to open")?
    .context("Failed to open walreceiver conection")?;
+
+    info!("connected!");
+    let mut connection_status = WalConnectionStatus {
+        is_connected: true,
+        has_received_wal: false,
+        latest_connection_update: Utc::now().naive_utc(),
+        latest_wal_update: Utc::now().naive_utc(),
+        streaming_lsn: None,
+        commit_lsn: None,
+    };
+    if let Err(e) = events_sender.send(TaskEvent::NewEvent(connection_status.clone())) {
+        warn!("Wal connection event listener dropped right after connection init, aborting the connection: {e}");
+        return Ok(());
+    }
+
    // The connection object performs the actual communication with the database,
    // so spawn it off to run on its own.
    let mut connection_cancellation = cancellation.clone();
    tokio::spawn(
        async move {
-            info!("connected!");
            select! {
                    connection_result = connection => match connection_result{
                            Ok(()) => info!("Walreceiver db connection closed"),
@@ -84,6 +116,14 @@ pub async fn handle_walreceiver_connection(

    let identify = identify_system(&mut replication_client).await?;
    info!("{identify:?}");
+
+    connection_status.latest_connection_update = Utc::now().naive_utc();
+    if let Err(e) = events_sender.send(TaskEvent::NewEvent(connection_status.clone())) {
+        warn!("Wal connection event listener dropped after IDENTIFY_SYSTEM, aborting the connection: {e}");
+        return Ok(());
+    }
+
+    // NB: this is a flush_lsn, not a commit_lsn.
    let end_of_wal = Lsn::from(u64::from(identify.xlogpos));
    let mut caught_up = false;
    let ZTenantTimelineId {
@@ -118,7 +158,7 @@ pub async fn handle_walreceiver_connection(
    // There might be some padding after the last full record, skip it.
    startpoint += startpoint.calc_padding(8u32);

-    info!("last_record_lsn {last_rec_lsn} starting replication from {startpoint}, server is at {end_of_wal}...");
+    info!("last_record_lsn {last_rec_lsn} starting replication from {startpoint}, safekeeper is at {end_of_wal}...");

    let query = format!("START_REPLICATION PHYSICAL {startpoint}");

@@ -140,6 +180,33 @@ pub async fn handle_walreceiver_connection(
        }
    } {
        let replication_message = replication_message?;
+        let now = Utc::now().naive_utc();
+
+        // Update the connection status before processing the message. If the message processing
+        // fails (e.g. in walingest), we still want to know latests LSNs from the safekeeper.
+        match &replication_message {
+            ReplicationMessage::XLogData(xlog_data) => {
+                connection_status.latest_connection_update = now;
+                connection_status.commit_lsn = Some(Lsn::from(xlog_data.wal_end()));
+                connection_status.streaming_lsn = Some(Lsn::from(
+                    xlog_data.wal_start() + xlog_data.data().len() as u64,
+                ));
+                if !xlog_data.data().is_empty() {
+                    connection_status.latest_wal_update = now;
+                    connection_status.has_received_wal = true;
+                }
+            }
+            ReplicationMessage::PrimaryKeepAlive(keepalive) => {
+                connection_status.latest_connection_update = now;
+                connection_status.commit_lsn = Some(Lsn::from(keepalive.wal_end()));
+            }
+            &_ => {}
+        };
+        if let Err(e) = events_sender.send(TaskEvent::NewEvent(connection_status.clone())) {
+            warn!("Wal connection event listener dropped, aborting the connection: {e}");
+            return Ok(());
+        }
+
        let status_update = match replication_message {
            ReplicationMessage::XLogData(xlog_data) => {
                // Pass the WAL data to the decoder, and see if we can decode
@@ -154,7 +221,7 @@ pub async fn handle_walreceiver_connection(

                {
                    let mut decoded = DecodedWALRecord::default();
-                    let mut modification = timeline.begin_modification();
+                    let mut modification = timeline.begin_modification(endlsn);
                    while let Some((lsn, recdata)) = waldecoder.poll_decode()? {
                        // let _enter = info_span!("processing record", lsn = %lsn).entered();

@@ -178,16 +245,6 @@ pub async fn handle_walreceiver_connection(
                    caught_up = true;
                }

-                let timeline_to_check = Arc::clone(&timeline);
-                tokio::task::spawn_blocking(move || timeline_to_check.check_checkpoint_distance())
-                    .await
-                    .with_context(|| {
-                        format!("Spawned checkpoint check task panicked for timeline {id}")
-                    })?
-                    .with_context(|| {
-                        format!("Failed to check checkpoint distance for timeline {id}")
-                    })?;
-
                Some(endlsn)
            }

@@ -208,6 +265,12 @@ pub async fn handle_walreceiver_connection(
            _ => None,
        };

+        let timeline_to_check = Arc::clone(&timeline);
+        tokio::task::spawn_blocking(move || timeline_to_check.check_checkpoint_distance())
+            .await
+            .with_context(|| format!("Spawned checkpoint check task panicked for timeline {id}"))?
+            .with_context(|| format!("Failed to check checkpoint distance for timeline {id}"))?;
+
        if let Some(last_lsn) = status_update {
            let remote_index = repo.get_remote_index();
            let timeline_remote_consistent_lsn = remote_index
@@ -261,10 +324,6 @@ pub async fn handle_walreceiver_connection(
                .as_mut()
                .zenith_status_update(data.len() as u64, &data)
                .await?;
-            if let Err(e) = events_sender.send(TaskEvent::NewEvent(zenith_status_update)) {
-                warn!("Wal connection event listener dropped, aborting the connection: {e}");
-                return Ok(());
-            }
        }
    }

--- a/pageserver/src/walredo.rs
+++ b/pageserver/src/walredo.rs
@@ -20,8 +20,8 @@
 //!
 use byteorder::{ByteOrder, LittleEndian};
 use bytes::{BufMut, Bytes, BytesMut};
-use lazy_static::lazy_static;
 use nix::poll::*;
+use once_cell::sync::Lazy;
 use serde::Serialize;
 use std::fs;
 use std::fs::OpenOptions;
@@ -105,21 +105,27 @@ impl crate::walredo::WalRedoManager for DummyRedoManager {
 // We collect the time spent in actual WAL redo ('redo'), and time waiting
 // for access to the postgres process ('wait') since there is only one for
 // each tenant.
-lazy_static! {
-    static ref WAL_REDO_TIME: Histogram =
-        register_histogram!("pageserver_wal_redo_seconds", "Time spent on WAL redo")
-            .expect("failed to define a metric");
-    static ref WAL_REDO_WAIT_TIME: Histogram = register_histogram!(
+
+static WAL_REDO_TIME: Lazy<Histogram> = Lazy::new(|| {
+    register_histogram!("pageserver_wal_redo_seconds", "Time spent on WAL redo")
+        .expect("failed to define a metric")
+});
+
+static WAL_REDO_WAIT_TIME: Lazy<Histogram> = Lazy::new(|| {
+    register_histogram!(
        "pageserver_wal_redo_wait_seconds",
        "Time spent waiting for access to the WAL redo process"
    )
-    .expect("failed to define a metric");
-    static ref WAL_REDO_RECORD_COUNTER: IntCounter = register_int_counter!(
+    .expect("failed to define a metric")
+});
+
+static WAL_REDO_RECORD_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "pageserver_replayed_wal_records_total",
        "Number of WAL records replayed in WAL redo process"
    )
-    .unwrap();
-}
+    .unwrap()
+});

 ///
 /// This is the real implementation that uses a Postgres process to
--- a/poetry.lock
+++ b/poetry.lock
--- a/proxy/Cargo.toml
+++ b/proxy/Cargo.toml
@@ -7,6 +7,7 @@ edition = "2021"
 anyhow = "1.0"
 async-trait = "0.1"
 base64 = "0.13.0"
+bstr = "0.2.17"
 bytes = { version = "1.0.1", features = ['serde'] }
 clap = "3.0"
 futures = "0.3.13"
@@ -14,7 +15,7 @@ hashbrown = "0.11.2"
 hex = "0.4.3"
 hmac = "0.12.1"
 hyper = "0.14"
-lazy_static = "1.4.0"
+once_cell = "1.13.0"
 md5 = "0.7.0"
 parking_lot = "0.12"
 pin-project-lite = "0.2.7"
--- a/proxy/src/auth.rs
+++ b/proxy/src/auth.rs
@@ -12,7 +12,7 @@ use password_hack::PasswordHackPayload;
 mod flow;
 pub use flow::*;

-use crate::{error::UserFacingError, waiters};
+use crate::error::UserFacingError;
 use std::io;
 use thiserror::Error;

@@ -22,51 +22,54 @@ pub type Result<T> = std::result::Result<T, AuthError>;
 /// Common authentication error.
 #[derive(Debug, Error)]
 pub enum AuthErrorImpl {
-    /// Authentication error reported by the console.
+    // This will be dropped in the future.
    #[error(transparent)]
-    Console(#[from] backend::AuthError),
+    Legacy(#[from] backend::LegacyAuthError),

    #[error(transparent)]
-    GetAuthInfo(#[from] backend::console::ConsoleAuthError),
+    Link(#[from] backend::LinkAuthError),

+    #[error(transparent)]
+    GetAuthInfo(#[from] backend::GetAuthInfoError),
+
+    #[error(transparent)]
+    WakeCompute(#[from] backend::WakeComputeError),
+
+    /// SASL protocol errors (includes [SCRAM](crate::scram)).
    #[error(transparent)]
    Sasl(#[from] crate::sasl::Error),

+    #[error("Unsupported authentication method: {0}")]
+    BadAuthMethod(Box<str>),
+
    #[error("Malformed password message: {0}")]
    MalformedPassword(&'static str),

-    /// Errors produced by [`crate::stream::PqStream`].
+    #[error(
+        "Project name is not specified. \
+        Either please upgrade the postgres client library (libpq) for SNI support \
+        or pass the project name as a parameter: '&options=project%3D<project-name>'. \
+        See more at https://neon.tech/sni"
+    )]
+    MissingProjectName,
+
+    /// Errors produced by e.g. [`crate::stream::PqStream`].
    #[error(transparent)]
    Io(#[from] io::Error),
 }

-impl AuthErrorImpl {
-    pub fn auth_failed(msg: impl Into<String>) -> Self {
-        Self::Console(backend::AuthError::auth_failed(msg))
-    }
-}
-
-impl From<waiters::RegisterError> for AuthErrorImpl {
-    fn from(e: waiters::RegisterError) -> Self {
-        Self::Console(backend::AuthError::from(e))
-    }
-}
-
-impl From<waiters::WaitError> for AuthErrorImpl {
-    fn from(e: waiters::WaitError) -> Self {
-        Self::Console(backend::AuthError::from(e))
-    }
-}
-
 #[derive(Debug, Error)]
 #[error(transparent)]
 pub struct AuthError(Box<AuthErrorImpl>);

-impl<T> From<T> for AuthError
-where
-    AuthErrorImpl: From<T>,
-{
-    fn from(e: T) -> Self {
+impl AuthError {
+    pub fn bad_auth_method(name: impl Into<Box<str>>) -> Self {
+        AuthErrorImpl::BadAuthMethod(name.into()).into()
+    }
+}
+
+impl<E: Into<AuthErrorImpl>> From<E> for AuthError {
+    fn from(e: E) -> Self {
        Self(Box::new(e.into()))
    }
 }
@@ -75,10 +78,14 @@ impl UserFacingError for AuthError {
    fn to_string_client(&self) -> String {
        use AuthErrorImpl::*;
        match self.0.as_ref() {
-            Console(e) => e.to_string_client(),
+            Legacy(e) => e.to_string_client(),
+            Link(e) => e.to_string_client(),
            GetAuthInfo(e) => e.to_string_client(),
+            WakeCompute(e) => e.to_string_client(),
            Sasl(e) => e.to_string_client(),
+            BadAuthMethod(_) => self.to_string(),
            MalformedPassword(_) => self.to_string(),
+            MissingProjectName => self.to_string(),
            _ => "Internal error".to_string(),
        }
    }
--- a/proxy/src/auth/backend.rs
+++ b/proxy/src/auth/backend.rs
@@ -1,10 +1,13 @@
-mod link;
 mod postgres;

-pub mod console;
+mod link;
+pub use link::LinkAuthError;
+
+mod console;
+pub use console::{GetAuthInfoError, WakeComputeError};

 mod legacy_console;
-pub use legacy_console::{AuthError, AuthErrorImpl};
+pub use legacy_console::LegacyAuthError;

 use crate::{
    auth::{self, AuthFlow, ClientCredentials},
@@ -12,13 +15,12 @@ use crate::{
    stream::PqStream,
    waiters::{self, Waiter, Waiters},
 };
-use lazy_static::lazy_static;
+
+use once_cell::sync::Lazy;
 use serde::{Deserialize, Serialize};
 use tokio::io::{AsyncRead, AsyncWrite};

-lazy_static! {
-    static ref CPLANE_WAITERS: Waiters<mgmt::ComputeReady> = Default::default();
-}
+static CPLANE_WAITERS: Lazy<Waiters<mgmt::ComputeReady>> = Lazy::new(Default::default);

 /// Give caller an opportunity to wait for the cloud's reply.
 pub async fn with_waiter<R, T, E>(
--- a/proxy/src/auth/backend/console.rs
+++ b/proxy/src/auth/backend/console.rs
@@ -13,21 +13,11 @@ use std::future::Future;
 use thiserror::Error;
 use tokio::io::{AsyncRead, AsyncWrite};

-pub type Result<T> = std::result::Result<T, ConsoleAuthError>;
+const REQUEST_FAILED: &str = "Console request failed";

 #[derive(Debug, Error)]
-pub enum ConsoleAuthError {
-    #[error(transparent)]
-    BadProjectName(#[from] auth::credentials::ClientCredsParseError),
-
-    // We shouldn't include the actual secret here.
-    #[error("Bad authentication secret")]
-    BadSecret,
-
-    #[error("Console responded with a malformed compute address: '{0}'")]
-    BadComputeAddress(String),
-
-    #[error("Console responded with a malformed JSON: '{0}'")]
+pub enum TransportError {
+    #[error("Console responded with a malformed JSON: {0}")]
    BadResponse(#[from] serde_json::Error),

    /// HTTP status (other than 200) returned by the console.
@@ -38,19 +28,72 @@ pub enum ConsoleAuthError {
    Io(#[from] std::io::Error),
 }

-impl UserFacingError for ConsoleAuthError {
+impl UserFacingError for TransportError {
    fn to_string_client(&self) -> String {
-        use ConsoleAuthError::*;
+        use TransportError::*;
        match self {
-            BadProjectName(e) => e.to_string_client(),
-            _ => "Internal error".to_string(),
+            HttpStatus(_) => self.to_string(),
+            _ => REQUEST_FAILED.to_owned(),
        }
    }
 }

-impl From<&auth::credentials::ClientCredsParseError> for ConsoleAuthError {
-    fn from(e: &auth::credentials::ClientCredsParseError) -> Self {
-        ConsoleAuthError::BadProjectName(e.clone())
+// Helps eliminate graceless `.map_err` calls without introducing another ctor.
+impl From<reqwest::Error> for TransportError {
+    fn from(e: reqwest::Error) -> Self {
+        io_error(e).into()
+    }
+}
+
+#[derive(Debug, Error)]
+pub enum GetAuthInfoError {
+    // We shouldn't include the actual secret here.
+    #[error("Console responded with a malformed auth secret")]
+    BadSecret,
+
+    #[error(transparent)]
+    Transport(TransportError),
+}
+
+impl UserFacingError for GetAuthInfoError {
+    fn to_string_client(&self) -> String {
+        use GetAuthInfoError::*;
+        match self {
+            BadSecret => REQUEST_FAILED.to_owned(),
+            Transport(e) => e.to_string_client(),
+        }
+    }
+}
+
+impl<E: Into<TransportError>> From<E> for GetAuthInfoError {
+    fn from(e: E) -> Self {
+        Self::Transport(e.into())
+    }
+}
+
+#[derive(Debug, Error)]
+pub enum WakeComputeError {
+    // We shouldn't show users the address even if it's broken.
+    #[error("Console responded with a malformed compute address: {0}")]
+    BadComputeAddress(String),
+
+    #[error(transparent)]
+    Transport(TransportError),
+}
+
+impl UserFacingError for WakeComputeError {
+    fn to_string_client(&self) -> String {
+        use WakeComputeError::*;
+        match self {
+            BadComputeAddress(_) => REQUEST_FAILED.to_owned(),
+            Transport(e) => e.to_string_client(),
+        }
+    }
+}
+
+impl<E: Into<TransportError>> From<E> for WakeComputeError {
+    fn from(e: E) -> Self {
+        Self::Transport(e.into())
    }
 }

@@ -95,7 +138,7 @@ impl<'a> Api<'a> {
        handle_user(client, &self, Self::get_auth_info, Self::wake_compute).await
    }

-    async fn get_auth_info(&self) -> Result<AuthInfo> {
+    async fn get_auth_info(&self) -> Result<AuthInfo, GetAuthInfoError> {
        let mut url = self.endpoint.clone();
        url.path_segments_mut().push("proxy_get_role_secret");
        url.query_pairs_mut()
@@ -105,21 +148,20 @@ impl<'a> Api<'a> {
        // TODO: use a proper logger
        println!("cplane request: {url}");

-        let resp = reqwest::get(url.into_inner()).await.map_err(io_error)?;
+        let resp = reqwest::get(url.into_inner()).await?;
        if !resp.status().is_success() {
-            return Err(ConsoleAuthError::HttpStatus(resp.status()));
+            return Err(TransportError::HttpStatus(resp.status()).into());
        }

-        let response: GetRoleSecretResponse =
-            serde_json::from_str(&resp.text().await.map_err(io_error)?)?;
+        let response: GetRoleSecretResponse = serde_json::from_str(&resp.text().await?)?;

-        scram::ServerSecret::parse(response.role_secret.as_str())
+        scram::ServerSecret::parse(&response.role_secret)
            .map(AuthInfo::Scram)
-            .ok_or(ConsoleAuthError::BadSecret)
+            .ok_or(GetAuthInfoError::BadSecret)
    }

    /// Wake up the compute node and return the corresponding connection info.
-    pub(super) async fn wake_compute(&self) -> Result<ComputeConnCfg> {
+    pub(super) async fn wake_compute(&self) -> Result<ComputeConnCfg, WakeComputeError> {
        let mut url = self.endpoint.clone();
        url.path_segments_mut().push("proxy_wake_compute");
        url.query_pairs_mut()
@@ -128,17 +170,16 @@ impl<'a> Api<'a> {
        // TODO: use a proper logger
        println!("cplane request: {url}");

-        let resp = reqwest::get(url.into_inner()).await.map_err(io_error)?;
+        let resp = reqwest::get(url.into_inner()).await?;
        if !resp.status().is_success() {
-            return Err(ConsoleAuthError::HttpStatus(resp.status()));
+            return Err(TransportError::HttpStatus(resp.status()).into());
        }

-        let response: GetWakeComputeResponse =
-            serde_json::from_str(&resp.text().await.map_err(io_error)?)?;
+        let response: GetWakeComputeResponse = serde_json::from_str(&resp.text().await?)?;

        // Unfortunately, ownership won't let us use `Option::ok_or` here.
        let (host, port) = match parse_host_port(&response.address) {
-            None => return Err(ConsoleAuthError::BadComputeAddress(response.address)),
+            None => return Err(WakeComputeError::BadComputeAddress(response.address)),
            Some(x) => x,
        };

@@ -162,8 +203,8 @@ pub(super) async fn handle_user<'a, Endpoint, GetAuthInfo, WakeCompute>(
    wake_compute: impl FnOnce(&'a Endpoint) -> WakeCompute,
 ) -> auth::Result<compute::NodeInfo>
 where
-    GetAuthInfo: Future<Output = Result<AuthInfo>>,
-    WakeCompute: Future<Output = Result<ComputeConnCfg>>,
+    GetAuthInfo: Future<Output = Result<AuthInfo, GetAuthInfoError>>,
+    WakeCompute: Future<Output = Result<ComputeConnCfg, WakeComputeError>>,
 {
    let auth_info = get_auth_info(endpoint).await?;

@@ -171,7 +212,7 @@ where
    let scram_keys = match auth_info {
        AuthInfo::Md5(_) => {
            // TODO: decide if we should support MD5 in api v2
-            return Err(auth::AuthErrorImpl::auth_failed("MD5 is not supported").into());
+            return Err(auth::AuthError::bad_auth_method("MD5"));
        }
        AuthInfo::Scram(secret) => {
            let scram = auth::Scram(&secret);
--- a/proxy/src/auth/backend/legacy_console.rs
+++ b/proxy/src/auth/backend/legacy_console.rs
@@ -14,7 +14,7 @@ use tokio::io::{AsyncRead, AsyncWrite};
 use utils::pq_proto::BeMessage as Be;

 #[derive(Debug, Error)]
-pub enum AuthErrorImpl {
+pub enum LegacyAuthError {
    /// Authentication error reported by the console.
    #[error("Authentication failed: {0}")]
    AuthFailed(String),
@@ -24,7 +24,7 @@ pub enum AuthErrorImpl {
    HttpStatus(reqwest::StatusCode),

    #[error("Console responded with a malformed JSON: {0}")]
-    MalformedResponse(#[from] serde_json::Error),
+    BadResponse(#[from] serde_json::Error),

    #[error(transparent)]
    Transport(#[from] reqwest::Error),
@@ -36,30 +36,10 @@ pub enum AuthErrorImpl {
    WaiterWait(#[from] waiters::WaitError),
 }

-#[derive(Debug, Error)]
-#[error(transparent)]
-pub struct AuthError(Box<AuthErrorImpl>);
-
-impl AuthError {
-    /// Smart constructor for authentication error reported by `mgmt`.
-    pub fn auth_failed(msg: impl Into<String>) -> Self {
-        Self(Box::new(AuthErrorImpl::AuthFailed(msg.into())))
-    }
-}
-
-impl<T> From<T> for AuthError
-where
-    AuthErrorImpl: From<T>,
-{
-    fn from(e: T) -> Self {
-        Self(Box::new(e.into()))
-    }
-}
-
-impl UserFacingError for AuthError {
+impl UserFacingError for LegacyAuthError {
    fn to_string_client(&self) -> String {
-        use AuthErrorImpl::*;
-        match self.0.as_ref() {
+        use LegacyAuthError::*;
+        match self {
            AuthFailed(_) | HttpStatus(_) => self.to_string(),
            _ => "Internal error".to_string(),
        }
@@ -88,7 +68,7 @@ async fn authenticate_proxy_client(
    md5_response: &str,
    salt: &[u8; 4],
    psql_session_id: &str,
-) -> Result<DatabaseInfo, AuthError> {
+) -> Result<DatabaseInfo, LegacyAuthError> {
    let mut url = auth_endpoint.clone();
    url.query_pairs_mut()
        .append_pair("login", &creds.user)
@@ -102,17 +82,17 @@ async fn authenticate_proxy_client(
        // TODO: leverage `reqwest::Client` to reuse connections
        let resp = reqwest::get(url).await?;
        if !resp.status().is_success() {
-            return Err(AuthErrorImpl::HttpStatus(resp.status()).into());
+            return Err(LegacyAuthError::HttpStatus(resp.status()));
        }

-        let auth_info: ProxyAuthResponse = serde_json::from_str(resp.text().await?.as_str())?;
+        let auth_info = serde_json::from_str(resp.text().await?.as_str())?;
        println!("got auth info: {:?}", auth_info);

        use ProxyAuthResponse::*;
        let db_info = match auth_info {
            Ready { conn_info } => conn_info,
-            Error { error } => return Err(AuthErrorImpl::AuthFailed(error).into()),
-            NotReady { .. } => waiter.await?.map_err(AuthErrorImpl::AuthFailed)?,
+            Error { error } => return Err(LegacyAuthError::AuthFailed(error)),
+            NotReady { .. } => waiter.await?.map_err(LegacyAuthError::AuthFailed)?,
        };

        Ok(db_info)
@@ -124,7 +104,7 @@ async fn handle_existing_user(
    auth_endpoint: &reqwest::Url,
    client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
    creds: &ClientCredentials,
-) -> Result<compute::NodeInfo, auth::AuthError> {
+) -> auth::Result<compute::NodeInfo> {
    let psql_session_id = super::link::new_psql_session_id();
    let md5_salt = rand::random();

--- a/proxy/src/auth/backend/link.rs
+++ b/proxy/src/auth/backend/link.rs
@@ -1,7 +1,34 @@
-use crate::{auth, compute, stream::PqStream};
+use crate::{auth, compute, error::UserFacingError, stream::PqStream, waiters};
+use thiserror::Error;
 use tokio::io::{AsyncRead, AsyncWrite};
 use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};

+#[derive(Debug, Error)]
+pub enum LinkAuthError {
+    /// Authentication error reported by the console.
+    #[error("Authentication failed: {0}")]
+    AuthFailed(String),
+
+    #[error(transparent)]
+    WaiterRegister(#[from] waiters::RegisterError),
+
+    #[error(transparent)]
+    WaiterWait(#[from] waiters::WaitError),
+
+    #[error(transparent)]
+    Io(#[from] std::io::Error),
+}
+
+impl UserFacingError for LinkAuthError {
+    fn to_string_client(&self) -> String {
+        use LinkAuthError::*;
+        match self {
+            AuthFailed(_) => self.to_string(),
+            _ => "Internal error".to_string(),
+        }
+    }
+}
+
 fn hello_message(redirect_uri: &str, session_id: &str) -> String {
    format!(
        concat![
@@ -34,7 +61,7 @@ pub async fn handle_user(
            .await?;

        // Wait for web console response (see `mgmt`)
-        waiter.await?.map_err(auth::AuthErrorImpl::auth_failed)
+        waiter.await?.map_err(LinkAuthError::AuthFailed)
    })
    .await?;

--- a/proxy/src/auth/backend/postgres.rs
+++ b/proxy/src/auth/backend/postgres.rs
@@ -3,7 +3,7 @@
 use crate::{
    auth::{
        self,
-        backend::console::{self, AuthInfo, Result},
+        backend::console::{self, AuthInfo, GetAuthInfoError, TransportError, WakeComputeError},
        ClientCredentials,
    },
    compute::{self, ComputeConnCfg},
@@ -20,6 +20,13 @@ pub(super) struct Api<'a> {
    creds: &'a ClientCredentials,
 }

+// Helps eliminate graceless `.map_err` calls without introducing another ctor.
+impl From<tokio_postgres::Error> for TransportError {
+    fn from(e: tokio_postgres::Error) -> Self {
+        io_error(e).into()
+    }
+}
+
 impl<'a> Api<'a> {
    /// Construct an API object containing the auth parameters.
    pub(super) fn new(endpoint: &'a ApiUrl, creds: &'a ClientCredentials) -> Self {
@@ -36,21 +43,16 @@ impl<'a> Api<'a> {
    }

    /// This implementation fetches the auth info from a local postgres instance.
-    async fn get_auth_info(&self) -> Result<AuthInfo> {
+    async fn get_auth_info(&self) -> Result<AuthInfo, GetAuthInfoError> {
        // Perhaps we could persist this connection, but then we'd have to
        // write more code for reopening it if it got closed, which doesn't
        // seem worth it.
        let (client, connection) =
-            tokio_postgres::connect(self.endpoint.as_str(), tokio_postgres::NoTls)
-                .await
-                .map_err(io_error)?;
+            tokio_postgres::connect(self.endpoint.as_str(), tokio_postgres::NoTls).await?;

        tokio::spawn(connection);
        let query = "select rolpassword from pg_catalog.pg_authid where rolname = $1";
-        let rows = client
-            .query(query, &[&self.creds.user])
-            .await
-            .map_err(io_error)?;
+        let rows = client.query(query, &[&self.creds.user]).await?;

        match &rows[..] {
            // We can't get a secret if there's no such user.
@@ -74,13 +76,13 @@ impl<'a> Api<'a> {
                        }))
                    })
                    // Putting the secret into this message is a security hazard!
-                    .ok_or(console::ConsoleAuthError::BadSecret)
+                    .ok_or(GetAuthInfoError::BadSecret)
            }
        }
    }

    /// We don't need to wake anything locally, so we just return the connection info.
-    pub(super) async fn wake_compute(&self) -> Result<ComputeConnCfg> {
+    pub(super) async fn wake_compute(&self) -> Result<ComputeConnCfg, WakeComputeError> {
        let mut config = ComputeConnCfg::new();
        config
            .host(self.endpoint.host_str().unwrap_or("localhost"))
--- a/proxy/src/auth/flow.rs
+++ b/proxy/src/auth/flow.rs
@@ -75,13 +75,12 @@ impl<S: AsyncRead + AsyncWrite + Unpin> AuthFlow<'_, S, PasswordHack> {
            .strip_suffix(&[0])
            .ok_or(AuthErrorImpl::MalformedPassword("missing terminator"))?;

-        // The so-called "password" should contain a base64-encoded json.
-        // We will use it later to route the client to their project.
-        let bytes = base64::decode(password)
-            .map_err(|_| AuthErrorImpl::MalformedPassword("bad encoding"))?;
-
-        let payload = serde_json::from_slice(&bytes)
-            .map_err(|_| AuthErrorImpl::MalformedPassword("invalid payload"))?;
+        let payload = PasswordHackPayload::parse(password)
+            // If we ended up here and the payload is malformed, it means that
+            // the user neither enabled SNI nor resorted to any other method
+            // for passing the project name we rely on. We should show them
+            // the most helpful error message and point to the documentation.
+            .ok_or(AuthErrorImpl::MissingProjectName)?;

        Ok(payload)
    }
@@ -98,7 +97,7 @@ impl<S: AsyncRead + AsyncWrite + Unpin> AuthFlow<'_, S, Scram<'_>> {

        // Currently, the only supported SASL method is SCRAM.
        if !scram::METHODS.contains(&sasl.method) {
-            return Err(AuthErrorImpl::auth_failed("method not supported").into());
+            return Err(super::AuthError::bad_auth_method(sasl.method));
        }

        let secret = self.state.0;
--- a/proxy/src/auth/password_hack.rs
+++ b/proxy/src/auth/password_hack.rs
@@ -1,102 +1,46 @@
 //! Payload for ad hoc authentication method for clients that don't support SNI.
 //! See the `impl` for [`super::backend::BackendType<ClientCredentials>`].
 //! Read more: <https://github.com/neondatabase/cloud/issues/1620#issuecomment-1165332290>.
+//! UPDATE (Mon Aug  8 13:20:34 UTC 2022): the payload format has been simplified.

-use serde::{de, Deserialize, Deserializer};
-use std::fmt;
+use bstr::ByteSlice;

-#[derive(Deserialize)]
-#[serde(untagged)]
-pub enum Password {
-    /// A regular string for utf-8 encoded passwords.
-    Simple { password: String },
-
-    /// Password is base64-encoded because it may contain arbitrary byte sequences.
-    Encoded {
-        #[serde(rename = "password_", deserialize_with = "deserialize_base64")]
-        password: Vec<u8>,
-    },
-}
-
-impl AsRef<[u8]> for Password {
-    fn as_ref(&self) -> &[u8] {
-        match self {
-            Password::Simple { password } => password.as_ref(),
-            Password::Encoded { password } => password.as_ref(),
-        }
-    }
-}
-
-#[derive(Deserialize)]
 pub struct PasswordHackPayload {
    pub project: String,
-
-    #[serde(flatten)]
-    pub password: Password,
+    pub password: Vec<u8>,
 }

-fn deserialize_base64<'a, D: Deserializer<'a>>(des: D) -> Result<Vec<u8>, D::Error> {
-    // It's very tempting to replace this with
-    //
-    // ```
-    // let base64: &str = Deserialize::deserialize(des)?;
-    // base64::decode(base64).map_err(serde::de::Error::custom)
-    // ```
-    //
-    // Unfortunately, we can't always deserialize into `&str`, so we'd
-    // have to use an allocating `String` instead. Thus, visitor is better.
-    struct Visitor;
+impl PasswordHackPayload {
+    pub fn parse(bytes: &[u8]) -> Option<Self> {
+        // The format is `project=<utf-8>;<password-bytes>`.
+        let mut iter = bytes.strip_prefix(b"project=")?.splitn_str(2, ";");
+        let project = iter.next()?.to_str().ok()?.to_owned();
+        let password = iter.next()?.to_owned();

-    impl<'de> de::Visitor<'de> for Visitor {
-        type Value = Vec<u8>;
-
-        fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
-            formatter.write_str("a string")
-        }
-
-        fn visit_str<E: de::Error>(self, v: &str) -> Result<Self::Value, E> {
-            base64::decode(v).map_err(de::Error::custom)
-        }
+        Some(Self { project, password })
    }
-
-    des.deserialize_str(Visitor)
 }

 #[cfg(test)]
 mod tests {
    use super::*;
-    use rstest::rstest;
-    use serde_json::json;

    #[test]
-    fn parse_password() -> anyhow::Result<()> {
-        let password: Password = serde_json::from_value(json!({
-            "password": "foo",
-        }))?;
-        assert_eq!(password.as_ref(), "foo".as_bytes());
+    fn parse_password_hack_payload() {
+        let bytes = b"";
+        assert!(PasswordHackPayload::parse(bytes).is_none());

-        let password: Password = serde_json::from_value(json!({
-            "password_": base64::encode("foo"),
-        }))?;
-        assert_eq!(password.as_ref(), "foo".as_bytes());
+        let bytes = b"project=";
+        assert!(PasswordHackPayload::parse(bytes).is_none());

-        Ok(())
-    }
+        let bytes = b"project=;";
+        let payload = PasswordHackPayload::parse(bytes).expect("parsing failed");
+        assert_eq!(payload.project, "");
+        assert_eq!(payload.password, b"");

-    #[rstest]
-    #[case("password", str::to_owned)]
-    #[case("password_", base64::encode)]
-    fn parse(#[case] key: &str, #[case] encode: fn(&'static str) -> String) -> anyhow::Result<()> {
-        let (password, project) = ("password", "pie-in-the-sky");
-        let payload = json!({
-            "project": project,
-            key: encode(password),
-        });
-
-        let payload: PasswordHackPayload = serde_json::from_value(payload)?;
-        assert_eq!(payload.password.as_ref(), password.as_bytes());
-        assert_eq!(payload.project, project);
-
-        Ok(())
+        let bytes = b"project=foobar;pass;word";
+        let payload = PasswordHackPayload::parse(bytes).expect("parsing failed");
+        assert_eq!(payload.project, "foobar");
+        assert_eq!(payload.password, b"pass;word");
    }
 }
--- a/proxy/src/proxy.rs
+++ b/proxy/src/proxy.rs
@@ -4,8 +4,8 @@ use crate::config::{ProxyConfig, TlsConfig};
 use crate::stream::{MetricsStream, PqStream, Stream};
 use anyhow::{bail, Context};
 use futures::TryFutureExt;
-use lazy_static::lazy_static;
 use metrics::{register_int_counter, IntCounter};
+use once_cell::sync::Lazy;
 use std::sync::Arc;
 use tokio::io::{AsyncRead, AsyncWrite};
 use utils::pq_proto::{BeMessage as Be, *};
@@ -13,23 +13,29 @@ use utils::pq_proto::{BeMessage as Be, *};
 const ERR_INSECURE_CONNECTION: &str = "connection is insecure (try using `sslmode=require`)";
 const ERR_PROTO_VIOLATION: &str = "protocol violation";

-lazy_static! {
-    static ref NUM_CONNECTIONS_ACCEPTED_COUNTER: IntCounter = register_int_counter!(
+static NUM_CONNECTIONS_ACCEPTED_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "proxy_accepted_connections_total",
        "Number of TCP client connections accepted."
    )
-    .unwrap();
-    static ref NUM_CONNECTIONS_CLOSED_COUNTER: IntCounter = register_int_counter!(
+    .unwrap()
+});
+
+static NUM_CONNECTIONS_CLOSED_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "proxy_closed_connections_total",
        "Number of TCP client connections closed."
    )
-    .unwrap();
-    static ref NUM_BYTES_PROXIED_COUNTER: IntCounter = register_int_counter!(
+    .unwrap()
+});
+
+static NUM_BYTES_PROXIED_COUNTER: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
        "proxy_io_bytes_total",
        "Number of bytes sent/received between any client and backend."
    )
-    .unwrap();
-}
+    .unwrap()
+});

 /// A small combinator for pluggable error logging.
 async fn log_error<R, F>(future: F) -> F::Output
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -26,6 +26,7 @@ pytest-lazy-fixture = "^0.6.3"
 prometheus-client = "^0.14.1"
 pytest-timeout = "^2.1.0"
 Werkzeug = "2.1.2"
+pytest-order = "^1.0.1"

 [tool.poetry.dev-dependencies]
 yapf = "==0.31.0"
--- a/pytest.ini
+++ b/pytest.ini
@@ -2,6 +2,7 @@
 filterwarnings =
    error::pytest.PytestUnhandledThreadExceptionWarning
    error::UserWarning
+    ignore:record_property is incompatible with junit_family:pytest.PytestWarning
 addopts =
    -m 'not remote_cluster'
 markers =
--- a/safekeeper/Cargo.toml
+++ b/safekeeper/Cargo.toml
@@ -9,7 +9,6 @@ bytes = "1.0.1"
 byteorder = "1.4.3"
 hyper = "0.14"
 fs2 = "0.4.3"
-lazy_static = "1.4.0"
 serde_json = "1"
 tracing = "0.1.27"
 clap = "3.0"
@@ -29,7 +28,7 @@ const_format = "0.2.21"
 tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
 git-version = "0.3.5"
 async-trait = "0.1"
-once_cell = "1.10.0"
+once_cell = "1.13.0"
 toml_edit = { version = "0.13", features = ["easy"] }

 postgres_ffi = { path = "../libs/postgres_ffi" }
--- a/safekeeper/src/control_file.rs
+++ b/safekeeper/src/control_file.rs
@@ -2,7 +2,7 @@

 use anyhow::{bail, ensure, Context, Result};
 use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt};
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;

 use std::fs::{self, File, OpenOptions};
 use std::io::{Read, Write};
@@ -26,15 +26,15 @@ const CONTROL_FILE_NAME: &str = "safekeeper.control";
 const CONTROL_FILE_NAME_PARTIAL: &str = "safekeeper.control.partial";
 pub const CHECKSUM_SIZE: usize = std::mem::size_of::<u32>();

-lazy_static! {
-    static ref PERSIST_CONTROL_FILE_SECONDS: HistogramVec = register_histogram_vec!(
+static PERSIST_CONTROL_FILE_SECONDS: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "safekeeper_persist_control_file_seconds",
        "Seconds to persist and sync control file, grouped by timeline",
        &["tenant_id", "timeline_id"],
        DISK_WRITE_SECONDS_BUCKETS.to_vec()
    )
-    .expect("Failed to register safekeeper_persist_control_file_seconds histogram vec");
-}
+    .expect("Failed to register safekeeper_persist_control_file_seconds histogram vec")
+});

 /// Storage should keep actual state inside of it. It should implement Deref
 /// trait to access state fields and have persist method for updating that state.
--- a/safekeeper/src/safekeeper.rs
+++ b/safekeeper/src/safekeeper.rs
@@ -727,7 +727,7 @@ where
                info!("setting local_start_lsn to {:?}", state.local_start_lsn);
            }
            // Initializing commit_lsn before acking first flushed record is
-            // important to let find_end_of_wal skip the whole in the beginning
+            // important to let find_end_of_wal skip the hole in the beginning
            // of the first segment.
            //
            // NB: on new clusters, this happens at the same time as
@@ -738,6 +738,10 @@ where

            // Initializing backup_lsn is useful to avoid making backup think it should upload 0 segment.
            self.inmem.backup_lsn = max(self.inmem.backup_lsn, state.timeline_start_lsn);
+            // Initializing remote_consistent_lsn sets that we have nothing to
+            // stream to pageserver(s) immediately after creation.
+            self.inmem.remote_consistent_lsn =
+                max(self.inmem.remote_consistent_lsn, state.timeline_start_lsn);

            state.acceptor_state.term_history = msg.term_history.clone();
            self.persist_control_file(state)?;
--- a/safekeeper/src/timeline.rs
+++ b/safekeeper/src/timeline.rs
@@ -4,7 +4,7 @@
 use anyhow::{bail, Context, Result};

 use etcd_broker::subscription_value::SkTimelineInfo;
-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use postgres_ffi::xlog_utils::XLogSegNo;

 use serde::Serialize;
@@ -137,7 +137,7 @@ impl SharedState {
        self.is_wal_backup_required()
            // FIXME: add tracking of relevant pageservers and check them here individually,
            // otherwise migration won't work (we suspend too early).
-            || self.sk.inmem.remote_consistent_lsn <= self.sk.inmem.commit_lsn
+            || self.sk.inmem.remote_consistent_lsn < self.sk.inmem.commit_lsn
    }

    /// Mark timeline active/inactive and return whether s3 offloading requires
@@ -559,12 +559,12 @@ struct GlobalTimelinesState {
    wal_backup_launcher_tx: Option<Sender<ZTenantTimelineId>>,
 }

-lazy_static! {
-    static ref TIMELINES_STATE: Mutex<GlobalTimelinesState> = Mutex::new(GlobalTimelinesState {
+static TIMELINES_STATE: Lazy<Mutex<GlobalTimelinesState>> = Lazy::new(|| {
+    Mutex::new(GlobalTimelinesState {
        timelines: HashMap::new(),
        wal_backup_launcher_tx: None,
-    });
-}
+    })
+});

 #[derive(Clone, Copy, Serialize)]
 pub struct TimelineDeleteForceResult {
--- a/safekeeper/src/wal_storage.rs
+++ b/safekeeper/src/wal_storage.rs
@@ -12,7 +12,7 @@ use std::io::{self, Seek, SeekFrom};
 use std::pin::Pin;
 use tokio::io::AsyncRead;

-use lazy_static::lazy_static;
+use once_cell::sync::Lazy;
 use postgres_ffi::xlog_utils::{
    find_end_of_wal, IsPartialXLogFileName, IsXLogFileName, XLogFromFileName, XLogSegNo, PG_TLI,
 };
@@ -38,31 +38,44 @@ use metrics::{register_histogram_vec, Histogram, HistogramVec, DISK_WRITE_SECOND

 use tokio::io::{AsyncReadExt, AsyncSeekExt};

-lazy_static! {
-    // The prometheus crate does not support u64 yet, i64 only (see `IntGauge`).
-    // i64 is faster than f64, so update to u64 when available.
-    static ref WRITE_WAL_BYTES: HistogramVec = register_histogram_vec!(
+// The prometheus crate does not support u64 yet, i64 only (see `IntGauge`).
+// i64 is faster than f64, so update to u64 when available.
+static WRITE_WAL_BYTES: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "safekeeper_write_wal_bytes",
        "Bytes written to WAL in a single request, grouped by timeline",
        &["tenant_id", "timeline_id"],
-        vec![1.0, 10.0, 100.0, 1024.0, 8192.0, 128.0 * 1024.0, 1024.0 * 1024.0, 10.0 * 1024.0 * 1024.0]
+        vec![
+            1.0,
+            10.0,
+            100.0,
+            1024.0,
+            8192.0,
+            128.0 * 1024.0,
+            1024.0 * 1024.0,
+            10.0 * 1024.0 * 1024.0
+        ]
    )
-    .expect("Failed to register safekeeper_write_wal_bytes histogram vec");
-    static ref WRITE_WAL_SECONDS: HistogramVec = register_histogram_vec!(
+    .expect("Failed to register safekeeper_write_wal_bytes histogram vec")
+});
+static WRITE_WAL_SECONDS: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "safekeeper_write_wal_seconds",
        "Seconds spent writing and syncing WAL to a disk in a single request, grouped by timeline",
        &["tenant_id", "timeline_id"],
        DISK_WRITE_SECONDS_BUCKETS.to_vec()
    )
-    .expect("Failed to register safekeeper_write_wal_seconds histogram vec");
-    static ref FLUSH_WAL_SECONDS: HistogramVec = register_histogram_vec!(
+    .expect("Failed to register safekeeper_write_wal_seconds histogram vec")
+});
+static FLUSH_WAL_SECONDS: Lazy<HistogramVec> = Lazy::new(|| {
+    register_histogram_vec!(
        "safekeeper_flush_wal_seconds",
        "Seconds spent syncing WAL to a disk, grouped by timeline",
        &["tenant_id", "timeline_id"],
        DISK_WRITE_SECONDS_BUCKETS.to_vec()
    )
-    .expect("Failed to register safekeeper_flush_wal_seconds histogram vec");
-}
+    .expect("Failed to register safekeeper_flush_wal_seconds histogram vec")
+});

 struct WalStorageMetrics {
    write_wal_bytes: Histogram,
@@ -319,7 +332,7 @@ impl Storage for PhysicalStorage {
        self.write_lsn = if state.commit_lsn == Lsn(0) {
            Lsn(0)
        } else {
-            Lsn(find_end_of_wal(&self.timeline_dir, wal_seg_size, true, state.commit_lsn)?.0)
+            find_end_of_wal(&self.timeline_dir, wal_seg_size, state.commit_lsn)?
        };

        self.write_record_lsn = self.write_lsn;
--- a/scripts/export_import_between_pageservers.py
+++ b/scripts/export_import_between_pageservers.py
@@ -0,0 +1,708 @@
+#
+# Script to export tenants from one pageserver and import them into another page server.
+#
+# Outline of steps:
+# 1. Get `(last_lsn, prev_lsn)` from old pageserver
+# 2. Get `fullbackup` from old pageserver, which creates a basebackup tar file
+# 3. This tar file might be missing relation files for empty relations, if the pageserver
+#    is old enough (we didn't always store those). So to recreate them, we start a local
+#    vanilla postgres on this basebackup and ask it what relations should exist, then touch
+#    any missing files and re-pack the tar.
+#    TODO This functionality is no longer needed, so we can delete it later if we don't
+#         end up using the same utils for the pg 15 upgrade. Not sure.
+# 4. We import the patched basebackup into a new pageserver
+# 5. We export again via fullbackup, now from the new pageserver and compare the returned
+#    tar file with the one we imported. This confirms that we imported everything that was
+#    exported, but doesn't guarantee correctness (what if we didn't **export** everything
+#    initially?)
+# 6. We wait for the new pageserver's remote_consistent_lsn to catch up
+#
+# For more context on how to use this, see:
+# https://github.com/neondatabase/cloud/wiki/Storage-format-migration
+
+import os
+from os import path
+import shutil
+from pathlib import Path
+import tempfile
+from contextlib import closing
+import psycopg2
+import subprocess
+import argparse
+import time
+import requests
+import uuid
+from psycopg2.extensions import connection as PgConnection
+from typing import Any, Callable, Dict, Iterator, List, Optional, TypeVar, cast, Union, Tuple
+
+###############################################
+### client-side utils copied from test fixtures
+###############################################
+
+Env = Dict[str, str]
+
+_global_counter = 0
+
+
+def global_counter() -> int:
+    """ A really dumb global counter.
+    This is useful for giving output files a unique number, so if we run the
+    same command multiple times we can keep their output separate.
+    """
+    global _global_counter
+    _global_counter += 1
+    return _global_counter
+
+
+def subprocess_capture(capture_dir: str, cmd: List[str], **kwargs: Any) -> str:
+    """ Run a process and capture its output
+    Output will go to files named "cmd_NNN.stdout" and "cmd_NNN.stderr"
+    where "cmd" is the name of the program and NNN is an incrementing
+    counter.
+    If those files already exist, we will overwrite them.
+    Returns basepath for files with captured output.
+    """
+    assert type(cmd) is list
+    base = os.path.basename(cmd[0]) + '_{}'.format(global_counter())
+    basepath = os.path.join(capture_dir, base)
+    stdout_filename = basepath + '.stdout'
+    stderr_filename = basepath + '.stderr'
+
+    with open(stdout_filename, 'w') as stdout_f:
+        with open(stderr_filename, 'w') as stderr_f:
+            print('(capturing output to "{}.stdout")'.format(base))
+            subprocess.run(cmd, **kwargs, stdout=stdout_f, stderr=stderr_f)
+
+    return basepath
+
+
+class PgBin:
+    """ A helper class for executing postgres binaries """
+    def __init__(self, log_dir: Path, pg_distrib_dir):
+        self.log_dir = log_dir
+        self.pg_bin_path = os.path.join(str(pg_distrib_dir), 'bin')
+        self.env = os.environ.copy()
+        self.env['LD_LIBRARY_PATH'] = os.path.join(str(pg_distrib_dir), 'lib')
+
+    def _fixpath(self, command: List[str]):
+        if '/' not in command[0]:
+            command[0] = os.path.join(self.pg_bin_path, command[0])
+
+    def _build_env(self, env_add: Optional[Env]) -> Env:
+        if env_add is None:
+            return self.env
+        env = self.env.copy()
+        env.update(env_add)
+        return env
+
+    def run(self, command: List[str], env: Optional[Env] = None, cwd: Optional[str] = None):
+        """
+        Run one of the postgres binaries.
+        The command should be in list form, e.g. ['pgbench', '-p', '55432']
+        All the necessary environment variables will be set.
+        If the first argument (the command name) doesn't include a path (no '/'
+        characters present), then it will be edited to include the correct path.
+        If you want stdout/stderr captured to files, use `run_capture` instead.
+        """
+
+        self._fixpath(command)
+        print('Running command "{}"'.format(' '.join(command)))
+        env = self._build_env(env)
+        subprocess.run(command, env=env, cwd=cwd, check=True)
+
+    def run_capture(self,
+                    command: List[str],
+                    env: Optional[Env] = None,
+                    cwd: Optional[str] = None,
+                    **kwargs: Any) -> str:
+        """
+        Run one of the postgres binaries, with stderr and stdout redirected to a file.
+        This is just like `run`, but for chatty programs. Returns basepath for files
+        with captured output.
+        """
+
+        self._fixpath(command)
+        print('Running command "{}"'.format(' '.join(command)))
+        env = self._build_env(env)
+        return subprocess_capture(str(self.log_dir),
+                                  command,
+                                  env=env,
+                                  cwd=cwd,
+                                  check=True,
+                                  **kwargs)
+
+
+class PgProtocol:
+    """ Reusable connection logic """
+    def __init__(self, **kwargs):
+        self.default_options = kwargs
+
+    def conn_options(self, **kwargs):
+        conn_options = self.default_options.copy()
+        if 'dsn' in kwargs:
+            conn_options.update(parse_dsn(kwargs['dsn']))
+        conn_options.update(kwargs)
+
+        # Individual statement timeout in seconds. 2 minutes should be
+        # enough for our tests, but if you need a longer, you can
+        # change it by calling "SET statement_timeout" after
+        # connecting.
+        if 'options' in conn_options:
+            conn_options['options'] = f"-cstatement_timeout=120s " + conn_options['options']
+        else:
+            conn_options['options'] = "-cstatement_timeout=120s"
+        return conn_options
+
+    # autocommit=True here by default because that's what we need most of the time
+    def connect(self, autocommit=True, **kwargs) -> PgConnection:
+        """
+        Connect to the node.
+        Returns psycopg2's connection object.
+        This method passes all extra params to connstr.
+        """
+        conn = psycopg2.connect(**self.conn_options(**kwargs))
+
+        # WARNING: this setting affects *all* tests!
+        conn.autocommit = autocommit
+        return conn
+
+    def safe_psql(self, query: str, **kwargs: Any) -> List[Tuple[Any, ...]]:
+        """
+        Execute query against the node and return all rows.
+        This method passes all extra params to connstr.
+        """
+        return self.safe_psql_many([query], **kwargs)[0]
+
+    def safe_psql_many(self, queries: List[str], **kwargs: Any) -> List[List[Tuple[Any, ...]]]:
+        """
+        Execute queries against the node and return all rows.
+        This method passes all extra params to connstr.
+        """
+        result: List[List[Any]] = []
+        with closing(self.connect(**kwargs)) as conn:
+            with conn.cursor() as cur:
+                for query in queries:
+                    print(f"Executing query: {query}")
+                    cur.execute(query)
+
+                    if cur.description is None:
+                        result.append([])  # query didn't return data
+                    else:
+                        result.append(cast(List[Any], cur.fetchall()))
+        return result
+
+
+class VanillaPostgres(PgProtocol):
+    def __init__(self, pgdatadir: Path, pg_bin: PgBin, port: int, init=True):
+        super().__init__(host='localhost', port=port, dbname='postgres')
+        self.pgdatadir = pgdatadir
+        self.pg_bin = pg_bin
+        self.running = False
+        if init:
+            self.pg_bin.run_capture(['initdb', '-D', str(pgdatadir)])
+        self.configure([f"port = {port}\n"])
+
+    def configure(self, options: List[str]):
+        """Append lines into postgresql.conf file."""
+        assert not self.running
+        with open(os.path.join(self.pgdatadir, 'postgresql.conf'), 'a') as conf_file:
+            conf_file.write("\n".join(options))
+
+    def start(self, log_path: Optional[str] = None):
+        assert not self.running
+        self.running = True
+
+        if log_path is None:
+            log_path = os.path.join(self.pgdatadir, "pg.log")
+
+        self.pg_bin.run_capture(
+            ['pg_ctl', '-w', '-D', str(self.pgdatadir), '-l', log_path, 'start'])
+
+    def stop(self):
+        assert self.running
+        self.running = False
+        self.pg_bin.run_capture(['pg_ctl', '-w', '-D', str(self.pgdatadir), 'stop'])
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc, tb):
+        if self.running:
+            self.stop()
+
+
+class NeonPageserverApiException(Exception):
+    pass
+
+
+class NeonPageserverHttpClient(requests.Session):
+    def __init__(self, host, port):
+        super().__init__()
+        self.host = host
+        self.port = port
+
+    def verbose_error(self, res: requests.Response):
+        try:
+            res.raise_for_status()
+        except requests.RequestException as e:
+            try:
+                msg = res.json()['msg']
+            except:
+                msg = ''
+            raise NeonPageserverApiException(msg) from e
+
+    def check_status(self):
+        self.get(f"http://{self.host}:{self.port}/v1/status").raise_for_status()
+
+    def tenant_list(self):
+        res = self.get(f"http://{self.host}:{self.port}/v1/tenant")
+        self.verbose_error(res)
+        res_json = res.json()
+        assert isinstance(res_json, list)
+        return res_json
+
+    def tenant_create(self, new_tenant_id: uuid.UUID, ok_if_exists):
+        res = self.post(
+            f"http://{self.host}:{self.port}/v1/tenant",
+            json={
+                'new_tenant_id': new_tenant_id.hex,
+            },
+        )
+
+        if res.status_code == 409:
+            if ok_if_exists:
+                print(f'could not create tenant: already exists for id {new_tenant_id}')
+            else:
+                res.raise_for_status()
+        elif res.status_code == 201:
+            print(f'created tenant {new_tenant_id}')
+        else:
+            self.verbose_error(res)
+
+        return new_tenant_id
+
+    def timeline_list(self, tenant_id: uuid.UUID):
+        res = self.get(f"http://{self.host}:{self.port}/v1/tenant/{tenant_id.hex}/timeline")
+        self.verbose_error(res)
+        res_json = res.json()
+        assert isinstance(res_json, list)
+        return res_json
+
+    def timeline_detail(self, tenant_id: uuid.UUID, timeline_id: uuid.UUID) -> Dict[Any, Any]:
+        res = self.get(
+            f"http://localhost:{self.port}/v1/tenant/{tenant_id.hex}/timeline/{timeline_id.hex}?include-non-incremental-logical-size=1"
+        )
+        self.verbose_error(res)
+        res_json = res.json()
+        assert isinstance(res_json, dict)
+        return res_json
+
+
+def lsn_to_hex(num: int) -> str:
+    """ Convert lsn from int to standard hex notation. """
+    return "{:X}/{:X}".format(num >> 32, num & 0xffffffff)
+
+
+def lsn_from_hex(lsn_hex: str) -> int:
+    """ Convert lsn from hex notation to int. """
+    l, r = lsn_hex.split('/')
+    return (int(l, 16) << 32) + int(r, 16)
+
+
+def remote_consistent_lsn(pageserver_http_client: NeonPageserverHttpClient,
+                          tenant: uuid.UUID,
+                          timeline: uuid.UUID) -> int:
+    detail = pageserver_http_client.timeline_detail(tenant, timeline)
+
+    if detail['remote'] is None:
+        # No remote information at all. This happens right after creating
+        # a timeline, before any part of it has been uploaded to remote
+        # storage yet.
+        return 0
+    else:
+        lsn_str = detail['remote']['remote_consistent_lsn']
+        assert isinstance(lsn_str, str)
+        return lsn_from_hex(lsn_str)
+
+
+def wait_for_upload(pageserver_http_client: NeonPageserverHttpClient,
+                    tenant: uuid.UUID,
+                    timeline: uuid.UUID,
+                    lsn: int):
+    """waits for local timeline upload up to specified lsn"""
+    for i in range(10):
+        current_lsn = remote_consistent_lsn(pageserver_http_client, tenant, timeline)
+        if current_lsn >= lsn:
+            return
+        print("waiting for remote_consistent_lsn to reach {}, now {}, iteration {}".format(
+            lsn_to_hex(lsn), lsn_to_hex(current_lsn), i + 1))
+        time.sleep(1)
+
+    raise Exception("timed out while waiting for remote_consistent_lsn to reach {}, was {}".format(
+        lsn_to_hex(lsn), lsn_to_hex(current_lsn)))
+
+
+##############
+# End of utils
+##############
+
+
+def pack_base(log_dir, restored_dir, output_tar):
+    """Create tar file from basebackup, being careful to produce relative filenames."""
+    tmp_tar_name = "tmp.tar"
+    tmp_tar_path = os.path.join(restored_dir, tmp_tar_name)
+    cmd = ["tar", "-cf", tmp_tar_name] + os.listdir(restored_dir)
+    # We actually cd into the dir and call tar from there. If we call tar from
+    # outside we won't encode filenames as relative, and they won't parse well
+    # on import.
+    subprocess_capture(log_dir, cmd, cwd=restored_dir)
+    shutil.move(tmp_tar_path, output_tar)
+
+
+def reconstruct_paths(log_dir, pg_bin, base_tar):
+    """Reconstruct what relation files should exist in the datadir by querying postgres."""
+    with tempfile.TemporaryDirectory() as restored_dir:
+        # Unpack the base tar
+        subprocess_capture(log_dir, ["tar", "-xf", base_tar, "-C", restored_dir])
+
+        # Start a vanilla postgres from the given datadir and query it to find
+        # what relfiles should exist, but possibly don't.
+        port = "55439"  # Probably free
+        with VanillaPostgres(restored_dir, pg_bin, port, init=False) as vanilla_pg:
+            vanilla_pg.configure([f"port={port}"])
+            vanilla_pg.start(log_path=os.path.join(log_dir, "tmp_pg.log"))
+
+            # Create database based on template0 because we can't connect to template0
+            query = "create database template0copy template template0"
+            vanilla_pg.safe_psql(query, user="cloud_admin")
+            vanilla_pg.safe_psql("CHECKPOINT", user="cloud_admin")
+
+            # Get all databases
+            query = "select oid, datname from pg_database"
+            oid_dbname_pairs = vanilla_pg.safe_psql(query, user="cloud_admin")
+            template0_oid = [
+                oid for (oid, database) in oid_dbname_pairs if database == "template0"
+            ][0]
+
+            # Get rel paths for each database
+            for oid, database in oid_dbname_pairs:
+                if database == "template0":
+                    # We can't connect to template0
+                    continue
+
+                query = "select relname, pg_relation_filepath(oid) from pg_class"
+                result = vanilla_pg.safe_psql(query, user="cloud_admin", dbname=database)
+                for relname, filepath in result:
+                    if filepath is not None:
+
+                        if database == "template0copy":
+                            # Add all template0copy paths to template0
+                            prefix = f"base/{oid}/"
+                            if filepath.startswith(prefix):
+                                suffix = filepath[len(prefix):]
+                                yield f"base/{template0_oid}/{suffix}"
+                            elif filepath.startswith("global"):
+                                print(f"skipping {database} global file {filepath}")
+                            else:
+                                raise AssertionError
+                        else:
+                            yield filepath
+
+
+def touch_missing_rels(log_dir, corrupt_tar, output_tar, paths):
+    """Add the appropriate empty files to a basebadkup tar."""
+    with tempfile.TemporaryDirectory() as restored_dir:
+        # Unpack the base tar
+        subprocess_capture(log_dir, ["tar", "-xf", corrupt_tar, "-C", restored_dir])
+
+        # Touch files that don't exist
+        for path in paths:
+            absolute_path = os.path.join(restored_dir, path)
+            exists = os.path.exists(absolute_path)
+            if not exists:
+                print(f"File {absolute_path} didn't exist. Creating..")
+                Path(absolute_path).touch()
+
+        # Repackage
+        pack_base(log_dir, restored_dir, output_tar)
+
+
+# HACK This is a workaround for exporting from old pageservers that
+#      can't export empty relations. In this case we need to start
+#      a vanilla postgres from the exported datadir, and query it
+#      to see what empty relations are missing, and then create
+#      those empty files before importing.
+def add_missing_rels(base_tar, output_tar, log_dir, pg_bin):
+    reconstructed_paths = set(reconstruct_paths(log_dir, pg_bin, base_tar))
+    touch_missing_rels(log_dir, base_tar, output_tar, reconstructed_paths)
+
+
+def get_rlsn(pageserver_connstr, tenant_id, timeline_id):
+    conn = psycopg2.connect(pageserver_connstr)
+    conn.autocommit = True
+    with conn.cursor() as cur:
+        cmd = f"get_last_record_rlsn {tenant_id} {timeline_id}"
+        cur.execute(cmd)
+        res = cur.fetchone()
+        prev_lsn = res[0]
+        last_lsn = res[1]
+    conn.close()
+
+    return last_lsn, prev_lsn
+
+
+def import_timeline(args,
+                    psql_path,
+                    pageserver_connstr,
+                    pageserver_http,
+                    tenant_id,
+                    timeline_id,
+                    last_lsn,
+                    prev_lsn,
+                    tar_filename):
+    # Import timelines to new pageserver
+    import_cmd = f"import basebackup {tenant_id} {timeline_id} {last_lsn} {last_lsn}"
+    full_cmd = rf"""cat {tar_filename} | {psql_path} {pageserver_connstr} -c '{import_cmd}' """
+
+    stderr_filename2 = path.join(args.work_dir, f"import_{tenant_id}_{timeline_id}.stderr")
+    stdout_filename = path.join(args.work_dir, f"import_{tenant_id}_{timeline_id}.stdout")
+
+    print(f"Running: {full_cmd}")
+
+    with open(stdout_filename, 'w') as stdout_f:
+        with open(stderr_filename2, 'w') as stderr_f:
+            print(f"(capturing output to {stdout_filename})")
+            pg_bin = PgBin(args.work_dir, args.pg_distrib_dir)
+            subprocess.run(full_cmd,
+                           stdout=stdout_f,
+                           stderr=stderr_f,
+                           env=pg_bin._build_env(None),
+                           shell=True,
+                           check=True)
+
+            print(f"Done import")
+
+    # Wait until pageserver persists the files
+    wait_for_upload(pageserver_http,
+                    uuid.UUID(tenant_id),
+                    uuid.UUID(timeline_id),
+                    lsn_from_hex(last_lsn))
+
+
+def export_timeline(args,
+                    psql_path,
+                    pageserver_connstr,
+                    tenant_id,
+                    timeline_id,
+                    last_lsn,
+                    prev_lsn,
+                    tar_filename):
+    # Choose filenames
+    incomplete_filename = tar_filename + ".incomplete"
+    stderr_filename = path.join(args.work_dir, f"{tenant_id}_{timeline_id}.stderr")
+
+    # Construct export command
+    query = f"fullbackup {tenant_id} {timeline_id} {last_lsn} {prev_lsn}"
+    cmd = [psql_path, "--no-psqlrc", pageserver_connstr, "-c", query]
+
+    # Run export command
+    print(f"Running: {cmd}")
+    with open(incomplete_filename, 'w') as stdout_f:
+        with open(stderr_filename, 'w') as stderr_f:
+            print(f"(capturing output to {incomplete_filename})")
+            pg_bin = PgBin(args.work_dir, args.pg_distrib_dir)
+            subprocess.run(cmd,
+                           stdout=stdout_f,
+                           stderr=stderr_f,
+                           env=pg_bin._build_env(None),
+                           check=True)
+
+    # Add missing rels
+    pg_bin = PgBin(args.work_dir, args.pg_distrib_dir)
+    add_missing_rels(incomplete_filename, tar_filename, args.work_dir, pg_bin)
+
+    # Log more info
+    file_size = os.path.getsize(tar_filename)
+    print(f"Done export: {tar_filename}, size {file_size}")
+
+
+def main(args: argparse.Namespace):
+    psql_path = str(Path(args.pg_distrib_dir) / "bin" / "psql")
+
+    old_pageserver_host = args.old_pageserver_host
+    new_pageserver_host = args.new_pageserver_host
+
+    old_http_client = NeonPageserverHttpClient(old_pageserver_host, args.old_pageserver_http_port)
+    old_http_client.check_status()
+    old_pageserver_connstr = f"postgresql://{old_pageserver_host}:{args.old_pageserver_pg_port}"
+
+    new_http_client = NeonPageserverHttpClient(new_pageserver_host, args.new_pageserver_http_port)
+    new_http_client.check_status()
+    new_pageserver_connstr = f"postgresql://{new_pageserver_host}:{args.new_pageserver_pg_port}"
+
+    for tenant_id in args.tenants:
+        print(f"Tenant: {tenant_id}")
+        timelines = old_http_client.timeline_list(uuid.UUID(tenant_id))
+        print(f"Timelines: {timelines}")
+
+        # Create tenant in new pageserver
+        if args.only_import is False and not args.timelines:
+            new_http_client.tenant_create(uuid.UUID(tenant_id), args.ok_if_exists)
+
+        for timeline in timelines:
+            # Skip timelines we don't need to export
+            if args.timelines and timeline['timeline_id'] not in args.timelines:
+                print(f"Skipping timeline {timeline['timeline_id']}")
+                continue
+
+            # Choose filenames
+            tar_filename = path.join(args.work_dir,
+                                     f"{timeline['tenant_id']}_{timeline['timeline_id']}.tar")
+
+            # Export timeline from old pageserver
+            if args.only_import is False:
+                last_lsn, prev_lsn = get_rlsn(
+                    old_pageserver_connstr,
+                    timeline['tenant_id'],
+                    timeline['timeline_id'],
+                )
+                export_timeline(
+                    args,
+                    psql_path,
+                    old_pageserver_connstr,
+                    timeline['tenant_id'],
+                    timeline['timeline_id'],
+                    last_lsn,
+                    prev_lsn,
+                    tar_filename,
+                )
+
+            # Import into new pageserver
+            import_timeline(
+                args,
+                psql_path,
+                new_pageserver_connstr,
+                new_http_client,
+                timeline['tenant_id'],
+                timeline['timeline_id'],
+                last_lsn,
+                prev_lsn,
+                tar_filename,
+            )
+
+            # Re-export and compare
+            re_export_filename = tar_filename + ".reexport"
+            export_timeline(args,
+                            psql_path,
+                            new_pageserver_connstr,
+                            timeline['tenant_id'],
+                            timeline['timeline_id'],
+                            last_lsn,
+                            prev_lsn,
+                            re_export_filename)
+
+            # Check the size is the same
+            old_size = os.path.getsize(tar_filename),
+            new_size = os.path.getsize(re_export_filename),
+            if old_size != new_size:
+                raise AssertionError(f"Sizes don't match old: {old_size} new: {new_size}")
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--tenant-id',
+        dest='tenants',
+        required=True,
+        nargs='+',
+        help='Id of the tenant to migrate. You can pass multiple arguments',
+    )
+    parser.add_argument(
+        '--timeline-id',
+        dest='timelines',
+        required=False,
+        nargs='+',
+        help='Id of the timeline to migrate. You can pass multiple arguments',
+    )
+    parser.add_argument(
+        '--from-host',
+        dest='old_pageserver_host',
+        required=True,
+        help='Host of the pageserver to migrate data from',
+    )
+    parser.add_argument(
+        '--from-http-port',
+        dest='old_pageserver_http_port',
+        required=False,
+        type=int,
+        default=9898,
+        help='HTTP port of the pageserver to migrate data from. Default: 9898',
+    )
+    parser.add_argument(
+        '--from-pg-port',
+        dest='old_pageserver_pg_port',
+        required=False,
+        type=int,
+        default=6400,
+        help='pg port of the pageserver to migrate data from. Default: 6400',
+    )
+    parser.add_argument(
+        '--to-host',
+        dest='new_pageserver_host',
+        required=True,
+        help='Host of the pageserver to migrate data to',
+    )
+    parser.add_argument(
+        '--to-http-port',
+        dest='new_pageserver_http_port',
+        required=False,
+        default=9898,
+        type=int,
+        help='HTTP port of the pageserver to migrate data to. Default: 9898',
+    )
+    parser.add_argument(
+        '--to-pg-port',
+        dest='new_pageserver_pg_port',
+        required=False,
+        default=6400,
+        type=int,
+        help='pg port of the pageserver to migrate data to. Default: 6400',
+    )
+    parser.add_argument(
+        '--ignore-tenant-exists',
+        dest='ok_if_exists',
+        required=False,
+        help=
+        'Ignore error if we are trying to create the tenant that already exists. It can be dangerous if existing tenant already contains some data.',
+    )
+    parser.add_argument(
+        '--pg-distrib-dir',
+        dest='pg_distrib_dir',
+        required=False,
+        default='/usr/local/',
+        help='Path where postgres binaries are installed. Default: /usr/local/',
+    )
+    parser.add_argument(
+        '--psql-path',
+        dest='psql_path',
+        required=False,
+        default='/usr/local/bin/psql',
+        help='Path to the psql binary. Default: /usr/local/bin/psql',
+    )
+    parser.add_argument(
+        '--only-import',
+        dest='only_import',
+        required=False,
+        default=False,
+        action='store_true',
+        help='Skip export and tenant creation part',
+    )
+    parser.add_argument(
+        '--work-dir',
+        dest='work_dir',
+        required=True,
+        default=False,
+        help='directory where temporary tar files are stored',
+    )
+    args = parser.parse_args()
+    main(args)
--- a/test_runner/batch_others/test_fsm_truncate.py
+++ b/test_runner/batch_others/test_fsm_truncate.py
@@ -0,0 +1,11 @@
+from fixtures.log_helper import log
+from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, NeonPageserverHttpClient
+import pytest
+
+
+def test_fsm_truncate(neon_env_builder: NeonEnvBuilder):
+    env = neon_env_builder.init_start()
+    env.neon_cli.create_branch("test_fsm_truncate")
+    pg = env.postgres.create_start('test_fsm_truncate')
+    pg.safe_psql(
+        'CREATE TABLE t1(key int); CREATE TABLE t2(key int); TRUNCATE TABLE t1; TRUNCATE TABLE t2;')
--- a/test_runner/batch_others/test_import.py
+++ b/test_runner/batch_others/test_import.py
@@ -1,9 +1,10 @@
+import re
 import pytest
-from fixtures.neon_fixtures import NeonEnvBuilder, wait_for_upload, wait_for_last_record_lsn
-from fixtures.utils import lsn_from_hex, lsn_to_hex
+from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, PgBin, Postgres, wait_for_upload, wait_for_last_record_lsn
+from fixtures.utils import lsn_from_hex
 from uuid import UUID, uuid4
-import tarfile
 import os
+import tarfile
 import shutil
 from pathlib import Path
 import json
@@ -105,20 +106,63 @@ def test_import_from_vanilla(test_output_dir, pg_bin, vanilla_pg, neon_env_build


@pytest.mark.timeout(600)
-def test_import_from_pageserver(test_output_dir, pg_bin, vanilla_pg, neon_env_builder):
-
-    num_rows = 3000
+def test_import_from_pageserver_small(pg_bin: PgBin, neon_env_builder: NeonEnvBuilder):
    neon_env_builder.num_safekeepers = 1
    neon_env_builder.enable_local_fs_remote_storage()
    env = neon_env_builder.init_start()

-    env.neon_cli.create_branch('test_import_from_pageserver')
-    pgmain = env.postgres.create_start('test_import_from_pageserver')
-    log.info("postgres is running on 'test_import_from_pageserver' branch")
+    timeline = env.neon_cli.create_branch('test_import_from_pageserver_small')
+    pg = env.postgres.create_start('test_import_from_pageserver_small')

-    timeline = pgmain.safe_psql("SHOW neon.timeline_id")[0][0]
+    num_rows = 3000
+    lsn = _generate_data(num_rows, pg)
+    _import(num_rows, lsn, env, pg_bin, timeline)

-    with closing(pgmain.connect()) as conn:
+
+@pytest.mark.timeout(1800)
+# TODO: temporarily disable `test_import_from_pageserver_multisegment` test, enable
+# the test back after finding the failure cause.
+# @pytest.mark.skipif(os.environ.get('BUILD_TYPE') == "debug", reason="only run with release build")
+@pytest.mark.skip("See https://github.com/neondatabase/neon/issues/2255")
+def test_import_from_pageserver_multisegment(pg_bin: PgBin, neon_env_builder: NeonEnvBuilder):
+    neon_env_builder.num_safekeepers = 1
+    neon_env_builder.enable_local_fs_remote_storage()
+    env = neon_env_builder.init_start()
+
+    timeline = env.neon_cli.create_branch('test_import_from_pageserver_multisegment')
+    pg = env.postgres.create_start('test_import_from_pageserver_multisegment')
+
+    # For `test_import_from_pageserver_multisegment`, we want to make sure that the data
+    # is large enough to create multi-segment files. Typically, a segment file's size is
+    # at most 1GB. A large number of inserted rows (`30000000`) is used to increase the
+    # DB size to above 1GB. Related: https://github.com/neondatabase/neon/issues/2097.
+    num_rows = 30000000
+    lsn = _generate_data(num_rows, pg)
+
+    logical_size = env.pageserver.http_client().timeline_detail(
+        env.initial_tenant, timeline)['local']['current_logical_size']
+    log.info(f"timeline logical size = {logical_size / (1024 ** 2)}MB")
+    assert logical_size > 1024**3  # = 1GB
+
+    tar_output_file = _import(num_rows, lsn, env, pg_bin, timeline)
+
+    # Check if the backup data contains multiple segment files
+    cnt_seg_files = 0
+    segfile_re = re.compile('[0-9]+\\.[0-9]+')
+    with tarfile.open(tar_output_file, "r") as tar_f:
+        for f in tar_f.getnames():
+            if segfile_re.search(f) is not None:
+                cnt_seg_files += 1
+                log.info(f"Found a segment file: {f} in the backup archive file")
+    assert cnt_seg_files > 0
+
+
+def _generate_data(num_rows: int, pg: Postgres) -> str:
+    """Generate a table with `num_rows` rows.
+
+    Returns:
+    the latest insert WAL's LSN"""
+    with closing(pg.connect()) as conn:
        with conn.cursor() as cur:
            # data loading may take a while, so increase statement timeout
            cur.execute("SET statement_timeout='300s'")
@@ -127,15 +171,28 @@ def test_import_from_pageserver(test_output_dir, pg_bin, vanilla_pg, neon_env_bu
            cur.execute("CHECKPOINT")

            cur.execute('SELECT pg_current_wal_insert_lsn()')
-            lsn = cur.fetchone()[0]
-            log.info(f"start_backup_lsn = {lsn}")
+            res = cur.fetchone()
+            assert res is not None and isinstance(res[0], str)
+            return res[0]
+
+
+def _import(expected_num_rows: int, lsn: str, env: NeonEnv, pg_bin: PgBin, timeline: UUID) -> str:
+    """Test importing backup data to the pageserver.
+
+    Args:
+    expected_num_rows: the expected number of rows of the test table in the backup data
+    lsn: the backup's base LSN
+
+    Returns:
+    path to the backup archive file"""
+    log.info(f"start_backup_lsn = {lsn}")

    # Set LD_LIBRARY_PATH in the env properly, otherwise we may use the wrong libpq.
    # PgBin sets it automatically, but here we need to pipe psql output to the tar command.
    psql_env = {'LD_LIBRARY_PATH': os.path.join(str(pg_distrib_dir), 'lib')}

    # Get a fullbackup from pageserver
-    query = f"fullbackup { env.initial_tenant.hex} {timeline} {lsn}"
+    query = f"fullbackup { env.initial_tenant.hex} {timeline.hex} {lsn}"
    cmd = ["psql", "--no-psqlrc", env.pageserver.connstr(), "-c", query]
    result_basepath = pg_bin.run_capture(cmd, env=psql_env)
    tar_output_file = result_basepath + ".stdout"
@@ -152,7 +209,7 @@ def test_import_from_pageserver(test_output_dir, pg_bin, vanilla_pg, neon_env_bu
    env.pageserver.start()

    # Import using another tenantid, because we use the same pageserver.
-    # TODO Create another pageserver to maeke test more realistic.
+    # TODO Create another pageserver to make test more realistic.
    tenant = uuid4()

    # Import to pageserver
@@ -165,7 +222,7 @@ def test_import_from_pageserver(test_output_dir, pg_bin, vanilla_pg, neon_env_bu
        "--tenant-id",
        tenant.hex,
        "--timeline-id",
-        timeline,
+        timeline.hex,
        "--node-name",
        node_name,
        "--base-lsn",
@@ -175,15 +232,15 @@ def test_import_from_pageserver(test_output_dir, pg_bin, vanilla_pg, neon_env_bu
    ])

    # Wait for data to land in s3
-    wait_for_last_record_lsn(client, tenant, UUID(timeline), lsn_from_hex(lsn))
-    wait_for_upload(client, tenant, UUID(timeline), lsn_from_hex(lsn))
+    wait_for_last_record_lsn(client, tenant, timeline, lsn_from_hex(lsn))
+    wait_for_upload(client, tenant, timeline, lsn_from_hex(lsn))

    # Check it worked
    pg = env.postgres.create_start(node_name, tenant_id=tenant)
-    assert pg.safe_psql('select count(*) from tbl') == [(num_rows, )]
+    assert pg.safe_psql('select count(*) from tbl') == [(expected_num_rows, )]

    # Take another fullbackup
-    query = f"fullbackup { tenant.hex} {timeline} {lsn}"
+    query = f"fullbackup { tenant.hex} {timeline.hex} {lsn}"
    cmd = ["psql", "--no-psqlrc", env.pageserver.connstr(), "-c", query]
    result_basepath = pg_bin.run_capture(cmd, env=psql_env)
    new_tar_output_file = result_basepath + ".stdout"
@@ -195,4 +252,6 @@ def test_import_from_pageserver(test_output_dir, pg_bin, vanilla_pg, neon_env_bu
    # Check that gc works
    psconn = env.pageserver.connect()
    pscur = psconn.cursor()
-    pscur.execute(f"do_gc {tenant.hex} {timeline} 0")
+    pscur.execute(f"do_gc {tenant.hex} {timeline.hex} 0")
+
+    return tar_output_file
--- a/test_runner/batch_others/test_proxy.py
+++ b/test_runner/batch_others/test_proxy.py
@@ -1,6 +1,5 @@
 import pytest
-import json
-import base64
+import psycopg2


 def test_proxy_select_1(static_proxy):
@@ -13,22 +12,14 @@ def test_password_hack(static_proxy):
    static_proxy.safe_psql(f"create role {user} with login password '{password}'",
                           options='project=irrelevant')

-    def encode(s: str) -> str:
-        return base64.b64encode(s.encode('utf-8')).decode('utf-8')
-
-    magic = encode(json.dumps({
-        'project': 'irrelevant',
-        'password': password,
-    }))
-
+    # Note the format of `magic`!
+    magic = f"project=irrelevant;{password}"
    static_proxy.safe_psql('select 1', sslsni=0, user=user, password=magic)

-    magic = encode(json.dumps({
-        'project': 'irrelevant',
-        'password_': encode(password),
-    }))
-
-    static_proxy.safe_psql('select 1', sslsni=0, user=user, password=magic)
+    # Must also check that invalid magic won't be accepted.
+    with pytest.raises(psycopg2.errors.OperationalError):
+        magic = "broken"
+        static_proxy.safe_psql('select 1', sslsni=0, user=user, password=magic)


 # Pass extra options to the server.
--- a/test_runner/batch_others/test_remote_storage.py
+++ b/test_runner/batch_others/test_remote_storage.py
@@ -110,7 +110,7 @@ def test_remote_storage_backup_and_restore(
    client.tenant_attach(UUID(tenant_id))

    log.info("waiting for timeline redownload")
-    wait_until(number_of_iterations=10,
+    wait_until(number_of_iterations=20,
               interval=1,
               func=lambda: assert_timeline_local(client, UUID(tenant_id), UUID(timeline_id)))

--- a/test_runner/batch_others/test_tenant_relocation.py
+++ b/test_runner/batch_others/test_tenant_relocation.py
@@ -229,7 +229,7 @@ def post_migration_check(pg: Postgres, sum_before_migration: int, old_local_path
        # basebackup and importing it into the new pageserver.
        # This kind of migration can tolerate breaking changes
        # to storage format
-        pytest.param('major', marks=pytest.mark.xfail(reason="Not implemented")),
+        'major',
    ])
@pytest.mark.parametrize('with_load', ['with_load', 'without_load'])
 def test_tenant_relocation(neon_env_builder: NeonEnvBuilder,
@@ -345,6 +345,8 @@ def test_tenant_relocation(neon_env_builder: NeonEnvBuilder,
        # Migrate either by attaching from s3 or import/export basebackup
        if method == "major":
            cmd = [
+                "poetry",
+                "run",
                "python",
                os.path.join(base_dir, "scripts/export_import_between_pageservers.py"),
                "--tenant-id",
@@ -361,12 +363,12 @@ def test_tenant_relocation(neon_env_builder: NeonEnvBuilder,
                str(new_pageserver_http_port),
                "--to-pg-port",
                str(new_pageserver_pg_port),
-                "--psql-path",
-                os.path.join(pg_distrib_dir, "bin", "psql"),
+                "--pg-distrib-dir",
+                pg_distrib_dir,
                "--work-dir",
                os.path.join(test_output_dir),
            ]
-            subprocess_capture(str(env.repo_dir), cmd, check=True)
+            subprocess_capture(test_output_dir, cmd, check=True)
        elif method == "minor":
            # call to attach timeline to new pageserver
            new_pageserver_http.tenant_attach(tenant_id)
@@ -427,6 +429,22 @@ def test_tenant_relocation(neon_env_builder: NeonEnvBuilder,
        post_migration_check(pg_main, 500500, old_local_path_main)
        post_migration_check(pg_second, 1001000, old_local_path_second)

+        # ensure that we can successfully read all relations on the new pageserver
+        with pg_cur(pg_second) as cur:
+            cur.execute('''
+                DO $$
+                DECLARE
+                r RECORD;
+                BEGIN
+                FOR r IN
+                SELECT relname FROM pg_class WHERE relkind='r'
+                LOOP
+                    RAISE NOTICE '%', r.relname;
+                    EXECUTE 'SELECT count(*) FROM quote_ident($1)' USING r.relname;
+                END LOOP;
+                END$$;
+                ''')
+
        if with_load == 'with_load':
            assert load_ok_event.wait(3)
            log.info('stopping load thread')
--- a/test_runner/batch_others/test_timeline_size.py
+++ b/test_runner/batch_others/test_timeline_size.py
@@ -4,7 +4,7 @@ from uuid import UUID
 import re
 import psycopg2.extras
 import psycopg2.errors
-from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, Postgres, assert_timeline_local
+from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, Postgres, assert_timeline_local, wait_for_last_flush_lsn
 from fixtures.log_helper import log
 import time

@@ -192,6 +192,8 @@ def test_timeline_physical_size_init(neon_simple_env: NeonEnv):
           FROM generate_series(1, 1000) g""",
    ])

+    wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
+
    # restart the pageserer to force calculating timeline's initial physical size
    env.pageserver.stop()
    env.pageserver.start()
@@ -211,7 +213,9 @@ def test_timeline_physical_size_post_checkpoint(neon_simple_env: NeonEnv):
           FROM generate_series(1, 1000) g""",
    ])

+    wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
    env.pageserver.safe_psql(f"checkpoint {env.initial_tenant.hex} {new_timeline_id.hex}")
+
    assert_physical_size(env, env.initial_tenant, new_timeline_id)


@@ -232,8 +236,10 @@ def test_timeline_physical_size_post_compaction(neon_env_builder: NeonEnvBuilder
           FROM generate_series(1, 100000) g""",
    ])

+    wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
    env.pageserver.safe_psql(f"checkpoint {env.initial_tenant.hex} {new_timeline_id.hex}")
    env.pageserver.safe_psql(f"compact {env.initial_tenant.hex} {new_timeline_id.hex}")
+
    assert_physical_size(env, env.initial_tenant, new_timeline_id)


@@ -254,15 +260,21 @@ def test_timeline_physical_size_post_gc(neon_env_builder: NeonEnvBuilder):
           SELECT 'long string to consume some space' || g
           FROM generate_series(1, 100000) g""",
    ])
+
+    wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
    env.pageserver.safe_psql(f"checkpoint {env.initial_tenant.hex} {new_timeline_id.hex}")
+
    pg.safe_psql("""
        INSERT INTO foo
            SELECT 'long string to consume some space' || g
            FROM generate_series(1, 100000) g
    """)
+
+    wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
    env.pageserver.safe_psql(f"checkpoint {env.initial_tenant.hex} {new_timeline_id.hex}")

    env.pageserver.safe_psql(f"do_gc {env.initial_tenant.hex} {new_timeline_id.hex} 0")
+
    assert_physical_size(env, env.initial_tenant, new_timeline_id)


@@ -279,6 +291,7 @@ def test_timeline_physical_size_metric(neon_simple_env: NeonEnv):
           FROM generate_series(1, 100000) g""",
    ])

+    wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_timeline_id)
    env.pageserver.safe_psql(f"checkpoint {env.initial_tenant.hex} {new_timeline_id.hex}")

    # get the metrics and parse the metric for the current timeline's physical size
@@ -319,6 +332,7 @@ def test_tenant_physical_size(neon_simple_env: NeonEnv):
            f"INSERT INTO foo SELECT 'long string to consume some space' || g FROM generate_series(1, {n_rows}) g",
        ])

+        wait_for_last_flush_lsn(env, pg, tenant, timeline)
        env.pageserver.safe_psql(f"checkpoint {tenant.hex} {timeline.hex}")

        timeline_total_size += get_timeline_physical_size(timeline)
--- a/test_runner/batch_others/test_wal_acceptor.py
+++ b/test_runner/batch_others/test_wal_acceptor.py
@@ -284,9 +284,12 @@ def test_wal_removal(neon_env_builder: NeonEnvBuilder, auth_enabled: bool):
    env.neon_cli.create_branch('test_safekeepers_wal_removal')
    pg = env.postgres.create_start('test_safekeepers_wal_removal')

+    # Note: it is important to insert at least two segments, as currently
+    # control file is synced roughly once in segment range and WAL is not
+    # removed until all horizons are persisted.
    pg.safe_psql_many([
        'CREATE TABLE t(key int primary key, value text)',
-        "INSERT INTO t SELECT generate_series(1,100000), 'payload'",
+        "INSERT INTO t SELECT generate_series(1,200000), 'payload'",
    ])

    tenant_id = pg.safe_psql("show neon.tenant_id")[0][0]
@@ -350,7 +353,7 @@ def wait_segment_offload(tenant_id, timeline_id, live_sk, seg_end):
        if lsn_from_hex(tli_status.backup_lsn) >= lsn_from_hex(seg_end):
            break
        elapsed = time.time() - started_at
-        if elapsed > 20:
+        if elapsed > 30:
            raise RuntimeError(
                f"timed out waiting {elapsed:.0f}s for segment ending at {seg_end} get offloaded")
        time.sleep(0.5)
@@ -1087,11 +1090,9 @@ def test_delete_force(neon_env_builder: NeonEnvBuilder, auth_enabled: bool):

    # Remove initial tenant fully (two branches are active)
    response = sk_http.tenant_delete_force(tenant_id)
-    assert response == {
-        timeline_id_3: {
-            "dir_existed": True,
-            "was_active": True,
-        }
+    assert response[timeline_id_3] == {
+        "dir_existed": True,
+        "was_active": True,
    }
    assert not (sk_data_dir / tenant_id).exists()
    assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir()
--- a/test_runner/batch_others/test_wal_acceptor_async.py
+++ b/test_runner/batch_others/test_wal_acceptor_async.py
@@ -520,3 +520,68 @@ def test_race_conditions(neon_env_builder: NeonEnvBuilder):
    pg = env.postgres.create_start('test_safekeepers_race_conditions')

    asyncio.run(run_race_conditions(env, pg))
+
+
+# Check that pageserver can select safekeeper with largest commit_lsn
+# and switch if LSN is not updated for some time (NoWalTimeout).
+async def run_wal_lagging(env: NeonEnv, pg: Postgres):
+    def safekeepers_guc(env: NeonEnv, active_sk: List[bool]) -> str:
+        # use ports 10, 11 and 12 to simulate unavailable safekeepers
+        return ','.join([
+            f'localhost:{sk.port.pg if active else 10 + i}'
+            for i, (sk, active) in enumerate(zip(env.safekeepers, active_sk))
+        ])
+
+    conn = await pg.connect_async()
+    await conn.execute('CREATE TABLE t(key int primary key, value text)')
+    await conn.close()
+    pg.stop()
+
+    n_iterations = 20
+    n_txes = 10000
+    expected_sum = 0
+    i = 1
+    quorum = len(env.safekeepers) // 2 + 1
+
+    for it in range(n_iterations):
+        active_sk = list(map(lambda _: random.random() >= 0.5, env.safekeepers))
+        active_count = sum(active_sk)
+
+        if active_count < quorum:
+            it -= 1
+            continue
+
+        pg.adjust_for_safekeepers(safekeepers_guc(env, active_sk))
+        log.info(f'Iteration {it}: {active_sk}')
+
+        pg.start()
+        conn = await pg.connect_async()
+
+        for _ in range(n_txes):
+            await conn.execute(f"INSERT INTO t values ({i}, 'payload')")
+            expected_sum += i
+            i += 1
+
+        await conn.close()
+        pg.stop()
+
+    pg.adjust_for_safekeepers(safekeepers_guc(env, [True] * len(env.safekeepers)))
+    pg.start()
+    conn = await pg.connect_async()
+
+    log.info(f'Executed {i-1} queries')
+
+    res = await conn.fetchval('SELECT sum(key) FROM t')
+    assert res == expected_sum
+
+
+# do inserts while restarting postgres and messing with safekeeper addresses
+def test_wal_lagging(neon_env_builder: NeonEnvBuilder):
+
+    neon_env_builder.num_safekeepers = 3
+    env = neon_env_builder.init_start()
+
+    env.neon_cli.create_branch('test_wal_lagging')
+    pg = env.postgres.create_start('test_wal_lagging')
+
+    asyncio.run(run_wal_lagging(env, pg))
--- a/test_runner/fixtures/benchmark_fixture.py
+++ b/test_runner/fixtures/benchmark_fixture.py
@@ -1,23 +1,21 @@
+import calendar
 import dataclasses
+import enum
 import json
 import os
-from pathlib import Path
 import re
-import subprocess
 import timeit
-import calendar
-import enum
-from datetime import datetime
 import uuid
+import warnings
+from contextlib import contextmanager
+from datetime import datetime
+from pathlib import Path
+# Type-related stuff
+from typing import Iterator, Optional
+
 import pytest
 from _pytest.config import Config
 from _pytest.terminal import TerminalReporter
-import warnings
-
-from contextlib import contextmanager
-
-# Type-related stuff
-from typing import Iterator, Optional
 """
 This file contains fixtures for micro-benchmarks.

@@ -77,7 +75,7 @@ class PgBenchRunResult:

        # we know significant parts of these values from test input
        # but to be precise take them from output
-        for line in stdout.splitlines():
+        for line in stdout_lines:
            # scaling factor: 5
            if line.startswith("scaling factor:"):
                scale = int(line.split()[-1])
@@ -131,6 +129,58 @@ class PgBenchRunResult:
        )


+@dataclasses.dataclass
+class PgBenchInitResult:
+    total: float
+    drop_tables: Optional[float]
+    create_tables: Optional[float]
+    client_side_generate: Optional[float]
+    vacuum: Optional[float]
+    primary_keys: Optional[float]
+    duration: float
+    start_timestamp: int
+    end_timestamp: int
+
+    @classmethod
+    def parse_from_stderr(
+        cls,
+        stderr: str,
+        duration: float,
+        start_timestamp: int,
+        end_timestamp: int,
+    ):
+        # Parses pgbench initialize output for default initialization steps (dtgvp)
+        # Example: done in 5.66 s (drop tables 0.05 s, create tables 0.31 s, client-side generate 2.01 s, vacuum 0.53 s, primary keys 0.38 s).
+
+        last_line = stderr.splitlines()[-1]
+
+        regex = re.compile(r"done in (\d+\.\d+) s "
+                           r"\("
+                           r"(?:drop tables (\d+\.\d+) s)?(?:, )?"
+                           r"(?:create tables (\d+\.\d+) s)?(?:, )?"
+                           r"(?:client-side generate (\d+\.\d+) s)?(?:, )?"
+                           r"(?:vacuum (\d+\.\d+) s)?(?:, )?"
+                           r"(?:primary keys (\d+\.\d+) s)?(?:, )?"
+                           r"\)\.")
+
+        if (m := regex.match(last_line)) is not None:
+            total, drop_tables, create_tables, client_side_generate, vacuum, primary_keys = [float(v) for v in m.groups() if v is not None]
+        else:
+            raise RuntimeError(f"can't parse pgbench initialize results from `{last_line}`")
+
+        return cls(
+            total=total,
+            drop_tables=drop_tables,
+            create_tables=create_tables,
+            client_side_generate=client_side_generate,
+            vacuum=vacuum,
+            primary_keys=primary_keys,
+            duration=duration,
+            start_timestamp=start_timestamp,
+            end_timestamp=end_timestamp,
+        )
+
+
@enum.unique
 class MetricReport(str, enum.Enum):  # str is a hack to make it json serializable
    # this means that this is a constant test parameter
@@ -232,6 +282,32 @@ class NeonBenchmarker:
                    '',
                    MetricReport.TEST_PARAM)

+    def record_pg_bench_init_result(self, prefix: str, result: PgBenchInitResult):
+        test_params = [
+            "start_timestamp",
+            "end_timestamp",
+        ]
+        for test_param in test_params:
+            self.record(f"{prefix}.{test_param}",
+                        getattr(result, test_param),
+                        '',
+                        MetricReport.TEST_PARAM)
+
+        metrics = [
+            "duration",
+            "drop_tables",
+            "create_tables",
+            "client_side_generate",
+            "vacuum",
+            "primary_keys",
+        ]
+        for metric in metrics:
+            if (value := getattr(result, metric)) is not None:
+                self.record(f"{prefix}.{metric}",
+                            value,
+                            unit="s",
+                            report=MetricReport.LOWER_IS_BETTER)
+
    def get_io_writes(self, pageserver) -> int:
        """
        Fetch the "cumulative # of bytes written" metric from the pageserver
--- a/test_runner/fixtures/neon_fixtures.py
+++ b/test_runner/fixtures/neon_fixtures.py
@@ -299,7 +299,9 @@ class PgProtocol:
        # change it by calling "SET statement_timeout" after
        # connecting.
        options = result.get('options', '')
-        result['options'] = f'-cstatement_timeout=120s {options}'
+        if "statement_timeout" not in options:
+            options = f'-cstatement_timeout=120s {options}'
+        result['options'] = options
        return result

    # autocommit=True here by default because that's what we need most of the time
@@ -2438,7 +2440,7 @@ def wait_for_upload(pageserver_http_client: NeonPageserverHttpClient,
                    timeline: uuid.UUID,
                    lsn: int):
    """waits for local timeline upload up to specified lsn"""
-    for i in range(10):
+    for i in range(20):
        current_lsn = remote_consistent_lsn(pageserver_http_client, tenant, timeline)
        if current_lsn >= lsn:
            return
@@ -2473,3 +2475,9 @@ def wait_for_last_record_lsn(pageserver_http_client: NeonPageserverHttpClient,
        time.sleep(1)
    raise Exception("timed out while waiting for last_record_lsn to reach {}, was {}".format(
        lsn_to_hex(lsn), lsn_to_hex(current_lsn)))
+
+
+def wait_for_last_flush_lsn(env: NeonEnv, pg: Postgres, tenant: uuid.UUID, timeline: uuid.UUID):
+    """Wait for pageserver to catch up the latest flush LSN"""
+    last_flush_lsn = lsn_from_hex(pg.safe_psql("SELECT pg_current_wal_flush_lsn()")[0][0])
+    wait_for_last_record_lsn(env.pageserver.http_client(), tenant, timeline, last_flush_lsn)
--- a/test_runner/fixtures/utils.py
+++ b/test_runner/fixtures/utils.py
@@ -32,10 +32,16 @@ def subprocess_capture(capture_dir: str, cmd: List[str], **kwargs: Any) -> str:
    stdout_filename = basepath + '.stdout'
    stderr_filename = basepath + '.stderr'

-    with open(stdout_filename, 'w') as stdout_f:
-        with open(stderr_filename, 'w') as stderr_f:
-            log.info('(capturing output to "{}.stdout")'.format(base))
-            subprocess.run(cmd, **kwargs, stdout=stdout_f, stderr=stderr_f)
+    try:
+        with open(stdout_filename, 'w') as stdout_f:
+            with open(stderr_filename, 'w') as stderr_f:
+                log.info(f'Capturing stdout to "{base}.stdout" and stderr to "{base}.stderr"')
+                subprocess.run(cmd, **kwargs, stdout=stdout_f, stderr=stderr_f)
+    finally:
+        # Remove empty files if there is no output
+        for filename in (stdout_filename, stderr_filename):
+            if os.stat(filename).st_size == 0:
+                os.remove(filename)

    return basepath

@@ -140,3 +146,12 @@ def parse_delta_layer(f_name: str) -> Tuple[int, int, int, int]:
    key_parts = parts[0].split("-")
    lsn_parts = parts[1].split("-")
    return int(key_parts[0], 16), int(key_parts[1], 16), int(lsn_parts[0], 16), int(lsn_parts[1], 16)
+
+
+def get_scale_for_db(size_mb: int) -> int:
+    """Returns pgbench scale factor for given target db size in MB.
+
+    Ref https://www.cybertec-postgresql.com/en/a-formula-to-calculate-pgbench-scaling-factor-for-target-db-size/
+    """
+
+    return round(0.06689 * size_mb - 0.5)
--- a/test_runner/performance/README.md
+++ b/test_runner/performance/README.md
@@ -10,7 +10,7 @@ In the CI, the performance tests are run in the same environment as the other in

 ## Remote tests

-There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not set up a local environment, and instead require a libpq connection string to connect to. So they can be run on any Postgres compatible database. Currently, the CI runs these tests our staging environment daily. Staging is not an isolated environment, so there can be noise in the results due to activity of other clusters.
+There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not set up a local environment, and instead require a libpq connection string to connect to. So they can be run on any Postgres compatible database. Currently, the CI runs these tests on our staging and captest environments daily. Those are not an isolated environments, so there can be noise in the results due to activity of other clusters.

 ## Noise

--- a/test_runner/performance/test_perf_pgbench.py
+++ b/test_runner/performance/test_perf_pgbench.py
@@ -1,17 +1,23 @@
-from contextlib import closing
-from fixtures.neon_fixtures import PgBin, VanillaPostgres, NeonEnv, profiling_supported
-from fixtures.compare_fixtures import PgCompare, VanillaCompare, NeonCompare
-
-from fixtures.benchmark_fixture import PgBenchRunResult, MetricReport, NeonBenchmarker
-from fixtures.log_helper import log
-
-from pathlib import Path
-
-import pytest
-from datetime import datetime
 import calendar
+import enum
 import os
 import timeit
+from datetime import datetime
+from pathlib import Path
+from typing import List
+
+import pytest
+from fixtures.benchmark_fixture import MetricReport, PgBenchInitResult, PgBenchRunResult
+from fixtures.compare_fixtures import NeonCompare, PgCompare
+from fixtures.neon_fixtures import profiling_supported
+from fixtures.utils import get_scale_for_db
+
+
+@enum.unique
+class PgBenchLoadType(enum.Enum):
+    INIT = "init"
+    SIMPLE_UPDATE = "simple_update"
+    SELECT_ONLY = "select-only"


 def utc_now_timestamp() -> int:
@@ -22,23 +28,24 @@ def init_pgbench(env: PgCompare, cmdline):
    # calculate timestamps and durations separately
    # timestamp is intended to be used for linking to grafana and logs
    # duration is actually a metric and uses float instead of int for timestamp
-    init_start_timestamp = utc_now_timestamp()
+    start_timestamp = utc_now_timestamp()
    t0 = timeit.default_timer()
    with env.record_pageserver_writes('init.pageserver_writes'):
-        env.pg_bin.run_capture(cmdline)
+        out = env.pg_bin.run_capture(cmdline)
        env.flush()
-    init_duration = timeit.default_timer() - t0
-    init_end_timestamp = utc_now_timestamp()

-    env.zenbenchmark.record("init.duration",
-                            init_duration,
-                            unit="s",
-                            report=MetricReport.LOWER_IS_BETTER)
-    env.zenbenchmark.record("init.start_timestamp",
-                            init_start_timestamp,
-                            '',
-                            MetricReport.TEST_PARAM)
-    env.zenbenchmark.record("init.end_timestamp", init_end_timestamp, '', MetricReport.TEST_PARAM)
+    duration = timeit.default_timer() - t0
+    end_timestamp = utc_now_timestamp()
+
+    stderr = Path(f"{out}.stderr").read_text()
+
+    res = PgBenchInitResult.parse_from_stderr(
+        stderr=stderr,
+        duration=duration,
+        start_timestamp=start_timestamp,
+        end_timestamp=end_timestamp,
+    )
+    env.zenbenchmark.record_pg_bench_init_result("init", res)


 def run_pgbench(env: PgCompare, prefix: str, cmdline):
@@ -70,38 +77,84 @@ def run_pgbench(env: PgCompare, prefix: str, cmdline):
 # the test database.
 #
 # Currently, the # of connections is hardcoded at 4
-def run_test_pgbench(env: PgCompare, scale: int, duration: int):
-
-    # Record the scale and initialize
+def run_test_pgbench(env: PgCompare, scale: int, duration: int, workload_type: PgBenchLoadType):
    env.zenbenchmark.record("scale", scale, '', MetricReport.TEST_PARAM)
-    init_pgbench(env, ['pgbench', f'-s{scale}', '-i', env.pg.connstr()])

-    # Run simple-update workload
-    run_pgbench(env,
-                "simple-update", ['pgbench', '-N', '-c4', f'-T{duration}', '-P2', env.pg.connstr()])
+    if workload_type == PgBenchLoadType.INIT:
+        # Run initialize
+        init_pgbench(
+            env, ['pgbench', f'-s{scale}', '-i', env.pg.connstr(options='-cstatement_timeout=1h')])

-    # Run SELECT workload
-    run_pgbench(env,
-                "select-only", ['pgbench', '-S', '-c4', f'-T{duration}', '-P2', env.pg.connstr()])
+    if workload_type == PgBenchLoadType.SIMPLE_UPDATE:
+        # Run simple-update workload
+        run_pgbench(env,
+                    "simple-update",
+                    [
+                        'pgbench',
+                        '-N',
+                        '-c4',
+                        f'-T{duration}',
+                        '-P2',
+                        '--progress-timestamp',
+                        env.pg.connstr(),
+                    ])
+
+    if workload_type == PgBenchLoadType.SELECT_ONLY:
+        # Run SELECT workload
+        run_pgbench(env,
+                    "select-only",
+                    [
+                        'pgbench',
+                        '-S',
+                        '-c4',
+                        f'-T{duration}',
+                        '-P2',
+                        '--progress-timestamp',
+                        env.pg.connstr(),
+                    ])

    env.report_size()


-def get_durations_matrix(default: int = 45):
+def get_durations_matrix(default: int = 45) -> List[int]:
    durations = os.getenv("TEST_PG_BENCH_DURATIONS_MATRIX", default=str(default))
-    return list(map(int, durations.split(",")))
+    rv = []
+    for d in durations.split(","):
+        d = d.strip().lower()
+        if d.endswith('h'):
+            duration = int(d.removesuffix('h')) * 60 * 60
+        elif d.endswith('m'):
+            duration = int(d.removesuffix('m')) * 60
+        else:
+            duration = int(d.removesuffix('s'))
+        rv.append(duration)
+
+    return rv


-def get_scales_matrix(default: int = 10):
+def get_scales_matrix(default: int = 10) -> List[int]:
    scales = os.getenv("TEST_PG_BENCH_SCALES_MATRIX", default=str(default))
-    return list(map(int, scales.split(",")))
+    rv = []
+    for s in scales.split(","):
+        s = s.strip().lower()
+        if s.endswith('mb'):
+            scale = get_scale_for_db(int(s.removesuffix('mb')))
+        elif s.endswith('gb'):
+            scale = get_scale_for_db(int(s.removesuffix('gb')) * 1024)
+        else:
+            scale = int(s)
+        rv.append(scale)
+
+    return rv


 # Run the pgbench tests against vanilla Postgres and neon
@pytest.mark.parametrize("scale", get_scales_matrix())
@pytest.mark.parametrize("duration", get_durations_matrix())
 def test_pgbench(neon_with_baseline: PgCompare, scale: int, duration: int):
-    run_test_pgbench(neon_with_baseline, scale, duration)
+    run_test_pgbench(neon_with_baseline, scale, duration, PgBenchLoadType.INIT)
+    run_test_pgbench(neon_with_baseline, scale, duration, PgBenchLoadType.SIMPLE_UPDATE)
+    run_test_pgbench(neon_with_baseline, scale, duration, PgBenchLoadType.SELECT_ONLY)


 # Run the pgbench tests, and generate a flamegraph from it
@@ -123,12 +176,34 @@ profiling="page_requests"
    env = neon_env_builder.init_start()
    env.neon_cli.create_branch("empty", "main")

-    run_test_pgbench(NeonCompare(zenbenchmark, env, pg_bin, "pgbench"), scale, duration)
+    neon_compare = NeonCompare(zenbenchmark, env, pg_bin, "pgbench")
+    run_test_pgbench(neon_compare, scale, duration, PgBenchLoadType.INIT)
+    run_test_pgbench(neon_compare, scale, duration, PgBenchLoadType.SIMPLE_UPDATE)
+    run_test_pgbench(neon_compare, scale, duration, PgBenchLoadType.SELECT_ONLY)


+# The following 3 tests run on an existing database as it was set up by previous tests,
+# and leaves the database in a state that would be used in the next tests.
+# Modifying the definition order of these functions or adding other remote tests in between will alter results.
+# See usage of --sparse-ordering flag in the pytest invocation in the CI workflow
+#
 # Run the pgbench tests against an existing Postgres cluster
@pytest.mark.parametrize("scale", get_scales_matrix())
@pytest.mark.parametrize("duration", get_durations_matrix())
@pytest.mark.remote_cluster
-def test_pgbench_remote(remote_compare: PgCompare, scale: int, duration: int):
-    run_test_pgbench(remote_compare, scale, duration)
+def test_pgbench_remote_init(remote_compare: PgCompare, scale: int, duration: int):
+    run_test_pgbench(remote_compare, scale, duration, PgBenchLoadType.INIT)
+
+
+@pytest.mark.parametrize("scale", get_scales_matrix())
+@pytest.mark.parametrize("duration", get_durations_matrix())
+@pytest.mark.remote_cluster
+def test_pgbench_remote_simple_update(remote_compare: PgCompare, scale: int, duration: int):
+    run_test_pgbench(remote_compare, scale, duration, PgBenchLoadType.SIMPLE_UPDATE)
+
+
+@pytest.mark.parametrize("scale", get_scales_matrix())
+@pytest.mark.parametrize("duration", get_durations_matrix())
+@pytest.mark.remote_cluster
+def test_pgbench_remote_select_only(remote_compare: PgCompare, scale: int, duration: int):
+    run_test_pgbench(remote_compare, scale, duration, PgBenchLoadType.SELECT_ONLY)
--- a/test_runner/performance/test_wal_backpressure.py
+++ b/test_runner/performance/test_wal_backpressure.py
@@ -146,7 +146,7 @@ def test_pgbench_simple_update_workload(pg_compare: PgCompare, scale: int, durat
    record_thread.join()


-def start_pgbench_intensive_initialization(env: PgCompare, scale: int):
+def start_pgbench_intensive_initialization(env: PgCompare, scale: int, done_event: threading.Event):
    with env.record_duration("run_duration"):
        # Needs to increase the statement timeout (default: 120s) because the
        # initialization step can be slow with a large scale.
@@ -155,9 +155,11 @@ def start_pgbench_intensive_initialization(env: PgCompare, scale: int):
            f'-s{scale}',
            '-i',
            '-Idtg',
-            env.pg.connstr(options='-cstatement_timeout=300s')
+            env.pg.connstr(options='-cstatement_timeout=600s')
        ])

+    done_event.set()
+

@pytest.mark.timeout(1000)
@pytest.mark.parametrize("scale", get_scales_matrix(1000))
@@ -166,15 +168,17 @@ def test_pgbench_intensive_init_workload(pg_compare: PgCompare, scale: int):
    with env.pg.connect().cursor() as cur:
        cur.execute("CREATE TABLE foo as select generate_series(1,100000)")

+    workload_done_event = threading.Event()
+
    workload_thread = threading.Thread(target=start_pgbench_intensive_initialization,
-                                       args=(env, scale))
+                                       args=(env, scale, workload_done_event))
    workload_thread.start()

    record_thread = threading.Thread(target=record_lsn_write_lag,
-                                     args=(env, lambda: workload_thread.is_alive()))
+                                     args=(env, lambda: not workload_done_event.is_set()))
    record_thread.start()

-    record_read_latency(env, lambda: workload_thread.is_alive(), "SELECT count(*) from foo")
+    record_read_latency(env, lambda: not workload_done_event.is_set(), "SELECT count(*) from foo")
    workload_thread.join()
    record_thread.join()

--- a/test_runner/pg_clients/test_pg_clients.py
+++ b/test_runner/pg_clients/test_pg_clients.py
@@ -3,10 +3,10 @@ import shutil
 import subprocess
 from pathlib import Path
 from tempfile import NamedTemporaryFile
-from urllib.parse import urlparse

 import pytest
 from fixtures.neon_fixtures import RemotePostgres
+from fixtures.utils import subprocess_capture


@pytest.mark.remote_cluster
@@ -25,7 +25,7 @@ from fixtures.neon_fixtures import RemotePostgres
        "typescript/postgresql-client",
    ],
 )
-def test_pg_clients(remote_pg: RemotePostgres, client: str):
+def test_pg_clients(test_output_dir: Path, remote_pg: RemotePostgres, client: str):
    conn_options = remote_pg.conn_options()

    env_file = None
@@ -43,12 +43,10 @@ def test_pg_clients(remote_pg: RemotePostgres, client: str):
    if docker_bin is None:
        raise RuntimeError("docker is required for running this test")

-    build_cmd = [
-        docker_bin, "build", "--quiet", "--tag", image_tag, f"{Path(__file__).parent / client}"
-    ]
+    build_cmd = [docker_bin, "build", "--tag", image_tag, f"{Path(__file__).parent / client}"]
+    subprocess_capture(str(test_output_dir), build_cmd, check=True)
+
    run_cmd = [docker_bin, "run", "--rm", "--env-file", env_file, image_tag]
+    basepath = subprocess_capture(str(test_output_dir), run_cmd, check=True)

-    subprocess.run(build_cmd, check=True)
-    result = subprocess.run(run_cmd, check=True, capture_output=True, text=True)
-
-    assert result.stdout.strip() == "1"
+    assert Path(f"{basepath}.stdout").read_text().strip() == "1"
--- a/vendor/postgres
+++ b/vendor/postgres