Have a pool of WAL redo processes per tenant

To allow more concurrency, have a pool of WAL redo processes that can grow up to 4 processes per tenant. There's no way to shrink the pool, that's why I'm capping it at 4 processes, to keep the total number of processes reasonable.
test_runner: replace global variables with fixtures (#2754 )
2026-03-17 15:20:37 +00:00 · 2022-11-08 14:59:49 +02:00 · 2022-11-07 18:39:51 +00:00 · 2022-11-07 18:13:24 +02:00 · 2022-11-07 12:03:57 +02:00 · 2022-11-06 13:42:18 +02:00
133 changed files with 6321 additions and 2113 deletions
--- a/.github/actions/run-python-test-set/action.yml
+++ b/.github/actions/run-python-test-set/action.yml
@@ -74,6 +74,7 @@ runs:
      run: ./scripts/pysync

    - name: Download compatibility snapshot for Postgres 14
+      if: inputs.build_type != 'remote'
      uses: ./.github/actions/download
      with:
        name: compatibility-snapshot-${{ inputs.build_type }}-pg14
--- a/.github/ansible/prod.ap-southeast-1.hosts.yaml
+++ b/.github/ansible/prod.ap-southeast-1.hosts.yaml
@@ -0,0 +1,35 @@
+storage:
+  vars:
+    bucket_name: neon-prod-storage-ap-southeast-1
+    bucket_region: ap-southeast-1
+    console_mgmt_base_url: http://console-release.local
+    etcd_endpoints: etcd-0.ap-southeast-1.aws.neon.tech:2379
+    pageserver_config_stub:
+      pg_distrib_dir: /usr/local
+      remote_storage:
+        bucket_name: "{{ bucket_name }}"
+        bucket_region: "{{ bucket_region }}"
+        prefix_in_bucket: "pageserver/v1"
+    safekeeper_s3_prefix: safekeeper/v1/wal
+    hostname_suffix: ""
+    remote_user: ssm-user
+    ansible_aws_ssm_region: ap-southeast-1
+    ansible_aws_ssm_bucket_name: neon-prod-storage-ap-southeast-1
+    console_region_id: aws-ap-southeast-1
+
+  children:
+    pageservers:
+      hosts:
+        pageserver-0.ap-southeast-1.aws.neon.tech:
+          ansible_host:  i-064de8ea28bdb495b
+        pageserver-1.ap-southeast-1.aws.neon.tech:
+          ansible_host:  i-0b180defcaeeb6b93
+
+    safekeepers:
+      hosts:
+        safekeeper-0.ap-southeast-1.aws.neon.tech:
+          ansible_host:  i-0d6f1dc5161eef894
+        safekeeper-1.ap-southeast-1.aws.neon.tech:
+          ansible_host:  i-0e338adda8eb2d19f
+        safekeeper-2.ap-southeast-1.aws.neon.tech:
+          ansible_host:  i-04fb63634e4679eb9
--- a/.github/ansible/prod.eu-central-1.hosts.yaml
+++ b/.github/ansible/prod.eu-central-1.hosts.yaml
@@ -0,0 +1,35 @@
+storage:
+  vars:
+    bucket_name: neon-prod-storage-eu-central-1
+    bucket_region: eu-central-1
+    console_mgmt_base_url: http://console-release.local
+    etcd_endpoints: etcd-0.eu-central-1.aws.neon.tech:2379
+    pageserver_config_stub:
+      pg_distrib_dir: /usr/local
+      remote_storage:
+        bucket_name: "{{ bucket_name }}"
+        bucket_region: "{{ bucket_region }}"
+        prefix_in_bucket: "pageserver/v1"
+    safekeeper_s3_prefix: safekeeper/v1/wal
+    hostname_suffix: ""
+    remote_user: ssm-user
+    ansible_aws_ssm_region: eu-central-1
+    ansible_aws_ssm_bucket_name: neon-prod-storage-eu-central-1
+    console_region_id: aws-eu-central-1
+
+  children:
+    pageservers:
+      hosts:
+        pageserver-0.eu-central-1.aws.neon.tech:
+          ansible_host:  i-0cd8d316ecbb715be
+        pageserver-1.eu-central-1.aws.neon.tech:
+          ansible_host:  i-090044ed3d383fef0
+
+    safekeepers:
+      hosts:
+        safekeeper-0.eu-central-1.aws.neon.tech:
+          ansible_host:  i-0b238612d2318a050
+        safekeeper-1.eu-central-1.aws.neon.tech:
+          ansible_host:  i-07b9c45e5c2637cd4
+        safekeeper-2.eu-central-1.aws.neon.tech:
+          ansible_host:  i-020257302c3c93d88
--- a/.github/ansible/prod.us-east-2.hosts.yaml
+++ b/.github/ansible/prod.us-east-2.hosts.yaml
@@ -0,0 +1,36 @@
+storage:
+  vars:
+    bucket_name: neon-prod-storage-us-east-2
+    bucket_region: us-east-2
+    console_mgmt_base_url: http://console-release.local
+    etcd_endpoints: etcd-0.us-east-2.aws.neon.tech:2379
+    pageserver_config_stub:
+      pg_distrib_dir: /usr/local
+      remote_storage:
+        bucket_name: "{{ bucket_name }}"
+        bucket_region: "{{ bucket_region }}"
+        prefix_in_bucket: "pageserver/v1"
+    safekeeper_s3_prefix: safekeeper/v1/wal
+    hostname_suffix: ""
+    remote_user: ssm-user
+    ansible_aws_ssm_region: us-east-2
+    ansible_aws_ssm_bucket_name: neon-prod-storage-us-east-2
+    console_region_id: aws-us-east-2
+
+  children:
+    pageservers:
+      hosts:
+        pageserver-0.us-east-2.aws.neon.tech:
+          ansible_host:  i-062227ba7f119eb8c
+        pageserver-1.us-east-2.aws.neon.tech:
+          ansible_host:  i-0b3ec0afab5968938
+
+    safekeepers:
+      hosts:
+        safekeeper-0.us-east-2.aws.neon.tech:
+          ansible_host:  i-0e94224750c57d346
+        safekeeper-1.us-east-2.aws.neon.tech:
+          ansible_host:  i-06d113fb73bfddeb0
+        safekeeper-2.us-east-2.aws.neon.tech:
+          ansible_host:  i-09f66c8e04afff2e8
+          
--- a/.github/ansible/ssm_config
+++ b/.github/ansible/ssm_config
@@ -1,3 +1,2 @@
 ansible_connection: aws_ssm
-ansible_aws_ssm_bucket_name: neon-dev-bucket
 ansible_python_interpreter: /usr/bin/python3
--- a/.github/ansible/staging.us-east-2.hosts.yaml
+++ b/.github/ansible/staging.us-east-2.hosts.yaml
@@ -14,6 +14,7 @@ storage:
    hostname_suffix: ""
    remote_user: ssm-user
    ansible_aws_ssm_region: us-east-2
+    ansible_aws_ssm_bucket_name: neon-staging-storage-us-east-2
    console_region_id: aws-us-east-2

  children:
--- a/.github/helm-values/prod-ap-southeast-1-epsilon.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-ap-southeast-1-epsilon.neon-proxy-scram.yaml
@@ -0,0 +1,31 @@
+# Helm chart values for neon-proxy-scram.
+# This is a YAML-formatted file.
+
+image:
+  repository: neondatabase/neon
+
+settings:
+  authBackend: "console"
+  authEndpoint: "http://console-release.local/management/api/v2"
+  domain: "*.ap-southeast-1.aws.neon.tech"
+
+# -- Additional labels for neon-proxy pods
+podLabels:
+  zenith_service: proxy-scram
+  zenith_env: prod
+  zenith_region: ap-southeast-1
+  zenith_region_slug: ap-southeast-1
+
+exposedService:
+  annotations:
+    service.beta.kubernetes.io/aws-load-balancer-type: external
+    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
+    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
+    external-dns.alpha.kubernetes.io/hostname: ap-southeast-1.aws.neon.tech
+
+#metrics:
+#  enabled: true
+#  serviceMonitor:
+#    enabled: true
+#    selector:
+#      release: kube-prometheus-stack
--- a/.github/helm-values/prod-eu-central-1-gamma.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-eu-central-1-gamma.neon-proxy-scram.yaml
@@ -0,0 +1,31 @@
+# Helm chart values for neon-proxy-scram.
+# This is a YAML-formatted file.
+
+image:
+  repository: neondatabase/neon
+
+settings:
+  authBackend: "console"
+  authEndpoint: "http://console-release.local/management/api/v2"
+  domain: "*.eu-central-1.aws.neon.tech"
+
+# -- Additional labels for neon-proxy pods
+podLabels:
+  zenith_service: proxy-scram
+  zenith_env: prod
+  zenith_region: eu-central-1
+  zenith_region_slug: eu-central-1
+
+exposedService:
+  annotations:
+    service.beta.kubernetes.io/aws-load-balancer-type: external
+    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
+    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
+    external-dns.alpha.kubernetes.io/hostname: eu-central-1.aws.neon.tech
+
+#metrics:
+#  enabled: true
+#  serviceMonitor:
+#    enabled: true
+#    selector:
+#      release: kube-prometheus-stack
--- a/.github/helm-values/prod-us-east-2-delta.neon-proxy-scram.yaml
+++ b/.github/helm-values/prod-us-east-2-delta.neon-proxy-scram.yaml
@@ -0,0 +1,31 @@
+# Helm chart values for neon-proxy-scram.
+# This is a YAML-formatted file.
+
+image:
+  repository: neondatabase/neon
+
+settings:
+  authBackend: "console"
+  authEndpoint: "http://console-release.local/management/api/v2"
+  domain: "*.us-east-2.aws.neon.tech"
+
+# -- Additional labels for neon-proxy pods
+podLabels:
+  zenith_service: proxy-scram
+  zenith_env: prod
+  zenith_region: us-east-2
+  zenith_region_slug: us-east-2
+
+exposedService:
+  annotations:
+    service.beta.kubernetes.io/aws-load-balancer-type: external
+    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
+    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
+    external-dns.alpha.kubernetes.io/hostname: us-east-2.aws.neon.tech
+
+#metrics:
+#  enabled: true
+#  serviceMonitor:
+#    enabled: true
+#    selector:
+#      release: kube-prometheus-stack
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -127,8 +127,8 @@ jobs:
            target/
          # Fall back to older versions of the key, if no cache for current Cargo.lock was found
          key: |
-            v9-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ hashFiles('Cargo.lock') }}
-            v9-${{ runner.os }}-${{ matrix.build_type }}-cargo-
+            v10-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ hashFiles('Cargo.lock') }}
+            v10-${{ runner.os }}-${{ matrix.build_type }}-cargo-

      - name: Cache postgres v14 build
        id: cache_pg_14
@@ -389,7 +389,7 @@ jobs:
            !~/.cargo/registry/src
            ~/.cargo/git/
            target/
-          key: v9-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ hashFiles('Cargo.lock') }}
+          key: v10-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ hashFiles('Cargo.lock') }}

      - name: Get Neon artifact
        uses: ./.github/actions/download
@@ -625,11 +625,11 @@ jobs:
          (github.ref_name == 'main' || github.ref_name == 'release') &&
          github.event_name != 'workflow_dispatch'
        run: |
-          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/neon:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.us-east-2.amazonaws.com/neon:latest
-          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.us-east-2.amazonaws.com/compute-tools:latest
-          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.us-east-2.amazonaws.com/compute-node:latest
-          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.us-east-2.amazonaws.com/compute-node-v14:latest
-          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.us-east-2.amazonaws.com/compute-node-v15:latest
+          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/neon:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/neon:latest
+          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:latest
+          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node:latest
+          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:latest
+          crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:latest

      - name: Configure Docker Hub login
        run: |
@@ -756,9 +756,9 @@ jobs:
    defaults:
      run:
        shell: bash
-    env:
-      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
+    strategy:
+      matrix:
+        target_region: [ us-east-2 ]
    steps:
      - name: Checkout
        uses: actions/checkout@v3
@@ -781,7 +781,47 @@ jobs:
          fi

          ansible-galaxy collection install sivel.toiletwater
-          ansible-playbook deploy.yaml -i staging.us-east-2.hosts.yaml -e @ssm_config -e CONSOLE_API_TOKEN=${{secrets.NEON_STAGING_API_KEY}}
+          ansible-playbook deploy.yaml -i staging.${{ matrix.target_region }}.hosts.yaml -e @ssm_config -e CONSOLE_API_TOKEN=${{secrets.NEON_STAGING_API_KEY}}
+          rm -f neon_install.tar.gz .neon_current_version
+
+  deploy-prod-new:
+    runs-on: prod
+    container: 093970136003.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
+    # We need both storage **and** compute images for deploy, because control plane picks the compute version based on the storage version.
+    # If it notices a fresh storage it may bump the compute version. And if compute image failed to build it may break things badly
+    needs: [ push-docker-hub, calculate-deploy-targets, tag, regress-tests ]
+    if: |
+      (github.ref_name == 'release') &&
+      github.event_name != 'workflow_dispatch'
+    defaults:
+      run:
+        shell: bash
+    strategy:
+      matrix:
+        target_region: [ us-east-2, eu-central-1, ap-southeast-1 ]
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v3
+        with:
+          submodules: true
+          fetch-depth: 0
+
+      - name: Redeploy
+        run: |
+          export DOCKER_TAG=${{needs.tag.outputs.build-tag}}
+          cd "$(pwd)/.github/ansible"
+
+          if [[ "$GITHUB_REF_NAME" == "main" ]]; then
+            ./get_binaries.sh
+          elif [[ "$GITHUB_REF_NAME" == "release" ]]; then
+            RELEASE=true ./get_binaries.sh
+          else
+            echo "GITHUB_REF_NAME (value '$GITHUB_REF_NAME') is not set to either 'main' or 'release'"
+            exit 1
+          fi
+
+          ansible-galaxy collection install sivel.toiletwater
+          ansible-playbook deploy.yaml -i prod.${{ matrix.target_region }}.hosts.yaml -e @ssm_config -e CONSOLE_API_TOKEN=${{secrets.NEON_PRODUCTION_API_KEY}}
          rm -f neon_install.tar.gz .neon_current_version

  deploy-proxy:
@@ -837,6 +877,11 @@ jobs:
    defaults:
      run:
        shell: bash
+    strategy:
+      matrix:
+        include:
+          - target_region:  us-east-2
+            target_cluster: dev-us-east-2-beta
    steps:
      - name: Checkout
        uses: actions/checkout@v3
@@ -847,12 +892,49 @@ jobs:
      - name: Configure environment
        run: |
          helm repo add neondatabase https://neondatabase.github.io/helm-charts
-          aws --region us-east-2 eks update-kubeconfig --name dev-us-east-2-beta --role-arn arn:aws:iam::369495373322:role/github-runner
+          aws --region ${{ matrix.target_region }} eks update-kubeconfig --name  ${{ matrix.target_cluster }}

      - name: Re-deploy proxy
        run: |
          DOCKER_TAG=${{needs.tag.outputs.build-tag}}
-          helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/dev-us-east-2-beta.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait --timeout 15m0s
+          helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait --timeout 15m0s
+
+  deploy-proxy-prod-new:
+    runs-on: prod
+    container: 093970136003.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
+    # Compute image isn't strictly required for proxy deploy, but let's still wait for it to run all deploy jobs consistently.
+    needs: [ push-docker-hub, calculate-deploy-targets, tag, regress-tests ]
+    if: |
+      (github.ref_name == 'release') &&
+      github.event_name != 'workflow_dispatch'
+    defaults:
+      run:
+        shell: bash
+    strategy:
+      matrix:
+        include:
+          - target_region:  us-east-2
+            target_cluster: prod-us-east-2-delta
+          - target_region: eu-central-1
+            target_cluster: prod-eu-central-1-gamma
+          - target_region: ap-southeast-1
+            target_cluster: prod-ap-southeast-1-epsilon
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v3
+        with:
+          submodules: true
+          fetch-depth: 0
+
+      - name: Configure environment
+        run: |
+          helm repo add neondatabase https://neondatabase.github.io/helm-charts
+          aws --region ${{ matrix.target_region }} eks update-kubeconfig --name  ${{ matrix.target_cluster }}
+
+      - name: Re-deploy proxy
+        run: |
+          DOCKER_TAG=${{needs.tag.outputs.build-tag}}
+          helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait --timeout 15m0s

  promote-compatibility-test-snapshot:
    runs-on: dev
--- a/.github/workflows/codestyle.yml
+++ b/.github/workflows/codestyle.yml
@@ -106,7 +106,7 @@ jobs:
            !~/.cargo/registry/src
            ~/.cargo/git
            target
-          key: v5-${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust
+          key: v6-${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust

      - name: Run cargo clippy
        run: ./run_clippy.sh
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -317,12 +317,6 @@ dependencies = [
 "generic-array",
 ]

-[[package]]
-name = "boxfnonce"
-version = "0.1.1"
-source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "5988cb1d626264ac94100be357308f29ff7cbdd3b36bda27f450a4ee3f713426"
-
 [[package]]
 name = "bstr"
 version = "1.0.1"
@@ -600,6 +594,7 @@ dependencies = [
 "tar",
 "thiserror",
 "toml",
+ "url",
 "utils",
 "workspace_hack",
 ]
@@ -849,16 +844,6 @@ dependencies = [
 "syn",
 ]

-[[package]]
-name = "daemonize"
-version = "0.4.1"
-source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "70c24513e34f53b640819f0ac9f705b673fcf4006d7aab8778bee72ebfc89815"
-dependencies = [
- "boxfnonce",
- "libc",
-]
-
 [[package]]
 name = "darling"
 version = "0.14.1"
@@ -2140,7 +2125,6 @@ dependencies = [
 "crc32c",
 "criterion",
 "crossbeam-utils",
- "daemonize",
 "etcd_broker",
 "fail",
 "futures",
@@ -2161,6 +2145,7 @@ dependencies = [
 "postgres-types",
 "postgres_ffi",
 "pprof",
+ "pq_proto",
 "rand",
 "regex",
 "remote_storage",
@@ -2173,6 +2158,7 @@ dependencies = [
 "svg_fmt",
 "tar",
 "tempfile",
+ "tenant_size_model",
 "thiserror",
 "tokio",
 "tokio-postgres",
@@ -2190,6 +2176,7 @@ name = "pageserver_api"
 version = "0.1.0"
 dependencies = [
 "anyhow",
+ "byteorder",
 "bytes",
 "const_format",
 "postgres_ffi",
@@ -2452,6 +2439,21 @@ version = "0.2.16"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872"

+[[package]]
+name = "pq_proto"
+version = "0.1.0"
+dependencies = [
+ "anyhow",
+ "bytes",
+ "pin-project-lite",
+ "postgres-protocol",
+ "rand",
+ "serde",
+ "tokio",
+ "tracing",
+ "workspace_hack",
+]
+
 [[package]]
 name = "prettyplease"
 version = "0.1.21"
@@ -2584,6 +2586,7 @@ dependencies = [
 "once_cell",
 "parking_lot 0.12.1",
 "pin-project-lite",
+ "pq_proto",
 "rand",
 "rcgen",
 "reqwest",
@@ -3087,7 +3090,6 @@ dependencies = [
 "clap 4.0.15",
 "const_format",
 "crc32c",
- "daemonize",
 "etcd_broker",
 "fs2",
 "git-version",
@@ -3095,11 +3097,13 @@ dependencies = [
 "humantime",
 "hyper",
 "metrics",
+ "nix 0.25.0",
 "once_cell",
 "parking_lot 0.12.1",
 "postgres",
 "postgres-protocol",
 "postgres_ffi",
+ "pq_proto",
 "regex",
 "remote_storage",
 "safekeeper_api",
@@ -3548,6 +3552,13 @@ dependencies = [
 "winapi",
 ]

+[[package]]
+name = "tenant_size_model"
+version = "0.1.0"
+dependencies = [
+ "workspace_hack",
+]
+
 [[package]]
 name = "termcolor"
 version = "1.1.3"
@@ -4053,9 +4064,7 @@ dependencies = [
 "metrics",
 "nix 0.25.0",
 "once_cell",
- "pin-project-lite",
- "postgres",
- "postgres-protocol",
+ "pq_proto",
 "rand",
 "routerify",
 "rustls",
@@ -4380,6 +4389,9 @@ dependencies = [
 "crossbeam-utils",
 "either",
 "fail",
+ "futures-channel",
+ "futures-task",
+ "futures-util",
 "hashbrown",
 "indexmap",
 "libc",
@@ -4393,6 +4405,7 @@ dependencies = [
 "rand",
 "regex",
 "regex-syntax",
+ "reqwest",
 "scopeguard",
 "serde",
 "stable_deref_trait",
--- a/Dockerfile.compute-node-v14
+++ b/Dockerfile.compute-node-v14
@@ -13,7 +13,7 @@ ARG TAG=pinned
 FROM debian:bullseye-slim AS build-deps
 RUN apt update &&  \
    apt install -y git autoconf automake libtool build-essential bison flex libreadline-dev \
-    zlib1g-dev libxml2-dev libcurl4-openssl-dev libossp-uuid-dev wget pkg-config
+    zlib1g-dev libxml2-dev libcurl4-openssl-dev libossp-uuid-dev wget pkg-config libssl-dev

 #########################################################################################
 #
@@ -24,7 +24,7 @@ RUN apt update &&  \
 FROM build-deps AS pg-build
 COPY vendor/postgres-v14 postgres
 RUN cd postgres && \
-    ./configure CFLAGS='-O2 -g3' --enable-debug --with-uuid=ossp && \
+    ./configure CFLAGS='-O2 -g3' --enable-debug --with-openssl --with-uuid=ossp && \
    make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s install && \
    make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C contrib/ install && \
    # Install headers
--- a/Dockerfile.compute-node-v15
+++ b/Dockerfile.compute-node-v15
@@ -13,7 +13,7 @@ ARG TAG=pinned
 FROM debian:bullseye-slim AS build-deps
 RUN apt update &&  \
    apt install -y git autoconf automake libtool build-essential bison flex libreadline-dev \
-    zlib1g-dev libxml2-dev libcurl4-openssl-dev libossp-uuid-dev wget pkg-config
+    zlib1g-dev libxml2-dev libcurl4-openssl-dev libossp-uuid-dev wget pkg-config libssl-dev

 #########################################################################################
 #
@@ -24,7 +24,7 @@ RUN apt update &&  \
 FROM build-deps AS pg-build
 COPY vendor/postgres-v15 postgres
 RUN cd postgres && \
-    ./configure CFLAGS='-O2 -g3' --enable-debug --with-uuid=ossp && \
+    ./configure CFLAGS='-O2 -g3' --enable-debug --with-openssl --with-uuid=ossp && \
    make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s install && \
    make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C contrib/ install && \
    # Install headers
--- a/10
+++ b/10
@@ -151,6 +151,11 @@ neon-pg-ext-v14: postgres-v14
 	(cd $(POSTGRES_INSTALL_DIR)/build/neon-v14 && \
 	$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v14/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
 		-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile install)
+	+@echo "Compiling neon_walredo v14"
+	mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-walredo-v14
+	(cd $(POSTGRES_INSTALL_DIR)/build/neon-walredo-v14 && \
+	$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v14/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
+		-f $(ROOT_PROJECT_DIR)/pgxn/neon_walredo/Makefile install)
 	+@echo "Compiling neon_test_utils" v14
 	mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-test-utils-v14
 	(cd $(POSTGRES_INSTALL_DIR)/build/neon-test-utils-v14 && \
@@ -163,6 +168,11 @@ neon-pg-ext-v15: postgres-v15
 	(cd $(POSTGRES_INSTALL_DIR)/build/neon-v15 && \
 	$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v15/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
 		-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile install)
+	+@echo "Compiling neon_walredo v15"
+	mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-walredo-v15
+	(cd $(POSTGRES_INSTALL_DIR)/build/neon-walredo-v15 && \
+	$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v15/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
+		-f $(ROOT_PROJECT_DIR)/pgxn/neon_walredo/Makefile install)
 	+@echo "Compiling neon_test_utils" v15
 	mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-test-utils-v15
 	(cd $(POSTGRES_INSTALL_DIR)/build/neon-test-utils-v15 && \
--- a/README.md
+++ b/README.md
@@ -223,10 +223,7 @@ Ensure your dependencies are installed as described [here](https://github.com/ne
 ```sh
 git clone --recursive https://github.com/neondatabase/neon.git

-# either:
 CARGO_BUILD_FLAGS="--features=testing" make
-# or:
-make debug

 ./scripts/pytest
 ```
--- a/control_plane/Cargo.toml
+++ b/control_plane/Cargo.toml
@@ -4,20 +4,21 @@ version = "0.1.0"
 edition = "2021"

 [dependencies]
+anyhow = "1.0"
 clap = "4.0"
 comfy-table = "6.1"
 git-version = "0.3.5"
-tar = "0.4.38"
+nix = "0.25"
+once_cell = "1.13.0"
 postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev = "d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+regex = "1"
+reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
 serde = { version = "1.0", features = ["derive"] }
 serde_with = "2.0"
-toml = "0.5"
-once_cell = "1.13.0"
-regex = "1"
-anyhow = "1.0"
+tar = "0.4.38"
 thiserror = "1"
-nix = "0.25"
-reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
+toml = "0.5"
+url = "2.2.2"

 # Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
 # instead, so that recompile times are better.
--- a/control_plane/src/background_process.rs
+++ b/control_plane/src/background_process.rs
@@ -0,0 +1,264 @@
+//! Spawns and kills background processes that are needed by Neon CLI.
+//! Applies common set-up such as log and pid files (if needed) to every process.
+//!
+//! Neon CLI does not run in background, so it needs to store the information about
+//! spawned processes, which it does in this module.
+//! We do that by storing the pid of the process in the "${process_name}.pid" file.
+//! The pid file can be created by the process itself
+//! (Neon storage binaries do that and also ensure that a lock is taken onto that file)
+//! or we create such file after starting the process
+//! (non-Neon binaries don't necessarily follow our pidfile conventions).
+//! The pid stored in the file is later used to stop the service.
+//!
+//! See [`lock_file`] module for more info.
+
+use std::ffi::OsStr;
+use std::io::Write;
+use std::path::Path;
+use std::process::{Child, Command};
+use std::time::Duration;
+use std::{fs, io, thread};
+
+use anyhow::{anyhow, bail, Context, Result};
+use nix::errno::Errno;
+use nix::sys::signal::{kill, Signal};
+use nix::unistd::Pid;
+
+use utils::lock_file;
+
+const RETRIES: u32 = 15;
+const RETRY_TIMEOUT_MILLIS: u64 = 500;
+
+/// Argument to `start_process`, to indicate whether it should create pidfile or if the process creates
+/// it itself.
+pub enum InitialPidFile<'t> {
+    /// Create a pidfile, to allow future CLI invocations to manipulate the process.
+    Create(&'t Path),
+    /// The process will create the pidfile itself, need to wait for that event.
+    Expect(&'t Path),
+}
+
+/// Start a background child process using the parameters given.
+pub fn start_process<F, S: AsRef<OsStr>>(
+    process_name: &str,
+    datadir: &Path,
+    command: &Path,
+    args: &[S],
+    initial_pid_file: InitialPidFile,
+    process_status_check: F,
+) -> anyhow::Result<Child>
+where
+    F: Fn() -> anyhow::Result<bool>,
+{
+    let log_path = datadir.join(format!("{process_name}.log"));
+    let process_log_file = fs::OpenOptions::new()
+        .create(true)
+        .write(true)
+        .append(true)
+        .open(&log_path)
+        .with_context(|| {
+            format!("Could not open {process_name} log file {log_path:?} for writing")
+        })?;
+    let same_file_for_stderr = process_log_file.try_clone().with_context(|| {
+        format!("Could not reuse {process_name} log file {log_path:?} for writing stderr")
+    })?;
+
+    let mut command = Command::new(command);
+    let background_command = command
+        .stdout(process_log_file)
+        .stderr(same_file_for_stderr)
+        .args(args);
+    let filled_cmd = fill_aws_secrets_vars(fill_rust_env_vars(background_command));
+
+    let mut spawned_process = filled_cmd.spawn().with_context(|| {
+        format!("Could not spawn {process_name}, see console output and log files for details.")
+    })?;
+    let pid = spawned_process.id();
+    let pid = Pid::from_raw(
+        i32::try_from(pid)
+            .with_context(|| format!("Subprocess {process_name} has invalid pid {pid}"))?,
+    );
+
+    let pid_file_to_check = match initial_pid_file {
+        InitialPidFile::Create(target_pid_file_path) => {
+            match lock_file::create_lock_file(target_pid_file_path, pid.to_string()) {
+                lock_file::LockCreationResult::Created { .. } => {
+                    // We use "lock" file here only to create the pid file. The lock on the pidfile will be dropped as soon
+                    // as this CLI invocation exits, so it's a bit useless, but doesn't any harm either.
+                }
+                lock_file::LockCreationResult::AlreadyLocked { .. } => {
+                    anyhow::bail!("Cannot write pid file for {process_name} at path {target_pid_file_path:?}: file is already locked by another process")
+                }
+                lock_file::LockCreationResult::CreationFailed(e) => {
+                    return Err(e.context(format!(
+                    "Failed to create pid file for {process_name} at path {target_pid_file_path:?}"
+                )))
+                }
+            }
+            None
+        }
+        InitialPidFile::Expect(pid_file_path) => Some(pid_file_path),
+    };
+
+    for retries in 0..RETRIES {
+        match process_started(pid, pid_file_to_check, &process_status_check) {
+            Ok(true) => {
+                println!("\n{process_name} started, pid: {pid}");
+                return Ok(spawned_process);
+            }
+            Ok(false) => {
+                if retries < 5 {
+                    print!(".");
+                    io::stdout().flush().unwrap();
+                } else {
+                    if retries == 5 {
+                        println!() // put a line break after dots for second message
+                    }
+                    println!("{process_name} has not started yet, retrying ({retries})...");
+                }
+                thread::sleep(Duration::from_millis(RETRY_TIMEOUT_MILLIS));
+            }
+            Err(e) => {
+                println!("{process_name} failed to start: {e:#}");
+                if let Err(e) = spawned_process.kill() {
+                    println!("Could not stop {process_name} subprocess: {e:#}")
+                };
+                return Err(e);
+            }
+        }
+    }
+    anyhow::bail!("{process_name} could not start in {RETRIES} attempts");
+}
+
+/// Stops the process, using the pid file given. Returns Ok also if the process is already not running.
+pub fn stop_process(immediate: bool, process_name: &str, pid_file: &Path) -> anyhow::Result<()> {
+    if !pid_file.exists() {
+        println!("{process_name} is already stopped: no pid file {pid_file:?} is present");
+        return Ok(());
+    }
+    let pid = read_pidfile(pid_file)?;
+
+    let sig = if immediate {
+        print!("Stopping {process_name} with pid {pid} immediately..");
+        Signal::SIGQUIT
+    } else {
+        print!("Stopping {process_name} with pid {pid} gracefully..");
+        Signal::SIGTERM
+    };
+    io::stdout().flush().unwrap();
+    match kill(pid, sig) {
+        Ok(()) => (),
+        Err(Errno::ESRCH) => {
+            println!(
+                "{process_name} with pid {pid} does not exist, but a pid file {pid_file:?} was found"
+            );
+            return Ok(());
+        }
+        Err(e) => anyhow::bail!("Failed to send signal to {process_name} with pid {pid}: {e}"),
+    }
+
+    // Wait until process is gone
+    for _ in 0..RETRIES {
+        match process_has_stopped(pid) {
+            Ok(true) => {
+                println!("\n{process_name} stopped");
+                if let Err(e) = fs::remove_file(pid_file) {
+                    if e.kind() != io::ErrorKind::NotFound {
+                        eprintln!("Failed to remove pid file {pid_file:?} after stopping the process: {e:#}");
+                    }
+                }
+                return Ok(());
+            }
+            Ok(false) => {
+                print!(".");
+                io::stdout().flush().unwrap();
+                thread::sleep(Duration::from_secs(1))
+            }
+            Err(e) => {
+                println!("{process_name} with pid {pid} failed to stop: {e:#}");
+                return Err(e);
+            }
+        }
+    }
+
+    anyhow::bail!("{process_name} with pid {pid} failed to stop in {RETRIES} attempts");
+}
+
+fn fill_rust_env_vars(cmd: &mut Command) -> &mut Command {
+    let mut filled_cmd = cmd.env_clear().env("RUST_BACKTRACE", "1");
+
+    let var = "LLVM_PROFILE_FILE";
+    if let Some(val) = std::env::var_os(var) {
+        filled_cmd = filled_cmd.env(var, val);
+    }
+
+    const RUST_LOG_KEY: &str = "RUST_LOG";
+    if let Ok(rust_log_value) = std::env::var(RUST_LOG_KEY) {
+        filled_cmd.env(RUST_LOG_KEY, rust_log_value)
+    } else {
+        filled_cmd
+    }
+}
+
+fn fill_aws_secrets_vars(mut cmd: &mut Command) -> &mut Command {
+    for env_key in [
+        "AWS_ACCESS_KEY_ID",
+        "AWS_SECRET_ACCESS_KEY",
+        "AWS_SESSION_TOKEN",
+    ] {
+        if let Ok(value) = std::env::var(env_key) {
+            cmd = cmd.env(env_key, value);
+        }
+    }
+    cmd
+}
+
+fn process_started<F>(
+    pid: Pid,
+    pid_file_to_check: Option<&Path>,
+    status_check: &F,
+) -> anyhow::Result<bool>
+where
+    F: Fn() -> anyhow::Result<bool>,
+{
+    match status_check() {
+        Ok(true) => match pid_file_to_check {
+            Some(pid_file_path) => {
+                if pid_file_path.exists() {
+                    let pid_in_file = read_pidfile(pid_file_path)?;
+                    Ok(pid_in_file == pid)
+                } else {
+                    Ok(false)
+                }
+            }
+            None => Ok(true),
+        },
+        Ok(false) => Ok(false),
+        Err(e) => anyhow::bail!("process failed to start: {e}"),
+    }
+}
+
+/// Read a PID file
+///
+/// We expect a file that contains a single integer.
+fn read_pidfile(pidfile: &Path) -> Result<Pid> {
+    let pid_str = fs::read_to_string(pidfile)
+        .with_context(|| format!("failed to read pidfile {pidfile:?}"))?;
+    let pid: i32 = pid_str
+        .parse()
+        .map_err(|_| anyhow!("failed to parse pidfile {pidfile:?}"))?;
+    if pid < 1 {
+        bail!("pidfile {pidfile:?} contained bad value '{pid}'");
+    }
+    Ok(Pid::from_raw(pid))
+}
+
+fn process_has_stopped(pid: Pid) -> anyhow::Result<bool> {
+    match kill(pid, None) {
+        // Process exists, keep waiting
+        Ok(_) => Ok(false),
+        // Process not found, we're done
+        Err(Errno::ESRCH) => Ok(true),
+        Err(err) => anyhow::bail!("Failed to send signal to process with pid {pid}: {err}"),
+    }
+}
--- a/control_plane/src/bin/neon_local.rs
+++ b/control_plane/src/bin/neon_local.rs
@@ -9,8 +9,8 @@ use anyhow::{anyhow, bail, Context, Result};
 use clap::{value_parser, Arg, ArgAction, ArgMatches, Command};
 use control_plane::compute::ComputeControlPlane;
 use control_plane::local_env::{EtcdBroker, LocalEnv};
+use control_plane::pageserver::PageServerNode;
 use control_plane::safekeeper::SafekeeperNode;
-use control_plane::storage::PageServerNode;
 use control_plane::{etcd, local_env};
 use pageserver_api::models::TimelineInfo;
 use pageserver_api::{
--- a/control_plane/src/compute.rs
+++ b/control_plane/src/compute.rs
@@ -12,15 +12,14 @@ use std::time::Duration;

 use anyhow::{Context, Result};
 use utils::{
-    connstring::connection_host_port,
    id::{TenantId, TimelineId},
    lsn::Lsn,
    postgres_backend::AuthType,
 };

 use crate::local_env::{LocalEnv, DEFAULT_PG_VERSION};
+use crate::pageserver::PageServerNode;
 use crate::postgresql_conf::PostgresConf;
-use crate::storage::PageServerNode;

 //
 // ComputeControlPlane
@@ -300,7 +299,8 @@ impl PostgresNode {

        // Configure the node to fetch pages from pageserver
        let pageserver_connstr = {
-            let (host, port) = connection_host_port(&self.pageserver.pg_connection_config);
+            let config = &self.pageserver.pg_connection_config;
+            let (host, port) = (config.host(), config.port());

            // Set up authentication
            //
--- a/control_plane/src/connection.rs
+++ b/control_plane/src/connection.rs
@@ -0,0 +1,57 @@
+use url::Url;
+
+#[derive(Debug)]
+pub struct PgConnectionConfig {
+    url: Url,
+}
+
+impl PgConnectionConfig {
+    pub fn host(&self) -> &str {
+        self.url.host_str().expect("BUG: no host")
+    }
+
+    pub fn port(&self) -> u16 {
+        self.url.port().expect("BUG: no port")
+    }
+
+    /// Return a `<host>:<port>` string.
+    pub fn raw_address(&self) -> String {
+        format!("{}:{}", self.host(), self.port())
+    }
+
+    /// Connect using postgres protocol with TLS disabled.
+    pub fn connect_no_tls(&self) -> Result<postgres::Client, postgres::Error> {
+        postgres::Client::connect(self.url.as_str(), postgres::NoTls)
+    }
+}
+
+impl std::str::FromStr for PgConnectionConfig {
+    type Err = anyhow::Error;
+
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        let mut url: Url = s.parse()?;
+
+        match url.scheme() {
+            "postgres" | "postgresql" => {}
+            other => anyhow::bail!("invalid scheme: {other}"),
+        }
+
+        // It's not a valid connection url if host is unavailable.
+        if url.host().is_none() {
+            anyhow::bail!(url::ParseError::EmptyHost);
+        }
+
+        // E.g. `postgres:bar`.
+        if url.cannot_be_a_base() {
+            anyhow::bail!("URL cannot be a base");
+        }
+
+        // Set the default PG port if it's missing.
+        if url.port().is_none() {
+            url.set_port(Some(5432))
+                .expect("BUG: couldn't set the default port");
+        }
+
+        Ok(Self { url })
+    }
+}
--- a/control_plane/src/etcd.rs
+++ b/control_plane/src/etcd.rs
@@ -1,99 +1,75 @@
-use std::{
-    fs,
-    path::PathBuf,
-    process::{Command, Stdio},
-};
+use std::{fs, path::PathBuf};

 use anyhow::Context;
-use nix::{
-    sys::signal::{kill, Signal},
-    unistd::Pid,
-};

-use crate::{local_env, read_pidfile};
+use crate::{background_process, local_env};

 pub fn start_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
    let etcd_broker = &env.etcd_broker;
    println!(
-        "Starting etcd broker using {}",
-        etcd_broker.etcd_binary_path.display()
+        "Starting etcd broker using {:?}",
+        etcd_broker.etcd_binary_path
    );

    let etcd_data_dir = env.base_data_dir.join("etcd");
-    fs::create_dir_all(&etcd_data_dir).with_context(|| {
-        format!(
-            "Failed to create etcd data dir: {}",
-            etcd_data_dir.display()
-        )
-    })?;
+    fs::create_dir_all(&etcd_data_dir)
+        .with_context(|| format!("Failed to create etcd data dir {etcd_data_dir:?}"))?;

-    let etcd_stdout_file =
-        fs::File::create(etcd_data_dir.join("etcd.stdout.log")).with_context(|| {
-            format!(
-                "Failed to create etcd stout file in directory {}",
-                etcd_data_dir.display()
-            )
-        })?;
-    let etcd_stderr_file =
-        fs::File::create(etcd_data_dir.join("etcd.stderr.log")).with_context(|| {
-            format!(
-                "Failed to create etcd stderr file in directory {}",
-                etcd_data_dir.display()
-            )
-        })?;
    let client_urls = etcd_broker.comma_separated_endpoints();
+    let args = [
+        format!("--data-dir={}", etcd_data_dir.display()),
+        format!("--listen-client-urls={client_urls}"),
+        format!("--advertise-client-urls={client_urls}"),
+        // Set --quota-backend-bytes to keep the etcd virtual memory
+        // size smaller. Our test etcd clusters are very small.
+        // See https://github.com/etcd-io/etcd/issues/7910
+        "--quota-backend-bytes=100000000".to_string(),
+        // etcd doesn't compact (vacuum) with default settings,
+        // enable it to prevent space exhaustion.
+        "--auto-compaction-mode=revision".to_string(),
+        "--auto-compaction-retention=1".to_string(),
+    ];

-    let etcd_process = Command::new(&etcd_broker.etcd_binary_path)
-        .args(&[
-            format!("--data-dir={}", etcd_data_dir.display()),
-            format!("--listen-client-urls={client_urls}"),
-            format!("--advertise-client-urls={client_urls}"),
-            // Set --quota-backend-bytes to keep the etcd virtual memory
-            // size smaller. Our test etcd clusters are very small.
-            // See https://github.com/etcd-io/etcd/issues/7910
-            "--quota-backend-bytes=100000000".to_string(),
-            // etcd doesn't compact (vacuum) with default settings,
-            // enable it to prevent space exhaustion.
-            "--auto-compaction-mode=revision".to_string(),
-            "--auto-compaction-retention=1".to_string(),
-        ])
-        .stdout(Stdio::from(etcd_stdout_file))
-        .stderr(Stdio::from(etcd_stderr_file))
-        .spawn()
-        .context("Failed to spawn etcd subprocess")?;
-    let pid = etcd_process.id();
+    let pid_file_path = etcd_pid_file_path(env);

-    let etcd_pid_file_path = etcd_pid_file_path(env);
-    fs::write(&etcd_pid_file_path, pid.to_string()).with_context(|| {
-        format!(
-            "Failed to create etcd pid file at {}",
-            etcd_pid_file_path.display()
-        )
-    })?;
+    let client = reqwest::blocking::Client::new();
+
+    background_process::start_process(
+        "etcd",
+        &etcd_data_dir,
+        &etcd_broker.etcd_binary_path,
+        &args,
+        background_process::InitialPidFile::Create(&pid_file_path),
+        || {
+            for broker_endpoint in &etcd_broker.broker_endpoints {
+                let request = broker_endpoint
+                    .join("health")
+                    .with_context(|| {
+                        format!(
+                            "Failed to append /health path to broker endopint {}",
+                            broker_endpoint
+                        )
+                    })
+                    .and_then(|url| {
+                        client.get(&url.to_string()).build().with_context(|| {
+                            format!("Failed to construct request to etcd endpoint {url}")
+                        })
+                    })?;
+                if client.execute(request).is_ok() {
+                    return Ok(true);
+                }
+            }
+
+            Ok(false)
+        },
+    )
+    .context("Failed to spawn etcd subprocess")?;

    Ok(())
 }

 pub fn stop_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
-    let etcd_path = &env.etcd_broker.etcd_binary_path;
-    println!("Stopping etcd broker at {}", etcd_path.display());
-
-    let etcd_pid_file_path = etcd_pid_file_path(env);
-    let pid = Pid::from_raw(read_pidfile(&etcd_pid_file_path).with_context(|| {
-        format!(
-            "Failed to read etcd pid file at {}",
-            etcd_pid_file_path.display()
-        )
-    })?);
-
-    kill(pid, Signal::SIGTERM).with_context(|| {
-        format!(
-            "Failed to stop etcd with pid {pid} at {}",
-            etcd_pid_file_path.display()
-        )
-    })?;
-
-    Ok(())
+    background_process::stop_process(true, "etcd", &etcd_pid_file_path(env))
 }

 fn etcd_pid_file_path(env: &local_env::LocalEnv) -> PathBuf {
--- a/control_plane/src/lib.rs
+++ b/control_plane/src/lib.rs
@@ -6,59 +6,12 @@
 // Intended to be used in integration tests and in CLI tools for
 // local installations.
 //
-use anyhow::{anyhow, bail, Context, Result};
-use std::fs;
-use std::path::Path;
-use std::process::Command;

+mod background_process;
 pub mod compute;
+pub mod connection;
 pub mod etcd;
 pub mod local_env;
+pub mod pageserver;
 pub mod postgresql_conf;
 pub mod safekeeper;
-pub mod storage;
-
-/// Read a PID file
-///
-/// We expect a file that contains a single integer.
-/// We return an i32 for compatibility with libc and nix.
-pub fn read_pidfile(pidfile: &Path) -> Result<i32> {
-    let pid_str = fs::read_to_string(pidfile)
-        .with_context(|| format!("failed to read pidfile {:?}", pidfile))?;
-    let pid: i32 = pid_str
-        .parse()
-        .map_err(|_| anyhow!("failed to parse pidfile {:?}", pidfile))?;
-    if pid < 1 {
-        bail!("pidfile {:?} contained bad value '{}'", pidfile, pid);
-    }
-    Ok(pid)
-}
-
-fn fill_rust_env_vars(cmd: &mut Command) -> &mut Command {
-    let cmd = cmd.env_clear().env("RUST_BACKTRACE", "1");
-
-    let var = "LLVM_PROFILE_FILE";
-    if let Some(val) = std::env::var_os(var) {
-        cmd.env(var, val);
-    }
-
-    const RUST_LOG_KEY: &str = "RUST_LOG";
-    if let Ok(rust_log_value) = std::env::var(RUST_LOG_KEY) {
-        cmd.env(RUST_LOG_KEY, rust_log_value)
-    } else {
-        cmd
-    }
-}
-
-fn fill_aws_secrets_vars(mut cmd: &mut Command) -> &mut Command {
-    for env_key in [
-        "AWS_ACCESS_KEY_ID",
-        "AWS_SECRET_ACCESS_KEY",
-        "AWS_SESSION_TOKEN",
-    ] {
-        if let Ok(value) = std::env::var(env_key) {
-            cmd = cmd.env(env_key, value);
-        }
-    }
-    cmd
-}
--- a/control_plane/src/local_env.rs
+++ b/control_plane/src/local_env.rs
@@ -226,12 +226,12 @@ impl LocalEnv {
        }
    }

-    pub fn pageserver_bin(&self) -> anyhow::Result<PathBuf> {
-        Ok(self.neon_distrib_dir.join("pageserver"))
+    pub fn pageserver_bin(&self) -> PathBuf {
+        self.neon_distrib_dir.join("pageserver")
    }

-    pub fn safekeeper_bin(&self) -> anyhow::Result<PathBuf> {
-        Ok(self.neon_distrib_dir.join("safekeeper"))
+    pub fn safekeeper_bin(&self) -> PathBuf {
+        self.neon_distrib_dir.join("safekeeper")
    }

    pub fn pg_data_dirs_path(&self) -> PathBuf {
--- a/control_plane/src/pageserver.rs
+++ b/control_plane/src/pageserver.rs
@@ -1,33 +1,27 @@
 use std::collections::HashMap;
-use std::fs::File;
+use std::fs::{self, File};
 use std::io::{BufReader, Write};
 use std::num::NonZeroU64;
 use std::path::{Path, PathBuf};
-use std::process::Command;
-use std::time::Duration;
-use std::{io, result, thread};
+use std::process::Child;
+use std::{io, result};

+use crate::connection::PgConnectionConfig;
 use anyhow::{bail, Context};
-use nix::errno::Errno;
-use nix::sys::signal::{kill, Signal};
-use nix::unistd::Pid;
 use pageserver_api::models::{
    TenantConfigRequest, TenantCreateRequest, TenantInfo, TimelineCreateRequest, TimelineInfo,
 };
-use postgres::{Config, NoTls};
 use reqwest::blocking::{Client, RequestBuilder, Response};
 use reqwest::{IntoUrl, Method};
 use thiserror::Error;
 use utils::{
-    connstring::connection_address,
    http::error::HttpErrorBody,
    id::{TenantId, TimelineId},
    lsn::Lsn,
    postgres_backend::AuthType,
 };

-use crate::local_env::LocalEnv;
-use crate::{fill_aws_secrets_vars, fill_rust_env_vars, read_pidfile};
+use crate::{background_process, local_env::LocalEnv};

 #[derive(Error, Debug)]
 pub enum PageserverHttpError {
@@ -75,7 +69,7 @@ impl ResponseErrorMessageExt for Response {
 //
 #[derive(Debug)]
 pub struct PageServerNode {
-    pub pg_connection_config: Config,
+    pub pg_connection_config: PgConnectionConfig,
    pub env: LocalEnv,
    pub http_client: Client,
    pub http_base_url: String,
@@ -101,7 +95,7 @@ impl PageServerNode {
    }

    /// Construct libpq connection string for connecting to the pageserver.
-    fn pageserver_connection_config(password: &str, listen_addr: &str) -> Config {
+    fn pageserver_connection_config(password: &str, listen_addr: &str) -> PgConnectionConfig {
        format!("postgresql://no_user:{password}@{listen_addr}/no_db")
            .parse()
            .unwrap()
@@ -161,7 +155,15 @@ impl PageServerNode {
            init_config_overrides.push("auth_validation_public_key_path='auth_public_key.pem'");
        }

-        self.start_node(&init_config_overrides, &self.env.base_data_dir, true)?;
+        let mut pageserver_process = self
+            .start_node(&init_config_overrides, &self.env.base_data_dir, true)
+            .with_context(|| {
+                format!(
+                    "Failed to start a process for pageserver {}",
+                    self.env.pageserver.id,
+                )
+            })?;
+
        let init_result = self
            .try_init_timeline(create_tenant, initial_timeline_id, pg_version)
            .context("Failed to create initial tenant and timeline for pageserver");
@@ -171,7 +173,29 @@ impl PageServerNode {
            }
            Err(e) => eprintln!("{e:#}"),
        }
-        self.stop(false)?;
+        match pageserver_process.kill() {
+            Err(e) => {
+                eprintln!(
+                    "Failed to stop pageserver {} process with pid {}: {e:#}",
+                    self.env.pageserver.id,
+                    pageserver_process.id(),
+                )
+            }
+            Ok(()) => {
+                println!(
+                    "Stopped pageserver {} process with pid {}",
+                    self.env.pageserver.id,
+                    pageserver_process.id(),
+                );
+                // cleanup after pageserver startup, since we do not call regular `stop_process` during init
+                let pid_file = self.pid_file();
+                if let Err(e) = fs::remove_file(&pid_file) {
+                    if e.kind() != io::ErrorKind::NotFound {
+                        eprintln!("Failed to remove pid file {pid_file:?} after stopping the process: {e:#}");
+                    }
+                }
+            }
+        }
        init_result
    }

@@ -196,11 +220,14 @@ impl PageServerNode {
        self.env.pageserver_data_dir()
    }

-    pub fn pid_file(&self) -> PathBuf {
+    /// The pid file is created by the pageserver process, with its pid stored inside.
+    /// Other pageservers cannot lock the same file and overwrite it for as long as the current
+    /// pageserver runs. (Unless someone removes the file manually; never do that!)
+    fn pid_file(&self) -> PathBuf {
        self.repo_path().join("pageserver.pid")
    }

-    pub fn start(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
+    pub fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> {
        self.start_node(config_overrides, &self.repo_path(), false)
    }

@@ -209,10 +236,10 @@ impl PageServerNode {
        config_overrides: &[&str],
        datadir: &Path,
        update_config: bool,
-    ) -> anyhow::Result<()> {
+    ) -> anyhow::Result<Child> {
        println!(
            "Starting pageserver at '{}' in '{}'",
-            connection_address(&self.pg_connection_config),
+            self.pg_connection_config.raw_address(),
            datadir.display()
        );
        io::stdout().flush()?;
@@ -220,10 +247,7 @@ impl PageServerNode {
        let mut args = vec![
            "-D",
            datadir.to_str().with_context(|| {
-                format!(
-                    "Datadir path '{}' cannot be represented as a unicode string",
-                    datadir.display()
-                )
+                format!("Datadir path {datadir:?} cannot be represented as a unicode string")
            })?,
        ];

@@ -235,48 +259,18 @@ impl PageServerNode {
            args.extend(["-c", config_override]);
        }

-        let mut cmd = Command::new(self.env.pageserver_bin()?);
-        let mut filled_cmd = fill_rust_env_vars(cmd.args(&args).arg("--daemonize"));
-        filled_cmd = fill_aws_secrets_vars(filled_cmd);
-
-        if !filled_cmd.status()?.success() {
-            bail!(
-                "Pageserver failed to start. See console output and '{}' for details.",
-                datadir.join("pageserver.log").display()
-            );
-        }
-
-        // It takes a while for the page server to start up. Wait until it is
-        // open for business.
-        const RETRIES: i8 = 15;
-        for retries in 1..RETRIES {
-            match self.check_status() {
-                Ok(()) => {
-                    println!("\nPageserver started");
-                    return Ok(());
-                }
-                Err(err) => {
-                    match err {
-                        PageserverHttpError::Transport(err) => {
-                            if err.is_connect() && retries < 5 {
-                                print!(".");
-                                io::stdout().flush().unwrap();
-                            } else {
-                                if retries == 5 {
-                                    println!() // put a line break after dots for second message
-                                }
-                                println!("Pageserver not responding yet, err {err} retrying ({retries})...");
-                            }
-                        }
-                        PageserverHttpError::Response(msg) => {
-                            bail!("pageserver failed to start: {msg} ")
-                        }
-                    }
-                    thread::sleep(Duration::from_secs(1));
-                }
-            }
-        }
-        bail!("pageserver failed to start in {RETRIES} seconds");
+        background_process::start_process(
+            "pageserver",
+            datadir,
+            &self.env.pageserver_bin(),
+            &args,
+            background_process::InitialPidFile::Expect(&self.pid_file()),
+            || match self.check_status() {
+                Ok(()) => Ok(true),
+                Err(PageserverHttpError::Transport(_)) => Ok(false),
+                Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
+            },
+        )
    }

    ///
@@ -288,69 +282,18 @@ impl PageServerNode {
    /// If the server is not running, returns success
    ///
    pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
-        let pid_file = self.pid_file();
-        if !pid_file.exists() {
-            println!("Pageserver is already stopped");
-            return Ok(());
-        }
-        let pid = Pid::from_raw(read_pidfile(&pid_file)?);
-
-        let sig = if immediate {
-            print!("Stopping pageserver immediately..");
-            Signal::SIGQUIT
-        } else {
-            print!("Stopping pageserver gracefully..");
-            Signal::SIGTERM
-        };
-        io::stdout().flush().unwrap();
-        match kill(pid, sig) {
-            Ok(_) => (),
-            Err(Errno::ESRCH) => {
-                println!("Pageserver with pid {pid} does not exist, but a PID file was found");
-                return Ok(());
-            }
-            Err(err) => bail!(
-                "Failed to send signal to pageserver with pid {pid}: {}",
-                err.desc()
-            ),
-        }
-
-        // Wait until process is gone
-        for i in 0..600 {
-            let signal = None; // Send no signal, just get the error code
-            match kill(pid, signal) {
-                Ok(_) => (), // Process exists, keep waiting
-                Err(Errno::ESRCH) => {
-                    // Process not found, we're done
-                    println!("done!");
-                    return Ok(());
-                }
-                Err(err) => bail!(
-                    "Failed to send signal to pageserver with pid {}: {}",
-                    pid,
-                    err.desc()
-                ),
-            };
-
-            if i % 10 == 0 {
-                print!(".");
-                io::stdout().flush().unwrap();
-            }
-            thread::sleep(Duration::from_millis(100));
-        }
-
-        bail!("Failed to stop pageserver with pid {pid}");
+        background_process::stop_process(immediate, "pageserver", &self.pid_file())
    }

    pub fn page_server_psql(&self, sql: &str) -> Vec<postgres::SimpleQueryMessage> {
-        let mut client = self.pg_connection_config.connect(NoTls).unwrap();
+        let mut client = self.pg_connection_config.connect_no_tls().unwrap();

        println!("Pageserver query: '{sql}'");
        client.simple_query(sql).unwrap()
    }

    pub fn page_server_psql_client(&self) -> result::Result<postgres::Client, postgres::Error> {
-        self.pg_connection_config.connect(NoTls)
+        self.pg_connection_config.connect_no_tls()
    }

    fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
@@ -549,7 +492,7 @@ impl PageServerNode {
        pg_wal: Option<(Lsn, PathBuf)>,
        pg_version: u32,
    ) -> anyhow::Result<()> {
-        let mut client = self.pg_connection_config.connect(NoTls).unwrap();
+        let mut client = self.pg_connection_config.connect_no_tls().unwrap();

        // Init base reader
        let (start_lsn, base_tarfile_path) = base;
--- a/control_plane/src/safekeeper.rs
+++ b/control_plane/src/safekeeper.rs
@@ -1,23 +1,21 @@
 use std::io::Write;
 use std::path::PathBuf;
-use std::process::Command;
+use std::process::Child;
 use std::sync::Arc;
-use std::time::Duration;
-use std::{io, result, thread};
+use std::{io, result};

-use anyhow::bail;
-use nix::errno::Errno;
-use nix::sys::signal::{kill, Signal};
-use nix::unistd::Pid;
-use postgres::Config;
+use anyhow::Context;
 use reqwest::blocking::{Client, RequestBuilder, Response};
 use reqwest::{IntoUrl, Method};
 use thiserror::Error;
-use utils::{connstring::connection_address, http::error::HttpErrorBody, id::NodeId};
+use utils::{http::error::HttpErrorBody, id::NodeId};

-use crate::local_env::{LocalEnv, SafekeeperConf};
-use crate::storage::PageServerNode;
-use crate::{fill_aws_secrets_vars, fill_rust_env_vars, read_pidfile};
+use crate::connection::PgConnectionConfig;
+use crate::pageserver::PageServerNode;
+use crate::{
+    background_process,
+    local_env::{LocalEnv, SafekeeperConf},
+};

 #[derive(Error, Debug)]
 pub enum SafekeeperHttpError {
@@ -63,7 +61,7 @@ pub struct SafekeeperNode {

    pub conf: SafekeeperConf,

-    pub pg_connection_config: Config,
+    pub pg_connection_config: PgConnectionConfig,
    pub env: LocalEnv,
    pub http_client: Client,
    pub http_base_url: String,
@@ -87,15 +85,15 @@ impl SafekeeperNode {
    }

    /// Construct libpq connection string for connecting to this safekeeper.
-    fn safekeeper_connection_config(port: u16) -> Config {
+    fn safekeeper_connection_config(port: u16) -> PgConnectionConfig {
        // TODO safekeeper authentication not implemented yet
-        format!("postgresql://no_user@127.0.0.1:{}/no_db", port)
+        format!("postgresql://no_user@127.0.0.1:{port}/no_db")
            .parse()
            .unwrap()
    }

    pub fn datadir_path_by_id(env: &LocalEnv, sk_id: NodeId) -> PathBuf {
-        env.safekeeper_data_dir(format!("sk{}", sk_id).as_ref())
+        env.safekeeper_data_dir(&format!("sk{sk_id}"))
    }

    pub fn datadir_path(&self) -> PathBuf {
@@ -106,91 +104,78 @@ impl SafekeeperNode {
        self.datadir_path().join("safekeeper.pid")
    }

-    pub fn start(&self) -> anyhow::Result<()> {
+    pub fn start(&self) -> anyhow::Result<Child> {
        print!(
            "Starting safekeeper at '{}' in '{}'",
-            connection_address(&self.pg_connection_config),
+            self.pg_connection_config.raw_address(),
            self.datadir_path().display()
        );
        io::stdout().flush().unwrap();

        let listen_pg = format!("127.0.0.1:{}", self.conf.pg_port);
        let listen_http = format!("127.0.0.1:{}", self.conf.http_port);
+        let id = self.id;
+        let datadir = self.datadir_path();

-        let mut cmd = Command::new(self.env.safekeeper_bin()?);
-        fill_rust_env_vars(
-            cmd.args(&["-D", self.datadir_path().to_str().unwrap()])
-                .args(&["--id", self.id.to_string().as_ref()])
-                .args(&["--listen-pg", &listen_pg])
-                .args(&["--listen-http", &listen_http])
-                .arg("--daemonize"),
-        );
+        let id_string = id.to_string();
+        let mut args = vec![
+            "-D",
+            datadir.to_str().with_context(|| {
+                format!("Datadir path {datadir:?} cannot be represented as a unicode string")
+            })?,
+            "--id",
+            &id_string,
+            "--listen-pg",
+            &listen_pg,
+            "--listen-http",
+            &listen_http,
+        ];
        if !self.conf.sync {
-            cmd.arg("--no-sync");
+            args.push("--no-sync");
        }

        let comma_separated_endpoints = self.env.etcd_broker.comma_separated_endpoints();
        if !comma_separated_endpoints.is_empty() {
-            cmd.args(&["--broker-endpoints", &comma_separated_endpoints]);
+            args.extend(["--broker-endpoints", &comma_separated_endpoints]);
        }
        if let Some(prefix) = self.env.etcd_broker.broker_etcd_prefix.as_deref() {
-            cmd.args(&["--broker-etcd-prefix", prefix]);
+            args.extend(["--broker-etcd-prefix", prefix]);
        }
+
+        let mut backup_threads = String::new();
        if let Some(threads) = self.conf.backup_threads {
-            cmd.args(&["--backup-threads", threads.to_string().as_ref()]);
+            backup_threads = threads.to_string();
+            args.extend(["--backup-threads", &backup_threads]);
+        } else {
+            drop(backup_threads);
        }
+
        if let Some(ref remote_storage) = self.conf.remote_storage {
-            cmd.args(&["--remote-storage", remote_storage]);
+            args.extend(["--remote-storage", remote_storage]);
        }
+
+        let key_path = self.env.base_data_dir.join("auth_public_key.pem");
        if self.conf.auth_enabled {
-            cmd.arg("--auth-validation-public-key-path");
-            // PathBuf is better be passed as is, not via `String`.
-            cmd.arg(self.env.base_data_dir.join("auth_public_key.pem"));
+            args.extend([
+                "--auth-validation-public-key-path",
+                key_path.to_str().with_context(|| {
+                    format!("Key path {key_path:?} cannot be represented as a unicode string")
+                })?,
+            ]);
        }

-        fill_aws_secrets_vars(&mut cmd);
-
-        if !cmd.status()?.success() {
-            bail!(
-                "Safekeeper failed to start. See '{}' for details.",
-                self.datadir_path().join("safekeeper.log").display()
-            );
-        }
-
-        // It takes a while for the safekeeper to start up. Wait until it is
-        // open for business.
-        const RETRIES: i8 = 15;
-        for retries in 1..RETRIES {
-            match self.check_status() {
-                Ok(_) => {
-                    println!("\nSafekeeper started");
-                    return Ok(());
-                }
-                Err(err) => {
-                    match err {
-                        SafekeeperHttpError::Transport(err) => {
-                            if err.is_connect() && retries < 5 {
-                                print!(".");
-                                io::stdout().flush().unwrap();
-                            } else {
-                                if retries == 5 {
-                                    println!() // put a line break after dots for second message
-                                }
-                                println!(
-                                    "Safekeeper not responding yet, err {} retrying ({})...",
-                                    err, retries
-                                );
-                            }
-                        }
-                        SafekeeperHttpError::Response(msg) => {
-                            bail!("safekeeper failed to start: {} ", msg)
-                        }
-                    }
-                    thread::sleep(Duration::from_secs(1));
-                }
-            }
-        }
-        bail!("safekeeper failed to start in {} seconds", RETRIES);
+        background_process::start_process(
+            &format!("safekeeper {id}"),
+            &datadir,
+            &self.env.safekeeper_bin(),
+            &args,
+            background_process::InitialPidFile::Expect(&self.pid_file()),
+            || match self.check_status() {
+                Ok(()) => Ok(true),
+                Err(SafekeeperHttpError::Transport(_)) => Ok(false),
+                Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
+            },
+        )
    }

    ///
@@ -202,63 +187,11 @@ impl SafekeeperNode {
    /// If the server is not running, returns success
    ///
    pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
-        let pid_file = self.pid_file();
-        if !pid_file.exists() {
-            println!("Safekeeper {} is already stopped", self.id);
-            return Ok(());
-        }
-        let pid = read_pidfile(&pid_file)?;
-        let pid = Pid::from_raw(pid);
-
-        let sig = if immediate {
-            print!("Stopping safekeeper {} immediately..", self.id);
-            Signal::SIGQUIT
-        } else {
-            print!("Stopping safekeeper {} gracefully..", self.id);
-            Signal::SIGTERM
-        };
-        io::stdout().flush().unwrap();
-        match kill(pid, sig) {
-            Ok(_) => (),
-            Err(Errno::ESRCH) => {
-                println!(
-                    "Safekeeper with pid {} does not exist, but a PID file was found",
-                    pid
-                );
-                return Ok(());
-            }
-            Err(err) => bail!(
-                "Failed to send signal to safekeeper with pid {}: {}",
-                pid,
-                err.desc()
-            ),
-        }
-
-        // Wait until process is gone
-        for i in 0..600 {
-            let signal = None; // Send no signal, just get the error code
-            match kill(pid, signal) {
-                Ok(_) => (), // Process exists, keep waiting
-                Err(Errno::ESRCH) => {
-                    // Process not found, we're done
-                    println!("done!");
-                    return Ok(());
-                }
-                Err(err) => bail!(
-                    "Failed to send signal to pageserver with pid {}: {}",
-                    pid,
-                    err.desc()
-                ),
-            };
-
-            if i % 10 == 0 {
-                print!(".");
-                io::stdout().flush().unwrap();
-            }
-            thread::sleep(Duration::from_millis(100));
-        }
-
-        bail!("Failed to stop safekeeper with pid {}", pid);
+        background_process::stop_process(
+            immediate,
+            &format!("safekeeper {}", self.id),
+            &self.pid_file(),
+        )
    }

    fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
--- a/docs/rfcs/020-pageserver-s3-coordination.md
+++ b/docs/rfcs/020-pageserver-s3-coordination.md
@@ -0,0 +1,246 @@
+# Coordinating access of multiple pageservers to the same s3 data
+
+## Motivation
+
+There are some blind spots around coordinating access of multiple pageservers
+to the same s3 data. Currently this is applicable only to tenant relocation
+case, but in the future we'll need to solve similar problems for
+replica/standby pageservers.
+
+## Impacted components (e.g. pageserver, safekeeper, console, etc)
+
+Pageserver
+
+## The problem
+
+### Relocation
+
+During relocation both pageservers can write to s3. This should be ok for all
+data except the `index_part.json`. For index part it causes problems during
+compaction/gc because they remove files from index/s3.
+
+Imagine this case:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant PS1
+    participant S3
+    participant PS2
+
+    PS1->>S3: Uploads L1, L2 <br/> Index contains L1 L2
+    PS2->>S3: Attach called, sees L1, L2
+    PS1->>S3: Compaction comes <br/> Removes L1, adds L3
+    note over S3: Index now L2, L3
+    PS2->>S3: Uploads new layer L4 <br/> (added to previous view of the index)
+    note over S3: Index now L1, L2, L4
+```
+
+At this point it is not possible to restore from index, it contains L2 which
+is no longer available in s3 and doesnt contain L3 added by compaction by the
+first pageserver. So if any of the pageservers restart initial sync will fail
+(or in on-demand world it will fail a bit later during page request from
+missing layer)
+
+### Standby pageserver
+
+Another related case is standby pageserver. In this case second pageserver can
+be used as a replica to scale reads and serve as a failover target in case
+first one fails.
+
+In this mode second pageserver needs to have the same picture of s3 files to
+be able to load layers on-demand. To accomplish that second pageserver
+cannot run gc/compaction jobs. Instead it needs to receive updates for index
+contents. (There is no need to run walreceiver on the second pageserver then).
+
+## Observations
+
+- If both pageservers ingest wal then their layer set diverges, because layer
+  file generation is not deterministic
+- If one of the pageservers does not ingest wal (and just picks up layer
+  updates) then it lags behind and cannot really answer queries in the same
+  pace as the primary one
+- Can compaction help make layers deterministic? E g we do not upload level
+  zero layers and construction of higher levels should be deterministic.
+  This way we can guarantee that layer creation by timeout wont mess things up.
+  This way one pageserver uploads data and second one can just ingest it.
+  But we still need some form of election
+
+## Solutions
+
+### Manual orchestration
+
+One possible solution for relocation case is to orchestrate background jobs
+from outside. The oracle who runs migration can turn off background jobs on
+PS1 before migration and then run migration -> enable them on PS2. The problem
+comes if migration fails. In this case in order to resume background jobs
+oracle needs to guarantee that PS2 doesnt run background jobs and if it doesnt
+respond then PS1 is stuck unable to run compaction/gc. This cannot be solved
+without human ensuring that no upload from PS2 can happen. In order to be able
+to resolve this automatically CAS is required on S3 side so pageserver can
+avoid overwriting index part if it is no longer the leading one
+
+Note that flag that disables background jobs needs to be persistent, because
+otherwise pageserver restart will clean it
+
+### Avoid index_part.json
+
+Index part consists of two parts, list of layers and metadata. List of layers
+can be easily obtained by `ListObjects` S3 API method. But what to do with
+metadata? Create metadata instance for each checkpoint and add some counter
+to the file name?
+
+Back to potentially long s3 ls.
+
+### Coordination based approach
+
+Do it like safekeepers chose leader for WAL upload. Ping each other and decide
+based on some heuristics e g smallest node id. During relocation PS1 sends
+"resign" ping message so others can start election without waiting for a timeout.
+
+This still leaves metadata question open and non deterministic layers are a
+problem as well
+
+### Avoid metadata file
+
+One way to eliminate metadata file is to store it in layer files under some
+special key. This may resonate with intention to keep all relation sizes in
+some special segment to avoid initial download during size calculation.
+Maybe with that we can even store pre calculated value.
+
+As a downside each checkpoint gets 512 bytes larger.
+
+If we entirely avoid metadata file this opens up many approaches
+
+* * *
+
+During discussion it seems that we converged on the approach consisting of:
+
+- index files stored per pageserver in the same timeline directory. With that
+  index file name starts to look like: `<pageserver_node_id>_index_part.json`.
+  In such set up there are no concurrent overwrites of index file by different
+  pageservers.
+- For replica pageservers the solution would be for primary to broadcast index
+  changes to any followers with an ability to check index files in s3 and
+  restore the full state. To properly merge changes with index files we can use
+  a counter that is persisted in an index file, is incremented on every change
+  to it and passed along with broadcasted change. This way we can determine
+  whether we need to apply change to the index state or not.
+- Responsibility for running background jobs is assigned externally. Pageserver
+  keeps locally persistent flag for each tenant that indicates whether this
+  pageserver is considered as primary one or not. TODO what happends if we
+  crash and cannot start for some extended period of time? Control plane can
+  assign ownership to some other pageserver. Pageserver needs some way to check
+  if its still the blessed one. Maybe by explicit request to control plane on
+  start.
+
+Requirement for deterministic layer generation was considered overly strict
+because of two reasons:
+
+- It can limit possible optimizations e g when pageserver wants to reshuffle
+  some data locally and doesnt want to coordinate this
+- The deterministic algorithm itself can change so during deployments for some
+  time there will be two different version running at the same time which can
+  cause non determinism
+
+### External elections
+
+The above case with lost state in this schema with externally managed
+leadership is represented like this:
+
+Note that here we keep objects list in the index file.
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant PS1
+    participant CP as Control Plane
+    participant S3
+    participant PS2
+
+    note over PS1,PS2: PS1 starts up and still a leader
+    PS1->>CP: Am I still the leader for Tenant X?
+    activate CP
+    CP->>PS1: Yes
+    deactivate CP
+    PS1->>S3: Fetch PS1 index.
+    note over PS1: Continue operations, start backround jobs
+    note over PS1,PS2: PS1 starts up and still and is not a leader anymore
+    PS1->>CP: Am I still the leader for Tenant X?
+    CP->>PS1: No
+    PS1->>PS2: Subscribe to index changes
+    PS1->>S3: Fetch PS1 and PS2 indexes
+    note over PS1: Combine index file to include layers <br> from both indexes to be able <br> to see newer files from leader (PS2)
+    note over PS1: Continue operations, do not start background jobs
+```
+
+### Internal elections
+
+To manage leadership internally we can use broker to exchange pings so nodes
+can decide on the leader roles. In case multiple pageservers are active leader
+is the one with lowest node id.
+
+Operations with internally managed elections:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant PS1
+    participant S3
+
+    note over PS1: Starts up
+    note over PS1: Subscribes to changes, waits for two ping <br> timeouts to see if there is a leader
+    PS1->>S3: Fetch indexes from s3
+    alt there is a leader
+        note over PS1: do not start background jobs, <br> continue applying index updates
+    else there is no leader
+        note over PS1: start background jobs, <br> broadcast index changes
+    end
+
+    note over PS1,S3: Then the picture is similar to external elections <br> the difference is that follower can become a leader <br> if there are no pings after some timeout new leader gets elected
+```
+
+### Eviction
+
+When two pageservers operate on a tenant for extended period of time follower
+doesnt perform write operations in s3. When layer is evicted follower relies
+on updates from primary to get info about layers it needs to cover range for
+evicted layer.
+
+Note that it wont match evicted layer exactly, so layers will overlap and
+lookup code needs to correctly handle that.
+
+### Relocation flow
+
+Actions become:
+
+- Attach tenant to new pageserver
+- New pageserver becomes follower since previous one is still leading
+- New pageserver starts replicating from safekeepers but does not upload layers
+- Detach is called on the old one
+- New pageserver becomes leader after it realizes that old one disappeared
+
+### Index File
+
+Using `s3 ls` on startup simplifies things, but we still need metadata, so we
+need to fetch index files anyway. If they contain list of files we can combine
+them and avoid costly `s3 ls`
+
+### Remaining issues
+
+- More than one remote consistent lsn for safekeepers to know
+
+Anything else?
+
+### Proposed solution
+
+To recap. On meeting we converged on approach with external elections but I
+think it will be overall harder to manage and will introduce a dependency on
+control plane for pageserver. Using separate index files for each pageserver
+consisting of log of operations and a metadata snapshot should be enough.
+
+### What we need to get there?
+
+- Change index file structure to contain log of changes instead of just the
+  file list
+- Implement pinging/elections for pageservers
--- a/docs/sourcetree.md
+++ b/docs/sourcetree.md
@@ -52,6 +52,10 @@ PostgreSQL extension that implements storage manager API and network communicati

 PostgreSQL extension that contains functions needed for testing and debugging.

+`/pgxn/neon_walredo`:
+
+Library to run Postgres as a "WAL redo process" in the pageserver.
+
 `/safekeeper`:

 The neon WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
--- a/libs/pageserver_api/Cargo.toml
+++ b/libs/pageserver_api/Cargo.toml
@@ -9,6 +9,7 @@ serde_with = "2.0"
 const_format = "0.2.21"
 anyhow = { version = "1.0", features = ["backtrace"] }
 bytes = "1.0.1"
+byteorder = "1.4.3"

 utils = { path = "../utils" }
 postgres_ffi = { path = "../postgres_ffi" }
--- a/libs/pageserver_api/src/models.rs
+++ b/libs/pageserver_api/src/models.rs
@@ -1,5 +1,6 @@
 use std::num::NonZeroU64;

+use byteorder::{BigEndian, ReadBytesExt};
 use serde::{Deserialize, Serialize};
 use serde_with::{serde_as, DisplayFromStr};
 use utils::{
@@ -9,7 +10,7 @@ use utils::{

 use crate::reltag::RelTag;
 use anyhow::bail;
-use bytes::{Buf, BufMut, Bytes, BytesMut};
+use bytes::{BufMut, Bytes, BytesMut};

 /// A state of a tenant in pageserver's memory.
 #[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
@@ -225,6 +226,7 @@ pub struct TimelineGcRequest {
 }

 // Wrapped in libpq CopyData
+#[derive(PartialEq, Eq)]
 pub enum PagestreamFeMessage {
    Exists(PagestreamExistsRequest),
    Nblocks(PagestreamNblocksRequest),
@@ -241,21 +243,21 @@ pub enum PagestreamBeMessage {
    DbSize(PagestreamDbSizeResponse),
 }

-#[derive(Debug)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct PagestreamExistsRequest {
    pub latest: bool,
    pub lsn: Lsn,
    pub rel: RelTag,
 }

-#[derive(Debug)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct PagestreamNblocksRequest {
    pub latest: bool,
    pub lsn: Lsn,
    pub rel: RelTag,
 }

-#[derive(Debug)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct PagestreamGetPageRequest {
    pub latest: bool,
    pub lsn: Lsn,
@@ -263,7 +265,7 @@ pub struct PagestreamGetPageRequest {
    pub blkno: u32,
 }

-#[derive(Debug)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct PagestreamDbSizeRequest {
    pub latest: bool,
    pub lsn: Lsn,
@@ -296,52 +298,98 @@ pub struct PagestreamDbSizeResponse {
 }

 impl PagestreamFeMessage {
-    pub fn parse(mut body: Bytes) -> anyhow::Result<PagestreamFeMessage> {
+    pub fn serialize(&self) -> Bytes {
+        let mut bytes = BytesMut::new();
+
+        match self {
+            Self::Exists(req) => {
+                bytes.put_u8(0);
+                bytes.put_u8(if req.latest { 1 } else { 0 });
+                bytes.put_u64(req.lsn.0);
+                bytes.put_u32(req.rel.spcnode);
+                bytes.put_u32(req.rel.dbnode);
+                bytes.put_u32(req.rel.relnode);
+                bytes.put_u8(req.rel.forknum);
+            }
+
+            Self::Nblocks(req) => {
+                bytes.put_u8(1);
+                bytes.put_u8(if req.latest { 1 } else { 0 });
+                bytes.put_u64(req.lsn.0);
+                bytes.put_u32(req.rel.spcnode);
+                bytes.put_u32(req.rel.dbnode);
+                bytes.put_u32(req.rel.relnode);
+                bytes.put_u8(req.rel.forknum);
+            }
+
+            Self::GetPage(req) => {
+                bytes.put_u8(2);
+                bytes.put_u8(if req.latest { 1 } else { 0 });
+                bytes.put_u64(req.lsn.0);
+                bytes.put_u32(req.rel.spcnode);
+                bytes.put_u32(req.rel.dbnode);
+                bytes.put_u32(req.rel.relnode);
+                bytes.put_u8(req.rel.forknum);
+                bytes.put_u32(req.blkno);
+            }
+
+            Self::DbSize(req) => {
+                bytes.put_u8(3);
+                bytes.put_u8(if req.latest { 1 } else { 0 });
+                bytes.put_u64(req.lsn.0);
+                bytes.put_u32(req.dbnode);
+            }
+        }
+
+        bytes.into()
+    }
+
+    pub fn parse<R: std::io::Read>(body: &mut R) -> anyhow::Result<PagestreamFeMessage> {
        // TODO these gets can fail

        // these correspond to the NeonMessageTag enum in pagestore_client.h
        //
        // TODO: consider using protobuf or serde bincode for less error prone
        // serialization.
-        let msg_tag = body.get_u8();
+        let msg_tag = body.read_u8()?;
        match msg_tag {
            0 => Ok(PagestreamFeMessage::Exists(PagestreamExistsRequest {
-                latest: body.get_u8() != 0,
-                lsn: Lsn::from(body.get_u64()),
+                latest: body.read_u8()? != 0,
+                lsn: Lsn::from(body.read_u64::<BigEndian>()?),
                rel: RelTag {
-                    spcnode: body.get_u32(),
-                    dbnode: body.get_u32(),
-                    relnode: body.get_u32(),
-                    forknum: body.get_u8(),
+                    spcnode: body.read_u32::<BigEndian>()?,
+                    dbnode: body.read_u32::<BigEndian>()?,
+                    relnode: body.read_u32::<BigEndian>()?,
+                    forknum: body.read_u8()?,
                },
            })),
            1 => Ok(PagestreamFeMessage::Nblocks(PagestreamNblocksRequest {
-                latest: body.get_u8() != 0,
-                lsn: Lsn::from(body.get_u64()),
+                latest: body.read_u8()? != 0,
+                lsn: Lsn::from(body.read_u64::<BigEndian>()?),
                rel: RelTag {
-                    spcnode: body.get_u32(),
-                    dbnode: body.get_u32(),
-                    relnode: body.get_u32(),
-                    forknum: body.get_u8(),
+                    spcnode: body.read_u32::<BigEndian>()?,
+                    dbnode: body.read_u32::<BigEndian>()?,
+                    relnode: body.read_u32::<BigEndian>()?,
+                    forknum: body.read_u8()?,
                },
            })),
            2 => Ok(PagestreamFeMessage::GetPage(PagestreamGetPageRequest {
-                latest: body.get_u8() != 0,
-                lsn: Lsn::from(body.get_u64()),
+                latest: body.read_u8()? != 0,
+                lsn: Lsn::from(body.read_u64::<BigEndian>()?),
                rel: RelTag {
-                    spcnode: body.get_u32(),
-                    dbnode: body.get_u32(),
-                    relnode: body.get_u32(),
-                    forknum: body.get_u8(),
+                    spcnode: body.read_u32::<BigEndian>()?,
+                    dbnode: body.read_u32::<BigEndian>()?,
+                    relnode: body.read_u32::<BigEndian>()?,
+                    forknum: body.read_u8()?,
                },
-                blkno: body.get_u32(),
+                blkno: body.read_u32::<BigEndian>()?,
            })),
            3 => Ok(PagestreamFeMessage::DbSize(PagestreamDbSizeRequest {
-                latest: body.get_u8() != 0,
-                lsn: Lsn::from(body.get_u64()),
-                dbnode: body.get_u32(),
+                latest: body.read_u8()? != 0,
+                lsn: Lsn::from(body.read_u64::<BigEndian>()?),
+                dbnode: body.read_u32::<BigEndian>()?,
            })),
-            _ => bail!("unknown smgr message tag: {},'{:?}'", msg_tag, body),
+            _ => bail!("unknown smgr message tag: {:?}", msg_tag),
        }
    }
 }
@@ -380,3 +428,58 @@ impl PagestreamBeMessage {
        bytes.into()
    }
 }
+
+#[cfg(test)]
+mod tests {
+    use bytes::Buf;
+
+    use super::*;
+
+    #[test]
+    fn test_pagestream() {
+        // Test serialization/deserialization of PagestreamFeMessage
+        let messages = vec![
+            PagestreamFeMessage::Exists(PagestreamExistsRequest {
+                latest: true,
+                lsn: Lsn(4),
+                rel: RelTag {
+                    forknum: 1,
+                    spcnode: 2,
+                    dbnode: 3,
+                    relnode: 4,
+                },
+            }),
+            PagestreamFeMessage::Nblocks(PagestreamNblocksRequest {
+                latest: false,
+                lsn: Lsn(4),
+                rel: RelTag {
+                    forknum: 1,
+                    spcnode: 2,
+                    dbnode: 3,
+                    relnode: 4,
+                },
+            }),
+            PagestreamFeMessage::GetPage(PagestreamGetPageRequest {
+                latest: true,
+                lsn: Lsn(4),
+                rel: RelTag {
+                    forknum: 1,
+                    spcnode: 2,
+                    dbnode: 3,
+                    relnode: 4,
+                },
+                blkno: 7,
+            }),
+            PagestreamFeMessage::DbSize(PagestreamDbSizeRequest {
+                latest: true,
+                lsn: Lsn(4),
+                dbnode: 7,
+            }),
+        ];
+        for msg in messages {
+            let bytes = msg.serialize();
+            let reconstructed = PagestreamFeMessage::parse(&mut bytes.reader()).unwrap();
+            assert!(msg == reconstructed);
+        }
+    }
+}
--- a/libs/pq_proto/Cargo.toml
+++ b/libs/pq_proto/Cargo.toml
@@ -0,0 +1,16 @@
+[package]
+name = "pq_proto"
+version = "0.1.0"
+edition = "2021"
+
+[dependencies]
+anyhow = "1.0"
+bytes = "1.0.1"
+pin-project-lite = "0.2.7"
+postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+rand = "0.8.3"
+serde = { version = "1.0", features = ["derive"] }
+tokio = { version = "1.17", features = ["macros"] }
+tracing = "0.1"
+
+workspace_hack = { version = "0.1", path = "../../workspace_hack" }
--- a/libs/utils/src/pq_proto.rs
+++ b/libs/utils/src/pq_proto.rs
@@ -2,7 +2,9 @@
 //! <https://www.postgresql.org/docs/devel/protocol-message-formats.html>
 //! on message formats.

-use crate::sync::{AsyncishRead, SyncFuture};
+// Tools for calling certain async methods in sync contexts.
+pub mod sync;
+
 use anyhow::{bail, ensure, Context, Result};
 use bytes::{Buf, BufMut, Bytes, BytesMut};
 use postgres_protocol::PG_EPOCH;
@@ -16,6 +18,7 @@ use std::{
    str,
    time::{Duration, SystemTime},
 };
+use sync::{AsyncishRead, SyncFuture};
 use tokio::io::AsyncReadExt;
 use tracing::{trace, warn};

@@ -198,7 +201,7 @@ impl FeMessage {
    ///
    /// ```
    /// # use std::io;
-    /// # use utils::pq_proto::FeMessage;
+    /// # use pq_proto::FeMessage;
    /// #
    /// # fn process_message(msg: FeMessage) -> anyhow::Result<()> {
    /// #     Ok(())
@@ -302,6 +305,7 @@ impl FeStartupPacket {
                Err(e) => return Err(e.into()),
            };

+            #[allow(clippy::manual_range_contains)]
            if len < 4 || len > MAX_STARTUP_PACKET_LENGTH {
                bail!("invalid message length");
            }
--- a/libs/pq_proto/src/sync.rs
+++ b/libs/pq_proto/src/sync.rs
@@ -29,7 +29,7 @@ impl<S, T: Future> SyncFuture<S, T> {
    /// Example:
    ///
    /// ```
-    /// # use utils::sync::SyncFuture;
+    /// # use pq_proto::sync::SyncFuture;
    /// # use std::future::Future;
    /// # use tokio::io::AsyncReadExt;
    /// #
--- a/libs/tenant_size_model/.gitignore
+++ b/libs/tenant_size_model/.gitignore
@@ -0,0 +1,3 @@
+*.dot
+*.png
+*.svg
--- a/libs/tenant_size_model/Cargo.toml
+++ b/libs/tenant_size_model/Cargo.toml
@@ -0,0 +1,8 @@
+[package]
+name = "tenant_size_model"
+version = "0.1.0"
+edition = "2021"
+publish = false
+
+[dependencies]
+workspace_hack = { version = "0.1", path = "../../workspace_hack" }
--- a/libs/tenant_size_model/Makefile
+++ b/libs/tenant_size_model/Makefile
@@ -0,0 +1,13 @@
+all: 1.svg 2.svg 3.svg 4.svg 1.png 2.png 3.png 4.png
+
+../../target/debug/tenant_size_model: Cargo.toml src/main.rs src/lib.rs
+	cargo build --bin tenant_size_model
+
+%.svg: %.dot
+	dot -Tsvg $< > $@
+
+%.png: %.dot
+	dot -Tpng $< > $@
+
+%.dot: ../../target/debug/tenant_size_model
+	../../target/debug/tenant_size_model $* > $@
--- a/libs/tenant_size_model/README.md
+++ b/libs/tenant_size_model/README.md
@@ -0,0 +1,7 @@
+# Logical size + WAL pricing
+
+This is a simulator to calculate the tenant size in different scenarios,
+using the "Logical size + WAL" method. Makefile produces diagrams used in a
+private presentation:
+
+https://docs.google.com/presentation/d/1OapE4k11xmcwMh7I7YvNWGC63yCRLh6udO9bXZ-fZmo/edit?usp=sharing
--- a/libs/tenant_size_model/src/lib.rs
+++ b/libs/tenant_size_model/src/lib.rs
@@ -0,0 +1,349 @@
+use std::borrow::Cow;
+use std::collections::HashMap;
+
+/// Pricing model or history size builder.
+///
+/// Maintains knowledge of the branches and their modifications. Generic over the branch name key
+/// type.
+pub struct Storage<K: 'static> {
+    segments: Vec<Segment>,
+
+    /// Mapping from the branch name to the index of a segment describing it's latest state.
+    branches: HashMap<K, usize>,
+}
+
+/// Snapshot of a branch.
+#[derive(Clone, Debug, Eq, PartialEq)]
+pub struct Segment {
+    /// Previous segment index into ['Storage::segments`], if any.
+    parent: Option<usize>,
+
+    /// Description of how did we get to this state.
+    ///
+    /// Mainly used in the original scenarios 1..=4 with insert, delete and update. Not used when
+    /// modifying a branch directly.
+    pub op: Cow<'static, str>,
+
+    /// LSN before this state
+    start_lsn: u64,
+
+    /// LSN at this state
+    pub end_lsn: u64,
+
+    /// Logical size before this state
+    start_size: u64,
+
+    /// Logical size at this state
+    pub end_size: u64,
+
+    /// Indices to [`Storage::segments`]
+    ///
+    /// FIXME: this could be an Option<usize>
+    children_after: Vec<usize>,
+
+    /// Determined by `retention_period` given to [`Storage::calculate`]
+    pub needed: bool,
+}
+
+//
+//
+//
+//
+//                 *-g--*---D--->
+//                /
+//               /
+//              /                 *---b----*-B--->
+//             /                 /
+//            /                 /
+//      -----*--e---*-----f----* C
+//           E                  \
+//                               \
+//                                *--a---*---A-->
+//
+// If A and B need to be retained, is it cheaper to store
+// snapshot at C+a+b, or snapshots at A and B ?
+//
+// If D also needs to be retained, which is cheaper:
+//
+// 1. E+g+e+f+a+b
+// 2. D+C+a+b
+// 3. D+A+B
+
+/// [`Segment`] which has had it's size calculated.
+pub struct SegmentSize {
+    pub seg_id: usize,
+
+    pub method: SegmentMethod,
+
+    this_size: u64,
+
+    pub children: Vec<SegmentSize>,
+}
+
+impl SegmentSize {
+    fn total(&self) -> u64 {
+        self.this_size + self.children.iter().fold(0, |acc, x| acc + x.total())
+    }
+
+    pub fn total_children(&self) -> u64 {
+        if self.method == SnapshotAfter {
+            self.this_size + self.children.iter().fold(0, |acc, x| acc + x.total())
+        } else {
+            self.children.iter().fold(0, |acc, x| acc + x.total())
+        }
+    }
+}
+
+/// Different methods to retain history from a particular state
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+pub enum SegmentMethod {
+    SnapshotAfter,
+    Wal,
+    WalNeeded,
+    Skipped,
+}
+
+use SegmentMethod::*;
+
+impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
+    /// Creates a new storage with the given default branch name.
+    pub fn new(initial_branch: K) -> Storage<K> {
+        let init_segment = Segment {
+            op: "".into(),
+            needed: false,
+            parent: None,
+            start_lsn: 0,
+            end_lsn: 0,
+            start_size: 0,
+            end_size: 0,
+            children_after: Vec::new(),
+        };
+
+        Storage {
+            segments: vec![init_segment],
+            branches: HashMap::from([(initial_branch, 0)]),
+        }
+    }
+
+    /// Advances the branch with the named operation, by the relative LSN and logical size bytes.
+    pub fn modify_branch<Q: ?Sized>(
+        &mut self,
+        branch: &Q,
+        op: Cow<'static, str>,
+        lsn_bytes: u64,
+        size_bytes: i64,
+    ) where
+        K: std::borrow::Borrow<Q>,
+        Q: std::hash::Hash + Eq,
+    {
+        let lastseg_id = *self.branches.get(branch).unwrap();
+        let newseg_id = self.segments.len();
+        let lastseg = &mut self.segments[lastseg_id];
+
+        let newseg = Segment {
+            op,
+            parent: Some(lastseg_id),
+            start_lsn: lastseg.end_lsn,
+            end_lsn: lastseg.end_lsn + lsn_bytes,
+            start_size: lastseg.end_size,
+            end_size: (lastseg.end_size as i64 + size_bytes) as u64,
+            children_after: Vec::new(),
+            needed: false,
+        };
+        lastseg.children_after.push(newseg_id);
+
+        self.segments.push(newseg);
+        *self.branches.get_mut(branch).expect("read already") = newseg_id;
+    }
+
+    pub fn insert<Q: ?Sized>(&mut self, branch: &Q, bytes: u64)
+    where
+        K: std::borrow::Borrow<Q>,
+        Q: std::hash::Hash + Eq,
+    {
+        self.modify_branch(branch, "insert".into(), bytes, bytes as i64);
+    }
+
+    pub fn update<Q: ?Sized>(&mut self, branch: &Q, bytes: u64)
+    where
+        K: std::borrow::Borrow<Q>,
+        Q: std::hash::Hash + Eq,
+    {
+        self.modify_branch(branch, "update".into(), bytes, 0i64);
+    }
+
+    pub fn delete<Q: ?Sized>(&mut self, branch: &Q, bytes: u64)
+    where
+        K: std::borrow::Borrow<Q>,
+        Q: std::hash::Hash + Eq,
+    {
+        self.modify_branch(branch, "delete".into(), bytes, -(bytes as i64));
+    }
+
+    /// Panics if the parent branch cannot be found.
+    pub fn branch<Q: ?Sized>(&mut self, parent: &Q, name: K)
+    where
+        K: std::borrow::Borrow<Q>,
+        Q: std::hash::Hash + Eq,
+    {
+        // Find the right segment
+        let branchseg_id = *self
+            .branches
+            .get(parent)
+            .expect("should had found the parent by key");
+        let _branchseg = &mut self.segments[branchseg_id];
+
+        // Create branch name for it
+        self.branches.insert(name, branchseg_id);
+    }
+
+    pub fn calculate(&mut self, retention_period: u64) -> SegmentSize {
+        // Phase 1: Mark all the segments that need to be retained
+        for (_branch, &last_seg_id) in self.branches.iter() {
+            let last_seg = &self.segments[last_seg_id];
+            let cutoff_lsn = last_seg.start_lsn.saturating_sub(retention_period);
+            let mut seg_id = last_seg_id;
+            loop {
+                let seg = &mut self.segments[seg_id];
+                if seg.end_lsn < cutoff_lsn {
+                    break;
+                }
+                seg.needed = true;
+                if let Some(prev_seg_id) = seg.parent {
+                    seg_id = prev_seg_id;
+                } else {
+                    break;
+                }
+            }
+        }
+
+        // Phase 2: For each oldest segment in a chain that needs to be retained,
+        // calculate if we should store snapshot or WAL
+        self.size_from_snapshot_later(0)
+    }
+
+    fn size_from_wal(&self, seg_id: usize) -> SegmentSize {
+        let seg = &self.segments[seg_id];
+
+        let this_size = seg.end_lsn - seg.start_lsn;
+
+        let mut children = Vec::new();
+
+        // try both ways
+        for &child_id in seg.children_after.iter() {
+            // try each child both ways
+            let child = &self.segments[child_id];
+            let p1 = self.size_from_wal(child_id);
+
+            let p = if !child.needed {
+                let p2 = self.size_from_snapshot_later(child_id);
+                if p1.total() < p2.total() {
+                    p1
+                } else {
+                    p2
+                }
+            } else {
+                p1
+            };
+            children.push(p);
+        }
+        SegmentSize {
+            seg_id,
+            method: if seg.needed { WalNeeded } else { Wal },
+            this_size,
+            children,
+        }
+    }
+
+    fn size_from_snapshot_later(&self, seg_id: usize) -> SegmentSize {
+        // If this is needed, then it's time to do the snapshot and continue
+        // with wal method.
+        let seg = &self.segments[seg_id];
+        //eprintln!("snap: seg{}: {} needed: {}", seg_id, seg.children_after.len(), seg.needed);
+        if seg.needed {
+            let mut children = Vec::new();
+
+            for &child_id in seg.children_after.iter() {
+                // try each child both ways
+                let child = &self.segments[child_id];
+                let p1 = self.size_from_wal(child_id);
+
+                let p = if !child.needed {
+                    let p2 = self.size_from_snapshot_later(child_id);
+                    if p1.total() < p2.total() {
+                        p1
+                    } else {
+                        p2
+                    }
+                } else {
+                    p1
+                };
+                children.push(p);
+            }
+            SegmentSize {
+                seg_id,
+                method: WalNeeded,
+                this_size: seg.start_size,
+                children,
+            }
+        } else {
+            // If any of the direct children are "needed", need to be able to reconstruct here
+            let mut children_needed = false;
+            for &child in seg.children_after.iter() {
+                let seg = &self.segments[child];
+                if seg.needed {
+                    children_needed = true;
+                    break;
+                }
+            }
+
+            let method1 = if !children_needed {
+                let mut children = Vec::new();
+                for child in seg.children_after.iter() {
+                    children.push(self.size_from_snapshot_later(*child));
+                }
+                Some(SegmentSize {
+                    seg_id,
+                    method: Skipped,
+                    this_size: 0,
+                    children,
+                })
+            } else {
+                None
+            };
+
+            // If this a junction, consider snapshotting here
+            let method2 = if children_needed || seg.children_after.len() >= 2 {
+                let mut children = Vec::new();
+                for child in seg.children_after.iter() {
+                    children.push(self.size_from_wal(*child));
+                }
+                Some(SegmentSize {
+                    seg_id,
+                    method: SnapshotAfter,
+                    this_size: seg.end_size,
+                    children,
+                })
+            } else {
+                None
+            };
+
+            match (method1, method2) {
+                (None, None) => panic!(),
+                (Some(method), None) => method,
+                (None, Some(method)) => method,
+                (Some(method1), Some(method2)) => {
+                    if method1.total() < method2.total() {
+                        method1
+                    } else {
+                        method2
+                    }
+                }
+            }
+        }
+    }
+
+    pub fn into_segments(self) -> Vec<Segment> {
+        self.segments
+    }
+}
--- a/libs/tenant_size_model/src/main.rs
+++ b/libs/tenant_size_model/src/main.rs
@@ -0,0 +1,268 @@
+//! Tenant size model testing ground.
+//!
+//! Has a number of scenarios and a `main` for invoking these by number, calculating the history
+//! size, outputs graphviz graph. Makefile in directory shows how to use graphviz to turn scenarios
+//! into pngs.
+
+use tenant_size_model::{Segment, SegmentSize, Storage};
+
+// Main branch only. Some updates on it.
+fn scenario_1() -> (Vec<Segment>, SegmentSize) {
+    // Create main branch
+    let mut storage = Storage::new("main");
+
+    // Bulk load 5 GB of data to it
+    storage.insert("main", 5_000);
+
+    // Stream of updates
+    for _ in 0..5 {
+        storage.update("main", 1_000);
+    }
+
+    let size = storage.calculate(1000);
+
+    (storage.into_segments(), size)
+}
+
+// Main branch only. Some updates on it.
+fn scenario_2() -> (Vec<Segment>, SegmentSize) {
+    // Create main branch
+    let mut storage = Storage::new("main");
+
+    // Bulk load 5 GB of data to it
+    storage.insert("main", 5_000);
+
+    // Stream of updates
+    for _ in 0..5 {
+        storage.update("main", 1_000);
+    }
+
+    // Branch
+    storage.branch("main", "child");
+    storage.update("child", 1_000);
+
+    // More updates on parent
+    storage.update("main", 1_000);
+
+    let size = storage.calculate(1000);
+
+    (storage.into_segments(), size)
+}
+
+// Like 2, but more updates on main
+fn scenario_3() -> (Vec<Segment>, SegmentSize) {
+    // Create main branch
+    let mut storage = Storage::new("main");
+
+    // Bulk load 5 GB of data to it
+    storage.insert("main", 5_000);
+
+    // Stream of updates
+    for _ in 0..5 {
+        storage.update("main", 1_000);
+    }
+
+    // Branch
+    storage.branch("main", "child");
+    storage.update("child", 1_000);
+
+    // More updates on parent
+    for _ in 0..5 {
+        storage.update("main", 1_000);
+    }
+
+    let size = storage.calculate(1000);
+
+    (storage.into_segments(), size)
+}
+
+// Diverged branches
+fn scenario_4() -> (Vec<Segment>, SegmentSize) {
+    // Create main branch
+    let mut storage = Storage::new("main");
+
+    // Bulk load 5 GB of data to it
+    storage.insert("main", 5_000);
+
+    // Stream of updates
+    for _ in 0..5 {
+        storage.update("main", 1_000);
+    }
+
+    // Branch
+    storage.branch("main", "child");
+    storage.update("child", 1_000);
+
+    // More updates on parent
+    for _ in 0..8 {
+        storage.update("main", 1_000);
+    }
+
+    let size = storage.calculate(1000);
+
+    (storage.into_segments(), size)
+}
+
+fn scenario_5() -> (Vec<Segment>, SegmentSize) {
+    let mut storage = Storage::new("a");
+    storage.insert("a", 5000);
+    storage.branch("a", "b");
+    storage.update("b", 4000);
+    storage.update("a", 2000);
+    storage.branch("a", "c");
+    storage.insert("c", 4000);
+    storage.insert("a", 2000);
+
+    let size = storage.calculate(5000);
+
+    (storage.into_segments(), size)
+}
+
+fn scenario_6() -> (Vec<Segment>, SegmentSize) {
+    use std::borrow::Cow;
+
+    const NO_OP: Cow<'static, str> = Cow::Borrowed("");
+
+    let branches = [
+        Some(0x7ff1edab8182025f15ae33482edb590a_u128),
+        Some(0xb1719e044db05401a05a2ed588a3ad3f),
+        Some(0xb68d6691c895ad0a70809470020929ef),
+    ];
+
+    // compared to other scenarios, this one uses bytes instead of kB
+
+    let mut storage = Storage::new(None);
+
+    storage.branch(&None, branches[0]); // at 0
+    storage.modify_branch(&branches[0], NO_OP, 108951064, 43696128); // at 108951064
+    storage.branch(&branches[0], branches[1]); // at 108951064
+    storage.modify_branch(&branches[1], NO_OP, 15560408, -1851392); // at 124511472
+    storage.modify_branch(&branches[0], NO_OP, 174464360, -1531904); // at 283415424
+    storage.branch(&branches[0], branches[2]); // at 283415424
+    storage.modify_branch(&branches[2], NO_OP, 15906192, 8192); // at 299321616
+    storage.modify_branch(&branches[0], NO_OP, 18909976, 32768); // at 302325400
+
+    let size = storage.calculate(100_000);
+
+    (storage.into_segments(), size)
+}
+
+fn main() {
+    let args: Vec<String> = std::env::args().collect();
+
+    let scenario = if args.len() < 2 { "1" } else { &args[1] };
+
+    let (segments, size) = match scenario {
+        "1" => scenario_1(),
+        "2" => scenario_2(),
+        "3" => scenario_3(),
+        "4" => scenario_4(),
+        "5" => scenario_5(),
+        "6" => scenario_6(),
+        other => {
+            eprintln!("invalid scenario {}", other);
+            std::process::exit(1);
+        }
+    };
+
+    graphviz_tree(&segments, &size);
+}
+
+fn graphviz_recurse(segments: &[Segment], node: &SegmentSize) {
+    use tenant_size_model::SegmentMethod::*;
+
+    let seg_id = node.seg_id;
+    let seg = segments.get(seg_id).unwrap();
+    let lsn = seg.end_lsn;
+    let size = seg.end_size;
+    let method = node.method;
+
+    println!("  {{");
+    println!("    node [width=0.1 height=0.1 shape=oval]");
+
+    let tenant_size = node.total_children();
+
+    let penwidth = if seg.needed { 6 } else { 3 };
+    let x = match method {
+        SnapshotAfter =>
+            format!("label=\"lsn: {lsn}\\nsize: {size}\\ntenant_size: {tenant_size}\" style=filled penwidth={penwidth}"),
+        Wal =>
+            format!("label=\"lsn: {lsn}\\nsize: {size}\\ntenant_size: {tenant_size}\" color=\"black\" penwidth={penwidth}"),
+        WalNeeded =>
+            format!("label=\"lsn: {lsn}\\nsize: {size}\\ntenant_size: {tenant_size}\" color=\"black\" penwidth={penwidth}"),
+        Skipped =>
+            format!("label=\"lsn: {lsn}\\nsize: {size}\\ntenant_size: {tenant_size}\" color=\"gray\" penwidth={penwidth}"),
+    };
+
+    println!("    \"seg{seg_id}\" [{x}]");
+    println!("  }}");
+
+    // Recurse. Much of the data is actually on the edge
+    for child in node.children.iter() {
+        let child_id = child.seg_id;
+        graphviz_recurse(segments, child);
+
+        let edge_color = match child.method {
+            SnapshotAfter => "gray",
+            Wal => "black",
+            WalNeeded => "black",
+            Skipped => "gray",
+        };
+
+        println!("  {{");
+        println!("    edge [] ");
+        print!("    \"seg{seg_id}\" -> \"seg{child_id}\" [");
+        print!("color={edge_color}");
+        if child.method == WalNeeded {
+            print!(" penwidth=6");
+        }
+        if child.method == Wal {
+            print!(" penwidth=3");
+        }
+
+        let next = segments.get(child_id).unwrap();
+
+        if next.op.is_empty() {
+            print!(
+                " label=\"{} / {}\"",
+                next.end_lsn - seg.end_lsn,
+                (next.end_size as i128 - seg.end_size as i128)
+            );
+        } else {
+            print!(" label=\"{}: {}\"", next.op, next.end_lsn - seg.end_lsn);
+        }
+        println!("]");
+        println!("  }}");
+    }
+}
+
+fn graphviz_tree(segments: &[Segment], tree: &SegmentSize) {
+    println!("digraph G {{");
+    println!("  fontname=\"Helvetica,Arial,sans-serif\"");
+    println!("  node [fontname=\"Helvetica,Arial,sans-serif\"]");
+    println!("  edge [fontname=\"Helvetica,Arial,sans-serif\"]");
+    println!("  graph [center=1 rankdir=LR]");
+    println!("  edge [dir=none]");
+
+    graphviz_recurse(segments, tree);
+
+    println!("}}");
+}
+
+#[test]
+fn scenarios_return_same_size() {
+    type ScenarioFn = fn() -> (Vec<Segment>, SegmentSize);
+    let truths: &[(u32, ScenarioFn, _)] = &[
+        (line!(), scenario_1, 8000),
+        (line!(), scenario_2, 9000),
+        (line!(), scenario_3, 13000),
+        (line!(), scenario_4, 16000),
+        (line!(), scenario_5, 17000),
+        (line!(), scenario_6, 333_792_000),
+    ];
+
+    for (line, scenario, expected) in truths {
+        let (_, size) = scenario();
+        assert_eq!(*expected, size.total_children(), "scenario on line {line}");
+    }
+}
--- a/libs/utils/Cargo.toml
+++ b/libs/utils/Cargo.toml
@@ -9,9 +9,6 @@ anyhow = "1.0"
 bincode = "1.3"
 bytes = "1.0.1"
 hyper = { version = "0.14.7", features = ["full"] }
-pin-project-lite = "0.2.7"
-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
-postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
 routerify = "3"
 serde = { version = "1.0", features = ["derive"] }
 serde_json = "1"
@@ -33,8 +30,8 @@ once_cell = "1.13.0"
 strum = "0.24"
 strum_macros = "0.24"

-
 metrics = { path = "../metrics" }
+pq_proto = { path = "../pq_proto" }
 workspace_hack = { version = "0.1", path = "../../workspace_hack" }

 [dev-dependencies]
--- a/libs/utils/src/connstring.rs
+++ b/libs/utils/src/connstring.rs
@@ -1,52 +0,0 @@
-use postgres::Config;
-
-pub fn connection_host_port(config: &Config) -> (String, u16) {
-    assert_eq!(
-        config.get_hosts().len(),
-        1,
-        "only one pair of host and port is supported in connection string"
-    );
-    assert_eq!(
-        config.get_ports().len(),
-        1,
-        "only one pair of host and port is supported in connection string"
-    );
-    let host = match &config.get_hosts()[0] {
-        postgres::config::Host::Tcp(host) => host.as_ref(),
-        postgres::config::Host::Unix(host) => host.to_str().unwrap(),
-    };
-    (host.to_owned(), config.get_ports()[0])
-}
-
-pub fn connection_address(config: &Config) -> String {
-    let (host, port) = connection_host_port(config);
-    format!("{}:{}", host, port)
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn test_connection_host_port() {
-        let config: Config = "postgresql://no_user@localhost:64000/no_db"
-            .parse()
-            .unwrap();
-        assert_eq!(
-            connection_host_port(&config),
-            ("localhost".to_owned(), 64000)
-        );
-    }
-
-    #[test]
-    #[should_panic(expected = "only one pair of host and port is supported in connection string")]
-    fn test_connection_host_port_multiple_ports() {
-        let config: Config = "postgresql://no_user@localhost:64000,localhost:64001/no_db"
-            .parse()
-            .unwrap();
-        assert_eq!(
-            connection_host_port(&config),
-            ("localhost".to_owned(), 64000)
-        );
-    }
-}
--- a/libs/utils/src/lib.rs
+++ b/libs/utils/src/lib.rs
@@ -1,8 +1,6 @@
 //! `utils` is intended to be a place to put code that is shared
 //! between other crates in this repository.

-#![allow(clippy::manual_range_contains)]
-
 /// `Lsn` type implements common tasks on Log Sequence Numbers
 pub mod lsn;
 /// SeqWait allows waiting for a future sequence number to arrive
@@ -17,10 +15,6 @@ pub mod vec_map;
 pub mod bin_ser;
 pub mod postgres_backend;
 pub mod postgres_backend_async;
-pub mod pq_proto;
-
-// dealing with connstring parsing and handy access to it's parts
-pub mod connstring;

 // helper functions for creating and fsyncing
 pub mod crashsafe;
@@ -39,13 +33,12 @@ pub mod sock_split;
 // common log initialisation routine
 pub mod logging;

+pub mod lock_file;
+
 // Misc
 pub mod accum;
 pub mod shutdown;

-// Tools for calling certain async methods in sync contexts
-pub mod sync;
-
 // Utility for binding TcpListeners with proper socket options.
 pub mod tcp_listener;

--- a/libs/utils/src/lock_file.rs
+++ b/libs/utils/src/lock_file.rs
@@ -0,0 +1,81 @@
+//! A module to create and read lock files. A lock file ensures that only one
+//! process is running at a time, in a particular directory.
+//!
+//! File locking is done using [`fcntl::flock`], which means that holding the
+//! lock on file only prevents acquiring another lock on it; all other
+//! operations are still possible on files. Other process can still open, read,
+//! write, or remove the file, for example.
+//! If the file is removed while a process is holding a lock on it,
+//! the process that holds the lock does not get any error or notification.
+//! Furthermore, you can create a new file with the same name and lock the new file,
+//! while the old process is still running.
+//! Deleting the lock file while the locking process is still running is a bad idea!
+
+use std::{fs, os::unix::prelude::AsRawFd, path::Path};
+
+use anyhow::Context;
+use nix::fcntl;
+
+use crate::crashsafe;
+
+pub enum LockCreationResult {
+    Created {
+        new_lock_contents: String,
+        file: fs::File,
+    },
+    AlreadyLocked {
+        existing_lock_contents: String,
+    },
+    CreationFailed(anyhow::Error),
+}
+
+/// Creates a lock file in the path given and writes the given contents into the file.
+/// Note: The lock is automatically released when the file closed. You might want to use Box::leak to make sure it lives until the end of the program.
+pub fn create_lock_file(lock_file_path: &Path, contents: String) -> LockCreationResult {
+    let lock_file = match fs::OpenOptions::new()
+        .create(true) // O_CREAT
+        .write(true)
+        .open(lock_file_path)
+        .context("Failed to open lock file")
+    {
+        Ok(file) => file,
+        Err(e) => return LockCreationResult::CreationFailed(e),
+    };
+
+    match fcntl::flock(
+        lock_file.as_raw_fd(),
+        fcntl::FlockArg::LockExclusiveNonblock,
+    ) {
+        Ok(()) => {
+            match lock_file
+                .set_len(0)
+                .context("Failed to truncate lockfile")
+                .and_then(|()| {
+                    fs::write(lock_file_path, &contents).with_context(|| {
+                        format!("Failed to write '{contents}' contents into lockfile")
+                    })
+                })
+                .and_then(|()| {
+                    crashsafe::fsync_file_and_parent(lock_file_path)
+                        .context("Failed to fsync lockfile")
+                }) {
+                Ok(()) => LockCreationResult::Created {
+                    new_lock_contents: contents,
+                    file: lock_file,
+                },
+                Err(e) => LockCreationResult::CreationFailed(e),
+            }
+        }
+        Err(nix::errno::Errno::EAGAIN) => {
+            match fs::read_to_string(lock_file_path).context("Failed to read lockfile contents") {
+                Ok(existing_lock_contents) => LockCreationResult::AlreadyLocked {
+                    existing_lock_contents,
+                },
+                Err(e) => LockCreationResult::CreationFailed(e),
+            }
+        }
+        Err(e) => {
+            LockCreationResult::CreationFailed(anyhow::anyhow!("Failed to lock lockfile: {e}"))
+        }
+    }
+}
--- a/libs/utils/src/logging.rs
+++ b/libs/utils/src/logging.rs
@@ -1,10 +1,6 @@
-use std::{
-    fs::{File, OpenOptions},
-    path::Path,
-    str::FromStr,
-};
+use std::str::FromStr;

-use anyhow::{Context, Result};
+use anyhow::Context;
 use strum_macros::{EnumString, EnumVariantNames};

 #[derive(EnumString, EnumVariantNames, Eq, PartialEq, Debug, Clone, Copy)]
@@ -25,19 +21,8 @@ impl LogFormat {
        })
    }
 }
-pub fn init(
-    log_filename: impl AsRef<Path>,
-    daemonize: bool,
-    log_format: LogFormat,
-) -> Result<File> {
-    // Don't open the same file for output multiple times;
-    // the different fds could overwrite each other's output.
-    let log_file = OpenOptions::new()
-        .create(true)
-        .append(true)
-        .open(&log_filename)
-        .with_context(|| format!("failed to open {:?}", log_filename.as_ref()))?;

+pub fn init(log_format: LogFormat) -> anyhow::Result<()> {
    let default_filter_str = "info";

    // We fall back to printing all spans at info-level or above if
@@ -45,50 +30,16 @@ pub fn init(
    let env_filter = tracing_subscriber::EnvFilter::try_from_default_env()
        .unwrap_or_else(|_| tracing_subscriber::EnvFilter::new(default_filter_str));

-    let x: File = log_file.try_clone().unwrap();
    let base_logger = tracing_subscriber::fmt()
        .with_env_filter(env_filter)
        .with_target(false)
        .with_ansi(false)
-        .with_writer(move || -> Box<dyn std::io::Write> {
-            // we are cloning and returning log file in order to allow redirecting daemonized stdout and stderr to it
-            // if we do not use daemonization (e.g. in docker) it is better to log to stdout directly
-            // for example to be in line with docker log command which expects logs comimg from stdout
-            if daemonize {
-                Box::new(x.try_clone().unwrap())
-            } else {
-                Box::new(std::io::stdout())
-            }
-        });
+        .with_writer(std::io::stdout);

    match log_format {
        LogFormat::Json => base_logger.json().init(),
        LogFormat::Plain => base_logger.init(),
    }

-    Ok(log_file)
-}
-
-// #[cfg(test)]
-// Due to global logger, can't run tests in same process.
-// So until there's a non-global one, the tests are in ../tests/ as separate files.
-#[macro_export(local_inner_macros)]
-macro_rules! test_init_file_logger {
-    ($log_level:expr, $log_format:expr) => {{
-        use std::str::FromStr;
-        std::env::set_var("RUST_LOG", $log_level);
-
-        let tmp_dir = tempfile::TempDir::new().unwrap();
-        let log_file_path = tmp_dir.path().join("logfile");
-
-        let log_format = $crate::logging::LogFormat::from_str($log_format).unwrap();
-        let _log_file = $crate::logging::init(&log_file_path, true, log_format).unwrap();
-
-        let log_file = std::fs::OpenOptions::new()
-            .read(true)
-            .open(&log_file_path)
-            .unwrap();
-
-        log_file
-    }};
+    Ok(())
 }
--- a/libs/utils/src/lsn.rs
+++ b/libs/utils/src/lsn.rs
@@ -13,7 +13,7 @@ use crate::seqwait::MonotonicCounter;
 pub const XLOG_BLCKSZ: u32 = 8192;

 /// A Postgres LSN (Log Sequence Number), also known as an XLogRecPtr
-#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Serialize, Deserialize)]
+#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash, Serialize, Deserialize)]
 #[serde(transparent)]
 pub struct Lsn(pub u64);

--- a/libs/utils/src/postgres_backend.rs
+++ b/libs/utils/src/postgres_backend.rs
@@ -3,10 +3,10 @@
 //! implementation determining how to process the queries. Currently its API
 //! is rather narrow, but we can extend it once required.

-use crate::pq_proto::{BeMessage, BeParameterStatusMessage, FeMessage, FeStartupPacket};
 use crate::sock_split::{BidiStream, ReadStream, WriteStream};
 use anyhow::{bail, ensure, Context, Result};
 use bytes::{Bytes, BytesMut};
+use pq_proto::{BeMessage, BeParameterStatusMessage, FeMessage, FeStartupPacket};
 use rand::Rng;
 use serde::{Deserialize, Serialize};
 use std::fmt;
--- a/libs/utils/src/postgres_backend_async.rs
+++ b/libs/utils/src/postgres_backend_async.rs
@@ -4,9 +4,9 @@
 //! is rather narrow, but we can extend it once required.

 use crate::postgres_backend::AuthType;
-use crate::pq_proto::{BeMessage, BeParameterStatusMessage, FeMessage, FeStartupPacket};
 use anyhow::{bail, Context, Result};
 use bytes::{Bytes, BytesMut};
+use pq_proto::{BeMessage, BeParameterStatusMessage, FeMessage, FeStartupPacket};
 use rand::Rng;
 use std::future::Future;
 use std::net::SocketAddr;
--- a/libs/utils/tests/logger_json_test.rs
+++ b/libs/utils/tests/logger_json_test.rs
@@ -1,36 +0,0 @@
-// This could be in ../src/logging.rs but since the logger is global, these
-// can't be run in threads of the same process
-use std::fs::File;
-use std::io::{BufRead, BufReader, Lines};
-use tracing::*;
-use utils::test_init_file_logger;
-
-fn read_lines(file: File) -> Lines<BufReader<File>> {
-    BufReader::new(file).lines()
-}
-
-#[test]
-fn test_json_format_has_message_and_custom_field() {
-    std::env::set_var("RUST_LOG", "info");
-
-    let log_file = test_init_file_logger!("info", "json");
-
-    let custom_field: &str = "hi";
-    trace!(custom = %custom_field, "test log message");
-    debug!(custom = %custom_field, "test log message");
-    info!(custom = %custom_field, "test log message");
-    warn!(custom = %custom_field, "test log message");
-    error!(custom = %custom_field, "test log message");
-
-    let lines = read_lines(log_file);
-    for line in lines {
-        let content = line.unwrap();
-        let json_object = serde_json::from_str::<serde_json::Value>(&content).unwrap();
-
-        assert_eq!(json_object["fields"]["custom"], "hi");
-        assert_eq!(json_object["fields"]["message"], "test log message");
-
-        assert_ne!(json_object["level"], "TRACE");
-        assert_ne!(json_object["level"], "DEBUG");
-    }
-}
--- a/libs/utils/tests/logger_plain_test.rs
+++ b/libs/utils/tests/logger_plain_test.rs
@@ -1,36 +0,0 @@
-// This could be in ../src/logging.rs but since the logger is global, these
-// can't be run in threads of the same process
-use std::fs::File;
-use std::io::{BufRead, BufReader, Lines};
-use tracing::*;
-use utils::test_init_file_logger;
-
-fn read_lines(file: File) -> Lines<BufReader<File>> {
-    BufReader::new(file).lines()
-}
-
-#[test]
-fn test_plain_format_has_message_and_custom_field() {
-    std::env::set_var("RUST_LOG", "warn");
-
-    let log_file = test_init_file_logger!("warn", "plain");
-
-    let custom_field: &str = "hi";
-    trace!(custom = %custom_field, "test log message");
-    debug!(custom = %custom_field, "test log message");
-    info!(custom = %custom_field, "test log message");
-    warn!(custom = %custom_field, "test log message");
-    error!(custom = %custom_field, "test log message");
-
-    let lines = read_lines(log_file);
-    for line in lines {
-        let content = line.unwrap();
-        serde_json::from_str::<serde_json::Value>(&content).unwrap_err();
-        assert!(content.contains("custom=hi"));
-        assert!(content.contains("test log message"));
-
-        assert!(!content.contains("TRACE"));
-        assert!(!content.contains("DEBUG"));
-        assert!(!content.contains("INFO"));
-    }
-}
--- a/pageserver/Cargo.toml
+++ b/pageserver/Cargo.toml
@@ -12,62 +12,61 @@ testing = ["fail/failpoints"]
 profiling = ["pprof"]

 [dependencies]
+amplify_num = { git = "https://github.com/hlinnaka/rust-amplify.git", branch = "unsigned-int-perf" }
+anyhow = { version = "1.0", features = ["backtrace"] }
 async-stream = "0.3"
 async-trait = "0.1"
-chrono = "0.4.19"
-rand = "0.8.3"
-regex = "1.4.5"
-bytes = "1.0.1"
 byteorder = "1.4.3"
+bytes = "1.0.1"
+chrono = "0.4.19"
+clap = { version = "4.0", features = ["string"] }
+close_fds = "0.3.2"
+const_format = "0.2.21"
+crc32c = "0.6.0"
+crossbeam-utils = "0.8.5"
+fail = "0.5.0"
 futures = "0.3.13"
+git-version = "0.3.5"
 hex = "0.4.3"
+humantime = "2.1.0"
+humantime-serde = "1.1.1"
 hyper = "0.14"
 itertools = "0.10.3"
-clap = { version = "4.0", features = ["string"] }
-daemonize = "0.4.1"
-tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
-tokio-util = { version = "0.7.3", features = ["io", "io-util"] }
-postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
-postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+nix = "0.25"
+num-traits = "0.2.15"
+once_cell = "1.13.0"
 postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
-tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
-anyhow = { version = "1.0", features = ["backtrace"] }
-crc32c = "0.6.0"
-thiserror = "1.0"
-tar = "0.4.33"
-humantime = "2.1.0"
+postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true }
+rand = "0.8.3"
+regex = "1.4.5"
+rstar = "0.9.3"
+scopeguard = "1.1.0"
 serde = { version = "1.0", features = ["derive"] }
 serde_json = "1"
 serde_with = "2.0"
-humantime-serde = "1.1.1"
-
-pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true }
-
-toml_edit = { version = "0.14", features = ["easy"] }
-scopeguard = "1.1.0"
-const_format = "0.2.21"
-tracing = "0.1.36"
 signal-hook = "0.3.10"
+svg_fmt = "0.4.1"
+tar = "0.4.33"
+thiserror = "1.0"
+tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
+tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+tokio-util = { version = "0.7.3", features = ["io", "io-util"] }
+toml_edit = { version = "0.14", features = ["easy"] }
+tracing = "0.1.36"
 url = "2"
-nix = "0.25"
-once_cell = "1.13.0"
-crossbeam-utils = "0.8.5"
-fail = "0.5.0"
-git-version = "0.3.5"
-rstar = "0.9.3"
-num-traits = "0.2.15"
-amplify_num = { git = "https://github.com/hlinnaka/rust-amplify.git", branch = "unsigned-int-perf" }
+walkdir = "2.3.2"

-pageserver_api = { path = "../libs/pageserver_api" }
-postgres_ffi = { path = "../libs/postgres_ffi" }
 etcd_broker = { path = "../libs/etcd_broker" }
 metrics = { path = "../libs/metrics" }
-utils = { path = "../libs/utils" }
+pageserver_api = { path = "../libs/pageserver_api" }
+postgres_ffi = { path = "../libs/postgres_ffi" }
+pq_proto = { path = "../libs/pq_proto" }
 remote_storage = { path = "../libs/remote_storage" }
+tenant_size_model = { path = "../libs/tenant_size_model" }
+utils = { path = "../libs/utils" }
 workspace_hack = { version = "0.1", path = "../workspace_hack" }
-close_fds = "0.3.2"
-walkdir = "2.3.2"
-svg_fmt = "0.4.1"

 [dev-dependencies]
 criterion = "0.4"
--- a/pageserver/src/bin/pageserver.rs
+++ b/pageserver/src/bin/pageserver.rs
@@ -1,38 +1,37 @@
 //! Main entry point for the Page Server executable.

-use remote_storage::GenericRemoteStorage;
 use std::{env, ops::ControlFlow, path::Path, str::FromStr};
+
+use anyhow::{anyhow, Context};
+use clap::{Arg, ArgAction, Command};
+use fail::FailScenario;
+use nix::unistd::Pid;
 use tracing::*;

-use anyhow::{anyhow, bail, Context, Result};
-
-use clap::{Arg, ArgAction, Command};
-use daemonize::Daemonize;
-
-use fail::FailScenario;
 use metrics::set_build_info_metric;
-
 use pageserver::{
    config::{defaults::*, PageServerConf},
-    http, page_cache, page_image_cache, page_service, profiling, task_mgr,
+    http, page_cache, page_service, profiling, task_mgr,
    task_mgr::TaskKind,
    task_mgr::{
        BACKGROUND_RUNTIME, COMPUTE_REQUEST_RUNTIME, MGMT_REQUEST_RUNTIME, WALRECEIVER_RUNTIME,
    },
-    tenant_mgr, virtual_file, LOG_FILE_NAME,
+    tenant_mgr, virtual_file,
 };
+use remote_storage::GenericRemoteStorage;
 use utils::{
    auth::JwtAuth,
-    logging,
+    lock_file, logging,
    postgres_backend::AuthType,
    project_git_version,
-    shutdown::exit_now,
    signals::{self, Signal},
    tcp_listener,
 };

 project_git_version!(GIT_VERSION);

+const PID_FILE_NAME: &str = "pageserver.pid";
+
 const FEATURES: &[&str] = &[
    #[cfg(feature = "testing")]
    "testing",
@@ -65,6 +64,7 @@ fn main() -> anyhow::Result<()> {
    let workdir = workdir
        .canonicalize()
        .with_context(|| format!("Error opening workdir '{}'", workdir.display()))?;
+
    let cfg_file_path = workdir.join("pageserver.toml");

    // Set CWD to workdir for non-daemon modes
@@ -75,8 +75,6 @@ fn main() -> anyhow::Result<()> {
        )
    })?;

-    let daemonize = arg_matches.get_flag("daemonize");
-
    let conf = match initialize_config(&cfg_file_path, arg_matches, &workdir)? {
        ControlFlow::Continue(conf) => conf,
        ControlFlow::Break(()) => {
@@ -101,9 +99,8 @@ fn main() -> anyhow::Result<()> {
    // Basic initialization of things that don't change after startup
    virtual_file::init(conf.max_file_descriptors);
    page_cache::init(conf.page_cache_size);
-    page_image_cache::init(64 * conf.page_cache_size); // temporary hack for benchmarking

-    start_pageserver(conf, daemonize).context("Failed to start pageserver")?;
+    start_pageserver(conf).context("Failed to start pageserver")?;

    scenario.teardown();
    Ok(())
@@ -198,12 +195,34 @@ fn initialize_config(
    })
 }

-fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()> {
-    // Initialize logger
-    let log_file = logging::init(LOG_FILE_NAME, daemonize, conf.log_format)?;
-
+fn start_pageserver(conf: &'static PageServerConf) -> anyhow::Result<()> {
+    logging::init(conf.log_format)?;
    info!("version: {}", version());

+    let lock_file_path = conf.workdir.join(PID_FILE_NAME);
+    let lock_file = match lock_file::create_lock_file(&lock_file_path, Pid::this().to_string()) {
+        lock_file::LockCreationResult::Created {
+            new_lock_contents,
+            file,
+        } => {
+            info!("Created lock file at {lock_file_path:?} with contenst {new_lock_contents}");
+            file
+        }
+        lock_file::LockCreationResult::AlreadyLocked {
+            existing_lock_contents,
+        } => anyhow::bail!(
+            "Could not lock pid file; pageserver is already running in {:?} with PID {}",
+            conf.workdir,
+            existing_lock_contents
+        ),
+        lock_file::LockCreationResult::CreationFailed(e) => {
+            return Err(e.context(format!("Failed to create lock file at {lock_file_path:?}")))
+        }
+    };
+    // ensure that the lock file is held even if the main thread of the process is panics
+    // we need to release the lock file only when the current process is gone
+    let _ = Box::leak(Box::new(lock_file));
+
    // TODO: Check that it looks like a valid repository before going further

    // bind sockets before daemonizing so we report errors early and do not return until we are listening
@@ -219,33 +238,6 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
    );
    let pageserver_listener = tcp_listener::bind(conf.listen_pg_addr.clone())?;

-    // NB: Don't spawn any threads before daemonizing!
-    if daemonize {
-        info!("daemonizing...");
-
-        // There shouldn't be any logging to stdin/stdout. Redirect it to the main log so
-        // that we will see any accidental manual fprintf's or backtraces.
-        let stdout = log_file
-            .try_clone()
-            .with_context(|| format!("Failed to clone log file '{:?}'", log_file))?;
-        let stderr = log_file;
-
-        let daemonize = Daemonize::new()
-            .pid_file("pageserver.pid")
-            .working_directory(".")
-            .stdout(stdout)
-            .stderr(stderr);
-
-        // XXX: The parent process should exit abruptly right after
-        // it has spawned a child to prevent coverage machinery from
-        // dumping stats into a `profraw` file now owned by the child.
-        // Otherwise, the coverage data will be damaged.
-        match daemonize.exit_action(|| exit_now(0)).start() {
-            Ok(_) => info!("Success, daemonized"),
-            Err(err) => bail!("{err}. could not daemonize. bailing."),
-        }
-    }
-
    let signals = signals::install_shutdown_handlers()?;

    // start profiler (if enabled)
@@ -348,14 +340,6 @@ fn cli() -> Command {
    Command::new("Neon page server")
        .about("Materializes WAL stream to pages and serves them to the postgres")
        .version(version())
-        .arg(
-
-            Arg::new("daemonize")
-                .short('d')
-                .long("daemonize")
-                .action(ArgAction::SetTrue)
-                .help("Run in the background"),
-        )
        .arg(
            Arg::new("init")
                .long("init")
--- a/pageserver/src/config.rs
+++ b/pageserver/src/config.rs
@@ -9,6 +9,7 @@ use remote_storage::RemoteStorageConfig;
 use std::env;
 use utils::crashsafe::path_with_suffix_extension;

+use std::num::NonZeroUsize;
 use std::path::{Path, PathBuf};
 use std::str::FromStr;
 use std::time::Duration;
@@ -48,6 +49,9 @@ pub mod defaults {

    pub const DEFAULT_LOG_FORMAT: &str = "plain";

+    pub const DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES: usize =
+        super::ConfigurableSemaphore::DEFAULT_INITIAL.get();
+
    ///
    /// Default built-in configuration file.
    ///
@@ -67,6 +71,9 @@ pub mod defaults {
 #initial_superuser_name = '{DEFAULT_SUPERUSER}'

 #log_format = '{DEFAULT_LOG_FORMAT}'
+
+#concurrent_tenant_size_logical_size_queries = '{DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES}'
+
 # [tenant_config]
 #checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
 #checkpoint_timeout = {DEFAULT_CHECKPOINT_TIMEOUT}
@@ -132,6 +139,9 @@ pub struct PageServerConf {
    pub broker_endpoints: Vec<Url>,

    pub log_format: LogFormat,
+
+    /// Number of concurrent [`Tenant::gather_size_inputs`] allowed.
+    pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore,
 }

 #[derive(Debug, Clone, PartialEq, Eq)]
@@ -200,6 +210,8 @@ struct PageServerConfigBuilder {
    broker_endpoints: BuilderValue<Vec<Url>>,

    log_format: BuilderValue<LogFormat>,
+
+    concurrent_tenant_size_logical_size_queries: BuilderValue<ConfigurableSemaphore>,
 }

 impl Default for PageServerConfigBuilder {
@@ -228,6 +240,8 @@ impl Default for PageServerConfigBuilder {
            broker_etcd_prefix: Set(etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string()),
            broker_endpoints: Set(Vec::new()),
            log_format: Set(LogFormat::from_str(DEFAULT_LOG_FORMAT).unwrap()),
+
+            concurrent_tenant_size_logical_size_queries: Set(ConfigurableSemaphore::default()),
        }
    }
 }
@@ -304,6 +318,10 @@ impl PageServerConfigBuilder {
        self.log_format = BuilderValue::Set(log_format)
    }

+    pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: ConfigurableSemaphore) {
+        self.concurrent_tenant_size_logical_size_queries = BuilderValue::Set(u);
+    }
+
    pub fn build(self) -> anyhow::Result<PageServerConf> {
        let broker_endpoints = self
            .broker_endpoints
@@ -349,6 +367,11 @@ impl PageServerConfigBuilder {
                .broker_etcd_prefix
                .ok_or(anyhow!("missing broker_etcd_prefix"))?,
            log_format: self.log_format.ok_or(anyhow!("missing log_format"))?,
+            concurrent_tenant_size_logical_size_queries: self
+                .concurrent_tenant_size_logical_size_queries
+                .ok_or(anyhow!(
+                    "missing concurrent_tenant_size_logical_size_queries"
+                ))?,
        })
    }
 }
@@ -476,6 +499,12 @@ impl PageServerConf {
                "log_format" => builder.log_format(
                    LogFormat::from_config(&parse_toml_string(key, item)?)?
                ),
+                "concurrent_tenant_size_logical_size_queries" => builder.concurrent_tenant_size_logical_size_queries({
+                    let input = parse_toml_string(key, item)?;
+                    let permits = input.parse::<usize>().context("expected a number of initial permits, not {s:?}")?;
+                    let permits = NonZeroUsize::new(permits).context("initial semaphore permits out of range: 0, use other configuration to disable a feature")?;
+                    ConfigurableSemaphore::new(permits)
+                }),
                _ => bail!("unrecognized pageserver option '{key}'"),
            }
        }
@@ -589,6 +618,7 @@ impl PageServerConf {
            broker_endpoints: Vec::new(),
            broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
            log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(),
+            concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
        }
    }
 }
@@ -654,6 +684,58 @@ fn parse_toml_array(name: &str, item: &Item) -> anyhow::Result<Vec<String>> {
        .collect()
 }

+/// Configurable semaphore permits setting.
+///
+/// Does not allow semaphore permits to be zero, because at runtime initially zero permits and empty
+/// semaphore cannot be distinguished, leading any feature using these to await forever (or until
+/// new permits are added).
+#[derive(Debug, Clone)]
+pub struct ConfigurableSemaphore {
+    initial_permits: NonZeroUsize,
+    inner: std::sync::Arc<tokio::sync::Semaphore>,
+}
+
+impl ConfigurableSemaphore {
+    pub const DEFAULT_INITIAL: NonZeroUsize = match NonZeroUsize::new(1) {
+        Some(x) => x,
+        None => panic!("const unwrap is not yet stable"),
+    };
+
+    /// Initializse using a non-zero amount of permits.
+    ///
+    /// Require a non-zero initial permits, because using permits == 0 is a crude way to disable a
+    /// feature such as [`Tenant::gather_size_inputs`]. Otherwise any semaphore using future will
+    /// behave like [`futures::future::pending`], just waiting until new permits are added.
+    pub fn new(initial_permits: NonZeroUsize) -> Self {
+        ConfigurableSemaphore {
+            initial_permits,
+            inner: std::sync::Arc::new(tokio::sync::Semaphore::new(initial_permits.get())),
+        }
+    }
+}
+
+impl Default for ConfigurableSemaphore {
+    fn default() -> Self {
+        Self::new(Self::DEFAULT_INITIAL)
+    }
+}
+
+impl PartialEq for ConfigurableSemaphore {
+    fn eq(&self, other: &Self) -> bool {
+        // the number of permits can be increased at runtime, so we cannot really fulfill the
+        // PartialEq value equality otherwise
+        self.initial_permits == other.initial_permits
+    }
+}
+
+impl Eq for ConfigurableSemaphore {}
+
+impl ConfigurableSemaphore {
+    pub fn inner(&self) -> &std::sync::Arc<tokio::sync::Semaphore> {
+        &self.inner
+    }
+}
+
 #[cfg(test)]
 mod tests {
    use std::{
@@ -725,6 +807,7 @@ log_format = 'json'
                    .expect("Failed to parse a valid broker endpoint URL")],
                broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
                log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(),
+                concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
            },
            "Correct defaults should be used when no config values are provided"
        );
@@ -770,6 +853,7 @@ log_format = 'json'
                    .expect("Failed to parse a valid broker endpoint URL")],
                broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
                log_format: LogFormat::Json,
+                concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
            },
            "Should be able to parse all basic config values correctly"
        );
--- a/pageserver/src/http/openapi_spec.yml
+++ b/pageserver/src/http/openapi_spec.yml
@@ -354,6 +354,54 @@ paths:
              schema:
                $ref: "#/components/schemas/Error"

+  /v1/tenant/{tenant_id}/size:
+    parameters:
+      - name: tenant_id
+        in: path
+        required: true
+        schema:
+          type: string
+          format: hex
+    get:
+      description: |
+        Calculate tenant's size, which is a mixture of WAL (bytes) and logical_size (bytes).
+      responses:
+        "200":
+          description: OK,
+          content:
+            application/json:
+              schema:
+                type: object
+                required:
+                  - id
+                  - size
+                properties:
+                  id:
+                    type: string
+                    format: hex
+                  size:
+                    type: integer
+                    description: |
+                      Size metric in bytes.
+        "401":
+          description: Unauthorized Error
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/UnauthorizedError"
+        "403":
+          description: Forbidden Error
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ForbiddenError"
+        "500":
+          description: Generic operation error
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/Error"
+
  /v1/tenant/{tenant_id}/timeline/:
    parameters:
      - name: tenant_id
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -227,13 +227,10 @@ async fn timeline_list_handler(request: Request<Body>) -> Result<Response<Body>,

    let state = get_state(&request);

-    let timelines = tokio::task::spawn_blocking(move || {
-        let _enter = info_span!("timeline_list", tenant = %tenant_id).entered();
+    let timelines = info_span!("timeline_list", tenant = %tenant_id).in_scope(|| {
        let tenant = tenant_mgr::get_tenant(tenant_id, true).map_err(ApiError::NotFound)?;
        Ok(tenant.list_timelines())
-    })
-    .await
-    .map_err(|e: JoinError| ApiError::InternalServerError(e.into()))??;
+    })?;

    let mut response_data = Vec::with_capacity(timelines.len());
    for timeline in timelines {
@@ -523,9 +520,7 @@ async fn tenant_status(request: Request<Body>) -> Result<Response<Body>, ApiErro
    check_permission(&request, Some(tenant_id))?;

    // if tenant is in progress of downloading it can be absent in global tenant map
-    let tenant = tokio::task::spawn_blocking(move || tenant_mgr::get_tenant(tenant_id, false))
-        .await
-        .map_err(|e: JoinError| ApiError::InternalServerError(e.into()))?;
+    let tenant = tenant_mgr::get_tenant(tenant_id, false);

    let state = get_state(&request);
    let remote_index = &state.remote_index;
@@ -571,6 +566,44 @@ async fn tenant_status(request: Request<Body>) -> Result<Response<Body>, ApiErro
    )
 }

+async fn tenant_size_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
+    let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
+    check_permission(&request, Some(tenant_id))?;
+
+    let tenant = tenant_mgr::get_tenant(tenant_id, false).map_err(ApiError::InternalServerError)?;
+
+    // this can be long operation, it currently is not backed by any request coalescing or similar
+    let inputs = tenant
+        .gather_size_inputs()
+        .await
+        .map_err(ApiError::InternalServerError)?;
+
+    let size = inputs.calculate().map_err(ApiError::InternalServerError)?;
+
+    /// Private response type with the additional "unstable" `inputs` field.
+    ///
+    /// The type is described with `id` and `size` in the openapi_spec file, but the `inputs` is
+    /// intentionally left out. The type resides in the pageserver not to expose `ModelInputs`.
+    #[serde_with::serde_as]
+    #[derive(serde::Serialize)]
+    struct TenantHistorySize {
+        #[serde_as(as = "serde_with::DisplayFromStr")]
+        id: TenantId,
+        /// Size is a mixture of WAL and logical size, so the unit is bytes.
+        size: u64,
+        inputs: crate::tenant::size::ModelInputs,
+    }
+
+    json_response(
+        StatusCode::OK,
+        TenantHistorySize {
+            id: tenant_id,
+            size,
+            inputs,
+        },
+    )
+}
+
 // Helper function to standardize the error messages we produce on bad durations
 //
 // Intended to be used with anyhow's `with_context`, e.g.:
@@ -792,14 +825,14 @@ async fn timeline_gc_handler(mut request: Request<Body>) -> Result<Response<Body
    let tenant = tenant_mgr::get_tenant(tenant_id, false).map_err(ApiError::NotFound)?;
    let gc_req: TimelineGcRequest = json_request(&mut request).await?;

-    let _span_guard =
-        info_span!("manual_gc", tenant = %tenant_id, timeline = %timeline_id).entered();
    let gc_horizon = gc_req.gc_horizon.unwrap_or_else(|| tenant.get_gc_horizon());

    // Use tenant's pitr setting
    let pitr = tenant.get_pitr_interval();
    let result = tenant
        .gc_iteration(Some(timeline_id), gc_horizon, pitr, true)
+        .instrument(info_span!("manual_gc", tenant = %tenant_id, timeline = %timeline_id))
+        .await
        // FIXME: `gc_iteration` can return an error for multiple reasons; we should handle it
        // better once the types support it.
        .map_err(ApiError::InternalServerError)?;
@@ -835,6 +868,7 @@ async fn timeline_checkpoint_handler(request: Request<Body>) -> Result<Response<
        .map_err(ApiError::NotFound)?;
    timeline
        .checkpoint(CheckpointConfig::Forced)
+        .await
        .map_err(ApiError::InternalServerError)?;

    json_response(StatusCode::OK, ())
@@ -898,6 +932,7 @@ pub fn make_router(
        .get("/v1/tenant", tenant_list_handler)
        .post("/v1/tenant", tenant_create_handler)
        .get("/v1/tenant/:tenant_id", tenant_status)
+        .get("/v1/tenant/:tenant_id/size", tenant_size_handler)
        .put("/v1/tenant/config", tenant_config_handler)
        .get("/v1/tenant/:tenant_id/timeline", timeline_list_handler)
        .post("/v1/tenant/:tenant_id/timeline", timeline_create_handler)
--- a/pageserver/src/lib.rs
+++ b/pageserver/src/lib.rs
@@ -5,7 +5,6 @@ pub mod import_datadir;
 pub mod keyspace;
 pub mod metrics;
 pub mod page_cache;
-pub mod page_image_cache;
 pub mod page_service;
 pub mod pgdatadir_mapping;
 pub mod profiling;
@@ -44,8 +43,6 @@ pub const DEFAULT_PG_VERSION: u32 = 14;
 pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;
 pub const DELTA_FILE_MAGIC: u16 = 0x5A61;

-pub const LOG_FILE_NAME: &str = "pageserver.log";
-
 static ZERO_PAGE: bytes::Bytes = bytes::Bytes::from_static(&[0u8; 8192]);

 /// Config for the Repository checkpointer
@@ -82,7 +79,6 @@ pub async fn shutdown_pageserver(exit_code: i32) {

    // There should be nothing left, but let's be sure
    task_mgr::shutdown_tasks(None, None, None).await;
-
    info!("Shut down successfully completed");
    std::process::exit(exit_code);
 }
--- a/pageserver/src/metrics.rs
+++ b/pageserver/src/metrics.rs
@@ -31,6 +31,7 @@ const STORAGE_TIME_OPERATIONS: &[&str] = &[
    "compact",
    "create images",
    "init logical size",
+    "logical size",
    "load layer map",
    "gc",
 ];
@@ -365,6 +366,7 @@ pub struct TimelineMetrics {
    pub compact_time_histo: Histogram,
    pub create_images_time_histo: Histogram,
    pub init_logical_size_histo: Histogram,
+    pub logical_size_histo: Histogram,
    pub load_layer_map_histo: Histogram,
    pub last_record_gauge: IntGauge,
    pub wait_lsn_time_histo: Histogram,
@@ -397,6 +399,9 @@ impl TimelineMetrics {
        let init_logical_size_histo = STORAGE_TIME
            .get_metric_with_label_values(&["init logical size", &tenant_id, &timeline_id])
            .unwrap();
+        let logical_size_histo = STORAGE_TIME
+            .get_metric_with_label_values(&["logical size", &tenant_id, &timeline_id])
+            .unwrap();
        let load_layer_map_histo = STORAGE_TIME
            .get_metric_with_label_values(&["load layer map", &tenant_id, &timeline_id])
            .unwrap();
@@ -428,6 +433,7 @@ impl TimelineMetrics {
            compact_time_histo,
            create_images_time_histo,
            init_logical_size_histo,
+            logical_size_histo,
            load_layer_map_histo,
            last_record_gauge,
            wait_lsn_time_histo,
--- a/pageserver/src/page_cache.rs
+++ b/pageserver/src/page_cache.rs
@@ -108,10 +108,10 @@ enum CacheKey {
 }

 #[derive(Debug, PartialEq, Eq, Hash, Clone)]
-pub struct MaterializedPageHashKey {
-    pub tenant_id: TenantId,
-    pub timeline_id: TimelineId,
-    pub key: Key,
+struct MaterializedPageHashKey {
+    tenant_id: TenantId,
+    timeline_id: TimelineId,
+    key: Key,
 }

 #[derive(Clone)]
--- a/pageserver/src/page_image_cache.rs
+++ b/pageserver/src/page_image_cache.rs
@@ -1,308 +0,0 @@
-//!
-//! Global page image cache
-//!
-//! Unlike page_cache it holds only most recent version of reconstructed page images.
-//! And it uses invalidation mechanism to avoid layer ap lookups.
-
-use crate::page_cache::MaterializedPageHashKey;
-use crate::pgdatadir_mapping::{rel_block_to_key, BlockNumber};
-use crate::repository::Key;
-use crate::tenant::Timeline;
-use anyhow::{bail, Result};
-use bytes::Bytes;
-use once_cell::sync::OnceCell;
-use pageserver_api::reltag::RelTag;
-use std::collections::hash_map::DefaultHasher;
-use std::hash::{Hash, Hasher};
-use std::sync::{Arc, Condvar, Mutex};
-use utils::{
-    id::{TenantId, TimelineId},
-    lsn::Lsn,
-};
-
-static PAGE_CACHE: OnceCell<Mutex<PageImageCache>> = OnceCell::new();
-const TEST_PAGE_CACHE_SIZE: usize = 50;
-
-enum PageImageState {
-    Vacant,                        // entry is not used
-    Loaded(Option<Bytes>),         // page is loaded or has failed
-    Loading(Option<Arc<Condvar>>), // page in process of loading, Condvar is created on demand when some thread need to wait load completion
-}
-
-struct CacheEntry {
-    key: MaterializedPageHashKey,
-
-    // next+prev are used for LRU L2-list and next is also used for L1 free pages list
-    next: usize,
-    prev: usize,
-
-    collision: usize, // L1 hash collision chain
-
-    state: PageImageState,
-}
-
-pub struct PageImageCache {
-    free_list: usize, // L1 list of free entries
-    pages: Vec<CacheEntry>,
-    hash_table: Vec<usize>, // indexes in pages array
-}
-
-///
-/// Initialize the page cache. This must be called once at page server startup.
-///
-pub fn init(size: usize) {
-    if PAGE_CACHE
-        .set(Mutex::new(PageImageCache::new(size)))
-        .is_err()
-    {
-        panic!("page cache already initialized");
-    }
-}
-
-///
-/// Get a handle to the page cache.
-///
-pub fn get() -> &'static Mutex<PageImageCache> {
-    //
-    // In unit tests, page server startup doesn't happen and no one calls
-    // page_image_cache::init(). Initialize it here with a tiny cache, so that the
-    // page cache is usable in unit tests.
-    //
-    if cfg!(test) {
-        PAGE_CACHE.get_or_init(|| Mutex::new(PageImageCache::new(TEST_PAGE_CACHE_SIZE)))
-    } else {
-        PAGE_CACHE.get().expect("page cache not initialized")
-    }
-}
-
-fn hash<T: Hash>(t: &T) -> usize {
-    let mut s = DefaultHasher::new();
-    t.hash(&mut s);
-    s.finish() as usize
-}
-
-impl PageImageCache {
-    fn new(size: usize) -> Self {
-        let mut pages: Vec<CacheEntry> = Vec::with_capacity(size + 1);
-        let hash_table = vec![0usize; size];
-
-        // Dummy key
-        let dummy_key = MaterializedPageHashKey {
-            key: Key::MIN,
-            tenant_id: TenantId::from([0u8; 16]),
-            timeline_id: TimelineId::from([0u8; 16]),
-        };
-
-        // LRU list head
-        pages.push(CacheEntry {
-            key: dummy_key.clone(),
-            next: 0,
-            prev: 0,
-            collision: 0,
-            state: PageImageState::Vacant,
-        });
-
-        // Construct L1 free page list
-        for i in 0..size {
-            pages.push(CacheEntry {
-                key: dummy_key.clone(),
-                next: i + 2, // build L1-list of free pages
-                prev: 0,
-                collision: 0,
-                state: PageImageState::Vacant,
-            });
-        }
-        pages[size - 1].next = 0; // en of free page list
-
-        PageImageCache {
-            free_list: 1,
-            pages,
-            hash_table,
-        }
-    }
-
-    // Unlink from L2-list
-    fn unlink(&mut self, index: usize) {
-        let next = self.pages[index].next;
-        let prev = self.pages[index].prev;
-        self.pages[next].prev = prev;
-        self.pages[prev].next = next;
-    }
-
-    // Link in L2-list after specified element
-    fn link_after(&mut self, after: usize, index: usize) {
-        let next = self.pages[after].next;
-        self.pages[index].prev = after;
-        self.pages[index].next = next;
-        self.pages[next].prev = index;
-        self.pages[after].next = index;
-    }
-
-    fn prune(&mut self, index: usize) {
-        self.pages[index].prev = index;
-        self.pages[index].next = index;
-    }
-
-    fn is_empty(&self, index: usize) -> bool {
-        self.pages[index].next == index
-    }
-}
-
-// Remove entry from cache: o page invalidation or drop relation
-pub fn remove(key: Key, tenant_id: TenantId, timeline_id: TimelineId) {
-    let key = MaterializedPageHashKey {
-        key,
-        tenant_id,
-        timeline_id,
-    };
-    let this = get();
-    let mut cache = this.lock().unwrap();
-    let h = hash(&key) % cache.hash_table.len();
-    let mut index = cache.hash_table[h];
-    let mut prev = 0usize;
-    while index != 0 {
-        if cache.pages[index].key == key {
-            if !cache.is_empty(index) {
-                cache.pages[index].state = PageImageState::Vacant;
-                // Remove from LRU list
-                cache.unlink(index);
-                // Insert entry in free list
-                cache.pages[index].next = cache.free_list;
-                cache.free_list = index;
-            } else {
-                // Page is process of loading: we can not remove it righ now,
-                // so just mark for deletion
-                cache.pages[index].next = 0; // make is_empty == false
-            }
-            // Remove from hash table
-            if prev == 0 {
-                cache.hash_table[h] = cache.pages[index].collision;
-            } else {
-                cache.pages[prev].collision = cache.pages[index].collision;
-            }
-            break;
-        }
-        prev = index;
-        index = cache.pages[index].collision;
-    }
-    // It's Ok if image not found
-}
-
-// Find or load page image in the cache
-pub fn lookup(timeline: &Timeline, rel: RelTag, blkno: BlockNumber, lsn: Lsn) -> Result<Bytes> {
-    let key = MaterializedPageHashKey {
-        key: rel_block_to_key(rel, blkno),
-        tenant_id: timeline.tenant_id,
-        timeline_id: timeline.timeline_id,
-    };
-    let this = get();
-    let mut cache = this.lock().unwrap();
-    let h = hash(&key) % cache.hash_table.len();
-
-    'lookup: loop {
-        let mut index = cache.hash_table[h];
-        while index != 0 {
-            if cache.pages[index].key == key {
-                // cache hit
-                match &cache.pages[index].state {
-                    PageImageState::Loaded(cached_page) => {
-                        // Move to the head of LRU list
-                        let page = cached_page.clone();
-                        cache.unlink(index);
-                        cache.link_after(0, index);
-                        return page.ok_or_else(|| anyhow::anyhow!("page loading failed earlier"));
-                    }
-                    PageImageState::Loading(event) => {
-                        // Create event on which to sleep if not yet assigned
-                        let cv = match event {
-                            None => {
-                                let cv = Arc::new(Condvar::new());
-                                cache.pages[index].state =
-                                    PageImageState::Loading(Some(cv.clone()));
-                                cv
-                            }
-                            Some(cv) => cv.clone(),
-                        };
-                        cache = cv.wait(cache).unwrap();
-                        // Retry lookup
-                        continue 'lookup;
-                    }
-                    PageImageState::Vacant => bail!("Vacant entry is not expected here"),
-                };
-            }
-            index = cache.pages[index].collision;
-        }
-        // Cache miss
-        index = cache.free_list;
-        if index == 0 {
-            // no free items
-            let victim = cache.pages[0].prev; // take least recently used element from the tail of LRU list
-            assert!(victim != 0);
-            // Remove victim from hash table
-            let h = hash(&cache.pages[victim].key) % cache.hash_table.len();
-            index = cache.hash_table[h];
-            let mut prev = 0usize;
-            while index != victim {
-                assert!(index != 0);
-                prev = index;
-                index = cache.pages[index].collision;
-            }
-            if prev == 0 {
-                cache.hash_table[h] = cache.pages[victim].collision;
-            } else {
-                cache.pages[prev].collision = cache.pages[victim].collision;
-            }
-            // and from LRU list
-            cache.unlink(victim);
-
-            index = victim;
-        } else {
-            // Use next free item
-            cache.free_list = cache.pages[index].next;
-        }
-        // Make is_empty(index) == true. If entry is removed in process of loaded,
-        // it will be updated so that !is_empty(index)
-        cache.prune(index);
-
-        // Insert in hash table
-        cache.pages[index].collision = cache.hash_table[h];
-        cache.hash_table[h] = index;
-
-        cache.pages[index].key = key;
-        cache.pages[index].state = PageImageState::Loading(None);
-        drop(cache); //release lock
-
-        // Load page
-        let res = timeline.get_rel_page_at_lsn(rel, blkno, lsn, true);
-
-        cache = this.lock().unwrap();
-        if let PageImageState::Loading(event) = &cache.pages[index].state {
-            // Are there soMe waiting threads?
-            if let Some(cv) = event {
-                // If so, then wakeup them
-                cv.notify_all();
-            }
-        } else {
-            bail!("Loading state is expected");
-        }
-        if cache.is_empty(index) {
-            // entry was not marked as deleted {
-            // Page is loaded
-
-            // match &res { ... } is same as `res.as_ref().ok().cloned()`
-            cache.pages[index].state = PageImageState::Loaded(match &res {
-                Ok(page) => Some(page.clone()),
-                Err(_) => None,
-            });
-            // Link the page to the head of LRU list
-            cache.link_after(0, index);
-        } else {
-            cache.pages[index].state = PageImageState::Vacant;
-            // Return page to free list
-            cache.pages[index].next = cache.free_list;
-            cache.free_list = index;
-        }
-        // only the first one gets the full error from `get_rel_page_at_lsn`
-        return res;
-    }
-}
--- a/pageserver/src/page_service.rs
+++ b/pageserver/src/page_service.rs
@@ -10,6 +10,7 @@
 //

 use anyhow::{bail, ensure, Context, Result};
+use bytes::Buf;
 use bytes::Bytes;
 use futures::{Stream, StreamExt};
 use pageserver_api::models::{
@@ -18,12 +19,13 @@ use pageserver_api::models::{
    PagestreamFeMessage, PagestreamGetPageRequest, PagestreamGetPageResponse,
    PagestreamNblocksRequest, PagestreamNblocksResponse,
 };
-
+use pq_proto::{BeMessage, FeMessage, RowDescriptor};
 use std::io;
 use std::net::TcpListener;
 use std::str;
 use std::str::FromStr;
 use std::sync::Arc;
+use tokio::pin;
 use tokio_util::io::StreamReader;
 use tokio_util::io::SyncIoBridge;
 use tracing::*;
@@ -33,7 +35,6 @@ use utils::{
    lsn::Lsn,
    postgres_backend::AuthType,
    postgres_backend_async::{self, PostgresBackend},
-    pq_proto::{BeMessage, FeMessage, RowDescriptor},
    simple_rcu::RcuReadGuard,
 };

@@ -41,7 +42,6 @@ use crate::basebackup;
 use crate::config::{PageServerConf, ProfilingConfig};
 use crate::import_datadir::import_wal_from_tar;
 use crate::metrics::{LIVE_CONNECTIONS_COUNT, SMGR_QUERY_TIME};
-use crate::page_image_cache;
 use crate::profiling::profpoint_start;
 use crate::task_mgr;
 use crate::task_mgr::TaskKind;
@@ -301,7 +301,7 @@ impl PageServerHandler {

            trace!("query: {copy_data_bytes:?}");

-            let neon_fe_msg = PagestreamFeMessage::parse(copy_data_bytes)?;
+            let neon_fe_msg = PagestreamFeMessage::parse(&mut copy_data_bytes.reader())?;

            let response = match neon_fe_msg {
                PagestreamFeMessage::Exists(req) => {
@@ -368,14 +368,12 @@ impl PageServerHandler {
        pgb.write_message(&BeMessage::CopyInResponse)?;
        pgb.flush().await?;

-        // import_basebackup_from_tar() is not async, mainly because the Tar crate
-        // it uses is not async. So we need to jump through some hoops:
-        // - convert the input from client connection to a synchronous Read
-        // - use block_in_place()
-        let mut copyin_stream = Box::pin(copyin_stream(pgb));
-        let reader = SyncIoBridge::new(StreamReader::new(&mut copyin_stream));
-        tokio::task::block_in_place(|| timeline.import_basebackup_from_tar(reader, base_lsn))?;
-        timeline.initialize()?;
+        let copyin_stream = copyin_stream(pgb);
+        pin!(copyin_stream);
+
+        timeline
+            .import_basebackup_from_tar(&mut copyin_stream, base_lsn)
+            .await?;

        // Drain the rest of the Copy data
        let mut bytes_after_tar = 0;
@@ -440,7 +438,7 @@ impl PageServerHandler {
        // We only want to persist the data, and it doesn't matter if it's in the
        // shape of deltas or images.
        info!("flushing layers");
-        timeline.checkpoint(CheckpointConfig::Flush)?;
+        timeline.checkpoint(CheckpointConfig::Flush).await?;

        info!("done");
        Ok(())
@@ -582,12 +580,8 @@ impl PageServerHandler {
        // current profiling is based on a thread-local variable, so it doesn't work
        // across awaits
        let _profiling_guard = profpoint_start(self.conf, ProfilingConfig::PageRequests);
+        let page = timeline.get_rel_page_at_lsn(req.rel, req.blkno, lsn, req.latest)?;

-        let page = if req.latest {
-            page_image_cache::lookup(timeline, req.rel, req.blkno, lsn)
-        } else {
-            timeline.get_rel_page_at_lsn(req.rel, req.blkno, lsn, false)
-        }?;
        Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse {
            page,
        }))
--- a/pageserver/src/pgdatadir_mapping.rs
+++ b/pageserver/src/pgdatadir_mapping.rs
@@ -1179,7 +1179,7 @@ fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key {
    }
 }

-pub fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
+fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
    Key {
        field1: 0x00,
        field2: rel.spcnode,
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -12,8 +12,12 @@
 //!

 use anyhow::{bail, Context};
+use bytes::Bytes;
+use futures::Stream;
 use pageserver_api::models::TimelineState;
 use tokio::sync::watch;
+use tokio_util::io::StreamReader;
+use tokio_util::io::SyncIoBridge;
 use tracing::*;
 use utils::crashsafe::path_with_suffix_extension;

@@ -29,6 +33,7 @@ use std::io::Write;
 use std::ops::Bound::Included;
 use std::path::Path;
 use std::path::PathBuf;
+use std::pin::Pin;
 use std::process::Command;
 use std::process::Stdio;
 use std::sync::Arc;
@@ -72,6 +77,8 @@ pub mod storage_layer;

 mod timeline;

+pub mod size;
+
 use storage_layer::Layer;

 pub use timeline::Timeline;
@@ -120,6 +127,9 @@ pub struct Tenant {

    /// Makes every timeline to backup their files to remote storage.
    upload_layers: bool,
+
+    /// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`].
+    cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
 }

 /// A timeline with some of its files on disk, being initialized.
@@ -132,7 +142,7 @@ pub struct Tenant {
 pub struct UninitializedTimeline<'t> {
    owning_tenant: &'t Tenant,
    timeline_id: TimelineId,
-    raw_timeline: Option<(Timeline, TimelineUninitMark)>,
+    raw_timeline: Option<(Arc<Timeline>, TimelineUninitMark)>,
 }

 /// An uninit mark file, created along the timeline dir to ensure the timeline either gets fully initialized and loaded into pageserver's memory,
@@ -164,7 +174,6 @@ impl UninitializedTimeline<'_> {
        let (new_timeline, uninit_mark) = self.raw_timeline.take().with_context(|| {
            format!("No timeline for initalization found for {tenant_id}/{timeline_id}")
        })?;
-        let new_timeline = Arc::new(new_timeline);

        let new_disk_consistent_lsn = new_timeline.get_disk_consistent_lsn();
        // TODO it would be good to ensure that, but apparently a lot of our testing is dependend on that at least
@@ -192,6 +201,9 @@ impl UninitializedTimeline<'_> {
                })?;
                new_timeline.set_state(TimelineState::Active);
                v.insert(Arc::clone(&new_timeline));
+
+                new_timeline.maybe_spawn_flush_loop();
+
                new_timeline.launch_wal_receiver();
            }
        }
@@ -200,20 +212,28 @@ impl UninitializedTimeline<'_> {
    }

    /// Prepares timeline data by loading it from the basebackup archive.
-    pub fn import_basebackup_from_tar(
-        &self,
-        reader: impl std::io::Read,
+    pub async fn import_basebackup_from_tar(
+        self,
+        mut copyin_stream: &mut Pin<&mut impl Stream<Item = io::Result<Bytes>>>,
        base_lsn: Lsn,
-    ) -> anyhow::Result<()> {
+    ) -> anyhow::Result<Arc<Timeline>> {
        let raw_timeline = self.raw_timeline()?;
-        import_datadir::import_basebackup_from_tar(raw_timeline, reader, base_lsn).with_context(
-            || {
-                format!(
-                    "Failed to import basebackup for timeline {}/{}",
-                    self.owning_tenant.tenant_id, self.timeline_id
-                )
-            },
-        )?;
+
+        // import_basebackup_from_tar() is not async, mainly because the Tar crate
+        // it uses is not async. So we need to jump through some hoops:
+        // - convert the input from client connection to a synchronous Read
+        // - use block_in_place()
+        let reader = SyncIoBridge::new(StreamReader::new(&mut copyin_stream));
+
+        tokio::task::block_in_place(|| {
+            import_datadir::import_basebackup_from_tar(raw_timeline, reader, base_lsn)
+                .context("Failed to import basebackup")
+        })?;
+
+        // Flush loop needs to be spawned in order for checkpoint to be able to flush.
+        // We want to run proper checkpoint before we mark timeline as available to outside world
+        // Thus spawning flush loop manually and skipping flush_loop setup in initialize_with_lock
+        raw_timeline.maybe_spawn_flush_loop();

        fail::fail_point!("before-checkpoint-new-timeline", |_| {
            bail!("failpoint before-checkpoint-new-timeline");
@@ -221,16 +241,15 @@ impl UninitializedTimeline<'_> {

        raw_timeline
            .checkpoint(CheckpointConfig::Flush)
-            .with_context(|| {
-                format!(
-                    "Failed to checkpoint after basebackup import for timeline {}/{}",
-                    self.owning_tenant.tenant_id, self.timeline_id
-                )
-            })?;
-        Ok(())
+            .await
+            .context("Failed to checkpoint after basebackup import")?;
+
+        let timeline = self.initialize()?;
+
+        Ok(timeline)
    }

-    fn raw_timeline(&self) -> anyhow::Result<&Timeline> {
+    fn raw_timeline(&self) -> anyhow::Result<&Arc<Timeline>> {
        Ok(&self
            .raw_timeline
            .as_ref()
@@ -465,7 +484,7 @@ impl Tenant {

                self.branch_timeline(ancestor_timeline_id, new_timeline_id, ancestor_start_lsn)?
            }
-            None => self.bootstrap_timeline(new_timeline_id, pg_version)?,
+            None => self.bootstrap_timeline(new_timeline_id, pg_version).await?,
        };

        // Have added new timeline into the tenant, now its background tasks are needed.
@@ -483,7 +502,7 @@ impl Tenant {
    /// `checkpoint_before_gc` parameter is used to force compaction of storage before GC
    /// to make tests more deterministic.
    /// TODO Do we still need it or we can call checkpoint explicitly in tests where needed?
-    pub fn gc_iteration(
+    pub async fn gc_iteration(
        &self,
        target_timeline_id: Option<TimelineId>,
        horizon: u64,
@@ -499,11 +518,13 @@ impl Tenant {
            .map(|x| x.to_string())
            .unwrap_or_else(|| "-".to_string());

-        STORAGE_TIME
-            .with_label_values(&["gc", &self.tenant_id.to_string(), &timeline_str])
-            .observe_closure_duration(|| {
-                self.gc_iteration_internal(target_timeline_id, horizon, pitr, checkpoint_before_gc)
-            })
+        {
+            let _timer = STORAGE_TIME
+                .with_label_values(&["gc", &self.tenant_id.to_string(), &timeline_str])
+                .start_timer();
+            self.gc_iteration_internal(target_timeline_id, horizon, pitr, checkpoint_before_gc)
+                .await
+        }
    }

    /// Perform one compaction iteration.
@@ -523,7 +544,6 @@ impl Tenant {
        let timelines = self.timelines.lock().unwrap();
        let timelines_to_compact = timelines
            .iter()
-            .filter(|(_, timeline)| timeline.is_active())
            .map(|(timeline_id, timeline)| (*timeline_id, timeline.clone()))
            .collect::<Vec<_>>();
        drop(timelines);
@@ -540,23 +560,24 @@ impl Tenant {
    ///
    /// Used at graceful shutdown.
    ///
-    pub fn checkpoint(&self) -> anyhow::Result<()> {
+    pub async fn checkpoint(&self) -> anyhow::Result<()> {
        // Scan through the hashmap and collect a list of all the timelines,
        // while holding the lock. Then drop the lock and actually perform the
        // checkpoints. We don't want to block everything else while the
        // checkpoint runs.
-        let timelines = self.timelines.lock().unwrap();
-        let timelines_to_checkpoint = timelines
-            .iter()
-            .map(|(timeline_id, timeline)| (*timeline_id, Arc::clone(timeline)))
-            .collect::<Vec<_>>();
-        drop(timelines);
+        let timelines_to_checkpoint = {
+            let timelines = self.timelines.lock().unwrap();
+            timelines
+                .iter()
+                .map(|(id, timeline)| (*id, Arc::clone(timeline)))
+                .collect::<Vec<_>>()
+        };

-        for (timeline_id, timeline) in &timelines_to_checkpoint {
-            let _entered =
-                info_span!("checkpoint", timeline = %timeline_id, tenant = %self.tenant_id)
-                    .entered();
-            timeline.checkpoint(CheckpointConfig::Flush)?;
+        for (id, timeline) in &timelines_to_checkpoint {
+            timeline
+                .checkpoint(CheckpointConfig::Flush)
+                .instrument(info_span!("checkpoint", timeline = %id, tenant = %self.tenant_id))
+                .await?;
        }

        Ok(())
@@ -835,6 +856,7 @@ impl Tenant {
            remote_index,
            upload_layers,
            state,
+            cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
        }
    }

@@ -956,8 +978,9 @@ impl Tenant {
    //                 +-----baz-------->
    //
    //
-    // 1. Grab 'gc_cs' mutex to prevent new timelines from being created
-    // 2. Scan all timelines, and on each timeline, make note of the
+    // 1. Grab 'gc_cs' mutex to prevent new timelines from being created while Timeline's
+    //    `gc_infos` are being refreshed
+    // 2. Scan collected timelines, and on each timeline, make note of the
    //    all the points where other timelines have been branched off.
    //    We will refrain from removing page versions at those LSNs.
    // 3. For each timeline, scan all layer files on the timeline.
@@ -968,7 +991,7 @@ impl Tenant {
    // - if a relation has a non-incremental persistent layer on a child branch, then we
    //   don't need to keep that in the parent anymore. But currently
    //   we do.
-    fn gc_iteration_internal(
+    async fn gc_iteration_internal(
        &self,
        target_timeline_id: Option<TimelineId>,
        horizon: u64,
@@ -978,6 +1001,68 @@ impl Tenant {
        let mut totals: GcResult = Default::default();
        let now = Instant::now();

+        let gc_timelines = self.refresh_gc_info_internal(target_timeline_id, horizon, pitr)?;
+
+        // Perform GC for each timeline.
+        //
+        // Note that we don't hold the GC lock here because we don't want
+        // to delay the branch creation task, which requires the GC lock.
+        // A timeline GC iteration can be slow because it may need to wait for
+        // compaction (both require `layer_removal_cs` lock),
+        // but the GC iteration can run concurrently with branch creation.
+        //
+        // See comments in [`Tenant::branch_timeline`] for more information
+        // about why branch creation task can run concurrently with timeline's GC iteration.
+        for timeline in gc_timelines {
+            if task_mgr::is_shutdown_requested() {
+                // We were requested to shut down. Stop and return with the progress we
+                // made.
+                break;
+            }
+
+            // If requested, force flush all in-memory layers to disk first,
+            // so that they too can be garbage collected. That's
+            // used in tests, so we want as deterministic results as possible.
+            if checkpoint_before_gc {
+                timeline.checkpoint(CheckpointConfig::Forced).await?;
+                info!(
+                    "timeline {} checkpoint_before_gc done",
+                    timeline.timeline_id
+                );
+            }
+
+            let result = timeline.gc()?;
+            totals += result;
+        }
+
+        totals.elapsed = now.elapsed();
+        Ok(totals)
+    }
+
+    /// Refreshes the Timeline::gc_info for all timelines, returning the
+    /// vector of timelines which have [`Timeline::get_last_record_lsn`] past
+    /// [`Tenant::get_gc_horizon`].
+    ///
+    /// This is usually executed as part of periodic gc, but can now be triggered more often.
+    pub fn refresh_gc_info(&self) -> anyhow::Result<Vec<Arc<Timeline>>> {
+        // since this method can now be called at different rates than the configured gc loop, it
+        // might be that these configuration values get applied faster than what it was previously,
+        // since these were only read from the gc task.
+        let horizon = self.get_gc_horizon();
+        let pitr = self.get_pitr_interval();
+
+        // refresh all timelines
+        let target_timeline_id = None;
+
+        self.refresh_gc_info_internal(target_timeline_id, horizon, pitr)
+    }
+
+    fn refresh_gc_info_internal(
+        &self,
+        target_timeline_id: Option<TimelineId>,
+        horizon: u64,
+        pitr: Duration,
+    ) -> anyhow::Result<Vec<Arc<Timeline>>> {
        // grab mutex to prevent new timelines from being created here.
        let gc_cs = self.gc_cs.lock().unwrap();

@@ -995,11 +1080,7 @@ impl Tenant {

            timelines
                .iter()
-                .filter(|(_, timeline)| timeline.is_active())
                .map(|(timeline_id, timeline_entry)| {
-                    // This is unresolved question for now, how to do gc in presence of remote timelines
-                    // especially when this is combined with branching.
-                    // Somewhat related: https://github.com/neondatabase/neon/issues/999
                    if let Some(ancestor_timeline_id) = &timeline_entry.get_ancestor_timeline_id() {
                        // If target_timeline is specified, we only need to know branchpoints of its children
                        if let Some(timeline_id) = target_timeline_id {
@@ -1053,41 +1134,7 @@ impl Tenant {
            }
        }
        drop(gc_cs);
-
-        // Perform GC for each timeline.
-        //
-        // Note that we don't hold the GC lock here because we don't want
-        // to delay the branch creation task, which requires the GC lock.
-        // A timeline GC iteration can be slow because it may need to wait for
-        // compaction (both require `layer_removal_cs` lock),
-        // but the GC iteration can run concurrently with branch creation.
-        //
-        // See comments in [`Tenant::branch_timeline`] for more information
-        // about why branch creation task can run concurrently with timeline's GC iteration.
-        for timeline in gc_timelines {
-            if task_mgr::is_shutdown_requested() {
-                // We were requested to shut down. Stop and return with the progress we
-                // made.
-                break;
-            }
-
-            // If requested, force flush all in-memory layers to disk first,
-            // so that they too can be garbage collected. That's
-            // used in tests, so we want as deterministic results as possible.
-            if checkpoint_before_gc {
-                timeline.checkpoint(CheckpointConfig::Forced)?;
-                info!(
-                    "timeline {} checkpoint_before_gc done",
-                    timeline.timeline_id
-                );
-            }
-
-            let result = timeline.gc()?;
-            totals += result;
-        }
-
-        totals.elapsed = now.elapsed();
-        Ok(totals)
+        Ok(gc_timelines)
    }

    /// Branch an existing timeline
@@ -1191,14 +1238,15 @@ impl Tenant {

    /// - run initdb to init temporary instance and get bootstrap data
    /// - after initialization complete, remove the temp dir.
-    fn bootstrap_timeline(
+    async fn bootstrap_timeline(
        &self,
        timeline_id: TimelineId,
        pg_version: u32,
    ) -> anyhow::Result<Arc<Timeline>> {
-        let timelines = self.timelines.lock().unwrap();
-        let timeline_uninit_mark = self.create_timeline_uninit_mark(timeline_id, &timelines)?;
-        drop(timelines);
+        let timeline_uninit_mark = {
+            let timelines = self.timelines.lock().unwrap();
+            self.create_timeline_uninit_mark(timeline_id, &timelines)?
+        };
        // create a `tenant/{tenant_id}/timelines/basebackup-{timeline_id}.{TEMP_FILE_SUFFIX}/`
        // temporary directory for basebackup files for the given timeline.
        let initdb_path = path_with_suffix_extension(
@@ -1248,25 +1296,35 @@ impl Tenant {

        let tenant_id = raw_timeline.owning_tenant.tenant_id;
        let unfinished_timeline = raw_timeline.raw_timeline()?;
-        import_datadir::import_timeline_from_postgres_datadir(
-            unfinished_timeline,
-            pgdata_path,
-            pgdata_lsn,
-        )
+
+        tokio::task::block_in_place(|| {
+            import_datadir::import_timeline_from_postgres_datadir(
+                unfinished_timeline,
+                pgdata_path,
+                pgdata_lsn,
+            )
+        })
        .with_context(|| {
            format!("Failed to import pgdatadir for timeline {tenant_id}/{timeline_id}")
        })?;

+        // Flush loop needs to be spawned in order for checkpoint to be able to flush.
+        // We want to run proper checkpoint before we mark timeline as available to outside world
+        // Thus spawning flush loop manually and skipping flush_loop setup in initialize_with_lock
+        unfinished_timeline.maybe_spawn_flush_loop();
+
        fail::fail_point!("before-checkpoint-new-timeline", |_| {
            anyhow::bail!("failpoint before-checkpoint-new-timeline");
        });
+
        unfinished_timeline
-            .checkpoint(CheckpointConfig::Forced)
+            .checkpoint(CheckpointConfig::Forced).await
            .with_context(|| format!("Failed to checkpoint after pgdatadir import for timeline {tenant_id}/{timeline_id}"))?;

-        let mut timelines = self.timelines.lock().unwrap();
-        let timeline = raw_timeline.initialize_with_lock(&mut timelines, false)?;
-        drop(timelines);
+        let timeline = {
+            let mut timelines = self.timelines.lock().unwrap();
+            raw_timeline.initialize_with_lock(&mut timelines, false)?
+        };

        info!(
            "created root timeline {} timeline.lsn {}",
@@ -1306,7 +1364,7 @@ impl Tenant {
                Ok(UninitializedTimeline {
                    owning_tenant: self,
                    timeline_id: new_timeline_id,
-                    raw_timeline: Some((new_timeline, uninit_mark)),
+                    raw_timeline: Some((Arc::new(new_timeline), uninit_mark)),
                })
            }
            Err(e) => {
@@ -1425,7 +1483,7 @@ impl Tenant {
            let timeline = UninitializedTimeline {
                owning_tenant: self,
                timeline_id,
-                raw_timeline: Some((dummy_timeline, TimelineUninitMark::dummy())),
+                raw_timeline: Some((Arc::new(dummy_timeline), TimelineUninitMark::dummy())),
            };
            match timeline.initialize_with_lock(&mut timelines_accessor, true) {
                Ok(initialized_timeline) => {
@@ -1446,6 +1504,25 @@ impl Tenant {

        Ok(())
    }
+
+    /// Gathers inputs from all of the timelines to produce a sizing model input.
+    ///
+    /// Future is cancellation safe. Only one calculation can be running at once per tenant.
+    #[instrument(skip_all, fields(tenant_id=%self.tenant_id))]
+    pub async fn gather_size_inputs(&self) -> anyhow::Result<size::ModelInputs> {
+        let logical_sizes_at_once = self
+            .conf
+            .concurrent_tenant_size_logical_size_queries
+            .inner();
+
+        // TODO: Having a single mutex block concurrent reads is unfortunate, but since the queries
+        // are for testing/experimenting, we tolerate this.
+        //
+        // See more for on the issue #2748 condenced out of the initial PR review.
+        let mut shared_cache = self.cached_logical_sizes.lock().await;
+
+        size::gather_inputs(self, logical_sizes_at_once, &mut *shared_cache).await
+    }
 }

 /// Create the cluster temporarily in 'initdbpath' directory inside the repository
@@ -1860,7 +1937,7 @@ mod tests {
        Ok(())
    }

-    fn make_some_layers(tline: &Timeline, start_lsn: Lsn) -> anyhow::Result<()> {
+    async fn make_some_layers(tline: &Timeline, start_lsn: Lsn) -> anyhow::Result<()> {
        let mut lsn = start_lsn;
        #[allow(non_snake_case)]
        {
@@ -1881,7 +1958,7 @@ mod tests {
            writer.finish_write(lsn);
            lsn += 0x10;
        }
-        tline.checkpoint(CheckpointConfig::Forced)?;
+        tline.checkpoint(CheckpointConfig::Forced).await?;
        {
            let writer = tline.writer();
            writer.put(
@@ -1898,24 +1975,26 @@ mod tests {
            )?;
            writer.finish_write(lsn);
        }
-        tline.checkpoint(CheckpointConfig::Forced)
+        tline.checkpoint(CheckpointConfig::Forced).await
    }

-    #[test]
-    fn test_prohibit_branch_creation_on_garbage_collected_data() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_prohibit_branch_creation_on_garbage_collected_data() -> anyhow::Result<()> {
        let tenant =
            TenantHarness::create("test_prohibit_branch_creation_on_garbage_collected_data")?
                .load();
        let tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
            .initialize()?;
-        make_some_layers(tline.as_ref(), Lsn(0x20))?;
+        make_some_layers(tline.as_ref(), Lsn(0x20)).await?;

        // this removes layers before lsn 40 (50 minus 10), so there are two remaining layers, image and delta for 31-50
        // FIXME: this doesn't actually remove any layer currently, given how the checkpointing
        // and compaction works. But it does set the 'cutoff' point so that the cross check
        // below should fail.
-        tenant.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?;
+        tenant
+            .gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)
+            .await?;

        // try to branch at lsn 25, should fail because we already garbage collected the data
        match tenant.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Some(Lsn(0x25))) {
@@ -1960,14 +2039,14 @@ mod tests {
    /*
    // FIXME: This currently fails to error out. Calling GC doesn't currently
    // remove the old value, we'd need to work a little harder
-    #[test]
-    fn test_prohibit_get_for_garbage_collected_data() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_prohibit_get_for_garbage_collected_data() -> anyhow::Result<()> {
        let repo =
            RepoHarness::create("test_prohibit_get_for_garbage_collected_data")?
            .load();

        let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?;
-        make_some_layers(tline.as_ref(), Lsn(0x20))?;
+        make_some_layers(tline.as_ref(), Lsn(0x20)).await?;

        repo.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?;
        let latest_gc_cutoff_lsn = tline.get_latest_gc_cutoff_lsn();
@@ -1980,43 +2059,47 @@ mod tests {
    }
     */

-    #[test]
-    fn test_retain_data_in_parent_which_is_needed_for_child() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_retain_data_in_parent_which_is_needed_for_child() -> anyhow::Result<()> {
        let tenant =
            TenantHarness::create("test_retain_data_in_parent_which_is_needed_for_child")?.load();
        let tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
            .initialize()?;
-        make_some_layers(tline.as_ref(), Lsn(0x20))?;
+        make_some_layers(tline.as_ref(), Lsn(0x20)).await?;

        tenant.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Some(Lsn(0x40)))?;
        let newtline = tenant
            .get_timeline(NEW_TIMELINE_ID, true)
            .expect("Should have a local timeline");
        // this removes layers before lsn 40 (50 minus 10), so there are two remaining layers, image and delta for 31-50
-        tenant.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?;
+        tenant
+            .gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)
+            .await?;
        assert!(newtline.get(*TEST_KEY, Lsn(0x25)).is_ok());

        Ok(())
    }
-    #[test]
-    fn test_parent_keeps_data_forever_after_branching() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_parent_keeps_data_forever_after_branching() -> anyhow::Result<()> {
        let tenant =
            TenantHarness::create("test_parent_keeps_data_forever_after_branching")?.load();
        let tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
            .initialize()?;
-        make_some_layers(tline.as_ref(), Lsn(0x20))?;
+        make_some_layers(tline.as_ref(), Lsn(0x20)).await?;

        tenant.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Some(Lsn(0x40)))?;
        let newtline = tenant
            .get_timeline(NEW_TIMELINE_ID, true)
            .expect("Should have a local timeline");

-        make_some_layers(newtline.as_ref(), Lsn(0x60))?;
+        make_some_layers(newtline.as_ref(), Lsn(0x60)).await?;

        // run gc on parent
-        tenant.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?;
+        tenant
+            .gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)
+            .await?;

        // Check that the data is still accessible on the branch.
        assert_eq!(
@@ -2027,8 +2110,8 @@ mod tests {
        Ok(())
    }

-    #[test]
-    fn timeline_load() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn timeline_load() -> anyhow::Result<()> {
        const TEST_NAME: &str = "timeline_load";
        let harness = TenantHarness::create(TEST_NAME)?;
        {
@@ -2036,8 +2119,8 @@ mod tests {
            let tline = tenant
                .create_empty_timeline(TIMELINE_ID, Lsn(0x8000), DEFAULT_PG_VERSION)?
                .initialize()?;
-            make_some_layers(tline.as_ref(), Lsn(0x8000))?;
-            tline.checkpoint(CheckpointConfig::Forced)?;
+            make_some_layers(tline.as_ref(), Lsn(0x8000)).await?;
+            tline.checkpoint(CheckpointConfig::Forced).await?;
        }

        let tenant = harness.load();
@@ -2048,8 +2131,8 @@ mod tests {
        Ok(())
    }

-    #[test]
-    fn timeline_load_with_ancestor() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn timeline_load_with_ancestor() -> anyhow::Result<()> {
        const TEST_NAME: &str = "timeline_load_with_ancestor";
        let harness = TenantHarness::create(TEST_NAME)?;
        // create two timelines
@@ -2059,8 +2142,8 @@ mod tests {
                .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
                .initialize()?;

-            make_some_layers(tline.as_ref(), Lsn(0x20))?;
-            tline.checkpoint(CheckpointConfig::Forced)?;
+            make_some_layers(tline.as_ref(), Lsn(0x20)).await?;
+            tline.checkpoint(CheckpointConfig::Forced).await?;

            tenant.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Some(Lsn(0x40)))?;

@@ -2068,8 +2151,8 @@ mod tests {
                .get_timeline(NEW_TIMELINE_ID, true)
                .expect("Should have a local timeline");

-            make_some_layers(newtline.as_ref(), Lsn(0x60))?;
-            tline.checkpoint(CheckpointConfig::Forced)?;
+            make_some_layers(newtline.as_ref(), Lsn(0x60)).await?;
+            tline.checkpoint(CheckpointConfig::Forced).await?;
        }

        // check that both of them are initially unloaded
@@ -2129,8 +2212,8 @@ mod tests {
        Ok(())
    }

-    #[test]
-    fn test_images() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_images() -> anyhow::Result<()> {
        let tenant = TenantHarness::create("test_images")?.load();
        let tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
@@ -2141,7 +2224,7 @@ mod tests {
        writer.finish_write(Lsn(0x10));
        drop(writer);

-        tline.checkpoint(CheckpointConfig::Forced)?;
+        tline.checkpoint(CheckpointConfig::Forced).await?;
        tline.compact()?;

        let writer = tline.writer();
@@ -2149,7 +2232,7 @@ mod tests {
        writer.finish_write(Lsn(0x20));
        drop(writer);

-        tline.checkpoint(CheckpointConfig::Forced)?;
+        tline.checkpoint(CheckpointConfig::Forced).await?;
        tline.compact()?;

        let writer = tline.writer();
@@ -2157,7 +2240,7 @@ mod tests {
        writer.finish_write(Lsn(0x30));
        drop(writer);

-        tline.checkpoint(CheckpointConfig::Forced)?;
+        tline.checkpoint(CheckpointConfig::Forced).await?;
        tline.compact()?;

        let writer = tline.writer();
@@ -2165,7 +2248,7 @@ mod tests {
        writer.finish_write(Lsn(0x40));
        drop(writer);

-        tline.checkpoint(CheckpointConfig::Forced)?;
+        tline.checkpoint(CheckpointConfig::Forced).await?;
        tline.compact()?;

        assert_eq!(tline.get(*TEST_KEY, Lsn(0x10))?, TEST_IMG("foo at 0x10"));
@@ -2181,8 +2264,8 @@ mod tests {
    // Insert 1000 key-value pairs with increasing keys, checkpoint,
    // repeat 50 times.
    //
-    #[test]
-    fn test_bulk_insert() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_bulk_insert() -> anyhow::Result<()> {
        let tenant = TenantHarness::create("test_bulk_insert")?.load();
        let tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
@@ -2215,7 +2298,7 @@ mod tests {
            let cutoff = tline.get_last_record_lsn();

            tline.update_gc_info(Vec::new(), cutoff, Duration::ZERO)?;
-            tline.checkpoint(CheckpointConfig::Forced)?;
+            tline.checkpoint(CheckpointConfig::Forced).await?;
            tline.compact()?;
            tline.gc()?;
        }
@@ -2223,8 +2306,8 @@ mod tests {
        Ok(())
    }

-    #[test]
-    fn test_random_updates() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_random_updates() -> anyhow::Result<()> {
        let tenant = TenantHarness::create("test_random_updates")?.load();
        let tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
@@ -2287,7 +2370,7 @@ mod tests {
            println!("checkpointing {}", lsn);
            let cutoff = tline.get_last_record_lsn();
            tline.update_gc_info(Vec::new(), cutoff, Duration::ZERO)?;
-            tline.checkpoint(CheckpointConfig::Forced)?;
+            tline.checkpoint(CheckpointConfig::Forced).await?;
            tline.compact()?;
            tline.gc()?;
        }
@@ -2295,8 +2378,8 @@ mod tests {
        Ok(())
    }

-    #[test]
-    fn test_traverse_branches() -> anyhow::Result<()> {
+    #[tokio::test]
+    async fn test_traverse_branches() -> anyhow::Result<()> {
        let tenant = TenantHarness::create("test_traverse_branches")?.load();
        let mut tline = tenant
            .create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
@@ -2368,7 +2451,7 @@ mod tests {
            println!("checkpointing {}", lsn);
            let cutoff = tline.get_last_record_lsn();
            tline.update_gc_info(Vec::new(), cutoff, Duration::ZERO)?;
-            tline.checkpoint(CheckpointConfig::Forced)?;
+            tline.checkpoint(CheckpointConfig::Forced).await?;
            tline.compact()?;
            tline.gc()?;
        }
--- a/pageserver/src/tenant/delta_layer.rs
+++ b/pageserver/src/tenant/delta_layer.rs
@@ -610,9 +610,9 @@ impl DeltaLayer {
 ///
 /// 3. Call `finish`.
 ///
-pub struct DeltaLayerWriter {
+struct DeltaLayerWriterInner {
    conf: &'static PageServerConf,
-    path: PathBuf,
+    pub path: PathBuf,
    timeline_id: TimelineId,
    tenant_id: TenantId,

@@ -624,17 +624,17 @@ pub struct DeltaLayerWriter {
    blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
 }

-impl DeltaLayerWriter {
+impl DeltaLayerWriterInner {
    ///
    /// Start building a new delta layer.
    ///
-    pub fn new(
+    fn new(
        conf: &'static PageServerConf,
        timeline_id: TimelineId,
        tenant_id: TenantId,
        key_start: Key,
        lsn_range: Range<Lsn>,
-    ) -> Result<DeltaLayerWriter> {
+    ) -> anyhow::Result<Self> {
        // Create the file initially with a temporary filename. We don't know
        // the end key yet, so we cannot form the final filename yet. We will
        // rename it when we're done.
@@ -653,7 +653,7 @@ impl DeltaLayerWriter {
        let block_buf = BlockBuf::new();
        let tree_builder = DiskBtreeBuilder::new(block_buf);

-        Ok(DeltaLayerWriter {
+        Ok(Self {
            conf,
            path,
            timeline_id,
@@ -670,17 +670,17 @@ impl DeltaLayerWriter {
    ///
    /// The values must be appended in key, lsn order.
    ///
-    pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> Result<()> {
+    fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> anyhow::Result<()> {
        self.put_value_bytes(key, lsn, &Value::ser(&val)?, val.will_init())
    }

-    pub fn put_value_bytes(
+    fn put_value_bytes(
        &mut self,
        key: Key,
        lsn: Lsn,
        val: &[u8],
        will_init: bool,
-    ) -> Result<()> {
+    ) -> anyhow::Result<()> {
        assert!(self.lsn_range.start <= lsn);

        let off = self.blob_writer.write_blob(val)?;
@@ -693,14 +693,14 @@ impl DeltaLayerWriter {
        Ok(())
    }

-    pub fn size(&self) -> u64 {
+    fn size(&self) -> u64 {
        self.blob_writer.size() + self.tree.borrow_writer().size()
    }

    ///
    /// Finish writing the delta layer.
    ///
-    pub fn finish(self, key_end: Key) -> anyhow::Result<DeltaLayer> {
+    fn finish(self, key_end: Key) -> anyhow::Result<DeltaLayer> {
        let index_start_blk =
            ((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;

@@ -768,6 +768,102 @@ impl DeltaLayerWriter {
    }
 }

+/// A builder object for constructing a new delta layer.
+///
+/// Usage:
+///
+/// 1. Create the DeltaLayerWriter by calling DeltaLayerWriter::new(...)
+///
+/// 2. Write the contents by calling `put_value` for every page
+///    version to store in the layer.
+///
+/// 3. Call `finish`.
+///
+/// # Note
+///
+/// As described in https://github.com/neondatabase/neon/issues/2650, it's
+/// possible for the writer to drop before `finish` is actually called. So this
+/// could lead to odd temporary files in the directory, exhausting file system.
+/// This structure wraps `DeltaLayerWriterInner` and also contains `Drop`
+/// implementation that cleans up the temporary file in failure. It's not
+/// possible to do this directly in `DeltaLayerWriterInner` since `finish` moves
+/// out some fields, making it impossible to implement `Drop`.
+///
+#[must_use]
+pub struct DeltaLayerWriter {
+    inner: Option<DeltaLayerWriterInner>,
+}
+
+impl DeltaLayerWriter {
+    ///
+    /// Start building a new delta layer.
+    ///
+    pub fn new(
+        conf: &'static PageServerConf,
+        timeline_id: TimelineId,
+        tenant_id: TenantId,
+        key_start: Key,
+        lsn_range: Range<Lsn>,
+    ) -> anyhow::Result<Self> {
+        Ok(Self {
+            inner: Some(DeltaLayerWriterInner::new(
+                conf,
+                timeline_id,
+                tenant_id,
+                key_start,
+                lsn_range,
+            )?),
+        })
+    }
+
+    ///
+    /// Append a key-value pair to the file.
+    ///
+    /// The values must be appended in key, lsn order.
+    ///
+    pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> anyhow::Result<()> {
+        self.inner.as_mut().unwrap().put_value(key, lsn, val)
+    }
+
+    pub fn put_value_bytes(
+        &mut self,
+        key: Key,
+        lsn: Lsn,
+        val: &[u8],
+        will_init: bool,
+    ) -> anyhow::Result<()> {
+        self.inner
+            .as_mut()
+            .unwrap()
+            .put_value_bytes(key, lsn, val, will_init)
+    }
+
+    pub fn size(&self) -> u64 {
+        self.inner.as_ref().unwrap().size()
+    }
+
+    ///
+    /// Finish writing the delta layer.
+    ///
+    pub fn finish(mut self, key_end: Key) -> anyhow::Result<DeltaLayer> {
+        self.inner.take().unwrap().finish(key_end)
+    }
+}
+
+impl Drop for DeltaLayerWriter {
+    fn drop(&mut self) {
+        if let Some(inner) = self.inner.take() {
+            match inner.blob_writer.into_inner().into_inner() {
+                Ok(vfile) => vfile.remove(),
+                Err(err) => warn!(
+                    "error while flushing buffer of image layer temporary file: {}",
+                    err
+                ),
+            }
+        }
+    }
+}
+
 ///
 /// Iterator over all key-value pairse stored in a delta layer
 ///
--- a/pageserver/src/tenant/image_layer.rs
+++ b/pageserver/src/tenant/image_layer.rs
@@ -411,7 +411,7 @@ impl ImageLayer {
 ///
 /// 3. Call `finish`.
 ///
-pub struct ImageLayerWriter {
+struct ImageLayerWriterInner {
    conf: &'static PageServerConf,
    path: PathBuf,
    timeline_id: TimelineId,
@@ -423,14 +423,17 @@ pub struct ImageLayerWriter {
    tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>,
 }

-impl ImageLayerWriter {
-    pub fn new(
+impl ImageLayerWriterInner {
+    ///
+    /// Start building a new image layer.
+    ///
+    fn new(
        conf: &'static PageServerConf,
        timeline_id: TimelineId,
        tenant_id: TenantId,
        key_range: &Range<Key>,
        lsn: Lsn,
-    ) -> anyhow::Result<ImageLayerWriter> {
+    ) -> anyhow::Result<Self> {
        // Create the file initially with a temporary filename.
        // We'll atomically rename it to the final name when we're done.
        let path = ImageLayer::temp_path_for(
@@ -455,7 +458,7 @@ impl ImageLayerWriter {
        let block_buf = BlockBuf::new();
        let tree_builder = DiskBtreeBuilder::new(block_buf);

-        let writer = ImageLayerWriter {
+        let writer = Self {
            conf,
            path,
            timeline_id,
@@ -474,7 +477,7 @@ impl ImageLayerWriter {
    ///
    /// The page versions must be appended in blknum order.
    ///
-    pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> {
+    fn put_image(&mut self, key: Key, img: &[u8]) -> anyhow::Result<()> {
        ensure!(self.key_range.contains(&key));
        let off = self.blob_writer.write_blob(img)?;

@@ -485,7 +488,10 @@ impl ImageLayerWriter {
        Ok(())
    }

-    pub fn finish(self) -> anyhow::Result<ImageLayer> {
+    ///
+    /// Finish writing the image layer.
+    ///
+    fn finish(self) -> anyhow::Result<ImageLayer> {
        let index_start_blk =
            ((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;

@@ -552,3 +558,76 @@ impl ImageLayerWriter {
        Ok(layer)
    }
 }
+
+/// A builder object for constructing a new image layer.
+///
+/// Usage:
+///
+/// 1. Create the ImageLayerWriter by calling ImageLayerWriter::new(...)
+///
+/// 2. Write the contents by calling `put_page_image` for every key-value
+///    pair in the key range.
+///
+/// 3. Call `finish`.
+///
+/// # Note
+///
+/// As described in https://github.com/neondatabase/neon/issues/2650, it's
+/// possible for the writer to drop before `finish` is actually called. So this
+/// could lead to odd temporary files in the directory, exhausting file system.
+/// This structure wraps `ImageLayerWriterInner` and also contains `Drop`
+/// implementation that cleans up the temporary file in failure. It's not
+/// possible to do this directly in `ImageLayerWriterInner` since `finish` moves
+/// out some fields, making it impossible to implement `Drop`.
+///
+#[must_use]
+pub struct ImageLayerWriter {
+    inner: Option<ImageLayerWriterInner>,
+}
+
+impl ImageLayerWriter {
+    ///
+    /// Start building a new image layer.
+    ///
+    pub fn new(
+        conf: &'static PageServerConf,
+        timeline_id: TimelineId,
+        tenant_id: TenantId,
+        key_range: &Range<Key>,
+        lsn: Lsn,
+    ) -> anyhow::Result<ImageLayerWriter> {
+        Ok(Self {
+            inner: Some(ImageLayerWriterInner::new(
+                conf,
+                timeline_id,
+                tenant_id,
+                key_range,
+                lsn,
+            )?),
+        })
+    }
+
+    ///
+    /// Write next value to the file.
+    ///
+    /// The page versions must be appended in blknum order.
+    ///
+    pub fn put_image(&mut self, key: Key, img: &[u8]) -> anyhow::Result<()> {
+        self.inner.as_mut().unwrap().put_image(key, img)
+    }
+
+    ///
+    /// Finish writing the image layer.
+    ///
+    pub fn finish(mut self) -> anyhow::Result<ImageLayer> {
+        self.inner.take().unwrap().finish()
+    }
+}
+
+impl Drop for ImageLayerWriter {
+    fn drop(&mut self) {
+        if let Some(inner) = self.inner.take() {
+            inner.blob_writer.into_inner().remove();
+        }
+    }
+}
--- a/pageserver/src/tenant/size.rs
+++ b/pageserver/src/tenant/size.rs
@@ -0,0 +1,475 @@
+use std::cmp;
+use std::collections::{HashMap, HashSet};
+use std::sync::Arc;
+
+use anyhow::Context;
+use tokio::sync::Semaphore;
+
+use super::Tenant;
+use utils::id::TimelineId;
+use utils::lsn::Lsn;
+
+use tracing::*;
+
+/// Inputs to the actual tenant sizing model
+///
+/// Implements [`serde::Serialize`] but is not meant to be part of the public API, instead meant to
+/// be a transferrable format between execution environments and developer.
+#[serde_with::serde_as]
+#[derive(Debug, serde::Serialize, serde::Deserialize)]
+pub struct ModelInputs {
+    updates: Vec<Update>,
+    retention_period: u64,
+    #[serde_as(as = "HashMap<serde_with::DisplayFromStr, _>")]
+    timeline_inputs: HashMap<TimelineId, TimelineInputs>,
+}
+
+/// Collect all relevant LSNs to the inputs. These will only be helpful in the serialized form as
+/// part of [`ModelInputs`] from the HTTP api, explaining the inputs.
+#[serde_with::serde_as]
+#[derive(Debug, serde::Serialize, serde::Deserialize)]
+struct TimelineInputs {
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    last_record: Lsn,
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    latest_gc_cutoff: Lsn,
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    horizon_cutoff: Lsn,
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    pitr_cutoff: Lsn,
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    next_gc_cutoff: Lsn,
+}
+
+/// Gathers the inputs for the tenant sizing model.
+///
+/// Tenant size does not consider the latest state, but only the state until next_gc_cutoff, which
+/// is updated on-demand, during the start of this calculation and separate from the
+/// [`Timeline::latest_gc_cutoff`].
+///
+/// For timelines in general:
+///
+/// ```ignore
+/// 0-----|---------|----|------------| · · · · · |·> lsn
+///   initdb_lsn  branchpoints*  next_gc_cutoff  latest
+/// ```
+///
+/// Until gc_horizon_cutoff > `Timeline::last_record_lsn` for any of the tenant's timelines, the
+/// tenant size will be zero.
+pub(super) async fn gather_inputs(
+    tenant: &Tenant,
+    limit: &Arc<Semaphore>,
+    logical_size_cache: &mut HashMap<(TimelineId, Lsn), u64>,
+) -> anyhow::Result<ModelInputs> {
+    // with joinset, on drop, all of the tasks will just be de-scheduled, which we can use to
+    // our advantage with `?` error handling.
+    let mut joinset = tokio::task::JoinSet::new();
+
+    let timelines = tenant
+        .refresh_gc_info()
+        .context("Failed to refresh gc_info before gathering inputs")?;
+
+    if timelines.is_empty() {
+        // All timelines are below tenant's gc_horizon; alternative would be to use
+        // Tenant::list_timelines but then those gc_info's would not be updated yet, possibly
+        // missing GcInfo::retain_lsns or having obsolete values for cutoff's.
+        return Ok(ModelInputs {
+            updates: vec![],
+            retention_period: 0,
+            timeline_inputs: HashMap::new(),
+        });
+    }
+
+    // record the used/inserted cache keys here, to remove extras not to start leaking
+    // after initial run the cache should be quite stable, but live timelines will eventually
+    // require new lsns to be inspected.
+    let mut needed_cache = HashSet::<(TimelineId, Lsn)>::new();
+
+    let mut updates = Vec::new();
+
+    // record the per timline values used to determine `retention_period`
+    let mut timeline_inputs = HashMap::with_capacity(timelines.len());
+
+    // used to determine the `retention_period` for the size model
+    let mut max_cutoff_distance = None;
+
+    // this will probably conflict with on-demand downloaded layers, or at least force them all
+    // to be downloaded
+    for timeline in timelines {
+        let last_record_lsn = timeline.get_last_record_lsn();
+
+        let (interesting_lsns, horizon_cutoff, pitr_cutoff, next_gc_cutoff) = {
+            // there's a race between the update (holding tenant.gc_lock) and this read but it
+            // might not be an issue, because it's not for Timeline::gc
+            let gc_info = timeline.gc_info.read().unwrap();
+
+            // similar to gc, but Timeline::get_latest_gc_cutoff_lsn() will not be updated before a
+            // new gc run, which we have no control over. however differently from `Timeline::gc`
+            // we don't consider the `Timeline::disk_consistent_lsn` at all, because we are not
+            // actually removing files.
+            let next_gc_cutoff = cmp::min(gc_info.horizon_cutoff, gc_info.pitr_cutoff);
+
+            // the minimum where we should find the next_gc_cutoff for our calculations.
+            //
+            // next_gc_cutoff in parent branch are not of interest (right now at least), nor do we
+            // want to query any logical size before initdb_lsn.
+            let cutoff_minimum = cmp::max(timeline.get_ancestor_lsn(), timeline.initdb_lsn);
+
+            let maybe_cutoff = if next_gc_cutoff > cutoff_minimum {
+                Some((next_gc_cutoff, LsnKind::GcCutOff))
+            } else {
+                None
+            };
+
+            // this assumes there are no other lsns than the branchpoints
+            let lsns = gc_info
+                .retain_lsns
+                .iter()
+                .inspect(|&&lsn| {
+                    trace!(
+                        timeline_id=%timeline.timeline_id,
+                        "retained lsn: {lsn:?}, is_before_ancestor_lsn={}",
+                        lsn < timeline.get_ancestor_lsn()
+                    )
+                })
+                .filter(|&&lsn| lsn > timeline.get_ancestor_lsn())
+                .copied()
+                .map(|lsn| (lsn, LsnKind::BranchPoint))
+                .chain(maybe_cutoff)
+                .collect::<Vec<_>>();
+
+            (
+                lsns,
+                gc_info.horizon_cutoff,
+                gc_info.pitr_cutoff,
+                next_gc_cutoff,
+            )
+        };
+
+        // update this to have a retention_period later for the tenant_size_model
+        // tenant_size_model compares this to the last segments start_lsn
+        if let Some(cutoff_distance) = last_record_lsn.checked_sub(next_gc_cutoff) {
+            match max_cutoff_distance.as_mut() {
+                Some(max) => {
+                    *max = std::cmp::max(*max, cutoff_distance);
+                }
+                _ => {
+                    max_cutoff_distance = Some(cutoff_distance);
+                }
+            }
+        }
+
+        // all timelines branch from something, because it might be impossible to pinpoint
+        // which is the tenant_size_model's "default" branch.
+        updates.push(Update {
+            lsn: timeline.get_ancestor_lsn(),
+            command: Command::BranchFrom(timeline.get_ancestor_timeline_id()),
+            timeline_id: timeline.timeline_id,
+        });
+
+        for (lsn, _kind) in &interesting_lsns {
+            if let Some(size) = logical_size_cache.get(&(timeline.timeline_id, *lsn)) {
+                updates.push(Update {
+                    lsn: *lsn,
+                    timeline_id: timeline.timeline_id,
+                    command: Command::Update(*size),
+                });
+
+                needed_cache.insert((timeline.timeline_id, *lsn));
+            } else {
+                let timeline = Arc::clone(&timeline);
+                let parallel_size_calcs = Arc::clone(limit);
+                joinset.spawn(calculate_logical_size(parallel_size_calcs, timeline, *lsn));
+            }
+        }
+
+        timeline_inputs.insert(
+            timeline.timeline_id,
+            TimelineInputs {
+                last_record: last_record_lsn,
+                // this is not used above, because it might not have updated recently enough
+                latest_gc_cutoff: *timeline.get_latest_gc_cutoff_lsn(),
+                horizon_cutoff,
+                pitr_cutoff,
+                next_gc_cutoff,
+            },
+        );
+    }
+
+    let mut have_any_error = false;
+
+    while let Some(res) = joinset.join_next().await {
+        // each of these come with Result<Result<_, JoinError>, JoinError>
+        // because of spawn + spawn_blocking
+        let res = res.and_then(|inner| inner);
+        match res {
+            Ok(TimelineAtLsnSizeResult(timeline, lsn, Ok(size))) => {
+                debug!(timeline_id=%timeline.timeline_id, %lsn, size, "size calculated");
+
+                logical_size_cache.insert((timeline.timeline_id, lsn), size);
+                needed_cache.insert((timeline.timeline_id, lsn));
+
+                updates.push(Update {
+                    lsn,
+                    timeline_id: timeline.timeline_id,
+                    command: Command::Update(size),
+                });
+            }
+            Ok(TimelineAtLsnSizeResult(timeline, lsn, Err(error))) => {
+                warn!(
+                    timeline_id=%timeline.timeline_id,
+                    "failed to calculate logical size at {lsn}: {error:#}"
+                );
+                have_any_error = true;
+            }
+            Err(join_error) if join_error.is_cancelled() => {
+                unreachable!("we are not cancelling any of the futures, nor should be");
+            }
+            Err(join_error) => {
+                // cannot really do anything, as this panic is likely a bug
+                error!("logical size query panicked: {join_error:#}");
+                have_any_error = true;
+            }
+        }
+    }
+
+    // prune any keys not needed anymore; we record every used key and added key.
+    logical_size_cache.retain(|key, _| needed_cache.contains(key));
+
+    if have_any_error {
+        // we cannot complete this round, because we are missing data.
+        // we have however cached all we were able to request calculation on.
+        anyhow::bail!("failed to calculate some logical_sizes");
+    }
+
+    // the data gathered to updates is per lsn, regardless of the branch, so we can use it to
+    // our advantage, not requiring a sorted container or graph walk.
+    //
+    // for branch points, which come as multiple updates at the same LSN, the Command::Update
+    // is needed before a branch is made out of that branch Command::BranchFrom. this is
+    // handled by the variant order in `Command`.
+    updates.sort_unstable();
+
+    let retention_period = match max_cutoff_distance {
+        Some(max) => max.0,
+        None => {
+            anyhow::bail!("the first branch should have a gc_cutoff after it's branch point at 0")
+        }
+    };
+
+    Ok(ModelInputs {
+        updates,
+        retention_period,
+        timeline_inputs,
+    })
+}
+
+impl ModelInputs {
+    pub fn calculate(&self) -> anyhow::Result<u64> {
+        // Option<TimelineId> is used for "naming" the branches because it is assumed to be
+        // impossible to always determine the a one main branch.
+        let mut storage = tenant_size_model::Storage::<Option<TimelineId>>::new(None);
+
+        // tracking these not to require modifying the current implementation of the size model,
+        // which works in relative LSNs and sizes.
+        let mut last_state: HashMap<TimelineId, (Lsn, u64)> = HashMap::new();
+
+        for update in &self.updates {
+            let Update {
+                lsn,
+                command: op,
+                timeline_id,
+            } = update;
+            match op {
+                Command::Update(sz) => {
+                    let latest = last_state.get_mut(timeline_id).ok_or_else(|| {
+                        anyhow::anyhow!(
+                        "ordering-mismatch: there must had been a previous state for {timeline_id}"
+                    )
+                    })?;
+
+                    let lsn_bytes = {
+                        let Lsn(now) = lsn;
+                        let Lsn(prev) = latest.0;
+                        debug_assert!(prev <= *now, "self.updates should had been sorted");
+                        now - prev
+                    };
+
+                    let size_diff =
+                        i64::try_from(*sz as i128 - latest.1 as i128).with_context(|| {
+                            format!("size difference i64 overflow for {timeline_id}")
+                        })?;
+
+                    storage.modify_branch(&Some(*timeline_id), "".into(), lsn_bytes, size_diff);
+                    *latest = (*lsn, *sz);
+                }
+                Command::BranchFrom(parent) => {
+                    storage.branch(parent, Some(*timeline_id));
+
+                    let size = parent
+                        .as_ref()
+                        .and_then(|id| last_state.get(id))
+                        .map(|x| x.1)
+                        .unwrap_or(0);
+                    last_state.insert(*timeline_id, (*lsn, size));
+                }
+            }
+        }
+
+        Ok(storage.calculate(self.retention_period).total_children())
+    }
+}
+
+/// Single size model update.
+///
+/// Sizing model works with relative increments over latest branch state.
+/// Updates are absolute, so additional state needs to be tracked when applying.
+#[serde_with::serde_as]
+#[derive(
+    Debug, PartialEq, PartialOrd, Eq, Ord, Clone, Copy, serde::Serialize, serde::Deserialize,
+)]
+struct Update {
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    lsn: utils::lsn::Lsn,
+    command: Command,
+    #[serde_as(as = "serde_with::DisplayFromStr")]
+    timeline_id: TimelineId,
+}
+
+#[serde_with::serde_as]
+#[derive(PartialOrd, PartialEq, Eq, Ord, Clone, Copy, serde::Serialize, serde::Deserialize)]
+#[serde(rename_all = "snake_case")]
+enum Command {
+    Update(u64),
+    BranchFrom(#[serde_as(as = "Option<serde_with::DisplayFromStr>")] Option<TimelineId>),
+}
+
+impl std::fmt::Debug for Command {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        // custom one-line implementation makes it more enjoyable to read {:#?} avoiding 3
+        // linebreaks
+        match self {
+            Self::Update(arg0) => write!(f, "Update({arg0})"),
+            Self::BranchFrom(arg0) => write!(f, "BranchFrom({arg0:?})"),
+        }
+    }
+}
+
+#[derive(Debug, Clone, Copy)]
+enum LsnKind {
+    BranchPoint,
+    GcCutOff,
+}
+
+/// Newtype around the tuple that carries the timeline at lsn logical size calculation.
+struct TimelineAtLsnSizeResult(
+    Arc<crate::tenant::Timeline>,
+    utils::lsn::Lsn,
+    anyhow::Result<u64>,
+);
+
+#[instrument(skip_all, fields(timeline_id=%timeline.timeline_id, lsn=%lsn))]
+async fn calculate_logical_size(
+    limit: Arc<tokio::sync::Semaphore>,
+    timeline: Arc<crate::tenant::Timeline>,
+    lsn: utils::lsn::Lsn,
+) -> Result<TimelineAtLsnSizeResult, tokio::task::JoinError> {
+    let permit = tokio::sync::Semaphore::acquire_owned(limit)
+        .await
+        .expect("global semaphore should not had been closed");
+
+    tokio::task::spawn_blocking(move || {
+        let _permit = permit;
+        let size_res = timeline.calculate_logical_size(lsn);
+        TimelineAtLsnSizeResult(timeline, lsn, size_res)
+    })
+    .await
+}
+
+#[test]
+fn updates_sort() {
+    use std::str::FromStr;
+    use utils::id::TimelineId;
+    use utils::lsn::Lsn;
+
+    let ids = [
+        TimelineId::from_str("7ff1edab8182025f15ae33482edb590a").unwrap(),
+        TimelineId::from_str("b1719e044db05401a05a2ed588a3ad3f").unwrap(),
+        TimelineId::from_str("b68d6691c895ad0a70809470020929ef").unwrap(),
+    ];
+
+    // try through all permutations
+    let ids = [
+        [&ids[0], &ids[1], &ids[2]],
+        [&ids[0], &ids[2], &ids[1]],
+        [&ids[1], &ids[0], &ids[2]],
+        [&ids[1], &ids[2], &ids[0]],
+        [&ids[2], &ids[0], &ids[1]],
+        [&ids[2], &ids[1], &ids[0]],
+    ];
+
+    for ids in ids {
+        // apply a fixture which uses a permutation of ids
+        let commands = [
+            Update {
+                lsn: Lsn(0),
+                command: Command::BranchFrom(None),
+                timeline_id: *ids[0],
+            },
+            Update {
+                lsn: Lsn::from_str("0/67E7618").unwrap(),
+                command: Command::Update(43696128),
+                timeline_id: *ids[0],
+            },
+            Update {
+                lsn: Lsn::from_str("0/67E7618").unwrap(),
+                command: Command::BranchFrom(Some(*ids[0])),
+                timeline_id: *ids[1],
+            },
+            Update {
+                lsn: Lsn::from_str("0/76BE4F0").unwrap(),
+                command: Command::Update(41844736),
+                timeline_id: *ids[1],
+            },
+            Update {
+                lsn: Lsn::from_str("0/10E49380").unwrap(),
+                command: Command::Update(42164224),
+                timeline_id: *ids[0],
+            },
+            Update {
+                lsn: Lsn::from_str("0/10E49380").unwrap(),
+                command: Command::BranchFrom(Some(*ids[0])),
+                timeline_id: *ids[2],
+            },
+            Update {
+                lsn: Lsn::from_str("0/11D74910").unwrap(),
+                command: Command::Update(42172416),
+                timeline_id: *ids[2],
+            },
+            Update {
+                lsn: Lsn::from_str("0/12051E98").unwrap(),
+                command: Command::Update(42196992),
+                timeline_id: *ids[0],
+            },
+        ];
+
+        let mut sorted = commands;
+
+        // these must sort in the same order, regardless of how the ids sort
+        // which is why the timeline_id is the last field
+        sorted.sort_unstable();
+
+        assert_eq!(commands, sorted, "{:#?} vs. {:#?}", commands, sorted);
+    }
+}
+
+#[test]
+fn verify_size_for_multiple_branches() {
+    // this is generated from integration test test_tenant_size_with_multiple_branches, but this way
+    // it has the stable lsn's
+    let doc = r#"{"updates":[{"lsn":"0/0","command":{"branch_from":null},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/176FA40","command":{"update":25763840},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/176FA40","command":{"branch_from":"cd9d9409c216e64bf580904facedb01b"},"timeline_id":"10b532a550540bc15385eac4edde416a"},{"lsn":"0/1819818","command":{"update":26075136},"timeline_id":"10b532a550540bc15385eac4edde416a"},{"lsn":"0/18B5E40","command":{"update":26427392},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/18D3DF0","command":{"update":26492928},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/18D3DF0","command":{"branch_from":"cd9d9409c216e64bf580904facedb01b"},"timeline_id":"230fc9d756f7363574c0d66533564dcc"},{"lsn":"0/220F438","command":{"update":25239552},"timeline_id":"230fc9d756f7363574c0d66533564dcc"}],"retention_period":131072,"timeline_inputs":{"cd9d9409c216e64bf580904facedb01b":{"last_record":"0/18D5E40","latest_gc_cutoff":"0/169ACF0","horizon_cutoff":"0/18B5E40","pitr_cutoff":"0/18B5E40","next_gc_cutoff":"0/18B5E40"},"10b532a550540bc15385eac4edde416a":{"last_record":"0/1839818","latest_gc_cutoff":"0/169ACF0","horizon_cutoff":"0/1819818","pitr_cutoff":"0/1819818","next_gc_cutoff":"0/1819818"},"230fc9d756f7363574c0d66533564dcc":{"last_record":"0/222F438","latest_gc_cutoff":"0/169ACF0","horizon_cutoff":"0/220F438","pitr_cutoff":"0/220F438","next_gc_cutoff":"0/220F438"}}}"#;
+
+    let inputs: ModelInputs = serde_json::from_str(doc).unwrap();
+
+    assert_eq!(inputs.calculate().unwrap(), 36_409_872);
+}
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -16,7 +16,7 @@ use std::fs;
 use std::ops::{Deref, Range};
 use std::path::PathBuf;
 use std::sync::atomic::{self, AtomicBool, AtomicI64, Ordering as AtomicOrdering};
-use std::sync::{Arc, Mutex, MutexGuard, RwLock, TryLockError};
+use std::sync::{Arc, Mutex, MutexGuard, RwLock};
 use std::time::{Duration, Instant, SystemTime};

 use crate::tenant::{
@@ -34,7 +34,6 @@ use crate::tenant::{
 use crate::config::{PageServerConf, METADATA_FILE_NAME};
 use crate::keyspace::{KeyPartitioning, KeySpace};
 use crate::metrics::TimelineMetrics;
-use crate::page_image_cache;
 use crate::pgdatadir_mapping::BlockNumber;
 use crate::pgdatadir_mapping::LsnForTimestamp;
 use crate::pgdatadir_mapping::{is_rel_fsm_block_key, is_rel_vm_block_key};
@@ -122,8 +121,16 @@ pub struct Timeline {
    /// to avoid deadlock.
    write_lock: Mutex<()>,

-    /// Used to ensure that there is only task performing flushing at a time
-    layer_flush_lock: Mutex<()>,
+    /// Used to avoid multiple `flush_loop` tasks running
+    flush_loop_started: Mutex<bool>,
+
+    /// layer_flush_start_tx can be used to wake up the layer-flushing task.
+    /// The value is a counter, incremented every time a new flush cycle is requested.
+    /// The flush cycle counter is sent back on the layer_flush_done channel when
+    /// the flush finishes. You can use that to wait for the flush to finish.
+    layer_flush_start_tx: tokio::sync::watch::Sender<u64>,
+    /// to be notified when layer flushing has finished, subscribe to the layer_flush_done channel
+    layer_flush_done_tx: tokio::sync::watch::Sender<(u64, anyhow::Result<()>)>,

    /// Layer removal lock.
    /// A lock to ensure that no layer of the timeline is removed concurrently by other tasks.
@@ -273,6 +280,11 @@ impl LogicalSize {
        self.size_added_after_initial
            .fetch_add(delta, AtomicOrdering::SeqCst);
    }
+
+    /// Returns the initialized (already calculated) value, if any.
+    fn initialized_size(&self) -> Option<u64> {
+        self.initial_logical_size.get().copied()
+    }
 }

 pub struct WalReceiverInfo {
@@ -462,15 +474,16 @@ impl Timeline {
    ///
    /// NOTE: This has nothing to do with checkpoint in PostgreSQL. We don't
    /// know anything about them here in the repository.
-    pub fn checkpoint(&self, cconf: CheckpointConfig) -> anyhow::Result<()> {
+    #[instrument(skip(self), fields(tenant_id=%self.tenant_id, timeline_id=%self.timeline_id))]
+    pub async fn checkpoint(&self, cconf: CheckpointConfig) -> anyhow::Result<()> {
        match cconf {
            CheckpointConfig::Flush => {
                self.freeze_inmem_layer(false);
-                self.flush_frozen_layers(true)
+                self.flush_frozen_layers_and_wait().await
            }
            CheckpointConfig::Forced => {
                self.freeze_inmem_layer(false);
-                self.flush_frozen_layers(true)?;
+                self.flush_frozen_layers_and_wait().await?;
                self.compact()
            }
        }
@@ -620,24 +633,8 @@ impl Timeline {
                self.last_freeze_at.store(last_lsn);
                *(self.last_freeze_ts.write().unwrap()) = Instant::now();

-                // Launch a task to flush the frozen layer to disk, unless
-                // a task was already running. (If the task was running
-                // at the time that we froze the layer, it must've seen the
-                // the layer we just froze before it exited; see comments
-                // in flush_frozen_layers())
-                if let Ok(guard) = self.layer_flush_lock.try_lock() {
-                    drop(guard);
-                    let self_clone = Arc::clone(self);
-                    task_mgr::spawn(
-                        task_mgr::BACKGROUND_RUNTIME.handle(),
-                        task_mgr::TaskKind::LayerFlushTask,
-                        Some(self.tenant_id),
-                        Some(self.timeline_id),
-                        "layer flush task",
-                        false,
-                        async move { self_clone.flush_frozen_layers(false) },
-                    );
-                }
+                // Wake up the layer flusher
+                self.flush_frozen_layers();
            }
        }
        Ok(())
@@ -728,6 +725,9 @@ impl Timeline {
        let disk_consistent_lsn = metadata.disk_consistent_lsn();
        let (state, _) = watch::channel(TimelineState::Suspended);

+        let (layer_flush_start_tx, _) = tokio::sync::watch::channel(0);
+        let (layer_flush_done_tx, _) = tokio::sync::watch::channel((0, Ok(())));
+
        let mut result = Timeline {
            conf,
            tenant_conf,
@@ -755,8 +755,12 @@ impl Timeline {

            upload_layers: AtomicBool::new(upload_layers),

+            flush_loop_started: Mutex::new(false),
+
+            layer_flush_start_tx,
+            layer_flush_done_tx,
+
            write_lock: Mutex::new(()),
-            layer_flush_lock: Mutex::new(()),
            layer_removal_cs: Mutex::new(()),

            gc_info: RwLock::new(GcInfo {
@@ -789,6 +793,33 @@ impl Timeline {
        result
    }

+    pub(super) fn maybe_spawn_flush_loop(self: &Arc<Self>) {
+        let mut flush_loop_started = self.flush_loop_started.lock().unwrap();
+        if *flush_loop_started {
+            info!(
+                "skipping attempt to start flush_loop twice {}/{}",
+                self.tenant_id, self.timeline_id
+            );
+            return;
+        }
+
+        let layer_flush_start_rx = self.layer_flush_start_tx.subscribe();
+        let self_clone = Arc::clone(self);
+        info!("spawning flush loop");
+        task_mgr::spawn(
+                    task_mgr::BACKGROUND_RUNTIME.handle(),
+                    task_mgr::TaskKind::LayerFlushTask,
+                    Some(self.tenant_id),
+                    Some(self.timeline_id),
+                    "layer flush task",
+                    false,
+                    async move { self_clone.flush_loop(layer_flush_start_rx).await; Ok(()) }
+                    .instrument(info_span!(parent: None, "layer flush task", tenant = %self.tenant_id, timeline = %self.timeline_id))
+                );
+
+        *flush_loop_started = true;
+    }
+
    pub(super) fn launch_wal_receiver(self: &Arc<Self>) {
        if !is_etcd_client_initialized() {
            if cfg!(test) {
@@ -980,9 +1011,26 @@ impl Timeline {
    /// Calculate the logical size of the database at the latest LSN.
    ///
    /// NOTE: counted incrementally, includes ancestors, this can be a slow operation.
-    fn calculate_logical_size(&self, up_to_lsn: Lsn) -> anyhow::Result<u64> {
-        info!("Calculating logical size for timeline {}", self.timeline_id);
-        let timer = self.metrics.init_logical_size_histo.start_timer();
+    pub fn calculate_logical_size(&self, up_to_lsn: Lsn) -> anyhow::Result<u64> {
+        info!(
+            "Calculating logical size for timeline {} at {}",
+            self.timeline_id, up_to_lsn
+        );
+        let timer = if up_to_lsn == self.initdb_lsn {
+            if let Some(size) = self.current_logical_size.initialized_size() {
+                if size != 0 {
+                    // non-zero size means that the size has already been calculated by this method
+                    // after startup. if the logical size is for a new timeline without layers the
+                    // size will be zero, and we cannot use that, or this caching strategy until
+                    // pageserver restart.
+                    return Ok(size);
+                }
+            }
+
+            self.metrics.init_logical_size_histo.start_timer()
+        } else {
+            self.metrics.logical_size_histo.start_timer()
+        };
        let logical_size = self.get_current_logical_size_non_incremental(up_to_lsn)?;
        debug!("calculated logical size: {logical_size}");
        timer.stop_and_record();
@@ -1268,53 +1316,94 @@ impl Timeline {
        drop(layers);
    }

-    /// Flush all frozen layers to disk.
-    ///
-    /// Only one task at a time can be doing layer-flushing for a
-    /// given timeline. If 'wait' is true, and another task is
-    /// currently doing the flushing, this function will wait for it
-    /// to finish. If 'wait' is false, this function will return
-    /// immediately instead.
-    fn flush_frozen_layers(&self, wait: bool) -> anyhow::Result<()> {
-        let flush_lock_guard = if wait {
-            self.layer_flush_lock.lock().unwrap()
-        } else {
-            match self.layer_flush_lock.try_lock() {
-                Ok(guard) => guard,
-                Err(TryLockError::WouldBlock) => return Ok(()),
-                Err(TryLockError::Poisoned(err)) => panic!("{:?}", err),
-            }
-        };
-
-        let timer = self.metrics.flush_time_histo.start_timer();
-
+    /// Layer flusher task's main loop.
+    async fn flush_loop(&self, mut layer_flush_start_rx: tokio::sync::watch::Receiver<u64>) {
+        info!("started flush loop");
        loop {
-            let layers = self.layers.read().unwrap();
-            if let Some(frozen_layer) = layers.frozen_layers.front() {
-                let frozen_layer = Arc::clone(frozen_layer);
-                drop(layers); // to allow concurrent reads and writes
-                self.flush_frozen_layer(frozen_layer)?;
-            } else {
-                // Drop the 'layer_flush_lock' *before* 'layers'. That
-                // way, if you freeze a layer, and then call
-                // flush_frozen_layers(false), it is guaranteed that
-                // if another thread was busy flushing layers and the
-                // call therefore returns immediately, the other
-                // thread will have seen the newly-frozen layer and
-                // will flush that too (assuming no errors).
-                drop(flush_lock_guard);
-                drop(layers);
-                break;
+            tokio::select! {
+                _ = task_mgr::shutdown_watcher() => {
+                    info!("shutting down layer flush task");
+                    break;
+                },
+                _ = layer_flush_start_rx.changed() => {}
            }
+
+            trace!("waking up");
+            let timer = self.metrics.flush_time_histo.start_timer();
+            let flush_counter = *layer_flush_start_rx.borrow();
+            let result = loop {
+                let layer_to_flush = {
+                    let layers = self.layers.read().unwrap();
+                    layers.frozen_layers.front().cloned()
+                    // drop 'layers' lock to allow concurrent reads and writes
+                };
+                if let Some(layer_to_flush) = layer_to_flush {
+                    if let Err(err) = self.flush_frozen_layer(layer_to_flush).await {
+                        error!("could not flush frozen layer: {err:?}");
+                        break Err(err);
+                    }
+                    continue;
+                } else {
+                    break Ok(());
+                }
+            };
+            // Notify any listeners that we're done
+            let _ = self
+                .layer_flush_done_tx
+                .send_replace((flush_counter, result));
+
+            timer.stop_and_record();
+        }
+    }
+
+    async fn flush_frozen_layers_and_wait(&self) -> anyhow::Result<()> {
+        let mut rx = self.layer_flush_done_tx.subscribe();
+
+        // Increment the flush cycle counter and wake up the flush task.
+        // Remember the new value, so that when we listen for the flush
+        // to finish, we know when the flush that we initiated has
+        // finished, instead of some other flush that was started earlier.
+        let mut my_flush_request = 0;
+
+        if !&*self.flush_loop_started.lock().unwrap() {
+            anyhow::bail!("cannot flush frozen layers when flush_loop is not running")
        }

-        timer.stop_and_record();
+        self.layer_flush_start_tx.send_modify(|counter| {
+            my_flush_request = *counter + 1;
+            *counter = my_flush_request;
+        });

-        Ok(())
+        loop {
+            {
+                let (last_result_counter, last_result) = &*rx.borrow();
+                if *last_result_counter >= my_flush_request {
+                    if let Err(_err) = last_result {
+                        // We already logged the original error in
+                        // flush_loop. We cannot propagate it to the caller
+                        // here, because it might not be Cloneable
+                        anyhow::bail!(
+                            "Could not flush frozen layer. Request id: {}",
+                            my_flush_request
+                        );
+                    } else {
+                        return Ok(());
+                    }
+                }
+            }
+            trace!("waiting for flush to complete");
+            rx.changed().await?;
+            trace!("done")
+        }
+    }
+
+    fn flush_frozen_layers(&self) {
+        self.layer_flush_start_tx.send_modify(|val| *val += 1);
    }

    /// Flush one frozen in-memory layer to disk, as a new delta layer.
-    fn flush_frozen_layer(&self, frozen_layer: Arc<InMemoryLayer>) -> anyhow::Result<()> {
+    #[instrument(skip(self, frozen_layer), fields(tenant_id=%self.tenant_id, timeline_id=%self.timeline_id, layer=%frozen_layer.filename().display()))]
+    async fn flush_frozen_layer(&self, frozen_layer: Arc<InMemoryLayer>) -> anyhow::Result<()> {
        // As a special case, when we have just imported an image into the repository,
        // instead of writing out a L0 delta layer, we directly write out image layer
        // files instead. This is possible as long as *all* the data imported into the
@@ -1542,6 +1631,10 @@ impl Timeline {
                    lsn,
                )?;

+                fail_point!("image-layer-writer-fail-before-finish", |_| {
+                    anyhow::bail!("failpoint image-layer-writer-fail-before-finish");
+                });
+
                for range in &partition.ranges {
                    let mut key = range.start;
                    while key < range.end {
@@ -1836,6 +1929,11 @@ impl Timeline {
                    },
                )?);
            }
+
+            fail_point!("delta-layer-writer-fail-before-finish", |_| {
+                anyhow::bail!("failpoint delta-layer-writer-fail-before-finish");
+            });
+
            writer.as_mut().unwrap().put_value(key, lsn, value)?;
            prev_key = Some(key);
        }
@@ -2235,13 +2333,10 @@ impl Timeline {

                let last_rec_lsn = data.records.last().unwrap().0;

-                let img = self.walredo_mgr.request_redo(
-                    key,
-                    request_lsn,
-                    base_img,
-                    data.records,
-                    self.pg_version,
-                )?;
+                let img = self
+                    .walredo_mgr
+                    .request_redo(key, request_lsn, base_img, data.records, self.pg_version)
+                    .context("Failed to reconstruct a page image:")?;

                if img.len() == page_cache::PAGE_SZ {
                    let cache = page_cache::get();
@@ -2316,7 +2411,6 @@ impl<'a> TimelineWriter<'a> {
    /// This will implicitly extend the relation, if the page is beyond the
    /// current end-of-file.
    pub fn put(&self, key: Key, lsn: Lsn, value: &Value) -> anyhow::Result<()> {
-        page_image_cache::remove(key, self.tenant_id, self.timeline_id);
        self.tl.put_value(key, lsn, value)
    }

--- a/pageserver/src/tenant_mgr.rs
+++ b/pageserver/src/tenant_mgr.rs
@@ -241,7 +241,7 @@ pub async fn shutdown_all_tenants() {
        let tenant_id = tenant.tenant_id();
        debug!("shutdown tenant {tenant_id}");

-        if let Err(err) = tenant.checkpoint() {
+        if let Err(err) = tenant.checkpoint().await {
            error!("Could not checkpoint tenant {tenant_id} during shutdown: {err:?}");
        }
    }
--- a/pageserver/src/tenant_tasks.rs
+++ b/pageserver/src/tenant_tasks.rs
@@ -119,7 +119,7 @@ async fn gc_loop(tenant_id: TenantId) {
            let gc_horizon = tenant.get_gc_horizon();
            let mut sleep_duration = gc_period;
            if gc_horizon > 0 {
-                if let Err(e) = tenant.gc_iteration(None, gc_horizon, tenant.get_pitr_interval(), false)
+                if let Err(e) = tenant.gc_iteration(None, gc_horizon, tenant.get_pitr_interval(), false).await
                {
                    sleep_duration = wait_duration;
                    error!("Gc failed, retrying in {:?}: {e:#}", sleep_duration);
--- a/pageserver/src/virtual_file.rs
+++ b/pageserver/src/virtual_file.rs
@@ -319,6 +319,12 @@ impl VirtualFile {

        Ok(result)
    }
+
+    pub fn remove(self) {
+        let path = self.path.clone();
+        drop(self);
+        std::fs::remove_file(path).expect("failed to remove the virtual file");
+    }
 }

 impl Drop for VirtualFile {
--- a/pageserver/src/walreceiver.rs
+++ b/pageserver/src/walreceiver.rs
@@ -155,22 +155,19 @@ impl<E: Clone> TaskHandle<E> {

    /// Aborts current task, waiting for it to finish.
    pub async fn shutdown(self) {
-        match self.join_handle {
-            Some(jh) => {
-                self.cancellation.send(()).ok();
-                match jh.await {
-                    Ok(Ok(())) => debug!("Shutdown success"),
-                    Ok(Err(e)) => error!("Shutdown task error: {e:?}"),
-                    Err(join_error) => {
-                        if join_error.is_cancelled() {
-                            error!("Shutdown task was cancelled");
-                        } else {
-                            error!("Shutdown task join error: {join_error}")
-                        }
+        if let Some(jh) = self.join_handle {
+            self.cancellation.send(()).ok();
+            match jh.await {
+                Ok(Ok(())) => debug!("Shutdown success"),
+                Ok(Err(e)) => error!("Shutdown task error: {e:?}"),
+                Err(join_error) => {
+                    if join_error.is_cancelled() {
+                        error!("Shutdown task was cancelled");
+                    } else {
+                        error!("Shutdown task join error: {join_error}")
                    }
                }
            }
-            None => {}
        }
    }
 }
--- a/pageserver/src/walreceiver/connection_manager.rs
+++ b/pageserver/src/walreceiver/connection_manager.rs
@@ -93,7 +93,7 @@ pub fn spawn_connection_manager_task(
            }
        }
        .instrument(
-            info_span!("wal_connection_manager", tenant = %tenant_id, timeline = %timeline_id),
+            info_span!(parent: None, "wal_connection_manager", tenant = %tenant_id, timeline = %timeline_id),
        ),
    );
 }
@@ -836,15 +836,20 @@ fn wal_stream_connection_string(
    listen_pg_addr_str: &str,
 ) -> anyhow::Result<String> {
    let sk_connstr = format!("postgresql://no_user@{listen_pg_addr_str}/no_db");
-    let me_conf = sk_connstr
-        .parse::<postgres::config::Config>()
-        .with_context(|| {
-            format!("Failed to parse pageserver connection string '{sk_connstr}' as a postgres one")
-        })?;
-    let (host, port) = utils::connstring::connection_host_port(&me_conf);
-    Ok(format!(
-        "host={host} port={port} options='-c timeline_id={timeline_id} tenant_id={tenant_id}'"
-    ))
+    sk_connstr
+        .parse()
+        .context("bad url")
+        .and_then(|url: url::Url| {
+            let host = url.host_str().context("host is missing")?;
+            let port = url.port().unwrap_or(5432); // default PG port
+
+            Ok(format!(
+                "host={host} \
+                 port={port} \
+                 options='-c timeline_id={timeline_id} tenant_id={tenant_id}'"
+            ))
+        })
+        .with_context(|| format!("Failed to parse pageserver connection URL '{sk_connstr}'"))
 }

 #[cfg(test)]
@@ -892,7 +897,7 @@ mod tests {
                        peer_horizon_lsn: None,
                        local_start_lsn: None,

-                        safekeeper_connstr: Some("no commit_lsn".to_string()),
+                        safekeeper_connstr: Some("no_commit_lsn".to_string()),
                    },
                    etcd_version: 0,
                    latest_update: now,
@@ -909,7 +914,7 @@ mod tests {
                        remote_consistent_lsn: None,
                        peer_horizon_lsn: None,
                        local_start_lsn: None,
-                        safekeeper_connstr: Some("no commit_lsn".to_string()),
+                        safekeeper_connstr: Some("no_commit_lsn".to_string()),
                    },
                    etcd_version: 0,
                    latest_update: now,
@@ -1005,7 +1010,7 @@ mod tests {
                        peer_horizon_lsn: None,
                        local_start_lsn: None,

-                        safekeeper_connstr: Some("not advanced Lsn".to_string()),
+                        safekeeper_connstr: Some("not_advanced_lsn".to_string()),
                    },
                    etcd_version: 0,
                    latest_update: now,
@@ -1023,7 +1028,7 @@ mod tests {
                        peer_horizon_lsn: None,
                        local_start_lsn: None,

-                        safekeeper_connstr: Some("not enough advanced Lsn".to_string()),
+                        safekeeper_connstr: Some("not_enough_advanced_lsn".to_string()),
                    },
                    etcd_version: 0,
                    latest_update: now,
@@ -1093,7 +1098,7 @@ mod tests {
                        peer_horizon_lsn: None,
                        local_start_lsn: None,

-                        safekeeper_connstr: Some("smaller commit_lsn".to_string()),
+                        safekeeper_connstr: Some("smaller_commit_lsn".to_string()),
                    },
                    etcd_version: 0,
                    latest_update: now,
@@ -1283,7 +1288,7 @@ mod tests {
                        peer_horizon_lsn: None,
                        local_start_lsn: None,

-                        safekeeper_connstr: Some("advanced by Lsn safekeeper".to_string()),
+                        safekeeper_connstr: Some("advanced_by_lsn_safekeeper".to_string()),
                    },
                    etcd_version: 0,
                    latest_update: now,
@@ -1307,7 +1312,7 @@ mod tests {
        );
        assert!(over_threshcurrent_candidate
            .wal_source_connstr
-            .contains("advanced by Lsn safekeeper"));
+            .contains("advanced_by_lsn_safekeeper"));

        Ok(())
    }
--- a/pageserver/src/walreceiver/walreceiver_connection.rs
+++ b/pageserver/src/walreceiver/walreceiver_connection.rs
@@ -31,8 +31,8 @@ use crate::{
    walrecord::DecodedWALRecord,
 };
 use postgres_ffi::waldecoder::WalStreamDecoder;
-use utils::id::TenantTimelineId;
-use utils::{lsn::Lsn, pq_proto::ReplicationFeedback};
+use pq_proto::ReplicationFeedback;
+use utils::{id::TenantTimelineId, lsn::Lsn};

 /// Status of the connection.
 #[derive(Debug, Clone)]
--- a/pageserver/src/walredo.rs
+++ b/pageserver/src/walredo.rs
@@ -10,7 +10,7 @@
 //! process. Then we get the page image back. Communication with the
 //! postgres process happens via stdin/stdout
 //!
-//! See src/backend/tcop/zenith_wal_redo.c for the other side of
+//! See pgxn/neon_walredo/walredoproc.c for the other side of
 //! this communication.
 //!
 //! The Postgres process is assumed to be secure against malicious WAL
@@ -21,6 +21,7 @@
 use byteorder::{ByteOrder, LittleEndian};
 use bytes::{BufMut, Bytes, BytesMut};
 use nix::poll::*;
+use once_cell::sync::Lazy;
 use serde::Serialize;
 use std::fs;
 use std::fs::OpenOptions;
@@ -31,7 +32,8 @@ use std::os::unix::prelude::CommandExt;
 use std::path::PathBuf;
 use std::process::Stdio;
 use std::process::{Child, ChildStderr, ChildStdin, ChildStdout, Command};
-use std::sync::Mutex;
+use std::sync::atomic::{AtomicU64, Ordering};
+use std::sync::{Condvar, Mutex};
 use std::time::Duration;
 use std::time::Instant;
 use tracing::*;
@@ -55,6 +57,9 @@ use postgres_ffi::v14::nonrelfile_utils::{
 };
 use postgres_ffi::BLCKSZ;

+/// Maximum number of WAL redo processes to launch for a single tenant.
+const MAX_PROCESSES: usize = 4;
+
 ///
 /// `RelTag` + block number (`blknum`) gives us a unique id of the page in the cluster.
 ///
@@ -88,18 +93,32 @@ pub trait WalRedoManager: Send + Sync {
    ) -> Result<Bytes, WalRedoError>;
 }

+static WAL_REDO_PROCESS_COUNTER: Lazy<AtomicU64> = Lazy::new(|| { AtomicU64::new(0) });
+
 ///
-/// This is the real implementation that uses a Postgres process to
-/// perform WAL replay. Only one thread can use the process at a time,
-/// that is controlled by the Mutex. In the future, we might want to
-/// launch a pool of processes to allow concurrent replay of multiple
-/// records.
+/// This is the real implementation that uses a special Postgres
+/// process to perform WAL replay. There is a pool of these processes.
 ///
 pub struct PostgresRedoManager {
    tenant_id: TenantId,
    conf: &'static PageServerConf,

-    process: Mutex<Option<PostgresRedoProcess>>,
+    /// Pool of processes.
+    process_list: Mutex<ProcessList>,
+    /// Condition variable that can be used to sleep until a process
+    /// becomes available in the pool.
+    condvar: Condvar,
+}
+
+// A pool of WAL redo processes
+#[derive(Default)]
+struct ProcessList {
+    /// processes that are available for reuse
+    free_processes: Vec<PostgresRedoProcess>,
+
+    /// Total number of processes, including all the processes in
+    /// 'free_processes' list, and any processes that are in use.
+    num_processes: usize,
 }

 /// Can this request be served by neon redo functions
@@ -204,7 +223,32 @@ impl PostgresRedoManager {
        PostgresRedoManager {
            tenant_id,
            conf,
-            process: Mutex::new(None),
+            process_list: Mutex::new(ProcessList::default()),
+            condvar: Condvar::new(),
+        }
+    }
+
+    // Get a handle to a redo process from the pool.
+    fn get_process(&self, pg_version: u32) -> Result<PostgresRedoProcess, WalRedoError> {
+        let mut process_list = self.process_list.lock().unwrap();
+
+        loop {
+            // If there's a free process immediately available, take it.
+            if let Some(process) = process_list.free_processes.pop() {
+                return Ok(process);
+            }
+
+            // All processes are in use. If the pool is at its maximum size
+            // already, wait for a process to become free. Otherwise launch
+            // a new process.
+            if process_list.num_processes >= MAX_PROCESSES {
+                process_list = self.condvar.wait(process_list).unwrap();
+                continue;
+            } else {
+                let process = PostgresRedoProcess::launch(self.conf, &self.tenant_id, pg_version)?;
+                process_list.num_processes += 1;
+                return Ok(process);
+            }
        }
    }

@@ -224,15 +268,9 @@ impl PostgresRedoManager {

        let start_time = Instant::now();

-        let mut process_guard = self.process.lock().unwrap();
-        let lock_time = Instant::now();
+        let mut process = self.get_process(pg_version)?;

-        // launch the WAL redo process on first use
-        if process_guard.is_none() {
-            let p = PostgresRedoProcess::launch(self.conf, &self.tenant_id, pg_version)?;
-            *process_guard = Some(p);
-        }
-        let process = process_guard.as_mut().unwrap();
+        let lock_time = Instant::now();

        WAL_REDO_WAIT_TIME.observe(lock_time.duration_since(start_time).as_secs_f64());

@@ -266,8 +304,9 @@ impl PostgresRedoManager {
            lsn
        );

-        // If something went wrong, don't try to reuse the process. Kill it, and
-        // next request will launch a new one.
+        // If something went wrong, don't try to reuse the
+        // process. Kill it, and next request will launch a new one.
+        // Otherwise return the process to the pool.
        if result.is_err() {
            error!(
                "error applying {} WAL records ({} bytes) to reconstruct page image at LSN {}",
@@ -275,8 +314,14 @@ impl PostgresRedoManager {
                nbytes,
                lsn
            );
-            let process = process_guard.take().unwrap();
            process.kill();
+            let mut process_list = self.process_list.lock().unwrap();
+            process_list.num_processes -= 1;
+            self.condvar.notify_one();
+        } else {
+            let mut process_list = self.process_list.lock().unwrap();
+            process_list.free_processes.push(process);
+            self.condvar.notify_one();
        }
        result
    }
@@ -594,11 +639,10 @@ impl PostgresRedoProcess {
        tenant_id: &TenantId,
        pg_version: u32,
    ) -> Result<PostgresRedoProcess, Error> {
-        // FIXME: We need a dummy Postgres cluster to run the process in. Currently, we
-        // just create one with constant name. That fails if you try to launch more than
-        // one WAL redo manager concurrently.
+        // We need a dummy Postgres cluster to run the process in.
+        let processno = WAL_REDO_PROCESS_COUNTER.fetch_add(1, Ordering::Relaxed);
        let datadir = path_with_suffix_extension(
-            conf.tenant_path(tenant_id).join("wal-redo-datadir"),
+            conf.tenant_path(tenant_id).join(format!("wal-redo-datadir-{}", processno)),
            TEMP_FILE_SUFFIX,
        );

@@ -644,14 +688,12 @@ impl PostgresRedoProcess {
                ),
            ));
        } else {
-            // Limit shared cache for wal-redo-postres
+            // Limit shared cache for wal-redo-postgres
            let mut config = OpenOptions::new()
                .append(true)
                .open(PathBuf::from(&datadir).join("postgresql.conf"))?;
            config.write_all(b"shared_buffers=128kB\n")?;
            config.write_all(b"fsync=off\n")?;
-            config.write_all(b"shared_preload_libraries=neon\n")?;
-            config.write_all(b"neon.wal_redo=on\n")?;
        }

        // Start postgres itself
@@ -664,18 +706,15 @@ impl PostgresRedoProcess {
            .env("LD_LIBRARY_PATH", &pg_lib_dir_path)
            .env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)
            .env("PGDATA", &datadir)
-            // The redo process is not trusted, so it runs in seccomp mode
-            // (see seccomp in zenith_wal_redo.c). We have to make sure it doesn't
-            // inherit any file descriptors from the pageserver that would allow
-            // an attacker to do bad things.
+            // The redo process is not trusted, and runs in seccomp mode that
+            // doesn't allow it to open any files. We have to also make sure it
+            // doesn't inherit any file descriptors from the pageserver, that
+            // would allow an attacker to read any files that happen to be open
+            // in the pageserver.
            //
            // The Rust standard library makes sure to mark any file descriptors with
            // as close-on-exec by default, but that's not enough, since we use
            // libraries that directly call libc open without setting that flag.
-            //
-            // One example is the pidfile of the daemonize library, which doesn't
-            // currently mark file descriptors as close-on-exec. Either way, we
-            // want to be on the safe side and prevent accidental regression.
            .close_fds()
            .spawn()
            .map_err(|e| {
@@ -844,7 +883,7 @@ impl PostgresRedoProcess {
 }

 // Functions for constructing messages to send to the postgres WAL redo
-// process. See vendor/postgres/src/backend/tcop/zenith_wal_redo.c for
+// process. See pgxn/neon_walredo/walredoproc.c for
 // explanation of the protocol.

 fn build_begin_redo_for_block_msg(tag: BufferTag, buf: &mut Vec<u8>) {
--- a/pgxn/neon/Makefile
+++ b/pgxn/neon/Makefile
@@ -4,7 +4,6 @@
 MODULE_big = neon
 OBJS = \
 	$(WIN32RES) \
-	inmem_smgr.o \
 	libpagestore.o \
 	libpqwalproposer.o \
 	pagestore_smgr.o \
--- a/pgxn/neon/libpagestore.c
+++ b/pgxn/neon/libpagestore.c
@@ -42,6 +42,11 @@ PGconn	   *pageserver_conn = NULL;

 char	   *page_server_connstring_raw;

+int			n_unflushed_requests = 0;
+int			flush_every_n_requests = 8;
+
+static void pageserver_flush(void);
+
 static void
 pageserver_connect()
 {
@@ -164,6 +169,8 @@ pageserver_disconnect(void)
 		PQfinish(pageserver_conn);
 		pageserver_conn = NULL;
 		connected = false;
+
+		prefetch_on_ps_disconnect();
 	}
 }

@@ -174,11 +181,7 @@ pageserver_send(NeonRequest * request)

 	/* If the connection was lost for some reason, reconnect */
 	if (connected && PQstatus(pageserver_conn) == CONNECTION_BAD)
-	{
-		PQfinish(pageserver_conn);
-		pageserver_conn = NULL;
-		connected = false;
-	}
+		pageserver_disconnect();

 	if (!connected)
 		pageserver_connect();
@@ -202,6 +205,11 @@ pageserver_send(NeonRequest * request)
 	}
 	pfree(req_buff.data);

+	n_unflushed_requests++;
+
+	if (flush_every_n_requests > 0 && n_unflushed_requests >= flush_every_n_requests)
+		pageserver_flush();
+
 	if (message_level_is_interesting(PageStoreTrace))
 	{
 		char	   *msg = nm_to_string((NeonMessage *) request);
@@ -255,25 +263,21 @@ pageserver_receive(void)
 static void
 pageserver_flush(void)
 {
-	if (PQflush(pageserver_conn))
+	if (!connected)
+	{
+		neon_log(WARNING, "Tried to flush while disconnected");
+	}
+	else if (PQflush(pageserver_conn))
 	{
 		char	   *msg = PQerrorMessage(pageserver_conn);

 		pageserver_disconnect();
 		neon_log(ERROR, "failed to flush page requests: %s", msg);
 	}
-}
-
-static NeonResponse *
-pageserver_call(NeonRequest * request)
-{
-	pageserver_send(request);
-	pageserver_flush();
-	return pageserver_receive();
+	n_unflushed_requests = 0;
 }

 page_server_api api = {
-	.request = pageserver_call,
 	.send = pageserver_send,
 	.flush = pageserver_flush,
 	.receive = pageserver_receive
@@ -419,15 +423,6 @@ pg_init_libpagestore(void)
 							   0,	/* no flags required */
 							   check_neon_id, NULL, NULL);

-	DefineCustomBoolVariable("neon.wal_redo",
-							 "start in wal-redo mode",
-							 NULL,
-							 &wal_redo,
-							 false,
-							 PGC_POSTMASTER,
-							 0,
-							 NULL, NULL, NULL);
-
 	DefineCustomIntVariable("neon.max_cluster_size",
 							"cluster size limit",
 							NULL,
@@ -436,6 +431,14 @@ pg_init_libpagestore(void)
 							PGC_SIGHUP,
 							GUC_UNIT_MB,
 							NULL, NULL, NULL);
+	DefineCustomIntVariable("neon.flush_output_after",
+							"Flush the output buffer after every N unflushed requests",
+							NULL,
+							&flush_every_n_requests,
+							8, -1, INT_MAX,
+							PGC_SIGHUP,
+							0,	/* no flags required */
+							NULL, NULL, NULL);

 	relsize_hash_init();

@@ -452,13 +455,7 @@ pg_init_libpagestore(void)
 	neon_timeline_walproposer = neon_timeline;
 	neon_tenant_walproposer = neon_tenant;

-	if (wal_redo)
-	{
-		neon_log(PageStoreTrace, "set inmem_smgr hook");
-		smgr_hook = smgr_inmem;
-		smgr_init_hook = smgr_init_inmem;
-	}
-	else if (page_server_connstring && page_server_connstring[0])
+	if (page_server_connstring && page_server_connstring[0])
 	{
 		neon_log(PageStoreTrace, "set neon_smgr hook");
 		smgr_hook = smgr_neon;
--- a/pgxn/neon/pagestore_client.h
+++ b/pgxn/neon/pagestore_client.h
@@ -115,6 +115,8 @@ typedef struct
 	char		page[FLEXIBLE_ARRAY_MEMBER];
 }			NeonGetPageResponse;

+#define PS_GETPAGERESPONSE_SIZE (MAXALIGN(offsetof(NeonGetPageResponse, page) + BLCKSZ))
+
 typedef struct
 {
 	NeonMessageTag tag;
@@ -138,15 +140,18 @@ extern char *nm_to_string(NeonMessage * msg);

 typedef struct
 {
-	NeonResponse *(*request) (NeonRequest * request);
 	void		(*send) (NeonRequest * request);
 	NeonResponse *(*receive) (void);
 	void		(*flush) (void);
 }			page_server_api;

+extern void prefetch_on_ps_disconnect(void);
+
 extern page_server_api * page_server;

 extern char *page_server_connstring;
+extern bool seqscan_prefetch_enabled;
+extern int seqscan_prefetch_distance;
 extern char *neon_timeline;
 extern char *neon_tenant;
 extern bool wal_redo;
@@ -155,10 +160,6 @@ extern int32 max_cluster_size;
 extern const f_smgr *smgr_neon(BackendId backend, RelFileNode rnode);
 extern void smgr_init_neon(void);

-extern const f_smgr *smgr_inmem(BackendId backend, RelFileNode rnode);
-extern void smgr_init_inmem(void);
-extern void smgr_shutdown_inmem(void);
-
 /* Neon storage manager functionality */

 extern void neon_init(void);
@@ -171,7 +172,6 @@ extern void neon_extend(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, char *buffer, bool skipFsync);
 extern bool neon_prefetch(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum);
-extern void neon_reset_prefetch(SMgrRelation reln);
 extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					  char *buffer);

@@ -188,29 +188,6 @@ extern void neon_truncate(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber nblocks);
 extern void neon_immedsync(SMgrRelation reln, ForkNumber forknum);

-/* neon wal-redo storage manager functionality */
-
-extern void inmem_init(void);
-extern void inmem_open(SMgrRelation reln);
-extern void inmem_close(SMgrRelation reln, ForkNumber forknum);
-extern void inmem_create(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool inmem_exists(SMgrRelation reln, ForkNumber forknum);
-extern void inmem_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void inmem_extend(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern bool inmem_prefetch(SMgrRelation reln, ForkNumber forknum,
-						   BlockNumber blocknum);
-extern void inmem_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-					   char *buffer);
-extern void inmem_write(SMgrRelation reln, ForkNumber forknum,
-						BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void inmem_writeback(SMgrRelation reln, ForkNumber forknum,
-							BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber inmem_nblocks(SMgrRelation reln, ForkNumber forknum);
-extern void inmem_truncate(SMgrRelation reln, ForkNumber forknum,
-						   BlockNumber nblocks);
-extern void inmem_immedsync(SMgrRelation reln, ForkNumber forknum);
-
 /* utils for neon relsize cache */
 extern void relsize_hash_init(void);
 extern bool get_cached_relsize(RelFileNode rnode, ForkNumber forknum, BlockNumber *size);
--- a/pgxn/neon/pagestore_smgr.c
+++ b/pgxn/neon/pagestore_smgr.c
@@ -49,22 +49,20 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlog_internal.h"
-#include "catalog/pg_class.h"
-#include "pagestore_client.h"
-#include "pagestore_client.h"
-#include "storage/smgr.h"
 #include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+#include "common/hashfn.h"
+#include "pagestore_client.h"
 #include "postmaster/interrupt.h"
+#include "postmaster/autovacuum.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/buf_internals.h"
+#include "storage/smgr.h"
 #include "storage/md.h"
-#include "fmgr.h"
-#include "miscadmin.h"
 #include "pgstat.h"
-#include "catalog/pg_tablespace_d.h"
-#include "postmaster/autovacuum.h"
+

 #if PG_VERSION_NUM >= 150000
 #include "access/xlogutils.h"
@@ -99,7 +97,6 @@ char	   *page_server_connstring;
 /*with substituted password*/
 char	   *neon_timeline;
 char	   *neon_tenant;
-bool		wal_redo = false;
 int32		max_cluster_size;

 /* unlogged relation build states */
@@ -114,48 +111,482 @@ typedef enum
 static SMgrRelation unlogged_build_rel = NULL;
 static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;

-
 /*
 * Prefetch implementation:
+ * 
 * Prefetch is performed locally by each backend.
- * There can be up to MAX_PREFETCH_REQUESTS registered using smgr_prefetch
- * before smgr_read. All this requests are appended to primary smgr_read request.
- * It is assumed that pages will be requested in prefetch order.
- * Reading of prefetch responses is delayed until them are actually needed (smgr_read).
- * It make it possible to parallelize processing and receiving of prefetched pages.
- * In case of prefetch miss or any other SMGR request other than smgr_read,
- * all prefetch responses has to be consumed.
+ *
+ * There can be up to READ_BUFFER_SIZE active IO requests registered at any
+ * time. Requests using smgr_prefetch are sent to the pageserver, but we don't
+ * wait on the response. Requests using smgr_read are either read from the
+ * buffer, or (if that's not possible) we wait on the response to arrive -
+ * this also will allow us to receive other prefetched pages. 
+ * Each request is immediately written to the output buffer of the pageserver
+ * connection, but may not be flushed if smgr_prefetch is used: pageserver
+ * flushes sent requests on manual flush, or every neon.flush_output_after
+ * unflushed requests; which is not necessarily always and all the time.
+ *
+ * Once we have received a response, this value will be stored in the response
+ * buffer, indexed in a hash table. This allows us to retain our buffered
+ * prefetch responses even when we have cache misses.
+ *
+ * Reading of prefetch responses is delayed until them are actually needed
+ * (smgr_read). In case of prefetch miss or any other SMGR request other than
+ * smgr_read, all prefetch responses in the pipeline will need to be read from
+ * the connection; the responses are stored for later use.
+ *
+ * NOTE: The current implementation of the prefetch system implements a ring
+ * buffer of up to READ_BUFFER_SIZE requests. If there are more _read and
+ * _prefetch requests between the initial _prefetch and the _read of a buffer,
+ * the prefetch request will have been dropped from this prefetch buffer, and
+ * your prefetch was wasted.
 */

-#define MAX_PREFETCH_REQUESTS 128
+/* Max amount of tracked buffer reads */
+#define READ_BUFFER_SIZE 128

-BufferTag	prefetch_requests[MAX_PREFETCH_REQUESTS];
-BufferTag	prefetch_responses[MAX_PREFETCH_REQUESTS];
-int			n_prefetch_requests;
-int			n_prefetch_responses;
-int			n_prefetched_buffers;
-int			n_prefetch_hits;
-int			n_prefetch_misses;
-XLogRecPtr	prefetch_lsn;
+typedef enum PrefetchStatus {
+	PRFS_UNUSED = 0,	/* unused slot */
+	PRFS_REQUESTED,		/* request was written to the sendbuffer to PS, but not
+						 * necessarily flushed.
+						 * all fields except response valid */
+	PRFS_RECEIVED,		/* all fields valid */
+	PRFS_TAG_REMAINS,	/* only buftag and my_ring_index are still valid */
+} PrefetchStatus;

+typedef struct PrefetchRequest {
+	BufferTag	buftag; /* must be first entry in the struct */
+	XLogRecPtr	effective_request_lsn;
+	NeonResponse *response; /* may be null */
+	PrefetchStatus status;
+	uint64		my_ring_index;
+} PrefetchRequest;
+
+/* prefetch buffer lookup hash table */
+
+typedef struct PrfHashEntry {
+	PrefetchRequest *slot;
+	uint32 status;
+	uint32 hash;
+} PrfHashEntry;
+
+#define SH_PREFIX			prfh
+#define SH_ELEMENT_TYPE		PrfHashEntry
+#define SH_KEY_TYPE			PrefetchRequest *
+#define SH_KEY				slot
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a)	((a)->hash)
+#define SH_HASH_KEY(tb, key) hash_bytes( \
+	((const unsigned char *) &(key)->buftag), \
+	sizeof(BufferTag) \
+)
+
+#define SH_EQUAL(tb, a, b)	(BUFFERTAGS_EQUAL((a)->buftag, (b)->buftag))
+#define SH_SCOPE			static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * PrefetchState maintains the state of (prefetch) getPage@LSN requests.
+ * It maintains a (ring) buffer of in-flight requests and responses.
+ * 
+ * We maintain several indexes into the ring buffer:
+ * ring_unused >= ring_receive >= ring_last >= 0
+ * 
+ * ring_unused points to the first unused slot of the buffer
+ * ring_receive is the next request that is to be received
+ * ring_last is the oldest received entry in the buffer
+ * 
+ * Apart from being an entry in the ring buffer of prefetch requests, each
+ * PrefetchRequest that is not UNUSED is indexed in prf_hash by buftag.
+ */
+typedef struct PrefetchState {
+	MemoryContext bufctx; /* context for prf_buffer[].response allocations */
+	MemoryContext errctx; /* context for prf_buffer[].response allocations */
+	MemoryContext hashctx; /* context for prf_buffer */
+
+	/* buffer indexes */
+	uint64	ring_unused;		/* first unused slot */
+	uint64	ring_receive;		/* next slot that is to receive a response */
+	uint64	ring_last;			/* min slot with a response value */
+
+	/* metrics / statistics  */
+	int		n_responses_buffered;	/* count of PS responses not yet in buffers */
+	int		n_requests_inflight;	/* count of PS requests considered in flight */
+	int		n_unused;				/* count of buffers < unused, > last, that are also unused */
+
+	/* the buffers */
+	prfh_hash *prf_hash;
+	PrefetchRequest prf_buffer[READ_BUFFER_SIZE]; /* prefetch buffers */
+} PrefetchState;
+
+PrefetchState *MyPState;
+
+int			n_prefetch_hits = 0;
+int			n_prefetch_misses = 0;
+int			n_prefetch_missed_caches = 0;
+int			n_prefetch_dupes = 0;
+
+XLogRecPtr	prefetch_lsn = 0;
+
+static void consume_prefetch_responses(void);
+static uint64 prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_lsn);
+static void prefetch_read(PrefetchRequest *slot);
+static void prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force_lsn);
+static void prefetch_wait_for(uint64 ring_index);
+static void prefetch_cleanup(void);
+static inline void prefetch_set_unused(uint64 ring_index, bool hash_cleanup);
+
+static XLogRecPtr neon_get_request_lsn(bool *latest, RelFileNode rnode,
+									   ForkNumber forknum, BlockNumber blkno);
+
+
+/*
+ * Make sure that there are no responses still in the buffer.
+ */
 static void
 consume_prefetch_responses(void)
 {
-	for (int i = n_prefetched_buffers; i < n_prefetch_responses; i++)
-	{
-		NeonResponse *resp = page_server->receive();
+	if (MyPState->ring_receive < MyPState->ring_unused)
+		prefetch_wait_for(MyPState->ring_unused - 1);
+}

-		pfree(resp);
+static void
+prefetch_cleanup(void)
+{
+	int		index;
+	uint64	ring_index;
+	PrefetchRequest *slot;
+
+	while (MyPState->ring_last < MyPState->ring_receive) {
+		ring_index = MyPState->ring_last;
+		index = (ring_index % READ_BUFFER_SIZE);
+		slot = &MyPState->prf_buffer[index];
+
+		if (slot->status == PRFS_UNUSED)
+			MyPState->ring_last += 1;
+		else
+			break;
 	}
-	n_prefetched_buffers = 0;
-	n_prefetch_responses = 0;
+}
+
+/*
+ * Wait for slot of ring_index to have received its response.
+ * The caller is responsible for making sure the request buffer is flushed.
+ */
+static void
+prefetch_wait_for(uint64 ring_index)
+{
+	int index;
+	PrefetchRequest *entry;
+
+	Assert(MyPState->ring_unused > ring_index);
+
+	while (MyPState->ring_receive <= ring_index)
+	{
+		index = (MyPState->ring_receive % READ_BUFFER_SIZE);
+		entry = &MyPState->prf_buffer[index];
+
+		Assert(entry->status == PRFS_REQUESTED);
+		prefetch_read(entry);
+	}
+}
+
+/*
+ * Read the response of a prefetch request into its slot.
+ * 
+ * The caller is responsible for making sure that the request for this buffer
+ * was flushed to the PageServer.
+ */
+static void
+prefetch_read(PrefetchRequest *slot)
+{
+	NeonResponse *response;
+	MemoryContext old;
+
+	Assert(slot->status == PRFS_REQUESTED);
+	Assert(slot->response == NULL);
+	Assert(slot->my_ring_index == MyPState->ring_receive);
+
+	old = MemoryContextSwitchTo(MyPState->errctx);
+	response = (NeonResponse *) page_server->receive();
+	MemoryContextSwitchTo(old);
+
+	/* update prefetch state */
+	MyPState->n_responses_buffered += 1;
+	MyPState->n_requests_inflight -= 1;
+	MyPState->ring_receive += 1;
+
+	/* update slot state */
+	slot->status = PRFS_RECEIVED;
+	slot->response = response;
+}
+
+/*
+ * Disconnect hook - drop prefetches when the connection drops
+ * 
+ * If we don't remove the failed prefetches, we'd be serving incorrect
+ * data to the smgr.
+ */
+void
+prefetch_on_ps_disconnect(void)
+{
+	for (; MyPState->ring_receive < MyPState->ring_unused; MyPState->ring_receive++)
+	{
+		PrefetchRequest *slot;
+		int		index = MyPState->ring_receive % READ_BUFFER_SIZE;
+
+		slot = &MyPState->prf_buffer[index];
+		Assert(slot->status == PRFS_REQUESTED);
+		Assert(slot->my_ring_index == MyPState->ring_receive);
+
+		/* clean up the request */
+		slot->status = PRFS_TAG_REMAINS;
+		MyPState->n_requests_inflight--;
+		prefetch_set_unused(MyPState->ring_receive, true);
+	}
+}
+
+/*
+ * prefetch_set_unused() - clear a received prefetch slot
+ *
+ * The slot at ring_index must be a current member of the ring buffer,
+ * and may not be in the PRFS_REQUESTED state.
+ */
+static inline void
+prefetch_set_unused(uint64 ring_index, bool hash_cleanup)
+{
+	PrefetchRequest *slot = &MyPState->prf_buffer[ring_index % READ_BUFFER_SIZE];
+
+	Assert(MyPState->ring_last <= ring_index &&
+		   MyPState->ring_unused > ring_index);
+
+	if (slot->status == PRFS_UNUSED)
+		return;
+
+	Assert(slot->status == PRFS_RECEIVED || slot->status == PRFS_TAG_REMAINS);
+	Assert(ring_index >= MyPState->ring_last &&
+		   ring_index < MyPState->ring_unused);
+
+	if (slot->status == PRFS_RECEIVED)
+	{
+		pfree(slot->response);
+		slot->response = NULL;
+
+		MyPState->n_responses_buffered -= 1;
+		MyPState->n_unused += 1;
+	}
+	else
+	{
+		Assert(slot->response == NULL);
+	}
+
+	if (hash_cleanup)
+		prfh_delete(MyPState->prf_hash, slot);
+
+	/* clear all fields */
+	MemSet(slot, 0, sizeof(PrefetchRequest));
+	slot->status = PRFS_UNUSED;
+
+	/* run cleanup if we're holding back ring_last */
+	if (MyPState->ring_last == ring_index)
+		prefetch_cleanup();
+}
+
+static void
+prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force_lsn)
+{
+	NeonGetPageRequest request = {
+		.req.tag = T_NeonGetPageRequest,
+		.req.latest = false,
+		.req.lsn = 0,
+		.rnode = slot->buftag.rnode,
+		.forknum = slot->buftag.forkNum,
+		.blkno = slot->buftag.blockNum,
+	};
+
+	if (force_lsn && force_latest)
+	{
+		request.req.lsn = *force_lsn;
+		request.req.latest = *force_latest;
+		slot->effective_request_lsn = *force_lsn;
+	}
+	else
+	{
+		XLogRecPtr lsn = neon_get_request_lsn(
+			&request.req.latest,
+			slot->buftag.rnode,
+			slot->buftag.forkNum,
+			slot->buftag.blockNum
+		);
+		/*
+		 * Note: effective_request_lsn is potentially higher than the requested
+		 * LSN, but still correct:
+		 * 
+		 * We know there are no changes between the actual requested LSN and
+		 * the value of effective_request_lsn: If there were, the page would
+		 * have been in cache and evicted between those LSN values, which
+		 * then would have had to result in a larger request LSN for this page.
+		 * 
+		 * It is possible that a concurrent backend loads the page, modifies
+		 * it and then evicts it again, but the LSN of that eviction cannot be
+		 * smaller than the current WAL insert/redo pointer, which is already
+		 * larger than this prefetch_lsn. So in any case, that would
+		 * invalidate this cache.
+		 * 
+		 * The best LSN to use for effective_request_lsn would be
+		 * XLogCtl->Insert.RedoRecPtr, but that's expensive to access.
+		 */
+		request.req.lsn = lsn;
+		prefetch_lsn = Max(prefetch_lsn, lsn);
+		slot->effective_request_lsn = prefetch_lsn;
+	}
+
+	Assert(slot->response == NULL);
+	Assert(slot->my_ring_index == MyPState->ring_unused);
+	page_server->send((NeonRequest *) &request);
+
+	/* update prefetch state */
+	MyPState->n_requests_inflight += 1;
+	MyPState->n_unused -= 1;
+	MyPState->ring_unused += 1;
+
+	/* update slot state */
+	slot->status = PRFS_REQUESTED;
+}
+
+/*
+ * prefetch_register_buffer() - register and prefetch buffer
+ *
+ * Register that we may want the contents of BufferTag in the near future.
+ * 
+ * If force_latest and force_lsn are not NULL, those values are sent to the
+ * pageserver. If they are NULL, we utilize the lastWrittenLsn -infrastructure
+ * to fill in these values manually.
+ */
+
+static uint64
+prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_lsn)
+{
+	int		index;
+	bool	found;
+	uint64	ring_index;
+	PrefetchRequest req;
+	PrefetchRequest *slot;
+	PrfHashEntry *entry;
+
+	/* use an intermediate PrefetchRequest struct to ensure correct alignment */
+	req.buftag = tag;
+	
+	entry = prfh_lookup(MyPState->prf_hash, (PrefetchRequest *) &req);
+
+	if (entry != NULL)
+	{
+		slot = entry->slot;
+		ring_index = slot->my_ring_index;
+		index = (ring_index % READ_BUFFER_SIZE);
+		Assert(slot == &MyPState->prf_buffer[index]);
+
+		Assert(slot->status != PRFS_UNUSED);
+		Assert(BUFFERTAGS_EQUAL(slot->buftag, tag));
+		
+		/*
+		 * If we want a specific lsn, we do not accept requests that were made
+		 * with a potentially different LSN.
+		 */
+		if (force_lsn && slot->effective_request_lsn != *force_lsn)
+		{
+			prefetch_wait_for(ring_index);
+			prefetch_set_unused(ring_index, true);
+		}
+		/*
+		 * We received a prefetch for a page that was recently read and
+		 * removed from the buffers. Remove that request from the buffers.
+		 */
+		else if (slot->status == PRFS_TAG_REMAINS)
+		{
+			prefetch_set_unused(ring_index, true);
+		}
+		else
+		{
+			/* The buffered request is good enough, return that index */
+			n_prefetch_dupes++;
+			return ring_index;
+		}
+	}
+
+	/*
+	 * If the prefetch queue is full, we need to make room by clearing the
+	 * oldest slot. If the oldest slot holds a buffer that was already
+	 * received, we can just throw it away; we fetched the page unnecessarily
+	 * in that case. If the oldest slot holds a request that we haven't
+	 * received a response for yet, we have to wait for the response to that
+	 * before we can continue. We might not have even flushed the request to
+	 * the pageserver yet, it might be just sitting in the output buffer. In
+	 * that case, we flush it and wait for the response. (We could decide not
+	 * to send it, but it's hard to abort when the request is already in the
+	 * output buffer, and 'not sending' a prefetch request kind of goes
+	 * against the principles of prefetching)
+	 */
+	if (MyPState->ring_last + READ_BUFFER_SIZE - 1 == MyPState->ring_unused)
+	{
+		slot = &MyPState->prf_buffer[(MyPState->ring_last % READ_BUFFER_SIZE)];
+
+		Assert(slot->status != PRFS_UNUSED);
+
+		/* We have the slot for ring_last, so that must still be in progress */
+		switch (slot->status)
+		{
+			case PRFS_REQUESTED:
+				Assert(MyPState->ring_receive == MyPState->ring_last);
+				prefetch_wait_for(MyPState->ring_last);
+				prefetch_set_unused(MyPState->ring_last, true);
+				break;
+			case PRFS_RECEIVED:
+			case PRFS_TAG_REMAINS:
+				prefetch_set_unused(MyPState->ring_last, true);
+				break;
+			default:
+				pg_unreachable();
+		}
+	}
+
+	/*
+	 * The next buffer pointed to by `ring_unused` is now unused, so we can insert
+	 * the new request to it.
+	 */
+	ring_index = MyPState->ring_unused;
+	index = (ring_index % READ_BUFFER_SIZE);
+	slot = &MyPState->prf_buffer[index];
+
+	Assert(MyPState->ring_last <= ring_index);
+
+	Assert(slot->status == PRFS_UNUSED);
+
+	/*
+	 * We must update the slot data before insertion, because the hash
+	 * function reads the buffer tag from the slot.
+	 */
+	slot->buftag = tag;
+	slot->my_ring_index = ring_index;
+
+	prfh_insert(MyPState->prf_hash, slot, &found);
+	Assert(!found);
+
+	prefetch_do_request(slot, force_latest, force_lsn);
+	Assert(slot->status == PRFS_REQUESTED);
+	Assert(ring_index < MyPState->ring_unused);
+	return ring_index;
 }

 static NeonResponse *
 page_server_request(void const *req)
 {
+	page_server->send((NeonRequest *) req);
+	page_server->flush();
 	consume_prefetch_responses();
-	return page_server->request((NeonRequest *) req);
+	return page_server->receive();
 }


@@ -269,12 +700,15 @@ nm_unpack_response(StringInfo s)

 		case T_NeonGetPageResponse:
 			{
-				NeonGetPageResponse *msg_resp = palloc0(offsetof(NeonGetPageResponse, page) + BLCKSZ);
+				NeonGetPageResponse *msg_resp;

+				msg_resp = MemoryContextAllocZero(MyPState->bufctx, PS_GETPAGERESPONSE_SIZE);
 				msg_resp->tag = tag;
 				/* XXX:	should be varlena */
 				memcpy(msg_resp->page, pq_getmsgbytes(s, BLCKSZ), BLCKSZ);
 				pq_getmsgend(s);
+				
+				Assert(msg_resp->tag == T_NeonGetPageResponse);

 				resp = (NeonResponse *) msg_resp;
 				break;
@@ -618,7 +1052,32 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, ch
 void
 neon_init(void)
 {
-	/* noop */
+	HASHCTL info;
+
+	if (MyPState != NULL)
+		return;
+
+	MyPState = MemoryContextAllocZero(TopMemoryContext, sizeof(PrefetchState));
+	
+	MyPState->n_unused = READ_BUFFER_SIZE;
+
+	MyPState->bufctx = SlabContextCreate(TopMemoryContext,
+										 "NeonSMGR/prefetch",
+										 SLAB_DEFAULT_BLOCK_SIZE * 17,
+										 PS_GETPAGERESPONSE_SIZE);
+	MyPState->errctx = AllocSetContextCreate(TopMemoryContext, 
+											 "NeonSMGR/errors",
+											 ALLOCSET_DEFAULT_SIZES);
+	MyPState->hashctx = AllocSetContextCreate(TopMemoryContext,
+											  "NeonSMGR/prefetch",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	info.keysize = sizeof(BufferTag);
+	info.entrysize = sizeof(uint64);
+
+	MyPState->prf_hash = prfh_create(MyPState->hashctx,
+									 READ_BUFFER_SIZE, NULL);
+
 #ifdef DEBUG_COMPARE_LOCAL
 	mdinit();
 #endif
@@ -1005,27 +1464,17 @@ neon_close(SMgrRelation reln, ForkNumber forknum)
 }


-/*
- *	neon_reset_prefetch() -- reoe all previously rgistered prefeth requests
- */
-void
-neon_reset_prefetch(SMgrRelation reln)
-{
-	n_prefetch_requests = 0;
-}
-
 /*
 *	neon_prefetch() -- Initiate asynchronous read of the specified block of a relation
 */
 bool
 neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
+	uint64 ring_index;
+
 	switch (reln->smgr_relpersistence)
 	{
-		case 0:
-			/* probably shouldn't happen, but ignore it */
-			break;
-
+		case 0: /* probably shouldn't happen, but ignore it */
 		case RELPERSISTENCE_PERMANENT:
 			break;

@@ -1037,14 +1486,17 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 			elog(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
 	}

-	if (n_prefetch_requests < MAX_PREFETCH_REQUESTS)
-	{
-		prefetch_requests[n_prefetch_requests].rnode = reln->smgr_rnode.node;
-		prefetch_requests[n_prefetch_requests].forkNum = forknum;
-		prefetch_requests[n_prefetch_requests].blockNum = blocknum;
-		n_prefetch_requests += 1;
-		return true;
-	}
+	BufferTag tag = (BufferTag) {
+		.rnode = reln->smgr_rnode.node,
+		.forkNum = forknum,
+		.blockNum = blocknum
+	};
+
+	ring_index = prefetch_register_buffer(tag, NULL, NULL);
+
+	Assert(ring_index < MyPState->ring_unused &&
+		   MyPState->ring_last <= ring_index);
+
 	return false;
 }

@@ -1095,81 +1547,72 @@ neon_read_at_lsn(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno,
 				 XLogRecPtr request_lsn, bool request_latest, char *buffer)
 {
 	NeonResponse *resp;
-	int			i;
+	BufferTag	buftag;
+	uint64		ring_index;
+	PrfHashEntry *entry;
+	PrefetchRequest *slot;
+
+	buftag = (BufferTag) {
+		.rnode = rnode,
+		.forkNum = forkNum,
+		.blockNum = blkno,
+	};

 	/*
-	 * Try to find prefetched page. It is assumed that pages will be requested
-	 * in the same order as them are prefetched, but some other backend may
-	 * load page in shared buffers, so some prefetch responses should be
-	 * skipped.
+	 * Try to find prefetched page in the list of received pages.
 	 */
-	for (i = n_prefetched_buffers; i < n_prefetch_responses; i++)
-	{
-		resp = page_server->receive();
-		if (resp->tag == T_NeonGetPageResponse &&
-			RelFileNodeEquals(prefetch_responses[i].rnode, rnode) &&
-			prefetch_responses[i].forkNum == forkNum &&
-			prefetch_responses[i].blockNum == blkno)
-		{
-			char	   *page = ((NeonGetPageResponse *) resp)->page;
+	entry = prfh_lookup(MyPState->prf_hash, (PrefetchRequest *) &buftag);

+	if (entry != NULL)
+	{
+		if (entry->slot->effective_request_lsn >= prefetch_lsn)
+		{
+			slot = entry->slot;
+			ring_index = slot->my_ring_index;
+			n_prefetch_hits += 1;
+		}
+		else /* the current prefetch LSN is not large enough, so drop the prefetch */
+		{
 			/*
-			 * Check if prefetched page is still relevant. If it is updated by
-			 * some other backend, then it should not be requested from smgr
-			 * unless it is evicted from shared buffers. In the last case
-			 * last_evicted_lsn should be updated and request_lsn should be
-			 * greater than prefetch_lsn. Maximum with page LSN is used
-			 * because page returned by page server may have LSN either
-			 * greater either smaller than requested.
+			 * We can't drop cache for not-yet-received requested items. It is
+			 * unlikely this happens, but it can happen if prefetch distance is
+			 * large enough and a backend didn't consume all prefetch requests.
 			 */
-			if (Max(prefetch_lsn, PageGetLSN(page)) >= request_lsn)
+			if (entry->slot->status == PRFS_REQUESTED)
 			{
-				n_prefetched_buffers = i + 1;
-				n_prefetch_hits += 1;
-				n_prefetch_requests = 0;
-				memcpy(buffer, page, BLCKSZ);
-				pfree(resp);
-				return;
+				page_server->flush();
+				prefetch_wait_for(entry->slot->my_ring_index);
 			}
+			/* drop caches */
+			prefetch_set_unused(entry->slot->my_ring_index, true);
+			n_prefetch_missed_caches += 1;
+			/* make it look like a prefetch cache miss */
+			entry = NULL;
 		}
-		pfree(resp);
 	}
-	n_prefetched_buffers = 0;
-	n_prefetch_responses = 0;
-	n_prefetch_misses += 1;
-	{
-		NeonGetPageRequest request = {
-			.req.tag = T_NeonGetPageRequest,
-			.req.latest = request_latest,
-			.req.lsn = request_lsn,
-			.rnode = rnode,
-			.forknum = forkNum,
-			.blkno = blkno
-		};

-		if (n_prefetch_requests > 0)
-		{
-			/* Combine all prefetch requests with primary request */
-			page_server->send((NeonRequest *) & request);
-			for (i = 0; i < n_prefetch_requests; i++)
-			{
-				request.rnode = prefetch_requests[i].rnode;
-				request.forknum = prefetch_requests[i].forkNum;
-				request.blkno = prefetch_requests[i].blockNum;
-				prefetch_responses[i] = prefetch_requests[i];
-				page_server->send((NeonRequest *) & request);
-			}
-			page_server->flush();
-			n_prefetch_responses = n_prefetch_requests;
-			n_prefetch_requests = 0;
-			prefetch_lsn = request_lsn;
-			resp = page_server->receive();
-		}
-		else
-		{
-			resp = page_server->request((NeonRequest *) & request);
-		}
+	if (entry == NULL)
+	{
+		n_prefetch_misses += 1;
+
+		ring_index = prefetch_register_buffer(buftag, &request_latest,
+											  &request_lsn);
+		slot = &MyPState->prf_buffer[(ring_index % READ_BUFFER_SIZE)];
 	}
+
+	Assert(MyPState->ring_last <= ring_index &&
+		   MyPState->ring_unused > ring_index);
+	Assert(slot->my_ring_index == ring_index);
+	Assert(slot->status != PRFS_UNUSED);
+	Assert(&MyPState->prf_buffer[(ring_index % READ_BUFFER_SIZE)] == slot);
+
+	page_server->flush();
+	prefetch_wait_for(ring_index);
+
+	Assert(slot->status == PRFS_RECEIVED);
+
+	resp = slot->response;
+
 	switch (resp->tag)
 	{
 		case T_NeonGetPageResponse:
@@ -1189,12 +1632,13 @@ neon_read_at_lsn(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno,
 					 errdetail("page server returned error: %s",
 							   ((NeonErrorResponse *) resp)->message)));
 			break;
-
 		default:
 			elog(ERROR, "unexpected response from page server with tag 0x%02x", resp->tag);
 	}

-	pfree(resp);
+	/* buffer was used, clean up for later reuse */
+	prefetch_set_unused(ring_index, true);
+	prefetch_cleanup();
 }

 /*
@@ -1816,7 +2260,6 @@ static const struct f_smgr neon_smgr =
 	.smgr_unlink = neon_unlink,
 	.smgr_extend = neon_extend,
 	.smgr_prefetch = neon_prefetch,
-	.smgr_reset_prefetch = neon_reset_prefetch,
 	.smgr_read = neon_read,
 	.smgr_write = neon_write,
 	.smgr_writeback = neon_writeback,
--- a/pgxn/neon/walproposer.c
+++ b/pgxn/neon/walproposer.c
@@ -43,6 +43,7 @@
 #if PG_VERSION_NUM >= 150000
 #include "access/xlogrecovery.h"
 #endif
+#include "storage/fd.h"
 #include "storage/latch.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -69,7 +70,8 @@
 #include "neon.h"
 #include "walproposer.h"
 #include "walproposer_utils.h"
-#include "replication/walpropshim.h"
+
+static bool syncSafekeepers = false;

 char	   *wal_acceptors_list;
 int			wal_acceptor_reconnect_timeout;
@@ -117,8 +119,8 @@ static TimestampTz last_reconnect_attempt;
 static WalproposerShmemState * walprop_shared;

 /* Prototypes for private functions */
-static void WalProposerInitImpl(XLogRecPtr flushRecPtr, uint64 systemId);
-static void WalProposerStartImpl(void);
+static void WalProposerInit(XLogRecPtr flushRecPtr, uint64 systemId);
+static void WalProposerStart(void);
 static void WalProposerLoop(void);
 static void InitEventSet(void);
 static void UpdateEventSet(Safekeeper *sk, uint32 events);
@@ -186,9 +188,56 @@ pg_init_walproposer(void)
 	ProcessInterruptsCallback = backpressure_throttling_impl;

 	WalProposerRegister();
+}

-	WalProposerInit = &WalProposerInitImpl;
-	WalProposerStart = &WalProposerStartImpl;
+/*
+ * Entry point for `postgres --sync-safekeepers`.
+ */
+void
+WalProposerSync(int argc, char *argv[])
+{
+	struct stat stat_buf;
+
+	syncSafekeepers = true;
+#if PG_VERSION_NUM < 150000
+	ThisTimeLineID = 1;
+#endif
+
+	/*
+	 * Initialize postmaster_alive_fds as WaitEventSet checks them.
+	 *
+	 * Copied from InitPostmasterDeathWatchHandle()
+	 */
+	if (pipe(postmaster_alive_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+					errmsg_internal("could not create pipe to monitor postmaster death: %m")));
+	if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+					errmsg_internal("could not set postmaster death monitoring pipe to nonblocking mode: %m")));
+
+	ChangeToDataDir();
+
+	/* Create pg_wal directory, if it doesn't exist */
+	if (stat(XLOGDIR, &stat_buf) != 0)
+	{
+		ereport(LOG, (errmsg("creating missing WAL directory \"%s\"", XLOGDIR)));
+		if (MakePGDirectory(XLOGDIR) < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+						errmsg("could not create directory \"%s\": %m",
+							   XLOGDIR)));
+			exit(1);
+		}
+	}
+
+	WalProposerInit(0, 0);
+
+	BackgroundWorkerUnblockSignals();
+
+	WalProposerStart();
 }

 static void
@@ -429,7 +478,7 @@ WalProposerRegister(void)
 }

 static void
-WalProposerInitImpl(XLogRecPtr flushRecPtr, uint64 systemId)
+WalProposerInit(XLogRecPtr flushRecPtr, uint64 systemId)
 {
 	char	   *host;
 	char	   *sep;
@@ -508,7 +557,7 @@ WalProposerInitImpl(XLogRecPtr flushRecPtr, uint64 systemId)
 }

 static void
-WalProposerStartImpl(void)
+WalProposerStart(void)
 {

 	/* Initiate connections to all safekeeper nodes */
--- a/pgxn/neon_walredo/Makefile
+++ b/pgxn/neon_walredo/Makefile
@@ -0,0 +1,22 @@
+# pgxs/neon_walredo/Makefile
+
+MODULE_big = neon_walredo
+OBJS = \
+	$(WIN32RES) \
+	inmem_smgr.o \
+	walredoproc.o \
+
+# This really should be guarded by $(with_libseccomp), but I couldn't
+# make that work with pgxs. So we always compile it, but its contents
+# are wrapped in #ifdef HAVE_LIBSECCOMP instead.
+OBJS += seccomp.o
+
+PGFILEDESC = "neon_walredo - helper process that runs in Neon pageserver"
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+
+ifeq ($(with_libseccomp),yes)
+SHLIB_LINK += -lseccomp
+endif
--- a/pgxn/neon_walredo/inmem_smgr.c
+++ b/pgxn/neon_walredo/inmem_smgr.c
@@ -3,9 +3,8 @@
 * inmem_smgr.c
 *
 * This is an implementation of the SMGR interface, used in the WAL redo
- * process (see src/backend/tcop/zenith_wal_redo.c). It has no persistent
- * storage, the pages that are written out are kept in a small number of
- * in-memory buffers.
+ * process. It has no persistent storage, the pages that are written out
+ * are kept in a small number of in-memory buffers.
 *
 * Normally, replaying a WAL record only needs to access a handful of
 * buffers, which fit in the normal buffer cache, so this is just for
@@ -15,15 +14,11 @@
 * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * IDENTIFICATION
- *	  contrib/neon/inmem_smgr.c
- *
 *-------------------------------------------------------------------------
 */
 #include "postgres.h"

 #include "access/xlog.h"
-#include "pagestore_client.h"
 #include "storage/block.h"
 #include "storage/buf_internals.h"
 #include "storage/relfilenode.h"
@@ -33,6 +28,8 @@
 #include "access/xlogutils.h"
 #endif

+#include "inmem_smgr.h"
+
 /* Size of the in-memory smgr */
 #define MAX_PAGES 64

@@ -59,10 +56,34 @@ locate_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno)
 	return -1;
 }

+
+/* neon wal-redo storage manager functionality */
+static void inmem_init(void);
+static void inmem_open(SMgrRelation reln);
+static void inmem_close(SMgrRelation reln, ForkNumber forknum);
+static void inmem_create(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+static bool inmem_exists(SMgrRelation reln, ForkNumber forknum);
+static void inmem_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+static void inmem_extend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, char *buffer, bool skipFsync);
+static bool inmem_prefetch(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum);
+static void inmem_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					   char *buffer);
+static void inmem_write(SMgrRelation reln, ForkNumber forknum,
+						BlockNumber blocknum, char *buffer, bool skipFsync);
+static void inmem_writeback(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum, BlockNumber nblocks);
+static BlockNumber inmem_nblocks(SMgrRelation reln, ForkNumber forknum);
+static void inmem_truncate(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber nblocks);
+static void inmem_immedsync(SMgrRelation reln, ForkNumber forknum);
+
+
 /*
 *	inmem_init() -- Initialize private state
 */
-void
+static void
 inmem_init(void)
 {
 	used_pages = 0;
@@ -71,7 +92,7 @@ inmem_init(void)
 /*
 *	inmem_exists() -- Does the physical file exist?
 */
-bool
+static bool
 inmem_exists(SMgrRelation reln, ForkNumber forknum)
 {
 	for (int i = 0; i < used_pages; i++)
@@ -90,7 +111,7 @@ inmem_exists(SMgrRelation reln, ForkNumber forknum)
 *
 * If isRedo is true, it's okay for the relation to exist already.
 */
-void
+static void
 inmem_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
 }
@@ -98,7 +119,7 @@ inmem_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 /*
 *	inmem_unlink() -- Unlink a relation.
 */
-void
+static void
 inmem_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
 {
 }
@@ -112,7 +133,7 @@ inmem_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
 *		EOF).  Note that we assume writing a block beyond current EOF
 *		causes intervening file space to become filled with zeroes.
 */
-void
+static void
 inmem_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 char *buffer, bool skipFsync)
 {
@@ -123,7 +144,7 @@ inmem_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 /*
 *  inmem_open() -- Initialize newly-opened relation.
 */
-void
+static void
 inmem_open(SMgrRelation reln)
 {
 }
@@ -131,7 +152,7 @@ inmem_open(SMgrRelation reln)
 /*
 *	inmem_close() -- Close the specified relation, if it isn't closed already.
 */
-void
+static void
 inmem_close(SMgrRelation reln, ForkNumber forknum)
 {
 }
@@ -139,7 +160,7 @@ inmem_close(SMgrRelation reln, ForkNumber forknum)
 /*
 *	inmem_prefetch() -- Initiate asynchronous read of the specified block of a relation
 */
-bool
+static bool
 inmem_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 	return true;
@@ -148,7 +169,7 @@ inmem_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 /*
 * inmem_writeback() -- Tell the kernel to write pages back to storage.
 */
-void
+static void
 inmem_writeback(SMgrRelation reln, ForkNumber forknum,
 				BlockNumber blocknum, BlockNumber nblocks)
 {
@@ -157,7 +178,7 @@ inmem_writeback(SMgrRelation reln, ForkNumber forknum,
 /*
 *	inmem_read() -- Read the specified block from a relation.
 */
-void
+static void
 inmem_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		   char *buffer)
 {
@@ -177,7 +198,7 @@ inmem_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 *		relation (ie, those before the current EOF).  To extend a relation,
 *		use mdextend().
 */
-void
+static void
 inmem_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			char *buffer, bool skipFsync)
 {
@@ -224,7 +245,7 @@ inmem_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 /*
 *	inmem_nblocks() -- Get the number of blocks stored in a relation.
 */
-BlockNumber
+static BlockNumber
 inmem_nblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	/*
@@ -243,7 +264,7 @@ inmem_nblocks(SMgrRelation reln, ForkNumber forknum)
 /*
 *	inmem_truncate() -- Truncate relation to specified number of blocks.
 */
-void
+static void
 inmem_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 {
 }
@@ -251,7 +272,7 @@ inmem_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 /*
 *	inmem_immedsync() -- Immediately sync a relation to stable storage.
 */
-void
+static void
 inmem_immedsync(SMgrRelation reln, ForkNumber forknum)
 {
 }
--- a/pgxn/neon_walredo/inmem_smgr.h
+++ b/pgxn/neon_walredo/inmem_smgr.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * inmem_smgr.h
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef INMEM_SMGR_H
+#define INMEM_SMGR_H
+
+extern const f_smgr *smgr_inmem(BackendId backend, RelFileNode rnode);
+extern void smgr_init_inmem(void);
+
+#endif /* INMEM_SMGR_H */
--- a/pgxn/neon_walredo/neon_seccomp.h
+++ b/pgxn/neon_walredo/neon_seccomp.h
@@ -0,0 +1,22 @@
+#ifndef NEON_SECCOMP_H
+#define NEON_SECCOMP_H
+
+#include <seccomp.h>
+
+typedef struct {
+    int    psr_syscall; /* syscall number */
+    uint32 psr_action;  /* libseccomp action, e.g. SCMP_ACT_ALLOW */
+} PgSeccompRule;
+
+#define PG_SCMP(syscall, action)                \
+    (PgSeccompRule) {                           \
+        .psr_syscall = SCMP_SYS(syscall),       \
+        .psr_action = (action),                 \
+    }
+
+#define PG_SCMP_ALLOW(syscall) \
+    PG_SCMP(syscall, SCMP_ACT_ALLOW)
+
+extern void seccomp_load_rules(PgSeccompRule *syscalls, int count);
+
+#endif /* NEON_SECCOMP_H */
--- a/pgxn/neon_walredo/seccomp.c
+++ b/pgxn/neon_walredo/seccomp.c
@@ -0,0 +1,257 @@
+/*-------------------------------------------------------------------------
+ *
+ * seccomp.c
+ *	  Secure Computing BPF API wrapper.
+ *
+ * Pageserver delegates complex WAL decoding duties to postgres,
+ * which means that the latter might fall victim to carefully designed
+ * malicious WAL records and start doing harmful things to the system.
+ * To prevent this, it has been decided to limit possible interactions
+ * with the outside world using the Secure Computing BPF mode.
+ *
+ * We use this mode to disable all syscalls not in the allowlist. This
+ * approach has its pros & cons:
+ *
+ *  - We have to carefully handpick and maintain the set of syscalls
+ *    required for the WAL redo process. Core dumps help with that.
+ *    The method of trial and error seems to work reasonably well,
+ *    but it would be nice to find a proper way to "prove" that
+ *    the set in question is both necessary and sufficient.
+ *
+ *  - Once we enter the seccomp bpf mode, it's impossible to lift those
+ *    restrictions (otherwise, what kind of "protection" would that be?).
+ *    Thus, we have to either enable extra syscalls for the clean shutdown,
+ *    or exit the process immediately via _exit() instead of proc_exit().
+ *
+ *  - Should we simply use SCMP_ACT_KILL_PROCESS, or implement a custom
+ *    facility to deal with the forbidden syscalls? If we'd like to embed
+ *    a startup security test, we should go with the latter; In that
+ *    case, which one of the following options is preferable?
+ *
+ *      * Catch the denied syscalls with a signal handler using SCMP_ACT_TRAP.
+ *        Provide a common signal handler with a static switch to override
+ *        its behavior for the test case. This would undermine the whole
+ *        purpose of such protection, so we'd have to go further and remap
+ *        the memory backing the switch as readonly, then ban mprotect().
+ *        Ugly and fragile, to say the least.
+ *
+ *      * Yet again, catch the denied syscalls using SCMP_ACT_TRAP.
+ *        Provide 2 different signal handlers: one for a test case,
+ *        another for the main processing loop. Install the first one,
+ *        enable seccomp, perform the test, switch to the second one,
+ *        finally ban sigaction(), presto!
+ *
+ *      * Spoof the result of a syscall using SECCOMP_RET_ERRNO for the
+ *        test, then ban it altogether with another filter. The downside
+ *        of this solution is that we don't actually check that
+ *        SCMP_ACT_KILL_PROCESS/SCMP_ACT_TRAP works.
+ *
+ *    Either approach seems to require two eBPF filter programs,
+ *    which is unfortunate: the man page tells this is uncommon.
+ *    Maybe I (@funbringer) am missing something, though; I encourage
+ *    any reader to get familiar with it and scrutinize my conclusions.
+ *
+ * TODOs and ideas in no particular order:
+ *
+ *  - Do something about mmap() in musl's malloc().
+ *    Definitely not a priority if we don't care about musl.
+ *
+ *  - See if we can untangle PG's shutdown sequence (involving unlink()):
+ *
+ *      * Simplify (or rather get rid of) shmem setup in PG's WAL redo mode.
+ *      * Investigate chroot() or mount namespaces for better FS isolation.
+ *      * (Per Heikki) Simply call _exit(), no big deal.
+ *      * Come up with a better idea?
+ *
+ *  - Make use of seccomp's argument inspection (for what?).
+ *    Unfortunately, it views all syscall arguments as scalars,
+ *    so it won't work for e.g. string comparison in unlink().
+ *
+ *  - Benchmark with bpf jit on/off, try seccomp_syscall_priority().
+ *
+ *  - Test against various linux distros & glibc versions.
+ *    I suspect that certain libc functions might involve slightly
+ *    different syscalls, e.g. select/pselect6/pselect6_time64/whatever.
+ *
+ *  - Test on any arch other than amd64 to see if it works there.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+/*
+ * I couldn't find a good way to do a conditional OBJS += seccomp.o in
+ * the Makefile, so this file is compiled even when seccomp is disabled,
+ * it's just empty in that case.
+ */
+#ifdef HAVE_LIBSECCOMP
+
+#include <fcntl.h>
+#include <unistd.h>
+
+#include "miscadmin.h"
+
+#include "neon_seccomp.h"
+
+static void die(int code, const char *str);
+
+static bool seccomp_test_sighandler_done = false;
+static void seccomp_test_sighandler(int signum, siginfo_t *info, void *cxt);
+static void seccomp_deny_sighandler(int signum, siginfo_t *info, void *cxt);
+
+static int do_seccomp_load_rules(PgSeccompRule *rules, int count, uint32 def_action);
+
+void
+seccomp_load_rules(PgSeccompRule *rules, int count)
+{
+	struct sigaction action = { .sa_flags = SA_SIGINFO };
+	PgSeccompRule rule;
+	long fd;
+
+	/*
+	 * Install a test signal handler.
+	 * XXX: pqsignal() is too restrictive for our purposes,
+	 * since we'd like to examine the contents of siginfo_t.
+	 */
+	action.sa_sigaction = seccomp_test_sighandler;
+	if (sigaction(SIGSYS, &action, NULL) != 0)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: could not install test SIGSYS handler")));
+
+	/*
+	 * First, check that open of a well-known file works.
+	 * XXX: We use raw syscall() to call the very open().
+	 */
+	fd = syscall(SCMP_SYS(open), "/dev/null", O_RDONLY, 0);
+	if (seccomp_test_sighandler_done)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: signal handler test flag was set unexpectedly")));
+	if (fd < 0)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: could not open /dev/null for seccomp testing: %m")));
+	close((int) fd);
+
+	/* Set a trap on open() to test seccomp bpf */
+	rule = PG_SCMP(open, SCMP_ACT_TRAP);
+	if (do_seccomp_load_rules(&rule, 1, SCMP_ACT_ALLOW) != 0)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: could not load test trap")));
+
+	/* Finally, check that open() now raises SIGSYS */
+	(void) syscall(SCMP_SYS(open), "/dev/null", O_RDONLY, 0);
+	if (!seccomp_test_sighandler_done)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: SIGSYS handler doesn't seem to work")));
+
+	/* Now that everything seems to work, install a proper handler */
+	action.sa_sigaction = seccomp_deny_sighandler;
+	if (sigaction(SIGSYS, &action, NULL) != 0)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: could not install SIGSYS handler")));
+
+	/* If this succeeds, any syscall not in the list will crash the process */
+	if (do_seccomp_load_rules(rules, count, SCMP_ACT_TRAP) != 0)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("seccomp: could not enter seccomp mode")));
+}
+
+/*
+ * Enter seccomp mode with a BPF filter that will only allow
+ * certain syscalls to proceed.
+ */
+static int
+do_seccomp_load_rules(PgSeccompRule *rules, int count, uint32 def_action)
+{
+	scmp_filter_ctx ctx;
+	int rc = -1;
+
+	/* Create a context with a default action for syscalls not in the list */
+	if ((ctx = seccomp_init(def_action)) == NULL)
+		goto cleanup;
+
+	for (int i = 0; i < count; i++)
+	{
+		PgSeccompRule *rule = &rules[i];
+		if ((rc = seccomp_rule_add(ctx, rule->psr_action, rule->psr_syscall, 0)) != 0)
+			goto cleanup;
+	}
+
+	/* Try building & loading the program into the kernel */
+	if ((rc = seccomp_load(ctx)) != 0)
+		goto cleanup;
+
+cleanup:
+	/*
+	 * We don't need the context anymore regardless of the result,
+	 * since either we failed or the eBPF program has already been
+	 * loaded into the linux kernel.
+	 */
+	seccomp_release(ctx);
+	return rc;
+}
+
+static void
+die(int code, const char *str)
+{
+	/* work around gcc ignoring that it shouldn't warn on (void) result being unused */
+	ssize_t _unused pg_attribute_unused();
+	/* Best effort write to stderr */
+	_unused = write(fileno(stderr), str, strlen(str));
+
+	/* XXX: we don't want to run any atexit callbacks */
+	_exit(code);
+}
+
+static void
+seccomp_test_sighandler(int signum, siginfo_t *info, void *cxt pg_attribute_unused())
+{
+#define DIE_PREFIX "seccomp test signal handler: "
+
+	/* Check that this signal handler is used only for a single test case */
+	if (seccomp_test_sighandler_done)
+		die(1, DIE_PREFIX "test handler should only be used for 1 test\n");
+	seccomp_test_sighandler_done = true;
+
+	if (signum != SIGSYS)
+		die(1, DIE_PREFIX "bad signal number\n");
+
+	/* TODO: maybe somehow extract the hardcoded syscall number */
+	if (info->si_syscall != SCMP_SYS(open))
+		die(1, DIE_PREFIX "bad syscall number\n");
+
+#undef DIE_PREFIX
+}
+
+static void
+seccomp_deny_sighandler(int signum, siginfo_t *info, void *cxt pg_attribute_unused())
+{
+	/*
+	 * Unfortunately, we can't use seccomp_syscall_resolve_num_arch()
+	 * to resolve the syscall's name, since it calls strdup()
+	 * under the hood (wtf!).
+	 */
+	char buffer[128];
+	(void)snprintf(buffer, lengthof(buffer),
+			"---------------------------------------\n"
+			"seccomp: bad syscall %d\n"
+			"---------------------------------------\n",
+			info->si_syscall);
+
+	/*
+	 * Instead of silently crashing the process with
+	 * a fake SIGSYS caused by SCMP_ACT_KILL_PROCESS,
+	 * we'd like to receive a real SIGSYS to print the
+	 * message and *then* immediately exit.
+	 */
+	die(1, buffer);
+}
+
+#endif		/* HAVE_LIBSECCOMP */
--- a/pgxn/neon_walredo/walredoproc.c
+++ b/pgxn/neon_walredo/walredoproc.c
@@ -0,0 +1,847 @@
+/*-------------------------------------------------------------------------
+ *
+ * walredoproc.c
+ *	  Entry point for WAL redo helper
+ *
+ *
+ * This file contains an alternative main() function for the 'postgres'
+ * binary. In the special mode, we go into a special mode that's similar
+ * to the single user mode. We don't launch postmaster or any auxiliary
+ * processes. Instead, we wait for command from 'stdin', and respond to
+ * 'stdout'.
+ *
+ * The protocol through stdin/stdout is loosely based on the libpq protocol.
+ * The process accepts messages through stdin, and each message has the format:
+ *
+ * char   msgtype;
+ * int32  length; // length of message including 'length' but excluding
+ *                // 'msgtype', in network byte order
+ * <payload>
+ *
+ * There are three message types:
+ *
+ * BeginRedoForBlock ('B'): Prepare for WAL replay for given block
+ * PushPage ('P'): Copy a page image (in the payload) to buffer cache
+ * ApplyRecord ('A'): Apply a WAL record (in the payload)
+ * GetPage ('G'): Return a page image from buffer cache.
+ *
+ * Currently, you only get a response to GetPage requests; the response is
+ * simply a 8k page, without any headers. Errors are logged to stderr.
+ *
+ * FIXME:
+ * - this currently requires a valid PGDATA, and creates a lock file there
+ *   like a normal postmaster. There's no fundamental reason for that, though.
+ * - should have EndRedoForBlock, and flush page cache, to allow using this
+ *   mechanism for more than one block without restarting the process.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <fcntl.h>
+#include <limits.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#ifdef HAVE_SYS_SELECT_H
+#include <sys/select.h>
+#endif
+#ifdef HAVE_SYS_RESOURCE_H
+#include <sys/time.h>
+#include <sys/resource.h>
+#endif
+
+#if defined(HAVE_LIBSECCOMP) && defined(__GLIBC__)
+#define MALLOC_NO_MMAP
+#include <malloc.h>
+#endif
+
+#ifndef HAVE_GETRUSAGE
+#include "rusagestub.h"
+#endif
+
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#if PG_VERSION_NUM >= 150000
+#include "access/xlogrecovery.h"
+#endif
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "miscadmin.h"
+#include "postmaster/postmaster.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "tcop/tcopprot.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+#include "inmem_smgr.h"
+
+#ifdef HAVE_LIBSECCOMP
+#include "neon_seccomp.h"
+#endif
+
+PG_MODULE_MAGIC;
+
+static int	ReadRedoCommand(StringInfo inBuf);
+static void BeginRedoForBlock(StringInfo input_message);
+static void PushPage(StringInfo input_message);
+static void ApplyRecord(StringInfo input_message);
+static void apply_error_callback(void *arg);
+static bool redo_block_filter(XLogReaderState *record, uint8 block_id);
+static void GetPage(StringInfo input_message);
+static ssize_t buffered_read(void *buf, size_t count);
+
+static BufferTag target_redo_tag;
+
+static XLogReaderState *reader_state;
+
+#define TRACE DEBUG5
+
+#ifdef HAVE_LIBSECCOMP
+static void
+enter_seccomp_mode(void)
+{
+	PgSeccompRule syscalls[] =
+	{
+		/* Hard requirements */
+		PG_SCMP_ALLOW(exit_group),
+		PG_SCMP_ALLOW(pselect6),
+		PG_SCMP_ALLOW(read),
+		PG_SCMP_ALLOW(select),
+		PG_SCMP_ALLOW(write),
+
+		/* Memory allocation */
+		PG_SCMP_ALLOW(brk),
+#ifndef MALLOC_NO_MMAP
+		/* TODO: musl doesn't have mallopt */
+		PG_SCMP_ALLOW(mmap),
+		PG_SCMP_ALLOW(munmap),
+#endif
+		/*
+		 * getpid() is called on assertion failure, in ExceptionalCondition.
+		 * It's not really needed, but seems pointless to hide it either. The
+		 * system call unlikely to expose a kernel vulnerability, and the PID
+		 * is stored in MyProcPid anyway.
+		 */
+		PG_SCMP_ALLOW(getpid),
+
+		/* Enable those for a proper shutdown.
+		PG_SCMP_ALLOW(munmap),
+		PG_SCMP_ALLOW(shmctl),
+		PG_SCMP_ALLOW(shmdt),
+		PG_SCMP_ALLOW(unlink), // shm_unlink
+		*/
+	};
+
+#ifdef MALLOC_NO_MMAP
+	/* Ask glibc not to use mmap() */
+	mallopt(M_MMAP_MAX, 0);
+#endif
+
+	seccomp_load_rules(syscalls, lengthof(syscalls));
+}
+#endif /* HAVE_LIBSECCOMP */
+
+/*
+ * Entry point for the WAL redo process.
+ *
+ * Performs similar initialization as PostgresMain does for normal
+ * backend processes. Some initialization was done in CallExtMain
+ * already.
+ */
+void
+WalRedoMain(int argc, char *argv[])
+{
+	int			firstchar;
+	StringInfoData input_message;
+#ifdef HAVE_LIBSECCOMP
+	bool		enable_seccomp;
+#endif
+
+	am_wal_redo_postgres = true;
+
+	/*
+	 * WAL redo does not need a large number of buffers. And speed of
+	 * DropRelFileNodeAllLocalBuffers() is proportional to the number of
+	 * buffers. So let's keep it small (default value is 1024)
+	 */
+	num_temp_buffers = 4;
+
+	/*
+	 * install the simple in-memory smgr
+	 */
+	smgr_hook = smgr_inmem;
+	smgr_init_hook = smgr_init_inmem;
+
+	/*
+	 * Validate we have been given a reasonable-looking DataDir and change into it.
+	 */
+	checkDataDir();
+	ChangeToDataDir();
+
+	/*
+	 * Create lockfile for data directory.
+	 */
+	CreateDataDirLockFile(false);
+
+	/* read control file (error checking and contains config ) */
+	LocalProcessControlFile(false);
+
+	/*
+	 * process any libraries that should be preloaded at postmaster start
+	 */
+	process_shared_preload_libraries();
+
+	/* Initialize MaxBackends (if under postmaster, was done already) */
+	InitializeMaxBackends();
+
+#if PG_VERSION_NUM >= 150000
+	/*
+	 * Give preloaded libraries a chance to request additional shared memory.
+	 */
+	process_shmem_requests();
+
+	/*
+	 * Now that loadable modules have had their chance to request additional
+	 * shared memory, determine the value of any runtime-computed GUCs that
+	 * depend on the amount of shared memory required.
+	 */
+	InitializeShmemGUCs();
+
+	/*
+	 * Now that modules have been loaded, we can process any custom resource
+	 * managers specified in the wal_consistency_checking GUC.
+	 */
+	InitializeWalConsistencyChecking();
+#endif
+
+	CreateSharedMemoryAndSemaphores();
+
+	/*
+	 * Remember stand-alone backend startup time,roughly at the same point
+	 * during startup that postmaster does so.
+	 */
+	PgStartTime = GetCurrentTimestamp();
+
+	/*
+	 * Create a per-backend PGPROC struct in shared memory. We must do
+	 * this before we can use LWLocks.
+	 */
+	InitAuxiliaryProcess();
+
+	SetProcessingMode(NormalProcessing);
+
+	/* Redo routines won't work if we're not "in recovery" */
+	InRecovery = true;
+
+	/*
+	 * Create the memory context we will use in the main loop.
+	 *
+	 * MessageContext is reset once per iteration of the main loop, ie, upon
+	 * completion of processing of each command message from the client.
+	 */
+	MessageContext = AllocSetContextCreate(TopMemoryContext,
+										   "MessageContext",
+										   ALLOCSET_DEFAULT_SIZES);
+
+	/* we need a ResourceOwner to hold buffer pins */
+	Assert(CurrentResourceOwner == NULL);
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "wal redo");
+
+	/* Initialize resource managers */
+	for (int rmid = 0; rmid <= RM_MAX_ID; rmid++)
+	{
+		if (RmgrTable[rmid].rm_startup != NULL)
+			RmgrTable[rmid].rm_startup();
+	}
+	reader_state = XLogReaderAllocate(wal_segment_size, NULL, XL_ROUTINE(), NULL);
+
+#ifdef HAVE_LIBSECCOMP
+	/* We prefer opt-out to opt-in for greater security */
+	enable_seccomp = true;
+	for (int i = 1; i < argc; i++)
+		if (strcmp(argv[i], "--disable-seccomp") == 0)
+			enable_seccomp = false;
+
+	/*
+	 * We deliberately delay the transition to the seccomp mode
+	 * until it's time to enter the main processing loop;
+	 * else we'd have to add a lot more syscalls to the allowlist.
+	 */
+	if (enable_seccomp)
+		enter_seccomp_mode();
+#endif /* HAVE_LIBSECCOMP */
+
+	/*
+	 * Main processing loop
+	 */
+	MemoryContextSwitchTo(MessageContext);
+	initStringInfo(&input_message);
+
+	for (;;)
+	{
+		/* Release memory left over from prior query cycle. */
+		resetStringInfo(&input_message);
+
+		set_ps_display("idle");
+
+		/*
+		 * (3) read a command (loop blocks here)
+		 */
+		firstchar = ReadRedoCommand(&input_message);
+		switch (firstchar)
+		{
+			case 'B':			/* BeginRedoForBlock */
+				BeginRedoForBlock(&input_message);
+				break;
+
+			case 'P':			/* PushPage */
+				PushPage(&input_message);
+				break;
+
+			case 'A':			/* ApplyRecord */
+				ApplyRecord(&input_message);
+				break;
+
+			case 'G':			/* GetPage */
+				GetPage(&input_message);
+				break;
+
+				/*
+				 * EOF means we're done. Perform normal shutdown.
+				 */
+			case EOF:
+				ereport(LOG,
+						(errmsg("received EOF on stdin, shutting down")));
+
+#ifdef HAVE_LIBSECCOMP
+				/*
+				 * Skip the shutdown sequence, leaving some garbage behind.
+				 * Hopefully, postgres will clean it up in the next run.
+				 * This way we don't have to enable extra syscalls, which is nice.
+				 * See enter_seccomp_mode() above.
+				 */
+				if (enable_seccomp)
+					_exit(0);
+#endif /* HAVE_LIBSECCOMP */
+				/*
+				 * NOTE: if you are tempted to add more code here, DON'T!
+				 * Whatever you had in mind to do should be set up as an
+				 * on_proc_exit or on_shmem_exit callback, instead. Otherwise
+				 * it will fail to be called during other backend-shutdown
+				 * scenarios.
+				 */
+				proc_exit(0);
+
+			default:
+				ereport(FATAL,
+						(errcode(ERRCODE_PROTOCOL_VIOLATION),
+						 errmsg("invalid frontend message type %d",
+								firstchar)));
+		}
+	}							/* end of input-reading loop */
+}
+
+
+/* Version compatility wrapper for ReadBufferWithoutRelcache */
+static inline Buffer
+NeonRedoReadBuffer(RelFileNode rnode,
+		   ForkNumber forkNum, BlockNumber blockNum,
+		   ReadBufferMode mode)
+{
+#if PG_VERSION_NUM >= 150000
+	return ReadBufferWithoutRelcache(rnode, forkNum, blockNum, mode,
+									 NULL, /* no strategy */
+									 true); /* WAL redo is only performed on permanent rels */
+#else
+	return ReadBufferWithoutRelcache(rnode, forkNum, blockNum, mode,
+									 NULL); /* no strategy */
+#endif
+}
+
+
+/*
+ * Some debug function that may be handy for now.
+ */
+pg_attribute_unused()
+static char *
+pprint_buffer(char *data, int len)
+{
+	StringInfoData s;
+
+	initStringInfo(&s);
+	appendStringInfo(&s, "\n");
+	for (int i = 0; i < len; i++) {
+
+		appendStringInfo(&s, "%02x ", (*(((char *) data) + i) & 0xff) );
+		if (i % 32 == 31) {
+			appendStringInfo(&s, "\n");
+		}
+	}
+	appendStringInfo(&s, "\n");
+
+	return s.data;
+}
+
+/* ----------------------------------------------------------------
+ *		routines to obtain user input
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Read next command from the client.
+ *
+ *	the string entered by the user is placed in its parameter inBuf,
+ *	and we act like a Q message was received.
+ *
+ *	EOF is returned if end-of-file input is seen; time to shut down.
+ * ----------------
+ */
+static int
+ReadRedoCommand(StringInfo inBuf)
+{
+	ssize_t		ret;
+	char		hdr[1 + sizeof(int32)];
+	int			qtype;
+	int32		len;
+
+	/* Read message type and message length */
+	ret = buffered_read(hdr, sizeof(hdr));
+	if (ret != sizeof(hdr))
+	{
+		if (ret == 0)
+			return EOF;
+		else if (ret < 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_CONNECTION_FAILURE),
+					 errmsg("could not read message header: %m")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected EOF")));
+	}
+
+	qtype = hdr[0];
+	memcpy(&len, &hdr[1], sizeof(int32));
+	len = pg_ntoh32(len);
+
+	if (len < 4)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg("invalid message length")));
+
+	len -= 4;					/* discount length itself */
+
+	/* Read the message payload */
+	enlargeStringInfo(inBuf, len);
+	ret = buffered_read(inBuf->data, len);
+	if (ret != len)
+	{
+		if (ret < 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_CONNECTION_FAILURE),
+					 errmsg("could not read message: %m")));
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected EOF")));
+	}
+	inBuf->len = len;
+	inBuf->data[len] = '\0';
+
+	return qtype;
+}
+
+/*
+ * Prepare for WAL replay on given block
+ */
+static void
+BeginRedoForBlock(StringInfo input_message)
+{
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blknum;
+	SMgrRelation reln;
+
+	/*
+	 * message format:
+	 *
+	 * spcNode
+	 * dbNode
+	 * relNode
+	 * ForkNumber
+	 * BlockNumber
+	 */
+	forknum = pq_getmsgbyte(input_message);
+	rnode.spcNode = pq_getmsgint(input_message, 4);
+	rnode.dbNode = pq_getmsgint(input_message, 4);
+	rnode.relNode = pq_getmsgint(input_message, 4);
+	blknum = pq_getmsgint(input_message, 4);
+	wal_redo_buffer = InvalidBuffer;
+
+	INIT_BUFFERTAG(target_redo_tag, rnode, forknum, blknum);
+
+	elog(TRACE, "BeginRedoForBlock %u/%u/%u.%d blk %u",
+		 target_redo_tag.rnode.spcNode,
+		 target_redo_tag.rnode.dbNode,
+		 target_redo_tag.rnode.relNode,
+		 target_redo_tag.forkNum,
+		 target_redo_tag.blockNum);
+
+	reln = smgropen(rnode, InvalidBackendId, RELPERSISTENCE_PERMANENT);
+	if (reln->smgr_cached_nblocks[forknum] == InvalidBlockNumber ||
+		reln->smgr_cached_nblocks[forknum] < blknum + 1)
+	{
+		reln->smgr_cached_nblocks[forknum] = blknum + 1;
+	}
+}
+
+/*
+ * Receive a page given by the client, and put it into buffer cache.
+ */
+static void
+PushPage(StringInfo input_message)
+{
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blknum;
+	const char *content;
+	Buffer		buf;
+	Page		page;
+
+	/*
+	 * message format:
+	 *
+	 * spcNode
+	 * dbNode
+	 * relNode
+	 * ForkNumber
+	 * BlockNumber
+	 * 8k page content
+	 */
+	forknum = pq_getmsgbyte(input_message);
+	rnode.spcNode = pq_getmsgint(input_message, 4);
+	rnode.dbNode = pq_getmsgint(input_message, 4);
+	rnode.relNode = pq_getmsgint(input_message, 4);
+	blknum = pq_getmsgint(input_message, 4);
+	content = pq_getmsgbytes(input_message, BLCKSZ);
+
+	buf = NeonRedoReadBuffer(rnode, forknum, blknum, RBM_ZERO_AND_LOCK);
+	wal_redo_buffer = buf;
+	page = BufferGetPage(buf);
+	memcpy(page, content, BLCKSZ);
+	MarkBufferDirty(buf); /* pro forma */
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Receive a WAL record, and apply it.
+ *
+ * All the pages should be loaded into the buffer cache by PushPage calls already.
+ */
+static void
+ApplyRecord(StringInfo input_message)
+{
+	char	   *errormsg;
+	XLogRecPtr	lsn;
+	XLogRecord *record;
+	int			nleft;
+	ErrorContextCallback errcallback;
+#if PG_VERSION_NUM >= 150000
+	DecodedXLogRecord *decoded;
+#endif
+
+	/*
+	 * message format:
+	 *
+	 * LSN (the *end* of the record)
+	 * record
+	 */
+	lsn = pq_getmsgint64(input_message);
+
+	smgrinit();					/* reset inmem smgr state */
+
+	/* note: the input must be aligned here */
+	record = (XLogRecord *) pq_getmsgbytes(input_message, sizeof(XLogRecord));
+
+	nleft = input_message->len - input_message->cursor;
+	if (record->xl_tot_len != sizeof(XLogRecord) + nleft)
+		elog(ERROR, "mismatch between record (%d) and message size (%d)",
+			 record->xl_tot_len, (int) sizeof(XLogRecord) + nleft);
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = apply_error_callback;
+	errcallback.arg = (void *) reader_state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	XLogBeginRead(reader_state, lsn);
+
+#if PG_VERSION_NUM >= 150000
+	decoded = (DecodedXLogRecord *) XLogReadRecordAlloc(reader_state, record->xl_tot_len, true);
+
+	if (!DecodeXLogRecord(reader_state, decoded, record, lsn, &errormsg))
+		elog(ERROR, "failed to decode WAL record: %s", errormsg);
+	else
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = reader_state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == reader_state->decode_buffer)
+				reader_state->decode_buffer_tail = reader_state->decode_buffer + decoded->size;
+			else
+				reader_state->decode_buffer_tail += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(reader_state->decode_queue_tail != decoded);
+		if (reader_state->decode_queue_tail)
+			reader_state->decode_queue_tail->next = decoded;
+		reader_state->decode_queue_tail = decoded;
+		if (!reader_state->decode_queue_head)
+			reader_state->decode_queue_head = decoded;
+
+		/*
+		 * Update the pointers to the beginning and one-past-the-end of this
+		 * record, again for the benefit of historical code that expected the
+		 * decoder to track this rather than accessing these fields of the record
+		 * itself.
+		 */
+		reader_state->record = reader_state->decode_queue_head;
+		reader_state->ReadRecPtr = reader_state->record->lsn;
+		reader_state->EndRecPtr = reader_state->record->next_lsn;
+	}
+#else
+	/*
+	 * In lieu of calling XLogReadRecord, store the record 'decoded_record'
+	 * buffer directly.
+	 */
+	reader_state->ReadRecPtr = lsn;
+	reader_state->decoded_record = record;
+	if (!DecodeXLogRecord(reader_state, record, &errormsg))
+		elog(ERROR, "failed to decode WAL record: %s", errormsg);
+#endif
+
+	/* Ignore any other blocks than the ones the caller is interested in */
+	redo_read_buffer_filter = redo_block_filter;
+
+	RmgrTable[record->xl_rmid].rm_redo(reader_state);
+
+	/*
+	 * If no base image of the page was provided by PushPage, initialize
+	 * wal_redo_buffer here. The first WAL record must initialize the page
+	 * in that case.
+	 */
+	if (BufferIsInvalid(wal_redo_buffer))
+	{
+		wal_redo_buffer = NeonRedoReadBuffer(target_redo_tag.rnode,
+											 target_redo_tag.forkNum,
+											 target_redo_tag.blockNum,
+											 RBM_NORMAL);
+		Assert(!BufferIsInvalid(wal_redo_buffer));
+		ReleaseBuffer(wal_redo_buffer);
+	}
+
+	redo_read_buffer_filter = NULL;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	elog(TRACE, "applied WAL record with LSN %X/%X",
+		 (uint32) (lsn >> 32), (uint32) lsn);
+#if PG_VERSION_NUM >= 150000
+	if (decoded && decoded->oversized)
+		pfree(decoded);
+#endif
+}
+
+/*
+ * Error context callback for errors occurring during ApplyRecord
+ */
+static void
+apply_error_callback(void *arg)
+{
+	XLogReaderState *record = (XLogReaderState *) arg;
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+	xlog_outdesc(&buf, record);
+
+	/* translator: %s is a WAL record description */
+	errcontext("WAL redo at %X/%X for %s",
+			   LSN_FORMAT_ARGS(record->ReadRecPtr),
+			   buf.data);
+
+	pfree(buf.data);
+}
+
+
+
+static bool
+redo_block_filter(XLogReaderState *record, uint8 block_id)
+{
+	BufferTag	target_tag;
+
+#if PG_VERSION_NUM >= 150000
+	XLogRecGetBlockTag(record, block_id,
+					   &target_tag.rnode, &target_tag.forkNum, &target_tag.blockNum);
+#else
+	if (!XLogRecGetBlockTag(record, block_id,
+							&target_tag.rnode, &target_tag.forkNum, &target_tag.blockNum))
+	{
+		/* Caller specified a bogus block_id */
+		elog(PANIC, "failed to locate backup block with ID %d", block_id);
+	}
+#endif
+
+	/*
+	 * Can a WAL redo function ever access a relation other than the one that
+	 * it modifies? I don't see why it would.
+	 */
+	if (!RelFileNodeEquals(target_tag.rnode, target_redo_tag.rnode))
+		elog(WARNING, "REDO accessing unexpected page: %u/%u/%u.%u blk %u",
+			 target_tag.rnode.spcNode, target_tag.rnode.dbNode, target_tag.rnode.relNode, target_tag.forkNum, target_tag.blockNum);
+
+	/*
+	 * If this block isn't one we are currently restoring, then return 'true'
+	 * so that this gets ignored
+	 */
+	return !BUFFERTAGS_EQUAL(target_tag, target_redo_tag);
+}
+
+/*
+ * Get a page image back from buffer cache.
+ *
+ * After applying some records.
+ */
+static void
+GetPage(StringInfo input_message)
+{
+	RelFileNode rnode;
+	ForkNumber forknum;
+	BlockNumber blknum;
+	Buffer		buf;
+	Page		page;
+	int			tot_written;
+
+	/*
+	 * message format:
+	 *
+	 * spcNode
+	 * dbNode
+	 * relNode
+	 * ForkNumber
+	 * BlockNumber
+	 */
+	forknum = pq_getmsgbyte(input_message);
+	rnode.spcNode = pq_getmsgint(input_message, 4);
+	rnode.dbNode = pq_getmsgint(input_message, 4);
+	rnode.relNode = pq_getmsgint(input_message, 4);
+	blknum = pq_getmsgint(input_message, 4);
+
+	/* FIXME: check that we got a BeginRedoForBlock message or this earlier */
+
+	buf = NeonRedoReadBuffer(rnode, forknum, blknum, RBM_NORMAL);
+	Assert(buf == wal_redo_buffer);
+	page = BufferGetPage(buf);
+	/* single thread, so don't bother locking the page */
+
+	/* Response: Page content */
+	tot_written = 0;
+	do {
+		ssize_t		rc;
+
+		rc = write(STDOUT_FILENO, &page[tot_written], BLCKSZ - tot_written);
+		if (rc < 0) {
+			/* If interrupted by signal, just retry */
+			if (errno == EINTR)
+				continue;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to stdout: %m")));
+		}
+		tot_written += rc;
+	} while (tot_written < BLCKSZ);
+
+	ReleaseBuffer(buf);
+	DropRelFileNodeAllLocalBuffers(rnode);
+	wal_redo_buffer = InvalidBuffer;
+
+	elog(TRACE, "Page sent back for block %u", blknum);
+}
+
+
+/* Buffer used by buffered_read() */
+static char stdin_buf[16 * 1024];
+static size_t stdin_len = 0;	/* # of bytes in buffer */
+static size_t stdin_ptr = 0;	/* # of bytes already consumed */
+
+/*
+ * Like read() on stdin, but buffered.
+ *
+ * We cannot use libc's buffered fread(), because it uses syscalls that we
+ * have disabled with seccomp(). Depending on the platform, it can call
+ * 'fstat' or 'newfstatat'. 'fstat' is probably harmless, but 'newfstatat'
+ * seems problematic because it allows interrogating files by path name.
+ *
+ * The return value is the number of bytes read. On error, -1 is returned, and
+ * errno is set appropriately. Unlike read(), this fills the buffer completely
+ * unless an error happens or EOF is reached.
+ */
+static ssize_t
+buffered_read(void *buf, size_t count)
+{
+	char	   *dst = buf;
+
+	while (count > 0)
+	{
+		size_t		nthis;
+
+		if (stdin_ptr == stdin_len)
+		{
+			ssize_t		ret;
+
+			ret = read(STDIN_FILENO, stdin_buf, sizeof(stdin_buf));
+			if (ret < 0)
+			{
+				/* don't do anything here that could set 'errno' */
+				return ret;
+			}
+			if (ret == 0)
+			{
+				/* EOF */
+				break;
+			}
+			stdin_len = (size_t) ret;
+			stdin_ptr = 0;
+		}
+		nthis = Min(stdin_len - stdin_ptr, count);
+
+		memcpy(dst, &stdin_buf[stdin_ptr], nthis);
+
+		stdin_ptr += nthis;
+		count -= nthis;
+		dst += nthis;
+	}
+
+	return (dst - (char *) buf);
+}
--- a/poetry.lock
+++ b/poetry.lock
@@ -583,7 +583,7 @@ python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"

 [[package]]
 name = "cryptography"
-version = "37.0.4"
+version = "38.0.3"
 description = "cryptography is a package which provides cryptographic recipes and primitives to Python developers."
 category = "main"
 optional = false
@@ -593,10 +593,10 @@ python-versions = ">=3.6"
 cffi = ">=1.12"

 [package.extras]
-docs = ["sphinx (>=1.6.5,!=1.8.0,!=3.1.0,!=3.1.1)", "sphinx_rtd_theme"]
+docs = ["sphinx (>=1.6.5,!=1.8.0,!=3.1.0,!=3.1.1)", "sphinx-rtd-theme"]
 docstest = ["pyenchant (>=1.6.11)", "sphinxcontrib-spelling (>=4.0.1)", "twine (>=1.12.0)"]
 pep8test = ["black", "flake8", "flake8-import-order", "pep8-naming"]
-sdist = ["setuptools_rust (>=0.11.4)"]
+sdist = ["setuptools-rust (>=0.11.4)"]
 ssh = ["bcrypt (>=3.1.5)"]
 test = ["hypothesis (>=1.11.4,!=3.79.2)", "iso8601", "pretend", "pytest (>=6.2.0)", "pytest-benchmark", "pytest-cov", "pytest-subtests", "pytest-xdist", "pytz"]

@@ -1568,7 +1568,7 @@ testing = ["func-timeout", "jaraco.itertools", "pytest (>=6)", "pytest-black (>=
 [metadata]
 lock-version = "1.1"
 python-versions = "^3.9"
-content-hash = "17cdbfe90f1b06dffaf24c3e076384ec08dd4a2dce5a05e50565f7364932eb2d"
+content-hash = "9352a89d49d34807f6a58f6c3f898acbd8cf3570e0f45ede973673644bde4d0e"

 [metadata.files]
 aiopg = [
@@ -1750,28 +1750,32 @@ colorama = [
    {file = "colorama-0.4.5.tar.gz", hash = "sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4"},
 ]
 cryptography = [
-    {file = "cryptography-37.0.4-cp36-abi3-macosx_10_10_universal2.whl", hash = "sha256:549153378611c0cca1042f20fd9c5030d37a72f634c9326e225c9f666d472884"},
-    {file = "cryptography-37.0.4-cp36-abi3-macosx_10_10_x86_64.whl", hash = "sha256:a958c52505c8adf0d3822703078580d2c0456dd1d27fabfb6f76fe63d2971cd6"},
-    {file = "cryptography-37.0.4-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:f721d1885ecae9078c3f6bbe8a88bc0786b6e749bf32ccec1ef2b18929a05046"},
-    {file = "cryptography-37.0.4-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:3d41b965b3380f10e4611dbae366f6dc3cefc7c9ac4e8842a806b9672ae9add5"},
-    {file = "cryptography-37.0.4-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:80f49023dd13ba35f7c34072fa17f604d2f19bf0989f292cedf7ab5770b87a0b"},
-    {file = "cryptography-37.0.4-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f2dcb0b3b63afb6df7fd94ec6fbddac81b5492513f7b0436210d390c14d46ee8"},
-    {file = "cryptography-37.0.4-cp36-abi3-manylinux_2_24_x86_64.whl", hash = "sha256:b7f8dd0d4c1f21759695c05a5ec8536c12f31611541f8904083f3dc582604280"},
-    {file = "cryptography-37.0.4-cp36-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:30788e070800fec9bbcf9faa71ea6d8068f5136f60029759fd8c3efec3c9dcb3"},
-    {file = "cryptography-37.0.4-cp36-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:190f82f3e87033821828f60787cfa42bff98404483577b591429ed99bed39d59"},
-    {file = "cryptography-37.0.4-cp36-abi3-win32.whl", hash = "sha256:b62439d7cd1222f3da897e9a9fe53bbf5c104fff4d60893ad1355d4c14a24157"},
-    {file = "cryptography-37.0.4-cp36-abi3-win_amd64.whl", hash = "sha256:f7a6de3e98771e183645181b3627e2563dcde3ce94a9e42a3f427d2255190327"},
-    {file = "cryptography-37.0.4-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6bc95ed67b6741b2607298f9ea4932ff157e570ef456ef7ff0ef4884a134cc4b"},
-    {file = "cryptography-37.0.4-pp37-pypy37_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:f8c0a6e9e1dd3eb0414ba320f85da6b0dcbd543126e30fcc546e7372a7fbf3b9"},
-    {file = "cryptography-37.0.4-pp38-pypy38_pp73-macosx_10_10_x86_64.whl", hash = "sha256:e007f052ed10cc316df59bc90fbb7ff7950d7e2919c9757fd42a2b8ecf8a5f67"},
-    {file = "cryptography-37.0.4-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7bc997818309f56c0038a33b8da5c0bfbb3f1f067f315f9abd6fc07ad359398d"},
-    {file = "cryptography-37.0.4-pp38-pypy38_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:d204833f3c8a33bbe11eda63a54b1aad7aa7456ed769a982f21ec599ba5fa282"},
-    {file = "cryptography-37.0.4-pp38-pypy38_pp73-win_amd64.whl", hash = "sha256:75976c217f10d48a8b5a8de3d70c454c249e4b91851f6838a4e48b8f41eb71aa"},
-    {file = "cryptography-37.0.4-pp39-pypy39_pp73-macosx_10_10_x86_64.whl", hash = "sha256:7099a8d55cd49b737ffc99c17de504f2257e3787e02abe6d1a6d136574873441"},
-    {file = "cryptography-37.0.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2be53f9f5505673eeda5f2736bea736c40f051a739bfae2f92d18aed1eb54596"},
-    {file = "cryptography-37.0.4-pp39-pypy39_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:91ce48d35f4e3d3f1d83e29ef4a9267246e6a3be51864a5b7d2247d5086fa99a"},
-    {file = "cryptography-37.0.4-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:4c590ec31550a724ef893c50f9a97a0c14e9c851c85621c5650d699a7b88f7ab"},
-    {file = "cryptography-37.0.4.tar.gz", hash = "sha256:63f9c17c0e2474ccbebc9302ce2f07b55b3b3fcb211ded18a42d5764f5c10a82"},
+    {file = "cryptography-38.0.3-cp36-abi3-macosx_10_10_universal2.whl", hash = "sha256:984fe150f350a3c91e84de405fe49e688aa6092b3525f407a18b9646f6612320"},
+    {file = "cryptography-38.0.3-cp36-abi3-macosx_10_10_x86_64.whl", hash = "sha256:ed7b00096790213e09eb11c97cc6e2b757f15f3d2f85833cd2d3ec3fe37c1722"},
+    {file = "cryptography-38.0.3-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:bbf203f1a814007ce24bd4d51362991d5cb90ba0c177a9c08825f2cc304d871f"},
+    {file = "cryptography-38.0.3-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:554bec92ee7d1e9d10ded2f7e92a5d70c1f74ba9524947c0ba0c850c7b011828"},
+    {file = "cryptography-38.0.3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b1b52c9e5f8aa2b802d48bd693190341fae201ea51c7a167d69fc48b60e8a959"},
+    {file = "cryptography-38.0.3-cp36-abi3-manylinux_2_24_x86_64.whl", hash = "sha256:728f2694fa743a996d7784a6194da430f197d5c58e2f4e278612b359f455e4a2"},
+    {file = "cryptography-38.0.3-cp36-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:dfb4f4dd568de1b6af9f4cda334adf7d72cf5bc052516e1b2608b683375dd95c"},
+    {file = "cryptography-38.0.3-cp36-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:5419a127426084933076132d317911e3c6eb77568a1ce23c3ac1e12d111e61e0"},
+    {file = "cryptography-38.0.3-cp36-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:9b24bcff7853ed18a63cfb0c2b008936a9554af24af2fb146e16d8e1aed75748"},
+    {file = "cryptography-38.0.3-cp36-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:25c1d1f19729fb09d42e06b4bf9895212292cb27bb50229f5aa64d039ab29146"},
+    {file = "cryptography-38.0.3-cp36-abi3-win32.whl", hash = "sha256:7f836217000342d448e1c9a342e9163149e45d5b5eca76a30e84503a5a96cab0"},
+    {file = "cryptography-38.0.3-cp36-abi3-win_amd64.whl", hash = "sha256:c46837ea467ed1efea562bbeb543994c2d1f6e800785bd5a2c98bc096f5cb220"},
+    {file = "cryptography-38.0.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:06fc3cc7b6f6cca87bd56ec80a580c88f1da5306f505876a71c8cfa7050257dd"},
+    {file = "cryptography-38.0.3-pp37-pypy37_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:65535bc550b70bd6271984d9863a37741352b4aad6fb1b3344a54e6950249b55"},
+    {file = "cryptography-38.0.3-pp37-pypy37_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:5e89468fbd2fcd733b5899333bc54d0d06c80e04cd23d8c6f3e0542358c6060b"},
+    {file = "cryptography-38.0.3-pp38-pypy38_pp73-macosx_10_10_x86_64.whl", hash = "sha256:6ab9516b85bebe7aa83f309bacc5f44a61eeb90d0b4ec125d2d003ce41932d36"},
+    {file = "cryptography-38.0.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:068147f32fa662c81aebab95c74679b401b12b57494872886eb5c1139250ec5d"},
+    {file = "cryptography-38.0.3-pp38-pypy38_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:402852a0aea73833d982cabb6d0c3bb582c15483d29fb7085ef2c42bfa7e38d7"},
+    {file = "cryptography-38.0.3-pp38-pypy38_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:b1b35d9d3a65542ed2e9d90115dfd16bbc027b3f07ee3304fc83580f26e43249"},
+    {file = "cryptography-38.0.3-pp38-pypy38_pp73-win_amd64.whl", hash = "sha256:6addc3b6d593cd980989261dc1cce38263c76954d758c3c94de51f1e010c9a50"},
+    {file = "cryptography-38.0.3-pp39-pypy39_pp73-macosx_10_10_x86_64.whl", hash = "sha256:be243c7e2bfcf6cc4cb350c0d5cdf15ca6383bbcb2a8ef51d3c9411a9d4386f0"},
+    {file = "cryptography-38.0.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:78cf5eefac2b52c10398a42765bfa981ce2372cbc0457e6bf9658f41ec3c41d8"},
+    {file = "cryptography-38.0.3-pp39-pypy39_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:4e269dcd9b102c5a3d72be3c45d8ce20377b8076a43cbed6f660a1afe365e436"},
+    {file = "cryptography-38.0.3-pp39-pypy39_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:8d41a46251bf0634e21fac50ffd643216ccecfaf3701a063257fe0b2be1b6548"},
+    {file = "cryptography-38.0.3-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:785e4056b5a8b28f05a533fab69febf5004458e20dad7e2e13a3120d8ecec75a"},
+    {file = "cryptography-38.0.3.tar.gz", hash = "sha256:bfbe6ee19615b07a98b1d2287d6a6073f734735b49ee45b11324d85efc4d5cbd"},
 ]
 docker = [
    {file = "docker-4.2.2-py2.py3-none-any.whl", hash = "sha256:03a46400c4080cb6f7aa997f881ddd84fef855499ece219d75fbdb53289c17ab"},
--- a/proxy/Cargo.toml
+++ b/proxy/Cargo.toml
@@ -22,11 +22,7 @@ once_cell = "1.13.0"
 parking_lot = "0.12"
 pin-project-lite = "0.2.7"
 rand = "0.8.3"
-reqwest = { version = "0.11", default-features = false, features = [
-    "blocking",
-    "json",
-    "rustls-tls",
-] }
+reqwest = { version = "0.11", default-features = false, features = [ "json", "rustls-tls" ] }
 routerify = "3"
 rustls = "0.20.0"
 rustls-pemfile = "1"
@@ -45,8 +41,9 @@ url = "2.2.2"
 uuid = { version = "1.2", features = ["v4", "serde"] }
 x509-parser = "0.14"

-utils = { path = "../libs/utils" }
 metrics = { path = "../libs/metrics" }
+pq_proto = { path = "../libs/pq_proto" }
+utils = { path = "../libs/utils" }
 workspace_hack = { version = "0.1", path = "../workspace_hack" }

 [dev-dependencies]
--- a/proxy/src/auth/backend/link.rs
+++ b/proxy/src/auth/backend/link.rs
@@ -1,8 +1,8 @@
 use crate::{auth, compute, error::UserFacingError, stream::PqStream, waiters};
+use pq_proto::{BeMessage as Be, BeParameterStatusMessage};
 use thiserror::Error;
 use tokio::io::{AsyncRead, AsyncWrite};
 use tracing::{info, info_span};
-use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};

 #[derive(Debug, Error)]
 pub enum LinkAuthError {
--- a/proxy/src/auth/credentials.rs
+++ b/proxy/src/auth/credentials.rs
@@ -1,10 +1,10 @@
 //! User credentials used in authentication.

 use crate::error::UserFacingError;
+use pq_proto::StartupMessageParams;
 use std::borrow::Cow;
 use thiserror::Error;
 use tracing::info;
-use utils::pq_proto::StartupMessageParams;

 #[derive(Debug, Error, PartialEq, Eq, Clone)]
 pub enum ClientCredsParseError {
--- a/proxy/src/auth/flow.rs
+++ b/proxy/src/auth/flow.rs
@@ -2,9 +2,9 @@

 use super::{AuthErrorImpl, PasswordHackPayload};
 use crate::{sasl, scram, stream::PqStream};
+use pq_proto::{BeAuthenticationSaslMessage, BeMessage, BeMessage as Be};
 use std::io;
 use tokio::io::{AsyncRead, AsyncWrite};
-use utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage, BeMessage as Be};

 /// Every authentication selector is supposed to implement this trait.
 pub trait AuthMethod {
--- a/proxy/src/cancellation.rs
+++ b/proxy/src/cancellation.rs
@@ -1,11 +1,11 @@
 use anyhow::{anyhow, Context};
 use hashbrown::HashMap;
 use parking_lot::Mutex;
+use pq_proto::CancelKeyData;
 use std::net::SocketAddr;
 use tokio::net::TcpStream;
 use tokio_postgres::{CancelToken, NoTls};
 use tracing::info;
-use utils::pq_proto::CancelKeyData;

 /// Enables serving `CancelRequest`s.
 #[derive(Default)]
--- a/proxy/src/compute.rs
+++ b/proxy/src/compute.rs
@@ -1,12 +1,12 @@
 use crate::{cancellation::CancelClosure, error::UserFacingError};
 use futures::TryFutureExt;
 use itertools::Itertools;
+use pq_proto::StartupMessageParams;
 use std::{io, net::SocketAddr};
 use thiserror::Error;
 use tokio::net::TcpStream;
 use tokio_postgres::NoTls;
 use tracing::{error, info};
-use utils::pq_proto::StartupMessageParams;

 #[derive(Debug, Error)]
 pub enum ConnectionError {
@@ -44,7 +44,7 @@ pub type ComputeConnCfg = tokio_postgres::Config;

 /// Various compute node info for establishing connection etc.
 pub struct NodeInfo {
-    /// Did we send [`utils::pq_proto::BeMessage::AuthenticationOk`]?
+    /// Did we send [`pq_proto::BeMessage::AuthenticationOk`]?
    pub reported_auth_ok: bool,
    /// Compute node connection params.
    pub config: tokio_postgres::Config,
--- a/proxy/src/mgmt.rs
+++ b/proxy/src/mgmt.rs
@@ -1,15 +1,13 @@
 use crate::auth;
 use anyhow::Context;
+use pq_proto::{BeMessage, SINGLE_COL_ROWDESC};
 use serde::Deserialize;
 use std::{
    net::{TcpListener, TcpStream},
    thread,
 };
 use tracing::{error, info};
-use utils::{
-    postgres_backend::{self, AuthType, PostgresBackend},
-    pq_proto::{BeMessage, SINGLE_COL_ROWDESC},
-};
+use utils::postgres_backend::{self, AuthType, PostgresBackend};

 /// TODO: move all of that to auth-backend/link.rs when we ditch legacy-console backend

--- a/proxy/src/proxy.rs
+++ b/proxy/src/proxy.rs
@@ -6,10 +6,10 @@ use anyhow::{bail, Context};
 use futures::TryFutureExt;
 use metrics::{register_int_counter, IntCounter};
 use once_cell::sync::Lazy;
+use pq_proto::{BeMessage as Be, *};
 use std::sync::Arc;
 use tokio::io::{AsyncRead, AsyncWrite};
 use tracing::{error, info, info_span, Instrument};
-use utils::pq_proto::{BeMessage as Be, *};

 const ERR_INSECURE_CONNECTION: &str = "connection is insecure (try using `sslmode=require`)";
 const ERR_PROTO_VIOLATION: &str = "protocol violation";
--- a/proxy/src/sasl/messages.rs
+++ b/proxy/src/sasl/messages.rs
@@ -1,9 +1,9 @@
 //! Definitions for SASL messages.

 use crate::parse::{split_at_const, split_cstr};
-use utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage};
+use pq_proto::{BeAuthenticationSaslMessage, BeMessage};

-/// SASL-specific payload of [`PasswordMessage`](utils::pq_proto::FeMessage::PasswordMessage).
+/// SASL-specific payload of [`PasswordMessage`](pq_proto::FeMessage::PasswordMessage).
 #[derive(Debug)]
 pub struct FirstMessage<'a> {
    /// Authentication method, e.g. `"SCRAM-SHA-256"`.
@@ -31,7 +31,7 @@ impl<'a> FirstMessage<'a> {

 /// A single SASL message.
 /// This struct is deliberately decoupled from lower-level
-/// [`BeAuthenticationSaslMessage`](utils::pq_proto::BeAuthenticationSaslMessage).
+/// [`BeAuthenticationSaslMessage`](pq_proto::BeAuthenticationSaslMessage).
 #[derive(Debug)]
 pub(super) enum ServerMessage<T> {
    /// We expect to see more steps.
--- a/proxy/src/stream.rs
+++ b/proxy/src/stream.rs
@@ -2,6 +2,7 @@ use crate::error::UserFacingError;
 use anyhow::bail;
 use bytes::BytesMut;
 use pin_project_lite::pin_project;
+use pq_proto::{BeMessage, FeMessage, FeStartupPacket};
 use rustls::ServerConfig;
 use std::pin::Pin;
 use std::sync::Arc;
@@ -9,7 +10,6 @@ use std::{io, task};
 use thiserror::Error;
 use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt, ReadBuf};
 use tokio_rustls::server::TlsStream;
-use utils::pq_proto::{BeMessage, FeMessage, FeStartupPacket};

 pin_project! {
    /// Stream wrapper which implements libpq's protocol.
--- a/rust-toolchain.toml
+++ b/rust-toolchain.toml
@@ -4,7 +4,7 @@
 # version, we can consider updating.
 # See https://tracker.debian.org/pkg/rustc for more details on Debian rustc package,
 # we use "unstable" version number as the highest version used in the project by default.
-channel = "1.61" # do update GitHub CI cache values for rust builds, when changing this value
+channel = "1.62.1" # do update GitHub CI cache values for rust builds, when changing this value
 profile = "default"
 # The default profile includes rustc, rust-std, cargo, rust-docs, rustfmt and clippy.
 # https://rust-lang.github.io/rustup/concepts/profiles.html
--- a/safekeeper/Cargo.toml
+++ b/safekeeper/Cargo.toml
@@ -4,41 +4,42 @@ version = "0.1.0"
 edition = "2021"

 [dependencies]
-regex = "1.4.5"
-bytes = "1.0.1"
-byteorder = "1.4.3"
-hyper = "0.14"
-fs2 = "0.4.3"
-serde_json = "1"
-tracing = "0.1.27"
-clap = "4.0"
-daemonize = "0.4.1"
-tokio = { version = "1.17", features = ["macros", "fs"] }
-postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
 anyhow = "1.0"
-crc32c = "0.6.0"
-humantime = "2.1.0"
-url = "2.2.2"
-signal-hook = "0.3.10"
-serde = { version = "1.0", features = ["derive"] }
-serde_with = "2.0"
-hex = "0.4.3"
-const_format = "0.2.21"
-tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
-git-version = "0.3.5"
 async-trait = "0.1"
+byteorder = "1.4.3"
+bytes = "1.0.1"
+clap = "4.0"
+const_format = "0.2.21"
+crc32c = "0.6.0"
+fs2 = "0.4.3"
+git-version = "0.3.5"
+hex = "0.4.3"
+humantime = "2.1.0"
+hyper = "0.14"
+nix = "0.25"
 once_cell = "1.13.0"
-toml_edit = { version = "0.14", features = ["easy"] }
-thiserror = "1"
 parking_lot = "0.12.1"
+postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+regex = "1.4.5"
+serde = { version = "1.0", features = ["derive"] }
+serde_json = "1"
+serde_with = "2.0"
+signal-hook = "0.3.10"
+thiserror = "1"
+tokio = { version = "1.17", features = ["macros", "fs"] }
+tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
+toml_edit = { version = "0.14", features = ["easy"] }
+tracing = "0.1.27"
+url = "2.2.2"

-safekeeper_api = { path = "../libs/safekeeper_api" }
-postgres_ffi = { path = "../libs/postgres_ffi" }
-metrics = { path = "../libs/metrics" }
-utils = { path = "../libs/utils" }
 etcd_broker = { path = "../libs/etcd_broker" }
+metrics = { path = "../libs/metrics" }
+postgres_ffi = { path = "../libs/postgres_ffi" }
+pq_proto = { path = "../libs/pq_proto" }
 remote_storage = { path = "../libs/remote_storage" }
+safekeeper_api = { path = "../libs/safekeeper_api" }
+utils = { path = "../libs/utils" }
 workspace_hack = { version = "0.1", path = "../workspace_hack" }

 [dev-dependencies]
--- a/safekeeper/src/bin/safekeeper.rs
+++ b/safekeeper/src/bin/safekeeper.rs
@@ -4,8 +4,7 @@
 use anyhow::{bail, Context, Result};
 use clap::{value_parser, Arg, ArgAction, Command};
 use const_format::formatcp;
-use daemonize::Daemonize;
-use fs2::FileExt;
+use nix::unistd::Pid;
 use remote_storage::RemoteStorageConfig;
 use std::fs::{self, File};
 use std::io::{ErrorKind, Write};
@@ -16,6 +15,7 @@ use tokio::sync::mpsc;
 use toml_edit::Document;
 use tracing::*;
 use url::{ParseError, Url};
+use utils::lock_file;

 use metrics::set_build_info_metric;
 use safekeeper::broker;
@@ -35,12 +35,10 @@ use utils::{
    http::endpoint,
    id::NodeId,
    logging::{self, LogFormat},
-    project_git_version,
-    shutdown::exit_now,
-    signals, tcp_listener,
+    project_git_version, signals, tcp_listener,
 };

-const LOCK_FILE_NAME: &str = "safekeeper.lock";
+const PID_FILE_NAME: &str = "safekeeper.pid";
 const ID_FILE_NAME: &str = "safekeeper.id";
 project_git_version!(GIT_VERSION);

@@ -65,10 +63,6 @@ fn main() -> anyhow::Result<()> {
        conf.no_sync = true;
    }

-    if arg_matches.get_flag("daemonize") {
-        conf.daemonize = true;
-    }
-
    if let Some(addr) = arg_matches.get_one::<String>("listen-pg") {
        conf.listen_pg_addr = addr.to_string();
    }
@@ -143,19 +137,33 @@ fn main() -> anyhow::Result<()> {
 }

 fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option<NodeId>, init: bool) -> Result<()> {
-    let log_file = logging::init("safekeeper.log", conf.daemonize, conf.log_format)?;
-
+    logging::init(conf.log_format)?;
    info!("version: {GIT_VERSION}");

    // Prevent running multiple safekeepers on the same directory
-    let lock_file_path = conf.workdir.join(LOCK_FILE_NAME);
-    let lock_file = File::create(&lock_file_path).context("failed to open lockfile")?;
-    lock_file.try_lock_exclusive().with_context(|| {
-        format!(
-            "control file {} is locked by some other process",
-            lock_file_path.display()
-        )
-    })?;
+    let lock_file_path = conf.workdir.join(PID_FILE_NAME);
+    let lock_file = match lock_file::create_lock_file(&lock_file_path, Pid::this().to_string()) {
+        lock_file::LockCreationResult::Created {
+            new_lock_contents,
+            file,
+        } => {
+            info!("Created lock file at {lock_file_path:?} with contenst {new_lock_contents}");
+            file
+        }
+        lock_file::LockCreationResult::AlreadyLocked {
+            existing_lock_contents,
+        } => anyhow::bail!(
+            "Could not lock pid file; safekeeper is already running in {:?} with PID {}",
+            conf.workdir,
+            existing_lock_contents
+        ),
+        lock_file::LockCreationResult::CreationFailed(e) => {
+            return Err(e.context(format!("Failed to create lock file at {lock_file_path:?}")))
+        }
+    };
+    // ensure that the lock file is held even if the main thread of the process is panics
+    // we need to release the lock file only when the current process is gone
+    let _ = Box::leak(Box::new(lock_file));

    // Set or read our ID.
    set_id(&mut conf, given_id)?;
@@ -187,31 +195,6 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option<NodeId>, init: bo
        }
    };

-    // XXX: Don't spawn any threads before daemonizing!
-    if conf.daemonize {
-        info!("daemonizing...");
-
-        // There should'n be any logging to stdin/stdout. Redirect it to the main log so
-        // that we will see any accidental manual fprintf's or backtraces.
-        let stdout = log_file.try_clone().unwrap();
-        let stderr = log_file;
-
-        let daemonize = Daemonize::new()
-            .pid_file("safekeeper.pid")
-            .working_directory(Path::new("."))
-            .stdout(stdout)
-            .stderr(stderr);
-
-        // XXX: The parent process should exit abruptly right after
-        // it has spawned a child to prevent coverage machinery from
-        // dumping stats into a `profraw` file now owned by the child.
-        // Otherwise, the coverage data will be damaged.
-        match daemonize.exit_action(|| exit_now(0)).start() {
-            Ok(_) => info!("Success, daemonized"),
-            Err(err) => bail!("Error: {err}. could not daemonize. bailing."),
-        }
-    }
-
    // Register metrics collector for active timelines. It's important to do this
    // after daemonizing, otherwise process collector will be upset.
    let timeline_collector = safekeeper::metrics::TimelineCollector::new();
@@ -384,13 +367,6 @@ fn cli() -> Command {
                .short('p')
                .long("pageserver"),
        )
-        .arg(
-            Arg::new("daemonize")
-                .short('d')
-                .long("daemonize")
-                .action(ArgAction::SetTrue)
-                .help("Run in the background"),
-        )
        .arg(
            Arg::new("no-sync")
                .short('n')
--- a/safekeeper/src/control_file_upgrade.rs
+++ b/safekeeper/src/control_file_upgrade.rs
@@ -4,13 +4,13 @@ use crate::safekeeper::{
    TermSwitchEntry,
 };
 use anyhow::{bail, Result};
+use pq_proto::SystemId;
 use serde::{Deserialize, Serialize};
 use tracing::*;
 use utils::{
    bin_ser::LeSer,
    id::{TenantId, TimelineId},
    lsn::Lsn,
-    pq_proto::SystemId,
 };

 /// Persistent consensus state of the acceptor.
--- a/safekeeper/src/handler.rs
+++ b/safekeeper/src/handler.rs
@@ -12,12 +12,12 @@ use anyhow::{bail, Context, Result};
 use postgres_ffi::PG_TLI;
 use regex::Regex;

+use pq_proto::{BeMessage, FeStartupPacket, RowDescriptor, INT4_OID, TEXT_OID};
 use tracing::info;
 use utils::{
    id::{TenantId, TenantTimelineId, TimelineId},
    lsn::Lsn,
    postgres_backend::{self, PostgresBackend},
-    pq_proto::{BeMessage, FeStartupPacket, RowDescriptor, INT4_OID, TEXT_OID},
 };

 /// Safekeeper handler of postgres commands
--- a/safekeeper/src/json_ctrl.rs
+++ b/safekeeper/src/json_ctrl.rs
@@ -24,11 +24,8 @@ use crate::timeline::Timeline;
 use crate::GlobalTimelines;
 use postgres_ffi::encode_logical_message;
 use postgres_ffi::WAL_SEGMENT_SIZE;
-use utils::{
-    lsn::Lsn,
-    postgres_backend::PostgresBackend,
-    pq_proto::{BeMessage, RowDescriptor, TEXT_OID},
-};
+use pq_proto::{BeMessage, RowDescriptor, TEXT_OID};
+use utils::{lsn::Lsn, postgres_backend::PostgresBackend};

 #[derive(Serialize, Deserialize, Debug)]
 pub struct AppendLogicalMessage {
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Heikki Linnakangas	c82ab848de	Have a pool of WAL redo processes per tenant To allow more concurrency, have a pool of WAL redo processes that can grow up to 4 processes per tenant. There's no way to shrink the pool, that's why I'm capping it at 4 processes, to keep the total number of processes reasonable.	2022-11-08 14:59:49 +02:00
Alexander Bayandin	c1a76eb0e5	test_runner: replace global variables with fixtures (#2754 ) This PR replaces the following global variables in the test framework with fixtures to make tests more configurable. I mainly need this for the forward compatibility tests (draft in https://github.com/neondatabase/neon/pull/2766). ``` base_dir neon_binpath pg_distrib_dir top_output_dir default_pg_version (this one got replaced with a fixture named pg_version) ``` Also, this PR adds more `Path` type where the code implies it.	2022-11-07 18:39:51 +00:00
MMeent	d5b6471fa9	Update prefetch mechanism: (#2687 ) Prefetch requests and responses are stored in a ringbuffer instead of a queue, which means we can utilize prefetches of many relations concurrently -- page reads of un-prefetched relations now don't imply dropping prefetches. In a future iteration, this may detect sequential scans based on the read behavior of sequential scans, and will dynamically prefetch buffers for such relations as needed. Right now, it still depends on explicit prefetch requests from PostgreSQL. The main improvement here is that we now have a buffer for prefetched pages of 128 entries with random access. Before, we had a similarly sized cache, but this cache did not allow for random access, which resulted in dropped entries when multiple systems used the prefetching subsystem concurrently. See also: #2544	2022-11-07 18:13:24 +02:00
Joonas Koivunen	548d472b12	fix: logical size query at before initdb_lsn (#2755 ) With more realistic selection of gc_horizon in tests there is an immediate failure with trying to query logical size with lsn < initdb_lsn. Fixes that, adds illustration gathered from clarity of explaining this tenant size calculation to more people. Cc: #2748, #2599.	2022-11-07 12:03:57 +02:00
Dmitry Rodionov	99e745a760	review adjustments	2022-11-06 13:42:18 +02:00
Dmitry Rodionov	15d970f731	decrease diff by moving check_checkpoint_distance back Co-authored-by: Christian Schwarz <me@cschwarz.com>	2022-11-06 13:42:18 +02:00
Heikki Linnakangas	7b7f84f1b4	Refactor layer flushing task Extracted from https://github.com/neondatabase/neon/pull/2595	2022-11-06 13:42:18 +02:00
Alexander Bayandin	bc40a5595f	test_runner: update cryptography (#2753 ) Bump `cryptography` package from 37.0.4 to 38.0.3 to fix https://github.com/neondatabase/neon/security/dependabot/6 (rather just in case). Ref https://www.openssl.org/news/secadv/20221101.txt	2022-11-04 21:29:51 +00:00
Heikki Linnakangas	07b3ba5ce3	Bump postgres submodules, for some cosmetic cleanups.	2022-11-04 20:30:13 +02:00
Dmitry Ivanov	c38f38dab7	Move pq_proto to its own crate	2022-11-03 22:56:04 +03:00
bojanserafimov	71d268c7c4	Write message serialization test (#2746 )	2022-11-03 14:24:15 +00:00
Joonas Koivunen	cf68963b18	Add initial tenant sizing model and a http route to query it (#2714 ) Tenant size information is gathered by using existing parts of `Tenant::gc_iteration` which are now separated as `Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch points, and invokes `Timeline::update_gc_info`; nothing was supposed to be changed there. The gathered branch points (through Timeline's `GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and `GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the `libs/tenant_size_model` to calculate the history size. The gathered information is now exposed using `GET /v1/tenant/{tenant_id}/size`, which which will respond with the actual calculated size. Initially the idea was to have this delivered as tenant background task and exported via metric, but it might be too computationally expensive to run it periodically as we don't yet know if the returned values are any good. Adds one new metric: - pageserver_storage_operations_seconds with label `logical_size` - separating from original `init_logical_size` Adds a pageserver wide configuration variable: - `concurrent_tenant_size_logical_size_queries` with default 1 This leaves a lot of TODO's, tracked on issue #2748.	2022-11-03 12:39:19 +00:00
Arseny Sher	63221e4b42	Fix sk->ps walsender shutdown on sk side on caughtup. This will fix many threads issue, but code around awfully still wants improvement. https://github.com/neondatabase/neon/issues/2722	2022-11-03 16:20:55 +04:00
bojanserafimov	d7eeb73f6f	Impl serialize for pagestream FeMessage (#2741 )	2022-11-02 23:44:07 -04:00
Joonas Koivunen	5112142997	fix: use different port for temporary postgres (#2743 ) `test_tenant_relocation` ends up starting a temporary postgres instance with a fixed port. the change makes the port configurable at scripts/export_import_between_pageservers.py and uses that in test_tenant_relocation.	2022-11-02 18:37:48 +00:00
bojanserafimov	a0a74868a4	Fix clippy (#2742 )	2022-11-02 12:30:09 -04:00
Christian Schwarz	b154992510	timeline_list_handler: avoid spawn_blocking As per https://github.com/neondatabase/neon/issues/2731#issuecomment-1299335813 refs https://github.com/neondatabase/neon/issues/2731	2022-11-02 16:22:58 +01:00
Christian Schwarz	a86a38c96e	README: fix instructions on how to run tests The `make debug` target doesn't exist, and I can't find it in the Git history.	2022-11-02 16:22:58 +01:00
Christian Schwarz	590f894db8	tenant_status: remove unnecessary spawn_blocking The spawn_blocking is pointless in this cases: get_tenant is not expected to block for any meaningful amount of time. There are get_tenant calls in most other functions in the file too, and they don't bother with spawn_blocking. Let's remove the spawn_blocking from tenant_status, too, to be consistent. fixes https://github.com/neondatabase/neon/issues/2731	2022-11-02 16:22:58 +01:00
Alexander Bayandin	0a0595b98d	test_backward_compatibility: assign random port to compute (#2738 )	2022-11-02 15:22:38 +00:00
Dmitry Rodionov	e56d11c8e1	fix style if possible (cannot really split long lines in mermaid)	2022-11-02 17:15:49 +02:00
Dmitry Rodionov	ccdc3188ed	update according to discussion and comments	2022-11-02 17:15:49 +02:00
Dmitry Rodionov	67401cbdb8	pageserver s3 coordination	2022-11-02 17:15:49 +02:00
Kirill Bulatov	d42700280f	Remove daemonize from storage components (#2677 ) Move daemonization logic into `control_plane`. Storage binaries now only crate a lockfile to avoid concurrent services running in the same directory.	2022-11-02 02:26:37 +02:00
Kirill Bulatov	6df4d5c911	Bump rustc to 1.62.1 (#2728 ) Changelog: https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1621-2022-07-19	2022-11-02 01:21:33 +02:00
Dmitry Rodionov	32d14403bd	remove wrong is_active filter for timelines in compaction/gc Gc needs to know about all branch points, not only ones for timelines that are active at the moment of gc. If timeline is inactive then we wont know about branch point. In this case gc can delete data that is needed by child timeline. For compaction it is less severe. Delaying compaction can cause an effect on performance. So it is still better to run it. There is a logic to exit it quickly if there is nothing to compact	2022-11-01 18:07:08 +02:00
Dmitry Ivanov	0df3467146	Refactoring: replace `utils::connstring` with `Url`-based APIs	2022-11-01 18:17:36 +03:00
Dmitry Rodionov	c64a121aa8	do not nest wal_connection_manager span inside parent one	2022-11-01 15:08:23 +02:00
Heikki Linnakangas	22cc8760b9	Move walredo process code under pgxn in the main 'neon' repository. - Refactor the way the WalProposerMain function is called when started with --sync-safekeepers. The postgres binary now explicitly loads the 'neon.so' library and calls the WalProposerMain in it. This is simpler than the global function callback "hook" we previously used. - Move the WAL redo process code to a new library, neon_walredo.so, and use the same mechanism as for --sync-safekeepers to call the WalRedoMain function, when launched with --walredo argument. - Also move the seccomp code to neon_walredo.so library. I kept the configure check in the postgres side for now, though.	2022-10-31 01:11:50 +01:00
Arseny Sher	596d622a82	Fix test_prepare_snapshot. It should checkpoint pageserver after waiting for all data arrival, not before.	2022-10-28 22:12:31 +04:00
Sergey Melnikov	7481fb082c	Fix bugs in #2713 (#2716 )	2022-10-28 14:12:49 +00:00
Arseny Sher	1eb9bd052a	Bump vendor/postgres-v15 to fix XLP_FIRST_IS_CONTRECORD issue. ref https://github.com/neondatabase/cloud/issues/2688	2022-10-28 16:45:11 +03:00
Sergey Melnikov	59a3ca4ec6	Deploy proxy to new prod regions (#2713 ) * Refactor proxy deploy * Test new prod deploy * Remove assume role * Add new values * Add all regions	2022-10-28 16:25:28 +03:00
Sergey Melnikov	e86a9105a4	Deploy storage to new prod regions (#2709 )	2022-10-28 10:17:27 +00:00
Stas Kelvich	d3c8749da5	Build compute postgres with openssl support The main reason for that change is that Postgres 15 requires OpenSSL for `pgcrypto` to work. Also not a bad idea to have SSL-enabled Postgres in general.	2022-10-28 10:39:22 +03:00
Alexander Bayandin	128dc8d405	Nightly Benchmarks: fix workflow (#2708 )	2022-10-27 19:26:10 +03:00
Alexander Bayandin	0cbae6e8f3	test_backward_compatibility: friendlier error message (#2707 )	2022-10-27 15:54:49 +00:00
Alexander Stanovoy	78e412b84b	The fix of #2650 . (#2686 ) * Wrappers and drop implementations for image and delta layer writers. * Two regression tests for the image and delta layer files.	2022-10-27 14:02:55 +00:00
Rory de Zoete	6dbf202e0d	Update crane copy target (#2704 ) Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>	2022-10-27 16:00:40 +02:00