feat(ci): lint gha with zizmor using the pedantic persona

storcon + safekeeper + scrubber: propagate root CA certs everywhere (#11418 )
## Problem There are some places in the code where we create `reqwest::Client` without providing SSL CA certs from `ssl_ca_file`. These will break after we enable TLS everywhere. - Part of https://github.com/neondatabase/cloud/issues/22686 ## Summary of changes - Support `ssl_ca_file` in storage scrubber. - Add `use_https_safekeeper_api` option to safekeeper to use https for peer requests. - Propagate SSL CA certs to storage_controller/client, storcon's ComputeHook, PeerClient and maybe_forward.
2026-05-25 09:00:37 +00:00 · 2025-04-07 17:14:57 +02:00 · 2025-04-04 06:30:48 +00:00 · 2025-04-04 01:06:22 +00:00 · 2025-04-04 00:17:40 +00:00 · 2025-04-03 23:00:58 +00:00
90 changed files with 1636 additions and 410 deletions
--- a/.github/scripts/generate_image_maps.py
+++ b/.github/scripts/generate_image_maps.py
@@ -39,12 +39,18 @@ registries = {
    ],
 }

+release_branches = ["release", "release-proxy", "release-compute"]
+
 outputs: dict[str, dict[str, list[str]]] = {}

-target_tags = [target_tag, "latest"] if branch == "main" else [target_tag]
-target_stages = (
-    ["dev", "prod"] if branch in ["release", "release-proxy", "release-compute"] else ["dev"]
+target_tags = (
+    [target_tag, "latest"]
+    if branch == "main"
+    else [target_tag, "released"]
+    if branch in release_branches
+    else [target_tag]
 )
+target_stages = ["dev", "prod"] if branch in release_branches else ["dev"]

 for component_name, component_images in components.items():
    for stage in target_stages:
--- a/.github/workflows/actionlint.yml
+++ b/.github/workflows/actionlint.yml
@@ -54,3 +54,14 @@ jobs:
            done
            exit 1
          fi
+
+      - name: Lint with zizmor
+        run: zizmor --persona pedantic --format sarif . > zizmor.sarif
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload zizmor results
+        uses: github/codeql-action/upload-sarif@fc7e4a0fa01c3cca5fd6a1fddec5c0740c977aa2  # v3.28.14
+        with:
+          sarif_file: zizmor.sarif
+          category: zizmor
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4329,6 +4329,7 @@ dependencies = [
 "strum",
 "strum_macros",
 "thiserror 1.0.69",
+ "tracing-utils",
 "utils",
 ]

@@ -7603,6 +7604,7 @@ dependencies = [
 "opentelemetry-otlp",
 "opentelemetry-semantic-conventions",
 "opentelemetry_sdk",
+ "pin-project-lite",
 "tokio",
 "tracing",
 "tracing-opentelemetry",
--- a/build-tools.Dockerfile
+++ b/build-tools.Dockerfile
@@ -292,7 +292,7 @@ WORKDIR /home/nonroot

 # Rust
 # Please keep the version of llvm (installed above) in sync with rust llvm (`rustc --version --verbose | grep LLVM`)
-ENV RUSTC_VERSION=1.85.0
+ENV RUSTC_VERSION=1.86.0
 ENV RUSTUP_HOME="/home/nonroot/.rustup"
 ENV PATH="/home/nonroot/.cargo/bin:${PATH}"
 ARG RUSTFILT_VERSION=0.2.1
@@ -302,6 +302,7 @@ ARG CARGO_HACK_VERSION=0.6.33
 ARG CARGO_NEXTEST_VERSION=0.9.85
 ARG CARGO_CHEF_VERSION=0.1.71
 ARG CARGO_DIESEL_CLI_VERSION=2.2.6
+ARG ZIZMOR_VERSION=1.5.2
 RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && whoami && \
 	chmod +x rustup-init && \
 	./rustup-init -y --default-toolchain ${RUSTC_VERSION} && \
@@ -316,6 +317,7 @@ RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux
    cargo install cargo-hack          --version ${CARGO_HACK_VERSION} && \
    cargo install cargo-nextest       --version ${CARGO_NEXTEST_VERSION} && \
    cargo install cargo-chef --locked --version ${CARGO_CHEF_VERSION} && \
+    cargo install zizmor --locked     --version ${ZIZMOR_VERSION} && \
    cargo install diesel_cli          --version ${CARGO_DIESEL_CLI_VERSION} \
                                      --features postgres-bundled --no-default-features && \
    rm -rf /home/nonroot/.cargo/registry && \
--- a/compute/compute-node.Dockerfile
+++ b/compute/compute-node.Dockerfile
@@ -1055,34 +1055,6 @@ RUN  if [ -d pg_embedding-src ]; then \
        make -j $(getconf _NPROCESSORS_ONLN) install; \
    fi

-#########################################################################################
-#
-# Layer "pg_anon-build"
-# compile anon extension
-#
-#########################################################################################
-FROM build-deps AS pg_anon-src
-ARG PG_VERSION
-
-# This is an experimental extension, never got to real production.
-# !Do not remove! It can be present in shared_preload_libraries and compute will fail to start if library is not found.
-WORKDIR /ext-src
-RUN case "${PG_VERSION:?}" in "v17") \
-    echo "postgresql_anonymizer does not yet support PG17" && exit 0;; \
-    esac && \
-    wget  https://github.com/neondatabase/postgresql_anonymizer/archive/refs/tags/neon_1.1.1.tar.gz -O pg_anon.tar.gz && \
-    echo "321ea8d5c1648880aafde850a2c576e4a9e7b9933a34ce272efc839328999fa9  pg_anon.tar.gz" | sha256sum --check && \
-    mkdir pg_anon-src && cd pg_anon-src && tar xzf ../pg_anon.tar.gz --strip-components=1 -C .
-
-FROM pg-build AS pg_anon-build
-COPY --from=pg_anon-src /ext-src/ /ext-src/
-WORKDIR /ext-src
-RUN if [ -d pg_anon-src ]; then \
-        cd pg_anon-src && \
-        make -j $(getconf _NPROCESSORS_ONLN) install && \
-        echo 'trusted = true' >> /usr/local/pgsql/share/extension/anon.control; \
-    fi
-
 #########################################################################################
 #
 # Layer "pg build with nonroot user and cargo installed"
@@ -1366,8 +1338,8 @@ ARG PG_VERSION
 # Do not update without approve from proxy team
 # Make sure the version is reflected in proxy/src/serverless/local_conn_pool.rs
 WORKDIR /ext-src
-RUN wget https://github.com/neondatabase/pg_session_jwt/archive/refs/tags/v0.2.0.tar.gz -O pg_session_jwt.tar.gz && \
-    echo "5ace028e591f2e000ca10afa5b1ca62203ebff014c2907c0ec3b29c36f28a1bb pg_session_jwt.tar.gz" | sha256sum --check && \
+RUN wget https://github.com/neondatabase/pg_session_jwt/archive/refs/tags/v0.3.0.tar.gz -O pg_session_jwt.tar.gz && \
+    echo "19be2dc0b3834d643706ed430af998bb4c2cdf24b3c45e7b102bb3a550e8660c pg_session_jwt.tar.gz" | sha256sum --check && \
    mkdir pg_session_jwt-src && cd pg_session_jwt-src && tar xzf ../pg_session_jwt.tar.gz --strip-components=1 -C . && \
    sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.9", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
    sed -i 's/version = "0.12.6"/version = "0.12.9"/g' pgrx-tests/Cargo.toml && \
@@ -1677,7 +1649,6 @@ COPY --from=pg_roaringbitmap-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg_semver-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg_embedding-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=wal2json-build /usr/local/pgsql /usr/local/pgsql
-COPY --from=pg_anon-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg_ivm-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg_partman-build /usr/local/pgsql/ /usr/local/pgsql/
 COPY --from=pg_mooncake-build /usr/local/pgsql/ /usr/local/pgsql/
--- a/compute/etc/neon_collector.jsonnet
+++ b/compute/etc/neon_collector.jsonnet
@@ -33,6 +33,7 @@
    import 'sql_exporter/lfc_hits.libsonnet',
    import 'sql_exporter/lfc_misses.libsonnet',
    import 'sql_exporter/lfc_used.libsonnet',
+    import 'sql_exporter/lfc_used_pages.libsonnet',
    import 'sql_exporter/lfc_writes.libsonnet',
    import 'sql_exporter/logical_slot_restart_lsn.libsonnet',
    import 'sql_exporter/max_cluster_size.libsonnet',
--- a/compute/etc/sql_exporter/lfc_used_pages.libsonnet
+++ b/compute/etc/sql_exporter/lfc_used_pages.libsonnet
@@ -0,0 +1,10 @@
+{
+  metric_name: 'lfc_used_pages',
+  type: 'gauge',
+  help: 'LFC pages used',
+  key_labels: null,
+  values: [
+    'lfc_used_pages',
+  ],
+  query: importstr 'sql_exporter/lfc_used_pages.sql',
+}
--- a/compute/etc/sql_exporter/lfc_used_pages.sql
+++ b/compute/etc/sql_exporter/lfc_used_pages.sql
@@ -0,0 +1 @@
+SELECT lfc_value AS lfc_used_pages FROM neon.neon_lfc_stats WHERE lfc_key = 'file_cache_used_pages';
--- a/compute/patches/pg_hint_plan_v16.patch
+++ b/compute/patches/pg_hint_plan_v16.patch
@@ -2,23 +2,6 @@ diff --git a/expected/ut-A.out b/expected/ut-A.out
 index da723b8..5328114 100644
 --- a/expected/ut-A.out
 +++ b/expected/ut-A.out
-@@ -9,13 +9,16 @@ SET search_path TO public;
- ----
- -- No.A-1-1-3
- CREATE EXTENSION pg_hint_plan;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/pg_hint_plan
- -- No.A-1-2-3
- DROP EXTENSION pg_hint_plan;
- -- No.A-1-1-4
- CREATE SCHEMA other_schema;
- CREATE EXTENSION pg_hint_plan SCHEMA other_schema;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/pg_hint_plan
- ERROR:  extension "pg_hint_plan" must be installed in schema "hint_plan"
- CREATE EXTENSION pg_hint_plan;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/pg_hint_plan
- DROP SCHEMA other_schema;
- ----
- ---- No. A-5-1 comment pattern
@@ -3175,6 +3178,7 @@ SELECT s.query, s.calls
   FROM public.pg_stat_statements s
   JOIN pg_catalog.pg_database d
@@ -27,18 +10,6 @@ index da723b8..5328114 100644
  ORDER BY 1;
                 query                 | calls 
 --------------------------------------+-------
-diff --git a/expected/ut-fdw.out b/expected/ut-fdw.out
-index d372459..6282afe 100644
--- a/expected/ut-fdw.out
-+++ b/expected/ut-fdw.out
-@@ -7,6 +7,7 @@ SET pg_hint_plan.debug_print TO on;
- SET client_min_messages TO LOG;
- SET pg_hint_plan.enable_hint TO on;
- CREATE EXTENSION file_fdw;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/file_fdw
- CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;
- CREATE USER MAPPING FOR PUBLIC SERVER file_server;
- CREATE FOREIGN TABLE ft1 (id int, val int) SERVER file_server OPTIONS (format 'csv', filename :'filename');
 diff --git a/sql/ut-A.sql b/sql/ut-A.sql
 index 7c7d58a..4fd1a07 100644
 --- a/sql/ut-A.sql
--- a/compute/patches/pg_hint_plan_v17.patch
+++ b/compute/patches/pg_hint_plan_v17.patch
@@ -1,24 +1,3 @@
-diff --git a/expected/ut-A.out b/expected/ut-A.out
-index e7d68a1..65a056c 100644
--- a/expected/ut-A.out
-+++ b/expected/ut-A.out
-@@ -9,13 +9,16 @@ SET search_path TO public;
- ----
- -- No.A-1-1-3
- CREATE EXTENSION pg_hint_plan;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/pg_hint_plan
- -- No.A-1-2-3
- DROP EXTENSION pg_hint_plan;
- -- No.A-1-1-4
- CREATE SCHEMA other_schema;
- CREATE EXTENSION pg_hint_plan SCHEMA other_schema;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/pg_hint_plan
- ERROR:  extension "pg_hint_plan" must be installed in schema "hint_plan"
- CREATE EXTENSION pg_hint_plan;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/pg_hint_plan
- DROP SCHEMA other_schema;
- ----
- ---- No. A-5-1 comment pattern
 diff --git a/expected/ut-J.out b/expected/ut-J.out
 index 2fa3c70..314e929 100644
 --- a/expected/ut-J.out
@@ -160,15 +139,3 @@ index a09bd34..0ad227c 100644
 error hint:
 
                     explain_filter                    
-diff --git a/expected/ut-fdw.out b/expected/ut-fdw.out
-index 017fa4b..98d989b 100644
--- a/expected/ut-fdw.out
-+++ b/expected/ut-fdw.out
-@@ -7,6 +7,7 @@ SET pg_hint_plan.debug_print TO on;
- SET client_min_messages TO LOG;
- SET pg_hint_plan.enable_hint TO on;
- CREATE EXTENSION file_fdw;
-+LOG:  Sending request to compute_ctl: http://localhost:3081/extension_server/file_fdw
- CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;
- CREATE USER MAPPING FOR PUBLIC SERVER file_server;
- CREATE FOREIGN TABLE ft1 (id int, val int) SERVER file_server OPTIONS (format 'csv', filename :'filename');
--- a/compute_tools/src/spec_apply.rs
+++ b/compute_tools/src/spec_apply.rs
@@ -419,7 +419,7 @@ impl ComputeNode {
                .iter()
                .filter_map(|val| val.parse::<usize>().ok())
                .map(|val| if val > 1 { val - 1 } else { 1 })
-                .last()
+                .next_back()
                .unwrap_or(3)
        }
    }
--- a/control_plane/storcon_cli/src/main.rs
+++ b/control_plane/storcon_cli/src/main.rs
@@ -385,8 +385,6 @@ where
 async fn main() -> anyhow::Result<()> {
    let cli = Cli::parse();

-    let storcon_client = Client::new(cli.api.clone(), cli.jwt.clone());
-
    let ssl_ca_certs = match &cli.ssl_ca_file {
        Some(ssl_ca_file) => {
            let buf = tokio::fs::read(ssl_ca_file).await?;
@@ -401,9 +399,11 @@ async fn main() -> anyhow::Result<()> {
    }
    let http_client = http_client.build()?;

+    let storcon_client = Client::new(http_client.clone(), cli.api.clone(), cli.jwt.clone());
+
    let mut trimmed = cli.api.to_string();
    trimmed.pop();
-    let vps_client = mgmt_api::Client::new(http_client, trimmed, cli.jwt.as_deref());
+    let vps_client = mgmt_api::Client::new(http_client.clone(), trimmed, cli.jwt.as_deref());

    match cli.command {
        Command::NodeRegister {
@@ -1056,7 +1056,7 @@ async fn main() -> anyhow::Result<()> {
            const DEFAULT_MIGRATE_CONCURRENCY: usize = 8;
            let mut stream = futures::stream::iter(moves)
                .map(|mv| {
-                    let client = Client::new(cli.api.clone(), cli.jwt.clone());
+                    let client = Client::new(http_client.clone(), cli.api.clone(), cli.jwt.clone());
                    async move {
                        client
                            .dispatch::<TenantShardMigrateRequest, TenantShardMigrateResponse>(
--- a/libs/http-utils/src/server.rs
+++ b/libs/http-utils/src/server.rs
@@ -91,14 +91,14 @@ impl Server {
                                        Ok(tls_stream) => tls_stream,
                                        Err(err) => {
                                            if !suppress_io_error(&err) {
-                                                info!("Failed to accept TLS connection: {err:#}");
+                                                info!(%remote_addr, "Failed to accept TLS connection: {err:#}");
                                            }
                                            return;
                                        }
                                    };
                                    if let Err(err) = Self::serve_connection(tls_stream, service, cancel).await {
                                        if !suppress_hyper_error(&err) {
-                                            info!("Failed to serve HTTPS connection: {err:#}");
+                                            info!(%remote_addr, "Failed to serve HTTPS connection: {err:#}");
                                        }
                                    }
                                }
@@ -106,7 +106,7 @@ impl Server {
                                    // Handle HTTP connection.
                                    if let Err(err) = Self::serve_connection(tcp_stream, service, cancel).await {
                                        if !suppress_hyper_error(&err) {
-                                            info!("Failed to serve HTTP connection: {err:#}");
+                                            info!(%remote_addr, "Failed to serve HTTP connection: {err:#}");
                                        }
                                    }
                                }
--- a/libs/pageserver_api/Cargo.toml
+++ b/libs/pageserver_api/Cargo.toml
@@ -34,6 +34,7 @@ postgres_backend.workspace = true
 nix = {workspace = true, optional = true}
 reqwest.workspace = true
 rand.workspace = true
+tracing-utils.workspace = true

 [dev-dependencies]
 bincode.workspace = true
--- a/libs/pageserver_api/src/config.rs
+++ b/libs/pageserver_api/src/config.rs
@@ -134,6 +134,7 @@ pub struct ConfigToml {
    pub load_previous_heatmap: Option<bool>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub generate_unarchival_heatmap: Option<bool>,
+    pub tracing: Option<Tracing>,
 }

 #[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
@@ -191,6 +192,54 @@ pub enum GetVectoredConcurrentIo {
    SidecarTask,
 }

+#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
+pub struct Ratio {
+    pub numerator: usize,
+    pub denominator: usize,
+}
+
+#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
+pub struct OtelExporterConfig {
+    pub endpoint: String,
+    pub protocol: OtelExporterProtocol,
+    #[serde(with = "humantime_serde")]
+    pub timeout: Duration,
+}
+
+#[derive(Debug, Copy, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
+#[serde(rename_all = "kebab-case")]
+pub enum OtelExporterProtocol {
+    Grpc,
+    HttpBinary,
+    HttpJson,
+}
+
+#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
+pub struct Tracing {
+    pub sampling_ratio: Ratio,
+    pub export_config: OtelExporterConfig,
+}
+
+impl From<&OtelExporterConfig> for tracing_utils::ExportConfig {
+    fn from(val: &OtelExporterConfig) -> Self {
+        tracing_utils::ExportConfig {
+            endpoint: Some(val.endpoint.clone()),
+            protocol: val.protocol.into(),
+            timeout: val.timeout,
+        }
+    }
+}
+
+impl From<OtelExporterProtocol> for tracing_utils::Protocol {
+    fn from(val: OtelExporterProtocol) -> Self {
+        match val {
+            OtelExporterProtocol::Grpc => tracing_utils::Protocol::Grpc,
+            OtelExporterProtocol::HttpJson => tracing_utils::Protocol::HttpJson,
+            OtelExporterProtocol::HttpBinary => tracing_utils::Protocol::HttpBinary,
+        }
+    }
+}
+
 pub mod statvfs {
    pub mod mock {
        #[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
@@ -537,6 +586,7 @@ impl Default for ConfigToml {
            validate_wal_contiguity: None,
            load_previous_heatmap: None,
            generate_unarchival_heatmap: None,
+            tracing: None,
        }
    }
 }
--- a/libs/remote_storage/tests/test_real_s3.rs
+++ b/libs/remote_storage/tests/test_real_s3.rs
@@ -558,7 +558,7 @@ async fn upload_large_enough_file(
 ) -> usize {
    let header = bytes::Bytes::from_static("remote blob data content".as_bytes());
    let body = bytes::Bytes::from(vec![0u8; 1024]);
-    let contents = std::iter::once(header).chain(std::iter::repeat(body).take(128));
+    let contents = std::iter::once(header).chain(std::iter::repeat_n(body, 128));

    let len = contents.clone().fold(0, |acc, next| acc + next.len());

--- a/libs/safekeeper_api/src/models.rs
+++ b/libs/safekeeper_api/src/models.rs
@@ -71,6 +71,7 @@ pub struct PeerInfo {
    pub ts: Instant,
    pub pg_connstr: String,
    pub http_connstr: String,
+    pub https_connstr: Option<String>,
 }

 pub type FullTransactionId = u64;
@@ -227,6 +228,8 @@ pub struct TimelineDeleteResult {
    pub dir_existed: bool,
 }

+pub type TenantDeleteResult = std::collections::HashMap<String, TimelineDeleteResult>;
+
 fn lsn_invalid() -> Lsn {
    Lsn::INVALID
 }
@@ -259,6 +262,8 @@ pub struct SkTimelineInfo {
    pub safekeeper_connstr: Option<String>,
    #[serde(default)]
    pub http_connstr: Option<String>,
+    #[serde(default)]
+    pub https_connstr: Option<String>,
    // Minimum of all active RO replicas flush LSN
    #[serde(default = "lsn_invalid")]
    pub standby_horizon: Lsn,
--- a/libs/tracing-utils/Cargo.toml
+++ b/libs/tracing-utils/Cargo.toml
@@ -14,6 +14,7 @@ tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
 tracing.workspace = true
 tracing-opentelemetry.workspace = true
 tracing-subscriber.workspace = true
+pin-project-lite.workspace = true

 [dev-dependencies]
 tracing-subscriber.workspace = true    # For examples in docs
--- a/libs/tracing-utils/src/lib.rs
+++ b/libs/tracing-utils/src/lib.rs
@@ -31,10 +31,10 @@
 //!         .init();
 //! }
 //! ```
-#![deny(unsafe_code)]
 #![deny(clippy::undocumented_unsafe_blocks)]

 pub mod http;
+pub mod perf_span;

 use opentelemetry::KeyValue;
 use opentelemetry::trace::TracerProvider;
--- a/libs/tracing-utils/src/perf_span.rs
+++ b/libs/tracing-utils/src/perf_span.rs
@@ -0,0 +1,153 @@
+//! Crutch module to work around tracing infrastructure deficiencies
+//!
+//! We wish to collect granular request spans without impacting performance
+//! by much. Ideally, we should have zero overhead for a sampling rate of 0.
+//!
+//! The approach taken by the pageserver crate is to use a completely different
+//! span hierarchy for the performance spans. Spans are explicitly stored in
+//! the request context and use a different [`tracing::Subscriber`] in order
+//! to avoid expensive filtering.
+//!
+//! [`tracing::Span`] instances record their [`tracing::Dispatch`] and, implcitly,
+//! their [`tracing::Subscriber`] at creation time. However, upon exiting the span,
+//! the global default [`tracing::Dispatch`] is used. This is problematic if one
+//! wishes to juggle different subscribers.
+//!
+//! In order to work around this, this module provides a [`PerfSpan`] type which
+//! wraps a [`Span`] and sets the default subscriber when exiting the span. This
+//! achieves the correct routing.
+//!
+//! There's also a modified version of [`tracing::Instrument`] which works with
+//! [`PerfSpan`].
+
+use core::{
+    future::Future,
+    marker::Sized,
+    mem::ManuallyDrop,
+    pin::Pin,
+    task::{Context, Poll},
+};
+use pin_project_lite::pin_project;
+use tracing::{Dispatch, field, span::Span};
+
+#[derive(Debug, Clone)]
+pub struct PerfSpan {
+    inner: ManuallyDrop<Span>,
+    dispatch: Dispatch,
+}
+
+#[must_use = "once a span has been entered, it should be exited"]
+pub struct PerfSpanEntered<'a> {
+    span: &'a PerfSpan,
+}
+
+impl PerfSpan {
+    pub fn new(span: Span, dispatch: Dispatch) -> Self {
+        Self {
+            inner: ManuallyDrop::new(span),
+            dispatch,
+        }
+    }
+
+    pub fn record<Q: field::AsField + ?Sized, V: field::Value>(
+        &self,
+        field: &Q,
+        value: V,
+    ) -> &Self {
+        self.inner.record(field, value);
+        self
+    }
+
+    pub fn enter(&self) -> PerfSpanEntered {
+        if let Some(ref id) = self.inner.id() {
+            self.dispatch.enter(id);
+        }
+
+        PerfSpanEntered { span: self }
+    }
+
+    pub fn inner(&self) -> &Span {
+        &self.inner
+    }
+}
+
+impl Drop for PerfSpan {
+    fn drop(&mut self) {
+        // Bring the desired dispatch into scope before explicitly calling
+        // the span destructor. This routes the span exit to the correct
+        // [`tracing::Subscriber`].
+        let _dispatch_guard = tracing::dispatcher::set_default(&self.dispatch);
+        // SAFETY: ManuallyDrop in Drop implementation
+        unsafe { ManuallyDrop::drop(&mut self.inner) }
+    }
+}
+
+impl Drop for PerfSpanEntered<'_> {
+    fn drop(&mut self) {
+        assert!(self.span.inner.id().is_some());
+
+        let _dispatch_guard = tracing::dispatcher::set_default(&self.span.dispatch);
+        self.span.dispatch.exit(&self.span.inner.id().unwrap());
+    }
+}
+
+pub trait PerfInstrument: Sized {
+    fn instrument(self, span: PerfSpan) -> PerfInstrumented<Self> {
+        PerfInstrumented {
+            inner: ManuallyDrop::new(self),
+            span,
+        }
+    }
+}
+
+pin_project! {
+    #[project = PerfInstrumentedProj]
+    #[derive(Debug, Clone)]
+    #[must_use = "futures do nothing unless you `.await` or poll them"]
+    pub struct PerfInstrumented<T> {
+        // `ManuallyDrop` is used here to to enter instrument `Drop` by entering
+        // `Span` and executing `ManuallyDrop::drop`.
+        #[pin]
+        inner: ManuallyDrop<T>,
+        span: PerfSpan,
+    }
+
+    impl<T> PinnedDrop for PerfInstrumented<T> {
+        fn drop(this: Pin<&mut Self>) {
+            let this = this.project();
+            let _enter = this.span.enter();
+            // SAFETY: 1. `Pin::get_unchecked_mut()` is safe, because this isn't
+            //             different from wrapping `T` in `Option` and calling
+            //             `Pin::set(&mut this.inner, None)`, except avoiding
+            //             additional memory overhead.
+            //         2. `ManuallyDrop::drop()` is safe, because
+            //            `PinnedDrop::drop()` is guaranteed to be called only
+            //            once.
+            unsafe { ManuallyDrop::drop(this.inner.get_unchecked_mut()) }
+        }
+    }
+}
+
+impl<'a, T> PerfInstrumentedProj<'a, T> {
+    /// Get a mutable reference to the [`Span`] a pinned mutable reference to
+    /// the wrapped type.
+    fn span_and_inner_pin_mut(self) -> (&'a mut PerfSpan, Pin<&'a mut T>) {
+        // SAFETY: As long as `ManuallyDrop<T>` does not move, `T` won't move
+        //         and `inner` is valid, because `ManuallyDrop::drop` is called
+        //         only inside `Drop` of the `Instrumented`.
+        let inner = unsafe { self.inner.map_unchecked_mut(|v| &mut **v) };
+        (self.span, inner)
+    }
+}
+
+impl<T: Future> Future for PerfInstrumented<T> {
+    type Output = T::Output;
+
+    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
+        let (span, inner) = self.project().span_and_inner_pin_mut();
+        let _enter = span.enter();
+        inner.poll(cx)
+    }
+}
+
+impl<T: Sized> PerfInstrument for T {}
--- a/pageserver/src/bin/pageserver.rs
+++ b/pageserver/src/bin/pageserver.rs
@@ -35,6 +35,7 @@ use tokio::signal::unix::SignalKind;
 use tokio::time::Instant;
 use tokio_util::sync::CancellationToken;
 use tracing::*;
+use tracing_utils::OtelGuard;
 use utils::auth::{JwtAuth, SwappableJwtAuth};
 use utils::crashsafe::syncfs;
 use utils::logging::TracingErrorLayerEnablement;
@@ -118,6 +119,21 @@ fn main() -> anyhow::Result<()> {
        logging::Output::Stdout,
    )?;

+    let otel_enablement = match &conf.tracing {
+        Some(cfg) => tracing_utils::OtelEnablement::Enabled {
+            service_name: "pageserver".to_string(),
+            export_config: (&cfg.export_config).into(),
+            runtime: *COMPUTE_REQUEST_RUNTIME,
+        },
+        None => tracing_utils::OtelEnablement::Disabled,
+    };
+
+    let otel_guard = tracing_utils::init_performance_tracing(otel_enablement);
+
+    if otel_guard.is_some() {
+        info!(?conf.tracing, "starting with OTEL tracing enabled");
+    }
+
    // mind the order required here: 1. logging, 2. panic_hook, 3. sentry.
    // disarming this hook on pageserver, because we never tear down tracing.
    logging::replace_panic_hook_with_tracing_panic_hook().forget();
@@ -191,7 +207,7 @@ fn main() -> anyhow::Result<()> {
    tracing::info!("Initializing page_cache...");
    page_cache::init(conf.page_cache_size);

-    start_pageserver(launch_ts, conf).context("Failed to start pageserver")?;
+    start_pageserver(launch_ts, conf, otel_guard).context("Failed to start pageserver")?;

    scenario.teardown();
    Ok(())
@@ -290,6 +306,7 @@ fn startup_checkpoint(started_at: Instant, phase: &str, human_phase: &str) {
 fn start_pageserver(
    launch_ts: &'static LaunchTimestamp,
    conf: &'static PageServerConf,
+    otel_guard: Option<OtelGuard>,
 ) -> anyhow::Result<()> {
    // Monotonic time for later calculating startup duration
    let started_startup_at = Instant::now();
@@ -675,13 +692,21 @@ fn start_pageserver(

    // Spawn a task to listen for libpq connections. It will spawn further tasks
    // for each connection. We created the listener earlier already.
-    let page_service = page_service::spawn(conf, tenant_manager.clone(), pg_auth, {
-        let _entered = COMPUTE_REQUEST_RUNTIME.enter(); // TcpListener::from_std requires it
-        pageserver_listener
-            .set_nonblocking(true)
-            .context("set listener to nonblocking")?;
-        tokio::net::TcpListener::from_std(pageserver_listener).context("create tokio listener")?
-    });
+    let perf_trace_dispatch = otel_guard.as_ref().map(|g| g.dispatch.clone());
+    let page_service = page_service::spawn(
+        conf,
+        tenant_manager.clone(),
+        pg_auth,
+        perf_trace_dispatch,
+        {
+            let _entered = COMPUTE_REQUEST_RUNTIME.enter(); // TcpListener::from_std requires it
+            pageserver_listener
+                .set_nonblocking(true)
+                .context("set listener to nonblocking")?;
+            tokio::net::TcpListener::from_std(pageserver_listener)
+                .context("create tokio listener")?
+        },
+    );

    // All started up! Now just sit and wait for shutdown signal.
    BACKGROUND_RUNTIME.block_on(async move {
--- a/pageserver/src/config.rs
+++ b/pageserver/src/config.rs
@@ -215,6 +215,8 @@ pub struct PageServerConf {

    /// When set, include visible layers in the next uploaded heatmaps of an unarchived timeline.
    pub generate_unarchival_heatmap: bool,
+
+    pub tracing: Option<pageserver_api::config::Tracing>,
 }

 /// Token for authentication to safekeepers
@@ -386,6 +388,7 @@ impl PageServerConf {
            validate_wal_contiguity,
            load_previous_heatmap,
            generate_unarchival_heatmap,
+            tracing,
        } = config_toml;

        let mut conf = PageServerConf {
@@ -435,6 +438,7 @@ impl PageServerConf {
            wal_receiver_protocol,
            page_service_pipelining,
            get_vectored_concurrent_io,
+            tracing,

            // ------------------------------------------------------------
            // fields that require additional validation or custom handling
@@ -506,6 +510,17 @@ impl PageServerConf {
            );
        }

+        if let Some(tracing_config) = conf.tracing.as_ref() {
+            let ratio = &tracing_config.sampling_ratio;
+            ensure!(
+                ratio.denominator != 0 && ratio.denominator >= ratio.numerator,
+                format!(
+                    "Invalid sampling ratio: {}/{}",
+                    ratio.numerator, ratio.denominator
+                )
+            );
+        }
+
        IndexEntry::validate_checkpoint_distance(conf.default_tenant_conf.checkpoint_distance)
            .map_err(anyhow::Error::msg)
            .with_context(|| {
--- a/pageserver/src/context.rs
+++ b/pageserver/src/context.rs
@@ -100,6 +100,12 @@ use crate::{
    task_mgr::TaskKind,
    tenant::Timeline,
 };
+use futures::FutureExt;
+use futures::future::BoxFuture;
+use std::future::Future;
+use tracing_utils::perf_span::{PerfInstrument, PerfSpan};
+
+use tracing::{Dispatch, Span};

 // The main structure of this module, see module-level comment.
 pub struct RequestContext {
@@ -109,6 +115,8 @@ pub struct RequestContext {
    page_content_kind: PageContentKind,
    read_path_debug: bool,
    scope: Scope,
+    perf_span: Option<PerfSpan>,
+    perf_span_dispatch: Option<Dispatch>,
 }

 #[derive(Clone)]
@@ -263,22 +271,15 @@ impl RequestContextBuilder {
                page_content_kind: PageContentKind::Unknown,
                read_path_debug: false,
                scope: Scope::new_global(),
+                perf_span: None,
+                perf_span_dispatch: None,
            },
        }
    }

-    pub fn extend(original: &RequestContext) -> Self {
+    pub fn from(original: &RequestContext) -> Self {
        Self {
-            // This is like a Copy, but avoid implementing Copy because ordinary users of
-            // RequestContext should always move or ref it.
-            inner: RequestContext {
-                task_kind: original.task_kind,
-                download_behavior: original.download_behavior,
-                access_stats_behavior: original.access_stats_behavior,
-                page_content_kind: original.page_content_kind,
-                read_path_debug: original.read_path_debug,
-                scope: original.scope.clone(),
-            },
+            inner: original.clone(),
        }
    }

@@ -316,12 +317,74 @@ impl RequestContextBuilder {
        self
    }

-    pub fn build(self) -> RequestContext {
+    pub(crate) fn perf_span_dispatch(mut self, dispatch: Option<Dispatch>) -> Self {
+        self.inner.perf_span_dispatch = dispatch;
+        self
+    }
+
+    pub fn root_perf_span<Fn>(mut self, make_span: Fn) -> Self
+    where
+        Fn: FnOnce() -> Span,
+    {
+        assert!(self.inner.perf_span.is_none());
+        assert!(self.inner.perf_span_dispatch.is_some());
+
+        let dispatcher = self.inner.perf_span_dispatch.as_ref().unwrap();
+        let new_span = tracing::dispatcher::with_default(dispatcher, make_span);
+
+        self.inner.perf_span = Some(PerfSpan::new(new_span, dispatcher.clone()));
+
+        self
+    }
+
+    pub fn perf_span<Fn>(mut self, make_span: Fn) -> Self
+    where
+        Fn: FnOnce(&Span) -> Span,
+    {
+        if let Some(ref perf_span) = self.inner.perf_span {
+            assert!(self.inner.perf_span_dispatch.is_some());
+            let dispatcher = self.inner.perf_span_dispatch.as_ref().unwrap();
+
+            let new_span =
+                tracing::dispatcher::with_default(dispatcher, || make_span(perf_span.inner()));
+
+            self.inner.perf_span = Some(PerfSpan::new(new_span, dispatcher.clone()));
+        }
+
+        self
+    }
+
+    pub fn root(self) -> RequestContext {
+        self.inner
+    }
+
+    pub fn attached_child(self) -> RequestContext {
+        self.inner
+    }
+
+    pub fn detached_child(self) -> RequestContext {
        self.inner
    }
 }

 impl RequestContext {
+    /// Private clone implementation
+    ///
+    /// Callers should use the [`RequestContextBuilder`] or child spaning APIs of
+    /// [`RequestContext`].
+    fn clone(&self) -> Self {
+        Self {
+            task_kind: self.task_kind,
+            download_behavior: self.download_behavior,
+            access_stats_behavior: self.access_stats_behavior,
+            page_content_kind: self.page_content_kind,
+            read_path_debug: self.read_path_debug,
+            scope: self.scope.clone(),
+            perf_span: self.perf_span.clone(),
+            perf_span_dispatch: self.perf_span_dispatch.clone(),
+        }
+    }
+
    /// Create a new RequestContext that has no parent.
    ///
    /// The function is called `new` because, once we add children
@@ -337,7 +400,7 @@ impl RequestContext {
    pub fn new(task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
        RequestContextBuilder::new(task_kind)
            .download_behavior(download_behavior)
-            .build()
+            .root()
    }

    /// Create a detached child context for a task that may outlive `self`.
@@ -358,7 +421,10 @@ impl RequestContext {
    ///
    /// We could make new calls to this function fail if `self` is already canceled.
    pub fn detached_child(&self, task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
-        self.child_impl(task_kind, download_behavior)
+        RequestContextBuilder::from(self)
+            .task_kind(task_kind)
+            .download_behavior(download_behavior)
+            .detached_child()
    }

    /// Create a child of context `self` for a task that shall not outlive `self`.
@@ -382,7 +448,7 @@ impl RequestContext {
    /// The method to wait for child tasks would return an error, indicating
    /// that the child task was not started because the context was canceled.
    pub fn attached_child(&self) -> Self {
-        self.child_impl(self.task_kind(), self.download_behavior())
+        RequestContextBuilder::from(self).attached_child()
    }

    /// Use this function when you should be creating a child context using
@@ -397,17 +463,10 @@ impl RequestContext {
        Self::new(task_kind, download_behavior)
    }

-    fn child_impl(&self, task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
-        RequestContextBuilder::extend(self)
-            .task_kind(task_kind)
-            .download_behavior(download_behavior)
-            .build()
-    }
-
    pub fn with_scope_timeline(&self, timeline: &Arc<Timeline>) -> Self {
-        RequestContextBuilder::extend(self)
+        RequestContextBuilder::from(self)
            .scope(Scope::new_timeline(timeline))
-            .build()
+            .attached_child()
    }

    pub(crate) fn with_scope_page_service_pagestream(
@@ -416,9 +475,9 @@ impl RequestContext {
            crate::page_service::TenantManagerTypes,
        >,
    ) -> Self {
-        RequestContextBuilder::extend(self)
+        RequestContextBuilder::from(self)
            .scope(Scope::new_page_service_pagestream(timeline_handle))
-            .build()
+            .attached_child()
    }

    pub fn with_scope_secondary_timeline(
@@ -426,28 +485,30 @@ impl RequestContext {
        tenant_shard_id: &TenantShardId,
        timeline_id: &TimelineId,
    ) -> Self {
-        RequestContextBuilder::extend(self)
+        RequestContextBuilder::from(self)
            .scope(Scope::new_secondary_timeline(tenant_shard_id, timeline_id))
-            .build()
+            .attached_child()
    }

    pub fn with_scope_secondary_tenant(&self, tenant_shard_id: &TenantShardId) -> Self {
-        RequestContextBuilder::extend(self)
+        RequestContextBuilder::from(self)
            .scope(Scope::new_secondary_tenant(tenant_shard_id))
-            .build()
+            .attached_child()
    }

    #[cfg(test)]
    pub fn with_scope_unit_test(&self) -> Self {
-        RequestContextBuilder::new(TaskKind::UnitTest)
+        RequestContextBuilder::from(self)
+            .task_kind(TaskKind::UnitTest)
            .scope(Scope::new_unit_test())
-            .build()
+            .attached_child()
    }

    pub fn with_scope_debug_tools(&self) -> Self {
-        RequestContextBuilder::new(TaskKind::DebugTool)
+        RequestContextBuilder::from(self)
+            .task_kind(TaskKind::DebugTool)
            .scope(Scope::new_debug_tools())
-            .build()
+            .attached_child()
    }

    pub fn task_kind(&self) -> TaskKind {
@@ -504,4 +565,61 @@ impl RequestContext {
            Scope::DebugTools { io_size_metrics } => io_size_metrics,
        }
    }
+
+    pub(crate) fn perf_follows_from(&self, from: &RequestContext) {
+        if let (Some(span), Some(from_span)) = (&self.perf_span, &from.perf_span) {
+            span.inner().follows_from(from_span.inner());
+        }
+    }
+
+    pub(crate) fn perf_span_record<
+        Q: tracing::field::AsField + ?Sized,
+        V: tracing::field::Value,
+    >(
+        &self,
+        field: &Q,
+        value: V,
+    ) {
+        if let Some(span) = &self.perf_span {
+            span.record(field, value);
+        }
+    }
+
+    pub(crate) fn has_perf_span(&self) -> bool {
+        self.perf_span.is_some()
+    }
 }
+
+/// [`Future`] extension trait that allow for creating performance
+/// spans on sampled requests
+pub(crate) trait PerfInstrumentFutureExt<'a>: Future + Send {
+    /// Instrument this future with a new performance span when the
+    /// provided request context indicates the originator request
+    /// was sampled. Otherwise, just box the future and return it as is.
+    fn maybe_perf_instrument<Fn>(
+        self,
+        ctx: &RequestContext,
+        make_span: Fn,
+    ) -> BoxFuture<'a, Self::Output>
+    where
+        Self: Sized + 'a,
+        Fn: FnOnce(&Span) -> Span,
+    {
+        match &ctx.perf_span {
+            Some(perf_span) => {
+                assert!(ctx.perf_span_dispatch.is_some());
+                let dispatcher = ctx.perf_span_dispatch.as_ref().unwrap();
+
+                let new_span =
+                    tracing::dispatcher::with_default(dispatcher, || make_span(perf_span.inner()));
+
+                let new_perf_span = PerfSpan::new(new_span, dispatcher.clone());
+                self.instrument(new_perf_span).boxed()
+            }
+            None => self.boxed(),
+        }
+    }
+}
+
+// Implement the trait for all types that satisfy the trait bounds
+impl<'a, T: Future + Send + 'a> PerfInstrumentFutureExt<'a> for T {}
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -2697,11 +2697,12 @@ async fn getpage_at_lsn_handler_inner(
    let lsn: Option<Lsn> = parse_query_param(&request, "lsn")?;

    async {
-        let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
-        // Enable read path debugging
        let timeline = active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id).await?;
-        let ctx = RequestContextBuilder::extend(&ctx).read_path_debug(true)
-        .scope(context::Scope::new_timeline(&timeline)).build();
+        let ctx = RequestContextBuilder::new(TaskKind::MgmtRequest)
+            .download_behavior(DownloadBehavior::Download)
+            .scope(context::Scope::new_timeline(&timeline))
+            .read_path_debug(true)
+            .root();

        // Use last_record_lsn if no lsn is provided
        let lsn = lsn.unwrap_or_else(|| timeline.get_last_record_lsn());
@@ -3188,7 +3189,8 @@ async fn list_aux_files(
        timeline.gate.enter().map_err(|_| ApiError::Cancelled)?,
    );

-    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
+    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download)
+        .with_scope_timeline(&timeline);
    let files = timeline
        .list_aux_files(body.lsn, &ctx, io_concurrency)
        .await?;
@@ -3432,14 +3434,15 @@ async fn put_tenant_timeline_import_wal(

    check_permission(&request, Some(tenant_id))?;

-    let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
-
    let span = info_span!("import_wal", tenant_id=%tenant_id, timeline_id=%timeline_id, start_lsn=%start_lsn, end_lsn=%end_lsn);
    async move {
        let state = get_state(&request);

        let timeline = active_timeline_of_active_tenant(&state.tenant_manager, TenantShardId::unsharded(tenant_id), timeline_id).await?;
-        let ctx = RequestContextBuilder::extend(&ctx).scope(context::Scope::new_timeline(&timeline)).build();
+        let ctx = RequestContextBuilder::new(TaskKind::MgmtRequest)
+            .download_behavior(DownloadBehavior::Warn)
+            .scope(context::Scope::new_timeline(&timeline))
+            .root();

        let mut body = StreamReader::new(request.into_body().map(|res| {
            res.map_err(|error| {
--- a/pageserver/src/lib.rs
+++ b/pageserver/src/lib.rs
@@ -55,6 +55,9 @@ pub const DEFAULT_PG_VERSION: u32 = 16;
 pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;
 pub const DELTA_FILE_MAGIC: u16 = 0x5A61;

+// Target used for performance traces.
+pub const PERF_TRACE_TARGET: &str = "P";
+
 static ZERO_PAGE: bytes::Bytes = bytes::Bytes::from_static(&[0u8; 8192]);

 pub use crate::metrics::preinitialize_metrics;
--- a/pageserver/src/page_service.rs
+++ b/pageserver/src/page_service.rs
@@ -9,6 +9,7 @@ use std::sync::Arc;
 use std::time::{Duration, Instant, SystemTime};
 use std::{io, str};

+use crate::PERF_TRACE_TARGET;
 use anyhow::{Context, bail};
 use async_compression::tokio::write::GzipEncoder;
 use bytes::Buf;
@@ -17,7 +18,7 @@ use itertools::Itertools;
 use once_cell::sync::OnceCell;
 use pageserver_api::config::{
    PageServicePipeliningConfig, PageServicePipeliningConfigPipelined,
-    PageServiceProtocolPipelinedExecutionStrategy,
+    PageServiceProtocolPipelinedExecutionStrategy, Tracing,
 };
 use pageserver_api::key::rel_block_to_key;
 use pageserver_api::models::{
@@ -36,6 +37,7 @@ use postgres_ffi::BLCKSZ;
 use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID;
 use pq_proto::framed::ConnectionError;
 use pq_proto::{BeMessage, FeMessage, FeStartupPacket, RowDescriptor};
+use rand::Rng;
 use strum_macros::IntoStaticStr;
 use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt, BufWriter};
 use tokio::task::JoinHandle;
@@ -53,7 +55,9 @@ use utils::sync::spsc_fold;
 use crate::auth::check_permission;
 use crate::basebackup::BasebackupError;
 use crate::config::PageServerConf;
-use crate::context::{DownloadBehavior, RequestContext};
+use crate::context::{
+    DownloadBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
+};
 use crate::metrics::{
    self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, LIVE_CONNECTIONS, SmgrOpTimer,
    TimelineMetrics,
@@ -100,6 +104,7 @@ pub fn spawn(
    conf: &'static PageServerConf,
    tenant_manager: Arc<TenantManager>,
    pg_auth: Option<Arc<SwappableJwtAuth>>,
+    perf_trace_dispatch: Option<Dispatch>,
    tcp_listener: tokio::net::TcpListener,
 ) -> Listener {
    let cancel = CancellationToken::new();
@@ -117,6 +122,7 @@ pub fn spawn(
            conf,
            tenant_manager,
            pg_auth,
+            perf_trace_dispatch,
            tcp_listener,
            conf.pg_auth_type,
            conf.page_service_pipelining.clone(),
@@ -173,6 +179,7 @@ pub async fn libpq_listener_main(
    conf: &'static PageServerConf,
    tenant_manager: Arc<TenantManager>,
    auth: Option<Arc<SwappableJwtAuth>>,
+    perf_trace_dispatch: Option<Dispatch>,
    listener: tokio::net::TcpListener,
    auth_type: AuthType,
    pipelining_config: PageServicePipeliningConfig,
@@ -205,8 +212,12 @@ pub async fn libpq_listener_main(
                // Connection established. Spawn a new task to handle it.
                debug!("accepted connection from {}", peer_addr);
                let local_auth = auth.clone();
-                let connection_ctx = listener_ctx
-                    .detached_child(TaskKind::PageRequestHandler, DownloadBehavior::Download);
+                let connection_ctx = RequestContextBuilder::from(&listener_ctx)
+                    .task_kind(TaskKind::PageRequestHandler)
+                    .download_behavior(DownloadBehavior::Download)
+                    .perf_span_dispatch(perf_trace_dispatch.clone())
+                    .detached_child();
+
                connection_handler_tasks.spawn(page_service_conn_main(
                    conf,
                    tenant_manager.clone(),
@@ -607,6 +618,7 @@ impl std::fmt::Display for BatchedPageStreamError {
 struct BatchedGetPageRequest {
    req: PagestreamGetPageRequest,
    timer: SmgrOpTimer,
+    ctx: RequestContext,
 }

 #[cfg(feature = "testing")]
@@ -743,6 +755,7 @@ impl PageServerHandler {
        tenant_id: TenantId,
        timeline_id: TimelineId,
        timeline_handles: &mut TimelineHandles,
+        tracing_config: Option<&Tracing>,
        cancel: &CancellationToken,
        ctx: &RequestContext,
        protocol_version: PagestreamProtocolVersion,
@@ -902,10 +915,51 @@ impl PageServerHandler {
                }

                let key = rel_block_to_key(req.rel, req.blkno);
-                let shard = match timeline_handles
+
+                let sampled = match tracing_config {
+                    Some(conf) => {
+                        let ratio = &conf.sampling_ratio;
+
+                        if ratio.numerator == 0 {
+                            false
+                        } else {
+                            rand::thread_rng().gen_range(0..ratio.denominator) < ratio.numerator
+                        }
+                    }
+                    None => false,
+                };
+
+                let ctx = if sampled {
+                    RequestContextBuilder::from(ctx)
+                        .root_perf_span(|| {
+                            info_span!(
+                            target: PERF_TRACE_TARGET,
+                            "GET_PAGE",
+                            tenant_id = %tenant_id,
+                            shard_id = field::Empty,
+                            timeline_id = %timeline_id,
+                            lsn = %req.hdr.request_lsn,
+                            request_id = %req.hdr.reqid,
+                            key = %key,
+                            )
+                        })
+                        .attached_child()
+                } else {
+                    ctx.attached_child()
+                };
+
+                let res = timeline_handles
                    .get(tenant_id, timeline_id, ShardSelector::Page(key))
-                    .await
-                {
+                    .maybe_perf_instrument(&ctx, |current_perf_span| {
+                        info_span!(
+                            target: PERF_TRACE_TARGET,
+                            parent: current_perf_span,
+                            "SHARD_SELECTION",
+                        )
+                    })
+                    .await;
+
+                let shard = match res {
                    Ok(tl) => tl,
                    Err(e) => {
                        let span = mkspan!(before shard routing);
@@ -932,26 +986,60 @@ impl PageServerHandler {
                        }
                    }
                };
+
+                // This ctx travels as part of the BatchedFeMessage through
+                // batching into the request handler.
+                // The request handler needs to do some per-request work
+                // (relsize check) before dispatching the batch as a single
+                // get_vectored call to the Timeline.
+                // This ctx will be used for the reslize check, whereas the
+                // get_vectored call will be a different ctx with separate
+                // perf span.
+                let ctx = ctx.with_scope_page_service_pagestream(&shard);
+
+                // Similar game for this `span`: we funnel it through so that
+                // request handler log messages contain the request-specific fields.
                let span = mkspan!(shard.tenant_shard_id.shard_slug());

+                // Enrich the perf span with shard_id now that shard routing is done.
+                ctx.perf_span_record(
+                    "shard_id",
+                    tracing::field::display(shard.get_shard_identity().shard_slug()),
+                );
+
                let timer = record_op_start_and_throttle(
                    &shard,
                    metrics::SmgrQueryType::GetPageAtLsn,
                    received_at,
                )
+                .maybe_perf_instrument(&ctx, |current_perf_span| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        parent: current_perf_span,
+                        "THROTTLE",
+                    )
+                })
                .await?;

                // We're holding the Handle
-                let effective_request_lsn = match Self::wait_or_get_last_lsn(
+                // TODO: if we actually need to wait for lsn here, it delays the entire batch which doesn't need to wait
+                let res = Self::wait_or_get_last_lsn(
                    &shard,
                    req.hdr.request_lsn,
                    req.hdr.not_modified_since,
                    &shard.get_applied_gc_cutoff_lsn(),
-                    ctx,
+                    &ctx,
                )
-                // TODO: if we actually need to wait for lsn here, it delays the entire batch which doesn't need to wait
-                .await
-                {
+                .maybe_perf_instrument(&ctx, |current_perf_span| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        parent: current_perf_span,
+                        "WAIT_LSN",
+                    )
+                })
+                .await;
+
+                let effective_request_lsn = match res {
                    Ok(lsn) => lsn,
                    Err(e) => {
                        return respond_error!(span, e);
@@ -961,7 +1049,7 @@ impl PageServerHandler {
                    span,
                    shard: shard.downgrade(),
                    effective_request_lsn,
-                    pages: smallvec::smallvec![BatchedGetPageRequest { req, timer }],
+                    pages: smallvec::smallvec![BatchedGetPageRequest { req, timer, ctx }],
                }
            }
            #[cfg(feature = "testing")]
@@ -1514,12 +1602,15 @@ impl PageServerHandler {
        IO: AsyncRead + AsyncWrite + Send + Sync + Unpin + 'static,
    {
        let cancel = self.cancel.clone();
+        let tracing_config = self.conf.tracing.clone();
+
        let err = loop {
            let msg = Self::pagestream_read_message(
                &mut pgb_reader,
                tenant_id,
                timeline_id,
                &mut timeline_handles,
+                tracing_config.as_ref(),
                &cancel,
                ctx,
                protocol_version,
@@ -1653,6 +1744,8 @@ impl PageServerHandler {
        // Batcher
        //

+        let tracing_config = self.conf.tracing.clone();
+
        let cancel_batcher = self.cancel.child_token();
        let (mut batch_tx, mut batch_rx) = spsc_fold::channel();
        let batcher = pipeline_stage!("batcher", cancel_batcher.clone(), move |cancel_batcher| {
@@ -1666,6 +1759,7 @@ impl PageServerHandler {
                        tenant_id,
                        timeline_id,
                        &mut timeline_handles,
+                        tracing_config.as_ref(),
                        &cancel_batcher,
                        &ctx,
                        protocol_version,
@@ -2004,7 +2098,9 @@ impl PageServerHandler {

        let results = timeline
            .get_rel_page_at_lsn_batched(
-                requests.iter().map(|p| (&p.req.rel, &p.req.blkno)),
+                requests
+                    .iter()
+                    .map(|p| (&p.req.rel, &p.req.blkno, p.ctx.attached_child())),
                effective_lsn,
                io_concurrency,
                ctx,
--- a/pageserver/src/pgdatadir_mapping.rs
+++ b/pageserver/src/pgdatadir_mapping.rs
@@ -9,6 +9,7 @@
 use std::collections::{BTreeMap, HashMap, HashSet, hash_map};
 use std::ops::{ControlFlow, Range};

+use crate::PERF_TRACE_TARGET;
 use anyhow::{Context, ensure};
 use bytes::{Buf, Bytes, BytesMut};
 use enum_map::Enum;
@@ -31,7 +32,7 @@ use postgres_ffi::{BLCKSZ, Oid, RepOriginId, TimestampTz, TransactionId};
 use serde::{Deserialize, Serialize};
 use strum::IntoEnumIterator;
 use tokio_util::sync::CancellationToken;
-use tracing::{debug, info, trace, warn};
+use tracing::{debug, info, info_span, trace, warn};
 use utils::bin_ser::{BeSer, DeserializeError};
 use utils::lsn::Lsn;
 use utils::pausable_failpoint;
@@ -39,7 +40,7 @@ use wal_decoder::serialized_batch::{SerializedValueBatch, ValueMeta};

 use super::tenant::{PageReconstructError, Timeline};
 use crate::aux_file;
-use crate::context::RequestContext;
+use crate::context::{PerfInstrumentFutureExt, RequestContext, RequestContextBuilder};
 use crate::keyspace::{KeySpace, KeySpaceAccum};
 use crate::metrics::{
    RELSIZE_CACHE_ENTRIES, RELSIZE_CACHE_HITS, RELSIZE_CACHE_MISSES, RELSIZE_CACHE_MISSES_OLD,
@@ -209,7 +210,9 @@ impl Timeline {
                let pages: smallvec::SmallVec<[_; 1]> = smallvec::smallvec![(tag, blknum)];
                let res = self
                    .get_rel_page_at_lsn_batched(
-                        pages.iter().map(|(tag, blknum)| (tag, blknum)),
+                        pages
+                            .iter()
+                            .map(|(tag, blknum)| (tag, blknum, ctx.attached_child())),
                        effective_lsn,
                        io_concurrency.clone(),
                        ctx,
@@ -248,7 +251,7 @@ impl Timeline {
    /// The ordering of the returned vec corresponds to the ordering of `pages`.
    pub(crate) async fn get_rel_page_at_lsn_batched(
        &self,
-        pages: impl ExactSizeIterator<Item = (&RelTag, &BlockNumber)>,
+        pages: impl ExactSizeIterator<Item = (&RelTag, &BlockNumber, RequestContext)>,
        effective_lsn: Lsn,
        io_concurrency: IoConcurrency,
        ctx: &RequestContext,
@@ -262,8 +265,11 @@ impl Timeline {
        let mut result = Vec::with_capacity(pages.len());
        let result_slots = result.spare_capacity_mut();

-        let mut keys_slots: BTreeMap<Key, smallvec::SmallVec<[usize; 1]>> = BTreeMap::default();
-        for (response_slot_idx, (tag, blknum)) in pages.enumerate() {
+        let mut keys_slots: BTreeMap<Key, smallvec::SmallVec<[(usize, RequestContext); 1]>> =
+            BTreeMap::default();
+
+        let mut perf_instrument = false;
+        for (response_slot_idx, (tag, blknum, ctx)) in pages.enumerate() {
            if tag.relnode == 0 {
                result_slots[response_slot_idx].write(Err(PageReconstructError::Other(
                    RelationError::InvalidRelnode.into(),
@@ -274,7 +280,16 @@ impl Timeline {
            }

            let nblocks = match self
-                .get_rel_size(*tag, Version::Lsn(effective_lsn), ctx)
+                .get_rel_size(*tag, Version::Lsn(effective_lsn), &ctx)
+                .maybe_perf_instrument(&ctx, |crnt_perf_span| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        parent: crnt_perf_span,
+                        "GET_REL_SIZE",
+                        reltag=%tag,
+                        lsn=%effective_lsn,
+                    )
+                })
                .await
            {
                Ok(nblocks) => nblocks,
@@ -297,8 +312,12 @@ impl Timeline {

            let key = rel_block_to_key(*tag, *blknum);

+            if ctx.has_perf_span() {
+                perf_instrument = true;
+            }
+
            let key_slots = keys_slots.entry(key).or_default();
-            key_slots.push(response_slot_idx);
+            key_slots.push((response_slot_idx, ctx));
        }

        let keyspace = {
@@ -314,16 +333,34 @@ impl Timeline {
            acc.to_keyspace()
        };

-        match self
-            .get_vectored(keyspace, effective_lsn, io_concurrency, ctx)
-            .await
-        {
+        let ctx = match perf_instrument {
+            true => RequestContextBuilder::from(ctx)
+                .root_perf_span(|| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        "GET_VECTORED",
+                        tenant_id = %self.tenant_shard_id.tenant_id,
+                        timeline_id = %self.timeline_id,
+                        lsn = %effective_lsn,
+                        shard = %self.tenant_shard_id.shard_slug(),
+                    )
+                })
+                .attached_child(),
+            false => ctx.attached_child(),
+        };
+
+        let res = self
+            .get_vectored(keyspace, effective_lsn, io_concurrency, &ctx)
+            .maybe_perf_instrument(&ctx, |current_perf_span| current_perf_span.clone())
+            .await;
+
+        match res {
            Ok(results) => {
                for (key, res) in results {
                    let mut key_slots = keys_slots.remove(&key).unwrap().into_iter();
-                    let first_slot = key_slots.next().unwrap();
+                    let (first_slot, first_req_ctx) = key_slots.next().unwrap();

-                    for slot in key_slots {
+                    for (slot, req_ctx) in key_slots {
                        let clone = match &res {
                            Ok(buf) => Ok(buf.clone()),
                            Err(err) => Err(match err {
@@ -341,17 +378,22 @@ impl Timeline {
                        };

                        result_slots[slot].write(clone);
+                        // There is no standardized way to express that the batched span followed from N request spans.
+                        // So, abuse the system and mark the request contexts as follows_from the batch span, so we get
+                        // some linkage in our trace viewer. It allows us to answer: which GET_VECTORED did this GET_PAGE wait for.
+                        req_ctx.perf_follows_from(&ctx);
                        slots_filled += 1;
                    }

                    result_slots[first_slot].write(res);
+                    first_req_ctx.perf_follows_from(&ctx);
                    slots_filled += 1;
                }
            }
            Err(err) => {
                // this cannot really happen because get_vectored only errors globally on invalid LSN or too large batch size
                // (We enforce the max batch size outside of this function, in the code that constructs the batch request.)
-                for slot in keys_slots.values().flatten() {
+                for (slot, req_ctx) in keys_slots.values().flatten() {
                    // this whole `match` is a lot like `From<GetVectoredError> for PageReconstructError`
                    // but without taking ownership of the GetVectoredError
                    let err = match &err {
@@ -383,6 +425,7 @@ impl Timeline {
                        }
                    };

+                    req_ctx.perf_follows_from(&ctx);
                    result_slots[*slot].write(err);
                }

--- a/pageserver/src/task_mgr.rs
+++ b/pageserver/src/task_mgr.rs
@@ -219,8 +219,7 @@ pageserver_runtime!(MGMT_REQUEST_RUNTIME, "mgmt request worker");
 pageserver_runtime!(WALRECEIVER_RUNTIME, "walreceiver worker");
 pageserver_runtime!(BACKGROUND_RUNTIME, "background op worker");
 // Bump this number when adding a new pageserver_runtime!
-// SAFETY: it's obviously correct
-const NUM_MULTIPLE_RUNTIMES: NonZeroUsize = unsafe { NonZeroUsize::new_unchecked(4) };
+const NUM_MULTIPLE_RUNTIMES: NonZeroUsize = NonZeroUsize::new(4).unwrap();

 #[derive(Debug, Clone, Copy)]
 pub struct PageserverTaskId(u64);
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -3689,7 +3689,7 @@ impl Tenant {
                        }
                    }
                }
-                TenantState::Active { .. } => {
+                TenantState::Active => {
                    return Ok(());
                }
                TenantState::Broken { reason, .. } => {
@@ -4205,9 +4205,9 @@ impl Tenant {
            self.cancel.child_token(),
        );

-        let timeline_ctx = RequestContextBuilder::extend(ctx)
+        let timeline_ctx = RequestContextBuilder::from(ctx)
            .scope(context::Scope::new_timeline(&timeline))
-            .build();
+            .detached_child();

        Ok((timeline, timeline_ctx))
    }
--- a/pageserver/src/tenant/layer_map/layer_coverage.rs
+++ b/pageserver/src/tenant/layer_map/layer_coverage.rs
@@ -53,7 +53,7 @@ impl<Value: Clone> LayerCoverage<Value> {
    ///
    /// Complexity: O(log N)
    fn add_node(&mut self, key: i128) {
-        let value = match self.nodes.range(..=key).last() {
+        let value = match self.nodes.range(..=key).next_back() {
            Some((_, Some(v))) => Some(v.clone()),
            Some((_, None)) => None,
            None => None,
--- a/pageserver/src/tenant/mgr.rs
+++ b/pageserver/src/tenant/mgr.rs
@@ -58,7 +58,7 @@ use crate::{InitializationOrder, TEMP_FILE_SUFFIX};

 /// For a tenant that appears in TenantsMap, it may either be
 /// - `Attached`: has a full Tenant object, is elegible to service
-///    reads and ingest WAL.
+///   reads and ingest WAL.
 /// - `Secondary`: is only keeping a local cache warm.
 ///
 /// Secondary is a totally distinct state rather than being a mode of a `Tenant`, because
--- a/pageserver/src/tenant/remote_timeline_client/index.rs
+++ b/pageserver/src/tenant/remote_timeline_client/index.rs
@@ -130,7 +130,7 @@ impl IndexPart {
    /// Version history
    /// - 2: added `deleted_at`
    /// - 3: no longer deserialize `timeline_layers` (serialized format is the same, but timeline_layers
-    ///      is always generated from the keys of `layer_metadata`)
+    ///   is always generated from the keys of `layer_metadata`)
    /// - 4: timeline_layers is fully removed.
    /// - 5: lineage was added
    /// - 6: last_aux_file_policy is added.
--- a/pageserver/src/tenant/storage_layer.rs
+++ b/pageserver/src/tenant/storage_layer.rs
@@ -13,13 +13,13 @@ pub mod merge_iterator;
 use std::cmp::Ordering;
 use std::collections::hash_map::Entry;
 use std::collections::{BinaryHeap, HashMap};
-use std::future::Future;
 use std::ops::Range;
 use std::pin::Pin;
 use std::sync::Arc;
 use std::sync::atomic::AtomicUsize;
 use std::time::{Duration, SystemTime, UNIX_EPOCH};

+use crate::PERF_TRACE_TARGET;
 pub use batch_split_writer::{BatchLayerWriter, SplitDeltaLayerWriter, SplitImageLayerWriter};
 use bytes::Bytes;
 pub use delta_layer::{DeltaLayer, DeltaLayerWriter, ValueRef};
@@ -34,7 +34,7 @@ use pageserver_api::key::Key;
 use pageserver_api::keyspace::{KeySpace, KeySpaceRandomAccum};
 use pageserver_api::record::NeonWalRecord;
 use pageserver_api::value::Value;
-use tracing::{Instrument, trace};
+use tracing::{Instrument, info_span, trace};
 use utils::lsn::Lsn;
 use utils::sync::gate::GateGuard;

@@ -43,7 +43,9 @@ use super::PageReconstructError;
 use super::layer_map::InMemoryLayerDesc;
 use super::timeline::{GetVectoredError, ReadPath};
 use crate::config::PageServerConf;
-use crate::context::{AccessStatsBehavior, RequestContext};
+use crate::context::{
+    AccessStatsBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
+};

 pub fn range_overlaps<T>(a: &Range<T>, b: &Range<T>) -> bool
 where
@@ -874,13 +876,37 @@ impl ReadableLayer {
    ) -> Result<(), GetVectoredError> {
        match self {
            ReadableLayer::PersistentLayer(layer) => {
+                let ctx = RequestContextBuilder::from(ctx)
+                    .perf_span(|crnt_perf_span| {
+                        info_span!(
+                            target: PERF_TRACE_TARGET,
+                            parent: crnt_perf_span,
+                            "PLAN_LAYER",
+                            layer = %layer
+                        )
+                    })
+                    .attached_child();
+
                layer
-                    .get_values_reconstruct_data(keyspace, lsn_range, reconstruct_state, ctx)
+                    .get_values_reconstruct_data(keyspace, lsn_range, reconstruct_state, &ctx)
+                    .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
                    .await
            }
            ReadableLayer::InMemoryLayer(layer) => {
+                let ctx = RequestContextBuilder::from(ctx)
+                    .perf_span(|crnt_perf_span| {
+                        info_span!(
+                            target: PERF_TRACE_TARGET,
+                            parent: crnt_perf_span,
+                            "PLAN_LAYER",
+                            layer = %layer
+                        )
+                    })
+                    .attached_child();
+
                layer
-                    .get_values_reconstruct_data(keyspace, lsn_range, reconstruct_state, ctx)
+                    .get_values_reconstruct_data(keyspace, lsn_range, reconstruct_state, &ctx)
+                    .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
                    .await
            }
        }
--- a/pageserver/src/tenant/storage_layer/delta_layer.rs
+++ b/pageserver/src/tenant/storage_layer/delta_layer.rs
@@ -896,9 +896,9 @@ impl DeltaLayerInner {
    where
        Reader: BlockReader + Clone,
    {
-        let ctx = RequestContextBuilder::extend(ctx)
+        let ctx = RequestContextBuilder::from(ctx)
            .page_content_kind(PageContentKind::DeltaLayerBtreeNode)
-            .build();
+            .attached_child();

        for range in keyspace.ranges.iter() {
            let mut range_end_handled = false;
@@ -1105,9 +1105,9 @@ impl DeltaLayerInner {
                    all_keys.push(entry);
                    true
                },
-                &RequestContextBuilder::extend(ctx)
+                &RequestContextBuilder::from(ctx)
                    .page_content_kind(PageContentKind::DeltaLayerBtreeNode)
-                    .build(),
+                    .attached_child(),
            )
            .await?;
        if let Some(last) = all_keys.last_mut() {
--- a/pageserver/src/tenant/storage_layer/image_layer.rs
+++ b/pageserver/src/tenant/storage_layer/image_layer.rs
@@ -481,9 +481,9 @@ impl ImageLayerInner {
        let tree_reader =
            DiskBtreeReader::new(self.index_start_blk, self.index_root_blk, block_reader);

-        let ctx = RequestContextBuilder::extend(ctx)
+        let ctx = RequestContextBuilder::from(ctx)
            .page_content_kind(PageContentKind::ImageLayerBtreeNode)
-            .build();
+            .attached_child();

        for range in keyspace.ranges.iter() {
            let mut range_end_handled = false;
--- a/pageserver/src/tenant/storage_layer/inmemory_layer.rs
+++ b/pageserver/src/tenant/storage_layer/inmemory_layer.rs
@@ -421,9 +421,9 @@ impl InMemoryLayer {
        reconstruct_state: &mut ValuesReconstructState,
        ctx: &RequestContext,
    ) -> Result<(), GetVectoredError> {
-        let ctx = RequestContextBuilder::extend(ctx)
+        let ctx = RequestContextBuilder::from(ctx)
            .page_content_kind(PageContentKind::InMemoryLayer)
-            .build();
+            .attached_child();

        let inner = self.inner.read().await;

--- a/pageserver/src/tenant/storage_layer/layer.rs
+++ b/pageserver/src/tenant/storage_layer/layer.rs
@@ -3,12 +3,13 @@ use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
 use std::sync::{Arc, Weak};
 use std::time::{Duration, SystemTime};

+use crate::PERF_TRACE_TARGET;
 use anyhow::Context;
 use camino::{Utf8Path, Utf8PathBuf};
 use pageserver_api::keyspace::KeySpace;
 use pageserver_api::models::HistoricLayerInfo;
 use pageserver_api::shard::{ShardIdentity, ShardIndex, TenantShardId};
-use tracing::Instrument;
+use tracing::{Instrument, info_span};
 use utils::generation::Generation;
 use utils::id::TimelineId;
 use utils::lsn::Lsn;
@@ -18,7 +19,7 @@ use super::delta_layer::{self};
 use super::image_layer::{self};
 use super::{
    AsLayerDesc, ImageLayerWriter, LayerAccessStats, LayerAccessStatsReset, LayerName,
-    LayerVisibilityHint, PersistentLayerDesc, ValuesReconstructState,
+    LayerVisibilityHint, PerfInstrumentFutureExt, PersistentLayerDesc, ValuesReconstructState,
 };
 use crate::config::PageServerConf;
 use crate::context::{DownloadBehavior, RequestContext, RequestContextBuilder};
@@ -324,16 +325,29 @@ impl Layer {
        reconstruct_data: &mut ValuesReconstructState,
        ctx: &RequestContext,
    ) -> Result<(), GetVectoredError> {
-        let downloaded =
+        let downloaded = {
+            let ctx = RequestContextBuilder::from(ctx)
+                .perf_span(|crnt_perf_span| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        parent: crnt_perf_span,
+                        "GET_LAYER",
+                    )
+                })
+                .attached_child();
+
            self.0
-                .get_or_maybe_download(true, ctx)
+                .get_or_maybe_download(true, &ctx)
+                .maybe_perf_instrument(&ctx, |crnt_perf_context| crnt_perf_context.clone())
                .await
                .map_err(|err| match err {
                    DownloadError::TimelineShutdown | DownloadError::DownloadCancelled => {
                        GetVectoredError::Cancelled
                    }
                    other => GetVectoredError::Other(anyhow::anyhow!(other)),
-                })?;
+                })?
+        };
+
        let this = ResidentLayer {
            downloaded: downloaded.clone(),
            owner: self.clone(),
@@ -341,9 +355,20 @@ impl Layer {

        self.record_access(ctx);

+        let ctx = RequestContextBuilder::from(ctx)
+            .perf_span(|crnt_perf_span| {
+                info_span!(
+                    target: PERF_TRACE_TARGET,
+                    parent: crnt_perf_span,
+                    "VISIT_LAYER",
+                )
+            })
+            .attached_child();
+
        downloaded
-            .get_values_reconstruct_data(this, keyspace, lsn_range, reconstruct_data, ctx)
+            .get_values_reconstruct_data(this, keyspace, lsn_range, reconstruct_data, &ctx)
            .instrument(tracing::debug_span!("get_values_reconstruct_data", layer=%self))
+            .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
            .await
            .map_err(|err| match err {
                GetVectoredError::Other(err) => GetVectoredError::Other(
@@ -1045,15 +1070,34 @@ impl LayerInner {
            return Err(DownloadError::DownloadRequired);
        }

-        let download_ctx = ctx.detached_child(TaskKind::LayerDownload, DownloadBehavior::Download);
+        let ctx = if ctx.has_perf_span() {
+            let dl_ctx = RequestContextBuilder::from(ctx)
+                .task_kind(TaskKind::LayerDownload)
+                .download_behavior(DownloadBehavior::Download)
+                .root_perf_span(|| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        "DOWNLOAD_LAYER",
+                        layer = %self,
+                        reason = %reason
+                    )
+                })
+                .detached_child();
+            ctx.perf_follows_from(&dl_ctx);
+            dl_ctx
+        } else {
+            ctx.attached_child()
+        };

        async move {
            tracing::info!(%reason, "downloading on-demand");

            let init_cancelled = scopeguard::guard((), |_| LAYER_IMPL_METRICS.inc_init_cancelled());
            let res = self
-                .download_init_and_wait(timeline, permit, download_ctx)
+                .download_init_and_wait(timeline, permit, ctx.attached_child())
+                .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
                .await?;
+
            scopeguard::ScopeGuard::into_inner(init_cancelled);
            Ok(res)
        }
@@ -1720,9 +1764,9 @@ impl DownloadedLayer {
            );

            let res = if owner.desc.is_delta {
-                let ctx = RequestContextBuilder::extend(ctx)
+                let ctx = RequestContextBuilder::from(ctx)
                    .page_content_kind(crate::context::PageContentKind::DeltaLayerSummary)
-                    .build();
+                    .attached_child();
                let summary = Some(delta_layer::Summary::expected(
                    owner.desc.tenant_shard_id.tenant_id,
                    owner.desc.timeline_id,
@@ -1738,9 +1782,9 @@ impl DownloadedLayer {
                .await
                .map(LayerKind::Delta)
            } else {
-                let ctx = RequestContextBuilder::extend(ctx)
+                let ctx = RequestContextBuilder::from(ctx)
                    .page_content_kind(crate::context::PageContentKind::ImageLayerSummary)
-                    .build();
+                    .attached_child();
                let lsn = owner.desc.image_layer_lsn();
                let summary = Some(image_layer::Summary::expected(
                    owner.desc.tenant_shard_id.tenant_id,
--- a/pageserver/src/tenant/storage_layer/layer/tests.rs
+++ b/pageserver/src/tenant/storage_layer/layer/tests.rs
@@ -119,6 +119,10 @@ async fn smoke_test() {
    let e = layer.evict_and_wait(FOREVER).await.unwrap_err();
    assert!(matches!(e, EvictionError::NotFound));

+    let dl_ctx = RequestContextBuilder::from(ctx)
+        .download_behavior(DownloadBehavior::Download)
+        .attached_child();
+
    // on accesses when the layer is evicted, it will automatically be downloaded.
    let img_after = {
        let mut data = ValuesReconstructState::new(io_concurrency.clone());
@@ -127,7 +131,7 @@ async fn smoke_test() {
                controlfile_keyspace.clone(),
                Lsn(0x10)..Lsn(0x11),
                &mut data,
-                ctx,
+                &dl_ctx,
            )
            .instrument(download_span.clone())
            .await
@@ -177,7 +181,7 @@ async fn smoke_test() {

    // plain downloading is rarely needed
    layer
-        .download_and_keep_resident(ctx)
+        .download_and_keep_resident(&dl_ctx)
        .instrument(download_span)
        .await
        .unwrap();
@@ -645,9 +649,10 @@ async fn cancelled_get_or_maybe_download_does_not_cancel_eviction() {
    let ctx = ctx.with_scope_timeline(&timeline);

    // This test does downloads
-    let ctx = RequestContextBuilder::extend(&ctx)
+    let ctx = RequestContextBuilder::from(&ctx)
        .download_behavior(DownloadBehavior::Download)
-        .build();
+        .attached_child();
+
    let layer = {
        let mut layers = {
            let layers = timeline.layers.read().await;
@@ -730,9 +735,9 @@ async fn evict_and_wait_does_not_wait_for_download() {
    let ctx = ctx.with_scope_timeline(&timeline);

    // This test does downloads
-    let ctx = RequestContextBuilder::extend(&ctx)
+    let ctx = RequestContextBuilder::from(&ctx)
        .download_behavior(DownloadBehavior::Download)
-        .build();
+        .attached_child();

    let layer = {
        let mut layers = {
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -23,6 +23,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering as AtomicOrdering};
 use std::sync::{Arc, Mutex, OnceLock, RwLock, Weak};
 use std::time::{Duration, Instant, SystemTime};

+use crate::PERF_TRACE_TARGET;
 use anyhow::{Context, Result, anyhow, bail, ensure};
 use arc_swap::{ArcSwap, ArcSwapOption};
 use bytes::Bytes;
@@ -96,7 +97,9 @@ use super::{
 };
 use crate::aux_file::AuxFileSizeEstimator;
 use crate::config::PageServerConf;
-use crate::context::{DownloadBehavior, RequestContext};
+use crate::context::{
+    DownloadBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
+};
 use crate::disk_usage_eviction_task::{DiskUsageEvictionInfo, EvictionCandidate, finite_f32};
 use crate::keyspace::{KeyPartitioning, KeySpace};
 use crate::l0_flush::{self, L0FlushGlobalState};
@@ -1289,9 +1292,22 @@ impl Timeline {
        };
        reconstruct_state.read_path = read_path;

-        let traversal_res: Result<(), _> = self
-            .get_vectored_reconstruct_data(keyspace.clone(), lsn, reconstruct_state, ctx)
-            .await;
+        let traversal_res: Result<(), _> = {
+            let ctx = RequestContextBuilder::from(ctx)
+                .perf_span(|crnt_perf_span| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        parent: crnt_perf_span,
+                        "PLAN_IO",
+                    )
+                })
+                .attached_child();
+
+            self.get_vectored_reconstruct_data(keyspace.clone(), lsn, reconstruct_state, &ctx)
+                .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
+                .await
+        };
+
        if let Err(err) = traversal_res {
            // Wait for all the spawned IOs to complete.
            // See comments on `spawn_io` inside `storage_layer` for more details.
@@ -1305,14 +1321,46 @@ impl Timeline {

        let layers_visited = reconstruct_state.get_layers_visited();

+        let ctx = RequestContextBuilder::from(ctx)
+            .perf_span(|crnt_perf_span| {
+                info_span!(
+                    target: PERF_TRACE_TARGET,
+                    parent: crnt_perf_span,
+                    "RECONSTRUCT",
+                )
+            })
+            .attached_child();
+
        let futs = FuturesUnordered::new();
        for (key, state) in std::mem::take(&mut reconstruct_state.keys) {
            futs.push({
                let walredo_self = self.myself.upgrade().expect("&self method holds the arc");
+                let ctx = RequestContextBuilder::from(&ctx)
+                    .perf_span(|crnt_perf_span| {
+                        info_span!(
+                            target: PERF_TRACE_TARGET,
+                            parent: crnt_perf_span,
+                            "RECONSTRUCT_KEY",
+                            key = %key,
+                        )
+                    })
+                    .attached_child();
+
                async move {
                    assert_eq!(state.situation, ValueReconstructSituation::Complete);

-                    let converted = match state.collect_pending_ios().await {
+                    let res = state
+                        .collect_pending_ios()
+                        .maybe_perf_instrument(&ctx, |crnt_perf_span| {
+                            info_span!(
+                                target: PERF_TRACE_TARGET,
+                                parent: crnt_perf_span,
+                                "WAIT_FOR_IO_COMPLETIONS",
+                            )
+                        })
+                        .await;
+
+                    let converted = match res {
                        Ok(ok) => ok,
                        Err(err) => {
                            return (key, Err(err));
@@ -1329,16 +1377,27 @@ impl Timeline {
                        "{converted:?}"
                    );

-                    (
-                        key,
-                        walredo_self.reconstruct_value(key, lsn, converted).await,
-                    )
+                    let walredo_deltas = converted.num_deltas();
+                    let walredo_res = walredo_self
+                        .reconstruct_value(key, lsn, converted)
+                        .maybe_perf_instrument(&ctx, |crnt_perf_span| {
+                            info_span!(
+                                target: PERF_TRACE_TARGET,
+                                parent: crnt_perf_span,
+                                "WALREDO",
+                                deltas = %walredo_deltas,
+                            )
+                        })
+                        .await;
+
+                    (key, walredo_res)
                }
            });
        }

        let results = futs
            .collect::<BTreeMap<Key, Result<Bytes, PageReconstructError>>>()
+            .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
            .await;

        // For aux file keys (v1 or v2) the vectored read path does not return an error
@@ -2247,7 +2306,7 @@ impl Timeline {
                        .await
                        .expect("holding a reference to self");
                }
-                TimelineState::Active { .. } => {
+                TimelineState::Active => {
                    return Ok(());
                }
                TimelineState::Broken { .. } | TimelineState::Stopping => {
@@ -3875,15 +3934,30 @@ impl Timeline {
            let TimelineVisitOutcome {
                completed_keyspace: completed,
                image_covered_keyspace,
-            } = Self::get_vectored_reconstruct_data_timeline(
-                timeline,
-                keyspace.clone(),
-                cont_lsn,
-                reconstruct_state,
-                &self.cancel,
-                ctx,
-            )
-            .await?;
+            } = {
+                let ctx = RequestContextBuilder::from(ctx)
+                    .perf_span(|crnt_perf_span| {
+                        info_span!(
+                            target: PERF_TRACE_TARGET,
+                            parent: crnt_perf_span,
+                            "PLAN_IO_TIMELINE",
+                            timeline = %timeline.timeline_id,
+                            lsn = %cont_lsn,
+                        )
+                    })
+                    .attached_child();
+
+                Self::get_vectored_reconstruct_data_timeline(
+                    timeline,
+                    keyspace.clone(),
+                    cont_lsn,
+                    reconstruct_state,
+                    &self.cancel,
+                    &ctx,
+                )
+                .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
+                .await?
+            };

            keyspace.remove_overlapping_with(&completed);

@@ -3927,8 +4001,24 @@ impl Timeline {

            // Take the min to avoid reconstructing a page with data newer than request Lsn.
            cont_lsn = std::cmp::min(Lsn(request_lsn.0 + 1), Lsn(timeline.ancestor_lsn.0 + 1));
+
+            let ctx = RequestContextBuilder::from(ctx)
+                .perf_span(|crnt_perf_span| {
+                    info_span!(
+                        target: PERF_TRACE_TARGET,
+                        parent: crnt_perf_span,
+                        "GET_ANCESTOR",
+                        timeline = %timeline.timeline_id,
+                        lsn = %cont_lsn,
+                        ancestor = %ancestor_timeline.timeline_id,
+                        ancestor_lsn = %timeline.ancestor_lsn
+                    )
+                })
+                .attached_child();
+
            timeline_owned = timeline
-                .get_ready_ancestor_timeline(ancestor_timeline, ctx)
+                .get_ready_ancestor_timeline(ancestor_timeline, &ctx)
+                .maybe_perf_instrument(&ctx, |crnt_perf_span| crnt_perf_span.clone())
                .await?;
            timeline = &*timeline_owned;
        };
@@ -7259,9 +7349,9 @@ mod tests {

            eprintln!("Downloading {layer} and re-generating heatmap");

-            let ctx = &RequestContextBuilder::extend(ctx)
+            let ctx = &RequestContextBuilder::from(ctx)
                .download_behavior(crate::context::DownloadBehavior::Download)
-                .build();
+                .attached_child();

            let _resident = layer
                .download_and_keep_resident(ctx)
--- a/pageserver/src/tenant/timeline/compaction.rs
+++ b/pageserver/src/tenant/timeline/compaction.rs
@@ -26,7 +26,7 @@ use once_cell::sync::Lazy;
 use pageserver_api::config::tenant_conf_defaults::DEFAULT_CHECKPOINT_DISTANCE;
 use pageserver_api::key::{KEY_SIZE, Key};
 use pageserver_api::keyspace::{KeySpace, ShardedRange};
-use pageserver_api::models::CompactInfoResponse;
+use pageserver_api::models::{CompactInfoResponse, CompactKeyRange};
 use pageserver_api::record::NeonWalRecord;
 use pageserver_api::shard::{ShardCount, ShardIdentity, TenantShardId};
 use pageserver_api::value::Value;
@@ -61,7 +61,7 @@ use crate::tenant::timeline::{
    DeltaLayerWriter, ImageLayerCreationOutcome, ImageLayerWriter, IoConcurrency, Layer,
    ResidentLayer, drop_rlock,
 };
-use crate::tenant::{DeltaLayer, MaybeOffloaded, gc_block};
+use crate::tenant::{DeltaLayer, MaybeOffloaded};
 use crate::virtual_file::{MaybeFatalIo, VirtualFile};

 /// Maximum number of deltas before generating an image layer in bottom-most compaction.
@@ -123,7 +123,6 @@ impl GcCompactionQueueItem {
 #[derive(Default)]
 struct GcCompactionGuardItems {
    notify: Option<tokio::sync::oneshot::Sender<()>>,
-    gc_guard: Option<gc_block::Guard>,
    permit: Option<OwnedSemaphorePermit>,
 }

@@ -279,7 +278,7 @@ impl GcCompactionQueue {
            gc_compaction_ratio_percent: u64,
        ) -> bool {
            const AUTO_TRIGGER_LIMIT: u64 = 150 * 1024 * 1024 * 1024; // 150GB
-            if l1_size >= AUTO_TRIGGER_LIMIT || l2_size >= AUTO_TRIGGER_LIMIT {
+            if l1_size + l2_size >= AUTO_TRIGGER_LIMIT {
                // Do not auto-trigger when physical size >= 150GB
                return false;
            }
@@ -319,7 +318,12 @@ impl GcCompactionQueue {
                        flags
                    },
                    sub_compaction: true,
-                    compact_key_range: None,
+                    // Only auto-trigger gc-compaction over the data keyspace due to concerns in
+                    // https://github.com/neondatabase/neon/issues/11318.
+                    compact_key_range: Some(CompactKeyRange {
+                        start: Key::MIN,
+                        end: Key::metadata_key_range().start,
+                    }),
                    compact_lsn_range: None,
                    sub_compaction_max_job_size_mb: None,
                },
@@ -343,44 +347,45 @@ impl GcCompactionQueue {
        info!("compaction job id={} finished", id);
        let mut guard = self.inner.lock().unwrap();
        if let Some(items) = guard.guards.remove(&id) {
-            drop(items.gc_guard);
            if let Some(tx) = items.notify {
                let _ = tx.send(());
            }
        }
    }

+    fn clear_running_job(&self) {
+        let mut guard = self.inner.lock().unwrap();
+        guard.running = None;
+    }
+
    async fn handle_sub_compaction(
        &self,
        id: GcCompactionJobId,
        options: CompactOptions,
        timeline: &Arc<Timeline>,
-        gc_block: &GcBlock,
        auto: bool,
    ) -> Result<(), CompactionError> {
        info!(
            "running scheduled enhanced gc bottom-most compaction with sub-compaction, splitting compaction jobs"
        );
-        let jobs = timeline
+        let res = timeline
            .gc_compaction_split_jobs(
                GcCompactJob::from_compact_options(options.clone()),
                options.sub_compaction_max_job_size_mb,
            )
-            .await?;
+            .await;
+        let jobs = match res {
+            Ok(jobs) => jobs,
+            Err(err) => {
+                warn!("cannot split gc-compaction jobs: {}, unblocked gc", err);
+                self.notify_and_unblock(id);
+                return Err(err);
+            }
+        };
        if jobs.is_empty() {
            info!("no jobs to run, skipping scheduled compaction task");
            self.notify_and_unblock(id);
        } else {
-            let gc_guard = match gc_block.start().await {
-                Ok(guard) => guard,
-                Err(e) => {
-                    return Err(CompactionError::Other(anyhow!(
-                        "cannot run gc-compaction because gc is blocked: {}",
-                        e
-                    )));
-                }
-            };
-
            let jobs_len = jobs.len();
            let mut pending_tasks = Vec::new();
            // gc-compaction might pick more layers or fewer layers to compact. The L2 LSN does not need to be accurate.
@@ -415,7 +420,6 @@ impl GcCompactionQueue {

            {
                let mut guard = self.inner.lock().unwrap();
-                guard.guards.entry(id).or_default().gc_guard = Some(gc_guard);
                let mut tasks = Vec::new();
                for task in pending_tasks {
                    let id = guard.next_id();
@@ -446,7 +450,18 @@ impl GcCompactionQueue {
        if let Err(err) = &res {
            log_compaction_error(err, None, cancel.is_cancelled());
        }
-        res
+        match res {
+            Ok(res) => Ok(res),
+            Err(CompactionError::ShuttingDown) => Err(CompactionError::ShuttingDown),
+            Err(_) => {
+                // There are some cases where traditional gc might collect some layer
+                // files causing gc-compaction cannot read the full history of the key.
+                // This needs to be resolved in the long-term by improving the compaction
+                // process. For now, let's simply avoid such errors triggering the
+                // circuit breaker.
+                Ok(CompactionOutcome::Skipped)
+            }
+        }
    }

    async fn iteration_inner(
@@ -494,27 +509,32 @@ impl GcCompactionQueue {
                    info!(
                        "running scheduled enhanced gc bottom-most compaction with sub-compaction, splitting compaction jobs"
                    );
-                    self.handle_sub_compaction(id, options, timeline, gc_block, auto)
+                    self.handle_sub_compaction(id, options, timeline, auto)
                        .await?;
                } else {
                    // Auto compaction always enables sub-compaction so we don't need to handle update_l2_lsn
                    // in this branch.
-                    let gc_guard = match gc_block.start().await {
+                    let _gc_guard = match gc_block.start().await {
                        Ok(guard) => guard,
                        Err(e) => {
+                            self.notify_and_unblock(id);
+                            self.clear_running_job();
                            return Err(CompactionError::Other(anyhow!(
                                "cannot run gc-compaction because gc is blocked: {}",
                                e
                            )));
                        }
                    };
-                    {
-                        let mut guard = self.inner.lock().unwrap();
-                        guard.guards.entry(id).or_default().gc_guard = Some(gc_guard);
-                    }
-                    let compaction_result =
-                        timeline.compact_with_options(cancel, options, ctx).await?;
-                    self.notify_and_unblock(id);
+                    let res = timeline.compact_with_options(cancel, options, ctx).await;
+                    let compaction_result = match res {
+                        Ok(res) => res,
+                        Err(err) => {
+                            warn!(%err, "failed to run gc-compaction");
+                            self.notify_and_unblock(id);
+                            self.clear_running_job();
+                            return Err(err);
+                        }
+                    };
                    if compaction_result == CompactionOutcome::YieldForL0 {
                        yield_for_l0 = true;
                    }
@@ -522,7 +542,25 @@ impl GcCompactionQueue {
            }
            GcCompactionQueueItem::SubCompactionJob(options) => {
                // TODO: error handling, clear the queue if any task fails?
-                let compaction_result = timeline.compact_with_options(cancel, options, ctx).await?;
+                let _gc_guard = match gc_block.start().await {
+                    Ok(guard) => guard,
+                    Err(e) => {
+                        self.clear_running_job();
+                        return Err(CompactionError::Other(anyhow!(
+                            "cannot run gc-compaction because gc is blocked: {}",
+                            e
+                        )));
+                    }
+                };
+                let res = timeline.compact_with_options(cancel, options, ctx).await;
+                let compaction_result = match res {
+                    Ok(res) => res,
+                    Err(err) => {
+                        warn!(%err, "failed to run gc-compaction subcompaction job");
+                        self.clear_running_job();
+                        return Err(err);
+                    }
+                };
                if compaction_result == CompactionOutcome::YieldForL0 {
                    // We will permenantly give up a task if we yield for L0 compaction: the preempted subcompaction job won't be running
                    // again. This ensures that we don't keep doing duplicated work within gc-compaction. Not directly returning here because
@@ -553,10 +591,7 @@ impl GcCompactionQueue {
                }
            }
        }
-        {
-            let mut guard = self.inner.lock().unwrap();
-            guard.running = None;
-        }
+        self.clear_running_job();
        Ok(if yield_for_l0 {
            tracing::info!("give up gc-compaction: yield for L0 compaction");
            CompactionOutcome::YieldForL0
@@ -1001,9 +1036,9 @@ impl Timeline {
        {
            Ok(((dense_partitioning, sparse_partitioning), lsn)) => {
                // Disables access_stats updates, so that the files we read remain candidates for eviction after we're done with them
-                let image_ctx = RequestContextBuilder::extend(ctx)
+                let image_ctx = RequestContextBuilder::from(ctx)
                    .access_stats_behavior(AccessStatsBehavior::Skip)
-                    .build();
+                    .attached_child();

                let mut partitioning = dense_partitioning;
                partitioning
--- a/pageserver/src/tenant/timeline/detach_ancestor.rs
+++ b/pageserver/src/tenant/timeline/detach_ancestor.rs
@@ -2,10 +2,14 @@ use std::collections::HashSet;
 use std::sync::Arc;

 use anyhow::Context;
+use bytes::Bytes;
 use http_utils::error::ApiError;
+use pageserver_api::key::Key;
+use pageserver_api::keyspace::KeySpace;
 use pageserver_api::models::DetachBehavior;
 use pageserver_api::models::detach_ancestor::AncestorDetached;
 use pageserver_api::shard::ShardIdentity;
+use pageserver_compaction::helpers::overlaps_with;
 use tokio::sync::Semaphore;
 use tokio_util::sync::CancellationToken;
 use tracing::Instrument;
@@ -22,7 +26,10 @@ use crate::task_mgr::TaskKind;
 use crate::tenant::Tenant;
 use crate::tenant::remote_timeline_client::index::GcBlockingReason::DetachAncestor;
 use crate::tenant::storage_layer::layer::local_layer_path;
-use crate::tenant::storage_layer::{AsLayerDesc as _, DeltaLayerWriter, Layer, ResidentLayer};
+use crate::tenant::storage_layer::{
+    AsLayerDesc as _, DeltaLayerWriter, ImageLayerWriter, IoConcurrency, Layer, ResidentLayer,
+    ValuesReconstructState,
+};
 use crate::virtual_file::{MaybeFatalIo, VirtualFile};

 #[derive(Debug, thiserror::Error)]
@@ -170,6 +177,92 @@ impl Attempt {
    }
 }

+async fn generate_tombstone_image_layer(
+    detached: &Arc<Timeline>,
+    ancestor: &Arc<Timeline>,
+    ancestor_lsn: Lsn,
+    ctx: &RequestContext,
+) -> Result<Option<ResidentLayer>, Error> {
+    tracing::info!(
+        "removing non-inherited keys by writing an image layer with tombstones at the detach LSN"
+    );
+    let io_concurrency = IoConcurrency::spawn_from_conf(
+        detached.conf,
+        detached.gate.enter().map_err(|_| Error::ShuttingDown)?,
+    );
+    let mut reconstruct_state = ValuesReconstructState::new(io_concurrency);
+    // Directly use `get_vectored_impl` to skip the max_vectored_read_key limit check. Note that the keyspace should
+    // not contain too many keys, otherwise this takes a lot of memory. Currently we limit it to 10k keys in the compute.
+    let key_range = Key::sparse_non_inherited_keyspace();
+    // avoid generating a "future layer" which will then be removed
+    let image_lsn = ancestor_lsn;
+
+    {
+        let layers = detached.layers.read().await;
+        for layer in layers.all_persistent_layers() {
+            if !layer.is_delta
+                && layer.lsn_range.start == image_lsn
+                && overlaps_with(&key_range, &layer.key_range)
+            {
+                tracing::warn!(
+                    layer=%layer, "image layer at the detach LSN already exists, skipping removing aux files"
+                );
+                return Ok(None);
+            }
+        }
+    }
+
+    let data = ancestor
+        .get_vectored_impl(
+            KeySpace::single(key_range.clone()),
+            image_lsn,
+            &mut reconstruct_state,
+            ctx,
+        )
+        .await
+        .context("failed to retrieve aux keys")
+        .map_err(|e| Error::launder(e, Error::Prepare))?;
+    if !data.is_empty() {
+        // TODO: is it possible that we can have an image at `image_lsn`? Unlikely because image layers are only generated
+        // upon compaction but theoretically possible.
+        let mut image_layer_writer = ImageLayerWriter::new(
+            detached.conf,
+            detached.timeline_id,
+            detached.tenant_shard_id,
+            &key_range,
+            image_lsn,
+            ctx,
+        )
+        .await
+        .context("failed to create image layer writer")
+        .map_err(Error::Prepare)?;
+        for key in data.keys() {
+            image_layer_writer
+                .put_image(*key, Bytes::new(), ctx)
+                .await
+                .context("failed to write key")
+                .map_err(|e| Error::launder(e, Error::Prepare))?;
+        }
+        let (desc, path) = image_layer_writer
+            .finish(ctx)
+            .await
+            .context("failed to finish image layer writer for removing the metadata keys")
+            .map_err(|e| Error::launder(e, Error::Prepare))?;
+        let generated = Layer::finish_creating(detached.conf, detached, desc, &path)
+            .map_err(|e| Error::launder(e, Error::Prepare))?;
+        detached
+            .remote_client
+            .upload_layer_file(&generated, &detached.cancel)
+            .await
+            .map_err(|e| Error::launder(e, Error::Prepare))?;
+        tracing::info!(layer=%generated, "wrote image layer");
+        Ok(Some(generated))
+    } else {
+        tracing::info!("no aux keys found in ancestor");
+        Ok(None)
+    }
+}
+
 /// See [`Timeline::prepare_to_detach_from_ancestor`]
 pub(super) async fn prepare(
    detached: &Arc<Timeline>,
@@ -352,10 +445,16 @@ pub(super) async fn prepare(

    // TODO: copying and lsn prefix copying could be done at the same time with a single fsync after
    let mut new_layers: Vec<Layer> =
-        Vec::with_capacity(straddling_branchpoint.len() + rest_of_historic.len());
+        Vec::with_capacity(straddling_branchpoint.len() + rest_of_historic.len() + 1);
+
+    if let Some(tombstone_layer) =
+        generate_tombstone_image_layer(detached, &ancestor, ancestor_lsn, ctx).await?
+    {
+        new_layers.push(tombstone_layer.into());
+    }

    {
-        tracing::debug!(to_rewrite = %straddling_branchpoint.len(), "copying prefix of delta layers");
+        tracing::info!(to_rewrite = %straddling_branchpoint.len(), "copying prefix of delta layers");

        let mut tasks = tokio::task::JoinSet::new();

--- a/pageserver/src/tenant/timeline/import_pgdata/upcall_api.rs
+++ b/pageserver/src/tenant/timeline/import_pgdata/upcall_api.rs
@@ -32,9 +32,15 @@ impl Client {
        let Some(ref base_url) = conf.import_pgdata_upcall_api else {
            anyhow::bail!("import_pgdata_upcall_api is not configured")
        };
+        let mut http_client = reqwest::Client::builder();
+        for cert in &conf.ssl_ca_certs {
+            http_client = http_client.add_root_certificate(cert.clone());
+        }
+        let http_client = http_client.build()?;
+
        Ok(Self {
            base_url: base_url.to_string(),
-            client: reqwest::Client::new(),
+            client: http_client,
            cancel,
            authorization_header: conf
                .import_pgdata_upcall_api_token
--- a/pageserver/src/virtual_file/owned_buffers_io/aligned_buffer/buffer_mut.rs
+++ b/pageserver/src/virtual_file/owned_buffers_io/aligned_buffer/buffer_mut.rs
@@ -25,8 +25,8 @@ impl<const A: usize> AlignedBufferMut<ConstAlign<A>> {
    /// * `align` must be a power of two,
    ///
    /// * `capacity`, when rounded up to the nearest multiple of `align`,
-    ///    must not overflow isize (i.e., the rounded value must be
-    ///    less than or equal to `isize::MAX`).
+    ///   must not overflow isize (i.e., the rounded value must be
+    ///   less than or equal to `isize::MAX`).
    pub fn with_capacity(capacity: usize) -> Self {
        AlignedBufferMut {
            raw: RawAlignedBuffer::with_capacity(capacity),
--- a/pageserver/src/virtual_file/owned_buffers_io/aligned_buffer/raw.rs
+++ b/pageserver/src/virtual_file/owned_buffers_io/aligned_buffer/raw.rs
@@ -37,8 +37,8 @@ impl<const A: usize> RawAlignedBuffer<ConstAlign<A>> {
    /// * `align` must be a power of two,
    ///
    /// * `capacity`, when rounded up to the nearest multiple of `align`,
-    ///    must not overflow isize (i.e., the rounded value must be
-    ///    less than or equal to `isize::MAX`).
+    ///   must not overflow isize (i.e., the rounded value must be
+    ///   less than or equal to `isize::MAX`).
    pub fn with_capacity(capacity: usize) -> Self {
        let align = ConstAlign::<A>;
        let layout = Layout::from_size_align(capacity, align.align()).expect("Invalid layout");
--- a/poetry.lock
+++ b/poetry.lock
@@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 2.1.1 and should not be changed by hand.
+# This file is automatically @generated by Poetry 2.1.2 and should not be changed by hand.

 [[package]]
 name = "aiohappyeyeballs"
@@ -1286,24 +1286,20 @@ files = [

 [[package]]
 name = "h2"
-version = "4.1.0"
+version = "4.2.0"
 description = "Pure-Python HTTP/2 protocol implementation"
 optional = false
 python-versions = ">=3.9"
 groups = ["main"]
-files = []
-develop = false
+files = [
+    {file = "h2-4.2.0-py3-none-any.whl", hash = "sha256:479a53ad425bb29af087f3458a61d30780bc818e4ebcf01f0b536ba916462ed0"},
+    {file = "h2-4.2.0.tar.gz", hash = "sha256:c8a52129695e88b1a0578d8d2cc6842bbd79128ac685463b887ee278126ad01f"},
+]

 [package.dependencies]
 hpack = ">=4.1,<5"
 hyperframe = ">=6.1,<7"

-[package.source]
-type = "git"
-url = "https://github.com/python-hyper/h2"
-reference = "HEAD"
-resolved_reference = "0b98b244b5fd1fe96100ac14905417a3b70a4286"
-
 [[package]]
 name = "hpack"
 version = "4.1.0"
@@ -3844,4 +3840,4 @@ cffi = ["cffi (>=1.11)"]
 [metadata]
 lock-version = "2.1"
 python-versions = "^3.11"
-content-hash = "fb50cb6b291169dce3188560cdb31a14af95647318f8f0f0d718131dbaf1817a"
+content-hash = "7ab1e7b975af34b3271b7c6018fa22a261d3f73c7c0a0403b6b2bb86b5fbd36e"
--- a/proxy/src/serverless/local_conn_pool.rs
+++ b/proxy/src/serverless/local_conn_pool.rs
@@ -41,7 +41,7 @@ use crate::control_plane::messages::{ColdStartInfo, MetricsAuxInfo};
 use crate::metrics::Metrics;

 pub(crate) const EXT_NAME: &str = "pg_session_jwt";
-pub(crate) const EXT_VERSION: &str = "0.2.0";
+pub(crate) const EXT_VERSION: &str = "0.3.0";
 pub(crate) const EXT_SCHEMA: &str = "auth";

 #[derive(Clone)]
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -43,7 +43,7 @@ websockets = "^12.0"
 clickhouse-connect = "^0.7.16"
 kafka-python = "^2.0.2"
 jwcrypto = "^1.5.6"
-h2 = {git = "https://github.com/python-hyper/h2"}
+h2 = "^4.2.0"
 types-jwcrypto = "^1.5.0.20240925"
 pyyaml = "^6.0.2"
 types-pyyaml = "^6.0.12.20240917"
--- a/rust-toolchain.toml
+++ b/rust-toolchain.toml
@@ -1,5 +1,5 @@
 [toolchain]
-channel = "1.85.0"
+channel = "1.86.0"
 profile = "default"
 # The default profile includes rustc, rust-std, cargo, rust-docs, rustfmt and clippy.
 # https://rust-lang.github.io/rustup/concepts/profiles.html
--- a/safekeeper/client/src/mgmt_api.rs
+++ b/safekeeper/client/src/mgmt_api.rs
@@ -115,13 +115,17 @@ impl Client {
            "{}/v1/tenant/{}/timeline/{}",
            self.mgmt_api_endpoint, tenant_id, timeline_id
        );
-        let resp = self.request(Method::DELETE, &uri, ()).await?;
+        let resp = self
+            .request_maybe_body(Method::DELETE, &uri, None::<()>)
+            .await?;
        resp.json().await.map_err(Error::ReceiveBody)
    }

-    pub async fn delete_tenant(&self, tenant_id: TenantId) -> Result<models::TimelineDeleteResult> {
+    pub async fn delete_tenant(&self, tenant_id: TenantId) -> Result<models::TenantDeleteResult> {
        let uri = format!("{}/v1/tenant/{}", self.mgmt_api_endpoint, tenant_id);
-        let resp = self.request(Method::DELETE, &uri, ()).await?;
+        let resp = self
+            .request_maybe_body(Method::DELETE, &uri, None::<()>)
+            .await?;
        resp.json().await.map_err(Error::ReceiveBody)
    }

@@ -197,6 +201,16 @@ impl Client {
        method: Method,
        uri: U,
        body: B,
+    ) -> Result<reqwest::Response> {
+        self.request_maybe_body(method, uri, Some(body)).await
+    }
+
+    /// Send the request and check that the status code is good, with an optional body.
+    async fn request_maybe_body<B: serde::Serialize, U: reqwest::IntoUrl>(
+        &self,
+        method: Method,
+        uri: U,
+        body: Option<B>,
    ) -> Result<reqwest::Response> {
        let res = self.request_noerror(method, uri, body).await?;
        let response = res.error_from_body().await?;
@@ -208,12 +222,15 @@ impl Client {
        &self,
        method: Method,
        uri: U,
-        body: B,
+        body: Option<B>,
    ) -> Result<reqwest::Response> {
        let mut req = self.client.request(method, uri);
        if let Some(value) = &self.authorization_header {
            req = req.header(reqwest::header::AUTHORIZATION, value.get_contents())
        }
-        req.json(&body).send().await.map_err(Error::ReceiveBody)
+        if let Some(body) = body {
+            req = req.json(&body);
+        }
+        req.send().await.map_err(Error::ReceiveBody)
    }
 }
--- a/safekeeper/src/bin/safekeeper.rs
+++ b/safekeeper/src/bin/safekeeper.rs
@@ -219,7 +219,10 @@ struct Args {
    pub ssl_cert_reload_period: Duration,
    /// Trusted root CA certificates to use in https APIs.
    #[arg(long)]
-    ssl_ca_file: Option<Utf8PathBuf>,
+    pub ssl_ca_file: Option<Utf8PathBuf>,
+    /// Flag to use https for requests to peer's safekeeper API.
+    #[arg(long)]
+    pub use_https_safekeeper_api: bool,
 }

 // Like PathBufValueParser, but allows empty string.
@@ -399,6 +402,7 @@ async fn main() -> anyhow::Result<()> {
        ssl_cert_file: args.ssl_cert_file,
        ssl_cert_reload_period: args.ssl_cert_reload_period,
        ssl_ca_certs,
+        use_https_safekeeper_api: args.use_https_safekeeper_api,
    });

    // initialize sentry if SENTRY_DSN is provided
--- a/safekeeper/src/http/routes.rs
+++ b/safekeeper/src/http/routes.rs
@@ -16,9 +16,9 @@ use http_utils::{RequestExt, RouterBuilder};
 use hyper::{Body, Request, Response, StatusCode};
 use postgres_ffi::WAL_SEGMENT_SIZE;
 use safekeeper_api::models::{
-    AcceptorStateStatus, PullTimelineRequest, SafekeeperStatus, SkTimelineInfo, TermSwitchApiEntry,
-    TimelineCopyRequest, TimelineCreateRequest, TimelineDeleteResult, TimelineStatus,
-    TimelineTermBumpRequest,
+    AcceptorStateStatus, PullTimelineRequest, SafekeeperStatus, SkTimelineInfo, TenantDeleteResult,
+    TermSwitchApiEntry, TimelineCopyRequest, TimelineCreateRequest, TimelineDeleteResult,
+    TimelineStatus, TimelineTermBumpRequest,
 };
 use safekeeper_api::{ServerInfo, membership, models};
 use storage_broker::proto::{SafekeeperTimelineInfo, TenantTimelineId as ProtoTenantTimelineId};
@@ -83,13 +83,11 @@ async fn tenant_delete_handler(mut request: Request<Body>) -> Result<Response<Bo
        .delete_all_for_tenant(&tenant_id, action)
        .await
        .map_err(ApiError::InternalServerError)?;
-    json_response(
-        StatusCode::OK,
-        delete_info
-            .iter()
-            .map(|(ttid, resp)| (format!("{}", ttid.timeline_id), *resp))
-            .collect::<HashMap<String, TimelineDeleteResult>>(),
-    )
+    let response_body: TenantDeleteResult = delete_info
+        .iter()
+        .map(|(ttid, resp)| (format!("{}", ttid.timeline_id), *resp))
+        .collect::<HashMap<String, TimelineDeleteResult>>();
+    json_response(StatusCode::OK, response_body)
 }

 async fn timeline_create_handler(mut request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -538,6 +536,7 @@ async fn record_safekeeper_info(mut request: Request<Body>) -> Result<Response<B
        peer_horizon_lsn: sk_info.peer_horizon_lsn.0,
        safekeeper_connstr: sk_info.safekeeper_connstr.unwrap_or_else(|| "".to_owned()),
        http_connstr: sk_info.http_connstr.unwrap_or_else(|| "".to_owned()),
+        https_connstr: sk_info.https_connstr,
        backup_lsn: sk_info.backup_lsn.0,
        local_start_lsn: sk_info.local_start_lsn.0,
        availability_zone: None,
--- a/safekeeper/src/lib.rs
+++ b/safekeeper/src/lib.rs
@@ -121,6 +121,7 @@ pub struct SafeKeeperConf {
    pub ssl_cert_file: Utf8PathBuf,
    pub ssl_cert_reload_period: Duration,
    pub ssl_ca_certs: Vec<Certificate>,
+    pub use_https_safekeeper_api: bool,
 }

 impl SafeKeeperConf {
@@ -170,6 +171,7 @@ impl SafeKeeperConf {
            ssl_cert_file: Utf8PathBuf::from(defaults::DEFAULT_SSL_CERT_FILE),
            ssl_cert_reload_period: Duration::from_secs(60),
            ssl_ca_certs: Vec::new(),
+            use_https_safekeeper_api: false,
        }
    }
 }
--- a/safekeeper/src/receive_wal.rs
+++ b/safekeeper/src/receive_wal.rs
@@ -94,10 +94,10 @@ impl WalReceivers {

    /// Get reference to locked slot contents. Slot must exist (registered
    /// earlier).
-    fn get_slot<'a>(
-        self: &'a Arc<WalReceivers>,
+    fn get_slot(
+        self: &Arc<WalReceivers>,
        id: WalReceiverId,
-    ) -> MappedMutexGuard<'a, WalReceiverState> {
+    ) -> MappedMutexGuard<'_, WalReceiverState> {
        MutexGuard::map(self.mutex.lock(), |locked| {
            locked.slots[id]
                .as_mut()
--- a/safekeeper/src/recovery.rs
+++ b/safekeeper/src/recovery.rs
@@ -176,6 +176,7 @@ pub struct Donor {
    pub flush_lsn: Lsn,
    pub pg_connstr: String,
    pub http_connstr: String,
+    pub https_connstr: Option<String>,
 }

 impl From<&PeerInfo> for Donor {
@@ -186,6 +187,7 @@ impl From<&PeerInfo> for Donor {
            flush_lsn: p.flush_lsn,
            pg_connstr: p.pg_connstr.clone(),
            http_connstr: p.http_connstr.clone(),
+            https_connstr: p.https_connstr.clone(),
        }
    }
 }
@@ -236,11 +238,33 @@ async fn recover(
    conf: &SafeKeeperConf,
 ) -> anyhow::Result<String> {
    // Learn donor term switch history to figure out starting point.
-    let client = reqwest::Client::new();
+
+    let mut client = reqwest::Client::builder();
+    for cert in &conf.ssl_ca_certs {
+        client = client.add_root_certificate(cert.clone());
+    }
+    let client = client
+        .build()
+        .context("Failed to build http client for recover")?;
+
+    let url = if conf.use_https_safekeeper_api {
+        if let Some(https_connstr) = donor.https_connstr.as_ref() {
+            format!("https://{https_connstr}")
+        } else {
+            anyhow::bail!(
+                "cannot recover from donor {}: \
+                https is enabled, but https_connstr is not specified",
+                donor.sk_id
+            );
+        }
+    } else {
+        format!("http://{}", donor.http_connstr)
+    };
+
    let timeline_info: TimelineStatus = client
        .get(format!(
-            "http://{}/v1/tenant/{}/timeline/{}",
-            donor.http_connstr, tli.ttid.tenant_id, tli.ttid.timeline_id
+            "{}/v1/tenant/{}/timeline/{}",
+            url, tli.ttid.tenant_id, tli.ttid.timeline_id
        ))
        .send()
        .await?
--- a/safekeeper/src/timeline.rs
+++ b/safekeeper/src/timeline.rs
@@ -50,6 +50,7 @@ fn peer_info_from_sk_info(sk_info: &SafekeeperTimelineInfo, ts: Instant) -> Peer
        local_start_lsn: Lsn(sk_info.local_start_lsn),
        pg_connstr: sk_info.safekeeper_connstr.clone(),
        http_connstr: sk_info.http_connstr.clone(),
+        https_connstr: sk_info.https_connstr.clone(),
        ts,
    }
 }
@@ -363,6 +364,7 @@ impl SharedState {
                .to_owned()
                .unwrap_or(conf.listen_pg_addr.clone()),
            http_connstr: conf.listen_http_addr.to_owned(),
+            https_connstr: conf.listen_https_addr.to_owned(),
            backup_lsn: self.sk.state().inmem.backup_lsn.0,
            local_start_lsn: self.sk.state().local_start_lsn.0,
            availability_zone: conf.availability_zone.clone(),
@@ -699,7 +701,7 @@ impl Timeline {
    }

    /// Take a writing mutual exclusive lock on timeline shared_state.
-    pub async fn write_shared_state<'a>(self: &'a Arc<Self>) -> WriteGuardSharedState<'a> {
+    pub async fn write_shared_state(self: &Arc<Self>) -> WriteGuardSharedState<'_> {
        WriteGuardSharedState::new(self.clone(), self.mutex.write().await)
    }

--- a/safekeeper/tests/misc_test.rs
+++ b/safekeeper/tests/misc_test.rs
@@ -116,7 +116,7 @@ fn test_many_tx() -> anyhow::Result<()> {
            }
            None
        })
-        .last()
+        .next_back()
        .unwrap();

    let initdb_lsn = 21623024;
--- a/safekeeper/tests/walproposer_sim/safekeeper.rs
+++ b/safekeeper/tests/walproposer_sim/safekeeper.rs
@@ -184,6 +184,7 @@ pub fn run_server(os: NodeOs, disk: Arc<SafekeeperDisk>) -> Result<()> {
        ssl_cert_file: Utf8PathBuf::from(""),
        ssl_cert_reload_period: Duration::ZERO,
        ssl_ca_certs: Vec::new(),
+        use_https_safekeeper_api: false,
    };

    let mut global = GlobalMap::new(disk, conf.clone())?;
--- a/storage_broker/benches/rps.rs
+++ b/storage_broker/benches/rps.rs
@@ -141,6 +141,7 @@ async fn publish(client: Option<BrokerClientChannel>, n_keys: u64) {
                peer_horizon_lsn: 5,
                safekeeper_connstr: "zenith-1-sk-1.local:7676".to_owned(),
                http_connstr: "zenith-1-sk-1.local:7677".to_owned(),
+                https_connstr: Some("zenith-1-sk-1.local:7678".to_owned()),
                local_start_lsn: 0,
                availability_zone: None,
                standby_horizon: 0,
--- a/storage_broker/proto/broker.proto
+++ b/storage_broker/proto/broker.proto
@@ -45,8 +45,10 @@ message SafekeeperTimelineInfo {
    uint64 standby_horizon = 14;
    // A connection string to use for WAL receiving.
    string safekeeper_connstr = 10;
-    // HTTP endpoint connection string
+    // HTTP endpoint connection string.
    string http_connstr = 13;
+    // HTTPS endpoint connection string.
+    optional string https_connstr = 15;
    // Availability zone of a safekeeper.
    optional string availability_zone = 11;
 }
--- a/storage_broker/src/bin/storage_broker.rs
+++ b/storage_broker/src/bin/storage_broker.rs
@@ -764,6 +764,7 @@ mod tests {
            peer_horizon_lsn: 5,
            safekeeper_connstr: "neon-1-sk-1.local:7676".to_owned(),
            http_connstr: "neon-1-sk-1.local:7677".to_owned(),
+            https_connstr: Some("neon-1-sk-1.local:7678".to_owned()),
            local_start_lsn: 0,
            availability_zone: None,
            standby_horizon: 0,
--- a/storage_controller/client/src/control_api.rs
+++ b/storage_controller/client/src/control_api.rs
@@ -10,13 +10,11 @@ pub struct Client {
 }

 impl Client {
-    pub fn new(base_url: Url, jwt_token: Option<String>) -> Self {
+    pub fn new(http_client: reqwest::Client, base_url: Url, jwt_token: Option<String>) -> Self {
        Self {
            base_url,
            jwt_token,
-            client: reqwest::ClientBuilder::new()
-                .build()
-                .expect("Failed to construct http client"),
+            client: http_client,
        }
    }

--- a/storage_controller/src/compute_hook.rs
+++ b/storage_controller/src/compute_hook.rs
@@ -4,6 +4,7 @@ use std::error::Error as _;
 use std::sync::Arc;
 use std::time::Duration;

+use anyhow::Context;
 use control_plane::endpoint::{ComputeControlPlane, EndpointStatus};
 use control_plane::local_env::LocalEnv;
 use futures::StreamExt;
@@ -364,25 +365,28 @@ pub(crate) struct ShardUpdate<'a> {
 }

 impl ComputeHook {
-    pub(super) fn new(config: Config) -> Self {
+    pub(super) fn new(config: Config) -> anyhow::Result<Self> {
        let authorization_header = config
            .control_plane_jwt_token
            .clone()
            .map(|jwt| format!("Bearer {}", jwt));

-        let client = reqwest::ClientBuilder::new()
-            .timeout(NOTIFY_REQUEST_TIMEOUT)
+        let mut client = reqwest::ClientBuilder::new().timeout(NOTIFY_REQUEST_TIMEOUT);
+        for cert in &config.ssl_ca_certs {
+            client = client.add_root_certificate(cert.clone());
+        }
+        let client = client
            .build()
-            .expect("Failed to construct HTTP client");
+            .context("Failed to build http client for compute hook")?;

-        Self {
+        Ok(Self {
            state: Default::default(),
            config,
            authorization_header,
            neon_local_lock: Default::default(),
            api_concurrency: tokio::sync::Semaphore::new(API_CONCURRENCY),
            client,
-        }
+        })
    }

    /// For test environments: use neon_local's LocalEnv to update compute
--- a/storage_controller/src/heartbeater.rs
+++ b/storage_controller/src/heartbeater.rs
@@ -12,6 +12,7 @@ use safekeeper_api::models::SafekeeperUtilization;
 use safekeeper_client::mgmt_api;
 use thiserror::Error;
 use tokio_util::sync::CancellationToken;
+use tracing::Instrument;
 use utils::id::NodeId;
 use utils::logging::SecretString;

@@ -227,6 +228,7 @@ impl HeartBeat<Node, PageserverState> for HeartbeaterTask<Node, PageserverState>

                    Some((*node_id, status))
                }
+                .instrument(tracing::info_span!("heartbeat_ps", %node_id))
            });
        }

@@ -253,7 +255,7 @@ impl HeartBeat<Node, PageserverState> for HeartbeaterTask<Node, PageserverState>
                PageserverState::WarmingUp { .. } => {
                    warming_up += 1;
                }
-                PageserverState::Offline { .. } => offline += 1,
+                PageserverState::Offline => offline += 1,
                PageserverState::Available { .. } => {}
            }
        }
@@ -369,6 +371,7 @@ impl HeartBeat<Safekeeper, SafekeeperState> for HeartbeaterTask<Safekeeper, Safe

                    Some((*node_id, status))
                }
+                .instrument(tracing::info_span!("heartbeat_sk", %node_id))
            });
        }

@@ -391,7 +394,7 @@ impl HeartBeat<Safekeeper, SafekeeperState> for HeartbeaterTask<Safekeeper, Safe
        let mut offline = 0;
        for state in new_state.values() {
            match state {
-                SafekeeperState::Offline { .. } => offline += 1,
+                SafekeeperState::Offline => offline += 1,
                SafekeeperState::Available { .. } => {}
            }
        }
--- a/storage_controller/src/http.rs
+++ b/storage_controller/src/http.rs
@@ -1733,9 +1733,9 @@ async fn maybe_forward(req: Request<Body>) -> ForwardOutcome {
        };

        if *self_addr == leader_addr {
-            return ForwardOutcome::Forwarded(Err(ApiError::InternalServerError(anyhow::anyhow!(
-                "Leader is stepped down instance"
-            ))));
+            return ForwardOutcome::Forwarded(Err(ApiError::ResourceUnavailable(
+                "Leader is stepped down instance".into(),
+            )));
        }
    }

@@ -1744,19 +1744,17 @@ async fn maybe_forward(req: Request<Body>) -> ForwardOutcome {
    // Use [`RECONCILE_TIMEOUT`] as the max amount of time a request should block for and
    // include some leeway to get the timeout for proxied requests.
    const PROXIED_REQUEST_TIMEOUT: Duration = Duration::from_secs(RECONCILE_TIMEOUT.as_secs() + 10);
-    let client = reqwest::ClientBuilder::new()
-        .timeout(PROXIED_REQUEST_TIMEOUT)
-        .build();
-    let client = match client {
-        Ok(client) => client,
-        Err(err) => {
-            return ForwardOutcome::Forwarded(Err(ApiError::InternalServerError(anyhow::anyhow!(
-                "Failed to build leader client for forwarding while in stepped down state: {err}"
-            ))));
-        }
-    };

-    let request: reqwest::Request = match convert_request(req, &client, leader.address).await {
+    let client = state.service.get_http_client().clone();
+
+    let request: reqwest::Request = match convert_request(
+        req,
+        &client,
+        leader.address,
+        PROXIED_REQUEST_TIMEOUT,
+    )
+    .await
+    {
        Ok(r) => r,
        Err(err) => {
            return ForwardOutcome::Forwarded(Err(ApiError::InternalServerError(anyhow::anyhow!(
@@ -1814,6 +1812,7 @@ async fn convert_request(
    req: hyper::Request<Body>,
    client: &reqwest::Client,
    to_address: String,
+    timeout: Duration,
 ) -> Result<reqwest::Request, ApiError> {
    use std::str::FromStr;

@@ -1868,6 +1867,7 @@ async fn convert_request(
        .request(method, uri)
        .headers(headers)
        .body(body)
+        .timeout(timeout)
        .build()
        .map_err(|err| {
            ApiError::InternalServerError(anyhow::anyhow!("Request conversion failed: {err}"))
--- a/storage_controller/src/leadership.rs
+++ b/storage_controller/src/leadership.rs
@@ -110,7 +110,20 @@ impl Leadership {
    ) -> Option<GlobalObservedState> {
        tracing::info!("Sending step down request to {leader:?}");

+        let mut http_client = reqwest::Client::builder();
+        for cert in &self.config.ssl_ca_certs {
+            http_client = http_client.add_root_certificate(cert.clone());
+        }
+        let http_client = match http_client.build() {
+            Ok(http_client) => http_client,
+            Err(err) => {
+                tracing::error!("Failed to build client for leader step-down request: {err}");
+                return None;
+            }
+        };
+
        let client = PeerClient::new(
+            http_client,
            Uri::try_from(leader.address.as_str()).expect("Failed to build leader URI"),
            self.config.peer_jwt_token.clone(),
        );
--- a/storage_controller/src/main.rs
+++ b/storage_controller/src/main.rs
@@ -283,10 +283,8 @@ impl Secrets {
    fn load_secret(cli: &Option<String>, env_name: &str) -> Option<String> {
        if let Some(v) = cli {
            Some(v.clone())
-        } else if let Ok(v) = std::env::var(env_name) {
-            Some(v)
        } else {
-            None
+            std::env::var(env_name).ok()
        }
    }
 }
--- a/storage_controller/src/peer_client.rs
+++ b/storage_controller/src/peer_client.rs
@@ -59,11 +59,11 @@ impl ResponseErrorMessageExt for reqwest::Response {
 pub(crate) struct GlobalObservedState(pub(crate) HashMap<TenantShardId, ObservedState>);

 impl PeerClient {
-    pub(crate) fn new(uri: Uri, jwt: Option<String>) -> Self {
+    pub(crate) fn new(http_client: reqwest::Client, uri: Uri, jwt: Option<String>) -> Self {
        Self {
            uri,
            jwt,
-            client: reqwest::Client::new(),
+            client: http_client,
        }
    }

--- a/storage_controller/src/persistence.rs
+++ b/storage_controller/src/persistence.rs
@@ -1524,25 +1524,14 @@ impl Persistence {
    /// Load pending operations from db.
    pub(crate) async fn list_pending_ops(
        &self,
-        filter_for_sk: Option<NodeId>,
    ) -> DatabaseResult<Vec<TimelinePendingOpPersistence>> {
        use crate::schema::safekeeper_timeline_pending_ops::dsl;

-        const FILTER_VAL_1: i64 = 1;
-        const FILTER_VAL_2: i64 = 2;
-        let filter_opt = filter_for_sk.map(|id| id.0 as i64);
        let timeline_from_db = self
            .with_measured_conn(DatabaseOperation::ListTimelineReconcile, move |conn| {
                Box::pin(async move {
                    let from_db: Vec<TimelinePendingOpPersistence> =
-                        dsl::safekeeper_timeline_pending_ops
-                            .filter(
-                                dsl::sk_id
-                                    .eq(filter_opt.unwrap_or(FILTER_VAL_1))
-                                    .and(dsl::sk_id.eq(filter_opt.unwrap_or(FILTER_VAL_2))),
-                            )
-                            .load(conn)
-                            .await?;
+                        dsl::safekeeper_timeline_pending_ops.load(conn).await?;
                    Ok(from_db)
                })
            })
--- a/storage_controller/src/reconciler.rs
+++ b/storage_controller/src/reconciler.rs
@@ -686,6 +686,8 @@ impl Reconciler {
                .await?,
        );

+        pausable_failpoint!("reconciler-live-migrate-post-generation-inc");
+
        let dest_conf = build_location_config(
            &self.shard,
            &self.config,
@@ -760,7 +762,9 @@ impl Reconciler {
        Ok(())
    }

-    async fn maybe_refresh_observed(&mut self) -> Result<(), ReconcileError> {
+    /// Returns true if the observed state of the attached location was refreshed
+    /// and false otherwise.
+    async fn maybe_refresh_observed(&mut self) -> Result<bool, ReconcileError> {
        // If the attached node has uncertain state, read it from the pageserver before proceeding: this
        // is important to avoid spurious generation increments.
        //
@@ -770,7 +774,7 @@ impl Reconciler {

        let Some(attached_node) = self.intent.attached.as_ref() else {
            // Nothing to do
-            return Ok(());
+            return Ok(false);
        };

        if matches!(
@@ -815,7 +819,7 @@ impl Reconciler {
            }
        }

-        Ok(())
+        Ok(true)
    }

    /// Reconciling a tenant makes API calls to pageservers until the observed state
@@ -831,7 +835,7 @@ impl Reconciler {
    /// state where it still requires later reconciliation.
    pub(crate) async fn reconcile(&mut self) -> Result<(), ReconcileError> {
        // Prepare: if we have uncertain `observed` state for our would-be attachement location, then refresh it
-        self.maybe_refresh_observed().await?;
+        let refreshed = self.maybe_refresh_observed().await?;

        // Special case: live migration
        self.maybe_live_migrate().await?;
@@ -855,8 +859,14 @@ impl Reconciler {
            );
            match self.observed.locations.get(&node.get_id()) {
                Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
-                    // Nothing to do
-                    tracing::info!(node_id=%node.get_id(), "Observed configuration already correct.")
+                    if refreshed {
+                        tracing::info!(
+                            node_id=%node.get_id(), "Observed configuration correct after refresh. Notifying compute.");
+                        self.compute_notify().await?;
+                    } else {
+                        // Nothing to do
+                        tracing::info!(node_id=%node.get_id(), "Observed configuration already correct.");
+                    }
                }
                observed => {
                    // In all cases other than a matching observed configuration, we will
--- a/storage_controller/src/safekeeper_client.rs
+++ b/storage_controller/src/safekeeper_client.rs
@@ -101,7 +101,7 @@ impl SafekeeperClient {
    pub(crate) async fn delete_tenant(
        &self,
        tenant_id: TenantId,
-    ) -> Result<models::TimelineDeleteResult> {
+    ) -> Result<models::TenantDeleteResult> {
        measured_request!(
            "delete_tenant",
            crate::metrics::Method::Delete,
--- a/storage_controller/src/service.rs
+++ b/storage_controller/src/service.rs
@@ -1711,7 +1711,7 @@ impl Service {
            ))),
            config: config.clone(),
            persistence,
-            compute_hook: Arc::new(ComputeHook::new(config.clone())),
+            compute_hook: Arc::new(ComputeHook::new(config.clone())?),
            result_tx,
            heartbeater_ps,
            heartbeater_sk,
--- a/storage_controller/src/service/safekeeper_reconciler.rs
+++ b/storage_controller/src/service/safekeeper_reconciler.rs
@@ -35,6 +35,10 @@ impl SafekeeperReconcilers {
        service: &Arc<Service>,
        reqs: Vec<ScheduleRequest>,
    ) {
+        tracing::info!(
+            "Scheduling {} pending safekeeper ops loaded from db",
+            reqs.len()
+        );
        for req in reqs {
            self.schedule_request(service, req);
        }
@@ -74,7 +78,7 @@ pub(crate) async fn load_schedule_requests(
    service: &Arc<Service>,
    safekeepers: &HashMap<NodeId, Safekeeper>,
 ) -> anyhow::Result<Vec<ScheduleRequest>> {
-    let pending_ops = service.persistence.list_pending_ops(None).await?;
+    let pending_ops = service.persistence.list_pending_ops().await?;
    let mut res = Vec::with_capacity(pending_ops.len());
    for op_persist in pending_ops {
        let node_id = NodeId(op_persist.sk_id as u64);
@@ -232,12 +236,14 @@ impl SafekeeperReconciler {
            let kind = req.kind;
            let tenant_id = req.tenant_id;
            let timeline_id = req.timeline_id;
+            let node_id = req.safekeeper.skp.id;
            self.reconcile_one(req, req_cancel)
                .instrument(tracing::info_span!(
                    "reconcile_one",
                    ?kind,
                    %tenant_id,
-                    ?timeline_id
+                    ?timeline_id,
+                    %node_id,
                ))
                .await;
        }
--- a/storage_controller/src/tenant_shard.rs
+++ b/storage_controller/src/tenant_shard.rs
@@ -622,7 +622,7 @@ impl TenantShard {
            .collect::<Vec<_>>();

        attached_locs.sort_by_key(|i| i.1);
-        if let Some((node_id, _gen)) = attached_locs.into_iter().last() {
+        if let Some((node_id, _gen)) = attached_locs.into_iter().next_back() {
            self.intent.set_attached(scheduler, Some(*node_id));
        }

--- a/storage_scrubber/src/find_large_objects.rs
+++ b/storage_scrubber/src/find_large_objects.rs
@@ -18,7 +18,7 @@ enum LargeObjectKind {

 impl LargeObjectKind {
    fn from_key(key: &str) -> Self {
-        let fname = key.split('/').last().unwrap();
+        let fname = key.split('/').next_back().unwrap();

        let Ok((layer_name, _generation)) = parse_layer_object_name(fname) else {
            return LargeObjectKind::Other;
--- a/storage_scrubber/src/lib.rs
+++ b/storage_scrubber/src/lib.rs
@@ -295,8 +295,8 @@ pub struct ControllerClientConfig {
 }

 impl ControllerClientConfig {
-    pub fn build_client(self) -> control_api::Client {
-        control_api::Client::new(self.controller_api, Some(self.controller_jwt))
+    pub fn build_client(self, http_client: reqwest::Client) -> control_api::Client {
+        control_api::Client::new(http_client, self.controller_api, Some(self.controller_jwt))
    }
 }

--- a/storage_scrubber/src/main.rs
+++ b/storage_scrubber/src/main.rs
@@ -3,7 +3,7 @@ use camino::Utf8PathBuf;
 use clap::{Parser, Subcommand};
 use pageserver_api::controller_api::{MetadataHealthUpdateRequest, MetadataHealthUpdateResponse};
 use pageserver_api::shard::TenantShardId;
-use reqwest::{Method, Url};
+use reqwest::{Certificate, Method, Url};
 use storage_controller_client::control_api;
 use storage_scrubber::garbage::{PurgeMode, find_garbage, purge_garbage};
 use storage_scrubber::pageserver_physical_gc::{GcMode, pageserver_physical_gc};
@@ -41,6 +41,10 @@ struct Cli {
    /// If set to true, the scrubber will exit with error code on fatal error.
    #[arg(long, default_value_t = false)]
    exit_code: bool,
+
+    /// Trusted root CA certificates to use in https APIs.
+    #[arg(long)]
+    ssl_ca_file: Option<Utf8PathBuf>,
 }

 #[derive(Subcommand, Debug)]
@@ -146,13 +150,28 @@ async fn main() -> anyhow::Result<()> {

    tracing::info!("version: {}, build_tag {}", GIT_VERSION, BUILD_TAG);

+    let ssl_ca_certs = match cli.ssl_ca_file.as_ref() {
+        Some(ssl_ca_file) => {
+            tracing::info!("Using ssl root CA file: {ssl_ca_file:?}");
+            let buf = tokio::fs::read(ssl_ca_file).await?;
+            Certificate::from_pem_bundle(&buf)?
+        }
+        None => Vec::new(),
+    };
+
+    let mut http_client = reqwest::Client::builder();
+    for cert in ssl_ca_certs {
+        http_client = http_client.add_root_certificate(cert);
+    }
+    let http_client = http_client.build()?;
+
    let controller_client = cli.controller_api.map(|controller_api| {
        ControllerClientConfig {
            controller_api,
            // Default to no key: this is a convenience when working in a development environment
            controller_jwt: cli.controller_jwt.unwrap_or("".to_owned()),
        }
-        .build_client()
+        .build_client(http_client)
    });

    match cli.command {
--- a/test_runner/fixtures/neon_fixtures.py
+++ b/test_runner/fixtures/neon_fixtures.py
@@ -376,6 +376,28 @@ class PageserverWalReceiverProtocol(StrEnum):
            raise ValueError(f"Unknown protocol type: {proto}")


+@dataclass
+class PageserverTracingConfig:
+    sampling_ratio: tuple[int, int]
+    endpoint: str
+    protocol: str
+    timeout: str
+
+    def to_config_key_value(self) -> tuple[str, dict[str, Any]]:
+        value = {
+            "sampling_ratio": {
+                "numerator": self.sampling_ratio[0],
+                "denominator": self.sampling_ratio[1],
+            },
+            "export_config": {
+                "endpoint": self.endpoint,
+                "protocol": self.protocol,
+                "timeout": self.timeout,
+            },
+        }
+        return ("tracing", value)
+
+
 class NeonEnvBuilder:
    """
    Builder object to create a Neon runtime environment
@@ -425,6 +447,7 @@ class NeonEnvBuilder:
        pageserver_virtual_file_io_mode: str | None = None,
        pageserver_wal_receiver_protocol: PageserverWalReceiverProtocol | None = None,
        pageserver_get_vectored_concurrent_io: str | None = None,
+        pageserver_tracing_config: PageserverTracingConfig | None = None,
    ):
        self.repo_dir = repo_dir
        self.rust_log_override = rust_log_override
@@ -478,6 +501,8 @@ class NeonEnvBuilder:
            pageserver_get_vectored_concurrent_io
        )

+        self.pageserver_tracing_config = pageserver_tracing_config
+
        self.pageserver_default_tenant_config_compaction_algorithm: dict[str, Any] | None = (
            pageserver_default_tenant_config_compaction_algorithm
        )
@@ -1138,6 +1163,7 @@ class NeonEnv:
        self.pageserver_virtual_file_io_mode = config.pageserver_virtual_file_io_mode
        self.pageserver_wal_receiver_protocol = config.pageserver_wal_receiver_protocol
        self.pageserver_get_vectored_concurrent_io = config.pageserver_get_vectored_concurrent_io
+        self.pageserver_tracing_config = config.pageserver_tracing_config

        # Create the neon_local's `NeonLocalInitConf`
        cfg: dict[str, Any] = {
@@ -1262,6 +1288,14 @@ class NeonEnv:
                if key not in ps_cfg:
                    ps_cfg[key] = value

+            if self.pageserver_tracing_config is not None:
+                key, value = self.pageserver_tracing_config.to_config_key_value()
+
+                if key not in ps_cfg:
+                    ps_cfg[key] = value
+
+                ps_cfg[key] = value
+
            # Create a corresponding NeonPageserver object
            self.pageservers.append(
                NeonPageserver(self, ps_id, port=pageserver_port, az_id=ps_cfg["availability_zone"])
@@ -1284,6 +1318,7 @@ class NeonEnv:
                "http_port": port.http,
                "https_port": port.https,
                "sync": config.safekeepers_enable_fsync,
+                "use_https_safekeeper_api": config.use_https_safekeeper_api,
            }
            if config.auth_enabled:
                sk_cfg["auth_enabled"] = True
--- a/test_runner/fixtures/pageserver/allowed_errors.py
+++ b/test_runner/fixtures/pageserver/allowed_errors.py
@@ -110,6 +110,7 @@ DEFAULT_PAGESERVER_ALLOWED_ERRORS = (
    ".*delaying layer flush by \\S+ for compaction backpressure.*",
    ".*stalling layer flushes for compaction backpressure.*",
    ".*layer roll waiting for flush due to compaction backpressure.*",
+    ".*BatchSpanProcessor.*",
 )


@@ -118,6 +119,7 @@ DEFAULT_STORAGE_CONTROLLER_ALLOWED_ERRORS = [
    # failing to connect to them.
    ".*Call to node.*management API.*failed.*receive body.*",
    ".*Call to node.*management API.*failed.*ReceiveBody.*",
+    ".*Call to node.*management API.*failed.*Timeout.*",
    ".*Failed to update node .+ after heartbeat round.*error sending request for url.*",
    ".*background_reconcile: failed to fetch top tenants:.*client error \\(Connect\\).*",
    # Many tests will start up with a node offline
--- a/test_runner/fixtures/pageserver/http.py
+++ b/test_runner/fixtures/pageserver/http.py
@@ -1192,3 +1192,28 @@ class PageserverHttpClient(requests.Session, MetricsGetter):
        log.info(f"Got perf info response code: {res.status_code}")
        self.verbose_error(res)
        return res.json()
+
+    def ingest_aux_files(
+        self,
+        tenant_id: TenantId | TenantShardId,
+        timeline_id: TimelineId,
+        aux_files: dict[str, bytes],
+    ):
+        res = self.post(
+            f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline/{timeline_id}/ingest_aux_files",
+            json={
+                "aux_files": aux_files,
+            },
+        )
+        self.verbose_error(res)
+        return res.json()
+
+    def list_aux_files(
+        self, tenant_id: TenantId | TenantShardId, timeline_id: TimelineId, lsn: Lsn
+    ) -> Any:
+        res = self.post(
+            f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline/{timeline_id}/list_aux_files",
+            json={"lsn": str(lsn)},
+        )
+        self.verbose_error(res)
+        return res.json()
--- a/test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
+++ b/test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
@@ -10,6 +10,7 @@ from fixtures.log_helper import log
 from fixtures.neon_fixtures import (
    NeonEnv,
    NeonEnvBuilder,
+    PageserverTracingConfig,
    PgBin,
    wait_for_last_flush_lsn,
 )
@@ -111,6 +112,15 @@ def setup_and_run_pagebench_benchmark(
    neon_env_builder.pageserver_config_override = (
        f"page_cache_size={page_cache_size}; max_file_descriptors={max_file_descriptors}"
    )
+
+    tracing_config = PageserverTracingConfig(
+        sampling_ratio=(0, 1000),
+        endpoint="http://localhost:4318/v1/traces",
+        protocol="http-binary",
+        timeout="10s",
+    )
+    neon_env_builder.pageserver_tracing_config = tracing_config
+    ratio = tracing_config.sampling_ratio[0] / tracing_config.sampling_ratio[1]
    params.update(
        {
            "pageserver_config_override.page_cache_size": (
@@ -118,6 +128,7 @@ def setup_and_run_pagebench_benchmark(
                {"unit": "byte"},
            ),
            "pageserver_config_override.max_file_descriptors": (max_file_descriptors, {"unit": ""}),
+            "pageserver_config_override.sampling_ratio": (ratio, {"unit": ""}),
        }
    )

--- a/test_runner/performance/test_physical_replication.py
+++ b/test_runner/performance/test_physical_replication.py
@@ -7,7 +7,6 @@ import traceback
 from typing import TYPE_CHECKING

 import psycopg2
-import psycopg2.extras
 import pytest
 from fixtures.benchmark_fixture import MetricReport
 from fixtures.common_types import Lsn
@@ -26,7 +25,11 @@ if TYPE_CHECKING:


 # Granularity of ~0.5 sec
-def measure_replication_lag(master, replica, timeout_sec=600):
+def measure_replication_lag(
+    master: psycopg2.extensions.cursor,
+    replica: psycopg2.extensions.cursor,
+    timeout_sec: int = 600,
+):
    start = time.time()
    master.execute("SELECT pg_current_wal_flush_lsn()")
    master_lsn = Lsn(master.fetchall()[0][0])
@@ -40,7 +43,7 @@ def measure_replication_lag(master, replica, timeout_sec=600):
    raise TimeoutError(f"Replication sync took more than {timeout_sec} sec")


-def check_pgbench_still_running(pgbench):
+def check_pgbench_still_running(pgbench: subprocess.Popen[str]):
    rc = pgbench.poll()
    if rc is not None:
        raise RuntimeError(f"Pgbench terminated early with return code {rc}")
@@ -61,6 +64,8 @@ def test_ro_replica_lag(

    project = neon_api.create_project(pg_version)
    project_id = project["project"]["id"]
+    log.info("Project ID: {}", project_id)
+    log.info("Primary endpoint ID: {}", project["project"]["endpoints"][0]["id"])
    neon_api.wait_for_operation_to_finish(project_id)
    error_occurred = False
    try:
@@ -76,6 +81,7 @@ def test_ro_replica_lag(
            endpoint_type="read_only",
            settings={"pg_settings": {"hot_standby_feedback": "on"}},
        )
+        log.info("Replica endpoint ID: {}", replica["endpoint"]["id"])
        replica_env = master_env.copy()
        replica_env["PGHOST"] = replica["endpoint"]["host"]
        neon_api.wait_for_operation_to_finish(project_id)
@@ -191,6 +197,8 @@ def test_replication_start_stop(

    project = neon_api.create_project(pg_version)
    project_id = project["project"]["id"]
+    log.info("Project ID: {}", project_id)
+    log.info("Primary endpoint ID: {}", project["project"]["endpoints"][0]["id"])
    neon_api.wait_for_operation_to_finish(project_id)
    try:
        branch_id = project["branch"]["id"]
@@ -200,15 +208,15 @@ def test_replication_start_stop(
        )

        replicas = []
-        for _ in range(num_replicas):
-            replicas.append(
-                neon_api.create_endpoint(
-                    project_id,
-                    branch_id,
-                    endpoint_type="read_only",
-                    settings={"pg_settings": {"hot_standby_feedback": "on"}},
-                )
+        for i in range(num_replicas):
+            replica = neon_api.create_endpoint(
+                project_id,
+                branch_id,
+                endpoint_type="read_only",
+                settings={"pg_settings": {"hot_standby_feedback": "on"}},
            )
+            log.info("Replica {} endpoint ID: {}", i + 1, replica["endpoint"]["id"])
+            replicas.append(replica)
            neon_api.wait_for_operation_to_finish(project_id)

        replica_connstr = [
--- a/test_runner/regress/test_compatibility.py
+++ b/test_runner/regress/test_compatibility.py
@@ -249,6 +249,7 @@ def test_forward_compatibility(
    top_output_dir: Path,
    pg_version: PgVersion,
    compatibility_snapshot_dir: Path,
+    compute_reconfigure_listener: ComputeReconfigure,
 ):
    """
    Test that the old binaries can read new data
@@ -257,6 +258,7 @@ def test_forward_compatibility(
        os.environ.get("ALLOW_FORWARD_COMPATIBILITY_BREAKAGE", "false").lower() == "true"
    )

+    neon_env_builder.control_plane_hooks_api = compute_reconfigure_listener.control_plane_hooks_api
    neon_env_builder.test_may_use_compatibility_snapshot_binaries = True

    try:
--- a/test_runner/regress/test_storage_controller.py
+++ b/test_runner/regress/test_storage_controller.py
@@ -4073,6 +4073,101 @@ def test_storage_controller_location_conf_equivalence(neon_env_builder: NeonEnvB
    assert reconciles_after_restart == 0


+@run_only_on_default_postgres("PG version is not interesting here")
+@pytest.mark.parametrize("restart_storcon", [True, False])
+def test_storcon_create_delete_sk_down(neon_env_builder: NeonEnvBuilder, restart_storcon: bool):
+    """
+    Test that the storcon can create and delete tenants and timelines with a safekeeper being down.
+      - restart_storcon: tests whether the pending ops are persisted.
+        if we don't restart, we test that we don't require it to come from the db.
+    """
+
+    neon_env_builder.num_safekeepers = 3
+    neon_env_builder.storage_controller_config = {
+        "timelines_onto_safekeepers": True,
+    }
+    env = neon_env_builder.init_start()
+
+    env.safekeepers[0].stop()
+
+    # Wait for heartbeater to pick up that the safekeeper is gone
+    # This isn't really neccessary
+    def logged_offline():
+        env.storage_controller.assert_log_contains(
+            "Heartbeat round complete for 3 safekeepers, 1 offline"
+        )
+
+    wait_until(logged_offline)
+
+    tenant_id = TenantId.generate()
+    timeline_id = TimelineId.generate()
+    env.create_tenant(tenant_id, timeline_id)
+
+    env.safekeepers[1].assert_log_contains(f"creating new timeline {tenant_id}/{timeline_id}")
+    env.safekeepers[2].assert_log_contains(f"creating new timeline {tenant_id}/{timeline_id}")
+
+    env.storage_controller.allowed_errors.extend(
+        [
+            ".*Call to safekeeper.* management API still failed after.*",
+            ".*reconcile_one.*tenant_id={tenant_id}.*Call to safekeeper.* management API still failed after.*",
+        ]
+    )
+
+    if restart_storcon:
+        # Restart the storcon to check that we persist operations
+        env.storage_controller.stop()
+        env.storage_controller.start()
+
+    config_lines = [
+        "neon.safekeeper_proto_version = 3",
+    ]
+    with env.endpoints.create("main", tenant_id=tenant_id, config_lines=config_lines) as ep:
+        # endpoint should start.
+        ep.start(safekeeper_generation=1, safekeepers=[1, 2, 3])
+        ep.safe_psql("CREATE TABLE IF NOT EXISTS t(key int, value text)")
+
+    env.storage_controller.assert_log_contains("writing pending op for sk id 1")
+    env.safekeepers[0].start()
+
+    # ensure that we applied the operation also for the safekeeper we just brought down
+    def logged_contains_on_sk():
+        env.safekeepers[0].assert_log_contains(
+            f"pulling timeline {tenant_id}/{timeline_id} from safekeeper"
+        )
+
+    wait_until(logged_contains_on_sk)
+
+    env.safekeepers[1].stop()
+
+    env.storage_controller.pageserver_api().tenant_delete(tenant_id)
+
+    # ensure the safekeeper deleted the timeline
+    def timeline_deleted_on_active_sks():
+        env.safekeepers[0].assert_log_contains(
+            f"deleting timeline {tenant_id}/{timeline_id} from disk"
+        )
+        env.safekeepers[2].assert_log_contains(
+            f"deleting timeline {tenant_id}/{timeline_id} from disk"
+        )
+
+    wait_until(timeline_deleted_on_active_sks)
+
+    if restart_storcon:
+        # Restart the storcon to check that we persist operations
+        env.storage_controller.stop()
+        env.storage_controller.start()
+
+    env.safekeepers[1].start()
+
+    # ensure that there is log msgs for the third safekeeper too
+    def timeline_deleted_on_sk():
+        env.safekeepers[1].assert_log_contains(
+            f"deleting timeline {tenant_id}/{timeline_id} from disk"
+        )
+
+    wait_until(timeline_deleted_on_sk)
+
+
@pytest.mark.parametrize("wrong_az", [True, False])
 def test_storage_controller_graceful_migration(neon_env_builder: NeonEnvBuilder, wrong_az: bool):
    """
@@ -4176,3 +4271,121 @@ def test_storage_controller_graceful_migration(neon_env_builder: NeonEnvBuilder,
        )
    else:
        assert initial_ps.http_client().tenant_list_locations()["tenant_shards"] == []
+
+
+@run_only_on_default_postgres("this is like a 'unit test' against storcon db")
+def test_storage_controller_migrate_with_pageserver_restart(
+    neon_env_builder: NeonEnvBuilder, make_httpserver
+):
+    """
+    Test that live migrations which fail right after incrementing the generation
+    due to the destination going offline eventually send a compute notification
+    after the destination re-attaches.
+    """
+    neon_env_builder.num_pageservers = 2
+
+    neon_env_builder.storage_controller_config = {
+        # Disable transitions to offline
+        "max_offline": "600s",
+        "use_local_compute_notifications": False,
+    }
+
+    neon_env_builder.control_plane_hooks_api = (
+        f"http://{make_httpserver.host}:{make_httpserver.port}/"
+    )
+
+    notifications = []
+
+    def notify(request: Request):
+        log.info(f"Received notify-attach: {request}")
+        notifications.append(request.json)
+
+    make_httpserver.expect_request("/notify-attach", method="PUT").respond_with_handler(notify)
+
+    env = neon_env_builder.init_start()
+
+    env.storage_controller.allowed_errors.extend(
+        [
+            ".*Call to node.*management API failed.*",
+            ".*Call to node.*management API still failed.*",
+            ".*Reconcile error.*",
+            ".*request.*PUT.*migrate.*",
+        ]
+    )
+
+    env.storage_controller.tenant_policy_update(env.initial_tenant, {"placement": {"Attached": 1}})
+    env.storage_controller.reconcile_until_idle()
+
+    initial_desc = env.storage_controller.tenant_describe(env.initial_tenant)["shards"][0]
+    log.info(f"{initial_desc=}")
+    primary = env.get_pageserver(initial_desc["node_attached"])
+    secondary = env.get_pageserver(initial_desc["node_secondary"][0])
+
+    # Pause the migration after incrementing the generation in the database
+    env.storage_controller.configure_failpoints(
+        ("reconciler-live-migrate-post-generation-inc", "pause")
+    )
+
+    tenant_shard_id = TenantShardId(env.initial_tenant, 0, 0)
+
+    try:
+        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
+            migrate_fut = executor.submit(
+                env.storage_controller.tenant_shard_migrate,
+                tenant_shard_id,
+                secondary.id,
+                config=StorageControllerMigrationConfig(prewarm=False, override_scheduler=True),
+            )
+
+            def has_hit_migration_failpoint():
+                expr = "at failpoint reconciler-live-migrate-post-generation-inc"
+                log.info(expr)
+                assert env.storage_controller.log_contains(expr)
+
+            wait_until(has_hit_migration_failpoint)
+
+            secondary.stop()
+
+            # Eventually migration completes
+            env.storage_controller.configure_failpoints(
+                ("reconciler-live-migrate-post-generation-inc", "off")
+            )
+            try:
+                migrate_fut.result()
+            except StorageControllerApiException as err:
+                log.info(f"Migration failed: {err}")
+    except:
+        env.storage_controller.configure_failpoints(
+            ("reconciler-live-migrate-post-generation-inc", "off")
+        )
+        raise
+
+    def process_migration_result():
+        dump = env.storage_controller.tenant_shard_dump()
+        observed = dump[0]["observed"]["locations"]
+
+        log.info(f"{observed=} primary={primary.id} secondary={secondary.id}")
+
+        assert observed[str(primary.id)]["conf"]["mode"] == "AttachedStale"
+        assert observed[str(secondary.id)]["conf"] is None
+
+    wait_until(process_migration_result)
+
+    # Start and wait for re-attach to be processed
+    secondary.start()
+    env.storage_controller.poll_node_status(
+        secondary.id,
+        desired_availability=PageserverAvailability.ACTIVE,
+        desired_scheduling_policy=None,
+        max_attempts=10,
+        backoff=1,
+    )
+
+    env.storage_controller.reconcile_until_idle()
+
+    assert notifications[-1] == {
+        "tenant_id": str(env.initial_tenant),
+        "stripe_size": None,
+        "shards": [{"node_id": int(secondary.id), "shard_number": 0}],
+        "preferred_az": DEFAULT_AZ_ID,
+    }
--- a/test_runner/regress/test_tenant_size.py
+++ b/test_runner/regress/test_tenant_size.py
@@ -776,6 +776,7 @@ def test_lsn_lease_storcon(neon_env_builder: NeonEnvBuilder):
        env.initial_tenant, env.initial_timeline, last_flush_lsn
    )
    env.storage_controller.tenant_shard_split(env.initial_tenant, 8)
+    env.storage_controller.reconcile_until_idle(timeout_secs=120)
    # TODO: do we preserve LSN leases across shard splits?
    env.storage_controller.pageserver_api().timeline_lsn_lease(
        env.initial_tenant, env.initial_timeline, last_flush_lsn
--- a/test_runner/regress/test_timeline_detach_ancestor.py
+++ b/test_runner/regress/test_timeline_detach_ancestor.py
@@ -1768,6 +1768,87 @@ def test_pageserver_compaction_detach_ancestor_smoke(neon_env_builder: NeonEnvBu
    workload_child.validate(env.pageserver.id)


+def test_timeline_detach_with_aux_files_with_detach_v1(
+    neon_env_builder: NeonEnvBuilder,
+):
+    """
+    Validate that "branches do not inherit their parent" is invariant over detach_ancestor.
+
+    Branches hide parent branch aux files etc by stopping lookup of non-inherited keyspace at the parent-child boundary.
+    We had a bug where detach_ancestor running on a child branch would copy aux files key range from child to parent,
+    thereby making parent aux files reappear.
+    """
+    env = neon_env_builder.init_start(
+        initial_tenant_conf={
+            "gc_period": "1s",
+            "lsn_lease_length": "0s",
+        }
+    )
+
+    env.pageserver.allowed_errors.extend(SHUTDOWN_ALLOWED_ERRORS)
+
+    http = env.pageserver.http_client()
+
+    endpoint = env.endpoints.create_start("main", tenant_id=env.initial_tenant)
+    lsn0 = wait_for_last_flush_lsn(env, endpoint, env.initial_tenant, env.initial_timeline)
+    endpoint.safe_psql(
+        "SELECT pg_create_logical_replication_slot('test_slot_parent_1', 'pgoutput')"
+    )
+    lsn1 = wait_for_last_flush_lsn(env, endpoint, env.initial_tenant, env.initial_timeline)
+    endpoint.safe_psql(
+        "SELECT pg_create_logical_replication_slot('test_slot_parent_2', 'pgoutput')"
+    )
+    lsn2 = wait_for_last_flush_lsn(env, endpoint, env.initial_tenant, env.initial_timeline)
+    assert set(http.list_aux_files(env.initial_tenant, env.initial_timeline, lsn0).keys()) == set(
+        []
+    )
+    assert set(http.list_aux_files(env.initial_tenant, env.initial_timeline, lsn1).keys()) == set(
+        ["pg_replslot/test_slot_parent_1/state"]
+    )
+    assert set(http.list_aux_files(env.initial_tenant, env.initial_timeline, lsn2).keys()) == set(
+        ["pg_replslot/test_slot_parent_1/state", "pg_replslot/test_slot_parent_2/state"]
+    )
+
+    # Restore at LSN1
+    branch_timeline_id = env.create_branch("restore", env.initial_tenant, "main", lsn1)
+    endpoint2 = env.endpoints.create_start("restore", tenant_id=env.initial_tenant)
+    assert set(http.list_aux_files(env.initial_tenant, branch_timeline_id, lsn1).keys()) == set([])
+
+    # Add a new slot file to the restore branch (This won't happen in reality because cplane immediately detaches the branch on restore,
+    # but we want to ensure that aux files on the detached branch are NOT inherited during ancestor detach. We could change the behavior
+    # in the future.
+    # TL;DR we should NEVER automatically detach a branch as a background optimization for those tenants that already used the restore
+    # feature before branch detach was introduced because it will clean up the aux files and stop logical replication.
+    endpoint2.safe_psql(
+        "SELECT pg_create_logical_replication_slot('test_slot_restore', 'pgoutput')"
+    )
+    lsn3 = wait_for_last_flush_lsn(env, endpoint, env.initial_tenant, branch_timeline_id)
+    assert set(http.list_aux_files(env.initial_tenant, branch_timeline_id, lsn1).keys()) == set([])
+    assert set(http.list_aux_files(env.initial_tenant, branch_timeline_id, lsn3).keys()) == set(
+        ["pg_replslot/test_slot_restore/state"]
+    )
+
+    print("lsn0=", lsn0)
+    print("lsn1=", lsn1)
+    print("lsn2=", lsn2)
+    print("lsn3=", lsn3)
+    # Detach the restore branch so that main doesn't have any child branches.
+    all_reparented = http.detach_ancestor(
+        env.initial_tenant, branch_timeline_id, detach_behavior="v1"
+    )
+    assert all_reparented == set([])
+
+    # We need to ensure all safekeeper data are ingested before checking aux files: the API does not wait for LSN.
+    wait_for_last_flush_lsn(env, endpoint, env.initial_tenant, branch_timeline_id)
+    assert set(http.list_aux_files(env.initial_tenant, env.initial_timeline, lsn2).keys()) == set(
+        ["pg_replslot/test_slot_parent_1/state", "pg_replslot/test_slot_parent_2/state"]
+    ), "main branch unaffected"
+    assert set(http.list_aux_files(env.initial_tenant, branch_timeline_id, lsn3).keys()) == set(
+        ["pg_replslot/test_slot_restore/state"]
+    )
+    assert set(http.list_aux_files(env.initial_tenant, branch_timeline_id, lsn1).keys()) == set([])
+
+
 # TODO:
 # - branch near existing L1 boundary, image layers?
 # - investigate: why are layers started at uneven lsn? not just after branching, but in general.
--- a/vendor/postgres-v14
+++ b/vendor/postgres-v14
--- a/vendor/postgres-v15
+++ b/vendor/postgres-v15
--- a/vendor/postgres-v16
+++ b/vendor/postgres-v16
--- a/vendor/postgres-v17
+++ b/vendor/postgres-v17
--- a/vendor/revisions.json
+++ b/vendor/revisions.json
@@ -1,18 +1,18 @@
 {
  "v17": [
    "17.4",
-    "22533c63fc42cdc1dbe138650ba1eca10a70c5d7"
+    "7ec41bf6cd92a4af751272145fdd590270c491da"
  ],
  "v16": [
    "16.8",
-    "473f68210d52ff8508f71c15b0c77c01296f4ace"
+    "26c7d3f6de6f361c8923bb80d7563853b4a04958"
  ],
  "v15": [
    "15.12",
-    "6cea02e23caa950d5f06932491a91b6af8f54360"
+    "4ac24a747cd897119ce9b20547b3b04eba2cacbd"
  ],
  "v14": [
    "14.17",
-    "35bc1b0cba55680e3b37abce4e67a46bb15f3315"
+    "bce3e48d8a72e70e72dfee1b7421fecd0f1b00ac"
  ]
 }
Author	SHA1	Message	Date
Jan Christian Grünhage	0a2f227ef6	feat(ci): lint gha with zizmor using the pedantic persona	2025-04-07 17:14:57 +02:00
Dmitrii Kovalkov	181af302b5	storcon + safekeeper + scrubber: propagate root CA certs everywhere (#11418 ) ## Problem There are some places in the code where we create `reqwest::Client` without providing SSL CA certs from `ssl_ca_file`. These will break after we enable TLS everywhere. - Part of https://github.com/neondatabase/cloud/issues/22686 ## Summary of changes - Support `ssl_ca_file` in storage scrubber. - Add `use_https_safekeeper_api` option to safekeeper to use https for peer requests. - Propagate SSL CA certs to storage_controller/client, storcon's ComputeHook, PeerClient and maybe_forward.	2025-04-04 06:30:48 +00:00
Tristan Partin	497116b76d	Download extension if it does not exist on the filesystem (#11315 ) Previously we attempted to download all extensions in CREATE EXTENSION statements. Extensions like pg_stat_statements and neon are not remote extensions, but still we were requesting them when skip_pg_catalog_updates was set to false. Fixes: https://github.com/neondatabase/neon/issues/11127 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-04 01:06:22 +00:00
Arpad Müller	a917952b30	Add test_storcon_create_delete_sk_down and make it work (#11400 ) Adds a test `test_storcon_create_delete_sk_down` which tests the reconciler and pending op persistence if faced with a temporary safekeeper downtime during timeline creation or deletion. This is in contrast to `test_explicit_timeline_creation_storcon`, which tests the happy path. We also do some fixes: * timeline and tenant deletion http requests didn't expect a body, but `()` sent one. * we got the tenant deletion http request's return type wrong: it's supposed to be a hash map * we add some logging to improve observability * We fix `list_pending_ops` which had broken code meant to make it possible to restrict oneself to a single pageserver. But diesel doesn't support that sadly, or at least I couldn't figure out a way to make it work. We don't need that functionality, so remove it. * We add an info span to the heartbeater futures with the node id, so that there is no context-free msgs like "Backoff: waiting 1.1 seconds before processing with the task" in the storcon logs. we could also add the full base url of the node but don't do it as most other log lines contain that information already, and if we do duplication it should at least not be verbose. One can always find out the base url from the node id. Successor of #11261 Part of #9011	2025-04-04 00:17:40 +00:00
Tristan Partin	e581b670f4	Improve nightly physical replication benchmark (#11389 ) Log the created project and endpoint IDs and improve typing in the source code to improve readability. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-03 23:00:58 +00:00
Alexander Bayandin	8ed79ed773	build(deps): bump h2 to 4.2.0 (#11437 ) ## Problem We switched `h2` from 4.1.0 to a git commit to fix stubgen (in https://github.com/neondatabase/neon/pull/10491). `h2` 4.2.0 was released soon after that, so we can switch back to a pinned version. Expected no changes, as 4.2.0 is the right next commit after the commit we currently use: `dacd614fed`%5E ## Summary of changes - Bump `h2` to 4.2.0	2025-04-03 21:42:34 +00:00
Alex Chi Z.	381f42519e	fix(pageserver): skip gc-compaction over sparse keyspaces (#11404 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11318 It's not 100% safe for now to run gc-compaction over the sparse keyspace. It might cause deleted file to re-appear if a specific sequence of operations are done as in the issue, which in reality doesn't happen due to how we split delta/image layers based on the key range. A long-term fix would be either having a separate gc-compaction code path for metadata keys (as how we have a different code path for metadata image layer generation), or let the compaction process aware of the information of "there's an image layer that doesn't contain a key" so that we can skip the keys. ## Summary of changes * gc-compaction auto trigger only triggers compaction over the normal data range. * do not hold gc_block_guard across the full compaction job, only hold it during each subcompaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-03 19:40:44 +00:00
Arpad Müller	375df517a0	storcon: return 503 instead of 500 if there is no new leader yet (#11417 ) The leadership transfer protocol between storage controller instances is as follows, listing the steps for the new pod: The new pod does these things: 1. new pod comes online. looks in database if there is a leader. if there is, it asks that leader to step down. 2. the new pod does some operations to come online. they should be fairly short timed, but it's not zero. 3. the new pod updates the leader entry in the database. The old pod, once it gets the step down request, changes its internal state to stepped down. It treats all incoming requests specially now: instead of processing, it wants to forward them to the new pod. The forwarding however only works if the new pod is online already, so before forwarding it reads from the db for a leader (also to get the address to forward to in the first place). If the new pod is not online yet, i.e. during step 2 above, the old pod might legitimately land in the branch which this patch is editing: the leader in the database is a stepped down instance. Before, we've returned a `ApiError::InternalServerError`, but that would print the full backtrace plus an error log. With this patch, we cut down on the noise, as it's an expected situation to have a short storcon downtime while we are cutting over to the new instance. A `ResourceUnavailable` error is not just more fitting, it also doesn't print a backtrace once encountered, and only prints on the INFO log level (see `api_error_handler` function). Fixes #11320 cc #8954	2025-04-03 18:43:16 +00:00
Vlad Lazar	9db63fea7a	pageserver: optionally export perf traces in OTEL format (#11140 ) Based on https://github.com/neondatabase/neon/pull/11139 ## Problem We want to export performance traces from the pageserver in OTEL format. End goal is to see them in Grafana. ## Summary of changes https://github.com/neondatabase/neon/pull/11139 introduces the infrastructure required to run the otel collector alongside the pageserver. ### Design Requirements: 1. We'd like to avoid implementing our own performance tracing stack if possible and use the `tracing` crate if possible. 2. Ideally, we'd like zero overhead of a sampling rate of zero and be a be able to change the tracing config for a tenant on the fly. 3. We should leave the current span hierarchy intact. This includes adding perf traces without modifying existing tracing. To satisfy (3) (and (2) in part) a separate span hierarchy is used. `RequestContext` gains an optional `perf_span` member that's only set when the request was chosen by sampling. All perf span related methods added to `RequestContext` are no-ops for requests that are not sampled. This on its own is not enough for (3), so performance spans use a separate tracing subscriber. The `tracing` crate doesn't have great support for this, so there's a fair amount of boilerplate to override the subscriber at all points of the perf span lifecycle. ### Perf Impact [Periodic pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc) shows no statistically significant regression with a sample ratio of 0. There's an annotation on the dashboard on 2025-03-06. ### Overview of changes: 1. Clean up the `RequestContext` API a bit. Namely, get rid of the `RequestContext::extend` API and use the builder instead. 2. Add pageserver level configs for tracing: sampling ratio, otel endpoint, etc. 3. Introduce some perf span tracking utilities and expose them via `RequestContext`. We add a `tracing::Span` wrapper to be used for perf spans and a `tracing::Instrumented` equivalent for it. See doc comments for reason. 4. Set up OTEL tracing infra according to configuration. A separate runtime is used for the collector. 5. Add perf traces to the read path. ## Refs - epic https://github.com/neondatabase/neon/issues/9873 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:56:51 +00:00
Alex Chi Z.	bfc767d60d	fix(test): wait for shard split complete for test_lsn_lease_storcon (#11436 ) ## Problem close https://github.com/neondatabase/neon/issues/11397 ref https://github.com/neondatabase/cloud/issues/23667 ## Summary of changes We need to wait until the shard split is complete, otherwise it will print warning like waiting for shard split exclusive lock for 30s. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-03 17:49:45 +00:00
Alex Chi Z.	109c54a300	fix(pageserver): avoid gc-compaction triggering circuit breaker (#11403 ) ## Problem There are some cases where traditional gc might collect some layer files causing gc-compaction cannot read the full history of the key. This needs to be resolved in the long-term by improving the compaction process. For now, let's simply avoid such errors triggering the circuit breaker. ## Summary of changes * Move the place where we trigger the circuit breaker. We only trigger it during compactions other than L0 compactions. We added the trigger a year ago due to file cleanup concerns in image layer compaction. * For gc-compaction, only return errors to the upper compaction_iteration if it's a shutdown error. Otherwise, just log it and skip the compaction for a key range. Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:18:37 +00:00
Vlad Lazar	74920d8cd8	storcon: notify compute if correct observed state was refreshed (#11342 ) ## Problem Previously, if the observed state was refreshed and matching the intent, we wouldn't send a compute notification. This is unsafe. There's no guarantee that the location landed on the pageserver _and_ a compute notification for it was delivered. See https://github.com/neondatabase/neon/issues/11291#issuecomment-2743205411 for one such example. ## Summary of changes Add a reproducer and notify the compute if the correct observed state required a refresh. Closes https://github.com/neondatabase/neon/issues/11291	2025-04-03 16:35:55 +00:00
Alex Chi Z.	131b32ef48	fix(pageserver): clean up aux files before detaching (#11299 ) ## Problem Related to https://github.com/neondatabase/cloud/issues/26091 and https://github.com/neondatabase/cloud/issues/25840 Close https://github.com/neondatabase/neon/issues/11297 Discussion on Slack: https://neondb.slack.com/archives/C033RQ5SPDH/p1742320666313969 ## Summary of changes * When detaching, scan all aux files within `sparse_non_inherited_keyspace` in the ancestor timeline and create an image layer exactly at the ancestor LSN. All scanned keys will map to an empty value, which is a delete tombstone. - Note that end_lsn for rewritten delta layers = ancestor_lsn + 1, so the image layer will have image_end_lsn=end_lsn. With the current `select_layer` logic, the read path will always first read the image layer. * Add a test case. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 15:55:22 +00:00
Suhas Thalanki	581bb5d7d5	removed pg_anon setup from compute dockerfile (#10960 ) ## Problem Removing the `anon` v1 extension in postgres as described in https://github.com/neondatabase/cloud/issues/22663. This extension is not built for postgres v17 and is out of date when compared to the upstream variant which is v2 (we have v1.4). ## Summary of changes Removed the `anon` v1 extension from the compute docker image Related to https://github.com/neondatabase/cloud/issues/22663	2025-04-03 15:26:35 +00:00
JC Grünhage	3c78133477	feat(ci): add 'released' tag to container images from release runs (#11425 ) ## Problem We had a problem with https://github.com/neondatabase/neon/pull/11413 having e2e tests failing, because an e2e test (`8d271bed47`) depended on an unreleased pageserver fix (`0ee5bfa2fc`). This came up because neon release CI runs against the most recent releases of the other components, but cloud e2e tests run against latest, which is tagged from main. ## Summary of changes Add an additional `released` tag for released versions. ## Alternative to consider We could (and maybe should) instead switch to `latest` being used for released versions and `main` being used where we use `latest` right now. That'd also mean we don't have to adjust the CI in the cloud repo.	2025-04-03 14:57:44 +00:00
Suhas Thalanki	46e046e779	Exporting `file_cache_used` to calculate LFC utilization (#11384 ) ## Problem Exporting `file_cache_used` which specifies the number of used chunks in the LFC. This helps calculate LFC utilization as: `file_cache_used_pages / (file_cache_used * file_cache_chunk_size_pages)` ## Summary of changes Exporting `file_cache_used`. Related Issue: https://github.com/neondatabase/cloud/issues/26688	2025-04-03 14:54:45 +00:00
Arpad Müller	d8cee52637	Update rust to 1.86.0 (#11431 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Announcement blog post](https://blog.rust-lang.org/2025/04/03/Rust-1.86.0.html). Prior update was in #10914.	2025-04-03 14:53:28 +00:00
Dmitrii Kovalkov	2e11d129d0	tests: suppress mgm api timeout error in sotrcon (#11428 ) ## Problem Since `0f367cb665` the timeout in `with_client_retries` is implemented via `tokio::timeout` instead of `reqwest::ClientBuilder::timeout` (because we reuse the client). It changed the error representation if the timeout is exceeded. Such errors were suppressed in `allowed_errors.py`, but old regexps do not match the new error. Discussion: https://neondb.slack.com/archives/C033RQ5SPDH/p1743533184736319 ## Summary of changes - Add new `Timeout` error to `allowed_errors.py`	2025-04-03 14:18:50 +00:00
Luís Tavares	43a7423f72	feat: bump pg_session_jwt extension to 0.3.0 (#11399 ) ## Problem Bumps https://github.com/neondatabase/pg_session_jwt to the latest release [v0.3.0](https://github.com/neondatabase/pg_session_jwt/releases/tag/v0.3.0) that introduces PostgREST fallback mechanisms. ## Summary of changes Updates the extension download tar and the extension version in the proxy constant. ## Subscribers @mrl5	2025-04-03 13:01:18 +00:00
Arpad Müller	374736a4de	Print remote_addr span for Failed to serve HTTP connection error (#11423 ) I've encountered this error in #11422. Ideally we'd have the URL as well to associate it with a tenant, but at this level we only have the remote addr I guess. Better than nothing.	2025-04-03 11:58:12 +00:00
				`@@ -0,0 +1 @@`
				`SELECT lfc_value AS lfc_used_pages FROM neon.neon_lfc_stats WHERE lfc_key = 'file_cache_used_pages';`