PS changes #1 (#12467)

# TLDR All changes are no-op except 1. publishing additional metrics. 2. problem VI ## Problem I It has come to my attention that the Neon Storage Controller doesn't correctly update its "observed" state of tenants previously associated with PSs that has come back up after a local data loss. It would still think that the old tenants are still attached to page servers and won't ask more questions. The pageserver has enough information from the reattach request/response to tell that something is wrong, but it doesn't do anything about it either. We need to detect this situation in production while I work on a fix. (I think there is just some misunderstanding about how Neon manages their pageserver deployments which got me confused about all the invariants.) ## Summary of changes I Added a `pageserver_local_data_loss_suspected` gauge metric that will be set to 1 if we detect a problematic situation from the reattch response. The problematic situation is when the PS doesn't have any local tenants but received a reattach response containing tenants. We can set up an alert using this metric. The alert should be raised whenever this metric reports non-zero number. Also added a HTTP PUT `http://pageserver/hadron-internal/reset_alert_gauges` API on the pageserver that can be used to reset the gauge and the alert once we manually rectify the situation (by restarting the HCC). ## Problem II Azure upload is 3x slower than AWS. -> 3x slower ingestion. The reason for the slower upload is that Azure upload in page server is much slower => higher flush latency => higher disk consistent LSN => higher back pressure. ## Summary of changes II Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in parallel. I set the put_block block size to be 128 MB by default in azure config. To minimize neon changes, upload function passes the layer file path to the azure upload code through the storage metadata. This allows the azure put block to use FileChunkStreamRead to stream read from one partition in the file instead of loading all file data in memory and split it into 8 128 MB chunks. ## How is this tested? II 1. rust test_real_azure tests the put_block change. 3. I deployed the change in azure dev and saw flush latency reduces from ~30 seconds to 10 seconds. 4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS runs. ## Problem III Currently Neon limits the compaction tasks as 3/4 * CPU cores. This limits the overall compaction throughput and it can easily cause head-of-the-line blocking problems when a few large tenants are compacting. ## Summary of changes III This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD` (default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also limits some other tasks `logical_size_calculation` and `layer eviction` . But compaction should be the most frequent and time-consuming task. ## Summary of changes IV This PR adds the following PageServer metrics: 1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures the total amount of bytes evicted. It's more straightforward to see the bytes directly instead of layers. 2. `pageserver_active_storage_operations_count`: captures the active storage operation, e.g., flush, L0 compaction, image creation etc. It's useful to visualize these active operations to get a better idea of what PageServers are spending cycles on in the background. ## Summary of changes V When investigating data corruptions, it's useful to search the base image and all WAL records of a page up to an LSN, i.e., a breakdown of GetPage@LSN request. This PR implements this functionality with two tools: 1. Extended `pagectl` with a new command to search the layer files for a given key up to a given LSN from the `index_part.json` file. The output can be used to download the files from S3 and then search the file contents using the second tool. Example usage: ``` cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b ``` 2. Added a unit test to search the layer file contents. It's not implemented part of `pagectl` because it depends on some test harness code, which can only be used by unit tests. Example usage: ``` cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` # omitted image for brievity delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA==" delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA==" ``` ## Problem VI Currently when page service resolves shards from page numbers, it doesn't fully support the case that the shard could be split in the middle. This will lead to query failures during the tenant split for either commit or abort cases (it's mostly for abort). ## Summary of changes VI This PR adds retry logic in `Cache::get()` to deal with shard resolution errors more gracefully. Specifically, it'll clear the cache and retry, instead of failing the query immediately. It also reduces the internal timeout to make retries faster. The PR also fixes a very obvious bug in `TenantManager::resolve_attached_shard` where the code tries to cache the computed the shard number, but forgot to recompute when the shard count is different. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>
2026-01-04 12:02:55 +00:00 · 2025-07-08 12:43:01 -07:00
parent 81e7218c27
commit 3dad4698ec
23 changed files with 1097 additions and 61 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4339,6 +4339,7 @@ dependencies = [
 "arc-swap",
 "async-compression",
 "async-stream",
+ "base64 0.22.1",
 "bincode",
 "bit_field",
 "byteorder",
@@ -5684,6 +5685,8 @@ dependencies = [
 "azure_identity",
 "azure_storage",
 "azure_storage_blobs",
+ "base64 0.22.1",
+ "byteorder",
 "bytes",
 "camino",
 "camino-tempfile",
--- a/libs/remote_storage/Cargo.toml
+++ b/libs/remote_storage/Cargo.toml
@@ -13,6 +13,7 @@ aws-smithy-async.workspace = true
 aws-smithy-types.workspace = true
 aws-config.workspace = true
 aws-sdk-s3.workspace = true
+base64.workspace = true
 bytes.workspace = true
 camino = { workspace = true, features = ["serde1"] }
 humantime-serde.workspace = true
@@ -41,6 +42,8 @@ http-body-util.workspace = true
 itertools.workspace = true
 sync_wrapper = { workspace = true, features = ["futures"] }

+byteorder = "1.4"
+
 [dev-dependencies]
 camino-tempfile.workspace = true
 test-context.workspace = true
--- a/libs/remote_storage/src/azure_blob.rs
+++ b/libs/remote_storage/src/azure_blob.rs
@@ -14,17 +14,25 @@ use anyhow::{Context, Result, anyhow};
 use azure_core::request_options::{IfMatchCondition, MaxResults, Metadata, Range};
 use azure_core::{Continuable, HttpClient, RetryOptions, TransportOptions};
 use azure_storage::StorageCredentials;
-use azure_storage_blobs::blob::operations::GetBlobBuilder;
+use azure_storage_blobs::blob::BlobBlockType;
+use azure_storage_blobs::blob::BlockList;
 use azure_storage_blobs::blob::{Blob, CopyStatus};
 use azure_storage_blobs::container::operations::ListBlobsBuilder;
-use azure_storage_blobs::prelude::{ClientBuilder, ContainerClient};
+use azure_storage_blobs::prelude::ClientBuilder;
+use azure_storage_blobs::{blob::operations::GetBlobBuilder, prelude::ContainerClient};
+use base64::{Engine as _, engine::general_purpose::URL_SAFE};
+use byteorder::{BigEndian, ByteOrder};
 use bytes::Bytes;
+use camino::Utf8Path;
 use futures::FutureExt;
 use futures::future::Either;
 use futures::stream::Stream;
 use futures_util::{StreamExt, TryStreamExt};
 use http_types::{StatusCode, Url};
 use scopeguard::ScopeGuard;
+use tokio::fs::File;
+use tokio::io::AsyncReadExt;
+use tokio::io::AsyncSeekExt;
 use tokio_util::sync::CancellationToken;
 use tracing::debug;
 use utils::backoff;
@@ -51,6 +59,9 @@ pub struct AzureBlobStorage {

    // Alternative timeout used for metadata objects which are expected to be small
    pub small_timeout: Duration,
+    /* BEGIN_HADRON */
+    pub put_block_size_mb: Option<usize>,
+    /* END_HADRON */
 }

 impl AzureBlobStorage {
@@ -107,6 +118,9 @@ impl AzureBlobStorage {
            concurrency_limiter: ConcurrencyLimiter::new(azure_config.concurrency_limit.get()),
            timeout,
            small_timeout,
+            /* BEGIN_HADRON */
+            put_block_size_mb: azure_config.put_block_size_mb,
+            /* END_HADRON */
        })
    }

@@ -583,31 +597,137 @@ impl RemoteStorage for AzureBlobStorage {

        let started_at = start_measuring_requests(kind);

-        let op = async {
+        let mut metadata_map = metadata.unwrap_or([].into());
+        let timeline_file_path = metadata_map.0.remove("databricks_azure_put_block");
+
+        /* BEGIN_HADRON */
+        let op = async move {
            let blob_client = self.client.blob_client(self.relative_path_to_name(to));
+            let put_block_size = self.put_block_size_mb.unwrap_or(0) * 1024 * 1024;
+            if timeline_file_path.is_none() || put_block_size == 0 {
+                // Use put_block_blob directly.
+                let from: Pin<
+                    Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>,
+                > = Box::pin(from);
+                let from = NonSeekableStream::new(from, data_size_bytes);
+                let body = azure_core::Body::SeekableStream(Box::new(from));

-            let from: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>> =
-                Box::pin(from);
+                let mut builder = blob_client.put_block_blob(body);
+                if !metadata_map.0.is_empty() {
+                    builder = builder.metadata(to_azure_metadata(metadata_map));
+                }
+                let fut = builder.into_future();
+                let fut = tokio::time::timeout(self.timeout, fut);
+                let result = fut.await;
+                match result {
+                    Ok(Ok(_response)) => return Ok(()),
+                    Ok(Err(azure)) => return Err(azure.into()),
+                    Err(_timeout) => return Err(TimeoutOrCancel::Timeout.into()),
+                };
+            }
+            // Upload chunks concurrently using Put Block.
+            // Each PutBlock uploads put_block_size bytes of the file.
+            let mut upload_futures: Vec<tokio::task::JoinHandle<Result<(), azure_core::Error>>> =
+                vec![];
+            let mut block_list = BlockList::default();
+            let mut start_bytes = 0u64;
+            let mut remaining_bytes = data_size_bytes;
+            let mut block_list_count = 0;

-            let from = NonSeekableStream::new(from, data_size_bytes);
+            while remaining_bytes > 0 {
+                let block_size = std::cmp::min(remaining_bytes, put_block_size);
+                let end_bytes = start_bytes + block_size as u64;
+                let block_id = block_list_count;
+                let timeout = self.timeout;
+                let blob_client = blob_client.clone();
+                let timeline_file = timeline_file_path.clone().unwrap().clone();

-            let body = azure_core::Body::SeekableStream(Box::new(from));
+                let mut encoded_block_id = [0u8; 8];
+                BigEndian::write_u64(&mut encoded_block_id, block_id);
+                URL_SAFE.encode(encoded_block_id);

-            let mut builder = blob_client.put_block_blob(body);
+                // Put one block.
+                let part_fut = async move {
+                    let mut file = File::open(Utf8Path::new(&timeline_file.clone())).await?;
+                    file.seek(io::SeekFrom::Start(start_bytes)).await?;
+                    let limited_reader = file.take(block_size as u64);
+                    let file_chunk_stream =
+                        tokio_util::io::ReaderStream::with_capacity(limited_reader, 1024 * 1024);
+                    let file_chunk_stream_pin: Pin<
+                        Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>,
+                    > = Box::pin(file_chunk_stream);
+                    let stream_wrapper = NonSeekableStream::new(file_chunk_stream_pin, block_size);
+                    let body = azure_core::Body::SeekableStream(Box::new(stream_wrapper));
+                    // Azure put block takes URL-encoded block ids and all blocks must have the same byte length.
+                    // https://learn.microsoft.com/en-us/rest/api/storageservices/put-block?tabs=microsoft-entra-id#uri-parameters
+                    let builder = blob_client.put_block(encoded_block_id.to_vec(), body);
+                    let fut = builder.into_future();
+                    let fut = tokio::time::timeout(timeout, fut);
+                    let result = fut.await;
+                    tracing::debug!(
+                        "azure put block id-{} size {} start {} end {} file {} response {:#?}",
+                        block_id,
+                        block_size,
+                        start_bytes,
+                        end_bytes,
+                        timeline_file,
+                        result
+                    );
+                    match result {
+                        Ok(Ok(_response)) => Ok(()),
+                        Ok(Err(azure)) => Err(azure),
+                        Err(_timeout) => Err(azure_core::Error::new(
+                            azure_core::error::ErrorKind::Io,
+                            std::io::Error::new(
+                                std::io::ErrorKind::TimedOut,
+                                "Operation timed out",
+                            ),
+                        )),
+                    }
+                };
+                upload_futures.push(tokio::spawn(part_fut));

-            if let Some(metadata) = metadata {
-                builder = builder.metadata(to_azure_metadata(metadata));
+                block_list_count += 1;
+                remaining_bytes -= block_size;
+                start_bytes += block_size as u64;
+
+                block_list
+                    .blocks
+                    .push(BlobBlockType::Uncommitted(encoded_block_id.to_vec().into()));
            }

+            tracing::debug!(
+                "azure put blocks {} total MB: {} chunk size MB: {}",
+                block_list_count,
+                data_size_bytes / 1024 / 1024,
+                put_block_size / 1024 / 1024
+            );
+            // Wait for all blocks to be uploaded.
+            let upload_results = futures::future::try_join_all(upload_futures).await;
+            if upload_results.is_err() {
+                return Err(anyhow::anyhow!(format!(
+                    "Failed to upload all blocks {:#?}",
+                    upload_results.unwrap_err()
+                )));
+            }
+
+            // Commit the blocks.
+            let mut builder = blob_client.put_block_list(block_list);
+            if !metadata_map.0.is_empty() {
+                builder = builder.metadata(to_azure_metadata(metadata_map));
+            }
            let fut = builder.into_future();
            let fut = tokio::time::timeout(self.timeout, fut);
+            let result = fut.await;
+            tracing::debug!("azure put block list response {:#?}", result);

-            match fut.await {
+            match result {
                Ok(Ok(_response)) => Ok(()),
                Ok(Err(azure)) => Err(azure.into()),
                Err(_timeout) => Err(TimeoutOrCancel::Timeout.into()),
            }
        };
+        /* END_HADRON */

        let res = tokio::select! {
            res = op => res,
@@ -622,7 +742,6 @@ impl RemoteStorage for AzureBlobStorage {
        crate::metrics::BUCKET_METRICS
            .req_seconds
            .observe_elapsed(kind, outcome, started_at);
-
        res
    }

--- a/libs/remote_storage/src/config.rs
+++ b/libs/remote_storage/src/config.rs
@@ -195,8 +195,19 @@ pub struct AzureConfig {
    pub max_keys_per_list_response: Option<i32>,
    #[serde(default = "default_azure_conn_pool_size")]
    pub conn_pool_size: usize,
+    /* BEGIN_HADRON */
+    #[serde(default = "default_azure_put_block_size_mb")]
+    pub put_block_size_mb: Option<usize>,
+    /* END_HADRON */
 }

+/* BEGIN_HADRON */
+fn default_azure_put_block_size_mb() -> Option<usize> {
+    // Disable parallel upload by default.
+    Some(0)
+}
+/* END_HADRON */
+
 fn default_remote_storage_azure_concurrency_limit() -> NonZeroUsize {
    NonZeroUsize::new(DEFAULT_REMOTE_STORAGE_AZURE_CONCURRENCY_LIMIT).unwrap()
 }
@@ -213,6 +224,9 @@ impl Debug for AzureConfig {
                "max_keys_per_list_response",
                &self.max_keys_per_list_response,
            )
+            /* BEGIN_HADRON */
+            .field("put_block_size_mb", &self.put_block_size_mb)
+            /* END_HADRON */
            .finish()
    }
 }
@@ -352,6 +366,7 @@ timeout = '5s'";
    upload_storage_class = 'INTELLIGENT_TIERING'
    timeout = '7s'
    conn_pool_size = 8
+    put_block_size_mb = 1024
    ";

        let config = parse(toml).unwrap();
@@ -367,6 +382,9 @@ timeout = '5s'";
                    concurrency_limit: default_remote_storage_azure_concurrency_limit(),
                    max_keys_per_list_response: DEFAULT_MAX_KEYS_PER_LIST_RESPONSE,
                    conn_pool_size: 8,
+                    /* BEGIN_HADRON */
+                    put_block_size_mb: Some(1024),
+                    /* END_HADRON */
                }),
                timeout: Duration::from_secs(7),
                small_timeout: RemoteStorageConfig::DEFAULT_SMALL_TIMEOUT
--- a/libs/remote_storage/tests/common/mod.rs
+++ b/libs/remote_storage/tests/common/mod.rs
@@ -165,10 +165,42 @@ pub(crate) async fn upload_remote_data(

            let (data, data_len) =
                upload_stream(format!("remote blob data {i}").into_bytes().into());
+
+            /* BEGIN_HADRON */
+            let mut metadata = None;
+            if matches!(&*task_client, GenericRemoteStorage::AzureBlob(_)) {
+                let file_path = "/tmp/dbx_upload_tmp_file.txt";
+                {
+                    // Open the file in append mode
+                    let mut file = std::fs::OpenOptions::new()
+                        .append(true)
+                        .create(true) // Create the file if it doesn't exist
+                        .open(file_path)?;
+                    // Append some bytes to the file
+                    std::io::Write::write_all(
+                        &mut file,
+                        &format!("remote blob data {i}").into_bytes(),
+                    )?;
+                    file.sync_all()?;
+                }
+                metadata = Some(remote_storage::StorageMetadata::from([(
+                    "databricks_azure_put_block",
+                    file_path,
+                )]));
+            }
+            /* END_HADRON */
+
            task_client
-                .upload(data, data_len, &blob_path, None, &cancel)
+                .upload(data, data_len, &blob_path, metadata, &cancel)
                .await?;

+            // TODO: Check upload is using the put_block upload.
+            // We cannot consume data here since data is moved inside the upload.
+            // let total_bytes = data.fold(0, |acc, chunk| async move {
+            //     acc + chunk.map(|bytes| bytes.len()).unwrap_or(0)
+            // }).await;
+            // assert_eq!(total_bytes, data_len);
+
            Ok::<_, anyhow::Error>((blob_prefix, blob_path))
        });
    }
--- a/libs/remote_storage/tests/test_real_azure.rs
+++ b/libs/remote_storage/tests/test_real_azure.rs
@@ -219,6 +219,9 @@ async fn create_azure_client(
            concurrency_limit: NonZeroUsize::new(100).unwrap(),
            max_keys_per_list_response,
            conn_pool_size: 8,
+            /* BEGIN_HADRON */
+            put_block_size_mb: Some(1),
+            /* END_HADRON */
        }),
        timeout: RemoteStorageConfig::DEFAULT_TIMEOUT,
        small_timeout: RemoteStorageConfig::DEFAULT_SMALL_TIMEOUT,
--- a/pageserver/Cargo.toml
+++ b/pageserver/Cargo.toml
@@ -112,6 +112,7 @@ twox-hash.workspace = true
 procfs.workspace = true

 [dev-dependencies]
+base64.workspace = true
 criterion.workspace = true
 hex-literal.workspace = true
 tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time", "test-util"] }
--- a/pageserver/ctl/src/index_part.rs
+++ b/pageserver/ctl/src/index_part.rs
@@ -1,10 +1,101 @@
+use std::str::FromStr;
+
 use anyhow::Context;
 use camino::Utf8PathBuf;
-use pageserver::tenant::IndexPart;
+use pageserver::tenant::{
+    IndexPart,
+    layer_map::{LayerMap, SearchResult},
+    remote_timeline_client::remote_layer_path,
+    storage_layer::{PersistentLayerDesc, ReadableLayerWeak},
+};
+use pageserver_api::key::Key;
+use utils::{
+    id::{TenantId, TimelineId},
+    lsn::Lsn,
+    shard::TenantShardId,
+};

 #[derive(clap::Subcommand)]
 pub(crate) enum IndexPartCmd {
-    Dump { path: Utf8PathBuf },
+    Dump {
+        path: Utf8PathBuf,
+    },
+    /// Find all layers that need to be searched to construct the given page at the given LSN.
+    Search {
+        #[arg(long)]
+        tenant_id: String,
+        #[arg(long)]
+        timeline_id: String,
+        #[arg(long)]
+        path: Utf8PathBuf,
+        #[arg(long)]
+        key: String,
+        #[arg(long)]
+        lsn: String,
+    },
+}
+
+async fn search_layers(
+    tenant_id: &str,
+    timeline_id: &str,
+    path: &Utf8PathBuf,
+    key: &str,
+    lsn: &str,
+) -> anyhow::Result<()> {
+    let tenant_id = TenantId::from_str(tenant_id).unwrap();
+    let tenant_shard_id = TenantShardId::unsharded(tenant_id);
+    let timeline_id = TimelineId::from_str(timeline_id).unwrap();
+    let index_json = {
+        let bytes = tokio::fs::read(path).await?;
+        IndexPart::from_json_bytes(&bytes).unwrap()
+    };
+    let mut layer_map = LayerMap::default();
+    {
+        let mut updates = layer_map.batch_update();
+        for (key, value) in index_json.layer_metadata.iter() {
+            updates.insert_historic(PersistentLayerDesc::from_filename(
+                tenant_shard_id,
+                timeline_id,
+                key.clone(),
+                value.file_size,
+            ));
+        }
+    }
+    let key = Key::from_hex(key)?;
+
+    let lsn = Lsn::from_str(lsn).unwrap();
+    let mut end_lsn = lsn;
+    loop {
+        let result = layer_map.search(key, end_lsn);
+        match result {
+            Some(SearchResult { layer, lsn_floor }) => {
+                let disk_layer = match layer {
+                    ReadableLayerWeak::PersistentLayer(layer) => layer,
+                    ReadableLayerWeak::InMemoryLayer(_) => {
+                        anyhow::bail!("unexpected in-memory layer")
+                    }
+                };
+
+                let metadata = index_json
+                    .layer_metadata
+                    .get(&disk_layer.layer_name())
+                    .unwrap();
+                println!(
+                    "{}",
+                    remote_layer_path(
+                        &tenant_id,
+                        &timeline_id,
+                        metadata.shard,
+                        &disk_layer.layer_name(),
+                        metadata.generation
+                    )
+                );
+                end_lsn = lsn_floor;
+            }
+            None => break,
+        }
+    }
+    Ok(())
 }

 pub(crate) async fn main(cmd: &IndexPartCmd) -> anyhow::Result<()> {
@@ -16,5 +107,12 @@ pub(crate) async fn main(cmd: &IndexPartCmd) -> anyhow::Result<()> {
            println!("{output}");
            Ok(())
        }
+        IndexPartCmd::Search {
+            tenant_id,
+            timeline_id,
+            path,
+            key,
+            lsn,
+        } => search_layers(tenant_id, timeline_id, path, key, lsn).await,
    }
 }
--- a/pageserver/src/disk_usage_eviction_task.rs
+++ b/pageserver/src/disk_usage_eviction_task.rs
@@ -458,6 +458,9 @@ pub(crate) async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
                match next {
                    Ok(Ok(file_size)) => {
                        METRICS.layers_evicted.inc();
+                        /*BEGIN_HADRON */
+                        METRICS.bytes_evicted.inc_by(file_size);
+                        /*END_HADRON */
                        usage_assumed.add_available_bytes(file_size);
                    }
                    Ok(Err((
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -61,6 +61,7 @@ use crate::context;
 use crate::context::{DownloadBehavior, RequestContext, RequestContextBuilder};
 use crate::deletion_queue::DeletionQueueClient;
 use crate::feature_resolver::FeatureResolver;
+use crate::metrics::LOCAL_DATA_LOSS_SUSPECTED;
 use crate::pgdatadir_mapping::LsnForTimestamp;
 use crate::task_mgr::TaskKind;
 use crate::tenant::config::LocationConf;
@@ -3628,6 +3629,17 @@ async fn activate_post_import_handler(
    .await
 }

+// [Hadron] Reset gauge metrics that are used to raised alerts. We need this API as a stop-gap measure to reset alerts
+// after we manually rectify situations such as local SSD data loss. We will eventually automate this.
+async fn hadron_reset_alert_gauges(
+    request: Request<Body>,
+    _cancel: CancellationToken,
+) -> Result<Response<Body>, ApiError> {
+    check_permission(&request, None)?;
+    LOCAL_DATA_LOSS_SUSPECTED.set(0);
+    json_response(StatusCode::OK, ())
+}
+
 /// Read the end of a tar archive.
 ///
 /// A tar archive normally ends with two consecutive blocks of zeros, 512 bytes each.
@@ -4154,5 +4166,8 @@ pub fn make_router(
        .post("/v1/feature_flag_spec", |r| {
            api_handler(r, update_feature_flag_spec)
        })
+        .post("/hadron-internal/reset_alert_gauges", |r| {
+            api_handler(r, hadron_reset_alert_gauges)
+        })
        .any(handler_404))
 }
--- a/pageserver/src/metrics.rs
+++ b/pageserver/src/metrics.rs
@@ -1,3 +1,4 @@
+use std::cell::Cell;
 use std::collections::HashMap;
 use std::num::NonZeroUsize;
 use std::os::fd::RawFd;
@@ -102,7 +103,18 @@ pub(crate) static STORAGE_TIME_COUNT_PER_TIMELINE: Lazy<IntCounterVec> = Lazy::n
    .expect("failed to define a metric")
 });

-// Buckets for background operation duration in seconds, like compaction, GC, size calculation.
+/* BEGIN_HADRON */
+pub(crate) static STORAGE_ACTIVE_COUNT_PER_TIMELINE: Lazy<IntGaugeVec> = Lazy::new(|| {
+    register_int_gauge_vec!(
+        "pageserver_active_storage_operations_count",
+        "Count of active storage operations with operation, tenant and timeline dimensions",
+        &["operation", "tenant_id", "shard_id", "timeline_id"],
+    )
+    .expect("failed to define a metric")
+});
+/*END_HADRON */
+
+// Buckets for background operations like compaction, GC, size calculation
 const STORAGE_OP_BUCKETS: &[f64] = &[0.010, 0.100, 1.0, 10.0, 100.0, 1000.0];

 pub(crate) static STORAGE_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
@@ -2810,6 +2822,31 @@ pub(crate) static WALRECEIVER_CANDIDATES_ADDED: Lazy<IntCounter> =
 pub(crate) static WALRECEIVER_CANDIDATES_REMOVED: Lazy<IntCounter> =
    Lazy::new(|| WALRECEIVER_CANDIDATES_EVENTS.with_label_values(&["remove"]));

+pub(crate) static LOCAL_DATA_LOSS_SUSPECTED: Lazy<IntGauge> = Lazy::new(|| {
+    register_int_gauge!(
+        "pageserver_local_data_loss_suspected",
+        "Non-zero value indicates that pageserver local data loss is suspected (and highly likely)."
+    )
+    .expect("failed to define a metric")
+});
+
+// Counter keeping track of misrouted PageStream requests. Spelling out PageStream requests here to distinguish
+// it from other types of reqeusts (SK wal replication, http requests, etc.). PageStream requests are used by
+// Postgres compute to fetch data from pageservers.
+// A misrouted PageStream request is registered if the pageserver cannot find the tenant identified in the
+// request, or if the pageserver is not the "primary" serving the tenant shard. These error almost always identify
+// issues with compute configuration, caused by either the compute node itself being stuck in the wrong
+// configuration or Storage Controller reconciliation bugs. Misrouted requests are expected during tenant migration
+// and/or during recovery following a pageserver failure, but persistently high rates of misrouted requests
+// are indicative of bugs (and unavailability).
+pub(crate) static MISROUTED_PAGESTREAM_REQUESTS: Lazy<IntCounter> = Lazy::new(|| {
+    register_int_counter!(
+        "pageserver_misrouted_pagestream_requests_total",
+        "Number of pageserver pagestream requests that were routed to the wrong pageserver"
+    )
+    .expect("failed to define a metric")
+});
+
 // Metrics collected on WAL redo operations
 //
 // We collect the time spent in actual WAL redo ('redo'), and time waiting
@@ -3048,13 +3085,19 @@ pub(crate) static WAL_REDO_PROCESS_COUNTERS: Lazy<WalRedoProcessCounters> =
 pub(crate) struct StorageTimeMetricsTimer {
    metrics: StorageTimeMetrics,
    start: Instant,
+    stopped: Cell<bool>,
 }

 impl StorageTimeMetricsTimer {
    fn new(metrics: StorageTimeMetrics) -> Self {
+        /*BEGIN_HADRON */
+        // record the active operation as the timer starts
+        metrics.timeline_active_count.inc();
+        /*END_HADRON */
        Self {
            metrics,
            start: Instant::now(),
+            stopped: Cell::new(false),
        }
    }

@@ -3070,6 +3113,10 @@ impl StorageTimeMetricsTimer {
        self.metrics.timeline_sum.inc_by(seconds);
        self.metrics.timeline_count.inc();
        self.metrics.global_histogram.observe(seconds);
+        /* BEGIN_HADRON*/
+        self.stopped.set(true);
+        self.metrics.timeline_active_count.dec();
+        /*END_HADRON */
        duration
    }

@@ -3080,6 +3127,16 @@ impl StorageTimeMetricsTimer {
    }
 }

+/*BEGIN_HADRON */
+impl Drop for StorageTimeMetricsTimer {
+    fn drop(&mut self) {
+        if !self.stopped.get() {
+            self.metrics.timeline_active_count.dec();
+        }
+    }
+}
+/*END_HADRON */
+
 pub(crate) struct AlwaysRecordingStorageTimeMetricsTimer(Option<StorageTimeMetricsTimer>);

 impl Drop for AlwaysRecordingStorageTimeMetricsTimer {
@@ -3105,6 +3162,10 @@ pub(crate) struct StorageTimeMetrics {
    timeline_sum: Counter,
    /// Number of oeprations, per operation, tenant_id and timeline_id
    timeline_count: IntCounter,
+    /*BEGIN_HADRON */
+    /// Number of active operations per operation, tenant_id, and timeline_id
+    timeline_active_count: IntGauge,
+    /*END_HADRON */
    /// Global histogram having only the "operation" label.
    global_histogram: Histogram,
 }
@@ -3124,6 +3185,11 @@ impl StorageTimeMetrics {
        let timeline_count = STORAGE_TIME_COUNT_PER_TIMELINE
            .get_metric_with_label_values(&[operation, tenant_id, shard_id, timeline_id])
            .unwrap();
+        /*BEGIN_HADRON */
+        let timeline_active_count = STORAGE_ACTIVE_COUNT_PER_TIMELINE
+            .get_metric_with_label_values(&[operation, tenant_id, shard_id, timeline_id])
+            .unwrap();
+        /*END_HADRON */
        let global_histogram = STORAGE_TIME_GLOBAL
            .get_metric_with_label_values(&[operation])
            .unwrap();
@@ -3131,6 +3197,7 @@ impl StorageTimeMetrics {
        StorageTimeMetrics {
            timeline_sum,
            timeline_count,
+            timeline_active_count,
            global_histogram,
        }
    }
@@ -3544,6 +3611,14 @@ impl TimelineMetrics {
                shard_id,
                timeline_id,
            ]);
+            /* BEGIN_HADRON */
+            let _ = STORAGE_ACTIVE_COUNT_PER_TIMELINE.remove_label_values(&[
+                op,
+                tenant_id,
+                shard_id,
+                timeline_id,
+            ]);
+            /*END_HADRON */
        }

        for op in StorageIoSizeOperation::VARIANTS {
@@ -4336,6 +4411,9 @@ pub(crate) mod disk_usage_based_eviction {
        pub(crate) layers_collected: IntCounter,
        pub(crate) layers_selected: IntCounter,
        pub(crate) layers_evicted: IntCounter,
+        /*BEGIN_HADRON */
+        pub(crate) bytes_evicted: IntCounter,
+        /*END_HADRON */
    }

    impl Default for Metrics {
@@ -4372,12 +4450,21 @@ pub(crate) mod disk_usage_based_eviction {
            )
            .unwrap();

+            /*BEGIN_HADRON */
+            let bytes_evicted = register_int_counter!(
+                "pageserver_disk_usage_based_eviction_evicted_bytes_total",
+                "Amount of bytes successfully evicted"
+            )
+            .unwrap();
+            /*END_HADRON */
+
            Self {
                tenant_collection_time,
                tenant_layer_count,
                layers_collected,
                layers_selected,
                layers_evicted,
+                bytes_evicted,
            }
        }
    }
@@ -4497,6 +4584,7 @@ pub fn preinitialize_metrics(
        &CIRCUIT_BREAKERS_UNBROKEN,
        &PAGE_SERVICE_SMGR_FLUSH_INPROGRESS_MICROS_GLOBAL,
        &WAIT_LSN_IN_PROGRESS_GLOBAL_MICROS,
+        &MISROUTED_PAGESTREAM_REQUESTS,
    ]
    .into_iter()
    .for_each(|c| {
@@ -4534,6 +4622,7 @@ pub fn preinitialize_metrics(

    // gauges
    WALRECEIVER_ACTIVE_MANAGERS.get();
+    LOCAL_DATA_LOSS_SUSPECTED.get();

    // histograms
    [
--- a/pageserver/src/page_service.rs
+++ b/pageserver/src/page_service.rs
@@ -70,7 +70,7 @@ use crate::context::{
 };
 use crate::metrics::{
    self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, GetPageBatchBreakReason, LIVE_CONNECTIONS,
-    SmgrOpTimer, TimelineMetrics,
+    MISROUTED_PAGESTREAM_REQUESTS, SmgrOpTimer, TimelineMetrics,
 };
 use crate::pgdatadir_mapping::{LsnRange, Version};
 use crate::span::{
@@ -91,7 +91,8 @@ use crate::{CancellableTask, PERF_TRACE_TARGET, timed_after_cancellation};
 /// is not yet in state [`TenantState::Active`].
 ///
 /// NB: this is a different value than [`crate::http::routes::ACTIVE_TENANT_TIMEOUT`].
-const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(30000);
+/// HADRON: reduced timeout and we will retry in Cache::get().
+const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(5000);

 /// Threshold at which to log slow GetPage requests.
 const LOG_SLOW_GETPAGE_THRESHOLD: Duration = Duration::from_secs(30);
@@ -1128,6 +1129,7 @@ impl PageServerHandler {
                                // Closing the connection by returning ``::Reconnect` has the side effect of rate-limiting above message, via
                                // client's reconnect backoff, as well as hopefully prompting the client to load its updated configuration
                                // and talk to a different pageserver.
+                                MISROUTED_PAGESTREAM_REQUESTS.inc();
                                return respond_error!(
                                    span,
                                    PageStreamError::Reconnect(
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -142,6 +142,9 @@ mod gc_block;
 mod gc_result;
 pub(crate) mod throttle;

+#[cfg(test)]
+pub mod debug;
+
 pub(crate) use timeline::{LogicalSizeCalculationCause, PageReconstructError, Timeline};

 pub(crate) use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
@@ -6015,12 +6018,11 @@ pub(crate) mod harness {
        }

        #[instrument(skip_all, fields(tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))]
-        pub(crate) async fn do_try_load(
+        pub(crate) async fn do_try_load_with_redo(
            &self,
+            walredo_mgr: Arc<WalRedoManager>,
            ctx: &RequestContext,
        ) -> anyhow::Result<Arc<TenantShard>> {
-            let walredo_mgr = Arc::new(WalRedoManager::from(TestRedoManager));
-
            let (basebackup_cache, _) = BasebackupCache::new(Utf8PathBuf::new(), None);

            let tenant = Arc::new(TenantShard::new(
@@ -6058,6 +6060,14 @@ pub(crate) mod harness {
            Ok(tenant)
        }

+        pub(crate) async fn do_try_load(
+            &self,
+            ctx: &RequestContext,
+        ) -> anyhow::Result<Arc<TenantShard>> {
+            let walredo_mgr = Arc::new(WalRedoManager::from(TestRedoManager));
+            self.do_try_load_with_redo(walredo_mgr, ctx).await
+        }
+
        pub fn timeline_path(&self, timeline_id: &TimelineId) -> Utf8PathBuf {
            self.conf.timeline_path(&self.tenant_shard_id, timeline_id)
        }
--- a/pageserver/src/tenant/debug.rs
+++ b/pageserver/src/tenant/debug.rs
@@ -0,0 +1,366 @@
+use std::{ops::Range, str::FromStr, sync::Arc};
+
+use crate::walredo::RedoAttemptType;
+use base64::{Engine as _, engine::general_purpose::STANDARD};
+use bytes::{Bytes, BytesMut};
+use camino::Utf8PathBuf;
+use clap::Parser;
+use itertools::Itertools;
+use pageserver_api::{
+    key::Key,
+    keyspace::KeySpace,
+    shard::{ShardIdentity, ShardStripeSize},
+};
+use postgres_ffi::PgMajorVersion;
+use postgres_ffi::{BLCKSZ, page_is_new, page_set_lsn};
+use tracing::Instrument;
+use utils::{
+    generation::Generation,
+    id::{TenantId, TimelineId},
+    lsn::Lsn,
+    shard::{ShardCount, ShardIndex, ShardNumber},
+};
+use wal_decoder::models::record::NeonWalRecord;
+
+use crate::{
+    context::{DownloadBehavior, RequestContext},
+    task_mgr::TaskKind,
+    tenant::storage_layer::ValueReconstructState,
+    walredo::harness::RedoHarness,
+};
+
+use super::{
+    WalRedoManager, WalredoManagerId,
+    harness::TenantHarness,
+    remote_timeline_client::LayerFileMetadata,
+    storage_layer::{AsLayerDesc, IoConcurrency, Layer, LayerName, ValuesReconstructState},
+};
+
+fn process_page_image(next_record_lsn: Lsn, is_fpw: bool, img_bytes: Bytes) -> Bytes {
+    // To match the logic in libs/wal_decoder/src/serialized_batch.rs
+    let mut new_image: BytesMut = img_bytes.into();
+    if is_fpw && !page_is_new(&new_image) {
+        page_set_lsn(&mut new_image, next_record_lsn);
+    }
+    assert_eq!(new_image.len(), BLCKSZ as usize);
+    new_image.freeze()
+}
+
+async fn redo_wals(input: &str, key: Key) -> anyhow::Result<()> {
+    let tenant_id = TenantId::generate();
+    let timeline_id = TimelineId::generate();
+    let redo_harness = RedoHarness::new()?;
+    let span = redo_harness.span();
+    let tenant_conf = pageserver_api::models::TenantConfig {
+        ..Default::default()
+    };
+
+    let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
+    let tenant = TenantHarness::create_custom(
+        "search_key",
+        tenant_conf,
+        tenant_id,
+        ShardIdentity::unsharded(),
+        Generation::new(1),
+    )
+    .await?
+    .do_try_load_with_redo(
+        Arc::new(WalRedoManager::Prod(
+            WalredoManagerId::next(),
+            redo_harness.manager,
+        )),
+        &ctx,
+    )
+    .await
+    .unwrap();
+    let timeline = tenant
+        .create_test_timeline(timeline_id, Lsn(0x10), PgMajorVersion::PG16, &ctx)
+        .await?;
+    let contents = tokio::fs::read_to_string(input)
+        .await
+        .map_err(|e| anyhow::Error::msg(format!("Failed to read input file {input}: {e}")))
+        .unwrap();
+    let lines = contents.lines();
+    let mut last_wal_lsn: Option<Lsn> = None;
+    let state = {
+        let mut state = ValueReconstructState::default();
+        let mut is_fpw = false;
+        let mut is_first_line = true;
+        for line in lines {
+            if is_first_line {
+                is_first_line = false;
+                if line.trim() == "FPW" {
+                    is_fpw = true;
+                }
+                continue; // Skip the first line.
+            }
+            // Each input line is in the "<next_record_lsn>,<base64>" format.
+            let (lsn_str, payload_b64) = line
+                .split_once(',')
+                .expect("Invalid input format: expected '<lsn>,<base64>'");
+
+            // Parse the LSN and decode the payload.
+            let lsn = Lsn::from_str(lsn_str.trim()).expect("Invalid LSN format");
+            let bytes = Bytes::from(
+                STANDARD
+                    .decode(payload_b64.trim())
+                    .expect("Invalid base64 payload"),
+            );
+
+            // The first line is considered the base image, the rest are WAL records.
+            if state.img.is_none() {
+                state.img = Some((lsn, process_page_image(lsn, is_fpw, bytes)));
+            } else {
+                let wal_record = NeonWalRecord::Postgres {
+                    will_init: false,
+                    rec: bytes,
+                };
+                state.records.push((lsn, wal_record));
+                last_wal_lsn.replace(lsn);
+            }
+        }
+        state
+    };
+
+    assert!(state.img.is_some(), "No base image found");
+    assert!(!state.records.is_empty(), "No WAL records found");
+    let result = timeline
+        .reconstruct_value(key, last_wal_lsn.unwrap(), state, RedoAttemptType::ReadPage)
+        .instrument(span.clone())
+        .await?;
+
+    eprintln!("final image: {:?}", STANDARD.encode(result));
+
+    Ok(())
+}
+
+async fn search_key(
+    tenant_id: TenantId,
+    timeline_id: TimelineId,
+    dir: String,
+    key: Key,
+    lsn: Lsn,
+) -> anyhow::Result<()> {
+    let shard_index = ShardIndex {
+        shard_number: ShardNumber(0),
+        shard_count: ShardCount(4),
+    };
+
+    let redo_harness = RedoHarness::new()?;
+    let span = redo_harness.span();
+    let tenant_conf = pageserver_api::models::TenantConfig {
+        ..Default::default()
+    };
+    let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
+    let tenant = TenantHarness::create_custom(
+        "search_key",
+        tenant_conf,
+        tenant_id,
+        ShardIdentity::new(
+            shard_index.shard_number,
+            shard_index.shard_count,
+            ShardStripeSize(32768),
+        )
+        .unwrap(),
+        Generation::new(1),
+    )
+    .await?
+    .do_try_load_with_redo(
+        Arc::new(WalRedoManager::Prod(
+            WalredoManagerId::next(),
+            redo_harness.manager,
+        )),
+        &ctx,
+    )
+    .await
+    .unwrap();
+
+    let timeline = tenant
+        .create_test_timeline(timeline_id, Lsn(0x10), PgMajorVersion::PG16, &ctx)
+        .await?;
+
+    let mut delta_layers: Vec<Layer> = Vec::new();
+    let mut img_layer: Option<Layer> = Option::None;
+    let mut dir = tokio::fs::read_dir(dir).await?;
+    loop {
+        let entry = dir.next_entry().await?;
+        if entry.is_none() || !entry.as_ref().unwrap().file_type().await?.is_file() {
+            break;
+        }
+        let path = Utf8PathBuf::from_path_buf(entry.unwrap().path()).unwrap();
+        let layer_name = match LayerName::from_str(path.file_name().unwrap()) {
+            Ok(name) => name,
+            Err(_) => {
+                eprintln!("Skipped invalid layer: {path}");
+                continue;
+            }
+        };
+        let layer = Layer::for_resident(
+            tenant.conf,
+            &timeline,
+            path.clone(),
+            layer_name,
+            LayerFileMetadata::new(
+                tokio::fs::metadata(path.clone()).await?.len(),
+                Generation::new(1),
+                shard_index,
+            ),
+        );
+        if layer.layer_desc().is_delta() {
+            delta_layers.push(layer.into());
+        } else if img_layer.is_none() {
+            img_layer = Some(layer.into());
+        } else {
+            anyhow::bail!("Found multiple image layers");
+        }
+    }
+    // sort delta layers based on the descending order of LSN
+    delta_layers.sort_by(|a, b| {
+        b.layer_desc()
+            .get_lsn_range()
+            .start
+            .cmp(&a.layer_desc().get_lsn_range().start)
+    });
+
+    let mut state = ValuesReconstructState::new(IoConcurrency::Sequential);
+
+    let key_space = KeySpace::single(Range {
+        start: key,
+        end: key.next(),
+    });
+    let lsn_range = Range {
+        start: img_layer
+            .as_ref()
+            .map_or(Lsn(0x00), |img| img.layer_desc().image_layer_lsn()),
+        end: lsn,
+    };
+    for delta_layer in delta_layers.iter() {
+        delta_layer
+            .get_values_reconstruct_data(key_space.clone(), lsn_range.clone(), &mut state, &ctx)
+            .await?;
+    }
+
+    img_layer
+        .as_ref()
+        .unwrap()
+        .get_values_reconstruct_data(key_space.clone(), lsn_range.clone(), &mut state, &ctx)
+        .await?;
+
+    for (_key, result) in std::mem::take(&mut state.keys) {
+        let state = result.collect_pending_ios().await?;
+        if state.img.is_some() {
+            eprintln!(
+                "image: {}: {:x?}",
+                state.img.as_ref().unwrap().0,
+                STANDARD.encode(state.img.as_ref().unwrap().1.clone())
+            );
+        }
+        for delta in state.records.iter() {
+            match &delta.1 {
+                NeonWalRecord::Postgres { will_init, rec } => {
+                    eprintln!(
+                        "delta: {}: will_init: {}, {:x?}",
+                        delta.0,
+                        will_init,
+                        STANDARD.encode(rec)
+                    );
+                }
+                _ => {
+                    eprintln!("delta: {}: {:x?}", delta.0, delta.1);
+                }
+            }
+        }
+
+        let result = timeline
+            .reconstruct_value(key, lsn_range.end, state, RedoAttemptType::ReadPage)
+            .instrument(span.clone())
+            .await?;
+        eprintln!("final image: {lsn} : {result:?}");
+    }
+
+    Ok(())
+}
+
+/// Redo all WALs against the base image in the input file. Return the base64 encoded final image.
+/// Each line in the input file must be in the form "<lsn>,<base64>" where:
+///   * `<lsn>` is a PostgreSQL LSN in hexadecimal notation, e.g. `0/16ABCDE`.
+///   * `<base64>` is the base64‐encoded page image (first line) or WAL record (subsequent lines).
+///
+/// The first line provides the base image of a page. The LSN is the LSN of "next record" following
+/// the record containing the FPI. For example, if the FPI was extracted from a WAL record occuping
+/// [0/1, 0/200) in the WAL stream, the LSN appearing along side the page image here should be 0/200.
+///
+/// The subsequent lines are WAL records, ordered from the oldest to the newest. The LSN is the
+/// record LSN of the WAL record, not the "next record" LSN. For example, if the WAL record here
+/// occupies [0/1, 0/200) in the WAL stream, the LSN appearing along side the WAL record here should
+/// be 0/1.
+#[derive(Parser)]
+struct RedoWalsCmd {
+    #[clap(long)]
+    input: String,
+    #[clap(long)]
+    key: String,
+}
+
+#[tokio::test]
+async fn test_redo_wals() -> anyhow::Result<()> {
+    let args = std::env::args().collect_vec();
+    let pos = args
+        .iter()
+        .position(|arg| arg == "--")
+        .unwrap_or(args.len());
+    let slice = &args[pos..args.len()];
+    let cmd = match RedoWalsCmd::try_parse_from(slice) {
+        Ok(cmd) => cmd,
+        Err(err) => {
+            eprintln!("{err}");
+            return Ok(());
+        }
+    };
+
+    let key = Key::from_hex(&cmd.key).unwrap();
+    redo_wals(&cmd.input, key).await?;
+
+    Ok(())
+}
+
+/// Search for a page at the given LSN in all layers of the data_dir.
+/// Return the base64-encoded image and all WAL records, as well as the final reconstructed image.
+#[derive(Parser)]
+struct SearchKeyCmd {
+    #[clap(long)]
+    tenant_id: String,
+    #[clap(long)]
+    timeline_id: String,
+    #[clap(long)]
+    data_dir: String,
+    #[clap(long)]
+    key: String,
+    #[clap(long)]
+    lsn: String,
+}
+
+#[tokio::test]
+async fn test_search_key() -> anyhow::Result<()> {
+    let args = std::env::args().collect_vec();
+    let pos = args
+        .iter()
+        .position(|arg| arg == "--")
+        .unwrap_or(args.len());
+    let slice = &args[pos..args.len()];
+    let cmd = match SearchKeyCmd::try_parse_from(slice) {
+        Ok(cmd) => cmd,
+        Err(err) => {
+            eprintln!("{err}");
+            return Ok(());
+        }
+    };
+
+    let tenant_id = TenantId::from_str(&cmd.tenant_id).unwrap();
+    let timeline_id = TimelineId::from_str(&cmd.timeline_id).unwrap();
+    let key = Key::from_hex(&cmd.key).unwrap();
+    let lsn = Lsn::from_str(&cmd.lsn).unwrap();
+    search_key(tenant_id, timeline_id, cmd.data_dir, key, lsn).await?;
+
+    Ok(())
+}
--- a/pageserver/src/tenant/mgr.rs
+++ b/pageserver/src/tenant/mgr.rs
@@ -43,7 +43,7 @@ use crate::controller_upcall_client::{
 };
 use crate::deletion_queue::DeletionQueueClient;
 use crate::http::routes::ACTIVE_TENANT_TIMEOUT;
-use crate::metrics::{TENANT, TENANT_MANAGER as METRICS};
+use crate::metrics::{LOCAL_DATA_LOSS_SUSPECTED, TENANT, TENANT_MANAGER as METRICS};
 use crate::task_mgr::{BACKGROUND_RUNTIME, TaskKind};
 use crate::tenant::config::{
    AttachedLocationConfig, AttachmentMode, LocationConf, LocationMode, SecondaryLocationConfig,
@@ -538,6 +538,21 @@ pub async fn init_tenant_mgr(
    // Determine which tenants are to be secondary or attached, and in which generation
    let tenant_modes = init_load_generations(conf, &tenant_configs, resources, cancel).await?;

+    // Hadron local SSD check: Raise an alert if our local filesystem does not contain any tenants but the re-attach request returned tenants.
+    // This can happen if the PS suffered a Kubernetes node failure resulting in loss of all local data, but recovered quickly on another node
+    // so the Storage Controller has not had the time to move tenants out.
+    let data_loss_suspected = if let Some(tenant_modes) = &tenant_modes {
+        tenant_configs.is_empty() && !tenant_modes.is_empty()
+    } else {
+        false
+    };
+    if data_loss_suspected {
+        tracing::error!(
+            "Local data loss suspected: no tenants found on local filesystem, but re-attach request returned tenants"
+        );
+    }
+    LOCAL_DATA_LOSS_SUSPECTED.set(if data_loss_suspected { 1 } else { 0 });
+
    tracing::info!(
        "Attaching {} tenants at startup, warming up {} at a time",
        tenant_configs.len(),
--- a/pageserver/src/tenant/remote_timeline_client/upload.rs
+++ b/pageserver/src/tenant/remote_timeline_client/upload.rs
@@ -141,11 +141,29 @@ pub(super) async fn upload_timeline_layer<'a>(

    let fs_size = usize::try_from(fs_size)
        .with_context(|| format!("convert {local_path:?} size {fs_size} usize"))?;
-
+    /* BEGIN_HADRON */
+    let mut metadata = None;
+    match storage {
+        // Pass the file path as a storage metadata to minimize changes to neon.
+        // Otherwise, we need to change the upload interface.
+        GenericRemoteStorage::AzureBlob(s) => {
+            let block_size_mb = s.put_block_size_mb.unwrap_or(0);
+            if block_size_mb > 0 && fs_size > block_size_mb * 1024 * 1024 {
+                metadata = Some(remote_storage::StorageMetadata::from([(
+                    "databricks_azure_put_block",
+                    local_path.as_str(),
+                )]));
+            }
+        }
+        GenericRemoteStorage::LocalFs(_) => {}
+        GenericRemoteStorage::AwsS3(_) => {}
+        GenericRemoteStorage::Unreliable(_) => {}
+    };
+    /* END_HADRON */
    let reader = tokio_util::io::ReaderStream::with_capacity(source_file, super::BUFFER_SIZE);

    storage
-        .upload(reader, fs_size, remote_path, None, cancel)
+        .upload(reader, fs_size, remote_path, metadata, cancel)
        .await
        .with_context(|| format!("upload layer from local path '{local_path}'"))
 }
--- a/pageserver/src/tenant/tasks.rs
+++ b/pageserver/src/tenant/tasks.rs
@@ -34,6 +34,21 @@ use crate::virtual_file::owned_buffers_io::write::FlushTaskError;
 /// We use 3/4 Tokio threads, to avoid blocking all threads in case we do any CPU-heavy work.
 static CONCURRENT_BACKGROUND_TASKS: Lazy<Semaphore> = Lazy::new(|| {
    let total_threads = TOKIO_WORKER_THREADS.get();
+
+    /*BEGIN_HADRON*/
+    // ideally we should run at least one compaction task per tenant in order to (1) maximize
+    // compaction throughput (2) avoid head-of-line blocking of large compactions. However doing
+    // that may create too many compaction tasks with lots of memory overheads. So we limit the
+    // number of compaction tasks based on the available CPU core count.
+    // Need to revisit.
+    // let tasks_per_thread = std::env::var("BG_TASKS_PER_THREAD")
+    //     .ok()
+    //     .and_then(|s| s.parse().ok())
+    //     .unwrap_or(4);
+    // let permits = usize::max(1, total_threads * tasks_per_thread);
+    // // assert!(permits < total_threads, "need threads for other work");
+    /*END_HADRON*/
+
    let permits = max(1, (total_threads * 3).checked_div(4).unwrap_or(0));
    assert_ne!(permits, 0, "we will not be adding in permits later");
    assert!(permits < total_threads, "need threads for other work");
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -6742,7 +6742,7 @@ impl Timeline {
    }

    /// Reconstruct a value, using the given base image and WAL records in 'data'.
-    async fn reconstruct_value(
+    pub(crate) async fn reconstruct_value(
        &self,
        key: Key,
        request_lsn: Lsn,
--- a/pageserver/src/tenant/timeline/handle.rs
+++ b/pageserver/src/tenant/timeline/handle.rs
@@ -212,8 +212,12 @@
 //! to the parent shard during a shard split. Eventually, the shard split task will
 //! shut down the parent => case (1).

-use std::collections::{HashMap, hash_map};
-use std::sync::{Arc, Mutex, Weak};
+use std::collections::HashMap;
+use std::collections::hash_map;
+use std::sync::Arc;
+use std::sync::Mutex;
+use std::sync::Weak;
+use std::time::Duration;

 use pageserver_api::shard::ShardIdentity;
 use tracing::{instrument, trace};
@@ -333,6 +337,44 @@ enum RoutingResult<T: Types> {
 }

 impl<T: Types> Cache<T> {
+    /* BEGIN_HADRON */
+    /// A wrapper of do_get to resolve the tenant shard for a get page request.
+    #[instrument(level = "trace", skip_all)]
+    pub(crate) async fn get(
+        &mut self,
+        timeline_id: TimelineId,
+        shard_selector: ShardSelector,
+        tenant_manager: &T::TenantManager,
+    ) -> Result<Handle<T>, GetError<T>> {
+        const GET_MAX_RETRIES: usize = 10;
+        const RETRY_BACKOFF: Duration = Duration::from_millis(100);
+        let mut attempt = 0;
+        loop {
+            attempt += 1;
+            match self
+                .do_get(timeline_id, shard_selector, tenant_manager)
+                .await
+            {
+                Ok(handle) => return Ok(handle),
+                Err(e) => {
+                    // Retry on tenant manager error to handle tenant split more gracefully
+                    if attempt < GET_MAX_RETRIES {
+                        tracing::warn!(
+                            "Fail to resolve tenant shard in attempt {}: {:?}. Retrying...",
+                            attempt,
+                            e
+                        );
+                        tokio::time::sleep(RETRY_BACKOFF).await;
+                        continue;
+                    } else {
+                        return Err(e);
+                    }
+                }
+            }
+        }
+    }
+    /* END_HADRON */
+
    /// See module-level comment for details.
    ///
    /// Does NOT check for the shutdown state of [`Types::Timeline`].
@@ -341,7 +383,7 @@ impl<T: Types> Cache<T> {
    /// and if so, return an error that causes the page service to
    /// close the connection.
    #[instrument(level = "trace", skip_all)]
-    pub(crate) async fn get(
+    async fn do_get(
        &mut self,
        timeline_id: TimelineId,
        shard_selector: ShardSelector,
@@ -879,6 +921,7 @@ mod tests {
            .await
            .err()
            .expect("documented behavior: can't get new handle after shutdown");
+
        assert_eq!(cache.map.len(), 1, "next access cleans up the cache");

        cache
--- a/pageserver/src/walredo.rs
+++ b/pageserver/src/walredo.rs
@@ -566,22 +566,55 @@ impl PostgresRedoManager {
    }
 }

+#[cfg(test)]
+pub(crate) mod harness {
+    use super::PostgresRedoManager;
+    use crate::config::PageServerConf;
+    use utils::{id::TenantId, shard::TenantShardId};
+
+    pub struct RedoHarness {
+        // underscored because unused, except for removal at drop
+        _repo_dir: camino_tempfile::Utf8TempDir,
+        pub manager: PostgresRedoManager,
+        tenant_shard_id: TenantShardId,
+    }
+
+    impl RedoHarness {
+        pub fn new() -> anyhow::Result<Self> {
+            crate::tenant::harness::setup_logging();
+
+            let repo_dir = camino_tempfile::tempdir()?;
+            let conf = PageServerConf::dummy_conf(repo_dir.path().to_path_buf());
+            let conf = Box::leak(Box::new(conf));
+            let tenant_shard_id = TenantShardId::unsharded(TenantId::generate());
+
+            let manager = PostgresRedoManager::new(conf, tenant_shard_id);
+
+            Ok(RedoHarness {
+                _repo_dir: repo_dir,
+                manager,
+                tenant_shard_id,
+            })
+        }
+        pub fn span(&self) -> tracing::Span {
+            tracing::info_span!("RedoHarness", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())
+        }
+    }
+}
+
 #[cfg(test)]
 mod tests {
    use std::str::FromStr;

    use bytes::Bytes;
    use pageserver_api::key::Key;
-    use pageserver_api::shard::TenantShardId;
    use postgres_ffi::PgMajorVersion;
    use tracing::Instrument;
-    use utils::id::TenantId;
    use utils::lsn::Lsn;
    use wal_decoder::models::record::NeonWalRecord;

-    use super::PostgresRedoManager;
-    use crate::config::PageServerConf;
    use crate::walredo::RedoAttemptType;
+    use crate::walredo::harness::RedoHarness;

    #[tokio::test]
    async fn test_ping() {
@@ -692,33 +725,4 @@ mod tests {
            )
        ]
    }
-
-    struct RedoHarness {
-        // underscored because unused, except for removal at drop
-        _repo_dir: camino_tempfile::Utf8TempDir,
-        manager: PostgresRedoManager,
-        tenant_shard_id: TenantShardId,
-    }
-
-    impl RedoHarness {
-        fn new() -> anyhow::Result<Self> {
-            crate::tenant::harness::setup_logging();
-
-            let repo_dir = camino_tempfile::tempdir()?;
-            let conf = PageServerConf::dummy_conf(repo_dir.path().to_path_buf());
-            let conf = Box::leak(Box::new(conf));
-            let tenant_shard_id = TenantShardId::unsharded(TenantId::generate());
-
-            let manager = PostgresRedoManager::new(conf, tenant_shard_id);
-
-            Ok(RedoHarness {
-                _repo_dir: repo_dir,
-                manager,
-                tenant_shard_id,
-            })
-        }
-        fn span(&self) -> tracing::Span {
-            tracing::info_span!("RedoHarness", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())
-        }
-    }
 }
--- a/test_runner/fixtures/metrics.py
+++ b/test_runner/fixtures/metrics.py
@@ -159,6 +159,9 @@ PAGESERVER_GLOBAL_METRICS: tuple[str, ...] = (
 )

 PAGESERVER_PER_TENANT_METRICS: tuple[str, ...] = (
+    # BEGIN_HADRON
+    "pageserver_active_storage_operations_count",
+    # END_HADRON
    "pageserver_current_logical_size",
    "pageserver_resident_physical_size",
    "pageserver_io_operations_bytes_total",
--- a/test_runner/fixtures/pageserver/allowed_errors.py
+++ b/test_runner/fixtures/pageserver/allowed_errors.py
@@ -111,6 +111,14 @@ DEFAULT_PAGESERVER_ALLOWED_ERRORS = (
    ".*stalling layer flushes for compaction backpressure.*",
    ".*layer roll waiting for flush due to compaction backpressure.*",
    ".*BatchSpanProcessor.*",
+    # Can happen in tests that purposely wipe pageserver "local disk" data.
+    ".*Local data loss suspected.*",
+    # Too many frozen layers error is normal during intensive benchmarks
+    ".*too many frozen layers.*",
+    # Transient errors when resolving tenant shards by page service
+    ".*Fail to resolve tenant shard in attempt.*",
+    # Expected warnings when pageserver has not refreshed GC info yet
+    ".*pitr LSN/interval not found, skipping force image creation LSN calculation.*",
    ".*No broker updates received for a while.*",
    *(
        [
--- a/test_runner/regress/test_sharding.py
+++ b/test_runner/regress/test_sharding.py
@@ -1,8 +1,11 @@
 from __future__ import annotations

 import os
+import random
+import threading
 import time
 from collections import defaultdict
+from threading import Event
 from typing import TYPE_CHECKING, Any

 import pytest
@@ -1505,6 +1508,171 @@ def test_sharding_split_failures(
    env.storage_controller.consistency_check()


+@pytest.mark.skip(reason="The backpressure change has not been merged yet.")
+def test_back_pressure_during_split(neon_env_builder: NeonEnvBuilder):
+    """
+    Test backpressure can ignore new shards during tenant split so that if we abort the split,
+    PG can continue without being blocked.
+    """
+    DBNAME = "regression"
+
+    init_shard_count = 4
+    neon_env_builder.num_pageservers = init_shard_count
+    stripe_size = 32
+
+    env = neon_env_builder.init_start(
+        initial_tenant_shard_count=init_shard_count, initial_tenant_shard_stripe_size=stripe_size
+    )
+
+    env.storage_controller.allowed_errors.extend(
+        [
+            # All split failures log a warning when then enqueue the abort operation
+            ".*Enqueuing background abort.*",
+            # Tolerate any error lots that mention a failpoint
+            ".*failpoint.*",
+        ]
+    )
+
+    endpoint = env.endpoints.create(
+        "main",
+        config_lines=[
+            "max_replication_write_lag = 1MB",
+            "databricks.max_wal_mb_per_second = 1",
+            "neon.max_cluster_size = 10GB",
+        ],
+    )
+    endpoint.respec(skip_pg_catalog_updates=False)  # Needed for databricks_system to get created.
+    endpoint.start()
+
+    endpoint.safe_psql(f"CREATE DATABASE {DBNAME}")
+
+    endpoint.safe_psql("CREATE TABLE usertable ( YCSB_KEY INT, FIELD0 TEXT);")
+    write_done = Event()
+
+    def write_data(write_done):
+        while not write_done.is_set():
+            endpoint.safe_psql(
+                "INSERT INTO usertable SELECT random(), repeat('a', 1000);", log_query=False
+            )
+        log.info("write_data thread exiting")
+
+    writer_thread = threading.Thread(target=write_data, args=(write_done,))
+    writer_thread.start()
+
+    env.storage_controller.configure_failpoints(("shard-split-pre-complete", "return(1)"))
+    # split the tenant
+    with pytest.raises(StorageControllerApiException):
+        env.storage_controller.tenant_shard_split(env.initial_tenant, shard_count=16)
+
+    write_done.set()
+    writer_thread.join()
+
+    # writing more data to page servers after split is aborted
+    for _i in range(5000):
+        endpoint.safe_psql(
+            "INSERT INTO usertable SELECT random(), repeat('a', 1000);", log_query=False
+        )
+
+    # wait until write lag becomes 0
+    def check_write_lag_is_zero():
+        res = endpoint.safe_psql(
+            """
+            SELECT
+                pg_wal_lsn_diff(pg_current_wal_flush_lsn(), received_lsn) as received_lsn_lag
+                FROM neon.backpressure_lsns();
+            """,
+            dbname="databricks_system",
+            log_query=False,
+        )
+        log.info(f"received_lsn_lag = {res[0][0]}")
+        assert res[0][0] == 0
+
+    wait_until(check_write_lag_is_zero)
+    endpoint.stop_and_destroy()
+
+
+# BEGIN_HADRON
+def test_shard_resolve_during_split_abort(neon_env_builder: NeonEnvBuilder):
+    """
+    Tests that page service is able to resolve the correct shard during tenant split without causing query errors
+    """
+    DBNAME = "regression"
+    WORKER_THREADS = 16
+    ROW_COUNT = 10000
+
+    init_shard_count = 4
+    neon_env_builder.num_pageservers = 1
+    stripe_size = 16
+
+    env = neon_env_builder.init_start(
+        initial_tenant_shard_count=init_shard_count, initial_tenant_shard_stripe_size=stripe_size
+    )
+
+    env.storage_controller.allowed_errors.extend(
+        [
+            # All split failures log a warning when then enqueue the abort operation
+            ".*Enqueuing background abort.*",
+            # Tolerate any error lots that mention a failpoint
+            ".*failpoint.*",
+        ]
+    )
+
+    endpoint = env.endpoints.create("main")
+    endpoint.respec(skip_pg_catalog_updates=False)  # Needed for databricks_system to get created.
+    endpoint.start()
+
+    endpoint.safe_psql(f"CREATE DATABASE {DBNAME}")
+
+    # generate 10MB of data
+    endpoint.safe_psql(
+        f"CREATE TABLE usertable AS SELECT s AS KEY, repeat('a', 1000) as VALUE from generate_series(1, {ROW_COUNT}) s;"
+    )
+    read_done = Event()
+
+    def read_data(read_done):
+        i = 0
+        while not read_done.is_set() or i < 10:
+            endpoint.safe_psql(
+                f"SELECT * FROM usertable where KEY = {random.randint(1, ROW_COUNT)}",
+                log_query=False,
+            )
+            i += 1
+        log.info(f"read_data thread exiting. Executed {i} queries.")
+
+    reader_threads = []
+    for _i in range(WORKER_THREADS):
+        reader_thread = threading.Thread(target=read_data, args=(read_done,))
+        reader_thread.start()
+        reader_threads.append(reader_thread)
+
+    env.storage_controller.configure_failpoints(("shard-split-pre-complete", "return(1)"))
+    # split the tenant
+    with pytest.raises(StorageControllerApiException):
+        env.storage_controller.tenant_shard_split(env.initial_tenant, shard_count=16)
+
+    # wait until abort is done
+    def check_tenant_status():
+        active_count = 0
+        for i in range(init_shard_count):
+            status = env.pageserver.http_client().tenant_status(
+                TenantShardId(env.initial_tenant, i, init_shard_count)
+            )
+            if status["state"]["slug"] == "Active":
+                active_count += 1
+        assert active_count == 4
+
+    wait_until(check_tenant_status)
+
+    read_done.set()
+    for thread in reader_threads:
+        thread.join()
+
+    endpoint.stop()
+
+
+# END_HADRON
+
+
 def test_sharding_backpressure(neon_env_builder: NeonEnvBuilder):
    """
    Check a scenario when one of the shards is much slower than others.