Files
neon/pageserver/src/tenant/debug.rs
HaoyuHuang 3dad4698ec PS changes #1 (#12467)
# TLDR
All changes are no-op except 
1. publishing additional metrics. 
2. problem VI

## Problem I

It has come to my attention that the Neon Storage Controller doesn't
correctly update its "observed" state of tenants previously associated
with PSs that has come back up after a local data loss. It would still
think that the old tenants are still attached to page servers and won't
ask more questions. The pageserver has enough information from the
reattach request/response to tell that something is wrong, but it
doesn't do anything about it either. We need to detect this situation in
production while I work on a fix.

(I think there is just some misunderstanding about how Neon manages
their pageserver deployments which got me confused about all the
invariants.)

## Summary of changes I

Added a `pageserver_local_data_loss_suspected` gauge metric that will be
set to 1 if we detect a problematic situation from the reattch response.
The problematic situation is when the PS doesn't have any local tenants
but received a reattach response containing tenants.

We can set up an alert using this metric. The alert should be raised
whenever this metric reports non-zero number.

Also added a HTTP PUT
`http://pageserver/hadron-internal/reset_alert_gauges` API on the
pageserver that can be used to reset the gauge and the alert once we
manually rectify the situation (by restarting the HCC).

## Problem II
Azure upload is 3x slower than AWS. -> 3x slower ingestion. 

The reason for the slower upload is that Azure upload in page server is
much slower => higher flush latency => higher disk consistent LSN =>
higher back pressure.

## Summary of changes II
Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in
parallel.

I set the put_block block size to be 128 MB by default in azure config. 

To minimize neon changes, upload function passes the layer file path to
the azure upload code through the storage metadata. This allows the
azure put block to use FileChunkStreamRead to stream read from one
partition in the file instead of loading all file data in memory and
split it into 8 128 MB chunks.

## How is this tested? II
1. rust test_real_azure tests the put_block change. 
3. I deployed the change in azure dev and saw flush latency reduces from
~30 seconds to 10 seconds.
4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS
runs.

## Problem III
Currently Neon limits the compaction tasks as 3/4 * CPU cores. This
limits the overall compaction throughput and it can easily cause
head-of-the-line blocking problems when a few large tenants are
compacting.

## Summary of changes III
This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD`
(default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also
limits some other tasks `logical_size_calculation` and `layer eviction`
. But compaction should be the most frequent and time-consuming task.

## Summary of changes IV
This PR adds the following PageServer metrics:
1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures
the total amount of bytes evicted. It's more straightforward to see the
bytes directly instead of layers.
2. `pageserver_active_storage_operations_count`: captures the active
storage operation, e.g., flush, L0 compaction, image creation etc. It's
useful to visualize these active operations to get a better idea of what
PageServers are spending cycles on in the background.

## Summary of changes V
When investigating data corruptions, it's useful to search the base
image and all WAL records of a page up to an LSN, i.e., a breakdown of
GetPage@LSN request. This PR implements this functionality with two
tools:

1. Extended `pagectl` with a new command to search the layer files for a
given key up to a given LSN from the `index_part.json` file. The output
can be used to download the files from S3 and then search the file
contents using the second tool.
Example usage:
```
cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8
```
Example output:
```
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b
```

2. Added a unit test to search the layer file contents. It's not
implemented part of `pagectl` because it depends on some test harness
code, which can only be used by unit tests.

Example usage:
```
cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8
```
Example output:
```
# omitted image for brievity
delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA=="
delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA=="
```

## Problem VI
Currently when page service resolves shards from page numbers, it
doesn't fully support the case that the shard could be split in the
middle. This will lead to query failures during the tenant split for
either commit or abort cases (it's mostly for abort).

## Summary of changes VI
This PR adds retry logic in `Cache::get()` to deal with shard resolution
errors more gracefully. Specifically, it'll clear the cache and retry,
instead of failing the query immediately. It also reduces the internal
timeout to make retries faster.

The PR also fixes a very obvious bug in
`TenantManager::resolve_attached_shard` where the code tries to cache
the computed the shard number, but forgot to recompute when the shard
count is different.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-08 19:43:01 +00:00

367 lines
11 KiB
Rust
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

use std::{ops::Range, str::FromStr, sync::Arc};
use crate::walredo::RedoAttemptType;
use base64::{Engine as _, engine::general_purpose::STANDARD};
use bytes::{Bytes, BytesMut};
use camino::Utf8PathBuf;
use clap::Parser;
use itertools::Itertools;
use pageserver_api::{
key::Key,
keyspace::KeySpace,
shard::{ShardIdentity, ShardStripeSize},
};
use postgres_ffi::PgMajorVersion;
use postgres_ffi::{BLCKSZ, page_is_new, page_set_lsn};
use tracing::Instrument;
use utils::{
generation::Generation,
id::{TenantId, TimelineId},
lsn::Lsn,
shard::{ShardCount, ShardIndex, ShardNumber},
};
use wal_decoder::models::record::NeonWalRecord;
use crate::{
context::{DownloadBehavior, RequestContext},
task_mgr::TaskKind,
tenant::storage_layer::ValueReconstructState,
walredo::harness::RedoHarness,
};
use super::{
WalRedoManager, WalredoManagerId,
harness::TenantHarness,
remote_timeline_client::LayerFileMetadata,
storage_layer::{AsLayerDesc, IoConcurrency, Layer, LayerName, ValuesReconstructState},
};
fn process_page_image(next_record_lsn: Lsn, is_fpw: bool, img_bytes: Bytes) -> Bytes {
// To match the logic in libs/wal_decoder/src/serialized_batch.rs
let mut new_image: BytesMut = img_bytes.into();
if is_fpw && !page_is_new(&new_image) {
page_set_lsn(&mut new_image, next_record_lsn);
}
assert_eq!(new_image.len(), BLCKSZ as usize);
new_image.freeze()
}
async fn redo_wals(input: &str, key: Key) -> anyhow::Result<()> {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let redo_harness = RedoHarness::new()?;
let span = redo_harness.span();
let tenant_conf = pageserver_api::models::TenantConfig {
..Default::default()
};
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
let tenant = TenantHarness::create_custom(
"search_key",
tenant_conf,
tenant_id,
ShardIdentity::unsharded(),
Generation::new(1),
)
.await?
.do_try_load_with_redo(
Arc::new(WalRedoManager::Prod(
WalredoManagerId::next(),
redo_harness.manager,
)),
&ctx,
)
.await
.unwrap();
let timeline = tenant
.create_test_timeline(timeline_id, Lsn(0x10), PgMajorVersion::PG16, &ctx)
.await?;
let contents = tokio::fs::read_to_string(input)
.await
.map_err(|e| anyhow::Error::msg(format!("Failed to read input file {input}: {e}")))
.unwrap();
let lines = contents.lines();
let mut last_wal_lsn: Option<Lsn> = None;
let state = {
let mut state = ValueReconstructState::default();
let mut is_fpw = false;
let mut is_first_line = true;
for line in lines {
if is_first_line {
is_first_line = false;
if line.trim() == "FPW" {
is_fpw = true;
}
continue; // Skip the first line.
}
// Each input line is in the "<next_record_lsn>,<base64>" format.
let (lsn_str, payload_b64) = line
.split_once(',')
.expect("Invalid input format: expected '<lsn>,<base64>'");
// Parse the LSN and decode the payload.
let lsn = Lsn::from_str(lsn_str.trim()).expect("Invalid LSN format");
let bytes = Bytes::from(
STANDARD
.decode(payload_b64.trim())
.expect("Invalid base64 payload"),
);
// The first line is considered the base image, the rest are WAL records.
if state.img.is_none() {
state.img = Some((lsn, process_page_image(lsn, is_fpw, bytes)));
} else {
let wal_record = NeonWalRecord::Postgres {
will_init: false,
rec: bytes,
};
state.records.push((lsn, wal_record));
last_wal_lsn.replace(lsn);
}
}
state
};
assert!(state.img.is_some(), "No base image found");
assert!(!state.records.is_empty(), "No WAL records found");
let result = timeline
.reconstruct_value(key, last_wal_lsn.unwrap(), state, RedoAttemptType::ReadPage)
.instrument(span.clone())
.await?;
eprintln!("final image: {:?}", STANDARD.encode(result));
Ok(())
}
async fn search_key(
tenant_id: TenantId,
timeline_id: TimelineId,
dir: String,
key: Key,
lsn: Lsn,
) -> anyhow::Result<()> {
let shard_index = ShardIndex {
shard_number: ShardNumber(0),
shard_count: ShardCount(4),
};
let redo_harness = RedoHarness::new()?;
let span = redo_harness.span();
let tenant_conf = pageserver_api::models::TenantConfig {
..Default::default()
};
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
let tenant = TenantHarness::create_custom(
"search_key",
tenant_conf,
tenant_id,
ShardIdentity::new(
shard_index.shard_number,
shard_index.shard_count,
ShardStripeSize(32768),
)
.unwrap(),
Generation::new(1),
)
.await?
.do_try_load_with_redo(
Arc::new(WalRedoManager::Prod(
WalredoManagerId::next(),
redo_harness.manager,
)),
&ctx,
)
.await
.unwrap();
let timeline = tenant
.create_test_timeline(timeline_id, Lsn(0x10), PgMajorVersion::PG16, &ctx)
.await?;
let mut delta_layers: Vec<Layer> = Vec::new();
let mut img_layer: Option<Layer> = Option::None;
let mut dir = tokio::fs::read_dir(dir).await?;
loop {
let entry = dir.next_entry().await?;
if entry.is_none() || !entry.as_ref().unwrap().file_type().await?.is_file() {
break;
}
let path = Utf8PathBuf::from_path_buf(entry.unwrap().path()).unwrap();
let layer_name = match LayerName::from_str(path.file_name().unwrap()) {
Ok(name) => name,
Err(_) => {
eprintln!("Skipped invalid layer: {path}");
continue;
}
};
let layer = Layer::for_resident(
tenant.conf,
&timeline,
path.clone(),
layer_name,
LayerFileMetadata::new(
tokio::fs::metadata(path.clone()).await?.len(),
Generation::new(1),
shard_index,
),
);
if layer.layer_desc().is_delta() {
delta_layers.push(layer.into());
} else if img_layer.is_none() {
img_layer = Some(layer.into());
} else {
anyhow::bail!("Found multiple image layers");
}
}
// sort delta layers based on the descending order of LSN
delta_layers.sort_by(|a, b| {
b.layer_desc()
.get_lsn_range()
.start
.cmp(&a.layer_desc().get_lsn_range().start)
});
let mut state = ValuesReconstructState::new(IoConcurrency::Sequential);
let key_space = KeySpace::single(Range {
start: key,
end: key.next(),
});
let lsn_range = Range {
start: img_layer
.as_ref()
.map_or(Lsn(0x00), |img| img.layer_desc().image_layer_lsn()),
end: lsn,
};
for delta_layer in delta_layers.iter() {
delta_layer
.get_values_reconstruct_data(key_space.clone(), lsn_range.clone(), &mut state, &ctx)
.await?;
}
img_layer
.as_ref()
.unwrap()
.get_values_reconstruct_data(key_space.clone(), lsn_range.clone(), &mut state, &ctx)
.await?;
for (_key, result) in std::mem::take(&mut state.keys) {
let state = result.collect_pending_ios().await?;
if state.img.is_some() {
eprintln!(
"image: {}: {:x?}",
state.img.as_ref().unwrap().0,
STANDARD.encode(state.img.as_ref().unwrap().1.clone())
);
}
for delta in state.records.iter() {
match &delta.1 {
NeonWalRecord::Postgres { will_init, rec } => {
eprintln!(
"delta: {}: will_init: {}, {:x?}",
delta.0,
will_init,
STANDARD.encode(rec)
);
}
_ => {
eprintln!("delta: {}: {:x?}", delta.0, delta.1);
}
}
}
let result = timeline
.reconstruct_value(key, lsn_range.end, state, RedoAttemptType::ReadPage)
.instrument(span.clone())
.await?;
eprintln!("final image: {lsn} : {result:?}");
}
Ok(())
}
/// Redo all WALs against the base image in the input file. Return the base64 encoded final image.
/// Each line in the input file must be in the form "<lsn>,<base64>" where:
/// * `<lsn>` is a PostgreSQL LSN in hexadecimal notation, e.g. `0/16ABCDE`.
/// * `<base64>` is the base64encoded page image (first line) or WAL record (subsequent lines).
///
/// The first line provides the base image of a page. The LSN is the LSN of "next record" following
/// the record containing the FPI. For example, if the FPI was extracted from a WAL record occuping
/// [0/1, 0/200) in the WAL stream, the LSN appearing along side the page image here should be 0/200.
///
/// The subsequent lines are WAL records, ordered from the oldest to the newest. The LSN is the
/// record LSN of the WAL record, not the "next record" LSN. For example, if the WAL record here
/// occupies [0/1, 0/200) in the WAL stream, the LSN appearing along side the WAL record here should
/// be 0/1.
#[derive(Parser)]
struct RedoWalsCmd {
#[clap(long)]
input: String,
#[clap(long)]
key: String,
}
#[tokio::test]
async fn test_redo_wals() -> anyhow::Result<()> {
let args = std::env::args().collect_vec();
let pos = args
.iter()
.position(|arg| arg == "--")
.unwrap_or(args.len());
let slice = &args[pos..args.len()];
let cmd = match RedoWalsCmd::try_parse_from(slice) {
Ok(cmd) => cmd,
Err(err) => {
eprintln!("{err}");
return Ok(());
}
};
let key = Key::from_hex(&cmd.key).unwrap();
redo_wals(&cmd.input, key).await?;
Ok(())
}
/// Search for a page at the given LSN in all layers of the data_dir.
/// Return the base64-encoded image and all WAL records, as well as the final reconstructed image.
#[derive(Parser)]
struct SearchKeyCmd {
#[clap(long)]
tenant_id: String,
#[clap(long)]
timeline_id: String,
#[clap(long)]
data_dir: String,
#[clap(long)]
key: String,
#[clap(long)]
lsn: String,
}
#[tokio::test]
async fn test_search_key() -> anyhow::Result<()> {
let args = std::env::args().collect_vec();
let pos = args
.iter()
.position(|arg| arg == "--")
.unwrap_or(args.len());
let slice = &args[pos..args.len()];
let cmd = match SearchKeyCmd::try_parse_from(slice) {
Ok(cmd) => cmd,
Err(err) => {
eprintln!("{err}");
return Ok(());
}
};
let tenant_id = TenantId::from_str(&cmd.tenant_id).unwrap();
let timeline_id = TimelineId::from_str(&cmd.timeline_id).unwrap();
let key = Key::from_hex(&cmd.key).unwrap();
let lsn = Lsn::from_str(&cmd.lsn).unwrap();
search_key(tenant_id, timeline_id, cmd.data_dir, key, lsn).await?;
Ok(())
}