mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-10 06:52:55 +00:00
# TLDR All changes are no-op except 1. publishing additional metrics. 2. problem VI ## Problem I It has come to my attention that the Neon Storage Controller doesn't correctly update its "observed" state of tenants previously associated with PSs that has come back up after a local data loss. It would still think that the old tenants are still attached to page servers and won't ask more questions. The pageserver has enough information from the reattach request/response to tell that something is wrong, but it doesn't do anything about it either. We need to detect this situation in production while I work on a fix. (I think there is just some misunderstanding about how Neon manages their pageserver deployments which got me confused about all the invariants.) ## Summary of changes I Added a `pageserver_local_data_loss_suspected` gauge metric that will be set to 1 if we detect a problematic situation from the reattch response. The problematic situation is when the PS doesn't have any local tenants but received a reattach response containing tenants. We can set up an alert using this metric. The alert should be raised whenever this metric reports non-zero number. Also added a HTTP PUT `http://pageserver/hadron-internal/reset_alert_gauges` API on the pageserver that can be used to reset the gauge and the alert once we manually rectify the situation (by restarting the HCC). ## Problem II Azure upload is 3x slower than AWS. -> 3x slower ingestion. The reason for the slower upload is that Azure upload in page server is much slower => higher flush latency => higher disk consistent LSN => higher back pressure. ## Summary of changes II Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in parallel. I set the put_block block size to be 128 MB by default in azure config. To minimize neon changes, upload function passes the layer file path to the azure upload code through the storage metadata. This allows the azure put block to use FileChunkStreamRead to stream read from one partition in the file instead of loading all file data in memory and split it into 8 128 MB chunks. ## How is this tested? II 1. rust test_real_azure tests the put_block change. 3. I deployed the change in azure dev and saw flush latency reduces from ~30 seconds to 10 seconds. 4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS runs. ## Problem III Currently Neon limits the compaction tasks as 3/4 * CPU cores. This limits the overall compaction throughput and it can easily cause head-of-the-line blocking problems when a few large tenants are compacting. ## Summary of changes III This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD` (default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also limits some other tasks `logical_size_calculation` and `layer eviction` . But compaction should be the most frequent and time-consuming task. ## Summary of changes IV This PR adds the following PageServer metrics: 1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures the total amount of bytes evicted. It's more straightforward to see the bytes directly instead of layers. 2. `pageserver_active_storage_operations_count`: captures the active storage operation, e.g., flush, L0 compaction, image creation etc. It's useful to visualize these active operations to get a better idea of what PageServers are spending cycles on in the background. ## Summary of changes V When investigating data corruptions, it's useful to search the base image and all WAL records of a page up to an LSN, i.e., a breakdown of GetPage@LSN request. This PR implements this functionality with two tools: 1. Extended `pagectl` with a new command to search the layer files for a given key up to a given LSN from the `index_part.json` file. The output can be used to download the files from S3 and then search the file contents using the second tool. Example usage: ``` cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b ``` 2. Added a unit test to search the layer file contents. It's not implemented part of `pagectl` because it depends on some test harness code, which can only be used by unit tests. Example usage: ``` cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` # omitted image for brievity delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA==" delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA==" ``` ## Problem VI Currently when page service resolves shards from page numbers, it doesn't fully support the case that the shard could be split in the middle. This will lead to query failures during the tenant split for either commit or abort cases (it's mostly for abort). ## Summary of changes VI This PR adds retry logic in `Cache::get()` to deal with shard resolution errors more gracefully. Specifically, it'll clear the cache and retry, instead of failing the query immediately. It also reduces the internal timeout to make retries faster. The PR also fixes a very obvious bug in `TenantManager::resolve_attached_shard` where the code tries to cache the computed the shard number, but forgot to recompute when the shard count is different. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>
367 lines
11 KiB
Rust
367 lines
11 KiB
Rust
use std::{ops::Range, str::FromStr, sync::Arc};
|
||
|
||
use crate::walredo::RedoAttemptType;
|
||
use base64::{Engine as _, engine::general_purpose::STANDARD};
|
||
use bytes::{Bytes, BytesMut};
|
||
use camino::Utf8PathBuf;
|
||
use clap::Parser;
|
||
use itertools::Itertools;
|
||
use pageserver_api::{
|
||
key::Key,
|
||
keyspace::KeySpace,
|
||
shard::{ShardIdentity, ShardStripeSize},
|
||
};
|
||
use postgres_ffi::PgMajorVersion;
|
||
use postgres_ffi::{BLCKSZ, page_is_new, page_set_lsn};
|
||
use tracing::Instrument;
|
||
use utils::{
|
||
generation::Generation,
|
||
id::{TenantId, TimelineId},
|
||
lsn::Lsn,
|
||
shard::{ShardCount, ShardIndex, ShardNumber},
|
||
};
|
||
use wal_decoder::models::record::NeonWalRecord;
|
||
|
||
use crate::{
|
||
context::{DownloadBehavior, RequestContext},
|
||
task_mgr::TaskKind,
|
||
tenant::storage_layer::ValueReconstructState,
|
||
walredo::harness::RedoHarness,
|
||
};
|
||
|
||
use super::{
|
||
WalRedoManager, WalredoManagerId,
|
||
harness::TenantHarness,
|
||
remote_timeline_client::LayerFileMetadata,
|
||
storage_layer::{AsLayerDesc, IoConcurrency, Layer, LayerName, ValuesReconstructState},
|
||
};
|
||
|
||
fn process_page_image(next_record_lsn: Lsn, is_fpw: bool, img_bytes: Bytes) -> Bytes {
|
||
// To match the logic in libs/wal_decoder/src/serialized_batch.rs
|
||
let mut new_image: BytesMut = img_bytes.into();
|
||
if is_fpw && !page_is_new(&new_image) {
|
||
page_set_lsn(&mut new_image, next_record_lsn);
|
||
}
|
||
assert_eq!(new_image.len(), BLCKSZ as usize);
|
||
new_image.freeze()
|
||
}
|
||
|
||
async fn redo_wals(input: &str, key: Key) -> anyhow::Result<()> {
|
||
let tenant_id = TenantId::generate();
|
||
let timeline_id = TimelineId::generate();
|
||
let redo_harness = RedoHarness::new()?;
|
||
let span = redo_harness.span();
|
||
let tenant_conf = pageserver_api::models::TenantConfig {
|
||
..Default::default()
|
||
};
|
||
|
||
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
|
||
let tenant = TenantHarness::create_custom(
|
||
"search_key",
|
||
tenant_conf,
|
||
tenant_id,
|
||
ShardIdentity::unsharded(),
|
||
Generation::new(1),
|
||
)
|
||
.await?
|
||
.do_try_load_with_redo(
|
||
Arc::new(WalRedoManager::Prod(
|
||
WalredoManagerId::next(),
|
||
redo_harness.manager,
|
||
)),
|
||
&ctx,
|
||
)
|
||
.await
|
||
.unwrap();
|
||
let timeline = tenant
|
||
.create_test_timeline(timeline_id, Lsn(0x10), PgMajorVersion::PG16, &ctx)
|
||
.await?;
|
||
let contents = tokio::fs::read_to_string(input)
|
||
.await
|
||
.map_err(|e| anyhow::Error::msg(format!("Failed to read input file {input}: {e}")))
|
||
.unwrap();
|
||
let lines = contents.lines();
|
||
let mut last_wal_lsn: Option<Lsn> = None;
|
||
let state = {
|
||
let mut state = ValueReconstructState::default();
|
||
let mut is_fpw = false;
|
||
let mut is_first_line = true;
|
||
for line in lines {
|
||
if is_first_line {
|
||
is_first_line = false;
|
||
if line.trim() == "FPW" {
|
||
is_fpw = true;
|
||
}
|
||
continue; // Skip the first line.
|
||
}
|
||
// Each input line is in the "<next_record_lsn>,<base64>" format.
|
||
let (lsn_str, payload_b64) = line
|
||
.split_once(',')
|
||
.expect("Invalid input format: expected '<lsn>,<base64>'");
|
||
|
||
// Parse the LSN and decode the payload.
|
||
let lsn = Lsn::from_str(lsn_str.trim()).expect("Invalid LSN format");
|
||
let bytes = Bytes::from(
|
||
STANDARD
|
||
.decode(payload_b64.trim())
|
||
.expect("Invalid base64 payload"),
|
||
);
|
||
|
||
// The first line is considered the base image, the rest are WAL records.
|
||
if state.img.is_none() {
|
||
state.img = Some((lsn, process_page_image(lsn, is_fpw, bytes)));
|
||
} else {
|
||
let wal_record = NeonWalRecord::Postgres {
|
||
will_init: false,
|
||
rec: bytes,
|
||
};
|
||
state.records.push((lsn, wal_record));
|
||
last_wal_lsn.replace(lsn);
|
||
}
|
||
}
|
||
state
|
||
};
|
||
|
||
assert!(state.img.is_some(), "No base image found");
|
||
assert!(!state.records.is_empty(), "No WAL records found");
|
||
let result = timeline
|
||
.reconstruct_value(key, last_wal_lsn.unwrap(), state, RedoAttemptType::ReadPage)
|
||
.instrument(span.clone())
|
||
.await?;
|
||
|
||
eprintln!("final image: {:?}", STANDARD.encode(result));
|
||
|
||
Ok(())
|
||
}
|
||
|
||
async fn search_key(
|
||
tenant_id: TenantId,
|
||
timeline_id: TimelineId,
|
||
dir: String,
|
||
key: Key,
|
||
lsn: Lsn,
|
||
) -> anyhow::Result<()> {
|
||
let shard_index = ShardIndex {
|
||
shard_number: ShardNumber(0),
|
||
shard_count: ShardCount(4),
|
||
};
|
||
|
||
let redo_harness = RedoHarness::new()?;
|
||
let span = redo_harness.span();
|
||
let tenant_conf = pageserver_api::models::TenantConfig {
|
||
..Default::default()
|
||
};
|
||
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
|
||
let tenant = TenantHarness::create_custom(
|
||
"search_key",
|
||
tenant_conf,
|
||
tenant_id,
|
||
ShardIdentity::new(
|
||
shard_index.shard_number,
|
||
shard_index.shard_count,
|
||
ShardStripeSize(32768),
|
||
)
|
||
.unwrap(),
|
||
Generation::new(1),
|
||
)
|
||
.await?
|
||
.do_try_load_with_redo(
|
||
Arc::new(WalRedoManager::Prod(
|
||
WalredoManagerId::next(),
|
||
redo_harness.manager,
|
||
)),
|
||
&ctx,
|
||
)
|
||
.await
|
||
.unwrap();
|
||
|
||
let timeline = tenant
|
||
.create_test_timeline(timeline_id, Lsn(0x10), PgMajorVersion::PG16, &ctx)
|
||
.await?;
|
||
|
||
let mut delta_layers: Vec<Layer> = Vec::new();
|
||
let mut img_layer: Option<Layer> = Option::None;
|
||
let mut dir = tokio::fs::read_dir(dir).await?;
|
||
loop {
|
||
let entry = dir.next_entry().await?;
|
||
if entry.is_none() || !entry.as_ref().unwrap().file_type().await?.is_file() {
|
||
break;
|
||
}
|
||
let path = Utf8PathBuf::from_path_buf(entry.unwrap().path()).unwrap();
|
||
let layer_name = match LayerName::from_str(path.file_name().unwrap()) {
|
||
Ok(name) => name,
|
||
Err(_) => {
|
||
eprintln!("Skipped invalid layer: {path}");
|
||
continue;
|
||
}
|
||
};
|
||
let layer = Layer::for_resident(
|
||
tenant.conf,
|
||
&timeline,
|
||
path.clone(),
|
||
layer_name,
|
||
LayerFileMetadata::new(
|
||
tokio::fs::metadata(path.clone()).await?.len(),
|
||
Generation::new(1),
|
||
shard_index,
|
||
),
|
||
);
|
||
if layer.layer_desc().is_delta() {
|
||
delta_layers.push(layer.into());
|
||
} else if img_layer.is_none() {
|
||
img_layer = Some(layer.into());
|
||
} else {
|
||
anyhow::bail!("Found multiple image layers");
|
||
}
|
||
}
|
||
// sort delta layers based on the descending order of LSN
|
||
delta_layers.sort_by(|a, b| {
|
||
b.layer_desc()
|
||
.get_lsn_range()
|
||
.start
|
||
.cmp(&a.layer_desc().get_lsn_range().start)
|
||
});
|
||
|
||
let mut state = ValuesReconstructState::new(IoConcurrency::Sequential);
|
||
|
||
let key_space = KeySpace::single(Range {
|
||
start: key,
|
||
end: key.next(),
|
||
});
|
||
let lsn_range = Range {
|
||
start: img_layer
|
||
.as_ref()
|
||
.map_or(Lsn(0x00), |img| img.layer_desc().image_layer_lsn()),
|
||
end: lsn,
|
||
};
|
||
for delta_layer in delta_layers.iter() {
|
||
delta_layer
|
||
.get_values_reconstruct_data(key_space.clone(), lsn_range.clone(), &mut state, &ctx)
|
||
.await?;
|
||
}
|
||
|
||
img_layer
|
||
.as_ref()
|
||
.unwrap()
|
||
.get_values_reconstruct_data(key_space.clone(), lsn_range.clone(), &mut state, &ctx)
|
||
.await?;
|
||
|
||
for (_key, result) in std::mem::take(&mut state.keys) {
|
||
let state = result.collect_pending_ios().await?;
|
||
if state.img.is_some() {
|
||
eprintln!(
|
||
"image: {}: {:x?}",
|
||
state.img.as_ref().unwrap().0,
|
||
STANDARD.encode(state.img.as_ref().unwrap().1.clone())
|
||
);
|
||
}
|
||
for delta in state.records.iter() {
|
||
match &delta.1 {
|
||
NeonWalRecord::Postgres { will_init, rec } => {
|
||
eprintln!(
|
||
"delta: {}: will_init: {}, {:x?}",
|
||
delta.0,
|
||
will_init,
|
||
STANDARD.encode(rec)
|
||
);
|
||
}
|
||
_ => {
|
||
eprintln!("delta: {}: {:x?}", delta.0, delta.1);
|
||
}
|
||
}
|
||
}
|
||
|
||
let result = timeline
|
||
.reconstruct_value(key, lsn_range.end, state, RedoAttemptType::ReadPage)
|
||
.instrument(span.clone())
|
||
.await?;
|
||
eprintln!("final image: {lsn} : {result:?}");
|
||
}
|
||
|
||
Ok(())
|
||
}
|
||
|
||
/// Redo all WALs against the base image in the input file. Return the base64 encoded final image.
|
||
/// Each line in the input file must be in the form "<lsn>,<base64>" where:
|
||
/// * `<lsn>` is a PostgreSQL LSN in hexadecimal notation, e.g. `0/16ABCDE`.
|
||
/// * `<base64>` is the base64‐encoded page image (first line) or WAL record (subsequent lines).
|
||
///
|
||
/// The first line provides the base image of a page. The LSN is the LSN of "next record" following
|
||
/// the record containing the FPI. For example, if the FPI was extracted from a WAL record occuping
|
||
/// [0/1, 0/200) in the WAL stream, the LSN appearing along side the page image here should be 0/200.
|
||
///
|
||
/// The subsequent lines are WAL records, ordered from the oldest to the newest. The LSN is the
|
||
/// record LSN of the WAL record, not the "next record" LSN. For example, if the WAL record here
|
||
/// occupies [0/1, 0/200) in the WAL stream, the LSN appearing along side the WAL record here should
|
||
/// be 0/1.
|
||
#[derive(Parser)]
|
||
struct RedoWalsCmd {
|
||
#[clap(long)]
|
||
input: String,
|
||
#[clap(long)]
|
||
key: String,
|
||
}
|
||
|
||
#[tokio::test]
|
||
async fn test_redo_wals() -> anyhow::Result<()> {
|
||
let args = std::env::args().collect_vec();
|
||
let pos = args
|
||
.iter()
|
||
.position(|arg| arg == "--")
|
||
.unwrap_or(args.len());
|
||
let slice = &args[pos..args.len()];
|
||
let cmd = match RedoWalsCmd::try_parse_from(slice) {
|
||
Ok(cmd) => cmd,
|
||
Err(err) => {
|
||
eprintln!("{err}");
|
||
return Ok(());
|
||
}
|
||
};
|
||
|
||
let key = Key::from_hex(&cmd.key).unwrap();
|
||
redo_wals(&cmd.input, key).await?;
|
||
|
||
Ok(())
|
||
}
|
||
|
||
/// Search for a page at the given LSN in all layers of the data_dir.
|
||
/// Return the base64-encoded image and all WAL records, as well as the final reconstructed image.
|
||
#[derive(Parser)]
|
||
struct SearchKeyCmd {
|
||
#[clap(long)]
|
||
tenant_id: String,
|
||
#[clap(long)]
|
||
timeline_id: String,
|
||
#[clap(long)]
|
||
data_dir: String,
|
||
#[clap(long)]
|
||
key: String,
|
||
#[clap(long)]
|
||
lsn: String,
|
||
}
|
||
|
||
#[tokio::test]
|
||
async fn test_search_key() -> anyhow::Result<()> {
|
||
let args = std::env::args().collect_vec();
|
||
let pos = args
|
||
.iter()
|
||
.position(|arg| arg == "--")
|
||
.unwrap_or(args.len());
|
||
let slice = &args[pos..args.len()];
|
||
let cmd = match SearchKeyCmd::try_parse_from(slice) {
|
||
Ok(cmd) => cmd,
|
||
Err(err) => {
|
||
eprintln!("{err}");
|
||
return Ok(());
|
||
}
|
||
};
|
||
|
||
let tenant_id = TenantId::from_str(&cmd.tenant_id).unwrap();
|
||
let timeline_id = TimelineId::from_str(&cmd.timeline_id).unwrap();
|
||
let key = Key::from_hex(&cmd.key).unwrap();
|
||
let lsn = Lsn::from_str(&cmd.lsn).unwrap();
|
||
search_key(tenant_id, timeline_id, cmd.data_dir, key, lsn).await?;
|
||
|
||
Ok(())
|
||
}
|