mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-06 21:12:55 +00:00
## Problem We recently added a "visibility" state to layers, but nothing initializes it. Part of: - #8398 ## Summary of changes - Add a dependency on `range-set-blaze`, which is used as a fast incrementally updated alternative to KeySpace. We could also use this to replace the internals of KeySpaceRandomAccum if we wanted to. Writing a type that does this kind of "BtreeMap & merge overlapping entries" thing isn't super complicated, but no reason to write this ourselves when there's a third party impl available. - Add a function to layermap to calculate visibilities for each layer - Add a function to Timeline to call into layermap and then apply these visibilities to the Layer objects. - Invoke the calculation during startup, after image layer creations, and when removing branches. Branch removal and image layer creation are the two ways that a layer can go from Visible to Covered. - Add unit test & benchmark for the visibility calculation - Expose `pageserver_visible_physical_size` metric, which should always be <= `pageserver_remote_physical_size`. - This metric will feed into the /v1/utilization endpoint later: the visible size indicates how much space we would like to use on this pageserver for this tenant. - When `pageserver_visible_physical_size` is greater than `pageserver_resident_physical_size`, this is a sign that the tenant has long-idle branches, which result in layers that are visible in principle, but not used in practice. This does not keep visibility hints up to date in all cases: particularly, when creating a child timeline, any previously covered layers will not get marked Visible until they are accessed. Updates after image layer creation could be implemented as more of a special case, but this would require more new code: the existing depth calculation code doesn't maintain+yield the list of deltas that would be covered by an image layer. ## Performance This operation is done rarely (at startup and at timeline deletion), so needs to be efficient but not ultra-fast. There is a new `visibility` bench that measures runtime for a synthetic 100k layers case (`sequential`) and a real layer map (`real_map`) with ~26k layers. The benchmark shows runtimes of single digit milliseconds (on a ryzen 7950). This confirms that the runtime shouldn't be a problem at startup (as we already incur S3-level latencies there), but that it's slow enough that we definitely shouldn't call it more often than necessary, and it may be worthwhile to optimize further later (things like: when removing a branch, only bother scanning layers below the branchpoint) ``` visibility/sequential time: [4.5087 ms 4.5894 ms 4.6775 ms] change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05) Performance has regressed. Found 24 outliers among 100 measurements (24.00%) 2 (2.00%) high mild 22 (22.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map time: [7.0796 ms 7.0832 ms 7.0871 ms] change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map_many_branches time: [4.5285 ms 4.5355 ms 4.5434 ms] change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05) Change within noise threshold. ```
319 lines
11 KiB
Rust
319 lines
11 KiB
Rust
use criterion::measurement::WallTime;
|
|
use pageserver::keyspace::{KeyPartitioning, KeySpace};
|
|
use pageserver::repository::Key;
|
|
use pageserver::tenant::layer_map::LayerMap;
|
|
use pageserver::tenant::storage_layer::LayerName;
|
|
use pageserver::tenant::storage_layer::PersistentLayerDesc;
|
|
use pageserver_api::shard::TenantShardId;
|
|
use rand::prelude::{SeedableRng, SliceRandom, StdRng};
|
|
use std::cmp::{max, min};
|
|
use std::fs::File;
|
|
use std::io::{BufRead, BufReader};
|
|
use std::path::PathBuf;
|
|
use std::str::FromStr;
|
|
use std::time::Instant;
|
|
use utils::id::{TenantId, TimelineId};
|
|
|
|
use utils::lsn::Lsn;
|
|
|
|
use criterion::{black_box, criterion_group, criterion_main, BenchmarkGroup, Criterion};
|
|
|
|
fn fixture_path(relative: &str) -> PathBuf {
|
|
PathBuf::from(env!("CARGO_MANIFEST_DIR")).join(relative)
|
|
}
|
|
|
|
fn build_layer_map(filename_dump: PathBuf) -> LayerMap {
|
|
let mut layer_map = LayerMap::default();
|
|
|
|
let mut min_lsn = Lsn(u64::MAX);
|
|
let mut max_lsn = Lsn(0);
|
|
|
|
let filenames = BufReader::new(File::open(filename_dump).unwrap()).lines();
|
|
|
|
let mut updates = layer_map.batch_update();
|
|
for fname in filenames {
|
|
let fname = fname.unwrap();
|
|
let fname = LayerName::from_str(&fname).unwrap();
|
|
let layer = PersistentLayerDesc::from(fname);
|
|
|
|
let lsn_range = layer.get_lsn_range();
|
|
min_lsn = min(min_lsn, lsn_range.start);
|
|
max_lsn = max(max_lsn, Lsn(lsn_range.end.0 - 1));
|
|
|
|
updates.insert_historic(layer);
|
|
}
|
|
|
|
println!("min: {min_lsn}, max: {max_lsn}");
|
|
|
|
updates.flush();
|
|
layer_map
|
|
}
|
|
|
|
/// Construct a layer map query pattern for benchmarks
|
|
fn uniform_query_pattern(layer_map: &LayerMap) -> Vec<(Key, Lsn)> {
|
|
// For each image layer we query one of the pages contained, at LSN right
|
|
// before the image layer was created. This gives us a somewhat uniform
|
|
// coverage of both the lsn and key space because image layers have
|
|
// approximately equal sizes and cover approximately equal WAL since
|
|
// last image.
|
|
layer_map
|
|
.iter_historic_layers()
|
|
.filter_map(|l| {
|
|
if l.is_incremental() {
|
|
None
|
|
} else {
|
|
let kr = l.get_key_range();
|
|
let lr = l.get_lsn_range();
|
|
|
|
let key_inside = kr.start.next();
|
|
let lsn_before = Lsn(lr.start.0 - 1);
|
|
|
|
Some((key_inside, lsn_before))
|
|
}
|
|
})
|
|
.collect()
|
|
}
|
|
|
|
// Construct a partitioning for testing get_difficulty map when we
|
|
// don't have an exact result of `collect_keyspace` to work with.
|
|
fn uniform_key_partitioning(layer_map: &LayerMap, _lsn: Lsn) -> KeyPartitioning {
|
|
let mut parts = Vec::new();
|
|
|
|
// We add a partition boundary at the start of each image layer,
|
|
// no matter what lsn range it covers. This is just the easiest
|
|
// thing to do. A better thing to do would be to get a real
|
|
// partitioning from some database. Even better, remove the need
|
|
// for key partitions by deciding where to create image layers
|
|
// directly based on a coverage-based difficulty map.
|
|
let mut keys: Vec<_> = layer_map
|
|
.iter_historic_layers()
|
|
.filter_map(|l| {
|
|
if l.is_incremental() {
|
|
None
|
|
} else {
|
|
let kr = l.get_key_range();
|
|
Some(kr.start.next())
|
|
}
|
|
})
|
|
.collect();
|
|
keys.sort();
|
|
|
|
let mut current_key = Key::from_hex("000000000000000000000000000000000000").unwrap();
|
|
for key in keys {
|
|
parts.push(KeySpace {
|
|
ranges: vec![current_key..key],
|
|
});
|
|
current_key = key;
|
|
}
|
|
|
|
KeyPartitioning { parts }
|
|
}
|
|
|
|
// Benchmark using metadata extracted from our performance test environment, from
|
|
// a project where we have run pgbench many timmes. The pgbench database was initialized
|
|
// between each test run.
|
|
fn bench_from_captest_env(c: &mut Criterion) {
|
|
// TODO consider compressing this file
|
|
let layer_map = build_layer_map(fixture_path("benches/odd-brook-layernames.txt"));
|
|
let queries: Vec<(Key, Lsn)> = uniform_query_pattern(&layer_map);
|
|
|
|
// Test with uniform query pattern
|
|
c.bench_function("captest_uniform_queries", |b| {
|
|
b.iter(|| {
|
|
for q in queries.clone().into_iter() {
|
|
black_box(layer_map.search(q.0, q.1));
|
|
}
|
|
});
|
|
});
|
|
|
|
// test with a key that corresponds to the RelDir entry. See pgdatadir_mapping.rs.
|
|
c.bench_function("captest_rel_dir_query", |b| {
|
|
b.iter(|| {
|
|
let result = black_box(layer_map.search(
|
|
Key::from_hex("000000067F00008000000000000000000001").unwrap(),
|
|
// This LSN is higher than any of the LSNs in the tree
|
|
Lsn::from_str("D0/80208AE1").unwrap(),
|
|
));
|
|
result.unwrap();
|
|
});
|
|
});
|
|
}
|
|
|
|
// Benchmark using metadata extracted from a real project that was taknig
|
|
// too long processing layer map queries.
|
|
fn bench_from_real_project(c: &mut Criterion) {
|
|
// Init layer map
|
|
let now = Instant::now();
|
|
let layer_map = build_layer_map(fixture_path("benches/odd-brook-layernames.txt"));
|
|
println!("Finished layer map init in {:?}", now.elapsed());
|
|
|
|
// Choose uniformly distributed queries
|
|
let queries: Vec<(Key, Lsn)> = uniform_query_pattern(&layer_map);
|
|
|
|
// Choose inputs for get_difficulty_map
|
|
let latest_lsn = layer_map
|
|
.iter_historic_layers()
|
|
.map(|l| l.get_lsn_range().end)
|
|
.max()
|
|
.unwrap();
|
|
let partitioning = uniform_key_partitioning(&layer_map, latest_lsn);
|
|
|
|
// Check correctness of get_difficulty_map
|
|
// TODO put this in a dedicated test outside of this mod
|
|
{
|
|
println!("running correctness check");
|
|
|
|
let now = Instant::now();
|
|
let result_bruteforce = layer_map.get_difficulty_map_bruteforce(latest_lsn, &partitioning);
|
|
assert!(result_bruteforce.len() == partitioning.parts.len());
|
|
println!("Finished bruteforce in {:?}", now.elapsed());
|
|
|
|
let now = Instant::now();
|
|
let result_fast = layer_map.get_difficulty_map(latest_lsn, &partitioning, None);
|
|
assert!(result_fast.len() == partitioning.parts.len());
|
|
println!("Finished fast in {:?}", now.elapsed());
|
|
|
|
// Assert results are equal. Manually iterate for easier debugging.
|
|
let zip = std::iter::zip(
|
|
&partitioning.parts,
|
|
std::iter::zip(result_bruteforce, result_fast),
|
|
);
|
|
for (_part, (bruteforce, fast)) in zip {
|
|
assert_eq!(bruteforce, fast);
|
|
}
|
|
|
|
println!("No issues found");
|
|
}
|
|
|
|
// Define and name the benchmark function
|
|
let mut group = c.benchmark_group("real_map");
|
|
group.bench_function("uniform_queries", |b| {
|
|
b.iter(|| {
|
|
for q in queries.clone().into_iter() {
|
|
black_box(layer_map.search(q.0, q.1));
|
|
}
|
|
});
|
|
});
|
|
group.bench_function("get_difficulty_map", |b| {
|
|
b.iter(|| {
|
|
layer_map.get_difficulty_map(latest_lsn, &partitioning, Some(3));
|
|
});
|
|
});
|
|
group.finish();
|
|
}
|
|
|
|
// Benchmark using synthetic data. Arrange image layers on stacked diagonal lines.
|
|
fn bench_sequential(c: &mut Criterion) {
|
|
// Init layer map. Create 100_000 layers arranged in 1000 diagonal lines.
|
|
//
|
|
// TODO This code is pretty slow and runs even if we're only running other
|
|
// benchmarks. It needs to be somewhere else, but it's not clear where.
|
|
// Putting it inside the `bench_function` closure is not a solution
|
|
// because then it runs multiple times during warmup.
|
|
let now = Instant::now();
|
|
let mut layer_map = LayerMap::default();
|
|
let mut updates = layer_map.batch_update();
|
|
for i in 0..100_000 {
|
|
let i32 = (i as u32) % 100;
|
|
let zero = Key::from_hex("000000000000000000000000000000000000").unwrap();
|
|
let layer = PersistentLayerDesc::new_img(
|
|
TenantShardId::unsharded(TenantId::generate()),
|
|
TimelineId::generate(),
|
|
zero.add(10 * i32)..zero.add(10 * i32 + 1),
|
|
Lsn(i),
|
|
0,
|
|
);
|
|
updates.insert_historic(layer);
|
|
}
|
|
updates.flush();
|
|
println!("Finished layer map init in {:?}", now.elapsed());
|
|
|
|
// Choose 100 uniformly random queries
|
|
let rng = &mut StdRng::seed_from_u64(1);
|
|
let queries: Vec<(Key, Lsn)> = uniform_query_pattern(&layer_map)
|
|
.choose_multiple(rng, 100)
|
|
.copied()
|
|
.collect();
|
|
|
|
// Define and name the benchmark function
|
|
let mut group = c.benchmark_group("sequential");
|
|
group.bench_function("uniform_queries", |b| {
|
|
b.iter(|| {
|
|
for q in queries.clone().into_iter() {
|
|
black_box(layer_map.search(q.0, q.1));
|
|
}
|
|
});
|
|
});
|
|
group.finish();
|
|
}
|
|
|
|
fn bench_visibility_with_map(
|
|
group: &mut BenchmarkGroup<WallTime>,
|
|
layer_map: LayerMap,
|
|
read_points: Vec<Lsn>,
|
|
bench_name: &str,
|
|
) {
|
|
group.bench_function(bench_name, |b| {
|
|
b.iter(|| black_box(layer_map.get_visibility(read_points.clone())));
|
|
});
|
|
}
|
|
|
|
// Benchmark using synthetic data. Arrange image layers on stacked diagonal lines.
|
|
fn bench_visibility(c: &mut Criterion) {
|
|
let mut group = c.benchmark_group("visibility");
|
|
{
|
|
// Init layer map. Create 100_000 layers arranged in 1000 diagonal lines.
|
|
let now = Instant::now();
|
|
let mut layer_map = LayerMap::default();
|
|
let mut updates = layer_map.batch_update();
|
|
for i in 0..100_000 {
|
|
let i32 = (i as u32) % 100;
|
|
let zero = Key::from_hex("000000000000000000000000000000000000").unwrap();
|
|
let layer = PersistentLayerDesc::new_img(
|
|
TenantShardId::unsharded(TenantId::generate()),
|
|
TimelineId::generate(),
|
|
zero.add(10 * i32)..zero.add(10 * i32 + 1),
|
|
Lsn(i),
|
|
0,
|
|
);
|
|
updates.insert_historic(layer);
|
|
}
|
|
updates.flush();
|
|
println!("Finished layer map init in {:?}", now.elapsed());
|
|
|
|
let mut read_points = Vec::new();
|
|
for i in (0..100_000).step_by(1000) {
|
|
read_points.push(Lsn(i));
|
|
}
|
|
|
|
bench_visibility_with_map(&mut group, layer_map, read_points, "sequential");
|
|
}
|
|
|
|
{
|
|
let layer_map = build_layer_map(fixture_path("benches/odd-brook-layernames.txt"));
|
|
let read_points = vec![Lsn(0x1C760FA190)];
|
|
bench_visibility_with_map(&mut group, layer_map, read_points, "real_map");
|
|
|
|
let layer_map = build_layer_map(fixture_path("benches/odd-brook-layernames.txt"));
|
|
let read_points = vec![
|
|
Lsn(0x1C760FA190),
|
|
Lsn(0x000000931BEAD539),
|
|
Lsn(0x000000931BF63011),
|
|
Lsn(0x000000931B33AE68),
|
|
Lsn(0x00000038E67ABFA0),
|
|
Lsn(0x000000931B33AE68),
|
|
Lsn(0x000000914E3F38F0),
|
|
Lsn(0x000000931B33AE68),
|
|
];
|
|
bench_visibility_with_map(&mut group, layer_map, read_points, "real_map_many_branches");
|
|
}
|
|
|
|
group.finish();
|
|
}
|
|
|
|
criterion_group!(group_1, bench_from_captest_env);
|
|
criterion_group!(group_2, bench_from_real_project);
|
|
criterion_group!(group_3, bench_sequential);
|
|
criterion_group!(group_4, bench_visibility);
|
|
criterion_main!(group_1, group_2, group_3, group_4);
|