periodic pagebench on hetzner runners (#11963)

## Problem

- Benchmark periodic pagebench had inconsistent benchmarking results
even when run with the same commit hash.
Hypothesis is this was due to running on dedicated but virtualized EC
instance with varying CPU frequency.

- the dedicated instance type used for the benchmark is quite "old" and
we increasingly get `An error occurred (InsufficientInstanceCapacity)
when calling the StartInstances operation (reached max retries: 2):
Insufficient capacity.`

- periodic pagebench uses a snapshot of pageserver timelines to have the
same layer structure in each run and get consistent performance.
Re-creating the snapshot was a painful manual process (see
https://github.com/neondatabase/cloud/issues/27051 and
https://github.com/neondatabase/cloud/issues/27653)

## Summary of changes

- Run the periodic pagebench on a custom hetzner GitHub runner with
large nvme disk and governor set to defined perf profile
- provide a manual dispatch option for the workflow that allows to
create a new snapshot
- keep the manual dispatch option to specify a commit hash useful for
bi-secting regressions
- always use the newest created snapshot (S3 bucket uses date suffix in
S3 key, example
`s3://neon-github-public-dev/performance/pagebench/shared-snapshots-2025-05-17/`
- `--ignore`
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
in regular benchmarks run for each commit
- improve perf copying snapshot by using `cp` subprocess instead of
traversing tree in python


## Example runs with code in this PR:
- run which creates new snapshot
https://github.com/neondatabase/neon/actions/runs/15083408849/job/42402986376#step:19:55
- run which uses latest snapshot
-
https://github.com/neondatabase/neon/actions/runs/15084907676/job/42406240745#step:11:65
This commit is contained in:
Peter Bendel
2025-05-23 11:37:19 +02:00
committed by GitHub
parent 06ce704041
commit 87fc0a0374
5 changed files with 197 additions and 108 deletions

View File

@@ -682,7 +682,7 @@ class NeonEnvBuilder:
log.info(
f"Copying pageserver tenants directory {tenants_from_dir} to {tenants_to_dir}"
)
shutil.copytree(tenants_from_dir, tenants_to_dir)
subprocess.run(["cp", "-a", tenants_from_dir, tenants_to_dir], check=True)
else:
log.info(
f"Creating overlayfs mount of pageserver tenants directory {tenants_from_dir} to {tenants_to_dir}"
@@ -698,8 +698,9 @@ class NeonEnvBuilder:
shutil.rmtree(self.repo_dir / "local_fs_remote_storage", ignore_errors=True)
if self.test_overlay_dir is None:
log.info("Copying local_fs_remote_storage directory from snapshot")
shutil.copytree(
repo_dir / "local_fs_remote_storage", self.repo_dir / "local_fs_remote_storage"
subprocess.run(
["cp", "-a", f"{repo_dir / 'local_fs_remote_storage'}", f"{self.repo_dir}"],
check=True,
)
else:
log.info("Creating overlayfs mount of local_fs_remote_storage directory from snapshot")

View File

@@ -14,7 +14,7 @@ from fixtures.neon_fixtures import (
PgBin,
wait_for_last_flush_lsn,
)
from fixtures.utils import get_scale_for_db, humantime_to_ms, skip_on_ci
from fixtures.utils import get_scale_for_db, humantime_to_ms
from performance.pageserver.util import setup_pageserver_with_tenants
@@ -36,9 +36,6 @@ if TYPE_CHECKING:
@pytest.mark.parametrize("pgbench_scale", [get_scale_for_db(200)])
@pytest.mark.parametrize("n_tenants", [500])
@pytest.mark.timeout(10000)
@skip_on_ci(
"This test needs lot of resources and should run on dedicated HW, not in github action runners as part of CI"
)
def test_pageserver_characterize_throughput_with_n_tenants(
neon_env_builder: NeonEnvBuilder,
zenbenchmark: NeonBenchmarker,
@@ -63,9 +60,6 @@ def test_pageserver_characterize_throughput_with_n_tenants(
@pytest.mark.parametrize("n_clients", [1, 64])
@pytest.mark.parametrize("n_tenants", [1])
@pytest.mark.timeout(2400)
@skip_on_ci(
"This test needs lot of resources and should run on dedicated HW, not in github action runners as part of CI"
)
def test_pageserver_characterize_latencies_with_1_client_and_throughput_with_many_clients_one_tenant(
neon_env_builder: NeonEnvBuilder,
zenbenchmark: NeonBenchmarker,