Move some perf benchmarks from hetzner to aws arm github runners (#12393)

## Problem

We want to move some benchmarks from hetzner runners to aws graviton
runners

## Summary of changes

Adjust the runner labels for some workflows.
Adjust the pagebench number of clients to match the latecny knee at 8
cores of the new instance type
Add `--security-opt seccomp=unconfined` to docker run command to bypass
IO_URING EPERM error.

## New runners


https://us-east-2.console.aws.amazon.com/ec2/home?region=us-east-2#Instances:instanceState=running;search=:github-unit-perf-runner-arm;v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false;sort=tag:Name

## Important Notes

I added the run-benchmarks label to get this tested **before we merge
it**.
[See](https://github.com/neondatabase/neon/actions/runs/15974141360)

I also test a run of pagebench with the new setup from this branch, see
https://github.com/neondatabase/neon/actions/runs/15972523054
- Update: the benchmarking workflow had failures, [see]
(https://github.com/neondatabase/neon/actions/runs/15974141360/job/45055897591)
- changed docker run command to avoid io_uring EPERM error, new run
[see](https://github.com/neondatabase/neon/actions/runs/15997965633/job/45125689920?pr=12393)

Update: the pagebench test run on the new runner [completed
successfully](https://github.com/neondatabase/neon/actions/runs/15972523054/job/45046772556)

Update 2025-07-07: the latest runs with instance store ext4 have been
successful and resolved the direct I/O issues we have been seeing before
in some runs. We only had one perf testcase failing (shard split) that
had been flaky before. So I think we can merge this now.

## Follow up

if this is merged and works successfully we must create a separate issue
to de-provision the hetzner unit-perf runners defined
[here](91a41729af/ansible/inventory/hosts_metal (L111))
This commit is contained in:
Peter Bendel
2025-07-07 08:44:41 +02:00
committed by GitHub
parent b568189f7b
commit ca9d8761ff
5 changed files with 11 additions and 9 deletions

View File

@@ -7,6 +7,7 @@ self-hosted-runner:
- small-metal - small-metal
- small-arm64 - small-arm64
- unit-perf - unit-perf
- unit-perf-aws-arm
- us-east-2 - us-east-2
config-variables: config-variables:
- AWS_ECR_REGION - AWS_ECR_REGION

View File

@@ -306,14 +306,14 @@ jobs:
statuses: write statuses: write
contents: write contents: write
pull-requests: write pull-requests: write
runs-on: [ self-hosted, unit-perf ] runs-on: [ self-hosted, unit-perf-aws-arm ]
container: container:
image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm
credentials: credentials:
username: ${{ github.actor }} username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }} password: ${{ secrets.GITHUB_TOKEN }}
# for changed limits, see comments on `options:` earlier in this file # for changed limits, see comments on `options:` earlier in this file
options: --init --shm-size=512mb --ulimit memlock=67108864:67108864 options: --init --shm-size=512mb --ulimit memlock=67108864:67108864 --ulimit nofile=65536:65536 --security-opt seccomp=unconfined
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:

View File

@@ -1,4 +1,4 @@
name: Periodic pagebench performance test on unit-perf hetzner runner name: Periodic pagebench performance test on unit-perf-aws-arm runners
on: on:
schedule: schedule:
@@ -40,7 +40,7 @@ jobs:
statuses: write statuses: write
contents: write contents: write
pull-requests: write pull-requests: write
runs-on: [ self-hosted, unit-perf ] runs-on: [ self-hosted, unit-perf-aws-arm ]
container: container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm image: ghcr.io/neondatabase/build-tools:pinned-bookworm
credentials: credentials:

View File

@@ -1,4 +1,4 @@
name: Periodic proxy performance test on unit-perf hetzner runner name: Periodic proxy performance test on unit-perf-aws-arm runners
on: on:
push: # TODO: remove after testing push: # TODO: remove after testing
@@ -32,7 +32,7 @@ jobs:
statuses: write statuses: write
contents: write contents: write
pull-requests: write pull-requests: write
runs-on: [self-hosted, unit-perf] runs-on: [self-hosted, unit-perf-aws-arm]
timeout-minutes: 60 # 1h timeout timeout-minutes: 60 # 1h timeout
container: container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm image: ghcr.io/neondatabase/build-tools:pinned-bookworm

View File

@@ -55,9 +55,10 @@ def test_pageserver_characterize_throughput_with_n_tenants(
@pytest.mark.parametrize("duration", [20 * 60]) @pytest.mark.parametrize("duration", [20 * 60])
@pytest.mark.parametrize("pgbench_scale", [get_scale_for_db(2048)]) @pytest.mark.parametrize("pgbench_scale", [get_scale_for_db(2048)])
# we use 1 client to characterize latencies, and 64 clients to characterize throughput/scalability # we use 1 client to characterize latencies, and 64 clients to characterize throughput/scalability
# we use 64 clients because typically for a high number of connections we recommend the connection pooler # we use 8 clients because we see a latency knee around 6-8 clients on im4gn.2xlarge instance type,
# which by default uses 64 connections # which we use for this periodic test - at a cpu utilization of around 70 % - which is considered
@pytest.mark.parametrize("n_clients", [1, 64]) # a good utilization for pageserver.
@pytest.mark.parametrize("n_clients", [1, 8])
@pytest.mark.parametrize("n_tenants", [1]) @pytest.mark.parametrize("n_tenants", [1])
@pytest.mark.timeout(2400) @pytest.mark.timeout(2400)
def test_pageserver_characterize_latencies_with_1_client_and_throughput_with_many_clients_one_tenant( def test_pageserver_characterize_latencies_with_1_client_and_throughput_with_many_clients_one_tenant(