rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
Mikhail	6689d6fd89	LFC prewarm perftest fixes: use existing staging project (#12651 ) https://github.com/neondatabase/cloud/issues/19011 - Prewarm config changes are not publicly available. Correct the test by using a pre-filled 50 GB project on staging - Create extension neon with schema neon to fix read performance tests on staging, error example in https://neon-github-public-dev.s3.amazonaws.com/reports/main/16483462789/index.html#suites/3d632da6dda4a70f5b4bd24904ab444c/919841e331089fc4/ - Don't create extra endpoint in LFC prewarm performance tests	2025-07-25 16:56:41 +00:00
Mikhail	2288efae66	Performance test for LFC prewarm (#12524 ) https://github.com/neondatabase/cloud/issues/19011 Measure relative performance for prewarmed and non-prewarmed endpoints. Add test that runs on every commit, and one performance test with a remote cluster.	2025-07-14 13:41:31 +00:00
Peter Bendel	ca9d8761ff	Move some perf benchmarks from hetzner to aws arm github runners (#12393 ) ## Problem We want to move some benchmarks from hetzner runners to aws graviton runners ## Summary of changes Adjust the runner labels for some workflows. Adjust the pagebench number of clients to match the latecny knee at 8 cores of the new instance type Add `--security-opt seccomp=unconfined` to docker run command to bypass IO_URING EPERM error. ## New runners https://us-east-2.console.aws.amazon.com/ec2/home?region=us-east-2#Instances:instanceState=running;search=:github-unit-perf-runner-arm;v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false;sort=tag:Name ## Important Notes I added the run-benchmarks label to get this tested before we merge it. [See](https://github.com/neondatabase/neon/actions/runs/15974141360) I also test a run of pagebench with the new setup from this branch, see https://github.com/neondatabase/neon/actions/runs/15972523054 - Update: the benchmarking workflow had failures, [see] (https://github.com/neondatabase/neon/actions/runs/15974141360/job/45055897591) - changed docker run command to avoid io_uring EPERM error, new run [see](https://github.com/neondatabase/neon/actions/runs/15997965633/job/45125689920?pr=12393) Update: the pagebench test run on the new runner [completed successfully](https://github.com/neondatabase/neon/actions/runs/15972523054/job/45046772556) Update 2025-07-07: the latest runs with instance store ext4 have been successful and resolved the direct I/O issues we have been seeing before in some runs. We only had one perf testcase failing (shard split) that had been flaky before. So I think we can merge this now. ## Follow up if this is merged and works successfully we must create a separate issue to de-provision the hetzner unit-perf runners defined [here](`91a41729af/ansible/inventory/hosts_metal (L111)`)	2025-07-07 06:44:41 +00:00
Alexander Bayandin	22290eb7ba	CI: notify relevant team about release deploy failures (#11797 ) ## Problem We notify only Storage team about failed deploys, but Compute and Proxy teams can also benefit from that ## Summary of changes - Adjust `notify-storage-release-deploy-failure` to notify the relevant team about failed deploy	2025-05-02 12:46:21 +00:00
Fedor Dikarev	9a6ace9bde	introduce new runners: unit-perf and use them for benchmark jobs (#11409 ) ## Problem Benchmarks results are inconsistent on existing small-metal runners ## Summary of changes Introduce new `unit-perf` runners, and lets run benchmark on them. The new hardware has slower, but consistent, CPU frequency - if run with default governor schedutil. Thus we needed to adjust some testcases' timeouts and add some retry steps where hard-coded timeouts couldn't be increased without changing the system under test. - [wait_for_last_record_lsn](`6592d69a67/test_runner/fixtures/pageserver/utils.py (L193)`) 1000s -> 2000s - [test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927) 1000s - [test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f) - with back throttling disabled compute becomes unresponsive for more than 60 seconds (PG hard-coded client authentication connection timeout) - [test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1) 600s -> 1200s Right now there are only 2 runners of that class, and if we decide to go with them, we have to check how much that type of runners we need, so jobs not stuck with waiting for that type of runners available. However we now decided to run those runners with governor performance instead of schedutil. This achieves almost same performance as previous runners but still achieves consistent results for same commit Related issue to activate performance governor on these runners https://github.com/neondatabase/runner/pull/138 ## Verification that it helps ### analyze runtimes on new runner for same commit Table of runtimes for the same commit on different runners in [run](https://github.com/neondatabase/neon/actions/runs/14417589789) \| Run \| Benchmarks (1) \| Benchmarks (2) \|Benchmarks (3) \|Benchmarks (4) \| Benchmarks (5) \| \|--------\|--------\|---------\|---------\|---------\|---------\| \| 1 \| 1950.37s \| 6374.55s \| 3646.15s \| 4149.48s \| 2330.22s \| \| 2 \| - \| 6369.27s \| 3666.65s \| 4162.42s \| 2329.23s \| \| Delta % \| - \| 0,07 % \| 0,5 % \| 0,3 % \| 0,04 % \| \| with governor performance \| 1519.57s \| 4131.62s \| - \| - \| - \| \| second run gov. perf. \| 1513.62s \| 4134.67s \| - \| - \| - \| \| Delta % \| 0,3 % \| 0,07 % \| - \| - \| - \| \| speedup gov. performance \| 22 % \| 35 % \| - \| - \| - \| \| current desktop class hetzner runners (main) \| 1487.10s \| 3699.67s \| - \| - \| - \| \| slower than desktop class \| 2 % \| 12 % \| - \| - \| - \| In summary, the runtimes for the same commit on this hardware varies less than 1 %. --------- Co-authored-by: BodoBolero <peterbendel@neon.tech>	2025-04-15 08:21:44 +00:00
Fedor Dikarev	1d5d168626	impr(ci): use hetzner buckets for cache (#11364 ) ## Problem Occasionally getting data from GH cache could be slow, with less than 10MB/s and taking 5+ minutes to download cache: ``` Received 20971520 of 2987085791 (0.7%), 9.9 MBs/sec Received 50331648 of 2987085791 (1.7%), 15.9 MBs/sec ... Received 1065353216 of 2987085791 (35.7%), 4.8 MBs/sec Received 1065353216 of 2987085791 (35.7%), 4.7 MBs/sec ... ``` https://github.com/neondatabase/neon/actions/runs/13956437454/job/39068664599#step:7:17 Resulting in getting cache even longer that build time. ## Summary of changes Switch to the caches, that are closer to the runners, and they provided stable throughput about 70-80MB/s	2025-03-27 11:11:45 +00:00
JC Grünhage	f17931870f	fix(ci): use <!subteam^ID> syntax for pinging groups on slack (#11135 ) ## Problem Pinging groups on slack didn't work, because I didn't use the correct syntax. ## Summary of changes Use the correct syntax for pinging groups.	2025-03-10 13:27:23 +00:00
Peter Bendel	a07599949f	First version of a new benchmark to test larger OLTP workload (#11053 ) ## Problem We want to support larger tenants (regarding logical database size, number of transactions per second etc.) and should increase our test coverage of OLTP transactions at larger scale. ## Summary of changes Start a new benchmark that over time will add more OLTP tests at larger scale. This PR covers the first version and will be extended in further PRs. Also fix some infrastructure: - default for new connections and large tenants is to use connection pooler pgbouncer, however our fixture always added `statement_timeout=120` which is not compatible with pooler [see](https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter) - action to create branch timed out after 10 seconds and 10 retries but for large tenants it can take longer so use increasing back-off for retries ## Test run https://github.com/neondatabase/neon/actions/runs/13593446706	2025-03-03 15:25:48 +00:00
JC Grünhage	e52e93797f	refactor(ci): use variables for AWS account IDs (#10886 ) ## Problem Our AWS account IDs are copy-pasted all over the place. A wrong paste might only be caught late if we hardcode them, but will get flagged instantly by actionlint if we access them from github actions variables. Resolves https://github.com/neondatabase/neon/issues/10787, follow-up for https://github.com/neondatabase/neon/pull/10613. ## Summary of changes Access AWS account IDs using Github Actions variables.	2025-02-19 12:34:41 +00:00
JC Grünhage	9151d3a318	feat(ci): notify storage oncall if deploy job fails on release branch (#10865 ) ## Problem If the deploy job on the release branch doesn't succeed, the preprod deployment will not have happened. It was requested that this triggers a notification in https://github.com/neondatabase/neon/issues/10662. ## Summary of changes If we're on the release branch and the deploy job doesn't end up in "success", notify storage oncall on slack.	2025-02-18 17:20:03 +00:00
JC Grünhage	10cf5e7a38	Move cargo-deny into a separate workflow on a schedule (#10289 ) ## Problem There are two (related) problems with the previous handling of `cargo-deny`: - When a new advisory is added to rustsec that affects a dependency, unrelated pull requests will fail. - New advisories rely on pushes or PRs to be surfaced. Problems that already exist on main will only be found if we try to merge new things into main. ## Summary of changes We split out `cargo-deny` into a separate workflow that runs on all PRs that touch `Cargo.lock`, and on a schedule on `main`, `release`, `release-compute` and `release-proxy` to find new advisories.	2025-01-31 13:42:59 +00:00
Fedor Dikarev	68cf0ba439	run benchmark tests on small-metal runners (#10549 ) ## Problem Ref: https://github.com/neondatabase/cloud/issues/23314 We suspect some inconsistency in Benchmark tests runs could be due to different type of runners they are landed in. To have that aligned in both terms: failure rates and benchmark results, lets run them for now on `small-metal` servers and see the progress for the tests stability. ## Summary of changes	2025-01-28 21:26:38 +00:00
Alexander Bayandin	b9464865b6	benchmarks: report successful runs to slack as well (#10393 ) ## Problem Successful `benchmarks` runs doesn't have enough visibility Ref https://neondb.slack.com/archives/C069Z2199DL/p1736868055094539 ## Summary of changes - Report both successful and failed `benchmarks` to Slack - Update `slackapi/slack-github-action` action	2025-01-15 13:05:05 +00:00
Peter Bendel	2f3f98a319	use OIDC role instead of AWS access keys for managing test runner (#10117 ) in periodic pagebench workflow ## Problem for background see https://github.com/neondatabase/cloud/issues/21545 ## Summary of changes use OIDC role to manage runners instead of AWS access key which needs to be periodically rotated ## logs seems to work in https://github.com/neondatabase/neon/actions/runs/12298575888/job/34322306127#step:6:1	2024-12-12 20:25:39 +00:00
a-masterov	2f3433876f	Change the channel for notification. (#10112 ) ## Problem Now notifications about failures in `pg_regress` tests run on the staging cloud instance, reach the channel `on-call-staging-stream`, while they should reach `on-call-qa-staging-stream` ## Summary of changes The channel changed.	2024-12-12 16:34:07 +00:00
a-masterov	92273b6d5e	Enable the pg_regress tests on staging for PG17 (#9978 ) ## Problem Currently, we run the `pg_regress` tests only for PG16 However, PG17 is a part of Neon and should be tested as well ## Summary of changes Modified the workflow and added a patch for PG17 enabling the `pg_regress` tests. The problem with leftovers was solved by using branches.	2024-12-09 19:30:39 +00:00
Peter Bendel	8db84d9964	new ingest benchmark (#9711 ) ## Problem We have no specific benchmark testing project migration of postgresql project with existing data into Neon. Typical steps of such a project migration are - schema creation in the neon project - initial COPY of relations - creation of indexes and constraints - vacuum analyze ## Summary of changes Add a periodic benchmark running 9 AM UTC every day. In each run: - copy a 200 GiB project that has realistic schema, data, tables, indexes and constraints from another project into - a new Neon project (7 CU fixed) - an existing tenant, (but new branch and new database) that already has 4 TiB of data - use pgcopydb tool to automate all steps and parallelize COPY and index creation - parse pgcopydb output and report performance metrics in Neon performance test database ## Logs This benchmark has been tested first manually and then as part of benchmarking.yml workflow, example run see https://github.com/neondatabase/neon/actions/runs/11757679870	2024-11-11 17:51:15 +00:00
Cihan Demirci	16c200d6d9	push images to prod ACR (#8940 ) Used `vars` for new storing non-sensitive information, changed dev secrets to vars as well but didn't cleanup any secrets. https://github.com/neondatabase/cloud/issues/16925 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-09-07 00:20:36 +01:00
Alexander Bayandin	aa2e16f307	CI: misc cleanup & fixes (#8559 ) ## Problem A bunch of small fixes and improvements for CI, that are too small to have a separate PR for them ## Summary of changes - CI(build-and-test): fix parenthesis - CI(actionlint): fix path to workflow file - CI: remove default args from actions/checkout - CI: remove `gen3` label, using a couple `self-hosted` + `small{,-arm64}`/`large{,-arm64}` is enough - CI: prettify Slack messages, hide links behind text messages - C(build-and-test): add more dependencies to `conclusion` job	2024-08-14 17:56:59 +01:00
Peter Bendel	2ca5ff26d7	Run a subset of benchmarking job steps on GitHub action runners in Azure - closer to the system under test (#8651 ) ## Problem Latency from one cloud provider to another one is higher than within the same cloud provider. Some of our benchmarks are latency sensitive - we run a pgbench or psql in the github action runner and the system under test is running in Neon (database project). For realistic perf tps and latency results we need to compare apples to apples and run the database client in the same "latency distance" for all tests. ## Summary of changes Move job steps that test Neon databases deployed on Azure into Azure action runners. - bench strategy variant using azure database - pgvector strategy variant using azure database - pgbench-compare strategy variants using azure database ## Test run https://github.com/neondatabase/neon/actions/runs/10314848502	2024-08-09 08:36:29 +01:00
Alexander Bayandin	e6e578821b	CI(benchmarking): set pub/sub projects for LR tests (#8483 ) ## Problem > Currently, long-running LR tests recreate endpoints every night. We'd like to have along-running buildup of history to exercise the pageserver in this case (instead of "unit-testing" the same behavior everynight). Closes #8317 ## Summary of changes - Update Postgres version for replication tests - Set `BENCHMARK_PROJECT_ID_PUB`/`BENCHMARK_PROJECT_ID_SUB` env vars to projects that were created for this purpose --------- Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>	2024-08-05 22:06:47 +00:00
Cihan Demirci	ff51b565d3	cicd: change Azure storage details [2/2] (#8562 ) Change Azure storage configuration to point to updated variables/secrets. Also update subscription id variable.	2024-07-31 17:42:10 +01:00
Cihan Demirci	a4df3c8488	cicd: change Azure storage details [1/2] (#8553 ) Change Azure storage configuration to point to new variables/secrets. They have the `_NEW` suffix in order not to disrupt any tests while we complete the switch.	2024-07-30 19:34:15 +00:00
Alexander Bayandin	438bacc32e	CI(neon-extra-builds): Use small-arm64 runners instead of large-arm64 (#7740 ) ## Problem There are not enough arm runners and jobs in `neon-extra-builds` workflow take about the same amount of time on a small-arm runner as on large-arm. ## Summary of changes - Switch `neon-extra-builds` workflow from `large-arm64` to `small-arm64` runners	2024-05-15 14:29:12 +03:00
Andrey Taranik	873b222080	use own arm64 gha runners (#7373 ) ## Problem Move from aws based arm64 runners to bare-metal based ## Summary of changes Changes in GitHub action workflows where `runs-on: arm64` used. More parallelism added, build time for `neon with extra platform builds` workflow reduced from 45m to 25m	2024-05-10 11:04:23 +00:00
Alexander Bayandin	30c9e145d7	check-macos-build: switch job to macos-14 (M1) (#6539 ) ## Problem - GitHub made available `macos-14` runners, and they run on M1 processors[0] - The price is the same as Intel-based runners — "macOS \| 3 or 4 (M1 or Intel) \| $0.08"[1], but runners on Apple Silicon should be significantly faster than their Intel counterparts. - Most developers who use macOS use Apple Silicon-based Macs nowadays. - [0] https://github.blog/changelog/2024-01-30-github-actions-introducing-the-new-m1-macos-runner-available-to-open-source/ - [1] https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#per-minute-rates ## Summary of changes - Run `check-macos-build` on `macos-14`	2024-02-02 10:51:20 +00:00
Alexander Bayandin	6e145a44fa	workflows/neon_extra_builds: run check-codestyle-rust & build-neon on arm64 (#5832 ) ## Problem Some developers use workstations with arm CPUs, and sometimes x86-64 code is not fully compatible with it (for example, https://github.com/neondatabase/neon/pull/5827). Although we don't have arm CPUs in the prod (yet?), it is worth having some basic checks for this architecture to have a better developer experience. Closes https://github.com/neondatabase/neon/issues/5829 ## Summary of changes - Run `check-codestyle-rust`-like & `build-neon`-like jobs on Arm runner - Add `run-extra-build-*` label to run all available extra builds	2023-11-10 12:45:41 +00:00
Alexander Bayandin	85f4514e7d	Get env var for real Azure tests from GitHub (#5662 ) ## Problem We'll need to switch `REMOTE_STORAGE_AZURE_REGION` from the current `eastus2` region to something `eu-central-1`-like. This may require changing `AZURE_STORAGE_ACCESS_KEY`. To make it possible to switch from one place (not to break a lot of builds on CI), move `REMOTE_STORAGE_AZURE_CONTAINER` and `REMOTE_STORAGE_AZURE_REGION` to GitHub Variables. See https://github.com/neondatabase/neon/settings/variables/actions ## Summary of changes - Get values for `REMOTE_STORAGE_AZURE_CONTAINER` & `REMOTE_STORAGE_AZURE_REGION` from GitHub Variables	2023-10-25 22:54:23 +01:00
Alexander Bayandin	34e39645c4	GitHub Workflows: add actionlint (#5265 ) ## Problem Add a CI pipeline that checks GitHub Workflows with https://github.com/rhysd/actionlint (it uses `shellcheck` for shell scripts in steps) To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint` ## Summary of changes - Add `.github/workflows/actionlint.yml` - Fix actionlint warnings	2023-09-10 20:05:07 +01:00

29 Commits