rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
JC Grünhage	e2411818ef	Add SBOMs and provenance attestations to container images (#12768 ) ## Problem Given a container image it is difficult to figure out dependencies and doesn't work automatically. ## Summary of changes - Build all rust binaries with `cargo auditable`, to allow sbom scanners to find it's dependencies. - Adjust `attests` for `docker/build-push-action`, so that buildkit creates sbom and provenance attestations. - Dropping `--locked` for `rustfilt`, because `rustfilt` can't build with locked dependencies[^5] ## Further details Building with `cargo auditable`[^1] embeds a dependency list into Linux, Windows, MacOS and WebAssembly artifacts. A bunch of tools support discovering dependencies from this, among them `syft`[^2], which is used by the BuildKit Syft scanner[^3] plugin. This BuildKit plugin is the default[^4] used in docker for generating sbom attestations, but we're making that default explicit by referencing the container image. [^1]: https://github.com/rust-secure-code/cargo-auditable [^2]: https://github.com/anchore/syft [^3]: https://github.com/docker/buildkit-syft-scanner [^4]: https://docs.docker.com/build/metadata/attestations/sbom/#sbom-generator [^5]: https://github.com/luser/rustfilt/issues/23	2025-07-29 12:12:14 +00:00
Ruslan Talpa	0dbe551802	proxy: subzero integration in auth-broker (embedded data-api) (#12474 ) ## Problem We want to have the data-api served by the proxy directly instead of relying on a 3rd party to run a deployment for each project/endpoint. ## Summary of changes With the changes below, the proxy (auth-broker) becomes also a "rest-broker", that can be thought of as a "Multi-tenant" data-api which provides an automated REST api for all the databases in the region. The core of the implementation (that leverages the subzero library) is in proxy/src/serverless/rest.rs and this is the only place that has "new logic". --------- Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-07-21 18:16:28 +00:00
Alexander Bayandin	caca08fe78	CI: rework and merge `lint-openapi-spec` and `validate-compute-manifest` jobs (#12575 ) ## Problem We have several linters that use Node.js, but they are currently set up differently, both locally and on CI. ## Summary of changes - Add Node.js to `build-tools` image - Move `compute/package.json` -> `build-tools/package.json` and add `redocly` to it `@redocly/cli` - Unify and merge into one job `lint-openapi-spec` and `validate-compute-manifest`	2025-07-16 11:08:27 +00:00
Mikhail	bc6a756f1c	ci: lint openapi specs using redocly (#12510 ) We need to lint specs for pageserver, endpoint storage, and safekeeper #0000	2025-07-09 14:29:45 +00:00
a-masterov	59e393aef3	Enable parallel execution of extension tests (#12118 ) ## Problem Extension tests were previously run sequentially, resulting in unnecessary wait time and underutilization of available CPU cores. ## Summary of changes Tests are now executed in a customizable number of parallel threads using separate database branches. This reduces overall test time by approximately 50% (e.g., on my laptop, parallel test lasts 173s, while sequential one lasts 340s) and increases the load on the pageserver, providing better test coverage. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-08 11:28:39 +00:00
Peter Bendel	ca9d8761ff	Move some perf benchmarks from hetzner to aws arm github runners (#12393 ) ## Problem We want to move some benchmarks from hetzner runners to aws graviton runners ## Summary of changes Adjust the runner labels for some workflows. Adjust the pagebench number of clients to match the latecny knee at 8 cores of the new instance type Add `--security-opt seccomp=unconfined` to docker run command to bypass IO_URING EPERM error. ## New runners https://us-east-2.console.aws.amazon.com/ec2/home?region=us-east-2#Instances:instanceState=running;search=:github-unit-perf-runner-arm;v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false;sort=tag:Name ## Important Notes I added the run-benchmarks label to get this tested before we merge it. [See](https://github.com/neondatabase/neon/actions/runs/15974141360) I also test a run of pagebench with the new setup from this branch, see https://github.com/neondatabase/neon/actions/runs/15972523054 - Update: the benchmarking workflow had failures, [see] (https://github.com/neondatabase/neon/actions/runs/15974141360/job/45055897591) - changed docker run command to avoid io_uring EPERM error, new run [see](https://github.com/neondatabase/neon/actions/runs/15997965633/job/45125689920?pr=12393) Update: the pagebench test run on the new runner [completed successfully](https://github.com/neondatabase/neon/actions/runs/15972523054/job/45046772556) Update 2025-07-07: the latest runs with instance store ext4 have been successful and resolved the direct I/O issues we have been seeing before in some runs. We only had one perf testcase failing (shard split) that had been flaky before. So I think we can merge this now. ## Follow up if this is merged and works successfully we must create a separate issue to de-provision the hetzner unit-perf runners defined [here](`91a41729af/ansible/inventory/hosts_metal (L111)`)	2025-07-07 06:44:41 +00:00
Busra Kugler	2af9380962	Revert "Replace step-security maintained actions" (#12386 ) Reverts neondatabase/neon#11663 and https://github.com/neondatabase/neon/pull/11265/ Step Security is not yet approved by Databricks team, in order to prevent issues during Github org migration, I'll revert this PR to use the previous action instead of Step Security maintained action.	2025-06-30 10:15:10 +00:00
Heikki Linnakangas	7916aa26e0	Stop using build-tools image in compute image build (#12306 ) The build-tools image contains various build tools and dependencies, mostly Rust-related. The compute image build used it to build compute_ctl and a few other little rust binaries that are included in the compute image. However, for extensions built in Rust (pgrx), the build used a different layer which installed the rust toolchain using rustup. Switch to using the same rust toolchain for both pgrx-based extensions and compute_ctl et al. Since we don't need anything else from the build-tools image, I switched to using the toolchain installed with rustup, and eliminated the dependency to build-tools altogether. The compute image build no longer depends on build-tools. Note: We no longer use 'mold' for linking compute_ctl et al, since mold is not included in the build-deps-with-cargo layer. We could add it there, but it doesn't seem worth it. I proposed stopping using mold altogether in https://github.com/neondatabase/neon/pull/10735, but that was rejected because 'mold' is faster for incremental builds. That doesn't matter much for docker builds however, since they're not incremental, and the compute binaries are not as large as the storage server binaries anyway.	2025-06-23 09:11:05 +00:00
Suhas Thalanki	632cde7f13	schema and github workflow for validation of compute manifest (#12069 ) Adds a schema to validate the manifest.yaml described in [this RFC](https://github.com/neondatabase/neon/blob/main/docs/rfcs/038-independent-compute-release.md) and a github workflow to test this.	2025-06-16 19:30:41 +00:00
Peter Bendel	87fc0a0374	periodic pagebench on hetzner runners (#11963 ) ## Problem - Benchmark periodic pagebench had inconsistent benchmarking results even when run with the same commit hash. Hypothesis is this was due to running on dedicated but virtualized EC instance with varying CPU frequency. - the dedicated instance type used for the benchmark is quite "old" and we increasingly get `An error occurred (InsufficientInstanceCapacity) when calling the StartInstances operation (reached max retries: 2): Insufficient capacity.` - periodic pagebench uses a snapshot of pageserver timelines to have the same layer structure in each run and get consistent performance. Re-creating the snapshot was a painful manual process (see https://github.com/neondatabase/cloud/issues/27051 and https://github.com/neondatabase/cloud/issues/27653) ## Summary of changes - Run the periodic pagebench on a custom hetzner GitHub runner with large nvme disk and governor set to defined perf profile - provide a manual dispatch option for the workflow that allows to create a new snapshot - keep the manual dispatch option to specify a commit hash useful for bi-secting regressions - always use the newest created snapshot (S3 bucket uses date suffix in S3 key, example `s3://neon-github-public-dev/performance/pagebench/shared-snapshots-2025-05-17/` - `--ignore` `test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py` in regular benchmarks run for each commit - improve perf copying snapshot by using `cp` subprocess instead of traversing tree in python ## Example runs with code in this PR: - run which creates new snapshot https://github.com/neondatabase/neon/actions/runs/15083408849/job/42402986376#step:19:55 - run which uses latest snapshot - https://github.com/neondatabase/neon/actions/runs/15084907676/job/42406240745#step:11:65	2025-05-23 09:37:19 +00:00
Alexander Bayandin	deed46015d	CI(test-images): increase timeout from 20m to 60m (#11955 ) ## Problem For some reason (unknown yet) 20m timeout is not enough for `test-images` job on arm runners. Ref: https://github.com/neondatabase/neon/actions/runs/15075321681/job/42387530399?pr=11953 ## Summary of changes - Increase the timeout from 20m to 1h	2025-05-17 06:34:54 +00:00
Christian Schwarz	32a12783fd	pageserver: batching & concurrent IO: update binary-built-in defaults; reduce CI matrix (#11923 ) Use the current production config for batching & concurrent IO. Remove the permutation testing for unit tests from CI. (The pageserver unit test matrix takes ~10min for debug builds). Drive-by-fix use of `if cfg!(test)` inside crate `pageserver_api`. It is ineffective for early-enabling new defaults for pageserver unit tests only. The reason is that the `test` cfg is only set for the crate under test but not its dependencies. So, `cargo test -p pageserver` will build `pageserver_api` with `cfg!(test) == false`. Resort to checking for feature flag `testing` instead, since all our unit tests are run with `--feature testing`. refs - `scattered-lsn` batching has been implemented and rolled out in all envs, cf https://github.com/neondatabase/neon/issues/10765 - preliminary for https://github.com/neondatabase/neon/pull/10466 - epic https://github.com/neondatabase/neon/issues/9377 - epic https://github.com/neondatabase/neon/issues/9378 - drive-by fix https://neondb.slack.com/archives/C0277TKAJCA/p1746821515504219	2025-05-14 16:30:21 +00:00
Alexander Bayandin	22290eb7ba	CI: notify relevant team about release deploy failures (#11797 ) ## Problem We notify only Storage team about failed deploys, but Compute and Proxy teams can also benefit from that ## Summary of changes - Adjust `notify-storage-release-deploy-failure` to notify the relevant team about failed deploy	2025-05-02 12:46:21 +00:00
Em Sharnoff	b48404952d	Bump vm-builder: v0.42.2 -> v0.46.0 (#11782 ) Bumped to pick up the changes from neondatabase/autoscaling#1366 — specifically including `uname` in the logs. Other changes included: * neondatabase/autoscaling#1301 * neondatabase/autoscaling#1296	2025-04-30 11:32:25 +00:00
Busra Kugler	7f8b1d79c0	Replace dorny/paths-filter with step-security maintained version (#11663 ) ## Problem Our CI/CD security tool StepSecurity maintains safer forks of popular GitHub Actions with low security scores. We're replacing dorny/paths-filter with the maintained step-security/paths-filter version to reduce risk of supply chain breaches and potential CVEs. ## Summary of changes replace ```uses: dorny/paths-filter@de90cc6fb3 ``` with ```uses: step-security/paths-filter@v3``` This PR will fix: neondatabase/cloud#26141	2025-04-29 09:02:01 +00:00
Alexander Bayandin	2b1d2a55d6	CI: fix typo oicd -> oidc (#11747 ) ## Problem It's OIDC (OpenID Connect), not OICD ## Summary of changes - Rename actions input `aws-oicd-role-arn` -> `aws-oidc-role-arn`	2025-04-28 12:44:28 +00:00
Christian Schwarz	8afb783708	feat: Direct IO for the pageserver write path (#11558 ) # Problem The Pageserver read path exclusively uses direct IO if `virtual_file_io_mode=direct`. The write path is half-finished. Here is what the various writing components use: \|what\|buffering\|flags on <br/>`v_f_io_mode`<br/>=`buffered`\|flags on <br/>`virtual_file_io_mode`<br/>=`direct`\| \|-\|-\|-\|-\| \|`DeltaLayerWriter`\| BlobWriter<BUFFERED=true> \| () \| () \| \|`ImageLayerWriter`\| BlobWriter<BUFFERED=false> \| () \| () \| \|`download_layer_file`\|BufferedWriter\|()\|()\| \|`InMemoryLayer`\|BufferedWriter\|()\|O_DIRECT\| The vehicle towards direct IO support is `BufferedWriter` which - largely takes care of O_DIRECT alignment & size-multiple requirements - double-buffering to mask latency `DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which has neither of these. # Changes ## High-Level At a high-level this PR makes the following primary changes: - switch the two layer writer types to use `BufferedWriter` & make sensitive to `virtual_file_io_mode` (via open_with_options_v2) - make `download_layer_file` sensitive to `virtual_file_io_mode` (also via open_with_options_v2) - add `virtual_file_io_mode=direct-rw` as a feature gate - we're hackish-ly piggybacking on OpenOptions's ask for write access here - this means with just `=direct` InMemoryLayer reads and writes no longer uses O_DIRECT - this is transitory and we'll remove the `direct-rw` variant once the rollout is complete (The `_v2` APIs for opening / creating VirtualFile are those that are sensitive to `virtual_file_io_mode`) The result is: \|what\|uses <br/>`BufferedWriter`\|flags on <br/>`v_f_io_mode`<br/>=`buffered`\|flags on <br/>`v_f_io_mode`<br/>=`direct`\|flags on <br/>`v_f_io_mode`<br/>=`direct-rw`\| \|-\|-\|-\|-\|-\| \|`DeltaLayerWriter`\| ~~Blob~~BufferedWriter \| () \| () \| O_DIRECT \| \|`ImageLayerWriter`\| ~~Blob~~BufferedWriter \| () \| () \| O_DIRECT \| \|`download_layer_file`\|BufferedWriter\|()\|()\|O_DIRECT\| \|`InMemoryLayer`\|BufferedWriter\|()\|~~O_DIRECT~~()\|O_DIRECT\| ## Code-Level The main change is: - Switch `blob_io::BlobWriter` away from its own buffering method to use `BufferedWriter`. Additional prep for upholding `O_DIRECT` requirements: - Layer writer `finish()` methods switched to use IoBufferMut for guaranteed buffer address alignment. The size of the buffers is PAGE_SZ and thereby implicitly assumed to fulfill O_DIRECT requirements. For the hacky feature-gating via `=direct-rw`: - Track `OpenOptions::write(true\|false)` in a field; bunch of mechanical churn. - Consolidate the APIs in which we "open" or "create" VirtualFile for better overview over which parts of the code use the `_v2` APIs. Necessary refactorings & infra work: - Add doc comments explaining how BufferedWriter ensures that writes are compliant with O_DIRECT alignment & size constraints. This isn't new, but should be spelled out. - Add the concept of shutdown modes to `BufferedWriter::shutdown` to make writer shutdown adhere to these constraints. - The `PadThenTruncate` mode might not be necessary in practice because I believe all layer files ever written are sized in multiples `PAGE_SZ` and since `PAGE_SZ` is larger than the current alignment requirements (512/4k depending on platform), it won't be necesary to pad. - Some test (I believe `round_trip_test_compressed`?) required it though - [ ] TODO: decide if we want to accept that complexity; if we do then address TODO in the code to separate alignment requirement from buffer capacity - Add `set_len` (=`ftruncate`) VirtualFile operation to support the above. - Allow `BufferedWriter` to start at a non-zero offset (to make room for the summary block). Cleanups unlocked by this change: - Remove non-positional APIs from VirtualFile (e.g. seek, write_full, read_full) Drive-by fixes: - PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit tests for all `virtual_file_io_mode` combinations but didn't because of a missing `_` in the env var. # Performance This section assesses this PR's impact on deployments with current production setting (`=direct`) and anticipated impact of switching to (`=direct-rw`). For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly improved on throughput because the `BlobWriter`'s buffer had the same size as the `BufferedWriter`'s buffer, but it didn't have the double-buffering that `BufferedWriter` has. The `=direct-rw` enables direct IO; throughput should not be suffering because of double-buffering; benchmarks will show if this is true. The `ImageLayerWriter` was previously not doing any buffering (`BUFFERED=false`). It went straight to issuing the IO operation to the underlying VirtualFile and the buffering was done by the kernel. The switch to `BufferedWriter` under `=direct` adds an additional memcpy into the BufferedWriter's buffer. We will win back that memcpy when enabling direct IO via `=direct-rw`. A nice win from the switch to `BufferedWriter` is that ImageLayerWriter performs >=16x fewer write operations to VirtualFile (the BlobWriter performs one write per len field and one write per image value). This should save low tens of microseconds of CPU overhead from doing all these syscalls/io_uring operations, regardless of `=direct` or `=direct-rw`. Aside from problems with alignment, this write frequency without double-buffering is prohibitive if we actually have to wait for the disk, which is what will happen when we enable direct IO via (`=direct-rw`). Throughput should not be suffering because of BufferedWrite's double-buffering; benchmarks will show if this is true. `InMemoryLayer` at `=direct` will flip back to using buffered IO but remain on BufferedWriter. The buffered IO adds back one memcpy of CPU overhead. Throughput should not suffer and will might improve on not-memory-pressured Pageservers but let's remember that we're doing the whole direct IO thing to eliminate global memory pressure as a source of perf variability. ## bench_ingest I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`. Use `git diff` with `--word-diff` or similar to see the change. General guidance on interpretation: - immediate production impact of this PR without production config change can be gauged by comparing the same `io_mode=Direct` - end state of production switched over to `io_mode=DirectRw` can be gauged by comparing old results' `io_mode=Direct` to new results' `io_mode=DirectRw` Given above guidance, on `im4gn.2xlarge` - immediate impact is a significant improvement in all cases - end state after switching has same significant improvements in all cases - ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192 key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s` instead of `253.43 MiB/s` - this is a 6% degradation - this workload is typical for image layer creation # Refs - epic https://github.com/neondatabase/neon/issues/9868 - stacked atop - preliminary refactor https://github.com/neondatabase/neon/pull/11549 - bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667 - derived from https://github.com/neondatabase/neon/pull/10063 Co-authored-by: Yuchen Liang <yuchen@neon.tech>	2025-04-24 14:57:36 +00:00
Christian Schwarz	f8100d66d5	ci: extend 'Wait for extension build to finish' timeout (#11689 ) Refs - https://neondb.slack.com/archives/C059ZC138NR/p1745427571307149	2025-04-24 08:15:08 +00:00
Christian Schwarz	2b041964b3	cover direct IO + concurrent IO in unit, regression & perf tests (#11585 ) This mirrors the production config. Thread that discusses the merits of this: - https://neondb.slack.com/archives/C033RQ5SPDH/p1744742010740569 # Refs - context https://neondb.slack.com/archives/C04BLQ4LW7K/p1744724844844589?thread_ts=1744705831.014169&cid=C04BLQ4LW7K - prep for https://github.com/neondatabase/neon/pull/11558 which adds new io mode `direct-rw` # Impact on CI turnaround time Spot-checking impact on CI timings - Baseline: [some recent main commit](https://github.com/neondatabase/neon/actions/runs/14471549758/job/40587837475) - Comparison: [this commit](https://github.com/neondatabase/neon/actions/runs/14471945087/job/40589613274) in this PR here Impact on CI turnaround time - Regression tests: - x64: very minor, sometimes better; likely in the noise - arm64: substantial 30min => 40min - Benchmarks (x86 only I think): very minor; noise seems higher than regress tests --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z. <4198311+skyzh@users.noreply.github.com> Co-authored-by: Peter Bendel <peterbendel@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>	2025-04-17 15:53:10 +00:00
Alexander Bayandin	19bea5fd0c	CI: do not wait for tests to trigger deploy job (#11548 ) ## Problem There is too much delay between merging a PR into `main` and deploying the changes to staging ## Summary of changes - Trigger `deploy` job without waiting for `build-and-test-locally` job	2025-04-15 11:23:41 +00:00
Fedor Dikarev	9a6ace9bde	introduce new runners: unit-perf and use them for benchmark jobs (#11409 ) ## Problem Benchmarks results are inconsistent on existing small-metal runners ## Summary of changes Introduce new `unit-perf` runners, and lets run benchmark on them. The new hardware has slower, but consistent, CPU frequency - if run with default governor schedutil. Thus we needed to adjust some testcases' timeouts and add some retry steps where hard-coded timeouts couldn't be increased without changing the system under test. - [wait_for_last_record_lsn](`6592d69a67/test_runner/fixtures/pageserver/utils.py (L193)`) 1000s -> 2000s - [test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927) 1000s - [test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f) - with back throttling disabled compute becomes unresponsive for more than 60 seconds (PG hard-coded client authentication connection timeout) - [test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1) 600s -> 1200s Right now there are only 2 runners of that class, and if we decide to go with them, we have to check how much that type of runners we need, so jobs not stuck with waiting for that type of runners available. However we now decided to run those runners with governor performance instead of schedutil. This achieves almost same performance as previous runners but still achieves consistent results for same commit Related issue to activate performance governor on these runners https://github.com/neondatabase/runner/pull/138 ## Verification that it helps ### analyze runtimes on new runner for same commit Table of runtimes for the same commit on different runners in [run](https://github.com/neondatabase/neon/actions/runs/14417589789) \| Run \| Benchmarks (1) \| Benchmarks (2) \|Benchmarks (3) \|Benchmarks (4) \| Benchmarks (5) \| \|--------\|--------\|---------\|---------\|---------\|---------\| \| 1 \| 1950.37s \| 6374.55s \| 3646.15s \| 4149.48s \| 2330.22s \| \| 2 \| - \| 6369.27s \| 3666.65s \| 4162.42s \| 2329.23s \| \| Delta % \| - \| 0,07 % \| 0,5 % \| 0,3 % \| 0,04 % \| \| with governor performance \| 1519.57s \| 4131.62s \| - \| - \| - \| \| second run gov. perf. \| 1513.62s \| 4134.67s \| - \| - \| - \| \| Delta % \| 0,3 % \| 0,07 % \| - \| - \| - \| \| speedup gov. performance \| 22 % \| 35 % \| - \| - \| - \| \| current desktop class hetzner runners (main) \| 1487.10s \| 3699.67s \| - \| - \| - \| \| slower than desktop class \| 2 % \| 12 % \| - \| - \| - \| In summary, the runtimes for the same commit on this hardware varies less than 1 %. --------- Co-authored-by: BodoBolero <peterbendel@neon.tech>	2025-04-15 08:21:44 +00:00
a-masterov	edc874e1b3	Use the same test image version as the computer one (#11448 ) ## Problem Changes in compute can cause errors in tests if another version of `neon-test-extensions` image is used. ## Summary of changes Use the same version of `neon-test-extensions` image as `compute` one for docker-compose based extension tests.	2025-04-04 10:13:00 +00:00
JC Grünhage	3c2bc5baba	fix(ci): run checks on release PRs (#11375 ) ## Problem Hotfix releases mean that sometimes changes in release PRs haven't been tested and linted yet. Disabling tests and lints is therefore not necessarily safe. In the future we will check whether tests have run on the same git tree already to speed things up, but for now we need to turn tests back on fully. This partially reverts: https://github.com/neondatabase/neon/pull/11272 ## Summary of changes Run checks on `.*-rc-pr` runs.	2025-04-02 14:32:53 +00:00
JC Grünhage	5cb6a4bc8b	fix(ci): use the right sha in release PRs (#11365 ) ## Problem `github.sha` contains a merge commit of `head` and `base` if we're in a PR. In release PRs, this makes no sense, because we fast-forward the `base` branch to contain the changes from `head`. Even though we correctly use `${{ github.event.pull_request.head.sha \|\| github.sha }}` to reference the git commit when building artifacts, we don't use that when checking out code, because we want to test the merge of head and base usually. In the case of release PRs, we definitely always want to test on the head sha though, because we're going to forward that, and it already has the base sha as a parent, so the merge would end up with the same tree anyway. As a side effect, not checking out `${{ github.event.pull_request.head.sha \|\| github.sha }}` also caused https://github.com/neondatabase/neon/actions/runs/13986389780/job/39173256184#step:6:49 to say `release-tag=release-compute-8187`, while https://github.com/neondatabase/neon/actions/runs/14084613121/job/39445314780#step:6:48 is talking about `build-tag=release-compute-8186` ## Summary of changes Run a few things on `github.event.pull_request.head.sha`, if we're in a release PR.	2025-03-28 11:56:24 +00:00
Fedor Dikarev	1d5d168626	impr(ci): use hetzner buckets for cache (#11364 ) ## Problem Occasionally getting data from GH cache could be slow, with less than 10MB/s and taking 5+ minutes to download cache: ``` Received 20971520 of 2987085791 (0.7%), 9.9 MBs/sec Received 50331648 of 2987085791 (1.7%), 15.9 MBs/sec ... Received 1065353216 of 2987085791 (35.7%), 4.8 MBs/sec Received 1065353216 of 2987085791 (35.7%), 4.7 MBs/sec ... ``` https://github.com/neondatabase/neon/actions/runs/13956437454/job/39068664599#step:7:17 Resulting in getting cache even longer that build time. ## Summary of changes Switch to the caches, that are closer to the runners, and they provided stable throughput about 70-80MB/s	2025-03-27 11:11:45 +00:00
StepSecurity Bot	88ea855cff	fix(ci): Fixing StepSecurity Flagged Issues (#11311 ) This pull request is created by [StepSecurity](https://app.stepsecurity.io/securerepo) at the request of @areyou1or0. ## Summary This pull request is created by [StepSecurity](https://app.stepsecurity.io/securerepo) at the request of @areyou1or0. Please merge the Pull Request to incorporate the requested changes. Please tag @areyou1or0 on your message if you have any questions related to the PR. ## Summary This pull request is created by [StepSecurity](https://app.stepsecurity.io/securerepo) at the request of @areyou1or0. Please merge the Pull Request to incorporate the requested changes. Please tag @areyou1or0 on your message if you have any questions related to the PR. ## Security Fixes ### Least Privileged GitHub Actions Token Permissions The GITHUB_TOKEN is an automatically generated secret to make authenticated calls to the GitHub API. GitHub recommends setting minimum token permissions for the GITHUB_TOKEN. - [GitHub Security Guide](https://docs.github.com/en/actions/security-guides/automatic-token-authentication#using-the-github_token-in-a-workflow) - [The Open Source Security Foundation (OpenSSF) Security Guide](https://github.com/ossf/scorecard/blob/main/docs/checks.md#token-permissions) ### Pinned Dependencies GitHub Action tags and Docker tags are mutable. This poses a security risk. GitHub's Security Hardening guide recommends pinning actions to full length commit. - [GitHub Security Guide](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-third-party-actions) - [The Open Source Security Foundation (OpenSSF) Security Guide](https://github.com/ossf/scorecard/blob/main/docs/checks.md#pinned-dependencies) ### Harden Runner [Harden-Runner](https://github.com/step-security/harden-runner) is an open-source security agent for the GitHub-hosted runner to prevent software supply chain attacks. It prevents exfiltration of credentials, detects tampering of source code during build, and enables running jobs without `sudo` access. See how popular open-source projects use Harden-Runner [here](https://docs.stepsecurity.io/whos-using-harden-runner). <details> <summary>Harden runner usage</summary> You can find link to view insights and policy recommendation in the build log <img src="https://github.com/step-security/harden-runner/blob/main/images/buildlog1.png?raw=true" width="60%" height="60%"> Please refer to [documentation](https://docs.stepsecurity.io/harden-runner) to find more details. </details> will fix https://github.com/neondatabase/cloud/issues/26141	2025-03-19 16:44:22 +00:00
JC Grünhage	aedeb37220	fix(ci): put the BUILD_TAG of the upcoming release into RC PR artifacts (#11304 ) ## Problem #11061 changed how artifacts for releases are built, by reusing/retagging the artifacts from release PRs. This resulted in the BUILD_TAG that's baked into the images to not be as expected. Context: https://neondb.slack.com/archives/C08JBTT3R1Q/p1742333300129069 ## Summary of changes Set BUILD_TAG to the release tag of the upcoming release when running inside release PRs.	2025-03-19 09:34:28 +00:00
JC Grünhage	99639c26b4	fix(ci): update build-tools image references (#11293 ) ## Problem https://github.com/neondatabase/neon/pull/11210 migrated pushing images to ghcr. Unfortunately, it was incomplete in using images from ghcr, which resulted in a few places referencing the ghcr build-tools image, while trying to use docker hub credentials. ## Summary of changes Use build-tools image from ghcr consistently.	2025-03-18 15:21:22 +00:00
JC Grünhage	eb6efda98b	impr(ci): move some kinds of tests to PR runs only (#11272 ) ## Problem The pipelines after release merges are slower than they need to be at the moment. This is because some kinds of tests/checks run on all kinds of pipelines, even though they only matter in some of those. ## Summary of changes Run `check-codestyle-{rust,python,jsonnet}`, `build-and-test-locally` and `trigger-e2e-tests` only on regular PRs, not release PR or pushes to main or release branches.	2025-03-18 13:49:34 +00:00
JC Grünhage	2dfff6a2a3	impr(ci): use ghcr.io as the default container registry (#11210 ) ## Problem Docker Hub has new rate limits coming up, and to avoid problems coming with those we're switching to GHCR. ## Summary of changes - Push images to GHCR initially and distribute them from there - Use images from GHCR in docker-compose	2025-03-18 11:30:49 +00:00
JC Grünhage	486ffeef6d	fix(ci): don't have neon-test-extensions release tag push depend on compute-node-image build (#11281 ) ## Problem Failures like https://github.com/neondatabase/neon/actions/runs/13901493608/job/38896940612?pr=11272 are caused by the dependency on `compute-node-image`, which was wrong on release jobs anyway. ## Summary of changes Remove dependency on `compute-node-image` from the job `add-release-tag-to-neon-test-extension-image`.	2025-03-17 16:31:49 +00:00
JC Grünhage	066b0a1be9	fix(ci): correctly push neon-test-extensions in releases and to ghcr (#11225 ) ## Problem `ef0d4a48a` adjusted how we build container images and how we push them, and the neon-test-extensions image was overlooked. Additionally, is was also missed in `1f0dea9a1`, which pushed our container images to GHCR. ## Summary of changes Push neon-test-extensions to GHCR and also push release tags for it.	2025-03-13 18:18:55 +00:00
JC Grünhage	ef0d4a48a8	Reuse artifacts from release PRs (#11061 ) ## Problem When we release our components, we perform builds in the release PR, then test the components, then merge the PR, and then build everything again, run tests again, and only then start deployments. To speed things up, we want to perform builds and run tests in the PR, and start deployments using the existing artifacts from the release PR. To make that possible, we need to have both CI pipelines running on the same commit hash, which requires fast forwarding release. That only works, if we have a commit in the PR that has the current release branch state as an ancestor. ## Summary of changes - Changes to release PR creation: - Remove templates and automatic bodies for release PRs. The previous template wasn't used anymore, and the automatic body we created in the pipeline didn't contain any useful content anymore after the changees here. - Make it possible to select the source branch. For releases that aren't cut from `main`, like https://github.com/neondatabase/neon/pull/11051, we need a way to trigger the new flow from a different branch. - Determine `release-branch` automatically from the component name instead of passing that as well. - Changes to the merge queue job: - Rename `get-changed-files` to `meta` in preparation of additional data being fetched as part of that job - Fail the merge queue if we're trying to merge into a branch other than main - this is to prevent non-fast-forward merges. - Label PRs to branches other than main as `fast-forward`, to trigger the fast-forward job - Add a fast-forward job that can be triggered with the `fast-forward` label that performs a fast-forward merge. This only happens if the PR has `mergeable_state == clean`, so CI having passed. - Build and Test on releases now skips building images, skips testing images and skips triggering e2e tests. We add new tags to the images from the release PR to tag them as release images, and we push them to the prod registries.	2025-03-12 21:00:59 +00:00
Christian Schwarz	083a30b1e2	storage broker: disable deploy by default (#11172 ) context - https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222 - companion infra.git PR: https://github.com/neondatabase/infra/pull/3249	2025-03-11 19:45:06 +00:00
JC Grünhage	f17931870f	fix(ci): use <!subteam^ID> syntax for pinging groups on slack (#11135 ) ## Problem Pinging groups on slack didn't work, because I didn't use the correct syntax. ## Summary of changes Use the correct syntax for pinging groups.	2025-03-10 13:27:23 +00:00
JC Grünhage	94e6897ead	fix(ci): make deploy job depend on pushing images to dev registries (#11089 ) ## Problem If an image fails to push to dev registries, we shouldn't trigger the deploy job, because that depends on images existing in dev registries. To ensure this is the case, the deploy job needs to depend on pushing to dev registries. ## Summary of changes Make `deploy` depend on `push-neon-image-dev` and `push-compute-image-dev`.	2025-03-05 14:28:43 +00:00
Misha Sakhnov	625c526bdd	ci: create multiarch vm images (#11017 ) ## Problem We build compute-nodes as multi-arch images, but not the vm-compute-nodes. The PR adds multiarch vm images the same way as in autoscaling repo. ## Summary of changes Add architecture to the matrix for vm compute build steps Add merge job --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-03-03 11:47:09 +00:00
a-masterov	7607686f25	Make test extensions upgrade work with absent images (#11036 ) ## Problem CI does not pass for the compute release due to the absence of some images ## Summary of changes Now we use the images from the old non-compute releases for non-compute images	2025-02-28 11:16:22 +00:00
JC Grünhage	7ed236e17e	fix(ci): push prod container images again (#11020 ) ## Problem https://github.com/neondatabase/neon/pull/10841 made building compute and neon images optional on releases that don't need them. The `push-<component>-image-prod` jobs had transitive dependencies that were skipped due to that, causing the images not to be pushed to production registries. ## Summary of changes Add `!failure() && !cancelled() &&` to the beginning of the conditions for these jobs to ensure they run even if some of their transitive dependencies are skipped.	2025-02-27 16:16:14 +00:00
JC Grünhage	c92a36740b	fix(ci): support PR-on-top-of-PR usecase again (#11013 ) ## Problem https://github.com/neondatabase/neon/pull/10841 broke CI on PRs that aren't based on main or a release branch but want to merge into another PR. ## Summary of changes Replace `run-kind=pr-main` with `run-kind=pr`, so that all PRs that aren't release PRs are treated equally.	2025-02-27 09:05:15 +00:00
JC Grünhage	8dfa8f0b94	feat(ci): don't build storage on compute-releases and vice versa (#10841 ) ## Problem Release CI is slow, because we're doing unnecessary work, for example building compute images on storage releases and vice versa. ## Summary of changes - Extract tag generation into reusable workflow and extend it with fetching of previous component releases - Don't build neon images on compute releases and don't build compute images on proxy and storage releases - Reuse images from previous releases for tests on branches where we don't build those images ## Open questions - We differentiate between `TAG` and `COMPUTE_TAG` in a few places, but we don't differentiate between storage and proxy releases. Since they use the same image, this will continue to work, but I'm not sure this is what we want.	2025-02-26 17:17:26 +00:00
Arthur Petukhovsky	3684162d9f	Bump vm-builder v0.37.1 -> v0.42.2 (#10981 ) Bump version to pick up changes introduced in https://github.com/neondatabase/autoscaling/pull/1286 It's better to have a compute release for this change first, because: - vm-runner changes kernel loglevel from 7 to 6 - vm-builder has a change to bring it back to 7 after startup Previous update: https://github.com/neondatabase/neon/pull/10015	2025-02-26 09:19:19 +00:00
JC Grünhage	1f0dea9a1a	feat(ci): push container images to ghcr.io as well (#10945 ) ## Problem There's new rate-limits coming on docker hub. To reduce our reliance on docker hub and the problems the limits are going to cause for us, we want to prepare for this by also pushing our container images to ghcr.io ## Summary of changes Push our images to ghcr.io as well and not just docker hub.	2025-02-24 17:45:23 +00:00
JC Grünhage	aad817d806	refactor(ci): use reusable push-to-container-registry workflow for pinning the build-tools image (#10890 ) ## Problem Pinning build tools still replicated the ACR/ECR/Docker Hub login and pushing, even though we have a reusable workflow for this. Was mentioned as a TODO in https://github.com/neondatabase/neon/pull/10613. ## Summary of changes Reuse `_push-to-container-registry.yml` for pinning the build-tools images.	2025-02-19 17:26:09 +00:00
JC Grünhage	e52e93797f	refactor(ci): use variables for AWS account IDs (#10886 ) ## Problem Our AWS account IDs are copy-pasted all over the place. A wrong paste might only be caught late if we hardcode them, but will get flagged instantly by actionlint if we access them from github actions variables. Resolves https://github.com/neondatabase/neon/issues/10787, follow-up for https://github.com/neondatabase/neon/pull/10613. ## Summary of changes Access AWS account IDs using Github Actions variables.	2025-02-19 12:34:41 +00:00
JC Grünhage	9151d3a318	feat(ci): notify storage oncall if deploy job fails on release branch (#10865 ) ## Problem If the deploy job on the release branch doesn't succeed, the preprod deployment will not have happened. It was requested that this triggers a notification in https://github.com/neondatabase/neon/issues/10662. ## Summary of changes If we're on the release branch and the deploy job doesn't end up in "success", notify storage oncall on slack.	2025-02-18 17:20:03 +00:00
Alexander Bayandin	3e8bf2159d	CI(build-and-test): run `benchmarks` after `deploy` job (#10791 ) ## Problem `benchmarks` is a long-running and non-blocking job. If, on Staging, a deploy-blocking job fails, restarting it requires cancelling any running `benchmarks` jobs, which is a waste of CI resources and requires a couple of extra clicks for a human to do. Ref: https://neondb.slack.com/archives/C059ZC138NR/p1739292995400899 ## Summary of changes - Run `benchmarks` after `deploy` job - Handle `benchmarks` run in PRs with `run-benchmarks` label but without `deploy` job.	2025-02-13 22:03:47 +00:00
JC Grünhage	e38694742c	fix(ci): don't try pushing to prod container registries from main (#10795 ) ## Problem https://github.com/neondatabase/neon/pull/10613 changed how images are pushed, and there was a small mismatch between the github workflow and the script generating what to push where. This resulted in the workflow trying to push images to prod registries from the main branch, even though we don't do that and therefore didn't generate a mapping for those registries in the script that decides what to push where. This misconception happened because promote-images-dev pushed to dev registries, and promote-images-prod pushed to prod registries, but promote-images-prod also updated the latest tag in the dev registries if and only if we are on the main branch. This last bit is why the push-<component>-image-prod jobs were trying to run on the main branch. ## Summary of changes Don't try pushing to prod registries from the main branch.	2025-02-12 20:26:05 +00:00
JC Grünhage	b77dd66bc4	refactor(ci): overhaul container image pushing (#10613 ) ## Problem Retagging container images and pushing container images taken from one registry to another is very tangled up with artifact building and not separated by component. This makes not building compute for storage releases and vice versa pretty tricky. To enable that, I want to clean up retagging and pushing of container images and then continue on making the pipelines for releases leaner by not building unnecessary things. ## Summary of changes - Add a reusable workflow that can push to ACR, ECR and Docker Hub, while being very flexible in terms of source and target images. This allows for retagging and pushing images between container registries. - Stop pushing images to registries aside of docker hub in the jobs that build the images - Split image pushing into 4 different jobs (not mentioning special cases): - neon-dev - neon-prod - compute-dev - compute-prod ## TODO - Consider also using this for `pin-build-tools-image`, as it's basically another instance of the same thing. ## Known limitations - The ECR part of this workflow supports authenticating to multiple AWS accounts and therefore multiple ECR endpoints, but the ACR part only supports one Azure Account. If someone with more knowledge on Azure can tell me whether an equivalent to https://github.com/aws-actions/amazon-ecr-login?tab=readme-ov-file#login-to-ecr-on-multiple-aws-accounts is easily possible, that'd be great. - The `image_map` input is a bit complex. It expects something along the lines of ``` { "docker.io/neondatabase/compute-node-v14:13196061314": [ "docker.io/neondatabase/compute-node-v14:13196061314", "369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:13196061314", "neoneastus2.azurecr.io/neondatabase/compute-node-v14:13196061314" ], "docker.io/neondatabase/compute-node-v15:13196061314": [ "docker.io/neondatabase/compute-node-v15:13196061314", "369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:13196061314", "neoneastus2.azurecr.io/neondatabase/compute-node-v15:13196061314" ] } ``` to map from source to target image. We have a small python step to generate this map for the 4 main image pushing jobs. The concrete example is taken from https://github.com/neondatabase/neon/actions/runs/13196061314/job/36838584098?pr=10613#step:3:6 and shortened to two images.	2025-02-12 17:54:51 +00:00
Heikki Linnakangas	8107140f7f	Refactor compute dockerfile (#10371 ) Refactor how extensions are built in compute Dockerfile 1. Rename some of the extension layers, so that names correspond more precisely to the upstream repository name and the source directory name. For example, instead of "pg-jsonschema-pg-build", spell it "pg_jsonschema-build". Some of the layer names had the extra "pg-" part, and some didn't; harmonize on not having it. And use an underscore if the upstream project name uses an underscore. 2. Each extension now consists of two dockerfile targets: [extension]-src and [extension]-build. By convention, the -src target downloads the sources and applies any neon-specific patches if necessary. The source tarball is downloaded and extracted under /ext-src. For example, the 'pgvector' extension creates the following files and directory: /ext-src/pgvector.tar.gz # original tarball /ext-src/pgvector.patch # neon-specific patch, copied from patches/ dir /ext-src/pgvector-src/ # extracted tarball, with patch applied This separation avoids re-downloading the sources every time the extension is recompiled. The 'extension-tests' target also uses the [extension]-src layers, by copying the /ext-src/ dirs from all the extensions together into one image. This refactoring came about when I was experimenting with different ways of splitting up the Dockerfile so that each extension would be in a separate file. That's not part of this PR yet, but this is a good step in modularizing the extensions.	2025-02-04 10:35:43 +00:00

1 2 3 4 5 ...

426 Commits