cargo test (or nextest) might rebuild the binaries with different
features/flags, so do install immediately after the build. Triggered by the
particular case of nextest invocations missing $CARGO_FEATURES, which recompiled
safekeeper without 'testing' feature which made python tests needing
it (failpoints) not run in the CI.
Also add CARGO_FEATURES to the nextest runs anyway because there doesn't seem to
be an important reason not to.
## Problem
halfvec data type was introduced in pgvector 0.7.0 and is popular
because
it allows smaller vectors, smaller indexes and potentially better
performance.
So far we have not tested halfvec in our periodic performance tests.
This PR adds halfvec indexing and halfvec queries to the test.
## Problem
I've bumped `docker/setup-buildx-action` in #8042 because I wasn't able
to reproduce the issue from #7445.
But now the issue appears again in
https://github.com/neondatabase/neon/actions/runs/9514373620/job/26226626923?pr=8059
The steps to reproduce aren't clear, it required
`docker/setup-buildx-action@v3` and rebuilding the image without cache,
probably
## Summary of changes
- Downgrade `docker/setup-buildx-action@v3`
to `docker/setup-buildx-action@v2`
## Problem
We have some amount of outdated action in the CI pipeline, GitHub
complains about some of them.
## Summary of changes
- Update `actions/checkout@1` (a really old one) in
`vm-compute-node-image`
- Update `actions/checkout@3` in `build-build-tools-image`
- Update `docker/setup-buildx-action` in all workflows / jobs, it was
downgraded in https://github.com/neondatabase/neon/pull/7445, but it
it seems it works fine now
## Problem
The merging of #7818 caused the problem with the docker-compose file.
Running docker compose is now impossible due to the unavailability of
the neon-test-extensions:latest image
## Summary of changes
Fix the problem:
Add the latest tag to the neon-test-extensions image and use the
profiles feature of the docker-compose file to avoid loading the
neon-test-extensions container if it is not needed.
## Problem
We need automated tests of extensions shipped with Neon to detect
possible problems.
## Summary of changes
A new image neon-test-extensions is added. Workflow changes to test the
shipped extensions are added as well.
Currently, the regression tests, shipped with extensions are in use.
Some extensions, i.e. rum, timescaledb, rdkit, postgis, pgx_ulid, pgtap,
pg_tiktoken, pg_jsonschema, pg_graphql, kq_imcx, wal2json_2_5 are
excluded due to problems or absence of internal tests.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Reverts neondatabase/neon#7956
Rationale: compute incompatibilties
Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1718011276665839?thread_ts=1718008160.431869&cid=C033RQ5SPDH
Relevant quotes from @hlinnaka
> If we go through with the current release candidate, but the compute
is pinned, people who create new projects will get that warning, which
is silly. To them, it looks like the ICU version was downgraded, because
initdb was run with newer version.
> We should upgrade the ICU version eventually. And when we do that,
users with old projects that use ICU will start to see that warning. I
think that's acceptable, as long as we do homework, notify users, and
communicate that properly.
> When do that, we should to try to upgrade the storage and compute
versions at roughly the same time.
## Problem
Due to the upcoming End of Life (EOL) for Debian 11, we need to upgrade
the base OS for Pageservers from Debian 11 to Debian 12 for security
reasons.
When deploying a new Pageserver on Debian 12 with the same binary built
on
Debian 11, we encountered the following errors:
```
could not execute operation: pageserver error, status: 500,
msg: Command failed with status ExitStatus(unix_wait_status(32512)):
/usr/local/neon/v16/bin/initdb: error while loading shared libraries:
libicuuc.so.67: cannot open shared object file: No such file or directory
```
and
```
could not execute operation: pageserver error, status: 500,
msg: Command failed with status ExitStatus(unix_wait_status(32512)):
/usr/local/neon/v14/bin/initdb: error while loading shared libraries:
libssl.so.1.1: cannot open shared object file: No such file or directory
```
These issues occur when creating new projects.
## Summary of changes
- To address these issues, we configured PostgreSQL build to use
statically linked OpenSSL and ICU libraries.
- This resolves the missing shared library errors when running the
binaries on Debian 12.
Closes: https://github.com/neondatabase/cloud/issues/12648
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [x] Do not forget to reformat commit message to not include the above
checklist
## Problem
We don't carry run-* labels from external contributors' PRs to
ci-run/pr-* PRs. This is not really convenient.
Need to sync labels in approved-for-ci-run workflow.
## Summary of changes
Added the procedure of transition of labels from the original PR
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We use ubuntu-latest as a default OS for running jobs. It can cause
problems due to instability, so we should use the LTS version of Ubuntu.
## Summary of changes
The image ubuntu-latest was changed with ubuntu-22.04 in workflows.
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
We want to regularly verify the performance of pgvector HNSW parallel
index builds and parallel similarity search using HNSW indexes.
The first release that considerably improved the index-build parallelism
was pgvector 0.7.0 and we want to make sure that we do not regress by
our neon compute VM settings (swap, memory over commit, pg conf etc.)
## Summary of changes
Prepare a Neon project with 1 million openAI vector embeddings (vector
size 1536).
Run HNSW indexing operations in the regression test for the various
distance metrics.
Run similarity queries using pgbench with 100 concurrent clients.
I have also added the relevant metrics to the grafana dashboards pgbench
and olape
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
A title for automatic proxy release PRs is `Proxy release`, and for
storage & compute, it's just `Release`
## Summary of changes
- Amend PR title for Storage & Compute releases to "Storage & Compute
release"
## Problem
We don't build our docker images for ARM arch, and that makes it harder
to run images on ARM (on MacBooks with Apple Silicon, for example).
## Summary of changes
- Build `neondatabase/neon` for ARM and create a multi-arch image
- Build `neondatabase/compute-node-vXX` for ARM and create a multi-arch
image
- Run `test-images` job on ARM as well
## Problem
Currently, `latest` tag is added to the images in several cases:
```
github.ref_name == 'main' || github.ref_name == 'release' || github.ref_name == 'release-proxy'
```
This leads to a race; the `latest` tag jumps back and forth depending on
the branch that has built images.
## Summary of changes
- Do not push `latest` images to prod ECR (we don't use it)
- Use `docker buildx imagetools` instead of `crane` for tagging images
- Unify `vm-compute-node-image` job with others and use dockerhub as a
first source for images (sync images with ECR)
- Tag images with `latest` only for commits in `main`
## Problem
`report-benchmarks-failures` got skipped if a dependent job fails.
## Summary of changes
- Fix the if-condition by adding `&& failures()` to it; it'll make the
job run if the dependent job fails.
## Problem
There are not enough arm runners and jobs in `neon-extra-builds` workflow
take about the same amount of time on a small-arm runner as on
large-arm.
## Summary of changes
- Switch `neon-extra-builds` workflow from `large-arm64` to
`small-arm64` runners
## Problem
`report-benchmarks-failures` job is triggered for any failure in the CI
pipeline, but we need it to be triggered only for failed `benchmarks`
job
## Summary of changes
- replace `failure()` with `needs.benchmarks.result == 'failure'` in the
condition
## Problem
`benchmarks` job that we run on the main doesn't block anything, so it's
easy to miss its failure.
Ref https://github.com/neondatabase/cloud/issues/13087
## Summary of changes
- Add `report-benchmarks-failures` job that report failures of
`benchmarks` job to a Slack channel
While switching to use nextest with the repository in f28bdb6, we had
not noticed that it doesn't yet support running doctests. Run the doc
tests before other tests.
## Problem
Move from aws based arm64 runners to bare-metal based
## Summary of changes
Changes in GitHub action workflows where `runs-on: arm64` used. More
parallelism added, build time for `neon with extra platform builds`
workflow reduced from 45m to 25m
## Problem
Ref https://neondb.slack.com/archives/C036U0GRMRB/p1688122168477729
## Summary of changes
- Add sha from postgres repo into postgres version string (via
`--with-extra-version`)
- Add a test that Postgres version matches the expected one
- Remove build-time hard check and allow only related tests to fail
## Problem
Benchmarks don't use the vectored read path.
## Summary of changes
* Update the benchmarks to use the vectored read path for both singular
and vectored gets.
* Disable validation for the benchmarks
## Problem
We are currently supporting two read paths. No bueno.
## Summary of changes
High level: use vectored read path to serve get page requests - gated by
`get_impl` config
Low level:
1. Add ps config, `get_impl` to specify which read path to use when
serving get page requests
2. Fix base cached image handling for the vectored read path. This was
subtly broken: previously we
would not mark keys that went past their cached lsn as complete. This is
a self standing change which
could be its own PR, but I've included it here because writing separate
tests for it is tricky.
3. Fork get page to use either the legacy or vectored implementation
4. Validate the use of vectored read path when serving get page requests
against the legacy implementation.
Controlled by `validate_vectored_get` ps config.
5. Use the vectored read path to serve get page requests in tests (with
validation).
## Note
Since the vectored read path does not go through the page cache to read
buffers, this change also amounts to a removal of the buffer page cache. Materialized page cache
is still used.
- Cleanup part for `docker/setup-buildx-action` started to fail with the following error (for no obvious reason):
```
/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175
throw new Error(`Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.`);
^
Error: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.
at Object.rejected (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175:1)
at Generator.next (<anonymous>)
at fulfilled (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:29:1)
```
- Downgrade `docker/setup-buildx-action` from v3 to v2
## Problem
For PRs, by default, we check out a phantom merge commit (merge a branch
into the main), but using a real branches head when finding `build-tools`
image tag.
## Summary of changes
- Change `COMMIT_SHA` to use `${{ github.sha }}` instead of `${{
github.event.pull_request.head.sha }}` for PRs
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
`create-test-report` job takes more than 8 minutes, the longest step is
uploading Allure report to S3:
Before:
```
+ aws s3 cp --recursive --only-show-errors /tmp/pr-7362-1712847045/report s3://neon-github-public-dev/reports/pr-7362/8647730612
real 6m10.572s
user 6m37.717s
sys 1m9.429s
```
After:
```
+ s5cmd --log error cp '/tmp/pr-7362-1712858221/report/*' s3://neon-github-public-dev/reports/pr-7362/8650636861/
real 0m9.698s
user 1m9.438s
sys 0m6.419s
```
## Summary of changes
- Add `s5cmd`(https://github.com/peak/s5cmd) to build-tools image
- Use `s5cmd` instead of `aws s3` for uploading Allure reports
## Problem
`build-build-tools-image` workflow is designed to be run only in one
example per the whole repository. Currently, the job gets cancelled if a
newer one is scheduled, here's an example:
https://github.com/neondatabase/neon/actions/runs/8419610607
## Summary of changes
- Explicitly set `cancel-in-progress: false` for all jobs that aren't
supposed to be cancelled
## Problem
```
Could not resolve host: console.stage.neon.tech
```
## Summary of changes
- replace `console.stage.neon.tech` with `console-stage.neon.build`
## Problem
During Nightly Benchmarks, we want to collect pgbench results for
sharded tenants as well.
## Summary of changes
- Add pre-created sharded project for pgbench
## Problem
Proxy release to a preprod automatically triggers a deployment of storage
controller (`deployStorageController=true` by default)
## Summary of changes
- Set `deployStorageController=false` for proxy releases to preprod
- Set explicitly `deployStorageController=true` for storage releases to
preprod and prod
## Problem
We don't want to run an excessive e2e test suite on neonvm if there are
no relevant changes.
## Summary of changes
- Check PR diff and if there are no relevant compute changes (in
`vendor/`, `pgxn/`, `libs/vm_monitor` or `Dockerfile.compute-node`
- Switch job from `small` to `ubuntu-latest` runner to make it possible
to use GitHub CLI
## Problem
We want to deploy releases to a preprod region first to perform required
checks
## Summary of changes
- Deploy `release-XXX` / `release-proxy-YYY` docker tags to a preprod region
These test runs usually take 20-30 minutes. if something hangs, we see
actions proceeding for several hours: it's more convenient to have them
time out sooner so that we notice that something has hung faster.
All of production is using it now as of
https://github.com/neondatabase/aws/pull/1121
The change in `flaky_tests.py` resets the flakiness detection logic.
The alternative would have been to repeat the choice of io engine in
each test name, which would junk up the various test reports too much.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
The current implementation of `deploy-prod` workflow doesn't allow to
run parallel deploys on Storage and Proxy.
## Summary of changes
- Call `deploy-proxy-prod` workflow that deploys only Proxy components,
and that can be run in parallel with `deploy-prod` for Storage.
## Problem
We build compute-tools binary twice — in `compute-node` and in
`compute-tools` jobs, and we build them slightly differently:
- `cargo build --locked --profile release-line-debug-size-lto`
(previously in `compute-node`)
- `mold -run cargo build -p compute_tools --locked --release`
(previously in `compute-tools`)
Before:
- compute-node: **6m 34s**
- compute-tools (as a separate job): **7m 47s**
After:
- compute-node: **7m 34s**
- compute-tools (as a separate step, within compute-node job): **5s**
## Summary of changes
- Move compute-tools image creation to `Dockerfile.compute-node`
- Delete `Dockerfile.compute-tools`
## Problem
`pin-build-tools-image` job doesn't have access to secrets and thus
fails. Missed in the original PR[0]
- [0] https://github.com/neondatabase/neon/pull/6795
## Summary of changes
- pass secrets to `pin-build-tools-image` job
## Problem
Currently, after updating `Dockerfile.build-tools` in a PR, it requires
a manual action to make it `pinned`, i.e., the default for everyone. It
also makes all opened PRs use such images (even created in the PR and
without such changes).
This PR overhauls the way we build and use `build-tools` image (and uses
the image from Docker Hub).
## Summary of changes
- The `neondatabase/build-tools` image gets tagged with the latest
commit sha for the `Dockerfile.build-tools` file
- Each PR calculates the tag for `neondatabase/build-tools`, tries to
pull it, and rebuilds the image with such tag if it doesn't exist.
- Use `neondatabase/build-tools` as a default image
- When running on `main` branch — create a `pinned` tag and push it to
ECR
- Use `concurrency` to ensure we don't build `build-tools` image for the
same commit in parallel from different PRs
## Problem
In the original PR[0], I've missed a couple of `release` occurrences
that should also be handled for `release-proxy` branch
- [0] https://github.com/neondatabase/neon/pull/6797
## Summary of changes
- Add handling for `release-proxy` branch to allure report
- Add handling for `release-proxy` branch to e2e tests malts.com
## Problem
To "build" a compute image that doesn't have anything new, kaniko takes
13m[0], docker buildx does it in 5m[1].
Also, kaniko doesn't fully support bash expressions in the Dockerfile
`RUN`, so we have to use different workarounds for this (like `bash -c
...`).
- [0]
https://github.com/neondatabase/neon/actions/runs/8011512414/job/21884933687
- [1]
https://github.com/neondatabase/neon/actions/runs/8008245697/job/21874278162
## Summary of changes
- Use docker buildx to build `compute-node` images
- Use docker buildx to build `neon-image` image
- Use docker buildx to build `compute-tools` image
- Use docker hub for image cache (instead of ECR)