Prepare for a new release workflow
* Release PR is created on Fridays
* The discussion/approval happens during Friday
* Sunday morning the deployment will be done in central-il and perf
tests will be run
* On Monday early IST morning gradually start rolling (starting from US
regions as they are still in weekend time)
See slack for discussion:
https://neondb.slack.com/archives/C04P81J55LK/p1698565305607839?thread_ts=1698428241.031979&cid=C04P81J55LK
Only applicable change was neondatabase/autoscaling#584, setting
pgbouncer auth_dbname=postgres in order to fix superuser connections
from preventing dropping databases.
## Problem
We'll need to switch `REMOTE_STORAGE_AZURE_REGION` from the current
`eastus2` region to something `eu-central-1`-like. This may require
changing `AZURE_STORAGE_ACCESS_KEY`.
To make it possible to switch from one place (not to break a lot of
builds on CI), move `REMOTE_STORAGE_AZURE_CONTAINER` and
`REMOTE_STORAGE_AZURE_REGION` to GitHub Variables.
See https://github.com/neondatabase/neon/settings/variables/actions
## Summary of changes
- Get values for `REMOTE_STORAGE_AZURE_CONTAINER` &
`REMOTE_STORAGE_AZURE_REGION` from GitHub Variables
## Problem
We started to store test results in a new format in
https://github.com/neondatabase/neon/pull/4549.
This PR switches scripts to query this db.
(we can completely remove old DB/ingestions scripts in a couple of
weeks after the PR merged)
## Summary of changes
- `scripts/benchmark_durations.py` query new database
- `scripts/flaky_tests.py` query new database
Only applicable change was neondatabase/autoscaling#571, removing the
postgres_exporter flags `--auto-discover-databases` and
`--exclude-databases=...`
Create Rust bindings for C functions from walproposer. This allows to
write better tests with real walproposer code without spawning multiple
processes and starting up the whole environment.
`make walproposer-lib` stage was added to build static libraries
`libwalproposer.a`, `libpgport.a`, `libpgcommon.a`. These libraries can
be statically linked to any executable to call walproposer functions.
`libs/walproposer/src/walproposer.rs` contains
`test_simple_sync_safekeepers` to test that walproposer can be called
from Rust to emulate sync_safekeepers logic. It can also be used as a
usage example.
## Problem
- Because we compress artifacts file by file, we don't need to put them
into `tar` containers (ie instead of `tar.gz` we can use just `gz`).
- Pythons gz single-threaded and pretty slow.
A benchmark has shown ~20 times speedup (19.876176291 vs
0.8748335830000009) on my laptop (for a pageserver.log size is 1.3M)
## Summary of changes
- Replace tarfile with zstandart
- Update allure to 2.24.0
## Problem
Before releasing new version to production, we'd like to run a set of
required checks on the incoming release.
The simplest approach, which doesn't require many changes — dedicate one
staging region to `preprod` installation.
The proposed changes to the release flow are the following:
- When a release PR is merged into the release branch — trigger
deployment from the release branch to a dedicated staging-preprod region
(for now, it's going to be `eu-west-1` — Ireland)
Corresponding infrastructure PR:
https://github.com/neondatabase/aws/pull/585
## Summary of changes
- Trigger `deploy.dev` workflow with `-f deployPreprodRegion=true` for
release branch
Only notable change is including neondatabase/autoscaling#523, which we
hope will help with making sure that TCP connections are properly
terminated before shutdown (which hopefully fixes a leak in the
pageserver).
## Problem
- `gh pr list` fails with `unknown argument "main"; please quote all
values that have spaces due to using a variable with the wrong name
- `permissions: write-all` are too wide for the job
## Summary of changes
- For variable name `HEAD` -> `BRANCH`
- Grant only required permissions for each job
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
If @github-actions creates release PR, the CI pipeline is not triggered
(but we have `release-notify.yml` workflow that we expect to run on this
event).
I suspect this happened because @github-actions is not a repository
member.
Ref
https://github.com/neondatabase/neon/pull/5283#issuecomment-1715209291
## Summary of changes
- Use `CI_ACCESS_TOKEN` to create a PR
- Use `gh` instead of `thomaseizinger/create-pull-request`
- Restrict permissions for GITHUB_TOKEN to `contents: write` only
(required for `git push`)
## Problem
`ci-run/pr-*` branches (and attached PRs) should be deleted
automatically when their parent PRs get closed.
But there are not
## Summary of changes
- Fix if-condition
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the
code to support this version.
The important changes to PostgreSQL 16 compared to the PostgreSQL 15
changeset include the addition of a neon_rmgr instead of altering Postgres's
original WAL format.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
Another thing I overlooked regarding'approved-for-ci-run`:
- When we create a PR, the action is associated with @vipvap and this
triggers the pipeline — this is good.
- When we update the PR by force-pushing to the branch, the action is
associated with @github-actions, which doesn't trigger a pipeline — this
is bad.
Initially spotted in #5239 / #5211
([link](https://github.com/neondatabase/neon/actions/runs/6122249456/job/16633919558?pr=5239))
— `check-permissions` should not fail.
## Summary of changes
- Use `CI_ACCESS_TOKEN` to check out the repo (I expect this token will
be reused in the following `git push`)
## Problem
Add a CI pipeline that checks GitHub Workflows with
https://github.com/rhysd/actionlint (it uses `shellcheck` for shell
scripts in steps)
To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint`
## Summary of changes
- Add `.github/workflows/actionlint.yml`
- Fix actionlint warnings
## Problem
This PR creates a GitHub release from a release tag with an
autogenerated changelog: https://github.com/neondatabase/neon/releases
## Summary of changes
- Call GitHub API to create a release
## Problem
A bunch of fixes for different test-related things
## Summary of changes
- Fix test_runner/pg_clients (`subprocess_capture` return value has
changed)
- Do not run create-test-report if check-permissions failed for not
cancelled jobs
- Fix Code Coverage comment layout after flaky tests. Add another
healing "\n"
- test_compatibility: add an instruction for local run
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
We've got `approved-for-ci-run` to work 🎉
But it's still a bit rough, this PR should improve the UX for external
contributors.
## Summary of changes
- `build_and_test.yml`: add `check-permissions` job, which fails if PR is
created from a fork. Make all jobs in the workflow to be dependant on
`check-permission` to fail fast
- `approved-for-ci-run.yml`: add `cleanup` job to close `ci-run/pr-*`
PRs and delete linked branches when the parent PR is closed
- `approved-for-ci-run.yml`: fix the layout for the `ci-run/pr-*` PR
description
- GitHub Autocomment: add a comment with tests result to the original PR
(instead of a PR from `ci-run/pr-*` )
## Problem
- `SCALE: unbound variable` from
https://github.com/neondatabase/neon/pull/5079
- The layout of the GitHub auto-comment is broken if the code coverage
section follows flaky test section from
https://github.com/neondatabase/neon/pull/4999
## Summary of changes
- `benchmarking.yml`: Rename `SCALE` to `TEST_OLAP_SCALE`
- `comment-test-report.js`: Add an extra new-line before Code coverage
section
## Problem
When the next release is coming, we want to let everyone know about it by
posting a message to the Slack channel with a list of commits.
## Summary of changes
- `.github/workflows/release-notify.yml` is added
- the workflow sends a message to
`vars.SLACK_UPCOMING_RELEASE_CHANNEL_ID` (or
[#test-release-notifications](https://neondb.slack.com/archives/C05QQ9J1BRC)
if not configured)
- On each PR update, the workflow updates the list of commits in the
message (it doesn't send additional messages)
## Problem
`CI_ACCESS_TOKEN` has quite limited access (which is good), but this
doesn't allow it to remove labels from PRs (which is bad)
## Summary of changes
- Use `GITHUB_TOKEN` to remove labels
- Use `CI_ACCESS_TOKEN` to create PRs
## Problem
Pull Requests created by GitHub Actions bot doesn't have access to
secrets, so we need to use our bot for it to auto-trigger a tests run
See previous PRs #4663, #5210, #5212
## Summary of changes
- Use our bot to create PRs
## Problem
It's hard to find out which DB size we use for OLAP benchmarks (TPC-H in
particular).
This PR adds handling of `TEST_OLAP_SCALE` env var, which is get added
to a test name as a parameter.
This is required for performing larger periodic benchmarks.
## Summary of changes
- Handle `TEST_OLAP_SCALE` in
`test_runner/performance/test_perf_olap.py`
- Set `TEST_OLAP_SCALE` in `.github/workflows/benchmarking.yml` to a
TPC-H scale
## Problem
Continuation of #4663, #5210
We're still getting an error:
```
GraphQL: Resource not accessible by integration (removeLabelsFromLabelable)
```
## Summary of changes
- trigger `approved-for-ci-run.yml` workflow on `pull_request_target`
instead of `pull_request`
## Problem
When a remote custom extension build fails, it looks a bit confusing on
neon CI:
- `trigger-custom-extensions-build` is green
- `wait-for-extensions-build` is red
- `build-and-upload-extensions` is red
But to restart the build (to get everything green), you need to restart
the only passed `trigger-custom-extensions-build`.
## Summary of changes
- Merge `trigger-custom-extensions-build` and
`wait-for-extensions-build` jobs into
`trigger-custom-extensions-build-and-wait`
## Problem
Currently, the `deploy` job doesn't wait for the custom extension job
(in another repo) and can be started even with failed extensions build.
This PR adds another job that polls the status of the extension build job
and fails if the extension build fails.
## Summary of changes
- Add `wait-for-extensions-build` job, which waits for a custom
extension build in another repo.
## Problem
We still need to rerun some builds manually because flaky tests weren't
detected automatically.
I found two reasons for it:
- If a test is flaky on a particular build type, on a particular
Postgres version, there's a high chance that this test is flaky on all
configurations, but we don't automatically detect such cases.
- We detect flaky tests only on the main branch, which requires manual
retrigger runs for freshly made flaky tests.
Both of them are fixed in the PR.
## Summary of changes
- Spread flakiness of a single test to all configurations
- Detect flaky tests in all branches (not only in the main)
- Look back only at 7 days of test history (instead of 10)
Starts `postgres` in cgroup directly from `compute_ctl` instead of from
`vm-builder`. This is required because the `vm-monitor` cannot be in the
cgroup it is managing. Otherwise, it itself would be frozen when
freezing the cgroup.
Requires https://github.com/neondatabase/cloud/pull/6331, which adds the
`AUTOSCALING` environment variable letting `compute_ctl` know to start
`postgres` in the cgroup.
Requires https://github.com/neondatabase/autoscaling/pull/468, which
prevents `vm-builder` from starting the monitor and putting postgres in
a cgroup. This will require a `VM_BUILDER_VERSION` bump.
This is cherry-picked-then-improved version of release branch commit
4204960942 PR #4861)
The commit
commit 5f8fd640bf
Author: Alek Westover <alek.westover@gmail.com>
Date: Wed Jul 26 08:24:03 2023 -0400
Upload Test Remote Extensions (#4792)
switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.
The switch to using release tag was good in principle, but,
it broke the release pipeline. So, switch release pipeline
back to using `latest`.
Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
A very slight change that allows us to configure the UID of the
neon-postgres cgroup owner. We start postgres in this cgroup so we can
scale it with the cgroups v2 api. Currently, the control plane
overwrites the entrypoint set by `vm-builder`, so `compute_ctl` (and
thus postgres), is not started in the neon-postgres cgroup. Having
`compute_ctl` start postgres in the cgroup should fix this. However, at
the moment appears like it does not have the correct permissions.
Configuring the neon-postgres UID to `postgres` (which is the UID
`compute_ctl` runs under) should hopefully fix this.
See #4920 - the PR to modify `compute_ctl` to start postgres in the
cgorup.
See: neondatabase/autoscaling#480, neondatabase/autoscaling#477. Both
these PR's are part of an effort to increase `vm-builder`'s
configurability and allow us to adjust it as we integrate in the
monitor.
## Problem
While adding new test results format, I've also changed the way we
upload Allure reports to S3
(722c7956bb)
to avoid duplicated results from previous runs. But it broke links at
earlier results (results are still available but on different URLs).
This PR fixes this (by reverting logic in
722c7956bb
changes), and moves the logic for storing test results into db to allure
generate step. It allows us to avoid test results duplicates in the db
and saves some time on extra s3 downloads that happened in a different
job before the PR.
Ref https://neondb.slack.com/archives/C059ZC138NR/p1691669522160229
## Summary of changes
- Move test results storing logic from a workflow to
`actions/allure-report-generate`
Patches a bug in vm-builder where it did not include enough parameters
in the query string. These parameters are `host=localhost port=5432`.
These parameters were not necessary for the monitor because the `pq` go
postgres driver included them by default.