## Problem
We want to keep `#on-call-staging-stream` channel close to the prod one
and redirect notifications from failing benchmarks to another channel
for investigation.
## Summary of changes
- Send notifications regarding failures in `benchmarking` job to
`#on-call-staging-stream`
- Send notifications regarding failures in `periodic_pagebench` job to
`#on-call-staging-stream`
## Problem
We have a bunch of duplicated code for automated releases. There will be
even more, once we have `release-compute` branch
(https://github.com/neondatabase/neon/pull/9637).
Another issue with the current `release` workflow is that it creates a
PR from the main as is. If we create 2 different releases from the
same commit, GitHub could mix up results from different PRs.
## Summary of changes
- Create a reusable workflow for releases
- Create an empty commit to differentiate releases
## Problem
The first version of the ingest benchmark had some parsing and reporting
logic in shell script inside GitHub workflow.
it is better to move that logic into a python testcase so that we can
also run it locally.
## Summary of changes
- Create new python testcase
- invoke pgcopydb inside python test case
- move the following logic into python testcase
- determine backpressure
- invoke pgcopydb and report its progress
- parse pgcopydb log and extract metrics
- insert metrics into perf test database
- add additional column to perf test database that can receive endpoint
ID used for pgcopydb run to have it available in grafana dashboard when
retrieving other metrics for an endpoint
## Example run
https://github.com/neondatabase/neon/actions/runs/11860622170/job/33056264386
## Problem
We call `check-build-tools-image` twice for each workflow whenever we
use it, along with `build-build-tools-image`, once as a workflow itself,
and the second time from `build-build-tools-image`. This is not
necessary.
## Summary of changes
- Inline `check-build-tools-image` into `build-build-tools-image`
- Remove separate `check-build-tools-image` workflow
We found that exporting GH Workflow Runs in batch is more efficient due
to
- better utilisation of Github API
- and gh runners usage is rounded to minutes, so even when ad-hoc export
is done in 5-10 seconds, we billed for one minute usage
So now we introduce batch exporting, with version v0.2.x of github
workflow stats exporter.
How it's expected to work now:
- every 15 minutes we query for the workflow runs, created in last 2
hours
- to avoid missing workflows that ran for more than 2 hours, every night
(00:25) we will query workflows created in past 24 hours and export them
as well
- should we have query for even longer periods?
- lets see how it works with current schedule
- for longer periods like for days or weeks, it may require to adjust
logic and concurrency of querying data, so lets for now use simpler
version
## Problem
We have no specific benchmark testing project migration of postgresql
project with existing data into Neon.
Typical steps of such a project migration are
- schema creation in the neon project
- initial COPY of relations
- creation of indexes and constraints
- vacuum analyze
## Summary of changes
Add a periodic benchmark running 9 AM UTC every day.
In each run:
- copy a 200 GiB project that has realistic schema, data, tables,
indexes and constraints from another project into
- a new Neon project (7 CU fixed)
- an existing tenant, (but new branch and new database) that already has
4 TiB of data
- use pgcopydb tool to automate all steps and parallelize COPY and index
creation
- parse pgcopydb output and report performance metrics in Neon
performance test database
## Logs
This benchmark has been tested first manually and then as part of
benchmarking.yml workflow, example run see
https://github.com/neondatabase/neon/actions/runs/11757679870
## Problem
GitHub API can return error 500, and it fails jobs that use
`actions/github-script` action.
## Summary of changes
- Add `retry: 500` to all `actions/github-script` usage
## Problem
The Merge queue doesn't work because it expects certain jobs, which we
don't have in the `pre-merge-checks` workflow.
But it turns out we can just create jobs/checks with the same names in
any workflow that we run.
## Summary of changes
- Add `conclusion` jobs
- Create `neon-cloud-e2e` status check
- Add a bunch of `if`s to handle cases with no relevant changes found
and prepare the workflow to run rust checks in the future
- List the workflow in `report-workflow-stats` to collect stats about it
## Problem
To prevent breaking main after Python 3.11 PR get merged
we need to enable merge queue and run `check-codestyle-python`
job on it
## Summary of changes
- Move `check-codestyle-python` to a reusable workflow
- Run this workflow on `merge_group` event
set-docker-config-dir was replicated over multiple repositories.
The replica of this action was removed from this repository and it's
using the version from github.com/neondatabase/dev-actions instead
## Problem
The IAM role associated with our github action runner supports a max
token expiration which is lower than the value we tried.
## Summary of changes
Since we believe to have understood the performance regression we (by
ensuring availability zone affinity of compute and pageserver) the job
should again run in lower than 5 hours and we revert this change instead
of increasing the max session token expiration in the IAM role which
would reduce our security.
## Problem
clickbench regression causes clickbench to run >9 hours and the AWS
session token is expired before the run completes
## Summary of changes
extend lifetime of session token for this job to 12 hours
## Problem
click bench job in benchmarking workflow has a performance regression
causing it to run in timeout of max job run.
Suspected root cause:
Project has been migrated from single pageserver to storage controller
managed project on Oct 14th.
Since then the regression shows.
## Summary of changes
Increase timeout of pytest to 12 hours.
Increase job timeout to 12 hours
## Problem
Based on a recent proxy deployment issue, we deployed another proxy
version (proxy-scram), which was not needed when deploying a specific
proxy type. we have
[PR](https://github.com/neondatabase/infra/pull/2142) to update on the
infra branch and need to update CI in this repo which triggers proxy
deployment.
## Summary of changes
- Update proxy deployment flag
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
We have `git config --global --add safe.directory ...` leftovers from the
past, but `actions/checkout` does it by default (since v3.0.2, we use v4)
## Summary of changes
- Remove `git config --global --add safe.directory ...` hack
## Problem
```
+ /tmp/neon/pg_install/v16/bin/psql '***' -c 'SELECT version()'
/tmp/neon/pg_install/v16/bin/psql: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /tmp/neon/pg_install/v16/bin/psql)
/tmp/neon/pg_install/v16/bin/psql: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /tmp/neon/pg_install/v16/bin/psql)
/tmp/neon/pg_install/v16/bin/psql: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /tmp/neon/pg_install/v16/lib/libpq.so.5)
/tmp/neon/pg_install/v16/bin/psql: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /tmp/neon/pg_install/v16/lib/libpq.so.5)
/tmp/neon/pg_install/v16/bin/psql: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /tmp/neon/pg_install/v16/lib/libpq.so.5)
```
## Summary of changes
- Use `build-tools:pinned-bookworm` whenever we download Neon artefact
## Problem
Our dockerfiles, for some historical reason, have unconventional names
`Dockerfile.<something>`, and some tools (like GitHub UI) fail to highlight
the syntax in them.
> Some projects may need distinct Dockerfiles for specific purposes. A
common convention is to name these `<something>.Dockerfile`
From: https://docs.docker.com/build/concepts/dockerfile/#filename
## Summary of changes
- Rename `Dockerfile.build-tools` -> `build-tools.Dockerfile`
- Rename `compute/Dockerfile.compute-node` ->
`compute/compute-node.Dockerfile`
This should make it a little bit easier for people wanting to check if
their files are formated correctly. Has the added bonus of making the CI
check simpler as well.
Signed-off-by: Tristan Partin <tristan@neon.tech>
There are quite a few benefits to this approach:
- Reduce config duplication
- The two sql_exporter configs were super similar with just a few
differences
- Pull SQL queries into standalone files
- That means we could run a SQL formatter on the file in the future
- It also means access to syntax highlighting
- In the future, run different queries for different PG versions
- This is relevant because right now, we have queries that are failing
on PG 17 due to catalog updates
Signed-off-by: Tristan Partin <tristan@neon.tech>