## Problem
Detaching a tenant can involve many thousands of local filesystem
metadata writes, but the control plane would benefit from us not
blocking detach/delete responses on these.
## Summary of changes
After rename of local tenant directory ack tenant detach and delete
tenant directory in background
#5183
---------
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
## Problem
Another thing I overlooked regarding'approved-for-ci-run`:
- When we create a PR, the action is associated with @vipvap and this
triggers the pipeline — this is good.
- When we update the PR by force-pushing to the branch, the action is
associated with @github-actions, which doesn't trigger a pipeline — this
is bad.
Initially spotted in #5239 / #5211
([link](https://github.com/neondatabase/neon/actions/runs/6122249456/job/16633919558?pr=5239))
— `check-permissions` should not fail.
## Summary of changes
- Use `CI_ACCESS_TOKEN` to check out the repo (I expect this token will
be reused in the following `git push`)
## Problem
When PR `ci-run/pr-*` is created the GitHub Autocomment with test
results are supposed to be posted to the original PR, currently, this
doesn't work.
I created this PR from a personal fork to debug and fix the issue.
## Summary of changes
- `scripts/comment-test-report.js`: use `pull_request.head` instead of
`pull_request.base` 🤦
## Problem
Add a CI pipeline that checks GitHub Workflows with
https://github.com/rhysd/actionlint (it uses `shellcheck` for shell
scripts in steps)
To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint`
## Summary of changes
- Add `.github/workflows/actionlint.yml`
- Fix actionlint warnings
Some VMs, when already scaled up as much as possible, end up spamming
the autoscaler-agent with upscale requests that will never be fulfilled.
If postgres is using memory greater than the cgroup's memory.high, it
can emit new memory.high events 1000 times per second, which... just
means unnecessary load on the rest of the system.
This changes the vm-monitor so that we skip sending upscale requests if
we already sent one within the last second, to avoid spamming the
autoscaler-agent. This matches previous behavior that the vm-informant
hand.
## Problem
This PR creates a GitHub release from a release tag with an
autogenerated changelog: https://github.com/neondatabase/neon/releases
## Summary of changes
- Call GitHub API to create a release
See #5001
No space is what's expected if we're at size limit.
Of course if SK incorrectly returned "no space", the availability check
wouldn't fire.
But users would notice such a bug quite soon anyways.
So ignoring "no space" is the right trade-off.
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
refer #5208
## Problem
See
https://neondb.slack.com/archives/C03H1K0PGKH/p1693938336062439?thread_ts=1693928260.704799&cid=C03H1K0PGKH#5208 disable LFC forever in case of error. It is not good because the
problem causing this error (for example ENOSPC) can be resolved anti
will be nice to reenable it after fixing.
Also #5208 disables LFC locally in one backend. But other backends may
still see corrupted data.
It should not cause problems right now with "permission denied" error
because there should be no backend which is able to normally open LFC.
But in case of out-of-disk-space error, other backend can read corrupted
data.
## Summary of changes
1. Cleanup hash table after error to prevent access to stale or
corrupted data
2. Perform disk write under exclusive lock (hoping it will not affect
performance because usually write just copy data from user to system
space)
3. Use generations to prevent access to stale data in lfc_read
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
I forgot a `str(...)` conversion in #5243. This lead to log lines such
as:
```
Using fs root 'PosixPath('/tmp/test_output/test_backward_compatibility[debug-pg14]/compatibility_snapshot/repo/local_fs_remote_storage/pageserver')' as a remote storage
```
This surprisingly works, creating hierarchy of under current working
directory (`repo_dir` for tests):
- `PosixPath('`
- `tmp` .. up until .. `local_fs_remote_storage`
- `pageserver')`
It should not work but right now test_compatibility.py tests finds local
metadata and layers, which end up used. After #5172 when remote storage
is the source of truth it will no longer work.
## Problem
A bunch of fixes for different test-related things
## Summary of changes
- Fix test_runner/pg_clients (`subprocess_capture` return value has
changed)
- Do not run create-test-report if check-permissions failed for not
cancelled jobs
- Fix Code Coverage comment layout after flaky tests. Add another
healing "\n"
- test_compatibility: add an instruction for local run
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
Currently our testing environment only supports running a single
pageserver at a time. This is insufficient for testing failover and
migrations.
- Dependency of writing tests for #5207
## Summary of changes
- `neon_local` and `neon_fixture` now handle multiple pageservers
- This is a breaking change to the `.neon/config` format: any local
environments will need recreating
- Existing tests continue to work unchanged:
- The default number of pageservers is 1
- `NeonEnv.pageserver` is now a helper property that retrieves the first
pageserver if there is only one, else throws.
- Pageserver data directories are now at `.neon/pageserver_{n}` where n
is 1,2,3...
- Compatibility tests get some special casing to migrate neon_local
configs: these are not meant to be backward/forward compatible, but they
were treated that way by the test.
See
https://neondb.slack.com/archives/C03H1K0PGKH/p1692550646191429
## Problem
Build index concurrently is writing WAL outside transaction.
`backpressure_throttling_impl` doesn't perform throttling for read-only
transactions (not assigned XID).
It cause huge write lag which can cause large delay of accessing the
table.
## Summary of changes
Looks at `PROC_IN_SAFE_IC` in process state set during concurrent index
build.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Prepare to upgrade rust version to latest stable.
- `rustfmt` has learned to format `let irrefutable = $expr else { ...
};` blocks
- There's a new warning about virtual (workspace) crate resolver, picked
the latest resolver as I suspect everyone would expect it to be the
latest; should not matter anyways
- Some new clippies, which seem alright
Remote storage cleanup split from #5198:
- pageserver, extensions, and safekeepers now have their separate remote
storage
- RemoteStorageKind has the configuration code
- S3Storage has the cleanup code
- with MOCK_S3, pageserver, extensions, safekeepers use different
buckets
- with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used
as path, where $user is `pageserver`, `safekeeper`
- no more `NeonEnvBuilder.enable_xxx_remote_storage` but one
`enable_{pageserver,extensions,safekeeper}_remote_storage`
Should not have any real changes. These will allow us to default to
`LOCAL_FS` for pageserver on the next PR, remove
`RemoteStorageKind.NOOP`, work towards #5172.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
`test_runner/performance/test_startup.py::test_startup` started to fail
more frequently because of the timeout.
Let's increase the timeout to see the failures on the perf dashboard.
## Summary of changes
- Increase timeout for`test_startup` from 600 to 900 seconds
The v1.4.0 includes changes to make it compile with PostgreSQL 16. The
commit log doesn't call it out explicitly, but I tested it manually.
v1.4.0 includes some new functions, but I tested manually that the the
v1.3.1 functionality works with the v1.4.0 version of the library. That
means that this doesn't break existing installations. Users can do
"ALTER EXTENSION hypopg UPDATE" if they want to use the new v1.4.0
functionality, but they don't have to.
This includes PostgreSQL 16 support. No other changes, really.
The extension version in the upstream was changed from 2.17 to 2.18,
however, there is no difference between the catalog objects. So if you
had installed 2.17 previously, it will continue to work. You can run
"ALTER EXTENSION hll UPDATE", but all it will do is update the version
number in the pg_extension table.
## Problem
Once we use async file system APIs for `VirtualFile`, these functions
will also need to be async fn.
## Summary of changes
Makes the functions `open, open_with_options, create,sync_all,with_file`
of `VirtualFile` async fn, including all functions that call it. Like in
the prior PRs, the actual I/O operations are not using async APIs yet,
as per request in the #4743 epic.
We switch towards not using `VirtualFile` in the par_fsync module,
hopefully this is only temporary until we can actually do fully async
I/O in `VirtualFile`. This might cause us to exhaust fd limits in the
tests, but it should only be an issue for the local developer as we have
high ulimits in prod.
This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.
It's a good idea to keep up-to-date in general. One noteworthy change is
that PostGIS 3.3.3 adds support for PostgreSQL v16. We'll need that.
PostGIS 3.4.0 has already been released, and we should consider
upgrading to that. However, it's a major upgrade and requires running
"SELECT postgis_extensions_upgrade();" in each database, to upgrade the
catalogs. I don't want to deal with that right now.
## Problem
We likely need this to support Postgres 16
It's also been asked by a user
https://github.com/neondatabase/neon/discussions/5042
The latest version is 3.2.0, but it requires some changes in the build
script (which I haven't checked, but it didn't work right away)
## Summary of changes
```
3.1.8 2023-08-01
- force v8 to compile in release mode
3.1.7 2023-06-26
- fix byteoffset issue with arraybuffers
- support postgres 16 beta
3.1.6 2023-04-08
- fix crash issue on fetch apply
- fix interrupt issue
```
From https://github.com/plv8/plv8/blob/v3.1.8/Changes
## Problem
We've got `approved-for-ci-run` to work 🎉
But it's still a bit rough, this PR should improve the UX for external
contributors.
## Summary of changes
- `build_and_test.yml`: add `check-permissions` job, which fails if PR is
created from a fork. Make all jobs in the workflow to be dependant on
`check-permission` to fail fast
- `approved-for-ci-run.yml`: add `cleanup` job to close `ci-run/pr-*`
PRs and delete linked branches when the parent PR is closed
- `approved-for-ci-run.yml`: fix the layout for the `ci-run/pr-*` PR
description
- GitHub Autocomment: add a comment with tests result to the original PR
(instead of a PR from `ci-run/pr-*` )
Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition.
Fixes#3115.
Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>
It was easy to interpret comment in the page cache initialization code
to be about justifying why we leak here at all, not just why this
specific type of leaking is done (which the comment was actually meant
to describe).
See
https://github.com/neondatabase/neon/pull/5125#discussion_r1308445993
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
- `SCALE: unbound variable` from
https://github.com/neondatabase/neon/pull/5079
- The layout of the GitHub auto-comment is broken if the code coverage
section follows flaky test section from
https://github.com/neondatabase/neon/pull/4999
## Summary of changes
- `benchmarking.yml`: Rename `SCALE` to `TEST_OLAP_SCALE`
- `comment-test-report.js`: Add an extra new-line before Code coverage
section
## Problem
When the next release is coming, we want to let everyone know about it by
posting a message to the Slack channel with a list of commits.
## Summary of changes
- `.github/workflows/release-notify.yml` is added
- the workflow sends a message to
`vars.SLACK_UPCOMING_RELEASE_CHANNEL_ID` (or
[#test-release-notifications](https://neondb.slack.com/archives/C05QQ9J1BRC)
if not configured)
- On each PR update, the workflow updates the list of commits in the
message (it doesn't send additional messages)
## Problem
We want to convert the `VirtualFile` APIs to async fn so that we can
adopt one of the async I/O solutions.
## Summary of changes
This PR is a follow-up of #5189, #5190, and #5195, and does the
following:
* Move the used `Write` trait functions of `VirtualFile` into inherent
functions
* Add optional buffering to `WriteBlobWriter`. The buffer is discarded
on drop, similarly to how tokio's `BufWriter` does it: drop is neither
async nor does it support errors.
* Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays
using `VirtualFile`
* Rename `WriteBlobWriter` to `BlobWriter`
* Make various functions in the write path async, like
`VirtualFile::{write,write_all}`.
Part of #4743.
## Problem
`CI_ACCESS_TOKEN` has quite limited access (which is good), but this
doesn't allow it to remove labels from PRs (which is bad)
## Summary of changes
- Use `GITHUB_TOKEN` to remove labels
- Use `CI_ACCESS_TOKEN` to create PRs
## Problem
- #5050
Closes: https://github.com/neondatabase/neon/issues/5136
## Summary of changes
- A new configuration property `control_plane_api` controls other
functionality in this PR: if it is unset (default) then everything still
works as it does today.
- If `control_plane_api` is set, then on startup we call out to control
plane `/re-attach` endpoint to discover our attachments and their
generations. If an attachment is missing from the response we implicitly
detach the tenant.
- Calls to pageserver `/attach` API may include a `generation`
parameter. If `control_plane_api` is set, then this parameter is
mandatory.
- RemoteTimelineClient's loading of index_part.json is generation-aware,
and will try to load the index_part with the most recent generation <=
its own generation.
- The `neon_local` testing environment now includes a new binary
`attachment_service` which implements the endpoints that the pageserver
requires to operate. This is on by default if running `cargo neon` by
hand. In `test_runner/` tests, it is off by default: existing tests
continue to run with in the legacy generation-less mode.
Caveats:
- The re-attachment during startup assumes that we are only re-attaching
tenants that have previously been attached, and not totally new tenants
-- this relies on the control plane's attachment logic to keep retrying
so that we should eventually see the attach API call. That's important
because the `/re-attach` API doesn't tell us which timelines we should
attach -- we still use local disk state for that. Ref:
https://github.com/neondatabase/neon/issues/5173
- Testing: generations are only enabled for one integration test right
now (test_pageserver_restart), as a smoke test that all the machinery
basically works. Writing fuller tests that stress tenant migration will
come later, and involve extending our test fixtures to deal with
multiple pageservers.
- I'm not in love with "attachment_service" as a name for the neon_local
component, but it's not very important because we can easily rename
these test bits whenever we want.
- Limited observability when in re-attach on startup: when I add
generation validation for deletions in a later PR, I want to wrap up the
control plane API calls in some small client class that will expose
metrics for things like errors calling the control plane API, which will
act as a strong red signal that something is not right.
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
Pull Requests created by GitHub Actions bot doesn't have access to
secrets, so we need to use our bot for it to auto-trigger a tests run
See previous PRs #4663, #5210, #5212
## Summary of changes
- Use our bot to create PRs
## Problem
- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all
## Summary of changes
- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos
This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.
The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
It's hard to find out which DB size we use for OLAP benchmarks (TPC-H in
particular).
This PR adds handling of `TEST_OLAP_SCALE` env var, which is get added
to a test name as a parameter.
This is required for performing larger periodic benchmarks.
## Summary of changes
- Handle `TEST_OLAP_SCALE` in
`test_runner/performance/test_perf_olap.py`
- Set `TEST_OLAP_SCALE` in `.github/workflows/benchmarking.yml` to a
TPC-H scale
Fixes#3830 by adding the `#[cfg(not(feature = "testing"))]` attribute
to unnecessary loggings in `pageserver/src/tenant/tasks.rs`.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
Continuation of #4663, #5210
We're still getting an error:
```
GraphQL: Resource not accessible by integration (removeLabelsFromLabelable)
```
## Summary of changes
- trigger `approved-for-ci-run.yml` workflow on `pull_request_target`
instead of `pull_request`