## Problem
We want to store Nightly Replication test results in the database and
notify the relevant Slack channel about failures
## Summary of changes
- Store test results in the database
- Notify `on-call-compute-staging-stream` about failures
## Problem
`secrets.GITHUB_TOKEN` (with any permissions) is not enough to get
a user's membership info if they decide to hide it.
## Summary of changes
- Use `secrets.CI_ACCESS_TOKEN` for `gh api` call
- Use `pull_request_target` instead of `pull_request` event to get
access to secrets
## Problem
Logical replication BGW checking replication lag is not reloading config
## Summary of changes
Add handling of reload config request
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Per #8674, disallow node configuration while drain/fill are ongoing.
Implement it by adding a only-http wrapper
`Service::external_node_configure` which checks for operation existing
before configuring.
Additionally:
- allow cancelling drain/fill after a pageserver has restarted and
transitioned to WarmingUp
Fixes: #8674
## Problem
On macOS, clippy fails with the following error:
```
error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt`
--> pageserver/src/tenant/remote_timeline_client/download.rs:26:5
|
26 | use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `-D unused-imports` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(unused_imports)]`
```
Introduced in https://github.com/neondatabase/neon/pull/8717
## Summary of changes
- allow `unused_imports` for
`crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS
in download.rs
## Problem
The control file contains the id of the safekeeper that uploaded it.
Previously, when sending a snapshot of the control file to another sk,
it would eventually be gc-ed by the receiving sk. This is incorrect
because the original sk might still need it later.
## Summary of Changes
When sending a snapshot and the control file contains an uploaded
segment:
* Create a copy of the segment in s3 with the destination sk in the
object name
* Tweak the streamed control file to point to the object create in the
previous step
Note that the snapshot endpoint now has to know the id of the requestor,
so the api has been extended to include the node if of the destination
sk.
Closes https://github.com/neondatabase/neon/issues/8542
The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird.
The new `FullSlice` newtype is better. See the doc comment for details.
The naming is not ideal, but we'll clean that up in a future refactoring
where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do
the following:
* tokio_epoll_uring::Slice is removed
* `FullSlice` becomes `tokio_epoll_uring::IoBufView`
* new type `tokio_epoll_uring::IoBufMutView` for the current
`tokio_epoll_uring::Slice<IoBufMut>`
Context
-------
I did this work in preparation for
https://github.com/neondatabase/neon/pull/8537.
There, I'm changing the type that the `inmemory_layer.rs` passes to
`DeltaLayerWriter::put_value_bytes` and thus it seemed like a good
opportunity to make this cleanup first.
## Problem
A bunch of small fixes and improvements for CI, that are too small to
have a separate PR for them
## Summary of changes
- CI(build-and-test): fix parenthesis
- CI(actionlint): fix path to workflow file
- CI: remove default args from actions/checkout
- CI: remove `gen3` label, using a couple `self-hosted` +
`small{,-arm64}`/`large{,-arm64}` is enough
- CI: prettify Slack messages, hide links behind text messages
- C(build-and-test): add more dependencies to `conclusion` job
## Problem
It's recommended that a couple of additional RUSTFLAGS be set up to
improve the performance of Rust applications on AWS Graviton.
See
57dc813626/rust.md
Note: Apple Silicon is compatible with neoverse-n1:
```
$ clang --version
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$
$ clang --print-supported-cpus 2>&1 | grep neoverse-
neoverse-512tvb
neoverse-e1
neoverse-n1
neoverse-n2
neoverse-v1
neoverse-v2
```
## Summary of changes
- Add `-Ctarget-feature=+lse -Ctarget-cpu=neoverse-n1` to RUSTFLAGS for
ARM images
Some benchmarks and tests might still fail because of #8655 (tracked in
#8708) because we are not fast enough to shut down ([one evidence]).
Partially this is explained by the current validation mode of streaming
k-merge, but otherwise because that is where we use a lot of time in
compaction. Outside of L0 => L1 compaction, the image layer generation
is already guarded by vectored reads doing cancellation checks.
32768 is a wild guess based on looking how many keys we put in each
layer in a bench (1-2 million), but I assume it will be good enough
divisor. Doing checks more often will start showing up as contention
which we cannot currently measure. Doing checks less often might be
reasonable.
[one evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/
Earlier PR: #8706.
## Problem
During `Run rust tests` step (for debug builds), we accidentally rebuild
neon twice (by `cargo test --doc` and by `cargo nextest run`).
It happens because we don't set `cov_prefix` for the `cargo test --doc`
command, which triggers rebuilding with different build flags, and one
more rebuild by `cargo nextest run`.
## Summary of changes
- Set `cov_prefix` for `cargo test --doc` to prevent unneeded rebuilds
## Problem
This command is kind of a hack, used when we're migrating large tenants
and want to get their resident size down. It sets the tenant config to a
fixed value, which omitted heatmap_period, so caused secondaries to get
out of date.
## Summary of changes
- Set heatmap period to the same 300s default that we use elsewhere when
updating eviction settings
This is not as elegant as some general purpose partial modification of
the config, but it practically makes the command safer to use.
basic JWT implementation that caches JWKs and verifies signatures.
this code is currently not reachable from proxy, I just wanted to get
something merged in.
We can get CompactionError::Other(Cancelled) via the error handling with
a few ways.
[evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/).
Hopefully fix it by:
1. replace the `map_err` which hid the
`GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for
GetVectoredError` conversion
2. simplifying the code in pgdatadir_mapping to eliminate the token
anyhow wrapping for deserialization errors
3. stop wrapping GetVectoredError as anyhow errors
4. stop wrapping PageReconstructError as anyhow errors
Additionally, produce warnings if we treat any other error (as was legal
before this PR) as missing key.
Cc: #8708.
## Problem
`author_association` doesn't properly work if a GitHub user decides not
to show affiliation with the org in their profile (i.e. if it's private)
## Summary of changes
- Call
`/orgs/ORG/members/USERNAME` API to check whether
a PR/issue author is a member of the org
## Problem
When pageservers do compaction, they frequently create image layers that
make earlier layers un-needed for reads, but then keep those earlier
layers around for 24 hours waiting for time-based eviction to expire
them.
Now that we track layer visibility, we can use it as an input to
eviction, and avoid the 24 hour "disk bump" that happens around
pageserver restarts.
## Summary of changes
- During time-based eviction, if a layer is marked Covered, use the
eviction period as the threshold: i.e. these layers get to remain
resident for at least one iteration of the eviction loop, but then get
evicted. With current settings this means they get evicted after 1h
instead of 24h.
- During disk usage eviction, prioritized evicting covered layers above
all other layers.
Caveats:
- Using the period as the threshold for time based eviction in this case
is a bit of a hack, but it avoids adding yet another configuration
property, and in any case the value of a new property would be somewhat
arbitrary: there's no "right" length of time to keep covered layers
around just in case.
- We had previously planned on removing time-based eviction: this change
would motivate us to keep it around, but we can still simplify the code
later to just do the eviction of covered layers, rather than applying a
TTL policy to all layers.
With additional phases from #8430 the `detach_ancestor::Error` became
untenable. Split it up into phases, and introduce laundering for
remaining `anyhow::Error` to propagate them as most often
`Error::ShuttingDown`.
Additionally, complete FIXMEs.
Cc: #6994
## Problem
The storage scrubber was reporting warnings for lots of timelines like:
```
WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c
```
These were spurious: these tenants are fine. There was a bug in
accumulating the ShardIndex for each tenant, whereby multiple timelines
would lead us to add the same ShardIndex more than one.
Closes: #8646
## Summary of changes
- Accumulate ShardIndex in a BTreeSet instead of a Vec
- Extend the test to reproduce the issue
## Problem
See https://github.com/neondatabase/neon/issues/8499
## Summary of changes
Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers.
Related Postgres PRs:
https://github.com/neondatabase/postgres/pull/457https://github.com/neondatabase/postgres/pull/458https://github.com/neondatabase/postgres/pull/459
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
With the persistent gc blocking, we can now retry reparenting timelines
which had failed for whatever reason on the previous attempt(s).
Restructure the detach_ancestor into three phases:
- prepare (insert persistent gc blocking, copy lsn prefix, layers)
- detach and reparent
- reparenting can fail, so we might need to retry this portion
- complete (remove persistent gc blocking)
Cc: #6994
After #8655 we've had a few issues (mostly tracked on #8708) with the
graceful shutdown. In order to shutdown more of the processes and catch
more errors, for example, from all pageservers, do an immediate shutdown
for those nodes which fail the initial (possibly graceful) shutdown.
Cc: #6485
## Problem
We use a set of **Neon** reuse databases in benchmarking.yml which are
still using pg14.
Because we want to compare apples to apples and have migrated the AWS
reuse clusters to pg16 we should also use pg16 for Neon.
## Summary of changes
- Automatically restore the test databases for Neon project