## Problem
The `pageserver_smgr_query_seconds` buckets are too coarse, using powers
of 10: 1 µs, 10 µs, 100 µs, 1 ms, 10 ms, 100 ms, 1 s, 10 s, 100 s. This
is one of our most crucial latency metrics, and needs better resolution.
Touches #11594.
## Summary of changes
This patch uses buckets with better resolution around 1 ms (the typical
latency):
* 0.6 ms
* 1 ms
* 3 ms
* 6 ms
* 10 ms
* 30 ms
* 100 ms
* 1 s
* 3 s
These will be the same as the compute's `compute_getpage_wait_seconds`,
to make them comparable across the compute and Pageserver:
https://github.com/neondatabase/flux-fleet/pull/579. We sacrifice
buckets above 3 s, since these can already be considered "too slow".
This does not change the previously used `CRITICAL_OP_BUCKETS`, which is
also used for other operations on different timescales (e.g. LSN waits).
We should consider replacing this with more appropriate buckets for
specific operations, since it covers a large span with low resolution.
## Problem
pg-sni-router isn't aware of compute TLS
## Summary of changes
If connections come in on port 4433, we require TLS to compute from
pg-sni-router
## Problem
- if-conditions for the `check-macos-build` workflow don't trigger it on
PRs with relevant changes (in Rust code or Postgres submodules).
- Jobs in the workflow depend on the presence of a cache, which is not
guaranteed.
## Summary of changes
- Fix if-conditions
- Use artifacts on top of cache whenever the workflow depends on it —
the cache might not be available
## Problem
We currently don't run end-to-end tests for PostgreSQL extensions on our
cloud infrastructure, which means we might miss problems that only occur
in a real cloud environment.
## Summary of changes
- Added a workflow to run extension tests against a cloud staging
instance
- Set up proper project configuration for extension testing
- Implemented test execution with appropriate environment settings
- Added error handling and reporting for test failures
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
## Problem
Provide an easy way to run particular test(s) N times on CI.
## Summary of changes
* Allow for passing the test selection and the number of test runs to
the existing "Build and Test Locally" workflow
* Allow for running multiple selected tests by the "Pytest regression
tests" step
* Introduce a new workflow to run specified test(s) several times
* Store results in a separate database to distinguish between testing
tests for stability and usual testing
## Problem
Proposed minor changes to the `consumption_metrics` document.
## Summary of changes
- Fixed minor typos in the document.
- Minor formatting in the description of metrics `timeline_logical_size`
and `synthetic_storage_size`. Makes this consistent as with description
of other metrics in the document.
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Mikhail Kot <mikhail@neon.tech>
### Summary
I'm fixing one or more of the following CI/CD misconfigurations to
improve security. Please feel free to leave a comment if you think the
current permissions for the GITHUB_TOKEN should not be restricted so I
can take a note of it as accepted behaviour.
- Restrict permissions for GITHUB_TOKEN
- Add step-security/harden-runner
- Pin Actions to a full length commit SHA
### Security Fixes
will fix https://github.com/neondatabase/cloud/issues/26141
## Problem
Broker supports only HTTP, no HTTPS
- Closes: https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Add `listen_https_addr`, `ssl_key_file`, `ssl_cert_file`,
`ssl_cert_reload_period` arguments to storage broker
- Make `listen_addr` argument optional
- Listen https in storage broker
- Support https for storage broker request in neon_local
- Add `use_https_storage_broker_api` option to NeonEnvBuilder
## Problem
Shard splits break timeline imports.
## Summary of Changes
Ensure mutual exclusion for imports and shard splits.
On the shard split code path:
1. Right before shard splitting, check the database to ensure that
no-import is on-going for the tenant. Exclusion is guaranteed because
this validation is done while holding the exclusive tenant lock.
Timeline creation (and import creation implicitly) requires a shared
tenant lock.
2. When selecting a shard to split, use the in-mem state to exclude
shards with an on-going import. This is opportunistic since an import
might start after the check, but allows shard splits to make progres
instead of continously retrying to split the same shard.
On the timeline creation code path:
1. Check the in-memory splitting flag on all shards of the tenant. If
any of them are splitting, error out asking the client to retry. On the
happy path this is not required, due to the tenant lock set-up described
above, but it covers the case where we restart with a pending
shard-split.
Closes https://github.com/neondatabase/neon/issues/11567
## Problem
- using Hetzner buckets for cache requires secrets, we either need
`secrets: inherit` to make it works
- we don't have self-hosted MacOs runners, so actually GH native cache
is more optimal solution there
## Summary of changes
- switch to GH native cache for macos builds
## Problem
The docker compose test script (`docker_compose_test.sh`) had
inconsistent codestyle, mixing legacy syntax with modern approaches and
not following best practices at all. This inconsistency could lead to
potential issues with variable expansion, path handling, and
maintainability.
## Summary of changes
This PR modernizes the test script with several codestyle improvements:
* Variable scoping and exports:
* Added proper export declarations for environment variables
* Added explicit COMPOSE_PROFILES export to avoid repetitive flags
* Modern Bash syntax:
* Replaced [ ] with [[ ]] for safer conditional testing
* Used arithmetic operations (( cnt += 3 )) instead of expr
* Added proper variable expansion with braces ${variable}
* Added proper quoting around variables and paths with "${variable}"
* Docker Compose commands:
* Replaced hardcoded container names with service names
* Used docker compose exec instead of docker exec $CONTAINER_NAME
* Removed repetitive flags by using environment variables
* Shell script best practices:
* Added function keyword before function definition
* Used safer path handling with "$(dirname "${0}")"
These changes make the script more maintainable, less error-prone, and
more consistent with modern shell scripting standards.
## Problem
It seems are production-ready cert-manager setup now includes a full
certificate chain. This was not accounted for and the decoder would
error.
## Summary of changes
Change the way we decode certificates to support cert-chains, ignoring
all but the first cert.
This also changes a log line to not use multi-line errors.
~~I have tested this code manually against real certificates/keys, I
didn't want to embed those in a test just yet, not until the cert
expires in 24 hours.~~
## Problem
close https://github.com/neondatabase/neon/issues/11694
We had the delta layer iterator and image layer iterator set to buffer
at most 8MB data. Note that 8MB is the compressed size, so it is
possible for those iterators contain more than 8MB data in memory.
For the recent OOM case, gc-compaction was running over 556 layers,
which means that we will have 556 active iterators. So in theory, it
could take up to 556*8=4448MB memory when the compaction is going on. If
images get compressed and the compression ratio is high (for that
tenant, we see 3x compression ratio across image layers), then that's
13344MB memory.
Also we have layer rewrites, which explains the memory taken by
gc-compaction itself (versus the iterators). We rewrite 424 out of 556
layers, and each of such rewrites need a pair of delta layer writer. So
we are buffering a lot of deltas in the memory.
The flamegraph shows that gc-compaction itself takes 6GB memory, delta
iterator 7GB, and image iterator 2GB, which can be explained by the
above theory.
## Summary of changes
- Reduce the buffer sizes.
- Estimate memory consumption and if it is too high.
- Also give up if the number of layers-to-rewrite is too high.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The current test was just SQL files only, but we also want to test a
remote extension which includes a loadable library. With both extensions
we should cover a larger portion of compute_ctl's remote extension code
paths.
Fixes: https://github.com/neondatabase/neon/issues/11146
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
In https://github.com/neondatabase/neon/pull/11345 coordination of
imports moved to the storage controller.
It involves notifying cplane when the import has been completed by
calling an idempotent endpoint.
If the storage controller shuts down in the middle of finalizing an
import, it would never be retried.
## Summary of changes
Reconcile imports at start-up by fetching the complete imports from the
database and spawning a background
task which notifies cplane.
Closes: https://github.com/neondatabase/neon/issues/11570
## Problem
We saw the following scenario in staging:
1. Pod A starts up. Becomes leader and steps down the previous pod
cleanly.
2. Pod B starts up (deployment).
3. Step down request from pod B to pod A times out. Pod A did not manage
to stop its reconciliations within 10 seconds and exited with return
code 1
([code](7ba8519b43/storage_controller/src/service.rs (L8686-L8702))).
4. Pod B marks itself as the leader and finishes start-up
5. k8s restarts pod A
6. k8s marks pod B as ready
7. pod A sends step down request to pod A - this succeeds => pod A is
now the leader
8. k8s kills pod A because it thinks pod B is healthy and pod A is part
of the old replica set
We end up in a situation where the only pod we have (B) is stepped down
and attempts to forward requests to a leader that doesn't exist. k8s
can't detect that pod B is in a bad state since the /status endpoint
simply returns 200 hundred if the pod is running.
## Summary of changes
This PR includes a number of robustness improvements to the leadership
protocol:
* use a single step down task per controller
* add a new endpoint to be used as k8s liveness probe and check
leadership status there
* handle restarts explicitly (i.e. don't step yourself down)
* increase the step down retry count
* don't kill the process on long step down since k8s will just restart
it
# Problem
The Pageserver read path exclusively uses direct IO if
`virtual_file_io_mode=direct`.
The write path is half-finished. Here is what the various writing
components use:
|what|buffering|flags on <br/>`v_f_io_mode`<br/>=`buffered`|flags on
<br/>`virtual_file_io_mode`<br/>=`direct`|
|-|-|-|-|
|`DeltaLayerWriter`| BlobWriter<BUFFERED=true> | () | () |
|`ImageLayerWriter`| BlobWriter<BUFFERED=false> | () | () |
|`download_layer_file`|BufferedWriter|()|()|
|`InMemoryLayer`|BufferedWriter|()|O_DIRECT|
The vehicle towards direct IO support is `BufferedWriter` which
- largely takes care of O_DIRECT alignment & size-multiple requirements
- double-buffering to mask latency
`DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which
has neither of these.
# Changes
## High-Level
At a high-level this PR makes the following primary changes:
- switch the two layer writer types to use `BufferedWriter` & make
sensitive to `virtual_file_io_mode` (via open_with_options_**v2**)
- make `download_layer_file` sensitive to `virtual_file_io_mode` (also
via open_with_options_**v2**)
- add `virtual_file_io_mode=direct-rw` as a feature gate
- we're hackish-ly piggybacking on OpenOptions's ask for write access
here
- this means with just `=direct` InMemoryLayer reads and writes no
longer uses O_DIRECT
- this is transitory and we'll remove the `direct-rw` variant once the
rollout is complete
(The `_v2` APIs for opening / creating VirtualFile are those that are
sensitive to `virtual_file_io_mode`)
The result is:
|what|uses <br/>`BufferedWriter`|flags on
<br/>`v_f_io_mode`<br/>=`buffered`|flags on
<br/>`v_f_io_mode`<br/>=`direct`|flags on
<br/>`v_f_io_mode`<br/>=`direct-rw`|
|-|-|-|-|-|
|`DeltaLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT |
|`ImageLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT |
|`download_layer_file`|BufferedWriter|()|()|O_DIRECT|
|`InMemoryLayer`|BufferedWriter|()|~~O_DIRECT~~()|O_DIRECT|
## Code-Level
The main change is:
- Switch `blob_io::BlobWriter` away from its own buffering method to use
`BufferedWriter`.
Additional prep for upholding `O_DIRECT` requirements:
- Layer writer `finish()` methods switched to use IoBufferMut for
guaranteed buffer address alignment. The size of the buffers is PAGE_SZ
and thereby implicitly assumed to fulfill O_DIRECT requirements.
For the hacky feature-gating via `=direct-rw`:
- Track `OpenOptions::write(true|false)` in a field; bunch of mechanical
churn.
- Consolidate the APIs in which we "open" or "create" VirtualFile for
better overview over which parts of the code use the `_v2` APIs.
Necessary refactorings & infra work:
- Add doc comments explaining how BufferedWriter ensures that writes are
compliant with O_DIRECT alignment & size constraints. This isn't new,
but should be spelled out.
- Add the concept of shutdown modes to `BufferedWriter::shutdown` to
make writer shutdown adhere to these constraints.
- The `PadThenTruncate` mode might not be necessary in practice because
I believe all layer files ever written are sized in multiples `PAGE_SZ`
and since `PAGE_SZ` is larger than the current alignment requirements
(512/4k depending on platform), it won't be necesary to pad.
- Some test (I believe `round_trip_test_compressed`?) required it though
- [ ] TODO: decide if we want to accept that complexity; if we do then
address TODO in the code to separate alignment requirement from buffer
capacity
- Add `set_len` (=`ftruncate`) VirtualFile operation to support the
above.
- Allow `BufferedWriter` to start at a non-zero offset (to make room for
the summary block).
Cleanups unlocked by this change:
- Remove non-positional APIs from VirtualFile (e.g. seek, write_full,
read_full)
Drive-by fixes:
- PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit
tests for all `virtual_file_io_mode` combinations but didn't because of
a missing `_` in the env var.
# Performance
This section assesses this PR's impact on deployments with current
production setting (`=direct`) and anticipated impact of switching to
(`=direct-rw`).
For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly
improved on throughput because the `BlobWriter`'s buffer had the same
size as the `BufferedWriter`'s buffer, but it didn't have the
double-buffering that `BufferedWriter` has.
The `=direct-rw` enables direct IO; throughput should not be suffering
because of double-buffering; benchmarks will show if this is true.
The `ImageLayerWriter` was previously not doing any buffering
(`BUFFERED=false`).
It went straight to issuing the IO operation to the underlying
VirtualFile and the buffering was done by the kernel.
The switch to `BufferedWriter` under `=direct` adds an additional memcpy
into the BufferedWriter's buffer.
We will win back that memcpy when enabling direct IO via `=direct-rw`.
A nice win from the switch to `BufferedWriter` is that ImageLayerWriter
performs >=16x fewer write operations to VirtualFile (the BlobWriter
performs one write per len field and one write per image value).
This should save low tens of microseconds of CPU overhead from doing all
these syscalls/io_uring operations, regardless of `=direct` or
`=direct-rw`.
Aside from problems with alignment, this write frequency without
double-buffering is prohibitive if we actually have to wait for the
disk, which is what will happen when we enable direct IO via
(`=direct-rw`).
Throughput should not be suffering because of BufferedWrite's
double-buffering; benchmarks will show if this is true.
`InMemoryLayer` at `=direct` will flip back to using buffered IO but
remain on BufferedWriter.
The buffered IO adds back one memcpy of CPU overhead.
Throughput should not suffer and will might improve on
not-memory-pressured Pageservers but let's remember that we're doing the
whole direct IO thing to eliminate global memory pressure as a source of
perf variability.
## bench_ingest
I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`.
Use `git diff` with `--word-diff` or similar to see the change.
General guidance on interpretation:
- immediate production impact of this PR without production config
change can be gauged by comparing the same `io_mode=Direct`
- end state of production switched over to `io_mode=DirectRw` can be
gauged by comparing old results' `io_mode=Direct` to new results'
`io_mode=DirectRw`
Given above guidance, on `im4gn.2xlarge`
- immediate impact is a significant improvement in all cases
- end state after switching has same significant improvements in all
cases
- ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192
key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s`
instead of `253.43 MiB/s`
- this is a 6% degradation
- this workload is typical for image layer creation
# Refs
- epic https://github.com/neondatabase/neon/issues/9868
- stacked atop
- preliminary refactor https://github.com/neondatabase/neon/pull/11549
- bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667
- derived from https://github.com/neondatabase/neon/pull/10063
Co-authored-by: Yuchen Liang <yuchen@neon.tech>
## Problem
`cargo-deny` 0.16.2 spits a bunch of warnings like:
```
warning[index-failure]: unable to check for yanked crates
```
The issue is fixed for the latest version of `cargo-deny` (0.18.2). And
while we're here, let's bump all the packages we have in `build-tools`
image
## Summary of changes
- bump cargo-hakari to 0.9.36
- bump cargo-deny to 0.18.2
- bump cargo-hack to 0.6.36
- bump cargo-nextest to 0.9.94
- bump diesel_cli to 2.2.9
- bump s5cmd to 2.3.0
- bump mold to 2.37.1
- bump python to 3.11.12
## Problem
Currently, we only report the timestamp of the last moment we think
Postgres was active. The problem is that if Postgres gets completely
unresponsive, we still report some old timestamp, and it's impossible to
distinguish situations 'Postgres is effectively down' and 'Postgres is
running, but no client activity'.
## Summary of changes
Refactor the `compute_ctl`'s compute monitor so that it was easier to
track the connection errors and failed activity checks, and report
- `now() - last_successful_check` as current downtime on any failure
- cumulative Postgres downtime during the whole compute lifetime
After adding a test, I also noticed that the compute monitor may not
reconnect even though queries fail with `connection closed` or `error
communicating with the server: Connection reset by peer (os error 54)`,
but for some reason we do not catch it with `client.is_closed()`, so I
added an explicit reconnect in case of any failures.
Discussion:
https://neondb.slack.com/archives/C03TN5G758R/p1742489426966639
Main change:
- `BufferedWriter` owns the `W`; no more `Arc<W>`
- We introduce auto-delete-on-drop wrappers for `VirtualFile`.
- `TempVirtualFile` for write-only users
- `TempVirtualFileCoOwnedByEphemeralFileAndBufferedWriter` for
EphemeralFile which requires read access to the immutable prefix of the
file (see doc comments for details)
- Users of `BufferedWriter` hand it such a wrapped `VirtualFile`.
- The wrapped `VirtualFile` moves to the background flush task.
- On `BufferedWriter` shutdown, ownership moves back.
- Callers remove the wrapper (`disarm_into_inner()`) after doing final
touches, e.g., flushing index blocks and summary for delta/image layer
writers.
If the BufferedWriter isn't shut down properly via
`BufferedWriter::shutdown`, or if there is an error during final
touches, the wrapper type ensures that the file gets unlinked.
We store a GateGuard inside the wrapper to ensure that the Timeline is
still alive when unlinking on drop.
Rust doesn't have async drop yet, so, the unlinking happens using a
synchronous syscall.
NB we don't fsync the surrounding directory.
This is how it's been before this PR; I believe it is correct because
all of these files are temporary paths that get cleaned up on timeline
load.
Again, timeline load does not need to fsync because the next timeline
load will unlink again if the file reappears.
The auto-delete-on-drop can happen after a higher-level mechanism
retries.
Therefore, we switch all users to monotonically increasing, never-reused
temp file disambiguators.
The aspects pointed out in the last two paragraphs will receive further
cleanup in follow-up task
- https://github.com/neondatabase/neon/issues/11692
Drive-by changes:
- It turns out we can remove the two-pronged code in the layer file
download code.
No need to make this a separate PR because all of production already
uses `tokio-epoll-uring` with the buffered writer for many weeks.
Refs
- epic https://github.com/neondatabase/neon/issues/9868
- alternative to https://github.com/neondatabase/neon/pull/11544
* Replace yanked papaya version
* Remove unused allowed license: OpenSSL
* Remove Zlib license from general allow list since it's listed in the
exceptions section per crate
* Drop clarification for ring since they have separate LICENSE files now
* List the tower-otel repo as allowed source while we sort out the OTel
deps
Switches the tenant snapshot subcommand of the storage scrubber to
`remote_storage`. As this is the last piece of the storage scrubber
still using the S3 SDK, this finishes the project started in #7547.
This allows us to do tenant snapshots on Azure as well.
Builds on #11671Fixes#8830
Adds a versioning API to remote_storage. We want to use it in the
scrubber, both for tenant snapshot as well as for metadata checks.
for #8830
and for #11588
## Problem
Pageservers notify control plane directly when a shard import has
completed.
Control plane has to download the status of each shard from S3 and
figure out if everything is truly done,
before proceeding with branch activation.
Issues with this approach are:
* We can't control shard split behaviour on the storage controller side.
It's unsafe to split
during import.
* Control plane needs to know about shards and implement logic to check
all timelines are indeed ready.
## Summary of changes
In short, storage controller coordinates imports, and, only when
everything is done, notifies control plane.
Big rocks:
1. Store timeline imports in the storage controller database. Each
import stores the status of its shards in the database.
We hook into the timeline creation call as our entry point for this.
2. Pageservers get a new upcall endpoint to notify the storage
controller of shard import updates.
3. Storage controller handles these updates by updating persisted state.
If an update finalizes the import,
then poll pageservers until timeline activation, and, then, notify the
control plane that the import is complete.
Cplane side change with new endpoint is in
https://github.com/neondatabase/cloud/pull/26166
Closes https://github.com/neondatabase/neon/issues/11566
Update the sentry crate to 0.37. This deduplicates the `webpki-roots`
crate in our crate graph, and brings another dependency onto newer
rustls `0.23.18`.
# Add --dev CLI flag to pageserver and safekeeper binaries
This PR adds the `--dev` CLI flag to both the pageserver and safekeeper
binaries without implementing any functionality yet. This is a precursor
to PR #11517, which will implement the full functionality to require
authentication by default unless the `--dev` flag is specified.
## Changes
- Add `dev_mode` config field to pageserver binary
- Add `--dev` CLI flag to safekeeper binary
This PR is needed for forward compatibility tests to work properly, when
we try to merge #11517
Link to Devin run:
https://app.devin.ai/sessions/ad8231b4e2be430398072b6fc4e85d46
Requested by: John Spray (john@neon.tech)
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: John Spray <john@neon.tech>
# Fix KeyError in physical replication benchmark test
This PR fixes the failing physical replication benchmark test that was
encountering a KeyError: 'endpoints'.
The issue was in accessing `project["project"]["endpoints"][0]["id"]`
when it should be `project["endpoints"][0]["id"]`, consistent with how
endpoints are accessed elsewhere in the codebase.
Fixed the issue in both test functions:
- test_ro_replica_lag
- test_replication_start_stop
Link to Devin run:
https://app.devin.ai/sessions/be3fe9a9ee5942e4b12e74a7055f541b
Requested by: Peter Bendel
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
ARM computes are incoming and we need to account for that in remote
extensions. Previously, we just blindly assumed that all computes were
x86_64.
Note that we use the Go architecture naming convention instead of the
Rust one directly to do our best and be consistent across the stack.
Part-of: https://github.com/neondatabase/cloud/issues/23148
Signed-off-by: Tristan Partin <tristan@neon.tech>
In tests and when one safekeeper is down in small regions, we need to
contend with one or two safekeepers. Before, we gave an error in
`safekeepers_for_new_timeline`. Now we just silently allow the timeline
to be created on one or two safekeepers.
Part of #9011
## Problem
test_storage_controller_heartbeats is flaky because of unallowed
reconciler errors (#11625)
## Summary of changes
Allow reconcile errors as in other tests in test_storage_controller.py.
## Problem
Init fork is used in DEBUG_COMPARE_LOCAL to determine unlogged relation
or unlogged build.
But it is created only after the relation is initialized and so can be
swapped out, producing `Page is evicted with zero LSN` error.
## Summary of changes
Create init fork together with main fork for unlogged relations in
DEBUG_COMPARE_LOCAL mode.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Safekeeper doesn't use TLS in wal service
- Closes: https://github.com/neondatabase/cloud/issues/27302
## Summary of changes
- Add `enable_tls_wal_service_api` option to safekeeper's cmd arguments
- Propagate `tls_server_config` to `wal_service` if the option is
enabled
- Create `BACKGROUND_RUNTIME` for small background tasks and offload SSL
certificate reloader to it.
No integration tests for now because support from compute side is
required: https://github.com/neondatabase/cloud/issues/25823
## Problem
The pg_repack test can be flaky due to unpredictable `NOTICE` messages
about waiting for some processes.
E.g.,
```
INFO: repacking table "public.issue3_2"
+NOTICE: Waiting for 1 transactions to finish. First PID: 427
```
## Summary of changes
The `client_min_messages` set to `warning` for the regression tests.
## Problem
We run benchmarks in batches (five parallel jobs on different runners).
If any test in a batch fails, we won’t upload any results for that
batch, even for the tests that passed.
## Summary of changes
- Move the results upload to a separate step in the run-python-test-set
action, and execute this step even if tests fail.
## Problem
If all batched requests are excluded from the query by
`Timeine::get_rel_page_at_lsn_batched` (e.g. because they are past the
end of the relation), the read path would panic since it doesn't expect
empty queries. This is a change in behaviour that was introduced with
the scattered query implementation.
## Summary of Changes
Handle empty queries explicitly.
This makes it easier to add a different client implementation alongside
the current one. I started working on a new gRPC-based protocol to
replace the libpq protocol, which will introduce a new function like
`client_libpq`, but for the new protocol.
It's a little more readable with less indentation anyway.
## Problem
https://github.com/neondatabase/neon/actions/runs/14538136318/job/40790985693?pr=11645
failed, even though the relevant parts of the CI had passed and
auto-merge determined the PR is ready to merge. After that, commenting
failed.
## Summary of changes
- set GH_TOKEN for commenting after fast-forward failure
- allow merging with mergeable_state unstable
## Problem
Pageservers and safakeepers do not pass CA certificates to broker
client, so the client do not trust locally issued certificates.
- Part of https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Change `ssl_ca_certs` type in PS/SK's config to `Pem` which may be
converted to both `reqwest` and `tonic` certificates.
- Pass CA certificates to storage broker client in PS and SK
## Problem
follow-up on https://github.com/neondatabase/neon/pull/11601
## Summary of changes
- serialize the start/end time using rfc3339 time string
- compute the size ratio of the compaction
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This delivers some additional fixes and improvements to storcon managed
safekeeper timelines:
* use `i32::MAX` for the generation number of timeline deletion
* start the generation for new timelines at 1 instead of 0: this ensures
that the other components actually are generation enabled
* fix database operations we use for metrics
* use join in list_pending_ops to prevent the classical ORM issue where
one does many db queries
* use enums in `test_storcon_create_delete_sk_down`. we are adding a
second parameter, and having two bool parameters is weird.
* extend `test_storcon_create_delete_sk_down` with a test of whole
tenant deletion. this hasn't been tested before.
* remove some redundant logging contexts
* Don't require mutable access to the service lock for scheduling
pending ops in memory. In order to pull this off, create reconcilers
eagerly. The advantage is that we don't need mutable access to the
service lock that way any more.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
## Problem
We need to test the stability of Neon.
## Summary of changes
The test runs random operations on a Neon project. It performs via the
Public API calls the following operations: `create a branch`, `delete a
branch`, `add a read-only endpoint`, `delete a read-only endpoint`,
`restore a branch to a random position in the past`. All the branches
and endpoints are loaded with `pgbench`.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We saw OOMs due to L0 compaction happening simultaneously for all shards
of the same tenant right after the shard split.
## Summary of changes
Lower the threshold so that we compact fewer files.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/11531 did not fully fix the
problem because the warning is part of the storcon instead of
pageserver.
## Summary of changes
Allow stale generation error in storcon.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_compute_startup_simple` and `test_compute_ondemand_slru_startup`
are failing.
This test implicitly asserts that the metrics.json endpoint succeeds and
returns all expected metrics, but doesn't make it easy to see what went
wrong if it doesn't (e.g. in this failure
https://neon-github-public-dev.s3.amazonaws.com/reports/main/14513210240/index.html#suites/13d8e764c394daadbad415a08454c04e/b0f92a86b2ed309f/)
In this case, it was failing because of a missing auth token, because it
was using `requests` directly instead of using the endpoint http client
type.
## Summary of changes
- Use endpoint http wrapper to get raise_for_status & auth token
## Problem
`Tenant` isn't really a whole tenant: it's just one shard of a tenant.
## Summary of changes
- Automated rename of Tenant to TenantShard
- Followup commit to change references in comments
## Problem
There are mentions of `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` and
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`, but in reality, this mechanism
doesn't work, so let's remove it to avoid confusion.
The idea behind it was to allow some breaking changes by adding a
special label to a PR that would `xfail` the test. However, in practice,
this means we would need to carry this label through all subsequent PRs
until the release (and artifact regeneration). This approach isn't
really viable, as it increases the risk of missing a compatibility break
in another PR.
## Summary of changes
- Remove mentions and handling of
`ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` /
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`
## Problem
`pg-clients` can't start:
```
The workflow is not valid. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'check-image' is requesting 'packages: read', but is only allowed 'packages: none'. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'build-image' is requesting 'packages: write', but is only allowed 'packages: none'.
```
## Summary of changes
- Grant required `packages: write` permissions to the workflow
## Problem
Test lfc working set approximation becomes flaky after recent changes in
prefetch.
May be it is caused by updating HLL in `lfc_write`, may be by some other
reasons.
## Summary of changes
1. Disable autovacuum in this test (as possible source of extra page
accesses).
2. Increase upper boundary for WS approximation from 12 to 20.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This is mostly a documentation update, but a few updates with regard to
neon_local, pageserver, and tests.
17 is our default for users in production, so dropping references to 16
makes sense.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
These various hacks were needed for the forward compatibility tests.
Enough time has passed since the merge that these are no longer needed.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The proxy denies using `unwrap()`s in regular code, but we want to use
it in test code
and so have to allow it for each test block.
## Summary of changes
Set `allow-unwrap-in-tests = true` in clippy.toml and remove all
exceptions.
Testodrome measures uptime based on the failed requests and errors. In
case of testodrome request we send back error based on the service. This
will help us distinguish error types in testodrome and rely on the
uptime SLI.
## Problem
We currently only have gc-compaction statistics for each single
sub-compaction job.
## Summary of changes
Add meta statistics across all sub-compaction jobs scheduled.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
During shard ancestor compaction, we currently recompress all page
images as we move them into a new layer file. This is expensive and
unnecessary.
Resolves#11562.
Requires #11607.
## Summary of changes
Pass through compressed page images in `ImageLayerInner::filter()`.
1. Compute may generate WAL on shutdown. The test assumes that after
shutdown,
no further ingest happens. Tweak the compute shutdown to make the
assumption true.
2. Assertion of local layer count post cold migration is not right since
we may have downloaded
layers due to ingest. Remove it.
Closes https://github.com/neondatabase/neon/issues/11587
## Problem
To avoid recompressing page images during layer filtering, we need
access to the raw header and data from vectored reads such that we can
pass them through to the target layer.
Touches #11562.
## Summary of changes
Adds `VectoredBlob::raw_with_header()` to return a raw view of the
header+data, and updates `read()` to track it.
Also adds `blob_io::Header` with header metadata and decode logic, to
reuse for tests and assertions. This isn't yet widely used.
## Problem
Splits of large tenants (several TB) can cause a huge amount of shard
ancestor compaction work, which can overload Pageservers.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Add a setting `compaction_shard_ancestor` (default `true`) to disable
shard ancestor compaction on a per-tenant basis.
Finally figured out the right incantation. I had had this in my original
go, but due to some refactoring and apparently missed testing, I
committed a mistake. The reason this doesn't currently break anything is
that we bypass the authorization middleware when the "testing" cargo
feature is enabled.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Metrics are saved in https://github.com/neondatabase/neon/pull/11559,
but the file is not matched by the attachment regex.
## Summary of changes
Make attachment regex match the metrics file.
## Problem
close https://github.com/neondatabase/neon/issues/11486, proceeding
https://github.com/neondatabase/neon/pull/11531
## Summary of changes
This patch fixes the rest 50% of instability of
`test_create_churn_during_restart`. During tenant warmup, we'll request
logical size; however, if the startup gets cancelled, we won't be able
to spawn the initial logical size calculation task that sets the
`cancel_wait_for_background_loop_concurrency_limit_semaphore`.
Therefore, we check `cancelled` before proceeding to get
`cancel_wait_for_background_loop_concurrency_limit_semaphore`. There
will still be a race if the timeline shutdown happens after L5710 and
before L5711, but it should be enough to reduce the flakiness of the
test.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We can't have more than one open release PR created on the same day (due
to non-unique enough branch names).
## Summary of changes
- Add time (hours and minutes) to RC PR branch names
- Also make sure we use UTC for releases
## Problem
When doing power-of-two shard splits (i.e. 4 → 8 → 16), we end up
rewriting all layers since half of the pages will be local due to
striping. This causes a lot of resource usage when splitting large
tenants.
## Summary of changes
Drop the threshold of local/total pages to 30%, to reduce the amount of
layer rewrites after splits.
## Problem
Shard ancestor compaction can be very expensive following shard splits
of large tenants. We currently rewrite garbage layers after shard splits
as well, which can be a significant amount of data.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Don't rewrite invisible layers after shard splits.
## Problem
https://github.com/neondatabase/neon/pull/11515 introduced a bug that
some key history cannot be verified.
If a key only exists above the horizon, the verification will fail for
its first occurrence because the history does not exist at that point.
As gc-compaction skips a key range whenever an error occurs, it might be
doing some wasted work in staging/prod now. But I'm not planning a
hotfix this week as the bug doesn't affect correctness/performance.
## Summary of changes
Allow keys with only above horizon history in the verification.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
There is too much delay between merging a PR into `main` and deploying
the changes to staging
## Summary of changes
- Trigger `deploy` job without waiting for `build-and-test-locally` job
## Problem
The information in the README.md contained errors, and some information
was missing.
## Summary of changes
Found errors are fixed, and new information is added.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
The `/__w/neon/neon` directory is mounted from host to container and
persists between runs.
Sometimes the next workflow run fails to delete it:
```
Deleting the contents of '/__w/neon/neon'
Error: File was unable to be removed Error: EACCES: permission denied, rmdir '/__w/neon/neon/allure-2.32.2/bin'
```
## Summary of changes
- Download and install allure to `/tmp` which exists in container only
Ref https://github.com/neondatabase/cloud/issues/27186
## Problem
Benchmarks results are inconsistent on existing small-metal runners
## Summary of changes
Introduce new `unit-perf` runners, and lets run benchmark on them.
The new hardware has slower, but consistent, CPU frequency - if run with
default governor schedutil.
Thus we needed to adjust some testcases' timeouts and add some retry
steps where hard-coded timeouts couldn't be increased without changing
the system under test.
-
[wait_for_last_record_lsn](6592d69a67/test_runner/fixtures/pageserver/utils.py (L193))
1000s -> 2000s
-
[test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927)
1000s
-
[test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f)
- with back throttling disabled compute becomes unresponsive for more
than 60 seconds (PG hard-coded client authentication connection timeout)
-
[test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1)
600s -> 1200s
Right now there are only 2 runners of that class, and if we decide to go
with them, we have to check how much that type of runners we need, so
jobs not stuck with waiting for that type of runners available.
However we now decided to run those runners with governor performance
instead of schedutil.
This achieves almost same performance as previous runners but still
achieves consistent results for same commit
Related issue to activate performance governor on these runners
https://github.com/neondatabase/runner/pull/138
## Verification that it helps
### analyze runtimes on new runner for same commit
Table of runtimes for the same commit on different runners in
[run](https://github.com/neondatabase/neon/actions/runs/14417589789)
| Run | Benchmarks (1) | Benchmarks (2) |Benchmarks (3) |Benchmarks (4)
| Benchmarks (5) |
|--------|--------|---------|---------|---------|---------|
| 1 | 1950.37s | 6374.55s | 3646.15s | 4149.48s | 2330.22s |
| 2 | - | 6369.27s | 3666.65s | 4162.42s | 2329.23s |
| Delta % | - | 0,07 % | 0,5 % | 0,3 % | 0,04 % |
| with governor performance | 1519.57s | 4131.62s | - | - | - |
| second run gov. perf. | 1513.62s | 4134.67s | - | - | - |
| Delta % | 0,3 % | 0,07 % | - | - | - |
| speedup gov. performance | 22 % | 35 % | - | - | - |
| current desktop class hetzner runners (main) | 1487.10s | 3699.67s | -
| - | - |
| slower than desktop class | 2 % | 12 % | - | - | - |
In summary, the runtimes for the same commit on this hardware varies
less than 1 %.
---------
Co-authored-by: BodoBolero <peterbendel@neon.tech>
## Problem
Shard ancestor compaction doesn't currently log any global progress
information, only for the current batch.
## Summary of changes
Log the number of layers checked for eligibility this iteration, and the
total number of layers to check. This will indicate how far along the
total shard ancestor compaction has gotten for this iteration.
## Problem
Every time we make changes to the read path to fix a bug or add a
feature,
we end up adding another incomprehensible test.
## Summary of changes
Add some generic infrastructure for generating a layer map from a type
spec
and use that for a read path test. The test is randomized but uses a
fixed seed
by default. A fuzzing mode is available for confidence building.
See [Notion
page](https://www.notion.so/neondatabase/Read-Path-Unit-Testing-Fuzzing-1d1f189e0047806c8e5cd37781b0a350?pvs=4)
for a diagram of the layer map
used.
Just for fun I tried removing [this
commit](9990199cb4)
from https://github.com/neondatabase/neon/pull/11494
and it caught the bug in the normal mode (no fuzzing required).
## Problem
Sometimes it's useful to see the pageserver metrics after a test in
order to debug stuff.
For example, for https://github.com/neondatabase/neon/issues/11465 I'd
like to know
what the remote storage latencies are from the client.
## Summary of changes
When stopping the env, record the pageserver metrics into a file in the
pageserver's workdir.
## Problem
Part of https://github.com/neondatabase/neon/issues/11486
## Summary of changes
50% of the test instability of `test_create_churn_during_restart` are
due to error message gets changed. Allow the new error message.
Still need to fix other errors due to failure to acquire semaphore in
this or the next patch.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Pageservers now ignore unknown config fields, so this config tweaking is
no longer needed.
## Summary of changes
Get rid of the hack.
Closes https://github.com/neondatabase/neon/issues/11524
## Problem
https://github.com/neondatabase/neon/pull/11494 changes the batching
logic, but we don't have a way to evaluate it.
## Summary of changes
This PR introduces a global and per timeline metric which tracks the
reason for
which a batch was broken.
In #10063 we will switch BlobWriter to use the owned buffers IO buffered
writer, which implements double-buffering by virtue of a background task
that performs the flushing.
That task's lifecylce must be contained within the Timeline lifecycle,
so, it must hold the timeline gate open and respect Timeline::cancel.
This PR does the noisy plumbing to reduce the #10063 diff.
Refs
- extracted from https://github.com/neondatabase/neon/pull/10063
- epic https://github.com/neondatabase/neon/issues/9868
We have already migrated the storage controller to
`--control-plane-url`, added in #11173. The new param was added to
support also safekeeper specific endpoints. See the docs changes in
#11195 for further details.
Part of #11163
## Problem
clang produce warning about unused variable `n_synced` in
HandleSafekeeperResponse
## Summary of changes
Remove local variable.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Get page batching stops when we encounter requests at different LSNs.
We are leaving batching factor on the table.
## Summary of changes
The goal is to support keys with different LSNs in a single batch and
still serve them with a single vectored get.
Important restriction: the same key at different LSNs is not supported
in one batch. Returning different key
versions is a much more intrusive change.
Firstly, the read path is changed to support "scattered" queries. This
is a conceptually simple step from
https://github.com/neondatabase/neon/pull/11463. Instead of initializing
the fringe for one keyspace,
we do it for multiple at different LSNs and let the logic already
present into the fringe handle selection.
Secondly, page service code is updated to support batching at different
LSNs. Eeach request parsed from the wire determines its effective
request LSN and keeps it in mem for the batcher toinspect. The batcher
allows keys at
different LSNs in one batch as long one key is not requested at
different LSNs.
I'd suggest doing the first pass commit by commit to get a feel for the
changes.
## Results
I used the batching test from [Christian's
PR](https://github.com/neondatabase/neon/pull/11391) which increases the
change of batch breaks. Looking at the logs I think the new code is at
the max batching factor for the workload (we
only break batches due to them being oversized or because the executor
is idle).
```
Main:
Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662
My branch:
Reasons for stopping batching: {'of batch size': 37024}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333
```
Related: https://github.com/neondatabase/neon/issues/10765
## Problem
See https://github.com/neondatabase/neon/issues/10652
Neon extension launches 2 BGW which reduce limit for parallel workers
and so affecting parallel_deadlock isolation test.
## Summary of changes
Increase `max_worker_processes` from default 8 to 16 for isolation test.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Part of #9114
There was a debug-mode verification mode that verifies at every
retain_lsn. However, the code was tangled within the actual history
generation itself and it's hard to reason about correctness. This patch
adds a separate post-verification of the gc-compaction result that redos
logs at every retain_lsn and every record above the GC horizon. This
ensures that all key history we produce with gc-compaction is readable,
and if there're read errors after gc-compaction, it can only be
read-path errors instead of gc-compaction bugs.
## Summary of changes
* Add gc_compaction_verification flag, default to true.
* Implement a post-verification process.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Shard ancestor compaction does not yield for L0 compaction, potentially
starving it.
close https://github.com/neondatabase/neon/issues/11125
## Summary of changes
* Yield for L0 during shard ancestor compaction.
* Return `CompactionOutcome::Pending` when limited by `rewrite_max`, for
eager rescheduling.
Previously, the structure of the spec file was just the compute spec.
However, the response from the control plane get spec request included
the compute spec and the compute_ctl config. This divergence was
hindering other work such as adding regression tests for compute_ctl
HTTP authorization.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Introduces a `WalIngestError` struct together with a
`WalIngestErrorKind` enum, to be used for walingest related failures and
errors.
* the enum captures backtraces, so we don't regress in comparison to
`anyhow::Error`s (backtraces might be a bit shorter if we use one of the
`anyhow::Error` wrappers)
* it explicitly lists most/all of the potential cases that can occur.
I've originally been inspired to do this in #11496, but it's a
longer-term TODO.
## Problem
Shard ancestor compaction always logs "starting shard ancestor
compaction", even if there is no work to do. This is very spammy (every
20 seconds for every shard). It also has limited progress logging.
## Summary of changes
* Only log "starting shard ancestor compaction" when there's work to do.
* Include details about the amount of work.
* Log progress messages for each layer, and when waiting for uploads.
* Log when compaction is completed, with elapsed duration and whether
there is more work for a later iteration.
## Problem
The `pagebench` benchmarks set up an initial dataset by creating a
template tenant, copying the remote storage to a bunch of new tenants,
and attaching them to Pageservers.
In #11420, we found that
`test_pageserver_characterize_throughput_with_n_tenants` had degraded
performance because it set a custom tenant config in Pageservers that
was then replaced with the default tenant config by the storage
controller.
The initial fix was to register the tenants directly in the storage
controller, but this created the tenants with generation 1. This broke
`test_basebackup_with_high_slru_count`, where the template tenant was at
generation 2, leading to all layer files at generation 2 being ignored.
Resolves#11485.
Touches #11381.
## Summary of changes
This patch addresses both test issues by modifying `attach_hook` to also
take a custom tenant config. This allows attaching tenants to
Pageservers from pre-existing remote storage, specifying both the
generation and tenant config when registering them in the storage
controller.
The batching perf test workload is currently read-only sequential scans.
However, realistic workloads have concurrent writes (to other pages)
going on.
This PR simulates concurrent writes to other pages by emitting logical
replication messages.
These degrade the achieved batching factor, for the reason see
- https://github.com/neondatabase/neon/issues/10765
PR
- https://github.com/neondatabase/neon/pull/11494
will fix this problem and get batching factor back up.
---------
Co-authored-by: Vlad Lazar <vlad@neon.tech>
# Refs
- fixes https://github.com/neondatabase/neon/issues/11395
# Problem
Since 2025-03-10, we have observed increased flakiness of
`test_pageserver_getpage_throttle`.
The test is timing-dependent by nature, and was hitting the
```
assert duration_secs >= 10 * actual_smgr_query_seconds, (
"smgr metrics should not include throttle wait time"
)
```
quite frequently.
# Analysis
These failures are not reproducible.
In this PR's history is a commit that reran the test 100 times without
requiring a single retry.
In https://github.com/neondatabase/neon/issues/11395 there is a link to
a query to the test results database.
It shows that the flakiness was not constant, but rather episodic:
2025-03-{10,11,12,13} 2025-03-{19,20,21} 2025-03-31 and 2025-04-01.
To me, this suggests variability in available CPU.
# Solution
The point of the offending assertion is to ensure that most of the
request latency is spent on throttling, because testing of the
throttling mechanism is the point of the test.
The `10` magic number means at most 10% of mean latency may be spent on
request processing.
Ideally we would control the passage of time (virtual clock source) to
make this test deterministic.
But I don't see that happening in our regression test setup.
So, this PR de-flakes the test as follows:
- allot up to 66% of mean latency for request processing
- increase duration from 10s to 20s, hoping to get better protection
from momentary CPU spikes in noisy neighbor tests or VMs on the runner
host
As a drive-by, switch to `pytest.approx` and remove one self-test
assertion I can't make sense of anymore.
## Problem
We need to export some metrics about certs/connections to configure
alerts and make sure that all HTTP requests are gone before turning
https-only mode on.
- Closes: https://github.com/neondatabase/cloud/issues/25526
## Summary of changes
- Add started connection and connection error metrics to http/https
Server.
- Add certificate expiration time and reload metrics to
ReloadingCertificateResolver.
Because it wasn't recursive, there was a limit to the depth of updates.
This work is necessary because as we teach neon_local and compute_ctl
that the content in --spec-path should match a similar structure we get
from the control plane, the spec object itself will no longer be
toplevel. It will be under the "spec" key.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This test is slow to execute, particularly if you're on a slow
environment like vscode in a browser. Might have got much slower when we
switched to direct IO?
## Summary of changes
- Reduce the scale of the test by 10x, since there was nothing special
about the original size.
## Problem
The graceful leadership transfer process involves calling step_down on
the old controller, but this was not waiting for shard splits to
complete, and the new controller could therefore end up trying to abort
a shard split while it was still going on.
We mitigated this already in #11256 by avoiding the case where shard
split completion would update the database incorrectly, but this was a
fragile fix because it assumes that is the only problematic part of the
split running concurrently.
Precursors:
- #11290
- #11256Closes: #11254
## Summary of changes
- Hold the reconciler gate from shard splits, so that step_down will
wait for them. Splits should always be fairly prompt, so it is okay to
wait here.
- Defense in depth: if step_down times out (hardcoded 10 second limit),
then fully terminate the controller process rather than letting it
continue running, potentially doing split-brainy things. This makes
sense because the new controller will always declare itself leader
unilaterally if step_down fails, so leaving an old controller running is
not beneficial.
- Tests: extend
`test_storage_controller_leadership_transfer_during_split` to separately
exercise the case of a split holding up step_down, and the case where
the overall timeout on step_down is hit and the controller terminates.
## Problem
Part of https://github.com/neondatabase/neon/issues/10395
## Summary of changes
Add a test case to ensure gc-compaction doesn't fire any critical errors
if the key history is invalid due to partial GC.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Pass more neon ids to compute_ctl.
Expose them to postgres as neon extension GUCs:
neon.project_id, neon.branch_id, neon.endpoint_id.
This is the compute side PR, not yet supported by cplane.
## Problem
`test_location_conf_churn` performs random location updates on
Pageservers. While doing this, it could instruct the compute to connect
to a stale generation and execute queries. This is invalid, and will
fail if a newer generation has removed layer files used by the stale
generation.
Resolves#11348.
## Summary of changes
Only connect to the latest generation when executing queries.
## Problem
Walproposer should get elected and commit WAL on safekeepers specified
by the membership configuration.
## Summary of changes
- Add to wp `members_safekeepers` and `new_members_safekeepers` arrays
mapping configuration members to connection slots. Establish this
mapping (by node id) when safekeeper sends greeting, giving its id and
when mconf becomes known / changes.
- Add to TermsCollected, VotesCollected,
GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it
partially duplicates existing one, but we'll drop the latter eventually.
- In python, rename Configuration to MembershipConfiguration for
clarity.
- Add test_quorum_sanity testing new logic.
ref https://github.com/neondatabase/neon/issues/10851
## Problem
Page service doesn't use TLS for incoming requests.
- Closes: https://github.com/neondatabase/cloud/issues/27236
## Summary of changes
- Add option `enable_tls_page_service_api` to pageserver config
- Propagate `tls_server_config` to `page_service` if the option is
enabled
No integration tests for now because I didn't find out how to call page
service API from python and AFAIK computes don't support TLS yet
It isn't used by the production control plane or neon_local. The removal
simplifies compute spec logic just a little bit more since we can remove
any notion of whether we should allow live reconfigurations.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Part of #9114
## Summary of changes
Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Part of #9114
## Summary of changes
Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_location_conf_churn` often fails with `neither image nor delta
layer`, but doesn't say what the file actually is. However, past local
failures have indicated that it might be `.___temp` files.
Touches https://github.com/neondatabase/neon/issues/11348.
## Summary of changes
Ignore `.___temp` files when evicting local layers, and include the file
name in the error message.
## Problem
In some cases gc-compaction doesn't respond to the L0 compaction yield
notifier. I suspect it's stuck on getting the first item, and if so, we
probably need to let L0 yield notifier preempt `next_with_trace`.
## Summary of changes
- Add `time_to_first_kv_pair` to gc-compaction statistics.
- Inverse the ratio so that smaller ratio -> better compaction ratio.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Not a complete fix for https://github.com/neondatabase/neon/issues/11492
but should work for a short term.
Our current retry strategy for walredo is to retry every request exactly
once. This retry doesn't make sense because it retries all requests
exactly once and each error is expected to cause process restart and
cause future requests to fail. I'll explain it with a scenario of two
threads requesting redos: one with an invalid history (that will cause
walredo to panic) and another that has a correct redo sequence.
First let's look at how we handle retries right now in
do_with_walredo_process. At the beginning of the function it will spawn
a new process if there's no existing one. Then it will continue to redo.
If the process fails, the first process that encounters the error will
remove the walredo process object from the OnceCell, so that the next
time it gets accessed, a new process will be spawned; if it is the last
one that uses the old walredo process, it will kill and wait the process
in `drop(proc)`. I'm skeptical whether this works under races but I
think this is not the root cause of the problem. In this retry handler,
if there are N requests attached to a walredo process and the i-th
request fails (panics the walredo), all other N-i requests will fail and
they need to retry so that they can access a new walredo process.
```
time ---->
proc A None B
request 1 ^-----------------^ fail
uses A for redo replace with None
request 2 ^-------------------- fail
uses A for redo
request 3 ^----------------^ fail
uses A for redo last ref, wait for A to be killed
request 4 ^---------------
None, spawn new process B
```
The problem is with our retry strategy. Normally, for a system that we
want to retry on, the probability of errors for each of the requests are
uncorrelated. However, in walredo, a prior request that panics the
walredo process will cause all future walredo on that process to fail
(that's correlated).
So, back to the situation where we have 2 requests where one will
definitely fail and the other will succeed and we get the following
sequence, where retry attempts = 1,
* new walredo process A starts.
* request 1 (invalid) being processed on A and panics A, waiting for
retry, remove process A from the process object.
* request 2 (valid) being processed on A and receives pipe broken /
poisoned process error, waiting for retry, wait for A to be killed --
this very likely takes a while and cannot finish before request 1 gets
processed again
* new walredo process B starts.
* request 1 (invalid) being processed again on B and panics B, the whole
request fail.
* request 2 (valid) being processed again on B, and get a poisoned error
again.
```
time ---->
proc A None B None
request 1 ^-----------------^--------------^--------------------^
spawn A for redo fail spawn B for redo fail
request 2 ^--------------------^-------------------------^------------^
use A for redo fail, wait to kill A B for redo fail again
```
In such cases, no matter how we set n_attempts, as long as the retry
count applies to all requests, this sequence is bound to fail both
requests because of how they get sequenced; while we could potentially
make request 2 successful.
There are many solutions to this -- like having a separate walredo
manager for compactions, or define which errors are retryable (i.e.,
broken pipe can be retried, while real walredo error won't be retried),
or having a exclusive big lock over the whole redo process (the current
one is very fine-grained). In this patch, we go with a simple approach:
use different retry attempts for different types of requests.
For gc-compaction, the attempt count is set to 0, so that it never
retries and consequently stops the compaction process -- no more redo
will be issued from gc-compaction. Once the walredo process gets
restarted, the normal read requests will proceed normally.
## Summary of changes
Add redo_attempt for each reconstruct value request to set different
retry policies.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
## Problem
our large oltp benchmark runs very long - we want to remove the duration
of the reindex step.
we don't run concurrent workload anyhow but added "concurrently" only to
have a "prod-like" approach. But if it just doubles the time we report
because it requires two instead of one full table scan we can remove it
## Summary of changes
remove keyword concurrently from the reindex step
Instead of encoding a certain structure for claims, let's allow the
caller to specify what claims be encoded.
Signed-off-by: Tristan Partin <tristan@neon.tech>
We'd like to run benchmarks starting from a steady state. To this end,
do a reconciliation round before proceeding with the benchmark.
This is useful for benchmarks that use tenant dir snapshots since a
non-standard tenant configuration is used to generate the snapshot. The
storage controller is not aware of the non default tenant configuration
and will reconcile while the bench is running.
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:
* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.
* Prefer `io::Error::other()`.
* Use implicit returns
One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
## Problem
We wish to improve pageserver batching such that one batch can contain
requests for
pages at different LSNs. The current shape of the code doesn't lend
itself to the change.
## Summary of changes
Refactor the read path such that the fringe gets initialized upfront.
This is where the multi LSN
change will plug in. A couple other small changes fell out of this.
There should be NO behaviour change here. If you smell one, shout!
I recommend reviewing commits individually (intentionally made them as
small as possible).
Related: https://github.com/neondatabase/neon/issues/10765
## Problem
Now `get_timestamp_of_lsn` returns `404 NotFound` if there is no clog
pages for given LSN, and it's difficult to distinguish from other 404
errors. A separate status code for this error will allow the control
plane to handle this case.
- Closes: https://github.com/neondatabase/neon/issues/11439
- Corresponding PR in control plane:
https://github.com/neondatabase/cloud/pull/27125
## Summary of changes
- Return `412 PreconditionFailed` instead of `404 NotFound` if no
timestamp is fond for given LSN.
I looked briefly through the current error handling code in cloud.git
and the status code change should not affect anything for the existing
code. Change from the corresponding PR also looks fine and should work
with the current PS status code. Additionally, here is OK to merge it
from control plane team:
https://github.com/neondatabase/neon/issues/11439#issuecomment-2789327552
---------
Co-authored-by: John Spray <john@neon.tech>
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".
There are plans to replace communicator parts with a new
implementation. See https://github.com/neondatabase/neon/pull/10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.
This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
The timeline stopping state is set much earlier than the cancellation
token is fired, so by checking for the stopping state, we can prevent
races with timeline shutdown where we issue a cancellation error but the
cancellation token hasn't been fired yet.
Fix#11427.
## Problem
The current stripe size of 256 MB is a bit large, and can cause load
imbalances across shards. A stripe size of 16 MB appears more reasonable
to avoid hotspots, although we don't see evidence of this in benchmarks.
Resolves https://github.com/neondatabase/cloud/issues/25634.
Touches https://github.com/neondatabase/cloud/issues/21870.
## Summary of changes
* Change the default stripe size to 16 MB.
* Remove `ShardParameters::DEFAULT_STRIPE_SIZE`, and only use
`pageserver_api::shard::DEFAULT_STRIPE_SIZE`.
* Update a bunch of tests that assumed a certain stripe size.
## Problem
Storcon will not start up if `use_https` is on and there are some
pageservers or safekeepers without https port in the database. Metrics
"how many nodes with https we have in DB" will help us to make sure that
`use_https` may be turned on safely.
- Part of https://github.com/neondatabase/cloud/issues/25526
## Summary of changes
- Add `storage_controller_https_pageserver_nodes`,
`storage_controller_safekeeper_nodes` and
`storage_controller_https_safekeeper_nodes` Prometheus metrics.
## Problem
With the recent improvements to L0 compaction responsiveness,
`test_create_snapshot` now ends up generating 10,000 layer files
(compared to 1,000 in previous snapshots). This increases the snapshot
size by 4x, and significantly slows down tests.
## Summary of changes
Increase the target layer size from 128 KB to 256 KB, and the L0
compaction threshold from 1 to 5. This reduces the layer count from
about 10,000 to 1,000.
## Problem
Support of unlogged build in DEBUG_COMPARE_LOCAL.
Neon SMGR treats present of local file as indicator of unlogged
relations.
But it doesn't work in DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Use INIT_FORKNUM as indicator of unlogged file and create this file
while unlogged index build.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`tenant_import`, used to import an existing tenant from remote storage
into a storage controller for support and debugging, assumed
`DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage.
In #11168, we are changing the stripe size, which will break
`tenant_import`.
Resolves#11175.
## Summary of changes
* Add `stripe_size` to the tenant manifest.
* Add `TenantScanRemoteStorageShard::stripe_size` and return from
`tenant_scan_remote` if present.
* Recover the stripe size during`tenant_import`, or fall back to 32768
(the original default stripe size).
* Add tenant manifest compatibility snapshot:
`2025-04-08-pgv17-tenant-manifest-v1.tar.zst`
There are no cross-version concerns here, since unknown fields are
ignored during deserialization where relevant.
## Problem
This is generated e.g. by `test_historic_storage_formats`, and causes
VSCode to list all the contained files as new.
## Summary of changes
Add `/artifact_cache` to `.gitignore`.
## Problem
For future gc-compaction tests when we support
https://github.com/neondatabase/neon/issues/10395
## Summary of changes
Add a new type of neon test WAL record that is conditionally applied
(i.e., only when image == the specified value). We can use this to mock
the situation where we lose some records in the middle, firing an error,
and see how gc-compaction reacts to it.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Service targeted for storing and retrieving LFC prewarm data.
Can be used for proxying S3 access for Postgres extensions like
pg_mooncake as well.
Requests must include a Bearer JWT token.
Token is validated using a pemfile (should be passed in infra/).
Note: app is not tolerant to extra trailing slashes, see app.rs
`delete_prefix` test for comments.
Resolves: https://github.com/neondatabase/cloud/issues/26342
Unrelated changes: gate a `rename_noreplace` feature and disable it in
`remote_storage` so as `object_storage` can be built with musl
## Problem
`test_scrubber_tenant_snapshot` is flaky with `request was dropped`
errors. More details are in the issue.
- Closes: https://github.com/neondatabase/neon/issues/11278
## Summary of changes
- Disable shard scheduling during pageservers restart
- Add `reconcile_until_idle` in the end of the test
Also, move the call to the lfc_init() function. It was weird to have it
in libpagestore.c, when libpagestore.c otherwise had nothing to do with
the LFC. Move it directly into _PG_init()
## Problem
If the local file cache is shrunk, so that we punch some holes in the
underlying file, the local_cache view displays the holes incorrectly.
See https://github.com/neondatabase/neon/issues/10770
## Summary of changes
Skip hole tags in the local_cache view.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Fix various small issues discovered during gc-compaction rollout.
## Summary of changes
- Log level changes: if errors are from gc-compaction, fire a warning
instead of errors or critical errors.
- Yield to L0 compaction more aggressively. Instead of checking every 1k
keys, we check on every key. Sometimes a single key reconstruct takes a
long time.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Currently, the tenant manifest is only uploaded if there are offloaded
timelines. The checks are also a bit loose (e.g. only checks number of
offloaded timelines). We want to start using the manifest for other
things too (e.g. stripe size).
Resolves#11271.
## Summary of changes
This patch ensures that a tenant manifest always exists. The lifecycle
is:
* During preload, fetch the existing manifest, if any.
* During attach, upload a tenant manifest if it differs from the
preloaded one (or does not exist).
* Upload a new manifest as needed, if it differs from the last-known
manifest (ignoring version number).
* On splits, pre-populate the manifest from the parent.
* During Pageserver physical GC, remove old manifests but keep the
latest 2 generations.
This will cause nearly all existing tenants to upload a new tenant
manifest on their first attach after this change. Attaches are
concurrency-limited in the storage controller, so we expect this will be
fine.
Also updates `make_broken` to automatically log at `INFO` level when the
tenant has been cancelled, to avoid spurious error logs during shutdown.
## Problem
If a tenant is cancelled (e.g. due to Pageserver shutdown) during
attach, it is set to `Broken`. This results both in error log spam and
500 responses for clients -- shutdown is supposed to return 503
responses which can be retried.
This becomes more likely to happen with #11328, where we perform tenant
manifest downloads/uploads during attach.
## Summary of changes
Set tenant state to `Stopping` when attach fails and the tenant is
cancelled, downgrading the log messages to INFO. This introduces two
variants of `Stopping` -- with and without a caller barrier -- where the
latter is used to signal attach cancellation.
Before we specified the JWT via `SAFEKEEPER_AUTH_TOKEN`, but env vars
are quite public, both in procfs as well as the unit files. So add a way
to put the auth token into a file directly.
context: https://neondb.slack.com/archives/C033RQ5SPDH/p1743692566311099
The 'neon_read' function needs to have a different prototype on PG < 16,
because it's part of the smgr interface. But neon_read_at_lsn doesn't
have that restriction.
## Problem
The shared libraries preloaded by default interfered with the
`pg_regress` tests on staging, causing wrong results
## Summary of changes
The projects used for these tests are now free from unnecessary
extensions. Some changes were made in patches.
Remove useless and often wrong IDENTIFICATION comments. PostgreSQL
sources have them, mostly for historical reasons, but there's no need
for us to copy that style.
Remove unnecessary #includes in header files, putting the #includes
directly in the .c files that need them. The principle is that a header
file should #include other header files if they need definitions from
them, such that each header file can be compiled on its own, but not
other #includes. (There are tools to enforce that, but this was just a
manual clean up of violations that I happened to spot.)
## Problem
The `pg_embedding` extension has been deprecated and can cause issues
with recent changes such as with
https://github.com/neondatabase/neon/issues/10973
Issue: `PG:2025-04-03 15:39:25.498 GMT
ttid=a4de5bee50225424b053dc64bac96d87/d6f3891b8f968458b3f7edea58fb3c6f
sqlstate=58P01 [15526] ERROR: could not load library
"/usr/local/lib/embedding.so": /usr/local/lib/embedding.so: undefined
symbol: SetLastWrittenLSNForRelation`
## Summary of changes
Removed `pg_embedding` extension from the compute image.
## Problem
We don't have metrics to exactly quantify the end user impact of
on-demand downloads.
Perf tracing is underway (#11140) to supply us with high-resolution
*samples*.
But it will also be useful to have some aggregate per-timeline and
per-instance metrics that definitively contain all observations.
## Summary of changes
This PR consists of independent commits that should be reviewed
independently.
However, for convenience, we're going to merge them together.
- refactor(metrics): measure_remote_op can use async traits
- impr(pageserver metrics): task_kind dimension for
remote_timeline_client latency histo
- implements https://github.com/neondatabase/cloud/issues/26800
- refs
https://github.com/neondatabase/cloud/issues/26193#issuecomment-2769705793
- use the opportunity to rename the metric and add a _global suffix;
checked grafana export, it's only used in two personal dashboards, one
of them mine, the other by Heikki
- log on-demand download latency for expensive-to-query but precise
ground truth
- metric for wall clock time spent waiting for on-demand downloads
## Refs
- refs https://github.com/neondatabase/cloud/issues/26800
- a bunch of minor investigations / incidents into latency outliers
# Refs
- refs https://github.com/neondatabase/neon/issues/8915
- discussion thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1742406381132599
- stacked atop https://github.com/neondatabase/neon/pull/11298
- corresponding internal docs update that illustrates how this PR
removes friction: https://github.com/neondatabase/docs/pull/404
# Problem
Rejecting `pageserver.toml`s with unknown fields adds friction,
especially when using `pageserver.toml` fields as feature flags that
need to be decommissioned.
See the added paragraphs on `pageserver_api::models::ConfigToml` for
details on what kind of friction it causes.
Also read the corresponding internal docs update linked above to see a
more imperative guide for using `pageserver.toml` flags as feature
flags.
# Solution
## Ignoring unknown fields
Ignoring is the serde default behavior.
So, just remove `serde(deny_unknown_fields)` from all structs in
`pageserver_api::config::ConfigToml`
`pageserver_api::config::TenantConfigToml`.
I went through all the child fields and verified they don't use
`deny_unknown_fields` either, including those shared with
`pageserver_api::models`.
## Warning about unknown fields
We still want to warn about unknown fields to
- be informed about typos in the config template
- be reminded about feature-flag style configs that have been cleaned up
in code but not yet in config templates
We tried `serde_ignore` (cf draft #11319) but it doesn't work with
`serde(flatten)`.
The solution we arrived at is to compare the on-disk TOML with the TOML
that we produce if we serialize the `ConfigToml` again.
Any key specified in the on-disk TOML but not present in the serialized
TOML is flagged as an ignored key.
The mechanism to do it is a tiny recursive decent visitor on the
`toml_edit::DocumentMut`.
# Future Work
Invalid config _values_ in known fields will continue to fail pageserver
startup.
See
- https://github.com/neondatabase/cloud/issues/24349
for current worst case impact to deployments & ideas to improve.
# Problem
Current perf tracing fields do not allow answering the question what a
specific Postgres backend was waiting for.
# Background
For Pageserver logs, we set the backend PID as the libpq
`application_name` on the compute side, and funnel that into the a
tracing field for the spans that emit to the global tracing subscriber.
# Solution
Funnel `application_name`, and the other fields that we use in the
logging spans, into the root span for perf tracing.
# Refs
- fixes https://github.com/neondatabase/neon/issues/11393
- stacked atop https://github.com/neondatabase/neon/pull/11433
- epic: https://github.com/neondatabase/neon/issues/9873
## Problem
We've started sending slack notifications for failed container image
pushes that are being retried. There are more messages coming in than
expected, so clicking through the link to see what image failed is
happening more often than we hoped.
## Summary of changes
- Make slack notifications clearer, including whether the job succeeded
and what retries have happened.
- Log failures/retries in step more clearly, so that you can easily see
when something fails.
## Problem
Postgres build fails with the following error on macOS:
```
/Users/bayandin/work/neon//vendor/postgres-v14/src/port/snprintf.c:424:27: error: 'strchrnul' is only available on macOS 15.4 or newer [-Werror,-Wunguarded-availability-new]
424 | const char *next_pct = strchrnul(format + 1, '%');
| ^~~~~~~~~
/Users/bayandin/work/neon//vendor/postgres-v14/src/port/snprintf.c:376:14: note: 'strchrnul' has been marked as being introduced in macOS 15.4 here, but the deployment target is macOS 15.0.0
376 | extern char *strchrnul(const char *s, int c);
| ^
/Users/bayandin/work/neon//vendor/postgres-v14/src/port/snprintf.c:424:27: note: enclose 'strchrnul' in a __builtin_available check to silence this warning
424 | const char *next_pct = strchrnul(format + 1, '%');
| ^~~~~~~~~
425 |
426 | /* Dump literal data we just scanned over */
427 | dostr(format, next_pct - format, target);
428 | if (target->failed)
429 | break;
430 |
431 | if (*next_pct == '\0')
432 | break;
433 | format = next_pct;
|
1 error generated.
```
## Summary of changes
- Update Postgres fork to include changes from
6da2ba1d8a
Corresponding Postgres PRs:
- https://github.com/neondatabase/postgres/pull/608
- https://github.com/neondatabase/postgres/pull/609
- https://github.com/neondatabase/postgres/pull/610
- https://github.com/neondatabase/postgres/pull/611
## Problem
https://github.com/neondatabase/neon/pull/11140 introduces performance
tracing with OTEL
and a pageserver config which configures the sampling ratio of get page
requests.
Enabling a non-zero sampling ratio on a per region basis is too
aggressive and comes with perf
impact that isn't very well understood yet.
## Summary of changes
Add a `sampling_ratio` tenant level config which overrides the
pageserver level config.
Note that we do not cache the config and load it on every get page
request such that changes propagate
timely.
Note that I've had to remove the `SHARD_SELECTION` span to get this to
work. The tracing library doesn't
expose a neat way to drop a span if one realises it's not needed at
runtime.
Closes https://github.com/neondatabase/neon/issues/11392
## Problem
IO metrics for secondary locations do not get deregistered when the
timeline is removed.
## Summary of changes
Stash the request context to be used for downloads in
`SecondaryTimelineDetail`. These objects match the lifetime of the
secondary timeline location pretty well.
When the timeline is removed, deregister the metrics too.
Closes https://github.com/neondatabase/neon/issues/11156
## Problem
Changes in compute can cause errors in tests if another version of
`neon-test-extensions` image is used.
## Summary of changes
Use the same version of `neon-test-extensions` image as `compute` one
for docker-compose based extension tests.
## Problem
There are some places in the code where we create `reqwest::Client`
without providing SSL CA certs from `ssl_ca_file`. These will break
after we enable TLS everywhere.
- Part of https://github.com/neondatabase/cloud/issues/22686
## Summary of changes
- Support `ssl_ca_file` in storage scrubber.
- Add `use_https_safekeeper_api` option to safekeeper to use https for
peer requests.
- Propagate SSL CA certs to storage_controller/client, storcon's
ComputeHook, PeerClient and maybe_forward.
Previously we attempted to download all extensions in CREATE EXTENSION
statements. Extensions like pg_stat_statements and neon are not remote
extensions, but still we were requesting them when
skip_pg_catalog_updates was set to false.
Fixes: https://github.com/neondatabase/neon/issues/11127
Signed-off-by: Tristan Partin <tristan@neon.tech>
Adds a test `test_storcon_create_delete_sk_down` which tests the
reconciler and pending op persistence if faced with a temporary
safekeeper downtime during timeline creation or deletion. This is in
contrast to `test_explicit_timeline_creation_storcon`, which tests the
happy path.
We also do some fixes:
* timeline and tenant deletion http requests didn't expect a body, but
`()` sent one.
* we got the tenant deletion http request's return type wrong: it's
supposed to be a hash map
* we add some logging to improve observability
* We fix `list_pending_ops` which had broken code meant to make it
possible to restrict oneself to a single pageserver. But diesel doesn't
support that sadly, or at least I couldn't figure out a way to make it
work. We don't need that functionality, so remove it.
* We add an info span to the heartbeater futures with the node id, so
that there is no context-free msgs like "Backoff: waiting 1.1 seconds
before processing with the task" in the storcon logs. we could also add
the full base url of the node but don't do it as most other log lines
contain that information already, and if we do duplication it should at
least not be verbose. One can always find out the base url from the node
id.
Successor of #11261
Part of #9011
Log the created project and endpoint IDs and improve typing in the
source code to improve readability.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We switched `h2` from 4.1.0 to a git commit to fix stubgen (in
https://github.com/neondatabase/neon/pull/10491). `h2` 4.2.0 was
released soon after that, so we can switch back to a pinned version.
Expected no changes, as 4.2.0 is the right next commit after the commit
we currently use:
dacd614fed%5E
## Summary of changes
- Bump `h2` to 4.2.0
## Problem
Part of https://github.com/neondatabase/neon/issues/11318
It's not 100% safe for now to run gc-compaction over the sparse
keyspace. It might cause deleted file to re-appear if a specific
sequence of operations are done as in the issue, which in reality
doesn't happen due to how we split delta/image layers based on the key
range.
A long-term fix would be either having a separate gc-compaction code
path for metadata keys (as how we have a different code path for
metadata image layer generation), or let the compaction process aware of
the information of "there's an image layer that doesn't contain a key"
so that we can skip the keys.
## Summary of changes
* gc-compaction auto trigger only triggers compaction over the normal
data range.
* do not hold gc_block_guard across the full compaction job, only hold
it during each subcompaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The leadership transfer protocol between storage controller instances is
as follows, listing the steps for the new pod:
The new pod does these things:
1. new pod comes online. looks in database if there is a leader. if
there is, it asks that leader to step down.
2. the new pod does some operations to come online. they should be
fairly short timed, but it's not zero.
3. the new pod updates the leader entry in the database.
The old pod, once it gets the step down request, changes its internal
state to stepped down. It treats all incoming requests specially now:
instead of processing, it wants to forward them to the new pod. The
forwarding however only works if the new pod is online already, so
before forwarding it reads from the db for a leader (also to get the
address to forward to in the first place).
If the new pod is not online yet, i.e. during step 2 above, the old pod
might legitimately land in the branch which this patch is editing: the
leader in the database is a stepped down instance.
Before, we've returned a `ApiError::InternalServerError`, but that would
print the full backtrace plus an error log. With this patch, we cut down
on the noise, as it's an expected situation to have a short storcon
downtime while we are cutting over to the new instance. A
`ResourceUnavailable` error is not just more fitting, it also doesn't
print a backtrace once encountered, and only prints on the INFO log
level (see `api_error_handler` function).
Fixes#11320
cc #8954
Based on https://github.com/neondatabase/neon/pull/11139
## Problem
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.
## Summary of changes
https://github.com/neondatabase/neon/pull/11139 introduces the
infrastructure required to run the otel collector alongside the
pageserver.
### Design
Requirements:
1. We'd like to avoid implementing our own performance tracing stack if
possible and use the `tracing` crate if possible.
2. Ideally, we'd like zero overhead of a sampling rate of zero and be a
be able to change the tracing config for a tenant on the fly.
3. We should leave the current span hierarchy intact. This includes
adding perf traces without modifying existing tracing.
To satisfy (3) (and (2) in part) a separate span hierarchy is used.
`RequestContext` gains an optional `perf_span` member
that's only set when the request was chosen by sampling. All perf span
related methods added to `RequestContext` are no-ops for requests that
are not sampled.
This on its own is not enough for (3), so performance spans use a
separate tracing subscriber. The `tracing` crate doesn't have great
support for this, so there's a fair amount of boilerplate to override
the subscriber at all points of the perf span lifecycle.
### Perf Impact
[Periodic
pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc)
shows no statistically significant regression with a sample ratio of 0.
There's an annotation on the dashboard on 2025-03-06.
### Overview of changes:
1. Clean up the `RequestContext` API a bit. Namely, get rid of the
`RequestContext::extend` API and use the builder instead.
2. Add pageserver level configs for tracing: sampling ratio, otel
endpoint, etc.
3. Introduce some perf span tracking utilities and expose them via
`RequestContext`. We add a `tracing::Span` wrapper to be used for perf
spans and a `tracing::Instrumented` equivalent for it. See doc comments
for reason.
4. Set up OTEL tracing infra according to configuration. A separate
runtime is used for the collector.
5. Add perf traces to the read path.
## Refs
- epic https://github.com/neondatabase/neon/issues/9873
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
There are some cases where traditional gc might collect some layer files
causing gc-compaction cannot read the full history of the key. This
needs to be resolved in the long-term by improving the compaction
process. For now, let's simply avoid such errors triggering the circuit
breaker.
## Summary of changes
* Move the place where we trigger the circuit breaker. We only trigger
it during compactions other than L0 compactions. We added the trigger a
year ago due to file cleanup concerns in image layer compaction.
* For gc-compaction, only return errors to the upper
compaction_iteration if it's a shutdown error. Otherwise, just log it
and skip the compaction for a key range.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
Previously, if the observed state was refreshed and matching the intent,
we wouldn't send
a compute notification. This is unsafe. There's no guarantee that the
location landed on the
pageserver _and_ a compute notification for it was delivered.
See
https://github.com/neondatabase/neon/issues/11291#issuecomment-2743205411
for one such example.
## Summary of changes
Add a reproducer and notify the compute if the correct observed state
required a refresh.
Closes https://github.com/neondatabase/neon/issues/11291
## Problem
We had a problem with https://github.com/neondatabase/neon/pull/11413
having e2e tests failing, because an e2e test
(8d271bed47)
depended on an unreleased pageserver fix
(0ee5bfa2fc).
This came up because neon release CI runs against the most recent
releases of the other components, but cloud e2e tests run against
latest, which is tagged from main.
## Summary of changes
Add an additional `released` tag for released versions.
## Alternative to consider
We could (and maybe should) instead switch to `latest` being used for
released versions and `main` being used where we use `latest` right now.
That'd also mean we don't have to adjust the CI in the cloud repo.
## Problem
Exporting `file_cache_used` which specifies the number of used chunks in
the LFC. This helps calculate LFC utilization as: `file_cache_used_pages
/ (file_cache_used * file_cache_chunk_size_pages)`
## Summary of changes
Exporting `file_cache_used`.
Related Issue: https://github.com/neondatabase/cloud/issues/26688
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Announcement blog
post](https://blog.rust-lang.org/2025/04/03/Rust-1.86.0.html).
Prior update was in #10914.
## Problem
Since
0f367cb665
the timeout in `with_client_retries` is implemented via `tokio::timeout`
instead of `reqwest::ClientBuilder::timeout` (because we reuse the
client). It changed the error representation if the timeout is exceeded.
Such errors were suppressed in `allowed_errors.py`, but old regexps do
not match the new error.
Discussion:
https://neondb.slack.com/archives/C033RQ5SPDH/p1743533184736319
## Summary of changes
- Add new `Timeout` error to `allowed_errors.py`
I've encountered this error in #11422. Ideally we'd have the URL as well
to associate it with a tenant, but at this level we only have the remote
addr I guess. Better than nothing.
## Problem
The test_pageserver_gc_compaction_smoke fails rather often due to a
timeout on slow machines.
See https://github.com/neondatabase/neon/issues/11355.
## Summary of changes
Increase the timeout for the test.
## Problem
We've seen quite a few CI failures related to pushes to docker hub
failing with weird error messages that indicate maybe docker hub is just
not reliable.
## Summary of changes
Retry container image pushing up to 10 times, and send a slack message
if we had to retry, regardless of the job succeeding or not.
## Problem
Pagebench creates a bunch of tenants by first creating a template tenant
and copying its remote storage, then attaching the copies to the
Pageserver.
These tenants had custom configurations to disable GC and compaction.
However, these configs were only picked up by the Pageserver on attach,
and not registered with the storage controller. This caused the storage
controller to replace the tenant configs with the default tenant config,
re-enabling GC and compaction which interferes with benchmark
performance.
Resolves#11381.
## Summary of changes
Register the copied tenants with the storage controller, instead of
directly attaching them to the Pageserver.
Right now we start safekeeper node ids at 0. However, other code treats
0 as invalid (see #11407). We decided on latter. Therefore, make the
register python tests register safekeepers starting at node id 1 instead
of 0, and forbid safekeepers with id 0 from registering.
Context:
https://github.com/neondatabase/neon/pull/11407#discussion_r2024852328
## Problem
close https://github.com/neondatabase/neon/issues/11279
## Summary of changes
* Allow passthrough of other methods in tenant timeline shard0
passthrough of storcon.
* Passthrough mark invisible API in storcon.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
In #11122, we changed the autosplit behavior to allow repeated and
initial splits. The defaults were set such that they retain the current
production settings (8 shards at 64 GB). However, these defaults don't
really make sense by themselves.
Once we deploy new settings to production, we should change the defaults
to something more reasonable.
## Summary of changes
Changes the following default settings:
* `max_split_shards`: 8 → 16
* `initial_split_threshold`: 64 GB → disabled
* `initial_split_shards`: 8 → 2
## Problem
Hotfix releases mean that sometimes changes in release PRs haven't been
tested and linted yet. Disabling tests and lints is therefore not
necessarily safe. In the future we will check whether tests have run on
the same git tree already to speed things up, but for now we need to
turn tests back on fully. This partially reverts:
https://github.com/neondatabase/neon/pull/11272
## Summary of changes
Run checks on `.*-rc-pr` runs.
## Problem
Tenants in attachment state `Stale` can't upload layers, and don't run
compaction, but still do periodic L0 layer flushes in the tenant
housekeeping loop. If the tenant remains stuck in stale mode, this
causes a large buildup of L0 layers, causing logging, metrics increases,
and possibly alerts.
Resolves#11245.
## Summary of changes
Don't perform periodic layer flushes in stale attachment state.
## Problem
Sometimes the forced extension upgrade test fails (on schedule) due to a
timeout.
## Summary of changes
The timeout is increased to 60 mins.
## Problem
In Neon DBaaS we adjust the shared_buffers to the size of the compute,
or better described we adjust the max number of connections to the
compute size and we adjust the shared_buffers size to the number of max
connections according to about the following sizes
`2 CU: 225mb; 4 CU: 450mb; 8 CU: 900mb`
[see](877e33b428/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go (L405))
## Summary of changes
We should run perf unit tests with settings that is realistic for a
paying customer and select 8 CU as the reference for those tests.
## Problem
Followup to https://github.com/neondatabase/neon/pull/10913
Existing chaos injection just does simple cutovers to secondary
locations. Let's also exercise code for doing graceful migrations. This
should implicitly test how such migrations cope with overlapping with
service restarts.
## Summary of changes
## Problem
ref https://github.com/neondatabase/neon/issues/11279
Imagine we have a branch with 3 snapshots A, B, and C:
```
base---+---+---+---main
\-A \-B \-C
base=100G, base-A=1G, A-B=1G, B-C=1G, C-main=1G
```
at this point, the synthetic size should be 100+1+1+1+1=104G.
after the deletion, the structure looks like:
```
base---+---+---+
\-A \-B \-C
```
If we simply assume main never exists, the size will be calculated as
size(A) + size(B) + size(C)=300GB, which obviously is not what the user
would expect.
The correct way to do this is to assume part of main still exists, that
is to say, set C-main=1G:
```
base---+---+---+main
\-A \-B \-C
```
And we will get the correct synthetic size of 100G+1+1+1=103G.
## Summary of changes
* Do not generate gc cutoff point for invisible branches.
* Use the same LSN as the last branchpoint for branch end.
* Remove test_api_handler for mark_invisible.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1741594233757489
Consider the following scenario:
1. Backend A wants to prefetch some block B
2. Backend A checks that block B is not present in shared buffer
3. Backend A registers new prefetch request and calls
prefetch_do_request
4. prefetch_do_request calls neon_get_request_lsns
5. neon_get_request_lsns obtains LwLSN for block B
6. Backend B downloads B, updates and wallogs it (let say to Lsn1)
7. Block B is once again thrown from shared buffers, its LwLSN is set to
Lsn1
8. Backend A obtains current flush LSN, let's say that it is Lsn1
9. Backend A stores Lsn1 as effective_lsn in prefetch slot.
10. Backend A reads page B with LwLSN=Lsn1
11. Backend A finds in prefetch ring response for prefetch request for
block B with effective_lsn=Lsn1, so that it satisfies
neon_prefetch_response_usable condition
12. Backend A uses deteriorated version of the page!
## Summary of changes
Use `not_modified_since` as `effective_lsn`.
It should not cause some degrade of performance because we store LwLSN
when it was not found in LwLSN hash, so if page is not changed till
prefetch response is arrived, then LwLSN should not be changed.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This PR contains a bunch of smaller followups and fixes of the original
PR #11058. most of these implement suggestions from Arseny:
* remove `Queryable, Selectable` from `TimelinePersistence`: they are
not needed.
* no `Arc` around `CancellationToken`: it itself is an arc wrapper
* only schedule deletes instead of scheduling excludes and deletes
* persist and delete deletion ops
* delete rows in timelines table upon tenant and timeline deletion
* set `deleted_at` for timelines we are deleting before we start any
reconciles: this flag will help us later to recognize half-executed
deletions, or when we crashed before we could remove the timeline row
but after we removed the last pending op (handling these situations are
left for later).
Part of #9011
Add an optional `safekeepers` field to `TimelineInfo` which is returned
by the storcon upon timeline creation if the
`--timelines-onto-safekeepers` flag is enabled. It contains the list of
safekeepers chosen.
Other contexts where we return `TimelineInfo` do not contain the
`safekeepers` field, sadly I couldn't make this more type safe like done
in Rust via `TimelineCreateResponseStorcon`, as there is no way of
flattening or inheritance (and I don't that duplicating the entire type
for some minor type safety improvements is worth it).
The storcon side has been done in #11058.
Part of https://github.com/neondatabase/cloud/issues/16176
cc https://github.com/neondatabase/cloud/issues/16796
## Problem
`CompactFlags::NoYield` was a bit inconvenient, since every caller
except for the background compaction loop should generally set it (e.g.
HTTP API calls, tests, etc). It was also inconsistent with
`CompactionOutcome::YieldForL0`.
## Summary of changes
Invert `CompactFlags::NoYield` as `CompactFlags::YieldForL0`. There
should be no behavioral changes.
## Problem
For computes running inside NeonVM, the actual compute image tag is
buried inside the NeonVM spec, and we cannot get it as part of standard
k8s container metrics (it's always an image and a tag of the NeonVM
runner container). The workaround we currently use is to extract the
running computes info from the control plane database with SQL. It has
several drawbacks: i) it's complicated, separate DB per region; ii) it's
slow; iii) it's still an indirect source of info, i.e. k8s state could
be different from what the control plane expects.
## Summary of changes
Add a new `compute_ctl_up` gauge metric with `build_tag` and `status`
labels. It will help us to both overview what are the tags/versions of
all running computes; and to break them down by current status (`empty`,
`running`, `failed`, etc.)
Later, we could introduce low cardinality (no endpoint or compute ids)
streaming aggregates for such metrics, so they will be blazingly fast
and usable for monitoring the fleet-wide state.
## Problem
Commit
3da70abfa5
cause noticeable performance regression (40% in update-with-prefetch in
test_bulk_update):
https://neondb.slack.com/archives/C04BLQ4LW7K/p1742633167580879
## Summary of changes
Remove loop from pageserver_try_receive to make it fetch not more than
one response. There is still loop in `pump_prefetch_state` which can
fetch as many responses as available.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Previously we had different meanings for the bitmask of vector IOps.
That has now been unified to "bit set = final result, no more
scribbling".
Furthermore, the LFC read path scribbled on pages that were already
read; that's probably not a good thing so that's been fixed too. In
passing, the read path of LFC has been updated to read only the
requested pages into the provided buffers, thus reducing the IO size of
vectorized IOs.
## Problem
## Summary of changes
## Problem
Current version of GitHub Workflow Stats action pull docker images from
DockerHub, that could be an issue with the new pull limits on DockerHub
side.
## Summary of changes
Switch to version `v0.2.2`, with docker images hosted on `ghcr.io`
In sqlstate, we have a manual `phf` construction, which is not
explicitly guaranteed to be stable - you're intended to use a build.rs
or the macro to make sure it's constructed correctly each time. This was
inherited from tokio-postgres upstream, which has the same issue
(https://github.com/rust-phf/rust-phf/pull/321#issuecomment-2724521193).
We don't need this encoding of sqlstate, so I've switched it to simply
parse 5 bytes
(https://www.postgresql.org/docs/current/errcodes-appendix.html).
While here, I switched out log for tracing.
## Problem
Macro IS_LOCAL_REL used for DEBUG_COMPARE_LOCAL mode use greater-than
rather than greater-or-equal comparison while first table really is
assigned FirstNormalObjectId.
## Summary of changes
Replace strict greater with greater-or-equal comparison.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`TYPE_CHECKING` is used inconsistently across Python tests.
## Summary of changes
- Update `ruff`: 0.7.0 -> 0.11.2
- Enable TC (flake8-type-checking):
https://docs.astral.sh/ruff/rules/#flake8-type-checking-tc
- (auto)fix all new issues
## Problem
Previously, L0 flushes would wait for uploads, as a simple form of
backpressure. However, this prevented flush pipelining and upload
parallelism. It has since been disabled by default and replaced by L0
compaction backpressure.
Touches https://github.com/neondatabase/cloud/issues/24664.
## Summary of changes
This patch removes L0 flush upload waits, along with the
`l0_flush_wait_upload`. This can't be merged until the setting has been
removed across the fleet.
## Problem
`github.sha` contains a merge commit of `head` and `base` if we're in a
PR. In release PRs, this makes no sense, because we fast-forward the
`base` branch to contain the changes from `head`.
Even though we correctly use `${{ github.event.pull_request.head.sha ||
github.sha }}` to reference the git commit when building artifacts, we
don't use that when checking out code, because we want to test the merge
of head and base usually. In the case of release PRs, we definitely
always want to test on the head sha though, because we're going to
forward that, and it already has the base sha as a parent, so the merge
would end up with the same tree anyway.
As a side effect, not checking out `${{
github.event.pull_request.head.sha || github.sha }}` also caused
https://github.com/neondatabase/neon/actions/runs/13986389780/job/39173256184#step:6:49
to say `release-tag=release-compute-8187`, while
https://github.com/neondatabase/neon/actions/runs/14084613121/job/39445314780#step:6:48
is talking about `build-tag=release-compute-8186`
## Summary of changes
Run a few things on `github.event.pull_request.head.sha`, if we're in a
release PR.
## Problem
Occasionally getting data from GH cache could be slow, with less than
10MB/s and taking 5+ minutes to download cache:
```
Received 20971520 of 2987085791 (0.7%), 9.9 MBs/sec
Received 50331648 of 2987085791 (1.7%), 15.9 MBs/sec
...
Received 1065353216 of 2987085791 (35.7%), 4.8 MBs/sec
Received 1065353216 of 2987085791 (35.7%), 4.7 MBs/sec
...
```
https://github.com/neondatabase/neon/actions/runs/13956437454/job/39068664599#step:7:17
Resulting in getting cache even longer that build time.
## Summary of changes
Switch to the caches, that are closer to the runners, and they provided
stable throughput about 70-80MB/s
## Problem
Some useful debugging tools are missing from the compute image and
sometimes it's impossible to install them because memory is tightly
packed.
## Summary of changes
Add the following tools: iproute2, lsof, screen, tcpdump.
The other changes come from sorting the packages alphabetically.
```bash
$ docker image inspect ghcr.io/neondatabase/vm-compute-node-v16:7555 | jaq '.[0].Size'
1389759645
$ docker image inspect ghcr.io/neondatabase/vm-compute-node-v16:14083125313 | jaq '.[0].Size'
1396051101
$ echo $((1396051101 - 1389759645))
6291456
```
To help with narrowing down
https://github.com/neondatabase/cloud/issues/26362, we make the case
more noisy where we are wait for the shutdown of a specific task (in the
case of that issue, the `gc_loop`).
…ng (#11308)"
This reverts commit e5aef3747c.
The logic of this commit was incorrect:
enabling audit requires a restart of the compute,
because audit extensions use shared_preload_libraries.
So it cannot be done in the configuration phase,
require endpoint restart instead.
## Problem
While working on bulk import, I want to use the `control-plane-url` flag
for a different request.
Currently, the local compute hook is used whenever no control plane is
specified in the config.
My test requires local compute notifications and a configured
`control-plane-url` which isn't supported.
## Summary of changes
Add a `use-local-compute-notifications` flag. When this is set, we use
the local flow regardless of other config values.
It's enabled by default in neon_local and disabled by default in all
other envs. I had to turn the flag off in tests
that wish to bypass the local flow, but that's expected.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
- We need to support multiple SSL CA certificates for graceful root CA
certificate rotation.
- Closes: https://github.com/neondatabase/cloud/issues/25971
## Summary of changes
- Parses `ssl_ca_file` as a pem bundle, which may contain multiple
certificates. Single pem cert is a valid pem bundle, so the change is
backward compatible.
## Problem
- Part of https://github.com/neondatabase/neon/issues/11113
- Building a new `reqwest::Client` for every request is expensive
because it parses CA certs under the hood. It's noticeable in storcon's
flamegraph.
## Summary of changes
- Reuse one `reqwest::Client` for all API calls to avoid parsing CA
certificates every time.
## Problem
Issue https://github.com/neondatabase/neon/issues/11254 describes a case
where restart during a shard split can result in a bad end state in the
database.
## Summary of changes
- Add a reproducer for the issue
- Tighten an existing safety check around updated row counts in
complete_shard_split
## Problem
Various aspects of the controller's job are already described in RFCs,
but the overall service didn't have an RFC that records design tradeoffs
and the top level structure.
## Summary of changes
- Add a retrospective RFC that should be useful for anyone understanding
storage controller functionality
## Problem
part of https://github.com/neondatabase/neon/issues/11279
## Summary of changes
The invisible flag is used to exclude a timeline from synthetic size
calculation. For the first step, let's persist this flag. Most of the
code are following the `is_archived` modification flow.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
SSL certs are loaded only during start up. It doesn't allow the rotation
of short-lived certificates without server restart.
- Closes: https://github.com/neondatabase/cloud/issues/25525
## Summary of changes
- Implement `ReloadingCertificateResolver` which reloads certificates
from disk periodically.
## Problem
Now that stuck connections are quickly terminated it's not easy to
quickly find the right port from the pid to correlate the connection
with the one seen on pageserver side.
## Summary of changes
Call getsockname() and include the local port number in the
no-response-from-pageserver log messages.
## Problem
Currently, we only split tenants into 8 shards once, at the 64 GB split
threshold. For very large tenants, we need to keep splitting to avoid
huge shards. And we also want to eagerly split at a lower threshold to
improve throughput during initial ingestion.
See
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907
for details.
Touches https://github.com/neondatabase/cloud/issues/22532.
Requires #11157.
## Summary of changes
This adds parameters and logic to enable repeated splits when a tenant's
largest timeline divided by shard count exceeds `split_threshold`, as
well as eager initial splits at a lower threshold to speed up initial
ingestion. The default parameters are all set such that they retain the
current behavior in production (only split into 8 shards once, at 64
GB).
* `split_threshold` now specifies a maximum shard size. When a shard
exceeds it, all tenant shards are split by powers of 2 such that all
tenant shards fall below `split_threshold`. Disabled by default, like
today.
* Add `max_split_shards` to specify a max shard count for autosplits.
Defaults to 8 to retain current behavior.
* Add `initial_split_threshold` and `initial_split_shards` to specify a
threshold and target count for eager splits of unsharded tenants.
Defaults to 64 GB and 8 shards to retain current production behavior.
Because this PR sets `initial_split_threshold` to 64 GB by default, it
has the effect of enabling autosplits by default. This was not the case
previously, since `split_threshold` defaults to None, but it is already
enabled across production and staging. This is temporary until we
complete the production rollout.
For more details, see code comments.
This must wait until #11157 has been deployed to Pageservers.
Once this has been deployed to production, we plan to change the
parameters to:
* `split-threshold`: 256 GB
* `initial-split-threshold`: 16 GB
* `initial-split-shards`: 4
* `max-split-shards`: 16
The final split points will thus be:
* Start: 1 shard
* 16 GB: 4 shards
* 1 TB: 8 shards
* 2 TB: 16 shards
We will then change the default settings to be disabled by default.
---------
Co-authored-by: John Spray <john@neon.tech>
## Problem
`fast_import` binary is being run inside neonvms, and they do not
support proper `kubectl describe logs` now, there are a bunch of other
caveats as well: https://github.com/neondatabase/autoscaling/issues/1320
Anyway, we needed a signal if job finished successfully or not, and if
not — at least some error message for the cplane operation. And after [a
short
discussion](https://neondb.slack.com/archives/C07PG8J1L0P/p1741954251813609),
that s3 object is the most convenient at the moment.
## Summary of changes
If `s3_prefix` was provided to `fast_import` call, any job run puts a
status object file into `{s3_prefix}/status/fast_import` with contents
`{"done": true}` or `{"done": false, "error": "..."}`. Added a test as
well
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1741176713523469
The problem is that this function is using `PQgetCopyData(shard->conn,
&resp_buff.data, 1 /* async = true */)`
to try to fetch next message. But this function returns 0 if the whole
message is not present in the buffer.
And input buffer may contain only part of message so result is not
fetched.
## Summary of changes
Use `PQisBusy` + `WaitEventSetWait` to check if data is available and
`PQgetCopyData(shard->conn, &resp_buff.data, 0)` to read whole message
in this case.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
If a tenant gets deleted, delete also all of its timelines. We assume
that by the time a tenant is being deleted, no new timelines are being
created, so we don't need to worry about races with creation in this
situation.
Unlike #11233, which was very simple because it listed the timelines and
invoked timeline deletion, this PR obtains a list of safekeepers to
invoke the tenant deletion on, and then invokes tenant deletion on each
safekeeper that has one or multiple timelines.
Alternative to #11233
Builds on #11288
Part of #9011
## Problem
Pageservers use unencrypted HTTP requests for storage controller API.
- Closes: https://github.com/neondatabase/cloud/issues/25524
## Summary of changes
- Replace hyper0::server::Server with http_utils::server::Server in
storage controller.
- Add HTTPS handler for storage controller API.
- Support `ssl_ca_file` in pageserver.
Timeline IDs do not contain characters that may cause a SQL injection,
but best to always play it safe.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We currently have this code duplicated across different PG versions.
Moving this to an extension would reduce duplication and simplify
maintenance.
## Summary of changes
Moving the LastWrittenLSN code from PG versions to the Neon extension
and linking it with hooks.
Related Postgres PR: https://github.com/neondatabase/postgres/pull/590
Closes: https://github.com/neondatabase/neon/issues/10973
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
The pageserver upcall api was designed to work with control plane or the
storage controller.
We have completed the transition period and now the upcall api only
targets the storage controller.
## Summary of changes
Rename types accordingly and tweak some comments.
## Problem
We had a recent Postgres startup latency (`start_postgres_ms`)
degradation, but it was only caught with SLO alerts. There was actually
an existing test for the same purpose -- `start_postgres_ms`, but it's
doing only two starts, so it's a bit noisy.
## Summary of changes
Add new compute startup latency test that does 100 iterations and
reports p50, p90 and p99 latencies.
Part of https://github.com/neondatabase/cloud/issues/24882
## Problem
Some extensions do not contain tests, which can be easily run on top of
docker-compose or staging.
## Summary of changes
Added the pg_regress based tests for `pg_tiktoken`, `pgx_ulid`, `pg_rag`
Now they will be run on top of docker-compose, but I intend to adopt
them to be run on top staging in the next PRs
## Problem
We're seeing timeline creation failures that look suspiciously like some
race with the cleanup-deletion of initdb temporary directories. I
couldn't spot the bug, but we can make it a bit easier to debug.
Related: https://github.com/neondatabase/neon/issues/11296
## Summary of changes
- Avoid surfacing distracting ENOENT failure to delete as a log error --
this is fine, and can happen if timeline is cancelled while doing
initdb, or if initdb itself has an error where it doesn't write the dir
(this error is surfaced separately)
- Log after purging initdb temp directories
The only difference between
- `pageserver_api::models::TenantConfig` and
- `pageserver::tenant::config::TenantConfOpt`
at this point is that `TenantConfOpt` serializes with
`skip_serializing_if = Option::is_none`.
That is an efficiency improvement for all the places that currently
serde `models::TenantConfig` because new serializations will no longer
write `$fieldname: null` for each field that is `None` at runtime.
This should be particularly beneficial for Storcon, which stores
JSON-serialized `models::TenantConfig` in its DB.
# Behavior Changes
This PR changes the serialization behavior: we omit `None` fields
instead of serializing `$fieldname: null`).
So it's a data format change (see section on compatibility below).
And it changes API responses from Storcon and Pageserver.
## API Response Compatibility
Storcon returns the location description.
Afaik it is passed through into
- storcon_cli output
- storcon UI in console admin UI
These outputs will no longer contain `$fieldname: null` values,
which de-bloats the output (good).
But in storcon UI, it also serves as an editor "default", which
will be eliminated after a storcon with this PR is released.
## Data Format Compatibility
Backwards compat: new software reading old serialized data will
deserialize to the same runtime value because all the field types
are exactly the same and `skip_serializing_if` does not affect
deserialization.
Forward compat: old software reading data serialized by new software
will map absence fields in the serialized form to runtime value
`Option::None`. This is serde default behavior, see this playground
to convince yourself:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=f7f4e1a169959a3085b6158c022a05eb
The `serde(with="humantime_serde")` however behaves strangely:
if used on an `Option<Duration>`, it still requires the field to be
present,
unlike the serde default behavior shown in the previous paragraph.
The workaround is to set `serde(default)`.
Previously it was set on each individual field, but, we do have the
container attribute, so, set it there.
This requires deriving a `Default` impl, which, because all fields are
`Option`,
is non-magic.
See my notes here:
https://gist.github.com/problame/eddbc225a5d12617e9f2c6413e0cf799
# Future Work
We should have separate types (& crates) for
- runtime types configuration (e.g. PageServerConf::tenant_config,
AttachedLocationConf)
- `config-v1` file pageserver local disk file format
- `mgmt API`
- `pageserver.toml`
Right now they all use the same, which is convenient but makes it hard
to reason about compatibility breakage.
# Refs
- corresponding docs.neon.build PR
https://github.com/neondatabase/docs/pull/470
## Problem
#11061 changed how artifacts for releases are built, by
reusing/retagging the artifacts from release PRs. This resulted in the
BUILD_TAG that's baked into the images to not be as expected.
Context: https://neondb.slack.com/archives/C08JBTT3R1Q/p1742333300129069
## Summary of changes
Set BUILD_TAG to the release tag of the upcoming release when running
inside release PRs.
In
-
https://github.com/neondatabase/neon/pull/10993#issuecomment-2690428336
I added infinite retries for buffered writer flush IOs, primarily to
gracefully handle ENOSPC but more generally so that the buffered writer
is not left in a state where reads from the surrounding InMemoryLayer
cause panics.
However, I didn't add cancellation sensitivity, which is concerning
because then there is no way to detach a timeline/tenant that is
encountering the write IO errors.
That’s a legitimate scenario in the case of some edge case bug.
See the #10993 description for details.
This PR
- first makes flush loop infallible, enabled by infinite retries
- then adds sensitivity to `Timeline::cancel` to the flush loop, thereby
making it fallible in one specific way again
- finally fixes the InMemoryLayer/EphemeralFile/BufferedWriter
amalgamate to remain read-available after flush loop is cancelled.
The support for read-availability after cancellation is necessary so
that reads from the InMemoryLayer that are already queued up behind the
RwLock that wraps the BufferedWriter won't panic because of the
`mutable=None` that we leave behind in case the flush loop gets
cancelled.
# Alternatives
One might think that we can only ship the change for read-availability
if flush encounters an error, without the infinite retrying and/or
cancellation sensitivity complexity.
The problem with that is that read-availability sounds good but is
really quite useless, because we cannot ingest new WAL without a
writable InMemoryLayer. Thus, very soon after we transition to read-only
mode, reads from compute are going to wait anyway, but on `wait_lsn`
instead of the RwLock, because ingest isn't progressing.
Thus, having the infinite flush retries still makes more sense because
they're just "slowness" to the user, whereas wait_lsn is hard errors.
## Problem
https://github.com/neondatabase/neon/pull/11210 migrated pushing images
to ghcr. Unfortunately, it was incomplete in using images from ghcr,
which resulted in a few places referencing the ghcr build-tools image,
while trying to use docker hub credentials.
## Summary of changes
Use build-tools image from ghcr consistently.
## Problem
The pipelines after release merges are slower than they need to be at
the moment. This is because some kinds of tests/checks run on all kinds
of pipelines, even though they only matter in some of those.
## Summary of changes
Run `check-codestyle-{rust,python,jsonnet}`, `build-and-test-locally`
and `trigger-e2e-tests` only on regular PRs, not release PR or pushes to
main or release branches.
Both crates seem well maintained. x509-cert is part of the high quality
RustCrypto project that we already make heavy use of, and I think it
makes sense to reduce the dependencies where possible.
## Problem
Docker Hub has new rate limits coming up, and to avoid problems coming
with those we're switching to GHCR.
## Summary of changes
- Push images to GHCR initially and distribute them from there
- Use images from GHCR in docker-compose
There is no functional change here. We move safekeeper related code from
`service.rs` to `service/safekeeper_service.rs`, so that safekeeper
related stuff is contained in a single file. This also helps with
preventing `service.rs` from growing even further.
Part of #9011.
## Problem
The test is flaky because of the same reasons as described in
https://github.com/neondatabase/neon/issues/11177.
The test has already suppressed these `WARN` and `ERROR` log messages,
but the regexp didn't match all possible errors.
## Summary of changes
- Change regexp to suppress all possible allowed error log messages.
PRs #10891 and #10902 have time-bounded the safekeeper heartbeating of
the storage controller. Those timeouts were not meant to be permanent,
but temporary until we figured out the reasons for the safekeeper
heartbeating causing problems.
Now they are better understood and resolved. A comment is
[here](https://github.com/neondatabase/cloud/issues/24396#issuecomment-2679342929),
but most importantly, we've had:
* #10954 to send heartbeats concurrently (before the issue was we sent
them sequentially, so the total time time was number of nodes times time
for timeout to be hit, now the total time is the maximum of all things
we are heartbeating)
* work to actually make heartbeats work and not error, i.e. JWT rollout
for storcon, not sending heartbeats to decomissioned safekeepers,
removal of decomissioned safekeepers from the databases
Part of https://github.com/neondatabase/cloud/issues/25473
## Problem
Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes.
In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED`
and save them on local disk.
But we do not save unlogged relations in LFC.
It may cause fetching incorrect value from LFC if relfilenode is reused.
## Summary of changes
Save modified pages in LFC on second stage of unlogged build (when
modified pages are walloged).
There is no need to save pages in LFC at first phase because the will be
in any case overwritten with assigned LSN at second phase.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Adds a basic test that makes the storcon issue explicit creation of a
timeline on safeekepers (main storcon PR in #11058). It was adapted from
`test_explicit_timeline_creation` from #11002.
Also, do a bunch of fixes needed to get the test work (the API
definitions weren't correct), and log more stuff when we can't create a
new timeline due to no safekeepers being active.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Work on https://github.com/neondatabase/cloud/issues/23721 and
https://github.com/neondatabase/cloud/issues/23714
Depends on https://github.com/neondatabase/neon/pull/11111
- Add `/configure_telemetry` API endpoint
- Support second rsyslog configuration for Postgres logs export
- Enable logs export when compute feature is enabled and configure
Postgres to send logs to syslog
I have used `/configure_telemetry` name because in the future I see it
also being used for configuring a `pg_tracing` extension to export
traces. Let me know if you'd rather have these APIs separate. In this
case we can rename it to `/configure_rsyslog`.
## Problem
https://github.com/neondatabase/neon/actions/runs/13894288475/job/38871819190
shows the "Add fast-fordward label to PR to trigger fast-forward merge"
job being skipped. This is due to not using the right variable for
checking which branch the merge queue is merging into.
## Summary of changes
Use the `branch` output of the `meta` task for checking the target
branch of a merge group.
## Problem
In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is
created on disk and then removed at the end.
It contradicts to the logic of DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL
mode.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In the previous PR #11045, one edge-case wasn't covered, when an ident
contains only one `$`, we were picking `$$` as a 'wrapper'. Yet, when
this `$` is at the beginning or at the end of the ident, then we end up
with `$$$` in a row which breaks the escaping.
## Summary of changes
Start from `x` tag instead of a blank string.
Slack:
https://neondb.slack.com/archives/C08HV951W2W/p1742076675079769?thread_ts=1742004205.461159&cid=C08HV951W2W
## Problem
Serial/numeric IDs lead to collisions, which is not critical but looks
awkward.
Previous discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1741891345869979
## Summary of changes
Suggest using the `YYYY-MM-DD` prefix, which i) has less chance of
collision; ii) provides out-of-the-box lexicographic sorting; iii) even
if it collides, it's not a big deal -- just two RFCs have been started
on the same day.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
... to better match the workload characteristics of real Neon customers
## Problem
We analyzed workloads of large Neon users and want to extend the oltp
workload to include characteristics seen in those workloads.
## Summary of changes
- for re-use branch delete inserted rows from last run
- adjust expected run-time (time-outs) in GitHub workflow
- add queries that exposes the prefetch getpages path
- add I/U/D transactions for another table (so far the workload was
insert/append-only)
- add an explicit vacuum analyze step and measure its time
- add reindex concurrently step and measure its time (and take care that
this step succeeds even if prior reindex runs have failed or were
canceled)
- create a second connection string for the pooled connection that
removes the `-pooler` suffix from the hostname because we want to run
long-running statements (database maintenance) and bypass the pooler
which doesn't support unlimited statement timeout
## Test run
https://github.com/neondatabase/neon/actions/runs/13851772887/job/38760172415
## Problem
There is a rare race between controller graceful deployment and shard
splitting where we may incorrectly both abort _and_ complete the split
(on different pods), and thereby leave no shards at all in the database.
Related: #11254
## Summary of changes
- In complete_shard_split, refuse to delete anything if child shards are
not found
## Problem
Python's `jsonnet` 0.20.0 doesn't support Python 3.13, so we have a
couple of tests xfailed because of that.
## Summary of changes
- Bump `jsonnet` to `0.21.0rc2` which supports Python 3.13
- Unxfail `test_sql_exporter_metrics_e2e` and
`test_sql_exporter_metrics_smoke` on Python 3.13
- add pgaudt_gc thread to compute_ctl
to cleanup old pgaudit logs if they exist.
pgaudit can rotate files, but it doesn't delete the old files
- Add AUDIT_LOG_DIR_SIZE metric to compute_ctl
to track the size of the audit log directory in bytes.
- Fix permissions for rsyslog state files directory
## Problem
We exposed the direction tag in #10925 but didn't actually include the
ingress tag in the export to allow for an adaption period.
## Summary of changes
We now export the ingress direction
# Refs
- fixes https://github.com/neondatabase/neon/issues/11228
# Problem High-Level
When storcon validates whether a `ScheduleOptimizationAction` should be
applied, it retrieves the `tenant_secondary_status` to determine whether
a secondary is ready for the optimization.
When collecting results, it associates secondary statuses with the wrong
optimization actions in the batch of optimizations that we're
validating.
The result is that we make the decision for shard/location X based on
the SecondaryStatus of a random secondary location Y in the current
batch of optimizations.
A possible symptom is an early cutover, as seen in this engineering
investigation here:
- https://github.com/neondatabase/cloud/issues/25734
# Problem Code-Level
This code here in `optimize_all_validate`
97e2e27f68/storage_controller/src/service.rs (L7012-L7029)
zips the `want_secondary_status` with the Vec returned from
`tenant_for_shards_api` .
However, the Vec returned from `want_secondary_status` is not ordered
(it uses FuturesUnordered internally).
# Solution
Sort the Vec in input order before returning it.
`optimize_all_validate` was the only caller affected by this problem
While at it, also future-proof similar-looking function
`tenant_for_shards`.
None of its callers care about the order, but this type of function
signature is easy to use incorrectly.
# Future Work
Avoid the additional iteration, map, and allocation.
Change API to leverage AsyncFn (async closure).
And/or invert `tenant_for_shards_api` into a Future ext trait / iterator
adaptor thing.
Adds API definitions for the safekeeper API endpoints `exclude_timeline`
and `term_bump`. Also does a bugfix to return the correct type from
`delete_timeline`.
Part of #8614
## Problem
This is already disabled in production, as it is replaced by L0 flush
delays. It will be removed in a later PR, once the config option is no
longer specified in production.
## Summary of changes
Disable `l0_flush_wait_upload` by default.
## Problem
`test_metadata_image_creation ` became flaky with #11212, since image
compaction may yield to L0 compaction.
## Summary of changes
Set `NoYield` when compacting in tenant tests.
## Problem
We noticed that error metrics didn't show for some services with light
load. This is not great and can cause problems for dashboards/alerts
## Summary of changes
Pre-initialise some metricvecs.
We want to switch away from and deprecate the `--compute-hook-url` param
for the storcon in favour of `--control-plane-url` because it allows us
to construct urls with `notify-safekeepers`.
This PR switches the pytests and neon_local from a
`control_plane_compute_hook_api` to a new param named
`control_plane_hooks_api` which is supposed to point to the parent of
the `notify-attach` URL.
We still support reading the old url from disk to not be too disruptive
with existing deployments, but we just ignore it.
Also add docs for the `notify-safekeepers` upcall API.
Follow-up of #11173
Part of https://github.com/neondatabase/neon/issues/11163
## Problem
`l0_flush_delay_threshold` has already been set to 30 in production for
a couple of weeks. Let's harmonize the default.
## Summary of changes
Update `DEFAULT_L0_FLUSH_DELAY_FACTOR` to 3 such that the default
`l0_flush_delay_threshold` is `3 * compaction_threshold`.
This differs from the production setting, which is hardcoded to 30 (with
`compaction_threshold` at 10), and is more appropriate for any tenants
that have custom `compaction_threshold` overrides.
## Problem
ef0d4a48a adjusted how we build container images and how we push them,
and the neon-test-extensions image was overlooked. Additionally, is was
also missed in 1f0dea9a1, which pushed our container images to GHCR.
## Summary of changes
Push neon-test-extensions to GHCR and also push release tags for it.
## Problem
DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16
## Summary of changes
Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
# Problem
We leave too few observability breadcrumbs in the case where wait_lsn is
exceptionally slow.
# Changes
- refactor: extract the monitoring logic out of `log_slow` into
`monitor_slow_future`
- add global + per-timeline counter for time spent waiting for wait_lsn
- It is updated while we're still waiting, similar to what we do for
page_service response flush.
- add per-timeline counterpair for started & finished wait_lsn count
- add slow-logging to leave breadcrumbs in logs, not just metrics
For the slow-logging, we need to consider not flooding the logs during a
broker or network outage/blip.
The solution is a "log-streak-level" concurrency limit per timeline.
At any given time, there is at most one slow wait_lsn that is logging
the "still running" and "completed" sequence of logs.
Other concurrent slow wait_lsn's don't log at all.
This leaves at least one breadcrumb in each timeline's logs if some
wait_lsn was exceptionally slow during a given period.
The full degree of slowness can then be determined by looking at the
per-timeline metric.
# Performance
Reran the `bench_log_slow` benchmark, no difference, so, existing call
sites are fine.
We do use a Semaphore, but only try_acquire it _after_ things have
already been determined to be slow. So, no baseline overhead
anticipated.
# Refs
-
https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222
Closes: https://github.com/neondatabase/cloud/issues/22998
If control-plane reports that TLS should be used, load the certificates
(and watch for updates), make sure postgres use them, and detects
updates.
Procedure:
1. Load certificates
2. Reconfigure postgres/pgbouncer
3. Loop on a timer until certificates have loaded
4. Go to 1
Notes:
1. We only run this procedure if requested on startup by control plane.
2. We needed to compile pgbouncer with openssl enabled
3. Postgres doesn't allow tls keys to be globally accessible - must be
read only to the postgres user. I couldn't convince the autoscaling team
to let me put this logic into the VM settings, so instead compute_ctl
will copy the keys to be read-only by postgres.
4. To mitigate a race condition, we also verify that the key matches the
cert.
## Problem
There was a panic on staging that compaction didn't find any keys. This
is possible if all layers selected for compaction does not contain any
keys within the current shard.
## Summary of changes
Make panic an error. In the future, we can try creating an empty image
layer so that GC can clean up those layers. Otherwise, for now, we can
only rely on shard ancestor compaction to remove these data.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`compaction_l0_first` has already been enabled in production for a
couple of weeks.
## Summary of changes
Enable `compaction_l0_first` by default.
Also set `CompactFlags::NoYield` in `timeline_checkpoint_handler`, to
ensure explicitly requested compaction runs to completion. This endpoint
is mainly used in tests, and caused some flakiness where tests expected
compaction to complete.
## Problem
#11061 introduced code fetching previous releases. #11151 introduced jq
error handling, which has also been applied in #11061, but parenthesis
have been missed.
## Summary of changes
Add parenthesis around error handling code.
## Problem
Autosplits always request `DEFAULT_STRIPE_SIZE` for splits. However,
splits do not allow changing the stripe size of already-sharded tenants,
and will error out if it differs.
In #11168, we are changing the stripe size, which could hit this when
attempting to autosplit already sharded tenants.
Touches #11168.
## Summary of changes
Pass `new_stripe_size: None` when autosplitting already sharded tenants.
Otherwise, pass `DEFAULT_STRIPE_SIZE` instead of the shard identity's
stripe size, since we want to use the current default rather than an
old, persisted default.
# Fix metric_unit length in test_compute_ctl_api.py
## Description
This PR changes the metric_unit from "microseconds" to "μs" in
test_compute_ctl_api.py to fix the issue where perf test results were
not being stored in the database due to the string exceeding the 10
character limit of the metric_unit column in the perf_test_results
table.
## Problem
As reported in Slack, the perf test results were not being uploaded to
the database because the "microseconds" string (12 characters) exceeds
the 10 character limit of the metric_unit column in the
perf_test_results table.
## Solution
Replace "microseconds" with "μs" in all metric_unit parameters in the
test_compute_ctl_api.py file.
## Testing
The changes have been committed and pushed. The PR is ready for review.
Link to Devin run:
https://app.devin.ai/sessions/e29edd672bd34114b059915820e8a853
Requested by: Peter Bendel
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
## Problem
A small adjustment in #11061 broke the lint-release-pr.sh script, and
the new version was neither tested nor linted. This has been done now,
the script is once again tested and passing `shellcheck`.
## Summary of changes
Add missing `el` of `elif` condition chain.
## Problem
#11061 changed release pr creation, and I missed that the workflow will
checkout a would-be-merge of the rc branch and the release branch
instead of the head ref, unless explicitly instructed otherwise.
## Summary of changes
Check out head ref for linting the release PRs.
## Problem
#11061 changed release pr creation, and I missed that creating PRs using
`gh` in non-interactive environments *requires* `--body` instead of
defaulting to an empty body.
## Summary of changes
Explicitly set an empty body when creating release PRs.
## Problem
#11061 changed release PR creation, and I missed that we need to
explicitly fetch the whole history so that the relevant git refs and
objects are available.
## Summary of changes
- Fetch all git refs including history by setting fetch-depth to 0
- Reference release branch as a remote branch, because we haven't
checked it out locally
## Problem
close https://github.com/neondatabase/neon/issues/10310
## Summary of changes
This patch adds a new behavior for the detach_ancestor API: detach with
multi-level ancestor and no reparenting. Though we can potentially
support multi-level + do reparenting / single-level + no-reparenting in
the future, as it's not required for the recovery/snapshot epic, I'd
prefer keeping things simple now that we only handle the old one and the
new one instead of supporting the full feature matrix.
I only added a test case of successful detaching instead of testing
failures. I'd like to make this into staging and add more tests in the
future.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
When we release our components, we perform builds in the release PR,
then test the components, then merge the PR, and then build everything
*again*, run tests *again*, and only then start deployments.
To speed things up, we want to perform builds and run tests in the PR,
and start deployments using the existing artifacts from the release PR.
To make that possible, we need to have both CI pipelines running on the
same commit hash, which requires fast forwarding release. That only
works, if we have a commit in the PR that has the current release branch
state as an ancestor.
## Summary of changes
- Changes to release PR creation:
- Remove templates and automatic bodies for release PRs. The previous
template wasn't used anymore, and the automatic body we created in the
pipeline didn't contain any useful content anymore after the changees
here.
- Make it possible to select the source branch. For releases that aren't
cut from `main`, like https://github.com/neondatabase/neon/pull/11051,
we need a way to trigger the new flow from a different branch.
- Determine `release-branch` automatically from the component name
instead of passing that as well.
- Changes to the merge queue job:
- Rename `get-changed-files` to `meta` in preparation of additional data
being fetched as part of that job
- Fail the merge queue if we're trying to merge into a branch other than
main - this is to prevent non-fast-forward merges.
- Label PRs to branches other than main as `fast-forward`, to trigger
the fast-forward job
- Add a fast-forward job that can be triggered with the `fast-forward`
label that performs a fast-forward merge. This only happens if the PR
has `mergeable_state == clean`, so CI having passed.
- Build and Test on releases now skips building images, skips testing
images and skips triggering e2e tests. We add new tags to the images
from the release PR to tag them as release images, and we push them to
the prod registries.
In our json encoding, we only need to know about array types.
Information about composites or enums are not actually used.
Enums are quite popular, needing to type query them when not needed can
add some latency cost for no gain.
## Problem
When a node becomes active, we query its locations and update the
observed state in-place.
This can race with the observed state updates done when processing
reconcile results.
## Summary of changes
The argument for this reconciliation step is that is reduces the need
for background reconciliations.
I don't think is actually true anymore. There's two cases.
1. Restart of node after drain. Usually the node does not go through the
offline state here, so observed locations
were not marked as none. In any case, there should be a handful of
shards max on the node since we've just drained it.
2. Node comes back online after failure or network partition. When the
node is marked offline, we reschedule everything away from it. When it
later becomes active, the previous observed location is extraneous and
requires a reconciliation anyway.
Closes https://github.com/neondatabase/neon/issues/11148
## Problem
When the storage controller and Pageserver loads tenants from persisted
storage, it uses `ShardIdentity::unsharded()` for unsharded tenants.
However, this replaces the persisted stripe size of unsharded tenants
with the default stripe size.
This doesn't really matter for practical purposes, since the stripe size
is meaningless for unsharded tenants anyway, but can cause consistency
check failures if the persisted stripe size differs from the default.
This was seen in #11168, where we change the default stripe size.
Touches #11168.
## Summary of changes
Carry over the persisted stripe size from `TenantShardPersistence` for
unsharded tenants, and from `LocationConf` on Pageservers.
Also add bounds checks for type casts when loading persisted shard
metadata.
## Problem
`info_span!` is only used in a `linux` branch, causing the unused lint
to fire on macOS.
## Summary of changes
Fully qualify the `info_span!` use.
## Problem
`humantime` is unmaintained, we want to migrate to `jiff`, see
https://github.com/neondatabase/neon/issues/11179.
`env_logger` in older versions depend on `humantime`, and newer versions
depend on `jiff`, so we need to update it.
## Summary of changes
Update `env_logger` to the most recent release, which does not depend on
`humantime` anymore.
## Problem
This command was used when onboarding tenants to the storage controller.
We no longer do that, so the command can go.
## Summary of changes
- Remove `storcon_cli tenant-warmup` command
## Problem
Test `test_timeline_archive` is flaky because it makes requests that are
intended to fail. It sometimes leads to warning in pageserver's logs.
More details are in the issue.
- Closes: https://github.com/neondatabase/neon/issues/11177
## Summary of changes
- Suppress such errors.
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.
To this end, there are two changes here:
1. Update the `tracing-utils` crate to allow for explicitly specifying
the export configuration. Pageserver configuration is loaded from a file
on start-up. This allows us to use the same flow for export configs
there.
2. Update the `utils::logging::init` common entry point to set up OTEL
tracing infrastructure if requested. Note that an entirely different
tracing subscriber is used. This is to avoid interference with the
existing tracing set-up. For now, no service uses this functionality.
PR to plug this into the pageserver is
[here](https://github.com/neondatabase/neon/pull/11140).
Related https://github.com/neondatabase/neon/issues/9873
## Problem
This field was retained for backward compat only in
https://github.com/neondatabase/neon/pull/10707.
Once https://github.com/neondatabase/cloud/pull/25233 is released,
nothing external will be reading this field.
Internally, this was a mandatory field so storage controller is still
trying to decode it, so we must do this removal in two steps: this PR
makes the field optional, and after one release we can fully remove it.
Related: https://github.com/neondatabase/cloud/issues/24250
## Summary of changes
- Rename field to `_unused`
- Remove field from swagger
- Make field optional
The upgrade to 8.0.0 caused severe performance regressions in the
start_postgres_ms metric, which measures the time it takes from execing
Postgres to the time Postgres marks itself as ready in the
postmaster.pid file. We use the notify crate to watch for changes in the
pgdata directory and the postmaster.pid file.
Signed-off-by: Tristan Partin <tristan@neon.tech>
# Problem
If the Pageserver ingest path
(InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable
buffer
of the BufferedWriter, the buffered writer is left in a state where
subsequent _reads_
from the InMemoryLayer it will cause a `must not use after we returned
an error` panic.
The reason is that
1. the flush background task bails on flush failure,
2. causing the `FlushHandle::flush` function to fail at channel.recv()
and
3. causing the `FlushHandle::flush` function to bail with the flush
error,
4. leaving its caller `BufferedWriter::flush` with
`BufferedWriter::mutable = None`,
5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent
reads can enter,
6. those reads find `mutable = None` and cause the panic.
# Context
It has always been the contract that writes against the BufferedWriter
API
must not be retried because the writer/stream-style/append-only
interface makes no atomicity guarantees ("On error, did nothing or a
piece of the buffer get appended?").
The idea was that the error would bubble up to upper layers that can
throw away the buffered writer and create a new one. (See our [internal
error handling policy document on how to handle e.g.
`ENOSPC`](c870a50bc0/src/storage/handling_io_and_logical_errors.md (L36-L43))).
That _might_ be true for delta/image layer writers, I haven't checked.
But it's certainly not true for the ingest path: there are no provisions
to throw away an InMemoryLayer that encountered a write error an
reingest the WAL already written to it.
Adding such higher-level retries would involve either resetting
last_record_lsn to a lower value and restarting walreceiver. The code
isn't flexible enough to do that, and such complexity likely isn't worth
it given that write errors are rare.
# Solution
The solution in this PR is to retry _any_ failing write operation
_indefinitely_ inside the buffered writer flush task, except of course
those that are fatal as per `maybe_fatal_err`.
Retrying indefinitely ensures that `BufferedWriter::mutable` is never
left `None` in the case of IO errors, thereby solving the problem
described above.
It's a clear improvement over the status quo.
However, while we're retrying, we build up backpressure because the
`flush` is only double-buffered, not infinitely buffered.
Backpressure here is generally good to avoid resource exhaustion, **but
blocks reads** and hence stalls GetPage requests because InMemoryLayer
reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.
## Caveats
Note that there are some remaining conditions in the flush background
task where it can bail with an error. I have annotated one of them with
a TODO comment.
Hence the `FlushHandle::flush` is still fallible and hence the overall
scenario of leaving `mutable = None` on the bail path is still possible.
We can clean that up in a later commit.
Note also that retrying indefinitely is great for temporary errors like
ENOSPC but likely undesirable in case the `std::io::Error` we get is
really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant
directory got deleted and VirtualFile's reopen fails with ENOENT.
Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered
writer background task was to recv from flush op channel until the
sending side (FlushHandle) is explicitly shut down or dropped.
Failing to handle cancellation carries the operational risk that even if
a single timeline gets stuck because of a logic bug such as the one laid
out above, we must still restart the whole pageserver process.
# Alternatives Considered
As pointed out in the `Context` section, throwing away a InMemoryLayer
that encountered an error and reingesting the WAL is a lot of complexity
that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.
A more general and simpler solution for ENOSPC is to `abort()` the
process and run eviction on startup before bringing up the rest of
pageserver.
I argued for it in the past, the pro arguments are still valid and
complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed
and all tenants have secondaries.
For example, if pageserver `abort()`s on ENOSPC and then simply don't
respond to storcon heartbeats while we're running eviction on startup,
storcon will fail tenants over to the secondary anyway, giving us all
the time we need to clean up.
The downside is that if there's a systemic space management bug, above
proposal will just propagate the problem to other nodes. But I imagine
that because of the delays involved with filling up disks, the system
might reach a half-stable state, providing operators more time to react.
# Demo
Intermediary commit `a03f335121480afc0171b0f34606bdf929e962c5` is demoed
in this (internal) screen recording:
https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link
# Perf Testing
Ran `bench_ingest` on tmpfs, no measurable difference.
Spans are uniquely owned by the flush task, and the span stack isn't too
deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not _that_
high frequency event anyways.
# Refs
- fixes https://github.com/neondatabase/neon/issues/10856
## Problem
This should also resolve the test flakiness of `test_gc_feedback`.
close https://github.com/neondatabase/neon/issues/11144
## Summary of changes
If `NoYield` is set, do not yield in gc-compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`humantime` is not maintained and `cargo deny check` fails
- Will be addressed in https://github.com/neondatabase/neon/issues/11179
## Summary of changes
Ignore RUSTSEC-2025-0014 advisory for now
## Problem
Investigate https://github.com/neondatabase/neon/issues/11159
## Summary of changes
This doesn't fix the issue, but at least we can narrow down the cause
next time it happens by logging ancestor referenced layer cnt even if
it's 0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Do some refactoring before making walproposer generations aware.
- Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED
- Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm,
donorEpoch to donorLastLogTerm
- Instead of counting n_votes, n_connected, introduce explicit
WalProposerState (collecting terms / voting / elected). Refactor out
TermsCollected and VotesCollected; they will determine state transition
differently depending whether generations are enabled or not.
There is no new logic in this PR and thus no new tests.
## Problem
In #11122, we want to split shards once the logical size of the largest
timeline exceeds a split threshold. However, `get_top_tenants` currently
only returns `max_logical_size`, which tracks the max _total_ logical
size of a timeline across all shards.
This is problematic, because the storage controller needs to fetch a
list of N tenants that are eligible for splits, but the API doesn't
currently have a way to express this. For example, with a split
threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible
to split if it has 1 or 2 shards, but not if it already has 4 shards. We
need to express this in per-shard terms, otherwise the `get_top_tenants`
endpoint may end up only returning tenants that can't be split, blocking
splits entirely.
Touches https://github.com/neondatabase/neon/pull/11122.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Add `TenantShardItem::max_logical_size_per_shard` containing
`max_logical_size / shard_count`, and
`TenantSorting::MaxLogicalSizePerShard` to order and filter by it.
Fixes https://github.com/neondatabase/serverless/issues/144
When tables have enums, we need to perform type queries for that data.
We cache these query statements for performance reasons. In Neon RLS, we
run "discard all" for security reasons, which discards all the
statements. When we need to type check again, the statements are no
longer valid.
This fixes it to discard the statements as well.
I've also added some new logs and error types to monitor this. Currently
we don't see the prepared statement errors in our logs.
# Refs
- fixes https://github.com/neondatabase/neon/issues/6107
# Problem
`VirtualFile` currently parses the path it is opened with to identify
the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE`
metric.
Further, for each read or write call to VirtualFile, it uses
`with_label_values` to retrieve the correct metrics object, which under
the hood is a global hashmap guarded by a parking_lot mutex.
We perform tens of thousands of reads and writes per second on every
pageserver instance; thus, doing the mutex lock + hashmap lookup is
wasteful.
# Changes
Apply the technique we use for all other timeline-scoped metrics to
avoid the repeat `with_label_values`: add it to `TimelineMetrics`.
Wrap `TimelineMetrics` into an `Arc`.
Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use
`Timeline::metrics::storage_io_size`.
To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics
between different connection handlers for the same timeline, we wrap it
into another Arc.
To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>>
inside the per-connection timeline cache.
Preliminary refactorings to enable this change:
- https://github.com/neondatabase/neon/pull/11001
- https://github.com/neondatabase/neon/pull/11030
# Performance
I ran the benchmarks in
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
on an `i3en.3xlarge` because that's what we currently run them on.
None of the benchmarks shows a meaningful difference in latency or
throughput or CPU utilization.
I would have expected some improvement in the
many-tenants-one-client-each workload because they all hit that hashmap
constantly, and clone the same `UintCounter` / `Arc` inside of it.
But apparently the overhead is miniscule compared to the remaining work
we do per getpage.
Yet, since the changes are already made, the added complexity is
manageable, and the perf overhead of `with_label_values` demonstrable in
micro-benchmarks, let's have this change anyway.
Also, propagating TimelineMetrics through RequestContext might come in
handy down the line.
The micro-benchmark that demonstrates perf impact of
`with_label_values`, along with other pitfalls and mitigation techniques
around the `metrics`/`prometheus` crate:
- https://github.com/neondatabase/neon/pull/11019
# Alternative Designs
An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside
`RequestContext`.
The problem is that this risks reference cycles if the RequestContext
gets stored in an object that is owned directly or indirectly by
`Timeline`.
Ideally, we wouldn't be using this mess of Arc's at all and propagate
Rust references instead.
But tokio requires tasks to be `'static`, and so, we wouldn't be able to
propagate references across task boundaries, which is incompatible with
any sort of fan-out code we already have (e.g. concurrent IO) or future
code (parallel compaction).
So, opt for Arc for now.
We use the `metrics` / `prometheus` crate in the Pageserver code base.
This PR demonstrates
- typical performance pitfalls with that crate
- our current set of techniques to avoid most of these pitfalls.
refs
- https://github.com/neondatabase/neon/issues/10948
- https://github.com/neondatabase/neon/pull/7202
- I applied the `label_values__cache_label_values_lookup` technique
there.
- It didn't yield measurable results in high-level benchmarks though.
This PR extends the storcon with basic safekeeper management of
timelines, mainly timeline creation and deletion. We want to make the
storcon manage safekeepers in the future. Timeline creation is
controlled by the `--timelines-onto-safekeepers` flag.
1. it adds the `timelines` and `safekeeper_timeline_pending_ops` tables
to the storcon db
2. extend code for the timeline creation and deletion
4. it adds per-safekeeper reconciler tasks
TODO:
* maybe not immediately schedule reconciliations for deletions but have
a prior manual step
* tenant deletions
* add exclude API definitions (probably separate PR)
* how to choose safekeeper to do exclude on vs deletion? this can be a
bit hairy because the safekeeper might go offline in the meantime.
* error/failure case handling
* tests (cc test_explicit_timeline_creation from #11002)
* single safekeeper mode: we often only have one SK (in tests for
example)
* `notify-safekeepers` hook:
https://github.com/neondatabase/neon/issues/11163
TODOs implemented:
* cancellations of enqueued reconciliations on a per-timeline basis,
helpful if there is an ongoing deletion
* implement pending ops overwrite behavior
* load pending operations from db
RFC section for important reading:
[link](https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#storage_controller-implementation)
Implements the bulk of #9011
Successor of #10440.
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
## Problem
Async prefetch in LFC PR cause incorrect calculation of LFC
`used_pages`when page is overwritten
## Summary of changes
Decrement `used_pages` is page is overwritten.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
This allows for improved decoding of otherwise broken WAL.
## Problem
Currently, if (or when) a WAL record has a wrong prevptr, that breaks
decoding. With this, we don't have to break on that if we decide it's OK
to proceed after that.
## Summary of changes
Use a Neon GUC to allow the system to enable the NEON-specific
skip_lsn_checks option in XLogReader.
## Problem
Partial reads are still problematic. They are stored in the buffer of
the wal decoder and result in gaps being reported too eagerly on the
pageserver side.
## Summary of changes
Previously, we always used the start LSN of the chunk of WAL that was
just read. This patch switches to using the end LSN of the last record
that was decoded in the previous iteration.
## Problem
Pinging groups on slack didn't work, because I didn't use the correct
syntax.
## Summary of changes
Use the correct syntax for pinging groups.
* pprof can also use `prost` as a backend, switch to it as `protobuf`
has no update available but a security issue.
* `paste` is a build time dependency, so add the unmaintained warning as
an exception.
## Problem
`home_shard_count` is not updated on the preferred AZ change.
Closes: https://github.com/neondatabase/neon/issues/10493
## Summary of changes
- Update scheduler stats (node ref counts) on preferred AZ change.
## Problem
See https://neondb.slack.com/archives/C08DE6Q9C3B
Sometimes compute is not able to receive responses from PS for a long
time (minutes).
I do not think that the problem is at compute side, but in order to
exclude this possibility I wan to see more information about connection
state at compute side, particularly amount of data cached in connection
buffer.
## Summary of changes
Right now we are dumping state of socket buffer.
This PR adds printing state of connection buffer.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Previously, remote extensions were not fetched unless they were used in
some other manner. For instance, loading a BM25 index in pg_search
fetches the pg_search extension. However, if on a fresh compute with
pg_search 0.15.5 installed, the user ran `ALTER EXTENSION pg_search
UPDATE TO '0.15.6'` without first using the pg_search extension, we
would not fetch the extension and fail to find an update path.
Signed-off-by: Tristan Partin <tristan@neon.tech>
We were previously only revoking privileges granted by neon_superuser.
However, we need to do it for all grantors.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
As part of the disaster recovery tool. Partly for
https://github.com/neondatabase/neon/issues/9114.
## Summary of changes
* Add a new pageserver API to force patch the fields in index_part and
modify the timeline internal structures.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Storage controller uses http for requests to safekeeper management API.
Closes: https://github.com/neondatabase/cloud/issues/24835
## Summary of changes
- Add `use_https_safekeeper_api` option to storcon to use https api
- Use https for requests to safekeeper management API if this option is
enabled
- Add `ssl_ca_file` option to storcon for ability to specify custom root
CA certificate
## Problem
The current migration API does a live migration, but if the destination
doesn't already have a secondary, that live migration is unlikely to be
able to warm up a tenant properly within its timeout (full warmup of a
big tenant can take tens of minutes).
Background optimisation code knows how to do this gracefully by creating
a secondary first, but we don't currently give a human a way to trigger
that.
Closes: https://github.com/neondatabase/neon/issues/10540
## Summary of changes
- Add `prefererred_node` parameter to TenantShard, which is respected by
optimize_attachment
- Modify migration API to have optional prewarm=true mode, in which we
set preferred_node and call optimize_attachment, rather than directly
modifying intentstate
- Require override_scheduler=true flag if migrating somewhere that is a
less-than-optimal scheduling location (e.g. wrong AZ)
- Add `origin_node_id` to migration API so that callers can ensure
they're moving from where they think they're moving from
- Add tests for the above
The storcon_cli wrapper for this has a 'watch' mode that waits for
eventual cutover. This doesn't show the warmth of the secondary evolve
because we don't currently have an API for that in the controller, as
the passthrough API only targets attached locations, not secondaries. It
would be straightforward to add later as a dedicated endpoint for
getting secondary status, then extend the storcon_cli to consume that
and print a nice progress indicator.
## Problem
Shard zero needs to track the start LSN of the latest record
in adition to the LSN of the next record to ingest. The former
is included in basebackup persisted by the compute in WAL.
Previously, empty records were skipped for all shards. This caused
the prev LSN tracking on the PS to fall behind and led to logical
replication
issues.
## Summary of changes
Shard zero now receives emtpy interpreted records for LSN tracking
purposes.
A test is included too.
## Problem
Allow for using _meta.yml with workflow_dispatch event.
## Summary of changes
Handle this event in the run-kind step; fix and update the description
of the run-kind output.
## Problem
This patch makes some initial tweaks as preparation for
https://github.com/neondatabase/cloud/issues/22532, where we will be
introducing additional autosplit logic.
The plan is outlined in
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907.
## Summary of changes
Minor code cleanups and behavioral changes:
* Decide that we'll split based on `max_logical_size` (later possibly
`total_logical_size`).
* Fix a bug that would split the smallest candidate rather than the
largest.
* Pick the largest candidate by `max_logical_size` rather than
`resident_size`, for consistency (debatable).
* Split out `get_top_tenant_shards()` to fetch split candidates.
* Fetch candidates concurrently from all nodes.
* Make `TenantShard.get_scheduling_policy()` return a copy instead of a
reference.
We have introduced the ability to specify safekeeper JWTs for the
storage controller. It now does heartbeats. We now want to also require
the presence of those JWTs.
Let's merge this PR shortly after the release cutoff.
Part of / follow-up of
https://github.com/neondatabase/cloud/issues/24727
## Problem
We just had a regression reported at
https://neondb.slack.com/archives/C08EXUJF554/p1741102467515599, which
clearly came with one of the releases. It's not a huge problem yet, but
it's annoying that we cannot quickly attribute it to a specific commit.
## Summary of changes
Add a very simple `compute_ctl` HTTP API benchmark that does 10k
requests to `/status` and `metrics.json` and reports p50 and p99.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?
<!-- List everything that should be done**before** release, any issues / setting changes / etc -->
### Checklist after release
- [ ] Make sure instructions from PRs included in this release and labeled `manual_release_instructions` are executed (either by you or by people who wrote them).
- [ ] Based on the merged commits write release notes and open a PR into `website` repo ([example](https://github.com/neondatabase/website/pull/219/files))
description:"Tag for the release if this is an RC PR run"
value:${{ jobs.tags.outputs.release-tag }}
previous-storage-release:
description:"Tag of the last storage release"
value:${{ jobs.tags.outputs.storage }}
@@ -19,27 +25,41 @@ on:
description:"Tag of the last compute release"
value:${{ jobs.tags.outputs.compute }}
run-kind:
description:"The kind of run we're currently in. Will be one of `pr`, `push-main`, `storage-rc`, `storage-release`, `proxy-rc`, `proxy-release`, `compute-rc`, `compute-release` or `merge_queue`"
description:"The kind of run we're currently in. Will be one of `push-main`, `storage-release`, `compute-release`, `proxy-release`, `storage-rc-pr`, `compute-rc-pr`, `proxy-rc-pr`, `pr`, or `workflow-dispatch`"
value:${{ jobs.tags.outputs.run-kind }}
release-pr-run-id:
description:"Only available if `run-kind in [storage-release, proxy-release, compute-release]`. Contains the run ID of the `Build and Test` workflow, assuming one with the current commit can be found."
value:${{ jobs.tags.outputs.release-pr-run-id }}
sha:
description:"github.event.pull_request.head.sha on release PRs, github.sha otherwise"
RELEASE_PR_RUN_ID=$(gh api "/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=$CURRENT_SHA" | jq '[.workflow_runs[] | select(.name == "Build and Test") | select(.head_branch | test("^rc/release.*$"; "s"))] | first | .id // ("Failed to find Build and Test run from RC PR!" | halt_error(1))')
echo "release-pr-run-id=$RELEASE_PR_RUN_ID" | tee -a $GITHUB_OUTPUT
--body "Not trying to forward pull-request, because \`mergeable_state\` is \`${{ github.event.pull_request.mergeable_state }}\`, not \`clean\` or \`unstable\`."
@@ -173,7 +173,7 @@ RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v$
&& rm -rf protoc.zip protoc
# s5cmd
ENVS5CMD_VERSION=2.2.2
ENVS5CMD_VERSION=2.3.0
RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/s5cmd_${S5CMD_VERSION}_Linux-$(uname -m | sed 's/x86_64/64bit/g'| sed 's/aarch64/arm64/g').tar.gz"| tar zxvf - s5cmd \
&& chmod +x s5cmd \
&& mv s5cmd /usr/local/bin/s5cmd
@@ -206,7 +206,7 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
# Program name comes from postgres' syslog_facility configuration: https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-SYSLOG-IDENT
# Default value is 'postgres'.
if $programname == 'postgres' then {{
# Forward Postgres logs to telemetry otel collector
info!("change anon extension functions owner to db owner");
db_client.simple_query(&query).await?;
// affects views as well
letquery=format!("GRANT ALL ON ALL TABLES IN SCHEMA anon TO {}",db_owner);
info!("granting anon extension permissions with query: {}",query);
db_client.simple_query(&query).await?;
letquery=format!("GRANT ALL ON ALL SEQUENCES IN SCHEMA anon TO {}",db_owner);
info!("granting anon extension permissions with query: {}",query);
db_client.simple_query(&query).await?;
}
}
Ok(())
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.