In order for the test to work when sanitizers are enabled, we would need
to compile the dummy Postgres extension with the same sanitizer flags
that we compile Postgres and the neon extension with. Doing this work
would be a little more than trivial, so skipping is the best option, at
least for now.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We didn't consider tombstones in replorigin read path in the past. This
was fine because tombstones are stored as LSN::Invalid before we
universally define what the tombstone is for sparse keyspaces.
Now we remove non-inherited keys during detach ancestor and write the
universal tombstone "empty image". So we need to consider it across all
the read paths.
related: https://github.com/neondatabase/neon/pull/11299
## Summary of changes
Empty value gets ignored for replorigin scans.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We had retained the ability to run in a generation-less mode to support
test_generations_upgrade, which was replaced with a cleaner backward
compat test in https://github.com/neondatabase/neon/pull/10701
## Summary of changes
- Remove all the special cases for "if no generation" or "if no control
plane api"
- Make control_plane_api config mandatory
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Postgres has a nice self-documenting macro called pg_unreachable() when
you want to assert that a location in code won't be hit.
Warning in question:
```
/home/tristan957/Projects/work/neon//pgxn/neon/libpagestore.c: In function ‘pageserver_connect’:
/home/tristan957/Projects/work/neon//pgxn/neon/libpagestore.c:739:1: warning: control reaches end of non-void function [-Wreturn-type]
739 | }
| ^
```
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
In princple, pageservers with different postgres binaries might generate
different initdbs, resulting in inconsistency between shards. To avoid
that, we should have shard 0 generate the initdb and other shards re-use
it.
Fixes: https://github.com/neondatabase/neon/issues/11340
## Summary of changes
- For shards with index greater than zero, set
`existing_initdb_timeline_id` in timeline creation to consume the
existing initdb rather than creating a new one
## Problem
Our different repositories had both had code to achieve very similar
results in terms of release PR creation, but they were structured
differently and had different extensions. This was likely to cause
maintainability problems in the long run.
## Summary of changes
Switch to a python cli based composite action for creating the release
PRs that will also be introduced in our other repos later.
## To Do
- [ ] Adjust our docs to reflect the changes from this.
# Remove SAFEKEEPER_AUTH_TOKEN env var parsing from safekeeper
This PR is a follow-up to #11443 that removes the parsing of the
`SAFEKEEPER_AUTH_TOKEN` environment variable from the safekeeper
codebase while keeping the `auth_token_path` CLI flag functionality.
## Changes:
- Removed code that checks for the `SAFEKEEPER_AUTH_TOKEN` environment
variable
- Updated comments to reflect that only the `auth_token_path` CLI flag
is now used
As mentioned in PR #11443, the environment variable approach was planned
to be deprecated and removed in favor of the file-based approach, which
is more secure since environment variables can be quite public in both
procfs and unit files.
Link to Devin run:
https://app.devin.ai/sessions/d6f56cf1b4164ea9880a9a06358a58ac
Requested by: arpad@neon.tech
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: arpad@neon.tech <arpad@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
The `pageserver_smgr_query_seconds` buckets are too coarse, using powers
of 10: 1 µs, 10 µs, 100 µs, 1 ms, 10 ms, 100 ms, 1 s, 10 s, 100 s. This
is one of our most crucial latency metrics, and needs better resolution.
Touches #11594.
## Summary of changes
This patch uses buckets with better resolution around 1 ms (the typical
latency):
* 0.6 ms
* 1 ms
* 3 ms
* 6 ms
* 10 ms
* 30 ms
* 100 ms
* 1 s
* 3 s
These will be the same as the compute's `compute_getpage_wait_seconds`,
to make them comparable across the compute and Pageserver:
https://github.com/neondatabase/flux-fleet/pull/579. We sacrifice
buckets above 3 s, since these can already be considered "too slow".
This does not change the previously used `CRITICAL_OP_BUCKETS`, which is
also used for other operations on different timescales (e.g. LSN waits).
We should consider replacing this with more appropriate buckets for
specific operations, since it covers a large span with low resolution.
## Problem
pg-sni-router isn't aware of compute TLS
## Summary of changes
If connections come in on port 4433, we require TLS to compute from
pg-sni-router
## Problem
- if-conditions for the `check-macos-build` workflow don't trigger it on
PRs with relevant changes (in Rust code or Postgres submodules).
- Jobs in the workflow depend on the presence of a cache, which is not
guaranteed.
## Summary of changes
- Fix if-conditions
- Use artifacts on top of cache whenever the workflow depends on it —
the cache might not be available
## Problem
We currently don't run end-to-end tests for PostgreSQL extensions on our
cloud infrastructure, which means we might miss problems that only occur
in a real cloud environment.
## Summary of changes
- Added a workflow to run extension tests against a cloud staging
instance
- Set up proper project configuration for extension testing
- Implemented test execution with appropriate environment settings
- Added error handling and reporting for test failures
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
## Problem
Provide an easy way to run particular test(s) N times on CI.
## Summary of changes
* Allow for passing the test selection and the number of test runs to
the existing "Build and Test Locally" workflow
* Allow for running multiple selected tests by the "Pytest regression
tests" step
* Introduce a new workflow to run specified test(s) several times
* Store results in a separate database to distinguish between testing
tests for stability and usual testing
## Problem
Proposed minor changes to the `consumption_metrics` document.
## Summary of changes
- Fixed minor typos in the document.
- Minor formatting in the description of metrics `timeline_logical_size`
and `synthetic_storage_size`. Makes this consistent as with description
of other metrics in the document.
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Mikhail Kot <mikhail@neon.tech>
### Summary
I'm fixing one or more of the following CI/CD misconfigurations to
improve security. Please feel free to leave a comment if you think the
current permissions for the GITHUB_TOKEN should not be restricted so I
can take a note of it as accepted behaviour.
- Restrict permissions for GITHUB_TOKEN
- Add step-security/harden-runner
- Pin Actions to a full length commit SHA
### Security Fixes
will fix https://github.com/neondatabase/cloud/issues/26141
## Problem
Broker supports only HTTP, no HTTPS
- Closes: https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Add `listen_https_addr`, `ssl_key_file`, `ssl_cert_file`,
`ssl_cert_reload_period` arguments to storage broker
- Make `listen_addr` argument optional
- Listen https in storage broker
- Support https for storage broker request in neon_local
- Add `use_https_storage_broker_api` option to NeonEnvBuilder
## Problem
Shard splits break timeline imports.
## Summary of Changes
Ensure mutual exclusion for imports and shard splits.
On the shard split code path:
1. Right before shard splitting, check the database to ensure that
no-import is on-going for the tenant. Exclusion is guaranteed because
this validation is done while holding the exclusive tenant lock.
Timeline creation (and import creation implicitly) requires a shared
tenant lock.
2. When selecting a shard to split, use the in-mem state to exclude
shards with an on-going import. This is opportunistic since an import
might start after the check, but allows shard splits to make progres
instead of continously retrying to split the same shard.
On the timeline creation code path:
1. Check the in-memory splitting flag on all shards of the tenant. If
any of them are splitting, error out asking the client to retry. On the
happy path this is not required, due to the tenant lock set-up described
above, but it covers the case where we restart with a pending
shard-split.
Closes https://github.com/neondatabase/neon/issues/11567
## Problem
- using Hetzner buckets for cache requires secrets, we either need
`secrets: inherit` to make it works
- we don't have self-hosted MacOs runners, so actually GH native cache
is more optimal solution there
## Summary of changes
- switch to GH native cache for macos builds
## Problem
The docker compose test script (`docker_compose_test.sh`) had
inconsistent codestyle, mixing legacy syntax with modern approaches and
not following best practices at all. This inconsistency could lead to
potential issues with variable expansion, path handling, and
maintainability.
## Summary of changes
This PR modernizes the test script with several codestyle improvements:
* Variable scoping and exports:
* Added proper export declarations for environment variables
* Added explicit COMPOSE_PROFILES export to avoid repetitive flags
* Modern Bash syntax:
* Replaced [ ] with [[ ]] for safer conditional testing
* Used arithmetic operations (( cnt += 3 )) instead of expr
* Added proper variable expansion with braces ${variable}
* Added proper quoting around variables and paths with "${variable}"
* Docker Compose commands:
* Replaced hardcoded container names with service names
* Used docker compose exec instead of docker exec $CONTAINER_NAME
* Removed repetitive flags by using environment variables
* Shell script best practices:
* Added function keyword before function definition
* Used safer path handling with "$(dirname "${0}")"
These changes make the script more maintainable, less error-prone, and
more consistent with modern shell scripting standards.
## Problem
It seems are production-ready cert-manager setup now includes a full
certificate chain. This was not accounted for and the decoder would
error.
## Summary of changes
Change the way we decode certificates to support cert-chains, ignoring
all but the first cert.
This also changes a log line to not use multi-line errors.
~~I have tested this code manually against real certificates/keys, I
didn't want to embed those in a test just yet, not until the cert
expires in 24 hours.~~
## Problem
close https://github.com/neondatabase/neon/issues/11694
We had the delta layer iterator and image layer iterator set to buffer
at most 8MB data. Note that 8MB is the compressed size, so it is
possible for those iterators contain more than 8MB data in memory.
For the recent OOM case, gc-compaction was running over 556 layers,
which means that we will have 556 active iterators. So in theory, it
could take up to 556*8=4448MB memory when the compaction is going on. If
images get compressed and the compression ratio is high (for that
tenant, we see 3x compression ratio across image layers), then that's
13344MB memory.
Also we have layer rewrites, which explains the memory taken by
gc-compaction itself (versus the iterators). We rewrite 424 out of 556
layers, and each of such rewrites need a pair of delta layer writer. So
we are buffering a lot of deltas in the memory.
The flamegraph shows that gc-compaction itself takes 6GB memory, delta
iterator 7GB, and image iterator 2GB, which can be explained by the
above theory.
## Summary of changes
- Reduce the buffer sizes.
- Estimate memory consumption and if it is too high.
- Also give up if the number of layers-to-rewrite is too high.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The current test was just SQL files only, but we also want to test a
remote extension which includes a loadable library. With both extensions
we should cover a larger portion of compute_ctl's remote extension code
paths.
Fixes: https://github.com/neondatabase/neon/issues/11146
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
In https://github.com/neondatabase/neon/pull/11345 coordination of
imports moved to the storage controller.
It involves notifying cplane when the import has been completed by
calling an idempotent endpoint.
If the storage controller shuts down in the middle of finalizing an
import, it would never be retried.
## Summary of changes
Reconcile imports at start-up by fetching the complete imports from the
database and spawning a background
task which notifies cplane.
Closes: https://github.com/neondatabase/neon/issues/11570
## Problem
We saw the following scenario in staging:
1. Pod A starts up. Becomes leader and steps down the previous pod
cleanly.
2. Pod B starts up (deployment).
3. Step down request from pod B to pod A times out. Pod A did not manage
to stop its reconciliations within 10 seconds and exited with return
code 1
([code](7ba8519b43/storage_controller/src/service.rs (L8686-L8702))).
4. Pod B marks itself as the leader and finishes start-up
5. k8s restarts pod A
6. k8s marks pod B as ready
7. pod A sends step down request to pod A - this succeeds => pod A is
now the leader
8. k8s kills pod A because it thinks pod B is healthy and pod A is part
of the old replica set
We end up in a situation where the only pod we have (B) is stepped down
and attempts to forward requests to a leader that doesn't exist. k8s
can't detect that pod B is in a bad state since the /status endpoint
simply returns 200 hundred if the pod is running.
## Summary of changes
This PR includes a number of robustness improvements to the leadership
protocol:
* use a single step down task per controller
* add a new endpoint to be used as k8s liveness probe and check
leadership status there
* handle restarts explicitly (i.e. don't step yourself down)
* increase the step down retry count
* don't kill the process on long step down since k8s will just restart
it
# Problem
The Pageserver read path exclusively uses direct IO if
`virtual_file_io_mode=direct`.
The write path is half-finished. Here is what the various writing
components use:
|what|buffering|flags on <br/>`v_f_io_mode`<br/>=`buffered`|flags on
<br/>`virtual_file_io_mode`<br/>=`direct`|
|-|-|-|-|
|`DeltaLayerWriter`| BlobWriter<BUFFERED=true> | () | () |
|`ImageLayerWriter`| BlobWriter<BUFFERED=false> | () | () |
|`download_layer_file`|BufferedWriter|()|()|
|`InMemoryLayer`|BufferedWriter|()|O_DIRECT|
The vehicle towards direct IO support is `BufferedWriter` which
- largely takes care of O_DIRECT alignment & size-multiple requirements
- double-buffering to mask latency
`DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which
has neither of these.
# Changes
## High-Level
At a high-level this PR makes the following primary changes:
- switch the two layer writer types to use `BufferedWriter` & make
sensitive to `virtual_file_io_mode` (via open_with_options_**v2**)
- make `download_layer_file` sensitive to `virtual_file_io_mode` (also
via open_with_options_**v2**)
- add `virtual_file_io_mode=direct-rw` as a feature gate
- we're hackish-ly piggybacking on OpenOptions's ask for write access
here
- this means with just `=direct` InMemoryLayer reads and writes no
longer uses O_DIRECT
- this is transitory and we'll remove the `direct-rw` variant once the
rollout is complete
(The `_v2` APIs for opening / creating VirtualFile are those that are
sensitive to `virtual_file_io_mode`)
The result is:
|what|uses <br/>`BufferedWriter`|flags on
<br/>`v_f_io_mode`<br/>=`buffered`|flags on
<br/>`v_f_io_mode`<br/>=`direct`|flags on
<br/>`v_f_io_mode`<br/>=`direct-rw`|
|-|-|-|-|-|
|`DeltaLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT |
|`ImageLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT |
|`download_layer_file`|BufferedWriter|()|()|O_DIRECT|
|`InMemoryLayer`|BufferedWriter|()|~~O_DIRECT~~()|O_DIRECT|
## Code-Level
The main change is:
- Switch `blob_io::BlobWriter` away from its own buffering method to use
`BufferedWriter`.
Additional prep for upholding `O_DIRECT` requirements:
- Layer writer `finish()` methods switched to use IoBufferMut for
guaranteed buffer address alignment. The size of the buffers is PAGE_SZ
and thereby implicitly assumed to fulfill O_DIRECT requirements.
For the hacky feature-gating via `=direct-rw`:
- Track `OpenOptions::write(true|false)` in a field; bunch of mechanical
churn.
- Consolidate the APIs in which we "open" or "create" VirtualFile for
better overview over which parts of the code use the `_v2` APIs.
Necessary refactorings & infra work:
- Add doc comments explaining how BufferedWriter ensures that writes are
compliant with O_DIRECT alignment & size constraints. This isn't new,
but should be spelled out.
- Add the concept of shutdown modes to `BufferedWriter::shutdown` to
make writer shutdown adhere to these constraints.
- The `PadThenTruncate` mode might not be necessary in practice because
I believe all layer files ever written are sized in multiples `PAGE_SZ`
and since `PAGE_SZ` is larger than the current alignment requirements
(512/4k depending on platform), it won't be necesary to pad.
- Some test (I believe `round_trip_test_compressed`?) required it though
- [ ] TODO: decide if we want to accept that complexity; if we do then
address TODO in the code to separate alignment requirement from buffer
capacity
- Add `set_len` (=`ftruncate`) VirtualFile operation to support the
above.
- Allow `BufferedWriter` to start at a non-zero offset (to make room for
the summary block).
Cleanups unlocked by this change:
- Remove non-positional APIs from VirtualFile (e.g. seek, write_full,
read_full)
Drive-by fixes:
- PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit
tests for all `virtual_file_io_mode` combinations but didn't because of
a missing `_` in the env var.
# Performance
This section assesses this PR's impact on deployments with current
production setting (`=direct`) and anticipated impact of switching to
(`=direct-rw`).
For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly
improved on throughput because the `BlobWriter`'s buffer had the same
size as the `BufferedWriter`'s buffer, but it didn't have the
double-buffering that `BufferedWriter` has.
The `=direct-rw` enables direct IO; throughput should not be suffering
because of double-buffering; benchmarks will show if this is true.
The `ImageLayerWriter` was previously not doing any buffering
(`BUFFERED=false`).
It went straight to issuing the IO operation to the underlying
VirtualFile and the buffering was done by the kernel.
The switch to `BufferedWriter` under `=direct` adds an additional memcpy
into the BufferedWriter's buffer.
We will win back that memcpy when enabling direct IO via `=direct-rw`.
A nice win from the switch to `BufferedWriter` is that ImageLayerWriter
performs >=16x fewer write operations to VirtualFile (the BlobWriter
performs one write per len field and one write per image value).
This should save low tens of microseconds of CPU overhead from doing all
these syscalls/io_uring operations, regardless of `=direct` or
`=direct-rw`.
Aside from problems with alignment, this write frequency without
double-buffering is prohibitive if we actually have to wait for the
disk, which is what will happen when we enable direct IO via
(`=direct-rw`).
Throughput should not be suffering because of BufferedWrite's
double-buffering; benchmarks will show if this is true.
`InMemoryLayer` at `=direct` will flip back to using buffered IO but
remain on BufferedWriter.
The buffered IO adds back one memcpy of CPU overhead.
Throughput should not suffer and will might improve on
not-memory-pressured Pageservers but let's remember that we're doing the
whole direct IO thing to eliminate global memory pressure as a source of
perf variability.
## bench_ingest
I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`.
Use `git diff` with `--word-diff` or similar to see the change.
General guidance on interpretation:
- immediate production impact of this PR without production config
change can be gauged by comparing the same `io_mode=Direct`
- end state of production switched over to `io_mode=DirectRw` can be
gauged by comparing old results' `io_mode=Direct` to new results'
`io_mode=DirectRw`
Given above guidance, on `im4gn.2xlarge`
- immediate impact is a significant improvement in all cases
- end state after switching has same significant improvements in all
cases
- ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192
key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s`
instead of `253.43 MiB/s`
- this is a 6% degradation
- this workload is typical for image layer creation
# Refs
- epic https://github.com/neondatabase/neon/issues/9868
- stacked atop
- preliminary refactor https://github.com/neondatabase/neon/pull/11549
- bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667
- derived from https://github.com/neondatabase/neon/pull/10063
Co-authored-by: Yuchen Liang <yuchen@neon.tech>
## Problem
`cargo-deny` 0.16.2 spits a bunch of warnings like:
```
warning[index-failure]: unable to check for yanked crates
```
The issue is fixed for the latest version of `cargo-deny` (0.18.2). And
while we're here, let's bump all the packages we have in `build-tools`
image
## Summary of changes
- bump cargo-hakari to 0.9.36
- bump cargo-deny to 0.18.2
- bump cargo-hack to 0.6.36
- bump cargo-nextest to 0.9.94
- bump diesel_cli to 2.2.9
- bump s5cmd to 2.3.0
- bump mold to 2.37.1
- bump python to 3.11.12
## Problem
Currently, we only report the timestamp of the last moment we think
Postgres was active. The problem is that if Postgres gets completely
unresponsive, we still report some old timestamp, and it's impossible to
distinguish situations 'Postgres is effectively down' and 'Postgres is
running, but no client activity'.
## Summary of changes
Refactor the `compute_ctl`'s compute monitor so that it was easier to
track the connection errors and failed activity checks, and report
- `now() - last_successful_check` as current downtime on any failure
- cumulative Postgres downtime during the whole compute lifetime
After adding a test, I also noticed that the compute monitor may not
reconnect even though queries fail with `connection closed` or `error
communicating with the server: Connection reset by peer (os error 54)`,
but for some reason we do not catch it with `client.is_closed()`, so I
added an explicit reconnect in case of any failures.
Discussion:
https://neondb.slack.com/archives/C03TN5G758R/p1742489426966639
Main change:
- `BufferedWriter` owns the `W`; no more `Arc<W>`
- We introduce auto-delete-on-drop wrappers for `VirtualFile`.
- `TempVirtualFile` for write-only users
- `TempVirtualFileCoOwnedByEphemeralFileAndBufferedWriter` for
EphemeralFile which requires read access to the immutable prefix of the
file (see doc comments for details)
- Users of `BufferedWriter` hand it such a wrapped `VirtualFile`.
- The wrapped `VirtualFile` moves to the background flush task.
- On `BufferedWriter` shutdown, ownership moves back.
- Callers remove the wrapper (`disarm_into_inner()`) after doing final
touches, e.g., flushing index blocks and summary for delta/image layer
writers.
If the BufferedWriter isn't shut down properly via
`BufferedWriter::shutdown`, or if there is an error during final
touches, the wrapper type ensures that the file gets unlinked.
We store a GateGuard inside the wrapper to ensure that the Timeline is
still alive when unlinking on drop.
Rust doesn't have async drop yet, so, the unlinking happens using a
synchronous syscall.
NB we don't fsync the surrounding directory.
This is how it's been before this PR; I believe it is correct because
all of these files are temporary paths that get cleaned up on timeline
load.
Again, timeline load does not need to fsync because the next timeline
load will unlink again if the file reappears.
The auto-delete-on-drop can happen after a higher-level mechanism
retries.
Therefore, we switch all users to monotonically increasing, never-reused
temp file disambiguators.
The aspects pointed out in the last two paragraphs will receive further
cleanup in follow-up task
- https://github.com/neondatabase/neon/issues/11692
Drive-by changes:
- It turns out we can remove the two-pronged code in the layer file
download code.
No need to make this a separate PR because all of production already
uses `tokio-epoll-uring` with the buffered writer for many weeks.
Refs
- epic https://github.com/neondatabase/neon/issues/9868
- alternative to https://github.com/neondatabase/neon/pull/11544
* Replace yanked papaya version
* Remove unused allowed license: OpenSSL
* Remove Zlib license from general allow list since it's listed in the
exceptions section per crate
* Drop clarification for ring since they have separate LICENSE files now
* List the tower-otel repo as allowed source while we sort out the OTel
deps
Switches the tenant snapshot subcommand of the storage scrubber to
`remote_storage`. As this is the last piece of the storage scrubber
still using the S3 SDK, this finishes the project started in #7547.
This allows us to do tenant snapshots on Azure as well.
Builds on #11671Fixes#8830
Adds a versioning API to remote_storage. We want to use it in the
scrubber, both for tenant snapshot as well as for metadata checks.
for #8830
and for #11588
## Problem
Pageservers notify control plane directly when a shard import has
completed.
Control plane has to download the status of each shard from S3 and
figure out if everything is truly done,
before proceeding with branch activation.
Issues with this approach are:
* We can't control shard split behaviour on the storage controller side.
It's unsafe to split
during import.
* Control plane needs to know about shards and implement logic to check
all timelines are indeed ready.
## Summary of changes
In short, storage controller coordinates imports, and, only when
everything is done, notifies control plane.
Big rocks:
1. Store timeline imports in the storage controller database. Each
import stores the status of its shards in the database.
We hook into the timeline creation call as our entry point for this.
2. Pageservers get a new upcall endpoint to notify the storage
controller of shard import updates.
3. Storage controller handles these updates by updating persisted state.
If an update finalizes the import,
then poll pageservers until timeline activation, and, then, notify the
control plane that the import is complete.
Cplane side change with new endpoint is in
https://github.com/neondatabase/cloud/pull/26166
Closes https://github.com/neondatabase/neon/issues/11566
Update the sentry crate to 0.37. This deduplicates the `webpki-roots`
crate in our crate graph, and brings another dependency onto newer
rustls `0.23.18`.
# Add --dev CLI flag to pageserver and safekeeper binaries
This PR adds the `--dev` CLI flag to both the pageserver and safekeeper
binaries without implementing any functionality yet. This is a precursor
to PR #11517, which will implement the full functionality to require
authentication by default unless the `--dev` flag is specified.
## Changes
- Add `dev_mode` config field to pageserver binary
- Add `--dev` CLI flag to safekeeper binary
This PR is needed for forward compatibility tests to work properly, when
we try to merge #11517
Link to Devin run:
https://app.devin.ai/sessions/ad8231b4e2be430398072b6fc4e85d46
Requested by: John Spray (john@neon.tech)
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: John Spray <john@neon.tech>
# Fix KeyError in physical replication benchmark test
This PR fixes the failing physical replication benchmark test that was
encountering a KeyError: 'endpoints'.
The issue was in accessing `project["project"]["endpoints"][0]["id"]`
when it should be `project["endpoints"][0]["id"]`, consistent with how
endpoints are accessed elsewhere in the codebase.
Fixed the issue in both test functions:
- test_ro_replica_lag
- test_replication_start_stop
Link to Devin run:
https://app.devin.ai/sessions/be3fe9a9ee5942e4b12e74a7055f541b
Requested by: Peter Bendel
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
ARM computes are incoming and we need to account for that in remote
extensions. Previously, we just blindly assumed that all computes were
x86_64.
Note that we use the Go architecture naming convention instead of the
Rust one directly to do our best and be consistent across the stack.
Part-of: https://github.com/neondatabase/cloud/issues/23148
Signed-off-by: Tristan Partin <tristan@neon.tech>
In tests and when one safekeeper is down in small regions, we need to
contend with one or two safekeepers. Before, we gave an error in
`safekeepers_for_new_timeline`. Now we just silently allow the timeline
to be created on one or two safekeepers.
Part of #9011
## Problem
test_storage_controller_heartbeats is flaky because of unallowed
reconciler errors (#11625)
## Summary of changes
Allow reconcile errors as in other tests in test_storage_controller.py.
## Problem
Init fork is used in DEBUG_COMPARE_LOCAL to determine unlogged relation
or unlogged build.
But it is created only after the relation is initialized and so can be
swapped out, producing `Page is evicted with zero LSN` error.
## Summary of changes
Create init fork together with main fork for unlogged relations in
DEBUG_COMPARE_LOCAL mode.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Safekeeper doesn't use TLS in wal service
- Closes: https://github.com/neondatabase/cloud/issues/27302
## Summary of changes
- Add `enable_tls_wal_service_api` option to safekeeper's cmd arguments
- Propagate `tls_server_config` to `wal_service` if the option is
enabled
- Create `BACKGROUND_RUNTIME` for small background tasks and offload SSL
certificate reloader to it.
No integration tests for now because support from compute side is
required: https://github.com/neondatabase/cloud/issues/25823
## Problem
The pg_repack test can be flaky due to unpredictable `NOTICE` messages
about waiting for some processes.
E.g.,
```
INFO: repacking table "public.issue3_2"
+NOTICE: Waiting for 1 transactions to finish. First PID: 427
```
## Summary of changes
The `client_min_messages` set to `warning` for the regression tests.
## Problem
We run benchmarks in batches (five parallel jobs on different runners).
If any test in a batch fails, we won’t upload any results for that
batch, even for the tests that passed.
## Summary of changes
- Move the results upload to a separate step in the run-python-test-set
action, and execute this step even if tests fail.
## Problem
If all batched requests are excluded from the query by
`Timeine::get_rel_page_at_lsn_batched` (e.g. because they are past the
end of the relation), the read path would panic since it doesn't expect
empty queries. This is a change in behaviour that was introduced with
the scattered query implementation.
## Summary of Changes
Handle empty queries explicitly.
This makes it easier to add a different client implementation alongside
the current one. I started working on a new gRPC-based protocol to
replace the libpq protocol, which will introduce a new function like
`client_libpq`, but for the new protocol.
It's a little more readable with less indentation anyway.
## Problem
https://github.com/neondatabase/neon/actions/runs/14538136318/job/40790985693?pr=11645
failed, even though the relevant parts of the CI had passed and
auto-merge determined the PR is ready to merge. After that, commenting
failed.
## Summary of changes
- set GH_TOKEN for commenting after fast-forward failure
- allow merging with mergeable_state unstable
## Problem
Pageservers and safakeepers do not pass CA certificates to broker
client, so the client do not trust locally issued certificates.
- Part of https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Change `ssl_ca_certs` type in PS/SK's config to `Pem` which may be
converted to both `reqwest` and `tonic` certificates.
- Pass CA certificates to storage broker client in PS and SK
## Problem
follow-up on https://github.com/neondatabase/neon/pull/11601
## Summary of changes
- serialize the start/end time using rfc3339 time string
- compute the size ratio of the compaction
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This delivers some additional fixes and improvements to storcon managed
safekeeper timelines:
* use `i32::MAX` for the generation number of timeline deletion
* start the generation for new timelines at 1 instead of 0: this ensures
that the other components actually are generation enabled
* fix database operations we use for metrics
* use join in list_pending_ops to prevent the classical ORM issue where
one does many db queries
* use enums in `test_storcon_create_delete_sk_down`. we are adding a
second parameter, and having two bool parameters is weird.
* extend `test_storcon_create_delete_sk_down` with a test of whole
tenant deletion. this hasn't been tested before.
* remove some redundant logging contexts
* Don't require mutable access to the service lock for scheduling
pending ops in memory. In order to pull this off, create reconcilers
eagerly. The advantage is that we don't need mutable access to the
service lock that way any more.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
## Problem
We need to test the stability of Neon.
## Summary of changes
The test runs random operations on a Neon project. It performs via the
Public API calls the following operations: `create a branch`, `delete a
branch`, `add a read-only endpoint`, `delete a read-only endpoint`,
`restore a branch to a random position in the past`. All the branches
and endpoints are loaded with `pgbench`.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We saw OOMs due to L0 compaction happening simultaneously for all shards
of the same tenant right after the shard split.
## Summary of changes
Lower the threshold so that we compact fewer files.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/11531 did not fully fix the
problem because the warning is part of the storcon instead of
pageserver.
## Summary of changes
Allow stale generation error in storcon.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_compute_startup_simple` and `test_compute_ondemand_slru_startup`
are failing.
This test implicitly asserts that the metrics.json endpoint succeeds and
returns all expected metrics, but doesn't make it easy to see what went
wrong if it doesn't (e.g. in this failure
https://neon-github-public-dev.s3.amazonaws.com/reports/main/14513210240/index.html#suites/13d8e764c394daadbad415a08454c04e/b0f92a86b2ed309f/)
In this case, it was failing because of a missing auth token, because it
was using `requests` directly instead of using the endpoint http client
type.
## Summary of changes
- Use endpoint http wrapper to get raise_for_status & auth token
## Problem
`Tenant` isn't really a whole tenant: it's just one shard of a tenant.
## Summary of changes
- Automated rename of Tenant to TenantShard
- Followup commit to change references in comments
## Problem
There are mentions of `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` and
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`, but in reality, this mechanism
doesn't work, so let's remove it to avoid confusion.
The idea behind it was to allow some breaking changes by adding a
special label to a PR that would `xfail` the test. However, in practice,
this means we would need to carry this label through all subsequent PRs
until the release (and artifact regeneration). This approach isn't
really viable, as it increases the risk of missing a compatibility break
in another PR.
## Summary of changes
- Remove mentions and handling of
`ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` /
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE`
## Problem
`pg-clients` can't start:
```
The workflow is not valid. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'check-image' is requesting 'packages: read', but is only allowed 'packages: none'. .github/workflows/pg-clients.yml (Line: 44, Col: 3): Error calling workflow 'neondatabase/neon/.github/workflows/build-build-tools-image.yml@aa19f10e7e958fbe0e0641f2e8c5952ce3be44b3'. The nested job 'build-image' is requesting 'packages: write', but is only allowed 'packages: none'.
```
## Summary of changes
- Grant required `packages: write` permissions to the workflow
## Problem
Test lfc working set approximation becomes flaky after recent changes in
prefetch.
May be it is caused by updating HLL in `lfc_write`, may be by some other
reasons.
## Summary of changes
1. Disable autovacuum in this test (as possible source of extra page
accesses).
2. Increase upper boundary for WS approximation from 12 to 20.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This is mostly a documentation update, but a few updates with regard to
neon_local, pageserver, and tests.
17 is our default for users in production, so dropping references to 16
makes sense.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
These various hacks were needed for the forward compatibility tests.
Enough time has passed since the merge that these are no longer needed.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The proxy denies using `unwrap()`s in regular code, but we want to use
it in test code
and so have to allow it for each test block.
## Summary of changes
Set `allow-unwrap-in-tests = true` in clippy.toml and remove all
exceptions.
Testodrome measures uptime based on the failed requests and errors. In
case of testodrome request we send back error based on the service. This
will help us distinguish error types in testodrome and rely on the
uptime SLI.
## Problem
We currently only have gc-compaction statistics for each single
sub-compaction job.
## Summary of changes
Add meta statistics across all sub-compaction jobs scheduled.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
During shard ancestor compaction, we currently recompress all page
images as we move them into a new layer file. This is expensive and
unnecessary.
Resolves#11562.
Requires #11607.
## Summary of changes
Pass through compressed page images in `ImageLayerInner::filter()`.
1. Compute may generate WAL on shutdown. The test assumes that after
shutdown,
no further ingest happens. Tweak the compute shutdown to make the
assumption true.
2. Assertion of local layer count post cold migration is not right since
we may have downloaded
layers due to ingest. Remove it.
Closes https://github.com/neondatabase/neon/issues/11587
## Problem
To avoid recompressing page images during layer filtering, we need
access to the raw header and data from vectored reads such that we can
pass them through to the target layer.
Touches #11562.
## Summary of changes
Adds `VectoredBlob::raw_with_header()` to return a raw view of the
header+data, and updates `read()` to track it.
Also adds `blob_io::Header` with header metadata and decode logic, to
reuse for tests and assertions. This isn't yet widely used.
## Problem
Splits of large tenants (several TB) can cause a huge amount of shard
ancestor compaction work, which can overload Pageservers.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Add a setting `compaction_shard_ancestor` (default `true`) to disable
shard ancestor compaction on a per-tenant basis.
Finally figured out the right incantation. I had had this in my original
go, but due to some refactoring and apparently missed testing, I
committed a mistake. The reason this doesn't currently break anything is
that we bypass the authorization middleware when the "testing" cargo
feature is enabled.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Metrics are saved in https://github.com/neondatabase/neon/pull/11559,
but the file is not matched by the attachment regex.
## Summary of changes
Make attachment regex match the metrics file.
## Problem
close https://github.com/neondatabase/neon/issues/11486, proceeding
https://github.com/neondatabase/neon/pull/11531
## Summary of changes
This patch fixes the rest 50% of instability of
`test_create_churn_during_restart`. During tenant warmup, we'll request
logical size; however, if the startup gets cancelled, we won't be able
to spawn the initial logical size calculation task that sets the
`cancel_wait_for_background_loop_concurrency_limit_semaphore`.
Therefore, we check `cancelled` before proceeding to get
`cancel_wait_for_background_loop_concurrency_limit_semaphore`. There
will still be a race if the timeline shutdown happens after L5710 and
before L5711, but it should be enough to reduce the flakiness of the
test.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We can't have more than one open release PR created on the same day (due
to non-unique enough branch names).
## Summary of changes
- Add time (hours and minutes) to RC PR branch names
- Also make sure we use UTC for releases
## Problem
When doing power-of-two shard splits (i.e. 4 → 8 → 16), we end up
rewriting all layers since half of the pages will be local due to
striping. This causes a lot of resource usage when splitting large
tenants.
## Summary of changes
Drop the threshold of local/total pages to 30%, to reduce the amount of
layer rewrites after splits.
## Problem
Shard ancestor compaction can be very expensive following shard splits
of large tenants. We currently rewrite garbage layers after shard splits
as well, which can be a significant amount of data.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Don't rewrite invisible layers after shard splits.
## Problem
https://github.com/neondatabase/neon/pull/11515 introduced a bug that
some key history cannot be verified.
If a key only exists above the horizon, the verification will fail for
its first occurrence because the history does not exist at that point.
As gc-compaction skips a key range whenever an error occurs, it might be
doing some wasted work in staging/prod now. But I'm not planning a
hotfix this week as the bug doesn't affect correctness/performance.
## Summary of changes
Allow keys with only above horizon history in the verification.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
There is too much delay between merging a PR into `main` and deploying
the changes to staging
## Summary of changes
- Trigger `deploy` job without waiting for `build-and-test-locally` job
## Problem
The information in the README.md contained errors, and some information
was missing.
## Summary of changes
Found errors are fixed, and new information is added.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
The `/__w/neon/neon` directory is mounted from host to container and
persists between runs.
Sometimes the next workflow run fails to delete it:
```
Deleting the contents of '/__w/neon/neon'
Error: File was unable to be removed Error: EACCES: permission denied, rmdir '/__w/neon/neon/allure-2.32.2/bin'
```
## Summary of changes
- Download and install allure to `/tmp` which exists in container only
Ref https://github.com/neondatabase/cloud/issues/27186
## Problem
Benchmarks results are inconsistent on existing small-metal runners
## Summary of changes
Introduce new `unit-perf` runners, and lets run benchmark on them.
The new hardware has slower, but consistent, CPU frequency - if run with
default governor schedutil.
Thus we needed to adjust some testcases' timeouts and add some retry
steps where hard-coded timeouts couldn't be increased without changing
the system under test.
-
[wait_for_last_record_lsn](6592d69a67/test_runner/fixtures/pageserver/utils.py (L193))
1000s -> 2000s
-
[test_branch_creation_many](https://github.com/neondatabase/neon/pull/11409/files#diff-2ebfe76f89004d563c7e53e3ca82462e1d85e92e6d5588e8e8f598bbe119e927)
1000s
-
[test_ingest_insert_bulk](https://github.com/neondatabase/neon/pull/11409/files#diff-e90e685be4a87053bc264a68740969e6a8872c8897b8b748d0e8c5f683a68d9f)
- with back throttling disabled compute becomes unresponsive for more
than 60 seconds (PG hard-coded client authentication connection timeout)
-
[test_sharded_ingest](https://github.com/neondatabase/neon/pull/11409/files#diff-e8d870165bd44acb9a6d8350f8640b301c1385a4108430b8d6d659b697e4a3f1)
600s -> 1200s
Right now there are only 2 runners of that class, and if we decide to go
with them, we have to check how much that type of runners we need, so
jobs not stuck with waiting for that type of runners available.
However we now decided to run those runners with governor performance
instead of schedutil.
This achieves almost same performance as previous runners but still
achieves consistent results for same commit
Related issue to activate performance governor on these runners
https://github.com/neondatabase/runner/pull/138
## Verification that it helps
### analyze runtimes on new runner for same commit
Table of runtimes for the same commit on different runners in
[run](https://github.com/neondatabase/neon/actions/runs/14417589789)
| Run | Benchmarks (1) | Benchmarks (2) |Benchmarks (3) |Benchmarks (4)
| Benchmarks (5) |
|--------|--------|---------|---------|---------|---------|
| 1 | 1950.37s | 6374.55s | 3646.15s | 4149.48s | 2330.22s |
| 2 | - | 6369.27s | 3666.65s | 4162.42s | 2329.23s |
| Delta % | - | 0,07 % | 0,5 % | 0,3 % | 0,04 % |
| with governor performance | 1519.57s | 4131.62s | - | - | - |
| second run gov. perf. | 1513.62s | 4134.67s | - | - | - |
| Delta % | 0,3 % | 0,07 % | - | - | - |
| speedup gov. performance | 22 % | 35 % | - | - | - |
| current desktop class hetzner runners (main) | 1487.10s | 3699.67s | -
| - | - |
| slower than desktop class | 2 % | 12 % | - | - | - |
In summary, the runtimes for the same commit on this hardware varies
less than 1 %.
---------
Co-authored-by: BodoBolero <peterbendel@neon.tech>
## Problem
Shard ancestor compaction doesn't currently log any global progress
information, only for the current batch.
## Summary of changes
Log the number of layers checked for eligibility this iteration, and the
total number of layers to check. This will indicate how far along the
total shard ancestor compaction has gotten for this iteration.
## Problem
Every time we make changes to the read path to fix a bug or add a
feature,
we end up adding another incomprehensible test.
## Summary of changes
Add some generic infrastructure for generating a layer map from a type
spec
and use that for a read path test. The test is randomized but uses a
fixed seed
by default. A fuzzing mode is available for confidence building.
See [Notion
page](https://www.notion.so/neondatabase/Read-Path-Unit-Testing-Fuzzing-1d1f189e0047806c8e5cd37781b0a350?pvs=4)
for a diagram of the layer map
used.
Just for fun I tried removing [this
commit](9990199cb4)
from https://github.com/neondatabase/neon/pull/11494
and it caught the bug in the normal mode (no fuzzing required).
## Problem
Sometimes it's useful to see the pageserver metrics after a test in
order to debug stuff.
For example, for https://github.com/neondatabase/neon/issues/11465 I'd
like to know
what the remote storage latencies are from the client.
## Summary of changes
When stopping the env, record the pageserver metrics into a file in the
pageserver's workdir.
## Problem
Part of https://github.com/neondatabase/neon/issues/11486
## Summary of changes
50% of the test instability of `test_create_churn_during_restart` are
due to error message gets changed. Allow the new error message.
Still need to fix other errors due to failure to acquire semaphore in
this or the next patch.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Pageservers now ignore unknown config fields, so this config tweaking is
no longer needed.
## Summary of changes
Get rid of the hack.
Closes https://github.com/neondatabase/neon/issues/11524
## Problem
https://github.com/neondatabase/neon/pull/11494 changes the batching
logic, but we don't have a way to evaluate it.
## Summary of changes
This PR introduces a global and per timeline metric which tracks the
reason for
which a batch was broken.
In #10063 we will switch BlobWriter to use the owned buffers IO buffered
writer, which implements double-buffering by virtue of a background task
that performs the flushing.
That task's lifecylce must be contained within the Timeline lifecycle,
so, it must hold the timeline gate open and respect Timeline::cancel.
This PR does the noisy plumbing to reduce the #10063 diff.
Refs
- extracted from https://github.com/neondatabase/neon/pull/10063
- epic https://github.com/neondatabase/neon/issues/9868
We have already migrated the storage controller to
`--control-plane-url`, added in #11173. The new param was added to
support also safekeeper specific endpoints. See the docs changes in
#11195 for further details.
Part of #11163
## Problem
clang produce warning about unused variable `n_synced` in
HandleSafekeeperResponse
## Summary of changes
Remove local variable.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Get page batching stops when we encounter requests at different LSNs.
We are leaving batching factor on the table.
## Summary of changes
The goal is to support keys with different LSNs in a single batch and
still serve them with a single vectored get.
Important restriction: the same key at different LSNs is not supported
in one batch. Returning different key
versions is a much more intrusive change.
Firstly, the read path is changed to support "scattered" queries. This
is a conceptually simple step from
https://github.com/neondatabase/neon/pull/11463. Instead of initializing
the fringe for one keyspace,
we do it for multiple at different LSNs and let the logic already
present into the fringe handle selection.
Secondly, page service code is updated to support batching at different
LSNs. Eeach request parsed from the wire determines its effective
request LSN and keeps it in mem for the batcher toinspect. The batcher
allows keys at
different LSNs in one batch as long one key is not requested at
different LSNs.
I'd suggest doing the first pass commit by commit to get a feel for the
changes.
## Results
I used the batching test from [Christian's
PR](https://github.com/neondatabase/neon/pull/11391) which increases the
change of batch breaks. Looking at the logs I think the new code is at
the max batching factor for the workload (we
only break batches due to them being oversized or because the executor
is idle).
```
Main:
Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662
My branch:
Reasons for stopping batching: {'of batch size': 37024}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333
```
Related: https://github.com/neondatabase/neon/issues/10765
## Problem
See https://github.com/neondatabase/neon/issues/10652
Neon extension launches 2 BGW which reduce limit for parallel workers
and so affecting parallel_deadlock isolation test.
## Summary of changes
Increase `max_worker_processes` from default 8 to 16 for isolation test.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Part of #9114
There was a debug-mode verification mode that verifies at every
retain_lsn. However, the code was tangled within the actual history
generation itself and it's hard to reason about correctness. This patch
adds a separate post-verification of the gc-compaction result that redos
logs at every retain_lsn and every record above the GC horizon. This
ensures that all key history we produce with gc-compaction is readable,
and if there're read errors after gc-compaction, it can only be
read-path errors instead of gc-compaction bugs.
## Summary of changes
* Add gc_compaction_verification flag, default to true.
* Implement a post-verification process.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Shard ancestor compaction does not yield for L0 compaction, potentially
starving it.
close https://github.com/neondatabase/neon/issues/11125
## Summary of changes
* Yield for L0 during shard ancestor compaction.
* Return `CompactionOutcome::Pending` when limited by `rewrite_max`, for
eager rescheduling.
Previously, the structure of the spec file was just the compute spec.
However, the response from the control plane get spec request included
the compute spec and the compute_ctl config. This divergence was
hindering other work such as adding regression tests for compute_ctl
HTTP authorization.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Introduces a `WalIngestError` struct together with a
`WalIngestErrorKind` enum, to be used for walingest related failures and
errors.
* the enum captures backtraces, so we don't regress in comparison to
`anyhow::Error`s (backtraces might be a bit shorter if we use one of the
`anyhow::Error` wrappers)
* it explicitly lists most/all of the potential cases that can occur.
I've originally been inspired to do this in #11496, but it's a
longer-term TODO.
## Problem
Shard ancestor compaction always logs "starting shard ancestor
compaction", even if there is no work to do. This is very spammy (every
20 seconds for every shard). It also has limited progress logging.
## Summary of changes
* Only log "starting shard ancestor compaction" when there's work to do.
* Include details about the amount of work.
* Log progress messages for each layer, and when waiting for uploads.
* Log when compaction is completed, with elapsed duration and whether
there is more work for a later iteration.
## Problem
The `pagebench` benchmarks set up an initial dataset by creating a
template tenant, copying the remote storage to a bunch of new tenants,
and attaching them to Pageservers.
In #11420, we found that
`test_pageserver_characterize_throughput_with_n_tenants` had degraded
performance because it set a custom tenant config in Pageservers that
was then replaced with the default tenant config by the storage
controller.
The initial fix was to register the tenants directly in the storage
controller, but this created the tenants with generation 1. This broke
`test_basebackup_with_high_slru_count`, where the template tenant was at
generation 2, leading to all layer files at generation 2 being ignored.
Resolves#11485.
Touches #11381.
## Summary of changes
This patch addresses both test issues by modifying `attach_hook` to also
take a custom tenant config. This allows attaching tenants to
Pageservers from pre-existing remote storage, specifying both the
generation and tenant config when registering them in the storage
controller.
The batching perf test workload is currently read-only sequential scans.
However, realistic workloads have concurrent writes (to other pages)
going on.
This PR simulates concurrent writes to other pages by emitting logical
replication messages.
These degrade the achieved batching factor, for the reason see
- https://github.com/neondatabase/neon/issues/10765
PR
- https://github.com/neondatabase/neon/pull/11494
will fix this problem and get batching factor back up.
---------
Co-authored-by: Vlad Lazar <vlad@neon.tech>
# Refs
- fixes https://github.com/neondatabase/neon/issues/11395
# Problem
Since 2025-03-10, we have observed increased flakiness of
`test_pageserver_getpage_throttle`.
The test is timing-dependent by nature, and was hitting the
```
assert duration_secs >= 10 * actual_smgr_query_seconds, (
"smgr metrics should not include throttle wait time"
)
```
quite frequently.
# Analysis
These failures are not reproducible.
In this PR's history is a commit that reran the test 100 times without
requiring a single retry.
In https://github.com/neondatabase/neon/issues/11395 there is a link to
a query to the test results database.
It shows that the flakiness was not constant, but rather episodic:
2025-03-{10,11,12,13} 2025-03-{19,20,21} 2025-03-31 and 2025-04-01.
To me, this suggests variability in available CPU.
# Solution
The point of the offending assertion is to ensure that most of the
request latency is spent on throttling, because testing of the
throttling mechanism is the point of the test.
The `10` magic number means at most 10% of mean latency may be spent on
request processing.
Ideally we would control the passage of time (virtual clock source) to
make this test deterministic.
But I don't see that happening in our regression test setup.
So, this PR de-flakes the test as follows:
- allot up to 66% of mean latency for request processing
- increase duration from 10s to 20s, hoping to get better protection
from momentary CPU spikes in noisy neighbor tests or VMs on the runner
host
As a drive-by, switch to `pytest.approx` and remove one self-test
assertion I can't make sense of anymore.
## Problem
We need to export some metrics about certs/connections to configure
alerts and make sure that all HTTP requests are gone before turning
https-only mode on.
- Closes: https://github.com/neondatabase/cloud/issues/25526
## Summary of changes
- Add started connection and connection error metrics to http/https
Server.
- Add certificate expiration time and reload metrics to
ReloadingCertificateResolver.
Because it wasn't recursive, there was a limit to the depth of updates.
This work is necessary because as we teach neon_local and compute_ctl
that the content in --spec-path should match a similar structure we get
from the control plane, the spec object itself will no longer be
toplevel. It will be under the "spec" key.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This test is slow to execute, particularly if you're on a slow
environment like vscode in a browser. Might have got much slower when we
switched to direct IO?
## Summary of changes
- Reduce the scale of the test by 10x, since there was nothing special
about the original size.
## Problem
The graceful leadership transfer process involves calling step_down on
the old controller, but this was not waiting for shard splits to
complete, and the new controller could therefore end up trying to abort
a shard split while it was still going on.
We mitigated this already in #11256 by avoiding the case where shard
split completion would update the database incorrectly, but this was a
fragile fix because it assumes that is the only problematic part of the
split running concurrently.
Precursors:
- #11290
- #11256Closes: #11254
## Summary of changes
- Hold the reconciler gate from shard splits, so that step_down will
wait for them. Splits should always be fairly prompt, so it is okay to
wait here.
- Defense in depth: if step_down times out (hardcoded 10 second limit),
then fully terminate the controller process rather than letting it
continue running, potentially doing split-brainy things. This makes
sense because the new controller will always declare itself leader
unilaterally if step_down fails, so leaving an old controller running is
not beneficial.
- Tests: extend
`test_storage_controller_leadership_transfer_during_split` to separately
exercise the case of a split holding up step_down, and the case where
the overall timeout on step_down is hit and the controller terminates.
## Problem
Part of https://github.com/neondatabase/neon/issues/10395
## Summary of changes
Add a test case to ensure gc-compaction doesn't fire any critical errors
if the key history is invalid due to partial GC.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Pass more neon ids to compute_ctl.
Expose them to postgres as neon extension GUCs:
neon.project_id, neon.branch_id, neon.endpoint_id.
This is the compute side PR, not yet supported by cplane.
## Problem
`test_location_conf_churn` performs random location updates on
Pageservers. While doing this, it could instruct the compute to connect
to a stale generation and execute queries. This is invalid, and will
fail if a newer generation has removed layer files used by the stale
generation.
Resolves#11348.
## Summary of changes
Only connect to the latest generation when executing queries.
## Problem
Walproposer should get elected and commit WAL on safekeepers specified
by the membership configuration.
## Summary of changes
- Add to wp `members_safekeepers` and `new_members_safekeepers` arrays
mapping configuration members to connection slots. Establish this
mapping (by node id) when safekeeper sends greeting, giving its id and
when mconf becomes known / changes.
- Add to TermsCollected, VotesCollected,
GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it
partially duplicates existing one, but we'll drop the latter eventually.
- In python, rename Configuration to MembershipConfiguration for
clarity.
- Add test_quorum_sanity testing new logic.
ref https://github.com/neondatabase/neon/issues/10851
## Problem
Page service doesn't use TLS for incoming requests.
- Closes: https://github.com/neondatabase/cloud/issues/27236
## Summary of changes
- Add option `enable_tls_page_service_api` to pageserver config
- Propagate `tls_server_config` to `page_service` if the option is
enabled
No integration tests for now because I didn't find out how to call page
service API from python and AFAIK computes don't support TLS yet
It isn't used by the production control plane or neon_local. The removal
simplifies compute spec logic just a little bit more since we can remove
any notion of whether we should allow live reconfigurations.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Part of #9114
## Summary of changes
Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Part of #9114
## Summary of changes
Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_location_conf_churn` often fails with `neither image nor delta
layer`, but doesn't say what the file actually is. However, past local
failures have indicated that it might be `.___temp` files.
Touches https://github.com/neondatabase/neon/issues/11348.
## Summary of changes
Ignore `.___temp` files when evicting local layers, and include the file
name in the error message.
## Problem
In some cases gc-compaction doesn't respond to the L0 compaction yield
notifier. I suspect it's stuck on getting the first item, and if so, we
probably need to let L0 yield notifier preempt `next_with_trace`.
## Summary of changes
- Add `time_to_first_kv_pair` to gc-compaction statistics.
- Inverse the ratio so that smaller ratio -> better compaction ratio.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Not a complete fix for https://github.com/neondatabase/neon/issues/11492
but should work for a short term.
Our current retry strategy for walredo is to retry every request exactly
once. This retry doesn't make sense because it retries all requests
exactly once and each error is expected to cause process restart and
cause future requests to fail. I'll explain it with a scenario of two
threads requesting redos: one with an invalid history (that will cause
walredo to panic) and another that has a correct redo sequence.
First let's look at how we handle retries right now in
do_with_walredo_process. At the beginning of the function it will spawn
a new process if there's no existing one. Then it will continue to redo.
If the process fails, the first process that encounters the error will
remove the walredo process object from the OnceCell, so that the next
time it gets accessed, a new process will be spawned; if it is the last
one that uses the old walredo process, it will kill and wait the process
in `drop(proc)`. I'm skeptical whether this works under races but I
think this is not the root cause of the problem. In this retry handler,
if there are N requests attached to a walredo process and the i-th
request fails (panics the walredo), all other N-i requests will fail and
they need to retry so that they can access a new walredo process.
```
time ---->
proc A None B
request 1 ^-----------------^ fail
uses A for redo replace with None
request 2 ^-------------------- fail
uses A for redo
request 3 ^----------------^ fail
uses A for redo last ref, wait for A to be killed
request 4 ^---------------
None, spawn new process B
```
The problem is with our retry strategy. Normally, for a system that we
want to retry on, the probability of errors for each of the requests are
uncorrelated. However, in walredo, a prior request that panics the
walredo process will cause all future walredo on that process to fail
(that's correlated).
So, back to the situation where we have 2 requests where one will
definitely fail and the other will succeed and we get the following
sequence, where retry attempts = 1,
* new walredo process A starts.
* request 1 (invalid) being processed on A and panics A, waiting for
retry, remove process A from the process object.
* request 2 (valid) being processed on A and receives pipe broken /
poisoned process error, waiting for retry, wait for A to be killed --
this very likely takes a while and cannot finish before request 1 gets
processed again
* new walredo process B starts.
* request 1 (invalid) being processed again on B and panics B, the whole
request fail.
* request 2 (valid) being processed again on B, and get a poisoned error
again.
```
time ---->
proc A None B None
request 1 ^-----------------^--------------^--------------------^
spawn A for redo fail spawn B for redo fail
request 2 ^--------------------^-------------------------^------------^
use A for redo fail, wait to kill A B for redo fail again
```
In such cases, no matter how we set n_attempts, as long as the retry
count applies to all requests, this sequence is bound to fail both
requests because of how they get sequenced; while we could potentially
make request 2 successful.
There are many solutions to this -- like having a separate walredo
manager for compactions, or define which errors are retryable (i.e.,
broken pipe can be retried, while real walredo error won't be retried),
or having a exclusive big lock over the whole redo process (the current
one is very fine-grained). In this patch, we go with a simple approach:
use different retry attempts for different types of requests.
For gc-compaction, the attempt count is set to 0, so that it never
retries and consequently stops the compaction process -- no more redo
will be issued from gc-compaction. Once the walredo process gets
restarted, the normal read requests will proceed normally.
## Summary of changes
Add redo_attempt for each reconstruct value request to set different
retry policies.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
## Problem
our large oltp benchmark runs very long - we want to remove the duration
of the reindex step.
we don't run concurrent workload anyhow but added "concurrently" only to
have a "prod-like" approach. But if it just doubles the time we report
because it requires two instead of one full table scan we can remove it
## Summary of changes
remove keyword concurrently from the reindex step
Instead of encoding a certain structure for claims, let's allow the
caller to specify what claims be encoded.
Signed-off-by: Tristan Partin <tristan@neon.tech>
We'd like to run benchmarks starting from a steady state. To this end,
do a reconciliation round before proceeding with the benchmark.
This is useful for benchmarks that use tenant dir snapshots since a
non-standard tenant configuration is used to generate the snapshot. The
storage controller is not aware of the non default tenant configuration
and will reconcile while the bench is running.
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:
* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.
* Prefer `io::Error::other()`.
* Use implicit returns
One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
## Problem
We wish to improve pageserver batching such that one batch can contain
requests for
pages at different LSNs. The current shape of the code doesn't lend
itself to the change.
## Summary of changes
Refactor the read path such that the fringe gets initialized upfront.
This is where the multi LSN
change will plug in. A couple other small changes fell out of this.
There should be NO behaviour change here. If you smell one, shout!
I recommend reviewing commits individually (intentionally made them as
small as possible).
Related: https://github.com/neondatabase/neon/issues/10765
## Problem
Now `get_timestamp_of_lsn` returns `404 NotFound` if there is no clog
pages for given LSN, and it's difficult to distinguish from other 404
errors. A separate status code for this error will allow the control
plane to handle this case.
- Closes: https://github.com/neondatabase/neon/issues/11439
- Corresponding PR in control plane:
https://github.com/neondatabase/cloud/pull/27125
## Summary of changes
- Return `412 PreconditionFailed` instead of `404 NotFound` if no
timestamp is fond for given LSN.
I looked briefly through the current error handling code in cloud.git
and the status code change should not affect anything for the existing
code. Change from the corresponding PR also looks fine and should work
with the current PS status code. Additionally, here is OK to merge it
from control plane team:
https://github.com/neondatabase/neon/issues/11439#issuecomment-2789327552
---------
Co-authored-by: John Spray <john@neon.tech>
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".
There are plans to replace communicator parts with a new
implementation. See https://github.com/neondatabase/neon/pull/10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.
This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
The timeline stopping state is set much earlier than the cancellation
token is fired, so by checking for the stopping state, we can prevent
races with timeline shutdown where we issue a cancellation error but the
cancellation token hasn't been fired yet.
Fix#11427.
## Problem
The current stripe size of 256 MB is a bit large, and can cause load
imbalances across shards. A stripe size of 16 MB appears more reasonable
to avoid hotspots, although we don't see evidence of this in benchmarks.
Resolves https://github.com/neondatabase/cloud/issues/25634.
Touches https://github.com/neondatabase/cloud/issues/21870.
## Summary of changes
* Change the default stripe size to 16 MB.
* Remove `ShardParameters::DEFAULT_STRIPE_SIZE`, and only use
`pageserver_api::shard::DEFAULT_STRIPE_SIZE`.
* Update a bunch of tests that assumed a certain stripe size.
## Problem
Storcon will not start up if `use_https` is on and there are some
pageservers or safekeepers without https port in the database. Metrics
"how many nodes with https we have in DB" will help us to make sure that
`use_https` may be turned on safely.
- Part of https://github.com/neondatabase/cloud/issues/25526
## Summary of changes
- Add `storage_controller_https_pageserver_nodes`,
`storage_controller_safekeeper_nodes` and
`storage_controller_https_safekeeper_nodes` Prometheus metrics.
## Problem
With the recent improvements to L0 compaction responsiveness,
`test_create_snapshot` now ends up generating 10,000 layer files
(compared to 1,000 in previous snapshots). This increases the snapshot
size by 4x, and significantly slows down tests.
## Summary of changes
Increase the target layer size from 128 KB to 256 KB, and the L0
compaction threshold from 1 to 5. This reduces the layer count from
about 10,000 to 1,000.
## Problem
Support of unlogged build in DEBUG_COMPARE_LOCAL.
Neon SMGR treats present of local file as indicator of unlogged
relations.
But it doesn't work in DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Use INIT_FORKNUM as indicator of unlogged file and create this file
while unlogged index build.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`tenant_import`, used to import an existing tenant from remote storage
into a storage controller for support and debugging, assumed
`DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage.
In #11168, we are changing the stripe size, which will break
`tenant_import`.
Resolves#11175.
## Summary of changes
* Add `stripe_size` to the tenant manifest.
* Add `TenantScanRemoteStorageShard::stripe_size` and return from
`tenant_scan_remote` if present.
* Recover the stripe size during`tenant_import`, or fall back to 32768
(the original default stripe size).
* Add tenant manifest compatibility snapshot:
`2025-04-08-pgv17-tenant-manifest-v1.tar.zst`
There are no cross-version concerns here, since unknown fields are
ignored during deserialization where relevant.
## Problem
This is generated e.g. by `test_historic_storage_formats`, and causes
VSCode to list all the contained files as new.
## Summary of changes
Add `/artifact_cache` to `.gitignore`.
## Problem
For future gc-compaction tests when we support
https://github.com/neondatabase/neon/issues/10395
## Summary of changes
Add a new type of neon test WAL record that is conditionally applied
(i.e., only when image == the specified value). We can use this to mock
the situation where we lose some records in the middle, firing an error,
and see how gc-compaction reacts to it.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Service targeted for storing and retrieving LFC prewarm data.
Can be used for proxying S3 access for Postgres extensions like
pg_mooncake as well.
Requests must include a Bearer JWT token.
Token is validated using a pemfile (should be passed in infra/).
Note: app is not tolerant to extra trailing slashes, see app.rs
`delete_prefix` test for comments.
Resolves: https://github.com/neondatabase/cloud/issues/26342
Unrelated changes: gate a `rename_noreplace` feature and disable it in
`remote_storage` so as `object_storage` can be built with musl
## Problem
`test_scrubber_tenant_snapshot` is flaky with `request was dropped`
errors. More details are in the issue.
- Closes: https://github.com/neondatabase/neon/issues/11278
## Summary of changes
- Disable shard scheduling during pageservers restart
- Add `reconcile_until_idle` in the end of the test
Also, move the call to the lfc_init() function. It was weird to have it
in libpagestore.c, when libpagestore.c otherwise had nothing to do with
the LFC. Move it directly into _PG_init()
## Problem
If the local file cache is shrunk, so that we punch some holes in the
underlying file, the local_cache view displays the holes incorrectly.
See https://github.com/neondatabase/neon/issues/10770
## Summary of changes
Skip hole tags in the local_cache view.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Fix various small issues discovered during gc-compaction rollout.
## Summary of changes
- Log level changes: if errors are from gc-compaction, fire a warning
instead of errors or critical errors.
- Yield to L0 compaction more aggressively. Instead of checking every 1k
keys, we check on every key. Sometimes a single key reconstruct takes a
long time.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Currently, the tenant manifest is only uploaded if there are offloaded
timelines. The checks are also a bit loose (e.g. only checks number of
offloaded timelines). We want to start using the manifest for other
things too (e.g. stripe size).
Resolves#11271.
## Summary of changes
This patch ensures that a tenant manifest always exists. The lifecycle
is:
* During preload, fetch the existing manifest, if any.
* During attach, upload a tenant manifest if it differs from the
preloaded one (or does not exist).
* Upload a new manifest as needed, if it differs from the last-known
manifest (ignoring version number).
* On splits, pre-populate the manifest from the parent.
* During Pageserver physical GC, remove old manifests but keep the
latest 2 generations.
This will cause nearly all existing tenants to upload a new tenant
manifest on their first attach after this change. Attaches are
concurrency-limited in the storage controller, so we expect this will be
fine.
Also updates `make_broken` to automatically log at `INFO` level when the
tenant has been cancelled, to avoid spurious error logs during shutdown.
## Problem
If a tenant is cancelled (e.g. due to Pageserver shutdown) during
attach, it is set to `Broken`. This results both in error log spam and
500 responses for clients -- shutdown is supposed to return 503
responses which can be retried.
This becomes more likely to happen with #11328, where we perform tenant
manifest downloads/uploads during attach.
## Summary of changes
Set tenant state to `Stopping` when attach fails and the tenant is
cancelled, downgrading the log messages to INFO. This introduces two
variants of `Stopping` -- with and without a caller barrier -- where the
latter is used to signal attach cancellation.
Before we specified the JWT via `SAFEKEEPER_AUTH_TOKEN`, but env vars
are quite public, both in procfs as well as the unit files. So add a way
to put the auth token into a file directly.
context: https://neondb.slack.com/archives/C033RQ5SPDH/p1743692566311099
The 'neon_read' function needs to have a different prototype on PG < 16,
because it's part of the smgr interface. But neon_read_at_lsn doesn't
have that restriction.
## Problem
The shared libraries preloaded by default interfered with the
`pg_regress` tests on staging, causing wrong results
## Summary of changes
The projects used for these tests are now free from unnecessary
extensions. Some changes were made in patches.
Remove useless and often wrong IDENTIFICATION comments. PostgreSQL
sources have them, mostly for historical reasons, but there's no need
for us to copy that style.
Remove unnecessary #includes in header files, putting the #includes
directly in the .c files that need them. The principle is that a header
file should #include other header files if they need definitions from
them, such that each header file can be compiled on its own, but not
other #includes. (There are tools to enforce that, but this was just a
manual clean up of violations that I happened to spot.)
## Problem
The `pg_embedding` extension has been deprecated and can cause issues
with recent changes such as with
https://github.com/neondatabase/neon/issues/10973
Issue: `PG:2025-04-03 15:39:25.498 GMT
ttid=a4de5bee50225424b053dc64bac96d87/d6f3891b8f968458b3f7edea58fb3c6f
sqlstate=58P01 [15526] ERROR: could not load library
"/usr/local/lib/embedding.so": /usr/local/lib/embedding.so: undefined
symbol: SetLastWrittenLSNForRelation`
## Summary of changes
Removed `pg_embedding` extension from the compute image.
## Problem
We don't have metrics to exactly quantify the end user impact of
on-demand downloads.
Perf tracing is underway (#11140) to supply us with high-resolution
*samples*.
But it will also be useful to have some aggregate per-timeline and
per-instance metrics that definitively contain all observations.
## Summary of changes
This PR consists of independent commits that should be reviewed
independently.
However, for convenience, we're going to merge them together.
- refactor(metrics): measure_remote_op can use async traits
- impr(pageserver metrics): task_kind dimension for
remote_timeline_client latency histo
- implements https://github.com/neondatabase/cloud/issues/26800
- refs
https://github.com/neondatabase/cloud/issues/26193#issuecomment-2769705793
- use the opportunity to rename the metric and add a _global suffix;
checked grafana export, it's only used in two personal dashboards, one
of them mine, the other by Heikki
- log on-demand download latency for expensive-to-query but precise
ground truth
- metric for wall clock time spent waiting for on-demand downloads
## Refs
- refs https://github.com/neondatabase/cloud/issues/26800
- a bunch of minor investigations / incidents into latency outliers
# Refs
- refs https://github.com/neondatabase/neon/issues/8915
- discussion thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1742406381132599
- stacked atop https://github.com/neondatabase/neon/pull/11298
- corresponding internal docs update that illustrates how this PR
removes friction: https://github.com/neondatabase/docs/pull/404
# Problem
Rejecting `pageserver.toml`s with unknown fields adds friction,
especially when using `pageserver.toml` fields as feature flags that
need to be decommissioned.
See the added paragraphs on `pageserver_api::models::ConfigToml` for
details on what kind of friction it causes.
Also read the corresponding internal docs update linked above to see a
more imperative guide for using `pageserver.toml` flags as feature
flags.
# Solution
## Ignoring unknown fields
Ignoring is the serde default behavior.
So, just remove `serde(deny_unknown_fields)` from all structs in
`pageserver_api::config::ConfigToml`
`pageserver_api::config::TenantConfigToml`.
I went through all the child fields and verified they don't use
`deny_unknown_fields` either, including those shared with
`pageserver_api::models`.
## Warning about unknown fields
We still want to warn about unknown fields to
- be informed about typos in the config template
- be reminded about feature-flag style configs that have been cleaned up
in code but not yet in config templates
We tried `serde_ignore` (cf draft #11319) but it doesn't work with
`serde(flatten)`.
The solution we arrived at is to compare the on-disk TOML with the TOML
that we produce if we serialize the `ConfigToml` again.
Any key specified in the on-disk TOML but not present in the serialized
TOML is flagged as an ignored key.
The mechanism to do it is a tiny recursive decent visitor on the
`toml_edit::DocumentMut`.
# Future Work
Invalid config _values_ in known fields will continue to fail pageserver
startup.
See
- https://github.com/neondatabase/cloud/issues/24349
for current worst case impact to deployments & ideas to improve.
# Problem
Current perf tracing fields do not allow answering the question what a
specific Postgres backend was waiting for.
# Background
For Pageserver logs, we set the backend PID as the libpq
`application_name` on the compute side, and funnel that into the a
tracing field for the spans that emit to the global tracing subscriber.
# Solution
Funnel `application_name`, and the other fields that we use in the
logging spans, into the root span for perf tracing.
# Refs
- fixes https://github.com/neondatabase/neon/issues/11393
- stacked atop https://github.com/neondatabase/neon/pull/11433
- epic: https://github.com/neondatabase/neon/issues/9873
## Problem
We've started sending slack notifications for failed container image
pushes that are being retried. There are more messages coming in than
expected, so clicking through the link to see what image failed is
happening more often than we hoped.
## Summary of changes
- Make slack notifications clearer, including whether the job succeeded
and what retries have happened.
- Log failures/retries in step more clearly, so that you can easily see
when something fails.
## Problem
Postgres build fails with the following error on macOS:
```
/Users/bayandin/work/neon//vendor/postgres-v14/src/port/snprintf.c:424:27: error: 'strchrnul' is only available on macOS 15.4 or newer [-Werror,-Wunguarded-availability-new]
424 | const char *next_pct = strchrnul(format + 1, '%');
| ^~~~~~~~~
/Users/bayandin/work/neon//vendor/postgres-v14/src/port/snprintf.c:376:14: note: 'strchrnul' has been marked as being introduced in macOS 15.4 here, but the deployment target is macOS 15.0.0
376 | extern char *strchrnul(const char *s, int c);
| ^
/Users/bayandin/work/neon//vendor/postgres-v14/src/port/snprintf.c:424:27: note: enclose 'strchrnul' in a __builtin_available check to silence this warning
424 | const char *next_pct = strchrnul(format + 1, '%');
| ^~~~~~~~~
425 |
426 | /* Dump literal data we just scanned over */
427 | dostr(format, next_pct - format, target);
428 | if (target->failed)
429 | break;
430 |
431 | if (*next_pct == '\0')
432 | break;
433 | format = next_pct;
|
1 error generated.
```
## Summary of changes
- Update Postgres fork to include changes from
6da2ba1d8a
Corresponding Postgres PRs:
- https://github.com/neondatabase/postgres/pull/608
- https://github.com/neondatabase/postgres/pull/609
- https://github.com/neondatabase/postgres/pull/610
- https://github.com/neondatabase/postgres/pull/611
## Problem
https://github.com/neondatabase/neon/pull/11140 introduces performance
tracing with OTEL
and a pageserver config which configures the sampling ratio of get page
requests.
Enabling a non-zero sampling ratio on a per region basis is too
aggressive and comes with perf
impact that isn't very well understood yet.
## Summary of changes
Add a `sampling_ratio` tenant level config which overrides the
pageserver level config.
Note that we do not cache the config and load it on every get page
request such that changes propagate
timely.
Note that I've had to remove the `SHARD_SELECTION` span to get this to
work. The tracing library doesn't
expose a neat way to drop a span if one realises it's not needed at
runtime.
Closes https://github.com/neondatabase/neon/issues/11392
## Problem
IO metrics for secondary locations do not get deregistered when the
timeline is removed.
## Summary of changes
Stash the request context to be used for downloads in
`SecondaryTimelineDetail`. These objects match the lifetime of the
secondary timeline location pretty well.
When the timeline is removed, deregister the metrics too.
Closes https://github.com/neondatabase/neon/issues/11156
## Problem
Changes in compute can cause errors in tests if another version of
`neon-test-extensions` image is used.
## Summary of changes
Use the same version of `neon-test-extensions` image as `compute` one
for docker-compose based extension tests.
## Problem
There are some places in the code where we create `reqwest::Client`
without providing SSL CA certs from `ssl_ca_file`. These will break
after we enable TLS everywhere.
- Part of https://github.com/neondatabase/cloud/issues/22686
## Summary of changes
- Support `ssl_ca_file` in storage scrubber.
- Add `use_https_safekeeper_api` option to safekeeper to use https for
peer requests.
- Propagate SSL CA certs to storage_controller/client, storcon's
ComputeHook, PeerClient and maybe_forward.
Previously we attempted to download all extensions in CREATE EXTENSION
statements. Extensions like pg_stat_statements and neon are not remote
extensions, but still we were requesting them when
skip_pg_catalog_updates was set to false.
Fixes: https://github.com/neondatabase/neon/issues/11127
Signed-off-by: Tristan Partin <tristan@neon.tech>
Adds a test `test_storcon_create_delete_sk_down` which tests the
reconciler and pending op persistence if faced with a temporary
safekeeper downtime during timeline creation or deletion. This is in
contrast to `test_explicit_timeline_creation_storcon`, which tests the
happy path.
We also do some fixes:
* timeline and tenant deletion http requests didn't expect a body, but
`()` sent one.
* we got the tenant deletion http request's return type wrong: it's
supposed to be a hash map
* we add some logging to improve observability
* We fix `list_pending_ops` which had broken code meant to make it
possible to restrict oneself to a single pageserver. But diesel doesn't
support that sadly, or at least I couldn't figure out a way to make it
work. We don't need that functionality, so remove it.
* We add an info span to the heartbeater futures with the node id, so
that there is no context-free msgs like "Backoff: waiting 1.1 seconds
before processing with the task" in the storcon logs. we could also add
the full base url of the node but don't do it as most other log lines
contain that information already, and if we do duplication it should at
least not be verbose. One can always find out the base url from the node
id.
Successor of #11261
Part of #9011
Log the created project and endpoint IDs and improve typing in the
source code to improve readability.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We switched `h2` from 4.1.0 to a git commit to fix stubgen (in
https://github.com/neondatabase/neon/pull/10491). `h2` 4.2.0 was
released soon after that, so we can switch back to a pinned version.
Expected no changes, as 4.2.0 is the right next commit after the commit
we currently use:
dacd614fed%5E
## Summary of changes
- Bump `h2` to 4.2.0
## Problem
Part of https://github.com/neondatabase/neon/issues/11318
It's not 100% safe for now to run gc-compaction over the sparse
keyspace. It might cause deleted file to re-appear if a specific
sequence of operations are done as in the issue, which in reality
doesn't happen due to how we split delta/image layers based on the key
range.
A long-term fix would be either having a separate gc-compaction code
path for metadata keys (as how we have a different code path for
metadata image layer generation), or let the compaction process aware of
the information of "there's an image layer that doesn't contain a key"
so that we can skip the keys.
## Summary of changes
* gc-compaction auto trigger only triggers compaction over the normal
data range.
* do not hold gc_block_guard across the full compaction job, only hold
it during each subcompaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The leadership transfer protocol between storage controller instances is
as follows, listing the steps for the new pod:
The new pod does these things:
1. new pod comes online. looks in database if there is a leader. if
there is, it asks that leader to step down.
2. the new pod does some operations to come online. they should be
fairly short timed, but it's not zero.
3. the new pod updates the leader entry in the database.
The old pod, once it gets the step down request, changes its internal
state to stepped down. It treats all incoming requests specially now:
instead of processing, it wants to forward them to the new pod. The
forwarding however only works if the new pod is online already, so
before forwarding it reads from the db for a leader (also to get the
address to forward to in the first place).
If the new pod is not online yet, i.e. during step 2 above, the old pod
might legitimately land in the branch which this patch is editing: the
leader in the database is a stepped down instance.
Before, we've returned a `ApiError::InternalServerError`, but that would
print the full backtrace plus an error log. With this patch, we cut down
on the noise, as it's an expected situation to have a short storcon
downtime while we are cutting over to the new instance. A
`ResourceUnavailable` error is not just more fitting, it also doesn't
print a backtrace once encountered, and only prints on the INFO log
level (see `api_error_handler` function).
Fixes#11320
cc #8954
Based on https://github.com/neondatabase/neon/pull/11139
## Problem
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.
## Summary of changes
https://github.com/neondatabase/neon/pull/11139 introduces the
infrastructure required to run the otel collector alongside the
pageserver.
### Design
Requirements:
1. We'd like to avoid implementing our own performance tracing stack if
possible and use the `tracing` crate if possible.
2. Ideally, we'd like zero overhead of a sampling rate of zero and be a
be able to change the tracing config for a tenant on the fly.
3. We should leave the current span hierarchy intact. This includes
adding perf traces without modifying existing tracing.
To satisfy (3) (and (2) in part) a separate span hierarchy is used.
`RequestContext` gains an optional `perf_span` member
that's only set when the request was chosen by sampling. All perf span
related methods added to `RequestContext` are no-ops for requests that
are not sampled.
This on its own is not enough for (3), so performance spans use a
separate tracing subscriber. The `tracing` crate doesn't have great
support for this, so there's a fair amount of boilerplate to override
the subscriber at all points of the perf span lifecycle.
### Perf Impact
[Periodic
pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc)
shows no statistically significant regression with a sample ratio of 0.
There's an annotation on the dashboard on 2025-03-06.
### Overview of changes:
1. Clean up the `RequestContext` API a bit. Namely, get rid of the
`RequestContext::extend` API and use the builder instead.
2. Add pageserver level configs for tracing: sampling ratio, otel
endpoint, etc.
3. Introduce some perf span tracking utilities and expose them via
`RequestContext`. We add a `tracing::Span` wrapper to be used for perf
spans and a `tracing::Instrumented` equivalent for it. See doc comments
for reason.
4. Set up OTEL tracing infra according to configuration. A separate
runtime is used for the collector.
5. Add perf traces to the read path.
## Refs
- epic https://github.com/neondatabase/neon/issues/9873
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
There are some cases where traditional gc might collect some layer files
causing gc-compaction cannot read the full history of the key. This
needs to be resolved in the long-term by improving the compaction
process. For now, let's simply avoid such errors triggering the circuit
breaker.
## Summary of changes
* Move the place where we trigger the circuit breaker. We only trigger
it during compactions other than L0 compactions. We added the trigger a
year ago due to file cleanup concerns in image layer compaction.
* For gc-compaction, only return errors to the upper
compaction_iteration if it's a shutdown error. Otherwise, just log it
and skip the compaction for a key range.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
Previously, if the observed state was refreshed and matching the intent,
we wouldn't send
a compute notification. This is unsafe. There's no guarantee that the
location landed on the
pageserver _and_ a compute notification for it was delivered.
See
https://github.com/neondatabase/neon/issues/11291#issuecomment-2743205411
for one such example.
## Summary of changes
Add a reproducer and notify the compute if the correct observed state
required a refresh.
Closes https://github.com/neondatabase/neon/issues/11291
## Problem
We had a problem with https://github.com/neondatabase/neon/pull/11413
having e2e tests failing, because an e2e test
(8d271bed47)
depended on an unreleased pageserver fix
(0ee5bfa2fc).
This came up because neon release CI runs against the most recent
releases of the other components, but cloud e2e tests run against
latest, which is tagged from main.
## Summary of changes
Add an additional `released` tag for released versions.
## Alternative to consider
We could (and maybe should) instead switch to `latest` being used for
released versions and `main` being used where we use `latest` right now.
That'd also mean we don't have to adjust the CI in the cloud repo.
## Problem
Exporting `file_cache_used` which specifies the number of used chunks in
the LFC. This helps calculate LFC utilization as: `file_cache_used_pages
/ (file_cache_used * file_cache_chunk_size_pages)`
## Summary of changes
Exporting `file_cache_used`.
Related Issue: https://github.com/neondatabase/cloud/issues/26688
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Announcement blog
post](https://blog.rust-lang.org/2025/04/03/Rust-1.86.0.html).
Prior update was in #10914.
## Problem
Since
0f367cb665
the timeout in `with_client_retries` is implemented via `tokio::timeout`
instead of `reqwest::ClientBuilder::timeout` (because we reuse the
client). It changed the error representation if the timeout is exceeded.
Such errors were suppressed in `allowed_errors.py`, but old regexps do
not match the new error.
Discussion:
https://neondb.slack.com/archives/C033RQ5SPDH/p1743533184736319
## Summary of changes
- Add new `Timeout` error to `allowed_errors.py`
I've encountered this error in #11422. Ideally we'd have the URL as well
to associate it with a tenant, but at this level we only have the remote
addr I guess. Better than nothing.
## Problem
The test_pageserver_gc_compaction_smoke fails rather often due to a
timeout on slow machines.
See https://github.com/neondatabase/neon/issues/11355.
## Summary of changes
Increase the timeout for the test.
## Problem
We've seen quite a few CI failures related to pushes to docker hub
failing with weird error messages that indicate maybe docker hub is just
not reliable.
## Summary of changes
Retry container image pushing up to 10 times, and send a slack message
if we had to retry, regardless of the job succeeding or not.
## Problem
Pagebench creates a bunch of tenants by first creating a template tenant
and copying its remote storage, then attaching the copies to the
Pageserver.
These tenants had custom configurations to disable GC and compaction.
However, these configs were only picked up by the Pageserver on attach,
and not registered with the storage controller. This caused the storage
controller to replace the tenant configs with the default tenant config,
re-enabling GC and compaction which interferes with benchmark
performance.
Resolves#11381.
## Summary of changes
Register the copied tenants with the storage controller, instead of
directly attaching them to the Pageserver.
Right now we start safekeeper node ids at 0. However, other code treats
0 as invalid (see #11407). We decided on latter. Therefore, make the
register python tests register safekeepers starting at node id 1 instead
of 0, and forbid safekeepers with id 0 from registering.
Context:
https://github.com/neondatabase/neon/pull/11407#discussion_r2024852328
## Problem
close https://github.com/neondatabase/neon/issues/11279
## Summary of changes
* Allow passthrough of other methods in tenant timeline shard0
passthrough of storcon.
* Passthrough mark invisible API in storcon.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
In #11122, we changed the autosplit behavior to allow repeated and
initial splits. The defaults were set such that they retain the current
production settings (8 shards at 64 GB). However, these defaults don't
really make sense by themselves.
Once we deploy new settings to production, we should change the defaults
to something more reasonable.
## Summary of changes
Changes the following default settings:
* `max_split_shards`: 8 → 16
* `initial_split_threshold`: 64 GB → disabled
* `initial_split_shards`: 8 → 2
## Problem
Hotfix releases mean that sometimes changes in release PRs haven't been
tested and linted yet. Disabling tests and lints is therefore not
necessarily safe. In the future we will check whether tests have run on
the same git tree already to speed things up, but for now we need to
turn tests back on fully. This partially reverts:
https://github.com/neondatabase/neon/pull/11272
## Summary of changes
Run checks on `.*-rc-pr` runs.
## Problem
Tenants in attachment state `Stale` can't upload layers, and don't run
compaction, but still do periodic L0 layer flushes in the tenant
housekeeping loop. If the tenant remains stuck in stale mode, this
causes a large buildup of L0 layers, causing logging, metrics increases,
and possibly alerts.
Resolves#11245.
## Summary of changes
Don't perform periodic layer flushes in stale attachment state.
## Problem
Sometimes the forced extension upgrade test fails (on schedule) due to a
timeout.
## Summary of changes
The timeout is increased to 60 mins.
## Problem
In Neon DBaaS we adjust the shared_buffers to the size of the compute,
or better described we adjust the max number of connections to the
compute size and we adjust the shared_buffers size to the number of max
connections according to about the following sizes
`2 CU: 225mb; 4 CU: 450mb; 8 CU: 900mb`
[see](877e33b428/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go (L405))
## Summary of changes
We should run perf unit tests with settings that is realistic for a
paying customer and select 8 CU as the reference for those tests.
## Problem
Followup to https://github.com/neondatabase/neon/pull/10913
Existing chaos injection just does simple cutovers to secondary
locations. Let's also exercise code for doing graceful migrations. This
should implicitly test how such migrations cope with overlapping with
service restarts.
## Summary of changes
## Problem
ref https://github.com/neondatabase/neon/issues/11279
Imagine we have a branch with 3 snapshots A, B, and C:
```
base---+---+---+---main
\-A \-B \-C
base=100G, base-A=1G, A-B=1G, B-C=1G, C-main=1G
```
at this point, the synthetic size should be 100+1+1+1+1=104G.
after the deletion, the structure looks like:
```
base---+---+---+
\-A \-B \-C
```
If we simply assume main never exists, the size will be calculated as
size(A) + size(B) + size(C)=300GB, which obviously is not what the user
would expect.
The correct way to do this is to assume part of main still exists, that
is to say, set C-main=1G:
```
base---+---+---+main
\-A \-B \-C
```
And we will get the correct synthetic size of 100G+1+1+1=103G.
## Summary of changes
* Do not generate gc cutoff point for invisible branches.
* Use the same LSN as the last branchpoint for branch end.
* Remove test_api_handler for mark_invisible.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1741594233757489
Consider the following scenario:
1. Backend A wants to prefetch some block B
2. Backend A checks that block B is not present in shared buffer
3. Backend A registers new prefetch request and calls
prefetch_do_request
4. prefetch_do_request calls neon_get_request_lsns
5. neon_get_request_lsns obtains LwLSN for block B
6. Backend B downloads B, updates and wallogs it (let say to Lsn1)
7. Block B is once again thrown from shared buffers, its LwLSN is set to
Lsn1
8. Backend A obtains current flush LSN, let's say that it is Lsn1
9. Backend A stores Lsn1 as effective_lsn in prefetch slot.
10. Backend A reads page B with LwLSN=Lsn1
11. Backend A finds in prefetch ring response for prefetch request for
block B with effective_lsn=Lsn1, so that it satisfies
neon_prefetch_response_usable condition
12. Backend A uses deteriorated version of the page!
## Summary of changes
Use `not_modified_since` as `effective_lsn`.
It should not cause some degrade of performance because we store LwLSN
when it was not found in LwLSN hash, so if page is not changed till
prefetch response is arrived, then LwLSN should not be changed.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This PR contains a bunch of smaller followups and fixes of the original
PR #11058. most of these implement suggestions from Arseny:
* remove `Queryable, Selectable` from `TimelinePersistence`: they are
not needed.
* no `Arc` around `CancellationToken`: it itself is an arc wrapper
* only schedule deletes instead of scheduling excludes and deletes
* persist and delete deletion ops
* delete rows in timelines table upon tenant and timeline deletion
* set `deleted_at` for timelines we are deleting before we start any
reconciles: this flag will help us later to recognize half-executed
deletions, or when we crashed before we could remove the timeline row
but after we removed the last pending op (handling these situations are
left for later).
Part of #9011
Add an optional `safekeepers` field to `TimelineInfo` which is returned
by the storcon upon timeline creation if the
`--timelines-onto-safekeepers` flag is enabled. It contains the list of
safekeepers chosen.
Other contexts where we return `TimelineInfo` do not contain the
`safekeepers` field, sadly I couldn't make this more type safe like done
in Rust via `TimelineCreateResponseStorcon`, as there is no way of
flattening or inheritance (and I don't that duplicating the entire type
for some minor type safety improvements is worth it).
The storcon side has been done in #11058.
Part of https://github.com/neondatabase/cloud/issues/16176
cc https://github.com/neondatabase/cloud/issues/16796
## Problem
`CompactFlags::NoYield` was a bit inconvenient, since every caller
except for the background compaction loop should generally set it (e.g.
HTTP API calls, tests, etc). It was also inconsistent with
`CompactionOutcome::YieldForL0`.
## Summary of changes
Invert `CompactFlags::NoYield` as `CompactFlags::YieldForL0`. There
should be no behavioral changes.
## Problem
For computes running inside NeonVM, the actual compute image tag is
buried inside the NeonVM spec, and we cannot get it as part of standard
k8s container metrics (it's always an image and a tag of the NeonVM
runner container). The workaround we currently use is to extract the
running computes info from the control plane database with SQL. It has
several drawbacks: i) it's complicated, separate DB per region; ii) it's
slow; iii) it's still an indirect source of info, i.e. k8s state could
be different from what the control plane expects.
## Summary of changes
Add a new `compute_ctl_up` gauge metric with `build_tag` and `status`
labels. It will help us to both overview what are the tags/versions of
all running computes; and to break them down by current status (`empty`,
`running`, `failed`, etc.)
Later, we could introduce low cardinality (no endpoint or compute ids)
streaming aggregates for such metrics, so they will be blazingly fast
and usable for monitoring the fleet-wide state.
## Problem
Commit
3da70abfa5
cause noticeable performance regression (40% in update-with-prefetch in
test_bulk_update):
https://neondb.slack.com/archives/C04BLQ4LW7K/p1742633167580879
## Summary of changes
Remove loop from pageserver_try_receive to make it fetch not more than
one response. There is still loop in `pump_prefetch_state` which can
fetch as many responses as available.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Previously we had different meanings for the bitmask of vector IOps.
That has now been unified to "bit set = final result, no more
scribbling".
Furthermore, the LFC read path scribbled on pages that were already
read; that's probably not a good thing so that's been fixed too. In
passing, the read path of LFC has been updated to read only the
requested pages into the provided buffers, thus reducing the IO size of
vectorized IOs.
## Problem
## Summary of changes
## Problem
Current version of GitHub Workflow Stats action pull docker images from
DockerHub, that could be an issue with the new pull limits on DockerHub
side.
## Summary of changes
Switch to version `v0.2.2`, with docker images hosted on `ghcr.io`
In sqlstate, we have a manual `phf` construction, which is not
explicitly guaranteed to be stable - you're intended to use a build.rs
or the macro to make sure it's constructed correctly each time. This was
inherited from tokio-postgres upstream, which has the same issue
(https://github.com/rust-phf/rust-phf/pull/321#issuecomment-2724521193).
We don't need this encoding of sqlstate, so I've switched it to simply
parse 5 bytes
(https://www.postgresql.org/docs/current/errcodes-appendix.html).
While here, I switched out log for tracing.
## Problem
Macro IS_LOCAL_REL used for DEBUG_COMPARE_LOCAL mode use greater-than
rather than greater-or-equal comparison while first table really is
assigned FirstNormalObjectId.
## Summary of changes
Replace strict greater with greater-or-equal comparison.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`TYPE_CHECKING` is used inconsistently across Python tests.
## Summary of changes
- Update `ruff`: 0.7.0 -> 0.11.2
- Enable TC (flake8-type-checking):
https://docs.astral.sh/ruff/rules/#flake8-type-checking-tc
- (auto)fix all new issues
## Problem
Previously, L0 flushes would wait for uploads, as a simple form of
backpressure. However, this prevented flush pipelining and upload
parallelism. It has since been disabled by default and replaced by L0
compaction backpressure.
Touches https://github.com/neondatabase/cloud/issues/24664.
## Summary of changes
This patch removes L0 flush upload waits, along with the
`l0_flush_wait_upload`. This can't be merged until the setting has been
removed across the fleet.
## Problem
`github.sha` contains a merge commit of `head` and `base` if we're in a
PR. In release PRs, this makes no sense, because we fast-forward the
`base` branch to contain the changes from `head`.
Even though we correctly use `${{ github.event.pull_request.head.sha ||
github.sha }}` to reference the git commit when building artifacts, we
don't use that when checking out code, because we want to test the merge
of head and base usually. In the case of release PRs, we definitely
always want to test on the head sha though, because we're going to
forward that, and it already has the base sha as a parent, so the merge
would end up with the same tree anyway.
As a side effect, not checking out `${{
github.event.pull_request.head.sha || github.sha }}` also caused
https://github.com/neondatabase/neon/actions/runs/13986389780/job/39173256184#step:6:49
to say `release-tag=release-compute-8187`, while
https://github.com/neondatabase/neon/actions/runs/14084613121/job/39445314780#step:6:48
is talking about `build-tag=release-compute-8186`
## Summary of changes
Run a few things on `github.event.pull_request.head.sha`, if we're in a
release PR.
## Problem
Occasionally getting data from GH cache could be slow, with less than
10MB/s and taking 5+ minutes to download cache:
```
Received 20971520 of 2987085791 (0.7%), 9.9 MBs/sec
Received 50331648 of 2987085791 (1.7%), 15.9 MBs/sec
...
Received 1065353216 of 2987085791 (35.7%), 4.8 MBs/sec
Received 1065353216 of 2987085791 (35.7%), 4.7 MBs/sec
...
```
https://github.com/neondatabase/neon/actions/runs/13956437454/job/39068664599#step:7:17
Resulting in getting cache even longer that build time.
## Summary of changes
Switch to the caches, that are closer to the runners, and they provided
stable throughput about 70-80MB/s
## Problem
Some useful debugging tools are missing from the compute image and
sometimes it's impossible to install them because memory is tightly
packed.
## Summary of changes
Add the following tools: iproute2, lsof, screen, tcpdump.
The other changes come from sorting the packages alphabetically.
```bash
$ docker image inspect ghcr.io/neondatabase/vm-compute-node-v16:7555 | jaq '.[0].Size'
1389759645
$ docker image inspect ghcr.io/neondatabase/vm-compute-node-v16:14083125313 | jaq '.[0].Size'
1396051101
$ echo $((1396051101 - 1389759645))
6291456
```
To help with narrowing down
https://github.com/neondatabase/cloud/issues/26362, we make the case
more noisy where we are wait for the shutdown of a specific task (in the
case of that issue, the `gc_loop`).
…ng (#11308)"
This reverts commit e5aef3747c.
The logic of this commit was incorrect:
enabling audit requires a restart of the compute,
because audit extensions use shared_preload_libraries.
So it cannot be done in the configuration phase,
require endpoint restart instead.
## Problem
While working on bulk import, I want to use the `control-plane-url` flag
for a different request.
Currently, the local compute hook is used whenever no control plane is
specified in the config.
My test requires local compute notifications and a configured
`control-plane-url` which isn't supported.
## Summary of changes
Add a `use-local-compute-notifications` flag. When this is set, we use
the local flow regardless of other config values.
It's enabled by default in neon_local and disabled by default in all
other envs. I had to turn the flag off in tests
that wish to bypass the local flow, but that's expected.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
- We need to support multiple SSL CA certificates for graceful root CA
certificate rotation.
- Closes: https://github.com/neondatabase/cloud/issues/25971
## Summary of changes
- Parses `ssl_ca_file` as a pem bundle, which may contain multiple
certificates. Single pem cert is a valid pem bundle, so the change is
backward compatible.
## Problem
- Part of https://github.com/neondatabase/neon/issues/11113
- Building a new `reqwest::Client` for every request is expensive
because it parses CA certs under the hood. It's noticeable in storcon's
flamegraph.
## Summary of changes
- Reuse one `reqwest::Client` for all API calls to avoid parsing CA
certificates every time.
## Problem
Issue https://github.com/neondatabase/neon/issues/11254 describes a case
where restart during a shard split can result in a bad end state in the
database.
## Summary of changes
- Add a reproducer for the issue
- Tighten an existing safety check around updated row counts in
complete_shard_split
## Problem
Various aspects of the controller's job are already described in RFCs,
but the overall service didn't have an RFC that records design tradeoffs
and the top level structure.
## Summary of changes
- Add a retrospective RFC that should be useful for anyone understanding
storage controller functionality
## Problem
part of https://github.com/neondatabase/neon/issues/11279
## Summary of changes
The invisible flag is used to exclude a timeline from synthetic size
calculation. For the first step, let's persist this flag. Most of the
code are following the `is_archived` modification flow.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
SSL certs are loaded only during start up. It doesn't allow the rotation
of short-lived certificates without server restart.
- Closes: https://github.com/neondatabase/cloud/issues/25525
## Summary of changes
- Implement `ReloadingCertificateResolver` which reloads certificates
from disk periodically.
## Problem
Now that stuck connections are quickly terminated it's not easy to
quickly find the right port from the pid to correlate the connection
with the one seen on pageserver side.
## Summary of changes
Call getsockname() and include the local port number in the
no-response-from-pageserver log messages.
## Problem
Currently, we only split tenants into 8 shards once, at the 64 GB split
threshold. For very large tenants, we need to keep splitting to avoid
huge shards. And we also want to eagerly split at a lower threshold to
improve throughput during initial ingestion.
See
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907
for details.
Touches https://github.com/neondatabase/cloud/issues/22532.
Requires #11157.
## Summary of changes
This adds parameters and logic to enable repeated splits when a tenant's
largest timeline divided by shard count exceeds `split_threshold`, as
well as eager initial splits at a lower threshold to speed up initial
ingestion. The default parameters are all set such that they retain the
current behavior in production (only split into 8 shards once, at 64
GB).
* `split_threshold` now specifies a maximum shard size. When a shard
exceeds it, all tenant shards are split by powers of 2 such that all
tenant shards fall below `split_threshold`. Disabled by default, like
today.
* Add `max_split_shards` to specify a max shard count for autosplits.
Defaults to 8 to retain current behavior.
* Add `initial_split_threshold` and `initial_split_shards` to specify a
threshold and target count for eager splits of unsharded tenants.
Defaults to 64 GB and 8 shards to retain current production behavior.
Because this PR sets `initial_split_threshold` to 64 GB by default, it
has the effect of enabling autosplits by default. This was not the case
previously, since `split_threshold` defaults to None, but it is already
enabled across production and staging. This is temporary until we
complete the production rollout.
For more details, see code comments.
This must wait until #11157 has been deployed to Pageservers.
Once this has been deployed to production, we plan to change the
parameters to:
* `split-threshold`: 256 GB
* `initial-split-threshold`: 16 GB
* `initial-split-shards`: 4
* `max-split-shards`: 16
The final split points will thus be:
* Start: 1 shard
* 16 GB: 4 shards
* 1 TB: 8 shards
* 2 TB: 16 shards
We will then change the default settings to be disabled by default.
---------
Co-authored-by: John Spray <john@neon.tech>
## Problem
`fast_import` binary is being run inside neonvms, and they do not
support proper `kubectl describe logs` now, there are a bunch of other
caveats as well: https://github.com/neondatabase/autoscaling/issues/1320
Anyway, we needed a signal if job finished successfully or not, and if
not — at least some error message for the cplane operation. And after [a
short
discussion](https://neondb.slack.com/archives/C07PG8J1L0P/p1741954251813609),
that s3 object is the most convenient at the moment.
## Summary of changes
If `s3_prefix` was provided to `fast_import` call, any job run puts a
status object file into `{s3_prefix}/status/fast_import` with contents
`{"done": true}` or `{"done": false, "error": "..."}`. Added a test as
well
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1741176713523469
The problem is that this function is using `PQgetCopyData(shard->conn,
&resp_buff.data, 1 /* async = true */)`
to try to fetch next message. But this function returns 0 if the whole
message is not present in the buffer.
And input buffer may contain only part of message so result is not
fetched.
## Summary of changes
Use `PQisBusy` + `WaitEventSetWait` to check if data is available and
`PQgetCopyData(shard->conn, &resp_buff.data, 0)` to read whole message
in this case.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
If a tenant gets deleted, delete also all of its timelines. We assume
that by the time a tenant is being deleted, no new timelines are being
created, so we don't need to worry about races with creation in this
situation.
Unlike #11233, which was very simple because it listed the timelines and
invoked timeline deletion, this PR obtains a list of safekeepers to
invoke the tenant deletion on, and then invokes tenant deletion on each
safekeeper that has one or multiple timelines.
Alternative to #11233
Builds on #11288
Part of #9011
## Problem
Pageservers use unencrypted HTTP requests for storage controller API.
- Closes: https://github.com/neondatabase/cloud/issues/25524
## Summary of changes
- Replace hyper0::server::Server with http_utils::server::Server in
storage controller.
- Add HTTPS handler for storage controller API.
- Support `ssl_ca_file` in pageserver.
Timeline IDs do not contain characters that may cause a SQL injection,
but best to always play it safe.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We currently have this code duplicated across different PG versions.
Moving this to an extension would reduce duplication and simplify
maintenance.
## Summary of changes
Moving the LastWrittenLSN code from PG versions to the Neon extension
and linking it with hooks.
Related Postgres PR: https://github.com/neondatabase/postgres/pull/590
Closes: https://github.com/neondatabase/neon/issues/10973
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
The pageserver upcall api was designed to work with control plane or the
storage controller.
We have completed the transition period and now the upcall api only
targets the storage controller.
## Summary of changes
Rename types accordingly and tweak some comments.
## Problem
We had a recent Postgres startup latency (`start_postgres_ms`)
degradation, but it was only caught with SLO alerts. There was actually
an existing test for the same purpose -- `start_postgres_ms`, but it's
doing only two starts, so it's a bit noisy.
## Summary of changes
Add new compute startup latency test that does 100 iterations and
reports p50, p90 and p99 latencies.
Part of https://github.com/neondatabase/cloud/issues/24882
## Problem
Some extensions do not contain tests, which can be easily run on top of
docker-compose or staging.
## Summary of changes
Added the pg_regress based tests for `pg_tiktoken`, `pgx_ulid`, `pg_rag`
Now they will be run on top of docker-compose, but I intend to adopt
them to be run on top staging in the next PRs
## Problem
We're seeing timeline creation failures that look suspiciously like some
race with the cleanup-deletion of initdb temporary directories. I
couldn't spot the bug, but we can make it a bit easier to debug.
Related: https://github.com/neondatabase/neon/issues/11296
## Summary of changes
- Avoid surfacing distracting ENOENT failure to delete as a log error --
this is fine, and can happen if timeline is cancelled while doing
initdb, or if initdb itself has an error where it doesn't write the dir
(this error is surfaced separately)
- Log after purging initdb temp directories
The only difference between
- `pageserver_api::models::TenantConfig` and
- `pageserver::tenant::config::TenantConfOpt`
at this point is that `TenantConfOpt` serializes with
`skip_serializing_if = Option::is_none`.
That is an efficiency improvement for all the places that currently
serde `models::TenantConfig` because new serializations will no longer
write `$fieldname: null` for each field that is `None` at runtime.
This should be particularly beneficial for Storcon, which stores
JSON-serialized `models::TenantConfig` in its DB.
# Behavior Changes
This PR changes the serialization behavior: we omit `None` fields
instead of serializing `$fieldname: null`).
So it's a data format change (see section on compatibility below).
And it changes API responses from Storcon and Pageserver.
## API Response Compatibility
Storcon returns the location description.
Afaik it is passed through into
- storcon_cli output
- storcon UI in console admin UI
These outputs will no longer contain `$fieldname: null` values,
which de-bloats the output (good).
But in storcon UI, it also serves as an editor "default", which
will be eliminated after a storcon with this PR is released.
## Data Format Compatibility
Backwards compat: new software reading old serialized data will
deserialize to the same runtime value because all the field types
are exactly the same and `skip_serializing_if` does not affect
deserialization.
Forward compat: old software reading data serialized by new software
will map absence fields in the serialized form to runtime value
`Option::None`. This is serde default behavior, see this playground
to convince yourself:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=f7f4e1a169959a3085b6158c022a05eb
The `serde(with="humantime_serde")` however behaves strangely:
if used on an `Option<Duration>`, it still requires the field to be
present,
unlike the serde default behavior shown in the previous paragraph.
The workaround is to set `serde(default)`.
Previously it was set on each individual field, but, we do have the
container attribute, so, set it there.
This requires deriving a `Default` impl, which, because all fields are
`Option`,
is non-magic.
See my notes here:
https://gist.github.com/problame/eddbc225a5d12617e9f2c6413e0cf799
# Future Work
We should have separate types (& crates) for
- runtime types configuration (e.g. PageServerConf::tenant_config,
AttachedLocationConf)
- `config-v1` file pageserver local disk file format
- `mgmt API`
- `pageserver.toml`
Right now they all use the same, which is convenient but makes it hard
to reason about compatibility breakage.
# Refs
- corresponding docs.neon.build PR
https://github.com/neondatabase/docs/pull/470
## Problem
#11061 changed how artifacts for releases are built, by
reusing/retagging the artifacts from release PRs. This resulted in the
BUILD_TAG that's baked into the images to not be as expected.
Context: https://neondb.slack.com/archives/C08JBTT3R1Q/p1742333300129069
## Summary of changes
Set BUILD_TAG to the release tag of the upcoming release when running
inside release PRs.
In
-
https://github.com/neondatabase/neon/pull/10993#issuecomment-2690428336
I added infinite retries for buffered writer flush IOs, primarily to
gracefully handle ENOSPC but more generally so that the buffered writer
is not left in a state where reads from the surrounding InMemoryLayer
cause panics.
However, I didn't add cancellation sensitivity, which is concerning
because then there is no way to detach a timeline/tenant that is
encountering the write IO errors.
That’s a legitimate scenario in the case of some edge case bug.
See the #10993 description for details.
This PR
- first makes flush loop infallible, enabled by infinite retries
- then adds sensitivity to `Timeline::cancel` to the flush loop, thereby
making it fallible in one specific way again
- finally fixes the InMemoryLayer/EphemeralFile/BufferedWriter
amalgamate to remain read-available after flush loop is cancelled.
The support for read-availability after cancellation is necessary so
that reads from the InMemoryLayer that are already queued up behind the
RwLock that wraps the BufferedWriter won't panic because of the
`mutable=None` that we leave behind in case the flush loop gets
cancelled.
# Alternatives
One might think that we can only ship the change for read-availability
if flush encounters an error, without the infinite retrying and/or
cancellation sensitivity complexity.
The problem with that is that read-availability sounds good but is
really quite useless, because we cannot ingest new WAL without a
writable InMemoryLayer. Thus, very soon after we transition to read-only
mode, reads from compute are going to wait anyway, but on `wait_lsn`
instead of the RwLock, because ingest isn't progressing.
Thus, having the infinite flush retries still makes more sense because
they're just "slowness" to the user, whereas wait_lsn is hard errors.
## Problem
https://github.com/neondatabase/neon/pull/11210 migrated pushing images
to ghcr. Unfortunately, it was incomplete in using images from ghcr,
which resulted in a few places referencing the ghcr build-tools image,
while trying to use docker hub credentials.
## Summary of changes
Use build-tools image from ghcr consistently.
## Problem
The pipelines after release merges are slower than they need to be at
the moment. This is because some kinds of tests/checks run on all kinds
of pipelines, even though they only matter in some of those.
## Summary of changes
Run `check-codestyle-{rust,python,jsonnet}`, `build-and-test-locally`
and `trigger-e2e-tests` only on regular PRs, not release PR or pushes to
main or release branches.
Both crates seem well maintained. x509-cert is part of the high quality
RustCrypto project that we already make heavy use of, and I think it
makes sense to reduce the dependencies where possible.
## Problem
Docker Hub has new rate limits coming up, and to avoid problems coming
with those we're switching to GHCR.
## Summary of changes
- Push images to GHCR initially and distribute them from there
- Use images from GHCR in docker-compose
There is no functional change here. We move safekeeper related code from
`service.rs` to `service/safekeeper_service.rs`, so that safekeeper
related stuff is contained in a single file. This also helps with
preventing `service.rs` from growing even further.
Part of #9011.
## Problem
The test is flaky because of the same reasons as described in
https://github.com/neondatabase/neon/issues/11177.
The test has already suppressed these `WARN` and `ERROR` log messages,
but the regexp didn't match all possible errors.
## Summary of changes
- Change regexp to suppress all possible allowed error log messages.
PRs #10891 and #10902 have time-bounded the safekeeper heartbeating of
the storage controller. Those timeouts were not meant to be permanent,
but temporary until we figured out the reasons for the safekeeper
heartbeating causing problems.
Now they are better understood and resolved. A comment is
[here](https://github.com/neondatabase/cloud/issues/24396#issuecomment-2679342929),
but most importantly, we've had:
* #10954 to send heartbeats concurrently (before the issue was we sent
them sequentially, so the total time time was number of nodes times time
for timeout to be hit, now the total time is the maximum of all things
we are heartbeating)
* work to actually make heartbeats work and not error, i.e. JWT rollout
for storcon, not sending heartbeats to decomissioned safekeepers,
removal of decomissioned safekeepers from the databases
Part of https://github.com/neondatabase/cloud/issues/25473
## Problem
Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes.
In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED`
and save them on local disk.
But we do not save unlogged relations in LFC.
It may cause fetching incorrect value from LFC if relfilenode is reused.
## Summary of changes
Save modified pages in LFC on second stage of unlogged build (when
modified pages are walloged).
There is no need to save pages in LFC at first phase because the will be
in any case overwritten with assigned LSN at second phase.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Adds a basic test that makes the storcon issue explicit creation of a
timeline on safeekepers (main storcon PR in #11058). It was adapted from
`test_explicit_timeline_creation` from #11002.
Also, do a bunch of fixes needed to get the test work (the API
definitions weren't correct), and log more stuff when we can't create a
new timeline due to no safekeepers being active.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Work on https://github.com/neondatabase/cloud/issues/23721 and
https://github.com/neondatabase/cloud/issues/23714
Depends on https://github.com/neondatabase/neon/pull/11111
- Add `/configure_telemetry` API endpoint
- Support second rsyslog configuration for Postgres logs export
- Enable logs export when compute feature is enabled and configure
Postgres to send logs to syslog
I have used `/configure_telemetry` name because in the future I see it
also being used for configuring a `pg_tracing` extension to export
traces. Let me know if you'd rather have these APIs separate. In this
case we can rename it to `/configure_rsyslog`.
## Problem
https://github.com/neondatabase/neon/actions/runs/13894288475/job/38871819190
shows the "Add fast-fordward label to PR to trigger fast-forward merge"
job being skipped. This is due to not using the right variable for
checking which branch the merge queue is merging into.
## Summary of changes
Use the `branch` output of the `meta` task for checking the target
branch of a merge group.
## Problem
In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is
created on disk and then removed at the end.
It contradicts to the logic of DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL
mode.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In the previous PR #11045, one edge-case wasn't covered, when an ident
contains only one `$`, we were picking `$$` as a 'wrapper'. Yet, when
this `$` is at the beginning or at the end of the ident, then we end up
with `$$$` in a row which breaks the escaping.
## Summary of changes
Start from `x` tag instead of a blank string.
Slack:
https://neondb.slack.com/archives/C08HV951W2W/p1742076675079769?thread_ts=1742004205.461159&cid=C08HV951W2W
## Problem
Serial/numeric IDs lead to collisions, which is not critical but looks
awkward.
Previous discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1741891345869979
## Summary of changes
Suggest using the `YYYY-MM-DD` prefix, which i) has less chance of
collision; ii) provides out-of-the-box lexicographic sorting; iii) even
if it collides, it's not a big deal -- just two RFCs have been started
on the same day.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
... to better match the workload characteristics of real Neon customers
## Problem
We analyzed workloads of large Neon users and want to extend the oltp
workload to include characteristics seen in those workloads.
## Summary of changes
- for re-use branch delete inserted rows from last run
- adjust expected run-time (time-outs) in GitHub workflow
- add queries that exposes the prefetch getpages path
- add I/U/D transactions for another table (so far the workload was
insert/append-only)
- add an explicit vacuum analyze step and measure its time
- add reindex concurrently step and measure its time (and take care that
this step succeeds even if prior reindex runs have failed or were
canceled)
- create a second connection string for the pooled connection that
removes the `-pooler` suffix from the hostname because we want to run
long-running statements (database maintenance) and bypass the pooler
which doesn't support unlimited statement timeout
## Test run
https://github.com/neondatabase/neon/actions/runs/13851772887/job/38760172415
## Problem
There is a rare race between controller graceful deployment and shard
splitting where we may incorrectly both abort _and_ complete the split
(on different pods), and thereby leave no shards at all in the database.
Related: #11254
## Summary of changes
- In complete_shard_split, refuse to delete anything if child shards are
not found
## Problem
Python's `jsonnet` 0.20.0 doesn't support Python 3.13, so we have a
couple of tests xfailed because of that.
## Summary of changes
- Bump `jsonnet` to `0.21.0rc2` which supports Python 3.13
- Unxfail `test_sql_exporter_metrics_e2e` and
`test_sql_exporter_metrics_smoke` on Python 3.13
- add pgaudt_gc thread to compute_ctl
to cleanup old pgaudit logs if they exist.
pgaudit can rotate files, but it doesn't delete the old files
- Add AUDIT_LOG_DIR_SIZE metric to compute_ctl
to track the size of the audit log directory in bytes.
- Fix permissions for rsyslog state files directory
## Problem
We exposed the direction tag in #10925 but didn't actually include the
ingress tag in the export to allow for an adaption period.
## Summary of changes
We now export the ingress direction
# Refs
- fixes https://github.com/neondatabase/neon/issues/11228
# Problem High-Level
When storcon validates whether a `ScheduleOptimizationAction` should be
applied, it retrieves the `tenant_secondary_status` to determine whether
a secondary is ready for the optimization.
When collecting results, it associates secondary statuses with the wrong
optimization actions in the batch of optimizations that we're
validating.
The result is that we make the decision for shard/location X based on
the SecondaryStatus of a random secondary location Y in the current
batch of optimizations.
A possible symptom is an early cutover, as seen in this engineering
investigation here:
- https://github.com/neondatabase/cloud/issues/25734
# Problem Code-Level
This code here in `optimize_all_validate`
97e2e27f68/storage_controller/src/service.rs (L7012-L7029)
zips the `want_secondary_status` with the Vec returned from
`tenant_for_shards_api` .
However, the Vec returned from `want_secondary_status` is not ordered
(it uses FuturesUnordered internally).
# Solution
Sort the Vec in input order before returning it.
`optimize_all_validate` was the only caller affected by this problem
While at it, also future-proof similar-looking function
`tenant_for_shards`.
None of its callers care about the order, but this type of function
signature is easy to use incorrectly.
# Future Work
Avoid the additional iteration, map, and allocation.
Change API to leverage AsyncFn (async closure).
And/or invert `tenant_for_shards_api` into a Future ext trait / iterator
adaptor thing.
Adds API definitions for the safekeeper API endpoints `exclude_timeline`
and `term_bump`. Also does a bugfix to return the correct type from
`delete_timeline`.
Part of #8614
## Problem
This is already disabled in production, as it is replaced by L0 flush
delays. It will be removed in a later PR, once the config option is no
longer specified in production.
## Summary of changes
Disable `l0_flush_wait_upload` by default.
## Problem
`test_metadata_image_creation ` became flaky with #11212, since image
compaction may yield to L0 compaction.
## Summary of changes
Set `NoYield` when compacting in tenant tests.
## Problem
We noticed that error metrics didn't show for some services with light
load. This is not great and can cause problems for dashboards/alerts
## Summary of changes
Pre-initialise some metricvecs.
We want to switch away from and deprecate the `--compute-hook-url` param
for the storcon in favour of `--control-plane-url` because it allows us
to construct urls with `notify-safekeepers`.
This PR switches the pytests and neon_local from a
`control_plane_compute_hook_api` to a new param named
`control_plane_hooks_api` which is supposed to point to the parent of
the `notify-attach` URL.
We still support reading the old url from disk to not be too disruptive
with existing deployments, but we just ignore it.
Also add docs for the `notify-safekeepers` upcall API.
Follow-up of #11173
Part of https://github.com/neondatabase/neon/issues/11163
## Problem
`l0_flush_delay_threshold` has already been set to 30 in production for
a couple of weeks. Let's harmonize the default.
## Summary of changes
Update `DEFAULT_L0_FLUSH_DELAY_FACTOR` to 3 such that the default
`l0_flush_delay_threshold` is `3 * compaction_threshold`.
This differs from the production setting, which is hardcoded to 30 (with
`compaction_threshold` at 10), and is more appropriate for any tenants
that have custom `compaction_threshold` overrides.
## Problem
ef0d4a48a adjusted how we build container images and how we push them,
and the neon-test-extensions image was overlooked. Additionally, is was
also missed in 1f0dea9a1, which pushed our container images to GHCR.
## Summary of changes
Push neon-test-extensions to GHCR and also push release tags for it.
## Problem
DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16
## Summary of changes
Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
# Problem
We leave too few observability breadcrumbs in the case where wait_lsn is
exceptionally slow.
# Changes
- refactor: extract the monitoring logic out of `log_slow` into
`monitor_slow_future`
- add global + per-timeline counter for time spent waiting for wait_lsn
- It is updated while we're still waiting, similar to what we do for
page_service response flush.
- add per-timeline counterpair for started & finished wait_lsn count
- add slow-logging to leave breadcrumbs in logs, not just metrics
For the slow-logging, we need to consider not flooding the logs during a
broker or network outage/blip.
The solution is a "log-streak-level" concurrency limit per timeline.
At any given time, there is at most one slow wait_lsn that is logging
the "still running" and "completed" sequence of logs.
Other concurrent slow wait_lsn's don't log at all.
This leaves at least one breadcrumb in each timeline's logs if some
wait_lsn was exceptionally slow during a given period.
The full degree of slowness can then be determined by looking at the
per-timeline metric.
# Performance
Reran the `bench_log_slow` benchmark, no difference, so, existing call
sites are fine.
We do use a Semaphore, but only try_acquire it _after_ things have
already been determined to be slow. So, no baseline overhead
anticipated.
# Refs
-
https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222
Closes: https://github.com/neondatabase/cloud/issues/22998
If control-plane reports that TLS should be used, load the certificates
(and watch for updates), make sure postgres use them, and detects
updates.
Procedure:
1. Load certificates
2. Reconfigure postgres/pgbouncer
3. Loop on a timer until certificates have loaded
4. Go to 1
Notes:
1. We only run this procedure if requested on startup by control plane.
2. We needed to compile pgbouncer with openssl enabled
3. Postgres doesn't allow tls keys to be globally accessible - must be
read only to the postgres user. I couldn't convince the autoscaling team
to let me put this logic into the VM settings, so instead compute_ctl
will copy the keys to be read-only by postgres.
4. To mitigate a race condition, we also verify that the key matches the
cert.
## Problem
There was a panic on staging that compaction didn't find any keys. This
is possible if all layers selected for compaction does not contain any
keys within the current shard.
## Summary of changes
Make panic an error. In the future, we can try creating an empty image
layer so that GC can clean up those layers. Otherwise, for now, we can
only rely on shard ancestor compaction to remove these data.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`compaction_l0_first` has already been enabled in production for a
couple of weeks.
## Summary of changes
Enable `compaction_l0_first` by default.
Also set `CompactFlags::NoYield` in `timeline_checkpoint_handler`, to
ensure explicitly requested compaction runs to completion. This endpoint
is mainly used in tests, and caused some flakiness where tests expected
compaction to complete.
## Problem
#11061 introduced code fetching previous releases. #11151 introduced jq
error handling, which has also been applied in #11061, but parenthesis
have been missed.
## Summary of changes
Add parenthesis around error handling code.
## Problem
Autosplits always request `DEFAULT_STRIPE_SIZE` for splits. However,
splits do not allow changing the stripe size of already-sharded tenants,
and will error out if it differs.
In #11168, we are changing the stripe size, which could hit this when
attempting to autosplit already sharded tenants.
Touches #11168.
## Summary of changes
Pass `new_stripe_size: None` when autosplitting already sharded tenants.
Otherwise, pass `DEFAULT_STRIPE_SIZE` instead of the shard identity's
stripe size, since we want to use the current default rather than an
old, persisted default.
# Fix metric_unit length in test_compute_ctl_api.py
## Description
This PR changes the metric_unit from "microseconds" to "μs" in
test_compute_ctl_api.py to fix the issue where perf test results were
not being stored in the database due to the string exceeding the 10
character limit of the metric_unit column in the perf_test_results
table.
## Problem
As reported in Slack, the perf test results were not being uploaded to
the database because the "microseconds" string (12 characters) exceeds
the 10 character limit of the metric_unit column in the
perf_test_results table.
## Solution
Replace "microseconds" with "μs" in all metric_unit parameters in the
test_compute_ctl_api.py file.
## Testing
The changes have been committed and pushed. The PR is ready for review.
Link to Devin run:
https://app.devin.ai/sessions/e29edd672bd34114b059915820e8a853
Requested by: Peter Bendel
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
## Problem
A small adjustment in #11061 broke the lint-release-pr.sh script, and
the new version was neither tested nor linted. This has been done now,
the script is once again tested and passing `shellcheck`.
## Summary of changes
Add missing `el` of `elif` condition chain.
## Problem
#11061 changed release pr creation, and I missed that the workflow will
checkout a would-be-merge of the rc branch and the release branch
instead of the head ref, unless explicitly instructed otherwise.
## Summary of changes
Check out head ref for linting the release PRs.
## Problem
#11061 changed release pr creation, and I missed that creating PRs using
`gh` in non-interactive environments *requires* `--body` instead of
defaulting to an empty body.
## Summary of changes
Explicitly set an empty body when creating release PRs.
## Problem
#11061 changed release PR creation, and I missed that we need to
explicitly fetch the whole history so that the relevant git refs and
objects are available.
## Summary of changes
- Fetch all git refs including history by setting fetch-depth to 0
- Reference release branch as a remote branch, because we haven't
checked it out locally
## Problem
close https://github.com/neondatabase/neon/issues/10310
## Summary of changes
This patch adds a new behavior for the detach_ancestor API: detach with
multi-level ancestor and no reparenting. Though we can potentially
support multi-level + do reparenting / single-level + no-reparenting in
the future, as it's not required for the recovery/snapshot epic, I'd
prefer keeping things simple now that we only handle the old one and the
new one instead of supporting the full feature matrix.
I only added a test case of successful detaching instead of testing
failures. I'd like to make this into staging and add more tests in the
future.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
When we release our components, we perform builds in the release PR,
then test the components, then merge the PR, and then build everything
*again*, run tests *again*, and only then start deployments.
To speed things up, we want to perform builds and run tests in the PR,
and start deployments using the existing artifacts from the release PR.
To make that possible, we need to have both CI pipelines running on the
same commit hash, which requires fast forwarding release. That only
works, if we have a commit in the PR that has the current release branch
state as an ancestor.
## Summary of changes
- Changes to release PR creation:
- Remove templates and automatic bodies for release PRs. The previous
template wasn't used anymore, and the automatic body we created in the
pipeline didn't contain any useful content anymore after the changees
here.
- Make it possible to select the source branch. For releases that aren't
cut from `main`, like https://github.com/neondatabase/neon/pull/11051,
we need a way to trigger the new flow from a different branch.
- Determine `release-branch` automatically from the component name
instead of passing that as well.
- Changes to the merge queue job:
- Rename `get-changed-files` to `meta` in preparation of additional data
being fetched as part of that job
- Fail the merge queue if we're trying to merge into a branch other than
main - this is to prevent non-fast-forward merges.
- Label PRs to branches other than main as `fast-forward`, to trigger
the fast-forward job
- Add a fast-forward job that can be triggered with the `fast-forward`
label that performs a fast-forward merge. This only happens if the PR
has `mergeable_state == clean`, so CI having passed.
- Build and Test on releases now skips building images, skips testing
images and skips triggering e2e tests. We add new tags to the images
from the release PR to tag them as release images, and we push them to
the prod registries.
In our json encoding, we only need to know about array types.
Information about composites or enums are not actually used.
Enums are quite popular, needing to type query them when not needed can
add some latency cost for no gain.
## Problem
When a node becomes active, we query its locations and update the
observed state in-place.
This can race with the observed state updates done when processing
reconcile results.
## Summary of changes
The argument for this reconciliation step is that is reduces the need
for background reconciliations.
I don't think is actually true anymore. There's two cases.
1. Restart of node after drain. Usually the node does not go through the
offline state here, so observed locations
were not marked as none. In any case, there should be a handful of
shards max on the node since we've just drained it.
2. Node comes back online after failure or network partition. When the
node is marked offline, we reschedule everything away from it. When it
later becomes active, the previous observed location is extraneous and
requires a reconciliation anyway.
Closes https://github.com/neondatabase/neon/issues/11148
## Problem
When the storage controller and Pageserver loads tenants from persisted
storage, it uses `ShardIdentity::unsharded()` for unsharded tenants.
However, this replaces the persisted stripe size of unsharded tenants
with the default stripe size.
This doesn't really matter for practical purposes, since the stripe size
is meaningless for unsharded tenants anyway, but can cause consistency
check failures if the persisted stripe size differs from the default.
This was seen in #11168, where we change the default stripe size.
Touches #11168.
## Summary of changes
Carry over the persisted stripe size from `TenantShardPersistence` for
unsharded tenants, and from `LocationConf` on Pageservers.
Also add bounds checks for type casts when loading persisted shard
metadata.
## Problem
`info_span!` is only used in a `linux` branch, causing the unused lint
to fire on macOS.
## Summary of changes
Fully qualify the `info_span!` use.
## Problem
`humantime` is unmaintained, we want to migrate to `jiff`, see
https://github.com/neondatabase/neon/issues/11179.
`env_logger` in older versions depend on `humantime`, and newer versions
depend on `jiff`, so we need to update it.
## Summary of changes
Update `env_logger` to the most recent release, which does not depend on
`humantime` anymore.
## Problem
This command was used when onboarding tenants to the storage controller.
We no longer do that, so the command can go.
## Summary of changes
- Remove `storcon_cli tenant-warmup` command
## Problem
Test `test_timeline_archive` is flaky because it makes requests that are
intended to fail. It sometimes leads to warning in pageserver's logs.
More details are in the issue.
- Closes: https://github.com/neondatabase/neon/issues/11177
## Summary of changes
- Suppress such errors.
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.
To this end, there are two changes here:
1. Update the `tracing-utils` crate to allow for explicitly specifying
the export configuration. Pageserver configuration is loaded from a file
on start-up. This allows us to use the same flow for export configs
there.
2. Update the `utils::logging::init` common entry point to set up OTEL
tracing infrastructure if requested. Note that an entirely different
tracing subscriber is used. This is to avoid interference with the
existing tracing set-up. For now, no service uses this functionality.
PR to plug this into the pageserver is
[here](https://github.com/neondatabase/neon/pull/11140).
Related https://github.com/neondatabase/neon/issues/9873
## Problem
This field was retained for backward compat only in
https://github.com/neondatabase/neon/pull/10707.
Once https://github.com/neondatabase/cloud/pull/25233 is released,
nothing external will be reading this field.
Internally, this was a mandatory field so storage controller is still
trying to decode it, so we must do this removal in two steps: this PR
makes the field optional, and after one release we can fully remove it.
Related: https://github.com/neondatabase/cloud/issues/24250
## Summary of changes
- Rename field to `_unused`
- Remove field from swagger
- Make field optional
The upgrade to 8.0.0 caused severe performance regressions in the
start_postgres_ms metric, which measures the time it takes from execing
Postgres to the time Postgres marks itself as ready in the
postmaster.pid file. We use the notify crate to watch for changes in the
pgdata directory and the postmaster.pid file.
Signed-off-by: Tristan Partin <tristan@neon.tech>
# Problem
If the Pageserver ingest path
(InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable
buffer
of the BufferedWriter, the buffered writer is left in a state where
subsequent _reads_
from the InMemoryLayer it will cause a `must not use after we returned
an error` panic.
The reason is that
1. the flush background task bails on flush failure,
2. causing the `FlushHandle::flush` function to fail at channel.recv()
and
3. causing the `FlushHandle::flush` function to bail with the flush
error,
4. leaving its caller `BufferedWriter::flush` with
`BufferedWriter::mutable = None`,
5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent
reads can enter,
6. those reads find `mutable = None` and cause the panic.
# Context
It has always been the contract that writes against the BufferedWriter
API
must not be retried because the writer/stream-style/append-only
interface makes no atomicity guarantees ("On error, did nothing or a
piece of the buffer get appended?").
The idea was that the error would bubble up to upper layers that can
throw away the buffered writer and create a new one. (See our [internal
error handling policy document on how to handle e.g.
`ENOSPC`](c870a50bc0/src/storage/handling_io_and_logical_errors.md (L36-L43))).
That _might_ be true for delta/image layer writers, I haven't checked.
But it's certainly not true for the ingest path: there are no provisions
to throw away an InMemoryLayer that encountered a write error an
reingest the WAL already written to it.
Adding such higher-level retries would involve either resetting
last_record_lsn to a lower value and restarting walreceiver. The code
isn't flexible enough to do that, and such complexity likely isn't worth
it given that write errors are rare.
# Solution
The solution in this PR is to retry _any_ failing write operation
_indefinitely_ inside the buffered writer flush task, except of course
those that are fatal as per `maybe_fatal_err`.
Retrying indefinitely ensures that `BufferedWriter::mutable` is never
left `None` in the case of IO errors, thereby solving the problem
described above.
It's a clear improvement over the status quo.
However, while we're retrying, we build up backpressure because the
`flush` is only double-buffered, not infinitely buffered.
Backpressure here is generally good to avoid resource exhaustion, **but
blocks reads** and hence stalls GetPage requests because InMemoryLayer
reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.
## Caveats
Note that there are some remaining conditions in the flush background
task where it can bail with an error. I have annotated one of them with
a TODO comment.
Hence the `FlushHandle::flush` is still fallible and hence the overall
scenario of leaving `mutable = None` on the bail path is still possible.
We can clean that up in a later commit.
Note also that retrying indefinitely is great for temporary errors like
ENOSPC but likely undesirable in case the `std::io::Error` we get is
really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant
directory got deleted and VirtualFile's reopen fails with ENOENT.
Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered
writer background task was to recv from flush op channel until the
sending side (FlushHandle) is explicitly shut down or dropped.
Failing to handle cancellation carries the operational risk that even if
a single timeline gets stuck because of a logic bug such as the one laid
out above, we must still restart the whole pageserver process.
# Alternatives Considered
As pointed out in the `Context` section, throwing away a InMemoryLayer
that encountered an error and reingesting the WAL is a lot of complexity
that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.
A more general and simpler solution for ENOSPC is to `abort()` the
process and run eviction on startup before bringing up the rest of
pageserver.
I argued for it in the past, the pro arguments are still valid and
complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed
and all tenants have secondaries.
For example, if pageserver `abort()`s on ENOSPC and then simply don't
respond to storcon heartbeats while we're running eviction on startup,
storcon will fail tenants over to the secondary anyway, giving us all
the time we need to clean up.
The downside is that if there's a systemic space management bug, above
proposal will just propagate the problem to other nodes. But I imagine
that because of the delays involved with filling up disks, the system
might reach a half-stable state, providing operators more time to react.
# Demo
Intermediary commit `a03f335121480afc0171b0f34606bdf929e962c5` is demoed
in this (internal) screen recording:
https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link
# Perf Testing
Ran `bench_ingest` on tmpfs, no measurable difference.
Spans are uniquely owned by the flush task, and the span stack isn't too
deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not _that_
high frequency event anyways.
# Refs
- fixes https://github.com/neondatabase/neon/issues/10856
## Problem
This should also resolve the test flakiness of `test_gc_feedback`.
close https://github.com/neondatabase/neon/issues/11144
## Summary of changes
If `NoYield` is set, do not yield in gc-compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`humantime` is not maintained and `cargo deny check` fails
- Will be addressed in https://github.com/neondatabase/neon/issues/11179
## Summary of changes
Ignore RUSTSEC-2025-0014 advisory for now
## Problem
Investigate https://github.com/neondatabase/neon/issues/11159
## Summary of changes
This doesn't fix the issue, but at least we can narrow down the cause
next time it happens by logging ancestor referenced layer cnt even if
it's 0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Do some refactoring before making walproposer generations aware.
- Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED
- Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm,
donorEpoch to donorLastLogTerm
- Instead of counting n_votes, n_connected, introduce explicit
WalProposerState (collecting terms / voting / elected). Refactor out
TermsCollected and VotesCollected; they will determine state transition
differently depending whether generations are enabled or not.
There is no new logic in this PR and thus no new tests.
## Problem
In #11122, we want to split shards once the logical size of the largest
timeline exceeds a split threshold. However, `get_top_tenants` currently
only returns `max_logical_size`, which tracks the max _total_ logical
size of a timeline across all shards.
This is problematic, because the storage controller needs to fetch a
list of N tenants that are eligible for splits, but the API doesn't
currently have a way to express this. For example, with a split
threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible
to split if it has 1 or 2 shards, but not if it already has 4 shards. We
need to express this in per-shard terms, otherwise the `get_top_tenants`
endpoint may end up only returning tenants that can't be split, blocking
splits entirely.
Touches https://github.com/neondatabase/neon/pull/11122.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Add `TenantShardItem::max_logical_size_per_shard` containing
`max_logical_size / shard_count`, and
`TenantSorting::MaxLogicalSizePerShard` to order and filter by it.
Fixes https://github.com/neondatabase/serverless/issues/144
When tables have enums, we need to perform type queries for that data.
We cache these query statements for performance reasons. In Neon RLS, we
run "discard all" for security reasons, which discards all the
statements. When we need to type check again, the statements are no
longer valid.
This fixes it to discard the statements as well.
I've also added some new logs and error types to monitor this. Currently
we don't see the prepared statement errors in our logs.
# Refs
- fixes https://github.com/neondatabase/neon/issues/6107
# Problem
`VirtualFile` currently parses the path it is opened with to identify
the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE`
metric.
Further, for each read or write call to VirtualFile, it uses
`with_label_values` to retrieve the correct metrics object, which under
the hood is a global hashmap guarded by a parking_lot mutex.
We perform tens of thousands of reads and writes per second on every
pageserver instance; thus, doing the mutex lock + hashmap lookup is
wasteful.
# Changes
Apply the technique we use for all other timeline-scoped metrics to
avoid the repeat `with_label_values`: add it to `TimelineMetrics`.
Wrap `TimelineMetrics` into an `Arc`.
Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use
`Timeline::metrics::storage_io_size`.
To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics
between different connection handlers for the same timeline, we wrap it
into another Arc.
To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>>
inside the per-connection timeline cache.
Preliminary refactorings to enable this change:
- https://github.com/neondatabase/neon/pull/11001
- https://github.com/neondatabase/neon/pull/11030
# Performance
I ran the benchmarks in
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
on an `i3en.3xlarge` because that's what we currently run them on.
None of the benchmarks shows a meaningful difference in latency or
throughput or CPU utilization.
I would have expected some improvement in the
many-tenants-one-client-each workload because they all hit that hashmap
constantly, and clone the same `UintCounter` / `Arc` inside of it.
But apparently the overhead is miniscule compared to the remaining work
we do per getpage.
Yet, since the changes are already made, the added complexity is
manageable, and the perf overhead of `with_label_values` demonstrable in
micro-benchmarks, let's have this change anyway.
Also, propagating TimelineMetrics through RequestContext might come in
handy down the line.
The micro-benchmark that demonstrates perf impact of
`with_label_values`, along with other pitfalls and mitigation techniques
around the `metrics`/`prometheus` crate:
- https://github.com/neondatabase/neon/pull/11019
# Alternative Designs
An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside
`RequestContext`.
The problem is that this risks reference cycles if the RequestContext
gets stored in an object that is owned directly or indirectly by
`Timeline`.
Ideally, we wouldn't be using this mess of Arc's at all and propagate
Rust references instead.
But tokio requires tasks to be `'static`, and so, we wouldn't be able to
propagate references across task boundaries, which is incompatible with
any sort of fan-out code we already have (e.g. concurrent IO) or future
code (parallel compaction).
So, opt for Arc for now.
We use the `metrics` / `prometheus` crate in the Pageserver code base.
This PR demonstrates
- typical performance pitfalls with that crate
- our current set of techniques to avoid most of these pitfalls.
refs
- https://github.com/neondatabase/neon/issues/10948
- https://github.com/neondatabase/neon/pull/7202
- I applied the `label_values__cache_label_values_lookup` technique
there.
- It didn't yield measurable results in high-level benchmarks though.
This PR extends the storcon with basic safekeeper management of
timelines, mainly timeline creation and deletion. We want to make the
storcon manage safekeepers in the future. Timeline creation is
controlled by the `--timelines-onto-safekeepers` flag.
1. it adds the `timelines` and `safekeeper_timeline_pending_ops` tables
to the storcon db
2. extend code for the timeline creation and deletion
4. it adds per-safekeeper reconciler tasks
TODO:
* maybe not immediately schedule reconciliations for deletions but have
a prior manual step
* tenant deletions
* add exclude API definitions (probably separate PR)
* how to choose safekeeper to do exclude on vs deletion? this can be a
bit hairy because the safekeeper might go offline in the meantime.
* error/failure case handling
* tests (cc test_explicit_timeline_creation from #11002)
* single safekeeper mode: we often only have one SK (in tests for
example)
* `notify-safekeepers` hook:
https://github.com/neondatabase/neon/issues/11163
TODOs implemented:
* cancellations of enqueued reconciliations on a per-timeline basis,
helpful if there is an ongoing deletion
* implement pending ops overwrite behavior
* load pending operations from db
RFC section for important reading:
[link](https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#storage_controller-implementation)
Implements the bulk of #9011
Successor of #10440.
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
## Problem
Async prefetch in LFC PR cause incorrect calculation of LFC
`used_pages`when page is overwritten
## Summary of changes
Decrement `used_pages` is page is overwritten.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
This allows for improved decoding of otherwise broken WAL.
## Problem
Currently, if (or when) a WAL record has a wrong prevptr, that breaks
decoding. With this, we don't have to break on that if we decide it's OK
to proceed after that.
## Summary of changes
Use a Neon GUC to allow the system to enable the NEON-specific
skip_lsn_checks option in XLogReader.
## Problem
Partial reads are still problematic. They are stored in the buffer of
the wal decoder and result in gaps being reported too eagerly on the
pageserver side.
## Summary of changes
Previously, we always used the start LSN of the chunk of WAL that was
just read. This patch switches to using the end LSN of the last record
that was decoded in the previous iteration.
## Problem
Pinging groups on slack didn't work, because I didn't use the correct
syntax.
## Summary of changes
Use the correct syntax for pinging groups.
* pprof can also use `prost` as a backend, switch to it as `protobuf`
has no update available but a security issue.
* `paste` is a build time dependency, so add the unmaintained warning as
an exception.
## Problem
`home_shard_count` is not updated on the preferred AZ change.
Closes: https://github.com/neondatabase/neon/issues/10493
## Summary of changes
- Update scheduler stats (node ref counts) on preferred AZ change.
## Problem
See https://neondb.slack.com/archives/C08DE6Q9C3B
Sometimes compute is not able to receive responses from PS for a long
time (minutes).
I do not think that the problem is at compute side, but in order to
exclude this possibility I wan to see more information about connection
state at compute side, particularly amount of data cached in connection
buffer.
## Summary of changes
Right now we are dumping state of socket buffer.
This PR adds printing state of connection buffer.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Previously, remote extensions were not fetched unless they were used in
some other manner. For instance, loading a BM25 index in pg_search
fetches the pg_search extension. However, if on a fresh compute with
pg_search 0.15.5 installed, the user ran `ALTER EXTENSION pg_search
UPDATE TO '0.15.6'` without first using the pg_search extension, we
would not fetch the extension and fail to find an update path.
Signed-off-by: Tristan Partin <tristan@neon.tech>
We were previously only revoking privileges granted by neon_superuser.
However, we need to do it for all grantors.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
As part of the disaster recovery tool. Partly for
https://github.com/neondatabase/neon/issues/9114.
## Summary of changes
* Add a new pageserver API to force patch the fields in index_part and
modify the timeline internal structures.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Storage controller uses http for requests to safekeeper management API.
Closes: https://github.com/neondatabase/cloud/issues/24835
## Summary of changes
- Add `use_https_safekeeper_api` option to storcon to use https api
- Use https for requests to safekeeper management API if this option is
enabled
- Add `ssl_ca_file` option to storcon for ability to specify custom root
CA certificate
## Problem
The current migration API does a live migration, but if the destination
doesn't already have a secondary, that live migration is unlikely to be
able to warm up a tenant properly within its timeout (full warmup of a
big tenant can take tens of minutes).
Background optimisation code knows how to do this gracefully by creating
a secondary first, but we don't currently give a human a way to trigger
that.
Closes: https://github.com/neondatabase/neon/issues/10540
## Summary of changes
- Add `prefererred_node` parameter to TenantShard, which is respected by
optimize_attachment
- Modify migration API to have optional prewarm=true mode, in which we
set preferred_node and call optimize_attachment, rather than directly
modifying intentstate
- Require override_scheduler=true flag if migrating somewhere that is a
less-than-optimal scheduling location (e.g. wrong AZ)
- Add `origin_node_id` to migration API so that callers can ensure
they're moving from where they think they're moving from
- Add tests for the above
The storcon_cli wrapper for this has a 'watch' mode that waits for
eventual cutover. This doesn't show the warmth of the secondary evolve
because we don't currently have an API for that in the controller, as
the passthrough API only targets attached locations, not secondaries. It
would be straightforward to add later as a dedicated endpoint for
getting secondary status, then extend the storcon_cli to consume that
and print a nice progress indicator.
## Problem
Shard zero needs to track the start LSN of the latest record
in adition to the LSN of the next record to ingest. The former
is included in basebackup persisted by the compute in WAL.
Previously, empty records were skipped for all shards. This caused
the prev LSN tracking on the PS to fall behind and led to logical
replication
issues.
## Summary of changes
Shard zero now receives emtpy interpreted records for LSN tracking
purposes.
A test is included too.
## Problem
Allow for using _meta.yml with workflow_dispatch event.
## Summary of changes
Handle this event in the run-kind step; fix and update the description
of the run-kind output.
## Problem
This patch makes some initial tweaks as preparation for
https://github.com/neondatabase/cloud/issues/22532, where we will be
introducing additional autosplit logic.
The plan is outlined in
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907.
## Summary of changes
Minor code cleanups and behavioral changes:
* Decide that we'll split based on `max_logical_size` (later possibly
`total_logical_size`).
* Fix a bug that would split the smallest candidate rather than the
largest.
* Pick the largest candidate by `max_logical_size` rather than
`resident_size`, for consistency (debatable).
* Split out `get_top_tenant_shards()` to fetch split candidates.
* Fetch candidates concurrently from all nodes.
* Make `TenantShard.get_scheduling_policy()` return a copy instead of a
reference.
We have introduced the ability to specify safekeeper JWTs for the
storage controller. It now does heartbeats. We now want to also require
the presence of those JWTs.
Let's merge this PR shortly after the release cutoff.
Part of / follow-up of
https://github.com/neondatabase/cloud/issues/24727
## Problem
We just had a regression reported at
https://neondb.slack.com/archives/C08EXUJF554/p1741102467515599, which
clearly came with one of the releases. It's not a huge problem yet, but
it's annoying that we cannot quickly attribute it to a specific commit.
## Summary of changes
Add a very simple `compute_ctl` HTTP API benchmark that does 10k
requests to `/status` and `metrics.json` and reports p50 and p99.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
## Problem
Part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
gc-compaction could take a long time in some cases, for example, if the
job split heuristics is wrong and we selected a too large region for
compaction that can't be finished within a reasonable amount of time. We
will give up such tasks and yield to L0 compaction. Each gc-compaction
sub-compaction job is atomic and cannot be split further so we have to
give up (instead of storing a state and continue later as in image
creation).
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
After splitting teams it became a bit more complicated for the PerfCorr
team to work on tests changes.
## Summary of changes
1. Add PerfCorr team co-owners for `.github/` folder
2. Add PerCorr team as owner for `test_runner/` folder
## Problem
In f37eeb56, I properly escaped the identifier, but I haven't noticed
that the resulting string is used in the `format('...')`, so it needs
additional escaping. Yet, after looking at it closer and with Heikki's
and Tristan's help, it appeared to be that it's a full can of worms and
we have problems all over the code in places where we use PL/pgSQL
blocks.
## Summary of changes
Add a new `pg_quote_dollar()` helper to deal with it, as dollar-quoting
of strings seems to be the only robust way to escape strings in dynamic
PL/pgSQL blocks. We mimic the Postgres' `pg_get_functiondef` logic here
[1].
While on it, I added more tests and caught a couple of more bugs with
string escaping:
1. `get_existing_dbs_async()` was wrapping `owner` in additional
double-quotes if it contained special characters
2. `construct_superuser_query()` was flawed in even more ways than the
rest of the code. It wasn't realistic to fix it quickly, but after
thinking about it more, I realized that we could drop most of it
altogether. IIUC, it was added as some sort of migration, probably back
when we haven't had migrations yet. So all the complicated code was
needed to properly update existing roles and DBs. In the current Neon,
this code only runs before we create the very first DB and role. When we
create roles and DBs, all `neon_superuser` grants are added in the
different places. So the worst thing that could happen is that there is
an ancient branch somewhere, so when users poke it, they will realize
that not all Neon features work as expected. Yet, the fix is simple and
self-serve -- just create a new role via UI or API, and it will get a
proper `neon_superuser` grant.
[1]:
8b49392b27/src/backend/utils/adt/ruleutils.c (L3153)Closesneondatabase/cloud#25048
## Problem
When periodic pagebench runs only once a day a lot of commits can be in
between a good run and a regression.
## Summary of changes
Run the workflow every 3 hours
## Problem
part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
* Add timers for each phase of the gc-compaction.
* Add a final ratio computation to directly show the garbage collection
ratio in the logs.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Original Slack discussion:
https://neondb.slack.com/archives/C04DGM6SMTM/p1739915430147169
TL;DR in Postgres, it's totally normal to have 'invalid' DBs (state
after the interrupted `DROP DATABASE`). Yet, some of our metrics
collected with `sql_exporter` try to get the size of such invalid DBs.
Typical log lines:
```
time=2025-03-05T16:30:32.368Z level=ERROR source=promhttp.go:52 msg="Error gathering metrics" error="[from Gatherer #1] [collector=neon_collector,query=pg_stats_userdb] pq: [NEON_SMGR] [reqid 0] could not read db size of db 173228 from page server at lsn 0/44A0E8C0"
time=2025-03-05T16:30:32.369Z level=ERROR source=promhttp.go:52 msg="Error gathering metrics" error="[from Gatherer #1] [collector=neon_collector,query=db_total_size] pq: [NEON_SMGR] [reqid 0] could not read db size of db 173228 from page server at lsn 0/44A0E8C0"
```
## Summary of changes
Ignore invalid DBs in these two metrics -- `pg_stats_userdb` and
`db_total_size`
## Problem
On unarchival, we update the previous heatmap with all visible layers.
When the primary generates a new heatmap it includes all those layers,
so the secondary will download them. Since they're not actually resident
on the primary (we didn't call the warm up API), they'll never be
evicted,
so they remain in the heatmap.
We want these layers in the heatmap, since we might wish to warm-up an
unarchived timeline after a shard migration. However, we don't want them
to be downloaded on the secondary until we've warmed up the primary.
## Summary of Changes
Include these layers in the heatmap and mark them as cold. All heatmap
operations act on non-cold layers apart from the attached location
warming up API,
which will download the cold layers. Once the cold layers are downloaded
on the primary,
they'll be included in the next heatmap as hot and the secondary starts
fetching them too.
## Problem
I like using `cargo stress` to hammer on a test, but it doesn't work out
of the box because it does parallel runs by default and tests always use
the same repo dir.
## Summary of changes
Add an uuid to the test repo dir when generating it.
## Problem
In a batch, `pageserver_layers_per_read_global` counts all layer visits
towards every read in the batch, since this directly affects the
observed latency of the read. However, this doesn't give a good picture
of the amortized read amplification due to batching.
## Summary of changes
Add two more global read amp metrics:
* `pageserver_layers_per_read_batch_global`: number of layers visited
per batch.
* `pageserver_layers_per_read_amortized_global`: number of layers
divided by reads in a batch.
* Remove callsite identifier registration on span creation. Forgot to
remove from last PR. Was part of alternative idea.
* Move "spans" object to right after "fields", so event and span fields
are listed together.
## Problem
For pg_regress test, we do both v1 and v2; for all the rest, we default
to v2.
part of https://github.com/neondatabase/neon/issues/9516
## Summary of changes
Use reldir v2 across test cases by default.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This PR adds a new key to neon.neon_lfc_stats —
'file_cache_chunk_size_pages'. It just returns the value of
BLOCKS_PER_CHUNK from the LFC implementation.
The new value should (eventually) allow changing the chunk size without
breaking any places that rely on LFC stats values measured in number of
chunks.
See neondatabase/cloud#25170 for more.
## Problem
Our benchmarking workflows contain links to grafana dashboards to
troubleshoot problems.
This works fine for non-pooled endpoints.
For pooled endpoints we need to remove the `-pooler` suffix from the
endpoint's hostname to get a valid endpoint ID.
Example link that doesn't work in this run
https://github.com/neondatabase/neon/actions/runs/13678933253/job/38246028316#step:8:311
## Summary of changes
Check if connection string is a -pooler connection string and if so
remove this suffix from the endpoint ID.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We realized that we may use this metric for more 'live' info about
extension installations vs. what we have with installed extensions
metric, which is only updated at start, atm.
## Summary of changes
Add `filename` label to `compute_ctl_remote_ext_requests_total`. Note
that it contains the raw archive name with `.tar.zst` at the end, so the
consumer may need to strip this suffix.
Closes https://github.com/neondatabase/cloud/issues/24694
Setup pgaudit and pgauditlogtofile extensions
in compute_ctl when the ComputeAuditLogLevel is
set to 'hipaa'.
See cloud PR https://github.com/neondatabase/cloud/pull/24568
Add rsyslog setup for compute_ctl.
Spin up a rsyslog server in the compute VM,
and configure it to send logs to the endpoint
specified in AUDIT_LOGGING_ENDPOINT env.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
We used anyhow::Error everywhere and it's time to fix.
## Summary of changes
* Make sure that cancel errors are correctly propagated as
CompactionError::ShuttingDown.
* Skip all the trigger computation work if gc_cutoff is not generated
yet.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
If an image fails to push to dev registries, we shouldn't trigger the
deploy job, because that depends on images existing in dev registries.
To ensure this is the case, the deploy job needs to depend on pushing to
dev registries.
## Summary of changes
Make `deploy` depend on `push-neon-image-dev` and
`push-compute-image-dev`.
## Problem
`test_check_visibility_map` is the slowest test in CI, and can cause
timeouts under particularly slow configurations (`debug` and
`without-lfc`).
## Summary of changes
* Reduce the `pgbench` scale factor from 10 to 8.
* Omit a redundant vacuum during `pgbench` init.
* Remove a final `vacuum freeze` + `pg_check_visible` pass, which has
questionable value (we've already done a vacuum freeze previously, and
we don't flush the compute cache before checking anyway).
## Problem
On unarchival, we update the previous heatmap with all visible layers.
When the primary generates a new heatmap it includes all those layers,
so the secondary will download them. Since they're not actually resident
on the primary (we didn't call the warm up API), they'll never be
evicted, so they remain in the heatmap.
This leads to oversized secondary locations like we saw in pre-prod.
## Summary of changes
Gate the loading of the previous heatmaps and the heatmap generation on
unarchival behind configuration
flags. They are disabled by default, but enabled in tests.
## Problem
Each page of the slru segment is fetched individually when it's loaded
on demand.
## Summary of Changes
Use `Timeline::get_vectored` to fetch 16 at a time.
fixes https://github.com/neondatabase/cloud/issues/24292
Do not drop tablesync replication slots on the publisher,
when we're in the process of dropping subscriptions inherited by a neon
branch.
Because these slots are still needed by the parent branch subscriptions.
For regular slots we handle this by setting the slot_name to NONE
before calling DROP SUBSCRIPTION, but tablesync slots are not exposed to
SQL.
rely on GUC disable_logical_replication_subscribers=true
to know that we're in the Neon-specific process of dropping
subscriptions.
## Problem
We don't have any logging for Sentry initialization. This makes it hard
to verify that it has been configured correctly.
## Summary of changes
Log some basic info when Sentry has been initialized, but omit the
public key (which allows submitting events). Also log when `SENTRY_DSN`
isn't specified at all, and when it fails to initialize (which is
supposed to panic, but we may as well).
## Problem
Grafana Loki's JSON handling is somewhat limited and the log message
should be structured in a way that it's easy to sift through logs and
filter.
## Summary of changes
* Drop span_id. It's too short lived to be of value and only bloats the
logs.
* Use the span's name as the object key, but append a unique numeric
value to prevent name collisions.
* Extract interesting span fields into a separate object at the root.
New format:
```json
{
"timestamp": "2025-03-04T18:54:44.134435Z",
"level": "INFO",
"message": "connected to compute node at 127.0.0.1 (127.0.0.1:5432) latency=client: 22.002292ms, cplane: 0ns, compute: 5.338875ms, retry: 0ns",
"fields": {
"cold_start_info": "unknown"
},
"process_id": 56675,
"thread_id": 9122892,
"task_id": "24",
"target": "proxy::compute",
"src": "proxy/src/compute.rs:288",
"trace_id": "5eb89b840ec63fee5fc56cebd633e197",
"spans": {
"connect_request#1": {
"ep": "endpoint",
"role": "proxy",
"session_id": "b8a41818-12bd-4c3f-8ef0-9a942cc99514",
"protocol": "tcp",
"conn_info": "127.0.0.1"
},
"connect_to_compute#6": {},
"connect_once#8": {
"compute_id": "compute",
"pid": "853"
}
},
"extract": {
"session_id": "b8a41818-12bd-4c3f-8ef0-9a942cc99514"
}
}
```
## Problem
Our benchmarking workflow has a job step `bench`which runs all tests in
test_runner/performance/* except those that we want to run separately.
We recently added two test cases to that testcase directory that we want
to run separately but forgot to ignore them during the bench step. This
is now causing
[failures](https://github.com/neondatabase/neon/actions/runs/13667689340/job/38212087331#step:7:392).
## Summary of changes
Ignore the separately run tests in the bench step.
## Problem
New async prefetch introduces `prefetch+lookup[` function which is
called before LFC lookup to check if prefetch request is already
completed.
This function is not containing now check that response is actually
`T_NeonGetPageResponse` (and not error).
## Summary of changes
Add checks for response tag.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Periodic pagebench workflow runs periodically from latest main commit
and also allows to dispatch it manually for a given commit hash to
bi-sect regressions.
However in the dashboards we can not distinguish manual runs from
periodic runs which makes it harder to follow the trend.
## Summary of changes
Send an additional flag commit type to the benchmark runner instance to
distinguish the run type.
Note: this needs a follow-up PR on the receiving side.
The compute should only act if requests come from the control plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
On macOS, the `unused` lint complains about two variables not used in
`!linux` builds.
These were introduced in #11007.
## Summary of changes
Appease the linter by explicitly using the variables in `!linux`
branches.
## Problem
part of https://github.com/neondatabase/neon/issues/11067
My observation is that with the current value of settings, x86-v1
usually takes 30s, arm-v1 1m30s, x86-v2 1m, arm-v2 3m. But sometimes the
system could run too slow and cause test to timeout on arm with reldir
v2.
While I investigate what's going on and further improve the performance,
I'd like to set both of them to use the same test input, so that it
doesn't timeout and we don't abuse this test case as a performance test.
## Summary of changes
Use the same settings for both test cases.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The code to generate symbolized pprof heap profiles and flamegraph SVGs
has been upstreamed to the `jemalloc_pprof` crate:
* https://github.com/polarsignals/rust-jemalloc-pprof/pull/22
* https://github.com/polarsignals/rust-jemalloc-pprof/pull/23
## Summary of changes
Use `jemalloc_pprof` to generate symbolized pprof heap profiles and
flamegraph SVGs.
This reintroduces a bunch of internal jemalloc stack frames that we'd
previously strip, e.g. each stack now always ends with
`prof_backtrace_impl` (where jemalloc takes a stack trace for heap
profiling), but that seems ok.
## Problem
Incoming requests often take the service lock, and sometimes even do
database transactions. That creates a risk that a rogue client can
starve the controller of the ability to do its primary job of
reconciling tenants to an available state.
## Summary of changes
* Use the `governor` crate to rate limit tenant requests at 10 requests
per second. This is ~10-100x lower than the worst "attack" we've seen
from a client bug. Admin APIs are not rate limited.
* Add a `storage_controller_http_request_rate_limited` histogram for
rate limited requests.
* Log a warning every 10 seconds for rate limited tenants.
The rate limiter is parametrized on TenantId, because the kinds of
client bug we're protecting against generally happen within tenant
scope, and the rates should be somewhat stable: we expect the global
rate of requests to increase as we do more work, but we do not expect
the rate of requests to one tenant to increase.
---------
Co-authored-by: John Spray <john@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/9516
## Summary of changes
Similar to the aux v2 migration, we persist the relv2 migration status
into index_part, so that even the config item is set to false, we will
still read from the v2 storage to avoid loss of data.
Note that only the two variants `None` and
`Some(RelSizeMigration::Migrating)` are used for now. We don't have full
migration implemented so it will never be set to
`RelSizeMigration::Migrated`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The "did not trigger" gets logged at 10k/minute in staging.
## Summary of changes
Change it to debug level.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The interpreted reader tracks a record aligned current position in the
WAL stream.
Partial reads move the stream internally, but not from the pov of the
interpreted WAL reader.
Hence, where new shards subscribe with a start position that matches the
reader's current position,
but we've also done some partial reads. This confuses the gap tracking.
To make it more robust,
update the current batch start to the min between the new start position
and its current value.
Since no record has been decoded yet (position matches), we can't have
lost it
## Problem
When a build is made with sanitizers, this is not reflected in the
artifact name, which can lead to overriding normal builds with sanitized
ones.
## Summary of changes
Take this property of a build into account when constructing the
artifact name.
## Problem
Image layers may be nested inside in-memory layers as diagnosed
[here](https://github.com/neondatabase/neon/issues/10720#issuecomment-2649419252).
The read path doesn't support this and may skip over the image layer,
resulting in a failure to reconstruct the page.
## Summary of changes
We already support nesting of image layers inside delta layers. The
logic lives in `LayerMap::select_layer`.
The main goal of this PR is to propagate the candidate in-memory layer
down to that point and update
the selection logic.
Important changes are:
1. Support partial reads for the in-memory layer. Previously, we could
only specify the start LSN of the read.
We need to control the end LSN too.
2. `LayerMap::ranged_search` considers in-memory layers too. Previously,
the search for in-memory layers
was done explicitly in `Timeline::get_reconstruct_data_timeline`. Note
that `LayerMap::ranged_search` now returns
a weak readable layer which the `LayerManager` can upgrade. This dance
is such that we can unit test the layer selection logic.
3. Update `LayerMap::select_layer` to consider the candidate in-memory
layer too
Loosely related drive bys:
1. Remove the "keys not found" tracking in the ranged search. This
wasn't used anywhere and it just complicates things.
2. Remove the difficulty map stuff from the layer map. Again, not used
anywhere.
Closes https://github.com/neondatabase/neon/issues/9185
Closes https://github.com/neondatabase/neon/issues/10720
## Problem
If a caller times out on safekeeper timeline deletion on a large
timeline, and waits a while before retrying, the deletion will not
progress while the retry is waiting. The net effect is very very slow
deletion as it only proceeds in 30 second bursts across 5 minute idle
periods.
Related: https://github.com/neondatabase/neon/issues/10265
## Summary of changes
- Run remote deletion in a background task
- Carry a watch::Receiver on the Timeline for other callers to join the
wait
- Restart deletion if the API is called again and the previous attempt
failed
## Problem
We want to support larger tenants (regarding logical database size,
number of transactions per second etc.) and should increase our test
coverage of OLTP transactions at larger scale.
## Summary of changes
Start a new benchmark that over time will add more OLTP tests at larger
scale.
This PR covers the first version and will be extended in further PRs.
Also fix some infrastructure:
- default for new connections and large tenants is to use connection
pooler pgbouncer, however our fixture always added
`statement_timeout=120` which is not compatible with pooler
[see](https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter)
- action to create branch timed out after 10 seconds and 10 retries but
for large tenants it can take longer so use increasing back-off for
retries
## Test run
https://github.com/neondatabase/neon/actions/runs/13593446706
## Problem
Timeline shutdown during basebackup logs at error level because the the
canecellation error is smushed into BasebackupError::Server.
## Summary of changes
Introduce BasebackupError::Shutdown and use it. `log_query_error` will
now see `QueryError::Shutdown` and log at info level.
## Problem
Preparation for https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Add walproposer `safekeepers_generations` field which can be set by
prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces
walproposer to use generations. In particular, this also disables
implicit timeline creation as timeline will be created by storcon. Add
test checking this. Also add missing infra: `--safekeepers-generation`
flag to neon_local endpoint start + fix `--start-timeout` flag: it
existed but value wasn't used.
## Problem
See https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339
smgrextend for FSM fork is called during page reconstruction by walredo
process causing overflow of inmem SMGR (64 pages).
## Summary of changes
Do not store zero pages in inmem SMGR because `inmem_read` returns zero
page if it is not able to locate specified block.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We build compute-nodes as multi-arch images, but not the
vm-compute-nodes. The PR adds multiarch vm images the same way as in
autoscaling repo.
## Summary of changes
Add architecture to the matrix for vm compute build steps
Add merge job
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We have not synced `force-test-extensions-upgrade.yml` with the last
changes.
The variable `TEST_EXTENSIONS_UPGRADE` was ignored in the script and
actually set to `NEW_COMPUTE_TAG` while it should be set to
`OLD_COMPUTE_TAG` as we are about to run compatibility tests.
## Summary of changes
The tag names were synced, the logic was fixed.
To speed up compute startup. Resizing swap in particular takes about 100
ms on my laptop. By performing it in parallel with downloading the
basebackup, that latency is effectively hidden. I would imagine that
downloading remote extensions can also take a non-trivial amount of
time, although I didn't try to measure that. In any case that's now also
performed in parallel with downloading the basebackup.
Move most of the code to compute.rs, so that all the major startup steps
are visible in one place. You can now get a pretty good picture of what
happens in the latency-critical path at compute startup by reading
ComputeNode::start_compute().
This also clarifies the error handling in start_compute. Previously, the
start_postgres function sometimes returned an Err, and sometimes Ok but
with the compute status already set to Failed. Now the start_compute
function always returns Err on failure, and it's the caller's
responsibility to change the compute status to Failed. Separately from
that, it returns a handle to the Postgres process via a `&mut` reference
if it had already started Postgres (i.e. on success, or if the failure
happens after launching the Postgres process).
---------
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
## Problem
In multi-character keys, the GIN index creates a CRC Hash of the first 3
bytes of the key.
The hash can have the first bit to be set or unset, needing to have a
consistent representation
of `char` across architectures for consistent results. GIN stores these
keys by their hashes
which determines the order in which the keys are obtained from the GIN
index.
By default, chars are signed in x86 and unsigned in arm, leading to
inconsistent behavior across different platform architectures. Adding
the `-fsigned-char` flag to the GCC compiler forces chars to be treated
as signed across platforms, ensuring the ordering in which the keys are
obtained consistent.
## Summary of changes
Added `-fsigned-char` to the `CFLAGS` to force GCC to use signed chars
across platforms.
Added a test to check this across platforms.
Fixes: https://github.com/neondatabase/cloud/issues/23199
## Problem
To measure latency accurate we should associate the testodrome role
within a latency data
## Summary of changes
Add latency logging to associate different roles within a latency.
Relates to the #22486
## Problem
`wait_for_active_tenant()`, used when starting background tasks, has a
race condition that can cause it to wait forever (until cancelled). It
first checks the current tenant state, and then subscribes for state
updates, but if the state changes between these then it won't be
notified about it.
We've seen this wedge compaction tasks, which can cause unbounded layer
file buildup and read amplification.
## Summary of changes
Use `watch::Receiver::wait_for()` to check both the current and new
tenant states.
## Problem
Somehow the previous patch loses the loop in the chaos injector function
so everything will only run once.
https://github.com/neondatabase/neon/pull/10934
## Summary of changes
Add back the loop.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
JWT tokens aren't in place, so all SK heartbeats fail. This is
equivalent to a wait before applying the PS heartbeats and makes things
more flaky.
## Summary of Changes
Add a flag that skips loading SKs from the db on start-up and at
runtime.
https://github.com/neondatabase/cloud/issues/23008
For TLS between proxy and compute, we are using an internally
provisioned CA to sign the compute certificates. This change ensures
that proxy will load them from a supplied env var pointing to the
correct file - this file and env var will be configured later, using a
kubernetes secret.
Control plane responds with a `server_name` field if and only if the
compute uses TLS. This server name is the name we use to validate the
certificate. Control plane still sends us the IP to connect to as well
(to support overlay IP).
To support this change, I'd had to split `host` and `host_addr` into
separate fields. Using `host_addr` and bypassing `lookup_addr` if
possible (which is what happens in production). `host` then is only used
for the TLS connection.
There's no blocker to merging this. The code paths will not be triggered
until the new control plane is deployed and the `enableTLS` compute flag
is enabled on a project.
## Problem
This `critical!` could fire on IO errors, which is just noisy.
Resolves#11027.
## Summary of changes
Downgrade to error, except for decode errors. These could be either data
corruption or a bug, but seem worth investigating either way.
# Changes
While working on
- https://github.com/neondatabase/neon/pull/7202
I found myself needing to cache another expensive Arc::clone inside
inside the timeline::handle::Cache by wrapping it in another Arc.
Before this PR, it seemed like the only expensive thing we were caching
was the
connection handler tasks' clone of `Arc<Timeline>`.
But in fact the GateGuard was another such thing, but it was
special-cased
in the implementation.
So, this refactoring PR de-special-cases the GateGuard.
# Performance
With this PR we are doing strictly _less_ operations per `Cache::get`.
The reason is that we wrap the entire `Types::Timeline` into one Arc.
Before this PR, it was a separate Arc around the Arc<Timeline> and
one around the Arc<GateGuard>.
With this PR, we avoid an allocation per cached item, namely,
the separate Arc around the GateGuard.
This PR does not change the amount of shared mutable state.
So, all in all, it should be a net positive, albeit probably not
noticable
with our small non-NUMA instances and generally high CPU usage per
request.
# Reviewing
To understand the refactoring logistics, look at the changes to the unit
test types first.
Then read the improved module doc comment.
Then the remaining changes.
In the future, we could rename things to be even more generic.
For example, `Types::TenantMgr` could really be a `Types::Resolver`.
And `Types::Timeline` should, to avoid constant confusion in the doc
comment,
be called `Types::Cached` or `Types::Resolved`.
Because the `handle` module, after this PR, really doesn't care that
we're
using it for storing Arc's and GateGuards.
Then again, specicifity is sometimes more useful than being generic.
And writing the module doc comment in a totally generic way would
probably also be more confusing than helpful.
## Problem
CI does not pass for the compute release due to the absence of some
images
## Summary of changes
Now we use the images from the old non-compute releases for non-compute
images
## Problem
We intend for cplane to use the heatmap layer download API to warm up
timelines after unarchival. It's tricky for them to recurse in the
ancestors,
and the current implementation doesn't work well when unarchiving a
chain
of branches and warming them up.
## Summary of changes
* Add a `recurse` flag to the API. When the flag is set, the operation
recurses into the parent
timeline after the current one is done.
* Be resilient to warming up a chain of unarchived branches. Let's say
we unarchived `B` and `C` from
the `A -> B -> C` branch hierarchy. `B` got unarchived first. We
generated the unarchival heatmaps
and stash them in `A` and `B`. When `C` unarchived, it dropped it's
unarchival heatmap since `A` and `B`
already had one. If `C` needed layers from `A` and `B`, it was out of
luck. Now, when choosing whether
to keep an unarchival heatmap we look at its end LSN. If it's more
inclusive than what we currently have,
keep it.
## Problem
Storage controller will proxy GETs to pageserver-like tenant/timeline
paths through to the pageserver.
Usually GET passthroughs make sense to go to shard 0, e.g. if you want
to list timelines.
But sometimes you really want to know about a particular shard, e.g.
reading its cache state or similar.
## Summary of changes
- Accept shard IDs as well as tenant IDs in the passthrough route
- Refactor node lookup to take a shard ID and make the tenant ID case a
layer on top of that. This is one more lock take-drop during these
requests, but it's not particularly expensive and these requests
shouldn't be terribly frequent
This is not immediately used by anything, but will be there any time we
want to e.g. do a pass-through query to check the warmth of a tenant
cache on a particular shard or somesuch.
## Problem
layer_map access was unwrapped. It might return an error during
shutdown.
## Summary of changes
Propagate the layer_map access error back to the compaction loop.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/10841 made building compute
and neon images optional on releases that don't need them. The
`push-<component>-image-prod` jobs had transitive dependencies that were
skipped due to that, causing the images not to be pushed to production
registries.
## Summary of changes
Add `!failure() && !cancelled() &&` to the beginning of the conditions
for these jobs to ensure they run even if some of their transitive
dependencies are skipped.
## Problem
We see `Inmem storage overflow` in page server logs:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339
walked process is using inseam SMGR with storage size limited by 64
pages with warning watermark 32 (based ion the assumption that
XLR_MAX_BLOCK_ID is 32, so WAL record can not access more than 32
pages).
Actually it is not true. We can update up to 3 forks for each block
(including update of FSM and VM forks).
## Summary of changes
This PR increases inseam SMGR size for walled process to 100 pages and
print stack trace in case of overflow.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We created extensions in a single database. The tests could interfere,
i.e., discover some service tables left by other extensions and produce
unexpected results.
## Summary of changes
The tests are now run in a separate timeline, so only one extension owns
the database, which prevents interference.
Before this PR, re-attach and validate would log the same warning
```
calling control plane generation validation API failed
```
on retry errors.
This can be confusing.
This PR makes the message generically valid for any upcall and adds
additional tracing spans to capture context.
Along the way, clean up some copy-pasta variable naming.
refs
-
https://github.com/neondatabase/neon/issues/10381#issuecomment-2684755827
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
## Problem
So far cumulative statistics have not been persisted when Neon scales to
zero (suspends endpoint).
With PR https://github.com/neondatabase/neon/pull/6560 the cumulative
statistics should now survive endpoint restarts and correctly trigger
the auto- vacuum and auto analyze maintenance
So far we did not have a testcase that validates that improvement in our
dev cloud environment with a real project.
## Summary of changes
Introduce testcase `test_cumulative_statistics_persistence`in the
benchmarking workflow running daily to verify:
- Verifies that the cumulative statistics are correctly persisted across
restarts.
- Cumulative statistics are important to persist across restarts because
they are used
- when auto-vacuum an auto-analyze trigger conditions are met.
- The test performs the following steps:
- Seed a new project using pgbench
- insert tuples that by itself are not enough to trigger auto-vacuum
- suspend the endpoint
- resume the endpoint
- insert additional tuples that by itself are not enough to trigger
auto-vacuum but in combination with the previous tuples are
- verify that autovacuum is triggered by the combination of tuples
inserted before and after endpoint suspension
## Test run
https://github.com/neondatabase/neon/actions/runs/13546879714/job/37860609089#step:6:282
Migrates the remaining crates to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
Like the previous migration PRs, this is comprised of three commits:
* the first does the edition update and makes `cargo check`/`cargo
clippy` pass. we had to update bindgen to make its output [satisfy the
requirements of edition
2024](https://doc.rust-lang.org/edition-guide/rust-2024/unsafe-extern.html)
* the second commit does a `cargo fmt` for the new style edition.
* the third commit reorders imports as a one-off change. As before, it
is entirely optional.
Part of #10918
## Problem
https://github.com/neondatabase/neon/pull/10841 broke CI on PRs that
aren't based on main or a release branch but want to merge into another
PR.
## Summary of changes
Replace `run-kind=pr-main` with `run-kind=pr`, so that all PRs that
aren't release PRs are treated equally.
## Problem
safekeepers must ignore walproposer messages with non matching
membership conf.
## Summary of changes
Make safekeepers reject vote request, proposer elected and append
request messages with non matching generation. Switch to the
configuration in the greeting message if it is higher.
In passing, fix one comment and WAL truncation.
Last part of https://github.com/neondatabase/neon/issues/9965
The comment was woefully outdated and outright wrong. It applied a long
time ago (before commit e5cc2f92c4 to be precise), but nowadays the
function just launches postgres and waits until it starts accepting
connections. The other things the comment talked about are done in other
functions.
## Problem
Check neon with extra platform builds is failing on main with:
```
The template is not valid. .github/workflows/neon_extra_builds.yml (Line: 74, Col: 26): Unexpected value 'false'
```
https://github.com/neondatabase/neon/actions/runs/13549634905
## Summary of changes
Use `fromJson()` to have `false` as boolean value.
thanks to @skyzh for pointing on the issue
## Problem
Yet another source of flakyness for
https://github.com/neondatabase/neon/issues/10517
## Summary of changes
The test scenario we want to create is that we have an image layer in
index_part and then overwrite it, so we have to ensure it gets persisted
in index_part by doing a force checkpoint.
Signed-off-by: Alex Chi Z <chi@neon.tech>
We lost this with the switch to axum for the HTTP server. Add it back.
In addition to just resurrecting the functionality we had before, pass
the tracing context of the /configure HTTP request to the start_postgres
operation that runs in the main thread. This way, the 'start_postgres'
and all its sub-spans like getting the basebackup become children of the
HTTP request span. This allows end-to-end tracing of a compute start,
all the way from the proxy to the SQL queries executed by compute_ctl as
part of compute startup.
## Problem
https://github.com/neondatabase/neon/pull/10241 added configuration
switch endpoint, but it didn't delete timeline if node was excluded.
## Summary of changes
Add separate /exclude API endpoint which similarly accepts membership
configuration where sk is supposed by be excluded. Implementation
deletes the timeline locally.
Some more small related tweaks:
- make mconf switch API PUT instead of POST as it is idempotent;
- return 409 if switch was refused instead of 200 with requested &
current;
- remove unused was_active flag from delete response;
- remove meaningless _force suffix from delete functions names;
- reuse timeline.rs delete_dir function in timelines_global_map instead
of its own copy.
part of https://github.com/neondatabase/neon/issues/9965
## Problem
I see a lot of timeout errors, which indicates that this test is too
slow. It seems that create relations are fast, but the subsequent
truncating step is slow.
## Summary of changes
Reduce number of relations for now, and investigate later.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Release CI is slow, because we're doing unnecessary work, for example
building compute images on storage releases and vice versa.
## Summary of changes
- Extract tag generation into reusable workflow and extend it with
fetching of previous component releases
- Don't build neon images on compute releases and don't build compute
images on proxy and storage releases
- Reuse images from previous releases for tests on branches where we
don't build those images
## Open questions
- We differentiate between `TAG` and `COMPUTE_TAG` in a few places, but
we don't differentiate between storage and proxy releases. Since they
use the same image, this will continue to work, but I'm not sure this is
what we want.
## Problem
ref https://github.com/neondatabase/neon/issues/10927
## Summary of changes
* Implement `is_critical` and `is_cancel` over `CompactionError`.
* Revisit all places that uses `CollectKeyspaceError` to ensure they are
handled correctly.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Delete layers that we have hardlinked so far when there is an error in
`remote_copy`. This prevents a retry of the ancestor detach from
stumbling over already present layer files: the hardlink would fail with
an error.
If there is a crash, we already clean up during the timeline attach: we
loop over all layer files and purge all layers that are not referenced
by the `index_part.json`.
Make sure to hold the timeline gate to prevent races with
detach&attach&read from the layer file.
These cleanups aren't completely enough however, as there is code after
`prepare` as well. To handle errors there, we add a special case for
`AlreadyExists` errors during the hardlink, where we check if the layer
is an orphan, and if yes, we delete it from local disk. That is ideally
not the case we hit, as it is less clear in that scenario where the
layer came from, but it provides good defense in depth.
Related #10729Fixes#10970
## Problem
If the client connection goes dead without an explicit close (e.g. due
to network infrastructure dropping the connection) then we currently
won't detect it for a long time, which may e.g. block GetPage flushes
and keep the task running.
Touches https://github.com/neondatabase/cloud/issues/23515.
## Summary of changes
Enable `SO_KEEPALIVE` on the page service socket, to enable periodic TCP
keepalive probes. These are configured via Linux sysctls, which will be
deployed separately. By default, the first probe is sent after 2 hours,
so this doesn't have a practical effect until we change the sysctls.
## Problem
json_ctrl.rs is an obsolete attempt to have tests with fine control of
feeding messages into safekeeper superseded by desim framework.
## Summary of changes
Drop it.
Updates `compute_tools` and `compute_api` crates to edition 2024. We
like to stay on the latest edition if possible. There is no functional
changes, however some code changes had to be done to accommodate the
edition's breaking changes.
The PR has three commits:
* the first commit updates the named crates to edition 2024 and appeases
`cargo clippy` by changing code.
* the second commit performs a `cargo fmt` that does some minor changes
(not many)
* the third commit performs a cargo fmt with nightly options to reorder
imports as a one-time thing. it's completely optional, but I offer it
here for the compute team to review it.
I'd like to hear opinions about the third commit, if it's wanted and
felt worth the diff or not. I think most attention should be put onto
the first commit.
Part of #10918
Ref: https://github.com/neondatabase/cloud/issues/24939
## Problem
I found that we are missing authorization for some container jobs, that
will make them use anonymous pulls. It's not an issue for now, with high
enough limits, but that could be an issue when new limits introduced in
DockerHub (10 pulls / hour)
## Summary of changes
- add credentials for the jobs that run in containers
We don't want to serialize to/from string all the time, so use
`SchedulingPolicy` in `SafekeeperPersistence` via the use of a wrapper.
Stacked atop #10891
In local dev environment, these steps take around 100 ms, and they are
in the critical path of a compute startup on a compute pool hit. I don't
know if it's like that in production, but as first step, add tracing
spans to the functions so that they can be measured more easily.
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
## Problem
PR https://github.com/neondatabase/neon/pull/10442 added prefetch_lookup
function.
It changed handling of getpage requests in compute.
Before:
1. Lookup in LFC (return if found)
2. Register prefetch buffer
3. Wait prefetch result (increment getpage_hist)
Now:
1. Lookup prefetch ring (return if prefetch request is already
completed)
2. Lookup in LFC (return if found)
3. Register prefetch buffer
4. Wait prefetch result (increment getpage_hist)
So if prefetch result is already available, then get page histogram is
not incremented.
It case failure of some test_throughtput benchmarks:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740425527249499
## Summary of changes
Increment getpage histogram in `prefetch_lookup`
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`neon endpoint list` shows a different LSN than what the state of the
replica is. This is mainly down to what we define as LSN in this output.
If we define it as the LSN that a compute was started with, it only
makes sense to show it for static computes.
## Summary of changes
Removed the output of `last_record_lsn` for primary/hot standby
computes.
Closes: https://github.com/neondatabase/neon/issues/5825
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
The patch for `semver` extensions relies on `PG_VERSION` environment
variable. The files were named without the letter `v` so script cannot
find them.
## Summary of changes
The patch files were renamed.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
Add the auto trigger for gc-compaction. It computes two values: L1 size
and L2 size. When L1 size >= initial trigger threshold, we will trigger
an initial gc-compaction. When l1_size / l2_size >=
gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have `test_perf_many_relations` but it only runs on remote clusters,
and we cannot directly modify tenant config. Therefore, I patched one of
the current tests to benchmark relv2 performance.
close https://github.com/neondatabase/neon/issues/9986
## Summary of changes
* Add `v1/v2` selector to `test_tx_abort_with_many_relations`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We use `docker cp` to copy the files required for the extension tests
now.
It causes problems if we run older images with the newer source tree.
## Summary of changes
Copying the files was moved to the compute Dockerfile.
## Problem
As part of https://github.com/neondatabase/neon/issues/8614 we need to
pass membership configurations between compute and safekeepers.
## Summary of changes
Add version 3 of the protocol carrying membership configurations.
Greeting message in both sides gets full conf, and other messages
generation number only. Use protocol bump to include other accumulated
changes:
- stop packing whole structs on the wire as is;
- make the tag u8 instead of u64;
- send all ints in network order;
- drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't
much useful anyway.
Per message changes, apart from mconf:
- ProposerGreeting: tenant / timeline id is sent now as hex cstring.
Remove proto version, it is passed outside in START_WAL_PUSH. Remove
postgres timeline, it is unused. Reorder fields a bit.
- AcceptorGreeting: reorder fields
- VoteResponse: timeline_start_lsn is removed. It can be taken from
first member of term history, and later we won't need it at all when all
timelines will be explicitly created. Vote itself is u8 instead of u64.
- ProposerElected: timeline_start_lsn is removed for the same reasons.
- AppendRequest: epoch_start_lsn removed, it is known from term history
in ProposerElected.
Both compute and sk are able to talk v2 and v3 to make rollbacks (in
case we need them) easier; neon.safekeeper_proto_version GUC sets the
client version. v2 code can be dropped later.
So far empty conf is passed everywhere, future PRs will handle them.
To test, add param to some tests choosing proto version; we want to test
both 2 and 3 until we fully migrate.
ref https://github.com/neondatabase/neon/issues/10326
---------
Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>
## Problem
The interpreted WAL reader tracks the start of the current logical
batch.
This needs to be invalidated when the reader is reset.
This bug caused a couple of WAL gap alerts in staging.
## Summary of changes
* Refactor to make it possible to write a reproducer
* Add repro unit test
* Fix by resetting the start with the reader
Related https://github.com/neondatabase/cloud/issues/23935
## Problem
While looking at logs I noticed that heartbeats are sent sequentially.
The loop polling the UnorderedSet is at the wrong level of identation.
Instead of doing it after we have the full set, we did after each entry.
## Summary of Changes
Poll the UnorderedSet properly.
## Problem
We recently added slow GetPage request logging. However, this
unintentionally included the flush time when logging (which we already
have separate logging for). It also logs at WARN level, which is a bit
aggressive since we see this fire quite frequently.
Follows https://github.com/neondatabase/neon/pull/10906.
## Summary of changes
* Only log the request execution time, not the flush time.
* Extract a `pagestream_dispatch_batched_message()` helper.
* Rename `warn_slow()` to `log_slow()` and downgrade to INFO.
We've seen some cases in production where a compute doesn't get a
response to a pageserver request for several minutes, or even more. We
haven't found the root cause for that yet, but whatever the reason is,
it seems overly optimistic to think that if the pageserver hasn't
responded for 2 minutes, we'd get a response if we just wait patiently a
little longer. More likely, the pageserver is dead or there's some kind
of a network glitch so that the TCP connection is dead, or at least
stuck for a long time. Either way, it's better to disconnect and
reconnect. I set the default timeout to 2 minutes, which should be
enough for any GetPage request under normal circumstances, even if the
pageserver has to download several layer files from remote storage.
Make the disconnect timeout configurable. Also make the "log interval",
after which we print a message to the log configurable, so that if you
change the disconnect timeout, you can set the log timeout
correspondingly. The default log interval is still 10 s. The new GUCs
are called "neon.pageserver_response_log_timeout" and
"neon.pageserver_response_disconnect_timeout".
Includes a basic test for the log and disconnect timeouts.
Implements issue #10857
## Problem
There's new rate-limits coming on docker hub. To reduce our reliance on
docker hub and the problems the limits are going to cause for us, we
want to prepare for this by also pushing our container images to ghcr.io
## Summary of changes
Push our images to ghcr.io as well and not just docker hub.
WaitEventSetWait() returns the number of "events" that happened, and
only that many events in the WaitEvent array are updated. When the
timeout is reached, it returns 0 and does not modify the WaitEvent array
at all. We were reading 'event.events' without checking the return
value, which would be uninitialized when the timeout was hit.
No test included, as this is harmless at the moment. But this caused the
test I'm including in PR #10882 to fail. That PR changes the logic to
loop back to retry the PQgetCopyData() call if WL_SOCKET_READABLE was
set. Currently, making an extra call to PQconsumeInput() is harmless,
but with that change in logic, it turns into a busy-wait.
For archived timelines, we would like to prevent all non-pageserver
issued getpage requests, as users are not supposed to send these.
Instead, they should unarchive a timeline before issuing any external
read traffic.
As this is non-trivial to do, at least prevent launches of new computes,
by errorring on basebackup requests for archived timelines. In #10688,
we started issuing a warning instead of an error, because an error would
mean a stuck project. Now after we can confirm the the warning is not
present in the logs for about a week, we can issue errors.
Follow-up of #10688
Related: #9548
## Problem
close https://github.com/neondatabase/cloud/issues/24485
## Summary of changes
This patch adds a new chaos injection mode for the storcon. The chaos
injector reads the crontab and exits immediately at the configured time.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This upgrades the `proxy/` crate as well as the forked libraries in
`libs/proxy/` to edition 2024.
Also reformats the imports of those forked libraries via:
```
cargo +nightly fmt -p proxy -p postgres-protocol2 -p postgres-types2 -p tokio-postgres2 -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true
```
It can be read commit-by-commit: the first commit has no formatting
changes, only changes to accomodate the new edition.
Part of #10918
## Problem
https://github.com/neondatabase/neon/pull/10788 introduced an API for
warming up attached locations
by downloading all layers in the heatmap. We intend to use it for
warming up timelines after unarchival too,
but it doesn't work. Any heatmap generated after the unarchival will not
include our timeline, so we've lost
all those layers.
## Summary of changes
Generate a cheeky heatmap on unarchival. It includes all the visible
layers.
Use that as the `PreviousHeatmap` which inputs into actual heatmap
generation.
Closes: https://github.com/neondatabase/neon/issues/10541
## Problem
There are a bunch of minor improvements that are too small and
insignificant as is, so collecting them in one PR.
## Summary of changes
- Add runner arch to artifact name to make it easier to distinguish
files on S3
([ref](https://neondb.slack.com/archives/C059ZC138NR/p1739365938371149))
- Use `github.event.pull_request.number` instead of parsing
`$GITHUB_EVENT_PATH` file
- Update Allure CLI and `allure-pytest`
## Problem
We failed to detect https://github.com/neondatabase/neon/pull/10845
before merging, because the tests we run with a matrix of component
versions didn't include the ones that did live migrations.
## Summary of changes
- Do a live migration during the storage controller smoke test, since
this is a pretty core piece of functionality
- Apply a compat version matrix to the graceful cluster restart test,
since this is the functionality that we most urgently need to work
across versions to make deploys work.
I expect the first CI run of this to fail, because
https://github.com/neondatabase/neon/pull/10845 isn't merged yet.
ref: https://github.com/neondatabase/cloud/issues/23385
Adds a direction flag as well as private-link ID to the traffic
reporting pipeline. We do not yet actually count ingress, but we include
the flag anyway.
I have additionally moved vpce_id string parsing earlier, since we
expect it to be utf8 (ascii).
The synchronous 'postgres' crate is just a wrapper around the async
'tokio_postgres' crate. Some places were unnecessarily using the
re-exported NoTls and Error from the synchronous 'postgres' crate, even
though they were otherwise using the 'tokio_postgres' crate. Tidy up by
using the tokio_postgres types directly.
## Problem
After refactoring the configuration code to phases, it became a bit
fuzzy who filters out DBs that are not present in Postgres, are invalid,
or have `datallowconn = false`. The first 2 are important for the DB
dropping case, as we could be in operation retry, so DB could be already
absent in Postgres or invalid (interrupted `DROP DATABASE`).
Recent case:
https://neondb.slack.com/archives/C03H1K0PGKH/p1740053359712419
## Summary of changes
Add a common code that filters out inaccessible DBs inside
`ApplySpecPhase::RunInEachDatabase`.
Updates `vm_monitor` to edition 2024. We like to stay on the latest
edition if possible. There is no functional changes, it's only changes
due to the rustfmt edition.
part of https://github.com/neondatabase/neon/issues/10918
There was a race condition with compute_ctl and the metric being
collected related to whether the neon extension had been updated or not.
compute_ctl will run `ALTER EXTENSION neon UPDATE` on compute start in
the postgres database.
Fixes: https://github.com/neondatabase/neon/issues/10932
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Prefetch is performed locally, so different backers can request the same
pages form PS.
Such duplicated request increase load of page server and network
traffic.
Making prefetch global seems to be very difficult and undesirable,
because different queries can access chunks on different speed. Storing
prefetch chunks in LFC will not completely eliminate duplicates, but can
minimise such requests.
The problem with storing prefetch result in LFC is that in this case
page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.
See:
https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC
@MMeent implementation of prewarm:
See https://github.com/neondatabase/neon/pull/10312/
## Summary of changes
Use conditional variables to sycnhronize access to LFC entry.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
The storage controller treats durations in the tenant config as strings.
These are loaded from the db.
The pageserver maps these durations to a seconds only format and we
always get a mismatch compared
to what's in the db.
## Summary of changes
Treat durations as durations inside the storage controller and not as
strings.
Nothing changes in the cross service API's themselves or the way things
are stored in the db.
I also added some logging which I would have made the investigation a
10min job:
1. Reason for why the reconciliation was spawned
2. Location config diff between the observed and wanted states
## Problem
Now that we rely on the optimisation logic to handle fixing things up
after some tenants are in the wrong AZ (e.g. after node failure), it's
no longer appropriate to treat optimisations as an ultra-low-priority
task. We used to reflect that low priority with a very low limit on
concurrent execution, such that we would only migrate 2 things every 20
seconds.
## Summary of changes
- Increase MAX_OPTIMIZATIONS_EXEC_PER_PASS from 2 to 16
- Increase MAX_OPTIMIZATIONS_PLAN_PER_PASS from 8 to 64.
Since we recently gave user-initiated actions their own semaphore, this
should not risk starving out API requests.
Safekeepers only respond to requests with the per-token scope, or the
`safekeeperdata` JWT scope. Therefore, add infrastructure in the storage
controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token`
to `pageserver_jwt_token`.
Part of #9011
Related: https://github.com/neondatabase/cloud/issues/24727
Return an empty json response in the `scheduling_policy` handler.
This prevents errors of the form:
```
Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0
```
when setting the scheduling policy via the `storcon_cli`.
part of #9011.
Plv8 consists of two parts:
1. the V8 engine, which is built from vendored sources, and
2. the PostgreSQL extension.
Split those into two separate steps in the Dockerfile. The first step
doesn't need any PostgreSQL sources or any other files from the neon
repository, just the build tools and the upstream plv8 sources. Use the
build-deps image as the base for that step, so that the layer can be
cached and doesn't need to be rebuilt every time. This is worthwhile
because the V8 build takes a very long time.
## Problem
This should be one last fix for
https://github.com/neondatabase/neon/issues/10517.
## Summary of changes
If a keyspace is empty, we might produce a gc-compaction job which
covers no layer files. We should avoid generating such jobs so that the
gc-compaction image layer can cover the full key range.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have good observability for "stuck" getpage requests.
Resolves https://github.com/neondatabase/cloud/issues/23808.
## Summary of changes
Log a periodic warning (every 30 seconds) if GetPage request execution
is slow to complete, to aid in debugging stuck GetPage requests.
This does not cover response flushing (we have separate logging for
that), nor reading the request from the socket and batching it (expected
to be insignificant and not straightforward to handle with the current
protocol).
This costs 95 nanoseconds on the happy path when awaiting a
`tokio::task::yield_now()`:
```
warn_slow/enabled=false time: [45.716 ns 46.116 ns 46.687 ns]
warn_slow/enabled=true time: [141.53 ns 141.83 ns 142.18 ns]
```
## Problem
Refer https://github.com/neondatabase/neon/issues/10885
Wait position in ring buffer to restrict number of in-flight requests is
not correctly calculated.
## Summary of changes
Update condition and remove redundant assertion
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
It clashed with pg_mooncake
This is the same as the hotfix #10908 , but for the main branch, to keep
the release and main branches in sync. In particular, we don't want to
accidentally revert this temporary fix, if we cut a new release from
main.
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Announcement blog
post](https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html).
Prior update was in #10618.
## Problem
The interpreted SK <-> PS protocol does not guard against gaps (neither
does the Vanilla one, but that's beside the point).
## Summary of changes
Extend the protocol to include the start LSN of the PG WAL section from
which the records were interpreted.
Validation is enabled via a config flag on the pageserver and works as
follows:
**Case 1**: `raw_wal_start_lsn` is smaller than the requested LSN
There can't be gaps here, but we check that the shard received records
which it hasn't seen before.
**Case 2**: `raw_wal_start_lsn` is equal to the requested LSN
This is the happy case. No gap and nothing to check
**Case 3**: `raw_wal_start_lsn` is greater than the requested LSN
This is a gap.
To make Case 3 work I had to bend the protocol a bit.
We read record chunks of WAL which aren't record aligned and feed them
to the decoder.
The picture below shows a shard which subscribes at a position somewhere
within Record 2.
We already have a wal reader which is below that position so we wait to
catch up.
We read some wal in Read 1 (all of Record 1 and some of Record 2). The
new shard doesn't
need Record 1 (it has already processed it according to the starting
position), but we read
past it's starting position. When we do Read 2, we decode Record 2 and
ship it off to the shard,
but the starting position of Read 2 is greater than the starting
position the shard requested.
This looks like a gap.

To make it work, we extend the protocol to send an empty
`InterpretedWalRecords` to shards
if the WAL the records originated from ends the requested start
position. On the pageserver,
that just updates the tracking LSNs in memory (no-op really). This gives
us a workaround for
the fake gap.
As a drive by, make `InterpretedWalRecords::next_record_lsn` mandatory
in the application level definition.
It's always included.
Related: https://github.com/neondatabase/cloud/issues/23935
## Problem
Storage controller uses unsecure http for pageserver API.
Closes: https://github.com/neondatabase/cloud/issues/23734
Closes: https://github.com/neondatabase/cloud/issues/24091
## Summary of changes
- Add an optional `listen_https_port` field to storage controller's Node
state and its API (RegisterNode/ListNodes/etc).
- Allow updating `listen_https_port` on node registration to gradually
add https port for all nodes.
- Add `use_https_pageserver_api` CLI option to storage controller to
enable https.
- Pageserver doesn't support https for now and always reports
`https_port=None`. This will be addressed in follow-up PR.
ALTER SUBSCRIPTION requires AccessExclusive lock
which conflicts with iteration over pg_subscription when multiple
databases are present
and operations are applied concurrently.
Fix by explicitly locking pg_subscription
in the beginning of the transaction in each database.
## Problem
https://github.com/neondatabase/cloud/issues/24292
Fix an issue caused by PR
https://github.com/neondatabase/neon/pull/10891: we introduced the
concept of timeouts for heartbeats, where we would hang up on the other
side of the oneshot channel if a timeout happened (future gets
cancelled, receiver is dropped).
This hang up would make the heartbeat task panic when it did obtain the
response, as we unwrap the result of the result sending operation. The
panic would lead to the heartbeat task panicing itself, which is then
according to logs the last sign of life we of that process invocation.
I'm not sure what brings down the process, in theory tokio [should
continue](https://docs.rs/tokio/latest/tokio/runtime/enum.UnhandledPanic.html#variant.Ignore),
but idk.
Alternative to #10901.
## Problem
Background heatmap uploads and downloads were blocking the ones done
manually by the test.
## Summary of changes
Disable Background heatmap uploads and downloads for the cold migration
test. The test does
them explicitly.
## Problem
Pinning build tools still replicated the ACR/ECR/Docker Hub login and
pushing, even though we have a reusable workflow for this. Was mentioned
as a TODO in https://github.com/neondatabase/neon/pull/10613.
## Summary of changes
Reuse `_push-to-container-registry.yml` for pinning the build-tools
images.
`pprof::symbolize()` used a regex to strip the Rust monomorphization
suffix from generic methods. However, the `backtrace` crate can do this
itself if formatted with the `:#` flag.
Also tighten up the code a bit.
This PR does the following things:
* The initial heartbeat round blocks the storage controller from
becoming online again. If all safekeepers are unresponsive, this can
cause storage controller startup to be very slow. The original intent of
#10583 was that heartbeats don't affect normal functionality of the
storage controller. So add a short timeout to prevent it from impeding
storcon functionality.
* Fix the URL of the utilization endpoint.
* Don't send heartbeats to safekeepers which are decomissioned.
Part of https://github.com/neondatabase/neon/issues/9011
context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589
## Problem
A simpler version of https://github.com/neondatabase/neon/pull/10812
## Summary of changes
Image layer creation will be preempted by L0 accumulated on other
timelines. We stop image layer generation if there's a pending L0
compaction request.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Adds CPU/heap profiling for storcon.
Also fixes allowlists to match on the path only, since profiling
endpoints take query parameters.
Requires #10892 for heap profiling.
## Problem
We'd like to enable CPU/heap profiling for storcon. This requires
jemalloc.
## Summary of changes
Use jemalloc as the global allocator, and enable heap sampling for
profiling.
## Problem
We've seen the previous default of 50 cause OOMs. Compacting many L0
layers at once now has limited benefit, since the cost is mostly linear
anyway. This is already being reduced to 20 in production settings.
## Summary of changes
Reduce `DEFAULT_COMPACTION_UPPER_LIMIT` to 20.
Once released, let's remove the config overrides.
## Problem
Our AWS account IDs are copy-pasted all over the place. A wrong paste
might only be caught late if we hardcode them, but will get flagged
instantly by actionlint if we access them from github actions variables.
Resolves https://github.com/neondatabase/neon/issues/10787, follow-up
for https://github.com/neondatabase/neon/pull/10613.
## Summary of changes
Access AWS account IDs using Github Actions variables.
## Problem
Autosplits are crucial for bulk ingest performance. However, autosplits
were only attempted when there was no other pending work. This could
cause e.g. mass AZ affinity violations following Pageserver restarts to
starve out autosplits for hours.
Resolves#10762.
## Summary of changes
Always attempt autosplits in the background reconciliation loop,
regardless of other pending work.
## Problem
we measure ingest performance for a few variants (stripe-sizes,
pre-sharded, shard-splitted).
However some phenomena (e.g. related to L0 compaction) in PS can be
better observed and optimized with un-sharded tenants.
## Summary of changes
- Allow to create projects with a policy that disables sharding
(`{"scheduling": "Essential"}`)
- add a variant to ingest_benchmark that uses that policy for the new
project
## Test run
https://github.com/neondatabase/neon/actions/runs/13396325970
## Problem
The nightly test discovered problems in the extensions upgrade test.
1. `PLv8` has different versions on PGv17 and PGv16 and a different test
set, which was not implemented correctly
[sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930271)
2. The same for `semver`
[sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930017)
3. `pgtap` interfered with the other tests, e.g. tables, created by
other extensions caused the tests to fail.
## Summary of changes
The discovered problems were fixed.
1. The tests list for `PLv8` is now generated using the original
Makefile
2. The patches for `semver` are now split for PGv16 and PGv17.
3. `pgtap` is being tested in a separate database now.
---------
Co-authored-by: Mikhail Kot <mikhail@neon.tech>
## Problem
Read errors during repartition should be a critical error.
## Summary of changes
<del>We only have one call site</del> We have two call sites of
`repartition` where one of them is during the initial image upload
optimization and another is during image layer creation, so I added a
`critical!` here instead of inside `collect_keyspace`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The usual workflow for me to debug read path errors in staging is:
download the tenant to my laptop, import, and then run some read tests.
With this patch, we can do this directly over staging pageservers.
## Summary of changes
* Add a new `touchpagelsn` API that does a page read but does not return
page info back.
* Allow read from latest record LSN from get/touchpagelsn
* Add read_debug config in the context.
* The read path will read the context config to decide whether to enable
read path tracing or not.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We log image compaction stats even when no image compaction happened.
This is logged every 10 seconds for every timeline.
## Summary of changes
Only log when we actually performed any image compaction.
## Problem
If the deploy job on the release branch doesn't succeed, the preprod
deployment will not have happened. It was requested that this triggers a
notification in https://github.com/neondatabase/neon/issues/10662.
## Summary of changes
If we're on the release branch and the deploy job doesn't end up in
"success", notify storage oncall on slack.
## Problem
We lack an API for warming up attached locations based on the heatmap
contents.
This is problematic in two places:
1. If we manually migrate and cut over while the secondary is still cold
2. When we re-attach a previously offloaded tenant
## Summary of changes
https://github.com/neondatabase/neon/pull/10597 made heatmap generation
additive
across migrations, so we won't clobber it a after a cold migration. This
allows us to implement:
1. An endpoint for downloading all missing heatmap layers on the
pageserver:
`/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_heatmap_layers`.
Only one such operation per timeline is allowed at any given time. The
granularity is tenant shard.
2. An endpoint to the storage controller to trigger the downloads on the
pageserver:
`/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_heatmap_layers`.
This works both at
tenant and tenant shard level. If an unsharded tenant id is provided,
the operation is started on
all shards, otherwise only the specified shard.
3. A storcon cli command. Again, tenant and tenant-shard level
granularities are supported.
Cplane will call into storcon and trigger the downloads for all shards.
When we want to rescue a migration, we will use storcon cli targeting
the specific tenant shard.
Related: https://github.com/neondatabase/neon/issues/10541
## Problem
ref https://github.com/neondatabase/neon/issues/10517
## Summary of changes
For some reasons the job split algorithm decides to have different image
coverage range for two compactions before/after restart. So we remove
the subcompaction key range and let it generate an image covering the
full range, which should make the test more stable.
Also slightly tuned the logging span.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This reverts commit 443c8d0b4b.
## Problem
We observe a massive amount of compaction errors.
## Summary of changes
If the tenant did not write any L1 layers (i.e., they accumulate L0
layers where number of them is below L0 threshold), image creation will
always fail. Therefore, it's not correct to simply use the
disk_consistent_lsn or L0/L1 boundary for the image creation.
Preparations for a successor of #10440:
* move `pull_timeline` to `safekeeper_api` and add it to
`SafekeeperClient`. we want to do `pull_timeline` on any creations that
we couldn't do initially.
* Add a `SafekeeperGeneration` type instead of relying on a type alias.
we want to maintain a safekeeper specific generation number now in the
storcon database. A separate type is important to make it impossible to
mix it up with the tenant's pageserver specific generation number. We
absolutely want to avoid that for correctness reasons. If someone mixes
up a safekeeper and pageserver id (both use the `NodeId` type), that's
bad but there is no wrong generations flying around.
part of #9011
## Problem
`discard all` cannot run in a transaction (even if implicit)
## Summary of changes
Split up the query into two, we don't need transaction support.
## Problem
Tests with mixed versions of binaries always pick up new versions if
services are started using `neon_local`.
## Summary of changes
- Set `neon_local_binpath` along with `neon_binpath` and
`pg_distrib_dir` for tests with mixed versions
This replaces the use of the awscli utility. awscli binary is massive,
it added about 200 MB to the docker image size, while the s3 client was
already a dependency so using that is essentially free, as far as binary
size is concerned.
I implemented a simple upload function that tries to keep 10 uploads
going in parallel. I believe that's the default behavior of the "aws s3
sync" command too.
These generated Postgres settings JSON files can get out of sync causing
the control plane to reject updated to an endpoint or project's Postgres
settings.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The current `pageserver_layers_per_read` histogram buckets don't
represent the current reality very well. For the percentiles we care
about (e.g. p50 and p99), we often see fairly high read amp, especially
during ingestion, and anything below 4 can be considered very good.
## Summary of changes
Change the per-timeline read amp histogram buckets to `[4.0, 8.0, 16.0,
32.0, 64.0, 128.0, 256.0]`.
## Problem
Test fails when comparing the first WAL segment because the system id in
the segment header is different. The system id is not consistently set
correctly since segments are usually inited on the safekeeper sync step
with sysid 0.
## Summary of Chnages
Compare timeline digests instead. This skips the header.
Closes https://github.com/neondatabase/neon/issues/10596
I was looking into
https://github.com/neondatabase/serverless/issues/144, I recall previous
cases where proxy would trigger these prepared statements which would
conflict with other statements prepared by our client downstream.
Because of that, and also to aid in debugging, I've made sure all
prepared statements that proxy needs to make have specific names that
likely won't conflict and makes it clear in a error log if it's our
statements that are causing issues
Instead of hardcoding the request timeout, let's make it configurable as
a PGC_SUSET GUC.
Additionally, add a connect timeout GUC. Although the extension server
runs on the compute, it is always best to keep operations from hanging.
Better to present a timeout error to the user than a stuck backend.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
In #10707 some new fields were introduced in TimelineInfo.
I forgot that we do not only use TimelineInfo for encoding, but also
decoding when the storage controller calls into a pageserver, so this
broke some calls from controller to pageserver while in a mixed-version
state.
## Summary of changes
- Make new fields have default behavior so that they are optional
Originally I wanted to switch back to the `neon` branch before merging
#10825, but I forgot to do it. Do it in a separate PR now.
No actual change of the source code, only changes the branch name (so
that maybe in a few weeks we can delete the temporary branch
`arpad/neon-rebase`).
## Problem
`SmgrOpFlushInProgress::measure()` takes a `socket_fd` argument which is
only used on Linux. This causes linter warnings on macOS.
Touches #10823.
## Summary of changes
Add a noop use of `socket_fd` on non-Linux branch.
## Problem
See https://github.com/neondatabase/neon/issues/10839
rho(x,b) functions returns values in range [1,b+1] and addSHLL tries to
store it in array of size b+1.
## Summary of changes
Subtract 1 fro value returned by rho
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
We want to host pg_duckdb (starting with v0.3.1) on Neon.
This PR replaces https://github.com/neondatabase/neon/pull/10350 which
was for older pg_duckdb v0.2.0
Use cases
- faster OLAP queries
- access to datelake files (e.g. parquet) on S3 buckets from Neon
PostgreSQL
Because neon does not provide superuser role to neon customers we need
to grant some additional permissions to neon_superuser:
Note: some grants that we require are already granted to `PUBLIC` in new
release of pg_duckdb
[here](3789e4c509/sql/pg_duckdb--0.2.0--0.3.0.sql (L1054))
```sql
GRANT ALL ON FUNCTION duckdb.install_extension(TEXT) TO neon_superuser;
GRANT ALL ON TABLE duckdb.extensions TO neon_superuser;
GRANT ALL ON SEQUENCE duckdb.extensions_table_seq TO neon_superuser;
```
## Problem
The timestamp prefix of pytest log lines contains milliseconds without
leading zeros, so values of milliseconds less than 100 printed
incorrectly.
For example:
```
2025-02-15 12:02:51.997 INFO [_internal.py:97] 127.0.0.1 - - ...
2025-02-15 12:02:52.4 INFO [_internal.py:97] 127.0.0.1 - - ...
2025-02-15 12:02:52.9 INFO [_internal.py:97] 127.0.0.1 - - ...
2025-02-15 12:02:52.23 INFO [_internal.py:97] 127.0.0.1 - - ...
```
## Summary of changes
Fix log_format for pytest so that milliseconds are printed with leading
zeros.
In commit 9537829ccd I made shared_buffers be derived from the system's
available RAM. However, I failed to remove the old hard-coded
shared_buffers=10GB settings, shared_buffers was set twice. Oopsie.
## Problem
Sometimes, a regression test run gets stuck (taking more than 60
minutes) and is killed by GitHub's `timeout-minutes` without leaving any
traces in the test results database.
I find no correlation between this and either the build type, the
architecture, or the Postgres version.
See: https://neonprod.grafana.net/goto/nM7ih7cHR?orgId=1
## Summary of changes
- Bump `pytest-timeout` to the version that supports `--session-timeout`
- Set `--session-timeout` to (timeout-minutes - 10 minutes) * 60 seconds
in Attempt to stop tests gracefully to generate test reports until they
are forcibly stopped by the stricter `timeout-minutes` limit.
## Problem
Part of https://github.com/neondatabase/neon/issues/9516
## Summary of changes
This patch adds the support for storing reldir in the sparse keyspace.
All logic are guarded with the `rel_size_v2_enabled` flag, so if it's
set to false, the code path is exactly the same as what's currently in
prod.
Note that we did not persist the `rel_size_v2_enabled` flag and the
logic around it will be implemented in the next patch. (i.e., what if we
enabled it, restart the pageserver, and then it gets set to false? we
should still read from v2 using the rel_size_v2_migration_status in the
index_part). The persistence logic I'll implement in the next patch will
disallow switching from v2->v1 via config item.
I also refactored the metrics so that it can work with the new reldir
store. However, this metric is not correctly computed for reldirs (see
the comments) before. With the refactor, the value will be computed only
when we have an initial value for the reldir size. The refactor keeps
the incorrectness of the computation when there are more than 1
database.
For the tests, we currently run all the tests with v2, and I'll set it
to false and add some v2-specific tests before merging, probably also
v1->v2 migration tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
I noticed the opportunity to simplify here while working on
https://github.com/neondatabase/neon/pull/9353 .
The only difference is the zero-fill behavior: if one reads past rel
size,
`get_rel_page_at_lsn` returns a zeroed page whereas `Timeline::get`
returns an error.
However, the `endblk` is at most rel size large, because `nblocks` is eq
`get_rel_size`, see a few lines above this change.
We're using the same LSN (`self.lsn`) for everything, so there is no
chance of non-determinism.
Refs:
- Slack discussion debating correctness:
https://neondb.slack.com/archives/C033RQ5SPDH/p1737457010607119
# Summary
In
- https://github.com/neondatabase/neon/pull/10813
we added slow flush logging but it didn't log the TCP send & recv queue
length.
This PR adds that data to the log message.
I believe the implementation to be safe & correct right now, but it's
brittle and thus this PR should be reverted or improved upon once the
investigation is over.
Refs:
- stacked atop https://github.com/neondatabase/neon/pull/10813
- context:
https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B
- improves https://github.com/neondatabase/neon/issues/10668
- part of https://github.com/neondatabase/cloud/issues/23515
# How It Works
The trouble is two-fold:
1. getting to the raw socket file descriptor through the many Rust types
that wrap it and
2. integrating with the `measure()` function
Rust wraps it in types to model file descriptor lifetimes and ownership,
and usually one can get access using `as_raw_fd()`.
However, we `split()` the stream and the resulting
[`tokio::io::WriteHalf`](https://docs.rs/tokio/latest/tokio/io/struct.WriteHalf.html)
.
Check the PR commit history for my attempts to do it.
My solution is to get the socket fd before we wrap it in our protocol
types, and to store that fd in the new `PostgresBackend::socket_fd`
field.
I believe it's safe because the lifetime of `PostgresBackend::socket_fd`
value == the lifetime of the `TcpStream` that wrap and store in
`PostgresBackend::framed`.
Specifically, the only place that close()s the socket is the `impl Drop
for TcpStream`.
I think the protocol stack calls `TcpStream::shutdown()`, but, that
doesn't `close()` the file descriptor underneath.
Regarding integration with the `measure()` function, the trouble is that
`flush_fut` is currently a generic `Future` type. So, we just pass in
the `socket_fd` as a separate argument.
A clean implementation would convert the `pgb_writer.flush()` to a named
future that provides an accessor for the socket fd while not being
polled.
I tried (see PR history), but failed to break through the `WriteHalf`.
# Testing
Tested locally by running
```
./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000
```
in one terminal, waiting a bit, then
```
pkill -STOP pagebench
```
then wait for slow logs to show up in `pageserver.log`.
Pick one of the slow log message's port pairs, e.g., `127.0.0.1:39500`,
and then checking sockstat output
```
ss -ntp | grep '127.0.0.1:39500'
```
to ensure that send & recv queue size match those in the log message.
The logic might seem a bit intricate / over-optimized, but I recently
spent time benchmarking this code path in the context of a nightly
pagebench regression
(https://github.com/neondatabase/cloud/issues/21759)
and I want to avoid regressing it any further.
Ideally would also log the socket send & recv queue length like we do on
the compute side in
- https://github.com/neondatabase/neon/pull/10673
But that is proving difficult due to the Rust abstractions that wrap the
socket fd.
Work in progress on that is happening in
- https://github.com/neondatabase/neon/pull/10823
Regarding production impact, I am worried at a theoretical level that
the additional logging may cause a downward spiral in the case where a
pageserver is slow to flush because there is not enough CPU. The logging
would consume more CPU and thereby slow down flushes even more. However,
I don't think this matters practically speaking.
# Refs
- context:
https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B
- fixes https://github.com/neondatabase/neon/issues/10668
- part of https://github.com/neondatabase/cloud/issues/23515
# Testing
Tested locally by running
```
./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000
```
in one terminal, waiting a bit, then
```
pkill -STOP pagebench
```
then wait for slow logs to show up in `pageserver.log`.
To see that the completion log message is logged, run
```
pkill -CONT pagebench
```
## Problem
Some situations may produce a large number of pending reconciles. If we
experience an issue where reconciles are processed more slowly than
expected, that can prevent us responding promptly to user requests like
tenant/timeline CRUD.
This is a cleaner implementation of the hotfix in
https://github.com/neondatabase/neon/pull/10815
## Summary of changes
- Introduce a second semaphore for high priority tasks, with
configurable units (default 256). The intent is that in practical
situations these user-facing requests should never have to wait.
- Use the high priority semaphore for: tenant/timeline CRUD, and shard
splitting operations. Use normal priority for everything else.
## Problem
We run the compatibility tests only if we are upgrading the extension.
An accidental code change may break the test itself, so we have to check
this code as well.
## Summary of changes
The test is scheduled once a day to save time and resources.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
PR #10305 makes sure that there is no *actual* race, i.e. we will never
attempt to offload a timeline that has just been unarchived, or similar.
However, if a timeline has been unarchived and has children that are
unarchived too, we will get an error log line. Such races can occur as
in compaction we check if the timeline can be offloaded way before we
attempt to offload it: the result might change in the meantime.
This patch checks if the delete guard can't be obtained because the
timeline has unarchived children, and if yes, it does another check for
whether the timeline has become unarchived or not. If it is unarchived,
it just prints an info log msg and integrates itself into the error
suppression logic of the compaction calling into it.
If you squint at it really closely, there is still a possible race in
which we print an error log, but this one is unlikely because the
timeline and its children need to be archived right after the check for
whether the timeline has any unarchived children, and right before the
check whether the timeline is archived. Archival involves a network
operation while nothing between these two checks does that, so it's very
unlikely to happen in real life.
https://github.com/neondatabase/cloud/issues/23979#issuecomment-2651265729
## Problem
We had code for stripping IDs out of proxied paths to reduce cardinality
of metrics, but it was only stripping out tenant IDs, and leaving in
timeline IDs and query parameters (e.g. LSN in lsn->timestamp lookups).
## Summary of changes
- Use a more general regex approach.
There is still some risk that a future pageserver API might include a
parameter in `/the/path/`, but we control that API and it is not often
extended. We will also alert on metrics cardinality in staging so that
if we made that mistake we would notice.
## Problem
See https://github.com/neondatabase/neon/issues/10755
Random access pattern of pgbench leaves sparse chunks, which makes the
on-disk size of file.cache unpredictable.
## Summary of changes
Perform seqscan to fill LFC chunks with data so that on-disk file size
included size of table.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`benchmarks` is a long-running and non-blocking job. If, on Staging, a
deploy-blocking job fails, restarting it requires cancelling any running
`benchmarks` jobs, which is a waste of CI resources and requires a
couple of extra clicks for a human to do.
Ref: https://neondb.slack.com/archives/C059ZC138NR/p1739292995400899
## Summary of changes
- Run `benchmarks` after `deploy` job
- Handle `benchmarks` run in PRs with `run-benchmarks` label but without
`deploy` job.
There was a typo in the name of the utilization endpoint URL, fix it.
Also, ensure that the heartbeat mechanism actually works.
Related: #10583, #10429
Part of #9011
Before this PR, an IO error returned from the kernel, e.g., due to a bad
disk, would get bubbled up, all the way to a user-visible query failing.
This is against the IO error handling policy where we have established
and is hence being rectified in this PR.
[[(internal Policy document
link)]](bef44149f7/src/storage/handling_io_and_logical_errors.md (L33-L35))
The practice on the write path seems to be that we call
`maybe_fatal_err()` or `fatal_err()` fairly high up the stack.
That is, regardless of whether std::fs, tokio::fs, or VirtualFile is
used to perform the IO.
For the read path, I choose a centralized approach in this PR by
checking for errors as close to the kernel interface as possible.
I believe this is better for long-term consistency.
To mitigate the problem of missing context if we abort so far down in
the stack, the `on_fatal_io_error` now captures and logs a backtrace.
I grepped the pageserver code base for `fs::read` to convince myself
that all non-VirtualFile reads already handle IO errors according to
policy.
Refs
- fixes https://github.com/neondatabase/neon/issues/10454
## Problem
We didn't catch all client errors causing alerts.
## Summary of changes
Client errors should be wrapped with ClientError so that it doesn't fire
alerts.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The test is flaky: WAL in remote storage appears to be corrupted. One of
hypotheses so far is that corruption is the result of local fs
implementation being non atomic, and safekeepers may concurrently PUT
the same segment. That's dubious though because by looking at local_fs
impl I'd expect then early EOF on segment read rather then observed
zeros in test failures, but other directions seem even less probable.
## Summary of changes
Let's add s3 backend as well and see if it is also flaky. Also add some
more logging around segments uploads.
ref https://github.com/neondatabase/neon/issues/10761
There is now a compute_ctl_config field in the response that currently
only contains a JSON Web Key set. compute_ctl currently doesn't do
anything with the keys, but will in the future.
The reasoning for the new field is due to the nature of empty computes.
When an empty compute is created, it does not have a tenant. A compute
spec is the primary means of communicating the details of an attached
tenant. In the empty compute state, there is no spec. Instead we wait
for the control plane to pass us one via /configure. If we were to
include the jwks field in the compute spec, we would have a partial
compute spec, which doesn't logically make sense.
Instead, we can have two means of passing settings to the compute:
- spec: tenant specific config details
- compute_ctl_config: compute specific settings
For instance, the JSON Web Key set passed to the compute is independent
of any tenant. It is a setting of the compute whether it is attached or
not.
Signed-off-by: Tristan Partin <tristan@neon.tech>
pg_search is 46ish MB. All other remote extensions are around hundeds of
KB. 3 seconds is not long enough to download the tarball if the S3
gateway cache doesn't already contain a copy. According to our setup,
the cache is limited to 10 GB in size and anything that has not been
accessed for an hour is purged.
This is really bad for scaling to 0, even more so if you're the only
project actively using the extension in a production Kubernetes cluster.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This test occasionally fails while the test teardown tries to do a
graceful shutdown, because the test has quickly written lots of data
into the pageserver.
Closes: #10654
## Summary of changes
- Call `post_checks` at the end of `test_isolation`, as we already do
for test_pg_regress -- this improves our detection of issues, and as a
nice side effect flushes the pageserver.
- Ignore pg_notify files when validating state at end of test, these are
not expected to be the same
## Problem
In #10752 I used an overly-strict regex that only ignored error on a
particular key.
## Summary of changes
- Drop key from regex so it matches all such errors
## Problem
The build of the neon container image is not caching any part of the
rust build, making it fairly slow.
## Summary of changes
Cache dependency building using cargo-chef.
## Problem
Previously, when cutting over to cold secondary locations,
we would clobber the previous, good, heatmap with a cold one.
This is because heatmap generation used to include only resident layers.
Once this merges, we can add an endpoint which triggers full heatmap
hydration on attached locations to heal cold migrations.
## Summary of changes
With this patch, heatmap generation becomes additive. If we have a
heatmap from when this location was secondary, the new uploaded heatmap
will be the result of a reconciliation between the old one and the on
disk resident layers.
More concretely, when we have the previous heatmap:
1. Filter the previous heatmap and keep layers that are (a) present
in the current layer map, (b) visible, (c) not resident. Call this set
of layers `visible_non_resident`.
2. From the layer map, select all layers that are resident and visible.
Call this set of layers `resident`.
3. The new heatmap is the result of merging the two disjoint sets.
Related https://github.com/neondatabase/neon/issues/10541
In #9011, we want to schedule timelines to safekeepers. In order to do
such scheduling, we need information about how utilized a safekeeper is
and if it's available or not.
Therefore, send constant heartbeats to the safekeepers and try to figure
out if they are online or not.
Includes some code from #10440.
## Problem
We expose `latest_gc_cutoff` in our API, and callers understandably were
using that to validate LSNs for branch creation. However, this is _not_
the true GC cutoff from a user's point of view: it's just the point at
which we last actually did GC. The actual cutoff used when validating
branch creations and page_service reads is the min() of latest_gc_cutoff
and the planned GC lsn in GcInfo.
Closes: https://github.com/neondatabase/neon/issues/10639
## Summary of changes
- Expose the more useful min() of GC cutoffs as `gc_cutoff_lsn` in the
API, so that the most obviously named field is really the one people
should use.
- Retain the ability to read the LSN at which GC was actually done, in
an `applied_gc_cutoff_lsn` field.
- Internally rename `latest_gc_cutoff_lsn` to `applied_gc_cutoff_lsn`
("latest" was a confusing name, as the value in GcInfo is more up to
date in terms of what a user experiences)
- Temporarily preserve the old `latest_gc_cutoff_lsn` field for compat
with control plane until we update it to use the new field.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
https://github.com/neondatabase/neon/pull/10613 changed how images are
pushed, and therefore also how we have to wait for images to be pushed
in `trigger-e2e-tests`. The `trigger-e2e-tests` workflow is triggered in
three different ways:
- When a pull request is pushed to that is already ready to review, here
we call the workflow from `build_and_test`
- When a pull request is marked ready for review, then the workflow is
triggered directly
- When a push to `main` or `release(-.*)?` triggers `build_and_test` and
that indirectly calls `trigger-e2e-tests`.
The second of these paths had a bug, which was not tested in the PR,
because this path being different wasn't clear to me.
## Summary of changes
Fix the jq statement that caused the bug.
2025-02-13 10:13:26 +00:00
1025 changed files with 58800 additions and 25097 deletions
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?
<!-- List everything that should be done**before** release, any issues / setting changes / etc -->
### Checklist after release
- [ ] Make sure instructions from PRs included in this release and labeled `manual_release_instructions` are executed (either by you or by people who wrote them).
- [ ] Based on the merged commits write release notes and open a PR into `website` repo ([example](https://github.com/neondatabase/website/pull/219/files))
# settings below only needed if you want the project to be sharded from the beginning
shard_split_project:
description:'by default new projects are not shard-split, specify true to shard-split'
description:'by default new projects are not shard-split initiailly, but only when shard-split threshold is reached, specify true to explicitly shard-split initially'
required:false
default:'false'
disable_sharding:
description:'by default new projects use storage controller default policy to shard-split when shard-split threshold is reached, specify true to explicitly disable sharding'
required:false
default:'false'
admin_api_key:
@@ -45,6 +49,10 @@ inputs:
description:'A JSON object with project settings'
required:false
default:'{}'
default_endpoint_settings:
description:'A JSON object with the default endpoint settings'
required:false
default:'{}'
outputs:
dsn:
@@ -62,9 +70,9 @@ runs:
# A shell without `set -x` to not to expose password/dsn in logs
description:"Tag for the release if this is an RC PR run"
value:${{ jobs.tags.outputs.release-tag }}
previous-storage-release:
description:"Tag of the last storage release"
value:${{ jobs.tags.outputs.storage }}
previous-proxy-release:
description:"Tag of the last proxy release"
value:${{ jobs.tags.outputs.proxy }}
previous-compute-release:
description:"Tag of the last compute release"
value:${{ jobs.tags.outputs.compute }}
run-kind:
description:"The kind of run we're currently in. Will be one of `push-main`, `storage-release`, `compute-release`, `proxy-release`, `storage-rc-pr`, `compute-rc-pr`, `proxy-rc-pr`, `pr`, or `workflow-dispatch`"
value:${{ jobs.tags.outputs.run-kind }}
release-pr-run-id:
description:"Only available if `run-kind in [storage-release, proxy-release, compute-release]`. Contains the run ID of the `Build and Test` workflow, assuming one with the current commit can be found."
value:${{ jobs.tags.outputs.release-pr-run-id }}
sha:
description:"github.event.pull_request.head.sha on release PRs, github.sha otherwise"
RELEASE_PR_RUN_ID=$(gh api "/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=$CURRENT_SHA" | jq '[.workflow_runs[] | select(.name == "Build and Test") | select(.head_branch | test("^rc/release.*$"; "s"))] | first | .id // ("Failed to find Build and Test run from RC PR!" | halt_error(1))')
echo "release-pr-run-id=$RELEASE_PR_RUN_ID" | tee -a $GITHUB_OUTPUT
--body "Not trying to forward pull-request, because \`mergeable_state\` is \`${{ github.event.pull_request.mergeable_state }}\`, not \`clean\` or \`unstable\`."
@@ -173,7 +173,7 @@ RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v$
&& rm -rf protoc.zip protoc
# s5cmd
ENVS5CMD_VERSION=2.2.2
ENVS5CMD_VERSION=2.3.0
RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/s5cmd_${S5CMD_VERSION}_Linux-$(uname -m | sed 's/x86_64/64bit/g'| sed 's/aarch64/arm64/g').tar.gz"| tar zxvf - s5cmd \
&& chmod +x s5cmd \
&& mv s5cmd /usr/local/bin/s5cmd
@@ -206,7 +206,7 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
# Program name comes from postgres' syslog_facility configuration: https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-SYSLOG-IDENT
# Default value is 'postgres'.
if $programname == 'postgres' then {{
# Forward Postgres logs to telemetry otel collector
// run `/neonvm/bin/set-disk-quota {size_kb} {mountpoint}`
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.