In
-
https://github.com/neondatabase/neon/pull/10993#issuecomment-2690428336
I added infinite retries for buffered writer flush IOs, primarily to
gracefully handle ENOSPC but more generally so that the buffered writer
is not left in a state where reads from the surrounding InMemoryLayer
cause panics.
However, I didn't add cancellation sensitivity, which is concerning
because then there is no way to detach a timeline/tenant that is
encountering the write IO errors.
That’s a legitimate scenario in the case of some edge case bug.
See the #10993 description for details.
This PR
- first makes flush loop infallible, enabled by infinite retries
- then adds sensitivity to `Timeline::cancel` to the flush loop, thereby
making it fallible in one specific way again
- finally fixes the InMemoryLayer/EphemeralFile/BufferedWriter
amalgamate to remain read-available after flush loop is cancelled.
The support for read-availability after cancellation is necessary so
that reads from the InMemoryLayer that are already queued up behind the
RwLock that wraps the BufferedWriter won't panic because of the
`mutable=None` that we leave behind in case the flush loop gets
cancelled.
# Alternatives
One might think that we can only ship the change for read-availability
if flush encounters an error, without the infinite retrying and/or
cancellation sensitivity complexity.
The problem with that is that read-availability sounds good but is
really quite useless, because we cannot ingest new WAL without a
writable InMemoryLayer. Thus, very soon after we transition to read-only
mode, reads from compute are going to wait anyway, but on `wait_lsn`
instead of the RwLock, because ingest isn't progressing.
Thus, having the infinite flush retries still makes more sense because
they're just "slowness" to the user, whereas wait_lsn is hard errors.
## Problem
https://github.com/neondatabase/neon/pull/11210 migrated pushing images
to ghcr. Unfortunately, it was incomplete in using images from ghcr,
which resulted in a few places referencing the ghcr build-tools image,
while trying to use docker hub credentials.
## Summary of changes
Use build-tools image from ghcr consistently.
## Problem
The pipelines after release merges are slower than they need to be at
the moment. This is because some kinds of tests/checks run on all kinds
of pipelines, even though they only matter in some of those.
## Summary of changes
Run `check-codestyle-{rust,python,jsonnet}`, `build-and-test-locally`
and `trigger-e2e-tests` only on regular PRs, not release PR or pushes to
main or release branches.
Both crates seem well maintained. x509-cert is part of the high quality
RustCrypto project that we already make heavy use of, and I think it
makes sense to reduce the dependencies where possible.
## Problem
Docker Hub has new rate limits coming up, and to avoid problems coming
with those we're switching to GHCR.
## Summary of changes
- Push images to GHCR initially and distribute them from there
- Use images from GHCR in docker-compose
There is no functional change here. We move safekeeper related code from
`service.rs` to `service/safekeeper_service.rs`, so that safekeeper
related stuff is contained in a single file. This also helps with
preventing `service.rs` from growing even further.
Part of #9011.
## Problem
The test is flaky because of the same reasons as described in
https://github.com/neondatabase/neon/issues/11177.
The test has already suppressed these `WARN` and `ERROR` log messages,
but the regexp didn't match all possible errors.
## Summary of changes
- Change regexp to suppress all possible allowed error log messages.
PRs #10891 and #10902 have time-bounded the safekeeper heartbeating of
the storage controller. Those timeouts were not meant to be permanent,
but temporary until we figured out the reasons for the safekeeper
heartbeating causing problems.
Now they are better understood and resolved. A comment is
[here](https://github.com/neondatabase/cloud/issues/24396#issuecomment-2679342929),
but most importantly, we've had:
* #10954 to send heartbeats concurrently (before the issue was we sent
them sequentially, so the total time time was number of nodes times time
for timeout to be hit, now the total time is the maximum of all things
we are heartbeating)
* work to actually make heartbeats work and not error, i.e. JWT rollout
for storcon, not sending heartbeats to decomissioned safekeepers,
removal of decomissioned safekeepers from the databases
Part of https://github.com/neondatabase/cloud/issues/25473
## Problem
Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes.
In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED`
and save them on local disk.
But we do not save unlogged relations in LFC.
It may cause fetching incorrect value from LFC if relfilenode is reused.
## Summary of changes
Save modified pages in LFC on second stage of unlogged build (when
modified pages are walloged).
There is no need to save pages in LFC at first phase because the will be
in any case overwritten with assigned LSN at second phase.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Adds a basic test that makes the storcon issue explicit creation of a
timeline on safeekepers (main storcon PR in #11058). It was adapted from
`test_explicit_timeline_creation` from #11002.
Also, do a bunch of fixes needed to get the test work (the API
definitions weren't correct), and log more stuff when we can't create a
new timeline due to no safekeepers being active.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Work on https://github.com/neondatabase/cloud/issues/23721 and
https://github.com/neondatabase/cloud/issues/23714
Depends on https://github.com/neondatabase/neon/pull/11111
- Add `/configure_telemetry` API endpoint
- Support second rsyslog configuration for Postgres logs export
- Enable logs export when compute feature is enabled and configure
Postgres to send logs to syslog
I have used `/configure_telemetry` name because in the future I see it
also being used for configuring a `pg_tracing` extension to export
traces. Let me know if you'd rather have these APIs separate. In this
case we can rename it to `/configure_rsyslog`.
## Problem
https://github.com/neondatabase/neon/actions/runs/13894288475/job/38871819190
shows the "Add fast-fordward label to PR to trigger fast-forward merge"
job being skipped. This is due to not using the right variable for
checking which branch the merge queue is merging into.
## Summary of changes
Use the `branch` output of the `meta` task for checking the target
branch of a merge group.
## Problem
In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is
created on disk and then removed at the end.
It contradicts to the logic of DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL
mode.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In the previous PR #11045, one edge-case wasn't covered, when an ident
contains only one `$`, we were picking `$$` as a 'wrapper'. Yet, when
this `$` is at the beginning or at the end of the ident, then we end up
with `$$$` in a row which breaks the escaping.
## Summary of changes
Start from `x` tag instead of a blank string.
Slack:
https://neondb.slack.com/archives/C08HV951W2W/p1742076675079769?thread_ts=1742004205.461159&cid=C08HV951W2W
## Problem
Serial/numeric IDs lead to collisions, which is not critical but looks
awkward.
Previous discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1741891345869979
## Summary of changes
Suggest using the `YYYY-MM-DD` prefix, which i) has less chance of
collision; ii) provides out-of-the-box lexicographic sorting; iii) even
if it collides, it's not a big deal -- just two RFCs have been started
on the same day.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
... to better match the workload characteristics of real Neon customers
## Problem
We analyzed workloads of large Neon users and want to extend the oltp
workload to include characteristics seen in those workloads.
## Summary of changes
- for re-use branch delete inserted rows from last run
- adjust expected run-time (time-outs) in GitHub workflow
- add queries that exposes the prefetch getpages path
- add I/U/D transactions for another table (so far the workload was
insert/append-only)
- add an explicit vacuum analyze step and measure its time
- add reindex concurrently step and measure its time (and take care that
this step succeeds even if prior reindex runs have failed or were
canceled)
- create a second connection string for the pooled connection that
removes the `-pooler` suffix from the hostname because we want to run
long-running statements (database maintenance) and bypass the pooler
which doesn't support unlimited statement timeout
## Test run
https://github.com/neondatabase/neon/actions/runs/13851772887/job/38760172415
## Problem
There is a rare race between controller graceful deployment and shard
splitting where we may incorrectly both abort _and_ complete the split
(on different pods), and thereby leave no shards at all in the database.
Related: #11254
## Summary of changes
- In complete_shard_split, refuse to delete anything if child shards are
not found
## Problem
Python's `jsonnet` 0.20.0 doesn't support Python 3.13, so we have a
couple of tests xfailed because of that.
## Summary of changes
- Bump `jsonnet` to `0.21.0rc2` which supports Python 3.13
- Unxfail `test_sql_exporter_metrics_e2e` and
`test_sql_exporter_metrics_smoke` on Python 3.13
- add pgaudt_gc thread to compute_ctl
to cleanup old pgaudit logs if they exist.
pgaudit can rotate files, but it doesn't delete the old files
- Add AUDIT_LOG_DIR_SIZE metric to compute_ctl
to track the size of the audit log directory in bytes.
- Fix permissions for rsyslog state files directory
## Problem
We exposed the direction tag in #10925 but didn't actually include the
ingress tag in the export to allow for an adaption period.
## Summary of changes
We now export the ingress direction
# Refs
- fixes https://github.com/neondatabase/neon/issues/11228
# Problem High-Level
When storcon validates whether a `ScheduleOptimizationAction` should be
applied, it retrieves the `tenant_secondary_status` to determine whether
a secondary is ready for the optimization.
When collecting results, it associates secondary statuses with the wrong
optimization actions in the batch of optimizations that we're
validating.
The result is that we make the decision for shard/location X based on
the SecondaryStatus of a random secondary location Y in the current
batch of optimizations.
A possible symptom is an early cutover, as seen in this engineering
investigation here:
- https://github.com/neondatabase/cloud/issues/25734
# Problem Code-Level
This code here in `optimize_all_validate`
97e2e27f68/storage_controller/src/service.rs (L7012-L7029)
zips the `want_secondary_status` with the Vec returned from
`tenant_for_shards_api` .
However, the Vec returned from `want_secondary_status` is not ordered
(it uses FuturesUnordered internally).
# Solution
Sort the Vec in input order before returning it.
`optimize_all_validate` was the only caller affected by this problem
While at it, also future-proof similar-looking function
`tenant_for_shards`.
None of its callers care about the order, but this type of function
signature is easy to use incorrectly.
# Future Work
Avoid the additional iteration, map, and allocation.
Change API to leverage AsyncFn (async closure).
And/or invert `tenant_for_shards_api` into a Future ext trait / iterator
adaptor thing.
Adds API definitions for the safekeeper API endpoints `exclude_timeline`
and `term_bump`. Also does a bugfix to return the correct type from
`delete_timeline`.
Part of #8614
## Problem
This is already disabled in production, as it is replaced by L0 flush
delays. It will be removed in a later PR, once the config option is no
longer specified in production.
## Summary of changes
Disable `l0_flush_wait_upload` by default.
## Problem
`test_metadata_image_creation ` became flaky with #11212, since image
compaction may yield to L0 compaction.
## Summary of changes
Set `NoYield` when compacting in tenant tests.
## Problem
We noticed that error metrics didn't show for some services with light
load. This is not great and can cause problems for dashboards/alerts
## Summary of changes
Pre-initialise some metricvecs.
We want to switch away from and deprecate the `--compute-hook-url` param
for the storcon in favour of `--control-plane-url` because it allows us
to construct urls with `notify-safekeepers`.
This PR switches the pytests and neon_local from a
`control_plane_compute_hook_api` to a new param named
`control_plane_hooks_api` which is supposed to point to the parent of
the `notify-attach` URL.
We still support reading the old url from disk to not be too disruptive
with existing deployments, but we just ignore it.
Also add docs for the `notify-safekeepers` upcall API.
Follow-up of #11173
Part of https://github.com/neondatabase/neon/issues/11163
## Problem
`l0_flush_delay_threshold` has already been set to 30 in production for
a couple of weeks. Let's harmonize the default.
## Summary of changes
Update `DEFAULT_L0_FLUSH_DELAY_FACTOR` to 3 such that the default
`l0_flush_delay_threshold` is `3 * compaction_threshold`.
This differs from the production setting, which is hardcoded to 30 (with
`compaction_threshold` at 10), and is more appropriate for any tenants
that have custom `compaction_threshold` overrides.
## Problem
ef0d4a48a adjusted how we build container images and how we push them,
and the neon-test-extensions image was overlooked. Additionally, is was
also missed in 1f0dea9a1, which pushed our container images to GHCR.
## Summary of changes
Push neon-test-extensions to GHCR and also push release tags for it.
## Problem
DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16
## Summary of changes
Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
# Problem
We leave too few observability breadcrumbs in the case where wait_lsn is
exceptionally slow.
# Changes
- refactor: extract the monitoring logic out of `log_slow` into
`monitor_slow_future`
- add global + per-timeline counter for time spent waiting for wait_lsn
- It is updated while we're still waiting, similar to what we do for
page_service response flush.
- add per-timeline counterpair for started & finished wait_lsn count
- add slow-logging to leave breadcrumbs in logs, not just metrics
For the slow-logging, we need to consider not flooding the logs during a
broker or network outage/blip.
The solution is a "log-streak-level" concurrency limit per timeline.
At any given time, there is at most one slow wait_lsn that is logging
the "still running" and "completed" sequence of logs.
Other concurrent slow wait_lsn's don't log at all.
This leaves at least one breadcrumb in each timeline's logs if some
wait_lsn was exceptionally slow during a given period.
The full degree of slowness can then be determined by looking at the
per-timeline metric.
# Performance
Reran the `bench_log_slow` benchmark, no difference, so, existing call
sites are fine.
We do use a Semaphore, but only try_acquire it _after_ things have
already been determined to be slow. So, no baseline overhead
anticipated.
# Refs
-
https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222
Closes: https://github.com/neondatabase/cloud/issues/22998
If control-plane reports that TLS should be used, load the certificates
(and watch for updates), make sure postgres use them, and detects
updates.
Procedure:
1. Load certificates
2. Reconfigure postgres/pgbouncer
3. Loop on a timer until certificates have loaded
4. Go to 1
Notes:
1. We only run this procedure if requested on startup by control plane.
2. We needed to compile pgbouncer with openssl enabled
3. Postgres doesn't allow tls keys to be globally accessible - must be
read only to the postgres user. I couldn't convince the autoscaling team
to let me put this logic into the VM settings, so instead compute_ctl
will copy the keys to be read-only by postgres.
4. To mitigate a race condition, we also verify that the key matches the
cert.
## Problem
There was a panic on staging that compaction didn't find any keys. This
is possible if all layers selected for compaction does not contain any
keys within the current shard.
## Summary of changes
Make panic an error. In the future, we can try creating an empty image
layer so that GC can clean up those layers. Otherwise, for now, we can
only rely on shard ancestor compaction to remove these data.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`compaction_l0_first` has already been enabled in production for a
couple of weeks.
## Summary of changes
Enable `compaction_l0_first` by default.
Also set `CompactFlags::NoYield` in `timeline_checkpoint_handler`, to
ensure explicitly requested compaction runs to completion. This endpoint
is mainly used in tests, and caused some flakiness where tests expected
compaction to complete.
## Problem
#11061 introduced code fetching previous releases. #11151 introduced jq
error handling, which has also been applied in #11061, but parenthesis
have been missed.
## Summary of changes
Add parenthesis around error handling code.
## Problem
Autosplits always request `DEFAULT_STRIPE_SIZE` for splits. However,
splits do not allow changing the stripe size of already-sharded tenants,
and will error out if it differs.
In #11168, we are changing the stripe size, which could hit this when
attempting to autosplit already sharded tenants.
Touches #11168.
## Summary of changes
Pass `new_stripe_size: None` when autosplitting already sharded tenants.
Otherwise, pass `DEFAULT_STRIPE_SIZE` instead of the shard identity's
stripe size, since we want to use the current default rather than an
old, persisted default.
# Fix metric_unit length in test_compute_ctl_api.py
## Description
This PR changes the metric_unit from "microseconds" to "μs" in
test_compute_ctl_api.py to fix the issue where perf test results were
not being stored in the database due to the string exceeding the 10
character limit of the metric_unit column in the perf_test_results
table.
## Problem
As reported in Slack, the perf test results were not being uploaded to
the database because the "microseconds" string (12 characters) exceeds
the 10 character limit of the metric_unit column in the
perf_test_results table.
## Solution
Replace "microseconds" with "μs" in all metric_unit parameters in the
test_compute_ctl_api.py file.
## Testing
The changes have been committed and pushed. The PR is ready for review.
Link to Devin run:
https://app.devin.ai/sessions/e29edd672bd34114b059915820e8a853
Requested by: Peter Bendel
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>
## Problem
A small adjustment in #11061 broke the lint-release-pr.sh script, and
the new version was neither tested nor linted. This has been done now,
the script is once again tested and passing `shellcheck`.
## Summary of changes
Add missing `el` of `elif` condition chain.
## Problem
#11061 changed release pr creation, and I missed that the workflow will
checkout a would-be-merge of the rc branch and the release branch
instead of the head ref, unless explicitly instructed otherwise.
## Summary of changes
Check out head ref for linting the release PRs.
## Problem
#11061 changed release pr creation, and I missed that creating PRs using
`gh` in non-interactive environments *requires* `--body` instead of
defaulting to an empty body.
## Summary of changes
Explicitly set an empty body when creating release PRs.
## Problem
#11061 changed release PR creation, and I missed that we need to
explicitly fetch the whole history so that the relevant git refs and
objects are available.
## Summary of changes
- Fetch all git refs including history by setting fetch-depth to 0
- Reference release branch as a remote branch, because we haven't
checked it out locally
## Problem
close https://github.com/neondatabase/neon/issues/10310
## Summary of changes
This patch adds a new behavior for the detach_ancestor API: detach with
multi-level ancestor and no reparenting. Though we can potentially
support multi-level + do reparenting / single-level + no-reparenting in
the future, as it's not required for the recovery/snapshot epic, I'd
prefer keeping things simple now that we only handle the old one and the
new one instead of supporting the full feature matrix.
I only added a test case of successful detaching instead of testing
failures. I'd like to make this into staging and add more tests in the
future.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
When we release our components, we perform builds in the release PR,
then test the components, then merge the PR, and then build everything
*again*, run tests *again*, and only then start deployments.
To speed things up, we want to perform builds and run tests in the PR,
and start deployments using the existing artifacts from the release PR.
To make that possible, we need to have both CI pipelines running on the
same commit hash, which requires fast forwarding release. That only
works, if we have a commit in the PR that has the current release branch
state as an ancestor.
## Summary of changes
- Changes to release PR creation:
- Remove templates and automatic bodies for release PRs. The previous
template wasn't used anymore, and the automatic body we created in the
pipeline didn't contain any useful content anymore after the changees
here.
- Make it possible to select the source branch. For releases that aren't
cut from `main`, like https://github.com/neondatabase/neon/pull/11051,
we need a way to trigger the new flow from a different branch.
- Determine `release-branch` automatically from the component name
instead of passing that as well.
- Changes to the merge queue job:
- Rename `get-changed-files` to `meta` in preparation of additional data
being fetched as part of that job
- Fail the merge queue if we're trying to merge into a branch other than
main - this is to prevent non-fast-forward merges.
- Label PRs to branches other than main as `fast-forward`, to trigger
the fast-forward job
- Add a fast-forward job that can be triggered with the `fast-forward`
label that performs a fast-forward merge. This only happens if the PR
has `mergeable_state == clean`, so CI having passed.
- Build and Test on releases now skips building images, skips testing
images and skips triggering e2e tests. We add new tags to the images
from the release PR to tag them as release images, and we push them to
the prod registries.
In our json encoding, we only need to know about array types.
Information about composites or enums are not actually used.
Enums are quite popular, needing to type query them when not needed can
add some latency cost for no gain.
## Problem
When a node becomes active, we query its locations and update the
observed state in-place.
This can race with the observed state updates done when processing
reconcile results.
## Summary of changes
The argument for this reconciliation step is that is reduces the need
for background reconciliations.
I don't think is actually true anymore. There's two cases.
1. Restart of node after drain. Usually the node does not go through the
offline state here, so observed locations
were not marked as none. In any case, there should be a handful of
shards max on the node since we've just drained it.
2. Node comes back online after failure or network partition. When the
node is marked offline, we reschedule everything away from it. When it
later becomes active, the previous observed location is extraneous and
requires a reconciliation anyway.
Closes https://github.com/neondatabase/neon/issues/11148
## Problem
When the storage controller and Pageserver loads tenants from persisted
storage, it uses `ShardIdentity::unsharded()` for unsharded tenants.
However, this replaces the persisted stripe size of unsharded tenants
with the default stripe size.
This doesn't really matter for practical purposes, since the stripe size
is meaningless for unsharded tenants anyway, but can cause consistency
check failures if the persisted stripe size differs from the default.
This was seen in #11168, where we change the default stripe size.
Touches #11168.
## Summary of changes
Carry over the persisted stripe size from `TenantShardPersistence` for
unsharded tenants, and from `LocationConf` on Pageservers.
Also add bounds checks for type casts when loading persisted shard
metadata.
## Problem
`info_span!` is only used in a `linux` branch, causing the unused lint
to fire on macOS.
## Summary of changes
Fully qualify the `info_span!` use.
## Problem
`humantime` is unmaintained, we want to migrate to `jiff`, see
https://github.com/neondatabase/neon/issues/11179.
`env_logger` in older versions depend on `humantime`, and newer versions
depend on `jiff`, so we need to update it.
## Summary of changes
Update `env_logger` to the most recent release, which does not depend on
`humantime` anymore.
## Problem
This command was used when onboarding tenants to the storage controller.
We no longer do that, so the command can go.
## Summary of changes
- Remove `storcon_cli tenant-warmup` command
## Problem
Test `test_timeline_archive` is flaky because it makes requests that are
intended to fail. It sometimes leads to warning in pageserver's logs.
More details are in the issue.
- Closes: https://github.com/neondatabase/neon/issues/11177
## Summary of changes
- Suppress such errors.
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.
To this end, there are two changes here:
1. Update the `tracing-utils` crate to allow for explicitly specifying
the export configuration. Pageserver configuration is loaded from a file
on start-up. This allows us to use the same flow for export configs
there.
2. Update the `utils::logging::init` common entry point to set up OTEL
tracing infrastructure if requested. Note that an entirely different
tracing subscriber is used. This is to avoid interference with the
existing tracing set-up. For now, no service uses this functionality.
PR to plug this into the pageserver is
[here](https://github.com/neondatabase/neon/pull/11140).
Related https://github.com/neondatabase/neon/issues/9873
## Problem
This field was retained for backward compat only in
https://github.com/neondatabase/neon/pull/10707.
Once https://github.com/neondatabase/cloud/pull/25233 is released,
nothing external will be reading this field.
Internally, this was a mandatory field so storage controller is still
trying to decode it, so we must do this removal in two steps: this PR
makes the field optional, and after one release we can fully remove it.
Related: https://github.com/neondatabase/cloud/issues/24250
## Summary of changes
- Rename field to `_unused`
- Remove field from swagger
- Make field optional
The upgrade to 8.0.0 caused severe performance regressions in the
start_postgres_ms metric, which measures the time it takes from execing
Postgres to the time Postgres marks itself as ready in the
postmaster.pid file. We use the notify crate to watch for changes in the
pgdata directory and the postmaster.pid file.
Signed-off-by: Tristan Partin <tristan@neon.tech>
# Problem
If the Pageserver ingest path
(InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable
buffer
of the BufferedWriter, the buffered writer is left in a state where
subsequent _reads_
from the InMemoryLayer it will cause a `must not use after we returned
an error` panic.
The reason is that
1. the flush background task bails on flush failure,
2. causing the `FlushHandle::flush` function to fail at channel.recv()
and
3. causing the `FlushHandle::flush` function to bail with the flush
error,
4. leaving its caller `BufferedWriter::flush` with
`BufferedWriter::mutable = None`,
5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent
reads can enter,
6. those reads find `mutable = None` and cause the panic.
# Context
It has always been the contract that writes against the BufferedWriter
API
must not be retried because the writer/stream-style/append-only
interface makes no atomicity guarantees ("On error, did nothing or a
piece of the buffer get appended?").
The idea was that the error would bubble up to upper layers that can
throw away the buffered writer and create a new one. (See our [internal
error handling policy document on how to handle e.g.
`ENOSPC`](c870a50bc0/src/storage/handling_io_and_logical_errors.md (L36-L43))).
That _might_ be true for delta/image layer writers, I haven't checked.
But it's certainly not true for the ingest path: there are no provisions
to throw away an InMemoryLayer that encountered a write error an
reingest the WAL already written to it.
Adding such higher-level retries would involve either resetting
last_record_lsn to a lower value and restarting walreceiver. The code
isn't flexible enough to do that, and such complexity likely isn't worth
it given that write errors are rare.
# Solution
The solution in this PR is to retry _any_ failing write operation
_indefinitely_ inside the buffered writer flush task, except of course
those that are fatal as per `maybe_fatal_err`.
Retrying indefinitely ensures that `BufferedWriter::mutable` is never
left `None` in the case of IO errors, thereby solving the problem
described above.
It's a clear improvement over the status quo.
However, while we're retrying, we build up backpressure because the
`flush` is only double-buffered, not infinitely buffered.
Backpressure here is generally good to avoid resource exhaustion, **but
blocks reads** and hence stalls GetPage requests because InMemoryLayer
reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.
## Caveats
Note that there are some remaining conditions in the flush background
task where it can bail with an error. I have annotated one of them with
a TODO comment.
Hence the `FlushHandle::flush` is still fallible and hence the overall
scenario of leaving `mutable = None` on the bail path is still possible.
We can clean that up in a later commit.
Note also that retrying indefinitely is great for temporary errors like
ENOSPC but likely undesirable in case the `std::io::Error` we get is
really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant
directory got deleted and VirtualFile's reopen fails with ENOENT.
Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered
writer background task was to recv from flush op channel until the
sending side (FlushHandle) is explicitly shut down or dropped.
Failing to handle cancellation carries the operational risk that even if
a single timeline gets stuck because of a logic bug such as the one laid
out above, we must still restart the whole pageserver process.
# Alternatives Considered
As pointed out in the `Context` section, throwing away a InMemoryLayer
that encountered an error and reingesting the WAL is a lot of complexity
that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.
A more general and simpler solution for ENOSPC is to `abort()` the
process and run eviction on startup before bringing up the rest of
pageserver.
I argued for it in the past, the pro arguments are still valid and
complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed
and all tenants have secondaries.
For example, if pageserver `abort()`s on ENOSPC and then simply don't
respond to storcon heartbeats while we're running eviction on startup,
storcon will fail tenants over to the secondary anyway, giving us all
the time we need to clean up.
The downside is that if there's a systemic space management bug, above
proposal will just propagate the problem to other nodes. But I imagine
that because of the delays involved with filling up disks, the system
might reach a half-stable state, providing operators more time to react.
# Demo
Intermediary commit `a03f335121480afc0171b0f34606bdf929e962c5` is demoed
in this (internal) screen recording:
https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link
# Perf Testing
Ran `bench_ingest` on tmpfs, no measurable difference.
Spans are uniquely owned by the flush task, and the span stack isn't too
deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not _that_
high frequency event anyways.
# Refs
- fixes https://github.com/neondatabase/neon/issues/10856
## Problem
This should also resolve the test flakiness of `test_gc_feedback`.
close https://github.com/neondatabase/neon/issues/11144
## Summary of changes
If `NoYield` is set, do not yield in gc-compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`humantime` is not maintained and `cargo deny check` fails
- Will be addressed in https://github.com/neondatabase/neon/issues/11179
## Summary of changes
Ignore RUSTSEC-2025-0014 advisory for now
## Problem
Investigate https://github.com/neondatabase/neon/issues/11159
## Summary of changes
This doesn't fix the issue, but at least we can narrow down the cause
next time it happens by logging ancestor referenced layer cnt even if
it's 0.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Do some refactoring before making walproposer generations aware.
- Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED
- Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm,
donorEpoch to donorLastLogTerm
- Instead of counting n_votes, n_connected, introduce explicit
WalProposerState (collecting terms / voting / elected). Refactor out
TermsCollected and VotesCollected; they will determine state transition
differently depending whether generations are enabled or not.
There is no new logic in this PR and thus no new tests.
## Problem
In #11122, we want to split shards once the logical size of the largest
timeline exceeds a split threshold. However, `get_top_tenants` currently
only returns `max_logical_size`, which tracks the max _total_ logical
size of a timeline across all shards.
This is problematic, because the storage controller needs to fetch a
list of N tenants that are eligible for splits, but the API doesn't
currently have a way to express this. For example, with a split
threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible
to split if it has 1 or 2 shards, but not if it already has 4 shards. We
need to express this in per-shard terms, otherwise the `get_top_tenants`
endpoint may end up only returning tenants that can't be split, blocking
splits entirely.
Touches https://github.com/neondatabase/neon/pull/11122.
Touches https://github.com/neondatabase/cloud/issues/22532.
## Summary of changes
Add `TenantShardItem::max_logical_size_per_shard` containing
`max_logical_size / shard_count`, and
`TenantSorting::MaxLogicalSizePerShard` to order and filter by it.
Fixes https://github.com/neondatabase/serverless/issues/144
When tables have enums, we need to perform type queries for that data.
We cache these query statements for performance reasons. In Neon RLS, we
run "discard all" for security reasons, which discards all the
statements. When we need to type check again, the statements are no
longer valid.
This fixes it to discard the statements as well.
I've also added some new logs and error types to monitor this. Currently
we don't see the prepared statement errors in our logs.
# Refs
- fixes https://github.com/neondatabase/neon/issues/6107
# Problem
`VirtualFile` currently parses the path it is opened with to identify
the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE`
metric.
Further, for each read or write call to VirtualFile, it uses
`with_label_values` to retrieve the correct metrics object, which under
the hood is a global hashmap guarded by a parking_lot mutex.
We perform tens of thousands of reads and writes per second on every
pageserver instance; thus, doing the mutex lock + hashmap lookup is
wasteful.
# Changes
Apply the technique we use for all other timeline-scoped metrics to
avoid the repeat `with_label_values`: add it to `TimelineMetrics`.
Wrap `TimelineMetrics` into an `Arc`.
Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use
`Timeline::metrics::storage_io_size`.
To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics
between different connection handlers for the same timeline, we wrap it
into another Arc.
To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>>
inside the per-connection timeline cache.
Preliminary refactorings to enable this change:
- https://github.com/neondatabase/neon/pull/11001
- https://github.com/neondatabase/neon/pull/11030
# Performance
I ran the benchmarks in
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
on an `i3en.3xlarge` because that's what we currently run them on.
None of the benchmarks shows a meaningful difference in latency or
throughput or CPU utilization.
I would have expected some improvement in the
many-tenants-one-client-each workload because they all hit that hashmap
constantly, and clone the same `UintCounter` / `Arc` inside of it.
But apparently the overhead is miniscule compared to the remaining work
we do per getpage.
Yet, since the changes are already made, the added complexity is
manageable, and the perf overhead of `with_label_values` demonstrable in
micro-benchmarks, let's have this change anyway.
Also, propagating TimelineMetrics through RequestContext might come in
handy down the line.
The micro-benchmark that demonstrates perf impact of
`with_label_values`, along with other pitfalls and mitigation techniques
around the `metrics`/`prometheus` crate:
- https://github.com/neondatabase/neon/pull/11019
# Alternative Designs
An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside
`RequestContext`.
The problem is that this risks reference cycles if the RequestContext
gets stored in an object that is owned directly or indirectly by
`Timeline`.
Ideally, we wouldn't be using this mess of Arc's at all and propagate
Rust references instead.
But tokio requires tasks to be `'static`, and so, we wouldn't be able to
propagate references across task boundaries, which is incompatible with
any sort of fan-out code we already have (e.g. concurrent IO) or future
code (parallel compaction).
So, opt for Arc for now.
We use the `metrics` / `prometheus` crate in the Pageserver code base.
This PR demonstrates
- typical performance pitfalls with that crate
- our current set of techniques to avoid most of these pitfalls.
refs
- https://github.com/neondatabase/neon/issues/10948
- https://github.com/neondatabase/neon/pull/7202
- I applied the `label_values__cache_label_values_lookup` technique
there.
- It didn't yield measurable results in high-level benchmarks though.
This PR extends the storcon with basic safekeeper management of
timelines, mainly timeline creation and deletion. We want to make the
storcon manage safekeepers in the future. Timeline creation is
controlled by the `--timelines-onto-safekeepers` flag.
1. it adds the `timelines` and `safekeeper_timeline_pending_ops` tables
to the storcon db
2. extend code for the timeline creation and deletion
4. it adds per-safekeeper reconciler tasks
TODO:
* maybe not immediately schedule reconciliations for deletions but have
a prior manual step
* tenant deletions
* add exclude API definitions (probably separate PR)
* how to choose safekeeper to do exclude on vs deletion? this can be a
bit hairy because the safekeeper might go offline in the meantime.
* error/failure case handling
* tests (cc test_explicit_timeline_creation from #11002)
* single safekeeper mode: we often only have one SK (in tests for
example)
* `notify-safekeepers` hook:
https://github.com/neondatabase/neon/issues/11163
TODOs implemented:
* cancellations of enqueued reconciliations on a per-timeline basis,
helpful if there is an ongoing deletion
* implement pending ops overwrite behavior
* load pending operations from db
RFC section for important reading:
[link](https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#storage_controller-implementation)
Implements the bulk of #9011
Successor of #10440.
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
## Problem
Async prefetch in LFC PR cause incorrect calculation of LFC
`used_pages`when page is overwritten
## Summary of changes
Decrement `used_pages` is page is overwritten.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
This allows for improved decoding of otherwise broken WAL.
## Problem
Currently, if (or when) a WAL record has a wrong prevptr, that breaks
decoding. With this, we don't have to break on that if we decide it's OK
to proceed after that.
## Summary of changes
Use a Neon GUC to allow the system to enable the NEON-specific
skip_lsn_checks option in XLogReader.
## Problem
Partial reads are still problematic. They are stored in the buffer of
the wal decoder and result in gaps being reported too eagerly on the
pageserver side.
## Summary of changes
Previously, we always used the start LSN of the chunk of WAL that was
just read. This patch switches to using the end LSN of the last record
that was decoded in the previous iteration.
## Problem
Pinging groups on slack didn't work, because I didn't use the correct
syntax.
## Summary of changes
Use the correct syntax for pinging groups.
* pprof can also use `prost` as a backend, switch to it as `protobuf`
has no update available but a security issue.
* `paste` is a build time dependency, so add the unmaintained warning as
an exception.
## Problem
`home_shard_count` is not updated on the preferred AZ change.
Closes: https://github.com/neondatabase/neon/issues/10493
## Summary of changes
- Update scheduler stats (node ref counts) on preferred AZ change.
## Problem
See https://neondb.slack.com/archives/C08DE6Q9C3B
Sometimes compute is not able to receive responses from PS for a long
time (minutes).
I do not think that the problem is at compute side, but in order to
exclude this possibility I wan to see more information about connection
state at compute side, particularly amount of data cached in connection
buffer.
## Summary of changes
Right now we are dumping state of socket buffer.
This PR adds printing state of connection buffer.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Previously, remote extensions were not fetched unless they were used in
some other manner. For instance, loading a BM25 index in pg_search
fetches the pg_search extension. However, if on a fresh compute with
pg_search 0.15.5 installed, the user ran `ALTER EXTENSION pg_search
UPDATE TO '0.15.6'` without first using the pg_search extension, we
would not fetch the extension and fail to find an update path.
Signed-off-by: Tristan Partin <tristan@neon.tech>
We were previously only revoking privileges granted by neon_superuser.
However, we need to do it for all grantors.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
As part of the disaster recovery tool. Partly for
https://github.com/neondatabase/neon/issues/9114.
## Summary of changes
* Add a new pageserver API to force patch the fields in index_part and
modify the timeline internal structures.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Storage controller uses http for requests to safekeeper management API.
Closes: https://github.com/neondatabase/cloud/issues/24835
## Summary of changes
- Add `use_https_safekeeper_api` option to storcon to use https api
- Use https for requests to safekeeper management API if this option is
enabled
- Add `ssl_ca_file` option to storcon for ability to specify custom root
CA certificate
## Problem
The current migration API does a live migration, but if the destination
doesn't already have a secondary, that live migration is unlikely to be
able to warm up a tenant properly within its timeout (full warmup of a
big tenant can take tens of minutes).
Background optimisation code knows how to do this gracefully by creating
a secondary first, but we don't currently give a human a way to trigger
that.
Closes: https://github.com/neondatabase/neon/issues/10540
## Summary of changes
- Add `prefererred_node` parameter to TenantShard, which is respected by
optimize_attachment
- Modify migration API to have optional prewarm=true mode, in which we
set preferred_node and call optimize_attachment, rather than directly
modifying intentstate
- Require override_scheduler=true flag if migrating somewhere that is a
less-than-optimal scheduling location (e.g. wrong AZ)
- Add `origin_node_id` to migration API so that callers can ensure
they're moving from where they think they're moving from
- Add tests for the above
The storcon_cli wrapper for this has a 'watch' mode that waits for
eventual cutover. This doesn't show the warmth of the secondary evolve
because we don't currently have an API for that in the controller, as
the passthrough API only targets attached locations, not secondaries. It
would be straightforward to add later as a dedicated endpoint for
getting secondary status, then extend the storcon_cli to consume that
and print a nice progress indicator.
## Problem
Shard zero needs to track the start LSN of the latest record
in adition to the LSN of the next record to ingest. The former
is included in basebackup persisted by the compute in WAL.
Previously, empty records were skipped for all shards. This caused
the prev LSN tracking on the PS to fall behind and led to logical
replication
issues.
## Summary of changes
Shard zero now receives emtpy interpreted records for LSN tracking
purposes.
A test is included too.
## Problem
Allow for using _meta.yml with workflow_dispatch event.
## Summary of changes
Handle this event in the run-kind step; fix and update the description
of the run-kind output.
## Problem
This patch makes some initial tweaks as preparation for
https://github.com/neondatabase/cloud/issues/22532, where we will be
introducing additional autosplit logic.
The plan is outlined in
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907.
## Summary of changes
Minor code cleanups and behavioral changes:
* Decide that we'll split based on `max_logical_size` (later possibly
`total_logical_size`).
* Fix a bug that would split the smallest candidate rather than the
largest.
* Pick the largest candidate by `max_logical_size` rather than
`resident_size`, for consistency (debatable).
* Split out `get_top_tenant_shards()` to fetch split candidates.
* Fetch candidates concurrently from all nodes.
* Make `TenantShard.get_scheduling_policy()` return a copy instead of a
reference.
We have introduced the ability to specify safekeeper JWTs for the
storage controller. It now does heartbeats. We now want to also require
the presence of those JWTs.
Let's merge this PR shortly after the release cutoff.
Part of / follow-up of
https://github.com/neondatabase/cloud/issues/24727
## Problem
We just had a regression reported at
https://neondb.slack.com/archives/C08EXUJF554/p1741102467515599, which
clearly came with one of the releases. It's not a huge problem yet, but
it's annoying that we cannot quickly attribute it to a specific commit.
## Summary of changes
Add a very simple `compute_ctl` HTTP API benchmark that does 10k
requests to `/status` and `metrics.json` and reports p50 and p99.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
## Problem
Part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
gc-compaction could take a long time in some cases, for example, if the
job split heuristics is wrong and we selected a too large region for
compaction that can't be finished within a reasonable amount of time. We
will give up such tasks and yield to L0 compaction. Each gc-compaction
sub-compaction job is atomic and cannot be split further so we have to
give up (instead of storing a state and continue later as in image
creation).
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
After splitting teams it became a bit more complicated for the PerfCorr
team to work on tests changes.
## Summary of changes
1. Add PerfCorr team co-owners for `.github/` folder
2. Add PerCorr team as owner for `test_runner/` folder
## Problem
In f37eeb56, I properly escaped the identifier, but I haven't noticed
that the resulting string is used in the `format('...')`, so it needs
additional escaping. Yet, after looking at it closer and with Heikki's
and Tristan's help, it appeared to be that it's a full can of worms and
we have problems all over the code in places where we use PL/pgSQL
blocks.
## Summary of changes
Add a new `pg_quote_dollar()` helper to deal with it, as dollar-quoting
of strings seems to be the only robust way to escape strings in dynamic
PL/pgSQL blocks. We mimic the Postgres' `pg_get_functiondef` logic here
[1].
While on it, I added more tests and caught a couple of more bugs with
string escaping:
1. `get_existing_dbs_async()` was wrapping `owner` in additional
double-quotes if it contained special characters
2. `construct_superuser_query()` was flawed in even more ways than the
rest of the code. It wasn't realistic to fix it quickly, but after
thinking about it more, I realized that we could drop most of it
altogether. IIUC, it was added as some sort of migration, probably back
when we haven't had migrations yet. So all the complicated code was
needed to properly update existing roles and DBs. In the current Neon,
this code only runs before we create the very first DB and role. When we
create roles and DBs, all `neon_superuser` grants are added in the
different places. So the worst thing that could happen is that there is
an ancient branch somewhere, so when users poke it, they will realize
that not all Neon features work as expected. Yet, the fix is simple and
self-serve -- just create a new role via UI or API, and it will get a
proper `neon_superuser` grant.
[1]:
8b49392b27/src/backend/utils/adt/ruleutils.c (L3153)Closesneondatabase/cloud#25048
## Problem
When periodic pagebench runs only once a day a lot of commits can be in
between a good run and a regression.
## Summary of changes
Run the workflow every 3 hours
## Problem
part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
* Add timers for each phase of the gc-compaction.
* Add a final ratio computation to directly show the garbage collection
ratio in the logs.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Original Slack discussion:
https://neondb.slack.com/archives/C04DGM6SMTM/p1739915430147169
TL;DR in Postgres, it's totally normal to have 'invalid' DBs (state
after the interrupted `DROP DATABASE`). Yet, some of our metrics
collected with `sql_exporter` try to get the size of such invalid DBs.
Typical log lines:
```
time=2025-03-05T16:30:32.368Z level=ERROR source=promhttp.go:52 msg="Error gathering metrics" error="[from Gatherer #1] [collector=neon_collector,query=pg_stats_userdb] pq: [NEON_SMGR] [reqid 0] could not read db size of db 173228 from page server at lsn 0/44A0E8C0"
time=2025-03-05T16:30:32.369Z level=ERROR source=promhttp.go:52 msg="Error gathering metrics" error="[from Gatherer #1] [collector=neon_collector,query=db_total_size] pq: [NEON_SMGR] [reqid 0] could not read db size of db 173228 from page server at lsn 0/44A0E8C0"
```
## Summary of changes
Ignore invalid DBs in these two metrics -- `pg_stats_userdb` and
`db_total_size`
## Problem
On unarchival, we update the previous heatmap with all visible layers.
When the primary generates a new heatmap it includes all those layers,
so the secondary will download them. Since they're not actually resident
on the primary (we didn't call the warm up API), they'll never be
evicted,
so they remain in the heatmap.
We want these layers in the heatmap, since we might wish to warm-up an
unarchived timeline after a shard migration. However, we don't want them
to be downloaded on the secondary until we've warmed up the primary.
## Summary of Changes
Include these layers in the heatmap and mark them as cold. All heatmap
operations act on non-cold layers apart from the attached location
warming up API,
which will download the cold layers. Once the cold layers are downloaded
on the primary,
they'll be included in the next heatmap as hot and the secondary starts
fetching them too.
## Problem
I like using `cargo stress` to hammer on a test, but it doesn't work out
of the box because it does parallel runs by default and tests always use
the same repo dir.
## Summary of changes
Add an uuid to the test repo dir when generating it.
## Problem
In a batch, `pageserver_layers_per_read_global` counts all layer visits
towards every read in the batch, since this directly affects the
observed latency of the read. However, this doesn't give a good picture
of the amortized read amplification due to batching.
## Summary of changes
Add two more global read amp metrics:
* `pageserver_layers_per_read_batch_global`: number of layers visited
per batch.
* `pageserver_layers_per_read_amortized_global`: number of layers
divided by reads in a batch.
* Remove callsite identifier registration on span creation. Forgot to
remove from last PR. Was part of alternative idea.
* Move "spans" object to right after "fields", so event and span fields
are listed together.
## Problem
For pg_regress test, we do both v1 and v2; for all the rest, we default
to v2.
part of https://github.com/neondatabase/neon/issues/9516
## Summary of changes
Use reldir v2 across test cases by default.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This PR adds a new key to neon.neon_lfc_stats —
'file_cache_chunk_size_pages'. It just returns the value of
BLOCKS_PER_CHUNK from the LFC implementation.
The new value should (eventually) allow changing the chunk size without
breaking any places that rely on LFC stats values measured in number of
chunks.
See neondatabase/cloud#25170 for more.
## Problem
Our benchmarking workflows contain links to grafana dashboards to
troubleshoot problems.
This works fine for non-pooled endpoints.
For pooled endpoints we need to remove the `-pooler` suffix from the
endpoint's hostname to get a valid endpoint ID.
Example link that doesn't work in this run
https://github.com/neondatabase/neon/actions/runs/13678933253/job/38246028316#step:8:311
## Summary of changes
Check if connection string is a -pooler connection string and if so
remove this suffix from the endpoint ID.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We realized that we may use this metric for more 'live' info about
extension installations vs. what we have with installed extensions
metric, which is only updated at start, atm.
## Summary of changes
Add `filename` label to `compute_ctl_remote_ext_requests_total`. Note
that it contains the raw archive name with `.tar.zst` at the end, so the
consumer may need to strip this suffix.
Closes https://github.com/neondatabase/cloud/issues/24694
Setup pgaudit and pgauditlogtofile extensions
in compute_ctl when the ComputeAuditLogLevel is
set to 'hipaa'.
See cloud PR https://github.com/neondatabase/cloud/pull/24568
Add rsyslog setup for compute_ctl.
Spin up a rsyslog server in the compute VM,
and configure it to send logs to the endpoint
specified in AUDIT_LOGGING_ENDPOINT env.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
We used anyhow::Error everywhere and it's time to fix.
## Summary of changes
* Make sure that cancel errors are correctly propagated as
CompactionError::ShuttingDown.
* Skip all the trigger computation work if gc_cutoff is not generated
yet.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
If an image fails to push to dev registries, we shouldn't trigger the
deploy job, because that depends on images existing in dev registries.
To ensure this is the case, the deploy job needs to depend on pushing to
dev registries.
## Summary of changes
Make `deploy` depend on `push-neon-image-dev` and
`push-compute-image-dev`.
## Problem
`test_check_visibility_map` is the slowest test in CI, and can cause
timeouts under particularly slow configurations (`debug` and
`without-lfc`).
## Summary of changes
* Reduce the `pgbench` scale factor from 10 to 8.
* Omit a redundant vacuum during `pgbench` init.
* Remove a final `vacuum freeze` + `pg_check_visible` pass, which has
questionable value (we've already done a vacuum freeze previously, and
we don't flush the compute cache before checking anyway).
## Problem
On unarchival, we update the previous heatmap with all visible layers.
When the primary generates a new heatmap it includes all those layers,
so the secondary will download them. Since they're not actually resident
on the primary (we didn't call the warm up API), they'll never be
evicted, so they remain in the heatmap.
This leads to oversized secondary locations like we saw in pre-prod.
## Summary of changes
Gate the loading of the previous heatmaps and the heatmap generation on
unarchival behind configuration
flags. They are disabled by default, but enabled in tests.
## Problem
Each page of the slru segment is fetched individually when it's loaded
on demand.
## Summary of Changes
Use `Timeline::get_vectored` to fetch 16 at a time.
fixes https://github.com/neondatabase/cloud/issues/24292
Do not drop tablesync replication slots on the publisher,
when we're in the process of dropping subscriptions inherited by a neon
branch.
Because these slots are still needed by the parent branch subscriptions.
For regular slots we handle this by setting the slot_name to NONE
before calling DROP SUBSCRIPTION, but tablesync slots are not exposed to
SQL.
rely on GUC disable_logical_replication_subscribers=true
to know that we're in the Neon-specific process of dropping
subscriptions.
## Problem
We don't have any logging for Sentry initialization. This makes it hard
to verify that it has been configured correctly.
## Summary of changes
Log some basic info when Sentry has been initialized, but omit the
public key (which allows submitting events). Also log when `SENTRY_DSN`
isn't specified at all, and when it fails to initialize (which is
supposed to panic, but we may as well).
## Problem
Grafana Loki's JSON handling is somewhat limited and the log message
should be structured in a way that it's easy to sift through logs and
filter.
## Summary of changes
* Drop span_id. It's too short lived to be of value and only bloats the
logs.
* Use the span's name as the object key, but append a unique numeric
value to prevent name collisions.
* Extract interesting span fields into a separate object at the root.
New format:
```json
{
"timestamp": "2025-03-04T18:54:44.134435Z",
"level": "INFO",
"message": "connected to compute node at 127.0.0.1 (127.0.0.1:5432) latency=client: 22.002292ms, cplane: 0ns, compute: 5.338875ms, retry: 0ns",
"fields": {
"cold_start_info": "unknown"
},
"process_id": 56675,
"thread_id": 9122892,
"task_id": "24",
"target": "proxy::compute",
"src": "proxy/src/compute.rs:288",
"trace_id": "5eb89b840ec63fee5fc56cebd633e197",
"spans": {
"connect_request#1": {
"ep": "endpoint",
"role": "proxy",
"session_id": "b8a41818-12bd-4c3f-8ef0-9a942cc99514",
"protocol": "tcp",
"conn_info": "127.0.0.1"
},
"connect_to_compute#6": {},
"connect_once#8": {
"compute_id": "compute",
"pid": "853"
}
},
"extract": {
"session_id": "b8a41818-12bd-4c3f-8ef0-9a942cc99514"
}
}
```
## Problem
Our benchmarking workflow has a job step `bench`which runs all tests in
test_runner/performance/* except those that we want to run separately.
We recently added two test cases to that testcase directory that we want
to run separately but forgot to ignore them during the bench step. This
is now causing
[failures](https://github.com/neondatabase/neon/actions/runs/13667689340/job/38212087331#step:7:392).
## Summary of changes
Ignore the separately run tests in the bench step.
## Problem
New async prefetch introduces `prefetch+lookup[` function which is
called before LFC lookup to check if prefetch request is already
completed.
This function is not containing now check that response is actually
`T_NeonGetPageResponse` (and not error).
## Summary of changes
Add checks for response tag.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Periodic pagebench workflow runs periodically from latest main commit
and also allows to dispatch it manually for a given commit hash to
bi-sect regressions.
However in the dashboards we can not distinguish manual runs from
periodic runs which makes it harder to follow the trend.
## Summary of changes
Send an additional flag commit type to the benchmark runner instance to
distinguish the run type.
Note: this needs a follow-up PR on the receiving side.
The compute should only act if requests come from the control plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
On macOS, the `unused` lint complains about two variables not used in
`!linux` builds.
These were introduced in #11007.
## Summary of changes
Appease the linter by explicitly using the variables in `!linux`
branches.
## Problem
part of https://github.com/neondatabase/neon/issues/11067
My observation is that with the current value of settings, x86-v1
usually takes 30s, arm-v1 1m30s, x86-v2 1m, arm-v2 3m. But sometimes the
system could run too slow and cause test to timeout on arm with reldir
v2.
While I investigate what's going on and further improve the performance,
I'd like to set both of them to use the same test input, so that it
doesn't timeout and we don't abuse this test case as a performance test.
## Summary of changes
Use the same settings for both test cases.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The code to generate symbolized pprof heap profiles and flamegraph SVGs
has been upstreamed to the `jemalloc_pprof` crate:
* https://github.com/polarsignals/rust-jemalloc-pprof/pull/22
* https://github.com/polarsignals/rust-jemalloc-pprof/pull/23
## Summary of changes
Use `jemalloc_pprof` to generate symbolized pprof heap profiles and
flamegraph SVGs.
This reintroduces a bunch of internal jemalloc stack frames that we'd
previously strip, e.g. each stack now always ends with
`prof_backtrace_impl` (where jemalloc takes a stack trace for heap
profiling), but that seems ok.
## Problem
Incoming requests often take the service lock, and sometimes even do
database transactions. That creates a risk that a rogue client can
starve the controller of the ability to do its primary job of
reconciling tenants to an available state.
## Summary of changes
* Use the `governor` crate to rate limit tenant requests at 10 requests
per second. This is ~10-100x lower than the worst "attack" we've seen
from a client bug. Admin APIs are not rate limited.
* Add a `storage_controller_http_request_rate_limited` histogram for
rate limited requests.
* Log a warning every 10 seconds for rate limited tenants.
The rate limiter is parametrized on TenantId, because the kinds of
client bug we're protecting against generally happen within tenant
scope, and the rates should be somewhat stable: we expect the global
rate of requests to increase as we do more work, but we do not expect
the rate of requests to one tenant to increase.
---------
Co-authored-by: John Spray <john@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/9516
## Summary of changes
Similar to the aux v2 migration, we persist the relv2 migration status
into index_part, so that even the config item is set to false, we will
still read from the v2 storage to avoid loss of data.
Note that only the two variants `None` and
`Some(RelSizeMigration::Migrating)` are used for now. We don't have full
migration implemented so it will never be set to
`RelSizeMigration::Migrated`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The "did not trigger" gets logged at 10k/minute in staging.
## Summary of changes
Change it to debug level.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The interpreted reader tracks a record aligned current position in the
WAL stream.
Partial reads move the stream internally, but not from the pov of the
interpreted WAL reader.
Hence, where new shards subscribe with a start position that matches the
reader's current position,
but we've also done some partial reads. This confuses the gap tracking.
To make it more robust,
update the current batch start to the min between the new start position
and its current value.
Since no record has been decoded yet (position matches), we can't have
lost it
## Problem
When a build is made with sanitizers, this is not reflected in the
artifact name, which can lead to overriding normal builds with sanitized
ones.
## Summary of changes
Take this property of a build into account when constructing the
artifact name.
## Problem
Image layers may be nested inside in-memory layers as diagnosed
[here](https://github.com/neondatabase/neon/issues/10720#issuecomment-2649419252).
The read path doesn't support this and may skip over the image layer,
resulting in a failure to reconstruct the page.
## Summary of changes
We already support nesting of image layers inside delta layers. The
logic lives in `LayerMap::select_layer`.
The main goal of this PR is to propagate the candidate in-memory layer
down to that point and update
the selection logic.
Important changes are:
1. Support partial reads for the in-memory layer. Previously, we could
only specify the start LSN of the read.
We need to control the end LSN too.
2. `LayerMap::ranged_search` considers in-memory layers too. Previously,
the search for in-memory layers
was done explicitly in `Timeline::get_reconstruct_data_timeline`. Note
that `LayerMap::ranged_search` now returns
a weak readable layer which the `LayerManager` can upgrade. This dance
is such that we can unit test the layer selection logic.
3. Update `LayerMap::select_layer` to consider the candidate in-memory
layer too
Loosely related drive bys:
1. Remove the "keys not found" tracking in the ranged search. This
wasn't used anywhere and it just complicates things.
2. Remove the difficulty map stuff from the layer map. Again, not used
anywhere.
Closes https://github.com/neondatabase/neon/issues/9185
Closes https://github.com/neondatabase/neon/issues/10720
## Problem
If a caller times out on safekeeper timeline deletion on a large
timeline, and waits a while before retrying, the deletion will not
progress while the retry is waiting. The net effect is very very slow
deletion as it only proceeds in 30 second bursts across 5 minute idle
periods.
Related: https://github.com/neondatabase/neon/issues/10265
## Summary of changes
- Run remote deletion in a background task
- Carry a watch::Receiver on the Timeline for other callers to join the
wait
- Restart deletion if the API is called again and the previous attempt
failed
## Problem
We want to support larger tenants (regarding logical database size,
number of transactions per second etc.) and should increase our test
coverage of OLTP transactions at larger scale.
## Summary of changes
Start a new benchmark that over time will add more OLTP tests at larger
scale.
This PR covers the first version and will be extended in further PRs.
Also fix some infrastructure:
- default for new connections and large tenants is to use connection
pooler pgbouncer, however our fixture always added
`statement_timeout=120` which is not compatible with pooler
[see](https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter)
- action to create branch timed out after 10 seconds and 10 retries but
for large tenants it can take longer so use increasing back-off for
retries
## Test run
https://github.com/neondatabase/neon/actions/runs/13593446706
## Problem
Timeline shutdown during basebackup logs at error level because the the
canecellation error is smushed into BasebackupError::Server.
## Summary of changes
Introduce BasebackupError::Shutdown and use it. `log_query_error` will
now see `QueryError::Shutdown` and log at info level.
## Problem
Preparation for https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Add walproposer `safekeepers_generations` field which can be set by
prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces
walproposer to use generations. In particular, this also disables
implicit timeline creation as timeline will be created by storcon. Add
test checking this. Also add missing infra: `--safekeepers-generation`
flag to neon_local endpoint start + fix `--start-timeout` flag: it
existed but value wasn't used.
## Problem
See https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339
smgrextend for FSM fork is called during page reconstruction by walredo
process causing overflow of inmem SMGR (64 pages).
## Summary of changes
Do not store zero pages in inmem SMGR because `inmem_read` returns zero
page if it is not able to locate specified block.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We build compute-nodes as multi-arch images, but not the
vm-compute-nodes. The PR adds multiarch vm images the same way as in
autoscaling repo.
## Summary of changes
Add architecture to the matrix for vm compute build steps
Add merge job
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We have not synced `force-test-extensions-upgrade.yml` with the last
changes.
The variable `TEST_EXTENSIONS_UPGRADE` was ignored in the script and
actually set to `NEW_COMPUTE_TAG` while it should be set to
`OLD_COMPUTE_TAG` as we are about to run compatibility tests.
## Summary of changes
The tag names were synced, the logic was fixed.
To speed up compute startup. Resizing swap in particular takes about 100
ms on my laptop. By performing it in parallel with downloading the
basebackup, that latency is effectively hidden. I would imagine that
downloading remote extensions can also take a non-trivial amount of
time, although I didn't try to measure that. In any case that's now also
performed in parallel with downloading the basebackup.
Move most of the code to compute.rs, so that all the major startup steps
are visible in one place. You can now get a pretty good picture of what
happens in the latency-critical path at compute startup by reading
ComputeNode::start_compute().
This also clarifies the error handling in start_compute. Previously, the
start_postgres function sometimes returned an Err, and sometimes Ok but
with the compute status already set to Failed. Now the start_compute
function always returns Err on failure, and it's the caller's
responsibility to change the compute status to Failed. Separately from
that, it returns a handle to the Postgres process via a `&mut` reference
if it had already started Postgres (i.e. on success, or if the failure
happens after launching the Postgres process).
---------
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
## Problem
In multi-character keys, the GIN index creates a CRC Hash of the first 3
bytes of the key.
The hash can have the first bit to be set or unset, needing to have a
consistent representation
of `char` across architectures for consistent results. GIN stores these
keys by their hashes
which determines the order in which the keys are obtained from the GIN
index.
By default, chars are signed in x86 and unsigned in arm, leading to
inconsistent behavior across different platform architectures. Adding
the `-fsigned-char` flag to the GCC compiler forces chars to be treated
as signed across platforms, ensuring the ordering in which the keys are
obtained consistent.
## Summary of changes
Added `-fsigned-char` to the `CFLAGS` to force GCC to use signed chars
across platforms.
Added a test to check this across platforms.
Fixes: https://github.com/neondatabase/cloud/issues/23199
## Problem
To measure latency accurate we should associate the testodrome role
within a latency data
## Summary of changes
Add latency logging to associate different roles within a latency.
Relates to the #22486
## Problem
`wait_for_active_tenant()`, used when starting background tasks, has a
race condition that can cause it to wait forever (until cancelled). It
first checks the current tenant state, and then subscribes for state
updates, but if the state changes between these then it won't be
notified about it.
We've seen this wedge compaction tasks, which can cause unbounded layer
file buildup and read amplification.
## Summary of changes
Use `watch::Receiver::wait_for()` to check both the current and new
tenant states.
## Problem
Somehow the previous patch loses the loop in the chaos injector function
so everything will only run once.
https://github.com/neondatabase/neon/pull/10934
## Summary of changes
Add back the loop.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
JWT tokens aren't in place, so all SK heartbeats fail. This is
equivalent to a wait before applying the PS heartbeats and makes things
more flaky.
## Summary of Changes
Add a flag that skips loading SKs from the db on start-up and at
runtime.
https://github.com/neondatabase/cloud/issues/23008
For TLS between proxy and compute, we are using an internally
provisioned CA to sign the compute certificates. This change ensures
that proxy will load them from a supplied env var pointing to the
correct file - this file and env var will be configured later, using a
kubernetes secret.
Control plane responds with a `server_name` field if and only if the
compute uses TLS. This server name is the name we use to validate the
certificate. Control plane still sends us the IP to connect to as well
(to support overlay IP).
To support this change, I'd had to split `host` and `host_addr` into
separate fields. Using `host_addr` and bypassing `lookup_addr` if
possible (which is what happens in production). `host` then is only used
for the TLS connection.
There's no blocker to merging this. The code paths will not be triggered
until the new control plane is deployed and the `enableTLS` compute flag
is enabled on a project.
## Problem
This `critical!` could fire on IO errors, which is just noisy.
Resolves#11027.
## Summary of changes
Downgrade to error, except for decode errors. These could be either data
corruption or a bug, but seem worth investigating either way.
# Changes
While working on
- https://github.com/neondatabase/neon/pull/7202
I found myself needing to cache another expensive Arc::clone inside
inside the timeline::handle::Cache by wrapping it in another Arc.
Before this PR, it seemed like the only expensive thing we were caching
was the
connection handler tasks' clone of `Arc<Timeline>`.
But in fact the GateGuard was another such thing, but it was
special-cased
in the implementation.
So, this refactoring PR de-special-cases the GateGuard.
# Performance
With this PR we are doing strictly _less_ operations per `Cache::get`.
The reason is that we wrap the entire `Types::Timeline` into one Arc.
Before this PR, it was a separate Arc around the Arc<Timeline> and
one around the Arc<GateGuard>.
With this PR, we avoid an allocation per cached item, namely,
the separate Arc around the GateGuard.
This PR does not change the amount of shared mutable state.
So, all in all, it should be a net positive, albeit probably not
noticable
with our small non-NUMA instances and generally high CPU usage per
request.
# Reviewing
To understand the refactoring logistics, look at the changes to the unit
test types first.
Then read the improved module doc comment.
Then the remaining changes.
In the future, we could rename things to be even more generic.
For example, `Types::TenantMgr` could really be a `Types::Resolver`.
And `Types::Timeline` should, to avoid constant confusion in the doc
comment,
be called `Types::Cached` or `Types::Resolved`.
Because the `handle` module, after this PR, really doesn't care that
we're
using it for storing Arc's and GateGuards.
Then again, specicifity is sometimes more useful than being generic.
And writing the module doc comment in a totally generic way would
probably also be more confusing than helpful.
## Problem
CI does not pass for the compute release due to the absence of some
images
## Summary of changes
Now we use the images from the old non-compute releases for non-compute
images
## Problem
We intend for cplane to use the heatmap layer download API to warm up
timelines after unarchival. It's tricky for them to recurse in the
ancestors,
and the current implementation doesn't work well when unarchiving a
chain
of branches and warming them up.
## Summary of changes
* Add a `recurse` flag to the API. When the flag is set, the operation
recurses into the parent
timeline after the current one is done.
* Be resilient to warming up a chain of unarchived branches. Let's say
we unarchived `B` and `C` from
the `A -> B -> C` branch hierarchy. `B` got unarchived first. We
generated the unarchival heatmaps
and stash them in `A` and `B`. When `C` unarchived, it dropped it's
unarchival heatmap since `A` and `B`
already had one. If `C` needed layers from `A` and `B`, it was out of
luck. Now, when choosing whether
to keep an unarchival heatmap we look at its end LSN. If it's more
inclusive than what we currently have,
keep it.
## Problem
Storage controller will proxy GETs to pageserver-like tenant/timeline
paths through to the pageserver.
Usually GET passthroughs make sense to go to shard 0, e.g. if you want
to list timelines.
But sometimes you really want to know about a particular shard, e.g.
reading its cache state or similar.
## Summary of changes
- Accept shard IDs as well as tenant IDs in the passthrough route
- Refactor node lookup to take a shard ID and make the tenant ID case a
layer on top of that. This is one more lock take-drop during these
requests, but it's not particularly expensive and these requests
shouldn't be terribly frequent
This is not immediately used by anything, but will be there any time we
want to e.g. do a pass-through query to check the warmth of a tenant
cache on a particular shard or somesuch.
## Problem
layer_map access was unwrapped. It might return an error during
shutdown.
## Summary of changes
Propagate the layer_map access error back to the compaction loop.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/10841 made building compute
and neon images optional on releases that don't need them. The
`push-<component>-image-prod` jobs had transitive dependencies that were
skipped due to that, causing the images not to be pushed to production
registries.
## Summary of changes
Add `!failure() && !cancelled() &&` to the beginning of the conditions
for these jobs to ensure they run even if some of their transitive
dependencies are skipped.
## Problem
We see `Inmem storage overflow` in page server logs:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339
walked process is using inseam SMGR with storage size limited by 64
pages with warning watermark 32 (based ion the assumption that
XLR_MAX_BLOCK_ID is 32, so WAL record can not access more than 32
pages).
Actually it is not true. We can update up to 3 forks for each block
(including update of FSM and VM forks).
## Summary of changes
This PR increases inseam SMGR size for walled process to 100 pages and
print stack trace in case of overflow.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We created extensions in a single database. The tests could interfere,
i.e., discover some service tables left by other extensions and produce
unexpected results.
## Summary of changes
The tests are now run in a separate timeline, so only one extension owns
the database, which prevents interference.
Before this PR, re-attach and validate would log the same warning
```
calling control plane generation validation API failed
```
on retry errors.
This can be confusing.
This PR makes the message generically valid for any upcall and adds
additional tracing spans to capture context.
Along the way, clean up some copy-pasta variable naming.
refs
-
https://github.com/neondatabase/neon/issues/10381#issuecomment-2684755827
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
## Problem
So far cumulative statistics have not been persisted when Neon scales to
zero (suspends endpoint).
With PR https://github.com/neondatabase/neon/pull/6560 the cumulative
statistics should now survive endpoint restarts and correctly trigger
the auto- vacuum and auto analyze maintenance
So far we did not have a testcase that validates that improvement in our
dev cloud environment with a real project.
## Summary of changes
Introduce testcase `test_cumulative_statistics_persistence`in the
benchmarking workflow running daily to verify:
- Verifies that the cumulative statistics are correctly persisted across
restarts.
- Cumulative statistics are important to persist across restarts because
they are used
- when auto-vacuum an auto-analyze trigger conditions are met.
- The test performs the following steps:
- Seed a new project using pgbench
- insert tuples that by itself are not enough to trigger auto-vacuum
- suspend the endpoint
- resume the endpoint
- insert additional tuples that by itself are not enough to trigger
auto-vacuum but in combination with the previous tuples are
- verify that autovacuum is triggered by the combination of tuples
inserted before and after endpoint suspension
## Test run
https://github.com/neondatabase/neon/actions/runs/13546879714/job/37860609089#step:6:282
Migrates the remaining crates to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
Like the previous migration PRs, this is comprised of three commits:
* the first does the edition update and makes `cargo check`/`cargo
clippy` pass. we had to update bindgen to make its output [satisfy the
requirements of edition
2024](https://doc.rust-lang.org/edition-guide/rust-2024/unsafe-extern.html)
* the second commit does a `cargo fmt` for the new style edition.
* the third commit reorders imports as a one-off change. As before, it
is entirely optional.
Part of #10918
## Problem
https://github.com/neondatabase/neon/pull/10841 broke CI on PRs that
aren't based on main or a release branch but want to merge into another
PR.
## Summary of changes
Replace `run-kind=pr-main` with `run-kind=pr`, so that all PRs that
aren't release PRs are treated equally.
## Problem
safekeepers must ignore walproposer messages with non matching
membership conf.
## Summary of changes
Make safekeepers reject vote request, proposer elected and append
request messages with non matching generation. Switch to the
configuration in the greeting message if it is higher.
In passing, fix one comment and WAL truncation.
Last part of https://github.com/neondatabase/neon/issues/9965
The comment was woefully outdated and outright wrong. It applied a long
time ago (before commit e5cc2f92c4 to be precise), but nowadays the
function just launches postgres and waits until it starts accepting
connections. The other things the comment talked about are done in other
functions.
## Problem
Check neon with extra platform builds is failing on main with:
```
The template is not valid. .github/workflows/neon_extra_builds.yml (Line: 74, Col: 26): Unexpected value 'false'
```
https://github.com/neondatabase/neon/actions/runs/13549634905
## Summary of changes
Use `fromJson()` to have `false` as boolean value.
thanks to @skyzh for pointing on the issue
## Problem
Yet another source of flakyness for
https://github.com/neondatabase/neon/issues/10517
## Summary of changes
The test scenario we want to create is that we have an image layer in
index_part and then overwrite it, so we have to ensure it gets persisted
in index_part by doing a force checkpoint.
Signed-off-by: Alex Chi Z <chi@neon.tech>
We lost this with the switch to axum for the HTTP server. Add it back.
In addition to just resurrecting the functionality we had before, pass
the tracing context of the /configure HTTP request to the start_postgres
operation that runs in the main thread. This way, the 'start_postgres'
and all its sub-spans like getting the basebackup become children of the
HTTP request span. This allows end-to-end tracing of a compute start,
all the way from the proxy to the SQL queries executed by compute_ctl as
part of compute startup.
## Problem
https://github.com/neondatabase/neon/pull/10241 added configuration
switch endpoint, but it didn't delete timeline if node was excluded.
## Summary of changes
Add separate /exclude API endpoint which similarly accepts membership
configuration where sk is supposed by be excluded. Implementation
deletes the timeline locally.
Some more small related tweaks:
- make mconf switch API PUT instead of POST as it is idempotent;
- return 409 if switch was refused instead of 200 with requested &
current;
- remove unused was_active flag from delete response;
- remove meaningless _force suffix from delete functions names;
- reuse timeline.rs delete_dir function in timelines_global_map instead
of its own copy.
part of https://github.com/neondatabase/neon/issues/9965
## Problem
I see a lot of timeout errors, which indicates that this test is too
slow. It seems that create relations are fast, but the subsequent
truncating step is slow.
## Summary of changes
Reduce number of relations for now, and investigate later.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Release CI is slow, because we're doing unnecessary work, for example
building compute images on storage releases and vice versa.
## Summary of changes
- Extract tag generation into reusable workflow and extend it with
fetching of previous component releases
- Don't build neon images on compute releases and don't build compute
images on proxy and storage releases
- Reuse images from previous releases for tests on branches where we
don't build those images
## Open questions
- We differentiate between `TAG` and `COMPUTE_TAG` in a few places, but
we don't differentiate between storage and proxy releases. Since they
use the same image, this will continue to work, but I'm not sure this is
what we want.
## Problem
ref https://github.com/neondatabase/neon/issues/10927
## Summary of changes
* Implement `is_critical` and `is_cancel` over `CompactionError`.
* Revisit all places that uses `CollectKeyspaceError` to ensure they are
handled correctly.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Delete layers that we have hardlinked so far when there is an error in
`remote_copy`. This prevents a retry of the ancestor detach from
stumbling over already present layer files: the hardlink would fail with
an error.
If there is a crash, we already clean up during the timeline attach: we
loop over all layer files and purge all layers that are not referenced
by the `index_part.json`.
Make sure to hold the timeline gate to prevent races with
detach&attach&read from the layer file.
These cleanups aren't completely enough however, as there is code after
`prepare` as well. To handle errors there, we add a special case for
`AlreadyExists` errors during the hardlink, where we check if the layer
is an orphan, and if yes, we delete it from local disk. That is ideally
not the case we hit, as it is less clear in that scenario where the
layer came from, but it provides good defense in depth.
Related #10729Fixes#10970
## Problem
If the client connection goes dead without an explicit close (e.g. due
to network infrastructure dropping the connection) then we currently
won't detect it for a long time, which may e.g. block GetPage flushes
and keep the task running.
Touches https://github.com/neondatabase/cloud/issues/23515.
## Summary of changes
Enable `SO_KEEPALIVE` on the page service socket, to enable periodic TCP
keepalive probes. These are configured via Linux sysctls, which will be
deployed separately. By default, the first probe is sent after 2 hours,
so this doesn't have a practical effect until we change the sysctls.
## Problem
json_ctrl.rs is an obsolete attempt to have tests with fine control of
feeding messages into safekeeper superseded by desim framework.
## Summary of changes
Drop it.
Updates `compute_tools` and `compute_api` crates to edition 2024. We
like to stay on the latest edition if possible. There is no functional
changes, however some code changes had to be done to accommodate the
edition's breaking changes.
The PR has three commits:
* the first commit updates the named crates to edition 2024 and appeases
`cargo clippy` by changing code.
* the second commit performs a `cargo fmt` that does some minor changes
(not many)
* the third commit performs a cargo fmt with nightly options to reorder
imports as a one-time thing. it's completely optional, but I offer it
here for the compute team to review it.
I'd like to hear opinions about the third commit, if it's wanted and
felt worth the diff or not. I think most attention should be put onto
the first commit.
Part of #10918
Ref: https://github.com/neondatabase/cloud/issues/24939
## Problem
I found that we are missing authorization for some container jobs, that
will make them use anonymous pulls. It's not an issue for now, with high
enough limits, but that could be an issue when new limits introduced in
DockerHub (10 pulls / hour)
## Summary of changes
- add credentials for the jobs that run in containers
We don't want to serialize to/from string all the time, so use
`SchedulingPolicy` in `SafekeeperPersistence` via the use of a wrapper.
Stacked atop #10891
In local dev environment, these steps take around 100 ms, and they are
in the critical path of a compute startup on a compute pool hit. I don't
know if it's like that in production, but as first step, add tracing
spans to the functions so that they can be measured more easily.
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
## Problem
PR https://github.com/neondatabase/neon/pull/10442 added prefetch_lookup
function.
It changed handling of getpage requests in compute.
Before:
1. Lookup in LFC (return if found)
2. Register prefetch buffer
3. Wait prefetch result (increment getpage_hist)
Now:
1. Lookup prefetch ring (return if prefetch request is already
completed)
2. Lookup in LFC (return if found)
3. Register prefetch buffer
4. Wait prefetch result (increment getpage_hist)
So if prefetch result is already available, then get page histogram is
not incremented.
It case failure of some test_throughtput benchmarks:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740425527249499
## Summary of changes
Increment getpage histogram in `prefetch_lookup`
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`neon endpoint list` shows a different LSN than what the state of the
replica is. This is mainly down to what we define as LSN in this output.
If we define it as the LSN that a compute was started with, it only
makes sense to show it for static computes.
## Summary of changes
Removed the output of `last_record_lsn` for primary/hot standby
computes.
Closes: https://github.com/neondatabase/neon/issues/5825
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
The patch for `semver` extensions relies on `PG_VERSION` environment
variable. The files were named without the letter `v` so script cannot
find them.
## Summary of changes
The patch files were renamed.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
Add the auto trigger for gc-compaction. It computes two values: L1 size
and L2 size. When L1 size >= initial trigger threshold, we will trigger
an initial gc-compaction. When l1_size / l2_size >=
gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have `test_perf_many_relations` but it only runs on remote clusters,
and we cannot directly modify tenant config. Therefore, I patched one of
the current tests to benchmark relv2 performance.
close https://github.com/neondatabase/neon/issues/9986
## Summary of changes
* Add `v1/v2` selector to `test_tx_abort_with_many_relations`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We use `docker cp` to copy the files required for the extension tests
now.
It causes problems if we run older images with the newer source tree.
## Summary of changes
Copying the files was moved to the compute Dockerfile.
## Problem
As part of https://github.com/neondatabase/neon/issues/8614 we need to
pass membership configurations between compute and safekeepers.
## Summary of changes
Add version 3 of the protocol carrying membership configurations.
Greeting message in both sides gets full conf, and other messages
generation number only. Use protocol bump to include other accumulated
changes:
- stop packing whole structs on the wire as is;
- make the tag u8 instead of u64;
- send all ints in network order;
- drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't
much useful anyway.
Per message changes, apart from mconf:
- ProposerGreeting: tenant / timeline id is sent now as hex cstring.
Remove proto version, it is passed outside in START_WAL_PUSH. Remove
postgres timeline, it is unused. Reorder fields a bit.
- AcceptorGreeting: reorder fields
- VoteResponse: timeline_start_lsn is removed. It can be taken from
first member of term history, and later we won't need it at all when all
timelines will be explicitly created. Vote itself is u8 instead of u64.
- ProposerElected: timeline_start_lsn is removed for the same reasons.
- AppendRequest: epoch_start_lsn removed, it is known from term history
in ProposerElected.
Both compute and sk are able to talk v2 and v3 to make rollbacks (in
case we need them) easier; neon.safekeeper_proto_version GUC sets the
client version. v2 code can be dropped later.
So far empty conf is passed everywhere, future PRs will handle them.
To test, add param to some tests choosing proto version; we want to test
both 2 and 3 until we fully migrate.
ref https://github.com/neondatabase/neon/issues/10326
---------
Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>
## Problem
The interpreted WAL reader tracks the start of the current logical
batch.
This needs to be invalidated when the reader is reset.
This bug caused a couple of WAL gap alerts in staging.
## Summary of changes
* Refactor to make it possible to write a reproducer
* Add repro unit test
* Fix by resetting the start with the reader
Related https://github.com/neondatabase/cloud/issues/23935
## Problem
While looking at logs I noticed that heartbeats are sent sequentially.
The loop polling the UnorderedSet is at the wrong level of identation.
Instead of doing it after we have the full set, we did after each entry.
## Summary of Changes
Poll the UnorderedSet properly.
## Problem
We recently added slow GetPage request logging. However, this
unintentionally included the flush time when logging (which we already
have separate logging for). It also logs at WARN level, which is a bit
aggressive since we see this fire quite frequently.
Follows https://github.com/neondatabase/neon/pull/10906.
## Summary of changes
* Only log the request execution time, not the flush time.
* Extract a `pagestream_dispatch_batched_message()` helper.
* Rename `warn_slow()` to `log_slow()` and downgrade to INFO.
We've seen some cases in production where a compute doesn't get a
response to a pageserver request for several minutes, or even more. We
haven't found the root cause for that yet, but whatever the reason is,
it seems overly optimistic to think that if the pageserver hasn't
responded for 2 minutes, we'd get a response if we just wait patiently a
little longer. More likely, the pageserver is dead or there's some kind
of a network glitch so that the TCP connection is dead, or at least
stuck for a long time. Either way, it's better to disconnect and
reconnect. I set the default timeout to 2 minutes, which should be
enough for any GetPage request under normal circumstances, even if the
pageserver has to download several layer files from remote storage.
Make the disconnect timeout configurable. Also make the "log interval",
after which we print a message to the log configurable, so that if you
change the disconnect timeout, you can set the log timeout
correspondingly. The default log interval is still 10 s. The new GUCs
are called "neon.pageserver_response_log_timeout" and
"neon.pageserver_response_disconnect_timeout".
Includes a basic test for the log and disconnect timeouts.
Implements issue #10857
## Problem
There's new rate-limits coming on docker hub. To reduce our reliance on
docker hub and the problems the limits are going to cause for us, we
want to prepare for this by also pushing our container images to ghcr.io
## Summary of changes
Push our images to ghcr.io as well and not just docker hub.
WaitEventSetWait() returns the number of "events" that happened, and
only that many events in the WaitEvent array are updated. When the
timeout is reached, it returns 0 and does not modify the WaitEvent array
at all. We were reading 'event.events' without checking the return
value, which would be uninitialized when the timeout was hit.
No test included, as this is harmless at the moment. But this caused the
test I'm including in PR #10882 to fail. That PR changes the logic to
loop back to retry the PQgetCopyData() call if WL_SOCKET_READABLE was
set. Currently, making an extra call to PQconsumeInput() is harmless,
but with that change in logic, it turns into a busy-wait.
For archived timelines, we would like to prevent all non-pageserver
issued getpage requests, as users are not supposed to send these.
Instead, they should unarchive a timeline before issuing any external
read traffic.
As this is non-trivial to do, at least prevent launches of new computes,
by errorring on basebackup requests for archived timelines. In #10688,
we started issuing a warning instead of an error, because an error would
mean a stuck project. Now after we can confirm the the warning is not
present in the logs for about a week, we can issue errors.
Follow-up of #10688
Related: #9548
## Problem
close https://github.com/neondatabase/cloud/issues/24485
## Summary of changes
This patch adds a new chaos injection mode for the storcon. The chaos
injector reads the crontab and exits immediately at the configured time.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This upgrades the `proxy/` crate as well as the forked libraries in
`libs/proxy/` to edition 2024.
Also reformats the imports of those forked libraries via:
```
cargo +nightly fmt -p proxy -p postgres-protocol2 -p postgres-types2 -p tokio-postgres2 -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true
```
It can be read commit-by-commit: the first commit has no formatting
changes, only changes to accomodate the new edition.
Part of #10918
## Problem
https://github.com/neondatabase/neon/pull/10788 introduced an API for
warming up attached locations
by downloading all layers in the heatmap. We intend to use it for
warming up timelines after unarchival too,
but it doesn't work. Any heatmap generated after the unarchival will not
include our timeline, so we've lost
all those layers.
## Summary of changes
Generate a cheeky heatmap on unarchival. It includes all the visible
layers.
Use that as the `PreviousHeatmap` which inputs into actual heatmap
generation.
Closes: https://github.com/neondatabase/neon/issues/10541
## Problem
There are a bunch of minor improvements that are too small and
insignificant as is, so collecting them in one PR.
## Summary of changes
- Add runner arch to artifact name to make it easier to distinguish
files on S3
([ref](https://neondb.slack.com/archives/C059ZC138NR/p1739365938371149))
- Use `github.event.pull_request.number` instead of parsing
`$GITHUB_EVENT_PATH` file
- Update Allure CLI and `allure-pytest`
## Problem
We failed to detect https://github.com/neondatabase/neon/pull/10845
before merging, because the tests we run with a matrix of component
versions didn't include the ones that did live migrations.
## Summary of changes
- Do a live migration during the storage controller smoke test, since
this is a pretty core piece of functionality
- Apply a compat version matrix to the graceful cluster restart test,
since this is the functionality that we most urgently need to work
across versions to make deploys work.
I expect the first CI run of this to fail, because
https://github.com/neondatabase/neon/pull/10845 isn't merged yet.
ref: https://github.com/neondatabase/cloud/issues/23385
Adds a direction flag as well as private-link ID to the traffic
reporting pipeline. We do not yet actually count ingress, but we include
the flag anyway.
I have additionally moved vpce_id string parsing earlier, since we
expect it to be utf8 (ascii).
The synchronous 'postgres' crate is just a wrapper around the async
'tokio_postgres' crate. Some places were unnecessarily using the
re-exported NoTls and Error from the synchronous 'postgres' crate, even
though they were otherwise using the 'tokio_postgres' crate. Tidy up by
using the tokio_postgres types directly.
## Problem
After refactoring the configuration code to phases, it became a bit
fuzzy who filters out DBs that are not present in Postgres, are invalid,
or have `datallowconn = false`. The first 2 are important for the DB
dropping case, as we could be in operation retry, so DB could be already
absent in Postgres or invalid (interrupted `DROP DATABASE`).
Recent case:
https://neondb.slack.com/archives/C03H1K0PGKH/p1740053359712419
## Summary of changes
Add a common code that filters out inaccessible DBs inside
`ApplySpecPhase::RunInEachDatabase`.
Updates `vm_monitor` to edition 2024. We like to stay on the latest
edition if possible. There is no functional changes, it's only changes
due to the rustfmt edition.
part of https://github.com/neondatabase/neon/issues/10918
There was a race condition with compute_ctl and the metric being
collected related to whether the neon extension had been updated or not.
compute_ctl will run `ALTER EXTENSION neon UPDATE` on compute start in
the postgres database.
Fixes: https://github.com/neondatabase/neon/issues/10932
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Prefetch is performed locally, so different backers can request the same
pages form PS.
Such duplicated request increase load of page server and network
traffic.
Making prefetch global seems to be very difficult and undesirable,
because different queries can access chunks on different speed. Storing
prefetch chunks in LFC will not completely eliminate duplicates, but can
minimise such requests.
The problem with storing prefetch result in LFC is that in this case
page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.
See:
https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC
@MMeent implementation of prewarm:
See https://github.com/neondatabase/neon/pull/10312/
## Summary of changes
Use conditional variables to sycnhronize access to LFC entry.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
The storage controller treats durations in the tenant config as strings.
These are loaded from the db.
The pageserver maps these durations to a seconds only format and we
always get a mismatch compared
to what's in the db.
## Summary of changes
Treat durations as durations inside the storage controller and not as
strings.
Nothing changes in the cross service API's themselves or the way things
are stored in the db.
I also added some logging which I would have made the investigation a
10min job:
1. Reason for why the reconciliation was spawned
2. Location config diff between the observed and wanted states
## Problem
Now that we rely on the optimisation logic to handle fixing things up
after some tenants are in the wrong AZ (e.g. after node failure), it's
no longer appropriate to treat optimisations as an ultra-low-priority
task. We used to reflect that low priority with a very low limit on
concurrent execution, such that we would only migrate 2 things every 20
seconds.
## Summary of changes
- Increase MAX_OPTIMIZATIONS_EXEC_PER_PASS from 2 to 16
- Increase MAX_OPTIMIZATIONS_PLAN_PER_PASS from 8 to 64.
Since we recently gave user-initiated actions their own semaphore, this
should not risk starving out API requests.
Safekeepers only respond to requests with the per-token scope, or the
`safekeeperdata` JWT scope. Therefore, add infrastructure in the storage
controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token`
to `pageserver_jwt_token`.
Part of #9011
Related: https://github.com/neondatabase/cloud/issues/24727
Return an empty json response in the `scheduling_policy` handler.
This prevents errors of the form:
```
Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0
```
when setting the scheduling policy via the `storcon_cli`.
part of #9011.
Plv8 consists of two parts:
1. the V8 engine, which is built from vendored sources, and
2. the PostgreSQL extension.
Split those into two separate steps in the Dockerfile. The first step
doesn't need any PostgreSQL sources or any other files from the neon
repository, just the build tools and the upstream plv8 sources. Use the
build-deps image as the base for that step, so that the layer can be
cached and doesn't need to be rebuilt every time. This is worthwhile
because the V8 build takes a very long time.
## Problem
This should be one last fix for
https://github.com/neondatabase/neon/issues/10517.
## Summary of changes
If a keyspace is empty, we might produce a gc-compaction job which
covers no layer files. We should avoid generating such jobs so that the
gc-compaction image layer can cover the full key range.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have good observability for "stuck" getpage requests.
Resolves https://github.com/neondatabase/cloud/issues/23808.
## Summary of changes
Log a periodic warning (every 30 seconds) if GetPage request execution
is slow to complete, to aid in debugging stuck GetPage requests.
This does not cover response flushing (we have separate logging for
that), nor reading the request from the socket and batching it (expected
to be insignificant and not straightforward to handle with the current
protocol).
This costs 95 nanoseconds on the happy path when awaiting a
`tokio::task::yield_now()`:
```
warn_slow/enabled=false time: [45.716 ns 46.116 ns 46.687 ns]
warn_slow/enabled=true time: [141.53 ns 141.83 ns 142.18 ns]
```
## Problem
Refer https://github.com/neondatabase/neon/issues/10885
Wait position in ring buffer to restrict number of in-flight requests is
not correctly calculated.
## Summary of changes
Update condition and remove redundant assertion
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
It clashed with pg_mooncake
This is the same as the hotfix #10908 , but for the main branch, to keep
the release and main branches in sync. In particular, we don't want to
accidentally revert this temporary fix, if we cut a new release from
main.
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Announcement blog
post](https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html).
Prior update was in #10618.
## Problem
The interpreted SK <-> PS protocol does not guard against gaps (neither
does the Vanilla one, but that's beside the point).
## Summary of changes
Extend the protocol to include the start LSN of the PG WAL section from
which the records were interpreted.
Validation is enabled via a config flag on the pageserver and works as
follows:
**Case 1**: `raw_wal_start_lsn` is smaller than the requested LSN
There can't be gaps here, but we check that the shard received records
which it hasn't seen before.
**Case 2**: `raw_wal_start_lsn` is equal to the requested LSN
This is the happy case. No gap and nothing to check
**Case 3**: `raw_wal_start_lsn` is greater than the requested LSN
This is a gap.
To make Case 3 work I had to bend the protocol a bit.
We read record chunks of WAL which aren't record aligned and feed them
to the decoder.
The picture below shows a shard which subscribes at a position somewhere
within Record 2.
We already have a wal reader which is below that position so we wait to
catch up.
We read some wal in Read 1 (all of Record 1 and some of Record 2). The
new shard doesn't
need Record 1 (it has already processed it according to the starting
position), but we read
past it's starting position. When we do Read 2, we decode Record 2 and
ship it off to the shard,
but the starting position of Read 2 is greater than the starting
position the shard requested.
This looks like a gap.

To make it work, we extend the protocol to send an empty
`InterpretedWalRecords` to shards
if the WAL the records originated from ends the requested start
position. On the pageserver,
that just updates the tracking LSNs in memory (no-op really). This gives
us a workaround for
the fake gap.
As a drive by, make `InterpretedWalRecords::next_record_lsn` mandatory
in the application level definition.
It's always included.
Related: https://github.com/neondatabase/cloud/issues/23935
## Problem
Storage controller uses unsecure http for pageserver API.
Closes: https://github.com/neondatabase/cloud/issues/23734
Closes: https://github.com/neondatabase/cloud/issues/24091
## Summary of changes
- Add an optional `listen_https_port` field to storage controller's Node
state and its API (RegisterNode/ListNodes/etc).
- Allow updating `listen_https_port` on node registration to gradually
add https port for all nodes.
- Add `use_https_pageserver_api` CLI option to storage controller to
enable https.
- Pageserver doesn't support https for now and always reports
`https_port=None`. This will be addressed in follow-up PR.
ALTER SUBSCRIPTION requires AccessExclusive lock
which conflicts with iteration over pg_subscription when multiple
databases are present
and operations are applied concurrently.
Fix by explicitly locking pg_subscription
in the beginning of the transaction in each database.
## Problem
https://github.com/neondatabase/cloud/issues/24292
Fix an issue caused by PR
https://github.com/neondatabase/neon/pull/10891: we introduced the
concept of timeouts for heartbeats, where we would hang up on the other
side of the oneshot channel if a timeout happened (future gets
cancelled, receiver is dropped).
This hang up would make the heartbeat task panic when it did obtain the
response, as we unwrap the result of the result sending operation. The
panic would lead to the heartbeat task panicing itself, which is then
according to logs the last sign of life we of that process invocation.
I'm not sure what brings down the process, in theory tokio [should
continue](https://docs.rs/tokio/latest/tokio/runtime/enum.UnhandledPanic.html#variant.Ignore),
but idk.
Alternative to #10901.
## Problem
Background heatmap uploads and downloads were blocking the ones done
manually by the test.
## Summary of changes
Disable Background heatmap uploads and downloads for the cold migration
test. The test does
them explicitly.
## Problem
Pinning build tools still replicated the ACR/ECR/Docker Hub login and
pushing, even though we have a reusable workflow for this. Was mentioned
as a TODO in https://github.com/neondatabase/neon/pull/10613.
## Summary of changes
Reuse `_push-to-container-registry.yml` for pinning the build-tools
images.
`pprof::symbolize()` used a regex to strip the Rust monomorphization
suffix from generic methods. However, the `backtrace` crate can do this
itself if formatted with the `:#` flag.
Also tighten up the code a bit.
This PR does the following things:
* The initial heartbeat round blocks the storage controller from
becoming online again. If all safekeepers are unresponsive, this can
cause storage controller startup to be very slow. The original intent of
#10583 was that heartbeats don't affect normal functionality of the
storage controller. So add a short timeout to prevent it from impeding
storcon functionality.
* Fix the URL of the utilization endpoint.
* Don't send heartbeats to safekeepers which are decomissioned.
Part of https://github.com/neondatabase/neon/issues/9011
context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589
## Problem
A simpler version of https://github.com/neondatabase/neon/pull/10812
## Summary of changes
Image layer creation will be preempted by L0 accumulated on other
timelines. We stop image layer generation if there's a pending L0
compaction request.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Adds CPU/heap profiling for storcon.
Also fixes allowlists to match on the path only, since profiling
endpoints take query parameters.
Requires #10892 for heap profiling.
## Problem
We'd like to enable CPU/heap profiling for storcon. This requires
jemalloc.
## Summary of changes
Use jemalloc as the global allocator, and enable heap sampling for
profiling.
## Problem
We've seen the previous default of 50 cause OOMs. Compacting many L0
layers at once now has limited benefit, since the cost is mostly linear
anyway. This is already being reduced to 20 in production settings.
## Summary of changes
Reduce `DEFAULT_COMPACTION_UPPER_LIMIT` to 20.
Once released, let's remove the config overrides.
## Problem
Our AWS account IDs are copy-pasted all over the place. A wrong paste
might only be caught late if we hardcode them, but will get flagged
instantly by actionlint if we access them from github actions variables.
Resolves https://github.com/neondatabase/neon/issues/10787, follow-up
for https://github.com/neondatabase/neon/pull/10613.
## Summary of changes
Access AWS account IDs using Github Actions variables.
## Problem
Autosplits are crucial for bulk ingest performance. However, autosplits
were only attempted when there was no other pending work. This could
cause e.g. mass AZ affinity violations following Pageserver restarts to
starve out autosplits for hours.
Resolves#10762.
## Summary of changes
Always attempt autosplits in the background reconciliation loop,
regardless of other pending work.
## Problem
we measure ingest performance for a few variants (stripe-sizes,
pre-sharded, shard-splitted).
However some phenomena (e.g. related to L0 compaction) in PS can be
better observed and optimized with un-sharded tenants.
## Summary of changes
- Allow to create projects with a policy that disables sharding
(`{"scheduling": "Essential"}`)
- add a variant to ingest_benchmark that uses that policy for the new
project
## Test run
https://github.com/neondatabase/neon/actions/runs/13396325970
## Problem
The nightly test discovered problems in the extensions upgrade test.
1. `PLv8` has different versions on PGv17 and PGv16 and a different test
set, which was not implemented correctly
[sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930271)
2. The same for `semver`
[sample](https://github.com/neondatabase/neon/actions/runs/13382330475/job/37372930017)
3. `pgtap` interfered with the other tests, e.g. tables, created by
other extensions caused the tests to fail.
## Summary of changes
The discovered problems were fixed.
1. The tests list for `PLv8` is now generated using the original
Makefile
2. The patches for `semver` are now split for PGv16 and PGv17.
3. `pgtap` is being tested in a separate database now.
---------
Co-authored-by: Mikhail Kot <mikhail@neon.tech>
## Problem
Read errors during repartition should be a critical error.
## Summary of changes
<del>We only have one call site</del> We have two call sites of
`repartition` where one of them is during the initial image upload
optimization and another is during image layer creation, so I added a
`critical!` here instead of inside `collect_keyspace`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The usual workflow for me to debug read path errors in staging is:
download the tenant to my laptop, import, and then run some read tests.
With this patch, we can do this directly over staging pageservers.
## Summary of changes
* Add a new `touchpagelsn` API that does a page read but does not return
page info back.
* Allow read from latest record LSN from get/touchpagelsn
* Add read_debug config in the context.
* The read path will read the context config to decide whether to enable
read path tracing or not.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We log image compaction stats even when no image compaction happened.
This is logged every 10 seconds for every timeline.
## Summary of changes
Only log when we actually performed any image compaction.
## Problem
If the deploy job on the release branch doesn't succeed, the preprod
deployment will not have happened. It was requested that this triggers a
notification in https://github.com/neondatabase/neon/issues/10662.
## Summary of changes
If we're on the release branch and the deploy job doesn't end up in
"success", notify storage oncall on slack.
## Problem
We lack an API for warming up attached locations based on the heatmap
contents.
This is problematic in two places:
1. If we manually migrate and cut over while the secondary is still cold
2. When we re-attach a previously offloaded tenant
## Summary of changes
https://github.com/neondatabase/neon/pull/10597 made heatmap generation
additive
across migrations, so we won't clobber it a after a cold migration. This
allows us to implement:
1. An endpoint for downloading all missing heatmap layers on the
pageserver:
`/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_heatmap_layers`.
Only one such operation per timeline is allowed at any given time. The
granularity is tenant shard.
2. An endpoint to the storage controller to trigger the downloads on the
pageserver:
`/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_heatmap_layers`.
This works both at
tenant and tenant shard level. If an unsharded tenant id is provided,
the operation is started on
all shards, otherwise only the specified shard.
3. A storcon cli command. Again, tenant and tenant-shard level
granularities are supported.
Cplane will call into storcon and trigger the downloads for all shards.
When we want to rescue a migration, we will use storcon cli targeting
the specific tenant shard.
Related: https://github.com/neondatabase/neon/issues/10541
## Problem
ref https://github.com/neondatabase/neon/issues/10517
## Summary of changes
For some reasons the job split algorithm decides to have different image
coverage range for two compactions before/after restart. So we remove
the subcompaction key range and let it generate an image covering the
full range, which should make the test more stable.
Also slightly tuned the logging span.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This reverts commit 443c8d0b4b.
## Problem
We observe a massive amount of compaction errors.
## Summary of changes
If the tenant did not write any L1 layers (i.e., they accumulate L0
layers where number of them is below L0 threshold), image creation will
always fail. Therefore, it's not correct to simply use the
disk_consistent_lsn or L0/L1 boundary for the image creation.
Preparations for a successor of #10440:
* move `pull_timeline` to `safekeeper_api` and add it to
`SafekeeperClient`. we want to do `pull_timeline` on any creations that
we couldn't do initially.
* Add a `SafekeeperGeneration` type instead of relying on a type alias.
we want to maintain a safekeeper specific generation number now in the
storcon database. A separate type is important to make it impossible to
mix it up with the tenant's pageserver specific generation number. We
absolutely want to avoid that for correctness reasons. If someone mixes
up a safekeeper and pageserver id (both use the `NodeId` type), that's
bad but there is no wrong generations flying around.
part of #9011
## Problem
`discard all` cannot run in a transaction (even if implicit)
## Summary of changes
Split up the query into two, we don't need transaction support.
## Problem
Tests with mixed versions of binaries always pick up new versions if
services are started using `neon_local`.
## Summary of changes
- Set `neon_local_binpath` along with `neon_binpath` and
`pg_distrib_dir` for tests with mixed versions
This replaces the use of the awscli utility. awscli binary is massive,
it added about 200 MB to the docker image size, while the s3 client was
already a dependency so using that is essentially free, as far as binary
size is concerned.
I implemented a simple upload function that tries to keep 10 uploads
going in parallel. I believe that's the default behavior of the "aws s3
sync" command too.
These generated Postgres settings JSON files can get out of sync causing
the control plane to reject updated to an endpoint or project's Postgres
settings.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The current `pageserver_layers_per_read` histogram buckets don't
represent the current reality very well. For the percentiles we care
about (e.g. p50 and p99), we often see fairly high read amp, especially
during ingestion, and anything below 4 can be considered very good.
## Summary of changes
Change the per-timeline read amp histogram buckets to `[4.0, 8.0, 16.0,
32.0, 64.0, 128.0, 256.0]`.
## Problem
Test fails when comparing the first WAL segment because the system id in
the segment header is different. The system id is not consistently set
correctly since segments are usually inited on the safekeeper sync step
with sysid 0.
## Summary of Chnages
Compare timeline digests instead. This skips the header.
Closes https://github.com/neondatabase/neon/issues/10596
I was looking into
https://github.com/neondatabase/serverless/issues/144, I recall previous
cases where proxy would trigger these prepared statements which would
conflict with other statements prepared by our client downstream.
Because of that, and also to aid in debugging, I've made sure all
prepared statements that proxy needs to make have specific names that
likely won't conflict and makes it clear in a error log if it's our
statements that are causing issues
Instead of hardcoding the request timeout, let's make it configurable as
a PGC_SUSET GUC.
Additionally, add a connect timeout GUC. Although the extension server
runs on the compute, it is always best to keep operations from hanging.
Better to present a timeout error to the user than a stuck backend.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
In #10707 some new fields were introduced in TimelineInfo.
I forgot that we do not only use TimelineInfo for encoding, but also
decoding when the storage controller calls into a pageserver, so this
broke some calls from controller to pageserver while in a mixed-version
state.
## Summary of changes
- Make new fields have default behavior so that they are optional
Originally I wanted to switch back to the `neon` branch before merging
#10825, but I forgot to do it. Do it in a separate PR now.
No actual change of the source code, only changes the branch name (so
that maybe in a few weeks we can delete the temporary branch
`arpad/neon-rebase`).
## Problem
`SmgrOpFlushInProgress::measure()` takes a `socket_fd` argument which is
only used on Linux. This causes linter warnings on macOS.
Touches #10823.
## Summary of changes
Add a noop use of `socket_fd` on non-Linux branch.
## Problem
See https://github.com/neondatabase/neon/issues/10839
rho(x,b) functions returns values in range [1,b+1] and addSHLL tries to
store it in array of size b+1.
## Summary of changes
Subtract 1 fro value returned by rho
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
We want to host pg_duckdb (starting with v0.3.1) on Neon.
This PR replaces https://github.com/neondatabase/neon/pull/10350 which
was for older pg_duckdb v0.2.0
Use cases
- faster OLAP queries
- access to datelake files (e.g. parquet) on S3 buckets from Neon
PostgreSQL
Because neon does not provide superuser role to neon customers we need
to grant some additional permissions to neon_superuser:
Note: some grants that we require are already granted to `PUBLIC` in new
release of pg_duckdb
[here](3789e4c509/sql/pg_duckdb--0.2.0--0.3.0.sql (L1054))
```sql
GRANT ALL ON FUNCTION duckdb.install_extension(TEXT) TO neon_superuser;
GRANT ALL ON TABLE duckdb.extensions TO neon_superuser;
GRANT ALL ON SEQUENCE duckdb.extensions_table_seq TO neon_superuser;
```
## Problem
The timestamp prefix of pytest log lines contains milliseconds without
leading zeros, so values of milliseconds less than 100 printed
incorrectly.
For example:
```
2025-02-15 12:02:51.997 INFO [_internal.py:97] 127.0.0.1 - - ...
2025-02-15 12:02:52.4 INFO [_internal.py:97] 127.0.0.1 - - ...
2025-02-15 12:02:52.9 INFO [_internal.py:97] 127.0.0.1 - - ...
2025-02-15 12:02:52.23 INFO [_internal.py:97] 127.0.0.1 - - ...
```
## Summary of changes
Fix log_format for pytest so that milliseconds are printed with leading
zeros.
In commit 9537829ccd I made shared_buffers be derived from the system's
available RAM. However, I failed to remove the old hard-coded
shared_buffers=10GB settings, shared_buffers was set twice. Oopsie.
## Problem
Sometimes, a regression test run gets stuck (taking more than 60
minutes) and is killed by GitHub's `timeout-minutes` without leaving any
traces in the test results database.
I find no correlation between this and either the build type, the
architecture, or the Postgres version.
See: https://neonprod.grafana.net/goto/nM7ih7cHR?orgId=1
## Summary of changes
- Bump `pytest-timeout` to the version that supports `--session-timeout`
- Set `--session-timeout` to (timeout-minutes - 10 minutes) * 60 seconds
in Attempt to stop tests gracefully to generate test reports until they
are forcibly stopped by the stricter `timeout-minutes` limit.
## Problem
Part of https://github.com/neondatabase/neon/issues/9516
## Summary of changes
This patch adds the support for storing reldir in the sparse keyspace.
All logic are guarded with the `rel_size_v2_enabled` flag, so if it's
set to false, the code path is exactly the same as what's currently in
prod.
Note that we did not persist the `rel_size_v2_enabled` flag and the
logic around it will be implemented in the next patch. (i.e., what if we
enabled it, restart the pageserver, and then it gets set to false? we
should still read from v2 using the rel_size_v2_migration_status in the
index_part). The persistence logic I'll implement in the next patch will
disallow switching from v2->v1 via config item.
I also refactored the metrics so that it can work with the new reldir
store. However, this metric is not correctly computed for reldirs (see
the comments) before. With the refactor, the value will be computed only
when we have an initial value for the reldir size. The refactor keeps
the incorrectness of the computation when there are more than 1
database.
For the tests, we currently run all the tests with v2, and I'll set it
to false and add some v2-specific tests before merging, probably also
v1->v2 migration tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
I noticed the opportunity to simplify here while working on
https://github.com/neondatabase/neon/pull/9353 .
The only difference is the zero-fill behavior: if one reads past rel
size,
`get_rel_page_at_lsn` returns a zeroed page whereas `Timeline::get`
returns an error.
However, the `endblk` is at most rel size large, because `nblocks` is eq
`get_rel_size`, see a few lines above this change.
We're using the same LSN (`self.lsn`) for everything, so there is no
chance of non-determinism.
Refs:
- Slack discussion debating correctness:
https://neondb.slack.com/archives/C033RQ5SPDH/p1737457010607119
# Summary
In
- https://github.com/neondatabase/neon/pull/10813
we added slow flush logging but it didn't log the TCP send & recv queue
length.
This PR adds that data to the log message.
I believe the implementation to be safe & correct right now, but it's
brittle and thus this PR should be reverted or improved upon once the
investigation is over.
Refs:
- stacked atop https://github.com/neondatabase/neon/pull/10813
- context:
https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B
- improves https://github.com/neondatabase/neon/issues/10668
- part of https://github.com/neondatabase/cloud/issues/23515
# How It Works
The trouble is two-fold:
1. getting to the raw socket file descriptor through the many Rust types
that wrap it and
2. integrating with the `measure()` function
Rust wraps it in types to model file descriptor lifetimes and ownership,
and usually one can get access using `as_raw_fd()`.
However, we `split()` the stream and the resulting
[`tokio::io::WriteHalf`](https://docs.rs/tokio/latest/tokio/io/struct.WriteHalf.html)
.
Check the PR commit history for my attempts to do it.
My solution is to get the socket fd before we wrap it in our protocol
types, and to store that fd in the new `PostgresBackend::socket_fd`
field.
I believe it's safe because the lifetime of `PostgresBackend::socket_fd`
value == the lifetime of the `TcpStream` that wrap and store in
`PostgresBackend::framed`.
Specifically, the only place that close()s the socket is the `impl Drop
for TcpStream`.
I think the protocol stack calls `TcpStream::shutdown()`, but, that
doesn't `close()` the file descriptor underneath.
Regarding integration with the `measure()` function, the trouble is that
`flush_fut` is currently a generic `Future` type. So, we just pass in
the `socket_fd` as a separate argument.
A clean implementation would convert the `pgb_writer.flush()` to a named
future that provides an accessor for the socket fd while not being
polled.
I tried (see PR history), but failed to break through the `WriteHalf`.
# Testing
Tested locally by running
```
./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000
```
in one terminal, waiting a bit, then
```
pkill -STOP pagebench
```
then wait for slow logs to show up in `pageserver.log`.
Pick one of the slow log message's port pairs, e.g., `127.0.0.1:39500`,
and then checking sockstat output
```
ss -ntp | grep '127.0.0.1:39500'
```
to ensure that send & recv queue size match those in the log message.
The logic might seem a bit intricate / over-optimized, but I recently
spent time benchmarking this code path in the context of a nightly
pagebench regression
(https://github.com/neondatabase/cloud/issues/21759)
and I want to avoid regressing it any further.
Ideally would also log the socket send & recv queue length like we do on
the compute side in
- https://github.com/neondatabase/neon/pull/10673
But that is proving difficult due to the Rust abstractions that wrap the
socket fd.
Work in progress on that is happening in
- https://github.com/neondatabase/neon/pull/10823
Regarding production impact, I am worried at a theoretical level that
the additional logging may cause a downward spiral in the case where a
pageserver is slow to flush because there is not enough CPU. The logging
would consume more CPU and thereby slow down flushes even more. However,
I don't think this matters practically speaking.
# Refs
- context:
https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B
- fixes https://github.com/neondatabase/neon/issues/10668
- part of https://github.com/neondatabase/cloud/issues/23515
# Testing
Tested locally by running
```
./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000
```
in one terminal, waiting a bit, then
```
pkill -STOP pagebench
```
then wait for slow logs to show up in `pageserver.log`.
To see that the completion log message is logged, run
```
pkill -CONT pagebench
```
## Problem
Some situations may produce a large number of pending reconciles. If we
experience an issue where reconciles are processed more slowly than
expected, that can prevent us responding promptly to user requests like
tenant/timeline CRUD.
This is a cleaner implementation of the hotfix in
https://github.com/neondatabase/neon/pull/10815
## Summary of changes
- Introduce a second semaphore for high priority tasks, with
configurable units (default 256). The intent is that in practical
situations these user-facing requests should never have to wait.
- Use the high priority semaphore for: tenant/timeline CRUD, and shard
splitting operations. Use normal priority for everything else.
## Problem
We run the compatibility tests only if we are upgrading the extension.
An accidental code change may break the test itself, so we have to check
this code as well.
## Summary of changes
The test is scheduled once a day to save time and resources.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
PR #10305 makes sure that there is no *actual* race, i.e. we will never
attempt to offload a timeline that has just been unarchived, or similar.
However, if a timeline has been unarchived and has children that are
unarchived too, we will get an error log line. Such races can occur as
in compaction we check if the timeline can be offloaded way before we
attempt to offload it: the result might change in the meantime.
This patch checks if the delete guard can't be obtained because the
timeline has unarchived children, and if yes, it does another check for
whether the timeline has become unarchived or not. If it is unarchived,
it just prints an info log msg and integrates itself into the error
suppression logic of the compaction calling into it.
If you squint at it really closely, there is still a possible race in
which we print an error log, but this one is unlikely because the
timeline and its children need to be archived right after the check for
whether the timeline has any unarchived children, and right before the
check whether the timeline is archived. Archival involves a network
operation while nothing between these two checks does that, so it's very
unlikely to happen in real life.
https://github.com/neondatabase/cloud/issues/23979#issuecomment-2651265729
## Problem
We had code for stripping IDs out of proxied paths to reduce cardinality
of metrics, but it was only stripping out tenant IDs, and leaving in
timeline IDs and query parameters (e.g. LSN in lsn->timestamp lookups).
## Summary of changes
- Use a more general regex approach.
There is still some risk that a future pageserver API might include a
parameter in `/the/path/`, but we control that API and it is not often
extended. We will also alert on metrics cardinality in staging so that
if we made that mistake we would notice.
## Problem
See https://github.com/neondatabase/neon/issues/10755
Random access pattern of pgbench leaves sparse chunks, which makes the
on-disk size of file.cache unpredictable.
## Summary of changes
Perform seqscan to fill LFC chunks with data so that on-disk file size
included size of table.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`benchmarks` is a long-running and non-blocking job. If, on Staging, a
deploy-blocking job fails, restarting it requires cancelling any running
`benchmarks` jobs, which is a waste of CI resources and requires a
couple of extra clicks for a human to do.
Ref: https://neondb.slack.com/archives/C059ZC138NR/p1739292995400899
## Summary of changes
- Run `benchmarks` after `deploy` job
- Handle `benchmarks` run in PRs with `run-benchmarks` label but without
`deploy` job.
There was a typo in the name of the utilization endpoint URL, fix it.
Also, ensure that the heartbeat mechanism actually works.
Related: #10583, #10429
Part of #9011
Before this PR, an IO error returned from the kernel, e.g., due to a bad
disk, would get bubbled up, all the way to a user-visible query failing.
This is against the IO error handling policy where we have established
and is hence being rectified in this PR.
[[(internal Policy document
link)]](bef44149f7/src/storage/handling_io_and_logical_errors.md (L33-L35))
The practice on the write path seems to be that we call
`maybe_fatal_err()` or `fatal_err()` fairly high up the stack.
That is, regardless of whether std::fs, tokio::fs, or VirtualFile is
used to perform the IO.
For the read path, I choose a centralized approach in this PR by
checking for errors as close to the kernel interface as possible.
I believe this is better for long-term consistency.
To mitigate the problem of missing context if we abort so far down in
the stack, the `on_fatal_io_error` now captures and logs a backtrace.
I grepped the pageserver code base for `fs::read` to convince myself
that all non-VirtualFile reads already handle IO errors according to
policy.
Refs
- fixes https://github.com/neondatabase/neon/issues/10454
## Problem
We didn't catch all client errors causing alerts.
## Summary of changes
Client errors should be wrapped with ClientError so that it doesn't fire
alerts.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The test is flaky: WAL in remote storage appears to be corrupted. One of
hypotheses so far is that corruption is the result of local fs
implementation being non atomic, and safekeepers may concurrently PUT
the same segment. That's dubious though because by looking at local_fs
impl I'd expect then early EOF on segment read rather then observed
zeros in test failures, but other directions seem even less probable.
## Summary of changes
Let's add s3 backend as well and see if it is also flaky. Also add some
more logging around segments uploads.
ref https://github.com/neondatabase/neon/issues/10761
There is now a compute_ctl_config field in the response that currently
only contains a JSON Web Key set. compute_ctl currently doesn't do
anything with the keys, but will in the future.
The reasoning for the new field is due to the nature of empty computes.
When an empty compute is created, it does not have a tenant. A compute
spec is the primary means of communicating the details of an attached
tenant. In the empty compute state, there is no spec. Instead we wait
for the control plane to pass us one via /configure. If we were to
include the jwks field in the compute spec, we would have a partial
compute spec, which doesn't logically make sense.
Instead, we can have two means of passing settings to the compute:
- spec: tenant specific config details
- compute_ctl_config: compute specific settings
For instance, the JSON Web Key set passed to the compute is independent
of any tenant. It is a setting of the compute whether it is attached or
not.
Signed-off-by: Tristan Partin <tristan@neon.tech>
pg_search is 46ish MB. All other remote extensions are around hundeds of
KB. 3 seconds is not long enough to download the tarball if the S3
gateway cache doesn't already contain a copy. According to our setup,
the cache is limited to 10 GB in size and anything that has not been
accessed for an hour is purged.
This is really bad for scaling to 0, even more so if you're the only
project actively using the extension in a production Kubernetes cluster.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This test occasionally fails while the test teardown tries to do a
graceful shutdown, because the test has quickly written lots of data
into the pageserver.
Closes: #10654
## Summary of changes
- Call `post_checks` at the end of `test_isolation`, as we already do
for test_pg_regress -- this improves our detection of issues, and as a
nice side effect flushes the pageserver.
- Ignore pg_notify files when validating state at end of test, these are
not expected to be the same
## Problem
In #10752 I used an overly-strict regex that only ignored error on a
particular key.
## Summary of changes
- Drop key from regex so it matches all such errors
## Problem
The build of the neon container image is not caching any part of the
rust build, making it fairly slow.
## Summary of changes
Cache dependency building using cargo-chef.
## Problem
Previously, when cutting over to cold secondary locations,
we would clobber the previous, good, heatmap with a cold one.
This is because heatmap generation used to include only resident layers.
Once this merges, we can add an endpoint which triggers full heatmap
hydration on attached locations to heal cold migrations.
## Summary of changes
With this patch, heatmap generation becomes additive. If we have a
heatmap from when this location was secondary, the new uploaded heatmap
will be the result of a reconciliation between the old one and the on
disk resident layers.
More concretely, when we have the previous heatmap:
1. Filter the previous heatmap and keep layers that are (a) present
in the current layer map, (b) visible, (c) not resident. Call this set
of layers `visible_non_resident`.
2. From the layer map, select all layers that are resident and visible.
Call this set of layers `resident`.
3. The new heatmap is the result of merging the two disjoint sets.
Related https://github.com/neondatabase/neon/issues/10541
In #9011, we want to schedule timelines to safekeepers. In order to do
such scheduling, we need information about how utilized a safekeeper is
and if it's available or not.
Therefore, send constant heartbeats to the safekeepers and try to figure
out if they are online or not.
Includes some code from #10440.
## Problem
We expose `latest_gc_cutoff` in our API, and callers understandably were
using that to validate LSNs for branch creation. However, this is _not_
the true GC cutoff from a user's point of view: it's just the point at
which we last actually did GC. The actual cutoff used when validating
branch creations and page_service reads is the min() of latest_gc_cutoff
and the planned GC lsn in GcInfo.
Closes: https://github.com/neondatabase/neon/issues/10639
## Summary of changes
- Expose the more useful min() of GC cutoffs as `gc_cutoff_lsn` in the
API, so that the most obviously named field is really the one people
should use.
- Retain the ability to read the LSN at which GC was actually done, in
an `applied_gc_cutoff_lsn` field.
- Internally rename `latest_gc_cutoff_lsn` to `applied_gc_cutoff_lsn`
("latest" was a confusing name, as the value in GcInfo is more up to
date in terms of what a user experiences)
- Temporarily preserve the old `latest_gc_cutoff_lsn` field for compat
with control plane until we update it to use the new field.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
https://github.com/neondatabase/neon/pull/10613 changed how images are
pushed, and therefore also how we have to wait for images to be pushed
in `trigger-e2e-tests`. The `trigger-e2e-tests` workflow is triggered in
three different ways:
- When a pull request is pushed to that is already ready to review, here
we call the workflow from `build_and_test`
- When a pull request is marked ready for review, then the workflow is
triggered directly
- When a push to `main` or `release(-.*)?` triggers `build_and_test` and
that indirectly calls `trigger-e2e-tests`.
The second of these paths had a bug, which was not tested in the PR,
because this path being different wasn't clear to me.
## Summary of changes
Fix the jq statement that caused the bug.
## Problem
https://github.com/neondatabase/neon/pull/10613 changed how images are
pushed, and there was a small mismatch between the github workflow and
the script generating what to push where. This resulted in the workflow
trying to push images to prod registries from the main branch, even
though we don't do that and therefore didn't generate a mapping for
those registries in the script that decides what to push where.
This misconception happened because promote-images-dev pushed to dev
registries, and promote-images-prod pushed to prod registries, but
promote-images-prod also updated the latest tag in the dev registries if
and only if we are on the main branch. This last bit is why the
push-<component>-image-prod jobs were trying to run on the main branch.
## Summary of changes
Don't try pushing to prod registries from the main branch.
## Problem
Retagging container images and pushing container images taken from one
registry to another is very tangled up with artifact building and not
separated by component. This makes not building compute for storage
releases and vice versa pretty tricky. To enable that, I want to clean
up retagging and pushing of container images and then continue on making
the pipelines for releases leaner by not building unnecessary things.
## Summary of changes
- Add a reusable workflow that can push to ACR, ECR and Docker Hub,
while being very flexible in terms of source and target images. This
allows for retagging and pushing images between container registries.
- Stop pushing images to registries aside of docker hub in the jobs that
build the images
- Split image pushing into 4 different jobs (not mentioning special
cases):
- neon-dev
- neon-prod
- compute-dev
- compute-prod
## TODO
- Consider also using this for `pin-build-tools-image`, as it's
basically another instance of the same thing.
## Known limitations
- The ECR part of this workflow supports authenticating to multiple AWS
accounts and therefore multiple ECR endpoints, but the ACR part only
supports one Azure Account. If someone with more knowledge on Azure can
tell me whether an equivalent to
https://github.com/aws-actions/amazon-ecr-login?tab=readme-ov-file#login-to-ecr-on-multiple-aws-accounts
is easily possible, that'd be great.
- The `image_map` input is a bit complex. It expects something along the
lines of
```
{
"docker.io/neondatabase/compute-node-v14:13196061314": [
"docker.io/neondatabase/compute-node-v14:13196061314",
"369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:13196061314",
"neoneastus2.azurecr.io/neondatabase/compute-node-v14:13196061314"
],
"docker.io/neondatabase/compute-node-v15:13196061314": [
"docker.io/neondatabase/compute-node-v15:13196061314",
"369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:13196061314",
"neoneastus2.azurecr.io/neondatabase/compute-node-v15:13196061314"
]
}
```
to map from source to target image. We have a small python step to
generate this map for the 4 main image pushing jobs. The concrete
example is taken from
https://github.com/neondatabase/neon/actions/runs/13196061314/job/36838584098?pr=10613#step:3:6
and shortened to two images.
## Problem
We respect `skip_pg_catalog_updates` at the initial start, but ignore at
the follow-up `/configure`. Yet, it's used for storage->cplane->compute
notify requests after migrations, shard split, etc. So every time we get
them, applying the new config takes much longer than it should because
we go through Postgres catalog checks. Cplane sets this flag, when it
does serves notify attach call
9068c7d743
Related to `inc-403`, for example
## Summary of changes
Look at `skip_pg_catalog_updates` in `compute.reconfigure()`
## Problem
PRs created by external contributors, in some cases might list failed
jobs
- `Trigger E2E Tests / cancel-previous-e2e-tests`
- `Trigger E2E Tests / tag`
They don't block the merge, and tests in fact pass (their counterparts
in internal PR), but because jobs are triggered from an external PR (and
not from the corresponding internal one) they still present as red
marks.
For example https://github.com/neondatabase/neon/pull/10778
## Summary of changes
- Check permissions before triggering e2e tests
## Problem
L0 compaction frequently gets starved out by other background tasks and
image/GC compaction. L0 compaction must be responsive to keep read
amplification under control.
Touches #10694.
Resolves#10689.
## Summary of changes
Use a separate semaphore for the L0-only compaction pass.
* Add a `CONCURRENT_L0_COMPACTION_TASKS` semaphore and
`BackgroundLoopKind::L0Compaction`.
* Add a setting `compaction_l0_semaphore` (default off via
`compaction_l0_first`).
* Use the L0 semaphore when doing an `OnlyL0Compaction` pass.
* Use the background semaphore when doing a regular compaction pass
(which includes an initial L0 pass).
* While waiting for the background semaphore, yield for L0 compaction if
triggered.
* Add `CompactFlags::NoYield` to disable L0 yielding, and set it for the
HTTP API route.
* Remove the old `use_compaction_semaphore` setting and
compaction-scoped semaphore.
* Remove the warning when waiting for a semaphore; it's noisy and we
have metrics.
## Problem
With current approach for the base images in `Dockerfiles`, it's hard to
track when image is updated, and as they are base, than update will
invalidate all the layers, as base image changed.
That also becomes more complicated, as we have a number of runners, and
they may have different images with the tag `bookworm-slim`, so that
will lead to invalidate caches, when image build on one runner will be
used on another runners.
To fix that problem, we could pin our base images to the specific sha,
and that not only align images across runners, and also will allow us to
have reproducible build and don't depend on any spontaneous changes in
upstream.
Fix: https://github.com/neondatabase/cloud/issues/24084
## Summary of changes
Beside of the main goal, that PR also included some small changes around
Dockerfiles:
1. Main change: use `SHA` for `bookworm-slim` and `bullseye-slim` debian
images
2. For the layers requiring `curl` we could add `curl` and `unzip` to
the `build-deps` image, and use it as a base image for all the steps,
removing extra dependency on `alpine/curl`
3. added `retry-on-host-error=on` for the `wgetrc` as it happened to me:
fail to resolve hostname
## Problem
Previously, Workload was reconfiguring the compute before each run of
writes, which was meant to be a no-op when nothing changed, but was
actually writing extra data due to an issue being fixed in
https://github.com/neondatabase/neon/pull/10696.
The row counts in tests were too low in some cases, these tests were
only working because of those extra writes that shouldn't have been
happening, and moreover were relying on checkpoints happening.
## Summary of changes
- Only reconfigure compute if the attached pageserver actually changed.
If pageserver is set to None, that means controller is managing
everything, so never reconfigure compute.
- Update tests that wrote too few rows.
---------
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
The old values assumed that you have at least about 18 GB of RAM
available (shared_buffers=10GB and maintenance_work_mem=8GB). That's a
lot when testing locally. Make it configurable, and make the default
assumption much smaller: 256 MB.
This is nice for local testing, but it's also in preparation for
starting to use VMs to run these jobs. When launched in a VM, the
control plane can set these env variables according to the max size of
the VM.
Also change the formula for how RAM is distributed: use 10% of RAM for
shared_buffers, and 70% for maintenance_work_mem. That leaves a good
amount for misc. other stuff and the OS. A very large shared_buffers
setting won't typically help with bulk loading. It won't help with the
network and I/O of processing all the tables, unless maybe if the whole
database fits in shared buffers, but even then it's not much faster than
using local disk. Bulk loading is all sequential I/O. It also won't help
much with index creation, which is also sequential I/O. A large
maintenance_work_mem can be quite useful, however, so that's where we
put most of the RAM.
Resolves: https://github.com/neondatabase/neon/issues/5734
When we query pageserver and it's unavailable after some retries,
postgres sigabrt's. This is intended behavior so I've added tests
checking it
## Problem
When image compaction yields for L0 compaction, it may not immediately
schedule L0 compaction, because it just goes on to compact the next
pending timeline.
Touches #10694.
Requires #10744.
## Summary of changes
Extend `CompactionOutcome` with `YieldForL0` and `Skipped` variants, and
immediately schedule an L0 compaction pass in the `YieldForL0` case.
## Problem
Image compaction can starve out L0 compaction if a tenant has several
timelines with L0 debt.
Touches #10694.
Requires #10740.
## Summary of changes
* Add an initial L0 compaction pass, in order of L0 count.
* Add a tenant option `compaction_l0_first` to control the L0 pass
(disabled by default).
* Add `CompactFlags::OnlyL0Compaction` to run an L0-only compaction
pass.
* Clean up the compaction iteration logic.
A later PR will use separate semaphores for the L0 and image compaction
passes to avoid cross-tenant L0 starvation. That PR will also make image
compaction yield if _any_ of the tenant's timelines have pending L0
compaction to further avoid starvation.
Avoids compiling the crate and its dependencies into binaries that don't
need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the
release-line-debug-size-lto profile.
Run "apt install" first, and only then COPY the files from the
intermediary build layers to the final image. This way, if you modify
any of the sources that trigger e.g. rebuilding compute_ctl, the "apt
install" step can still be cached.
## Problem
In https://github.com/neondatabase/neon/pull/10411 fill logic changes
such that it benefits us to test it with & without AZs set up. I didn't
extend the test inline in that PR because there were overlapping test
changes in flight to add `num_az` parameter.
## Summary of changes
- Parameterise test on AZ count (1 or 2)
- When AZ count is 2, use a different balance check that just asserts
the _tenants_ are balanced (since AZ affinity is chosen on a per-tenant
basis)
The compute_ctl HTTP server has the following purposes:
- Allow management via the control plane
- Provide an endpoint for scaping metrics
- Provide APIs for compute internal clients
- Neon Postgres extension for installing remote extensions
- local_proxy for installing extensions and adding grants
The first two purposes require the HTTP server to be available outside
the compute.
The Neon threat model is a bad actor within our internal network. We
need to reduce the surface area of attack. By exposing unnecessary
unauthenticated HTTP endpoints to the internal network, we increase the
surface area of attack. For endpoints described in the third bullet
point, we can just run an extra HTTP server, which is only bound to the
loopback interface since all consumers of those endpoints are within the
compute.
This changes the default value of the `timeline_offloading` pageserver
and tenant configs to true, now that offloading has been rolled out
without problems.
There is also a small fix in the tenant config merge function, where we
applied the `lazy_slru_download` value instead of `timeline_offloading`.
Related issue: https://github.com/neondatabase/cloud/issues/21353
# Problem
Say we have a batch of 10 responses to send out.
Then, even with
- #10728
we've still only called observe_execution_end_flush_start for the first
3 responses.
The remaining 7 response timers are still ticking.
When compute now closes the connection, the waiting flush fails with an
error and we `drop()` the remaining 7 responses' smgr op timers. The
`impl Drop for SmgrOpTimer` will observe an execution time that includes
the flush time.
In practice, this is supsected to produce the `+Inf` observations in the
smgr op latency histogram we've seen since the introduction of
pipelining, even after shipping #10728.
refs:
- fixup of https://github.com/neondatabase/neon/pull/10042
- fixup of https://github.com/neondatabase/neon/pull/10728
- fixes https://github.com/neondatabase/neon/issues/10754
## Problem
These tests can encounter a bug in the pageserver read path (#9185)
which occurs under the very specific circumstances that the tests
create, but is very unlikely to happen in the field.
We will fix the bug, but in the meantime let's un-flake the tests.
Related: https://github.com/neondatabase/neon/issues/10720
## Summary of changes
- Permit "could not find data for key" errors in tests affected by #9185
I don't see the point of static linking, postgres itself and many of the
extensions are already built dynamically.
One reason for the change is that I'm working on bigger changes to start
using systemd in the compute, and as part of that I wanted to add the
--with-systemd configure option to pgbouncer, and there doesn't seem to
be a static version of libsystemd (at least not on Debian).
## Problem
/database_schema endpoint returns incomplete output from `pg_dump`
## Summary of changes
The Tokio process was not used properly. The returned stream does not
include `process::Child`, and the process is scheduled to be killed
immediately after the `get_database_schema` call when `cmd` goes out of
scope.
The solution in this PR is to return a special Stream implementation
that retains `process::Child`.
compute_ctl is mostly written in synchronous fashion, intended to run in
a single thread. However various parts had become async, and they
launched their own tokio runtimes to run the async code. For example, VM
monitor ran in its own multi-threaded runtime, and apply_spec_sql()
launched another multi-threaded runtime to run the per-database SQL
commands in parallel. In addition to that, a few places used a
current-thread runtime to run async code in the main thread, or launched
a current-thread runtime in a *different* thread to run background
tasks.
Unify the runtimes so that there is only one tokio runtime. It's created
very early at process startup, and the main thread "enters" the runtime,
so that it's always available for tokio::spawn() and runtime.block_on()
calls. All code that needs to run async code uses the same runtime.
The main thread still mostly runs in a synchronous fashion. When it
needs to run async code, it uses rt.block_on().
Spawn fewer additional threads, prefer to spawn tokio tasks instead.
Convert some code that ran synchronously in background threads into
async. I didn't go all the way, though, some background threads are
still spawned.
## Problem
close https://github.com/neondatabase/neon/issues/10213
`range_search` only returns the top-most layers that may satisfy the
search, so it doesn't include all layers that might be accessed (the
user needs to recursively call this function). We need to retrieve the
full layer map and find overlaps in order to have a correct heuristics
of the job split.
## Summary of changes
Retrieve all layers and find overlaps instead of doing `range_search`.
The patch also reduces the time holding the layer map read guard.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The compaction loop currently runs periodically, which can cause it to
wait for up to 20 seconds before starting L0 compaction by default.
Also, when we later separate the semaphores for L0 compaction and image
compaction, we want to give up waiting for the image compaction
semaphore if L0 compaction is needed on any timeline.
Touches #10694.
## Summary of changes
Notify the compaction loop when an L0 flush (on any timeline) exceeds
`compaction_threshold`.
Also do some opportunistic cleanups in the area.
Building the compute rust binaries from scratch is pretty slow, it takes
between 4-15 minutes on my laptop, depending on which compiler flags and
other tricks I use. A cache mount allows caching the dependencies and
incremental builds, which speeds up rebuilding significantly when you
only makes a small change in a source file.
The awscli was downloaded at the last stages of the overall compute
image build, which meant that if you modified any part of the build, it
would trigger a re-download of the awscli. That's a bit annoying when
developing locally and rebuilding the compute image repeatedly. Move it
to a separate layer, to cache separately and to avoid the spurious
rebuilds.
The compute_id will be used when verifying claims sent by the control
plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Make it possible to dump WAL records in format recognised by walredo
process.
Intended usage:
```
pg_waldump -R 1663/5/16396 -B 771727 000000010000000100000034 --save-records=/tmp/walredo.records
postgres --wal-redo < /tmp/walredo.records > /tmp/page.img
```
## Summary of changes
Related Postgres PRs:
https://github.com/neondatabase/postgres/pull/575https://github.com/neondatabase/postgres/pull/572
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
helps investigate https://github.com/neondatabase/neon/issues/10482
## Summary of changes
In debug mode and testing mode, we will record all files visited by a
read operation, and print it out when it errors.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Reduce the read amplification when doing `repartition`.
## Summary of changes
Compute the L0-L1 boundary LSN and do repartition here.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The upgrade test for pg_jwt does not work correctly.
## Summary of changes
The script for the upgrade test is modified to use the database
`contrib_regression`.
# Problem
walredo shutdown is done in the compaction task. Let's move it to tenant
housekeeping.
# Summary of changes
* Rename "ingest housekeeping" to "tenant housekeeping".
* Move walredo shutdown into tenant housekeeping.
* Add a constant `WALREDO_IDLE_TIMEOUT` set to 3 minutes (previously 10x
compaction threshold).
This patch does a bunch of superficial cleanups of `tenant::tasks` to
avoid noise in subsequent PRs. There are no functional changes.
PS: enable "hide whitespace" when reviewing, due to the unindentation of
large async blocks.
# Problem
Before this PR, the `shard_id` field was missing when page_service logs
a reconstruct error.
This was caused by batching-related refactorings.
Example from staging:
```
2025-01-30T07:10:04.346022Z ERROR page_service_conn_main{peer_addr=...}:process_query{tenant_id=... timeline_id=...}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key ...
```
# Changes
Delay creation of the handler-specific span until after shard routing
This also avoids the need for the record() call in the pagestream hot
path.
# Testing
Manual testing with a failpoint that is part of this PR's history but
will be squashed away.
# Refs
- fixes https://github.com/neondatabase/neon/issues/10599
## Problem
This test would sometimes fail its assertion that a timeline does not
revert to active once archived. That's because it was using the
in-memory offload state, not the persistent state, so this was sometimes
lost across a pageserver restart.
Closes: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- When reading offload status, read from pageserver API _and_ remote
storage before considering the timeline offloaded
I plan to use this when launching a fast_import job in a VM. There's
currently no good way for an executable running in a NeonVM to exit
gracefully and have the VM shut down. The inittab we use always respawns
the payload command. The idea is that the control plane can use
"fast_import ... && poweroff" as the command, so that when fast_import
completes successfully, the VM is terminated, and the k8s Pod and
VirtualMachine object are marked as completed successfully.
I'm working on bigger changes to how we launch VMs, and will try to come
up with a nicer system for that, but in the meanwhile, this quick hack
allows us to proceed with using VMs for one-off jobs like fast_import.
## Problem
There are a couple of log warnings tripping up
`test_timeline_archival_chaos`
- `[stopping left-over name="timeline_delete"
tenant_shard_id=2d526292b67dac0e6425266d7079c253
timeline_id=Some(44ba36bfdee5023672c93778985facd9)
kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)`
- `ignoring attempt to restart exited flush_loop
503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')`
Related: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- Downgrade the 'ignoring attempt to restart' to info -- there's nothing
in the design that forbids this happening, i.e. someone calling
maybe_spawn_flush_loop concurrently with shutdown()
- Prevent timeline deletion tasks outliving tenants by carrying a
gateguard. This logically makes sense because the deletion process does
call into Tenant to update manifests.
## Problem
L0 compaction can get starved by other background tasks. It needs to be
responsive to avoid read amp blowing up during heavy write workloads.
Touches #10694.
## Summary of changes
Add a separate semaphore for compaction, configurable via
`use_compaction_semaphore` (disabled by default). This is primarily for
testing in staging; it needs further work (in particular to split
image/L0 compaction jobs) before it can be enabled.
## Problem
Endpoint kept running while timeline was deleted, causing forbidden
warnings on the pageserver when the tenant is not found.
## Summary of changes
- Explicitly stop the endpoint before the end of the test, so that it
isn't trying to talk to the pageserver in the background while things
are torn down
## Problem
This is tech debt. While we introduced generations for tenants, some
legacy situations without generations needed to delete things inline
(async operation) instead of enqueing them (sync operation).
## Summary of changes
- Remove the async code, replace calls with the sync variant, and assert
that the generation is always set
## Problem
Ref: https://github.com/neondatabase/neon/issues/10632
We use dns named `*.localtest.me` in our test, and that domain is
well-known and widely used for that, with all the records there resolve
to the localhost, both IPv4 and IPv6: `127.0.0.1` and `::1`
In some cases on our runners these addresses resolves only to `IPv6`,
and so components fail to connect when runner doesn't have `IPv6`
address. We suspect issue in systemd-resolved here
(https://github.com/systemd/systemd/issues/17745)
To workaround that and improve test stability, we introduced our own
domain `*.local.neon.build` with IPv4 address `127.0.0.1` only
See full details and troubleshoot log in referred issue.
p.s.
If you're FritzBox user, don't forget to add that domain
`local.neon.build` to the `DNS Rebind Protection` section under `Home
Network -> Network -> Network Settings`, otherwise FritzBox will block
addresses, resolving to the local addresses.
For other devices/vendors, please check corresponding documentation, if
resolving `local.neon.build` will produce empty answer for you.
## Summary of changes
Replace all the occurrences of `localtest.me` with `local.neon.build`
We need to disable the detection of memory leaks when running
``neon_local init` for build with sanitizers to avoid an error thrown by
AddressSanitizer.
## Problem
Found another pgcopydb segfault in error handling
```bash
2025-02-06 15:30:40.112 51299 ERROR pgsql.c:2330 [TARGET -738302813] FATAL: terminating connection due to administrator command
2025-02-06 15:30:40.112 51298 ERROR pgsql.c:2330 [TARGET -1407749748] FATAL: terminating connection due to administrator command
2025-02-06 15:30:40.112 51297 ERROR pgsql.c:2330 [TARGET -2073308066] FATAL: terminating connection due to administrator command
2025-02-06 15:30:40.112 51300 ERROR pgsql.c:2330 [TARGET 1220908650] FATAL: terminating connection due to administrator command
2025-02-06 15:30:40.432 51300 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command
2025-02-06 15:30:40.513 51290 ERROR copydb.c:773 Sub-process 51300 exited with code 0 and signal Segmentation fault
2025-02-06 15:30:40.578 51299 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command
2025-02-06 15:30:40.613 51290 ERROR copydb.c:773 Sub-process 51299 exited with code 0 and signal Segmentation fault
2025-02-06 15:30:41.253 51298 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command
2025-02-06 15:30:41.314 51290 ERROR copydb.c:773 Sub-process 51298 exited with code 0 and signal Segmentation fault
2025-02-06 15:30:43.133 51297 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command
2025-02-06 15:30:43.215 51290 ERROR copydb.c:773 Sub-process 51297 exited with code 0 and signal Segmentation fault
2025-02-06 15:30:43.215 51290 ERROR indexes.c:123 Some INDEX worker process(es) have exited with error, see above for details
2025-02-06 15:30:43.215 51290 ERROR indexes.c:59 Failed to create indexes, see above for details
2025-02-06 15:30:43.232 51271 ERROR copydb.c:768 Sub-process 51290 exited with code 12
```
```bashadmin@ip-172-31-38-164:~/pgcopydb$ gdb /usr/local/pgsql/bin/pgcopydb core
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/pgsql/bin/pgcopydb...
[New LWP 51297]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `pgcopydb: create index ocr.ocr_pipeline_step_results_version_pkey '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000aaaac3a4b030 in splitLines (lbuf=lbuf@entry=0xffffd8b86930, buffer=<optimized out>) at string_utils.c:630
630 *newLinePtr = '\0';
(gdb) bt
#0 0x0000aaaac3a4b030 in splitLines (lbuf=lbuf@entry=0xffffd8b86930, buffer=<optimized out>) at string_utils.c:630
#1 0x0000aaaac3a3a678 in pgsql_execute_log_error (pgsql=pgsql@entry=0xffffd8b87040, result=result@entry=0x0,
sql=sql@entry=0xffff81fe9be0 "CREATE UNIQUE INDEX IF NOT EXISTS ocr_pipeline_step_results_version_pkey ON ocr.ocr_pipeline_step_results_version USING btree (id, transaction_id);",
debugParameters=debugParameters@entry=0xaaaaec5f92f0, context=context@entry=0x0) at pgsql.c:2322
#2 0x0000aaaac3a3bbec in pgsql_execute_with_params (pgsql=pgsql@entry=0xffffd8b87040,
sql=0xffff81fe9be0 "CREATE UNIQUE INDEX IF NOT EXISTS ocr_pipeline_step_results_version_pkey ON ocr.ocr_pipeline_step_results_version USING btree (id, transaction_id);", paramCount=paramCount@entry=0,
paramTypes=paramTypes@entry=0x0, paramValues=paramValues@entry=0x0, context=context@entry=0x0, parseFun=parseFun@entry=0x0) at pgsql.c:1649
#3 0x0000aaaac3a3c468 in pgsql_execute (pgsql=pgsql@entry=0xffffd8b87040, sql=<optimized out>) at pgsql.c:1522
#4 0x0000aaaac3a245f4 in copydb_create_index (specs=specs@entry=0xffffd8b8ec98, dst=dst@entry=0xffffd8b87040, index=index@entry=0xffff81f71800, ifNotExists=<optimized out>) at indexes.c:846
#5 0x0000aaaac3a24ca8 in copydb_create_index_by_oid (specs=specs@entry=0xffffd8b8ec98, dst=dst@entry=0xffffd8b87040, indexOid=<optimized out>) at indexes.c:410
#6 0x0000aaaac3a25040 in copydb_index_worker (specs=specs@entry=0xffffd8b8ec98) at indexes.c:297
#7 0x0000aaaac3a25238 in copydb_start_index_workers (specs=specs@entry=0xffffd8b8ec98) at indexes.c:209
#8 0x0000aaaac3a252f4 in copydb_index_supervisor (specs=specs@entry=0xffffd8b8ec98) at indexes.c:112
#9 0x0000aaaac3a253f4 in copydb_start_index_supervisor (specs=0xffffd8b8ec98) at indexes.c:57
#10 copydb_start_index_supervisor (specs=specs@entry=0xffffd8b8ec98) at indexes.c:34
#11 0x0000aaaac3a51ff4 in copydb_process_table_data (specs=specs@entry=0xffffd8b8ec98) at table-data.c:146
#12 0x0000aaaac3a520dc in copydb_copy_all_table_data (specs=specs@entry=0xffffd8b8ec98) at table-data.c:69
#13 0x0000aaaac3a0ccd8 in cloneDB (copySpecs=copySpecs@entry=0xffffd8b8ec98) at cli_clone_follow.c:602
#14 0x0000aaaac3a0d2cc in start_clone_process (pid=0xffffd8b743d8, copySpecs=0xffffd8b8ec98) at cli_clone_follow.c:502
#15 start_clone_process (copySpecs=copySpecs@entry=0xffffd8b8ec98, pid=pid@entry=0xffffd8b89788) at cli_clone_follow.c:482
#16 0x0000aaaac3a0d52c in cli_clone (argc=<optimized out>, argv=<optimized out>) at cli_clone_follow.c:164
#17 0x0000aaaac3a53850 in commandline_run (command=command@entry=0xffffd8b9eb88, argc=0, argc@entry=22, argv=0xffffd8b9edf8, argv@entry=0xffffd8b9ed48) at /home/admin/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#18 0x0000aaaac3a01464 in main (argc=22, argv=0xffffd8b9ed48) at main.c:140
(gdb)
```
The problem is most likely that the following call returned a message in
a read-only memory segment where we cannot replace \n with \0 in
string_utils.c splitLines() function
```C
char *message = PQerrorMessage(pgsql->connection);
```
## Summary of changes
modified the patch to also address this problem
## Problem
Currently the following line below uses array subscript notation which
is confusing since `reqlsns` is not an array but just a pointer to a
struct.
```
XLogWaitForReplayOf(reqlsns[0].request_lsn);
```
## Summary of changes
Switch from array subscript notation to arrow operator to improve
readability of code.
Close#10620.
## Problem
We don't have visibility into how long an individual background job is
waiting for a semaphore permit.
## Summary of changes
* Make `pageserver_background_loop_semaphore_wait_seconds` a histogram
rather than a sum.
* Add a paced warning when a task takes more than 10 minutes to get a
permit (for now).
* Drive-by cleanup of some `EnumMap` usage.
## Problem
This test called NeonPageserver.tenant_detach, which as well as
detaching locally on the pageserver, also updates the storage controller
to put the tenant into Detached mode. When the test runs slowly in debug
mode, it sometimes takes long enough that the background_reconcile loop
wakes up and drops the tenant from memory in response, such that the
pageserver can't validate its deletions and the test does not behave as
expected.
Closes: https://github.com/neondatabase/neon/issues/10513
## Summary of changes
- Call the pageserver HTTP client directly rather than going via
NeonPageserver.tenant_detach
## Problem
The default `GITHUB_TOKEN` is used to push changes created with
`approved-for-ci-run`, which doesn't work:
```
Run git push --force origin "${BRANCH}"
remote: Permission to neondatabase/neon.git denied to github-actions[bot].
fatal: unable to access 'https://github.com/neondatabase/neon/': The requested URL returned error: 403
```
Ref:
https://github.com/neondatabase/neon/actions/runs/13166108303/job/36746518291?pr=10687
## Summary of changes
- Use `CI_ACCESS_TOKEN` to clone an external repo
- Remove unneeded `actions/checkout`
## Problem
During ingest_benchmark which uses `pgcopydb`
([see](https://github.com/dimitri/pgcopydb))we sometimes had outages.
- when PostgreSQL COPY step failed we got a segfault (reported
[here](https://github.com/dimitri/pgcopydb/issues/899))
- the root cause was Neon idle_in_transaction_session_timeout is set to
5 minutes which is suboptimal for long-running tasks like project import
(reported [here](https://github.com/dimitri/pgcopydb/issues/900))
## Summary of changes
Patch pgcopydb to avoid segfault.
override idle_in_transaction_session_timeout and set it to "unlimited"
## Problem
We need a metrics to know what's going on in pageserver's background
jobs.
## Summary of changes
* Waiting tasks: task still waiting for the semaphore.
* Running tasks: tasks doing their actual jobs.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
Often the output of the timestamp->lsn API is used as input for branch
creation, and branch creation takes the planned lsn into account, i.e.
rejects lsn's as branch lsns that are before the planned lsn.
This patch doesn't fix all race conditions, it's still racy. But at
least it is a step into the right direction.
For #10639
## Problem
Following #10641, let's add a few critical errors.
Resolves#10094.
## Summary of changes
Adds the following critical errors:
* WAL sender read/decode failure.
* WAL record ingestion failure.
* WAL redo failure.
* Missing key during compaction.
We don't add an error for missing keys during GetPage requests, since
we've seen a handful of these in production recently, and the cause is
still unclear (most likely a benign race).
Right now, branch creation doesn't care if a lsn lease exists or not, it
just fails if the passed lsn is older than either the last or the
planned gc cutoff.
However, if an lsn lease exists for a given lsn, we can actually create
a branch at that point: nothing has been gc'd away.
This prevents race conditions that #10678 still leaves around.
Related: #10639https://github.com/neondatabase/cloud/issues/23667
We don't want any external requests for an archived timeline. This
includes basebackup requests, i.e. when a compute is being started up.
Therefore, we'd like to forbid such basebackup requests: any attempt to
get a basebackup on an archived timeline (or any getpage request really)
is a cplane bug. Make this a warning for now so that, if there is
potentially a bug, we can detect cases in the wild before they cause
stuck operations, but the intention is to return an error eventually.
Related: #9548
On my AX102 Hetzner box, removing this line removes about 20us from the
`latency_mean` result in
`test_pageserver_characterize_latencies_with_1_client_and_throughput_with_many_clients_one_tenant`.
If the same 20us can be removed in the nightly benchmark run, this will
be a ~10% improvement because there, mean latencies are about ~220us.
This span was added during batching refactors, we didn't have it before,
and I don't think it's terribly useful.
refs
- https://github.com/neondatabase/cloud/issues/21759
## Problem
The local filesystem backend for remote storage doesn't set a
concurrency limit. While it can't/won't enforce a concurrency limit
itself, this also bounds the upload queue concurrency. Some tests create
thousands of uploads, which slows down the quadratic scheduling of the
upload queue, and there is no point spawning that many Tokio tasks.
Resolves#10409.
## Summary of changes
Set a concurrency limit of 100 for the LocalFS backend.
Before: `test_layer_map[release-pg17].test_query: 68.338 s`
After: `test_layer_map[release-pg17].test_query: 5.209 s`
## Problem
Incorrect manipulations with iteration index in `lfc_cache_containsv`
## Summary of changes
```
- int this_chunk = Min(nblocks, BLOCKS_PER_CHUNK - chunk_offs);
+ int this_chunk = Min(nblocks - i, BLOCKS_PER_CHUNK - chunk_offs); int this_chunk = ```
- if (i + 1 >= nblocks)
+ if (i >= nblocks)
```
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
USE_ASSERT _CHECKING is defined as empty entity. but it is checked using
#if
## Summary of changes
Replace `#if USE_ASSERT _CHECKING` with `#ifdef USE_ASSERT _CHECKING` as
done in other places in Postgres
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
The code is intended to reschedule compaction immediately if there are
pending tasks. We set the duration to 0 before if there are pending
tasks, but this will go through the `if period == Duration::ZERO {`
branch and sleep for another 10 seconds.
## Summary of changes
Set duration to 1 so that it doesn't sleep for too long.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Any compaction should never produce l0 layers. This never happened in my
experiments, but would be good to guard it early.
## Summary of changes
Disallow gc-compaction to produce l0 layers.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The full build of all extensions takes a long time. When working locally
on parts that don't need extensions, you can iterate more quickly by
skipping the unnecessary extensions.
This adds a build argument to the dockerfile to specify extensions to
build. There are three options:
- EXTENSIONS=all (default)
- EXTENSIONS=minimal: Build only a few extensions that are listed in
shared_preload_libraries in the default neon config.
- EXTENSIONS=none: Build no extensions (except for the mandatory 'neon'
extension).
## Problem
close https://github.com/neondatabase/neon/issues/10651
## Summary of changes
* Image layer creation starts from the next partition of the last
processed partition if the previous attempt was not complete.
* Add tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
When a shard has two secondary locations, but one of them is on a node
with MaySchedule::No, the optimiser would get stuck, because it couldn't
decide which secondary to remove.
This is generally okay if a node is offline, but if a node is in Pause
mode for a long period of time, it's a problem.
Closes: https://github.com/neondatabase/neon/issues/10646
## Summary of changes
- Instead of insisting on finding a node in the wrong AZ to remove, find
an available node in the _right_ AZ, and remove all the others. This
ensures that if there is one live suitable node, then other
offline/paused nodes cannot hold things up.
Given a remote extensions manifest of the following:
```json
{
"public_extensions": [],
"custom_extensions": null,
"library_index": {
"pg_search": "pg_search"
},
"extension_data": {
"pg_search": {
"control_data": {
"pg_search.control": "comment = 'pg_search: Full text search for PostgreSQL using BM25'\ndefault_version = '0.14.1'\nmodule_pathname = '$libdir/pg_search'\nrelocatable = false\nsuperuser = true\nschema = paradedb\ntrusted = true\n"
},
"archive_path": "13117844657/v14/extensions/pg_search.tar.zst"
}
}
}
```
We were allowing a compute to install a remote extension that wasn't
listed in either public_extensions or custom_extensions.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We don't have visibility into the ratio of image vs. delta pages
ingested in Pageservers. This might be useful to determine whether we
should compress WAL records before storing them, which in turn might
make compaction more efficient.
## Summary of changes
Add `pageserver_wal_ingest_values_committed` metric with dimensions
`class=metadata|data` and `kind=image|delta`.
## Problem
We get WARN log noise on tenant creations. Cplane creates tenants via
/location_config. That returns the attached locations in the response
and spawns a reconciliation which will also attempt to notify cplane. If
the notification is attempted before cplane persists the shards to its
database, storcon gets back a 404. The situation is harmless, but
annoying.
## Summary of Changes
* Add a tenant creation hint to the reconciler config
* If the hint is true and we get back a 404 on the notification from
cplane, ignore the error, but still queue the reconcile up for a retry.
Closes https://github.com/neondatabase/cloud/issues/20732
## Problem
Ref: https://github.com/neondatabase/cloud/issues/23461
and follow-up after: https://github.com/neondatabase/neon/pull/10553
we used `echo` to set-up `.wgetrc` and `.curlrc`, and there we used `\n`
to make these multiline configs with one echo command.
The problem is that Debian `/bin/sh`'s built-in echo command behaves
differently from the `/bin/echo` executable and from the `echo` built-in
in `bash`. Namely, it does not support the`-e` option, and while it does
treat `\n` as a newline, passing `-e` here will add that `-e` to the
output.
At the same time, when we use different base images, for example
`alpine/curl`, their `/bin/sh` supports and requires `-e` for treating
escape sequences like `\n`.
But having different `echo` and remembering difference in their
behaviour isn't best experience for the developer and makes bad
experience maintaining Dockerfiles.
Work-arounds:
- Explicitly use `/bin/bash` (like in this PR)
- Use `/bin/echo` instead of the shell's built-in echo function
- Use printf "foo\n" instead of echo -e "foo\n"
## Summary of changes
1. To fix that, we process with the option setting `/bin/bash` as a
SHELL for the debian-baysed layers
2. With no changes for `alpine/curl` based layers.
3. And one more change here: in `extensions` layer split to the 2 steps:
installing dependencies from `CPAN` and installing `lcov` from github,
so upgrading `lcov` could reuse previous layer with installed cpan
modules.
## Problem
We wish to make heatmap generation additive in
https://github.com/neondatabase/neon/pull/10597.
However, if the pageserver restarts and has a heatmap on disk from when
it was a secondary long ago,
we can end up keeping extra layers on the secondary's disk.
## Summary of changes
Persist the heatmap after a successful upload.
## Problem
Hopefully this can resolve
https://github.com/neondatabase/neon/issues/10517. The reason why the
test is flaky is that after restart the compute node might write some
data so that the pageserver flush some layers, and in the end, causing
L0 compaction to run, and we cannot get the test scenario as we want.
## Summary of changes
Ensure all L0 layers are compacted before starting the test.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
In #9895, we fixed some issues where `ClearVmBits` were broadcast to all
shards, even those not owning the VM relation. As part of that, we found
some ancient code from #1417, which discarded spurious incorrect
`ClearVmBits` records for pages outside of the VM relation. We added
observability in #9911 to see how often this actually happens in the
wild.
After two months, we have not seen this happen once in production or
staging. However, out of caution, we don't want a hard error and break
WAL ingestion.
Resolves#10067.
## Summary of changes
Log a critical error when ingesting `ClearVmBits` for unknown VM
relations or pages.
## Problem
We want to switch proxy and ideally all Rust services to structured JSON
logging to support better filtering and cross-referencing with tracing.
## Summary of changes
* Introduce a custom tracing-subscriber to write the JSON. In a first
attempt a customized tracing::fmt::FmtSubscriber was used, but it's very
inefficient and can still generate invalid JSON. It's also doesn't allow
us to add important fields to the root object.
* Make this opt in: the `LOGFMT` env var can be set to `"json"` to
enable to new logger at startup.
This PR does a bunch of things:
* only allow errors of the server cert verification, not of the TLS
handshake. The TLS handshake doesn't cause any errors for us so we can
just always require it to be valid. This simplifies the code a little.
* As the solution is more permanent than originally anticipated, I think
it makes sense to move the `AcceptAll` verifier outside.
* log the connstr information. this helps with figuring out which domain
names are configured in the connstr, etc. I think it is generally useful
to print it. make extra sure that the password is not leaked.
Follow-up of #10640
Refactor how extensions are built in compute Dockerfile
1. Rename some of the extension layers, so that names correspond more
precisely to the upstream repository name and the source directory
name. For example, instead of "pg-jsonschema-pg-build", spell it
"pg_jsonschema-build". Some of the layer names had the extra "pg-"
part, and some didn't; harmonize on not having it. And use an
underscore if the upstream project name uses an underscore.
2. Each extension now consists of two dockerfile targets:
[extension]-src and [extension]-build. By convention, the -src
target downloads the sources and applies any neon-specific patches
if necessary. The source tarball is downloaded and extracted under
/ext-src. For example, the 'pgvector' extension creates the
following files and directory:
/ext-src/pgvector.tar.gz # original tarball
/ext-src/pgvector.patch # neon-specific patch, copied from patches/ dir
/ext-src/pgvector-src/ # extracted tarball, with patch applied
This separation avoids re-downloading the sources every time the
extension is recompiled. The 'extension-tests' target also uses the
[extension]-src layers, by copying the /ext-src/ dirs from all
the extensions together into one image.
This refactoring came about when I was experimenting with different
ways of splitting up the Dockerfile so that each extension would be in
a separate file. That's not part of this PR yet, but this is a good
step in modularizing the extensions.
## Problem
Image layer generation could block L0 compactions for a long time.
## Summary of changes
* Refactored the return value of `create_image_layers_for_*` functions
to make it self-explainable.
* Preempt image layer generation in `Try` mode if L0 piles up.
Note that we might potentially run into a state that only the beginning
part of the keyspace gets image coverage. In that case, we either need
to implement something to prioritize some keyspaces with image coverage,
or tune the image_creation_threshold to ensure that the frequency of
image creation could keep up with L0 compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
## Problem
We don't currently have good alerts for critical errors, e.g. data
loss/corruption.
Touches #10094.
## Summary of changes
Add a `critical!` macro and corresponding
`libmetrics_tracing_event_count{level="critical"}` metric. This will:
* Emit an `ERROR` log message with prefix `"CRITICAL:"` and a backtrace.
* Increment `libmetrics_tracing_event_count{level="critical"}`, and
indirectly `level="error"`.
* Trigger a pageable alert (via the metric above).
* In debug builds, panic the process.
I'll add uses of the macro separately.
## Problem
I noticed when onboarding lots of tenants that the AZ scheduling
violation stat was climbing, before falling later as optimisations
happened. This was happening because we first add the tenant with
PlacementPolicy::Secondary, and then later go to
PlacementPolicy::Attached, and the scheduler's behavior led to a bad AZ
choice:
1. Create a secondary location in the non-preferred AZ
2. Upgrade to Attached where we promote that non-preferred-AZ location
to attached and then create another secondary
3. Optimiser later realises we're in the wrong AZ and moves us
## Summary of changes
- Extend some logging to give more information about AZs
- When scheduling secondary location in PlacementPolicy::Secondary,
select it as if we were attached: in this mode, our business goal is to
have a warm pageserver location that we can make available as attached
quickly if needed, therefore we want it to be in the preferred AZ.
- Make optimize_secondary logic the same, so that it will consider a
secondary location in the preferred AZ to be optimal when in
PlacementPolicy::Secondary
- When transitioning to from PlacementPolicy::Attached(N) to
PlacementPolicy::Secondary, instead of arbitrarily picking a location to
keep, prefer to keep the location in the preferred AZ
We encountered some TLS validation errors for the storcon since applying
#10614. Add an option to downgrade them to logged errors instead to
allow us to debug with more peace.
cc issue https://github.com/neondatabase/cloud/issues/23583
## Problem
* The behavior of this flag changed. Plus, it's not necessary to disable
the IP check as long as there are no IPs listed in the local postgres.
## Summary of changes
* Drop the flag from the command in the README.md section.
* Change the postgres URL passed to proxy to not use the endpoint
hostname.
* Also swap postgres creation and proxy startup, so the DB is running
when proxy comes up.
## Problem
Tests for logical replication (on Staging) have been failing for some
time because logical replication is not enabled for them. This issue
occurred after switching to an org API key with a different default
setting, where logical replication was not enabled by default.
## Summary of changes
- Add `enable_logical_replication` input to
`actions/neon-project-create`
- Enable logical replication in `test-logical-replication` job
Logging errors with the debug format specifier causes multi-line errors,
which are sometimes a pain to deal with. Instead, we should use anyhow's
alternate display format, which shows the same information on a single
line.
Also adjusted a couple of error messages that were stale.
Fixesneondatabase/cloud#14710.
## Problem
1. First of all it's more correct
2. Current usage allows ` Time-of-Check-Time-of-Use (TOCTOU) 'Pwn
Request' vulnerabilities`. Please check security slack channel or reach
me for more details. I will update PR description after merge.
## Summary of changes
1. Use `actions/checkout` with `ref: ${{
github.event.pull_request.head.sha }}`
Discovered by and Co-author: @varunsh-coder
Fixes flaky test_lr_with_slow_safekeeper test #10242
Fix query to `pg_catalog.pg_stat_subscription` catalog to handle table
synchronization and parallel LR correctly.
Successor of #10280 after it was reverted in #10592.
Re-introduce the usage of diesel-async again, but now also add TLS
support so that we connect to the storcon database using TLS. By
default, diesel-async doesn't support TLS, so add some code to make us
explicitly request TLS.
cc https://github.com/neondatabase/cloud/issues/23583
## Problem
While working on https://github.com/neondatabase/neon/pull/10617 I
(unintentionally) merged the PR before the main CI pipeline has
finished.
I suspect this happens because we have received all the required job
results from the pre-merge-checks workflow, which runs on PRs that
include changes to relevant files.
## Summary of changes
- Skip the `conclusion` job in `pre-merge-checks` workflows for PRs
## Problem
This test would sometimes emit unexpected logs from the storage
controller's requests to do migrations, which overlap with the test's
restarts of pageservers, where those migrations are happening some time
after a shard split as the controller moves load around.
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10602/13067323736/index.html#testresult/f66f1329557a1fc5/retries
## Summary of changes
- Do a reconcile_until_idle after shard split, so that the rest of the
test doesn't run concurrently with migrations
In the safekeeper, we block deletions on the timeline's gate closing,
and any `WalResidentTimeline` keeps the gate open (because it owns a
gate lock object). Thus, unless the `main_task` function of a partial
backup doesn't return, we can't delete the associated timeline.
In order to make these tasks exit early, we call the cancellation token
of the timeline upon its shutdown. However, the partial backup task
wasn't looking for the cancellation while waiting to acquire a partial
backup permit.
On a staging safekeeper we have been in a situation in the past where
the semaphore was already empty for a duration of many hours, rendering
all attempted deletions unable to proceed until a restart where the
semaphore was reset:
https://neondb.slack.com/archives/C03H1K0PGKH/p1738416586442029
## Problem
when introducing pg17 for job step `Generate matrix for OLAP benchmarks`
I introduced a syntax error that only hits on Saturdays.
## Summary of changes
Remove trailing comma
## successful test run
https://github.com/neondatabase/neon/actions/runs/13086363907
- Wired up filtering on VPC endpoints
- Wired up block access from public internet / VPC depending on per
project flag
- Added cache invalidation for VPC endpoints (partially based on PR from
Raphael)
- Removed BackendIpAllowlist trait
---------
Co-authored-by: Ivan Efremov <ivan@neon.tech>
We forked copy_bidirectional to solve some issues like fast-shutdown
(disallowing half-open connections) and to introduce better error
tracking (which side of the conn closed down).
A change recently made its way upstream offering performance
improvements: https://github.com/tokio-rs/tokio/pull/6532. These seem
applicable to our fork, thus it makes sense to apply them here as well.
The primary benefit is that all the ad hoc get_matches() calls are no
longer necessary. Now all it takes to get at the CLI arguments is
referencing a struct member. It's also great the we can replace the ad
hoc CLI struct we had with this more formal solution.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Merge Queue fails if changes include Rust code.
## Summary of changes
- Fix condition for `build-build-tools-image`
- Add a couple of no-op `false ||` to make predicates look
symmetric
## Problem
When a client dropped before a request completed, and a handler returned
an ApiError, we would log that at error severity. That was excessive in
the case of a request erroring on a shutdown, and could cause test
flakes.
example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/13067651123/index.html#suites/ad9c266207b45eafe19909d1020dd987/6021ce86a0d72ae7/
```
Cancelled request finished with an error: ShuttingDown
```
## Summary of changes
- Log a different info-level on ShuttingDown and ResourceUnavailable API
errors from cancelled requests
## Problem
This assertion is incorrect: it is legal to see another shard's data at
this point, after a shard split.
Closes: https://github.com/neondatabase/neon/issues/10609
## Summary of changes
- Remove faulty assertion
## Problem
There are two (related) problems with the previous handling of
`cargo-deny`:
- When a new advisory is added to rustsec that affects a dependency,
unrelated pull requests will fail.
- New advisories rely on pushes or PRs to be surfaced. Problems that
already exist on main will only be found if we try to merge new things
into main.
## Summary of changes
We split out `cargo-deny` into a separate workflow that runs on all PRs
that touch `Cargo.lock`, and on a schedule on `main`, `release`,
`release-compute` and `release-proxy` to find new advisories.
Update to a rebased version of our rust-postgres patches, rebased on
[this](98f5a11bc0)
commit this time.
With #10280 reapplied, this means that the rust-postgres crates will be
deduplicated, as the new crate versions are finally compatible with the
requirements of diesel-async.
Earlier update: #10561
rust-postgres PR: https://github.com/neondatabase/rust-postgres/pull/39
## Problem
We want to check that `diesel print-schema` doesn't generate any changes
(`storage_controller/src/schema.rs`) in comparison with the list of
migration.
## Summary of changes
- Add `diesel_cli` to `build-tools` image
- Add `Check diesel schema` step to `build-neon` job, at this stage we
have all required binaries, so don't need to compile anything
additionally
- Check runs only on x86 release builds to be sure we do it at least
once per CI run.
Ref: https://github.com/neondatabase/cloud/issues/23461
## Problem
Just made changes around and see these 2 base layers could be optimised.
and after review comment from @myrrc setting up timeouts and retries in
`alpine/curl` image
## Summary of changes
## Problem
This test may not fully detect data corruption during splits, since we
don't force-compact the entire keyspace.
## Summary of changes
Force-compact all data in `test_sharding_autosplit`.
## Problem
If offloading races with normal shutdown, we get a "failed to freeze and
flush: cannot flush frozen layers when flush_loop is not running, state
is Exited". This is harmless but points to it being quite strange to try
and freeze and flush such a timeline. flushing on shutdown for an
archived timeline isn't useful.
Related: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the
timeline is archived
## Problem
test_scrubber_tenant_snapshot restarts pageservers, but log validation
fails tests on any non white listed storcon warnings, making the test
flaky.
## Summary of changes
Allow warns like
2025-01-29T12:37:42.622179Z WARN reconciler{seq=1
tenant_id=2011077aea9b4e8a60e8e8a19407634c shard_id=0004}: Call to node
2 (localhost:15352) management API failed, will retry (attempt 1):
receive body: error sending request for url
(http://localhost:15352/v1/tenant/2011077aea9b4e8a60e8e8a19407634c-0004/location_config):
client error (Connect)
ref https://github.com/neondatabase/neon/issues/10462
Update `tokio` base crates and their deps. Pin `tokio` to at least 1.41
which stabilized task ID APIs.
To dedup `mio` dep the `notify` crate is updated. It's used in
`compute_tools`.
9f81828429/compute_tools/src/pg_helpers.rs (L258-L367)
## Problem
Because dashmap 6 switched to hashbrown RawTable API, it required us to
use unsafe code in the upgrade:
https://github.com/neondatabase/neon/pull/8107
## Summary of changes
Switch to clashmap, a fork maintained by me which removes much of the
unsafe and ultimately switches to HashTable instead of RawTable to
remove much of the unsafe requirement on us.
The test runs this query:
select count(*), sum(data::bigint)::bigint from t
to validate the test results between each part of the test. It performs
a simple sequential scan and aggregation, but was taking an order of
magnitude longer on v17 than on previous Postgres versions, which
sometimes caused the test to time out. There were two reasons for that:
1. On v17, the planner estimates the table to have only only one row. In
reality it has 305790 rows, and older versions estimated it at 611580,
which is not too bad given that the table has not been analyzed so the
planner bases that estimate just on the number of pages and the widths
of the datatypes. The new estimate of 1 row is much worse, and it leads
the planner to disregard parallel plans, whereas on older versions you
got a Parallel Seq Scan.
I tracked this down to upstream commit 29cf61ade3, "Consider fillfactor
when estimating relation size". With that commit,
table_block_relation_estimate_size() function calculates that each page
accommodates less than 1 row when the fillfactor is taken into account,
which rounds down to 0. In reality, the executor will always place at
least one row on a page regardless of fillfactor, but the new estimation
formula doesn't take that into account.
I reported this to pgsql-hackers
(https://www.postgresql.org/message-id/2bf9d973-7789-4937-a7ca-0af9fb49c71e%40iki.fi),
we don't need to do anything more about it in neon. It's OK to not use
parallel scans here; once issue 2. below is addressed, the queries are
fast enough without parallelism..
2. On v17, prefetching was not happening for the sequential scan. That's
because starting with v17, buffers are reserved in the shared buffer
cache before prefetching is initiated, and we use a tiny
shared_buffers=1MB setting in the tests. The prefetching is effectively
disabled with such a small shared_buffers setting, to protect the system
from completely starving out of buffers.
To address that, simply bump up shared_buffers in the test.
This patch addresses the second issue, which is enough to fix the
problem.
## Problem
In https://github.com/neondatabase/neon/pull/10438 it was pointed out
that it would be good to avoid picking tenants in ID order, and also to
avoid situations where we might double-select the same tenant.
There was an initial swing at this in
https://github.com/neondatabase/neon/pull/10443, where Chi suggested a
simpler approach which is done in this PR
## Summary of changes
- Split total set of tenants into in and out of home AZ
- Consume out of home AZ first, and if necessary shuffle + consume from
out of home AZ
## Problem
In https://github.com/neondatabase/neon/pull/10438 I had got the
function for picking tenants backwards, and it was preferring to move
things _away_ from their preferred AZ.
## Summary of changes
- Fix condition in `is_attached_outside_preferred_az`
## Problem
Timeline bootstrap starts a flush loop, but doesn't reliably shut down
the timeline (incl. waiting for flush loop to exit) before destroying
UninitializedTimeline, and that destructor tries to clean up local
storage. If local storage is still being written to, then this is
unsound.
Currently the symptom is that we see a "Directory not empty" error log,
e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries
## Summary of changes
- Move fallible IO part of bootstrap into a function (notably, this is
fallible in the case of the tenant being shut down while creation is
happening)
- When that function returns an error, call shutdown() on the timeline
## Problem
The test asserts that it completes at least 10 full timeline lifecycles,
but the noisy CI environment sometimes doesn't meet that goal.
Related: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- Sleep for longer between pageserver restarts, so that the timeline
workers have more chance to make progress
- Sleep for shorter between retries from timeline worker, so that they
have better chance to get in while a pageserver is up between restarts
- Relax the success condition to complete at least 5 iterations instead
of 10
## Problem
for OLAP benchmarks we need specific connstr secrets with different
database names for each job step
This is a follow-up for https://github.com/neondatabase/neon/pull/10536
In previous PR we used a common GitHub secret for a shared re-use
project that has 4 databases: neondb, tpch, clickbench and userexamples.
[Failure
example](https://neon-github-public-dev.s3.amazonaws.com/reports/main/13044872855/index.html#suites/54d0af6f403f1d8611e8894c2e07d023/fc029330265e9f6e/):
```log
# /tmp/neon/pg_install/v17/bin/psql user=neondb_owner dbname=neondb host=ep-broad-brook-w2luwzzv.us-east-2.aws.neon.build sslmode=require options='-cstatement_timeout=0 ' -c -- $ID$
-- TPC-H/TPC-R Pricing Summary Report Query (Q1)
-- Functional Query Definition
-- Approved February 1998
...
ERROR: relation "lineitem" does not exist
```
## Summary of changes
We need dedicated GitHub secrets and dedicated connection strings for
each of the use cases.
## Test run
https://github.com/neondatabase/neon/actions/runs/13053968231
It took me ages to figure out why it was failing on my laptop. What I
saw was that when the test makes the 'import_pgdata' in the pageserver,
the pageserver actually performs a regular 'bootstrap' timeline creation
by running initdb, with no importing. It boiled down to the json request
that the test uses:
```
{
"new_timeline_id": str(timeline_id),
"import_pgdata": {
"idempotency_key": str(idempotency),
"location": {"LocalFs": {"path": str(importbucket.absolute())}},
},
},
```
and how serde deserializes into rust structs. The 'LocalFs' enum variant
in `models.rs` is gated on the 'testing' cargo feature. On a non-testing
build, that got deserialized into the default Bootstrap enum variant, as
a valid TimelineCreateRequestModeImportPgdata variant could not be
formed.
PS. IMHO we should get rid of the testing feature, compile in all the
functionality, and have a runtime flag to disable anything dangeorous.
With that, you would've gotten a nice "feature only enabled in testing
mode" error in this case, or the test would've simply worked. But that's
another story.
## Problem
close https://github.com/neondatabase/neon/issues/10482
## Summary of changes
Add an extra lock on the read path to protect against races. The read
path has an implication that only certain kind of compactions can be
performed. Garbage keys must first have an image layer covering the
range, and then being gc-ed -- they cannot be done in one operation. An
alternative to fix this is to move the layers read guard to be acquired
at the beginning of `get_vectored_reconstruct_data_timeline`, but that
was intentionally optimized out and I don't want to regress.
The race is not limited to image layers. Gc-compaction will consolidate
deltas automatically and produce a flat delta layer (i.e., when we have
retain_lsns below the gc-horizon). The same race would also cause
behaviors like getting an un-replayable key history as in
https://github.com/neondatabase/neon/issues/10049.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
In some cases, we were returning a very shallow error like `error
sending request for url (XXX)`, which made it very hard to figure out
the actual error.
## Summary of changes
Use `{:?}` in a few places, and remove it from places where we were
printing a string anyway.
Related to #10308, we might have legitimate changes in file size or
generation. Those changes should not cause warn log lines.
In order to detect changes of the generation number while the file size
stayed the same, load the metadata that we store on disk on loading of
the timeline.
Still do a comparison with the on-disk layer sizes to find any
discrepancies that might occur due to race conditions (new metadata file
gets written but layer file has not been updated yet, and PS shuts
down). However, as it's possible to hit it in a race conditon, downgrade
it to a warning.
Also fix a mistake in #10529: we want to compare the old with the new
metadata, not the old metadata with itself.
## Problem
The client code for `tenant-set-preferred-az` declared response type
`()`, so printed a spurious error on each use:
```
Error: receive body: error decoding response body: invalid type: map, expected unit at line 1 column 0
```
The requests were successful anyway.
## Summary of changes
- Declare the proper return type, so that the command succeeds quietly.
## Problem
We don't have per-timeline observability for read amplification.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
Add a per-timeline `pageserver_layers_per_read` histogram.
NB: per-timeline histograms are expensive, but probably worth it in this
case.
## Problem
We suspect that Postgres checkpoints will limit the number of page
deltas necessary to reconstruct a page, but don't know for certain.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
Add `pageserver_deltas_per_read_global` metric.
This pairs with `pageserver_layers_per_read_global` from #10573.
## Problem
The current global `pageserver_layers_visited_per_vectored_read_global`
metric does not appear to accurately measure read amplification. It
divides the layer count by the number of reads in a batch, but this
means that e.g. 10 reads with 100 L0 layers will only measure a read amp
of 10 per read, while the actual read amp was 100.
While the cost of layer visits are amortized across the batch, and some
layers may not intersect with a given key, each visited layer
contributes directly to the observed latency for every read in the
batch, which is what we care about.
Touches https://github.com/neondatabase/cloud/issues/23283.
Extracted from #10566.
## Summary of changes
* Count the number of layers visited towards each read in the batch,
instead of the average across the batch.
* Rename `pageserver_layers_visited_per_vectored_read_global` to
`pageserver_layers_per_read_global`.
* Reduce the read amp log warning threshold down from 512 to 100.
## Problem
Follow up of https://github.com/neondatabase/neon/pull/10550 in case the
upper limit is set larger than threshold. It does not make sense for
someone to enforce the behavior like "if there are >= 50 L0s, only
compact 10 of them".
## Summary of changes
Use the maximum of compaction threshold and upper limit when selecting
L0 files to compact.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Repartition is slow, but it's only used in image layer creation. We can
skip it if we have a lot of L0 layers to ingest.
## Summary of changes
If L0 compaction is not complete, do not repartition and do not create
image layers.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have good observability for per-timeline compaction debt,
specifically the number of delta layers in the frozen, L0, and L1
levels.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
* Add a `level` label for `pageserver_layer_{count,size}` with values
`l0`, `l1`, and `frozen`.
* Track metrics for frozen layers.
There is already a `kind={delta,image}` label. `kind=image` is only
possible for `level=l1`.
We don't include the currently open ephemeral layer, only frozen layers.
There is always exactly 1 ephemeral layer, with a dynamic size which is
already tracked in `pageserver_timeline_ephemeral_bytes`.
## Problem
benchmarking.yml so far is only running benchmarks with PostgreSQL
version 16.
However neon recently changed the default for new customers to
PostgreSQL version 17.
See related [epic](https://github.com/neondatabase/cloud/issues/23295)
## Summary of changes
We do not want to run every job step with both pg 16 and 17 because this
would need excessive resources (runners, computes) and extend the
benchmarking run wall clock time too much.
So we select an opinionated subset of testcases that we also report in
weekly reporting and add a postgres v17 job step.
For re-use projects associated Neon projects have been created and
connection strings have been added to neon database organization
secrets.
A follow up is to add the reporting for these new runs to some grafana
dashboards.
## Problem
1. d04d924 added separate metrics for total requests and failures
separately, but it doesn't make much sense. We could just have a unified
counter with `http_status`.
2. `test_compute_migrations_retry` had a race, i.e., it was waiting for
the last successful migration, not an actual failure. This was revealed
after adding an assert on failure metric in d04d924.
## Summary of changes
1. Switch to unified counters for `compute_ctl` requests.
2. Add a waiting loop into `test_compute_migrations_retry` to eliminate
the race.
Part of neondatabase/cloud#17590
## Problem
If we are GC-ing because a new image layer was added while traversing
the timeline, then it will remove layers that are required for
fulfilling the current get request (read-path cannot "look back" and
notice the new image layer).
## Summary of Changes
Prevent GC from progressing on the current timeline while it is being
visited for a read.
Epic: https://github.com/neondatabase/neon/issues/9376
Luckily they were the same version, so we didn't spend time compiling
two versions, which could have been the case in the future.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The approach of having CancelMap as an in-memory structure increases
code complexity,
as well as putting additional load for Redis streams.
## Summary of changes
- Implement a set of KV ops for Redis client;
- Remove cancel notifications code;
- Send KV ops over the bounded channel to the handling background task
for removing and adding the cancel keys.
Closes#9660
## Problem
We have to test the extensions, shipped with Neon for compatibility
before the upgrade.
## Summary of changes
Added the test for compatibility with the upgraded extensions.
## Problem
Follow-up of the incident, we should not use the same bound on
lower/upper limit of compaction files. This patch adds an upper bound
limit, which is set to 50 for now.
## Summary of changes
Add `compaction_upper_limit`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
Ref: https://github.com/neondatabase/cloud/issues/23314
We suspect some inconsistency in Benchmark tests runs could be due to
different type of runners they are landed in.
To have that aligned in both terms: failure rates and benchmark results,
lets run them for now on `small-metal` servers and see the progress for
the tests stability.
## Summary of changes
## Problem
There are several parts of `compute_ctl` with a very low visibility of
errors:
1. DB migrations that run async in the background after compute start.
2. Requests made to control plane (currently only `GetSpec`).
3. Requests made to the remote extensions server.
## Summary of changes
Add new counters to quickly evaluate the amount of errors among the
fleet.
Part of neondatabase/cloud#17590
## Problem
https://github.com/neondatabase/neon/pull/10448 removed release notes,
because if their generation failed, the whole release was failing.
People liked them though, and wanted some basic release notes as a
fall-back instead of completely removing them.
## Summary of changes
Include basic release notes that link to the release PR and to a diff to
the previous release.
## Problem
We've seen the ingest connection manager get stuck shortly after a
migration.
## Summary of changes
A speculative mitigation is to use the same mechanism as get page
requests for kicking LSN ingest. The connection manager monitors
LSN waits and queries the broker if no updates are received for the
timeline.
Closes https://github.com/neondatabase/neon/issues/10351
## Problem
We need a setting to disable the flush upload wait, to test L0 flush
backpressure in staging.
## Summary of changes
Add `l0_flush_wait_upload` setting.
## Problem
The request data and usage metrics S3 requests use the same identifier
shown in logs, causing confusion about what type of upload failed.
## Summary of changes
Use the correct identifier for usage metrics uploads.
neondatabase/cloud#23084
Only a few things that needed updating:
- async_trait was removed
- Message::Text takes a Utf8Bytes object instead of a String
Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Conrad Ludgate <connor@neon.tech>
In #10308, we noticed many warnings about the local layer having
different sizes on-disk compared to the metadata.
However, the layer downloader would never redownload layer files if the
sizes or generation numbers change. This is obviously a bug, which we
aim to fix with this PR.
This change also moves the code deciding what to do about a layer to a
dedicated function: before we handled the "routing" via control flow,
but now it's become too complicated and it is nicer to have the
different verdicts for a layer spelled out in a list/match.
This reverts commit 9e55d79803.
We'll still need this until we can tune L0 flush backpressure and
compaction. I'll add a setting to disable this separately.
## Problem
This one is fairly embarrassing. Safekeeper node id was used in the
pageserver application name
when connecting to safekeepers.
## Summary of changes
Use the right node id.
Closes https://github.com/neondatabase/neon/issues/10461
We want to verify if pageserver stripe size has an impact on ingest
performance.
We want to verify if ingest performance has improved or regressed with
postgres version 17.
## Summary of changes
- Allow to create new project with different postgres versions
- allow to pre-shard new project with different stripe sizes instead of
relying on storage manager to shard_split the project once a threshold
is exceeded
Replaces https://github.com/neondatabase/neon/pull/10509
Test run https://github.com/neondatabase/neon/actions/runs/12986410381
We now don't need libpq any more for the build of the storage
controller, as we use `diesel-async` since #10280. Therefore, we remove
the env var that gave cargo/rustc the location for libpq.
Follow-up of #10280
During broker deploys, pageservers log this noisy WARN en masse.
I can trivially reproduce the WARN message in neon_local by SIGKILLing
broker during e.g. `pgbench -i`.
I don't understand why tonic is not detecting the error as
`Code::Unavailable`.
Until we find time to understand that / fix upstream, this PR adds the
error message to the existing list of known error messages that get
demoted to INFO level.
Refs:
- refs https://github.com/neondatabase/neon/issues/9562
## Problem
We were logging a warning after a single request timeout, while listing
objects.
Closes: https://github.com/neondatabase/neon/issues/10166
## Summary of changes
- These timeouts are a pretty normal part of life, so back it off to
only log a warning after two in a row.
## Problem
From time to time, folks discover our `control_plane/` folder and make
the (reasonable) mistake of thinking it's a tool for running full-sized
Neon systems, whereas in reality it is a tool for dev/test.
## Summary of changes
- Change control_plane's readme title to "Local Development Control
Plane (`neon_local`)`
- Change "Running local installation" to "Running a local development
environment" in the main readme
## Problem
The intent of this parameter is to have pageservers consider themselves
"full" if they've got lots of shards, even if they have plenty of
capacity. It works, but because we typically successfully oversubscribe
capacity up to 200%, the MAX_SHARDS limit is effectively doubled, so
this 20,000 value ends up meaning 40,000, whereas the original intent
was to limit nodes to ~10000 shards.
## Summary of changes
- Change MAX_SHARDS to 5000, so that a node with 5000 will get a 100%
utilization, which is equivalent in practice to being considered "half
full" by the storage controller in capacity terms.
This is all a bit subtle and indiret. Originally the limit was baked
into the pageserver with the idea that the pageserver knows better what
its own resources tolerate than the storage controller does, but in
practice it would be probably be easier to understand all this if we
just did it controller-side. So there's scope to refactor here in
future.
Switches the storcon away from using diesel's synchronous APIs in favour
of `diesel-async`.
Advantages:
* less C dependencies, especially no openssl, which might be behind the
bug: https://github.com/neondatabase/cloud/issues/21010
* Better to only have async than mix of async plus `spawn_blocking`
We had to turn off usage of the connection pool for migrations, as
diesel migrations don't support async APIs. Thus we still use
`spawn_blocking` in that one place. But this is explicitly done in one
of the `diesel-async` examples.
## Problem
In ingest benchmarks, we see L0 compaction delays of over 10 minutes due
to image compaction. We can't stall L0 flushes for that long.
## Summary of changes
Disable L0 flush stalls, and bump the default L0 flush delay threshold
from 20 to 30 L0 layers.
## Problem
If compaction fails, we disable L0 flush stalls to avoid persistent
stalls. However, the logic would unset the failure marker on offload
failures or shutdown. This can lead to sudden L0 flush stalls if we try
and fail to offload a timeline with compaction failures, or if there is
some kind of shutdown race.
Touches #10405.
## Summary of changes
Don't touch the compaction failure marker on offload failures or
shutdown.
## Problem
After talking about it again with @bayandin again this should replace
the changes from https://github.com/neondatabase/neon/pull/10475. While
the previous changes worked, they are less visually clear in what
happens, and we might end up in a situation where we update `latest`,
but don't actually have the tagged image pushed that contains the same
changes. The latter would result in potentially hard to debug
situations.
## Summary of changes
Revert c283aaaf8d and make
promote-images-prod depend on promote-images-dev instead.
## Problem
The containers' log output is mixed with the tests' output, so you must
scroll up to find the error.
## Summary of changes
Printing of containers' logs moved to a separate step.
Note: this has to merge after the release is cut on `2025-01-17` for
compat tests to start passing.
## Problem
SK wal reader fan-out is not enabled in tests by default.
## Summary of changes
Enable it.
## Problem
There is no direct backpressure for compaction and L0 read
amplification. This allows a large buildup of compaction debt and read
amplification.
Resolves#5415.
Requires #10402.
## Summary of changes
Delay layer flushes based on the number of level 0 delta layers:
* `l0_flush_delay_threshold`: delay flushes such that they take 2x as
long (default `2 * compaction_threshold`).
* `l0_flush_stall_threshold`: stall flushes until level 0 delta layers
drop below threshold (default `4 * compaction_threshold`).
If either threshold is reached, ephemeral layer rolls also synchronously
wait for layer flushes to propagate this backpressure up into WAL
ingestion. This will bound the number of frozen layers to 1 once
backpressure kicks in, since all other frozen layers must flush before
the rolled layer.
## Analysis
This will significantly change the compute backpressure characteristics.
Recall the three compute backpressure knobs:
* `max_replication_write_lag`: 500 MB (based on Pageserver
`last_received_lsn`).
* `max_replication_flush_lag`: 10 GB (based on Pageserver
`disk_consistent_lsn`).
* `max_replication_apply_lag`: disabled (based on Pageserver
`remote_consistent_lsn`).
Previously, the Pageserver would keep ingesting WAL and build up
ephemeral layers and L0 layers until the compute hit
`max_replication_flush_lag` at 10 GB and began backpressuring. Now, once
we delay/stall WAL ingestion, the compute will begin backpressuring
after `max_replication_write_lag`, i.e. 500 MB. This is probably a good
thing (we're not building up a ton of compaction debt), but we should
consider tuning these settings.
`max_replication_flush_lag` probably doesn't serve a purpose anymore,
and we should consider removing it.
Furthermore, the removal of the upload barrier in #10402 will mean that
we no longer backpressure flushes based on S3 uploads, since
`max_replication_apply_lag` is disabled. We should consider enabling
this as well.
### When and what do we compact?
Default compaction settings:
* `compaction_threshold`: 10 L0 delta layers.
* `compaction_period`: 20 seconds (between each compaction loop check).
* `checkpoint_distance`: 256 MB (size of L0 delta layers).
* `l0_flush_delay_threshold`: 20 L0 delta layers.
* `l0_flush_stall_threshold`: 40 L0 delta layers.
Compaction characteristics:
* Minimum compaction volume: 10 layers * 256 MB = 2.5 GB.
* Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20
seconds = 2.5 GB (10 L0 layers).
* Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s.
### When do we hit `max_replication_write_lag`?
Depending on how fast compaction and flushes happens, the compute will
backpressure somewhere between `l0_flush_delay_threshold` or
`l0_flush_stall_threshold` + `max_replication_write_lag`.
* Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB
* Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0
GB
This seems like a reasonable range to me.
This reapplies #10135. Just removing this flush backpressure without
further mitigations caused read amp increases during bulk ingestion
(predictably), so it was reverted. We will replace it by
compaction-based backpressure.
## Problem
In #8550, we made the flush loop wait for uploads after every layer.
This was to avoid unbounded buildup of uploads, and to reduce compaction
debt. However, the approach has several problems:
* It prevents upload parallelism.
* It prevents flush and upload pipelining.
* It slows down ingestion even when there is no need to backpressure.
* It does not directly backpressure based on compaction debt and read
amplification.
We will instead implement compaction-based backpressure in a PR
immediately following this removal (#5415).
Touches #5415.
Touches #10095.
## Summary of changes
Remove waiting on the upload queue in the flush loop.
## Problem
If gc-compaction decides to rewrite an image layer, it will now cause
index_part to lose reference to that layer. In details,
* Assume there's only one image layer of key 0000...AAAA at LSN 0x100
and generation 0xA in the system.
* gc-compaction kicks in at gc-horizon 0x100, and then produce
0000...AAAA at LSN 0x100 and generation 0xB.
* It submits a compaction result update into the index part that unlinks
0000-AAAA-100-A and adds 0000-AAAA-100-B
On the remote storage / local disk side, this is fine -- it unlinks
things correctly and uploads the new file. However, the
`index_part.json` itself doesn't record generations. The buggy procedure
is as follows:
1. upload the new file
2. update the index part to remove the old file and add the new file
3. remove the new file
Therefore, the correct update result process for gc-compaction should be
as follows:
* When modifying the layer map, delete the old one and upload the new
one.
* When updating the index, uploading the new one in the index without
deleting the old one.
## Summary of changes
* Modify `finish_gc_compaction` to correctly order insertions and
deletions.
* Update the way gc-compaction uploads the layer files.
* Add new tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We've finally transitioned to using a separate `release-compute` branch.
Now, we can finally automatically create release PRs on Fri and release
them during the following week.
Part of neondatabase/cloud#11698
Sometimes, especially when the host running the tests is overloaded, we
can run into reconcile timeouts in
`test_timeline_ancestor_detach_idempotent_success`, making the test
flaky. By increasing the timeouts from 30 seconds to 120 seconds, we can
address the flakiness.
Fixes#10464
## Problem
Currently, the report does not contain the LFC state of the failed
tests.
## Summary of changes
Added the LFC state to the link to the allure report.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Drop logical replication subscribers
before compute starts on a non-main branch.
Add new compute_ctl spec flag: drop_subscriptions_before_start
If it is set, drop all the subscriptions from the compute node
before it starts.
To avoid race on compute start, use new GUC
neon.disable_logical_replication_subscribers
to temporarily disable logical replication workers until we drop the
subscriptions.
Ensure that we drop subscriptions exactly once when endpoint starts on a
new branch.
It is essential, because otherwise, we may drop not only inherited, but
newly created subscriptions.
We cannot rely only on spec.drop_subscriptions_before_start flag,
because if for some reason compute restarts inside VM,
it will start again with the same spec and flag value.
To handle this, we save the fact of the operation in the database
in the neon.drop_subscriptions_done table.
If the table does not exist, we assume that the operation was never
performed, so we must do it.
If table exists, we check if the operation was performed on the current
timeline.
fixes: https://github.com/neondatabase/neon/issues/8790
## Problem
Not really a bug fix, but hopefully can reproduce
https://github.com/neondatabase/neon/issues/10482 more.
If the layer map does not contain layers that end at exactly the end
range of the compaction job, the current split algorithm will produce
the last job that ends at the maximum layer key. This patch extends it
all the way to the compaction job end key.
For example, the user requests a compaction of 0000...FFFF. However, we
only have a layer 0000..3000 in the layer map, and the split job will
have a range of 0000..3000 instead of 0000..FFFF.
This is not a correctness issue but it would be better to fix it so that
we can get consistent job splits.
## Summary of changes
Compaction job split will always cover the full specified key range.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
PR #10457 was supposed to fix the flakiness of
`test_scrubber_physical_gc_ancestors`, but instead it made it even more
flaky. However, the original error causes disappeared, now to be
replaced by key not found errors.
See this for a longer explanation:
https://github.com/neondatabase/neon/issues/10391#issuecomment-2608018967
## Solution
This does one churn rows after all compactions, and before we do any
timeline gc's. That way, we remain more accessible at older lsn's.
## Problem
The trust connection to the compute required for `pg_anon` was removed.
However, the PGPASSWORD environment variable was not added to
`docker-compose.yml`.
This caused connection errors, which were interpreted as success due to
errors in the bash script.
## Summary of changes
The environment variable was added, and the logic in the bash script was
fixed.
## Problem
https://github.com/neondatabase/neon/actions/runs/12896686483/job/35961290336#step:5:107
showed that `promote-images-prod` was missing another dependency.
## Summary of changes
Modify `promote-images-prod` to tag based on docker-hub images, so that
`promote-images-prod` does not rely on `promote-images-dev`. The result
should be the exact same, but allows the two jobs to run in parallel.
## Refs
- Epic: https://github.com/neondatabase/neon/issues/9378
Co-authored-by: Vlad Lazar <vlad@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
The read path does its IOs sequentially.
This means that if N values need to be read to reconstruct a page,
we will do N IOs and getpage latency is `O(N*IoLatency)`.
## Solution
With this PR we gain the ability to issue IO concurrently within one
layer visit **and** to move on to the next layer without waiting for IOs
from the previous visit to complete.
This is an evolved version of the work done at the Lisbon hackathon,
cf https://github.com/neondatabase/neon/pull/9002.
## Design
### `will_init` now sourced from disk btree index keys
On the algorithmic level, the only change is that the
`get_values_reconstruct_data`
now sources `will_init` from the disk btree index key (which is
PS-page_cache'd), instead
of from the `Value`, which is only available after the IO completes.
### Concurrent IOs, Submission & Completion
To separate IO submission from waiting for its completion, while
simultaneously
feature-gating the change, we introduce the notion of an `IoConcurrency`
struct
through which IO futures are "spawned".
An IO is an opaque future, and waiting for completions is handled
through
`tokio::sync::oneshot` channels.
The oneshot Receiver's take the place of the `img` and `records` fields
inside `VectoredValueReconstructState`.
When we're done visiting all the layers and submitting all the IOs along
the way
we concurrently `collect_pending_ios` for each value, which means
for each value there is a future that awaits all the oneshot receivers
and then calls into walredo to reconstruct the page image.
Walredo is now invoked concurrently for each value instead of
sequentially.
Walredo itself remains unchanged.
The spawned IO futures are driven to completion by a sidecar tokio task
that
is separate from the task that performs all the layer visiting and
spawning of IOs.
That tasks receives the IO futures via an unbounded mpsc channel and
drives them to completion inside a `FuturedUnordered`.
(The behavior from before this PR is available through
`IoConcurrency::Sequential`,
which awaits the IO futures in place, without "spawning" or "submitting"
them
anywhere.)
#### Alternatives Explored
A few words on the rationale behind having a sidecar *task* and what
alternatives were considered.
One option is to queue up all IO futures in a FuturesUnordered that is
polled
the first time when we `collect_pending_ios`.
Firstly, the IO futures are opaque, compiler-generated futures that need
to be polled at least once to submit their IO. "At least once" because
tokio-epoll-uring may not be able to submit the IO to the kernel on
first
poll right away.
Second, there are deadlocks if we don't drive the IO futures to
completion
independently of the spawning task.
The reason is that both the IO futures and the spawning task may hold
some
_and_ try to acquire _more_ shared limited resources.
For example, both spawning task and IO future may try to acquire
* a VirtualFile file descriptor cache slot async mutex (observed during
impl)
* a tokio-epoll-uring submission slot (observed during impl)
* a PageCache slot (currently this is not the case but we may move more
code into the IO futures in the future)
Another option is to spawn a short-lived `tokio::task` for each IO
future.
We implemented and benchmarked it during development, but found little
throughput improvement and moderate mean & tail latency degradation.
Concerns about pressure on the tokio scheduler made us discard this
variant.
The sidecar task could be obsoleted if the IOs were not arbitrary code
but a well-defined struct.
However,
1. the opaque futures approach taken in this PR allows leaving the
existing
code unchanged, which
2. allows us to implement the `IoConcurrency::Sequential` mode for
feature-gating
the change.
Once the new mode sidecar task implementation is rolled out everywhere,
and `::Sequential` removed, we can think about a descriptive submission
& completion interface.
The problems around deadlocks pointed out earlier will need to be solved
then.
For example, we could eliminate VirtualFile file descriptor cache and
tokio-epoll-uring slots.
The latter has been drafted in
https://github.com/neondatabase/tokio-epoll-uring/pull/63.
See the lengthy doc comment on `spawn_io()` for more details.
### Error handling
There are two error classes during reconstruct data retrieval:
* traversal errors: index lookup, move to next layer, and the like
* value read IO errors
A traversal error fails the entire get_vectored request, as before this
PR.
A value read error only fails that value.
In any case, we preserve the existing behavior that once
`get_vectored` returns, all IOs are done. Panics and failing
to poll `get_vectored` to completion will leave the IOs dangling,
which is safe but shouldn't happen, and so, a rate-limited
log statement will be emitted at warning level.
There is a doc comment on `collect_pending_ios` giving more code-level
details and rationale.
### Feature Gating
The new behavior is opt-in via pageserver config.
The `Sequential` mode is the default.
The only significant change in `Sequential` mode compared to before
this PR is the buffering of results in the `oneshot`s.
## Code-Level Changes
Prep work:
* Make `GateGuard` clonable.
Core Feature:
* Traversal code: track `will_init` in `BlobMeta` and source it from
the Delta/Image/InMemory layer index, instead of determining `will_init`
after we've read the value. This avoids having to read the value to
determine whether traversal can stop.
* Introduce `IoConcurrency` & its sidecar task.
* `IoConcurrency` is the clonable handle.
* It connects to the sidecar task via an `mpsc`.
* Plumb through `IoConcurrency` from high level code to the
individual layer implementations' `get_values_reconstruct_data`.
We piggy-back on the `ValuesReconstructState` for this.
* The sidecar task should be long-lived, so, `IoConcurrency` needs
to be rooted up "high" in the call stack.
* Roots as of this PR:
* `page_service`: outside of pagestream loop
* `create_image_layers`: when it is called
* `basebackup`(only auxfiles + replorigin + SLRU segments)
* Code with no roots that uses `IoConcurrency::sequential`
* any `Timeline::get` call
* `collect_keyspace` is a good example
* follow-up: https://github.com/neondatabase/neon/issues/10460
* `TimelineAdaptor` code used by the compaction simulator, unused in
practive
* `ingest_xlog_dbase_create`
* Transform Delta/Image/InMemoryLayer to
* do their values IO in a distinct `async {}` block
* extend the residence of the Delta/Image layer until the IO is done
* buffer their results in a `oneshot` channel instead of straight
in `ValuesReconstructState`
* the `oneshot` channel is wrapped in `OnDiskValueIo` /
`OnDiskValueIoWaiter`
types that aid in expressiveness and are used to keep track of
in-flight IOs so we can print warnings if we leave them dangling.
* Change `ValuesReconstructState` to hold the receiving end of the
`oneshot` channel aka `OnDiskValueIoWaiter`.
* Change `get_vectored_impl` to `collect_pending_ios` and issue walredo
concurrently, in a `FuturesUnordered`.
Testing / Benchmarking:
* Support queue-depth in pagebench for manual benchmarkinng.
* Add test suite support for setting concurrency mode ps config
field via a) an env var and b) via NeonEnvBuilder.
* Hacky helper to have sidecar-based IoConcurrency in tests.
This will be cleaned up later.
More benchmarking will happen post-merge in nightly benchmarks, plus in
staging/pre-prod.
Some intermediate helpers for manual benchmarking have been preserved in
https://github.com/neondatabase/neon/pull/10466 and will be landed in
later PRs.
(L0 layer stack generator!)
Drive-By:
* test suite actually didn't enable batching by default because
`config.compatibility_neon_binpath` is always Truthy in our CI
environment
=> https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
* initial logical size calculation wasn't always polled to completion,
which was
surfaced through the added WARN logs emitted when dropping a
`ValuesReconstructState` that still has inflight IOs.
* remove the timing histograms
`pageserver_getpage_get_reconstruct_data_seconds`
and `pageserver_getpage_reconstruct_seconds` because with planning,
value read
IO, and walredo happening concurrently, one can no longer attribute
latency
to any one of them; we'll revisit this when Vlad's work on
tracing/sampling
through RequestContext lands.
* remove code related to `get_cached_lsn()`.
The logic around this has been dead at runtime for a long time,
ever since the removal of the materialized page cache in #8105.
## Testing
Unit tests use the sidecar task by default and run both modes in CI.
Python regression tests and benchmarks also use the sidecar task by
default.
We'll test more in staging and possibly preprod.
# Future Work
Please refer to the parent epic for the full plan.
The next step will be to fold the plumbing of IoConcurrency
into RequestContext so that the function signatures get cleaned up.
Once `Sequential` isn't used anymore, we can take the next
big leap which is replacing the opaque IOs with structs
that have well-defined semantics.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
Both these versions are binary compatible, but the way pgvector
structures the SQL files forbids installing 0.7.4 if you have a 0.8.0
distribution. Yet, some users may need a previous version for backward
compatibility, e.g., restoring the dump.
See this thread for discussion
https://neondb.slack.com/archives/C04DGM6SMTM/p1735911490242919?thread_ts=1731343604.259169&cid=C04DGM6SMTM
## Summary of changes
Put `vector--0.7.4.sql` file into compute image to allow installing this
version as well.
Tested on staging and it seems to be working as expected:
```sql
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | (null) | vector data type and ivfflat and hnsw access methods
create extension vector version '0.7.4';
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | 0.7.4 | vector data type and ivfflat and hnsw access methods
alter extension vector update;
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | 0.8.0 | vector data type and ivfflat and hnsw access methods
drop extension vector;
create extension vector;
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | 0.8.0 | vector data type and ivfflat and hnsw access methods
```
If we find out it's a good approach, we can adopt the same for other
extensions with a stable ABI -- support both `current` and `current - 1`
releases.
# Refs
- extracted from https://github.com/neondatabase/neon/pull/9353
# Problem
Before this PR, when task_mgr shutdown is signalled, e.g. during
pageserver shutdown or Tenant shutdown, initial logical size calculation
stops polling and drops the future that represents the calculation.
This is against the current policy that we poll all futures to
completion.
This became apparent during development of concurrent IO which warns if
we drop a `Timeline::get_vectored` future that still has in-flight IOs.
We may revise the policy in the future, but, right now initial logical
size calculation is the only part of the codebase that doesn't adhere to
the policy, so let's fix it.
## Code Changes
- make sensitive exclusively to `Timeline::cancel`
- This should be sufficient for all cases of shutdowns; the sensitivity
to task_mgr shutdown is unnecessary.
- this broke the various cancel tests in `test_timeline_size.py`, e.g.,
`test_timeline_initial_logical_size_calculation_cancellation`
- the tests would time out because the await point was not sensitive to
cancellation
- to fix this, refactor `pausable_failpoint` so that it accepts a
cancellation token
- side note: we _really_ should write our own failpoint library; maybe
after we get heap-allocated RequestContext, we can plumb failpoints
through there.
## Problem
PR #9993 was supposed to enable `page_service_pipelining` by default for
all `NeonEnv`s, but this was ineffective in our CI environment.
Thus, CI Python-based tests and benchmarks, unless explicitly
configuring pipelining, were still using serial protocol handling.
## Analysis
The root cause was that in our CI environment,
`config.compatibility_neon_binpath` is always Truthy.
It's not in local environments, which is why this slipped through in
local testing.
Lesson: always add a log line ot pageserver startup and spot-check tests
to ensure the intended default is picked up.
## Summary of changes
Fix it. Since enough time has passed, the compatiblity snapshot contains
a recent enough software version so we don't need to worry about
`compatibility_neon_binpath` anymore.
## Future Work
The question how to add a new default except for compatibliity tests,
which is what the broken code was supposed to do, is still unsolved.
Slack discussion:
https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
## Problem
Currently, the layer flush loop will continue flushing layers as long as
any are pending, and only notify waiters once there are no further
layers to flush. This can cause waiters to wait longer than necessary,
and potentially starve them if pending layers keep arriving faster than
they can be flushed. The impact of this will increase when we add
compaction backpressure and propagate it up into the WAL receiver.
Extracted from #10405.
## Summary of changes
Break out of the layer flush loop once we've flushed up to the requested
LSN. If further flush requests have arrived in the meanwhile, flushing
will resume immediately after.
## Problem
For compaction backpressure, we need a mechanism to signal when
compaction has reduced the L0 delta layer count below the backpressure
threshold.
Extracted from #10405.
## Summary of changes
Add `LayerMap::watch_level0_deltas()` which returns a
`tokio::sync:⌚:Receiver` signalling the current L0 delta layer
count.
## Problem
It's sometimes useful to obtain the elapsed duration from a
`StorageTimeMetricsTimer` for purposes beyond just recording it in
metrics (e.g. to log it).
Extracted from #10405.
## Summary of changes
Add `StorageTimeMetricsTimer.elapsed()` and return the duration from
`stop_and_record()`.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
The automatic trigger is already implemented at
https://github.com/neondatabase/neon/pull/10221 but I need to write some
tests and finish my experiments in staging before I can merge it with
confidence. Given that I have some other patches that will modify the
config items, I'd like to get the config items merged first to reduce
conflicts.
## Summary of changes
* add `l2_lsn` to index_part.json -- below that LSN, data have been
processed by gc-compaction
* add a set of gc-compaction auto trigger control items into the config
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We are removing the `pg_anon` v1 extension from Neon. So we don't need
to test it anymore and can remove the code for simplicity.
## Summary of changes
The code required for testing `pg_anon` is removed.
We did not have any tests on fast_import binary yet.
In this PR I have introduced:
- `FastImport` class and tools for testing in python
- basic test that runs fast import against vanilla postgres and checks
that data is there
Should be merged after https://github.com/neondatabase/neon/pull/10251
We currently have some flakiness in
`test_scrubber_physical_gc_ancestors`, see #10391.
The first flakiness kind is about the reconciler not actually becoming
idle within the timeout of 30 seconds. We see continuous forward
progress so this is likely not a hang. We also see this happen in
parallel to a test failure, so is likely due to runners being
overloaded. Therefore, we increase the timeout.
The second flakiness kind is an assertion failure. This one is a little
bit more tricky, but we saw in the successful run that there was some
advance of the lsn between the compaction ran (which created layer
files) and the gc run. Apparently gc rejects reductions to the single
image layer setting if the cutoff lsn is the same as the lsn of the
image layer: it will claim that that layer is newer than the space
cutoff and therefore skip it, while thinking the old layer (that we want
to delete) is the latest one (so it's not deleted).
We address the second flakiness kind by inserting a tiny amount of WAL
between the compaction and gc. This should hopefully fix things.
Related issue: #10391
(not closing it with the merger of the PR as we'll need to validate that
these changes had the intended effect).
Thanks to Chi for going over this together with me in a call.
## Problem
When releasing `release-7574`, the Github Release creation failed with
"body is too long" (see
https://github.com/neondatabase/neon/actions/runs/12834025431/job/35792346745#step:5:77).
There's lots of room for improvement of the release notes, but for now
we'll disable them instead.
## Summary of changes
- Disable automatic generation of release notes for Github releases
- Enable creation of Github releases for proxy/compute
This simplifies the code in `pageserver_physical_gc` a little bit after
the feedback in #10007 that the code is too complicated.
Most importantly, we don't pass around `GcSummary` any more in a
complicated fashion, and we save on async stream-combinator-inception in
one place in favour of `try_stream!{}`.
Follow-up of #10007
Otherwise we might hit ERRORs in otherwise safe situations (such as user
queries), which isn't a great user experience.
## Problem
https://github.com/neondatabase/neon/pull/10376
## Summary of changes
Instead of accepting internal errors as acceptable, we ensure we don't
exceed our allocated usage.
## Refs
- fixes https://github.com/neondatabase/neon/issues/10444
## Problem
We're seeing a panic `handles are only shut down once in their lifetime`
in our performance testbed.
## Hypothesis
Annotated code in
https://github.com/neondatabase/neon/issues/10444#issuecomment-2602286415.
```
T1: drop Cache, executes up to (1)
=> HandleInner is now in state ShutDown
T2: Timeline::shutdown => PerTimelineState::shutdown executes shutdown() again => panics
```
Likely this snuck in the final touches of #10386 where I narrowed down
the locking rules.
## Summary of changes
Make duplicate shutdowns a no-op.
## Summary
Whereas currently we send all WAL to all pageserver shards, and each
shard filters out the data that it needs,
in this RFC we add a mechanism to filter the WAL on the safekeeper, so
that each shard receives
only the data it needs.
This will place some extra CPU load on the safekeepers, in exchange for
reducing the network bandwidth
for ingesting WAL back to scaling as O(1) with shard count, rather than
O(N_shards).
Touches #9329.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
The other test focus on the external interface usage while the tests
added in this PR add some testing around HandleInner's lifecycle,
ensuring we don't leak it once either connection gets dropped or
per-timeline-state is shut down explicitly.
It was requested by review in #10305 to use an enum or something like it
for distinguishing the different modes instead of two parameters,
because two flags allow four combinations, and two of them don't really
make sense/ aren't used.
follow-up of #10305
## Problem
Since #9916 , the chaos code is actively fighting the optimizer: tenants
tend to be attached in their preferred AZ, so most chaos migrations were
moving them to a non-preferred AZ.
## Summary of changes
- When picking migrations, prefer to migrate things _toward_ their
preferred AZ when possible. Then pick shards to move the other way when
necessary.
The resulting behavior should be an alternating "back and forth" where
the chaos code migrates thiings away from home, and then migrates them
back on the next iteration.
The side effect will be that the chaos code actively helps to push
things into their home AZ. That's not contrary to its purpose though: we
mainly just want it to continuously migrate things to exercise
migration+notification code.
## Problem
Occasionally, we encounter bugs in test environments that can be
detected at the point of uploading an index, but we proceed to upload it
anyway and leave a tenant in a broken state that's awkward to handle.
## Summary of changes
- Validate index when submitting it for upload, so that we can see the
issue quickly e.g. in an API invoking compaction
- Validate index before executing the upload, so that we have a hard
enforcement that any code path that tries to upload an index will not
overwrite a valid index with an invalid one.
Add an endpoint to obtain the utilization of a safekeeper. Future
changes to the storage controller can use this endpoint to find the most
suitable safekeepers for newly created timelines, analogously to how
it's done for pageservers already.
Initially we just want to assign by timeline count, then we can iterate
from there.
Part of https://github.com/neondatabase/neon/issues/9011
## Problem
871e8b325f failed CI on main because a job
ran to soon. This was caused by
ea84ec357f. While `promote-images-dev`
does not inherently need `neon-image`, a few jobs depending on
`promote-images-dev` do need it, and previously had it when it was
`promote-images`, which depended on `test-images`, which in turn
depended on `neon-image`.
## Summary of changes
To ensure jobs depending `docker.io/neondatabase/neon` images get them,
`promote-images-dev` gets the dependency to `neon-image` back which it
previously had transitively through `test-images`.
Instead of generating our own request ID, we can just use the one
provided by the control plane. In the event, we get a request from a
client which doesn't set X-Request-ID, then we just generate one which
is useful for tracing purposes.
Signed-off-by: Tristan Partin <tristan@neon.tech>
# Refs
- fixes https://github.com/neondatabase/neon/issues/10309
- fixup of batching design, first introduced in
https://github.com/neondatabase/neon/pull/9851
- refinement of https://github.com/neondatabase/neon/pull/8339
# Problem
`Tenant::shutdown` was occasionally taking many minutes (sometimes up to
20) in staging and prod if the
`page_service_pipelining.mode="concurrent-futures"` is enabled.
# Symptoms
The issue happens during shard migration between pageservers.
There is page_service unavailability and hence effectively downtime for
customers in the following case:
1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`,
waiting for the gate to close.
2. Cplane/Storcon decides to transition the target `AttachedMulti` to
`AttachedSingle`.
3. That transition comes with a bump of the generation number, causing
the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` /
`Tenant::attach` cycle for the target location.
4. That `Tenant::shutdown` on the target gets stuck, waiting for the
gate to close.
5. Eventually the gate closes (`close completed`), correlating with a
`page_service` connection handler logging that it's exiting because of a
network error (`Connection reset by peer` or `Broken pipe`).
While in (4):
- `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls
to complete.
So, really, this is a `Timeline::shutdown` bug.
- retries from Cplane/Storcon to complete above state transitions, fail
with errors related to the tenant mgr slot being in state
`TenantSlot::InProgress`, the tenant state being
`TenantState::Stopping`, and the timelines being in
`TimelineState::Stopping`, and the `Timeline::cancel` being cancelled.
- Existing (and/or new?) page_service connections log errors `error
reading relation or page version: Not found: Timed out waiting 30s for
tenant active state. Latest state: None`
# Root-Cause
After a lengthy investigation ([internal
write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4))
I arrived at the following root cause.
The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the
Batcher and Executor stages of the pipelined mode was storing a `Handle`
and thus `GateGuard` of the Timeline that was not shutting down.
The design assumption with pipelining was that this would always be a
short transient state.
However, that was incorrect: the Executor was stuck on writing/flushing
an earlier response into the connection to the client, i.e., socket
write being slow because of TCP backpressure.
The probable scenario of how we end up in that case:
1. Compute backend process sends a continuous stream of getpage prefetch
requests into the connection, but never reads the responses (why this
happens: see Appendix section).
2. Batch N is processed by Batcher and Executor, up to the point where
Executor starts flushing the response.
3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`.
4. Executor is still waiting for batch N flush to finish.
5. Batcher eventually hits the `TimeoutReader` error (10min).
From here on it waits on the
`spsc_fold.send(Err(QueryError(TimeoutReader_error)))`
which will never finish because the batch already inside the `spsc_fold`
is not
being read by the Executor, because the Executor is still stuck in the
flush.
(This state is not observable at our default `info` log level)
6. Eventually, Compute backend process is killed (`close()` on the
socket) or Compute as a whole gets killed (probably no clean TCP
shutdown happening in that case).
7. Eventually, Pageserver TCP stack learns about (6) through RST packets
and the Executor's flush() call fails with an error.
8. The Executor exits, dropping `cancel_batcher` and its end of the
spsc_fold.
This wakes Batcher, causing the `spsc_fold.send` to fail.
Batcher exits.
The pipeline shuts down as intended.
We return from `process_query` and log the `Connection reset by peer` or
`Broken pipe` error.
The following diagram visualizes the wait-for graph at (5)
```mermaid
flowchart TD
Batcher --spsc_fold.send(TimeoutReader_error)--> Executor
Executor --flush batch N responses--> socket.write_end
socket.write_end --wait for TCP window to move forward--> Compute
```
# Analysis
By holding the GateGuard inside the `spsc_fold` open, the pipelining
implementation
violated the principle established in
(https://github.com/neondatabase/neon/pull/8339).
That is, that `Handle`s must only be held across an await point if that
await point
is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token.
In this case, we were holding the Handle inside the `spsc_fold` while
awaiting the
`pgb_writer.flush()` future.
One may jump to the conclusion that we should simply peek into the
spsc_fold to get
that Timeline cancel token and be sensitive to it during flush, then.
But that violates another principle of the design from
https://github.com/neondatabase/neon/pull/8339.
That is, that the page_service connection lifecycle and the Timeline
lifecycles must be completely decoupled.
Tt must be possible to shut down one shard without shutting down the
page_service connection, because on that single connection we might be
serving other shards attached to this pageserver.
(The current compute client opens separate connections per shard, but,
there are plans to change that.)
# Solution
This PR adds a `handle::WeakHandle` struct that does _not_ hold the
timeline gate open.
It must be `upgrade()`d to get a `handle::Handle`.
That `handle::Handle` _does_ hold the timeline gate open.
The batch queued inside the `spsc_fold` only holds a `WeakHandle`.
We only upgrade it while calling into the various `handle_` methods,
i.e., while interacting with the `Timeline` via `<Handle as
Deref<Target=Timeline>>`.
All that code has always been required to be (and is!) sensitive to
`Timeline::cancel`, and therefore we're guaranteed to bail from it
quickly when `Timeline::shutdown` starts.
We will drop the `Handle` immediately, before we start
`pgb_writer.flush()`ing the responses.
Thereby letting go of our hold on the `GateGuard`, allowing the timeline
shutdown to complete while the page_service handler remains intact.
# Code Changes
* Reproducer & Regression Test
* Developed and proven to reproduce the issue in
https://github.com/neondatabase/neon/pull/10399
* Add a `Test` message to the pagestream protocol (`cfg(feature =
"testing")`).
* Drive-by minimal improvement to the parsing code, we now have a
`PagestreamFeMessageTag`.
* Refactor `pageserver/client` to allow sending and receiving
`page_service` requests independently.
* Add a Rust helper binary to produce situation (4) from above
* Rationale: (4) and (5) are the same bug class, we're holding a gate
open while `flush()`ing.
* Add a Python regression test that uses the helper binary to
demonstrate the problem.
* Fix
* Introduce and use `WeakHandle` as explained earlier.
* Replace the `shut_down` atomic with two enum states for `HandleInner`,
wrapped in a `Mutex`.
* To make `WeakHandle::upgrade()` and `Handle::downgrade()`
cache-efficient:
* Wrap the `Types::Timeline` in an `Arc`
* Wrap the `GateGuard` in an `Arc`
* The separate `Arc`s enable uncontended cloning of the timeline
reference in `upgrade()` and `downgrade()`.
If instead we were `Arc<Timeline>::clone`, different connection handlers
would be hitting the same cache line on every upgrade()/downgrade(),
causing contention.
* Please read the udpated module-level comment in `mod handle`
module-level comment for details.
# Testing & Performance
The reproducer test that failed before the changes now passes, and
obviously other tests are passing as well.
We'll do more testing in staging, where the issue happens every ~4h if
chaos migrations are enabled in storcon.
Existing perf testing will be sufficient, no perf degradation is
expected.
It's a few more alloctations due to the added Arc's, but, they're low
frequency.
# Appendix: Why Compute Sometimes Doesn't Read Responses
Remember, the whole problem surfaced because flush() was slow because
Compute was not reading responses. Why is that?
In short, the way the compute works, it only advances the page_service
protocol processing when it has an interest in data, i.e., when the
pagestore smgr is called to return pages.
Thus, if compute issues a bunch of requests as part of prefetch but then
it turns out it can service the query without reading those pages, it
may very well happen that these messages stay in the TCP until the next
smgr read happens, either in that session, or possibly in another
session.
If there’s too many unread responses in the TCP, the pageserver kernel
is going to backpressure into userspace, resulting in our stuck flush().
All of this stems from the way vanilla Postgres does prefetching and
"async IO":
it issues `fadvise()` to make the kernel do the IO in the background,
buffering results in the kernel page cache.
It then consumes the results through synchronous `read()` system calls,
which hopefully will be fast because of the `fadvise()`.
If it turns out that some / all of the prefetch results are not needed,
Postgres will not be issuing those `read()` system calls.
The kernel will eventually react to that by reusing page cache pages
that hold completed prefetched data.
Uncompleted prefetch requests may or may not be processed -- it's up to
the kernel.
In Neon, the smgr + Pageserver together take on the role of the kernel
in above paragraphs.
In the current implementation, all prefetches are sent as GetPage
requests to Pageserver.
The responses are only processed in the places where vanilla Postgres
would do the synchronous `read()` system call.
If we never get to that, the responses are queued inside the TCP
connection, which, once buffers run full, will backpressure into
Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the
root cause of the problems we're fixing in this PR.
The extension now supports Postgres 17. The release also seems to be
binary compatible with the previous version.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
`test_storage_controller_node_deletion` sometimes failed because shards
were moving around during timeline creation, and neon_local isn't
tolerant of that. The movements were unexpected because the shards had
only just been created.
This was a regression from #9916Closes: #10383
## Summary of changes
- Make this test use multiple AZs -- this makes the storage controller's
scheduling reliably stable
Why this works: in #9916 , I made a simplifying assumption that we would
have multiple AZs to get nice stable scheduling -- it's much easier,
because each tenant has a well defined primary+secondary location when
they have an AZ preference and nodes have different AZs. Everything
still works if you don't have multiple AZs, but you just have this quirk
that sometimes the optimizer can disagree with initial scheduling, so
once in a while a shard moves after being created -- annoying for tests,
harmless IRL.
## Problem
All pageserver have the same application name which makes it hard to
distinguish them.
## Summary of changes
Include the node id in the application name sent to the safekeeper. This
should gives us
more visibility in logs. There's a few metrics that will increase in
cardinality by `pageserver_count`,
but that's fine.
## Problem
Node fills were limited to moving (total shards / node_count) shards. In
systems that aren't perfectly balanced already, that leads us to skip
migrating some of the shards that belong on this node, generating work
for the optimizer later to gradually move them back.
## Summary of changes
- Where a shard has a preferred AZ and is currently attached outside
this AZ, then always promote it during fill, irrespective of target fill
count
## Problem
We were comparing serialized configs from the database with serialized
configs from memory. If fields have been added/removed to TenantConfig,
this generates spurious consistency errors. This is fine in test
environments, but limits the usefulness of this debug API in the field.
Closes: https://github.com/neondatabase/neon/issues/10369
## Summary of changes
- Do a decode/encode cycle on the config before comparing it, so that it
will have exactly the expected fields.
## Problem
gc-compaction needs the partitioning data to decide the job split. This
refactor allows concurrent access/computing the partitioning.
## Summary of changes
Make `partitioning` an ArcSwap so that others can access the
partitioning while we compute it. Fully eliminate the `repartition is
called concurrently` warning when gc-compaction is going on.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Rename the safekeeper scheduling policy "disabled" to "pause".
A rename was requested in
https://github.com/neondatabase/neon/pull/10400#discussion_r1916259124,
as the "disabled" policy is meant to be analogous to the "pause" policy
for pageservers.
Also simplify the `SkSchedulingPolicyArg::from_str` function, relying on
the `from_str` implementation of `SkSchedulingPolicy`. Latter is used
for the database format as well, so it is quite stable. If we ever want
to change the UI, we'll need to duplicate the function again but this is
cheap.
## Problem
Threads spawned in `test_tenant_delete_races_timeline_creation` are not
joined before the test ends, and can generate
`PytestUnhandledThreadExceptionWarning` in other tests.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10419/12805365523/index.html#/testresult/53a72568acd04dbd
## Summary of changes
- Wrap threads in ThreadPoolExecutor which will join them before the
test ends
- Remove a spurious deletion call -- the background thread doing
deletion ought to succeed.
## Problem
When multiple changes are grouped in a merge group to be merged as part
of the merge queue, the changes might individually pass
`check-codestyle-rust` but not in their combined form.
## Summary of changes
- Move `check-codestyle-rust` into a reusable workflow that is called
from it's previous location in `build_and_test.yml`, and additionally
call it from `pre_merge_checks.yml`. The additional call does not run on
ARM, only x86, to ensure the merge queue continues being responsive.
- Trigger `pre_merge_checks.yml` on PRs that change any of the workflows
running in `pre_merge_checks.yml`, so that we get feedback on those
early an not only after trying to merge those changes.
This should fix the largest source of flakyness of
test_nbtree_pagesplit_cycleid.
## Problem
https://github.com/neondatabase/neon/issues/10390
## Summary of changes
By using a guaranteed-flushed LSN, we ensure that PS won't have to wait
forever.
(If it does wait forever, we know the issue can't be with Compute's WAL)
## Problem
As part of https://github.com/neondatabase/neon/issues/8614 we need to
pass options to START_WAL_PUSH.
## Summary of changes
Add two options. `allow_timeline_creation`, default true, disables
implicit timeline creation in the connection from compute. Eventually
such creation will be forbidden completely, but as we migrate to
configurations we need to support both: current mode and configurations
enabled where creation by compute is disabled.
`proto_version` specifies compute <-> sk protocol version. We have it
currently in the first greeting package also, but I plan to change tag
size from u64 to u8, which would make it hard to use. Command is more
appropriate place for it anyway.
This reduces pressure on the OS TCP read buffer by increasing the
moments we read data out of the receive buffer, and increasing the
number of bytes we can pull from that buffer when we do reads.
## Problem
A backend may not always consume its prefetch data quick enough
## Summary of changes
We add a new function `prefetch_pump_state` which pulls as many prefetch
requests from the OS TCP receive buffer as possible, but without
blocking.
This thus reduces pressure on OS-level TCP buffers, thus increasing
throughput by limiting throttling caused by full TCP buffers.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
part of investigation of
https://github.com/neondatabase/neon/issues/10049
## Summary of changes
* If `cfg!(test) or cfg!(feature = testing)`, then we will always try
generating an image to ensure the history is replayable, but not put the
image layer into the final layer results, therefore discovering wrong
key history before we hit a read error.
* I suspect it's easier to trigger some races if gc-compaction is
continuously run on a timeline, so I increased the frequency to twice
per 10 churns.
* Also, create branches in gc-compaction smoke tests to get more test
coverage.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad@neon.tech>
## Problem
`postgres` is system database at neon, so we need to do `pg_restore`
into `neondb` instead
https://github.com/neondatabase/cloud/issues/22100
## Summary of changes
Changed fast_import a little bit:
1. After succesfull connection creating `neondb` in postgres instance
2. Changed restore connstring to use new db
3. Added optional `source_connection_string`, which allows to skip
`s3_prefix` and just connect directly.
4. Added `-i` that stops process until sigterm
## TODO
- [x] test image in cplane e2e
- [ ] Change import job image back to latest after this merged (partial
revert of https://github.com/neondatabase/cloud/pull/22338)
## Problem
When a pageserver is receiving high rates of requests, we don't have a
good way to efficiently discover what the client's access pattern is.
Closes: https://github.com/neondatabase/neon/issues/10275
## Summary of changes
- Add
`/v1/tenant/x/timeline/y/page_trace?size_limit_bytes=...&time_limit_secs=...`
API, which returns a binary buffer.
- Add `pagectl page-trace` tool to decode and analyze the output.
---------
Co-authored-by: Erik Grinaker <erik@neon.tech>
Implementing the last missing endpoint of #9981, this adds support to
set the scheduling policy of an individual safekeeper, as specified in
the RFC. However, unlike in the RFC we call the endpoint
`scheduling_policy` not `status`
Closes#9981.
As for why not use the upsert endpoint for this: we want to have the
safekeeper upsert endpoint be used for testing and for deploying new
safekeepers, but not for changes of the scheduling policy. We don't want
to change any of the other fields when marking a safekeeper as
decommissioned for example, so we'd have to first fetch them only to
then specify them again. Of course one can also design an endpoint where
one can omit any field and it doesn't get modified, but it's still not
great for observability to put everything into one big "change something
about this safekeeper" endpoint.
## Problem
Safekeepers currently decode and interpret WAL for each shard
separately.
This is wasteful in terms of CPU memory usage - we've seen this in
profiles.
## Summary of changes
Fan-out interpreted WAL to multiple shards.
The basic is that wal decoding and interpretation happens in a separate
tokio task and senders
attach to it. Senders only receive batches concerning their shard and
only past the Lsn they've last seen.
Fan-out is gated behind the `wal_reader_fanout` safekeeper flag
(disabled by default for now).
When fan-out is enabled, it might be desirable to control the absolute
delta between the
current position and a new shard's desired position (i.e. how far behind
or ahead a shard may be).
`max_delta_for_fanout` is a new optional safekeeper flag which dictates
whether to create a new
WAL reader or attach to the existing one. By default, this behaviour is
disabled. Let's consider enabling
it if we spot the need for it in the field.
## Testing
Tests passed [here](https://github.com/neondatabase/neon/pull/10301)
with wal reader fanout enabled
as of
34f6a71718.
Related: https://github.com/neondatabase/neon/issues/9337
Epic: https://github.com/neondatabase/neon/issues/9329
## Problem
https://github.com/neondatabase/neon/issues/9965
## Summary of changes
Add to safekeeper http endpoint to switch membership configuration. Also
add it to python client for tests, and add simple test itself.
## Problem
Successful `benchmarks` runs doesn't have enough visibility
Ref https://neondb.slack.com/archives/C069Z2199DL/p1736868055094539
## Summary of changes
- Report both successful and failed `benchmarks` to Slack
- Update `slackapi/slack-github-action` action
## Problem
Currently, we call `InterpretedWalRecord::from_bytes_filtered`
from each shard. To serve multiple shards at the same time,
the API needs to allow for enquiring about multiple shards.
## Summary of changes
This commit tweaks it a pretty brute force way. Naively, we could
just generate the shard for a key, but pre and post split shards
may be subscribed at the same time, so doing it efficiently is more
complex.
## Problem
https://github.com/neondatabase/neon/issues/9965
## Summary of changes
Add safekeeper membership configuration struct itself and storing it in
the control file. In passing also add creation timestamp to the control
file (there were cases where I wanted it in the past).
Remove obsolete unused PersistedPeerInfo struct from control file (still
keep it control_file_upgrade.rs to have it in old upgrade code).
Remove the binary representation of cfile in the roundtrip test.
Updating it is annoying, and we still test the actual roundtrip.
Also add configuration to timeline creation http request, currently used
only in one python test. In passing, slightly change LSNs meaning in the
request: normally start_lsn is passed (the same as ancestor_start_lsn in
similar pageserver call), but we allow specifying higher commit_lsn for
manual intervention if needed. Also when given LSN initialize
term_history with it.
## Problem
https://github.com/neondatabase/neon/pull/8455 wasn't specific enough on
migration from current situation to enabling generations.
## Summary of changes
Describe the missing parts, including control plane pushing generation
to compute, which also defines whether generations are enabled -- non
zero value does it.
## Problem
For large deployments, the `control/v1/tenant` listing API can time out
transmitting a monolithic serialized response.
## Summary of changes
- Add `limit` and `start_after` parameters to listing API
- Update storcon_cli to use these parameters and limit requests to 1000
items at a time
## Problem
With upload queue reordering in #10218, we can easily get into a
situation where multiple index uploads are queued back to back, which
can't be parallelized. This will happen e.g. when multiple layer flushes
enqueue layer/index/layer/index/... and the layers skip the queue and
are uploaded in parallel.
These index uploads will incur serial S3 roundtrip latencies, and may
block later operations.
Touches #10096.
## Summary of changes
When multiple back-to-back index uploads are ready to upload, only
upload the most recent index and drop the rest.
## Problem
The upload queue can currently schedule an arbitrary number of tasks.
This can both spawn an unbounded number of Tokio tasks, and also
significantly slow down upload queue scheduling as it's quadratic in
number of operations.
Touches #10096.
## Summary of changes
Limit the number of inprogress tasks to the remote storage upload
concurrency. While this concurrency limit is shared across all tenants,
there's certainly no point in scheduling more than this -- we could even
consider setting the limit lower, but don't for now to avoid
artificially constraining tenants.
By setting PATH in the 'pg-build' layer, all the extension build layers
will inherit. No need to pass PG_CONFIG to all the various make
invocations either: once pg_config is in PATH, the Makefiles will pick
it up from there.
## Problem
The upload queue currently sees significant head-of-line blocking. For
example, index uploads act as upload barriers, and for every layer flush
we schedule a layer and index upload, which effectively serializes layer
uploads.
Resolves#10096.
## Summary of changes
Allow upload queue operations to bypass the queue if they don't conflict
with preceding operations, increasing parallelism.
NB: the upload queue currently schedules an explicit barrier after every
layer flush as well (see #8550). This must be removed to enable
parallelism. This will require a better mechanism for compaction
backpressure, see e.g. #8390 or #5415.
## Problem
Since https://github.com/neondatabase/neon/pull/9916, the preferred AZ
of a tenant is much more impactful, and we would like to make it more
visible in tooling.
## Summary of changes
- Include AZ in node describe API
- Include AZ info in node & tenant outputs in CLI
- Add metrics for per-node shard counts, labelled by AZ
- Add a CLI for setting preferred AZ on a tenant
- Extend AZ-setting API+CLI to handle None for clearing preferred AZ
## Problem
Before this PR, the pagestream throttle was applied weighted on a
per-batch basis.
This had several problems:
1. The throttle occurence counters were only bumped by `1` instead of
`batch_size`.
2. The throttle wait time aggregator metric only counted one wait time,
irrespective
of `batch_size`. That makes sense in some ways of looking at it but not
in others.
3. If the last request in the batch runs into the throttle, the other
requests in the
batch are also throttled, i.e., over-throttling happens (theoretical,
didn't measure
it in practice).
## Solution
It occured to me that we can simply push the throttling upwards into
`pagestream_read_message`.
This has the added benefit that in pipeline mode, the `executor` stage
will, if it is idle,
steal whatever requests already made it into the `spsc_fold` and execute
them; before this
change, that was not the case - the throttling happened in the
`executor` stage instead of
the `batcher` stage.
## Code Changes
There are two changes in this PR:
1. Lifting up the throttling into the `pagestream_read_message` method.
2. Move the throttling metrics out of the `Throttle` type into
`SmgrOpMetrics`.
Unlike the other smgr metrics, throttling is per-tenant, hence the Arc.
3. Refactor the `SmgrOpTimer` implementation to account for the new
observation states,
and simplify its design.
4. Drive-by-fix flush time metrics. It was using the same `now` in the
`observe_guard` every time.
The `SmgrOpTimer` is now a state machine.
Each observation point moves the state machine forward.
If a timer object is dropped early some "pair"-like metrics still
require an increment or observation.
That's done in the Drop implementation, by driving the state machine to
completion.
## Problem
Discovered during the relation dir refactor work.
If we do not create images as in this patch, we would get two set of
image layers:
```
0000...METADATA_KEYS
0000...REL_KEYS
```
They overlap at the same LSN and would cause data loss for relation
keys. This doesn't happen in prod because initial image layer generation
is never called, but better to be fixed to avoid future issues with the
reldir refactors.
## Summary of changes
* Consolidate create_image_layers call into a single one.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Because of https://github.com/Azure/azure-sdk-for-rust/issues/1739, our
identity token file was not being refreshed. This caused our uploads to
start failing when the storage token expired.
## Summary of changes
Drop and recreate the remote storage config every time we upload in
order to force reload the identity token file.
## Problem
See https://github.com/neondatabase/neon/issues/10167
Too small number of `max_connections` (2) can cause failures of
test_physical_replication_config_mismatch_too_many_known_xids test
## Summary of changes
Increase `max_connections` to 5
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We want to do a more robust job of scheduling tenants into their home
AZ: https://github.com/neondatabase/neon/issues/8264.
Closes: https://github.com/neondatabase/neon/issues/8969
## Summary of changes
### Scope
This PR combines prioritizing AZ with a larger rework of how we do
optimisation. The rationale is that just bumping AZ in the order of
Score attributes is a very tiny change: the interesting part is lining
up all the optimisation logic to respect this properly, which means
rewriting it to use the same scores as the scheduler, rather than the
fragile hand-crafted logic that we had before. Separating these changes
out is possible, but would involve doing two rounds of test updates
instead of one.
### Scheduling optimisation
`TenantShard`'s `optimize_attachment` and `optimize_secondary` methods
now both use the scheduler to pick a new "favourite" location. Then
there is some refined logic for whether + how to migrate to it:
- To decide if a new location is sufficiently "better", we generate
scores using some projected ScheduleContexts that exclude the shard
under consideration, so that we avoid migrating from a node with
AffinityScore(2) to a node with AffinityScore(1), only to migrate back
later.
- Score types get a `for_optimization` method so that when we compare
scores, we will only do an optimisation if the scores differ by their
highest-ranking attributes, not just because one pageserver is lower in
utilization. Eventually we _will_ want a mode that does this, but doing
it here would make scheduling logic unstable and harder to test, and to
do this correctly one needs to know the size of the tenant that one is
migrating.
- When we find a new attached location that we would like to move to, we
will create a new secondary location there, even if we already had one
on some other node. This handles the case where we have a home AZ A, and
want to migrate the attachment between pageservers in that AZ while
retaining a secondary location in some other AZ as well.
- A unit test is added for
https://github.com/neondatabase/neon/issues/8969, which is implicitly
fixed by reworking optimisation to use the same scheduling scores as
scheduling.
## Problem
In preparation to https://github.com/neondatabase/neon/issues/9516. We
need to store rel size and directory data in the sparse keyspace, but it
does not support inheritance yet.
## Summary of changes
Add a new type of keyspace "sparse but inherited" into the system.
On the read path: we don't remove the key range when we descend into the
ancestor. The search will stop when (1) the full key range is covered by
image layers (which has already been implemented before), or (2) we
reach the end of the ancestor chain.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Generally ed25519 seems to be much preferred for cryptographic strength
to P256 nowadays, and it is NIST approved finally. We should use it
where we can as it's also faster than p256.
This PR makes the re-signed JWTs between local_proxy and pg_session_jwt
use ed25519.
This does introduce a new dependency on ed25519, but I do recall some
Neon Authorise customers asking for support for ed25519, so I am
justifying this dependency addition in the context that we can then
introduce support for customer ed25519 keys
sources:
* https://csrc.nist.gov/pubs/fips/186-5/final subsection 7 (EdDSA)
* https://datatracker.ietf.org/doc/html/rfc8037#section-3.1
## Problem
Currently, if we want to move a secondary there isn't a neat way to do
that: we just have migration API for the attached location, and it is
only clean to use that if you've manually created a secondary via
pageserver API in the place you're going to move it to.
Secondary migration API enables:
- Moving the secondary somewhere because we would like to later move the
attached location there.
- Move the secondary location because we just want to reclaim some disk
space from its current location.
## Summary of changes
- Add `/migrate_secondary` API
- Add `tenant-shard-migrate-secondary` CLI
- Add tests for above
## Problem
We would sometimes fail to retry compute notifications:
1. Try and send, set compute_notify_failure if we can't
2. On next reconcile, reconcile() fails for some other reason (e.g.
tried to talk to an offline node), and we fail the `result.is_ok() &&
must_notify` condition around the re-sending.
Closes: https://github.com/neondatabase/cloud/issues/22612
## Summary of changes
- Clarify the meaning of the reconcile result: it should be Ok(()) if
configuring attached location worked, even if secondary or detach
locations cannot be reached.
- Skip trying to talk to secondaries if they're offline
- Even if reconcile fails and we can't send the compute notification (we
can't send it because we're not sure if it's really attached), make sure
we save the `compute_notify_failure` flag so that subsequent reconciler
runs will try again
- Add a regression test for the above
Taking continuous profiles every 20 seconds is likely too expensive (in
dollar terms). Let's try 60-second profiles. We can now interrupt
running profiles via `?force=true`, so this should be fine.
For testing the proxy's websockets support.
I wrote this to test https://github.com/neondatabase/neon/issues/3822.
Unfortunately, that bug can *not* be reproduced with this tunnel. The
bug only appears when the client pipelines the first query with the
authentication messages. The tunnel doesn't do that.
---
Update (@conradludgate 2025-01-10):
We have since added some websocket tests, but they manually implemented
a very simplistic setup of the postgres protocol. Introducing the tunnel
would make more complex testing simpler in the future.
---------
Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
## Problem
Limitations found while using this to investigate
https://github.com/neondatabase/neon/issues/10234:
- If we hit a node consistency issue, we drop out and don't check shards
for consistency
- The messages printed after a shard consistency issue are huge, and
grafana appears to drop them.
## Summary of changes
- Defer node consistency errors until the end of the function, so that
we always proceed to check shards for consistency
- Print out smaller log lines that just point out the diffs between
expected and persistent state
With a new beta build of the rust compiler, it's good to check out the
new lints. Either to find false positives, or find flaws in our code.
Additionally, it helps reduce the effort required to update to 1.85 in 6
weeks.
## Problem
It's only possible to take one CPU profile at a time. With Grafana
continuous profiling, a (low-frequency) CPU profile will always be
running, making it hard to take an ad hoc CPU profile at the same time.
Resolves#10072.
## Summary of changes
Add a `force` parameter for `/profile/cpu` which will end and return an
already running CPU profile, starting a new one for the current caller.
## Problem
Currently, the heap profiling frequency is every 1 MB allocated. Taking
a profile stack trace takes about 1 µs, and allocating 1 MB takes about
15 µs, so the overhead is about 6.7% which is a bit high. This is a
fixed cost regardless of whether heap profiles are actually accessed.
## Summary of changes
Increase the heap profiling sample frequency from 1 MB to 2 MB, which
reduces the overhead to about 3.3%. This seems acceptable, considering
performance-sensitive code will avoid allocations as far as possible
anyway.
There used to be some pg version dependencies in these extensions, but
now that there isn't, follow the simpler pattern used in other
extensions. No change in the produced images.
We don't need or want the `active` column. Remove it. Vlad pointed out
that this is safe.
Thanks to the separation of the schemata in earlier PRs, this is easy.
follow-up of #10205
Part of https://github.com/neondatabase/neon/issues/9981
When we moved throttling up from Timeline::get into page_service,
we stopped being sensitive to `Timeline::cancel`, even though we're
holding a Handle and thus a guard on the `Timeline::gate` open.
This PR rectifies the situation.
Refs
- Found while investigating #10309 (hung detach because gate kept open),
but not expected to be the root cause of that issue because the
affected tenants are not being throttled according to their metrics.
## Problem
https://github.com/neondatabase/infra/pull/2725 updated the scrubber to
use a non-host
port endpoint for storcon. That breaks when unwrapping the port.
## Summary of changes
Support both `host:port` and `host` formats for the storcon api.
## Problem
These two tests came up in #9537 as doing multi-gigabyte I/O, and from
inspection of the tests it doesn't seem like they need that to fulfil
their purpose.
## Summary of changes
- In test_local_file_cache_unlink, run fewer background threads with a
smaller number of rows. These background threads AFAICT exist to make
sure some I/O is going on while we unlink the LFC directory, but 5
threads should be enough for "some".
- In test_lfc_resize, tweak the test to validate that the cache size is
larger than the final size before resizing it, so that we're sure we're
writing enough data to really be doing something. Then decrease the
pgbench scale.
## Problem
When poetry v2 (released Jan 5) is used it needs `packaging.metadata`
module, but we downgrade `packaging` to 23.0. `packaging==23.1`
introduced the metadata submodule.
## Summary of changes
Update `packaging` to 24.2.
## Problem
This test writes ~5GB of data. It is not suitable to run in parallel
with all the other small tests in test_runner/regress.
via #9537
## Summary of changes
- Move test_parallel_copy into the performance directory, so that it
does not run in parallel with other tests
## Problem
I noticed in https://github.com/neondatabase/neon/pull/9537 that tests
which work with compat snapshots were writing several hundred MB of
data, which isn't really necessary.
Also, the snapshots are large but don't have the proper variety of
storage format features, e.g. they could just have L0 deltas.
## Summary of changes
- Use smaller scale factor and runtime to generate less data
- Configure a small layer size and use force image layer generation so
that our output contains L1 deltas and image layers, and has a decent
number of entries in the layer map
## Problem
Unlike CPU profiles, the `/profile/heap` endpoint can't automatically
generate SVG flamegraphs. This requires the user to install and use
`pprof` tooling, which is unnecessary and annoying.
Resolves#10203.
## Summary of changes
Add `format=svg` for the `/profile/heap` route, and generate an SVG
flamegraph using the `inferno` crate, similarly to what `pprof-rs`
already does for CPU profiles.
# Problem
Before this PR, there were cases where send() in state
SenderWaitsForReceiverToConsume would never be woken up
by the receiver, because it never registered with `wake_sender`.
Example Scenario 1: we stop polling a send() future A that was waiting
for the receiver to consume. We drop A and create a new send() future B.
B would return Poll::Pending and never regsister a waker.
Example Scenario 2: a send() future A transitions from HasData
to SenderWaitsForReceiverToConsume. This registers the context X
with `wake_sender`. But before the Receiver consumes the data,
we poll A from a different context Y.
The state is still SenderWaitsForReceiverToConsume, but we wouldn't
register the new context with `wake_sender`.
When the Receiver comes around to consume and `wake_sender.notify()`s,
it wakes the old context X instead of Y.
# Fix
Register the waker in the case where we're polled in
state `SenderWaitsForReceiverToConsume`.
# Relation to #10309
I found this bug while investigating #10309.
There was never proof that this bug here is the root cause for #10309.
In the meantime we found a more probably hypothesis
for the root cause than what is being fixed here.
Regardless, let's walk through my thought process about
how it might have been relevant:
There (in page_service), Scenario 1 does not apply because
we poll the send() future to completion.
Scenario 2 (`tokio::join!`) also does not apply with the
current `tokio::join!()` impl, because it will just poll each
future every time, each with the same context.
Although if we ever used something like a FuturesUnordered anywhere,
that will be using a different context, so, in that case,
the bug might materialize.
Regarding tokio & spurious poll in general:
@conradludgate is not aware of any spurious wakeup cases in current
tokio,
but within a `tokio::join!()`, any wake meant for one future will poll
all
the futures, so that can appear as a spurious wake up to the N-1 futures
of the `tokio::join!()`.
## Problem
We were incorrectly constructing the ComputeUserInfo, used for
cancellation checks, based on the return parameters from postgres. This
didn't contain the correct info.
## Summary of changes
Propagate down the existing ComputeUserInfo.
## Problem
When the proxy receives a `Notification` with an unknown topic it's
supposed to use the `UnknownTopic` unit variant. Unfortunately, in
adjacently tagged enums serde will not simply ignore the configured
content if found and try to deserialize a map/object instead.
## Summary of changes
* Use a custom deserialize function to ignore variant content.
* Add a little unit test covering both cases.
## Problem
Auto-offloading as requested by the compaction task is racy with
unarchival, in that the compaction task might attempt to offload an
unarchived timeline. By that point it will already have set the timeline
to the `Stopping` state however, which makes it unusable for any
purpose. For example:
1. compaction task decides to offload timeline
2. timeline gets unarchived
3. `offload_timeline` gets called by compaction task
* sets timeline's state to `Stopping`
* realizes that the timeline can't be unarchived, errors out
6. endpoint can't be started as the timeline is `Stopping` and thus
'can't be found'.
A future iteration of the compaction task can't "heal" this state either
as the timeline will still not be archived, same goes for other
automatic stuff. The only way to heal this is a tenant detach+attach, or
alternatively a pageserver restart.
Furthermore, the compaction task is especially amenable for such races
as it first stores `can_offload` into a variable, figures out whether
compaction is needed (which takes some time), and only then does it
attempt an offload operation: the time difference between "check" and
"use" is non-trivially small.
To make it even worse, we start the compaction task right after attach
of a tenant, and it is a common pattern by pageserver users to attach a
tenant to then immediately unarchive a timeline, so that an endpoint can
be started.
## Solutions not adopted
The simplest solution is to move the `can_offload` check to right before
attempting of the offload. But this is not a good solution, as no lock
is held between that check and timeline shutdown. So races would still
be possible, just become less likely.
I explored using the timeline state for this, as in adding an additional
enum variant. But `Timeline::set_state` is racy (#10297).
## Adopted solution
We use the lock on the timeline's upload queue as an arbiter: either
unarchival gets to it first and sours the state for auto-offloading, or
auto-offloading shuts it down, which stops any parallel unarchival in
its tracks. The key part is not releasing the upload queue's lock
between the check whether the timeline is archived or not, and shutting
it down (the actual implementation only sets `shutting_down` but it has
the same effect on `initialized_mut()` as a full shutdown). The rest of
the patch is stuff that follows from this.
We also move the part where we set the state to `Stopping` to after that
arbiter has decided the fate of the timeline. For deletions, we do keep
it inside `DeleteTimelineFlow::prepare` however, so that it is called
with all of the the timelines locks held that the function allocates
(timelines lock most importantly). This is only a precautionary measure
however, as I didn't want to analyze deletion related code for possible
races.
## Future changes
It might make sense to move `can_offload` to right before the offload
attempt. Maybe some other properties might have changed as well.
Although this will not be perfect either as no lock is held. I want to
keep it out of this change to emphasize that this move wasn't the main
reason we are race free now.
Fixes#10220
This is a refactor to create better abstractions related to our
management server. It cleans up the code, and prepares everything for
authorized communication to and from the control plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Release notes](https://releases.rs/docs/1.84.0/).
Prior update was in #9926.
## Problem
In Postgres, one cannot drop a role if it has any dependent objects in
the DB. In `compute_ctl`, we automatically reassign all dependent
objects in every DB to the corresponding DB owner. Yet, it seems that it
doesn't help with some implicit permissions. The issue is reproduced by
installing a `postgis` extension because it creates some views and
tables in the public schema.
## Summary of changes
Added a repro test without using a `postgis`: i) create a role via
`compute_ctl` (with `neon_superuser` grant); ii) create a test role, a
table in schema public, and grant permissions via the role in
`neon_superuser`.
To fix the issue, I added a new `compute_ctl` code that removes such
dangling permissions before dropping the role. It's done in the least
invasive way, i.e., only touches the schema public, because i) that's
the problem we had with PostGIS; ii) it creates a smaller chance of
messing anything up and getting a stuck operation again, just for a
different reason.
Properly, any API-based catalog operations should fail gracefully and
provide an actionable error and status code to the control plane,
allowing the latter to unwind the operation and propagate an error
message and hint to the user. In this sense, it's aligned with another
feature request https://github.com/neondatabase/cloud/issues/21611Resolveneondatabase/cloud#13582
## Problem
Initially we defaulted this to zero to reduce risk. We have now been
using pooling in staging for some time without issues, so let's make it
the default for anyone using this software without setting the config
explicitly.
Closes: https://github.com/neondatabase/cloud/issues/20971
## Summary of changes
- Set Azure blob storage connection pool size to 8 by default
## Problem
Occasionally we see an unexpected error like:
```
ERROR spawn_heartbeat_driver: Failed to update node state 1 after heartbeat round: Shutting down\n')
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add
```
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10324/12690404952/index.html#/testresult/63406a0687bf6eca
## Summary of changes
- Explicitly handle ApiError::ShuttingDown as a no-op when mutating node
status
## Problem
If for some reasons we already garbage-collected the data under an LSN
but the caller uses a past LSN for the find_time_cutoff function, now we
will report a missing key error and GC will never proceed.
Note that missing key error can also happen if the key is really missing
(i.e., during the past offload incidents)
## Summary of changes
Make sure GC proceeds by bumping the LSN. When time_cutoff=None, we will
not increase the time_cutoff (it will be set to latest_gc_cutoff). If we
really need to bump the GC LSN for maintenance purpose, we need a
separate API to do that.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have several serious data corruption incidents caused by mismatch of
get-age requests:
https://neondb.slack.com/archives/C07FJS4QF7V/p1723032720164359
We hope that the problem is fixed now. But it is better to prevent such
kind of problems in future.
Part of https://github.com/neondatabase/cloud/issues/16472
## Summary of changes
This PR introduce new V3 version of compute<->pageserver protocol,
adding tag to getpage response.
So now compute is able to check if it really gets response to the
requested page.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
The filtered record metric doesn't make sense for interpreted ingest.
## Summary of changes
While of dubious utility in the first place, this patch replaces them
with records received and records observed metrics for interpreted
ingest:
* received records cause the pageserver to do _something_: write a key,
value pair to storage, update some metadata or flush pending
modifications
* observed records are a shard 0 concept and contain only key metadata
used in tracking relation sizes (received records include observed
records)
## Problem
We want to define the algorithm for safekeeper membership change.
## Summary of changes
Add spec for it, several models and logs of checking them.
ref https://github.com/neondatabase/neon/issues/8699
## Problem
This was causing storage controller to still use neon-built libpq
instead of vanilla libpq.
Since https://github.com/neondatabase/neon/pull/10269 we have a vanilla
postgres in the system path -- anything that wants a postgres library
will use that.
## Summary of changes
- Remove LD_LIBRARY_PATH assignment in Dockerfile
This PR removes the direct dependency of the IP allowlist from
CancelClosure, allowing for more scalable and flexible IP restrictions
and enabling the future use of Redis-based CancelMap storage.
Changes:
- Introduce a new BackendAuth async trait that retrieves the IP
allowlist through existing authentication methods;
- Improve cancellation error handling by instrument() async
cancel_sesion() rather than dropping it.
- Set and store IP allowlist for SCRAM Proxy to consistently perform IP
allowance check
Relates to #9660
## Problem
Project gets stuck if database with subscriptions was deleted via API /
UI.
https://github.com/neondatabase/cloud/issues/18646
## Summary of changes
Before dropping the database, drop all the subscriptions in it.
Do not drop slot on publisher, because we have no guarantee that the
slot still exists or that the publisher is reachable.
Add `DropSubscriptionsForDeletedDatabases` phase to run these operations
in all databases, we're about to delete.
Ignore the error if the database does not exist.
## Problem
Typical deployments of neon have some tenants that stay in use
continuously, and a background churning population of tenants that are
created and then fall idle, and are configured to Detached state.
Currently, this churn of short lived tenants results in an
ever-increasing memory footprint.
Closes: https://github.com/neondatabase/neon/issues/9712
## Summary of changes
- At startup, filter to only load shards that don't have Detached policy
- In process_result, check if a tenant's shards are all Detached and
observed=={}, and if so drop them from memory
- In tenant_location_conf and other tenant mutators, load the tenants'
shards on-demand if they are not present
## Problem
The observed state removal may race with the inline updates of the
observed state done from `Service::node_activate_reconcile`.
This was intended to work as follows:
1. Detaches while the node is unavailable remove the entry from the
observed state.
2. `Service::node_activate_reconcile` diffs the locations returned
by the pageserver with the observed state and detaches in-line
when required.
## Summary of changes
This PR removes step (1) and lets background reconciliations
deal with the mismatch between the intent and observed state.
A follow up will attempt to remove `Service::node_activate_reconcile`
altogether.
Closes https://github.com/neondatabase/neon/issues/10253
## Problem
Consider the pageserver is doing the following sequence of operations:
* upload X files
* update index_part to add X and remove Y
* delete Y files
When storage scrubber obtains the initial timeline snapshot before
"update index_part" (that is the old version that contains Y but not X),
and then obtains the index_part file after it gets updated, it will
report all Y files are missing.
## Summary of changes
Do not report layer file missing if index_part listed and downloaded are
not the same (i.e. different last_modified times)
Signed-off-by: Alex Chi Z <chi@neon.tech>
Closes: https://github.com/neondatabase/cloud/issues/17784
## Problem
Currently, we run the whole CI pipeline for any changes. It's slow and
expensive.
## Suggestion
Starting with MacOs builds:
- check what files were changed
- rebuild only needed parts
- reuse results from previous builds when available
- run builds in parallel when possible
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We currently parse Notification twice even in the happy path.
## Summary of changes
Use `#[serde(other)]` to catch unknown topics and defer the second
parsing.
## Problem
`promote-images` was split into `promote-images-dev` and
`promote-images-prod` in
https://github.com/neondatabase/neon/pull/10267.
`dev` credentials were loaded in `promote-images-dev` and `prod`
credentials were loaded in `promote-images-prod`, but
`promote-images-prod` needs `dev` credentials as well to access the
`dev` images to replicate them from `dev` to `prod`.
## Summary of changes
Load `dev` credentials in `promote-images-prod` as well.
Apparently, we failed to do this bookkeeping in quite a few places...
## Problem
Fixes https://github.com/neondatabase/cloud/issues/22364
## Summary of changes
Add accounting of dropped requests. Note that this includes prefetches
dropped due to things like "PS connection dropped unexpectedly" or
"prefetch queue is already full", but *not* (yet?) "dropped due to
backend shutdown".
## Problem
`trigger-e2e-tests` waits half an hour before starting to run. Nearly
half of that time can be saved by promoting images before tests on them
are complete, so the e2e tests can run in parallel.
On `main` and `release{,-proxy,-compute}`, `promote-images` updates
`latest` and pushes things to prod ecr, so we want to run
`promote-images` only after `test-images` is done, but on other
branches, there is no harm in promoting images that aren't tested yet.
## Summary of changes
To promote images into dev container registries sooner, `promote-images`
is split into `promote-images-dev` and `promote-images-prod`. The former
pushes to dev container registries, the latter to prod ones. The latter
also waits for `test-images`, while the former doesn't. This allows to
run `trigger-e2e-tests` sooner.
Using `min(0, ...)` causes us to fail to wait in most situations, so a
lack of data would be a hot wait loop, which is bad.
## Problem
We noticed high CPU usage in some situations
## Problem
On macOS:
```
error: unused variable: `disable_lfc_resizing`
--> compute_tools/src/bin/compute_ctl.rs:431:9
|
431 | disable_lfc_resizing,
| ^^^^^^^^^^^^^^^^^^^^ help: try ignoring the field: `disable_lfc_resizing: _`
|
= note: `-D unused-variables` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(unused_variables)]`
```
## Summary of changes
- Initialise `disable_lfc_resizing` only on Linux (because it's used on
Linux only in further bloc)
## Problem
It's impossible to run regression tests with Python 3.13 as some
dependencies don't support it (some of them are outdated, and `jsonnet`
doesn't support it at all yet)
## Summary of changes
- Update dependencies for Python 3.13
- Install `jsonnet` only on Python < 3.13 and skip relevant tests on
Python 3.13
Closes#10237
## Problem
close https://github.com/neondatabase/neon/issues/10192
## Summary of changes
* `find_gc_time_cutoff` takes `now` parameter so that all branches
compute the cutoff based on the same start time, avoiding races.
* gc-compaction uses a single `get_gc_compaction_watermark` function to
get the safe LSN to compact.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Frame pointers are typically disabled by default (depending on CPU
architecture), to improve performance. This frees up a CPU register, and
avoids a couple of instructions per function call. However, it makes
stack unwinding much more inefficient, since it has to use DWARF debug
information instead, and gives worse results with e.g. `perf` and eBPF
profiles. The `backtrace` implementation of `libunwind` is also
suspected to cause seg faults.
The performance benefit of frame pointer omission doesn't appear to
matter that much on modern 64-bit CPU architectures (which have plenty
of registers and optimized instruction execution), and benchmarks did
not show measurable overhead.
The Rust standard library and jemalloc already enable frame pointers by
default.
For more information, see
https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html.
Resolves#10224.
Touches #10225.
## Summary of changes
Enable frame pointers in all builds, and use frame pointers for pprof-rs
stack sampling.
## Problem
Before the holidays, and just before our code freeze, a change to cplane
was made that started publishing the topics from #10197. This triggered
our alerts and put us in a sticky situation as it was not an error, and
we didn't want to silence the alert for the entire holidays, and we
didn't want to release proxy 2 days in a row if it was not essential.
We fixed it eventually by rewriting the alert based on logs, but this is
not a good solution.
## Summary of changes
Introduces an intermediate parsing step to check the topic name first,
to allow us to ignore parsing errors for any topics we do not know
about.
## Problem
We are chasing down segfaults in the storage controller
https://github.com/neondatabase/cloud/issues/21010
This is for use by the storage controller, which links dynamically with
`libpq`. We currently use the neon-built libpq, but this may be unsafe
for use from multi-threaded programs like the controller, as it uses a
statically linked openssl
Precursor to https://github.com/neondatabase/neon/pull/10258
## Summary of changes
- Include `postgresql-15` in container builds.
The reason for using version 15 is simply because that is what's
available in Debian 12 without adding any extra repositories, and we
don't have any special need for latest version in our libpq usage.
## Problem
It's not legal to modify layers that are referenced by the current layer
index. Assert this in the upload queue, as preparation for upload queue
reordering.
Touches #10096.
## Summary of changes
Add a debug assertion that the upload queue does not modify layers
referenced by the current index.
I could be convinced that this should be a plain assertion, but will be
conservative for now.
## Problem
Since enabling continuous profiling in staging, we've seen frequent seg
faults. This is suspected to be because jemalloc and pprof-rs take a
stack trace at the same time, and the handlers aren't signal safe.
jemalloc does this probabilistically on every allocation, regardless of
whether someone is taking a heap profile, which means that any CPU
profile has a chance to cause a seg fault.
Touches #10225.
## Summary of changes
For now, just disable heap profiles -- CPU profiles are more important,
and we need to be able to take them without risking a crash.
There is a race condition between `Tenant::shutdown`'s `defuse_for_drop`
loop and `offload_timeline`, where timeline offloading can insert into a
tenant that is in the process of shutting down, in fact so far
progressed that the `defuse_for_drop` has already been called.
This prevents warn log lines of the form:
```
offloaded timeline <hash> was dropped without having cleaned it up at the ancestor
```
The solution piggybacks on the `offloaded_timelines` lock: both the
defuse loop and the offloaded timeline insertion need to acquire the
lock, and we know that the defuse loop only runs after the tenant has
set its `TenantState` to `Stopping`.
So if we hold the `offloaded_timelines` lock, and know that the
`TenantState` is not `Stopping`, then we know that the defuse loop has
not ran yet, and holding the lock ensures that it doesn't start running
while we are inserting the offloaded timeline.
Fixes#10070
## Problem
When we do a timeline CRUD operation, we check that the shards we need
to mutate are currently attached to a pageserver, by reading
`generation` and `generation_pageserver` from the database.
If any don't appear to be attached, we respond with a a 503 and "One or
more shards in tenant is not yet attached".
This is happening more often than expected, and it's not obvious with
current logging what's going on: specifically which shard has a problem,
and exactly what we're seeing in these persistent generation columns.
(Aside: it's possible that we broke something with the change in #10011
which clears generation_pageserver when we detach a shard, although if
so the mechanism isn't trivial: what should happen is that if we stamp
on generation_pageserver if a reconciler is running, then it shouldn't
matter because we're about to
## Summary of changes
- When we are in Attached mode but find that
generation_pageserver/generation are unset, output details while looping
over shards.
## Problem
We see periodic failures in `test_scrubber_physical_gc_ancestors`, where
the logs show that the pageserver is creating image layers that should
cause child shards to no longer reference their parents' layers, but
then the scrubber runs and doesn't find any unreferenced layers.[
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10256/12582034135/index.html#/testresult/78ea06dea6ba8dd3
From inspecting the code & test, it seems like this could be as simple
as the test failing to wait for uploads before running the scrubber. It
had a 2 second delay built in to satisfy the scrubbers time threshold
checks, which on a lightly loaded machine would also have been easily
enough for uploads to complete, but our test machines are more heavily
loaded all the time.
## Summary of changes
- Wait for uploads to complete after generating images layers in
test_scrubber_physical_gc_ancestors, so that the scrubber should
reliably see the post-compaction metadata.
## Problem
Versions of `diesel` and `pq-sys` were somewhat stale. I was checking on
libpq->openssl versions while investigating a segfault via
https://github.com/neondatabase/cloud/issues/21010. I don't think these
rust bindings are likely to be the source of issues, but we might as
well freshen them as a precaution.
## Summary of changes
- Update diesel to 2.2.6
- Update pq-sys to 0.6.3
There was no value in saving them off to temporary variables.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
ref neondatabase/cloud#21731
## Problem
When we manually override the LFC size for particular computes,
autoscaling will typically undo that because vm-monitor will resize LFC
itself.
So, we'd like a way to make vm-monitor not set LFC size — this actually
already exists, if we just don't give vm-monitor a postgres connection
string.
## Summary of changes
Add a new field to the compute spec, `disable_lfc_resizing`. When set to
`true`, we pass in `None` for its postgres connection string. That
matches the configuration tested in `neondatabase/autoscaling` CI.
## Problem
Building local_proxy and compute_tools features the same dependency
tree, but as they are currently built in separate clean layers all that
progress is wasted. For our arm builds that's an extra 10 minutes.
## Summary of changes
Combines the compute_tools and local_proxy build layers.
## Problem
https://neondb.slack.com/archives/C085MBDUSS2/p1734604792755369
## Summary of changes
Recognize and ignore the 3 new broadcast messages:
- `/block_public_or_vpc_access_updated`
- `/allowed_vpc_endpoints_updated_for_org`
- `/allowed_vpc_endpoints_updated_for_projects`
## Problem
Running clippy with `cargo hack --feature-powerset` in CI isn't
particularly fast. This PR follows-up on
https://github.com/neondatabase/neon/pull/8912 to improve the speed of
our clippy runs.
Parallelism as suggested in
https://github.com/neondatabase/neon/issues/9901 was tested, but didn't
show consistent enough improvements to be worth it. It actually
increased the amount of work done, as there's less cache hits when
clippy runs are spread out over multiple target directories.
Additionally, parallelism makes it so caching needs to be thought about
more actively and copying around target directories to enable
parallelism eats the rest of the performance gains from parallel
execution.
After some discussion, the decision was to instead cut down on the
number of jobs that are running further. The easiest way to do this is
to not run clippy *without* default features. The list of default
features is empty for all crates, and I haven't found anything using
`cfg(feature = "default")` either, so this is likely not going to change
anything except speeding the runs up.
## Summary of changes
Reduce the amount of feature combinations tried by `cargo hack` (as
suggested in
https://github.com/neondatabase/neon/pull/8912#pullrequestreview-2286482368)
by never disabling default features.
## Alternatives
- We can split things out into different jobs which reduces the time
until everything is finished by running more things in parallel. This
does however decreases the amount of cache hits and increases the amount
of time spent on overhead tasks like repo cloning and restoring caches
by doing those multiple times instead of once.
- We could replace `cargo hack [...] clippy` with `cargo clippy [...];
cargo clippy --features testing`. I'm not 100% sure how this compares to
the change here in the PR, but it does seem to run a bit faster. That
likely means it's doing less work, but without understanding what
exactly we loose by that I'd rather not do that for now. I'd appreciate
input on this though.
Now that we construct the TLS client config for cancellation as well as
connect, it feels appropriate to construct the same config once and
re-use it elsewhere. It might also help should #7500 require any extra
setup, so we can easily add it to all the appropriate call sites.
In #10207 it was clear there was some confusion with the current
connection logic. To analyse the flow to make sure there was no poll
stalling, I ended up with the following refactor.
Notable changes:
1. Now all functions called `poll_xyz` and that have a `cx: &mut
Context` argument must return a `Poll<_>` type, and can only return
`Pending` iff an internal poll call also returned `Pending`
2. State management is handled entirely by `poll_messages`. There are
now only 2 states which makes it much easier to keep track of.
Each commit should be self-reviewable and should be simple to verify
that it keeps the same behaviour
## Problem
Currently default value of storage controller heartbeat interval is
100msec. It means that 10 times per second it establish connection to
PS. And it seems to be quite expensive.
At MacOS right now storage_controller consumes 70% CPU and trusts - 30%.
So together they completely utilize one core.
A lot of us has Macs. Let's save environment a little bit and do not
waste electricity and contribute to global warming.
By the way, on prod we have interval 10seconds
## Summary of changes
Increase heartbeat interval from 100msec to 1 second.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In https://github.com/neondatabase/neon/pull/9897 we temporarily
disabled the layer valid check because the current one only considers
the end result of all compaction algorithms, but partial gc-compaction
would temporarily produce an "invalid" layer map.
part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
Allow LSN splits to overlap in the slow path check. Currently, the valid
check is only used in storage scrubber (background job) and during
gc-compaction (without taking layer lock). Therefore, it's fine for such
checks to be a little bit inefficient but more accurate.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Safekeeper may currently send a batch to the pageserver even if it
hasn't decoded a new record.
I think this is quite unlikely in the field, but worth adressing.
## Summary of changes
Don't send anything if we haven't decoded a full record. Once this
merges and releases, the `InterpretedWalRecords` struct can be updated
to remove the Option wrapper for `next_record_lsn`.
## Problem
The benchmarking utilities are also useful for testing. We want to write
tests in the safekeeper crate.
## Summary of changes
This commit lifts the utils to the safekeeper crate. They are compiled
if the benchmarking features is enabled or if in test mode.
## Problem
test_timeline_archival_chaos does timeline creation with failure
injection, and thereby sometimes leaves timelines in a part created
state. This was being reported as corruption by the scrubber on test
teardown, because it considered a layer without an index to be an
invalid state. This was incorrect: the scrubber should accept this
state, it occurs legitimately during timeline creation.
Closes: https://github.com/neondatabase/neon/issues/9988
## Summary of changes
- Report a timeline with layers but no index as Relic rather than
MissingIndexPart.
- We retain the MissingIndexPart variant for the case where an index
_was_ found in the listing, but was not found by a subsequent GET, i.e.
racing with deletion.
## Problem
`test_pgdata_import_smoke` writes two gigabytes of pages and then reads
them back serially. This is CPU bottlenecked and results in a long
runtime, and sensitivity to CPU load from other tests on the same
machine.
Closes: https://github.com/neondatabase/neon/issues/10071
## Summary of changes
- Use effective_io_concurrency=32 when doing sequential scans through
2GiB of pages in test_pgdata_import_smoke. This is a ~10x runtime
decrease in the parts of the test that do sequential scans.
- Also set `effective_io_concurrency=2` for tests, as I noticed while
debugging that we were doing all getpage requests serially, which is bad
for checking the stability of the batching code.
## Problem
We want to verify how much / if pgbench throughput and latency on Neon
suffers if the database contains many other relations, too.
## Summary of changes
Modify the benchmarking.yml pgbench-compare job to
- create an addiitional project at scale factor 10 GiB
- before running pgbench add n tables (initially 10k) to the database
- then compare the pgbench throughput and latency to the existing
pgbench-compare at 10 Gib scale factor
We use a realistic template for the n relations that is a partitioned
table with some realistic data types, indexes and constraints - similar
to a table that we use internally.
Example run:
https://github.com/neondatabase/neon/actions/runs/12377565956/job/34547386959
## Problem
s5cmd doesn't pick up the pod service account
```
2024/12/16 16:26:01 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. <nil>
ERROR "ls s3://neon-dev-bulk-import-us-east-2/import-pgdata/fast-import/v1/br-wandering-hall-w2xobawv": NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors
```
## Summary of changes
Switch to offical CLI.
## Testing
Tested the pre-merge image in staging, using `job_image` override in
project settings.
https://neondb.slack.com/archives/C033RQ5SPDH/p1734554944391949?thread_ts=1734368383.258759&cid=C033RQ5SPDH
## Future Work
Switch back to s5cmd once https://github.com/peak/s5cmd/pull/769 gets
merged.
## Refs
- fixes https://github.com/neondatabase/cloud/issues/21876
---------
Co-authored-by: Gleb Novikov <NanoBjorn@users.noreply.github.com>
## Problem
In https://github.com/neondatabase/neon/pull/8103 we changed the test
case to have more test coverage of gc_compaction. Now that we have
`test_gc_compaction_smoke`, we can revert this test case to serve its
original purpose and revert the parameter changes.
part of https://github.com/neondatabase/neon/issues/9114
## Summary of changes
* Revert pitr_interval from 60s to 10s.
* Assert the physical/logical size ratio in the benchmark.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
We cannot get the size of the compaction queue and access the info.
Part of #9114
## Summary of changes
* Add an API endpoint to get the compaction queue.
* gc_compaction test case now waits until the compaction finishes.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`neon_local` has always been unsafe to run concurrently with itself: it
uses simple text files for persistent state, and concurrent runs will
step on each other.
In some test environments we intentionally handle this with mutexes in
python land, but it's fragile to try and always remember to do that.
## Summary of changes
- Add a `flock` based mutex around the `main` function of neon_local,
using the repo directory as the file to lock
- Clean up an Option<> around control_plane_api, this is a drive-by
change because it was one of the fields that had a weird effect when
previous concurrent stuff stamped on it.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
In https://github.com/neondatabase/neon/pull/10127 we fixed the race,
but we didn't add the errors to the allowlist.
## Summary of changes
* Allow repartition errors in the gc-compaction smoke test.
I think it might be worth to refactor the code to allow multiple threads
getting a copy of repartition status (i.e., using Rcu) in the future.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Add a `safekeepers` subcommand to `storcon_cli` that allows listing the
safekeepers.
```
$ curl -X POST --url http://localhost:1234/control/v1/safekeeper/42 --data \
'{"active":true, "id":42, "created_at":"2023-10-25T09:11:25Z", "updated_at":"2024-08-28T11:32:43Z","region_id":"neon_local","host":"localhost","port":5454,"http_port":0,"version":123,"availability_zone_id":"us-east-2b"}'
$ cargo run --bin storcon_cli -- --api http://localhost:1234 safekeepers
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.38s
Running `target/debug/storcon_cli --api 'http://localhost:1234' safekeepers`
+----+---------+-----------+------+-----------+------------+
| Id | Version | Host | Port | Http Port | AZ Id |
+==========================================================+
| 42 | 123 | localhost | 5454 | 0 | us-east-2b |
+----+---------+-----------+------+-----------+------------+
```
Also:
* Don't return the raw `SafekeeperPersistence` struct that contains the
raw database presentation, but instead a new
`SafekeeperDescribeResponse` struct.
* The `SafekeeperPersistence` struct leaves out the `active` field on
purpose because we want to deprecate it and replace it with a
`scheduling_policy` one.
Part of https://github.com/neondatabase/neon/issues/9981
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?
<!-- List everything that should be done**before** release, any issues / setting changes / etc -->
### Checklist after release
- [ ] Make sure instructions from PRs included in this release and labeled `manual_release_instructions` are executed (either by you or by people who wrote them).
- [ ] Based on the merged commits write release notes and open a PR into `website` repo ([example](https://github.com/neondatabase/website/pull/219/files))
# settings below only needed if you want the project to be sharded from the beginning
shard_split_project:
description:'by default new projects are not shard-split initiailly, but only when shard-split threshold is reached, specify true to explicitly shard-split initially'
required:false
default:'false'
disable_sharding:
description:'by default new projects use storage controller default policy to shard-split when shard-split threshold is reached, specify true to explicitly disable sharding'
required:false
default:'false'
admin_api_key:
description:'Admin API Key needed for shard-splitting. Must be specified if shard_split_project is true'
required:false
shard_count:
description:'Number of shards to split the project into, only applies if shard_split_project is true'
required:false
default:'8'
stripe_size:
description:'Stripe size, optional, in 8kiB pages. e.g. set 2048 for 16MB stripes. Default is 128 MiB, only applies if shard_split_project is true'
required:false
default:'32768'
psql_path:
description:'Path to psql binary - it is caller responsibility to provision the psql binary'
required:false
default:'/tmp/neon/pg_install/v16/bin/psql'
libpq_lib_path:
description:'Path to directory containing libpq library - it is caller responsibility to provision the libpq library'
required:false
default:'/tmp/neon/pg_install/v16/lib'
project_settings:
description:'A JSON object with project settings'
required:false
default:'{}'
outputs:
dsn:
@@ -48,7 +80,7 @@ runs:
\"provisioner\": \"k8s-neonvm\",
\"autoscaling_limit_min_cu\": ${MIN_CU},
\"autoscaling_limit_max_cu\": ${MAX_CU},
\"settings\": { }
\"settings\": ${PROJECT_SETTINGS}
}
}")
@@ -63,6 +95,38 @@ runs:
echo "project_id=${project_id}" >> $GITHUB_OUTPUT
echo "Project ${project_id} has been created"
if [ "${SHARD_SPLIT_PROJECT}" = "true" ]; then
# determine tenant ID
TENANT_ID=`${PSQL} ${dsn} -t -A -c "SHOW neon.tenant_id"`
echo "Splitting project ${project_id} with tenant_id ${TENANT_ID} into $((SHARD_COUNT)) shards with stripe size $((STRIPE_SIZE))"
echo "Sending PUT request to https://${API_HOST}/regions/${REGION_ID}/api/v1/admin/storage/proxy/control/v1/tenant/${TENANT_ID}/shard_split"
echo "with body {\"new_shard_count\": $((SHARD_COUNT)), \"new_stripe_size\": $((STRIPE_SIZE))}"
# we need an ADMIN API KEY to invoke storage controller API for shard splitting (bash -u above checks that the variable is set)
description:"The kind of run we're currently in. Will be one of `push-main`, `storage-release`, `compute-release`, `proxy-release`, `storage-rc-pr`, `compute-rc-pr`, `proxy-rc-pr`, `pr`, or `workflow-dispatch`"
value:${{ jobs.tags.outputs.run-kind }}
release-pr-run-id:
description:"Only available if `run-kind in [storage-release, proxy-release, compute-release]`. Contains the run ID of the `Build and Test` workflow, assuming one with the current commit can be found."
RELEASE_PR_RUN_ID=$(gh api "/repos/${GITHUB_REPOSITORY}/actions/runs?head_sha=$CURRENT_SHA" | jq '[.workflow_runs[] | select(.name == "Build and Test") | select(.head_branch | test("^rc/release(-(proxy|compute))?/[0-9]{4}-[0-9]{2}-[0-9]{2}$"; "s"))] | first | .id // ("Failed to find Build and Test run from RC PR!" | halt_error(1))')
echo "release-pr-run-id=$RELEASE_PR_RUN_ID" | tee -a $GITHUB_OUTPUT
# Program name comes from postgres' syslog_facility configuration: https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-SYSLOG-IDENT
# Default value is 'postgres'.
if $programname == 'postgres' then {{
# Forward Postgres logs to telemetry otel collector
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.