## Problem
The endpoint filter cache is still unused because it's not yet reliable
enough to be used. It only consumes a lot of memory.
## Summary of changes
Remove the code. Needs a new design.
neondatabase/cloud#30634
See #11992 and #11961 for some examples of usecases.
This introduces a JSON serialization lib, designed for more flexibility
than serde_json offers.
## Dynamic construction
Sometimes you have dynamic values you want to serialize, that are not
already in a serde-aware model like a struct or a Vec etc. To achieve
this with serde, you need to implement a lot of different traits on a
lot of different new-types. Because of this, it's often easier to
give-in and pull all the data into a serde-aware model
(serde_json::Value or some intermediate struct), but that is often not
very efficient.
This crate allows full control over the JSON encoding without needing to
implement any extra traits. Just call the relevant functions, and it
will guarantee a correctly encoded JSON value.
## Async construction
Similar to the above, sometimes the values arrive asynchronously. Often
collecting those values in memory is more expensive than writing them as
JSON, since the overheads of `Vec` and `String` is much higher, however
there are exceptions.
Serializing to JSON all in one go is also more CPU intensive and can
cause lag spikes, whereas serializing values incrementally spreads out
the CPU load and reduces lag.
## Problem
We don't log gRPC request errors on the server.
Touches #11728.
## Summary of changes
Automatically log non-OK gRPC response statuses in the observability
middleware, and add corresponding logging for the `get_pages` stream.
Also adds the peer address and gRPC method to the gRPC tracing span.
Example output:
```
2025-07-02T20:18:16.813718Z WARN grpc:pageservice{peer=127.0.0.1:56698 method=CheckRelExists tenant_id=c7b45faa1924b1958f05c5fdee8b0d04 timeline_id=4a36ee64fd2f97781b9dcc2c3cddd51b shard_id=0000}: request failed with NotFound: Tenant c7b45faa1924b1958f05c5fdee8b0d04 not found
```
## Problem
Names are not consistent between safekeeper migration RFC and the actual
implementation.
It's not used anywhere in production yet, so it's safe to rename. We
don't need to worry about backward compatibility.
- Follow up on https://github.com/neondatabase/neon/pull/12432
## Summary of changes
- rename term -> last_log_term in TimelineMembershipSwitchResponse
- add missing fields to TimelineMembershipSwitchResponse in python
## Problem
The safekeeper migration code/logic slightly diverges from the initial
RFC. This PR aims to address these differences.
- Part of https://github.com/neondatabase/neon/issues/12192
## Summary of changes
- Adjust the RFC to reflect that we implemented the safekeeper
reconciler with in-memory queue.
- Add `sk_set_notified_generation` field to the `timelines` table in the
RFC to address the "finish migration atomically" problem.
- Describe how we are going to make the timeline migration handler fully
retriable with in-memory reconciler queue.
- Unify type/field/method names in the code and RFC.
- Fix typos
This doesn't do anything interesting yet, but demonstrates linking Rust
code to the neon Postgres extension, so that we can review and test
drive just the build process changes independently.
Mostly a revert of https://github.com/neondatabase/neon/pull/11851 and
https://github.com/neondatabase/neon/pull/12330 .
Christian suggested reverting his PR to fix the issue
https://github.com/neondatabase/neon/issues/12369 .
Alternatives considered:
1. I have originally wanted to introduce cancellation tokens to
`RequestContext`, but in the end I gave up on them because I didn't find
a select-free way of preventing
`test_layer_download_cancelled_by_config_location` from hanging.
Namely if I put a select around the `get_or_maybe_download` invocation
in `get_values_reconstruct_data`, it wouldn't hang, but if I put it
around the `download_init_and_wait` invocation in
`get_or_maybe_download`, the test would still hang. Not sure why, even
though I made the attached child function of the `RequestContext` create
a child token.
2. Introduction of a `download_cancel` cancellation token as a child of
a timeline token, putting it into `RemoteTimelineClient` together with
the main token, and then putting it into the whole
`RemoteTimelineClient` read path.
3. Greater refactorings, like to make cancellation tokens follow a DAG
structure so you can have tokens cancelled either by say timeline
shutting down or a request ending. It doesn't just represent an effort
that we don't have the engineering budget for, it also causes
interesting questions like what to do about batching (do you cancel the
entire request if only some requests get cancelled?).
We might see a reemergence of
https://github.com/neondatabase/neon/issues/11762, but given that we
have https://github.com/neondatabase/neon/pull/11853 and
https://github.com/neondatabase/neon/pull/12376 now, it is possible that
it will not come back. Looking at some code, it might actually fix the
locations where the error pops up. Let's see.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
- Add ComputeSpec flag `offload_lfc_interval_seconds` controlling
whether LFC should be offloaded to endpoint storage. Default value
(None) means "don't offload".
- Add glue code around it for `neon_local` and integration tests.
- Add `autoprewarm` mode for `test_lfc_prewarm` testing
`offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction.
- Rename `compute_ctl_lfc_prewarm_requests_total` and
`compute_ctl_lfc_offload_requests_total` to
`compute_ctl_lfc_prewarms_total`
and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and
offloads, not `compute_ctl` requests of those.
Don't count request in metrics if there is a prewarm/offload already
ongoing.
https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/30770
Don't build walproposer-lib as a separate job. It only takes a few
seconds, after you have built all its dependencies.
Don't cache the Neon Pg extensions in the per-postgres-version caches.
This is in preparation for the communicator project, which will
introduce Rust parts to the Neon Pg extension, which complicates the
build process. With that, the 'make neon-pg-ext' step requires some of
the Rust bits to be built already, or it will build them on the spot,
which in turn requires all the Rust sources to be present, and we don't
want to repeat that part for each Postgres version anyway. To prepare
for that, rely on "make all" to build the neon extension and the rust
bits in the correct order instead. Building the neon extension doesn't
currently take very long anyway after you have built Postgres itself, so
you don't gain much by caching it. See
https://github.com/neondatabase/neon/pull/12266.
Add an explicit "rustup update" step to update the toolchain. It's not
strictly necessary right now, because currently "make all" will only
invoke "cargo build" once and the race condition described in the
comment doesn't happen. But prepare for the future.
To further simplify the build, get rid of the separate 'build-postgres'
jobs too, and just build Postgres as a step in the main job. That makes
the overall workflow run longer, because we no longer build all the
postgres versions in parallel (although you still get intra-runner
parallelism thanks to `make -j`), but that's acceptable. In the
cache-hit case, it might even be a little faster because there is less
overhead from launching jobs, and in the cache-miss case, it's maybe
5-10 minutes slower altogether.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
The current deletion operation is synchronous and blocking, which is
unsuitable for potentially long-running tasks like. In such cases, the
standard HTTP request-response pattern is not a good fit.
## Summary of Changes
- Added new `storcon_cli` commands: `NodeStartDelete` and
`NodeCancelDelete` to initiate and cancel deletion asynchronously.
- Added corresponding `storcon` HTTP handlers to support the new
start/cancel deletion flow.
- Introduced a new type of background operation: `Delete`, to track and
manage the deletion process outside the request lifecycle.
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
## Problem
I suspect that the pageservers get stuck on receiving broker updates.
## Summary of changes
This is a an opportunistic (staging only) patch that resets the
susbscription
stream if it's been idle for a while. This won't go to prod in this
form.
I'll revert or update it before Friday.
## Problem
DEBUG_LOCAL_COMPARE mode allows to detect data corruption.
But it requires rebuild of neon extension (and so requires special
image) and significantly slowdown execution because always fetch pages
from page server.
## Summary of changes
Introduce new GUC `neon.debug_compare_local`, accepting the following
values: " none", "prefetch", "lfc", "all" (by default it is definitely
disabled).
In mode less than "all", neon SMGR will not fetch page from PS if it is
found in local caches.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
The background installed extensions worker relied on `interval.tick()`
to go to sleep for a period of time. This can lead to bugs due to the
interval being updated at the end of the loop as the first tick is
[instantaneous](https://docs.rs/tokio/latest/tokio/time/struct.Interval.html#method.tick).
## Summary of changes
Changed it to a `tokio::time::sleep` to prevent this issue. Now it puts
the thread to sleep and only wakes up after the specified duration
## Problem
Better investigate memory usage during backpressure
## Summary of changes
Print open layer size if backpressure is activated
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Ancestor detach from a previously detached parent when there were no
writes panics since it tries to upload the tombstone layer twice.
## Summary of Changes
If we're gonna copy the tombstone from the ancestor, don't bother
creating it.
Fixes https://github.com/neondatabase/neon/issues/12458
When deploying new safekeepers, we don't immediately want to send
traffic to them. Maybe they are not ready yet by the time the deploy
script is registering them with the storage controller.
For pageservers, the storcon solves the problem by not scheduling stuff
to them unless there has been a positive heartbeat response. We can't do
the same for safekeepers though, otherwise a single down safekeeper
would mean we can't create new timelines in smaller regions where there
is only three safekeepers in total.
So far we have created safekeepers as `pause` but this adds a manual
step to safekeeper deployment which is prone to oversight. We want
things to be automatted. So we introduce a new state `activating` that
acts just like `pause`, except that we automatically transition the
policy to `active` once we get a positive heartbeat from the safekeeper.
For `pause`, we always keep the safekeeper paused.
## Problem
Currently, if `storcon` (storage controller) reconciliations repeatedly
fail, the system will indefinitely freeze optimizations. This can result
in optimization starvation for several days until the reconciliation
issues are manually resolved. To mitigate this, we should detect
persistently failing reconciliations and exclude them from influencing
the optimization decision.
## Summary of Changes
- A tenant shard reconciliation is now considered "keep-failing" if it
fails 5 consecutive times. These failures are excluded from the
optimization readiness check.
- Added a new metric: `storage_controller_keep_failing_reconciles` to
monitor such cases.
- Added a warning log message when a reconciliation is marked as
"keep-failing".
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
## TLDR
This PR is a no-op. The changes are disabled by default.
## Problem
I. Currently we don't have a way to detect disk I/O failures from WAL
operations.
II.
We observe that the offloader fails to upload a segment due to race
conditions on XLOG SWITCH and PG start streaming WALs. wal_backup task
continously failing to upload a full segment while the segment remains
partial on the disk.
The consequence is that commit_lsn for all SKs move forward but
backup_lsn stays the same. Then, all SKs run out of disk space.
III.
We have discovered SK bugs where the WAL offload owner cannot keep up
with WAL backup/upload to S3, which results in an unbounded accumulation
of WAL segment files on the Safekeeper's disk until the disk becomes
full. This is a somewhat dangerous operation that is hard to recover
from because the Safekeeper cannot write its control files when it is
out of disk space. There are actually 2 problems here:
1. A single problematic timeline can take over the entire disk for the
SK
2. Once out of disk, it's difficult to recover SK
IV.
Neon reports certain storage errors as "critical" errors using a marco,
which will increment a counter/metric that can be used to raise alerts.
However, this metric isn't sliced by tenant and/or timeline today. We
need the tenant/timeline dimension to better respond to incidents and
for blast radius analysis.
## Summary of changes
I.
The PR adds a `safekeeper_wal_disk_io_errors ` which is incremented when
SK fails to create or flush WALs.
II.
To mitigate this issue, we will re-elect a new offloader if the current
offloader is lagging behind too much.
Each SK makes the decision locally but they are aware of each other's
commit and backup lsns.
The new algorithm is
- determine_offloader will pick a SK. say SK-1.
- Each SK checks
-- if commit_lsn - back_lsn > threshold,
-- -- remove SK-1 from the candidate and call determine_offloader again.
SK-1 will step down and all SKs will elect the same leader again.
After the backup is caught up, the leader will become SK-1 again.
This also helps when SK-1 is slow to backup.
I'll set the reelect backup lag to 4 GB later. Setting to 128 MB in dev
to trigger the code more frequently.
III.
This change addresses problem no. 1 by having the Safekeeper perform a
timeline disk utilization check check when processing WAL proposal
messages from Postgres/compute. The Safekeeper now rejects the WAL
proposal message, effectively stops writing more WAL for the timeline to
disk, if the existing WAL files for the timeline on the SK disk exceeds
a certain size (the default threshold is 100GB). The disk utilization is
calculated based on a `last_removed_segno` variable tracked by the
background task removing WAL files, which produces an accurate and
conservative estimate (>= than actual disk usage) of the actual disk
usage.
IV.
* Add a new metric `hadron_critical_storage_event_count` that has the
`tenant_shard_id` and `timeline_id` as dimensions.
* Modified the `crtitical!` marco to include tenant_id and timeline_id
as additional arguments and adapted existing call sites to populate the
tenant shard and timeline ID fields. The `critical!` marco invocation
now increments the `hadron_critical_storage_event_count` with the extra
dimensions. (In SK there isn't the notion of a tenant-shard, so just the
tenant ID is recorded in lieu of tenant shard ID.)
I considered adding a separate marco to avoid merge conflicts, but I
think in this case (detecting critical errors) conflicts are probably
more desirable so that we can be aware whenever Neon adds another
`critical!` invocation in their code.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
## Problem
integrating subzero requires a bit of refactoring. To make the
integration PR a bit more manageable, the refactoring is done in this
separate PR.
## Summary of changes
* move common types/functions used in sql_over_http to errors.rs and
http_util.rs
* add the "Local" auth backend to proxy (similar to local_proxy), useful
in local testing
* change the Connect and Send type for the http client to allow for
custom body when making post requests to local_proxy from the proxy
---------
Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>
## Problem
`keep_connection` does not exit, so it was never setting
`credentials_refreshed`.
## Summary of changes
Set `credentials_refreshed` to true when we first establish a
connection, and after we re-authenticate the connection.
## Problem
`keep_connection` does not exit, so it was never setting
`credentials_refreshed`.
## Summary of changes
Set `credentials_refreshed` to true when we first establish a
connection, and after we re-authenticate the connection.
## Problem
#12450
## Summary of changes
Instead of `#[arg(default_value_t = typed_default_value)]`, we use
`#[arg(default_value = "str that deserializes into the value")]`,
because apparently you can't convince clap to _not_ deserialize from the
Display implementation of an imported enum.
Health checks for pg-sni-router open a TCP connection and immediately
close it again. This is noisy. We will filter out any EOF errors on the
first message.
"acquired permit" debug log is incorrect since it logs when we timedout
as well. This fixes the debug log.
## Problem
When local proxy is configured with TLS, the certificate does not match
the endpoint string. This currently returns an error.
## Summary of changes
I don't think this code is necessary anymore, taking the prefix from the
hostname is good enough (and is equivalent to what `endpoint_sni` was
doing) and we ignore checking the domain suffix.
## Problem
Neon currently implements several features that guarantee high uptime of
compute nodes:
1. Storage high-availability (HA), i.e. each tenant shard has a
secondary pageserver location, so we can quickly switch over compute to
it in case of primary pageserver failure.
2. Fast compute provisioning, i.e. we have a fleet of pre-created empty
computes, that are ready to serve workload, so restarting unresponsive
compute is very fast.
3. Preemptive NeonVM compute provisioning in case of k8s node
unavailability.
This helps us to be well-within the uptime SLO of 99.95% most of the
time. Problems begin when we go up to multi-TB workloads and 32-64 CU
computes. During restart, compute looses all caches: LFC, shared
buffers, file system cache. Depending on the workload, it can take a lot
of time to warm up the caches, so that performance could be degraded and
might be even unacceptable for certain workloads. The latter means that
although current approach works well for small to
medium workloads, we still have to do some additional work to avoid
performance degradation after restart of large instances.
[Rendered
version](https://github.com/neondatabase/neon/blob/alexk/pg-prewarm-rfc/docs/rfcs/2025-03-17-compute-prewarm.md)
Part of https://github.com/neondatabase/cloud/issues/19011
## Problem
The billing team wants to change the billing events pipeline and use a
common events format in S3 buckets across different event producers.
## Summary of changes
Change the events storage format for billing events from JSON to NDJSON.
Also partition files by hours, rather than days.
Resolves: https://github.com/neondatabase/cloud/issues/29995
We would like to move towards an arena system for JSON encoding the
responses. This change pushes an "out" parameter into the pg-test to
json routines to make swapping in an arena system easier in the future.
(see #11992)
This additionally removes the redundant `column: &[Type]` argument, as
well as rewriting the pg_array parser.
---
I rewrote the pg_array parser since while making these changes I found
it hard to reason about. I went back to the specification and rewrote it
from scratch. There's 4 separate routines:
1. pg_array_parse - checks for any prelude (multidimensional array
ranges)
2. pg_array_parse_inner - only deals with the arrays themselves
3. pg_array_parse_item - parses a single item from the array, this might
be quoted, unquoted, or another nested array.
4. pg_array_parse_quoted - parses a quoted string, following the
relevant string escaping rules.
Don't print errors like:
```
Compaction failed 1 times, retrying in 2s: Failed to offload timeline: Unexpected offload error: Timeline deletion is already in progress
```
Print it at info log level instead.
https://github.com/neondatabase/cloud/issues/30666
## Problem
https://github.com/neondatabase/neon/pull/11712 changed how computes are
started in the test: the lsn is specified, making them read-only static
replicas. Lsn is `last_record_lsn` from pageserver. It works fine with
read-only branches (because their `last_record_lsn` is equal to
`start_lsn` and always valid). But with writable timelines, the
`last_record_lsn` on the pageserver might be stale.
Particularly in this test, after the `detach_branch` operation, the
tenant is reset on the pagesever. It leads to `last_record_lsn` going
back to `disk_consistent_lsn`, so basically rolling back some recent
writes.
If we start a primary compute, it will start at safekeepers' commit Lsn,
which is the correct one , and will wait till pageserver catches up with
this Lsn after reset.
- Closes: https://github.com/neondatabase/neon/issues/12365
## Summary of changes
- Start `primary` compute for writable timelines.
## Problem
Part of #11813
## Summary of changes
* Compute tenant remote size in the housekeeping loop.
* Add a new `TenantFeatureResolver` struct to cache the tenant-specific
properties.
* Evaluate feature flag based on the remote size.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
close https://github.com/neondatabase/neon/issues/11528
## Summary of changes
Gives us better observability of compaction progress.
- Image creation: num of partition processed / total partition
- Gc-compaction: index of the in the queue / total items for a full
compaction
- Shard ancestor compaction: layers to rewrite / total layers
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
L0 compaction currently holds the read lock for a long region while it
doesn't need to.
## Summary of changes
This patch reduces the one long contention region into 2 short ones:
gather the layers to compact at the beginning, and several short read
locks when querying the image coverage.
Co-Authored-By: Chen Luo
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
This PR works on adding dynamic masking support for `anon` v2. It
currently only supports static masking.
## Summary of changes
Added a security definer function that sets the dynamic masking guc to
`true` with superuser permissions.
Added a security definer function that adds `anon` to
`session_preload_libraries` if it's not already present.
Related to: https://github.com/neondatabase/cloud/issues/20456
## Problem
Currently we do not have metrics for autovacuum.
## Summary of changes
Added a metric that extracts the top 5 DBs with oldest mxid and frozen
xid. Tables that were vacuumed recently should have younger value (or
younger age).
Related Issue: https://github.com/neondatabase/cloud/issues/27296
## Problem
#8843
User initiated sql queries are being classified as "postgres" errors,
whereas they're really user errors.
## Summary of changes
Classify user-initiated postgres errors as user errors if they are
related to a sql query that we ran on their behalf. Do not log those
errors.
Add support to transport syslogs over TLS. Since TLS params essentially
require passing host and port separately, add a boolean flag to the
configuration template and also use the same `action` format for
plaintext logs. This allows seamless transition.
The plaintext host:port is picked from `AUDIT_LOGGING_ENDPOINT` (as
earlier) and from `AUDIT_LOGGING_TLS_ENDPOINT`. The TLS host:port is
used when defined and non-empty.
`remote_endpoint` is split separately to hostname and port as required
by `omfwd` module.
Also the address parsing and config content generation are split to more
testable functions with basic tests added.
## Problem
We persist safekeeper host/port in the storcon DB after
https://github.com/neondatabase/neon/pull/11712, so the storcon fails to
ping safekeepers in the compatibility tests, where we start the cluster
from the snapshot.
PR also adds some small code improvements related to the test failure.
- Closes: https://github.com/neondatabase/neon/issues/12339
## Summary of changes
- Update safekeeper ports in the storcon DB when starting the neon from
the dir (snapshot)
- Fail the response on all not-success codes (e.g. 3xx). Should not
happen, but just to be more safe.
- Add `neon_previous/` to .gitignore to make it easier to run compat
tests.
- Add missing EXPORT to the instruction for running compat tests
## Problem
Previously, the background worker that collects the list of installed
extensions across DBs had a timeout set to 1 hour. This cause a problem
with computes that had a `suspend_timeout` > 1 hour as this collection
was treated as activity, preventing compute shutdown.
Issue: https://github.com/neondatabase/cloud/issues/30147
## Summary of changes
Passing the `suspend_timeout` as part of the `ComputeSpec` so that any
updates to this are taken into account by the background worker and
updates its collection interval.
If a hardlink operation inside `detach_ancestor` fails due to the layer
already existing, we delete the layer to make sure the source is one we
know about, and then retry.
But we deleted the wrong file, namely, the one we wanted to use as the
source of the hardlink. As a result, the follow up hard link operation
failed. Our PR corrects this mistake.
## Problem
We don't notify cplane about safekeeper membership change yet. Without
the notification the compute needs to know all the safekeepers on the
cluster to be able to speak to them. Change notifications will allow to
avoid it.
- Closes: https://github.com/neondatabase/neon/issues/12188
## Summary of changes
- Implement `notify_safekeepers` method in `ComputeHook`
- Notify cplane about safekeepers in `safekeeper_migrate` handler.
- Update the test to make sure notifications work.
## Out of scope
- There is `cplane_notified_generation` field in `timelines` table in
strocon's database. It's not needed now, so it's not updated in the PR.
Probably we can remove it.
- e2e tests to make sure it works with a production cplane
## Problem
The gRPC API does not provide LSN leases.
## Summary of changes
* Add LSN lease support to the gRPC API.
* Use gRPC LSN leases for static computes with `grpc://` connstrings.
* Move `PageserverProtocol` into the `compute_api::spec` module and
reuse it.
## Problem
Location config changes can currently result in changes to the shard
identity. Such changes will cause data corruption, as seen with #12217.
Resolves#12227.
Requires #12377.
## Summary of changes
Assert that the shard identity does not change on location config
updates and on (re)attach.
This is currently asserted with `critical!`, in case it misfires in
production. Later, we should reject such requests with an error and turn
this into a proper assertion.
Implicit mapping to an `anyhow::Error` when we do `?` is discouraged
because tooling to find those places isn't great.
As a drive-by, also make SplitImageLayerWriter::new infallible and sync.
I think we should also make ImageLayerWriter::new completely lazy,
then `BatchLayerWriter:new` infallible and async.
Reverts neondatabase/neon#11663 and
https://github.com/neondatabase/neon/pull/11265/
Step Security is not yet approved by Databricks team, in order to
prevent issues during Github org migration, I'll revert this PR to use
the previous action instead of Step Security maintained action.
## Problem
Similarly to #12217, the following endpoints may result in a stripe size
mismatch between the storage controller and Pageserver if an unsharded
tenant has a different stripe size set than the default. This can lead
to data corruption if the tenant is later manually split without
specifying an explicit stripe size, since the storage controller and
Pageserver will apply different defaults. This commonly happens with
tenants that were created before the default stripe size was changed
from 32k to 2k.
* `PUT /v1/tenant/config`
* `PATCH /v1/tenant/config`
These endpoints are no longer in regular production use (they were used
when cplane still managed Pageserver directly), but can still be called
manually or by tests.
## Summary of changes
Retain the current shard parameters when updating the location config in
`PUT | PATCH /v1/tenant/config`.
Also opportunistically derive `Copy` for `ShardParameters`.
## Problem
Some of the design decisions in PR #12256 were influenced by the
requirements of consistency tests. These decisions introduced
intermediate logic that is no longer needed and should be cleaned up.
## Summary of Changes
- Remove the `feature("testing")` flag related to
`kick_secondary_download`.
- Set the default value of `kick_secondary_download` back to false,
reflecting the intended production behavior.
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
## Problem
- Closes: https://github.com/neondatabase/neon/issues/12298
## Summary of changes
- Set `timeline_safekeeper_count` in `neon_local` if we have less than 3
safekeepers
- Remove `cfg!(feature = "testing")` code from
`safekeepers_for_new_timeline`
- Change `timeline_safekeeper_count` type to `usize`
## Problem
`compute_ctl` should support gRPC base backups.
Requires #12111.
Requires #12243.
Touches #11926.
## Summary of changes
Support `grpc://` connstrings for `compute_ctl` base backups.
## Problem
The problem has been well described in already-commited PR #11853.
tl;dr: BufferedWriter is sensitive to cancellation, which the previous
approach was not.
The write path was most affected (ingest & compaction), which was mostly
fixed in #11853:
it introduced `PutError` and mapped instances of `PutError` that were
due to cancellation of underlying buffered writer into
`CreateImageLayersError::Cancelled`.
However, there is a long tail of remaining errors that weren't caught by
#11853 that result in `CompactionError::Other`s, which we log with great
noise.
## Solution
The stack trace logging for CompactionError::Other added in #11853
allows us to chop away at that long tail using the following pattern:
- look at the stack trace
- from leaf up, identify the place where we incorrectly map from the
distinguished variant X indicating cancellation to an `anyhow::Error`
- follow that anyhow further up, ensuring it stays the same anyhow all
the way up in the `CompactionError::Other`
- since it stayed one anyhow chain all the way up, root_cause() will
yield us X
- so, in `log_compaction_error`, add an additional `downcast_ref` check
for X
This PR specifically adds checks for
- the flush task cancelling (FlushTaskError, BlobWriterError)
- opening of the layer writer (GateError)
That should cover all the reports in issues
- https://github.com/neondatabase/cloud/issues/29434
- https://github.com/neondatabase/neon/issues/12162
## Refs
- follow-up to #11853
- fixup of / fixes https://github.com/neondatabase/neon/issues/11762
- fixes https://github.com/neondatabase/neon/issues/12162
- refs https://github.com/neondatabase/cloud/issues/29434
We don't have cancellation support for timeline deletions. In other
words, timeline deletion might still go on in an older generation while
we are attaching it in a newer generation already, because the
cancellation simply hasn't reached the deletion code.
This has caused us to hit a situation with offloaded timelines in which
the timeline was in an unrecoverable state: always returning an accepted
response, but never a 404 like it should be.
The detailed description can be found in
[here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859)
(private repo link).
TLDR:
1. we ask to delete timeline on old pageserver/generation, starts
process in background
2. the storcon migrates the tenant to a different pageserver.
- during attach, the pageserver still finds an index part, so it adds it
to `offloaded_timelines`
4. the timeline deletion finishes, removing the index part in S3
5. there is a retry of the timeline deletion endpoint, sent to the new
pageserver location. it is bound to fail however:
- as the index part is gone, we print `Timeline already deleted in
remote storage`.
- the problem is that we then return an accepted response code, and not
a 404.
- this confuses the code calling us. it thinks the timeline is not
deleted, so keeps retrying.
- this state never gets recovered from until a reset/detach, because of
the `offloaded_timelines` entry staying there.
This is where this PR fixes things: if no index part can be found, we
can safely assume that the timeline is gone in S3 (it's the last thing
to be deleted), so we can remove it from `offloaded_timelines` and
trigger a reupload of the manifest. Subsequent retries will pick that
up.
Why not improve the cancellation support? It is a more disruptive code
change, that might have its own risks. So we don't do it for now.
Fixes https://github.com/neondatabase/cloud/issues/30406
Mainly for general readability. Some notable changes:
- Postgres can be built without the rest of the repository, and in
particular without any of the Rust bits. Some CI scripts took advantage
of that, so let's make that more explicit by separating those parts.
Also add an explicit comment about that in the new postgres.mk file.
- Add a new PG_INSTALL_CACHED variable. If it's set, `make all` and
other top-Makefile targets skip checking if Postgres is up-to-date. This
is also to be used in CI scripts that build and cache Postgres as
separate steps. (It is currently only used in the macos walproposer-lib
rule, but stay tuned for more.)
- Introduce a POSTGRES_VERSIONS variable that lists all supported
PostgreSQL versions. Refactor a few Makefile rules to use that.
## Problem
In large oltp test run
https://github.com/neondatabase/neon/actions/runs/15905488707/job/44859116742
we see that the `Benchmark database maintenance` step is skipped in all
3 strategy variants, however it should be executed in two.
This is due to treating the `test_maintenance` boolean type in the
strategy in the condition of the `Benchmark database maintenance` step
## Summary of changes
Use a boolean condition instead of a string comparison
## Test run from this pull request branch
https://github.com/neondatabase/neon/actions/runs/15923605412
## Problem
When resolving a shard during a split we might have multiple attached
shards with the old shard count (i.e. not all of them are marked in
progress and ignored). Hence, we can compute the desired shard number
based on the old shard count and misroute the request.
## Summary of Changes
Recompute the desired shard every time the shard count changes during
the iteration
# Problem
In #12335 I moved the `authenticate` method outside of the
`connect_to_compute` loop. This triggered [e2e tests to become
flaky](https://github.com/neondatabase/cloud/pull/30533). This
highlighted an edge case we forgot to consider with that change.
When we connect to compute, the compute IP might be cached. This cache
hit might however be stale. Because we can't validate the IP is
associated with a specific compute-id☨, we will succeed the
connect_to_compute operation and fail when it comes to password
authentication☨☨. Before the change, we were invalidating the cache and
triggering wake_compute if the authentication failed.
Additionally, I noticed some faulty logic I introduced 1 year ago
https://github.com/neondatabase/neon/pull/8141/files#diff-5491e3afe62d8c5c77178149c665603b29d88d3ec2e47fc1b3bb119a0a970afaL145-R147
☨ We can when we roll out TLS, as the certificate common name includes
the compute-id.
☨☨ Technically password authentication could pass for the wrong compute,
but I think this would only happen in the very very rare event that the
IP got reused **and** the compute's endpoint happened to be a
branch/replica.
# Solution
1. Fix the broken logic
2. Simplify cache invalidation (I don't know why it was so convoluted)
3. Add a loop around connect_to_compute + authenticate to re-introduce
the wake_compute invalidation we accidentally removed.
I went with this approach to try and avoid interfering with
https://github.com/neondatabase/neon/compare/main...cloneable/proxy-pglb-connect-compute-split.
The changes made in commit 3 will move into `handle_client_request` I
suspect,
## Problem
Some pageservers hit `max_size_entries` limit in staging with only ~25
MiB storage used by basebackup cache. The limit is too strict. It should
be safe to relax it.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Increase the default `max_size_entries` from 1000 to 10000
## Problem
Druing shard splits we shut down the remote client early and allow the
parent shard to keep ingesting data. While ingesting data, the wal
receiver task may wait for the current flush to complete in order to
apply backpressure. Notifications are delivered via
`Timeline::layer_flush_done_tx`.
When the remote client was being shut down the flush loop exited
whithout delivering a notification. This left
`Timeline::wait_flush_completion` hanging indefinitely which blocked the
shutdown of the wal receiver task, and, hence, the shard split.
## Summary of Changes
Deliver a final notification when the flush loop is shutting down
without the timeline cancel cancellation token having fired. I tried
writing a test for this, but got stuck in failpoint hell and decided
it's not worth it.
`test_sharding_autosplit`, which reproduces this reliably in CI, passed
with the proposed fix in
https://github.com/neondatabase/neon/pull/12304.
Closes https://github.com/neondatabase/neon/issues/12060
Make the safekeeper `pull_timeline` endpoint support timelines that
haven't had any writes yet. In the storcon managed sk timelines world,
if a safekeeper goes down temporarily, the storcon will schedule a
`pull_timeline` call. There is no guarantee however that by when the
safekeeper is online again, there have been writes to the timeline yet.
The `snapshot` endpoint gives an error if the timeline hasn't had
writes, so we avoid calling it if `timeline_start_lsn` indicates a
freshly created timeline.
Fixes#11422
Part of #11670
## Problem
gRPC base backups do not use the base backup cache.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
Integrate gRPC base backups with the base backup cache.
Also fixes a bug where the base backup cache did not differentiate
between primary/replica base backups (at least I think that's a bug?).
## Problem
In our infra config, we have to split server_api_key and other fields in
two files: the former one in the sops file, and the latter one in the
normal config. It creates the situation that we might misconfigure some
regions that it only has part of the fields available, causing
storcon/pageserver refuse to start.
## Summary of changes
Allow PostHog config to have part of the fields available. Parse it
later.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Basebackup cache now uses unbounded channel for prepare requests. In
theory it can grow large if the cache is hung and does not process the
requests.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Replace an unbounded channel with a bounded one, the size is
configurable.
- Add `pageserver_basebackup_cache_prepare_queue_size` to observe the
size of the queue.
- Refactor a bit to move all metrics logic to `basebackup_cache.rs`
I was looking at how we could expose our proxy config as toml again, and
as I was writing out the schema format, I noticed some cruft in our CLI
args that no longer seem to be in use.
The redis change is the most complex, but I am pretty sure it's sound.
Since https://github.com/neondatabase/cloud/pull/15613 cplane longer
publishes to the global redis instance.
## Problem
If LFC generation is changed then `lfc_readv_select` will return -1 but
pages are still marked as available in bitmap.
## Summary of changes
Update bitmap after generation check.
Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
## Problem
Fix for https://github.com/neondatabase/neon/pull/12324
## Summary of changes
Need `serde(default)` to allow this field not present in the config,
otherwise there will be a config deserialization error.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
gRPC base backups use gRPC compression. However, this has two problems:
* Base backup caching will cache compressed base backups (making gRPC
compression pointless).
* Tonic does not support varying the compression level, and zstd default
level is 10% slower than gzip fastest level.
Touches https://github.com/neondatabase/neon/issues/11728.
Touches https://github.com/neondatabase/cloud/issues/29353.
## Summary of changes
This patch adds a gRPC parameter `BaseBackupRequest::compression`
specifying the compression algorithm. It also moves compression into
`send_basebackup_tarball` to reduce code duplication.
A follow-up PR will integrate the base backup cache with gRPC.
## Summary
A design for a storage system that allows storage of files required to
make
Neon's Endpoints have a better experience at or after a reboot.
## Motivation
Several systems inside PostgreSQL (and Neon) need some persistent
storage for
optimal workings across reboots and restarts, but still work without.
Examples are the cumulative statistics file in `pg_stat/global.stat`,
`pg_stat_statements`' `pg_stat/pg_stat_statements.stat`, and
`pg_prewarm`'s
`autoprewarm.blocks`. We need a storage system that can store and manage
these files for each Endpoint.
[GH rendered
file](https://github.com/neondatabase/neon/blob/MMeent/rfc-unlogged-file/docs/rfcs/040-Endpoint-Persistent-Unlogged-Files-Storage.md)
Part of https://github.com/neondatabase/cloud/issues/24225
## Problem
part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
It costs $$$ to directly retrieve the feature flags from the pageserver.
Therefore, this patch adds new APIs to retrieve the spec from the
storcon and updates it via pageserver.
* Storcon retrieves the feature flag and send it to the pageservers.
* If the feature flag gets updated outside of the normal refresh loop of
the pageserver, pageserver won't fetch the flags on its own as long as
the last updated time <= refresh_period.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
https://github.com/neondatabase/cloud/issues/30539
If the current leader cancels the `call` function, then it has removed
the jobs from the queue, but will never finish sending the responses.
Because of this, it is not cancellation safe.
## Summary of changes
Document these functions as not cancellation safe. Move cancellation of
the queued jobs into the queue itself.
## Alternatives considered
1. We could spawn the task that runs the batch, since that won't get
cancelled.
* This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`.
2. We could add another scopeguard and return the requests back to the
queue.
* This requires that requests are always retry safe, and also requires
requests to be `Clone`.
## Problem
https://github.com/neondatabase/cloud/issues/30539
If the current leader cancels the `call` function, then it has removed
the jobs from the queue, but will never finish sending the responses.
Because of this, it is not cancellation safe.
## Summary of changes
Document these functions as not cancellation safe. Move cancellation of
the queued jobs into the queue itself.
## Alternatives considered
1. We could spawn the task that runs the batch, since that won't get
cancelled.
* This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`.
2. We could add another scopeguard and return the requests back to the
queue.
* This requires that requests are always retry safe, and also requires
requests to be `Clone`.
## Problem
While working more on TLS to compute, I realised that Console Redirect
-> pg-sni-router -> compute would break if channel binding was set to
prefer. This is because the channel binding data would differ between
Console Redirect -> pg-sni-router vs pg-sni-router -> compute.
I also noticed that I actually disabled channel binding in #12145, since
`connect_raw` would think that the connection didn't support TLS.
## Summary of changes
Make sure we specify the channel binding.
Make sure that `connect_raw` can see if we have TLS support.
In `test_layer_download_cancelled_by_config_location`, we simulate hung
downloads via the `before-downloading-layer-stream-pausable` failpoint.
Then, we cancel a timeline via the `location_config` endpoint.
With the new default as of
https://github.com/neondatabase/neon/pull/11712, we would be creating
the timeline on safekeepers regardless if there have been writes or not,
and it turns out the test relied on the timeline not existing on
safekeepers, due to a cancellation bug:
* as established before, the test makes the read path hang
* the timeline cancellation function first cancels the walreceiver, and
only then cancels the timeline's token
* `WalIngest::new` is requesting a checkpoint, which hits the read path
* at cancellation time, we'd be hanging inside the read, not seeing the
cancellation of the walreceiver
* the test would time out due to the hang
This is probably also reproducible in the wild when there is S3
unavailabilies or bottlenecks. So we thought that it's worthwhile to fix
the hang issue. The approach chosen in the end involves the
`tokio::select` macro.
In PR 11712, we originally punted on the test due to the hang and opted
it out from the new default, but now we can use the new default.
Part of https://github.com/neondatabase/neon/issues/12299
pgaudit can spam logs due to all the monitoring that we do. Logs from
these connections are not necessary for HIPPA compliance, so we can stop
logging from those connections.
Part-of: https://github.com/neondatabase/cloud/issues/29574
Signed-off-by: Tristan Partin <tristan@neon.tech>
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.
## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.
With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.
## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
The billing team wants to change the billing events pipeline and use a
common events format in S3 buckets across different event producers.
## Summary of changes
Change the events storage format for billing events from JSON to NDJSON.
Resolves: https://github.com/neondatabase/cloud/issues/29994
## Problem
PGLB will do the connect_to_compute logic, neonkeeper will do the
session establishment logic. We should split it.
## Summary of changes
Moves postgres authentication to compute to a separate routine that
happens after connect_to_compute.
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
`Attached(0)` tenant migrations can get stuck if the heatmap file has
not been uploaded.
## Summary of Changes
- Added a test to reproduce the issue.
- Introduced a `kick_secondary_downloads` config flag:
- Enabled in testing environments.
- Disabled in production (and in the new test).
- Updated `Attached(0)` locations to consider the number of secondaries
in their intent when deciding whether to download the heatmap.
## Problem
Part of #11813
## Summary of changes
Add a test API to make it easier to manipulate the feature flags within
tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The scheduler uses total shards per AZ to select the AZ for newly
created or attached tenants.
This makes bad decisions when we have different node counts per AZ -- we
might have 2 very busy pageservers in one AZ, and 4 more lightly loaded
pageservers in other AZs, and the scheduler picks the busy pageservers
because the total shard count in their AZ is lower.
## Summary of changes
- Divide the shard count by the number of nodes in the AZ when scoring
in `get_az_for_new_tenant`
---------
Co-authored-by: John Spray <john.spray@databricks.com>
The `--gzip-probability` parameter was removed in #12250. However,
`test_basebackup_with_high_slru_count` still uses it, and keeps failing.
This patch removes the use of the parameter (gzip is enabled by
default).
## Problem
Part of #11813
## Summary of changes
The current interval is 30s and it costs a lot of $$$. This patch
reduced it to 600s refresh interval (which means that it takes 10min for
feature flags to propagate from UI to the pageserver). In the future we
can let storcon retrieve the feature flags and push it to pageservers.
We can consider creating a new release or we can postpone this to the
week after the next week.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The 'make all' step must run always. PR #12311 accidentally left the
condition in there to skip it if there were no changes in postgres v14
sources. That condition belonged to a whole different step that was
removed altogether in PR#12311, and the condition should've been removed
too.
Per CI failure:
https://github.com/neondatabase/neon/actions/runs/15820148967/job/44587394469
## Problem
Previously, we couldn't read from an in-memory layer while a batch was
being written to it. Vice-versa, we couldn't write to it while there
was an on-going read.
## Summary of Changes
The goal of this change is to improve concurrency. Writes happened
through a &mut self method so the enforcement was at the type system
level.
We attempt to improve by:
1. Adding interior mutability to EphemeralLayer. This involves wrapping
the buffered writer in a read-write lock.
2. Minimise the time that the read lock is held for. Only hold the read
lock while reading from the buffers (recently flushed or pending
flush). If we need to read from the file, drop the lock and allow IO
to be concurrent.
The new benchmark variants with concurrent reads improve between 70 to
200 percent (against main).
Benchmark results are in this
[commit](891f094ce6).
## Future Changes
We can push the interior mutability into the buffered writer. The
mutable tail goes under a read lock, the flushed part goes into an
ArcSwap and then we can read from anything that is flushed _without_ any
locking.
gRPC base backups send a stream of fixed-size 64KB chunks.
pagebench basebackup with compression enabled shows this to reduce
throughput:
* 64 KB: 55 RPS
* 128 KB: 69 RPS
* 256 KB: 73 RPS
* 1024 KB: 73 RPS
This patch sets the base backup chunk size to 256 KB.
The build-tools image contains various build tools and dependencies,
mostly Rust-related. The compute image build used it to build
compute_ctl and a few other little rust binaries that are included in
the compute image. However, for extensions built in Rust (pgrx), the
build used a different layer which installed the rust toolchain using
rustup.
Switch to using the same rust toolchain for both pgrx-based extensions
and compute_ctl et al. Since we don't need anything else from the
build-tools image, I switched to using the toolchain installed with
rustup, and eliminated the dependency to build-tools altogether. The
compute image build no longer depends on build-tools.
Note: We no longer use 'mold' for linking compute_ctl et al, since mold
is not included in the build-deps-with-cargo layer. We could add it
there, but it doesn't seem worth it. I proposed stopping using mold
altogether in https://github.com/neondatabase/neon/pull/10735, but that
was rejected because 'mold' is faster for incremental builds. That
doesn't matter much for docker builds however, since they're not
incremental, and the compute binaries are not as large as the storage
server binaries anyway.
To avoid duplicating the build logic. `make all` covers the separate
`postgres-*` and `neon-pg-ext` steps, and also does `cargo build`.
That's how you would typically do a full local build anyway.
Every time you run `make`, it runs `make install` on all the PostgreSQL
sources, which copies the header files. That in turn triggers a rebuild
of the `postgres_ffi` crate, and everything that depends on it. We had
worked around this earlier (see #2458), by passing a custom INSTALL
script to the Postgres makefiles, which refrains from updating the
modification timestamp on headers when they have not been changed, but
the v14 makefile didn't obey INSTALL for the header files. Backporting
c0a1d7621b to v14 fixes that.
This backports upstream PostgreSQL commit c0a1d7621b to v14.
Corresponding PR in the 'postgres' repo:
https://github.com/neondatabase/postgres/pull/660
## Problem
In #12217, we began passing the stripe size in reattach responses, and
persisting it in the on-disk state. This is necessary to ensure the
storage controller and Pageserver have a consistent view of the intended
stripe size of unsharded tenants, which will be used for splits that do
not specify a stripe size. However, for backwards compatibility, these
stripe sizes were optional.
## Summary of changes
Make the stripe sizes required for reattach responses and on-disk
location configs. These will always be provided by the previous
(current) release.
The previous behavior was for the compute to override control plane
options if there was a conflict. We want to change the behavior so that
the control plane has the absolute power on what is right. In the event
that we need a new option passed to the compute as soon as possible, we
can initially roll it out in the control plane, and then migrate the
option to EXTRA_OPTIONS within the compute later, for instance.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We need to specify the number of safekeepers for neon_local without
`testing` feature.
Also we need this option for testing different configurations of
safekeeper migration code.
We cannot set it in `neon_fixtures.py` and in the default config of
`neon_local` yet, because it will fail compatibility tests. I'll make a
separate PR with removing `cfg!("testing")` completely and specifying
this option in the config when this option reaches the release branch.
- Part of https://github.com/neondatabase/neon/issues/12298
## Summary of changes
- Add `timeline_safekeeper_count` config option to storcon and
neon_local
Enable it across tests and set it as default. Marks the first milestone
of https://github.com/neondatabase/neon/issues/9114. We already enabled
it in all AWS regions and planning to enable it in all Azure regions
next week.
will merge after we roll out in all regions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Timeline GC is very aggressive with regards to layer map locking.
We've seen timelines with loads of layers in production that hold the
write lock for the layer map for 30 minutes at a time.
This blocks reads and the write path to some extent.
## Summary of changes
Determining the set of layers to GC is done under the read lock.
Applying the updates is done under the write lock.
Previously, everything was done under write lock.
See #11942
Idea:
* if connections are short lived, they can get enqueued and then also
remove themselves later if they never made it to redis. This reduces the
load on the queue.
* short lived connections (<10m, most?) will only issue 1 command, we
remove the delete command and rely on ttl.
* we can enqueue as many commands as we want, as we can always cancel
the enqueue, thanks to the ~~intrusive linked lists~~ `BTreeMap`.
This improves `pagebench getpage-latest-lsn` gRPC support by:
* Using `page_api::Client`.
* Removing `--protocol`, and using the `page-server-connstring` scheme
instead.
* Adding `--compression` to enable zstd compression.
## Problem
Pagebench does not support gRPC for `basebackup` benchmarks.
Requires #12243.
Touches #11728.
## Summary of changes
Add gRPC support via gRPC connstrings, e.g. `pagebench basebackup
--page-service-connstring grpc://localhost:51051`.
Also change `--gzip-probability` to `--no-compression`, since this must
be specified per-client for gRPC.
Introduce a separate `postgres_ffi_types` crate which contains a few
types and functions that were used in the API. `postgres_ffi_types` is a
much small crate than `postgres_ffi`, and it doesn't depend on bindgen
or the Postgres C headers.
Move NeonWalRecord and Value types to wal_decoder crate. They are only
used in the pageserver-safekeeper "ingest" API. The rest of the ingest
API types are defined in wal_decoder, so move these there as well.
We have been bad at keeping them up-to-date, several contrib modules and
neon extensions were missing from the clean rules. Give up trying, and
remove the targets altogether. In practice, it's straightforward to just
do `rm -rf pg_install/build`, so the clean-targets are hardly worth the
maintenance effort.
I kept `make distclean` though. The rule for that is simple enough.
Since commit 87ad50c925, storage_controller has used diesel_async, which
in turn uses tokio-postgres as the Postgres client, which doesn't
require libpq. Thus we no longer need libpq in the storage image.
## Problem
The gRPC page service should support compression.
Requires #12111.
Touches #11728.
Touches https://github.com/neondatabase/cloud/issues/25679.
## Summary of changes
Add support for gzip and zstd compression in the server, and a client
parameter to enable compression.
This will need further benchmarking under realistic network conditions.
## Problem
GC info is an input to updating layer visibility.
Currently, gc info is updated on timeline activation and visibility is
computed on tenant attach, so we ignore branch points and compute
visibility by taking all layers into account.
Side note: gc info is also updated when timelines are created and
dropped. That doesn't help because we create the timelines in
topological order from the root. Hence the root timeline goes first,
without context of where the branch points are.
The impact of this in prod is that shards need to rehydrate layers after
live migration since the non-visible ones were excluded from the
heatmap.
## Summary of Changes
Move the visibility calculation into tenant attachment instead of
activation.
## Problem
`test_runner` integration tests should support gRPC.
Touches #11926.
## Summary of changes
* Enable gRPC for Pageservers, with dynamic port allocations.
* Add a `grpc` parameter for endpoint creation, plumbed through to
`neon_local endpoint create`.
No tests actually use gRPC yet, since computes don't support it yet.
## Problem
`test_sharding_failures` is flaky due to interference from the
`background_reconcile` process.
The details are in the issue
https://github.com/neondatabase/neon/issues/12029.
## Summary of changes
- Use `reconcile_until_idle` to ensure a stable state before running
test assertions
- Added error tolerance in `reconcile_until_idle` test function (Failure
cases: 1, 3, 19, 20)
- Ignore the `Keeping extra secondaries` warning message since it i
retryable (Failure case: 2)
- Deduplicated code in `assert_rolled_back` and `assert_split_done`
- Added a log message before printing plenty of Node `X` seen on
pageserver `Y`
## Problem
- We run the large tenant oltp workload with a fixed size (larger than
existing customers' workloads).
Our customer's workloads are continuously growing and our testing should
stay ahead of the customers' production workloads.
- we want to touch all tables in the tenant's database (updates) so that
we simulate a continuous change in layer files like in a real production
workload
- our current oltp benchmark uses a mixture of read and write
transactions, however we also want a separate test run with read-only
transactions only
## Summary of changes
- modify the existing workload to have a separate run with pgbench
custom scripts that are read-only
- create a new workload that
- grows all large tables in each run (for the reuse branch in the large
oltp tenant's project)
- updates a percentage of rows in all large tables in each run (to
enforce table bloat and auto-vacuum runs and layer rebuild in
pageservers
Each run of the new workflow increases the logical database size about
16 GB.
We start with 6 runs per day which will give us about 96-100 GB growth
per day.
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
- Add optional `?mode=fast|immediate` to `/terminate`, `fast` is
default. Immediate avoids waiting 30
seconds before returning from `terminate`.
- Add `TerminateMode` to `ComputeStatus::TerminationPending`
- Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl
stop` for `test_replica_promotes`.
- Change `test_replica_promotes` to check returned LSN
- Annotate `finish_sync_safekeepers` as `noreturn`.
https://github.com/neondatabase/cloud/issues/29807
This check API only cheks the safekeeper_connstrings at the moment, and
the validation is limited to checking we have at least one entry in
there, and no duplicates.
## Problem
If the compute_ctl service is started with an empty list of safekeepers,
then hard-to-debug errors may happen at runtime, where it would be much
easier to catch them early.
## Summary of changes
Add an entry point in the compute_ctl API to validate the configuration
for safekeeper_connstrings.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Create a separate stage for downloading the Rust toolchain for pgrx, so
that it can be cached independently of the pg-build layer. Before this,
the 'pg-build-nonroot=with-cargo' layer was unnecessarily rebuilt every
time there was a change in PostgreSQL sources. Furthermore, this allows
using the same cached layer for building the compute images of all
Postgres versions.
## Problem
Full basebackups are used in tests, and may be useful for debugging as
well, so we should support them in the gRPC API.
Touches #11728.
## Summary of changes
Add `GetBaseBackupRequest::full` to generate full base backups.
The libpq implementation also allows specifying `prev_lsn` for full
backups, i.e. the end LSN of the previous WAL record. This is omitted in
the gRPC API, since it's not used by any tests, and presumably of
limited value since it's autodetected. We can add it later if we find
that we need it.
## Problem
The checksum for eigen (a dependency for onnxruntime) has changed which
breaks compute image build.
## Summary of changes
- Add a patch for onnxruntime which backports changes from
f57db79743
(we keep the current version)
Ref https://github.com/microsoft/onnxruntime/issues/24861
## Problem
https://github.com/neondatabase/neon/issues/7570
Even triggers are supported only for superusers.
## Summary of changes
Temporary switch to superuser when even trigger is created and disable
execution of user's even triggers under superuser.
---------
Co-authored-by: Dimitri Fontaine <dim@tapoueh.org>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
When converting `proto::GetPageRequest` into `page_api::GetPageRequest`
and validating the request, errors are returned as `tonic::Status`. This
will tear down the GetPage stream, which is disruptive and unnecessary.
## Summary of changes
Emit invalid request errors as `GetPageResponse` with an appropriate
`status_code` instead.
Also move the conversion from `tonic::Status` to `GetPageResponse` out
into the stream handler.
## Problem
Compatibility tests may be run against a compatibility snapshot
generated with --timelines-onto-safekeepers=false. We need to start the
compute without a generation (or with 0 generation) if the timeline is
not storcon-managed, otherwise the compute will hang.
- Follow up on https://github.com/neondatabase/neon/pull/12203
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Handle compatibility snapshot generated with no
`--timelines-onot-safekeepers` properly
## Problem
The cache was introduced as a hackathon project and the only supported
limit was the number of entries.
The basebackup entry size may vary. We need to have more control over
disk space usage to ship it to production.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Store the size of entries in the cache and use it to limit
`max_total_size_bytes`
- Add the size of the cache in bytes to metrics.
## Problem
`neon_local` should support endpoints using gRPC, by providing `grpc://`
connstrings with the Pageservers' gRPC ports.
Requires #12268.
Touches #11926.
## Summary of changes
* Add `--grpc` switch for `neon_local endpoint create`.
* Generate `grpc://` connstrings for endpoints when enabled.
Computes don't actually support `grpc://` connstrings yet, but will
soon.
gRPC is configured when the endpoint is created, not when it's started,
such that it continues to use gRPC across restarts and reconfigurations.
In particular, this is necessary for the storage controller's local
notify hook, which can't easily plumb through gRPC configuration from
the start/reconfigure commands but has access to the endpoint's
configuration.
## Problem
The metrics `storage_controller_safekeeper_request_error` and
`storage_controller_safekeeper_request_latency` currently use
`pageserver_id` as a label.
This can be misleading, as the metrics are about safekeeper requests.
We want to replace this with a more accurate label — either
`safekeeper_id` or `node_id`.
## Summary of changes
- Introduced `SafekeeperRequestLabelGroup` with `safekeeper_id`.
- Updated the affected metrics to use the new label group.
- Fixed incorrect metric usage in safekeeper_client.rs
## Follow-up
- Review usage of these metrics in alerting rules and existing Grafana
dashboards to ensure this change does not break something.
## Problem
Pageservers now expose a gRPC API on a separate address and port. This
must be registered with the storage controller such that it can be
plumbed through to the compute via cplane.
Touches #11926.
## Summary of changes
This patch registers the gRPC address and port with the storage
controller:
* Add gRPC address to `nodes` database table and `NodePersistence`, with
a Diesel migration.
* Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`,
`NodeDescribeResponse`, and `TenantLocateResponseShard`.
* Add gRPC address flags to `storcon_cli node-register`.
These changes are backwards-compatible, since all structs will ignore
unknown fields during deserialization.
## Problem
The gRPC base backup implementation has a few issues: chunks are not
properly bounded, and it's not possible to omit the LSN.
Touches #11728.
## Summary of changes
* Properly bound chunks by using a limited writer.
* Use an `Option<Lsn>` rather than a `ReadLsn` (the latter requires an
LSN).
## Problem
The `stably_attached` function is hard to read due to deeply nested
conditionals
## Summary of Changes
- Refactored `stably_attached` to use early returns and the `?` operator
for improved readability
## Problem
If the node intent includes more than one secondary, we can generate a
replace optimization using a candidate node that is already a secondary
location.
## Summary of changes
- Exclude all other secondary nodes from the scoring process to ensure
optimal candidate selection.
## Problem
We could easily miss a sanitizer-detected defect, if it occurred due to
some race condition, as we just rerun the test and if it succeeds, the
overall test run is considered successful. It was more reasonable
before, when we had much more unstable tests in main, but now we can
track all test failures.
## Summary of changes
Don't rerun failed tests.
## Problem
We don't test debug builds for v14..v16 in the regular "Build and Test"
runs to perform the testing faster, but it means we can't detect
assertion failures in those versions.
(See https://github.com/neondatabase/neon/issues/11891,
https://github.com/neondatabase/neon/issues/11997)
## Summary of changes
Add a new workflow to test all build types and all versions on all
architectures.
## Problem
In our environment, we don't always have access to the pagectl tool on
the pageserver. We have to download the page files to local env to
introspect them. Hence, it'll be useful to be able to parse the local
files using `pagectl`.
## Summary of changes
* Add `dump-layer-local` to `pagectl` that takes a local path as
argument and returns the layer content:
```
cargo run -p pagectl layer dump-layer-local ~/Desktop/000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001
```
* Bonus: Fix a bug in `pageserver/ctl/src/draw_timeline_dir.rs` in which
we don't filter out temporary files.
## Problem
The `page_api` domain types are missing a few derives.
## Summary of changes
Add `Clone`, `Copy`, and `Debug` derives for all types where
appropriate.
## Problem
Comment for `storage_controller_reconcile_long_running` metric was
copy-pasted and not updated in #9207
## Summary of changes
- Fixed comment
## Problem
Comment is in incorrect place: `/metrics` code is above its description
comment.
## Summary of changes
- `/metrics` code is now below the comment
## Problem
Need to fix naming `safkeeper` -> `safekeeper`
## Summary of changes
- `storage_controller_safkeeper_reconciles_*` renamed to
`storage_controller_safekeeper_reconciles_*`
## Problem
ChaosInjector is intended to skip non-active scheduling policies, but
the current logic skips active shards instead.
## Summary of changes
- Fixed shard eligibility condition to correctly allow chaos injection
for shards with an Active scheduling policy.
## Problem
We would easily hit this limit for a tenant running for enough long
time.
## Summary of changes
Remove the max key limit for time-travel recovery if the command is
running locally.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Part of #11813
PostHog has two endpoints to retrieve feature flags: the old project ID
one that uses personal API token, and the new one using a special
feature flag secure token that can only retrieve feature flag. The new
API I added in this patch is not documented in the PostHog API doc but
it's used in their Python SDK.
## Summary of changes
Add support for "feature flag secure token API". The API has no way of
providing a project ID so we verify if the retrieved spec is consistent
with the project ID specified by comparing the `team_id` field.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_historic_storage_formats` uses `/tenant_import` to import historic
data. Tenant import does not create timelines onto safekeepers, because
they might already exist on some safekeeper set. If it does, then we may
end up with two different quorums accepting WAL for the same timeline.
If the tenant import is used in a real deployment, the administrator is
responsible for looking for the proper safekeeper set and migrate
timelines into storcon-managed timelines.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Create timelines onto safekeepers manually after tenant import in
`test_historic_storage_formats`
- Add a note to tenant import that timelines will be not storcon-managed
after the import.
## Problem
* Inside `compact_with_gc_inner`, there is a similar log line:
db24ba95d1/pageserver/src/tenant/timeline/compaction.rs (L3181-L3187)
* Also, I think it would be useful when debugging to have the ability to
select a particular sub-compaction job (e.g., `1/100`) to see all the
logs for that job.
## Summary of changes
* Attach a span to the `compact_with_gc_inner`.
CC: @skyzh
Sometimes during a failed redis connection attempt at the init stage
proxy pod can continuously restart. This, in turn, can aggravate the
problem if redis is overloaded.
Solves the #11114
## Problem
The location config (which includes the stripe size) is stored on
pageserver disk.
For unsharded tenants we [do not include the shard identity in the
serialized
description](ad88ec9257/pageserver/src/tenant/config.rs (L64-L66)).
When the pageserver restarts, it reads that configuration and will use
the stripe size from there
and rely on storcon input from reattach for generation and mode.
The default deserialization is ShardIdentity::unsharded. This has the
new default stripe size of 2048.
Hence, for unsharded tenants we can be running with a stripe size
different from that the one in the
storcon observed state. This is not a problem until we shard split
without specifying a stripe size (i.e. manual splits via the UI or
storcon_cli). When that happens the new shards will use the 2048 stripe
size until storcon realises and switches them back. At that point it's
too late, since we've ingested data with the wrong stripe sizes.
## Summary of changes
Ideally, we would always have the full shard identity on disk. To
achieve this over two releases we do:
1. Always persist the shard identity in the location config on the PS.
2. Storage controller includes the stripe size to use in the re attach
response.
After the first release, we will start persisting correct stripe sizes
for any tenant shard that the storage controller
explicitly sends a location_conf. After the second release, the
re-attach change kicks in and we'll persist the
shard identity for all shards.
## Problem
Base64 0.13 is outdated.
## Summary of changes
Update base64 to 0.22. Affects mostly proxy and proxy libs. Also upgrade
serde_with to remove another dep on base64 0.13 from dep tree.
## Problem
The test creates an endpoint and deletes its tenant. The compute cannot
stop gracefully because it tries to write a checkpoint shutdown record
into the WAL, but the timeline had been already deleted from
safekeepers.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
Stop the compute before deleting a tenant
## Problem
Compatibility tests may be run against a compatibility snapshot
generated with `--timelines-onto-safekeepers=false`. We need to start
the compute without a generation (or with 0 generation) if the timeline
is not storcon-managed, otherwise the compute will hang.
This handler is needed to check if the timeline is storcon-managed.
It's also needed for better test coverage of safekeeper migration code.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Implement `tenant_timeline_locate` handler in storcon to get
safekeeper info from storcon's DB
## Problem
We hold the layer map for too long on occasion.
## Summary of changes
This should help us identify the places where it's happening from.
Related https://github.com/neondatabase/neon/issues/12182
## Problem
DROP FUNCTION doesn't allow to specify default for parameters.
## Summary of changes
Remove DEFAULT clause from pgxn/neon/neon--1.6--1.5.sql
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
The project limits were not respected, resulting in errors.
## Summary of changes
Now limits are checked before running an action, and if the action is
not possible to run, another random action will be run.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
## Problem
See https://github.com/neondatabase/neon/issues/12018
Now `prefetch_register_bufferv` calculates min_ring_index of all vector
requests.
But because of pump prefetch state or connection failure, previous slots
can be already proceeded and reused.
## Summary of changes
Instead of returning minimal index, this function should return last
slot index.
Actually result of this function is used only in two places. A first
place just fort checking (and this check is redundant because the same
check is done in `prefetch_register_bufferv` itself.
And in the second place where index of filled slot is actually used,
there is just one request.
Sp fortunately this bug can cause only assert failure in debug build.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
There is this TODO in code:
https://github.com/neondatabase/neon/blob/main/pageserver/src/tenant/mgr.rs#L300-L302
This is an old TODO by @jcsp.
## Summary of changes
This PR addresses the TODO. Specifically, it removes a global static
`TENANTS`. Instead the `TenantManager` now directly manages the tenant
map. Enhancing abstraction.
Essentially, this PR moves all module-level methods to inside the
implementation of `TenantManager`.
## Problem
Looks like our sql-over-http tests get to rely on "trust"
authentication, so the path that made sure the authkeys data was set was
never being hit.
## Summary of changes
Slight refactor to WakeComputeBackends, as well as making sure auth keys
are propagated. Fix tests to ensure passwords are tested.
## Problem
`VotesCollectedMset` uses the wrong safekeeper to update truncateLsn.
This led to some failed assert later in the code during running
safekeeper migration tests.
- Relates to https://github.com/neondatabase/neon/issues/11823
## Summary of changes
Use proper safekeeper to update truncateLsn in VotesCollectedMset
## Problem
part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
Collect pageserver hostname property so that we can use it in the
PostHog UI. Not sure if this is the best way to do that -- open to
suggestions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/7546
Add Azure time travel recovery support. The tricky thing is how Azure
handles deletes in its blob version API. For the following sequence:
```
upload file_1 = a
upload file_1 = b
delete file_1
upload file_1 = c
```
The "delete file_1" won't be stored as a version (as AWS did).
Therefore, we can never rollback to a state where file_1 is temporarily
invisible. If we roll back to the time before file_1 gets created for
the first time, it will be removed correctly.
However, this is fine for pageservers, because (1) having extra files in
the tenant storage is usually fine (2) for things like
timelines/X/index_part-Y.json, it will only be deleted once, so it can
always be recovered to a correct state. Therefore, I don't expect any
issues when this functionality is used on pageserver recovery.
TODO: unit tests for time-travel recovery.
## Summary of changes
Add Azure blob storage time-travel recovery support.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We are getting tenants with a lot of branches and num of timelines is a
good indicator of pageserver loads. I added this metrics to help us
better plan pageserver capacities.
## Summary of changes
Add `pageserver_timeline_states_count` with two labels: active +
offloaded.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
If a shard split fails and must roll back, the tenant may hit a cold
start as the parent shard's files have already been removed from local
disk.
External contribution with minor adjustments, see
https://neondb.slack.com/archives/C08TE3203RQ/p1748246398269309.
## Summary of changes
Keep the parent shard's files on local disk until the split has been
committed, such that they are available if the spilt is rolled back. If
all else fails, the files will be removed on the next Pageserver
restart.
This should also be fine in a mixed version:
* New storcon, old Pageserver: the Pageserver will delete the files
during the split, storcon will log an error when the cleanup detach
fails.
* Old storcon, new Pageserver: the Pageserver will leave the parent's
files around until the next Pageserver restart.
The change looks good to me, but shard splits are delicate so I'd like
some extra eyes on this.
The current cache invalidation messages are far too specific. They
should be more generic since it only ends up triggering a
`GetEndpointAccessControl` message anyway.
Mappings:
* `/allowed_ips_updated`, `/block_public_or_vpc_access_updated`, and
`/allowed_vpc_endpoints_updated_for_projects` ->
`/project_settings_update`.
* `/allowed_vpc_endpoints_updated_for_org` ->
`/account_settings_update`.
* `/password_updated` -> `/role_setting_update`.
I've also introduced `/endpoint_settings_update`.
All message types support singular or multiple entries, which allows us
to simplify things both on our side and on cplane side.
I'm opening a PR to cplane to apply the above mappings, but for now
using the old phrases to allow both to roll out independently.
This change is inspired by my need to add yet another cached entry to
`GetEndpointAccessControl` for
https://github.com/neondatabase/cloud/issues/28333
We might delete timelines on safekeepers before we are deleting them on
pageservers. This should be an exceptional situation, but can occur. As
the first step to improve behaviour here, emit a special error that is
less scary/obscure than "was not found in global map".
It is for example emitted when the pageserver tries to run
`IDENTIFY_SYSTEM` on a timeline that has been deleted on the safekeeper.
Found when analyzing the failure of
`test_scrubber_physical_gc_timeline_deletion` when enabling
`--timelines-onto-safekeepers` on the pytests.
Due to safekeeper restarts, there is no hard guarantee that we will keep
issuing this error, so we need to think of something better if we start
encountering this in staging/prod. But I would say that the introduction
of `--timelines-onto-safekeepers` in the pytests and into staging won't
change much about this: we are already deleting timelines from there. In
`test_scrubber_physical_gc_timeline_deletion`, we'd just be leaking the
timeline before on the safekeepers.
Part of #11712
## Problem
PGLB/Neonkeeper needs to separate the concerns of connecting to compute,
and authenticating to compute.
Additionally, the code within `connect_to_compute` is rather messy,
spending effort on recovering the authentication info after
wake_compute.
## Summary of changes
Split `ConnCfg` into `ConnectInfo` and `AuthInfo`. `wake_compute` only
returns `ConnectInfo` and `AuthInfo` is determined separately from the
`handshake`/`authenticate` process.
Additionally, `ConnectInfo::connect_raw` is in-charge or establishing
the TLS connection, and the `postgres_client::Config::connect_raw` is
configured to use `NoTls` which will force it to skip the TLS
negotiation. This should just work.
## Problem
Removed nodes can re-add themselves on restart if not properly
tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this,
especially in cases where the node is unreachable.
More details there: #12036
## Summary of changes
- Introduced `NodeLifecycle` enum to represent node lifecycle states.
- Added a string representation of `NodeLifecycle` to the `nodes` table.
- Implemented node removal using a tombstone mechanism.
- Introduced `/debug/v1/tombstone*` handlers to manage the tombstone
state.
## Problem
Makes it easier to debug AWS permission issues (i.e., storage scrubber)
## Summary of changes
Install awscliv2 into the docker image.
Signed-off-by: Alex Chi Z <chi@neon.tech>
neon_local's timeline import subcommand creates timelines manually, but
doesn't create them on the safekeepers. If a test then tries to open an
endpoint to read from the timeline, it will error in the new world with
`--timelines-onto-safekeepers`.
Therefore, if that flag is enabled, create the timelines on the
safekeepers.
Note that this import functionality is different from the fast import
feature (https://github.com/neondatabase/neon/issues/10188, #11801).
Part of #11670
As well as part of #11712
## Problem
The `computed_columns` test assumes that computed columns are always
faster than the request itself. However, this is not always the case on
Neon, which can lead to flaky results.
## Summary of changes
The `computed_columns` test is excluded from the PostGIS test for
PostgreSQL v16, accompanied by related patch refactoring.
## Problem
After introducing a naive downtime calculation for the Postgres process
inside compute in https://github.com/neondatabase/neon/pull/11346, I
noticed that some amount of computes regularly report short downtime.
After checking some particular cases, it looks like all of them report
downtime close to the end of the life of the compute, i.e., when the
control plane calls a `/terminate` and we are waiting for Postgres to
exit.
Compute monitor also produces a lot of error logs because Postgres stops
accepting connections, but it's expected during the termination process.
## Summary of changes
Regularly check the compute status inside the main compute monitor loop
and exit gracefully when the compute is in some terminal or
soon-to-be-terminal state.
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
- `test_basebackup_cache` fails in
https://github.com/neondatabase/neon/pull/11712 because after the
timelines on safekeepers are managed by storage controller, they do
contain proper start_lsn and the compute_ctl tool sends the first
basebackup request with this LSN.
- `Failed to prepare basebackup` log messages during timeline
initialization, because the timeline is not yet in the global timeline
map.
- Relates to https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Account for `timeline_onto_safekeepers` storcon's option in the test.
- Do not trigger basebackup prepare during the timeline initialization.
## Problem
The new gRPC page service protocol supports client-side batches. The
current libpq protocol only does best-effort server-side batching.
To compare these approaches, Pagebench should support submitting
contiguous page batches, similar to how Postgres will submit them (e.g.
with prefetches or vectored reads).
## Summary of changes
Add a `--batch-size` parameter specifying the size of contiguous page
batches. One batch counts as 1 RPS and 1 queue depth.
For the libpq protocol, a batch is submitted as individual requests and
we rely on the server to batch them for us. This will give a realistic
comparison of how these would be processed in the wild (e.g. when
Postgres sends 100 prefetch requests).
This patch also adds some basic validation of responses.
## Problem
We support two ingest protocols on the pageserver: vanilla and
interpreted.
Interpreted has been the only protocol in use for a long time.
## Summary of changes
* Remove the ingest handling of the vanilla protocol
* Remove tenant and pageserver configuration for it
* Update all tests that tweaked the ingest protocol
## Compatibility
Backward compatibility:
* The new pageserver version can read the existing pageserver
configuration and it will ignore the unknown field.
* When the tenant config is read from the storcon db or from the
pageserver disk, the extra field will be ignored.
Forward compatiblity:
* Both the pageserver config and the tenant config map missing fields to
their default value.
I'm not aware of any tenant level override that was made for this knob.
## Problem
It will be useful to understand what kind of queries our clients are
executed.
And one of the most important characteristic of query is query execution
time - at least it allows to distinguish OLAP and OLTP queries. Also
monitoring query execution time can help to detect problem with
performance (assuming that workload is more or less stable).
## Summary of changes
Add query execution time histogram.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Split the modules responsible for passing data and connecting to compute
from auth and waking for PGLB.
This PR just moves files. The waking is going to get removed from pglb
after this.
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
In PostHog UI, we need to create the properties before using them as a
filter. We report all variants automatically when we start the
pageserver. In the future, we can report all real tenants instead of
fake tenants (we do that now to save money + we don't need real tenants
in the UI).
## Summary of changes
* Collect `region`, `availability_zone`, `pageserver_id` properties and
use them in the feature evaluation.
* Report 10 fake tenants on each pageserver startup.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
I believe in all environments we now specify either required/rejected
for proxy-protocol V2 as required. We no longer rely on the supported
flow. This means we no longer need to keep around read bytes incase
they're not in a header.
While I designed ChainRW to be fast (the hot path with an empty buffer
is very easy to branch predict), it's still unnecessary.
## Summary of changes
* Remove the ChainRW wrapper
* Refactor how we read the proxy-protocol header using read_exact.
Slightly worse perf but it's hardly significant.
* Don't try and parse the header if it's rejected.
`safekeepers_cmp` was added by #8840 to make changes of the safekeeper
set order independent: a `sk1,sk2,sk3` specifier changed to
`sk3,sk1,sk2` should not cause a walproposer restart. However, this
check didn't support generations, in the sense that it would see the
`g#123:` as part of the first safekeeper in the list, and if the first
safekeeper changes, it would also restart the walproposer.
Therefore, parse the generation properly and make it not be a part of
the generation.
This PR doesn't add a specific test, but I have confirmed locally that
`test_safekeepers_reconfigure_reorder` is fixed with the changes of PR
#11712 applied thanks to this PR.
Part of https://github.com/neondatabase/neon/issues/11670
## Problem
Inbetween adding the TLS config for compute-ctl, and adding the TLS
config in controlplane, we switched from using a provision flag to a
bind flag. This happened to work in all of my testing in preview regions
as they have no VM pool, so each bind was also a provision. However, in
staging I found that the TLS config is still only processed during
provision, even though it's only sent on bind.
## Summary of changes
* Add a new feature flag value, `tls_experimental`, which tells
postgres/pgbouncer/local_proxy to use the TLS certificates on bind.
* compute_ctl on provision will be told where the certificates are,
instead of being told on bind.
Url::to_string() adds a trailing slash on the base URL, so when we did
the format!(), we were adding a double forward slash.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The gRPC page service doesn't respect `get_vectored_concurrent_io` and
always uses sequential IO.
## Summary of changes
Spawn a sidecar task for concurrent IO when enabled.
Cancellation will be addressed separately.
## Problem
The script `compute.sh` had a non-consistent coding style and didn't
follow best practices for modern bash scripts
## Summary of changes
The coding style was fixed to follow best practices.
## Problem
We want to repro an OOM situation, but large partial reads are required.
## Summary of Changes
Make the max partial read size configurable for import jobs.
## Problem
We don't currently run tests for PostGIS in our test environment.
## Summary of Changes
- Added PostGIS test support for PostgreSQL v16 and v17
- Configured different PostGIS versions based on PostgreSQL version:
- PostgreSQL v17: PostGIS 3.5.0
- PostgreSQL v14/v15/v16: PostGIS 3.3.3
- Added necessary test scripts and configurations
This ensures our PostgreSQL implementation remains compatible with this
widely-used extension.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
Add a counter on the feature evaluation outcome and we will set up
alerts for too many failed evaluations in the future.
Signed-off-by: Alex Chi Z <chi@neon.tech>
JsonResponse::error() properly logs an error message which can be read
in the compute logs. invalid_status() was not going through that helper
function, thus not logging anything.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This is already the default in production and in our test suite.
## Summary of changes
Set the default proto to interpreted to reduce friction when spinning up
new regions or cells.
## Problem
Currently, `page_api` domain types validate message invariants both when
converting Protobuf → domain and domain → Protobuf. This is annoying for
clients, because they can't use stream combinators to convert streamed
requests (needed for hot path performance), and also performs the
validation twice in the common case.
Blocks #12099.
## Summary of changes
Only validate the Protobuf → domain type conversion, i.e. on the
receiver side, and make domain → Protobuf infallible. This is where it
matters -- the Protobuf types are less strict than the domain types, and
receivers should expect all sorts of junk from senders (they're not
required to validate anyway, and can just construct an invalid message
manually).
Also adds a missing `impl From<CheckRelExistsRequest> for
proto::CheckRelExistsRequest`.
## Problem
Setting `max_batch_size` to anything higher than
`Timeline::MAX_GET_VECTORED_KEYS` will cause runtime error. We should
rather fail fast at startup if this is the case.
## Summary of changes
* Create `max_get_vectored_keys` as a new configuration (default to 32);
* Validate `max_batch_size` against `max_get_vectored_keys` right at
config parsing and validation.
Closes https://github.com/neondatabase/neon/issues/11994
## Problem
The gRPC `page_api` domain types used smallvecs to avoid heap
allocations in the common case where a single page is requested.
However, this is pointless: the Protobuf types use a normal vec, and
converting a smallvec into a vec always causes a heap allocation anyway.
## Summary of changes
Use a normal `Vec` instead of a `SmallVec` in `page_api` domain types.
## Problem
We print a backtrace in an info level log every 10 seconds while waiting
for the import data to land in the bucket.
## Summary of changes
The backtrace is not useful. Remove it.
## Problem
We should use the full host name for computes, according to
https://github.com/neondatabase/cloud/issues/26005 , but now a truncated
host name is used.
## Summary of changes
The URL for REMOTE_ONNX is rewritten using the FQDN.
## Problem
fix https://github.com/neondatabase/neon/issues/12101; this is a quick
hack and we need better API in the future.
In `get_db_size`, we call `get_reldir_size` for every relation. However,
we do the same deserializing the reldir directory thing for every
relation. This creates huge CPU overhead.
## Summary of changes
Get and deserialize the reldir v1 key once and use it across all
get_rel_size requests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We should expose the page service over gRPC.
Requires #12093.
Touches #11728.
## Summary of changes
This patch adds an initial page service implementation over gRPC. It
ties in with the existing `PageServerHandler` request logic, to avoid
the implementations drifting apart for the core read path.
This is just a bare-bones functional implementation. Several important
aspects have been omitted, and will be addressed in follow-up PRs:
* Limited observability: minimal tracing, no logging, limited metrics
and timing, etc.
* Rate limiting will currently block.
* No performance optimization.
* No cancellation handling.
* No tests.
I've only done rudimentary testing of this, but Pagebench passes at
least.
A smaller version of #12066 that is somewhat easier to review.
Now that I've been using https://crates.io/crates/top-type-sizes I've
found a lot more of the low hanging fruit that can be tweaks to reduce
the memory usage.
Some context for the optimisations:
Rust's stack allocation in futures is quite naive. Stack variables, even
if moved, often still end up taking space in the future. Rearranging the
order in which variables are defined, and properly scoping them can go a
long way.
`async fn` and `async move {}` have a consequence that they always
duplicate the "upvars" (aka captures). All captures are permanently
allocated in the future, even if moved. We can be mindful when writing
futures to only capture as little as possible.
TlsStream is massive. Needs boxing so it doesn't contribute to the above
issue.
## Measurements from `top-type-sizes`:
### Before
```
10328 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}} align=8
6120 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8
```
### After
```
4040 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}}
4704 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8
```
## Problem
We need gRPC support in Pagebench to benchmark the new gRPC Pageserver
implementation.
Touches #11728.
## Summary of changes
Adds a `Client` trait to make the client transport swappable, and a gRPC
client via a `--protocol grpc` parameter. This must also specify the
connstring with the gRPC port:
```
pagebench get-page-latest-lsn --protocol grpc --page-service-connstring grpc://localhost:51051
```
The client is implemented using the raw Tonic-generated gRPC client, to
minimize client overhead.
## Problem
The page service logic asserts that a tracing span is present with
tenant/timeline/shard IDs. An initial gRPC page service implementation
thus requires a tracing span.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
Adds an `ObservabilityLayer` middleware that generates a tracing span
and decorates it with IDs from the gRPC metadata.
This is a minimal implementation to address the tracing span assertion.
It will be extended with additional observability in later PRs.
## Problem
If the wal receiver is cancelled, there's a 50% chance that it will
ingest yet more WAL.
## Summary of Changes
Always check cancellation first.
Precursor to https://github.com/neondatabase/cloud/issues/28333.
We want per-endpoint configuration for rate limits, which will be
distributed via the `GetEndpointAccessControl` API. This lays some of
the ground work.
1. Allow the endpoint rate limiter to accept a custom leaky bucket
config on check.
2. Remove the unused auth rate limiter, as I don't want to think about
how it fits into this.
3. Refactor the caching of `GetEndpointAccessControl`, as it adds
friction for adding new cached data to the API.
That third one was rather large. I couldn't find any way to split it up.
The core idea is that there's now only 2 cache APIs.
`get_endpoint_access_controls` and `get_role_access_controls`.
I'm pretty sure the behaviour is unchanged, except I did a drive by
change to fix#8989 because it felt harmless. The change in question is
that when a password validation fails, we eagerly expire the role cache
if the role was cached for 5 minutes. This is to allow for edge cases
where a user tries to connect with a reset password, but the cache never
expires the entry due to some redis related quirk (lag, or
misconfiguration, or cplane error)
libs/pqproto is designed for safekeeper/pageserver with maximum
throughput.
proxy only needs it for handshakes/authentication where throughput is
not a concern but memory efficiency is. For this reason, we switch to
using read_exact and only allocating as much memory as we need to.
All reads return a `&'a [u8]` instead of a `Bytes` because accidental
sharing of bytes can cause fragmentation. Returning the reference
enforces all callers only hold onto the bytes they absolutely need. For
example, before this change, `pqproto` was allocating 8KiB for the
initial read `BytesMut`, and proxy was holding the `Bytes` in the
`StartupMessageParams` for the entire connection through to passthrough.
## Problem
Release pipeline failing due to some tools cannot be installed.
## Summary of changes
Install with `--locked`.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The expected operating range for the production NVMe drives is
in the range of 50 to 250us.
The bucket boundaries before this PR were not well suited
to reason about the utilization / queuing / latency variability
of those devices.
# Performance
There was some concern about perf impact of having so many buckets,
considering the impl does a linear search on each observe().
I added a benchmark and measured on relevant machines.
In any way, the PR is 40 buckets, so, won't make a meaningful
difference on production machines (im4gn.2xlarge),
going from 30ns -> 35ns.
## Problem
There's a bunch of TODOs in the import code.
## Summary of changes
1. Bound max import byte range to 128MiB. This might still be too high,
given the default job concurrency, but it needs to be balanced with
going back and forth to S3.
2. Prevent unsigned overflow when determining key range splits for
concurrent jobs
3. Use sharded ranges to estimate task size when splitting jobs
4. Bubble up errors that we might hit due to invalid data in the bucket
back to the storage controller.
5. Tweak the import bucket S3 client configuration.
## Problem
I noticed a small percentage of flakes on some import tests.
They were all instances of the storage controller being too eager on the
finalization.
As a refresher: the pageserver notifies the storage controller that it's
done from the import task
and the storage controller has to call back into it in order to finalize
the import. The pageserver
checks that the import task is done before serving that request. Hence,
we can get this race.
In practice, this has no impact since the storage controller will simply
retry.
## Summary of changes
Allow list such cases
This patch is a fixup for
- https://github.com/neondatabase/neon/pull/6788
Background
----------
That PR 6788 added artificial advancement of `disk_consistent_lsn`
and `remote_consistent_lsn` for shards that weren't written to
while other shards _were_ written to.
See the PR description for more context.
At the time of that PR, Pageservers shards were doing WAL filtering.
Nowadays, the WAL filtering happens in Safekeepers.
Shards learn about the WAL gaps via
`InterpretedWalRecords::next_record_lsn`.
The Bug
-------
That artificial advancement code also runs if the flush failed.
So, we advance the disk_consistent_lsn / remote_consistent_lsn,
without having the corresponding L0 to the `index_part.json`.
The frozen layer remains in the layer map until detach,
so we continue to serve data correctly.
We're not advancing flush loop variable `flushed_to_lsn` either,
so, subsequent flush requests will retry the flush and repair the
situation if they succeed.
But if there aren't any successful retries, eventually the tenant
will be detached and when it is attached somewhere else, the
`index_part.json` and therefore layer map...
1. ... does not contain the frozen layer that failed to flush and
2. ... won't re-ingest that WAL either because walreceiver
starts up with the advanced disk_consistent_lsn/remote_consistent_lsn.
The result is that the read path will have a gap in the reconstruct
data for the keys whose modifications were lost, resulting in
a) either walredo failure
b) or an incorrect page@lsn image if walredo doesn't error.
The Fix
-------
The fix is to only do the artificial advancement if `result.is_ok()`.
Misc
----
As an aside, I took some time to re-review the flush loop and its
callers.
I found one more bug related to error handling that I filed here:
- https://github.com/neondatabase/neon/issues/12025
## Problem
## Summary of changes
## Problem
Checking the most recent state of pageservers was insufficient to
evaluate whether another pageserver may read in a particular generation,
since the latest state might mask some earlier AttachedSingle state.
Related: https://github.com/neondatabase/neon/issues/11348
## Summary of changes
- Maintain a history of all attachments
- Write out explicit rules for when a pageserver may read
## Problem
Part of #11813. This pull request adds misc observability improvements
for the functionality.
## Summary of changes
* Info span for the PostHog feature background loop.
* New evaluate feature flag API.
* Put the request error into the error message.
* Log when feature flag gets updated.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
I have a hypothesis that import might be using lower number of jobs than
max for the VM, where the job is running. This change will help finding
this out from logs
## Summary of changes
Added logging of number of jobs, which is passed into both `pg_dump` and
`pg_restore`
Adds two metrics to the storcon that are related to the safekeeper
reconciler:
* `storage_controller_safkeeper_reconciles_queued` to indicate currrent
queue depth
* `storage_controller_safkeeper_reconciles_complete` to indicate the
number of complete reconciles
Both metrics operate on a per-safekeeper basis (as reconcilers run on a
per-safekeeper basis too).
These metrics mirror the `storage_controller_pending_reconciles` and
`storage_controller_reconcile_complete` metrics, although those are not
scoped on a per-pageserver basis but are global for the entire storage
controller.
Part of #11670
## Problem
Disk usage eviction isn't sensitive to layers of imported timelines.
## Summary of changes
Hook importing timelines up into eviction and add a test for it.
I don't think we need any special eviction logic for this. These layers
will all be visible and
their access time will be their creation time. Hence, we'll remove
covered layers first
and get to the imported layers if there's still disk pressure.
## Problem
Importing timelines can't currently be deleted. This is problematic
because:
1. Cplane cannot delete failed imports and we leave the timeline behind.
2. The flow does not support user driven cancellation of the import
## Summary of changes
On the pageserver: I've taken the path of least resistance, extended
`TimelineOrOffloaded`
with a new variant and added handling in the right places. I'm open to
thoughts here,
but I think it turned out better than I was envisioning.
On the storage controller: Again, fairly simple business: when a DELETE
timeline request is
received, we remove the import from the DB and stop any finalization
tasks/futures. In order
to stop finalizations, we track them in-memory. For each finalizing
import, we associate a gate
and a cancellation token.
Note that we delete the entry from the database before cancelling any
finalizations. This is such
that a concurrent request can't progress the import into finalize state
and race with the deletion.
This concern about deleting an import with on-going finalization is
theoretical in the near future.
We are only going to delete importing timelines after the storage
controller reports the failure to
cplane. Alas, the design works for user driven cancellation too.
Closes https://github.com/neondatabase/neon/issues/11897
## Summary
The optimiser background loop could get delayed a lot by waiting for
timeouts trying to talk to offline nodes.
Fixes: #12056
## Solution
- Skip offline nodes in `get_top_tenant_shards`
Link to Devin run:
https://app.devin.ai/sessions/065afd6756734d33bbd4d012428c4b6e
Requested by: John Spray (john@neon.tech)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: John Spray <john@neon.tech>
## Problem
Temporarily reduce the concurrency of gc-compaction to 1 job at a time.
We are going to roll out in the largest AWS region next week. Having one
job running at a time makes it easier to identify what tenant causes
problem if it's not running well and pause gc-compaction for that
specific tenant.
(We can make this configurable via pageserver config in the future!)
## Summary of changes
Reduce `CONCURRENT_GC_COMPACTION_TASKS` from 2 to 1.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The `test_storcon_create_delete_sk_down` test is still flaky. This test
addresses two possible causes for flakiness. both causes are related to
deletion racing with `pull_timeline` which hasn't finished yet.
* the first cause is timeline deletion racing with `pull_timeline`:
* the first deletion attempt doesn't contain the line because the
timeline doesn't exist yet
* the subsequent deletion attempts don't contain it either, only a note
that the timeline is already deleted.
* so this patch adds the note that the timeline is already deleted to
the regex
* the second cause is about tenant deletion racing with `pull_timeline`:
* there were no tenant specific tombstones so if a tenant was deleted,
we only added tombstones for the specific timelines being deleted, not
for the tenant itself.
* This patch changes this, so we now have tenant specific tombstones as
well as timeline specific ones, and creation of a timeline checks both.
* we also don't see any retries of the tenant deletion in the logs. once
it's done it's done. so extend the regex to contain the tenant deletion
message as well.
One could wonder why the regex and why not using the API to check
whether the timeline is just "gone". The issue with the API is that it
doesn't allow one to distinguish between "deleted" and "has never
existed", and latter case might race with `pull_timeline`. I.e. the
second case flakiness helped in the discovery of a real bug (no tenant
tombstones), so the more precise check was helpful.
Before, I could easily reproduce 2-9 occurences of flakiness when
running the test with an additional `range(128)` parameter (i.e. 218
times 4 times). With this patch, I ran it three times, not a single
failure.
Fixes#11838
## Problem
Import planning takes a job size limit as its input. Previously, the job
size came from a pageserver config field. This field may change while
imports are in progress. If this happens, plans will no longer be
identical and the import would fail permanently.
## Summary of Changes
Bake the job size into the import progress reported to the storage
controller. For new imports, use the value from the pagesever config,
and, for existing imports, use the value present in the shard progress.
This value is identical for all shards, but we want it to be versioned
since future versions of the planner might split the jobs up
differently. Hence, it ends up in `ShardImportProgress`.
Closes https://github.com/neondatabase/neon/issues/11983
## Problem
Previous attempt https://github.com/neondatabase/neon/pull/10548 caused
some issues in staging and we reverted it. This is a re-attempt to
address https://github.com/neondatabase/neon/issues/11063.
Currently we create image layers at latest record LSN. We would create
"future image layers" (i.e., image layers with LSN larger than disk
consistent LSN) that need special handling at startup. We also waste a
lot of read operations to reconstruct from L0 layers while we could have
compacted all of the L0 layers and operate on a flat level of historic
layers.
## Summary of changes
* Run repartition at L0-L1 boundary.
* Roll out with feature flags.
* Piggyback a change that downgrades "image layer creating below
gc_cutoff" to debug level.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
There is a new API that I plan to use. We generate client from the spec
so it should be in the spec
## Summary of changes
Document the existing API in openAPI format
## Problem
Currently, we collect metrics of what extensions are installed on
computes at start up time. We do not have a mechanism that does this at
runtime.
## Summary of changes
Added a background thread that queries all DBs at regular intervals and
collects a list of installed extensions.
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
* Support evaluate boolean flags.
* Add docs on how to handle errors.
* Add test cases based on real PostHog config.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have some gaps in our traces. This indicates missing spans.
## Summary of changes
This PR adds two new spans:
* WAIT_EXECUTOR: time a batched request spends in the batch waiting to
be picked up
* FLUSH_RESPONSE: time a get page request spends flushing the response
to the compute

## Problem
The page API gRPC errors need a few tweaks to integrate better with the
GetPage machinery.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
* Add `GetPageStatus::InternalError` for internal server errors.
* Rename `GetPageStatus::Invalid` to `InvalidRequest` for clarity.
* Rename `status` and `GetPageStatus` to `status_code` and
`GetPageStatusCode`.
* Add an `Into<tonic::Status>` implementation for `ProtocolError`.
Some tests still explicitly specify version 3 of the safekeeper
walproposer protocol. Remove the explicit opt in from the tests as v3 is
the default now since #11518.
We don't touch the places where a test exercises both v2 and v3. Those
we leave for #12021.
Part of https://github.com/neondatabase/neon/issues/10326
## Problem
Test coverage of timeline imports is lacking.
## Summary of changes
This PR adds a chaos import test. It runs an import while injecting
various chaos events
in the environment. All the commits that follow the test fix various
issues that were surfaced by it.
Closes https://github.com/neondatabase/neon/issues/10191
## Problem
The regression test for the extension online_advisor fails on the
staging instance due to a lack of permission to alter the database.
## Summary of changes
A script was added to work around this problem.
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
In order to enable TLS connections between computes and safekeepers, we
need to provide the control plane with a way to configure the various
libpq keyword parameters, sslmode and sslrootcert. neon.safekeepers is a
comma separated list of safekeepers formatted as host:port, so isn't
available for extension in the same way that neon.pageserver_connstring
is. This could be remedied in a future PR.
Part-of: https://github.com/neondatabase/cloud/issues/25823
Link:
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
Signed-off-by: Tristan Partin <tristan@neon.tech>
Support timeline creations on the storage controller to opt out from
their creation on the safekeepers, introducing the read-only timelines
concept. Read only timelines:
* will never receive WAL of their own, so it's fine to not create them
on the safekeepers
* the property is non-transitive. children of read-only timelines aren't
neccessarily read-only themselves.
This feature can be used for snapshots, to prevent the safekeepers from
being overloaded by empty timelines that won't ever get written to. In
the current world, this is not a problem, because timelines are created
implicitly by the compute connecting to a safekeeper that doesn't have
the timeline yet. In the future however, where the storage controller
creates timelines eagerly, we should watch out for that.
We represent read-only timelines in the storage controller database so
that we ensure that they never touch the safekeepers at all. Especially
we don't want them to cause a mess during the importing process of the
timelines from the cplane to the storcon database.
In a hypothetical future where we have a feature to detach timelines
from safekeepers, we'll either need to find a way to distinguish the
two, or if not, asking safekeepers to list the (empty) timeline prefix
and delete everything from it isn't a big issue either.
This patch will unconditionally hit the new safekeeper timeline creation
path for read-only timelines, without them needing the
`--timelines-onto-safekeepers` flag enabled. This is done because it's
lower risk (no safekeepers or computes involved at all) and gives us
some initial way to verify at least some parts of that code in prod.
https://github.com/neondatabase/cloud/issues/29435https://github.com/neondatabase/neon/issues/11670
Previously we were using project-id/endpoint-id as SYSLOGTAG, which has
a
limit of 32 characters, so the endpoint-id got truncated.
The output is now in RFC5424 format, where the message is json encoded
with additional metadata `endpoint_id` and `project_id`
Also as pgaudit logs multiline messages, we now detect this by parsing
the timestamp in the specific format, and consider non-matching lines to
belong in the previous log message.
Using syslog structured-data would be an alternative, but leaning
towards json
due to being somewhat more generic.
## Problem
part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
* Integrate feature store with tenant structure.
* gc-compaction picks up the current strategy from the feature store.
* We only log them for now for testing purpose. They will not be used
until we have more patches to support different strategies defined in
PostHog.
* We don't support property-based evaulation for now; it will be
implemented later.
* Evaluating result of the feature flag is not cached -- it's not
efficient and cannot be used on hot path right now.
* We don't report the evaluation result back to PostHog right now.
I plan to enable it in staging once we get the patch merged.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We see unexpected basebackup error alerts in the alert channel.
https://github.com/neondatabase/neon/pull/11778 only fixed the alerts
for shutdown errors. However, another path is that tenant shutting down
while waiting LSN -> WaitLsnError::BadState -> QueryError::Reconnect.
Therefore, the reconnect error should also be discarded from the
ok/error counter.
## Summary of changes
Do not increase ok/err counter for reconnect errors.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We didn't test the h3 extension in our test suite.
## Summary of changes
Added tests for h3 and h3-postgis extensions
Includes upgrade test for h3
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
For the gRPC Pageserver API, we should convert the Protobuf types to
stricter, canonical Rust types.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
Adds Rust domain types that mirror the Protobuf types, with conversion
and validation.
## Problem
We need authentication for the gRPC server.
Requires #11972.
Touches #11728.
## Summary of changes
Add two request interceptors that decode the tenant/timeline/shard
metadata and authenticate the JWT token against them.
## Problem
We want to expose the page service over gRPC, for use with the
communicator.
Requires #11995.
Touches #11728.
## Summary of changes
This patch wires up a gRPC server in the Pageserver, using Tonic. It
does not yet implement the actual page service.
* Adds `listen_grpc_addr` and `grpc_auth_type` config options (disabled
by default).
* Enables gRPC by default with `neon_local`.
* Stub implementation of `page_api.PageService`, returning unimplemented
errors.
* gRPC reflection service for use with e.g. `grpcurl`.
Subsequent PRs will implement the actual page service, including
authentication and observability.
Notably, TLS support is not yet implemented. Certificate reloading
requires us to reimplement the entire Tonic gRPC server.
## Problem
For #11992 I realised we need to get the type info before executing the
query. This is important to know how to decode rows with custom types,
eg the following query:
```sql
CREATE TYPE foo AS ENUM ('foo','bar','baz');
SELECT ARRAY['foo'::foo, 'bar'::foo, 'baz'::foo] AS data;
```
Getting that to work was harder that it seems. The original
tokio-postgres setup has a split between `Client` and `Connection`,
where messages are passed between. Because multiple clients were
supported, each client message included a dedicated response channel.
Each request would be terminated by the `ReadyForQuery` message.
The flow I opted to use for parsing types early would not trigger a
`ReadyForQuery`. The flow is as follows:
```
PARSE "" // parse the user provided query
DESCRIBE "" // describe the query, returning param/result type oids
FLUSH // force postgres to flush the responses early
// wait for descriptions
// check if we know the types, if we don't then
// setup the typeinfo query and execute it against each OID:
PARSE typeinfo // prepare our typeinfo query
DESCRIBE typeinfo
FLUSH // force postgres to flush the responses early
// wait for typeinfo statement
// for each OID we don't know:
BIND typeinfo
EXECUTE
FLUSH
// wait for type info, might reveal more OIDs to inspect
// close the typeinfo query, we cache the OID->type map and this is kinder to pgbouncer.
CLOSE typeinfo
// finally once we know all the OIDs:
BIND "" // bind the user provided query - already parsed - to the user provided params
EXECUTE // run the user provided query
SYNC // commit the transaction
```
## Summary of changes
Please review commit by commit. The main challenge was allowing one
query to issue multiple sub-queries. To do this I first made sure that
the client could fully own the connection, which required removing any
shared client state. I then had to replace the way responses are sent to
the client, by using only a single permanent channel. This required some
additional effort to track which query is being processed. Lastly I had
to modify the query/typeinfo functions to not issue `sync` commands, so
it would fit into the desired flow above.
To note: the flow above does force an extra roundtrip into each query. I
don't know yet if this has a measurable latency overhead.
## Problem
- Benchmark periodic pagebench had inconsistent benchmarking results
even when run with the same commit hash.
Hypothesis is this was due to running on dedicated but virtualized EC
instance with varying CPU frequency.
- the dedicated instance type used for the benchmark is quite "old" and
we increasingly get `An error occurred (InsufficientInstanceCapacity)
when calling the StartInstances operation (reached max retries: 2):
Insufficient capacity.`
- periodic pagebench uses a snapshot of pageserver timelines to have the
same layer structure in each run and get consistent performance.
Re-creating the snapshot was a painful manual process (see
https://github.com/neondatabase/cloud/issues/27051 and
https://github.com/neondatabase/cloud/issues/27653)
## Summary of changes
- Run the periodic pagebench on a custom hetzner GitHub runner with
large nvme disk and governor set to defined perf profile
- provide a manual dispatch option for the workflow that allows to
create a new snapshot
- keep the manual dispatch option to specify a commit hash useful for
bi-secting regressions
- always use the newest created snapshot (S3 bucket uses date suffix in
S3 key, example
`s3://neon-github-public-dev/performance/pagebench/shared-snapshots-2025-05-17/`
- `--ignore`
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
in regular benchmarks run for each commit
- improve perf copying snapshot by using `cp` subprocess instead of
traversing tree in python
## Example runs with code in this PR:
- run which creates new snapshot
https://github.com/neondatabase/neon/actions/runs/15083408849/job/42402986376#step:19:55
- run which uses latest snapshot
-
https://github.com/neondatabase/neon/actions/runs/15084907676/job/42406240745#step:11:65
## Problem
We're about to implement a gRPC interface for Pageserver. Let's upgrade
Tonic first, to avoid a more painful migration later. It's currently
only used by storage-broker.
Touches #11728.
## Summary of changes
Upgrade Tonic 0.12.3 → 0.13.1. Also opportunistically upgrade Prost
0.13.3 → 0.13.5. This transitively pulls in Indexmap 2.0.1 → 2.9.0, but
it doesn't appear to be used in any particularly critical code paths.
## Problem
See https://github.com/neondatabase/neon/issues/11997
This guard prevents race condition with pump prefetch state (initiated
by timeout).
Assert checks that prefetching is also done under guard.
But prewarm knows nothing about it.
## Summary of changes
Pump prefetch state only in regular backends.
Prewarming is done by background workers now.
Also it seems to have not sense to pump prefetch state in any other
background workers: parallel executors, vacuum,... because they are
short living and can not leave unconsumed responses in socket.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Detect problems with Postgres optimiser: lack of indexes and statistics
## Summary of changes
https://github.com/knizhnik/online_advisor
Add online_advistor extension to docker image
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/11988 waits only for max
~200ms, so we still see failures, which self-resolve after several
operation retries.
## Summary of changes
Change it to waiting for at least 5 seconds, starting with 2 ms sleep
between iterations and x2 sleep on each next iteration. It could be that
it's not a problem with a slow `rsyslog` start, but a longer wait won't
hurt. If it won't start, we should debug why `inittab` doesn't start it,
or maybe there is another problem.
Add retry loop around waiting for rsyslog start
## Problem
## Summary of changes
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Basebackup cache is on the hot path of compute startup and is generated
on every request (may be slow).
- Issue: https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Add `BasebackupCache` which stores basebackups on local disk.
- Basebackup prepare requests are triggered by
`XLOG_CHECKPOINT_SHUTDOWN` records in the log.
- Limit the size of the cache by number of entries.
- Add `basebackup_cache_enabled` feature flag to TenantConfig.
- Write tests for the cache
## Not implemented yet
- Limit the size of the cache by total size in bytes
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr@neon.tech>
## Problem
For billing, we'd like per-branch consumption metrics.
Requires https://github.com/neondatabase/neon/pull/11984.
Resolves https://github.com/neondatabase/cloud/issues/28155.
## Summary of changes
This patch adds two new consumption metrics:
* `written_size_since_parent`: `written_size - ancestor_lsn`
* `pitr_history_size_since_parent`: `written_size - max(pitr_cutoff,
ancestor_lsn)`
Note that `pitr_history_size_since_parent` will not be emitted until the
PITR cutoff has been computed, and may or may not increase ~immediately
when a user increases their PITR window (depending on how much history
we have available and whether the tenant is restarted/migrated).
## Problem
When testing local proxy the auth-endpoint password shows up in command
line and log
```bash
RUST_LOG=proxy LOGFMT=text cargo run --release --package proxy --bin proxy --features testing -- \
--auth-backend postgres \
--auth-endpoint 'postgresql://postgres:secret_password@127.0.0.1:5432/postgres' \
--tls-cert server.crt \
--tls-key server.key \
--wss 0.0.0.0:4444
```
## Summary of changes
- Allow to set env variable PGPASSWORD
- fall back to use PGPASSWORD env variable when auth-endpoint does not
contain password
- remove auth-endpoint password from logs in `--features testing` mode
Example
```bash
export PGPASSWORD=secret_password
RUST_LOG=proxy LOGFMT=text cargo run --package proxy --bin proxy --features testing -- \
--auth-backend postgres \
--auth-endpoint 'postgresql://postgres@127.0.0.1:5432/postgres' \
--tls-cert server.crt \
--tls-key server.key \
--wss 0.0.0.0:4444
```
## Problem
It is not currently possible to disambiguate a timeline with an
uninitialized PITR cutoff from one that was created within the PITR
window -- both of these have `GcCutoffs::time == Lsn(0)`. For billing
metrics, we need to disambiguate these to avoid accidentally billing the
entire history when a tenant is initially loaded.
Touches https://github.com/neondatabase/cloud/issues/28155.
## Summary of changes
Make `GcCutoffs::time` an `Option<Lsn>`, and only set it to `Some` when
initialized. A `pitr_interval` of 0 will yield `Some(last_record_lsn)`.
This PR takes a conservative approach, and mostly retains the old
behavior of consumers by using `unwrap_or_default()` to yield 0 when
uninitialized, to avoid accidentally introducing bugs -- except in cases
where there is high confidence that the change is beneficial (e.g. for
the `pageserver_pitr_history_size` Prometheus metric and to return early
during GC).
## Problem
Hitting max_client_conn from pgbouncer would lead to invalidation of the
conn info cache.
Customers would hit the limit on wake_compute.
## Summary of changes
`should_retry_wake_compute` detects this specific error from pgbouncer
as non-retriable,
meaning we won't try to wake up the compute again.
## Problem
When using an incorrect endpoint string - `"localhost:4317"`, it's a
runtime error, but it can be a config error
- Closes: https://github.com/neondatabase/neon/issues/11394
## Summary of changes
Add config parse time check via `request::Url::parse` validation.
---------
Co-authored-by: Aleksandr Sarantsev <ephemeralsad@gmail.com>
#11962
Please review each commit separately.
Each commit is rather small in goal. The overall goal of this PR is to
keep the behaviour identical, but shave away small inefficiencies here
and there.
## Problem
See
Discussion:
https://neondb.slack.com/archives/C033RQ5SPDH/p1746645666075799
Issue: https://github.com/neondatabase/cloud/issues/28609
Relation size cache is not correctly updated at PS in case of replicas.
## Summary of changes
1. Have two caches for relation size in timeline:
`rel_size_primary_cache` and `rel_size_replica_cache`.
2. `rel_size_primary_cache` is actually what we have now. The only
difference is that it is not updated in `get_rel_size`, only by WAL
ingestion
3. `rel_size_replica_cache` has limited size (LruCache) and it's key is
`(Lsn,RelTag)` . It is updated in `get_rel_size`. Only strict LSN
matches are accepted as cache hit.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In the escaping path we were checking that `${tag}$` or `${outer_tag}$`
are present in the string, but that's not enough, as original string
surrounded by `$` can also form a 'tag', like `$x$xx$x$`, which is fine
on it's own, but cannot be used in the string escaped with `$xx$`.
## Summary of changes
Remove `$` from the checks, just check if `{tag}` or `{outer_tag}` are
present. Add more test cases and change the catalog test to stress the
`drop_subscriptions_before_start: true` path as well.
Fixes https://github.com/neondatabase/cloud/issues/29198
Like #9931 but without rebasing upstream just yet, to try and minimise
the differences.
Removes all proxy-specific commits from the rust-postgres fork, now that
proxy no longer depends on them. Merging upstream changes to come later.
Closes#9387.
## Problem
`BufferedWriter` cannot proceed while the owned buffer is flushing to
disk. We want to implement double buffering so that the flush can happen
in the background. See #9387.
## Summary of changes
- Maintain two owned buffers in `BufferedWriter`.
- The writer is in charge of copying the data into owned, aligned
buffer, once full, submit it to the flush task.
- The flush background task is in charge of flushing the owned buffer to
disk, and returned the buffer to the writer for reuse.
- The writer and the flush background task communicate through a
bi-directional channel.
For in-memory layer, we also need to be able to read from the buffered
writer in `get_values_reconstruct_data`. To handle this case, we did the
following
- Use replace `VirtualFile::write_all` with `VirtualFile::write_all_at`,
and use `Arc` to share it between writer and background task.
- leverage `IoBufferMut::freeze` to get a cheaply clonable `IoBuffer`,
one clone will be submitted to the channel, the other clone will be
saved within the writer to serve reads. When we want to reuse the
buffer, we can invoke `IoBuffer::into_mut`, which gives us back the
mutable aligned buffer.
- InMemoryLayer reads is now aware of the maybe_flushed part of the
buffer.
**Caveat**
- We removed the owned version of write, because this interface does not
work well with buffer alignment. The result is that without direct IO
enabled,
[`download_object`](a439d57050/pageserver/src/tenant/remote_timeline_client/download.rs (L243))
does one more memcpy than before this PR due to the switch to use
`_borrowed` version of the write.
- "Bypass aligned part of write" could be implemented later to avoid
large amount of memcpy.
**Testing**
- use an oneshot channel based control mechanism to make flush behavior
deterministic in test.
- test reading from `EphemeralFile` when the last submitted buffer is
not flushed, in-progress, and done flushing to disk.
## Performance
We see performance improvement for small values, and regression on big
values, likely due to being CPU bound + disk write latency.
[Results](https://www.notion.so/neondatabase/Benchmarking-New-BufferedWriter-11-20-2024-143f189e0047805ba99acda89f984d51?pvs=4)
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
We have a scale test for the storage controller which also acts as a
good stress test for scheduling stability. However, it created nodes
with no AZs set.
## Summary of changes
- Bump node count to 6 and set AZs on them.
This is a precursor to other AZ-related PRs, to make sure any new code
that's landed is getting scale tested in an AZ-aware environment.
## Problem
We practice a manual release flow for the compute module. This will
allow automation of the compute release process.
## Summary of changes
The workflow was modified to make a compute release automatically on the
branch release-compute.
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
Reqwest errors don't include details about the inner source error. This
means that we get opaque errors like:
```
receive body: error sending request for url (http://localhost:9898/v1/location_config)
```
Instead of the more helpful:
```
receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out
```
Touches #9801.
## Summary of changes
Include the source error for `reqwest::Error` wherever it's displayed.
## Problem
When client specifies `application_name`, pgbouncer propagates it to the
Postgres. Yet, if client doesn't do it, we have hard time figuring out
who opens a lot of Postgres connections (including the `cloud_admin`
ones).
See this investigation as an example:
https://neondb.slack.com/archives/C0836R0RZ0D
## Summary of changes
I haven't found this documented, but it looks like pgbouncer accepts
standard Postgres connstring parameters in the connstring in the
`[databases]` section, so put the default `application_name=pgbouncer`
there. That way, we will always see who opens Postgres connections. I
did tests, and if client specifies a `application_name`, pgbouncer
overrides this default, so it only works if it's not specified or set to
blank `&application_name=` in the connection string.
This is the last place we could potentially open some Postgres
connections without `application_name`. Everything else should be either
of two:
1. Direct client connections without `application_name`, but these
should be strictly non-`cloud_admin` ones
2. Some ad-hoc internal connections, so if we see spikes of unidentified
`cloud_admin` connections, we will need to investigate it again.
Fixesneondatabase/cloud#20948
(stacked on #9990 and #9995)
Partially fixes#1287 with a custom option field to enable the fixed
behaviour. This allows us to gradually roll out the fix without silently
changing the observed behaviour for our customers.
related to https://github.com/neondatabase/cloud/issues/15284
## Problem
During deploys, we see a lot of 500 errors due to heapmap uploads for
inactive tenants. These should be 503s instead.
Resolves#9574.
## Summary of changes
Make the secondary tenant scheduler use `ApiError` rather than
`anyhow::Error`, to propagate the tenant error and convert it to an
appropriate status code.
## Problem
we tried different parallelism settings for ingest bench
## Summary of changes
the following settings seem optimal after merging
- SK side Wal filtering
- batched getpages
Settings:
- effective_io_concurrency 100
- concurrency limit 200 (different from Prod!)
- jobs 4, maintenance workers 7
- 10 GB chunk size
## Problem
```
2024-12-03T15:42:46.5978335Z + poetry run python /__w/neon/neon/scripts/ingest_perf_test_result.py --ingest /__w/neon/neon/test_runner/perf-report-local
2024-12-03T15:42:49.5325077Z Traceback (most recent call last):
2024-12-03T15:42:49.5325603Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 165, in <module>
2024-12-03T15:42:49.5326029Z main()
2024-12-03T15:42:49.5326316Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 155, in main
2024-12-03T15:42:49.5326739Z ingested = ingest_perf_test_result(cur, item, recorded_at_timestamp)
2024-12-03T15:42:49.5327488Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-03T15:42:49.5327914Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 99, in ingest_perf_test_result
2024-12-03T15:42:49.5328321Z psycopg2.extras.execute_values(
2024-12-03T15:42:49.5328940Z File "/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/psycopg2/extras.py", line 1299, in execute_values
2024-12-03T15:42:49.5335618Z cur.execute(b''.join(parts))
2024-12-03T15:42:49.5335967Z psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type numeric: "concurrent-futures"
2024-12-03T15:42:49.5336287Z LINE 57: 'concurrent-futures',
2024-12-03T15:42:49.5336462Z ^
```
## Summary of changes
- `test_page_service_batching`: save non-numeric params as `labels`
- Add a runtime check that `metric_value` is NUMERIC
Before this PR, some override callbacks used `.default()`, others
used `.setdefault()`.
As of this PR, all callbacks use `.setdefault()` which I think is least
prone to failure.
Aligning on a single way will set the right example for future tests
that need such customization.
The `test_pageserver_getpage_throttle.py` technically is a change in
behavior: before, it replaced the `tenant_config` field, now it just
configures the throttle. This is what I believe is intended anyway.
Support tenant manifests in the storage scrubber:
* list the manifests, order them by generation
* delete all manifests except for the two most recent generations
* for the latest manifest: try parsing it.
I've tested this patch by running the against a staging bucket and it
successfully deleted stuff (and avoided deleting the latest two
generations).
In follow-up work, we might want to also check some invariants of the
manifest, as mentioned in #8088.
Part of #9386
Part of #8088
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
The Pageserver signal handler would only respond to a single signal and
initiate shutdown. Subsequent signals were ignored. This meant that a
`SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a
slow or stalled shutdown). The `test_runner` uses this to force shutdown
if graceful shutdown is slow.
Touches #9740.
## Summary of changes
Keep responding to signals after the initial shutdown signal has been
received.
Arguably, the `test_runner` should also use `SIGKILL` rather than
`SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT`
regardless.
Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres
allows us to remove a lot of dead weight from our actual postgres
connection logic.
## Problem
We saw a peculiar case where a pageserver apparently got a 0-tenant
response to `/re-attach` but we couldn't see the request landing on a
storage controller. It was hard to confirm retrospectively that the
pageserver was configured properly at the moment it sent the request.
## Summary of changes
- Log the URL to which we are sending the request
- Log the NodeId and metadata that we sent
## Problem
Sharded tenants should be run in a single AZ for best performance, so
that computes have AZ-local latency to all the shards.
Part of https://github.com/neondatabase/neon/issues/8264
## Summary of changes
- When we split a tenant, instead of updating each shard's preferred AZ
to wherever it is scheduled, propagate the preferred AZ from the parent.
- Drop the check in `test_shard_preferred_azs` that asserts shards end
up in their preferred AZ: this will not be true again until the
optimize_attachment logic is updated to make this so. The existing check
wasn't testing anything about scheduling, it was just asserting that we
set preferred AZ in a way that matches the way things happen to be
scheduled at time of split.
## Problem
In the batching PR
- https://github.com/neondatabase/neon/pull/9870
I stopped deducting the time-spent-in-throttle fro latency metrics,
i.e.,
- smgr latency metrics (`SmgrOpTimer`)
- basebackup latency (+scan latency, which I think is part of
basebackup).
The reason for stopping the deduction was that with the introduction of
batching, the trick with tracking time-spent-in-throttle inside
RequestContext and swap-replacing it from the `impl Drop for
SmgrOpTimer` no longer worked with >1 requests in a batch.
However, deducting time-spent-in-throttle is desirable because our
internal latency SLO definition does not account for throttling.
## Summary of changes
- Redefine throttling to be a page_service pagestream request throttle
instead of a throttle for repository `Key` reads through `Timeline::get`
/ `Timeline::get_vectored`.
- This means reads done by `basebackup` are no longer subject to any
throttle.
- The throttle applies after batching, before handling of the request.
- Drive-by fix: make throttle sensitive to cancellation.
- Rename metric label `kind` from `timeline_get` to `pagestream` to
reflect the new scope of throttling.
To avoid config format breakage, we leave the config field named
`timeline_get_throttle` and ignore the `task_kinds` field.
This will be cleaned up in a future PR.
## Trade-Offs
Ideally, we would apply the throttle before reading a request off the
connection, so that we queue the minimal amount of work inside the
process.
However, that's not possible because we need to do shard routing.
The redefinition of the throttle to limit pagestream request rate
instead of repository `Key` rate comes with several downsides:
- We're no longer able to use the throttle mechanism for other other
tasks, e.g. image layer creation.
However, in practice, we never used that capability anyways.
- We no longer throttle basebackup.
## Problem
`test_sharded_ingest` ingests a lot of data, which can cause shutdown to
be slow e.g. due to local "S3 uploads" or compactions. This can cause
test flakes during teardown.
Resolves#9740.
## Summary of changes
Perform an immediate shutdown of the cluster.
## Problem
We don't have good observability for memory usage. This would be useful
e.g. to debug OOM incidents or optimize performance or resource usage.
We would also like to use continuous profiling with e.g. [Grafana Cloud
Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/)
(see https://github.com/neondatabase/cloud/issues/14888).
This PR is intended as a proof of concept, to try it out in staging and
drive further discussions about profiling more broadly.
Touches https://github.com/neondatabase/neon/issues/9534.
Touches https://github.com/neondatabase/cloud/issues/14888.
Depends on #9779.
Depends on #9780.
## Summary of changes
Adds a HTTP route `/profile/heap` that takes a heap profile and returns
it. Query parameters:
* `format`: output format (`jemalloc` or `pprof`; default `pprof`).
Unlike CPU profiles (see #9764), heap profiles are not symbolized and
require the original binary to translate addresses to function names. To
make this work with Grafana, we'll probably have to symbolize the
process server-side -- this is left as future work, as is other output
formats like SVG.
Heap profiles don't work on macOS due to limitations in jemalloc.
## Problem
The extensions for Postgres v17 are ready but we do not test the
extensions shipped with v17
## Summary of changes
Build the test image based on Postgres v17. Run the tests for v17.
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
This PR
- fixes smgr metrics https://github.com/neondatabase/neon/issues/9925
- adds an additional startup log line logging the current batching
config
- adds a histogram of batch sizes global and per-tenant
- adds a metric exposing the current batching config
The issue described #9925 is that before this PR, request latency was
only observed *after* batching.
This means that smgr latency metrics (most importantly getpage latency)
don't account for
- `wait_lsn` time
- time spent waiting for batch to fill up / the executor stage to pick
up the batch.
The fix is to use a per-request batching timer, like we did before the
initial batching PR.
We funnel those timers through the entire request lifecycle.
I noticed that even before the initial batching changes, we weren't
accounting for the time spent writing & flushing the response to the
wire.
This PR drive-by fixes that deficiency by dropping the timers at the
very end of processing the batch, i.e., after the `pgb.flush()` call.
I was **unable to maintain the behavior that we deduct
time-spent-in-throttle from various latency metrics.
The reason is that we're using a *single* counter in `RequestContext` to
track micros spent in throttle.
But there are *N* metrics timers in the batch, one per request.
As a consequence, the practice of consuming the counter in the drop
handler of each timer no longer works because all but the first timer
will encounter error `close() called on closed state`.
A failed attempt to maintain the current behavior can be found in
https://github.com/neondatabase/neon/pull/9951.
So, this PR remvoes the deduction behavior from all metrics.
I started a discussion on Slack about it the implications this has for
our internal SLO calculation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029
# Refs
- fixes https://github.com/neondatabase/neon/issues/9925
- sub-issue https://github.com/neondatabase/neon/issues/9377
- epic: https://github.com/neondatabase/neon/issues/9376
Before this PR, the storcon_cli didn't have a way to show the
tenant-wide information of the TenantDescribeResponse.
Sadly, the `Serialize` impl for the tenant config doesn't skip on
`None`, so, the output becomes a bit bloated.
Maybe we can use `skip_serializing_if(Option::is_none)` in the future.
=> https://github.com/neondatabase/neon/issues/9983
## Problem
I was touching `test_storage_controller_node_deletion` because for AZ
scheduling work I was adding a change to the storage controller (kick
secondaries during optimisation) that made a FIXME in this test defunct.
While looking at it I also realized that we can easily fix the way node
deletion currently doesn't use a proper ScheduleContext, using the
iterator type recently added for that purpose.
## Summary of changes
- A testing-only behavior in storage controller where if a secondary
location isn't yet ready during optimisation, it will be actively
polled.
- Remove workaround in `test_storage_controller_node_deletion` that
previously was needed because optimisation would get stuck on cold
secondaries.
- Update node deletion code to use a `TenantShardContextIterator` and
thereby a proper ScheduleContext
## Problem
After enabling LFC in tests and lowering `shared_buffers` we started
having more problems with `test_pg_regress`.
## Summary of changes
Set `shared_buffers` to 1MB to both exercise getPage requests/LFC, and
still have enough room for Postgres to operate. Everything smaller might
be not enough for Postgres under load, and can cause errors like 'no
unpinned buffers available'.
See Konstantin's comment [1] as well.
Fixes#9956
[1]:
https://github.com/neondatabase/neon/issues/9956#issuecomment-2511608097
On reconfigure, we no longer passed a port for the extension server
which caused us to not write out the neon.extension_server_port line.
Thus, Postgres thought we were setting the port to the default value of
0. PGC_POSTMASTER GUCs cannot be set at runtime, which causes the
following log messages:
> LOG: parameter "neon.extension_server_port" cannot be changed without
restarting the server
> LOG: configuration file
"/var/db/postgres/compute/pgdata/postgresql.conf" contains errors;
unaffected changes were applied
Fixes: https://github.com/neondatabase/neon/issues/9945
Signed-off-by: Tristan Partin <tristan@neon.tech>
The spec was written for the buggy protocol which we had before the one
more similar to Raft was implemented. Update the spec with what we
currently have.
ref https://github.com/neondatabase/neon/issues/8699
## Problem
The credentials providers tries to connect to AWS STS even when we use
plain Redis connections.
## Summary of changes
* Construct the CredentialsProvider only when needed ("irsa").
## Problem
`if: ${{ github.event.schedule }}` gets skipped if a previous step has
failed, but we want to run the step for both `success` and `failure`
## Summary of changes
- Add `!cancelled()` to notification step if-condition, to skip only
cancelled jobs
Fixes https://github.com/neondatabase/cloud/issues/20973.
This refactors `connect_raw` in order to return direct access to the
delayed notices.
I cannot find a way to test this with psycopg2 unfortunately, although
testing it with psql does return the expected results.
## Problem
We can't easily tell how far the state of shards is from their AZ
preferences. This can be a cause of performance issues, so it's
important for diagnosability that we can tell easily if there are
significant numbers of shards that aren't running in their preferred AZ.
Related: https://github.com/neondatabase/cloud/issues/15413
## Summary of changes
- In reconcile_all, count shards that are scheduled into the wrong AZ
(if they have a preference), and publish it as a prometheus gauge.
- Also calculate a statistic for how many shards wanted to reconcile but
couldn't.
This is clearly a lazy calculation: reconcile all only runs
periodically. But that's okay: shards in the wrong AZ is something that
only matters if it stays that way for some period of time.
Improves `wait_until` by:
* Use `timeout` instead of `iterations`. This allows changing the
timeout/interval parameters independently.
* Make `timeout` and `interval` optional (default 20s and 0.5s). Most
callers don't care.
* Only output status every 1s by default, and add optional
`status_interval` parameter.
* Remove `show_intermediate_error`, this was always emitted anyway.
Most callers have been updated to use the defaults, except where they
had good reason otherwise.
## Problem
We saw unexpected container terminations when running in k8s with with
small CPU resource requests.
The /status and /ready handlers called `maybe_forward`, which always
takes the lock on Service::inner.
If there is a lot of writer lock contention, and the container is
starved of CPU, this increases the likelihood that we will get killed by
the kubelet.
It isn't certain that this was a cause of issues, but it is a potential
source that we can eliminate.
## Summary of changes
- Revise logic to return immediately if the URL is in the non-forwarded
list, rather than calling maybe_forward
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1732110190129479
We observe the following error in the logs
```
[XX000] ERROR: [NEON_SMGR] [shard 3] Incorrect prefetch read: status=1 response=0x7fafef335138 my=128 receive=128
```
most likely caused by changing `neon.readahead_buffer_size`
## Summary of changes
1. Copy shard state
2. Do not use prefetch_set_unused in readahead_buffer_resize
3. Change prefetch buffer overflow criteria
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Current compute images for Postgres 14-16 don't build on Debian 12
because of issues with extensions.
This PR fixes that, but for the current setup, it is mostly a no-op
change.
## Summary of changes
- Use `/bin/bash -euo pipefail` as SHELL to fail earlier
- Fix `plv8` build: backport a trivial patch for v8
- Fix `postgis` build: depend `sfgal` version on Debian version instead
of Postgres version
Tested in: https://github.com/neondatabase/neon/pull/9849
#8564
## Problem
The main and backup consumption metric pushes are completely
independent,
resulting in different event time windows and different idempotency
keys.
## Summary of changes
* Merge the push tasks, but keep chunks the same size.
# Problem
The timeout-based batching adds latency to unbatchable workloads.
We can choose a short batching timeout (e.g. 10us) but that requires
high-resolution timers, which tokio doesn't have.
I thoroughly explored options to use OS timers (see
[this](https://github.com/neondatabase/neon/pull/9822) abandoned PR).
In short, it's not an attractive option because any timer implementation
adds non-trivial overheads.
# Solution
The insight is that, in the steady state of a batchable workload, the
time we spend in `get_vectored` will be hundreds of microseconds anyway.
If we prepare the next batch concurrently to `get_vectored`, we will
have a sizeable batch ready once `get_vectored` of the current batch is
done and do not need an explicit timeout.
This can be reasonably described as **pipelining of the protocol
handler**.
# Implementation
We model the sub-protocol handler for pagestream requests
(`handle_pagrequests`) as two futures that form a pipeline:
2. Batching: read requests from the connection and fill the current
batch
3. Execution: `take` the current batch, execute it using `get_vectored`,
and send the response.
The Reading and Batching stage are connected through a new type of
channel called `spsc_fold`.
See the long comment in the `handle_pagerequests_pipelined` for details.
# Changes
- Refactor `handle_pagerequests`
- separate functions for
- reading one protocol message; produces a `BatchedFeMessage` with just
one page request in it
- batching; tried to merge an incoming `BatchedFeMessage` into an
existing `BatchedFeMessage`; returns `None` on success and returns back
the incoming message in case merging isn't possible
- execution of a batched message
- unify the timeline handle acquisition & request span construction; it
now happen in the function that reads the protocol message
- Implement serial and pipelined model
- serial: what we had before any of the batching changes
- read one protocol message
- execute protocol messages
- pipelined: the design described above
- optionality for execution of the pipeline: either via concurrent
futures vs tokio tasks
- Pageserver config
- remove batching timeout field
- add ability to configure pipelining mode
- add ability to limit max batch size for pipelined configurations
(required for the rollout, cf
https://github.com/neondatabase/cloud/issues/20620 )
- ability to configure execution mode
- Tests
- remove `batch_timeout` parametrization
- rename `test_getpage_merge_smoke` to `test_throughput`
- add parametrization to test different max batch sizes and execution
moes
- rename `test_timer_precision` to `test_latency`
- rename the test case file to `test_page_service_batching.py`
- better descriptions of what the tests actually do
## On the holding The `TimelineHandle` in the pending batch
While batching, we hold the `TimelineHandle` in the pending batch.
Therefore, the timeline will not finish shutting down while we're
batching.
This is not a problem in practice because the concurrently ongoing
`get_vectored` call will fail quickly with an error indicating that the
timeline is shutting down.
This results in the Execution stage returning a `QueryError::Shutdown`,
which causes the pipeline / entire page service connection to shut down.
This drops all references to the
`Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the
contained `TimelineHandle`s.
- => fixes https://github.com/neondatabase/neon/issues/9850
# Performance
Local run of the benchmarks, results in [this empty
commit](1cf5b1463f)
in the PR branch.
Key take-aways:
* `concurrent-futures` and `tasks` deliver identical `batching_factor`
* tail latency impact unknown, cf
https://github.com/neondatabase/neon/issues/9837
* `concurrent-futures` has higher throughput than `tasks` in all
workloads (=lower `time` metric)
* In unbatchable workloads, `concurrent-futures` has 5% higher
`CPU-per-throughput` than that of `tasks`, and 15% higher than that of
`serial`.
* In batchable-32 workload, `concurrent-futures` has 8% lower
`CPU-per-throughput` than that of `tasks` (comparison to tput of
`serial` is irrelevant)
* in unbatchable workloads, mean and tail latencies of
`concurrent-futures` is practically identical to `serial`, whereas
`tasks` adds 20-30us of overhead
Overall, `concurrent-futures` seems like a slightly more attractive
choice.
# Rollout
This change is disabled-by-default.
Rollout plan:
- https://github.com/neondatabase/cloud/issues/20620
# Refs
- epic: https://github.com/neondatabase/neon/issues/9376
- this sub-task: https://github.com/neondatabase/neon/issues/9377
- the abandoned attempt to improve batching timeout resolution:
https://github.com/neondatabase/neon/pull/9820
- closes https://github.com/neondatabase/neon/issues/9850
- fixes https://github.com/neondatabase/neon/issues/9835
## Problem
It appears that the Azure storage API tends to hang TCP connections more
than S3 does.
Currently we use a 2 minute timeout for all downloads. This is large
because sometimes the objects we download are large. However, waiting 2
minutes when doing something like downloading a manifest on tenant
attach is problematic, because when someone is doing a "create tenant,
create timeline" workflow, that 2 minutes is long enough for them
reasonably to give up creating that timeline.
Rather than propagate oversized timeouts further up the stack, we should
use a different timeout for objects that we expect to be small.
Closes: https://github.com/neondatabase/neon/issues/9836
## Summary of changes
- Add a `small_timeout` configuration attribute to remote storage,
defaulting to 30 seconds (still a very generous period to do something
like download an index)
- Add a DownloadKind parameter to DownloadOpts, so that callers can
indicate whether they expect the object to be small or large.
- In the azure client, use small timeout for HEAD requests, and for GET
requests if DownloadKind::Small is used.
- Use DownloadKind::Small for manifests, indices, and heatmap downloads.
This PR intentionally does not make the equivalent change to the S3
client, to reduce blast radius in case this has unexpected consequences
(we could accomplish the same thing by editing lots of configs, but just
skipping the code is simpler for right now)
## Problem
It was not always possible to judge what exactly some `cloud_admin`
connections were doing because we didn't consistently set
`application_name` everywhere.
## Summary of changes
Unify the way we connect to Postgres:
1. Switch to building configs everywhere
2. Always set `application_name` and make naming consistent
Follow-up for #9919
Part of neondatabase/cloud#20948
## Problem
To add Safekeeper heap profiling in #9778, we need to switch to an
allocator that supports it. Pageserver and proxy already use jemalloc.
Touches #9534.
## Summary of changes
Use jemalloc in Safekeeper.
## Problem
When picking locations for a shard, we should use a ScheduleContext that
includes all the other shards in the tenant, so that we apply proper
anti-affinity between shards. If we don't do this, then it can lead to
unstable scheduling, where we place a shard somewhere that the optimizer
will then immediately move it away from.
We didn't always do this, because it was a bit awkward to accumulate the
context for a tenant rather than just walking tenants.
This was a TODO in `handle_node_availability_transition`:
```
// TODO: populate a ScheduleContext including all shards in the same tenant_id (only matters
// for tenants without secondary locations: if they have a secondary location, then this
// schedule() call is just promoting an existing secondary)
```
This is a precursor to https://github.com/neondatabase/neon/issues/8264,
where the current imperfect scheduling during node evacuation hampers
testing.
## Summary of changes
- Add an iterator type that yields each shard along with a
schedulecontext that includes all the other shards from the same tenant
- Use the iterator to replace hand-crafted logic in optimize_all_plan
(functionally identical)
- Use the iterator in `handle_node_availability_transition` to apply
proper anti-affinity during node evacuation.
Our rust-postgres fork is getting messy. Mostly because proxy wants more
control over the raw protocol than tokio-postgres provides. As such,
it's diverging more and more. Storage and compute also make use of
rust-postgres, but in more normal usage, thus they don't need our crazy
changes.
Idea:
* proxy maintains their subset
* other teams use a minimal patch set against upstream rust-postgres
Reviewing this code will be difficult. To implement it, I
1. Copied tokio-postgres, postgres-protocol and postgres-types from
00940fcdb5
2. Updated their package names with the `2` suffix to make them compile
in the workspace.
3. Updated proxy to use those packages
4. Copied in the code from tokio-postgres-rustls 0.13 (with some patches
applied https://github.com/jbg/tokio-postgres-rustls/pull/32https://github.com/jbg/tokio-postgres-rustls/pull/33)
5. Removed as much dead code as I could find in the vendored libraries
6. Updated the tokio-postgres-rustls code to use our existing channel
binding implementation
Adds a benchmark for logical message WAL ingestion throughput
end-to-end. Logical messages are essentially noops, and thus ignored by
the Pageserver.
Example results from my MacBook, with fsync enabled:
```
postgres_ingest: 14.445 s
safekeeper_ingest: 29.948 s
pageserver_ingest: 30.013 s
pageserver_recover_ingest: 8.633 s
wal_written: 10,340 MB
message_count: 1310720 messages
postgres_throughput: 715 MB/s
safekeeper_throughput: 345 MB/s
pageserver_throughput: 344 MB/s
pageserver_recover_throughput: 1197 MB/s
```
See
https://github.com/neondatabase/neon/issues/9642#issuecomment-2475995205
for running analysis.
Touches #9642.
## Problem
We used `set_path()` to replace the database name in the connection
string. It automatically does url-safe encoding if the path is not
already encoded, but it does it as per the URL standard, which assumes
that tabs can be safely removed from the path without changing the
meaning of the URL. See, e.g.,
https://url.spec.whatwg.org/#concept-basic-url-parser. It also breaks
for DBs with properly %-encoded names, like with `%20`, as they are kept
intact, but actually should be escaped.
Yet, this is not true for Postgres, where it's completely valid to have
trailing tabs in the database name.
I think this is the PR that caused this regression
https://github.com/neondatabase/neon/pull/9717, as it switched from
`postgres::config::Config` back to `set_path()`.
This was fixed a while ago already [1], btw, I just haven't added a test
to catch this regression back then :(
## Summary of changes
This commit changes the code back to use
`postgres/tokio_postgres::Config` everywhere.
While on it, also do some changes around, as I had to touch this code:
1. Bump some logging from `debug` to `info` in the spec apply path. We
do not use `debug` in prod, and it was tricky to understand what was
going on with this bug in prod.
2. Refactor configuration concurrency calculation code so it was
reusable. Yet, still keep `1` in the case of reconfiguration. The
database can be actively used at this moment, so we cannot guarantee
that there will be enough spare connection slots, and the underlying
code won't handle connection errors properly.
3. Simplify the installed extensions code. It was spawning a blocking
task inside async function, which doesn't make much sense. Instead, just
have a main sync function and call it with `spawn_blocking` in the API
code -- the only place we need it to be async.
4. Add regression python test to cover this and related problems in the
future. Also, add more extensive testing of schema dump and DBs and
roles listing API.
[1]:
4d1e48f3b9
[2]:
https://www.postgresql.org/message-id/flat/20151023003445.931.91267%40wrigleys.postgresql.orgResolvesneondatabase/cloud#20869
## Problem
Currently, we rerun only known flaky tests. This approach was chosen to
reduce the number of tests that go unnoticed (by forcing people to take
a look at failed tests and rerun the job manually), but it has some
drawbacks:
- In PRs, people tend to push new changes without checking failed tests
(that's ok)
- In the main, tests are just restarted without checking
(understandable)
- Parametrised tests become flaky one by one, i.e. if `test[1]` is flaky
`, test[2]` is not marked as flaky automatically (which may or may not
be the case).
I suggest rerunning all failed tests to increase the stability of GitHub
jobs and using the Grafana Dashboard with flaky tests for deeper
analysis.
## Summary of changes
- Rerun all failed tests twice at max
## Problem
For the interpreted proto the pageserver is not returning the correct
LSN
in replies to keep alive requests. This is because the interpreted
protocol arm
was not updating `last_rec_lsn`.
## Summary of changes
* Return correct LSN in keep-alive responses
* Fix shard field in wal sender traces
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Release notes](https://releases.rs/docs/1.83.0/).
Also update `cargo-hakari`, `cargo-deny`, `cargo-hack` and
`cargo-nextest` to their latest versions.
Prior update was in #9445.
## Problem
We currently see elevated levels of errors for GetBlob requests. This is
because 404 and 304 are counted as errors for metric reporting.
## Summary of Changes
Bring the implementation in line with the S3 client and treat 404 and
304 responses as ok for metric purposes.
Related: https://github.com/neondatabase/cloud/issues/20666
## Problem
For cancellation, a connection is open during all the cancel checks.
## Summary of changes
Spawn cancellation checks in the background, and close connection
immediately.
Use task_tracker for cancellation checks.
## Problem
possible for the database connections to not close in time.
## Summary of changes
force the closing of connections if the client has hung up
## Problem
In a recent refactor, we accidentally dropped the cancel session early
## Summary of changes
Hold the cancel session during proxy passthrough
## Problem
Not really a problem, just refactoring.
## Summary of changes
Separate authenticate from wake compute.
Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
## Problem
hard to see where time is taken during HTTP flow.
## Summary of changes
add a lot more for query state. add a conn_id field to the sql-over-http
span
## Problem
`tokio::io::copy_bidirectional` doesn't close the connection once one of
the sides closes it. It's not really suitable for the postgres protocol.
## Summary of changes
Fork `copy_bidirectional` and initiate a shutdown for both connections.
---------
Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
There is currently no cleanup done after a delta layer creation error,
so delta layers can accumulate. The problem gets worse as the operation
gets retried and delta layers accumulate on the disk. Therefore, delete
them from disk (if something has been written to disk).
## Problem
When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.
Fixes: https://github.com/neondatabase/neon/issues/6423
## Summary of changes
- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.
(cherry picked from commit 93572a3e99)
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.
(cherry picked from commit 15c0df4de7)
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.
Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.
https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719https://github.com/neondatabase/cloud/issues/9057
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.
This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.
Release 2023-12-19
We need to do a config change that requires restarting the pageservers.
Slip in two metrics-related commits that didn't make this week's regularly release.
Pre-merge `git merge --squash` of
https://github.com/neondatabase/neon/pull/6115
Lowering the tracing level in get_value_reconstruct_data and
get_or_maybe_download from info to debug reduces the overhead
of span creation in non-debug environments.
## Problem
#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms
- remove a rogue log messages
Error indicating request cancellation OR timeline shutdown was deemed as
a reason to exit the background worker that calculated synthetic size.
Fix it to only be considered for avoiding logging such of such errors.
This conflicted on tenant_shard_id having already replaced tenant_id on
`main`.
```
could not start the compute node: compute is in state "failed": db error: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory Caused by: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory
```
Only applicable change was neondatabase/autoscaling#584, setting
pgbouncer auth_dbname=postgres in order to fix superuser connections
from preventing dropping databases.
Only applicable change was neondatabase/autoscaling#571, removing the
postgres_exporter flags `--auto-discover-databases` and
`--exclude-databases=...`
## Problem
Logical replication requires new AUX_FILES_KEY which is definitely
absent in existed database.
We do not have function to check if key exists in our KV storage.
So I have to handle the error in `list_aux_files` method.
But this key is also included in key space range and accessed y
`create_image_layer` method.
## Summary of changes
Check if AUX_FILES_KEY exists before including it in keyspace.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Shany Pozin <shany@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Fixes an issue we observed on staging that happens when the
autoscaler-agent attempts to immediately downscale the VM after binding,
which is typical for pooled computes.
The issue was occurring because the autoscaler-agent was requesting
downscaling before the vm-monitor had gathered sufficient cgroup memory
stats to be confident in approving it. When the vm-monitor returned an
internal error instead of denying downscaling, the autoscaler-agent
retried the connection and immediately hit the same issue (in part
because cgroup stats are collected per-connection, rather than
globally).
There's currently an issue with the vm-monitor on staging that's not
really feasible to debug because the current display impl gives no
context to the errors (just says "failed to downscale").
Logging the full error should help.
For communications with the autoscaler-agent, it's ok to only provide
the outermost cause, because we can cross-reference with the VM logs.
At some point in the future, we may want to change that.
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.
In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.
This PR fixes#5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
Before this PR, when we restarted pageserver, we'd see a rush of
`$number_of_tenants` concurrent eviction tasks starting to do imitate
accesses building up in the period of `[init_order allows activations,
$random_access_delay + EvictionPolicyLayerAccessThreshold::period]`.
We simply cannot handle that degree of concurrent IO.
We already solved the problem for compactions by adding a semaphore.
So, this PR shares that semaphore for use by evictions.
Part of https://github.com/neondatabase/neon/issues/5479
Which is again part of https://github.com/neondatabase/neon/issues/4743
Risks / Changes In System Behavior
==================================
* we don't do evictions as timely as we currently do
* we log a bunch of warnings about eviction taking too long
* imitate accesses and compactions compete for the same concurrency
limit, so, they'll slow each other down through this shares semaphore
Changes
=======
- Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs`
- Rename it to `CONCURRENT_BACKGROUND_TASKS`
- Use it also for the eviction imitate accesses:
- Imitate acceses are both per-TIMELINE and per-TENANT
- The per-TENANT is done through coalescing all the per-TIMELINE
tasks via a tokio mutex `eviction_task_tenant_state`.
- We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the
beginning of the eviction iteration, much before the imitate
acesses start (and they may not even start at all in the given
iteration, as they happen only every $threshold).
- Acquiring early is **sub-optimal** because when the per-timline
tasks coalesce on the `eviction_task_tenant_state` mutex,
they are already holding a CONCURRENT_BACKGROUND_TASKS permit.
- It's also unfair because tenants with many timelines win
the CONCURRENT_BACKGROUND_TASKS more often.
- I don't think there's another way though, without refactoring
more of the imitate accesses logic, e.g, making it all per-tenant.
- Add metrics for queue depth behind the semaphore.
I found these very useful to understand what work is queued in the
system.
- The metrics are tagged by the new `BackgroundLoopKind`.
- On a green slate, I would have used `TaskKind`, but we already had
pre-existing labels whose names didn't map exactly to task kind.
Also the task kind is kind of a lower-level detail, so, I think
it's fine to have a separate enum to identify background work kinds.
Future Work
===========
I guess I could move the eviction tasks from a ticker to "sleep for
$period".
The benefit would be that the semaphore automatically "smears" the
eviction task scheduling over time, so, we only have the rush on restart
but a smeared-out rush afterward.
The downside is that this perverts the meaning of "$period", as we'd
actually not run the eviction at a fixed period. It also means the the
"took to long" warning & metric becomes meaningless.
Then again, that is already the case for the compaction and gc tasks,
which do sleep for `$period` instead of using a ticker.
(cherry picked from commit 9256788273)
## Problem
Folks have re-taged releases for `pg_jsonschema` and `pg_graphql` (to
increase timeouts on their CI), for us, these are a noop changes,
but unfortunately, this will cause our builds to fail due to checksums
mismatch (this might not strike right away because of the build cache).
- 8ba7c7be9d
- aa7509370a
## Summary of changes
- `pg_jsonschema` update checksum
- `pg_graphql` update checksum
When you log more than a few blocks, you need to reserve the space in
advance. We didn't do that, so we got errors. Now we do that, and
shouldn't get errors.
## Problem
See https://neondb.slack.com/archives/C05L7D1JAUS/p1694614585955029https://www.notion.so/neondatabase/Duplicate-key-issue-651627ce843c45188fbdcb2d30fd2178
## Summary of changes
Swap old/new block references
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
The sequence that can lead to a deadlock:
1. DELETE request gets all the way to `tenant.shutdown(progress,
false).await.is_err() ` , while holding TENANTS.read()
2. POST request for tenant creation comes in, calls `tenant_map_insert`,
it does `let mut guard = TENANTS.write().await;`
3. Something that `tenant.shutdown()` needs to wait for needs a
`TENANTS.read().await`.
The only case identified in exhaustive manual scanning of the code base
is this one:
Imitate size access does `get_tenant().await`, which does
`TENANTS.read().await` under the hood.
In the above case (1) waits for (3), (3)'s read-lock request is queued
behind (2)'s write-lock, and (2) waits for (1).
Deadlock.
I made a reproducer/proof-that-above-hypothesis-holds in
https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for
merge yet and we want the fix _now_.
fixes https://github.com/neondatabase/neon/issues/5284
## Problem
We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.
We weren't doing that, so the connection task would get stuck
## Summary of changes
Don't return pending. Loop instead
## Problem
cargo deny lint broken
Links to the CVEs:
[rustsec.org/advisories/RUSTSEC-2023-0052](https://rustsec.org/advisories/RUSTSEC-2023-0052)
[rustsec.org/advisories/RUSTSEC-2023-0053](https://rustsec.org/advisories/RUSTSEC-2023-0053)
One is fixed, the other one isn't so we allow it (for now), to unbreak
CI. Then later we'll try to get rid of webpki in favour of the rustls
fork.
## Summary of changes
```
+ignore = ["RUSTSEC-2023-0052"]
```
## Problem
When an endpoint is shutting down, it can take a few seconds. Currently
when starting a new compute, this causes an "endpoint is in transition"
error. We need to add delays before retrying to ensure that we allow
time for the endpoint to shutdown properly.
## Summary of changes
Adds a delay before retrying in auth. connect_to_compute already has
this delay
commit
commit 5f8fd640bf
Author: Alek Westover <alek.westover@gmail.com>
Date: Wed Jul 26 08:24:03 2023 -0400
Upload Test Remote Extensions (#4792)
switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.
The switch to using release tag was good in principle, but,
reverting that part to make the release pipeine work.
Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.
## Problem
If we fail to wake up the compute node, a subsequent connect attempt
will definitely fail. However, kubernetes won't fail the connection
immediately, instead it hangs until we timeout (10s).
## Summary of changes
Refactor the loop to allow fast retries of compute_wake and to skip a
connect attempt.
## Problem
#4598 compute nodes are not accessible some time after wake up due to
kubernetes DNS not being fully propagated.
## Summary of changes
Update connect retry mechanism to support handling IO errors and
sleeping for 100ms
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
```
CREATE EXTENSION embedding;
CREATE TABLE t (val real[]);
INSERT INTO t (val) VALUES ('{0,0,0}'), ('{1,2,3}'), ('{1,1,1}'), (NULL);
CREATE INDEX ON t USING hnsw (val) WITH (maxelements = 10, dims=3, m=3);
INSERT INTO t (val) VALUES (array[1,2,4]);
SELECT * FROM t ORDER BY val <-> array[3,3,3];
val
---------
{1,2,3}
{1,2,4}
{1,1,1}
{0,0,0}
(5 rows)
```
The consumption metrics synthetic size worker does logical size calculation.
Logical size calculation currently does synchronous disk IO.
This blocks the MGMT_REQUEST_RUNTIME's executor threads, starving other futures.
While there's work on the way to move the synchronous disk IO into spawn_blocking,
the quickfix here is to use the BACKGROUND_RUNTIME instead of MGMT_REQUEST_RUNTIME.
Actually it's not just a quickfix. We simply shouldn't be blocking MGMT_REQUEST_RUNTIME
executor threads on CPU or sync disk IO.
That work isn't done yet, as many of the mgmt tasks still _do_ disk IO.
But it's not as intensive as the logical size calculations that we're fixing here.
While we're at it, fix disk-usage-based eviction in a similar way.
It wasn't the culprit here, according to prod logs, but it can theoretically be
a little CPU-intensive.
More context, including graphs from Prod:
https://neondb.slack.com/archives/C03F5SM1N02/p1687541681336949
(cherry picked from commit d6e35222ea)
This commit introduces an SQL-over-HTTP endpoint in the proxy, with a JSON
response structure resembling that of the node-postgres driver. This method,
using HTTP POST, achieves smaller amortized latencies in edge setups due to
fewer round trips and an enhanced open connection reuse by the v8 engine.
This update involves several intricacies:
1. SQL injection protection: We employed the extended query protocol, modifying
the rust-postgres driver to send queries in one roundtrip using a text
protocol rather than binary, bypassing potential issues like those identified
in https://github.com/sfackler/rust-postgres/issues/1030.
2. Postgres type compatibility: As not all postgres types have binary
representations (e.g., acl's in pg_class), we adjusted rust-postgres to
respond with text protocol, simplifying serialization and fixing queries with
text-only types in response.
3. Data type conversion: Considering JSON supports fewer data types than
Postgres, we perform conversions where possible, passing all other types as
strings. Key conversions include:
- postgres int2, int4, float4, float8 -> json number (NaN and Inf remain
text)
- postgres bool, null, text -> json bool, null, string
- postgres array -> json array
- postgres json and jsonb -> json object
4. Alignment with node-postgres: To facilitate integration with js libraries,
we've matched the response structure of node-postgres, returning command tags
and column oids. Command tag capturing was added to the rust-postgres
functionality as part of this change.
## Problem
Compatibility tests don't support Postgres 15 yet, but we're still
trying to upload compatibility snapshot (which we do not collect).
Ref
https://github.com/neondatabase/neon/actions/runs/4991394158/jobs/8940369368#step:4:38129
## Summary of changes
Add `pg_version` parameter to `run-python-test-set` actions and do not
upload compatibility snapshot for Postgres 15
This reverts commit 732acc5.
Reverted PR: #3869
As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.
Cc: #4094, #4088
Otherwise they get lost. Normally buffer is empty before proxy pass, but this is
not the case with pipeline mode of out npm driver; fixes connection hangup
introduced by b80fe41af3 for it.
fixes https://github.com/neondatabase/neon/issues/3822
## Describe your changes
We have previously changed the neon-proxy to use RollingUpdate. This
should be enabled in legacy proxy too in order to avoid breaking
connections for the clients and allow for example backups to run even
during deployment. (https://github.com/neondatabase/neon/pull/3683)
## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3333
## Describe your changes
Rebase vendored PostgreSQL onto 14.7 and 15.2
## Issue ticket number and link
#3579
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
```
The version of PostgreSQL that we use is updated to 14.7 for PostgreSQL
14 and 15.2 for PostgreSQL 15.
```
previously we applied the ratelimiting only up to receiving the headers
from s3, or somewhere near it. the commit adds an adapter which carries
the permit until the AsyncRead has been disposed.
fixes#3662.
Calculation of logical size is now async because of layer downloads, so
we shouldn't use spawn_blocking for it. Use of `spawn_blocking`
exhausted resources which are needed by `tokio::io::copy` when copying
from a stream to a file which lead to deadlock.
Fixes: #3657
these are happening in tests because of #3655 but they sure took some
time to appear.
makes the `Compaction failed, retrying in 2s: Cannot run compaction
iteration on inactive tenant` into a globally allowed error, because it
has been seen failing on different test cases.
Small changes, but hopefully this will help with the panic detected in
staging, for which we cannot get the debugging information right now
(end-of-branch before branch-point).
Before only the timelines which have passed the `gc_horizon` were
processed which failed with orphans at the tree_sort phase. Example
input in added `test_branched_empty_timeline_size` test case.
The PR changes iteration to happen through all timelines, and in
addition to that, any learned branch points will be calculated as they
would had been in the original implementation if the ancestor branch had
been over the `gc_horizon`.
This also changes how tenants where all timelines are below `gc_horizon`
are handled. Previously tenant_size 0 was returned, but now they will
have approximately `initdb_lsn` worth of tenant_size.
The PR also adds several new tenant size tests that describe various corner
cases of branching structure and `gc_horizon` setting.
They are currently disabled to not consume time during CI.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Previously, we were trying to re-assign owned objects of the already
deleted role. This were causing a crash loop in the case when compute
was restarted with a spec that includes delta operation for role
deletion. To avoid such cases, check that role is still present before
calling `reassign_owned_objects`.
Resolvesneondatabase/cloud#3553
This reverts commit 826e89b9ce.
The problem with that commit was that it deletes the TempDir while
there are still EphemeralFile instances open.
At first I thought this could be fixed by simply adding
Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None))
to TenantHarness::drop, but it turned out to be insufficient.
So, reverting the commit until we find a proper solution.
refs https://github.com/neondatabase/neon/issues/3385
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.
Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.
This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats. I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
IMDSv2 has limits, and if we query it on every s3 interaction we are
going to go over those limits. Changes the s3_bucket client
configuration to use:
- ChainCredentialsProvider to handle env variables or imds usage
- LazyCachingCredentialsProvider to actually cache any credentials
Related: https://github.com/awslabs/aws-sdk-rust/issues/629
Possibly related: https://github.com/neondatabase/neon/issues/3118
plv8 can only be built with a fairly new gold linker version. We used to install
it via binutils packages from testing, but it also updates libc and that causes
troubles in the resulting image as different extensions were built against
different libc versions. We could either use libc from debian-testing everywhere
or restrain from using testing packages and install necessary programs manually.
This patch uses the latter approach: gold for plv8 and cmake for h3 are
installed manually.
In a passing declare h3_postgis as a safe extension (previous omission).
`GRANT CREATE ON SCHEMA public` fails if there is no schema `public`.
Disable it in release for now and make a better fix later (it is
needed for v15 support).
* Check for entire range during sasl validation (#2281)
* Gen2 GH runner (#2128)
* Re-add rustup override
* Try s3 bucket
* Set git version
* Use v4 cache key to prevent problems
* Switch to v5 for key
* Add second rustup fix
* Rebase
* Add kaniko steps
* Fix typo and set compress level
* Disable global run default
* Specify shell for step
* Change approach with kaniko
* Try less verbose shell spec
* Add submodule pull
* Add promote step
* Adjust dependency chain
* Try default swap again
* Use env
* Don't override aws key
* Make kaniko build conditional
* Specify runs on
* Try without dependency link
* Try soft fail
* Use image with git
* Try passing to next step
* Fix duplicate
* Try other approach
* Try other approach
* Fix typo
* Try other syntax
* Set env
* Adjust setup
* Try step 1
* Add link
* Try global env
* Fix mistake
* Debug
* Try other syntax
* Try other approach
* Change order
* Move output one step down
* Put output up one level
* Try other syntax
* Skip build
* Try output
* Re-enable build
* Try other syntax
* Skip middle step
* Update check
* Try first step of dockerhub push
* Update needs dependency
* Try explicit dir
* Add missing package
* Try other approach
* Try other approach
* Specify region
* Use with
* Try other approach
* Add debug
* Try other approach
* Set region
* Follow AWS example
* Try github approach
* Skip Qemu
* Try stdin
* Missing steps
* Add missing close
* Add echo debug
* Try v2 endpoint
* Use v1 endpoint
* Try without quotes
* Revert
* Try crane
* Add debug
* Split steps
* Fix duplicate
* Add shell step
* Conform to options
* Add verbose flag
* Try single step
* Try workaround
* First request fails hunch
* Try bullseye image
* Try other approach
* Adjust verbose level
* Try previous step
* Add more debug
* Remove debug step
* Remove rogue indent
* Try with larger image
* Add build tag step
* Update workflow for testing
* Add tag step for test
* Remove unused
* Update dependency chain
* Add ownership fix
* Use matrix for promote
* Force update
* Force build
* Remove unused
* Add new image
* Add missing argument
* Update dockerfile copy
* Update Dockerfile
* Update clone
* Update dockerfile
* Go to correct folder
* Use correct format
* Update dockerfile
* Remove cd
* Debug find where we are
* Add debug on first step
* Changedir to postgres
* Set workdir
* Use v1 approach
* Use other dependency
* Try other approach
* Try other approach
* Update dockerfile
* Update approach
* Update dockerfile
* Update approach
* Update dockerfile
* Update dockerfile
* Add workspace hack
* Update Dockerfile
* Update Dockerfile
* Update Dockerfile
* Change last step
* Cleanup pull in prep for review
* Force build images
* Add condition for latest tagging
* Use pinned version
* Try without name value
* Remove more names
* Shorten names
* Add kaniko comments
* Pin kaniko
* Pin crane and ecr helper
* Up one level
* Switch to pinned tag for rust image
* Force update for test
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box>
* Add missing step output, revert one deploy step (#2285)
* Add missing step output, revert one deploy step
* Conform to syntax
* Update approach
* Add missing value
* Add missing needs
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
* Error for fatal not git repo (#2286)
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
* Use main, not branch for ref check (#2288)
* Use main, not branch for ref check
* Add more debug
* Count main, not head
* Try new approach
* Conform to syntax
* Update approach
* Get full history
* Skip checkout
* Cleanup debug
* Remove more debug
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
* Fix docker zombie process issue (#2289)
* Fix docker zombie process issue
* Init everywhere
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
* Fix 1.63 clippy lints (#2282)
* split out timeline metrics, track layer map loading and size calculation
* reset rust cache for clippy run to avoid an ICE
additionally remove trailing whitespaces
* Rename pg_control_ffi.h to bindgen_deps.h, for clarity.
The pg_control_ffi.h name implies that it only includes stuff related to
pg_control.h. That's mostly true currently, but really the point of the
file is to include everything that we need to generate Rust definitions
from.
* Make local mypy behave like CI mypy (#2291)
* Fix flaky pageserver restarts in tests (#2261)
* Remove extra type aliases (#2280)
* Update cachepot endpoint (#2290)
* Update cachepot endpoint
* Update dockerfile & remove env
* Update image building process
* Cannot use metadata endpoint for this
* Update workflow
* Conform to kaniko syntax
* Update syntax
* Update approach
* Update dockerfiles
* Force update
* Update dockerfiles
* Update dockerfile
* Cleanup dockerfiles
* Update s3 test location
* Revert s3 experiment
* Add more debug
* Specify aws region
* Remove debug, add prefix
* Remove one more debug
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
* workflows/benchmarking: increase timeout (#2294)
* Rework `init` in pageserver CLI (#2272)
* Do not create initial tenant and timeline (adjust Python tests for that)
* Rework config handling during init, add --update-config to manage local config updates
* Fix: Always build images (#2296)
* Always build images
* Remove unused
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
* Move auto-generated 'bindings' to a separate inner module.
Re-export only things that are used by other modules.
In the future, I'm imagining that we run bindgen twice, for Postgres
v14 and v15. The two sets of bindings would go into separate
'bindings_v14' and 'bindings_v15' modules.
Rearrange postgres_ffi modules.
Move function, to avoid Postgres version dependency in timelines.rs
Move function to generate a logical-message WAL record to postgres_ffi.
* fix cargo test
* Fix walreceiver and safekeeper bugs (#2295)
- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.
New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.
* on safekeeper registration pass availability zone param (#2292)
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Rory de Zoete <33318916+zoete@users.noreply.github.com>
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Anton Galitsyn <agalitsyn@users.noreply.github.com>
* github/workflows: Fix git dubious ownership (#2223)
* Move relation size cache from WalIngest to DatadirTimeline (#2094)
* Move relation sie cache to layered timeline
* Fix obtaining current LSN for relation size cache
* Resolve merge conflicts
* Resolve merge conflicts
* Reestore 'lsn' field in DatadirModification
* adjust DatadirModification lsn in ingest_record
* Fix formatting
* Pass lsn to get_relsize
* Fix merge conflict
* Update pageserver/src/pgdatadir_mapping.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/pgdatadir_mapping.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* refactor: replace lazy-static with once-cell (#2195)
- Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy`
- fixes#1147
Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>
* Add more buckets to pageserver latency metrics (#2225)
* ignore record property warning to fix benchmarks
* increase statement timeout
* use event so it fires only if workload thread successfully finished
* remove debug log
* increase timeout to pass test with real s3
* avoid duplicate parameter, increase timeout
* Major migration script (#2073)
This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview.
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
* Fix etcd typos
* Fix links to safekeeper protocol docs. (#2188)
safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in
commit 0b14fdb078, as part of reorganizing the docs into 'mdbook' format.
Fixes issue #1475. Thanks to @banks for spotting the outdated references.
In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb078. See https://github.com/neondatabase/neon/pull/2188#pullrequestreview-1055918480.
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Thang Pham <thang@neon.tech>
* Update CONTRIBUTING.md
* Update CONTRIBUTING.md
* support node id and remote storage params in docker_entrypoint.sh
* Safe truncate (#2218)
* Move relation sie cache to layered timeline
* Fix obtaining current LSN for relation size cache
* Resolve merge conflicts
* Resolve merge conflicts
* Reestore 'lsn' field in DatadirModification
* adjust DatadirModification lsn in ingest_record
* Fix formatting
* Pass lsn to get_relsize
* Fix merge conflict
* Update pageserver/src/pgdatadir_mapping.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/pgdatadir_mapping.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Check if relation exists before trying to truncat it
refer #1932
* Add test reporducing FSM truncate problem
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Fix exponential backoff values
* Update back `vendor/postgres` back; it was changed accidentally. (#2251)
Commit 4227cfc96e accidentally reverted vendor/postgres to an older
version. Update it back.
* Add pageserver checkpoint_timeout option.
To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.
* Share exponential backoff code and fix logic for delete task failure (#2252)
* Fix bug when import large (>1GB) relations (#2172)
Resolves#2097
- use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get`
- update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`.
+ `small` is the old test
+ `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build
* Fix timeline physical size flaky tests (#2244)
Resolves#2212.
- use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests
## Context
Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk.
### Note
The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly
* postgres_ffi/waldecoder: validate more header fields
* postgres_ffi/waldecoder: remove unused startlsn
* postgres_ffi/waldecoder: introduce explicit `enum State`
Previously it was emulated with a combination of nullable fields.
This change should make the logic more readable.
* disable `test_import_from_pageserver_multisegment` (#2258)
This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure.
See: #2255, #2256
* get_binaries uses DOCKER_TAG taken from docker image build step (#2260)
* [proxy] Rework wire format of the password hack and some errors (#2236)
The new format has a few benefits: it's shorter, simpler and
human-readable as well. We don't use base64 anymore, since
url encoding got us covered.
We also show a better error in case we couldn't parse the
payload; the users should know it's all about passing the
correct project name.
* test_runner/pg_clients: collect docker logs (#2259)
* get_binaries script fix (#2263)
* get_binaries uses DOCKER_TAG taken from docker image build step
* remove docker tag discovery at all and fix get_binaries for version variable
* Better storage sync logs (#2268)
* Find end of WAL on safekeepers using WalStreamDecoder.
We could make it inside wal_storage.rs, but taking into account that
- wal_storage.rs reading is async
- we don't need s3 here
- error handling is different; error during decoding is normal
I decided to put it separately.
Test
cargo test test_find_end_of_wal_last_crossing_segment
prepared earlier by @yeputons passes now.
Fixes https://github.com/neondatabase/neon/issues/544https://github.com/neondatabase/cloud/issues/2004
Supersedes https://github.com/neondatabase/neon/pull/2066
* Improve walreceiver logic (#2253)
This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers.
- There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237
- Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second.
- `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
- `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.
* increase timeout in wait_for_upload to avoid spurious failures when testing with real s3
* Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274)
* Set up a workflow to run pgbench against captest (#2077)
Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Ankur Srivastava <ansrivas@users.noreply.github.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Thang Pham <thang@neon.tech>
Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Co-authored-by: Egor Suvorov <egor@neon.tech>
Co-authored-by: Andrey Taranik <andrey@cicd.team>
Co-authored-by: Dmitry Ivanov <ivadmi5@gmail.com>
[HOTFIX] Release deploy fix
This PR uses this branch neondatabase/postgres#171 and several required commits from the main to use only locally built compute-tools. This should allow us to rollout safekeepers sync issue fix on prod
name:Periodic pagebench performance test on dedicated EC2 machine in eu-central-1 region
name:Periodic pagebench performance test on unit-perf-aws-arm runners
on:
schedule:
@@ -8,7 +8,7 @@ on:
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron:'0 */3 * * *'# Runs every 3 hours
- cron:'0 */4 * * *'# Runs every 4 hours
workflow_dispatch:# Allows manual triggering of the workflow
inputs:
commit_hash:
@@ -16,6 +16,11 @@ on:
description:'The long neon repo commit hash for the system under test (pageserver) to be tested.'
required:false
default:''
recreate_snapshots:
type:boolean
description:'Recreate snapshots - !!!WARNING!!! We should only recreate snapshots if the previous ones are no longer compatible. Otherwise benchmarking results are not comparable across runs.'
storage_broker={version="0.1",path="./storage_broker/"}# Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
# Keep the version the same as in compute/compute-node.Dockerfile and
# test_runner/regress/test_compute_metrics.py.
ENVSQL_EXPORTER_VERSION=0.17.0
ENVSQL_EXPORTER_VERSION=0.17.3
RUN curl -fsSL \
"https://github.com/burningalchemist/sql_exporter/releases/download/${SQL_EXPORTER_VERSION}/sql_exporter-${SQL_EXPORTER_VERSION}.linux-$(case"$(uname -m)" in x86_64) echo amd64;; aarch64) echo arm64;; esac).tar.gz"\
--output sql_exporter.tar.gz \
@@ -165,6 +165,7 @@ RUN curl -fsSL \
&& rm sql_exporter.tar.gz
# protobuf-compiler (protoc)
# Keep the version the same as in compute/compute-node.Dockerfile
ENVPROTOC_VERSION=25.1
RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-linux-$(uname -m | sed 's/aarch64/aarch_64/g').zip" -o "protoc.zip"\
&& unzip -q protoc.zip -d protoc \
@@ -179,7 +180,7 @@ RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/
&& mv s5cmd /usr/local/bin/s5cmd
# LLVM
ENVLLVM_VERSION=19
ENVLLVM_VERSION=20
RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key'| apt-key add - \
"address_standardizer.control":"# address_standardizer extension\ncomment = ''Used to parse an address into constituent elements. Generally used to support geocoding address normalization step.''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n",
This directory contains configuration files and patches for running PostGIS tests in the Neon database environment.
## Overview
PostGIS is a spatial database extension for PostgreSQL that adds support for geographic objects. Testing PostGIS compatibility ensures that Neon's modifications to PostgreSQL don't break compatibility with this critical extension.
## PostGIS Versions
- PostgreSQL v17: PostGIS 3.5.0
- PostgreSQL v14/v15/v16: PostGIS 3.3.3
## Test Configuration
The test setup includes:
-`postgis-no-upgrade-test.patch`: Disables upgrade tests by removing the upgrade test section from regress/runtest.mk
-`postgis-regular-v16.patch`: Version-specific patch for PostgreSQL v16
-`postgis-regular-v17.patch`: Version-specific patch for PostgreSQL v17
-`regular-test.sh`: Script to run PostGIS tests as a regular user
-`neon-test.sh`: Script to handle version-specific test configurations
-`raster_outdb_template.sql`: Template for raster tests with explicit file paths
## Excluded Tests
**Important Note:** The test exclusions listed below are specifically for regular-user tests against staging instances. These exclusions are necessary because staging instances run with limited privileges and cannot perform operations requiring superuser access. Docker-compose based tests are not affected by these exclusions.
### Tests Requiring Superuser Permissions
These tests cannot be run as a regular user:
-`estimatedextent`
-`regress/core/legacy`
-`regress/core/typmod`
-`regress/loader/TestSkipANALYZE`
-`regress/loader/TestANALYZE`
### Tests Requiring Filesystem Access
These tests need direct filesystem access that is only possible for superusers:
-`loader/load_outdb`
### Tests with Flaky Results
These tests have assumptions that don't always hold true:
-`regress/core/computed_columns` - Assumes computed columns always outperform alternatives, which is not consistently true
accepting `Configuration`. Safekeeper switches to the given conf if it is higher than its
current one and ignores it otherwise. In any case it replies with
```
struct ConfigurationSwitchResponse {
struct TimelineMembershipSwitchResponse {
conf: Configuration,
term: Term,
last_log_term: Term,
@@ -108,7 +108,7 @@ establishes this configuration as its own and moves to voting.
It should stop talking to safekeepers not listed in the configuration at this
point, though it is not unsafe to continue doing so.
To be elected it must receive votes from both majorites if `new_sk_set` is present.
To be elected it must receive votes from both majorities if `new_sk_set` is present.
Similarly, to commit WAL it must receive flush acknowledge from both majorities.
If walproposer hears from safekeeper configuration higher than his own (i.e.
@@ -130,7 +130,7 @@ storage are reachable.
1) Fetch current timeline configuration from the configuration storage.
2) If it is already joint one and `new_set` is different from `desired_set`
refuse to change. However, assign join conf to (in memory) var
`join_conf` and proceed to step 4 to finish the ongoing change.
`joint_conf` and proceed to step 4 to finish the ongoing change.
3) Else, create joint `joint_conf: Configuration`: increment current conf number
`n` and put `desired_set` to `new_sk_set`. Persist it in the configuration
storage by doing CAS on the current generation: change happens only if
@@ -161,11 +161,11 @@ storage are reachable.
because `pull_timeline` already includes it and plus additionally would be
broadcast by compute. More importantly, we may proceed to the next step
only when `<last_log_term, flush_lsn>` on the majority of the new set reached
`sync_position`. Similarly, on the happy path no waiting is not needed because
`sync_position`. Similarly, on the happy path no waiting is needed because
`pull_timeline` already includes it. However, we should double
check to be safe. For example, timeline could have been created earlier e.g.
manually or after try-to-migrate, abort, try-to-migrate-again sequence.
7) Create `new_conf: Configuration` incrementing `join_conf` generation and having new
7) Create `new_conf: Configuration` incrementing `joint_conf` generation and having new
safekeeper set as `sk_set` and None `new_sk_set`. Write it to configuration
storage under one more CAS.
8) Call `PUT``configuration` on safekeepers from the new set,
@@ -178,12 +178,12 @@ spec of it.
Description above focuses on safety. To make the flow practical and live, here a few more
considerations.
1) It makes sense to ping new set to ensure it we are migrating to live node(s) before
1) It makes sense to ping new set to ensure we are migrating to live node(s) before
step 3.
2) If e.g. accidentally wrong new sk set has been specified, before CAS in step `6` is completed
it is safe to rollback to the old conf with one more CAS.
3) On step 4 timeline might be already created on members of the new set for various reasons;
the simplest is the procedure restart. There are more complicated scenarious like mentioned
the simplest is the procedure restart. There are more complicated scenarios like mentioned
in step 5. Deleting and re-doing `pull_timeline` is generally unsafe without involving
generations, so seems simpler to treat existing timeline as success. However, this also
has a disadvantage: you might imagine an surpassingly unlikely schedule where condition in
@@ -192,7 +192,7 @@ considerations.
4) In the end timeline should be locally deleted on the safekeeper(s) which are
in the old set but not in the new one, unless they are unreachable. To be
safe this also should be done under generation number (deletion proceeds only if
current configuration is <= than one in request and safekeeper is not memeber of it).
current configuration is <= than one in request and safekeeper is not member of it).
5) If current conf fetched on step 1 is already not joint and members equal to `desired_set`,
jump to step 7, using it as `new_conf`.
@@ -261,14 +261,14 @@ Timeline (branch) creation in cplane should call storage_controller POST
Response should be augmented with `safekeepers_generation` and `safekeepers`
fields like described in `/notify-safekeepers` above. Initially (currently)
these fields may be absent; in this case cplane chooses safekeepers on its own
like it currently does. The call should be retried until succeeds.
like it currently does. The call should be retried until it succeeds.
Timeline deletion and tenant deletion in cplane should call appropriate
storage_controller endpoints like it currently does for sharded tenants. The
calls should be retried until they succeed.
When compute receives safekeepers list from control plane it needs to know the
generation to checked whether it should be updated (note that compute may get
When compute receives safekeeper list from control plane it needs to know the
generation to check whether it should be updated (note that compute may get
safekeeper list from either cplane or safekeepers). Currently `neon.safekeepers`
GUC is just a comma separates list of `host:port`. Let's prefix it with
`g#<generation>:` to this end, so it will look like
@@ -305,8 +305,8 @@ enum MigrationRequest {
```
`FinishPending` requests to run the procedure to ensure state is clean: current
configuration is not joint and majority of safekeepers are aware of it, but do
not attempt to migrate anywhere. If current configuration fetched on step 1 is
configuration is not joint and the majority of safekeepers are aware of it, but do
not attempt to migrate anywhere. If the current configuration fetched on step 1 is
not joint it jumps to step 7. It should be run at startup for all timelines (but
similarly, in the first version it is ok to trigger it manually).
@@ -315,7 +315,7 @@ similarly, in the first version it is ok to trigger it manually).
`safekeepers` table mirroring current `nodes` should be added, except that for
`scheduling_policy`: it is enough to have at least in the beginning only 3
fields: 1) `active` 2) `paused` (initially means only not assign new tlis there
3) `decomissioned` (node is removed).
3) `decommissioned` (node is removed).
`timelines` table:
```
@@ -326,9 +326,10 @@ table! {
tenant_id -> Varchar,
start_lsn -> pg_lsn,
generation -> Int4,
sk_set -> Array<Int4>, // list of safekeeper ids
sk_set -> Array<Int8>, // list of safekeeper ids
new_sk_set -> Nullable<Array<Int8>>, // list of safekeeper ids, null if not joint conf
cplane_notified_generation -> Int4,
sk_set_notified_generation -> Int4, // the generation a quorum of sk_set knows about
deleted_at -> Nullable<Timestamptz>,
}
}
@@ -338,13 +339,23 @@ table! {
might also want to add ancestor_timeline_id to preserve the hierarchy, but for
this RFC it is not needed.
`cplane_notified_generation` and `sk_set_notified_generation` fields are used to
track the last stage of the algorithm, when we need to notify safekeeper set and cplane
with the final configuration after it's already committed to DB.
The timeline is up-to-date (no migration in progress) if `new_sk_set` is null and
`*_notified_generation` fields are up to date with `generation`.
It's possible to replace `*_notified_generation` with one boolean field `migration_completed`,
but for better observability it's nice to have them separately.
#### API
Node management is similar to pageserver:
1) POST `/control/v1/safekeepers` inserts safekeeper.
2) GET `/control/v1/safekeepers` lists safekeepers.
3) GET `/control/v1/safekeepers/:node_id` gets safekeeper.
4) PUT `/control/v1/safekepers/:node_id/status` changes status to e.g.
1) POST `/control/v1/safekeeper` inserts safekeeper.
2) GET `/control/v1/safekeeper` lists safekeepers.
3) GET `/control/v1/safekeeper/:node_id` gets safekeeper.
4) PUT `/control/v1/safekeper/:node_id/scheduling_policy` changes status to e.g.
`offline` or `decomissioned`. Initially it is simpler not to schedule any
migrations here.
@@ -368,8 +379,8 @@ Migration API: the first version is the simplest and the most imperative:
all timelines from one safekeeper to another. It accepts json
```
{
"src_sk": u32,
"dst_sk": u32,
"src_sk": NodeId,
"dst_sk": NodeId,
"limit": Optional<u32>,
}
```
@@ -379,12 +390,15 @@ Returns list of scheduled requests.
2) PUT `/control/v1/tenant/:tenant_id/timeline/:timeline_id/safekeeper_migrate` schedules `MigrationRequest`
to move single timeline to given set of safekeepers:
```
{
"desired_set": Vec<u32>,
struct TimelineSafekeeperMigrateRequest {
"new_sk_set": Vec<NodeId>,
}
```
Returns scheduled request.
In the first version the handler migrates the timeline to `new_sk_set` synchronously.
Should be retried until success.
In the future we might change it to asynchronous API and return scheduled request.
Similar call should be added for the tenant.
@@ -434,6 +448,9 @@ table! {
}
```
We load all pending ops from the table on startup into the memory.
The table is needed only to preserve the state between restarts.
`op_type` can be `include` (seed from peers and ensure generation is up to
date), `exclude` (remove locally) and `delete`. Field is actually not strictly
needed as it can be computed from current configuration, but gives more explicit
@@ -474,7 +491,7 @@ actions must be idempotent. Now, a tricky point here is timeline start LSN. For
the initial (tenant creation) call cplane doesn't know it. However, setting
start_lsn on safekeepers during creation is a good thing -- it provides a
guarantee that walproposer can always find a common point in WAL histories of
safekeeper and its own, and so absense of it would be a clear sign of
safekeeper and its own, and so absence of it would be a clear sign of
corruption. The following sequence works:
1) Create timeline (or observe that it exists) on pageserver,
figuring out last_record_lsn in response.
@@ -497,11 +514,9 @@ corruption. The following sequence works:
retries the call until 200 response.
There is a small question how request handler (timeline creation in this
case) would interact with per sk reconciler. As always I prefer to do the
simplest possible thing and here it seems to be just waking it up so it
re-reads the db for work to do. Passing work in memory is faster, but
that shouldn't matter, and path to scan db for work will exist anyway,
simpler to reuse it.
case) would interact with per sk reconciler. In the current implementation
we first persist the request in the DB, and then send an in-memory request
to each safekeeper reconciler to process it.
For pg version / wal segment size: while we may persist them in `timelines`
table, it is not necessary as initial creation at step 3 can take them from
@@ -509,30 +524,40 @@ pageserver or cplane creation call and later pull_timeline will carry them
around.
Timeline migration.
1) CAS to the db to create joint conf, and in the same transaction create
`safekeeper_timeline_pending_ops``include` entries to initialize new members
as well as deliver this conf to current ones; poke per sk reconcilers to work
on it. Also any conf change should also poke cplane notifier task(s).
2) Once it becomes possible per alg description above, get out of joint conf
with another CAS. Task should get wakeups from per sk reconcilers because
conf switch is required for advancement; however retries should be sleep
based as well as LSN advancement might be needed, though in happy path
it isn't. To see whether further transition is possible on wakup migration
executor polls safekeepers per the algorithm. CAS creating new conf with only
new members should again insert entries to `safekeeper_timeline_pending_ops`
to switch them there, as well as `exclude` rows to remove timeline from
old members.
1) CAS to the db to create joint conf. Since this moment the migration is considered to be
"in progress". We can detect all "in-progress" migrations looking into the database.
2) Do steps 4-6 from the algorithm, including `pull_timeline` onto `new_sk_set`, update membership
configuration on all safekeepers, notify cplane, etc. All operations are idempotent,
so we don't need to persist anything in the database at this stage. If any errors occur,
it's safe to retry or abort the migration.
3) Once it becomes possible per alg description above, get out of joint conf
with another CAS. Also should insert `exclude` entries into `safekeeper_timeline_pending_ops`
in the same DB transaction. Adding `exclude` entries atomically is nesessary because after
CAS we don't have the list of excluded safekeepers in the `timelines` table anymore, but we
need to have them persisted somewhere in case the migration is interrupted right after the CAS.
4) Finish the migration. The final membership configuration is committed to the DB at this stage.
So, the migration can not be aborted anymore. But it can still be retried if the migration fails
past stage 3. To finish the migration we need to send the new membership configuration to
a new quorum of safekeepers, notify cplane with the new safekeeper list and schedule the `exclude`
requests to in-memory queue for safekeeper reconciler. If the algrorithm is retried, it's
possible that we have already committed `exclude` requests to DB, but didn't send them to
the in-memory queue. In this case we need to read them from `safekeeper_timeline_pending_ops`
because it's the only place where they are persistent. The fields `sk_set_notified_generation`
and `cplane_notified_generation` are updated after each step. The migration is considered
fully completed when they match the `generation` field.
In practice, we can report "success" after stage 3 and do the "finish" step in per-timeline
reconciler (if we implement it). But it's wise to at least try to finish them synchronously,
so the timeline is always in a "good state" and doesn't require an old quorum to commit
WAL after the migration reported "success".
Timeline deletion: just set `deleted_at` on the timeline row and insert
`safekeeper_timeline_pending_ops` entries in the same xact, the rest is done by
per sk reconcilers.
When node is removed (set to `decomissioned`), `safekeeper_timeline_pending_ops`
When node is removed (set to `decommissioned`), `safekeeper_timeline_pending_ops`
for it must be cleared in the same transaction.
One more task pool should infinitely retry notifying control plane about changed
safekeeper sets (trying making `cplane_notified_generation` equal `generation`).
#### Dealing with multiple instances of storage_controller
Operations described above executed concurrently might create some errors but do
@@ -541,7 +566,7 @@ of storage_controller it is fine to have it temporarily, e.g. during redeploy.
To harden against some controller instance creating some work in
`safekeeper_timeline_pending_ops` and then disappearing without anyone pickup up
the job per sk reconcilers apart from explicit wakups should scan for work
the job per sk reconcilers apart from explicit wakeups should scan for work
periodically. It is possible to remove that though if all db updates are
protected with leadership token/term -- then such scans are needed only after
leadership is acquired.
@@ -563,7 +588,7 @@ There should be following layers of tests:
safekeeper communication and pull_timeline need to be mocked and main switch
procedure wrapped to as a node (thread) in simulation tests, using these
mocks. Test would inject migrations like it currently injects
safekeeper/walproposer restars. Main assert is the same -- committed WAL must
safekeeper/walproposer restarts. Main assert is the same -- committed WAL must
not be lost.
3) Since simulation testing injects at relatively high level points (not
@@ -613,7 +638,7 @@ Let's have the following implementation bits for gradual rollout:
`notify-safekeepers`.
Then the rollout for a region would be:
- Current situation: safekeepers are choosen by control_plane.
- Current situation: safekeepers are chosen by control_plane.
- We manually migrate some timelines, test moving them around.
- Then we enable `--set-safekeepers` so that all new timelines
are on storage controller.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.