rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	e0d2d293b1	Merge commit '7cd006621' into problame/standby-horizon-leases	2025-08-06 18:00:38 +02:00
Christian Schwarz	8799e87ae3	Merge commit '62d844e65' into problame/standby-horizon-leases	2025-08-06 18:00:09 +02:00
Christian Schwarz	737f5825bb	Merge commit 'b623fbae0' into problame/standby-horizon-leases	2025-08-06 17:59:58 +02:00
Christian Schwarz	b95106cd79	Merge commit '5c57e8a11' into problame/standby-horizon-leases	2025-08-06 17:59:21 +02:00
Christian Schwarz	7d28fb118b	Merge commit 'f85935446' into problame/standby-horizon-leases	2025-08-06 17:58:36 +02:00
Christian Schwarz	e52d0ef311	Merge commit '5b0972151' into problame/standby-horizon-leases	2025-08-06 17:56:07 +02:00
Christian Schwarz	d22e23f66d	Merge commit '108f7ec54' into problame/standby-horizon-leases	2025-08-06 17:55:56 +02:00
Christian Schwarz	54480167dc	Merge commit '9c0efba91' into problame/standby-horizon-leases	2025-08-06 17:55:48 +02:00
Christian Schwarz	30e7c4b75d	Merge commit '187170be4' into problame/standby-horizon-leases	2025-08-06 17:55:39 +02:00
Christian Schwarz	35c916c062	Merge commit '5c934efb2' into problame/standby-horizon-leases	2025-08-06 17:50:33 +02:00
Christian Schwarz	02e1aeef66	Merge commit 'a456e818a' into problame/standby-horizon-leases	2025-08-06 17:49:56 +02:00
Christian Schwarz	e2c88c1929	Merge commit '296c9190b' into problame/standby-horizon-leases	2025-08-06 17:49:50 +02:00
Christian Schwarz	553a120075	Merge commit '15f633922' into problame/standby-horizon-leases	2025-08-06 17:49:41 +02:00
Christian Schwarz	e2facbde4e	Merge commit 'cec0543b5' into problame/standby-horizon-leases	2025-08-06 17:47:10 +02:00
Christian Schwarz	b8c8168378	Merge commit 'be5bbaeca' into problame/standby-horizon-leases	2025-08-06 17:46:44 +02:00
Christian Schwarz	28a2cd05d5	Merge commit '5ec82105c' into problame/standby-horizon-leases	2025-08-06 17:46:37 +02:00
Christian Schwarz	1635390a96	fix all clippy complaints in this branch	2025-08-06 17:39:17 +02:00
Christian Schwarz	47146fe1d6	Merge commit '7049003cf' into problame/standby-horizon-leases	2025-08-06 17:17:11 +02:00
Christian Schwarz	2ee0f4271c	fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field The LSN lease api actually accepts a tenant_shard_id, not a tenant_id. But we put the Display of the tenant_shard_id into the tenant_id field. This PR fixes it. Refs - fixes https://databricks.atlassian.net/browse/LKB-2930	2025-08-05 22:48:27 +02:00
Christian Schwarz	8a9f1dd5e7	use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism	2025-08-05 22:47:53 +02:00
Christian Schwarz	9f01840c18	use standby_horizon leases feature in the test, demonstrating that it passes now	2025-08-05 22:47:28 +02:00
Christian Schwarz	44466cebdb	WIP better observability for return values (SystemTime Debug is useless)	2025-08-05 22:46:54 +02:00
Christian Schwarz	b865e85de3	previous commit broke the tests because of the cfg business, see this commit's TODO	2025-08-05 22:46:24 +02:00
Christian Schwarz	73336962a8	finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes	2025-08-05 19:24:06 +02:00
Erik Grinaker	7cd0066212	page_api: add `SplitError` for `GetPageSplitter` (#12709 ) Add a `SplitError` for `GetPageSplitter`, with an `Into<tonic::Status>` implementation. This avoids a bunch of boilerplate to convert `GetPageSplitter` errors into `tonic::Status`. Requires #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-29 18:26:20 +00:00
Erik Grinaker	65d1be6e90	pageserver: route gRPC requests to child shards (#12702 ) ## Problem During shard splits, each parent shard is split and removed incrementally. Only when all parent shards have split is the split committed and the compute notified. This can take several minutes for large tenants. In the meanwhile, the compute will be sending requests to the (now-removed) parent shards. This was (mostly) not a problem for the libpq protocol, because it does shard routing on the server-side. The compute just sends requests to some Pageserver, and the server will figure out which local shard should serve it. It is a problem for the gRPC protocol, where the client explicitly says which shard it's talking to. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). Requires #12772. ## Summary of changes * Add server-side routing of gRPC requests to any local child shards if the parent does not exist. * Add server-side splitting of GetPage batch requests straddling multiple child shards. * Move the `GetPageSplitter` into `pageserver_page_api`. I really don't like this approach, but it avoids making changes to the split protocol. I could be convinced we should change the split protocol instead, e.g. to keep the parent shard alive until the split commits and the compute has been notified, but we can also do that as a later change without blocking the communicator on it.	2025-07-29 16:28:57 +00:00
Erik Grinaker	61f267d8f9	pageserver: only retry `WaitForActiveTimeout` during shard resolution (#12772 ) ## Problem In https://github.com/neondatabase/neon/pull/12467, timeouts and retries were added to `Cache::get` tenant shard resolution to paper over an issue with read unavailability during shard splits. However, this retries _all_ errors, including irrecoverable errors like `NotFound`. This causes problems with gRPC child shard routing in #12702, which targets specific shards with `ShardSelector::Known` and relies on prompt `NotFound` errors to reroute requests to child shards. These retries introduce a 1s delay for all reads during child routing. The broader problem of read unavailability during shard splits is left as future work, see https://databricks.atlassian.net/browse/LKB-672. Touches #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). ## Summary of changes * Change `TenantManager` to always return a concrete `GetActiveTimelineError`. * Only retry `WaitForActiveTimeout` errors. * Lots of code unindentation due to the simplified error handling. Out of caution, we do not gate the retries on `ShardSelector`, since this can trigger other races. Improvements here are left as future work.	2025-07-29 12:33:02 +00:00
John Spray	60feb168e2	pageserver: decrease MAX_SHARDS in utilization (#12668 ) ## Problem When tenants have a lot of timelines, the number of tenants that a pageserver can comfortably handle goes down. Branching is much more widely used in practice now than it was when this code was written, and we generally run pageservers with a few thousand tenants (where each tenant has many timelines), rather than the 10k-20k we might have done historically. This should really be something configurable, or a more direct proxy for resource utilization (such as non-archived timeline count), but this change should be a low effort improvement. ## Summary of changes * Change the target shard count (MAX_SHARDS) to 2500 from 5000 when calculating pageserver utilization (i.e. a 200% overcommit now corresponds to 5000 shards, not 10000 shards) Co-authored-by: John Spray <john.spray@databricks.com>	2025-07-28 13:50:18 +00:00
Christian Schwarz	3365c8c648	enforce standby_horizon leases are always above applied_gc_cutoff (check against cutoff on upsert + block gc for lease length to allow renewals after attach)	2025-07-26 16:38:44 +02:00
Christian Schwarz	bc09df8823	add todo about init deadline	2025-07-26 16:23:59 +02:00
Christian Schwarz	e1eb98c0e9	add basic test & fix embarrasing bug in cull (needs comment out todo!())	2025-07-26 16:23:59 +02:00
Christian Schwarz	1e61ac6af2	cargo fmt (unrelated to prev commit)	2025-07-26 16:23:59 +02:00
Christian Schwarz	a948054db3	naming orhtodoxy: always refere to leases as LSN leases	2025-07-26 16:23:59 +02:00
Christian Schwarz	19b74b8837	fix(page_service): getpage requests don't hold `applied_gc_cutoff_lsn` guard (#12743 ) Before this PR, getpage requests wouldn't hold the `applied_gc_cutoff_lsn` guard until they were done. Theoretical impact: if we’re not holding the `RcuReadGuard`, gc can theoretically concurrently delete reconstruct data that we need to reconstruct the page. I don't think this practically occurs in production because the odds of it happening are quite low, especially for primary read_write computes. But RO replicas / standby_horizon relies on correct `applied_gc_cutofff_lsn`, so, I'm fixing this as part of the work ok replacing standby_horizon propagation mechanism with leases (LKB-88). The change is feature-gated with a feature flag, and evaluated once when entering `handle_pagestream` to avoid performance impact. For observability, we add a field to the `handle_pagestream` span, and a slow-log to the place in `gc_loop` where it waits for the in-flight RcuReadGuard's to drain. refs - fixes https://databricks.atlassian.net/browse/LKB-2572 - standby_horizon leases epic: https://databricks.atlassian.net/browse/LKB-2572 --------- Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-25 20:25:04 +00:00
Vlad Lazar	b0dfe0ffa6	storcon: attempt all non-essential location config calls during reconciliations (#12745 ) ## Problem We saw the following in the field: Context and observations: * The storage controller keeps track of the latest generations and the pageserver that issued the latest generation in the database * When the storage controller needs to proxy a request (e.g. timeline creation) to the pageservers, it will find use the pageserver that issued the latest generation from the db (generation_pageserver). * pageserver-2.cell-2 got into a bad state and wasn't able to apply location_config (e.g. detach a shard) What happened: 1. pageserver-2.cell-2 was a secondary for our shard since we were not able to detach it 2. control plane asked to detach a tenant (presumably because it was idle) a. In response storcon clears the generation_pageserver from the db and attempts to detach all locations b. it tries to detach pageserver-2.cell-2 first, but fails, which fails the entire reconciliation leaving the good attached location still there c. return success to cplane 3. control plane asks to re-attach the tenant a. In response storcon performs a reconciliation b. it finds that the observed state matches the intent (remember we did not detach the primary at step(2)) c. skips incrementing the genration and setting the generation_pageserver column Now any requests that need to be proxied to pageservers and rely on the generation_pageserver db column fail because that's not set ## Summary of changes 1. We do all non-essential location config calls (setting up secondaries, detaches) at the end of the reconciliation. Previously, we bailed out of the reconciliation on the first failure. With this patch we attempt all of the RPCs. This allows the observed state to update even if another RPC failed for unrelated reasons. 2. If the overall reconciliation failed, we don't want to remove nodes from the observed state as a safe-guard. With the previous patch, we'll get a deletion delta to process, which would be ignored. Ignoring it is not the right thing to do since it's out of sync with the db state. Hence, on reconciliation failures map deletion from the observed state to the uncertain state. Future reconciliation will query the node to refresh their observed state. Closes LKB-204	2025-07-25 14:03:17 +00:00
Erik Grinaker	185ead8395	pageserver: verify gRPC GetPages on correct shard (#12722 ) Verify that gRPC `GetPageRequest` has been sent to the shard that owns the pages. This avoid spurious `NotFound` errors if a compute misroutes a request, which can appear scarier (e.g. data loss). Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-25 13:43:04 +00:00
Erik Grinaker	37e322438b	pageserver: document gRPC compute accessibility (#12724 ) Document that the Pageserver gRPC port is accessible by computes, and should not provide internal services. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-25 13:35:44 +00:00
Christian Schwarz	2ee24900ca	have claude generate plumbing for standby_horizon_lease_length	2025-07-25 13:16:20 +02:00
Christian Schwarz	23d1029afd	explain why there's no need to check standby_horizon lease deadline for getpage requests	2025-07-25 09:30:27 +00:00
Tristan Partin	512210bb5a	[BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716 ) ## Problem In our experience running the system so far, almost all of the "hang compute" situations are due to the compute (postgres) pointing at the wrong pageservers. We currently mainly rely on the promethesus exporter (PGExporter) running on PG to detect and report any down time, but these can be unreliable because the read and write probes the PGExporter runs do not always generate pageserver requests due to caching, even though the real user might be experiencing down time when touching uncached pages. We are also about to start disk-wiping node pool rotation operations in prod clusters for our pageservers, and it is critical to have a convenient way to monitor the impact of these node pool rotations so that we can quickly respond to any issues. These metrics should provide very clear signals to address this operational need. ## Summary of changes Added a pair of metrics to detect issues between postgres' PageStream protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications with pageservers: * On the compute node (compute_ctl), exports a counter metric that is incremented every time postgres requests a configuration refresh. Postgres today only requests these configuration refreshes when it cannot connect to a pageserver or if the pageserver rejects its request by disconnecting. * On the pageserver, exports a counter metric that is incremented every time it receives a PageStream request that cannot be handled because the tenant is not known or if the request was routed to the wrong shard (e.g. secondary). ### How I plan to use metrics I plan to use the metrics added here to create alerts. The alerts can fire, for example, if these counters have been continuously increasing for over a certain period of time. During rollouts, misrouted requests may occasionally happen, but they should soon die down as reconfigurations make progress. We can start with something like raising the alert if the counters have been increasing continuously for over 5 minutes. ## How is this tested? New integration tests in `test_runner/regress/test_hadron_ps_connectivity_metrics.py` Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 19:05:00 +00:00
Alex Chi Z.	5c57e8a11b	feat(pageserver): rework reldirv2 rollout (#12576 ) ## Problem LKB-197, #9516 To make sure the migration path is smooth. The previous plan is to store new relations in new keyspace and old ones in old keyspace until it gets dropped. This makes the migration path hard as we can't validate v2 writes and can't rollback. This patch gives us a more smooth migration path: - The first time we enable reldirv2 for a tenant, we copy over everything in the old keyspace to the new one. This might create a short spike of latency for the create relation operation, but it's oneoff. - After that, we have identical v1/v2 keyspace and read/write both of them. We validate reads every time we list the reldirs. - If we are in `migrating` mode, use v1 as source of truth and log a warning for failed v2 operations. If we are in `migrated` mode, use v2 as source of truth and error when writes fail. - One compatibility test uses dataset from the time where we enabled reldirv2 (of the original rollout plan), which only has relations written to the v2 keyspace instead of the v1 keyspace. We had to adjust it accordingly. - Add `migrated_at` in index_part to indicate the LSN where we did the initialize. TODOs: - Test if relv1 can be read below the migrated_at LSN. - Move the initialization process to L0 compaction instead of doing it on the write path. - Disable relcache in the relv2 test case so that all code path gets fully tested. ## Summary of changes - New behavior of reldirv2 migration flags as described above. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-23 16:12:46 +00:00
Alex Chi Z.	f859354466	feat(pageserver): add db rel count as feature flag property (#12632 ) ## Problem As part of the reldirv2 rollout: LKB-197. We will use number of db/rels as a criteria whether to rollout reldirv2 directly on the write path (simplest and easiest way of rollout). If the number of rel/db is small then it shouldn't take too long time on the write path. ## Summary of changes * Compute db/rel count during basebackup. * Also compute it during logical size computation. * Collect maximum number of db/rel across all timelines in the feature flag propeties. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-22 17:55:07 +00:00
Vlad Lazar	5b0972151c	pageserver: silence shard resolution warning (#12685 ) ## Problem We drive the get page requests that have started processing to completion. So in the case when the compute received a reconfiguration request and the old connection has a request procesing on the pageserver, we are going to issue the warning. I spot checked a few instances of the warning and in all cases the compute was already connected to the correct pageserver. ## Summary of Changes Downgrade to INFO. It would be nice to somehow figure out if the connection has been terminated in the meantime, but the terminate libpq message is still in the pipe while we're doing the shard resolution. Closes LKB-2381	2025-07-22 17:34:23 +00:00
Folke Behrens	108f7ec544	Bump opentelemetry crates to 0.30 (#12680 ) This rebuilds #11552 on top the current Cargo.lock. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2025-07-22 16:05:35 +00:00
Alex Chi Z.	88391ce069	feat(pageserver): create image layers at L0-L1 boundary by default (#12669 ) ## Problem Post LKB-198 rollout. We added a new strategy to generate image layers at the L0-L1 boundary instead of the latest LSN to ensure too many L0 layers do not trigger image layer creation. ## Summary of changes We already rolled it out to all users so we can remove the feature flag now. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-22 14:29:26 +00:00
Vlad Lazar	d91d018afa	storcon: handle pageserver disk loss (#12667 ) NB: effectively a no-op in the neon env since the handling is config gated in storcon ## Problem When a pageserver suffers from a local disk/node failure and restarts, the storage controller will receive a re-attach call and return all the tenants the pageserver is suppose to attach, but the pageserver will not act on any tenants that it doesn't know about locally. As a result, the pageserver will not rehydrate any tenants from remote storage if it restarted following a local disk loss, while the storage controller still thinks that the pageserver have all the tenants attached. This leaves the system in a bad state, and the symptom is that PG's pageserver connections will fail with "tenant not found" errors. ## Summary of changes Made a slight change to the storage controller's `re_attach` API: * The pageserver will set an additional bit `empty_local_disk` in the reattach request, indicating whether it has started with an empty disk or does not know about any tenants. * Upon receiving the reattach request, if this `empty_local_disk` bit is set, the storage controller will go ahead and clear all observed locations referencing the pageserver. The reconciler will then discover the discrepancy between the intended state and observed state of the tenant and take care of the situation. To facilitate rollouts this extra behavior in the `re_attach` API is guarded by the `handle_ps_local_disk_loss` command line flag of the storage controller. --------- Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-22 11:04:03 +00:00
Folke Behrens	9c0efba91e	Bump rand crate to 0.9 (#12674 )	2025-07-22 09:31:39 +00:00
Vlad Lazar	30e1213141	pageserver: check env var for ip address before node registration (#12666 ) Include the ip address (optionally read from an env var) in the pageserver's registration request. Note that the ip address is ignored by the storage controller at the moment, which makes it a no-op in the neon env.	2025-07-21 15:32:28 +00:00
Erik Grinaker	5a48365fb9	pageserver/client_grpc: don't set stripe size for unsharded tenants (#12639 ) ## Problem We've had bugs where the compute would use the stale default stripe size from an unsharded tenant after the tenant split with a new stripe size. ## Summary of changes Never specify a stripe size for unsharded tenants, to guard against misuse. Only specify it once tenants are sharded and the stripe size can't change. Also opportunistically changes `GetPageSplitter` to return `anyhow::Result`, since we'll be using this in other code paths as well (specifically during server-side shard splits).	2025-07-21 12:28:39 +00:00
Erik Grinaker	194b9ffc41	pageserver: remove gRPC `CheckRelExists` (#12616 ) ## Problem Postgres will often immediately follow a relation existence check with a relation size query. This incurs two roundtrips, and may prevent effective caching. See [Slack thread](https://databricks.slack.com/archives/C091SDX74SC/p1751951732136139). Touches #11728. ## Summary of changes For the gRPC API: * Add an `allow_missing` parameter to `GetRelSize`, which returns `missing=true` instead of a `NotFound` error. * Remove `CheckRelExists`. There are no changes to libpq behavior.	2025-07-21 11:43:26 +00:00

1 2 3 4 5 ...

3130 Commits