rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-25 09:00:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	35c916c062	Merge commit '5c934efb2' into problame/standby-horizon-leases	2025-08-06 17:50:33 +02:00
Christian Schwarz	02e1aeef66	Merge commit 'a456e818a' into problame/standby-horizon-leases	2025-08-06 17:49:56 +02:00
Christian Schwarz	e2c88c1929	Merge commit '296c9190b' into problame/standby-horizon-leases	2025-08-06 17:49:50 +02:00
Christian Schwarz	553a120075	Merge commit '15f633922' into problame/standby-horizon-leases	2025-08-06 17:49:41 +02:00
Christian Schwarz	e2facbde4e	Merge commit 'cec0543b5' into problame/standby-horizon-leases	2025-08-06 17:47:10 +02:00
Christian Schwarz	b8c8168378	Merge commit 'be5bbaeca' into problame/standby-horizon-leases	2025-08-06 17:46:44 +02:00
Christian Schwarz	28a2cd05d5	Merge commit '5ec82105c' into problame/standby-horizon-leases	2025-08-06 17:46:37 +02:00
Christian Schwarz	1635390a96	fix all clippy complaints in this branch	2025-08-06 17:39:17 +02:00
Christian Schwarz	47146fe1d6	Merge commit '7049003cf' into problame/standby-horizon-leases	2025-08-06 17:17:11 +02:00
Christian Schwarz	2ee0f4271c	fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field The LSN lease api actually accepts a tenant_shard_id, not a tenant_id. But we put the Display of the tenant_shard_id into the tenant_id field. This PR fixes it. Refs - fixes https://databricks.atlassian.net/browse/LKB-2930	2025-08-05 22:48:27 +02:00
Christian Schwarz	8a9f1dd5e7	use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism	2025-08-05 22:47:53 +02:00
Christian Schwarz	9f01840c18	use standby_horizon leases feature in the test, demonstrating that it passes now	2025-08-05 22:47:28 +02:00
Christian Schwarz	44466cebdb	WIP better observability for return values (SystemTime Debug is useless)	2025-08-05 22:46:54 +02:00
Christian Schwarz	b865e85de3	previous commit broke the tests because of the cfg business, see this commit's TODO	2025-08-05 22:46:24 +02:00
Christian Schwarz	73336962a8	finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes	2025-08-05 19:24:06 +02:00
Christian Schwarz	3365c8c648	enforce standby_horizon leases are always above applied_gc_cutoff (check against cutoff on upsert + block gc for lease length to allow renewals after attach)	2025-07-26 16:38:44 +02:00
Christian Schwarz	bc09df8823	add todo about init deadline	2025-07-26 16:23:59 +02:00
Christian Schwarz	e1eb98c0e9	add basic test & fix embarrasing bug in cull (needs comment out todo!())	2025-07-26 16:23:59 +02:00
Christian Schwarz	1e61ac6af2	cargo fmt (unrelated to prev commit)	2025-07-26 16:23:59 +02:00
Christian Schwarz	a948054db3	naming orhtodoxy: always refere to leases as LSN leases	2025-07-26 16:23:59 +02:00
Christian Schwarz	2ee24900ca	have claude generate plumbing for standby_horizon_lease_length	2025-07-25 13:16:20 +02:00
Christian Schwarz	23d1029afd	explain why there's no need to check standby_horizon lease deadline for getpage requests	2025-07-25 09:30:27 +00:00
Christian Schwarz	b47d3900b9	observability and debugging facilities	2025-07-20 23:13:02 +00:00
Christian Schwarz	f4b38d5975	expand comment on why we normalize_lsn	2025-07-20 22:38:07 +00:00
Christian Schwarz	2a89f72389	rudimentary leases impl, lacks initial lease deadline stuff	2025-07-20 22:17:03 +00:00
Christian Schwarz	2b5bb850f2	WIP	2025-07-20 20:46:20 +00:00
Christian Schwarz	4dc3acf3ed	WIP: standby_horizon leases	2025-07-20 19:21:47 +00:00
Arpad Müller	5c934efb29	Don't depend on the postgres_ffi just for one type (#12610 ) We don't want to depend on postgres_ffi in an API crate. If there is no such dependency, we can compile stuff like `storcon_cli` without needing a full working postgres build. Fixes regression of #12548 (before we could compile it).	2025-07-15 17:28:08 +00:00
Heikki Linnakangas	5c9c3b3317	Misc cosmetic cleanups (#12598 ) - Remove a few obsolete "allowed error messages" from tests. The pageserver doesn't emit those messages anymore. - Remove misplaced and outdated docstring comment from `test_tenants.py`. A docstring is supposed to be the first thing in a function, but we had added some code before it. And it was outdated, as we haven't supported running without safekeepers for a long time. - Fix misc typos in comments - Remove obsolete comment about backwards compatibility with safekeepers without `TIMELINE_STATUS` API. All safekeepers have it by now.	2025-07-15 14:36:28 +00:00
Vlad Lazar	f8d3f86f58	pageserver: include records in get page debug handler (#12578 ) Include records and image in the debug get page handler. This endpoint does not update the metrics and does not support tracing. Note that this now returns individual bytes which need to be encoded properly for debugging. Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>	2025-07-14 16:37:28 +00:00
HaoyuHuang	f67a8a173e	A few SK changes (#12577 ) # TLDR This PR is a no-op. ## Problem When a SK loses a disk, it must recover all WALs from the very beginning. This may take days/weeks to catch up to the latest WALs for all timelines it owns. ## Summary of changes When SK starts up, if it finds that it has 0 timelines, - it will ask SC for the timeline it owns. - Then, pulls the timeline from its peer safekeepers to restore the WAL redundancy right away. After pulling timeline is complete, it will become active and accepts new WALs. The current impl is a prototype. We can optimize the impl further, e.g., parallel pull timelines. --------- Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>	2025-07-14 16:37:04 +00:00
Erik Grinaker	eb830fa547	pageserver/client_grpc: use unbounded pools (#12585 ) ## Problem The communicator gRPC client currently uses bounded client/stream pools. This can artificially constrain clients, especially after we remove pipelining in #12584. [Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. In the worst case, we'll degenerate to the current libpq state with one stream per backend, but without the TCP connection overhead. In the common case we expect significantly lower stream counts due to stream sharing, driven e.g. by idle backends, LFC hits, read coalescing, sharding (backends typically only talk to one shard at a time), etc. Currently, Pageservers rarely serve more than 4000 backend connections, so we have at least 2 orders of magnitude of headroom. Touches #11735. Requires #12584. ## Summary of changes Remove the pool limits, and restructure the pools. We still keep a separate bulk pool for Getpage batches of >4 pages (>32 KB), with fewer streams per connection. This reduces TCP-level congestion and head-of-line blocking for non-bulk requests, and concentrates larger window sizes on a smaller set of streams/connections, presumably reducing memory usage. Apart from this, bulk requests don't have any latency penalty compared to other requests.	2025-07-14 13:22:38 +00:00
Erik Grinaker	a203f9829a	pageserver: add timeline_id span when freezing layers (#12572 ) ## Problem We don't log the timeline ID when rolling ephemeral layers during housekeeping. Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179) ## Summary of changes Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer` from the housekeeping loop. We don't instrument the function itself, since future callers may not have a span including the tenant_id already, but we don't want to duplicate the tenant_id for these spans.	2025-07-14 12:30:28 +00:00
Erik Grinaker	42ab34dc36	pageserver/client_grpc: don't pipeline GetPage requests (#12584 ) ## Problem The communicator gRPC client currently attempts to pipeline GetPage requests from multiple callers onto the same gRPC stream. This has a number of issues: * Head-of-line blocking: the request may block on e.g. layer download or LSN wait, delaying the next request. * Cancellation: we can't easily cancel in-progress requests (e.g. due to timeout or backend termination), so it may keep blocking the next request (even its own retry). * Complex stream scheduling: picking a stream becomes harder/slower, and additional Tokio tasks and synchronization is needed for stream management. Touches #11735. Requires #12579. ## Summary of changes This patch removes pipelining of gRPC stream requests, and instead prefers to scale out the number of streams to achieve the same throughput. Stream scheduling has been rewritten, and mostly follows the same pattern as the client pool with exclusive acquisition by a single caller. [Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. This has a number of advantages: * It (mostly) eliminates head-of-line blocking (except at the TCP level). * Cancellation becomes trivial, by closing the stream. * Stream scheduling becomes significantly simpler and cheaper. * Individual callers can still use client-side batching for pipelining.	2025-07-14 12:11:33 +00:00
Erik Grinaker	30b877074c	pagebench: add CPU profiling support (#12478 ) ## Problem The new communicator gRPC client has significantly worse Pagebench performance than a basic gRPC client. We need to find out why. ## Summary of changes Add a `pagebench --profile` flag which takes a client CPU profile of the benchmark and writes a flamegraph to `profile.svg`.	2025-07-14 11:44:53 +00:00
Erik Grinaker	f18cc808f0	pageserver/client_grpc: reap idle channels immediately (#12587 ) ## Problem It can take 3x the idle timeout to reap a channel. We have to wait for the idle timeout to trigger first for the stream, then the client, then the channel. Touches #11735. ## Summary of changes Reap empty channels immediately, and rely indirectly on the channel/stream timeouts. This can still lead to 2x the idle timeout for streams (first stream then client), but that's okay -- if the stream closes abruptly (e.g. due to timeout or error) we want to keep the client around in the pool for a while.	2025-07-14 10:47:26 +00:00
Erik Grinaker	d14d8271b8	pageserver/client_grpc: improve retry logic (#12579 ) ## Problem gRPC client retries currently include pool acquisition under the per-attempt timeout. If pool acquisition is slow (e.g. full pool), this will cause spurious timeout warnings, and the caller will lose its place in the pool queue. Touches #11735. ## Summary of changes Makes several improvements to retries and related logic: * Don't include pool acquisition time under request timeouts. * Move attempt timeouts out of `Retry` and into the closure. * Make `Retry` configurable, move constants into main module. * Don't backoff on the first retry, and reduce initial/max backoffs to 5ms and 5s respectively. * Add `with_retries` and `with_timeout` helpers. * Add slow logging for pool acquisition, and a `warn_slow` counterpart to `log_slow`. * Add debug logging for requests and responses at the client boundary.	2025-07-14 10:43:10 +00:00
Erik Grinaker	fecb707b19	pagebench: add `idle-streams` (#12583 ) ## Problem For the communicator scheduling policy, we need to understand the server-side cost of idle gRPC streams. Touches #11735. ## Summary of changes Add an `idle-streams` benchmark to `pagebench` which opens a large number of idle gRPC GetPage streams.	2025-07-14 09:41:58 +00:00
HaoyuHuang	cb991fba42	A few more PS changes (#12552 ) # TLDR Problem-I is a bug fix. The rest are no-ops. ## Problem I Page server checks image layer creation based on the elapsed time but this check depends on the current logical size, which is only computed on shard 0. Thus, for non-0 shards, the check will be ineffective and image creation will never be done for idle tenants. ## Summary of changes I This PR fixes the problem by simply removing the dependency on current logical size. ## Summary of changes II This PR adds a timeout when calling page server to split shard to make sure SC does not wait for the API call forever. Currently the PR doesn't adds any retry logic because it's not clear whether page server shard split can be safely retried if the existing operation is still ongoing or left the storage in a bad state. Thus it's better to abort the whole operation and restart. ## Problem III `test_remote_failures` requires PS to be compiled in the testing mode. For PS in dev/staging, they are compiled without this mode. ## Summary of changes III Remove the restriction and also increase the number of total failures allowed. ## Summary of changes IV remove test on PS getpage http route. --------- Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Yecheng Yang <carlton.yang@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-07-11 19:27:55 +00:00
Matthias van de Meent	4566b12a22	NEON: Finish Zenith->Neon rename (#12566 ) Even though we're now part of Databricks, let's at least make this part consistent. ## Summary of changes - PG14: https://github.com/neondatabase/postgres/pull/669 - PG15: https://github.com/neondatabase/postgres/pull/670 - PG16: https://github.com/neondatabase/postgres/pull/671 - PG17: https://github.com/neondatabase/postgres/pull/672 --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-07-11 18:56:39 +00:00
Alex Chi Z.	63ca084696	fix(pageserver): downgrade wal apply error during gc-compaction (#12518 ) ## Problem close LKB-162 close https://github.com/neondatabase/cloud/issues/30665, related to https://github.com/neondatabase/cloud/issues/29434 We see a lot of errors like: ``` 2025-05-22T23:06:14.928959Z ERROR compaction_loop{tenant_id=? shard_id=0304}:run:gc_compact_timeline{timeline_id=?}: error applying 4 WAL records 35/DC0DF0B8..3B/E43188C0 (8119 bytes) to key 000000067F0000400500006027000000B9D0, from base image with LSN 0/0 to reconstruct page image at LSN 61/150B9B20 n_attempts=0: apply_wal_records Caused by: 0: read walredo stdout 1: early eof ``` which is an acceptable form of error and we should downgrade it to warning. ## Summary of changes walredo error during gc-compaction is expected when the data below the gc horizon does not contain a full key history. This is possible in some rare cases of gc that is only able to remove data in the middle of the history but not all earlier history when a full keyspace gets deleted. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-11 18:37:55 +00:00
Vlad Lazar	154f6dc59c	pageserver: log only on final shard resolution failure (#12565 ) This log is too noisy. Instead of warning on every retry, let's log only on the final failure.	2025-07-11 13:25:25 +00:00
Vlad Lazar	15f633922a	pageserver: use image consistent LSN for force image layer creation (#12547 ) This is a no-op for the neon deployment * Introduce the concept image consistent lsn: of the largest LSN below which all pages have been redone successfully * Use the image consistent LSN for forced image layer creations * Optionally expose the image consistent LSN via the timeline describe HTTP endpoint * Add a sharded timeline describe endpoint to storcon --------- Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-11 11:39:51 +00:00
Erik Grinaker	8aa9540a05	pageserver/page_api: include block number and rel in gRPC `GetPageResponse` (#12542 ) ## Problem With gRPC `GetPageRequest` batches, we'll have non-trivial fragmentation/reassembly logic in several places of the stack (concurrent reads, shard splits, LFC hits, etc). If we included the block numbers with the pages in `GetPageResponse` we could have better verification and observability that the final responses are correct. Touches #11735. Requires #12480. ## Summary of changes Add a `Page` struct with`block_number` for `GetPageResponse`, along with the `RelTag` for completeness, and verify them in the rich gRPC client.	2025-07-10 22:35:14 +00:00
Erik Grinaker	44ea17b7b2	pageserver/page_api: add attempt to GetPage request ID (#12536 ) ## Problem `GetPageRequest::request_id` is supposed to be a unique ID for a request. It's not, because we may retry the request using the same ID. This causes assertion failures and confusion. Touches #11735. Requires #12480. ## Summary of changes Extend the request ID with a retry attempt, and handle it in the gRPC client and server.	2025-07-10 20:39:42 +00:00
Erik Grinaker	dcdfe80bf0	pagebench: add support for rich gRPC client (#12477 ) ## Problem We need to benchmark the rich gRPC client `client_grpc::PageserverClient` against the basic, no-frills `page_api::Client` to determine how much overhead it adds. Touches #11735. Requires #12476. ## Summary of changes Add a `pagebench --rich-client` parameter to use `client_grpc::PageserverClient`. Also adds a compression parameter to the client.	2025-07-10 17:30:09 +00:00
Erik Grinaker	2fc77c836b	pageserver/client_grpc: add shard map updates (#12480 ) ## Problem The communicator gRPC client must support changing the shard map on splits. Touches #11735. Requires #12476. ## Summary of changes * Wrap the shard set in a `ArcSwap` to allow swapping it out. * Add a new `ShardSpec` parameter struct to pass validated shard info to the client. * Add `update_shards()` to change the shard set. In-flight requests are allowed to complete using the old shards. * Restructure `get_page` to use a stable view of the shard map, and retry errors at the top (pre-split) level to pick up shard map changes. * Also marks `tonic::Status::Internal` as non-retryable, so that we can use it for client-side invariant checks without continually retrying these.	2025-07-10 15:46:39 +00:00
HaoyuHuang	2c6b327be6	A few PS changes (#12540 ) # TLDR All changes are no-op except some metrics. ## Summary of changes I ### Pageserver Added a new global counter metric `pageserver_pagestream_handler_results_total` that categorizes pagestream request results according to their outcomes: 1. Success 2. Internal errors 3. Other errors Internal errors include: 1. Page reconstruction error: This probably indicates a pageserver bug/corruption 2. LSN timeout error: Could indicate overload or bugs with PS's ability to reach other components 3. Misrouted request error: Indicates bugs in the Storage Controller/HCC Other errors include transient errors that are expected during normal operation or errors indicating bugs with other parts of the system (e.g., malformed requests, errors due to cancelled operations during PS shutdown, etc.) ## Summary of changes II This PR adds a pageserver endpoint and its counterpart in storage controller to list visible size of all tenant shards. This will be a prerequisite of the tenant rebalance command. ## Problem III We need a way to download WAL segments/layerfiles from S3 and replay WAL records. We cannot access production S3 from our laptops directly, and we also can't transfer any user data out of production systems for GDPR compliance, so we need solutions. ## Summary of changes III This PR adds a couple of tools to support the debugging workflow in production: 1. A new `pagectl download-remote-object` command that can be used to download remote storage objects assuming the correct access is set up. ## Summary of changes IV This PR adds a command to list all visible delta and image layers from index_part. This is useful to debug compaction issues as index_part often contain a lot of covered layers due to PITR. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-07-10 14:39:38 +00:00
Vlad Lazar	ffeede085e	libs: move metric collection for pageserver and safekeeper in a background task (#12525 ) ## Problem Safekeeper and pageserver metrics collection might time out. We've seen this in both hadron and neon. ## Summary of changes This PR moves metrics collection in PS/SK to the background so that we will always get some metrics, despite there may be some delays. Will leave it to the future work to reduce metrics collection time. --------- Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-10 11:58:22 +00:00
Erik Grinaker	f4b03ddd7b	pageserver/client_grpc: reap idle pool resources (#12476 ) ## Problem The gRPC client pools don't reap idle resources. Touches #11735. Requires #12475. ## Summary of changes Reap idle pool resources (channels/clients/streams) after 3 minutes of inactivity. Also restructure the `StreamPool` to use a mutex rather than atomics for synchronization, for simplicity. This will be optimized later.	2025-07-10 10:18:37 +00:00

1 2 3 4 5 ...

3095 Commits