rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-21 23:20:40 +00:00

Author	SHA1	Message	Date
Erik Grinaker	8cd5370c00	Merge branch 'main' into communicator-rewrite	2025-07-11 10:39:26 +02:00
Erik Grinaker	8aa9540a05	pageserver/page_api: include block number and rel in gRPC `GetPageResponse` (#12542 ) ## Problem With gRPC `GetPageRequest` batches, we'll have non-trivial fragmentation/reassembly logic in several places of the stack (concurrent reads, shard splits, LFC hits, etc). If we included the block numbers with the pages in `GetPageResponse` we could have better verification and observability that the final responses are correct. Touches #11735. Requires #12480. ## Summary of changes Add a `Page` struct with`block_number` for `GetPageResponse`, along with the `RelTag` for completeness, and verify them in the rich gRPC client.	2025-07-10 22:35:14 +00:00
Erik Grinaker	44ea17b7b2	pageserver/page_api: add attempt to GetPage request ID (#12536 ) ## Problem `GetPageRequest::request_id` is supposed to be a unique ID for a request. It's not, because we may retry the request using the same ID. This causes assertion failures and confusion. Touches #11735. Requires #12480. ## Summary of changes Extend the request ID with a retry attempt, and handle it in the gRPC client and server.	2025-07-10 20:39:42 +00:00
Erik Grinaker	dcdfe80bf0	pagebench: add support for rich gRPC client (#12477 ) ## Problem We need to benchmark the rich gRPC client `client_grpc::PageserverClient` against the basic, no-frills `page_api::Client` to determine how much overhead it adds. Touches #11735. Requires #12476. ## Summary of changes Add a `pagebench --rich-client` parameter to use `client_grpc::PageserverClient`. Also adds a compression parameter to the client.	2025-07-10 17:30:09 +00:00
Erik Grinaker	2fc77c836b	pageserver/client_grpc: add shard map updates (#12480 ) ## Problem The communicator gRPC client must support changing the shard map on splits. Touches #11735. Requires #12476. ## Summary of changes * Wrap the shard set in a `ArcSwap` to allow swapping it out. * Add a new `ShardSpec` parameter struct to pass validated shard info to the client. * Add `update_shards()` to change the shard set. In-flight requests are allowed to complete using the old shards. * Restructure `get_page` to use a stable view of the shard map, and retry errors at the top (pre-split) level to pick up shard map changes. * Also marks `tonic::Status::Internal` as non-retryable, so that we can use it for client-side invariant checks without continually retrying these.	2025-07-10 15:46:39 +00:00
Erik Grinaker	f4b03ddd7b	pageserver/client_grpc: reap idle pool resources (#12476 ) ## Problem The gRPC client pools don't reap idle resources. Touches #11735. Requires #12475. ## Summary of changes Reap idle pool resources (channels/clients/streams) after 3 minutes of inactivity. Also restructure the `StreamPool` to use a mutex rather than atomics for synchronization, for simplicity. This will be optimized later.	2025-07-10 10:18:37 +00:00
Erik Grinaker	2f71eda00f	pageserver/client_grpc: add separate pools for bulk requests (#12475 ) ## Problem GetPage bulk requests such as prefetches and vacuum can head-of-line block foreground requests, causing increased latency. Touches #11735. Requires #12469. ## Summary of changes * Use dedicated channel/client/stream pools for bulk GetPage requests. * Use lower concurrency but higher queue depth for bulk pools. * Make pool limits configurable. * Require unbounded client pool for stream pool, to avoid accidental starvation.	2025-07-09 16:12:59 +00:00
Erik Grinaker	8f3351fa91	pageserver/client_grpc: split GetPage batches across shards (#12469 ) ## Problem The rich gRPC Pageserver client needs to split GetPage batches that straddle multiple shards. Touches #11735. Requires #12462. ## Summary of changes Adds a `GetPageSplitter` which splits `GetPageRequest` that span multiple shards, and then reassembles the responses. Dispatches per-shard requests in parallel.	2025-07-09 14:17:22 +00:00
Heikki Linnakangas	8db138ef64	Plumb through the stripe size to the communicator	2025-07-09 16:18:26 +03:00
Erik Grinaker	3915995530	pageserver/client_grpc: add rich Pageserver gRPC client (#12462 ) ## Problem For the communicator, we need a rich Pageserver gRPC client. Touches #11735. Requires #12434. ## Summary of changes This patch adds an initial rich Pageserver gRPC client. It supports: * Sharded tenants across multiple Pageservers. * Pooling of connections, clients, and streams for efficient resource use. * Concurrent use by many callers. * Internal handling of GetPage bidirectional streams, with pipelining and error handling. * Automatic retries. * Observability. The client is still under development. In particular, it needs GetPage batch splitting, shard map updates, and performance optimization. This will be addressed in follow-up PRs.	2025-07-09 11:42:46 +00:00
Erik Grinaker	08399672be	Temporary workaround for timeout retry errors	2025-07-09 09:49:15 +02:00
Erik Grinaker	8223c1ba9d	pageserver/client_grpc: add initial gRPC client pools (#12434 ) ## Problem The communicator will need gRPC channel/client/stream pools for efficient reuse across many backends. Touches #11735. Requires #12396. ## Summary of changes Adds three nested resource pools: * `ChannelPool` for gRPC channels (i.e. TCP connections). * `ClientPool` for gRPC clients (i.e. `page_api::Client`). Acquires channels from `ChannelPool`. * `StreamPool` for gRPC GetPage streams. Acquires clients from `ClientPool`. These are minimal functional implementations that will need further improvements and performance optimization. However, the overall structure is expected to be roughly final, so reviews should focus on that. The pools are not yet in use, but will form the foundation of a rich gRPC Pageserver client used by the communicator (see #12462). This PR also adds the initial crate scaffolding for that client. See doc comments for details.	2025-07-08 20:58:18 +00:00
Erik Grinaker	9ae004f3bc	Rename ShardMap to ShardSpec	2025-07-06 19:13:59 +02:00
Erik Grinaker	341c5f53d8	Restructure get_page retries	2025-07-06 18:35:47 +02:00
Erik Grinaker	4b06b547c1	pageserver/client_grpc: add shard map updates	2025-07-06 13:27:17 +02:00
Erik Grinaker	23ba42446b	Fix accidental 1ms sleeps for GetPages	2025-07-06 11:09:58 +02:00
Erik Grinaker	6f3fb4433f	Add TODO	2025-07-05 14:15:34 +02:00
Erik Grinaker	d7678df445	Reap idle pool resources	2025-07-05 13:35:28 +02:00
Erik Grinaker	03d9f0ec41	Comment tweaks	2025-07-05 11:16:40 +02:00
Erik Grinaker	56845f2da2	Add `GetPageClass::is_bulk`	2025-07-05 11:15:28 +02:00
Erik Grinaker	cb698a3951	Add dedicated client pools for bulk requests	2025-07-04 21:52:25 +02:00
Erik Grinaker	f6cc5cbd0c	Split out retry handler to separate module	2025-07-04 20:20:09 +02:00
Erik Grinaker	88d1127bf4	Tweak GetPageSplitter	2025-07-03 21:12:26 +02:00
Erik Grinaker	42e4e5a418	Add GetPage request splitting	2025-07-03 18:31:12 +02:00
Erik Grinaker	6f8650782f	Client tweaks	2025-07-03 14:54:23 +02:00
Erik Grinaker	14214eb853	Add client shard routing	2025-07-03 14:42:35 +02:00
Erik Grinaker	d4b4724921	Sanity-check Pageserver URLs	2025-07-03 14:18:14 +02:00
Erik Grinaker	9aba9550dd	Instrument client methods	2025-07-03 14:11:53 +02:00
Erik Grinaker	375e8e5592	Improve retries and logging	2025-07-03 14:02:43 +02:00
Erik Grinaker	52c586f678	Restructure shard management	2025-07-03 11:51:19 +02:00
Erik Grinaker	12dade35fa	Comment tweaks	2025-07-02 14:47:27 +02:00
Erik Grinaker	1ec63bd6bc	Misc pool improvements	2025-07-02 14:42:06 +02:00
Erik Grinaker	bf01145ae4	Remove some old code	2025-07-02 11:46:54 +02:00
Erik Grinaker	6f0af96a54	Add new PageserverClient	2025-07-02 10:59:40 +02:00
Heikki Linnakangas	9913d2668a	print retried pageserver requests to log Not sure how verbose we want this to be in production, but for now, more is better. This shows that many tests are failing with errors like these: PG:2025-07-01 23:02:34.311 GMT [1456523] LOG: [COMMUNICATOR] send_process_get_rel_size_request: got error status: NotFound, message: "Read error", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 01 Jul 2025 23:02:34 GMT"} }, retrying I haven't debugged why that is yet. Did the compute make a bogus request?	2025-07-02 02:04:04 +03:00
Erik Grinaker	f6761760a2	Documentation and tweaks	2025-07-01 17:54:41 +02:00
Erik Grinaker	0bce818d5e	Add stream pool	2025-07-01 17:54:41 +02:00
Erik Grinaker	48be1da6ef	Add initial client pool	2025-07-01 17:54:41 +02:00
Erik Grinaker	d2efc80e40	Add initial ChannelPool	2025-07-01 17:54:41 +02:00
Erik Grinaker	c3cb1ab98d	Merge branch 'main' into communicator-rewrite	2025-06-30 21:07:01 +02:00
Erik Grinaker	81ac4ef43a	Add a generic pool prototype	2025-06-30 14:49:34 +02:00
Erik Grinaker	67b04f8ab3	Fix a bunch of linter warnings	2025-06-30 11:10:02 +02:00
Heikki Linnakangas	924c6a6fdf	Fix handling the case that server closes the stream - avoid panic by checking for Ok(None) response from tonic::Streaming::message() instead of just using unwrap() - There was a race condition, if the caller sent the message, but the receiver task concurrently received Ok(None) indicating the stream was closed. (I didn't see that in action, but I think it could happen by reading the code)	2025-06-29 22:53:39 +03:00
Heikki Linnakangas	7020476bf5	Run `cargo fmt`	2025-06-29 22:53:09 +03:00
Heikki Linnakangas	80e948db93	Remove ununused mock factory After reading the code a few times, I didn't quite understand what it was, to be honest, or how it was going to be used. Remove it now to reduce noise, but we can resurrect it from git history if we need it in the future.	2025-06-29 22:52:48 +03:00
Heikki Linnakangas	bfb30d434c	minor code tidy-up	2025-06-29 22:51:34 +03:00
Heikki Linnakangas	f3ba201800	Run `cargo fmt`	2025-06-29 21:21:07 +03:00
Heikki Linnakangas	8b7796cbfa	wip	2025-06-29 21:20:48 +03:00
Heikki Linnakangas	fdc7e9c2a4	Extract repeated code to look up RequestTracker into a helper function	2025-06-29 21:20:14 +03:00
Heikki Linnakangas	a352d290eb	Plumb through both libpq and grpc connection strings to the compute Add a new 'pageserver_connection_info' field in the compute spec. It replaces the old 'pageserver_connstring' field with a more complicated struct that includes both libpq and grpc URLs, for each shard (or only one of the the URLs, depending on the configuration). It also includes a flag suggesting which one to use; compute_ctl now uses it to decide which protocol to use for the basebackup. This is compatible with everything that's in production, because the control plane never used the 'pageserver_connstring' field. That was added a long time ago with the idea that it would replace the code that digs the 'neon.pageserver_connstring' GUC from the list of Postgres settings, but we never got around to do that in the control plane. Hence, it was only used with neon_local. But the plan now is to pass the 'pageserver_connection_info' from the control plane, and once that's fully deployed everywhere, the code to parse 'neon.pageserver_connstring' in compute_ctl can be removed. The 'grpc' flag on an endpoint in endpoint config is now more of a suggestion. Compute_ctl gets both URLs, so it can choose to use libpq or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag that's part of the connection info though. Postgres however uses grpc iff the new rust-based communicator is enabled. TODO/plan for the control plane: - Start to pass `pageserver_connection_info` in the spec file. - Also keep the current `neon.pageserver_connstring` setting for now, for backwards compatibility with old computes After that, the `pageserver_connection_info.prefer_grpc` flag in the spec file can be used to control whether compute_ctl uses grpc or libpq. The actual compute's grpc usage will be controlled by the `neon.enable_new_communicator` GUC. It can be set separately from 'prefer_grpc'. Later: - Once all old computes are gone, remove the code to pass `neon.pageserver_connstring`	2025-06-29 18:16:49 +03:00

1 2

81 Commits