rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 06:52:55 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	c14cf15b52	Tidy up the memory ordering instructions on request slot code I believe the explicit memory fence instructions are unnecessary. Performing a store with Release ordering makes all the previous non-atomic writes visible too. Per rust docs for Ordering::Release ( https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Release): > When coupled with a store, all previous operations become ordered > before any load of this value with Acquire (or stronger) > ordering. In particular, all previous writes become visible to all > threads that perform an Acquire (or stronger) load of this value. > > ... > > Corresponds to memory_order_release in C++20. The "all previous writes" means non-atomic writes too. It's not very clear from that text, but the C++20 docs that it links to is more explicit about it: > All memory writes (including non-atomic and relaxed atomic) that > happened-before the atomic store from the point of view of thread A, > become visible side-effects in thread B. That is, once the atomic > load is completed, thread B is guaranteed to see everything thread A > wrote to memory. In addition to removing the fence instructions, fix the comments on each atomic Acquire operation to point to the correct Release counterpart. We had such comments but they had gone out-of-date as code has moved.	2025-07-10 15:19:20 +03:00
Heikki Linnakangas	5da06d4129	Make start_neon_io_request() wakeup the communicator process All the callers did that previously. So rather than document that the caller needs to do it, just do it in start_neon_io_request() straight away. (We might want to revisit this if we get codepaths where the C code submits multiple IO requests as a batch. In that case, it would be more efficient to fill all the request slots first and only send one notification to the pipe for all of them)	2025-07-10 15:19:20 +03:00
Heikki Linnakangas	f30c59bec9	Improve comments on request slots	2025-07-10 15:19:20 +03:00
Heikki Linnakangas	47c099a0fb	Rename NeonIOHandle to NeonIORequestSlot All the code talks about "request slots", better to make the struct name reflect that. The "Handle" term was borrowed from Postgres v18 AIO implementation, from the similar handles or slots used to submit IO requests from backends to worker processes. But even though the idea is similar, it's a completely separate implementation and there's nothing else shared between them than the very high level design.	2025-07-10 14:52:16 +03:00
Heikki Linnakangas	b67e8f2edc	Move some code, just for more natural logical ordering	2025-07-10 14:49:29 +03:00
Heikki Linnakangas	b5b1db29bb	Implement shard map live-update	2025-07-10 12:25:15 +03:00
Heikki Linnakangas	ed4652b65b	Update the relsize cache rather than forget it at end of index build This greatly reduces the cases where we make a request to the pageserver with a very recent LSN. Those cases are slow because the pageserver needs to wait for the WAL to arrive. This speeds up the Postgres pg_regress and isolation tests greatly.	2025-07-09 17:21:06 +03:00
Heikki Linnakangas	60d87966b8	minor comment improvement	2025-07-09 16:39:40 +03:00
Heikki Linnakangas	8db138ef64	Plumb through the stripe size to the communicator	2025-07-09 16:18:26 +03:00
Heikki Linnakangas	1ee24602d5	Implement working set size estimation	2025-07-09 16:18:26 +03:00
Heikki Linnakangas	732bd26e70	cargo fmt	2025-07-09 16:18:26 +03:00
Erik Grinaker	08399672be	Temporary workaround for timeout retry errors	2025-07-09 09:49:15 +02:00
Heikki Linnakangas	d63f1d259a	avoid assertion failure about calling palloc() in critical section	2025-07-08 21:33:25 +03:00
Heikki Linnakangas	4053092408	Fix LSN tracking on "unlogged index builds" Fixes the test_gin_redo.py test failure, and probably some others	2025-07-08 17:22:24 +03:00
Heikki Linnakangas	ccf88e9375	Improve debug logging by printing IO request details	2025-07-08 17:16:09 +03:00
Heikki Linnakangas	a79fd3bda7	Move logic for picking request slot to the C code With this refactoring, the Rust code deals with one giant array of requests, and doesn't know that it's sliced up per backend process. The C code is now responsible for slicing it up. This also adds code to complete old IOs at backends start that were started and left behind by a previous session. That was a little more straightforward to do with the refactoring, which is why I tackled it now.	2025-07-07 12:59:08 +03:00
Heikki Linnakangas	e1b58d5d69	Don't segfault if one of the unimplemented functions are called We'll need to implement these, but let's stop the crashing for now	2025-07-07 11:33:44 +03:00
Erik Grinaker	9ae004f3bc	Rename ShardMap to ShardSpec	2025-07-06 19:13:59 +02:00
Erik Grinaker	341c5f53d8	Restructure get_page retries	2025-07-06 18:35:47 +02:00
Erik Grinaker	4b06b547c1	pageserver/client_grpc: add shard map updates	2025-07-06 13:27:17 +02:00
Heikki Linnakangas	74e0d85a04	fix: Don't lose track of in-progress request if query is cancelled	2025-07-06 13:04:03 +03:00
Erik Grinaker	23ba42446b	Fix accidental 1ms sleeps for GetPages	2025-07-06 11:09:58 +02:00
Heikki Linnakangas	71a83daac2	Revert crate dependencies to the versions in main branch Some tests were failing with "Only request bodies with a known size can be checksum validated." erros. This is a known issue with more recent aws client versions, see https://github.com/neondatabase/neon/issues/11363.	2025-07-05 18:03:19 +03:00
Heikki Linnakangas	1b8355a9f9	put back option lost in merge	2025-07-05 17:36:27 +03:00
Heikki Linnakangas	e14bb4be39	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-05 16:59:51 +03:00
Heikki Linnakangas	f3a6c0d8ff	cargo fmt	2025-07-05 16:26:24 +03:00
Heikki Linnakangas	17ec37aab2	Close gRPC getpage streams on shutdown Some tests were failing, because pageserver didn't shut down promptly. Tonic server graceful shutdown was a little too graceful; any open streams linger until they're closed. Check the cancellation token while waiting for next request, and close the stream if shutdown/cancellation was requested.	2025-07-05 16:26:24 +03:00
Heikki Linnakangas	d6ec1f1a1c	Skip legacy LFC initialization when communicator is used It clashes with the initialization of the LFC file	2025-07-05 16:26:24 +03:00
Erik Grinaker	6f3fb4433f	Add TODO	2025-07-05 14:15:34 +02:00
Erik Grinaker	d7678df445	Reap idle pool resources	2025-07-05 13:35:28 +02:00
Erik Grinaker	03d9f0ec41	Comment tweaks	2025-07-05 11:16:40 +02:00
Erik Grinaker	56845f2da2	Add `GetPageClass::is_bulk`	2025-07-05 11:15:28 +02:00
Heikki Linnakangas	b568189f7b	Build dummy libcommunicator into the 'neon' extension (#12266 ) This doesn't do anything interesting yet, but demonstrates linking Rust code to the neon Postgres extension, so that we can review and test drive just the build process changes independently.	2025-07-04 23:27:28 +00:00
Heikki Linnakangas	9a37bfdf63	Fix re-finding an entry in bucket chain	2025-07-05 00:44:46 +03:00
Arpad Müller	b94a5ce119	Don't await the walreceiver on timeline shutdown (#12402 ) Mostly a revert of https://github.com/neondatabase/neon/pull/11851 and https://github.com/neondatabase/neon/pull/12330 . Christian suggested reverting his PR to fix the issue https://github.com/neondatabase/neon/issues/12369 . Alternatives considered: 1. I have originally wanted to introduce cancellation tokens to `RequestContext`, but in the end I gave up on them because I didn't find a select-free way of preventing `test_layer_download_cancelled_by_config_location` from hanging. Namely if I put a select around the `get_or_maybe_download` invocation in `get_values_reconstruct_data`, it wouldn't hang, but if I put it around the `download_init_and_wait` invocation in `get_or_maybe_download`, the test would still hang. Not sure why, even though I made the attached child function of the `RequestContext` create a child token. 2. Introduction of a `download_cancel` cancellation token as a child of a timeline token, putting it into `RemoteTimelineClient` together with the main token, and then putting it into the whole `RemoteTimelineClient` read path. 3. Greater refactorings, like to make cancellation tokens follow a DAG structure so you can have tokens cancelled either by say timeline shutting down or a request ending. It doesn't just represent an effort that we don't have the engineering budget for, it also causes interesting questions like what to do about batching (do you cancel the entire request if only some requests get cancelled?). We might see a reemergence of https://github.com/neondatabase/neon/issues/11762, but given that we have https://github.com/neondatabase/neon/pull/11853 and https://github.com/neondatabase/neon/pull/12376 now, it is possible that it will not come back. Looking at some code, it might actually fix the locations where the error pops up. Let's see. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-07-04 20:12:10 +00:00
Heikki Linnakangas	4c916552e8	Reduce logging noise These are very useful while debugging, but also very noisy; let's dial it down a little.	2025-07-04 23:11:36 +03:00
Heikki Linnakangas	50fbf4ac53	Fix hash table initialization across forked processes attach_writer()/reader() are called from each forked process. It's too late to do initialization there, in fact we used to overwrite the contents of the hash table (or at least the freelist?) every time a new process attached to it. The initialization must be done earlier, in the HashMapInit() constructors.	2025-07-04 23:08:34 +03:00
Erik Grinaker	cb698a3951	Add dedicated client pools for bulk requests	2025-07-04 21:52:25 +02:00
Mikhail	7ed4530618	`offload_lfc_interval_seconds` in ComputeSpec (#12447 ) - Add ComputeSpec flag `offload_lfc_interval_seconds` controlling whether LFC should be offloaded to endpoint storage. Default value (None) means "don't offload". - Add glue code around it for `neon_local` and integration tests. - Add `autoprewarm` mode for `test_lfc_prewarm` testing `offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction. - Rename `compute_ctl_lfc_prewarm_requests_total` and `compute_ctl_lfc_offload_requests_total` to `compute_ctl_lfc_prewarms_total` and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and offloads, not `compute_ctl` requests of those. Don't count request in metrics if there is a prewarm/offload already ongoing. https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/30770	2025-07-04 18:49:57 +00:00
Erik Grinaker	f6cc5cbd0c	Split out retry handler to separate module	2025-07-04 20:20:09 +02:00
Heikki Linnakangas	00affada26	Add request ID to all communicator log lines as context information	2025-07-04 20:34:26 +03:00
Heikki Linnakangas	90d3c09c24	Minor cleanup Tidy up and add some comments. Rename a few things for clarity.	2025-07-04 20:32:59 +03:00
Heikki Linnakangas	6c398aeae7	Fix dependency in Makefile	2025-07-04 20:24:21 +03:00
Heikki Linnakangas	3a44774227	impr(ci): Simplify build-macos workflow, prepare for rust communicator (#12357 ) Don't build walproposer-lib as a separate job. It only takes a few seconds, after you have built all its dependencies. Don't cache the Neon Pg extensions in the per-postgres-version caches. This is in preparation for the communicator project, which will introduce Rust parts to the Neon Pg extension, which complicates the build process. With that, the 'make neon-pg-ext' step requires some of the Rust bits to be built already, or it will build them on the spot, which in turn requires all the Rust sources to be present, and we don't want to repeat that part for each Postgres version anyway. To prepare for that, rely on "make all" to build the neon extension and the rust bits in the correct order instead. Building the neon extension doesn't currently take very long anyway after you have built Postgres itself, so you don't gain much by caching it. See https://github.com/neondatabase/neon/pull/12266. Add an explicit "rustup update" step to update the toolchain. It's not strictly necessary right now, because currently "make all" will only invoke "cargo build" once and the race condition described in the comment doesn't happen. But prepare for the future. To further simplify the build, get rid of the separate 'build-postgres' jobs too, and just build Postgres as a step in the main job. That makes the overall workflow run longer, because we no longer build all the postgres versions in parallel (although you still get intra-runner parallelism thanks to `make -j`), but that's acceptable. In the cache-hit case, it might even be a little faster because there is less overhead from launching jobs, and in the cache-miss case, it's maybe 5-10 minutes slower altogether. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-07-04 15:34:58 +00:00
Heikki Linnakangas	1856bbbb9f	Minor cleanup and commenting	2025-07-04 18:28:34 +03:00
Aleksandr Sarantsev	b2705cfee6	storcon: Make node deletion process cancellable (#12320 ) ## Problem The current deletion operation is synchronous and blocking, which is unsuitable for potentially long-running tasks like. In such cases, the standard HTTP request-response pattern is not a good fit. ## Summary of Changes - Added new `storcon_cli` commands: `NodeStartDelete` and `NodeCancelDelete` to initiate and cancel deletion asynchronously. - Added corresponding `storcon` HTTP handlers to support the new start/cancel deletion flow. - Introduced a new type of background operation: `Delete`, to track and manage the deletion process outside the request lifecycle. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-04 14:08:09 +00:00
Heikki Linnakangas	bd46dd60a0	Add a temporary timeout to handling an IO request in the communicator It's nicer to timeout in the communicator and return an error to the backend, than PANIC the backend.	2025-07-04 16:08:22 +03:00
Heikki Linnakangas	5f2d476a58	Add request ID to io-in-progress locking table, to ease debugging I also added INFO messages for when a backend blocks on the io-in-progress lock. It's probably too noisy for production, but useful now to get a picture of how much it happens.	2025-07-04 15:55:57 +03:00
Heikki Linnakangas	3231cb6138	Await the io-in-progress locking futures Otherwise they don't do anything. Oops.	2025-07-04 15:55:57 +03:00
Heikki Linnakangas	e558e0da5c	Assign request_id earlier, in the originating backend Makes it more useful for stitching together logs etc. for a specific request.	2025-07-04 15:55:55 +03:00

1 2 3 4 5 ...

8428 Commits