rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 15:02:56 +00:00

Author	SHA1	Message	Date
quantumish	d33071a386	Fix concurrency bugs in resizing (WIP)	2025-08-12 10:34:29 -07:00
David Freifeld	add1a0ad78	Add initial work in implementing incremental resizing (WIP)	2025-07-14 09:02:56 -07:00
David Freifeld	282b90df28	Fix freelist bug, clean up interface to BucketIdx	2025-07-11 12:47:27 -07:00
David Freifeld	7f655ddb9a	Heavily WIP rewrite of hashmap for concurrency + perf	2025-07-10 17:34:30 -07:00
David Freifeld	da6f4dbafe	Merge branch 'communicator-rewrite' into quantumish/lfc-resizable-map	2025-07-10 17:33:56 -07:00
Heikki Linnakangas	a79fd3bda7	Move logic for picking request slot to the C code With this refactoring, the Rust code deals with one giant array of requests, and doesn't know that it's sliced up per backend process. The C code is now responsible for slicing it up. This also adds code to complete old IOs at backends start that were started and left behind by a previous session. That was a little more straightforward to do with the refactoring, which is why I tackled it now.	2025-07-07 12:59:08 +03:00
Heikki Linnakangas	e1b58d5d69	Don't segfault if one of the unimplemented functions are called We'll need to implement these, but let's stop the crashing for now	2025-07-07 11:33:44 +03:00
Erik Grinaker	9ae004f3bc	Rename ShardMap to ShardSpec	2025-07-06 19:13:59 +02:00
Erik Grinaker	341c5f53d8	Restructure get_page retries	2025-07-06 18:35:47 +02:00
Erik Grinaker	4b06b547c1	pageserver/client_grpc: add shard map updates	2025-07-06 13:27:17 +02:00
Heikki Linnakangas	74e0d85a04	fix: Don't lose track of in-progress request if query is cancelled	2025-07-06 13:04:03 +03:00
Erik Grinaker	23ba42446b	Fix accidental 1ms sleeps for GetPages	2025-07-06 11:09:58 +02:00
Heikki Linnakangas	71a83daac2	Revert crate dependencies to the versions in main branch Some tests were failing with "Only request bodies with a known size can be checksum validated." erros. This is a known issue with more recent aws client versions, see https://github.com/neondatabase/neon/issues/11363.	2025-07-05 18:03:19 +03:00
Heikki Linnakangas	1b8355a9f9	put back option lost in merge	2025-07-05 17:36:27 +03:00
Heikki Linnakangas	e14bb4be39	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-05 16:59:51 +03:00
Heikki Linnakangas	f3a6c0d8ff	cargo fmt	2025-07-05 16:26:24 +03:00
Heikki Linnakangas	17ec37aab2	Close gRPC getpage streams on shutdown Some tests were failing, because pageserver didn't shut down promptly. Tonic server graceful shutdown was a little too graceful; any open streams linger until they're closed. Check the cancellation token while waiting for next request, and close the stream if shutdown/cancellation was requested.	2025-07-05 16:26:24 +03:00
Heikki Linnakangas	d6ec1f1a1c	Skip legacy LFC initialization when communicator is used It clashes with the initialization of the LFC file	2025-07-05 16:26:24 +03:00
Erik Grinaker	6f3fb4433f	Add TODO	2025-07-05 14:15:34 +02:00
Erik Grinaker	d7678df445	Reap idle pool resources	2025-07-05 13:35:28 +02:00
Erik Grinaker	03d9f0ec41	Comment tweaks	2025-07-05 11:16:40 +02:00
Erik Grinaker	56845f2da2	Add `GetPageClass::is_bulk`	2025-07-05 11:15:28 +02:00
Heikki Linnakangas	b568189f7b	Build dummy libcommunicator into the 'neon' extension (#12266 ) This doesn't do anything interesting yet, but demonstrates linking Rust code to the neon Postgres extension, so that we can review and test drive just the build process changes independently.	2025-07-04 23:27:28 +00:00
Heikki Linnakangas	9a37bfdf63	Fix re-finding an entry in bucket chain	2025-07-05 00:44:46 +03:00
Arpad Müller	b94a5ce119	Don't await the walreceiver on timeline shutdown (#12402 ) Mostly a revert of https://github.com/neondatabase/neon/pull/11851 and https://github.com/neondatabase/neon/pull/12330 . Christian suggested reverting his PR to fix the issue https://github.com/neondatabase/neon/issues/12369 . Alternatives considered: 1. I have originally wanted to introduce cancellation tokens to `RequestContext`, but in the end I gave up on them because I didn't find a select-free way of preventing `test_layer_download_cancelled_by_config_location` from hanging. Namely if I put a select around the `get_or_maybe_download` invocation in `get_values_reconstruct_data`, it wouldn't hang, but if I put it around the `download_init_and_wait` invocation in `get_or_maybe_download`, the test would still hang. Not sure why, even though I made the attached child function of the `RequestContext` create a child token. 2. Introduction of a `download_cancel` cancellation token as a child of a timeline token, putting it into `RemoteTimelineClient` together with the main token, and then putting it into the whole `RemoteTimelineClient` read path. 3. Greater refactorings, like to make cancellation tokens follow a DAG structure so you can have tokens cancelled either by say timeline shutting down or a request ending. It doesn't just represent an effort that we don't have the engineering budget for, it also causes interesting questions like what to do about batching (do you cancel the entire request if only some requests get cancelled?). We might see a reemergence of https://github.com/neondatabase/neon/issues/11762, but given that we have https://github.com/neondatabase/neon/pull/11853 and https://github.com/neondatabase/neon/pull/12376 now, it is possible that it will not come back. Looking at some code, it might actually fix the locations where the error pops up. Let's see. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-07-04 20:12:10 +00:00
Heikki Linnakangas	4c916552e8	Reduce logging noise These are very useful while debugging, but also very noisy; let's dial it down a little.	2025-07-04 23:11:36 +03:00
Heikki Linnakangas	50fbf4ac53	Fix hash table initialization across forked processes attach_writer()/reader() are called from each forked process. It's too late to do initialization there, in fact we used to overwrite the contents of the hash table (or at least the freelist?) every time a new process attached to it. The initialization must be done earlier, in the HashMapInit() constructors.	2025-07-04 23:08:34 +03:00
Erik Grinaker	cb698a3951	Add dedicated client pools for bulk requests	2025-07-04 21:52:25 +02:00
Mikhail	7ed4530618	`offload_lfc_interval_seconds` in ComputeSpec (#12447 ) - Add ComputeSpec flag `offload_lfc_interval_seconds` controlling whether LFC should be offloaded to endpoint storage. Default value (None) means "don't offload". - Add glue code around it for `neon_local` and integration tests. - Add `autoprewarm` mode for `test_lfc_prewarm` testing `offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction. - Rename `compute_ctl_lfc_prewarm_requests_total` and `compute_ctl_lfc_offload_requests_total` to `compute_ctl_lfc_prewarms_total` and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and offloads, not `compute_ctl` requests of those. Don't count request in metrics if there is a prewarm/offload already ongoing. https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/30770	2025-07-04 18:49:57 +00:00
Erik Grinaker	f6cc5cbd0c	Split out retry handler to separate module	2025-07-04 20:20:09 +02:00
Heikki Linnakangas	00affada26	Add request ID to all communicator log lines as context information	2025-07-04 20:34:26 +03:00
Heikki Linnakangas	90d3c09c24	Minor cleanup Tidy up and add some comments. Rename a few things for clarity.	2025-07-04 20:32:59 +03:00
Heikki Linnakangas	6c398aeae7	Fix dependency in Makefile	2025-07-04 20:24:21 +03:00
Heikki Linnakangas	3a44774227	impr(ci): Simplify build-macos workflow, prepare for rust communicator (#12357 ) Don't build walproposer-lib as a separate job. It only takes a few seconds, after you have built all its dependencies. Don't cache the Neon Pg extensions in the per-postgres-version caches. This is in preparation for the communicator project, which will introduce Rust parts to the Neon Pg extension, which complicates the build process. With that, the 'make neon-pg-ext' step requires some of the Rust bits to be built already, or it will build them on the spot, which in turn requires all the Rust sources to be present, and we don't want to repeat that part for each Postgres version anyway. To prepare for that, rely on "make all" to build the neon extension and the rust bits in the correct order instead. Building the neon extension doesn't currently take very long anyway after you have built Postgres itself, so you don't gain much by caching it. See https://github.com/neondatabase/neon/pull/12266. Add an explicit "rustup update" step to update the toolchain. It's not strictly necessary right now, because currently "make all" will only invoke "cargo build" once and the race condition described in the comment doesn't happen. But prepare for the future. To further simplify the build, get rid of the separate 'build-postgres' jobs too, and just build Postgres as a step in the main job. That makes the overall workflow run longer, because we no longer build all the postgres versions in parallel (although you still get intra-runner parallelism thanks to `make -j`), but that's acceptable. In the cache-hit case, it might even be a little faster because there is less overhead from launching jobs, and in the cache-miss case, it's maybe 5-10 minutes slower altogether. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-07-04 15:34:58 +00:00
Heikki Linnakangas	1856bbbb9f	Minor cleanup and commenting	2025-07-04 18:28:34 +03:00
Aleksandr Sarantsev	b2705cfee6	storcon: Make node deletion process cancellable (#12320 ) ## Problem The current deletion operation is synchronous and blocking, which is unsuitable for potentially long-running tasks like. In such cases, the standard HTTP request-response pattern is not a good fit. ## Summary of Changes - Added new `storcon_cli` commands: `NodeStartDelete` and `NodeCancelDelete` to initiate and cancel deletion asynchronously. - Added corresponding `storcon` HTTP handlers to support the new start/cancel deletion flow. - Introduced a new type of background operation: `Delete`, to track and manage the deletion process outside the request lifecycle. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-04 14:08:09 +00:00
Heikki Linnakangas	bd46dd60a0	Add a temporary timeout to handling an IO request in the communicator It's nicer to timeout in the communicator and return an error to the backend, than PANIC the backend.	2025-07-04 16:08:22 +03:00
Heikki Linnakangas	5f2d476a58	Add request ID to io-in-progress locking table, to ease debugging I also added INFO messages for when a backend blocks on the io-in-progress lock. It's probably too noisy for production, but useful now to get a picture of how much it happens.	2025-07-04 15:55:57 +03:00
Heikki Linnakangas	3231cb6138	Await the io-in-progress locking futures Otherwise they don't do anything. Oops.	2025-07-04 15:55:57 +03:00
Heikki Linnakangas	e558e0da5c	Assign request_id earlier, in the originating backend Makes it more useful for stitching together logs etc. for a specific request.	2025-07-04 15:55:55 +03:00
Heikki Linnakangas	70bf2e088d	Request multiple block numbers in a single GetPageV request That's how it was always intended to be used	2025-07-04 15:49:04 +03:00
Trung Dinh	225267b3ae	Make disk eviction run by default (#12464 ) ## Problem ## Summary of changes Provide a sane set of default values for disk_usage_based_eviction. Closes https://github.com/neondatabase/neon/issues/12301.	2025-07-04 12:06:10 +00:00
Vlad Lazar	d378726e38	pageserver: reset the broker subscription if it's been idle for a while (#12436 ) ## Problem I suspect that the pageservers get stuck on receiving broker updates. ## Summary of changes This is a an opportunistic (staging only) patch that resets the susbscription stream if it's been idle for a while. This won't go to prod in this form. I'll revert or update it before Friday.	2025-07-04 10:25:03 +00:00
Konstantin Knizhnik	436a117c15	Do not allocate anything in subtransaction memory context (#12176 ) ## Problem See https://github.com/neondatabase/neon/issues/12173 ## Summary of changes Allocate table in TopTransactionMemoryContext --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-07-04 10:24:39 +00:00
Heikki Linnakangas	da3f9ee72d	cargo fmt	2025-07-04 12:39:41 +03:00
Alex Chi Z.	cc699f6f85	fix(pageserver): do not log no-route-to-host errors (#12468 ) ## Problem close https://github.com/neondatabase/neon/issues/12344 ## Summary of changes Add `HostUnreachable` and `NetworkUnreachable` to expected I/O error. This was new in Rust 1.83. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-03 21:57:42 +00:00
Erik Grinaker	88d1127bf4	Tweak GetPageSplitter	2025-07-03 21:12:26 +02:00
David Freifeld	794bb7a9e8	Merge branch 'quantumish/comm-lfc-integration' into communicator-rewrite	2025-07-03 10:52:29 -07:00
Konstantin Knizhnik	495112ca50	Add GUC for dynamically enable compare local mode (#12424 ) ## Problem DEBUG_LOCAL_COMPARE mode allows to detect data corruption. But it requires rebuild of neon extension (and so requires special image) and significantly slowdown execution because always fetch pages from page server. ## Summary of changes Introduce new GUC `neon.debug_compare_local`, accepting the following values: " none", "prefetch", "lfc", "all" (by default it is definitely disabled). In mode less than "all", neon SMGR will not fetch page from PS if it is found in local caches. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-07-03 17:37:05 +00:00
Suhas Thalanki	46158ee63f	fix(compute): background installed extensions worker would collect data without waiting for interval (#12465 ) ## Problem The background installed extensions worker relied on `interval.tick()` to go to sleep for a period of time. This can lead to bugs due to the interval being updated at the end of the loop as the first tick is [instantaneous](https://docs.rs/tokio/latest/tokio/time/struct.Interval.html#method.tick). ## Summary of changes Changed it to a `tokio::time::sleep` to prevent this issue. Now it puts the thread to sleep and only wakes up after the specified duration	2025-07-03 17:10:30 +00:00

1 2 3 4 5 ...

8419 Commits