rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-23 06:09:59 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	cd14f6ca94	Fix LSN lease background thread with grpc The spawned thread didn't have the tokio runtime active, which lead to this error: ERROR lsn_lease_bg_task{tenant_id=1bb647cb7d3974b52e74f7442fa7d059 timeline_id=cf41456d3202e1c3940cb8f372d160ab lsn=0/1576000}:panic{thread=<unnamed> location=compute_tools/src/lsn_lease.rs:201:5}: there is no reactor running, must be called from the context of a Tokio 1.x runtime Fixes `test_readonly_node_gc`	2025-08-01 00:37:22 +03:00
Heikki Linnakangas	8ed56decfb	Make LFC prewarming test case less sensitive to LFC chunk size Namely, this makes it pass with the new communicator, which doesn't do chunking at all.	2025-08-01 00:24:01 +03:00
Heikki Linnakangas	e466cd1eb2	fix prewarm test with grpc I added a fixture to run these tests with and without grpc, but missed passing the option to one endpoint creation.	2025-08-01 00:12:06 +03:00
Heikki Linnakangas	4a031b9467	Fix LFC prewarm cancellation	2025-08-01 00:11:50 +03:00
Heikki Linnakangas	26bd994852	reformat	2025-07-31 23:44:11 +03:00
Heikki Linnakangas	b78cdfe3ea	Fix test_lfc_prewarm.py test failure	2025-07-31 23:44:11 +03:00
Heikki Linnakangas	50302499f5	Silence test failure with gRPC The error message is just a little different with gRPC.	2025-07-31 22:02:30 +03:00
Heikki Linnakangas	ede37c5346	rever unintentional changes to submodules	2025-07-31 21:17:19 +03:00
Heikki Linnakangas	b72f410b6e	cargo fmt	2025-07-31 20:47:52 +03:00
Heikki Linnakangas	e1c7d79e2a	dial down smgr trace logging to same level as on 'main'	2025-07-31 20:47:20 +03:00
Heikki Linnakangas	bb1f50bf09	Set `num_shards` in shared memory. The get_num_shards() function, called from the WAL proposer, requires it. Fixes test_timeline_size_quota_on_startup	2025-07-31 16:29:24 +03:00
Heikki Linnakangas	9871a3f9e7	tidy up error handling a bit Pass back a suitable 'errno' from the communicator process to the originating backend in all cases. Usually it's just EIO because we don't have a good way to map from tonic StatusCodes to libc error numbers. That's probably good enough; from the original backend's perspective all errors are IO errors. In the C code, set libc errno variable before calling ereport(), so that errcode_for_file_access() works. And once we do that, we can replace pg_strerror() calls with %m.	2025-07-31 15:31:19 +03:00
Heikki Linnakangas	e1df05448c	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-31 15:01:34 +03:00
Heikki Linnakangas	b4a63e0a34	Fix how `neon.stripe_size` option is set in postgresql.conf file (#12776 ) Commit `1dce2a9e74` changed how the `neon.pageserver_connstring` setting is formed, but it messed up setting the `neon.stripe_size` setting so that it was set twice. That got mixed up during development of the patch, as commit `7fef4435c1` landed first and was merged incorrectly.	2025-07-31 11:46:57 +00:00
Heikki Linnakangas	17cd611ccc	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-31 14:45:22 +03:00
Heikki Linnakangas	c509d53cd1	fix clippy warnings	2025-07-31 14:45:13 +03:00
Heikki Linnakangas	84f4dcd2be	fix test scripts to not set neon.use_communicator_worker anymore compute_ctl does it based on prefer_protocol now	2025-07-31 14:36:26 +03:00
Heikki Linnakangas	b4808a4e5c	Set `neon.use_communicator_worker` GUC based on `prefer_protocol`	2025-07-31 14:24:38 +03:00
Heikki Linnakangas	5e2a19ce73	cargo fmt	2025-07-31 14:24:17 +03:00
Heikki Linnakangas	8a4f16a471	More work on metrics Switch to the 'measured' crate everywhere in the communicator. Connect the allocator metrics to the metrics endpoint.	2025-07-31 14:09:39 +03:00
Erik Grinaker	f8fc0bf3c0	neon_local: use doc comments for help texts (#12270 ) Clap automatically uses doc comments as help/about texts. Doc comments are strictly better, since they're also used e.g. for IDE documentation, and are better formatted. This patch updates all `neon_local` commands to use doc comments (courtesy of GPT-o3).	2025-07-31 10:25:33 +00:00
Alexey Kondratov	8fe7596120	chore(compute_tools): Delete unused anon_ext_fn_reassign.sql (#12787 ) It's an anon v1 failed launch artifact, I suppose.	2025-07-31 10:11:30 +00:00
Krzysztof Szafrański	f3ee6e818d	[proxy] Correctly classify ConnectErrors (#12793 ) As is, e.g. quota errors on wake compute are logged as "compute" errors.	2025-07-31 09:53:48 +00:00
Dmitrii Kovalkov	edd60730c8	safekeeper: use last_log_term in mconf switch + choose most advanced sk in pull timeline (#12778 ) ## Problem I discovered two bugs corresponding to safekeeper migration, which together might lead to a data loss during the migration. The second bug is from a hadron patch and might lead to a data loss during the safekeeper restore in hadron as well. 1. `switch_membership` returns the current `term` instead of `last_log_term`. It is used to choose the `sync_position` in the algorithm, so we might choose the wrong one and break the correctness guarantees. 2. The current `term` is used to choose the most advanced SK in `pull_timeline` with higher priority than `flush_lsn`. It is incorrect because the most advanced safekeeper is the one with the highest `(last_log_term, flush_lsn)` pair. The compute might bump term on the least advanced sk, making it the best choice to pull from, and thus making committed log entries "uncommitted" after `pull_timeline` Part of https://databricks.atlassian.net/browse/LKB-1017 ## Summary of changes - Return `last_log_term` in `switch_membership` - Use `(last_log_term, flush_lsn)` as a primary key for choosing the most advanced sk in `pull_timeline` and deny pulling if the `max_term` is higher than on the most advanced sk (hadron only) - Write tests for both cases - Retry `sync_safekeepers` in `compute_ctl` - Take into the account the quorum size when calculating `sync_position`	2025-07-31 09:29:25 +00:00
Aleksandr Sarantsev	975b95f4cd	Introduce deletion API improvement RFC (#12484 ) ## Problem The deletion logic had become difficult to understand and maintain. ## Summary of changes - Added an RFC detailing proposed improvements to all deletion-related APIs. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-31 08:34:47 +00:00
Heikki Linnakangas	0428164058	Fix LFC stats exposed by the built-in prometheus endpoint	2025-07-31 11:34:14 +03:00
Heikki Linnakangas	c8042f9e31	Run pgindent on the new communicator C code	2025-07-31 11:11:38 +03:00
Heikki Linnakangas	4016808dff	Handle get_raw_page_at_lsn() debugging function properly This adds a new request type between backend and communicator, to make a getpage request at a given LSN, bypassing the LFC. Only used by the get_raw_page_at_lsn() debugging/testing function.	2025-07-31 11:04:15 +03:00
Mikhail	01c39f378e	prewarm cancellation (#12785 ) Add DELETE /lfc/prewarm route which handles ongoing prewarm cancellation, update API spec, add prewarm Cancelled state Add offload Cancelled state when LFC is not initialized	2025-07-30 22:05:51 +00:00
Dimitri Fontaine	4d3b28bd2e	[Hadron] Always run databricks auth hook. (#12683 )	2025-07-30 21:34:30 +00:00
Heikki Linnakangas	c8b875c93b	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-30 23:08:43 +03:00
Heikki Linnakangas	768fc101cc	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-30 23:08:18 +03:00
Heikki Linnakangas	81ddd10be6	tests: Don't print Hostname on every test connection (#12782 ) These lines are a significant fraction of the total log size of the regression tests. And it seems very uninteresting, it's always 'localhost' in local tests.	2025-07-30 19:56:22 +00:00
Heikki Linnakangas	3dfa2fc3e4	Fix relsize caching in hot standby mode Fixes remaining test_hot_standby.py failures	2025-07-30 22:55:38 +03:00
Suhas Thalanki	e470997627	enable tests introduced in hadron commits (#12790 ) Enables skipped tests introduced in hadron integration commits	2025-07-30 19:10:33 +00:00
Heikki Linnakangas	49204b6a59	don't try to update the legacy last-written LSN cache with new communicator	2025-07-30 22:01:04 +03:00
Heikki Linnakangas	c0360644a7	Evict and retry if the block hash map is full I made this change to one the is_write==true case earlier already, but the is_write==false codepath needs the same treatment.	2025-07-30 21:48:25 +03:00
Heikki Linnakangas	688990e7ec	Crank down the logging More logs is useful during debugging, but it's time to crank it down a notch...	2025-07-30 21:24:19 +03:00
Heikki Linnakangas	af5e3da381	Fix updating last-written LSN when WAL redo skips updating a block This makes the test_replica_query_race test pass, and probably some other read replica tests too.	2025-07-30 21:20:10 +03:00
Erik Grinaker	eb2741758b	storcon: actually update gRPC address on reattach (#12784 ) ## Problem In #12268, we added Pageserver gRPC addresses to the storage controller. However, we didn't actually persist these in the database. ## Summary of changes Update the database with the new gRPC address on reattach.	2025-07-30 16:18:35 +00:00
Matthias van de Meent	f3a0e4f255	Improve specificity with which we apply compute specs (#12773 ) This makes sure we don't confuse user-controlled functions with PG's builtin functions. ## Problem See https://github.com/neondatabase/cloud/issues/31628	2025-07-30 15:29:16 +00:00
Suhas Thalanki	842a5091d5	[BRC-3051] Walproposer: Safekeeper quorum health metrics (#930 ) (#12750 ) Today we don't have any indications (other than spammy logs in PG that nobody monitors) if the Walproposer in PG cannot connect to/get votes from all Safekeepers. This means we don't have signals indicating that the Safekeepers are operating at degraded redundancy. We need these signals. Added plumbing in PG extension so that the `neon_perf_counters` view exports the following gauge metrics on safekeeper health: - `num_configured_safekeepers`: The total number of safekeepers configured in PG. - `num_active_safekeepers`: The number of safekeepers that PG is actively streaming WAL to. An alert should be raised whenever `num_active_safekeepers` < `num_configured_safekeepers`. The metrics are implemented by adding additional state to the Walproposer shared memory keeping track of the active statuses of safekeepers using a simple array. The status of the safekeeper is set to active (1) after the Walproposer acquires a quorum and starts streaming data to the safekeeper, and is set to inactive (0) when the connection with a safekeeper is shut down. We scan the safekeeper status array in Walproposer shared memory when collecting the metrics to produce results for the gauges. Added coverage for the metrics to integration test `test_wal_acceptor.py::test_timeline_disk_usage_limit`. ## Problem ## Summary of changes --------- Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-30 15:14:59 +00:00
Suhas Thalanki	056056bef0	fix(compute): validate `prewarm_local_cache()` input (#12648 ) ## Problem ``` postgres=> select neon.prewarm_local_cache('\xfcfcfcfc01000000ffffffff070000000000000000000000000000000000000000000000000000000000000000000000000000ff', 1); WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: server conn crashed? ``` The function takes a bytea argument and casts it to a C struct, without validating the contents. ## Summary of changes Added validation for number of pages to be prefetched and for the chunks as well.	2025-07-30 14:33:19 +00:00
Heikki Linnakangas	fca52af7e3	Don't update the legacy last-written LSN cache with new communicator The new communicator has its own tracking	2025-07-30 17:31:51 +03:00
Ruslan Talpa	e989e0da78	[proxy] accept jwts when configured as rest_broker (#12777 ) ## Problem when compiled with rest_broker feature and is_rest_broker=true (but is_auth_broker=false) accept_jwts is set to false ## Summary of changes set the config with ``` accept_jwts: args.is_auth_broker \|\| args.is_rest_broker ``` Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>	2025-07-30 14:17:51 +00:00
Heikki Linnakangas	b3c1aecd11	tests: Stop endpoints in parallel (#12769 ) Shaves off a few seconds from tests involving multiple endpoints.	2025-07-30 12:19:00 +00:00
Heikki Linnakangas	95ef69ca95	Enable gRPC in the docker-compose setup	2025-07-30 15:16:50 +03:00
Heikki Linnakangas	9e250e382a	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-30 11:19:42 +03:00
Heikki Linnakangas	1dce2a9e74	Change how pageserver connection info is passed in compute spec (#12604 ) Add a new 'pageserver_connection_info' field in the compute spec. It replaces the old 'pageserver_connstring' field with a more complicated struct that includes both libpq and grpc URLs, for each shard (or only one of the the URLs, depending on the configuration). It also includes a flag suggesting which one to use; compute_ctl now uses it to decide which protocol to use for the basebackup. This is backwards-compatible with everything that's in production. If the control plane fills in `pageserver_connection_info`, compute_ctl uses that. If it fills in the `pageserver_connstring`/`shard_stripe_size` fields, it uses those. As last resort, it uses the 'neon.pageserver_connstring' GUC from the list of Postgres settings. The 'grpc' flag in the endpoint config is now more of a suggestion, and it's used to populate the 'prefer_protocol' flag in the compute spec. Regardless of the flag, compute_ctl gets both URLs, so it can choose to use libpq or grpc as it wishes. It currently always obeys the flag to choose which method to use for getting the basebackup, but Postgres itself will always use the libpq protocol. (That will be changed with the new rust-based communicator project, which implements the gRPC client in the compute). After that, the `pageserver_connection_info.prefer_protocol` flag in the spec file can be used to control whether compute_ctl uses grpc or libpq. The actual compute's grpc usage will be controlled by the `neon.enable_new_communicator` GUC (not yet; that will be introduced in the future, with the new rust-base communicator project). It can be set separately from 'prefer_protocol'. Later: - Once all old computes are gone, remove the code to pass `neon.pageserver_connstring`	2025-07-29 22:20:05 +00:00
HaoyuHuang	ca88521653	Set neon_superuser privilege under lakebase mode (#12775 ) ## Problem ## Summary of changes	2025-07-29 21:30:34 +00:00

1 2 3 4 5 ...

8758 Commits