rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	bb1f50bf09	Set `num_shards` in shared memory. The get_num_shards() function, called from the WAL proposer, requires it. Fixes test_timeline_size_quota_on_startup	2025-07-31 16:29:24 +03:00
Heikki Linnakangas	9871a3f9e7	tidy up error handling a bit Pass back a suitable 'errno' from the communicator process to the originating backend in all cases. Usually it's just EIO because we don't have a good way to map from tonic StatusCodes to libc error numbers. That's probably good enough; from the original backend's perspective all errors are IO errors. In the C code, set libc errno variable before calling ereport(), so that errcode_for_file_access() works. And once we do that, we can replace pg_strerror() calls with %m.	2025-07-31 15:31:19 +03:00
Heikki Linnakangas	c509d53cd1	fix clippy warnings	2025-07-31 14:45:13 +03:00
Heikki Linnakangas	5e2a19ce73	cargo fmt	2025-07-31 14:24:17 +03:00
Heikki Linnakangas	8a4f16a471	More work on metrics Switch to the 'measured' crate everywhere in the communicator. Connect the allocator metrics to the metrics endpoint.	2025-07-31 14:09:39 +03:00
Heikki Linnakangas	0428164058	Fix LFC stats exposed by the built-in prometheus endpoint	2025-07-31 11:34:14 +03:00
Heikki Linnakangas	c8042f9e31	Run pgindent on the new communicator C code	2025-07-31 11:11:38 +03:00
Heikki Linnakangas	4016808dff	Handle get_raw_page_at_lsn() debugging function properly This adds a new request type between backend and communicator, to make a getpage request at a given LSN, bypassing the LFC. Only used by the get_raw_page_at_lsn() debugging/testing function.	2025-07-31 11:04:15 +03:00
Heikki Linnakangas	768fc101cc	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-30 23:08:18 +03:00
Heikki Linnakangas	3dfa2fc3e4	Fix relsize caching in hot standby mode Fixes remaining test_hot_standby.py failures	2025-07-30 22:55:38 +03:00
Heikki Linnakangas	49204b6a59	don't try to update the legacy last-written LSN cache with new communicator	2025-07-30 22:01:04 +03:00
Heikki Linnakangas	c0360644a7	Evict and retry if the block hash map is full I made this change to one the is_write==true case earlier already, but the is_write==false codepath needs the same treatment.	2025-07-30 21:48:25 +03:00
Heikki Linnakangas	688990e7ec	Crank down the logging More logs is useful during debugging, but it's time to crank it down a notch...	2025-07-30 21:24:19 +03:00
Heikki Linnakangas	af5e3da381	Fix updating last-written LSN when WAL redo skips updating a block This makes the test_replica_query_race test pass, and probably some other read replica tests too.	2025-07-30 21:20:10 +03:00
Suhas Thalanki	842a5091d5	[BRC-3051] Walproposer: Safekeeper quorum health metrics (#930 ) (#12750 ) Today we don't have any indications (other than spammy logs in PG that nobody monitors) if the Walproposer in PG cannot connect to/get votes from all Safekeepers. This means we don't have signals indicating that the Safekeepers are operating at degraded redundancy. We need these signals. Added plumbing in PG extension so that the `neon_perf_counters` view exports the following gauge metrics on safekeeper health: - `num_configured_safekeepers`: The total number of safekeepers configured in PG. - `num_active_safekeepers`: The number of safekeepers that PG is actively streaming WAL to. An alert should be raised whenever `num_active_safekeepers` < `num_configured_safekeepers`. The metrics are implemented by adding additional state to the Walproposer shared memory keeping track of the active statuses of safekeepers using a simple array. The status of the safekeeper is set to active (1) after the Walproposer acquires a quorum and starts streaming data to the safekeeper, and is set to inactive (0) when the connection with a safekeeper is shut down. We scan the safekeeper status array in Walproposer shared memory when collecting the metrics to produce results for the gauges. Added coverage for the metrics to integration test `test_wal_acceptor.py::test_timeline_disk_usage_limit`. ## Problem ## Summary of changes --------- Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-30 15:14:59 +00:00
Suhas Thalanki	056056bef0	fix(compute): validate `prewarm_local_cache()` input (#12648 ) ## Problem ``` postgres=> select neon.prewarm_local_cache('\xfcfcfcfc01000000ffffffff070000000000000000000000000000000000000000000000000000000000000000000000000000ff', 1); WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: server conn crashed? ``` The function takes a bytea argument and casts it to a C struct, without validating the contents. ## Summary of changes Added validation for number of pages to be prefetched and for the chunks as well.	2025-07-30 14:33:19 +00:00
Heikki Linnakangas	fca52af7e3	Don't update the legacy last-written LSN cache with new communicator The new communicator has its own tracking	2025-07-30 17:31:51 +03:00
Heikki Linnakangas	9e250e382a	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-30 11:19:42 +03:00
Heikki Linnakangas	1dce2a9e74	Change how pageserver connection info is passed in compute spec (#12604 ) Add a new 'pageserver_connection_info' field in the compute spec. It replaces the old 'pageserver_connstring' field with a more complicated struct that includes both libpq and grpc URLs, for each shard (or only one of the the URLs, depending on the configuration). It also includes a flag suggesting which one to use; compute_ctl now uses it to decide which protocol to use for the basebackup. This is backwards-compatible with everything that's in production. If the control plane fills in `pageserver_connection_info`, compute_ctl uses that. If it fills in the `pageserver_connstring`/`shard_stripe_size` fields, it uses those. As last resort, it uses the 'neon.pageserver_connstring' GUC from the list of Postgres settings. The 'grpc' flag in the endpoint config is now more of a suggestion, and it's used to populate the 'prefer_protocol' flag in the compute spec. Regardless of the flag, compute_ctl gets both URLs, so it can choose to use libpq or grpc as it wishes. It currently always obeys the flag to choose which method to use for getting the basebackup, but Postgres itself will always use the libpq protocol. (That will be changed with the new rust-based communicator project, which implements the gRPC client in the compute). After that, the `pageserver_connection_info.prefer_protocol` flag in the spec file can be used to control whether compute_ctl uses grpc or libpq. The actual compute's grpc usage will be controlled by the `neon.enable_new_communicator` GUC (not yet; that will be introduced in the future, with the new rust-base communicator project). It can be set separately from 'prefer_protocol'. Later: - Once all old computes are gone, remove the code to pass `neon.pageserver_connstring`	2025-07-29 22:20:05 +00:00
Suhas Thalanki	07c3cfd2a0	[BRC-2905] Feed back PS-detected data corruption signals to SK and PG… (#12748 ) … walproposer (#895) Data corruptions are typically detected on the pageserver side when it replays WAL records. However, since PS doesn't synchronously replay WAL records as they are being ingested through safekeepers, we need some extra plumbing to feed information about pageserver-detected corruptions during compaction (and/or WAL redo in general) back to SK and PG for proper action. We don't yet know what actions PG/SK should take upon receiving the signal, but we should have the detection and feedback in place. Add an extra `corruption_detected` field to the `PageserverFeedback` message that is sent from PS -> SK -> PG. It's a boolean value that is set to true when PS detects a "critical error" that signals data corruption, and it's sent in all `PageserverFeedback` messages. Upon receiving this signal, the safekeeper raises a `safekeeper_ps_corruption_detected` gauge metric (value set to 1). The safekeeper then forwards this signal to PG where a `ps_corruption_detected` gauge metric (value also set to 1) is raised in the `neon_perf_counters` view. Added an integration test in `test_compaction.py::test_ps_corruption_detection_feedback` that confirms that the safekeeper and PG can receive the data corruption signal in the `PageserverFeedback` message in a simulated data corruption. ## Problem ## Summary of changes --------- Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-29 20:40:07 +00:00
Erik Grinaker	80d052f262	Merge branch 'main' into communicator-rewrite	2025-07-29 22:05:16 +02:00
Suhas Thalanki	bf3a1529bf	Report metrics on data/index corruption (#12729 ) ## Problem We don't have visibility into data/index corruption. ## Summary of changes Add data/index corruptions metrics. PG calls elog ERROR errcode to emit these corruption errors. PG Changes: https://github.com/neondatabase/postgres/pull/698	2025-07-29 18:08:24 +00:00
Heikki Linnakangas	aad301e083	cargo fmt	2025-07-29 16:46:54 +03:00
Heikki Linnakangas	b6b3911063	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-29 16:44:00 +03:00
Heikki Linnakangas	5e3cb2ab07	Refactor LFC stats functions (#12696 ) Split the functions into two parts: an internal function in file_cache.c which returns an array of structs representing the result set, and another function in neon.c with the glue code to expose it as a SQL function. This is in preparation for the new communicator, which needs to implement the same SQL functions, but getting the information from a different place. In the glue code, use the more modern Postgres way of building a result set using a tuplestore.	2025-07-29 13:12:44 +00:00
Heikki Linnakangas	40cae8cc36	Fix misc typos and some cosmetic code cleanup (#12695 )	2025-07-28 16:21:35 +00:00
Heikki Linnakangas	02fc8b7c70	Add compatibility macros for MyProcNumber and PGIOAlignedBlock (#12715 ) There were a few uses of these already, so collect them to the compatibility header to avoid the repetition and scattered #ifdefs. The definition of MyProcNumber is a little different from what was used before, but the end result is the same. (PGPROC->pgprocno values were just assigned sequentially to all PGPROC array members, see InitProcGlobal(). That's a bit silly, which is why it was removed in v17.)	2025-07-28 15:05:36 +00:00
Tristan Partin	b623fbae0c	Cancel PG query if stuck at refreshing configuration (#12717 ) ## Problem While configuring or reconfiguring PG due to PageServer movements, it's possible PG may get stuck if PageServer is moved around after fetching the spec from StorageController. ## Summary of changes To fix this issue, this PR introduces two changes: 1. Fail the PG query directly if the query cannot request configuration for certain number of times. 2. Introduce a new state `RefreshConfiguration` in compute tools to differentiate it from `RefreshConfigurationPending`. If compute tool is already in `RefreshConfiguration` state, then it will not accept new request configuration requests. ## How is this tested? Chaos testing. Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-25 00:01:59 +00:00
Tristan Partin	11527b9df7	[BRC-2951] Enforce PG backpressure parameters at the shard level (#12694 ) ## Problem Currently PG backpressure parameters are enforced globally. With tenant splitting, this makes it hard to balance small tenants and large tenants. For large tenants with more shards, we need to increase the lagging because each shard receives total/shard_count amount of data, while doing so could be suboptimal to small tenants with fewer shards. ## Summary of changes This PR makes these parameters to be enforced at the shard level, i.e., PG will compute the actual lag limit by multiply the shard count. ## How is this tested? Added regression test. Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-24 18:41:29 +00:00
Tristan Partin	89554af1bd	[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712 ) ## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 16:44:45 +00:00
Erik Grinaker	d793088225	pgxn: set `MACOSX_DEPLOYMENT_TARGET` (#12723 ) ## Problem Compiling `neon-pg-ext-v17` results in these linker warnings for `libcommunicator.a`: ``` $ make -j`nproc` -s neon-pg-ext-v17 Installing PostgreSQL v17 headers Compiling PostgreSQL v17 Compiling neon-specific Postgres extensions for v17 ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1159](25ac62e5b3c53843-curve25519.o)) was built for newer 'macOS' version (15.5) than being linked (15.0) ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1160](0bbbd18bda93c05b-aes_nohw.o)) was built for newer 'macOS' version (15.5) than being linked (15.0) ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1161](00c879ee3285a50d-montgomery.o)) was built for newer 'macOS' version (15.5) than being linked (15.0) [...] ``` ## Summary of changes Set `MACOSX_DEPLOYMENT_TARGET` to the current local SDK version (15.5 in this case), which links against object files for that version.	2025-07-24 14:48:35 +00:00
Heikki Linnakangas	0e0aff7b8c	fix metrics when not using the new communicator	2025-07-24 01:40:32 +03:00
Tristan Partin	9b2e6f862a	Set an upper limit on PG backpressure throttling (#12675 ) ## Problem Tenant split test revealed another bug with PG backpressure throttling that under some cases PS may never report its progress back to SK (e.g., observed when aborting tenant shard where the old shard needs to re-establish SK connection and re-ingest WALs from a much older LSN). In this case, PG may get stuck forever. ## Summary of changes As a general precaution that PS feedback mechanism may not always be reliable, this PR uses the previously introduced WAL write rate limit mechanism to slow down write rates instead of completely pausing it. The idea is to introduce a new `databricks_effective_max_wal_bytes_per_second`, which is set to `databricks_max_wal_mb_per_second` when no PS back pressure and is set to `10KB` when there is back pressure. This way, PG can still write to SK, though at a very low speed. The PR also fixes the problem that the current WAL rate limiting mechanism is too coarse grained and cannot enforce limits < 1MB. This is because it always resets the rate limiter after 1 second, even if PG could have written more data in the past second. The fix is to introduce a `batch_end_time_us` which records the expected end time of the current batch. For example, if PG writes 10MB of data in a single batch, and max WAL write rate is set as `1MB/s`, then `batch_end_time_us` will be set as 10 seconds later. ## How is this tested? Tweaked the existing test, and also did manual testing on dev. I set `max_replication_flush_lag` as 1GB, and loaded 500GB pgbench tables. It's expected to see PG gets throttled periodically because PS will accumulate 4GB of data before flushing. Results: when PG is throttled: ``` 9500000 of 3300000000 tuples (0%) done (elapsed 10.36 s, remaining 3587.62 s) 9600000 of 3300000000 tuples (0%) done (elapsed 124.07 s, remaining 42523.59 s) 9700000 of 3300000000 tuples (0%) done (elapsed 255.79 s, remaining 86763.97 s) 9800000 of 3300000000 tuples (0%) done (elapsed 315.89 s, remaining 106056.52 s) 9900000 of 3300000000 tuples (0%) done (elapsed 412.75 s, remaining 137170.58 s) ``` when PS just flushed: ``` 18100000 of 3300000000 tuples (0%) done (elapsed 433.80 s, remaining 78655.96 s) 18200000 of 3300000000 tuples (0%) done (elapsed 433.85 s, remaining 78231.71 s) 18300000 of 3300000000 tuples (0%) done (elapsed 433.90 s, remaining 77810.62 s) 18400000 of 3300000000 tuples (0%) done (elapsed 433.96 s, remaining 77395.86 s) 18500000 of 3300000000 tuples (0%) done (elapsed 434.03 s, remaining 76987.27 s) 18600000 of 3300000000 tuples (0%) done (elapsed 434.08 s, remaining 76579.59 s) 18700000 of 3300000000 tuples (0%) done (elapsed 434.13 s, remaining 76177.12 s) 18800000 of 3300000000 tuples (0%) done (elapsed 434.19 s, remaining 75779.45 s) 18900000 of 3300000000 tuples (0%) done (elapsed 434.84 s, remaining 75489.40 s) 19000000 of 3300000000 tuples (0%) done (elapsed 434.89 s, remaining 75097.90 s) 19100000 of 3300000000 tuples (0%) done (elapsed 434.94 s, remaining 74712.56 s) 19200000 of 3300000000 tuples (0%) done (elapsed 498.93 s, remaining 85254.20 s) 19300000 of 3300000000 tuples (0%) done (elapsed 498.97 s, remaining 84817.95 s) 19400000 of 3300000000 tuples (0%) done (elapsed 623.80 s, remaining 105486.76 s) 19500000 of 3300000000 tuples (0%) done (elapsed 745.86 s, remaining 125476.51 s) ``` Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-23 22:37:27 +00:00
Tristan Partin	12e87d7a9f	Add neon.lakebase_mode boolean GUC (#12714 ) This GUC will become useful for temporarily disabling Lakebase-specific features during the code merge. Signed-off-by: Tristan Partin <tristan.partin@databricks.com>	2025-07-23 22:37:20 +00:00
Heikki Linnakangas	5a5ea9cb9f	cargo fmt	2025-07-24 01:33:02 +03:00
Heikki Linnakangas	3d209dcaae	Minor changes to minimize diff against 'main' The `pgxn/neon/communicator/Cargo.lock` file was not used, since the package is part of the workspace.	2025-07-24 00:42:00 +03:00
Heikki Linnakangas	f939691f6a	remove leftover empty file	2025-07-24 00:27:49 +03:00
Erik Grinaker	c8cdd25da4	Pass stripe size during shard map updates	2025-07-23 16:38:20 +02:00
Heikki Linnakangas	6d8b1cc754	silence compiler warning about using variable unused	2025-07-23 13:47:35 +03:00
Heikki Linnakangas	35da660200	more work on exposing LFC stats	2025-07-23 13:39:32 +03:00
Heikki Linnakangas	bfdd37b54e	Fix segfault in unimplemented function We need to implement this eventually, but for now let's at least silence the segfault. See also https://github.com/neondatabase/neon/pull/12696	2025-07-23 13:08:59 +03:00
Heikki Linnakangas	6cd1295d9f	Refactor communicator process initialization when new communicator is not used This should fix the 'cargo test' failures on xlog_utils tests, which launch Postgres in stand-alone mode, i.e. without setting 'neon_tenant'	2025-07-23 13:01:19 +03:00
Erik Grinaker	464ed0cbc7	rustfmt	2025-07-23 09:41:01 +02:00
Erik Grinaker	f55ccd2c17	Fix lints	2025-07-23 08:17:06 +02:00
Erik Grinaker	c9758dc46b	Fix communicator build	2025-07-23 08:06:20 +02:00
Heikki Linnakangas	a7a6df3d6f	fix datatype used in test mock function	2025-07-23 01:44:45 +03:00
Heikki Linnakangas	bfb4b0991d	Refactor the way lfc_get_stats() is implemented This reduces the boilerplate a little, and makes it more straightforward to dispatch the call to the old or the new communicator	2025-07-23 01:40:42 +03:00
Heikki Linnakangas	c18f4a52f8	refactor metrics to use 'measured' crate	2025-07-23 00:56:21 +03:00
Tristan Partin	fc242afcc2	PG ignore PageserverFeedback from unknown shards (#12671 ) ## Problem When testing tenant splits, I found that PG can get backpressure throttled indefinitely if the split is aborted afterwards. It turns out that each PageServer activates new shard separately even before the split is committed and they may start sending PageserverFeedback to PG directly. As a result, if the split is aborted, no one resets the pageserver feedback in PG, and thus PG will be backpressure throttled forever unless it's restarted manually. ## Summary of changes This PR fixes this problem by having `walprop_pg_process_safekeeper_feedback` simply ignore all pageserver feedback from unknown shards. The source of truth here is defined by the shard map, which is guaranteed to be reloaded only after the split is committed. Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-22 21:41:56 +00:00
Heikki Linnakangas	48535798ba	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-23 00:00:10 +03:00

1 2 3 4 5 ...

556 Commits