rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 04:30:38 +00:00

Author	SHA1	Message	Date
Erik Grinaker	8cd5370c00	Merge branch 'main' into communicator-rewrite	2025-07-11 10:39:26 +02:00
Tristan Partin	cec0543b51	Add background to compute migration 0002-alter_roles.sql (#11708 ) On December 8th, 2023, an engineering escalation (INC-110) was opened after it was found that BYPASSRLS was being applied to all roles. PR that introduced the issue: https://github.com/neondatabase/neon/pull/5657 Subsequent commit on main: `ad99fa5f03` NOBYPASSRLS and INHERIT are the defaults for a Postgres role, but because it isn't easy to know if a Postgres cluster is affected by the issue, we need to keep the migration around for a long time, if not indefinitely, so any cluster can be fixed. Branching is the gift that keeps on giving... Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Signed-off-by: Tristan Partin <tristan.partin@databricks.com>	2025-07-10 22:58:54 +00:00
Mikhail	3593fe195a	split TerminationPending into two values, keeping ComputeStatus stateless (#12506 ) After https://github.com/neondatabase/neon/pull/12240 we observed issues in our go code as `ComputeStatus` is not stateless, thus doesn't deserialize as string. ``` could not check compute activity: json: cannot unmarshal object into Go struct field ComputeState.status of type computeclient.ComputeStatus ``` - Fix this by splitting this status into two. - Update compute OpenApi spec to reflect changes to `/terminate` in previous PR	2025-07-10 19:28:10 +00:00
Mikhail	c5aaf1ae21	Qualify call to neon extension in compute_ctl's prewarming (#12554 ) https://github.com/neondatabase/cloud/issues/19011 Calls without `neon.` failed on staging. Also fix local tests to work with qualified calls	2025-07-10 18:37:54 +00:00
Alex Chi Z.	13b5e7b26f	fix(compute_ctl): reload config before applying spec (#12551 ) ## Problem If we have catalog update AND a pageserver migration batched in a single spec, we will not be able to apply the spec (running the SQL) because the compute is not attached to the right pageserver and we are not able to read anything if we don't pick up the latest pageserver connstring. This is not a case for now because cplane always schedules shard split / pageserver migrations with `skip_pg_catalog_updates` (I suppose). Context: https://databricks.slack.com/archives/C09254R641L/p1752163559259399?thread_ts=1752160163.141149&cid=C09254R641L With this fix, backpressure will likely not be able to affect reconfigurations. ## Summary of changes Do `pg_reload_conf` before we apply specs in SQL. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-10 18:02:54 +00:00
Mikhail	bdca5b500b	Fix test_lfc_prewarm: reduce number of prewarms, sleep before LFC offloading (#12515 ) Fixes: - Sleep before LFC offloading in `test_lfc_prewarm[autoprewarm]` to ensure offloaded LFC is the one exported after all writes finish - Reduce number of prewarms and increase timeout in `test_lfc_prewarm_under_workload` as debug builds were failing due to timeout. Additional changes: - Remove `check_pinned_entries`: https://github.com/neondatabase/neon/pull/12447#discussion_r2185946210 - Fix LFC error metrics description: https://github.com/neondatabase/neon/pull/12486#discussion_r2190763107	2025-07-10 11:11:53 +00:00
Tristan Partin	13e38a58a1	Grant pg_signal_backend to neon_superuser (#12533 ) Allow neon_superuser to cancel backends from non-neon_superusers, excluding Postgres superusers. Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Co-authored-by: Vikas Jain <vikas.jain@databricks.com>	2025-07-09 21:35:39 +00:00
Tristan Partin	28f604d628	Make pg_monitor neon_superuser test more robust (#12532 ) Make sure to check for NULL just in case. Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Co-authored-by: Vikas Jain <vikas.jain@databricks.com>	2025-07-09 18:45:50 +00:00
Tristan Partin	12c26243fc	Fix typo in migration testing related to pg_monitor (#12530 ) We should be joining on the neon_superuser roleid, not the pg_monitor roleid. Signed-off-by: Tristan Partin <tristan.partin@databricks.com>	2025-07-09 16:47:21 +00:00
Mikhail	bc6a756f1c	ci: lint openapi specs using redocly (#12510 ) We need to lint specs for pageserver, endpoint storage, and safekeeper #0000	2025-07-09 14:29:45 +00:00
Mikhail	e7d18bc188	Replica promotion in compute_ctl (#12183 ) Add `/promote` method for `compute_ctl` promoting secondary replica to primary, depends on secondary being prewarmed. Add `compute-ctl` mode to `test_replica_promotes`, testing happy path only (no corner cases yet) Add openapi spec for `/promote` and `/lfc` handlers https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/29807	2025-07-09 12:55:10 +00:00
Suhas Thalanki	09ff22a4d4	fix(compute): removing `NEON_EXT_INT_UPD` log statement added for debugging verbosity (#12509 ) Removes the `NEON_EXT_INT_UPD` log statement that was added for debugging verbosity.	2025-07-08 21:12:26 +00:00
Erik Grinaker	81e7218c27	pageserver: tighten up gRPC `page_api::Client` (#12396 ) This patch tightens up `page_api::Client`. It's mostly superficial changes, but also adds a new constructor that takes an existing gRPC channel, for use with the communicator connection pool.	2025-07-08 18:15:13 +00:00
Mikhail	4f16ab3f56	add lfc offload and prewarm error metrics (#12486 ) Add `compute_ctl_lfc_prewarm_errors_total` and `compute_ctl_lfc_offload_errors_total` metrics. Add comments in `test_lfc_prewarm`. Correction PR for https://github.com/neondatabase/neon/pull/12447 https://github.com/neondatabase/cloud/issues/19011	2025-07-08 09:34:01 +00:00
Heikki Linnakangas	e14bb4be39	Merge remote-tracking branch 'origin/main' into communicator-rewrite	2025-07-05 16:59:51 +03:00
Mikhail	7ed4530618	`offload_lfc_interval_seconds` in ComputeSpec (#12447 ) - Add ComputeSpec flag `offload_lfc_interval_seconds` controlling whether LFC should be offloaded to endpoint storage. Default value (None) means "don't offload". - Add glue code around it for `neon_local` and integration tests. - Add `autoprewarm` mode for `test_lfc_prewarm` testing `offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction. - Rename `compute_ctl_lfc_prewarm_requests_total` and `compute_ctl_lfc_offload_requests_total` to `compute_ctl_lfc_prewarms_total` and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and offloads, not `compute_ctl` requests of those. Don't count request in metrics if there is a prewarm/offload already ongoing. https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/30770	2025-07-04 18:49:57 +00:00
Suhas Thalanki	46158ee63f	fix(compute): background installed extensions worker would collect data without waiting for interval (#12465 ) ## Problem The background installed extensions worker relied on `interval.tick()` to go to sleep for a period of time. This can lead to bugs due to the interval being updated at the end of the loop as the first tick is [instantaneous](https://docs.rs/tokio/latest/tokio/time/struct.Interval.html#method.tick). ## Summary of changes Changed it to a `tokio::time::sleep` to prevent this issue. Now it puts the thread to sleep and only wakes up after the specified duration	2025-07-03 17:10:30 +00:00
Erik Grinaker	958c2577f5	pageserver: tighten up `page_api::Client`	2025-07-01 17:54:41 +02:00
Lassi Pölönen	6d73cfa608	Support audit syslog over TLS (#12124 ) Add support to transport syslogs over TLS. Since TLS params essentially require passing host and port separately, add a boolean flag to the configuration template and also use the same `action` format for plaintext logs. This allows seamless transition. The plaintext host:port is picked from `AUDIT_LOGGING_ENDPOINT` (as earlier) and from `AUDIT_LOGGING_TLS_ENDPOINT`. The TLS host:port is used when defined and non-empty. `remote_endpoint` is split separately to hostname and port as required by `omfwd` module. Also the address parsing and config content generation are split to more testable functions with basic tests added.	2025-07-01 12:53:46 +00:00
Suhas Thalanki	5f3532970e	[compute] fix: background worker that collects installed extension metrics now updates collection interval (#12277 ) ## Problem Previously, the background worker that collects the list of installed extensions across DBs had a timeout set to 1 hour. This cause a problem with computes that had a `suspend_timeout` > 1 hour as this collection was treated as activity, preventing compute shutdown. Issue: https://github.com/neondatabase/cloud/issues/30147 ## Summary of changes Passing the `suspend_timeout` as part of the `ComputeSpec` so that any updates to this are taken into account by the background worker and updates its collection interval.	2025-06-30 22:12:37 +00:00
Erik Grinaker	c3cb1ab98d	Merge branch 'main' into communicator-rewrite	2025-06-30 21:07:01 +02:00
Erik Grinaker	d0a4ae3e8f	pageserver: add gRPC LSN lease support (#12384 ) ## Problem The gRPC API does not provide LSN leases. ## Summary of changes * Add LSN lease support to the gRPC API. * Use gRPC LSN leases for static computes with `grpc://` connstrings. * Move `PageserverProtocol` into the `compute_api::spec` module and reuse it.	2025-06-30 12:44:17 +00:00
Erik Grinaker	67b04f8ab3	Fix a bunch of linter warnings	2025-06-30 11:10:02 +02:00
Heikki Linnakangas	f3ba201800	Run `cargo fmt`	2025-06-29 21:21:07 +03:00
Heikki Linnakangas	a352d290eb	Plumb through both libpq and grpc connection strings to the compute Add a new 'pageserver_connection_info' field in the compute spec. It replaces the old 'pageserver_connstring' field with a more complicated struct that includes both libpq and grpc URLs, for each shard (or only one of the the URLs, depending on the configuration). It also includes a flag suggesting which one to use; compute_ctl now uses it to decide which protocol to use for the basebackup. This is compatible with everything that's in production, because the control plane never used the 'pageserver_connstring' field. That was added a long time ago with the idea that it would replace the code that digs the 'neon.pageserver_connstring' GUC from the list of Postgres settings, but we never got around to do that in the control plane. Hence, it was only used with neon_local. But the plan now is to pass the 'pageserver_connection_info' from the control plane, and once that's fully deployed everywhere, the code to parse 'neon.pageserver_connstring' in compute_ctl can be removed. The 'grpc' flag on an endpoint in endpoint config is now more of a suggestion. Compute_ctl gets both URLs, so it can choose to use libpq or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag that's part of the connection info though. Postgres however uses grpc iff the new rust-based communicator is enabled. TODO/plan for the control plane: - Start to pass `pageserver_connection_info` in the spec file. - Also keep the current `neon.pageserver_connstring` setting for now, for backwards compatibility with old computes After that, the `pageserver_connection_info.prefer_grpc` flag in the spec file can be used to control whether compute_ctl uses grpc or libpq. The actual compute's grpc usage will be controlled by the `neon.enable_new_communicator` GUC. It can be set separately from 'prefer_grpc'. Later: - Once all old computes are gone, remove the code to pass `neon.pageserver_connstring`	2025-06-29 18:16:49 +03:00
Erik Grinaker	e50b914a8e	compute_tools: support gRPC base backups in `compute_ctl` (#12244 ) ## Problem `compute_ctl` should support gRPC base backups. Requires #12111. Requires #12243. Touches #11926. ## Summary of changes Support `grpc://` connstrings for `compute_ctl` base backups.	2025-06-27 16:39:00 +00:00
Mikhail	ebc12a388c	fix: endpoint_storage_addr as String (#12359 ) It's not a SocketAddr as we use k8s DNS https://github.com/neondatabase/cloud/issues/19011	2025-06-27 11:06:27 +00:00
Tristan Partin	aa75722010	Set pgaudit.log=none for monitoring connections (#12137 ) pgaudit can spam logs due to all the monitoring that we do. Logs from these connections are not necessary for HIPPA compliance, so we can stop logging from those connections. Part-of: https://github.com/neondatabase/cloud/issues/29574 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-24 17:42:23 +00:00
Matthias van de Meent	6c6de6382a	Use enum-typed PG versions (#12317 ) This makes it possible for the compiler to validate that a match block matched all PostgreSQL versions we support. ## Problem We did not have a complete picture about which places we had to test against PG versions, and what format these versions were: The full PG version ID format (Major/minor/bugfix `MMmmbb`) as transfered in protocol messages, or only the Major release version (`MM`). This meant type confusion was rampant. With this change, it becomes easier to develop new version-dependent features, by making type and niche confusion impossible. ## Summary of changes Every use of `pg_version` is now typed as either `PgVersionId` (u32, valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for every major version we support, serialized and stored like a u32 with the value of that major version) --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-06-24 17:25:31 +00:00
Arpad Müller	552249607d	apply clippy fixes for 1.88.0 beta (#12331 ) The 1.88.0 stable release is near (this Thursday). We'd like to fix most warnings beforehand so that the compiler upgrade doesn't require approval from too many teams. This is therefore a preparation PR (like similar PRs before it). There is a lot of changes for this release, mostly because the `uninlined_format_args` lint has been added to the `style` lint group. One can read more about the lint [here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args). The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One remaining warning is left for the proxy team. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-06-24 10:12:42 +00:00
Heikki Linnakangas	e90be06d46	silence a few compiler warnings about unnecessary 'mut's and 'use's	2025-06-23 18:16:54 +03:00
Heikki Linnakangas	356ba67607	Merge remote-tracking branch 'origin/main' into HEAD I also included build script changes from https://github.com/neondatabase/neon/pull/12266, which is not yet merged but will be soon.	2025-06-23 17:46:30 +03:00
Tristan Partin	c8b2ac93cf	Allow the control plane to override any Postgres connection options (#12262 ) The previous behavior was for the compute to override control plane options if there was a conflict. We want to change the behavior so that the control plane has the absolute power on what is right. In the event that we need a new option passed to the compute as soon as possible, we can initially roll it out in the control plane, and then migrate the option to EXTRA_OPTIONS within the compute later, for instance. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-20 18:46:30 +00:00
Suhas Thalanki	20f4febce1	fix: additional changes to terminate pgbouncer on compute suspend (#12153 ) (#12284 ) Addressed [retrospective comments ](https://github.com/neondatabase/neon/pull/12153#discussion_r2154197503) to https://github.com/neondatabase/neon/pull/12153	2025-06-18 19:31:22 +00:00
Mikhail	e95f2f9a67	compute_ctl: return LSN in /terminate (#12240 ) - Add optional `?mode=fast\|immediate` to `/terminate`, `fast` is default. Immediate avoids waiting 30 seconds before returning from `terminate`. - Add `TerminateMode` to `ComputeStatus::TerminationPending` - Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl stop` for `test_replica_promotes`. - Change `test_replica_promotes` to check returned LSN - Annotate `finish_sync_safekeepers` as `noreturn`. https://github.com/neondatabase/cloud/issues/29807	2025-06-18 12:25:19 +00:00
Dimitri Fontaine	67fbc0582e	Validate safekeeper_connstrings when parsing compute specs. (#11906 ) This check API only cheks the safekeeper_connstrings at the moment, and the validation is limited to checking we have at least one entry in there, and no duplicates. ## Problem If the compute_ctl service is started with an empty list of safekeepers, then hard-to-debug errors may happen at runtime, where it would be much easier to catch them early. ## Summary of changes Add an entry point in the compute_ctl API to validate the configuration for safekeeper_connstrings. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-06-18 10:01:05 +00:00
Suhas Thalanki	83069f6ca1	fix: terminate pgbouncer on compute suspend (#12153 ) ## Problem PgBouncer does not terminate connections on a suspend: https://github.com/neondatabase/cloud/issues/16282 ## Summary of changes 1. Adds a pid file to store the pid of PgBouncer 2. Terminates connections on a compute suspend --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-06-17 22:56:05 +00:00
Tristan Partin	f669e18477	Remove TODO comment related to default_transaction_read_only (#12261 ) This code has been deployed for a while, so let's remove the TODO, and remove the option passed from the control plane. Link: https://github.com/neondatabase/cloud/pull/30274 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-16 19:38:26 +00:00
Erik Grinaker	d0b3629412	Tweak base backups	2025-06-13 13:47:26 -07:00
Mikhail	1b935b1958	endpoint_storage: add ?from_endpoint= to /lfc/prewarm (#12195 ) Related: https://github.com/neondatabase/cloud/issues/24225 Add optional from_endpoint parameter to allow prewarming from other endpoint	2025-06-10 19:25:32 +00:00
Erik Grinaker	ec17ae0658	Handle gRPC basebackups in compute_ctl	2025-06-09 22:50:57 +02:00
Alexey Kondratov	6ae4b89000	feat(compute_ctl): Implement graceful compute monitor exit (#11911 ) ## Problem After introducing a naive downtime calculation for the Postgres process inside compute in https://github.com/neondatabase/neon/pull/11346, I noticed that some amount of computes regularly report short downtime. After checking some particular cases, it looks like all of them report downtime close to the end of the life of the compute, i.e., when the control plane calls a `/terminate` and we are waiting for Postgres to exit. Compute monitor also produces a lot of error logs because Postgres stops accepting connections, but it's expected during the termination process. ## Summary of changes Regularly check the compute status inside the main compute monitor loop and exit gracefully when the compute is in some terminal or soon-to-be-terminal state. --------- Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-06-05 12:17:28 +00:00
Conrad Ludgate	1fb1315aed	compute-ctl: add spec for enable_tls, separate from compute-ctl config (#12109 ) ## Problem Inbetween adding the TLS config for compute-ctl, and adding the TLS config in controlplane, we switched from using a provision flag to a bind flag. This happened to work in all of my testing in preview regions as they have no VM pool, so each bind was also a provision. However, in staging I found that the TLS config is still only processed during provision, even though it's only sent on bind. ## Summary of changes * Add a new feature flag value, `tls_experimental`, which tells postgres/pgbouncer/local_proxy to use the TLS certificates on bind. * compute_ctl on provision will be told where the certificates are, instead of being told on bind.	2025-06-04 20:07:47 +00:00
Tristan Partin	3fd5a94a85	Use Url::join() when creating the final remote extension URL (#12121 ) Url::to_string() adds a trailing slash on the base URL, so when we did the format!(), we were adding a double forward slash. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-04 15:56:12 +00:00
Mikhail	c698cee19a	ComputeSpec: prewarm_lfc_on_startup -> autoprewarm (#12120 ) https://github.com/neondatabase/cloud/pull/29472 https://github.com/neondatabase/cloud/issues/26346	2025-06-04 05:38:03 +00:00
Tristan Partin	4a3f32bf4a	Clean up compute_tools::http::JsonResponse::invalid_status() (#12110 ) JsonResponse::error() properly logs an error message which can be read in the compute logs. invalid_status() was not going through that helper function, thus not logging anything. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-03 16:00:56 +00:00
Shockingly Good	f05df409bd	impr(compute): Remove the deprecated CLI arg alias for remote-ext-config. (#12087 ) Also moves it from `String` to `Url`.	2025-05-30 17:45:24 +00:00
Shockingly Good	62cd3b8d3d	fix(compute) Remove the hardcoded default value for PGXN HTTP URL. (#12030 ) Removes the hardcoded value for the Postgres Extensions HTTP gateway URL as it is always provided by the calling code.	2025-05-30 15:26:22 +00:00
Gleb Novikov	3b4d4eb535	fast_import.rs: log number of jobs for pg_dump/pg_restore (#12068 ) ## Problem I have a hypothesis that import might be using lower number of jobs than max for the VM, where the job is running. This change will help finding this out from logs ## Summary of changes Added logging of number of jobs, which is passed into both `pg_dump` and `pg_restore`	2025-05-29 18:25:42 +00:00
Suhas Thalanki	e77961c1c6	background worker that collects installed extensions (#11939 ) ## Problem Currently, we collect metrics of what extensions are installed on computes at start up time. We do not have a mechanism that does this at runtime. ## Summary of changes Added a background thread that queries all DBs at regular intervals and collects a list of installed extensions.	2025-05-27 19:40:51 +00:00

1 2 3 4 5 ...

442 Commits