rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
Tristan Partin	90cd5a5be8	[BRC-1778] Add mechanism to `compute_ctl` to pull a new config (#12711 ) ## Problem We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected. ## Summary of changes To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to `compute_ctl` called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl: 1. The `compute_ctl` state machine now has a new State, `RefreshConfigurationPending`. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers. 2. Upon entering the `RefreshConfigurationPending` state, the background configurator thread in `compute_ctl` wakes up, pulls a new config from the control plane, and reconfigures PG (with `pg_ctl reload`) according to the new config. 3. The compute node may enter the new `RefreshConfigurationPending` state from `Running` or `Failed` states. If the configurator managed to configure the compute node successfully, it will enter the `Running` state, otherwise, it stays in `RefreshConfigurationPending` and the configurator thread will wait for the next notification if an incorrect config is still suspected. 4. Added various plumbing in `compute_ctl` data structures to allow the configurator thread to perform the config fetch. The "incorrect config suspected" notification is delivered using a HTTP endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers. ## How is this tested? Modified `test_runner/regress/test_change_pageserver.py` to add a scenario where we use the new `/refresh_configuration` mechanism instead of the existing `/configure` mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers. I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways: * The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it. * The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work. The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests. In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old `/configure` path (it turns out the test has bugs which causes compute `/configure` messages to be sent despite the test intending to disable/blackhole them). ```test 2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request 2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration 2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008" ... ``` Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 14:26:21 +00:00
Erik Grinaker	e181b996c3	utils: move `ShardStripeSize` into `shard` module (#12640 ) ## Problem `ShardStripeSize` will be used in the compute spec and internally in the communicator. It shouldn't require pulling in all of `pageserver_api`. ## Summary of changes Move `ShardStripeSize` into `utils::shard`, along with other basic shard types. Also remove the `Default` implementation, to discourage clients from falling back to a default (it's generally a footgun). The type is still re-exported from `pageserver_api::shard`, along with all the other shard types.	2025-07-21 10:56:20 +00:00
Alexey Kondratov	dd7fff655a	feat(compute): Introduce privileged_role_name parameter (#12539 ) ## Problem Currently `neon_superuser` is hardcoded in many places. It makes it harder to reuse the same code in different envs. ## Summary of changes Parametrize `neon_superuser` in `compute_ctl` via `--privileged-role-name` and in `neon` extensions via `neon.privileged_role_name`, so it's now possible to use different 'superuser' role names if needed. Everything still defaults to `neon_superuser`, so no control plane code changes are needed and I intentionally do not touch regression and migrations tests. Postgres PRs: - https://github.com/neondatabase/postgres/pull/674 - https://github.com/neondatabase/postgres/pull/675 - https://github.com/neondatabase/postgres/pull/676 - https://github.com/neondatabase/postgres/pull/677 Cloud PR: - https://github.com/neondatabase/cloud/pull/31138	2025-07-15 20:22:57 +00:00
Heikki Linnakangas	5c9c3b3317	Misc cosmetic cleanups (#12598 ) - Remove a few obsolete "allowed error messages" from tests. The pageserver doesn't emit those messages anymore. - Remove misplaced and outdated docstring comment from `test_tenants.py`. A docstring is supposed to be the first thing in a function, but we had added some code before it. And it was outdated, as we haven't supported running without safekeepers for a long time. - Fix misc typos in comments - Remove obsolete comment about backwards compatibility with safekeepers without `TIMELINE_STATUS` API. All safekeepers have it by now.	2025-07-15 14:36:28 +00:00
Matthias van de Meent	4566b12a22	NEON: Finish Zenith->Neon rename (#12566 ) Even though we're now part of Databricks, let's at least make this part consistent. ## Summary of changes - PG14: https://github.com/neondatabase/postgres/pull/669 - PG15: https://github.com/neondatabase/postgres/pull/670 - PG16: https://github.com/neondatabase/postgres/pull/671 - PG17: https://github.com/neondatabase/postgres/pull/672 --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-07-11 18:56:39 +00:00
Mikhail	3593fe195a	split TerminationPending into two values, keeping ComputeStatus stateless (#12506 ) After https://github.com/neondatabase/neon/pull/12240 we observed issues in our go code as `ComputeStatus` is not stateless, thus doesn't deserialize as string. ``` could not check compute activity: json: cannot unmarshal object into Go struct field ComputeState.status of type computeclient.ComputeStatus ``` - Fix this by splitting this status into two. - Update compute OpenApi spec to reflect changes to `/terminate` in previous PR	2025-07-10 19:28:10 +00:00
Mikhail	7ed4530618	`offload_lfc_interval_seconds` in ComputeSpec (#12447 ) - Add ComputeSpec flag `offload_lfc_interval_seconds` controlling whether LFC should be offloaded to endpoint storage. Default value (None) means "don't offload". - Add glue code around it for `neon_local` and integration tests. - Add `autoprewarm` mode for `test_lfc_prewarm` testing `offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction. - Rename `compute_ctl_lfc_prewarm_requests_total` and `compute_ctl_lfc_offload_requests_total` to `compute_ctl_lfc_prewarms_total` and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and offloads, not `compute_ctl` requests of those. Don't count request in metrics if there is a prewarm/offload already ongoing. https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/30770	2025-07-04 18:49:57 +00:00
Suhas Thalanki	5f3532970e	[compute] fix: background worker that collects installed extension metrics now updates collection interval (#12277 ) ## Problem Previously, the background worker that collects the list of installed extensions across DBs had a timeout set to 1 hour. This cause a problem with computes that had a `suspend_timeout` > 1 hour as this collection was treated as activity, preventing compute shutdown. Issue: https://github.com/neondatabase/cloud/issues/30147 ## Summary of changes Passing the `suspend_timeout` as part of the `ComputeSpec` so that any updates to this are taken into account by the background worker and updates its collection interval.	2025-06-30 22:12:37 +00:00
Dmitrii Kovalkov	8e216a3a59	storcon: notify cplane on safekeeper membership change (#12390 ) ## Problem We don't notify cplane about safekeeper membership change yet. Without the notification the compute needs to know all the safekeepers on the cluster to be able to speak to them. Change notifications will allow to avoid it. - Closes: https://github.com/neondatabase/neon/issues/12188 ## Summary of changes - Implement `notify_safekeepers` method in `ComputeHook` - Notify cplane about safekeepers in `safekeeper_migrate` handler. - Update the test to make sure notifications work. ## Out of scope - There is `cplane_notified_generation` field in `timelines` table in strocon's database. It's not needed now, so it's not updated in the PR. Probably we can remove it. - e2e tests to make sure it works with a production cplane	2025-06-30 14:09:50 +00:00
Erik Grinaker	d0a4ae3e8f	pageserver: add gRPC LSN lease support (#12384 ) ## Problem The gRPC API does not provide LSN leases. ## Summary of changes * Add LSN lease support to the gRPC API. * Use gRPC LSN leases for static computes with `grpc://` connstrings. * Move `PageserverProtocol` into the `compute_api::spec` module and reuse it.	2025-06-30 12:44:17 +00:00
Matthias van de Meent	6c6de6382a	Use enum-typed PG versions (#12317 ) This makes it possible for the compiler to validate that a match block matched all PostgreSQL versions we support. ## Problem We did not have a complete picture about which places we had to test against PG versions, and what format these versions were: The full PG version ID format (Major/minor/bugfix `MMmmbb`) as transfered in protocol messages, or only the Major release version (`MM`). This meant type confusion was rampant. With this change, it becomes easier to develop new version-dependent features, by making type and niche confusion impossible. ## Summary of changes Every use of `pg_version` is now typed as either `PgVersionId` (u32, valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for every major version we support, serialized and stored like a u32 with the value of that major version) --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-06-24 17:25:31 +00:00
Arpad Müller	552249607d	apply clippy fixes for 1.88.0 beta (#12331 ) The 1.88.0 stable release is near (this Thursday). We'd like to fix most warnings beforehand so that the compiler upgrade doesn't require approval from too many teams. This is therefore a preparation PR (like similar PRs before it). There is a lot of changes for this release, mostly because the `uninlined_format_args` lint has been added to the `style` lint group. One can read more about the lint [here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args). The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One remaining warning is left for the proxy team. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-06-24 10:12:42 +00:00
Mikhail	e95f2f9a67	compute_ctl: return LSN in /terminate (#12240 ) - Add optional `?mode=fast\|immediate` to `/terminate`, `fast` is default. Immediate avoids waiting 30 seconds before returning from `terminate`. - Add `TerminateMode` to `ComputeStatus::TerminationPending` - Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl stop` for `test_replica_promotes`. - Change `test_replica_promotes` to check returned LSN - Annotate `finish_sync_safekeepers` as `noreturn`. https://github.com/neondatabase/cloud/issues/29807	2025-06-18 12:25:19 +00:00
Suhas Thalanki	83069f6ca1	fix: terminate pgbouncer on compute suspend (#12153 ) ## Problem PgBouncer does not terminate connections on a suspend: https://github.com/neondatabase/cloud/issues/16282 ## Summary of changes 1. Adds a pid file to store the pid of PgBouncer 2. Terminates connections on a compute suspend --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-06-17 22:56:05 +00:00
Erik Grinaker	edf51688bc	neon_local: support gRPC connstrings for endpoints (#12271 ) ## Problem `neon_local` should support endpoints using gRPC, by providing `grpc://` connstrings with the Pageservers' gRPC ports. Requires #12268. Touches #11926. ## Summary of changes * Add `--grpc` switch for `neon_local endpoint create`. * Generate `grpc://` connstrings for endpoints when enabled. Computes don't actually support `grpc://` connstrings yet, but will soon. gRPC is configured when the endpoint is created, not when it's started, such that it continues to use gRPC across restarts and reconfigurations. In particular, this is necessary for the storage controller's local notify hook, which can't easily plumb through gRPC configuration from the start/reconfigure commands but has access to the endpoint's configuration.	2025-06-17 14:39:42 +00:00
Folke Behrens	1dce65308d	Update base64 to 0.22 (#12215 ) ## Problem Base64 0.13 is outdated. ## Summary of changes Update base64 to 0.22. Affects mostly proxy and proxy libs. Also upgrade serde_with to remove another dep on base64 0.13 from dep tree.	2025-06-12 16:12:47 +00:00
Mikhail	c698cee19a	ComputeSpec: prewarm_lfc_on_startup -> autoprewarm (#12120 ) https://github.com/neondatabase/cloud/pull/29472 https://github.com/neondatabase/cloud/issues/26346	2025-06-04 05:38:03 +00:00
Shockingly Good	4d2e4b19c3	fix(compute) Correct the PGXN s3 gateway URL. (#11796 ) Corrects the postgres extension s3 gateway address to be not just a domain name but a full base URL. To make the code more readable, the option is renamed to "remote_ext_base_url", while keeping the old name also accessible by providing a clap argument alias. Also provides a very simple and, perhaps, even redundant unit test to confirm the logic behind parsing of the corresponding CLI argument. ## Problem As it is clearly stated in https://github.com/neondatabase/cloud/issues/26005, using of the short version of the domain name might work for now, but in the future, we should get rid of using the `default` namespace and this is where it will, most likely, break down. ## Summary of changes The changes adjust the domain name of the extension s3 gateway to use the proper base url format instead of the just domain name assuming the "default" namespace and add a new CLI argument name for to reflect the change and the expectance.	2025-05-07 16:34:08 +00:00
Tristan Partin	0ef6851219	Make the audience claim in compute JWTs a vector (#11845 ) According to RFC 7519, `aud` is generally an array of StringOrURI, but in special cases may be a single StringOrURI value. To accomodate future control plane work where a single token may work for multiple services, make the claim a vector. Link: https://www.rfc-editor.org/rfc/rfc7519#section-4.1.3 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 22:19:15 +00:00
Mikhail	5c356c63eb	endpoint_storage compute_ctl integration (#11550 ) Add `/lfc/(prewarm\|offload)` routes to `compute_ctl` which interact with endpoint storage. Add `prewarm_lfc_on_startup` spec option which, if enabled, downloads LFC prewarm data on compute startup. Resolves: https://github.com/neondatabase/cloud/issues/26343	2025-05-06 22:02:12 +00:00
Tristan Partin	f9b3a2e059	Add scoping to compute_ctl JWT claims (#11639 ) Currently we only have an admin scope which allows a user to bypass the compute_id check. When the admin scope is provided, validate the audience of the JWT to be "compute". Closes: https://github.com/neondatabase/cloud/issues/27614 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 19:51:10 +00:00
Tristan Partin	79083de61c	Remove forward compatibility hacks related to compute_ctl auth (#11621 ) These various hacks were needed for the forward compatibility tests. Enough time has passed since the merge that these are no longer needed. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-16 23:14:24 +00:00
Tristan Partin	edc11253b6	Fix neon_local public key parsing when create compute JWKS (#11602 ) Finally figured out the right incantation. I had had this in my original go, but due to some refactoring and apparently missed testing, I committed a mistake. The reason this doesn't currently break anything is that we bypass the authorization middleware when the "testing" cargo feature is enabled. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-16 12:51:48 +00:00
Tristan Partin	eadb05f78e	Teach neon_local to pass the Authorization header to compute_ctl (#11490 ) This allows us to remove hacks in the compute_ctl authorization middleware which allowed for bypasses of auth checks. Fixes: https://github.com/neondatabase/neon/issues/11316 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-15 17:27:49 +00:00
Tristan Partin	ff5a527167	Consolidate compute_ctl configuration structures (#11514 ) Previously, the structure of the spec file was just the compute spec. However, the response from the control plane get spec request included the compute spec and the compute_ctl config. This divergence was hindering other work such as adding regression tests for compute_ctl HTTP authorization. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-11 15:06:29 +00:00
Anastasia Lubennikova	5063151271	compute: Add more neon ids to compute (#11366 ) Pass more neon ids to compute_ctl. Expose them to postgres as neon extension GUCs: neon.project_id, neon.branch_id, neon.endpoint_id. This is the compute side PR, not yet supported by cplane.	2025-04-10 13:04:18 +00:00
Roman Zaynetdinov	a7142f3bc6	Configure rsyslog for logs export using the spec (#11338 ) - Work on https://github.com/neondatabase/cloud/issues/24896 - Cplane part https://github.com/neondatabase/cloud/pull/26808 Instead of reconfiguring rsyslog via an API endpoint [we have agreed](https://neondb.slack.com/archives/C04DGM6SMTM/p1743513810964509?thread_ts=1743170369.865859&cid=C04DGM6SMTM) to have a new `logs_export_host` field as part of the compute spec. --------- Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-04-08 14:03:09 +00:00
Anastasia Lubennikova	d94fc75cfc	Setup compute_ctl pgaudit and rsyslog (#10615 ) Setup pgaudit and pgauditlogtofile extensions in compute_ctl when the ComputeAuditLogLevel is set to 'hipaa'. See cloud PR https://github.com/neondatabase/cloud/pull/24568 Add rsyslog setup for compute_ctl. Spin up a rsyslog server in the compute VM, and configure it to send logs to the endpoint specified in AUDIT_LOGGING_ENDPOINT env.	2025-03-05 18:01:00 +00:00
Arseny Sher	ef2b50994c	walproposer: basic infra to enable generations (#11002 ) ## Problem Preparation for https://github.com/neondatabase/neon/issues/10851 ## Summary of changes Add walproposer `safekeepers_generations` field which can be set by prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces walproposer to use generations. In particular, this also disables implicit timeline creation as timeline will be created by storcon. Add test checking this. Also add missing infra: `--safekeepers-generation` flag to neon_local endpoint start + fix `--start-timeout` flag: it existed but value wasn't used.	2025-03-03 13:20:20 +00:00
Arpad Müller	a22be5af72	Migrate the last crates to edition 2024 (#10998 ) Migrates the remaining crates to edition 2024. We like to stay on the latest edition if possible. There is no functional changes, however some code changes had to be done to accommodate the edition's breaking changes. Like the previous migration PRs, this is comprised of three commits: * the first does the edition update and makes `cargo check`/`cargo clippy` pass. we had to update bindgen to make its output [satisfy the requirements of edition 2024](https://doc.rust-lang.org/edition-guide/rust-2024/unsafe-extern.html) * the second commit does a `cargo fmt` for the new style edition. * the third commit reorders imports as a one-off change. As before, it is entirely optional. Part of #10918	2025-02-27 09:40:40 +00:00
Tristan Partin	d571553d8a	Remove hacks in compute_ctl related to compute ID (#10751 )	2025-02-20 17:31:52 +00:00
Tristan Partin	f7474d3f41	Remove forward compatibility hacks related to compute HTTP servers (#10797 ) These hacks were added to appease the forward compatibility tests and can be removed. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-20 17:31:42 +00:00
Anastasia Lubennikova	7c7180a79d	Fix deadlock in drop_subscriptions_before_start (#10806 ) ALTER SUBSCRIPTION requires AccessExclusive lock which conflicts with iteration over pg_subscription when multiple databases are present and operations are applied concurrently. Fix by explicitly locking pg_subscription in the beginning of the transaction in each database. ## Problem https://github.com/neondatabase/cloud/issues/24292	2025-02-20 17:14:16 +00:00
Tristan Partin	0cf9157adc	Handle new compute_ctl_config parameter in compute spec requests (#10746 ) There is now a compute_ctl_config field in the response that currently only contains a JSON Web Key set. compute_ctl currently doesn't do anything with the keys, but will in the future. The reasoning for the new field is due to the nature of empty computes. When an empty compute is created, it does not have a tenant. A compute spec is the primary means of communicating the details of an attached tenant. In the empty compute state, there is no spec. Instead we wait for the control plane to pass us one via /configure. If we were to include the jwks field in the compute spec, we would have a partial compute spec, which doesn't logically make sense. Instead, we can have two means of passing settings to the compute: - spec: tenant specific config details - compute_ctl_config: compute specific settings For instance, the JSON Web Key set passed to the compute is independent of any tenant. It is a setting of the compute whether it is attached or not. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-13 18:04:36 +00:00
Tristan Partin	da9c101939	Implement a second HTTP server within compute_ctl (#10574 ) The compute_ctl HTTP server has the following purposes: - Allow management via the control plane - Provide an endpoint for scaping metrics - Provide APIs for compute internal clients - Neon Postgres extension for installing remote extensions - local_proxy for installing extensions and adding grants The first two purposes require the HTTP server to be available outside the compute. The Neon threat model is a bad actor within our internal network. We need to reduce the surface area of attack. By exposing unnecessary unauthenticated HTTP endpoints to the internal network, we increase the surface area of attack. For endpoints described in the third bullet point, we can just run an extra HTTP server, which is only bound to the loopback interface since all consumers of those endpoints are within the compute.	2025-02-11 18:02:22 +00:00
Tristan Partin	3d143ad799	Unbrick the forward compatibility test failures (#10747 ) Since the merge of https://github.com/neondatabase/neon/pull/10523, forward compatibility tests have been broken everywhere. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 22:22:10 +00:00
Tristan Partin	946da3f7e2	Require --compute-id when running compute_ctl (#10523 ) The compute_id will be used when verifying claims sent by the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 16:46:20 +00:00
Anastasia Lubennikova	8e8df1b453	Disable logical replication subscribers (#10249 ) Drop logical replication subscribers before compute starts on a non-main branch. Add new compute_ctl spec flag: drop_subscriptions_before_start If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions. Ensure that we drop subscriptions exactly once when endpoint starts on a new branch. It is essential, because otherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timeline. fixes: https://github.com/neondatabase/neon/issues/8790	2025-01-23 11:02:15 +00:00
Tristan Partin	49756a0d01	Implement compute_ctl management API in Axum (#10099 ) This is a refactor to create better abstractions related to our management server. It cleans up the code, and prepares everything for authorized communication to and from the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-09 20:08:26 +00:00
Em Sharnoff	cd10c719f9	compute: Add spec support for disabling LFC resizing (#10132 ) ref neondatabase/cloud#21731 ## Problem When we manually override the LFC size for particular computes, autoscaling will typically undo that because vm-monitor will resize LFC itself. So, we'd like a way to make vm-monitor not set LFC size — this actually already exists, if we just don't give vm-monitor a postgres connection string. ## Summary of changes Add a new field to the compute spec, `disable_lfc_resizing`. When set to `true`, we pass in `None` for its postgres connection string. That matches the configuration tested in `neondatabase/autoscaling` CI.	2025-01-02 19:45:59 +00:00
John Spray	65042cbadd	tests: use high IO concurrency in `test_pgdata_import_smoke`, use `effective_io_concurrency=2` in tests by default (#10114 ) ## Problem `test_pgdata_import_smoke` writes two gigabytes of pages and then reads them back serially. This is CPU bottlenecked and results in a long runtime, and sensitivity to CPU load from other tests on the same machine. Closes: https://github.com/neondatabase/neon/issues/10071 ## Summary of changes - Use effective_io_concurrency=32 when doing sequential scans through 2GiB of pages in test_pgdata_import_smoke. This is a ~10x runtime decrease in the parts of the test that do sequential scans. - Also set `effective_io_concurrency=2` for tests, as I noticed while debugging that we were doing all getpage requests serially, which is bad for checking the stability of the batching code.	2024-12-19 10:58:49 +00:00
Arseny Sher	e4bb1ca7d8	Increase neon_local http client to compute timeout in reconfigure. (#10088 ) Seems like 30s sometimes not enough when CI runners are overloaded, causing pull_timeline flakiness. ref https://github.com/neondatabase/neon/issues/9731#issuecomment-2535946443	2024-12-11 15:46:50 +00:00
Tristan Partin	6ff4175fd7	Send Content-Type header on reconfigure request from neon_local (#10029 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-05 20:30:35 +00:00
Alexey Kondratov	13e8105740	feat(compute): Allow specifying the reconfiguration concurrency (#10006 ) ## Problem We need a higher concurrency during reconfiguration in case of many DBs, but the instance is already running and used by the client. We can easily get out of `max_connections` limit, and the current code won't handle that. ## Summary of changes Default to 1, but also allow control plane to override this value for specific projects. It's also recommended to bump `superuser_reserved_connections` += `reconfigure_concurrency` for such projects to ensure that we always have enough spare connections for reconfiguration process to succeed. Quick workaround for neondatabase/cloud#17846	2024-12-05 17:57:25 +00:00
Alexey Kondratov	2e9207fdf3	fix(testing): Use 1 MB shared_buffers even with LFC (#9969 ) ## Problem After enabling LFC in tests and lowering `shared_buffers` we started having more problems with `test_pg_regress`. ## Summary of changes Set `shared_buffers` to 1MB to both exercise getPage requests/LFC, and still have enough room for Postgres to operate. Everything smaller might be not enough for Postgres under load, and can cause errors like 'no unpinned buffers available'. See Konstantin's comment [1] as well. Fixes #9956 [1]: https://github.com/neondatabase/neon/issues/9956#issuecomment-2511608097	2024-12-02 18:46:06 +00:00
Arpad Müller	d92ff578c4	Add test for fixed storage broker issue (#9311 ) Adds a test for the (now fixed) storage broker limit issue, see #9268 for the description and #9299 for the fix. Also fix a race condition with endpoint creation/starts running in parallel, leading to file not found errors.	2024-10-14 14:34:57 +02:00
Conrad Ludgate	6c05f89f7d	proxy: add local-proxy to compute image (#8823 ) 1. Adds local-proxy to compute image and vm spec 2. Updates local-proxy config processing, writing PID to a file eagerly 3. Updates compute-ctl to understand local proxy compute spec and to send SIGHUP to local-proxy over that pid. closes https://github.com/neondatabase/cloud/issues/16867	2024-10-04 14:52:01 +00:00
Arthur Petukhovsky	ba498a630a	Set disk quotas on bind in compute_ctl (#8936 ) Part of https://github.com/neondatabase/cloud/issues/13127. Resolves #9153 What changed in this PR: 1. Adds `ComputeSpec.disk_quota_bytes: Option<u64>` 2. Adds new arg to compute_ctl: `--set-disk-quota-for-fs <mountpoint>` 3. Implements running `/neonvm/bin/set-disk-quota` with the right value if both cmdline arg AND field in the spec are specified 4. Patches `/etc/sudoers.d` to allow `compute_ctl` to set quota with sudo This PR is very similar to the swap support added earlier, you can take a look at it as prior art: #7434 In theory, it can be implemented outside of compute_ctl when we will have a separate neonvm daemon, but we are not there yet. Current implementation is the simplest possible to unblock computes with larger disks. All code related to usage of `/neonvm/bin/set-disk-quota` is located in `disk_quota.rs`. We need to call this script with the following arguments: `/neonvm/bin/set-disk-quota {size_kb} {mountpoint}`. Quotas are set on the filesystem level, so we need to provide path to the directory that filesystem was mounted to. I tested this change locally with https://github.com/neondatabase/cloud/pull/17270. It should be safe to merge, because this feature is gated by both cmdline arg and field in the spec. If control-plane doesn't set values in both places, compute_ctl won't be affected by this change.	2024-09-27 20:52:22 +01:00
Christian Schwarz	035a49a6b2	`neon_local start`: parallel startup to break cyclic dependency (#8950 ) (Found this useful during investigation https://github.com/neondatabase/cloud/issues/16886.) Problem ------- Before this PR, `neon_local` sequentially does the following: 1. launch storcon process 2. wait for storcon to signal readiness [here](`75310fe441/control_plane/src/storage_controller.rs (L804-L808)`) 3. start pageserver 4. wait for pageserver to become ready [here](`c43e664ff5/control_plane/src/pageserver.rs (L343-L346)`) 5. etc The problem is that storcon's readiness waits for the [`startup_reconcile`](`cbcd4058ed/storage_controller/src/service.rs (L520-L523)`) to complete. But pageservers aren't started at this point. So, worst case we wait for `STARTUP_RECONCILE_TIMEOUT/2`, i.e., 15s. This is more than the 10s default timeout allowed by neon_local. So, the result is that `neon_local start` fails to start storcon and stops everything. Solution -------- In this PR I choose the the radical solution to start everything in parallel. It junks up the output because we do stuff like `print!(".")` to indicate progress. We should just abandon that. And switch to `utils::logging` + `tracing` with separate spans for each component. I can do that in this PR or we leave it as a follow-up. Alternatives Considered ----------------------- The Pageserver's `/v1/status` or in fact any endpoint of the mgmt API will not `accept()` on the mgmt API socket until after the `re-attach` call to storcon returned success. So, it's insufficient to change the startup order to start Pageservers first. We cannot easily change Pageserver startup order because `init_tenant_mgr` must complete before we start serving the mgmt API. Otherwise tenant detach calls et al can race with `init_tenant_mgr`. We'd have to add a "loading" state to tenant mgr and make all API endpoints except `/v1/status` wait for _that_ to complete. Related ------- - https://github.com/neondatabase/neon/pull/6475	2024-09-18 18:17:55 +02:00
Arseny Sher	a4eea5025c	Fix logical apply worker reporting of flush_lsn wrt sync replication. It should take syncrep flush_lsn into account because WAL before it on endpoint restart is lost, which makes replication miss some data if slot had already been advanced too far. This commit adds test reproducing the issue and bumps vendor/postgres to commit with the actual fix.	2024-08-12 13:14:02 +03:00

1 2

93 Commits