rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-21 07:00:38 +00:00

Author	SHA1	Message	Date
Tristan Partin	512210bb5a	[BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716 ) ## Problem In our experience running the system so far, almost all of the "hang compute" situations are due to the compute (postgres) pointing at the wrong pageservers. We currently mainly rely on the promethesus exporter (PGExporter) running on PG to detect and report any down time, but these can be unreliable because the read and write probes the PGExporter runs do not always generate pageserver requests due to caching, even though the real user might be experiencing down time when touching uncached pages. We are also about to start disk-wiping node pool rotation operations in prod clusters for our pageservers, and it is critical to have a convenient way to monitor the impact of these node pool rotations so that we can quickly respond to any issues. These metrics should provide very clear signals to address this operational need. ## Summary of changes Added a pair of metrics to detect issues between postgres' PageStream protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications with pageservers: * On the compute node (compute_ctl), exports a counter metric that is incremented every time postgres requests a configuration refresh. Postgres today only requests these configuration refreshes when it cannot connect to a pageserver or if the pageserver rejects its request by disconnecting. * On the pageserver, exports a counter metric that is incremented every time it receives a PageStream request that cannot be handled because the tenant is not known or if the request was routed to the wrong shard (e.g. secondary). ### How I plan to use metrics I plan to use the metrics added here to create alerts. The alerts can fire, for example, if these counters have been continuously increasing for over a certain period of time. During rollouts, misrouted requests may occasionally happen, but they should soon die down as reconfigurations make progress. We can start with something like raising the alert if the counters have been increasing continuously for over 5 minutes. ## How is this tested? New integration tests in `test_runner/regress/test_hadron_ps_connectivity_metrics.py` Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 19:05:00 +00:00
Tristan Partin	90cd5a5be8	[BRC-1778] Add mechanism to `compute_ctl` to pull a new config (#12711 ) ## Problem We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected. ## Summary of changes To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to `compute_ctl` called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl: 1. The `compute_ctl` state machine now has a new State, `RefreshConfigurationPending`. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers. 2. Upon entering the `RefreshConfigurationPending` state, the background configurator thread in `compute_ctl` wakes up, pulls a new config from the control plane, and reconfigures PG (with `pg_ctl reload`) according to the new config. 3. The compute node may enter the new `RefreshConfigurationPending` state from `Running` or `Failed` states. If the configurator managed to configure the compute node successfully, it will enter the `Running` state, otherwise, it stays in `RefreshConfigurationPending` and the configurator thread will wait for the next notification if an incorrect config is still suspected. 4. Added various plumbing in `compute_ctl` data structures to allow the configurator thread to perform the config fetch. The "incorrect config suspected" notification is delivered using a HTTP endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers. ## How is this tested? Modified `test_runner/regress/test_change_pageserver.py` to add a scenario where we use the new `/refresh_configuration` mechanism instead of the existing `/configure` mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers. I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways: * The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it. * The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work. The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests. In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old `/configure` path (it turns out the test has bugs which causes compute `/configure` messages to be sent despite the test intending to disable/blackhole them). ```test 2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request 2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration 2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008" ... ``` Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 14:26:21 +00:00
HaoyuHuang	63ea4b0579	A few more compute_tool changes (#12687 ) ## Summary of changes All changes are no-op except that the tracing-appender lib is upgraded from 0.2.2 to 0.2.3	2025-07-23 18:30:33 +00:00
HaoyuHuang	8de320ab9b	Add a few compute_tool changes (#12677 ) ## Summary of changes All changes are no-op.	2025-07-22 16:22:18 +00:00
Folke Behrens	108f7ec544	Bump opentelemetry crates to 0.30 (#12680 ) This rebuilds #11552 on top the current Cargo.lock. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2025-07-22 16:05:35 +00:00
Alexey Kondratov	dd7fff655a	feat(compute): Introduce privileged_role_name parameter (#12539 ) ## Problem Currently `neon_superuser` is hardcoded in many places. It makes it harder to reuse the same code in different envs. ## Summary of changes Parametrize `neon_superuser` in `compute_ctl` via `--privileged-role-name` and in `neon` extensions via `neon.privileged_role_name`, so it's now possible to use different 'superuser' role names if needed. Everything still defaults to `neon_superuser`, so no control plane code changes are needed and I intentionally do not touch regression and migrations tests. Postgres PRs: - https://github.com/neondatabase/postgres/pull/674 - https://github.com/neondatabase/postgres/pull/675 - https://github.com/neondatabase/postgres/pull/676 - https://github.com/neondatabase/postgres/pull/677 Cloud PR: - https://github.com/neondatabase/cloud/pull/31138	2025-07-15 20:22:57 +00:00
Suhas Thalanki	5f3532970e	[compute] fix: background worker that collects installed extension metrics now updates collection interval (#12277 ) ## Problem Previously, the background worker that collects the list of installed extensions across DBs had a timeout set to 1 hour. This cause a problem with computes that had a `suspend_timeout` > 1 hour as this collection was treated as activity, preventing compute shutdown. Issue: https://github.com/neondatabase/cloud/issues/30147 ## Summary of changes Passing the `suspend_timeout` as part of the `ComputeSpec` so that any updates to this are taken into account by the background worker and updates its collection interval.	2025-06-30 22:12:37 +00:00
Matthias van de Meent	6c6de6382a	Use enum-typed PG versions (#12317 ) This makes it possible for the compiler to validate that a match block matched all PostgreSQL versions we support. ## Problem We did not have a complete picture about which places we had to test against PG versions, and what format these versions were: The full PG version ID format (Major/minor/bugfix `MMmmbb`) as transfered in protocol messages, or only the Major release version (`MM`). This meant type confusion was rampant. With this change, it becomes easier to develop new version-dependent features, by making type and niche confusion impossible. ## Summary of changes Every use of `pg_version` is now typed as either `PgVersionId` (u32, valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for every major version we support, serialized and stored like a u32 with the value of that major version) --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-06-24 17:25:31 +00:00
Arpad Müller	552249607d	apply clippy fixes for 1.88.0 beta (#12331 ) The 1.88.0 stable release is near (this Thursday). We'd like to fix most warnings beforehand so that the compiler upgrade doesn't require approval from too many teams. This is therefore a preparation PR (like similar PRs before it). There is a lot of changes for this release, mostly because the `uninlined_format_args` lint has been added to the `style` lint group. One can read more about the lint [here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args). The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One remaining warning is left for the proxy team. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-06-24 10:12:42 +00:00
Suhas Thalanki	83069f6ca1	fix: terminate pgbouncer on compute suspend (#12153 ) ## Problem PgBouncer does not terminate connections on a suspend: https://github.com/neondatabase/cloud/issues/16282 ## Summary of changes 1. Adds a pid file to store the pid of PgBouncer 2. Terminates connections on a compute suspend --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-06-17 22:56:05 +00:00
Tristan Partin	3fd5a94a85	Use Url::join() when creating the final remote extension URL (#12121 ) Url::to_string() adds a trailing slash on the base URL, so when we did the format!(), we were adding a double forward slash. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-04 15:56:12 +00:00
Shockingly Good	f05df409bd	impr(compute): Remove the deprecated CLI arg alias for remote-ext-config. (#12087 ) Also moves it from `String` to `Url`.	2025-05-30 17:45:24 +00:00
Shockingly Good	62cd3b8d3d	fix(compute) Remove the hardcoded default value for PGXN HTTP URL. (#12030 ) Removes the hardcoded value for the Postgres Extensions HTTP gateway URL as it is always provided by the calling code.	2025-05-30 15:26:22 +00:00
Gleb Novikov	3b4d4eb535	fast_import.rs: log number of jobs for pg_dump/pg_restore (#12068 ) ## Problem I have a hypothesis that import might be using lower number of jobs than max for the VM, where the job is running. This change will help finding this out from logs ## Summary of changes Added logging of number of jobs, which is passed into both `pg_dump` and `pg_restore`	2025-05-29 18:25:42 +00:00
Suhas Thalanki	e77961c1c6	background worker that collects installed extensions (#11939 ) ## Problem Currently, we collect metrics of what extensions are installed on computes at start up time. We do not have a mechanism that does this at runtime. ## Summary of changes Added a background thread that queries all DBs at regular intervals and collects a list of installed extensions.	2025-05-27 19:40:51 +00:00
Mark Novikov	d22377c754	Skip event triggers in dump-restore (#11794 ) ## Problem Data import fails if the src db has any event triggers, because those can only be restored by a superuser. Specifically imports from Heroku and Supabase are guaranteed to fail. Closes https://github.com/neondatabase/cloud/issues/27353 ## Summary of changes Depends on `pg_dump` patches per each supported PostgreSQL version: - https://github.com/neondatabase/postgres/pull/630 - https://github.com/neondatabase/postgres/pull/629 - https://github.com/neondatabase/postgres/pull/627 - https://github.com/neondatabase/postgres/pull/628	2025-05-08 11:04:28 +00:00
Shockingly Good	4d2e4b19c3	fix(compute) Correct the PGXN s3 gateway URL. (#11796 ) Corrects the postgres extension s3 gateway address to be not just a domain name but a full base URL. To make the code more readable, the option is renamed to "remote_ext_base_url", while keeping the old name also accessible by providing a clap argument alias. Also provides a very simple and, perhaps, even redundant unit test to confirm the logic behind parsing of the corresponding CLI argument. ## Problem As it is clearly stated in https://github.com/neondatabase/cloud/issues/26005, using of the short version of the domain name might work for now, but in the future, we should get rid of using the `default` namespace and this is where it will, most likely, break down. ## Summary of changes The changes adjust the domain name of the extension s3 gateway to use the proper base url format instead of the just domain name assuming the "default" namespace and add a new CLI argument name for to reflect the change and the expectance.	2025-05-07 16:34:08 +00:00
Konstantin Knizhnik	1531712555	Undo commit `d1728a6bcd` because it causes problems with creating pg_search extension (#11700 ) ## Problem See https://neondb.slack.com/archives/C03H1K0PGKH/p1745489241982209 pg_search extension now can not be created. ## Summary of changes Undo `d1728a6bcd` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-24 14:46:10 +00:00
Tristan Partin	d1728a6bcd	Remove old compatibility hack for remote extensions (#11620 ) Control plane has long since been updated to send the right value. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-17 14:08:42 +00:00
Tristan Partin	79083de61c	Remove forward compatibility hacks related to compute_ctl auth (#11621 ) These various hacks were needed for the forward compatibility tests. Enough time has passed since the merge that these are no longer needed. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-16 23:14:24 +00:00
Tristan Partin	028a191040	Continue with s/spec/config changes (#11574 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-14 21:18:21 +00:00
Tristan Partin	ff5a527167	Consolidate compute_ctl configuration structures (#11514 ) Previously, the structure of the spec file was just the compute spec. However, the response from the control plane get spec request included the compute spec and the compute_ctl config. This divergence was hindering other work such as adding regression tests for compute_ctl HTTP authorization. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-11 15:06:29 +00:00
Tristan Partin	a04e33ceb6	Remove --spec-json argument from compute_ctl (#11510 ) It isn't used by the production control plane or neon_local. The removal simplifies compute spec logic just a little bit more since we can remove any notion of whether we should allow live reconfigurations. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-09 22:39:54 +00:00
Alexey Kondratov	557127550c	feat(compute): Add compute_ctl_up metric (#11376 ) ## Problem For computes running inside NeonVM, the actual compute image tag is buried inside the NeonVM spec, and we cannot get it as part of standard k8s container metrics (it's always an image and a tag of the NeonVM runner container). The workaround we currently use is to extract the running computes info from the control plane database with SQL. It has several drawbacks: i) it's complicated, separate DB per region; ii) it's slow; iii) it's still an indirect source of info, i.e. k8s state could be different from what the control plane expects. ## Summary of changes Add a new `compute_ctl_up` gauge metric with `build_tag` and `status` labels. It will help us to both overview what are the tags/versions of all running computes; and to break them down by current status (`empty`, `running`, `failed`, etc.) Later, we could introduce low cardinality (no endpoint or compute ids) streaming aggregates for such metrics, so they will be blazingly fast and usable for monitoring the fleet-wide state.	2025-04-01 08:51:17 +00:00
Gleb Novikov	2065074559	fast_import: put job status to s3 (#11284 ) ## Problem `fast_import` binary is being run inside neonvms, and they do not support proper `kubectl describe logs` now, there are a bunch of other caveats as well: https://github.com/neondatabase/autoscaling/issues/1320 Anyway, we needed a signal if job finished successfully or not, and if not — at least some error message for the cplane operation. And after [a short discussion](https://neondb.slack.com/archives/C07PG8J1L0P/p1741954251813609), that s3 object is the most convenient at the moment. ## Summary of changes If `s3_prefix` was provided to `fast_import` call, any job run puts a status object file into `{s3_prefix}/status/fast_import` with contents `{"done": true}` or `{"done": false, "error": "..."}`. Added a test as well	2025-03-20 15:23:35 +00:00
Tristan Partin	7b7e4a9fd3	Authorize compute_ctl requests from the control plane (#10530 ) The compute should only act if requests come from the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-03-04 18:08:00 +00:00
Heikki Linnakangas	066324d6ec	compute_ctl: Rearrange startup code (#11007 ) Move most of the code to compute.rs, so that all the major startup steps are visible in one place. You can now get a pretty good picture of what happens in the latency-critical path at compute startup by reading ComputeNode::start_compute(). This also clarifies the error handling in start_compute. Previously, the start_postgres function sometimes returned an Err, and sometimes Ok but with the compute status already set to Failed. Now the start_compute function always returns Err on failure, and it's the caller's responsibility to change the compute status to Failed. Separately from that, it returns a handle to the Postgres process via a `&mut` reference if it had already started Postgres (i.e. on success, or if the failure happens after launching the Postgres process). --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-02-28 22:48:05 +00:00
Heikki Linnakangas	5cfdb1244f	compute_ctl: Add OTEL tracing to incoming HTTP requests and startup (#10971 ) We lost this with the switch to axum for the HTTP server. Add it back. In addition to just resurrecting the functionality we had before, pass the tracing context of the /configure HTTP request to the start_postgres operation that runs in the main thread. This way, the 'start_postgres' and all its sub-spans like getting the basebackup become children of the HTTP request span. This allows end-to-end tracing of a compute start, all the way from the proxy to the SQL queries executed by compute_ctl as part of compute startup.	2025-02-26 19:27:16 +00:00
Arpad Müller	f94286f0c9	Upgrade compute_tools and compute_api to edition 2024 (#10983 ) Updates `compute_tools` and `compute_api` crates to edition 2024. We like to stay on the latest edition if possible. There is no functional changes, however some code changes had to be done to accommodate the edition's breaking changes. The PR has three commits: * the first commit updates the named crates to edition 2024 and appeases `cargo clippy` by changing code. * the second commit performs a `cargo fmt` that does some minor changes (not many) * the third commit performs a cargo fmt with nightly options to reorder imports as a one-time thing. it's completely optional, but I offer it here for the compute team to review it. I'd like to hear opinions about the third commit, if it's wanted and felt worth the diff or not. I think most attention should be put onto the first commit. Part of #10918	2025-02-26 13:12:26 +00:00
Tristan Partin	d571553d8a	Remove hacks in compute_ctl related to compute ID (#10751 )	2025-02-20 17:31:52 +00:00
Tristan Partin	f7474d3f41	Remove forward compatibility hacks related to compute HTTP servers (#10797 ) These hacks were added to appease the forward compatibility tests and can be removed. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-20 17:31:42 +00:00
Alexander Lakhin	29e4ca351e	Pass asan/ubsan options to pg_dump/pg_restore started by fast_import (#10866 )	2025-02-18 15:41:20 +00:00
Heikki Linnakangas	811506aaa2	fast_import: Use rust s3 client for uploading (#10777 ) This replaces the use of the awscli utility. awscli binary is massive, it added about 200 MB to the docker image size, while the s3 client was already a dependency so using that is essentially free, as far as binary size is concerned. I implemented a simple upload function that tries to keep 10 uploads going in parallel. I believe that's the default behavior of the "aws s3 sync" command too.	2025-02-17 20:07:31 +00:00
Heikki Linnakangas	2dae0612dd	fast_import: Fix shared_buffers setting (#10837 ) In commit `9537829ccd` I made shared_buffers be derived from the system's available RAM. However, I failed to remove the old hard-coded shared_buffers=10GB settings, shared_buffers was set twice. Oopsie.	2025-02-16 00:01:19 +00:00
Gleb Novikov	3d7a32f619	fast import: allow restore to provided connection string (#10407 ) Within https://github.com/neondatabase/cloud/issues/22089 we decided that would be nice to start with import that runs dump-restore into a running compute (more on this [here](https://www.notion.so/neondatabase/2024-Jan-13-Migration-Assistant-Next-Steps-Proposal-Revised-17af189e004780228bdbcad13eeda93f?pvs=4#17af189e004780de816ccd9c13afd953)) We could do it by writing another tool or by extending existing `fast_import.rs`, we chose the latter. In this PR, I have added optional `restore_connection_string` as a cli arg and as a part of the json spec. If specified, the script will not run postgres and will just perform restore into provided connection string. TODO: - [x] fast_import.rs: - [x] cli arg in the fast_import.rs - [x] encoded connstring in json spec - [x] simplify `fn main` a little, take out too verbose stuff to some functions - [ ] ~~allow streaming from dump stdout to restore stdin~~ will do in a separate PR - [ ] ~~address https://github.com/neondatabase/neon/pull/10251#pullrequestreview-2551877845~~ will do in a separate PR - [x] tests: - [x] restore with cli arg in the fast_import.rs - [x] restore with encoded connstring in json spec in s3 - [ ] ~~test with custom dbname~~ will do in a separate PR - [ ] ~~test with s3 + pageserver + fast import binary~~ https://github.com/neondatabase/neon/pull/10487 - [ ] ~~https://github.com/neondatabase/neon/pull/10271#discussion_r1923715493~~ will do in a separate PR neondatabase/cloud#22775 --------- Co-authored-by: Eduard Dykman <bird.duskpoet@gmail.com>	2025-02-14 16:10:06 +00:00
Tristan Partin	0cf9157adc	Handle new compute_ctl_config parameter in compute spec requests (#10746 ) There is now a compute_ctl_config field in the response that currently only contains a JSON Web Key set. compute_ctl currently doesn't do anything with the keys, but will in the future. The reasoning for the new field is due to the nature of empty computes. When an empty compute is created, it does not have a tenant. A compute spec is the primary means of communicating the details of an attached tenant. In the empty compute state, there is no spec. Instead we wait for the control plane to pass us one via /configure. If we were to include the jwks field in the compute spec, we would have a partial compute spec, which doesn't logically make sense. Instead, we can have two means of passing settings to the compute: - spec: tenant specific config details - compute_ctl_config: compute specific settings For instance, the JSON Web Key set passed to the compute is independent of any tenant. It is a setting of the compute whether it is attached or not. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-13 18:04:36 +00:00
Heikki Linnakangas	9537829ccd	fast_import: Make CPU & memory size configurable (#10709 ) The old values assumed that you have at least about 18 GB of RAM available (shared_buffers=10GB and maintenance_work_mem=8GB). That's a lot when testing locally. Make it configurable, and make the default assumption much smaller: 256 MB. This is nice for local testing, but it's also in preparation for starting to use VMs to run these jobs. When launched in a VM, the control plane can set these env variables according to the max size of the VM. Also change the formula for how RAM is distributed: use 10% of RAM for shared_buffers, and 70% for maintenance_work_mem. That leaves a good amount for misc. other stuff and the OS. A very large shared_buffers setting won't typically help with bulk loading. It won't help with the network and I/O of processing all the tables, unless maybe if the whole database fits in shared buffers, but even then it's not much faster than using local disk. Bulk loading is all sequential I/O. It also won't help much with index creation, which is also sequential I/O. A large maintenance_work_mem can be quite useful, however, so that's where we put most of the RAM.	2025-02-12 11:43:23 +00:00
Tristan Partin	da9c101939	Implement a second HTTP server within compute_ctl (#10574 ) The compute_ctl HTTP server has the following purposes: - Allow management via the control plane - Provide an endpoint for scaping metrics - Provide APIs for compute internal clients - Neon Postgres extension for installing remote extensions - local_proxy for installing extensions and adding grants The first two purposes require the HTTP server to be available outside the compute. The Neon threat model is a bad actor within our internal network. We need to reduce the surface area of attack. By exposing unnecessary unauthenticated HTTP endpoints to the internal network, we increase the surface area of attack. For endpoints described in the third bullet point, we can just run an extra HTTP server, which is only bound to the loopback interface since all consumers of those endpoints are within the compute.	2025-02-11 18:02:22 +00:00
Heikki Linnakangas	98883e4b30	compute_ctl: Use a single tokio runtime (#10743 ) compute_ctl is mostly written in synchronous fashion, intended to run in a single thread. However various parts had become async, and they launched their own tokio runtimes to run the async code. For example, VM monitor ran in its own multi-threaded runtime, and apply_spec_sql() launched another multi-threaded runtime to run the per-database SQL commands in parallel. In addition to that, a few places used a current-thread runtime to run async code in the main thread, or launched a current-thread runtime in a different thread to run background tasks. Unify the runtimes so that there is only one tokio runtime. It's created very early at process startup, and the main thread "enters" the runtime, so that it's always available for tokio::spawn() and runtime.block_on() calls. All code that needs to run async code uses the same runtime. The main thread still mostly runs in a synchronous fashion. When it needs to run async code, it uses rt.block_on(). Spawn fewer additional threads, prefer to spawn tokio tasks instead. Convert some code that ran synchronously in background threads into async. I didn't go all the way, though, some background threads are still spawned.	2025-02-11 00:39:44 +00:00
Tristan Partin	3d143ad799	Unbrick the forward compatibility test failures (#10747 ) Since the merge of https://github.com/neondatabase/neon/pull/10523, forward compatibility tests have been broken everywhere. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 22:22:10 +00:00
Tristan Partin	946da3f7e2	Require --compute-id when running compute_ctl (#10523 ) The compute_id will be used when verifying claims sent by the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 16:46:20 +00:00
Alexander Lakhin	977781e423	Enable sanitizers for postgres v17 (#10401 ) Add a build with sanitizers (asan, ubsan) to the CI pipeline and run tests on it. See https://github.com/neondatabase/neon/issues/6053 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-02-06 12:53:43 +00:00
Tristan Partin	fcd195c2b6	Migrate compute_ctl arg parsing to clap derive (#10497 ) The primary benefit is that all the ad hoc get_matches() calls are no longer necessary. Now all it takes to get at the CLI arguments is referencing a struct member. It's also great the we can replace the ad hoc CLI struct we had with this more formal solution. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-31 19:04:26 +00:00
Gleb Novikov	19bf7b78a0	fast import: basic python test (#10271 ) We did not have any tests on fast_import binary yet. In this PR I have introduced: - `FastImport` class and tools for testing in python - basic test that runs fast import against vanilla postgres and checks that data is there Should be merged after https://github.com/neondatabase/neon/pull/10251	2025-01-21 16:50:44 +00:00
Gleb Novikov	55a68b28a2	fast import: restore to neondb (not postgres) database (#10251 ) ## Problem `postgres` is system database at neon, so we need to do `pg_restore` into `neondb` instead https://github.com/neondatabase/cloud/issues/22100 ## Summary of changes Changed fast_import a little bit: 1. After succesfull connection creating `neondb` in postgres instance 2. Changed restore connstring to use new db 3. Added optional `source_connection_string`, which allows to skip `s3_prefix` and just connect directly. 4. Added `-i` that stops process until sigterm ## TODO - [x] test image in cplane e2e - [ ] Change import job image back to latest after this merged (partial revert of https://github.com/neondatabase/cloud/pull/22338)	2025-01-15 20:51:09 +00:00
Heikki Linnakangas	70a3bf37a0	Stop building 'compute-tools' image (#10333 ) It's been unused from time immemorial. --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2025-01-11 13:09:55 +00:00
Folke Behrens	b6205af4a5	Update tracing/otel crates (#10311 ) Update the tracing(-x) and opentelemetry(-x) crates. Some breaking changes require updating our code: * Initialization is done via builders now https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/CHANGELOG.md#0270 * Errors from OTel SDK are logged via tracing crate as well. https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry/CHANGELOG.md#0270	2025-01-10 08:48:03 +00:00
Tristan Partin	49756a0d01	Implement compute_ctl management API in Axum (#10099 ) This is a refactor to create better abstractions related to our management server. It cleans up the code, and prepares everything for authorized communication to and from the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-09 20:08:26 +00:00
Alexander Bayandin	02f81b6469	Fix clippy warning on macOS (#10282 ) ## Problem On macOS: ``` error: unused variable: `disable_lfc_resizing` --> compute_tools/src/bin/compute_ctl.rs:431:9 \| 431 \| disable_lfc_resizing, \| ^^^^^^^^^^^^^^^^^^^^ help: try ignoring the field: `disable_lfc_resizing: _` \| = note: `-D unused-variables` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_variables)]` ``` ## Summary of changes - Initialise `disable_lfc_resizing` only on Linux (because it's used on Linux only in further bloc)	2025-01-06 20:28:33 +00:00
Em Sharnoff	cd10c719f9	compute: Add spec support for disabling LFC resizing (#10132 ) ref neondatabase/cloud#21731 ## Problem When we manually override the LFC size for particular computes, autoscaling will typically undo that because vm-monitor will resize LFC itself. So, we'd like a way to make vm-monitor not set LFC size — this actually already exists, if we just don't give vm-monitor a postgres connection string. ## Summary of changes Add a new field to the compute spec, `disable_lfc_resizing`. When set to `true`, we pass in `None` for its postgres connection string. That matches the configuration tested in `neondatabase/autoscaling` CI.	2025-01-02 19:45:59 +00:00

1 2 3

128 Commits