rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-03 19:42:55 +00:00

Author	SHA1	Message	Date
Alexander Lakhin	29e4ca351e	Pass asan/ubsan options to pg_dump/pg_restore started by fast_import (#10866 )	2025-02-18 15:41:20 +00:00
Heikki Linnakangas	811506aaa2	fast_import: Use rust s3 client for uploading (#10777 ) This replaces the use of the awscli utility. awscli binary is massive, it added about 200 MB to the docker image size, while the s3 client was already a dependency so using that is essentially free, as far as binary size is concerned. I implemented a simple upload function that tries to keep 10 uploads going in parallel. I believe that's the default behavior of the "aws s3 sync" command too.	2025-02-17 20:07:31 +00:00
Heikki Linnakangas	2dae0612dd	fast_import: Fix shared_buffers setting (#10837 ) In commit `9537829ccd` I made shared_buffers be derived from the system's available RAM. However, I failed to remove the old hard-coded shared_buffers=10GB settings, shared_buffers was set twice. Oopsie.	2025-02-16 00:01:19 +00:00
Gleb Novikov	3d7a32f619	fast import: allow restore to provided connection string (#10407 ) Within https://github.com/neondatabase/cloud/issues/22089 we decided that would be nice to start with import that runs dump-restore into a running compute (more on this [here](https://www.notion.so/neondatabase/2024-Jan-13-Migration-Assistant-Next-Steps-Proposal-Revised-17af189e004780228bdbcad13eeda93f?pvs=4#17af189e004780de816ccd9c13afd953)) We could do it by writing another tool or by extending existing `fast_import.rs`, we chose the latter. In this PR, I have added optional `restore_connection_string` as a cli arg and as a part of the json spec. If specified, the script will not run postgres and will just perform restore into provided connection string. TODO: - [x] fast_import.rs: - [x] cli arg in the fast_import.rs - [x] encoded connstring in json spec - [x] simplify `fn main` a little, take out too verbose stuff to some functions - [ ] ~~allow streaming from dump stdout to restore stdin~~ will do in a separate PR - [ ] ~~address https://github.com/neondatabase/neon/pull/10251#pullrequestreview-2551877845~~ will do in a separate PR - [x] tests: - [x] restore with cli arg in the fast_import.rs - [x] restore with encoded connstring in json spec in s3 - [ ] ~~test with custom dbname~~ will do in a separate PR - [ ] ~~test with s3 + pageserver + fast import binary~~ https://github.com/neondatabase/neon/pull/10487 - [ ] ~~https://github.com/neondatabase/neon/pull/10271#discussion_r1923715493~~ will do in a separate PR neondatabase/cloud#22775 --------- Co-authored-by: Eduard Dykman <bird.duskpoet@gmail.com>	2025-02-14 16:10:06 +00:00
Tristan Partin	0cf9157adc	Handle new compute_ctl_config parameter in compute spec requests (#10746 ) There is now a compute_ctl_config field in the response that currently only contains a JSON Web Key set. compute_ctl currently doesn't do anything with the keys, but will in the future. The reasoning for the new field is due to the nature of empty computes. When an empty compute is created, it does not have a tenant. A compute spec is the primary means of communicating the details of an attached tenant. In the empty compute state, there is no spec. Instead we wait for the control plane to pass us one via /configure. If we were to include the jwks field in the compute spec, we would have a partial compute spec, which doesn't logically make sense. Instead, we can have two means of passing settings to the compute: - spec: tenant specific config details - compute_ctl_config: compute specific settings For instance, the JSON Web Key set passed to the compute is independent of any tenant. It is a setting of the compute whether it is attached or not. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-13 18:04:36 +00:00
Alexey Kondratov	49775d28e4	fix(compute): Respect skip_pg_catalog_updates in reconfigure() (#10696 ) ## Problem We respect `skip_pg_catalog_updates` at the initial start, but ignore at the follow-up `/configure`. Yet, it's used for storage->cplane->compute notify requests after migrations, shard split, etc. So every time we get them, applying the new config takes much longer than it should because we go through Postgres catalog checks. Cplane sets this flag, when it does serves notify attach call `9068c7d743` Related to `inc-403`, for example ## Summary of changes Look at `skip_pg_catalog_updates` in `compute.reconfigure()`	2025-02-12 17:54:21 +00:00
Heikki Linnakangas	9537829ccd	fast_import: Make CPU & memory size configurable (#10709 ) The old values assumed that you have at least about 18 GB of RAM available (shared_buffers=10GB and maintenance_work_mem=8GB). That's a lot when testing locally. Make it configurable, and make the default assumption much smaller: 256 MB. This is nice for local testing, but it's also in preparation for starting to use VMs to run these jobs. When launched in a VM, the control plane can set these env variables according to the max size of the VM. Also change the formula for how RAM is distributed: use 10% of RAM for shared_buffers, and 70% for maintenance_work_mem. That leaves a good amount for misc. other stuff and the OS. A very large shared_buffers setting won't typically help with bulk loading. It won't help with the network and I/O of processing all the tables, unless maybe if the whole database fits in shared buffers, but even then it's not much faster than using local disk. Bulk loading is all sequential I/O. It also won't help much with index creation, which is also sequential I/O. A large maintenance_work_mem can be quite useful, however, so that's where we put most of the RAM.	2025-02-12 11:43:23 +00:00
Heikki Linnakangas	635b67508b	Split utils::http to separate crate (#10753 ) Avoids compiling the crate and its dependencies into binaries that don't need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the release-line-debug-size-lto profile.	2025-02-11 22:06:53 +00:00
Tristan Partin	da9c101939	Implement a second HTTP server within compute_ctl (#10574 ) The compute_ctl HTTP server has the following purposes: - Allow management via the control plane - Provide an endpoint for scaping metrics - Provide APIs for compute internal clients - Neon Postgres extension for installing remote extensions - local_proxy for installing extensions and adding grants The first two purposes require the HTTP server to be available outside the compute. The Neon threat model is a bad actor within our internal network. We need to reduce the surface area of attack. By exposing unnecessary unauthenticated HTTP endpoints to the internal network, we increase the surface area of attack. For endpoints described in the third bullet point, we can just run an extra HTTP server, which is only bound to the loopback interface since all consumers of those endpoints are within the compute.	2025-02-11 18:02:22 +00:00
Andrew Rudenko	4ab18444ec	compute_ctl: database_schema should keep process::Child as part of returned value (#10273 ) ## Problem /database_schema endpoint returns incomplete output from `pg_dump` ## Summary of changes The Tokio process was not used properly. The returned stream does not include `process::Child`, and the process is scheduled to be killed immediately after the `get_database_schema` call when `cmd` goes out of scope. The solution in this PR is to return a special Stream implementation that retains `process::Child`.	2025-02-11 07:02:13 +00:00
Heikki Linnakangas	98883e4b30	compute_ctl: Use a single tokio runtime (#10743 ) compute_ctl is mostly written in synchronous fashion, intended to run in a single thread. However various parts had become async, and they launched their own tokio runtimes to run the async code. For example, VM monitor ran in its own multi-threaded runtime, and apply_spec_sql() launched another multi-threaded runtime to run the per-database SQL commands in parallel. In addition to that, a few places used a current-thread runtime to run async code in the main thread, or launched a current-thread runtime in a different thread to run background tasks. Unify the runtimes so that there is only one tokio runtime. It's created very early at process startup, and the main thread "enters" the runtime, so that it's always available for tokio::spawn() and runtime.block_on() calls. All code that needs to run async code uses the same runtime. The main thread still mostly runs in a synchronous fashion. When it needs to run async code, it uses rt.block_on(). Spawn fewer additional threads, prefer to spawn tokio tasks instead. Convert some code that ran synchronously in background threads into async. I didn't go all the way, though, some background threads are still spawned.	2025-02-11 00:39:44 +00:00
Tristan Partin	3d143ad799	Unbrick the forward compatibility test failures (#10747 ) Since the merge of https://github.com/neondatabase/neon/pull/10523, forward compatibility tests have been broken everywhere. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 22:22:10 +00:00
Tristan Partin	946da3f7e2	Require --compute-id when running compute_ctl (#10523 ) The compute_id will be used when verifying claims sent by the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 16:46:20 +00:00
Alexander Lakhin	977781e423	Enable sanitizers for postgres v17 (#10401 ) Add a build with sanitizers (asan, ubsan) to the CI pipeline and run tests on it. See https://github.com/neondatabase/neon/issues/6053 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-02-06 12:53:43 +00:00
Tristan Partin	fcd195c2b6	Migrate compute_ctl arg parsing to clap derive (#10497 ) The primary benefit is that all the ad hoc get_matches() calls are no longer necessary. Now all it takes to get at the CLI arguments is referencing a struct member. It's also great the we can replace the ad hoc CLI struct we had with this more formal solution. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-31 19:04:26 +00:00
Alexey Kondratov	be51b10da7	chore(compute): Print some compute_ctl errors in debug mode (#10586 ) ## Problem In some cases, we were returning a very shallow error like `error sending request for url (XXX)`, which made it very hard to figure out the actual error. ## Summary of changes Use `{:?}` in a few places, and remove it from places where we were printing a string anyway.	2025-01-30 14:31:49 +00:00
Tristan Partin	707a926057	Remove unused compute_ctl HTTP routes (#10544 ) These are not used anywhere within the platform, so let's remove dead code. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-29 19:22:01 +00:00
Alexey Kondratov	34322b2424	chore(compute): Simplify new compute_ctl metrics and fix flaky test (#10560 ) ## Problem 1. `d04d924` added separate metrics for total requests and failures separately, but it doesn't make much sense. We could just have a unified counter with `http_status`. 2. `test_compute_migrations_retry` had a race, i.e., it was waiting for the last successful migration, not an actual failure. This was revealed after adding an assert on failure metric in `d04d924`. ## Summary of changes 1. Switch to unified counters for `compute_ctl` requests. 2. Add a waiting loop into `test_compute_migrations_retry` to eliminate the race. Part of neondatabase/cloud#17590	2025-01-29 18:09:25 +00:00
Alexey Kondratov	d04d924649	feat(compute): Add some basic compute_ctl metrics (#10504 ) ## Problem There are several parts of `compute_ctl` with a very low visibility of errors: 1. DB migrations that run async in the background after compute start. 2. Requests made to control plane (currently only `GetSpec`). 3. Requests made to the remote extensions server. ## Summary of changes Add new counters to quickly evaluate the amount of errors among the fleet. Part of neondatabase/cloud#17590	2025-01-28 19:24:07 +00:00
Tristan Partin	15fecb8474	Update axum to 0.8.1 (#10332 ) Only a few things that needed updating: - async_trait was removed - Message::Text takes a Utf8Bytes object instead of a String Signed-off-by: Tristan Partin <tristan@neon.tech> Co-authored-by: Conrad Ludgate <connor@neon.tech>	2025-01-28 15:32:59 +00:00
Anastasia Lubennikova	8e8df1b453	Disable logical replication subscribers (#10249 ) Drop logical replication subscribers before compute starts on a non-main branch. Add new compute_ctl spec flag: drop_subscriptions_before_start If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions. Ensure that we drop subscriptions exactly once when endpoint starts on a new branch. It is essential, because otherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timeline. fixes: https://github.com/neondatabase/neon/issues/8790	2025-01-23 11:02:15 +00:00
Gleb Novikov	19bf7b78a0	fast import: basic python test (#10271 ) We did not have any tests on fast_import binary yet. In this PR I have introduced: - `FastImport` class and tools for testing in python - basic test that runs fast import against vanilla postgres and checks that data is there Should be merged after https://github.com/neondatabase/neon/pull/10251	2025-01-21 16:50:44 +00:00
Tristan Partin	871e8b325f	Use the request ID given by the control plane in compute_ctl (#10418 ) Instead of generating our own request ID, we can just use the one provided by the control plane. In the event, we get a request from a client which doesn't set X-Request-ID, then we just generate one which is useful for tracing purposes. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-16 20:46:53 +00:00
Tristan Partin	58f6af6c9a	Clean up compute_ctl extension server code (#10417 )	2025-01-16 08:35:36 +00:00
Gleb Novikov	55a68b28a2	fast import: restore to neondb (not postgres) database (#10251 ) ## Problem `postgres` is system database at neon, so we need to do `pg_restore` into `neondb` instead https://github.com/neondatabase/cloud/issues/22100 ## Summary of changes Changed fast_import a little bit: 1. After succesfull connection creating `neondb` in postgres instance 2. Changed restore connstring to use new db 3. Added optional `source_connection_string`, which allows to skip `s3_prefix` and just connect directly. 4. Added `-i` that stops process until sigterm ## TODO - [x] test image in cplane e2e - [ ] Change import job image back to latest after this merged (partial revert of https://github.com/neondatabase/cloud/pull/22338)	2025-01-15 20:51:09 +00:00
Heikki Linnakangas	70a3bf37a0	Stop building 'compute-tools' image (#10333 ) It's been unused from time immemorial. --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2025-01-11 13:09:55 +00:00
Folke Behrens	b6205af4a5	Update tracing/otel crates (#10311 ) Update the tracing(-x) and opentelemetry(-x) crates. Some breaking changes require updating our code: * Initialization is done via builders now https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/CHANGELOG.md#0270 * Errors from OTel SDK are logged via tracing crate as well. https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry/CHANGELOG.md#0270	2025-01-10 08:48:03 +00:00
Tristan Partin	49756a0d01	Implement compute_ctl management API in Axum (#10099 ) This is a refactor to create better abstractions related to our management server. It cleans up the code, and prepares everything for authorized communication to and from the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-09 20:08:26 +00:00
Alexey Kondratov	f37eeb56ad	fix(compute_ctl): Resolve issues with dropping roles having dangling permissions (#10299 ) ## Problem In Postgres, one cannot drop a role if it has any dependent objects in the DB. In `compute_ctl`, we automatically reassign all dependent objects in every DB to the corresponding DB owner. Yet, it seems that it doesn't help with some implicit permissions. The issue is reproduced by installing a `postgis` extension because it creates some views and tables in the public schema. ## Summary of changes Added a repro test without using a `postgis`: i) create a role via `compute_ctl` (with `neon_superuser` grant); ii) create a test role, a table in schema public, and grant permissions via the role in `neon_superuser`. To fix the issue, I added a new `compute_ctl` code that removes such dangling permissions before dropping the role. It's done in the least invasive way, i.e., only touches the schema public, because i) that's the problem we had with PostGIS; ii) it creates a smaller chance of messing anything up and getting a stuck operation again, just for a different reason. Properly, any API-based catalog operations should fail gracefully and provide an actionable error and status code to the control plane, allowing the latter to unwind the operation and propagate an error message and hint to the user. In this sense, it's aligned with another feature request https://github.com/neondatabase/cloud/issues/21611 Resolve neondatabase/cloud#13582	2025-01-09 16:39:53 +00:00
Tristan Partin	5b2751397d	Refactor MigrationRunner::run_migrations() to call a helper (#10232 ) This will make it easier to add per-db migrations, such as that for CVE-2024-4317. Link: https://www.postgresql.org/support/security/CVE-2024-4317/ Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-09 07:05:07 +00:00
Anastasia Lubennikova	0ad0db6ff8	compute: dropdb DROP SUBSCRIPTION fix (#10066 ) ## Problem Project gets stuck if database with subscriptions was deleted via API / UI. https://github.com/neondatabase/cloud/issues/18646 ## Summary of changes Before dropping the database, drop all the subscriptions in it. Do not drop slot on publisher, because we have no guarantee that the slot still exists or that the publisher is reachable. Add `DropSubscriptionsForDeletedDatabases` phase to run these operations in all databases, we're about to delete. Ignore the error if the database does not exist.	2025-01-08 18:55:04 +00:00
Alexander Bayandin	02f81b6469	Fix clippy warning on macOS (#10282 ) ## Problem On macOS: ``` error: unused variable: `disable_lfc_resizing` --> compute_tools/src/bin/compute_ctl.rs:431:9 \| 431 \| disable_lfc_resizing, \| ^^^^^^^^^^^^^^^^^^^^ help: try ignoring the field: `disable_lfc_resizing: _` \| = note: `-D unused-variables` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_variables)]` ``` ## Summary of changes - Initialise `disable_lfc_resizing` only on Linux (because it's used on Linux only in further bloc)	2025-01-06 20:28:33 +00:00
Tristan Partin	eefad27538	Inline various migration queries (#10231 ) There was no value in saving them off to temporary variables. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-02 22:12:56 +00:00
Em Sharnoff	cd10c719f9	compute: Add spec support for disabling LFC resizing (#10132 ) ref neondatabase/cloud#21731 ## Problem When we manually override the LFC size for particular computes, autoscaling will typically undo that because vm-monitor will resize LFC itself. So, we'd like a way to make vm-monitor not set LFC size — this actually already exists, if we just don't give vm-monitor a postgres connection string. ## Summary of changes Add a new field to the compute spec, `disable_lfc_resizing`. When set to `true`, we pass in `None` for its postgres connection string. That matches the configuration tested in `neondatabase/autoscaling` CI.	2025-01-02 19:45:59 +00:00
Tristan Partin	363ea97f69	Add more substantial tests for compute migrations (#9811 ) The previous tests really didn't do much. This set should be quite a bit more encompassing. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-02 18:37:50 +00:00
Christian Schwarz	a1b0558493	fast import: importer: use aws s3 cli (#10162 ) ## Problem s5cmd doesn't pick up the pod service account ``` 2024/12/16 16:26:01 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. <nil> ERROR "ls s3://neon-dev-bulk-import-us-east-2/import-pgdata/fast-import/v1/br-wandering-hall-w2xobawv": NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors ``` ## Summary of changes Switch to offical CLI. ## Testing Tested the pre-merge image in staging, using `job_image` override in project settings. https://neondb.slack.com/archives/C033RQ5SPDH/p1734554944391949?thread_ts=1734368383.258759&cid=C033RQ5SPDH ## Future Work Switch back to s5cmd once https://github.com/peak/s5cmd/pull/769 gets merged. ## Refs - fixes https://github.com/neondatabase/cloud/issues/21876 --------- Co-authored-by: Gleb Novikov <NanoBjorn@users.noreply.github.com>	2024-12-19 10:04:17 +00:00
Anastasia Lubennikova	ef233e91ef	Update compute_installed_extensions metric: (#9891 ) add owned_by_superuser field to filter out system extensions. While on it, also correct related code: - fix the metric setting: use set() instead of inc() in a loop. inc() is not idempotent and can lead to incorrect results if the function called multiple times. Currently it is only called at compute start, but this will change soon. - fix the return type of the installed_extensions endpoint to match the metric. Currently it is only used in the test.	2024-12-11 16:43:26 +00:00
Mikhail Kot	c79c1dd8e9	compute_ctl: don't panic if control plane can't be reached (#10078 ) ## Problem If the control plane cannot be reached for some reason, compute_ctl panics ## Summary of changes panic is removed in favour of returning an error. Code is reformatted a bit for more flat control flow Resolves: #5391	2024-12-11 15:03:11 +00:00
Alexey Kondratov	13e8105740	feat(compute): Allow specifying the reconfiguration concurrency (#10006 ) ## Problem We need a higher concurrency during reconfiguration in case of many DBs, but the instance is already running and used by the client. We can easily get out of `max_connections` limit, and the current code won't handle that. ## Summary of changes Default to 1, but also allow control plane to override this value for specific projects. It's also recommended to bump `superuser_reserved_connections` += `reconfigure_concurrency` for such projects to ensure that we always have enough spare connections for reconfiguration process to succeed. Quick workaround for neondatabase/cloud#17846	2024-12-05 17:57:25 +00:00
Tristan Partin	d8ebd33fe6	Stop changing the value of neon.extension_server_port at runtime (#9972 ) On reconfigure, we no longer passed a port for the extension server which caused us to not write out the neon.extension_server_port line. Thus, Postgres thought we were setting the port to the default value of 0. PGC_POSTMASTER GUCs cannot be set at runtime, which causes the following log messages: > LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server > LOG: configuration file "/var/db/postgres/compute/pgdata/postgresql.conf" contains errors; unaffected changes were applied Fixes: https://github.com/neondatabase/neon/issues/9945 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-02 18:06:19 +00:00
Gleb Novikov	c848f25ec2	Fixed fast_import pgbin in calling get_pg_version (#9933 ) Was working on https://github.com/neondatabase/cloud/pull/20795 and discovered that fast_import is not working normally.	2024-11-29 17:58:36 +00:00
Alexey Kondratov	538e2312a6	feat(compute_ctl): Always set application_name (#9934 ) ## Problem It was not always possible to judge what exactly some `cloud_admin` connections were doing because we didn't consistently set `application_name` everywhere. ## Summary of changes Unify the way we connect to Postgres: 1. Switch to building configs everywhere 2. Always set `application_name` and make naming consistent Follow-up for #9919 Part of neondatabase/cloud#20948	2024-11-29 13:55:56 +00:00
Alexey Kondratov	42fb3c4d30	fix(compute_ctl): Allow usage of DB names with whitespaces (#9919 ) ## Problem We used `set_path()` to replace the database name in the connection string. It automatically does url-safe encoding if the path is not already encoded, but it does it as per the URL standard, which assumes that tabs can be safely removed from the path without changing the meaning of the URL. See, e.g., https://url.spec.whatwg.org/#concept-basic-url-parser. It also breaks for DBs with properly %-encoded names, like with `%20`, as they are kept intact, but actually should be escaped. Yet, this is not true for Postgres, where it's completely valid to have trailing tabs in the database name. I think this is the PR that caused this regression https://github.com/neondatabase/neon/pull/9717, as it switched from `postgres::config::Config` back to `set_path()`. This was fixed a while ago already [1], btw, I just haven't added a test to catch this regression back then :( ## Summary of changes This commit changes the code back to use `postgres/tokio_postgres::Config` everywhere. While on it, also do some changes around, as I had to touch this code: 1. Bump some logging from `debug` to `info` in the spec apply path. We do not use `debug` in prod, and it was tricky to understand what was going on with this bug in prod. 2. Refactor configuration concurrency calculation code so it was reusable. Yet, still keep `1` in the case of reconfiguration. The database can be actively used at this moment, so we cannot guarantee that there will be enough spare connection slots, and the underlying code won't handle connection errors properly. 3. Simplify the installed extensions code. It was spawning a blocking task inside async function, which doesn't make much sense. Instead, just have a main sync function and call it with `spawn_blocking` in the API code -- the only place we need it to be async. 4. Add regression python test to cover this and related problems in the future. Also, add more extensive testing of schema dump and DBs and roles listing API. [1]: `4d1e48f3b9` [2]: https://www.postgresql.org/message-id/flat/20151023003445.931.91267%40wrigleys.postgresql.org Resolves neondatabase/cloud#20869	2024-11-28 21:38:30 +00:00
Arpad Müller	a74ab9338d	fast_import: remove hardcoding of pg_version (#9878 ) Before, we hardcoded the pg_version to 140000, while the code expected version numbers like 14. Now we use an enum, and code from `extension_server.rs` to auto-detect the correct version. The enum helps when we add support for a version: enums ensure that compilation fails if one forgets to put the version to one of the `match` locations. cc https://github.com/neondatabase/neon/pull/9218	2024-11-25 20:23:42 +00:00
Christian Schwarz	450be26bbb	fast imports: initial Importer and Storage changes (#9218 ) Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Stas Kelvic <stas@neon.tech> # Context This PR contains PoC-level changes for a product feature that allows onboarding large databases into Neon without going through the regular data path. # Changes This internal RFC provides all the context * https://github.com/neondatabase/cloud/pull/19799 In the language of the RFC, this PR covers * the Importer code (`fast_import`) * all the Pageserver changes (mgmt API changes, flow implementation, etc) * a basic test for the Pageserver changes # Reviewing As acknowledged in the RFC, the code added in this PR is not ready for general availability. Also, the architecture is not to be discussed in this PR, but in the RFC and associated Slack channel instead. Reviewers of this PR should take that into consideration. The quality bar to apply during review depends on what area of the code is being reviewed: * Importer code (`fast_import`): practically anything goes * Core flow (`flow.rs`): * Malicious input data must be expected and the existing threat models apply. * The code must not be safe to execute on dedicated Pageserver instances: * This means in particular that tenants on other Pageserver instances must not be affected negatively wrt data confidentiality, integrity or availability. * Other code: the usual quality bar * Pay special attention to correct use of gate guards, timeline cancellation in all places during shutdown & migration, etc. * Consider the broader system impact; if you find potentially problematic interactions with Storage features that were not covered in the RFC, bring that up during the review. I recommend submitting three separate reviews, for the three high-level areas with different quality bars. # References (Internal-only) * refs https://github.com/neondatabase/cloud/issues/17507 * refs https://github.com/neondatabase/company_projects/issues/293 * refs https://github.com/neondatabase/company_projects/issues/309 * refs https://github.com/neondatabase/cloud/issues/20646 --------- Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2024-11-22 22:47:06 +00:00
Anastasia Lubennikova	3245f7b88d	Rename 'installed_extensions' metric to 'compute_installed_extensions' (#9759 ) to keep it consistent with existing compute metrics. flux-fleet change is not needed, because it doesn't have any filter by metric name for compute metrics.	2024-11-22 19:27:04 +00:00
Tristan Partin	c10b7f7de9	Write a newline after adding dynamic_shared_memory_type to PG conf (#9843 ) Without adding a newline, we can end up with a conf line that looks like the following: dynamic_shared_memory_type = mmap# Managed by compute_ctl: begin This leads to Postgres logging: LOG: configuration file "/var/db/postgres/compute/pgdata/postgresql.conf" contains errors; unaffected changes were applied Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-22 13:37:06 +00:00
Tristan Partin	37962e729e	Fix panic in compute_ctl metrics collection (#9831 ) Calling unwrap on the encoder is a little overzealous. One of the errors that can be returned by the encode function in particular is the non-existence of metrics for a metric family, so we should prematurely filter instances like that out. I believe that the cause of this panic was caused by a race condition between the prometheus collector and the compute collecting the installed extensions metric for the first time. The HTTP server is spawned on a separate thread before we even start bringing up Postgres. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-21 20:19:02 +00:00
Arpad Müller	59c2c3f8ad	compute_ctl: print OpenTelemetry errors via tracing, not stdout (#9830 ) Before, `OpenTelemetry` errors were printed to stdout/stderr directly, causing one of the few log lines without a timestamp, like: ``` OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/v1/traces) ``` Now, we print: ``` 2024-11-21T02:24:20.511160Z INFO OpenTelemetry error: error sending request for url (http://localhost:4318/v1/traces) ``` I found this while investigating #9731.	2024-11-21 04:46:01 +00:00
Matthias van de Meent	ea1858e3b6	compute_ctl: Streamline and Pipeline startup SQL (#9717 ) Before, compute_ctl didn't have a good registry for what command would run when, depending exclusively on sync code to apply changes. When users have many databases/roles to manage, this step can take a substantial amount of time, breaking assumptions about low (re)start times in other systems. This commit reduces the time compute_ctl takes to restart when changes must be applied, by making all commands more or less blind writes, and applying these commands in an asynchronous context, only waiting for completion once we know the commands have all been sent. Additionally, this reduces time spent by batching per-database operations where previously we would create a new SQL connection for every user-database operation we planned to execute.	2024-11-20 02:14:58 +01:00

1 2 3 4 5 ...

319 Commits