rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-03 20:20:38 +00:00

Author	SHA1	Message	Date
Nikita Kalyanov	cbb599f353	Add /terminate API (#6745 ) this is to speed up suspends, see https://github.com/neondatabase/cloud/issues/10284 ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-02-20 19:42:36 +02:00
Alexey Kondratov	c1295bfb3a	[compute_ctl] User correct HTTP code in the /configure errors (#6017 ) It was using `PRECONDITION_FAILED` for errors during `ComputeSpec` to `ParsedSpec` conversion, but this disobeys the OpenAPI spec [1] and correct code should be `BAD_REQUEST` for any spec processing errors. While on it, I also noticed that `compute_ctl` OpenAPI spec has an invalid format and fixed it. [1] `fd81945a60/compute_tools/src/http/openapi_spec.yaml (L119-L120)`	2023-12-01 18:19:55 +01:00
Anastasia Lubennikova	92bc2bb132	Refactor remote extensions feature to request extensions from proxy (#5836 ) instead of direct S3 request. Pros: - simplify code a lot (no need to provide AWS credentials and paths); - reduce latency of downloading extension data as proxy resides near computes; -reduce AWS costs as proxy has cache and 1000 computes asking the same extension will not generate 1000 downloads from S3. - we can use only one S3 bucket to store extensions (and rid of regional buckets which were introduced to reduce latency); Changes: - deprecate remote-ext-config compute_ctl parameter, use http://pg-ext-s3-gateway if any old format remote-ext-cofig is provided; - refactor tests to use mock http server;	2023-11-27 12:10:23 +00:00
Nikita Kalyanov	77658a155b	support deploying in IPv6-only environments (#4135 ) A set of changes to enable neon to work in IPv6 environments. The changes are backward-compatible but allow to deploy neon even to IPv6-only environments: - bind to both IPv4 and IPv6 interfaces - allow connections to Postgres from IPv6 interface - parse the address from control plane that could also be IPv6	2023-09-05 12:45:46 +03:00
Anastasia Lubennikova	e5a397cf96	Form archive_path for remote extensions on the fly	2023-08-30 13:56:51 +03:00
Anastasia Lubennikova	786c7b3708	Refactor remote extensions index download. Don't download ext_index.json from s3, but instead receive it as a part of spec from control plane. This eliminates s3 access for most compute starts, and also allows us to update extensions spec on the fly	2023-08-17 12:48:33 +03:00
Anastasia Lubennikova	4ce7aa9ffe	Fix extensions download error handling (#4941 ) Don't panic if library or extension is not found in remote extension storage or download has failed. Instead, log the error and proceed - if file is not present locally as well, postgres will fail with postgres error. If it is a shared_preload_library, it won't start, because of bad config. Otherwise, it will just fail to run the SQL function/ command that needs the library. Also, don't try to download extensions if remote storage is not configured.	2023-08-09 15:37:51 +03:00
Alek Westover	d005c77ea3	Tar Remote Extensions (#4715 ) Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage. Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files. Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests. When PostgreSQL requests a file, 'compute_ctl' downloads it. See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Alek Westover <alek.westover@gmail.com>	2023-08-02 12:38:12 +03:00
Heikki Linnakangas	df3bae2ce3	Use `compute_ctl` to manage Postgres in tests. (#3886 ) This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-06-06 14:59:36 +01:00
Alexey Kondratov	dd4fd89dc6	[compute_ctl] Do not initialize `last_active` on start (#4137 ) Our scale-to-zero logic was optimized for short auto-suspend intervals, e.g. minutes or hours. In this case, if compute was restarted by k8s due to some reason (OOM, k8s node went down, pod relocation, etc.), `last_active` got bumped, we start counting auto-suspend timeout again. It's not a big deal, i.e. we suspend completely idle compute not after 5 minutes, but after 10 minutes or so. Yet, some clients may want days or even weeks. And chance that compute could be restarted during this interval is pretty high, but in this case we could be not able to suspend some computes for weeks. After this commit, we won't initialize `last_active` on start, so `/status` could return an unset attribute. This means that there was no user activity since start. Control-plane should deal with it by taking `max()` out of all available activity timestamps: `started_at`, `last_active`, etc. compute_ctl part of neondatabase/cloud#4853	2023-05-05 11:45:37 +02:00
Alexey Kondratov	7ba5c286b7	[compute_ctl] Improve 'empty' compute startup sequence (#4034 ) Do several attempts to get spec from the control-plane and retry network errors and all reasonable HTTP response codes. Do not hang waiting for spec without confirmation from the control-plane that compute is known and is in the `Empty` state. Adjust the way we track `total_startup_ms` metric, it should be calculated since the moment we received spec, not from the moment `compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric to track the time spent sleeping and waiting for spec to be delivered from control-plane. Part of neondatabase/cloud#3533	2023-04-21 11:10:48 +02:00
Alexey Kondratov	589cf1ed21	[compute_ctl] Do not create availability checker data on each start (#4019 ) Initially, idea was to ensure that when we come and check data availability, special service table already contains one row. So if we loose it for some reason, we will error out. Yet, to do availability check we anyway start compute first! So it doesn't really add some value, but we affect each compute start as we update at least one row in the database. Also this writes some WAL, so if timeline is close to `neon.max_cluster_size` it could prevent compute from starting up. That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the `/check_writability` handler.	2023-04-14 13:05:07 +02:00
Alexey Kondratov	db8dd6f380	[compute_ctl] Implement live reconfiguration (#3980 ) With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433	2023-04-13 18:07:29 +02:00
Heikki Linnakangas	ef68321b31	Use Lsn, TenantId, TimelineId types in compute_ctl. Stronger types are generally nicer.	2023-04-12 12:11:22 +03:00
Heikki Linnakangas	6064a26963	Refactor 'spec' in ComputeState. Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.	2023-04-12 01:55:40 +03:00
Heikki Linnakangas	f0b2e076d9	Move compute_ctl structs used in HTTP API and spec file to separate crate. This is in preparation of using compute_ctl to launch postgres nodes in the neon_local control plane. And seems like a good idea to separate the public interfaces anyway. One non-mechanical change here is that the 'metrics' field is moved under the Mutex, instead of using atomics. We were not using atomics for performance but for convenience here, and it seems more clear to not use atomics in the model for the HTTP response type.	2023-04-09 21:52:28 +03:00
Alexey Kondratov	e42982fb1e	[compute_ctl] Empty computes and /configure API (#3963 ) This commit adds an option to start compute without spec and then pass it a valid spec via `POST /configure` API endpoint. This is a main prerequisite for maintaining the pool of compute nodes in the control-plane. For example: 1. Start compute with ```shell cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \ -b ./pg_install/v15/bin/postgres ``` 2. Configure it with ```shell curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure ``` Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec is moved under Mutex, as it's now could be updated in the http handler. Also `RwLock` was replaced with `Mutex` because the latter works well with `Condvar`. First part of the neondatabase/cloud#4433	2023-04-06 21:21:58 +02:00
Vadim Kharitonov	1401021b21	Be able to get number of CPUs (#3774 ) After enabling autoscaling, we faced the issue that customers are not able to get the number of CPUs they use at this moment. Therefore I've added these two options: 1. Postgresql function to allow customers to call it whenever they want 2. `compute_ctl` endpoint to show these number in console	2023-03-10 19:00:20 +02:00
Alexey Kondratov	e43c413a3f	[compute_tools] Add /insights endpoint to compute_ctl (#3704 ) This commit adds a basic HTTP API endpoint that allows scraping the `pg_stat_statements` data and getting a list of slow queries. New insights like cache hit rate and so on could be added later. Extension `pg_stat_statements` is checked / created only if compute tries to load the corresponding shared library. The latter is configured by control-plane and currently covered with feature flag. Co-authored by Eduard Dyckman (bird.duskpoet@gmail.com)	2023-03-09 14:21:10 +01:00
Vadim Kharitonov	f4359b688c	Backport `cargo fmt` diff from `release` branch into `main`	2023-02-10 14:20:55 +01:00
Heikki Linnakangas	0c0e15b81d	compute_ctl: Extract tracing context from incoming HTTP requests. This allows tracing the handling of HTTP requests as part of the caller's trace.	2023-01-26 15:20:03 +02:00
Heikki Linnakangas	e5cc2f92c4	Switch to 'tracing' for logging, restructure code to make use of spans. Refactors Compute::prepare_and_run. It's split into subroutines differently, to make it easier to attach tracing spans to the different stages. The high-level logic for waiting for Postgres to exit is moved to the caller. Replace 'env_logger' with 'tracing', and add `#instrument` directives to different stages fo the startup process. This is a fairly mechanical change, except for the changes in 'spec.rs'. 'spec.rs' contained some complicated formatting, where parts of log messages were printed directly to stdout with `print`s. That was a bit messed up because the log normally goes to stderr, but those lines were printed to stdout. In our docker images, stderr and stdout both go to the same place so you wouldn't notice, but I don't think it was intentional. This changes the log format to the default 'tracing_subscriber::format' format. It's different from the Postgres log format, however, and because both compute_tools and Postgres print to the same log, it's now a mix of two different formats. I'm not sure how the Grafana log parsing pipeline can handle that. If it's a problem, we can build custom formatter to change the compute_tools log format to be the same as Postgres's, like it was before this commit, or we can change the Postgres log format to match tracing_formatter's, or we can start printing compute_tool's log output to a different destination than Postgres	2023-01-18 19:42:47 +02:00
Vadim Kharitonov	434fcac357	Remove unused HTTP endpoints from compute_ctl	2022-12-29 13:59:40 +01:00
Alexey Kondratov	772c2fb4ff	Report startup metrics and failure reason from compute_ctl (#1581 ) + neondatabase/cloud#1103 This adds a couple of control endpoints to simplify compute state discovery for control-plane. For example, now we may figure out that Postgres wasn't able to start or basebackup failed within seconds instead of just blindly polling the compute readiness for a minute or two. Also we now expose startup metrics (time of the each step: basebackup, sync safekeepers, config, total). Console grabs them after each successful start and report as histogram to prometheus and grafana. OpenAPI spec is added and up-tp date, but is not currently used in the console yet.	2022-05-18 13:03:29 +04:00

24 Commits