rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-03 20:20:38 +00:00

Author	SHA1	Message	Date
Alexey Kondratov	bc1020f965	compute_ctl: Notify waiters when Postgres failed to start (#6034 ) In case of configuring the empty compute, API handler is waiting on condvar for compute state change. Yet, previously if Postgres failed to start we were just setting compute status to `Failed` without notifying. It causes a timeout on control plane side, although we can return a proper error from compute earlier. With this commit API handler should be properly notified.	2023-12-05 13:38:45 +01:00
Anastasia Lubennikova	92bc2bb132	Refactor remote extensions feature to request extensions from proxy (#5836 ) instead of direct S3 request. Pros: - simplify code a lot (no need to provide AWS credentials and paths); - reduce latency of downloading extension data as proxy resides near computes; -reduce AWS costs as proxy has cache and 1000 computes asking the same extension will not generate 1000 downloads from S3. - we can use only one S3 bucket to store extensions (and rid of regional buckets which were introduced to reduce latency); Changes: - deprecate remote-ext-config compute_ctl parameter, use http://pg-ext-s3-gateway if any old format remote-ext-cofig is provided; - refactor tests to use mock http server;	2023-11-27 12:10:23 +00:00
Em Sharnoff	cad0dca4b8	compute_ctl: Remove deprecated flag `--file-cache-on-disk` (#5622 ) See neondatabase/cloud#7516 for more.	2023-11-18 12:43:54 +01:00
Em Sharnoff	367971a0e9	vm-monitor: Remove support for file cache in tmpfs (#5617 ) ref neondatabase/cloud#7516. We switched everything over to file cache on disk, now time to remove support for having it in tmpfs.	2023-11-02 16:06:16 +00:00
Em Sharnoff	f7067a38b7	compute_ctl: Assume --vm-monitor-addr arg is always present (#5611 ) It has a default value, so this should be sound. Treating its presence as semantically significant was leading to spurious warnings.	2023-10-31 10:00:23 -07:00
Sasha Krassovsky	116c342cad	Support changing pageserver dynamically (#5542 ) ## Problem We currently require full restart of compute if we change the pageserver url ## Summary of changes Makes it so that we don't have to do a full restart, but can just send SIGHUP	2023-10-26 10:56:07 -07:00
Em Sharnoff	8d2a4aa5f8	vm-monitor: Add flag for when file cache on disk (#5130 ) Part 1 of 2, for moving the file cache onto disk. Because VMs are created by the control plane (and that's where the filesystem for the file cache is defined), we can't rely on any kind of synchronization between releases, so the change needs to be feature-gated (kind of), with the default remaining the same for now. See also: neondatabase/cloud#6593	2023-08-29 12:44:48 -07:00
Felix Prasanna	85d6d9dc85	monitor/compute_ctl: remove references to the informant (#5115 ) Also added some docs to the monitor :) Co-authored-by: Em Sharnoff <sharnoff@neon.tech>	2023-08-29 02:59:27 +03:00
Em Sharnoff	529f8b5016	compute_ctl: Fix switched vm-monitor args (#5117 ) Small switcheroo from #4946.	2023-08-28 14:55:41 +02:00
Felix Prasanna	18537be298	monitor: listen on correct port to accept agent connections (#5100 ) ## Problem The previous arguments have the monitor listen on `localhost`, which the informant can connect to since it's also in the VM, but which the agent cannot. Also, the port is wrong. ## Summary of changes Listen on `0.0.0.0:10301`	2023-08-24 17:32:46 -04:00
Felix Prasanna	3128eeff01	compute_ctl: add vm-monitor (#4946 ) Co-authored-by: Em Sharnoff <sharnoff@neon.tech>	2023-08-24 15:54:37 -04:00
Anastasia Lubennikova	786c7b3708	Refactor remote extensions index download. Don't download ext_index.json from s3, but instead receive it as a part of spec from control plane. This eliminates s3 access for most compute starts, and also allows us to update extensions spec on the fly	2023-08-17 12:48:33 +03:00
bojanserafimov	4ad0c8f960	compute_ctl: Prewarm before starting http server (#4867 )	2023-08-02 14:19:06 -04:00
Alek Westover	d005c77ea3	Tar Remote Extensions (#4715 ) Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage. Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files. Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests. When PostgreSQL requests a file, 'compute_ctl' downloads it. See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Alek Westover <alek.westover@gmail.com>	2023-08-02 12:38:12 +03:00
bojanserafimov	ddbe170454	Prewarm compute nodes (#4828 )	2023-07-31 14:13:32 -04:00
Alek Westover	7feb0d1a80	`unwrap` instead of passing `anyhow::Error` on failure to spawn a thread (#4779 )	2023-07-21 15:17:16 -04:00
bojanserafimov	9de1a6fb14	cold starts: Run sync_safekeepers on compute_ctl shutdown (#4588 )	2023-06-30 16:29:47 -04:00
Anastasia Lubennikova	2f618f46be	Use BUILD_TAG in compute_ctl binary. (#4541 ) Pass BUILD_TAG to compute_ctl binary. We need it to access versioned extension storage.	2023-06-22 17:06:16 +03:00
Heikki Linnakangas	df3bae2ce3	Use `compute_ctl` to manage Postgres in tests. (#3886 ) This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-06-06 14:59:36 +01:00
Heikki Linnakangas	66b06e416a	Pass tracing context in env variables instead of the spec file. (#4174 ) If compute_ctl is launched without a spec file, it fetches it from the control plane with an HTTP request. We cannot get the startup tracing context from the compute spec in that case, because we don't have it available on start. We could still read the tracing context from the compute spec after we have fetched it, but that would leave the fetch itself out of the context. Pass the tracing context in environment variables instead.	2023-05-09 17:08:02 +03:00
Alexey Kondratov	7ba5c286b7	[compute_ctl] Improve 'empty' compute startup sequence (#4034 ) Do several attempts to get spec from the control-plane and retry network errors and all reasonable HTTP response codes. Do not hang waiting for spec without confirmation from the control-plane that compute is known and is in the `Empty` state. Adjust the way we track `total_startup_ms` metric, it should be calculated since the moment we received spec, not from the moment `compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric to track the time spent sleeping and waiting for spec to be delivered from control-plane. Part of neondatabase/cloud#3533	2023-04-21 11:10:48 +02:00
Alexey Kondratov	db8dd6f380	[compute_ctl] Implement live reconfiguration (#3980 ) With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433	2023-04-13 18:07:29 +02:00
Heikki Linnakangas	6064a26963	Refactor 'spec' in ComputeState. Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.	2023-04-12 01:55:40 +03:00
Alexey Kondratov	40a68e9077	[compute_ctl] Add timeout for `tracing_utils::shutdown_tracing()` (#3982 ) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - https://github.com/open-telemetry/opentelemetry-rust/issues/868 - and our problems with staging https://github.com/neondatabase/cloud/issues/3707#issuecomment-1493983636 Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707	2023-04-11 15:05:35 +02:00
Heikki Linnakangas	f0b2e076d9	Move compute_ctl structs used in HTTP API and spec file to separate crate. This is in preparation of using compute_ctl to launch postgres nodes in the neon_local control plane. And seems like a good idea to separate the public interfaces anyway. One non-mechanical change here is that the 'metrics' field is moved under the Mutex, instead of using atomics. We were not using atomics for performance but for convenience here, and it seems more clear to not use atomics in the model for the HTTP response type.	2023-04-09 21:52:28 +03:00
Alexey Kondratov	e42982fb1e	[compute_ctl] Empty computes and /configure API (#3963 ) This commit adds an option to start compute without spec and then pass it a valid spec via `POST /configure` API endpoint. This is a main prerequisite for maintaining the pool of compute nodes in the control-plane. For example: 1. Start compute with ```shell cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \ -b ./pg_install/v15/bin/postgres ``` 2. Configure it with ```shell curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure ``` Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec is moved under Mutex, as it's now could be updated in the http handler. Also `RwLock` was replaced with `Mutex` because the latter works well with `Condvar`. First part of the neondatabase/cloud#4433	2023-04-06 21:21:58 +02:00
Lassi Pölönen	41d364a8f1	Add more detailed logging to compute_ctl's shutdown (#3915 ) Currently we don't see from the logs, if shutting down tracing takes long time or not. We do see that shutting down computes gets delayed for some reason and hits thhe grace period limit. Moving the shutdown message to slightly later, when we don't have anything else than just exit left. ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-03-30 22:02:39 +03:00
Heikki Linnakangas	6fdd9c10d1	Read storage auth token from spec file. We read the pageserver connection string from the spec file, so let's read the auth token from the same place. We've been talking about pre-launching compute nodes that are not associated with any particular tenant at startup, so that the spec file is delivered to the compute node later. We cannot change the env variables after the process has been launched. We still pass the token to 'postgres' binary in the NEON_AUTH_TOKEN env variable, but compute_ctl is now responsible for setting it.	2023-03-21 20:12:09 +02:00
Sam Kleinman	c79dd8d458	compute_ctl: support for fetching spec from control plane (#3610 )	2023-02-23 13:19:39 -05:00
sharnoff	2153d2e00a	Run compute_ctl in a cgroup in VMs (#3577 )	2023-02-17 14:14:41 -08:00
Heikki Linnakangas	3e94fd5af3	Inherit OpenTelemetry context for compute startup from cloud console. This allows fine-grained distributed tracing of the 'start_compute' operation from the cloud console. The startup actions performed by 'compute_ctl' are now performed in a child of the 'start_compute' context, so you can trace through the whole compute start operation. This needs a corresponding change in the cloud console to fill in the 'startup_tracing_context' field in the json spec. If it's missing, the startup operations are simply traced as a separate trace, without a parent.	2023-01-26 15:20:03 +02:00
Heikki Linnakangas	006ee5f94a	Configure 'compute_ctl' to use OpenTelemetry exporter. This allows tracing the startup actions e.g. with Jaeger (https://www.jaegertracing.io/). We use the "tracing-opentelemetry" crate, which turns tracing spans into OpenTelemetry spans, so you can use the usual "#[instrument]" directives to add tracing. I put the tracing initialization code to a separate crate, `tracing-utils`, so that we can reuse it in other programs. We probably want to set up tracing in the same way in all our programs. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-01-26 15:20:03 +02:00
Heikki Linnakangas	e5cc2f92c4	Switch to 'tracing' for logging, restructure code to make use of spans. Refactors Compute::prepare_and_run. It's split into subroutines differently, to make it easier to attach tracing spans to the different stages. The high-level logic for waiting for Postgres to exit is moved to the caller. Replace 'env_logger' with 'tracing', and add `#instrument` directives to different stages fo the startup process. This is a fairly mechanical change, except for the changes in 'spec.rs'. 'spec.rs' contained some complicated formatting, where parts of log messages were printed directly to stdout with `print`s. That was a bit messed up because the log normally goes to stderr, but those lines were printed to stdout. In our docker images, stderr and stdout both go to the same place so you wouldn't notice, but I don't think it was intentional. This changes the log format to the default 'tracing_subscriber::format' format. It's different from the Postgres log format, however, and because both compute_tools and Postgres print to the same log, it's now a mix of two different formats. I'm not sure how the Grafana log parsing pipeline can handle that. If it's a problem, we can build custom formatter to change the compute_tools log format to be the same as Postgres's, like it was before this commit, or we can change the Postgres log format to match tracing_formatter's, or we can start printing compute_tool's log output to a different destination than Postgres	2023-01-18 19:42:47 +02:00
sharnoff	5c6a7a17cb	Add VM informant to vm-compute-node (#3324 ) The general idea is that the VM informant binary is added to the vm-compute-node images only. `compute_tools` then will run whatever's at `/bin/vm-informant`, if the path exists.	2023-01-16 07:05:29 -08:00
Vadim Kharitonov	9b71215906	Simplify some functions in compute_tools and fix typo errors in func name	2022-12-22 15:05:43 +01:00
Kirill Bulatov	c4ee62d427	Bump clap and other minor dependencies (#2623 )	2022-10-17 12:58:40 +03:00
Heikki Linnakangas	d865892a06	Print full error with stacktrace, if compute node startup fails. It failed in staging environment a few times, and all we got in the logs was: ERROR could not start the compute node: failed to get basebackup@0/2D6194F8 from pageserver host=zenith-us-stage-ps-2.local port=6400 giving control plane 30s to collect the error before shutdown That's missing all the detail on why it failed.	2022-07-29 16:41:55 +03:00
Dmitry Rodionov	00fc696606	replace extra urlencode dependency with already present url library	2022-06-30 14:32:15 +03:00
Anastasia Lubennikova	67d6ff4100	Rename custom GUCs: - zenith.zenith_tenant -> neon.tenant_id - zenith.zenith_timeline -> neon.timeline_id	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	6a867bce6d	Rename 'zenith_admin' role to 'cloud_admin'	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	3accde613d	Rename contrib/zenith to contrib/neon. Rename custom GUCs: - zenith.page_server_connstring -> neon.pageserver_connstring - zenith.zenith_tenant -> neon.tenantid - zenith.zenith_timeline -> neon.timelineid - zenith.max_cluster_size -> neon.max_cluster_size	2022-05-30 11:11:01 +03:00
Alexey Kondratov	772c2fb4ff	Report startup metrics and failure reason from compute_ctl (#1581 ) + neondatabase/cloud#1103 This adds a couple of control endpoints to simplify compute state discovery for control-plane. For example, now we may figure out that Postgres wasn't able to start or basebackup failed within seconds instead of just blindly polling the compute readiness for a minute or two. Also we now expose startup metrics (time of the each step: basebackup, sync safekeepers, config, total). Console grabs them after each successful start and report as histogram to prometheus and grafana. OpenAPI spec is added and up-tp date, but is not currently used in the console yet.	2022-05-18 13:03:29 +04:00
Anastasia Lubennikova	78a6cb247f	allow the users to create extensions: GRANT CREATE ON DATABASE	2022-04-25 15:35:44 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Daniil	58d5136a61	compute_tools: check writability handler (#941 )	2022-04-13 17:16:25 +03:00
Kirill Bulatov	76b74349cb	Bump pageserver dependencies	2022-02-10 08:33:22 -05:00
Kirill Bulatov	7fb62fc849	Fix macos compilation	2022-01-18 23:01:04 +02:00
Alexey Kondratov	06c28174c2	Integrate compute_tools into zenith workspace and improve logging (zenithdb/console#487 )	2022-01-18 18:47:31 +03:00
Alexey Kondratov	f64074c609	Move compute_tools from console repo (zenithdb/console#383 ) Currently it's included with minimal changes and lives aside of the main workspace. Later we may re-use and combine common parts with zenith control_plane. This change is mostly needed to unify cloud deployment pipeline: 1.1. build compute-tools image 1.2. build compute-node image based on the freshly built compute-tools 2. build zenith image So we can roll new compute image and new storage required by it to operate properly. Also it becomes easier to test console against some specific version of compute-node/-tools.	2021-12-28 20:17:29 +03:00

49 Commits