rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
bojanserafimov	ddbe170454	Prewarm compute nodes (#4828 )	2023-07-31 14:13:32 -04:00
bojanserafimov	916a5871a6	compute_ctl: Parse sk connstring (#4809 )	2023-07-26 08:10:49 -04:00
bojanserafimov	520046f5bd	cold starts: Add sync-safekeepers fast path (#4804 )	2023-07-25 19:44:18 -04:00
Alek Westover	7feb0d1a80	`unwrap` instead of passing `anyhow::Error` on failure to spawn a thread (#4779 )	2023-07-21 15:17:16 -04:00
arpad-m	982fce1e72	Fix rustdoc warnings and test cargo doc in CI (#4711 ) ## Problem `cargo +nightly doc` is giving a lot of warnings: broken links, naked URLs, etc. ## Summary of changes * update the `proc-macro2` dependency so that it can compile on latest Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and https://github.com/dtolnay/proc-macro2/issues/398 * allow the `private_intra_doc_links` lint, as linking to something that's private is always more useful than just mentioning it without a link: if the link breaks in the future, at least there is a warning due to that. Also, one might enable [`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options) in the future and make these links work in general. * fix all the remaining warnings given by `cargo +nightly doc` * make it possible to run `cargo doc` on stable Rust by updating `opentelemetry` and associated crates to version 0.19, pulling in a fix that previously broke `cargo doc` on stable: https://github.com/open-telemetry/opentelemetry-rust/pull/904 * Add `cargo doc` to CI to ensure that it won't get broken in the future. Fixes #2557 ## Future work * Potentially, it might make sense, for development purposes, to publish the generated rustdocs somewhere, like for example [how the rust compiler does it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html). I will file an issue for discussion.	2023-07-15 05:11:25 +03:00
Alexey Kondratov	ed938885ff	[compute_ctl] Fix deletion of template databases (#4661 ) If database was created with `is_template true` Postgres doesn't allow dropping it right away and throws error ``` ERROR: cannot drop a template database ``` so we have to unset `is_template` first. Fixing it, I noticed that our `escape_literal` isn't exactly correct and following the same logic as in `quote_literal_internal`, we need to prepend string with `E`. Otherwise, it's not possible to filter `pg_database` using `escape_literal()` result if name contains `\`, for example. Also use `FORCE` to drop database even if there are active connections. We run this from `cloud_admin`, so it should have enough privileges. NB: there could be other db states, which prevent us from dropping the database. For example, if db is used by any active subscription or logical replication slot. TODO: deal with it once we allow logical replication. Proper fix should involve returning an error code to the control plane, so it could figure out that this is a non-retryable error, return it to the user and mark operation as permanently failed. Related to neondatabase/cloud#4258	2023-07-13 13:18:35 +02:00
bojanserafimov	92aee7e07f	cold starts: basebackup compression (#4482 ) Co-authored-by: Alex Chi Z <iskyzh@gmail.com>	2023-07-11 13:11:23 -04:00
bojanserafimov	618d36ee6d	compute_ctl: log a structured event on successful start (#4679 )	2023-07-10 15:34:26 -04:00
bojanserafimov	c7143dbde6	compute_ctl: Fix misleading metric (#4608 )	2023-07-04 19:07:36 -04:00
Joonas Koivunen	cff7ae0b0d	fix: no more ansi colored logs (#4613 ) Allure does not support ansi colored logs, yet `compute_ctl` has them. Upgrade criterion to get rid of atty dependency, disable ansi colors, remove atty dependency and disable ansi feature of tracing-subscriber. This is a heavy-handed approach. I am not aware of a workflow where you'd want to connect a terminal directly to for example `compute_ctl`, usually you find the logs in a file. If someone had been using colors, they will now need to: - turn the `tracing-subscriber.default-features` to `true` - edit their wanted project to have colors I decided to explicitly disable ansi colors in case we would have in future a dependency accidentally enabling the feature on `tracing-subscriber`, which would be quite surprising but not unimagineable. By getting rid of `atty` from dependencies we get rid of <https://github.com/advisories/GHSA-g98v-hv3f-hcfr>.	2023-07-03 16:37:02 +03:00
bojanserafimov	9de1a6fb14	cold starts: Run sync_safekeepers on compute_ctl shutdown (#4588 )	2023-06-30 16:29:47 -04:00
Joonas Koivunen	44e7d5132f	fix: hide token from logs (#4584 ) fixes #4583 and also changes all needlessly arg listing places to use `skip_all`.	2023-06-29 15:53:16 +03:00
Sasha Krassovsky	c215389f1c	quote_ident identifiers when creating neon_superuser (#4562 ) ## Problem	2023-06-24 10:34:15 +03:00
Sasha Krassovsky	b1477b4448	Create neon_superuser role, grant it to roles created from control plane (#4425 ) ## Problem Currently, if a user creates a role, it won't by default have any grants applied to it. If the compute restarts, the grants get applied. This gives a very strange UX of being able to drop roles/not have any access to anything at first, and then once something triggers a config application, suddenly grants are applied. This removes these grants.	2023-06-24 01:38:27 +03:00
Anastasia Lubennikova	2f618f46be	Use BUILD_TAG in compute_ctl binary. (#4541 ) Pass BUILD_TAG to compute_ctl binary. We need it to access versioned extension storage.	2023-06-22 17:06:16 +03:00
Alexey Kondratov	1299df87d2	[compute_ctl] Fix logging if catalog updates are skipped (#4480 ) Otherwise, it wasn't clear from the log when Postgres started up completely if catalog updates were skipped. Follow-up for `4936ab6`	2023-06-13 13:34:56 +02:00
bojanserafimov	4936ab6842	compute_ctl: add flag to avoid config step (#4457 ) Add backwards-compatible flag that cplane can use to speed up startup time	2023-06-12 13:57:02 -04:00
Heikki Linnakangas	df3bae2ce3	Use `compute_ctl` to manage Postgres in tests. (#3886 ) This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-06-06 14:59:36 +01:00
Joonas Koivunen	36fee50f4d	compute_ctl: enable tracing panic hook (#4375 ) compute_ctl can panic, but `tracing` is used for logging. panic stderr output can interleave with messages from normal logging. The fix is to use the established way (pageserver, safekeeper, storage_broker) of using `tracing` to report panics.	2023-06-01 20:12:07 +03:00
Sasha Krassovsky	6052ecee07	Add connector extension to send Role/Database updates to console (#3891 ) ## Describe your changes ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [x] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-05-25 12:36:57 +03:00
Heikki Linnakangas	66b06e416a	Pass tracing context in env variables instead of the spec file. (#4174 ) If compute_ctl is launched without a spec file, it fetches it from the control plane with an HTTP request. We cannot get the startup tracing context from the compute spec in that case, because we don't have it available on start. We could still read the tracing context from the compute spec after we have fetched it, but that would leave the fetch itself out of the context. Pass the tracing context in environment variables instead.	2023-05-09 17:08:02 +03:00
Alexey Kondratov	dd4fd89dc6	[compute_ctl] Do not initialize `last_active` on start (#4137 ) Our scale-to-zero logic was optimized for short auto-suspend intervals, e.g. minutes or hours. In this case, if compute was restarted by k8s due to some reason (OOM, k8s node went down, pod relocation, etc.), `last_active` got bumped, we start counting auto-suspend timeout again. It's not a big deal, i.e. we suspend completely idle compute not after 5 minutes, but after 10 minutes or so. Yet, some clients may want days or even weeks. And chance that compute could be restarted during this interval is pretty high, but in this case we could be not able to suspend some computes for weeks. After this commit, we won't initialize `last_active` on start, so `/status` could return an unset attribute. This means that there was no user activity since start. Control-plane should deal with it by taking `max()` out of all available activity timestamps: `started_at`, `last_active`, etc. compute_ctl part of neondatabase/cloud#4853	2023-05-05 11:45:37 +02:00
Heikki Linnakangas	b627fa71e4	Make read-only replicas explicit in compute spec (#4136 ) This builds on top of PR #4058, and supersedes #4018	2023-05-04 17:41:42 +03:00
MMeent	e6ec2400fc	Enable hot standby PostgreSQL replicas. Notes: - This still needs UI support from the Console - I've not tuned any GUCs for PostgreSQL to make this work better - Safekeeper has gotten a tweak in which WAL is sent and how: It now sends zero-ed WAL data from the start of the timeline's first segment up to the first byte of the timeline to be compatible with normal PostgreSQL WAL streaming. - This includes the commits of #3714 Fixes one part of https://github.com/neondatabase/neon/issues/769 Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-04-27 15:26:44 +02:00
Alexey Kondratov	7ba5c286b7	[compute_ctl] Improve 'empty' compute startup sequence (#4034 ) Do several attempts to get spec from the control-plane and retry network errors and all reasonable HTTP response codes. Do not hang waiting for spec without confirmation from the control-plane that compute is known and is in the `Empty` state. Adjust the way we track `total_startup_ms` metric, it should be calculated since the moment we received spec, not from the moment `compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric to track the time spent sleeping and waiting for spec to be delivered from control-plane. Part of neondatabase/cloud#3533	2023-04-21 11:10:48 +02:00
Alexey Kondratov	589cf1ed21	[compute_ctl] Do not create availability checker data on each start (#4019 ) Initially, idea was to ensure that when we come and check data availability, special service table already contains one row. So if we loose it for some reason, we will error out. Yet, to do availability check we anyway start compute first! So it doesn't really add some value, but we affect each compute start as we update at least one row in the database. Also this writes some WAL, so if timeline is close to `neon.max_cluster_size` it could prevent compute from starting up. That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the `/check_writability` handler.	2023-04-14 13:05:07 +02:00
Alexey Kondratov	db8dd6f380	[compute_ctl] Implement live reconfiguration (#3980 ) With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433	2023-04-13 18:07:29 +02:00
Heikki Linnakangas	06ce83c912	Tolerate missing 'operation_uuid' field in spec file. 'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.	2023-04-12 12:11:22 +03:00
Heikki Linnakangas	ef68321b31	Use Lsn, TenantId, TimelineId types in compute_ctl. Stronger types are generally nicer.	2023-04-12 12:11:22 +03:00
Heikki Linnakangas	6064a26963	Refactor 'spec' in ComputeState. Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.	2023-04-12 01:55:40 +03:00
Alexey Kondratov	40a68e9077	[compute_ctl] Add timeout for `tracing_utils::shutdown_tracing()` (#3982 ) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - https://github.com/open-telemetry/opentelemetry-rust/issues/868 - and our problems with staging https://github.com/neondatabase/cloud/issues/3707#issuecomment-1493983636 Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707	2023-04-11 15:05:35 +02:00
Heikki Linnakangas	f0b2e076d9	Move compute_ctl structs used in HTTP API and spec file to separate crate. This is in preparation of using compute_ctl to launch postgres nodes in the neon_local control plane. And seems like a good idea to separate the public interfaces anyway. One non-mechanical change here is that the 'metrics' field is moved under the Mutex, instead of using atomics. We were not using atomics for performance but for convenience here, and it seems more clear to not use atomics in the model for the HTTP response type.	2023-04-09 21:52:28 +03:00
Alexey Kondratov	e42982fb1e	[compute_ctl] Empty computes and /configure API (#3963 ) This commit adds an option to start compute without spec and then pass it a valid spec via `POST /configure` API endpoint. This is a main prerequisite for maintaining the pool of compute nodes in the control-plane. For example: 1. Start compute with ```shell cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \ -b ./pg_install/v15/bin/postgres ``` 2. Configure it with ```shell curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure ``` Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec is moved under Mutex, as it's now could be updated in the http handler. Also `RwLock` was replaced with `Mutex` because the latter works well with `Condvar`. First part of the neondatabase/cloud#4433	2023-04-06 21:21:58 +02:00
Lassi Pölönen	41d364a8f1	Add more detailed logging to compute_ctl's shutdown (#3915 ) Currently we don't see from the logs, if shutting down tracing takes long time or not. We do see that shutting down computes gets delayed for some reason and hits thhe grace period limit. Moving the shutdown message to slightly later, when we don't have anything else than just exit left. ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-03-30 22:02:39 +03:00
Heikki Linnakangas	5a123b56e5	Remove obsolete hack to rename neon-specific GUCs. I checked the console database, we don't have any of these left in production.	2023-03-28 17:57:22 +03:00
Heikki Linnakangas	6fdd9c10d1	Read storage auth token from spec file. We read the pageserver connection string from the spec file, so let's read the auth token from the same place. We've been talking about pre-launching compute nodes that are not associated with any particular tenant at startup, so that the spec file is delivered to the compute node later. We cannot change the env variables after the process has been launched. We still pass the token to 'postgres' binary in the NEON_AUTH_TOKEN env variable, but compute_ctl is now responsible for setting it.	2023-03-21 20:12:09 +02:00
Heikki Linnakangas	299db9d028	Simplify and clean up the $NEON_AUTH_TOKEN stuff in compute - Remove the neon.safekeeper_token_env GUC. It was used to set the name of an environment variable, which was then used in pageserver and safekeeper connection strings to in place of the password. Instead, always look up the environment variable called NEON_AUTH_TOKEN. That's what neon.safekeeper_token_env was always set to in practice, and I don't see the need for the extra level of indirection or configurability. - Instead of substituting $NEON_AUTH_TOKEN in the connection strings, pass $NEON_AUTH_TOKEN "out-of-band" as the password, when we connect to the pageserver or safekeepers. That's simpler. - Also use the password from $NEON_AUTH_TOKEN in compute_ctl, when it connects to the pageserver to get the "base backup".	2023-03-21 00:15:04 +02:00
Vadim Kharitonov	1401021b21	Be able to get number of CPUs (#3774 ) After enabling autoscaling, we faced the issue that customers are not able to get the number of CPUs they use at this moment. Therefore I've added these two options: 1. Postgresql function to allow customers to call it whenever they want 2. `compute_ctl` endpoint to show these number in console	2023-03-10 19:00:20 +02:00
Heikki Linnakangas	d1537a49fa	Fix escaping in postgresql.conf that we generate at compute startup If there are any config options that contain single quotes or backslashes, they need to be escaped	2023-03-10 14:59:21 +02:00
Heikki Linnakangas	856d01ff68	Add newline at end of postgresql.conf	2023-03-10 14:59:21 +02:00
Heikki Linnakangas	42ec79fb0d	Make expected test output nicer to read. By using Rust raw string literal.	2023-03-10 14:59:21 +02:00
Alexey Kondratov	e43c413a3f	[compute_tools] Add /insights endpoint to compute_ctl (#3704 ) This commit adds a basic HTTP API endpoint that allows scraping the `pg_stat_statements` data and getting a list of slow queries. New insights like cache hit rate and so on could be added later. Extension `pg_stat_statements` is checked / created only if compute tries to load the corresponding shared library. The latter is configured by control-plane and currently covered with feature flag. Co-authored by Eduard Dyckman (bird.duskpoet@gmail.com)	2023-03-09 14:21:10 +01:00
Sam Kleinman	c79dd8d458	compute_ctl: support for fetching spec from control plane (#3610 )	2023-02-23 13:19:39 -05:00
sharnoff	2153d2e00a	Run compute_ctl in a cgroup in VMs (#3577 )	2023-02-17 14:14:41 -08:00
Vadim Kharitonov	f4359b688c	Backport `cargo fmt` diff from `release` branch into `main`	2023-02-10 14:20:55 +01:00
Heikki Linnakangas	5ee77c0b1f	Fix holding tracing span guard over query execution. I added these spans to trace how long the queries take, but I didn't realize that there's a difference between: let _ = span.entered(); and let _guard = span.entered(); The former drops the guard immediately, while the latter holds it until the end of the scope. As a result, the span was ended immediately, and the query was executed outside the span.	2023-01-30 12:10:51 +02:00
Heikki Linnakangas	0c0e15b81d	compute_ctl: Extract tracing context from incoming HTTP requests. This allows tracing the handling of HTTP requests as part of the caller's trace.	2023-01-26 15:20:03 +02:00
Heikki Linnakangas	3e94fd5af3	Inherit OpenTelemetry context for compute startup from cloud console. This allows fine-grained distributed tracing of the 'start_compute' operation from the cloud console. The startup actions performed by 'compute_ctl' are now performed in a child of the 'start_compute' context, so you can trace through the whole compute start operation. This needs a corresponding change in the cloud console to fill in the 'startup_tracing_context' field in the json spec. If it's missing, the startup operations are simply traced as a separate trace, without a parent.	2023-01-26 15:20:03 +02:00
Heikki Linnakangas	006ee5f94a	Configure 'compute_ctl' to use OpenTelemetry exporter. This allows tracing the startup actions e.g. with Jaeger (https://www.jaegertracing.io/). We use the "tracing-opentelemetry" crate, which turns tracing spans into OpenTelemetry spans, so you can use the usual "#[instrument]" directives to add tracing. I put the tracing initialization code to a separate crate, `tracing-utils`, so that we can reuse it in other programs. We probably want to set up tracing in the same way in all our programs. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-01-26 15:20:03 +02:00
Alexey Kondratov	a4be54d21f	[compute_ctl] Stop updating roles on each compute start (#3391 ) I noticed that `compute_ctl` updates all roles on each start, search for rows like > - web_access:[FILTERED] -> update in the compute startup log. It happens since we had an adhoc hack for md5 hashes comparison, which doesn't work with scram hashes stored in the `pg_authid`. It doesn't really hurt, as nothing changes, but we just run >= 2 extra queries on each start, so fix it.	2023-01-23 17:46:22 +01:00

1 2 3

111 Commits