Commit Graph

86 Commits

Author SHA1 Message Date
Alexey Kondratov
589cf1ed21 [compute_ctl] Do not create availability checker data on each start (#4019)
Initially, idea was to ensure that when we come and check data
availability, special service table already contains one row. So if we
loose it for some reason, we will error out.

Yet, to do availability check we anyway start compute first! So it
doesn't really add some value, but we affect each compute start as we
update at least one row in the database. Also this writes some WAL, so
if timeline is close to `neon.max_cluster_size` it could prevent compute
from starting up.

That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the
`/check_writability` handler.
2023-04-14 13:05:07 +02:00
Alexey Kondratov
db8dd6f380 [compute_ctl] Implement live reconfiguration (#3980)
With this commit one can request compute reconfiguration
from the running `compute_ctl` with compute in `Running` state
by sending a new spec:
```shell
curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure
```

Internally, we start a separate configurator thread that is waiting on
`Condvar` for `ConfigurationPending` compute state in a loop. Then it does
reconfiguration, sets compute back to `Running` state and notifies other
waiters.

It will need some follow-ups, e.g. for retry logic for control-plane
requests, but should be useful for testing in the current state. This
shouldn't affect any existing environment, since computes are configured
in a different way there.

Resolves neondatabase/cloud#4433
2023-04-13 18:07:29 +02:00
Heikki Linnakangas
06ce83c912 Tolerate missing 'operation_uuid' field in spec file.
'compute_ctl' doesn't use the operation_uuid for anything, it just prints
it to the log.
2023-04-12 12:11:22 +03:00
Heikki Linnakangas
ef68321b31 Use Lsn, TenantId, TimelineId types in compute_ctl.
Stronger types are generally nicer.
2023-04-12 12:11:22 +03:00
Heikki Linnakangas
6064a26963 Refactor 'spec' in ComputeState.
Sometimes, it contained real values, sometimes just defaults if the
spec was not received yet. Make the state more clear by making it an
Option instead.

One consequence is that if some of the required settings like
neon.tenant_id are missing from the spec file sent to the /configure
endpoint, it is spotted earlier and you get an immediate HTTP error
response. Not that it matters very much, but it's nicer nevertheless.
2023-04-12 01:55:40 +03:00
Alexey Kondratov
40a68e9077 [compute_ctl] Add timeout for tracing_utils::shutdown_tracing() (#3982)
Shutting down OTEL tracing provider may hang for quite some time, see,
for example:
- https://github.com/open-telemetry/opentelemetry-rust/issues/868
- and our problems with staging
https://github.com/neondatabase/cloud/issues/3707#issuecomment-1493983636

Yet, we want computes to shut down fast enough, as we may need a new one
for the same timeline ASAP. So wait no longer than 2s for the shutdown
to complete, then just error out and exit the main thread.

Related to neondatabase/cloud#3707
2023-04-11 15:05:35 +02:00
Heikki Linnakangas
f0b2e076d9 Move compute_ctl structs used in HTTP API and spec file to separate crate.
This is in preparation of using compute_ctl to launch postgres nodes
in the neon_local control plane. And seems like a good idea to
separate the public interfaces anyway.

One non-mechanical change here is that the 'metrics' field is moved
under the Mutex, instead of using atomics. We were not using atomics
for performance but for convenience here, and it seems more clear to
not use atomics in the model for the HTTP response type.
2023-04-09 21:52:28 +03:00
Alexey Kondratov
e42982fb1e [compute_ctl] Empty computes and /configure API (#3963)
This commit adds an option to start compute without spec and then pass
it a valid spec via `POST /configure` API endpoint. This is a main
prerequisite for maintaining the pool of compute nodes in the
control-plane.

For example:

1. Start compute with
   ```shell
   cargo run --bin compute_ctl -- -i no-compute \
    -p http://localhost:9095 \
    -D compute_pgdata \
    -C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \
    -b ./pg_install/v15/bin/postgres
   ```

2. Configure it with
   ```shell
   curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure
   ```

Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec
is moved under Mutex, as it's now could be updated in the http handler.
Also `RwLock` was replaced with `Mutex` because the latter works well
with `Condvar`.

First part of the neondatabase/cloud#4433
2023-04-06 21:21:58 +02:00
Lassi Pölönen
41d364a8f1 Add more detailed logging to compute_ctl's shutdown (#3915)
Currently we don't see from the logs, if shutting down tracing takes
long time or not. We do see that shutting down computes gets delayed for
some reason and hits thhe grace period limit. Moving the shutdown
message to slightly later, when we don't have anything else than just
exit left.
## Issue ticket number and link

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-30 22:02:39 +03:00
Heikki Linnakangas
5a123b56e5 Remove obsolete hack to rename neon-specific GUCs.
I checked the console database, we don't have any of these left in
production.
2023-03-28 17:57:22 +03:00
Heikki Linnakangas
6fdd9c10d1 Read storage auth token from spec file.
We read the pageserver connection string from the spec file, so let's
read the auth token from the same place.

We've been talking about pre-launching compute nodes that are not
associated with any particular tenant at startup, so that the spec
file is delivered to the compute node later. We cannot change the env
variables after the process has been launched.

We still pass the token to 'postgres' binary in the NEON_AUTH_TOKEN
env variable, but compute_ctl is now responsible for setting it.
2023-03-21 20:12:09 +02:00
Heikki Linnakangas
299db9d028 Simplify and clean up the $NEON_AUTH_TOKEN stuff in compute
- Remove the neon.safekeeper_token_env GUC. It was used to set the
  name of an environment variable, which was then used in pageserver
  and safekeeper connection strings to in place of the
  password. Instead, always look up the environment variable called
  NEON_AUTH_TOKEN. That's what neon.safekeeper_token_env was always
  set to in practice, and I don't see the need for the extra level of
  indirection or configurability.

- Instead of substituting $NEON_AUTH_TOKEN in the connection strings,
  pass $NEON_AUTH_TOKEN "out-of-band" as the password, when we connect
  to the pageserver or safekeepers. That's simpler.

- Also use the password from $NEON_AUTH_TOKEN in compute_ctl, when it
  connects to the pageserver to get the "base backup".
2023-03-21 00:15:04 +02:00
Vadim Kharitonov
1401021b21 Be able to get number of CPUs (#3774)
After enabling autoscaling, we faced the issue that customers are not
able to get the number of CPUs they use at this moment. Therefore I've
added these two options:

1. Postgresql function to allow customers to call it whenever they want
2. `compute_ctl` endpoint to show these number in console
2023-03-10 19:00:20 +02:00
Heikki Linnakangas
d1537a49fa Fix escaping in postgresql.conf that we generate at compute startup
If there are any config options that contain single quotes or backslashes,
they need to be escaped
2023-03-10 14:59:21 +02:00
Heikki Linnakangas
856d01ff68 Add newline at end of postgresql.conf 2023-03-10 14:59:21 +02:00
Heikki Linnakangas
42ec79fb0d Make expected test output nicer to read.
By using Rust raw string literal.
2023-03-10 14:59:21 +02:00
Alexey Kondratov
e43c413a3f [compute_tools] Add /insights endpoint to compute_ctl (#3704)
This commit adds a basic HTTP API endpoint that allows scraping the
`pg_stat_statements` data and getting a list of slow queries. New
insights like cache hit rate and so on could be added later.

Extension `pg_stat_statements` is checked / created only if compute
tries to load the corresponding shared library. The latter is configured
by control-plane and currently covered with feature flag.

Co-authored by Eduard Dyckman (bird.duskpoet@gmail.com)
2023-03-09 14:21:10 +01:00
Sam Kleinman
c79dd8d458 compute_ctl: support for fetching spec from control plane (#3610) 2023-02-23 13:19:39 -05:00
sharnoff
2153d2e00a Run compute_ctl in a cgroup in VMs (#3577) 2023-02-17 14:14:41 -08:00
Vadim Kharitonov
f4359b688c Backport cargo fmt diff from release branch into main 2023-02-10 14:20:55 +01:00
Heikki Linnakangas
5ee77c0b1f Fix holding tracing span guard over query execution.
I added these spans to trace how long the queries take, but I didn't realize
that there's a difference between:

    let _ = span.entered();

and

    let _guard = span.entered();

The former drops the guard immediately, while the latter holds it
until the end of the scope. As a result, the span was ended
immediately, and the query was executed outside the span.
2023-01-30 12:10:51 +02:00
Heikki Linnakangas
0c0e15b81d compute_ctl: Extract tracing context from incoming HTTP requests.
This allows tracing the handling of HTTP requests as part of the caller's
trace.
2023-01-26 15:20:03 +02:00
Heikki Linnakangas
3e94fd5af3 Inherit OpenTelemetry context for compute startup from cloud console.
This allows fine-grained distributed tracing of the 'start_compute'
operation from the cloud console. The startup actions performed by
'compute_ctl' are now performed in a child of the 'start_compute'
context, so you can trace through the whole compute start operation.

This needs a corresponding change in the cloud console to fill in the
'startup_tracing_context' field in the json spec. If it's missing, the
startup operations are simply traced as a separate trace, without
a parent.
2023-01-26 15:20:03 +02:00
Heikki Linnakangas
006ee5f94a Configure 'compute_ctl' to use OpenTelemetry exporter.
This allows tracing the startup actions e.g. with Jaeger
(https://www.jaegertracing.io/). We use the "tracing-opentelemetry"
crate, which turns tracing spans into OpenTelemetry spans, so you can
use the usual "#[instrument]" directives to add tracing.

I put the tracing initialization code to a separate crate,
`tracing-utils`, so that we can reuse it in other programs. We
probably want to set up tracing in the same way in all our programs.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-01-26 15:20:03 +02:00
Alexey Kondratov
a4be54d21f [compute_ctl] Stop updating roles on each compute start (#3391)
I noticed that `compute_ctl` updates all roles on each start, search for
rows like

> - web_access:[FILTERED] -> update

in the compute startup log.

It happens since we had an adhoc hack for md5 hashes comparison, which
doesn't work with scram hashes stored in the `pg_authid`. It doesn't
really hurt, as nothing changes, but we just run >= 2 extra queries on
each start, so fix it.
2023-01-23 17:46:22 +01:00
Alexey Kondratov
20b1e26e74 [compute_ctl] Make role deletion spec processing idempotent (#3380)
Previously, we were trying to re-assign owned objects of the already
deleted role. This were causing a crash loop in the case when compute
was restarted with a spec that includes delta operation for role
deletion. To avoid such cases, check that role is still present before
calling `reassign_owned_objects`.

Resolves neondatabase/cloud#3553
2023-01-20 15:37:24 +01:00
Heikki Linnakangas
e5cc2f92c4 Switch to 'tracing' for logging, restructure code to make use of spans.
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.

Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.

This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats.  I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
2023-01-18 19:42:47 +02:00
Heikki Linnakangas
3b58c61b33 If an error happens while checking for core dumps, don't panic.
If we panic, we skip the 30s wait in 'main', and don't give the
console a chance to observe the error. Which is not nice.

Spotted by @ololobus at
https://github.com/neondatabase/neon/pull/3352#discussion_r1072806981
2023-01-18 11:25:47 +02:00
sharnoff
5c6a7a17cb Add VM informant to vm-compute-node (#3324)
The general idea is that the VM informant binary is added to the
vm-compute-node images only. `compute_tools` then will run whatever's at
`/bin/vm-informant`, if the path exists.
2023-01-16 07:05:29 -08:00
Kirill Bulatov
bce4233d3a Rework Cargo.toml dependencies (#3322)
* Use workspace variables from cargo, coming with rustc
[1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22)

See
https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-package-table
and
https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-dependencies-table
sections.

Now, all dependencies in all non-root `Cargo.toml` files are defined as 
```
clap.workspace = true
```

sometimes, when extra features are needed, as 
```
bytes = {workspace = true, features = ['serde'] }
```

With the actual declarations (with shared features and version
numbers/file paths/etc.) in the root Cargo.toml.
Features are additive:

https://doc.rust-lang.org/nightly/cargo/reference/specifying-dependencies.html#inheriting-a-dependency-from-a-workspace

* Uses the mechanism above to set common, 2021, edition and license across the
workspace

* Mechanically bumps a few dependencies

* Updates hakari format, as it suggested:
```
work/neon/neon kb/cargo-templated ❯ cargo hakari generate
info: no changes detected
info: new hakari format version available: 3 (current: 2)
(add or update `dep-format-version = "3"` in hakari.toml, then run `cargo hakari generate && cargo hakari manage-deps`)
```
2023-01-13 18:13:34 +02:00
Heikki Linnakangas
af9425394f Print time taken by CREATE/ALTER DATABASE at compute start.
Trying to investigate why the "apply_config" stage is taking longer
than expected. This proves or disproves that it's the CREATE DATABASE
statement.
2023-01-06 17:50:44 +02:00
Heikki Linnakangas
df42213dbb Fix missing COMMIT in handle_role_deletions.
There was no COMMIT, so the DROP ROLE commands were always implicitly
rolled back.

Fixes issue #3279.
2023-01-06 17:07:46 +02:00
Vadim Kharitonov
f436fb2dfb Fix panics at compute_ctl:monitor 2023-01-04 17:26:42 +01:00
Vadim Kharitonov
0b428f7c41 Enable licenses check for 3rd-parties 2023-01-03 15:11:50 +01:00
Vadim Kharitonov
434fcac357 Remove unused HTTP endpoints from compute_ctl 2022-12-29 13:59:40 +01:00
Vadim Kharitonov
9b71215906 Simplify some functions in compute_tools and fix typo errors in func
name
2022-12-22 15:05:43 +01:00
Stas Kelvich
5a762744c7 Collect core dump backtraces in compute_ctl.
Scan core dumps directory on exit. In case of existing core dumps
call gdb/lldb to get a backtrace and log it. By default look for
core dumps in postgres data directory as core.<pid>. That is how
core collection is configured in our k8s nodes (and a reasonable
convention in general).
2022-12-22 16:01:49 +02:00
Kirill Bulatov
fca25edae8 Fix 1.66 Clippy warnings (#3178)
1.66 release speeds up compile times for over 10% according to tests.

Also its Clippy finds plenty of old nits in our code:
* useless conversion, `foo as u8` where `foo: u8` and similar, removed
`as u8` and similar
* useless references and dereferenced (that were automatically adjusted
by the compiler), removed various `&` and `*`
* bool -> u8 conversion via `if/else`, changed to `u8::from`
* Map `.iter()` calls where only values were used, changed to
`.values()` instead

Standing out lints:
* `Eq` is missing in our protoc generated structs. Silenced, does not
seem crucial for us.
* `fn default` looks like the one from `Default` trait, so I've
implemented that instead and replaced the `dummy_*` method in tests with
`::default()` invocation
* Clippy detected that
```
if retry_attempt < u32::MAX {
    retry_attempt += 1;
}
```
is a saturating add and proposed to replace it.
2022-12-22 14:27:48 +02:00
Dmitry Ivanov
61194ab2f4 Update rust-postgres everywhere
I've rebased[1] Neon's fork of rust-postgres to incorporate
latest upstream changes (including dependabot's fixes),
so we need to advance revs here as well.

[1] https://github.com/neondatabase/rust-postgres/commits/neon
2022-12-17 00:26:10 +03:00
Dmitry Ivanov
83baf49487 [proxy] Forward compute connection params to client
This fixes all kinds of problems related to missing params,
like broken timestamps (due to `integer_datetimes`).

This solution is not ideal, but it will help. Meanwhile,
I'm going to dedicate some time to improving connection machinery.

Note that this **does not** fix problems with passing certain parameters
in a reverse direction, i.e. **from client to compute**. This is a
separate matter and will be dealt with in an upcoming PR.
2022-12-16 21:37:50 +03:00
Alexander Bayandin
61825dfb57 Update chrono to 0.4.23; use only clock feature from it 2022-12-06 15:45:58 +01:00
andres
1cf257bc4a feedback 2022-11-08 20:15:54 +04:00
Anastasia Lubennikova
39897105b2 Check postgres version and ensure that public schema exists
before running GRANT query on it
2022-10-25 09:55:24 +03:00
Stas Kelvich
2f399f08b2 Hotfix to disable grant create on public schema
`GRANT CREATE ON SCHEMA public` fails if there is no schema `public`.
Disable it in release for now and make a better fix later (it is
needed for v15 support).
2022-10-25 09:55:24 +03:00
Alexey Kondratov
4d1e48f3b9 [compute_ctl] Use postgres::config to properly escape database names (#2652)
We've got at least one user in production that cannot create a
database with a trailing space in the name.

This happens because we use `url` crate for manipulating the
DATABASE_URL, but it follows a standard that doesn't fit really
well with Postgres. For example, it trims all trailing spaces
from the path:

  > Remove any leading and trailing C0 control or space from input.
  > See: https://url.spec.whatwg.org/#url-parsing

But we used `set_path()` to set database name and it's totally valid
to have trailing spaces in the database name in Postgres.

Thus, use `postgres::config::Config` to modify database name in the
connection details.
2022-10-19 19:20:06 +02:00
Anastasia Lubennikova
7576b18b14 [compute_tools] fix GRANT CREATE ON SCHEMA public -
run the grant query in each database
2022-10-19 18:37:52 +03:00
Anastasia Lubennikova
0ec5ddea0b GRANT CREATE ON SCHEMA public TO web_access 2022-10-17 22:42:51 +03:00
Kirill Bulatov
c4ee62d427 Bump clap and other minor dependencies (#2623) 2022-10-17 12:58:40 +03:00
Arthur Petukhovsky
687ba81366 Display sync safekeepers output in compute_ctl (#2571)
Pipe postgres output to compute_ctl stdout and create a test to check that compute_ctl works and prints postgres logs.
2022-10-06 13:53:52 +00:00
Joonas Koivunen
e8b195acb7 fix: apply notify workaround on m1 mac docker (#2564)
workaround as discussed in the notify repository.
2022-10-06 11:13:40 +03:00