Commit Graph

62 Commits

Author SHA1 Message Date
Alek Westover
b4abbfe6fb fix clippy; use oncelock instead of mutex because it is more appropriate 2023-06-28 16:49:04 -04:00
Alek Westover
a6c9a4abe7 cache extensions 2023-06-28 14:51:55 -04:00
Alek Westover
ce0da2889e cache available libraries: increases efficiency, and reduces code duplication 2023-06-28 11:48:55 -04:00
Anastasia Lubennikova
7357b7cad5 Code cleanup 2023-06-27 15:29:33 +03:00
Alek Westover
a2e154f07b real s3 and tenant specific files 2023-06-26 15:25:20 -04:00
Anastasia Lubennikova
1104de0b9b refactoring
- enable CREATE EXTENSION and LOAD test
- change test_file_download to use mock_s3

- some code cleanup
- add caching of extensions_list

- WIP downloading of shared_preload_libraries (not tested yet)
2023-06-26 18:54:38 +03:00
Alek Westover
a8f848b5de test reals3 2023-06-23 15:51:49 -04:00
Alek Westover
ca59330df8 modify file names 2023-06-23 15:28:30 -04:00
Alek Westover
36bb5ad527 refactor 2023-06-23 14:47:48 -04:00
Alek Westover
8bc128e474 real s3 tests 2023-06-23 13:41:08 -04:00
Alek Westover
c3994541eb add real s3 tests 2023-06-23 13:26:28 -04:00
Anastasia Lubennikova
7cdcc8a500 Fix downloading of sql files for extension and libraries.
Rust code refactoring and C code fixes.

Add test for CREATE EXTENSION and LOAD 'library'
2023-06-23 20:25:14 +03:00
Alek Westover
31aa0283b0 More Extension Features (#4555)
Added tenant specific extensions and more tests
2023-06-23 09:30:49 -04:00
Alek Westover
9c35c06c58 small refactor 2023-06-22 14:24:59 -04:00
Alek Westover
44ac7a45be Merge branch 'main' into extension_server 2023-06-22 10:30:19 -04:00
Alek Westover
a79b0d69c4 made remote_ext_config an optional parameter 2023-06-22 10:21:07 -04:00
Anastasia Lubennikova
2f618f46be Use BUILD_TAG in compute_ctl binary. (#4541)
Pass BUILD_TAG to compute_ctl binary. 
We need it to access versioned extension storage.
2023-06-22 17:06:16 +03:00
Alek Westover
bf3b83b504 fix code style for clippy 2023-06-22 09:37:07 -04:00
Alek Westover
f984f9e7d3 seems close to working 2023-06-21 15:25:06 -04:00
Alek Westover
605c30e5c5 fixed an issue where pgconfig was pointing at global installation of postgres rather than the correct local version 2023-06-21 14:34:24 -04:00
Alek Westover
0b11d8e836 replaced download_files function with more appropriate download_extensions function 2023-06-21 14:01:44 -04:00
Alek Westover
02a1d4d8c1 refactoring a bit 2023-06-21 11:32:04 -04:00
Alek Westover
559e318328 remote dead imports 2023-06-21 11:01:45 -04:00
Alek Westover
bb414e5a0a removing debugging 2023-06-21 10:45:37 -04:00
Alek Westover
c99e203094 I think it's working 2023-06-21 10:10:02 -04:00
Alek Westover
356f7d3a7e more debugging 2023-06-20 22:39:02 -04:00
Alek Westover
890061d371 arg passing is mostly working 2023-06-20 19:35:13 -04:00
Alek Westover
6b74d1a76a partils 2023-06-19 15:25:53 -04:00
Alek Westover
a936b8a92b add ext cli args 2023-06-16 17:05:39 -04:00
Alek Westover
c7bea52849 adding command line argument 2023-06-16 16:58:13 -04:00
Anastasia Lubennikova
34f22e9b12 Request extension files from compute_ctl 2023-06-13 16:52:37 +02:00
Heikki Linnakangas
df3bae2ce3 Use compute_ctl to manage Postgres in tests. (#3886)
This adds test coverage for 'compute_ctl', as it is now used by all
the python tests.
    
There are a few differences in how 'compute_ctl' is called in the
tests, compared to the real web console:
    
- In the tests, the postgresql.conf file is included as one large
  string in the spec file, and it is written out as it is to the data
  directory.  I added a new field for that to the spec file. The real
  web console, however, sets all the necessary settings in the
  'settings' field, and 'compute_ctl' creates the postgresql.conf from
  those settings.

- In the tests, the information needed to connect to the storage, i.e.
  tenant_id, timeline_id, connection strings to pageserver and
  safekeepers, are now passed as new fields in the spec file. The real
  web console includes them as the GUCs in the 'settings' field. (Both
  of these are different from what the test control plane used to do:
  It used to write the GUCs directly in the postgresql.conf file). The
  plan is to change the control plane to use the new method, and
  remove the old method, but for now, support both.

Some tests that were sensitive to the amount of WAL generated needed
small changes, to accommodate that compute_ctl runs the background
health monitor which makes a few small updates. Also some tests shut
down the pageserver, and now that the background health check can run
some queries while the pageserver is down, that can produce a few
extra errors in the logs, which needed to be allowlisted.

Other changes:
- remove obsolete comments about PostgresNode;
- create standby.signal file for Static compute node;
- log output of `compute_ctl` and `postgres` is merged into
`endpoints/compute.log`.

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-06-06 14:59:36 +01:00
Heikki Linnakangas
66b06e416a Pass tracing context in env variables instead of the spec file. (#4174)
If compute_ctl is launched without a spec file, it fetches it from the
control plane with an HTTP request. We cannot get the startup tracing
context from the compute spec in that case, because we don't have it
available on start. We could still read the tracing context from the
compute spec after we have fetched it, but that would leave the fetch
itself out of the context. Pass the tracing context in environment
variables instead.
2023-05-09 17:08:02 +03:00
Alexey Kondratov
7ba5c286b7 [compute_ctl] Improve 'empty' compute startup sequence (#4034)
Do several attempts to get spec from the control-plane and retry network
errors and all reasonable HTTP response codes. Do not hang waiting for
spec without confirmation from the control-plane that compute is known
and is in the `Empty` state.

Adjust the way we track `total_startup_ms` metric, it should be
calculated since the moment we received spec, not from the moment
`compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric
to track the time spent sleeping and waiting for spec to be delivered
from control-plane.

Part of neondatabase/cloud#3533
2023-04-21 11:10:48 +02:00
Alexey Kondratov
db8dd6f380 [compute_ctl] Implement live reconfiguration (#3980)
With this commit one can request compute reconfiguration
from the running `compute_ctl` with compute in `Running` state
by sending a new spec:
```shell
curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure
```

Internally, we start a separate configurator thread that is waiting on
`Condvar` for `ConfigurationPending` compute state in a loop. Then it does
reconfiguration, sets compute back to `Running` state and notifies other
waiters.

It will need some follow-ups, e.g. for retry logic for control-plane
requests, but should be useful for testing in the current state. This
shouldn't affect any existing environment, since computes are configured
in a different way there.

Resolves neondatabase/cloud#4433
2023-04-13 18:07:29 +02:00
Heikki Linnakangas
6064a26963 Refactor 'spec' in ComputeState.
Sometimes, it contained real values, sometimes just defaults if the
spec was not received yet. Make the state more clear by making it an
Option instead.

One consequence is that if some of the required settings like
neon.tenant_id are missing from the spec file sent to the /configure
endpoint, it is spotted earlier and you get an immediate HTTP error
response. Not that it matters very much, but it's nicer nevertheless.
2023-04-12 01:55:40 +03:00
Alexey Kondratov
40a68e9077 [compute_ctl] Add timeout for tracing_utils::shutdown_tracing() (#3982)
Shutting down OTEL tracing provider may hang for quite some time, see,
for example:
- https://github.com/open-telemetry/opentelemetry-rust/issues/868
- and our problems with staging
https://github.com/neondatabase/cloud/issues/3707#issuecomment-1493983636

Yet, we want computes to shut down fast enough, as we may need a new one
for the same timeline ASAP. So wait no longer than 2s for the shutdown
to complete, then just error out and exit the main thread.

Related to neondatabase/cloud#3707
2023-04-11 15:05:35 +02:00
Heikki Linnakangas
f0b2e076d9 Move compute_ctl structs used in HTTP API and spec file to separate crate.
This is in preparation of using compute_ctl to launch postgres nodes
in the neon_local control plane. And seems like a good idea to
separate the public interfaces anyway.

One non-mechanical change here is that the 'metrics' field is moved
under the Mutex, instead of using atomics. We were not using atomics
for performance but for convenience here, and it seems more clear to
not use atomics in the model for the HTTP response type.
2023-04-09 21:52:28 +03:00
Alexey Kondratov
e42982fb1e [compute_ctl] Empty computes and /configure API (#3963)
This commit adds an option to start compute without spec and then pass
it a valid spec via `POST /configure` API endpoint. This is a main
prerequisite for maintaining the pool of compute nodes in the
control-plane.

For example:

1. Start compute with
   ```shell
   cargo run --bin compute_ctl -- -i no-compute \
    -p http://localhost:9095 \
    -D compute_pgdata \
    -C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \
    -b ./pg_install/v15/bin/postgres
   ```

2. Configure it with
   ```shell
   curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure
   ```

Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec
is moved under Mutex, as it's now could be updated in the http handler.
Also `RwLock` was replaced with `Mutex` because the latter works well
with `Condvar`.

First part of the neondatabase/cloud#4433
2023-04-06 21:21:58 +02:00
Lassi Pölönen
41d364a8f1 Add more detailed logging to compute_ctl's shutdown (#3915)
Currently we don't see from the logs, if shutting down tracing takes
long time or not. We do see that shutting down computes gets delayed for
some reason and hits thhe grace period limit. Moving the shutdown
message to slightly later, when we don't have anything else than just
exit left.
## Issue ticket number and link

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-30 22:02:39 +03:00
Heikki Linnakangas
6fdd9c10d1 Read storage auth token from spec file.
We read the pageserver connection string from the spec file, so let's
read the auth token from the same place.

We've been talking about pre-launching compute nodes that are not
associated with any particular tenant at startup, so that the spec
file is delivered to the compute node later. We cannot change the env
variables after the process has been launched.

We still pass the token to 'postgres' binary in the NEON_AUTH_TOKEN
env variable, but compute_ctl is now responsible for setting it.
2023-03-21 20:12:09 +02:00
Sam Kleinman
c79dd8d458 compute_ctl: support for fetching spec from control plane (#3610) 2023-02-23 13:19:39 -05:00
sharnoff
2153d2e00a Run compute_ctl in a cgroup in VMs (#3577) 2023-02-17 14:14:41 -08:00
Heikki Linnakangas
3e94fd5af3 Inherit OpenTelemetry context for compute startup from cloud console.
This allows fine-grained distributed tracing of the 'start_compute'
operation from the cloud console. The startup actions performed by
'compute_ctl' are now performed in a child of the 'start_compute'
context, so you can trace through the whole compute start operation.

This needs a corresponding change in the cloud console to fill in the
'startup_tracing_context' field in the json spec. If it's missing, the
startup operations are simply traced as a separate trace, without
a parent.
2023-01-26 15:20:03 +02:00
Heikki Linnakangas
006ee5f94a Configure 'compute_ctl' to use OpenTelemetry exporter.
This allows tracing the startup actions e.g. with Jaeger
(https://www.jaegertracing.io/). We use the "tracing-opentelemetry"
crate, which turns tracing spans into OpenTelemetry spans, so you can
use the usual "#[instrument]" directives to add tracing.

I put the tracing initialization code to a separate crate,
`tracing-utils`, so that we can reuse it in other programs. We
probably want to set up tracing in the same way in all our programs.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-01-26 15:20:03 +02:00
Heikki Linnakangas
e5cc2f92c4 Switch to 'tracing' for logging, restructure code to make use of spans.
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.

Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.

This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats.  I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
2023-01-18 19:42:47 +02:00
sharnoff
5c6a7a17cb Add VM informant to vm-compute-node (#3324)
The general idea is that the VM informant binary is added to the
vm-compute-node images only. `compute_tools` then will run whatever's at
`/bin/vm-informant`, if the path exists.
2023-01-16 07:05:29 -08:00
Vadim Kharitonov
9b71215906 Simplify some functions in compute_tools and fix typo errors in func
name
2022-12-22 15:05:43 +01:00
Kirill Bulatov
c4ee62d427 Bump clap and other minor dependencies (#2623) 2022-10-17 12:58:40 +03:00
Heikki Linnakangas
d865892a06 Print full error with stacktrace, if compute node startup fails.
It failed in staging environment a few times, and all we got in the
logs was:

    ERROR could not start the compute node: failed to get basebackup@0/2D6194F8 from pageserver host=zenith-us-stage-ps-2.local port=6400
    giving control plane 30s to collect the error before shutdown

That's missing all the detail on *why* it failed.
2022-07-29 16:41:55 +03:00