It was using `PRECONDITION_FAILED` for errors during `ComputeSpec` to
`ParsedSpec` conversion, but this disobeys the OpenAPI spec [1] and
correct code should be `BAD_REQUEST` for any spec processing errors.
While on it, I also noticed that `compute_ctl` OpenAPI spec has an
invalid format and fixed it.
[1] fd81945a60/compute_tools/src/http/openapi_spec.yaml (L119-L120)
Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage.
Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files.
Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests.
When PostgreSQL requests a file, 'compute_ctl' downloads it.
See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alek Westover <alek.westover@gmail.com>
Our scale-to-zero logic was optimized for short auto-suspend intervals,
e.g. minutes or hours. In this case, if compute was restarted by k8s due
to some reason (OOM, k8s node went down, pod relocation, etc.),
`last_active` got bumped, we start counting auto-suspend timeout again.
It's not a big deal, i.e. we suspend completely idle compute not after 5
minutes, but after 10 minutes or so.
Yet, some clients may want days or even weeks. And chance that compute
could be restarted during this interval is pretty high, but in this case
we could be not able to suspend some computes for weeks.
After this commit, we won't initialize `last_active` on start, so
`/status` could return an unset attribute. This means that there was no
user activity since start. Control-plane should deal with it by taking
`max()` out of all available activity timestamps: `started_at`,
`last_active`, etc.
compute_ctl part of neondatabase/cloud#4853
Do several attempts to get spec from the control-plane and retry network
errors and all reasonable HTTP response codes. Do not hang waiting for
spec without confirmation from the control-plane that compute is known
and is in the `Empty` state.
Adjust the way we track `total_startup_ms` metric, it should be
calculated since the moment we received spec, not from the moment
`compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric
to track the time spent sleeping and waiting for spec to be delivered
from control-plane.
Part of neondatabase/cloud#3533
This commit adds an option to start compute without spec and then pass
it a valid spec via `POST /configure` API endpoint. This is a main
prerequisite for maintaining the pool of compute nodes in the
control-plane.
For example:
1. Start compute with
```shell
cargo run --bin compute_ctl -- -i no-compute \
-p http://localhost:9095 \
-D compute_pgdata \
-C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \
-b ./pg_install/v15/bin/postgres
```
2. Configure it with
```shell
curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure
```
Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec
is moved under Mutex, as it's now could be updated in the http handler.
Also `RwLock` was replaced with `Mutex` because the latter works well
with `Condvar`.
First part of the neondatabase/cloud#4433
After enabling autoscaling, we faced the issue that customers are not
able to get the number of CPUs they use at this moment. Therefore I've
added these two options:
1. Postgresql function to allow customers to call it whenever they want
2. `compute_ctl` endpoint to show these number in console
This commit adds a basic HTTP API endpoint that allows scraping the
`pg_stat_statements` data and getting a list of slow queries. New
insights like cache hit rate and so on could be added later.
Extension `pg_stat_statements` is checked / created only if compute
tries to load the corresponding shared library. The latter is configured
by control-plane and currently covered with feature flag.
Co-authored by Eduard Dyckman (bird.duskpoet@gmail.com)
+ neondatabase/cloud#1103
This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.
Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.
OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.