Temporarily reset neon.max_cluster_size to avoid
the possibility of hitting the limit, while we are applying config:
creating new extensions, roles, etc...
Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage.
Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files.
Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests.
When PostgreSQL requests a file, 'compute_ctl' downloads it.
See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alek Westover <alek.westover@gmail.com>
If database was created with `is_template true` Postgres doesn't allow
dropping it right away and throws error
```
ERROR: cannot drop a template database
```
so we have to unset `is_template` first.
Fixing it, I noticed that our `escape_literal` isn't exactly correct
and following the same logic as in `quote_literal_internal`, we need to
prepend string with `E`. Otherwise, it's not possible to filter
`pg_database` using `escape_literal()` result if name contains `\`, for
example.
Also use `FORCE` to drop database even if there are active connections.
We run this from `cloud_admin`, so it should have enough privileges.
NB: there could be other db states, which prevent us from dropping
the database. For example, if db is used by any active subscription
or logical replication slot.
TODO: deal with it once we allow logical replication. Proper fix should
involve returning an error code to the control plane, so it could
figure out that this is a non-retryable error, return it to the user and
mark operation as permanently failed.
Related to neondatabase/cloud#4258
This adds test coverage for 'compute_ctl', as it is now used by all
the python tests.
There are a few differences in how 'compute_ctl' is called in the
tests, compared to the real web console:
- In the tests, the postgresql.conf file is included as one large
string in the spec file, and it is written out as it is to the data
directory. I added a new field for that to the spec file. The real
web console, however, sets all the necessary settings in the
'settings' field, and 'compute_ctl' creates the postgresql.conf from
those settings.
- In the tests, the information needed to connect to the storage, i.e.
tenant_id, timeline_id, connection strings to pageserver and
safekeepers, are now passed as new fields in the spec file. The real
web console includes them as the GUCs in the 'settings' field. (Both
of these are different from what the test control plane used to do:
It used to write the GUCs directly in the postgresql.conf file). The
plan is to change the control plane to use the new method, and
remove the old method, but for now, support both.
Some tests that were sensitive to the amount of WAL generated needed
small changes, to accommodate that compute_ctl runs the background
health monitor which makes a few small updates. Also some tests shut
down the pageserver, and now that the background health check can run
some queries while the pageserver is down, that can produce a few
extra errors in the logs, which needed to be allowlisted.
Other changes:
- remove obsolete comments about PostgresNode;
- create standby.signal file for Static compute node;
- log output of `compute_ctl` and `postgres` is merged into
`endpoints/compute.log`.
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
This is in preparation of using compute_ctl to launch postgres nodes
in the neon_local control plane. And seems like a good idea to
separate the public interfaces anyway.
One non-mechanical change here is that the 'metrics' field is moved
under the Mutex, instead of using atomics. We were not using atomics
for performance but for convenience here, and it seems more clear to
not use atomics in the model for the HTTP response type.
+ neondatabase/cloud#1103
This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.
Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.
OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.
Currently it's included with minimal changes and lives aside of the main
workspace. Later we may re-use and combine common parts with zenith
control_plane.
This change is mostly needed to unify cloud deployment pipeline:
1.1. build compute-tools image
1.2. build compute-node image based on the freshly built compute-tools
2. build zenith image
So we can roll new compute image and new storage required by it to
operate properly. Also it becomes easier to test console against some
specific version of compute-node/-tools.