Currently, the exporter exposes the same LFC metrics that are exposed by
the "autoscaling" sql_exporter in the docker image. With this, we can
remove the dedicated sql_exporter instance. (Actually doing the removal
is left as a TODO until this is rolled out to production and we have
changed autoscaling-agent to fetch the metrics from this new endpoint.)
The exporter runs as a Postgres background worker process. This is
extracted from the Rust communicator rewrite project, which will use the
same worker process for much more, to handle the communications with the
pageservers. For now, though, it merely handles the metrics requests.
In the future, we will add more metrics, and perhaps even APIs to
control the running Postgres instance.
The exporter listens on a Unix Domain socket within the Postgres data
directory. A Unix Domain socket is a bit unconventional, but it has some
advantages:
- Permissions are taken care of. Only processes that can access the data
directory, and therefore already have full access to the running
Postgres instance, can connect to it.
- No need to allocate and manage a new port number for the listener
It has some downsides too: it's not immediately accessible from the
outside world, and the functions to work with Unix Domain sockets are
more low-level than TCP sockets (see the symlink hack in
`postgres_metrics_client.rs`, for example).
To expose the metrics from the local Unix Domain Socket to the
autoscaling agent, introduce a new '/autoscaling_metrics' endpoint in
the compute_ctl's HTTP server. Currently it merely forwards the request
to the Postgres instance, but we could add rate limiting and access
control there in the future.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
Add support to transport syslogs over TLS. Since TLS params essentially
require passing host and port separately, add a boolean flag to the
configuration template and also use the same `action` format for
plaintext logs. This allows seamless transition.
The plaintext host:port is picked from `AUDIT_LOGGING_ENDPOINT` (as
earlier) and from `AUDIT_LOGGING_TLS_ENDPOINT`. The TLS host:port is
used when defined and non-empty.
`remote_endpoint` is split separately to hostname and port as required
by `omfwd` module.
Also the address parsing and config content generation are split to more
testable functions with basic tests added.
## Problem
`compute_ctl` should support gRPC base backups.
Requires #12111.
Requires #12243.
Touches #11926.
## Summary of changes
Support `grpc://` connstrings for `compute_ctl` base backups.
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.
## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.
With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.
## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Read replicas cannot grant permissions for roles for Neon RLS. Usually
the permission is already granted, so we can optimistically check. See
INC-509
## Summary of changes
Perform a permission lookup prior to actually executing any grants.
Add `/lfc/(prewarm|offload)` routes to `compute_ctl` which interact with
endpoint storage.
Add `prewarm_lfc_on_startup` spec option which, if enabled, downloads
LFC prewarm data on compute startup.
Resolves: https://github.com/neondatabase/cloud/issues/26343
## Problem
It seems are production-ready cert-manager setup now includes a full
certificate chain. This was not accounted for and the decoder would
error.
## Summary of changes
Change the way we decode certificates to support cert-chains, ignoring
all but the first cert.
This also changes a log line to not use multi-line errors.
~~I have tested this code manually against real certificates/keys, I
didn't want to embed those in a test just yet, not until the cert
expires in 24 hours.~~
Both crates seem well maintained. x509-cert is part of the high quality
RustCrypto project that we already make heavy use of, and I think it
makes sense to reduce the dependencies where possible.
Closes: https://github.com/neondatabase/cloud/issues/22998
If control-plane reports that TLS should be used, load the certificates
(and watch for updates), make sure postgres use them, and detects
updates.
Procedure:
1. Load certificates
2. Reconfigure postgres/pgbouncer
3. Loop on a timer until certificates have loaded
4. Go to 1
Notes:
1. We only run this procedure if requested on startup by control plane.
2. We needed to compile pgbouncer with openssl enabled
3. Postgres doesn't allow tls keys to be globally accessible - must be
read only to the postgres user. I couldn't convince the autoscaling team
to let me put this logic into the VM settings, so instead compute_ctl
will copy the keys to be read-only by postgres.
4. To mitigate a race condition, we also verify that the key matches the
cert.
The compute should only act if requests come from the control plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
We lost this with the switch to axum for the HTTP server. Add it back.
In addition to just resurrecting the functionality we had before, pass
the tracing context of the /configure HTTP request to the start_postgres
operation that runs in the main thread. This way, the 'start_postgres'
and all its sub-spans like getting the basebackup become children of the
HTTP request span. This allows end-to-end tracing of a compute start,
all the way from the proxy to the SQL queries executed by compute_ctl as
part of compute startup.
Updates `compute_tools` and `compute_api` crates to edition 2024. We
like to stay on the latest edition if possible. There is no functional
changes, however some code changes had to be done to accommodate the
edition's breaking changes.
The PR has three commits:
* the first commit updates the named crates to edition 2024 and appeases
`cargo clippy` by changing code.
* the second commit performs a `cargo fmt` that does some minor changes
(not many)
* the third commit performs a cargo fmt with nightly options to reorder
imports as a one-time thing. it's completely optional, but I offer it
here for the compute team to review it.
I'd like to hear opinions about the third commit, if it's wanted and
felt worth the diff or not. I think most attention should be put onto
the first commit.
Part of #10918
This replaces the use of the awscli utility. awscli binary is massive,
it added about 200 MB to the docker image size, while the s3 client was
already a dependency so using that is essentially free, as far as binary
size is concerned.
I implemented a simple upload function that tries to keep 10 uploads
going in parallel. I believe that's the default behavior of the "aws s3
sync" command too.
There is now a compute_ctl_config field in the response that currently
only contains a JSON Web Key set. compute_ctl currently doesn't do
anything with the keys, but will in the future.
The reasoning for the new field is due to the nature of empty computes.
When an empty compute is created, it does not have a tenant. A compute
spec is the primary means of communicating the details of an attached
tenant. In the empty compute state, there is no spec. Instead we wait
for the control plane to pass us one via /configure. If we were to
include the jwks field in the compute spec, we would have a partial
compute spec, which doesn't logically make sense.
Instead, we can have two means of passing settings to the compute:
- spec: tenant specific config details
- compute_ctl_config: compute specific settings
For instance, the JSON Web Key set passed to the compute is independent
of any tenant. It is a setting of the compute whether it is attached or
not.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Instead of generating our own request ID, we can just use the one
provided by the control plane. In the event, we get a request from a
client which doesn't set X-Request-ID, then we just generate one which
is useful for tracing purposes.
Signed-off-by: Tristan Partin <tristan@neon.tech>
This is a refactor to create better abstractions related to our
management server. It cleans up the code, and prepares everything for
authorized communication to and from the control plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>
# Context
This PR contains PoC-level changes for a product feature that allows
onboarding large databases into Neon without going through the regular
data path.
# Changes
This internal RFC provides all the context
* https://github.com/neondatabase/cloud/pull/19799
In the language of the RFC, this PR covers
* the Importer code (`fast_import`)
* all the Pageserver changes (mgmt API changes, flow implementation,
etc)
* a basic test for the Pageserver changes
# Reviewing
As acknowledged in the RFC, the code added in this PR is not ready for
general availability.
Also, the **architecture is not to be discussed in this PR**, but in the
RFC and associated Slack channel instead.
Reviewers of this PR should take that into consideration.
The quality bar to apply during review depends on what area of the code
is being reviewed:
* Importer code (`fast_import`): practically anything goes
* Core flow (`flow.rs`):
* Malicious input data must be expected and the existing threat models
apply.
* The code must not be safe to execute on *dedicated* Pageserver
instances:
* This means in particular that tenants *on other* Pageserver instances
must not be affected negatively wrt data confidentiality, integrity or
availability.
* Other code: the usual quality bar
* Pay special attention to correct use of gate guards, timeline
cancellation in all places during shutdown & migration, etc.
* Consider the broader system impact; if you find potentially
problematic interactions with Storage features that were not covered in
the RFC, bring that up during the review.
I recommend submitting three separate reviews, for the three high-level
areas with different quality bars.
# References
(Internal-only)
* refs https://github.com/neondatabase/cloud/issues/17507
* refs https://github.com/neondatabase/company_projects/issues/293
* refs https://github.com/neondatabase/company_projects/issues/309
* refs https://github.com/neondatabase/cloud/issues/20646
---------
Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
and add /metrics endpoint to compute_ctl to expose such metrics
metric format example for extension pg_rag
with versions 1.2.3 and 1.4.2
installed in 3 and 1 databases respectively:
neon_extensions_installed{extension="pg_rag", version="1.2.3"} = 3
neon_extensions_installed{extension="pg_rag", version="1.4.2"} = 1
------
infra part: https://github.com/neondatabase/flux-fleet/pull/251
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
1. Adds local-proxy to compute image and vm spec
2. Updates local-proxy config processing, writing PID to a file eagerly
3. Updates compute-ctl to understand local proxy compute spec and to
send SIGHUP to local-proxy over that pid.
closes https://github.com/neondatabase/cloud/issues/16867
Follow-up of #9234 to give hyper 1.0 the version-free name, and the
legacy version of hyper the one with the version number inside. As we
move away from hyper 0.14, we can remove the `hyper0` name piece by
piece.
Part of #9255
* tracing-utils now returns a `Layer` impl. Removes the need for crates
to
import OTel crates.
* Drop the /v1/traces URI check. Verified that the code does the right
thing.
* Leave a TODO to hook in an error handler for OTel to log errors to
when it
assumes the regular pipeline cannot be used/is broken.
Makes `flush_frozen_layer` add a barrier to the upload queue and makes
it wait for that barrier to be reached until it lets the flushing be
completed.
This gives us backpressure and ensures that writes can't build up in an
unbounded fashion.
Fixes#7317
Set core rmilit to ulimited in compute_ctl, so that all child processes
inherit it. We could also set rlimit in relevant startup script, but
that way we would depend on external setup and might inadvertently
disable it again (core dumping worked in pods, but not in VMs with
inittab-based startup).
## Problem
There are two cloud's features that require extra compute endpoints.
1. We are running pg_dump to get DB schemas. Currently, we are using a
special service for this. But it would be great to execute pg_dump in an
isolated environment. And we already have such an environment, it's our
compute! And likely enough pg_dump already exists there too! (see
https://github.com/neondatabase/cloud/issues/11644#issuecomment-2084617832)
2. We need to have a way to get databases and roles from compute after
time travel (see https://github.com/neondatabase/cloud/issues/12109)
## Summary of changes
It adds two API endpoints to compute_ctl HTTP API that target both of
the aforementioned cases.
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.
This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.
- add pgbouncer_settings section to compute spec;
- add pgbouncer-connstr option to compute_ctl.
- add pgbouncer-ini-path option to compute_ctl. Default: /etc/pgbouncer/pgbouncer.ini
Apply pgbouncer config on compute start and respec to override default spec.
Save pgbouncer config updates to pgbouncer.ini to preserve them across pgbouncer restarts.
instead of direct S3 request.
Pros:
- simplify code a lot (no need to provide AWS credentials and paths);
- reduce latency of downloading extension data as proxy resides near
computes; -reduce AWS costs as proxy has cache and 1000 computes asking
the same extension will not generate 1000 downloads from S3.
- we can use only one S3 bucket to store extensions (and rid of regional
buckets which were introduced to reduce latency);
Changes:
- deprecate remote-ext-config compute_ctl parameter, use
http://pg-ext-s3-gateway if any old format remote-ext-cofig is provided;
- refactor tests to use mock http server;
Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage.
Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files.
Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests.
When PostgreSQL requests a file, 'compute_ctl' downloads it.
See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alek Westover <alek.westover@gmail.com>
This is in preparation of using compute_ctl to launch postgres nodes
in the neon_local control plane. And seems like a good idea to
separate the public interfaces anyway.
One non-mechanical change here is that the 'metrics' field is moved
under the Mutex, instead of using atomics. We were not using atomics
for performance but for convenience here, and it seems more clear to
not use atomics in the model for the HTTP response type.
After enabling autoscaling, we faced the issue that customers are not
able to get the number of CPUs they use at this moment. Therefore I've
added these two options:
1. Postgresql function to allow customers to call it whenever they want
2. `compute_ctl` endpoint to show these number in console
This allows fine-grained distributed tracing of the 'start_compute'
operation from the cloud console. The startup actions performed by
'compute_ctl' are now performed in a child of the 'start_compute'
context, so you can trace through the whole compute start operation.
This needs a corresponding change in the cloud console to fill in the
'startup_tracing_context' field in the json spec. If it's missing, the
startup operations are simply traced as a separate trace, without
a parent.
This allows tracing the startup actions e.g. with Jaeger
(https://www.jaegertracing.io/). We use the "tracing-opentelemetry"
crate, which turns tracing spans into OpenTelemetry spans, so you can
use the usual "#[instrument]" directives to add tracing.
I put the tracing initialization code to a separate crate,
`tracing-utils`, so that we can reuse it in other programs. We
probably want to set up tracing in the same way in all our programs.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.
Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.
This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats. I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
This fixes all kinds of problems related to missing params,
like broken timestamps (due to `integer_datetimes`).
This solution is not ideal, but it will help. Meanwhile,
I'm going to dedicate some time to improving connection machinery.
Note that this **does not** fix problems with passing certain parameters
in a reverse direction, i.e. **from client to compute**. This is a
separate matter and will be dealt with in an upcoming PR.
Compute node startup time is very important. After launching
PostgreSQL, use 'notify' to be notified immediately when it has
updated the PID file, instead of polling. The polling loop had 100 ms
interval so this shaves up to 100 ms from the startup time.