Currently, the exporter exposes the same LFC metrics that are exposed by
the "autoscaling" sql_exporter in the docker image. With this, we can
remove the dedicated sql_exporter instance. (Actually doing the removal
is left as a TODO until this is rolled out to production and we have
changed autoscaling-agent to fetch the metrics from this new endpoint.)
The exporter runs as a Postgres background worker process. This is
extracted from the Rust communicator rewrite project, which will use the
same worker process for much more, to handle the communications with the
pageservers. For now, though, it merely handles the metrics requests.
In the future, we will add more metrics, and perhaps even APIs to
control the running Postgres instance.
The exporter listens on a Unix Domain socket within the Postgres data
directory. A Unix Domain socket is a bit unconventional, but it has some
advantages:
- Permissions are taken care of. Only processes that can access the data
directory, and therefore already have full access to the running
Postgres instance, can connect to it.
- No need to allocate and manage a new port number for the listener
It has some downsides too: it's not immediately accessible from the
outside world, and the functions to work with Unix Domain sockets are
more low-level than TCP sockets (see the symlink hack in
`postgres_metrics_client.rs`, for example).
To expose the metrics from the local Unix Domain Socket to the
autoscaling agent, introduce a new '/autoscaling_metrics' endpoint in
the compute_ctl's HTTP server. Currently it merely forwards the request
to the Postgres instance, but we could add rate limiting and access
control there in the future.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
See https://databricks.slack.com/archives/C092W8NBXC0/p1752924508578339
In case of larger number of databases and large `max_connections` we can
open too many connection for parallel apply config which may cause `Too
many open files` error.
## Summary of changes
Limit maximal number of parallel config apply connections by 100.
---------
Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
## Problem
We currently offload LFC state unconditionally, which can cause
problems. Imagine a situation:
1. Endpoint started with `autoprewarm: true`.
2. While prewarming is not completed, we upload the new incomplete
state.
3. Compute gets interrupted and restarts.
4. We start again and try to prewarm with the state from 2. instead of
the previous complete state.
During the orchestrated prewarming, it's probably not a big issue, but
it's still better to do not interfere with the prewarm process.
## Summary of changes
Do not offload LFC state if we are currently prewarming or any issue
occurred. While on it, also introduce `Skipped` LFC prewarm status,
which is used when the corresponding LFC state is not present in the
endpoint storage. It's primarily needed to distinguish the first compute
start for particular endpoint, as it's completely valid to do not have
LFC state yet.
## Problem
See https://databricks.slack.com/archives/C09254R641L/p1752004515032899
stripe_size GUC update may be delayed at different backends and so cause
inconsistency with connection strings (shard map).
## Summary of changes
Postmaster should store stripe_size in shared memory as well as
connection strings.
It should be also enforced that stripe size is defined prior to
connection strings in postgresql.conf
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
- Remove a few obsolete "allowed error messages" from tests. The
pageserver doesn't emit those messages anymore.
- Remove misplaced and outdated docstring comment from
`test_tenants.py`. A docstring is supposed to be the first thing in a
function, but we had added some code before it. And it was outdated, as
we haven't supported running without safekeepers for a long time.
- Fix misc typos in comments
- Remove obsolete comment about backwards compatibility with safekeepers
without `TIMELINE_STATUS` API. All safekeepers have it by now.
All Errors that can occur during get_installed_extensions() come from
tokio-postgres functions, e.g. if the database is being shut down
("FATAL: terminating connection due to administrator command"). I'm
seeing a lot of such errors in the logs with the regression tests, with
very verbose stack traces. The compute_ctl stack trace is pretty useless
for errors originating from the Postgres connection, the error message
has all the information, so stop printing the stack trace.
I changed the result type of the functions to return the originating
tokio_postgres Error rather than anyhow::Error, so that if we introduce
other error sources to the functions where the stack trace might be
useful, we'll be forced to revisit this, probably by introducing a new
Error type that separates postgres errors from other errors. But this
will do for now.
On December 8th, 2023, an engineering escalation (INC-110) was opened
after it was found that BYPASSRLS was being applied to all roles.
PR that introduced the issue:
https://github.com/neondatabase/neon/pull/5657
Subsequent commit on main:
ad99fa5f03
NOBYPASSRLS and INHERIT are the defaults for a Postgres role, but
because it isn't easy to know if a Postgres cluster is affected by the
issue, we need to keep the migration around for a long time, if not
indefinitely, so any cluster can be fixed.
Branching is the gift that keeps on giving...
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
After https://github.com/neondatabase/neon/pull/12240 we observed
issues in our go code as `ComputeStatus` is not stateless, thus doesn't
deserialize as string.
```
could not check compute activity: json: cannot unmarshal object into Go struct field
ComputeState.status of type computeclient.ComputeStatus
```
- Fix this by splitting this status into two.
- Update compute OpenApi spec to reflect changes to `/terminate` in
previous PR
## Problem
If we have catalog update AND a pageserver migration batched in a single
spec, we will not be able to apply the spec (running the SQL) because
the compute is not attached to the right pageserver and we are not able
to read anything if we don't pick up the latest pageserver connstring.
This is not a case for now because cplane always schedules shard split /
pageserver migrations with `skip_pg_catalog_updates` (I suppose).
Context:
https://databricks.slack.com/archives/C09254R641L/p1752163559259399?thread_ts=1752160163.141149&cid=C09254R641L
With this fix, backpressure will likely not be able to affect
reconfigurations.
## Summary of changes
Do `pg_reload_conf` before we apply specs in SQL.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This patch tightens up `page_api::Client`. It's mostly superficial
changes, but also adds a new constructor that takes an existing gRPC
channel, for use with the communicator connection pool.
- Add ComputeSpec flag `offload_lfc_interval_seconds` controlling
whether LFC should be offloaded to endpoint storage. Default value
(None) means "don't offload".
- Add glue code around it for `neon_local` and integration tests.
- Add `autoprewarm` mode for `test_lfc_prewarm` testing
`offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction.
- Rename `compute_ctl_lfc_prewarm_requests_total` and
`compute_ctl_lfc_offload_requests_total` to
`compute_ctl_lfc_prewarms_total`
and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and
offloads, not `compute_ctl` requests of those.
Don't count request in metrics if there is a prewarm/offload already
ongoing.
https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/30770
## Problem
The background installed extensions worker relied on `interval.tick()`
to go to sleep for a period of time. This can lead to bugs due to the
interval being updated at the end of the loop as the first tick is
[instantaneous](https://docs.rs/tokio/latest/tokio/time/struct.Interval.html#method.tick).
## Summary of changes
Changed it to a `tokio::time::sleep` to prevent this issue. Now it puts
the thread to sleep and only wakes up after the specified duration
Add support to transport syslogs over TLS. Since TLS params essentially
require passing host and port separately, add a boolean flag to the
configuration template and also use the same `action` format for
plaintext logs. This allows seamless transition.
The plaintext host:port is picked from `AUDIT_LOGGING_ENDPOINT` (as
earlier) and from `AUDIT_LOGGING_TLS_ENDPOINT`. The TLS host:port is
used when defined and non-empty.
`remote_endpoint` is split separately to hostname and port as required
by `omfwd` module.
Also the address parsing and config content generation are split to more
testable functions with basic tests added.
## Problem
Previously, the background worker that collects the list of installed
extensions across DBs had a timeout set to 1 hour. This cause a problem
with computes that had a `suspend_timeout` > 1 hour as this collection
was treated as activity, preventing compute shutdown.
Issue: https://github.com/neondatabase/cloud/issues/30147
## Summary of changes
Passing the `suspend_timeout` as part of the `ComputeSpec` so that any
updates to this are taken into account by the background worker and
updates its collection interval.
## Problem
The gRPC API does not provide LSN leases.
## Summary of changes
* Add LSN lease support to the gRPC API.
* Use gRPC LSN leases for static computes with `grpc://` connstrings.
* Move `PageserverProtocol` into the `compute_api::spec` module and
reuse it.
## Problem
`compute_ctl` should support gRPC base backups.
Requires #12111.
Requires #12243.
Touches #11926.
## Summary of changes
Support `grpc://` connstrings for `compute_ctl` base backups.
pgaudit can spam logs due to all the monitoring that we do. Logs from
these connections are not necessary for HIPPA compliance, so we can stop
logging from those connections.
Part-of: https://github.com/neondatabase/cloud/issues/29574
Signed-off-by: Tristan Partin <tristan@neon.tech>
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.
## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.
With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.
## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
The previous behavior was for the compute to override control plane
options if there was a conflict. We want to change the behavior so that
the control plane has the absolute power on what is right. In the event
that we need a new option passed to the compute as soon as possible, we
can initially roll it out in the control plane, and then migrate the
option to EXTRA_OPTIONS within the compute later, for instance.
Signed-off-by: Tristan Partin <tristan@neon.tech>
- Add optional `?mode=fast|immediate` to `/terminate`, `fast` is
default. Immediate avoids waiting 30
seconds before returning from `terminate`.
- Add `TerminateMode` to `ComputeStatus::TerminationPending`
- Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl
stop` for `test_replica_promotes`.
- Change `test_replica_promotes` to check returned LSN
- Annotate `finish_sync_safekeepers` as `noreturn`.
https://github.com/neondatabase/cloud/issues/29807
This check API only cheks the safekeeper_connstrings at the moment, and
the validation is limited to checking we have at least one entry in
there, and no duplicates.
## Problem
If the compute_ctl service is started with an empty list of safekeepers,
then hard-to-debug errors may happen at runtime, where it would be much
easier to catch them early.
## Summary of changes
Add an entry point in the compute_ctl API to validate the configuration
for safekeeper_connstrings.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
After introducing a naive downtime calculation for the Postgres process
inside compute in https://github.com/neondatabase/neon/pull/11346, I
noticed that some amount of computes regularly report short downtime.
After checking some particular cases, it looks like all of them report
downtime close to the end of the life of the compute, i.e., when the
control plane calls a `/terminate` and we are waiting for Postgres to
exit.
Compute monitor also produces a lot of error logs because Postgres stops
accepting connections, but it's expected during the termination process.
## Summary of changes
Regularly check the compute status inside the main compute monitor loop
and exit gracefully when the compute is in some terminal or
soon-to-be-terminal state.
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
Inbetween adding the TLS config for compute-ctl, and adding the TLS
config in controlplane, we switched from using a provision flag to a
bind flag. This happened to work in all of my testing in preview regions
as they have no VM pool, so each bind was also a provision. However, in
staging I found that the TLS config is still only processed during
provision, even though it's only sent on bind.
## Summary of changes
* Add a new feature flag value, `tls_experimental`, which tells
postgres/pgbouncer/local_proxy to use the TLS certificates on bind.
* compute_ctl on provision will be told where the certificates are,
instead of being told on bind.
Url::to_string() adds a trailing slash on the base URL, so when we did
the format!(), we were adding a double forward slash.
Signed-off-by: Tristan Partin <tristan@neon.tech>
JsonResponse::error() properly logs an error message which can be read
in the compute logs. invalid_status() was not going through that helper
function, thus not logging anything.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
I have a hypothesis that import might be using lower number of jobs than
max for the VM, where the job is running. This change will help finding
this out from logs
## Summary of changes
Added logging of number of jobs, which is passed into both `pg_dump` and
`pg_restore`
## Problem
Currently, we collect metrics of what extensions are installed on
computes at start up time. We do not have a mechanism that does this at
runtime.
## Summary of changes
Added a background thread that queries all DBs at regular intervals and
collects a list of installed extensions.
Previously we were using project-id/endpoint-id as SYSLOGTAG, which has
a
limit of 32 characters, so the endpoint-id got truncated.
The output is now in RFC5424 format, where the message is json encoded
with additional metadata `endpoint_id` and `project_id`
Also as pgaudit logs multiline messages, we now detect this by parsing
the timestamp in the specific format, and consider non-matching lines to
belong in the previous log message.
Using syslog structured-data would be an alternative, but leaning
towards json
due to being somewhat more generic.