rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-05 20:42:54 +00:00

Files

Tristan Partin 512210bb5a [BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716 )

## Problem

In our experience running the system so far, almost all of the "hang
compute" situations are due to the compute (postgres) pointing at the
wrong pageservers. We currently mainly rely on the promethesus exporter
(PGExporter) running on PG to detect and report any down time, but these
can be unreliable because the read and write probes the PGExporter runs
do not always generate pageserver requests due to caching, even though
the real user might be experiencing down time when touching uncached
pages.

We are also about to start disk-wiping node pool rotation operations in
prod clusters for our pageservers, and it is critical to have a
convenient way to monitor the impact of these node pool rotations so
that we can quickly respond to any issues. These metrics should provide
very clear signals to address this operational need.

## Summary of changes

Added a pair of metrics to detect issues between postgres' PageStream
protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications
with pageservers:
* On the compute node (compute_ctl), exports a counter metric that is
incremented every time postgres requests a configuration refresh.
Postgres today only requests these configuration refreshes when it
cannot connect to a pageserver or if the pageserver rejects its request
by disconnecting.
* On the pageserver, exports a counter metric that is incremented every
time it receives a PageStream request that cannot be handled because the
tenant is not known or if the request was routed to the wrong shard
(e.g. secondary).

### How I plan to use metrics
I plan to use the metrics added here to create alerts. The alerts can
fire, for example, if these counters have been continuously increasing
for over a certain period of time. During rollouts, misrouted requests
may occasionally happen, but they should soon die down as
reconfigurations make progress. We can start with something like raising
the alert if the counters have been increasing continuously for over 5
minutes.

## How is this tested?

New integration tests in
`test_runner/regress/test_hadron_ps_connectivity_metrics.py`

Co-authored-by: William Huang <william.huang@databricks.com>

2025-07-24 19:05:00 +00:00

src

[BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716 )

2025-07-24 19:05:00 +00:00

tests

offload_lfc_interval_seconds in ComputeSpec (#12447 )

2025-07-04 18:49:57 +00:00

.dockerignore

Move compute_tools from console repo (zenithdb/console#383 )

2021-12-28 20:17:29 +03:00

.gitignore

Move compute_tools from console repo (zenithdb/console#383 )

2021-12-28 20:17:29 +03:00

Cargo.toml

A few more compute_tool changes (#12687 )

2025-07-23 18:30:33 +00:00

README.md

[BRC-1778] Add mechanism to compute_ctl to pull a new config (#12711 )

2025-07-24 14:26:21 +00:00

rustfmt.toml

Move compute_tools from console repo (zenithdb/console#383 )

2021-12-28 20:17:29 +03:00

README.md

Compute node tools

Postgres wrapper (compute_ctl) is intended to be run as a Docker entrypoint or as a systemd ExecStart option. It will handle all the Neon specifics during compute node initialization:

compute_ctl accepts cluster (compute node) specification as a JSON file.
Every start is a fresh start, so the data directory is removed and initialized again on each run.
Next it will put configuration files into the PGDATA directory.
Sync safekeepers and get commit LSN.
Get basebackup from pageserver using the returned on the previous step LSN.
Try to start postgres and wait until it is ready to accept connections.
Check and alter/drop/create roles and databases.
Hang waiting on the postmaster process to exit.

Also compute_ctl spawns two separate service threads:

compute-monitor checks the last Postgres activity timestamp and saves it into the shared ComputeNode;
http-endpoint runs a Hyper HTTP API server, which serves readiness and the last activity requests.

If AUTOSCALING environment variable is set, compute_ctl will start the vm-monitor located in [neon/libs/vm_monitor]. For VM compute nodes, vm-monitor communicates with the VM autoscaling system. It coordinates downscaling and requests immediate upscaling under resource pressure.

Usage example:

compute_ctl -D /var/db/postgres/compute \
            -C 'postgresql://cloud_admin@localhost/postgres' \
            -S /var/db/postgres/specs/current.json \
            -b /usr/local/bin/postgres

State Diagram

Computes can be in various states. Below is a diagram that details how a compute moves between states.

%% https://mermaid.js.org/syntax/stateDiagram.html
stateDiagram-v2
  [*] --> Empty : Compute spawned
  Empty --> ConfigurationPending : Waiting for compute spec
  ConfigurationPending --> Configuration : Received compute spec
  Configuration --> Failed : Failed to configure the compute
  Configuration --> Running : Compute has been configured
  Empty --> Init : Compute spec is immediately available
  Empty --> TerminationPendingFast : Requested termination
  Empty --> TerminationPendingImmediate : Requested termination
  Init --> Failed : Failed to start Postgres
  Init --> Running : Started Postgres
  Running --> TerminationPendingFast : Requested termination
  Running --> TerminationPendingImmediate : Requested termination
  Running --> ConfigurationPending : Received a /configure request with spec
  Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
  RefreshConfigurationPending --> Running : Compute  has been re-configured
  TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
  TerminationPendingImmediate --> Terminated : Terminated compute immediately
  Running --> TerminationPending : Requested termination
  TerminationPending --> Terminated : Terminated compute
  Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
  Failed --> [*] : Compute exited
  Terminated --> [*] : Compute exited

Tests

Cargo formatter:

cargo fmt

Run tests:

cargo test

Clippy linter:

cargo clippy --all --all-targets -- -Dwarnings -Drust-2018-idioms

Cross-platform compilation

Imaging that you are on macOS (x86) and you want a Linux GNU (x86_64-unknown-linux-gnu platform in rust terminology) executable.

Using docker

You can use a throw-away Docker container (rustlang/rust image) for doing that:

docker run --rm \
    -v $(pwd):/compute_tools \
    -w /compute_tools \
    -t rustlang/rust:nightly cargo build --release --target=x86_64-unknown-linux-gnu

or one-line:

docker run --rm -v $(pwd):/compute_tools -w /compute_tools -t rust:latest cargo build --release --target=x86_64-unknown-linux-gnu

Using rust native cross-compilation

Another way is to add x86_64-unknown-linux-gnu target on your host system:

rustup target add x86_64-unknown-linux-gnu

Install macOS cross-compiler toolchain:

brew tap SergioBenitez/osxct
brew install x86_64-unknown-linux-gnu

And finally run cargo build:

CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-unknown-linux-gnu-gcc cargo build --target=x86_64-unknown-linux-gnu --release