rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-23 06:09:59 +00:00

Files

Tristan Partin 89554af1bd [BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712 )

## Problem

This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification
mechanism to make it more robust. Please refer to that commit or ticket
BRC-1778 for full context of the problem.

## Summary of changes

The previous change added mechanism in `compute_ctl` that makes it
possible to refresh the configuration of PG on-demand by having
`compute_ctl` go out to download a new config from the control
plane/HCC. This change wired this mechanism up with PG so that PG will
signal `compute_ctl` to refresh its configuration when it suspects that
it could be talking to incorrect pageservers due to a stale
configuration.

PG will become suspicious that it is talking to the wrong pageservers in
the following situations:
1. It cannot connect to a pageserver (e.g., getting a network-level
connection refused error)
2. It can connect to a pageserver, but the pageserver does not return
any data for the GetPage request
3. It can connect to a pageserver, but the pageserver returns a
malformed response
4. It can connect to a pageserver, but there is an error receiving the
GetPage request response for any other reason

This change also includes a minor tweak to `compute_ctl`'s config
refresh behavior. Upon receiving a request to refresh PG configuration,
`compute_ctl` will reach out to download a config, but it will not
attempt to apply the configuration if the config is the same as the old
config is it replacing. This optimization is added because the act of
reconfiguring itself requires working pageserver connections. In many
failure situations it is likely that PG detects an issue with a
pageserver before the control plane can detect the issue, migrate
tenants, and update the compute config. In this case even the latest
compute config won't point PG to working pageservers, causing the
configuration attempt to hang and negatively impact PG's
time-to-recovery. With this change, `compute_ctl` only attempts
reconfiguration if the refreshed config points PG to different
pageservers.

## How is this tested?

The new code paths are exercised in all existing tests because this
mechanism is on by default.

Explicitly tested in `test_runner/regress/test_change_pageserver.py`.

Co-authored-by: William Huang <william.huang@databricks.com>

2025-07-24 16:44:45 +00:00

src

[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712 )

2025-07-24 16:44:45 +00:00

tests

offload_lfc_interval_seconds in ComputeSpec (#12447 )

2025-07-04 18:49:57 +00:00

.dockerignore

Move compute_tools from console repo (zenithdb/console#383 )

2021-12-28 20:17:29 +03:00

.gitignore

Move compute_tools from console repo (zenithdb/console#383 )

2021-12-28 20:17:29 +03:00

Cargo.toml

A few more compute_tool changes (#12687 )

2025-07-23 18:30:33 +00:00

README.md

[BRC-1778] Add mechanism to compute_ctl to pull a new config (#12711 )

2025-07-24 14:26:21 +00:00

rustfmt.toml

Move compute_tools from console repo (zenithdb/console#383 )

2021-12-28 20:17:29 +03:00

README.md

Compute node tools

Postgres wrapper (compute_ctl) is intended to be run as a Docker entrypoint or as a systemd ExecStart option. It will handle all the Neon specifics during compute node initialization:

compute_ctl accepts cluster (compute node) specification as a JSON file.
Every start is a fresh start, so the data directory is removed and initialized again on each run.
Next it will put configuration files into the PGDATA directory.
Sync safekeepers and get commit LSN.
Get basebackup from pageserver using the returned on the previous step LSN.
Try to start postgres and wait until it is ready to accept connections.
Check and alter/drop/create roles and databases.
Hang waiting on the postmaster process to exit.

Also compute_ctl spawns two separate service threads:

compute-monitor checks the last Postgres activity timestamp and saves it into the shared ComputeNode;
http-endpoint runs a Hyper HTTP API server, which serves readiness and the last activity requests.

If AUTOSCALING environment variable is set, compute_ctl will start the vm-monitor located in [neon/libs/vm_monitor]. For VM compute nodes, vm-monitor communicates with the VM autoscaling system. It coordinates downscaling and requests immediate upscaling under resource pressure.

Usage example:

compute_ctl -D /var/db/postgres/compute \
            -C 'postgresql://cloud_admin@localhost/postgres' \
            -S /var/db/postgres/specs/current.json \
            -b /usr/local/bin/postgres

State Diagram

Computes can be in various states. Below is a diagram that details how a compute moves between states.

%% https://mermaid.js.org/syntax/stateDiagram.html
stateDiagram-v2
  [*] --> Empty : Compute spawned
  Empty --> ConfigurationPending : Waiting for compute spec
  ConfigurationPending --> Configuration : Received compute spec
  Configuration --> Failed : Failed to configure the compute
  Configuration --> Running : Compute has been configured
  Empty --> Init : Compute spec is immediately available
  Empty --> TerminationPendingFast : Requested termination
  Empty --> TerminationPendingImmediate : Requested termination
  Init --> Failed : Failed to start Postgres
  Init --> Running : Started Postgres
  Running --> TerminationPendingFast : Requested termination
  Running --> TerminationPendingImmediate : Requested termination
  Running --> ConfigurationPending : Received a /configure request with spec
  Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
  RefreshConfigurationPending --> Running : Compute  has been re-configured
  TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
  TerminationPendingImmediate --> Terminated : Terminated compute immediately
  Running --> TerminationPending : Requested termination
  TerminationPending --> Terminated : Terminated compute
  Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
  Failed --> [*] : Compute exited
  Terminated --> [*] : Compute exited

Tests

Cargo formatter:

cargo fmt

Run tests:

cargo test

Clippy linter:

cargo clippy --all --all-targets -- -Dwarnings -Drust-2018-idioms

Cross-platform compilation

Imaging that you are on macOS (x86) and you want a Linux GNU (x86_64-unknown-linux-gnu platform in rust terminology) executable.

Using docker

You can use a throw-away Docker container (rustlang/rust image) for doing that:

docker run --rm \
    -v $(pwd):/compute_tools \
    -w /compute_tools \
    -t rustlang/rust:nightly cargo build --release --target=x86_64-unknown-linux-gnu

or one-line:

docker run --rm -v $(pwd):/compute_tools -w /compute_tools -t rust:latest cargo build --release --target=x86_64-unknown-linux-gnu

Using rust native cross-compilation

Another way is to add x86_64-unknown-linux-gnu target on your host system:

rustup target add x86_64-unknown-linux-gnu

Install macOS cross-compiler toolchain:

brew tap SergioBenitez/osxct
brew install x86_64-unknown-linux-gnu

And finally run cargo build:

CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-unknown-linux-gnu-gcc cargo build --target=x86_64-unknown-linux-gnu --release