## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: William Huang <william.huang@databricks.com>
Compute node tools
Postgres wrapper (compute_ctl) is intended to be run as a Docker entrypoint or as a systemd
ExecStart option. It will handle all the Neon specifics during compute node
initialization:
compute_ctlaccepts cluster (compute node) specification as a JSON file.- Every start is a fresh start, so the data directory is removed and initialized again on each run.
- Next it will put configuration files into the
PGDATAdirectory. - Sync safekeepers and get commit LSN.
- Get
basebackupfrom pageserver using the returned on the previous step LSN. - Try to start
postgresand wait until it is ready to accept connections. - Check and alter/drop/create roles and databases.
- Hang waiting on the
postmasterprocess to exit.
Also compute_ctl spawns two separate service threads:
compute-monitorchecks the last Postgres activity timestamp and saves it into the sharedComputeNode;http-endpointruns a Hyper HTTP API server, which serves readiness and the last activity requests.
If AUTOSCALING environment variable is set, compute_ctl will start the
vm-monitor located in [neon/libs/vm_monitor]. For VM compute nodes,
vm-monitor communicates with the VM autoscaling system. It coordinates
downscaling and requests immediate upscaling under resource pressure.
Usage example:
compute_ctl -D /var/db/postgres/compute \
-C 'postgresql://cloud_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres
State Diagram
Computes can be in various states. Below is a diagram that details how a compute moves between states.
%% https://mermaid.js.org/syntax/stateDiagram.html
stateDiagram-v2
[*] --> Empty : Compute spawned
Empty --> ConfigurationPending : Waiting for compute spec
ConfigurationPending --> Configuration : Received compute spec
Configuration --> Failed : Failed to configure the compute
Configuration --> Running : Compute has been configured
Empty --> Init : Compute spec is immediately available
Empty --> TerminationPendingFast : Requested termination
Empty --> TerminationPendingImmediate : Requested termination
Init --> Failed : Failed to start Postgres
Init --> Running : Started Postgres
Running --> TerminationPendingFast : Requested termination
Running --> TerminationPendingImmediate : Requested termination
Running --> ConfigurationPending : Received a /configure request with spec
Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
RefreshConfigurationPending --> Running : Compute has been re-configured
TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
TerminationPendingImmediate --> Terminated : Terminated compute immediately
Running --> TerminationPending : Requested termination
TerminationPending --> Terminated : Terminated compute
Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
Failed --> [*] : Compute exited
Terminated --> [*] : Compute exited
Tests
Cargo formatter:
cargo fmt
Run tests:
cargo test
Clippy linter:
cargo clippy --all --all-targets -- -Dwarnings -Drust-2018-idioms
Cross-platform compilation
Imaging that you are on macOS (x86) and you want a Linux GNU (x86_64-unknown-linux-gnu platform in rust terminology) executable.
Using docker
You can use a throw-away Docker container (rustlang/rust image) for doing that:
docker run --rm \
-v $(pwd):/compute_tools \
-w /compute_tools \
-t rustlang/rust:nightly cargo build --release --target=x86_64-unknown-linux-gnu
or one-line:
docker run --rm -v $(pwd):/compute_tools -w /compute_tools -t rust:latest cargo build --release --target=x86_64-unknown-linux-gnu
Using rust native cross-compilation
Another way is to add x86_64-unknown-linux-gnu target on your host system:
rustup target add x86_64-unknown-linux-gnu
Install macOS cross-compiler toolchain:
brew tap SergioBenitez/osxct
brew install x86_64-unknown-linux-gnu
And finally run cargo build:
CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-unknown-linux-gnu-gcc cargo build --target=x86_64-unknown-linux-gnu --release