[BRC-1778] Add mechanism to compute_ctl to pull a new config (#12711)

## Problem We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected. ## Summary of changes To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to `compute_ctl` called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl: 1. The `compute_ctl` state machine now has a new State, `RefreshConfigurationPending`. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers. 2. Upon entering the `RefreshConfigurationPending` state, the background configurator thread in `compute_ctl` wakes up, pulls a new config from the control plane, and reconfigures PG (with `pg_ctl reload`) according to the new config. 3. The compute node may enter the new `RefreshConfigurationPending` state from `Running` or `Failed` states. If the configurator managed to configure the compute node successfully, it will enter the `Running` state, otherwise, it stays in `RefreshConfigurationPending` and the configurator thread will wait for the next notification if an incorrect config is still suspected. 4. Added various plumbing in `compute_ctl` data structures to allow the configurator thread to perform the config fetch. The "incorrect config suspected" notification is delivered using a HTTP endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers. ## How is this tested? Modified `test_runner/regress/test_change_pageserver.py` to add a scenario where we use the new `/refresh_configuration` mechanism instead of the existing `/configure` mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers. I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways: * The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it. * The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work. The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests. In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old `/configure` path (it turns out the test has bugs which causes compute `/configure` messages to be sent despite the test intending to disable/blackhole them). ```test 2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request 2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration 2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008" ... ``` Co-authored-by: William Huang <william.huang@databricks.com>
2026-01-08 05:52:55 +00:00 · 2025-07-24 09:26:21 -05:00
parent 643448b1a2
commit 90cd5a5be8
12 changed files with 315 additions and 32 deletions
--- a/compute_tools/README.md
+++ b/compute_tools/README.md
@@ -52,8 +52,14 @@ stateDiagram-v2
  Init --> Running : Started Postgres
  Running --> TerminationPendingFast : Requested termination
  Running --> TerminationPendingImmediate : Requested termination
+  Running --> ConfigurationPending : Received a /configure request with spec
+  Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
+  RefreshConfigurationPending --> Running : Compute  has been re-configured
  TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
  TerminationPendingImmediate --> Terminated : Terminated compute immediately
+  Running --> TerminationPending : Requested termination
+  TerminationPending --> Terminated : Terminated compute
+  Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
  Failed --> [*] : Compute exited
  Terminated --> [*] : Compute exited
 ```
--- a/compute_tools/src/bin/compute_ctl.rs
+++ b/compute_tools/src/bin/compute_ctl.rs
@@ -235,6 +235,9 @@ fn main() -> Result<()> {
            pg_isready_bin: get_pg_isready_bin(&cli.pgbin),
            instance_id: std::env::var("INSTANCE_ID").ok(),
            lakebase_mode: cli.lakebase_mode,
+            build_tag: BUILD_TAG.to_string(),
+            control_plane_uri: cli.control_plane_uri,
+            config_path_test_only: cli.config,
        },
        config,
    )?;
--- a/compute_tools/src/compute.rs
+++ b/compute_tools/src/compute.rs
@@ -21,6 +21,7 @@ use postgres::NoTls;
 use postgres::error::SqlState;
 use remote_storage::{DownloadError, RemotePath};
 use std::collections::{HashMap, HashSet};
+use std::ffi::OsString;
 use std::os::unix::fs::{PermissionsExt, symlink};
 use std::path::Path;
 use std::process::{Command, Stdio};
@@ -120,6 +121,10 @@ pub struct ComputeNodeParams {
    // Path to the `pg_isready` binary.
    pub pg_isready_bin: String,
    pub lakebase_mode: bool,
+
+    pub build_tag: String,
+    pub control_plane_uri: Option<String>,
+    pub config_path_test_only: Option<OsString>,
 }

 type TaskHandle = Mutex<Option<JoinHandle<()>>>;
@@ -1796,12 +1801,12 @@ impl ComputeNode {
        let states_allowing_configuration_refresh = [
            ComputeStatus::Running,
            ComputeStatus::Failed,
-            // ComputeStatus::RefreshConfigurationPending,
+            ComputeStatus::RefreshConfigurationPending,
        ];

-        let state = self.state.lock().expect("state lock poisoned");
+        let mut state = self.state.lock().expect("state lock poisoned");
        if states_allowing_configuration_refresh.contains(&state.status) {
-            // state.status = ComputeStatus::RefreshConfigurationPending;
+            state.status = ComputeStatus::RefreshConfigurationPending;
            self.state_changed.notify_all();
            Ok(())
        } else if state.status == ComputeStatus::Init {
@@ -1988,6 +1993,7 @@ impl ComputeNode {
                            // wait
                            ComputeStatus::Init
                            | ComputeStatus::Configuration
+                            | ComputeStatus::RefreshConfigurationPending
                            | ComputeStatus::Empty => {
                                state = self.state_changed.wait(state).unwrap();
                            }
--- a/compute_tools/src/configurator.rs
+++ b/compute_tools/src/configurator.rs
@@ -1,10 +1,12 @@
-use std::sync::Arc;
+use std::fs::File;
 use std::thread;
+use std::{path::Path, sync::Arc};

-use compute_api::responses::ComputeStatus;
+use compute_api::responses::{ComputeConfig, ComputeStatus};
 use tracing::{error, info, instrument};

-use crate::compute::ComputeNode;
+use crate::compute::{ComputeNode, ParsedSpec};
+use crate::spec::get_config_from_control_plane;

 #[instrument(skip_all)]
 fn configurator_main_loop(compute: &Arc<ComputeNode>) {
@@ -12,12 +14,22 @@ fn configurator_main_loop(compute: &Arc<ComputeNode>) {
    loop {
        let mut state = compute.state.lock().unwrap();

-        // We have to re-check the status after re-acquiring the lock because it could be that
-        // the status has changed while we were waiting for the lock, and we might not need to
-        // wait on the condition variable. Otherwise, we might end up in some soft-/deadlock, i.e.
-        // we are waiting for a condition variable that will never be signaled.
-        if state.status != ComputeStatus::ConfigurationPending {
-            state = compute.state_changed.wait(state).unwrap();
+        if compute.params.lakebase_mode {
+            while state.status != ComputeStatus::ConfigurationPending
+                && state.status != ComputeStatus::RefreshConfigurationPending
+                && state.status != ComputeStatus::Failed
+            {
+                info!("configurator: compute status: {:?}, sleeping", state.status);
+                state = compute.state_changed.wait(state).unwrap();
+            }
+        } else {
+            // We have to re-check the status after re-acquiring the lock because it could be that
+            // the status has changed while we were waiting for the lock, and we might not need to
+            // wait on the condition variable. Otherwise, we might end up in some soft-/deadlock, i.e.
+            // we are waiting for a condition variable that will never be signaled.
+            if state.status != ComputeStatus::ConfigurationPending {
+                state = compute.state_changed.wait(state).unwrap();
+            }
        }

        // Re-check the status after waking up
@@ -38,6 +50,80 @@ fn configurator_main_loop(compute: &Arc<ComputeNode>) {
            // std::thread::sleep(std::time::Duration::from_millis(10000));

            compute.set_status(new_status);
+        } else if state.status == ComputeStatus::RefreshConfigurationPending {
+            info!(
+                "compute node suspects its configuration is out of date, now refreshing configuration"
+            );
+            // Drop the lock guard here to avoid holding the lock while downloading spec from the control plane / HCC.
+            // This is the only thread that can move compute_ctl out of the `RefreshConfigurationPending` state, so it
+            // is safe to drop the lock like this.
+            drop(state);
+
+            let spec = if let Some(config_path) = &compute.params.config_path_test_only {
+                // This path is only to make testing easier. In production we always get the spec from the HCC.
+                info!(
+                    "reloading config.json from path: {}",
+                    config_path.to_string_lossy()
+                );
+                let path = Path::new(config_path);
+                if let Ok(file) = File::open(path) {
+                    match serde_json::from_reader::<File, ComputeConfig>(file) {
+                        Ok(config) => config.spec,
+                        Err(e) => {
+                            error!("could not parse spec file: {}", e);
+                            None
+                        }
+                    }
+                } else {
+                    error!(
+                        "could not open config file at path: {}",
+                        config_path.to_string_lossy()
+                    );
+                    None
+                }
+            } else if let Some(control_plane_uri) = &compute.params.control_plane_uri {
+                match get_config_from_control_plane(control_plane_uri, &compute.params.compute_id) {
+                    Ok(config) => config.spec,
+                    Err(e) => {
+                        error!("could not get config from control plane: {}", e);
+                        None
+                    }
+                }
+            } else {
+                None
+            };
+
+            if let Some(spec) = spec {
+                if let Ok(pspec) = ParsedSpec::try_from(spec) {
+                    {
+                        let mut state = compute.state.lock().unwrap();
+                        // Defensive programming to make sure this thread is indeed the only one that can move the compute
+                        // node out of the `RefreshConfigurationPending` state. Would be nice if we can encode this invariant
+                        // into the type system.
+                        assert_eq!(state.status, ComputeStatus::RefreshConfigurationPending);
+                        // state.pspec is consumed by compute.reconfigure() below. Note that compute.reconfigure() will acquire
+                        // the compute.state lock again so we need to have the lock guard go out of scope here. We could add a
+                        // "locked" variant of compute.reconfigure() that takes the lock guard as an argument to make this cleaner,
+                        // but it's not worth forking the codebase too much for this minor point alone right now.
+                        state.pspec = Some(pspec);
+                    }
+                    match compute.reconfigure() {
+                        Ok(_) => {
+                            info!("Refresh configuration: compute node configured");
+                            compute.set_status(ComputeStatus::Running);
+                        }
+                        Err(e) => {
+                            error!(
+                                "Refresh configuration: could not configure compute node: {}",
+                                e
+                            );
+                            // Leave the compute node in the `RefreshConfigurationPending` state if the configuration
+                            // was not successful. It should be okay to treat this situation the same as if the loop
+                            // hasn't executed yet as long as the detection side keeps notifying.
+                        }
+                    }
+                }
+            }
        } else if state.status == ComputeStatus::Failed {
            info!("compute node is now in Failed state, exiting");
            break;
--- a/compute_tools/src/http/routes/refresh_configuration.rs
+++ b/compute_tools/src/http/routes/refresh_configuration.rs
@@ -7,28 +7,22 @@ use axum::{
    response::{IntoResponse, Response},
 };
 use http::StatusCode;
-use tracing::debug;

 use crate::compute::ComputeNode;
 // use crate::hadron_metrics::POSTGRES_PAGESTREAM_REQUEST_ERRORS;
 use crate::http::JsonResponse;

-// The /refresh_configuration POST method is used to nudge compute_ctl to pull a new spec
-// from the HCC and attempt to reconfigure Postgres with the new spec. The method does not wait
-// for the reconfiguration to complete. Rather, it simply delivers a signal that will cause
-// configuration to be reloaded in a best effort manner. Invocation of this method does not
-// guarantee that a reconfiguration will occur. The caller should consider keep sending this
-// request while it believes that the compute configuration is out of date.
+/// The /refresh_configuration POST method is used to nudge compute_ctl to pull a new spec
+/// from the HCC and attempt to reconfigure Postgres with the new spec. The method does not wait
+/// for the reconfiguration to complete. Rather, it simply delivers a signal that will cause
+/// configuration to be reloaded in a best effort manner. Invocation of this method does not
+/// guarantee that a reconfiguration will occur. The caller should consider keep sending this
+/// request while it believes that the compute configuration is out of date.
 pub(in crate::http) async fn refresh_configuration(
    State(compute): State<Arc<ComputeNode>>,
 ) -> Response {
-    debug!("serving /refresh_configuration POST request");
-    // POSTGRES_PAGESTREAM_REQUEST_ERRORS.inc();
    match compute.signal_refresh_configuration().await {
        Ok(_) => StatusCode::OK.into_response(),
-        Err(e) => {
-            tracing::error!("error handling /refresh_configuration request: {}", e);
-            JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, e)
-        }
+        Err(e) => JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, e),
    }
 }
--- a/compute_tools/src/http/server.rs
+++ b/compute_tools/src/http/server.rs
@@ -23,11 +23,11 @@ use super::{
    middleware::authorize::Authorize,
    routes::{
        check_writability, configure, database_schema, dbs_and_roles, extension_server, extensions,
-        grants, insights, lfc, metrics, metrics_json, promote, status, terminate,
+        grants, hadron_liveness_probe, insights, lfc, metrics, metrics_json, promote,
+        refresh_configuration, status, terminate,
    },
 };
 use crate::compute::ComputeNode;
-use crate::http::routes::{hadron_liveness_probe, refresh_configuration};

 /// `compute_ctl` has two servers: internal and external. The internal server
 /// binds to the loopback interface and handles communication from clients on