vm-monitor: Deny not fail downscale if no memory stats yet (#5606)

Fixes an issue we observed on staging that happens when the autoscaler-agent attempts to immediately downscale the VM after binding, which is typical for pooled computes. The issue was occurring because the autoscaler-agent was requesting downscaling before the vm-monitor had gathered sufficient cgroup memory stats to be confident in approving it. When the vm-monitor returned an internal error instead of denying downscaling, the autoscaler-agent retried the connection and immediately hit the same issue (in part because cgroup stats are collected per-connection, rather than globally).
2026-01-10 23:12:54 +00:00 · 2023-10-19 11:09:37 -07:00
parent 5a8bcdccb0
commit 2cf6a47cca
1 changed files with 12 additions and 1 deletions
--- a/libs/vm_monitor/src/runner.rs
+++ b/libs/vm_monitor/src/runner.rs
@@ -253,11 +253,22 @@ impl Runner {
        if let Some(cgroup) = &self.cgroup {
            let (last_time, last_history) = *cgroup.watcher.borrow();

+            // NB: The ordering of these conditions is intentional. During startup, we should deny
+            // downscaling until we have enough information to determine that it's safe to do so
+            // (i.e. enough samples have come in). But if it's been a while and we *still* haven't
+            // received any information, we should *fail* instead of just denying downscaling.
+            //
+            // `last_time` is set to `Instant::now()` on startup, so checking `last_time.elapsed()`
+            // serves double-duty: it trips if we haven't received *any* metrics for long enough,
+            // OR if we haven't received metrics *recently enough*.
+            //
            // TODO: make the duration here configurable.
            if last_time.elapsed() > Duration::from_secs(5) {
                bail!("haven't gotten cgroup memory stats recently enough to determine downscaling information");
            } else if last_history.samples_count <= 1 {
-                bail!("haven't received enough cgroup memory stats yet");
+                let status = "haven't received enough cgroup memory stats yet";
+                info!(status, "discontinuing downscale");
+                return Ok((false, status.to_owned()));
            }

            let new_threshold = self