Cancel PG query if stuck at refreshing configuration (#12717)

## Problem

While configuring or reconfiguring PG due to PageServer movements, it's
possible PG may get stuck if PageServer is moved around after fetching
the spec from StorageController.

## Summary of changes

To fix this issue, this PR introduces two changes:
1. Fail the PG query directly if the query cannot request configuration
for certain number of times.
2. Introduce a new state `RefreshConfiguration` in compute tools to
differentiate it from `RefreshConfigurationPending`. If compute tool is
already in `RefreshConfiguration` state, then it will not accept new
request configuration requests.

## How is this tested?
Chaos testing.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
This commit is contained in:
Tristan Partin
2025-07-24 19:01:59 -05:00
committed by GitHub
parent 512210bb5a
commit b623fbae0c
8 changed files with 542 additions and 55 deletions

View File

@@ -54,11 +54,11 @@ stateDiagram-v2
Running --> TerminationPendingImmediate : Requested termination
Running --> ConfigurationPending : Received a /configure request with spec
Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
RefreshConfigurationPending --> Running : Compute has been re-configured
RefreshConfigurationPending --> RefreshConfiguration: Received compute spec and started configuration
RefreshConfiguration --> Running : Compute has been re-configured
RefreshConfiguration --> RefreshConfigurationPending : Configuration failed and to be retried
TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
TerminationPendingImmediate --> Terminated : Terminated compute immediately
Running --> TerminationPending : Requested termination
TerminationPending --> Terminated : Terminated compute
Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
Failed --> [*] : Compute exited
Terminated --> [*] : Compute exited