Cancel PG query if stuck at refreshing configuration (#12717)

## Problem

While configuring or reconfiguring PG due to PageServer movements, it's
possible PG may get stuck if PageServer is moved around after fetching
the spec from StorageController.

## Summary of changes

To fix this issue, this PR introduces two changes:
1. Fail the PG query directly if the query cannot request configuration
for certain number of times.
2. Introduce a new state `RefreshConfiguration` in compute tools to
differentiate it from `RefreshConfigurationPending`. If compute tool is
already in `RefreshConfiguration` state, then it will not accept new
request configuration requests.

## How is this tested?
Chaos testing.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
This commit is contained in:
Tristan Partin
2025-07-24 19:01:59 -05:00
committed by GitHub
parent 512210bb5a
commit b623fbae0c
8 changed files with 542 additions and 55 deletions

View File

@@ -938,7 +938,8 @@ impl Endpoint {
| ComputeStatus::TerminationPendingFast
| ComputeStatus::TerminationPendingImmediate
| ComputeStatus::Terminated
| ComputeStatus::RefreshConfigurationPending => {
| ComputeStatus::RefreshConfigurationPending
| ComputeStatus::RefreshConfiguration => {
bail!("unexpected compute status: {:?}", state.status)
}
}