Fix flaky test_sharding_split_failures (#12199)

## Problem

`test_sharding_failures` is flaky due to interference from the
`background_reconcile` process.

The details are in the issue
https://github.com/neondatabase/neon/issues/12029.

## Summary of changes

- Use `reconcile_until_idle` to ensure a stable state before running
test assertions
- Added error tolerance in `reconcile_until_idle` test function (Failure
cases: 1, 3, 19, 20)
- Ignore the `Keeping extra secondaries` warning message since it i
retryable (Failure case: 2)
- Deduplicated code in `assert_rolled_back` and `assert_split_done`
- Added a log message before printing plenty of Node `X` seen on
pageserver `Y`
This commit is contained in:
Aleksandr Sarantsev
2025-06-18 17:27:41 +04:00
committed by GitHub
parent 7e711ede44
commit e6a404c66d
3 changed files with 45 additions and 48 deletions

View File

@@ -8778,15 +8778,22 @@ impl Service {
let waiter_count = waiters.len();
match self.await_waiters(waiters, RECONCILE_TIMEOUT).await {
Ok(()) => {}
Err(ReconcileWaitError::Failed(_, reconcile_error))
if matches!(*reconcile_error, ReconcileError::Cancel) =>
{
// Ignore reconciler cancel errors: this reconciler might have shut down
// because some other change superceded it. We will return a nonzero number,
// so the caller knows they might have to call again to quiesce the system.
}
Err(e) => {
return Err(e);
if let ReconcileWaitError::Failed(_, reconcile_error) = &e {
match **reconcile_error {
ReconcileError::Cancel
| ReconcileError::Remote(mgmt_api::Error::Cancelled) => {
// Ignore reconciler cancel errors: this reconciler might have shut down
// because some other change superceded it. We will return a nonzero number,
// so the caller knows they might have to call again to quiesce the system.
}
_ => {
return Err(e);
}
}
} else {
return Err(e);
}
}
};