mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-03 19:42:55 +00:00
## Problem Currently, if `storcon` (storage controller) reconciliations repeatedly fail, the system will indefinitely freeze optimizations. This can result in optimization starvation for several days until the reconciliation issues are manually resolved. To mitigate this, we should detect persistently failing reconciliations and exclude them from influencing the optimization decision. ## Summary of Changes - A tenant shard reconciliation is now considered "keep-failing" if it fails 5 consecutive times. These failures are excluded from the optimization readiness check. - Added a new metric: `storage_controller_keep_failing_reconciles` to monitor such cases. - Added a warning log message when a reconciliation is marked as "keep-failing". --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>