mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-22 21:59:59 +00:00
## Problem While running tenant split tests I ran into a situation where PG got stuck completely. This seems to be a general problem that was not found in the previous chaos testing fixes. What happened is that if PG gets throttled by PS, and SC decided to move some tenant away, then PG reconfiguration could be blocked forever because it cannot talk to the old PS anymore to refresh the throttling stats, and reconfiguration cannot proceed because it's being throttled. Neon has considered the case that configuration could be blocked if the PG storage is full, but forgot the backpressure case. ## Summary of changes The PR fixes this problem by simply skipping throttling while PS is being configured, i.e., `max_cluster_size < 0`. An alternative fix is to set those throttle knobs to -1 (e.g., max_replication_apply_lag), however these knobs were labeled with PGC_POSTMASTER so their values cannot be changed unless we restart PG. ## How is this tested? Tested manually. Co-authored-by: Chen Luo <chen.luo@databricks.com>