Skip PG throttle during configuration (#12670)

## Problem While running tenant split tests I ran into a situation where PG got stuck completely. This seems to be a general problem that was not found in the previous chaos testing fixes. What happened is that if PG gets throttled by PS, and SC decided to move some tenant away, then PG reconfiguration could be blocked forever because it cannot talk to the old PS anymore to refresh the throttling stats, and reconfiguration cannot proceed because it's being throttled. Neon has considered the case that configuration could be blocked if the PG storage is full, but forgot the backpressure case. ## Summary of changes The PR fixes this problem by simply skipping throttling while PS is being configured, i.e., `max_cluster_size < 0`. An alternative fix is to set those throttle knobs to -1 (e.g., max_replication_apply_lag), however these knobs were labeled with PGC_POSTMASTER so their values cannot be changed unless we restart PG. ## How is this tested? Tested manually. Co-authored-by: Chen Luo <chen.luo@databricks.com>
2026-01-09 22:42:57 +00:00 · 2025-07-21 15:50:02 -05:00
parent 050c9f704f
commit b7bc3ce61e
3 changed files with 18 additions and 1 deletions
--- a/pgxn/neon/walproposer_pg.c
+++ b/pgxn/neon/walproposer_pg.c
@@ -400,6 +400,14 @@ static uint64
 backpressure_lag_impl(void)
 {
 	struct WalproposerShmemState* state = NULL;
+
+	/* BEGIN_HADRON */
+	if(max_cluster_size < 0){
+		// if max cluster size is not set, then we don't apply backpressure because we're reconfiguring PG
+		return 0;
+	}
+	/* END_HADRON */
+
 	if (max_replication_apply_lag > 0 || max_replication_flush_lag > 0 || max_replication_write_lag > 0)
 	{
 		XLogRecPtr	writePtr;
--- a/test_runner/regress/test_pg_regress.py
+++ b/test_runner/regress/test_pg_regress.py
@@ -368,7 +368,14 @@ def test_max_wal_rate(neon_simple_env: NeonEnv):
    superuser_name = "databricks_superuser"

    # Connect to postgres and create a database called "regression".
-    endpoint = env.endpoints.create_start("main")
+    endpoint = env.endpoints.create_start(
+        "main",
+        config_lines=[
+            # we need this option because default max_cluster_size < 0 will disable throttling completely
+            "neon.max_cluster_size=10GB",
+        ],
+    )
+
    endpoint.safe_psql_many(
        [
            f"CREATE ROLE {superuser_name}",
--- a/test_runner/regress/test_sharding.py
+++ b/test_runner/regress/test_sharding.py
@@ -1810,6 +1810,8 @@ def test_sharding_backpressure(neon_env_builder: NeonEnvBuilder):
            "config_lines": [
                # Tip: set to 100MB to make the test fail
                "max_replication_write_lag=1MB",
+                # Hadron: Need to set max_cluster_size to some value to enable any backpressure at all.
+                "neon.max_cluster_size=1GB",
            ],
            # We need `neon` extension for calling backpressure functions,
            # this flag instructs `compute_ctl` to pre-install it.