Set an upper limit on PG backpressure throttling (#12675)

## Problem Tenant split test revealed another bug with PG backpressure throttling that under some cases PS may never report its progress back to SK (e.g., observed when aborting tenant shard where the old shard needs to re-establish SK connection and re-ingest WALs from a much older LSN). In this case, PG may get stuck forever. ## Summary of changes As a general precaution that PS feedback mechanism may not always be reliable, this PR uses the previously introduced WAL write rate limit mechanism to slow down write rates instead of completely pausing it. The idea is to introduce a new `databricks_effective_max_wal_bytes_per_second`, which is set to `databricks_max_wal_mb_per_second` when no PS back pressure and is set to `10KB` when there is back pressure. This way, PG can still write to SK, though at a very low speed. The PR also fixes the problem that the current WAL rate limiting mechanism is too coarse grained and cannot enforce limits < 1MB. This is because it always resets the rate limiter after 1 second, even if PG could have written more data in the past second. The fix is to introduce a `batch_end_time_us` which records the expected end time of the current batch. For example, if PG writes 10MB of data in a single batch, and max WAL write rate is set as `1MB/s`, then `batch_end_time_us` will be set as 10 seconds later. ## How is this tested? Tweaked the existing test, and also did manual testing on dev. I set `max_replication_flush_lag` as 1GB, and loaded 500GB pgbench tables. It's expected to see PG gets throttled periodically because PS will accumulate 4GB of data before flushing. Results: when PG is throttled: ``` 9500000 of 3300000000 tuples (0%) done (elapsed 10.36 s, remaining 3587.62 s) 9600000 of 3300000000 tuples (0%) done (elapsed 124.07 s, remaining 42523.59 s) 9700000 of 3300000000 tuples (0%) done (elapsed 255.79 s, remaining 86763.97 s) 9800000 of 3300000000 tuples (0%) done (elapsed 315.89 s, remaining 106056.52 s) 9900000 of 3300000000 tuples (0%) done (elapsed 412.75 s, remaining 137170.58 s) ``` when PS just flushed: ``` 18100000 of 3300000000 tuples (0%) done (elapsed 433.80 s, remaining 78655.96 s) 18200000 of 3300000000 tuples (0%) done (elapsed 433.85 s, remaining 78231.71 s) 18300000 of 3300000000 tuples (0%) done (elapsed 433.90 s, remaining 77810.62 s) 18400000 of 3300000000 tuples (0%) done (elapsed 433.96 s, remaining 77395.86 s) 18500000 of 3300000000 tuples (0%) done (elapsed 434.03 s, remaining 76987.27 s) 18600000 of 3300000000 tuples (0%) done (elapsed 434.08 s, remaining 76579.59 s) 18700000 of 3300000000 tuples (0%) done (elapsed 434.13 s, remaining 76177.12 s) 18800000 of 3300000000 tuples (0%) done (elapsed 434.19 s, remaining 75779.45 s) 18900000 of 3300000000 tuples (0%) done (elapsed 434.84 s, remaining 75489.40 s) 19000000 of 3300000000 tuples (0%) done (elapsed 434.89 s, remaining 75097.90 s) 19100000 of 3300000000 tuples (0%) done (elapsed 434.94 s, remaining 74712.56 s) 19200000 of 3300000000 tuples (0%) done (elapsed 498.93 s, remaining 85254.20 s) 19300000 of 3300000000 tuples (0%) done (elapsed 498.97 s, remaining 84817.95 s) 19400000 of 3300000000 tuples (0%) done (elapsed 623.80 s, remaining 105486.76 s) 19500000 of 3300000000 tuples (0%) done (elapsed 745.86 s, remaining 125476.51 s) ``` Co-authored-by: Chen Luo <chen.luo@databricks.com>
2026-01-08 22:12:56 +00:00 · 2025-07-23 17:37:27 -05:00
parent 12e87d7a9f
commit 9b2e6f862a
5 changed files with 220 additions and 107 deletions
--- a/test_runner/regress/test_pg_regress.py
+++ b/test_runner/regress/test_pg_regress.py
@@ -395,23 +395,6 @@ def test_max_wal_rate(neon_simple_env: NeonEnv):
    tuples = endpoint.safe_psql("SELECT backpressure_throttling_time();")
    assert tuples[0][0] == 0, "Backpressure throttling detected"

-    # 0 MB/s max_wal_rate. WAL proposer can still push some WALs but will be super slow.
-    endpoint.safe_psql_many(
-        [
-            "ALTER SYSTEM SET databricks.max_wal_mb_per_second = 0;",
-            "SELECT pg_reload_conf();",
-        ]
-    )
-
-    # Write ~10 KB data should hit backpressure.
-    with endpoint.cursor(dbname=DBNAME) as cur:
-        cur.execute("SET databricks.max_wal_mb_per_second = 0;")
-        for _ in range(0, 10):
-            cur.execute("INSERT INTO usertable SELECT random(), repeat('a', 1000);")
-
-    tuples = endpoint.safe_psql("SELECT backpressure_throttling_time();")
-    assert tuples[0][0] > 0, "No backpressure throttling detected"
-
    # 1 MB/s max_wal_rate.
    endpoint.safe_psql_many(
        [
--- a/test_runner/regress/test_sharding.py
+++ b/test_runner/regress/test_sharding.py
@@ -1508,20 +1508,55 @@ def test_sharding_split_failures(
    env.storage_controller.consistency_check()


-@pytest.mark.skip(reason="The backpressure change has not been merged yet.")
+# HADRON
+def test_create_tenant_after_split(neon_env_builder: NeonEnvBuilder):
+    """
+    Tests creating a tenant and a timeline should fail after a tenant split.
+    """
+    env = neon_env_builder.init_start(initial_tenant_shard_count=4)
+
+    env.storage_controller.allowed_errors.extend(
+        [
+            ".*already exists with a different shard count.*",
+        ]
+    )
+
+    ep = env.endpoints.create_start("main", tenant_id=env.initial_tenant)
+    ep.safe_psql("CREATE TABLE usertable ( YCSB_KEY INT, FIELD0 TEXT);")
+    ep.safe_psql("INSERT INTO usertable VALUES (1, 'test1');")
+    ep.safe_psql("INSERT INTO usertable VALUES (2, 'test2');")
+    ep.safe_psql("INSERT INTO usertable VALUES (3, 'test3');")
+
+    # Split the tenant
+
+    env.storage_controller.tenant_shard_split(env.initial_tenant, shard_count=8)
+
+    with pytest.raises(RuntimeError):
+        env.create_tenant(env.initial_tenant, env.initial_timeline, shard_count=4)
+
+    # run more queries
+    ep.safe_psql("SELECT * FROM usertable;")
+    ep.safe_psql("UPDATE usertable set FIELD0 = 'test4';")
+
+    ep.stop_and_destroy()
+
+
+# HADRON
 def test_back_pressure_during_split(neon_env_builder: NeonEnvBuilder):
    """
-    Test backpressure can ignore new shards during tenant split so that if we abort the split,
-    PG can continue without being blocked.
+    Test backpressure works correctly during a shard split, especially after a split is aborted,
+    PG will not be stuck forever.
    """
-    DBNAME = "regression"
-
-    init_shard_count = 4
+    init_shard_count = 1
    neon_env_builder.num_pageservers = init_shard_count
    stripe_size = 32

    env = neon_env_builder.init_start(
-        initial_tenant_shard_count=init_shard_count, initial_tenant_shard_stripe_size=stripe_size
+        initial_tenant_shard_count=init_shard_count,
+        initial_tenant_shard_stripe_size=stripe_size,
+        initial_tenant_conf={
+            "checkpoint_distance": 1024 * 1024 * 1024,
+        },
    )

    env.storage_controller.allowed_errors.extend(
@@ -1537,19 +1572,31 @@ def test_back_pressure_during_split(neon_env_builder: NeonEnvBuilder):
        "main",
        config_lines=[
            "max_replication_write_lag = 1MB",
-            "databricks.max_wal_mb_per_second = 1",
            "neon.max_cluster_size = 10GB",
+            "databricks.max_wal_mb_per_second=100",
        ],
    )
-    endpoint.respec(skip_pg_catalog_updates=False)  # Needed for databricks_system to get created.
+    endpoint.respec(skip_pg_catalog_updates=False)
    endpoint.start()

-    endpoint.safe_psql(f"CREATE DATABASE {DBNAME}")
-
-    endpoint.safe_psql("CREATE TABLE usertable ( YCSB_KEY INT, FIELD0 TEXT);")
+    # generate 10MB of data
+    endpoint.safe_psql(
+        "CREATE TABLE usertable AS SELECT s AS KEY, repeat('a', 1000) as VALUE from generate_series(1, 10000) s;"
+    )
    write_done = Event()

-    def write_data(write_done):
+    def get_write_lag():
+        res = endpoint.safe_psql(
+            """
+            SELECT
+                pg_wal_lsn_diff(pg_current_wal_flush_lsn(), received_lsn) as received_lsn_lag
+                FROM neon.backpressure_lsns();
+            """,
+            log_query=False,
+        )
+        return res[0][0]
+
+    def write_data(write_done: Event):
        while not write_done.is_set():
            endpoint.safe_psql(
                "INSERT INTO usertable SELECT random(), repeat('a', 1000);", log_query=False
@@ -1560,35 +1607,39 @@ def test_back_pressure_during_split(neon_env_builder: NeonEnvBuilder):
    writer_thread.start()

    env.storage_controller.configure_failpoints(("shard-split-pre-complete", "return(1)"))
+    # sleep 10 seconds before re-activating the old shard when aborting the split.
+    # this is to add some backpressures to PG
+    env.pageservers[0].http_client().configure_failpoints(
+        ("attach-before-activate-sleep", "return(10000)"),
+    )
    # split the tenant
    with pytest.raises(StorageControllerApiException):
-        env.storage_controller.tenant_shard_split(env.initial_tenant, shard_count=16)
+        env.storage_controller.tenant_shard_split(env.initial_tenant, shard_count=4)
+
+    def check_tenant_status():
+        status = (
+            env.pageservers[0].http_client().tenant_status(TenantShardId(env.initial_tenant, 0, 1))
+        )
+        assert status["state"]["slug"] == "Active"
+
+    wait_until(check_tenant_status)

    write_done.set()
    writer_thread.join()

+    log.info(f"current write lag: {get_write_lag()}")
+
    # writing more data to page servers after split is aborted
-    for _i in range(5000):
-        endpoint.safe_psql(
-            "INSERT INTO usertable SELECT random(), repeat('a', 1000);", log_query=False
-        )
+    with endpoint.cursor() as cur:
+        for _i in range(1000):
+            cur.execute("INSERT INTO usertable SELECT random(), repeat('a', 1000);")

    # wait until write lag becomes 0
    def check_write_lag_is_zero():
-        res = endpoint.safe_psql(
-            """
-            SELECT
-                pg_wal_lsn_diff(pg_current_wal_flush_lsn(), received_lsn) as received_lsn_lag
-                FROM neon.backpressure_lsns();
-            """,
-            dbname="databricks_system",
-            log_query=False,
-        )
-        log.info(f"received_lsn_lag = {res[0][0]}")
-        assert res[0][0] == 0
+        res = get_write_lag()
+        assert res == 0

    wait_until(check_write_lag_is_zero)
-    endpoint.stop_and_destroy()


 # BEGIN_HADRON
@@ -1674,7 +1725,6 @@ def test_shard_resolve_during_split_abort(neon_env_builder: NeonEnvBuilder):


 # HADRON
-@pytest.mark.skip(reason="The backpressure change has not been merged yet.")
 def test_back_pressure_per_shard(neon_env_builder: NeonEnvBuilder):
    """
    Tests back pressure knobs are enforced on the per shard basis instead of at the tenant level.
@@ -1703,20 +1753,16 @@ def test_back_pressure_per_shard(neon_env_builder: NeonEnvBuilder):
            "neon.max_cluster_size = 10GB",
        ],
    )
-    endpoint.respec(skip_pg_catalog_updates=False)  # Needed for databricks_system to get created.
+    endpoint.respec(skip_pg_catalog_updates=False)
    endpoint.start()

-    # generate 20MB of data
+    # generate 10MB of data
    endpoint.safe_psql(
-        "CREATE TABLE usertable AS SELECT s AS KEY, repeat('a', 1000) as VALUE from generate_series(1, 20000) s;"
+        "CREATE TABLE usertable AS SELECT s AS KEY, repeat('a', 1000) as VALUE from generate_series(1, 10000) s;"
    )
-    res = endpoint.safe_psql(
-        "SELECT neon.backpressure_throttling_time() as throttling_time", dbname="databricks_system"
-    )[0]
+    res = endpoint.safe_psql("SELECT neon.backpressure_throttling_time() as throttling_time")[0]
    assert res[0] == 0, f"throttling_time should be 0, but got {res[0]}"

-    endpoint.stop()
-

 # HADRON
 def test_shard_split_page_server_timeout(neon_env_builder: NeonEnvBuilder):
@@ -1880,14 +1926,14 @@ def test_sharding_backpressure(neon_env_builder: NeonEnvBuilder):
    shards_info()

    for _write_iter in range(30):
-        # approximately 1MB of data
-        workload.write_rows(8000, upload=False)
+        # approximately 10MB of data
+        workload.write_rows(80000, upload=False)
        update_write_lsn()
        infos = shards_info()
        min_lsn = min(Lsn(info["last_record_lsn"]) for info in infos)
        max_lsn = max(Lsn(info["last_record_lsn"]) for info in infos)
        diff = max_lsn - min_lsn
-        assert diff < 2 * 1024 * 1024, f"LSN diff={diff}, expected diff < 2MB due to backpressure"
+        assert diff < 8 * 1024 * 1024, f"LSN diff={diff}, expected diff < 8MB due to backpressure"


 def test_sharding_unlogged_relation(neon_env_builder: NeonEnvBuilder):