Skip PG throttle during configuration (#12670)

## Problem

While running tenant split tests I ran into a situation where PG got
stuck completely. This seems to be a general problem that was not found
in the previous chaos testing fixes.

What happened is that if PG gets throttled by PS, and SC decided to move
some tenant away, then PG reconfiguration could be blocked forever
because it cannot talk to the old PS anymore to refresh the throttling
stats, and reconfiguration cannot proceed because it's being throttled.
Neon has considered the case that configuration could be blocked if the
PG storage is full, but forgot the backpressure case.

## Summary of changes
The PR fixes this problem by simply skipping throttling while PS is
being configured, i.e., `max_cluster_size < 0`. An alternative fix is to
set those throttle knobs to -1 (e.g., max_replication_apply_lag),
however these knobs were labeled with PGC_POSTMASTER so their values
cannot be changed unless we restart PG.

## How is this tested?
Tested manually.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
This commit is contained in:
Tristan Partin
2025-07-21 15:50:02 -05:00
committed by GitHub
parent 050c9f704f
commit b7bc3ce61e
3 changed files with 18 additions and 1 deletions

View File

@@ -400,6 +400,14 @@ static uint64
backpressure_lag_impl(void)
{
struct WalproposerShmemState* state = NULL;
/* BEGIN_HADRON */
if(max_cluster_size < 0){
// if max cluster size is not set, then we don't apply backpressure because we're reconfiguring PG
return 0;
}
/* END_HADRON */
if (max_replication_apply_lag > 0 || max_replication_flush_lag > 0 || max_replication_write_lag > 0)
{
XLogRecPtr writePtr;

View File

@@ -368,7 +368,14 @@ def test_max_wal_rate(neon_simple_env: NeonEnv):
superuser_name = "databricks_superuser"
# Connect to postgres and create a database called "regression".
endpoint = env.endpoints.create_start("main")
endpoint = env.endpoints.create_start(
"main",
config_lines=[
# we need this option because default max_cluster_size < 0 will disable throttling completely
"neon.max_cluster_size=10GB",
],
)
endpoint.safe_psql_many(
[
f"CREATE ROLE {superuser_name}",

View File

@@ -1810,6 +1810,8 @@ def test_sharding_backpressure(neon_env_builder: NeonEnvBuilder):
"config_lines": [
# Tip: set to 100MB to make the test fail
"max_replication_write_lag=1MB",
# Hadron: Need to set max_cluster_size to some value to enable any backpressure at all.
"neon.max_cluster_size=1GB",
],
# We need `neon` extension for calling backpressure functions,
# this flag instructs `compute_ctl` to pre-install it.