mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-08 05:52:55 +00:00
## Problem There is no direct backpressure for compaction and L0 read amplification. This allows a large buildup of compaction debt and read amplification. Resolves #5415. Requires #10402. ## Summary of changes Delay layer flushes based on the number of level 0 delta layers: * `l0_flush_delay_threshold`: delay flushes such that they take 2x as long (default `2 * compaction_threshold`). * `l0_flush_stall_threshold`: stall flushes until level 0 delta layers drop below threshold (default `4 * compaction_threshold`). If either threshold is reached, ephemeral layer rolls also synchronously wait for layer flushes to propagate this backpressure up into WAL ingestion. This will bound the number of frozen layers to 1 once backpressure kicks in, since all other frozen layers must flush before the rolled layer. ## Analysis This will significantly change the compute backpressure characteristics. Recall the three compute backpressure knobs: * `max_replication_write_lag`: 500 MB (based on Pageserver `last_received_lsn`). * `max_replication_flush_lag`: 10 GB (based on Pageserver `disk_consistent_lsn`). * `max_replication_apply_lag`: disabled (based on Pageserver `remote_consistent_lsn`). Previously, the Pageserver would keep ingesting WAL and build up ephemeral layers and L0 layers until the compute hit `max_replication_flush_lag` at 10 GB and began backpressuring. Now, once we delay/stall WAL ingestion, the compute will begin backpressuring after `max_replication_write_lag`, i.e. 500 MB. This is probably a good thing (we're not building up a ton of compaction debt), but we should consider tuning these settings. `max_replication_flush_lag` probably doesn't serve a purpose anymore, and we should consider removing it. Furthermore, the removal of the upload barrier in #10402 will mean that we no longer backpressure flushes based on S3 uploads, since `max_replication_apply_lag` is disabled. We should consider enabling this as well. ### When and what do we compact? Default compaction settings: * `compaction_threshold`: 10 L0 delta layers. * `compaction_period`: 20 seconds (between each compaction loop check). * `checkpoint_distance`: 256 MB (size of L0 delta layers). * `l0_flush_delay_threshold`: 20 L0 delta layers. * `l0_flush_stall_threshold`: 40 L0 delta layers. Compaction characteristics: * Minimum compaction volume: 10 layers * 256 MB = 2.5 GB. * Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20 seconds = 2.5 GB (10 L0 layers). * Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s. ### When do we hit `max_replication_write_lag`? Depending on how fast compaction and flushes happens, the compute will backpressure somewhere between `l0_flush_delay_threshold` or `l0_flush_stall_threshold` + `max_replication_write_lag`. * Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB * Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0 GB This seems like a reasonable range to me.
70 lines
2.4 KiB
Python
70 lines
2.4 KiB
Python
from __future__ import annotations
|
|
|
|
import time
|
|
from contextlib import closing
|
|
|
|
from fixtures.log_helper import log
|
|
from fixtures.neon_fixtures import NeonEnvBuilder
|
|
|
|
|
|
#
|
|
# Test pageserver recovery after crash
|
|
#
|
|
def test_pageserver_recovery(neon_env_builder: NeonEnvBuilder):
|
|
# Override default checkpointer settings to run it more often.
|
|
# This also creates a bunch more L0 layers, so disable backpressure.
|
|
env = neon_env_builder.init_start(
|
|
initial_tenant_conf={
|
|
"checkpoint_distance": "1048576",
|
|
"l0_flush_delay_threshold": "0",
|
|
"l0_flush_stall_threshold": "0",
|
|
}
|
|
)
|
|
env.pageserver.is_testing_enabled_or_skip()
|
|
|
|
# We expect the pageserver to exit, which will cause storage storage controller
|
|
# requests to fail and warn.
|
|
env.storage_controller.allowed_errors.append(".*management API still failed.*")
|
|
env.storage_controller.allowed_errors.append(
|
|
".*Reconcile error.*error sending request for url.*"
|
|
)
|
|
|
|
# Create a branch for us
|
|
env.create_branch("test_pageserver_recovery", ancestor_branch_name="main")
|
|
|
|
endpoint = env.endpoints.create_start("test_pageserver_recovery")
|
|
|
|
with closing(endpoint.connect()) as conn:
|
|
with conn.cursor() as cur:
|
|
with env.pageserver.http_client() as pageserver_http:
|
|
# Create and initialize test table
|
|
cur.execute("CREATE TABLE foo(x bigint)")
|
|
cur.execute("INSERT INTO foo VALUES (generate_series(1,100000))")
|
|
|
|
# Sleep for some time to let checkpoint create image layers
|
|
time.sleep(2)
|
|
|
|
# Configure failpoints
|
|
pageserver_http.configure_failpoints(
|
|
[
|
|
("flush-frozen-pausable", "sleep(2000)"),
|
|
("flush-frozen-exit", "exit"),
|
|
]
|
|
)
|
|
|
|
# Do some updates until pageserver is crashed
|
|
try:
|
|
while True:
|
|
cur.execute("update foo set x=x+1")
|
|
except Exception as err:
|
|
log.info(f"Expected server crash {err}")
|
|
|
|
log.info("Wait before server restart")
|
|
env.pageserver.stop()
|
|
env.pageserver.start()
|
|
|
|
with closing(endpoint.connect()) as conn:
|
|
with conn.cursor() as cur:
|
|
cur.execute("select count(*) from foo")
|
|
assert cur.fetchone() == (100000,)
|