mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-13 16:32:56 +00:00
## Problem There is no direct backpressure for compaction and L0 read amplification. This allows a large buildup of compaction debt and read amplification. Resolves #5415. Requires #10402. ## Summary of changes Delay layer flushes based on the number of level 0 delta layers: * `l0_flush_delay_threshold`: delay flushes such that they take 2x as long (default `2 * compaction_threshold`). * `l0_flush_stall_threshold`: stall flushes until level 0 delta layers drop below threshold (default `4 * compaction_threshold`). If either threshold is reached, ephemeral layer rolls also synchronously wait for layer flushes to propagate this backpressure up into WAL ingestion. This will bound the number of frozen layers to 1 once backpressure kicks in, since all other frozen layers must flush before the rolled layer. ## Analysis This will significantly change the compute backpressure characteristics. Recall the three compute backpressure knobs: * `max_replication_write_lag`: 500 MB (based on Pageserver `last_received_lsn`). * `max_replication_flush_lag`: 10 GB (based on Pageserver `disk_consistent_lsn`). * `max_replication_apply_lag`: disabled (based on Pageserver `remote_consistent_lsn`). Previously, the Pageserver would keep ingesting WAL and build up ephemeral layers and L0 layers until the compute hit `max_replication_flush_lag` at 10 GB and began backpressuring. Now, once we delay/stall WAL ingestion, the compute will begin backpressuring after `max_replication_write_lag`, i.e. 500 MB. This is probably a good thing (we're not building up a ton of compaction debt), but we should consider tuning these settings. `max_replication_flush_lag` probably doesn't serve a purpose anymore, and we should consider removing it. Furthermore, the removal of the upload barrier in #10402 will mean that we no longer backpressure flushes based on S3 uploads, since `max_replication_apply_lag` is disabled. We should consider enabling this as well. ### When and what do we compact? Default compaction settings: * `compaction_threshold`: 10 L0 delta layers. * `compaction_period`: 20 seconds (between each compaction loop check). * `checkpoint_distance`: 256 MB (size of L0 delta layers). * `l0_flush_delay_threshold`: 20 L0 delta layers. * `l0_flush_stall_threshold`: 40 L0 delta layers. Compaction characteristics: * Minimum compaction volume: 10 layers * 256 MB = 2.5 GB. * Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20 seconds = 2.5 GB (10 L0 layers). * Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s. ### When do we hit `max_replication_write_lag`? Depending on how fast compaction and flushes happens, the compute will backpressure somewhere between `l0_flush_delay_threshold` or `l0_flush_stall_threshold` + `max_replication_write_lag`. * Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB * Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0 GB This seems like a reasonable range to me.
49 lines
1.5 KiB
Python
49 lines
1.5 KiB
Python
from __future__ import annotations
|
|
|
|
import time
|
|
|
|
from fixtures.neon_fixtures import NeonEnvBuilder, flush_ep_to_pageserver
|
|
|
|
|
|
def test_layer_map(neon_env_builder: NeonEnvBuilder, zenbenchmark):
|
|
"""Benchmark searching the layer map, when there are a lot of small layer files."""
|
|
|
|
env = neon_env_builder.init_configs()
|
|
n_iters = 10
|
|
n_records = 100000
|
|
|
|
env.start()
|
|
|
|
# We want to have a lot of lot of layer files to exercise the layer map. Disable
|
|
# GC, and make checkpoint_distance very small, so that we get a lot of small layer
|
|
# files.
|
|
tenant, timeline = env.create_tenant(
|
|
conf={
|
|
"gc_period": "0s",
|
|
"checkpoint_distance": "16384",
|
|
"compaction_period": "1 s",
|
|
"compaction_threshold": "1",
|
|
"l0_flush_delay_threshold": "0",
|
|
"l0_flush_stall_threshold": "0",
|
|
"compaction_target_size": "16384",
|
|
}
|
|
)
|
|
|
|
endpoint = env.endpoints.create_start("main", tenant_id=tenant)
|
|
cur = endpoint.connect().cursor()
|
|
cur.execute("create table t(x integer)")
|
|
for _ in range(n_iters):
|
|
cur.execute(f"insert into t values (generate_series(1,{n_records}))")
|
|
time.sleep(1)
|
|
|
|
cur.execute("vacuum t")
|
|
|
|
with zenbenchmark.record_duration("test_query"):
|
|
cur.execute("SELECT count(*) from t")
|
|
assert cur.fetchone() == (n_iters * n_records,)
|
|
|
|
flush_ep_to_pageserver(env, endpoint, tenant, timeline)
|
|
env.pageserver.http_client().timeline_checkpoint(
|
|
tenant, timeline, compact=False, wait_until_uploaded=True
|
|
)
|