neon/regress at skyzh/workaround-layer-map - neon

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 05:30:38 +00:00

Files

Erik Grinaker ddb9ae1214 pageserver: add compaction backpressure for layer flushes (#10405 )

## Problem

There is no direct backpressure for compaction and L0 read
amplification. This allows a large buildup of compaction debt and read
amplification.

Resolves #5415.
Requires #10402.

## Summary of changes

Delay layer flushes based on the number of level 0 delta layers:

* `l0_flush_delay_threshold`: delay flushes such that they take 2x as
long (default `2 * compaction_threshold`).
* `l0_flush_stall_threshold`: stall flushes until level 0 delta layers
drop below threshold (default `4 * compaction_threshold`).

If either threshold is reached, ephemeral layer rolls also synchronously
wait for layer flushes to propagate this backpressure up into WAL
ingestion. This will bound the number of frozen layers to 1 once
backpressure kicks in, since all other frozen layers must flush before
the rolled layer.

## Analysis

This will significantly change the compute backpressure characteristics.
Recall the three compute backpressure knobs:

* `max_replication_write_lag`: 500 MB (based on Pageserver
`last_received_lsn`).
* `max_replication_flush_lag`: 10 GB (based on Pageserver
`disk_consistent_lsn`).
* `max_replication_apply_lag`: disabled (based on Pageserver
`remote_consistent_lsn`).

Previously, the Pageserver would keep ingesting WAL and build up
ephemeral layers and L0 layers until the compute hit
`max_replication_flush_lag` at 10 GB and began backpressuring. Now, once
we delay/stall WAL ingestion, the compute will begin backpressuring
after `max_replication_write_lag`, i.e. 500 MB. This is probably a good
thing (we're not building up a ton of compaction debt), but we should
consider tuning these settings.

`max_replication_flush_lag` probably doesn't serve a purpose anymore,
and we should consider removing it.

Furthermore, the removal of the upload barrier in #10402 will mean that
we no longer backpressure flushes based on S3 uploads, since
`max_replication_apply_lag` is disabled. We should consider enabling
this as well.

### When and what do we compact?

Default compaction settings:

* `compaction_threshold`: 10 L0 delta layers.
* `compaction_period`: 20 seconds (between each compaction loop check).
* `checkpoint_distance`: 256 MB (size of L0 delta layers).
* `l0_flush_delay_threshold`: 20 L0 delta layers.
* `l0_flush_stall_threshold`: 40 L0 delta layers.

Compaction characteristics:

* Minimum compaction volume: 10 layers * 256 MB = 2.5 GB.
* Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20
seconds = 2.5 GB (10 L0 layers).
* Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s.

### When do we hit `max_replication_write_lag`?

Depending on how fast compaction and flushes happens, the compute will
backpressure somewhere between `l0_flush_delay_threshold` or
`l0_flush_stall_threshold` + `max_replication_write_lag`.

* Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB
* Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0
GB

This seems like a reasonable range to me.

2025-01-24 09:47:28 +00:00

data/extension_test/5670669815

[compute/postgres] feature: PostgreSQL 17 (#8573 )

2024-09-12 23:18:41 +01:00

test_ancestor_branch.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_attach_tenant_config.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_auth_broker.py

[auth_broker]: regress test (#9541 )

2024-10-29 11:39:09 +00:00

test_auth.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_backpressure.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_bad_connection.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_basebackup_error.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_branch_and_gc.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_branch_behind.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_branching.py

Reapply "pageserver: revert flush backpressure" (#10270 ) (#10402 )

2025-01-24 08:35:35 +00:00

test_broken_timeline.py

pageserver: fix spurious error logs in timeline lifecycle (#9589 )

2024-10-31 14:44:59 +00:00

test_build_info_metric.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_change_pageserver.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_clog_truncate.py

pageserver: tighten up code around SLRU dir key handling (#10082 )

2024-12-16 10:06:08 +00:00

test_close_fds.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_combocid.py

test_runner: use LFC by default (#8613 )

2024-11-25 09:01:05 +00:00

test_compaction.py

fix(pageserver): handle dup layers during gc-compaction (#10430 )

2025-01-23 21:54:44 +00:00

test_compatibility.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_compute_catalog.py

fix(compute_ctl): Resolve issues with dropping roles having dangling permissions (#10299 )

2025-01-09 16:39:53 +00:00

test_compute_locales.py

Switch compute-related locales to C.UTF-8 by default

2024-11-08 12:19:18 -06:00

test_compute_metrics.py

Update postgres-exporter and sql_exporter in computes (#10349 )

2025-01-14 00:44:39 +00:00

test_compute_migrations.py

Add more substantial tests for compute migrations (#9811 )

2025-01-02 18:37:50 +00:00

test_config.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_crafted_wal_end.py

safekeeper: use protobuf for sending compressed records to pageserver (#9821 )

2024-11-27 12:12:21 +00:00

test_createdropdb.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_createuser.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_ddl_forwarding.py

fix(adapter): password not set in role drop (#10130 )

2024-12-14 17:37:13 +00:00

test_disk_usage_eviction.py

pageserver/storcon: add patch endpoints for tenant config metrics (#10020 )

2024-12-11 19:16:33 +00:00

test_download_extensions.py

Improve typing in test_runner/fixtures/httpserver.py (#10103 )

2024-12-11 22:21:42 +00:00

test_endpoint_crash.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_explain_with_lfc_stats.py

test_runner: use LFC by default (#8613 )

2024-11-25 09:01:05 +00:00

test_extensions.py

compute_ctl: Add endpoint that allows extensions to be installed (#9344 )

2024-10-18 15:07:36 +03:00

test_fsm_truncate.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_fullbackup.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_gc_aggressive.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_gin_redo.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_hot_standby.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_import_pgdata.py

fast import: basic python test (#10271 )

2025-01-21 16:50:44 +00:00

test_import.py

pageserver: fix spurious error logs in timeline lifecycle (#9589 )

2024-10-31 14:44:59 +00:00

test_ingestion_layer_size.py

pageserver/storcon: add patch endpoints for tenant config metrics (#10020 )

2024-12-11 19:16:33 +00:00

test_installed_extensions.py

Update compute_installed_extensions metric: (#9891 )

2024-12-11 16:43:26 +00:00

test_large_schema.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_layer_bloating.py

test_runner: skip more tests using decorator instead of pytest.skip (#9704 )

2024-11-11 18:07:01 +00:00

test_layer_eviction.py

test_runner: skip more tests using decorator instead of pytest.skip (#9704 )

2024-11-11 18:07:01 +00:00

test_layer_writers_fail.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_layers_from_future.py

pageserver/storcon: add patch endpoints for tenant config metrics (#10020 )

2024-12-11 19:16:33 +00:00

test_lfc_resize.py

tests: smaller datasets in LFC tests (#10346 )

2025-01-10 15:53:23 +00:00

test_lfc_working_set_approximation.py

test_runner: use LFC by default (#8613 )

2024-11-25 09:01:05 +00:00

test_local_file_cache.py

Compute/LFC: Apply limits consistently (#10449 )

2025-01-20 18:29:21 +00:00

test_logging.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_logical_replication.py

Fix test_subscriber_synchronous_commit flakiness. (#10057 )

2024-12-12 11:57:00 +00:00

test_lsn_mapping.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_multixact.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_nbtree_pagesplit_cycleid.py

Make sure we request pages with a known-flushed LSN. (#10413 )

2025-01-16 08:34:11 +00:00

test_neon_cli.py

test_runner: skip more tests using decorator instead of pytest.skip (#9704 )

2024-11-11 18:07:01 +00:00

test_neon_extension.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_neon_local_cli.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_neon_superuser.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_next_xid.py

Update ruff to much newer version (#9433 )

2024-10-18 12:42:41 +02:00

test_normal_work.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_oid_overflow.py

test_runner: use LFC by default (#8613 )

2024-11-25 09:01:05 +00:00

test_old_request_lsn.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_ondemand_download.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_ondemand_slru_download.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

test_ondemand_wal_download.py

2024-11-19 22:29:57 +02:00

test_page_service_batching_regressions.py

fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386 )

2025-01-16 20:34:02 +00:00

test_pageserver_api.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_pageserver_catchup.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_pageserver_crash_consistency.py

pageserver/storcon: add patch endpoints for tenant config metrics (#10020 )

2024-12-11 19:16:33 +00:00

test_pageserver_generations.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_pageserver_getpage_throttle.py

tests & benchmarks: unify the way we customize the default tenant config (#9992 )

2024-12-03 22:07:03 +00:00

test_pageserver_layer_rolling.py

fix(pageserver): run psql in thread to avoid blocking (#10177 )

2024-12-19 09:45:06 +00:00

test_pageserver_metric_collection.py

Improve typing in test_runner/fixtures/httpserver.py (#10103 )

2024-12-11 22:21:42 +00:00

test_pageserver_reconnect.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_pageserver_restart.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_pageserver_restarts_under_workload.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_pageserver_secondary.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_pg_query_cancellation.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_pg_regress.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

test_pg_waldump.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_physical_and_logical_replicaiton.py

Fix flukyness of test_physical_and_logical_replicaiton.py (#10176 )

2024-12-18 19:15:38 +00:00

test_physical_replication.py

Increase max connection for replica to prevent test flukyness (#10306 )

2025-01-13 20:01:03 +00:00

test_pitr_gc.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_postgres_version.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_prefetch_buffer_resize.py

test_prefetch: reduce timeout to default 5m from 10m (#10105 )

2024-12-13 14:52:54 +00:00

test_proxy_allowed_ips.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_proxy_metric_collection.py

Improve typing in test_runner/fixtures/httpserver.py (#10103 )

2024-12-11 22:21:42 +00:00

test_proxy_websockets.py

Add a websockets tunnel and a test for the proxy's websockets support. (#3823 )

2025-01-13 11:35:39 +00:00

test_proxy.py

feat(proxy): add option to forward startup params (#9979 )

2024-12-04 12:58:35 +00:00

test_read_validation.py

test_runner: use LFC by default (#8613 )

2024-11-25 09:01:05 +00:00

test_readonly_node.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_recovery.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_remote_storage.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_replica_start.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_role_grants.py

compute_ctl: Add endpoint that allows setting role grants (#9395 )

2024-10-18 11:25:45 +01:00

test_s3_restore.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

test_setup.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_sharding.py

storcon: rework scheduler optimisation, prioritize AZ (#9916 )

2025-01-13 19:33:00 +00:00

test_sni_router.py

build(deps): bump mypy from 1.3.0 to 1.13.0 (#9670 )

2024-11-22 14:31:36 +00:00

test_storage_controller.py

tests: stabilize test_storage_controller_node_deletion (#10420 )

2025-01-16 19:00:16 +00:00

test_storage_scrubber.py

Only churn rows once in test_scrubber_physical_gc_ancestors (#10481 )

2025-01-22 19:45:12 +00:00

test_subscriber_branching.py

Disable logical replication subscribers (#10249 )

2025-01-23 11:02:15 +00:00

test_subscriber_restart.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_subxacts.py

safekeeper: use protobuf for sending compressed records to pageserver (#9821 )

2024-11-27 12:12:21 +00:00

test_tenant_conf.py

pageserver/storcon: add patch endpoints for tenant config metrics (#10020 )

2024-12-11 19:16:33 +00:00

test_tenant_delete.py

tests: refactor test_tenant_delete_races_timeline_creation (#10425 )

2025-01-16 14:11:33 +00:00

test_tenant_detach.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_tenant_relocation.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_tenant_size.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_tenant_tasks.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_tenants_with_remote_storage.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_tenants.py

pageserver: include node id when subscribing to SK (#10432 )

2025-01-16 18:51:56 +00:00

test_threshold_based_eviction.py

Improve typing in test_runner/fixtures/httpserver.py (#10103 )

2024-12-11 22:21:42 +00:00

test_timeline_archive.py

Handle race between auto-offload and unarchival (#10305 )

2025-01-09 20:41:49 +00:00

test_timeline_delete.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_timeline_detach_ancestor.py

Increase reconciler timeout after shard split (#10490 )

2025-01-23 16:43:04 +00:00

test_timeline_gc_blocking.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_timeline_size.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_truncate.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_twophase.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_unlogged.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_unstable_extensions.py

Create the notion of unstable extensions

2024-10-28 17:47:15 -05:00

test_vm_bits.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_vm_truncate.py

Correctly truncate VM (#9342 )

2024-11-14 17:19:13 +02:00

test_wal_acceptor_async.py

safekeeper: use protobuf for sending compressed records to pageserver (#9821 )

2024-11-27 12:12:21 +00:00

test_wal_acceptor.py

safekeeper: add membership configuration switch endpoint (#10241 )

2025-01-15 14:16:04 +00:00

test_wal_receiver.py

test_runner: improve wait_until (#9936 )

2024-12-02 10:26:15 +00:00

test_wal_restore.py

Switch compute-related locales to C.UTF-8 by default

2024-11-08 12:19:18 -06:00

test_walredo_not_left_behind_on_detach.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00