rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-05 20:42:54 +00:00

Files

Erik Grinaker ddb9ae1214 pageserver: add compaction backpressure for layer flushes (#10405 )

## Problem

There is no direct backpressure for compaction and L0 read
amplification. This allows a large buildup of compaction debt and read
amplification.

Resolves #5415.
Requires #10402.

## Summary of changes

Delay layer flushes based on the number of level 0 delta layers:

* `l0_flush_delay_threshold`: delay flushes such that they take 2x as
long (default `2 * compaction_threshold`).
* `l0_flush_stall_threshold`: stall flushes until level 0 delta layers
drop below threshold (default `4 * compaction_threshold`).

If either threshold is reached, ephemeral layer rolls also synchronously
wait for layer flushes to propagate this backpressure up into WAL
ingestion. This will bound the number of frozen layers to 1 once
backpressure kicks in, since all other frozen layers must flush before
the rolled layer.

## Analysis

This will significantly change the compute backpressure characteristics.
Recall the three compute backpressure knobs:

* `max_replication_write_lag`: 500 MB (based on Pageserver
`last_received_lsn`).
* `max_replication_flush_lag`: 10 GB (based on Pageserver
`disk_consistent_lsn`).
* `max_replication_apply_lag`: disabled (based on Pageserver
`remote_consistent_lsn`).

Previously, the Pageserver would keep ingesting WAL and build up
ephemeral layers and L0 layers until the compute hit
`max_replication_flush_lag` at 10 GB and began backpressuring. Now, once
we delay/stall WAL ingestion, the compute will begin backpressuring
after `max_replication_write_lag`, i.e. 500 MB. This is probably a good
thing (we're not building up a ton of compaction debt), but we should
consider tuning these settings.

`max_replication_flush_lag` probably doesn't serve a purpose anymore,
and we should consider removing it.

Furthermore, the removal of the upload barrier in #10402 will mean that
we no longer backpressure flushes based on S3 uploads, since
`max_replication_apply_lag` is disabled. We should consider enabling
this as well.

### When and what do we compact?

Default compaction settings:

* `compaction_threshold`: 10 L0 delta layers.
* `compaction_period`: 20 seconds (between each compaction loop check).
* `checkpoint_distance`: 256 MB (size of L0 delta layers).
* `l0_flush_delay_threshold`: 20 L0 delta layers.
* `l0_flush_stall_threshold`: 40 L0 delta layers.

Compaction characteristics:

* Minimum compaction volume: 10 layers * 256 MB = 2.5 GB.
* Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20
seconds = 2.5 GB (10 L0 layers).
* Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s.

### When do we hit `max_replication_write_lag`?

Depending on how fast compaction and flushes happens, the compute will
backpressure somewhere between `l0_flush_delay_threshold` or
`l0_flush_stall_threshold` + `max_replication_write_lag`.

* Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB
* Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0
GB

This seems like a reasonable range to me.

2025-01-24 09:47:28 +00:00

many_relations

Run pgbench on 10 GB scale factor on database with n relations (e.g. 10k) (#10172 )

2024-12-19 10:25:44 +00:00

pageserver

test_page_service_batching: fix non-numeric metrics (#9998 )

2024-12-03 22:46:18 +00:00

pgvector

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

tpc-h

Nightly Benchmarks: add TPC-H benchmark (#2978 )

2022-12-08 15:32:49 +00:00

__init__.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

README.md

page_service: add benchmark for batching (#9820 )

2024-11-25 15:52:39 +00:00

test_branch_creation.py

tests & benchmarks: unify the way we customize the default tenant config (#9992 )

2024-12-03 22:07:03 +00:00

test_branching.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_bulk_insert.py

test_bulk_insert: fix typing for PgVersion (#9854 )

2024-11-22 16:13:53 +00:00

test_bulk_tenant_create.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_bulk_update.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_compaction.py

fix(pageserver): ensure all layers are flushed before measuring RSS (#9861 )

2024-11-25 14:25:18 +00:00

test_compare_pg_stats.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_copy.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

test_dup_key.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_gc_feedback.py

refactor(test): tighten up test_gc_feedback (#10126 )

2024-12-18 18:10:05 +00:00

test_gist_build.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_hot_page.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_hot_table.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_ingest_insert_bulk.py

test_runner/performance: add improved bulk insert benchmark (#9812 )

2024-12-06 15:17:15 +00:00

test_ingest_logical_message.py

test_runner/performance: add logical message ingest benchmark (#9749 )

2024-11-29 09:40:08 +00:00

test_latency.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_layer_map.py

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

test_lazy_startup.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_logical_replication.py

Remove the replication slot in test_snap_files at the end of the test

2024-11-14 10:59:15 -06:00

test_parallel_copy_to.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_parallel_copy.py

tests: move test_parallel_copy into performance tree (#10343 )

2025-01-10 13:57:26 +00:00

test_perf_ingest_using_pgcopydb.py

fix parsing human time output like "50m37s" (#10001 )

2024-12-04 11:37:24 +00:00

test_perf_many_relations.py

Run pgbench on 10 GB scale factor on database with n relations (e.g. 10k) (#10172 )

2024-12-19 10:25:44 +00:00

test_perf_olap.py

test_runner: fix path to tpc-h queries (#9327 )

2024-10-09 12:11:06 +01:00

test_perf_pgbench.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_perf_pgvector_queries.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_physical_replication.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

test_random_writes.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_seqscans.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_sharded_ingest.py

test_runner: use immediate shutdown in test_sharded_ingest (#9984 )

2024-12-03 14:33:31 +00:00

test_sharding_autosplit.py

storcon: rework scheduler optimisation, prioritize AZ (#9916 )

2025-01-13 19:33:00 +00:00

test_startup.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

test_storage_controller_scale.py

storcon: rework scheduler optimisation, prioritize AZ (#9916 )

2025-01-13 19:33:00 +00:00

test_wal_backpressure.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

test_write_amplification.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

README.md

Running locally

First make a release build. The -s flag silences a lot of output, and makes it easier to see if you have compile errors without scrolling up. BUILD_TYPE=release CARGO_BUILD_FLAGS="--features=testing" make -s -j8

You may also need to run ./scripts/pysync.

Then run the tests DEFAULT_PG_VERSION=16 NEON_BIN=./target/release poetry run pytest test_runner/performance

Some handy pytest flags for local development:

-x tells pytest to stop on first error
-s shows test output
-k selects a test to run
--timeout=0 disables our default timeout of 300s (see setup.cfg)
--preserve-database-files to skip cleanup
--out-dir to produce a JSON with the recorded test metrics

What performance tests do we have and how we run them

Performance tests are built using the same infrastructure as our usual python integration tests. There are some extra fixtures that help to collect performance metrics, and to run tests against both vanilla PostgreSQL and Neon for comparison.

Tests that are run against local installation

Most of the performance tests run against a local installation. This is not very representative of a production environment. Firstly, Postgres, safekeeper(s) and the pageserver have to share CPU and I/O resources, which can add noise to the results. Secondly, network overhead is eliminated.

In the CI, the performance tests are run in the same environment as the other integration tests. We don't have control over the host that the CI runs on, so the environment may vary widely from one run to another, which makes the results across different runs noisy to compare.

Remote tests

There are a few tests that marked with pytest.mark.remote_cluster. These tests do not set up a local environment, and instead require a libpq connection string to connect to. So they can be run on any Postgres compatible database. Currently, the CI runs these tests on our staging and captest environments daily. Those are not an isolated environments, so there can be noise in the results due to activity of other clusters.

Noise

All tests run only once. Usually to obtain more consistent performance numbers, a test should be repeated multiple times and the results be aggregated, for example by taking min, max, avg, or median.

Results collection

Local test results for main branch, and results of daily performance tests, are stored in a neon project deployed in production environment. There is a Grafana dashboard that visualizes the results. Here is the dashboard. The main problem with it is the unavailability to point at particular commit, though the data for that is available in the database. Needs some tweaking from someone who knows Grafana tricks.

There is also an inconsistency in test naming. Test name should be the same across platforms, and results can be differentiated by the platform field. But currently, platform is sometimes included in test name because of the way how parametrization works in pytest. I.e. there is a platform switch in the dashboard with neon-local-ci and neon-staging variants. I.e. some tests under neon-local-ci value for a platform switch are displayed as Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[vanilla] and Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[neon] which is highly confusing.