rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-04 12:02:55 +00:00

Files

Erik Grinaker ddb9ae1214 pageserver: add compaction backpressure for layer flushes (#10405 )

## Problem

There is no direct backpressure for compaction and L0 read
amplification. This allows a large buildup of compaction debt and read
amplification.

Resolves #5415.
Requires #10402.

## Summary of changes

Delay layer flushes based on the number of level 0 delta layers:

* `l0_flush_delay_threshold`: delay flushes such that they take 2x as
long (default `2 * compaction_threshold`).
* `l0_flush_stall_threshold`: stall flushes until level 0 delta layers
drop below threshold (default `4 * compaction_threshold`).

If either threshold is reached, ephemeral layer rolls also synchronously
wait for layer flushes to propagate this backpressure up into WAL
ingestion. This will bound the number of frozen layers to 1 once
backpressure kicks in, since all other frozen layers must flush before
the rolled layer.

## Analysis

This will significantly change the compute backpressure characteristics.
Recall the three compute backpressure knobs:

* `max_replication_write_lag`: 500 MB (based on Pageserver
`last_received_lsn`).
* `max_replication_flush_lag`: 10 GB (based on Pageserver
`disk_consistent_lsn`).
* `max_replication_apply_lag`: disabled (based on Pageserver
`remote_consistent_lsn`).

Previously, the Pageserver would keep ingesting WAL and build up
ephemeral layers and L0 layers until the compute hit
`max_replication_flush_lag` at 10 GB and began backpressuring. Now, once
we delay/stall WAL ingestion, the compute will begin backpressuring
after `max_replication_write_lag`, i.e. 500 MB. This is probably a good
thing (we're not building up a ton of compaction debt), but we should
consider tuning these settings.

`max_replication_flush_lag` probably doesn't serve a purpose anymore,
and we should consider removing it.

Furthermore, the removal of the upload barrier in #10402 will mean that
we no longer backpressure flushes based on S3 uploads, since
`max_replication_apply_lag` is disabled. We should consider enabling
this as well.

### When and what do we compact?

Default compaction settings:

* `compaction_threshold`: 10 L0 delta layers.
* `compaction_period`: 20 seconds (between each compaction loop check).
* `checkpoint_distance`: 256 MB (size of L0 delta layers).
* `l0_flush_delay_threshold`: 20 L0 delta layers.
* `l0_flush_stall_threshold`: 40 L0 delta layers.

Compaction characteristics:

* Minimum compaction volume: 10 layers * 256 MB = 2.5 GB.
* Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20
seconds = 2.5 GB (10 L0 layers).
* Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s.

### When do we hit `max_replication_write_lag`?

Depending on how fast compaction and flushes happens, the compute will
backpressure somewhere between `l0_flush_delay_threshold` or
`l0_flush_stall_threshold` + `max_replication_write_lag`.

* Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB
* Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0
GB

This seems like a reasonable range to me.

2025-01-24 09:47:28 +00:00

src

pageserver: add compaction backpressure for layer flushes (#10405 )

2025-01-24 09:47:28 +00:00

storcon_cli

Rename "disabled" safekeeper scheduling policy to "pause" (#10410 )

2025-01-16 14:30:49 +00:00

.gitignore

Revert "Use actual temporary dir for pageserver unit tests"

2023-01-19 20:16:56 +01:00

Cargo.toml

Rename hyper 1.0 to hyper and hyper 0.14 to hyper0 (#9254 )

2024-10-03 16:33:43 +02:00

README.md

neon_local: improved docs and fix wrong connstr (#6954 )

2024-03-01 14:54:07 -05:00

safekeepers.conf

Separate mgmt and libpq authentication configs in pageserver. (#3773 )

2023-03-15 13:52:29 +02:00

simple.conf

tests: enable multiple pageservers in neon_local and neon_fixture (#5231 )

2023-09-08 16:19:57 +01:00

README.md

Control Plane and Neon Local

This crate contains tools to start a Neon development environment locally. This utility can be used with the cargo neon command.

Example: Start with Postgres 16

To create and start a local development environment with Postgres 16, you will need to provide --pg-version flag to 3 of the start-up commands.

cargo neon init --pg-version 16
cargo neon start
cargo neon tenant create --set-default --pg-version 16
cargo neon endpoint create main --pg-version 16
cargo neon endpoint start main

Example: Create Test User and Database

By default, cargo neon starts an endpoint with cloud_admin and postgres database. If you want to have a role and a database similar to what we have on the cloud service, you can do it with the following commands when starting an endpoint.

cargo neon endpoint create main --pg-version 16 --update-catalog true
cargo neon endpoint start main --create-test-user true

The first command creates neon_superuser and necessary roles. The second command creates test user and neondb database. You will see a connection string that connects you to the test user after running the second command.