pageserver: revert open layer rolling revert (#6962)

## Problem
We reverted https://github.com/neondatabase/neon/pull/6661 a few days
ago. The change led to OOMs in
benchmarks followed by large WAL reingests.

The issue was that we removed [this
code](d04af08567/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L409-L417)).
That call may trigger a roll of the open layer due to
the keepalive messages received from the safekeeper. Removing it meant
that enforcing
of checkpoint timeout became even more lax and led to using up large
amounts of memory
for the in memory layer indices.

## Summary of changes
Piggyback on keep alive messages to enforce checkpoint timeout. This is
a hack, but it's exactly what
the current code is doing.

## Alternatives
Christhian, Joonas and myself sketched out a timer based approach
[here](https://github.com/neondatabase/neon/pull/6940). While discussing
it further, it became obvious that's also a bit of a hack and not the
desired end state. I chose not
to take that further since it's not what we ultimately want and it'll be
harder to rip out.

Right now it's unclear what the ideal system behaviour is:
* early flushing on memory pressure, or ...
* detaching tenants on memory pressure
This commit is contained in:
Vlad Lazar
2024-03-07 19:53:10 +00:00
committed by GitHub
parent 02358b21a4
commit 0f05ef67e2
6 changed files with 322 additions and 184 deletions

View File

@@ -17,10 +17,10 @@ def test_layer_map(neon_env_builder: NeonEnvBuilder, zenbenchmark):
tenant, _ = env.neon_cli.create_tenant(
conf={
"gc_period": "0s",
"checkpoint_distance": "8192",
"checkpoint_distance": "16384",
"compaction_period": "1 s",
"compaction_threshold": "1",
"compaction_target_size": "8192",
"compaction_target_size": "16384",
}
)