pageserver: handle WAL gaps on sharded tenants (#6788)

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

## Problem

In the test for https://github.com/neondatabase/neon/pull/6776, a test
cases uses tiny layer sizes and tiny stripe sizes. This hits a scenario
where a shard's checkpoint interval spans a region where none of the
content in the WAL is ingested by this shard. Since there is no layer to
flush, we do not advance disk_consistent_lsn, and this causes the test
to fail while waiting for LSN to advance.

## Summary of changes

- Pass an LSN through `layer_flush_start_tx`. This is the LSN to which
we have frozen at the time we ask the flush to flush layers frozen up to
this point.
- In the layer flush task, if the layers we flush do not reach
`frozen_to_lsn`, then advance disk_consistent_lsn up to this point.
- In `maybe_freeze_ephemeral_layer`, handle the case where
last_record_lsn has advanced without writing a layer file: this ensures
that disk_consistent_lsn and remote_consistent_lsn advance anyway.

The net effect is that the disk_consistent_lsn is allowed to advance
past regions in the WAL where a shard ingests no data, and that we
uphold our guarantee that remote_consistent_lsn always eventually
reaches the tip of the WAL.

The case of no layer at all is hard to test at present due to >0 shards
being polluted with SLRU writes, but I have tested it locally with a
branch that disables SLRU writes on shards >0. We can tighten up the
testing on this in future as/when we refine shard filtering (currently
shards >0 need the SLRU because they use it to figure out cutoff in GC
using timestamp-to-lsn).

This commit is contained in:

John Spray

2024-04-04 17:54:38 +01:00

committed by

GitHub

parent 862a6b7018

commit ac7fc6110b

4 changed files with 225 additions and 31 deletions

									
										5

test_runner/fixtures/workload.py
									
												View File
												
				@@ -85,6 +85,11 @@ class Workload:

				        if self._endpoint is not None:

				            self._endpoint.stop()

				    def stop(self):

				        if self._endpoint is not None:

				            self._endpoint.stop()

				            self._endpoint = None

				    def init(self, pageserver_id: Optional[int] = None):

				        endpoint = self.endpoint(pageserver_id)

pageserver: handle WAL gaps on sharded tenants (#6788)

5 test_runner/fixtures/workload.py Unescape Escape View File

5

test_runner/fixtures/workload.py

View File