fix(test_timeline_archival_chaos): flakiness caused by orphan layers (#10083)

The test was failing with the scary but generic message `Remote storage
metadata corrupted`.

The underlying scrubber error is `Orphan layer detected: ...`.

The test kills pageserver at random points, hence it's expected that we
leak layers if we're killed in the window after layer upload but before
it's referenced from index part.

Refer to generation numbers RFC for details.

Refs:
- fixes https://github.com/neondatabase/neon/issues/9988
- root-cause analysis
https://github.com/neondatabase/neon/issues/9988#issuecomment-2520673167
This commit is contained in:
Christian Schwarz
2024-12-13 17:28:21 +01:00
committed by GitHub
parent 2c91062828
commit fcff752851

View File

@@ -435,6 +435,14 @@ def test_timeline_archival_chaos(neon_env_builder: NeonEnvBuilder):
]
)
env.storage_scrubber.allowed_errors.extend(
[
# Unclcean shutdowns of pageserver can legitimately result in orphan layers
# (https://github.com/neondatabase/neon/issues/9988#issuecomment-2520558211)
f".*Orphan layer detected: tenants/{tenant_id}/.*"
]
)
class TimelineState:
def __init__(self):
self.timeline_id = TimelineId.generate()