pageserver: improved synthetic size & find_gc_cutoff error handling (#8051)

## Problem

This PR refactors some error handling to avoid log spam on
tenant/timeline shutdown.

- "ignoring failure to find gc cutoffs: timeline shutting down." logs
(https://github.com/neondatabase/neon/issues/8012)
- "synthetic_size_worker: failed to calculate synthetic size for tenant
...: Failed to refresh gc_info before gathering inputs: tenant shutting
down", for example here:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9502988669/index.html#suites/3fc871d9ee8127d8501d607e03205abb/1a074a66548bbcea

Closes: https://github.com/neondatabase/neon/issues/8012

## Summary of changes

- Refactor: Add a PageReconstructError variant to GcError: this is the
only kind of error that find_gc_cutoffs can emit.
- Functional change: only ignore shutdown PageReconstructError variant:
for other variants, treat it as a real error
- Refactor: add a structured CalculateSyntheticSizeError type and use it
instead of anyhow::Error in synthetic size calculations
- Functional change: while iterating through timelines gathering logical
sizes, only drop out if the whole tenant is cancelled: individual
timeline cancellations indicate deletion in progress and we can just
ignore those.
This commit is contained in:
John Spray
2024-06-14 11:08:11 +01:00
committed by GitHub
parent 6843fd8f89
commit eb0ca9b648
7 changed files with 115 additions and 71 deletions

View File

@@ -94,8 +94,6 @@ DEFAULT_PAGESERVER_ALLOWED_ERRORS = (
".*WARN.*path=/v1/utilization .*request was dropped before completing",
# Can happen during shutdown
".*scheduling deletion on drop failed: queue is in state Stopped.*",
# Can happen during shutdown
".*ignoring failure to find gc cutoffs: timeline shutting down.*",
)

View File

@@ -678,10 +678,6 @@ def test_synthetic_size_while_deleting(neon_env_builder: NeonEnvBuilder):
with pytest.raises(PageserverApiException, match=matcher):
completion.result()
# this happens on both cases
env.pageserver.allowed_errors.append(
".*ignoring failure to find gc cutoffs: timeline shutting down.*"
)
# this happens only in the case of deletion (http response logging)
env.pageserver.allowed_errors.append(".*Failed to refresh gc_info before gathering inputs.*")