pageserver: return ACCEPTED when deletion already in flight (#7384)

## Problem

test_sharding_smoke recently got an added section that checks deletion
of a sharded tenant. The storage controller does a retry loop for
deletion, waiting for a 404 response. When deletion is a bit slow (debug
builds), the retry of deletion was getting a 500 response -- this caused
the test to become flaky (example failure:
https://neon-github-public-dev.s3.amazonaws.com/reports/release-proxy/8659801445/index.html#testresult/b4cbf5b58190f60e/retries)

There was a false comment in the code:
```
         match tenant.current_state() {
             TenantState::Broken { .. } | TenantState::Stopping { .. } => {
-                // If a tenant is broken or stopping, DeleteTenantFlow can
-                // handle it: broken tenants proceed to delete, stopping tenants
-                // are checked for deletion already in progress.
```

If the tenant is stopping, DeleteTenantFlow does not in fact handle it,
but returns a 500-yielding errror.

## Summary of changes

Before calling into DeleteTenantFlow, if the tenant is in
stopping|broken state then return 202 if a deletion is in progress. This
makes the API friendlier for retries.

The historic AlreadyInProgress (409) response still exists for if we
enter DeleteTenantFlow and unexpectedly see the tenant stopping. That
should go away when we implement #5080 . For the moment, callers that
handle 409s should continue to do so.
This commit is contained in:
John Spray
2024-04-16 11:39:18 +03:00
committed by GitHub
parent 2d5a8462c8
commit 3366cd34ba
4 changed files with 31 additions and 16 deletions

View File

@@ -436,6 +436,11 @@ impl DeleteTenantFlow {
.await
}
/// Check whether background deletion of this tenant is currently in progress
pub(crate) fn is_in_progress(tenant: &Tenant) -> bool {
tenant.delete_progress.try_lock().is_err()
}
async fn prepare(
tenant: &Arc<Tenant>,
) -> Result<tokio::sync::OwnedMutexGuard<Self>, DeleteTenantError> {

View File

@@ -1410,9 +1410,15 @@ impl TenantManager {
match tenant.current_state() {
TenantState::Broken { .. } | TenantState::Stopping { .. } => {
// If a tenant is broken or stopping, DeleteTenantFlow can
// handle it: broken tenants proceed to delete, stopping tenants
// are checked for deletion already in progress.
// If deletion is already in progress, return success (the semantics of this
// function are to rerturn success afterr deletion is spawned in background).
// Otherwise fall through and let [`DeleteTenantFlow`] handle this state.
if DeleteTenantFlow::is_in_progress(&tenant) {
// The `delete_progress` lock is held: deletion is already happening
// in the bacckground
slot_guard.revert();
return Ok(());
}
}
_ => {
tenant

View File

@@ -2449,10 +2449,12 @@ class NeonPageserver(PgProtocol):
if cur_line_no < skip_until_line_no:
cur_line_no += 1
continue
if contains_re.search(line):
elif contains_re.search(line):
# found it!
cur_line_no += 1
return (line, LogCursor(cur_line_no))
else:
cur_line_no += 1
return None
def tenant_attach(

View File

@@ -469,7 +469,8 @@ def test_tenant_delete_concurrent(
):
"""
Validate that concurrent delete requests to the same tenant behave correctly:
exactly one should succeed.
exactly one should execute: the rest should give 202 responses but not start
another deletion.
This is a reproducer for https://github.com/neondatabase/neon/issues/5936
"""
@@ -484,14 +485,10 @@ def test_tenant_delete_concurrent(
run_pg_bench_small(pg_bin, endpoint.connstr())
last_flush_lsn_upload(env, endpoint, tenant_id, timeline_id)
CONFLICT_MESSAGE = "Precondition failed: Invalid state Stopping. Expected Active or Broken"
env.pageserver.allowed_errors.extend(
[
# lucky race with stopping from flushing a layer we fail to schedule any uploads
".*layer flush task.+: could not flush frozen layer: update_metadata_file",
# Errors logged from our 4xx requests
f".*{CONFLICT_MESSAGE}.*",
]
)
@@ -507,7 +504,7 @@ def test_tenant_delete_concurrent(
return ps_http.tenant_delete(tenant_id)
def hit_remove_failpoint():
env.pageserver.assert_log_contains(f"at failpoint {BEFORE_REMOVE_FAILPOINT}")
return env.pageserver.assert_log_contains(f"at failpoint {BEFORE_REMOVE_FAILPOINT}")[1]
def hit_run_failpoint():
env.pageserver.assert_log_contains(f"at failpoint {BEFORE_RUN_FAILPOINT}")
@@ -518,11 +515,14 @@ def test_tenant_delete_concurrent(
# Wait until the first request completes its work and is blocked on removing
# the TenantSlot from tenant manager.
wait_until(100, 0.1, hit_remove_failpoint)
log_cursor = wait_until(100, 0.1, hit_remove_failpoint)
assert log_cursor is not None
# Start another request: this should fail when it sees a tenant in Stopping state
with pytest.raises(PageserverApiException, match=CONFLICT_MESSAGE):
ps_http.tenant_delete(tenant_id)
# Start another request: this should succeed without actually entering the deletion code
ps_http.tenant_delete(tenant_id)
assert not env.pageserver.log_contains(
f"at failpoint {BEFORE_RUN_FAILPOINT}", offset=log_cursor
)
# Start another background request, which will pause after acquiring a TenantSlotGuard
# but before completing.
@@ -539,8 +539,10 @@ def test_tenant_delete_concurrent(
# Permit the duplicate background request to run to completion and fail.
ps_http.configure_failpoints((BEFORE_RUN_FAILPOINT, "off"))
with pytest.raises(PageserverApiException, match=CONFLICT_MESSAGE):
background_4xx_req.result(timeout=10)
background_4xx_req.result(timeout=10)
assert not env.pageserver.log_contains(
f"at failpoint {BEFORE_RUN_FAILPOINT}", offset=log_cursor
)
# Physical deletion should have happened
assert_prefix_empty(