storage controller: always wait for tenant detach before delete (#8049)

## Problem This test could fail with a timeout waiting for tenant deletions. Tenant deletions could get tripped up on nodes transitioning from offline to online at the moment of the deletion. In a previous reconciliation, the reconciler would skip detaching a particular location because the node was offline, but then when we do the delete the node is marked as online and can be picked as the node to use for issuing a deletion request. This hits the "Unexpectedly still attached path", which would still work if the caller kept calling DELETE, but if a caller does a Delete,get,get,get poll, then it doesn't work because the GET calls fail after we've marked the tenant as detached. ## Summary of changes Fix the undesirable storage controller behavior highlighted by this test failure: - Change tenant deletion flow to _always_ wait for reconciliation to succeed: it was unsound to proceed and return 202 if something was still attached, because after the 202 callers can no longer GET the tenant. Stabilize the test: - Add a reconcile_until_idle to the test, so that it will not have reconciliations running in the background while we mark a node online. This test is not meant to be a chaos test: we should test that kind of complexity elsewhere. - This reconcile_until_idle also fixes another failure mode where the test might see a None for a tenant location because a reconcile was mutating it (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7288/9500177581/index.html#suites/8fc5d1648d2225380766afde7c428d81/4acece42ae00c442/) It remains the case that a motivated tester could produce a situation where a DELETE gives a 500, when precisely the wrong node transitions from offline to available at the precise moment of a deletion (but the 500 is better than returning 202 and then failing all subsequent GETs). Note that nodes don't go through the offline state during normal restarts, so this is super rare. We should eventually fix this by making DELETE to the pageserver implicitly detach the tenant if it's attached, but that should wait until nobody is using the legacy-style deletes (the ones that use 202 + polling)
2026-01-09 06:22:57 +00:00 · 2024-06-14 10:37:30 +01:00
parent edc900028e
commit 6843fd8f89
2 changed files with 18 additions and 11 deletions
--- a/storage_controller/src/service.rs
+++ b/storage_controller/src/service.rs
@@ -2409,11 +2409,17 @@ impl Service {
            (detach_waiters, shard_ids, node.clone())
        };

-        if let Err(e) = self.await_waiters(detach_waiters, RECONCILE_TIMEOUT).await {
-            // Failing to detach shouldn't hold up deletion, e.g. if a node is offline we should be able
-            // to use some other node to run the remote deletion.
-            tracing::warn!("Failed to detach some locations: {e}");
-        }
+        // This reconcile wait can fail in a few ways:
+        //  A there is a very long queue for the reconciler semaphore
+        //  B some pageserver is failing to handle a detach promptly
+        //  C some pageserver goes offline right at the moment we send it a request.
+        //
+        // A and C are transient: the semaphore will eventually become available, and once a node is marked offline
+        // the next attempt to reconcile will silently skip detaches for an offline node and succeed.  If B happens,
+        // it's a bug, and needs resolving at the pageserver level (we shouldn't just leave attachments behind while
+        // deleting the underlying data).
+        self.await_waiters(detach_waiters, RECONCILE_TIMEOUT)
+            .await?;

        let locations = shard_ids
            .into_iter()
@@ -2431,13 +2437,11 @@ impl Service {
        for result in results {
            match result {
                Ok(StatusCode::ACCEPTED) => {
-                    // This could happen if we failed detach above, and hit a pageserver where the tenant
-                    // is still attached: it will accept the deletion in the background
-                    tracing::warn!(
-                        "Unexpectedly still attached on {}, client should retry",
+                    // This should never happen: we waited for detaches to finish above
+                    return Err(ApiError::InternalServerError(anyhow::anyhow!(
+                        "Unexpectedly still attached on {}",
                        node
-                    );
-                    return Ok(StatusCode::ACCEPTED);
+                    )));
                }
                Ok(_) => {}
                Err(mgmt_api::Error::Cancelled) => {
--- a/test_runner/regress/test_storage_controller.py
+++ b/test_runner/regress/test_storage_controller.py
@@ -133,6 +133,9 @@ def test_storage_controller_smoke(

    wait_until(10, 1, lambda: node_evacuated(env.pageservers[0].id))

+    # Let all the reconciliations after marking the node offline complete
+    env.storage_controller.reconcile_until_idle()
+
    # Marking pageserver active should not migrate anything to it
    # immediately
    env.storage_controller.node_configure(env.pageservers[0].id, {"availability": "Active"})