## Problem The deletion logic had become difficult to understand and maintain. ## Summary of changes - Added an RFC detailing proposed improvements to all deletion-related APIs. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
12 KiB
Node deletion API improvement
Created on 2025-07-07 Implemented on TBD
Summary
This RFC describes improvements to the storage controller API for gracefully deleting pageserver nodes.
Motivation
The basic node deletion API introduced in #8226 has several limitations:
- Deleted nodes can re-add themselves if they restart (e.g., a flaky node that keeps restarting and we cannot reach via SSH to stop the pageserver). This issue has been resolved by tombstone mechanism in #12036
- Process of node deletion is not graceful, i.e. it just imitates a node failure
In this context, "graceful" node deletion means that users do not experience any disruption or negative effects, provided the system remains in a healthy state (i.e., the remaining pageservers can handle the workload and all requirements are met). To achieve this, the system must perform live migration of all tenant shards from the node being deleted while the node is still running and continue processing all incoming requests. The node is removed only after all tenant shards have been safely migrated.
Although live migrations can be achieved with the drain functionality, it leads to incorrect shard placement, such as not matching availability zones. This results in unnecessary work to optimize the placement that was just recently performed.
If we delete a node before its tenant shards are fully moved, the new node won't have all the needed data (e.g. heatmaps) ready. This means user requests to the new node will be much slower at first. If there are many tenant shards, this slowdown affects a huge amount of users.
Graceful node deletion is more complicated and can introduce new issues. It takes longer because live migration of each tenant shard can last several minutes. Using non-blocking accessors may also cause deletion to wait if other processes are holding inner state lock. It also gets trickier because we need to handle other requests, like drain and fill, at the same time.
Impacted components (e.g. pageserver, safekeeper, console, etc)
- storage controller
- pageserver (indirectly)
Proposed implementation
Tombstones
To resolve the problem of deleted nodes re-adding themselves, a tombstone mechanism was introduced
as part of the node stored information. Each node has a separate NodeLifecycle field with two
possible states: Active and Deleted. When node deletion completes, the database row is not
deleted but instead has its NodeLifecycle column switched to Deleted. Nodes with Deleted
lifecycle are treated as if the row is absent for most handlers, with several exceptions: reattach
and register functionality must be aware of tombstones. Additionally, new debug handlers are
available for listing and deleting tombstones via the /debug/v1/tombstone path.
Gracefulness
The problem of making node deletion graceful is complex and involves several challenges:
- Cancellable: The operation must be cancellable to allow administrators to abort the process if needed, e.g. if run by mistake.
- Non-blocking: We don't want to block deployment operations like draining/filling on the node deletion process. We need clear policies for handling concurrent operations: what happens when a drain/fill request arrives while deletion is in progress, and what happens when a delete request arrives while drain/fill is in progress.
- Persistent: If the storage controller restarts during this long-running operation, we must preserve progress and automatically resume the deletion process after the storage controller restarts.
- Migrated correctly: We cannot simply use the existing drain mechanism for nodes scheduled for deletion, as this would move shards to irrelevant locations. The drain process expects the node to return, so it only moves shards to backup locations, not to their preferred AZs. It also leaves secondary locations unmoved. This could result in unnecessary load on the storage controller and inefficient resource utilization.
- Force option: Administrators need the ability to force immediate, non-graceful deletion when time constraints or emergency situations require it, bypassing the normal graceful migration process.
See below for a detailed breakdown of the proposed changes and mechanisms.
Node lifecycle
New NodeLifecycle enum and a matching database field with these values:
Active: The normal state. All operations are allowed.ScheduledForDeletion: The node is marked to be deleted soon. Deletion may be in progress or will happen later, but the node will eventually be removed. All operations are allowed.Deleted: The node is fully deleted. No operations are allowed, and the node cannot be brought back. The only action left is to remove its record from the database. Any attempt to register a node in this state will fail.
This state persists across storage controller restarts.
State transition
+--------------------+
+---| Active |<---------------------+
| +--------------------+ |
| ^ |
| start_node_delete | cancel_node_delete |
v | |
+----------------------------------+ |
| ScheduledForDeletion | |
+----------------------------------+ |
| |
| node_register |
| |
| delete_node (at the finish) |
| |
v |
+---------+ tombstone_delete +----------+
| Deleted |-------------------------------->| no row |
+---------+ +----------+
NodeSchedulingPolicy::Deleting
A Deleting variant to the NodeSchedulingPolicy enum. This means the deletion function is
running for the node right now. Only one node can have the Deleting policy at a time.
The NodeSchedulingPolicy::Deleting state is persisted in the database. However, after a storage
controller restart, any node previously marked as Deleting will have its scheduling policy reset
to Pause. The policy will only transition back to Deleting when the deletion operation is
actively started again, as triggered by the node's NodeLifecycle::ScheduledForDeletion state.
NodeSchedulingPolicy transition details:
- When
node_deletebegins, set the policy toNodeSchedulingPolicy::Deleting. - If
node_deleteis cancelled (for example, due to a concurrent drain operation), revert the policy to its previous value. The policy is persisted in storcon DB. - After
node_deletecompletes, the final value of the scheduling policy is irrelevant, sinceNodeLifecycle::Deletedprevents any further access to this field.
The deletion process cannot be initiated for nodes currently undergoing deployment-related
operations (Draining, Filling, or PauseForRestart policies). Deletion will only be triggered
once the node transitions to either the Active or Pause state.
OperationTracker
A replacement for Option<OperationHandler> ongoing_operation, the OperationTracker is a
dedicated service state object responsible for managing all long-running node operations (drain,
fill, delete) with robust concurrency control.
Key responsibilities:
- Orchestrates the execution of operations
- Supports cancellation of currently running operations
- Enforces operation constraints, e.g. allowing only single drain/fill operation at a time
- Persists deletion state, enabling recovery of pending deletions across restarts
- Ensures thread safety across concurrent requests
Attached tenant shard processing
When deleting a node, handle each attached tenant shard as follows:
- Pick the best node to become the new attached (the candidate).
- If the candidate already has this shard as a secondary:
- Create a new secondary for the shard on another suitable node. Otherwise:
- Create a secondary for the shard on the candidate node.
- Wait until all secondaries are ready and pre-warmed.
- Promote the candidate's secondary to attached.
- Remove the secondary from the node being deleted.
This process safely moves all attached shards before deleting the node.
Secondary tenant shard processing
When deleting a node, handle each secondary tenant shard as follows:
- Choose the best node to become the new secondary.
- Create a secondary for the shard on that node.
- Wait until the new secondary is ready.
- Remove the secondary from the node being deleted.
This ensures all secondary shards are safely moved before deleting the node.
Reliability, failure modes and corner cases
In case of a storage controller failure and following restart, the system behavior depends on the
NodeLifecycle state:
- If
NodeLifecycleisActive: No action is taken for this node. - If
NodeLifecycleisDeleted: The node will not be re-added. - If
NodeLifecycleisScheduledForDeletion: A deletion background task will be launched for this node.
In case of a pageserver node failure during deletion, the behavior depends on the force flag:
- If
forceis set: The node deletion will proceed regardless of the node's availability. - If
forceis not set: The deletion will be retried a limited number of times. If the node remains unavailable, the deletion process will pause and automatically resume when the node becomes healthy again.
Operations concurrency
The following sections describe the behavior when different types of requests arrive at the storage controller and how they interact with ongoing operations.
Delete request
Handler: PUT /control/v1/node/:node_id/delete
- If node lifecycle is
NodeLifecycle::ScheduledForDeletion:- Return
200 OK: there is already an ongoing deletion request for this node
- Return
- Update & persist lifecycle to
NodeLifecycle::ScheduledForDeletion - Persist current scheduling policy
- If there is no active operation (drain/fill/delete):
- Run deletion process for this node
Cancel delete request
Handler: DELETE /control/v1/node/:node_id/delete
- If node lifecycle is not
NodeLifecycle::ScheduledForDeletion:- Return
404 Not Found: there is no current deletion request for this node
- Return
- If the active operation is deleting this node, cancel it
- Update & persist lifecycle to
NodeLifecycle::Active - Restore the last scheduling policy from persistence
Drain/fill request
- If there are already ongoing drain/fill processes:
- Return
409 Conflict: queueing of drain/fill processes is not supported
- Return
- If there is an ongoing delete process:
- Cancel it and wait until it is cancelled
- Run the drain/fill process
- After the drain/fill process is cancelled or finished:
- Try to find another candidate to delete and run the deletion process for that node
Drain/fill cancel request
- If the active operation is not the related process:
- Return
400 Bad Request: cancellation request is incorrect, operations are not the same
- Return
- Cancel the active operation
- Try to find another candidate to delete and run the deletion process for that node
Definition of Done
- Fix flaky node scenario and introduce related debug handlers
- Node deletion intent is persistent - a node will be eventually deleted after a deletion request regardless of draining/filling requests and restarts
- Node deletion can be graceful - deletion completes only after moving all tenant shards to recommended locations
- Deploying does not break due to long deletions - drain/fill operations override deletion process and deletion resumes after drain/fill completes
forceflag is implemented and provides fast, failure-tolerant node removal (e.g., when a pageserver node does not respond)- Legacy delete handler code is removed from storage_controller, test_runner, and storcon_cli