storcon: squash all migrations into one

Problem Neon and Hadron deployments have the same database schema, but different migration histories. Some transactions have different identifiers too. If we don't do anything about it, then the storage controller would fail to apply the merged set of transactions. Summary of Changes We squash all migrations into a single one. If the schema already matches, then the new transaction gets applie without doing anything. For new regions, this migration will bootstrap the database schema. This should be merged in both neon and hadron codebases. Note that after deploying this change, the `__diesel_schema_migrations` table will still contain entries for the old pre-squash transactions. This is fine because diesel only considers the tranasactions embedded in the repo for application. Once we are certain that we are not going to roll back, we can clean up the `__diesel_schema_migrations` tables in prod, but this is manual and error prone, so I'd skip it. Rolling back Rolling back to a previous deployment which embeds the non-squashed transactions is safe. The old transactions are still present in `__diesel_schema_migrations` (i.e. considered applied), so no migrations will be run, so no migrations will be run. Note that this assumes that all transactions are applied and squashed into the new migration before deployment.
chore(compute_tools): Delete unused anon_ext_fn_reassign.sql (#12787 )
2026-05-21 15:10:44 +00:00 · 2025-07-31 13:39:27 +01:00 · 2025-07-31 10:11:30 +00:00 · 2025-07-31 09:53:48 +00:00 · 2025-07-31 09:29:25 +00:00 · 2025-07-31 08:34:47 +00:00
71 changed files with 860 additions and 280 deletions
--- a/compute_tools/src/compute.rs
+++ b/compute_tools/src/compute.rs
@@ -32,8 +32,12 @@ use std::sync::{Arc, Condvar, Mutex, RwLock};
 use std::time::{Duration, Instant};
 use std::{env, fs};
 use tokio::{spawn, sync::watch, task::JoinHandle, time};
+use tokio_util::sync::CancellationToken;
 use tracing::{Instrument, debug, error, info, instrument, warn};
 use url::Url;
+use utils::backoff::{
+    DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS, exponential_backoff_duration,
+};
 use utils::id::{TenantId, TimelineId};
 use utils::lsn::Lsn;
 use utils::measured_stream::MeasuredReader;
@@ -192,6 +196,7 @@ pub struct ComputeState {
    pub startup_span: Option<tracing::span::Span>,

    pub lfc_prewarm_state: LfcPrewarmState,
+    pub lfc_prewarm_token: CancellationToken,
    pub lfc_offload_state: LfcOffloadState,

    /// WAL flush LSN that is set after terminating Postgres and syncing safekeepers if
@@ -217,6 +222,7 @@ impl ComputeState {
            lfc_offload_state: LfcOffloadState::default(),
            terminate_flush_lsn: None,
            promote_state: None,
+            lfc_prewarm_token: CancellationToken::new(),
        }
    }

@@ -1554,6 +1560,41 @@ impl ComputeNode {
        Ok(lsn)
    }

+    fn sync_safekeepers_with_retries(&self, storage_auth_token: Option<String>) -> Result<Lsn> {
+        let max_retries = 5;
+        let mut attempts = 0;
+        loop {
+            let result = self.sync_safekeepers(storage_auth_token.clone());
+            match &result {
+                Ok(_) => {
+                    if attempts > 0 {
+                        tracing::info!("sync_safekeepers succeeded after {attempts} retries");
+                    }
+                    return result;
+                }
+                Err(e) if attempts < max_retries => {
+                    tracing::info!(
+                        "sync_safekeepers failed, will retry (attempt {attempts}): {e:#}"
+                    );
+                }
+                Err(err) => {
+                    tracing::warn!(
+                        "sync_safekeepers still failed after {attempts} retries, giving up: {err:?}"
+                    );
+                    return result;
+                }
+            }
+            // sleep and retry
+            let backoff = exponential_backoff_duration(
+                attempts,
+                DEFAULT_BASE_BACKOFF_SECONDS,
+                DEFAULT_MAX_BACKOFF_SECONDS,
+            );
+            std::thread::sleep(backoff);
+            attempts += 1;
+        }
+    }
+
    /// Do all the preparations like PGDATA directory creation, configuration,
    /// safekeepers sync, basebackup, etc.
    #[instrument(skip_all)]
@@ -1589,7 +1630,7 @@ impl ComputeNode {
                    lsn
                } else {
                    info!("starting safekeepers syncing");
-                    self.sync_safekeepers(pspec.storage_auth_token.clone())
+                    self.sync_safekeepers_with_retries(pspec.storage_auth_token.clone())
                        .with_context(|| "failed to sync safekeepers")?
                };
                info!("safekeepers synced at LSN {}", lsn);
--- a/compute_tools/src/compute_prewarm.rs
+++ b/compute_tools/src/compute_prewarm.rs
@@ -7,7 +7,8 @@ use http::StatusCode;
 use reqwest::Client;
 use std::mem::replace;
 use std::sync::Arc;
-use tokio::{io::AsyncReadExt, spawn};
+use tokio::{io::AsyncReadExt, select, spawn};
+use tokio_util::sync::CancellationToken;
 use tracing::{error, info};

 #[derive(serde::Serialize, Default)]
@@ -92,34 +93,35 @@ impl ComputeNode {
    /// If there is a prewarm request ongoing, return `false`, `true` otherwise.
    /// Has a failpoint "compute-prewarm"
    pub fn prewarm_lfc(self: &Arc<Self>, from_endpoint: Option<String>) -> bool {
+        let token: CancellationToken;
        {
-            let state = &mut self.state.lock().unwrap().lfc_prewarm_state;
-            if let LfcPrewarmState::Prewarming = replace(state, LfcPrewarmState::Prewarming) {
+            let state = &mut self.state.lock().unwrap();
+            token = state.lfc_prewarm_token.clone();
+            if let LfcPrewarmState::Prewarming =
+                replace(&mut state.lfc_prewarm_state, LfcPrewarmState::Prewarming)
+            {
                return false;
            }
        }
        crate::metrics::LFC_PREWARMS.inc();

-        let cloned = self.clone();
+        let this = self.clone();
        spawn(async move {
-            let state = match cloned.prewarm_impl(from_endpoint).await {
-                Ok(true) => LfcPrewarmState::Completed,
-                Ok(false) => {
-                    info!(
-                        "skipping LFC prewarm because LFC state is not found in endpoint storage"
-                    );
-                    LfcPrewarmState::Skipped
-                }
+            let prewarm_state = match this.prewarm_impl(from_endpoint, token).await {
+                Ok(state) => state,
                Err(err) => {
                    crate::metrics::LFC_PREWARM_ERRORS.inc();
                    error!(%err, "could not prewarm LFC");
-                    LfcPrewarmState::Failed {
-                        error: format!("{err:#}"),
-                    }
+                    let error = format!("{err:#}");
+                    LfcPrewarmState::Failed { error }
                }
            };

-            cloned.state.lock().unwrap().lfc_prewarm_state = state;
+            let state = &mut this.state.lock().unwrap();
+            if let LfcPrewarmState::Cancelled = prewarm_state {
+                state.lfc_prewarm_token = CancellationToken::new();
+            }
+            state.lfc_prewarm_state = prewarm_state;
        });
        true
    }
@@ -132,47 +134,70 @@ impl ComputeNode {

    /// Request LFC state from endpoint storage and load corresponding pages into Postgres.
    /// Returns a result with `false` if the LFC state is not found in endpoint storage.
-    async fn prewarm_impl(&self, from_endpoint: Option<String>) -> Result<bool> {
-        let EndpointStoragePair { url, token } = self.endpoint_storage_pair(from_endpoint)?;
+    async fn prewarm_impl(
+        &self,
+        from_endpoint: Option<String>,
+        token: CancellationToken,
+    ) -> Result<LfcPrewarmState> {
+        let EndpointStoragePair {
+            url,
+            token: storage_token,
+        } = self.endpoint_storage_pair(from_endpoint)?;

        #[cfg(feature = "testing")]
-        fail::fail_point!("compute-prewarm", |_| {
-            bail!("prewarm configured to fail because of a failpoint")
-        });
+        fail::fail_point!("compute-prewarm", |_| bail!("compute-prewarm failpoint"));

        info!(%url, "requesting LFC state from endpoint storage");
-        let request = Client::new().get(&url).bearer_auth(token);
-        let res = request.send().await.context("querying endpoint storage")?;
-        match res.status() {
+        let request = Client::new().get(&url).bearer_auth(storage_token);
+        let response = select! {
+            _ = token.cancelled() => return Ok(LfcPrewarmState::Cancelled),
+            response = request.send() => response
+        }
+        .context("querying endpoint storage")?;
+
+        match response.status() {
            StatusCode::OK => (),
-            StatusCode::NOT_FOUND => {
-                return Ok(false);
-            }
+            StatusCode::NOT_FOUND => return Ok(LfcPrewarmState::Skipped),
            status => bail!("{status} querying endpoint storage"),
        }

        let mut uncompressed = Vec::new();
-        let lfc_state = res
-            .bytes()
-            .await
-            .context("getting request body from endpoint storage")?;
-        ZstdDecoder::new(lfc_state.iter().as_slice())
-            .read_to_end(&mut uncompressed)
-            .await
-            .context("decoding LFC state")?;
+        let lfc_state = select! {
+            _ = token.cancelled() => return Ok(LfcPrewarmState::Cancelled),
+            lfc_state = response.bytes() => lfc_state
+        }
+        .context("getting request body from endpoint storage")?;
+
+        let mut decoder = ZstdDecoder::new(lfc_state.iter().as_slice());
+        select! {
+            _ = token.cancelled() => return Ok(LfcPrewarmState::Cancelled),
+            read = decoder.read_to_end(&mut uncompressed) => read
+        }
+        .context("decoding LFC state")?;
+
        let uncompressed_len = uncompressed.len();
+        info!(%url, "downloaded LFC state, uncompressed size {uncompressed_len}");

-        info!(%url, "downloaded LFC state, uncompressed size {uncompressed_len}, loading into Postgres");
-
-        ComputeNode::get_maintenance_client(&self.tokio_conn_conf)
+        // Client connection and prewarm info querying are fast and therefore don't need
+        // cancellation
+        let client = ComputeNode::get_maintenance_client(&self.tokio_conn_conf)
            .await
-            .context("connecting to postgres")?
-            .query_one("select neon.prewarm_local_cache($1)", &[&uncompressed])
-            .await
-            .context("loading LFC state into postgres")
-            .map(|_| ())?;
+            .context("connecting to postgres")?;
+        let pg_token = client.cancel_token();

-        Ok(true)
+        let params: Vec<&(dyn postgres_types::ToSql + Sync)> = vec![&uncompressed];
+        select! {
+            res = client.query_one("select neon.prewarm_local_cache($1)", &params) => res,
+            _ = token.cancelled() => {
+                pg_token.cancel_query(postgres::NoTls).await
+                    .context("cancelling neon.prewarm_local_cache()")?;
+                return Ok(LfcPrewarmState::Cancelled)
+            }
+        }
+        .context("loading LFC state into postgres")
+        .map(|_| ())?;
+
+        Ok(LfcPrewarmState::Completed)
    }

    /// If offload request is ongoing, return false, true otherwise
@@ -200,20 +225,20 @@ impl ComputeNode {

    async fn offload_lfc_with_state_update(&self) {
        crate::metrics::LFC_OFFLOADS.inc();
-
-        let Err(err) = self.offload_lfc_impl().await else {
-            self.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Completed;
-            return;
+        let state = match self.offload_lfc_impl().await {
+            Ok(state) => state,
+            Err(err) => {
+                crate::metrics::LFC_OFFLOAD_ERRORS.inc();
+                error!(%err, "could not offload LFC");
+                let error = format!("{err:#}");
+                LfcOffloadState::Failed { error }
+            }
        };

-        crate::metrics::LFC_OFFLOAD_ERRORS.inc();
-        error!(%err, "could not offload LFC state to endpoint storage");
-        self.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Failed {
-            error: format!("{err:#}"),
-        };
+        self.state.lock().unwrap().lfc_offload_state = state;
    }

-    async fn offload_lfc_impl(&self) -> Result<()> {
+    async fn offload_lfc_impl(&self) -> Result<LfcOffloadState> {
        let EndpointStoragePair { url, token } = self.endpoint_storage_pair(None)?;
        info!(%url, "requesting LFC state from Postgres");

@@ -228,7 +253,7 @@ impl ComputeNode {
            .context("deserializing LFC state")?;
        let Some(state) = state else {
            info!(%url, "empty LFC state, not exporting");
-            return Ok(());
+            return Ok(LfcOffloadState::Skipped);
        };

        let mut compressed = Vec::new();
@@ -242,7 +267,7 @@ impl ComputeNode {

        let request = Client::new().put(url).bearer_auth(token).body(compressed);
        match request.send().await {
-            Ok(res) if res.status() == StatusCode::OK => Ok(()),
+            Ok(res) if res.status() == StatusCode::OK => Ok(LfcOffloadState::Completed),
            Ok(res) => bail!(
                "Request to endpoint storage failed with status: {}",
                res.status()
@@ -250,4 +275,8 @@ impl ComputeNode {
            Err(err) => Err(err).context("writing to endpoint storage"),
        }
    }
+
+    pub fn cancel_prewarm(self: &Arc<Self>) {
+        self.state.lock().unwrap().lfc_prewarm_token.cancel();
+    }
 }
--- a/compute_tools/src/http/openapi_spec.yaml
+++ b/compute_tools/src/http/openapi_spec.yaml
@@ -139,6 +139,15 @@ paths:
            application/json:
              schema:
                $ref: "#/components/schemas/LfcPrewarmState"
+    delete:
+      tags:
+        - Prewarm
+      summary: Cancel ongoing LFC prewarm
+      description: ""
+      operationId: cancelLfcPrewarm
+      responses:
+        202:
+          description: Prewarm cancelled

  /lfc/offload:
    post:
@@ -636,7 +645,7 @@ components:
      properties:
        status:
          description: LFC offload status
-          enum: [not_offloaded, offloading, completed, failed]
+          enum: [not_offloaded, offloading, completed, skipped, failed]
          type: string
        error:
          description: LFC offload error, if any
--- a/compute_tools/src/http/routes/lfc.rs
+++ b/compute_tools/src/http/routes/lfc.rs
@@ -46,3 +46,8 @@ pub(in crate::http) async fn offload(compute: Compute) -> Response {
        )
    }
 }
+
+pub(in crate::http) async fn cancel_prewarm(compute: Compute) -> StatusCode {
+    compute.cancel_prewarm();
+    StatusCode::ACCEPTED
+}
--- a/compute_tools/src/http/server.rs
+++ b/compute_tools/src/http/server.rs
@@ -99,7 +99,12 @@ impl From<&Server> for Router<Arc<ComputeNode>> {
                    );

                let authenticated_router = Router::<Arc<ComputeNode>>::new()
-                    .route("/lfc/prewarm", get(lfc::prewarm_state).post(lfc::prewarm))
+                    .route(
+                        "/lfc/prewarm",
+                        get(lfc::prewarm_state)
+                            .post(lfc::prewarm)
+                            .delete(lfc::cancel_prewarm),
+                    )
                    .route("/lfc/offload", get(lfc::offload_state).post(lfc::offload))
                    .route("/promote", post(promote::promote))
                    .route("/check_writability", post(check_writability::is_writable))
--- a/compute_tools/src/sql/anon_ext_fn_reassign.sql
+++ b/compute_tools/src/sql/anon_ext_fn_reassign.sql
@@ -1,13 +0,0 @@
-DO $$
-DECLARE
-    query varchar;
-BEGIN
-    FOR query IN
-    SELECT pg_catalog.format('ALTER FUNCTION %I(%s) OWNER TO {db_owner};', p.oid::regproc, pg_catalog.pg_get_function_identity_arguments(p.oid))
-    FROM pg_catalog.pg_proc p
-        WHERE p.pronamespace OPERATOR(pg_catalog.=) 'anon'::regnamespace::oid
-    LOOP
-        EXECUTE query;
-    END LOOP;
-END
-$$;
--- a/docs/rfcs/2025-07-07-node-deletion-api-improvement.md
+++ b/docs/rfcs/2025-07-07-node-deletion-api-improvement.md
@@ -0,0 +1,246 @@
+# Node deletion API improvement
+
+Created on 2025-07-07
+Implemented on _TBD_
+
+## Summary
+
+This RFC describes improvements to the storage controller API for gracefully deleting pageserver
+nodes.
+
+## Motivation
+
+The basic node deletion API introduced in [#8226](https://github.com/neondatabase/neon/issues/8333)
+has several limitations:
+
+- Deleted nodes can re-add themselves if they restart (e.g., a flaky node that keeps restarting and
+we cannot reach via SSH to stop the pageserver). This issue has been resolved by tombstone
+mechanism in [#12036](https://github.com/neondatabase/neon/issues/12036)
+- Process of node deletion is not graceful, i.e. it just imitates a node failure
+
+In this context, "graceful" node deletion means that users do not experience any disruption or
+negative effects, provided the system remains in a healthy state (i.e., the remaining pageservers
+can handle the workload and all requirements are met). To achieve this, the system must perform
+live migration of all tenant shards from the node being deleted while the node is still running
+and continue processing all incoming requests. The node is removed only after all tenant shards
+have been safely migrated.
+
+Although live migrations can be achieved with the drain functionality, it leads to incorrect shard
+placement, such as not matching availability zones. This results in unnecessary work to optimize
+the placement that was just recently performed.
+
+If we delete a node before its tenant shards are fully moved, the new node won't have all the
+needed data (e.g. heatmaps) ready. This means user requests to the new node will be much slower at
+first. If there are many tenant shards, this slowdown affects a huge amount of users.
+
+Graceful node deletion is more complicated and can introduce new issues. It takes longer because
+live migration of each tenant shard can last several minutes. Using non-blocking accessors may
+also cause deletion to wait if other processes are holding inner state lock. It also gets trickier
+because we need to handle other requests, like drain and fill, at the same time.
+
+## Impacted components (e.g. pageserver, safekeeper, console, etc)
+
+- storage controller
+- pageserver (indirectly)
+
+## Proposed implementation
+
+### Tombstones
+
+To resolve the problem of deleted nodes re-adding themselves, a tombstone mechanism was introduced
+as part of the node stored information. Each node has a separate `NodeLifecycle` field with two
+possible states: `Active` and `Deleted`. When node deletion completes, the database row is not
+deleted but instead has its `NodeLifecycle` column switched to `Deleted`. Nodes with `Deleted`
+lifecycle are treated as if the row is absent for most handlers, with several exceptions: reattach
+and register functionality must be aware of tombstones. Additionally, new debug handlers are
+available for listing and deleting tombstones via the `/debug/v1/tombstone` path.
+
+### Gracefulness
+
+The problem of making node deletion graceful is complex and involves several challenges:
+
+- **Cancellable**: The operation must be cancellable to allow administrators to abort the process
+if needed, e.g. if run by mistake.
+- **Non-blocking**: We don't want to block deployment operations like draining/filling on the node
+deletion process. We need clear policies for handling concurrent operations: what happens when a
+drain/fill request arrives while deletion is in progress, and what happens when a delete request
+arrives while drain/fill is in progress.
+- **Persistent**: If the storage controller restarts during this long-running operation, we must
+preserve progress and automatically resume the deletion process after the storage controller
+restarts.
+- **Migrated correctly**: We cannot simply use the existing drain mechanism for nodes scheduled
+for deletion, as this would move shards to irrelevant locations. The drain process expects the
+node to return, so it only moves shards to backup locations, not to their preferred AZs. It also
+leaves secondary locations unmoved. This could result in unnecessary load on the storage
+controller and inefficient resource utilization.
+- **Force option**: Administrators need the ability to force immediate, non-graceful deletion when
+time constraints or emergency situations require it, bypassing the normal graceful migration
+process.
+
+See below for a detailed breakdown of the proposed changes and mechanisms.
+
+#### Node lifecycle
+
+New `NodeLifecycle` enum and a matching database field with these values:
+- `Active`: The normal state. All operations are allowed.
+- `ScheduledForDeletion`: The node is marked to be deleted soon. Deletion may be in progress or
+will happen later, but the node will eventually be removed. All operations are allowed.
+- `Deleted`: The node is fully deleted. No operations are allowed, and the node cannot be brought
+back. The only action left is to remove its record from the database. Any attempt to register a
+node in this state will fail.
+
+This state persists across storage controller restarts.
+
+**State transition**
+```
+        +--------------------+
+    +---|       Active       |<---------------------+
+    |   +--------------------+                      |
+    |                     ^                         |
+    | start_node_delete   | cancel_node_delete      |
+    v                     |                         |
+  +----------------------------------+              |
+  |       ScheduledForDeletion       |              |
+  +----------------------------------+              |
+       |                                            |
+       |                              node_register |
+       |                                            |
+       | delete_node (at the finish)                |
+       |                                            |
+       v                                            |
+  +---------+         tombstone_delete        +----------+
+  | Deleted |-------------------------------->|  no row  |
+  +---------+                                 +----------+
+```
+
+#### NodeSchedulingPolicy::Deleting
+
+A `Deleting` variant to the `NodeSchedulingPolicy` enum. This means the deletion function is
+running for the node right now. Only one node can have the `Deleting` policy at a time.
+
+The `NodeSchedulingPolicy::Deleting` state is persisted in the database. However, after a storage
+controller restart, any node previously marked as `Deleting` will have its scheduling policy reset
+to `Pause`. The policy will only transition back to `Deleting` when the deletion operation is
+actively started again, as triggered by the node's `NodeLifecycle::ScheduledForDeletion` state.
+
+`NodeSchedulingPolicy` transition details:
+1. When `node_delete` begins, set the policy to `NodeSchedulingPolicy::Deleting`.
+2. If `node_delete` is cancelled (for example, due to a concurrent drain operation), revert the
+policy to its previous value. The policy is persisted in storcon DB.
+3. After `node_delete` completes, the final value of the scheduling policy is irrelevant, since
+`NodeLifecycle::Deleted` prevents any further access to this field.
+
+The deletion process cannot be initiated for nodes currently undergoing deployment-related
+operations (`Draining`, `Filling`, or `PauseForRestart` policies). Deletion will only be triggered
+once the node transitions to either the `Active` or `Pause` state.
+
+#### OperationTracker
+
+A replacement for `Option<OperationHandler> ongoing_operation`, the `OperationTracker` is a
+dedicated service state object responsible for managing all long-running node operations (drain,
+fill, delete) with robust concurrency control.
+
+Key responsibilities:
+- Orchestrates the execution of operations
+- Supports cancellation of currently running operations
+- Enforces operation constraints, e.g. allowing only single drain/fill operation at a time
+- Persists deletion state, enabling recovery of pending deletions across restarts
+- Ensures thread safety across concurrent requests
+
+#### Attached tenant shard processing
+
+When deleting a node, handle each attached tenant shard as follows:
+
+1. Pick the best node to become the new attached (the candidate).
+2. If the candidate already has this shard as a secondary:
+    - Create a new secondary for the shard on another suitable node.
+   Otherwise:
+    - Create a secondary for the shard on the candidate node.
+3. Wait until all secondaries are ready and pre-warmed.
+4. Promote the candidate's secondary to attached.
+5. Remove the secondary from the node being deleted.
+
+This process safely moves all attached shards before deleting the node.
+
+#### Secondary tenant shard processing
+
+When deleting a node, handle each secondary tenant shard as follows:
+
+1. Choose the best node to become the new secondary.
+2. Create a secondary for the shard on that node.
+3. Wait until the new secondary is ready.
+4. Remove the secondary from the node being deleted.
+
+This ensures all secondary shards are safely moved before deleting the node.
+
+### Reliability, failure modes and corner cases
+
+In case of a storage controller failure and following restart, the system behavior depends on the
+`NodeLifecycle` state:
+
+- If `NodeLifecycle` is `Active`: No action is taken for this node.
+- If `NodeLifecycle` is `Deleted`: The node will not be re-added.
+- If `NodeLifecycle` is `ScheduledForDeletion`: A deletion background task will be launched for
+this node.
+
+In case of a pageserver node failure during deletion, the behavior depends on the `force` flag:
+- If `force` is set: The node deletion will proceed regardless of the node's availability.
+- If `force` is not set: The deletion will be retried a limited number of times. If the node
+remains unavailable, the deletion process will pause and automatically resume when the node
+becomes healthy again.
+
+### Operations concurrency
+
+The following sections describe the behavior when different types of requests arrive at the storage
+controller and how they interact with ongoing operations.
+
+#### Delete request
+
+Handler: `PUT /control/v1/node/:node_id/delete`
+
+1. If node lifecycle is `NodeLifecycle::ScheduledForDeletion`:
+    - Return `200 OK`: there is already an ongoing deletion request for this node
+2. Update & persist lifecycle to `NodeLifecycle::ScheduledForDeletion`
+3. Persist current scheduling policy
+4. If there is no active operation (drain/fill/delete):
+    - Run deletion process for this node
+
+#### Cancel delete request
+
+Handler: `DELETE /control/v1/node/:node_id/delete`
+
+1. If node lifecycle is not `NodeLifecycle::ScheduledForDeletion`:
+    - Return `404 Not Found`: there is no current deletion request for this node
+2. If the active operation is deleting this node, cancel it
+3. Update & persist lifecycle to `NodeLifecycle::Active`
+4. Restore the last scheduling policy from persistence
+
+#### Drain/fill request
+
+1. If there are already ongoing drain/fill processes:
+    - Return `409 Conflict`: queueing of drain/fill processes is not supported
+2. If there is an ongoing delete process:
+    - Cancel it and wait until it is cancelled
+3. Run the drain/fill process
+4. After the drain/fill process is cancelled or finished:
+    - Try to find another candidate to delete and run the deletion process for that node
+
+#### Drain/fill cancel request
+
+1. If the active operation is not the related process:
+    - Return `400 Bad Request`: cancellation request is incorrect, operations are not the same
+2. Cancel the active operation
+3. Try to find another candidate to delete and run the deletion process for that node
+
+## Definition of Done
+
+- [x] Fix flaky node scenario and introduce related debug handlers
+- [ ] Node deletion intent is persistent - a node will be eventually deleted after a deletion
+request regardless of draining/filling requests and restarts
+- [ ] Node deletion can be graceful - deletion completes only after moving all tenant shards to
+recommended locations
+- [ ] Deploying does not break due to long deletions - drain/fill operations override deletion
+process and deletion resumes after drain/fill completes
+- [ ] `force` flag is implemented and provides fast, failure-tolerant node removal (e.g., when a
+pageserver node does not respond)
+- [ ] Legacy delete handler code is removed from storage_controller, test_runner, and storcon_cli
--- a/libs/compute_api/src/responses.rs
+++ b/libs/compute_api/src/responses.rs
@@ -68,11 +68,15 @@ pub enum LfcPrewarmState {
    /// We tried to fetch the corresponding LFC state from the endpoint storage,
    /// but received `Not Found 404`. This should normally happen only during the
    /// first endpoint start after creation with `autoprewarm: true`.
+    /// This may also happen if LFC is turned off or not initialized
    ///
    /// During the orchestrated prewarm via API, when a caller explicitly
    /// provides the LFC state key to prewarm from, it's the caller responsibility
    /// to handle this status as an error state in this case.
    Skipped,
+    /// LFC prewarm was cancelled. Some pages in LFC cache may be prewarmed if query
+    /// has started working before cancellation
+    Cancelled,
 }

 impl Display for LfcPrewarmState {
@@ -83,6 +87,7 @@ impl Display for LfcPrewarmState {
            LfcPrewarmState::Completed => f.write_str("Completed"),
            LfcPrewarmState::Skipped => f.write_str("Skipped"),
            LfcPrewarmState::Failed { error } => write!(f, "Error({error})"),
+            LfcPrewarmState::Cancelled => f.write_str("Cancelled"),
        }
    }
 }
@@ -97,6 +102,7 @@ pub enum LfcOffloadState {
    Failed {
        error: String,
    },
+    Skipped,
 }

 #[derive(Serialize, Debug, Clone, PartialEq)]
--- a/proxy/src/serverless/backend.rs
+++ b/proxy/src/serverless/backend.rs
@@ -458,7 +458,7 @@ pub(crate) enum LocalProxyConnError {
 impl ReportableError for HttpConnError {
    fn get_error_kind(&self) -> ErrorKind {
        match self {
-            HttpConnError::ConnectError(_) => ErrorKind::Compute,
+            HttpConnError::ConnectError(e) => e.get_error_kind(),
            HttpConnError::ConnectionClosedAbruptly(_) => ErrorKind::Compute,
            HttpConnError::PostgresConnectionError(p) => match p.as_db_error() {
                // user provided a wrong database name
--- a/safekeeper/src/pull_timeline.rs
+++ b/safekeeper/src/pull_timeline.rs
@@ -612,19 +612,25 @@ pub async fn handle_request(
        }
    }

+    let max_term = statuses
+        .iter()
+        .map(|(status, _)| status.acceptor_state.term)
+        .max()
+        .unwrap();
+
    // Find the most advanced safekeeper
    let (status, i) = statuses
        .into_iter()
        .max_by_key(|(status, _)| {
            (
                status.acceptor_state.epoch,
+                status.flush_lsn,
                /* BEGIN_HADRON */
                // We need to pull from the SK with the highest term.
                // This is because another compute may come online and vote the same highest term again on the other two SKs.
                // Then, there will be 2 computes running on the same term.
                status.acceptor_state.term,
                /* END_HADRON */
-                status.flush_lsn,
                status.commit_lsn,
            )
        })
@@ -634,6 +640,22 @@ pub async fn handle_request(
    assert!(status.tenant_id == request.tenant_id);
    assert!(status.timeline_id == request.timeline_id);

+    // TODO(diko): This is hadron only check to make sure that we pull the timeline
+    // from the safekeeper with the highest term during timeline restore.
+    // We could avoid returning the error by calling bump_term after pull_timeline.
+    // However, this is not a big deal because we retry the pull_timeline requests.
+    // The check should be removed together with removing custom hadron logic for
+    // safekeeper restore.
+    if wait_for_peer_timeline_status && status.acceptor_state.term != max_term {
+        return Err(ApiError::PreconditionFailed(
+            format!(
+                "choosen safekeeper {} has term {}, but the most advanced term is {}",
+                safekeeper_host, status.acceptor_state.term, max_term
+            )
+            .into(),
+        ));
+    }
+
    match pull_timeline(
        status,
        safekeeper_host,
--- a/safekeeper/src/timeline.rs
+++ b/safekeeper/src/timeline.rs
@@ -195,12 +195,14 @@ impl StateSK {
        to: Configuration,
    ) -> Result<TimelineMembershipSwitchResponse> {
        let result = self.state_mut().membership_switch(to).await?;
+        let flush_lsn = self.flush_lsn();
+        let last_log_term = self.state().acceptor_state.get_last_log_term(flush_lsn);

        Ok(TimelineMembershipSwitchResponse {
            previous_conf: result.previous_conf,
            current_conf: result.current_conf,
-            last_log_term: self.state().acceptor_state.term,
-            flush_lsn: self.flush_lsn(),
+            last_log_term,
+            flush_lsn,
        })
    }

--- a/storage_controller/migrations/2024-01-07-211257_create_tenant_shards/down.sql
+++ b/storage_controller/migrations/2024-01-07-211257_create_tenant_shards/down.sql
@@ -1 +0,0 @@
-DROP TABLE tenant_shards;
--- a/storage_controller/migrations/2024-01-07-211257_create_tenant_shards/up.sql
+++ b/storage_controller/migrations/2024-01-07-211257_create_tenant_shards/up.sql
@@ -1,13 +0,0 @@
-CREATE TABLE tenant_shards (
-  tenant_id VARCHAR NOT NULL,
-  shard_number INTEGER NOT NULL,
-  shard_count INTEGER NOT NULL,
-  PRIMARY KEY(tenant_id, shard_number, shard_count),
-  shard_stripe_size INTEGER NOT NULL,
-  generation INTEGER NOT NULL,
-  generation_pageserver BIGINT NOT NULL,
-  placement_policy VARCHAR NOT NULL,
-  splitting SMALLINT NOT NULL,
-  -- config is JSON encoded, opaque to the database.
-  config TEXT NOT NULL
-);
--- a/storage_controller/migrations/2024-01-07-212945_create_nodes/down.sql
+++ b/storage_controller/migrations/2024-01-07-212945_create_nodes/down.sql
@@ -1 +0,0 @@
-DROP TABLE nodes;
--- a/storage_controller/migrations/2024-01-07-212945_create_nodes/up.sql
+++ b/storage_controller/migrations/2024-01-07-212945_create_nodes/up.sql
@@ -1,10 +0,0 @@
-CREATE TABLE nodes (
-  node_id BIGINT PRIMARY KEY NOT NULL,
-
-  scheduling_policy VARCHAR NOT NULL,
-
-  listen_http_addr VARCHAR NOT NULL,
-  listen_http_port INTEGER NOT NULL,
-  listen_pg_addr VARCHAR NOT NULL,
-  listen_pg_port INTEGER NOT NULL
-);
--- a/storage_controller/migrations/2024-02-29-094122_generations_null/down.sql
+++ b/storage_controller/migrations/2024-02-29-094122_generations_null/down.sql
@@ -1,2 +0,0 @@
-ALTER TABLE tenant_shards ALTER generation SET NOT NULL;
-ALTER TABLE tenant_shards ALTER generation_pageserver SET NOT NULL;
--- a/storage_controller/migrations/2024-02-29-094122_generations_null/up.sql
+++ b/storage_controller/migrations/2024-02-29-094122_generations_null/up.sql
@@ -1,4 +0,0 @@
-
-
-ALTER TABLE tenant_shards ALTER generation DROP NOT NULL;
-ALTER TABLE tenant_shards ALTER generation_pageserver DROP NOT NULL;
--- a/storage_controller/migrations/2024-03-18-184429_rename_policy/down.sql
+++ b/storage_controller/migrations/2024-03-18-184429_rename_policy/down.sql
@@ -1,3 +0,0 @@
-
-UPDATE tenant_shards set placement_policy='{"Double": 1}' where placement_policy='{"Attached": 1}';
-UPDATE tenant_shards set placement_policy='"Single"' where placement_policy='{"Attached": 0}';
--- a/storage_controller/migrations/2024-03-18-184429_rename_policy/up.sql
+++ b/storage_controller/migrations/2024-03-18-184429_rename_policy/up.sql
@@ -1,3 +0,0 @@
-
-UPDATE tenant_shards set placement_policy='{"Attached": 1}' where placement_policy='{"Double": 1}';
-UPDATE tenant_shards set placement_policy='{"Attached": 0}' where placement_policy='"Single"';
--- a/storage_controller/migrations/2024-03-27-133204_tenant_policies/down.sql
+++ b/storage_controller/migrations/2024-03-27-133204_tenant_policies/down.sql
@@ -1,3 +0,0 @@
-- This file should undo anything in `up.sql`
-
-ALTER TABLE tenant_shards drop scheduling_policy;
--- a/storage_controller/migrations/2024-03-27-133204_tenant_policies/up.sql
+++ b/storage_controller/migrations/2024-03-27-133204_tenant_policies/up.sql
@@ -1,2 +0,0 @@
-
-ALTER TABLE tenant_shards add scheduling_policy VARCHAR NOT NULL DEFAULT '"Active"';
--- a/storage_controller/migrations/2024-07-23-191537_create_metadata_health/down.sql
+++ b/storage_controller/migrations/2024-07-23-191537_create_metadata_health/down.sql
@@ -1 +0,0 @@
-DROP TABLE metadata_health;
--- a/storage_controller/migrations/2024-07-23-191537_create_metadata_health/up.sql
+++ b/storage_controller/migrations/2024-07-23-191537_create_metadata_health/up.sql
@@ -1,14 +0,0 @@
-CREATE TABLE metadata_health (
-  tenant_id VARCHAR NOT NULL,
-  shard_number INTEGER NOT NULL,
-  shard_count INTEGER NOT NULL,
-  PRIMARY KEY(tenant_id, shard_number, shard_count),
-  -- Rely on cascade behavior for delete
-  FOREIGN KEY(tenant_id, shard_number, shard_count) REFERENCES tenant_shards ON DELETE CASCADE,
-  healthy BOOLEAN NOT NULL DEFAULT TRUE,
-  last_scrubbed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
-);
-
-
-INSERT INTO metadata_health(tenant_id, shard_number, shard_count)
-SELECT tenant_id, shard_number, shard_count FROM tenant_shards;
--- a/storage_controller/migrations/2024-07-26-140924_create_leader/down.sql
+++ b/storage_controller/migrations/2024-07-26-140924_create_leader/down.sql
@@ -1 +0,0 @@
-DROP TABLE controllers;
--- a/storage_controller/migrations/2024-07-26-140924_create_leader/up.sql
+++ b/storage_controller/migrations/2024-07-26-140924_create_leader/up.sql
@@ -1,5 +0,0 @@
-CREATE TABLE controllers (
-  address VARCHAR NOT NULL,
-  started_at TIMESTAMPTZ NOT NULL,
-  PRIMARY KEY(address, started_at)
-);
--- a/storage_controller/migrations/2024-08-23-102952_safekeepers/down.sql
+++ b/storage_controller/migrations/2024-08-23-102952_safekeepers/down.sql
@@ -1,2 +0,0 @@
-- This file should undo anything in `up.sql`
-DROP TABLE safekeepers;
--- a/storage_controller/migrations/2024-08-23-102952_safekeepers/up.sql
+++ b/storage_controller/migrations/2024-08-23-102952_safekeepers/up.sql
@@ -1,15 +0,0 @@
-- started out as a copy of cplane schema, removed the unnecessary columns.
-CREATE TABLE safekeepers (
-	-- the surrogate identifier defined by control plane database sequence
-	id BIGINT PRIMARY KEY,
-	region_id TEXT NOT NULL,
-	version BIGINT NOT NULL,
-	-- the natural id on whatever cloud platform, not needed in storage controller
-	-- instance_id TEXT UNIQUE NOT NULL,
-	host TEXT NOT NULL,
-	port INTEGER NOT NULL,
-	active BOOLEAN NOT NULL DEFAULT false,
-	-- projects_count INTEGER NOT NULL DEFAULT 0,
-	http_port INTEGER NOT NULL,
-	availability_zone_id TEXT NOT NULL
-);
--- a/storage_controller/migrations/2024-08-23-170149_tenant_id_index/down.sql
+++ b/storage_controller/migrations/2024-08-23-170149_tenant_id_index/down.sql
@@ -1,2 +0,0 @@
-- This file should undo anything in `up.sql`
-DROP INDEX tenant_shards_tenant_id;
--- a/storage_controller/migrations/2024-08-23-170149_tenant_id_index/up.sql
+++ b/storage_controller/migrations/2024-08-23-170149_tenant_id_index/up.sql
@@ -1,2 +0,0 @@
-- Your SQL goes here
-CREATE INDEX tenant_shards_tenant_id ON tenant_shards (tenant_id);
--- a/storage_controller/migrations/2024-08-27-184400_pageserver_az/down.sql
+++ b/storage_controller/migrations/2024-08-27-184400_pageserver_az/down.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes DROP availability_zone_id;
--- a/storage_controller/migrations/2024-08-27-184400_pageserver_az/up.sql
+++ b/storage_controller/migrations/2024-08-27-184400_pageserver_az/up.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes ADD availability_zone_id VARCHAR;
--- a/storage_controller/migrations/2024-08-28-150530_pageserver_az_not_null/down.sql
+++ b/storage_controller/migrations/2024-08-28-150530_pageserver_az_not_null/down.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes ALTER availability_zone_id DROP NOT NULL;
--- a/storage_controller/migrations/2024-08-28-150530_pageserver_az_not_null/up.sql
+++ b/storage_controller/migrations/2024-08-28-150530_pageserver_az_not_null/up.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes ALTER availability_zone_id SET NOT NULL;
--- a/storage_controller/migrations/2024-09-05-104500_tenant_shard_preferred_az/down.sql
+++ b/storage_controller/migrations/2024-09-05-104500_tenant_shard_preferred_az/down.sql
@@ -1 +0,0 @@
-ALTER TABLE tenant_shards DROP preferred_az_id;
--- a/storage_controller/migrations/2024-09-05-104500_tenant_shard_preferred_az/up.sql
+++ b/storage_controller/migrations/2024-09-05-104500_tenant_shard_preferred_az/up.sql
@@ -1 +0,0 @@
-ALTER TABLE tenant_shards ADD preferred_az_id VARCHAR;
--- a/storage_controller/migrations/2024-12-12-212515_safekeepers_scheduling_policy/down.sql
+++ b/storage_controller/migrations/2024-12-12-212515_safekeepers_scheduling_policy/down.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers DROP scheduling_policy;
--- a/storage_controller/migrations/2024-12-12-212515_safekeepers_scheduling_policy/up.sql
+++ b/storage_controller/migrations/2024-12-12-212515_safekeepers_scheduling_policy/up.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers ADD scheduling_policy VARCHAR NOT NULL DEFAULT 'disabled';
--- a/storage_controller/migrations/2025-01-09-160454_safekeepers_remove_active/down.sql
+++ b/storage_controller/migrations/2025-01-09-160454_safekeepers_remove_active/down.sql
@@ -1,4 +0,0 @@
-- this sadly isn't a "true" revert of the migration, as the column is now at the end of the table.
-- But preserving order is not a trivial operation.
-- https://wiki.postgresql.org/wiki/Alter_column_position
-ALTER TABLE safekeepers ADD active BOOLEAN NOT NULL DEFAULT false;
--- a/storage_controller/migrations/2025-01-09-160454_safekeepers_remove_active/up.sql
+++ b/storage_controller/migrations/2025-01-09-160454_safekeepers_remove_active/up.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers DROP active;
--- a/storage_controller/migrations/2025-01-15-181207_safekeepers_disabled_to_pause/down.sql
+++ b/storage_controller/migrations/2025-01-15-181207_safekeepers_disabled_to_pause/down.sql
@@ -1,2 +0,0 @@
-ALTER TABLE safekeepers ALTER COLUMN scheduling_policy SET DEFAULT 'disabled';
-UPDATE safekeepers SET scheduling_policy = 'disabled' WHERE scheduling_policy = 'pause';
--- a/storage_controller/migrations/2025-01-15-181207_safekeepers_disabled_to_pause/up.sql
+++ b/storage_controller/migrations/2025-01-15-181207_safekeepers_disabled_to_pause/up.sql
@@ -1,2 +0,0 @@
-ALTER TABLE safekeepers ALTER COLUMN scheduling_policy SET DEFAULT 'pause';
-UPDATE safekeepers SET scheduling_policy = 'pause' WHERE scheduling_policy = 'disabled';
--- a/storage_controller/migrations/2025-02-11-144848_pageserver_use_https/down.sql
+++ b/storage_controller/migrations/2025-02-11-144848_pageserver_use_https/down.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes DROP listen_https_port;
--- a/storage_controller/migrations/2025-02-11-144848_pageserver_use_https/up.sql
+++ b/storage_controller/migrations/2025-02-11-144848_pageserver_use_https/up.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes ADD listen_https_port INTEGER;
--- a/storage_controller/migrations/2025-02-14-160526_safekeeper_timelines/down.sql
+++ b/storage_controller/migrations/2025-02-14-160526_safekeeper_timelines/down.sql
@@ -1,2 +0,0 @@
-DROP TABLE timelines;
-DROP TABLE safekeeper_timeline_pending_ops;
--- a/storage_controller/migrations/2025-02-14-160526_safekeeper_timelines/up.sql
+++ b/storage_controller/migrations/2025-02-14-160526_safekeeper_timelines/up.sql
@@ -1,19 +0,0 @@
-CREATE TABLE timelines (
-  tenant_id VARCHAR NOT NULL,
-  timeline_id VARCHAR NOT NULL,
-  start_lsn pg_lsn NOT NULL,
-  generation INTEGER NOT NULL,
-  sk_set BIGINT[] NOT NULL,
-  new_sk_set BIGINT[],
-  cplane_notified_generation INTEGER NOT NULL,
-  deleted_at timestamptz,
-  PRIMARY KEY(tenant_id, timeline_id)
-);
-CREATE TABLE safekeeper_timeline_pending_ops (
-  sk_id BIGINT NOT NULL,
-  tenant_id VARCHAR NOT NULL,
-  timeline_id VARCHAR NOT NULL,
-  generation INTEGER NOT NULL,
-  op_kind VARCHAR NOT NULL,
-  PRIMARY KEY(tenant_id, timeline_id, sk_id)
-);
--- a/storage_controller/migrations/2025-02-28-141741_safekeeper_use_https/down.sql
+++ b/storage_controller/migrations/2025-02-28-141741_safekeeper_use_https/down.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers DROP https_port;
--- a/storage_controller/migrations/2025-02-28-141741_safekeeper_use_https/up.sql
+++ b/storage_controller/migrations/2025-02-28-141741_safekeeper_use_https/up.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers ADD https_port INTEGER;
--- a/storage_controller/migrations/2025-03-18-103700_timeline_imports/down.sql
+++ b/storage_controller/migrations/2025-03-18-103700_timeline_imports/down.sql
@@ -1 +0,0 @@
-DROP TABLE timeline_imports;
--- a/storage_controller/migrations/2025-03-18-103700_timeline_imports/up.sql
+++ b/storage_controller/migrations/2025-03-18-103700_timeline_imports/up.sql
@@ -1,6 +0,0 @@
-CREATE TABLE timeline_imports (
-  tenant_id VARCHAR NOT NULL,
-  timeline_id VARCHAR NOT NULL,
-  shard_statuses JSONB NOT NULL,
-  PRIMARY KEY(tenant_id, timeline_id)
-);
--- a/storage_controller/migrations/2025-06-01-201442_add_lifecycle_to_nodes/down.sql
+++ b/storage_controller/migrations/2025-06-01-201442_add_lifecycle_to_nodes/down.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes DROP COLUMN lifecycle;
--- a/storage_controller/migrations/2025-06-01-201442_add_lifecycle_to_nodes/up.sql
+++ b/storage_controller/migrations/2025-06-01-201442_add_lifecycle_to_nodes/up.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes ADD COLUMN lifecycle VARCHAR NOT NULL DEFAULT 'active';
--- a/storage_controller/migrations/2025-06-17-082247_pageserver_grpc_addr/down.sql
+++ b/storage_controller/migrations/2025-06-17-082247_pageserver_grpc_addr/down.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes DROP listen_grpc_addr, listen_grpc_port;
--- a/storage_controller/migrations/2025-06-17-082247_pageserver_grpc_addr/up.sql
+++ b/storage_controller/migrations/2025-06-17-082247_pageserver_grpc_addr/up.sql
@@ -1 +0,0 @@
-ALTER TABLE nodes ADD listen_grpc_addr VARCHAR NULL, ADD listen_grpc_port INTEGER NULL;
--- a/storage_controller/migrations/2025-07-02-170751_safekeeper_default_no_pause/down.sql
+++ b/storage_controller/migrations/2025-07-02-170751_safekeeper_default_no_pause/down.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers ALTER COLUMN scheduling_policy SET DEFAULT 'pause';
--- a/storage_controller/migrations/2025-07-02-170751_safekeeper_default_no_pause/up.sql
+++ b/storage_controller/migrations/2025-07-02-170751_safekeeper_default_no_pause/up.sql
@@ -1 +0,0 @@
-ALTER TABLE safekeepers ALTER COLUMN scheduling_policy SET DEFAULT 'activating';
--- a/storage_controller/migrations/2025-07-08-114340_sk_set_notified_generation/down.sql
+++ b/storage_controller/migrations/2025-07-08-114340_sk_set_notified_generation/down.sql
@@ -1 +0,0 @@
-ALTER TABLE timelines DROP sk_set_notified_generation;
--- a/storage_controller/migrations/2025-07-08-114340_sk_set_notified_generation/up.sql
+++ b/storage_controller/migrations/2025-07-08-114340_sk_set_notified_generation/up.sql
@@ -1 +0,0 @@
-ALTER TABLE timelines ADD sk_set_notified_generation INTEGER NOT NULL DEFAULT 1;
--- a/storage_controller/migrations/2025-07-17-000001_hadron_safekeepers/down.sql
+++ b/storage_controller/migrations/2025-07-17-000001_hadron_safekeepers/down.sql
@@ -1,2 +0,0 @@
-DROP TABLE hadron_safekeepers;
-DROP TABLE hadron_timeline_safekeepers;
--- a/storage_controller/migrations/2025-07-17-000001_hadron_safekeepers/up.sql
+++ b/storage_controller/migrations/2025-07-17-000001_hadron_safekeepers/up.sql
@@ -1,17 +0,0 @@
-- hadron_safekeepers keep track of all Safe Keeper nodes that exist in the system.
-- Upon startup, each Safe Keeper reaches out to the hadron cluster coordinator to register its node ID and listen addresses.
-
-CREATE TABLE hadron_safekeepers (
-  sk_node_id BIGINT PRIMARY KEY NOT NULL,
-  listen_http_addr VARCHAR NOT NULL,
-  listen_http_port INTEGER NOT NULL,
-  listen_pg_addr VARCHAR NOT NULL,
-  listen_pg_port INTEGER NOT NULL
-);
-
-CREATE TABLE hadron_timeline_safekeepers (
-  timeline_id VARCHAR NOT NULL,
-  sk_node_id BIGINT NOT NULL,
-  legacy_endpoint_id UUID DEFAULT NULL,
-  PRIMARY KEY(timeline_id, sk_node_id)
-);
--- a/storage_controller/migrations/2025-07-31-115100_squash_migrations/down.sql
+++ b/storage_controller/migrations/2025-07-31-115100_squash_migrations/down.sql
@@ -0,0 +1,10 @@
+DROP TABLE IF EXISTS timelines CASCADE;
+DROP TABLE IF EXISTS timeline_imports CASCADE;
+DROP TABLE IF EXISTS tenant_shards CASCADE;
+DROP TABLE IF EXISTS safekeepers CASCADE;
+DROP TABLE IF EXISTS safekeeper_timeline_pending_ops CASCADE;
+DROP TABLE IF EXISTS nodes CASCADE;
+DROP TABLE IF EXISTS metadata_health CASCADE;
+DROP TABLE IF EXISTS hadron_timeline_safekeepers CASCADE;
+DROP TABLE IF EXISTS hadron_safekeepers CASCADE;
+DROP TABLE IF EXISTS controllers CASCADE;
--- a/storage_controller/migrations/2025-07-31-115100_squash_migrations/up.sql
+++ b/storage_controller/migrations/2025-07-31-115100_squash_migrations/up.sql
@@ -0,0 +1,112 @@
+/*
+ * This migration squashes all previous migrations into a single one.
+ * It will be applied in two cases:
+ * 1. Spinning up a new region. There's no previous migrations apart from the diesel setup in this case
+ * 2. For existing regions.
+ *
+ * Hence, this migration does nothing if the schema is already up to date.
+ */
+
+CREATE TABLE IF NOT EXISTS controllers (
+    address character varying NOT NULL,
+    started_at timestamp with time zone NOT NULL,
+    PRIMARY KEY(address, started_at)
+);
+
+CREATE TABLE IF NOT EXISTS hadron_safekeepers (
+    sk_node_id bigint PRIMARY KEY NOT NULL,
+    listen_http_addr character varying NOT NULL,
+    listen_http_port integer NOT NULL,
+    listen_pg_addr character varying NOT NULL,
+    listen_pg_port integer NOT NULL
+);
+
+CREATE TABLE IF NOT EXISTS hadron_timeline_safekeepers (
+    timeline_id character varying NOT NULL,
+    sk_node_id bigint NOT NULL,
+    legacy_endpoint_id uuid,
+    PRIMARY KEY(timeline_id, sk_node_id)
+);
+
+CREATE TABLE IF NOT EXISTS tenant_shards (
+    tenant_id character varying NOT NULL,
+    shard_number integer NOT NULL,
+    shard_count integer NOT NULL,
+    PRIMARY KEY(tenant_id, shard_number, shard_count),
+    shard_stripe_size integer NOT NULL,
+    generation integer,
+    generation_pageserver bigint,
+    placement_policy character varying NOT NULL,
+    splitting smallint NOT NULL,
+    config text NOT NULL,
+    scheduling_policy character varying DEFAULT '"Active"'::character varying NOT NULL,
+    preferred_az_id character varying
+);
+
+CREATE INDEX IF NOT EXISTS tenant_shards_tenant_id ON tenant_shards USING btree (tenant_id);
+
+CREATE TABLE IF NOT EXISTS metadata_health (
+    tenant_id character varying NOT NULL,
+    shard_number integer NOT NULL,
+    shard_count integer NOT NULL,
+    PRIMARY KEY(tenant_id, shard_number, shard_count),
+    -- Rely on cascade behavior for delete
+    FOREIGN KEY(tenant_id, shard_number, shard_count) REFERENCES tenant_shards ON DELETE CASCADE,
+    healthy boolean DEFAULT true NOT NULL,
+    last_scrubbed_at timestamp with time zone DEFAULT now() NOT NULL
+);
+
+CREATE TABLE IF NOT EXISTS nodes (
+    node_id bigint PRIMARY KEY NOT NULL,
+    scheduling_policy character varying NOT NULL,
+    listen_http_addr character varying NOT NULL,
+    listen_http_port integer NOT NULL,
+    listen_pg_addr character varying NOT NULL,
+    listen_pg_port integer NOT NULL,
+    availability_zone_id character varying NOT NULL,
+    listen_https_port integer,
+    lifecycle character varying DEFAULT 'active'::character varying NOT NULL,
+    listen_grpc_addr character varying,
+    listen_grpc_port integer
+);
+
+CREATE TABLE IF NOT EXISTS safekeeper_timeline_pending_ops (
+    sk_id bigint NOT NULL,
+    tenant_id character varying NOT NULL,
+    timeline_id character varying NOT NULL,
+    generation integer NOT NULL,
+    op_kind character varying NOT NULL,
+    PRIMARY KEY(tenant_id, timeline_id, sk_id)
+);
+
+CREATE TABLE IF NOT EXISTS safekeepers (
+    id bigint PRIMARY KEY NOT NULL,
+    region_id text NOT NULL,
+    version bigint NOT NULL,
+    host text NOT NULL,
+    port integer NOT NULL,
+    http_port integer NOT NULL,
+    availability_zone_id text NOT NULL,
+    scheduling_policy character varying DEFAULT 'activating'::character varying NOT NULL,
+    https_port integer
+);
+
+CREATE TABLE IF NOT EXISTS timeline_imports (
+    tenant_id character varying NOT NULL,
+    timeline_id character varying NOT NULL,
+    shard_statuses jsonb NOT NULL,
+    PRIMARY KEY(tenant_id, timeline_id)
+);
+
+CREATE TABLE IF NOT EXISTS timelines (
+    tenant_id character varying NOT NULL,
+    timeline_id character varying NOT NULL,
+    start_lsn pg_lsn NOT NULL,
+    generation integer NOT NULL,
+    sk_set bigint[] NOT NULL,
+    new_sk_set bigint[],
+    cplane_notified_generation integer NOT NULL,
+    deleted_at timestamp with time zone,
+    sk_set_notified_generation integer DEFAULT 1 NOT NULL,
+    PRIMARY KEY(tenant_id, timeline_id)
+);
--- a/storage_controller/src/service/safekeeper_service.rs
+++ b/storage_controller/src/service/safekeeper_service.rs
@@ -24,12 +24,12 @@ use pageserver_api::controller_api::{
 };
 use pageserver_api::models::{SafekeeperInfo, SafekeepersInfo, TimelineInfo};
 use safekeeper_api::PgVersionId;
+use safekeeper_api::Term;
 use safekeeper_api::membership::{self, MemberSet, SafekeeperGeneration};
 use safekeeper_api::models::{
    PullTimelineRequest, TimelineLocateResponse, TimelineMembershipSwitchRequest,
    TimelineMembershipSwitchResponse,
 };
-use safekeeper_api::{INITIAL_TERM, Term};
 use safekeeper_client::mgmt_api;
 use tokio::task::JoinSet;
 use tokio_util::sync::CancellationToken;
@@ -1298,13 +1298,7 @@ impl Service {
            )
            .await?;

-        let mut sync_position = (INITIAL_TERM, Lsn::INVALID);
-        for res in results.into_iter().flatten() {
-            let sk_position = (res.last_log_term, res.flush_lsn);
-            if sync_position < sk_position {
-                sync_position = sk_position;
-            }
-        }
+        let sync_position = Self::get_sync_position(&results)?;

        tracing::info!(
            %generation,
@@ -1598,4 +1592,36 @@ impl Service {

        Ok(())
    }
+
+    /// Get membership switch responses from all safekeepers and return the sync position.
+    ///
+    /// Sync position is a position equal or greater than the commit position.
+    /// It is guaranteed that all WAL entries with (last_log_term, flush_lsn)
+    /// greater than the sync position are not committed (= not on a quorum).
+    ///
+    /// Returns error if there is no quorum of successful responses.
+    fn get_sync_position(
+        responses: &[mgmt_api::Result<TimelineMembershipSwitchResponse>],
+    ) -> Result<(Term, Lsn), ApiError> {
+        let quorum_size = responses.len() / 2 + 1;
+
+        let mut wal_positions = responses
+            .iter()
+            .flatten()
+            .map(|res| (res.last_log_term, res.flush_lsn))
+            .collect::<Vec<_>>();
+
+        // Should be already checked if the responses are from tenant_timeline_set_membership_quorum.
+        if wal_positions.len() < quorum_size {
+            return Err(ApiError::InternalServerError(anyhow::anyhow!(
+                "not enough successful responses to get sync position: {}/{}",
+                wal_positions.len(),
+                quorum_size,
+            )));
+        }
+
+        wal_positions.sort();
+
+        Ok(wal_positions[quorum_size - 1])
+    }
 }
--- a/test_runner/fixtures/endpoint/http.py
+++ b/test_runner/fixtures/endpoint/http.py
@@ -78,20 +78,26 @@ class EndpointHttpClient(requests.Session):
        json: dict[str, str] = res.json()
        return json

-    def prewarm_lfc(self, from_endpoint_id: str | None = None):
+    def prewarm_lfc(self, from_endpoint_id: str | None = None) -> dict[str, str]:
        """
        Prewarm LFC cache from given endpoint and wait till it finishes or errors
        """
        params = {"from_endpoint": from_endpoint_id} if from_endpoint_id else dict()
        self.post(self.prewarm_url, params=params).raise_for_status()
-        self.prewarm_lfc_wait()
+        return self.prewarm_lfc_wait()

-    def prewarm_lfc_wait(self):
+    def cancel_prewarm_lfc(self):
+        """
+        Cancel LFC prewarm if any is ongoing
+        """
+        self.delete(self.prewarm_url).raise_for_status()
+
+    def prewarm_lfc_wait(self) -> dict[str, str]:
        """
        Wait till LFC prewarm returns with error or success.
        If prewarm was not requested before calling this function, it will error
        """
-        statuses = "failed", "completed", "skipped"
+        statuses = "failed", "completed", "skipped", "cancelled"

        def prewarmed():
            json = self.prewarm_lfc_status()
@@ -101,6 +107,7 @@ class EndpointHttpClient(requests.Session):
        wait_until(prewarmed, timeout=60)
        res = self.prewarm_lfc_status()
        assert res["status"] != "failed", res
+        return res

    def offload_lfc_status(self) -> dict[str, str]:
        res = self.get(self.offload_url)
@@ -108,29 +115,31 @@ class EndpointHttpClient(requests.Session):
        json: dict[str, str] = res.json()
        return json

-    def offload_lfc(self):
+    def offload_lfc(self) -> dict[str, str]:
        """
        Offload LFC cache to endpoint storage and wait till offload finishes or errors
        """
        self.post(self.offload_url).raise_for_status()
-        self.offload_lfc_wait()
+        return self.offload_lfc_wait()

-    def offload_lfc_wait(self):
+    def offload_lfc_wait(self) -> dict[str, str]:
        """
        Wait till LFC offload returns with error or success.
        If offload was not requested before calling this function, it will error
        """
+        statuses = "failed", "completed", "skipped"

        def offloaded():
            json = self.offload_lfc_status()
            status, err = json["status"], json.get("error")
-            assert status in ["failed", "completed"], f"{status}, {err=}"
+            assert status in statuses, f"{status}, {err=}"

        wait_until(offloaded, timeout=60)
        res = self.offload_lfc_status()
        assert res["status"] != "failed", res
+        return res

-    def promote(self, promote_spec: dict[str, Any], disconnect: bool = False):
+    def promote(self, promote_spec: dict[str, Any], disconnect: bool = False) -> dict[str, str]:
        url = f"http://localhost:{self.external_port}/promote"
        if disconnect:
            try:  # send first request to start promote and disconnect
--- a/test_runner/fixtures/neon_fixtures.py
+++ b/test_runner/fixtures/neon_fixtures.py
@@ -2313,6 +2313,7 @@ class NeonStorageController(MetricsGetter, LogUtils):
        timeline_id: TimelineId,
        new_sk_set: list[int],
    ):
+        log.info(f"migrate_safekeepers({tenant_id}, {timeline_id}, {new_sk_set})")
        response = self.request(
            "POST",
            f"{self.api}/v1/tenant/{tenant_id}/timeline/{timeline_id}/safekeeper_migrate",
--- a/test_runner/regress/test_lfc_prewarm.py
+++ b/test_runner/regress/test_lfc_prewarm.py
@@ -1,6 +1,6 @@
 import random
-import threading
 from enum import StrEnum
+from threading import Thread
 from time import sleep
 from typing import Any

@@ -47,19 +47,23 @@ def offload_lfc(method: PrewarmMethod, client: EndpointHttpClient, cur: Cursor)
        # With autoprewarm, we need to be sure LFC was offloaded after all writes
        # finish, so we sleep. Otherwise we'll have less prewarmed pages than we want
        sleep(AUTOOFFLOAD_INTERVAL_SECS)
-        client.offload_lfc_wait()
-        return
+        offload_res = client.offload_lfc_wait()
+        log.info(offload_res)
+        return offload_res

    if method == PrewarmMethod.COMPUTE_CTL:
        status = client.prewarm_lfc_status()
        assert status["status"] == "not_prewarmed"
        assert "error" not in status
-        client.offload_lfc()
+        offload_res = client.offload_lfc()
+        log.info(offload_res)
        assert client.prewarm_lfc_status()["status"] == "not_prewarmed"
+
        parsed = prom_parse(client)
        desired = {OFFLOAD_LABEL: 1, PREWARM_LABEL: 0, OFFLOAD_ERR_LABEL: 0, PREWARM_ERR_LABEL: 0}
        assert parsed == desired, f"{parsed=} != {desired=}"
-        return
+
+        return offload_res

    raise AssertionError(f"{method} not in PrewarmMethod")

@@ -68,21 +72,30 @@ def prewarm_endpoint(
    method: PrewarmMethod, client: EndpointHttpClient, cur: Cursor, lfc_state: str | None
 ):
    if method == PrewarmMethod.AUTOPREWARM:
-        client.prewarm_lfc_wait()
+        prewarm_res = client.prewarm_lfc_wait()
+        log.info(prewarm_res)
    elif method == PrewarmMethod.COMPUTE_CTL:
-        client.prewarm_lfc()
+        prewarm_res = client.prewarm_lfc()
+        log.info(prewarm_res)
+        return prewarm_res
    elif method == PrewarmMethod.POSTGRES:
        cur.execute("select neon.prewarm_local_cache(%s)", (lfc_state,))


-def check_prewarmed(
+def check_prewarmed_contains(
    method: PrewarmMethod, client: EndpointHttpClient, desired_status: dict[str, str | int]
 ):
    if method == PrewarmMethod.AUTOPREWARM:
-        assert client.prewarm_lfc_status() == desired_status
+        prewarm_status = client.prewarm_lfc_status()
+        for k in desired_status:
+            assert desired_status[k] == prewarm_status[k]
+
        assert prom_parse(client)[PREWARM_LABEL] == 1
    elif method == PrewarmMethod.COMPUTE_CTL:
-        assert client.prewarm_lfc_status() == desired_status
+        prewarm_status = client.prewarm_lfc_status()
+        for k in desired_status:
+            assert desired_status[k] == prewarm_status[k]
+
        desired = {OFFLOAD_LABEL: 0, PREWARM_LABEL: 1, PREWARM_ERR_LABEL: 0, OFFLOAD_ERR_LABEL: 0}
        assert prom_parse(client) == desired

@@ -149,9 +162,6 @@ def test_lfc_prewarm(neon_simple_env: NeonEnv, method: PrewarmMethod):
    log.info(f"Used LFC size: {lfc_used_pages}")
    pg_cur.execute("select * from neon.get_prewarm_info()")
    total, prewarmed, skipped, _ = pg_cur.fetchall()[0]
-    log.info(f"Prewarm info: {total=} {prewarmed=} {skipped=}")
-    progress = (prewarmed + skipped) * 100 // total
-    log.info(f"Prewarm progress: {progress}%")
    assert lfc_used_pages > 10000
    assert total > 0
    assert prewarmed > 0
@@ -161,7 +171,54 @@ def test_lfc_prewarm(neon_simple_env: NeonEnv, method: PrewarmMethod):
    assert lfc_cur.fetchall()[0][0] == n_records * (n_records + 1) / 2

    desired = {"status": "completed", "total": total, "prewarmed": prewarmed, "skipped": skipped}
-    check_prewarmed(method, client, desired)
+    check_prewarmed_contains(method, client, desired)
+
+
+@pytest.mark.skipif(not USE_LFC, reason="LFC is disabled, skipping")
+def test_lfc_prewarm_cancel(neon_simple_env: NeonEnv):
+    """
+    Test we can cancel LFC prewarm and prewarm successfully after
+    """
+    env = neon_simple_env
+    n_records = 1000000
+    cfg = [
+        "autovacuum = off",
+        "shared_buffers=1MB",
+        "neon.max_file_cache_size=1GB",
+        "neon.file_cache_size_limit=1GB",
+        "neon.file_cache_prewarm_limit=1000",
+    ]
+    endpoint = env.endpoints.create_start(branch_name="main", config_lines=cfg)
+
+    pg_conn = endpoint.connect()
+    pg_cur = pg_conn.cursor()
+    pg_cur.execute("create schema neon; create extension neon with schema neon")
+    pg_cur.execute("create database lfc")
+
+    lfc_conn = endpoint.connect(dbname="lfc")
+    lfc_cur = lfc_conn.cursor()
+    log.info(f"Inserting {n_records} rows")
+    lfc_cur.execute("create table t(pk integer primary key, payload text default repeat('?', 128))")
+    lfc_cur.execute(f"insert into t (pk) values (generate_series(1,{n_records}))")
+    log.info(f"Inserted {n_records} rows")
+
+    client = endpoint.http_client()
+    method = PrewarmMethod.COMPUTE_CTL
+    offload_lfc(method, client, pg_cur)
+
+    endpoint.stop()
+    endpoint.start()
+
+    thread = Thread(target=lambda: prewarm_endpoint(method, client, pg_cur, None))
+    thread.start()
+    # wait 2 seconds to ensure we cancel prewarm SQL query
+    sleep(2)
+    client.cancel_prewarm_lfc()
+    thread.join()
+    assert client.prewarm_lfc_status()["status"] == "cancelled"
+
+    prewarm_endpoint(method, client, pg_cur, None)
+    assert client.prewarm_lfc_status()["status"] == "completed"


@pytest.mark.skipif(not USE_LFC, reason="LFC is disabled, skipping")
@@ -178,9 +235,8 @@ def test_lfc_prewarm_empty(neon_simple_env: NeonEnv):
    cur = conn.cursor()
    cur.execute("create schema neon; create extension neon with schema neon")
    method = PrewarmMethod.COMPUTE_CTL
-    offload_lfc(method, client, cur)
-    prewarm_endpoint(method, client, cur, None)
-    assert client.prewarm_lfc_status()["status"] == "skipped"
+    assert offload_lfc(method, client, cur)["status"] == "skipped"
+    assert prewarm_endpoint(method, client, cur, None)["status"] == "skipped"


 # autoprewarm isn't needed as we prewarm manually
@@ -251,11 +307,11 @@ def test_lfc_prewarm_under_workload(neon_simple_env: NeonEnv, method: PrewarmMet

    workload_threads = []
    for _ in range(n_threads):
-        t = threading.Thread(target=workload)
+        t = Thread(target=workload)
        workload_threads.append(t)
        t.start()

-    prewarm_thread = threading.Thread(target=prewarm)
+    prewarm_thread = Thread(target=prewarm)
    prewarm_thread.start()

    def prewarmed():
--- a/test_runner/regress/test_safekeeper_migration.py
+++ b/test_runner/regress/test_safekeeper_migration.py
@@ -286,3 +286,177 @@ def test_sk_generation_aware_tombstones(neon_env_builder: NeonEnvBuilder):
    assert re.match(r".*Timeline .* deleted.*", exc.value.response.text)
    # The timeline should remain deleted.
    expect_deleted(second_sk)
+
+
+def test_safekeeper_migration_stale_timeline(neon_env_builder: NeonEnvBuilder):
+    """
+    Test that safekeeper migration handles stale timeline correctly by migrating to
+    a safekeeper with a stale timeline.
+    1. Check that we are waiting for the stale timeline to catch up with the commit lsn.
+       The migration might fail if there is no compute to advance the WAL.
+    2. Check that we rely on last_log_term (and not the current term) when waiting for the
+       sync_position on step 7.
+    3. Check that migration succeeds if the compute is running.
+    """
+    neon_env_builder.num_safekeepers = 2
+    neon_env_builder.storage_controller_config = {
+        "timelines_onto_safekeepers": True,
+        "timeline_safekeeper_count": 1,
+    }
+    env = neon_env_builder.init_start()
+    env.pageserver.allowed_errors.extend(PAGESERVER_ALLOWED_ERRORS)
+    env.storage_controller.allowed_errors.append(".*not enough successful .* to reach quorum.*")
+
+    mconf = env.storage_controller.timeline_locate(env.initial_tenant, env.initial_timeline)
+
+    active_sk = env.get_safekeeper(mconf["sk_set"][0])
+    other_sk = [sk for sk in env.safekeepers if sk.id != active_sk.id][0]
+
+    ep = env.endpoints.create("main", tenant_id=env.initial_tenant)
+    ep.start(safekeeper_generation=1, safekeepers=[active_sk.id])
+    ep.safe_psql("CREATE TABLE t(a int)")
+    ep.safe_psql("INSERT INTO t VALUES (0)")
+
+    # Pull the timeline to other_sk, so other_sk now has a "stale" timeline on it.
+    other_sk.pull_timeline([active_sk], env.initial_tenant, env.initial_timeline)
+
+    # Advance the WAL on active_sk.
+    ep.safe_psql("INSERT INTO t VALUES (1)")
+
+    # The test is more tricky if we have the same last_log_term but different term/flush_lsn.
+    # Stop the active_sk during the endpoint shutdown because otherwise compute_ctl runs
+    # sync_safekeepers and advances last_log_term on active_sk.
+    active_sk.stop()
+    ep.stop(mode="immediate")
+    active_sk.start()
+
+    active_sk_status = active_sk.http_client().timeline_status(
+        env.initial_tenant, env.initial_timeline
+    )
+    other_sk_status = other_sk.http_client().timeline_status(
+        env.initial_tenant, env.initial_timeline
+    )
+
+    # other_sk should have the same last_log_term, but a stale flush_lsn.
+    assert active_sk_status.last_log_term == other_sk_status.last_log_term
+    assert active_sk_status.flush_lsn > other_sk_status.flush_lsn
+
+    commit_lsn = active_sk_status.flush_lsn
+
+    # Bump the term on other_sk to make it higher than active_sk.
+    # This is to make sure we don't use current term instead of last_log_term in the algorithm.
+    other_sk.http_client().term_bump(
+        env.initial_tenant, env.initial_timeline, active_sk_status.term + 100
+    )
+
+    # TODO(diko): now it fails because the timeline on other_sk is stale and there is no compute
+    # to catch up it with active_sk. It might be fixed in https://databricks.atlassian.net/browse/LKB-946
+    # if we delete stale timelines before starting the migration.
+    # But the rest of the test is still valid: we should not lose committed WAL after the migration.
+    with pytest.raises(
+        StorageControllerApiException, match="not enough successful .* to reach quorum"
+    ):
+        env.storage_controller.migrate_safekeepers(
+            env.initial_tenant, env.initial_timeline, [other_sk.id]
+        )
+
+    mconf = env.storage_controller.timeline_locate(env.initial_tenant, env.initial_timeline)
+    assert mconf["new_sk_set"] == [other_sk.id]
+    assert mconf["sk_set"] == [active_sk.id]
+    assert mconf["generation"] == 2
+
+    # Start the endpoint, so it advances the WAL on other_sk.
+    ep.start(safekeeper_generation=2, safekeepers=[active_sk.id, other_sk.id])
+    # Now the migration should succeed.
+    env.storage_controller.migrate_safekeepers(
+        env.initial_tenant, env.initial_timeline, [other_sk.id]
+    )
+
+    # Check that we didn't lose committed WAL.
+    assert (
+        other_sk.http_client().timeline_status(env.initial_tenant, env.initial_timeline).flush_lsn
+        >= commit_lsn
+    )
+    assert ep.safe_psql("SELECT * FROM t") == [(0,), (1,)]
+
+
+def test_pull_from_most_advanced_sk(neon_env_builder: NeonEnvBuilder):
+    """
+    Test that we pull the timeline from the most advanced safekeeper during the
+    migration and do not lose committed WAL.
+    """
+    neon_env_builder.num_safekeepers = 4
+    neon_env_builder.storage_controller_config = {
+        "timelines_onto_safekeepers": True,
+        "timeline_safekeeper_count": 3,
+    }
+    env = neon_env_builder.init_start()
+    env.pageserver.allowed_errors.extend(PAGESERVER_ALLOWED_ERRORS)
+
+    mconf = env.storage_controller.timeline_locate(env.initial_tenant, env.initial_timeline)
+
+    sk_set = mconf["sk_set"]
+    assert len(sk_set) == 3
+
+    other_sk = [sk.id for sk in env.safekeepers if sk.id not in sk_set][0]
+
+    ep = env.endpoints.create("main", tenant_id=env.initial_tenant)
+    ep.start(safekeeper_generation=1, safekeepers=sk_set)
+    ep.safe_psql("CREATE TABLE t(a int)")
+    ep.safe_psql("INSERT INTO t VALUES (0)")
+
+    # Stop one sk, so we have a lagging WAL on it.
+    env.get_safekeeper(sk_set[0]).stop()
+    # Advance the WAL on the other sks.
+    ep.safe_psql("INSERT INTO t VALUES (1)")
+
+    # Stop other sks to make sure compute_ctl doesn't advance the last_log_term on them during shutdown.
+    for sk_id in sk_set[1:]:
+        env.get_safekeeper(sk_id).stop()
+    ep.stop(mode="immediate")
+    for sk_id in sk_set:
+        env.get_safekeeper(sk_id).start()
+
+    # Bump the term on the lagging sk to make sure we don't use it to choose the most advanced sk.
+    env.get_safekeeper(sk_set[0]).http_client().term_bump(
+        env.initial_tenant, env.initial_timeline, 100
+    )
+
+    def get_commit_lsn(sk_set: list[int]):
+        flush_lsns = []
+        last_log_terms = []
+        for sk_id in sk_set:
+            sk = env.get_safekeeper(sk_id)
+            status = sk.http_client().timeline_status(env.initial_tenant, env.initial_timeline)
+            flush_lsns.append(status.flush_lsn)
+            last_log_terms.append(status.last_log_term)
+
+        # In this test we assume that all sks have the same last_log_term.
+        assert len(set(last_log_terms)) == 1
+
+        flush_lsns.sort(reverse=True)
+        commit_lsn = flush_lsns[len(sk_set) // 2]
+
+        log.info(f"sk_set: {sk_set}, flush_lsns: {flush_lsns}, commit_lsn: {commit_lsn}")
+        return commit_lsn
+
+    commit_lsn_before_migration = get_commit_lsn(sk_set)
+
+    # Make two migrations, so the lagging sk stays in the sk_set, but other sks are replaced.
+    new_sk_set1 = [sk_set[0], sk_set[1], other_sk]  # remove sk_set[2], add other_sk
+    new_sk_set2 = [sk_set[0], other_sk, sk_set[2]]  # remove sk_set[1], add sk_set[2] back
+    env.storage_controller.migrate_safekeepers(
+        env.initial_tenant, env.initial_timeline, new_sk_set1
+    )
+    env.storage_controller.migrate_safekeepers(
+        env.initial_tenant, env.initial_timeline, new_sk_set2
+    )
+
+    commit_lsn_after_migration = get_commit_lsn(new_sk_set2)
+
+    # We should not lose committed WAL.
+    # If we have choosen the lagging sk to pull the timeline from, this might fail.
+    assert commit_lsn_before_migration <= commit_lsn_after_migration
+
+    ep.start(safekeeper_generation=5, safekeepers=new_sk_set2)
+    assert ep.safe_psql("SELECT * FROM t") == [(0,), (1,)]
--- a/vendor/postgres-v14
+++ b/vendor/postgres-v14
--- a/vendor/postgres-v15
+++ b/vendor/postgres-v15
--- a/vendor/postgres-v16
+++ b/vendor/postgres-v16
--- a/vendor/postgres-v17
+++ b/vendor/postgres-v17
--- a/vendor/revisions.json
+++ b/vendor/revisions.json
@@ -1,18 +1,18 @@
 {
  "v17": [
    "17.5",
-    "fa1788475e3146cc9c7c6a1b74f48fd296898fcd"
+    "1e01fcea2a6b38180021aa83e0051d95286d9096"
  ],
  "v16": [
    "16.9",
-    "9b9cb4b3e33347aea8f61e606bb6569979516de5"
+    "a42351fcd41ea01edede1daed65f651e838988fc"
  ],
  "v15": [
    "15.13",
-    "aaaeff2550d5deba58847f112af9b98fa3a58b00"
+    "2aaab3bb4a13557aae05bb2ae0ef0a132d0c4f85"
  ],
  "v14": [
    "14.18",
-    "c9f9fdd0113b52c0bd535afdb09d3a543aeee25f"
+    "2155cb165d05f617eb2c8ad7e43367189b627703"
  ]
 }
Author	SHA1	Message	Date
Vlad Lazar	89231e3f99	storcon: squash all migrations into one Problem Neon and Hadron deployments have the same database schema, but different migration histories. Some transactions have different identifiers too. If we don't do anything about it, then the storage controller would fail to apply the merged set of transactions. Summary of Changes We squash all migrations into a single one. If the schema already matches, then the new transaction gets applie without doing anything. For new regions, this migration will bootstrap the database schema. This should be merged in both neon and hadron codebases. Note that after deploying this change, the `__diesel_schema_migrations` table will still contain entries for the old pre-squash transactions. This is fine because diesel only considers the tranasactions embedded in the repo for application. Once we are certain that we are not going to roll back, we can clean up the `__diesel_schema_migrations` tables in prod, but this is manual and error prone, so I'd skip it. Rolling back Rolling back to a previous deployment which embeds the non-squashed transactions is safe. The old transactions are still present in `__diesel_schema_migrations` (i.e. considered applied), so no migrations will be run, so no migrations will be run. Note that this assumes that all transactions are applied and squashed into the new migration before deployment.	2025-07-31 13:39:27 +01:00
Alexey Kondratov	8fe7596120	chore(compute_tools): Delete unused anon_ext_fn_reassign.sql (#12787 ) It's an anon v1 failed launch artifact, I suppose.	2025-07-31 10:11:30 +00:00
Krzysztof Szafrański	f3ee6e818d	[proxy] Correctly classify ConnectErrors (#12793 ) As is, e.g. quota errors on wake compute are logged as "compute" errors.	2025-07-31 09:53:48 +00:00
Dmitrii Kovalkov	edd60730c8	safekeeper: use last_log_term in mconf switch + choose most advanced sk in pull timeline (#12778 ) ## Problem I discovered two bugs corresponding to safekeeper migration, which together might lead to a data loss during the migration. The second bug is from a hadron patch and might lead to a data loss during the safekeeper restore in hadron as well. 1. `switch_membership` returns the current `term` instead of `last_log_term`. It is used to choose the `sync_position` in the algorithm, so we might choose the wrong one and break the correctness guarantees. 2. The current `term` is used to choose the most advanced SK in `pull_timeline` with higher priority than `flush_lsn`. It is incorrect because the most advanced safekeeper is the one with the highest `(last_log_term, flush_lsn)` pair. The compute might bump term on the least advanced sk, making it the best choice to pull from, and thus making committed log entries "uncommitted" after `pull_timeline` Part of https://databricks.atlassian.net/browse/LKB-1017 ## Summary of changes - Return `last_log_term` in `switch_membership` - Use `(last_log_term, flush_lsn)` as a primary key for choosing the most advanced sk in `pull_timeline` and deny pulling if the `max_term` is higher than on the most advanced sk (hadron only) - Write tests for both cases - Retry `sync_safekeepers` in `compute_ctl` - Take into the account the quorum size when calculating `sync_position`	2025-07-31 09:29:25 +00:00
Aleksandr Sarantsev	975b95f4cd	Introduce deletion API improvement RFC (#12484 ) ## Problem The deletion logic had become difficult to understand and maintain. ## Summary of changes - Added an RFC detailing proposed improvements to all deletion-related APIs. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-31 08:34:47 +00:00
Mikhail	01c39f378e	prewarm cancellation (#12785 ) Add DELETE /lfc/prewarm route which handles ongoing prewarm cancellation, update API spec, add prewarm Cancelled state Add offload Cancelled state when LFC is not initialized	2025-07-30 22:05:51 +00:00
Dimitri Fontaine	4d3b28bd2e	[Hadron] Always run databricks auth hook. (#12683 )	2025-07-30 21:34:30 +00:00
				`@@ -1,2 +0,0 @@`

				`ALTER TABLE tenant_shards add scheduling_policy VARCHAR NOT NULL DEFAULT '"Active"';`
				`@@ -1 +0,0 @@`
				`ALTER TABLE nodes DROP availability_zone_id;`
				`@@ -1 +0,0 @@`
				`ALTER TABLE nodes ADD availability_zone_id VARCHAR;`
				`@@ -1 +0,0 @@`
				`ALTER TABLE nodes ALTER availability_zone_id DROP NOT NULL;`
				`@@ -1 +0,0 @@`
				`ALTER TABLE tenant_shards DROP preferred_az_id;`
				`@@ -1 +0,0 @@`
				`ALTER TABLE tenant_shards ADD preferred_az_id VARCHAR;`
				`@@ -1 +0,0 @@`
				`ALTER TABLE safekeepers DROP scheduling_policy;`
				`@@ -1 +0,0 @@`
				`ALTER TABLE safekeepers ADD scheduling_policy VARCHAR NOT NULL DEFAULT 'disabled';`
				`@@ -1 +0,0 @@`
				`ALTER TABLE nodes ADD listen_https_port INTEGER;`