A few SK changes (#12577)

# TLDR This PR is a no-op. ## Problem When a SK loses a disk, it must recover all WALs from the very beginning. This may take days/weeks to catch up to the latest WALs for all timelines it owns. ## Summary of changes When SK starts up, if it finds that it has 0 timelines, - it will ask SC for the timeline it owns. - Then, pulls the timeline from its peer safekeepers to restore the WAL redundancy right away. After pulling timeline is complete, it will become active and accepts new WALs. The current impl is a prototype. We can optimize the impl further, e.g., parallel pull timelines. --------- Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-12-22 21:59:59 +00:00 · 2025-07-14 09:37:04 -07:00
parent 2288efae66
commit f67a8a173e
14 changed files with 808 additions and 36 deletions
--- a/control_plane/storcon_cli/src/main.rs
+++ b/control_plane/storcon_cli/src/main.rs
@@ -476,6 +476,7 @@ async fn main() -> anyhow::Result<()> {
                        listen_http_port,
                        listen_https_port,
                        availability_zone_id: AvailabilityZone(availability_zone_id),
+                        node_ip_addr: None,
                    }),
                )
                .await?;