From ef3e75abc3caf58555c7477f39119fe1c6300cac Mon Sep 17 00:00:00 2001
From: Christian Schwarz <christian@neon.tech>
Date: Fri, 1 Sep 2023 19:10:44 +0200
Subject: [PATCH] for #5029 (rfc tenant migrations): editorial fixes (#5185)

---
 docs/rfcs/027-pageserver-migration.md | 35 +++++++++++++++++----------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/docs/rfcs/027-pageserver-migration.md b/docs/rfcs/027-pageserver-migration.md
index 74a9a3a571..32a2178c2c 100644
--- a/docs/rfcs/027-pageserver-migration.md
+++ b/docs/rfcs/027-pageserver-migration.md
@@ -1,4 +1,4 @@
-# Fast tenant transfers for high availability
+# Seamless tenant migration
 
 - Author: john@neon.tech
 - Created on 2023-08-11
@@ -7,15 +7,15 @@
 ## Summary
 
 The preceding [generation numbers RFC](025-generation-numbers.md) may be thought of as "making tenant
-transfers safe". Following that,
-this RFC is about how those transfers are to be done:
+migration safe". Following that,
+this RFC is about how those migrations are to be done:
 
 1. Seamlessly (without interruption to client availability)
 2. Quickly (enabling faster operations)
 3. Efficiently (minimizing I/O and $ cost)
 
 These points are in priority order: if we have to sacrifice
-efficiency to make a transfer seamless for clients, we will
+efficiency to make a migration seamless for clients, we will
 do so, etc.
 
 This is accomplished by introducing two high level changes:
@@ -36,9 +36,18 @@ at scale, in several contexts:
    database and they need to migrate to a pageserver with more capacity.
 3. Restarting pageservers for upgrades and maintenance
 
-Currently, a tenant may migrated by attaching to a new node,
-re-configuring endpoints to use the new node, and then later detaching from the old node. This is safe once [generation numbers](025-generation-numbers.md) are implemented, but does meet
-our seamless/fast/efficient goals:
+The current situation steps for migration are:
+* detach from old node; skip if old node is dead; (the [skip part is still WIP](https://github.com/neondatabase/cloud/issues/5426)).
+* attach to new node
+* re-configure endpoints to use the new node
+
+Once [generation numbers](025-generation-numbers.md) are implemented,
+the detach step is no longer critical for correctness. So, we can
+* attach to a new node,
+* re-configure endpoints to use the new node, and then
+* detach from the old node.
+
+However, this still does not meet our seamless/fast/efficient goals:
 
 - Not fast: The new node will have to download potentially large amounts
   of data from S3, which may take many minutes.
@@ -54,7 +63,7 @@ The user expectations for availability are:
 - For unplanned changes (e.g. node failures), there should be
   minimal availability gap.
 
-## Non Goals (if relevant)
+## Non Goals
 
 - We do not aim to have the pageservers fail over if the
   control plane is unavailable.
@@ -63,7 +72,7 @@ The user expectations for availability are:
   page cache usually contains such pages, we do not expect
   them to be read frequently from the pageserver).
 
-## Impacted components (e.g. pageserver, safekeeper, console, etc)
+## Impacted components
 
 Pageserver, control plane
 
@@ -81,7 +90,7 @@ Pageserver, control plane
 
 ## Implementation (high level)
 
-### Secondary locations
+### Warm secondary locations
 
 To enable faster migrations, we will identify at least one _secondary location_
 for each tenant. This secondary location will keep a warm cache of layers
@@ -149,8 +158,8 @@ The following table summarizes how the state of the system advances:
 This procedure readily applies to other migration cases:
 
 - **Node failures**: if node A is unavailable, then all calls into
-  node A are simply skipped, and when waiting for node B LSN to catch
-  up, we may proceed immediately.
+  node A are skipped and we don't wait for B to catch up before
+  switching updating the endpoints to use B.
 - **Migration without a secondary location**: if node A is initially
   in Detached state, the procedure is idential, but waiting for Node B
   to download layers and catch up with WAL will take much longer.
@@ -437,7 +446,7 @@ attachment on the secondary node.
 The downside to this approach is a potentially large gap in readability of
 recent LSNs while loading data onto the new node. To avoid this, it is worthwhile
 to incur the extra cost of double-replaying the WAL onto old and new nodes' local
-storage during a transfer.
+storage during a migration.
 
 ### Peer-to-peer pageserver communication