From c9d2b6119576d8bd5f98460ac249db468b51bc7a Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Fri, 2 Aug 2024 12:28:11 +0300
Subject: [PATCH] fix term uniqueness

---
 ...35-safekeeper-dynamic-membership-change.md | 51 ++++++++++---------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/docs/rfcs/035-safekeeper-dynamic-membership-change.md b/docs/rfcs/035-safekeeper-dynamic-membership-change.md
index 88087270d6..ed831f1492 100644
--- a/docs/rfcs/035-safekeeper-dynamic-membership-change.md
+++ b/docs/rfcs/035-safekeeper-dynamic-membership-change.md
@@ -68,9 +68,9 @@ case of sk->wp message.
 
 Basic rule: once safekeeper observes configuration higher than his own it
 immediately switches to it. It must refuse all messages with lower generation
-that his. It also refuses messages if it is not member of the current
-generation, though it is likely not unsafe to process them (walproposer should
-ignore them anyway).
+that his. It also refuses messages if it is not member of the current generation
+(that is, of either `sk_set` of `sk_new_set`), though it is likely not unsafe to
+process them (walproposer should ignore result anyway).
 
 If there is non null configuration in `ProposerGreeting` and it is higher than
 current safekeeper one, safekeeper switches to it.
@@ -87,9 +87,9 @@ current one and ignores it otherwise. In any case it replies with
 ```
 struct ConfigurationSwitchResponse {
     conf: Configuration,
+    term: Term,
     last_log_term: Term,
     flush_lsn: Lsn,
-    term: Term, // not used by this RFC, but might be useful for observability
 }
 ```
 
@@ -142,29 +142,33 @@ storage are reachable.
    delivering them `joint_conf`. Collecting responses from majority is required
    to proceed. If any response returned generation higher than 
    `joint_conf.generation`, abort (another switch raced us). Otherwise, choose
-   max `<last_log_term, flush_lsn>` among responses and establish it as 
-   (in memory) `sync_position`. We can't finish switch until majority 
-   of the new set catches up to this position because data before it 
-   could be committed without ack from the new set.
-4) Initialize timeline on safekeeper(s) from `new_sk_set` where it 
+   max `<last_log_term, flush_lsn>` among responses and establish it as
+   (in memory) `sync_position`. Also choose max `term` and establish it as (in
+   memory) `sync_term`. We can't finish the switch until majority of the new set
+   catches up to this `sync_position` because data before it could be committed
+   without ack from the new set. Similarly, we'll bump term on new majority
+   to `sync_term` so that two computes with the same term are never elected.
+4) Initialize timeline on safekeeper(s) from `new_sk_set` where it
    doesn't exist yet by doing `pull_timeline` from the majority of the 
-   current set. Doing  that on majority of `new_sk_set` is enough to
+   current set. Doing that on majority of `new_sk_set` is enough to
    proceed, but it is reasonable to ensure that all `new_sk_set` members
    are initialized -- if some of them are down why are we migrating there?
-5) Call `PUT` `configuration` on safekeepers from the new set,
+5) Call `POST` `bump_term(sync_term)` on safekeepers from the new set. 
+   Success on majority is enough.
+6) Repeatedly call `PUT` `configuration` on safekeepers from the new set,
    delivering them `joint_conf` and collecting their positions. This will
    switch them to the `joint_conf` which generally won't be needed 
    because `pull_timeline` already includes it and plus additionally would be
-   broadcast by compute. More importantly, we may proceed to the next step 
+   broadcast by compute. More importantly, we may proceed to the next step
    only when `<last_log_term, flush_lsn>` on the majority of the new set reached 
-   `sync_position`. Similarly, on the happy path this is not needed because 
-   `pull_timeline` already includes it. However, it is better to double
+   `sync_position`. Similarly, on the happy path no waiting is not needed because 
+   `pull_timeline` already includes it. However, we should double
     check to be safe. For example, timeline could have been created earlier e.g.
-    manually or after try-to-migrate, abort, try-to-migrate-again sequence.   
-6) Create `new_conf: Configuration` incrementing `join_conf` generation and having new 
+    manually or after try-to-migrate, abort, try-to-migrate-again sequence. 
+7) Create `new_conf: Configuration` incrementing `join_conf` generation and having new 
    safekeeper set as `sk_set` and None `new_sk_set`. Write it to configuration 
    storage under one more CAS.
-7) Call `PUT` `configuration` on safekeepers from the new set,
+8) Call `PUT` `configuration` on safekeepers from the new set,
    delivering them `new_conf`. It is enough to deliver it to the majority 
    of the new set; the rest can be updated by compute.
 
@@ -469,13 +473,12 @@ implement proper migration before doing smarter timeline archival.
 
 ## Possible optimizations
 
-
-Algorithm suggested above forces walproposer re-election (technically restart)
-and thus reconnection to safekeepers; essentially we treat generation as part of
-term and don't allow leader to survive configuration change. It is possible to
-optimize this, but this is untrivial and I don't think needed. Reconnection is
-very fast and it is much more important to avoid compute restart than
-millisecond order of write stall.
+Steps above suggest walproposer restart (with re-election) and thus reconnection
+to safekeepers. Since by bumping term on new majority we ensure that leader
+terms are unique even across generation switches it is possible to preserve
+connections. However, it is more complicated, reconnection is very fast and it
+is much more important to avoid compute restart than millisecond order of write
+stall.
 
 Multiple joint consensus: algorithm above rejects attempt to change membership
 while another attempt is in progress. It is possible to overlay them and AFAIK