tests, rollout

2026-06-04 22:10:39 +00:00 · 2024-07-23 17:58:32 +03:00
parent 7b50c1a457
commit 4d1cf2dc6f
1 changed files with 62 additions and 32 deletions
--- a/docs/rfcs/035-safekeeper-dynamic-membership-change.md
+++ b/docs/rfcs/035-safekeeper-dynamic-membership-change.md
@@ -347,7 +347,7 @@ Returns scheduled request.

 Similar call should be added for the tenant.

-It would be great to have some way of subscribing to the results (appart from
+It would be great to have some way of subscribing to the results (apart from
 looking at logs/metrics).

 Migration is executed as described above. One subtlety is that (local) deletion on
@@ -367,6 +367,9 @@ which switches configurations. Similarly when timeline is fully deleted to
 prevent cplane operation from blocking when some safekeeper is not available
 deletion should be also registered.

+One more task pool should infinitely retry notifying control plane about changed
+safekeeper sets.
+
 3) GET `/control/v1/tenant/:tenant_id/timeline/:timeline_id/` should return
   current in memory state of the timeline and pending `MigrationRequest`,
   if any.
@@ -402,15 +405,19 @@ There should be following layers of tests:
   not be lost.

 3) Since simulation testing injects at relatively high level points (not
-   syscalls), it omits some code, in particular `pull_timeline`. Thus it 
-   is better to have basic tests covering whole system. Extended 
-   version of `test_restarts_under_load` would do. TBD
+   syscalls), it omits some code, in particular `pull_timeline`. Thus it is
+   better to have basic tests covering whole system as well. Extended version of
+   `test_restarts_under_load` would do: start background load and do migration 
+   under it, then restart endpoint and check that no reported commits 
+   had been lost. I'd also add one more creating classic network split scenario, with
+   one compute talking to AC and another to BD while migration from nodes ABC to ABD
+   happens.

-4) Basic e2e test should ensure that full flow including cplane notification works.
+4) Simple e2e test should ensure that full flow including cplane notification works.

 ## Order of implementation and rollout

-note that 
+Note that 
 - Control plane parts and integration with it is fully independent from everything else
  (tests would use simulation and neon_local).
 - There is a lot of infra work making storage_controller aware of timelines and safekeepers
@@ -418,40 +425,63 @@ note that
 - Initially walproposer can just stop working while it observers joint configuration.
  Such window would be typically very short anyway.

-TimelineCreateRequest should get optional safekeepers field with safekeepers chosen by cplane.
+To rollout smoothly, both walproposer and safekeeper should have flag
+`configurations_enabled`; when set to false, they would work as currently, i.e.
+walproposer is able to commit on whatever safekeeper set it is provided. Until
+all timelines are managed by storcon we'd need to use current script to migrate
+and update/drop entries in the storage_controller database if it has any.

-rough order:
- add sk infra, but not enforce confs
- change proto
- add wp proto, but not enforce confs
- implement storconn. It will be used and tested by neon_local.
- implement cplane/storcon integration. Route branch creation/deletion 
-  through storcon. Then we can test migration of these branches, hm.
-  In principle sk choice from cplane can be removed at this point. 
-  However, that would be bad because before import 1) 
-  storconn doesn't know about existing project so can't colocate tenants 
-  2) neither it knows about capacity. So we could instead allow to set sk 
-  set in the branch creation request.
-  These cplane -> storconn calls should be under feature flag; 
-  rollback is safe.
- finally import existing branches. Then we can drop cplane 
-  sk selection code.
-  also only at this point wp will always use generations and
-  so we can drop 'tli creation on connect'.
+Safekeepers would need to be able to talk both current and new protocol version
+with compute to reduce number of computes restarted in prod once v2 protocol is
+deployed (though before completely switching we'd need to force this).
+
+Let's have the following rollout order:
+- storage_controller becomes aware of safekeepers;
+- storage_controller gets timeline creation for new timelines and deletion requests, but
+  doesn't manage all timelines yet. Migration can be tested on these new timelines.
+  To keep control plane and storage_controller databases in sync while control 
+  plane still chooses the safekeepers initially (until all timelines are imported
+  it can choose better), `TimelineCreateRequest` can get optional safekeepers
+  field with safekeepers chosen by cplane.
+- Then we can import all existing timelines from control plane to
+  storage_controller and gradually enable configurations region by region.
+
+
+Very rough implementation order:
+- Add concept of configurations to safekeepers (including control file),
+  implement v3 protocol.
+- Implement walproposer changes, including protocol.
+- Implement storconn part. Use it in neon_local (and pytest).
+- Make cplane store safekeepers per timeline instead of per tenant.
+- Implement cplane/storcon integration. Route branch creation/deletion 
+  through storcon. Then we can test migration of new branches.
+- Finally import existing branches. Then we can drop cplane 
+  safekeeper selection code. Gradually enable configurations at 
+  computes and safekeepers. Before that, all computes must talk only
+  v3 protocol version.

 ## Integration with evicted timelines

+Currently, `pull_timeline` doesn't work correctly with evicted timelines because
+copy would point to original partial file. To fix let's just do s3 copy of the
+file. It is a bit stupid as generally unnecessary work, but it makes sense to
+implement proper migration before doing smarter timeline archival.
+
 ## Possible optimizations

-`AcceptorRefusal` separate message

-Preserving connections (not needed)
+Algorithm suggested above forces walproposer re-election (technically restart)
+and thus reconnection to safekeepers; essentially we treat generation as part of
+term and don't allow leader to survive configuration change. It is possible to
+optimize this, but this is untrivial and I don't think needed. Reconnection is
+very fast and it is much more important to avoid compute restart than
+millisecond order of write stall.

-multiple joint consensus (not needed)
+Multiple joint consensus: algorithm above rejects attempt to change membership
+while another attempt is in progress. It is possible to overlay them and AFAIK
+Aurora does this but similarly I don't think this is needed.

 ## Misc

-We should use Compute <-> safekeeper protocol change to include other (long
-yearned) modifications:
- network order
-
+We should use Compute <-> safekeeper protocol change to include another long
+yearned modifications: send data in network order to make arm work.