From 5ac3f86c962f655efd9902a6090ce2ccd17960b2 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Wed, 2 Aug 2023 17:05:36 +0300 Subject: [PATCH] Safekeeper peer recovery rfc. --- docs/rfcs/025-sk-peer-recovery.md | 71 +++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 docs/rfcs/025-sk-peer-recovery.md diff --git a/docs/rfcs/025-sk-peer-recovery.md b/docs/rfcs/025-sk-peer-recovery.md new file mode 100644 index 0000000000..1a297bb4db --- /dev/null +++ b/docs/rfcs/025-sk-peer-recovery.md @@ -0,0 +1,71 @@ +# Safekeeper peer recovery + +## A problem and proposed solution + +Currently, on start walproposer preserves WAL for all safekeepers: on start it +determines the horizon beyond which all safekeepers received WAL, downloads +missing part to pg_wal and holds WAL since this horizon ever since. This is +problematic, because + +1. If one safekeeper is down and/or lagging, pg_wal eventually explodes -- we intentionally don't have much space on computes. +2. If one safekeeper is down and/or lagging it makes compute start longer. + +Proposed solution is to teach safekeepers to fetch WAL directly from peers, +respecting consensus rules. Namely, +- On start, walproposer won't download WAL at all -- it will have it only since + writing position. As WAL grows it should also keep some fixed number of + latest segments (~20) to provide gradual switch from peer recovery to walproposer + streaming under load; it can be achieved by setting wal_keep_size or + implemented separately. +- Whenever safekeeper through peer communication discovers that 1) it lacks WAL compared to some + peer and 2) walproposer streaming is not active, it starts recovery. Recovery ends when either there + is nothing more to fetch or streaming walproposer is discovered. + +## Details + +### Correctness + +The idea is simple: recovery process imitates actions of donor's last_log_term +== donor's term leader. That is, sk A will fetch WAL from sk B if +1) B's (last_log_term, LSN) is higher than A's (last_log_term, LSN) *and* +2) A's term <= B's term -- otherwise append request can't be accepted. +3) B's term == B's last_log_term -- to ensure that such a leader was ever elected in + the first place. + +Note that not always such configuration is possible. e.g. in scenario +A 1.1 +B 1.1 2.1 +C 1.1 3.1 + +where (x.y) is (term, LSN) pair if A voted for term 4 and B and C haven't (their +terms are 3 and 2 respectively), then A can't pull from B nor from C. IOW, we +need elected authoritative leader to determine to correct log sequence. However, +such scenario is unlikely and will be fixed by walproposer voting once it +appears, so we can ignore it for now, and add elections on safekeepers side +later if needed. + +Just like a normal leader, recovery would first truncate WAL and only then start +inserting. + +### Start/stop criterion + +Recovery shouldn't prevent actively streaming compute -- we don't skip records, +so if recovery inserts something after walproposer push, next will error out. +OTOH, for better availability recovery should finish its job aligning all +safekeepers even if compute is missing. So I propose to track on safekeeper +existence of streaming compute. Recovery should kick in if 1) there is something +to pull and 2) streaming compute doesn't exist. On each insert, compute presence +is checked and recovery is terminated if it appeared. It also terminates if +there is nothing more to pull. + +This should be good enough, though not bullet proof: in theory we can imagine +recovery starting regularly before streaming started and inserting something +after. Such loop is very unlikely though, we can add more heuristics if it shows +up. + +## Alternatives + +An entirely different direction would be more granular WAL managing on computes +-- don't hold a lot, but download and pass on demand to stale safekeepers. It +seems of comparable complexity, but writing rust is more pleasant and less +postgres version dependant. \ No newline at end of file