+
+Few databases are stored in one chunk, replicated three times
+
+- When database can't fit into one storage node it can occupy lots of chunks that were split while database was growing. Chunk placement on nodes is controlled by us with some automatization, but we alway may manually move chunks around the cluster.
+
+
+
+Here one big database occupies two set of nodes. Also some chunks were moved around to restore replication factor after disk failure. In this case we also have "sharded" storage for a big database and issue wal writes to different chunks in parallel.
+
+## **Chunk placement strategies**
+
+There are few scenarios where we may want to move chunks around the cluster:
+
+- disk usage on some node is big
+- some disk experienced a failure
+- some node experienced a failure or need maintenance
+
+## **Chunk replication**
+
+Chunk replication may be done by cloning page ranges with respect to some lsn from peer nodes, updating global metadata, waiting for WAL to come, replaying previous WAL and becoming online -- more or less like during chunk split.
+
diff --git a/docs/rfcs/003-laptop-cli.md b/docs/rfcs/003-laptop-cli.md
new file mode 100644
index 0000000000..4d1f0a68f0
--- /dev/null
+++ b/docs/rfcs/003-laptop-cli.md
@@ -0,0 +1,267 @@
+# Command line interface (end-user)
+
+Zenith CLI as it is described here mostly resides on the same conceptual level as pg_ctl/initdb/pg_recvxlog/etc and replaces some of them in an opinionated way. I would also suggest bundling our patched postgres inside zenith distribution at least at the start.
+
+This proposal is focused on managing local installations. For cluster operations, different tooling would be needed. The point of integration between the two is storage URL: no matter how complex cluster setup is it may provide an endpoint where the user may push snapshots.
+
+The most important concept here is a snapshot, which can be created/pushed/pulled/exported. Also, we may start temporary read-only postgres instance over any local snapshot. A more complex scenario would consist of several basic operations over snapshots.
+
+# Possible usage scenarios
+
+## Install zenith, run a postgres
+
+```
+> brew install pg-zenith
+> zenith pg create # creates pgdata with default pattern pgdata$i
+> zenith pg list
+ID PGDATA USED STORAGE ENDPOINT
+primary1 pgdata1 0G zenith-local localhost:5432
+```
+
+## Import standalone postgres to zenith
+
+```
+> zenith snapshot import --from=basebackup://replication@localhost:5432/ oldpg
+[====================------------] 60% | 20MB/s
+> zenith snapshot list
+ID SIZE PARENT
+oldpg 5G -
+
+> zenith pg create --snapshot oldpg
+Started postgres on localhost:5432
+
+> zenith pg list
+ID PGDATA USED STORAGE ENDPOINT
+primary1 pgdata1 5G zenith-local localhost:5432
+
+> zenith snapshot destroy oldpg
+Ok
+```
+
+Also, we may start snapshot import implicitly by looking at snapshot schema
+
+```
+> zenith pg create --snapshot basebackup://replication@localhost:5432/
+Downloading snapshot... Done.
+Started postgres on localhost:5432
+Destroying snapshot... Done.
+```
+
+## Pull snapshot with some publicly shared database
+
+Since we may export the whole snapshot as one big file (tar of basebackup, maybe with some manifest) it may be shared over conventional means: http, ssh, [git+lfs](https://docs.github.com/en/github/managing-large-files/about-git-large-file-storage).
+
+```
+> zenith pg create --snapshot http://learn-postgres.com/movies_db.zenith movies
+```
+
+## Create snapshot and push it to the cloud
+
+```
+> zenith snapshot create pgdata1@snap1
+> zenith snapshot push --to ssh://stas@zenith.tech pgdata1@snap1
+```
+
+## Rollback database to the snapshot
+
+One way to rollback the database is just to init a new database from the snapshot and destroy the old one. But creating a new database from a snapshot would require a copy of that snapshot which is time consuming operation. Another option that would be cool to support is the ability to create the copy-on-write database from the snapshot without copying data, and store updated pages in a separate location, however that way would have performance implications. So to properly rollback the database to the older state we have `zenith pg checkout`.
+
+```
+> zenith pg list
+ID PGDATA USED STORAGE ENDPOINT
+primary1 pgdata1 5G zenith-local localhost:5432
+
+> zenith snapshot create pgdata1@snap1
+
+> zenith snapshot list
+ID SIZE PARENT
+oldpg 5G -
+pgdata1@snap1 6G -
+pgdata1@CURRENT 6G -
+
+> zenith pg checkout pgdata1@snap1
+Stopping postgres on pgdata1.
+Rolling back pgdata1@CURRENT to pgdata1@snap1.
+Starting postgres on pgdata1.
+
+> zenith snapshot list
+ID SIZE PARENT
+oldpg 5G -
+pgdata1@snap1 6G -
+pgdata1@HEAD{0} 6G -
+pgdata1@CURRENT 6G -
+```
+
+Some notes: pgdata1@CURRENT -- implicit snapshot representing the current state of the database in the data directory. When we are checking out some snapshot CURRENT will be set to this snapshot and the old CURRENT state will be named HEAD{0} (0 is the number of postgres timeline, it would be incremented after each such checkout).
+
+## Configure PITR area (Point In Time Recovery).
+
+PITR area acts like a continuous snapshot where you can reset the database to any point in time within this area (by area I mean some TTL period or some size limit, both possibly infinite).
+
+```
+> zenith pitr create --storage s3tank --ttl 30d --name pitr_last_month
+```
+
+Resetting the database to some state in past would require creating a snapshot on some lsn / time in this pirt area.
+
+# Manual
+
+## storage
+
+Storage is either zenith pagestore or s3. Users may create a database in a pagestore and create/move *snapshots* and *pitr regions* in both pagestore and s3. Storage is a concept similar to `git remote`. After installation, I imagine one local storage is available by default.
+
+**zenith storage attach** -t [native|s3] -c key=value -n name
+
+Attaches/initializes storage. For --type=s3, user credentials and path should be provided. For --type=native we may support --path=/local/path and --url=zenith.tech/stas/mystore. Other possible term for native is 'zstore'.
+
+
+**zenith storage list**
+
+Show currently attached storages. For example:
+
+```
+> zenith storage list
+NAME USED TYPE OPTIONS PATH
+local 5.1G zenith-local /opt/zenith/store/local
+local.compr 20.4G zenith-local comression=on /opt/zenith/store/local.compr
+zcloud 60G zenith-remote zenith.tech/stas/mystore
+s3tank 80G S3
+```
+
+**zenith storage detach**
+
+**zenith storage show**
+
+
+
+## pg
+
+Manages postgres data directories and can start postgreses with proper configuration. An experienced user may avoid using that (except pg create) and configure/run postgres by themself.
+
+Pg is a term for a single postgres running on some data. I'm trying to avoid here separation of datadir management and postgres instance management -- both that concepts bundled here together.
+
+**zenith pg create** [--no-start --snapshot --cow] -s storage-name -n pgdata
+
+Creates (initializes) new data directory in given storage and starts postgres. I imagine that storage for this operation may be only local and data movement to remote location happens through snapshots/pitr.
+
+--no-start: just init datadir without creating
+
+--snapshot snap: init from the snapshot. Snap is a name or URL (zenith.tech/stas/mystore/snap1)
+
+--cow: initialize Copy-on-Write data directory on top of some snapshot (makes sense if it is a snapshot of currently running a database)
+
+**zenith pg destroy**
+
+**zenith pg start** [--replica] pgdata
+
+Start postgres with proper extensions preloaded/installed.
+
+**zenith pg checkout**
+
+Rollback data directory to some previous snapshot.
+
+**zenith pg stop** pg_id
+
+**zenith pg list**
+
+```
+ROLE PGDATA USED STORAGE ENDPOINT
+primary my_pg 5.1G local localhost:5432
+replica-1 localhost:5433
+replica-2 localhost:5434
+primary my_pg2 3.2G local.compr localhost:5435
+- my_pg3 9.2G local.compr -
+```
+
+**zenith pg show**
+
+```
+my_pg:
+ storage: local
+ space used on local: 5.1G
+ space used on all storages: 15.1G
+ snapshots:
+ on local:
+ snap1: 1G
+ snap2: 1G
+ on zcloud:
+ snap2: 1G
+ on s3tank:
+ snap5: 2G
+ pitr:
+ on s3tank:
+ pitr_one_month: 45G
+
+```
+
+**zenith pg start-rest/graphql** pgdata
+
+Starts REST/GraphQL proxy on top of postgres master. Not sure we should do that, just an idea.
+
+
+## snapshot
+
+Snapshot creation is cheap -- no actual data is copied, we just start retaining old pages. Snapshot size means the amount of retained data, not all data. Snapshot name looks like pgdata_name@tag_name. tag_name is set by the user during snapshot creation. There are some reserved tag names: CURRENT represents the current state of the data directory; HEAD{i} represents the data directory state that resided in the database before i-th checkout.
+
+**zenith snapshot create** pgdata_name@snap_name
+
+Creates a new snapshot in the same storage where pgdata_name exists.
+
+**zenith snapshot push** --to url pgdata_name@snap_name
+
+Produces binary stream of a given snapshot. Under the hood starts temp read-only postgres over this snapshot and sends basebackup stream. Receiving side should start `zenith snapshot recv` before push happens. If url has some special schema like zenith:// receiving side may require auth start `zenith snapshot recv` on the go.
+
+**zenith snapshot recv**
+
+Starts a port listening for a basebackup stream, prints connection info to stdout (so that user may use that in push command), and expects data on that socket.
+
+**zenith snapshot pull** --from url or path
+
+Connects to a remote zenith/s3/file and pulls snapshot. The remote site should be zenith service or files in our format.
+
+**zenith snapshot import** --from basebackup://<...> or path
+
+Creates a new snapshot out of running postgres via basebackup protocol or basebackup files.
+
+**zenith snapshot export**
+
+Starts read-only postgres over this snapshot and exports data in some format (pg_dump, or COPY TO on some/all tables). One of the options may be zenith own format which is handy for us (but I think just tar of basebackup would be okay).
+
+**zenith snapshot diff** snap1 snap2
+
+Shows size of data changed between two snapshots. We also may provide options to diff schema/data in tables. To do that start temp read-only postgreses.
+
+**zenith snapshot destroy**
+
+## pitr
+
+Pitr represents wal stream and ttl policy for that stream
+
+XXX: any suggestions on a better name?
+
+**zenith pitr create** name
+
+--ttl = inf | period
+
+--size-limit = inf | limit
+
+--storage = storage_name
+
+**zenith pitr extract-snapshot** pitr_name --lsn xxx
+
+Creates a snapshot out of some lsn in PITR area. The obtained snapshot may be managed with snapshot routines (move/send/export)
+
+**zenith pitr gc** pitr_name
+
+Force garbage collection on some PITR area.
+
+**zenith pitr list**
+
+**zenith pitr destroy**
+
+
+## console
+
+**zenith console**
+
+Opens browser targeted at web console with the more or less same functionality as described here.
diff --git a/docs/rfcs/004-durability.md b/docs/rfcs/004-durability.md
new file mode 100644
index 0000000000..4543be3dae
--- /dev/null
+++ b/docs/rfcs/004-durability.md
@@ -0,0 +1,218 @@
+Durability & Consensus
+======================
+
+When a transaction commits, a commit record is generated in the WAL.
+When do we consider the WAL record as durable, so that we can
+acknowledge the commit to the client and be reasonably certain that we
+will not lose the transaction?
+
+Zenith uses a group of WAL safekeeper nodes to hold the generated WAL.
+A WAL record is considered durable, when it has been written to a
+majority of WAL safekeeper nodes. In this document, I use 5
+safekeepers, because I have five fingers. A WAL record is durable,
+when at least 3 safekeepers have written it to disk.
+
+First, assume that only one primary node can be running at a
+time. This can be achieved by Kubernetes or etcd or some
+cloud-provider specific facility, or we can implement it
+ourselves. These options are discussed in later chapters. For now,
+assume that there is a Magic STONITH Fairy that ensures that.
+
+In addition to the WAL safekeeper nodes, the WAL is archived in
+S3. WAL that has been archived to S3 can be removed from the
+safekeepers, so the safekeepers don't need a lot of disk space.
+
+
+ +----------------+
+ +-----> | WAL safekeeper |
+ | +----------------+
+ | +----------------+
+ +-----> | WAL safekeeper |
++------------+ | +----------------+
+| Primary | | +----------------+
+| Processing | ---------+-----> | WAL safekeeper |
+| Node | | +----------------+
++------------+ | +----------------+
+ \ +-----> | WAL safekeeper |
+ \ | +----------------+
+ \ | +----------------+
+ \ +-----> | WAL safekeeper |
+ \ +----------------+
+ \
+ \
+ \
+ \
+ \ +--------+
+ \ | |
+ +--> | S3 |
+ | |
+ +--------+
+
+
+Every WAL safekeeper holds a section of WAL, and a VCL value.
+The WAL can be divided into three portions:
+
+
+ VCL LSN
+ | |
+ V V
+.................ccccccccccccccccccccXXXXXXXXXXXXXXXXXXXXXXX
+Archived WAL Completed WAL In-flight WAL
+
+
+Note that all this WAL kept in a safekeeper is a contiguous section.
+This is different from Aurora: In Aurora, there can be holes in the
+WAL, and there is a Gossip protocol to fill the holes. That could be
+implemented in the future, but let's keep it simple for now. WAL needs
+to be written to a safekeeper in order. However, during crash
+recovery, In-flight WAL that has already been stored in a safekeeper
+can be truncated or overwritten.
+
+The Archived WAL has already been stored in S3, and can be removed from
+the safekeeper.
+
+The Completed WAL has been written to at least three safekeepers. The
+algorithm ensures that it is not lost, when at most two nodes fail at
+the same time.
+
+The In-flight WAL has been persisted in the safekeeper, but if a crash
+happens, it may still be overwritten or truncated.
+
+
+The VCL point is determined in the Primary. It is not strictly
+necessary to store it in the safekeepers, but it allows some
+optimizations and sanity checks and is probably generally useful for
+the system as whole. The VCL values stored in the safekeepers can lag
+behind the VCL computed by the primary.
+
+
+Primary node Normal operation
+-----------------------------
+
+1. Generate some WAL.
+
+2. Send the WAL to all the safekeepers that you can reach.
+
+3. As soon as a quorum of safekeepers have acknowledged that they have
+ received and durably stored the WAL up to that LSN, update local VCL
+ value in memory, and acknowledge commits to the clients.
+
+4. Send the new VCL to all the safekeepers that were part of the quorum.
+ (Optional)
+
+
+Primary Crash recovery
+----------------------
+
+When a new Primary node starts up, before it can generate any new WAL
+it needs to contact a majority of the WAL safekeepers to compute the
+VCL. Remember that there is a Magic STONITH fairy that ensures that
+only node process can be doing this at a time.
+
+1. Contact all WAL safekeepers. Find the Max((Epoch, LSN)) tuple among the ones you
+ can reach. This is the Winner safekeeper, and its LSN becomes the new VCL.
+
+2. Update the other safekeepers you can reach, by copying all the WAL
+ from the Winner, starting from each safekeeper's old VCL point. Any old
+ In-Flight WAL from previous Epoch is truncated away.
+
+3. Increment Epoch, and send the new Epoch to the quorum of
+ safekeepers. (This ensures that if any of the safekeepers that we
+ could not reach later come back online, they will be considered as
+ older than this in any future recovery)
+
+You can now start generating new WAL, starting from the newly-computed
+VCL.
+
+Optimizations
+-------------
+
+As described, the Primary node sends all the WAL to all the WAL safekeepers. That
+can be a lot of network traffic. Instead of sending the WAL directly from Primary,
+some safekeepers can be daisy-chained off other safekeepers, or there can be a
+broadcast mechanism among them. There should still be a direct connection from the
+each safekeeper to the Primary for the acknowledgments though.
+
+Similarly, the responsibility for archiving WAL to S3 can be delegated to one of
+the safekeepers, to reduce the load on the primary.
+
+
+Magic STONITH fairy
+-------------------
+
+Now that we have a system that works as long as only one primary node is running at a time, how
+do we ensure that?
+
+1. Use etcd to grant a lease on a key. The primary node is only allowed to operate as primary
+ when it's holding a valid lease. If the primary node dies, the lease expires after a timeout
+ period, and a new node is allowed to become the primary.
+
+2. Use S3 to store the lease. S3's consistency guarantees are more lenient, so in theory you
+ cannot do this safely. In practice, it would probably be OK if you make the lease times and
+ timeouts long enough. This has the advantage that we don't need to introduce a new
+ component to the architecture.
+
+3. Use Raft or Paxos, with the WAL safekeepers acting as the Acceptors to form the quorum. The
+ next chapter describes this option.
+
+
+Built-in Paxos
+--------------
+
+The WAL safekeepers act as PAXOS Acceptors, and the Processing nodes
+as both Proposers and Learners.
+
+Each WAL safekeeper holds an Epoch value in addition to the VCL and
+the WAL. Each request by the primary to safekeep WAL is accompanied by
+an Epoch value. If a safekeeper receives a request with Epoch that
+doesn't match its current Accepted Epoch, it must ignore (NACK) it.
+(In different Paxos papers, Epochs are called "terms" or "round
+numbers")
+
+When a node wants to become the primary, it generates a new Epoch
+value that is higher than any previously observed Epoch value, and
+globally unique.
+
+
+Accepted Epoch: 555 VCL LSN
+ | |
+ V V
+.................ccccccccccccccccccccXXXXXXXXXXXXXXXXXXXXXXX
+Archived WAL Completed WAL In-flight WAL
+
+
+Primary node startup:
+
+1. Contact all WAL safekeepers that you can reach (if you cannot
+ connect to a quorum of them, you can give up immediately). Find the
+ latest Epoch among them.
+
+2. Generate a new globally unique Epoch, greater than the latest Epoch
+ found in previous step.
+
+2. Send the new Epoch in a Prepare message to a quorum of
+ safekeepers. (PAXOS Prepare message)
+
+3. Each safekeeper responds with a Promise. If a safekeeper has
+ already made a promise with a higher Epoch, it doesn't respond (or
+ responds with a NACK). After making a promise, the safekeeper stops
+ responding to any write requests with earlier Epoch.
+
+4. Once you have received a majority of promises, you know that the
+ VCL cannot advance on the old Epoch anymore. This effectively kills
+ any old primary server.
+
+5. Find the highest written LSN among the quorum of safekeepers (these
+ can be included in the Promise messages already). This is the new
+ VCL. If a new node starts the election process after this point,
+ it will compute the same or higher VCL.
+
+6. Copy the WAL from the safekeeper with the highest LSN to the other
+ safekeepers in the quorum, using the new Epoch. (PAXOS Accept
+ phase)
+
+7. You can now start generating new WAL starting from the VCL. If
+ another process starts the election process after this point and
+ gains control of a majority of the safekeepers, we will no longer
+ be able to advance the VCL.
+
diff --git a/docs/rfcs/005-zenith_local.md b/docs/rfcs/005-zenith_local.md
new file mode 100644
index 0000000000..7b078e9ec0
--- /dev/null
+++ b/docs/rfcs/005-zenith_local.md
@@ -0,0 +1,103 @@
+# Zenith local
+
+Here I list some objectives to keep in mind when discussing zenith-local design and a proposal that brings all components together. Your comments on both parts are very welcome.
+
+#### Why do we need it?
+- For distribution - this easy to use binary will help us to build adoption among developers.
+- For internal use - to test all components together.
+
+In my understanding, we consider it to be just a mock-up version of zenith-cloud.
+> Question: How much should we care about durability and security issues for a local setup?
+
+
+#### Why is it better than a simple local postgres?
+
+- Easy one-line setup. As simple as `cargo install zenith && zenith start`
+
+- Quick and cheap creation of compute nodes over the same storage.
+> Question: How can we describe a use-case for this feature?
+
+- Zenith-local can work with S3 directly.
+
+- Push and pull images (snapshots) to remote S3 to exchange data with other users.
+
+- Quick and cheap snapshot checkouts to switch back and forth in the database history.
+> Question: Do we want it in the very first release? This feature seems quite complicated.
+
+#### Distribution:
+
+Ideally, just one binary that incorporates all elements we need.
+> Question: Let's discuss pros and cons of having a separate package with modified PostgreSQL.
+
+#### Components:
+
+- **zenith-CLI** - interface for end-users. Turns commands to REST requests and handles responces to show them in a user-friendly way.
+CLI proposal is here https://github.com/libzenith/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md
+WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src/bin/cli
+
+- **zenith-console** - WEB UI with same functionality as CLI.
+>Note: not for the first release.
+
+- **zenith-local** - entrypoint. Service that starts all other components and handles REST API requests. See REST API proposal below.
+ > Idea: spawn all other components as child processes, so that we could shutdown everything by stopping zenith-local.
+
+- **zenith-pageserver** - consists of a storage and WAL-replaying service (modified PG in current implementation).
+> Question: Probably, for local setup we should be able to bypass page-storage and interact directly with S3 to avoid double caching in shared buffers and page-server?
+
+WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src
+
+- **zenith-S3** - stores base images of the database and WAL in S3 object storage. Import and export images from/to zenith.
+> Question: How should it operate in a local setup? Will we manage it ourselves or ask user to provide credentials for existing S3 object storage (i.e. minio)?
+> Question: Do we use it together with local page store or they are interchangeable?
+
+WIP code is ???
+
+- **zenith-safekeeper** - receives WAL from postgres, stores it durably, answers to Postgres that "sync" is succeed.
+> Question: How should it operate in a local setup? In my understanding it should push WAL directly to S3 (if we use it) or store all data locally (if we use local page storage). The latter option seems meaningless (extra overhead and no gain), but it is still good to test the system.
+
+WIP code is here: https://github.com/libzenith/postgres/tree/main/src/bin/safekeeper
+
+- **zenith-computenode** - bottomless PostgreSQL, ideally upstream, but for a start - our modified version. User can quickly create and destroy them and work with it as a regular postgres database.
+
+ WIP code is in main branch and here: https://github.com/libzenith/postgres/commits/compute_node
+
+#### REST API:
+
+Service endpoint: `http://localhost:3000`
+
+Resources:
+- /storages - Where data lives: zenith-pageserver or zenith-s3
+- /pgs - Postgres - zenith-computenode
+- /snapshots - snapshots **TODO**
+
+>Question: Do we want to extend this API to manage zenith components? I.e. start page-server, manage safekeepers and so on? Or they will be hardcoded to just start once and for all?
+
+Methods and their mapping to CLI:
+
+- /storages - zenith-pageserver or zenith-s3
+
+CLI | REST API
+------------- | -------------
+storage attach -n name --type [native\s3] --path=[datadir\URL] | PUT -d { "name": "name", "type": "native", "path": "/tmp" } /storages
+storage detach -n name | DELETE /storages/:storage_name
+storage list | GET /storages
+storage show -n name | GET /storages/:storage_name
+
+
+- /pgs - zenith-computenode
+
+CLI | REST API
+------------- | -------------
+pg create -n name --s storage_name | PUT -d { "name": "name", "storage_name": "storage_name" } /pgs
+pg destroy -n name | DELETE /pgs/:pg_name
+pg start -n name --replica | POST -d {"action": "start", "is_replica":"replica"} /pgs/:pg_name /actions
+pg stop -n name | POST -d {"action": "stop"} /pgs/:pg_name /actions
+pg promote -n name | POST -d {"action": "promote"} /pgs/:pg_name /actions
+pg list | GET /pgs
+pg show -n name | GET /pgs/:pg_name
+
+- /snapshots **TODO**
+
+CLI | REST API
+------------- | -------------
+
diff --git a/docs/rfcs/006-laptop-cli-v2-CLI.md b/docs/rfcs/006-laptop-cli-v2-CLI.md
new file mode 100644
index 0000000000..a04536922a
--- /dev/null
+++ b/docs/rfcs/006-laptop-cli-v2-CLI.md
@@ -0,0 +1,64 @@
+Zenith CLI allows you to operate database clusters (catalog clusters) and their commit history locally and in the cloud. Since ANSI calls them catalog clusters and cluster is a loaded term in the modern infrastructure we will call it "catalog".
+
+# CLI v2 (after chatting with Carl)
+
+Zenith introduces the notion of a repository.
+
+```bash
+zenith init
+zenith clone zenith://zenith.tech/piedpiper/northwind -- clones a repo to the northwind directory
+```
+
+Once you have a cluster catalog you can explore it
+
+```bash
+zenith log -- returns a list of commits
+zenith status -- returns if there are changes in the catalog that can be committed
+zenith commit -- commits the changes and generates a new commit hash
+zenith branch experimental +A(t=1, e=1) 1.1 1.2 1.3 1.4 +B(t=1, e=1) 1.1 +C(t=1, e=1) 1.1 +D(t=1, e=1) 1.1 +E(t=1, e=1) 1.1 ++ +2) P2 is elected by CDE in term 2, epochStartLsn is 2, and writes 2.2, 2.3 on CD: + +
+A(t=1, e=1) 1.1 1.2 1.3 1.4 +B(t=1, e=1) 1.1 +C(t=2, e=2) 1.1 2.2 2.3 +D(t=2, e=2) 1.1 2.2 2.3 +E(t=2, e=1) 1.1 ++ + +3) P3 is elected by CDE in term 3, epochStartLsn is 4, and writes 3.4 on D: + +
+A(t=1, e=1) 1.1 1.2 1.3 1.4 +B(t=1, e=1) 1.1 +C(t=3, e=2) 1.1 2.2 2.3 +D(t=3, e=3) 1.1 2.2 2.3 3.4 +E(t=3, e=1) 1.1 ++ + +Now, A gets back and P3 starts recovering it. How it should proceed? There are +two options. + +## Don't try to find divergence point at all + +...start sending WAL conservatively since the horizon (1.1), and truncate +obsolete part of WAL only when recovery is finished, i.e. epochStartLsn (4) is +reached, i.e. 2.3 transferred -- that's what https://github.com/zenithdb/zenith/pull/505 proposes. + +Then the following is possible: + +4) P3 moves one record 2.2 to A. + +
+A(t=1, e=1) 1.1 2.2 1.3 1.4 +B(t=1, e=1) 1.1 1.2 +C(t=3, e=2) 1.1 2.2 2.3 +D(t=3, e=3) 1.1 2.2 2.3 3.4 +E(t=3, e=1) 1.1 ++ +Now log of A is basically corrupted. Moreover, since ABE are all in epoch 1 and +A's log is the longest one, they can elect P4 who will commit such log. + +Note that this particular history couldn't happen if we forbid to *create* new +records in term n until majority of safekeepers switch to it. It would force CDE +to switch to 2 before 2.2 is created, and A could never become donor while his +log is corrupted. Generally with this additional barrier I believe the algorithm +becomes safe, but + - I don't like this kind of artificial barrier; + - I also feel somewhat discomfortable about even temporary having intentionally + corrupted WAL; + - I'd still model check the idea. + +## Find divergence point and truncate at it + +Then step 4 would delete 1.3 1.4 on A, and we are ok. The question is, how do we +do that? Without term switching history we have to resort to sending again since +the horizon and memcmp'ing records, which is inefficient and ugly. Or we can +maintain full history and determine truncation point by comparing 'wrong' and +'right' histories -- much like pg_rewind does -- and perform truncation + start +streaming right there. + +# Proposal + +- Add term history as array of