mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-07 13:32:57 +00:00
Usually RFC documents are not modified, but the vast mentions of "zenith" in early RFC documents make it desirable to update the product name to today's name, to avoid confusion. ## Problem Early RFC documents use the old "zenith" product name a lot, which is not something everyone is aware of after the product was renamed. ## Summary of changes Replace occurrences of "zenith" with "neon". Images are excluded. --------- Co-authored-by: Andreas Scherbaum <andreas@neon.tech>
219 lines
8.7 KiB
Markdown
219 lines
8.7 KiB
Markdown
Durability & Consensus
|
|
======================
|
|
|
|
When a transaction commits, a commit record is generated in the WAL.
|
|
When do we consider the WAL record as durable, so that we can
|
|
acknowledge the commit to the client and be reasonably certain that we
|
|
will not lose the transaction?
|
|
|
|
Neon uses a group of WAL safekeeper nodes to hold the generated WAL.
|
|
A WAL record is considered durable, when it has been written to a
|
|
majority of WAL safekeeper nodes. In this document, I use 5
|
|
safekeepers, because I have five fingers. A WAL record is durable,
|
|
when at least 3 safekeepers have written it to disk.
|
|
|
|
First, assume that only one primary node can be running at a
|
|
time. This can be achieved by Kubernetes or etcd or some
|
|
cloud-provider specific facility, or we can implement it
|
|
ourselves. These options are discussed in later chapters. For now,
|
|
assume that there is a Magic STONITH Fairy that ensures that.
|
|
|
|
In addition to the WAL safekeeper nodes, the WAL is archived in
|
|
S3. WAL that has been archived to S3 can be removed from the
|
|
safekeepers, so the safekeepers don't need a lot of disk space.
|
|
|
|
```
|
|
+----------------+
|
|
+-----> | WAL safekeeper |
|
|
| +----------------+
|
|
| +----------------+
|
|
+-----> | WAL safekeeper |
|
|
+------------+ | +----------------+
|
|
| Primary | | +----------------+
|
|
| Processing | ---------+-----> | WAL safekeeper |
|
|
| Node | | +----------------+
|
|
+------------+ | +----------------+
|
|
\ +-----> | WAL safekeeper |
|
|
\ | +----------------+
|
|
\ | +----------------+
|
|
\ +-----> | WAL safekeeper |
|
|
\ +----------------+
|
|
\
|
|
\
|
|
\
|
|
\
|
|
\ +--------+
|
|
\ | |
|
|
+------> | S3 |
|
|
| |
|
|
+--------+
|
|
|
|
```
|
|
Every WAL safekeeper holds a section of WAL, and a VCL value.
|
|
The WAL can be divided into three portions:
|
|
|
|
```
|
|
VCL LSN
|
|
| |
|
|
V V
|
|
.................ccccccccccccccccccccXXXXXXXXXXXXXXXXXXXXXXX
|
|
Archived WAL Completed WAL In-flight WAL
|
|
```
|
|
|
|
Note that all this WAL kept in a safekeeper is a contiguous section.
|
|
This is different from Aurora: In Aurora, there can be holes in the
|
|
WAL, and there is a Gossip protocol to fill the holes. That could be
|
|
implemented in the future, but let's keep it simple for now. WAL needs
|
|
to be written to a safekeeper in order. However, during crash
|
|
recovery, In-flight WAL that has already been stored in a safekeeper
|
|
can be truncated or overwritten.
|
|
|
|
The Archived WAL has already been stored in S3, and can be removed from
|
|
the safekeeper.
|
|
|
|
The Completed WAL has been written to at least three safekeepers. The
|
|
algorithm ensures that it is not lost, when at most two nodes fail at
|
|
the same time.
|
|
|
|
The In-flight WAL has been persisted in the safekeeper, but if a crash
|
|
happens, it may still be overwritten or truncated.
|
|
|
|
|
|
The VCL point is determined in the Primary. It is not strictly
|
|
necessary to store it in the safekeepers, but it allows some
|
|
optimizations and sanity checks and is probably generally useful for
|
|
the system as whole. The VCL values stored in the safekeepers can lag
|
|
behind the VCL computed by the primary.
|
|
|
|
|
|
Primary node Normal operation
|
|
-----------------------------
|
|
|
|
1. Generate some WAL.
|
|
|
|
2. Send the WAL to all the safekeepers that you can reach.
|
|
|
|
3. As soon as a quorum of safekeepers have acknowledged that they have
|
|
received and durably stored the WAL up to that LSN, update local VCL
|
|
value in memory, and acknowledge commits to the clients.
|
|
|
|
4. Send the new VCL to all the safekeepers that were part of the quorum.
|
|
(Optional)
|
|
|
|
|
|
Primary Crash recovery
|
|
----------------------
|
|
|
|
When a new Primary node starts up, before it can generate any new WAL
|
|
it needs to contact a majority of the WAL safekeepers to compute the
|
|
VCL. Remember that there is a Magic STONITH fairy that ensures that
|
|
only node process can be doing this at a time.
|
|
|
|
1. Contact all WAL safekeepers. Find the Max((Epoch, LSN)) tuple among the ones you
|
|
can reach. This is the Winner safekeeper, and its LSN becomes the new VCL.
|
|
|
|
2. Update the other safekeepers you can reach, by copying all the WAL
|
|
from the Winner, starting from each safekeeper's old VCL point. Any old
|
|
In-Flight WAL from previous Epoch is truncated away.
|
|
|
|
3. Increment Epoch, and send the new Epoch to the quorum of
|
|
safekeepers. (This ensures that if any of the safekeepers that we
|
|
could not reach later come back online, they will be considered as
|
|
older than this in any future recovery)
|
|
|
|
You can now start generating new WAL, starting from the newly-computed
|
|
VCL.
|
|
|
|
Optimizations
|
|
-------------
|
|
|
|
As described, the Primary node sends all the WAL to all the WAL safekeepers. That
|
|
can be a lot of network traffic. Instead of sending the WAL directly from Primary,
|
|
some safekeepers can be daisy-chained off other safekeepers, or there can be a
|
|
broadcast mechanism among them. There should still be a direct connection from the
|
|
each safekeeper to the Primary for the acknowledgments though.
|
|
|
|
Similarly, the responsibility for archiving WAL to S3 can be delegated to one of
|
|
the safekeepers, to reduce the load on the primary.
|
|
|
|
|
|
Magic STONITH fairy
|
|
-------------------
|
|
|
|
Now that we have a system that works as long as only one primary node is running at a time, how
|
|
do we ensure that?
|
|
|
|
1. Use etcd to grant a lease on a key. The primary node is only allowed to operate as primary
|
|
when it's holding a valid lease. If the primary node dies, the lease expires after a timeout
|
|
period, and a new node is allowed to become the primary.
|
|
|
|
2. Use S3 to store the lease. S3's consistency guarantees are more lenient, so in theory you
|
|
cannot do this safely. In practice, it would probably be OK if you make the lease times and
|
|
timeouts long enough. This has the advantage that we don't need to introduce a new
|
|
component to the architecture.
|
|
|
|
3. Use Raft or Paxos, with the WAL safekeepers acting as the Acceptors to form the quorum. The
|
|
next chapter describes this option.
|
|
|
|
|
|
Built-in Paxos
|
|
--------------
|
|
|
|
The WAL safekeepers act as PAXOS Acceptors, and the Processing nodes
|
|
as both Proposers and Learners.
|
|
|
|
Each WAL safekeeper holds an Epoch value in addition to the VCL and
|
|
the WAL. Each request by the primary to safekeep WAL is accompanied by
|
|
an Epoch value. If a safekeeper receives a request with Epoch that
|
|
doesn't match its current Accepted Epoch, it must ignore (NACK) it.
|
|
(In different Paxos papers, Epochs are called "terms" or "round
|
|
numbers")
|
|
|
|
When a node wants to become the primary, it generates a new Epoch
|
|
value that is higher than any previously observed Epoch value, and
|
|
globally unique.
|
|
|
|
|
|
Accepted Epoch: 555 VCL LSN
|
|
| |
|
|
V V
|
|
.................ccccccccccccccccccccXXXXXXXXXXXXXXXXXXXXXXX
|
|
Archived WAL Completed WAL In-flight WAL
|
|
|
|
|
|
Primary node startup:
|
|
|
|
1. Contact all WAL safekeepers that you can reach (if you cannot
|
|
connect to a quorum of them, you can give up immediately). Find the
|
|
latest Epoch among them.
|
|
|
|
2. Generate a new globally unique Epoch, greater than the latest Epoch
|
|
found in previous step.
|
|
|
|
2. Send the new Epoch in a Prepare message to a quorum of
|
|
safekeepers. (PAXOS Prepare message)
|
|
|
|
3. Each safekeeper responds with a Promise. If a safekeeper has
|
|
already made a promise with a higher Epoch, it doesn't respond (or
|
|
responds with a NACK). After making a promise, the safekeeper stops
|
|
responding to any write requests with earlier Epoch.
|
|
|
|
4. Once you have received a majority of promises, you know that the
|
|
VCL cannot advance on the old Epoch anymore. This effectively kills
|
|
any old primary server.
|
|
|
|
5. Find the highest written LSN among the quorum of safekeepers (these
|
|
can be included in the Promise messages already). This is the new
|
|
VCL. If a new node starts the election process after this point,
|
|
it will compute the same or higher VCL.
|
|
|
|
6. Copy the WAL from the safekeeper with the highest LSN to the other
|
|
safekeepers in the quorum, using the new Epoch. (PAXOS Accept
|
|
phase)
|
|
|
|
7. You can now start generating new WAL starting from the VCL. If
|
|
another process starts the election process after this point and
|
|
gains control of a majority of the safekeepers, we will no longer
|
|
be able to advance the VCL.
|
|
|