mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-24 06:39:58 +00:00
It creates busy loop if pageserver <-> safekeeper connection fails after it was established (e.g. currently due to 'segment checkpoint not found' error on pageserver). Also wake up callmemaybe thread regularly once in recall_period regardless of channel activity.
# WAL service
The zenith WAL service acts as a holding area and redistribution
center for recently generated WAL. The primary Postgres server streams
the WAL to the WAL safekeeper, and treats it like a (synchronous)
replica. A replication slot is used in the primary to prevent the
primary from discarding WAL that hasn't been streamed to the WAL
service yet.
+--------------+ +------------------+
| | WAL | |
| Compute node | ----------> | WAL Service |
| | | |
+--------------+ +------------------+
|
|
| WAL
|
|
V
+--------------+
| |
| Pageservers |
| |
+--------------+
The WAL service consists of multiple WAL safekeepers that all store a
copy of the WAL. A WAL record is considered durable when the majority
of safekeepers have received and stored the WAL to local disk. A
consensus algorithm based on Paxos is used to manage the quorum.
+-------------------------------------------+
| WAL Service |
| |
| |
| +------------+ |
| | safekeeper | |
| +------------+ |
| |
| +------------+ |
| | safekeeper | |
| +------------+ |
| |
| +------------+ |
| | safekeeper | |
| +------------+ |
| |
+-------------------------------------------+
The primary connects to the WAL safekeepers, so it works in a "push"
fashion. That's different from how streaming replication usually
works, where the replica initiates the connection. To do that, there
is a component called the "WAL proposer". The WAL proposer is a
background worker that runs in the primary Postgres server. It
connects to the WAL safekeeper, and sends all the WAL. (PostgreSQL's
archive_commands works in the "push" style, but it operates on a WAL
segment granularity. If PostgreSQL had a push style API for streaming,
WAL propose could be implemented using it.)
The Page Server connects to the WAL safekeeper, using the same
streaming replication protocol that's used between Postgres primary
and standby. You can also connect the Page Server directly to a
primary PostgreSQL node for testing.
In a production installation, there are multiple WAL safekeepers
running on different nodes, and there is a quorum mechanism using the
Paxos algorithm to ensure that a piece of WAL is considered as durable
only after it has been flushed to disk on more than half of the WAL
safekeepers. The Paxos and crash recovery algorithm ensures that only
one primary node can be actively streaming WAL to the quorum of
safekeepers.
See README_PROTO.md for a more detailed desription of the consensus
protocol. spec/ contains TLA+ specification of it.
# Q&A
Q: Why have a separate service instead of connecting Page Server directly to a
primary PostgreSQL node?
A: Page Server is a single server which can be lost. As our primary
fault-tolerant storage is S3, we do not want to wait for it before
committing a transaction. The WAL service acts as a temporary fault-tolerant
storage for recent data before it gets to the Page Server and then finally
to S3. Whenever WALs and pages are committed to S3, WAL's storage can be
trimmed.
Q: What if the compute node evicts a page, needs it back, but the page is yet
to reach the Page Server?
A: If the compute node has evicted a page, changes to it have been WAL-logged
(that's why it is called Write Ahead logging; there are some exceptions like
index builds, but these are exceptions). These WAL records will eventually
reach the Page Server. The Page Server notes that the compute note requests
pages with a very recent LSN and will not respond to the compute node until a
corresponding WAL is received from WAL safekeepers.
Q: How long may Page Server wait for?
A: Not too long, hopefully. If a page is evicted, it probably was not used for
a while, so the WAL service have had enough time to push changes to the Page
Server. There may be issues if there is no backpressure and compute node with
WAL service run ahead of Page Server, though.
There is no backpressure right now, so you may even see some spurious
timeouts in tests.
Q: How do WAL safekeepers communicate with each other?
A: They may only send each other messages via the compute node, they never
communicate directly with each other.
Q: Why have a consensus algorithm if there is only a single compute node?
A: Actually there may be moments with multiple PostgreSQL nodes running at the
same time. E.g. we are bringing one up and one down. We would like to avoid
simultaneous writes from different nodes, so there should be a consensus on
who is the primary node.
# Terminology
WAL service - The service as whole that ensures that WAL is stored durably.
WAL safekeeper - One node that participates in the quorum. All the safekeepers
together form the WAL service.
WAL acceptor, WAL proposer - In the context of the consensus algorithm, the Postgres
compute node is also known as the WAL proposer, and the safekeeper is also known
as the acceptor. Those are the standard terms in the Paxos algorithm.