mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-23 16:10:37 +00:00
This replaces the page server's "datadir" concept. The Page Server now always works with a "Zenith Repository". When you initialize a new repository with "zenith init", it runs initdb and loads an initial basebackup of the freshly-created cluster into the repository, on "main" branch. Repository can hold multiple "timelines", which can be given human-friendly names, making them "branches". One page server simultaneously serves all timelines stored in the repository, and you can have multiple Postgres compute nodes connected to the page server, as long they all operate on a different timeline. There is a new command "zenith branch", which can be used to fork off new branches from existing branches. The repository uses the directory layout desribed as Repository format v1 in https://github.com/zenithdb/rfcs/pull/5. It it *highly* inefficient: - we never create new snapshots. So in practice, it's really just a base backup of the initial empty cluster, and everything else is reconstructed by redoing all WAL - when you create a new timeline, the base snapshot and *all* WAL is copied from the new timeline to the new one. There is no smarts about referencing the old snapshots/wal from the ancestor timeline. To support all this, this commit includes a bunch of other changes: - Implement "basebackup" funtionality in page server. When you initialize a new compute node with "zenith pg create", it connects to the page server, and requests a base backup of the Postgres data directory on that timeline. (the base backup excludes user tables, so it's not as bad as it sounds). - Have page server's WAL receiver write the WAL into timeline dir. This allows running a Page Server and Compute Nodes without a WAL safekeeper, until we get around to integrate that properly into the system. (Even after we integrate WAL safekeeper, this is perhaps how this will operate when you want to run the system on your laptop.) - restore_datadir.rs was renamed to restore_local_repo.rs, and heavily modified to use the new format. It now also restores all WAL. - Page server no longer scans and restores everything into memory at startup. Instead, when the first request is made for a timeline, the timeline is slurped into memory at that point. - The responsibility for telling page server to "callmemaybe" was moved into Postgres libpqpagestore code. Also, WAL producer connstring cannot be specified in the pageserver's command line anymore. - Having multiple "system identifiers" in the same page server is no longer supported. I repurposed much of that code to support multiple timelines, instead. - Implemented very basic, incomplete, support for PostgreSQL's Extended Query Protocol in page_service.rs. Turns out that rust-postgres' copy_out() function always uses the extended query protocol to send out the command, and I'm using that to stream the base backup from the page server. TODO: I haven't fixed the WAL safekeeper for this scheme, so all the integration tests involving safekeepers are failing. My plan is to modify the safekeeper to know about Zenith timelines, too, and modify it to work with the same Zenith repository format. It only needs to care about the '.zenith/timelines/<timeline>/wal' directories.
# WAL safekeeper Also know as the WAL service, WAL keeper or WAL acceptor. The WAL safekeeper acts as a holding area and redistribution center for recently generated WAL. The primary Postgres server streams the WAL to the WAL safekeeper, and treats it like a (synchronous) replica. A replication slot is used in the primary to prevent the primary from discarding WAL that hasn't been streamed to the safekeeper yet. The primary connects to the WAL safekeeper, so it works in a "push" fashion. That's different from how streaming replication usually works, where the replica initiates the connection. To do that, there is a component called "safekeeper_proxy". The safekeeper_proxy runs on the same host as the primary Postgres server and connects to it to do streaming replication. It also connects to the WAL safekeeper, and forwards all the WAL. (PostgreSQL's archive_commands works in the "push" style, but it operates on a WAL segment granularity. If PostgreSQL had a push style API for streaming, we wouldn't need the proxy). The Page Server connects to the WAL safekeeper, using the same streaming replication protocol that's used between Postgres primary and standby. You can also connect the Page Server directly to a primary PostgreSQL node for testing. In a production installation, there are multiple WAL safekeepers running on different nodes, and there is a quorum mechanism using the Paxos algorithm to ensure that a piece of WAL is considered as durable only after it has been flushed to disk on more than half of the WAL safekeepers. The Paxos and crash recovery algorithm ensures that only one primary node can be actively streaming WAL to the quorum of safekeepers. See vendor/postgres/src/bin/safekeeper/README.md for a more detailed desription of the consensus protocol. (TODO: move the text here?)