20 Commits

Author SHA1 Message Date
Heikki Linnakangas
220a023e51 Fix typo in error message 2021-04-21 17:12:43 +03:00
Eric Seppanen
92e4f4b3b6 cargo fmt 2021-04-20 17:59:56 -07:00
Heikki Linnakangas
d047a3abf7 Fixes, per Eric's and Konstantin's comments 2021-04-20 19:11:29 +03:00
Heikki Linnakangas
f69db17409 Make WAL safekeeper work with zenith timelines 2021-04-20 19:11:29 +03:00
Heikki Linnakangas
3600b33f1c Implement "timelines" in page server
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.

There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.

The repository uses the directory layout desribed as Repository format
v1 in https://github.com/zenithdb/rfcs/pull/5. It it *highly* inefficient:
- we never create new snapshots. So in practice, it's really just a base
  backup of the initial empty cluster, and everything else is reconstructed
  by redoing all WAL

- when you create a new timeline, the base snapshot and *all* WAL is copied
  from the new timeline to the new one. There is no smarts about
  referencing the old snapshots/wal from the ancestor timeline.

To support all this, this commit includes a bunch of other changes:

- Implement "basebackup" funtionality in page server. When you initialize
  a new compute node with "zenith pg create", it connects to the page
  server, and requests a base backup of the Postgres data directory on
  that timeline. (the base backup excludes user tables, so it's not
  as bad as it sounds).

- Have page server's WAL receiver write the WAL into timeline dir. This
  allows running a Page Server and Compute Nodes without a WAL safekeeper,
  until we get around to integrate that properly into the system. (Even
  after we integrate WAL safekeeper, this is perhaps how this will operate
  when you want to run the system on your laptop.)

- restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
  modified to use the new format. It now also restores all WAL.

- Page server no longer scans and restores everything into memory at startup.
  Instead, when the first request is made for a timeline, the timeline is
  slurped into memory at that point.

- The responsibility for telling page server to "callmemaybe" was moved
  into Postgres libpqpagestore code. Also, WAL producer connstring cannot
  be specified in the pageserver's command line anymore.

- Having multiple "system identifiers" in the same page server is no
  longer supported. I repurposed much of that code to support multiple
  timelines, instead.

- Implemented very basic, incomplete, support for PostgreSQL's Extended
  Query Protocol in page_service.rs. Turns out that rust-postgres'
  copy_out() function always uses the extended query protocol to send
  out the command, and I'm using that to stream the base backup from the
  page server.

TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines/<timeline>/wal' directories.
2021-04-20 19:11:27 +03:00
Eric Seppanen
533087fd5d cargo fmt 2021-04-19 23:26:13 -07:00
Konstantin Knizhnik
8879f747ee Add multitenancy test for wal_acceptor 2021-04-19 14:30:42 +03:00
Eric Seppanen
3c7f810849 clippy cleanup #1
Resolve some basic warnings from clippy:
- useless conversion to the same type
- redundant field names in struct initialization
- redundant single-component path imports
2021-04-18 19:15:06 -07:00
Eric Seppanen
4ff248515b remote unnecessary dependencies between peer crates
These dependencies make cargo rebuild more than is strictly necessary.
Removing them makes the build a little faster.
2021-04-16 15:25:43 -07:00
Eric Seppanen
e8032f26e6 adopt new tokio-postgres:replication branch
This PR has evolved a lot; jump to the newer version. This should make
it easier to handle keepalive messages.
2021-04-16 08:29:47 -07:00
anastasia
d7eeaec706 add test for restore from local pgdata 2021-04-15 16:43:03 +03:00
anastasia
1190030872 handle SLRU in restore_datadir 2021-04-15 16:43:03 +03:00
lubennikovaav
82dc1e82ba Restore pageserver from s3 or local datadir (#9)
* change pageserver --skip-recovery option to --restore-from=[s3|local]
* implement restore from local pgdata 
* add simple test for local restore
2021-04-14 21:14:10 +03:00
anastasia
2e9c730dd1 Cargo fmt pass 2021-04-14 20:12:50 +03:00
Eric Seppanen
d1d6c968d5 control_plane: add error handling to reading pid files
print file errors to stderr; propagate the io::Error to the caller.
This error isn't handled very gracefully in WalAcceptorNode::drop(),
but there aren't any good options there since drop can't fail.
2021-04-13 14:30:48 -07:00
Stas Kelvich
f35d13183e fixup, check testing in CI 2021-04-13 18:58:22 +03:00
Stas Kelvich
c5f379bff3 [WIP] Implement CLI pg part 2021-04-13 18:58:22 +03:00
Stas Kelvich
39ebec51d1 Return Result<()> from pageserver start/stop.
To provide meaningful error messages when it is called by CLI.
2021-04-10 19:03:40 +03:00
Stas Kelvich
6264dc6aa3 Move control_plane code out of lib.rs and split up control plane
into compute and storage parts.

Before that code was concentrated in lib.rs which was unhandy to
open by name.
2021-04-10 13:56:19 +03:00
Stas Kelvich
59163cf3b3 Rework controle_plane code to reuse it in CLI.
Move all paths from control_plane to local_env which can set them
for testing environment or for local installation.
2021-04-10 12:09:20 +03:00