Commit Graph

187 Commits

Author SHA1 Message Date
Konstantin Knizhnik
ed30f2096c Disable GC by default 2021-04-22 11:30:27 +03:00
Konstantin Knizhnik
da9508716d Address issues from Eric's review 2021-04-22 10:37:52 +03:00
Konstantin Knizhnik
9e7c45cb72 Merge with master 2021-04-22 09:45:13 +03:00
Eric Seppanen
2cd730d31f page_cache: replace long mutex sleep with SeqWait
When calling into the page cache, it was possible to wait on a blocking
mutex, which can stall the async executor.

Replace that sleep with a SeqWait::wait_for(lsn).await so that the
executor can go on with other work while we wait.

Change walreceiver_works to an AtomicBool to avoid the awkwardness of
taking the lock, then dropping it while we call wait_for and then
acquiring it again to do real work.
2021-04-21 18:02:13 -07:00
Eric Seppanen
8060e17b50 add SeqWait
SeqWait adds a way to .await the arrival of some sequence number.
It provides wait_for(num) which is an async fn, and advance(num) which
is synchronous.

This should be useful in solving the page cache deadlocks, and may be
useful in other areas too.

This implementation still uses a Mutex internally, but only for a brief
critical section. If we find this code broadly useful and start to care
more about executor stalls due to unfair thread scheduling, there might
be ways to make it lock-free.
2021-04-21 18:02:13 -07:00
Konstantin Knizhnik
4f3f0304c2 Merge branch 'main' into rocksdb_pageserver 2021-04-21 19:05:02 +03:00
Konstantin Knizhnik
c981f4ad66 Implement garbage collection of unused versions 2021-04-21 19:04:30 +03:00
Heikki Linnakangas
e911427872 Remove some unnecessary dependencies 2021-04-21 16:42:12 +03:00
Konstantin Knizhnik
07507274c0 Merge branch 'main' into rocksdb_pageserver 2021-04-21 16:06:31 +03:00
Eric Seppanen
f387769203 add zenith_utils crate
This is a place for code that's shared between other crates in this
repository.
2021-04-20 11:11:29 -07:00
Heikki Linnakangas
3600b33f1c Implement "timelines" in page server
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.

There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.

The repository uses the directory layout desribed as Repository format
v1 in https://github.com/zenithdb/rfcs/pull/5. It it *highly* inefficient:
- we never create new snapshots. So in practice, it's really just a base
  backup of the initial empty cluster, and everything else is reconstructed
  by redoing all WAL

- when you create a new timeline, the base snapshot and *all* WAL is copied
  from the new timeline to the new one. There is no smarts about
  referencing the old snapshots/wal from the ancestor timeline.

To support all this, this commit includes a bunch of other changes:

- Implement "basebackup" funtionality in page server. When you initialize
  a new compute node with "zenith pg create", it connects to the page
  server, and requests a base backup of the Postgres data directory on
  that timeline. (the base backup excludes user tables, so it's not
  as bad as it sounds).

- Have page server's WAL receiver write the WAL into timeline dir. This
  allows running a Page Server and Compute Nodes without a WAL safekeeper,
  until we get around to integrate that properly into the system. (Even
  after we integrate WAL safekeeper, this is perhaps how this will operate
  when you want to run the system on your laptop.)

- restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
  modified to use the new format. It now also restores all WAL.

- Page server no longer scans and restores everything into memory at startup.
  Instead, when the first request is made for a timeline, the timeline is
  slurped into memory at that point.

- The responsibility for telling page server to "callmemaybe" was moved
  into Postgres libpqpagestore code. Also, WAL producer connstring cannot
  be specified in the pageserver's command line anymore.

- Having multiple "system identifiers" in the same page server is no
  longer supported. I repurposed much of that code to support multiple
  timelines, instead.

- Implemented very basic, incomplete, support for PostgreSQL's Extended
  Query Protocol in page_service.rs. Turns out that rust-postgres'
  copy_out() function always uses the extended query protocol to send
  out the command, and I'm using that to stream the base backup from the
  page server.

TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines/<timeline>/wal' directories.
2021-04-20 19:11:27 +03:00
Konstantin Knizhnik
8aa3013ec2 Merge branch 'main' into rocksdb_pageserver 2021-04-19 16:28:29 +03:00
Eric Seppanen
b32cc6a088 pageserver: change over some error handling to anyhow+thiserror
This is a first attempt at a new error-handling strategy:
- Use anyhow::Error as the first choice for easy error handling
- Use thiserror to generate local error types for anything that
  needs it (no error type is available to us) or will be inspected
  or matched on by higher layers.
2021-04-18 23:06:35 -07:00
Eric Seppanen
35e0099ac6 pin remote rust-s3 dependency to a git hash
Using the hash should allow us to change the remote repo and propagate
that change to user builds without that change becoming visible at a
random time.

It's unfortunate that we can't declare this dependency once in the
top-level Cargo.toml; that feature request is rust-lang rfc 2906.
2021-04-16 15:26:11 -07:00
Eric Seppanen
4ff248515b remote unnecessary dependencies between peer crates
These dependencies make cargo rebuild more than is strictly necessary.
Removing them makes the build a little faster.
2021-04-16 15:25:43 -07:00
Eric Seppanen
e8032f26e6 adopt new tokio-postgres:replication branch
This PR has evolved a lot; jump to the newer version. This should make
it easier to handle keepalive messages.
2021-04-16 08:29:47 -07:00
lubennikovaav
82dc1e82ba Restore pageserver from s3 or local datadir (#9)
* change pageserver --skip-recovery option to --restore-from=[s3|local]
* implement restore from local pgdata 
* add simple test for local restore
2021-04-14 21:14:10 +03:00
Stas Kelvich
c5f379bff3 [WIP] Implement CLI pg part 2021-04-13 18:58:22 +03:00
Stas Kelvich
59163cf3b3 Rework controle_plane code to reuse it in CLI.
Move all paths from control_plane to local_env which can set them
for testing environment or for local installation.
2021-04-10 12:09:20 +03:00
Konstantin Knizhnik
07fb30747a Store pageserver data in RocksDB 2021-04-08 19:39:30 +03:00
Heikki Linnakangas
1367332447 Separate walkeeper and pageserver sources into different directories.
The integration tests, which depend on both walkeeper and pageserver,
are moved into yet another directory, 'integration_tests'.
2021-04-06 13:15:26 +03:00
Konstantin Knizhnik
13f507f0b4 Calculate records CRC in wal decoder 2021-04-02 10:30:56 +03:00
Konstantin Knizhnik
02ca245081 Port wal_acceptor to rust 2021-04-02 10:30:56 +03:00
anastasia
d1ef8a1784 [issue #7] CLI parse args for pg subcommand 2021-03-29 15:59:28 +03:00
Stas Kelvich
8a80d055b9 daemon mode for pageserver 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
0cec3b5ed7 Update dependencies with "cargo update".
Notably, this updates the version of rust-postgres so that it doesn't
panic if the replication connection is lost unexpectedly.
2021-03-29 15:59:28 +03:00
Heikki Linnakangas
4c0be32bf5 Implement Text User Interface to show log streams in multiple "windows"
Switch to 'slog' crate for logging, it gives us the flexibility that we
need for the widget to scroll logs on TUI
2021-03-29 15:59:28 +03:00
Stas Kelvich
9e89c1e2cd add CLI options to pageserver 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
303a546aba Refactor locking in page cache, and use async I/O for WAL redo
Story on why:

The apply_wal_records() function spawned the special postgres process
to perform WAL redo. That was done in a blocking fashion: it launches
the process, then it writes the command to its stdin, then it reads
the result from its stdout.  I wanted to also read the child process's
stderr, and forward it to the page server's log (which is just the
page server's stderr ATM). That has classic potential for deadlock:
the child process might block trying to write to stderr/stdout, if the
parent isn't reading it. So the parent needs to perform the read/write
with the child's stdin/stdout/stderr in an async fashion.  So I
refactored the code in walredo.c into async style.  But it started to
hang. It took me a while to figure it out; async makes for really ugly
stacktraces, it's hard to figure out what's going on. The call path
goes like this: Page service -> get_page_at_lsn() in page cache ->
apply_wal_records() the page service is written in async style. And I
refactored apply_wal_recorsds() to also be async. BUT,
get_page_at_lsn() acquires a lock, in a blocking fashion.

The lock-up happened like this:

- a GetPage@LSN request arrives. The asynch handler thread calls
  get_page_at_lsn(), which acquires a lock. While holding the lock,
  it calls apply_wal_records().
- apply_wal_records() launches the child process, and waits on it
  using async functions
- more GetPage@LSN requests arrive. They also call get_page_at_lsn().
  But because the lock is already held, they all block

The subsequent GetPage@LSN calls that block waiting on the lock use up
all the async handler threads. All the threads are locked up, so there
is no one left to make progress on the apply_wal_records() call, so it
never releases the lock. Deadlock So my lesson here is that mixing
async and blocking styles is painful. Googling around, this is a well
known problem, there are long philosophical discussions on "what color
is your function".  My plan to fix that is to move the WAL redo into a
separate thread or thread pool, and have the GetPage@LSN handlers
communicate with it using channels.  Having a separate thread pool for
it makes sense anyway in the long run. We'll want to keep the postgres
process around, rather than launch it separately every time we need to
reconstruct a page. Also, when we're not busy reconstructing pages
that are needed right now by GetPage@LSN calls, we want to proactively
apply incoming WAL records from a "backlog".

Solution:

Launch a dedicated thread for WAL redo at startup. It has an event loop,
where it listens on a channel for requests to apply WAL. When a page
server thread needs some WAL to be applied, it sends the request on
the channel, and waits for response. After it's done the WAL redo process
puts the new page image in the page cache, and wakes up the requesting
thread using a condition variable.

This also needed locking changes in the page cache. Each cache entry now
has a reference counter and a dedicated Mutex to protect just the entry.
2021-03-29 15:59:28 +03:00
Heikki Linnakangas
e7694a1d5a page server: use logger module 2021-03-29 15:59:28 +03:00
Stas Kelvich
851e910324 WIP: local control plane 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
8bb282dcad page server: Restore base backup from S3 at page server startup
This includes a "launch.sh" script that I've been using to initialize
and launch the Postgres + Page Server combination.
2021-03-29 15:59:28 +03:00
Heikki Linnakangas
af7ebb6395 Implement WAL redo.
When a page is requested from the page cache (GetPage@LSN), launch postgres
in special "WAL redo" mode to reconstruct that page version.

Plus a bunch of misc fixes to the WAL decoding code.
2021-03-29 15:59:27 +03:00
Stas Kelvich
626b4e9987 basic support of postgres backend protocol 2021-03-29 15:59:27 +03:00
Heikki Linnakangas
3058021ca7 WIP: beginnings of page server page cache 2021-03-29 15:57:15 +03:00
Heikki Linnakangas
9a9480e8c9 Add WIP support for decoding WAL records. 2021-03-29 15:57:15 +03:00
Stas Kelvich
c856a2f2d2 pageserver stub 2021-03-29 15:57:15 +03:00