Commit Graph

16 Commits

Author SHA1 Message Date
Heikki Linnakangas
1367332447 Separate walkeeper and pageserver sources into different directories.
The integration tests, which depend on both walkeeper and pageserver,
are moved into yet another directory, 'integration_tests'.
2021-04-06 13:15:26 +03:00
Konstantin Knizhnik
13f507f0b4 Calculate records CRC in wal decoder 2021-04-02 10:30:56 +03:00
Konstantin Knizhnik
02ca245081 Port wal_acceptor to rust 2021-04-02 10:30:56 +03:00
anastasia
853c130ff0 [issue #7] CLI parse subcommands 2021-03-29 15:59:28 +03:00
Stas Kelvich
8a80d055b9 daemon mode for pageserver 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
4c0be32bf5 Implement Text User Interface to show log streams in multiple "windows"
Switch to 'slog' crate for logging, it gives us the flexibility that we
need for the widget to scroll logs on TUI
2021-03-29 15:59:28 +03:00
Stas Kelvich
9e89c1e2cd add CLI options to pageserver 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
303a546aba Refactor locking in page cache, and use async I/O for WAL redo
Story on why:

The apply_wal_records() function spawned the special postgres process
to perform WAL redo. That was done in a blocking fashion: it launches
the process, then it writes the command to its stdin, then it reads
the result from its stdout.  I wanted to also read the child process's
stderr, and forward it to the page server's log (which is just the
page server's stderr ATM). That has classic potential for deadlock:
the child process might block trying to write to stderr/stdout, if the
parent isn't reading it. So the parent needs to perform the read/write
with the child's stdin/stdout/stderr in an async fashion.  So I
refactored the code in walredo.c into async style.  But it started to
hang. It took me a while to figure it out; async makes for really ugly
stacktraces, it's hard to figure out what's going on. The call path
goes like this: Page service -> get_page_at_lsn() in page cache ->
apply_wal_records() the page service is written in async style. And I
refactored apply_wal_recorsds() to also be async. BUT,
get_page_at_lsn() acquires a lock, in a blocking fashion.

The lock-up happened like this:

- a GetPage@LSN request arrives. The asynch handler thread calls
  get_page_at_lsn(), which acquires a lock. While holding the lock,
  it calls apply_wal_records().
- apply_wal_records() launches the child process, and waits on it
  using async functions
- more GetPage@LSN requests arrive. They also call get_page_at_lsn().
  But because the lock is already held, they all block

The subsequent GetPage@LSN calls that block waiting on the lock use up
all the async handler threads. All the threads are locked up, so there
is no one left to make progress on the apply_wal_records() call, so it
never releases the lock. Deadlock So my lesson here is that mixing
async and blocking styles is painful. Googling around, this is a well
known problem, there are long philosophical discussions on "what color
is your function".  My plan to fix that is to move the WAL redo into a
separate thread or thread pool, and have the GetPage@LSN handlers
communicate with it using channels.  Having a separate thread pool for
it makes sense anyway in the long run. We'll want to keep the postgres
process around, rather than launch it separately every time we need to
reconstruct a page. Also, when we're not busy reconstructing pages
that are needed right now by GetPage@LSN calls, we want to proactively
apply incoming WAL records from a "backlog".

Solution:

Launch a dedicated thread for WAL redo at startup. It has an event loop,
where it listens on a channel for requests to apply WAL. When a page
server thread needs some WAL to be applied, it sends the request on
the channel, and waits for response. After it's done the WAL redo process
puts the new page image in the page cache, and wakes up the requesting
thread using a condition variable.

This also needed locking changes in the page cache. Each cache entry now
has a reference counter and a dedicated Mutex to protect just the entry.
2021-03-29 15:59:28 +03:00
Heikki Linnakangas
e7694a1d5a page server: use logger module 2021-03-29 15:59:28 +03:00
Stas Kelvich
851e910324 WIP: local control plane 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
8bb282dcad page server: Restore base backup from S3 at page server startup
This includes a "launch.sh" script that I've been using to initialize
and launch the Postgres + Page Server combination.
2021-03-29 15:59:28 +03:00
Heikki Linnakangas
af7ebb6395 Implement WAL redo.
When a page is requested from the page cache (GetPage@LSN), launch postgres
in special "WAL redo" mode to reconstruct that page version.

Plus a bunch of misc fixes to the WAL decoding code.
2021-03-29 15:59:27 +03:00
Stas Kelvich
626b4e9987 basic support of postgres backend protocol 2021-03-29 15:59:27 +03:00
Heikki Linnakangas
3058021ca7 WIP: beginnings of page server page cache 2021-03-29 15:57:15 +03:00
Heikki Linnakangas
9a9480e8c9 Add WIP support for decoding WAL records. 2021-03-29 15:57:15 +03:00
Stas Kelvich
c856a2f2d2 pageserver stub 2021-03-29 15:57:15 +03:00