rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
Stas Kelvich	bef2731880	bump vendor/postgres	2021-03-31 16:55:15 +03:00
Heikki Linnakangas	d97bd869ae	Fix confusion on what record's LSN means. A WAL record's LSN is the end of the record (exclusive), not the beginning. The WAL receiver and redo code were confused on that, and sometimes returned wrong page version because of that.	2021-03-31 16:54:31 +03:00
Heikki Linnakangas	bfb20522eb	Advance "last valid LSN", even if it's not at WAL record boundary. The GetPage@LSN requests used last flushed WAL position as the request LSN, but the last flushed WAL position might point in the middle of a WAL record (most likely at a page boundary). But we used to only update the "last valid LSN" after fully decoding a record. As a result, this could happen: 1. Postgres generates two WAL record. They span from 0/10000 to 0/20000, and from 0/20000 to 0/30000. 2. Postgres flushes the WAL to 0/25000. 3. Page server receives the WAL up to 0/25000. It decodes the first WAL record and advances the last valid LSN to the end of that record, 0/20000 3. Postgres issues a GetPage@LSN request, using 0/15000 as the request LSN. 4. The GetPage@LSN request is stuck in the page server, because last valid LSN is 0/10000, and the request LSN is 0/15000. This situation gets unwedged when something kicks a new WAL flush in the Postgres server, like a new transaction. But that can take a long time. Fix by updating the last valid LSN to the last received LSN, even if it points in the middle of a record.	2021-03-31 16:54:20 +03:00
Heikki Linnakangas	52e754f301	Make page server startup less noisy.	2021-03-31 16:53:29 +03:00
Stas Kelvich	98b8426780	Ignore test_pageserver::test_regress as it fails now	2021-03-31 15:30:35 +03:00
Stas Kelvich	98cc8400f4	Move regression tests output to tmp_check/regress	2021-03-31 15:13:50 +03:00
Heikki Linnakangas	cd98818a22	Add @LSN argument to GetPage requests	2021-03-31 12:13:10 +03:00
Heikki Linnakangas	9a8bda2938	Break out of busy loop, if the page-service connection is lost.	2021-03-31 12:13:10 +03:00
Heikki Linnakangas	98fd4aeffe	Use write_all() when sending messages to Postgres WAL process. tokio::io:AsyncWrite.read() function will do a short write, if the pipe's buffer is full.	2021-03-31 12:13:10 +03:00
Stas Kelvich	1348915655	add regression tests runner	2021-03-31 12:13:10 +03:00
anastasia	91700e56de	fix postgres submodule state	2021-03-30 18:59:35 +03:00
anastasia	e84e2d7d4d	make pgbuild.sh less noisy	2021-03-30 17:02:09 +03:00
Stas Kelvich	924962de06	Update README.md	2021-03-30 03:22:07 +03:00
Stas Kelvich	4ea713b720	run tests with github actions, take 3	2021-03-29 22:46:58 +03:00
Stas Kelvich	50b3b4a3c2	run tests with github actions, take 2	2021-03-29 22:43:16 +03:00
Stas Kelvich	98b0d3d32c	run tests with github actions	2021-03-29 22:40:10 +03:00
Stas Kelvich	5912d0b9da	switch pg submodule to our branch	2021-03-29 17:56:19 +03:00
Stas Kelvich	9038116714	dockerfile	2021-03-29 15:59:28 +03:00
anastasia	d1ef8a1784	[issue #7 ] CLI parse args for pg subcommand	2021-03-29 15:59:28 +03:00
anastasia	853c130ff0	[issue #7 ] CLI parse subcommands	2021-03-29 15:59:28 +03:00
anastasia	c018247c56	[issue #7 ] CLI first commit	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	e4d90fe0bb	Add timeouts to WAL redo, to prevent it from getting stuck. It is getting stuck at least in spgist index currently. Not sure why, that needs to be investigated, but having a timeout is a good idea anyway.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	95c1ef5bb7	Don't panic if we receive a duplicate WAL record. This is currently happening sometimes. I'm not sure why, and that needs to be investigated, but just shut up the panic for now.	2021-03-29 15:59:28 +03:00
Stas Kelvich	a482c3256c	clean up tests a bit; drop username dependency	2021-03-29 15:59:28 +03:00
Stas Kelvich	c8eeb8573d	add 'test_' prexix to test files to avoid lots of similarly-name files in repo	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	6e29523fba	If GetPage@LSN fails, print an error to the log, too.	2021-03-29 15:59:28 +03:00
Stas Kelvich	d42f2b431f	pg + pageserver integration testing	2021-03-29 15:59:28 +03:00
Stas Kelvich	a0de2b6255	redirect logging to file in daemon mode	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	e74a812850	Handle WAL redo errors more gracefully. Don't panic, but return an error to the caller, all the way up to the processing node instance that did the GetPage@LSN call.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	53b31f60be	Add basic metrics to page cache, and show them in the TUI. In the passing, tefactor the code to render the widgets slightly.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	bfc2ed26cf	Reuse postgres --wal-redo process. Use it for 100 requests before restarting the process.	2021-03-29 15:59:28 +03:00
Stas Kelvich	667ec0db60	start pageserver and pg on top of it	2021-03-29 15:59:28 +03:00
Stas Kelvich	8a80d055b9	daemon mode for pageserver	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	0cec3b5ed7	Update dependencies with "cargo update". Notably, this updates the version of rust-postgres so that it doesn't panic if the replication connection is lost unexpectedly.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	4c0be32bf5	Implement Text User Interface to show log streams in multiple "windows" Switch to 'slog' crate for logging, it gives us the flexibility that we need for the widget to scroll logs on TUI	2021-03-29 15:59:28 +03:00
Stas Kelvich	9e89c1e2cd	add CLI options to pageserver	2021-03-29 15:59:28 +03:00
Stas Kelvich	a522961d0c	add test stubs	2021-03-29 15:59:28 +03:00
Stas Kelvich	efe9fdbbe6	Add 'make postgres' Makefile target. That would build postgres and install it into REPO_ROOT/tmp_install where pageserver integration tests would be able to find it.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	22a0da114f	page server: Name the threads for debugging.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	375f8e3134	add comment	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	303a546aba	Refactor locking in page cache, and use async I/O for WAL redo Story on why: The apply_wal_records() function spawned the special postgres process to perform WAL redo. That was done in a blocking fashion: it launches the process, then it writes the command to its stdin, then it reads the result from its stdout. I wanted to also read the child process's stderr, and forward it to the page server's log (which is just the page server's stderr ATM). That has classic potential for deadlock: the child process might block trying to write to stderr/stdout, if the parent isn't reading it. So the parent needs to perform the read/write with the child's stdin/stdout/stderr in an async fashion. So I refactored the code in walredo.c into async style. But it started to hang. It took me a while to figure it out; async makes for really ugly stacktraces, it's hard to figure out what's going on. The call path goes like this: Page service -> get_page_at_lsn() in page cache -> apply_wal_records() the page service is written in async style. And I refactored apply_wal_recorsds() to also be async. BUT, get_page_at_lsn() acquires a lock, in a blocking fashion. The lock-up happened like this: - a GetPage@LSN request arrives. The asynch handler thread calls get_page_at_lsn(), which acquires a lock. While holding the lock, it calls apply_wal_records(). - apply_wal_records() launches the child process, and waits on it using async functions - more GetPage@LSN requests arrive. They also call get_page_at_lsn(). But because the lock is already held, they all block The subsequent GetPage@LSN calls that block waiting on the lock use up all the async handler threads. All the threads are locked up, so there is no one left to make progress on the apply_wal_records() call, so it never releases the lock. Deadlock So my lesson here is that mixing async and blocking styles is painful. Googling around, this is a well known problem, there are long philosophical discussions on "what color is your function". My plan to fix that is to move the WAL redo into a separate thread or thread pool, and have the GetPage@LSN handlers communicate with it using channels. Having a separate thread pool for it makes sense anyway in the long run. We'll want to keep the postgres process around, rather than launch it separately every time we need to reconstruct a page. Also, when we're not busy reconstructing pages that are needed right now by GetPage@LSN calls, we want to proactively apply incoming WAL records from a "backlog". Solution: Launch a dedicated thread for WAL redo at startup. It has an event loop, where it listens on a channel for requests to apply WAL. When a page server thread needs some WAL to be applied, it sends the request on the channel, and waits for response. After it's done the WAL redo process puts the new page image in the page cache, and wakes up the requesting thread using a condition variable. This also needed locking changes in the page cache. Each cache entry now has a reference counter and a dedicated Mutex to protect just the entry.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	1c1812df05	Fix decoding WAL segment header	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	e7694a1d5a	page server: use logger module	2021-03-29 15:59:28 +03:00
Stas Kelvich	6fc00e855c	[WIP] local control plane: clean up on exit, use whoami user name	2021-03-29 15:59:28 +03:00
Stas Kelvich	851e910324	WIP: local control plane	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	8bb282dcad	page server: Restore base backup from S3 at page server startup This includes a "launch.sh" script that I've been using to initialize and launch the Postgres + Page Server combination.	2021-03-29 15:59:28 +03:00
Stas Kelvich	0298a6dad6	smgr <-> page service	2021-03-29 15:59:27 +03:00
Heikki Linnakangas	ab3434f30f	Fix instructions in the page server README	2021-03-29 15:59:27 +03:00
Heikki Linnakangas	1016e69bbf	Add instructions on how to run the Page Server	2021-03-29 15:59:27 +03:00
Stas Kelvich	a11ef162f7	stubs for smgr proto	2021-03-29 15:59:27 +03:00

... 166 167 168 169 170

8470 Commits