Commit Graph

92 Commits

Author SHA1 Message Date
Stas Kelvich
dab1f0381c Cache postgres build and cargo deps in CI builds
Now most of CI check time is spent during dependencies installation and compilation (~ 10min total). Use actions/cache@v2 to cache things between checks. This commit sets up two caching targets:

* ./tmp_install with postgres build files and installed binaries uses $runner.os-pg-$pg_submodule_revision as a cache key and will be rebuilt only if linked submodule revision changes.

* ./target with cargo dependencies. That one uses hash(Cargo.lock) as a caching key and will be rebuilt only on deps update.

Also add tg notifications in a passing.
2021-04-05 22:59:49 +03:00
Heikki Linnakangas
412e4a2ee7 Move the mgmt-console code from vendor/postgres repository. 2021-04-05 21:27:40 +03:00
Heikki Linnakangas
0bf3c3224e Make the debugging output from WAL receiver a bit nicer. 2021-04-05 17:35:28 +03:00
Stas Kelvich
79e4110cf0 Use immediate postgres stop in tests.
No need to wait for checkpoint during compute node stop.
2021-04-05 14:11:05 +03:00
Stas Kelvich
bd606ab37a Start pageserver walreceiver from predefined "-inf" lsn.
If we start walreceiver with identify_system.xlogpos() we will have race condition with
postgres start: postgres may request page that was modified with lsn
smaller than identify_system.xlogpos().

Current procedure for starting postgres will anyway be changed to something
different like having 'initdb' method on a pageserver (or importing some shared
empty database snapshot), so for now I just put start of first segment which
seems to be a valid record and is strictly before first lsn records.
2021-04-05 12:37:45 +03:00
Stas Kelvich
a555b5917f bump pg commit 2021-04-04 10:50:51 +03:00
Konstantin Knizhnik
2c7ed574b8 Fix bug XLogFromFileName 2021-04-04 08:58:23 +03:00
Stas Kelvich
decb04fc4f fix build 2021-04-03 19:42:45 +03:00
Stas Kelvich
da0decc24e bump pg version: include system_id in getPage requests 2021-04-03 19:15:15 +03:00
Stas Kelvich
2c308da4d2 Support several postgres instances on top of a single pageserver.
Each postgres will use its own page cache with associated data
structures. Postgres system_id is used to distinguish instances.
That also means that backup should have valid system_id stashed
somewhere. For now I put '42' as sys_id during S3 restore, but
that ought to be fixed.

Also this commit introduces new way of starting WAL receivers:
postgres can initiate such connection by calling 'callmemaybe $url'
command in the page_service -- that will start appropriate wal-redo
and wal-receiver threads. This way page server may start without
a priori knowledge of compute node addreses.
2021-04-03 19:02:44 +03:00
Konstantin Knizhnik
6eabe17e98 Fix bugs in wal_acceptor WAL parser 2021-04-02 20:13:20 +03:00
anastasia
4566a0a160 update vendor/postres: merge compute_node branch 2021-04-02 15:19:16 +03:00
anastasia
dc4b5f5f23 use pg_resetwal to generate pg_control in the test 2021-04-02 15:09:20 +03:00
anastasia
9fdf1964a7 generate controlfile using pg_resetwal 2021-04-02 14:57:16 +03:00
anastasia
2d8a19affa add protocol message to receive pg_control 2021-04-02 14:56:32 +03:00
Stas Kelvich
e2ce9e562e remove unused modules 2021-04-02 10:38:51 +03:00
Konstantin Knizhnik
13f507f0b4 Calculate records CRC in wal decoder 2021-04-02 10:30:56 +03:00
Konstantin Knizhnik
02ca245081 Port wal_acceptor to rust 2021-04-02 10:30:56 +03:00
Heikki Linnakangas
08e59f5674 If WAL streaming connection is lost, restart at right place.
Need to restart at the end of last WAL record, not in the middle of a
record if we had previously streamed a partial record.
2021-03-31 21:27:26 +03:00
Heikki Linnakangas
5f272380a2 Don't panic on XLOG_SWITCH records. 2021-03-31 20:00:02 +03:00
Heikki Linnakangas
7353098b47 Fix typo 2021-03-31 18:50:22 +03:00
Stas Kelvich
9175fb5ea7 enable test_regress in CI 2021-03-31 17:01:00 +03:00
Stas Kelvich
bef2731880 bump vendor/postgres 2021-03-31 16:55:15 +03:00
Heikki Linnakangas
d97bd869ae Fix confusion on what record's LSN means.
A WAL record's LSN is the *end* of the record (exclusive), not the
beginning. The WAL receiver and redo code were confused on that, and
sometimes returned wrong page version because of that.
2021-03-31 16:54:31 +03:00
Heikki Linnakangas
bfb20522eb Advance "last valid LSN", even if it's not at WAL record boundary.
The GetPage@LSN requests used last flushed WAL position as the request LSN,
but the last flushed WAL position might point in the middle of a WAL record
(most likely at a page boundary). But we used to only update the "last valid
LSN" after fully decoding a record. As a result, this could happen:

1. Postgres generates two WAL record. They span from 0/10000 to 0/20000, and
from 0/20000 to 0/30000.

2. Postgres flushes the WAL to 0/25000.

3. Page server receives the WAL up to 0/25000. It decodes the first WAL
   record and advances the last valid LSN to the end of that record, 0/20000

3. Postgres issues a GetPage@LSN request, using 0/15000 as the request LSN.

4. The GetPage@LSN request is stuck in the page server, because last valid
   LSN is 0/10000, and the request LSN is 0/15000.

This situation gets unwedged when something kicks a new WAL flush in the
Postgres server, like a new transaction. But that can take a long time.

Fix by updating the last valid LSN to the last received LSN, even if it
points in the middle of a record.
2021-03-31 16:54:20 +03:00
Heikki Linnakangas
52e754f301 Make page server startup less noisy. 2021-03-31 16:53:29 +03:00
Stas Kelvich
98b8426780 Ignore test_pageserver::test_regress as it fails now 2021-03-31 15:30:35 +03:00
Stas Kelvich
98cc8400f4 Move regression tests output to tmp_check/regress 2021-03-31 15:13:50 +03:00
Heikki Linnakangas
cd98818a22 Add @LSN argument to GetPage requests 2021-03-31 12:13:10 +03:00
Heikki Linnakangas
9a8bda2938 Break out of busy loop, if the page-service connection is lost. 2021-03-31 12:13:10 +03:00
Heikki Linnakangas
98fd4aeffe Use write_all() when sending messages to Postgres WAL process.
tokio::io:AsyncWrite.read() function will do a short write, if the pipe's
buffer is full.
2021-03-31 12:13:10 +03:00
Stas Kelvich
1348915655 add regression tests runner 2021-03-31 12:13:10 +03:00
anastasia
91700e56de fix postgres submodule state 2021-03-30 18:59:35 +03:00
anastasia
e84e2d7d4d make pgbuild.sh less noisy 2021-03-30 17:02:09 +03:00
Stas Kelvich
924962de06 Update README.md 2021-03-30 03:22:07 +03:00
Stas Kelvich
4ea713b720 run tests with github actions, take 3 2021-03-29 22:46:58 +03:00
Stas Kelvich
50b3b4a3c2 run tests with github actions, take 2 2021-03-29 22:43:16 +03:00
Stas Kelvich
98b0d3d32c run tests with github actions 2021-03-29 22:40:10 +03:00
Stas Kelvich
5912d0b9da switch pg submodule to our branch 2021-03-29 17:56:19 +03:00
Stas Kelvich
9038116714 dockerfile 2021-03-29 15:59:28 +03:00
anastasia
d1ef8a1784 [issue #7] CLI parse args for pg subcommand 2021-03-29 15:59:28 +03:00
anastasia
853c130ff0 [issue #7] CLI parse subcommands 2021-03-29 15:59:28 +03:00
anastasia
c018247c56 [issue #7] CLI first commit 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
e4d90fe0bb Add timeouts to WAL redo, to prevent it from getting stuck.
It is getting stuck at least in spgist index currently. Not sure why,
that needs to be investigated, but having a timeout is a good idea anyway.
2021-03-29 15:59:28 +03:00
Heikki Linnakangas
95c1ef5bb7 Don't panic if we receive a duplicate WAL record.
This is currently happening sometimes. I'm not sure why, and that needs to
be investigated, but just shut up the panic for now.
2021-03-29 15:59:28 +03:00
Stas Kelvich
a482c3256c clean up tests a bit; drop username dependency 2021-03-29 15:59:28 +03:00
Stas Kelvich
c8eeb8573d add 'test_' prexix to test files to avoid lots of similarly-name files in repo 2021-03-29 15:59:28 +03:00
Heikki Linnakangas
6e29523fba If GetPage@LSN fails, print an error to the log, too. 2021-03-29 15:59:28 +03:00
Stas Kelvich
d42f2b431f pg + pageserver integration testing 2021-03-29 15:59:28 +03:00
Stas Kelvich
a0de2b6255 redirect logging to file in daemon mode 2021-03-29 15:59:28 +03:00