Commit Graph

112 Commits

Author SHA1 Message Date
anastasia
a267dfa41f code cleanup for compute_node_rebase branch 2021-04-09 17:25:41 +03:00
anastasia
1b9eb9430c 1. Handle SLRU and nonrel files as pageserver pages: upload them via restore_s3, handle in protocol.
2. Parse pg_control to retrieve systemid, lsn and so on. Store it in pagecache.
3. Setup compute node without files: only request a few essential files from pageserver to bootstrap. And after that route ALL I/O requests to pageserver.
Use initdb --compute-node flag to create such minimal node without files. And GUC 'computenode_mode=true'to request all pages from pageserver
2021-04-08 16:01:24 +03:00
anastasia
9a4fbf365c add test test_pageserver_recovery.
zenith_push postgres to minio and start pageserver using this base image
2021-04-08 15:14:44 +03:00
Heikki Linnakangas
ba4f8e94aa Minor comment fixes. 2021-04-08 13:30:42 +03:00
Heikki Linnakangas
3e09cb5718 Add README to give an overview of the WAL safekeeper. 2021-04-08 13:30:10 +03:00
Heikki Linnakangas
0c6471ca0d Print a few more LSNs in the standard format. 2021-04-07 19:05:28 +03:00
Heikki Linnakangas
198fc9ea53 Capture initdb's stdout/stderr, to avoid messing with log formatting.
Especially with --interactive.
2021-04-07 18:51:34 +03:00
Konstantin Knizhnik
abaa36a15c Format code according to rust style guide 2021-04-07 16:50:58 +03:00
Heikki Linnakangas
6b9fc3aff0 Fix minor typos and copy-pastos 2021-04-07 16:39:37 +03:00
Konstantin Knizhnik
3fea78d688 Multitenant wal_acceptor 2021-04-07 13:43:40 +03:00
Konstantin Knizhnik
8547184830 Fix merge conflicts 2021-04-06 15:55:55 +03:00
Konstantin Knizhnik
dc16386639 Fix wal_accetor tests 2021-04-06 15:45:21 +03:00
Stas Kelvich
c0fcbbbe0c Cargo fmt pass over a codebase 2021-04-06 14:42:13 +03:00
Heikki Linnakangas
494b95886b Move launch.sh into 'pageserver' directory.
It's a script to launch the Page Server.
2021-04-06 14:05:43 +03:00
Heikki Linnakangas
4706ed4238 Add --test-threads=1 to instructions on how to run tests.
Currently, the tests don't work in parallel. The tests try to create a
data directory in the same path and clash with each other. That should be
fixed, but for now, let's at least fix the instructions on how it works
now.
2021-04-06 14:00:37 +03:00
Heikki Linnakangas
37e258edf2 Add a brief overview of the source code layout in README 2021-04-06 13:55:29 +03:00
Heikki Linnakangas
1367332447 Separate walkeeper and pageserver sources into different directories.
The integration tests, which depend on both walkeeper and pageserver,
are moved into yet another directory, 'integration_tests'.
2021-04-06 13:15:26 +03:00
Konstantin Knizhnik
012c528749 Add race condition test for safekeeper 2021-04-06 11:58:48 +03:00
Heikki Linnakangas
abbbdc401a Don't panic in test if data directory is missing. 2021-04-06 11:48:25 +03:00
Konstantin Knizhnik
b5c49b2482 Add safekeeper tests 2021-04-06 00:51:22 +03:00
Stas Kelvich
dab1f0381c Cache postgres build and cargo deps in CI builds
Now most of CI check time is spent during dependencies installation and compilation (~ 10min total). Use actions/cache@v2 to cache things between checks. This commit sets up two caching targets:

* ./tmp_install with postgres build files and installed binaries uses $runner.os-pg-$pg_submodule_revision as a cache key and will be rebuilt only if linked submodule revision changes.

* ./target with cargo dependencies. That one uses hash(Cargo.lock) as a caching key and will be rebuilt only on deps update.

Also add tg notifications in a passing.
2021-04-05 22:59:49 +03:00
Heikki Linnakangas
412e4a2ee7 Move the mgmt-console code from vendor/postgres repository. 2021-04-05 21:27:40 +03:00
Heikki Linnakangas
0bf3c3224e Make the debugging output from WAL receiver a bit nicer. 2021-04-05 17:35:28 +03:00
Stas Kelvich
79e4110cf0 Use immediate postgres stop in tests.
No need to wait for checkpoint during compute node stop.
2021-04-05 14:11:05 +03:00
Stas Kelvich
bd606ab37a Start pageserver walreceiver from predefined "-inf" lsn.
If we start walreceiver with identify_system.xlogpos() we will have race condition with
postgres start: postgres may request page that was modified with lsn
smaller than identify_system.xlogpos().

Current procedure for starting postgres will anyway be changed to something
different like having 'initdb' method on a pageserver (or importing some shared
empty database snapshot), so for now I just put start of first segment which
seems to be a valid record and is strictly before first lsn records.
2021-04-05 12:37:45 +03:00
Stas Kelvich
a555b5917f bump pg commit 2021-04-04 10:50:51 +03:00
Konstantin Knizhnik
2c7ed574b8 Fix bug XLogFromFileName 2021-04-04 08:58:23 +03:00
Stas Kelvich
decb04fc4f fix build 2021-04-03 19:42:45 +03:00
Stas Kelvich
da0decc24e bump pg version: include system_id in getPage requests 2021-04-03 19:15:15 +03:00
Stas Kelvich
2c308da4d2 Support several postgres instances on top of a single pageserver.
Each postgres will use its own page cache with associated data
structures. Postgres system_id is used to distinguish instances.
That also means that backup should have valid system_id stashed
somewhere. For now I put '42' as sys_id during S3 restore, but
that ought to be fixed.

Also this commit introduces new way of starting WAL receivers:
postgres can initiate such connection by calling 'callmemaybe $url'
command in the page_service -- that will start appropriate wal-redo
and wal-receiver threads. This way page server may start without
a priori knowledge of compute node addreses.
2021-04-03 19:02:44 +03:00
Konstantin Knizhnik
6eabe17e98 Fix bugs in wal_acceptor WAL parser 2021-04-02 20:13:20 +03:00
anastasia
4566a0a160 update vendor/postres: merge compute_node branch 2021-04-02 15:19:16 +03:00
anastasia
dc4b5f5f23 use pg_resetwal to generate pg_control in the test 2021-04-02 15:09:20 +03:00
anastasia
9fdf1964a7 generate controlfile using pg_resetwal 2021-04-02 14:57:16 +03:00
anastasia
2d8a19affa add protocol message to receive pg_control 2021-04-02 14:56:32 +03:00
Stas Kelvich
e2ce9e562e remove unused modules 2021-04-02 10:38:51 +03:00
Konstantin Knizhnik
13f507f0b4 Calculate records CRC in wal decoder 2021-04-02 10:30:56 +03:00
Konstantin Knizhnik
02ca245081 Port wal_acceptor to rust 2021-04-02 10:30:56 +03:00
Heikki Linnakangas
08e59f5674 If WAL streaming connection is lost, restart at right place.
Need to restart at the end of last WAL record, not in the middle of a
record if we had previously streamed a partial record.
2021-03-31 21:27:26 +03:00
Heikki Linnakangas
5f272380a2 Don't panic on XLOG_SWITCH records. 2021-03-31 20:00:02 +03:00
Heikki Linnakangas
7353098b47 Fix typo 2021-03-31 18:50:22 +03:00
Stas Kelvich
9175fb5ea7 enable test_regress in CI 2021-03-31 17:01:00 +03:00
Stas Kelvich
bef2731880 bump vendor/postgres 2021-03-31 16:55:15 +03:00
Heikki Linnakangas
d97bd869ae Fix confusion on what record's LSN means.
A WAL record's LSN is the *end* of the record (exclusive), not the
beginning. The WAL receiver and redo code were confused on that, and
sometimes returned wrong page version because of that.
2021-03-31 16:54:31 +03:00
Heikki Linnakangas
bfb20522eb Advance "last valid LSN", even if it's not at WAL record boundary.
The GetPage@LSN requests used last flushed WAL position as the request LSN,
but the last flushed WAL position might point in the middle of a WAL record
(most likely at a page boundary). But we used to only update the "last valid
LSN" after fully decoding a record. As a result, this could happen:

1. Postgres generates two WAL record. They span from 0/10000 to 0/20000, and
from 0/20000 to 0/30000.

2. Postgres flushes the WAL to 0/25000.

3. Page server receives the WAL up to 0/25000. It decodes the first WAL
   record and advances the last valid LSN to the end of that record, 0/20000

3. Postgres issues a GetPage@LSN request, using 0/15000 as the request LSN.

4. The GetPage@LSN request is stuck in the page server, because last valid
   LSN is 0/10000, and the request LSN is 0/15000.

This situation gets unwedged when something kicks a new WAL flush in the
Postgres server, like a new transaction. But that can take a long time.

Fix by updating the last valid LSN to the last received LSN, even if it
points in the middle of a record.
2021-03-31 16:54:20 +03:00
Heikki Linnakangas
52e754f301 Make page server startup less noisy. 2021-03-31 16:53:29 +03:00
Stas Kelvich
98b8426780 Ignore test_pageserver::test_regress as it fails now 2021-03-31 15:30:35 +03:00
Stas Kelvich
98cc8400f4 Move regression tests output to tmp_check/regress 2021-03-31 15:13:50 +03:00
Heikki Linnakangas
cd98818a22 Add @LSN argument to GetPage requests 2021-03-31 12:13:10 +03:00
Heikki Linnakangas
9a8bda2938 Break out of busy loop, if the page-service connection is lost. 2021-03-31 12:13:10 +03:00