Now most of CI check time is spent during dependencies installation and compilation (~ 10min total). Use actions/cache@v2 to cache things between checks. This commit sets up two caching targets:
* ./tmp_install with postgres build files and installed binaries uses $runner.os-pg-$pg_submodule_revision as a cache key and will be rebuilt only if linked submodule revision changes.
* ./target with cargo dependencies. That one uses hash(Cargo.lock) as a caching key and will be rebuilt only on deps update.
Also add tg notifications in a passing.
If we start walreceiver with identify_system.xlogpos() we will have race condition with
postgres start: postgres may request page that was modified with lsn
smaller than identify_system.xlogpos().
Current procedure for starting postgres will anyway be changed to something
different like having 'initdb' method on a pageserver (or importing some shared
empty database snapshot), so for now I just put start of first segment which
seems to be a valid record and is strictly before first lsn records.
Each postgres will use its own page cache with associated data
structures. Postgres system_id is used to distinguish instances.
That also means that backup should have valid system_id stashed
somewhere. For now I put '42' as sys_id during S3 restore, but
that ought to be fixed.
Also this commit introduces new way of starting WAL receivers:
postgres can initiate such connection by calling 'callmemaybe $url'
command in the page_service -- that will start appropriate wal-redo
and wal-receiver threads. This way page server may start without
a priori knowledge of compute node addreses.
A WAL record's LSN is the *end* of the record (exclusive), not the
beginning. The WAL receiver and redo code were confused on that, and
sometimes returned wrong page version because of that.
The GetPage@LSN requests used last flushed WAL position as the request LSN,
but the last flushed WAL position might point in the middle of a WAL record
(most likely at a page boundary). But we used to only update the "last valid
LSN" after fully decoding a record. As a result, this could happen:
1. Postgres generates two WAL record. They span from 0/10000 to 0/20000, and
from 0/20000 to 0/30000.
2. Postgres flushes the WAL to 0/25000.
3. Page server receives the WAL up to 0/25000. It decodes the first WAL
record and advances the last valid LSN to the end of that record, 0/20000
3. Postgres issues a GetPage@LSN request, using 0/15000 as the request LSN.
4. The GetPage@LSN request is stuck in the page server, because last valid
LSN is 0/10000, and the request LSN is 0/15000.
This situation gets unwedged when something kicks a new WAL flush in the
Postgres server, like a new transaction. But that can take a long time.
Fix by updating the last valid LSN to the last received LSN, even if it
points in the middle of a record.