Compare files in existing compute node's pgdata with fresh basebackup at the same lsn. We expect that content is identical, except tmp files
Use it after some tests.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.
And make logging slightly better.
ref #439
Change control plane code to call `postgres --sync-safekeepers` before
compute node start when safekeepers are enabled. Now `pg create` will
create an empty data directory with the proper config file. Subsequent
`pg start` will run `sync-safekeepers` and will call basebackup with
the resulting LSN. Also change few tests to accommodate this new behavior.
In a passing fix two minor issues with basabackup:
* check that we can't create branches with pre-initdb LSN's
* normalize branch LSN's that are pointing to the segment boundary
patch by @knizhnik
closes#506
Now that the page server collects this metric (since commit 212920e47e),
let's include it in the performance test results
The new metric looks like this:
performance/test_perf_pgbench.py . [100%]
--------------- Benchmark results ----------------
test_pgbench.init: 6.784 s
test_pgbench.pageserver_writes: 466 MB <---- THIS IS NEW
test_pgbench.5000_xacts: 8.196 s
test_pgbench.size: 163 MB
=============== 1 passed in 21.00s ===============
This should fix the sporadic regression test failures we've been seeing
lately with "no base img found" errors.
This fixes the common case, but one corner case is still not handled:
If a relation is extended across a segment boundary, leaving a gap block
in the segment preceding the segment containing the target block, the
preceding segment will not be padded with zeros correctly. This adds
a test case for that, but it's commented out.
See github issue https://github.com/zenithdb/zenith/issues/500
To fix, break out of the loop when you reach an in-memory layer that was
created after the checkpoint started. To do that, add a "generation"
counter into the layer map.
Fixes https://github.com/zenithdb/zenith/issues/494
Previously, the InMemoryLayer and DeltaLayer implementations of
get_page_reconstruct_data would recursively call the predecessor layer's
get_page_reconstruct_data function. Refactor so that we iterate in the
caller instead. Make get_page_reconstruct_data() return the predecessor
layer along with the continuation LSN, so that the caller can iterate.
IMO this makes the logic more clear, although this is more lines of code.
DeltaLayer uses the name `predecessor` for the same thing. Use the
same name in InMemoryLayer. The 'img_layer' name was misleading, as
the predecessor layer is not necessarily an image layer. Currently,
the 'freeze' function always creates a new image layer, but it
wouldn't have to be that way. Also, when you create a new branch, at
the branch point the predecessor layer can be a delta layer on the
ancestor branch.
* add lsn argument
* do not expose wait_lsn, wait inside list_nonrels()
* fix parameters parsing
* expose get_last_record_rlsn() to atomically read (last,prev) pair
More work is needed to correctly handle basebackup@old_lsn but current
approach already allows to fix test_restart_compute
There are two main reasons for that:
a) Latest unfinished record may disapper after compute node restart, so let's
try not leak volatile part of the WAL into the repository. Always use
last_valid_record instead.
That change requires different getPage@LSN logic in postgres -- we need
to ask LSN's that point to some complete record instead of GetFlushRecPtr()
that can point in the middle of the record. That was already done by @knizhnik
to deal with the same problem during the work on `postgres --sync-safekeepers`.
Postgres will use LSN's aligned on 0x8 boundary in get_page requests, so we
also need to be sure that last_valid_record is aligned.
b) Switch to get_last_record_lsn() in basebackup@no_lsn. When compute node
is running without safekeepers and streams WAL directly
to pageserver it is important to match basebackup LSN and LSN of replication
start. Before this commit basebackup@no_lsn was waiting for last_valid_lsn
and walreceiver started replication with last_record_lsn, which can be less.
So replication was failing since compute node doesn't have requested WAL.
Make this test look like 'test_compute_restart.sh' by @ololobus, which
was surprisingly good for checking safekeepers behavior. This test adds
an intermediate compute node start with bulk select that causes a lot of
FPI's and select itself wouldn't wait for all that WAL to be replicated.
So if we kill compute node right after that we end up with lagging safekeepers
with VCL != flush_lsn. And starting new node from that state takes special
care.
Also, run and print `pg_controldata` output after each compute node start
to eyeball lsn/checkpoint info of basebackup.
This commit only adds test without fixing the problem.
I noticed that the timeline directory contained files like this:
pg_xact_0000_0_000000000169C3C2_00000000016BB399
pg_xact_0000_0_00000000016BB399
pg_xact_0000_0_00000000016BB399_00000000016BDD06
pg_xact_0000_0_00000000016BDD06
pg_xact_0000_0_00000000016BDD06_00000000016C63AA
pg_xact_0000_0_00000000016C63AA
pg_xact_0000_0_00000000016C63AA_0000000001765226_DROPPED
pg_xact_0000_0_0000000001765226
pg_xact_0001_0_00000000016BB77E_00000000016BDD06
pg_xact_0001_0_00000000016BDD06
pg_xact_0001_0_00000000016BDD06_0000000001765226_DROPPED
pg_xact_0001_0_0000000001765226
Note how there is an image file after each DROPPED file. It's a waste of
time and space to materialize an image of the file at the point where it's
dropped, no one is going to request pages on a dropped relation. And it's
a correctness issue too: list_rels() and list_nonrels() will not consider
the relation as unlinked, unless the latest layer indicates so, and there
is no concept of a dropped image layer. That was causing test_clog_truncate
test to fail, when I adjusted the checkpointer to force a checkpoint more
aggressively.
There are a bunch more issues related to dropped rels and branching,
see https://github.com/zenithdb/zenith/issues/502. Hence this doesn't
completely fix the issue I saw with test_clog_truncate either. But it's
a start.
The comment talked about the WAL redo thread, but commit 6e22a8f709
refactored that away. The problem the comment describes probably still
exists, so keep the comment, but update the wording.
Avoid slurping entire image files into memory.
For blocky segments, we write the bytes directly to a bookfile chapter.
The blocks are a fixed size, which allows for random access.
split the page versions into two chapters:
PAGE_VERSION_METAS - a rust BTreeMap from (block #, lsn) -> page & WAL
byte ranges in PAGE_VERSIONS_CHAPTER
PAGE_VERSIONS_CHAPTER - raw page images and serialized WAL records