Commit Graph

64 Commits

Author SHA1 Message Date
Arseny Sher
0aec60938a Make flush_lsn reported by safekeepers point to record boundary.
Otherwise we produce corrupted record holes in WAL during compute node restart
in case there was an unfinished record from the old compute, as these reports
advance commit_lsn -- reliably persisted part of WAL.

ref #549.

Mostly by @knizhnik. I adjusted to make sure proposer always starts streaming
since record beginning so we don't need special quirks for decoding in
safekeeper.
2021-09-11 06:10:10 +03:00
Dmitry Rodionov
bc709561b6 fix clippy warnings 2021-09-02 18:54:44 +03:00
Stas Kelvich
8c07a36fda Remove last_valid_lsn tracking in wal_receiver.
There are two main reasons for that:

a) Latest unfinished record may disapper after compute node restart, so let's
    try not leak volatile part of the WAL into the repository. Always use
    last_valid_record instead.

    That change requires different getPage@LSN logic in postgres -- we need
    to ask LSN's that point to some complete record instead of GetFlushRecPtr()
    that can point in the middle of the record. That was already done by @knizhnik
    to deal with the same problem during the work on `postgres --sync-safekeepers`.

    Postgres will use LSN's aligned on 0x8 boundary in get_page requests, so we
    also need to be sure that last_valid_record is aligned.

b) Switch to get_last_record_lsn() in basebackup@no_lsn. When compute node
    is running without safekeepers and streams WAL directly
    to pageserver it is important to match basebackup LSN and LSN of replication
    start. Before this commit basebackup@no_lsn was waiting for last_valid_lsn
    and walreceiver started replication with last_record_lsn, which can be less.
    So replication was failing since compute node doesn't have requested WAL.
2021-09-02 12:06:12 +03:00
Heikki Linnakangas
d7bebd8074 Add 'dump_layerfile' utility for debugging.
Seems handy for getting a quick idea of what's stored in an
image or delta layer file.

Example output on a file after runnnig pgbench for a while:

    % ./target/debug/dump_layerfile pgbench_layers/pg_control_checkpoint_0_00000000016B914A
    ----- image layer for checkpoint.0 at 0/16B914A ----
    non-blocky (88 bytes)
    % ./target/debug/dump_layerfile pgbench_layers/pg_xact_0000_0_000000000412FD40
    ----- image layer for pg_xact/0000.0 at 0/412FD40 ----
    (1) blocks
    % ./target/debug/dump_layerfile pgbench_layers/rel_1663_14236_1247_0_0_00000000016B914A_000000000412FD40 | head -n 20
    ----- delta layer for 1663/14236/1247.0 0/16B914A-0/412FD40 ----
    --- relsizes ---
      0/16B914A: 14
      0/16CA559: 15
    --- page versions ---
      blk 13 at 0/16BB1D2:  rec 8162 bytes will_init: true HEAP INSERT
      blk 14 at 0/16CA559:  rec 8241 bytes will_init: true XLOG FPI
      blk 14 at 0/16CA637:  rec 215 bytes will_init: true HEAP INSERT
      blk 14 at 0/16DF14F:  rec 215 bytes will_init: false HEAP INSERT
      blk 14 at 0/16DF3A7:  rec 215 bytes will_init: false HEAP INSERT
      blk 14 at 0/16E0637:  rec 215 bytes will_init: false HEAP INSERT
      blk 14 at 0/16E088F:  rec 215 bytes will_init: false HEAP INSERT
      blk 14 at 0/16E5F9F:  rec 215 bytes will_init: false HEAP INSERT
      blk 14 at 0/16E620F:  rec 215 bytes will_init: false HEAP INSERT
2021-09-01 12:20:16 -07:00
anastasia
8b3a293bb0 Use postgres_ffi bindings instead of custom type definitions.
Move several functions to postgres_ffi crate
2021-09-01 16:11:44 +03:00
anastasia
78963ad104 Issue #411. Support drop database in pageserver.
Use put_unlink for FileNodeMap relishes.
Always store FileNodeMap as materialized page images (no wal records).
2021-08-30 17:29:29 +03:00
Heikki Linnakangas
b949127b06 Rename page_cache.rs to tenant_mgr.rs.
Once upon a time, 'page_cache.rs' contained an actual page cache, but
it hasn't for a very long time. Rename to reflect what it actually does
these days.
2021-08-30 15:17:30 +03:00
Patrick Insinger
d265b4cdd3 waldecoder - check for trailing bytes
When we parse the main data in a WAL record, ensure we consume all bytes.
2021-08-26 10:24:33 -07:00
Heikki Linnakangas
81dd4bc41e Fix decoding XLOG_HEAP_DELETE and XLOG_HEAP_UPDATE records.
Because the t_cid field was missing from the XlHeapDelete struct that
corresponds to the PostgreSQL xl_heap_delete struct, the check for the
XLH_DELETE_ALL_VISIBLE_CLEARED flag did not work correctly.

Decoding XlHeapUpdate struct was also missing the t_cid field, but that
didn't cause any immediate problems because in that struct, the t_cid
field is after all the fields that the page server cares about. But fix
that too, as it was an accident waiting to happen.

The bug was mostly hidden by the VM page handling in zenith_wallog_page,
where it forcibly generates a FPW record whenever a VM page is evicted:

    else if (forknum == VISIBILITYMAP_FORKNUM && !RecoveryInProgress())
    {
        /*
         * Always WAL-log vm.
         * We should never miss clearing visibility map bits.
         *
         * TODO Is it too bad for performance?
         * Hopefully we do not evict actively used vm too often.
         */
        XLogRecPtr recptr;
        recptr = log_newpage_copy(&reln->smgr_rnode.node, forknum, blocknum, buffer, false);
        XLogFlush(recptr);
        lsn = recptr;

But that was just hiding the issue: it's still visible if you had a
read-only node relying on the data in the page server, or you killed and
restarted the primary node, or you started a branch. In the included test
case, I used a new branch to expose this.

Fixes https://github.com/zenithdb/zenith/issues/461
2021-08-24 15:59:25 +03:00
anastasia
a368642790 cargo fmt 2021-08-10 14:26:52 +03:00
Konstantin Knizhnik
3ca3394170 [refer #395] Check WAL record CRC in waldecoder (#396) 2021-08-05 16:57:57 +03:00
Heikki Linnakangas
9ff122835f Refactor ObjectTags, intruducing a new concept called "relish"
This clarifies - I hope - the abstractions between Repository and
ObjectRepository. The ObjectTag struct was a mix of objects that could
be accessed directly through the public Timeline interface, and also
objects that were created and used internally by the ObjectRepository
implementation and not supposed to be accessed directly by the
callers.  With the RelishTag separaate from ObjectTag, the distinction
is more clear: RelishTag is used in the public interface, and
ObjectTag is used internally between object_repository.rs and
object_store.rs, and it contains the internal metadata object types.

One awkward thing with the ObjectTag struct was that the Repository
implementation had to distinguish between ObjectTags for relations,
and track the size of the relation, while others were used to store
"blobs".  With the RelishTags, some relishes are considered
"non-blocky", and the Repository implementation is expected to track
their sizes, while others are stored as blobs. I'm not 100% happy with
how RelishTag captures that either: it just knows that some relish
kinds are blocky and some non-blocky, and there's an is_block()
function to check that.  But this does enable size-tracking for SLRUs,
allowing us to treat them more like relations.

This changes the way SLRUs are stored in the repository. Each SLRU
segment, e.g. "pg_clog/0000", "pg_clog/0001", are now handled as a
separate relish.  This removes the need for the SLRU-specific
put_slru_truncate() function in the Timeline trait. SLRU truncation is
now handled by caling put_unlink() on the segment. This is more in
line with how PostgreSQL stores SLRUs and handles their trunction.

The SLRUs are "blocky", so they are accessed one 8k page at a time,
and repository tracks their size. I considered an alternative design
where we would treat each SLRU segment as non-blocky, and just store
the whole file as one blob. Each SLRU segment is up to 256 kB in size,
which isn't that large, so that might've worked fine, too. One reason
I didn't do that is that it seems better to have the WAL redo
routines be as close as possible to the PostgreSQL routines. It
doesn't matter much in the repository, though; we have to track the
size for relations anyway, so there's not much difference in whether
we also do it for SLRUs.

While working on this, I noticed that the CLOG and MultiXact redo code
did not handle wraparound correctly. We need to fix that, but for now,
I just commented them out with a FIXME comment.
2021-08-03 14:01:05 +03:00
Konstantin Knizhnik
f6705b7a7d Fix TimestampTz type to i64 to be compatbile with Postgres 2021-07-16 18:43:07 +03:00
Konstantin Knizhnik
eb0a56eb22 Replay non-relational WAL records on page server 2021-07-16 18:43:07 +03:00
Konstantin Knizhnik
97681acfcf Replace XLR_RMGR_INFO_MASK with XLOG_HEAP_OPMASK 2021-07-10 10:09:56 +03:00
Konstantin Knizhnik
baf8800b96 Fix incorrect mask in wldecoder 2021-07-10 10:09:56 +03:00
Heikki Linnakangas
ced338fd20 Handle relation DROPs in page server.
Add back code to parse transaction commit and abort records, and in
particular the list of dropped relations in them. Add 'put_unlink'
function to the Timeline trait and implementation. We had the code to
handle dropped relations in the GC code and elsewhere in ObjectRepository
already, but there was nothing to create the RelationSizeEntry::Unlink
tombstone entries until now. Also add a test to check that GC correctly
removes all page versions of a dropped relation.

Implements https://github.com/zenithdb/zenith/issues/232, except for the
"orphaned" rels.

Reviewed-by: Konstantin Knizhnik
2021-06-29 00:27:10 +03:00
Heikki Linnakangas
44c35722d8 Remove a bunch of dead code
Some of these were related to handling various WAL records that are not
related to any relations, like pg_multixact updates. These should have
been removed in the revert commit 6a9c036ac1, but I missed them.

Also, we didn't anything with commit/abort records. We will start
parsing commit/abort records in the next commit, but seems better to
add that from clean slate.

Reviewed-by: Konstantin Knizhnik
2021-06-29 00:26:53 +03:00
anastasia
0969574d48 Use bindgen for various xlog structures and checkpoint.
Implement encode/decode methods for them.

Some methods are unused now. This is a preparatory commit for nonrel_wal
2021-06-09 01:00:42 +03:00
anastasia
2b0193e6bf implement from_bytes for XLogPageHeader structs 2021-06-08 13:08:57 +03:00
anastasia
c31a5e2c8f move XLogPageHeader structs to xlog_utils 2021-06-08 13:08:57 +03:00
anastasia
d85d67a6f1 use constants defined in xlog_utils for waldecoder 2021-06-08 13:08:57 +03:00
Konstantin Knizhnik
063429aade Implement GC for new object_store API (#229)
* Implement GC for new object_store API

* Add comments for GC

* Revert postgres module version reference
2021-06-04 20:11:56 +03:00
anastasia
0e423d481e Update rustdoc comments and README for pageserver crate 2021-06-01 19:38:42 +03:00
Heikki Linnakangas
558a2214bc Fix comment 2021-06-01 18:28:01 +03:00
Heikki Linnakangas
ac60b68d50 Handle VM and FSM truncation WAL records in the page server.
Fixes issue #190.

Original patch by Konstantin Knizhnik.
2021-05-31 23:36:17 +03:00
Heikki Linnakangas
6a9c036ac1 Revert all changes related to storing and restoring non-rel data in page server
This includes the following commits:

35a1c3d521 Specify right LSN in test_createdb.py
d95e1da742 Fix issue with propagation of CREATE DATABASE to the branch
8465738aa5 [refer #167] Fix handling of pg_filenode.map files in page server
86056abd0e Fix merge conflict: set initial WAL position to second segment because of pg_resetwal
2bf2dd1d88 Add nonrelfile_utils.rs file
20b6279beb Fix restoring non-relational data during compute node startup
06f96f9600 Do not transfer WAL to computation nodes: use pg_resetwal for node startup

As well as some older changes related to storing CLOG and MultiXact data as
"pseudorelation" in the page server.

With this revert, we go back to the situtation that when you create a
new compute node, we ship *all* the WAL from the beginning of time to
the compute node. Obviously we need a better solution, like the code
that this reverts. But per discussion with Konstantin and Stas, this
stuff was still half-baked, and it's better for it to live in a branch
for now, until it's more complete and has gone through some review.
2021-05-24 16:05:45 +03:00
anastasia
84508d4f68 fix replay of nextMulti and nextMultiOffset fields 2021-05-24 15:17:35 +03:00
Eric Seppanen
4aabc9a682 easy clippy cleanups
Various things that clippy complains about, and are really easy to
fix.
2021-05-23 13:17:15 -07:00
Konstantin Knizhnik
20b6279beb Fix restoring non-relational data during compute node startup 2021-05-20 14:14:52 +03:00
Konstantin Knizhnik
06f96f9600 Do not transfer WAL to computation nodes: use pg_resetwal for node startup 2021-05-20 14:13:47 +03:00
Konstantin Knizhnik
04dc698d4b Add support of twophase transactions 2021-05-16 00:03:20 +03:00
Konstantin Knizhnik
9ece1e863d Compute and restore pg_xact, pg_multixact and pg_filenode.map files 2021-05-14 16:35:09 +03:00
Konstantin Knizhnik
22e7fcbf2d Handle visbility map updates in WAL redo 2021-05-12 10:38:43 +03:00
Eric Seppanen
d26b76fe7c cargo fmt 2021-05-07 13:11:44 -07:00
anastasia
15db0d1d6f refactor walreciever and restore_local_repo 2021-05-06 12:58:08 +03:00
Konstantin Knizhnik
eea6f0898e Restore CLOG from snapshot 2021-04-30 14:22:47 +03:00
anastasia
1369145e83 code cleanup 2021-04-29 18:41:42 +03:00
anastasia
b49164a1d4 cargo fmt 2021-04-29 18:41:42 +03:00
anastasia
e7b112aacc Refactor pg_constants. Move them to postgres_ffi/ 2021-04-29 18:41:42 +03:00
anastasia
421d586953 code cleanup for XLogRecord decoding 2021-04-28 13:56:27 +03:00
anastasia
ef37eb96b9 refactor XLogRecord reading 2021-04-28 13:56:27 +03:00
anastasia
d311f708b6 handle subtrans in COMMIT/ABORT records 2021-04-28 13:56:27 +03:00
Eric Seppanen
648755a25e add Lsn::block_offset, remaining_in_block, calc_padding
Replace open-coded math with member fns.
2021-04-25 19:37:02 -07:00
Eric Seppanen
01e239afa3 apply Lsn type everywhere
Use the `Lsn` type everywhere that I can find u64 being used to
represent an LSN.
2021-04-25 19:37:02 -07:00
Konstantin Knizhnik
52ee3a2bac Support CREATE DATABASE command 2021-04-23 17:03:56 +03:00
anastasia
b64bd2a8af handle XLOG_DBASE_CREATE in waldecoder 2021-04-23 14:06:09 +03:00
Konstantin Knizhnik
9e7c45cb72 Merge with master 2021-04-22 09:45:13 +03:00
Eric Seppanen
1f3f4cfaf5 clippy cleanup #2
- remove needless return
- remove needless format!
- remove a few more needless clone()
- from_str_radix(_, 10) -> .parse()
- remove needless reference
- remove needless `mut`

Also manually replaced a match statement with map_err() because after
clippy was done with it, there was almost nothing left in the match
expression.
2021-04-21 17:56:58 -07:00
Konstantin Knizhnik
a22cb7acc1 Merge with main branch 2021-04-21 20:19:34 +03:00