rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 12:40:37 +00:00

Author	SHA1	Message	Date
Kirill Bulatov	81417788c8	walkeeper -> safekeeper	2022-04-18 12:52:31 +03:00
Kirill Bulatov	9e5423c867	Assert in a more informative way	2022-04-03 19:30:36 +03:00
Dmitry Rodionov	eee0f51e0c	use cargo-hakari to manage workspace_hack crate workspace_hack is needed to avoid recompilation when different crates inside the workspace depend on the same packages but with different features being enabled. Problem occurs when you build crates separately one by one. So this is irrelevant to our CI setup because there we build all binaries at once, but it may be relevant for local development. this also changes cargo's resolver version to 2	2022-03-29 10:42:04 +03:00
Heikki Linnakangas	07342f7519	Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>	2022-03-28 05:41:15 -05:00
Arseny Sher	d5a96d3d50	Fix finding end of WAL on safekeepers after `f86cf93435`. That commit dropped wal_start_lsn, now we're looking since commit_lsn, which is the real end of WAL if no records follow it. ref #1351	2022-03-14 18:54:59 +03:00
Kirill Bulatov	76b74349cb	Bump pageserver dependencies	2022-02-10 08:33:22 -05:00
Konstantin Knizhnik	08135910a5	Fix checkpoint.nextXid update (#1166 ) * Fix checkpoint.nextXid update * Add test for cehckpoint.nextXid * Fix indentation of test_next_xid.py * Fix mypy error in test_next_xid.py * Tidy up the test case. * Add a unit test Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>	2022-01-27 18:21:51 +03:00
Dmitry Rodionov	e6f2d70517	use 2021 rust edition	2022-01-25 18:48:49 +03:00
Heikki Linnakangas	55a4cf64a1	Refactor WAL record handling. Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL record that is replayed with the Postgres WAL redo process, or a built-in type that is handled entirely by pageserver code. Replace the special code to replay Postgres XACT commit/abort records with new Zenith WAL records. A separate zenith WAL record is created for each modified CLOG page. This allows removing the 'main_data_offset' field from stored PostgreSQL WAL records, which saves some memory and some disk space in delta layers. Introduce zenith WAL records for updating bits in the visibility map. Previously, when e.g. a heap insert cleared the VM bit, we duplicated the heap insert WAL record for the affected VM page. That was very wasteful. The heap WAL record could be massive, containing a full page image in the worst case. This addresses github issue #941.	2022-01-04 11:26:37 +02:00
Heikki Linnakangas	c77e30116e	Split waldecoder.rs into two source files. Move the code for decoding a WAL stream into WAL records into 'postgres_ffi', and keep the code to parse the WAL records deeper in 'pageserver' crate, renamed to walrecord.rs. This tidies up the dependencies a bit. 'walkeeper' reuses the same waldecoder routines, and it used to depend on 'pageserver' because of that. Now it only depends on 'postgres_ffi'. (The comment in walkeeper/Cargo.toml that claimed that the dependency was needed for ZTimelineId was obsolete. ZTimelineId is defined in 'zenith_utils', the dependency was actually needed for the waldecoder.)	2021-12-10 15:14:13 +02:00
Arseny Sher	d39608c367	Fix passing start_offset to find_end_of_wal_segment.	2021-12-03 12:43:57 +03:00
Arseny Sher	e7ca8ef5a8	Use PG timelineid 1 everywhere. As changing it doesn't have useful meaning in Zenith. ref #824	2021-11-11 13:53:39 +03:00
Heikki Linnakangas	feae7f39c1	Support read-only nodes Change 'zenith.signal' file to a human-readable format, similar to backup_label. It can contain a "PREV LSN: %X/%X" line, or a special value to indicate that it's OK to start with invalid LSN ('none'), or that it's a read-only node and generating WAL is forbidden ('invalid'). The 'zenith pg create' and 'zenith pg start' commands now take a node name parameter, separate from the branch name. If the node name is not given, it defaults to the branch name, so this doesn't break existing scripts. If you pass "foo@<lsn>" as the branch name, a read-only node anchored at that LSN is created. The anchoring is performed by setting the 'recovery_target_lsn' option in the postgresql.conf file, and putting the server into standby mode with 'standby.signal'. We no longer store the synthetic checkpoint record in the WAL segment. The postgres startup code has been changed to use the copy of the checkpoint record in the pg_control file, when starting in zenith mode.	2021-10-19 09:48:12 +03:00
Heikki Linnakangas	0e026371ec	Optimize WAL decoding slightly. This adds a fast-path for the common case that the record doesn't cross a page boundary. We now split off a new Bytes directly from the original input buffer in that case, instead of copying the record to a new BytesMut. Shaves about 5% of the page server's CPU time on my laptop, in the 'test_bulk_insert' test.	2021-10-14 14:21:23 +03:00
Heikki Linnakangas	934fb8592f	Detect when a checkpoint is modified in a smarter way. Previously, the WAL receiver we would make a decoded copy of the current Checkpoint before each WAL record, and compare it with the Checkpoint after the record has been processed. If it has changed, the checkpoint relish is updated in the repository. That's somewhat expensive, the Checkpoint::encode() function is visible in 'perf' profile. Change that so that we set a flag whenever the Checkpoint struct is modified, so that we dont need to compare the whole struct anymore.	2021-10-12 09:09:10 +03:00
Kirill Bulatov	7dda9f2894	Fix clippy lints and enable clippy checking in CI	2021-09-16 15:09:16 +03:00
Max Sharnoff	b11b0bb088	bin_ser: reject trailing bytes by default (#587 ) Changes `LeSer`/`BeSer::des`. Also adds a new `des_prefix` function to keep a way to allow trailing bytes.	2021-09-15 11:48:19 -07:00
Arseny Sher	a68c23448a	Skip the bootstrap hole in safekeeper's find_end_of_wal. Otherwise restart of safekeeper before the first segment is filled makes it report 0 as flushed LSN. To this end, tweak find_end_of_wal_segment to allow starting from given LSN, not only from the start of the segment. While here, make it less panicky.	2021-09-13 22:46:04 +03:00
Stas Kelvich	ed4eed0a19	Make use of `postgres --sync-safekeepers` in tests and CLI. Change control plane code to call `postgres --sync-safekeepers` before compute node start when safekeepers are enabled. Now `pg create` will create an empty data directory with the proper config file. Subsequent `pg start` will run `sync-safekeepers` and will call basebackup with the resulting LSN. Also change few tests to accommodate this new behavior.	2021-09-06 13:06:20 +03:00
Konstantin Knizhnik	b227c63edf	Set proper xl_prev in basebackup, when possible. In a passing fix two minor issues with basabackup: * check that we can't create branches with pre-initdb LSN's * normalize branch LSN's that are pointing to the segment boundary patch by @knizhnik closes #506	2021-09-03 14:58:59 +03:00
Dmitry Rodionov	bc709561b6	fix clippy warnings	2021-09-02 18:54:44 +03:00
Kirill Bulatov	0e4cbe0165	Fix some typos	2021-09-02 17:27:18 +03:00
Heikki Linnakangas	d7bebd8074	Add 'dump_layerfile' utility for debugging. Seems handy for getting a quick idea of what's stored in an image or delta layer file. Example output on a file after runnnig pgbench for a while: % ./target/debug/dump_layerfile pgbench_layers/pg_control_checkpoint_0_00000000016B914A ----- image layer for checkpoint.0 at 0/16B914A ---- non-blocky (88 bytes) % ./target/debug/dump_layerfile pgbench_layers/pg_xact_0000_0_000000000412FD40 ----- image layer for pg_xact/0000.0 at 0/412FD40 ---- (1) blocks % ./target/debug/dump_layerfile pgbench_layers/rel_1663_14236_1247_0_0_00000000016B914A_000000000412FD40 \| head -n 20 ----- delta layer for 1663/14236/1247.0 0/16B914A-0/412FD40 ---- --- relsizes --- 0/16B914A: 14 0/16CA559: 15 --- page versions --- blk 13 at 0/16BB1D2: rec 8162 bytes will_init: true HEAP INSERT blk 14 at 0/16CA559: rec 8241 bytes will_init: true XLOG FPI blk 14 at 0/16CA637: rec 215 bytes will_init: true HEAP INSERT blk 14 at 0/16DF14F: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16DF3A7: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E0637: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E088F: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E5F9F: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E620F: rec 215 bytes will_init: false HEAP INSERT	2021-09-01 12:20:16 -07:00
anastasia	8b3a293bb0	Use postgres_ffi bindings instead of custom type definitions. Move several functions to postgres_ffi crate	2021-09-01 16:11:44 +03:00
Dmitry Rodionov	989ab7e883	move several functions which replicate ones from postgresql to postgres_ffi crate	2021-09-01 16:11:44 +03:00
Arseny Sher	8d3450f4c6	Basic safekeeper refactoring and bug fixing. 1) Extract consensus logic to safekeeper.rs. 2) Change the voting flow so that acceptor tells his epoch along with giving the vote, not before it; otherwise it might get immediately stale. #294 3) Process messages from compute atomically and sync state properly. #270 4) Use separate structs for disk and network. ref #315	2021-08-27 15:22:10 +03:00
Patrick Insinger	d265b4cdd3	waldecoder - check for trailing bytes When we parse the main data in a WAL record, ensure we consume all bytes.	2021-08-26 10:24:33 -07:00
Eric Seppanen	41fa02f82b	Replace transmute with serde Upgrade to bindgen 0.59, which has two new abilities: - specify arbitrary #[derive] attributes to attach to generated structs - request explicit padding fields These two features are enough to replace transmute with serde/bincode.	2021-08-24 16:32:37 +03:00
anastasia	1bfade8adc	Issue #330 . Use put_unlink for twophase relishes. Follow PostgreSQL logic: remove Twophase files when prepared transaction is committed/aborted. Always store Twophase segments as materialized page images (no wal records).	2021-08-12 14:42:21 +03:00
anastasia	cc877f1980	Add unit test for find_end_of_wal(). Based on previous attempt to add same test by @lubennikovaav Now WAL files are generated by initdb command.	2021-08-10 12:30:21 +03:00
anastasia	1e6267a35f	Get rid of snapshot directory + related code cleanup and refactoring. - Add new subdir postgres_ffi/samples/ for config file samples. - Don't copy wal to the new branch on zenith init or zenith branch. - Import_timeline_wal on zenith init.	2021-07-23 13:21:45 +03:00
Eric Seppanen	ad79ca05e9	suppress nullptr warnings on auto-generated bindgen unit tests Hopefully, this will be addressed upstream before too long; see rust-bindgen issue #1651.	2021-07-20 20:12:15 +03:00
Heikki Linnakangas	325dd41277	Remove unused constructor function. This was failing to compile with rustc nightly version, because the datatype of 'fullPageWrites' was changed. See discussion at https://github.com/zenithdb/zenith/issues/207#issuecomment-881478570. But since the function is actually unused, let's just remove it.	2021-07-20 16:01:37 +03:00
Konstantin Knizhnik	e74b06d999	Pass prev_record_ptr through zenith.signal file to compute node	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	f6705b7a7d	Fix TimestampTz type to i64 to be compatbile with Postgres	2021-07-16 18:43:07 +03:00
Heikki Linnakangas	46e613f423	Fix typos	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	3cded20662	Refactring after Heikki review	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	eb0a56eb22	Replay non-relational WAL records on page server	2021-07-16 18:43:07 +03:00
Heikki Linnakangas	befefe8d84	Run 'cargo fmt'. Fixes a few formatting discrepancies had crept in recently.	2021-07-14 22:03:14 +03:00
Konstantin Knizhnik	ad92b66eed	Fix TimestampTz type to i64 to be compatbile with Postgres	2021-07-14 15:55:12 +03:00
Konstantin Knizhnik	3e69c41a47	Add XLOG_HEAP_OPMASK to pg_contants	2021-07-10 10:09:56 +03:00
Heikki Linnakangas	ced338fd20	Handle relation DROPs in page server. Add back code to parse transaction commit and abort records, and in particular the list of dropped relations in them. Add 'put_unlink' function to the Timeline trait and implementation. We had the code to handle dropped relations in the GC code and elsewhere in ObjectRepository already, but there was nothing to create the RelationSizeEntry::Unlink tombstone entries until now. Also add a test to check that GC correctly removes all page versions of a dropped relation. Implements https://github.com/zenithdb/zenith/issues/232, except for the "orphaned" rels. Reviewed-by: Konstantin Knizhnik	2021-06-29 00:27:10 +03:00
Heikki Linnakangas	44c35722d8	Remove a bunch of dead code Some of these were related to handling various WAL records that are not related to any relations, like pg_multixact updates. These should have been removed in the revert commit `6a9c036ac1`, but I missed them. Also, we didn't anything with commit/abort records. We will start parsing commit/abort records in the next commit, but seems better to add that from clean slate. Reviewed-by: Konstantin Knizhnik	2021-06-29 00:26:53 +03:00
Heikki Linnakangas	4f1b22a2c8	Use ObjectTag enum instead of special fork number to store metadata objects. Extracted from Konstantin's larger PR: https://github.com/zenithdb/zenith/pull/268	2021-06-22 21:34:31 +03:00
anastasia	0969574d48	Use bindgen for various xlog structures and checkpoint. Implement encode/decode methods for them. Some methods are unused now. This is a preparatory commit for nonrel_wal	2021-06-09 01:00:42 +03:00
anastasia	2b0193e6bf	implement from_bytes for XLogPageHeader structs	2021-06-08 13:08:57 +03:00
anastasia	c31a5e2c8f	move XLogPageHeader structs to xlog_utils	2021-06-08 13:08:57 +03:00
Heikki Linnakangas	434374ebb4	Turn encode/decode into methods Like in PR #208	2021-06-04 23:05:30 +03:00
Heikki Linnakangas	a7ae552851	Use rust memoffset crate to replace C offsetof(). Cherry-picked from Eric's PR #208	2021-06-04 23:05:28 +03:00
Heikki Linnakangas	8b5a061c8e	Add comments on the unsafe use of transmute in encode/decode_pg_control Note the unsafety of the unsafe block, with a link to the ongoing discussion. This doesn't try to solve the problem, but let's at least document the status quo.	2021-06-04 23:05:26 +03:00

1 2

91 Commits