rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-30 11:30:37 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	6127b6638b	Major storage format rewrite Major changes and new concepts: Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Key --- The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. See pgdatadir_mapping.rs for how relation blocks and metadata keys are mapped to the Key struct. Store arbitrary key-ranges in the layer files --------------------------------------------- The concept of a "segment" is gone. Each layer file can store an arbitrary range of Keys. TODO: - Deleting keys, to reclaim space. This isn't visible to Postgres, dropping or truncating a relation works as you would expect if you look at it from the compute node. If you drop a relation, for example, the relation is removed from the metadata entry, so that it appears to be gone. However, the layered repository implementation never reclaims the storage. - Tracking "logical database size", for disk space quotas. That ought to be reimplemented now in pgdatadir_mapping.rs, or perhaps in walingest.rs. - LSM compaction. The logic for checkpointing and creating image layers is very dumb. AFAIK the read code could deal with a full-fledged LSM tree now consisting of the delta and image layers. But there's no code to take a bunch of delta layers and compact them, and the heuristics for when to create image layers is pretty dumb. - The code to track the layers is inefficient. All layers are just stored in a vector, and whenever we need to find a layer, we do a linear search in it.	2022-03-09 11:36:39 +02:00
Heikki Linnakangas	55a4cf64a1	Refactor WAL record handling. Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL record that is replayed with the Postgres WAL redo process, or a built-in type that is handled entirely by pageserver code. Replace the special code to replay Postgres XACT commit/abort records with new Zenith WAL records. A separate zenith WAL record is created for each modified CLOG page. This allows removing the 'main_data_offset' field from stored PostgreSQL WAL records, which saves some memory and some disk space in delta layers. Introduce zenith WAL records for updating bits in the visibility map. Previously, when e.g. a heap insert cleared the VM bit, we duplicated the heap insert WAL record for the affected VM page. That was very wasteful. The heap WAL record could be massive, containing a full page image in the worst case. This addresses github issue #941.	2022-01-04 11:26:37 +02:00
Stas Kelvich	ed4eed0a19	Make use of `postgres --sync-safekeepers` in tests and CLI. Change control plane code to call `postgres --sync-safekeepers` before compute node start when safekeepers are enabled. Now `pg create` will create an empty data directory with the proper config file. Subsequent `pg start` will run `sync-safekeepers` and will call basebackup with the resulting LSN. Also change few tests to accommodate this new behavior.	2021-09-06 13:06:20 +03:00
Dmitry Rodionov	bc709561b6	fix clippy warnings	2021-09-02 18:54:44 +03:00
Heikki Linnakangas	d7bebd8074	Add 'dump_layerfile' utility for debugging. Seems handy for getting a quick idea of what's stored in an image or delta layer file. Example output on a file after runnnig pgbench for a while: % ./target/debug/dump_layerfile pgbench_layers/pg_control_checkpoint_0_00000000016B914A ----- image layer for checkpoint.0 at 0/16B914A ---- non-blocky (88 bytes) % ./target/debug/dump_layerfile pgbench_layers/pg_xact_0000_0_000000000412FD40 ----- image layer for pg_xact/0000.0 at 0/412FD40 ---- (1) blocks % ./target/debug/dump_layerfile pgbench_layers/rel_1663_14236_1247_0_0_00000000016B914A_000000000412FD40 \| head -n 20 ----- delta layer for 1663/14236/1247.0 0/16B914A-0/412FD40 ---- --- relsizes --- 0/16B914A: 14 0/16CA559: 15 --- page versions --- blk 13 at 0/16BB1D2: rec 8162 bytes will_init: true HEAP INSERT blk 14 at 0/16CA559: rec 8241 bytes will_init: true XLOG FPI blk 14 at 0/16CA637: rec 215 bytes will_init: true HEAP INSERT blk 14 at 0/16DF14F: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16DF3A7: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E0637: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E088F: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E5F9F: rec 215 bytes will_init: false HEAP INSERT blk 14 at 0/16E620F: rec 215 bytes will_init: false HEAP INSERT	2021-09-01 12:20:16 -07:00
anastasia	8b3a293bb0	Use postgres_ffi bindings instead of custom type definitions. Move several functions to postgres_ffi crate	2021-09-01 16:11:44 +03:00
Patrick Insinger	d265b4cdd3	waldecoder - check for trailing bytes When we parse the main data in a WAL record, ensure we consume all bytes.	2021-08-26 10:24:33 -07:00
anastasia	1bfade8adc	Issue #330 . Use put_unlink for twophase relishes. Follow PostgreSQL logic: remove Twophase files when prepared transaction is committed/aborted. Always store Twophase segments as materialized page images (no wal records).	2021-08-12 14:42:21 +03:00
anastasia	1e6267a35f	Get rid of snapshot directory + related code cleanup and refactoring. - Add new subdir postgres_ffi/samples/ for config file samples. - Don't copy wal to the new branch on zenith init or zenith branch. - Import_timeline_wal on zenith init.	2021-07-23 13:21:45 +03:00
Konstantin Knizhnik	eb0a56eb22	Replay non-relational WAL records on page server	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	3e69c41a47	Add XLOG_HEAP_OPMASK to pg_contants	2021-07-10 10:09:56 +03:00
Heikki Linnakangas	ced338fd20	Handle relation DROPs in page server. Add back code to parse transaction commit and abort records, and in particular the list of dropped relations in them. Add 'put_unlink' function to the Timeline trait and implementation. We had the code to handle dropped relations in the GC code and elsewhere in ObjectRepository already, but there was nothing to create the RelationSizeEntry::Unlink tombstone entries until now. Also add a test to check that GC correctly removes all page versions of a dropped relation. Implements https://github.com/zenithdb/zenith/issues/232, except for the "orphaned" rels. Reviewed-by: Konstantin Knizhnik	2021-06-29 00:27:10 +03:00
Heikki Linnakangas	44c35722d8	Remove a bunch of dead code Some of these were related to handling various WAL records that are not related to any relations, like pg_multixact updates. These should have been removed in the revert commit `6a9c036ac1`, but I missed them. Also, we didn't anything with commit/abort records. We will start parsing commit/abort records in the next commit, but seems better to add that from clean slate. Reviewed-by: Konstantin Knizhnik	2021-06-29 00:26:53 +03:00
Heikki Linnakangas	4f1b22a2c8	Use ObjectTag enum instead of special fork number to store metadata objects. Extracted from Konstantin's larger PR: https://github.com/zenithdb/zenith/pull/268	2021-06-22 21:34:31 +03:00
anastasia	0969574d48	Use bindgen for various xlog structures and checkpoint. Implement encode/decode methods for them. Some methods are unused now. This is a preparatory commit for nonrel_wal	2021-06-09 01:00:42 +03:00
Heikki Linnakangas	d18cc8a3a8	Update 'postgres_ffi' module's readme file and comments. Explain the purpose of of the 'postgres_ffi' module, explain what the PostgreSQL control file is, and some other minor cleanup.	2021-06-04 23:05:11 +03:00
Heikki Linnakangas	ac60b68d50	Handle VM and FSM truncation WAL records in the page server. Fixes issue #190. Original patch by Konstantin Knizhnik.	2021-05-31 23:36:17 +03:00
Heikki Linnakangas	d5fe515363	Implement "checkpointing" in the page server. - Previously, we checked on first use of a timeline, whether there is a snapshot and WAL for the timeline, and loaded it all into the (rocksdb) repository. That's a waste of effort if we had done that earlier already, and stopped and restarted the server. Track the last LSN that we have loaded into the repository, and only load the recent missing WAL after that. - When you create a new zenith repository with "zenith init", immediately load the initial empty postgres cluster into the rocksdb repository. Previously, we only did that on the first connection. This way, we don't need any "load from filesystem" codepath during normal operation, we can assume that the repository for a timeline is always up to date. (We might still want to use the functionality to import an existing PostgreSQL data directory into the repository in the future, as a separate Import feature, but not today.)	2021-05-24 17:02:05 +03:00
Heikki Linnakangas	6a9c036ac1	Revert all changes related to storing and restoring non-rel data in page server This includes the following commits: `35a1c3d521` Specify right LSN in test_createdb.py `d95e1da742` Fix issue with propagation of CREATE DATABASE to the branch `8465738aa5` [refer #167] Fix handling of pg_filenode.map files in page server `86056abd0e` Fix merge conflict: set initial WAL position to second segment because of pg_resetwal `2bf2dd1d88` Add nonrelfile_utils.rs file `20b6279beb` Fix restoring non-relational data during compute node startup `06f96f9600` Do not transfer WAL to computation nodes: use pg_resetwal for node startup As well as some older changes related to storing CLOG and MultiXact data as "pseudorelation" in the page server. With this revert, we go back to the situtation that when you create a new compute node, we ship all the WAL from the beginning of time to the compute node. Obviously we need a better solution, like the code that this reverts. But per discussion with Konstantin and Stas, this stuff was still half-baked, and it's better for it to live in a branch for now, until it's more complete and has gone through some review.	2021-05-24 16:05:45 +03:00
Konstantin Knizhnik	20b6279beb	Fix restoring non-relational data during compute node startup	2021-05-20 14:14:52 +03:00
Konstantin Knizhnik	06f96f9600	Do not transfer WAL to computation nodes: use pg_resetwal for node startup	2021-05-20 14:13:47 +03:00
Heikki Linnakangas	aa8debf4e8	Add test for a relation that's larger than 1 GB. This isn't very exciting with the current RocksDB implementation, because it doesn't care about the PostgreSQL 1 GB segment boundaries at all. But I think we will care about this in the future, and more tests is generally better anyway.	2021-05-19 09:22:17 +03:00
Konstantin Knizhnik	04dc698d4b	Add support of twophase transactions	2021-05-16 00:03:20 +03:00
Konstantin Knizhnik	9ece1e863d	Compute and restore pg_xact, pg_multixact and pg_filenode.map files	2021-05-14 16:35:09 +03:00
Konstantin Knizhnik	2f2dff4c8d	Merge with main brnach	2021-05-12 10:46:01 +03:00
Konstantin Knizhnik	22e7fcbf2d	Handle visbility map updates in WAL redo	2021-05-12 10:38:43 +03:00
Heikki Linnakangas	33d126ecbe	Tidy up usage of a few constants from PostgreSQL headers.	2021-05-06 21:57:01 +03:00
anastasia	15db0d1d6f	refactor walreciever and restore_local_repo	2021-05-06 12:58:08 +03:00
Konstantin Knizhnik	eea6f0898e	Restore CLOG from snapshot	2021-04-30 14:22:47 +03:00
anastasia	b49164a1d4	cargo fmt	2021-04-29 18:41:42 +03:00
anastasia	e7b112aacc	Refactor pg_constants. Move them to postgres_ffi/	2021-04-29 18:41:42 +03:00

31 Commits