rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 13:32:57 +00:00

Author	SHA1	Message	Date
Kirill Bulatov	8ab4c8a050	Code review fixes	2022-01-11 15:44:23 +02:00
Kirill Bulatov	65c851a451	Test pageserver's timeline http methods z	2022-01-11 15:44:23 +02:00
Kirill Bulatov	23cf2fa984	Properly shutdown storage sync loop	2022-01-11 15:44:23 +02:00
Kirill Bulatov	384b2a91fa	Pass generic pageserver params through zenith cli	2022-01-11 15:44:23 +02:00
Konstantin Knizhnik	2fd4c390cb	Do not hold timelines lock during GC (#1089 ) * Do not hold timelines lock during GC refer #1087 * Add gc_cs mutex for preveting creation of new timelines during GC * Make clippy happy * Use Mutex<()> instead of Mutex<i32> for GC critical section	2022-01-10 14:41:15 +03:00
Patrick Insinger	191d9d2b74	par_fsync - use VirtualFile	2022-01-04 20:40:57 -08:00
Patrick Insinger	24c8dab86f	pageserver - parallelize checkpoint fsyncs	2022-01-04 20:40:57 -08:00
Heikki Linnakangas	55a4cf64a1	Refactor WAL record handling. Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL record that is replayed with the Postgres WAL redo process, or a built-in type that is handled entirely by pageserver code. Replace the special code to replay Postgres XACT commit/abort records with new Zenith WAL records. A separate zenith WAL record is created for each modified CLOG page. This allows removing the 'main_data_offset' field from stored PostgreSQL WAL records, which saves some memory and some disk space in delta layers. Introduce zenith WAL records for updating bits in the visibility map. Previously, when e.g. a heap insert cleared the VM bit, we duplicated the heap insert WAL record for the affected VM page. That was very wasteful. The heap WAL record could be massive, containing a full page image in the worst case. This addresses github issue #941.	2022-01-04 11:26:37 +02:00
Heikki Linnakangas	722667f189	Add test case for performance issue #941 . The first COPY generates about 230 MB of write I/O, but the second COPY, after deleting most of the rows and vacuuming the rows away, generates 370 MB of writes. Both COPYs insert the same amount of data, so they should generate roughly the same amount of I/O. This commit doesn't try to fix the issue, just adds a test case to demonstrate it. Add a new 'checkpoint' command to the pageserver API. Previously, we've used 'do_gc' for that, but many tests, including this new one, really only want to perform a checkpoint and don't care about GC. For now, I only used the command in the new test, though, and didn't convert any existing tests to use it.	2022-01-04 11:26:37 +02:00
Konstantin Knizhnik	1c47fbae81	Do not write image layers during enforced checkpoint (#1057 ) * Do not write image layers during enforced checkpoint refer #1056 * Add Flush option to CheckpointConfig refer #1057	2022-01-01 19:08:09 +03:00
Dmitry Rodionov	c910132d4b	Fix wal receiver shutdown This patch allows to shutdown wal receiver when there are no messages and wal receiver is blocked inside tokio-postgres. In this case it cannot check the shutdown flag. This patch switches to use async interface of tokio-postgres directly without sync wrappers. It opens the possibility to use tokio::select! between the phsycal_stream.next() and a shutdown channel readiness to interrupt replication process. Also this allows to shutdown only particular wal receiver without using global shutdown_requested flag.	2021-12-29 14:42:29 +03:00
Kirill Bulatov	f0afd08667	Fix zenith init defaults	2021-12-28 00:21:48 +02:00
Kirill Bulatov	b494ac1ea0	Remove redundant pageserver cli params	2021-12-27 18:38:54 +02:00
Arseny Sher	a163650a99	Refactor Postgres command parsing in safekeeper. Do it separately with SafekeeperPostgresCommand enum as a result. Since query is always C string, switch postgres_backend process_query argument from Bytes to &str. Make passing ztli/ztenant id in safekeeper connection string optional; this is needed for upcoming intra-safekeeper heartbeat cmd which is not bound to any timeline.	2021-12-24 15:48:13 +03:00
anastasia	980f5f8440	Propagate remote_consistent_lsn to safekeepers. Change meaning of lsns in HOT_STANDBY_FEEDBACK: flush_lsn = disk_consistent_lsn, apply_lsn = remote_consistent_lsn Update compute node backpressure configuration respectively. Update compute node configuration: set 'synchronous_commit=remote_write' in setup without safekeepers. This way compute node doesn't have to wait for data checkpoint on pageserver. This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.	2021-12-24 15:32:54 +03:00
bojanserafimov	b807570f46	Use parking_lot::Mutex instead of std::Mutex in walreceiver (#1045 )	2021-12-23 14:25:44 -05:00
Kirill Bulatov	114a757d1c	Use generic config parameters in pageserver cli Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>	2021-12-23 18:58:28 +02:00
Heikki Linnakangas	fdd987c3ad	Refactor the way Image- and DeltaLayers are created Introduce builder objects, DeltaLayerWriter and ImageLayerWriter. This gives more flexibility, as the DeltaLayer::create and ImageLayer::create functions don't need to know about the details of the format of where the page versions are coming from. This allows us to change the format used in InMemoryLayer more easily, without having to modify Delta- and ImageLayer code. Also refactor the code in InMemoryLayer::write_to_disk for clarity.	2021-12-23 00:33:16 +02:00
Heikki Linnakangas	da62407fce	Change the meaning of 'blknum' argument in Layer trait Previously, the 'blknum' argument of various Layer functions was the block number within the overall relation. That was pretty confusing, because an individual layer only holds data from a one segment of the relation. Furthermore, the 'put_truncation' function already dealt with per-segment size, not overall relation size, adding to the confusion. Change the meaning of the 'blknum' argument to mean the block number within the segment, not the overall relation.	2021-12-22 16:55:37 +02:00
Heikki Linnakangas	1cc181ca32	Fix WAL redo of commit records with subtransactions. If a commit record contains XIDs that are stored on different CLOG pages, we duplicate the commit record for each affected CLOG page. In the redo routine, we must only apply the parts of the record that apply to the CLOG page being restored. We got that right in the loop that handles the sub-XIDs, but incorrectly always set the bit that corresponds to the main XID.	2021-12-21 23:08:01 +02:00
Heikki Linnakangas	bcf80eaa95	Fix multixacts members WAL redo. The logic to compute the page number was broken, and as a result, only the first page of multixact members was updated correctly. All the rest were left as zeros. Improve test_multixact.py to generate more multixacts, to cover this case. Also fix the check that the restored PG data directory matches the original one. Previously, the test compared the 'pg_new' cluster, which is a bit silly because the test restored the 'pg_new' cluster only a few lines earlier, so if the multixact WAL redo is somehow broken, the comparison will just compare two broken data directories and report success. Change it to compare the original datadir, the one where the multixacts were originally created, with a restored image of the same.	2021-12-21 17:50:06 +02:00
Konstantin Knizhnik	68aa9d2715	Set utf8 encoding in initdb (#993 ) refer #992	2021-12-21 15:43:34 +03:00
Konstantin Knizhnik	76777f5812	Add utility for dumping/editing metadata file (#1031 )	2021-12-21 15:43:15 +03:00
anastasia	90e5b6f983	Don't try to reconnect failed walreceiver. If necessary, wal service will send new callmemaybe request	2021-12-20 12:34:28 +03:00
Heikki Linnakangas	75cbaafb96	Remove old ephemeral files on pageserver restart. The ephemeral files are not usable after restart, so just delete them. Before this, you got "unrecognized filename in timeline dir" warnings about them, as Konstantin noted at: https://github.com/zenithdb/zenith/issues/906#issuecomment-995530870. While we're at it, refactor away the list_files() function, moving the logic fully into the caller. Seems more straightforward.	2021-12-17 00:00:02 +02:00
Heikki Linnakangas	d8a367dd32	Remove dead code, fix typos.	2021-12-15 19:58:03 +02:00
Kirill Bulatov	ca60561a01	Propagate disk consistent lsn in timeline sync statuses	2021-12-15 15:13:21 +02:00
Heikki Linnakangas	7f78e80c51	Refactor WAL ingestion code. Rename save_decoded_record() to ingest_record(), and move the responsibility for decoding the record into ingest_record(). Also move the responsibility of updating the CheckPoint relish to ingest_record(). Put it in a new WalIngest struct, to help with tracking that.	2021-12-14 20:24:03 +02:00
Heikki Linnakangas	f8f88154d5	Split restore_local_repo.rs into two files, with more descriptive names.	2021-12-14 20:24:03 +02:00
Kirill Bulatov	5cff7d1de9	Use proper download order	2021-12-14 15:32:22 +02:00
Heikki Linnakangas	e0d41ac6a3	Move constants related to metadata file to metadata.rs. They're not used anywhere else, so seems like a better place.	2021-12-13 16:57:16 +02:00
Heikki Linnakangas	72ef59c378	Fix small typos in comments, add a comment. The introducing paragraph README could use some more love, but let's at least fix the typos.	2021-12-13 13:44:08 +02:00
Kirill Bulatov	673c297949	Download timelines on demand	2021-12-10 17:23:35 +02:00
Kirill Bulatov	e61732ca7c	Compress checkpoint files before streaming into S3	2021-12-10 17:23:35 +02:00
Heikki Linnakangas	cb4a8396fb	Use rustls rather than native-tls in all dependencies. We depends on rustls in postgres_backend anyway, so might as well use it for all TLS stuff. Seems better to depend on only one library both from a security point of view, and because fewer dependencies means less code to compile. With this commit, we no longer depend on OpenSSL.	2021-12-10 15:14:27 +02:00
Heikki Linnakangas	c77e30116e	Split waldecoder.rs into two source files. Move the code for decoding a WAL stream into WAL records into 'postgres_ffi', and keep the code to parse the WAL records deeper in 'pageserver' crate, renamed to walrecord.rs. This tidies up the dependencies a bit. 'walkeeper' reuses the same waldecoder routines, and it used to depend on 'pageserver' because of that. Now it only depends on 'postgres_ffi'. (The comment in walkeeper/Cargo.toml that claimed that the dependency was needed for ZTimelineId was obsolete. ZTimelineId is defined in 'zenith_utils', the dependency was actually needed for the waldecoder.)	2021-12-10 15:14:13 +02:00
Heikki Linnakangas	9d369f158c	Update rust-s3 to version 0.28.0 0.28.0 includes two changes I submitted to upstream: - Add support for older ListObjects API, needed to use rust-s3 with Google Cloud Storage: https://github.com/durch/rust-s3/pull/229 - If file is smaller than one chunk, don't initiate multi-part upload. https://github.com/durch/rust-s3/pull/228 These are not critical for Zenith right now, but let's stay up-to-date.	2021-12-10 14:52:08 +02:00
Heikki Linnakangas	f3f059c1f8	Fix a few cases where request beyond end of rel would error out. Currently, we return an all-zeros page if you request a block beyond end of a relation. That has been implemented in LayeredTimeline::materialize_page, so that if Layer::get_page_reconstruct_data returns Missing, it returns and all-zeros page. However InMemoryLayer and DeltaLayer would return Continue, not Missing, in that case, and materialize_page would try to find the predecessor layer. If there was a preceding image layer, then everything would still work, but if there wasn't, it would return a "could not find predecessor of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them check the size of the relation and return Missing in that case. This is hard to reproduce at the moment, but it happened quickly with pgbench when I modified InMemoryLayer::write_to_disk so that it didn't always create a new ImageLayer.	2021-12-09 17:46:48 +02:00
Dmitry Ivanov	7cec13d1df	Improve shutdown story for code coverage This patch introduces fixes for several problems affecting LLVM-based code coverage: * Daemonizing parent processes should call _exit() to prevent coverage data file corruption (.profraw) due to concurrent writes. Implement proper shutdown handlers in safekeeper.	2021-12-06 13:27:52 +03:00
anastasia	c7f3b4e62c	Clarify the meaning of StandbyReply LSNs: write_lsn - The last LSN received and processed by pageserver's walreceiver. flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers. apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).	2021-12-06 12:49:42 +03:00
Heikki Linnakangas	5bad2deff8	Don't hold 'timelines' lock over checkpoint. It was very noticeable that you while the checkpointer was busy, you could not e.g. open a new connection.	2021-12-03 07:42:10 -05:00
Heikki Linnakangas	f49ad33f1b	Initialize 'loaded' correctly in DeltaLayer. While we're at it, reuse the Book and the VirtualFile that's backing it even over unload() calls. Previously, we would keep the Book open, but on load(), we would re-open it anyway, which didn't make much sense. Now we reuse it it. Alternatively, perhaps we should close it on unload() to save some memory, but this I'm not going to think too hard about it right now as the whole load/unload thing is a bit of a hack and needs to be rewritten. This is hard to reproduce ATM, because the incorrect state would get fixed by an unload(). A checkpoint creates the DeltaLayer, and it also calls unload() afterwards, so the window is not very large. I hit it occasionally with a scale 1000 pgbench test, after I had modified InMemoryLayer::write_to_disk() to not write an image layer every time, which made the DeltaLayers be accessed more often.	2021-11-30 22:23:59 +02:00
Kirill Bulatov	670205e17a	Evict excessively failing sync tasks, improve processing for the rest of the tasks	2021-11-30 13:58:49 +02:00
Konstantin Knizhnik	f72d4814b1	Extract page images from FPI WAL records (#949 ) * Extract page images from FPI WAL records * Fix issues reported in review	2021-11-30 12:57:26 +03:00
Heikki Linnakangas	5ecf0664cc	Fix off-by-one error in check for future delta layers. This doesnt show up at the moment, because we never create a delta layer with end-LSN equal to the last LSN. We always create an image layer at that LSN instead. For example, if the latest processed LSN is 100, we would create a delta layer with end LSN 100 (exclusive), and an image layer at 100. But that's just how InMemoryLayer::write_to_disk happens to work at the moment, there's no fundamental reason it needs to always create that image layer. I noticed this bug when I tried to change the logic in InMemoryLayer::write_to_disk to only create an image layer after a few delta layers.	2021-11-29 14:35:24 +02:00
Heikki Linnakangas	7cae265447	Fix dump_layerfile. The VirtualFile machinery panics if it's not initialized	2021-11-29 11:26:54 +02:00
Heikki Linnakangas	5aa969a588	Replace in-memory layers and OOM-triggered eviction with temp files. The "in-memory layer" is misnomer now, each in-memory layer is now actually backed by a file. The files are ephemeral, in that they don't survive page server crash or shutdown. To avoid reading the file for every operation, "ephemeral files" are cached in a page cache. This includes changes from 'inmemory-layer-chunks' branch to serialize / the page versions when they are added to the open layer. The difference is that they are not serialized to the expandable in-memory "chunk buffer", but written out to the file.	2021-11-26 17:25:17 +03:00
Dmitry Rodionov	130184fee9	Prohibit branch creation and basebackup at out of scope lsns Out of scope LSNs include pre initdb LSNs, and LSNs prior to latest_gc_cutoff. To get there there was also two cleanups: * Fix error handling in Execute message handler. This fixes behaviour when basebackup retured an error. Previously pageserver thread just died. * Remove "ancestor" file which previously contained ancestor id and branch lsn. Currently the same data can be obtained from metadata file. And just the way we handled ancestor file in the code introduced the case when branching fails timeline directory is created but there is no data in it except ancestor file. And this confused gc because it scans directories. So it is better to just remove ancestor file and clean up this timeline directory creation so it happens after all validity checks have passed	2021-11-25 15:27:16 +03:00
Heikki Linnakangas	d47f610606	Fix pageserver CLI parameter names and document them	2021-11-25 13:31:52 +02:00
Dmitry Rodionov	0650e51f0b	add test one more case for layer visibility	2021-11-22 11:39:20 +03:00

1 2 3 4 5 ...

560 Commits