rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-28 10:30:40 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	468366a28f	Fix wrong 'lsn' stored in test page image The test creates a page version with a string like "foo 123 at 0/10" as the content. But the LSN stored in that string was wrong: the page version stored at LSN 0/20 would say "foo <blk> at 0/10".	2022-02-23 11:33:17 +02:00
Heikki Linnakangas	9632c352ab	Avoid having multiple records for the same page and LSN. If a heap UPDATE record modified two pages, and both pages needed to have their VM bits cleared, and the VM bits were located on the same VM page, we would emit two ZenithWalRecord::ClearVisibilityMapFlags records for the same VM page. That produced warnings like this in the pageserver log: Page version Wal(ClearVisibilityMapFlags { heap_blkno: 18, flags: 3 }) of rel 1663/13949/2619_vm blk 0 at 2A/346046A0 already exists To fix, change ClearVisibilityMapFlags so that it can update the bits for both pages as one operation. This was already covered by several python tests, so no need to add a new one. Fixes #1125. Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>	2022-02-15 14:26:16 +02:00
Dmitry Rodionov	5df21e1058	remove Timeline::start_lsn in favor of ancestor_lsn	2022-01-28 12:31:15 +03:00
Konstantin Knizhnik	79f0e44a20	Gc cutoff rwlock (#1139 ) * Reproduce github issue #1047. * Use RwLock to protect gc_cuttof_lsn * Eeduce number of updates in test_gc_aggressive * Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages test * Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages * Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC * Remove random sleep between wait_for_lsn and get_page_at_lsn * Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn * Update test_prohibit_branch_creation_on_pre_initdb_lsn test Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>	2022-01-27 14:41:16 +03:00
Dmitry Rodionov	37c440c5d3	Introduce first version of tenant migraiton between pageservers This patch includes attach/detach http endpoints in pageservers. Some changes in callmemaybe handling inside safekeeper and an integrational test to check migration with and without load. There are still some rough edges that will be addressed in follow up patches	2022-01-24 17:20:15 +03:00
Heikki Linnakangas	dab30c27b6	Refactor thread management and shutdown This introduces a new module to handle thread creation and shutdown. All page server threads are now registered in a global hash map, and there's a function to request individual threads to shut down gracefully. Thread shutdown request is signalled to the thread with a flag, as well as a Future that can be used to wake up async operations if shutdown is requested. Use that facility to have the libpq listener thread respond to pageserver shutdown, based on Kirill's earlier prototype (https://github.com/zenithdb/zenith/pull/1088). That addresses https://github.com/zenithdb/zenith/issues/1036, previously the libpq listener thread would not exit until one more connection arrives. This also eliminates a resource leak in the accept() loop. Previously, we added the JoinHanlde of each new thread to a vector but old handles for threads that had already exited were never removed.	2022-01-14 18:36:10 +02:00
Heikki Linnakangas	55a4cf64a1	Refactor WAL record handling. Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL record that is replayed with the Postgres WAL redo process, or a built-in type that is handled entirely by pageserver code. Replace the special code to replay Postgres XACT commit/abort records with new Zenith WAL records. A separate zenith WAL record is created for each modified CLOG page. This allows removing the 'main_data_offset' field from stored PostgreSQL WAL records, which saves some memory and some disk space in delta layers. Introduce zenith WAL records for updating bits in the visibility map. Previously, when e.g. a heap insert cleared the VM bit, we duplicated the heap insert WAL record for the affected VM page. That was very wasteful. The heap WAL record could be massive, containing a full page image in the worst case. This addresses github issue #941.	2022-01-04 11:26:37 +02:00
Kirill Bulatov	114a757d1c	Use generic config parameters in pageserver cli Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>	2021-12-23 18:58:28 +02:00
Heikki Linnakangas	da62407fce	Change the meaning of 'blknum' argument in Layer trait Previously, the 'blknum' argument of various Layer functions was the block number within the overall relation. That was pretty confusing, because an individual layer only holds data from a one segment of the relation. Furthermore, the 'put_truncation' function already dealt with per-segment size, not overall relation size, adding to the confusion. Change the meaning of the 'blknum' argument to mean the block number within the segment, not the overall relation.	2021-12-22 16:55:37 +02:00
Kirill Bulatov	ca60561a01	Propagate disk consistent lsn in timeline sync statuses	2021-12-15 15:13:21 +02:00
Kirill Bulatov	673c297949	Download timelines on demand	2021-12-10 17:23:35 +02:00
Heikki Linnakangas	f3f059c1f8	Fix a few cases where request beyond end of rel would error out. Currently, we return an all-zeros page if you request a block beyond end of a relation. That has been implemented in LayeredTimeline::materialize_page, so that if Layer::get_page_reconstruct_data returns Missing, it returns and all-zeros page. However InMemoryLayer and DeltaLayer would return Continue, not Missing, in that case, and materialize_page would try to find the predecessor layer. If there was a preceding image layer, then everything would still work, but if there wasn't, it would return a "could not find predecessor of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them check the size of the relation and return Missing in that case. This is hard to reproduce at the moment, but it happened quickly with pgbench when I modified InMemoryLayer::write_to_disk so that it didn't always create a new ImageLayer.	2021-12-09 17:46:48 +02:00
Kirill Bulatov	670205e17a	Evict excessively failing sync tasks, improve processing for the rest of the tasks	2021-11-30 13:58:49 +02:00
Heikki Linnakangas	5ecf0664cc	Fix off-by-one error in check for future delta layers. This doesnt show up at the moment, because we never create a delta layer with end-LSN equal to the last LSN. We always create an image layer at that LSN instead. For example, if the latest processed LSN is 100, we would create a delta layer with end LSN 100 (exclusive), and an image layer at 100. But that's just how InMemoryLayer::write_to_disk happens to work at the moment, there's no fundamental reason it needs to always create that image layer. I noticed this bug when I tried to change the logic in InMemoryLayer::write_to_disk to only create an image layer after a few delta layers.	2021-11-29 14:35:24 +02:00
Dmitry Rodionov	130184fee9	Prohibit branch creation and basebackup at out of scope lsns Out of scope LSNs include pre initdb LSNs, and LSNs prior to latest_gc_cutoff. To get there there was also two cleanups: * Fix error handling in Execute message handler. This fixes behaviour when basebackup retured an error. Previously pageserver thread just died. * Remove "ancestor" file which previously contained ancestor id and branch lsn. Currently the same data can be obtained from metadata file. And just the way we handled ancestor file in the code introduced the case when branching fails timeline directory is created but there is no data in it except ancestor file. And this confused gc because it scans directories. So it is better to just remove ancestor file and clean up this timeline directory creation so it happens after all validity checks have passed	2021-11-25 15:27:16 +03:00
Dmitry Rodionov	0650e51f0b	add test one more case for layer visibility	2021-11-22 11:39:20 +03:00
Dmitry Rodionov	6f7ebe6e01	preserve data in parent branch that might be referenced in child branch	2021-11-22 11:39:20 +03:00
Dmitry Rodionov	44111e3ba3	Prohibit branch creation at lsn that was already garbage collected. This introduces new timeline field latest_gc_cutoff. It is updated before each gc iteration. New check is added to branch_timelines to prevent branch creation with start point less than latest_gc_cutoff. Also this adds a check to get_page_at_lsn which asserts that lsn at which the page is requested was not garbage collected. This check currently is triggered for readonly nodes which are pinned to specific lsn and because they are not tracked in pageserver garbage collection can remove data that still might be referenced. This is a bug and will be fixed separately.	2021-11-15 20:03:16 +03:00
Heikki Linnakangas	4ba783d0af	Remove a couple of unused functions. We might want to have custom serialize/deserialize functions for WALRecords and PageVersions for performance reasons, see github issue 832. But that would probably look a bit different from this, and currently these functions are just dead.	2021-11-03 19:10:23 +02:00
Heikki Linnakangas	fb524dd973	Put a global limit on memory used by in-memory layers. Adds simple global tracking of memory used by the in-memory layers. It's very approximate, it doesn't take into account allocator, memory fragmentation or many other things, but it's a good first step. After storing a WAL record in the repository, the WAL receiver checks if the global memory usage. If it's above a configurable threshold (hard coded at 128 MB at the moment), it evicts a layer. The victim layer is chosen by GClock algorithm, similar to that used in the Postgres buffer cache. This stops the page server from using an unbounded amount of memory. It's pretty crude, the eviction and materializing and writing a layer to disk happens now in the WAL receiver thread. It would be nice to move that to a background thread, and it would be nice to have a smarter policy on when to materialize a new image layer and when to just write out a delta layer, and it would be nice to have more accurate accounting of memory. But this should fix the most pressing OOM issues, and is a step in the right direction. Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>	2021-11-02 15:49:39 +02:00
Kirill Bulatov	e6ef27637b	Better API to handle timeline metadata properly	2021-10-29 23:51:40 +03:00
anastasia	ea5900f155	Refactoring of checkpointer and GC. Move them to a separate tenant_threads module to detangle thread management from LayeredRepository implementation.	2021-10-27 20:50:26 +03:00
Kirill Bulatov	04fb0a0342	Add core relish backup and restore functionality	2021-10-22 22:22:38 +03:00
Konstantin Knizhnik	c310932121	Implement backpressure for compute node to avoid WAL overflow Co-authored-by: Arseny Sher <sher-ars@yandex.ru> Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2021-10-21 18:15:50 +03:00
Heikki Linnakangas	feae7f39c1	Support read-only nodes Change 'zenith.signal' file to a human-readable format, similar to backup_label. It can contain a "PREV LSN: %X/%X" line, or a special value to indicate that it's OK to start with invalid LSN ('none'), or that it's a read-only node and generating WAL is forbidden ('invalid'). The 'zenith pg create' and 'zenith pg start' commands now take a node name parameter, separate from the branch name. If the node name is not given, it defaults to the branch name, so this doesn't break existing scripts. If you pass "foo@<lsn>" as the branch name, a read-only node anchored at that LSN is created. The anchoring is performed by setting the 'recovery_target_lsn' option in the postgresql.conf file, and putting the server into standby mode with 'standby.signal'. We no longer store the synthetic checkpoint record in the WAL segment. The postgres startup code has been changed to use the copy of the checkpoint record in the pg_control file, when starting in zenith mode.	2021-10-19 09:48:12 +03:00
Patrick Insinger	1c29de81de	pageserver - remove `lsn` from `WALRecord`	2021-10-13 00:03:42 -07:00
Patrick Insinger	6e5ca5dc5c	pageserver - create TimelineWriter	2021-10-12 10:02:15 -07:00
anastasia	d7c9dd06f4	Implement graceful shutdown at 'pageserver stop': - perform checkpoint for each tenant repository. - wait for the completion of all threads. Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.	2021-10-11 13:35:01 +03:00
Kirill Bulatov	bf58f7f649	Expose certain layered repository structs to reuse in relish storage (#688 )	2021-10-09 19:23:57 +03:00
Kirill Bulatov	5719f13cb2	Rework the relish thread model (#689 )	2021-10-05 10:15:56 +03:00
Patrick Insinger	d134a9856e	pageserver - introduce RepoHarness for testing	2021-10-04 08:36:35 -07:00
Patrick Insinger	664b99b5ac	pageserver - use constant TIMELINE_ID for tests	2021-10-04 08:36:35 -07:00
Patrick Insinger	7095a5d551	pageserver - reject and backup future layer files If a layer file is found with LSN after the disk_consistent_lsn, it is renamed (to avoid conflicts with new layer files) and a warning is logged.	2021-10-01 11:41:39 -07:00
Patrick Insinger	538c2a2a3e	pageserver - store timeline metadata durably The metadata file is now always 512 bytes. The last 4 bytes are a crc32c checksum of the previous 508 bytes. Padding zeroes are added between the serde serialization and the start of the checksum. A single write call is used, and the file is fsyncd after. On file creation, the parent directory is fsyncd as well.	2021-10-01 11:41:39 -07:00
anastasia	7fb7f67bb4	Fix relish extention after it was dropped or truncated. - Turn dropped layers into non-writeable in get_layer_for_write(). - Handle non-writeable dropped layers in checkpointer. They don't need freezing, so just remove them from list of open_segs and write out to disk. - Remove code that handles dropped layers in freeze() function. It is not used anymore.	2021-09-23 13:19:45 +03:00
anastasia	86164c8b33	Add unit tests for drop_lsn. test_drop_extend and test_truncate_extend illustrate what happens if we dropped a segment and then created it again within the same layer.	2021-09-23 13:19:45 +03:00
anastasia	a4fc6da57b	Fix gc_internal to treat dropped layers. Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected. Add new fields to GcResult for layers that are preserved as tombstones	2021-09-23 12:21:47 +03:00
anastasia	c934e724a8	Enable test_list_rels_drop test	2021-09-23 12:21:47 +03:00
anastasia	e554f9514f	gc refactoring - rename 'compact' argument of GC to 'checkpoint_before_gc'. - gc_iteration_internal() refactoring	2021-09-23 12:21:47 +03:00
Kirill Bulatov	1d5abf1253	Initial version of the relish storage	2021-09-17 15:30:22 +03:00
anastasia	98d4f9cea5	Add checkpoint_distance config parameter. - Change hardcoded OLDEST_INMEM_DISTANCE value to pageserver config option checkpoint_distance. - Get rid of 'force' flag in checkpoint_internal(). Use checkpoint_distance=0 instead.	2021-09-16 12:33:50 +03:00
Max Sharnoff	b11b0bb088	bin_ser: reject trailing bytes by default (#587 ) Changes `LeSer`/`BeSer::des`. Also adds a new `des_prefix` function to keep a way to allow trailing bytes.	2021-09-15 11:48:19 -07:00
Dmitry Rodionov	4ebe643d0c	Support parallel test running for python tests Support is done via pytest-xdist plugin. To use the feature add -n<concurrency> to pytest invocation e.g. pytest -n8 to run 8 tests in parallel. Changes in code are mostly about ports assigning. Previously port for pageserver was hardcoded without the ability to override through zenith cli and ports for started compute nodes were calculated twice, in zenith cli and in test code. Now zenith cli supports port arguments for pageserver and compute nodes to be passed explicitly. Tests are modified in such a way that each worker gets a non overlapping port range which can be configured and now contains 100 ports. These ports are distributed to test services (pageserver, wal acceptors, compute nodes) so they can work independently.	2021-09-15 14:02:15 +03:00
Heikki Linnakangas	ab33614ab1	Forbid adding WAL to the repository after advancing last record LSN. When you advance last record LSN, all changes up to that LSN should be imported into repository. We have been a bit sloppy about that when it comes to the checkpoint information that we also store in the repository. In WAL receiver, for example, we would receive a WAL record, advance last record LSN, and only then update the checkpoint relish at the same LSN. Reorder that so that you advance the last record LSN only after updating the checkpoint relish. It hasn't apparently caused any problems so far, but let's be tidy. Tighten the check for that in get_layer_for_write(), so that it checks for 'lsn > last_record_lsn' rather than 'lsn >= last_record_lsn'.	2021-09-10 10:59:09 +03:00
Heikki Linnakangas	03dff207db	Remove start_lsn arg from `create_empty_repository`. Always use lsn(0) as the initial last_record_lsn. It is updated soon after creating the timeline anyway, after loading the bootstrap data, so it doesn't stay long in that state. I was a bit worried about using a special value like 0, but it's actually nice that you can distinguish it from any real LSN value. The unit tests have been using Lsn(0) as the initial start LSN all along.	2021-09-10 10:24:35 +03:00
Heikki Linnakangas	6a8785379a	Add explicit 'wait_lsn' calls before get_page_at_lsn and such calls. Move the responsibility to wait for the WAL to arrive to the callers, and remove the wait_lsn() calls from the Timeline::get_page_at_lsn() and friends. We were not totally consistent before, list_rels() was missing the wait_lsn() call for example. Closes https://github.com/zenithdb/zenith/issues/521	2021-09-10 09:56:11 +03:00
anastasia	b79754d06e	list_rels() and list_nonrels() refactoring: move shared code to list_relishes() function.	2021-09-09 16:05:32 +03:00
anastasia	674807eee1	Add test for dropped reltaions. Fix list_rels() and list_nonrels() functions	2021-09-09 16:05:32 +03:00
Dmitry Rodionov	b4ecae33e4	add incremental tracking of logical timeline size In order to exclude problems with synchronizing disk and memory logical size is not stored in metadata on disk. It is calculated on timeline "start" by scanning the contents of layered repo and then size is maintained via an atomic variable. This patch also adds new endpoint to pageserver http api: branch detail. It allows retrieval of a particular branch info by its name. Size info is also added to the response of the endpoint and used in tests.	2021-09-07 18:25:15 +03:00
Heikki Linnakangas	04ee1d5977	Add test for managing old open segments in binary heap. I thought this test would trigger the bug fixed previous commit, but it did not. More tests are nice in any case.	2021-09-07 18:10:07 +03:00

1 2 3 4

170 Commits