rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 04:30:38 +00:00

Author	SHA1	Message	Date
Kirill Bulatov	33251a9d8f	Disable failing remote storage tests for now	2022-01-28 18:35:46 +03:00
Konstantin Knizhnik	c045ae7a9b	Fix random range for keys in test_gc_aggressive.py (#1199 )	2022-01-28 16:29:55 +03:00
Dmitry Rodionov	602ccb7d5f	distinguish failures for pre-initdb lsn and pre-ancestor lsn branching in test_branch_behind	2022-01-28 12:31:15 +03:00
Konstantin Knizhnik	08135910a5	Fix checkpoint.nextXid update (#1166 ) * Fix checkpoint.nextXid update * Add test for cehckpoint.nextXid * Fix indentation of test_next_xid.py * Fix mypy error in test_next_xid.py * Tidy up the test case. * Add a unit test Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>	2022-01-27 18:21:51 +03:00
Arthur Petukhovsky	cedde559b8	Add test for replacement of the failed safekeeper (#1179 ) * Add test to replace failed safekeeper * Restart safekeepers in test_replace_safekeeper * Update vendor/postgres	2022-01-27 17:26:55 +03:00
Konstantin Knizhnik	79f0e44a20	Gc cutoff rwlock (#1139 ) * Reproduce github issue #1047. * Use RwLock to protect gc_cuttof_lsn * Eeduce number of updates in test_gc_aggressive * Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages test * Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages * Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC * Remove random sleep between wait_for_lsn and get_page_at_lsn * Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn * Update test_prohibit_branch_creation_on_pre_initdb_lsn test Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>	2022-01-27 14:41:16 +03:00
Dmitry Rodionov	63dd7bce7e	bandaid to avoid concurrent timeline downloading until proper refactoring/fix	2022-01-26 19:54:09 +03:00
Dmitry Rodionov	39591ef627	reduce flakiness	2022-01-24 17:20:15 +03:00
Dmitry Rodionov	37c440c5d3	Introduce first version of tenant migraiton between pageservers This patch includes attach/detach http endpoints in pageservers. Some changes in callmemaybe handling inside safekeeper and an integrational test to check migration with and without load. There are still some rough edges that will be addressed in follow up patches	2022-01-24 17:20:15 +03:00
Dmitry Rodionov	5f5a11525c	Switch our python package management solution to poetry. Mainly because it has better support for installing the packages from different python versions. It also has better dependency resolver than Pipenv. And supports modern standard for python dependency management. This includes usage of pyproject.toml for project specific configuration instead of per tool conf files. See following links for details: https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/ https://www.python.org/dev/peps/pep-0518/	2022-01-24 11:33:47 +03:00
Dmitry Rodionov	026eb64a83	Use python lib to mock s3	2022-01-20 18:42:47 +02:00
anastasia	7aba299dbd	Use safekeeper in test_branch_behind (#1068 ) to avoid a subtle race condition. Without safekeeper, walreceiver reconnection can stuck, because of IO deadlock between walsender auth and regular backend.	2022-01-12 14:38:04 +03:00
Kirill Bulatov	8ab4c8a050	Code review fixes	2022-01-11 15:44:23 +02:00
Kirill Bulatov	7c4a653230	Propagate Zenith CLI's RUST_LOG env var to subprocesses	2022-01-11 15:44:23 +02:00
Kirill Bulatov	a3cd8f0e6d	Add the remote storage test	2022-01-11 15:44:23 +02:00
Kirill Bulatov	65c851a451	Test pageserver's timeline http methods z	2022-01-11 15:44:23 +02:00
Arthur Petukhovsky	70778058d9	Add test for safekeeper setup without pageserver (#1000 )	2021-12-29 12:58:27 +03:00
anastasia	5ef2b1baf7	Add new test illustrating issue with sync-safekeepers. If safekeepers sync fast enough, callmemaybe thread may never make a call before receiving Unsubscribe request. This leads to the situation, when pageserver lacks data that exists on safekeepers.	2021-12-28 17:50:48 +03:00
anastasia	980f5f8440	Propagate remote_consistent_lsn to safekeepers. Change meaning of lsns in HOT_STANDBY_FEEDBACK: flush_lsn = disk_consistent_lsn, apply_lsn = remote_consistent_lsn Update compute node backpressure configuration respectively. Update compute node configuration: set 'synchronous_commit=remote_write' in setup without safekeepers. This way compute node doesn't have to wait for data checkpoint on pageserver. This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.	2021-12-24 15:32:54 +03:00
Heikki Linnakangas	1cc181ca32	Fix WAL redo of commit records with subtransactions. If a commit record contains XIDs that are stored on different CLOG pages, we duplicate the commit record for each affected CLOG page. In the redo routine, we must only apply the parts of the record that apply to the CLOG page being restored. We got that right in the loop that handles the sub-XIDs, but incorrectly always set the bit that corresponds to the main XID.	2021-12-21 23:08:01 +02:00
Heikki Linnakangas	927587cec8	Fix comments in tests	2021-12-21 22:38:33 +02:00
Heikki Linnakangas	bcf80eaa95	Fix multixacts members WAL redo. The logic to compute the page number was broken, and as a result, only the first page of multixact members was updated correctly. All the rest were left as zeros. Improve test_multixact.py to generate more multixacts, to cover this case. Also fix the check that the restored PG data directory matches the original one. Previously, the test compared the 'pg_new' cluster, which is a bit silly because the test restored the 'pg_new' cluster only a few lines earlier, so if the multixact WAL redo is somehow broken, the comparison will just compare two broken data directories and report success. Change it to compare the original datadir, the one where the multixacts were originally created, with a restored image of the same.	2021-12-21 17:50:06 +02:00
Kirill Bulatov	673c297949	Download timelines on demand	2021-12-10 17:23:35 +02:00
Dmitry Rodionov	557e3024cd	Forward pageserver connection string from compute to safekeeper This is needed for implementation of tenant rebalancing. With this change safekeeper becomes aware of which pageserver is supposed to be used for replication from this particular compute.	2021-12-06 21:28:49 +03:00
Arseny Sher	cba4da3f4d	Add term history to safekeepers. Persist full history of term switches on safekeepers instead of storing only the single term of the highest entry (called epoch). This allows easily and correctly find the divergence point of two logs and truncate the obsolete part before overwriting it with entries of the newer proposer(s). Full history of the proposer is transferred in separate message before proposer starts streaming; it is immediately persisted by safekeeper, though he might not yet have entries for some older terms there. That's because we can't atomically append to WAL and update the control file anyway, so locally available WAL must be taken into account when looking at the history. We should sometimes purge term history entries beyond truncate_lsn; this is not done here. Per https://github.com/zenithdb/rfcs/pull/12 Closes #296. Bumps vendor/postgres.	2021-12-03 12:43:57 +03:00
Dmitry Rodionov	130184fee9	Prohibit branch creation and basebackup at out of scope lsns Out of scope LSNs include pre initdb LSNs, and LSNs prior to latest_gc_cutoff. To get there there was also two cleanups: * Fix error handling in Execute message handler. This fixes behaviour when basebackup retured an error. Previously pageserver thread just died. * Remove "ancestor" file which previously contained ancestor id and branch lsn. Currently the same data can be obtained from metadata file. And just the way we handled ancestor file in the code introduced the case when branching fails timeline directory is created but there is no data in it except ancestor file. And this confused gc because it scans directories. So it is better to just remove ancestor file and clean up this timeline directory creation so it happens after all validity checks have passed	2021-11-25 15:27:16 +03:00
Dmitry Rodionov	737a557f09	add check to python tests that afteer gc number of rows is unchanged in all branches	2021-11-22 11:39:20 +03:00
Dmitry Rodionov	44111e3ba3	Prohibit branch creation at lsn that was already garbage collected. This introduces new timeline field latest_gc_cutoff. It is updated before each gc iteration. New check is added to branch_timelines to prevent branch creation with start point less than latest_gc_cutoff. Also this adds a check to get_page_at_lsn which asserts that lsn at which the page is requested was not garbage collected. This check currently is triggered for readonly nodes which are pinned to specific lsn and because they are not tracked in pageserver garbage collection can remove data that still might be referenced. This is a bug and will be fixed separately.	2021-11-15 20:03:16 +03:00
Heikki Linnakangas	849ac791a6	Bandaid fix for "page not found" errors, when a table is loaded. During parallel load of a table, Postgres sometimes requests a page from the page server for which no WAL has been generated yet. That's normal; Postgres expects the page to be full of zeros. There was a special case for that in LayeredTimeline::materialize_page, but the problem remained when you're crossing a segment boundary, so that there's no layer for the segment at all. It would be nice to have a more robust cross-check for this case. That might need help from the Postgres side. But this extends the bandaid fix we had in materialize_page() to the case where cross segment boundary. Fixes https://github.com/zenithdb/zenith/issues/841	2021-11-12 18:47:39 +02:00
Arthur Petukhovsky	9aaa02bc9a	Fix high CPU usage in walproposer (#860 ) * Bump vendor/postgres * Update time limits for test_restarts_under_load	2021-11-10 17:18:07 +03:00
Egor Suvorov	587935ebed	Add Safekeeper metrics tests (#746 ) * zenith_fixtures.py: add SafekeeperHttpClient.get_metrics() * Ensure that `collect_lsn` and `flush_lsn`'s reported values look reasonable in `test_many_timelines`	2021-11-09 22:18:59 +03:00
Arthur Petukhovsky	d19263aec8	Adjust timeouts for test_restarts_under_load (#830 ) * Adjust timeouts for test_restarts_under_load * Add test timeout for test_restarts_under_load	2021-11-04 19:58:40 +03:00
Heikki Linnakangas	66ec135676	Refactor pytest fixtures Instead of having a lot of separate fixtures for setting up the page server, the compute nodes, the safekeepers etc., have one big ZenithEnv object that encapsulates the whole environment. Every test either uses a shared "zenith_simple_env" fixture, which contains the default setup of a pageserver with no authentication, and no safekeepers. Tests that want to use safekeepers or authentication set up a custom test-specific ZenithEnv fixture. Gathering information about the whole environment into one object makes some things simpler. For example, when a new compute node is created, you no longer need to pass the 'wal_acceptors' connection string as argument to the 'postgres.create_start' function. The 'create_start' function fetches that information directly from the ZenithEnv object.	2021-10-25 14:14:47 +03:00
Heikki Linnakangas	28af3e5008	Remove some unnecessary fixture arguments	2021-10-25 14:14:45 +03:00
Heikki Linnakangas	57ce541521	Remove unnecessary 'pg_bin' object from 'postgres' fixture. It was only used in check_restored_datadir_content(), and that function can construct it easily from the other information it has.	2021-10-25 14:14:41 +03:00
Egor Suvorov	ff563ff080	test_runner: fix mypy errors and force it on CI (#774 ) * Fix bugs found by mypy * Add some missing types and runtime checks, remove unused code * Make ZenithPageserver start right away for better type safety * Add `types-` packages to Pipfile Pin mypy version and run it on CircleCI	2021-10-21 13:51:54 +03:00
anastasia	7f9d2a7d05	Change 'zenith tenant list' API to return tenant state added in `0dc7a3fc`	2021-10-21 11:04:22 +03:00
Arthur Petukhovsky	13f4e173c9	Wait for safekeepers to catch up in test_restarts_under_load (#776 )	2021-10-20 14:42:53 +03:00
Egor Suvorov	eb706bc9f4	Force yapf (Python code formatter) in CI (#772 ) * Add yapf run to CircleCI * Pin yapf version * Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting * Reformat all existing code with slight manual adjustments * test_runner/README: note that yapf is forced	2021-10-19 20:13:47 +03:00
Heikki Linnakangas	feae7f39c1	Support read-only nodes Change 'zenith.signal' file to a human-readable format, similar to backup_label. It can contain a "PREV LSN: %X/%X" line, or a special value to indicate that it's OK to start with invalid LSN ('none'), or that it's a read-only node and generating WAL is forbidden ('invalid'). The 'zenith pg create' and 'zenith pg start' commands now take a node name parameter, separate from the branch name. If the node name is not given, it defaults to the branch name, so this doesn't break existing scripts. If you pass "foo@<lsn>" as the branch name, a read-only node anchored at that LSN is created. The anchoring is performed by setting the 'recovery_target_lsn' option in the postgresql.conf file, and putting the server into standby mode with 'standby.signal'. We no longer store the synthetic checkpoint record in the WAL segment. The postgres startup code has been changed to use the copy of the checkpoint record in the pg_control file, when starting in zenith mode.	2021-10-19 09:48:12 +03:00
Heikki Linnakangas	c2b468c958	Separate node name from the branch name in ComputeControlPlane This is in preparation for supporting read-only nodes. You can launch multiple read-only nodes on the same brach, so we need an identifier for each node, separate from the branch name.	2021-10-19 09:48:10 +03:00
Arseny Sher	de744a44dd	Add /timeline http request to safekeeper returning its status. Which is mainly generational state (terms) and useful LSNs. Also add /status basic healthcheck request which is now used in tests to determine the safekeeper is up; this fixes #726. ref #115	2021-10-14 19:02:38 +03:00
Arthur Petukhovsky	4b87acb1f6	Use logging in python tests (#674 ) * Use logging in python tests * Use f-strings for logs * Don't log test output while running * Use only pytest logging handler * Add more info about pytest logging	2021-10-14 13:10:09 +03:00
Heikki Linnakangas	e3945d94fd	Store unlogged tables locally, and replace PD_WAL_LOGGED. All the changes are in the vendor/postgres side. However, because we now generate fewer Full Page Writes, the 'branch_behind' test needs to be modified so that it still generates enough WAL to consume a few WAL segments.	2021-10-06 10:58:15 +03:00
Arseny Sher	4256231eb7	Enable test_start_compute with safekeepers. It should work now.	2021-10-04 16:50:46 +03:00
Arthur Petukhovsky	d6fc74a412	Various fixes for test_sync_safekeepers (#668 ) * Send ProposerGreeting manually in tests * Move test_sync_safekeepers to test_wal_acceptor.py * Capture test_sync_safekeepers output * Add comment for handle_json_ctrl * Save captured output in CI	2021-09-28 19:25:05 +03:00
Arseny Sher	7a370394a7	Wait till previous victim recovers in run_restarts_under_load. Fixes test flakiness, as recovery easily might take the whole iteration.	2021-09-28 19:15:41 +03:00
Arseny Sher	70b08923ed	Disable new safekeepers tests as not stable enough.	2021-09-26 22:33:58 +03:00
Heikki Linnakangas	ff5cbe2694	Support overlapping and nested Layers in the layer map. This introduces a new tree data structure for holding intervals, and queries of the form "which intervals contain the given point?". It then uses that to store the Layers in the layer map, instead of the BTreeMap. While we don't currently create overlapping layers in the page server, that situation might arise in the future if we start to create extra layers for performance purposes, or as part of some multi-stage garbage collection operation that creates new layers in some interval and then removes old ones. The situation might also arise if you have multiple page servers running on the same timeline, freezing layers at different points, and both uploading them to S3. So even though overlapping layers might not happen currently, let's avoid getting confused if it does happen for some reason. Fixes https://github.com/zenithdb/zenith/issues/517.	2021-09-24 14:10:52 +03:00
Heikki Linnakangas	2319e0ec8f	Define a layer's start and end bounds more precisely. After this, a layer's start bound is always defined to be inclusive, and end bound exclusive. For example, if you have a layer in the range 100-200, that layer can be used for GetPage@LSN requests at LSN 100, 199, or anything in between. But for LSN 200, you need to look at the next layer (if one exists). This is one part of a fix for https://github.com/zenithdb/zenith/issues/517. After this, the page server shouldn't create layers for the same segment with the same LSN, which avoids the issue. However, the same thing would still happen, if you managed to create layers with same start LSN again. That could happen e.g. if you had two page servers running, or in some weird crash/restart scenario, or due to bugs or features added later. The next commit makes the layer map more robust, so that it tolerates that situation without deleting wrong files.	2021-09-24 14:10:49 +03:00

1 2

98 Commits