rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-27 08:09:58 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	d9dcbffac3	python: allow using allowed_errors.py (#7719 ) See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.	2024-05-13 15:16:23 +03:00
John Spray	39c712f2ca	tests: adjust log allow list since reqwest upgrade (#7666 ) ## Problem Various performance test cases were destabilized by the recent upgrade of `reqwest`, because it changes an error string. Examples: - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/3f984e471a9029a5/ - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/8bd0f095fe0402b7/ The performance tests suffer from this more than most tests, because they churn enough data that the pageserver is still trying to contact the storage controller while it is shut down at the end of tests. ## Summary of changes s/Connection refused/error sending request/	2024-05-09 10:07:59 +01:00
Arpad Müller	f3af5f4660	Fix test_ts_of_lsn_api flakiness (#7599 ) Changes parameters to fix the flakiness of `test_ts_of_lsn_api`. Already now, the amount of flakiness of the test is pretty low. With this, it's even lower. cc #5768	2024-05-06 16:41:51 +00:00
John Spray	6814bb4b59	tests: add a log allow list to stabilize benchmarks (#7251 ) ## Problem https://github.com/neondatabase/neon/pull/7227 destabilized various tests in the performance suite, with log errors during shutdown. It's because we switched shutdown order to stop the storage controller before the pageservers. ## Summary of changes - Tolerate "connection failed" errors from pageservers trying to validation their deletion queue.	2024-03-26 17:44:18 +00:00
John Spray	0099dfa56b	storage controller: tighten up secrets handling (#7105 ) - Remove code for using AWS secrets manager, as we're deploying with k8s->env vars instead - Load each secret independently, so that one can mix CLI args with environment variables, rather than requiring that all secrets are loaded with the same mechanism. - Add a 'strict mode', enabled by default, which will refuse to start if secrets are not loaded. This avoids the risk of accidentially disabling auth by omitting the public key, for example	2024-03-25 11:52:33 +00:00
John Spray	b80704cd34	tests: log hygiene checks for storage controller (#6710 ) ## Problem As with the pageserver, we should fail tests that emit unexpected log errors/warnings. ## Summary of changes - Refactor existing log checks to be reusable - Run log checks for attachment_service - Add allow lists as needed.	2024-03-19 10:30:33 +00:00
John Spray	22c26d610b	pageserver: remove un-needed "uninit mark" (#5717 ) Switched the order; doing https://github.com/neondatabase/neon/pull/6139 first then can remove uninit marker after. ## Problem Previously, existence of a timeline directory was treated as evidence of the timeline's logical existence. That is no longer the case since we treat remote storage as the source of truth on each startup: we can therefore do without this mark file. The mark file had also been used as a pseudo-lock to guard against concurrent creations of the same TimelineId -- now that persistence is no longer required, this is a bit unwieldy. In #6139 the `Tenant::timelines_creating` was added to protect against concurrent creations on the same TimelineId, making the uninit mark file entirely redundant. ## Summary of changes - Code that writes & reads mark file is removed - Some nearby `pub` definitions are amended to `pub(crate)` - `test_duplicate_creation` is added to demonstrate that mutual exclusion of creations still works.	2024-03-15 17:23:05 +02:00
Vlad Lazar	587cb705b8	pageserver: roll open layer in timeline writer (#6661 ) ## Problem One WAL record can actually produce an arbitrary amount of key value pairs. This is problematic since it might cause our frozen layers to bloat past the max allowed size of S3 single shot uploads. [#6639](https://github.com/neondatabase/neon/pull/6639) introduced a "should roll" check after every batch of `ingest_batch_size` (100 WAL records by default). This helps, but the original problem still exists. ## Summary of changes This patch moves the responsibility of rolling the currently open layer to the `TimelineWriter`. Previously, this was done ad-hoc via calls to `check_checkpoint_distance`. The advantages of this approach are: * ability to split one batch over multiple open layers * less layer map locking * remove ad-hoc check_checkpoint_distance calls More specifically, we track the current size of the open layer in the writer. On each `put` check whether the current layer should be closed and a new one opened. Keeping track of the currently open layer results in less contention on the layer map lock. It only needs to be acquired on the first write and on writes that require a roll afterwards. Rolling the open layer can be triggered by: 1. The distance from the last LSN we rolled at. This bounds the amount of WAL that the safekeepers need to store. 2. The size of the currently open layer. 3. The time since the last roll. It helps safekeepers to regard pageserver as caught up and suspend activity. Closes #6624	2024-02-19 12:34:27 +00:00
John Spray	56f7d55ba7	pageserver: basic cancel/timeout for remote storage operations (#6097 ) ## Problem Various places in remote storage were not subject to a timeout (thereby stuck TCP connections could hold things up), and did not respect a cancellation token (so things like timeline deletion or tenant detach would have to wait arbitrarily long). ## Summary of changes - Add download_cancellable and upload_cancellable helpers, and use them in all the places we wait for remote storage operations (with the exception of initdb downloads, where it would not have been safe). - Add a cancellation token arg to `download_retry`. - Use cancellation token args in various places that were missing one per #5066 Closes: #5066 Why is this only "basic" handling? - Doesn't express difference between shutdown and errors in return types, to avoid refactoring all the places that use an anyhow::Error (these should all eventually return a more structured error type) - Implements timeouts on top of remote storage, rather than within it: this means that operations hitting their timeout will lose their semaphore permit and thereby go to the back of the queue for their retry. - Doing a nicer job is tracked in https://github.com/neondatabase/neon/issues/6096	2023-12-15 17:43:02 +00:00
Joonas Koivunen	831fad46d5	tests: fix allowed_error for compaction detecting a shutdown (#5919 ) This has been causing flaky tests, [example evidence]. Follow-up to #5883 where I forgot to fix this. [example evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5917/6981540065/index.html#suites/9d2450a537238135fd4007859e09aca7/6fd3556a879fa3d1	2023-11-24 16:14:32 +00:00
Joonas Koivunen	044375732a	test: support validating allowed_errors against a logfile (#5905 ) this will make it easier to test if an added allowed_error does in fact match for example against a log file from an allure report. ``` $ python3 test_runner/fixtures/pageserver/allowed_errors.py --help usage: allowed_errors.py [-h] [-i INPUT] check input against pageserver global allowed_errors optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Pageserver logs file. Reads from stdin if no file is provided. ``` Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-11-24 12:43:25 +00:00

11 Commits