rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 04:30:38 +00:00

Author	SHA1	Message	Date
Christian Schwarz	3d2c2ce139	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-22 14:36:56 +02:00
John Spray	7e2a3d2728	pageserver: downgrade stale generation messages to INFO (#8256 ) ## Problem When generations were new, these messages were an important way of noticing if something unexpected was going on. We found some real issues when investigating tests that unexpectedly tripped them. At time has gone on, this code is now pretty battle-tested, and as we do more live migrations etc, it's fairly normal to see the occasional message from a node with a stale generation. At this point the cognitive load on developers to selectively allow-list these logs outweighs the benefit of having them at warn severity. Closes: https://github.com/neondatabase/neon/issues/8080 ## Summary of changes - Downgrade "Dropped remote consistent LSN updates" and "Dropping stale deletions" messages to INFO - Remove all the allow-list entries for these logs.	2024-07-08 17:22:35 +01:00
Peter Bendel	08b33adfee	add pagebench test cases for periodic pagebench on dedicated hardware (#8233 ) we want to run some specific pagebench test cases on dedicated hardware to get reproducible results run1: 1 client per tenant => characterize throughput with n tenants. - 500 tenants - scale 13 (200 MB database) - 1 hour duration - ca 380 GB layer snapshot files run2.singleclient: 1 client per tenant => characterize latencies run2.manyclient: N clients per tenant => characterize throughput scalability within one tenant. - 1 tenant with 1 client for latencies - 1 tenant with 64 clients because typically for a high number of connections we recommend the connection pooler which by default uses 64 connections (for scalability) - scale 136 (2048 MB database) - 20 minutes each	2024-07-08 17:22:35 +01:00
Peter Bendel	82266a252c	Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky (#8079 ) ## Problem see https://github.com/neondatabase/neon/issues/8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```	2024-06-21 10:36:12 +00:00
Peter Bendel	56da624870	allow storage_controller error during pagebench (#8109 ) ## Problem `test_pageserver_max_throughput_getpage_at_latest_lsn` is a pagebench testcase which creates several tenants/timelines to verify pageserver performance. The test swaps environments around in the tenant duplication stage, so the storage controller uses two separate db instances (one in the duplication stage and another one in the benchmarking stage). In the benchmarking stage, the storage controller starts without any knowledge of nodes, but with knowledge of tenants (via attachments.json). When we re-attach and attempt to update the scheduler stats, the scheduler rightfully complains about the node not being known. The setup should preserve the storage controller across the two envs, but i think it's fine to just allow list the error in this case. ## Summary of changes add the error message `2024-06-19T09:38:27.866085Z ERROR Scheduler missing node 1`` to the list of allowed errors for storage_controller	2024-06-19 13:04:29 +00:00
Peter Bendel	9ba9f32dfe	Reactivate page bench test in CI after ignoring CopyFail error in pageserver (#8023 ) ## Problem Testcase page bench test_pageserver_max_throughput_getpage_at_latest_lsn had been deactivated because it was flaky. We now ignore copy fail error messages like in `270d3be507/test_runner/regress/test_pageserver_getpage_throttle.py (L17-L20)` and want to reactivate it to see it it is still flaky ## Summary of changes - reactivate the test in CI - ignore CopyFail error message during page bench test cases ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-06-12 16:10:57 +02:00
Jure Bajic	affc18f912	Add performance regress `test_ondemand_download_churn.py` (#7242 ) Add performance regress test for on-demand download throughput. Closes https://github.com/neondatabase/neon/issues/7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-05-15 18:41:12 +02:00
Joonas Koivunen	d9dcbffac3	python: allow using allowed_errors.py (#7719 ) See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.	2024-05-13 15:16:23 +03:00
macdoos	3b95e8072a	test_runner: replace all `.format()` with f-strings (#7194 )	2024-04-02 14:32:14 +01:00
Vlad Lazar	3d8830ac35	test_runner: re-enable large slru benchmark (#7125 ) Previously disabled due to https://github.com/neondatabase/neon/issues/7006.	2024-03-14 16:47:32 +00:00
John Spray	89cf714890	tests/neon_local: rename "attachment service" -> "storage controller" (#7087 ) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )	2024-03-12 11:36:27 +00:00
Vlad Lazar	2daa2f1d10	test: disable large slru basebackup bench in ci (#7025 ) The test is flaky due to https://github.com/neondatabase/neon/issues/7006.	2024-03-05 15:41:05 +00:00
Vlad Lazar	2b11466b59	pageserver: optimise disk io for vectored get (#6780 ) ## Problem The vectored read path proposed in https://github.com/neondatabase/neon/pull/6576 seems to be functionally correct, but in my testing (see below) it is about 10-20% slower than the naive sequential vectored implementation. ## Summary of changes There's three parts to this PR: 1. Supporting vectored blob reads. This is actually trickier than it sounds because on disk blobs are prefixed with a variable length size header. Since the blobs are not necessarily fixed size, we need to juggle the offsets such that the callers can retrieve the blobs from the resulting buffer. 2. Merge disk read requests issued by the vectored read path up to a maximum size. Again, the merging is complicated by the fact that blobs are not fixed size. We keep track of the begin and end offset of each blob and pass them into the vectored blob reader. In turn, the reader will return a buffer and the offsets at which the blobs begin and end. 3. A benchmark for basebackup requests against tenant with large SLRU block counts is added. This required a small change to pagebench and a new config variable for the pageserver which toggles the vectored get validation. We can probably optimise things further by adding a little bit of concurrency for our IO. In principle, it's as simple as spawning a task which deals with issuing IO and doing the serialisation and handling on the parent task which receives input via a channel.	2024-02-28 12:06:00 +00:00
Alexander Bayandin	59c5b374de	test_pageserver_max_throughput_getpage_at_latest_lsn: disable on CI (#6785 ) ## Problem `test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which makes CI status red pretty frequently. `benchmarks` is not a blocking job (doesn't block `deploy`), so having it red might hide failures in other jobs Ref: https://github.com/neondatabase/neon/issues/6724 ## Summary of changes - Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI until it fixed	2024-02-16 15:30:04 +00:00
Christian Schwarz	fd4cce9417	test_pageserver_max_throughput_getpage_at_latest_lsn: remove n_tenants=100 combination (#6477 ) Need to fix the neon_local timeouts first (https://github.com/neondatabase/neon/issues/6473) and also not run them on every merge, but only nightly: https://github.com/neondatabase/neon/issues/6476	2024-01-25 18:17:53 +00:00
Christian Schwarz	996abc9563	pagebench-based GetPage@LSN performance test (#6214 )	2024-01-24 12:51:53 +01:00

16 Commits