rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 22:12:56 +00:00

Author	SHA1	Message	Date
Arthur Petukhovsky	3118c24521	Panic on unexpected error in simtests (#8169 )	2024-06-26 16:46:14 +01:00
Alexander Bayandin	5af9660b9e	CI(build-tools): don't install Postgres 14 (#6540 ) ## Problem We install Postgres 14 in `build-tools` image, but we don't need it. We use Postgres binaries, which we build ourselves. ## Summary of changes - Remove Postgresql 14 installation from `build-tools` image	2024-06-26 16:37:04 +01:00
Conrad Ludgate	d7e349d33c	proxy: report blame for passthrough disconnect io errors (#8170 ) ## Problem Hard to debug the disconnection reason currently. ## Summary of changes Keep track of error-direction, and therefore error source (client vs compute) during passthrough.	2024-06-26 15:11:26 +00:00
Arthur Petukhovsky	47e5bf3bbb	Improve term reject message in walproposer (#8164 ) Co-authored-by: Tristan Partin <tristan@neon.tech>	2024-06-26 15:26:52 +01:00
Alex Chi Z	5d2f9ffa89	test(bottom-most-compaction): wal apply order (#8163 ) A follow-up on https://github.com/neondatabase/neon/pull/8103/. Previously, main branch fails with: ``` assertion `left == right` failed left: b"value 3@0x10@0x30@0x28@0x40" right: b"value 3@0x10@0x28@0x30@0x40" ``` This gets fixed after #8103 gets merged. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-26 09:34:41 -04:00
Heikki Linnakangas	fdadd6a152	Remove primary_is_running (#8162 ) This was a half-finished mechanism to allow a replica to enter hot standby mode sooner, without waiting for a running-xacts record. It had issues, and we are working on a better mechanism to replace it. The control plane might still set the flag in the spec file, but compute_ctl will simply ignore it.	2024-06-26 15:13:03 +03:00
Peter Bendel	9b623d3a2c	add commit hash to S3 object identifier for artifacts on S3 (#8161 ) In future we may want to run periodic tests on dedicated cloud instances that are not GitHub action runners. To allow these to download artifact binaries for a specific commit hash we want to make the search by commit hash possible and prefix the S3 objects with `artifacts/${GITHUB_SHA}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}` --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-06-26 05:46:52 +00:00
Alex Chi Z	9b98823d61	bottom-most-compaction: use in test_gc_feedback + fix bugs (#8103 ) Adds manual compaction trigger; add gc compaction to test_gc_feedback Part of https://github.com/neondatabase/neon/issues/8002 ``` test_gc_feedback[debug-pg15].logical_size: 50 Mb test_gc_feedback[debug-pg15].physical_size: 2269 Mb test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302 test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7 test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2 test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312 test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4 test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1 ``` ## Summary of changes * Add the manual compaction trigger * Use in test_gc_feedback * Add a guard to avoid running it with retain_lsns * Fix: Do `schedule_compaction_update` after compaction * Fix: Supply deltas in the correct order to reconstruct value --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-25 23:00:14 +00:00
Alex Chi Z	76864e6a2a	feat(pageserver): add image layer iterator (#8006 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes This pull request adds the image layer iterator. It buffers a fixed amount of key-value pairs in memory, and give developer an iterator abstraction over the image layer. Once the buffer is exhausted, it will issue 1 I/O to fetch the next batch. Due to the Rust lifetime mysteries, the `get_stream_from` API has been refactored to `into_stream` and consumes `self`. Delta layer iterator implementation will be similar, therefore I'll add it after this pull request gets merged. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-25 20:49:29 +00:00
Conrad Ludgate	6c5d3b5263	proxy fix wake compute console retry (#8141 ) ## Problem 1. Proxy is retrying errors from cplane that shouldn't be retried 2. ~~Proxy is not using the retry_after_ms value~~ ## Summary of changes 1. Correct the could_retry impl for ConsoleError. 2. ~~Update could_retry interface to support returning a fixed wait duration.~~	2024-06-25 18:07:54 +00:00
Christian Schwarz	cd9a550d97	clippy-deny the `todo!()` macro (#4340 ) `todo!()` shouldn't slip into prod code	2024-06-25 18:03:27 +00:00
John Spray	07f21dd6b6	pageserver: remove attach/detach apis (#8134 ) ## Problem These APIs have been deprecated for some time, but were still used from test code. Closes: https://github.com/neondatabase/neon/issues/4282 ## Summary of changes - It is still convenient to do a "tenant_attach" from a test without having to write out a location_conf body, so those test methods have been retained with implementations that call through to their location_conf equivalent.	2024-06-25 17:38:06 +01:00
Heikki Linnakangas	64a4461191	Fix submodule references to match the REL_*_STABLE_neon branches (#8159 ) No code changes, just point to the correct commit SHAs.	2024-06-25 19:05:13 +03:00
Yuchen Liang	961fc0ba8f	feat(pageserver): add metrics for number of valid leases after each refresh (#8147 ) Part of #7497, closes #8120. ## Summary of changes This PR adds a metric to track the number of valid leases after `GCInfo` gets refreshed each time. Besides this metric, we should also track disk space and synthetic size (after #8071 is closed) to make sure leases are used properly. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-25 15:43:12 +00:00
Alexander Bayandin	9b2f9419d9	CI: upload docker cache only from main (#8157 ) ## Problem The Docker build cache gets invalidated by PRs ## Summary of changes - Upload cache only from the main branch	2024-06-25 16:18:22 +01:00
Christian Schwarz	947f6da75e	L0 flush: avoid short-lived allocation when checking key_range empty (#8154 ) We only use `keys` to check if it's empty so we can bail out early. No need to collect the keys for that. Found this while doing research for https://github.com/neondatabase/neon/issues/7418	2024-06-25 17:04:44 +02:00
Vlad Lazar	7026dde9eb	storcon: update db related dependencides (#8155 ) ## Problem Storage controller runs into memory corruption issue on the drain/fill code paths. ## Summary of changes Update db related depdencies in the unlikely case that the issue was fixed in diesel.	2024-06-25 15:06:18 +01:00
Heikki Linnakangas	d502313841	Fix MVCC bug with prepared xact with subxacts on standby (#8152 ) We did not recover the subtransaction IDs of prepared transactions when starting a hot standby from a shutdown checkpoint. As a result, such subtransactions were considered as aborted, rather than in-progress. That would lead to hint bits being set incorrectly, and the subtransactions suddenly becoming visible to old snapshots when the prepared transaction was committed. To fix, update pg_subtrans with prepared transactions's subxids when starting hot standby from a shutdown checkpoint. The snapshots taken from that state need to be marked as "suboverflowed", so that we also check the pg_subtrans. Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0%40iki.fi NEON: cherry-picked from the upstream thread ahead of time, to unblock https://github.com/neondatabase/neon/pull/7288. I expect this to be committed to upstream in the next few days, superseding this. NOTE: I did not include the new regression test on v15 and v14 branches, because the test would need some adapting, and we don't run the perl tests on Neon anyway.	2024-06-25 16:29:32 +03:00
Yuchen Liang	219e78f885	feat(pageserver): add an optional lease to the get_lsn_by_timestamp API (#8104 ) Part of #7497, closes #8072. ## Problem Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls. This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-24 20:12:24 +00:00
John Spray	1ea5d8b132	tests: accomodate some messages that can fail tests (#8144 ) ## Problem - `test_storage_controller_many_tenants` can fail with warnings in the storage controller about tenant creation holding a lock for too long, because this test stresses the machine running the test with many concurrent timeline creations - `test_tenant_delete_smoke` can fail when synthetic remote storage errors show up ## Summary of changes - tolerate warnings about slow timeline creation in test_storage_controller_many_tenants - tolerate both possible errors during error_tolerant_delete	2024-06-24 17:03:53 +00:00
John Spray	3d760938e1	storcon_cli: remove old tenant-scatter command (#8127 ) ## Problem This command was used in the very early days of sharding, before the storage controller had anti-affinity + scheduling optimization to spread out shards. ## Summary of changes - Remove `storcon_cli tenant-scatter`	2024-06-24 12:57:16 -04:00
Alex Chi Z	9211de0df7	test(pageserver): add delta records tests for gc-compaction (#8078 ) Part of https://github.com/neondatabase/neon/issues/8002 This pull request adds tests for bottom-most gc-compaction with delta records. Also fixed a bug in the compaction process that creates overlapping delta layers by force splitting at the original delta layer boundary. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-24 15:50:31 +00:00
Alex Chi Z	d8ffe662a9	fix(pageserver): handle version number in draw timeline (#8102 ) We now have a `vX` number in the file name, i.e., `000000067F0000000400000B150100000000-000000067F0000000400000D350100000000__00000000014B7AC8-v1-00000001` The related pull request for new-style path was merged a month ago https://github.com/neondatabase/neon/pull/7660 ## Summary of changes Fixed the draw timeline dir command to handle it. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-24 15:31:06 +00:00
Arthur Petukhovsky	a4db2af1f0	Truncate waltmp file on creation (#8133 ) Previously in safekeeper code, new segment file was opened without truncate option. I don't think there is a reason to do it, this commit replaces it with `File::create` to make it simpler and remove `clippy::suspicious_open_options` linter warning.	2024-06-24 14:07:59 +00:00
John Spray	47fdf93cf0	tests: fix a flake in test_sharding_split_compaction (#8136 ) ## Problem This test could occasionally trigger a "removing local file ... because it has unexpected length log" when using the `compact-shard-ancestors-persistent` failpoint is in use, which is unexpected because that failpoint stops the process when the remote metadata is in sync with local files. It was because there are two shards on the same pageserver, and while the one being compacted explicitly stops at the failpoint, another shard was compacting in the background and failing at an unclean point. The test intends to disable background compaction, but was mistakenly revoking the value of `compaction_period` when it updated `pitr_interval`. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8123/9602976462/index.html#/testresult/7dd6165da7daef40 ## Summary of changes - Update `TENANT_CONF` in the test to use properly typed values, so that it is usable in pageserver APIs as well as via neon_local. - When updating tenant config with `pitr_interval`, retain the overrides from the start of the test, so that there won't be any background compaction going on during the test.	2024-06-24 14:54:54 +01:00
John Spray	de05f90735	pageserver: add more info-level logging in shard splits (#8137 ) ## Problem `test_sharding_autosplit` is occasionally failing on warnings about shard splits taking longer than expected (`Exclusive lock by ShardSplit was held for`...) It's not obvious which part is taking the time (I suspect remote storage uploads). Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9618788427/index.html#testresult/b395294d5bdeb783/ ## Summary of changes - Since shard splits are infrequent events, we can afford to be very chatty: add a bunch of info-level logging throughout the process.	2024-06-24 11:53:43 +01:00
John Spray	188797f048	pageserver: remove code that resumes tenant deletions after restarts (#8091 ) #8082 removed the legacy deletion path, but retained code for completing deletions that were started before a pageserver restart. This PR cleans up that remaining code, and removes all the pageserver code that dealt with tenant deletion markers and resuming tenant deletions. The release at https://github.com/neondatabase/neon/pull/8138 contains https://github.com/neondatabase/neon/pull/8082, so we can now merge this to `main`	2024-06-24 11:41:11 +01:00
Arpad Müller	5446e08891	Move remote_storage config related code into dedicated module (#8132 ) Moves `RemoteStorageConfig` and related structs and functions into a dedicated module. Also implements `Serialize` for the config structs (requested in #8126). Follow-up of #8126	2024-06-24 12:29:54 +02:00
Conrad Ludgate	78d9059fc7	proxy: update tokio-postgres to allow arbitrary config params (#8076 ) ## Problem Fixes https://github.com/neondatabase/neon/issues/1287 ## Summary of changes tokio-postgres now supports arbitrary server params through the `param(key, value)` method. Some keys are special so we explicitly filter them out.	2024-06-24 10:20:27 +00:00
Arpad Müller	75747cdbff	Use serde for RemoteStorageConfig parsing (#8126 ) Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the same as #7743 but with less repetitive code, by deriving `Deserialize` impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The disadvantage is less useful error messages. The git history of this PR contains a state where we go via an intermediate representation, leveraging the `serde_json` crate, without it ever being actual json though. Also, the PR adds deserialization tests. Alternative to #7743 .	2024-06-22 17:57:09 +00:00
Vlad Lazar	8fe3f17c47	storcon: improve drain and fill shard placement (#8119 ) ## Problem While adapting the storage controller scale test to do graceful rolling restarts via drain and fill, I noticed that secondaries are also being rescheduled, which, in turn, caused the storage controller to optimise attachments. ## Summary of changes * Introduce a transactional looking rescheduling primitive (i.e. "try to schedule to this secondary, but leave everything as is if you can't") * Use it for the drain and fill stages to avoid calling into `Scheduler::schedule` and having secondaries move around.	2024-06-22 14:20:58 +00:00
Anastasia Lubennikova	8776089c70	Remove kq_imcx extension support per customer request neondatabase/cloud#13648	2024-06-21 20:22:54 +01:00
John Spray	b74232eb4d	tests: allow-list neon_local endpoint errors from storage controller (#8123 ) ## Problem For testing, the storage controller has a built-in hack that loads neon_local endpoint config from disk, and uses it to reconfigure endpoints when the attached pageserver changes. Some tests that stop an endpoint while the storage controller is running could occasionally fail on log errors from the controller trying to use its special test-mode calls into neon local Endpoint. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8117/9592392425/index.html#/testresult/9d2bb8623d0d53f8 ## Summary of changes - Give NotifyError an explicit NeonLocal variant, to avoid munging these into generic 500s (I don't want to ignore 500s in general) - Allow-list errors related to the local notification hook. The expectation is that tests using endpoints/workloads should be independently checking that those endpoints work: if neon_local generates an error inside the storage controller, that's ignorable.	2024-06-21 17:23:31 +00:00
Vlad Lazar	ee3081863e	storcon: implement endpoints for cancellation of drain and fill operations (#8029 ) ## Problem There's no way to cancel drain and fill operations. ## Summary of changes Implement HTTP endpoints to allow cancelling of background operations. When the operationis cancelled successfully, the node scheduling policy will revert to `Active`.	2024-06-21 17:13:51 +01:00
John Spray	15728be0e1	pageserver: always detach before deleting (#8082 ) In #7957 we enabled deletion without attachment, but retained the old-style deletion (return 202, delete in background) for attached tenants. In this PR, we remove the old-style deletion path, such that if the tenant delete API is invoked while a tenant is detached, it is simply detached before completing the deletion. This intentionally doesn't rip out all the old deletion code: in case a deletion was in progress at time of upgrade, we keep around the code for finishing it for one release cycle. The rest of the code removal happens in https://github.com/neondatabase/neon/pull/8091 Now that deletion will always be via the new path, the new path is also updated to use some retries around remote storage operations, to tripping up the control plane with 500s if S3 has an intermittent issue.	2024-06-21 15:39:19 +01:00
Arthur Petukhovsky	f45cf28247	Add eviction_state to control file (#8125 ) This is a preparation for #8022, to make the PR both backwards and foward compatible. This commit adds `eviction_state` field to control file. Adds support for reading it, but writes control file in old format where possible, to keep the disk format forward compatible. Note: in `patch_control_file`, new field gets serialized to json like this: - `"eviction_state": "Present"` - `"eviction_state": {"Offloaded": "0/8F"}`	2024-06-21 13:15:02 +01:00
Peter Bendel	82266a252c	Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky (#8079 ) ## Problem see https://github.com/neondatabase/neon/issues/8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```	2024-06-21 10:36:12 +00:00
John Spray	59f949b4a8	pageserver: remove unused load/ignore APIs (#8122 ) ## Problem These APIs have be unused for some time. They were superseded by /location_conf: the equivalent of ignoring a tenant is now to put it in secondary mode. ## Summary of changes - Remove APIs - Remove tests & helpers that used them - Remove error variants that are no longer needed.	2024-06-21 10:02:15 +00:00
Vlad Lazar	01399621d5	storcon: avoid promoting too many shards of the same tenant (#8099 ) ## Problem The fill planner introduced in https://github.com/neondatabase/neon/pull/8014 selects tenant shards to promote strictly based on attached shard count load (tenant shards on nodes with the most attached shard counts are considered first). This approach runs the risk of migrating too many shards belonging to the same tenant on the same primary node. This is bad for availability and causes extra reconciles via the storage controller's background optimisations. Also see https://github.com/neondatabase/neon/pull/8014#discussion_r1642456241. ## Summary of changes Refine the fill plan to avoid promoting too many shards belonging to the same tenant on the same node. We allow for `max(1, shard_count / node_count)` shards belonging to the same tenant to be promoted.	2024-06-21 10:19:01 +01:00
Jure Bajic	0792bb6785	Add tracing for shared locks in `id_lock_map` (#7618 ) ## Problem Storage controller shared locks do not print a warning when held for long time spans. ## Summary of changes Extension of issue https://github.com/neondatabase/neon/issues/7108 in tracing to exclusive lock in `id_lock_map` was added, to add the same for shared locks. It was mentioned in the comment https://github.com/neondatabase/neon/pull/7397#discussion_r1587961160	2024-06-21 09:47:04 +01:00
Vlad Lazar	f8ac3b0e0e	storcon: use attached shard counts for initial shard placement (#8061 ) ## Problem When creating a new shard the storage controller schedules via Scheduler::schedule_shard. This does not take into account the number of attached shards. What it does take into account is the node affinity: when a shard is scheduled, all its nodes (primaries and secondaries) get their affinity incremented. For two node clusters and shards with one secondary we have a pathological case where all primaries are scheduled on the same node. Now that we track the count of attached shards per node, this is trivial to fix. Still, the "proper" fix is to use the pageserver's utilization score. Closes https://github.com/neondatabase/neon/issues/8041 ## Summary of changes Use attached shard count when deciding which node to schedule a fresh shard on.	2024-06-20 17:32:01 +01:00
Christian Schwarz	02ecdd137b	fix: preinitialize `pageserver_basebackup_query_seconds` metric (#8121 ) Without this patch, the Pageserver 4 Golden Signals dashboard shows no data if there are no basebackups (observed in pre-prod).	2024-06-20 15:50:43 +00:00
Christian Schwarz	79401638df	remove materialized page cache (#8105 ) part of Epic https://github.com/neondatabase/neon/issues/7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * getpage requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * ingest (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * get_lsn_by_timestamp: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.	2024-06-20 11:56:14 +02:00
Alexander Bayandin	c789ec21f6	CI: miscellaneous cleanups (#8073 ) ## Problem There are a couple of small CI cleanups that seem too small for dedicated PRs ## Summary of changes - Create release PR with the title that matches the title in the description - Tune error message for disallowing `ubuntu-latest` to explicitly mention what to do - Remove junit output from pytest, we use allure instead	2024-06-19 19:21:09 +01:00
Alexander Bayandin	558a57b15b	CI(test-images): add dockerhub auth (#8115 ) ## Problem ``` Unable to find image 'neondatabase/neon:9583413584' locally docker: Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit. ``` ## Summary of changes - add `docker/login-action@v3` for `test-images` job	2024-06-19 16:54:07 +00:00
John Spray	f0e2bb79b2	tests: use semaphore instead of lock for Endpoint.running (#8112 ) ## Problem Ahem, let's try this again. https://github.com/neondatabase/neon/pull/8110 had a spooky failure in test_multi_attach where a call to Endpoint.stop() timed out waiting for a lock, even though we can see an earlier call completing and releasing the lock. I suspect something weird is going on with the way pytest runs tests across processes, or use of asyncio perhaps. Anyway: the simplest fix is to just use a semaphore instead: if we don't lock we can't deadlock. ## Summary of changes - Make Endpoint.running a semaphore, where we add a unit to its counter when starting the process and atomically decrement it when stopping.	2024-06-19 16:07:14 +00:00
MMeent	fd0b22f5cd	Make sure we can handle temporarily offline PS when we first connect (#8094 ) Fixes https://github.com/neondatabase/neon/issues/7897 ## Problem `shard->delay_us` was potentially uninitialized when we connect to PS, as it wasn't set to a non-0 value until we've first connected to the shard's pageserver. That caused the exponential backoff to use an initial value (multiplier) of 0 for the first connection attempt to that pageserver, thus causing a hot retry loop with connection attempts to the pageserver without significant delay. That in turn caused attemmpts to reconnect to quickly fail, rather than showing the expected 'wait until pageserver is available' behaviour. ## Summary of changes We initialize shard->delay_us before connection initialization if we notice it is not initialized yet.	2024-06-19 15:05:31 +02:00
Peter Bendel	56da624870	allow storage_controller error during pagebench (#8109 ) ## Problem `test_pageserver_max_throughput_getpage_at_latest_lsn` is a pagebench testcase which creates several tenants/timelines to verify pageserver performance. The test swaps environments around in the tenant duplication stage, so the storage controller uses two separate db instances (one in the duplication stage and another one in the benchmarking stage). In the benchmarking stage, the storage controller starts without any knowledge of nodes, but with knowledge of tenants (via attachments.json). When we re-attach and attempt to update the scheduler stats, the scheduler rightfully complains about the node not being known. The setup should preserve the storage controller across the two envs, but i think it's fine to just allow list the error in this case. ## Summary of changes add the error message `2024-06-19T09:38:27.866085Z ERROR Scheduler missing node 1`` to the list of allowed errors for storage_controller	2024-06-19 13:04:29 +00:00
Conrad Ludgate	b998b70315	proxy: reduce some per-task memory usage (#8095 ) ## Problem Some tasks are using around upwards of 10KB of memory at all times, sometimes having buffers that swing them up to 30MB. ## Summary of changes Split some of the async tasks in selective places and box them as appropriate to try and reduce the constant memory usage. Especially in the locations where the large future is only a small part of the total runtime of the task. Also, reduces the size of the CopyBuffer buffer size from 8KB to 1KB. In my local testing and in staging this had a minor improvement. sadly not the improvement I was hoping for :/ Might have more impact in production	2024-06-19 13:34:15 +01:00
John Spray	76aa6936e8	tests: make Endpoint.stop() thread safe (occasional flakes in `test_multi_attach`) (#8110 ) ## Problem Tests using the `Workload` helper would occasionally fail in a strange way, where the endpoint appears to try and stop twice concurrently, and the second stop fails because the pidfile is already gone. `test_multi_attach` suffered from this. Workload has a `__del__` that stops the endpoint, and python is destroying this object in a different thread than NeonEnv.stop is called, resulting in racing stop() calls. Endpoint has a `running` attribute that avoids calling neon_local's stop twice, but that doesn't help in the concurrent case. ## Summary of changes - Make `Endpoint.stop` thread safe with a simple lock held across the updates to `running` and the actual act of stopping it. One could also work around this by letting Workload.endpoint outlive the Workload, or making Workload a context manager, but this change feels most robust, as it avoids all test code having to know that it must not try and stop an endpoint from a destructor.	2024-06-19 13:14:50 +01:00

1 2 3 4 5 ...

5500 Commits