rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 13:40:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	3388eb1205	Merge branch 'problame/async-cleanup-on-drop-for-writers' into yuchen/direct-io-delta-image-layer-write	2025-04-11 17:46:46 +02:00
Christian Schwarz	2f0677be26	refactor delta&image writers to perform cleanup on Drop in the background In #10063 we will switch BlobWriter, which underlies delta and image layer writers, to use the owned buffers IO buffered writer. That buffered writer implements double-buffering by virtue of a background task that performs the flushing -- it owns the VirtualFile and both DeltaLayerWriter and ImageLayerWriter are mere clients to it. The implication is that it's no longer true that dropping these client objects guarantees that all IO activity is complete. We must wait for the flush task to exit. In preparation for that new world, this PR moves the cleanup to a short-lived task that is spawned from the Drop impl, and adds appropriate gate guard holdings to hook it into the Timeline lifecycle. We must (theoretically) worry that there will be a retry inbetween Drop completing and the spawned task completing. It could collide on the randomly generated temporary file name. We avoid this by switching to a global monotonic counter. Refs - extracted from https://github.com/neondatabase/neon/pull/10063 - epic https://github.com/neondatabase/neon/issues/9868	2025-04-11 17:40:42 +02:00
Christian Schwarz	062c7b9a76	refactor: plumb gate and cancellation down to to blob_io::BlobWriter In #10063 we will switch BlobWriter to use the owned buffers IO buffered writer, which implements double-buffering by virtue of a background task that performs the flushing. That task's lifecylce must be contained within the Timeline lifecycle, so, it must hold the timeline gate open and respect Timeline::cancel. This PR does the noisy plumbing to reduce the #10063 diff. Refs - extracted from https://github.com/neondatabase/neon/pull/10063 - epic https://github.com/neondatabase/neon/issues/9868	2025-04-11 17:00:00 +02:00
Christian Schwarz	c6209b4a39	Revert "undo all changes except gate,cancel,context propagation" This reverts commit `f25f71bc98`.	2025-04-11 16:57:36 +02:00
Christian Schwarz	f25f71bc98	undo all changes except gate,cancel,context propagation	2025-04-11 16:57:19 +02:00
Christian Schwarz	2ee316b454	minimize diff a bit	2025-04-11 16:46:41 +02:00
Christian Schwarz	a929e7a844	gate & cancel propagation: make it less invasive - store reference to gate - store CancellationToken clone	2025-04-11 16:39:54 +02:00
Christian Schwarz	3f417d4ac8	Merge 2025-04-11 main commit 'c66444ea1538349d13ab5e87bca880394434004b' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:26:41 +02:00
Christian Schwarz	af6c433947	Merge 2025-04-09 main commit 'a04e33ceb638a3ee5fef8d642b57ffc3a4543c98' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:26:26 +02:00
Christian Schwarz	fd7e3fd82f	Merge WITH CONFLICTS commit '72832b32140a78db7612af626d7c69079d73f445' into yuchen/direct-io-delta-image-layer-write Conflicts: pageserver/src/tenant/blob_io.rs - minor stuff Also I noticed some earlier merge went through cleanly but the `generate_tombstone_image_layer` layer writer didn't have the right arugments, so, failed to compile. Fixed in this merge commit.	2025-04-11 16:24:35 +02:00
Christian Schwarz	ddf6ba75c2	Merge 2025-04-09 main commit 'd11f23a3419a5b8eef62bc5736a4dd9d413bdab8' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:18:59 +02:00
Christian Schwarz	f017382b2b	Merge 2025-04-09 main commit 'e7502a3d637932a59ee502ababb1df3d0e3bca26' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:18:48 +02:00
Christian Schwarz	d0cb1a93dc	Merge 2025-04-09 main commit 'ef8101a9be3ce80d104943238a7d608561432189' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:18:34 +02:00
Christian Schwarz	140b47dc5a	Merge 2025-04-09 main commit 'a6ff8ec3d47963616d9cef07421d9319db958e8a' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:17:54 +02:00
Christian Schwarz	de1c392082	Merge 2025-04-07 main commit '486872dd28d538817599f29b045be025d1e3f43a' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:17:32 +02:00
Christian Schwarz	c5c60e156e	Merge WITH CONFLICTS 2025-03-18 main commit '9fb77d6cdd0894ec4e93b4fe3a576655cfad3b2e' into yuchen/direct-io-delta-image-layer-write The previous merge commit was the commit before, so, all these conflicts are the conflicts that arise from this PR and 97fb77 which is the commit that added cancellation sensitivity to flush task infinite retries. Conflicts: pageserver/src/tenant/remote_timeline_client/download.rs - different return type pageserver/src/virtual_file/owned_buffers_io/write.rs - added TODO that needs to be fixed before merge about retrying final write. I want a different API than this shutdown() thing we have rn pageserver/src/virtual_file/owned_buffers_io/write/flush.rs Most of the churn came from the need to propagate cancellation token. And churn in tests from having to propagate upwards the FlushTaskError instead of the std::io::Error we were propagating upwards before.	2025-04-11 16:13:44 +02:00
Arpad Müller	c66444ea15	Add timeline_import http endpoint (#11484 ) The added `timleine_import` endpoint allows us to migrate safekeeper timelines from control plane managed to storcon managed. Part of #9011	2025-04-11 14:10:27 +00:00
Arpad Müller	88f01c1ca1	Introduce WalIngestError (#11506 ) Introduces a `WalIngestError` struct together with a `WalIngestErrorKind` enum, to be used for walingest related failures and errors. * the enum captures backtraces, so we don't regress in comparison to `anyhow::Error`s (backtraces might be a bit shorter if we use one of the `anyhow::Error` wrappers) * it explicitly lists most/all of the potential cases that can occur. I've originally been inspired to do this in #11496, but it's a longer-term TODO.	2025-04-11 14:08:46 +00:00
Erik Grinaker	a6937a3281	pageserver: improve shard ancestor compaction logging (#11535 ) ## Problem Shard ancestor compaction always logs "starting shard ancestor compaction", even if there is no work to do. This is very spammy (every 20 seconds for every shard). It also has limited progress logging. ## Summary of changes * Only log "starting shard ancestor compaction" when there's work to do. * Include details about the amount of work. * Log progress messages for each layer, and when waiting for uploads. * Log when compaction is completed, with elapsed duration and whether there is more work for a later iteration.	2025-04-11 12:14:08 +00:00
Christian Schwarz	9256935e1b	fix download usage of buffered writer (using pad + set_len strategy) this fixes tenant::timeline::tests::test_heatmap_generation	2025-04-11 13:51:56 +02:00
Christian Schwarz	647c881878	fix for vectored_blob_io::tests::test_really_big_array	2025-04-11 13:31:51 +02:00
Erik Grinaker	3c8565a194	test_runner: propagate config via `attach_hook` for test fix (#11529 ) ## Problem The `pagebench` benchmarks set up an initial dataset by creating a template tenant, copying the remote storage to a bunch of new tenants, and attaching them to Pageservers. In #11420, we found that `test_pageserver_characterize_throughput_with_n_tenants` had degraded performance because it set a custom tenant config in Pageservers that was then replaced with the default tenant config by the storage controller. The initial fix was to register the tenants directly in the storage controller, but this created the tenants with generation 1. This broke `test_basebackup_with_high_slru_count`, where the template tenant was at generation 2, leading to all layer files at generation 2 being ignored. Resolves #11485. Touches #11381. ## Summary of changes This patch addresses both test issues by modifying `attach_hook` to also take a custom tenant config. This allows attaching tenants to Pageservers from pre-existing remote storage, specifying both the generation and tenant config when registering them in the storage controller.	2025-04-11 11:31:12 +00:00
Christian Schwarz	d1277b8259	I have a hypothesis for what the issue is with the vectored_blob_io::tests::test_really_big_array	2025-04-11 13:26:39 +02:00
Christian Schwarz	53b837d507	put in a note on blob_io writer not needed to do owned buffers io anymore	2025-04-11 13:25:25 +02:00
Christian Schwarz	f5d69e97c4	remark: vectored_blob_io::tests::test_really_big_array is failing since before I started merging from main	2025-04-11 12:56:07 +02:00
Christian Schwarz	e79beb0720	turns out we can delete all the seek-related APIs as well	2025-04-11 12:29:22 +02:00
Christian Schwarz	dfc364e4f4	remove non-absolute-position write APIs from VirtualFile	2025-04-11 11:57:09 +02:00
Christian Schwarz	979fa0682b	tests: update batching perf test workload to include scattered LSNs (#11391 ) The batching perf test workload is currently read-only sequential scans. However, realistic workloads have concurrent writes (to other pages) going on. This PR simulates concurrent writes to other pages by emitting logical replication messages. These degrade the achieved batching factor, for the reason see - https://github.com/neondatabase/neon/issues/10765 PR - https://github.com/neondatabase/neon/pull/11494 will fix this problem and get batching factor back up. --------- Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-04-11 09:55:49 +00:00
Christian Schwarz	8884865bca	tests: make `test_pageserver_getpage_throttle` less flaky (#11482 ) # Refs - fixes https://github.com/neondatabase/neon/issues/11395 # Problem Since 2025-03-10, we have observed increased flakiness of `test_pageserver_getpage_throttle`. The test is timing-dependent by nature, and was hitting the ``` assert duration_secs >= 10 * actual_smgr_query_seconds, ( "smgr metrics should not include throttle wait time" ) ``` quite frequently. # Analysis These failures are not reproducible. In this PR's history is a commit that reran the test 100 times without requiring a single retry. In https://github.com/neondatabase/neon/issues/11395 there is a link to a query to the test results database. It shows that the flakiness was not constant, but rather episodic: 2025-03-{10,11,12,13} 2025-03-{19,20,21} 2025-03-31 and 2025-04-01. To me, this suggests variability in available CPU. # Solution The point of the offending assertion is to ensure that most of the request latency is spent on throttling, because testing of the throttling mechanism is the point of the test. The `10` magic number means at most 10% of mean latency may be spent on request processing. Ideally we would control the passage of time (virtual clock source) to make this test deterministic. But I don't see that happening in our regression test setup. So, this PR de-flakes the test as follows: - allot up to 66% of mean latency for request processing - increase duration from 10s to 20s, hoping to get better protection from momentary CPU spikes in noisy neighbor tests or VMs on the runner host As a drive-by, switch to `pytest.approx` and remove one self-test assertion I can't make sense of anymore.	2025-04-11 09:38:05 +00:00
Dmitrii Kovalkov	4c4e33bc2e	storage: add http/https server and cert resover metrics (#11450 ) ## Problem We need to export some metrics about certs/connections to configure alerts and make sure that all HTTP requests are gone before turning https-only mode on. - Closes: https://github.com/neondatabase/cloud/issues/25526 ## Summary of changes - Add started connection and connection error metrics to http/https Server. - Add certificate expiration time and reload metrics to ReloadingCertificateResolver.	2025-04-11 06:11:35 +00:00
Tristan Partin	342607473a	Make Endpoint::respec_deep() infinitely deep (#11527 ) Because it wasn't recursive, there was a limit to the depth of updates. This work is necessary because as we teach neon_local and compute_ctl that the content in --spec-path should match a similar structure we get from the control plane, the spec object itself will no longer be toplevel. It will be under the "spec" key. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-10 19:55:51 +00:00
John Spray	9c37bfc90a	pageserver/tests: make image_layer_rewrite write less data (#11525 ) ## Problem This test is slow to execute, particularly if you're on a slow environment like vscode in a browser. Might have got much slower when we switched to direct IO? ## Summary of changes - Reduce the scale of the test by 10x, since there was nothing special about the original size.	2025-04-10 17:03:22 +00:00
John Spray	52dee408dc	storage controller: improve safety of shard splits coinciding with controller restarts (#11412 ) ## Problem The graceful leadership transfer process involves calling step_down on the old controller, but this was not waiting for shard splits to complete, and the new controller could therefore end up trying to abort a shard split while it was still going on. We mitigated this already in #11256 by avoiding the case where shard split completion would update the database incorrectly, but this was a fragile fix because it assumes that is the only problematic part of the split running concurrently. Precursors: - #11290 - #11256 Closes: #11254 ## Summary of changes - Hold the reconciler gate from shard splits, so that step_down will wait for them. Splits should always be fairly prompt, so it is okay to wait here. - Defense in depth: if step_down times out (hardcoded 10 second limit), then fully terminate the controller process rather than letting it continue running, potentially doing split-brainy things. This makes sense because the new controller will always declare itself leader unilaterally if step_down fails, so leaving an old controller running is not beneficial. - Tests: extend `test_storage_controller_leadership_transfer_during_split` to separately exercise the case of a split holding up step_down, and the case where the overall timeout on step_down is hit and the controller terminates.	2025-04-10 16:55:37 +00:00
Anastasia Lubennikova	5487a20b72	compute: Set log_parameter=off for audit logging. (#11500 ) Log -> Base, pgaudit.log = 'ddl', pgaudit.log_parameter='off' Hipaa -> Extended. pgaudit.log = 'all, -misc', pgaudit.log_parameter='off' add new level Full: pgaudit.log='all', pgaudit.log_parameter='on' Keep old parameter names for compatibility, until cplane side changes are implemented and released. closes https://github.com/neondatabase/cloud/issues/27202	2025-04-10 15:28:28 +00:00
Alex Chi Z.	f06d721a98	test(pageserver): ensure gc-compaction does not fire critical errors (#11513 ) ## Problem Part of https://github.com/neondatabase/neon/issues/10395 ## Summary of changes Add a test case to ensure gc-compaction doesn't fire any critical errors if the key history is invalid due to partial GC. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-10 14:53:37 +00:00
Christian Schwarz	2e35f23085	tests: remove ignored `fair` field (#11521 ) Pageserver has been ignoring field `tenant_config.timeline_get_throttle.fair` for many monhts, since we removed it from the config struct in neondatabase/neon#8539. Refs - epic https://github.com/neondatabase/cloud/issues/27320	2025-04-10 14:24:30 +00:00
Anastasia Lubennikova	5063151271	compute: Add more neon ids to compute (#11366 ) Pass more neon ids to compute_ctl. Expose them to postgres as neon extension GUCs: neon.project_id, neon.branch_id, neon.endpoint_id. This is the compute side PR, not yet supported by cplane.	2025-04-10 13:04:18 +00:00
Erik Grinaker	0122d97f95	test_runner: only use last gen in `test_location_conf_churn` (#11511 ) ## Problem `test_location_conf_churn` performs random location updates on Pageservers. While doing this, it could instruct the compute to connect to a stale generation and execute queries. This is invalid, and will fail if a newer generation has removed layer files used by the stale generation. Resolves #11348. ## Summary of changes Only connect to the latest generation when executing queries.	2025-04-10 10:07:16 +00:00
Arseny Sher	fae7528adb	walproposer: make it aware of membership (#11407 ) ## Problem Walproposer should get elected and commit WAL on safekeepers specified by the membership configuration. ## Summary of changes - Add to wp `members_safekeepers` and `new_members_safekeepers` arrays mapping configuration members to connection slots. Establish this mapping (by node id) when safekeeper sends greeting, giving its id and when mconf becomes known / changes. - Add to TermsCollected, VotesCollected, GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it partially duplicates existing one, but we'll drop the latter eventually. - In python, rename Configuration to MembershipConfiguration for clarity. - Add test_quorum_sanity testing new logic. ref https://github.com/neondatabase/neon/issues/10851	2025-04-10 09:55:37 +00:00
Christian Schwarz	9222995c4f	REVIEW more the shutdown API	2025-04-10 11:16:35 +02:00
Dmitrii Kovalkov	8a72e6f888	pageserver: add enable_tls_page_service_api (#11508 ) ## Problem Page service doesn't use TLS for incoming requests. - Closes: https://github.com/neondatabase/cloud/issues/27236 ## Summary of changes - Add option `enable_tls_page_service_api` to pageserver config - Propagate `tls_server_config` to `page_service` if the option is enabled No integration tests for now because I didn't find out how to call page service API from python and AFAIK computes don't support TLS yet	2025-04-10 08:45:17 +00:00
Christian Schwarz	6f25c976f6	REVIEW: undo the `mutable->tail` rename to minimize conflicts with next commit Changes to be committed: modified: pageserver/src/tenant/ephemeral_file.rs modified: pageserver/src/virtual_file/owned_buffers_io/write.rs	2025-04-10 09:02:45 +02:00
Christian Schwarz	dd3178836d	REVIEW: minor nits	2025-04-10 08:58:06 +02:00
Tristan Partin	a04e33ceb6	Remove --spec-json argument from compute_ctl (#11510 ) It isn't used by the production control plane or neon_local. The removal simplifies compute spec logic just a little bit more since we can remove any notion of whether we should allow live reconfigurations. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-09 22:39:54 +00:00
Alex Chi Z.	af0be11503	fix(pageserver): ensure gc-compaction gets preempted by L0 (#11512 ) ## Problem Part of #9114 ## Summary of changes Gc-compaction flag was not correctly set, causing it not getting preempted by L0. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 21:41:11 +00:00
Alex Chi Z.	405a17bf0b	fix(pageserver): ensure gc-compaction gets preempted by L0 (#11512 ) ## Problem Part of #9114 ## Summary of changes Gc-compaction flag was not correctly set, causing it not getting preempted by L0. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 20:57:50 +00:00
Erik Grinaker	63ee8e2181	test_runner: ignore `.___temp` files in `evict_random_layers` (#11509 ) ## Problem `test_location_conf_churn` often fails with `neither image nor delta layer`, but doesn't say what the file actually is. However, past local failures have indicated that it might be `.___temp` files. Touches https://github.com/neondatabase/neon/issues/11348. ## Summary of changes Ignore `.___temp` files when evicting local layers, and include the file name in the error message.	2025-04-09 19:03:49 +00:00
Alex Chi Z.	2c21a65b0b	feat(pageserver): add gc-compaction time-to-first-item stats (#11475 ) ## Problem In some cases gc-compaction doesn't respond to the L0 compaction yield notifier. I suspect it's stuck on getting the first item, and if so, we probably need to let L0 yield notifier preempt `next_with_trace`. ## Summary of changes - Add `time_to_first_kv_pair` to gc-compaction statistics. - Inverse the ratio so that smaller ratio -> better compaction ratio. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 18:07:58 +00:00
Alex Chi Z.	ec66b788e2	fix(pageserver): use different walredo retry setting for gc-compaction (#11497 ) ## Problem Not a complete fix for https://github.com/neondatabase/neon/issues/11492 but should work for a short term. Our current retry strategy for walredo is to retry every request exactly once. This retry doesn't make sense because it retries all requests exactly once and each error is expected to cause process restart and cause future requests to fail. I'll explain it with a scenario of two threads requesting redos: one with an invalid history (that will cause walredo to panic) and another that has a correct redo sequence. First let's look at how we handle retries right now in do_with_walredo_process. At the beginning of the function it will spawn a new process if there's no existing one. Then it will continue to redo. If the process fails, the first process that encounters the error will remove the walredo process object from the OnceCell, so that the next time it gets accessed, a new process will be spawned; if it is the last one that uses the old walredo process, it will kill and wait the process in `drop(proc)`. I'm skeptical whether this works under races but I think this is not the root cause of the problem. In this retry handler, if there are N requests attached to a walredo process and the i-th request fails (panics the walredo), all other N-i requests will fail and they need to retry so that they can access a new walredo process. ``` time ----> proc A None B request 1 ^-----------------^ fail uses A for redo replace with None request 2 ^-------------------- fail uses A for redo request 3 ^----------------^ fail uses A for redo last ref, wait for A to be killed request 4 ^--------------- None, spawn new process B ``` The problem is with our retry strategy. Normally, for a system that we want to retry on, the probability of errors for each of the requests are uncorrelated. However, in walredo, a prior request that panics the walredo process will cause all future walredo on that process to fail (that's correlated). So, back to the situation where we have 2 requests where one will definitely fail and the other will succeed and we get the following sequence, where retry attempts = 1, * new walredo process A starts. * request 1 (invalid) being processed on A and panics A, waiting for retry, remove process A from the process object. * request 2 (valid) being processed on A and receives pipe broken / poisoned process error, waiting for retry, wait for A to be killed -- this very likely takes a while and cannot finish before request 1 gets processed again * new walredo process B starts. * request 1 (invalid) being processed again on B and panics B, the whole request fail. * request 2 (valid) being processed again on B, and get a poisoned error again. ``` time ----> proc A None B None request 1 ^-----------------^--------------^--------------------^ spawn A for redo fail spawn B for redo fail request 2 ^--------------------^-------------------------^------------^ use A for redo fail, wait to kill A B for redo fail again ``` In such cases, no matter how we set n_attempts, as long as the retry count applies to all requests, this sequence is bound to fail both requests because of how they get sequenced; while we could potentially make request 2 successful. There are many solutions to this -- like having a separate walredo manager for compactions, or define which errors are retryable (i.e., broken pipe can be retried, while real walredo error won't be retried), or having a exclusive big lock over the whole redo process (the current one is very fine-grained). In this patch, we go with a simple approach: use different retry attempts for different types of requests. For gc-compaction, the attempt count is set to 0, so that it never retries and consequently stops the compaction process -- no more redo will be issued from gc-compaction. Once the walredo process gets restarted, the normal read requests will proceed normally. ## Summary of changes Add redo_attempt for each reconstruct value request to set different retry policies. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-04-09 18:01:31 +00:00
Christian Schwarz	2a29b3de89	Merge 2025-03-18 main commit '99639c26b49a0d6d546fd' into yuchen/direct-io-delta-image-layer-write	2025-04-09 19:40:14 +02:00

1 2 3 4 5 ...

7726 Commits