rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-27 01:50:38 +00:00

Author	SHA1	Message	Date
Christian Schwarz	3388eb1205	Merge branch 'problame/async-cleanup-on-drop-for-writers' into yuchen/direct-io-delta-image-layer-write	2025-04-11 17:46:46 +02:00
Christian Schwarz	2f0677be26	refactor delta&image writers to perform cleanup on Drop in the background In #10063 we will switch BlobWriter, which underlies delta and image layer writers, to use the owned buffers IO buffered writer. That buffered writer implements double-buffering by virtue of a background task that performs the flushing -- it owns the VirtualFile and both DeltaLayerWriter and ImageLayerWriter are mere clients to it. The implication is that it's no longer true that dropping these client objects guarantees that all IO activity is complete. We must wait for the flush task to exit. In preparation for that new world, this PR moves the cleanup to a short-lived task that is spawned from the Drop impl, and adds appropriate gate guard holdings to hook it into the Timeline lifecycle. We must (theoretically) worry that there will be a retry inbetween Drop completing and the spawned task completing. It could collide on the randomly generated temporary file name. We avoid this by switching to a global monotonic counter. Refs - extracted from https://github.com/neondatabase/neon/pull/10063 - epic https://github.com/neondatabase/neon/issues/9868	2025-04-11 17:40:42 +02:00
Christian Schwarz	062c7b9a76	refactor: plumb gate and cancellation down to to blob_io::BlobWriter In #10063 we will switch BlobWriter to use the owned buffers IO buffered writer, which implements double-buffering by virtue of a background task that performs the flushing. That task's lifecylce must be contained within the Timeline lifecycle, so, it must hold the timeline gate open and respect Timeline::cancel. This PR does the noisy plumbing to reduce the #10063 diff. Refs - extracted from https://github.com/neondatabase/neon/pull/10063 - epic https://github.com/neondatabase/neon/issues/9868	2025-04-11 17:00:00 +02:00
Christian Schwarz	c6209b4a39	Revert "undo all changes except gate,cancel,context propagation" This reverts commit `f25f71bc98`.	2025-04-11 16:57:36 +02:00
Christian Schwarz	f25f71bc98	undo all changes except gate,cancel,context propagation	2025-04-11 16:57:19 +02:00
Christian Schwarz	2ee316b454	minimize diff a bit	2025-04-11 16:46:41 +02:00
Christian Schwarz	a929e7a844	gate & cancel propagation: make it less invasive - store reference to gate - store CancellationToken clone	2025-04-11 16:39:54 +02:00
Christian Schwarz	3f417d4ac8	Merge 2025-04-11 main commit 'c66444ea1538349d13ab5e87bca880394434004b' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:26:41 +02:00
Christian Schwarz	af6c433947	Merge 2025-04-09 main commit 'a04e33ceb638a3ee5fef8d642b57ffc3a4543c98' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:26:26 +02:00
Christian Schwarz	fd7e3fd82f	Merge WITH CONFLICTS commit '72832b32140a78db7612af626d7c69079d73f445' into yuchen/direct-io-delta-image-layer-write Conflicts: pageserver/src/tenant/blob_io.rs - minor stuff Also I noticed some earlier merge went through cleanly but the `generate_tombstone_image_layer` layer writer didn't have the right arugments, so, failed to compile. Fixed in this merge commit.	2025-04-11 16:24:35 +02:00
Christian Schwarz	ddf6ba75c2	Merge 2025-04-09 main commit 'd11f23a3419a5b8eef62bc5736a4dd9d413bdab8' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:18:59 +02:00
Christian Schwarz	f017382b2b	Merge 2025-04-09 main commit 'e7502a3d637932a59ee502ababb1df3d0e3bca26' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:18:48 +02:00
Christian Schwarz	d0cb1a93dc	Merge 2025-04-09 main commit 'ef8101a9be3ce80d104943238a7d608561432189' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:18:34 +02:00
Christian Schwarz	140b47dc5a	Merge 2025-04-09 main commit 'a6ff8ec3d47963616d9cef07421d9319db958e8a' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:17:54 +02:00
Christian Schwarz	de1c392082	Merge 2025-04-07 main commit '486872dd28d538817599f29b045be025d1e3f43a' into yuchen/direct-io-delta-image-layer-write	2025-04-11 16:17:32 +02:00
Christian Schwarz	c5c60e156e	Merge WITH CONFLICTS 2025-03-18 main commit '9fb77d6cdd0894ec4e93b4fe3a576655cfad3b2e' into yuchen/direct-io-delta-image-layer-write The previous merge commit was the commit before, so, all these conflicts are the conflicts that arise from this PR and 97fb77 which is the commit that added cancellation sensitivity to flush task infinite retries. Conflicts: pageserver/src/tenant/remote_timeline_client/download.rs - different return type pageserver/src/virtual_file/owned_buffers_io/write.rs - added TODO that needs to be fixed before merge about retrying final write. I want a different API than this shutdown() thing we have rn pageserver/src/virtual_file/owned_buffers_io/write/flush.rs Most of the churn came from the need to propagate cancellation token. And churn in tests from having to propagate upwards the FlushTaskError instead of the std::io::Error we were propagating upwards before.	2025-04-11 16:13:44 +02:00
Arpad Müller	88f01c1ca1	Introduce WalIngestError (#11506 ) Introduces a `WalIngestError` struct together with a `WalIngestErrorKind` enum, to be used for walingest related failures and errors. * the enum captures backtraces, so we don't regress in comparison to `anyhow::Error`s (backtraces might be a bit shorter if we use one of the `anyhow::Error` wrappers) * it explicitly lists most/all of the potential cases that can occur. I've originally been inspired to do this in #11496, but it's a longer-term TODO.	2025-04-11 14:08:46 +00:00
Erik Grinaker	a6937a3281	pageserver: improve shard ancestor compaction logging (#11535 ) ## Problem Shard ancestor compaction always logs "starting shard ancestor compaction", even if there is no work to do. This is very spammy (every 20 seconds for every shard). It also has limited progress logging. ## Summary of changes * Only log "starting shard ancestor compaction" when there's work to do. * Include details about the amount of work. * Log progress messages for each layer, and when waiting for uploads. * Log when compaction is completed, with elapsed duration and whether there is more work for a later iteration.	2025-04-11 12:14:08 +00:00
Christian Schwarz	9256935e1b	fix download usage of buffered writer (using pad + set_len strategy) this fixes tenant::timeline::tests::test_heatmap_generation	2025-04-11 13:51:56 +02:00
Christian Schwarz	647c881878	fix for vectored_blob_io::tests::test_really_big_array	2025-04-11 13:31:51 +02:00
Christian Schwarz	d1277b8259	I have a hypothesis for what the issue is with the vectored_blob_io::tests::test_really_big_array	2025-04-11 13:26:39 +02:00
Christian Schwarz	53b837d507	put in a note on blob_io writer not needed to do owned buffers io anymore	2025-04-11 13:25:25 +02:00
Christian Schwarz	e79beb0720	turns out we can delete all the seek-related APIs as well	2025-04-11 12:29:22 +02:00
Christian Schwarz	dfc364e4f4	remove non-absolute-position write APIs from VirtualFile	2025-04-11 11:57:09 +02:00
Dmitrii Kovalkov	4c4e33bc2e	storage: add http/https server and cert resover metrics (#11450 ) ## Problem We need to export some metrics about certs/connections to configure alerts and make sure that all HTTP requests are gone before turning https-only mode on. - Closes: https://github.com/neondatabase/cloud/issues/25526 ## Summary of changes - Add started connection and connection error metrics to http/https Server. - Add certificate expiration time and reload metrics to ReloadingCertificateResolver.	2025-04-11 06:11:35 +00:00
John Spray	9c37bfc90a	pageserver/tests: make image_layer_rewrite write less data (#11525 ) ## Problem This test is slow to execute, particularly if you're on a slow environment like vscode in a browser. Might have got much slower when we switched to direct IO? ## Summary of changes - Reduce the scale of the test by 10x, since there was nothing special about the original size.	2025-04-10 17:03:22 +00:00
Alex Chi Z.	f06d721a98	test(pageserver): ensure gc-compaction does not fire critical errors (#11513 ) ## Problem Part of https://github.com/neondatabase/neon/issues/10395 ## Summary of changes Add a test case to ensure gc-compaction doesn't fire any critical errors if the key history is invalid due to partial GC. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-10 14:53:37 +00:00
Christian Schwarz	9222995c4f	REVIEW more the shutdown API	2025-04-10 11:16:35 +02:00
Dmitrii Kovalkov	8a72e6f888	pageserver: add enable_tls_page_service_api (#11508 ) ## Problem Page service doesn't use TLS for incoming requests. - Closes: https://github.com/neondatabase/cloud/issues/27236 ## Summary of changes - Add option `enable_tls_page_service_api` to pageserver config - Propagate `tls_server_config` to `page_service` if the option is enabled No integration tests for now because I didn't find out how to call page service API from python and AFAIK computes don't support TLS yet	2025-04-10 08:45:17 +00:00
Christian Schwarz	6f25c976f6	REVIEW: undo the `mutable->tail` rename to minimize conflicts with next commit Changes to be committed: modified: pageserver/src/tenant/ephemeral_file.rs modified: pageserver/src/virtual_file/owned_buffers_io/write.rs	2025-04-10 09:02:45 +02:00
Christian Schwarz	dd3178836d	REVIEW: minor nits	2025-04-10 08:58:06 +02:00
Alex Chi Z.	405a17bf0b	fix(pageserver): ensure gc-compaction gets preempted by L0 (#11512 ) ## Problem Part of #9114 ## Summary of changes Gc-compaction flag was not correctly set, causing it not getting preempted by L0. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 20:57:50 +00:00
Alex Chi Z.	2c21a65b0b	feat(pageserver): add gc-compaction time-to-first-item stats (#11475 ) ## Problem In some cases gc-compaction doesn't respond to the L0 compaction yield notifier. I suspect it's stuck on getting the first item, and if so, we probably need to let L0 yield notifier preempt `next_with_trace`. ## Summary of changes - Add `time_to_first_kv_pair` to gc-compaction statistics. - Inverse the ratio so that smaller ratio -> better compaction ratio. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 18:07:58 +00:00
Alex Chi Z.	ec66b788e2	fix(pageserver): use different walredo retry setting for gc-compaction (#11497 ) ## Problem Not a complete fix for https://github.com/neondatabase/neon/issues/11492 but should work for a short term. Our current retry strategy for walredo is to retry every request exactly once. This retry doesn't make sense because it retries all requests exactly once and each error is expected to cause process restart and cause future requests to fail. I'll explain it with a scenario of two threads requesting redos: one with an invalid history (that will cause walredo to panic) and another that has a correct redo sequence. First let's look at how we handle retries right now in do_with_walredo_process. At the beginning of the function it will spawn a new process if there's no existing one. Then it will continue to redo. If the process fails, the first process that encounters the error will remove the walredo process object from the OnceCell, so that the next time it gets accessed, a new process will be spawned; if it is the last one that uses the old walredo process, it will kill and wait the process in `drop(proc)`. I'm skeptical whether this works under races but I think this is not the root cause of the problem. In this retry handler, if there are N requests attached to a walredo process and the i-th request fails (panics the walredo), all other N-i requests will fail and they need to retry so that they can access a new walredo process. ``` time ----> proc A None B request 1 ^-----------------^ fail uses A for redo replace with None request 2 ^-------------------- fail uses A for redo request 3 ^----------------^ fail uses A for redo last ref, wait for A to be killed request 4 ^--------------- None, spawn new process B ``` The problem is with our retry strategy. Normally, for a system that we want to retry on, the probability of errors for each of the requests are uncorrelated. However, in walredo, a prior request that panics the walredo process will cause all future walredo on that process to fail (that's correlated). So, back to the situation where we have 2 requests where one will definitely fail and the other will succeed and we get the following sequence, where retry attempts = 1, * new walredo process A starts. * request 1 (invalid) being processed on A and panics A, waiting for retry, remove process A from the process object. * request 2 (valid) being processed on A and receives pipe broken / poisoned process error, waiting for retry, wait for A to be killed -- this very likely takes a while and cannot finish before request 1 gets processed again * new walredo process B starts. * request 1 (invalid) being processed again on B and panics B, the whole request fail. * request 2 (valid) being processed again on B, and get a poisoned error again. ``` time ----> proc A None B None request 1 ^-----------------^--------------^--------------------^ spawn A for redo fail spawn B for redo fail request 2 ^--------------------^-------------------------^------------^ use A for redo fail, wait to kill A B for redo fail again ``` In such cases, no matter how we set n_attempts, as long as the retry count applies to all requests, this sequence is bound to fail both requests because of how they get sequenced; while we could potentially make request 2 successful. There are many solutions to this -- like having a separate walredo manager for compactions, or define which errors are retryable (i.e., broken pipe can be retried, while real walredo error won't be retried), or having a exclusive big lock over the whole redo process (the current one is very fine-grained). In this patch, we go with a simple approach: use different retry attempts for different types of requests. For gc-compaction, the attempt count is set to 0, so that it never retries and consequently stops the compaction process -- no more redo will be issued from gc-compaction. Once the walredo process gets restarted, the normal read requests will proceed normally. ## Summary of changes Add redo_attempt for each reconstruct value request to set different retry policies. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-04-09 18:01:31 +00:00
Christian Schwarz	2a29b3de89	Merge 2025-03-18 main commit '99639c26b49a0d6d546fd' into yuchen/direct-io-delta-image-layer-write	2025-04-09 19:40:14 +02:00
Christian Schwarz	91aff7b842	Merge WITH CONFLICTS 2025-03-11 main commit '158db414bf881fb358494e3215d192c8fa420a53' into yuchen/dire ct-io-delta-image-layer-write Conflicts: pageserver/src/virtual_file.rs pageserver/src/virtual_file/owned_buffers_io/write/flush.rs	2025-04-09 19:39:56 +02:00
Christian Schwarz	f078d7e1a9	Merge WITH CONFLICTS 2025-03-11 main commit '7c462b3417ecd3ae3907f3480f3b8a8c99fc6d7b' into yuchen/dire ct-io-delta-image-layer-write Conflicts: pageserver/src/tenant/blob_io.rs	2025-04-09 19:39:12 +02:00
Christian Schwarz	537eb334f2	Merge WITH CONFLICTS 2025-02-25 main commit '920040e40240774219b6607f1f8ef74478dc4b29' into yuchen/dire ct-io-delta-image-layer-write Conflicts: pageserver/src/tenant/blob_io.rs pageserver/src/tenant/block_io.rs pageserver/src/tenant/disk_btree.rs pageserver/src/tenant/storage_layer/delta_layer.rs pageserver/src/tenant/storage_layer/image_layer.rs pageserver/src/virtual_file/owned_buffers_io/write.rs	2025-04-09 19:38:20 +02:00
Christian Schwarz	e37cbc1a50	make clippy pass	2025-04-09 19:33:35 +02:00
Conrad Ludgate	72832b3214	chore: fix clippy lints from nightly-2025-03-16 (#11273 ) I like to run nightly clippy every so often to make our future rust upgrades easier. Some notable changes: * Prefer `next_back()` over `last()`. Generic iterators will implement `last()` to run forward through the iterator until the end. * Prefer `io::Error::other()`. * Use implicit returns One case where I haven't dealt with the issues is the now [more-sensitive "large enum variant" lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not to take any decisions around it here, and simply marked them as allow for now.	2025-04-09 15:04:42 +00:00
Vlad Lazar	d11f23a341	pageserver: refactor read path for multi LSN batching support (#11463 ) ## Problem We wish to improve pageserver batching such that one batch can contain requests for pages at different LSNs. The current shape of the code doesn't lend itself to the change. ## Summary of changes Refactor the read path such that the fringe gets initialized upfront. This is where the multi LSN change will plug in. A couple other small changes fell out of this. There should be NO behaviour change here. If you smell one, shout! I recommend reviewing commits individually (intentionally made them as small as possible). Related: https://github.com/neondatabase/neon/issues/10765	2025-04-09 13:17:02 +00:00
Dmitrii Kovalkov	e7502a3d63	pageserver: return 412 PreconditionFailed in get_timestamp_of_lsn if timestamp is not found (#11491 ) ## Problem Now `get_timestamp_of_lsn` returns `404 NotFound` if there is no clog pages for given LSN, and it's difficult to distinguish from other 404 errors. A separate status code for this error will allow the control plane to handle this case. - Closes: https://github.com/neondatabase/neon/issues/11439 - Corresponding PR in control plane: https://github.com/neondatabase/cloud/pull/27125 ## Summary of changes - Return `412 PreconditionFailed` instead of `404 NotFound` if no timestamp is fond for given LSN. I looked briefly through the current error handling code in cloud.git and the status code change should not affect anything for the existing code. Change from the corresponding PR also looks fine and should work with the current PS status code. Additionally, here is OK to merge it from control plane team: https://github.com/neondatabase/neon/issues/11439#issuecomment-2789327552 --------- Co-authored-by: John Spray <john@neon.tech>	2025-04-09 13:16:15 +00:00
Arpad Müller	d2825e72ad	Add is_stopping check around critical macro in walreceiver (#11496 ) The timeline stopping state is set much earlier than the cancellation token is fired, so by checking for the stopping state, we can prevent races with timeline shutdown where we issue a cancellation error but the cancellation token hasn't been fired yet. Fix #11427.	2025-04-09 12:17:45 +00:00
Erik Grinaker	7679b63a2c	pageserver: persist stripe size in tenant manifest for tenant_import (#11181 ) ## Problem `tenant_import`, used to import an existing tenant from remote storage into a storage controller for support and debugging, assumed `DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage. In #11168, we are changing the stripe size, which will break `tenant_import`. Resolves #11175. ## Summary of changes * Add `stripe_size` to the tenant manifest. * Add `TenantScanRemoteStorageShard::stripe_size` and return from `tenant_scan_remote` if present. * Recover the stripe size during`tenant_import`, or fall back to 32768 (the original default stripe size). * Add tenant manifest compatibility snapshot: `2025-04-08-pgv17-tenant-manifest-v1.tar.zst` There are no cross-version concerns here, since unknown fields are ignored during deserialization where relevant.	2025-04-08 20:43:27 +00:00
Alex Chi Z.	a09c933de3	test(pageserver): add conditional append test record (#11476 ) ## Problem For future gc-compaction tests when we support https://github.com/neondatabase/neon/issues/10395 ## Summary of changes Add a new type of neon test WAL record that is conditionally applied (i.e., only when image == the specified value). We can use this to mock the situation where we lose some records in the middle, firing an error, and see how gc-compaction reacts to it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-08 16:08:44 +00:00
Mikhail Kot	6138d61592	Object storage proxy (#11357 ) Service targeted for storing and retrieving LFC prewarm data. Can be used for proxying S3 access for Postgres extensions like pg_mooncake as well. Requests must include a Bearer JWT token. Token is validated using a pemfile (should be passed in infra/). Note: app is not tolerant to extra trailing slashes, see app.rs `delete_prefix` test for comments. Resolves: https://github.com/neondatabase/cloud/issues/26342 Unrelated changes: gate a `rename_noreplace` feature and disable it in `remote_storage` so as `object_storage` can be built with musl	2025-04-08 14:54:53 +00:00
Alex Chi Z.	0875dacce0	fix(pageserver): more aggressively yield in gc-compaction, degrade errors to warnings (#11469 ) ## Problem Fix various small issues discovered during gc-compaction rollout. ## Summary of changes - Log level changes: if errors are from gc-compaction, fire a warning instead of errors or critical errors. - Yield to L0 compaction more aggressively. Instead of checking every 1k keys, we check on every key. Sometimes a single key reconstruct takes a long time. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-07 21:19:06 +00:00
Erik Grinaker	99d8788756	pageserver: improve tenant manifest lifecycle (#11328 ) ## Problem Currently, the tenant manifest is only uploaded if there are offloaded timelines. The checks are also a bit loose (e.g. only checks number of offloaded timelines). We want to start using the manifest for other things too (e.g. stripe size). Resolves #11271. ## Summary of changes This patch ensures that a tenant manifest always exists. The lifecycle is: * During preload, fetch the existing manifest, if any. * During attach, upload a tenant manifest if it differs from the preloaded one (or does not exist). * Upload a new manifest as needed, if it differs from the last-known manifest (ignoring version number). * On splits, pre-populate the manifest from the parent. * During Pageserver physical GC, remove old manifests but keep the latest 2 generations. This will cause nearly all existing tenants to upload a new tenant manifest on their first attach after this change. Attaches are concurrency-limited in the storage controller, so we expect this will be fine. Also updates `make_broken` to automatically log at `INFO` level when the tenant has been cancelled, to avoid spurious error logs during shutdown.	2025-04-07 19:10:36 +00:00
Erik Grinaker	26c5c7e942	pageserver: set `Stopping` state on attach cancellation (#11462 ) ## Problem If a tenant is cancelled (e.g. due to Pageserver shutdown) during attach, it is set to `Broken`. This results both in error log spam and 500 responses for clients -- shutdown is supposed to return 503 responses which can be retried. This becomes more likely to happen with #11328, where we perform tenant manifest downloads/uploads during attach. ## Summary of changes Set tenant state to `Stopping` when attach fails and the tenant is cancelled, downgrading the log messages to INFO. This introduces two variants of `Stopping` -- with and without a caller barrier -- where the latter is used to signal attach cancellation.	2025-04-07 17:56:56 +00:00
Alex Chi Z.	d37e90f430	fix(pageserver): allow shard ancestor compaction to be cancelled (#11452 ) ## Problem https://github.com/neondatabase/neon/issues/11330 https://github.com/neondatabase/neon/issues/11358 ## Summary of changes Looking at the staging log, a few tenants right after shard split are stuck on shutdown because they are running shard ancestor compaction. The compaction does not respect the cancellation token. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-07 16:01:21 +00:00

1 2 3 4 5 ...

2899 Commits