rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-23 08:00:37 +00:00

Author	SHA1	Message	Date
Vlad Lazar	c54cd9e76a	storcon: signal LSN wait to pageserver during live migration (#10452 ) ## Problem We've seen the ingest connection manager get stuck shortly after a migration. ## Summary of changes A speculative mitigation is to use the same mechanism as get page requests for kicking LSN ingest. The connection manager monitors LSN waits and queries the broker if no updates are received for the timeline. Closes https://github.com/neondatabase/neon/issues/10351	2025-01-28 17:33:07 +00:00
Erik Grinaker	1010b8add4	pageserver: add `l0_flush_wait_upload` setting (#10534 ) ## Problem We need a setting to disable the flush upload wait, to test L0 flush backpressure in staging. ## Summary of changes Add `l0_flush_wait_upload` setting.	2025-01-28 17:21:05 +00:00
Folke Behrens	ae4b2af299	fix(proxy): Use correct identifier for usage metrics upload (#10538 ) ## Problem The request data and usage metrics S3 requests use the same identifier shown in logs, causing confusion about what type of upload failed. ## Summary of changes Use the correct identifier for usage metrics uploads. neondatabase/cloud#23084	2025-01-28 17:08:17 +00:00
Tristan Partin	15fecb8474	Update axum to 0.8.1 (#10332 ) Only a few things that needed updating: - async_trait was removed - Message::Text takes a Utf8Bytes object instead of a String Signed-off-by: Tristan Partin <tristan@neon.tech> Co-authored-by: Conrad Ludgate <connor@neon.tech>	2025-01-28 15:32:59 +00:00
Erik Grinaker	47677ba578	pageserver: disable L0 backpressure by default (#10535 ) ## Problem We'll need further improvements to compaction before enabling L0 flush backpressure by default. See: https://neondb.slack.com/archives/C033RQ5SPDH/p1738066068960519?thread_ts=1737818888.474179&cid=C033RQ5SPDH. Touches #5415. ## Summary of changes Disable `l0_flush_delay_threshold` by default.	2025-01-28 14:51:30 +00:00
Arpad Müller	83b6bfa229	Re-download layer if its local and on-disk metadata diverge (#10529 ) In #10308, we noticed many warnings about the local layer having different sizes on-disk compared to the metadata. However, the layer downloader would never redownload layer files if the sizes or generation numbers change. This is obviously a bug, which we aim to fix with this PR. This change also moves the code deciding what to do about a layer to a dedicated function: before we handled the "routing" via control flow, but now it's become too complicated and it is nicer to have the different verdicts for a layer spelled out in a list/match.	2025-01-28 13:39:53 +00:00
Erik Grinaker	ed942b05f7	Revert "pageserver: revert flush backpressure" (#10402 )" (#10533 ) This reverts commit `9e55d79803`. We'll still need this until we can tune L0 flush backpressure and compaction. I'll add a setting to disable this separately.	2025-01-28 13:33:58 +00:00
Vlad Lazar	62a717a2ca	pageserver: use PS node id for SK appname (#10522 ) ## Problem This one is fairly embarrassing. Safekeeper node id was used in the pageserver application name when connecting to safekeepers. ## Summary of changes Use the right node id. Closes https://github.com/neondatabase/neon/issues/10461	2025-01-28 13:11:51 +00:00
Peter Bendel	c8fbbb9b65	Test ingest_benchmark with different stripe size and also PostgreSQL version 17 (#10510 ) We want to verify if pageserver stripe size has an impact on ingest performance. We want to verify if ingest performance has improved or regressed with postgres version 17. ## Summary of changes - Allow to create new project with different postgres versions - allow to pre-shard new project with different stripe sizes instead of relying on storage manager to shard_split the project once a threshold is exceeded Replaces https://github.com/neondatabase/neon/pull/10509 Test run https://github.com/neondatabase/neon/actions/runs/12986410381	2025-01-27 21:06:05 +00:00
John Spray	d73f4a6470	pageserver: retry wrapper on manifest upload (#10524 ) ## Problem On remote storage errors (e.g. I/O timeout) uploading tenant manifest, all of compaction could fail. This is a problem IRL because we shouldn't abort compaction on a single IO error, and in tests because it generates spurious failures. Related: https://github.com/orgs/neondatabase/projects/51/views/2?sliceBy%5Bvalue%5D=jcsp&pane=issue&itemId=93692919&issue=neondatabase%7Cneon%7C10389 ## Summary of changes - Use `backoff::retry` when uploading tenant manifest	2025-01-27 21:02:25 +00:00
Heikki Linnakangas	5477d7db93	fast_import: fixes for Postgres v17 (#10414 ) Now that the tests are run on v17, they're also run in debug mode, which is slow. Increase statement_timeout in the test to work around that.	2025-01-27 19:47:49 +00:00
Arpad Müller	eb9832d846	Remove PQ_LIB_DIR env var (#10526 ) We now don't need libpq any more for the build of the storage controller, as we use `diesel-async` since #10280. Therefore, we remove the env var that gave cargo/rustc the location for libpq. Follow-up of #10280	2025-01-27 19:38:18 +00:00
Christian Schwarz	3d36dfe533	fix: noisy `broker subscription failed` error during storage broker deploys (#10521 ) During broker deploys, pageservers log this noisy WARN en masse. I can trivially reproduce the WARN message in neon_local by SIGKILLing broker during e.g. `pgbench -i`. I don't understand why tonic is not detecting the error as `Code::Unavailable`. Until we find time to understand that / fix upstream, this PR adds the error message to the existing list of known error messages that get demoted to INFO level. Refs: - refs https://github.com/neondatabase/neon/issues/9562	2025-01-27 19:19:55 +00:00
John Spray	ebf44210ba	remote_storage: less sensitive timeout logging in ABS listings (#10518 ) ## Problem We were logging a warning after a single request timeout, while listing objects. Closes: https://github.com/neondatabase/neon/issues/10166 ## Summary of changes - These timeouts are a pretty normal part of life, so back it off to only log a warning after two in a row.	2025-01-27 17:44:18 +00:00
John Spray	aabf455dfb	README: clarify that neon_local is a dev/test tool (#10512 ) ## Problem From time to time, folks discover our `control_plane/` folder and make the (reasonable) mistake of thinking it's a tool for running full-sized Neon systems, whereas in reality it is a tool for dev/test. ## Summary of changes - Change control_plane's readme title to "Local Development Control Plane (`neon_local`)` - Change "Running local installation" to "Running a local development environment" in the main readme	2025-01-27 17:24:42 +00:00
John Spray	aec92bfc34	pageserver: decrease utilization MAX_SHARDS (#10489 ) ## Problem The intent of this parameter is to have pageservers consider themselves "full" if they've got lots of shards, even if they have plenty of capacity. It works, but because we typically successfully oversubscribe capacity up to 200%, the MAX_SHARDS limit is effectively doubled, so this 20,000 value ends up meaning 40,000, whereas the original intent was to limit nodes to ~10000 shards. ## Summary of changes - Change MAX_SHARDS to 5000, so that a node with 5000 will get a 100% utilization, which is equivalent in practice to being considered "half full" by the storage controller in capacity terms. This is all a bit subtle and indiret. Originally the limit was baked into the pageserver with the idea that the pageserver knows better what its own resources tolerate than the storage controller does, but in practice it would be probably be easier to understand all this if we just did it controller-side. So there's scope to refactor here in future.	2025-01-27 17:03:32 +00:00
Arpad Müller	b0b4b7dd8f	storcon: switch to diesel-async and tokio-postgres (#10280 ) Switches the storcon away from using diesel's synchronous APIs in favour of `diesel-async`. Advantages: * less C dependencies, especially no openssl, which might be behind the bug: https://github.com/neondatabase/cloud/issues/21010 * Better to only have async than mix of async plus `spawn_blocking` We had to turn off usage of the connection pool for migrations, as diesel migrations don't support async APIs. Thus we still use `spawn_blocking` in that one place. But this is explicitly done in one of the `diesel-async` examples.	2025-01-27 14:25:11 +00:00
Mikhail Kot	4dd4096f11	Pgbouncer exporter in compute image (#10503 ) https://github.com/neondatabase/cloud/issues/19081 Include pgbouncer_exporter in compute image and run it at port 9127	2025-01-27 14:09:21 +00:00
Erik Grinaker	be718ed121	pageserver: disable L0 flush stalls, tune delay threshold (#10507 ) ## Problem In ingest benchmarks, we see L0 compaction delays of over 10 minutes due to image compaction. We can't stall L0 flushes for that long. ## Summary of changes Disable L0 flush stalls, and bump the default L0 flush delay threshold from 20 to 30 L0 layers.	2025-01-25 16:51:54 +00:00
Konstantin Knizhnik	9f1408fdf3	Do not assign max(lsn) to maxLastWrittenLsn in SetLastWrittenLSNForblokv (#10474 ) ## Problem See https://github.com/neondatabase/neon/issues/10281 `SetLastWrittenLSNForBlockv` is assigning max(lsn) to `maxLastWrittenLsn` while its should contain only max LSN not present in LwLSN cache. It case unnecessary waits in PS. ## Summary of changes Restore status-quo for pg17. Related Postgres PR: https://github.com/neondatabase/postgres/pull/563 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-01-24 14:57:32 +00:00
Conrad Ludgate	7000aaaf75	chore: fix h2 stubgen (#10491 ) ## Problem ## Summary of changes --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-01-24 14:55:48 +00:00
Erik Grinaker	ef2a2555b1	pageserver: tighten compaction failure detection (#10502 ) ## Problem If compaction fails, we disable L0 flush stalls to avoid persistent stalls. However, the logic would unset the failure marker on offload failures or shutdown. This can lead to sudden L0 flush stalls if we try and fail to offload a timeline with compaction failures, or if there is some kind of shutdown race. Touches #10405. ## Summary of changes Don't touch the compaction failure marker on offload failures or shutdown.	2025-01-24 13:55:05 +00:00
Konstantin Knizhnik	d8ab6ddb0f	Check if relation has storage in calculate_relation_size (#10477 ) ## Problem Parent of partitioned table has no storage, it relfilelocator is zero. It cab be incorrectly hashed and produce wrong results. See https://github.com/neondatabase/postgres/pull/518 ## Summary of changes This problem is already addressed in pg17. Add the same check for all other PG versions. Postgres PRs: https://github.com/neondatabase/postgres/pull/566 https://github.com/neondatabase/postgres/pull/565 https://github.com/neondatabase/postgres/pull/564 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-01-24 12:43:52 +00:00
JC Grünhage	dcc437da1d	Make promote-images-prod depend on promote-images-dev (#10494 ) ## Problem After talking about it again with @bayandin again this should replace the changes from https://github.com/neondatabase/neon/pull/10475. While the previous changes worked, they are less visually clear in what happens, and we might end up in a situation where we update `latest`, but don't actually have the tagged image pushed that contains the same changes. The latter would result in potentially hard to debug situations. ## Summary of changes Revert `c283aaaf8d` and make promote-images-prod depend on promote-images-dev instead.	2025-01-24 11:03:39 +00:00
a-masterov	c286fea018	Print logs in extensions test in another step to improve readability (#10483 ) ## Problem The containers' log output is mixed with the tests' output, so you must scroll up to find the error. ## Summary of changes Printing of containers' logs moved to a separate step.	2025-01-24 10:44:48 +00:00
Vlad Lazar	de8276488d	tests: enable wal reader fanout in tests (#10301 ) Note: this has to merge after the release is cut on `2025-01-17` for compat tests to start passing. ## Problem SK wal reader fan-out is not enabled in tests by default. ## Summary of changes Enable it.	2025-01-24 10:34:57 +00:00
Erik Grinaker	ddb9ae1214	pageserver: add compaction backpressure for layer flushes (#10405 ) ## Problem There is no direct backpressure for compaction and L0 read amplification. This allows a large buildup of compaction debt and read amplification. Resolves #5415. Requires #10402. ## Summary of changes Delay layer flushes based on the number of level 0 delta layers: * `l0_flush_delay_threshold`: delay flushes such that they take 2x as long (default `2 * compaction_threshold`). * `l0_flush_stall_threshold`: stall flushes until level 0 delta layers drop below threshold (default `4 * compaction_threshold`). If either threshold is reached, ephemeral layer rolls also synchronously wait for layer flushes to propagate this backpressure up into WAL ingestion. This will bound the number of frozen layers to 1 once backpressure kicks in, since all other frozen layers must flush before the rolled layer. ## Analysis This will significantly change the compute backpressure characteristics. Recall the three compute backpressure knobs: * `max_replication_write_lag`: 500 MB (based on Pageserver `last_received_lsn`). * `max_replication_flush_lag`: 10 GB (based on Pageserver `disk_consistent_lsn`). * `max_replication_apply_lag`: disabled (based on Pageserver `remote_consistent_lsn`). Previously, the Pageserver would keep ingesting WAL and build up ephemeral layers and L0 layers until the compute hit `max_replication_flush_lag` at 10 GB and began backpressuring. Now, once we delay/stall WAL ingestion, the compute will begin backpressuring after `max_replication_write_lag`, i.e. 500 MB. This is probably a good thing (we're not building up a ton of compaction debt), but we should consider tuning these settings. `max_replication_flush_lag` probably doesn't serve a purpose anymore, and we should consider removing it. Furthermore, the removal of the upload barrier in #10402 will mean that we no longer backpressure flushes based on S3 uploads, since `max_replication_apply_lag` is disabled. We should consider enabling this as well. ### When and what do we compact? Default compaction settings: * `compaction_threshold`: 10 L0 delta layers. * `compaction_period`: 20 seconds (between each compaction loop check). * `checkpoint_distance`: 256 MB (size of L0 delta layers). * `l0_flush_delay_threshold`: 20 L0 delta layers. * `l0_flush_stall_threshold`: 40 L0 delta layers. Compaction characteristics: * Minimum compaction volume: 10 layers * 256 MB = 2.5 GB. * Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20 seconds = 2.5 GB (10 L0 layers). * Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s. ### When do we hit `max_replication_write_lag`? Depending on how fast compaction and flushes happens, the compute will backpressure somewhere between `l0_flush_delay_threshold` or `l0_flush_stall_threshold` + `max_replication_write_lag`. * Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB * Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0 GB This seems like a reasonable range to me.	2025-01-24 09:47:28 +00:00
Erik Grinaker	9e55d79803	Reapply "pageserver: revert flush backpressure" (#10270 ) (#10402 ) This reapplies #10135. Just removing this flush backpressure without further mitigations caused read amp increases during bulk ingestion (predictably), so it was reverted. We will replace it by compaction-based backpressure. ## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure based on compaction debt and read amplification. We will instead implement compaction-based backpressure in a PR immediately following this removal (#5415). Touches #5415. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.	2025-01-24 08:35:35 +00:00
Alex Chi Z.	8d47a60de2	fix(pageserver): handle dup layers during gc-compaction (#10430 ) ## Problem If gc-compaction decides to rewrite an image layer, it will now cause index_part to lose reference to that layer. In details, * Assume there's only one image layer of key 0000...AAAA at LSN 0x100 and generation 0xA in the system. * gc-compaction kicks in at gc-horizon 0x100, and then produce 0000...AAAA at LSN 0x100 and generation 0xB. * It submits a compaction result update into the index part that unlinks 0000-AAAA-100-A and adds 0000-AAAA-100-B On the remote storage / local disk side, this is fine -- it unlinks things correctly and uploads the new file. However, the `index_part.json` itself doesn't record generations. The buggy procedure is as follows: 1. upload the new file 2. update the index part to remove the old file and add the new file 3. remove the new file Therefore, the correct update result process for gc-compaction should be as follows: * When modifying the layer map, delete the old one and upload the new one. * When updating the index, uploading the new one in the index without deleting the old one. ## Summary of changes * Modify `finish_gc_compaction` to correctly order insertions and deletions. * Update the way gc-compaction uploads the layer files. * Add new tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-23 21:54:44 +00:00
Alexey Kondratov	6166482589	feat(compute): Automatically create release PRs (#10495 ) We've finally transitioned to using a separate `release-compute` branch. Now, we can finally automatically create release PRs on Fri and release them during the following week. Part of neondatabase/cloud#11698	2025-01-23 20:47:20 +00:00
Arpad Müller	ca6d72ba2a	Increase reconciler timeout after shard split (#10490 ) Sometimes, especially when the host running the tests is overloaded, we can run into reconcile timeouts in `test_timeline_ancestor_detach_idempotent_success`, making the test flaky. By increasing the timeouts from 30 seconds to 120 seconds, we can address the flakiness. Fixes #10464	2025-01-23 16:43:04 +00:00
a-masterov	b6c0f66619	CI(autocomment): add the lfc state (#10121 ) ## Problem Currently, the report does not contain the LFC state of the failed tests. ## Summary of changes Added the LFC state to the link to the allure report. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-01-23 14:52:07 +00:00
Mikhail Kot	3702ec889f	Enable postgres_fdw (#10426 ) Update compute image to include postgres_fdw #3720	2025-01-23 13:22:31 +00:00
Anastasia Lubennikova	8e8df1b453	Disable logical replication subscribers (#10249 ) Drop logical replication subscribers before compute starts on a non-main branch. Add new compute_ctl spec flag: drop_subscriptions_before_start If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions. Ensure that we drop subscriptions exactly once when endpoint starts on a new branch. It is essential, because otherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timeline. fixes: https://github.com/neondatabase/neon/issues/8790	2025-01-23 11:02:15 +00:00
Alex Chi Z.	92d95b08cf	fix(pageserver): extend split job key range to the end (#10484 ) ## Problem Not really a bug fix, but hopefully can reproduce https://github.com/neondatabase/neon/issues/10482 more. If the layer map does not contain layers that end at exactly the end range of the compaction job, the current split algorithm will produce the last job that ends at the maximum layer key. This patch extends it all the way to the compaction job end key. For example, the user requests a compaction of 0000...FFFF. However, we only have a layer 0000..3000 in the layer map, and the split job will have a range of 0000..3000 instead of 0000..FFFF. This is not a correctness issue but it would be better to fix it so that we can get consistent job splits. ## Summary of changes Compaction job split will always cover the full specified key range. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-23 00:15:46 +00:00
Arpad Müller	0af40b5494	Only churn rows once in test_scrubber_physical_gc_ancestors (#10481 ) ## Problem PR #10457 was supposed to fix the flakiness of `test_scrubber_physical_gc_ancestors`, but instead it made it even more flaky. However, the original error causes disappeared, now to be replaced by key not found errors. See this for a longer explanation: https://github.com/neondatabase/neon/issues/10391#issuecomment-2608018967 ## Solution This does one churn rows after all compactions, and before we do any timeline gc's. That way, we remain more accessible at older lsn's.	2025-01-22 19:45:12 +00:00
Arpad Müller	c60b91369a	Expose safekeeper APIs for creation and deletion (#10478 ) Add APIs for timeline creation and deletion to the safekeeper client crate. Going to be used later in #10440. Split off from #10440. Part of https://github.com/neondatabase/neon/issues/9011	2025-01-22 18:52:16 +00:00
a-masterov	f1473dd438	Fix the connection error for extension tests (#10480 ) ## Problem The trust connection to the compute required for `pg_anon` was removed. However, the PGPASSWORD environment variable was not added to `docker-compose.yml`. This caused connection errors, which were interpreted as success due to errors in the bash script. ## Summary of changes The environment variable was added, and the logic in the bash script was fixed.	2025-01-22 16:34:57 +00:00
JC Grünhage	c283aaaf8d	Tag images from docker-hub in promote-images-prod (#10475 ) ## Problem https://github.com/neondatabase/neon/actions/runs/12896686483/job/35961290336#step:5:107 showed that `promote-images-prod` was missing another dependency. ## Summary of changes Modify `promote-images-prod` to tag based on docker-hub images, so that `promote-images-prod` does not rely on `promote-images-dev`. The result should be the exact same, but allows the two jobs to run in parallel.	2025-01-22 16:09:41 +00:00
Vlad Lazar	414ed82c1f	pageserver: issue concurrent IO on the read path (#9353 ) ## Refs - Epic: https://github.com/neondatabase/neon/issues/9378 Co-authored-by: Vlad Lazar <vlad@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> ## Problem The read path does its IOs sequentially. This means that if N values need to be read to reconstruct a page, we will do N IOs and getpage latency is `O(NIoLatency)`. ## Solution With this PR we gain the ability to issue IO concurrently within one layer visit and* to move on to the next layer without waiting for IOs from the previous visit to complete. This is an evolved version of the work done at the Lisbon hackathon, cf https://github.com/neondatabase/neon/pull/9002. ## Design ### `will_init` now sourced from disk btree index keys On the algorithmic level, the only change is that the `get_values_reconstruct_data` now sources `will_init` from the disk btree index key (which is PS-page_cache'd), instead of from the `Value`, which is only available after the IO completes. ### Concurrent IOs, Submission & Completion To separate IO submission from waiting for its completion, while simultaneously feature-gating the change, we introduce the notion of an `IoConcurrency` struct through which IO futures are "spawned". An IO is an opaque future, and waiting for completions is handled through `tokio::sync::oneshot` channels. The oneshot Receiver's take the place of the `img` and `records` fields inside `VectoredValueReconstructState`. When we're done visiting all the layers and submitting all the IOs along the way we concurrently `collect_pending_ios` for each value, which means for each value there is a future that awaits all the oneshot receivers and then calls into walredo to reconstruct the page image. Walredo is now invoked concurrently for each value instead of sequentially. Walredo itself remains unchanged. The spawned IO futures are driven to completion by a sidecar tokio task that is separate from the task that performs all the layer visiting and spawning of IOs. That tasks receives the IO futures via an unbounded mpsc channel and drives them to completion inside a `FuturedUnordered`. (The behavior from before this PR is available through `IoConcurrency::Sequential`, which awaits the IO futures in place, without "spawning" or "submitting" them anywhere.) #### Alternatives Explored A few words on the rationale behind having a sidecar task and what alternatives were considered. One option is to queue up all IO futures in a FuturesUnordered that is polled the first time when we `collect_pending_ios`. Firstly, the IO futures are opaque, compiler-generated futures that need to be polled at least once to submit their IO. "At least once" because tokio-epoll-uring may not be able to submit the IO to the kernel on first poll right away. Second, there are deadlocks if we don't drive the IO futures to completion independently of the spawning task. The reason is that both the IO futures and the spawning task may hold some _and_ try to acquire _more_ shared limited resources. For example, both spawning task and IO future may try to acquire * a VirtualFile file descriptor cache slot async mutex (observed during impl) * a tokio-epoll-uring submission slot (observed during impl) * a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future) Another option is to spawn a short-lived `tokio::task` for each IO future. We implemented and benchmarked it during development, but found little throughput improvement and moderate mean & tail latency degradation. Concerns about pressure on the tokio scheduler made us discard this variant. The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct. However, 1. the opaque futures approach taken in this PR allows leaving the existing code unchanged, which 2. allows us to implement the `IoConcurrency::Sequential` mode for feature-gating the change. Once the new mode sidecar task implementation is rolled out everywhere, and `::Sequential` removed, we can think about a descriptive submission & completion interface. The problems around deadlocks pointed out earlier will need to be solved then. For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots. The latter has been drafted in https://github.com/neondatabase/tokio-epoll-uring/pull/63. See the lengthy doc comment on `spawn_io()` for more details. ### Error handling There are two error classes during reconstruct data retrieval: * traversal errors: index lookup, move to next layer, and the like * value read IO errors A traversal error fails the entire get_vectored request, as before this PR. A value read error only fails that value. In any case, we preserve the existing behavior that once `get_vectored` returns, all IOs are done. Panics and failing to poll `get_vectored` to completion will leave the IOs dangling, which is safe but shouldn't happen, and so, a rate-limited log statement will be emitted at warning level. There is a doc comment on `collect_pending_ios` giving more code-level details and rationale. ### Feature Gating The new behavior is opt-in via pageserver config. The `Sequential` mode is the default. The only significant change in `Sequential` mode compared to before this PR is the buffering of results in the `oneshot`s. ## Code-Level Changes Prep work: * Make `GateGuard` clonable. Core Feature: * Traversal code: track `will_init` in `BlobMeta` and source it from the Delta/Image/InMemory layer index, instead of determining `will_init` after we've read the value. This avoids having to read the value to determine whether traversal can stop. * Introduce `IoConcurrency` & its sidecar task. * `IoConcurrency` is the clonable handle. * It connects to the sidecar task via an `mpsc`. * Plumb through `IoConcurrency` from high level code to the individual layer implementations' `get_values_reconstruct_data`. We piggy-back on the `ValuesReconstructState` for this. * The sidecar task should be long-lived, so, `IoConcurrency` needs to be rooted up "high" in the call stack. * Roots as of this PR: * `page_service`: outside of pagestream loop * `create_image_layers`: when it is called * `basebackup`(only auxfiles + replorigin + SLRU segments) * Code with no roots that uses `IoConcurrency::sequential` * any `Timeline::get` call * `collect_keyspace` is a good example * follow-up: https://github.com/neondatabase/neon/issues/10460 * `TimelineAdaptor` code used by the compaction simulator, unused in practive * `ingest_xlog_dbase_create` * Transform Delta/Image/InMemoryLayer to * do their values IO in a distinct `async {}` block * extend the residence of the Delta/Image layer until the IO is done * buffer their results in a `oneshot` channel instead of straight in `ValuesReconstructState` * the `oneshot` channel is wrapped in `OnDiskValueIo` / `OnDiskValueIoWaiter` types that aid in expressiveness and are used to keep track of in-flight IOs so we can print warnings if we leave them dangling. * Change `ValuesReconstructState` to hold the receiving end of the `oneshot` channel aka `OnDiskValueIoWaiter`. * Change `get_vectored_impl` to `collect_pending_ios` and issue walredo concurrently, in a `FuturesUnordered`. Testing / Benchmarking: * Support queue-depth in pagebench for manual benchmarkinng. * Add test suite support for setting concurrency mode ps config field via a) an env var and b) via NeonEnvBuilder. * Hacky helper to have sidecar-based IoConcurrency in tests. This will be cleaned up later. More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod. Some intermediate helpers for manual benchmarking have been preserved in https://github.com/neondatabase/neon/pull/10466 and will be landed in later PRs. (L0 layer stack generator!) Drive-By: * test suite actually didn't enable batching by default because `config.compatibility_neon_binpath` is always Truthy in our CI environment => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 * initial logical size calculation wasn't always polled to completion, which was surfaced through the added WARN logs emitted when dropping a `ValuesReconstructState` that still has inflight IOs. * remove the timing histograms `pageserver_getpage_get_reconstruct_data_seconds` and `pageserver_getpage_reconstruct_seconds` because with planning, value read IO, and walredo happening concurrently, one can no longer attribute latency to any one of them; we'll revisit this when Vlad's work on tracing/sampling through RequestContext lands. * remove code related to `get_cached_lsn()`. The logic around this has been dead at runtime for a long time, ever since the removal of the materialized page cache in #8105. ## Testing Unit tests use the sidecar task by default and run both modes in CI. Python regression tests and benchmarks also use the sidecar task by default. We'll test more in staging and possibly preprod. # Future Work Please refer to the parent epic for the full plan. The next step will be to fold the plumbing of IoConcurrency into RequestContext so that the function signatures get cleaned up. Once `Sequential` isn't used anymore, we can take the next big leap which is replacing the opaque IOs with structs that have well-defined semantics. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-01-22 15:30:23 +00:00
Alexey Kondratov	881e351f69	feat(compute): Allow installing both 0.8.0 and 0.7.4 pgvector (#10345 ) ## Problem Both these versions are binary compatible, but the way pgvector structures the SQL files forbids installing 0.7.4 if you have a 0.8.0 distribution. Yet, some users may need a previous version for backward compatibility, e.g., restoring the dump. See this thread for discussion https://neondb.slack.com/archives/C04DGM6SMTM/p1735911490242919?thread_ts=1731343604.259169&cid=C04DGM6SMTM ## Summary of changes Put `vector--0.7.4.sql` file into compute image to allow installing this version as well. Tested on staging and it seems to be working as expected: ```sql select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| (null) \| vector data type and ivfflat and hnsw access methods create extension vector version '0.7.4'; select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| 0.7.4 \| vector data type and ivfflat and hnsw access methods alter extension vector update; select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| 0.8.0 \| vector data type and ivfflat and hnsw access methods drop extension vector; create extension vector; select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| 0.8.0 \| vector data type and ivfflat and hnsw access methods ``` If we find out it's a good approach, we can adopt the same for other extensions with a stable ABI -- support both `current` and `current - 1` releases.	2025-01-22 12:38:23 +00:00
Christian Schwarz	b31ce14083	initial logical size calculation: always poll to completion (#10471 ) # Refs - extracted from https://github.com/neondatabase/neon/pull/9353 # Problem Before this PR, when task_mgr shutdown is signalled, e.g. during pageserver shutdown or Tenant shutdown, initial logical size calculation stops polling and drops the future that represents the calculation. This is against the current policy that we poll all futures to completion. This became apparent during development of concurrent IO which warns if we drop a `Timeline::get_vectored` future that still has in-flight IOs. We may revise the policy in the future, but, right now initial logical size calculation is the only part of the codebase that doesn't adhere to the policy, so let's fix it. ## Code Changes - make sensitive exclusively to `Timeline::cancel` - This should be sufficient for all cases of shutdowns; the sensitivity to task_mgr shutdown is unnecessary. - this broke the various cancel tests in `test_timeline_size.py`, e.g., `test_timeline_initial_logical_size_calculation_cancellation` - the tests would time out because the await point was not sensitive to cancellation - to fix this, refactor `pausable_failpoint` so that it accepts a cancellation token - side note: we _really_ should write our own failpoint library; maybe after we get heap-allocated RequestContext, we can plumb failpoints through there.	2025-01-22 12:28:26 +00:00
Christian Schwarz	b4d87b9dfe	fix(tests): actually enable pipelinig by default in the test suite (#10472 ) ## Problem PR #9993 was supposed to enable `page_service_pipelining` by default for all `NeonEnv`s, but this was ineffective in our CI environment. Thus, CI Python-based tests and benchmarks, unless explicitly configuring pipelining, were still using serial protocol handling. ## Analysis The root cause was that in our CI environment, `config.compatibility_neon_binpath` is always Truthy. It's not in local environments, which is why this slipped through in local testing. Lesson: always add a log line ot pageserver startup and spot-check tests to ensure the intended default is picked up. ## Summary of changes Fix it. Since enough time has passed, the compatiblity snapshot contains a recent enough software version so we don't need to worry about `compatibility_neon_binpath` anymore. ## Future Work The question how to add a new default except for compatibliity tests, which is what the broken code was supposed to do, is still unsolved. Slack discussion: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309	2025-01-22 10:10:43 +00:00
Conrad Ludgate	2b49d6ee05	feat: adjust the tonic features to remove axum dependency (#10348 ) To help facilitate an upgrade to axum 0.8 (https://github.com/neondatabase/neon/pull/10332#pullrequestreview-2541989619) this massages the tonic dependency features so that tonic does not depend on axum.	2025-01-22 09:15:52 +00:00
Erik Grinaker	14e1f89053	pageserver: eagerly notify flush waiters (#10469 ) ## Problem Currently, the layer flush loop will continue flushing layers as long as any are pending, and only notify waiters once there are no further layers to flush. This can cause waiters to wait longer than necessary, and potentially starve them if pending layers keep arriving faster than they can be flushed. The impact of this will increase when we add compaction backpressure and propagate it up into the WAL receiver. Extracted from #10405. ## Summary of changes Break out of the layer flush loop once we've flushed up to the requested LSN. If further flush requests have arrived in the meanwhile, flushing will resume immediately after.	2025-01-21 22:01:27 +00:00
Erik Grinaker	8a8c656c06	pageserver: add `LayerMap::watch_layer0_deltas()` (#10470 ) ## Problem For compaction backpressure, we need a mechanism to signal when compaction has reduced the L0 delta layer count below the backpressure threshold. Extracted from #10405. ## Summary of changes Add `LayerMap::watch_level0_deltas()` which returns a `tokio::sync:⌚:Receiver` signalling the current L0 delta layer count.	2025-01-21 21:18:09 +00:00
Erik Grinaker	a75e11cc00	pageserver: return duration from `StorageTimeMetricsTimer` (#10468 ) ## Problem It's sometimes useful to obtain the elapsed duration from a `StorageTimeMetricsTimer` for purposes beyond just recording it in metrics (e.g. to log it). Extracted from #10405. ## Summary of changes Add `StorageTimeMetricsTimer.elapsed()` and return the duration from `stop_and_record()`.	2025-01-21 20:56:34 +00:00
Alex Chi Z.	7d4bfcdc47	feat(pageserver): add config items for gc-compaction auto trigger (#10455 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 The automatic trigger is already implemented at https://github.com/neondatabase/neon/pull/10221 but I need to write some tests and finish my experiments in staging before I can merge it with confidence. Given that I have some other patches that will modify the config items, I'd like to get the config items merged first to reduce conflicts. ## Summary of changes * add `l2_lsn` to index_part.json -- below that LSN, data have been processed by gc-compaction * add a set of gc-compaction auto trigger control items into the config --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-21 19:29:38 +00:00
a-masterov	737888e5c9	Remove the tests for `pg_anon` (#10382 ) ## Problem We are removing the `pg_anon` v1 extension from Neon. So we don't need to test it anymore and can remove the code for simplicity. ## Summary of changes The code required for testing `pg_anon` is removed.	2025-01-21 19:17:14 +00:00
Gleb Novikov	19bf7b78a0	fast import: basic python test (#10271 ) We did not have any tests on fast_import binary yet. In this PR I have introduced: - `FastImport` class and tools for testing in python - basic test that runs fast import against vanilla postgres and checks that data is there Should be merged after https://github.com/neondatabase/neon/pull/10251	2025-01-21 16:50:44 +00:00

1 2 3 4 5 ...

7016 Commits