rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	7000aaaf75	chore: fix h2 stubgen (#10491 ) ## Problem ## Summary of changes --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-01-24 14:55:48 +00:00
Vlad Lazar	de8276488d	tests: enable wal reader fanout in tests (#10301 ) Note: this has to merge after the release is cut on `2025-01-17` for compat tests to start passing. ## Problem SK wal reader fan-out is not enabled in tests by default. ## Summary of changes Enable it.	2025-01-24 10:34:57 +00:00
Erik Grinaker	ddb9ae1214	pageserver: add compaction backpressure for layer flushes (#10405 ) ## Problem There is no direct backpressure for compaction and L0 read amplification. This allows a large buildup of compaction debt and read amplification. Resolves #5415. Requires #10402. ## Summary of changes Delay layer flushes based on the number of level 0 delta layers: * `l0_flush_delay_threshold`: delay flushes such that they take 2x as long (default `2 * compaction_threshold`). * `l0_flush_stall_threshold`: stall flushes until level 0 delta layers drop below threshold (default `4 * compaction_threshold`). If either threshold is reached, ephemeral layer rolls also synchronously wait for layer flushes to propagate this backpressure up into WAL ingestion. This will bound the number of frozen layers to 1 once backpressure kicks in, since all other frozen layers must flush before the rolled layer. ## Analysis This will significantly change the compute backpressure characteristics. Recall the three compute backpressure knobs: * `max_replication_write_lag`: 500 MB (based on Pageserver `last_received_lsn`). * `max_replication_flush_lag`: 10 GB (based on Pageserver `disk_consistent_lsn`). * `max_replication_apply_lag`: disabled (based on Pageserver `remote_consistent_lsn`). Previously, the Pageserver would keep ingesting WAL and build up ephemeral layers and L0 layers until the compute hit `max_replication_flush_lag` at 10 GB and began backpressuring. Now, once we delay/stall WAL ingestion, the compute will begin backpressuring after `max_replication_write_lag`, i.e. 500 MB. This is probably a good thing (we're not building up a ton of compaction debt), but we should consider tuning these settings. `max_replication_flush_lag` probably doesn't serve a purpose anymore, and we should consider removing it. Furthermore, the removal of the upload barrier in #10402 will mean that we no longer backpressure flushes based on S3 uploads, since `max_replication_apply_lag` is disabled. We should consider enabling this as well. ### When and what do we compact? Default compaction settings: * `compaction_threshold`: 10 L0 delta layers. * `compaction_period`: 20 seconds (between each compaction loop check). * `checkpoint_distance`: 256 MB (size of L0 delta layers). * `l0_flush_delay_threshold`: 20 L0 delta layers. * `l0_flush_stall_threshold`: 40 L0 delta layers. Compaction characteristics: * Minimum compaction volume: 10 layers * 256 MB = 2.5 GB. * Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20 seconds = 2.5 GB (10 L0 layers). * Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s. ### When do we hit `max_replication_write_lag`? Depending on how fast compaction and flushes happens, the compute will backpressure somewhere between `l0_flush_delay_threshold` or `l0_flush_stall_threshold` + `max_replication_write_lag`. * Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB * Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0 GB This seems like a reasonable range to me.	2025-01-24 09:47:28 +00:00
Erik Grinaker	9e55d79803	Reapply "pageserver: revert flush backpressure" (#10270 ) (#10402 ) This reapplies #10135. Just removing this flush backpressure without further mitigations caused read amp increases during bulk ingestion (predictably), so it was reverted. We will replace it by compaction-based backpressure. ## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure based on compaction debt and read amplification. We will instead implement compaction-based backpressure in a PR immediately following this removal (#5415). Touches #5415. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.	2025-01-24 08:35:35 +00:00
Alex Chi Z.	8d47a60de2	fix(pageserver): handle dup layers during gc-compaction (#10430 ) ## Problem If gc-compaction decides to rewrite an image layer, it will now cause index_part to lose reference to that layer. In details, * Assume there's only one image layer of key 0000...AAAA at LSN 0x100 and generation 0xA in the system. * gc-compaction kicks in at gc-horizon 0x100, and then produce 0000...AAAA at LSN 0x100 and generation 0xB. * It submits a compaction result update into the index part that unlinks 0000-AAAA-100-A and adds 0000-AAAA-100-B On the remote storage / local disk side, this is fine -- it unlinks things correctly and uploads the new file. However, the `index_part.json` itself doesn't record generations. The buggy procedure is as follows: 1. upload the new file 2. update the index part to remove the old file and add the new file 3. remove the new file Therefore, the correct update result process for gc-compaction should be as follows: * When modifying the layer map, delete the old one and upload the new one. * When updating the index, uploading the new one in the index without deleting the old one. ## Summary of changes * Modify `finish_gc_compaction` to correctly order insertions and deletions. * Update the way gc-compaction uploads the layer files. * Add new tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-23 21:54:44 +00:00
Arpad Müller	ca6d72ba2a	Increase reconciler timeout after shard split (#10490 ) Sometimes, especially when the host running the tests is overloaded, we can run into reconcile timeouts in `test_timeline_ancestor_detach_idempotent_success`, making the test flaky. By increasing the timeouts from 30 seconds to 120 seconds, we can address the flakiness. Fixes #10464	2025-01-23 16:43:04 +00:00
a-masterov	b6c0f66619	CI(autocomment): add the lfc state (#10121 ) ## Problem Currently, the report does not contain the LFC state of the failed tests. ## Summary of changes Added the LFC state to the link to the allure report. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-01-23 14:52:07 +00:00
Anastasia Lubennikova	8e8df1b453	Disable logical replication subscribers (#10249 ) Drop logical replication subscribers before compute starts on a non-main branch. Add new compute_ctl spec flag: drop_subscriptions_before_start If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions. Ensure that we drop subscriptions exactly once when endpoint starts on a new branch. It is essential, because otherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timeline. fixes: https://github.com/neondatabase/neon/issues/8790	2025-01-23 11:02:15 +00:00
Alex Chi Z.	92d95b08cf	fix(pageserver): extend split job key range to the end (#10484 ) ## Problem Not really a bug fix, but hopefully can reproduce https://github.com/neondatabase/neon/issues/10482 more. If the layer map does not contain layers that end at exactly the end range of the compaction job, the current split algorithm will produce the last job that ends at the maximum layer key. This patch extends it all the way to the compaction job end key. For example, the user requests a compaction of 0000...FFFF. However, we only have a layer 0000..3000 in the layer map, and the split job will have a range of 0000..3000 instead of 0000..FFFF. This is not a correctness issue but it would be better to fix it so that we can get consistent job splits. ## Summary of changes Compaction job split will always cover the full specified key range. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-23 00:15:46 +00:00
Arpad Müller	0af40b5494	Only churn rows once in test_scrubber_physical_gc_ancestors (#10481 ) ## Problem PR #10457 was supposed to fix the flakiness of `test_scrubber_physical_gc_ancestors`, but instead it made it even more flaky. However, the original error causes disappeared, now to be replaced by key not found errors. See this for a longer explanation: https://github.com/neondatabase/neon/issues/10391#issuecomment-2608018967 ## Solution This does one churn rows after all compactions, and before we do any timeline gc's. That way, we remain more accessible at older lsn's.	2025-01-22 19:45:12 +00:00
Vlad Lazar	414ed82c1f	pageserver: issue concurrent IO on the read path (#9353 ) ## Refs - Epic: https://github.com/neondatabase/neon/issues/9378 Co-authored-by: Vlad Lazar <vlad@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> ## Problem The read path does its IOs sequentially. This means that if N values need to be read to reconstruct a page, we will do N IOs and getpage latency is `O(NIoLatency)`. ## Solution With this PR we gain the ability to issue IO concurrently within one layer visit and* to move on to the next layer without waiting for IOs from the previous visit to complete. This is an evolved version of the work done at the Lisbon hackathon, cf https://github.com/neondatabase/neon/pull/9002. ## Design ### `will_init` now sourced from disk btree index keys On the algorithmic level, the only change is that the `get_values_reconstruct_data` now sources `will_init` from the disk btree index key (which is PS-page_cache'd), instead of from the `Value`, which is only available after the IO completes. ### Concurrent IOs, Submission & Completion To separate IO submission from waiting for its completion, while simultaneously feature-gating the change, we introduce the notion of an `IoConcurrency` struct through which IO futures are "spawned". An IO is an opaque future, and waiting for completions is handled through `tokio::sync::oneshot` channels. The oneshot Receiver's take the place of the `img` and `records` fields inside `VectoredValueReconstructState`. When we're done visiting all the layers and submitting all the IOs along the way we concurrently `collect_pending_ios` for each value, which means for each value there is a future that awaits all the oneshot receivers and then calls into walredo to reconstruct the page image. Walredo is now invoked concurrently for each value instead of sequentially. Walredo itself remains unchanged. The spawned IO futures are driven to completion by a sidecar tokio task that is separate from the task that performs all the layer visiting and spawning of IOs. That tasks receives the IO futures via an unbounded mpsc channel and drives them to completion inside a `FuturedUnordered`. (The behavior from before this PR is available through `IoConcurrency::Sequential`, which awaits the IO futures in place, without "spawning" or "submitting" them anywhere.) #### Alternatives Explored A few words on the rationale behind having a sidecar task and what alternatives were considered. One option is to queue up all IO futures in a FuturesUnordered that is polled the first time when we `collect_pending_ios`. Firstly, the IO futures are opaque, compiler-generated futures that need to be polled at least once to submit their IO. "At least once" because tokio-epoll-uring may not be able to submit the IO to the kernel on first poll right away. Second, there are deadlocks if we don't drive the IO futures to completion independently of the spawning task. The reason is that both the IO futures and the spawning task may hold some _and_ try to acquire _more_ shared limited resources. For example, both spawning task and IO future may try to acquire * a VirtualFile file descriptor cache slot async mutex (observed during impl) * a tokio-epoll-uring submission slot (observed during impl) * a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future) Another option is to spawn a short-lived `tokio::task` for each IO future. We implemented and benchmarked it during development, but found little throughput improvement and moderate mean & tail latency degradation. Concerns about pressure on the tokio scheduler made us discard this variant. The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct. However, 1. the opaque futures approach taken in this PR allows leaving the existing code unchanged, which 2. allows us to implement the `IoConcurrency::Sequential` mode for feature-gating the change. Once the new mode sidecar task implementation is rolled out everywhere, and `::Sequential` removed, we can think about a descriptive submission & completion interface. The problems around deadlocks pointed out earlier will need to be solved then. For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots. The latter has been drafted in https://github.com/neondatabase/tokio-epoll-uring/pull/63. See the lengthy doc comment on `spawn_io()` for more details. ### Error handling There are two error classes during reconstruct data retrieval: * traversal errors: index lookup, move to next layer, and the like * value read IO errors A traversal error fails the entire get_vectored request, as before this PR. A value read error only fails that value. In any case, we preserve the existing behavior that once `get_vectored` returns, all IOs are done. Panics and failing to poll `get_vectored` to completion will leave the IOs dangling, which is safe but shouldn't happen, and so, a rate-limited log statement will be emitted at warning level. There is a doc comment on `collect_pending_ios` giving more code-level details and rationale. ### Feature Gating The new behavior is opt-in via pageserver config. The `Sequential` mode is the default. The only significant change in `Sequential` mode compared to before this PR is the buffering of results in the `oneshot`s. ## Code-Level Changes Prep work: * Make `GateGuard` clonable. Core Feature: * Traversal code: track `will_init` in `BlobMeta` and source it from the Delta/Image/InMemory layer index, instead of determining `will_init` after we've read the value. This avoids having to read the value to determine whether traversal can stop. * Introduce `IoConcurrency` & its sidecar task. * `IoConcurrency` is the clonable handle. * It connects to the sidecar task via an `mpsc`. * Plumb through `IoConcurrency` from high level code to the individual layer implementations' `get_values_reconstruct_data`. We piggy-back on the `ValuesReconstructState` for this. * The sidecar task should be long-lived, so, `IoConcurrency` needs to be rooted up "high" in the call stack. * Roots as of this PR: * `page_service`: outside of pagestream loop * `create_image_layers`: when it is called * `basebackup`(only auxfiles + replorigin + SLRU segments) * Code with no roots that uses `IoConcurrency::sequential` * any `Timeline::get` call * `collect_keyspace` is a good example * follow-up: https://github.com/neondatabase/neon/issues/10460 * `TimelineAdaptor` code used by the compaction simulator, unused in practive * `ingest_xlog_dbase_create` * Transform Delta/Image/InMemoryLayer to * do their values IO in a distinct `async {}` block * extend the residence of the Delta/Image layer until the IO is done * buffer their results in a `oneshot` channel instead of straight in `ValuesReconstructState` * the `oneshot` channel is wrapped in `OnDiskValueIo` / `OnDiskValueIoWaiter` types that aid in expressiveness and are used to keep track of in-flight IOs so we can print warnings if we leave them dangling. * Change `ValuesReconstructState` to hold the receiving end of the `oneshot` channel aka `OnDiskValueIoWaiter`. * Change `get_vectored_impl` to `collect_pending_ios` and issue walredo concurrently, in a `FuturesUnordered`. Testing / Benchmarking: * Support queue-depth in pagebench for manual benchmarkinng. * Add test suite support for setting concurrency mode ps config field via a) an env var and b) via NeonEnvBuilder. * Hacky helper to have sidecar-based IoConcurrency in tests. This will be cleaned up later. More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod. Some intermediate helpers for manual benchmarking have been preserved in https://github.com/neondatabase/neon/pull/10466 and will be landed in later PRs. (L0 layer stack generator!) Drive-By: * test suite actually didn't enable batching by default because `config.compatibility_neon_binpath` is always Truthy in our CI environment => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 * initial logical size calculation wasn't always polled to completion, which was surfaced through the added WARN logs emitted when dropping a `ValuesReconstructState` that still has inflight IOs. * remove the timing histograms `pageserver_getpage_get_reconstruct_data_seconds` and `pageserver_getpage_reconstruct_seconds` because with planning, value read IO, and walredo happening concurrently, one can no longer attribute latency to any one of them; we'll revisit this when Vlad's work on tracing/sampling through RequestContext lands. * remove code related to `get_cached_lsn()`. The logic around this has been dead at runtime for a long time, ever since the removal of the materialized page cache in #8105. ## Testing Unit tests use the sidecar task by default and run both modes in CI. Python regression tests and benchmarks also use the sidecar task by default. We'll test more in staging and possibly preprod. # Future Work Please refer to the parent epic for the full plan. The next step will be to fold the plumbing of IoConcurrency into RequestContext so that the function signatures get cleaned up. Once `Sequential` isn't used anymore, we can take the next big leap which is replacing the opaque IOs with structs that have well-defined semantics. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-01-22 15:30:23 +00:00
Christian Schwarz	b4d87b9dfe	fix(tests): actually enable pipelinig by default in the test suite (#10472 ) ## Problem PR #9993 was supposed to enable `page_service_pipelining` by default for all `NeonEnv`s, but this was ineffective in our CI environment. Thus, CI Python-based tests and benchmarks, unless explicitly configuring pipelining, were still using serial protocol handling. ## Analysis The root cause was that in our CI environment, `config.compatibility_neon_binpath` is always Truthy. It's not in local environments, which is why this slipped through in local testing. Lesson: always add a log line ot pageserver startup and spot-check tests to ensure the intended default is picked up. ## Summary of changes Fix it. Since enough time has passed, the compatiblity snapshot contains a recent enough software version so we don't need to worry about `compatibility_neon_binpath` anymore. ## Future Work The question how to add a new default except for compatibliity tests, which is what the broken code was supposed to do, is still unsolved. Slack discussion: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309	2025-01-22 10:10:43 +00:00
Alex Chi Z.	7d4bfcdc47	feat(pageserver): add config items for gc-compaction auto trigger (#10455 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 The automatic trigger is already implemented at https://github.com/neondatabase/neon/pull/10221 but I need to write some tests and finish my experiments in staging before I can merge it with confidence. Given that I have some other patches that will modify the config items, I'd like to get the config items merged first to reduce conflicts. ## Summary of changes * add `l2_lsn` to index_part.json -- below that LSN, data have been processed by gc-compaction * add a set of gc-compaction auto trigger control items into the config --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-21 19:29:38 +00:00
Gleb Novikov	19bf7b78a0	fast import: basic python test (#10271 ) We did not have any tests on fast_import binary yet. In this PR I have introduced: - `FastImport` class and tools for testing in python - basic test that runs fast import against vanilla postgres and checks that data is there Should be merged after https://github.com/neondatabase/neon/pull/10251	2025-01-21 16:50:44 +00:00
Arpad Müller	7e4a39ea53	Fix two flakiness sources in test_scrubber_physical_gc_ancestors (#10457 ) We currently have some flakiness in `test_scrubber_physical_gc_ancestors`, see #10391. The first flakiness kind is about the reconciler not actually becoming idle within the timeout of 30 seconds. We see continuous forward progress so this is likely not a hang. We also see this happen in parallel to a test failure, so is likely due to runners being overloaded. Therefore, we increase the timeout. The second flakiness kind is an assertion failure. This one is a little bit more tricky, but we saw in the successful run that there was some advance of the lsn between the compaction ran (which created layer files) and the gc run. Apparently gc rejects reductions to the single image layer setting if the cutoff lsn is the same as the lsn of the image layer: it will claim that that layer is newer than the space cutoff and therefore skip it, while thinking the old layer (that we want to delete) is the latest one (so it's not deleted). We address the second flakiness kind by inserting a tiny amount of WAL between the compaction and gc. This should hopefully fix things. Related issue: #10391 (not closing it with the merger of the PR as we'll need to validate that these changes had the intended effect). Thanks to Chi for going over this together with me in a call.	2025-01-21 15:40:04 +00:00
Alex Chi Z.	2de2b26c62	feat(pageserver): add reldir migration configs (#10439 ) ## Problem Part of #9516 per RFC at https://github.com/neondatabase/neon/pull/10412 ## Summary of changes Adding the necessary config items and index_part items for the large relation count work. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-20 20:44:12 +00:00
Matthias van de Meent	e781cf6dd8	Compute/LFC: Apply limits consistently (#10449 ) Otherwise we might hit ERRORs in otherwise safe situations (such as user queries), which isn't a great user experience. ## Problem https://github.com/neondatabase/neon/pull/10376 ## Summary of changes Instead of accepting internal errors as acceptable, we ensure we don't exceed our allocated usage.	2025-01-20 18:29:21 +00:00
Christian Schwarz	c47c5f4ace	fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386 ) # Refs - fixes https://github.com/neondatabase/neon/issues/10309 - fixup of batching design, first introduced in https://github.com/neondatabase/neon/pull/9851 - refinement of https://github.com/neondatabase/neon/pull/8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (https://github.com/neondatabase/neon/pull/8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from https://github.com/neondatabase/neon/pull/8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in https://github.com/neondatabase/neon/pull/10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.	2025-01-16 20:34:02 +00:00
John Spray	8f2ebc0684	tests: stabilize test_storage_controller_node_deletion (#10420 ) ## Problem `test_storage_controller_node_deletion` sometimes failed because shards were moving around during timeline creation, and neon_local isn't tolerant of that. The movements were unexpected because the shards had only just been created. This was a regression from #9916 Closes: #10383 ## Summary of changes - Make this test use multiple AZs -- this makes the storage controller's scheduling reliably stable Why this works: in #9916 , I made a simplifying assumption that we would have multiple AZs to get nice stable scheduling -- it's much easier, because each tenant has a well defined primary+secondary location when they have an AZ preference and nodes have different AZs. Everything still works if you don't have multiple AZs, but you just have this quirk that sometimes the optimizer can disagree with initial scheduling, so once in a while a shard moves after being created -- annoying for tests, harmless IRL.	2025-01-16 19:00:16 +00:00
Vlad Lazar	3a285a046b	pageserver: include node id when subscribing to SK (#10432 ) ## Problem All pageserver have the same application name which makes it hard to distinguish them. ## Summary of changes Include the node id in the application name sent to the safekeeper. This should gives us more visibility in logs. There's a few metrics that will increase in cardinality by `pageserver_count`, but that's fine.	2025-01-16 18:51:56 +00:00
Arpad Müller	e436dcad57	Rename "disabled" safekeeper scheduling policy to "pause" (#10410 ) Rename the safekeeper scheduling policy "disabled" to "pause". A rename was requested in https://github.com/neondatabase/neon/pull/10400#discussion_r1916259124, as the "disabled" policy is meant to be analogous to the "pause" policy for pageservers. Also simplify the `SkSchedulingPolicyArg::from_str` function, relying on the `from_str` implementation of `SkSchedulingPolicy`. Latter is used for the database format as well, so it is quite stable. If we ever want to change the UI, we'll need to duplicate the function again but this is cheap.	2025-01-16 14:30:49 +00:00
John Spray	21d7b6a258	tests: refactor test_tenant_delete_races_timeline_creation (#10425 ) ## Problem Threads spawned in `test_tenant_delete_races_timeline_creation` are not joined before the test ends, and can generate `PytestUnhandledThreadExceptionWarning` in other tests. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10419/12805365523/index.html#/testresult/53a72568acd04dbd ## Summary of changes - Wrap threads in ThreadPoolExecutor which will join them before the test ends - Remove a spurious deletion call -- the background thread doing deletion ought to succeed.	2025-01-16 14:11:33 +00:00
Matthias van de Meent	7be971081a	Make sure we request pages with a known-flushed LSN. (#10413 ) This should fix the largest source of flakyness of test_nbtree_pagesplit_cycleid. ## Problem https://github.com/neondatabase/neon/issues/10390 ## Summary of changes By using a guaranteed-flushed LSN, we ensure that PS won't have to wait forever. (If it does wait forever, we know the issue can't be with Compute's WAL)	2025-01-16 08:34:11 +00:00
Alex Chi Z.	a753349cb0	feat(pageserver): validate data integrity during gc-compaction (#10131 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 part of investigation of https://github.com/neondatabase/neon/issues/10049 ## Summary of changes * If `cfg!(test) or cfg!(feature = testing)`, then we will always try generating an image to ensure the history is replayable, but not put the image layer into the final layer results, therefore discovering wrong key history before we hit a read error. * I suspect it's easier to trigger some races if gc-compaction is continuously run on a timeline, so I increased the frequency to twice per 10 churns. * Also, create branches in gc-compaction smoke tests to get more test coverage. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad@neon.tech>	2025-01-15 22:04:06 +00:00
Arpad Müller	efaec6cdf8	Add endpoint and storcon cli cmd to set sk scheduling policy (#10400 ) Implementing the last missing endpoint of #9981, this adds support to set the scheduling policy of an individual safekeeper, as specified in the RFC. However, unlike in the RFC we call the endpoint `scheduling_policy` not `status` Closes #9981. As for why not use the upsert endpoint for this: we want to have the safekeeper upsert endpoint be used for testing and for deploying new safekeepers, but not for changes of the scheduling policy. We don't want to change any of the other fields when marking a safekeeper as decommissioned for example, so we'd have to first fetch them only to then specify them again. Of course one can also design an endpoint where one can omit any field and it doesn't get modified, but it's still not great for observability to put everything into one big "change something about this safekeeper" endpoint.	2025-01-15 18:15:30 +00:00
Arseny Sher	05a71c7d6a	safekeeper: add membership configuration switch endpoint (#10241 ) ## Problem https://github.com/neondatabase/neon/issues/9965 ## Summary of changes Add to safekeeper http endpoint to switch membership configuration. Also add it to python client for tests, and add simple test itself.	2025-01-15 14:16:04 +00:00
Arseny Sher	2d0ea08524	Add safekeeper membership conf to control file. (#10196 ) ## Problem https://github.com/neondatabase/neon/issues/9965 ## Summary of changes Add safekeeper membership configuration struct itself and storing it in the control file. In passing also add creation timestamp to the control file (there were cases where I wanted it in the past). Remove obsolete unused PersistedPeerInfo struct from control file (still keep it control_file_upgrade.rs to have it in old upgrade code). Remove the binary representation of cfile in the roundtrip test. Updating it is annoying, and we still test the actual roundtrip. Also add configuration to timeline creation http request, currently used only in one python test. In passing, slightly change LSNs meaning in the request: normally start_lsn is passed (the same as ancestor_start_lsn in similar pageserver call), but we allow specifying higher commit_lsn for manual intervention if needed. Also when given LSN initialize term_history with it.	2025-01-15 09:45:58 +00:00
John Spray	47c1640acc	storage controller: pagination for tenant listing API (#10365 ) ## Problem For large deployments, the `control/v1/tenant` listing API can time out transmitting a monolithic serialized response. ## Summary of changes - Add `limit` and `start_after` parameters to listing API - Update storcon_cli to use these parameters and limit requests to 1000 items at a time	2025-01-14 21:37:32 +00:00
Tristan Partin	430b556b34	Update postgres-exporter and sql_exporter in computes (#10349 ) The postgres-exporter was much further out of date, but let's just bump both. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-14 00:44:39 +00:00
Konstantin Knizhnik	1783501eaa	Increase max connection for replica to prevent test flukyness (#10306 ) ## Problem See https://github.com/neondatabase/neon/issues/10167 Too small number of `max_connections` (2) can cause failures of test_physical_replication_config_mismatch_too_many_known_xids test ## Summary of changes Increase `max_connections` to 5 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-01-13 20:01:03 +00:00
John Spray	fd1368d31e	storcon: rework scheduler optimisation, prioritize AZ (#9916 ) ## Problem We want to do a more robust job of scheduling tenants into their home AZ: https://github.com/neondatabase/neon/issues/8264. Closes: https://github.com/neondatabase/neon/issues/8969 ## Summary of changes ### Scope This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one. ### Scheduling optimisation `TenantShard`'s `optimize_attachment` and `optimize_secondary` methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it: - To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later. - Score types get a `for_optimization` method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we _will_ want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating. - When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well. - A unit test is added for https://github.com/neondatabase/neon/issues/8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.	2025-01-13 19:33:00 +00:00
John Spray	ef8bfacd6b	storage controller: API + CLI for migrating secondary locations (#10284 ) ## Problem Currently, if we want to move a secondary there isn't a neat way to do that: we just have migration API for the attached location, and it is only clean to use that if you've manually created a secondary via pageserver API in the place you're going to move it to. Secondary migration API enables: - Moving the secondary somewhere because we would like to later move the attached location there. - Move the secondary location because we just want to reclaim some disk space from its current location. ## Summary of changes - Add `/migrate_secondary` API - Add `tenant-shard-migrate-secondary` CLI - Add tests for above	2025-01-13 14:52:43 +00:00
John Spray	d1bc36f536	storage controller: fix retries of compute hook notifications while a secondary node is offline (#10352 ) ## Problem We would sometimes fail to retry compute notifications: 1. Try and send, set compute_notify_failure if we can't 2. On next reconcile, reconcile() fails for some other reason (e.g. tried to talk to an offline node), and we fail the `result.is_ok() && must_notify` condition around the re-sending. Closes: https://github.com/neondatabase/cloud/issues/22612 ## Summary of changes - Clarify the meaning of the reconcile result: it should be Ok(()) if configuring attached location worked, even if secondary or detach locations cannot be reached. - Skip trying to talk to secondaries if they're offline - Even if reconcile fails and we can't send the compute notification (we can't send it because we're not sure if it's really attached), make sure we save the `compute_notify_failure` flag so that subsequent reconciler runs will try again - Add a regression test for the above	2025-01-13 13:31:57 +00:00
Heikki Linnakangas	09fe3b025c	Add a websockets tunnel and a test for the proxy's websockets support. (#3823 ) For testing the proxy's websockets support. I wrote this to test https://github.com/neondatabase/neon/issues/3822. Unfortunately, that bug can not be reproduced with this tunnel. The bug only appears when the client pipelines the first query with the authentication messages. The tunnel doesn't do that. --- Update (@conradludgate 2025-01-10): We have since added some websocket tests, but they manually implemented a very simplistic setup of the postgres protocol. Introducing the tunnel would make more complex testing simpler in the future. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2025-01-13 11:35:39 +00:00
Christian Schwarz	58332cb361	pageserver: remove unused metric `pageserver_layers_visited_per_read_global` (#10141 ) As of commit "pageserver: remove legacy read path" (#8601) we always use vectored get, which has a separate metric.	2025-01-10 20:35:50 +00:00
John Spray	4398051385	tests: smaller datasets in LFC tests (#10346 ) ## Problem These two tests came up in #9537 as doing multi-gigabyte I/O, and from inspection of the tests it doesn't seem like they need that to fulfil their purpose. ## Summary of changes - In test_local_file_cache_unlink, run fewer background threads with a smaller number of rows. These background threads AFAICT exist to make sure some I/O is going on while we unlink the LFC directory, but 5 threads should be enough for "some". - In test_lfc_resize, tweak the test to validate that the cache size is larger than the final size before resizing it, so that we're sure we're writing enough data to really be doing something. Then decrease the pgbench scale.	2025-01-10 15:53:23 +00:00
John Spray	105f66c4ce	tests: move test_parallel_copy into performance tree (#10343 ) ## Problem This test writes ~5GB of data. It is not suitable to run in parallel with all the other small tests in test_runner/regress. via #9537 ## Summary of changes - Move test_parallel_copy into the performance directory, so that it does not run in parallel with other tests	2025-01-10 13:57:26 +00:00
John Spray	0d4fce2d35	tests: refine how compat snapshot is generated (#10342 ) ## Problem I noticed in https://github.com/neondatabase/neon/pull/9537 that tests which work with compat snapshots were writing several hundred MB of data, which isn't really necessary. Also, the snapshots are large but don't have the proper variety of storage format features, e.g. they could just have L0 deltas. ## Summary of changes - Use smaller scale factor and runtime to generate less data - Configure a small layer size and use force image layer generation so that our output contains L1 deltas and image layers, and has a decent number of entries in the layer map	2025-01-10 13:57:23 +00:00
Arpad Müller	6149ac8834	Handle race between auto-offload and unarchival (#10305 ) ## Problem Auto-offloading as requested by the compaction task is racy with unarchival, in that the compaction task might attempt to offload an unarchived timeline. By that point it will already have set the timeline to the `Stopping` state however, which makes it unusable for any purpose. For example: 1. compaction task decides to offload timeline 2. timeline gets unarchived 3. `offload_timeline` gets called by compaction task * sets timeline's state to `Stopping` * realizes that the timeline can't be unarchived, errors out 6. endpoint can't be started as the timeline is `Stopping` and thus 'can't be found'. A future iteration of the compaction task can't "heal" this state either as the timeline will still not be archived, same goes for other automatic stuff. The only way to heal this is a tenant detach+attach, or alternatively a pageserver restart. Furthermore, the compaction task is especially amenable for such races as it first stores `can_offload` into a variable, figures out whether compaction is needed (which takes some time), and only then does it attempt an offload operation: the time difference between "check" and "use" is non-trivially small. To make it even worse, we start the compaction task right after attach of a tenant, and it is a common pattern by pageserver users to attach a tenant to then immediately unarchive a timeline, so that an endpoint can be started. ## Solutions not adopted The simplest solution is to move the `can_offload` check to right before attempting of the offload. But this is not a good solution, as no lock is held between that check and timeline shutdown. So races would still be possible, just become less likely. I explored using the timeline state for this, as in adding an additional enum variant. But `Timeline::set_state` is racy (#10297). ## Adopted solution We use the lock on the timeline's upload queue as an arbiter: either unarchival gets to it first and sours the state for auto-offloading, or auto-offloading shuts it down, which stops any parallel unarchival in its tracks. The key part is not releasing the upload queue's lock between the check whether the timeline is archived or not, and shutting it down (the actual implementation only sets `shutting_down` but it has the same effect on `initialized_mut()` as a full shutdown). The rest of the patch is stuff that follows from this. We also move the part where we set the state to `Stopping` to after that arbiter has decided the fate of the timeline. For deletions, we do keep it inside `DeleteTimelineFlow::prepare` however, so that it is called with all of the the timelines locks held that the function allocates (timelines lock most importantly). This is only a precautionary measure however, as I didn't want to analyze deletion related code for possible races. ## Future changes It might make sense to move `can_offload` to right before the offload attempt. Maybe some other properties might have changed as well. Although this will not be perfect either as no lock is held. I want to keep it out of this change to emphasize that this move wasn't the main reason we are race free now. Fixes #10220	2025-01-09 20:41:49 +00:00
Alexey Kondratov	f37eeb56ad	fix(compute_ctl): Resolve issues with dropping roles having dangling permissions (#10299 ) ## Problem In Postgres, one cannot drop a role if it has any dependent objects in the DB. In `compute_ctl`, we automatically reassign all dependent objects in every DB to the corresponding DB owner. Yet, it seems that it doesn't help with some implicit permissions. The issue is reproduced by installing a `postgis` extension because it creates some views and tables in the public schema. ## Summary of changes Added a repro test without using a `postgis`: i) create a role via `compute_ctl` (with `neon_superuser` grant); ii) create a test role, a table in schema public, and grant permissions via the role in `neon_superuser`. To fix the issue, I added a new `compute_ctl` code that removes such dangling permissions before dropping the role. It's done in the least invasive way, i.e., only touches the schema public, because i) that's the problem we had with PostGIS; ii) it creates a smaller chance of messing anything up and getting a stuck operation again, just for a different reason. Properly, any API-based catalog operations should fail gracefully and provide an actionable error and status code to the control plane, allowing the latter to unwind the operation and propagate an error message and hint to the user. In this sense, it's aligned with another feature request https://github.com/neondatabase/cloud/issues/21611 Resolve neondatabase/cloud#13582	2025-01-09 16:39:53 +00:00
Arpad Müller	bebc46e713	Add scheduling_policy column to safekeepers table (#10205 ) Add a `scheduling_policy` column to the safekeepers table of the storage controller. Part of #9981	2025-01-09 15:55:02 +00:00
Vlad Lazar	f4739d49e3	pageserver: tweak interpreted ingest record metrics (#10291 ) ## Problem The filtered record metric doesn't make sense for interpreted ingest. ## Summary of changes While of dubious utility in the first place, this patch replaces them with records received and records observed metrics for interpreted ingest: * received records cause the pageserver to do _something_: write a key, value pair to storage, update some metadata or flush pending modifications * observed records are a shard 0 concept and contain only key metadata used in tracking relation sizes (received records include observed records)	2025-01-09 12:31:02 +00:00
Anastasia Lubennikova	0ad0db6ff8	compute: dropdb DROP SUBSCRIPTION fix (#10066 ) ## Problem Project gets stuck if database with subscriptions was deleted via API / UI. https://github.com/neondatabase/cloud/issues/18646 ## Summary of changes Before dropping the database, drop all the subscriptions in it. Do not drop slot on publisher, because we have no guarantee that the slot still exists or that the publisher is reachable. Add `DropSubscriptionsForDeletedDatabases` phase to run these operations in all databases, we're about to delete. Ignore the error if the database does not exist.	2025-01-08 18:55:04 +00:00
John Spray	68d8acfd05	storage controller: don't hold detached tenants in memory (#10264 ) ## Problem Typical deployments of neon have some tenants that stay in use continuously, and a background churning population of tenants that are created and then fall idle, and are configured to Detached state. Currently, this churn of short lived tenants results in an ever-increasing memory footprint. Closes: https://github.com/neondatabase/neon/issues/9712 ## Summary of changes - At startup, filter to only load shards that don't have Detached policy - In process_result, check if a tenant's shards are all Detached and observed=={}, and if so drop them from memory - In tenant_location_conf and other tenant mutators, load the tenants' shards on-demand if they are not present	2025-01-08 18:12:09 +00:00
Vlad Lazar	dc284247a5	storage_controller: fix node flap detach race (#10298 ) ## Problem The observed state removal may race with the inline updates of the observed state done from `Service::node_activate_reconcile`. This was intended to work as follows: 1. Detaches while the node is unavailable remove the entry from the observed state. 2. `Service::node_activate_reconcile` diffs the locations returned by the pageserver with the observed state and detaches in-line when required. ## Summary of changes This PR removes step (1) and lets background reconciliations deal with the mismatch between the intent and observed state. A follow up will attempt to remove `Service::node_activate_reconcile` altogether. Closes https://github.com/neondatabase/neon/issues/10253	2025-01-08 10:26:53 +00:00
Alexander Bayandin	ad7f14d526	test_runner: update packages for Python 3.13 (#10285 ) ## Problem It's impossible to run regression tests with Python 3.13 as some dependencies don't support it (some of them are outdated, and `jsonnet` doesn't support it at all yet) ## Summary of changes - Update dependencies for Python 3.13 - Install `jsonnet` only on Python < 3.13 and skip relevant tests on Python 3.13 Closes #10237	2025-01-06 20:25:31 +00:00
Busra Kugler	406cca643b	Update neon_fixtures.py - remove logs (#10219 ) We need to remove this line to prevent aws keys exposing in the public s3 buckets	2025-01-06 10:44:23 +00:00
Erik Grinaker	1393cc668b	Revert "pageserver: revert flush backpressure (#8550 ) (#10135 )" (#10270 ) This reverts commit `f3ecd5d76a`. It is [suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759) to have caused significant read amplification in the [ingest benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs) (specifically during index creation). We will revisit an intermediate improvement here to unblock [upload parallelism](https://github.com/neondatabase/neon/issues/10096) before properly addressing [compaction backpressure](https://github.com/neondatabase/neon/issues/8390).	2025-01-03 15:38:51 +00:00
John Spray	ba9722a2fd	tests: add upload wait in test_scrubber_physical_gc_ancestors (#10260 ) ## Problem We see periodic failures in `test_scrubber_physical_gc_ancestors`, where the logs show that the pageserver is creating image layers that should cause child shards to no longer reference their parents' layers, but then the scrubber runs and doesn't find any unreferenced layers.[ https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10256/12582034135/index.html#/testresult/78ea06dea6ba8dd3 From inspecting the code & test, it seems like this could be as simple as the test failing to wait for uploads before running the scrubber. It had a 2 second delay built in to satisfy the scrubbers time threshold checks, which on a lightly loaded machine would also have been easily enough for uploads to complete, but our test machines are more heavily loaded all the time. ## Summary of changes - Wait for uploads to complete after generating images layers in test_scrubber_physical_gc_ancestors, so that the scrubber should reliably see the post-compaction metadata.	2025-01-03 10:55:07 +00:00
Tristan Partin	363ea97f69	Add more substantial tests for compute migrations (#9811 ) The previous tests really didn't do much. This set should be quite a bit more encompassing. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-02 18:37:50 +00:00

1 2 3 4 5 ...

1850 Commits