rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-03 12:10:36 +00:00

Author	SHA1	Message	Date
Christian Schwarz	4e72b22b41	make noise from IoConcurrency::drop instead of the task, for more context	2025-01-21 14:51:54 +01:00
Christian Schwarz	4014c390e2	initial logical size calculation can also reasonable use the sidecar because it's concurrency-limited	2025-01-21 12:50:33 +01:00
Christian Schwarz	bca4263eb8	inspect_image_layer can also have an IoConcurrency root, it's tests only	2025-01-21 12:40:22 +01:00
Christian Schwarz	a958febd7a	reference issue that will remote hard-coded sequential()	2025-01-21 12:36:23 +01:00
Christian Schwarz	fc27da43ff	one more test can do without it	2025-01-21 12:25:30 +01:00
Christian Schwarz	cf2f0c27aa	IoConcurrency roots for scan() an tests	2025-01-21 12:21:46 +01:00
Christian Schwarz	f54c5d5596	turns out create_image_layers is easy	2025-01-21 10:47:49 +01:00
Christian Schwarz	ce5452d2e5	followup `0a37164c29`: also rename IoConcurrency::serial()	2025-01-21 00:47:37 +01:00
Christian Schwarz	af6c9ffac7	Ok, I now understand why it deadlocked in mode=sidecar-task The reason is that even in mode=`sidecar-task`, there are a bunch of places that are serial. Those places obviously deadlock.	2025-01-21 00:41:45 +01:00
Christian Schwarz	081ff26519	fixup `40ab9c2c5e`: deadlock Reproduced by test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg16-flat-1-10]' It kinda makes sense that this deadlocks in `sequential` mode. However, it also deadlocks in `sidecar-task` mode. I don't understand why.	2025-01-20 23:46:56 +01:00
Christian Schwarz	0a37164c29	replace env var with config variable; add test suite fixture env var to override default	2025-01-20 23:46:56 +01:00
Christian Schwarz	0eff09e35f	Merge remote-tracking branch 'origin/main' into vlad/read-path-concurrent-io	2025-01-20 19:47:03 +01:00
Matthias van de Meent	e781cf6dd8	Compute/LFC: Apply limits consistently (#10449 ) Otherwise we might hit ERRORs in otherwise safe situations (such as user queries), which isn't a great user experience. ## Problem https://github.com/neondatabase/neon/pull/10376 ## Summary of changes Instead of accepting internal errors as acceptable, we ensure we don't exceed our allocated usage.	2025-01-20 18:29:21 +00:00
Christian Schwarz	cdad6b2de5	we don't need the CancellationToken, the ios_rx.recv() will fail at the same time	2025-01-20 19:13:40 +01:00
Christian Schwarz	82d20b52b3	make noise when dropping an IoConcurrency with unfinished requests	2025-01-20 19:12:00 +01:00
Christian Schwarz	72130d7d6c	fix(page_service / handle): panic when parallel client disconnect & Timeline shutdown (#10445 ) ## Refs - fixes https://github.com/neondatabase/neon/issues/10444 ## Problem We're seeing a panic `handles are only shut down once in their lifetime` in our performance testbed. ## Hypothesis Annotated code in https://github.com/neondatabase/neon/issues/10444#issuecomment-2602286415. ``` T1: drop Cache, executes up to (1) => HandleInner is now in state ShutDown T2: Timeline::shutdown => PerTimelineState::shutdown executes shutdown() again => panics ``` Likely this snuck in the final touches of #10386 where I narrowed down the locking rules. ## Summary of changes Make duplicate shutdowns a no-op.	2025-01-20 17:51:30 +00:00
John Spray	2657b7ec75	rfcs: add sharded ingest RFC (#8754 ) ## Summary Whereas currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs, in this RFC we add a mechanism to filter the WAL on the safekeeper, so that each shard receives only the data it needs. This will place some extra CPU load on the safekeepers, in exchange for reducing the network bandwidth for ingesting WAL back to scaling as O(1) with shard count, rather than O(N_shards). Touches #9329. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-01-20 17:33:07 +00:00
Christian Schwarz	3b1328423e	basebackup: fetch all SLRUs of one basebackup using the same IoConcurrency	2025-01-20 16:58:14 +01:00
Christian Schwarz	02fc58b878	impr(timeline handles): add more tests covering reference cyle (#10446 ) The other test focus on the external interface usage while the tests added in this PR add some testing around HandleInner's lifecycle, ensuring we don't leak it once either connection gets dropped or per-timeline-state is shut down explicitly.	2025-01-20 14:37:24 +00:00
Arpad Müller	b312a3c320	Move DeleteTimelineFlow::prepare to separate function and use enum (#10334 ) It was requested by review in #10305 to use an enum or something like it for distinguishing the different modes instead of two parameters, because two flags allow four combinations, and two of them don't really make sense/ aren't used. follow-up of #10305	2025-01-20 12:50:44 +00:00
John Spray	7d761a9d22	storage controller: make chaos less disruptive to AZ locality (#10438 ) ## Problem Since #9916 , the chaos code is actively fighting the optimizer: tenants tend to be attached in their preferred AZ, so most chaos migrations were moving them to a non-preferred AZ. ## Summary of changes - When picking migrations, prefer to migrate things _toward_ their preferred AZ when possible. Then pick shards to move the other way when necessary. The resulting behavior should be an alternating "back and forth" where the chaos code migrates thiings away from home, and then migrates them back on the next iteration. The side effect will be that the chaos code actively helps to push things into their home AZ. That's not contrary to its purpose though: we mainly just want it to continuously migrate things to exercise migration+notification code.	2025-01-20 09:47:23 +00:00
John Spray	8bdaee35f3	pageserver: safety checks on validity of uploaded indices (#10403 ) ## Problem Occasionally, we encounter bugs in test environments that can be detected at the point of uploading an index, but we proceed to upload it anyway and leave a tenant in a broken state that's awkward to handle. ## Summary of changes - Validate index when submitting it for upload, so that we can see the issue quickly e.g. in an API invoking compaction - Validate index before executing the upload, so that we have a hard enforcement that any code path that tries to upload an index will not overwrite a valid index with an invalid one.	2025-01-20 09:20:31 +00:00
Arpad Müller	b0f34099f9	Add safekeeper utilization endpoint (#10429 ) Add an endpoint to obtain the utilization of a safekeeper. Future changes to the storage controller can use this endpoint to find the most suitable safekeepers for newly created timelines, analogously to how it's done for pageservers already. Initially we just want to assign by timeline count, then we can iterate from there. Part of https://github.com/neondatabase/neon/issues/9011	2025-01-17 21:43:52 +00:00
Christian Schwarz	2eb235e923	doc string explaining why we're deadlock free right now and why it's so brittle	2025-01-17 18:33:34 +01:00
Christian Schwarz	40ab9c2c5e	we can avoid adding the Arc<Mutex<>> around EphemeralLayer if we instead extend the lifetime of the InMemoryLayer for the spawned IO future; plus it's semantically more similar to what we now do for Delta and Image layers	2025-01-17 18:16:17 +01:00
Christian Schwarz	c43400389f	delta & image layer spawned IOs: keep layer resident until IO is done	2025-01-17 18:00:13 +01:00
Vlad Lazar	6975228a76	pageserver: add initdb metrics (#10434 ) ## Problem Initdb observability is poor. ## Summary of changes Add some metrics so we can figure out which part, if any, is slow. Closes https://github.com/neondatabase/neon/issues/10423	2025-01-17 14:51:33 +00:00
JC Grünhage	053abff71f	Fix dependency on neon-image in promote-images-dev (#10437 ) ## Problem `871e8b325f` failed CI on main because a job ran to soon. This was caused by `ea84ec357f`. While `promote-images-dev` does not inherently need `neon-image`, a few jobs depending on `promote-images-dev` do need it, and previously had it when it was `promote-images`, which depended on `test-images`, which in turn depended on `neon-image`. ## Summary of changes To ensure jobs depending `docker.io/neondatabase/neon` images get them, `promote-images-dev` gets the dependency to `neon-image` back which it previously had transitively through `test-images`.	2025-01-17 14:21:30 +00:00
Tristan Partin	871e8b325f	Use the request ID given by the control plane in compute_ctl (#10418 ) Instead of generating our own request ID, we can just use the one provided by the control plane. In the event, we get a request from a client which doesn't set X-Request-ID, then we just generate one which is useful for tracing purposes. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-16 20:46:53 +00:00
Christian Schwarz	c47c5f4ace	fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386 ) # Refs - fixes https://github.com/neondatabase/neon/issues/10309 - fixup of batching design, first introduced in https://github.com/neondatabase/neon/pull/9851 - refinement of https://github.com/neondatabase/neon/pull/8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (https://github.com/neondatabase/neon/pull/8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from https://github.com/neondatabase/neon/pull/8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in https://github.com/neondatabase/neon/pull/10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.	2025-01-16 20:34:02 +00:00
Tristan Partin	b0838a68e5	Enable pgx_ulid on Postgres 17 (#10397 ) The extension now supports Postgres 17. The release also seems to be binary compatible with the previous version. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-16 19:49:04 +00:00
Christian Schwarz	65932512c1	run tests with futures-unordered	2025-01-16 20:03:01 +01:00
Christian Schwarz	1866f261e0	make mypy pass	2025-01-16 20:01:42 +01:00
John Spray	8f2ebc0684	tests: stabilize test_storage_controller_node_deletion (#10420 ) ## Problem `test_storage_controller_node_deletion` sometimes failed because shards were moving around during timeline creation, and neon_local isn't tolerant of that. The movements were unexpected because the shards had only just been created. This was a regression from #9916 Closes: #10383 ## Summary of changes - Make this test use multiple AZs -- this makes the storage controller's scheduling reliably stable Why this works: in #9916 , I made a simplifying assumption that we would have multiple AZs to get nice stable scheduling -- it's much easier, because each tenant has a well defined primary+secondary location when they have an AZ preference and nodes have different AZs. Everything still works if you don't have multiple AZs, but you just have this quirk that sometimes the optimizer can disagree with initial scheduling, so once in a while a shard moves after being created -- annoying for tests, harmless IRL.	2025-01-16 19:00:16 +00:00
Vlad Lazar	3a285a046b	pageserver: include node id when subscribing to SK (#10432 ) ## Problem All pageserver have the same application name which makes it hard to distinguish them. ## Summary of changes Include the node id in the application name sent to the safekeeper. This should gives us more visibility in logs. There's a few metrics that will increase in cardinality by `pageserver_count`, but that's fine.	2025-01-16 18:51:56 +00:00
Christian Schwarz	7c662b771a	Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io	2025-01-16 19:22:38 +01:00
Christian Schwarz	8f40bd4eb3	there is no `Error` Fe message -,-	2025-01-16 19:21:44 +01:00
John Spray	da13154791	storcon: revise fill logic to prioritize AZ (#10411 ) ## Problem Node fills were limited to moving (total shards / node_count) shards. In systems that aren't perfectly balanced already, that leads us to skip migrating some of the shards that belong on this node, generating work for the optimizer later to gradually move them back. ## Summary of changes - Where a shard has a preferred AZ and is currently attached outside this AZ, then always promote it during fill, irrespective of target fill count	2025-01-16 17:33:46 +00:00
Christian Schwarz	d2f8342080	Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io	2025-01-16 18:16:36 +01:00
Christian Schwarz	92e4dd7ffa	script: template NEON_REPO_DIR	2025-01-16 18:14:34 +01:00
Christian Schwarz	0c3ab9c494	move test message tag to 99 and represent Fe message tag as enum, like we do for Be message	2025-01-16 18:07:56 +01:00
John Spray	2e13a3aa7a	storage controller: handle legacy TenantConf in consistency_check (#10422 ) ## Problem We were comparing serialized configs from the database with serialized configs from memory. If fields have been added/removed to TenantConfig, this generates spurious consistency errors. This is fine in test environments, but limits the usefulness of this debug API in the field. Closes: https://github.com/neondatabase/neon/issues/10369 ## Summary of changes - Do a decode/encode cycle on the config before comparing it, so that it will have exactly the expected fields.	2025-01-16 16:56:44 +00:00
Christian Schwarz	c19a16792a	address nit ; https://github.com/neondatabase/neon/pull/10386#discussion_r1918782034	2025-01-16 17:54:14 +01:00
Christian Schwarz	cf75eb7d86	Revert "hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput" This reverts commit `9fffe6e60d`.	2025-01-16 16:46:49 +01:00
Christian Schwarz	6ededa17e2	Revert "experiment: buffered socket with 128k buffer size; not super needle-moving" This reverts commit `7e13e5fc4a`.	2025-01-16 16:42:10 +01:00
Christian Schwarz	7e13e5fc4a	experiment: buffered socket with 128k buffer size; not super needle-moving	2025-01-16 16:42:01 +01:00
Christian Schwarz	45358bcb65	in the deepl_layers_with_delta script, make the stack height an argument	2025-01-16 16:41:15 +01:00
Alex Chi Z.	cccc196848	refactor(pageserver): make partitioning an ArcSwap (#10377 ) ## Problem gc-compaction needs the partitioning data to decide the job split. This refactor allows concurrent access/computing the partitioning. ## Summary of changes Make `partitioning` an ArcSwap so that others can access the partitioning while we compute it. Fully eliminate the `repartition is called concurrently` warning when gc-compaction is going on. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-16 15:33:37 +00:00
Arpad Müller	e436dcad57	Rename "disabled" safekeeper scheduling policy to "pause" (#10410 ) Rename the safekeeper scheduling policy "disabled" to "pause". A rename was requested in https://github.com/neondatabase/neon/pull/10400#discussion_r1916259124, as the "disabled" policy is meant to be analogous to the "pause" policy for pageservers. Also simplify the `SkSchedulingPolicyArg::from_str` function, relying on the `from_str` implementation of `SkSchedulingPolicy`. Latter is used for the database format as well, so it is quite stable. If we ever want to change the UI, we'll need to duplicate the function again but this is cheap.	2025-01-16 14:30:49 +00:00
John Spray	21d7b6a258	tests: refactor test_tenant_delete_races_timeline_creation (#10425 ) ## Problem Threads spawned in `test_tenant_delete_races_timeline_creation` are not joined before the test ends, and can generate `PytestUnhandledThreadExceptionWarning` in other tests. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10419/12805365523/index.html#/testresult/53a72568acd04dbd ## Summary of changes - Wrap threads in ThreadPoolExecutor which will join them before the test ends - Remove a spurious deletion call -- the background thread doing deletion ought to succeed.	2025-01-16 14:11:33 +00:00

1 2 3 4 5 ...

7107 Commits