rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-16 09:52:54 +00:00

Author	SHA1	Message	Date
Christian Schwarz	4e72b22b41	make noise from IoConcurrency::drop instead of the task, for more context	2025-01-21 14:51:54 +01:00
Christian Schwarz	4014c390e2	initial logical size calculation can also reasonable use the sidecar because it's concurrency-limited	2025-01-21 12:50:33 +01:00
Christian Schwarz	bca4263eb8	inspect_image_layer can also have an IoConcurrency root, it's tests only	2025-01-21 12:40:22 +01:00
Christian Schwarz	a958febd7a	reference issue that will remote hard-coded sequential()	2025-01-21 12:36:23 +01:00
Christian Schwarz	fc27da43ff	one more test can do without it	2025-01-21 12:25:30 +01:00
Christian Schwarz	cf2f0c27aa	IoConcurrency roots for scan() an tests	2025-01-21 12:21:46 +01:00
Christian Schwarz	f54c5d5596	turns out create_image_layers is easy	2025-01-21 10:47:49 +01:00
Christian Schwarz	ce5452d2e5	followup `0a37164c29`: also rename IoConcurrency::serial()	2025-01-21 00:47:37 +01:00
Christian Schwarz	081ff26519	fixup `40ab9c2c5e`: deadlock Reproduced by test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg16-flat-1-10]' It kinda makes sense that this deadlocks in `sequential` mode. However, it also deadlocks in `sidecar-task` mode. I don't understand why.	2025-01-20 23:46:56 +01:00
Christian Schwarz	0a37164c29	replace env var with config variable; add test suite fixture env var to override default	2025-01-20 23:46:56 +01:00
Christian Schwarz	0eff09e35f	Merge remote-tracking branch 'origin/main' into vlad/read-path-concurrent-io	2025-01-20 19:47:03 +01:00
Christian Schwarz	cdad6b2de5	we don't need the CancellationToken, the ios_rx.recv() will fail at the same time	2025-01-20 19:13:40 +01:00
Christian Schwarz	82d20b52b3	make noise when dropping an IoConcurrency with unfinished requests	2025-01-20 19:12:00 +01:00
Christian Schwarz	72130d7d6c	fix(page_service / handle): panic when parallel client disconnect & Timeline shutdown (#10445 ) ## Refs - fixes https://github.com/neondatabase/neon/issues/10444 ## Problem We're seeing a panic `handles are only shut down once in their lifetime` in our performance testbed. ## Hypothesis Annotated code in https://github.com/neondatabase/neon/issues/10444#issuecomment-2602286415. ``` T1: drop Cache, executes up to (1) => HandleInner is now in state ShutDown T2: Timeline::shutdown => PerTimelineState::shutdown executes shutdown() again => panics ``` Likely this snuck in the final touches of #10386 where I narrowed down the locking rules. ## Summary of changes Make duplicate shutdowns a no-op.	2025-01-20 17:51:30 +00:00
Christian Schwarz	3b1328423e	basebackup: fetch all SLRUs of one basebackup using the same IoConcurrency	2025-01-20 16:58:14 +01:00
Christian Schwarz	02fc58b878	impr(timeline handles): add more tests covering reference cyle (#10446 ) The other test focus on the external interface usage while the tests added in this PR add some testing around HandleInner's lifecycle, ensuring we don't leak it once either connection gets dropped or per-timeline-state is shut down explicitly.	2025-01-20 14:37:24 +00:00
Arpad Müller	b312a3c320	Move DeleteTimelineFlow::prepare to separate function and use enum (#10334 ) It was requested by review in #10305 to use an enum or something like it for distinguishing the different modes instead of two parameters, because two flags allow four combinations, and two of them don't really make sense/ aren't used. follow-up of #10305	2025-01-20 12:50:44 +00:00
John Spray	8bdaee35f3	pageserver: safety checks on validity of uploaded indices (#10403 ) ## Problem Occasionally, we encounter bugs in test environments that can be detected at the point of uploading an index, but we proceed to upload it anyway and leave a tenant in a broken state that's awkward to handle. ## Summary of changes - Validate index when submitting it for upload, so that we can see the issue quickly e.g. in an API invoking compaction - Validate index before executing the upload, so that we have a hard enforcement that any code path that tries to upload an index will not overwrite a valid index with an invalid one.	2025-01-20 09:20:31 +00:00
Christian Schwarz	2eb235e923	doc string explaining why we're deadlock free right now and why it's so brittle	2025-01-17 18:33:34 +01:00
Christian Schwarz	40ab9c2c5e	we can avoid adding the Arc<Mutex<>> around EphemeralLayer if we instead extend the lifetime of the InMemoryLayer for the spawned IO future; plus it's semantically more similar to what we now do for Delta and Image layers	2025-01-17 18:16:17 +01:00
Christian Schwarz	c43400389f	delta & image layer spawned IOs: keep layer resident until IO is done	2025-01-17 18:00:13 +01:00
Vlad Lazar	6975228a76	pageserver: add initdb metrics (#10434 ) ## Problem Initdb observability is poor. ## Summary of changes Add some metrics so we can figure out which part, if any, is slow. Closes https://github.com/neondatabase/neon/issues/10423	2025-01-17 14:51:33 +00:00
Christian Schwarz	c47c5f4ace	fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386 ) # Refs - fixes https://github.com/neondatabase/neon/issues/10309 - fixup of batching design, first introduced in https://github.com/neondatabase/neon/pull/9851 - refinement of https://github.com/neondatabase/neon/pull/8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (https://github.com/neondatabase/neon/pull/8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from https://github.com/neondatabase/neon/pull/8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in https://github.com/neondatabase/neon/pull/10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.	2025-01-16 20:34:02 +00:00
Vlad Lazar	3a285a046b	pageserver: include node id when subscribing to SK (#10432 ) ## Problem All pageserver have the same application name which makes it hard to distinguish them. ## Summary of changes Include the node id in the application name sent to the safekeeper. This should gives us more visibility in logs. There's a few metrics that will increase in cardinality by `pageserver_count`, but that's fine.	2025-01-16 18:51:56 +00:00
Christian Schwarz	d2f8342080	Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io	2025-01-16 18:16:36 +01:00
Christian Schwarz	c19a16792a	address nit ; https://github.com/neondatabase/neon/pull/10386#discussion_r1918782034	2025-01-16 17:54:14 +01:00
Christian Schwarz	cf75eb7d86	Revert "hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput" This reverts commit `9fffe6e60d`.	2025-01-16 16:46:49 +01:00
Christian Schwarz	6ededa17e2	Revert "experiment: buffered socket with 128k buffer size; not super needle-moving" This reverts commit `7e13e5fc4a`.	2025-01-16 16:42:10 +01:00
Christian Schwarz	7e13e5fc4a	experiment: buffered socket with 128k buffer size; not super needle-moving	2025-01-16 16:42:01 +01:00
Alex Chi Z.	cccc196848	refactor(pageserver): make partitioning an ArcSwap (#10377 ) ## Problem gc-compaction needs the partitioning data to decide the job split. This refactor allows concurrent access/computing the partitioning. ## Summary of changes Make `partitioning` an ArcSwap so that others can access the partitioning while we compute it. Fully eliminate the `repartition is called concurrently` warning when gc-compaction is going on. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-16 15:33:37 +00:00
Christian Schwarz	9fffe6e60d	hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput	2025-01-16 13:58:23 +01:00
Alex Chi Z.	a753349cb0	feat(pageserver): validate data integrity during gc-compaction (#10131 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 part of investigation of https://github.com/neondatabase/neon/issues/10049 ## Summary of changes * If `cfg!(test) or cfg!(feature = testing)`, then we will always try generating an image to ensure the history is replayable, but not put the image layer into the final layer results, therefore discovering wrong key history before we hit a read error. * I suspect it's easier to trigger some races if gc-compaction is continuously run on a timeline, so I increased the frequency to twice per 10 churns. * Also, create branches in gc-compaction smoke tests to get more test coverage. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad@neon.tech>	2025-01-15 22:04:06 +00:00
Christian Schwarz	66c0df8109	doc comment on BatchedFeMessage explaining WeakHandle; https://github.com/neondatabase/neon/pull/10386#discussion_r1916968951	2025-01-15 21:50:00 +01:00
Christian Schwarz	9fe77c527f	inline get_impl; https://github.com/neondatabase/neon/pull/10386#discussion_r1916939623	2025-01-15 21:47:39 +01:00
Christian Schwarz	7fb4595c7e	fix: WeakHandle was holding on to the Timeline allocation This made test_timeline_deletion_with_files_stuck_in_upload_queue fail because the RemoteTimelineClient was being kept alive. The fix is to stop keeping the timeline alive from WeakHandle.	2025-01-15 21:46:37 +01:00
Christian Schwarz	350dc251df	test case demonstrates the issue: we hod Timeline object alive --- STDERR: pageserver tenant::timeline::handle::tests::test_weak_handles --- thread 'tenant::timeline::handle::tests::test_weak_handles' panicked at pageserver/src/tenant/timeline/handle.rs:1131:9: assertion `left == right` failed left: 3 right: 2	2025-01-15 21:46:30 +01:00
John Spray	fb0e2acb2f	pageserver: add `page_trace` API for debugging (#10293 ) ## Problem When a pageserver is receiving high rates of requests, we don't have a good way to efficiently discover what the client's access pattern is. Closes: https://github.com/neondatabase/neon/issues/10275 ## Summary of changes - Add `/v1/tenant/x/timeline/y/page_trace?size_limit_bytes=...&time_limit_secs=...` API, which returns a binary buffer. - Add `pagectl page-trace` tool to decode and analyze the output. --------- Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-01-15 19:07:22 +00:00
Christian Schwarz	5b77a6d3ce	address clippy	2025-01-15 19:38:21 +01:00
Christian Schwarz	8c5005ff59	rename IoConcurrency::{todo=>serial} and remove deprecation warning	2025-01-15 19:38:05 +01:00
Christian Schwarz	f8218ac5fc	Revert "investigation: add log_if_slow => shows that the io_futures are slow" This reverts commit `e81fa7137e`.	2025-01-15 19:34:37 +01:00
Christian Schwarz	40470c66cd	remove opportunistic poll, it seems slightly beneficial for perf esp before I remembered to configure pipelining, the unpipelined configuration achieved ~10% higher tput. In any way, makes sense to not do the opportunisitc polling because it registers the wrong waker.	2025-01-15 19:34:05 +01:00
Christian Schwarz	8fafff37c5	remove the whole barriers business	2025-01-15 19:00:00 +01:00
Christian Schwarz	e81fa7137e	investigation: add log_if_slow => shows that the io_futures are slow	2025-01-15 18:56:07 +01:00
Christian Schwarz	351da2349e	Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io	2025-01-15 17:09:02 +01:00
Christian Schwarz	c545d227b9	review doc comment	2025-01-15 16:24:39 +01:00
Christian Schwarz	a4fc6a92c9	fix cargo doc	2025-01-15 16:10:04 +01:00
Christian Schwarz	2205736262	doc comment & one fixup	2025-01-15 14:27:08 +01:00
Vlad Lazar	1577430408	safekeeper: decode and interpret for multiple shards in one go (#10201 ) ## Problem Currently, we call `InterpretedWalRecord::from_bytes_filtered` from each shard. To serve multiple shards at the same time, the API needs to allow for enquiring about multiple shards. ## Summary of changes This commit tweaks it a pretty brute force way. Naively, we could just generate the shard for a key, but pre and post split shards may be subscribed at the same time, so doing it efficiently is more complex.	2025-01-15 11:10:24 +00:00
Christian Schwarz	0340f00228	post-merge fix the handling of the new pagestream Test message, so that the regression test now passes non-package-mode-py3.10christian@neon-hetzner-dev-christian:[~/src/neon]: BUILD_TYPE=debug DEFAULT_PG_VERSION=16 poetry run pytest ./test_runner/regress/test_page_service_batching_regressions.py --timeout=0 --pdb	2025-01-14 23:56:35 +01:00
Christian Schwarz	366ff9ffcc	Merge branch 'problame/hung-shutdown/demo-hypothesis' into problame/hung-shutdown/fix	2025-01-14 23:51:53 +01:00

1 2 3 4 5 ...

2744 Commits