rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 05:00:37 +00:00

Author	SHA1	Message	Date
Alex Chi Z.	a4e3989c8d	fix(pageserver): make repartition error critical (#10872 ) ## Problem Read errors during repartition should be a critical error. ## Summary of changes <del>We only have one call site</del> We have two call sites of `repartition` where one of them is during the initial image upload optimization and another is during image layer creation, so I added a `critical!` here instead of inside `collect_keyspace`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-18 20:19:23 +00:00
Alex Chi Z.	538ea03f73	feat(pageserver): allow read path debug in getpagelsn API (#10748 ) ## Problem The usual workflow for me to debug read path errors in staging is: download the tenant to my laptop, import, and then run some read tests. With this patch, we can do this directly over staging pageservers. ## Summary of changes * Add a new `touchpagelsn` API that does a page read but does not return page info back. * Allow read from latest record LSN from get/touchpagelsn * Add read_debug config in the context. * The read path will read the context config to decide whether to enable read path tracing or not. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-18 18:54:53 +00:00
Erik Grinaker	cb8060545d	pageserver: don't log noop image compaction (#10873 ) ## Problem We log image compaction stats even when no image compaction happened. This is logged every 10 seconds for every timeline. ## Summary of changes Only log when we actually performed any image compaction.	2025-02-18 17:49:01 +00:00
Vlad Lazar	1a69a8cba7	storage: add APIs for warming up location after cold migrations (#10788 ) ## Problem We lack an API for warming up attached locations based on the heatmap contents. This is problematic in two places: 1. If we manually migrate and cut over while the secondary is still cold 2. When we re-attach a previously offloaded tenant ## Summary of changes https://github.com/neondatabase/neon/pull/10597 made heatmap generation additive across migrations, so we won't clobber it a after a cold migration. This allows us to implement: 1. An endpoint for downloading all missing heatmap layers on the pageserver: `/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_heatmap_layers`. Only one such operation per timeline is allowed at any given time. The granularity is tenant shard. 2. An endpoint to the storage controller to trigger the downloads on the pageserver: `/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_heatmap_layers`. This works both at tenant and tenant shard level. If an unsharded tenant id is provided, the operation is started on all shards, otherwise only the specified shard. 3. A storcon cli command. Again, tenant and tenant-shard level granularities are supported. Cplane will call into storcon and trigger the downloads for all shards. When we want to rescue a migration, we will use storcon cli targeting the specific tenant shard. Related: https://github.com/neondatabase/neon/issues/10541	2025-02-18 16:09:06 +00:00
Alex Chi Z.	ed98f6d57e	feat(pageserver): log lease request (#10832 ) ## Problem To investigate https://github.com/neondatabase/cloud/issues/23650 ## Summary of changes We log lease requests to see why there are clients accessing things below gc_cutoff. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-18 16:06:39 +00:00
Alex Chi Z.	f9a063e2e9	test(pageserver): fix test_pageserver_gc_compaction_idempotent (#10833 ) ## Problem ref https://github.com/neondatabase/neon/issues/10517 ## Summary of changes For some reasons the job split algorithm decides to have different image coverage range for two compactions before/after restart. So we remove the subcompaction key range and let it generate an image covering the full range, which should make the test more stable. Also slightly tuned the logging span. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-18 16:06:20 +00:00
Alex Chi Z.	290f007b8e	Revert "feat(pageserver): repartition on L0-L1 boundary (#10548 )" (#10870 ) This reverts commit `443c8d0b4b`. ## Problem We observe a massive amount of compaction errors. ## Summary of changes If the tenant did not write any L1 layers (i.e., they accumulate L0 layers where number of them is below L0 threshold), image creation will always fail. Therefore, it's not correct to simply use the disk_consistent_lsn or L0/L1 boundary for the image creation.	2025-02-18 15:43:33 +00:00
Arseny Sher	d36baae758	Add gc_blocking and restore latest_gc_cutoff in openapi spec (#10867 ) ## Problem gc_blocking is missing in the tenant info, but cplane wants to use it. Also, https://github.com/neondatabase/neon/pull/10707/ removed latest_gc_cutoff from the spec, renaming it to applied_gc_cutoff. Temporarily get it back until cplane migrates. ## Summary of changes Add them. ref https://neondb.slack.com/archives/C03438W3FLZ/p1739877734963979	2025-02-18 13:57:12 +00:00
Erik Grinaker	84bbe87d60	pageserver: tweak `pageserver_layers_per_read` histogram resolution (#10847 ) ## Problem The current `pageserver_layers_per_read` histogram buckets don't represent the current reality very well. For the percentiles we care about (e.g. p50 and p99), we often see fairly high read amp, especially during ingestion, and anything below 4 can be considered very good. ## Summary of changes Change the per-timeline read amp histogram buckets to `[4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0]`.	2025-02-17 17:24:17 +00:00
Erik Grinaker	8a2d95b4b5	pageserver: appease unused lint on macOS (#10846 ) ## Problem `SmgrOpFlushInProgress::measure()` takes a `socket_fd` argument which is only used on Linux. This causes linter warnings on macOS. Touches #10823. ## Summary of changes Add a noop use of `socket_fd` on non-Linux branch.	2025-02-17 14:41:22 +00:00
Alex Chi Z.	ae091c6913	feat(pageserver): store reldir in sparse keyspace (#10593 ) ## Problem Part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes This patch adds the support for storing reldir in the sparse keyspace. All logic are guarded with the `rel_size_v2_enabled` flag, so if it's set to false, the code path is exactly the same as what's currently in prod. Note that we did not persist the `rel_size_v2_enabled` flag and the logic around it will be implemented in the next patch. (i.e., what if we enabled it, restart the pageserver, and then it gets set to false? we should still read from v2 using the rel_size_v2_migration_status in the index_part). The persistence logic I'll implement in the next patch will disallow switching from v2->v1 via config item. I also refactored the metrics so that it can work with the new reldir store. However, this metric is not correctly computed for reldirs (see the comments) before. With the refactor, the value will be computed only when we have an initial value for the reldir size. The refactor keeps the incorrectness of the computation when there are more than 1 database. For the tests, we currently run all the tests with v2, and I'll set it to false and add some v2-specific tests before merging, probably also v1->v2 migration tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-14 20:31:54 +00:00
Christian Schwarz	a32e8871ac	compute/pageserver: correlation of logs through backend PID (via `application_name`) (#10810 ) This PR makes compute set the `application_name` field to the PG backend process PID which is also included in each compute log line. This allows correlation of Pageserver connection logs with compute logs in a way that was guesswork before this PR. In future, we can switch for a more unique identifier for a page_service session. Refs - discussion in https://neondb.slack.com/archives/C08DE6Q9C3B/p1739465208296169?thread_ts=1739462628.361019&cid=C08DE6Q9C3B - fixes https://github.com/neondatabase/neon/issues/10808	2025-02-14 20:11:42 +00:00
Christian Schwarz	9177312ba6	basebackup: use `Timeline::get` for `get_rel` instead of `get_rel_page_at_lsn` (#10476 ) I noticed the opportunity to simplify here while working on https://github.com/neondatabase/neon/pull/9353 . The only difference is the zero-fill behavior: if one reads past rel size, `get_rel_page_at_lsn` returns a zeroed page whereas `Timeline::get` returns an error. However, the `endblk` is at most rel size large, because `nblocks` is eq `get_rel_size`, see a few lines above this change. We're using the same LSN (`self.lsn`) for everything, so there is no chance of non-determinism. Refs: - Slack discussion debating correctness: https://neondb.slack.com/archives/C033RQ5SPDH/p1737457010607119	2025-02-14 17:57:18 +00:00
Christian Schwarz	b992a1a62a	page_service: include socket send & recv queue length in slow flush log mesage (#10823 ) # Summary In - https://github.com/neondatabase/neon/pull/10813 we added slow flush logging but it didn't log the TCP send & recv queue length. This PR adds that data to the log message. I believe the implementation to be safe & correct right now, but it's brittle and thus this PR should be reverted or improved upon once the investigation is over. Refs: - stacked atop https://github.com/neondatabase/neon/pull/10813 - context: https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B - improves https://github.com/neondatabase/neon/issues/10668 - part of https://github.com/neondatabase/cloud/issues/23515 # How It Works The trouble is two-fold: 1. getting to the raw socket file descriptor through the many Rust types that wrap it and 2. integrating with the `measure()` function Rust wraps it in types to model file descriptor lifetimes and ownership, and usually one can get access using `as_raw_fd()`. However, we `split()` the stream and the resulting [`tokio::io::WriteHalf`](https://docs.rs/tokio/latest/tokio/io/struct.WriteHalf.html) . Check the PR commit history for my attempts to do it. My solution is to get the socket fd before we wrap it in our protocol types, and to store that fd in the new `PostgresBackend::socket_fd` field. I believe it's safe because the lifetime of `PostgresBackend::socket_fd` value == the lifetime of the `TcpStream` that wrap and store in `PostgresBackend::framed`. Specifically, the only place that close()s the socket is the `impl Drop for TcpStream`. I think the protocol stack calls `TcpStream::shutdown()`, but, that doesn't `close()` the file descriptor underneath. Regarding integration with the `measure()` function, the trouble is that `flush_fut` is currently a generic `Future` type. So, we just pass in the `socket_fd` as a separate argument. A clean implementation would convert the `pgb_writer.flush()` to a named future that provides an accessor for the socket fd while not being polled. I tried (see PR history), but failed to break through the `WriteHalf`. # Testing Tested locally by running ``` ./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000 ``` in one terminal, waiting a bit, then ``` pkill -STOP pagebench ``` then wait for slow logs to show up in `pageserver.log`. Pick one of the slow log message's port pairs, e.g., `127.0.0.1:39500`, and then checking sockstat output ``` ss -ntp \| grep '127.0.0.1:39500' ``` to ensure that send & recv queue size match those in the log message.	2025-02-14 16:20:07 +00:00
Christian Schwarz	fac5db3c8d	page_service: emit periodic log message while response flush is slow (#10813 ) The logic might seem a bit intricate / over-optimized, but I recently spent time benchmarking this code path in the context of a nightly pagebench regression (https://github.com/neondatabase/cloud/issues/21759) and I want to avoid regressing it any further. Ideally would also log the socket send & recv queue length like we do on the compute side in - https://github.com/neondatabase/neon/pull/10673 But that is proving difficult due to the Rust abstractions that wrap the socket fd. Work in progress on that is happening in - https://github.com/neondatabase/neon/pull/10823 Regarding production impact, I am worried at a theoretical level that the additional logging may cause a downward spiral in the case where a pageserver is slow to flush because there is not enough CPU. The logging would consume more CPU and thereby slow down flushes even more. However, I don't think this matters practically speaking. # Refs - context: https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B - fixes https://github.com/neondatabase/neon/issues/10668 - part of https://github.com/neondatabase/cloud/issues/23515 # Testing Tested locally by running ``` ./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000 ``` in one terminal, waiting a bit, then ``` pkill -STOP pagebench ``` then wait for slow logs to show up in `pageserver.log`. To see that the completion log message is logged, run ``` pkill -CONT pagebench ```	2025-02-14 14:37:03 +00:00
Arpad Müller	878c1c7110	offload_timeline: check if the timeline is archived on HasChildren error (#10776 ) PR #10305 makes sure that there is no actual race, i.e. we will never attempt to offload a timeline that has just been unarchived, or similar. However, if a timeline has been unarchived and has children that are unarchived too, we will get an error log line. Such races can occur as in compaction we check if the timeline can be offloaded way before we attempt to offload it: the result might change in the meantime. This patch checks if the delete guard can't be obtained because the timeline has unarchived children, and if yes, it does another check for whether the timeline has become unarchived or not. If it is unarchived, it just prints an info log msg and integrates itself into the error suppression logic of the compaction calling into it. If you squint at it really closely, there is still a possible race in which we print an error log, but this one is unlikely because the timeline and its children need to be archived right after the check for whether the timeline has any unarchived children, and right before the check whether the timeline is archived. Archival involves a network operation while nothing between these two checks does that, so it's very unlikely to happen in real life. https://github.com/neondatabase/cloud/issues/23979#issuecomment-2651265729	2025-02-14 10:21:50 +00:00
Christian Schwarz	487f3202fe	pageserver read path: abort on fatal IO errors from disk / filesystem (#10786 ) Before this PR, an IO error returned from the kernel, e.g., due to a bad disk, would get bubbled up, all the way to a user-visible query failing. This is against the IO error handling policy where we have established and is hence being rectified in this PR. [[(internal Policy document link)]](`bef44149f7/src/storage/handling_io_and_logical_errors.md (L33-L35)`) The practice on the write path seems to be that we call `maybe_fatal_err()` or `fatal_err()` fairly high up the stack. That is, regardless of whether std::fs, tokio::fs, or VirtualFile is used to perform the IO. For the read path, I choose a centralized approach in this PR by checking for errors as close to the kernel interface as possible. I believe this is better for long-term consistency. To mitigate the problem of missing context if we abort so far down in the stack, the `on_fatal_io_error` now captures and logs a backtrace. I grepped the pageserver code base for `fs::read` to convince myself that all non-VirtualFile reads already handle IO errors according to policy. Refs - fixes https://github.com/neondatabase/neon/issues/10454	2025-02-13 20:53:39 +00:00
Alex Chi Z.	6a741fd1c2	fix(pageserver): ensure all basebackup client errors are caught (#10793 ) ## Problem We didn't catch all client errors causing alerts. ## Summary of changes Client errors should be wrapped with ClientError so that it doesn't fire alerts. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-13 19:38:02 +00:00
Vlad Lazar	8fea43a5ba	pageserver: make heatmap generation additive (#10597 ) ## Problem Previously, when cutting over to cold secondary locations, we would clobber the previous, good, heatmap with a cold one. This is because heatmap generation used to include only resident layers. Once this merges, we can add an endpoint which triggers full heatmap hydration on attached locations to heal cold migrations. ## Summary of changes With this patch, heatmap generation becomes additive. If we have a heatmap from when this location was secondary, the new uploaded heatmap will be the result of a reconciliation between the old one and the on disk resident layers. More concretely, when we have the previous heatmap: 1. Filter the previous heatmap and keep layers that are (a) present in the current layer map, (b) visible, (c) not resident. Call this set of layers `visible_non_resident`. 2. From the layer map, select all layers that are resident and visible. Call this set of layers `resident`. 3. The new heatmap is the result of merging the two disjoint sets. Related https://github.com/neondatabase/neon/issues/10541	2025-02-13 12:48:47 +00:00
John Spray	b8095f84a0	pageserver: make true GC cutoff visible in admin API, rebrand `latest_gc_cutoff` as `applied_gc_cutoff` (#10707 ) ## Problem We expose `latest_gc_cutoff` in our API, and callers understandably were using that to validate LSNs for branch creation. However, this is _not_ the true GC cutoff from a user's point of view: it's just the point at which we last actually did GC. The actual cutoff used when validating branch creations and page_service reads is the min() of latest_gc_cutoff and the planned GC lsn in GcInfo. Closes: https://github.com/neondatabase/neon/issues/10639 ## Summary of changes - Expose the more useful min() of GC cutoffs as `gc_cutoff_lsn` in the API, so that the most obviously named field is really the one people should use. - Retain the ability to read the LSN at which GC was actually done, in an `applied_gc_cutoff_lsn` field. - Internally rename `latest_gc_cutoff_lsn` to `applied_gc_cutoff_lsn` ("latest" was a confusing name, as the value in GcInfo is more up to date in terms of what a user experiences) - Temporarily preserve the old `latest_gc_cutoff_lsn` field for compat with control plane until we update it to use the new field. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-02-13 10:33:47 +00:00
Arpad Müller	61d2474632	Also check by the planned gc cutoff for lease creation (#10764 ) We don't want to allow new leases below the planned gc cutoff either. Other APIs like branch creation or getpage requests already enforce this.	2025-02-12 19:29:17 +00:00
Erik Grinaker	f62047ae97	pageserver: add separate semaphore for L0 compaction (#10780 ) ## Problem L0 compaction frequently gets starved out by other background tasks and image/GC compaction. L0 compaction must be responsive to keep read amplification under control. Touches #10694. Resolves #10689. ## Summary of changes Use a separate semaphore for the L0-only compaction pass. * Add a `CONCURRENT_L0_COMPACTION_TASKS` semaphore and `BackgroundLoopKind::L0Compaction`. * Add a setting `compaction_l0_semaphore` (default off via `compaction_l0_first`). * Use the L0 semaphore when doing an `OnlyL0Compaction` pass. * Use the background semaphore when doing a regular compaction pass (which includes an initial L0 pass). * While waiting for the background semaphore, yield for L0 compaction if triggered. * Add `CompactFlags::NoYield` to disable L0 yielding, and set it for the HTTP API route. * Remove the old `use_compaction_semaphore` setting and compaction-scoped semaphore. * Remove the warning when waiting for a semaphore; it's noisy and we have metrics.	2025-02-12 16:12:21 +00:00
Erik Grinaker	71c30e52fa	pageserver: properly yield for L0 compaction (#10769 ) ## Problem When image compaction yields for L0 compaction, it may not immediately schedule L0 compaction, because it just goes on to compact the next pending timeline. Touches #10694. Requires #10744. ## Summary of changes Extend `CompactionOutcome` with `YieldForL0` and `Skipped` variants, and immediately schedule an L0 compaction pass in the `YieldForL0` case.	2025-02-11 23:43:58 +00:00
Erik Grinaker	6c83ac3fd2	pageserver: do all L0 compaction before image compaction (#10744 ) ## Problem Image compaction can starve out L0 compaction if a tenant has several timelines with L0 debt. Touches #10694. Requires #10740. ## Summary of changes * Add an initial L0 compaction pass, in order of L0 count. * Add a tenant option `compaction_l0_first` to control the L0 pass (disabled by default). * Add `CompactFlags::OnlyL0Compaction` to run an L0-only compaction pass. * Clean up the compaction iteration logic. A later PR will use separate semaphores for the L0 and image compaction passes to avoid cross-tenant L0 starvation. That PR will also make image compaction yield if _any_ of the tenant's timelines have pending L0 compaction to further avoid starvation.	2025-02-11 22:08:46 +00:00
Heikki Linnakangas	635b67508b	Split utils::http to separate crate (#10753 ) Avoids compiling the crate and its dependencies into binaries that don't need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the release-line-debug-size-lto profile.	2025-02-11 22:06:53 +00:00
Arpad Müller	f7b2293317	Hardlink resident layers during detach ancestor (#10729 ) After a detach ancestor operation, we don't want to on-demand download layers that are already resident. This has shown to impede performance, sometimes quite a lot (50 seconds: https://github.com/neondatabase/neon/issues/8828#issuecomment-2643735644) Fixes #8828.	2025-02-11 16:58:34 +00:00
Arpad Müller	be447ba4f8	Change timeline_offloading setting default to true (#10760 ) This changes the default value of the `timeline_offloading` pageserver and tenant configs to true, now that offloading has been rolled out without problems. There is also a small fix in the tenant config merge function, where we applied the `lazy_slru_download` value instead of `timeline_offloading`. Related issue: https://github.com/neondatabase/cloud/issues/21353	2025-02-11 16:36:54 +00:00
Christian Schwarz	9247331c67	fix(page_service / batching): smgr op latency metric of dropped responses include flush time (#10756 ) # Problem Say we have a batch of 10 responses to send out. Then, even with - #10728 we've still only called observe_execution_end_flush_start for the first 3 responses. The remaining 7 response timers are still ticking. When compute now closes the connection, the waiting flush fails with an error and we `drop()` the remaining 7 responses' smgr op timers. The `impl Drop for SmgrOpTimer` will observe an execution time that includes the flush time. In practice, this is supsected to produce the `+Inf` observations in the smgr op latency histogram we've seen since the introduction of pipelining, even after shipping #10728. refs: - fixup of https://github.com/neondatabase/neon/pull/10042 - fixup of https://github.com/neondatabase/neon/pull/10728 - fixes https://github.com/neondatabase/neon/issues/10754	2025-02-11 14:05:59 +00:00
Alex Chi Z.	b0c7ee0175	feat(pageserver): better gc_compaction_split heuristics (#10727 ) ## Problem close https://github.com/neondatabase/neon/issues/10213 `range_search` only returns the top-most layers that may satisfy the search, so it doesn't include all layers that might be accessed (the user needs to recursively call this function). We need to retrieve the full layer map and find overlaps in order to have a correct heuristics of the job split. ## Summary of changes Retrieve all layers and find overlaps instead of doing `range_search`. The patch also reduces the time holding the layer map read guard. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-10 19:33:34 +00:00
Erik Grinaker	8c4e94107d	pageserver: notify compaction loop at threshold (#10740 ) ## Problem The compaction loop currently runs periodically, which can cause it to wait for up to 20 seconds before starting L0 compaction by default. Also, when we later separate the semaphores for L0 compaction and image compaction, we want to give up waiting for the image compaction semaphore if L0 compaction is needed on any timeline. Touches #10694. ## Summary of changes Notify the compaction loop when an L0 flush (on any timeline) exceeds `compaction_threshold`. Also do some opportunistic cleanups in the area.	2025-02-10 17:48:09 +00:00
Alex Chi Z.	b37f52fdf1	feat(pageserver): dump read path on missing key error (#10528 ) ## Problem helps investigate https://github.com/neondatabase/neon/issues/10482 ## Summary of changes In debug mode and testing mode, we will record all files visited by a read operation, and print it out when it errors. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-10 14:25:56 +00:00
Alex Chi Z.	443c8d0b4b	feat(pageserver): repartition on L0-L1 boundary (#10548 ) ## Problem Reduce the read amplification when doing `repartition`. ## Summary of changes Compute the L0-L1 boundary LSN and do repartition here. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-10 14:25:48 +00:00
Erik Grinaker	ac55e2dbe5	pageserver: improve tenant housekeeping task (#10725 ) # Problem walredo shutdown is done in the compaction task. Let's move it to tenant housekeeping. # Summary of changes * Rename "ingest housekeeping" to "tenant housekeeping". * Move walredo shutdown into tenant housekeeping. * Add a constant `WALREDO_IDLE_TIMEOUT` set to 3 minutes (previously 10x compaction threshold).	2025-02-08 12:42:55 +00:00
Erik Grinaker	874accd6ed	pageserver: misc task cleanups (#10723 ) This patch does a bunch of superficial cleanups of `tenant::tasks` to avoid noise in subsequent PRs. There are no functional changes. PS: enable "hide whitespace" when reviewing, due to the unindentation of large async blocks.	2025-02-08 11:02:13 +00:00
Christian Schwarz	6cd3b501ec	fix(page_service / batching): smgr op latency metrics includes the flush time of preceding requests (#10728 ) Before this PR, if a batch contains N responses, the smgr op latency reported for response (N-i) would include the time we spent flushing the preceding requests. refs: - fixup of https://github.com/neondatabase/neon/pull/10042 - fixes https://github.com/neondatabase/neon/issues/10674	2025-02-08 09:28:09 +00:00
Christian Schwarz	bf20d78292	fix(page_service): page reconstruct error log does not include `shard_id` label (#10680 ) # Problem Before this PR, the `shard_id` field was missing when page_service logs a reconstruct error. This was caused by batching-related refactorings. Example from staging: ``` 2025-01-30T07:10:04.346022Z ERROR page_service_conn_main{peer_addr=...}:process_query{tenant_id=... timeline_id=...}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key ... ``` # Changes Delay creation of the handler-specific span until after shard routing This also avoids the need for the record() call in the pagestream hot path. # Testing Manual testing with a failpoint that is part of this PR's history but will be squashed away. # Refs - fixes https://github.com/neondatabase/neon/issues/10599	2025-02-07 19:45:39 +00:00
John Spray	9609f7547e	tests: address warnings in timeline shutdown (#10702 ) ## Problem There are a couple of log warnings tripping up `test_timeline_archival_chaos` - `[stopping left-over name="timeline_delete" tenant_shard_id=2d526292b67dac0e6425266d7079c253 timeline_id=Some(44ba36bfdee5023672c93778985facd9) kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)` - `ignoring attempt to restart exited flush_loop 503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')` Related: https://github.com/neondatabase/neon/issues/10389 ## Summary of changes - Downgrade the 'ignoring attempt to restart' to info -- there's nothing in the design that forbids this happening, i.e. someone calling maybe_spawn_flush_loop concurrently with shutdown() - Prevent timeline deletion tasks outliving tenants by carrying a gateguard. This logically makes sense because the deletion process does call into Tenant to update manifests.	2025-02-07 15:29:34 +00:00
Erik Grinaker	d6e87a3a9c	pageserver: add separate, disabled compaction semaphore (#10716 ) ## Problem L0 compaction can get starved by other background tasks. It needs to be responsive to avoid read amp blowing up during heavy write workloads. Touches #10694. ## Summary of changes Add a separate semaphore for compaction, configurable via `use_compaction_semaphore` (disabled by default). This is primarily for testing in staging; it needs further work (in particular to split image/L0 compaction jobs) before it can be enabled.	2025-02-07 15:11:31 +00:00
John Spray	08f92bb916	pageserver: clean up DeletionQueue push_layers_sync (#10701 ) ## Problem This is tech debt. While we introduced generations for tenants, some legacy situations without generations needed to delete things inline (async operation) instead of enqueing them (sync operation). ## Summary of changes - Remove the async code, replace calls with the sync variant, and assert that the generation is always set	2025-02-07 13:03:01 +00:00
Erik Grinaker	2943590694	pageserver: use histogram for background job semaphore waits (#10697 ) ## Problem We don't have visibility into how long an individual background job is waiting for a semaphore permit. ## Summary of changes * Make `pageserver_background_loop_semaphore_wait_seconds` a histogram rather than a sum. * Add a paced warning when a task takes more than 10 minutes to get a permit (for now). * Drive-by cleanup of some `EnumMap` usage.	2025-02-06 17:17:47 +00:00
Alex Chi Z.	f22d41eaec	feat(pageserver): num of background job metrics (#10690 ) ## Problem We need a metrics to know what's going on in pageserver's background jobs. ## Summary of changes * Waiting tasks: task still waiting for the semaphore. * Running tasks: tasks doing their actual jobs. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-02-06 14:39:37 +00:00
Alexander Lakhin	977781e423	Enable sanitizers for postgres v17 (#10401 ) Add a build with sanitizers (asan, ubsan) to the CI pipeline and run tests on it. See https://github.com/neondatabase/neon/issues/6053 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-02-06 12:53:43 +00:00
Arpad Müller	67b71538d0	Limit returned lsn for timestamp by the planned gc cutoff (#10678 ) Often the output of the timestamp->lsn API is used as input for branch creation, and branch creation takes the planned lsn into account, i.e. rejects lsn's as branch lsns that are before the planned lsn. This patch doesn't fix all race conditions, it's still racy. But at least it is a step into the right direction. For #10639	2025-02-06 11:17:08 +00:00
Erik Grinaker	f4cfa725b8	pageserver: add a few critical errors (#10657 ) ## Problem Following #10641, let's add a few critical errors. Resolves #10094. ## Summary of changes Adds the following critical errors: * WAL sender read/decode failure. * WAL record ingestion failure. * WAL redo failure. * Missing key during compaction. We don't add an error for missing keys during GetPage requests, since we've seen a handful of these in production recently, and the cause is still unclear (most likely a benign race).	2025-02-06 10:30:27 +00:00
Arpad Müller	05326cc247	Skip gc cutoff lsn check at timeline creation if lease exists (#10685 ) Right now, branch creation doesn't care if a lsn lease exists or not, it just fails if the passed lsn is older than either the last or the planned gc cutoff. However, if an lsn lease exists for a given lsn, we can actually create a branch at that point: nothing has been gc'd away. This prevents race conditions that #10678 still leaves around. Related: #10639 https://github.com/neondatabase/cloud/issues/23667	2025-02-06 10:10:11 +00:00
Arpad Müller	b66fbd6176	Warn on basebackups for archived timelines (#10688 ) We don't want any external requests for an archived timeline. This includes basebackup requests, i.e. when a compute is being started up. Therefore, we'd like to forbid such basebackup requests: any attempt to get a basebackup on an archived timeline (or any getpage request really) is a cplane bug. Make this a warning for now so that, if there is potentially a bug, we can detect cases in the wild before they cause stuck operations, but the intention is to return an error eventually. Related: #9548	2025-02-06 10:09:20 +00:00
Christian Schwarz	1686d9e733	perf(page_service): dont `.instrument(span.clone())` the response flush (#10686 ) On my AX102 Hetzner box, removing this line removes about 20us from the `latency_mean` result in `test_pageserver_characterize_latencies_with_1_client_and_throughput_with_many_clients_one_tenant`. If the same 20us can be removed in the nightly benchmark run, this will be a ~10% improvement because there, mean latencies are about ~220us. This span was added during batching refactors, we didn't have it before, and I don't think it's terribly useful. refs - https://github.com/neondatabase/cloud/issues/21759	2025-02-06 08:33:37 +00:00
Erik Grinaker	abcd00181c	pageserver: set a concurrency limit for LocalFS (#10676 ) ## Problem The local filesystem backend for remote storage doesn't set a concurrency limit. While it can't/won't enforce a concurrency limit itself, this also bounds the upload queue concurrency. Some tests create thousands of uploads, which slows down the quadratic scheduling of the upload queue, and there is no point spawning that many Tokio tasks. Resolves #10409. ## Summary of changes Set a concurrency limit of 100 for the LocalFS backend. Before: `test_layer_map[release-pg17].test_query: 68.338 s` After: `test_layer_map[release-pg17].test_query: 5.209 s`	2025-02-06 07:24:36 +00:00
Alex Chi Z.	0ceeec9be3	fix(pageserver): schedule compaction immediately if pending (#10684 ) ## Problem The code is intended to reschedule compaction immediately if there are pending tasks. We set the duration to 0 before if there are pending tasks, but this will go through the `if period == Duration::ZERO {` branch and sleep for another 10 seconds. ## Summary of changes Set duration to 1 so that it doesn't sleep for too long. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-05 22:11:50 +00:00
Alex Chi Z.	733a57247b	fix(pageserver): disallow gc-compaction produce l0 layer (#10679 ) ## Problem Any compaction should never produce l0 layers. This never happened in my experiments, but would be good to guard it early. ## Summary of changes Disallow gc-compaction to produce l0 layers. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-05 20:44:28 +00:00

1 2 3 4 5 ...

2735 Commits