rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
Vlad Lazar	a338984dc7	pageserver: support keys at different LSNs in one get page batch (#11494 ) ## Problem Get page batching stops when we encounter requests at different LSNs. We are leaving batching factor on the table. ## Summary of changes The goal is to support keys with different LSNs in a single batch and still serve them with a single vectored get. Important restriction: the same key at different LSNs is not supported in one batch. Returning different key versions is a much more intrusive change. Firstly, the read path is changed to support "scattered" queries. This is a conceptually simple step from https://github.com/neondatabase/neon/pull/11463. Instead of initializing the fringe for one keyspace, we do it for multiple at different LSNs and let the logic already present into the fringe handle selection. Secondly, page service code is updated to support batching at different LSNs. Eeach request parsed from the wire determines its effective request LSN and keeps it in mem for the batcher toinspect. The batcher allows keys at different LSNs in one batch as long one key is not requested at different LSNs. I'd suggest doing the first pass commit by commit to get a feel for the changes. ## Results I used the batching test from [Christian's PR](https://github.com/neondatabase/neon/pull/11391) which increases the change of batch breaks. Looking at the logs I think the new code is at the max batching factor for the workload (we only break batches due to them being oversized or because the executor is idle). ``` Main: Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662 My branch: Reasons for stopping batching: {'of batch size': 37024} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333 ``` Related: https://github.com/neondatabase/neon/issues/10765	2025-04-14 09:05:29 +00:00
Alex Chi Z.	4f7b2cdd4f	feat(pageserver): gc-compaction result verification (#11515 ) ## Problem Part of #9114 There was a debug-mode verification mode that verifies at every retain_lsn. However, the code was tangled within the actual history generation itself and it's hard to reason about correctness. This patch adds a separate post-verification of the gc-compaction result that redos logs at every retain_lsn and every record above the GC horizon. This ensures that all key history we produce with gc-compaction is readable, and if there're read errors after gc-compaction, it can only be read-path errors instead of gc-compaction bugs. ## Summary of changes * Add gc_compaction_verification flag, default to true. * Implement a post-verification process. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-11 15:50:29 +00:00
Erik Grinaker	fd16caa7d0	pageserver: yield for L0 during ancestor compaction (#11536 ) ## Problem Shard ancestor compaction does not yield for L0 compaction, potentially starving it. close https://github.com/neondatabase/neon/issues/11125 ## Summary of changes * Yield for L0 during shard ancestor compaction. * Return `CompactionOutcome::Pending` when limited by `rewrite_max`, for eager rescheduling.	2025-04-11 15:09:28 +00:00
Arpad Müller	88f01c1ca1	Introduce WalIngestError (#11506 ) Introduces a `WalIngestError` struct together with a `WalIngestErrorKind` enum, to be used for walingest related failures and errors. * the enum captures backtraces, so we don't regress in comparison to `anyhow::Error`s (backtraces might be a bit shorter if we use one of the `anyhow::Error` wrappers) * it explicitly lists most/all of the potential cases that can occur. I've originally been inspired to do this in #11496, but it's a longer-term TODO.	2025-04-11 14:08:46 +00:00
Erik Grinaker	a6937a3281	pageserver: improve shard ancestor compaction logging (#11535 ) ## Problem Shard ancestor compaction always logs "starting shard ancestor compaction", even if there is no work to do. This is very spammy (every 20 seconds for every shard). It also has limited progress logging. ## Summary of changes * Only log "starting shard ancestor compaction" when there's work to do. * Include details about the amount of work. * Log progress messages for each layer, and when waiting for uploads. * Log when compaction is completed, with elapsed duration and whether there is more work for a later iteration.	2025-04-11 12:14:08 +00:00
Dmitrii Kovalkov	4c4e33bc2e	storage: add http/https server and cert resover metrics (#11450 ) ## Problem We need to export some metrics about certs/connections to configure alerts and make sure that all HTTP requests are gone before turning https-only mode on. - Closes: https://github.com/neondatabase/cloud/issues/25526 ## Summary of changes - Add started connection and connection error metrics to http/https Server. - Add certificate expiration time and reload metrics to ReloadingCertificateResolver.	2025-04-11 06:11:35 +00:00
John Spray	9c37bfc90a	pageserver/tests: make image_layer_rewrite write less data (#11525 ) ## Problem This test is slow to execute, particularly if you're on a slow environment like vscode in a browser. Might have got much slower when we switched to direct IO? ## Summary of changes - Reduce the scale of the test by 10x, since there was nothing special about the original size.	2025-04-10 17:03:22 +00:00
Alex Chi Z.	f06d721a98	test(pageserver): ensure gc-compaction does not fire critical errors (#11513 ) ## Problem Part of https://github.com/neondatabase/neon/issues/10395 ## Summary of changes Add a test case to ensure gc-compaction doesn't fire any critical errors if the key history is invalid due to partial GC. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-10 14:53:37 +00:00
Dmitrii Kovalkov	8a72e6f888	pageserver: add enable_tls_page_service_api (#11508 ) ## Problem Page service doesn't use TLS for incoming requests. - Closes: https://github.com/neondatabase/cloud/issues/27236 ## Summary of changes - Add option `enable_tls_page_service_api` to pageserver config - Propagate `tls_server_config` to `page_service` if the option is enabled No integration tests for now because I didn't find out how to call page service API from python and AFAIK computes don't support TLS yet	2025-04-10 08:45:17 +00:00
Alex Chi Z.	405a17bf0b	fix(pageserver): ensure gc-compaction gets preempted by L0 (#11512 ) ## Problem Part of #9114 ## Summary of changes Gc-compaction flag was not correctly set, causing it not getting preempted by L0. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 20:57:50 +00:00
Alex Chi Z.	2c21a65b0b	feat(pageserver): add gc-compaction time-to-first-item stats (#11475 ) ## Problem In some cases gc-compaction doesn't respond to the L0 compaction yield notifier. I suspect it's stuck on getting the first item, and if so, we probably need to let L0 yield notifier preempt `next_with_trace`. ## Summary of changes - Add `time_to_first_kv_pair` to gc-compaction statistics. - Inverse the ratio so that smaller ratio -> better compaction ratio. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-09 18:07:58 +00:00
Alex Chi Z.	ec66b788e2	fix(pageserver): use different walredo retry setting for gc-compaction (#11497 ) ## Problem Not a complete fix for https://github.com/neondatabase/neon/issues/11492 but should work for a short term. Our current retry strategy for walredo is to retry every request exactly once. This retry doesn't make sense because it retries all requests exactly once and each error is expected to cause process restart and cause future requests to fail. I'll explain it with a scenario of two threads requesting redos: one with an invalid history (that will cause walredo to panic) and another that has a correct redo sequence. First let's look at how we handle retries right now in do_with_walredo_process. At the beginning of the function it will spawn a new process if there's no existing one. Then it will continue to redo. If the process fails, the first process that encounters the error will remove the walredo process object from the OnceCell, so that the next time it gets accessed, a new process will be spawned; if it is the last one that uses the old walredo process, it will kill and wait the process in `drop(proc)`. I'm skeptical whether this works under races but I think this is not the root cause of the problem. In this retry handler, if there are N requests attached to a walredo process and the i-th request fails (panics the walredo), all other N-i requests will fail and they need to retry so that they can access a new walredo process. ``` time ----> proc A None B request 1 ^-----------------^ fail uses A for redo replace with None request 2 ^-------------------- fail uses A for redo request 3 ^----------------^ fail uses A for redo last ref, wait for A to be killed request 4 ^--------------- None, spawn new process B ``` The problem is with our retry strategy. Normally, for a system that we want to retry on, the probability of errors for each of the requests are uncorrelated. However, in walredo, a prior request that panics the walredo process will cause all future walredo on that process to fail (that's correlated). So, back to the situation where we have 2 requests where one will definitely fail and the other will succeed and we get the following sequence, where retry attempts = 1, * new walredo process A starts. * request 1 (invalid) being processed on A and panics A, waiting for retry, remove process A from the process object. * request 2 (valid) being processed on A and receives pipe broken / poisoned process error, waiting for retry, wait for A to be killed -- this very likely takes a while and cannot finish before request 1 gets processed again * new walredo process B starts. * request 1 (invalid) being processed again on B and panics B, the whole request fail. * request 2 (valid) being processed again on B, and get a poisoned error again. ``` time ----> proc A None B None request 1 ^-----------------^--------------^--------------------^ spawn A for redo fail spawn B for redo fail request 2 ^--------------------^-------------------------^------------^ use A for redo fail, wait to kill A B for redo fail again ``` In such cases, no matter how we set n_attempts, as long as the retry count applies to all requests, this sequence is bound to fail both requests because of how they get sequenced; while we could potentially make request 2 successful. There are many solutions to this -- like having a separate walredo manager for compactions, or define which errors are retryable (i.e., broken pipe can be retried, while real walredo error won't be retried), or having a exclusive big lock over the whole redo process (the current one is very fine-grained). In this patch, we go with a simple approach: use different retry attempts for different types of requests. For gc-compaction, the attempt count is set to 0, so that it never retries and consequently stops the compaction process -- no more redo will be issued from gc-compaction. Once the walredo process gets restarted, the normal read requests will proceed normally. ## Summary of changes Add redo_attempt for each reconstruct value request to set different retry policies. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-04-09 18:01:31 +00:00
Conrad Ludgate	72832b3214	chore: fix clippy lints from nightly-2025-03-16 (#11273 ) I like to run nightly clippy every so often to make our future rust upgrades easier. Some notable changes: * Prefer `next_back()` over `last()`. Generic iterators will implement `last()` to run forward through the iterator until the end. * Prefer `io::Error::other()`. * Use implicit returns One case where I haven't dealt with the issues is the now [more-sensitive "large enum variant" lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not to take any decisions around it here, and simply marked them as allow for now.	2025-04-09 15:04:42 +00:00
Vlad Lazar	d11f23a341	pageserver: refactor read path for multi LSN batching support (#11463 ) ## Problem We wish to improve pageserver batching such that one batch can contain requests for pages at different LSNs. The current shape of the code doesn't lend itself to the change. ## Summary of changes Refactor the read path such that the fringe gets initialized upfront. This is where the multi LSN change will plug in. A couple other small changes fell out of this. There should be NO behaviour change here. If you smell one, shout! I recommend reviewing commits individually (intentionally made them as small as possible). Related: https://github.com/neondatabase/neon/issues/10765	2025-04-09 13:17:02 +00:00
Dmitrii Kovalkov	e7502a3d63	pageserver: return 412 PreconditionFailed in get_timestamp_of_lsn if timestamp is not found (#11491 ) ## Problem Now `get_timestamp_of_lsn` returns `404 NotFound` if there is no clog pages for given LSN, and it's difficult to distinguish from other 404 errors. A separate status code for this error will allow the control plane to handle this case. - Closes: https://github.com/neondatabase/neon/issues/11439 - Corresponding PR in control plane: https://github.com/neondatabase/cloud/pull/27125 ## Summary of changes - Return `412 PreconditionFailed` instead of `404 NotFound` if no timestamp is fond for given LSN. I looked briefly through the current error handling code in cloud.git and the status code change should not affect anything for the existing code. Change from the corresponding PR also looks fine and should work with the current PS status code. Additionally, here is OK to merge it from control plane team: https://github.com/neondatabase/neon/issues/11439#issuecomment-2789327552 --------- Co-authored-by: John Spray <john@neon.tech>	2025-04-09 13:16:15 +00:00
Arpad Müller	d2825e72ad	Add is_stopping check around critical macro in walreceiver (#11496 ) The timeline stopping state is set much earlier than the cancellation token is fired, so by checking for the stopping state, we can prevent races with timeline shutdown where we issue a cancellation error but the cancellation token hasn't been fired yet. Fix #11427.	2025-04-09 12:17:45 +00:00
Erik Grinaker	7679b63a2c	pageserver: persist stripe size in tenant manifest for tenant_import (#11181 ) ## Problem `tenant_import`, used to import an existing tenant from remote storage into a storage controller for support and debugging, assumed `DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage. In #11168, we are changing the stripe size, which will break `tenant_import`. Resolves #11175. ## Summary of changes * Add `stripe_size` to the tenant manifest. * Add `TenantScanRemoteStorageShard::stripe_size` and return from `tenant_scan_remote` if present. * Recover the stripe size during`tenant_import`, or fall back to 32768 (the original default stripe size). * Add tenant manifest compatibility snapshot: `2025-04-08-pgv17-tenant-manifest-v1.tar.zst` There are no cross-version concerns here, since unknown fields are ignored during deserialization where relevant.	2025-04-08 20:43:27 +00:00
Alex Chi Z.	a09c933de3	test(pageserver): add conditional append test record (#11476 ) ## Problem For future gc-compaction tests when we support https://github.com/neondatabase/neon/issues/10395 ## Summary of changes Add a new type of neon test WAL record that is conditionally applied (i.e., only when image == the specified value). We can use this to mock the situation where we lose some records in the middle, firing an error, and see how gc-compaction reacts to it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-08 16:08:44 +00:00
Mikhail Kot	6138d61592	Object storage proxy (#11357 ) Service targeted for storing and retrieving LFC prewarm data. Can be used for proxying S3 access for Postgres extensions like pg_mooncake as well. Requests must include a Bearer JWT token. Token is validated using a pemfile (should be passed in infra/). Note: app is not tolerant to extra trailing slashes, see app.rs `delete_prefix` test for comments. Resolves: https://github.com/neondatabase/cloud/issues/26342 Unrelated changes: gate a `rename_noreplace` feature and disable it in `remote_storage` so as `object_storage` can be built with musl	2025-04-08 14:54:53 +00:00
Alex Chi Z.	0875dacce0	fix(pageserver): more aggressively yield in gc-compaction, degrade errors to warnings (#11469 ) ## Problem Fix various small issues discovered during gc-compaction rollout. ## Summary of changes - Log level changes: if errors are from gc-compaction, fire a warning instead of errors or critical errors. - Yield to L0 compaction more aggressively. Instead of checking every 1k keys, we check on every key. Sometimes a single key reconstruct takes a long time. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-07 21:19:06 +00:00
Erik Grinaker	99d8788756	pageserver: improve tenant manifest lifecycle (#11328 ) ## Problem Currently, the tenant manifest is only uploaded if there are offloaded timelines. The checks are also a bit loose (e.g. only checks number of offloaded timelines). We want to start using the manifest for other things too (e.g. stripe size). Resolves #11271. ## Summary of changes This patch ensures that a tenant manifest always exists. The lifecycle is: * During preload, fetch the existing manifest, if any. * During attach, upload a tenant manifest if it differs from the preloaded one (or does not exist). * Upload a new manifest as needed, if it differs from the last-known manifest (ignoring version number). * On splits, pre-populate the manifest from the parent. * During Pageserver physical GC, remove old manifests but keep the latest 2 generations. This will cause nearly all existing tenants to upload a new tenant manifest on their first attach after this change. Attaches are concurrency-limited in the storage controller, so we expect this will be fine. Also updates `make_broken` to automatically log at `INFO` level when the tenant has been cancelled, to avoid spurious error logs during shutdown.	2025-04-07 19:10:36 +00:00
Erik Grinaker	26c5c7e942	pageserver: set `Stopping` state on attach cancellation (#11462 ) ## Problem If a tenant is cancelled (e.g. due to Pageserver shutdown) during attach, it is set to `Broken`. This results both in error log spam and 500 responses for clients -- shutdown is supposed to return 503 responses which can be retried. This becomes more likely to happen with #11328, where we perform tenant manifest downloads/uploads during attach. ## Summary of changes Set tenant state to `Stopping` when attach fails and the tenant is cancelled, downgrading the log messages to INFO. This introduces two variants of `Stopping` -- with and without a caller barrier -- where the latter is used to signal attach cancellation.	2025-04-07 17:56:56 +00:00
Alex Chi Z.	d37e90f430	fix(pageserver): allow shard ancestor compaction to be cancelled (#11452 ) ## Problem https://github.com/neondatabase/neon/issues/11330 https://github.com/neondatabase/neon/issues/11358 ## Summary of changes Looking at the staging log, a few tenants right after shard split are stuck on shutdown because they are running shard ancestor compaction. The compaction does not respect the cancellation token. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-07 16:01:21 +00:00
Christian Schwarz	aad410c8f1	improve ondemand-download latency observability (#11421 ) ## Problem We don't have metrics to exactly quantify the end user impact of on-demand downloads. Perf tracing is underway (#11140) to supply us with high-resolution samples. But it will also be useful to have some aggregate per-timeline and per-instance metrics that definitively contain all observations. ## Summary of changes This PR consists of independent commits that should be reviewed independently. However, for convenience, we're going to merge them together. - refactor(metrics): measure_remote_op can use async traits - impr(pageserver metrics): task_kind dimension for remote_timeline_client latency histo - implements https://github.com/neondatabase/cloud/issues/26800 - refs https://github.com/neondatabase/cloud/issues/26193#issuecomment-2769705793 - use the opportunity to rename the metric and add a _global suffix; checked grafana export, it's only used in two personal dashboards, one of them mine, the other by Heikki - log on-demand download latency for expensive-to-query but precise ground truth - metric for wall clock time spent waiting for on-demand downloads ## Refs - refs https://github.com/neondatabase/cloud/issues/26800 - a bunch of minor investigations / incidents into latency outliers	2025-04-04 18:04:39 +00:00
Christian Schwarz	4f94751b75	pageserver config: ignore+warn about unknown fields (instead of `deny_unknown_fields`) (#11275 ) # Refs - refs https://github.com/neondatabase/neon/issues/8915 - discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1742406381132599 - stacked atop https://github.com/neondatabase/neon/pull/11298 - corresponding internal docs update that illustrates how this PR removes friction: https://github.com/neondatabase/docs/pull/404 # Problem Rejecting `pageserver.toml`s with unknown fields adds friction, especially when using `pageserver.toml` fields as feature flags that need to be decommissioned. See the added paragraphs on `pageserver_api::models::ConfigToml` for details on what kind of friction it causes. Also read the corresponding internal docs update linked above to see a more imperative guide for using `pageserver.toml` flags as feature flags. # Solution ## Ignoring unknown fields Ignoring is the serde default behavior. So, just remove `serde(deny_unknown_fields)` from all structs in `pageserver_api::config::ConfigToml` `pageserver_api::config::TenantConfigToml`. I went through all the child fields and verified they don't use `deny_unknown_fields` either, including those shared with `pageserver_api::models`. ## Warning about unknown fields We still want to warn about unknown fields to - be informed about typos in the config template - be reminded about feature-flag style configs that have been cleaned up in code but not yet in config templates We tried `serde_ignore` (cf draft #11319) but it doesn't work with `serde(flatten)`. The solution we arrived at is to compare the on-disk TOML with the TOML that we produce if we serialize the `ConfigToml` again. Any key specified in the on-disk TOML but not present in the serialized TOML is flagged as an ignored key. The mechanism to do it is a tiny recursive decent visitor on the `toml_edit::DocumentMut`. # Future Work Invalid config _values_ in known fields will continue to fail pageserver startup. See - https://github.com/neondatabase/cloud/issues/24349 for current worst case impact to deployments & ideas to improve.	2025-04-04 17:30:58 +00:00
Christian Schwarz	6ee84d985a	impr(perf tracing): ability to correlate with page_service logs (#11398 ) # Problem Current perf tracing fields do not allow answering the question what a specific Postgres backend was waiting for. # Background For Pageserver logs, we set the backend PID as the libpq `application_name` on the compute side, and funnel that into the a tracing field for the spans that emit to the global tracing subscriber. # Solution Funnel `application_name`, and the other fields that we use in the logging spans, into the root span for perf tracing. # Refs - fixes https://github.com/neondatabase/neon/issues/11393 - stacked atop https://github.com/neondatabase/neon/pull/11433 - epic: https://github.com/neondatabase/neon/issues/9873	2025-04-04 15:13:54 +00:00
Vlad Lazar	1ef4258f29	pageserver: add tenant level performance tracing sampling ratio (#11433 ) ## Problem https://github.com/neondatabase/neon/pull/11140 introduces performance tracing with OTEL and a pageserver config which configures the sampling ratio of get page requests. Enabling a non-zero sampling ratio on a per region basis is too aggressive and comes with perf impact that isn't very well understood yet. ## Summary of changes Add a `sampling_ratio` tenant level config which overrides the pageserver level config. Note that we do not cache the config and load it on every get page request such that changes propagate timely. Note that I've had to remove the `SHARD_SELECTION` span to get this to work. The tracing library doesn't expose a neat way to drop a span if one realises it's not needed at runtime. Closes https://github.com/neondatabase/neon/issues/11392	2025-04-04 13:41:28 +00:00
Vlad Lazar	65e2aae6e4	pageserver/secondary: deregister IO metrics (#11283 ) ## Problem IO metrics for secondary locations do not get deregistered when the timeline is removed. ## Summary of changes Stash the request context to be used for downloads in `SecondaryTimelineDetail`. These objects match the lifetime of the secondary timeline location pretty well. When the timeline is removed, deregister the metrics too. Closes https://github.com/neondatabase/neon/issues/11156	2025-04-04 10:52:59 +00:00
Dmitrii Kovalkov	181af302b5	storcon + safekeeper + scrubber: propagate root CA certs everywhere (#11418 ) ## Problem There are some places in the code where we create `reqwest::Client` without providing SSL CA certs from `ssl_ca_file`. These will break after we enable TLS everywhere. - Part of https://github.com/neondatabase/cloud/issues/22686 ## Summary of changes - Support `ssl_ca_file` in storage scrubber. - Add `use_https_safekeeper_api` option to safekeeper to use https for peer requests. - Propagate SSL CA certs to storage_controller/client, storcon's ComputeHook, PeerClient and maybe_forward.	2025-04-04 06:30:48 +00:00
Alex Chi Z.	381f42519e	fix(pageserver): skip gc-compaction over sparse keyspaces (#11404 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11318 It's not 100% safe for now to run gc-compaction over the sparse keyspace. It might cause deleted file to re-appear if a specific sequence of operations are done as in the issue, which in reality doesn't happen due to how we split delta/image layers based on the key range. A long-term fix would be either having a separate gc-compaction code path for metadata keys (as how we have a different code path for metadata image layer generation), or let the compaction process aware of the information of "there's an image layer that doesn't contain a key" so that we can skip the keys. ## Summary of changes * gc-compaction auto trigger only triggers compaction over the normal data range. * do not hold gc_block_guard across the full compaction job, only hold it during each subcompaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-03 19:40:44 +00:00
Vlad Lazar	9db63fea7a	pageserver: optionally export perf traces in OTEL format (#11140 ) Based on https://github.com/neondatabase/neon/pull/11139 ## Problem We want to export performance traces from the pageserver in OTEL format. End goal is to see them in Grafana. ## Summary of changes https://github.com/neondatabase/neon/pull/11139 introduces the infrastructure required to run the otel collector alongside the pageserver. ### Design Requirements: 1. We'd like to avoid implementing our own performance tracing stack if possible and use the `tracing` crate if possible. 2. Ideally, we'd like zero overhead of a sampling rate of zero and be a be able to change the tracing config for a tenant on the fly. 3. We should leave the current span hierarchy intact. This includes adding perf traces without modifying existing tracing. To satisfy (3) (and (2) in part) a separate span hierarchy is used. `RequestContext` gains an optional `perf_span` member that's only set when the request was chosen by sampling. All perf span related methods added to `RequestContext` are no-ops for requests that are not sampled. This on its own is not enough for (3), so performance spans use a separate tracing subscriber. The `tracing` crate doesn't have great support for this, so there's a fair amount of boilerplate to override the subscriber at all points of the perf span lifecycle. ### Perf Impact [Periodic pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc) shows no statistically significant regression with a sample ratio of 0. There's an annotation on the dashboard on 2025-03-06. ### Overview of changes: 1. Clean up the `RequestContext` API a bit. Namely, get rid of the `RequestContext::extend` API and use the builder instead. 2. Add pageserver level configs for tracing: sampling ratio, otel endpoint, etc. 3. Introduce some perf span tracking utilities and expose them via `RequestContext`. We add a `tracing::Span` wrapper to be used for perf spans and a `tracing::Instrumented` equivalent for it. See doc comments for reason. 4. Set up OTEL tracing infra according to configuration. A separate runtime is used for the collector. 5. Add perf traces to the read path. ## Refs - epic https://github.com/neondatabase/neon/issues/9873 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:56:51 +00:00
Alex Chi Z.	109c54a300	fix(pageserver): avoid gc-compaction triggering circuit breaker (#11403 ) ## Problem There are some cases where traditional gc might collect some layer files causing gc-compaction cannot read the full history of the key. This needs to be resolved in the long-term by improving the compaction process. For now, let's simply avoid such errors triggering the circuit breaker. ## Summary of changes * Move the place where we trigger the circuit breaker. We only trigger it during compactions other than L0 compactions. We added the trigger a year ago due to file cleanup concerns in image layer compaction. * For gc-compaction, only return errors to the upper compaction_iteration if it's a shutdown error. Otherwise, just log it and skip the compaction for a key range. Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:18:37 +00:00
Alex Chi Z.	131b32ef48	fix(pageserver): clean up aux files before detaching (#11299 ) ## Problem Related to https://github.com/neondatabase/cloud/issues/26091 and https://github.com/neondatabase/cloud/issues/25840 Close https://github.com/neondatabase/neon/issues/11297 Discussion on Slack: https://neondb.slack.com/archives/C033RQ5SPDH/p1742320666313969 ## Summary of changes * When detaching, scan all aux files within `sparse_non_inherited_keyspace` in the ancestor timeline and create an image layer exactly at the ancestor LSN. All scanned keys will map to an empty value, which is a delete tombstone. - Note that end_lsn for rewritten delta layers = ancestor_lsn + 1, so the image layer will have image_end_lsn=end_lsn. With the current `select_layer` logic, the read path will always first read the image layer. * Add a test case. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 15:55:22 +00:00
Arpad Müller	d8cee52637	Update rust to 1.86.0 (#11431 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Announcement blog post](https://blog.rust-lang.org/2025/04/03/Rust-1.86.0.html). Prior update was in #10914.	2025-04-03 14:53:28 +00:00
Alex Chi Z.	dd1299f337	feat(storcon): passthrough mark invisible and add tests (#11401 ) ## Problem close https://github.com/neondatabase/neon/issues/11279 ## Summary of changes * Allow passthrough of other methods in tenant timeline shard0 passthrough of storcon. * Passthrough mark invisible API in storcon. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-02 17:11:49 +00:00
John Spray	cb19e4e05d	pageserver: remove legacy TimelineInfo::latest_gc_cutoff field (2/2) (#11136 ) ## Problem This field was retained for backward compat only in #10707. Once https://github.com/neondatabase/cloud/pull/25233 is released, nothing will be reading this field. Related: https://github.com/neondatabase/cloud/issues/24250 ## Summary of changes - Remove TimelineInfo::latest_gc_cutoff_lsn	2025-04-02 15:21:58 +00:00
Erik Grinaker	47f5bcf2bc	pageserver: don't periodically flush layers for stale attachments (#11317 ) ## Problem Tenants in attachment state `Stale` can't upload layers, and don't run compaction, but still do periodic L0 layer flushes in the tenant housekeeping loop. If the tenant remains stuck in stale mode, this causes a large buildup of L0 layers, causing logging, metrics increases, and possibly alerts. Resolves #11245. ## Summary of changes Don't perform periodic layer flushes in stale attachment state.	2025-04-02 12:55:15 +00:00
Alex Chi Z.	c4fc602115	feat(pageserver): support synthetic size calculation for invisible branches (#11335 ) ## Problem ref https://github.com/neondatabase/neon/issues/11279 Imagine we have a branch with 3 snapshots A, B, and C: ``` base---+---+---+---main \-A \-B \-C base=100G, base-A=1G, A-B=1G, B-C=1G, C-main=1G ``` at this point, the synthetic size should be 100+1+1+1+1=104G. after the deletion, the structure looks like: ``` base---+---+---+ \-A \-B \-C ``` If we simply assume main never exists, the size will be calculated as size(A) + size(B) + size(C)=300GB, which obviously is not what the user would expect. The correct way to do this is to assume part of main still exists, that is to say, set C-main=1G: ``` base---+---+---+main \-A \-B \-C ``` And we will get the correct synthetic size of 100G+1+1+1=103G. ## Summary of changes * Do not generate gc cutoff point for invisible branches. * Use the same LSN as the last branchpoint for branch end. * Remove test_api_handler for mark_invisible. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-01 18:50:58 +00:00
Arpad Müller	016068b966	API spec: add safekeepers field returned by storcon (#11385 ) Add an optional `safekeepers` field to `TimelineInfo` which is returned by the storcon upon timeline creation if the `--timelines-onto-safekeepers` flag is enabled. It contains the list of safekeepers chosen. Other contexts where we return `TimelineInfo` do not contain the `safekeepers` field, sadly I couldn't make this more type safe like done in Rust via `TimelineCreateResponseStorcon`, as there is no way of flattening or inheritance (and I don't that duplicating the entire type for some minor type safety improvements is worth it). The storcon side has been done in #11058. Part of https://github.com/neondatabase/cloud/issues/16176 cc https://github.com/neondatabase/cloud/issues/16796	2025-04-01 12:39:10 +00:00
Erik Grinaker	80596feeaa	pageserver: invert `CompactFlags::NoYield` as `YieldForL0` (#11382 ) ## Problem `CompactFlags::NoYield` was a bit inconvenient, since every caller except for the background compaction loop should generally set it (e.g. HTTP API calls, tests, etc). It was also inconsistent with `CompactionOutcome::YieldForL0`. ## Summary of changes Invert `CompactFlags::NoYield` as `CompactFlags::YieldForL0`. There should be no behavioral changes.	2025-04-01 11:43:58 +00:00
Erik Grinaker	225cabd84d	pageserver: update upload queue TODOs (#11377 ) Update some upload queue TODOs, particularly to track https://github.com/neondatabase/neon/issues/10283, which I won't get around to.	2025-04-01 11:38:12 +00:00
Alex Chi Z.	0ee5bfa2fc	fix(pageserver): allow sibling archived branch for detaching (#11383 ) ## Problem close https://github.com/neondatabase/neon/issues/11379 ## Summary of changes Remove checks around archived branches for detach v2. I also updated the comments `ancestor_retain_lsn`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-31 16:32:55 +00:00
Erik Grinaker	db5384e1b0	pageserver: remove L0 flush upload wait (#11196 ) ## Problem Previously, L0 flushes would wait for uploads, as a simple form of backpressure. However, this prevented flush pipelining and upload parallelism. It has since been disabled by default and replaced by L0 compaction backpressure. Touches https://github.com/neondatabase/cloud/issues/24664. ## Summary of changes This patch removes L0 flush upload waits, along with the `l0_flush_wait_upload`. This can't be merged until the setting has been removed across the fleet.	2025-03-30 13:14:04 +00:00
Arpad Müller	5f3551e405	Add "still waiting for task" for slow shutdowns (#11351 ) To help with narrowing down https://github.com/neondatabase/cloud/issues/26362, we make the case more noisy where we are wait for the shutdown of a specific task (in the case of that issue, the `gc_loop`).	2025-03-24 17:29:44 +00:00
Dmitrii Kovalkov	aeb53fea94	storage: support multiple SSL CA certificates (#11341 ) ## Problem - We need to support multiple SSL CA certificates for graceful root CA certificate rotation. - Closes: https://github.com/neondatabase/cloud/issues/25971 ## Summary of changes - Parses `ssl_ca_file` as a pem bundle, which may contain multiple certificates. Single pem cert is a valid pem bundle, so the change is backward compatible.	2025-03-21 13:43:38 +00:00
Dmitrii Kovalkov	0f367cb665	storcon: reuse reqwest http client (#11327 ) ## Problem - Part of https://github.com/neondatabase/neon/issues/11113 - Building a new `reqwest::Client` for every request is expensive because it parses CA certs under the hood. It's noticeable in storcon's flamegraph. ## Summary of changes - Reuse one `reqwest::Client` for all API calls to avoid parsing CA certificates every time.	2025-03-21 11:48:22 +00:00
Alex Chi Z.	bae9b9acdc	feat(pageserver): persist timeline invisible flag (#11331 ) ## Problem part of https://github.com/neondatabase/neon/issues/11279 ## Summary of changes The invisible flag is used to exclude a timeline from synthetic size calculation. For the first step, let's persist this flag. Most of the code are following the `is_archived` modification flow. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-20 18:39:08 +00:00
Nikita Kalyanov	53f54ba37a	chore: expose detach_v2 (#11325 ) we need this exposed in the spec to use it in cplane. extracted from https://github.com/neondatabase/cloud/pull/26167 ## Problem ## Summary of changes	2025-03-20 18:04:17 +00:00
Dmitrii Kovalkov	28fc051dcc	storage: live ssl certificate reload (#11309 ) ## Problem SSL certs are loaded only during start up. It doesn't allow the rotation of short-lived certificates without server restart. - Closes: https://github.com/neondatabase/cloud/issues/25525 ## Summary of changes - Implement `ReloadingCertificateResolver` which reloads certificates from disk periodically.	2025-03-20 16:26:27 +00:00
Alex Chi Z.	78502798ae	feat(compute_ctl): pass compute type to pageserver with pg_options (#11287 ) ## Problem second try of https://github.com/neondatabase/neon/pull/11185, part of https://github.com/neondatabase/cloud/issues/24706 ## Summary of changes Tristan reminded me of the `options` field of the pg wire protocol, which can be used to pass configurations. This patch adds the parsing on the pageserver side, and supplies `neon.endpoint_type` as part of the `options`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-20 15:48:40 +00:00

1 2 3 4 5 ...

2858 Commits