rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Arpad Müller	5f3551e405	Add "still waiting for task" for slow shutdowns (#11351 ) To help with narrowing down https://github.com/neondatabase/cloud/issues/26362, we make the case more noisy where we are wait for the shutdown of a specific task (in the case of that issue, the `gc_loop`).	2025-03-24 17:29:44 +00:00
Dmitrii Kovalkov	aeb53fea94	storage: support multiple SSL CA certificates (#11341 ) ## Problem - We need to support multiple SSL CA certificates for graceful root CA certificate rotation. - Closes: https://github.com/neondatabase/cloud/issues/25971 ## Summary of changes - Parses `ssl_ca_file` as a pem bundle, which may contain multiple certificates. Single pem cert is a valid pem bundle, so the change is backward compatible.	2025-03-21 13:43:38 +00:00
Alex Chi Z.	bae9b9acdc	feat(pageserver): persist timeline invisible flag (#11331 ) ## Problem part of https://github.com/neondatabase/neon/issues/11279 ## Summary of changes The invisible flag is used to exclude a timeline from synthetic size calculation. For the first step, let's persist this flag. Most of the code are following the `is_archived` modification flow. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-20 18:39:08 +00:00
Nikita Kalyanov	53f54ba37a	chore: expose detach_v2 (#11325 ) we need this exposed in the spec to use it in cplane. extracted from https://github.com/neondatabase/cloud/pull/26167 ## Problem ## Summary of changes	2025-03-20 18:04:17 +00:00
Dmitrii Kovalkov	28fc051dcc	storage: live ssl certificate reload (#11309 ) ## Problem SSL certs are loaded only during start up. It doesn't allow the rotation of short-lived certificates without server restart. - Closes: https://github.com/neondatabase/cloud/issues/25525 ## Summary of changes - Implement `ReloadingCertificateResolver` which reloads certificates from disk periodically.	2025-03-20 16:26:27 +00:00
Alex Chi Z.	78502798ae	feat(compute_ctl): pass compute type to pageserver with pg_options (#11287 ) ## Problem second try of https://github.com/neondatabase/neon/pull/11185, part of https://github.com/neondatabase/cloud/issues/24706 ## Summary of changes Tristan reminded me of the `options` field of the pg wire protocol, which can be used to pass configurations. This patch adds the parsing on the pageserver side, and supplies `neon.endpoint_type` as part of the `options`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-20 15:48:40 +00:00
Arpad Müller	5dd60933d3	Mark min_readable_lsn as required in the pageserver API spec (#11324 ) Fully applies the changes of https://github.com/neondatabase/cloud/pull/25233 to neon.git. The field is always present in the Rust struct definition, so it can be marked as required. cc #10707	2025-03-20 15:25:09 +00:00
Dmitrii Kovalkov	9bf59989db	storcon: add https API (#11239 ) ## Problem Pageservers use unencrypted HTTP requests for storage controller API. - Closes: https://github.com/neondatabase/cloud/issues/25524 ## Summary of changes - Replace hyper0::server::Server with http_utils::server::Server in storage controller. - Add HTTPS handler for storage controller API. - Support `ssl_ca_file` in pageserver.	2025-03-20 08:22:02 +00:00
Vlad Lazar	9ce3704ab5	pageseserver: rename cplane api to storage controller api (#11310 ) ## Problem The pageserver upcall api was designed to work with control plane or the storage controller. We have completed the transition period and now the upcall api only targets the storage controller. ## Summary of changes Rename types accordingly and tweak some comments.	2025-03-19 16:29:52 +00:00
John Spray	55cb07f680	pageserver: improve debuggability of timeline creation failures during chaos testing (#11300 ) ## Problem We're seeing timeline creation failures that look suspiciously like some race with the cleanup-deletion of initdb temporary directories. I couldn't spot the bug, but we can make it a bit easier to debug. Related: https://github.com/neondatabase/neon/issues/11296 ## Summary of changes - Avoid surfacing distracting ENOENT failure to delete as a log error -- this is fine, and can happen if timeline is cancelled while doing initdb, or if initdb itself has an error where it doesn't write the dir (this error is surfaced separately) - Log after purging initdb temp directories	2025-03-19 13:38:03 +00:00
Christian Schwarz	0f20dae3c3	impr: merge `pageserver_api::models::TenantConfig` and `pageserver::tenant::config::TenantConfOpt` (#11298 ) The only difference between - `pageserver_api::models::TenantConfig` and - `pageserver::tenant::config::TenantConfOpt` at this point is that `TenantConfOpt` serializes with `skip_serializing_if = Option::is_none`. That is an efficiency improvement for all the places that currently serde `models::TenantConfig` because new serializations will no longer write `$fieldname: null` for each field that is `None` at runtime. This should be particularly beneficial for Storcon, which stores JSON-serialized `models::TenantConfig` in its DB. # Behavior Changes This PR changes the serialization behavior: we omit `None` fields instead of serializing `$fieldname: null`). So it's a data format change (see section on compatibility below). And it changes API responses from Storcon and Pageserver. ## API Response Compatibility Storcon returns the location description. Afaik it is passed through into - storcon_cli output - storcon UI in console admin UI These outputs will no longer contain `$fieldname: null` values, which de-bloats the output (good). But in storcon UI, it also serves as an editor "default", which will be eliminated after a storcon with this PR is released. ## Data Format Compatibility Backwards compat: new software reading old serialized data will deserialize to the same runtime value because all the field types are exactly the same and `skip_serializing_if` does not affect deserialization. Forward compat: old software reading data serialized by new software will map absence fields in the serialized form to runtime value `Option::None`. This is serde default behavior, see this playground to convince yourself: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=f7f4e1a169959a3085b6158c022a05eb The `serde(with="humantime_serde")` however behaves strangely: if used on an `Option<Duration>`, it still requires the field to be present, unlike the serde default behavior shown in the previous paragraph. The workaround is to set `serde(default)`. Previously it was set on each individual field, but, we do have the container attribute, so, set it there. This requires deriving a `Default` impl, which, because all fields are `Option`, is non-magic. See my notes here: https://gist.github.com/problame/eddbc225a5d12617e9f2c6413e0cf799 # Future Work We should have separate types (& crates) for - runtime types configuration (e.g. PageServerConf::tenant_config, AttachedLocationConf) - `config-v1` file pageserver local disk file format - `mgmt API` - `pageserver.toml` Right now they all use the same, which is convenient but makes it hard to reason about compatibility breakage. # Refs - corresponding docs.neon.build PR https://github.com/neondatabase/docs/pull/470	2025-03-19 12:47:17 +00:00
Christian Schwarz	9fb77d6cdd	buffered writer: add cancellation sensitivity (#11052 ) In - https://github.com/neondatabase/neon/pull/10993#issuecomment-2690428336 I added infinite retries for buffered writer flush IOs, primarily to gracefully handle ENOSPC but more generally so that the buffered writer is not left in a state where reads from the surrounding InMemoryLayer cause panics. However, I didn't add cancellation sensitivity, which is concerning because then there is no way to detach a timeline/tenant that is encountering the write IO errors. That’s a legitimate scenario in the case of some edge case bug. See the #10993 description for details. This PR - first makes flush loop infallible, enabled by infinite retries - then adds sensitivity to `Timeline::cancel` to the flush loop, thereby making it fallible in one specific way again - finally fixes the InMemoryLayer/EphemeralFile/BufferedWriter amalgamate to remain read-available after flush loop is cancelled. The support for read-availability after cancellation is necessary so that reads from the InMemoryLayer that are already queued up behind the RwLock that wraps the BufferedWriter won't panic because of the `mutable=None` that we leave behind in case the flush loop gets cancelled. # Alternatives One might think that we can only ship the change for read-availability if flush encounters an error, without the infinite retrying and/or cancellation sensitivity complexity. The problem with that is that read-availability sounds good but is really quite useless, because we cannot ingest new WAL without a writable InMemoryLayer. Thus, very soon after we transition to read-only mode, reads from compute are going to wait anyway, but on `wait_lsn` instead of the RwLock, because ingest isn't progressing. Thus, having the infinite flush retries still makes more sense because they're just "slowness" to the user, whereas wait_lsn is hard errors.	2025-03-18 18:48:43 +00:00
Alex Chi Z.	05ca27c981	fix(pagectl/benches): scope context with debug tools (#11285 ) ## Problem `7c462b3417` requires all contexts have scopes. pagectl/benches don't have such scopes. close https://github.com/neondatabase/neon/issues/11280 ## Summary of changes Adding scopes for the tools. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-17 21:27:27 +00:00
Alex Chi Z.	bb64beffbb	fix(pageserver): log compaction errors with timeline ids (#11231 ) ## Problem Makes it easier to debug. ## Summary of changes Log compaction errors with timeline ids. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-17 19:42:02 +00:00
Erik Grinaker	53d50c7ea5	pageserver: deflake compaction tests (#11246 ) These need to set `NoYield`, otherwise they may be preempted by pending L0 compaction.	2025-03-14 17:45:18 +00:00
Dmitrii Kovalkov	f68be2b5e2	safekeeper: https for management API (#11171 ) ## Problem Storage controller uses unencrypted HTTP requests for safekeeper management API. - Closes: https://github.com/neondatabase/cloud/issues/24836 ## Summary of changes - Replace `hyper0::server::Server` with `http_utils::server::Server` in safekeeper. - Add HTTPS handler for safekeeper management API.	2025-03-14 11:41:22 +00:00
Erik Grinaker	4ff000c042	pageserver: deflake `test_metadata_image_creation` (#11230 ) ## Problem `test_metadata_image_creation ` became flaky with #11212, since image compaction may yield to L0 compaction. ## Summary of changes Set `NoYield` when compacting in tenant tests.	2025-03-13 20:46:21 +00:00
Alex Chi Z.	23b713900e	feat(storcon): passthrough ancestor detach behavior (#11199 ) ## Problem https://github.com/neondatabase/neon/issues/10310 https://github.com/neondatabase/neon/pull/11158 ## Summary of changes We need to passthrough the new detach behavior through the storcon API. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-13 20:21:23 +00:00
Erik Grinaker	8afae9d03c	pageserver: enable `l0_flush_delay_threshold` by default (#11214 ) ## Problem `l0_flush_delay_threshold` has already been set to 30 in production for a couple of weeks. Let's harmonize the default. ## Summary of changes Update `DEFAULT_L0_FLUSH_DELAY_FACTOR` to 3 such that the default `l0_flush_delay_threshold` is `3 * compaction_threshold`. This differs from the production setting, which is hardcoded to 30 (with `compaction_threshold` at 10), and is more appropriate for any tenants that have custom `compaction_threshold` overrides.	2025-03-13 19:15:22 +00:00
Christian Schwarz	ed31dd2a3c	pageserver: better observability for slow wait_lsn (#11176 ) # Problem We leave too few observability breadcrumbs in the case where wait_lsn is exceptionally slow. # Changes - refactor: extract the monitoring logic out of `log_slow` into `monitor_slow_future` - add global + per-timeline counter for time spent waiting for wait_lsn - It is updated while we're still waiting, similar to what we do for page_service response flush. - add per-timeline counterpair for started & finished wait_lsn count - add slow-logging to leave breadcrumbs in logs, not just metrics For the slow-logging, we need to consider not flooding the logs during a broker or network outage/blip. The solution is a "log-streak-level" concurrency limit per timeline. At any given time, there is at most one slow wait_lsn that is logging the "still running" and "completed" sequence of logs. Other concurrent slow wait_lsn's don't log at all. This leaves at least one breadcrumb in each timeline's logs if some wait_lsn was exceptionally slow during a given period. The full degree of slowness can then be determined by looking at the per-timeline metric. # Performance Reran the `bench_log_slow` benchmark, no difference, so, existing call sites are fine. We do use a Semaphore, but only try_acquire it _after_ things have already been determined to be slow. So, no baseline overhead anticipated. # Refs - https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222	2025-03-13 15:03:53 +00:00
Alex Chi Z.	b2286f5bcb	fix(pageserver): don't panic if gc-compaction find no keys (#11200 ) ## Problem There was a panic on staging that compaction didn't find any keys. This is possible if all layers selected for compaction does not contain any keys within the current shard. ## Summary of changes Make panic an error. In the future, we can try creating an empty image layer so that GC can clean up those layers. Otherwise, for now, we can only rely on shard ancestor compaction to remove these data. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-13 14:38:45 +00:00
Erik Grinaker	c036fec065	pageserver: enable `compaction_l0_first` by default (#11212 ) ## Problem `compaction_l0_first` has already been enabled in production for a couple of weeks. ## Summary of changes Enable `compaction_l0_first` by default. Also set `CompactFlags::NoYield` in `timeline_checkpoint_handler`, to ensure explicitly requested compaction runs to completion. This endpoint is mainly used in tests, and caused some flakiness where tests expected compaction to complete.	2025-03-13 14:28:42 +00:00
Alex Chi Z.	c3b3b507f7	feat(pageserver): support detaching behavior v2 (#11158 ) ## Problem close https://github.com/neondatabase/neon/issues/10310 ## Summary of changes This patch adds a new behavior for the detach_ancestor API: detach with multi-level ancestor and no reparenting. Though we can potentially support multi-level + do reparenting / single-level + no-reparenting in the future, as it's not required for the recovery/snapshot epic, I'd prefer keeping things simple now that we only handle the old one and the new one instead of supporting the full feature matrix. I only added a test case of successful detaching instead of testing failures. I'd like to make this into staging and add more tests in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-12 22:27:23 +00:00
Erik Grinaker	c7717c85c7	storcon,pageserver: use persisted stripe size when loading unsharded tenants (#11193 ) ## Problem When the storage controller and Pageserver loads tenants from persisted storage, it uses `ShardIdentity::unsharded()` for unsharded tenants. However, this replaces the persisted stripe size of unsharded tenants with the default stripe size. This doesn't really matter for practical purposes, since the stripe size is meaningless for unsharded tenants anyway, but can cause consistency check failures if the persisted stripe size differs from the default. This was seen in #11168, where we change the default stripe size. Touches #11168. ## Summary of changes Carry over the persisted stripe size from `TenantShardPersistence` for unsharded tenants, and from `LocationConf` on Pageservers. Also add bounds checks for type casts when loading persisted shard metadata.	2025-03-12 15:16:54 +00:00
Erik Grinaker	1436b8469c	pageserver: appease unused lint on macOS (#11192 ) ## Problem `info_span!` is only used in a `linux` branch, causing the unused lint to fire on macOS. ## Summary of changes Fully qualify the `info_span!` use.	2025-03-12 14:34:29 +00:00
Vlad Lazar	1c0ff3c04d	utils: explicit OTEL export config and OTEL enablement via common entry point (#11139 ) We want to export performance traces from the pageserver in OTEL format. End goal is to see them in Grafana. To this end, there are two changes here: 1. Update the `tracing-utils` crate to allow for explicitly specifying the export configuration. Pageserver configuration is loaded from a file on start-up. This allows us to use the same flow for export configs there. 2. Update the `utils::logging::init` common entry point to set up OTEL tracing infrastructure if requested. Note that an entirely different tracing subscriber is used. This is to avoid interference with the existing tracing set-up. For now, no service uses this functionality. PR to plug this into the pageserver is [here](https://github.com/neondatabase/neon/pull/11140). Related https://github.com/neondatabase/neon/issues/9873	2025-03-12 11:07:49 +00:00
John Spray	7bf6397334	pageserver: remove legacy `TimelineInfo::latest_gc_cutoff` field (1/2) (#11149 ) ## Problem This field was retained for backward compat only in https://github.com/neondatabase/neon/pull/10707. Once https://github.com/neondatabase/cloud/pull/25233 is released, nothing external will be reading this field. Internally, this was a mandatory field so storage controller is still trying to decode it, so we must do this removal in two steps: this PR makes the field optional, and after one release we can fully remove it. Related: https://github.com/neondatabase/cloud/issues/24250 ## Summary of changes - Rename field to `_unused` - Remove field from swagger - Make field optional	2025-03-12 10:23:41 +00:00
Christian Schwarz	158db414bf	buffered writer: handle write errors by retrying all write IO errors indefinitely (#10993 ) # Problem If the Pageserver ingest path (InMemoryLayer=>EphemeralFile=>BufferedWriter) encounters ENOSPC or any other write IO error when flushing the mutable buffer of the BufferedWriter, the buffered writer is left in a state where subsequent _reads_ from the InMemoryLayer it will cause a `must not use after we returned an error` panic. The reason is that 1. the flush background task bails on flush failure, 2. causing the `FlushHandle::flush` function to fail at channel.recv() and 3. causing the `FlushHandle::flush` function to bail with the flush error, 4. leaving its caller `BufferedWriter::flush` with `BufferedWriter::mutable = None`, 5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent reads can enter, 6. those reads find `mutable = None` and cause the panic. # Context It has always been the contract that writes against the BufferedWriter API must not be retried because the writer/stream-style/append-only interface makes no atomicity guarantees ("On error, did nothing or a piece of the buffer get appended?"). The idea was that the error would bubble up to upper layers that can throw away the buffered writer and create a new one. (See our [internal error handling policy document on how to handle e.g. `ENOSPC`](`c870a50bc0/src/storage/handling_io_and_logical_errors.md (L36-L43)`)). That _might_ be true for delta/image layer writers, I haven't checked. But it's certainly not true for the ingest path: there are no provisions to throw away an InMemoryLayer that encountered a write error an reingest the WAL already written to it. Adding such higher-level retries would involve either resetting last_record_lsn to a lower value and restarting walreceiver. The code isn't flexible enough to do that, and such complexity likely isn't worth it given that write errors are rare. # Solution The solution in this PR is to retry _any_ failing write operation _indefinitely_ inside the buffered writer flush task, except of course those that are fatal as per `maybe_fatal_err`. Retrying indefinitely ensures that `BufferedWriter::mutable` is never left `None` in the case of IO errors, thereby solving the problem described above. It's a clear improvement over the status quo. However, while we're retrying, we build up backpressure because the `flush` is only double-buffered, not infinitely buffered. Backpressure here is generally good to avoid resource exhaustion, but blocks reads and hence stalls GetPage requests because InMemoryLayer reads and writes are mutually exclusive. That's orthogonal to the problem that is solved here, though. ## Caveats Note that there are some remaining conditions in the flush background task where it can bail with an error. I have annotated one of them with a TODO comment. Hence the `FlushHandle::flush` is still fallible and hence the overall scenario of leaving `mutable = None` on the bail path is still possible. We can clean that up in a later commit. Note also that retrying indefinitely is great for temporary errors like ENOSPC but likely undesirable in case the `std::io::Error` we get is really due to higher-level logic bugs. For example, we could fail to flush because the timeline or tenant directory got deleted and VirtualFile's reopen fails with ENOENT. Note finally that cancellation is not respected while we're retrying. This means we will block timeline/tenant/pageserver shutdown. The reason is that the existing cancellation story for the buffered writer background task was to recv from flush op channel until the sending side (FlushHandle) is explicitly shut down or dropped. Failing to handle cancellation carries the operational risk that even if a single timeline gets stuck because of a logic bug such as the one laid out above, we must still restart the whole pageserver process. # Alternatives Considered As pointed out in the `Context` section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case. Also, it's wasteful. I think it's a local optimum. A more general and simpler solution for ENOSPC is to `abort()` the process and run eviction on startup before bringing up the rest of pageserver. I argued for it in the past, the pro arguments are still valid and complete: https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329 The trouble at the time was implementing eviction on startup. However, maybe things are simpler now that we are fully storcon-managed and all tenants have secondaries. For example, if pageserver `abort()`s on ENOSPC and then simply don't respond to storcon heartbeats while we're running eviction on startup, storcon will fail tenants over to the secondary anyway, giving us all the time we need to clean up. The downside is that if there's a systemic space management bug, above proposal will just propagate the problem to other nodes. But I imagine that because of the delays involved with filling up disks, the system might reach a half-stable state, providing operators more time to react. # Demo Intermediary commit `a03f335121480afc0171b0f34606bdf929e962c5` is demoed in this (internal) screen recording: https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link # Perf Testing Ran `bench_ingest` on tmpfs, no measurable difference. Spans are uniquely owned by the flush task, and the span stack isn't too deep, so, enter and exit should be cheap. Plus, each flush takes ~150us with direct IO enabled, so, not _that_ high frequency event anyways. # Refs - fixes https://github.com/neondatabase/neon/issues/10856	2025-03-11 20:40:23 +00:00
Alex Chi Z.	7d221214bb	feat(pageserver): support no-yield for gc-compaction (#11184 ) ## Problem This should also resolve the test flakiness of `test_gc_feedback`. close https://github.com/neondatabase/neon/issues/11144 ## Summary of changes If `NoYield` is set, do not yield in gc-compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-11 19:13:52 +00:00
Alex Chi Z.	7588983168	fix(scrubber): log even if no refs are found (#11160 ) ## Problem Investigate https://github.com/neondatabase/neon/issues/11159 ## Summary of changes This doesn't fix the issue, but at least we can narrow down the cause next time it happens by logging ancestor referenced layer cnt even if it's 0. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-11 14:33:35 +00:00
Erik Grinaker	f466c01995	pageserver: add `max_logical_size_per_shard` for `get_top_tenants` (#11157 ) ## Problem In #11122, we want to split shards once the logical size of the largest timeline exceeds a split threshold. However, `get_top_tenants` currently only returns `max_logical_size`, which tracks the max _total_ logical size of a timeline across all shards. This is problematic, because the storage controller needs to fetch a list of N tenants that are eligible for splits, but the API doesn't currently have a way to express this. For example, with a split threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible to split if it has 1 or 2 shards, but not if it already has 4 shards. We need to express this in per-shard terms, otherwise the `get_top_tenants` endpoint may end up only returning tenants that can't be split, blocking splits entirely. Touches https://github.com/neondatabase/neon/pull/11122. Touches https://github.com/neondatabase/cloud/issues/22532. ## Summary of changes Add `TenantShardItem::max_logical_size_per_shard` containing `max_logical_size / shard_count`, and `TenantSorting::MaxLogicalSizePerShard` to order and filter by it.	2025-03-11 11:43:55 +00:00
Christian Schwarz	7c462b3417	impr: propagate VirtualFile metrics via RequestContext (#7202 ) # Refs - fixes https://github.com/neondatabase/neon/issues/6107 # Problem `VirtualFile` currently parses the path it is opened with to identify the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE` metric. Further, for each read or write call to VirtualFile, it uses `with_label_values` to retrieve the correct metrics object, which under the hood is a global hashmap guarded by a parking_lot mutex. We perform tens of thousands of reads and writes per second on every pageserver instance; thus, doing the mutex lock + hashmap lookup is wasteful. # Changes Apply the technique we use for all other timeline-scoped metrics to avoid the repeat `with_label_values`: add it to `TimelineMetrics`. Wrap `TimelineMetrics` into an `Arc`. Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use `Timeline::metrics::storage_io_size`. To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics between different connection handlers for the same timeline, we wrap it into another Arc. To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>> inside the per-connection timeline cache. Preliminary refactorings to enable this change: - https://github.com/neondatabase/neon/pull/11001 - https://github.com/neondatabase/neon/pull/11030 # Performance I ran the benchmarks in `test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py` on an `i3en.3xlarge` because that's what we currently run them on. None of the benchmarks shows a meaningful difference in latency or throughput or CPU utilization. I would have expected some improvement in the many-tenants-one-client-each workload because they all hit that hashmap constantly, and clone the same `UintCounter` / `Arc` inside of it. But apparently the overhead is miniscule compared to the remaining work we do per getpage. Yet, since the changes are already made, the added complexity is manageable, and the perf overhead of `with_label_values` demonstrable in micro-benchmarks, let's have this change anyway. Also, propagating TimelineMetrics through RequestContext might come in handy down the line. The micro-benchmark that demonstrates perf impact of `with_label_values`, along with other pitfalls and mitigation techniques around the `metrics`/`prometheus` crate: - https://github.com/neondatabase/neon/pull/11019 # Alternative Designs An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside `RequestContext`. The problem is that this risks reference cycles if the RequestContext gets stored in an object that is owned directly or indirectly by `Timeline`. Ideally, we wouldn't be using this mess of Arc's at all and propagate Rust references instead. But tokio requires tasks to be `'static`, and so, we wouldn't be able to propagate references across task boundaries, which is incompatible with any sort of fan-out code we already have (e.g. concurrent IO) or future code (parallel compaction). So, opt for Arc for now.	2025-03-11 07:23:06 +00:00
Dmitrii Kovalkov	63b22d3fb1	pageserver: https for management API (#11025 ) ## Problem Storage controller uses unencrypted HTTP requests for pageserver management API. Closes: https://github.com/neondatabase/cloud/issues/24283 ## Summary of changes - Implement `http_utils::server::Server` with TLS support. - Replace `hyper0::server::Server` with `http_utils::server::Server` in pageserver. - Add HTTPS handler for pageserver management API. - Generate local SSL certificates in neon local.	2025-03-10 15:07:59 +00:00
Alex Chi Z.	cd438406fb	feat(pageserver): add force patch index_part API (#11119 ) ## Problem As part of the disaster recovery tool. Partly for https://github.com/neondatabase/neon/issues/9114. ## Summary of changes * Add a new pageserver API to force patch the fields in index_part and modify the timeline internal structures. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-07 17:42:52 +00:00
Vlad Lazar	084fc4a757	pageserver: enable previous heatmaps by default (#11132 ) We add the off by default configs in https://github.com/neondatabase/neon/pull/11088 because the unarchival heatmap was causing oversized secondary locations. That was fixed in https://github.com/neondatabase/neon/pull/11098, so let's turn them on by default.	2025-03-07 16:05:31 +00:00
Alex Chi Z.	e825974a2d	feat(pageserver): yield gc-compaction to L0 compaction (#11120 ) ## Problem Part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes gc-compaction could take a long time in some cases, for example, if the job split heuristics is wrong and we selected a too large region for compaction that can't be finished within a reasonable amount of time. We will give up such tasks and yield to L0 compaction. Each gc-compaction sub-compaction job is atomic and cannot be split further so we have to give up (instead of storing a state and continue later as in image creation). --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-06 20:30:11 +00:00
Alex Chi Z.	11334a2cdb	feat(pageserver): more statistics for gc-compaction (#11103 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes * Add timers for each phase of the gc-compaction. * Add a final ratio computation to directly show the garbage collection ratio in the logs. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-06 16:44:00 +00:00
Vlad Lazar	5ceb8c994d	pageserver: mark unarchival heatmap layers as cold (#11098 ) ## Problem On unarchival, we update the previous heatmap with all visible layers. When the primary generates a new heatmap it includes all those layers, so the secondary will download them. Since they're not actually resident on the primary (we didn't call the warm up API), they'll never be evicted, so they remain in the heatmap. We want these layers in the heatmap, since we might wish to warm-up an unarchived timeline after a shard migration. However, we don't want them to be downloaded on the secondary until we've warmed up the primary. ## Summary of Changes Include these layers in the heatmap and mark them as cold. All heatmap operations act on non-cold layers apart from the attached location warming up API, which will download the cold layers. Once the cold layers are downloaded on the primary, they'll be included in the next heatmap as hot and the secondary starts fetching them too.	2025-03-06 11:25:02 +00:00
Vlad Lazar	43cea0df91	pageserver: allow for unit test stress test (#11112 ) ## Problem I like using `cargo stress` to hammer on a test, but it doesn't work out of the box because it does parallel runs by default and tests always use the same repo dir. ## Summary of changes Add an uuid to the test repo dir when generating it.	2025-03-06 11:23:25 +00:00
Erik Grinaker	ab7efe9e47	pageserver: add amortized read amp metrics (#11093 ) ## Problem In a batch, `pageserver_layers_per_read_global` counts all layer visits towards every read in the batch, since this directly affects the observed latency of the read. However, this doesn't give a good picture of the amortized read amplification due to batching. ## Summary of changes Add two more global read amp metrics: * `pageserver_layers_per_read_batch_global`: number of layers visited per batch. * `pageserver_layers_per_read_amortized_global`: number of layers divided by reads in a batch.	2025-03-06 10:23:48 +00:00
Alex Chi Z.	2de3629b88	test(pageserver): use reldirv2 by default in regress tests (#11081 ) ## Problem For pg_regress test, we do both v1 and v2; for all the rest, we default to v2. part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes Use reldir v2 across test cases by default. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-05 21:02:44 +00:00
Alex Chi Z.	9cdc8c0e6c	feat(pageserver): revisit error types for gc-compaction (#11082 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 We used anyhow::Error everywhere and it's time to fix. ## Summary of changes * Make sure that cancel errors are correctly propagated as CompactionError::ShuttingDown. * Skip all the trigger computation work if gc_cutoff is not generated yet. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-05 15:57:38 +00:00
Vlad Lazar	8c12ccf729	pageserver: gate previous heatmap behind config flag (#11088 ) ## Problem On unarchival, we update the previous heatmap with all visible layers. When the primary generates a new heatmap it includes all those layers, so the secondary will download them. Since they're not actually resident on the primary (we didn't call the warm up API), they'll never be evicted, so they remain in the heatmap. This leads to oversized secondary locations like we saw in pre-prod. ## Summary of changes Gate the loading of the previous heatmaps and the heatmap generation on unarchival behind configuration flags. They are disabled by default, but enabled in tests.	2025-03-05 12:20:18 +00:00
Vlad Lazar	abae7637d6	pageserver: do big reads to fetch slru segment (#11029 ) ## Problem Each page of the slru segment is fetched individually when it's loaded on demand. ## Summary of Changes Use `Timeline::get_vectored` to fetch 16 at a time.	2025-03-05 11:55:55 +00:00
Alex Chi Z.	6d0976dad5	feat(pageserver): persist reldir v2 migration status (#10980 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes Similar to the aux v2 migration, we persist the relv2 migration status into index_part, so that even the config item is set to false, we will still read from the v2 storage to avoid loss of data. Note that only the two variants `None` and `Some(RelSizeMigration::Migrating)` are used for now. We don't have full migration implemented so it will never be set to `RelSizeMigration::Migrated`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-03 21:05:43 +00:00
Alex Chi Z.	dbf9a80261	fix(pageserver): avoid flooding gc-compaction logs (#11024 ) ## Problem The "did not trigger" gets logged at 10k/minute in staging. ## Summary of changes Change it to debug level. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-03 20:23:20 +00:00
Vlad Lazar	5197e43396	pageserver: add recurse flag to layer download spec (#11068 ) I missed updating the open api spec in the original PR. We need this so that the cplane auto-generated client sees the flag.	2025-03-03 19:04:01 +00:00
Vlad Lazar	8298bc903c	pageserver: handle in-memory layer overlaps with persistent layers (#11000 ) ## Problem Image layers may be nested inside in-memory layers as diagnosed [here](https://github.com/neondatabase/neon/issues/10720#issuecomment-2649419252). The read path doesn't support this and may skip over the image layer, resulting in a failure to reconstruct the page. ## Summary of changes We already support nesting of image layers inside delta layers. The logic lives in `LayerMap::select_layer`. The main goal of this PR is to propagate the candidate in-memory layer down to that point and update the selection logic. Important changes are: 1. Support partial reads for the in-memory layer. Previously, we could only specify the start LSN of the read. We need to control the end LSN too. 2. `LayerMap::ranged_search` considers in-memory layers too. Previously, the search for in-memory layers was done explicitly in `Timeline::get_reconstruct_data_timeline`. Note that `LayerMap::ranged_search` now returns a weak readable layer which the `LayerManager` can upgrade. This dance is such that we can unit test the layer selection logic. 3. Update `LayerMap::select_layer` to consider the candidate in-memory layer too Loosely related drive bys: 1. Remove the "keys not found" tracking in the ranged search. This wasn't used anywhere and it just complicates things. 2. Remove the difficulty map stuff from the layer map. Again, not used anywhere. Closes https://github.com/neondatabase/neon/issues/9185 Closes https://github.com/neondatabase/neon/issues/10720	2025-03-03 17:52:59 +00:00
Vlad Lazar	38277497fd	pageserver: log shutdown at info level for basebackup (#11046 ) ## Problem Timeline shutdown during basebackup logs at error level because the the canecellation error is smushed into BasebackupError::Server. ## Summary of changes Introduce BasebackupError::Shutdown and use it. `log_query_error` will now see `QueryError::Shutdown` and log at info level.	2025-03-03 13:46:50 +00:00
Erik Grinaker	d857f63e3b	pageserver: fix race that can wedge background tasks (#11047 ) ## Problem `wait_for_active_tenant()`, used when starting background tasks, has a race condition that can cause it to wait forever (until cancelled). It first checks the current tenant state, and then subscribes for state updates, but if the state changes between these then it won't be notified about it. We've seen this wedge compaction tasks, which can cause unbounded layer file buildup and read amplification. ## Summary of changes Use `watch::Receiver::wait_for()` to check both the current and new tenant states.	2025-02-28 17:00:22 +00:00

... 2 3 4 5 6 ...

2874 Commits