rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-18 05:30:37 +00:00

Author	SHA1	Message	Date
John Spray	7bf6397334	pageserver: remove legacy `TimelineInfo::latest_gc_cutoff` field (1/2) (#11149 ) ## Problem This field was retained for backward compat only in https://github.com/neondatabase/neon/pull/10707. Once https://github.com/neondatabase/cloud/pull/25233 is released, nothing external will be reading this field. Internally, this was a mandatory field so storage controller is still trying to decode it, so we must do this removal in two steps: this PR makes the field optional, and after one release we can fully remove it. Related: https://github.com/neondatabase/cloud/issues/24250 ## Summary of changes - Rename field to `_unused` - Remove field from swagger - Make field optional	2025-03-12 10:23:41 +00:00
Erik Grinaker	f466c01995	pageserver: add `max_logical_size_per_shard` for `get_top_tenants` (#11157 ) ## Problem In #11122, we want to split shards once the logical size of the largest timeline exceeds a split threshold. However, `get_top_tenants` currently only returns `max_logical_size`, which tracks the max _total_ logical size of a timeline across all shards. This is problematic, because the storage controller needs to fetch a list of N tenants that are eligible for splits, but the API doesn't currently have a way to express this. For example, with a split threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible to split if it has 1 or 2 shards, but not if it already has 4 shards. We need to express this in per-shard terms, otherwise the `get_top_tenants` endpoint may end up only returning tenants that can't be split, blocking splits entirely. Touches https://github.com/neondatabase/neon/pull/11122. Touches https://github.com/neondatabase/cloud/issues/22532. ## Summary of changes Add `TenantShardItem::max_logical_size_per_shard` containing `max_logical_size / shard_count`, and `TenantSorting::MaxLogicalSizePerShard` to order and filter by it.	2025-03-11 11:43:55 +00:00
Arpad Müller	4d3c477689	storcon: timetime table, creation and deletion (#11058 ) This PR extends the storcon with basic safekeeper management of timelines, mainly timeline creation and deletion. We want to make the storcon manage safekeepers in the future. Timeline creation is controlled by the `--timelines-onto-safekeepers` flag. 1. it adds the `timelines` and `safekeeper_timeline_pending_ops` tables to the storcon db 2. extend code for the timeline creation and deletion 4. it adds per-safekeeper reconciler tasks TODO: * maybe not immediately schedule reconciliations for deletions but have a prior manual step * tenant deletions * add exclude API definitions (probably separate PR) * how to choose safekeeper to do exclude on vs deletion? this can be a bit hairy because the safekeeper might go offline in the meantime. * error/failure case handling * tests (cc test_explicit_timeline_creation from #11002) * single safekeeper mode: we often only have one SK (in tests for example) * `notify-safekeepers` hook: https://github.com/neondatabase/neon/issues/11163 TODOs implemented: * cancellations of enqueued reconciliations on a per-timeline basis, helpful if there is an ongoing deletion * implement pending ops overwrite behavior * load pending operations from db RFC section for important reading: [link](https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#storage_controller-implementation) Implements the bulk of #9011 Successor of #10440. --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2025-03-11 02:31:22 +00:00
Alex Chi Z.	cd438406fb	feat(pageserver): add force patch index_part API (#11119 ) ## Problem As part of the disaster recovery tool. Partly for https://github.com/neondatabase/neon/issues/9114. ## Summary of changes * Add a new pageserver API to force patch the fields in index_part and modify the timeline internal structures. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-07 17:42:52 +00:00
Alex Chi Z.	6d0976dad5	feat(pageserver): persist reldir v2 migration status (#10980 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes Similar to the aux v2 migration, we persist the relv2 migration status into index_part, so that even the config item is set to false, we will still read from the v2 storage to avoid loss of data. Note that only the two variants `None` and `Some(RelSizeMigration::Migrating)` are used for now. We don't have full migration implemented so it will never be set to `RelSizeMigration::Migrated`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-03 21:05:43 +00:00
Arpad Müller	920040e402	Update storage components to edition 2024 (#10919 ) Updates storage components to edition 2024. We like to stay on the latest edition if possible. There is no functional changes, however some code changes had to be done to accommodate the edition's breaking changes. The PR has two commits: * the first commit updates storage crates to edition 2024 and appeases `cargo clippy` by changing code. i have accidentially ran the formatter on some files that had other edits. * the second commit performs a `cargo fmt` I would recommend a closer review of the first commit and a less close review of the second one (as it just runs `cargo fmt`). part of https://github.com/neondatabase/neon/issues/10918	2025-02-25 23:51:37 +00:00
Vlad Lazar	3e82addd64	storcon: use `Duration` for duration's in the storage controller tenant config (#10928 ) ## Problem The storage controller treats durations in the tenant config as strings. These are loaded from the db. The pageserver maps these durations to a seconds only format and we always get a mismatch compared to what's in the db. ## Summary of changes Treat durations as durations inside the storage controller and not as strings. Nothing changes in the cross service API's themselves or the way things are stored in the db. I also added some logging which I would have made the investigation a 10min job: 1. Reason for why the reconciliation was spawned 2. Location config diff between the observed and wanted states	2025-02-21 15:45:00 +00:00
Arseny Sher	d36baae758	Add gc_blocking and restore latest_gc_cutoff in openapi spec (#10867 ) ## Problem gc_blocking is missing in the tenant info, but cplane wants to use it. Also, https://github.com/neondatabase/neon/pull/10707/ removed latest_gc_cutoff from the spec, renaming it to applied_gc_cutoff. Temporarily get it back until cplane migrates. ## Summary of changes Add them. ref https://neondb.slack.com/archives/C03438W3FLZ/p1739877734963979	2025-02-18 13:57:12 +00:00
John Spray	39d42d846a	pageserver_api: fix decoding old-version TimelineInfo (#10845 ) ## Problem In #10707 some new fields were introduced in TimelineInfo. I forgot that we do not only use TimelineInfo for encoding, but also decoding when the storage controller calls into a pageserver, so this broke some calls from controller to pageserver while in a mixed-version state. ## Summary of changes - Make new fields have default behavior so that they are optional	2025-02-17 15:04:47 +00:00
John Spray	b8095f84a0	pageserver: make true GC cutoff visible in admin API, rebrand `latest_gc_cutoff` as `applied_gc_cutoff` (#10707 ) ## Problem We expose `latest_gc_cutoff` in our API, and callers understandably were using that to validate LSNs for branch creation. However, this is _not_ the true GC cutoff from a user's point of view: it's just the point at which we last actually did GC. The actual cutoff used when validating branch creations and page_service reads is the min() of latest_gc_cutoff and the planned GC lsn in GcInfo. Closes: https://github.com/neondatabase/neon/issues/10639 ## Summary of changes - Expose the more useful min() of GC cutoffs as `gc_cutoff_lsn` in the API, so that the most obviously named field is really the one people should use. - Retain the ability to read the LSN at which GC was actually done, in an `applied_gc_cutoff_lsn` field. - Internally rename `latest_gc_cutoff_lsn` to `applied_gc_cutoff_lsn` ("latest" was a confusing name, as the value in GcInfo is more up to date in terms of what a user experiences) - Temporarily preserve the old `latest_gc_cutoff_lsn` field for compat with control plane until we update it to use the new field. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-02-13 10:33:47 +00:00
Erik Grinaker	f62047ae97	pageserver: add separate semaphore for L0 compaction (#10780 ) ## Problem L0 compaction frequently gets starved out by other background tasks and image/GC compaction. L0 compaction must be responsive to keep read amplification under control. Touches #10694. Resolves #10689. ## Summary of changes Use a separate semaphore for the L0-only compaction pass. * Add a `CONCURRENT_L0_COMPACTION_TASKS` semaphore and `BackgroundLoopKind::L0Compaction`. * Add a setting `compaction_l0_semaphore` (default off via `compaction_l0_first`). * Use the L0 semaphore when doing an `OnlyL0Compaction` pass. * Use the background semaphore when doing a regular compaction pass (which includes an initial L0 pass). * While waiting for the background semaphore, yield for L0 compaction if triggered. * Add `CompactFlags::NoYield` to disable L0 yielding, and set it for the HTTP API route. * Remove the old `use_compaction_semaphore` setting and compaction-scoped semaphore. * Remove the warning when waiting for a semaphore; it's noisy and we have metrics.	2025-02-12 16:12:21 +00:00
Erik Grinaker	6c83ac3fd2	pageserver: do all L0 compaction before image compaction (#10744 ) ## Problem Image compaction can starve out L0 compaction if a tenant has several timelines with L0 debt. Touches #10694. Requires #10740. ## Summary of changes * Add an initial L0 compaction pass, in order of L0 count. * Add a tenant option `compaction_l0_first` to control the L0 pass (disabled by default). * Add `CompactFlags::OnlyL0Compaction` to run an L0-only compaction pass. * Clean up the compaction iteration logic. A later PR will use separate semaphores for the L0 and image compaction passes to avoid cross-tenant L0 starvation. That PR will also make image compaction yield if _any_ of the tenant's timelines have pending L0 compaction to further avoid starvation.	2025-02-11 22:08:46 +00:00
Alex Chi Z.	c1be84197e	feat(pageserver): preempt image layer generation if L0 piles up (#10572 ) ## Problem Image layer generation could block L0 compactions for a long time. ## Summary of changes * Refactored the return value of `create_image_layers_for_` functions to make it self-explainable. Preempt image layer generation in `Try` mode if L0 piles up. Note that we might potentially run into a state that only the beginning part of the keyspace gets image coverage. In that case, we either need to implement something to prioritize some keyspaces with image coverage, or tune the image_creation_threshold to ensure that the frequency of image creation could keep up with L0 compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-02-03 20:55:47 +00:00
Alex Chi Z.	983e18e63e	feat(pageserver): add compaction_upper_limit config (#10550 ) ## Problem Follow-up of the incident, we should not use the same bound on lower/upper limit of compaction files. This patch adds an upper bound limit, which is set to 50 for now. ## Summary of changes Add `compaction_upper_limit`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-01-28 23:18:32 +00:00
Vlad Lazar	c54cd9e76a	storcon: signal LSN wait to pageserver during live migration (#10452 ) ## Problem We've seen the ingest connection manager get stuck shortly after a migration. ## Summary of changes A speculative mitigation is to use the same mechanism as get page requests for kicking LSN ingest. The connection manager monitors LSN waits and queries the broker if no updates are received for the timeline. Closes https://github.com/neondatabase/neon/issues/10351	2025-01-28 17:33:07 +00:00
Erik Grinaker	1010b8add4	pageserver: add `l0_flush_wait_upload` setting (#10534 ) ## Problem We need a setting to disable the flush upload wait, to test L0 flush backpressure in staging. ## Summary of changes Add `l0_flush_wait_upload` setting.	2025-01-28 17:21:05 +00:00
Erik Grinaker	ddb9ae1214	pageserver: add compaction backpressure for layer flushes (#10405 ) ## Problem There is no direct backpressure for compaction and L0 read amplification. This allows a large buildup of compaction debt and read amplification. Resolves #5415. Requires #10402. ## Summary of changes Delay layer flushes based on the number of level 0 delta layers: * `l0_flush_delay_threshold`: delay flushes such that they take 2x as long (default `2 * compaction_threshold`). * `l0_flush_stall_threshold`: stall flushes until level 0 delta layers drop below threshold (default `4 * compaction_threshold`). If either threshold is reached, ephemeral layer rolls also synchronously wait for layer flushes to propagate this backpressure up into WAL ingestion. This will bound the number of frozen layers to 1 once backpressure kicks in, since all other frozen layers must flush before the rolled layer. ## Analysis This will significantly change the compute backpressure characteristics. Recall the three compute backpressure knobs: * `max_replication_write_lag`: 500 MB (based on Pageserver `last_received_lsn`). * `max_replication_flush_lag`: 10 GB (based on Pageserver `disk_consistent_lsn`). * `max_replication_apply_lag`: disabled (based on Pageserver `remote_consistent_lsn`). Previously, the Pageserver would keep ingesting WAL and build up ephemeral layers and L0 layers until the compute hit `max_replication_flush_lag` at 10 GB and began backpressuring. Now, once we delay/stall WAL ingestion, the compute will begin backpressuring after `max_replication_write_lag`, i.e. 500 MB. This is probably a good thing (we're not building up a ton of compaction debt), but we should consider tuning these settings. `max_replication_flush_lag` probably doesn't serve a purpose anymore, and we should consider removing it. Furthermore, the removal of the upload barrier in #10402 will mean that we no longer backpressure flushes based on S3 uploads, since `max_replication_apply_lag` is disabled. We should consider enabling this as well. ### When and what do we compact? Default compaction settings: * `compaction_threshold`: 10 L0 delta layers. * `compaction_period`: 20 seconds (between each compaction loop check). * `checkpoint_distance`: 256 MB (size of L0 delta layers). * `l0_flush_delay_threshold`: 20 L0 delta layers. * `l0_flush_stall_threshold`: 40 L0 delta layers. Compaction characteristics: * Minimum compaction volume: 10 layers * 256 MB = 2.5 GB. * Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20 seconds = 2.5 GB (10 L0 layers). * Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s. ### When do we hit `max_replication_write_lag`? Depending on how fast compaction and flushes happens, the compute will backpressure somewhere between `l0_flush_delay_threshold` or `l0_flush_stall_threshold` + `max_replication_write_lag`. * Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB * Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0 GB This seems like a reasonable range to me.	2025-01-24 09:47:28 +00:00
Alex Chi Z.	7d4bfcdc47	feat(pageserver): add config items for gc-compaction auto trigger (#10455 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 The automatic trigger is already implemented at https://github.com/neondatabase/neon/pull/10221 but I need to write some tests and finish my experiments in staging before I can merge it with confidence. Given that I have some other patches that will modify the config items, I'd like to get the config items merged first to reduce conflicts. ## Summary of changes * add `l2_lsn` to index_part.json -- below that LSN, data have been processed by gc-compaction * add a set of gc-compaction auto trigger control items into the config --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-21 19:29:38 +00:00
Alex Chi Z.	2de2b26c62	feat(pageserver): add reldir migration configs (#10439 ) ## Problem Part of #9516 per RFC at https://github.com/neondatabase/neon/pull/10412 ## Summary of changes Adding the necessary config items and index_part items for the large relation count work. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-20 20:44:12 +00:00
Christian Schwarz	c47c5f4ace	fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386 ) # Refs - fixes https://github.com/neondatabase/neon/issues/10309 - fixup of batching design, first introduced in https://github.com/neondatabase/neon/pull/9851 - refinement of https://github.com/neondatabase/neon/pull/8339 # Problem `Tenant::shutdown` was occasionally taking many minutes (sometimes up to 20) in staging and prod if the `page_service_pipelining.mode="concurrent-futures"` is enabled. # Symptoms The issue happens during shard migration between pageservers. There is page_service unavailability and hence effectively downtime for customers in the following case: 1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`, waiting for the gate to close. 2. Cplane/Storcon decides to transition the target `AttachedMulti` to `AttachedSingle`. 3. That transition comes with a bump of the generation number, causing the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` / `Tenant::attach` cycle for the target location. 4. That `Tenant::shutdown` on the target gets stuck, waiting for the gate to close. 5. Eventually the gate closes (`close completed`), correlating with a `page_service` connection handler logging that it's exiting because of a network error (`Connection reset by peer` or `Broken pipe`). While in (4): - `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls to complete. So, really, this is a `Timeline::shutdown` bug. - retries from Cplane/Storcon to complete above state transitions, fail with errors related to the tenant mgr slot being in state `TenantSlot::InProgress`, the tenant state being `TenantState::Stopping`, and the timelines being in `TimelineState::Stopping`, and the `Timeline::cancel` being cancelled. - Existing (and/or new?) page_service connections log errors `error reading relation or page version: Not found: Timed out waiting 30s for tenant active state. Latest state: None` # Root-Cause After a lengthy investigation ([internal write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4)) I arrived at the following root cause. The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the Batcher and Executor stages of the pipelined mode was storing a `Handle` and thus `GateGuard` of the Timeline that was not shutting down. The design assumption with pipelining was that this would always be a short transient state. However, that was incorrect: the Executor was stuck on writing/flushing an earlier response into the connection to the client, i.e., socket write being slow because of TCP backpressure. The probable scenario of how we end up in that case: 1. Compute backend process sends a continuous stream of getpage prefetch requests into the connection, but never reads the responses (why this happens: see Appendix section). 2. Batch N is processed by Batcher and Executor, up to the point where Executor starts flushing the response. 3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`. 4. Executor is still waiting for batch N flush to finish. 5. Batcher eventually hits the `TimeoutReader` error (10min). From here on it waits on the `spsc_fold.send(Err(QueryError(TimeoutReader_error)))` which will never finish because the batch already inside the `spsc_fold` is not being read by the Executor, because the Executor is still stuck in the flush. (This state is not observable at our default `info` log level) 6. Eventually, Compute backend process is killed (`close()` on the socket) or Compute as a whole gets killed (probably no clean TCP shutdown happening in that case). 7. Eventually, Pageserver TCP stack learns about (6) through RST packets and the Executor's flush() call fails with an error. 8. The Executor exits, dropping `cancel_batcher` and its end of the spsc_fold. This wakes Batcher, causing the `spsc_fold.send` to fail. Batcher exits. The pipeline shuts down as intended. We return from `process_query` and log the `Connection reset by peer` or `Broken pipe` error. The following diagram visualizes the wait-for graph at (5) ```mermaid flowchart TD Batcher --spsc_fold.send(TimeoutReader_error)--> Executor Executor --flush batch N responses--> socket.write_end socket.write_end --wait for TCP window to move forward--> Compute ``` # Analysis By holding the GateGuard inside the `spsc_fold` open, the pipelining implementation violated the principle established in (https://github.com/neondatabase/neon/pull/8339). That is, that `Handle`s must only be held across an await point if that await point is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token. In this case, we were holding the Handle inside the `spsc_fold` while awaiting the `pgb_writer.flush()` future. One may jump to the conclusion that we should simply peek into the spsc_fold to get that Timeline cancel token and be sensitive to it during flush, then. But that violates another principle of the design from https://github.com/neondatabase/neon/pull/8339. That is, that the page_service connection lifecycle and the Timeline lifecycles must be completely decoupled. Tt must be possible to shut down one shard without shutting down the page_service connection, because on that single connection we might be serving other shards attached to this pageserver. (The current compute client opens separate connections per shard, but, there are plans to change that.) # Solution This PR adds a `handle::WeakHandle` struct that does _not_ hold the timeline gate open. It must be `upgrade()`d to get a `handle::Handle`. That `handle::Handle` _does_ hold the timeline gate open. The batch queued inside the `spsc_fold` only holds a `WeakHandle`. We only upgrade it while calling into the various `handle_` methods, i.e., while interacting with the `Timeline` via `<Handle as Deref<Target=Timeline>>`. All that code has always been required to be (and is!) sensitive to `Timeline::cancel`, and therefore we're guaranteed to bail from it quickly when `Timeline::shutdown` starts. We will drop the `Handle` immediately, before we start `pgb_writer.flush()`ing the responses. Thereby letting go of our hold on the `GateGuard`, allowing the timeline shutdown to complete while the page_service handler remains intact. # Code Changes * Reproducer & Regression Test * Developed and proven to reproduce the issue in https://github.com/neondatabase/neon/pull/10399 * Add a `Test` message to the pagestream protocol (`cfg(feature = "testing")`). * Drive-by minimal improvement to the parsing code, we now have a `PagestreamFeMessageTag`. * Refactor `pageserver/client` to allow sending and receiving `page_service` requests independently. * Add a Rust helper binary to produce situation (4) from above * Rationale: (4) and (5) are the same bug class, we're holding a gate open while `flush()`ing. * Add a Python regression test that uses the helper binary to demonstrate the problem. * Fix * Introduce and use `WeakHandle` as explained earlier. * Replace the `shut_down` atomic with two enum states for `HandleInner`, wrapped in a `Mutex`. * To make `WeakHandle::upgrade()` and `Handle::downgrade()` cache-efficient: * Wrap the `Types::Timeline` in an `Arc` * Wrap the `GateGuard` in an `Arc` * The separate `Arc`s enable uncontended cloning of the timeline reference in `upgrade()` and `downgrade()`. If instead we were `Arc<Timeline>::clone`, different connection handlers would be hitting the same cache line on every upgrade()/downgrade(), causing contention. * Please read the udpated module-level comment in `mod handle` module-level comment for details. # Testing & Performance The reproducer test that failed before the changes now passes, and obviously other tests are passing as well. We'll do more testing in staging, where the issue happens every ~4h if chaos migrations are enabled in storcon. Existing perf testing will be sufficient, no perf degradation is expected. It's a few more alloctations due to the added Arc's, but, they're low frequency. # Appendix: Why Compute Sometimes Doesn't Read Responses Remember, the whole problem surfaced because flush() was slow because Compute was not reading responses. Why is that? In short, the way the compute works, it only advances the page_service protocol processing when it has an interest in data, i.e., when the pagestore smgr is called to return pages. Thus, if compute issues a bunch of requests as part of prefetch but then it turns out it can service the query without reading those pages, it may very well happen that these messages stay in the TCP until the next smgr read happens, either in that session, or possibly in another session. If there’s too many unread responses in the TCP, the pageserver kernel is going to backpressure into userspace, resulting in our stuck flush(). All of this stems from the way vanilla Postgres does prefetching and "async IO": it issues `fadvise()` to make the kernel do the IO in the background, buffering results in the kernel page cache. It then consumes the results through synchronous `read()` system calls, which hopefully will be fast because of the `fadvise()`. If it turns out that some / all of the prefetch results are not needed, Postgres will not be issuing those `read()` system calls. The kernel will eventually react to that by reusing page cache pages that hold completed prefetched data. Uncompleted prefetch requests may or may not be processed -- it's up to the kernel. In Neon, the smgr + Pageserver together take on the role of the kernel in above paragraphs. In the current implementation, all prefetches are sent as GetPage requests to Pageserver. The responses are only processed in the places where vanilla Postgres would do the synchronous `read()` system call. If we never get to that, the responses are queued inside the TCP connection, which, once buffers run full, will backpressure into Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the root cause of the problems we're fixing in this PR.	2025-01-16 20:34:02 +00:00
John Spray	fb0e2acb2f	pageserver: add `page_trace` API for debugging (#10293 ) ## Problem When a pageserver is receiving high rates of requests, we don't have a good way to efficiently discover what the client's access pattern is. Closes: https://github.com/neondatabase/neon/issues/10275 ## Summary of changes - Add `/v1/tenant/x/timeline/y/page_trace?size_limit_bytes=...&time_limit_secs=...` API, which returns a binary buffer. - Add `pagectl page-trace` tool to decode and analyze the output. --------- Co-authored-by: Erik Grinaker <erik@neon.tech>	2025-01-15 19:07:22 +00:00
Alex Chi Z.	b5d54ba52a	refactor(pageserver): move queue logic to compaction.rs (#10330 ) ## Problem close https://github.com/neondatabase/neon/issues/10031, part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes Move the compaction job generation to `compaction.rs`, thus making the code more readable and debuggable. We now also return running job through the get compaction job API, versus before we only return scheduled jobs. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-10 20:53:00 +00:00
Konstantin Knizhnik	20c40eb733	Add response tag to getpage request in V3 protocol version (#8686 ) ## Problem We have several serious data corruption incidents caused by mismatch of get-age requests: https://neondb.slack.com/archives/C07FJS4QF7V/p1723032720164359 We hope that the problem is fixed now. But it is better to prevent such kind of problems in future. Part of https://github.com/neondatabase/cloud/issues/16472 ## Summary of changes This PR introduce new V3 version of compute<->pageserver protocol, adding tag to getpage response. So now compute is able to check if it really gets response to the requested page. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-01-09 13:12:04 +00:00
Alex Chi Z.	3d1c3a80ae	feat(pageserver): add compact queue http endpoint (#10173 ) ## Problem We cannot get the size of the compaction queue and access the info. Part of #9114 ## Summary of changes * Add an API endpoint to get the compaction queue. * gc_compaction test case now waits until the compaction finishes. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-12-18 18:09:02 +00:00
Vlad Lazar	a3e80448e8	pageserver/storcon: add patch endpoints for tenant config metrics (#10020 ) ## Problem Cplane and storage controller tenant config changes are not additive. Any change overrides all existing tenant configs. This would be fine if both did client side patching, but that's not the case. Once this merges, we must update cplane to use the PATCH endpoint. ## Summary of changes ### High Level Allow for patching of tenant configuration with a `PATCH /v1/tenant/config` endpoint. It takes the same data as it's PUT counterpart. For example the payload below will update `gc_period` and unset `compaction_period`. All other fields are left in their original state. ``` { "tenant_id": "1234", "gc_period": "10s", "compaction_period": null } ``` ### Low Level * PS and storcon gain `PATCH /v1/tenant/config` endpoints. PS endpoint is only used for cplane managed instances. * `storcon_cli` is updated to have separate commands for `set-tenant-config` and `patch-tenant-config` Related https://github.com/neondatabase/cloud/issues/21043	2024-12-11 19:16:33 +00:00
Christian Schwarz	4d422b937c	pageserver: only throttle pagestream requests & bring back throttling deduction for smgr latency metrics (#9962 ) ## Problem In the batching PR - https://github.com/neondatabase/neon/pull/9870 I stopped deducting the time-spent-in-throttle fro latency metrics, i.e., - smgr latency metrics (`SmgrOpTimer`) - basebackup latency (+scan latency, which I think is part of basebackup). The reason for stopping the deduction was that with the introduction of batching, the trick with tracking time-spent-in-throttle inside RequestContext and swap-replacing it from the `impl Drop for SmgrOpTimer` no longer worked with >1 requests in a batch. However, deducting time-spent-in-throttle is desirable because our internal latency SLO definition does not account for throttling. ## Summary of changes - Redefine throttling to be a page_service pagestream request throttle instead of a throttle for repository `Key` reads through `Timeline::get` / `Timeline::get_vectored`. - This means reads done by `basebackup` are no longer subject to any throttle. - The throttle applies after batching, before handling of the request. - Drive-by fix: make throttle sensitive to cancellation. - Rename metric label `kind` from `timeline_get` to `pagestream` to reflect the new scope of throttling. To avoid config format breakage, we leave the config field named `timeline_get_throttle` and ignore the `task_kinds` field. This will be cleaned up in a future PR. ## Trade-Offs Ideally, we would apply the throttle before reading a request off the connection, so that we queue the minimal amount of work inside the process. However, that's not possible because we need to do shard routing. The redefinition of the throttle to limit pagestream request rate instead of repository `Key` rate comes with several downsides: - We're no longer able to use the throttle mechanism for other other tasks, e.g. image layer creation. However, in practice, we never used that capability anyways. - We no longer throttle basebackup.	2024-12-03 15:25:58 +00:00
Vlad Lazar	8fdf786217	pageserver: add tenant config override for wal receiver proto (#9888 ) ## Problem Can't change protocol at tenant granularity. ## Summary of changes Add tenant config level override for wal receiver protocol. ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-27 13:46:23 +00:00
Christian Schwarz	450be26bbb	fast imports: initial Importer and Storage changes (#9218 ) Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Stas Kelvic <stas@neon.tech> # Context This PR contains PoC-level changes for a product feature that allows onboarding large databases into Neon without going through the regular data path. # Changes This internal RFC provides all the context * https://github.com/neondatabase/cloud/pull/19799 In the language of the RFC, this PR covers * the Importer code (`fast_import`) * all the Pageserver changes (mgmt API changes, flow implementation, etc) * a basic test for the Pageserver changes # Reviewing As acknowledged in the RFC, the code added in this PR is not ready for general availability. Also, the architecture is not to be discussed in this PR, but in the RFC and associated Slack channel instead. Reviewers of this PR should take that into consideration. The quality bar to apply during review depends on what area of the code is being reviewed: * Importer code (`fast_import`): practically anything goes * Core flow (`flow.rs`): * Malicious input data must be expected and the existing threat models apply. * The code must not be safe to execute on dedicated Pageserver instances: * This means in particular that tenants on other Pageserver instances must not be affected negatively wrt data confidentiality, integrity or availability. * Other code: the usual quality bar * Pay special attention to correct use of gate guards, timeline cancellation in all places during shutdown & migration, etc. * Consider the broader system impact; if you find potentially problematic interactions with Storage features that were not covered in the RFC, bring that up during the review. I recommend submitting three separate reviews, for the three high-level areas with different quality bars. # References (Internal-only) * refs https://github.com/neondatabase/cloud/issues/17507 * refs https://github.com/neondatabase/company_projects/issues/293 * refs https://github.com/neondatabase/company_projects/issues/309 * refs https://github.com/neondatabase/cloud/issues/20646 --------- Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2024-11-22 22:47:06 +00:00
Arpad Müller	ee68bbf6f5	Add tenant config option to allow timeline_offloading (#9598 ) Allow us to enable timeline offloading for single tenants without having to enable it for the entire pageserver. Part of #8088.	2024-11-04 21:01:18 +01:00
Alex Chi Z.	57c21aff9f	refactor(pageserver): remove aux v1 configs (#9494 ) ## Problem Part of https://github.com/neondatabase/neon/issues/8623 ## Summary of changes Removed all aux-v1 config processing code. Note that we persisted it into the index part file, so we cannot really remove the field from index part. I also kept the config item within the tenant config, but we will not read it any more. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-28 19:51:14 +00:00
Christian Schwarz	6f5c262684	pageserver: add testing API to scan layers for disposable keys (#9393 ) This PR adds a pageserver mgmt API to scan a layer file for disposable keys. It hooks it up to the sharding compaction test, demonstrating that we're not filtering out all disposable keys. This is extracted from PGDATA import (https://github.com/neondatabase/neon/pull/9218) where I do the filtering of layer files based on `is_key_disposable`.	2024-10-25 14:16:45 +02:00
Christian Schwarz	b782b11b33	refactor(timeline creation): represent bootstrap vs branch using enum (#9366 ) # Problem Timeline creation can either be bootstrap or branch. The distinction is made based on whether the `ancestor_` fields are present or not. In the PGDATA import code (https://github.com/neondatabase/neon/pull/9218), I add a third variant to timeline creation. # Solution The above pushed me to refactor the code in Pageserver to distinguish the different creation requests through enum variants. There is no externally observable effect from this change. On the implementation level, a notable change is that the acquisition of the `TimelineCreationGuard` happens later than before. This is necessary so that we have everything in place to construct the `CreateTimelineIdempotency`. Notably, this moves the acquisition of the creation guard _after_ the acquisition of the `gc_cs` lock in the case of branching. This might appear as if we're at risk of holding `gc_cs` longer than before this PR, but, even before this PR, we were holding `gc_cs` until after the `wait_completion()` that makes the timeline creation durable in S3 returns. I don't see any deadlock risk with reversing the lock acquisition order. As a drive-by change, I found that the `create_timeline()` function in `neon_local` is unused, so I removed it. # Refs platform context: https://github.com/neondatabase/neon/pull/9218 * product context: https://github.com/neondatabase/cloud/issues/17507 * next PR stacked atop this one: https://github.com/neondatabase/neon/pull/9501	2024-10-25 10:04:27 +00:00
Arpad Müller	6f8fcdf9ea	Timeline offloading persistence (#9444 ) Persist timeline offloaded state to S3. Right now, as of #8907, at each restart of the pageserver, all offloaded state is lost, so we load the full timeline again. As it starts with an empty local directory, we might potentially download some files again, leading to downloads that are ultimately wasteful. This patch adds support for persisting the offloaded state, allowing us to never load offloaded timelines in the first place. The persistence feature is facilitated via a new file in S3 that is tenant-global, which contains a list of all offloaded timelines. It is updated each time we offload or unoffload a timeline, and otherwise never touched. This choice means that tenants where no offloading is happening will not immediately get a manifest, keeping the change very minimal at the start. We leave generation support for future work. It is important to support generations, as in the worst case, the manifest might be overwritten by an older generation after a timeline has been unoffloaded (and unarchived), so the next pageserver process instantiation might wrongly believe that some timeline is still offloaded even though it should be active. Part of #9386, #8088	2024-10-22 20:52:30 +00:00
Arpad Müller	34b6bd416a	offloaded timeline list API (#9461 ) Add a way to list the offloaded timelines. Before, one had to look at logs to figure out if a timeline has been offloaded or not, or use the non-presence of a certain timeline in the list of normal timelines. Now, one can list them directly. Part of #8088	2024-10-21 16:33:05 +01:00
Alex Chi Z.	63b3491c1b	refactor(pageserver): remove aux v1 code path (#9424 ) Part of the aux v1 retirement https://github.com/neondatabase/neon/issues/8623 ## Summary of changes Remove write/read path for aux v1, but keeping the config item and the index part field for now. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-17 17:22:44 +01:00
Yuchen Liang	bee04b8a69	pageserver: add direct io config to virtual file (#9214 ) ## Problem We need a way to incrementally switch to direct IO. During the rollout we might want to switch to O_DIRECT on image and delta layer read path first before others. ## Summary of changes - Revisited and simplified direct io config in `PageserverConf`. - We could add a fallback mode for open, but for read there isn't a reasonable alternative (without creating another buffered virtual file). - Added a wrapper around `VirtualFile`, current implementation become `VirtualFileInner` - Use `open_v2`, `create_v2`, `open_with_options_v2` when we want to use the IO mode specified in PS config. - Once we onboard all IO through VirtualFile using this new API, we will delete the old code path. - Make io mode live configurable for benchmarking. - Only guaranteed for files opened after the config change, so do it before the experiment. As an example, we are using `open_v2` with `virtual_file::IoMode::Direct` in https://github.com/neondatabase/neon/pull/9169 We also remove `io_buffer_alignment` config in `a04cfd754b` and use it as a compile time constant. This way we don't have to carry the alignment around or make frequent call to retrieve this information from the static variable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-10-09 08:33:07 -04:00
Heikki Linnakangas	5cbf5b45ae	Remove TenantState::Loading (#9118 ) The last real use was removed in commit `de90bf4663`. It was still used in a few unit tests, but they can use Attaching too.	2024-09-24 20:58:54 +00:00
Heikki Linnakangas	728b79b9dd	Remove some unnecessary derives	2024-09-19 11:57:10 +03:00
Heikki Linnakangas	2d885ac07a	Update strum (#8962 ) I wanted to use some features from the newer version. The PR that needed the new version is not ready yet (and might never be), but seems nice to stay up in any case.	2024-09-08 21:47:57 +03:00
Arpad Müller	cbcd4058ed	Fix 1.82 clippy lint too_long_first_doc_paragraph (#8941 ) Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by adding newlines to the first sentence if it is short enough, and making a short first sentence if there is the need.	2024-09-06 14:33:52 +02:00
Christian Schwarz	850421ec06	refactor(pageserver): rely on serde derive for toml deserialization (#7656 ) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields https://github.com/neondatabase/neon/issues/8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values	2024-09-05 14:59:49 +02:00
Alex Chi Z.	3d9001d83f	fix(pageserver): is_archived should be optional (#8902 ) Set the field to optional, otherwise there will be decode errors when newer version of the storage controller receives the JSON from older version of the pageservers. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-03 14:05:06 -04:00
Conrad Ludgate	a644f01b6a	proxy+pageserver: shared leaky bucket impl (#8539 ) In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I figured I could share the code with pageserver and remove the leaky_bucket crate dependency with some very basic tokio timers and queues for fairness. The underlying algorithm should be fairly clear how it works from the comments I have left in the code. --- In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the `pagebench get-page-latest-lsn` benchmark with the clickbench dataset (`test_perf_olap.py`). The discontinuity is that for any of `--num-clients={2,3,4}`, getpage throughput remains 10k. With `--num-clients=5` and greater, getpage throughput then jumps to the configured 20k rate limit. With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to `--num-clients` until the configured rate limit. More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641. closes https://github.com/neondatabase/cloud/issues/16886 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-29 11:26:52 +00:00
Heikki Linnakangas	2d10306f7a	Remove support for pageserver <-> compute protocol version 1 (#8774 ) Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.	2024-08-27 18:36:33 +03:00
Arpad Müller	2dd53e7ae0	Timeline archival test (#8824 ) This PR: * Implements the rule that archived timelines require all of their children to be archived as well, as specified in the RFC. There is no fancy locking mechanism though, so the precondition can still be broken. As a TODO for later, we still allow unarchiving timelines with archived parents. * Adds an `is_archived` flag to `TimelineInfo` * Adds timeline_archival_config to `PageserverHttpClient` * Adds a new `test_timeline_archive` test, loosely based on `test_timeline_delete` Part of #8088	2024-08-26 17:30:19 +02:00
Alex Chi Z.	6eb638f4b3	feat(pageserver): warn on aux v1 tenants + default to v2 (#8625 ) part of https://github.com/neondatabase/neon/issues/8623 We want to discover potential aux v1 customers that we might have missed from the migrations. ## Summary of changes Log warnings on basebackup, load timeline, and the first put_file. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-22 22:31:38 +01:00
Yuchen Liang	542385e364	feat(pageserver): add direct io pageserver config (#8622 ) Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md) ## Description Add pageserver config for evaluating/enabling direct I/O. - Disabled: current default, uses buffered io as is. - Evaluate: still uses buffered io, but could do alignment checking and perf simulation (pad latency by direct io RW to a fake file). - Enabled: uses direct io, behavior on alignment error is configurable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-07 21:04:19 +01:00
Joonas Koivunen	138f008bab	feat: persistent gc blocking (#8600 ) Currently, we do not have facilities to persistently block GC on a tenant for whatever reason. We could do a tenant configuration update, but that is risky for generation numbers and would also be transient. Introduce a `gc_block` facility in the tenant, which manages per timeline blocking reasons. Additionally, add HTTP endpoints for enabling/disabling manual gc blocking for a specific timeline. For debugging, individual tenant status now includes a similar string representation logged when GC is skipped. Cc: #6994	2024-08-06 10:09:56 +01:00
John Spray	f5db655447	pageserver: simplify LayerAccessStats (#8431 ) ## Problem LayerAccessStats contains a lot of detail that we don't use: short histories of most recent accesses, specifics on what kind of task accessed a layer, etc. This is all stored inside a Mutex, which is locked every time something accesses a layer. ## Summary of changes - Store timestamps at a very low resolution (to the nearest second), sufficient for use on the timescales of eviction. - Pack access time and last residence change time into a single u64 - Use the high bits of the u64 for other flags, including the new layer visibility concept. - Simplify the external-facing model for access stats to just include what we now track. Note that the `HistoryBufferWithDropCounter` is removed here because it is no longer used. I do not dislike this type, we just happen not to use it for anything else at present. Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 08:17:28 +01:00
Arpad Müller	3d582b212a	Add archival_config endpoint to pageserver (#8414 ) This adds an archival_config endpoint to the pageserver. Currently it has no effect, and always "works", but later the intent is that it will make a timeline archived/unarchived. - [x] add yml spec - [x] add endpoint handler Part of https://github.com/neondatabase/neon/issues/8088	2024-07-19 21:01:59 +02:00

1 2 3 4

167 Commits