rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 05:00:37 +00:00

Author	SHA1	Message	Date
John Spray	8e2e9f0fed	pageserver: generation-aware storage for TenantManifest (#9555 ) ## Problem When tenant manifest objects are written without a generation suffix, concurrently attached pageservers may stamp on each others writes of the manifest and cause undefined behavior. Closes: #9543 ## Summary of changes - Use download_generation_object helper when reading manifests, to search for the most recent generation - Use Tenant::generation as the generation suffix when writing manifests.	2024-10-29 23:24:04 +01:00
Alex Chi Z.	81f9aba005	fix(pagectl): layer parsing and image layer dump (#9571 ) This patch contains various improvements for the pagectl tool. ## Summary of changes * Rewrite layer name parsing: LayerName now supports all variants we use now. * Drop pagectl's own layer parsing function, use LayerName in the pageserver crate. * Support image layer dumping in the layer dump command using ImageLayer::dump, drop the original implementation. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-29 15:16:23 -04:00
Alex Chi Z.	88ff8a7803	feat(pageserver): support partial gc-compaction for lowest retain lsn (#9134 ) part of https://github.com/neondatabase/neon/issues/8921, https://github.com/neondatabase/neon/issues/9114 ## Summary of changes We start the partial compaction implementation with the image layer partial generation. The partial compaction API now takes a key range. We will only generate images for that key range for now, and remove layers fully included in the key range after compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-29 18:25:32 +00:00
Konstantin Knizhnik	0c075fab3a	Add --replica parameter to basebackup (#9553 ) ## Problem See https://github.com/neondatabase/neon/pull/9458 This PR separates PS related changes in #9458 from compute_ctl changes to enforce that PS is deployed before compute. ## Summary of changes This PR adds handlings of `--replica` parameters of backebackup to page server. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-10-29 18:40:10 +02:00
John Spray	7a1331eee5	pageserver: make concurrent offloaded timeline operations safe wrt manifest uploads (#9557 ) ## Problem Uploads of the tenant manifest could race between different tasks, resulting in unexpected results in remote storage. Closes: https://github.com/neondatabase/neon/issues/9556 ## Summary of changes - Create a central function for uploads that takes a tokio::sync::Mutex - Store the latest upload in that Mutex, so that when there is lots of concurrency (e.g. archive 20 timelines at once) we can coalesce their manifest writes somewhat.	2024-10-29 13:54:48 +00:00
John Spray	4ef74215e1	pageserver: refactor generation-aware loading code into generic (#9545 ) ## Problem Indices used to be the only kind of object where we had to search across generations to find the most recent one. As of https://github.com/neondatabase/neon/issues/9543, manifests will need the same treatment. ## Summary of changes - Refactor download_index_part to a generic download_generation_object function, which will be usable for downloading manifest objects as well.	2024-10-29 13:00:03 +00:00
Arpad Müller	a73402e646	Offloaded timeline deletion (#9519 ) As pointed out in https://github.com/neondatabase/neon/pull/9489#discussion_r1814699683 , we currently didn't support deletion for offloaded timelines after the timeline has been loaded from the manifest instead of having been offloaded. This was because the upload queue hasn't been initialized yet. This PR thus initializes the timeline and shuts it down immediately. Part of #8088	2024-10-29 10:41:53 +00:00
Vlad Lazar	07b974480c	pageserver: move things around to prepare for decoding logic (#9504 ) ## Problem We wish to have high level WAL decoding logic in `wal_decoder::decoder` module. ## Summary of Changes For this we need the `Value` and `NeonWalRecord` types accessible there, so: 1. Move `Value` and `NeonWalRecord` to `pageserver::value` and `pageserver::record` respectively. 2. Get rid of `pageserver::repository` (follow up from (1)) 3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In theory they could live in `wal_decoder`, but it would create a circular dependency between `wal_decoder` and `postgres_ffi`. Long term it makes sense for those types to be PG version specific, so that will work out nicely. 4. Move higher level WAL record types (to be ingested by pageserver) into `wal_decoder::models` Related: https://github.com/neondatabase/neon/issues/9335 Epic: https://github.com/neondatabase/neon/issues/9329	2024-10-29 10:00:34 +00:00
Arpad Müller	62f5d484d9	Assert the tenant to be active in `unoffload_timeline` (#9539 ) Currently, all callers of `unoffload_timeline` ensure that the tenant the unoffload operation is called on is active. We rely on it being active as we activate the timeline below and don't want to race with the activation code of the tenant (in the worst case, activating a timeline twice). Therefore, add this assertion. Part of #8088	2024-10-29 00:36:05 +00:00
Alex Chi Z.	57c21aff9f	refactor(pageserver): remove aux v1 configs (#9494 ) ## Problem Part of https://github.com/neondatabase/neon/issues/8623 ## Summary of changes Removed all aux-v1 config processing code. Note that we persisted it into the index part file, so we cannot really remove the field from index part. I also kept the config item within the tenant config, but we will not read it any more. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-28 19:51:14 +00:00
Arpad Müller	e7277885b3	Don't consider archived timelines for synthetic size calculation (#9497 ) Archived timelines should not count towards synthetic size. Closes #9384. Part of #8088.	2024-10-26 13:27:57 +00:00
Yuchen Liang	85b954f449	pageserver: add tokio-epoll-uring slots waiters queue depth metrics (#9482 ) In complement to https://github.com/neondatabase/tokio-epoll-uring/pull/56. ## Problem We want to make tokio-epoll-uring slots waiters queue depth observable via Prometheus. ## Summary of changes - Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth` metrics as a `Histogram`. - Each thread-local tokio-epoll-uring system is given a `LocalHistogram` to observe the metrics. - Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data to the shared histogram. - Extend `Collector::collect` to report `pageserver_tokio_epoll_uring_slots_submission_queue_depth`. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-25 21:30:57 +01:00
Arpad Müller	76328ada05	Fix unoffload_timeline races with creation (#9525 ) This PR does two things: 1. Obtain a `TimelineCreateGuard` object in `unoffload_timeline`. This prevents two unoffload tasks from racing with each other. While they already obtain locks for `timelines` and `offloaded_timelines`, they aren't sufficient, as we have already constructed an entire timeline at that point. We shouldn't ever have two `Timeline` objects in the same process at the same time. 2. don't allow timeline creations for timelines that have been offloaded. Obviously they already exist, so we should not allow creation. the previous logic only looked at the timelines list. Part of #8088	2024-10-25 20:06:27 +00:00
John Spray	8297f7a181	pageserver: fix N^2 I/O when processing relation drops in transaction abort (#9507 ) ## Problem We have some known N^2 behaviors when it comes to large relation counts, due to the monolithic encoding and full rewrites of of RelDirectory each time a relation is added. Ordinarily our backpressure mechanisms give "slow but steady" performance when creating/dropping/truncating relations. However, in the case of a transaction abort, it is possible for a single WAL record to drop an unbounded number of relations. The results in an unavailable compute, as when it sends one of these records, it can stall the pageserver's ingest for many minutes, even though the compute only sent a small amount of WAL. Closes https://github.com/neondatabase/neon/issues/9505 ## Summary of changes - Rewrite relation-dropping code to do one read/modify/write cycle of RelDirectory, instead of doing it separately for each relation in a loop. - Add a test for the bug scenario encountered: `test_tx_abort_with_many_relations` The test has ~40s runtime on my workstation. About 1 second of that is the part where we wait for ingest to catch up after a rollback, the rest is the slowness of creating and truncating a large number of relations. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-10-25 15:09:02 +01:00
Christian Schwarz	2090e928d1	refactor(timeline creation): idempotency checking (#9501 ) # Context In the PGDATA import code (https://github.com/neondatabase/neon/pull/9218) I add a third way to create timelines, namely, by importing from a copy of a vanilla PGDATA directory in object storage. For idempotency, I'm using the PGDATA object storage location specification, which is stored in the IndexPart for the entire lifespan of the timeline. When loading the timeline from remote storage, that value gets stored inside `struct Timeline` and timeline creation compares the creation argument with that value to determine idempotency of the request. # Changes This PR refactors the existing idempotency handling of Timeline bootstrap and branching such that we simply compare the `CreateTimelineIdempotency` struct, using the derive-generated `PartialEq` implementation. Also, by spelling idempotency out in the type names, I find it adds a lot of clarity. The pathway to idempotency via requester-provided idempotency key also becomes very straight-forward, if we ever want to do this in the future. # Refs * platform context: https://github.com/neondatabase/neon/pull/9218 * product context: https://github.com/neondatabase/cloud/issues/17507 * stacks on top of https://github.com/neondatabase/neon/pull/9366	2024-10-25 14:44:20 +01:00
Christian Schwarz	6f5c262684	pageserver: add testing API to scan layers for disposable keys (#9393 ) This PR adds a pageserver mgmt API to scan a layer file for disposable keys. It hooks it up to the sharding compaction test, demonstrating that we're not filtering out all disposable keys. This is extracted from PGDATA import (https://github.com/neondatabase/neon/pull/9218) where I do the filtering of layer files based on `is_key_disposable`.	2024-10-25 14:16:45 +02:00
Arpad Müller	4d9036bf1f	Support offloaded timelines during shard split (#9489 ) Before, we didn't copy over the `index-part.json` of offloaded timelines to the new shard's location, resulting in the new shard not knowing the timeline even exists. In #9444, we copy over the manifest, but we also need to do this for `index-part.json`. As the operations to do are mostly the same between offloaded and non-offloaded timelines, we can iterate over all of them in the same loop, after the introduction of a `TimelineOrOffloadedArcRef` type to generalize over the two cases. This is analogous to the deletion code added in #8907. The added test also ensures that the sharded archival config endpoint works, something that has not yet been ensured by tests. Part of #8088	2024-10-25 12:32:46 +02:00
Vlad Lazar	b3bedda6fd	pageserver/walingest: log on gappy rel extend (#9502 ) ## Problem https://github.com/neondatabase/neon/pull/9492 added a metric to track the total count of block gaps filled on rel extend. More context is needed to understand when this happens. The current theory is that it may only happen on pg 14 and pg 15 since they do not WAL log relation extends. ## Summary of Changes A rate limited log is added.	2024-10-25 11:15:53 +01:00
Christian Schwarz	b782b11b33	refactor(timeline creation): represent bootstrap vs branch using enum (#9366 ) # Problem Timeline creation can either be bootstrap or branch. The distinction is made based on whether the `ancestor_` fields are present or not. In the PGDATA import code (https://github.com/neondatabase/neon/pull/9218), I add a third variant to timeline creation. # Solution The above pushed me to refactor the code in Pageserver to distinguish the different creation requests through enum variants. There is no externally observable effect from this change. On the implementation level, a notable change is that the acquisition of the `TimelineCreationGuard` happens later than before. This is necessary so that we have everything in place to construct the `CreateTimelineIdempotency`. Notably, this moves the acquisition of the creation guard _after_ the acquisition of the `gc_cs` lock in the case of branching. This might appear as if we're at risk of holding `gc_cs` longer than before this PR, but, even before this PR, we were holding `gc_cs` until after the `wait_completion()` that makes the timeline creation durable in S3 returns. I don't see any deadlock risk with reversing the lock acquisition order. As a drive-by change, I found that the `create_timeline()` function in `neon_local` is unused, so I removed it. # Refs platform context: https://github.com/neondatabase/neon/pull/9218 * product context: https://github.com/neondatabase/cloud/issues/17507 * next PR stacked atop this one: https://github.com/neondatabase/neon/pull/9501	2024-10-25 10:04:27 +00:00
Vlad Lazar	5069123b6d	pageserver: refactor ingest inplace to decouple decoding and handling (#9472 ) ## Problem WAL ingest couples decoding of special records with their handling (updates to the storage engine mostly). This is a roadblock for our plan to move WAL filtering (and implicitly decoding) to safekeepers since they cannot do writes to the storage engine. ## Summary of changes This PR decouples the decoding of the special WAL records from their application. The changes are done in place and I've done my best to refrain from refactorings and attempted to preserve the original code as much as possible. Related: https://github.com/neondatabase/neon/issues/9335 Epic: https://github.com/neondatabase/neon/issues/9329	2024-10-24 17:12:47 +01:00
Alex Chi Z.	fb0406e9d2	refactor(pageserver): refactor split writers using batch layer writer (#9493 ) part of https://github.com/neondatabase/neon/issues/9114, https://github.com/neondatabase/neon/issues/8836, https://github.com/neondatabase/neon/issues/8362 The split layer writer code can be used in a more general way: the caller puts unfinished writers into the batch layer writer and let batch layer writer to ensure the atomicity of the layer produces. ## Summary of changes * Add batch layer writer, which atomically finishes the layers. `BatchLayerWriter::finish` is simply a copy-paste from previous split layer writers. * Refactor split writers to use the batch layer writer. * The current split writer tests cover all code path of batch layer writer. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-24 10:49:54 -04:00
Christian Schwarz	6f34f97573	refactor(pageserver(load_remote_timeline)) remove dead code handling absence of IndexPart (#9408 ) The code is dead at runtime since we're nowadays always running with remote storage and treat it as the source of truth during attach. Clean it up as a preliminary to https://github.com/neondatabase/neon/pull/9218. Related: https://github.com/neondatabase/neon/pull/9366	2024-10-24 09:00:22 +01:00
Vlad Lazar	ac1205c14c	pageserver: add metric for number of zeroed pages on rel extend (#9492 ) ## Problem Filling the gap in with zeroes is annoying for sharded ingest. We are not sure it even happens in reality. ## Summary of Changes Add one global counter which tracks how many such gap blocks we filled on relation extends. We can add more metrics once we understand the scope.	2024-10-23 19:58:28 +01:00
Arpad Müller	3a3bd34a28	Rename IndexPart::{from_s3_bytes,to_s3_bytes} (#9481 ) We support multiple storage backends now, so remove the `_s3_` from the name. Analogous to the names adopted for tenant manifests added in #9444.	2024-10-23 00:34:24 +02:00
Alex Chi Z.	64949a37a9	fix(pageserver): make delta split layer writer finish atomic (#9048 ) similar to https://github.com/neondatabase/neon/pull/8841, we make the delta layer writer atomic when finishing the layers. ## Summary of changes * `put_value` not taking discard fn anymore * `finish` decides what layers to keep --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-22 22:06:21 +00:00
Arpad Müller	6f8fcdf9ea	Timeline offloading persistence (#9444 ) Persist timeline offloaded state to S3. Right now, as of #8907, at each restart of the pageserver, all offloaded state is lost, so we load the full timeline again. As it starts with an empty local directory, we might potentially download some files again, leading to downloads that are ultimately wasteful. This patch adds support for persisting the offloaded state, allowing us to never load offloaded timelines in the first place. The persistence feature is facilitated via a new file in S3 that is tenant-global, which contains a list of all offloaded timelines. It is updated each time we offload or unoffload a timeline, and otherwise never touched. This choice means that tenants where no offloading is happening will not immediately get a manifest, keeping the change very minimal at the start. We leave generation support for future work. It is important to support generations, as in the worst case, the manifest might be overwritten by an older generation after a timeline has been unoffloaded (and unarchived), so the next pageserver process instantiation might wrongly believe that some timeline is still offloaded even though it should be active. Part of #9386, #8088	2024-10-22 20:52:30 +00:00
Arpad Müller	34b6bd416a	offloaded timeline list API (#9461 ) Add a way to list the offloaded timelines. Before, one had to look at logs to figure out if a timeline has been offloaded or not, or use the non-presence of a certain timeline in the list of normal timelines. Now, one can list them directly. Part of #8088	2024-10-21 16:33:05 +01:00
Yuchen Liang	49d5e56c08	pageserver: use direct IO for delta and image layer reads (#9326 ) Part of #8130 ## Problem Pageserver previously goes through the kernel page cache for all the IOs. The kernel page cache makes light-loaded pageserver have deceptive fast performance. Using direct IO would offer predictable latencies of our virtual file IO operations. In particular for reads, the data pages also have an extremely low temporal locality because the most frequently accessed pages are cached on the compute side. ## Summary of changes This PR enables pageserver to use direct IO for delta layer and image layer reads. We can ship them separately because these layers are write-once, read-many, so we will not be mixing buffered IO with direct IO. - implement `IoBufferMut`, an buffer type with aligned allocation (currently set to 512). - use `IoBufferMut` at all places we are doing reads on image + delta layers. - leverage Rust type system and use `IoBufAlignedMut` marker trait to guarantee that the input buffers for the IO operations are aligned. - page cache allocation is also made aligned. _* in-memory layer reads and the write path will be shipped separately._ ## Testing Integration test suite run with O_DIRECT enabled: https://github.com/neondatabase/neon/pull/9350 ## Performance We evaluated performance based on the `get-page-at-latest-lsn` benchmark. The results demonstrate a decrease in the number of IOps, no sigificant change in the latency mean, and an slight improvement on the p99.9 and p99.99 latencies. [Benchmark](https://www.notion.so/neondatabase/Benchmark-O_DIRECT-for-image-and-delta-layers-2024-10-01-112f189e00478092a195ea5a0137e706?pvs=4) ## Rollout We will add `virtual_file_io_mode=direct` region by region to enable direct IO on image + delta layers. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-10-21 11:01:25 -04:00
Alex Chi Z.	aca81f5fa4	fix(pageserver): make image split layer writer finish atomic (#8841 ) Part of https://github.com/neondatabase/neon/issues/8836 ## Summary of changes This pull request makes the image layer split writer atomic when finishing the layers. All the produced layers either finish at the same time, or discard at the same time. Note that this does not prevent atomicity when crash, but anyways, it will be cleaned up on pageserver restart. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-21 15:59:48 +01:00
Arpad Müller	71d09c78d4	Accept basebackup <tenant> <timeline> --gzip requests (#9456 ) In #9453, we want to remove the non-gzipped basebackup code in the computes, and always request gzipped basebackups. However, right now the pageserver's page service only accepts basebackup requests in the following formats: * `basebackup <tenant_id> <timeline_id>`, lsn is determined by the pageserver as the most recent one (`timeline.get_last_record_rlsn()`) * `basebackup <tenant_id> <timeline_id> <lsn>` * `basebackup <tenant_id> <timeline_id> <lsn> --gzip` We add a fourth case, `basebackup <tenant_id> <timeline_id> --gzip` to allow gzipping the request for the latest lsn as well.	2024-10-19 00:23:49 +02:00
Conrad Ludgate	b8304f90d6	2024 oct new clippy lints (#9448 ) Fixes new lints from `cargo +nightly clippy` (`clippy 0.1.83 (798fb83f 2024-10-16)`)	2024-10-18 10:27:50 +01:00
John Spray	24398bf060	pageserver: detect & warn on loading an old index which is probably the result of a bad generation (#9383 ) ## Problem The pageserver generally trusts the storage controller/control plane to give it valid generations. However, sometimes it should be obvious that a generation is bad, and for defense in depth we should detect that on the pageserver. This PR is part 1 of 2: 1. in this PR we detect and warn on such situations, but do not block starting up the tenant. Once we have confidence that the check is not firing unexpectedly in the field 2. part 2 of 2 will introduce a condition that refuses to start a tenant in this situtation, and a test for that (maybe, if we can figure out how to spoof an ancient mtime) Related: #6951 ## Summary of changes - When loading an index older than 2 weeks, log an INFO message noting that we will check for other indices - When loading an index older than 2 weeks _and_ a newer-generation index exists, log a warning.	2024-10-17 19:02:24 +01:00
Alex Chi Z.	63b3491c1b	refactor(pageserver): remove aux v1 code path (#9424 ) Part of the aux v1 retirement https://github.com/neondatabase/neon/issues/8623 ## Summary of changes Remove write/read path for aux v1, but keeping the config item and the index part field for now. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-17 17:22:44 +01:00
Erik Grinaker	4c9835f4a3	storage_controller: delete stale shards when deleting tenant (#9333 ) ## Problem Tenant deletion only removes the current shards from remote storage. Any stale parent shards (before splits) will be left behind. These shards are kept since child shards may reference data from the parent until new image layers are generated. ## Summary of changes * Document a special case for pageserver tenant deletion that deletes all shards in remote storage when given an unsharded tenant ID, as well as any unsharded tenant data. * Pass an unsharded tenant ID to delete all remote storage under the tenant ID prefix. * Split out `RemoteStorage::delete_prefix()` to delete a bucket prefix, with additional test coverage. * Add a `delimiter` argument to `asset_prefix_empty()` to support partial prefix matches (i.e. all shards starting with a given tenant ID).	2024-10-17 14:34:51 +00:00
Alex Chi Z.	f3a3eefd26	feat(pageserver): do space check before gc-compaction (#9250 ) part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes gc-compaction may take a lot of disk space, and if it does, the caller should do a partial gc-compaction. This patch adds space check for the compaction job. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-17 10:29:53 -04:00
Arpad Müller	35e7d91bc9	Add config variable for timeline offloading (#9421 ) Adds a configuration variable for timeline offloading support. The added pageserver-global config option controls whether the pageserver automatically offloads timelines during compaction. Therefore, already offloaded timelines are not affected by this, nor is the manual testing endpoint. This allows the rollout of timeline offloading to be driven by the storage team. Part of #8088	2024-10-17 12:07:58 +00:00
Arpad Müller	55b246085e	Activate timelines during unoffload (#9399 ) The current code has forgotten to activate timelines during unoffload, leading to inability to receive the basebackup, due to the timeline still being in loading state. ``` stderr: command failed: compute startup failed: failed to get basebackup@0/0 from pageserver postgresql://no_user@localhost:15014 Caused by: 0: db error: ERROR: Not found: Timeline 508546c79b2b16a84ab609fdf966e0d3/bfc18c24c4b837ecae5dbb5216c80fce is not active, state: Loading 1: ERROR: Not found: Timeline 508546c79b2b16a84ab609fdf966e0d3/bfc18c24c4b837ecae5dbb5216c80fce is not active, state: Loading ``` Therefore, also activate the timeline during unoffloading. Part of #8088	2024-10-16 16:47:17 +02:00
Arpad Müller	3140c14d60	Remove allow(clippy::unknown_lints) (#9416 ) the lint stabilized in 1.80.	2024-10-16 16:28:55 +02:00
John Spray	89a65a9e5a	pageserver: improve handling of archival_config calls during Timeline shutdown (#9415 ) ## Problem In test `test_timeline_offloading`, we see failures like: ``` PageserverApiException: queue is in state Stopped ``` Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/main/11356917668/index.html#testresult/ff0e348a78a974ee/retries ## Summary of changes - Amend code paths that handle errors from RemoteTimelineClient to check for cancellation and emit the Cancelled error variant in these cases (will give clients a 503 to retry) - Remove the implicit `#[from]` for the Other error case, to make it harder to add code that accidentally squashes errors into this (500-equivalent) error variant. This would be neater if we made RemoteTimelineClient return a structured error instead of anyhow::Error, but that's a bigger refactor. I'm not sure if the test really intends to hit this path, but the error handling fix makes sense either way.	2024-10-16 13:39:58 +01:00
Alex Chi Z.	f1eb703256	fix(pageserver): use a buffer for basebackup; add aux basebackup metrics log (#9401 ) Our replication bench project is stuck because it is too slow to generate basebackup and it caused compute to disconnect. https://neondb.slack.com/archives/C03438W3FLZ/p1728330685012419 The compute timeout for waiting for basebackup is 10m (is it true?). Generating basebackup directly on pageserver takes ~3min. Therefore, I suspect it's because there are too many wasted round-trip time for writing the 10000+ snapshot aux files. Also, it is possible that the basebackup process takes too long time retrieving all aux files that it did not write anything over the wire protocol, causing a read timeout. Basebackup size is 800KB gzipped for that project and was 55MB tar before compression. ## Summary of changes * Potentially fix the issue by placing a write buffer for basebackup. * Log how many aux files did we read + the time spent on it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-15 16:35:21 -04:00
Konstantin Knizhnik	614c3aef72	Remove redundant code (#9373 ) ## Problem There is double update of resize cache in `put_rel_truncation` Also `page_server_request` contains check that fork is MAIN_FORKNUM which 1. is incorrect (because Vm/FSM pages are shreded in the same way as MAIN fork pages and 2. is redundant because `page_server_request` is never called for `get page` request so first part to OR condition is always true. ## Summary of changes Remove redundant code ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-10-15 17:18:52 +03:00
Arpad Müller	ec4cc30de9	Shut down timelines during offload and add offload tests (#9289 ) Add a test for timeline offloading, and subsequent unoffloading. Also adds a manual endpoint, and issues a proper timeline shutdown during offloading which prevents a pageserver hang at shutdown. Part of #8088.	2024-10-15 09:46:51 +00:00
John Spray	73c6626b38	pageserver: stabilize & refine controller scale test (#8971 ) ## Problem We were seeing timeouts on migrations in this test. The test unfortunately tends to saturate local storage, which is shared between the pageservers and the control plane database, which makes the test kind of unrealistic. We will also want to increase the scale of this test, so it's worth fixing that. ## Summary of changes - Instead of randomly creating timelines at the same time as the other background operations, explicitly identify a subset of tenant which will have timelines, and create them at the start. This avoids pageservers putting a lot of load on the test node during the main body of the test. - Adjust the tenants created to create some number of 8 shard tenants and the rest 1 shard tenants, instead of just creating a lot of 2 shard tenants. - Use archival_config to exercise tenant-mutating operations, instead of using timeline creation for this. - Adjust reconcile_until_idle calls to avoid waiting 5 seconds between calls, which causes timelines with large shard count tenants. - Fix a pageserver bug where calls to archival_config during activation get 404	2024-10-15 09:31:18 +01:00
Arpad Müller	f54e3e9147	Also consider offloaded timelines for obtaining retain_lsn (#9308 ) Also consider offloaded timelines for obtaining `retain_lsn`. This is required for correctness for all timelines that have not been flattened yet: otherwise we GC data that might still be required for reading. This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required. In the future, we can improve the way the offloaded timelines are stored. We also make the `retain_lsn` optional so that in the future, when we implement flattening, we can make it None. This also applies to full timeline objects by the way, where it would probably make most sense to add a bool flag whether the timeline is successfully flattened, and if it is, one can exclude it from `retain_lsn` as well. Also, track whether a timeline was offloaded or not in `retain_lsn` so that the `retain_lsn` can be excluded from visibility and size calculation. Part of #8088	2024-10-14 17:54:03 +02:00
John Spray	426b1c5f08	storage controller: use 'infra' JWT scope for node registration (#9343 ) ## Problem Storage controller `/control` API mostly requires admin tokens, for interactive use by engineers. But for endpoints used by scripts, we should not require admin tokens. Discussion at https://neondb.slack.com/archives/C033RQ5SPDH/p1728550081788989?thread_ts=1728548232.265019&cid=C033RQ5SPDH ## Summary of changes - Introduce the 'infra' JWT scope, which was not previously used in the neon repo - For pageserver & safekeeper node registrations, require infra scope instead of admin Note that admin will still work, as the controller auth checks permit admin tokens for all endpoints irrespective of what scope they require.	2024-10-10 12:26:43 +01:00
Yuchen Liang	bee04b8a69	pageserver: add direct io config to virtual file (#9214 ) ## Problem We need a way to incrementally switch to direct IO. During the rollout we might want to switch to O_DIRECT on image and delta layer read path first before others. ## Summary of changes - Revisited and simplified direct io config in `PageserverConf`. - We could add a fallback mode for open, but for read there isn't a reasonable alternative (without creating another buffered virtual file). - Added a wrapper around `VirtualFile`, current implementation become `VirtualFileInner` - Use `open_v2`, `create_v2`, `open_with_options_v2` when we want to use the IO mode specified in PS config. - Once we onboard all IO through VirtualFile using this new API, we will delete the old code path. - Make io mode live configurable for benchmarking. - Only guaranteed for files opened after the config change, so do it before the experiment. As an example, we are using `open_v2` with `virtual_file::IoMode::Direct` in https://github.com/neondatabase/neon/pull/9169 We also remove `io_buffer_alignment` config in `a04cfd754b` and use it as a compile time constant. This way we don't have to carry the alignment around or make frequent call to retrieve this information from the static variable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-10-09 08:33:07 -04:00
Erik Grinaker	211970f0e0	remote_storage: add `DownloadOpts::byte_(start\|end)` (#9293 ) `download_byte_range()` is basically a copy of `download()` with an additional option passed to the backend SDKs. This can cause these code paths to diverge, and prevents combining various options. This patch adds `DownloadOpts::byte_(start\|end)` and move byte range handling into `download()`.	2024-10-09 10:29:06 +01:00
Arpad Müller	e8ae37652b	Add timeline offload mechanism (#8907 ) Implements an initial mechanism for offloading of archived timelines. Offloading is implemented as specified in the RFC. For now, there is no persistence, so a restart of the pageserver will retrigger downloads until the timeline is offloaded again. We trigger offloading in the compaction loop because we need the signal for whether compaction is done and everything has been uploaded or not. Part of #8088	2024-10-09 01:33:39 +02:00
Erik Grinaker	37158d0424	pageserver: use conditional GET for secondary tenant heatmaps (#9236 ) ## Problem Secondary tenant heatmaps were always downloaded, even when they hadn't changed. This can be avoided by using a conditional GET request passing the `ETag` of the previous heatmap. ## Summary of changes The `ETag` was already plumbed down into the heatmap downloader, and just needed further plumbing into the remote storage backends. * Add a `DownloadOpts` struct and pass it to `RemoteStorage::download()`. * Add an optional `DownloadOpts::etag` field, which uses a conditional GET and returns `DownloadError::Unmodified` on match.	2024-10-04 12:29:48 +02:00
Vlad Lazar	552fa2b972	pageserver: tweak oversized key read path warning (#9221 ) ## Problem `Oversized vectored read [...]` logs are spewing in prod because we have a few keys that are unexpectedly large: * reldir/relblock - these are unbounded, so it's known technical debt * slru block - they can be a bit bigger than 128KiB due to storage format overhead ## Summary of changes * Bump threshold to 130KiB * Don't warn on oversized reldir and dbdir keys Closes https://github.com/neondatabase/neon/issues/8967	2024-10-03 16:40:35 +01:00

1 2 3 4 5 ...

2507 Commits