rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-06 13:02:55 +00:00

Author	SHA1	Message	Date
duguorong009	706977fb77	fix(pageserver): add the walreceiver state to tenant timeline GET api endpoint (#5196 ) Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition. Fixes #3115. Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>	2023-09-07 14:17:18 +03:00
Arpad Müller	7ba0f5c08d	Improve comment in page cache (#5220 ) It was easy to interpret comment in the page cache initialization code to be about justifying why we leak here at all, not just why this specific type of leaking is done (which the comment was actually meant to describe). See https://github.com/neondatabase/neon/pull/5125#discussion_r1308445993 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 21:44:54 +02:00
Arpad Müller	6243b44dea	Remove Virtual from FileBlockReaderVirtual variant name (#5225 ) With #5181, the generics for `FileBlockReader` have been removed, so having a `Virtual` postfix makes less sense now.	2023-09-06 20:54:57 +02:00
duguorong009	31e1568dee	refactor(pageserver): refactor pageserver router state creation (#5165 ) Fixes #3894 by: - Refactor the pageserver router creation flow - Create the router state in `pageserver/src/bin/pageserver.rs`	2023-09-06 21:31:49 +03:00
Chengpeng Yan	9a9187b81a	Complete the missing metrics for files_created/bytes_written (#5120 )	2023-09-06 14:00:15 -04:00
Arpad Müller	5e00c44169	Add WriteBlobWriter buffering and make VirtualFile::{write,write_all} async (#5203 ) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes This PR is a follow-up of #5189, #5190, and #5195, and does the following: * Move the used `Write` trait functions of `VirtualFile` into inherent functions * Add optional buffering to `WriteBlobWriter`. The buffer is discarded on drop, similarly to how tokio's `BufWriter` does it: drop is neither async nor does it support errors. * Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays using `VirtualFile` * Rename `WriteBlobWriter` to `BlobWriter` * Make various functions in the write path async, like `VirtualFile::{write,write_all}`. Part of #4743.	2023-09-06 18:17:12 +02:00
John Spray	61d661a6c3	pageserver: generation number fetch on startup and use in /attach (#5163 ) ## Problem - #5050 Closes: https://github.com/neondatabase/neon/issues/5136 ## Summary of changes - A new configuration property `control_plane_api` controls other functionality in this PR: if it is unset (default) then everything still works as it does today. - If `control_plane_api` is set, then on startup we call out to control plane `/re-attach` endpoint to discover our attachments and their generations. If an attachment is missing from the response we implicitly detach the tenant. - Calls to pageserver `/attach` API may include a `generation` parameter. If `control_plane_api` is set, then this parameter is mandatory. - RemoteTimelineClient's loading of index_part.json is generation-aware, and will try to load the index_part with the most recent generation <= its own generation. - The `neon_local` testing environment now includes a new binary `attachment_service` which implements the endpoints that the pageserver requires to operate. This is on by default if running `cargo neon` by hand. In `test_runner/` tests, it is off by default: existing tests continue to run with in the legacy generation-less mode. Caveats: - The re-attachment during startup assumes that we are only re-attaching tenants that have previously been attached, and not totally new tenants -- this relies on the control plane's attachment logic to keep retrying so that we should eventually see the attach API call. That's important because the `/re-attach` API doesn't tell us which timelines we should attach -- we still use local disk state for that. Ref: https://github.com/neondatabase/neon/issues/5173 - Testing: generations are only enabled for one integration test right now (test_pageserver_restart), as a smoke test that all the machinery basically works. Writing fuller tests that stress tenant migration will come later, and involve extending our test fixtures to deal with multiple pageservers. - I'm not in love with "attachment_service" as a name for the neon_local component, but it's not very important because we can easily rename these test bits whenever we want. - Limited observability when in re-attach on startup: when I add generation validation for deletions in a later PR, I want to wrap up the control plane API calls in some small client class that will expose metrics for things like errors calling the control plane API, which will act as a strong red signal that something is not right. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 14:44:48 +01:00
John Spray	743933176e	scrubber: add `scan-metadata` and hook into integration tests (#5176 ) ## Problem - Scrubber's `tidy` command requires presence of a control plane - Scrubber has no tests at all ## Summary of changes - Add re-usable async streams for reading metadata from a bucket - Add a `scan-metadata` command that reads from those streams and calls existing `checks.rs` code to validate metadata, then returns a summary struct for the bucket. Command returns nonzero status if errors are found. - Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that tests using remote storage can request to have the scrubber run after they finish - Enable remote storarge and scrub_on_exit in test_pageserver_restart and test_pageserver_chaos This is a "toe in the water" of the overall space of validating the scrubber. Later, we should: - Enable scrubbing at end of tests using remote storage by default - Make the success condition stricter than "no errors": tests should declare what tenants+timelines they expect to see in the bucket (or sniff these from the functions tests use to create them) and we should require that the scrubber reports on these particular tenants/timelines. The `tidy` command is untouched in this PR, but it should be refactored later to use similar async streaming interface instead of the current batch-reading approach (the streams are faster with large buckets), and to also be covered by some tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-09-06 11:55:24 +01:00
duguorong009	4fec48f2b5	chore(pageserver): remove unnecessary logging in tenant task loops (#5188 ) Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute to unnecessary loggings in `pageserver/src/tenant/tasks.rs`. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 13:19:19 +03:00
Arpad Müller	4904613aaa	Convert `VirtualFile::{seek,metadata}` to async (#5195 ) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes Convert the following APIs of `VirtualFile` to async fn (as well as all of the APIs calling it): * `VirtualFile::seek` * `VirtualFile::metadata` * Also, prepare for deletion of the write impl by writing the summary to a buffer before writing it to disk, as suggested in https://github.com/neondatabase/neon/issues/4743#issuecomment-1700663864 . This change adds an additional warning for the case when the summary exceeds a block. Previously, we'd have silently corrupted data in this (unlikely) case. * `WriteBlobWriter::write_blob`, in preparation for making `VirtualFile::write_all` async.	2023-09-05 12:55:45 +02:00
Arpad Müller	128a85ba5e	Convert many VirtualFile APIs to async (#5190 ) ## Problem `VirtualFile` does both reading and writing, and it would be nice if both could be converted to async, so that it doesn't have to support an async read path and a blocking write path (especially for the locks this is annoying as none of the lock implementations in std, tokio or parking_lot have support for both async and blocking access). ## Summary of changes This PR is some initial work on making the `VirtualFile` APIs async. It can be reviewed commit-by-commit. * Introduce the `MaybeVirtualFile` enum to be generic in a test that compares real files with virtual files. * Make various APIs of `VirtualFile` async, including `write_all_at`, `read_at`, `read_exact_at`. Part of #4743 , successor of #5180. Co-authored-by: Christian Schwarz <me@cschwarz.com>	2023-09-04 17:05:20 +02:00
Arpad Müller	6cd497bb44	Make VirtualFile::crashsafe_overwrite async fn (#5189 ) ## Problem The `VirtualFile::crashsafe_overwrite` function was introduced by #5186 but it was not turned `async fn` yet. We want to make these functions async fn as part of #4743. ## Summary of changes Make `VirtualFile::crashsafe_overwrite` async fn, as well as all the functions calling it. Don't make anything inside `crashsafe_overwrite` use async functionalities, as per #4743 instructions. Also, add rustdoc to `crashsafe_overwrite`. Part of #4743.	2023-09-04 12:52:35 +02:00
John Spray	80f10d5ced	pageserver: safe deletion for tenant directories (#5182 ) ## Problem If a pageserver crashes partway through deleting a tenant's directory, it might leave a partial state that confuses a subsequent startup/attach. ## Summary of changes Rename tenant directory to a temporary path before deleting. Timeline deletions already have deletion markers to provide safety. In future, it would be nice to exploit this to send responses to detach requests earlier: https://github.com/neondatabase/neon/issues/5183	2023-09-04 08:31:55 +01:00
Christian Schwarz	7e817789d5	VirtualFile: add crash-safe overwrite abstraction & use it (#5186 ) (part of #4743) (preliminary to #5180) This PR adds a special-purpose API to `VirtualFile` for write-once files. It adopts it for `save_metadata` and `persist_tenant_conf`. This is helpful for the asyncification efforts (#4743) and specifically asyncification of `VirtualFile` because above two functions were the only ones that needed the VirtualFile to be an `std::io::Write`. (There was also `manifest.rs` that needed the `std::io::Write`, but, it isn't used right now, and likely won't be used because we're taking a different route for crash consistency, see #5172. So, let's remove it. It'll be in Git history if we need to re-introduce it when picking up the compaction work again; that's why it was introduced in the first place). We can't remove the `impl std::io::Write for VirtualFile` just yet because of the `BufWriter` in ```rust struct DeltaLayerWriterInner { ... blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>, } ``` But, @arpad-m and I have a plan to get rid of that by extracting the append-only-ness-on-top-of-VirtualFile that #4994 added to `EphemeralFile` into an abstraction that can be re-used in the `DeltaLayerWriterInner` and `ImageLayerWriterInner`. That'll be another PR. ### Performance Impact This PR adds more fsyncs compared to before because we fsync the parent directory every time. 1. For `save_metadata`, the additional fsyncs are unnecessary because we know that `metadata` fits into a kernel page, and hence the write won't be torn on the way into the kernel. However, the `metadata` file in general is going to lose signficance very soon (=> see #5172), and the NVMes in prod can definitely handle the additional fsync. So, let's not worry about it. 2. For `persist_tenant_conf`, which we don't check to fit into a single kernel page, this PR makes it actually crash-consistent. Before, we could crash while writing out the tenant conf, leaving a prefix of the tenant conf on disk.	2023-09-02 10:06:14 +02:00
Christian Schwarz	cfc0fb573d	pageserver: run all Rust tests with remote storage enabled (#5164 ) For [#5086](https://github.com/neondatabase/neon/pull/5086#issuecomment-1701331777) we will require remote storage to be configured in pageserver. This PR enables `localfs`-based storage for all Rust unit tests. Changes: - In `TenantHarness`, set up localfs remote storage for the tenant. - `create_test_timeline` should mimic what real timeline creation does, and real timeline creation waits for the timeline to reach remote storage. With this PR, `create_test_timeline` now does that as well. - All the places that create the harness tenant twice need to shut down the tenant before the re-create through a second call to `try_load` or `load`. - Without shutting down, upload tasks initiated by/through the first incarnation of the harness tenant might still be ongoing when the second incarnation of the harness tenant is `try_load`/`load`ed. That doesn't make sense in the tests that do that, they generally try to set up a scenario similar to pageserver stop & start. - There was one test that recreates a timeline, not the tenant. For that case, I needed to create a `Timeline::shutdown` method. It's a refactoring of the existing `Tenant::shutdown` method. - The remote_timeline_client tests previously set up their own `GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the one that's pre-created by the TenantHarness. Some adjustments to the assertions were needed because the assertions now need to account for the initial image layer that's created by `create_test_timeline` to be present.	2023-09-01 18:10:40 +02:00
Christian Schwarz	aa22000e67	FileBlockReader<File> is never used (#5181 ) part of #4743 preliminary to #5180	2023-09-01 17:30:22 +02:00
Christian Schwarz	40ce520c07	remote_timeline_client: tests: run upload ops on the tokio::test runtime (#5177 ) The `remote_timeline_client` tests use `#[tokio::test]` and rely on the fact that the test runtime that is set up by this macro is single-threaded. In PR https://github.com/neondatabase/neon/pull/5164, we observed interesting flakiness with the `upload_scheduling` test case: it would observe the upload of the third layer (`layer_file_name_3`) before we did `wait_completion`. Under the single-threaded-runtime assumption, that wouldn't be possible, because the test code doesn't await inbetween scheduling the upload and calling `wait_completion`. However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`. That means there was parallelism where the tests didn't expect it, leading to flakiness such as execution of an UploadOp task before the test calls `wait_completion`. The most confusing scenario is code like this: ``` schedule upload(A); wait_completion.await; // B schedule_upload(C); wait_completion.await; // D ``` On a single-threaded executor, it is guaranteed that the upload up C doesn't run before D, because we (the test) don't relinquish control to the executor before D's `await` point. However, RemoteTimelineClient actually scheduled onto the BACKGROUND_RUNTIME, so, `A` could start running before `B` and `C` could start running before `D`. This would cause flaky tests when making assertions about the state manipulated by the operations. The concrete issue that led to discover of this bug was an assertion about `remote_fs_dir` state in #5164.	2023-09-01 16:24:04 +03:00
John Spray	616e7046c7	s3_scrubber: import into the main `neon` repository (#5141 ) ## Problem The S3 scrubber currently lives at https://github.com/neondatabase/s3-scrubber We don't have tests that use it, and it has copies of some data structures that can get stale. ## Summary of changes - Import the s3-scrubber as `s3_scrubber/ - Replace copied_definitions/ in the scrubber with direct access to the `utils` and `pageserver` crates - Modify visibility of a few definitions in `pageserver` to allow the scrubber to use them - Update scrubber code for recent changes to `IndexPart` - Update `KNOWN_VERSIONS` for IndexPart and move the definition into index.rs so that it is easier to keep up to date As a future refinement, it would be good to pull the remote persistence types (like IndexPart) out of `pageserver` into a separate library so that the scrubber doesn't have to link against the whole pageserver, and so that it's clearer which types need to be public. Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-31 19:01:39 +01:00
John Spray	300a5aa05e	pageserver: fix test v4_indexpart_is_parsed (#5157 ) ## Problem Two recent PRs raced: - https://github.com/neondatabase/neon/pull/5153 - https://github.com/neondatabase/neon/pull/5140 ## Summary of changes Add missing `generation` argument to IndexLayerMetadata construction	2023-08-31 10:40:46 +01:00
John Spray	83ae2bd82c	pageserver: generation number support in keys and indices (#5140 ) ## Problem To implement split brain protection, we need tenants and timelines to be aware of their current generation, and use it when composing S3 keys. ## Summary of changes - A `Generation` type is introduced in the `utils` crate -- it is in this broadly-visible location because it will later be used from `control_plane/` as well as `pageserver/`. Generations can be a number, None, or Broken, to support legacy content (None), and Tenants in the broken state (Broken). - Tenant, Timeline, and RemoteTimelineClient all get a generation attribute - IndexPart's IndexLayerMetadata has a new `generation` attribute. Legacy layers' metadata will deserialize to Generation::none(). - Remote paths are composed with a trailing generation suffix. If a generation is equal to Generation::none() (as it currently always is), then this suffix is an empty string. - Functions for composing remote storage paths added in remote_timeline_client: these avoid the way that we currently always compose a local path and then strip the prefix, and avoid requiring a PageserverConf reference on functions that want to create remote paths (the conf is only needed for local paths). These are less DRY than the old functions, but remote storage paths are a very rarely changing thing, so it's better to write out our paths clearly in the functions than to compose timeline paths from tenant paths, etc. - Code paths that construct a Tenant take a `generation` argument in anticipation that we will soon load generations on startup before constructing Tenant. Until the whole feature is done, we don't want any generation-ful keys though: so initially we will carry this everywhere with the special Generation::none() value. Closes: https://github.com/neondatabase/neon/issues/5135 Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-31 09:19:34 +01:00
John Spray	a93274b389	pageserver: remove vestigial `timeline_layers` attribute (#5153 ) ## Problem `timeline_layers` was write-only since `b95addddd5` We deployed the version that no longer requires it for deserializing, so now we can stop including it when serializing. ## Summary of changes Fully remove `timeline_layers`.	2023-08-30 16:14:04 +01:00
Joonas Koivunen	05773708d3	fix: add context for ancestor lsn wait (#5143 ) In logs it is confusing to see seqwait timeouts which seemingly arise from the branched lsn but actually are about the ancestor, leading to questions like "has the last_record_lsn went back". Noticed by @problame.	2023-08-30 12:21:41 +03:00
Arpad Müller	eb0a698adc	Make page cache and read_blk async (#5023 ) ## Problem `read_blk` does I/O and thus we would like to make it async. We can't make the function async as long as the `PageReadGuard` returned by `read_blk` isn't `Send`. The page cache is called by `read_blk`, and thus it can't be async without `read_blk` being async. Thus, we have a circular dependency. ## Summary of changes Due to the circular dependency, we convert both the page cache and `read_blk` to async at the same time: We make the page cache use `tokio::sync` synchronization primitives as those are `Send`. This makes all the places that acquire a lock require async though, which we then also do. This includes also asyncification of the `read_blk` function. Builds upon #4994, #5015, #5056, and #5129. Part of #4743.	2023-08-30 09:04:31 +02:00
Arseny Sher	bc49c73fee	Move wal_stream_connection_config to utils. It will be used by safekeeper as well.	2023-08-29 23:19:40 +03:00
Arseny Sher	e98580b092	Add term and http endpoint to broker messaged SkTimelineInfo. We need them for safekeeper peer recovery https://github.com/neondatabase/neon/pull/4875	2023-08-29 23:19:40 +03:00
Arpad Müller	805fee1483	page cache: small code cleanups (#5125 ) ## Problem I saw these things while working on #5111. ## Summary of changes * Add a comment explaining why we use `Vec::leak` instead of `Vec::into_boxed_slice` plus `Box::leak`. * Add another comment explaining what `valid` is doing, it wasn't very clear before. * Add a function `set_usage_count` to not set it directly.	2023-08-29 11:49:04 +03:00
Christian Schwarz	0fe3b3646a	page cache: don't proactively evict EphemeralFile pages (#5129 ) Before this patch, when dropping an EphemeralFile, we'd scan the entire `slots` to proactively evict its pages (`drop_buffers_for_immutable`). This was _necessary_ before #4994 because the page cache was a write-back cache: we'd be deleting the EphemeralFile from disk after, so, if we hadn't evicted its pages before that, write-back in `find_victim` wouldhave failed. But, since #4994, the page cache is a read-only cache, so, it's safe to keep read-only data cached. It's never going to get accessed again and eventually, `find_victim` will evict it. The only remaining advantage of `drop_buffers_for_immutable` over relying on `find_victim` is that `find_victim` has to do the clock page replacement iterations until the count reaches 0, whereas `drop_buffers_for_immutable` can kick the page out right away. However, weigh that against the cost of `drop_buffers_for_immutable`, which currently scans the entire `slots` array to find the EphemeralFile's pages. Alternatives have been proposed in #5122 and #5128, but, they come with their own overheads & trade-offs. Also, the real reason why we're looking into this piece of code is that we want to make the slots rwlock async in #5023. Since `drop_buffers_for_immutable` is called from drop, and there is no async drop, it would be nice to not have to deal with this. So, let's just stop doing `drop_buffers_for_immutable` and observe the performance impact in benchmarks.	2023-08-28 20:42:18 +02:00
Joonas Koivunen	fbcd174489	load_layer_map: schedule deletions for any future layers (#5103 ) Unrelated fixes noticed while integrating #4938. - Stop leaking future layers in remote storage - We schedule extra index_part uploads if layer name to be removed was not actually present	2023-08-28 10:51:49 +03:00
John Spray	b758bf47ca	pageserver: refactor TimelineMetadata serialization in IndexPart (#5091 ) ## Problem The `metadata_bytes` field of IndexPart required explicit deserialization & error checking everywhere it was used -- there isn't anything special about this structure that should prevent it from being serialized & deserialized along with the rest of the structure. ## Summary of changes - Implement Serialize and Deserialize for TimelineMetadata - Replace IndexPart::metadata_bytes with a simpler `metadata`, that can be used directly. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-25 16:16:20 +01:00
Arpad Müller	8c13296add	Remove `BlockReader::read_blk` in favour of `BlockCursor` (#5015 ) ## Problem We want to make `read_blk` an async function, but outside of `async_trait`, which allocates, and nightly features, we can't use async fn's in traits. ## Summary of changes * Remove all uses of `BlockReader::read_blk` in favour of using block cursors, at least where the type of the `BlockReader` is behind a generic * Introduce a `BlockReaderRef` enum that lists all implementors of `BlockReader::read_blk`. * Remove `BlockReader::read_blk` and move its implementations into inherent functions on the types instead. We don't turn `read_blk` into an async fn yet, for that we also need to modify the page cache. So this is a preparatory PR, albeit an important one. Part of #4743.	2023-08-25 12:28:01 +02:00
Arpad Müller	227c87e333	Make EphemeralFile::write_blob function async (#5056 ) ## Problem The `EphemeralFile::write_blob` function accesses the page cache internally. We want to require `async` for these accesses in #5023. ## Summary of changes This removes the implementaiton of the `BlobWriter` trait for `EphemeralFile` and turns the `write_blob` function into an inherent function. We can then make it async as well as the `push_bytes` function. We move the `SER_BUFFER` thread-local into the `InMemoryLayerInner` so that the same buffer can be accessed by different threads as the async is (potentially) moved between threads. Part of #4743, preparation for #5023.	2023-08-24 19:18:30 +02:00
Chengpeng Yan	fa74d5649e	rename `EphmeralFile::size` to `EphemeralFile::len` (#5076 ) ## Problem close https://github.com/neondatabase/neon/issues/5034 ## Summary of changes Based on the [comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922). Just rename the `EphmeralFile::size` to `EphemeralFile::len`.	2023-08-24 16:41:57 +02:00
Joonas Koivunen	f70871dfd0	internal-devx: pageserver future layers (#5092 ) I've personally forgotten why/how can we have future layers during reconciliation. Adds `#[cfg(feature = "testing")]` logging when we upload such index_part.json, with a cross reference to where the cleanup happens. Latest private slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH Builds upon #5074. Should had been considered on #4837.	2023-08-24 17:22:36 +03:00
Joonas Koivunen	76aa01c90f	refactor: single phase Timeline::load_layer_map (#5074 ) Current implementation first calls `load_layer_map`, which loads all local layers, cleans up files, leave cleaning up stuff to "second function". Then the "second function" is finally called, it does not do the cleanup and some of the first functions setup can torn down. "Second function" is actually both `reconcile_with_remote` and `create_remote_layers`. This change makes it a bit more verbose but in one phase with the following sub-steps: 1. scan the timeline directory 2. delete extra files - now including on-demand download files - fixes #3660 3. recoincile the two sources of layers (directory, index_part) 4. rename_to_backup future layers, short layers 5. create the remaining as layers Needed by #4938. It was also noticed that this is blocking code in an `async fn` so just do it in a `spawn_blocking`, which should be healthy for our startup times. Other effects includes hopefully halving of `stat` calls; extra calls which were not done previously are now done for the future layers. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2023-08-24 16:07:40 +03:00
John Spray	3e2f0ffb11	libs: make backoff::retry() take a cancellation token (#5065 ) ## Problem Currently, anything that uses backoff::retry will delay the join of its task by however long its backoff sleep is, multiplied by its max retries. Whenever we call a function that sleeps, we should be passing in a CancellationToken. ## Summary of changes - Add a `Cancel` type to backoff::retry that wraps a CancellationToken and an error `Fn` to generate an error if the cancellation token fires. - In call sites that already run in a `task_mgr` task, use `shutdown_token()` to provide the token. In other locations, use a dead `CancellationToken` to satisfy the interface, and leave a TODO to fix it up when we broaden the use of explicit cancellation tokens.	2023-08-24 14:54:46 +03:00
Joonas Koivunen	ad8d777c1c	refactor: remove is_incremental=true for ImageLayers footgun (#5061 ) Accidentially giving is_incremental=true for ImageLayers costs a lot of debugging time. Removes all API which would allow to do that. They can easily be restored later when needed. Split off from #4938.	2023-08-22 22:12:05 +03:00
Joonas Koivunen	533a92636c	refactor: pre-cleanup Layer, PersistentLayer and impls (#5059 ) Remove pub but dead code, move trait methods as inherent methods, remove unnecessary. Split off from #4938.	2023-08-22 21:14:28 +03:00
Christian Schwarz	8cd20485f8	metrics: smgr query time: add a pre-aggregated histogram (#5064 ) When doing global queries in VictoriaMetrics, the per-timeline histograms make us run into cardinality limits. We don't want to give them up just yet because we don't have an alternative for drilling down on timeline-specific performance issues. So, add a pre-aggregated histogram and add observations to it whenever we add observations to the per-timeline histogram. While we're at it, switch to using a strummed enum for the operation type names.	2023-08-22 20:08:31 +03:00
Joonas Koivunen	933a869f00	refactor: compaction becomes async again (#5058 ) #4938 will make on-demand download of layers in compaction possible, so it's not suitable for our "policy" of no `spawn_blocking(\|\| ... Handle::block_on(async { spawn_blocking(...).await })` because this poses a clear deadlock risk. Nested spawn_blockings are because of the download using `tokio::fs::File`. - Remove `spawn_blocking` from caller of `compact_level0_phase1` - Remove `Handle::block_on` from `compact_level0_phase1` (indentation change) - Revert to `AsLayerDesc::layer_desc` usage temporarily (until it becomes field access in #4938)	2023-08-22 20:03:14 +03:00
John Spray	615a490239	pageserver: refactor Tenant/Timeline args into structs (#5053 ) ## Problem There are some common types that we pass into tenants and timelines as we construct them, such as remote storage and the broker client. Currently the list is small, but this is likely to grow -- the deletion queue PR (#4960) pushed some methods to the point of clippy complaining they had too many args, because of the extra deletion queue client being passed around. There are some shared objects that currently aren't passed around explicitly because they use a static `once_cell` (e.g. CONCURRENT_COMPACTIONS), but as we add more resource management and concurreny control over time, it will be more readable & testable to pass a type around in the respective Resources object, rather than to coordinate via static objects. The `Resources` structures in this PR will make it easier to add references to central coordination functions, without having to rely on statics. ## Summary of changes - For `Tenant`, the `broker_client` and `remote_storage` are bundled into `TenantSharedResources` - For `Timeline`, the `remote_client` is wrapped into `TimelineResources`. Both of these structures will get an additional deletion queue member in #4960.	2023-08-21 17:30:28 +01:00
John Spray	b95addddd5	pageserver: do not read redundant `timeline_layers` from IndexPart, so that we can remove it later (#4972 ) ## Problem IndexPart contains two redundant lists of layer names: a set of the names, and then a map of name to metadata. We already required that all the layers in `timeline_layers` are also in `layers_metadata`, in `initialize_with_current_remote_index_part`, so if there were any index_part.json files in the field that relied on these sets being different, they would already be broken. ## Summary of changes `timeline_layers` is made private and no longer read at runtime. It is still serialized, but not deserialized. `disk_consistent_lsn` is also made private, as this field only exists for convenience of humans reading the serialized JSON. This prepares us to entirely remove `timeline_layers` in a future release, once this change is fully deployed, and therefore no pageservers are trying to read the field.	2023-08-21 14:29:36 +03:00
Dmitry Rodionov	9140a950f4	Resume tenant deletion on attach (#5039 ) I'm still a bit nervous about attach -> crash case. But it should work. (unlike case with timeline). Ideally would be cool to cover this with test. This continues tradition of adding bool flags for Tenant::set_stopping. Probably lifecycle project will help with fixing it.	2023-08-20 12:28:50 +03:00
Arpad Müller	a23b0773f1	Fix DeltaLayer dumping (#5045 ) ## Problem Before, DeltaLayer dumping (via `cargo run --release -p pagectl -- print-layer-file` ) would crash as one can't call `Handle::block_on` in an async executor thread. ## Summary of changes Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys into RAM (which we already do during compaction), and then load the values one by one during dumping.	2023-08-19 00:56:03 +02:00
Joonas Koivunen	368ee6c8ca	refactor: failpoint support (#5033 ) - move them to pageserver which is the only dependant on the crate fail - "move" the exported macro to the new module - support at init time the same failpoints as runtime Found while debugging test failures and making tests more repeatable by allowing "exit" from pageserver start via environment variables. Made those changes to `test_gc_cutoff.py`. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-19 01:01:44 +03:00
Dmitry Rodionov	30888a24d9	Avoid flakiness in test_timeline_delete_fail_before_local_delete (#5032 ) The problem was that timeline detail can return timelines in not only active state. And by the time request comes timeline deletion can still be in progress if we're unlucky (test execution happened to be slower for some reason) Reference for failed test run https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5022/5891420105/index.html#suites/f588e0a787c49e67b29490359c589fae/dab036e9bd673274 The error was `Exception: detail succeeded (it should return 404)` reported by @koivunej	2023-08-18 20:49:11 +03:00
Dmitry Rodionov	f6c671c140	resume timeline deletions on attach (#5030 ) closes [#5036](https://github.com/neondatabase/neon/issues/5036)	2023-08-18 20:48:33 +03:00
Christian Schwarz	7a63685cde	simplify page-caching of EphemeralFile (#4994 ) (This PR is the successor of https://github.com/neondatabase/neon/pull/4984 ) ## Summary The current way in which `EphemeralFile` uses `PageCache` complicates the Pageserver code base to a degree that isn't worth it. This PR refactors how we cache `EphemeralFile` contents, by exploiting the append-only nature of `EphemeralFile`. The result is that `PageCache` only holds `ImmutableFilePage` and `MaterializedPage`. These types of pages are read-only and evictable without write-back. This allows us to remove the writeback code from `PageCache`, also eliminating an entire failure mode. Futher, many great open-source libraries exist to solve the problem of a read-only cache, much better than our `page_cache.rs` (e.g., better replacement policy, less global locking). With this PR, we can now explore using them. ## Problem & Analysis Before this PR, `PageCache` had three types of pages: * `ImmutableFilePage`: caches Delta / Image layer file contents * `MaterializedPage`: caches results of Timeline::get (page materialization) * `EphemeralPage`: caches `EphemeralFile` contents `EphemeralPage` is quite different from `ImmutableFilePage` and `MaterializedPage`: * Immutable and materialized pages are for the acceleration of (future) reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of DRAM. * Ephemeral pages are a write-back cache of `EphemeralFile` contents, i.e., if there is pressure in the page cache, we spill `EphemeralFile` contents to disk. `EphemeralFile` is only used by `InMemoryLayer`, for the following purposes: * write: when filling up the `InMemoryLayer`, via `impl BlobWriter for EphemeralFile` * read: when doing page reconstruction for a page@lsn that isn't written to disk * read: when writing L0 layer files, we re-read the `InMemoryLayer` and put the contents into the L0 delta writer (`create_delta_layer`). This happens every 10min or when InMemoryLayer reaches 256MB in size. The access patterns of the `InMemoryLayer` use case are as follows: * write: via `BlobWriter`, strictly append-only * read for page reconstruction: via `BlobReader`, random * read for `create_delta_layer`: via `BlobReader`, dependent on data, but generally random. Why? * in classical LSM terms, this function is what writes the memory-resident `C0` tree into the disk-resident `C1` tree * in our system, though, the values of InMemoryLayer are stored in an EphemeralFile, and hence they are not guaranteed to be memory-resident * the function reads `Value`s in `Key, LSN` order, which is `!=` insert order What do these `EphemeralFile`-level access patterns mean for the page cache? * write: * the common case is that `Value` is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller than `PAGE_SIZE` * So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE` sized writes. * If there's no page cache eviction between subsequent `InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident, so the page cache avoids doing a `write` system call. * In practice, a busy page server will have page cache evictions because we only configure 64MB of page cache size. * reads for page reconstruction: read acceleration, just as for the other page types. * reads for `create_delta_layer`: * The `Value` reads happen through a `BlockCursor`, which optimizes the case of repeated reads from the same page. * So, the best case is that subsequent values are located on the same page; hence `BlockCursor`s buffer is maximally effective. * The worst case is that each `Value` is on a different page; hence the `BlockCursor`'s 1-page-sized buffer is ineffective. * The best case translates into `256MB/PAGE_SIZE` page cache accesses, one per page. * the worst case translates into `#Values` page cache accesses * again, the page cache accesses must be assumed to be random because the `Value`s aren't accessed in insertion order but `Key, LSN` order. ## Summary of changes Preliminaries for this PR were: - #5003 - #5004 - #5005 - uncommitted microbenchmark in #5011 Based on the observations outlined above, this PR makes the following changes: * Rip out `EphemeralPage` from `page_cache.rs` * Move the `block_io::FileId` to `page_cache::FileId` * Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct. It's called `mutable_tail`. * Change `write_blob` to use `mutable_tail` for the write buffering instead of a page cache page. * if `mutable_tail` is full, it writes it out to disk, zeroes it out, and re-uses it. * There is explicitly no double-buffering, so that memory allocation per `EphemeralFile` instance is fixed. * Change `read_blob` to return different `BlockLease` variants depending on `blknum` * for the `blknum` that corresponds to the `mutable_tail`, return a ref to it * Rust borrowing rules prevent `write_blob` calls while refs are outstanding. * for all non-tail blocks, return a page-cached `ImmutablePage` * It is safe to page-cache these as ImmutablePage because EphemeralFile is append-only. ## Performance How doe the changes above affect performance? M claim is: not significantly. * write path: * before this PR, the `EphemeralFile::write_blob` didn't issue its own `write` system calls. * If there were enough free pages, it didn't issue any `write` system calls. * If it had to evict other `EphemeralPage`s to get pages a page for its writes (`get_buf_for_write`), the page cache code would implicitly issue the writeback of victim pages as needed. * With this PR, `EphemeralFile::write_blob` always issues all of its own `write` system calls. * Also, the writes are explicit instead of implicit through page cache write back, which will help #4743 * The perf impact of always doing the writes is the CPU overhead and syscall latency. * Before this PR, we might have never issued them if there were enough free pages. * We don't issue `fsync` and can expect the writes to only hit the kernel page cache. * There is also an advantage in issuing the writes directly: the perf impact is paid by the tenant that caused the writes, instead of whatever tenant evicts the `EphemeralPage`. * reads for page reconstruction: no impact. * The `write_blob` function pre-warms the page cache when it writes the `mutable_tail` to disk. * So, the behavior is the same as with the EphemeralPages before this PR. * reads for `create_delta_layer`: no impact. * Same argument as for page reconstruction. * Note for the future: * going through the page cache likely causes read amplification here. Why? * Due to the `Key,Lsn`-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read different `Values` from it. * So, it might be better to bypass the page cache here. * Idea drafts: * bypass PS page cache + prefetch pipeline + iovec-based IO * bypass PS page cache + use `copy_file_range` to copy from ephemeral file into the L0 delta file, without going through user space	2023-08-18 20:31:03 +03:00
Arpad Müller	f4da010aee	Make the compaction warning more tolerant (#5024 ) ## Problem The performance benchmark in `test_runner/performance/test_layer_map.py` is currently failing due to the warning added in #4888. ## Summary of changes The test mentioned has a `compaction_target_size` of 8192, which is just one page size. This is an unattainable goal, as we generate at least three pages: one for the header, one for the b-tree (minimally sized ones have just the root node in a single page), one for the data. Therefore, we add two pages to the warning limit. The warning text becomes a bit less accurate but I think this is okay.	2023-08-18 16:36:31 +02:00
Joonas Koivunen	67af24191e	test: cleanup remote_timeline_client tests (#5013 ) I will have to change these as I change remote_timeline_client api in #4938. So a bit of cleanup, handle my comments which were just resolved during initial review. Cleanup: - use unwrap in tests instead of mixed `?` and `unwrap` - use `Handle` instead of `&'static Reactor` to make the RemoteTimelineClient more natural - use arrays in tests - use plain `#[tokio::test]`	2023-08-17 19:27:30 +03:00
Joonas Koivunen	6af5f9bfe0	fix: format context (#5022 ) We return an error with unformatted `{timeline_id}`.	2023-08-17 14:30:25 +00:00

1 2 3 4 5 ...

1515 Commits