rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 06:22:57 +00:00

Author	SHA1	Message	Date
Christian Schwarz	5be8d38a63	fix deadlock around TENANTS (#5285 ) The sequence that can lead to a deadlock: 1. DELETE request gets all the way to `tenant.shutdown(progress, false).await.is_err() ` , while holding TENANTS.read() 2. POST request for tenant creation comes in, calls `tenant_map_insert`, it does `let mut guard = TENANTS.write().await;` 3. Something that `tenant.shutdown()` needs to wait for needs a `TENANTS.read().await`. The only case identified in exhaustive manual scanning of the code base is this one: Imitate size access does `get_tenant().await`, which does `TENANTS.read().await` under the hood. In the above case (1) waits for (3), (3)'s read-lock request is queued behind (2)'s write-lock, and (2) waits for (1). Deadlock. I made a reproducer/proof-that-above-hypothesis-holds in https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for merge yet and we want the fix _now_. fixes https://github.com/neondatabase/neon/issues/5284	2023-09-12 11:23:46 +02:00
Arpad Müller	a18d6d9ae3	Make File opening in VirtualFile async-compatible (#5280 ) ## Problem Previously, we were using `observe_closure_duration` in `VirtualFile` file opening code, but this doesn't support async open operations, which we want to use as part of #4743. ## Summary of changes * Move the duration measurement from the `with_file` macro into a `observe_duration` macro. * Some smaller drive-by fixes to replace the old strings with the new variant names introduced by #5273 Part of #4743, follow-up of #5247.	2023-09-11 18:41:08 +02:00
Arpad Müller	76cc87398c	Use tokio locks in VirtualFile and turn with_file into macro (#5247 ) ## Problem For #4743, we want to convert everything up to the actual I/O operations of `VirtualFile` to `async fn`. ## Summary of changes This PR is the last change in a series of changes to `VirtualFile`: #5189, #5190, #5195, #5203, and #5224. It does the last preparations before the I/O operations are actually made async. We are doing the following things: * First, we change the locks for the file descriptor cache to tokio's locks that support Send. This is important when one wants to hold locks across await points (which we want to do), otherwise the Future won't be Send. Also, one shouldn't generally block in async code as executors don't like that. * Due to the lock change, we now take an approach for the `VirtualFile` destructors similar to the one proposed by #5122 for the page cache, to use `try_write`. Similarly to the situation in the linked PR, one can make an argument that if we are in the destructor and the slot has not been reused yet, we are the only user accessing the slot due to owning the lock mutably. It is still possible that we are not obtaining the lock, but the only cause for that is the clock algorithm touching the slot, which should be quite an unlikely occurence. For the instance of `try_write` failing, we spawn an async task to destroy the lock. As just argued however, most of the time the code path where we spawn the task should not be visited. * Lastly, we split `with_file` into a macro part, and a function part that contains most of the logic. The function part returns a lock object, that the macro uses. The macro exists to perform the operation in a more compact fashion, saving code from putting the lock into a variable and then doing the operation while measuring the time to run it. We take the locks approach because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures via parameters that has a lifetime (captures work). For details, see [this](https://smallcultfollowing.com/babysteps/blog/2023/03/29/thoughts-on-async-closures/) and [this](https://users.rust-lang.org/t/function-that-takes-an-async-closure/61663) link. In #5224, we ran into a similar problem with the `test_files` function, and we ended up passing the path and the `OpenOptions` by-value instead of by-ref, at the expense of a few extra copies. This can be done as the data is cheaply copyable, and we are in test code. But here, we are not, and while `File::try_clone` exists, it [issues system calls internally](`1e746d7741/library/std/src/os/fd/owned.rs (L94-L111)`). Also, it would allocate an entirely new file descriptor, something that the fd cache was built to prevent. * We change the `STORAGE_IO_TIME` metrics to support async. Part of #4743.	2023-09-11 17:35:05 +02:00
duguorong009	d7fa2dba2d	fix(pageserver): update the `STORAGE_IO_TIME` metrics to avoid expensive operations (#5273 ) Introduce the `StorageIoOperation` enum, `StorageIoTime` struct, and `STORAGE_IO_TIME_METRIC` static which provides lockless access to histograms consumed by `VirtualFile`. Closes #5131 Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-11 14:58:15 +03:00
Rahul Modpur	999fe668e7	Ack tenant detach before local files are deleted (#5211 ) ## Problem Detaching a tenant can involve many thousands of local filesystem metadata writes, but the control plane would benefit from us not blocking detach/delete responses on these. ## Summary of changes After rename of local tenant directory ack tenant detach and delete tenant directory in background #5183 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>	2023-09-10 22:59:51 +03:00
Joonas Koivunen	720d59737a	rust-1.72.0 changes (#5255 ) Prepare to upgrade rust version to latest stable. - `rustfmt` has learned to format `let irrefutable = $expr else { ... };` blocks - There's a new warning about virtual (workspace) crate resolver, picked the latest resolver as I suspect everyone would expect it to be the latest; should not matter anyways - Some new clippies, which seem alright	2023-09-08 16:28:41 +03:00
Arpad Müller	d206655a63	Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn (#5224 ) ## Problem Once we use async file system APIs for `VirtualFile`, these functions will also need to be async fn. ## Summary of changes Makes the functions `open, open_with_options, create,sync_all,with_file` of `VirtualFile` async fn, including all functions that call it. Like in the prior PRs, the actual I/O operations are not using async APIs yet, as per request in the #4743 epic. We switch towards not using `VirtualFile` in the par_fsync module, hopefully this is only temporary until we can actually do fully async I/O in `VirtualFile`. This might cause us to exhaust fd limits in the tests, but it should only be an issue for the local developer as we have high ulimits in prod. This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.	2023-09-08 00:50:50 +02:00
duguorong009	706977fb77	fix(pageserver): add the walreceiver state to tenant timeline GET api endpoint (#5196 ) Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition. Fixes #3115. Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>	2023-09-07 14:17:18 +03:00
Arpad Müller	7ba0f5c08d	Improve comment in page cache (#5220 ) It was easy to interpret comment in the page cache initialization code to be about justifying why we leak here at all, not just why this specific type of leaking is done (which the comment was actually meant to describe). See https://github.com/neondatabase/neon/pull/5125#discussion_r1308445993 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 21:44:54 +02:00
Arpad Müller	6243b44dea	Remove Virtual from FileBlockReaderVirtual variant name (#5225 ) With #5181, the generics for `FileBlockReader` have been removed, so having a `Virtual` postfix makes less sense now.	2023-09-06 20:54:57 +02:00
duguorong009	31e1568dee	refactor(pageserver): refactor pageserver router state creation (#5165 ) Fixes #3894 by: - Refactor the pageserver router creation flow - Create the router state in `pageserver/src/bin/pageserver.rs`	2023-09-06 21:31:49 +03:00
Chengpeng Yan	9a9187b81a	Complete the missing metrics for files_created/bytes_written (#5120 )	2023-09-06 14:00:15 -04:00
Arpad Müller	5e00c44169	Add WriteBlobWriter buffering and make VirtualFile::{write,write_all} async (#5203 ) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes This PR is a follow-up of #5189, #5190, and #5195, and does the following: * Move the used `Write` trait functions of `VirtualFile` into inherent functions * Add optional buffering to `WriteBlobWriter`. The buffer is discarded on drop, similarly to how tokio's `BufWriter` does it: drop is neither async nor does it support errors. * Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays using `VirtualFile` * Rename `WriteBlobWriter` to `BlobWriter` * Make various functions in the write path async, like `VirtualFile::{write,write_all}`. Part of #4743.	2023-09-06 18:17:12 +02:00
John Spray	61d661a6c3	pageserver: generation number fetch on startup and use in /attach (#5163 ) ## Problem - #5050 Closes: https://github.com/neondatabase/neon/issues/5136 ## Summary of changes - A new configuration property `control_plane_api` controls other functionality in this PR: if it is unset (default) then everything still works as it does today. - If `control_plane_api` is set, then on startup we call out to control plane `/re-attach` endpoint to discover our attachments and their generations. If an attachment is missing from the response we implicitly detach the tenant. - Calls to pageserver `/attach` API may include a `generation` parameter. If `control_plane_api` is set, then this parameter is mandatory. - RemoteTimelineClient's loading of index_part.json is generation-aware, and will try to load the index_part with the most recent generation <= its own generation. - The `neon_local` testing environment now includes a new binary `attachment_service` which implements the endpoints that the pageserver requires to operate. This is on by default if running `cargo neon` by hand. In `test_runner/` tests, it is off by default: existing tests continue to run with in the legacy generation-less mode. Caveats: - The re-attachment during startup assumes that we are only re-attaching tenants that have previously been attached, and not totally new tenants -- this relies on the control plane's attachment logic to keep retrying so that we should eventually see the attach API call. That's important because the `/re-attach` API doesn't tell us which timelines we should attach -- we still use local disk state for that. Ref: https://github.com/neondatabase/neon/issues/5173 - Testing: generations are only enabled for one integration test right now (test_pageserver_restart), as a smoke test that all the machinery basically works. Writing fuller tests that stress tenant migration will come later, and involve extending our test fixtures to deal with multiple pageservers. - I'm not in love with "attachment_service" as a name for the neon_local component, but it's not very important because we can easily rename these test bits whenever we want. - Limited observability when in re-attach on startup: when I add generation validation for deletions in a later PR, I want to wrap up the control plane API calls in some small client class that will expose metrics for things like errors calling the control plane API, which will act as a strong red signal that something is not right. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 14:44:48 +01:00
John Spray	743933176e	scrubber: add `scan-metadata` and hook into integration tests (#5176 ) ## Problem - Scrubber's `tidy` command requires presence of a control plane - Scrubber has no tests at all ## Summary of changes - Add re-usable async streams for reading metadata from a bucket - Add a `scan-metadata` command that reads from those streams and calls existing `checks.rs` code to validate metadata, then returns a summary struct for the bucket. Command returns nonzero status if errors are found. - Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that tests using remote storage can request to have the scrubber run after they finish - Enable remote storarge and scrub_on_exit in test_pageserver_restart and test_pageserver_chaos This is a "toe in the water" of the overall space of validating the scrubber. Later, we should: - Enable scrubbing at end of tests using remote storage by default - Make the success condition stricter than "no errors": tests should declare what tenants+timelines they expect to see in the bucket (or sniff these from the functions tests use to create them) and we should require that the scrubber reports on these particular tenants/timelines. The `tidy` command is untouched in this PR, but it should be refactored later to use similar async streaming interface instead of the current batch-reading approach (the streams are faster with large buckets), and to also be covered by some tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-09-06 11:55:24 +01:00
duguorong009	4fec48f2b5	chore(pageserver): remove unnecessary logging in tenant task loops (#5188 ) Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute to unnecessary loggings in `pageserver/src/tenant/tasks.rs`. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 13:19:19 +03:00
Arpad Müller	4904613aaa	Convert `VirtualFile::{seek,metadata}` to async (#5195 ) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes Convert the following APIs of `VirtualFile` to async fn (as well as all of the APIs calling it): * `VirtualFile::seek` * `VirtualFile::metadata` * Also, prepare for deletion of the write impl by writing the summary to a buffer before writing it to disk, as suggested in https://github.com/neondatabase/neon/issues/4743#issuecomment-1700663864 . This change adds an additional warning for the case when the summary exceeds a block. Previously, we'd have silently corrupted data in this (unlikely) case. * `WriteBlobWriter::write_blob`, in preparation for making `VirtualFile::write_all` async.	2023-09-05 12:55:45 +02:00
Arpad Müller	128a85ba5e	Convert many VirtualFile APIs to async (#5190 ) ## Problem `VirtualFile` does both reading and writing, and it would be nice if both could be converted to async, so that it doesn't have to support an async read path and a blocking write path (especially for the locks this is annoying as none of the lock implementations in std, tokio or parking_lot have support for both async and blocking access). ## Summary of changes This PR is some initial work on making the `VirtualFile` APIs async. It can be reviewed commit-by-commit. * Introduce the `MaybeVirtualFile` enum to be generic in a test that compares real files with virtual files. * Make various APIs of `VirtualFile` async, including `write_all_at`, `read_at`, `read_exact_at`. Part of #4743 , successor of #5180. Co-authored-by: Christian Schwarz <me@cschwarz.com>	2023-09-04 17:05:20 +02:00
Arpad Müller	6cd497bb44	Make VirtualFile::crashsafe_overwrite async fn (#5189 ) ## Problem The `VirtualFile::crashsafe_overwrite` function was introduced by #5186 but it was not turned `async fn` yet. We want to make these functions async fn as part of #4743. ## Summary of changes Make `VirtualFile::crashsafe_overwrite` async fn, as well as all the functions calling it. Don't make anything inside `crashsafe_overwrite` use async functionalities, as per #4743 instructions. Also, add rustdoc to `crashsafe_overwrite`. Part of #4743.	2023-09-04 12:52:35 +02:00
John Spray	80f10d5ced	pageserver: safe deletion for tenant directories (#5182 ) ## Problem If a pageserver crashes partway through deleting a tenant's directory, it might leave a partial state that confuses a subsequent startup/attach. ## Summary of changes Rename tenant directory to a temporary path before deleting. Timeline deletions already have deletion markers to provide safety. In future, it would be nice to exploit this to send responses to detach requests earlier: https://github.com/neondatabase/neon/issues/5183	2023-09-04 08:31:55 +01:00
Christian Schwarz	7e817789d5	VirtualFile: add crash-safe overwrite abstraction & use it (#5186 ) (part of #4743) (preliminary to #5180) This PR adds a special-purpose API to `VirtualFile` for write-once files. It adopts it for `save_metadata` and `persist_tenant_conf`. This is helpful for the asyncification efforts (#4743) and specifically asyncification of `VirtualFile` because above two functions were the only ones that needed the VirtualFile to be an `std::io::Write`. (There was also `manifest.rs` that needed the `std::io::Write`, but, it isn't used right now, and likely won't be used because we're taking a different route for crash consistency, see #5172. So, let's remove it. It'll be in Git history if we need to re-introduce it when picking up the compaction work again; that's why it was introduced in the first place). We can't remove the `impl std::io::Write for VirtualFile` just yet because of the `BufWriter` in ```rust struct DeltaLayerWriterInner { ... blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>, } ``` But, @arpad-m and I have a plan to get rid of that by extracting the append-only-ness-on-top-of-VirtualFile that #4994 added to `EphemeralFile` into an abstraction that can be re-used in the `DeltaLayerWriterInner` and `ImageLayerWriterInner`. That'll be another PR. ### Performance Impact This PR adds more fsyncs compared to before because we fsync the parent directory every time. 1. For `save_metadata`, the additional fsyncs are unnecessary because we know that `metadata` fits into a kernel page, and hence the write won't be torn on the way into the kernel. However, the `metadata` file in general is going to lose signficance very soon (=> see #5172), and the NVMes in prod can definitely handle the additional fsync. So, let's not worry about it. 2. For `persist_tenant_conf`, which we don't check to fit into a single kernel page, this PR makes it actually crash-consistent. Before, we could crash while writing out the tenant conf, leaving a prefix of the tenant conf on disk.	2023-09-02 10:06:14 +02:00
Christian Schwarz	cfc0fb573d	pageserver: run all Rust tests with remote storage enabled (#5164 ) For [#5086](https://github.com/neondatabase/neon/pull/5086#issuecomment-1701331777) we will require remote storage to be configured in pageserver. This PR enables `localfs`-based storage for all Rust unit tests. Changes: - In `TenantHarness`, set up localfs remote storage for the tenant. - `create_test_timeline` should mimic what real timeline creation does, and real timeline creation waits for the timeline to reach remote storage. With this PR, `create_test_timeline` now does that as well. - All the places that create the harness tenant twice need to shut down the tenant before the re-create through a second call to `try_load` or `load`. - Without shutting down, upload tasks initiated by/through the first incarnation of the harness tenant might still be ongoing when the second incarnation of the harness tenant is `try_load`/`load`ed. That doesn't make sense in the tests that do that, they generally try to set up a scenario similar to pageserver stop & start. - There was one test that recreates a timeline, not the tenant. For that case, I needed to create a `Timeline::shutdown` method. It's a refactoring of the existing `Tenant::shutdown` method. - The remote_timeline_client tests previously set up their own `GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the one that's pre-created by the TenantHarness. Some adjustments to the assertions were needed because the assertions now need to account for the initial image layer that's created by `create_test_timeline` to be present.	2023-09-01 18:10:40 +02:00
Christian Schwarz	aa22000e67	FileBlockReader<File> is never used (#5181 ) part of #4743 preliminary to #5180	2023-09-01 17:30:22 +02:00
Christian Schwarz	40ce520c07	remote_timeline_client: tests: run upload ops on the tokio::test runtime (#5177 ) The `remote_timeline_client` tests use `#[tokio::test]` and rely on the fact that the test runtime that is set up by this macro is single-threaded. In PR https://github.com/neondatabase/neon/pull/5164, we observed interesting flakiness with the `upload_scheduling` test case: it would observe the upload of the third layer (`layer_file_name_3`) before we did `wait_completion`. Under the single-threaded-runtime assumption, that wouldn't be possible, because the test code doesn't await inbetween scheduling the upload and calling `wait_completion`. However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`. That means there was parallelism where the tests didn't expect it, leading to flakiness such as execution of an UploadOp task before the test calls `wait_completion`. The most confusing scenario is code like this: ``` schedule upload(A); wait_completion.await; // B schedule_upload(C); wait_completion.await; // D ``` On a single-threaded executor, it is guaranteed that the upload up C doesn't run before D, because we (the test) don't relinquish control to the executor before D's `await` point. However, RemoteTimelineClient actually scheduled onto the BACKGROUND_RUNTIME, so, `A` could start running before `B` and `C` could start running before `D`. This would cause flaky tests when making assertions about the state manipulated by the operations. The concrete issue that led to discover of this bug was an assertion about `remote_fs_dir` state in #5164.	2023-09-01 16:24:04 +03:00
John Spray	616e7046c7	s3_scrubber: import into the main `neon` repository (#5141 ) ## Problem The S3 scrubber currently lives at https://github.com/neondatabase/s3-scrubber We don't have tests that use it, and it has copies of some data structures that can get stale. ## Summary of changes - Import the s3-scrubber as `s3_scrubber/ - Replace copied_definitions/ in the scrubber with direct access to the `utils` and `pageserver` crates - Modify visibility of a few definitions in `pageserver` to allow the scrubber to use them - Update scrubber code for recent changes to `IndexPart` - Update `KNOWN_VERSIONS` for IndexPart and move the definition into index.rs so that it is easier to keep up to date As a future refinement, it would be good to pull the remote persistence types (like IndexPart) out of `pageserver` into a separate library so that the scrubber doesn't have to link against the whole pageserver, and so that it's clearer which types need to be public. Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-31 19:01:39 +01:00
John Spray	300a5aa05e	pageserver: fix test v4_indexpart_is_parsed (#5157 ) ## Problem Two recent PRs raced: - https://github.com/neondatabase/neon/pull/5153 - https://github.com/neondatabase/neon/pull/5140 ## Summary of changes Add missing `generation` argument to IndexLayerMetadata construction	2023-08-31 10:40:46 +01:00
John Spray	83ae2bd82c	pageserver: generation number support in keys and indices (#5140 ) ## Problem To implement split brain protection, we need tenants and timelines to be aware of their current generation, and use it when composing S3 keys. ## Summary of changes - A `Generation` type is introduced in the `utils` crate -- it is in this broadly-visible location because it will later be used from `control_plane/` as well as `pageserver/`. Generations can be a number, None, or Broken, to support legacy content (None), and Tenants in the broken state (Broken). - Tenant, Timeline, and RemoteTimelineClient all get a generation attribute - IndexPart's IndexLayerMetadata has a new `generation` attribute. Legacy layers' metadata will deserialize to Generation::none(). - Remote paths are composed with a trailing generation suffix. If a generation is equal to Generation::none() (as it currently always is), then this suffix is an empty string. - Functions for composing remote storage paths added in remote_timeline_client: these avoid the way that we currently always compose a local path and then strip the prefix, and avoid requiring a PageserverConf reference on functions that want to create remote paths (the conf is only needed for local paths). These are less DRY than the old functions, but remote storage paths are a very rarely changing thing, so it's better to write out our paths clearly in the functions than to compose timeline paths from tenant paths, etc. - Code paths that construct a Tenant take a `generation` argument in anticipation that we will soon load generations on startup before constructing Tenant. Until the whole feature is done, we don't want any generation-ful keys though: so initially we will carry this everywhere with the special Generation::none() value. Closes: https://github.com/neondatabase/neon/issues/5135 Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-31 09:19:34 +01:00
John Spray	a93274b389	pageserver: remove vestigial `timeline_layers` attribute (#5153 ) ## Problem `timeline_layers` was write-only since `b95addddd5` We deployed the version that no longer requires it for deserializing, so now we can stop including it when serializing. ## Summary of changes Fully remove `timeline_layers`.	2023-08-30 16:14:04 +01:00
Joonas Koivunen	05773708d3	fix: add context for ancestor lsn wait (#5143 ) In logs it is confusing to see seqwait timeouts which seemingly arise from the branched lsn but actually are about the ancestor, leading to questions like "has the last_record_lsn went back". Noticed by @problame.	2023-08-30 12:21:41 +03:00
Arpad Müller	eb0a698adc	Make page cache and read_blk async (#5023 ) ## Problem `read_blk` does I/O and thus we would like to make it async. We can't make the function async as long as the `PageReadGuard` returned by `read_blk` isn't `Send`. The page cache is called by `read_blk`, and thus it can't be async without `read_blk` being async. Thus, we have a circular dependency. ## Summary of changes Due to the circular dependency, we convert both the page cache and `read_blk` to async at the same time: We make the page cache use `tokio::sync` synchronization primitives as those are `Send`. This makes all the places that acquire a lock require async though, which we then also do. This includes also asyncification of the `read_blk` function. Builds upon #4994, #5015, #5056, and #5129. Part of #4743.	2023-08-30 09:04:31 +02:00
Arseny Sher	bc49c73fee	Move wal_stream_connection_config to utils. It will be used by safekeeper as well.	2023-08-29 23:19:40 +03:00
Arseny Sher	e98580b092	Add term and http endpoint to broker messaged SkTimelineInfo. We need them for safekeeper peer recovery https://github.com/neondatabase/neon/pull/4875	2023-08-29 23:19:40 +03:00
Arpad Müller	805fee1483	page cache: small code cleanups (#5125 ) ## Problem I saw these things while working on #5111. ## Summary of changes * Add a comment explaining why we use `Vec::leak` instead of `Vec::into_boxed_slice` plus `Box::leak`. * Add another comment explaining what `valid` is doing, it wasn't very clear before. * Add a function `set_usage_count` to not set it directly.	2023-08-29 11:49:04 +03:00
Christian Schwarz	0fe3b3646a	page cache: don't proactively evict EphemeralFile pages (#5129 ) Before this patch, when dropping an EphemeralFile, we'd scan the entire `slots` to proactively evict its pages (`drop_buffers_for_immutable`). This was _necessary_ before #4994 because the page cache was a write-back cache: we'd be deleting the EphemeralFile from disk after, so, if we hadn't evicted its pages before that, write-back in `find_victim` wouldhave failed. But, since #4994, the page cache is a read-only cache, so, it's safe to keep read-only data cached. It's never going to get accessed again and eventually, `find_victim` will evict it. The only remaining advantage of `drop_buffers_for_immutable` over relying on `find_victim` is that `find_victim` has to do the clock page replacement iterations until the count reaches 0, whereas `drop_buffers_for_immutable` can kick the page out right away. However, weigh that against the cost of `drop_buffers_for_immutable`, which currently scans the entire `slots` array to find the EphemeralFile's pages. Alternatives have been proposed in #5122 and #5128, but, they come with their own overheads & trade-offs. Also, the real reason why we're looking into this piece of code is that we want to make the slots rwlock async in #5023. Since `drop_buffers_for_immutable` is called from drop, and there is no async drop, it would be nice to not have to deal with this. So, let's just stop doing `drop_buffers_for_immutable` and observe the performance impact in benchmarks.	2023-08-28 20:42:18 +02:00
Joonas Koivunen	fbcd174489	load_layer_map: schedule deletions for any future layers (#5103 ) Unrelated fixes noticed while integrating #4938. - Stop leaking future layers in remote storage - We schedule extra index_part uploads if layer name to be removed was not actually present	2023-08-28 10:51:49 +03:00
John Spray	b758bf47ca	pageserver: refactor TimelineMetadata serialization in IndexPart (#5091 ) ## Problem The `metadata_bytes` field of IndexPart required explicit deserialization & error checking everywhere it was used -- there isn't anything special about this structure that should prevent it from being serialized & deserialized along with the rest of the structure. ## Summary of changes - Implement Serialize and Deserialize for TimelineMetadata - Replace IndexPart::metadata_bytes with a simpler `metadata`, that can be used directly. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-25 16:16:20 +01:00
Arpad Müller	8c13296add	Remove `BlockReader::read_blk` in favour of `BlockCursor` (#5015 ) ## Problem We want to make `read_blk` an async function, but outside of `async_trait`, which allocates, and nightly features, we can't use async fn's in traits. ## Summary of changes * Remove all uses of `BlockReader::read_blk` in favour of using block cursors, at least where the type of the `BlockReader` is behind a generic * Introduce a `BlockReaderRef` enum that lists all implementors of `BlockReader::read_blk`. * Remove `BlockReader::read_blk` and move its implementations into inherent functions on the types instead. We don't turn `read_blk` into an async fn yet, for that we also need to modify the page cache. So this is a preparatory PR, albeit an important one. Part of #4743.	2023-08-25 12:28:01 +02:00
Arpad Müller	227c87e333	Make EphemeralFile::write_blob function async (#5056 ) ## Problem The `EphemeralFile::write_blob` function accesses the page cache internally. We want to require `async` for these accesses in #5023. ## Summary of changes This removes the implementaiton of the `BlobWriter` trait for `EphemeralFile` and turns the `write_blob` function into an inherent function. We can then make it async as well as the `push_bytes` function. We move the `SER_BUFFER` thread-local into the `InMemoryLayerInner` so that the same buffer can be accessed by different threads as the async is (potentially) moved between threads. Part of #4743, preparation for #5023.	2023-08-24 19:18:30 +02:00
Chengpeng Yan	fa74d5649e	rename `EphmeralFile::size` to `EphemeralFile::len` (#5076 ) ## Problem close https://github.com/neondatabase/neon/issues/5034 ## Summary of changes Based on the [comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922). Just rename the `EphmeralFile::size` to `EphemeralFile::len`.	2023-08-24 16:41:57 +02:00
Joonas Koivunen	f70871dfd0	internal-devx: pageserver future layers (#5092 ) I've personally forgotten why/how can we have future layers during reconciliation. Adds `#[cfg(feature = "testing")]` logging when we upload such index_part.json, with a cross reference to where the cleanup happens. Latest private slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH Builds upon #5074. Should had been considered on #4837.	2023-08-24 17:22:36 +03:00
Joonas Koivunen	76aa01c90f	refactor: single phase Timeline::load_layer_map (#5074 ) Current implementation first calls `load_layer_map`, which loads all local layers, cleans up files, leave cleaning up stuff to "second function". Then the "second function" is finally called, it does not do the cleanup and some of the first functions setup can torn down. "Second function" is actually both `reconcile_with_remote` and `create_remote_layers`. This change makes it a bit more verbose but in one phase with the following sub-steps: 1. scan the timeline directory 2. delete extra files - now including on-demand download files - fixes #3660 3. recoincile the two sources of layers (directory, index_part) 4. rename_to_backup future layers, short layers 5. create the remaining as layers Needed by #4938. It was also noticed that this is blocking code in an `async fn` so just do it in a `spawn_blocking`, which should be healthy for our startup times. Other effects includes hopefully halving of `stat` calls; extra calls which were not done previously are now done for the future layers. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2023-08-24 16:07:40 +03:00
John Spray	3e2f0ffb11	libs: make backoff::retry() take a cancellation token (#5065 ) ## Problem Currently, anything that uses backoff::retry will delay the join of its task by however long its backoff sleep is, multiplied by its max retries. Whenever we call a function that sleeps, we should be passing in a CancellationToken. ## Summary of changes - Add a `Cancel` type to backoff::retry that wraps a CancellationToken and an error `Fn` to generate an error if the cancellation token fires. - In call sites that already run in a `task_mgr` task, use `shutdown_token()` to provide the token. In other locations, use a dead `CancellationToken` to satisfy the interface, and leave a TODO to fix it up when we broaden the use of explicit cancellation tokens.	2023-08-24 14:54:46 +03:00
Joonas Koivunen	ad8d777c1c	refactor: remove is_incremental=true for ImageLayers footgun (#5061 ) Accidentially giving is_incremental=true for ImageLayers costs a lot of debugging time. Removes all API which would allow to do that. They can easily be restored later when needed. Split off from #4938.	2023-08-22 22:12:05 +03:00
Joonas Koivunen	533a92636c	refactor: pre-cleanup Layer, PersistentLayer and impls (#5059 ) Remove pub but dead code, move trait methods as inherent methods, remove unnecessary. Split off from #4938.	2023-08-22 21:14:28 +03:00
Christian Schwarz	8cd20485f8	metrics: smgr query time: add a pre-aggregated histogram (#5064 ) When doing global queries in VictoriaMetrics, the per-timeline histograms make us run into cardinality limits. We don't want to give them up just yet because we don't have an alternative for drilling down on timeline-specific performance issues. So, add a pre-aggregated histogram and add observations to it whenever we add observations to the per-timeline histogram. While we're at it, switch to using a strummed enum for the operation type names.	2023-08-22 20:08:31 +03:00
Joonas Koivunen	933a869f00	refactor: compaction becomes async again (#5058 ) #4938 will make on-demand download of layers in compaction possible, so it's not suitable for our "policy" of no `spawn_blocking(\|\| ... Handle::block_on(async { spawn_blocking(...).await })` because this poses a clear deadlock risk. Nested spawn_blockings are because of the download using `tokio::fs::File`. - Remove `spawn_blocking` from caller of `compact_level0_phase1` - Remove `Handle::block_on` from `compact_level0_phase1` (indentation change) - Revert to `AsLayerDesc::layer_desc` usage temporarily (until it becomes field access in #4938)	2023-08-22 20:03:14 +03:00
John Spray	615a490239	pageserver: refactor Tenant/Timeline args into structs (#5053 ) ## Problem There are some common types that we pass into tenants and timelines as we construct them, such as remote storage and the broker client. Currently the list is small, but this is likely to grow -- the deletion queue PR (#4960) pushed some methods to the point of clippy complaining they had too many args, because of the extra deletion queue client being passed around. There are some shared objects that currently aren't passed around explicitly because they use a static `once_cell` (e.g. CONCURRENT_COMPACTIONS), but as we add more resource management and concurreny control over time, it will be more readable & testable to pass a type around in the respective Resources object, rather than to coordinate via static objects. The `Resources` structures in this PR will make it easier to add references to central coordination functions, without having to rely on statics. ## Summary of changes - For `Tenant`, the `broker_client` and `remote_storage` are bundled into `TenantSharedResources` - For `Timeline`, the `remote_client` is wrapped into `TimelineResources`. Both of these structures will get an additional deletion queue member in #4960.	2023-08-21 17:30:28 +01:00
John Spray	b95addddd5	pageserver: do not read redundant `timeline_layers` from IndexPart, so that we can remove it later (#4972 ) ## Problem IndexPart contains two redundant lists of layer names: a set of the names, and then a map of name to metadata. We already required that all the layers in `timeline_layers` are also in `layers_metadata`, in `initialize_with_current_remote_index_part`, so if there were any index_part.json files in the field that relied on these sets being different, they would already be broken. ## Summary of changes `timeline_layers` is made private and no longer read at runtime. It is still serialized, but not deserialized. `disk_consistent_lsn` is also made private, as this field only exists for convenience of humans reading the serialized JSON. This prepares us to entirely remove `timeline_layers` in a future release, once this change is fully deployed, and therefore no pageservers are trying to read the field.	2023-08-21 14:29:36 +03:00
Dmitry Rodionov	9140a950f4	Resume tenant deletion on attach (#5039 ) I'm still a bit nervous about attach -> crash case. But it should work. (unlike case with timeline). Ideally would be cool to cover this with test. This continues tradition of adding bool flags for Tenant::set_stopping. Probably lifecycle project will help with fixing it.	2023-08-20 12:28:50 +03:00
Arpad Müller	a23b0773f1	Fix DeltaLayer dumping (#5045 ) ## Problem Before, DeltaLayer dumping (via `cargo run --release -p pagectl -- print-layer-file` ) would crash as one can't call `Handle::block_on` in an async executor thread. ## Summary of changes Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys into RAM (which we already do during compaction), and then load the values one by one during dumping.	2023-08-19 00:56:03 +02:00

1 2 3 4 5 ...

1522 Commits