rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-31 03:50:37 +00:00

Author	SHA1	Message	Date
Arpad Müller	69528b7c30	Prepare k-merge in compaction for async I/O (#4836 ) ## Problem The k-merge in pageserver compaction currently relies on iterators over the keys and also over the values. This approach does not support async code because we are using iterators and those don't support async in general. Also, the k-merge implementation we use doesn't support async either. Instead, as we already load all the keys into memory, the plan is to just do the sorting in-memory for now, switch to async, and then once we want to support workloads that don't have all keys stored in memory, we can look into switching to a k-merge implementation that supports async instead. ## Summary of changes The core of this PR is the move from functions on the `PersistentLayer` trait to return custom iterator types to inherent functions on `DeltaLayer` that return buffers with all keys or value references. Value references are a type we created in this PR, containing a `BlobRef` as well as an `Arc` pointer to the `DeltaLayerInner`, so that we can lazily load the values during compaction. This preserves the property of the current code. This PR does not switch us to doing the k-merge via sort on slices, but with this PR, doing such a switch is relatively easy and only requires changes of the compaction code itself. Part of https://github.com/neondatabase/neon/issues/4743	2023-08-01 13:38:35 +02:00
arpad-m	e5b7ddfeee	Preparatory pageserver async conversions (#4773 ) In #4743, we'd like to convert the read path to use `async` rust. In preparation of that, this PR switches some functions that are calling lower level functions like `BlockReader::read_blk`, `BlockCursor::read_blob`, etc into `async`. The PR does not switch all functions however, and only focuses on the ones which are easy to switch. This leaves around some async functions that are (currently) unnecessarily `async`, but on the other hand it makes future changes smaller in diff. Part of #4743 (but does not completely address it).	2023-07-24 14:01:54 +02:00
Alex Chi Z	1066bca5e3	compaction: allow duplicated layers and skip in replacement (#4696 ) ## Problem Compactions might generate files of exactly the same name as before compaction due to our naming of layer files. This could have already caused some mess in the system, and is known to cause some issues like https://github.com/neondatabase/neon/issues/4088. Therefore, we now consider duplicated layers in the post-compaction process to avoid violating the layer map duplicate checks. related previous works: close https://github.com/neondatabase/neon/pull/4094 error reported in: https://github.com/neondatabase/neon/issues/4690, https://github.com/neondatabase/neon/issues/4088 ## Summary of changes If a file already exists in the layer map before the compaction, do not modify the layer map and do not delete the file. The file on disk at that time should be the new one overwritten by the compaction process. This PR also adds a test case with a fail point that produces exactly the same set of files. This bypassing behavior is safe because the produced layer files have the same content / are the same representation of the original file. An alternative might be directly removing the duplicate check in the layer map, but I feel it would be good if we can prevent that in the first place. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-07-17 17:26:29 +03:00
arpad-m	982fce1e72	Fix rustdoc warnings and test cargo doc in CI (#4711 ) ## Problem `cargo +nightly doc` is giving a lot of warnings: broken links, naked URLs, etc. ## Summary of changes * update the `proc-macro2` dependency so that it can compile on latest Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and https://github.com/dtolnay/proc-macro2/issues/398 * allow the `private_intra_doc_links` lint, as linking to something that's private is always more useful than just mentioning it without a link: if the link breaks in the future, at least there is a warning due to that. Also, one might enable [`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options) in the future and make these links work in general. * fix all the remaining warnings given by `cargo +nightly doc` * make it possible to run `cargo doc` on stable Rust by updating `opentelemetry` and associated crates to version 0.19, pulling in a fix that previously broke `cargo doc` on stable: https://github.com/open-telemetry/opentelemetry-rust/pull/904 * Add `cargo doc` to CI to ensure that it won't get broken in the future. Fixes #2557 ## Future work * Potentially, it might make sense, for development purposes, to publish the generated rustdocs somewhere, like for example [how the rust compiler does it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html). I will file an issue for discussion.	2023-07-15 05:11:25 +03:00
Alex Chi Z	c76b74c50d	semantic layer map operations (#4618 ) ## Problem ref https://github.com/neondatabase/neon/issues/4373 ## Summary of changes A step towards immutable layer map. I decided to finish the refactor with this new approach and apply https://github.com/neondatabase/neon/pull/4455 on this patch later. In this PR, we moved all modifications of the layer map to one place with semantic operations like `initialize_local_layers`, `finish_compact_l0`, `finish_gc_timeline`, etc, which is now part of `LayerManager`. This makes it easier to build new features upon this PR: * For immutable storage state refactor, we can simply replace the layer map with `ArcSwap<LayerMap>` and remove the `layers` lock. Moving towards it requires us to put all layer map changes in a single place as in https://github.com/neondatabase/neon/pull/4455. * For manifest, we can write to manifest in each of the semantic functions. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-07-13 17:35:27 +03:00
Alex Chi Z	08bfe1c826	remove `LayerDescriptor` and use `LayerObject` for tests (#4637 ) ## Problem part of https://github.com/neondatabase/neon/pull/4340 ## Summary of changes Remove LayerDescriptor and remove `todo!`. At the same time, this PR adds `AsLayerDesc` trait for all persistent layers and changed `LayerFileManager` to have a generic type. For tests, we are now using `LayerObject`, which is a wrapper around `PersistentLayerDesc`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-10 19:40:37 +03:00
arpad-m	4f280c2953	Small pageserver cleanups (#4657 ) ## Problem I was reading the code of the page server today and found these minor things that I thought could be cleaned up. ## Summary of changes * remove a redundant indentation layer and continue in the flushing loop * use the builtin `PartialEq` check instead of hand-rolling a `range_eq` function * Add a missing `>` to a prominent doc comment	2023-07-07 16:53:14 +02:00
arpad-m	1c516906e7	Impl Display for LayerFileName and require it for Layer (#4630 ) Does three things: * add a `Display` impl for `LayerFileName` equal to the `short_id` * based on that, replace the `Layer::short_id` function by a requirement for a `Display` impl * use that `Display` impl in the places where the `short_id` and `file_name()` functions were used instead Fixes #4145	2023-07-05 14:27:50 +02:00
Alex Chi Z	7e20b49da4	refactor: use LayerDesc in LayerMap (part 2) (#4437 ) ## Problem part of https://github.com/neondatabase/neon/issues/4392, continuation of https://github.com/neondatabase/neon/pull/4408 ## Summary of changes This PR removes all layer objects from LayerMap and moves it to the timeline struct. In timeline struct, LayerFileManager maps a layer descriptor to a layer object, and it is stored in the same RwLock as LayerMap to avoid behavior difference. Key changes: * LayerMap now does not have generic, and only stores descriptors. * In Timeline, we add a new struct called layer mapping. * Currently, layer mapping is stored in the same lock with layer map. Every time we retrieve data from the layer map, we will need to map the descriptor to the actual object. * Replace_historic is moved to layer mapping's replace, and the return value behavior is different from before. I'm a little bit unsure about this part and it would be good to have some comments on that. * Some test cases are rewritten to adapt to the new interface, and we can decide whether to remove it in the future because it does not make much sense now. * LayerDescriptor is moved to `tests` module and should only be intended for unit testing / benchmarks. * Because we now have a usage pattern like "take the guard of lock, then get the reference of two fields", we want to avoid dropping the incorrect object when we intend to unlock the lock guard. Therefore, a new set of helper function `drop_r/wlock` is added. This can be removed in the future when we finish the refactor. TODOs after this PR: fully remove RemoteLayer, and move LayerMapping to a separate LayerCache. all refactor PRs: ``` #4437 --- #4479 ------------ #4510 (refactor done at this point) \-- #4455 -- #4502 --/ ``` --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-06-29 15:06:07 -04:00
Shany Pozin	a7f3f5f356	Revert "run `Layer::get_value_reconstruct_data` in `spawn_blocking`#4498" (#4569 ) This reverts commit `1faf69a698`.	2023-06-27 10:57:28 +03:00
Christian Schwarz	1faf69a698	run `Layer::get_value_reconstruct_data` in `spawn_blocking` (#4498 ) This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at https://github.com/neondatabase/neon/issues/4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * https://github.com/neondatabase/neon/pull/4316 * https://github.com/neondatabase/neon/pull/4317 * https://github.com/neondatabase/neon/pull/4318 * https://github.com/neondatabase/neon/pull/4319 * https://github.com/neondatabase/neon/pull/4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. https://github.com/neondatabase/neon/pull/4364 3. https://github.com/neondatabase/neon/pull/4472 4. https://github.com/neondatabase/neon/pull/4476 5. https://github.com/neondatabase/neon/pull/4477 6. https://github.com/neondatabase/neon/pull/4485 7. https://github.com/neondatabase/neon/pull/4441	2023-06-26 11:43:11 +02:00
Christian Schwarz	3693d1f431	turn Timeline::layers into tokio::sync::RwLock (#4441 ) This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). # Full Stack Of Preliminary PRs Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. 1. Clean-ups * https://github.com/neondatabase/neon/pull/4316 * https://github.com/neondatabase/neon/pull/4317 * https://github.com/neondatabase/neon/pull/4318 * https://github.com/neondatabase/neon/pull/4319 * https://github.com/neondatabase/neon/pull/4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. https://github.com/neondatabase/neon/pull/4364 3. https://github.com/neondatabase/neon/pull/4472 4. https://github.com/neondatabase/neon/pull/4476 5. https://github.com/neondatabase/neon/pull/4477 6. https://github.com/neondatabase/neon/pull/4485 # Significant Changes In This PR ## `compact_level0_phase1` & `create_delta_layer` This commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" `4e359db4c7`. Specifically, it reverts the `spawn_blocking`-ificiation of `compact_level0_phase1`. If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside `compact_level0_phase1`. That would use up a thread in the `spawn_blocking` thread pool, which is hard-capped. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`. Also, this PR switches the `create_delta_layer` _function_ back to async, and uses `spawn_blocking` inside to run the code that does sync IO, while keeping the code that needs to lock `Timeline::layers` async. ## `LayerIter` and `LayerKeyIter` `Send` bounds I had to add a `Send` bound on the `dyn` type that `LayerIter` and `LayerKeyIter` wrap. Why? Because we now have the second `layers.read().await` inside `compact_level0_phase`, and these iterator instances are held across that await-point. More background: https://github.com/neondatabase/neon/pull/4462#issuecomment-1587376960 ## `DatadirModification::flush` Needed to replace the `HashMap::retain` with a hand-rolled variant because `TimelineWriter::put` is now async.	2023-06-13 18:38:41 +02:00
Alex Chi Z	2e687bca5b	refactor: use LayerDesc in layer map (part 1) (#4408 ) ## Problem part of https://github.com/neondatabase/neon/issues/4392 ## Summary of changes This PR adds a new HashMap that maps persistent layer desc to the layer object inside LayerMap. Originally I directly went towards adding such layer cache in Timeline, but the changes are too many and cannot be reviewed as a reasonably-sized PR. Therefore, we take this intermediate step to change part of the codebase to use persistent layer desc, and come up with other PRs to move this hash map of layer desc to the timeline struct. Also, file_size is now part of the layer desc. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2023-06-07 18:28:18 +03:00
Alex Chi Z	66cdba990a	refactor: use PersistentLayerDesc for persistent layers (#4398 ) ## Problem Part of https://github.com/neondatabase/neon/issues/4373 ## Summary of changes This PR adds `PersistentLayerDesc`, which will be used in LayerMap mapping and probably layer cache. After this PR and after we change LayerMap to map to layer desc, we can safely drop RemoteLayerDesc. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2023-06-01 22:06:28 +03:00
Alex Chi Z	7126197000	pagectl: refactor ctl and support dump kv in delta (#4268 ) This PR refactors the original page_binutils with a single tool pagectl, use clap derive for better command line parsing, and adds the dump kv tool to extract information from delta file. This helps me better understand what's inside the page server. We can add support for other types of file and more functionalities in the future. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-05-24 19:36:07 +03:00
Christian Schwarz	7dd9553bbb	eviction: regression test + distinguish layer write from map insert (#4005 ) This patch adds a regression test for the threshold-based layer eviction. The test asserts the basic invariant that, if left alone, the residence statuses will stabilize, with some layers resident and some layers evicted. Thereby, we cover both the aspect of last-access-time-threshold-based eviction, and the "imitate access" hacks that we put in recently. The aggressive `period` and `threshold` values revealed a subtle bug which is also fixed in this patch. The symptom was that, without the Rust changes of this patch, there would be occasional test failures due to `WARN... unexpectedly downloading` log messages. These log messages were caused by the "imitate access" calls of the eviction task. But, the whole point of the "imitate access" hack was to prevent eviction of the layers that we access there. After some digging, I found the root cause, which is the following race condition: 1. Compact: Write out an L1 layer from several L0 layers. This records residence event `LayerCreate` with the current timestamp. 2. Eviction: imitate access logical size calculation. This accesses the L0 layers because the L1 layer is not yet in the layer map. 3. Compact: Grab layer map lock, add the new L1 to layer map and remove the L0s, release layer map lock. 4. Eviction: observes the new L1 layer whose only activity timestamp is the `LayerCreate` event. The L1 layer had no chance of being accessed until after (3). So, if enough time passes between (1) and (3), then (4) will observe a layer with `now-last_activity > threshold` and evict it The fix is to require the first `record_residence_event` to happen while we already hold the layer map lock. The API requires a ref to a `BatchedUpdates` as a witness that we are inside a layer map lock. That is not fool-proof, e.g., new call sites for `insert_historic` could just completely forget to record the residence event. It would be nice to prevent this at the type level. In the meantime, we have a rate-limited log messages to warn us, if such an implementation error sneaks in in the future. fixes https://github.com/neondatabase/neon/issues/3593 fixes https://github.com/neondatabase/neon/issues/3942 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-05-04 16:16:48 +02:00
Christian Schwarz	a64dd3ecb5	disk-usage-based layer eviction (#3809 ) This patch adds a pageserver-global background loop that evicts layers in response to a shortage of available bytes in the $repo/tenants directory's filesystem. The loop runs periodically at a configurable `period`. Each loop iteration uses `statvfs` to determine filesystem-level space usage. It compares the returned usage data against two different types of thresholds. The iteration tries to evict layers until app-internal accounting says we should be below the thresholds. We cross-check this internal accounting with the real world by making another `statvfs` at the end of the iteration. We're good if that second statvfs shows that we're _actually_ below the configured thresholds. If we're still above one or more thresholds, we emit a warning log message, leaving it to the operator to investigate further. There are two thresholds: - `max_usage_pct` is the relative available space, expressed in percent of the total filesystem space. If the actual usage is higher, the threshold is exceeded. - `min_avail_bytes` is the absolute available space in bytes. If the actual usage is lower, the threshold is exceeded. The iteration evicts layers in LRU fashion with a reservation of up to `tenant_min_resident_size` bytes of the most recent layers per tenant. The layers not part of the per-tenant reservation are evicted least-recently-used first until we're below all thresholds. The `tenant_min_resident_size` can be overridden per tenant as `min_resident_size_override` (bytes). In addition to the loop, there is also an HTTP endpoint to perform one loop iteration synchronous to the request. The endpoint takes an absolute number of bytes that the iteration needs to evict before pressure is relieved. The tests use this endpoint, which is a great simplification over setting up loopback-mounts in the tests, which would be required to test the statvfs part of the implementation. We will rely on manual testing in staging to test the statvfs parts. The HTTP endpoint is also handy in emergencies where an operator wants the pageserver to evict a given amount of space _now. Hence, it's arguments documented in openapi_spec.yml. The response type isn't documented though because we don't consider it stable. The endpoint should _not_ be used by Console but it could be used by on-call. Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-03-31 14:47:57 +03:00
Kirill Bulatov	dd22c87100	Remove older layer metadata format support code (#3854 ) The PR enforces current newest `index_part.json` format in the type system (version `1`), not allowing any previous forms of it, that were used in the past. Similarly, the code to mitigate the https://github.com/neondatabase/neon/issues/3024 issue is now also removed. Current code does not produce old formats and extra files in the index_part.json, in the future we will be able to use https://github.com/neondatabase/aversion or other approach to make version transitions more explicit. See https://neondb.slack.com/archives/C033RQ5SPDH/p1679134185248119 for the justification on the breaking changes.	2023-03-21 23:33:28 +02:00
Christian Schwarz	38022ff11c	gc: only decrement resident size if GC'd layer is resident Before this patch, GC would call PersistentLayer::delete() on every GC'ed layer. RemoteLayer::delete() returned Ok(()) unconditionally. GC would then proceed by decrementing the resident size metric, even though the layer is a RemoteLayer. This patch makes the following changes: - Rename PersistentLayer::delete() to delete_resident_layer_file(). That name is unambiguous. - Make RemoteLayer::delete_resident_layer_file return an Err(). We would have uncovered this bug if we had done that from the start. - Change GC / Timeline::delete_historic_layer check whether the layer is remote or not, and only call delete_resident_layer_file() if it's not remote. This brings us in line with how eviction does it. - Add a regression test. fixes https://github.com/neondatabase/neon/issues/3722	2023-03-03 12:10:24 +01:00
Christian Schwarz	175a577ad4	automatic layer eviction This patch adds a per-timeline periodic task that executes an eviction policy. The eviction policy is configurable per tenant. Two policies exist: - NoEviction (the default one) - LayerAccessThreshold The LayerAccessThreshold policy examines the last access timestamp per layer in the layer map and evicts the layer if that last access is further in the past than a configurable threshold value. This policy kind is evaluated periodically at a configurable period. It logs a summary statistic at `info!()` or `warn!()` level, depending on whether any evictions failed. This feature has no explicit killswitch since it's off by default.	2023-02-09 13:33:55 +01:00
Joonas Koivunen	1fdf01e3bc	fix: readable Debug for Layers (#3575 ) #3536 added the custom Debug implementations but it using derived Debug on Key lead to too verbose output. Instead of making `Key`'s `Debug` unconditionally or conditionally do the `Display` variant (for table space'd keys), opted to build a newtype to provide `Debug` for `Range<Key>` via `Display` which seemed to work unconditionally. Also orders Key to have: 1. comment, 2. derive, 3. `struct Key`.	2023-02-09 13:55:37 +02:00
Christian Schwarz	446a39e969	make LayerAccesStatFullDetails Copy Method to_api_model renamed to as_api_model because of Clippy complaint: https://rust-lang.github.io/rust-clippy/master/index.html#wrong_self_convention	2023-02-09 12:35:45 +01:00
Christian Schwarz	58fa4f0eb7	maintain access stats for historic layers This patch adds basic access statistics for historic layers and exposes them in the management API's `LayerMapInfo`. We record the accesses in the `{Delta,Image}Layer::load()` function because it's the common path of * page_service (`Timline::get_reconstruct_data()`) * Compaction (`PersistentLayer::iter()` and `PersistentLayer::key_iter()`) The stats survive residence status changes, and record these as well. When scraping the layer map endpoint to record its evolution over time, one must account for stat resets because they are in-memory only and will reset on pageserver restart. Use the launch timestamp header added by (#3527) to identify pageserver restarts. This is PR https://github.com/neondatabase/neon/pull/3496	2023-02-06 17:01:38 +01:00
Joonas Koivunen	678fe0684f	std::fmt::Debug for Layer implementations (#3536 ) Follow-up to #3513. This removes the old blanket `std::fmt::Debug` impl on `dyn Layer` which did not seem to be used from anywhere (no compilation errors after removing). Adds `std::fmt::Debug` requirement and implementations for `trait Layer` implementors: - LayerDescriptor (derived) - RemoteLayer (manual) - DeltaLayer (manual) - ImageLayer (manual) Manual implementations are used to skip PageserverConf, tenant and timeline ids, large collections. Adds and adjusts some doc comments to be more rustdoc alike.	2023-02-06 14:21:51 +02:00
Joonas Koivunen	f2d89761c2	feat: LayerMap::replace (#3513 ) Cc: #3486 Adds a method to replace a particular layer from the LayerMap for the purposes of remote layer download and layer eviction. In those use cases read lock on layer map needs to be released after initial search, but other operations could modify layermap before replacing thread gets to run. Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2023-02-03 15:33:46 +02:00
Kirill Bulatov	2759f1a22e	Evict layers on demand (#3486 ) Closes https://github.com/neondatabase/neon/issues/3439 Adds a set of commands to manipulate the layer map: * dump the layer map contents * evict the layer form the layer map (remove the local file, put the remote layer instead in the layer map) * download the layer (operation, reversing the eviction) The commands will change later, when the statistics is added on top, so the swagger schema is not adjusted. The commands might have issues with big amount of layers: no pagination is done for the dump command, eviction and download commands look for the layer to evict/download by iterating all layers sequentially and comparing the layer names. For now, that seems to be tolerable ("big" number of layers is ~2_000) and further experiments are needed. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-02-02 12:14:44 +02:00
Christian Schwarz	f1aece1ba0	add RequestContext plumbing for layer access stats In preparation for #3496 plumb through RequestContext to the data access methods of `PersistentLayer`. This is PR https://github.com/neondatabase/neon/pull/3504	2023-02-01 15:29:01 +02:00
Konstantin Knizhnik	895f929bce	Add layer_map_analyzer tool (#3451 ) See #3348	2023-01-31 15:50:52 +02:00
bojanserafimov	a3d7ad2d52	Implement layer map using immutable BST (#2998 )	2023-01-20 16:10:12 -05:00
Heikki Linnakangas	c1731bc4f0	Push on-demand download into Timeline::get() function itself. This makes Timeline::get() async, and all functions that call it directly or indirectly with it. The with_ondemand_download() mechanism is gone, Timeline::get() now always downloads files, whether you want it or not. That is what all the current callers want, so even though this loses the capability to get a page only if it's already in the pageserver, without downloading, we were not using that capability. There were some places that used 'no_ondemand_download' in the WAL ingestion code that would error out if a layer file was not found locally, but those were dubious. We do actually want to on-demand download in all of those places. Per discussion at https://github.com/neondatabase/neon/pull/3233#issuecomment-1368032358	2023-01-12 11:53:10 +02:00
Kirill Bulatov	b77c33ee06	Move tenant-related modules below `tenant` module (#3190 ) No real code changes besides moving code around and adjusting the imports.	2022-12-23 15:40:37 +02:00
Heikki Linnakangas	7ff591ffbf	On-Demand Download The code in this change was extracted from #2595 (Heikki’s on-demand download draft PR). High-Level Changes - New RemoteLayer Type - On-Demand Download As An Effect Of Page Reconstruction - Breaking Semantics For Physical Size Metrics There are several follow-up work items planned. Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029 closes https://github.com/neondatabase/neon/pull/3013 Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> New RemoteLayer Type ==================== Instead of downloading all layers during tenant attach, we create RemoteLayer instances for each of them and add them to the layer map. On-Demand Download As An Effect Of Page Reconstruction ====================================================== At the heart of pageserver is Timeline::get_reconstruct_data(). It traverses the layer map until it has collected all the data it needs to produce the page image. Most code in the code base uses it, though many layers of indirection. Before this patch, the function would use synchronous filesystem IO to load data from disk-resident layer files if the data was not cached. That is not possible with RemoteLayer, because the layer file has not been downloaded yet. So, we do the download when get_reconstruct_data gets there, i.e., “on demand”. The mechanics of how the download is done are rather involved, because of the infamous async-sync-async sandwich problem that plagues the async Rust world. We use the new PageReconstructResult type to work around this. Its introduction is the cause for a good amount of code churn in this patch. Refer to the block comment on `with_ondemand_download()` for details. Breaking Semantics For Physical Size Metrics ============================================ We rename prometheus metric pageserver_{current,resident}_physical_size to reflect what this metric actually represents with on-demand download. This intentionally BREAKS existing grafana dashboard and the cost model data pipeline. Breaking is desirable because the meaning of this metrics has changed with on-demand download. See https://docs.google.com/document/d/12AFpvKY-7FZdR5a4CaD6Ir_rI3QokdCLSPJ6upHxJBo/edit# for how we will handle this breakage. Likewise, we rename the new billing_metrics’s PhysicalSize => ResidentSize. This is not yet used anywhere, so, this is not a breaking change. There is still a field called TimelineInfo::current_physical_size. It is now the sum of the layer sizes in layer map, regardless of whether local or remote. To compute that sum, we added a new trait method PersistentLayer::file_size(). When updating the Python tests, we got rid of current_physical_size_non_incremental. An earlier commit removed it from the OpenAPI spec already, so this is not a breaking change. test_timeline_size.py has grown additional assertions on the resident_physical_size metric.	2022-12-21 19:16:39 +01:00
Christian Schwarz	1da03141a7	refactor: make Layer::local_path return Option<PathBuf> instead of PathBuf This is in preparation for RemoteLayer, which by definition doesn't have a local path.	2022-12-21 19:16:39 +01:00
Christian Schwarz	e94b451430	no-op: storage_layer::Iter::{iter, key_iter}: make them fallible	2022-12-21 19:16:39 +01:00
Christian Schwarz	f5b424b96c	no-op: type aliases for Layer::iter and Layer::key_iter return types Not needed by anything right now, but the next commit adds a `Result<>` around iter() and key_iter()'s return types, and that makes clippy complain.	2022-12-21 19:16:39 +01:00
Christian Schwarz	22ae67af8d	refactor: use new type LayerFileName when referring to layer file names in PathBuf/RemotePath (#3026 ) refactor: use new type LayerFileName when referring to layer file names in PathBuf/RemotePath Before this patch, we would sometimes carry around plain file names in `Path` types and/or awkwardly "rebase" paths to have a unified representation of the layer file name between local and remote. This patch introduces a new type `LayerFileName` which replaces the use of `Path` / `PathBuf` / `RemotePath` in the `storage_sync2` APIs. Instead of holding a string, it contains the parsed representation of the image and delta file name. When we need the file name, e.g., to construct a local path or remote object key, we construct the name ad-hoc. `LayerFileName` is also serde {Dese,Se}rializable, and in an initial version of this patch, it was supposed to be used directly inside `IndexPart`, replacing `RemotePath`. However, commit `3122f3282f` Ignore backup files (ones with .n.old suffix) in download_missing fixed handling of `.old` backup file names in IndexPart, and we need to carry that behavior forward. The solution is to remove `.old` backup files names during deserialization. When we re-serialize the IndexPart, the `*.old` file will be gone. This leaks the `.old` file in the remote storage, but makes it safe to clean it up later. There is additional churn by a preliminary refactoring that got squashed into this change: split off LayerMap's needs from trait Layer into super trait That refactoring renames `Layer` to `PersistentLayer` and splits off a subset of the functions into a super-trait called `Layer`. The upser trait implements just the functions needed by `LayerMap`, whereas `PersisentLayer` adds the context of the pageserver. The naming is imperfect as some functions that reside in `PersistentLayer` have nothing persistence-specific to it. But it's a step in the right direction.	2022-12-13 01:27:59 +02:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Kirill Bulatov	1a8c8b04d7	Merge Repository and Tenant entities, rework tenant background jobs	2022-09-13 15:39:39 +03:00

38 Commits