rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 06:00:38 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	a06c8e9add	inmemory_layer: drop impl Layer	2023-08-28 12:42:28 +03:00
Joonas Koivunen	18eefd61eb	inmemory_layer: less warnings	2023-08-28 12:42:28 +03:00
Joonas Koivunen	a2de0574b5	inmemorylayer: flush integration, partial?	2023-08-28 12:42:28 +03:00
Joonas Koivunen	aa8e954197	layer_manager, layer_access_stats: remove witness sidequest it was interfering with me moving responsibilities back and forth	2023-08-28 12:42:28 +03:00
Joonas Koivunen	30847e59b9	remove remote_layer instead of keep fixing	2023-08-28 12:42:28 +03:00
Joonas Koivunen	d930a581f8	layer_manager: compaction changes	2023-08-28 12:42:28 +03:00
Joonas Koivunen	931c22545b	compaction: clippies	2023-08-28 12:42:28 +03:00
Joonas Koivunen	366f3c8ff8	compaction: reflect LayerE now managing remote client	2023-08-28 12:42:28 +03:00
Joonas Koivunen	26c39d7b4c	timeline: pub(crate) compaction	2023-08-28 12:42:28 +03:00
Joonas Koivunen	6ffa5138ce	compaction: upload index on success otherwise test_gc_of_remote_layers fails	2023-08-28 12:42:28 +03:00
Joonas Koivunen	975e1558cc	compaction: integration	2023-08-28 12:42:28 +03:00
Joonas Koivunen	82a955ebfe	layer_manager: resident layer flush l0 changes	2023-08-28 12:42:28 +03:00
Joonas Koivunen	82596f8807	delta_layer: reflect non-async LayerE::for_written	2023-08-28 12:42:28 +03:00
Joonas Koivunen	e3e57579a1	integrate: download_all_layers this time around with graceful cancellation.	2023-08-28 12:42:28 +03:00
Joonas Koivunen	b4a0f8baf7	integrate: Timeline::get_value_reconstruct_data	2023-08-28 12:42:28 +03:00
Joonas Koivunen	2e686ed6ea	eviction: integration - evictiontask: remove unused imports - eviction_task and dube: cleanup - timeline: pub(crate) eviction - timeline: adjust to "layere: adjust eviction" - test: remove layer_eviction_aba_fails because it can no longer happen - test: fix up evicts later test with ability to await for eviction - eviction_task: more unused imports - eviction: clippy - eviction_task: more clippy - fixup eviction: docs - eviction: hold only Arc<Layer> after checking downloadedness - refactor earlier eviction: use drop_eviction_guard instead - dube: evict in spawned tasks - timeline, eviction: evict in spawned - eviction: add more errors - evict_layers: remove witness - eviction: post-witness forgotten panic - eviction: remove blog references	2023-08-28 12:42:28 +03:00
Joonas Koivunen	c46f72d411	create_image_layers integration (multifile)	2023-08-28 12:42:28 +03:00
Joonas Koivunen	ed46713e5c	remote_timeline_client: pub(crate) on upload	2023-08-28 12:42:28 +03:00
Joonas Koivunen	c228cb7b3f	remote_timeline_client: continued integration work post ResidentLayer	2023-08-28 12:42:28 +03:00
Joonas Koivunen	c88e4a0974	layer_manager: integration	2023-08-28 12:42:28 +03:00
Joonas Koivunen	b71a2f4cb2	load_layer_map: integration	2023-08-28 12:42:23 +03:00
Joonas Koivunen	235a8cbd28	wip: LayerE	2023-08-28 12:41:31 +03:00
Joonas Koivunen	f36658ac10	layerdesc: add from_filename	2023-08-28 12:41:31 +03:00
Joonas Koivunen	f8227da9da	unrelated: layer_manager: avoid arc cloning	2023-08-28 12:41:31 +03:00
Joonas Koivunen	fbcd174489	load_layer_map: schedule deletions for any future layers (#5103 ) Unrelated fixes noticed while integrating #4938. - Stop leaking future layers in remote storage - We schedule extra index_part uploads if layer name to be removed was not actually present	2023-08-28 10:51:49 +03:00
John Spray	b758bf47ca	pageserver: refactor TimelineMetadata serialization in IndexPart (#5091 ) ## Problem The `metadata_bytes` field of IndexPart required explicit deserialization & error checking everywhere it was used -- there isn't anything special about this structure that should prevent it from being serialized & deserialized along with the rest of the structure. ## Summary of changes - Implement Serialize and Deserialize for TimelineMetadata - Replace IndexPart::metadata_bytes with a simpler `metadata`, that can be used directly. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-08-25 16:16:20 +01:00
Arpad Müller	8c13296add	Remove `BlockReader::read_blk` in favour of `BlockCursor` (#5015 ) ## Problem We want to make `read_blk` an async function, but outside of `async_trait`, which allocates, and nightly features, we can't use async fn's in traits. ## Summary of changes * Remove all uses of `BlockReader::read_blk` in favour of using block cursors, at least where the type of the `BlockReader` is behind a generic * Introduce a `BlockReaderRef` enum that lists all implementors of `BlockReader::read_blk`. * Remove `BlockReader::read_blk` and move its implementations into inherent functions on the types instead. We don't turn `read_blk` into an async fn yet, for that we also need to modify the page cache. So this is a preparatory PR, albeit an important one. Part of #4743.	2023-08-25 12:28:01 +02:00
Arpad Müller	227c87e333	Make EphemeralFile::write_blob function async (#5056 ) ## Problem The `EphemeralFile::write_blob` function accesses the page cache internally. We want to require `async` for these accesses in #5023. ## Summary of changes This removes the implementaiton of the `BlobWriter` trait for `EphemeralFile` and turns the `write_blob` function into an inherent function. We can then make it async as well as the `push_bytes` function. We move the `SER_BUFFER` thread-local into the `InMemoryLayerInner` so that the same buffer can be accessed by different threads as the async is (potentially) moved between threads. Part of #4743, preparation for #5023.	2023-08-24 19:18:30 +02:00
Chengpeng Yan	fa74d5649e	rename `EphmeralFile::size` to `EphemeralFile::len` (#5076 ) ## Problem close https://github.com/neondatabase/neon/issues/5034 ## Summary of changes Based on the [comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922). Just rename the `EphmeralFile::size` to `EphemeralFile::len`.	2023-08-24 16:41:57 +02:00
Joonas Koivunen	f70871dfd0	internal-devx: pageserver future layers (#5092 ) I've personally forgotten why/how can we have future layers during reconciliation. Adds `#[cfg(feature = "testing")]` logging when we upload such index_part.json, with a cross reference to where the cleanup happens. Latest private slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH Builds upon #5074. Should had been considered on #4837.	2023-08-24 17:22:36 +03:00
Joonas Koivunen	76aa01c90f	refactor: single phase Timeline::load_layer_map (#5074 ) Current implementation first calls `load_layer_map`, which loads all local layers, cleans up files, leave cleaning up stuff to "second function". Then the "second function" is finally called, it does not do the cleanup and some of the first functions setup can torn down. "Second function" is actually both `reconcile_with_remote` and `create_remote_layers`. This change makes it a bit more verbose but in one phase with the following sub-steps: 1. scan the timeline directory 2. delete extra files - now including on-demand download files - fixes #3660 3. recoincile the two sources of layers (directory, index_part) 4. rename_to_backup future layers, short layers 5. create the remaining as layers Needed by #4938. It was also noticed that this is blocking code in an `async fn` so just do it in a `spawn_blocking`, which should be healthy for our startup times. Other effects includes hopefully halving of `stat` calls; extra calls which were not done previously are now done for the future layers. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2023-08-24 16:07:40 +03:00
John Spray	3e2f0ffb11	libs: make backoff::retry() take a cancellation token (#5065 ) ## Problem Currently, anything that uses backoff::retry will delay the join of its task by however long its backoff sleep is, multiplied by its max retries. Whenever we call a function that sleeps, we should be passing in a CancellationToken. ## Summary of changes - Add a `Cancel` type to backoff::retry that wraps a CancellationToken and an error `Fn` to generate an error if the cancellation token fires. - In call sites that already run in a `task_mgr` task, use `shutdown_token()` to provide the token. In other locations, use a dead `CancellationToken` to satisfy the interface, and leave a TODO to fix it up when we broaden the use of explicit cancellation tokens.	2023-08-24 14:54:46 +03:00
Joonas Koivunen	ad8d777c1c	refactor: remove is_incremental=true for ImageLayers footgun (#5061 ) Accidentially giving is_incremental=true for ImageLayers costs a lot of debugging time. Removes all API which would allow to do that. They can easily be restored later when needed. Split off from #4938.	2023-08-22 22:12:05 +03:00
Joonas Koivunen	533a92636c	refactor: pre-cleanup Layer, PersistentLayer and impls (#5059 ) Remove pub but dead code, move trait methods as inherent methods, remove unnecessary. Split off from #4938.	2023-08-22 21:14:28 +03:00
Christian Schwarz	8cd20485f8	metrics: smgr query time: add a pre-aggregated histogram (#5064 ) When doing global queries in VictoriaMetrics, the per-timeline histograms make us run into cardinality limits. We don't want to give them up just yet because we don't have an alternative for drilling down on timeline-specific performance issues. So, add a pre-aggregated histogram and add observations to it whenever we add observations to the per-timeline histogram. While we're at it, switch to using a strummed enum for the operation type names.	2023-08-22 20:08:31 +03:00
Joonas Koivunen	933a869f00	refactor: compaction becomes async again (#5058 ) #4938 will make on-demand download of layers in compaction possible, so it's not suitable for our "policy" of no `spawn_blocking(\|\| ... Handle::block_on(async { spawn_blocking(...).await })` because this poses a clear deadlock risk. Nested spawn_blockings are because of the download using `tokio::fs::File`. - Remove `spawn_blocking` from caller of `compact_level0_phase1` - Remove `Handle::block_on` from `compact_level0_phase1` (indentation change) - Revert to `AsLayerDesc::layer_desc` usage temporarily (until it becomes field access in #4938)	2023-08-22 20:03:14 +03:00
John Spray	615a490239	pageserver: refactor Tenant/Timeline args into structs (#5053 ) ## Problem There are some common types that we pass into tenants and timelines as we construct them, such as remote storage and the broker client. Currently the list is small, but this is likely to grow -- the deletion queue PR (#4960) pushed some methods to the point of clippy complaining they had too many args, because of the extra deletion queue client being passed around. There are some shared objects that currently aren't passed around explicitly because they use a static `once_cell` (e.g. CONCURRENT_COMPACTIONS), but as we add more resource management and concurreny control over time, it will be more readable & testable to pass a type around in the respective Resources object, rather than to coordinate via static objects. The `Resources` structures in this PR will make it easier to add references to central coordination functions, without having to rely on statics. ## Summary of changes - For `Tenant`, the `broker_client` and `remote_storage` are bundled into `TenantSharedResources` - For `Timeline`, the `remote_client` is wrapped into `TimelineResources`. Both of these structures will get an additional deletion queue member in #4960.	2023-08-21 17:30:28 +01:00
John Spray	b95addddd5	pageserver: do not read redundant `timeline_layers` from IndexPart, so that we can remove it later (#4972 ) ## Problem IndexPart contains two redundant lists of layer names: a set of the names, and then a map of name to metadata. We already required that all the layers in `timeline_layers` are also in `layers_metadata`, in `initialize_with_current_remote_index_part`, so if there were any index_part.json files in the field that relied on these sets being different, they would already be broken. ## Summary of changes `timeline_layers` is made private and no longer read at runtime. It is still serialized, but not deserialized. `disk_consistent_lsn` is also made private, as this field only exists for convenience of humans reading the serialized JSON. This prepares us to entirely remove `timeline_layers` in a future release, once this change is fully deployed, and therefore no pageservers are trying to read the field.	2023-08-21 14:29:36 +03:00
Dmitry Rodionov	9140a950f4	Resume tenant deletion on attach (#5039 ) I'm still a bit nervous about attach -> crash case. But it should work. (unlike case with timeline). Ideally would be cool to cover this with test. This continues tradition of adding bool flags for Tenant::set_stopping. Probably lifecycle project will help with fixing it.	2023-08-20 12:28:50 +03:00
Arpad Müller	a23b0773f1	Fix DeltaLayer dumping (#5045 ) ## Problem Before, DeltaLayer dumping (via `cargo run --release -p pagectl -- print-layer-file` ) would crash as one can't call `Handle::block_on` in an async executor thread. ## Summary of changes Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys into RAM (which we already do during compaction), and then load the values one by one during dumping.	2023-08-19 00:56:03 +02:00
Joonas Koivunen	368ee6c8ca	refactor: failpoint support (#5033 ) - move them to pageserver which is the only dependant on the crate fail - "move" the exported macro to the new module - support at init time the same failpoints as runtime Found while debugging test failures and making tests more repeatable by allowing "exit" from pageserver start via environment variables. Made those changes to `test_gc_cutoff.py`. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-19 01:01:44 +03:00
Dmitry Rodionov	30888a24d9	Avoid flakiness in test_timeline_delete_fail_before_local_delete (#5032 ) The problem was that timeline detail can return timelines in not only active state. And by the time request comes timeline deletion can still be in progress if we're unlucky (test execution happened to be slower for some reason) Reference for failed test run https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5022/5891420105/index.html#suites/f588e0a787c49e67b29490359c589fae/dab036e9bd673274 The error was `Exception: detail succeeded (it should return 404)` reported by @koivunej	2023-08-18 20:49:11 +03:00
Dmitry Rodionov	f6c671c140	resume timeline deletions on attach (#5030 ) closes [#5036](https://github.com/neondatabase/neon/issues/5036)	2023-08-18 20:48:33 +03:00
Christian Schwarz	7a63685cde	simplify page-caching of EphemeralFile (#4994 ) (This PR is the successor of https://github.com/neondatabase/neon/pull/4984 ) ## Summary The current way in which `EphemeralFile` uses `PageCache` complicates the Pageserver code base to a degree that isn't worth it. This PR refactors how we cache `EphemeralFile` contents, by exploiting the append-only nature of `EphemeralFile`. The result is that `PageCache` only holds `ImmutableFilePage` and `MaterializedPage`. These types of pages are read-only and evictable without write-back. This allows us to remove the writeback code from `PageCache`, also eliminating an entire failure mode. Futher, many great open-source libraries exist to solve the problem of a read-only cache, much better than our `page_cache.rs` (e.g., better replacement policy, less global locking). With this PR, we can now explore using them. ## Problem & Analysis Before this PR, `PageCache` had three types of pages: * `ImmutableFilePage`: caches Delta / Image layer file contents * `MaterializedPage`: caches results of Timeline::get (page materialization) * `EphemeralPage`: caches `EphemeralFile` contents `EphemeralPage` is quite different from `ImmutableFilePage` and `MaterializedPage`: * Immutable and materialized pages are for the acceleration of (future) reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of DRAM. * Ephemeral pages are a write-back cache of `EphemeralFile` contents, i.e., if there is pressure in the page cache, we spill `EphemeralFile` contents to disk. `EphemeralFile` is only used by `InMemoryLayer`, for the following purposes: * write: when filling up the `InMemoryLayer`, via `impl BlobWriter for EphemeralFile` * read: when doing page reconstruction for a page@lsn that isn't written to disk * read: when writing L0 layer files, we re-read the `InMemoryLayer` and put the contents into the L0 delta writer (`create_delta_layer`). This happens every 10min or when InMemoryLayer reaches 256MB in size. The access patterns of the `InMemoryLayer` use case are as follows: * write: via `BlobWriter`, strictly append-only * read for page reconstruction: via `BlobReader`, random * read for `create_delta_layer`: via `BlobReader`, dependent on data, but generally random. Why? * in classical LSM terms, this function is what writes the memory-resident `C0` tree into the disk-resident `C1` tree * in our system, though, the values of InMemoryLayer are stored in an EphemeralFile, and hence they are not guaranteed to be memory-resident * the function reads `Value`s in `Key, LSN` order, which is `!=` insert order What do these `EphemeralFile`-level access patterns mean for the page cache? * write: * the common case is that `Value` is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller than `PAGE_SIZE` * So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE` sized writes. * If there's no page cache eviction between subsequent `InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident, so the page cache avoids doing a `write` system call. * In practice, a busy page server will have page cache evictions because we only configure 64MB of page cache size. * reads for page reconstruction: read acceleration, just as for the other page types. * reads for `create_delta_layer`: * The `Value` reads happen through a `BlockCursor`, which optimizes the case of repeated reads from the same page. * So, the best case is that subsequent values are located on the same page; hence `BlockCursor`s buffer is maximally effective. * The worst case is that each `Value` is on a different page; hence the `BlockCursor`'s 1-page-sized buffer is ineffective. * The best case translates into `256MB/PAGE_SIZE` page cache accesses, one per page. * the worst case translates into `#Values` page cache accesses * again, the page cache accesses must be assumed to be random because the `Value`s aren't accessed in insertion order but `Key, LSN` order. ## Summary of changes Preliminaries for this PR were: - #5003 - #5004 - #5005 - uncommitted microbenchmark in #5011 Based on the observations outlined above, this PR makes the following changes: * Rip out `EphemeralPage` from `page_cache.rs` * Move the `block_io::FileId` to `page_cache::FileId` * Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct. It's called `mutable_tail`. * Change `write_blob` to use `mutable_tail` for the write buffering instead of a page cache page. * if `mutable_tail` is full, it writes it out to disk, zeroes it out, and re-uses it. * There is explicitly no double-buffering, so that memory allocation per `EphemeralFile` instance is fixed. * Change `read_blob` to return different `BlockLease` variants depending on `blknum` * for the `blknum` that corresponds to the `mutable_tail`, return a ref to it * Rust borrowing rules prevent `write_blob` calls while refs are outstanding. * for all non-tail blocks, return a page-cached `ImmutablePage` * It is safe to page-cache these as ImmutablePage because EphemeralFile is append-only. ## Performance How doe the changes above affect performance? M claim is: not significantly. * write path: * before this PR, the `EphemeralFile::write_blob` didn't issue its own `write` system calls. * If there were enough free pages, it didn't issue any `write` system calls. * If it had to evict other `EphemeralPage`s to get pages a page for its writes (`get_buf_for_write`), the page cache code would implicitly issue the writeback of victim pages as needed. * With this PR, `EphemeralFile::write_blob` always issues all of its own `write` system calls. * Also, the writes are explicit instead of implicit through page cache write back, which will help #4743 * The perf impact of always doing the writes is the CPU overhead and syscall latency. * Before this PR, we might have never issued them if there were enough free pages. * We don't issue `fsync` and can expect the writes to only hit the kernel page cache. * There is also an advantage in issuing the writes directly: the perf impact is paid by the tenant that caused the writes, instead of whatever tenant evicts the `EphemeralPage`. * reads for page reconstruction: no impact. * The `write_blob` function pre-warms the page cache when it writes the `mutable_tail` to disk. * So, the behavior is the same as with the EphemeralPages before this PR. * reads for `create_delta_layer`: no impact. * Same argument as for page reconstruction. * Note for the future: * going through the page cache likely causes read amplification here. Why? * Due to the `Key,Lsn`-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read different `Values` from it. * So, it might be better to bypass the page cache here. * Idea drafts: * bypass PS page cache + prefetch pipeline + iovec-based IO * bypass PS page cache + use `copy_file_range` to copy from ephemeral file into the L0 delta file, without going through user space	2023-08-18 20:31:03 +03:00
Arpad Müller	f4da010aee	Make the compaction warning more tolerant (#5024 ) ## Problem The performance benchmark in `test_runner/performance/test_layer_map.py` is currently failing due to the warning added in #4888. ## Summary of changes The test mentioned has a `compaction_target_size` of 8192, which is just one page size. This is an unattainable goal, as we generate at least three pages: one for the header, one for the b-tree (minimally sized ones have just the root node in a single page), one for the data. Therefore, we add two pages to the warning limit. The warning text becomes a bit less accurate but I think this is okay.	2023-08-18 16:36:31 +02:00
Joonas Koivunen	67af24191e	test: cleanup remote_timeline_client tests (#5013 ) I will have to change these as I change remote_timeline_client api in #4938. So a bit of cleanup, handle my comments which were just resolved during initial review. Cleanup: - use unwrap in tests instead of mixed `?` and `unwrap` - use `Handle` instead of `&'static Reactor` to make the RemoteTimelineClient more natural - use arrays in tests - use plain `#[tokio::test]`	2023-08-17 19:27:30 +03:00
Joonas Koivunen	6af5f9bfe0	fix: format context (#5022 ) We return an error with unformatted `{timeline_id}`.	2023-08-17 14:30:25 +00:00
Dmitry Rodionov	d8b0a298b7	Do not attach deleted tenants (#5008 ) Rather temporary solution before proper: https://github.com/neondatabase/neon/issues/5006 It requires more plumbing so lets not attach deleted tenants first and then implement resume. Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation, and since we always asserted for absence of stuff it worked. Here I started to assert for presence of stuff too and it failed. Added more "presence" asserts to other places to be confident that it works. Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)	2023-08-17 13:46:49 +03:00
Christian Schwarz	957af049c2	ephemeral file: refactor write_blob impl to concentrate mutable state (#5004 ) Before this patch, we had the `off` and `blknum` as function-wide mutable state. Now it's contained in the `Writer` struct. The use of `push_bytes` instead of index-based filling of the buffer also makes it easier to reason about what's going on. This is prep for https://github.com/neondatabase/neon/pull/4994	2023-08-17 13:07:25 +03:00
Joonas Koivunen	d3612ce266	delta_layer: Restore generic from last week (#5014 ) Restores #4937 work relating to the ability to use `ResidentDeltaLayer` (which is an Arc wrapper) in #4938 for the ValueRef's by removing the borrow from `ValueRef` and providing it from an upper layer. This should not have any functional changes, most importantly, the `main` will continue to use the borrowed `DeltaLayerInner`. It might be that I can change #4938 to be like this. If that is so, I'll gladly rip out the `Ref` and move the borrow back. But I'll first want to look at the current test failures.	2023-08-17 11:47:31 +03:00

1 2 3 4 5 ...

1512 Commits