rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
John Spray	9cb255be97	Update pageserver/src/deletion_queue.rs Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-22 14:10:11 +01:00
John Spray	57a44dcc01	Update pageserver/src/deletion_queue.rs Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-22 14:10:06 +01:00
John Spray	1afc6337fb	Remove unused `num_inprogress_deletions`	2023-08-22 14:06:15 +01:00
John Spray	74058e196a	remote_storage: defensively handle 404 on deletions S3 implementions are _meant_ to return 200 on deleting a nonexistent object, but S3 is not a standard and some implementations have their own ideas.	2023-08-22 13:52:58 +01:00
John Spray	a116f6656f	deletion queue: more consistent use of backoff::retry	2023-08-22 13:38:31 +01:00
John Spray	2c7b97245a	tweak test_remote_storage_upload_queue_retries	2023-08-22 13:34:12 +01:00
John Spray	6efddbf526	flush tweaks	2023-08-22 13:17:57 +01:00
John Spray	7c4d79f4db	deletion queue: cancellable retries	2023-08-22 13:05:04 +01:00
John Spray	8c2ff87f1a	libs: make backoff::retry() take a cancellation token	2023-08-22 12:39:19 +01:00
John Spray	23fc247a03	remove redundant spans	2023-08-22 11:22:51 +01:00
John Spray	d8dc4425f8	Merge remote-tracking branch 'upstream/main' into jcsp/deletion-queue	2023-08-22 10:09:23 +01:00
John Spray	18159b7695	deletion queue: expose errors from push/flush	2023-08-22 10:01:10 +01:00
Conrad Ludgate	0b001a0001	proxy: remove connections on shutdown (#5051 ) ## Problem On shutdown, proxy connections are staying open. ## Summary of changes Remove the connections on shutdown	2023-08-21 19:20:58 +01:00
Felix Prasanna	4a8bd866f6	bump vm-builder version to v0.16.3 (#5055 ) This change to autoscaling allows agents to connect directly to the monitor, completely removing the informant.	2023-08-21 13:29:16 -04:00
John Spray	615a490239	pageserver: refactor Tenant/Timeline args into structs (#5053 ) ## Problem There are some common types that we pass into tenants and timelines as we construct them, such as remote storage and the broker client. Currently the list is small, but this is likely to grow -- the deletion queue PR (#4960) pushed some methods to the point of clippy complaining they had too many args, because of the extra deletion queue client being passed around. There are some shared objects that currently aren't passed around explicitly because they use a static `once_cell` (e.g. CONCURRENT_COMPACTIONS), but as we add more resource management and concurreny control over time, it will be more readable & testable to pass a type around in the respective Resources object, rather than to coordinate via static objects. The `Resources` structures in this PR will make it easier to add references to central coordination functions, without having to rely on statics. ## Summary of changes - For `Tenant`, the `broker_client` and `remote_storage` are bundled into `TenantSharedResources` - For `Timeline`, the `remote_client` is wrapped into `TimelineResources`. Both of these structures will get an additional deletion queue member in #4960.	2023-08-21 17:30:28 +01:00
John Spray	b95addddd5	pageserver: do not read redundant `timeline_layers` from IndexPart, so that we can remove it later (#4972 ) ## Problem IndexPart contains two redundant lists of layer names: a set of the names, and then a map of name to metadata. We already required that all the layers in `timeline_layers` are also in `layers_metadata`, in `initialize_with_current_remote_index_part`, so if there were any index_part.json files in the field that relied on these sets being different, they would already be broken. ## Summary of changes `timeline_layers` is made private and no longer read at runtime. It is still serialized, but not deserialized. `disk_consistent_lsn` is also made private, as this field only exists for convenience of humans reading the serialized JSON. This prepares us to entirely remove `timeline_layers` in a future release, once this change is fully deployed, and therefore no pageservers are trying to read the field.	2023-08-21 14:29:36 +03:00
Joonas Koivunen	130ccb4b67	Remove initial timeline id troubles (#5044 ) I made a mistake when I adding `env.initial_timeline: Optional[TimelineId]` in the #3839, should had just generated it and used it to create a specific timeline. This PR fixes those mistakes, and some extra calling into psql which must be slower than python field access.	2023-08-20 12:33:19 +03:00
Dmitry Rodionov	9140a950f4	Resume tenant deletion on attach (#5039 ) I'm still a bit nervous about attach -> crash case. But it should work. (unlike case with timeline). Ideally would be cool to cover this with test. This continues tradition of adding bool flags for Tenant::set_stopping. Probably lifecycle project will help with fixing it.	2023-08-20 12:28:50 +03:00
Arpad Müller	a23b0773f1	Fix DeltaLayer dumping (#5045 ) ## Problem Before, DeltaLayer dumping (via `cargo run --release -p pagectl -- print-layer-file` ) would crash as one can't call `Handle::block_on` in an async executor thread. ## Summary of changes Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys into RAM (which we already do during compaction), and then load the values one by one during dumping.	2023-08-19 00:56:03 +02:00
Joonas Koivunen	368ee6c8ca	refactor: failpoint support (#5033 ) - move them to pageserver which is the only dependant on the crate fail - "move" the exported macro to the new module - support at init time the same failpoints as runtime Found while debugging test failures and making tests more repeatable by allowing "exit" from pageserver start via environment variables. Made those changes to `test_gc_cutoff.py`. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-08-19 01:01:44 +03:00
Felix Prasanna	5c6a692cf1	bump `VM_BUILDER_VERSION` to v0.16.2 (#5031 ) A very slight change that allows us to configure the UID of the neon-postgres cgroup owner. We start postgres in this cgroup so we can scale it with the cgroups v2 api. Currently, the control plane overwrites the entrypoint set by `vm-builder`, so `compute_ctl` (and thus postgres), is not started in the neon-postgres cgroup. Having `compute_ctl` start postgres in the cgroup should fix this. However, at the moment appears like it does not have the correct permissions. Configuring the neon-postgres UID to `postgres` (which is the UID `compute_ctl` runs under) should hopefully fix this. See #4920 - the PR to modify `compute_ctl` to start postgres in the cgorup. See: neondatabase/autoscaling#480, neondatabase/autoscaling#477. Both these PR's are part of an effort to increase `vm-builder`'s configurability and allow us to adjust it as we integrate in the monitor.	2023-08-18 14:29:20 -04:00
Dmitry Rodionov	30888a24d9	Avoid flakiness in test_timeline_delete_fail_before_local_delete (#5032 ) The problem was that timeline detail can return timelines in not only active state. And by the time request comes timeline deletion can still be in progress if we're unlucky (test execution happened to be slower for some reason) Reference for failed test run https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5022/5891420105/index.html#suites/f588e0a787c49e67b29490359c589fae/dab036e9bd673274 The error was `Exception: detail succeeded (it should return 404)` reported by @koivunej	2023-08-18 20:49:11 +03:00
Dmitry Rodionov	f6c671c140	resume timeline deletions on attach (#5030 ) closes [#5036](https://github.com/neondatabase/neon/issues/5036)	2023-08-18 20:48:33 +03:00
Christian Schwarz	ed5bce7cba	rfcs: archive my MVCC S3 Notion Proposal (#5040 ) This is a copy from the [original Notion page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4), taken on 2023-08-16. This is for archival mostly. The RFC that we're likely to go with is https://github.com/neondatabase/neon/pull/4919.	2023-08-18 19:34:29 +02:00
Christian Schwarz	7a63685cde	simplify page-caching of EphemeralFile (#4994 ) (This PR is the successor of https://github.com/neondatabase/neon/pull/4984 ) ## Summary The current way in which `EphemeralFile` uses `PageCache` complicates the Pageserver code base to a degree that isn't worth it. This PR refactors how we cache `EphemeralFile` contents, by exploiting the append-only nature of `EphemeralFile`. The result is that `PageCache` only holds `ImmutableFilePage` and `MaterializedPage`. These types of pages are read-only and evictable without write-back. This allows us to remove the writeback code from `PageCache`, also eliminating an entire failure mode. Futher, many great open-source libraries exist to solve the problem of a read-only cache, much better than our `page_cache.rs` (e.g., better replacement policy, less global locking). With this PR, we can now explore using them. ## Problem & Analysis Before this PR, `PageCache` had three types of pages: * `ImmutableFilePage`: caches Delta / Image layer file contents * `MaterializedPage`: caches results of Timeline::get (page materialization) * `EphemeralPage`: caches `EphemeralFile` contents `EphemeralPage` is quite different from `ImmutableFilePage` and `MaterializedPage`: * Immutable and materialized pages are for the acceleration of (future) reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of DRAM. * Ephemeral pages are a write-back cache of `EphemeralFile` contents, i.e., if there is pressure in the page cache, we spill `EphemeralFile` contents to disk. `EphemeralFile` is only used by `InMemoryLayer`, for the following purposes: * write: when filling up the `InMemoryLayer`, via `impl BlobWriter for EphemeralFile` * read: when doing page reconstruction for a page@lsn that isn't written to disk * read: when writing L0 layer files, we re-read the `InMemoryLayer` and put the contents into the L0 delta writer (`create_delta_layer`). This happens every 10min or when InMemoryLayer reaches 256MB in size. The access patterns of the `InMemoryLayer` use case are as follows: * write: via `BlobWriter`, strictly append-only * read for page reconstruction: via `BlobReader`, random * read for `create_delta_layer`: via `BlobReader`, dependent on data, but generally random. Why? * in classical LSM terms, this function is what writes the memory-resident `C0` tree into the disk-resident `C1` tree * in our system, though, the values of InMemoryLayer are stored in an EphemeralFile, and hence they are not guaranteed to be memory-resident * the function reads `Value`s in `Key, LSN` order, which is `!=` insert order What do these `EphemeralFile`-level access patterns mean for the page cache? * write: * the common case is that `Value` is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller than `PAGE_SIZE` * So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE` sized writes. * If there's no page cache eviction between subsequent `InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident, so the page cache avoids doing a `write` system call. * In practice, a busy page server will have page cache evictions because we only configure 64MB of page cache size. * reads for page reconstruction: read acceleration, just as for the other page types. * reads for `create_delta_layer`: * The `Value` reads happen through a `BlockCursor`, which optimizes the case of repeated reads from the same page. * So, the best case is that subsequent values are located on the same page; hence `BlockCursor`s buffer is maximally effective. * The worst case is that each `Value` is on a different page; hence the `BlockCursor`'s 1-page-sized buffer is ineffective. * The best case translates into `256MB/PAGE_SIZE` page cache accesses, one per page. * the worst case translates into `#Values` page cache accesses * again, the page cache accesses must be assumed to be random because the `Value`s aren't accessed in insertion order but `Key, LSN` order. ## Summary of changes Preliminaries for this PR were: - #5003 - #5004 - #5005 - uncommitted microbenchmark in #5011 Based on the observations outlined above, this PR makes the following changes: * Rip out `EphemeralPage` from `page_cache.rs` * Move the `block_io::FileId` to `page_cache::FileId` * Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct. It's called `mutable_tail`. * Change `write_blob` to use `mutable_tail` for the write buffering instead of a page cache page. * if `mutable_tail` is full, it writes it out to disk, zeroes it out, and re-uses it. * There is explicitly no double-buffering, so that memory allocation per `EphemeralFile` instance is fixed. * Change `read_blob` to return different `BlockLease` variants depending on `blknum` * for the `blknum` that corresponds to the `mutable_tail`, return a ref to it * Rust borrowing rules prevent `write_blob` calls while refs are outstanding. * for all non-tail blocks, return a page-cached `ImmutablePage` * It is safe to page-cache these as ImmutablePage because EphemeralFile is append-only. ## Performance How doe the changes above affect performance? M claim is: not significantly. * write path: * before this PR, the `EphemeralFile::write_blob` didn't issue its own `write` system calls. * If there were enough free pages, it didn't issue any `write` system calls. * If it had to evict other `EphemeralPage`s to get pages a page for its writes (`get_buf_for_write`), the page cache code would implicitly issue the writeback of victim pages as needed. * With this PR, `EphemeralFile::write_blob` always issues all of its own `write` system calls. * Also, the writes are explicit instead of implicit through page cache write back, which will help #4743 * The perf impact of always doing the writes is the CPU overhead and syscall latency. * Before this PR, we might have never issued them if there were enough free pages. * We don't issue `fsync` and can expect the writes to only hit the kernel page cache. * There is also an advantage in issuing the writes directly: the perf impact is paid by the tenant that caused the writes, instead of whatever tenant evicts the `EphemeralPage`. * reads for page reconstruction: no impact. * The `write_blob` function pre-warms the page cache when it writes the `mutable_tail` to disk. * So, the behavior is the same as with the EphemeralPages before this PR. * reads for `create_delta_layer`: no impact. * Same argument as for page reconstruction. * Note for the future: * going through the page cache likely causes read amplification here. Why? * Due to the `Key,Lsn`-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read different `Values` from it. * So, it might be better to bypass the page cache here. * Idea drafts: * bypass PS page cache + prefetch pipeline + iovec-based IO * bypass PS page cache + use `copy_file_range` to copy from ephemeral file into the L0 delta file, without going through user space	2023-08-18 20:31:03 +03:00
Joonas Koivunen	0a082aee77	test: allow race with flush and stopped queue (#5027 ) A lucky race can happen with the shutdown order I guess right now. Seen in [test_tenant_delete_smoke]. The message is not the greatest to match against. [test_tenant_delete_smoke]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5892262320/index.html#suites/3556ed71f2d69272a7014df6dcb02317/189a0d1245fb5a8c	2023-08-18 19:36:25 +03:00
Arthur Petukhovsky	0b90411380	Fix safekeeper recovery with auth (#5035 ) Fix missing a password in walrcv_connect for a safekeeper recovery. Add a test which restarts endpoint and triggers a recovery.	2023-08-18 16:48:55 +01:00
Arpad Müller	f4da010aee	Make the compaction warning more tolerant (#5024 ) ## Problem The performance benchmark in `test_runner/performance/test_layer_map.py` is currently failing due to the warning added in #4888. ## Summary of changes The test mentioned has a `compaction_target_size` of 8192, which is just one page size. This is an unattainable goal, as we generate at least three pages: one for the header, one for the b-tree (minimally sized ones have just the root node in a single page), one for the data. Therefore, we add two pages to the warning limit. The warning text becomes a bit less accurate but I think this is okay.	2023-08-18 16:36:31 +02:00
John Spray	c1bc9c0f70	Various test fixes + tweaks to flushing	2023-08-18 12:44:35 +01:00
John Spray	2de5efa208	Fix broken wait_untils in test_remote_storage_upload_queue_retries	2023-08-18 12:44:35 +01:00
John Spray	d330eac4bc	clippy	2023-08-18 12:44:35 +01:00
John Spray	3ebceeda71	pageserver: refactor timeline args into TimelineResources This sidesteps clippy complaining about function arg counts, and will enable introducing more shared structures in future without the noise of adding extra args to all the functions involved in timeline setup.	2023-08-18 12:44:35 +01:00
John Spray	31729d6f4d	pageserver: refactor tenant args into a structure This way, when we add some new shared structure that the tenants need a reference to, we do not have to add it individually as an extra argument to the various functions.	2023-08-18 12:44:35 +01:00
John Spray	7e0e3517c1	clippy	2023-08-18 12:44:35 +01:00
John Spray	c4fc6e433d	tests: add e2e deletion queue recovery test	2023-08-18 12:44:35 +01:00
John Spray	c36cba28d6	pageserver: generalize flush API	2023-08-18 12:44:35 +01:00
John Spray	8eaa4015de	deletion queue: versions in keys	2023-08-18 12:44:35 +01:00
John Spray	10e927ee3e	Add encoding versions to deletion queue structs	2023-08-18 12:44:35 +01:00
John Spray	bb3a59f275	clippy	2023-08-18 12:44:35 +01:00
John Spray	a0ed43cc12	deletion queue: add DeletionHeader for sequence numbers	2023-08-18 12:44:35 +01:00
John Spray	99dc5a5c27	Deletion queue: implement recovery on startup	2023-08-18 12:44:35 +01:00
John Spray	54db1f5d8a	remote_storage: add a helper for downloading full objects This is only for use with small objects that we will deserialize in a non-streaming way. Also add a strip_prefix method to RemotePath.	2023-08-18 12:44:35 +01:00
John Spray	404b25e45f	Remove vestigial remote_timeline_client deletion paths	2023-08-18 12:44:35 +01:00
John Spray	f4dba9f907	tests: update tenant deletion tests for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	4ec45bc7dc	tests: update tenant deletion tests for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	a00d4a8d8c	tests: update test_remote_timeline_client_calls_started_metric for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	43c9a09d8f	tests: update remote storage test for deletion queue	2023-08-18 12:44:35 +01:00
John Spray	3edd7ece40	deletion queue: improve frontend retry	2023-08-18 12:44:35 +01:00
John Spray	504fe9c2b0	pageserver: send timeline deletions through the deletion queue	2023-08-18 12:44:35 +01:00
John Spray	10df237a81	deletion queue: add push for generic objects (layers and garbage)	2023-08-18 12:44:35 +01:00

1 2 3 4 5 ...

3652 Commits