rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-06-01 20:40:37 +00:00

Author	SHA1	Message	Date
John Spray	e0821e1eab	pageserver: refined Timeline shutdown (#5833 ) ## Problem We have observed the shutdown of a timeline taking a long time when a deletion arrives at a busy time for the system. This suggests that we are not respecting cancellation tokens promptly enough. ## Summary of changes - Refactor timeline shutdown so that rather than having a shutdown() function that takes a flag for optionally flushing, there are two distinct functions, one for graceful flushing shutdown, and another that does the "normal" shutdown where we're just setting a cancellation token and then tearing down as fast as we can. This makes things a bit easier to reason about, and enables us to remove the hand-written variant of shutdown that was maintained in `delete.rs` - Layer flush task checks cancellation token more carefully - Logical size calculation's handling of cancellation tokens is simplified: rather than passing one in, it respects the Timeline's cancellation token. This PR doesn't touch RemoteTimelineClient, which will be a key thing to fix as well, so that a slow remote storage op doesn't hold up shutdown.	2023-11-09 16:02:59 +00:00
bojanserafimov	4469b1a62c	Fix blob_io test (#5818 )	2023-11-09 10:47:03 -05:00
John Spray	40441f8ada	pageserver: use `Gate` for stronger safety check in `SlotGuard` (#5793 ) ## Problem #5711 and #5367 raced -- the `SlotGuard` type needs `Gate` to properly enforce its invariant that we may not drop an `Arc<Tenant>` from a slot. ## Summary of changes Replace the TODO with the intended check of Gate.	2023-11-08 13:00:11 +00:00
Shany Pozin	0ac4cf67a6	Use self.tenants instead of TENANTS (#5811 )	2023-11-07 11:38:02 +00:00
Joonas Koivunen	4be6bc7251	refactor: remove unnecessary unsafe (#5802 ) unsafe impls for `Send` and `Sync` should not be added by default. in the case of `SlotGuard` removing them does not cause any issues, as the compiler automatically derives those. This PR adds requirement to document the unsafety (see [clippy::undocumented_unsafe_blocks]) and opportunistically adds `#![deny(unsafe_code)]` to most places where we don't have unsafe code right now. TRPL on Send and Sync: https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html [clippy::undocumented_unsafe_blocks]: https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks	2023-11-07 10:26:25 +00:00
John Spray	c00651ff9b	pageserver: start refactoring into TenantManager (#5797 ) ## Problem See: https://github.com/neondatabase/neon/issues/5796 ## Summary of changes Completing the refactor is quite verbose and can be done in stages: each interface that is currently called directly from a top-level mgr.rs function can be moved into TenantManager once the relevant subsystems have access to it. Landing the initial change to create of TenantManager is useful because it enables new code to use it without having to be altered later, and sets us up to incrementally fix the existing code to use an explicit Arc<TenantManager> instead of relying on the static TENANTS.	2023-11-07 09:06:53 +00:00
John Spray	85cd97af61	pageserver: add `InProgress` tenant map state, use a sync lock for the map (#5367 ) ## Problem Follows on from #5299 - We didn't have a generic way to protect a tenant undergoing changes: `Tenant` had states, but for our arbitrary transitions between secondary/attached, we need a general way to say "reserve this tenant ID, and don't allow any other ops on it, but don't try and report it as being in any particular state". - The TenantsMap structure was behind an async RwLock, but it was never correct to hold it across await points: that would block any other changes for all tenants. ## Summary of changes - Add the `TenantSlot::InProgress` value. This means: - Incoming administrative operations on the tenant should retry later - Anything trying to read the live state of the tenant (e.g. a page service reader) should retry later or block. - Store TenantsMap in `std::sync::RwLock` - Provide an extended `get_active_tenant_with_timeout` for page_service to use, which will wait on InProgress slots as well as non-active tenants. Closes: https://github.com/neondatabase/neon/issues/5378 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-06 14:03:22 +00:00
bojanserafimov	dc72567288	Layer flush minor speedup (#5765 ) Convert keys to `i128` before sorting	2023-11-06 08:58:20 -05:00
John Spray	6defa2b5d5	pageserver: add `Gate` as a partner to CancellationToken for safe shutdown of `Tenant` & `Timeline` (#5711 ) ## Problem When shutting down a Tenant, it isn't just important to cause any background tasks to stop. It's also important to wait until they have stopped before declaring shutdown complete, in cases where we may re-use the tenant's local storage for something else, such as running in secondary mode, or creating a new tenant with the same ID. ## Summary of changes A `Gate` class is added, inspired by [seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html). For types that have an important lifetime that corresponds to some physical resource, use of a Gate as well as a CancellationToken provides a robust pattern for async requests & shutdown: - Requests must always acquire the gate as long as they are using the object - Shutdown must set the cancellation token, and then `close()` the gate to wait for requests in progress before returning. This is not for memory safety: it's for expressing the difference between "Arc<Tenant> exists", and "This tenant's files on disk are eligible to be read/written". - Both Tenant and Timeline get a Gate & CancellationToken. - The Timeline gate is held during eviction of layers, and during page_service requests. - Existing cancellation support in page_service is refined to use the timeline-scope cancellation token instead of a process-scope cancellation token. This replaces the use of `task_mgr::associate_with`: tasks no longer change their tenant/timelineidentity after being spawned. The Tenant's Gate is not yet used, but will be important for Tenant-scoped operations in secondary mode, where we must ensure that our secondary-mode downloads for a tenant are gated wrt the activity of an attached Tenant. This is part of a broader move away from using the global-state driven `task_mgr` shutdown tokens: - less global state where we rely on implicit knowledge of what task a given function is running in, and more explicit references to the cancellation token that a particular function/type will respect, making shutdown easier to reason about. - eventually avoid the big global TASKS mutex. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-06 12:39:20 +00:00
duguorong009	b3d3a2587d	feat: improve the serde impl for several types(`Lsn`, `TenantId`, `TimelineId` ...) (#5335 ) Improve the serde impl for several types (`Lsn`, `TenantId`, `TimelineId`) by making them sensitive to `Serializer::is_human_readadable` (true for json, false for bincode). Fixes #3511 by: - Implement the custom serde for `Lsn` - Implement the custom serde for `Id` - Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs` - Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in all possible structs Additionally some safekeeper types gained serde tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-06 11:40:03 +02:00
John Spray	306c4f9967	s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700 ) ## Problem The scrubber didn't know how to find the latest index_part when generations were in use. ## Summary of changes - Teach the scrubber to do the same dance that pageserver does when finding the latest index_part.json - Teach the scrubber how to understand layer files with generation suffixes. - General improvement to testability: scan_metadata has a machine readable output that the testing `S3Scrubber` wrapper can read. - Existing test coverage of scrubber was false-passing because it just didn't see any data due to prefixing of data in the bucket. Fix that. This is incremental improvement: the more confidence we can have in the scrubber, the more we can use it in integration tests to validate the state of remote storage. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-11-03 17:36:02 +00:00
Joonas Koivunen	27bdbf5e36	chore(layer): restore logging, doc changes (#5766 ) Some of the log messages were lost with the #4938. This PR adds some of them back, most notably: - starting to on-demand download - successful completion of on-demand download - ability to see when there were many waiters for the layer download - "unexpectedly on-demand downloading ..." is now `info!` Additionally some rare events are logged as error, which should never happen.	2023-11-02 19:05:33 +00:00
Joonas Koivunen	098d3111a5	fix(layer): get_and_upgrade and metrics (#5767 ) when introducing `get_and_upgrade` I forgot that an `evict_and_wait` would had already incremented the counter for started evictions, but an upgrade would just "silently" cancel the eviction as no drop would ever run. these metrics are likely sources for alerts with the next release, so it's important to keep them correct.	2023-11-02 13:06:14 +00:00
Joonas Koivunen	3737fe3a4b	fix(layer): error out early if layer path is non-file (#5756 ) In an earlier PR https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I added a FIXME and there's a simple solution suggested by @jcsp, so implement it. Wondering why I did not implement this originally, there is no concept of a permanent failure, so this failure will happen quite often. I don't think the frequency is a problem however. Sadly for std::fs::FileType there is only decimal and hex formatting, no octal.	2023-11-02 11:03:38 +00:00
Joonas Koivunen	2dca4c03fc	feat(layer): cancellable get_or_maybe_download (#5744 ) With the layer implementation as was done in #4938, it is possible via cancellation to cause two concurrent downloads on the same path, due to how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread the init semaphore through the spawned task of downloading to make this impossible to happen.	2023-11-02 08:06:32 +00:00
Joonas Koivunen	e82d1ad6b8	fix(layer): reinit on access before eviction happens (#5743 ) Right before merging, I added a loop to `fn LayerInner::get_or_maybe_download`, which was always supposed to be there. However I had forgotten to restart initialization instead of waiting for the eviction to happen to support original design goal of "eviction should always lose to redownload (or init)". This was wrong. After this fix, if `spawn_blocking` queue is blocked on something, nothing bad will happen. Part of #5737.	2023-11-01 17:38:32 +02:00
Joonas Koivunen	896347f307	refactor(layer): remove version checking with atomics (#5742 ) The `LayerInner::version` never needed to be read in more than one place. Clarified while fixing #5737 of which this is the first step. This decrements possible wrong atomics usage in Layer, but does not really fix anything.	2023-10-31 18:40:08 +02:00
John Spray	9c35e1e6e5	pageserver: downgrade slow task warnings from warn to info (#5724 ) ## Problem In #5658 we suppressed the first-iteration output from these logs, but the volume of warnings is still problematic. ## Summary of changes - Downgrade all slow task warnings to INFO. The information is still there if we actively want to know about which tasks are running slowly, without polluting the overall stream of warnings with situations that are unsurprising to us. - Revert the previous change so that we output on the first iteration as we used to do. There is no reason to suppress these, now that the severity is just info.	2023-10-30 18:32:30 +00:00
Conrad Ludgate	d8c21ec70d	fix nightly 1.75 (#5719 ) ## Problem Neon doesn't compile on nightly and had numerous clippy complaints. ## Summary of changes 1. Fixed troublesome dependency 2. Fixed or ignored the lints where appropriate	2023-10-30 16:43:06 +00:00
Joonas Koivunen	4db8efb2cf	Layer: logging fixes (#5676 ) - include Layer generation in the default display, with Generation::Broken as `-broken` - omit layer from `layer_gc` span because the api it works with needs to support N layers, so the api needs to log each layer	2023-10-30 16:21:30 +02:00
John Spray	07c2b29895	pageserver: fix error logging on stray timeline files (#5712 ) ## Problem If there were stray files in the timelines/ dir after tenant deletion, pageserver could panic on out of range. ## Summary of changes Use iterator `take()`, which doesn't care if the number of elements available is less than requested.	2023-10-30 13:24:52 +00:00
John Spray	87db4b441c	pageserver: cleaner shutdown in timeline delete (#5701 ) The flush task logs a backtrace if it tries to upload and remote timeline client is already in stopped state. Therefore we cannot shut them down concurrently: flush task must be shut down first. This wasn't more obvious because: - Timeline deletions IRL usually happen when not much is being written - In tests, there is a global allow-list for this log It's not obvious whether removing the global log allow list is safe, this PR was prompted by how the log spam got in my way when testing deletion changes.	2023-10-30 09:18:40 +00:00
Arpad Müller	bd59349af3	Fix Rust 1.74 warnings (#5702 ) Fixes new warnings and clippy changes introduced by version 1.74 of the rust compiler toolchain.	2023-10-28 03:47:26 +02:00
Joonas Koivunen	68f15cf967	fix: schedule_compaction_update must only unlink (#5675 ) #5649 added the concept of dangling layers which #4938 uses but only partially. I forgot to change `schedule_compaction_update` to not schedule deletions to uphold the "have a layer, you can read it". With the now remembered fix, I don't think these checks should ever fail except for a mistake I already did. These changes might be useful for protecting future changes, even though the Layer carrying the generation AND the `schedule_(gc\|compaction)_update` require strong arcs. Rationale for keeping the `#[cfg(feature = "testing")]` is worsening any leak situation which might come up.	2023-10-27 11:16:01 +01:00
John Spray	de90bf4663	pageserver: always load remote metadata (no more `spawn_load`) (#5580 ) ## Problem The pageserver had two ways of loading a tenant: - `spawn_load` would trust on-disk content to reflect all existing timelines - `spawn_attach` would list timelines in remote storage. It was incorrect for `spawn_load` to trust local disk content, because it doesn't know if the tenant might have been attached and written somewhere else. To make this correct would requires some generation number checks, but the payoff is to avoid one S3 op per tenant at startup, so it's not worth the complexity -- it is much simpler to have one way to load a tenant. ## Summary of changes - `Tenant` objects are always created with `Tenant::spawn`: there is no more distinction between "load" and "attach". - The ability to run without remote storage (for `neon_local`) is preserved by adding a branch inside `attach` that uses a fallback `load_local` if no remote_storage is present. - Fix attaching a tenant when it has a timeline with no IndexPart: this can occur if a newly created timeline manages to upload a layer before it has uploaded an index. - The attach marker file that used to indicate whether a tenant should be "loaded" or "attached" is no longer needed, and is removed. - The GenericRemoteStorage interface gets a `list()` method that maps more directly to what ListObjects does, returning both keys and common prefixes. The existing `list_files` and `list_prefixes` methods are just calls into `list()` now -- these can be removed later if we would like to shrink the interface a bit. - The remote deletion marker is moved into `timelines/` and detected as part of listing timelines rather than as a separate GET request. If any existing tenants have a marker in the old location (unlikely, only happens if something crashes mid-delete), then they will rely on the control plane retrying to complete their deletion. - Revise S3 calls for timeline listing and tenant load to take a cancellation token, and retry forever: it never makes sense to make a Tenant broken because of a transient S3 issue. ## Breaking changes - The remote deletion marker is moved from `deleted` to `timelines/deleted` within the tenant prefix. Markers in the old location will be ignored: it is the control plane's responsibility to retry deletions until they succeed. Markers in the new location will be tolerated by the previous release of pageserver via https://github.com/neondatabase/neon/pull/5632 - The local `attaching` marker file is no longer written. Therefore, if the pageserver is downgraded after running this code, the old pageserver will not be able to distinguish between partially attached tenants and fully attached tenants. This would only impact tenants that were partway through attaching at the moment of downgrade. In the unlikely even t that we do experience an incident that prompts us to roll back, then we may check for attach operations in flight, and manually insert `attaching` marker files as needed. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-10-26 14:48:44 +01:00
John Spray	8360307ea0	pageserver: exponential backoff on compaction/GC failures (#5672 ) Previously, if walredo process crashed we would try to spawn a fresh one every 2 seconds, which is expensive in itself, but also results in a high I/O load from the part of the compaction prior to the failure, which we re-run every 2 seconds. Closes: https://github.com/neondatabase/neon/issues/5671	2023-10-26 14:00:26 +01:00
Joonas Koivunen	c508d3b5fa	reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer (#4938 ) Implement a new `struct Layer` abstraction which manages downloadness internally, requiring no LayerMap locking or rewriting to download or evict providing a property "you have a layer, you can read it". The new `struct Layer` provides ability to keep the file resident via a RAII structure for new layers which still need to be uploaded. Previous solution solved this `RemoteTimelineClient::wait_completion` which lead to bugs like #5639. Evicting or the final local deletion after garbage collection is done using Arc'd value `Drop`. With a single `struct Layer` the closed open ended `trait Layer`, `trait PersistentLayer` and `struct RemoteLayer` are removed following noting that compaction could be simplified by simply not using any of the traits in between: #4839. The new `struct Layer` is a preliminary to remove `Timeline::layer_removal_cs` documented in #4745. Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058, #5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off: #5057, #5134.	2023-10-26 12:36:38 +03:00
Joonas Koivunen	f70019797c	refactor(rtc): schedule compaction update (#5649 ) a single operation instead of N uploads and 1 deletion scheduling with write(layer_map) lock releasing in the between. Compaction update will make for a much better place to change how the operation will change in future compared to more general file based operations. builds upon #5645. solves the problem of difficult to see hopeful correctness w.r.t. other `index_part.json` changing operations. Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-25 22:25:43 +01:00
Joonas Koivunen	325258413a	fix: trampling on global physical size metric (#5663 ) All loading (attached, or from disk) timelines overwrite the global gauge for physical size. The `_set` method cannot be used safely, so remove it and just "add" the physical size.	2023-10-25 19:29:12 +01:00
John Spray	5683ae9eab	pageserver: suppress some of the most common spurious warnings (#5658 ) Two of the most common spurious log messages: - broker connections terminate & we log at error severity. Unfortunately tonic gives us an "Unknown" error so to suppress these we're doing string matching. It's hacky but worthwhile for operations. - the first iteration of tenant background tasks tends to over-run its schedule and emit a warning. Ultimately we should fix these to run on time, but for now we are not benefiting from polluting our logs with the warnings.	2023-10-25 14:55:37 +01:00
Joonas Koivunen	4ae2d1390d	refactor(remote_timeline_client): Split deletion into unlinking + deletion (#5645 ) Quest: #4745. Prerequisite for #4938. Original https://github.com/neondatabase/neon/pull/4938#issuecomment-1777150665. The new Layer implementation has so far been using `RemoteTimelineClient::schedule_layer_file_deletion` from `Layer::drop` but it was noticed that this could mean that the L0s compaction wanted to remove could linger in the index part for longer time or be left there for longer time. Solution is to split the `RemoteTimelineClient::schedule_layer_file_deletion` into two parts: - unlinking from index_part.json, to be called from end of compaction and gc - scheduling of actual deletions, to be called from `Layer::drop` The added methods are added unused.	2023-10-25 15:01:19 +03:00
Joonas Koivunen	c5949e1fd6	misc smaller improvements (#5527 ) - finally add an `#[instrument]` to Timeline::create_image_layers, making it easier to see that something is happening because we create image layers - format some macro context code - add a warning not to create new validation functions a la parse do not validate Split off from #5198.	2023-10-25 14:59:43 +03:00
John Spray	a0b862a8bd	pageserver: schedule frozen layer uploads inside the layers lock (#5639 ) ## Problem Compaction's source of truth for what layers exist is the LayerManager. `flush_frozen_layer` updates LayerManager before it has scheduled upload of the frozen layer. Compaction can then "see" the new layer, decide to delete it, schedule uploads of replacement layers, all before `flush_frozen_layer` wakes up again and schedules the upload. When the upload is scheduled, the local layer file may be gone, in which case we end up with no such layer in remote storage, but an entry still added to IndexPart pointing to the missing layer. ## Summary of changes Schedule layer uploads inside the `self.layers` lock, so that whenever a frozen layer is present in LayerManager, it is also present in RemoteTimelineClient's metadata. Closes: #5635	2023-10-24 13:57:01 +01:00
John Spray	188f67e1df	pageserver: forward compat: be tolerant of deletion marker in `timelines/` (#5632 ) ## Problem https://github.com/neondatabase/neon/pull/5580 will move the remote deletion marker into the `timelines/` path. This would cause old pageserver code to fail loading the tenant due to an apparently invalid timeline ID. That would be a problem if we had to roll back after deploying #5580 ## Summary of changes If a `deleted` file is in `timelines/` just ignore it.	2023-10-23 17:51:38 +02:00
John Spray	7e805200bb	pageserver: parallel load of configs (#5607 ) ## Problem When the number of tenants is large, sequentially issuing the open/read calls for their config files is a ~1000ms delay during startup. It's not a lot, but it's simple to fix. ## Summary of changes Put all the config loads into spawn_blocking() tasks and run them in a JoinSet. We can simplify this a bit later when we have full async disk I/O. --------- Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-23 15:32:34 +01:00
Christian Schwarz	9da67c4f19	walredo: make request_redo() an async fn (#5559 ) Stacked atop https://github.com/neondatabase/neon/pull/5557 Prep work for https://github.com/neondatabase/neon/pull/5560 These changes have a 2% impact on `bench_walredo`. That's likely because of the `block_on() in the innermost piece of benchmark-only code. So, it doesn't affect production code. The use of closures in the benchmarking code prevents a straightforward conversion of the whole benchmarking code to async. before: ``` $ cargo bench --features testing --bench bench_walredo Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver) Finished bench [optimized + debuginfo] target(s) in 2m 11s Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70) Gnuplot not found, using plotters backend short/short/1 time: [26.363 µs 27.451 µs 28.573 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild short/short/2 time: [64.340 µs 64.927 µs 65.485 µs] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) low mild short/short/4 time: [101.98 µs 104.06 µs 106.13 µs] short/short/8 time: [151.42 µs 152.74 µs 154.03 µs] short/short/16 time: [296.30 µs 297.53 µs 298.88 µs] Found 14 outliers among 100 measurements (14.00%) 10 (10.00%) high mild 4 (4.00%) high severe medium/medium/1 time: [225.12 µs 225.90 µs 226.66 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild medium/medium/2 time: [490.80 µs 491.64 µs 492.49 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild medium/medium/4 time: [934.47 µs 936.49 µs 938.52 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe medium/medium/8 time: [1.8364 ms 1.8412 ms 1.8463 ms] Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild medium/medium/16 time: [3.6694 ms 3.6896 ms 3.7104 ms] ``` after: ``` $ cargo bench --features testing --bench bench_walredo Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver) Finished bench [optimized + debuginfo] target(s) in 2m 11s Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70) Gnuplot not found, using plotters backend short/short/1 time: [28.345 µs 28.529 µs 28.699 µs] change: [-0.2201% +3.9276% +8.2451%] (p = 0.07 > 0.05) No change in performance detected. Found 17 outliers among 100 measurements (17.00%) 4 (4.00%) low severe 5 (5.00%) high mild 8 (8.00%) high severe short/short/2 time: [66.145 µs 66.719 µs 67.274 µs] change: [+1.5467% +2.7605% +3.9927%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) low mild short/short/4 time: [105.51 µs 107.52 µs 109.49 µs] change: [+0.5023% +3.3196% +6.1986%] (p = 0.02 < 0.05) Change within noise threshold. short/short/8 time: [151.90 µs 153.16 µs 154.41 µs] change: [-1.0001% +0.2779% +1.4221%] (p = 0.65 > 0.05) No change in performance detected. short/short/16 time: [297.38 µs 298.26 µs 299.20 µs] change: [-0.2953% +0.2462% +0.7763%] (p = 0.37 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild medium/medium/1 time: [229.76 µs 230.72 µs 231.69 µs] change: [+1.5804% +2.1354% +2.6635%] (p = 0.00 < 0.05) Performance has regressed. medium/medium/2 time: [501.14 µs 502.31 µs 503.64 µs] change: [+1.8730% +2.1709% +2.5199%] (p = 0.00 < 0.05) Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 1 (1.00%) high mild 5 (5.00%) high severe medium/medium/4 time: [954.15 µs 956.74 µs 959.33 µs] change: [+1.7962% +2.1627% +2.4905%] (p = 0.00 < 0.05) Performance has regressed. medium/medium/8 time: [1.8726 ms 1.8785 ms 1.8848 ms] change: [+1.5858% +2.0240% +2.4626%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe medium/medium/16 time: [3.7565 ms 3.7746 ms 3.7934 ms] change: [+1.5503% +2.3044% +3.0818%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild ```	2023-10-18 11:23:06 +01:00
Arpad Müller	093f8c5f45	Update rust to 1.73.0 (#5574 ) [Release notes](https://blog.rust-lang.org/2023/10/05/Rust-1.73.0.html)	2023-10-17 13:13:12 +01:00
Christian Schwarz	9256788273	limit imitate accesses concurrency, using same semaphore as compactions (#5578 ) Before this PR, when we restarted pageserver, we'd see a rush of `$number_of_tenants` concurrent eviction tasks starting to do imitate accesses building up in the period of `[init_order allows activations, $random_access_delay + EvictionPolicyLayerAccessThreshold::period]`. We simply cannot handle that degree of concurrent IO. We already solved the problem for compactions by adding a semaphore. So, this PR shares that semaphore for use by evictions. Part of https://github.com/neondatabase/neon/issues/5479 Which is again part of https://github.com/neondatabase/neon/issues/4743 Risks / Changes In System Behavior ================================== * we don't do evictions as timely as we currently do * we log a bunch of warnings about eviction taking too long * imitate accesses and compactions compete for the same concurrency limit, so, they'll slow each other down through this shares semaphore Changes ======= - Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs` - Rename it to `CONCURRENT_BACKGROUND_TASKS` - Use it also for the eviction imitate accesses: - Imitate acceses are both per-TIMELINE and per-TENANT - The per-TENANT is done through coalescing all the per-TIMELINE tasks via a tokio mutex `eviction_task_tenant_state`. - We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the beginning of the eviction iteration, much before the imitate acesses start (and they may not even start at all in the given iteration, as they happen only every $threshold). - Acquiring early is sub-optimal because when the per-timline tasks coalesce on the `eviction_task_tenant_state` mutex, they are already holding a CONCURRENT_BACKGROUND_TASKS permit. - It's also unfair because tenants with many timelines win the CONCURRENT_BACKGROUND_TASKS more often. - I don't think there's another way though, without refactoring more of the imitate accesses logic, e.g, making it all per-tenant. - Add metrics for queue depth behind the semaphore. I found these very useful to understand what work is queued in the system. - The metrics are tagged by the new `BackgroundLoopKind`. - On a green slate, I would have used `TaskKind`, but we already had pre-existing labels whose names didn't map exactly to task kind. Also the task kind is kind of a lower-level detail, so, I think it's fine to have a separate enum to identify background work kinds. Future Work =========== I guess I could move the eviction tasks from a ticker to "sleep for $period". The benefit would be that the semaphore automatically "smears" the eviction task scheduling over time, so, we only have the rush on restart but a smeared-out rush afterward. The downside is that this perverts the meaning of "$period", as we'd actually not run the eviction at a fixed period. It also means the the "took to long" warning & metric becomes meaningless. Then again, that is already the case for the compaction and gc tasks, which do sleep for `$period` instead of using a ticker.	2023-10-17 11:29:48 +02:00
Joonas Koivunen	9e1449353d	crash-consistent layer map through index_part.json (#5198 ) Fixes #5172 as it: - removes recoinciliation with remote index_part.json and accepts remote index_part.json as the truth, deleting any local progress which is yet to be reflected in remote - moves to prefer remote metadata Additionally: - tests with single LOCAL_FS parametrization are cleaned up - adds a test case for branched (non-bootstrap) local only timeline availability after restart --------- Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2023-10-17 10:04:56 +01:00
John Spray	b06dffe3dc	pageserver: fixes to `/location_config` API (#5548 ) ## Problem I found some issues with the `/location_config` API when writing new tests. ## Summary of changes - Calling the API with the "Detached" state is now idempotent. - `Tenant::spawn_attach` now takes a boolean to indicate whether to expect a marker file. Marker files are used in the old attach path, but not in the new location conf API. They aren't needed because in the New World, the choice of whether to attach via remote state ("attach") or to trust local state ("load") will be revised to cope with the transitions between secondary & attached (see https://github.com/neondatabase/neon/issues/5550). It is okay to merge this change ahead of that ticket, because the API is not used in the wild yet. - Instead of using `schedule_local_tenant_processing`, the location conf API handler does its own directory creation and calls `spawn_attach` directly. - A new `unsafe_create_dir_all` is added. This differs from crashsafe::create_dir_all in two ways: - It is intentionally not crashsafe, because in the location conf API we are no longer using directory or config existence as the signal for any important business logic. - It is async and uses `tokio::fs`.	2023-10-17 10:21:31 +02:00
John Spray	ded7f48565	pageserver: measure startup duration spent fetching remote indices (#5564 ) ## Problem Currently it's unclear how much of the `initial_tenant_load` period is in S3 objects, and therefore how impactful it is to make changes to remote operations during startup. ## Summary of changes - `Tenant::load` is refactored to load remote indices in parallel and to wait for all these remote downloads to finish before it proceeds to construct any `Timeline` objects. - `pageserver_startup_duration_seconds` gets a new `phase` value of `initial_tenant_load_remote` which counts the time from startup to when the last tenant finishes loading remote content. - `test_pageserver_restart` is extended to validate this phase. The previous version of the test was relying on order of dict entries, which stopped working when adding a phase, so this is refactored a bit. - `test_pageserver_restart` used to explicitly create a branch, now it uses the default initial_timeline. This avoids startup getting held up waiting for logical sizes, when one of the branches is not in use.	2023-10-16 18:21:37 +01:00
John Spray	44b1c4c456	pageserver: fix eviction across generations (#5538 ) ## Problem Bug was introduced by me in `83ae2bd82c` When eviction constructs a RemoteLayer to replace the layer it just evicted, it is building a LayerFileMetadata using its _current_ generation, rather than the generation of the layer. ## Summary of changes - Retrieve Generation from RemoteTimelineClient when evicting. This will no longer be necessary when #4938 lands. - Add a test for the scenario in question (this fails without the fix).	2023-10-15 20:23:18 +01:00
Christian Schwarz	dd6990567f	walredo: apply_batch_postgres: get a backtrace whenever it encounters an error (#5541 ) For 2 weeks we've seen rare, spurious, not-reproducible page reconstruction failures with PG16 in prod. One of the commits we deployed this week was Commit commit `fc467941f9` Author: Joonas Koivunen <joonas@neon.tech> Date: Wed Oct 4 16:19:19 2023 +0300 walredo: log retryed error (#546) With the logs from that commit, we learned that some read() or write() system call that walredo does fails with `EAGAIN`, aka `Resource temporarily unavailable (os error 11)`. But we have no idea where exactly in the code we get back that error. So, use anyhow instead of fake std::io::Error's as an easy way to get a backtrace when the error happens, and change the logging to print that backtrace (i.e., use `{:?}` instead of `utils::error::report_compact_sources(e)`). The `WalRedoError` type had to go because we add additional `.context()` further up the call chain before we `{:?}`-print it. That additional `.context()` further up doesn't see that there's already an anyhow::Error inside the `WalRedoError::ApplyWalRecords` variant, and hence captures another backtrace and prints that one on `{:?}`-print instead of the original one inside `WalRedoError::ApplyWalRecords`. If we ever switch back to `report_compact_sources`, we should make sure we have some other way to uniquely identify the places where we return an error in the error message.	2023-10-13 14:08:23 +00:00
John Spray	39e144696f	pageserver: clean up `mgr.rs` types that needn't be public (#5529 ) ## Problem These types/functions are public and it prevents clippy from catching unused things. ## Summary of changes Move to `pub(crate)` and remove the error enum that becomes clearly unused as a result.	2023-10-11 11:50:16 +00:00
John Spray	acefee9a32	pageserver: flush deletion queue on detach (#5452 ) ## Problem If a caller detaches a tenant and then attaches it again, pending deletions from the old attachment might not have happened yet. This is not a correctness problem, but it causes: - Risk of leaking some objects in S3 - Some warnings from the deletion queue when pending LSN updates and pending deletions don't pass validation. ## Summary of changes - Deletion queue now uses UnboundedChannel so that the push interfaces don't have to be async. - This was pulled out of https://github.com/neondatabase/neon/pull/5397, where it is also useful to be able to drive the queue from non-async contexts. - Why is it okay for this to be unbounded? The only way the unbounded-ness of the channel can become a problem is if writing out deletion lists can't keep up, but if the system were that overloaded then the code generating deletions (GC, compaction) would also be impacted. - DeletionQueueClient gets a new `flush_advisory` function, which is like flush_execute, but doesn't wait for completion: this is appropriate for use in contexts where we would like to encourage the deletion queue to flush, but don't need to block on it. - This function is also expected to be useful in next steps for seamless migration, where the option to flush to S3 while transitioning into AttachedStale will also include flushing deletion queue, but we wouldn't want to block on that flush. - The tenant_detach code in mgr.rs invokes flush_advisory after stopping the `Tenant` object. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-10-10 10:46:24 +01:00
Joonas Koivunen	4772cd6c93	fix: deny branching, starting compute from not yet uploaded timelines (#5484 ) Part of #5172. First commits show that we used to allow starting up a compute or creating a branch off a not yet uploaded timeline. This PR moves activation of a timeline to happen after initial layer file(s) (if any) and `index_part.json` have been uploaded. Simply moving activation to be after downloads have finished works because we now spawn a task per http request handler. Current behaviour of uploading on the timelines on next startup is kept, to be removed later as part of #5172. Adds: - `NeonCli.map_branch` and corresponding `neon_local` implementation: allow creating computes for timelines managed via pageserver http client/api - possibly duplicate tests (I did not want to search for, will cleanup in a follow-up if these duplicated) Changes: - make `wait_until_tenant_state` return immediatedly on `Broken` and not wait more	2023-10-09 17:03:38 +03:00
John Spray	ea5a97e7b4	pageserver: implement emergency mode for operating without control plane (#5469 ) ## Problem Pageservers with `control_plane_api` configured require a control plane to start up: in an incident this might be a problem. ## Summary of changes Note to reviewers: most of the code churn in mgr.rs is the refactor commit that enables the later emergency mode commit: you may want to review commits separately. - Add `control_plane_emergency_mode` configuration property - Refactor init_tenant_mgr to separate loading configurations from the main loop where we construct Tenant, so that the generations fetch can peek at the configs in emergency mode. - During startup, in emergency mode, attach any tenants that were attached on their last run, using the same generation number. Closes: #5381 Closes: https://github.com/neondatabase/neon/issues/5492	2023-10-06 17:25:21 +01:00
John Spray	547914fe19	pageserver: adjust timeline deletion for generations (#5453 ) ## Problem Spun off from https://github.com/neondatabase/neon/pull/5449 Timeline deletion does the following: 1. Delete layers referenced in the index 2. Delete everything else in the timeline prefix, except the index 3. Delete the index. When generations were added, the filter in step 2 got outdated, such that the index objects were deleted along with everything else at step 2. That didn't really break anything, but it makes an automated test unhappy and is a violation of the original intent of the code, which presumably intends to upload an invariant that as long as any objects for a timeline exist, the index exists. (Eventually, this index-object-last complexity can go away: when we do https://github.com/neondatabase/neon/issues/5080, there is no need to keep the index_part around, as deletions can always be retried any time any where.) ## Summary of changes After object listing, split the listed objects into layers and index objects. Delete the layers first, then the index objects.	2023-10-06 16:15:18 +00:00
Arpad Müller	607b185a49	Fix 1.73.0 clippy lints (#5494 ) Doesn't do an upgrade of rustc to 1.73.0 as we want to wait for the cargo response of the curl CVE before updating. In preparation for an update, we address the clippy lints that are newly firing in 1.73.0.	2023-10-06 14:17:19 +01:00
Christian Schwarz	bfba5e3aca	page_cache: ensure forward progress on miss (#5482 ) Problem ======= Prior to this PR, when we had a cache miss, we'd get back a write guard, fill it, the drop it and retry the read from cache. If there's severe contention for the cache, it could happen that the just-filled data gets evicted before our retry, resulting in lost work and no forward progress. Solution ======== This PR leverages the now-available `tokio::sync::RwLockWriteGuard`'s `downgrade()` functionality to turn the filled slot write guard into a read guard. We don't drop the guard at any point, so, forward progress is ensured. Refs ==== Stacked atop https://github.com/neondatabase/neon/pull/5480 part of https://github.com/neondatabase/neon/issues/4743 specifically part of https://github.com/neondatabase/neon/issues/5479	2023-10-06 13:41:13 +01:00

1 2 3 4 5 ...

432 Commits