rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
Arpad Müller	fa3fc73c1b	Address 1.82 clippy lints (#8944 ) Addresses the clippy lints of the beta 1.82 toolchain. The `too_long_first_doc_paragraph` lint complained a lot and was addressed separately: #8941	2024-09-06 21:05:18 +02:00
Arseny Sher	c1a51416db	safekeeper: fsync filesystem on start. We can't really rely on files contents after boot without fsync'ing them.	2024-09-06 19:14:25 +03:00
Arpad Müller	cbcd4058ed	Fix 1.82 clippy lint too_long_first_doc_paragraph (#8941 ) Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by adding newlines to the first sentence if it is short enough, and making a short first sentence if there is the need.	2024-09-06 14:33:52 +02:00
vladov	ebddda5b7f	Fix precedence issue causing yielding loop to never yield. (#8922 ) There is a bug in `yielding_loop` that causes it to never yield. ## Summary of changes Fixed the bug. `i + 1 % interval == 0` will always evaluate to `i + 1 == 0` which is false ([Playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=68e6ca393a02113cb7720115c2842e75)). This function is called in 2 places [here](`99fa1c3600/pageserver/src/tenant/secondary/scheduler.rs (L389)`) and [here](`99fa1c3600/pageserver/src/tenant/secondary/heatmap_uploader.rs (L152)`) with `interval == 1000` in both cases. This may change the performance of the system since now we are yielding to tokio. Also, this may expose undefined behavior since it is now possible for tasks to be moved between threads/whatever tokio does to tasks. However, this was the intention of the author of the code.	2024-09-05 11:06:57 -04:00
Christian Schwarz	850421ec06	refactor(pageserver): rely on serde derive for toml deserialization (#7656 ) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields https://github.com/neondatabase/neon/issues/8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values	2024-09-05 14:59:49 +02:00
Arpad Müller	8eaa8ad358	Remove async_trait usages from safekeeper and neon_local (#8864 ) Removes additional async_trait usages from safekeeper and neon_local. Also removes now redundant dependencies of the `async_trait` crate. cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296	2024-08-29 18:24:25 +02:00
Conrad Ludgate	a644f01b6a	proxy+pageserver: shared leaky bucket impl (#8539 ) In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I figured I could share the code with pageserver and remove the leaky_bucket crate dependency with some very basic tokio timers and queues for fairness. The underlying algorithm should be fairly clear how it works from the comments I have left in the code. --- In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the `pagebench get-page-latest-lsn` benchmark with the clickbench dataset (`test_perf_olap.py`). The discontinuity is that for any of `--num-clients={2,3,4}`, getpage throughput remains 10k. With `--num-clients=5` and greater, getpage throughput then jumps to the configured 20k rate limit. With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to `--num-clients` until the configured rate limit. More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641. closes https://github.com/neondatabase/cloud/issues/16886 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-29 11:26:52 +00:00
Christian Schwarz	c2f8fdccd7	ingest: rate-limited warning if WAL commit timestamps lags for > wait_lsn_timeout (#8839 ) refs https://github.com/neondatabase/cloud/issues/13750 The logging in this commit will make it easier to detect lagging ingest. We're trusting compute timestamps --- ideally we'd use SK timestmaps instead. But trusting the compute timestamp is ok for now.	2024-08-29 12:06:00 +01:00
Conrad Ludgate	428b105dde	remove workspace hack from libs (#8780 ) This removes workspace hack from all libs, not from any binaries. This does not change the behaviour of the hack. Running ``` cargo clean cargo build --release --bin proxy ``` Before this change took 5m16s. After this change took 3m3s. This is because this allows the build to be parallelisable much more.	2024-08-21 14:45:32 +01:00
Joonas Koivunen	6d6e2c6a39	feat(detach_ancestor): better retries with persistent gc blocking (#8430 ) With the persistent gc blocking, we can now retry reparenting timelines which had failed for whatever reason on the previous attempt(s). Restructure the detach_ancestor into three phases: - prepare (insert persistent gc blocking, copy lsn prefix, layers) - detach and reparent - reparenting can fail, so we might need to retry this portion - complete (remove persistent gc blocking) Cc: #6994	2024-08-13 18:51:51 +01:00
Alexander Bayandin	4a53cd0fc3	Dockerfiles: remove cachepot (#8666 ) ## Problem We install and try to use `cachepot`. But it is not configured correctly and doesn't work (after https://github.com/neondatabase/neon/pull/2290) ## Summary of changes - Remove `cachepot`	2024-08-09 15:48:16 +01:00
Joonas Koivunen	fc78774f39	fix: EphemeralFiles can outlive their Timeline via `enum LayerManager` (#8229 ) Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is as proposed: - make ephemeral files carry the gate guard to delay `Timeline::gate` closing - flush in-memory layers and strong references to those on `Timeline::shutdown` The above are realized by making LayerManager an `enum` with `Open` and `Closed` variants, and fail requests to modify `LayerMap`. Additionally: - fix too eager anyhow conversions in compaction - unify how we freeze layers and handle errors - optimize likely_resident_layers to read LayerFileManager hashmap values instead of bouncing through LayerMap Fixes: #7830	2024-08-07 17:50:09 +03:00
Yuchen Liang	e374d6778e	feat(storcon): store scrubber metadata scan result (#8480 ) Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 14:32:00 +01:00
John Spray	f5db655447	pageserver: simplify LayerAccessStats (#8431 ) ## Problem LayerAccessStats contains a lot of detail that we don't use: short histories of most recent accesses, specifics on what kind of task accessed a layer, etc. This is all stored inside a Mutex, which is locked every time something accesses a layer. ## Summary of changes - Store timestamps at a very low resolution (to the nearest second), sufficient for use on the timescales of eviction. - Pack access time and last residence change time into a single u64 - Use the high bits of the u64 for other flags, including the new layer visibility concept. - Simplify the external-facing model for access stats to just include what we now track. Note that the `HistoryBufferWithDropCounter` is removed here because it is no longer used. I do not dislike this type, we just happen not to use it for anything else at present. Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 08:17:28 +01:00
Yuchen Liang	595c450036	fix(scrubber): more robust metadata consistency checks (#8344 ) Part of #8128. ## Problem Scrubber uses `scan_metadata` command to flag metadata inconsistencies. To trust it at scale, we need to make sure the errors we emit is a reflection of real scenario. One check performed in the scrubber is to see whether layers listed in the latest `index_part.json` is present in object listing. Currently, the scrubber does not robustly handle the case where objects are uploaded/deleted during the scan. ## Summary of changes Condition for success: An object in the index is (1) in the object listing we acquire from S3 or (2) found in a HeadObject request (new object). - Add in the `HeadObject` requests for the layers missing from the object listing. - Keep the order of first getting the object listing and then downloading the layers. - Update check to only consider shards with highest shard count. - Skip analyzing a timeline if `deleted_at` tombstone is marked in `index_part.json`. - Add new test to see if scrubber actually detect the metadata inconsistency. _Misc_ - A timeline with no ancestor should always have some layers. - Removed experimental histograms _Caveat_ - Ancestor layer is not cleaned until #8308 is implemented. If ancestor layers reference non-existing layers in the index, the scrubber will emit false positives. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-22 14:53:33 +01:00
John Spray	44781518d0	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-19 19:07:59 +03:00
Conrad Ludgate	411a130675	Fix nightly warnings 2024 june (#8151 ) ## Problem new clippy warnings on nightly. ## Summary of changes broken up each commit by warning type. 1. Remove some unnecessary refs. 2. In edition 2024, inference will default to `!` and not `()`. 3. Clippy complains about doc comment indentation 4. Fix `Trait + ?Sized` where `Trait: Sized`. 5. diesel_derives triggering `non_local_defintions`	2024-07-12 13:58:04 +01:00
John Spray	0645ae318e	pageserver: circuit breaker on compaction (#8359 ) ## Problem We already back off on compaction retries, but the impact of a failing compaction can be so great that backing off up to 300s isn't enough. The impact is consuming a lot of I/O+CPU in the case of image layer generation for large tenants, and potentially also leaking disk space. Compaction failures are extremely rare and almost always indicate a bug, frequently a bug that will not let compaction to proceed until it is fixed. Related: https://github.com/neondatabase/neon/issues/6738 ## Summary of changes - Introduce a CircuitBreaker type - Add a circuit breaker for compaction, with a policy that after 5 failures, compaction will not be attempted again for 24 hours. - Add metrics that we can alert on: any >0 value for `pageserver_circuit_breaker_broken_total` should generate an alert. - Add a test that checks this works as intended. Couple notes to reviewers: - Circuit breakers are intrinsically a defense-in-depth measure: this is not the solution to any underlying issues, it is just a general mitigation for "unknown unknowns" that might be encountered in future. - This PR isn't primarily about writing a perfect CircuitBreaker type: the one in this PR is meant to be just enough to mitigate issues in compaction, and make it easy to monitor/alert on these failures. We can refine this type in future as/when we want to use it elsewhere.	2024-07-12 12:04:02 +01:00
Christian Schwarz	e26ef640c1	pageserver: remove `trace_read_requests` (#8338 ) `trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.	2024-07-11 15:17:07 +02:00
Christian Schwarz	1a49f1c15c	pageserver: move `page_service`'s `import basebackup` / `import wal` to mgmt API (#8292 ) I want to fix bugs in `page_service` ([issue](https://github.com/neondatabase/neon/issues/7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.	2024-07-09 23:17:42 +02:00
Yuchen Liang	a68edad913	refactor: move part of sharding API from `pageserver_api` to `utils` (#8254 ) ## Problem LSN Leases introduced in #8084 is a new API that is made shard-aware from day 1. To support ephemeral endpoint in #7994 without linking Postgres C API against `compute_ctl`, part of the sharding needs to reside in `utils`. ## Summary of changes - Create a new `shard` module in utils crate. - Move more interface related part of tenant sharding API to utils and re-export them in pageserver_api. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-08 15:43:10 +01:00
Christian Schwarz	7dcdbaa25e	remote_storage config: move handling of empty inline table `{}` to callers (#8193 ) Before this PR, `RemoteStorageConfig::from_toml` would support deserializing an empty `{}` TOML inline table to a `None`, otherwise try `Some()`. We can instead let * in proxy: let clap derive handle the Option * in PS & SK: assume that if the field is specified, it must be a valid RemtoeStorageConfig (This PR started with a much simpler goal of factoring out the `deserialize_item` function because I need that in another PR).	2024-07-02 12:53:08 +02:00
John Spray	c39d5b03e8	pageserver: remove legacy tenant config code, clean up redundant generation none/broken usages (#7947 ) ## Problem In https://github.com/neondatabase/neon/pull/5299, the new config-v1 tenant config file was added to hold the LocationConf type. We left the old config file in place for forward compat, and because running without generations (therefore without LocationConf) as still useful before the storage controller was ready for prime-time. Closes: https://github.com/neondatabase/neon/issues/5388 ## Summary of changes - Remove code for reading and writing the legacy config file - Remove Generation::Broken: it was unused. - Treat missing config file on disk as an error loading a tenant, rather than defaulting it. We can now remove LocationConf::default, and thereby guarantee that we never construct a tenant with a None generation. - Update some comments + add some assertions to clarify that Generation::None is only used in layer metadata, not in the state of a running tenant. - Update docker compose test to create tenants with a generation	2024-06-26 19:53:59 +00:00
John Spray	07f21dd6b6	pageserver: remove attach/detach apis (#8134 ) ## Problem These APIs have been deprecated for some time, but were still used from test code. Closes: https://github.com/neondatabase/neon/issues/4282 ## Summary of changes - It is still convenient to do a "tenant_attach" from a test without having to write out a location_conf body, so those test methods have been retained with implementations that call through to their location_conf equivalent.	2024-06-25 17:38:06 +01:00
Yuchen Liang	630cfbe420	refactor(pageserver): designated api error type for cancelled request (#7949 ) Closes #7406. ## Problem When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is exposed to handle that case, which verbosely logs the error. However, we don't benefit from having the full backtrace provided by anyhow in this case. ## Summary of changes This PR introduces a new `ApiError` type to handle errors caused by cancelled request more robustly. - A new enum variant `ApiError::Cancelled` - Currently the cancelled request is mapped to status code 500. - Need to handle this error in proxy's `http_util` as well. - Added a failpoint test to simulate cancelled `get_lsn_by_timestamp` request. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-06 14:00:14 +00:00
Christian Schwarz	17116f2ea9	fix(pageserver): abort on duplicate layers, before doing damage (#7799 ) fixes https://github.com/neondatabase/neon/issues/7790 (duplicating most of the issue description here for posterity) # Background From the time before always-authoritative `index_part.json`, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen: `a8e6d259cb/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md (L41-L50)` As of #5198 , we should not be exposed to that problem anymore. # Problem 1 We still have 1. [code in Pageserver](`82960b2175/pageserver/src/tenant/timeline.rs (L4502-L4521)`) than handles duplicate layers 2. [tests in the test suite](`d9dcbffac3/test_runner/regress/test_duplicate_layers.py (L15)`) that demonstrates the problem using a failpoint However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production. What is does instead is to return early with an `Ok()`, so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised. That "return early" would be a bug in the routine if it happened in production. So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior. # Problem 2 Further, if production code _did_ (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late: * the code handles it by discarding the newer `struct Layer`; that's good. * however, on disk, we have already overwritten the old with the new layer file * the fact that we do it atomically doesn't matter because ... * if the new layer file is not bit-identical, then we have a cache coherency problem * PS PageCache block cache: caches old bit battern * blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets * => reading based on these offsets from the new file might yield different data than before # Solution - Remove the test suite code pertaining to Problem 1 - Move & rename test suite code that actually tests RFC-27 crash-consistent layer map. - Remove the Pageserver code that handles duplicate layers too late (Problem 1) - Use `RENAME_NOREPLACE` to prevent over-rename the file during `.finish()`, bail with an error if it happens (Problem 2) - This bailing prevents the caller from even trying to insert into the layer map, as they don't even get a `struct Layer` at hand. - Add `abort`s in the place where we have the layer map lock and check for duplicates (Problem 2) - Note again, we can't reach there because we bail from `.finish()` much earlier in the code. - Share the logic to clean up after failed `.finish()` between image layers and delta layers (drive-by cleanup) - This exposed that test `image_layer_rewrite` was overwriting layer files in place. Fix the test. # Future Work This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers: 1. Start a compaction that will produce 3 layers: A, B, C 2. Layer A is `finish()`ed successfully. 3. Layer B fails mid-way at some `put_value()`. 4. Compaction bails out, sleeps 20s. 5. Some disk space gets freed in the meantime. 6. Compaction wakes from sleep, another iteration starts, it attempts to write Layer A again. But the `.finish()` fails because A already exists on disk. The failure in step 5 is new with this PR, and it causes the compaction to get stuck. Before, it would silently overwrite the file and "successfully" complete the second iteration. The mitigation for this is to `/reset` the tenant.	2024-06-04 16:16:23 +00:00
Joonas Koivunen	ef83f31e77	pagectl: key command for dumping what we know about the key (#7890 ) What we know about the key via added `pagectl key $key` command: - debug formatting - shard placement when `--shard-count` is specified - different boolean queries in `key.rs` - aux files v2 Example: ``` $ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C parsed from hex: 000000063F00004005000060270000100E2C: Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 } rel_block: true rel_vm_block: false rel_fsm_block: false slru_block: false inherited: true rel_size: false slru_segment_size: false recognized kind: None ```	2024-05-31 18:19:41 +00:00
Arseny Sher	3797566c36	safekeeper: test pull_timeline with WAL gc. Do pull_timeline while WAL is being removed. To this end - extract pausable_failpoint to utils, sprinkle pull_timeline with it - add 'checkpoint' sk http endpoint to force WAL removal. After fixing checking for pull file status code test fails so far which is expected.	2024-05-25 06:06:32 +03:00
Joonas Koivunen	a3f5b83677	chore: lower gate guard drop logging threshold to 100ms (#7862 ) We have some 1001ms cases, which do not yield gate guard context.	2024-05-24 14:07:58 +01:00
Joonas Koivunen	6206f76419	build: run doctests (#7697 ) While switching to use nextest with the repository in `f28bdb6`, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.	2024-05-10 16:46:50 +02:00
Arthur Petukhovsky	50a45e67dc	Discover safekeepers via broker request (#7279 ) We had an incident where pageserver requests timed out because pageserver couldn't fetch WAL from safekeepers. This incident was caused by a bug in safekeeper logic for timeline activation, which prevented pageserver from finding safekeepers. This bug was since fixed, but there is still a chance of a similar bug in the future due to overall complexity. We add a new broker message to "signal interest" for timeline. This signal will be sent by pageservers `wait_lsn`, and safekeepers will receive this signal to start broadcasting broker messages. Then every broker subscriber will be able to find the safekeepers and connect to them (to start fetching WAL). This feature is not limited to pageservers and any service that wants to download WAL from safekeepers will be able to use this discovery request. This commit changes pageserver's connection_manager (walreceiver) to send a SafekeeperDiscoveryRequest when there is no information about safekeepers present in memory. Current implementation will send these requests only if there is an active wait_lsn() call and no more often than once per 10 seconds. Add `test_broker_discovery` to test this: safekeepers started with `--disable-periodic-broker-push` will not push info to broker so that pageserver must use a discovery to start fetching WAL. Add task_stats in safekeepers broker module to log a warning if there is no message received from the broker for the last 10 seconds. Closes #5471 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-04-30 18:50:03 +00:00
John Spray	b655c7030f	neon_local: add "tenant import" (#7399 ) ## Problem Sometimes we have test data in the form of S3 contents that we would like to run live in a neon_local environment. ## Summary of changes - Add a storage controller API that imports an existing tenant. Currently this is equivalent to doing a create with a high generation number, but in future this would be something smarter to probe S3 to find the shards in a tenant and find generation numbers. - Add a `neon_local` command that invokes the import API, and then inspects timelines in the newly attached tenant to create matching branches.	2024-04-29 08:52:18 +01:00
Joonas Koivunen	3df67bf4d7	fix(Layer): metric regression with too many canceled evictions (#7363 ) #7030 introduced an annoying papercut, deeming a failure to acquire a strong reference to `LayerInner` from `DownloadedLayer::drop` as a canceled eviction. Most of the time, it wasn't that, but just timeline deletion or tenant detach with the layer not wanting to be deleted or evicted. When a Layer is dropped as part of a normal shutdown, the `Layer` is dropped first, and the `DownloadedLayer` the second. Because of this, we cannot detect eviction being canceled from the `DownloadedLayer::drop`. We can detect it from `LayerInner::drop`, which this PR adds. Test case is added which before had 1 started eviction, 2 canceled. Now it accurately finds 1 started, 1 canceled.	2024-04-18 15:27:58 +00:00
Christian Schwarz	2d5a8462c8	add `async` walredo mode (disabled-by-default, opt-in via config) (#6548 ) Before this PR, the `nix::poll::poll` call would stall the executor. This PR refactors the `walredo::process` module to allow for different implementations, and adds a new `async` implementation which uses `tokio::process::ChildStd{in,out}` for IPC. The `sync` variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable. Performance ----------- I updated `bench_walredo.rs`, demonstrating that a single `async`-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput. I further did manual less-micro-benchmarking in the real pageserver binary. Methodology & results are published here: https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4 tl;dr: - use pagebench against a pageserver patched to answer getpage request & small-enough working set to fit into PS PageCache / kernel page cache. - compare knee in the latency/throughput curve - N tenants, each 1 pagebench clients - sync better throughput at N < 30, async better at higher N - async generally noticable but not much worse p99.X tail latencies - eyeballing CPU efficiency in htop, `async` seems significantly more CPU efficient at ca N=[0.5ncpus, 1.5ncpus], worse than `sync` outside of that band Mental Model For Walredo & Scheduler Interactions ------------------------------------------------- Walredo is CPU-/DRAM-only work. This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable. To the Linux kernel scheduler, the `$ncpus` executor threads and the walredo process thread are just `struct task_struct`, and it will divide CPU time fairly among them. In `sync` mode, there are always `$ncpus` runnable `struct task_struct` because the executor thread blocks while `walredo` runs, and the executor thread becomes runnable when the `walredo` process is done handling the request. In `async` mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver. The above means that in `sync` mode, there is an implicit concurrency limit on concurrent walredo requests (`$num_runtimes * $num_executor_threads_per_runtime`). And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong. In `async` mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler. If we're not CPU-bound, `async` has a pipelining and hence throughput advantage over `sync` because one executor thread can continue processing requests while a walredo request is in flight. If we're CPU-bound, under a fair CPU scheduler, the fixed number of executor threads has to share CPU time with the aggregate of walredo processes. It's trivial to reason about this in `sync` mode due to the blocked-runnable-ping-pong. In `async` mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process. Why `async` mode Doesn't Limit Walredo Concurrency -------------------------------------------------- To control that equilibrium in `async` mode, one may add a tokio semaphore to limit the number of in-flight walredo requests. However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations. In the case of walredo, that might be the entire reconstruct data. We don't limit the number of total inflight Timeline::get (we only throttle admission). So, that queue might lead to an OOM. The alternative is to acquire the semaphore permit before collecting reconstruct data. However, what if we need to on-demand download? A combination of semaphores might help: one for reconstruct data, one for walredo. The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit. This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency. However, sizing the amount of permits for the semaphores is tricky: - Reconstruct data retrieval is a mix of disk IO and CPU work. - If we need to do on-demand downloads, it's network IO + disk IO + CPU work. - At this time, we have no good data on how the wall clock time is distributed. It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now. Future Work ----------- We will do more testing of `async` mode and gradual rollout to prod using the config flag. Once that is done, we'll remove `sync` mode to avoid the temporary code duplication introduced by this PR. The flag will be removed. The `wait()` for the child process to exit is still synchronous; the comment [here]( `655d3b6468/pageserver/src/walredo.rs (L294-L306)`) is still a valid argument in favor of that. The `sync` mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget. But with `async` mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO. We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to `sync`. However, the [runtime function for this is `tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html). Refs ---- refs #6628 refs https://github.com/neondatabase/neon/issues/2975	2024-04-15 22:14:42 +02:00
Kevin Mingtarja	a306d0a54b	implement Serialize/Deserialize for SystemTime with RFC3339 format (#7203 ) ## Problem We have two places that use a helper (`ser_rfc3339_millis`) to get serde to stringify SystemTimes into the desired format. ## Summary of changes Created a new module `utils::serde_system_time` and inside it a wrapper type `SystemTime` for `std::time::SystemTime` that serializes/deserializes to the RFC3339 format. This new type is then used in the two places that were previously using the helper for serialization, thereby eliminating the need to decorate structs. Closes #7151.	2024-04-08 15:53:07 +01:00
Christian Schwarz	1081a4d246	pageserver: option to run with just one tokio runtime (#7331 ) This PR is an off-by-default revision v2 of the (since-reverted) PR #6555 / commit `3220f830b7fbb785d6db8a93775f46314f10a99b`. See that PR for details on why running with a single runtime is desirable and why we should be ready. We reverted #6555 because it showed regressions in prodlike cloudbench, see the revert commit message `ad072de4209193fd21314cf7f03f14df4fa55eb1` for more context. This PR makes it an opt-in choice via an env var. The default is to use the 4 separate runtimes that we have today, there shouldn't be any performance change. I tested manually that the env var & added metric works. ``` # undefined env var => no change to before this PR, uses 4 runtimes ./target/debug/neon_local start # defining the env var enables one-runtime mode, value defines that one runtime's configuration NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:1 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:2 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:default ./target/debug/neon_local start ``` I want to use this change to do more manualy testing and potentially testing in staging. Future Work ----------- Testing / deployment ergonomics would be better if this were a variable in `pageserver.toml`. It can be done, but, I don't need it right now, so let's stick with the env var.	2024-04-08 16:27:08 +02:00
Christian Schwarz	3de416a016	refactor(walreceiver): eliminate task_mgr usage (#7260 ) We want to move the code base away from task_mgr. This PR refactors the walreceiver code such that it doesn't use `task_mgr` anymore. # Background As a reminder, there are three tasks in a Timeline that's ingesting WAL. `WalReceiverManager`, `WalReceiverConnectionHandler`, and `WalReceiverConnectionPoller`. See the documentation in `task_mgr.rs` for how they interact. Before this PR, cancellation was requested through task_mgr::shutdown_token() and `TaskHandle::shutdown`. Wait-for-task-finish was implemented using a mixture of `task_mgr::shutdown_tasks` and `TaskHandle::shutdown`. This drawing might help: <img width="300" alt="image" src="https://github.com/neondatabase/neon/assets/956573/b6be7ad6-ecb3-41d0-b410-ec85cb8d6d20"> # Changes For cancellation, the entire WalReceiver task tree now has a `child_token()` of `Timeline::cancel`. The `TaskHandle` no longer is a cancellation root. This means that `Timeline::cancel.cancel()` is propagated. For wait-for-task-finish, all three tasks in the task tree hold the `Timeline::gate` open until they exit. The downside of using the `Timeline::gate` is that we can no longer wait for just the walreceiver to shut down, which is particularly relevant for `Timeline::flush_and_shutdown`. Effectively, it means that we might ingest more WAL while the `freeze_and_flush()` call is ongoing. Also, drive-by-fix the assertiosn around task kinds in `wait_lsn`. The check for `WalReceiverConnectionHandler` was ineffective because that never was a task_mgr task, but a TaskHandle task. Refine the assertion to check whether we would wait, and only fail in that case. # Alternatives I contemplated (ab-)using the `Gate` by having a separate `Gate` for `struct WalReceiver`. All the child tasks would use _that_ gate instead of `Timeline::gate`. And `struct WalReceiver` itself would hold an `Option<GateGuard>` of the `Timeline::gate`. Then we could have a `WalReceiver::stop` function that closes the WalReceiver's gate, then drops the `WalReceiver::Option<GateGuard>`. However, such design would mean sharing the WalReceiver's `Gate` in an `Arc`, which seems awkward. A proper abstraction would be to make gates hierarchical, analogous to CancellationToken. In the end, @jcsp and I talked it over and we determined that it's not worth the effort at this time. # Refs part of #7062	2024-04-03 12:28:04 +02:00
Arpad Müller	3ee34a3f26	Update Rust to 1.77.0 (#7198 ) Release notes: https://blog.rust-lang.org/2024/03/21/Rust-1.77.0.html Thanks to #6886 the diff is reasonable, only for one new lint `clippy::suspicious_open_options`. I added `truncate()` calls to the places where it is obviously the right choice to me, and added allows everywhere else, leaving it for followups. I had to specify cargo install --locked because the build would fail otherwise. This was also recommended by upstream.	2024-03-22 06:52:31 +00:00
Vlad Lazar	c75b584430	storage_controller: add metrics (#7178 ) ## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes https://github.com/neondatabase/neon/issues/6875	2024-03-21 12:00:20 +00:00
Jure Bajic	94138c1a28	Enforce LSN ordering of batch entries (#7071 ) ## Summary of changes Enforce LSN ordering of batch entries. Closes https://github.com/neondatabase/neon/issues/6707	2024-03-21 09:17:24 +00:00
Joonas Koivunen	a95c41f463	fix(heavier_once_cell): take_and_deinit should take ownership (#7185 ) Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.	2024-03-21 00:42:38 +02:00
Tristan Partin	64c6dfd3e4	Move functions for creating/extracting tarballs into utils Useful for other code paths which will handle zstd compression and decompression.	2024-03-19 10:50:41 -05:00
Arthur Petukhovsky	ad5efb49ee	Support backpressure for sharding (#7100 ) Add shard_number to PageserverFeedback and parse it on the compute side. When compute receives a new ps_feedback, it calculates min LSNs among feedbacks from all shards, and uses those LSNs for backpressure. Add `test_sharding_backpressure` to verify that backpressure slows down compute to wait for the slowest shard.	2024-03-18 21:54:44 +00:00
Christian Schwarz	ad6f538aef	tokio-epoll-uring: use it for on-demand downloads (#6992 ) # Problem On-demand downloads are still using `tokio::fs`, which we know is inefficient. # Changes - Add `pagebench ondemand-download-churn` to quantify on-demand download throughput - Requires dumping layer map, which required making `history_buffer` impl `Deserialize` - Implement an equivalent of `tokio::io::copy_buf` for owned buffers => `owned_buffers_io` module and children. - Make layer file download sensitive to `io_engine::get()`, using VirtualFile + above copy loop - For this, I had to move some code into the `retry_download`, e.g., `sync_all()` call. Drive-by: - fix missing escaping in `scripts/ps_ec2_setup_instance_store` - if we failed in retry_download to create a file, we'd try to remove it, encounter `NotFound`, and `abort()` the process using `on_fatal_io_error`. This PR adds treats `NotFound` as a success. # Testing Functional - The copy loop is generic & unit tested. Performance - Used the `ondemand-download-churn` benchmark to manually test against real S3. - Results (public Notion page): https://neondatabase.notion.site/Benchmarking-tokio-epoll-uring-on-demand-downloads-2024-04-15-newer-code-03c0fdc475c54492b44d9627b6e4e710?pvs=4 - Performance is equivalent at low concurrency. Jumpier situation at high concurrency, but, still less CPU / throughput with tokio-epoll-uring. - It’s a win. # Future Work Turn the manual performance testing described in the above results document into a performance regression test: https://github.com/neondatabase/neon/issues/7146	2024-03-15 18:57:05 +00:00
Joonas Koivunen	59b6cce418	heavier_once_cell: add detached init support (#7135 ) Aiming for the design where `heavier_once_cell::OnceCell` is initialized by a future factory lead to awkwardness with how `LayerInner::get_or_maybe_download` looks right now with the `loop`. The loop helps with two situations: - an eviction has been scheduled but has not yet happened, and a read access should cancel the eviction - a previous `LayerInner::get_or_maybe_download` that canceled a pending eviction was canceled leaving the `heavier_once_cell::OnceCell` uninitialized but needing repair by the next `LayerInner::get_or_maybe_download` By instead supporting detached initialization in `heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`, we can fix what the monolithic #7030 does: - spawned off download task initializes the `heavier_once_cell::OnceCell` regardless of the download starter being canceled - a canceled `LayerInner::get_or_maybe_download` no longer stops eviction but can win it if not canceled Split off from #7030. Cc: #5331	2024-03-15 15:54:28 +00:00
Arthur Petukhovsky	580e136b2e	Forward all backpressure feedback to compute (#7079 ) Previously we aggregated ps_feedback on each safekeeper and sent it to walproposer with every AppendResponse. This PR changes it to send ps_feedback to walproposer right after receiving it from pageserver, without aggregating it in memory. Also contains some preparations for implementing backpressure support for sharding.	2024-03-12 12:14:02 +00:00
John Spray	20d0939b00	control_plane/attachment_service: implement PlacementPolicy::Secondary, configuration updates (#6521 ) During onboarding, the control plane may attempt ad-hoc creation of a secondary location to facilitate live migration. This gives us two problems to solve: - Accept 'Secondary' mode in /location_config and use it to put the tenant into secondary mode on some physical pageserver, then pass through /tenant/xyz/secondary/download requests - Create tenants with no generation initially, since the initial `Secondary` mode call will not provide us a generation. This PR also fixes modification of a tenant's TenantConf during /location_conf, which was previously ignored, and refines the flow for config modification: - avoid bumping generations when the only reason we're reconciling an attached location is a config change - increment TenantState.sequence when spawning a reconciler: usually schedule() does this, but when we do config changes that doesn't happen, so without this change waiters would think reconciliation was done immediately. `sequence` is a bit of a murky thing right now, as it's dual-purposed for tracking waiters, and for checking if an existing reconciliation is already making updates to our current sequence. I'll follow up at some point to clarify it's purpose. - test config modification at the end of onboarding test	2024-03-01 20:25:53 +00:00
Arpad Müller	82853cc1d1	Fix warnings and compile errors on nightly (#6886 ) Nightly has added a bunch of compiler and linter warnings. There is also two dependencies that fail compilation on latest nightly due to using the old `stdsimd` feature name. This PR fixes them.	2024-03-01 17:14:19 +01:00
Joonas Koivunen	5ab10d051d	metrics: record more details of the responding (#6979 ) On eu-west-1 during benchmarks we sometimes lose samples. Add more time measurements.	2024-03-01 14:04:39 +00:00
Vlad Lazar	5984edaecd	libs: fix expired token in auth decode test (#6963 ) The test token expired earlier today (1709200879). I regenerated the token, but without an expiration date this time.	2024-02-29 13:55:38 +00:00

1 2 3 4 5

246 Commits