rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
John Spray	e22c072064	remote_storage: fix prefix handling in remote storage & clean up (#7431 ) ## Problem Split off from https://github.com/neondatabase/neon/pull/7399, which is the first piece of code that does a WithDelimiter object listing using a prefix that isn't a full directory name. ## Summary of changes - Revise list function to not append a `/` to the prefix -- prefixes don't have to end with a slash. - Fix local_fs implementation of list to not assume that WithDelimiter case will always use a directory as a prerfix. - Remove `list_files`, `list_prefixes` wrappers, as they add little value and obscure the underlying list function -- we need callers to understand the semantics of what they're really calling (listobjectsv2)	2024-04-23 16:24:51 +01:00
Alex Chi Z	89f023e6b0	feat(pageserver): add metadata key range and aux key encoding (#7401 ) Extracted from https://github.com/neondatabase/neon/pull/7375. We assume everything >= 0x80 are metadata keys. AUX file keys are part of the metadata keys, and we use `0x90` as the prefix for AUX file keys. The AUX file encoding is described in the code comment. We use xxhash128 as the hash algorithm. It seems to be portable according to the introduction, > xxHash is an Extremely fast Hash algorithm, processing at RAM speed limits. Code is highly portable, and produces hashes identical across all platforms (little / big endian). ...though whether the Rust version follows the same convention is unknown and might need manual review of the library. Anyways, we can always change the hash algorithm before rolling it out in staging/end-user, and I made a quick decision to use xxhash here because it generates 128b hash + portable. We can save the discussion of which hash algorithm to use later. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-23 15:16:04 +00:00
Vlad Lazar	a9fda8c832	pageserver: fix vectored read aux key handling (#7404 ) ## Problem Vectored get would descend into ancestor timelines for aux files. This is not the behaviour of the legacy read path and blocks cutting over to the vectored read path. Fixes https://github.com/neondatabase/neon/issues/7379 ## Summary of Changes Treat non inherited keys specially in vectored get. At the point when we want to descend into the ancestor mark all pending non inherited keys as errored out at the key level. Note that this diverges from the standard vectored get behaviour for missing keys which is a top level error. This divergence is required to avoid blocking compaction in case such an error is encountered when compaction aux files keys. I'm pretty sure the bug I just described predates the vectored get implementation, but it's still worth fixing.	2024-04-23 14:03:33 +01:00
Arpad Müller	fa12d60237	Don't pass tenant_id in location_config requests from storage controller (#7476 ) Tested this locally via a simple patch, the `tenant_id` is now gone from the json. Follow-up of #7055, prerequisite for #7469.	2024-04-23 11:42:58 +00:00
John Spray	0bd16182f7	pageserver: fix unlogged relations with sharding (#7454 ) ## Problem - #7451 INIT_FORKNUM blocks must be stored on shard 0 to enable including them in basebackup. This issue can be missed in simple tests because creating an unlogged table isn't sufficient -- to repro I had to create an _index_ on an unlogged table (then restart the endpoint). Closes: #7451 ## Summary of changes - Add a reproducer for the issue. - Tweak the condition for `key_is_shard0` to include anything that isn't a normal relation block _and_ any normal relation block whose forknum is INIT_FORKNUM. - To enable existing databases to recover from the issue, add a special case that omits relations if they were stored on the wrong INITFORK. This enables postgres to start and the user to drop the table and recreate it.	2024-04-22 11:47:24 +00:00
Heikki Linnakangas	00d9c2d9a8	Make another walcraft test more robust (#7439 ) There were two issues with the test at page boundaries: 1. If the first logical message with 10 bytes payload crossed a page boundary, the calculated 'base_size' was too large because it included the page header. 2. If it was inserted near the end of a page so that there was not enough room for another one, we did "remaining_lsn += XLOG_BLCKSZ" but that didn't take into account the page headers either. As a result, the test would fail if the WAL insert position at the beginning of the test was too close to the end of a WAL page. Fix the calculations by repeating the 10-byte logical message if the starting position is not suitable. I bumped into this with PR #7377; it changed the arguments of a few SQL functions in neon_test_utils extension, which changed the WAL positions slightly, and caused a test failure. This is similar to https://github.com/neondatabase/neon/pull/7436, but for different test.	2024-04-22 10:58:28 +03:00
Heikki Linnakangas	3a673dce67	Make test less sensitive to exact WAL positions (#7436 ) As noted in the comment, the craft_internal() function fails if the inserted WAL happens to land at page boundary. I bumped into that with PR #7377; it changed the arguments of a few SQL functions in neon_test_utils extension, which changed the WAL positions slightly, and caused a test failure.	2024-04-22 10:58:10 +03:00
Joonas Koivunen	3df67bf4d7	fix(Layer): metric regression with too many canceled evictions (#7363 ) #7030 introduced an annoying papercut, deeming a failure to acquire a strong reference to `LayerInner` from `DownloadedLayer::drop` as a canceled eviction. Most of the time, it wasn't that, but just timeline deletion or tenant detach with the layer not wanting to be deleted or evicted. When a Layer is dropped as part of a normal shutdown, the `Layer` is dropped first, and the `DownloadedLayer` the second. Because of this, we cannot detect eviction being canceled from the `DownloadedLayer::drop`. We can detect it from `LayerInner::drop`, which this PR adds. Test case is added which before had 1 started eviction, 2 canceled. Now it accurately finds 1 started, 1 canceled.	2024-04-18 15:27:58 +00:00
Christian Schwarz	2d5a8462c8	add `async` walredo mode (disabled-by-default, opt-in via config) (#6548 ) Before this PR, the `nix::poll::poll` call would stall the executor. This PR refactors the `walredo::process` module to allow for different implementations, and adds a new `async` implementation which uses `tokio::process::ChildStd{in,out}` for IPC. The `sync` variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable. Performance ----------- I updated `bench_walredo.rs`, demonstrating that a single `async`-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput. I further did manual less-micro-benchmarking in the real pageserver binary. Methodology & results are published here: https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4 tl;dr: - use pagebench against a pageserver patched to answer getpage request & small-enough working set to fit into PS PageCache / kernel page cache. - compare knee in the latency/throughput curve - N tenants, each 1 pagebench clients - sync better throughput at N < 30, async better at higher N - async generally noticable but not much worse p99.X tail latencies - eyeballing CPU efficiency in htop, `async` seems significantly more CPU efficient at ca N=[0.5ncpus, 1.5ncpus], worse than `sync` outside of that band Mental Model For Walredo & Scheduler Interactions ------------------------------------------------- Walredo is CPU-/DRAM-only work. This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable. To the Linux kernel scheduler, the `$ncpus` executor threads and the walredo process thread are just `struct task_struct`, and it will divide CPU time fairly among them. In `sync` mode, there are always `$ncpus` runnable `struct task_struct` because the executor thread blocks while `walredo` runs, and the executor thread becomes runnable when the `walredo` process is done handling the request. In `async` mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver. The above means that in `sync` mode, there is an implicit concurrency limit on concurrent walredo requests (`$num_runtimes * $num_executor_threads_per_runtime`). And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong. In `async` mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler. If we're not CPU-bound, `async` has a pipelining and hence throughput advantage over `sync` because one executor thread can continue processing requests while a walredo request is in flight. If we're CPU-bound, under a fair CPU scheduler, the fixed number of executor threads has to share CPU time with the aggregate of walredo processes. It's trivial to reason about this in `sync` mode due to the blocked-runnable-ping-pong. In `async` mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process. Why `async` mode Doesn't Limit Walredo Concurrency -------------------------------------------------- To control that equilibrium in `async` mode, one may add a tokio semaphore to limit the number of in-flight walredo requests. However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations. In the case of walredo, that might be the entire reconstruct data. We don't limit the number of total inflight Timeline::get (we only throttle admission). So, that queue might lead to an OOM. The alternative is to acquire the semaphore permit before collecting reconstruct data. However, what if we need to on-demand download? A combination of semaphores might help: one for reconstruct data, one for walredo. The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit. This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency. However, sizing the amount of permits for the semaphores is tricky: - Reconstruct data retrieval is a mix of disk IO and CPU work. - If we need to do on-demand downloads, it's network IO + disk IO + CPU work. - At this time, we have no good data on how the wall clock time is distributed. It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now. Future Work ----------- We will do more testing of `async` mode and gradual rollout to prod using the config flag. Once that is done, we'll remove `sync` mode to avoid the temporary code duplication introduced by this PR. The flag will be removed. The `wait()` for the child process to exit is still synchronous; the comment [here]( `655d3b6468/pageserver/src/walredo.rs (L294-L306)`) is still a valid argument in favor of that. The `sync` mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget. But with `async` mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO. We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to `sync`. However, the [runtime function for this is `tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html). Refs ---- refs #6628 refs https://github.com/neondatabase/neon/issues/2975	2024-04-15 22:14:42 +02:00
John Spray	83cdbbb89a	pageserver: improve readability of shard.rs (#7330 ) No functional changes, this is a comments/naming PR. While merging sharding changes, some cleanup of the shard.rs types was deferred. In this PR: - Rename `is_zero` to `is_shard_zero` to make clear that this method doesn't literally mean that the entire object is zeros, just that it refers to the 0th shard in a tenant. - Pull definitions of types to the top of shard.rs and add a big comment giving an overview of which type is for what. Closes: https://github.com/neondatabase/neon/issues/6072	2024-04-15 11:50:26 +01:00
Conrad Ludgate	5299f917d6	proxy: replace prometheus with measured (#6717 ) ## Problem My benchmarks show that prometheus is not very good. https://github.com/conradludgate/measured We're already using it in storage_controller and it seems to be working well. ## Summary of changes Replace prometheus with my new measured crate in proxy only. Apologies for the large diff. I tried to keep it as minimal as I could. The label types add a bit of boiler plate (but reduce the chance we mistype the labels), and some of our custom metrics like CounterPair and HLL needed to be rewritten.	2024-04-11 16:26:01 +00:00
Conrad Ludgate	f212630da2	update measured with some more convenient features (#7334 ) ## Problem Some awkwardness in the measured API. Missing process metrics. ## Summary of changes Update measured to use the new convenience setup features. Added measured-process lib. Added measured support for libmetrics	2024-04-08 18:01:41 +00:00
Kevin Mingtarja	a306d0a54b	implement Serialize/Deserialize for SystemTime with RFC3339 format (#7203 ) ## Problem We have two places that use a helper (`ser_rfc3339_millis`) to get serde to stringify SystemTimes into the desired format. ## Summary of changes Created a new module `utils::serde_system_time` and inside it a wrapper type `SystemTime` for `std::time::SystemTime` that serializes/deserializes to the RFC3339 format. This new type is then used in the two places that were previously using the helper for serialization, thereby eliminating the need to decorate structs. Closes #7151.	2024-04-08 15:53:07 +01:00
Christian Schwarz	1081a4d246	pageserver: option to run with just one tokio runtime (#7331 ) This PR is an off-by-default revision v2 of the (since-reverted) PR #6555 / commit `3220f830b7fbb785d6db8a93775f46314f10a99b`. See that PR for details on why running with a single runtime is desirable and why we should be ready. We reverted #6555 because it showed regressions in prodlike cloudbench, see the revert commit message `ad072de4209193fd21314cf7f03f14df4fa55eb1` for more context. This PR makes it an opt-in choice via an env var. The default is to use the 4 separate runtimes that we have today, there shouldn't be any performance change. I tested manually that the env var & added metric works. ``` # undefined env var => no change to before this PR, uses 4 runtimes ./target/debug/neon_local start # defining the env var enables one-runtime mode, value defines that one runtime's configuration NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:1 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:2 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:default ./target/debug/neon_local start ``` I want to use this change to do more manualy testing and potentially testing in staging. Future Work ----------- Testing / deployment ergonomics would be better if this were a variable in `pageserver.toml`. It can be done, but, I don't need it right now, so let's stick with the env var.	2024-04-08 16:27:08 +02:00
John Spray	66fc465484	Clean up 'attachment service' names to storage controller (#7326 ) The binary etc were renamed some time ago, but the path in the source tree remained "attachment_service" to avoid disruption to ongoing PRs. There aren't any big PRs out right now, so it's a good time to cut over. - Rename `attachment_service` to `storage_controller` - Move it to the top level for symmetry with `storage_broker` & to avoid mixing the non-prod neon_local stuff (`control_plane/`) with the storage controller which is a production component.	2024-04-05 16:18:00 +01:00
Arpad Müller	0fa517eb80	Update test-context dependency to 0.3 (#7303 ) Updates the `test-context` dev-dependency of the `remote_storage` crate to 0.3. This removes a lot of `async_trait` instances. Related earlier work: #6305, #6464	2024-04-05 15:53:29 +02:00
Arthur Petukhovsky	3f77f26aa2	Upload partial segments (#6530 ) Add support for backing up partial segments to remote storage. Disabled by default, can be enabled with `--partial-backup-enabled`. Safekeeper timeline has a background task which is subscribed to `commit_lsn` and `flush_lsn` updates. After the partial segment was updated (`flush_lsn` was changed), the segment will be uploaded to S3 in about 15 minutes. The filename format for partial segments is `Segment_Term_Flush_Commit_skNN.partial`, where: - `Segment` – the segment name, like `000000010000000000000001` - `Term` – current term - `Flush` – flush_lsn in hex format `{:016X}`, e.g. `00000000346BC568` - `Commit` – commit_lsn in the same hex format - `NN` – safekeeper_id, like `1` The full object name example: `000000010000000000000002_2_0000000002534868_0000000002534410_sk1.partial` Each safekeeper will keep info about remote partial segments in its control file. Code updates state in the control file before doing any S3 operations. This way control file stores information about all potentially existing remote partial segments and can clean them up after uploading a newer version. Closes #6336	2024-04-03 15:20:51 +00:00
Christian Schwarz	3de416a016	refactor(walreceiver): eliminate task_mgr usage (#7260 ) We want to move the code base away from task_mgr. This PR refactors the walreceiver code such that it doesn't use `task_mgr` anymore. # Background As a reminder, there are three tasks in a Timeline that's ingesting WAL. `WalReceiverManager`, `WalReceiverConnectionHandler`, and `WalReceiverConnectionPoller`. See the documentation in `task_mgr.rs` for how they interact. Before this PR, cancellation was requested through task_mgr::shutdown_token() and `TaskHandle::shutdown`. Wait-for-task-finish was implemented using a mixture of `task_mgr::shutdown_tasks` and `TaskHandle::shutdown`. This drawing might help: <img width="300" alt="image" src="https://github.com/neondatabase/neon/assets/956573/b6be7ad6-ecb3-41d0-b410-ec85cb8d6d20"> # Changes For cancellation, the entire WalReceiver task tree now has a `child_token()` of `Timeline::cancel`. The `TaskHandle` no longer is a cancellation root. This means that `Timeline::cancel.cancel()` is propagated. For wait-for-task-finish, all three tasks in the task tree hold the `Timeline::gate` open until they exit. The downside of using the `Timeline::gate` is that we can no longer wait for just the walreceiver to shut down, which is particularly relevant for `Timeline::flush_and_shutdown`. Effectively, it means that we might ingest more WAL while the `freeze_and_flush()` call is ongoing. Also, drive-by-fix the assertiosn around task kinds in `wait_lsn`. The check for `WalReceiverConnectionHandler` was ineffective because that never was a task_mgr task, but a TaskHandle task. Refine the assertion to check whether we would wait, and only fail in that case. # Alternatives I contemplated (ab-)using the `Gate` by having a separate `Gate` for `struct WalReceiver`. All the child tasks would use _that_ gate instead of `Timeline::gate`. And `struct WalReceiver` itself would hold an `Option<GateGuard>` of the `Timeline::gate`. Then we could have a `WalReceiver::stop` function that closes the WalReceiver's gate, then drops the `WalReceiver::Option<GateGuard>`. However, such design would mean sharing the WalReceiver's `Gate` in an `Arc`, which seems awkward. A proper abstraction would be to make gates hierarchical, analogous to CancellationToken. In the end, @jcsp and I talked it over and we determined that it's not worth the effort at this time. # Refs part of #7062	2024-04-03 12:28:04 +02:00
John Spray	6e3834d506	controller: add `storcon-cli` (#7114 ) ## Problem During incidents, we may need to quickly access the storage controller's API without trying API client code or crafting `curl` CLIs on the fly. A basic CLI client is needed for this. ## Summary of changes - Update storage controller node listing API to only use public types in controller_api.rs - Add a storage controller API for listing tenants - Add a basic test that the CLI can list and modify nodes and tenants.	2024-04-03 10:07:56 +00:00
Vlad Lazar	090123a429	pageserver: check for new image layers based on ingested WAL (#7230 ) ## Problem Part of the legacy (but current) compaction algorithm is to find a stack of overlapping delta layers which will be turned into an image layer. This operation is exponential in terms of the number of matching layers and we do it roughly every 20 seconds. ## Summary of changes Only check if a new image layer is required if we've ingested a certain amount of WAL since the last check. The amount of wal is expressed in terms of multiples of checkpoint distance, with the intuition being that that there's little point doing the check if we only have two new L1 layers (not enough to create a new image).	2024-03-28 17:44:55 +00:00
John Spray	6633332e67	storage controller: tenant scheduling policy (#7262 ) ## Problem In the event of bugs with scheduling or reconciliation, we need to be able to switch this off at a per-tenant granularity. This is intended to mitigate risk of issues with https://github.com/neondatabase/neon/pull/7181, which makes scheduling more involved. Closes: #7103 ## Summary of changes - Introduce a scheduling policy per tenant, with API to set it - Refactor persistent.rs helpers for updating tenants to be more general - Add tests	2024-03-28 14:19:25 +00:00
Conrad Ludgate	12512f3173	add authentication rate limiting (#6865 ) ## Problem https://github.com/neondatabase/cloud/issues/9642 ## Summary of changes 1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter` 2. Add support for claiming multiple tokens at once 3. Add `AuthRateLimiter` alias. 4. Check `(Endpoint, IP)` pair during authentication, weighted by how many hashes proxy would be doing. TODO: handle ipv6 subnets. will do this in a separate PR.	2024-03-26 19:31:19 +00:00
George Ma	d837ce0686	chore: remove repetitive words (#7206 ) Signed-off-by: availhang <mayangang@outlook.com>	2024-03-25 11:43:02 -04:00
Arseny Sher	a6c1fdcaf6	Try to fix test_crafted_wal_end flakiness. Postgres can always write some more WAL, so previous checks that WAL doesn't change after something had been crafted were wrong; remove them. Add comments here and there. should fix https://github.com/neondatabase/neon/issues/4691	2024-03-25 14:53:06 +03:00
Arpad Müller	3ee34a3f26	Update Rust to 1.77.0 (#7198 ) Release notes: https://blog.rust-lang.org/2024/03/21/Rust-1.77.0.html Thanks to #6886 the diff is reasonable, only for one new lint `clippy::suspicious_open_options`. I added `truncate()` calls to the places where it is obviously the right choice to me, and added allows everywhere else, leaving it for followups. I had to specify cargo install --locked because the build would fail otherwise. This was also recommended by upstream.	2024-03-22 06:52:31 +00:00
John Spray	06cb582d91	pageserver: extend /re-attach response to include tenant mode (#6941 ) This change improves the resilience of the system to unclean restarts. Previously, re-attach responses only included attached tenants - If the pageserver had local state for a secondary location, it would remain, but with no guarantee that it was still _meant_ to be there. After this change, the pageserver will only retain secondary locations if the /re-attach response indicates that they should still be there. - If the pageserver had local state for an attached location that was omitted from a re-attach response, it would be entirely detached. This is wasteful in a typical HA setup, where an offline node's tenants might have been re-attached elsewhere before it restarts, but the offline node's location should revert to a secondary location rather than being wiped. Including secondary tenants in the re-attach response enables the pageserver to avoid throwing away local state unnecessarily. In this PR: - The re-attach items are extended with a 'mode' field. - Storage controller populates 'mode' - Pageserver interprets it (default is attached if missing) to construct either a SecondaryTenant or a Tenant. - A new test exercises both cases.	2024-03-21 13:39:23 +00:00
Vlad Lazar	c75b584430	storage_controller: add metrics (#7178 ) ## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes https://github.com/neondatabase/neon/issues/6875	2024-03-21 12:00:20 +00:00
Jure Bajic	94138c1a28	Enforce LSN ordering of batch entries (#7071 ) ## Summary of changes Enforce LSN ordering of batch entries. Closes https://github.com/neondatabase/neon/issues/6707	2024-03-21 09:17:24 +00:00
Joonas Koivunen	a95c41f463	fix(heavier_once_cell): take_and_deinit should take ownership (#7185 ) Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.	2024-03-21 00:42:38 +02:00
John Spray	a5d5c2a6a0	storage controller: tech debt (#7165 ) This is a mixed bag of changes split out for separate review while working on other things, and batched together to reduce load on CI runners. Each commits stands alone for review purposes: - do_tenant_shard_split was a long function and had a synchronous validation phase at the start that could readily be pulled out into a separate function. This also avoids the special casing of ApiError::BadRequest when deciding whether an abort is needed on errors - Add a 'describe' API (GET on tenant ID) that will enable storcon-cli to see what's going on with a tenant - the 'locate' API wasn't really meant for use in the field. It's for tests: demote it to the /debug/ prefix - The `Single` placement policy was a redundant duplicate of Double(0), and Double was a bad name. Rename it Attached. (https://github.com/neondatabase/neon/issues/7107) - Some neon_local commands were added for debug/demos, which are now replaced by commands in storcon-cli (#7114 ). Even though that's not merged yet, we don't need the neon_local ones any more. Closes https://github.com/neondatabase/neon/issues/7107 ## Backward compat of Single/Double -> `Attached(n)` change A database migration is used to convert any existing values.	2024-03-19 16:08:20 +00:00
Tristan Partin	64c6dfd3e4	Move functions for creating/extracting tarballs into utils Useful for other code paths which will handle zstd compression and decompression.	2024-03-19 10:50:41 -05:00
Arthur Petukhovsky	ad5efb49ee	Support backpressure for sharding (#7100 ) Add shard_number to PageserverFeedback and parse it on the compute side. When compute receives a new ps_feedback, it calculates min LSNs among feedbacks from all shards, and uses those LSNs for backpressure. Add `test_sharding_backpressure` to verify that backpressure slows down compute to wait for the slowest shard.	2024-03-18 21:54:44 +00:00
John Spray	9752ad8489	pageserver, controller: improve secondary download APIs for large shards (#7131 ) ## Problem The existing secondary download API relied on the caller to wait as long as it took to complete -- for large shards that could be a long time, so typical clients that might have a baked-in ~30s timeout would have a problem. ## Summary of changes - Take a `wait_ms` query parameter to instruct the pageserver how long to wait: if the download isn't complete in this duration, then 201 is returned instead of 200. - For both 200 and 201 responses, include response body describing download progress, in terms of layers and bytes. This is sufficient for the caller to track how much data is being transferred and log/present that status. - In storage controller live migrations, use this API to apply a much longer outer timeout, with smaller individual per-request timeouts, and log the progress of the downloads. - Add a test that injects layer download delays to exercise the new behavior	2024-03-15 19:45:58 +00:00
Christian Schwarz	ad6f538aef	tokio-epoll-uring: use it for on-demand downloads (#6992 ) # Problem On-demand downloads are still using `tokio::fs`, which we know is inefficient. # Changes - Add `pagebench ondemand-download-churn` to quantify on-demand download throughput - Requires dumping layer map, which required making `history_buffer` impl `Deserialize` - Implement an equivalent of `tokio::io::copy_buf` for owned buffers => `owned_buffers_io` module and children. - Make layer file download sensitive to `io_engine::get()`, using VirtualFile + above copy loop - For this, I had to move some code into the `retry_download`, e.g., `sync_all()` call. Drive-by: - fix missing escaping in `scripts/ps_ec2_setup_instance_store` - if we failed in retry_download to create a file, we'd try to remove it, encounter `NotFound`, and `abort()` the process using `on_fatal_io_error`. This PR adds treats `NotFound` as a success. # Testing Functional - The copy loop is generic & unit tested. Performance - Used the `ondemand-download-churn` benchmark to manually test against real S3. - Results (public Notion page): https://neondatabase.notion.site/Benchmarking-tokio-epoll-uring-on-demand-downloads-2024-04-15-newer-code-03c0fdc475c54492b44d9627b6e4e710?pvs=4 - Performance is equivalent at low concurrency. Jumpier situation at high concurrency, but, still less CPU / throughput with tokio-epoll-uring. - It’s a win. # Future Work Turn the manual performance testing described in the above results document into a performance regression test: https://github.com/neondatabase/neon/issues/7146	2024-03-15 18:57:05 +00:00
Joonas Koivunen	59b6cce418	heavier_once_cell: add detached init support (#7135 ) Aiming for the design where `heavier_once_cell::OnceCell` is initialized by a future factory lead to awkwardness with how `LayerInner::get_or_maybe_download` looks right now with the `loop`. The loop helps with two situations: - an eviction has been scheduled but has not yet happened, and a read access should cancel the eviction - a previous `LayerInner::get_or_maybe_download` that canceled a pending eviction was canceled leaving the `heavier_once_cell::OnceCell` uninitialized but needing repair by the next `LayerInner::get_or_maybe_download` By instead supporting detached initialization in `heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`, we can fix what the monolithic #7030 does: - spawned off download task initializes the `heavier_once_cell::OnceCell` regardless of the download starter being canceled - a canceled `LayerInner::get_or_maybe_download` no longer stops eviction but can win it if not canceled Split off from #7030. Cc: #5331	2024-03-15 15:54:28 +00:00
John Spray	516f793ab4	remote_storage: make last_modified and etag mandatory (#7126 ) ## Problem These fields were only optional for the convenience of the `local_fs` test helper -- real remote storage backends provide them. It complicated any code that actually wanted to use them for anything. ## Summary of changes - Make these fields non-optional - For azure/S3 it is an error if the server doesn't provide them - For local_fs, use random strings as etags and the file's mtime for last_modified.	2024-03-15 13:37:49 +00:00
Vlad Lazar	38767ace68	storage_controller: periodic pageserver heartbeats (#7092 ) ## Problem If a pageserver was offline when the storage controller started, there was no mechanism to update the storage controller state when the pageserver becomes active. ## Summary of changes * Add a heartbeater module. The heartbeater must be driven by an external loop. * Integrate the heartbeater into the service. - Extend the types used by the service and scheduler to keep track of a nodes' utilisation score. - Add a background loop to drive the heartbeater and update the state based on the deltas it generated - Do an initial round of heartbeats at start-up	2024-03-14 15:21:36 +00:00
Arpad Müller	5309711691	Make tenant_id in TenantLocationConfigRequest optional (#7055 ) The `tenant_id` in `TenantLocationConfigRequest` in the `location_config` endpoint was only used in the storage controller/attachment service, and there it was only used for assertions and the creation part.	2024-03-13 17:30:29 +01:00
John Spray	1b41db8bdd	pageserver: enable setting stripe size inline with split request. (#7093 ) ## Summary - Currently we can set stripe size at tenant creation, but it doesn't mean anything until we have multiple shards - When onboarding an existing tenant, it will always get a default shard stripe size, so we would like to be able to pick the actual stripe size at the point we split. ## Why do this inline with a split? The alternative to this change would be to have a separate endpoint on the storage controller for setting the stripe size on a tenant, and only permit writes to that endpoint when the tenant has only a single shard. That would work, but be a little bit more work for a client, and not appreciably simpler (instead of having a special argument to the split functions, we'd have a special separate endpoint, and a requirement that the controller must sync its config down to the pageserver before calling the split API). Either approach would work, but this one feels a bit more robust end-to-end: the split API is the _very last moment_ that the stripe size is mutable, so if we aim to set it before splitting, it makes sense to do it as part of the same operation.	2024-03-12 20:41:08 +00:00
John Spray	7ae8364b0b	storage controller: register nodes in re-attach request (#7040 ) ## Problem Currently we manually register nodes with the storage controller, and use a script during deploy to register with the cloud control plane. Rather than extend that script further, nodes should just register on startup. ## Summary of changes - Extend the re-attach request to include an optional NodeRegisterRequest - If the `register` field is set, handle it like a normal node registration before executing the normal re-attach work. - Update tests/neon_local that used to rely on doing an explicit register step that could be enabled/disabled. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-03-12 14:47:12 +00:00
Arthur Petukhovsky	580e136b2e	Forward all backpressure feedback to compute (#7079 ) Previously we aggregated ps_feedback on each safekeeper and sent it to walproposer with every AppendResponse. This PR changes it to send ps_feedback to walproposer right after receiving it from pageserver, without aggregating it in memory. Also contains some preparations for implementing backpressure support for sharding.	2024-03-12 12:14:02 +00:00
John Spray	89cf714890	tests/neon_local: rename "attachment service" -> "storage controller" (#7087 ) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )	2024-03-12 11:36:27 +00:00
Heikki Linnakangas	74d09b78c7	Keep walproposer alive until shutdown checkpoint is safe on safekepeers The walproposer pretends to be a walsender in many ways. It has a WalSnd slot, it claims to be a walsender by calling MarkPostmasterChildWalSender() etc. But one different to real walsenders was that the postmaster still treated it as a bgworker rather than a walsender. The difference is that at shutdown, walsenders are not killed until the very end, after the checkpointer process has written the shutdown checkpoint and exited. As a result, the walproposer always got killed before the shutdown checkpoint was written, so the shutdown checkpoint never made it to safekeepers. That's fine in principle, we don't require a clean shutdown after all. But it also feels a bit silly not to stream the shutdown checkpoint. It could be useful for initializing hot standby mode in a read replica, for example. Change postmaster to treat background workers that have called MarkPostmasterChildWalSender() as walsenders. That unfortunately requires another small change in postgres core. After doing that, walproposers stay alive longer. However, it also means that the checkpointer will wait for the walproposer to switch to WALSNDSTATE_STOPPING state, when the checkpointer sends the PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in walproposer to receive and handle that signal reliably. Instead, we mark walproposer as being in WALSNDSTATE_STOPPING always. In commit `568f91420a`, I assumed that shutdown will wait for all the remaining WAL to be streamed to safekeepers, but before this commit that was not true, and the test became flaky. This should make it stable again. Some tests wrongly assumed that no WAL could have been written between pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing flush_ep_to_pageserver which first stops the endpoint and then waits till all committed WAL reaches the pageserver. In passing extract safekeeper http client to its own module.	2024-03-11 23:29:32 +04:00
Joonas Koivunen	b09d686335	fix: on-demand downloads can outlive timeline shutdown (#7051 ) ## Problem Before this PR, it was possible that on-demand downloads were started after `Timeline::shutdown()`. For example, we have observed a walreceiver-connection-handler-initiated on-demand download that was started after `Timeline::shutdown()`s final `task_mgr::shutdown_tasks()` call. The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky, i.e., new tasks can be spawned during or after `task_mgr::shutdown_tasks()`. Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more specific issue for task_mgr. We already decided we want to get rid of it anyways. Original investigation: https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949 ## Changes - enter gate while downloading - use timeline cancellation token for cancelling download thereby, fixes #7054 Entering the gate might also remove recent "kept the gate from closing" in staging.	2024-03-09 13:09:08 +00:00
Christian Schwarz	74d24582cf	throttling: exclude throttled time from basebackup (fixup of #6953 ) (#7072 ) PR #6953 only excluded throttled time from the handle_pagerequests (aka smgr metrics). This PR implements the deduction for `basebackup ` queries. The other page_service methods either don't use Timeline::get or they aren't used in production. Found by manually inspecting in [staging logs](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22wx8%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhostname%3D%5C%22pageserver-0.eu-west-1.aws.neon.build%5C%22%7D%20%7C~%20%60git-env%7CERR%7CWARN%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22to%22:%221709919114642%22,%22from%22:%221709904430898%22%7D%7D%7D).	2024-03-09 13:37:02 +01:00
John Spray	7329413705	storage controller: enable setting PlacementPolicy in tenant creation (#7037 ) ## Problem Tenants created via the storage controller have a `PlacementPolicy` that defines their HA/secondary/detach intent. For backward compat we can just set it to Single, for onboarding tenants using /location_conf it is automatically set to Double(1) if there are at least two pageservers, but for freshly created tenants we didn't have a way to specify it. This unblocks writing tests that create HA tenants on the storage controller and do failure injection testing. ## Summary of changes - Add optional fields to TenantCreateRequest for specifying PlacementPolicy. This request structure is used both on pageserver API and storage controller API, but this method is only meaningful for the storage controller (same as existing `shard_parameters` attribute). - Use the value from the creation request in tenant creation, if provided.	2024-03-08 15:34:53 +00:00
Conrad Ludgate	02358b21a4	update rustls (#7048 ) ## Summary of changes Update rustls from 0.21 to 0.22. reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route available yet.	2024-03-07 18:23:19 +00:00
John Spray	4a31e18c81	storage controller: include stripe size in compute notifications (#6974 ) ## Problem - The storage controller is the source of truth for a tenant's stripe size, but doesn't currently have a way to propagate that to compute: we're just using the default stripe size everywhere. Closes: https://github.com/neondatabase/neon/issues/6903 ## Summary of changes - Include stripe size in `ComputeHookNotifyRequest` - Include stripe size in `LocationConfigResponse` The stripe size is optional: it will only be advertised for multi-sharded tenants. This enables the controller to defer the choice of stripe size until we split a tenant for the first time.	2024-03-06 13:56:30 +00:00
Joonas Koivunen	752bf5a22f	build: clippy disallow futures::pin_mut macro (#7016 ) `std` has had `pin!` macro for some time, there is no need for us to use the older alternatives. Cannot disallow `tokio::pin` because tokio macros use that.	2024-03-05 10:14:37 +00:00
John Spray	20d0939b00	control_plane/attachment_service: implement PlacementPolicy::Secondary, configuration updates (#6521 ) During onboarding, the control plane may attempt ad-hoc creation of a secondary location to facilitate live migration. This gives us two problems to solve: - Accept 'Secondary' mode in /location_config and use it to put the tenant into secondary mode on some physical pageserver, then pass through /tenant/xyz/secondary/download requests - Create tenants with no generation initially, since the initial `Secondary` mode call will not provide us a generation. This PR also fixes modification of a tenant's TenantConf during /location_conf, which was previously ignored, and refines the flow for config modification: - avoid bumping generations when the only reason we're reconciling an attached location is a config change - increment TenantState.sequence when spawning a reconciler: usually schedule() does this, but when we do config changes that doesn't happen, so without this change waiters would think reconciliation was done immediately. `sequence` is a bit of a murky thing right now, as it's dual-purposed for tracking waiters, and for checking if an existing reconciliation is already making updates to our current sequence. I'll follow up at some point to clarify it's purpose. - test config modification at the end of onboarding test	2024-03-01 20:25:53 +00:00

1 2 3 4 5 ...

590 Commits