rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 06:52:55 +00:00

Author	SHA1	Message	Date
Christian Schwarz	3de416a016	refactor(walreceiver): eliminate task_mgr usage (#7260 ) We want to move the code base away from task_mgr. This PR refactors the walreceiver code such that it doesn't use `task_mgr` anymore. # Background As a reminder, there are three tasks in a Timeline that's ingesting WAL. `WalReceiverManager`, `WalReceiverConnectionHandler`, and `WalReceiverConnectionPoller`. See the documentation in `task_mgr.rs` for how they interact. Before this PR, cancellation was requested through task_mgr::shutdown_token() and `TaskHandle::shutdown`. Wait-for-task-finish was implemented using a mixture of `task_mgr::shutdown_tasks` and `TaskHandle::shutdown`. This drawing might help: <img width="300" alt="image" src="https://github.com/neondatabase/neon/assets/956573/b6be7ad6-ecb3-41d0-b410-ec85cb8d6d20"> # Changes For cancellation, the entire WalReceiver task tree now has a `child_token()` of `Timeline::cancel`. The `TaskHandle` no longer is a cancellation root. This means that `Timeline::cancel.cancel()` is propagated. For wait-for-task-finish, all three tasks in the task tree hold the `Timeline::gate` open until they exit. The downside of using the `Timeline::gate` is that we can no longer wait for just the walreceiver to shut down, which is particularly relevant for `Timeline::flush_and_shutdown`. Effectively, it means that we might ingest more WAL while the `freeze_and_flush()` call is ongoing. Also, drive-by-fix the assertiosn around task kinds in `wait_lsn`. The check for `WalReceiverConnectionHandler` was ineffective because that never was a task_mgr task, but a TaskHandle task. Refine the assertion to check whether we would wait, and only fail in that case. # Alternatives I contemplated (ab-)using the `Gate` by having a separate `Gate` for `struct WalReceiver`. All the child tasks would use _that_ gate instead of `Timeline::gate`. And `struct WalReceiver` itself would hold an `Option<GateGuard>` of the `Timeline::gate`. Then we could have a `WalReceiver::stop` function that closes the WalReceiver's gate, then drops the `WalReceiver::Option<GateGuard>`. However, such design would mean sharing the WalReceiver's `Gate` in an `Arc`, which seems awkward. A proper abstraction would be to make gates hierarchical, analogous to CancellationToken. In the end, @jcsp and I talked it over and we determined that it's not worth the effort at this time. # Refs part of #7062	2024-04-03 12:28:04 +02:00
Arpad Müller	3ee34a3f26	Update Rust to 1.77.0 (#7198 ) Release notes: https://blog.rust-lang.org/2024/03/21/Rust-1.77.0.html Thanks to #6886 the diff is reasonable, only for one new lint `clippy::suspicious_open_options`. I added `truncate()` calls to the places where it is obviously the right choice to me, and added allows everywhere else, leaving it for followups. I had to specify cargo install --locked because the build would fail otherwise. This was also recommended by upstream.	2024-03-22 06:52:31 +00:00
Vlad Lazar	c75b584430	storage_controller: add metrics (#7178 ) ## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes https://github.com/neondatabase/neon/issues/6875	2024-03-21 12:00:20 +00:00
Jure Bajic	94138c1a28	Enforce LSN ordering of batch entries (#7071 ) ## Summary of changes Enforce LSN ordering of batch entries. Closes https://github.com/neondatabase/neon/issues/6707	2024-03-21 09:17:24 +00:00
Joonas Koivunen	a95c41f463	fix(heavier_once_cell): take_and_deinit should take ownership (#7185 ) Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.	2024-03-21 00:42:38 +02:00
Tristan Partin	64c6dfd3e4	Move functions for creating/extracting tarballs into utils Useful for other code paths which will handle zstd compression and decompression.	2024-03-19 10:50:41 -05:00
Arthur Petukhovsky	ad5efb49ee	Support backpressure for sharding (#7100 ) Add shard_number to PageserverFeedback and parse it on the compute side. When compute receives a new ps_feedback, it calculates min LSNs among feedbacks from all shards, and uses those LSNs for backpressure. Add `test_sharding_backpressure` to verify that backpressure slows down compute to wait for the slowest shard.	2024-03-18 21:54:44 +00:00
Christian Schwarz	ad6f538aef	tokio-epoll-uring: use it for on-demand downloads (#6992 ) # Problem On-demand downloads are still using `tokio::fs`, which we know is inefficient. # Changes - Add `pagebench ondemand-download-churn` to quantify on-demand download throughput - Requires dumping layer map, which required making `history_buffer` impl `Deserialize` - Implement an equivalent of `tokio::io::copy_buf` for owned buffers => `owned_buffers_io` module and children. - Make layer file download sensitive to `io_engine::get()`, using VirtualFile + above copy loop - For this, I had to move some code into the `retry_download`, e.g., `sync_all()` call. Drive-by: - fix missing escaping in `scripts/ps_ec2_setup_instance_store` - if we failed in retry_download to create a file, we'd try to remove it, encounter `NotFound`, and `abort()` the process using `on_fatal_io_error`. This PR adds treats `NotFound` as a success. # Testing Functional - The copy loop is generic & unit tested. Performance - Used the `ondemand-download-churn` benchmark to manually test against real S3. - Results (public Notion page): https://neondatabase.notion.site/Benchmarking-tokio-epoll-uring-on-demand-downloads-2024-04-15-newer-code-03c0fdc475c54492b44d9627b6e4e710?pvs=4 - Performance is equivalent at low concurrency. Jumpier situation at high concurrency, but, still less CPU / throughput with tokio-epoll-uring. - It’s a win. # Future Work Turn the manual performance testing described in the above results document into a performance regression test: https://github.com/neondatabase/neon/issues/7146	2024-03-15 18:57:05 +00:00
Joonas Koivunen	59b6cce418	heavier_once_cell: add detached init support (#7135 ) Aiming for the design where `heavier_once_cell::OnceCell` is initialized by a future factory lead to awkwardness with how `LayerInner::get_or_maybe_download` looks right now with the `loop`. The loop helps with two situations: - an eviction has been scheduled but has not yet happened, and a read access should cancel the eviction - a previous `LayerInner::get_or_maybe_download` that canceled a pending eviction was canceled leaving the `heavier_once_cell::OnceCell` uninitialized but needing repair by the next `LayerInner::get_or_maybe_download` By instead supporting detached initialization in `heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`, we can fix what the monolithic #7030 does: - spawned off download task initializes the `heavier_once_cell::OnceCell` regardless of the download starter being canceled - a canceled `LayerInner::get_or_maybe_download` no longer stops eviction but can win it if not canceled Split off from #7030. Cc: #5331	2024-03-15 15:54:28 +00:00
Arthur Petukhovsky	580e136b2e	Forward all backpressure feedback to compute (#7079 ) Previously we aggregated ps_feedback on each safekeeper and sent it to walproposer with every AppendResponse. This PR changes it to send ps_feedback to walproposer right after receiving it from pageserver, without aggregating it in memory. Also contains some preparations for implementing backpressure support for sharding.	2024-03-12 12:14:02 +00:00
John Spray	20d0939b00	control_plane/attachment_service: implement PlacementPolicy::Secondary, configuration updates (#6521 ) During onboarding, the control plane may attempt ad-hoc creation of a secondary location to facilitate live migration. This gives us two problems to solve: - Accept 'Secondary' mode in /location_config and use it to put the tenant into secondary mode on some physical pageserver, then pass through /tenant/xyz/secondary/download requests - Create tenants with no generation initially, since the initial `Secondary` mode call will not provide us a generation. This PR also fixes modification of a tenant's TenantConf during /location_conf, which was previously ignored, and refines the flow for config modification: - avoid bumping generations when the only reason we're reconciling an attached location is a config change - increment TenantState.sequence when spawning a reconciler: usually schedule() does this, but when we do config changes that doesn't happen, so without this change waiters would think reconciliation was done immediately. `sequence` is a bit of a murky thing right now, as it's dual-purposed for tracking waiters, and for checking if an existing reconciliation is already making updates to our current sequence. I'll follow up at some point to clarify it's purpose. - test config modification at the end of onboarding test	2024-03-01 20:25:53 +00:00
Arpad Müller	82853cc1d1	Fix warnings and compile errors on nightly (#6886 ) Nightly has added a bunch of compiler and linter warnings. There is also two dependencies that fail compilation on latest nightly due to using the old `stdsimd` feature name. This PR fixes them.	2024-03-01 17:14:19 +01:00
Joonas Koivunen	5ab10d051d	metrics: record more details of the responding (#6979 ) On eu-west-1 during benchmarks we sometimes lose samples. Add more time measurements.	2024-03-01 14:04:39 +00:00
Vlad Lazar	5984edaecd	libs: fix expired token in auth decode test (#6963 ) The test token expired earlier today (1709200879). I regenerated the token, but without an expiration date this time.	2024-02-29 13:55:38 +00:00
Vlad Lazar	5accf6e24a	attachment_service: JWT auth enforcement (#6897 ) ## Problem Attachment service does not do auth based on JWT scopes. ## Summary of changes Do JWT based permission checking for requests coming into the attachment service. Requests into the attachment service must use different tokens based on the endpoint: * `/control` and `/debug` require `admin` scope * `/upcall` requires `generations_api` scope * `/v1/...` requires `pageserverapi` scope Requests into the pageserver from the attachment service must use `pageserverapi` scope.	2024-02-26 18:17:06 +00:00
Christian Schwarz	ec3efc56a8	Revert "Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers"" (#6775 ) Reverts neondatabase/neon#6765 , bringing back #6731 We concluded that #6731 never was the root cause for the instability in staging. More details: https://neondb.slack.com/archives/C033RQ5SPDH/p1708011674755319 However, the massive amount of concurrent `spawn_blocking` calls from the `save_metadata` calls during startups might cause a performance regression. So, we'll merge this PR here after we've stopped writing the metadata #6769).	2024-02-23 17:16:43 +01:00
Christian Schwarz	ca07fa5f8b	per-TenantShard read throttling (#6706 )	2024-02-16 21:26:59 +01:00
John Spray	5d039c6e9b	libs: add 'generations_api' auth scope (#6783 ) ## Problem Even if you're not enforcing auth, the JwtAuth middleware barfs on scopes it doesn't know about. Add `generations_api` scope, which was invented in the cloud control plane for the pageserver's /re-attach and /validate upcalls: this will be enforced in storage controller's implementation of these in a later PR. Unfortunately the scope's naming doesn't match the other scope's naming styles, so needs a manual serde decorator to give it an underscore. ## Summary of changes - Add `Scope::GenerationsApi` variant - Update pageserver + safekeeper auth code to print appropriate message if they see it.	2024-02-16 15:53:09 +00:00
Christian Schwarz	024372a3db	Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers" (#6765 ) Reverts neondatabase/neon#6731 On high tenant count Pageservers in staging, memory and CPU usage shoots to 100% with this change. (NB: staging currently has tokio-epoll-uring enabled) Will analyze tomorrow. https://neondb.slack.com/archives/C03H1K0PGKH/p1707933875639379?thread_ts=1707929541.125329&cid=C03H1K0PGKH	2024-02-14 19:17:12 +00:00
Arpad Müller	a2d0d44b42	Remove unused allow's (#6760 ) These allow's became redundant some time ago so remove them, or address them if addressing is very simple.	2024-02-14 18:16:05 +00:00
Christian Schwarz	df5d588f63	refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers (#6731 ) Some callers of `VirtualFile::crashsafe_overwrite` call it on the executor thread, thereby potentially stalling it. Others are more diligent and wrap it in `spawn_blocking(..., Handle::block_on, ... )` to avoid stalling the executor thread. However, because `crashsafe_overwrite` uses VirtualFile::open_with_options internally, we spawn a new thread-local `tokio-epoll-uring::System` in the blocking pool thread that's used for the `spawn_blocking` call. This PR refactors the situation such that we do the `spawn_blocking` inside `VirtualFile::crashsafe_overwrite`. This unifies the situation for the better: 1. Callers who didn't wrap in `spawn_blocking(..., Handle::block_on, ...)` before no longer stall the executor. 2. Callers who did it before now can avoid the `block_on`, resolving the problem with the short-lived `tokio-epoll-uring::System`s in the blocking pool threads. A future PR will build on top of this and divert to tokio-epoll-uring if it's configures as the IO engine. Changes ------- - Convert implementation to std::fs and move it into `crashsafe.rs` - Yes, I know, Safekeepers (cc @arssher ) added `durable_rename` and `fsync_async_opt` recently. However, `crashsafe_overwrite` is different in the sense that it's higher level, i.e., it's more like `std::fs::write` and the Safekeeper team's code is more building block style. - The consequence is that we don't use the VirtualFile file descriptor cache anymore. - I don't think it's a big deal because we have plenty of slack wrt production file descriptor limit rlimit (see [this dashboard](https://neonprod.grafana.net/d/e4a40325-9acf-4aa0-8fd9-f6322b3f30bd/pageserver-open-file-descriptors?orgId=1)) - Use `tokio::task::spawn_blocking` in `VirtualFile::crashsafe_overwrite` to call the new `crashsafe::overwrite` API. - Inspect all callers to remove any double-`spawn_blocking` - spawn_blocking requires the captures data to be 'static + Send. So, refactor the callers. We'll need this for future tokio-epoll-uring support anyway, because tokio-epoll-uring requires owned buffers. Related Issues -------------- - overall epic to enable write path to tokio-epoll-uring: #6663 - this is also kind of relevant to the tokio-epoll-uring System creation failures that we encountered in staging, investigation being tracked in #6667 - why is it relevant? Because this PR removes two uses of `spawn_blocking+Handle::block_on`	2024-02-14 14:22:41 +00:00
Joonas Koivunen	c77411e903	cleanup around `attach` (#6621 ) The smaller changes I found while looking around #6584. - rustfmt was not able to format handle_timeline_create - fix Generation::get_suffix always allocating - Generation was missing a `#[track_caller]` for panicky method - attach has a lot of issues, but even with this PR it cannot be formatted by rustfmt - moved the `preload` span to be on top of `attach` -- it is awaited inline - make disconnected panic! or unreachable! into expect, expect_err	2024-02-12 14:52:20 +02:00
Joonas Koivunen	aeda82a010	fix(heavier_once_cell): assertion failure can be hit (#6722 ) @problame noticed that the `tokio::sync::AcquireError` branch assertion can be hit like in the added test. We haven't seen this yet in production, but I'd prefer not to see it there. There `take_and_deinit` is being used, but this race must be quite timing sensitive. Rework of earlier: #6652.	2024-02-12 09:57:29 +00:00
Christian Schwarz	5779c7908a	revert two recent `heavier_once_cell` changes (#6704 ) This PR reverts - https://github.com/neondatabase/neon/pull/6589 - https://github.com/neondatabase/neon/pull/6652 because there's a performance regression that's particularly visible at high layer counts. Most likely it's because the switch to RwLock inflates the ``` inner: heavier_once_cell::OnceCell<ResidentOrWantedEvicted>, ``` size from 48 to 88 bytes, which, by itself is almost a doubling of the cache footprint, and probably the fact that it's now larger than a cache line also doesn't help. See this chat on the Neon discord for more context: https://discord.com/channels/1176467419317940276/1204714372295958548/1205541184634617906 I'm reverting 6652 as well because it might also have perf implications, and we're getting close to the next release. We should re-do its changes after the next release, though. cc @koivunej cc @ivaxer	2024-02-09 22:22:40 +00:00
Joonas Koivunen	9a31311990	fix(heavier_once_cell): assertion failure can be hit (#6652 ) @problame noticed that the `tokio::sync::AcquireError` branch assertion can be hit like in the first commit. We haven't seen this yet in production, but I'd prefer not to see it there. There `take_and_deinit` is being used, but this race must be quite timing sensitive.	2024-02-08 22:40:14 +02:00
John Spray	e8d2843df6	storage controller: improved handling of node availability on restart (#6658 ) - Automatically set a node's availability to Active if it is responsive in startup_reconcile - Impose a 5s timeout of HTTP request to list location conf, so that an unresponsive node can't hang it for minutes - Do several retries if the request fails with a retryable error, to be tolerant of concurrent pageserver & storage controller restarts - Add a readiness hook for use with k8s so that we can tell when the startup reconciliaton is done and the service is fully ready to do work. - Add /metrics to the list of un-authenticated endpoints (this is unrelated but we're touching the line in this PR already, and it fixes auth error spam in deployed container.) - A test for the above. Closes: #6670	2024-02-08 18:00:53 +00:00
Christian Schwarz	53a3ed0a7e	debug_assert presence of `shard_id` tracing field (#6572 ) also: fixes https://github.com/neondatabase/neon/issues/6638	2024-02-06 14:43:33 +00:00
Christian Schwarz	0de46fd6f2	heavier_once_cell: switch to tokio::sync::RwLock (#6589 ) Using the RwLock reduces contention on the hot path. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-02-06 14:04:15 +02:00
Joonas Koivunen	947165788d	refactor: needless cancellation token cloning (#6618 ) The solution we ended up for `backoff::retry` requires always cloning of cancellation tokens even though there is just `.await`. Fix that, and also turn the return type into `Option<Result<T, E>>` avoiding the need for the `E::cancelled()` fn passed in. Cc: #6096	2024-02-06 09:39:06 +02:00
John Spray	7e2436695d	storage controller: use AWS Secrets Manager for database URL, etc (#6585 ) ## Problem Passing secrets in via CLI/environment is awkward when using helm for deployment, and not ideal for security (secrets may show up in ps, /proc). We can bypass these issues by simply connecting directly to the AWS Secrets Manager service at runtime. ## Summary of changes - Add dependency on aws-sdk-secretsmanager - Update other aws dependencies to latest, to match transitive dependency versions - Add `Secrets` type in attachment service, using AWS SDK to load if secrets are not provided on the command line.	2024-02-02 16:57:11 +00:00
Joonas Koivunen	3d5fab127a	rewrite Gate impl for better observability (#6542 ) changes: - two messages instead of message every second when gate was closing - replace the gate name string by using a pointer - slow GateGuards are likely to log who they were (see example) example found in regress tests: <https://github.com/neondatabase/neon/pull/6542#issuecomment-1919009256>	2024-01-31 22:15:58 +00:00
Arseny Sher	bc684e9d3b	Make WAL segment init atomic. Since fdatasync is used for flushing WAL, changing file size is unsafe. Make segment creation atomic by using tmp file + rename to avoid using partially initialized segments. fixes https://github.com/neondatabase/neon/issues/6402	2024-01-30 18:05:22 +04:00
John Spray	58f6cb649e	control_plane: database persistence for attachment_service (#6468 ) ## Problem Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR is just the persistence parts and the changes that enable it to work nicely ## Summary of changes - Revert #6444 and #6450 - In neon_local, start a vanilla postgres instance for the attachment service to use. - Adopt `diesel` crate for database access in attachment service. This uses raw SQL migrations as the source of truth for the schema, so it's a soft dependency: we can switch libraries pretty easily. - Rewrite persistence.rs to use postgres (via diesel) instead of JSON. - Preserve JSON read+write at startup and shutdown: this enables using the JSON format in compatibility tests, so that we don't have to commit to our DB schema yet. - In neon_local, run database creation + migrations before starting attachment service - Run the initial reconciliation in Service::spawn in the background, so that the pageserver + attachment service don't get stuck waiting for each other to start, when restarting both together in a test.	2024-01-26 17:20:44 +00:00
John Spray	c9b1657e4c	pageserver: fixes for creation operations overlapping with shutdown/startup (#6436 ) ## Problem For #6423, creating a reproducer turned out to be very easy, as an extension to test_ondemand_activation. However, before I had diagnosed the issue, I was starting with a more brute force approach of running creation API calls in the background while restarting a pageserver, and that shows up a bunch of other interesting issues. In this PR: - Add the reproducer for #6423 by extending `test_ondemand_activation` (confirmed that this test fails if I revert the fix from https://github.com/neondatabase/neon/pull/6430) - In timeline creation, return 503 responses when we get an error and the tenant's cancellation token is set: this covers the cases where we get an anyhow::Error from something during timeline creation as a result of shutdown. - While waiting for tenants to become active during creation, don't .map_err() the result to a 500: instead let the `From` impl map the result to something appropriate (this includes mapping shutdown to 503) - During tenant creation, we were calling `Tenant::load_local` because no Preload object is provided. This is usually harmless because the tenant dir is empty, but if there are some half-created timelines in there, bad things can happen. Propagate the SpawnMode into Tenant::attach, so that it can properly skip _any_ attempt to load timelines if creating. - When we call upsert_location, there's a SpawnMode that tells us whether to load from remote storage or not. But if the operation is a retry and we already have the tenant, it is not correct to skip loading from remote storage: there might be a timeline there. This isn't strictly a correctness issue as long as the caller behaves correctly (does not assume that any timelines are persistent until the creation is acked), but it's a more defensive position. - If we shut down while the task in Tenant::attach is running, it can end up spawning rogue tasks. Fix this by holding a GateGuard through here, and in upsert_location shutting down a tenant after calling tenant_spawn if we can't insert it into tenants_map. This fixes the expected behavior that after shutdown_all_tenants returns, no tenant tasks are running. - Add `test_create_churn_during_restart`, which runs tenant & timeline creations across pageserver restarts. - Update a couple of tests that covered cancellation, to reflect the cleaner errors we now return.	2024-01-25 12:35:52 +00:00
Christian Schwarz	42c17a6fc6	attachment_service: use atomic overwrite to persist attachments.json (#6444 ) The pagebench integration PR (#6214) is the first to SIGQUIT & then restart attachment_service. With many tenants (100), we have found frequent failures on restart in the CI[^1]. [^1]: [Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/) ``` 2024-01-22T19:07:57.932021Z INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK 2024-01-22T19:07:58.898213Z INFO Got SIGQUIT. Terminating 2024-01-22T19:08:02.176588Z INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048 thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17: Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?) stack backtrace: 0: rust_begin_unwind at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14 2: attachment_service::persistence::PersistentState::load_or_new::{{closure}} at ./control_plane/attachment_service/src/persistence.rs:95:17 3: attachment_service::persistence::Persistence:🆕:{{closure}} at ./control_plane/attachment_service/src/persistence.rs:103:56 4: attachment_service::main::{{closure}} at ./control_plane/attachment_service/src/main.rs:69:61 5: tokio::runtime::park::CachedParkThread::block_on::{{closure}} at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63 6: tokio::runtime::coop::with_budget at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5 7: tokio::runtime::coop::budget at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5 8: tokio::runtime::park::CachedParkThread::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31 9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9 10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13 11: tokio::runtime::context::runtime::enter_runtime at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16 12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9 13: tokio::runtime::runtime::Runtime::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50 14: attachment_service::main at ./control_plane/attachment_service/src/main.rs:99:5 15: core::ops::function::FnOnce::call_once at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ``` The attachment_service handles SIGQUIT by just exiting the process. In theory, the SIGQUIT could come in while we're writing out the `attachments.json`. Now, in above log output, there's a 1 second gap between the last request completing and the SIGQUIT coming in. So, there must be some other issue. But, let's have this change anyways, maybe it helps uncover the real cause for the test failure.	2024-01-23 17:21:06 +01:00
Conrad Ludgate	72de1cb511	remove some duped deps (#6422 ) ## Problem duplicated deps ## Summary of changes little bit of fiddling with deps to reduce duplicates needs consideration: https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17	2024-01-23 11:17:15 +00:00
John Spray	b6ec11ad78	control_plane: generalize attachment_service to handle sharding (#6251 ) ## Problem To test sharding, we need something to control it. We could write python code for doing this from the test runner, but this wouldn't be usable with neon_local run directly, and when we want to write tests with large number of shards/tenants, Rust is a better fit efficiently handling all the required state. This service enables automated tests to easily get a system with sharding/HA without the test itself having to set this all up by hand: existing tests can be run against sharded tenants just by setting a shard count when creating the tenant. ## Summary of changes Attachment service was previously a map of TenantId->TenantState, where the principal state stored for each tenant was the generation and the last attached pageserver. This enabled it to serve the re-attach and validate requests that the pageserver requires. In this PR, the scope of the service is extended substantially to do overall management of tenants in the pageserver, including tenant/timeline creation, live migration, evacuation of offline pageservers etc. This is done using synchronous code to make declarative changes to the tenant's intended state (`TenantState.policy` and `TenantState.intent`), which are then translated into calls into the pageserver by the `Reconciler`. Top level summary of modules within `control_plane/attachment_service/src`: - `tenant_state`: structure that represents one tenant shard. - `service`: implements the main high level such as tenant/timeline creation, marking a node offline, etc. - `scheduler`: for operations that need to pick a pageserver for a tenant, construct a scheduler and call into it. - `compute_hook`: receive notifications when a tenant shard is attached somewhere new. Once we have locations for all the shards in a tenant, emit an update to postgres configuration via the neon_local `LocalEnv`. - `http`: HTTP stubs. These mostly map to methods on `Service`, but are separated for readability and so that it'll be easier to adapt if/when we switch to another RPC layer. - `node`: structure that describes a pageserver node. The most important attribute of a node is its availability: marking a node offline causes tenant shards to reschedule away from it. This PR is a precursor to implementing the full sharding service for prod (#6342). What's the difference between this and a production-ready controller for pageservers? - JSON file persistence to be replaced with a database - Limited observability. - No concurrency limits. Marking a pageserver offline will try and migrate every tenant to a new pageserver concurrently, even if there are thousands. - Very simple scheduler that only knows to pick the pageserver with fewest tenants, and place secondary locations on a different pageserver than attached locations: it does not try to place shards for the same tenant on different pageservers. This matters little in tests, because picking the least-used pageserver usually results in round-robin placement. - Scheduler state is rebuilt exhaustively for each operation that requires a scheduler. - Relies on neon_local mechanisms for updating postgres: in production this would be something that flows through the real control plane. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-01-17 18:01:08 +00:00
John Spray	4b9b4c2c36	pageserver: cleanup redundant create/attach code, fix detach while attaching (#6277 ) ## Problem The code for tenant create and tenant attach was just a special case of what upsert_location does. ## Summary of changes - Use `upsert_location` for create and attach APIs - Clean up error handling in upsert_location so that it can generate appropriate HTTP response codes - Update tests that asserted the old non-idempotent behavior of attach - Rework the `test_ignore_while_attaching` test, and fix tenant shutdown during activation, which this test was supposed to cover, but it was actually just waiting for activation to complete.	2024-01-09 10:37:54 +00:00
John Spray	3c560d27a8	pageserver: implement secondary-mode downloads (#6123 ) Follows on from #6050 , in which we upload heatmaps. Secondary locations will now poll those heatmaps and download layers mentioned in the heatmap. TODO: - [X] ~Unify/reconcile stats for behind-schedule execution with warn_when_period_overrun (https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~ - [x] Give downloads their own concurrency config independent of uploads Deferred optimizations: - https://github.com/neondatabase/neon/issues/6199 - https://github.com/neondatabase/neon/issues/6200 Eviction will be the next PR: - #5342	2024-01-05 12:29:20 +00:00
John Spray	18e9208158	pageserver: improved error handling for shard routing error, timeline not found (#6262 ) ## Problem - When a client requests a key that isn't found in any shard on the node (edge case that only happens if a compute's config is out of date), we should prompt them to reconnect (as this includes a backoff), since they will not be able to complete the request until they eventually get a correct pageserver connection string. - QueryError::Other is used excessively: this contains a type-ambiguous anyhow::Error and is logged very verbosely (including backtrace). ## Summary of changes - Introduce PageStreamError to replace use of anyhow::Error in request handlers for getpage, etc. - Introduce Reconnect and NotFound variants to QueryError - Map the "shard routing error" case to PageStreamError::Reconnect -> QueryError::Reconnect - Update type conversions for LSN timeouts and tenant/timeline not found errors to use PageStreamError::NotFound->QueryError::NotFound	2024-01-04 10:40:03 +00:00
Arseny Sher	dbd36e40dc	Move failpoint support code to utils. To enable them in safekeeper as well.	2024-01-02 10:50:20 +04:00
Christian Schwarz	5385791ca6	add pageserver component-level benchmark (`pagebench`) (#6174 ) This PR adds a component-level benchmarking utility for pageserver. Its name is `pagebench`. The problem solved by `pagebench` is that we want to put Pageserver under high load. This isn't easily achieved with `pgbench` because it needs to go through a compute, which has signficant performance overhead compared to accessing Pageserver directly. Further, compute has its own performance optimizations (most importantly: caches). Instead of designing a compute-facing workload that defeats those internal optimizations, `pagebench` simply bypasses them by accessing pageserver directly. Supported benchmarks: * getpage@latest_lsn * basebackup * triggering logical size calculation This code has no automated users yet. A performance regression test for getpage@latest_lsn will be added in a later PR. part of https://github.com/neondatabase/neon/issues/5771	2023-12-21 13:07:23 +01:00
John Spray	56f7d55ba7	pageserver: basic cancel/timeout for remote storage operations (#6097 ) ## Problem Various places in remote storage were not subject to a timeout (thereby stuck TCP connections could hold things up), and did not respect a cancellation token (so things like timeline deletion or tenant detach would have to wait arbitrarily long). ## Summary of changes - Add download_cancellable and upload_cancellable helpers, and use them in all the places we wait for remote storage operations (with the exception of initdb downloads, where it would not have been safe). - Add a cancellation token arg to `download_retry`. - Use cancellation token args in various places that were missing one per #5066 Closes: #5066 Why is this only "basic" handling? - Doesn't express difference between shutdown and errors in return types, to avoid refactoring all the places that use an anyhow::Error (these should all eventually return a more structured error type) - Implements timeouts on top of remote storage, rather than within it: this means that operations hitting their timeout will lose their semaphore permit and thereby go to the back of the queue for their retry. - Doing a nicer job is tracked in https://github.com/neondatabase/neon/issues/6096	2023-12-15 17:43:02 +00:00
Joonas Koivunen	07508fb110	fix: better Json parsing errors (#6135 ) Before any json parsing from the http api only returned errors were per field errors. Now they are done using `serde_path_to_error`, which at least helped greatly with the `disk_usage_eviction_run` used for testing. I don't think this can conflict with anything added in #5310.	2023-12-15 12:18:22 +02:00
Christian Schwarz	aa5581d14f	utils::logging: TracingEventCountLayer: don't use with_label_values() on hot path (#6129 ) fixes #6126	2023-12-14 16:31:41 +01:00
John Spray	20e9cf7d31	pageserver: tweaks to slow/hung task logging (#6098 ) ## Problem - `shutdown_tasks` would log when a particular task was taking a long time to shut down, but not when it eventually completed. That left one uncertain as to whether the slow task was the source of a hang, or just a precursor. ## Summary of changes - Add a log line after a slow task shutdown - Add an equivalent in Gate's `warn_if_stuck`, in case we ever need it. This isn't related to the original issue but was noticed when checking through these logging paths.	2023-12-12 07:19:59 +00:00
Conrad Ludgate	e1a564ace2	proxy simplify cancellation (#5916 ) ## Problem The cancellation code was confusing and error prone (as seen before in our memory leaks). ## Summary of changes * Use the new `TaskTracker` primitve instead of JoinSet to gracefully wait for tasks to shutdown. * Updated libs/utils/completion to use `TaskTracker` * Remove `tokio::select` in favour of `futures::future::select` in a specialised `run_until_cancelled()` helper function	2023-12-08 16:21:17 +00:00
Heikki Linnakangas	31be301ef3	Make simple_rcu::RcuWaitList::wait() async (#6046 ) The gc_timeline() function is async, but it calls the synchronous wait() function. In the worst case, that could lead to a deadlock by using up all tokio executor threads. In the passing, fix a few typos in comments. Fixes issue #6045. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-12-07 10:20:40 +02:00
John Spray	1d81e70d60	pageserver: tweak logs for index_part loading (#6005 ) ## Problem On pageservers upgraded to enable generations, these INFO level logs were rather frequent. If a tenant timeline hasn't written new layers since the upgrade, it will emit the "No index_part.json*" log every time it starts. ## Summary of changes - Downgrade two log lines from info to debug - Add a tiny unit test that I wrote for sanity-checking that there wasn't something wrong with our Generation-comparing logic when loading index parts.	2023-12-04 09:57:47 +00:00
Arpad Müller	54327bbeec	Upload initdb results to S3 (#5390 ) ## Problem See #2592 ## Summary of changes Compresses the results of initdb into a .tar.zst file and uploads them to S3, to enable usage in recovery from lsn. Generations should not be involved I think because we do this only once at the very beginning of a timeline. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-23 18:11:52 +00:00

1 2 3 4 5

210 Commits