rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
John Spray	8f2ebc0684	tests: stabilize test_storage_controller_node_deletion (#10420 ) ## Problem `test_storage_controller_node_deletion` sometimes failed because shards were moving around during timeline creation, and neon_local isn't tolerant of that. The movements were unexpected because the shards had only just been created. This was a regression from #9916 Closes: #10383 ## Summary of changes - Make this test use multiple AZs -- this makes the storage controller's scheduling reliably stable Why this works: in #9916 , I made a simplifying assumption that we would have multiple AZs to get nice stable scheduling -- it's much easier, because each tenant has a well defined primary+secondary location when they have an AZ preference and nodes have different AZs. Everything still works if you don't have multiple AZs, but you just have this quirk that sometimes the optimizer can disagree with initial scheduling, so once in a while a shard moves after being created -- annoying for tests, harmless IRL.	2025-01-16 19:00:16 +00:00
Arpad Müller	e436dcad57	Rename "disabled" safekeeper scheduling policy to "pause" (#10410 ) Rename the safekeeper scheduling policy "disabled" to "pause". A rename was requested in https://github.com/neondatabase/neon/pull/10400#discussion_r1916259124, as the "disabled" policy is meant to be analogous to the "pause" policy for pageservers. Also simplify the `SkSchedulingPolicyArg::from_str` function, relying on the `from_str` implementation of `SkSchedulingPolicy`. Latter is used for the database format as well, so it is quite stable. If we ever want to change the UI, we'll need to duplicate the function again but this is cheap.	2025-01-16 14:30:49 +00:00
Arpad Müller	efaec6cdf8	Add endpoint and storcon cli cmd to set sk scheduling policy (#10400 ) Implementing the last missing endpoint of #9981, this adds support to set the scheduling policy of an individual safekeeper, as specified in the RFC. However, unlike in the RFC we call the endpoint `scheduling_policy` not `status` Closes #9981. As for why not use the upsert endpoint for this: we want to have the safekeeper upsert endpoint be used for testing and for deploying new safekeepers, but not for changes of the scheduling policy. We don't want to change any of the other fields when marking a safekeeper as decommissioned for example, so we'd have to first fetch them only to then specify them again. Of course one can also design an endpoint where one can omit any field and it doesn't get modified, but it's still not great for observability to put everything into one big "change something about this safekeeper" endpoint.	2025-01-15 18:15:30 +00:00
John Spray	47c1640acc	storage controller: pagination for tenant listing API (#10365 ) ## Problem For large deployments, the `control/v1/tenant` listing API can time out transmitting a monolithic serialized response. ## Summary of changes - Add `limit` and `start_after` parameters to listing API - Update storcon_cli to use these parameters and limit requests to 1000 items at a time	2025-01-14 21:37:32 +00:00
John Spray	fd1368d31e	storcon: rework scheduler optimisation, prioritize AZ (#9916 ) ## Problem We want to do a more robust job of scheduling tenants into their home AZ: https://github.com/neondatabase/neon/issues/8264. Closes: https://github.com/neondatabase/neon/issues/8969 ## Summary of changes ### Scope This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one. ### Scheduling optimisation `TenantShard`'s `optimize_attachment` and `optimize_secondary` methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it: - To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later. - Score types get a `for_optimization` method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we _will_ want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating. - When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well. - A unit test is added for https://github.com/neondatabase/neon/issues/8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.	2025-01-13 19:33:00 +00:00
John Spray	ef8bfacd6b	storage controller: API + CLI for migrating secondary locations (#10284 ) ## Problem Currently, if we want to move a secondary there isn't a neat way to do that: we just have migration API for the attached location, and it is only clean to use that if you've manually created a secondary via pageserver API in the place you're going to move it to. Secondary migration API enables: - Moving the secondary somewhere because we would like to later move the attached location there. - Move the secondary location because we just want to reclaim some disk space from its current location. ## Summary of changes - Add `/migrate_secondary` API - Add `tenant-shard-migrate-secondary` CLI - Add tests for above	2025-01-13 14:52:43 +00:00
John Spray	d1bc36f536	storage controller: fix retries of compute hook notifications while a secondary node is offline (#10352 ) ## Problem We would sometimes fail to retry compute notifications: 1. Try and send, set compute_notify_failure if we can't 2. On next reconcile, reconcile() fails for some other reason (e.g. tried to talk to an offline node), and we fail the `result.is_ok() && must_notify` condition around the re-sending. Closes: https://github.com/neondatabase/cloud/issues/22612 ## Summary of changes - Clarify the meaning of the reconcile result: it should be Ok(()) if configuring attached location worked, even if secondary or detach locations cannot be reached. - Skip trying to talk to secondaries if they're offline - Even if reconcile fails and we can't send the compute notification (we can't send it because we're not sure if it's really attached), make sure we save the `compute_notify_failure` flag so that subsequent reconciler runs will try again - Add a regression test for the above	2025-01-13 13:31:57 +00:00
Arpad Müller	bebc46e713	Add scheduling_policy column to safekeepers table (#10205 ) Add a `scheduling_policy` column to the safekeepers table of the storage controller. Part of #9981	2025-01-09 15:55:02 +00:00
John Spray	68d8acfd05	storage controller: don't hold detached tenants in memory (#10264 ) ## Problem Typical deployments of neon have some tenants that stay in use continuously, and a background churning population of tenants that are created and then fall idle, and are configured to Detached state. Currently, this churn of short lived tenants results in an ever-increasing memory footprint. Closes: https://github.com/neondatabase/neon/issues/9712 ## Summary of changes - At startup, filter to only load shards that don't have Detached policy - In process_result, check if a tenant's shards are all Detached and observed=={}, and if so drop them from memory - In tenant_location_conf and other tenant mutators, load the tenants' shards on-demand if they are not present	2025-01-08 18:12:09 +00:00
Vlad Lazar	dc284247a5	storage_controller: fix node flap detach race (#10298 ) ## Problem The observed state removal may race with the inline updates of the observed state done from `Service::node_activate_reconcile`. This was intended to work as follows: 1. Detaches while the node is unavailable remove the entry from the observed state. 2. `Service::node_activate_reconcile` diffs the locations returned by the pageserver with the observed state and detaches in-line when required. ## Summary of changes This PR removes step (1) and lets background reconciliations deal with the mismatch between the intent and observed state. A follow up will attempt to remove `Service::node_activate_reconcile` altogether. Closes https://github.com/neondatabase/neon/issues/10253	2025-01-08 10:26:53 +00:00
Arpad Müller	85696297c5	Add safekeepers command to storcon_cli for listing (#10151 ) Add a `safekeepers` subcommand to `storcon_cli` that allows listing the safekeepers. ``` $ curl -X POST --url http://localhost:1234/control/v1/safekeeper/42 --data \ '{"active":true, "id":42, "created_at":"2023-10-25T09:11:25Z", "updated_at":"2024-08-28T11:32:43Z","region_id":"neon_local","host":"localhost","port":5454,"http_port":0,"version":123,"availability_zone_id":"us-east-2b"}' $ cargo run --bin storcon_cli -- --api http://localhost:1234 safekeepers Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.38s Running `target/debug/storcon_cli --api 'http://localhost:1234' safekeepers` +----+---------+-----------+------+-----------+------------+ \| Id \| Version \| Host \| Port \| Http Port \| AZ Id \| +==========================================================+ \| 42 \| 123 \| localhost \| 5454 \| 0 \| us-east-2b \| +----+---------+-----------+------+-----------+------------+ ``` Also: * Don't return the raw `SafekeeperPersistence` struct that contains the raw database presentation, but instead a new `SafekeeperDescribeResponse` struct. * The `SafekeeperPersistence` struct leaves out the `active` field on purpose because we want to deprecate it and replace it with a `scheduling_policy` one. Part of https://github.com/neondatabase/neon/issues/9981	2024-12-18 12:47:56 +00:00
John Spray	fd230227f2	storcon: include preferred AZ in compute notifications (#9953 ) ## Problem It is unreliable for the control plane to infer the AZ for computes from where the tenant is currently attached, because if a tenant happens to be in a degraded state or a release is ongoing while a compute starts, then the tenant's attached AZ can be a different one to where it will run long-term, and the control plane doesn't check back later to restart the compute. This can land in parallel with https://github.com/neondatabase/neon/pull/9947 ## Summary of changes - Thread through the preferred AZ into the compute hook code via the reconciler - Include the preferred AZ in the body of compute hook notifications	2024-12-17 20:04:09 +00:00
Arpad Müller	342cbea255	storcon: add safekeeper list API (#10089 ) This adds an API to the storage controller to list safekeepers registered to it. This PR does a `diesel print-schema > storage_controller/src/schema.rs` because of an inconsistency between up.sql and schema.rs, introduced by [this](`2c142f14f7`) commit, so there is some updates of `schema.rs` due to that. As a followup to this, we should maybe think about running `diesel print-schema` in CI. Part of #9981	2024-12-12 01:09:24 +00:00
Tristan Partin	b391b29bdc	Improve typing in test_runner/fixtures/httpserver.py (#10103 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-11 22:21:42 +00:00
Vlad Lazar	e8395807a5	storcon: allow for more concurrency in drain/fill operations (#10093 ) ## Problem We saw the drain/fill operations not drain fast enough in ap-southeast. ## Summary of changes These are some quick changes to speed it up: * double reconcile concurrency - this is now half of the available reconcile bandwidth * reduce the waiter polling timeout - this way we can spawn new reconciliations faster	2024-12-11 19:43:40 +00:00
Vlad Lazar	a3e80448e8	pageserver/storcon: add patch endpoints for tenant config metrics (#10020 ) ## Problem Cplane and storage controller tenant config changes are not additive. Any change overrides all existing tenant configs. This would be fine if both did client side patching, but that's not the case. Once this merges, we must update cplane to use the PATCH endpoint. ## Summary of changes ### High Level Allow for patching of tenant configuration with a `PATCH /v1/tenant/config` endpoint. It takes the same data as it's PUT counterpart. For example the payload below will update `gc_period` and unset `compaction_period`. All other fields are left in their original state. ``` { "tenant_id": "1234", "gc_period": "10s", "compaction_period": null } ``` ### Low Level * PS and storcon gain `PATCH /v1/tenant/config` endpoints. PS endpoint is only used for cplane managed instances. * `storcon_cli` is updated to have separate commands for `set-tenant-config` and `patch-tenant-config` Related https://github.com/neondatabase/cloud/issues/21043	2024-12-11 19:16:33 +00:00
John Spray	ec790870d5	storcon: automatically clear Pause/Stop scheduling policies to enable detaches (#10011 ) ## Problem We saw a tenant get stuck when it had been put into Pause scheduling mode to pin it to a pageserver, then it was left idle for a while and the control plane tried to detach it. Close: https://github.com/neondatabase/neon/issues/9957 ## Summary of changes - When changing policy to Detached or Secondary, set the scheduling policy to Active. - Add a test that exercises this - When persisting tenant shards, set their `generation_pageserver` to null if the placement policy is not Attached (this enables consistency checks to work, and avoids leaving state in the DB that could be confusing/misleading in future)	2024-12-07 13:05:09 +00:00
John Spray	71d004289c	storcon: in shard splits, inherit parent's AZ (#9946 ) ## Problem Sharded tenants should be run in a single AZ for best performance, so that computes have AZ-local latency to all the shards. Part of https://github.com/neondatabase/neon/issues/8264 ## Summary of changes - When we split a tenant, instead of updating each shard's preferred AZ to wherever it is scheduled, propagate the preferred AZ from the parent. - Drop the check in `test_shard_preferred_azs` that asserts shards end up in their preferred AZ: this will not be true again until the optimize_attachment logic is updated to make this so. The existing check wasn't testing anything about scheduling, it was just asserting that we set preferred AZ in a way that matches the way things happen to be scheduled at time of split.	2024-12-03 16:55:00 +00:00
Christian Schwarz	15d01b257a	storcon_cli tenant-describe: include tenant-wide information in output (#9899 ) Before this PR, the storcon_cli didn't have a way to show the tenant-wide information of the TenantDescribeResponse. Sadly, the `Serialize` impl for the tenant config doesn't skip on `None`, so, the output becomes a bit bloated. Maybe we can use `skip_serializing_if(Option::is_none)` in the future. => https://github.com/neondatabase/neon/issues/9983	2024-12-03 10:55:13 +00:00
John Spray	aaee713e53	storcon: use proper schedule context during node delete (#9958 ) ## Problem I was touching `test_storage_controller_node_deletion` because for AZ scheduling work I was adding a change to the storage controller (kick secondaries during optimisation) that made a FIXME in this test defunct. While looking at it I also realized that we can easily fix the way node deletion currently doesn't use a proper ScheduleContext, using the iterator type recently added for that purpose. ## Summary of changes - A testing-only behavior in storage controller where if a secondary location isn't yet ready during optimisation, it will be actively polled. - Remove workaround in `test_storage_controller_node_deletion` that previously was needed because optimisation would get stuck on cold secondaries. - Update node deletion code to use a `TenantShardContextIterator` and thereby a proper ScheduleContext	2024-12-03 08:59:38 +00:00
Erik Grinaker	5330122049	test_runner: improve `wait_until` (#9936 ) Improves `wait_until` by: * Use `timeout` instead of `iterations`. This allows changing the timeout/interval parameters independently. * Make `timeout` and `interval` optional (default 20s and 0.5s). Most callers don't care. * Only output status every 1s by default, and add optional `status_interval` parameter. * Remove `show_intermediate_error`, this was always emitted anyway. Most callers have been updated to use the defaults, except where they had good reason otherwise.	2024-12-02 10:26:15 +00:00
Alexander Bayandin	51d26a261b	build(deps): bump mypy from 1.3.0 to 1.13.0 (#9670 ) ## Problem We use a pretty old version of `mypy` 1.3 (released 1.5 years ago), it produces false positives for `typing.Self`. ## Summary of changes - Bump `mypy` from 1.3 to 1.13 - Fix new warnings and errors - Use `typing.Self` whenever we `return self`	2024-11-22 14:31:36 +00:00
Alexander Bayandin	8d1c44039e	Python 3.11 (#9515 ) ## Problem On Debian 12 (Bookworm), Python 3.11 is the latest available version. ## Summary of changes - Update Python to 3.11 in build-tools - Fix ruff check / format - Fix mypy - Use `StrEnum` instead of pair `str`, `Enum` - Update docs	2024-11-21 16:25:31 +00:00
Alexander Bayandin	e9dcfa2eb2	test_runner: skip more tests using decorator instead of pytest.skip (#9704 ) ## Problem Running `pytest.skip(...)` in a test body instead of marking the test with `@pytest.mark.skipif(...)` makes all fixtures to be initialised, which is not necessary if the test is going to be skipped anyway. Also, some tests are unnecessarily skipped (e.g. `test_layer_bloating` on Postgres 17, or `test_idle_reconnections` at all) or run (e.g. `test_parse_project_git_version_output_positive` more than on once configuration) according to comments. ## Summary of changes - Move `skip_on_postgres` / `xfail_on_postgres` / `run_only_on_default_postgres` decorators to `fixture.utils` - Add new `skip_in_debug_build` and `skip_on_ci` decorators - Replace `pytest.skip(...)` calls with decorators where possible	2024-11-11 18:07:01 +00:00
Tristan Partin	ecde8d7632	Improve type safety according to pyright Pyright found many issues that mypy doesn't seem to want to catch or mypy isn't configured to catch. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-08 14:43:15 -06:00
John Spray	93987b5a4a	tests: add test_storage_controller_onboard_detached (#9431 ) ## Problem We haven't historically taken this API route where we would onboard a tenant to the controller in detached state. It worked, but we didn't have test coverage. ## Summary of changes - Add a test that onboards a tenant to the storage controller in Detached mode, and checks that deleting it without attaching it works as expected.	2024-10-28 11:11:12 +00:00
John Spray	33baca07b6	storcon: add an API to cancel ongoing reconciler (#9520 ) ## Problem If something goes wrong with a live migration, we currently only have awkward ways to interrupt that: - Restart the storage controller - Ask it to do some other modification/migration on the shard, which we don't really want. ## Summary of changes - Add a new `/cancel` control API, and storcon_cli wrapper for it, which fires the Reconciler's cancellation token. This is just for on-call use and we do not expect it to be used by any other services.	2024-10-28 09:26:01 +00:00
John Spray	8dca188974	storage controller: add metrics for tenant shard, node count (#9475 ) ## Problem Previously, figuring out how many tenant shards were managed by a storage controller was typically done by peeking at the database or calling into the API. A metric makes it easier to monitor, as unexpectedly increasing shard counts can be indicative of problems elsewhere in the system. ## Summary of changes - Add metrics `storage_controller_pageserver_nodes` (updated on node CRUD operations from Service) and `storage_controller_tenant_shards` (updated RAII-style from TenantShard)	2024-10-22 19:43:02 +01:00
John Spray	b7173b1ef0	storcon: fix case where we might fail to send compute notifications after two opposite migrations (#9435 ) ## Problem If we migrate A->B, then B->A, and the notification of A->B fails, then we might have retained state that makes us think "A" is the last state we sent to the compute hook, whereas when we migrate B->A we should really be sending a fresh notification in case our earlier failed notification has actually mutated the remote compute config. Closes: #9417 ## Summary of changes - Add a reproducer for the bug (`test_storage_controller_compute_hook_revert`) - Refactor compute hook code to represent remote state with `ComputeRemoteState` which stores a boolean for whether the compute has fully applied the change as well as the request that the compute accepted. - The actual bug fix: after sending a compute notification, if we got a 423 response then update our ComputeRemoteState to reflect that we have mutated the remote state. This way, when we later try and notify for our historic location, we will properly see that as a change and send the notification. Co-authored-by: Vlad Lazar <vlad@neon.tech>	2024-10-18 11:29:23 +01:00
a-masterov	091a175a3e	Test versions mismatch (#9167 ) ## Problem We faced the problem of incompatibility of the different components of different versions. This should be detected automatically to prevent production bugs. ## Summary of changes The test for this situation was implemented Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-10-11 15:29:54 +02:00
John Spray	184935619e	tests: stabilize test_storage_controller_heartbeats (#9347 ) ## Problem This could fail with `reconciliation in progress` if running on a slow test node such that background reconciliation happens at the same time as we call consistency_check. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/11258171952/index.html#/testresult/54889c9469afb232 ## Summary of changes - Call reconcile_until_idle before calling consistency check once, rather than calling consistency check until it passes	2024-10-11 09:41:08 +01:00
Tristan Partin	d3464584a6	Improve some typing in test_runner Fixes some types, adds some types, and adds some override annotations. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-09 15:42:22 -05:00
Tristan Partin	5bd8e2363a	Enable all pyupgrade checks in ruff This will help to keep us from using deprecated Python features going forward. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-08 14:32:26 -05:00
Vlad Lazar	dcf7af5a16	storcon: do timeline creation on all attached location (#9237 ) ## Problem Creation of a timelines during a reconciliation can lead to unavailability if the user attempts to start a compute before the storage controller has notified cplane of the cut-over. ## Summary of changes Create timelines on all currently attached locations. For the latest location, we still look at the database (this is a previously). With this change we also look into the observed state to find other attached locations. Related https://github.com/neondatabase/neon/issues/9144	2024-10-04 11:56:43 +01:00
Heikki Linnakangas	8ef0c38b23	tests: Rename NeonLocalCli functions to match the 'neon_local' commands (#9195 ) This makes it more clear that the functions in NeonLocalCli are just typed wrappers around the corresponding 'neon_local' commands.	2024-10-03 22:03:27 +03:00
Heikki Linnakangas	19db9e9aad	tests: Replace direct calls to neon_cli with wrappers in NeonEnv (#9195 ) Add wrappers for a few commands that didn't have them before. Move the logic to generate tenant and timeline IDs from NeonCli to the callers, so that NeonCli is more purely just a type-safe wrapper around 'neon_local'.	2024-10-03 22:03:22 +03:00
Vlad Lazar	38a8dcab9f	storcon: add metric for long running reconciles (#9207 ) ## Problem We don't have an alert for long running reconciles. Stuck reconciles are problematic as we've seen in a recent incident. ## Summary of changes Add a new metric `storage_controller_reconcile_long_running_total` with labels: `{tenant_id, shard_number, seq}`. The metric is removed after the long running reconcile finishes. These events should be rare, so we won't break the bank on cardinality. Related https://github.com/neondatabase/neon/issues/9150	2024-10-02 17:25:11 +01:00
Vlad Lazar	8dbfda98d4	storcon: ignore deleted timelines on new location catch-up (#9244 ) ## Problem If a timeline was deleted right before waiting for LSNs to catch up before the cut-over, then we would wait forever. ## Summary of changes Fix the issue and add a test for timeline deletions mid migration. Related https://github.com/neondatabase/neon/issues/9144	2024-10-02 17:23:26 +01:00
John Spray	651ae44569	storage controller: drop out of blocking compute notification loop if migration origin becomes unavailable (#9147 ) ## Problem The live migration code waits forever for the compute notification hook, on the basis that until it succeeds, the compute is probably using the old location and we shouldn't detach it. However, if a pageserver stops or restarts in the background, then this original location might no longer be available, so there is no point waiting. Waiting is also actively harmful, because it prevents other reconciliations happening for the tenant shard, such as during an upgrade where a stuck "drain" migration might prevent the later "fill" migration from moving the shard back to its original location. ## Summary of changes - Refactor the notification wait loop into a function - Add a checks during the loop, for the origin node's cancellation token and an explicit HTTP request to the origin node to confirm the shard is still attached there. Closes: https://github.com/neondatabase/neon/issues/8901	2024-10-01 07:57:22 +00:00
John Spray	4b711caf5e	storage controller: make proxying of GETs to pageservers more robust (#9065 ) ## Problem These commits are split off from https://github.com/neondatabase/neon/pull/8971/commits where I was fixing this to make a better scale test pass -- Vlad also independently recognized these issues with cloudbench in https://github.com/neondatabase/neon/issues/9062. 1. The storage controller proxies GET requests to pageservers based on their intent, not the ground truth of where they're really attached. 2. Proxied requests can race with scheduling to tenants, resulting in 404 responses if the request hits the wrong pageserver. Closes: https://github.com/neondatabase/neon/issues/9062 ## Summary of changes 1. If a shard has a running reconciler, then use the database generation_pageserver to decide who to proxy the request to 2. If such a request gets a 404 response and its scheduled node has changed since the request was dispatched.	2024-09-25 13:56:39 +00:00
Yuchen Liang	1708743e78	pageserver: wait for lsn lease duration after transition into AttachedSingle (#9024 ) Part of #7497, closes https://github.com/neondatabase/neon/issues/8890. ## Problem Since leases are in-memory objects, we need to take special care of them after pageserver restarts and while doing a live migration. The approach we took for pageserver restart is to wait for at least lease duration before doing first GC. We want to do the same for live migration. Since we do not do any GC when a tenant is in `AttachedStale` or `AttachedMulti` mode, only the transition from `AttachedMulti` to `AttachedSingle` requires this treatment. ## Summary of changes - Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is temporarily blocked from GC until we reach the deadline. This information does not persist to S3. - In `GCBlock::start`, skip the GC iteration if we are blocked by the lsn lease deadline. - In `TenantManager::upsert_location`, set the lsn_lease_deadline to `Instant::now() + lsn_lease_length` so the granted leases have a chance to be renewed before we run GC for the first time after transitioned from AttachedMulti to AttachedSingle. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-09-19 17:27:10 +01:00
Christian Schwarz	135e7e4306	add `neon_local` subcommand for the broker & use that from regression tests (#8948 ) There's currently no way to just start/stop broker from `neon_local`. This PR * adds a sub-command * uses that sub-command from the test suite instead of the pre-existing Python `subprocess` based approach. Found this useful during investigation https://github.com/neondatabase/cloud/issues/16886.	2024-09-18 09:10:27 +02:00
Vlad Lazar	b719d58863	storcon: forward requests from stepped down instance to the current leader (#8954 ) ## Problem It turns out that we can't rely on external orchestration to promptly route trafic to the new leader. This is downtime inducing. Forwarding provides a safe way out. ## Safety We forward when: 1. Request is not one of ["/control/v1/step_down", "/status", "/ready", "/metrics"] 2. Current instance is in [`LeadershipStatus::SteppedDown`] state 3. There is a leader in the database to forward to 4. Leader from step (3) is not the current instance If a storcon instance is persisted in the database, then we know that it is the current leader. There's one exception: time between handling step-down request and the new leader updating the database. Let's treat the happy case first. The stepped down node does not produce any side effects, since all request handling happens on the leader. As for the edge case, we are guaranteed to always have a maximum of two running instances. Hence, if we are in the edge case scenario the leader persisted in the database is the stepped down instance that received the request. Condition (4) above covers this scenario. ## Summary of changes * Conversion utilities for reqwest <-> hyper. I'm not happy with these, but I don't see a better way. Open to suggestions. * Add request forwarding logic * Update each request handler. Again, not happy with this. If anyone knows a nice to wrap the handlers, lmk. Me and Joonas tried :/ * Update each handler to maybe forward * Tweak tests to showcase new behaviour	2024-09-17 09:25:42 +01:00
Heikki Linnakangas	2db840d8b8	Move a few test functions related to auth tokens to separate file (#9018 ) For readability. neon_fixtures.py is huge.	2024-09-17 06:53:18 +03:00
Alex Chi Z.	ac5815b594	feat(storage-controller): add node shards api (#8896 ) For control-plane managed tenants, we have the page in the admin console that lists all tenants on a specific pageserver. But for storage-controller managed ones, we don't have that functionality for now. ## Summary of changes Adds an API that lists all shards on a given node (intention + observed) --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-06 14:14:21 -04:00
Vlad Lazar	e86fef05dd	storcon: track preferred AZ for each tenant shard (#8937 ) ## Problem We want to do AZ aware scheduling, but don't have enough metadata. ## Summary of changes Introduce a `preferred_az_id` concept for each managed tenant shard. In a future PR, the scheduler will use this as a soft preference. The idea is to try and keep the shard attachments within the same AZ. Under the assumption that the compute was placed in the correct AZ, this reduces the chances of cross AZ trafic from between compute and PS. In terms of code changes we: 1. Add a new nullable `preferred_az_id` column to the `tenant_shards` table. Also include an in-memory counterpart. 2. Populate the preferred az on tenant creation and shard splits. 3. Add an endpoint which allows to bulk-set preferred AZs. (3) gives us the migration path. I'll write a script which queries the cplane db in the region and sets the preferred az of all shards with an active compute to the AZ of said compute. For shards without an active compute, I'll use the AZ of the currently attached pageserver since this is what cplane uses now to schedule computes.	2024-09-06 13:11:17 +01:00
John Spray	1a9b54f1d9	storage controller: read from database in validate API (#8784 ) ## Problem The initial implementation of the validate API treats the in-memory generations as authoritative. - This is true when only one storage controller is running, but if a rogue controller was running that hadn't been shut down properly, and some pageserver requests were routed to that bad controller, it could incorrectly return valid=true for stale generations. - The generation in the main in-memory map gets out of date while a live migration is in flight, and if the origin location for the migration tries to do some deletions even though it is in AttachedStale (for example because it had already started compaction), these might be wrongly validated + executed. ## Summary of changes - Continue to do the in-memory check: if this returns valid=false it is sufficient to reject requests. - When valid=true, do an additional read from the database to confirm the generation is fresh. - Revise behavior for validation on missing shards: this used to always return valid=true as a convenience for deletions and shard splits, so that pageservers weren't prevented from completing any enqueued deletions for these shards after they're gone. However, this becomes unsafe when we consider split brain scenarios. We could reinstate this in future if we wanted to store some tombstones for deleted shards. - Update test_scrubber_physical_gc to cope with the behavioral change: they must now explicitly flush the deletion queue before splits, to avoid tripping up on deletions that are enqueued at the time of the split (these tests assert "scrubber deletes nothing", which check fails if the split leaves behind some remote objects that are legitimately GC'able) - Add `test_storage_controller_validate_during_migration`, which uses failpoints to create a situation where incorrect generation validation during a live migration could result in a corruption The rate of validate calls for tenants is pretty low: it happens as a consequence deletions from GC and compaction, which are both concurrency-limited on the pageserver side.	2024-09-04 15:00:40 +01:00
Joonas Koivunen	7a1397cf37	storcon: boilerplate to upsert safekeeper records on deploy (#8879 ) We currently do not record safekeepers in the storage controller database. We want to migrate timelines across safekeepers eventually, so start recording the safekeepers on deploy. Cc: #8698	2024-09-04 10:10:05 +00:00
John Spray	0aa1450936	storage controller: enable timeline CRUD operations to run concurrently with reconciliation & make them safer (#8783 ) ## Problem - If a reconciler was waiting to be able to notify computes about a change, but the control plane was waiting for the controller to finish a timeline creation/deletion, the overall system can deadlock. - If a tenant shard was migrated concurrently with a timeline creation/deletion, there was a risk that the timeline operation could be applied to a non-latest-generation location, and thereby not really be persistent. This has never happened in practice, but would eventually happen at scale. Closes: #8743 ## Summary of changes - Introduce `Service::tenant_remote_mutation` helper, which looks up shards & generations and passes them into an inner function that may do remote I/O to pageservers. Before returning success, this helper checks that generations haven't incremented, to guarantee that changes are persistent. - Convert tenant_timeline_create, tenant_timeline_delete, and tenant_timeline_detach_ancestor to use this helper. - These functions no longer block on ensure_attached unless the tenant was never attached at all, so they should make progress even if we can't complete compute notifications. This increases the database load from timeline/create operations, but only with cheap read transactions.	2024-08-23 18:56:05 +01:00
Joonas Koivunen	73286e6b9f	test: copy dict to avoid error on retry (#8811 ) there is no "const" in python, so when we modify the global dict, it will remain that way on the retry. fix to not have it influence other tests which might be run on the same runner. evidence: <https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8625/10513146742/index.html#/testresult/453c4ce05ada7496>	2024-08-23 14:43:08 +01:00

1 2

79 Commits