rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-25 09:00:37 +00:00

Author	SHA1	Message	Date
Dmitrii Kovalkov	133f16e9b5	storcon: finish safekeeper migration gracefully (#12528 ) ## Problem We don't detect if safekeeper migration fails after the the commiting the membership configuration to the database. As a result, we might leave stale timelines on excluded safekeepers and do not notify cplane/safekepeers about new configuration. - Implements solution proposed in https://github.com/neondatabase/neon/pull/12432 - Closes: https://github.com/neondatabase/neon/issues/12192 - Closes: [LKB-944](https://databricks.atlassian.net/browse/LKB-944) ## Summary of changes - Add `sk_set_notified_generation` column to `timelines` database - Update `_notified_generation` in database during the finish state. - Commit reconciliation requests to database atomically with membership configuration. - Reload pending ops and retry "finish" step if we detect `_notified_generation` mismatch. - Add failpoints and test that we handle failures well	2025-07-22 14:58:20 +00:00
Arpad Müller	80baeaa084	storcon: add force_upsert flag to timeline_import endpoint (#12622 ) It is useful to have ability to update an existing timeline entry, as a way to mirror legacy migrations to the storcon managed table.	2025-07-21 21:14:15 +00:00
Aleksandr Sarantsev	b2705cfee6	storcon: Make node deletion process cancellable (#12320 ) ## Problem The current deletion operation is synchronous and blocking, which is unsuitable for potentially long-running tasks like. In such cases, the standard HTTP request-response pattern is not a good fit. ## Summary of Changes - Added new `storcon_cli` commands: `NodeStartDelete` and `NodeCancelDelete` to initiate and cancel deletion asynchronously. - Added corresponding `storcon` HTTP handlers to support the new start/cancel deletion flow. - Introduced a new type of background operation: `Delete`, to track and manage the deletion process outside the request lifecycle. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-04 14:08:09 +00:00
Arpad Müller	a852bc5e39	Add new activating scheduling policy for safekeepers (#12441 ) When deploying new safekeepers, we don't immediately want to send traffic to them. Maybe they are not ready yet by the time the deploy script is registering them with the storage controller. For pageservers, the storcon solves the problem by not scheduling stuff to them unless there has been a positive heartbeat response. We can't do the same for safekeepers though, otherwise a single down safekeeper would mean we can't create new timelines in smaller regions where there is only three safekeepers in total. So far we have created safekeepers as `pause` but this adds a manual step to safekeeper deployment which is prone to oversight. We want things to be automatted. So we introduce a new state `activating` that acts just like `pause`, except that we automatically transition the policy to `active` once we get a positive heartbeat from the safekeeper. For `pause`, we always keep the safekeeper paused.	2025-07-03 16:27:43 +00:00
Dmitrii Kovalkov	c746678bbc	storcon: implement safekeeper_migrate handler (#11849 ) This PR implements a safekeeper migration algorithm from RFC-035 https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#change-algorithm - Closes: https://github.com/neondatabase/neon/issues/11823 It is not production-ready yet, but I think it's good enough to commit and start testing. There are some known issues which will be addressed in later PRs: - https://github.com/neondatabase/neon/issues/12186 - https://github.com/neondatabase/neon/issues/12187 - https://github.com/neondatabase/neon/issues/12188 - https://github.com/neondatabase/neon/issues/12189 - https://github.com/neondatabase/neon/issues/12190 - https://github.com/neondatabase/neon/issues/12191 - https://github.com/neondatabase/neon/issues/12192 ## Summary of changes - Implement `tenant_timeline_safekeeper_migrate` handler to drive the migration - Add possibility to specify number of safekeepers per timeline in tests (`timeline_safekeeper_count`) - Add `term` and `flush_lsn` to `TimelineMembershipSwitchResponse` - Implement compare-and-swap (CAS) operation over timeline in DB for updating membership configuration safely. - Write simple test to verify that migration code works	2025-06-30 08:30:05 +00:00
Arpad Müller	552249607d	apply clippy fixes for 1.88.0 beta (#12331 ) The 1.88.0 stable release is near (this Thursday). We'd like to fix most warnings beforehand so that the compiler upgrade doesn't require approval from too many teams. This is therefore a preparation PR (like similar PRs before it). There is a lot of changes for this release, mostly because the `uninlined_format_args` lint has been added to the `style` lint group. One can read more about the lint [here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args). The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One remaining warning is left for the proxy team. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-06-24 10:12:42 +00:00
Erik Grinaker	48052477b4	storcon: register Pageserver gRPC address (#12268 ) ## Problem Pageservers now expose a gRPC API on a separate address and port. This must be registered with the storage controller such that it can be plumbed through to the compute via cplane. Touches #11926. ## Summary of changes This patch registers the gRPC address and port with the storage controller: * Add gRPC address to `nodes` database table and `NodePersistence`, with a Diesel migration. * Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`, `NodeDescribeResponse`, and `TenantLocateResponseShard`. * Add gRPC address flags to `storcon_cli node-register`. These changes are backwards-compatible, since all structs will ignore unknown fields during deserialization.	2025-06-17 13:27:10 +00:00
Alexander Sarantcev	590301df08	storcon: Introduce deletion tombstones to support flaky node scenario (#12096 ) ## Problem Removed nodes can re-add themselves on restart if not properly tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this, especially in cases where the node is unreachable. More details there: #12036 ## Summary of changes - Introduced `NodeLifecycle` enum to represent node lifecycle states. - Added a string representation of `NodeLifecycle` to the `nodes` table. - Implemented node removal using a tombstone mechanism. - Introduced `/debug/v1/tombstone*` handlers to manage the tombstone state.	2025-06-06 10:16:55 +00:00
Vlad Lazar	045ae13e06	pageserver: make imports work with tenant shut downs (#11855 ) ## Problem Lifetime of imported timelines (and implicitly the import background task) has some shortcomings: 1. Timeline activation upon import completion is tricky. Previously, a timeline that finished importing after a tenant detach would not get activated and there's concerns about the safety of activating concurrently with shut-down. 2. Import jobs can prevent tenant shut down since they hold the tenant gate ## Summary of Changes Track the import tasks in memory and abort them explicitly on tenant shutdown. Integrate more closely with the storage controller: 1. When an import task has finished all of its jobs, it notifies the storage controller, but does not mark the import as done in the index_part. When all shards have finished importing, the storage controller will call the `/activate_post_import` idempotent endpoint for all of them. The handler, marks the import complete in index part, resets the tenant if required and checks if the timeline is active yet. 2. Not directly related, but the import job now gets the starting state from the storage controller instead of the import bucket. This paves the way for progress checkpointing. Related: https://github.com/neondatabase/neon/issues/11568	2025-05-13 17:49:49 +00:00
Vlad Lazar	6f0046b688	storage_controller: ensure mutual exclusion for imports and shard splits (#11632 ) ## Problem Shard splits break timeline imports. ## Summary of Changes Ensure mutual exclusion for imports and shard splits. On the shard split code path: 1. Right before shard splitting, check the database to ensure that no-import is on-going for the tenant. Exclusion is guaranteed because this validation is done while holding the exclusive tenant lock. Timeline creation (and import creation implicitly) requires a shared tenant lock. 2. When selecting a shard to split, use the in-mem state to exclude shards with an on-going import. This is opportunistic since an import might start after the check, but allows shard splits to make progres instead of continously retrying to split the same shard. On the timeline creation code path: 1. Check the in-memory splitting flag on all shards of the tenant. If any of them are splitting, error out asking the client to retry. On the happy path this is not required, due to the tenant lock set-up described above, but it covers the case where we restart with a pending shard-split. Closes https://github.com/neondatabase/neon/issues/11567	2025-04-25 11:46:15 +00:00
Vlad Lazar	5ba7315c84	storage_controller: reconcile completed imports at start-up (#11614 ) ## Problem In https://github.com/neondatabase/neon/pull/11345 coordination of imports moved to the storage controller. It involves notifying cplane when the import has been completed by calling an idempotent endpoint. If the storage controller shuts down in the middle of finalizing an import, it would never be retried. ## Summary of changes Reconcile imports at start-up by fetching the complete imports from the database and spawning a background task which notifies cplane. Closes: https://github.com/neondatabase/neon/issues/11570	2025-04-24 18:39:19 +00:00
Vlad Lazar	3a50d95b6d	storage_controller: coordinate imports across shards in the storage controller (#11345 ) ## Problem Pageservers notify control plane directly when a shard import has completed. Control plane has to download the status of each shard from S3 and figure out if everything is truly done, before proceeding with branch activation. Issues with this approach are: * We can't control shard split behaviour on the storage controller side. It's unsafe to split during import. * Control plane needs to know about shards and implement logic to check all timelines are indeed ready. ## Summary of changes In short, storage controller coordinates imports, and, only when everything is done, notifies control plane. Big rocks: 1. Store timeline imports in the storage controller database. Each import stores the status of its shards in the database. We hook into the timeline creation call as our entry point for this. 2. Pageservers get a new upcall endpoint to notify the storage controller of shard import updates. 3. Storage controller handles these updates by updating persisted state. If an update finalizes the import, then poll pageservers until timeline activation, and, then, notify the control plane that the import is complete. Cplane side change with new endpoint is in https://github.com/neondatabase/cloud/pull/26166 Closes https://github.com/neondatabase/neon/issues/11566	2025-04-24 11:26:06 +00:00
Arpad Müller	c1e4befd56	Additional fixes and improvements to storcon safekeeper timelines (#11477 ) This delivers some additional fixes and improvements to storcon managed safekeeper timelines: * use `i32::MAX` for the generation number of timeline deletion * start the generation for new timelines at 1 instead of 0: this ensures that the other components actually are generation enabled * fix database operations we use for metrics * use join in list_pending_ops to prevent the classical ORM issue where one does many db queries * use enums in `test_storcon_create_delete_sk_down`. we are adding a second parameter, and having two bool parameters is weird. * extend `test_storcon_create_delete_sk_down` with a test of whole tenant deletion. this hasn't been tested before. * remove some redundant logging contexts * Don't require mutable access to the service lock for scheduling pending ops in memory. In order to pull this off, create reconcilers eagerly. The advantage is that we don't need mutable access to the service lock that way any more. Part of #9011 --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2025-04-17 20:25:30 +00:00
Arpad Müller	a917952b30	Add test_storcon_create_delete_sk_down and make it work (#11400 ) Adds a test `test_storcon_create_delete_sk_down` which tests the reconciler and pending op persistence if faced with a temporary safekeeper downtime during timeline creation or deletion. This is in contrast to `test_explicit_timeline_creation_storcon`, which tests the happy path. We also do some fixes: * timeline and tenant deletion http requests didn't expect a body, but `()` sent one. * we got the tenant deletion http request's return type wrong: it's supposed to be a hash map * we add some logging to improve observability * We fix `list_pending_ops` which had broken code meant to make it possible to restrict oneself to a single pageserver. But diesel doesn't support that sadly, or at least I couldn't figure out a way to make it work. We don't need that functionality, so remove it. * We add an info span to the heartbeater futures with the node id, so that there is no context-free msgs like "Backoff: waiting 1.1 seconds before processing with the task" in the storcon logs. we could also add the full base url of the node but don't do it as most other log lines contain that information already, and if we do duplication it should at least not be verbose. One can always find out the base url from the node id. Successor of #11261 Part of #9011	2025-04-04 00:17:40 +00:00
Arpad Müller	1fad1abb24	storcon: timeline deletion improvements and fixes (#11334 ) This PR contains a bunch of smaller followups and fixes of the original PR #11058. most of these implement suggestions from Arseny: * remove `Queryable, Selectable` from `TimelinePersistence`: they are not needed. * no `Arc` around `CancellationToken`: it itself is an arc wrapper * only schedule deletes instead of scheduling excludes and deletes * persist and delete deletion ops * delete rows in timelines table upon tenant and timeline deletion * set `deleted_at` for timelines we are deleting before we start any reconciles: this flag will help us later to recognize half-executed deletions, or when we crashed before we could remove the timeline row but after we removed the last pending op (handling these situations are left for later). Part of #9011	2025-04-01 15:16:33 +00:00
John Spray	76088c16d2	storcon: reproduce shard split issue (#11290 ) ## Problem Issue https://github.com/neondatabase/neon/issues/11254 describes a case where restart during a shard split can result in a bad end state in the database. ## Summary of changes - Add a reproducer for the issue - Tighten an existing safety check around updated row counts in complete_shard_split	2025-03-21 08:48:56 +00:00
Arpad Müller	91dad2514f	storcon: also support tenant deletion for safekeepers (#11289 ) If a tenant gets deleted, delete also all of its timelines. We assume that by the time a tenant is being deleted, no new timelines are being created, so we don't need to worry about races with creation in this situation. Unlike #11233, which was very simple because it listed the timelines and invoked timeline deletion, this PR obtains a list of safekeepers to invoke the tenant deletion on, and then invokes tenant deletion on each safekeeper that has one or multiple timelines. Alternative to #11233 Builds on #11288 Part of #9011	2025-03-20 10:52:21 +00:00
John Spray	a674ed8caf	storcon: safety check when completing shard split (#11256 ) ## Problem There is a rare race between controller graceful deployment and shard splitting where we may incorrectly both abort _and_ complete the split (on different pods), and thereby leave no shards at all in the database. Related: #11254 ## Summary of changes - In complete_shard_split, refuse to delete anything if child shards are not found	2025-03-14 20:08:24 +00:00
Erik Grinaker	c7717c85c7	storcon,pageserver: use persisted stripe size when loading unsharded tenants (#11193 ) ## Problem When the storage controller and Pageserver loads tenants from persisted storage, it uses `ShardIdentity::unsharded()` for unsharded tenants. However, this replaces the persisted stripe size of unsharded tenants with the default stripe size. This doesn't really matter for practical purposes, since the stripe size is meaningless for unsharded tenants anyway, but can cause consistency check failures if the persisted stripe size differs from the default. This was seen in #11168, where we change the default stripe size. Touches #11168. ## Summary of changes Carry over the persisted stripe size from `TenantShardPersistence` for unsharded tenants, and from `LocationConf` on Pageservers. Also add bounds checks for type casts when loading persisted shard metadata.	2025-03-12 15:16:54 +00:00
Arpad Müller	4d3c477689	storcon: timetime table, creation and deletion (#11058 ) This PR extends the storcon with basic safekeeper management of timelines, mainly timeline creation and deletion. We want to make the storcon manage safekeepers in the future. Timeline creation is controlled by the `--timelines-onto-safekeepers` flag. 1. it adds the `timelines` and `safekeeper_timeline_pending_ops` tables to the storcon db 2. extend code for the timeline creation and deletion 4. it adds per-safekeeper reconciler tasks TODO: * maybe not immediately schedule reconciliations for deletions but have a prior manual step * tenant deletions * add exclude API definitions (probably separate PR) * how to choose safekeeper to do exclude on vs deletion? this can be a bit hairy because the safekeeper might go offline in the meantime. * error/failure case handling * tests (cc test_explicit_timeline_creation from #11002) * single safekeeper mode: we often only have one SK (in tests for example) * `notify-safekeepers` hook: https://github.com/neondatabase/neon/issues/11163 TODOs implemented: * cancellations of enqueued reconciliations on a per-timeline basis, helpful if there is an ongoing deletion * implement pending ops overwrite behavior * load pending operations from db RFC section for important reading: [link](https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#storage_controller-implementation) Implements the bulk of #9011 Successor of #10440. --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2025-03-11 02:31:22 +00:00
Dmitrii Kovalkov	e876794ce5	storcon: use https safekeeper api (#11065 ) ## Problem Storage controller uses http for requests to safekeeper management API. Closes: https://github.com/neondatabase/cloud/issues/24835 ## Summary of changes - Add `use_https_safekeeper_api` option to storcon to use https api - Use https for requests to safekeeper management API if this option is enabled - Add `ssl_ca_file` option to storcon for ability to specify custom root CA certificate	2025-03-07 17:22:47 +00:00
Arpad Müller	26bda17551	storcon: use the SchedulingPolicy enum in SafekeeperPersistence (#10897 ) We don't want to serialize to/from string all the time, so use `SchedulingPolicy` in `SafekeeperPersistence` via the use of a wrapper. Stacked atop #10891	2025-02-26 12:12:50 +00:00
Arpad Müller	920040e402	Update storage components to edition 2024 (#10919 ) Updates storage components to edition 2024. We like to stay on the latest edition if possible. There is no functional changes, however some code changes had to be done to accommodate the edition's breaking changes. The PR has two commits: * the first commit updates storage crates to edition 2024 and appeases `cargo clippy` by changing code. i have accidentially ran the formatter on some files that had other edits. * the second commit performs a `cargo fmt` I would recommend a closer review of the first commit and a less close review of the second one (as it just runs `cargo fmt`). part of https://github.com/neondatabase/neon/issues/10918	2025-02-25 23:51:37 +00:00
Dmitrii Kovalkov	e808e9432a	storcon: use https for pageservers (#10759 ) ## Problem Storage controller uses unsecure http for pageserver API. Closes: https://github.com/neondatabase/cloud/issues/23734 Closes: https://github.com/neondatabase/cloud/issues/24091 ## Summary of changes - Add an optional `listen_https_port` field to storage controller's Node state and its API (RegisterNode/ListNodes/etc). - Allow updating `listen_https_port` on node registration to gradually add https port for all nodes. - Add `use_https_pageserver_api` CLI option to storage controller to enable https. - Pageserver doesn't support https for now and always reports `https_port=None`. This will be addressed in follow-up PR.	2025-02-20 17:16:04 +00:00
Arpad Müller	536bdb3209	storcon: track safekeepers in memory, send heartbeats to them (#10583 ) In #9011, we want to schedule timelines to safekeepers. In order to do such scheduling, we need information about how utilized a safekeeper is and if it's available or not. Therefore, send constant heartbeats to the safekeepers and try to figure out if they are online or not. Includes some code from #10440.	2025-02-13 11:06:30 +00:00
Arpad Müller	b6e9daea9a	storcon: only allow errrors of the server cert verification (#10644 ) This PR does a bunch of things: * only allow errors of the server cert verification, not of the TLS handshake. The TLS handshake doesn't cause any errors for us so we can just always require it to be valid. This simplifies the code a little. * As the solution is more permanent than originally anticipated, I think it makes sense to move the `AcceptAll` verifier outside. * log the connstr information. this helps with figuring out which domain names are configured in the connstr, etc. I think it is generally useful to print it. make extra sure that the password is not leaked. Follow-up of #10640	2025-02-04 14:01:57 +00:00
Arpad Müller	c774f0a147	storcon db: allow accepting any TLS certificate (#10640 ) We encountered some TLS validation errors for the storcon since applying #10614. Add an option to downgrade them to logged errors instead to allow us to debug with more peace. cc issue https://github.com/neondatabase/cloud/issues/23583	2025-02-03 18:21:01 +00:00
Arpad Müller	87ad50c925	storcon: use diesel-async again, now with tls support (#10614 ) Successor of #10280 after it was reverted in #10592. Re-introduce the usage of diesel-async again, but now also add TLS support so that we connect to the storcon database using TLS. By default, diesel-async doesn't support TLS, so add some code to make us explicitly request TLS. cc https://github.com/neondatabase/cloud/issues/23583	2025-02-03 11:53:51 +00:00
Arpad Müller	4d2c2e9460	Revert "storcon: switch to diesel-async and tokio-postgres (#10280 )" (#10592 ) There was a regression of #10280, tracked in [#23583](https://github.com/neondatabase/cloud/issues/23583). I have ideas how to fix the issue, but we are too close to the release cutoff, so revert #10280 for now. We can revert the revert later :).	2025-01-30 19:23:25 +00:00
Arpad Müller	b0b4b7dd8f	storcon: switch to diesel-async and tokio-postgres (#10280 ) Switches the storcon away from using diesel's synchronous APIs in favour of `diesel-async`. Advantages: * less C dependencies, especially no openssl, which might be behind the bug: https://github.com/neondatabase/cloud/issues/21010 * Better to only have async than mix of async plus `spawn_blocking` We had to turn off usage of the connection pool for migrations, as diesel migrations don't support async APIs. Thus we still use `spawn_blocking` in that one place. But this is explicitly done in one of the `diesel-async` examples.	2025-01-27 14:25:11 +00:00
Arpad Müller	efaec6cdf8	Add endpoint and storcon cli cmd to set sk scheduling policy (#10400 ) Implementing the last missing endpoint of #9981, this adds support to set the scheduling policy of an individual safekeeper, as specified in the RFC. However, unlike in the RFC we call the endpoint `scheduling_policy` not `status` Closes #9981. As for why not use the upsert endpoint for this: we want to have the safekeeper upsert endpoint be used for testing and for deploying new safekeepers, but not for changes of the scheduling policy. We don't want to change any of the other fields when marking a safekeeper as decommissioned for example, so we'd have to first fetch them only to then specify them again. Of course one can also design an endpoint where one can omit any field and it doesn't get modified, but it's still not great for observability to put everything into one big "change something about this safekeeper" endpoint.	2025-01-15 18:15:30 +00:00
John Spray	aa7323a384	storage controller: quality of life improvements for AZ handling (#10379 ) ## Problem Since https://github.com/neondatabase/neon/pull/9916, the preferred AZ of a tenant is much more impactful, and we would like to make it more visible in tooling. ## Summary of changes - Include AZ in node describe API - Include AZ info in node & tenant outputs in CLI - Add metrics for per-node shard counts, labelled by AZ - Add a CLI for setting preferred AZ on a tenant - Extend AZ-setting API+CLI to handle None for clearing preferred AZ	2025-01-14 15:30:43 +00:00
Arpad Müller	23c0748cdd	Remove active column (#10335 ) We don't need or want the `active` column. Remove it. Vlad pointed out that this is safe. Thanks to the separation of the schemata in earlier PRs, this is easy. follow-up of #10205 Part of https://github.com/neondatabase/neon/issues/9981	2025-01-11 02:52:45 +00:00
Arpad Müller	bebc46e713	Add scheduling_policy column to safekeepers table (#10205 ) Add a `scheduling_policy` column to the safekeepers table of the storage controller. Part of #9981	2025-01-09 15:55:02 +00:00
John Spray	68d8acfd05	storage controller: don't hold detached tenants in memory (#10264 ) ## Problem Typical deployments of neon have some tenants that stay in use continuously, and a background churning population of tenants that are created and then fall idle, and are configured to Detached state. Currently, this churn of short lived tenants results in an ever-increasing memory footprint. Closes: https://github.com/neondatabase/neon/issues/9712 ## Summary of changes - At startup, filter to only load shards that don't have Detached policy - In process_result, check if a tenant's shards are all Detached and observed=={}, and if so drop them from memory - In tenant_location_conf and other tenant mutators, load the tenants' shards on-demand if they are not present	2025-01-08 18:12:09 +00:00
Arpad Müller	85696297c5	Add safekeepers command to storcon_cli for listing (#10151 ) Add a `safekeepers` subcommand to `storcon_cli` that allows listing the safekeepers. ``` $ curl -X POST --url http://localhost:1234/control/v1/safekeeper/42 --data \ '{"active":true, "id":42, "created_at":"2023-10-25T09:11:25Z", "updated_at":"2024-08-28T11:32:43Z","region_id":"neon_local","host":"localhost","port":5454,"http_port":0,"version":123,"availability_zone_id":"us-east-2b"}' $ cargo run --bin storcon_cli -- --api http://localhost:1234 safekeepers Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.38s Running `target/debug/storcon_cli --api 'http://localhost:1234' safekeepers` +----+---------+-----------+------+-----------+------------+ \| Id \| Version \| Host \| Port \| Http Port \| AZ Id \| +==========================================================+ \| 42 \| 123 \| localhost \| 5454 \| 0 \| us-east-2b \| +----+---------+-----------+------+-----------+------------+ ``` Also: * Don't return the raw `SafekeeperPersistence` struct that contains the raw database presentation, but instead a new `SafekeeperDescribeResponse` struct. * The `SafekeeperPersistence` struct leaves out the `active` field on purpose because we want to deprecate it and replace it with a `scheduling_policy` one. Part of https://github.com/neondatabase/neon/issues/9981	2024-12-18 12:47:56 +00:00
Arpad Müller	342cbea255	storcon: add safekeeper list API (#10089 ) This adds an API to the storage controller to list safekeepers registered to it. This PR does a `diesel print-schema > storage_controller/src/schema.rs` because of an inconsistency between up.sql and schema.rs, introduced by [this](`2c142f14f7`) commit, so there is some updates of `schema.rs` due to that. As a followup to this, we should maybe think about running `diesel print-schema` in CI. Part of #9981	2024-12-12 01:09:24 +00:00
John Spray	ec790870d5	storcon: automatically clear Pause/Stop scheduling policies to enable detaches (#10011 ) ## Problem We saw a tenant get stuck when it had been put into Pause scheduling mode to pin it to a pageserver, then it was left idle for a while and the control plane tried to detach it. Close: https://github.com/neondatabase/neon/issues/9957 ## Summary of changes - When changing policy to Detached or Secondary, set the scheduling policy to Active. - Add a test that exercises this - When persisting tenant shards, set their `generation_pageserver` to null if the placement policy is not Attached (this enables consistency checks to work, and avoids leaving state in the DB that could be confusing/misleading in future)	2024-12-07 13:05:09 +00:00
Vlad Lazar	2cf47b1477	storcon: do az aware scheduling (#9083 ) ## Problem Storage controller didn't previously consider AZ locality between compute and pageservers when scheduling nodes. Control plane has this feature, and, since we are migrating tenants away from it, we need feature parity to avoid perf degradations. ## Summary of changes The change itself is fairly simple: 1. Thread az info into the scheduler 2. Add an extra member to the scheduling scores Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled: Attached Shards We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that is where the compute is like to be. The AZ member for `NodeAttachmentSchedulingScore` has been placed below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node regions, since that would mean that one tenant can drive the general workload of an entire pageserver. I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place. Secondary Shards We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one for HA purposes. The AZ member for `NodeSecondarySchedulingScore` has been placed first, so nodes in different AZs from the preferred one will always be considered first. On small clusters, this can mean that all the secondaries of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the attached location, so IMO the argument made for attached shards doesn't hold. Related: https://github.com/neondatabase/neon/issues/8848	2024-09-25 14:31:04 +01:00
Vlad Lazar	e86fef05dd	storcon: track preferred AZ for each tenant shard (#8937 ) ## Problem We want to do AZ aware scheduling, but don't have enough metadata. ## Summary of changes Introduce a `preferred_az_id` concept for each managed tenant shard. In a future PR, the scheduler will use this as a soft preference. The idea is to try and keep the shard attachments within the same AZ. Under the assumption that the compute was placed in the correct AZ, this reduces the chances of cross AZ trafic from between compute and PS. In terms of code changes we: 1. Add a new nullable `preferred_az_id` column to the `tenant_shards` table. Also include an in-memory counterpart. 2. Populate the preferred az on tenant creation and shard splits. 3. Add an endpoint which allows to bulk-set preferred AZs. (3) gives us the migration path. I'll write a script which queries the cplane db in the region and sets the preferred az of all shards with an active compute to the AZ of said compute. For shards without an active compute, I'll use the AZ of the currently attached pageserver since this is what cplane uses now to schedule computes.	2024-09-06 13:11:17 +01:00
Vlad Lazar	04f99a87bf	storcon: make pageserver AZ id mandatory (#8856 ) ## Problem https://github.com/neondatabase/neon/pull/8852 introduced a new nullable column for the `nodes` table: `availability_zone_id` ## Summary of changes * Make neon local and the test suite always provide an az id * Make the az id field in the ps registration request mandatory * Migrate the column to non-nullable and adjust in memory state accordingly * Remove the code that was used to populate the az id for pre-existing nodes	2024-09-05 19:14:21 +01:00
John Spray	1a9b54f1d9	storage controller: read from database in validate API (#8784 ) ## Problem The initial implementation of the validate API treats the in-memory generations as authoritative. - This is true when only one storage controller is running, but if a rogue controller was running that hadn't been shut down properly, and some pageserver requests were routed to that bad controller, it could incorrectly return valid=true for stale generations. - The generation in the main in-memory map gets out of date while a live migration is in flight, and if the origin location for the migration tries to do some deletions even though it is in AttachedStale (for example because it had already started compaction), these might be wrongly validated + executed. ## Summary of changes - Continue to do the in-memory check: if this returns valid=false it is sufficient to reject requests. - When valid=true, do an additional read from the database to confirm the generation is fresh. - Revise behavior for validation on missing shards: this used to always return valid=true as a convenience for deletions and shard splits, so that pageservers weren't prevented from completing any enqueued deletions for these shards after they're gone. However, this becomes unsafe when we consider split brain scenarios. We could reinstate this in future if we wanted to store some tombstones for deleted shards. - Update test_scrubber_physical_gc to cope with the behavioral change: they must now explicitly flush the deletion queue before splits, to avoid tripping up on deletions that are enqueued at the time of the split (these tests assert "scrubber deletes nothing", which check fails if the split leaves behind some remote objects that are legitimately GC'able) - Add `test_storage_controller_validate_during_migration`, which uses failpoints to create a situation where incorrect generation validation during a live migration could result in a corruption The rate of validate calls for tenants is pretty low: it happens as a consequence deletions from GC and compaction, which are both concurrency-limited on the pageserver side.	2024-09-04 15:00:40 +01:00
Joonas Koivunen	7a1397cf37	storcon: boilerplate to upsert safekeeper records on deploy (#8879 ) We currently do not record safekeepers in the storage controller database. We want to migrate timelines across safekeepers eventually, so start recording the safekeepers on deploy. Cc: #8698	2024-09-04 10:10:05 +00:00
John Spray	516ac0591e	storage controller: eliminate ensure_attached (#8875 ) ## Problem This is a followup to #8783 - The old blocking ensure_attached function had been retained to handle the case where a shard had a None generation_pageserver, but this wasn't really necessary. - There was a subtle `.1` in the code where a struct would have been clearer Closes #8819 ## Summary of changes - Add ShardGenerationState to represent the results of peek_generation - Instead of calling ensure_attached when a tenant has a non-attached shard, check the shard's policy and return 409 if it isn't Attached, else return 503 if the shard's policy is attached but it hasn't been reconciled yet (i.e. has a None generation_pageserver)	2024-09-02 11:36:57 +00:00
Vlad Lazar	793b5061ec	storcon: track pageserver availability zone (#8852 ) ## Problem In order to build AZ aware scheduling, the storage controller needs to know what AZ pageservers are in. Related https://github.com/neondatabase/neon/issues/8848 ## Summary of changes This patch set adds a new nullable column to the `nodes` table: `availability_zone_id`. The node registration request is extended to include the AZ id (pageservers already have this in their `metadata.json` file). If the node is already registered, then we update the persistent and in-memory state with the provided AZ. Otherwise, we add the node with the AZ to begin with. A couple assumptions are made here: 1. Pageserver AZ ids are stable 2. AZ ids do not change over time Once all pageservers have a configured AZ, we can remove the optionals in the code and make the database column not nullable.	2024-08-28 18:23:55 +01:00
John Spray	0aa1450936	storage controller: enable timeline CRUD operations to run concurrently with reconciliation & make them safer (#8783 ) ## Problem - If a reconciler was waiting to be able to notify computes about a change, but the control plane was waiting for the controller to finish a timeline creation/deletion, the overall system can deadlock. - If a tenant shard was migrated concurrently with a timeline creation/deletion, there was a risk that the timeline operation could be applied to a non-latest-generation location, and thereby not really be persistent. This has never happened in practice, but would eventually happen at scale. Closes: #8743 ## Summary of changes - Introduce `Service::tenant_remote_mutation` helper, which looks up shards & generations and passes them into an inner function that may do remote I/O to pageservers. Before returning success, this helper checks that generations haven't incremented, to guarantee that changes are persistent. - Convert tenant_timeline_create, tenant_timeline_delete, and tenant_timeline_detach_ancestor to use this helper. - These functions no longer block on ensure_attached unless the tenant was never attached at all, so they should make progress even if we can't complete compute notifications. This increases the database load from timeline/create operations, but only with cheap read transactions.	2024-08-23 18:56:05 +01:00
Vlad Lazar	1c96957e85	storcon: run db migrations after step down sequence (#8756 ) ## Problem Previously, we would run db migrations before doing the step-down sequence. This meant that the current leader would have to deal with the schema changes and that's generally not safe. ## Summary of changes Push the step-down procedure earlier in start-up and do db migrations right after it (but before we load-up the in-memory state from the db). Epic: https://github.com/neondatabase/cloud/issues/14701	2024-08-20 14:00:36 +01:00
Vlad Lazar	ae527ef088	storcon: implement graceful leadership transfer (#8588 ) ## Problem Storage controller restarts cause temporary unavailability from the control plane POV. See RFC for more details. ## Summary of changes * A couple of small refactors of the storage controller start-up sequence to make extending it easier. * A leader table is added to track the storage controller instance that's currently the leader (if any) * A peer client is added such that storage controllers can send `step_down` requests to each other (implemented in https://github.com/neondatabase/neon/pull/8512). * Implement the leader cut-over as described in the RFC * Add `start-as-candidate` flag to the storage controller to gate the rolling restart behaviour. When the flag is `false` (the default), the only change from the current start-up sequence is persisting the leader entry to the database.	2024-08-12 13:58:46 +01:00
Yuchen Liang	e374d6778e	feat(storcon): store scrubber metadata scan result (#8480 ) Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 14:32:00 +01:00
Christian Schwarz	a2d170b6d0	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-18 10:56:07 +02:00

1 2

57 Commits