rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-25 17:10:38 +00:00

Author	SHA1	Message	Date
Vlad Lazar	c75b584430	storage_controller: add metrics (#7178 ) ## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes https://github.com/neondatabase/neon/issues/6875	2024-03-21 12:00:20 +00:00
Vlad Lazar	38767ace68	storage_controller: periodic pageserver heartbeats (#7092 ) ## Problem If a pageserver was offline when the storage controller started, there was no mechanism to update the storage controller state when the pageserver becomes active. ## Summary of changes * Add a heartbeater module. The heartbeater must be driven by an external loop. * Integrate the heartbeater into the service. - Extend the types used by the service and scheduler to keep track of a nodes' utilisation score. - Add a background loop to drive the heartbeater and update the state based on the deltas it generated - Do an initial round of heartbeats at start-up	2024-03-14 15:21:36 +00:00
John Spray	44f42627dd	pageserver/controller: error handling for shard splitting (#7074 ) ## Problem Shard splits worked, but weren't safe against failures (e.g. node crash during split) yet. Related: #6676 ## Summary of changes - Introduce async rwlocks at the scope of Tenant and Node: - exclusive tenant lock is used to protect splits - exclusive node lock is used to protect new reconciliation process that happens when setting node active - exclusive locks used in both cases when doing persistent updates (e.g. node scheduling conf) where the update to DB & in-memory state needs to be atomic. - Add failpoints to shard splitting in control plane and pageserver code. - Implement error handling in control plane for shard splits: this detaches child chards and ensures parent shards are re-attached. - Crash-safety for storage controller restarts requires little effort: we already reconcile with nodes over a storage controller restart, so as long as we reset any incomplete splits in the DB on restart (added in this PR), things are implicitly cleaned up. - Implement reconciliation with offline nodes before they transition to active: - (in this context reconciliation means something like startup_reconcile, not literally the Reconciler) - This covers cases where split abort cannot reach a node to clean it up: the cleanup will eventually happen when the node is marked active, as part of reconciliation. - This also covers the case where a node was unavailable when the storage controller started, but becomes available later: previously this allowed it to skip the startup reconcile. - Storage controller now terminates on panics. We only use panics for true "should never happen" assertions, and these cases can leave us in an un-usable state if we keep running (e.g. panicking in a shard split). In the unlikely event that we get into a crashloop as a result, we'll rely on kubernetes to back us off. - Add `test_sharding_split_failures` which exercises a variety of failure cases during shard split.	2024-03-14 09:11:57 +00:00
John Spray	7329413705	storage controller: enable setting PlacementPolicy in tenant creation (#7037 ) ## Problem Tenants created via the storage controller have a `PlacementPolicy` that defines their HA/secondary/detach intent. For backward compat we can just set it to Single, for onboarding tenants using /location_conf it is automatically set to Double(1) if there are at least two pageservers, but for freshly created tenants we didn't have a way to specify it. This unblocks writing tests that create HA tenants on the storage controller and do failure injection testing. ## Summary of changes - Add optional fields to TenantCreateRequest for specifying PlacementPolicy. This request structure is used both on pageserver API and storage controller API, but this method is only meaningful for the storage controller (same as existing `shard_parameters` attribute). - Use the value from the creation request in tenant creation, if provided.	2024-03-08 15:34:53 +00:00
John Spray	20d0939b00	control_plane/attachment_service: implement PlacementPolicy::Secondary, configuration updates (#6521 ) During onboarding, the control plane may attempt ad-hoc creation of a secondary location to facilitate live migration. This gives us two problems to solve: - Accept 'Secondary' mode in /location_config and use it to put the tenant into secondary mode on some physical pageserver, then pass through /tenant/xyz/secondary/download requests - Create tenants with no generation initially, since the initial `Secondary` mode call will not provide us a generation. This PR also fixes modification of a tenant's TenantConf during /location_conf, which was previously ignored, and refines the flow for config modification: - avoid bumping generations when the only reason we're reconciling an attached location is a config change - increment TenantState.sequence when spawning a reconciler: usually schedule() does this, but when we do config changes that doesn't happen, so without this change waiters would think reconciliation was done immediately. `sequence` is a bit of a murky thing right now, as it's dual-purposed for tracking waiters, and for checking if an existing reconciliation is already making updates to our current sequence. I'll follow up at some point to clarify it's purpose. - test config modification at the end of onboarding test	2024-03-01 20:25:53 +00:00
Vlad Lazar	5accf6e24a	attachment_service: JWT auth enforcement (#6897 ) ## Problem Attachment service does not do auth based on JWT scopes. ## Summary of changes Do JWT based permission checking for requests coming into the attachment service. Requests into the attachment service must use different tokens based on the endpoint: * `/control` and `/debug` require `admin` scope * `/upcall` requires `generations_api` scope * `/v1/...` requires `pageserverapi` scope Requests into the pageserver from the attachment service must use `pageserverapi` scope.	2024-02-26 18:17:06 +00:00
John Spray	0c105ef352	storage controller: debug observability endpoints and self-test (#6820 ) This PR stacks on https://github.com/neondatabase/neon/pull/6814 Observability: - Because we only persist a subset of our state, and our external API is pretty high level, it can be hard to get at the detail of what's going on internally (e.g. the IntentState of a shard). - Add debug endpoints for getting a full dump of all TenantState and SchedulerNode objects - Enrich the /control/v1/node listing endpoint to include full in-memory detail of `Node` rather than just the `NodePersistence` subset Consistency checks: - The storage controller maintains separate in-memory and on-disk states, by design. To catch subtle bugs, it is useful to occasionally cross-check these. - The Scheduler maintains reference counts for shard->node relationships, which could drift if there was a bug in IntentState: exhausively cross check them in tests.	2024-02-19 20:29:23 +00:00
John Spray	4f7704af24	storage controller: fix spurious reconciles after pageserver restarts (#6814 ) ## Problem When investigating test failures (https://github.com/neondatabase/neon/issues/6813) I noticed we were doing a bunch of Reconciler runs right after splitting a tenant. It's because the splitting test does a pageserver restart, and there was a bug in /re-attach handling, where we would update the generation correctly in the database and intent state, but not observed state, thereby triggering a reconciliation on the next call to maybe_reconcile. This didn't break anything profound (underlying rules about generations were respected), but caused the storage controller to do an un-needed extra round of bumping the generation and reconciling. ## Summary of changes - Start adding metrics to the storage controller - Assert on the number of reconciles done in test_sharding_split_smoke - Fix /re-attach to update `observed` such that we don't spuriously re-reconcile tenants.	2024-02-19 17:44:20 +00:00
John Spray	f2e5212fed	storage controller: background reconcile, graceful shutdown, better logging (#6709 ) ## Problem Now that the storage controller is working end to end, we start burning down the robustness aspects. ## Summary of changes - Add a background task that periodically calls `reconcile_all`. This ensures that if earlier operations couldn't succeed (e.g. because a node was unavailable), we will eventually retry. This is a naive initial implementation can start an unlimited number of reconcile tasks: limiting reconcile concurrency is a later item in #6342 - Add a number of tracing spans in key locations: each background task, each reconciler task. - Add a top level CancellationToken and Gate, and use these to implement a graceful shutdown that waits for tasks to shut down. This is not bulletproof yet, because within these tasks we have remote HTTP calls that aren't wrapped in cancellation/timeouts, but it creates the structure, and if we don't shutdown promptly then k8s will kill us. - To protect shard splits from background reconciliation, expose the `SplitState` in memory and use it to guard any APIs that require an attached tenant.	2024-02-16 13:00:53 +00:00
John Spray	58f6cb649e	control_plane: database persistence for attachment_service (#6468 ) ## Problem Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR is just the persistence parts and the changes that enable it to work nicely ## Summary of changes - Revert #6444 and #6450 - In neon_local, start a vanilla postgres instance for the attachment service to use. - Adopt `diesel` crate for database access in attachment service. This uses raw SQL migrations as the source of truth for the schema, so it's a soft dependency: we can switch libraries pretty easily. - Rewrite persistence.rs to use postgres (via diesel) instead of JSON. - Preserve JSON read+write at startup and shutdown: this enables using the JSON format in compatibility tests, so that we don't have to commit to our DB schema yet. - In neon_local, run database creation + migrations before starting attachment service - Run the initial reconciliation in Service::spawn in the background, so that the pageserver + attachment service don't get stuck waiting for each other to start, when restarting both together in a test.	2024-01-26 17:20:44 +00:00
John Spray	a72af29d12	control_plane/attachment_service: implement PlacementPolicy::Detached (#6458 ) ## Problem The API for detaching things wasn't implement yet, but one could hit this case indirectly from tests when using attach-hook, and find tenants unexpectedly attached again because their policy remained Single. ## Summary of changes Add PlacementPolicy::Detached, and: - add the behavior for it in schedule() - in tenant_migrate, refuse if the policy is detached - automatically set this policy in attach-hook if the caller has specified pageserver=null.	2024-01-24 12:49:30 +01:00
John Spray	b6ec11ad78	control_plane: generalize attachment_service to handle sharding (#6251 ) ## Problem To test sharding, we need something to control it. We could write python code for doing this from the test runner, but this wouldn't be usable with neon_local run directly, and when we want to write tests with large number of shards/tenants, Rust is a better fit efficiently handling all the required state. This service enables automated tests to easily get a system with sharding/HA without the test itself having to set this all up by hand: existing tests can be run against sharded tenants just by setting a shard count when creating the tenant. ## Summary of changes Attachment service was previously a map of TenantId->TenantState, where the principal state stored for each tenant was the generation and the last attached pageserver. This enabled it to serve the re-attach and validate requests that the pageserver requires. In this PR, the scope of the service is extended substantially to do overall management of tenants in the pageserver, including tenant/timeline creation, live migration, evacuation of offline pageservers etc. This is done using synchronous code to make declarative changes to the tenant's intended state (`TenantState.policy` and `TenantState.intent`), which are then translated into calls into the pageserver by the `Reconciler`. Top level summary of modules within `control_plane/attachment_service/src`: - `tenant_state`: structure that represents one tenant shard. - `service`: implements the main high level such as tenant/timeline creation, marking a node offline, etc. - `scheduler`: for operations that need to pick a pageserver for a tenant, construct a scheduler and call into it. - `compute_hook`: receive notifications when a tenant shard is attached somewhere new. Once we have locations for all the shards in a tenant, emit an update to postgres configuration via the neon_local `LocalEnv`. - `http`: HTTP stubs. These mostly map to methods on `Service`, but are separated for readability and so that it'll be easier to adapt if/when we switch to another RPC layer. - `node`: structure that describes a pageserver node. The most important attribute of a node is its availability: marking a node offline causes tenant shards to reschedule away from it. This PR is a precursor to implementing the full sharding service for prod (#6342). What's the difference between this and a production-ready controller for pageservers? - JSON file persistence to be replaced with a database - Limited observability. - No concurrency limits. Marking a pageserver offline will try and migrate every tenant to a new pageserver concurrently, even if there are thousands. - Very simple scheduler that only knows to pick the pageserver with fewest tenants, and place secondary locations on a different pageserver than attached locations: it does not try to place shards for the same tenant on different pageservers. This matters little in tests, because picking the least-used pageserver usually results in round-robin placement. - Scheduler state is rebuilt exhaustively for each operation that requires a scheduler. - Relies on neon_local mechanisms for updating postgres: in production this would be something that flows through the real control plane. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-01-17 18:01:08 +00:00

12 Commits