rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-17 02:12:56 +00:00

Author	SHA1	Message	Date
John Spray	aa7323a384	storage controller: quality of life improvements for AZ handling (#10379 ) ## Problem Since https://github.com/neondatabase/neon/pull/9916, the preferred AZ of a tenant is much more impactful, and we would like to make it more visible in tooling. ## Summary of changes - Include AZ in node describe API - Include AZ info in node & tenant outputs in CLI - Add metrics for per-node shard counts, labelled by AZ - Add a CLI for setting preferred AZ on a tenant - Extend AZ-setting API+CLI to handle None for clearing preferred AZ	2025-01-14 15:30:43 +00:00
John Spray	bd09369198	storcon: add metric for AZ scheduling violations (#9949 ) ## Problem We can't easily tell how far the state of shards is from their AZ preferences. This can be a cause of performance issues, so it's important for diagnosability that we can tell easily if there are significant numbers of shards that aren't running in their preferred AZ. Related: https://github.com/neondatabase/cloud/issues/15413 ## Summary of changes - In reconcile_all, count shards that are scheduled into the wrong AZ (if they have a preference), and publish it as a prometheus gauge. - Also calculate a statistic for how many shards wanted to reconcile but couldn't. This is clearly a lazy calculation: reconcile all only runs periodically. But that's okay: shards in the wrong AZ is something that only matters if it stays that way for some period of time.	2024-12-02 11:50:22 +00:00
John Spray	8dca188974	storage controller: add metrics for tenant shard, node count (#9475 ) ## Problem Previously, figuring out how many tenant shards were managed by a storage controller was typically done by peeking at the database or calling into the API. A metric makes it easier to monitor, as unexpectedly increasing shard counts can be indicative of problems elsewhere in the system. ## Summary of changes - Add metrics `storage_controller_pageserver_nodes` (updated on node CRUD operations from Service) and `storage_controller_tenant_shards` (updated RAII-style from TenantShard)	2024-10-22 19:43:02 +01:00
Vlad Lazar	38a8dcab9f	storcon: add metric for long running reconciles (#9207 ) ## Problem We don't have an alert for long running reconciles. Stuck reconciles are problematic as we've seen in a recent incident. ## Summary of changes Add a new metric `storage_controller_reconcile_long_running_total` with labels: `{tenant_id, shard_number, seq}`. The metric is removed after the long running reconcile finishes. These events should be rare, so we won't break the bank on cardinality. Related https://github.com/neondatabase/neon/issues/9150	2024-10-02 17:25:11 +01:00
Vlad Lazar	1c96957e85	storcon: run db migrations after step down sequence (#8756 ) ## Problem Previously, we would run db migrations before doing the step-down sequence. This meant that the current leader would have to deal with the schema changes and that's generally not safe. ## Summary of changes Push the step-down procedure earlier in start-up and do db migrations right after it (but before we load-up the in-memory state from the db). Epic: https://github.com/neondatabase/cloud/issues/14701	2024-08-20 14:00:36 +01:00
Vlad Lazar	ae527ef088	storcon: implement graceful leadership transfer (#8588 ) ## Problem Storage controller restarts cause temporary unavailability from the control plane POV. See RFC for more details. ## Summary of changes * A couple of small refactors of the storage controller start-up sequence to make extending it easier. * A leader table is added to track the storage controller instance that's currently the leader (if any) * A peer client is added such that storage controllers can send `step_down` requests to each other (implemented in https://github.com/neondatabase/neon/pull/8512). * Implement the leader cut-over as described in the RFC * Add `start-as-candidate` flag to the storage controller to gate the rolling restart behaviour. When the flag is `false` (the default), the only change from the current start-up sequence is persisting the leader entry to the database.	2024-08-12 13:58:46 +01:00
Vlad Lazar	7a796a9963	storcon: introduce step down primitive (#8512 ) ## Problem We are missing the step-down primitive required to implement rolling restarts of the storage controller. ## Summary of changes Add `/control/v1/step_down` endpoint which puts the storage controller into a state where it rejects all API requests apart from `/control/v1/step_down`, `/status` and `/metrics`. When receiving the request, storage controller cancels all pending reconciles and waits for them to exit gracefully. The response contains a snapshot of the in-memory observed state. Related: * https://github.com/neondatabase/cloud/issues/14701 * https://github.com/neondatabase/neon/issues/7797 * https://github.com/neondatabase/neon/pull/8310	2024-07-26 14:54:09 +01:00
Conrad Ludgate	f212630da2	update measured with some more convenient features (#7334 ) ## Problem Some awkwardness in the measured API. Missing process metrics. ## Summary of changes Update measured to use the new convenience setup features. Added measured-process lib. Added measured support for libmetrics	2024-04-08 18:01:41 +00:00
John Spray	66fc465484	Clean up 'attachment service' names to storage controller (#7326 ) The binary etc were renamed some time ago, but the path in the source tree remained "attachment_service" to avoid disruption to ongoing PRs. There aren't any big PRs out right now, so it's a good time to cut over. - Rename `attachment_service` to `storage_controller` - Move it to the top level for symmetry with `storage_broker` & to avoid mixing the non-prod neon_local stuff (`control_plane/`) with the storage controller which is a production component.	2024-04-05 16:18:00 +01:00

9 Commits