Commit Graph

7 Commits

Author SHA1 Message Date
John Spray
926662eb7c storage_controller: suppress misleading log (#7395)
## Problem

- https://github.com/neondatabase/neon/issues/7355

The optimize_secondary function calls schedule_shard to check for
improvements, but if there are exactly the same number of nodes as there
are replicas of the shard, it emits some scary looking logs about no
nodes being elegible.

Closes https://github.com/neondatabase/neon/issues/7355

## Summary of changes

- Add a mode to SchedulingContext that controls logging: this should be
useful in future any time we add a log to the scheduling path, to avoid
it becoming a source of spam when the scheduler is called during
optimization.
2024-04-16 12:41:48 +00:00
John Spray
83cdbbb89a pageserver: improve readability of shard.rs (#7330)
No functional changes, this is a comments/naming PR.

While merging sharding changes, some cleanup of the shard.rs types was
deferred.

In this PR:
- Rename `is_zero` to `is_shard_zero` to make clear that this method
doesn't literally mean that the entire object is zeros, just that it
refers to the 0th shard in a tenant.
- Pull definitions of types to the top of shard.rs and add a big comment
giving an overview of which type is for what.

Closes: https://github.com/neondatabase/neon/issues/6072
2024-04-15 11:50:26 +01:00
John Spray
1628b5b145 compute hook: use shared client with explicit timeout (#7359)
## Problem

We are seeing some mysterious long waits when sending requests.

## Summary of changes

- To eliminate risk that we are incurring some unreasonable overheads
from setup, e.g. DNS, use a single Client (internally a pool) instead of
repeatedly constructing a fresh one.
- To make it clearer where a timeout is occurring, apply a 10 second
timeout to requests as we send them.
2024-04-11 14:14:09 +00:00
Conrad Ludgate
f212630da2 update measured with some more convenient features (#7334)
## Problem

Some awkwardness in the measured API.
Missing process metrics.

## Summary of changes

Update measured to use the new convenience setup features.
Added measured-process lib.
Added measured support for libmetrics
2024-04-08 18:01:41 +00:00
John Spray
74b2314a5d control_plane: revise compute_hook locking (don't serialise all calls) (#7088)
## Problem

- Previously, an async mutex was held for the duration of
`ComputeHook::notify`. This served multiple purposes:
  - Ensure updates to a given tenant are sent in the proper order
- Prevent concurrent calls into neon_local endpoint updates in test
environments (neon_local is not safe to call concurrently)
- Protect the inner ComputeHook::state hashmap that is used to calculate
when to send notifications.

This worked, but had the major downside that while we're waiting for a
compute hook request to the control plane to succeed, we can't notify
about any other tenants. Notifications block progress of live
migrations, so this is a problem.

## Summary of changes

- Protect `ComputeHook::state` with a sync lock instead of an async lock
- Use a separate async lock ( `ComputeHook::neon_local_lock` ) for
preventing concurrent calls into neon_local, and only take this in the
neon_local code path.
- Add per-tenant async locks in ShardedComputeHookTenant, and use these
to ensure that only one remote notification can be sent at once per
tenant. If several shards update concurrently, their updates will be
coalesced.
- Add an explicit semaphore that limits concurrency of calls into the
cloud control plane.
2024-04-06 19:51:59 +00:00
John Spray
ec01292b55 storage controller: rename TenantState to TenantShard (#7329)
This is a widely used type that had a misleading name: it's not the
total state of a tenant, but rrepresents one shard.
2024-04-05 16:29:53 +00:00
John Spray
66fc465484 Clean up 'attachment service' names to storage controller (#7326)
The binary etc were renamed some time ago, but the path in the source
tree remained "attachment_service" to avoid disruption to ongoing PRs.
There aren't any big PRs out right now, so it's a good time to cut over.

- Rename `attachment_service` to `storage_controller`
- Move it to the top level for symmetry with `storage_broker` & to avoid
mixing the non-prod neon_local stuff (`control_plane/`) with the storage
controller which is a production component.
2024-04-05 16:18:00 +01:00