controller: use PageserverUtilization for scheduling (#8711)

## Problem

Previously, the controller only used the shard counts for scheduling.
This works well when hosting only many-sharded tenants, but works much
less well when hosting single-sharded tenants that have a greater
deviation in size-per-shard.

Closes: https://github.com/neondatabase/neon/issues/7798

## Summary of changes

- Instead of UtilizationScore, carry the full PageserverUtilization
through into the Scheduler.
- Use the PageserverUtilization::score() instead of shard count when
ordering nodes in scheduling.

Q: Why did test_sharding_split_smoke need updating in this PR?
A: There's an interesting side effect during shard splits: because we do
not decrement the shard count in the utilization when we de-schedule the
shards from before the split, the controller will now prefer to pick
_different_ nodes for shards compared with which ones held secondaries
before the split. We could use our knowledge of splitting to fix up the
utilizations more actively in this situation, but I'm leaning toward
leaving the code simpler, as in practical systems the impact of one
shard on the utilization of a node should be fairly low (single digit
%).
This commit is contained in:
John Spray
2024-08-23 18:32:56 +01:00
committed by GitHub
parent c1cb7a0fa0
commit b65a95f12e
11 changed files with 340 additions and 101 deletions

View File

@@ -394,6 +394,7 @@ def test_sharding_split_smoke(
# Note which pageservers initially hold a shard after tenant creation
pre_split_pageserver_ids = [loc["node_id"] for loc in env.storage_controller.locate(tenant_id)]
log.info("Pre-split pageservers: {pre_split_pageserver_ids}")
# For pageservers holding a shard, validate their ingest statistics
# reflect a proper splitting of the WAL.
@@ -555,9 +556,9 @@ def test_sharding_split_smoke(
assert sum(total.values()) == split_shard_count * 2
check_effective_tenant_config()
# More specific check: that we are fully balanced. This is deterministic because
# the order in which we consider shards for optimization is deterministic, and the
# order of preference of nodes is also deterministic (lower node IDs win).
# More specific check: that we are fully balanced. It is deterministic that we will get exactly
# one shard on each pageserver, because for these small shards the utilization metric is
# dominated by shard count.
log.info(f"total: {total}")
assert total == {
1: 1,
@@ -577,8 +578,14 @@ def test_sharding_split_smoke(
15: 1,
16: 1,
}
# The controller is not required to lay out the attached locations in any particular way, but
# all the pageservers that originally held an attached shard should still hold one, otherwise
# it would indicate that we had done some unnecessary migration.
log.info(f"attached: {attached}")
assert attached == {1: 1, 2: 1, 3: 1, 5: 1, 6: 1, 7: 1, 9: 1, 11: 1}
for ps_id in pre_split_pageserver_ids:
log.info("Pre-split pageserver {ps_id} should still hold an attached location")
assert ps_id in attached
# Ensure post-split pageserver locations survive a restart (i.e. the child shards
# correctly wrote config to disk, and the storage controller responds correctly