storage controller: check warmth of secondary before doing proactive migration (#7583)

## Problem

The logic in Service::optimize_all would sometimes choose to migrate a
tenant to a secondary location that was only recently created, resulting
in Reconciler::live_migrate hitting its 5 minute timeout warming up the
location, and proceeding to attach a tenant to a location that doesn't
have a warm enough local set of layer files for good performance.

Closes: #7532 

## Summary of changes

- Add a pageserver API for checking download progress of a secondary
location
- During `optimize_all`, connect to pageservers of candidate
optimization secondary locations, and check they are warm.
- During shard split, do heatmap uploads and start secondary downloads,
so that the new shards' secondary locations start downloading ASAP,
rather than waiting minutes for background downloads to kick in.

I have intentionally not implemented this by continuously reading the
status of locations, to avoid dealing with the scale challenge of
efficiently polling & updating 10k-100k locations status. If we
implement that in the future, then this code can be simplified to act
based on latest state of a location rather than fetching it inline
during optimize_all.
This commit is contained in:
John Spray
2024-05-03 15:28:23 +01:00
committed by GitHub
parent ce0ddd749c
commit b5a6e68e68
6 changed files with 471 additions and 63 deletions

View File

@@ -287,6 +287,11 @@ def test_sharding_split_smoke(
== shard_count
)
# Make secondary downloads slow: this exercises the storage controller logic for not migrating an attachment
# during post-split optimization until the secondary is ready
for ps in env.pageservers:
ps.http_client().configure_failpoints([("secondary-layer-download-sleep", "return(1000)")])
env.storage_controller.tenant_shard_split(tenant_id, shard_count=split_shard_count)
post_split_pageserver_ids = [loc["node_id"] for loc in env.storage_controller.locate(tenant_id)]
@@ -300,7 +305,7 @@ def test_sharding_split_smoke(
# Enough background reconciliations should result in the shards being properly distributed.
# Run this before the workload, because its LSN-waiting code presumes stable locations.
env.storage_controller.reconcile_until_idle()
env.storage_controller.reconcile_until_idle(timeout_secs=60)
workload.validate()
@@ -342,6 +347,10 @@ def test_sharding_split_smoke(
assert cancelled_reconciles is not None and int(cancelled_reconciles) == 0
assert errored_reconciles is not None and int(errored_reconciles) == 0
# We should see that the migration of shards after the split waited for secondaries to warm up
# before happening
assert env.storage_controller.log_contains(".*Skipping.*because secondary isn't ready.*")
env.storage_controller.consistency_check()
def get_node_shard_counts(env: NeonEnv, tenant_ids):
@@ -1071,6 +1080,17 @@ def test_sharding_split_failures(
finish_split()
assert_split_done()
if isinstance(failure, StorageControllerFailpoint) and "post-complete" in failure.failpoint:
# On a post-complete failure, the controller will recover the post-split state
# after restart, but it will have missed the optimization part of the split function
# where secondary downloads are kicked off. This means that reconcile_until_idle
# will take a very long time if we wait for all optimizations to complete, because
# those optimizations will wait for secondary downloads.
#
# Avoid that by configuring the tenant into Essential scheduling mode, so that it will
# skip optimizations when we're exercising this particular failpoint.
env.storage_controller.tenant_policy_update(tenant_id, {"scheduling": "Essential"})
# Having completed the split, pump the background reconciles to ensure that
# the scheduler reaches an idle state
env.storage_controller.reconcile_until_idle(timeout_secs=30)