storcon: handle reattach and heartbeat race

Consider the case when the storage controller handles the re-attach of a node
before the heartbeats detect that the node is back online. We still need
to reconfigure the node (by calling `Service::node_configure`) to migrate
attachments back onto the node.

In order to determine if node reconfiguration is required, we call into
`Node::get_availability_transition`. This commit updates the function
to consider the transition from "node just re-attached" (with no
utilisation score) to "node responded to the first heartbeat after a
period of unavailablity" (with some utilisation score).
This commit is contained in:
Vlad Lazar
2024-06-14 11:28:11 +01:00
parent dc2ab4407f
commit 8270b58f39

View File

@@ -3,7 +3,7 @@ use std::{str::FromStr, time::Duration};
use pageserver_api::{
controller_api::{
NodeAvailability, NodeDescribeResponse, NodeRegisterRequest, NodeSchedulingPolicy,
TenantLocateResponseShard,
TenantLocateResponseShard, UtilizationScore,
},
shard::TenantShardId,
};
@@ -116,6 +116,15 @@ impl Node {
match (self.availability, availability) {
(Offline, Active(_)) => ToActive,
(Active(_), Offline) => ToOffline,
// Consider the case when the storage controller handles the re-attach of a node
// before the heartbeats detect that the node is back online. We still need
// [`Service::node_configure`] to migrate attachments back onto the node.
// The unsavoury match arm below handles this situation.
(Active(lhs), Active(rhs))
if lhs == UtilizationScore::worst() && rhs < UtilizationScore::worst() =>
{
ToActive
}
_ => Unchanged,
}
}