storcon: handle pageserver disk loss (#12667)

NB: effectively a no-op in the neon env since the handling is config
gated
in storcon

## Problem

When a pageserver suffers from a local disk/node failure and restarts,
the storage controller will receive a re-attach call and return all the
tenants the pageserver is suppose to attach, but the pageserver will not
act on any tenants that it doesn't know about locally. As a result, the
pageserver will not rehydrate any tenants from remote storage if it
restarted following a local disk loss, while the storage controller
still thinks that the pageserver have all the tenants attached. This
leaves the system in a bad state, and the symptom is that PG's
pageserver connections will fail with "tenant not found" errors.

## Summary of changes

Made a slight change to the storage controller's `re_attach` API:
* The pageserver will set an additional bit `empty_local_disk` in the
reattach request, indicating whether it has started with an empty disk
or does not know about any tenants.
* Upon receiving the reattach request, if this `empty_local_disk` bit is
set, the storage controller will go ahead and clear all observed
locations referencing the pageserver. The reconciler will then discover
the discrepancy between the intended state and observed state of the
tenant and take care of the situation.

To facilitate rollouts this extra behavior in the `re_attach` API is
guarded by the `handle_ps_local_disk_loss` command line flag of the
storage controller.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
This commit is contained in:
Vlad Lazar
2025-07-22 12:04:03 +01:00
committed by GitHub
parent 9c0efba91e
commit d91d018afa
11 changed files with 122 additions and 3 deletions

View File

@@ -225,6 +225,10 @@ struct Cli {
#[arg(long)]
shard_split_request_timeout: Option<humantime::Duration>,
/// **Feature Flag** Whether the storage controller should act to rectify pageserver-reported local disk loss.
#[arg(long, default_value = "false")]
handle_ps_local_disk_loss: bool,
}
enum StrictMode {
@@ -477,6 +481,7 @@ async fn async_main() -> anyhow::Result<()> {
.shard_split_request_timeout
.map(humantime::Duration::into)
.unwrap_or(Duration::MAX),
handle_ps_local_disk_loss: args.handle_ps_local_disk_loss,
};
// Validate that we can connect to the database

View File

@@ -487,6 +487,9 @@ pub struct Config {
/// Timeout used for HTTP client of split requests. [`Duration::MAX`] if None.
pub shard_split_request_timeout: Duration,
// Feature flag: Whether the storage controller should act to rectify pageserver-reported local disk loss.
pub handle_ps_local_disk_loss: bool,
}
impl From<DatabaseError> for ApiError {
@@ -2388,6 +2391,33 @@ impl Service {
tenants: Vec::new(),
};
// [Hadron] If the pageserver reports in the reattach message that it has an empty disk, it's possible that it just
// recovered from a local disk failure. The response of the reattach request will contain a list of tenants but it
// will not be honored by the pageserver in this case (disk failure). We should make sure we clear any observed
// locations of tenants attached to the node so that the reconciler will discover the discrpancy and reconfigure the
// missing tenants on the node properly.
if self.config.handle_ps_local_disk_loss && reattach_req.empty_local_disk.unwrap_or(false) {
tracing::info!(
"Pageserver {node_id} reports empty local disk, clearing observed locations referencing the pageserver for all tenants",
node_id = reattach_req.node_id
);
let mut num_tenant_shards_affected = 0;
for (tenant_shard_id, shard) in tenants.iter_mut() {
if shard
.observed
.locations
.remove(&reattach_req.node_id)
.is_some()
{
tracing::info!("Cleared observed location for tenant shard {tenant_shard_id}");
num_tenant_shards_affected += 1;
}
}
tracing::info!(
"Cleared observed locations for {num_tenant_shards_affected} tenant shards"
);
}
// TODO: cancel/restart any running reconciliation for this tenant, it might be trying
// to call location_conf API with an old generation. Wait for cancellation to complete
// before responding to this request. Requires well implemented CancellationToken logic