fix(tenant/timeline metrics): race condition during shutdown + recreation (#7064)

Tenant::shutdown or Timeline::shutdown completes and becomes externally
observable before the corresponding Tenant/Timeline object is dropped.

For example, after observing a Tenant::shutdown to complete, we could
attach the same tenant_id again. The shut down Tenant object might still
be around at the time of the attach.

The race is then the following:
- old object's metrics are still around
- new object uses with_label_values
- old object calls remove_label_values

The outcome is that the new object will have the metric objects (they're
an Arc internall) but the metrics won't be part of the internal registry
and hence they'll be missing in `/metrics`.

Later, when the new object gets shut down and tries to
remove_label_value, it will observe an error because
the metric was already removed by the old object.

Changes
-------

This PR moves metric removal to `shutdown()`.

An alternative design would be to multi-version the metrics using a
distinguishing label, or, to use a better metrics crate that allows
removing metrics from the registry through the locally held metric
handle instead of interacting with the (globally shared) registry.

refs https://github.com/neondatabase/neon/pull/7051
This commit is contained in:
Christian Schwarz
2024-03-11 15:41:41 +01:00
committed by GitHub
parent 2b0f3549f7
commit 8224580f3e
3 changed files with 5 additions and 8 deletions

View File

@@ -2017,10 +2017,8 @@ impl TimelineMetrics {
pub(crate) fn resident_physical_size_get(&self) -> u64 {
self.resident_physical_size_gauge.get()
}
}
impl Drop for TimelineMetrics {
fn drop(&mut self) {
pub(crate) fn shutdown(&self) {
let tenant_id = &self.tenant_id;
let timeline_id = &self.timeline_id;
let shard_id = &self.shard_id;

View File

@@ -1846,6 +1846,8 @@ impl Tenant {
// Wait for any in-flight operations to complete
self.gate.close().await;
remove_tenant_metrics(&self.tenant_shard_id);
Ok(())
}
@@ -3557,11 +3559,6 @@ async fn run_initdb(
Ok(())
}
impl Drop for Tenant {
fn drop(&mut self) {
remove_tenant_metrics(&self.tenant_shard_id);
}
}
/// Dump contents of a layer file to stdout.
pub async fn dump_layerfile_from_path(
path: &Utf8Path,

View File

@@ -1257,6 +1257,8 @@ impl Timeline {
// Finally wait until any gate-holders are complete
self.gate.close().await;
self.metrics.shutdown();
}
pub(crate) fn set_state(&self, new_state: TimelineState) {