pageserver: circuit breaker on compaction (#8359)

## Problem

We already back off on compaction retries, but the impact of a failing
compaction can be so great that backing off up to 300s isn't enough. The
impact is consuming a lot of I/O+CPU in the case of image layer
generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug,
frequently a bug that will not let compaction to proceed until it is
fixed.

Related: https://github.com/neondatabase/neon/issues/6738

## Summary of changes

- Introduce a CircuitBreaker type
- Add a circuit breaker for compaction, with a policy that after 5
failures, compaction will not be attempted again for 24 hours.
- Add metrics that we can alert on: any >0 value for
`pageserver_circuit_breaker_broken_total` should generate an alert.
- Add a test that checks this works as intended.

Couple notes to reviewers:
- Circuit breakers are intrinsically a defense-in-depth measure: this is
not the solution to any underlying issues, it is just a general
mitigation for "unknown unknowns" that might be encountered in future.
- This PR isn't primarily about writing a perfect CircuitBreaker type:
the one in this PR is meant to be just enough to mitigate issues in
compaction, and make it easy to monitor/alert on these failures. We can
refine this type in future as/when we want to use it elsewhere.
This commit is contained in:
John Spray
2024-07-12 12:04:02 +01:00
committed by Alex Chi Z
parent c440756410
commit 3f8819827c
5 changed files with 229 additions and 2 deletions

View File

@@ -569,6 +569,22 @@ static VALID_LSN_LEASE_COUNT: Lazy<UIntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
pub(crate) static CIRCUIT_BREAKERS_BROKEN: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_circuit_breaker_broken",
"How many times a circuit breaker has broken"
)
.expect("failed to define a metric")
});
pub(crate) static CIRCUIT_BREAKERS_UNBROKEN: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_circuit_breaker_unbroken",
"How many times a circuit breaker has been un-broken (recovered)"
)
.expect("failed to define a metric")
});
pub(crate) mod initial_logical_size {
use metrics::{register_int_counter, register_int_counter_vec, IntCounter, IntCounterVec};
use once_cell::sync::Lazy;

View File

@@ -39,6 +39,7 @@ use tokio::task::JoinSet;
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::backoff;
use utils::circuit_breaker::CircuitBreaker;
use utils::completion;
use utils::crashsafe::path_with_suffix_extension;
use utils::failpoint_support;
@@ -76,7 +77,8 @@ use crate::is_uninit_mark;
use crate::l0_flush::L0FlushGlobalState;
use crate::metrics::TENANT;
use crate::metrics::{
remove_tenant_metrics, BROKEN_TENANTS_SET, TENANT_STATE_METRIC, TENANT_SYNTHETIC_SIZE_METRIC,
remove_tenant_metrics, BROKEN_TENANTS_SET, CIRCUIT_BREAKERS_BROKEN, CIRCUIT_BREAKERS_UNBROKEN,
TENANT_STATE_METRIC, TENANT_SYNTHETIC_SIZE_METRIC,
};
use crate::repository::GcResult;
use crate::task_mgr;
@@ -276,6 +278,10 @@ pub struct Tenant {
eviction_task_tenant_state: tokio::sync::Mutex<EvictionTaskTenantState>,
/// Track repeated failures to compact, so that we can back off.
/// Overhead of mutex is acceptable because compaction is done with a multi-second period.
compaction_circuit_breaker: std::sync::Mutex<CircuitBreaker>,
/// If the tenant is in Activating state, notify this to encourage it
/// to proceed to Active as soon as possible, rather than waiting for lazy
/// background warmup.
@@ -1641,13 +1647,31 @@ impl Tenant {
timelines_to_compact
};
// Before doing any I/O work, check our circuit breaker
if self.compaction_circuit_breaker.lock().unwrap().is_broken() {
info!("Skipping compaction due to previous failures");
return Ok(());
}
for (timeline_id, timeline) in &timelines_to_compact {
timeline
.compact(cancel, EnumSet::empty(), ctx)
.instrument(info_span!("compact_timeline", %timeline_id))
.await?;
.await
.map_err(|e| {
self.compaction_circuit_breaker
.lock()
.unwrap()
.fail(&CIRCUIT_BREAKERS_BROKEN, &e);
e
})?;
}
self.compaction_circuit_breaker
.lock()
.unwrap()
.success(&CIRCUIT_BREAKERS_UNBROKEN);
Ok(())
}
@@ -2563,6 +2587,14 @@ impl Tenant {
cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()),
compaction_circuit_breaker: std::sync::Mutex::new(CircuitBreaker::new(
format!("compaction-{tenant_shard_id}"),
5,
// Compaction can be a very expensive operation, and might leak disk space. It also ought
// to be infallible, as long as remote storage is available. So if it repeatedly fails,
// use an extremely long backoff.
Some(Duration::from_secs(3600 * 24)),
)),
activate_now_sem: tokio::sync::Semaphore::new(0),
cancel: CancellationToken::default(),
gate: Gate::default(),