Files
neon/scripts
William Huang fae734b15c [BRC-3082] Monitor commit LSN lag among active SKs (#1002)
Commit
e69c3d632b
added metrics (used for alerting) to indicate whether Safekeepers are
operating with a degraded quorum due to some of them being down.
However, even if all SKs are active/reachable, we probably still want to
raise an alert if some of them are really slow or otherwise lagging
behind, as it is technically still a "degraded quorum" situation.

Added a new field `max_active_safekeeper_commit_lag` to the
`neon_perf_counters` view that reports the lag between the most advanced
and most lagging commit LSNs among active Safekeepers.

Commit LSNs are received from `AppendResponse` messages from SKs and
recorded in the `WalProposer`'s shared memory state.

Note that this lag is calculated among active SKs only to keep this
alert clean. If there are inactive SKs the previous metric on active SKs
will capture that instead.

Note: @chen-luo_data pointed out during the PR review that we can
probably get the benefits of this metric with PromQL query like `max
(safekeeper_commit_lsn) by (tenant_id, timeline_id) -
min(safekeeper_commit_lsn) by (tenant_id, timeline_id)` on existing
metrics exported by SKs.

Given that this code is already ready, @haoyu-huang_data suggested that
I just check in this change anyway, as the reliability of prometheus
metrics (and especially the aggregation operators when the result set
cardinality is high) is somewhat questionable based on our prior
experience.

Added integration test
`test_wal_acceptor.py::test_max_active_safekeeper_commit_lag`.
2025-07-25 15:08:49 -04:00
..
2024-09-06 21:05:18 +02:00