neon/pgxn at main - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Files

Suhas Thalanki 842a5091d5 [BRC-3051] Walproposer: Safekeeper quorum health metrics (#930 ) (#12750 )

Today we don't have any indications (other than spammy logs in PG that
nobody monitors) if the Walproposer in PG cannot connect to/get votes
from all Safekeepers. This means we don't have signals indicating that
the Safekeepers are operating at degraded redundancy. We need these
signals.

Added plumbing in PG extension so that the `neon_perf_counters` view
exports the following gauge metrics on safekeeper health:
- `num_configured_safekeepers`: The total number of safekeepers
configured in PG.
- `num_active_safekeepers`: The number of safekeepers that PG is
actively streaming WAL to.

An alert should be raised whenever `num_active_safekeepers` <
`num_configured_safekeepers`.

The metrics are implemented by adding additional state to the
Walproposer shared memory keeping track of the active statuses of
safekeepers using a simple array. The status of the safekeeper is set to
active (1) after the Walproposer acquires a quorum and starts streaming
data to the safekeeper, and is set to inactive (0) when the connection
with a safekeeper is shut down. We scan the safekeeper status array in
Walproposer shared memory when collecting the metrics to produce results
for the gauges.

Added coverage for the metrics to integration test
`test_wal_acceptor.py::test_timeline_disk_usage_limit`.

## Problem

## Summary of changes

---------

Co-authored-by: William Huang <william.huang@databricks.com>

2025-07-30 15:14:59 +00:00

neon

[BRC-3051] Walproposer: Safekeeper quorum health metrics (#930 ) (#12750 )

2025-07-30 15:14:59 +00:00

neon_rmgr

Remove unused static function

2024-10-07 23:49:11 +03:00

neon_test_utils

NEON: Finish Zenith->Neon rename (#12566 )

2025-07-11 18:56:39 +00:00

neon_utils

Be able to get number of CPUs (#3774 )