Commit Graph

16 Commits

Author SHA1 Message Date
Suhas Thalanki
0934ce9bce compute: metrics for autovacuum (mxid, postgres) (#12294)
## Problem

Currently we do not have metrics for autovacuum.

## Summary of changes

Added a metric that extracts the top 5 DBs with oldest mxid and frozen
xid. Tables that were vacuumed recently should have younger value (or
younger age).

Related Issue: https://github.com/neondatabase/cloud/issues/27296
2025-07-01 15:33:23 +00:00
Mikhail Kot
8da4ec9740 Postgres metrics for stuck getpage requests (#11710)
https://github.com/neondatabase/neon/issues/10327
Resolves: #11720 

New metrics:
- `compute_getpage_stuck_requests_total`
- `compute_getpage_max_inflight_stuck_time_ms`
2025-04-30 12:01:41 +00:00
Suhas Thalanki
46e046e779 Exporting file_cache_used to calculate LFC utilization (#11384)
## Problem

Exporting `file_cache_used` which specifies the number of used chunks in
the LFC. This helps calculate LFC utilization as: `file_cache_used_pages
/ (file_cache_used * file_cache_chunk_size_pages)`

## Summary of changes

Exporting `file_cache_used`.

Related Issue: https://github.com/neondatabase/cloud/issues/26688
2025-04-03 14:54:45 +00:00
Alexey Kondratov
4b77807de9 fix(compute/sql_exporter): Ignore invalid DBs when collecting size (#11097)
## Problem

Original Slack discussion:
https://neondb.slack.com/archives/C04DGM6SMTM/p1739915430147169

TL;DR in Postgres, it's totally normal to have 'invalid' DBs (state
after the interrupted `DROP DATABASE`). Yet, some of our metrics
collected with `sql_exporter` try to get the size of such invalid DBs.

Typical log lines:
```
time=2025-03-05T16:30:32.368Z level=ERROR source=promhttp.go:52 msg="Error gathering metrics" error="[from Gatherer #1] [collector=neon_collector,query=pg_stats_userdb] pq: [NEON_SMGR] [reqid 0] could not read db size of db 173228 from page server at lsn 0/44A0E8C0"
time=2025-03-05T16:30:32.369Z level=ERROR source=promhttp.go:52 msg="Error gathering metrics" error="[from Gatherer #1] [collector=neon_collector,query=db_total_size] pq: [NEON_SMGR] [reqid 0] could not read db size of db 173228 from page server at lsn 0/44A0E8C0"
```

## Summary of changes

Ignore invalid DBs in these two metrics -- `pg_stats_userdb` and
`db_total_size`
2025-03-06 15:32:17 +00:00
Em Sharnoff
1fe23fe8d2 compute/lfc: Add chunk size to neon_lfc_stats (#11100)
This PR adds a new key to neon.neon_lfc_stats —
'file_cache_chunk_size_pages'. It just returns the value of
BLOCKS_PER_CHUNK from the LFC implementation.

The new value should (eventually) allow changing the chunk size without
breaking any places that rely on LFC stats values measured in number of
chunks.

See neondatabase/cloud#25170 for more.
2025-03-05 20:35:08 +00:00
Alexey Kondratov
2dfd3cab8c fix(compute): Report compute_backpressure_throttling_seconds as counter (#10125)
## Problem

It was reported as `gauge`, but it's actually a `counter`.

Also add `_total` suffix as that's the convention for counters.

The corresponding flux-fleet PR:
https://github.com/neondatabase/flux-fleet/pull/386
2024-12-17 16:14:07 +00:00
Tristan Partin
c0ba416967 Add compute_logical_snapshots_bytes metric (#9887)
This metric exposes the size of all non-temporary logical snapshot
files.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-12-05 19:04:33 +00:00
Mikhail Kot
e12628fe93 Collect max_connections metric (#9770)
This will further allow us to expose this metric to users
2024-11-15 17:42:41 +00:00
Tristan Partin
a61d81bbc7 Calculate compute_backpressure_throttling_seconds correctly
The original value that we get is measured in microseconds. It comes
from a calculation using Postgres' GetCurrentTimestamp(), whihc is
implemented in terms of gettimeofday(2).

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-12 13:12:08 -06:00
Tristan Partin
93123f2623 Rename compute_backpressure_throttling_ms to compute_backpressure_throttling_seconds
This is in line with the Prometheus guidance[0]. We also haven't started
using this metric, so renaming is essentially free.

Link: https://prometheus.io/docs/practices/naming/ [0]
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-06 13:28:23 -06:00
Tristan Partin
8af9412eb2 Collect compute backpressure throttling time
This will tell us how much time the compute has spent throttled if
pageserver/safekeeper cannot keep up with WAL generation.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-30 09:58:29 -05:00
Tristan Partin
0595320c87 Protect call to pg_current_wal_lsn() in retained_wal query
We can't call pg_current_wal_lsn() if we are a standby instance (read
replica). Any attempt to call this function while in recovery results
in:

ERROR:  recovery is in progress

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-23 09:55:00 -06:00
Tristan Partin
fcb55a2aa2 Fix copy-paste error in checkpoints_timed metric
Importing the wrong metric. Sigh...

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-22 14:34:26 -06:00
Tristan Partin
e0fa6bcf1a Fix some sql_exporter metrics for PG 17
Checkpointer related statistics moved from pg_stat_bgwriter to
pg_stat_checkpointer, so we need to adjust our queries accordingly.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-16 14:46:33 -05:00
Matthias van de Meent
18f4e5f10c Add newly added metrics from neondatabase/neon#9116 to exports (#9402)
They weren't added in that PR, but should be available immediately on
rollout as the neon extension already defaults to 1.5.
2024-10-15 23:13:31 +02:00
Tristan Partin
cf7a596a15 Generate sql_exporter config files with Jsonnet
There are quite a few benefits to this approach:

- Reduce config duplication
  - The two sql_exporter configs were super similar with just a few
    differences
- Pull SQL queries into standalone files
  - That means we could run a SQL formatter on the file in the future
  - It also means access to syntax highlighting
- In the future, run different queries for different PG versions
  - This is relevant because right now, we have queries that are failing
    on PG 17 due to catalog updates

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-15 11:18:38 -05:00