pageserver: layer count & size metrics (#8410)

## Problem

We lack insight into:
- How much of a tenant's physical size is image vs. delta layers
- Average sizes of image vs. delta layers
- Total layer counts per timeline, indicating size of index_part object

As well as general observability love, this is motivated by
https://github.com/neondatabase/neon/issues/6738, where we need to
define some sensible thresholds for storage amplification, and using
total physical size may not work well (if someone does a lot of DROPs
then it's legitimate for the physical-synthetic ratio to be huge), but
the ratio between image layer size and delta layer size may be a better
indicator of whether we're generating unreasonable quantities of image
layers.

## Summary of changes

- Add pageserver_layer_bytes and pageserver_layer_count metrics,
labelled by timeline and `kind` (delta or image)
- Add & subtract these with LayerInner's lifetime.

I'm intentionally avoiding using a generic metric RAII guard object, to
avoid bloating LayerInner: it already has all the information it needs
to update metric on new+drop.
This commit is contained in:
John Spray
2024-07-17 21:55:20 +01:00
committed by Christian Schwarz
parent ae1af558b4
commit ef3ebfaf67
3 changed files with 117 additions and 0 deletions

View File

@@ -146,6 +146,8 @@ PAGESERVER_PER_TENANT_METRICS: Tuple[str, ...] = (
"pageserver_smgr_query_seconds_sum",
"pageserver_archive_size",
"pageserver_pitr_history_size",
"pageserver_layer_bytes",
"pageserver_layer_count",
"pageserver_storage_operations_seconds_count_total",
"pageserver_storage_operations_seconds_sum_total",
"pageserver_evictions_total",