This is a rebase of PR #10739 by @henryliu2014 on the current main branch. ## Problem pageserver: remove resident size from billing metrics Fixes #10388 ## Summary of changes The following changes have been made to remove resident size from billing metrics: * removed the metric "resident_size" and related codes in consumption_metrics/metrics.rs * removed the item of the description of metric "resident_size" in consumption_metrics.md * refactored the metric "resident_size" related test case Requested by: John Spray (john@neon.tech) --------- Co-authored-by: liuheqing <hq.liu@qq.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: John Spray <john@neon.tech>
3.4 KiB
Overview
Pageserver and proxy periodically collect consumption metrics and push them to a HTTP endpoint.
This doc describes current implementation details. For design details see the RFC and the discussion on Github.
-
The metrics are collected in a separate thread, and the collection interval and endpoint are configurable.
-
Metrics are cached, so that we don't send unchanged metrics on every iteration.
-
Metrics are sent in batches of 1000 (see CHUNK_SIZE const) metrics max with no particular grouping guarantees.
batch format is
{ "events" : [metric1, metric2, ...] }
See metric format examples below.
-
All metrics values are in bytes, unless otherwise specified.
-
Currently no retries are implemented.
Pageserver metrics
Configuration
The endpoint and the collection interval are specified in the pageserver config file (or can be passed as command line arguments):
metric_collection_endpoint defaults to None, which means that metric collection is disabled by default.
metric_collection_interval defaults to 10min
Metrics
Currently, the following metrics are collected:
written_size
Amount of WAL produced , by a timeline, i.e. last_record_lsn This is an absolute, per-timeline metric.
remote_storage_size
Size of the remote storage (S3) directory. This is an absolute, per-tenant metric.
timeline_logical_size
Logical size of the data in the timeline. This is an absolute, per-timeline metric.
synthetic_storage_size
Size of all tenant's branches including WAL.
This is the same metric that tenant/{tenant_id}/size endpoint returns.
This is an absolute, per-tenant metric.
Synthetic storage size is calculated in a separate thread, so it might be slightly outdated.
Format example
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
}
idempotency_key is a unique key for each metric, so that we can deduplicate metrics.
It is a combination of the time, node_id and a random number.
Proxy consumption metrics
Configuration
The endpoint and the collection interval can be passed as command line arguments for proxy:
metric_collection_endpoint no default, which means that metric collection is disabled by default.
metric_collection_interval no default
Metrics
Currently, only one proxy metric is collected:
proxy_io_bytes_per_clientOutbound traffic per client. This is an incremental, per-endpoint metric.
Format example
{
"metric": "proxy_io_bytes_per_client",
"type": "incremental",
"start_time": "2022-12-28T11:07:19.317310284Z",
"stop_time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"endpoint_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
}
The metric is incremental, so the value is the difference between the current and the previous value.
If there is no previous value, the value is the current value and the start_time equals stop_time.
TODO
- Handle errors better: currently if one tenant fails to gather metrics, the whole iteration fails and metrics are not sent for any tenant.
- Add retries
- Tune the interval