mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-22 21:59:59 +00:00
This is a rebase of PR #10739 by @henryliu2014 on the current main branch. ## Problem pageserver: remove resident size from billing metrics Fixes #10388 ## Summary of changes The following changes have been made to remove resident size from billing metrics: * removed the metric "resident_size" and related codes in consumption_metrics/metrics.rs * removed the item of the description of metric "resident_size" in consumption_metrics.md * refactored the metric "resident_size" related test case Requested by: John Spray (john@neon.tech) --------- Co-authored-by: liuheqing <hq.liu@qq.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: John Spray <john@neon.tech>
113 lines
3.4 KiB
Markdown
113 lines
3.4 KiB
Markdown
### Overview
|
|
Pageserver and proxy periodically collect consumption metrics and push them to a HTTP endpoint.
|
|
|
|
This doc describes current implementation details.
|
|
For design details see [the RFC](./rfcs/021-metering.md) and [the discussion on Github](https://github.com/neondatabase/neon/pull/2884).
|
|
|
|
- The metrics are collected in a separate thread, and the collection interval and endpoint are configurable.
|
|
|
|
- Metrics are cached, so that we don't send unchanged metrics on every iteration.
|
|
|
|
- Metrics are sent in batches of 1000 (see CHUNK_SIZE const) metrics max with no particular grouping guarantees.
|
|
|
|
batch format is
|
|
```json
|
|
|
|
{ "events" : [metric1, metric2, ...] }
|
|
|
|
```
|
|
See metric format examples below.
|
|
|
|
- All metrics values are in bytes, unless otherwise specified.
|
|
|
|
- Currently no retries are implemented.
|
|
|
|
### Pageserver metrics
|
|
|
|
#### Configuration
|
|
The endpoint and the collection interval are specified in the pageserver config file (or can be passed as command line arguments):
|
|
`metric_collection_endpoint` defaults to None, which means that metric collection is disabled by default.
|
|
`metric_collection_interval` defaults to 10min
|
|
|
|
#### Metrics
|
|
|
|
Currently, the following metrics are collected:
|
|
|
|
- `written_size`
|
|
|
|
Amount of WAL produced , by a timeline, i.e. last_record_lsn
|
|
This is an absolute, per-timeline metric.
|
|
|
|
- `remote_storage_size`
|
|
|
|
Size of the remote storage (S3) directory.
|
|
This is an absolute, per-tenant metric.
|
|
|
|
- `timeline_logical_size`
|
|
|
|
Logical size of the data in the timeline.
|
|
This is an absolute, per-timeline metric.
|
|
|
|
- `synthetic_storage_size`
|
|
|
|
Size of all tenant's branches including WAL.
|
|
This is the same metric that `tenant/{tenant_id}/size` endpoint returns.
|
|
This is an absolute, per-tenant metric.
|
|
|
|
Synthetic storage size is calculated in a separate thread, so it might be slightly outdated.
|
|
|
|
#### Format example
|
|
|
|
```json
|
|
{
|
|
"metric": "remote_storage_size",
|
|
"type": "absolute",
|
|
"time": "2022-12-28T11:07:19.317310284Z",
|
|
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
|
|
"value": 12345454,
|
|
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
|
|
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
|
|
}
|
|
```
|
|
|
|
`idempotency_key` is a unique key for each metric, so that we can deduplicate metrics.
|
|
It is a combination of the time, node_id and a random number.
|
|
|
|
### Proxy consumption metrics
|
|
|
|
#### Configuration
|
|
The endpoint and the collection interval can be passed as command line arguments for proxy:
|
|
`metric_collection_endpoint` no default, which means that metric collection is disabled by default.
|
|
`metric_collection_interval` no default
|
|
|
|
#### Metrics
|
|
|
|
Currently, only one proxy metric is collected:
|
|
|
|
- `proxy_io_bytes_per_client`
|
|
Outbound traffic per client.
|
|
This is an incremental, per-endpoint metric.
|
|
|
|
#### Format example
|
|
|
|
```json
|
|
{
|
|
"metric": "proxy_io_bytes_per_client",
|
|
"type": "incremental",
|
|
"start_time": "2022-12-28T11:07:19.317310284Z",
|
|
"stop_time": "2022-12-28T11:07:19.317310284Z",
|
|
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
|
|
"value": 12345454,
|
|
"endpoint_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
|
|
}
|
|
```
|
|
|
|
The metric is incremental, so the value is the difference between the current and the previous value.
|
|
If there is no previous value, the value is the current value and the `start_time` equals `stop_time`.
|
|
|
|
### TODO
|
|
|
|
- [ ] Handle errors better: currently if one tenant fails to gather metrics, the whole iteration fails and metrics are not sent for any tenant.
|
|
- [ ] Add retries
|
|
- [ ] Tune the interval
|