Files
neon/pageserver/slo
Christian Schwarz 1251ef6dd4 initial pageserver SLO sloth rules
The generated files were deployed manually in
https://github.com/neondatabase/cloud/pull/5812
2023-07-12 16:03:28 +02:00
..
2023-07-12 16:03:28 +02:00
2023-07-12 16:03:28 +02:00
2023-07-12 16:03:28 +02:00

Install Sloth

https://sloth.dev/introduction/install/

wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
chmod +x ./sloth-linux-amd64

Background on Sloth

https://sloth.dev/introduction/ https://sloth.dev/introduction/architecture/

Generate Prometheus Rules From Sloth Spec

./sloth-linux-amd64 generate --input ./spec.sloth.yml --out generated.prometheus.rules.yml

Background reading:

SRE workbook chapter on "Implementing SLOs", section "Calculating the SLIs"

https://sre.google/workbook/implementing-slos/

Citation:

Availability

    sum(rate(http_requests_total{host="api", status!~"5.."}[7d]))
    /
    sum(rate(http_requests_total{host="api"}[7d])

Latency

    histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[7d]))

    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[7d]))

Sloth Rule Syntax

It's under-documented.

Best to go to the Go types: https://pkg.go.dev/github.com/slok/sloth@v0.6.0/pkg/prometheus/api/v1#section-readme

For latency SLOs, pageserver, we want the "Raw SLI" type SLI, not the one that is based on events. Seach for error_ratio_query ; example: https://sloth.dev/examples/default/raw-sli/

Use victoriametrics histogram_share to compute the error ratio. It's the inverese of histogram_quantile. https://docs.victoriametrics.com/MetricsQL.html#histogram_share

share_le_over_time seems also useful https://docs.victoriametrics.com/MetricsQL.html#share_le_over_time

https://stackoverflow.com/questions/72559302/is-it-possible-to-calculate-ranks-of-metrics?rq=1

Problem with the VictoriaMetrics-only functions is that sloth has an internal validation pass: https://github.com/slok/sloth/issues/510 Option to skip the check: https://github.com/slok/sloth/pull/511 =>

git submodule update --init
pushd sloth.git
make build
popd
sloth.git/bin/sloth-linux-amd64 generate  \
    --disable-promExpr-validation \
    --input ./spec.sloth.yml \
    --out generated.prometheus.rules.yml

Notes On How To Scale The Process To Multiple Teams / Automate Sloth In Neon