mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-16 18:02:56 +00:00
## Problem In the batching PR - https://github.com/neondatabase/neon/pull/9870 I stopped deducting the time-spent-in-throttle fro latency metrics, i.e., - smgr latency metrics (`SmgrOpTimer`) - basebackup latency (+scan latency, which I think is part of basebackup). The reason for stopping the deduction was that with the introduction of batching, the trick with tracking time-spent-in-throttle inside RequestContext and swap-replacing it from the `impl Drop for SmgrOpTimer` no longer worked with >1 requests in a batch. However, deducting time-spent-in-throttle is desirable because our internal latency SLO definition does not account for throttling. ## Summary of changes - Redefine throttling to be a page_service pagestream request throttle instead of a throttle for repository `Key` reads through `Timeline::get` / `Timeline::get_vectored`. - This means reads done by `basebackup` are no longer subject to any throttle. - The throttle applies after batching, before handling of the request. - Drive-by fix: make throttle sensitive to cancellation. - Rename metric label `kind` from `timeline_get` to `pagestream` to reflect the new scope of throttling. To avoid config format breakage, we leave the config field named `timeline_get_throttle` and ignore the `task_kinds` field. This will be cleaned up in a future PR. ## Trade-Offs Ideally, we would apply the throttle before reading a request off the connection, so that we queue the minimal amount of work inside the process. However, that's not possible because we need to do shard routing. The redefinition of the throttle to limit pagestream request rate instead of repository `Key` rate comes with several downsides: - We're no longer able to use the throttle mechanism for other other tasks, e.g. image layer creation. However, in practice, we never used that capability anyways. - We no longer throttle basebackup.