Exposes metrics for caches. LKB-2594
This exposes a high level namespace, `cache`, that all cache metrics can
be added to - this makes it easier to make library panels for the caches
as I understand it.
To calculate the current cache fill ratio, you could use the following
query:
```
(
cache_inserted_total{cache="node_info"}
- sum (cache_evicted_total{cache="node_info"}) without (cause)
)
/ cache_capacity{cache="node_info"}
```
To calculate the cache hit ratio, you could use the following query:
```
cache_request_total{cache="node_info", outcome="hit"}
/ sum (cache_request_total{cache="node_info"}) without (outcome)
```
## Problem
LKB-2502 The garbage collection of the project info cache is garbage.
What we observed: If we get unlucky, we might throw away a very hot
entry if the cache is full. The GC loop is dependent on getting a lucky
shard of the projects2ep table that clears a lot of cold entries. The GC
does not take into account active use, and the interval it runs at is
too sparse to do any good.
Can we switch to a proper cache implementation?
Complications:
1. We need to invalidate by project/account.
2. We need to expire based on `retry_delay_ms`.
## Summary of changes
1. Replace `retry_delay_ms: Duration` with `retry_at: Instant` when
deserializing.
2. Split the EndpointControls from the RoleControls into two different
caches.
3. Introduce an expiry policy based on error retry info.
4. Introduce `moka` as a dependency, replacing our `TimedLru`.
See the follow up PR for changing all TimedLru instances to use moka:
#12726.
## Problem
If there's a quota error, it makes sense to cache it for a short window
of time. Many clients do not handle database connection errors
gracefully, so just spam retry 🤡
## Summary of changes
Updates the node_info cache to support storing console errors. Store
console errors if they cannot be retried (using our own heuristic.
should only trigger for quota exceeded errors).
## Problem
https://github.com/neondatabase/cloud/issues/9642
## Summary of changes
1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter`
2. Add support for claiming multiple tokens at once
3. Add `AuthRateLimiter` alias.
4. Check `(Endpoint, IP)` pair during authentication, weighted by how
many hashes proxy would be doing.
TODO: handle ipv6 subnets. will do this in a separate PR.
## Problem
Current cache doesn't support any updates from the cplane.
## Summary of changes
* Added redis notifier listner.
* Added cache which can be invalidated with the notifier. If the
notifier is not available, it's just a normal ttl cache.
* Updated cplane api.
The motivation behind this organization of the data is the following:
* In the Neon data model there are projects. Projects could have
multiple branches and each branch could have more than one endpoint.
* Also there is one special `main` branch.
* Password reset works per branch.
* Allowed IPs are the same for every branch in the project (except,
maybe, the main one).
* The main branch can be changed to the other branch.
* The endpoint can be moved between branches.
Every event described above requires some special processing on the
porxy (or cplane) side.
The idea of invalidating for the project is that whenever one of the
events above is happening with the project, proxy can invalidate all
entries for the entire project.
This approach also requires some additional API change (returning
project_id inside the auth info).