## Problem
Now that https://github.com/neondatabase/cloud/issues/15245 is done, we
can remove the old code.
## Summary of changes
Removes support for the ManagementV2 API, in favour of the ProxyV1 API.
Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres
allows us to remove a lot of dead weight from our actual postgres
connection logic.
I found the rightward drift of the `renew_jwks` function hard to review.
This PR splits out some major logic and uses early returns to make the
happy path more linear.
## Problem
It is called context/ctx everywhere and the Monitoring suffix needlessly
confuses with proper monitoring code.
## Summary of changes
* Rename RequestMonitoring to RequestContext
* Rename RequestMonitoringInner to RequestContextInner
See https://github.com/neondatabase/cloud/issues/14378
In collaboration with @cloneable and @awarus, we sifted through logs and
simply demoted some logs to debug. This is not at all finished and there
are more logs to review, but we ran out of time in the session we
organised. In any slightly more nuanced cases, we didn't touch the log,
instead leaving a TODO comment.
While setting up some tests, I noticed that we didn't support keycloak.
They make use of encryption JWKs as well as signature ones. Our current
jwks crate does not support parsing encryption keys which caused the
entire jwk set to fail to parse. Switching to lazy parsing fixes this.
Also while setting up tests, I couldn't use localhost jwks server as we
require HTTPS and we were using webpki so it was impossible to add a
custom CA. Enabling native roots addresses this possibility.
I saw some of our current e2e tests against our custom JWKS in s3 were
taking a while to fetch. I've added a timeout + retries to address this.
The overall idea of the PR is to rename a few types to make their
purpose more clear, reduce abstraction where not needed, and move types
to to more better suited modules.
* Also rename `AuthFailed` variant to `PasswordFailed`.
* Before this all JWT errors end up in `AuthError::AuthFailed()`,
expects a username and also causes cache invalidation.
In the base64 payload of an aws cognito jwt, I saw the following:
```
"iss":"https:\/\/cognito-idp.us-west-2.amazonaws.com\/us-west-2_redacted"
```
issuers are supposed to be URLs, and URLs are always valid un-escaped
JSON. However, `\/` is a valid escape character so what AWS is doing is
technically correct... sigh...
This PR refactors the test suite and adds a new regression test for
cognito.
Follow up on #9344. We want to install the extension automatically. We
didn't want to couple the extension into compute_ctl so instead
local_proxy is the one to issue requests specific to the extension.
depends on #9344 and #9395
removes the ConsoleRedirect backend from the main auth::Backends enum,
copy-paste the existing crate::proxy::task_main structure to use the
ConsoleRedirectBackend exclusively.
This makes the logic a bit simpler at the cost of some fairly trivial
code duplication.
preliminary for #9270
The auth::Backend didn't need to be in the mega ProxyConfig object, so I
split it off and passed it manually in the few places it was necessary.
I've also refined some of the uses of config I saw while doing this
small refactor.
I've also followed the trend and make the console redirect backend it's
own struct, same as LocalBackend and ControlPlaneBackend.
```shell
$ cargo run -p proxy --bin proxy -- --auth-backend=web --webauth-confirmation-timeout=5s
```
```
$ psql -h localhost -p 4432
NOTICE: Welcome to Neon!
Authenticate by visiting within 5s:
http://localhost:3000/psql_session/e946900c8a9bc6e9
psql: error: connection to server at "localhost" (::1), port 4432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 4432 failed: ERROR: Disconnected due to inactivity after 5s.
```
Microsoft exposes JWKs without the alg header. It's only included on the
tokens. Not a problem.
Also noticed that wrt the `typ` header:
> It will typically not be used by applications when it is already known
that the object is a JWT. This parameter is ignored by JWT
implementations; any processing of this parameter is performed by the
JWT application.
Since we know we are expecting JWTs only, I've followed the guidance and
removed the validation.
misc changes split out from #8855
- **allow cloning the request context in a read-only fashion for
background tasks**
- **propagate endpoint and request context through the jwk cache**
- **only allow password based auth for md5 during testing**
- **remove auth info from conn info**
Some tests were very slow and some tests occasionally stalled. This PR
improves some test performance and replaces the custom threadpool in
order to fix the stalling of tests.
basic JWT implementation that caches JWKs and verifies signatures.
this code is currently not reachable from proxy, I just wanted to get
something merged in.
## Problem
1. Hard to correlate startup parameters with the endpoint that provided
them.
2. Some configurations are not needed in the `ProxyConfig` struct.
## Summary of changes
Because of some borrow checker fun, I needed to switch to an
interior-mutability implementation of our `RequestMonitoring` context
system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock
for such a use-case (needed to be thread safe).
Removed the lock of each startup message, instead just logging only the
startup params in a successful handshake.
Also removed from values from `ProxyConfig` and kept as arguments.
(needed for local-proxy config)
## Problem
The current bucket based rate limiter is not very intuitive and has some
bad failure cases.
## Summary of changes
Switches from fixed interval buckets to leaky bucket impl. A single
bucket per endpoint,
drains over time. Drains by checking the time since the last check, and
draining tokens en-masse. Garbage collection works similar to before, it
drains a shard (1/64th of the set) every 2048 checks, and it only
removes buckets that are empty.
To be compatible with the existing config, I've faffed to make it take
the min and the max rps of each as the sustained rps and the max bucket
size which should be roughly equivalent.
## Problem
We were rate limiting wake_compute in the wrong place
## Summary of changes
Move wake_compute rate limit to after the permit is acquired. Also makes
a slight refactor on normalize, as it caught my eye
## Problem
Despite making password hashing async, it can still take time away from
the network code.
## Summary of changes
Introduce a custom threadpool, inspired by rayon. Features:
### Fairness
Each task is tagged with it's endpoint ID. The more times we have seen
the endpoint, the more likely we are to skip the task if it comes up in
the queue. This is using a min-count-sketch estimator for the number of
times we have seen the endpoint, resetting it every 1000+ steps.
Since tasks are immediately rescheduled if they do not complete, the
worker could get stuck in a "always work available loop". To combat
this, we check the global queue every 61 steps to ensure all tasks
quickly get a worker assigned to them.
### Balanced
Using crossbeam_deque, like rayon does, we have workstealing out of the
box. I've tested it a fair amount and it seems to balance the workload
accordingly