## Problem
1. Hard to correlate startup parameters with the endpoint that provided
them.
2. Some configurations are not needed in the `ProxyConfig` struct.
## Summary of changes
Because of some borrow checker fun, I needed to switch to an
interior-mutability implementation of our `RequestMonitoring` context
system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock
for such a use-case (needed to be thread safe).
Removed the lock of each startup message, instead just logging only the
startup params in a successful handshake.
Also removed from values from `ProxyConfig` and kept as arguments.
(needed for local-proxy config)
## Problem
Hard to find error reasons by endpoint for HTTP flow.
## Summary of changes
I want all root spans to have session id and endpoint id. I want all
root spans to be consistent.
## Problem
Not really a problem, just refactoring.
## Summary of changes
Separate authenticate from wake compute.
Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
## Problem
Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.
We currently don't report error classifications in proxy as the current
error handling made it hard to do so.
## Summary of changes
1. Add a `ReportableError` trait that all errors will implement. This
provides the error classification functionality.
2. Handle Client requests a strongly typed error
* this error is a `ReportableError` and is logged appropriately
3. The handle client error only has a few possible error types, to
account for the fact that at this point errors should be returned to the
user.
## Problem
not really any problem, just some drive-by changes
## Summary of changes
1. move wake compute
2. move json processing
3. move handle_try_wake
4. move test backend to api provider
5. reduce wake-compute concerns
6. remove duplicate wake-compute loop
## Problem
HTTP connection pool was not respecting the PitR options.
## Summary of changes
1. refactor neon_options a bit to allow easier access to cache_key
2. make HTTP not go through `StartupMessageParams`
3. expose SNI processing to replace what was removed in step 2.
## Problem
The `src/proxy.rs` file is far too large
## Summary of changes
Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
## Problem
Per-project IP allowlist:
https://github.com/neondatabase/cloud/issues/8116
## Summary of changes
Implemented IP filtering on the proxy side.
To retrieve ip allowlist for all scenarios, added `get_auth_info` call
to the control plane for:
* sql-over-http
* password_hack
* cleartext_hack
Added cache with ttl for sql-over-http path
This might slow down a bit, consider using redis in the future.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
channel binding protects scram from sophisticated MITM attacks where the
attacker is able to produce 'valid' TLS certificates.
## Summary of changes
get the tls-server-end-point channel binding, and verify it is correct
for the SCRAM-SHA-256-PLUS authentication flow
## Problem
In #5539, I moved the connect_to_compute latency to start counting
before authentication - this is because authentication will perform some
calls to the control plane in order to get credentials and to eagerly
wake a compute server. It felt important to include these times in the
latency metric as these are times we should definitely care about
reducing.
What is not interesting to record in this metric is the roundtrip time
during authentication when we wait for the client to respond.
## Summary of changes
Implement a mechanism to pause the latency timer, resuming on drop of
the pause struct. We pause the timer right before we send the
authentication message to the client, and we resume the timer right
after we complete the authentication flow.
## Problem
Errors and notices that happen during a pooled connection lifecycle have
no session identifiers
## Summary of changes
Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session
Also, adding a connection id to be able to search the entire connection
lifecycle
## Problem
A customer is having trouble connecting to neon from their production
environment. The logs show a mix of "Internal error" and "authentication
protocol violation" but not the full error
## Summary of changes
Make sure we don't miss any logs during SASL/SCRAM
## Problem
When an endpoint is shutting down, it can take a few seconds. Currently
when starting a new compute, this causes an "endpoint is in transition"
error. We need to add delays before retrying to ensure that we allow
time for the endpoint to shutdown properly.
## Summary of changes
Adds a delay before retrying in auth. connect_to_compute already has
this delay
## Problem
Pre-requisites for #4852 and #4853
## Summary of changes
1. Includes the client's IP address (which we already log) with the span
info so we can have it on all associated logs. This makes making
dashboards based on IP addresses easier.
2. Switch to a consistent error/warning log for errors during
connection. This includes error, num_retries, retriable=true/false and a
consistent log message that we can grep for.
## Problem
1. In the CacheInvalid state loop, we weren't checking the
`num_retries`. If this managed to get up to `32`, the retry_after
procedure would compute 2^32 which would overflow to 0 and trigger a div
by zero
2. When fixing the above, I started working on a flow diagram for the
state machine logic and realised it was more complex than it had to be:
a. We start in a `Cached` state
b. `Cached`: call `connect_once`. After the first connect_once error, we
always move to the `CacheInvalid` state, otherwise, we return the
connection.
c. `CacheInvalid`: we attempt to `wake_compute` and we either switch to
Cached or we retry this step (or we error).
d. `Cached`: call `connect_once`. We either retry this step or we have a
connection (or we error) - After num_retries > 1 we never switch back to
`CacheInvalid`.
## Summary of changes
1. Insert a `num_retries` check in the `handle_try_wake` procedure. Also
using floats in the retry_after procedure to prevent the overflow
entirely
2. Refactor connect_to_compute to be more linear in design.
## Problem
wake_compute can fail sometimes but is eligible for retries. We retry
during the main connect, but not during auth.
## Summary of changes
retry wake_compute during auth flow if there was an error talking to
control plane, or if there was a temporary error in waking the compute
node
This patch adds a timed LRU cache implementation and a compute node info cache on top of that.
Cache entries might expire on their own (default ttl=5mins) or become invalid due to real-world events,
e.g. compute node scale-to-zero event, so we add a connection retry loop with a wake-up call.
Solved problems:
- [x] Find a decent LRU implementation.
- [x] Implement timed LRU on top of that.
- [x] Cache results of `proxy_wake_compute` API call.
- [x] Don't invalidate newer cache entries for the same key.
- [x] Add cmdline configuration knobs (requires some refactoring).
- [x] Add failed connection estab metric.
- [x] Refactor auth backends to make things simpler (retries, cache
placement, etc).
- [x] Address review comments (add code comments + cleanup).
- [x] Retry `/proxy_wake_compute` if we couldn't connect to a compute
(e.g. stalled cache entry).
- [x] Add high-level description for `TimedLru`.
TODOs (will be addressed later):
- [ ] Add cache metrics (hit, spurious hit, miss).
- [ ] Synchronize http requests across concurrent per-client tasks
(https://github.com/neondatabase/neon/pull/3331#issuecomment-1399216069).
- [ ] Cache results of `proxy_get_role_secret` API call.