## Problem
Missing error classification for SQL-over-HTTP queries.
Not respecting `UserFacingError` for SQL-over-HTTP queries.
## Summary of changes
Adds error classification.
Adds user facing errors.
## Problem
Now that we have tls-listener vendored, we can refactor and remove a lot
of bloated code and make the whole flow a bit simpler
## Summary of changes
1. Remove dead code
2. Move the error handling to inside the `TlsListener` accept() function
3. Extract the peer_addr from the PROXY protocol header and log it with
errors
## Problem
On HTTP query timeout, we should try and cancel the current in-flight
SQL query.
## Summary of changes
Trigger a cancellation command in postgres once the timeout is reach
## Problem
For the ephemeral endpoint feature, it's not really too helpful to keep
them around in the connection pool. This isn't really pressing but I
think it's still a bit better this way.
## Summary of changes
Add `is_ephemeral` function to `NeonOptions`. Allow
`serverless::ConnInfo::endpoint_cache_key()` to return an `Option`.
Handle that option appropriately
`std` has had `pin!` macro for some time, there is no need for us to use
the older alternatives. Cannot disallow `tokio::pin` because tokio
macros use that.
Nightly has added a bunch of compiler and linter warnings. There is also
two dependencies that fail compilation on latest nightly due to using
the old `stdsimd` feature name. This PR fixes them.
## Problem
Hard to find error reasons by endpoint for HTTP flow.
## Summary of changes
I want all root spans to have session id and endpoint id. I want all
root spans to be consistent.
## Problem
## Summary of changes
1. Classify further cplane API errors
2. add 'serviceratelimit' and make a few of the timeout errors return
that.
3. a few additional minor changes
## Problem
Not really a problem, just refactoring.
## Summary of changes
Separate authenticate from wake compute.
Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
## Problem
hard to see where time is taken during HTTP flow.
## Summary of changes
add a lot more for query state. add a conn_id field to the sql-over-http
span
## Problem
usernames and passwords can be URL 'percent' encoded in the connection
string URL provided by serverless driver.
## Summary of changes
Decode the parameters when getting conn info
## Problem
Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.
We currently don't report error classifications in proxy as the current
error handling made it hard to do so.
## Summary of changes
1. Add a `ReportableError` trait that all errors will implement. This
provides the error classification functionality.
2. Handle Client requests a strongly typed error
* this error is a `ReportableError` and is logged appropriately
3. The handle client error only has a few possible error types, to
account for the fact that at this point errors should be returned to the
user.
## Problem
Drizzle needs to be able to configure the array_mode flag per query.
## Summary of changes
Adds an array_mode flag to the query data json that will otherwise
default to the header flag.
## Problem
The password check logic for the sql-over-http is a bit non-intuitive.
## Summary of changes
1. Perform scram auth using the same logic as for websocket cleartext
password.
2. Split establish connection logic and connection pool.
3. Parallelize param parsing logic with authentication + wake compute.
4. Limit the total number of clients
## Problem
not really any problem, just some drive-by changes
## Summary of changes
1. move wake compute
2. move json processing
3. move handle_try_wake
4. move test backend to api provider
5. reduce wake-compute concerns
6. remove duplicate wake-compute loop
## Problem
Right now if get_role_secret response wasn't cached (e.g. cache already
reached max size) it will send the second (exactly the same request).
## Summary of changes
Avoid needless request.
## Problem
Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.
Proxy flow was quite deeply nested, which makes adding more interesting
error handling quite tricky.
## Summary of changes
I recommend reviewing commit by commit.
1. move handshake logic into a separate file
2. move passthrough logic into a separate file
3. no longer accept a closure in CancelMap session logic
4. Remove connect_to_db, copy logic into handle_client
5. flatten auth_and_wake_compute in authenticate
6. record info for link auth
## Problem
http-over-sql allowes host to be in format api.aws.... however it's not
the case for the websocket flow.
## Summary of changes
Relax endpoint check for the ws serverless connections.
## Problem
too many string based IDs. easy to mix up ID types.
## Summary of changes
Add a bunch of `SmolStr` wrappers that provide convenience methods but
are type safe
## Problem
Some fields were missed in the initial spec.
## Summary of changes
Adds a success boolean (defaults to false unless specifically marked as
successful).
Adds a duration_us integer that tracks how many microseconds were taken
from session start through to request completion.
## Problem
HTTP connection pool was not respecting the PitR options.
## Summary of changes
1. refactor neon_options a bit to allow easier access to cache_key
2. make HTTP not go through `StartupMessageParams`
3. expose SNI processing to replace what was removed in step 2.
## Summary of changes
### RequestMonitoring
We want to add an event stream with information on each request for
easier analysis than what we can do with diagnostic logs alone
(https://github.com/neondatabase/cloud/issues/8807). This
RequestMonitoring will keep a record of the final state of a request. On
drop it will be pushed into a queue to be uploaded.
Because this context is a bag of data, I don't want this information to
impact logic of request handling. I personally think that weakly typed
data (such as all these options) makes for spaghetti code. I will
however allow for this data to impact rate-limiting and blocking of
requests, as this does not _really_ change how a request is handled.
### Parquet
Each `RequestMonitoring` is flushed into a channel where it is converted
into `RequestData`, which is accumulated into parquet files. Each file
will have a certain number of rows per row group, and several row groups
will eventually fill up the file, which we then upload to S3.
We will also upload smaller files if they take too long to construct.
## Problem
HTTP connection pool will grow without being pruned
## Summary of changes
Remove connection clients from pools once idle, or once they exit.
Periodically clear pool shards.
GC Logic:
Each shard contains a hashmap of `Arc<EndpointPool>`s.
Each connection stores a `Weak<EndpointPool>`.
During a GC sweep, we take a random shard write lock, and check that if
any of the `Arc<EndpointPool>`s are unique (using `Arc::get_mut`).
- If they are unique, then we check that the endpoint-pool is empty, and
sweep if it is.
- If they are not unique, then the endpoint-pool is in active use and we
don't sweep.
- Idle connections will self-clear from the endpoint-pool after 5
minutes.
Technically, the uniqueness of the endpoint-pool should be enough to
consider it empty, but the connection count check is done for
completeness sake.
## Problem
Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.
## Summary of changes
Cache with ttl=4min proxy_get_role_secret response.
Note: this is the temporary hack, notifier listener is WIP.
## Problem
The `src/proxy.rs` file is far too large
## Summary of changes
Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
## Summary of changes
saw some low-hanging codecov improvements. even if code coverage is
somewhat of a pointless game, might as well add tests where we can and
delete code if it's unused
## Problem
The websockets gauge for active db connections seems to be growing more
than the gauge for client connections over websockets, which does not
make sense.
## Summary of changes
refactor how our counter-pair gauges are represented. not sure if this
will improve the problem, but it should be harder to mess-up the
counters. The API is much nicer though now and doesn't require
scopeguard::defer hacks
## Problem
In snowflake logs currently there is no information about the protocol,
that the client uses.
## Summary of changes
Propagate the information about the protocol together with the app_name.
In format: `{app_name}/{sql_over_http/tcp/ws}`.
This will give to @stepashka more observability on what our clients are
using.
## Problem
no problem
## Summary of changes
replaces boxstr with arcstr as it's cheaper to clone. mild perf
improvement.
probably should look into other smallstring optimsations tbh, they will
likely be even better. The longest endpoint name I was able to construct
is something like `ep-weathered-wildflower-12345678` which is 32 bytes.
Most string optimisations top out at 23 bytes
## Problem
Per-project IP allowlist:
https://github.com/neondatabase/cloud/issues/8116
## Summary of changes
Implemented IP filtering on the proxy side.
To retrieve ip allowlist for all scenarios, added `get_auth_info` call
to the control plane for:
* sql-over-http
* password_hack
* cleartext_hack
Added cache with ttl for sql-over-http path
This might slow down a bit, consider using redis in the future.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
#5123
## Summary of changes
Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set
`--sql-over-http-pool-opt-in false` region-by-region
The idea is to pass neon_* prefixed options to control plane. It can be
used by cplane to dynamically create timelines and computes. Such
options also should be excluded from passing to compute. Another issue
is how connection caching is working now, because compute's instance now
depends not only on hostname but probably on such options too I included
them to cache key.
## Problem
Our serverless backend was a bit jumbled. As a comment indicated, we
were handling SQL-over-HTTP in our `websocket.rs` file.
I've extracted out the `sql_over_http` and `websocket` files from the
`http` module and put them into a new module called `serverless`.
## Summary of changes
```sh
mkdir proxy/src/serverless
mv proxy/src/http/{conn_pool,sql_over_http,websocket}.rs proxy/src/serverless/
mv proxy/src/http/server.rs proxy/src/http/health_server.rs
mv proxy/src/metrics proxy/src/usage_metrics.rs
```
I have also extracted the hyper server and handler from websocket.rs
into `serverless.rs`