## Problem
hyper1 offers control over the HTTP connection that hyper0_14 does not.
We're blocked on switching all services to hyper1 because of how we use
tonic, but no reason we can't switch proxy over.
## Summary of changes
1. hyper0.14 -> hyper1
1. self managed server
2. Remove the `WithConnectionGuard` wrapper from `protocol2`
2. Remove TLS listener as it's no longer necessary
3. include first session ID in connection startup logs
## Problem
Proxy doesn't know about existing endpoints.
## Summary of changes
* Added caching of all available endpoints.
* On the high load, use it before going to cplane.
* Report metrics for the outcome.
* For rate limiter and credentials caching don't distinguish between
`-pooled` and not
TODOs:
* Make metrics more meaningful
* Consider integrating it with the endpoint rate limiter
* Test it together with cplane in preview
## Problem
Would be nice to have a bit more info on cold start metrics.
## Summary of changes
* Change connect compute latency to include `cold_start_info`.
* Update `ColdStartInfo` to include HttpPoolHit and WarmCached.
* Several changes to make more use of interned strings
## Problem
https://github.com/neondatabase/cloud/issues/11051
additionally, I felt like the http logic was a bit complex.
## Summary of changes
1. Removes timeout for HTTP requests.
2. Split out header parsing to a `HttpHeaders` type.
3. Moved db client handling to `QueryData::process` and
`BatchQueryData::process` to simplify the logic of `handle_inner` a bit.
## Problem
https://github.com/neondatabase/cloud/issues/9642
## Summary of changes
1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter`
2. Add support for claiming multiple tokens at once
3. Add `AuthRateLimiter` alias.
4. Check `(Endpoint, IP)` pair during authentication, weighted by how
many hashes proxy would be doing.
TODO: handle ipv6 subnets. will do this in a separate PR.
## Problem
stack overflow in blanket impl for `CancellationPublisher`
## Summary of changes
Removes `async_trait` and fixes the impl order to make it non-recursive.
## Problem
I noticed code coverage for auth_quirks was pretty bare
## Summary of changes
Adds 3 happy path unit tests for auth_quirks
* scram
* cleartext (websockets)
* cleartext (password hack)
## Problem
Support of IAM Roles for Service Accounts for authentication.
## Summary of changes
* Obtain aws 15m-long credentials
* Retrieve redis password from credentials
* Update every 1h to keep connection for more than 12h
* For now allow to have different endpoints for pubsub/stream redis.
TODOs:
* PubSub doesn't support credentials refresh, consider using stream
instead.
* We need an AWS role for proxy to be able to connect to both: S3 and
elasticache.
Credentials obtaining and connection refresh was tested on xenon
preview.
https://github.com/neondatabase/cloud/issues/10365
## Problem
for HTTP/WS/password hack flows we imitate SCRAM to validate passwords.
This code was unnecessarily complicated.
## Summary of changes
Copy in the `pbkdf2` and 'derive keys' steps from the
`postgres_protocol` crate in our `rust-postgres` fork. Derive the
`client_key`, `server_key` and `stored_key` from the password directly.
Use constant time equality to compare the `stored_key` and `server_key`
with the ones we are sent from cplane.
## Problem
Storage controller had basically no metrics.
## Summary of changes
1. Migrate the existing metrics to use Conrad's
[`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
2. Add metrics for incoming http requests
3. Add metrics for outgoing http requests to the pageserver
4. Add metrics for outgoing pass through requests to the pageserver
5. Add metrics for database queries
Note that the metrics response for the attachment service does not use
chunked encoding like the rest of the metrics endpoints. Conrad has
kindly extended the crate such that it can now be done. Let's leave it
for a follow-up since the payload shouldn't be that big at this point.
Fixes https://github.com/neondatabase/neon/issues/6875
## Problem
faster sha2 hashing.
## Summary of changes
enable asm feature for sha2. this feature will be default in sha2 0.11,
so we might as well lean into it now. It provides a noticeable speed
boost on macos aarch64. Haven't tested on x86 though
## Problem
hyper auto-cancels the request futures on connection close.
`sql_over_http::handle` is not 'drop cancel safe', so we need to do some
other work to make sure connections are queries in the right way.
## Summary of changes
1. tokio::spawn the request handler to resolve the initial cancel-safety
issue
2. share a cancellation token, and cancel it when the request `Service`
is dropped.
3. Add a new log span to be able to track the HTTP connection lifecycle.
## Problem
Currently cplane communication is a part of the latency monitoring. It
doesn't allow to setup the proper alerting based on proxy latency.
## Summary of changes
Added dimension to exclude cplane latency.
## Problem
* quotes in serialized string
* no status if connection is from local cache
## Summary of changes
* remove quotes
* report warm if connection if from local cache
## Problem
Missing error classification for SQL-over-HTTP queries.
Not respecting `UserFacingError` for SQL-over-HTTP queries.
## Summary of changes
Adds error classification.
Adds user facing errors.
## Problem
Now that we have tls-listener vendored, we can refactor and remove a lot
of bloated code and make the whole flow a bit simpler
## Summary of changes
1. Remove dead code
2. Move the error handling to inside the `TlsListener` accept() function
3. Extract the peer_addr from the PROXY protocol header and log it with
errors
## Problem
On HTTP query timeout, we should try and cancel the current in-flight
SQL query.
## Summary of changes
Trigger a cancellation command in postgres once the timeout is reach
## Problem
`422 Unprocessable Entity: compute time quota of non-primary branches is
exceeded` being marked as a control plane error.
## Summary of changes
Add the manual checks to make this a user error that should not be
retried.
## Problem
For the ephemeral endpoint feature, it's not really too helpful to keep
them around in the connection pool. This isn't really pressing but I
think it's still a bit better this way.
## Summary of changes
Add `is_ephemeral` function to `NeonOptions`. Allow
`serverless::ConnInfo::endpoint_cache_key()` to return an `Option`.
Handle that option appropriately
## Problem
Branch/project and coldStart were not populated to data events.
## Summary of changes
Populate it. Also added logging for the coldstart info.
`std` has had `pin!` macro for some time, there is no need for us to use
the older alternatives. Cannot disallow `tokio::pin` because tokio
macros use that.
## Problem
Actually it's good idea to distinguish between cases when it's a cold
start, but we took the compute from the pool
## Summary of changes
Updated to enum.
Nightly has added a bunch of compiler and linter warnings. There is also
two dependencies that fail compilation on latest nightly due to using
the old `stdsimd` feature name. This PR fixes them.
## Problem
Hard to find error reasons by endpoint for HTTP flow.
## Summary of changes
I want all root spans to have session id and endpoint id. I want all
root spans to be consistent.
## Problem
Data team cannot distinguish between cold start and not cold start.
## Summary of changes
Report `is_cold_start` to analytics.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
## Summary of changes
1. Classify further cplane API errors
2. add 'serviceratelimit' and make a few of the timeout errors return
that.
3. a few additional minor changes
## Problem
Proxy already supported HTTP2, but I expect no one is using it because
we don't advertise it in the TLS handshake.
## Summary of changes
#6335 without the websocket changes.
1) `scram::password` was used in tests only. can be replaced with
`postgres_protocol::password`.
2) `postgres_protocol::authentication::sasl` provides a client impl of
SASL which improves our ability to test
Cancellation and timeouts are handled at remote_storage callsites, if
they are. However they should always be handled, because we've had
transient problems with remote storage connections.
- Add cancellation token to the `trait RemoteStorage` methods
- For `download*`, `list*` methods there is
`DownloadError::{Cancelled,Timeout}`
- For the rest now using `anyhow::Error`, it will have root cause
`remote_storage::TimeoutOrCancel::{Cancel,Timeout}`
- Both types have `::is_permanent` equivalent which should be passed to
`backoff::retry`
- New generic RemoteStorageConfig option `timeout`, defaults to 120s
- Start counting timeouts only after acquiring concurrency limiter
permit
- Cancellable permit acquiring
- Download stream timeout or cancellation is communicated via an
`std::io::Error`
- Exit backoff::retry by marking cancellation errors permanent
Fixes: #6096Closes: #4781
Co-authored-by: arpad-m <arpad-m@users.noreply.github.com>
## Problem
In a recent refactor, we accidentally dropped the cancel session early
## Summary of changes
Hold the cancel session during proxy passthrough
## Problem
Not really a problem, just refactoring.
## Summary of changes
Separate authenticate from wake compute.
Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
## Problem
hard to see where time is taken during HTTP flow.
## Summary of changes
add a lot more for query state. add a conn_id field to the sql-over-http
span
## Problem
`tokio::io::copy_bidirectional` doesn't close the connection once one of
the sides closes it. It's not really suitable for the postgres protocol.
## Summary of changes
Fork `copy_bidirectional` and initiate a shutdown for both connections.
---------
Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
## Problem
usernames and passwords can be URL 'percent' encoded in the connection
string URL provided by serverless driver.
## Summary of changes
Decode the parameters when getting conn info
## Problem
Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.
We currently don't report error classifications in proxy as the current
error handling made it hard to do so.
## Summary of changes
1. Add a `ReportableError` trait that all errors will implement. This
provides the error classification functionality.
2. Handle Client requests a strongly typed error
* this error is a `ReportableError` and is logged appropriately
3. The handle client error only has a few possible error types, to
account for the fact that at this point errors should be returned to the
user.