## Problem
Some HTTP client connections can stay open for quite a long time.
## Summary of changes
When there are too many HTTP client connections, pick a random
connection and gracefully cancel it.
## Problem
Usually, the connection itself is quite fast (bellow 10ms for p999:
https://neonprod.grafana.net/goto/aOyn8vYIg?orgId=1).
It doesn't make a lot of sense to wait for a lot of time for the lock,
if it takes a lot of time to acquire it, probably, something goes wrong.
We also spawn a lot of retries, but they are not super helpful (0 means
that it was connected successfully, 1, most probably, that it was
re-request of the compute node address
https://neonprod.grafana.net/goto/J_8VQvLIR?orgId=1). Let's try to keep
a small number of retries.
## Problem
Too many connect_compute attempts can overwhelm postgres, getting the
connections stuck.
## Summary of changes
Limit number of connection attempts that can happen at a given time.
## Problem
Currently we cannot configure retries, also, we don't really have
visibility of what's going on there.
## Summary of changes
* Added cli params
* Improved logging
* Decrease the number of retries: it feels like most of retries doesn't
help. Once there would be better errors handling, we can increase it
back.
## Problem
Many users have access to ipv6 subnets (eg a /64). That gives them 2^64
addresses to play with
## Summary of changes
Truncate the address to /64 to reduce the attack surface.
Todo:
~~Will NAT64 be an issue here? AFAIU they put the IPv4 address at the
end of the IPv6 address. By truncating we will lose all that detail.~~
It's the same problem as a host sharing IPv6 addresses between clients.
I don't think it's up to us to solve. If a customer is getting DDoSed,
then they likely need to arrange a dedicated IP with us.
## Problem
Actually read redis events.
## Summary of changes
This is revert of https://github.com/neondatabase/neon/pull/7350 +
fixes.
* Fixed events parsing
* Added timeout after connection failure
* Separated regional and global redis clients.
## Problem
Proxy doesn't know about existing endpoints.
## Summary of changes
* Added caching of all available endpoints.
* On the high load, use it before going to cplane.
* Report metrics for the outcome.
* For rate limiter and credentials caching don't distinguish between
`-pooled` and not
TODOs:
* Make metrics more meaningful
* Consider integrating it with the endpoint rate limiter
* Test it together with cplane in preview
## Problem
https://github.com/neondatabase/cloud/issues/9642
## Summary of changes
1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter`
2. Add support for claiming multiple tokens at once
3. Add `AuthRateLimiter` alias.
4. Check `(Endpoint, IP)` pair during authentication, weighted by how
many hashes proxy would be doing.
TODO: handle ipv6 subnets. will do this in a separate PR.
## Problem
Support of IAM Roles for Service Accounts for authentication.
## Summary of changes
* Obtain aws 15m-long credentials
* Retrieve redis password from credentials
* Update every 1h to keep connection for more than 12h
* For now allow to have different endpoints for pubsub/stream redis.
TODOs:
* PubSub doesn't support credentials refresh, consider using stream
instead.
* We need an AWS role for proxy to be able to connect to both: S3 and
elasticache.
Credentials obtaining and connection refresh was tested on xenon
preview.
https://github.com/neondatabase/cloud/issues/10365
## Problem
Not really a problem, just refactoring.
## Summary of changes
Separate authenticate from wake compute.
Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
## Problem
HTTP connection pool was not respecting the PitR options.
## Summary of changes
1. refactor neon_options a bit to allow easier access to cache_key
2. make HTTP not go through `StartupMessageParams`
3. expose SNI processing to replace what was removed in step 2.
## Problem
Current cache doesn't support any updates from the cplane.
## Summary of changes
* Added redis notifier listner.
* Added cache which can be invalidated with the notifier. If the
notifier is not available, it's just a normal ttl cache.
* Updated cplane api.
The motivation behind this organization of the data is the following:
* In the Neon data model there are projects. Projects could have
multiple branches and each branch could have more than one endpoint.
* Also there is one special `main` branch.
* Password reset works per branch.
* Allowed IPs are the same for every branch in the project (except,
maybe, the main one).
* The main branch can be changed to the other branch.
* The endpoint can be moved between branches.
Every event described above requires some special processing on the
porxy (or cplane) side.
The idea of invalidating for the project is that whenever one of the
events above is happening with the project, proxy can invalidate all
entries for the entire project.
This approach also requires some additional API change (returning
project_id inside the auth info).
## Summary of changes
### RequestMonitoring
We want to add an event stream with information on each request for
easier analysis than what we can do with diagnostic logs alone
(https://github.com/neondatabase/cloud/issues/8807). This
RequestMonitoring will keep a record of the final state of a request. On
drop it will be pushed into a queue to be uploaded.
Because this context is a bag of data, I don't want this information to
impact logic of request handling. I personally think that weakly typed
data (such as all these options) makes for spaghetti code. I will
however allow for this data to impact rate-limiting and blocking of
requests, as this does not _really_ change how a request is handled.
### Parquet
Each `RequestMonitoring` is flushed into a channel where it is converted
into `RequestData`, which is accumulated into parquet files. Each file
will have a certain number of rows per row group, and several row groups
will eventually fill up the file, which we then upload to S3.
We will also upload smaller files if they take too long to construct.
## Problem
HTTP connection pool will grow without being pruned
## Summary of changes
Remove connection clients from pools once idle, or once they exit.
Periodically clear pool shards.
GC Logic:
Each shard contains a hashmap of `Arc<EndpointPool>`s.
Each connection stores a `Weak<EndpointPool>`.
During a GC sweep, we take a random shard write lock, and check that if
any of the `Arc<EndpointPool>`s are unique (using `Arc::get_mut`).
- If they are unique, then we check that the endpoint-pool is empty, and
sweep if it is.
- If they are not unique, then the endpoint-pool is in active use and we
don't sweep.
- Idle connections will self-clear from the endpoint-pool after 5
minutes.
Technically, the uniqueness of the endpoint-pool should be enough to
consider it empty, but the connection count check is done for
completeness sake.
## Problem
Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.
## Summary of changes
Cache with ttl=4min proxy_get_role_secret response.
Note: this is the temporary hack, notifier listener is WIP.
## Problem
Single rate bucket is limited in usefulness
## Summary of changes
Introduce a secondary bucket allowing an average of 200 requests per
second over 1 minute, and a tertiary bucket allowing an average of 100
requests per second over 10 minutes.
Configured by using a format like
```sh
proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 100@10s --endpoint-rps-limit 50@1m
```
If the bucket limits are inconsistent, an error is returned on startup
```
$ proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 10@10s
Error: invalid endpoint RPS limits. 10@10s allows fewer requests per bucket than 300@1s (100 vs 300)
```
## Problem
Per-project IP allowlist:
https://github.com/neondatabase/cloud/issues/8116
## Summary of changes
Implemented IP filtering on the proxy side.
To retrieve ip allowlist for all scenarios, added `get_auth_info` call
to the control plane for:
* sql-over-http
* password_hack
* cleartext_hack
Added cache with ttl for sql-over-http path
This might slow down a bit, consider using redis in the future.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
#5123
## Summary of changes
Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set
`--sql-over-http-pool-opt-in false` region-by-region
## Problem
channel binding protects scram from sophisticated MITM attacks where the
attacker is able to produce 'valid' TLS certificates.
## Summary of changes
get the tls-server-end-point channel binding, and verify it is correct
for the SCRAM-SHA-256-PLUS authentication flow
## Problem
A user can perform many database connections at the same instant of time
- these will all cache miss and materialise as requests to the control
plane. #5705
## Summary of changes
I am using a `DashMap` (a sharded `RwLock<HashMap>`) of endpoints ->
semaphores to apply a limiter. If the limiter is enabled (permits > 0),
the semaphore will be retrieved per endpoint and a permit will be
awaited before continuing to call the wake_compute endpoint.
### Important details
This dashmap would grow uncontrollably without maintenance. It's not a
cache so I don't think an LRU-based reclamation makes sense. Instead,
I've made use of the sharding functionality of DashMap to lock a single
shard and clear out unused semaphores periodically.
I ran a test in release, using 128 tokio tasks among 12 threads each
pushing 1000 entries into the map per second, clearing a shard every 2
seconds (64 second epoch with 32 shards). The endpoint names were
sampled from a gamma distribution to make sure some overlap would occur,
and each permit was held for 1ms. The histogram for time to clear each
shard settled between 256-512us without any variance in my testing.
Holding a lock for under a millisecond for 1 of the shards does not
concern me as blocking
## Problem
We need a flag to require proxy protocol (prerequisite for #5416)
## Summary of changes
Add a cli flag to require client IP addresses. Error if IP address is
missing when the flag is active.
## Problem
`cargo +nightly doc` is giving a lot of warnings: broken links, naked
URLs, etc.
## Summary of changes
* update the `proc-macro2` dependency so that it can compile on latest
Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and
https://github.com/dtolnay/proc-macro2/issues/398
* allow the `private_intra_doc_links` lint, as linking to something
that's private is always more useful than just mentioning it without a
link: if the link breaks in the future, at least there is a warning due
to that. Also, one might enable
[`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options)
in the future and make these links work in general.
* fix all the remaining warnings given by `cargo +nightly doc`
* make it possible to run `cargo doc` on stable Rust by updating
`opentelemetry` and associated crates to version 0.19, pulling in a fix
that previously broke `cargo doc` on stable:
https://github.com/open-telemetry/opentelemetry-rust/pull/904
* Add `cargo doc` to CI to ensure that it won't get broken in the
future.
Fixes#2557
## Future work
* Potentially, it might make sense, for development purposes, to publish
the generated rustdocs somewhere, like for example [how the rust
compiler does
it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html).
I will file an issue for discussion.
In case we try to connect to an outdated address that is no longer valid, the
default behavior of Kubernetes is to drop the packets, causing us to wait for
the entire timeout period. We want to fail fast in such cases.
A specific case to consider is when we have cached compute node information
with a 5-minute TTL (Time To Live), but the user has executed a `/suspend` API
call, resulting in the nonexistence of the compute node.
This commit introduces an SQL-over-HTTP endpoint in the proxy, with a JSON
response structure resembling that of the node-postgres driver. This method,
using HTTP POST, achieves smaller amortized latencies in edge setups due to
fewer round trips and an enhanced open connection reuse by the v8 engine.
This update involves several intricacies:
1. SQL injection protection: We employed the extended query protocol, modifying
the rust-postgres driver to send queries in one roundtrip using a text
protocol rather than binary, bypassing potential issues like those identified
in https://github.com/sfackler/rust-postgres/issues/1030.
2. Postgres type compatibility: As not all postgres types have binary
representations (e.g., acl's in pg_class), we adjusted rust-postgres to
respond with text protocol, simplifying serialization and fixing queries with
text-only types in response.
3. Data type conversion: Considering JSON supports fewer data types than
Postgres, we perform conversions where possible, passing all other types as
strings. Key conversions include:
- postgres int2, int4, float4, float8 -> json number (NaN and Inf remain
text)
- postgres bool, null, text -> json bool, null, string
- postgres array -> json array
- postgres json and jsonb -> json object
4. Alignment with node-postgres: To facilitate integration with js libraries,
we've matched the response structure of node-postgres, returning command tags
and column oids. Command tag capturing was added to the rust-postgres
functionality as part of this change.
When no SNI is provided use the default certificate, otherwise we can't
get to the options parameter which can be used to set endpoint name too.
That means that non-SNI flow will not work for CNAME domains in verify-full
mode.
Old coding here ignored non-wildcard common names and passed None instead. With my recent changes
I started throwing an error in that case. Old logic doesn't seem to be a great choice, so instead
of passing None I actually set non-wildcard common names too. That way it is possible to avoid handling
cases with None in downstream code.
Make it possible to specify directory where proxy will look up for
extra certificates. Proxy will iterate through subdirs of that directory
and load `key.pem` and `cert.pem` files from each subdir. Certs directory
structure may look like that:
certs
|--example.com
| |--key.pem
| |--cert.pem
|--foo.bar
|--key.pem
|--cert.pem
Actual domain names are taken from certs and key, subdir names are
ignored.
On the surface, this doesn't add much, but there are some benefits:
* We can do graceful shutdowns and thus record more code coverage data.
* We now have a foundation for the more interesting behaviors, e.g. "stop
accepting new connections after SIGTERM but keep serving the existing ones".
* We give the otel machinery a chance to flush trace events before
finally shutting down.
This patch adds a timed LRU cache implementation and a compute node info cache on top of that.
Cache entries might expire on their own (default ttl=5mins) or become invalid due to real-world events,
e.g. compute node scale-to-zero event, so we add a connection retry loop with a wake-up call.
Solved problems:
- [x] Find a decent LRU implementation.
- [x] Implement timed LRU on top of that.
- [x] Cache results of `proxy_wake_compute` API call.
- [x] Don't invalidate newer cache entries for the same key.
- [x] Add cmdline configuration knobs (requires some refactoring).
- [x] Add failed connection estab metric.
- [x] Refactor auth backends to make things simpler (retries, cache
placement, etc).
- [x] Address review comments (add code comments + cleanup).
- [x] Retry `/proxy_wake_compute` if we couldn't connect to a compute
(e.g. stalled cache entry).
- [x] Add high-level description for `TimedLru`.
TODOs (will be addressed later):
- [ ] Add cache metrics (hit, spurious hit, miss).
- [ ] Synchronize http requests across concurrent per-client tasks
(https://github.com/neondatabase/neon/pull/3331#issuecomment-1399216069).
- [ ] Cache results of `proxy_get_role_secret` API call.
[proxy] Add the `password hack` authentication flow
This lets us authenticate users which can use neither
SNI (due to old libpq) nor connection string `options`
(due to restrictions in other client libraries).
Note: `PasswordHack` will accept passwords which are not
encoded in base64 via the "password" field. The assumption
is that most user passwords will be valid utf-8 strings,
and the rest may still be passed via "password_".