I was looking into
https://github.com/neondatabase/serverless/issues/144, I recall previous
cases where proxy would trigger these prepared statements which would
conflict with other statements prepared by our client downstream.
Because of that, and also to aid in debugging, I've made sure all
prepared statements that proxy needs to make have specific names that
likely won't conflict and makes it clear in a error log if it's our
statements that are causing issues
- Wired up filtering on VPC endpoints
- Wired up block access from public internet / VPC depending on per
project flag
- Added cache invalidation for VPC endpoints (partially based on PR from
Raphael)
- Removed BackendIpAllowlist trait
---------
Co-authored-by: Ivan Efremov <ivan@neon.tech>
Generally ed25519 seems to be much preferred for cryptographic strength
to P256 nowadays, and it is NIST approved finally. We should use it
where we can as it's also faster than p256.
This PR makes the re-signed JWTs between local_proxy and pg_session_jwt
use ed25519.
This does introduce a new dependency on ed25519, but I do recall some
Neon Authorise customers asking for support for ed25519, so I am
justifying this dependency addition in the context that we can then
introduce support for customer ed25519 keys
sources:
* https://csrc.nist.gov/pubs/fips/186-5/final subsection 7 (EdDSA)
* https://datatracker.ietf.org/doc/html/rfc8037#section-3.1
Now that we construct the TLS client config for cancellation as well as
connect, it feels appropriate to construct the same config once and
re-use it elsewhere. It might also help should #7500 require any extra
setup, so we can easily add it to all the appropriate call sites.
As the title says, I updated the lint rules to no longer allow unwrap or
unimplemented.
Three special cases:
* Tests are allowed to use them
* std::sync::Mutex lock().unwrap() is common because it's usually
correct to continue panicking on poison
* `tokio::spawn_blocking(...).await.unwrap()` is common because it will
only error if the blocking fn panics, so continuing the panic is also
correct
I've introduced two extension traits to help with these last two, that
are a bit more explicit so they don't need an expect message every time.
## Problem
While reviewing #10152 I found it tricky to actually determine whether
the connection used `allow_self_signed_compute` or not.
I've tried to remove this setting in the past:
* https://github.com/neondatabase/neon/pull/7884
* https://github.com/neondatabase/neon/pull/7437
* https://github.com/neondatabase/cloud/pull/13702
But each time it seems it is used by e2e tests
## Summary of changes
The `node_info.allow_self_signed_computes` is always initialised to
false, and then sometimes inherits the proxy config value. There's no
need this needs to be in the node_info, so removing it and propagating
it via `TcpMechansim` is simpler.
(stacked on #9990 and #9995)
Partially fixes#1287 with a custom option field to enable the fixed
behaviour. This allows us to gradually roll out the fix without silently
changing the observed behaviour for our customers.
related to https://github.com/neondatabase/cloud/issues/15284
Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres
allows us to remove a lot of dead weight from our actual postgres
connection logic.
Our rust-postgres fork is getting messy. Mostly because proxy wants more
control over the raw protocol than tokio-postgres provides. As such,
it's diverging more and more. Storage and compute also make use of
rust-postgres, but in more normal usage, thus they don't need our crazy
changes.
Idea:
* proxy maintains their subset
* other teams use a minimal patch set against upstream rust-postgres
Reviewing this code will be difficult. To implement it, I
1. Copied tokio-postgres, postgres-protocol and postgres-types from
00940fcdb5
2. Updated their package names with the `2` suffix to make them compile
in the workspace.
3. Updated proxy to use those packages
4. Copied in the code from tokio-postgres-rustls 0.13 (with some patches
applied https://github.com/jbg/tokio-postgres-rustls/pull/32https://github.com/jbg/tokio-postgres-rustls/pull/33)
5. Removed as much dead code as I could find in the vendored libraries
6. Updated the tokio-postgres-rustls code to use our existing channel
binding implementation
Follow up to #9803
See https://github.com/neondatabase/cloud/issues/14378
In collaboration with @cloneable and @awarus, we sifted through logs and
simply demoted some logs to debug. This is not at all finished and there
are more logs to review, but we ran out of time in the session we
organised. In any slightly more nuanced cases, we didn't touch the log,
instead leaving a TODO comment.
I've also slightly refactored the sql-over-http body read/length reject
code. I can split that into a separate PR. It just felt natural after I
switched to `read_body_with_limit` as we discussed during the meet.
## Problem
It is called context/ctx everywhere and the Monitoring suffix needlessly
confuses with proper monitoring code.
## Summary of changes
* Rename RequestMonitoring to RequestContext
* Rename RequestMonitoringInner to RequestContextInner
The overall idea of the PR is to rename a few types to make their
purpose more clear, reduce abstraction where not needed, and move types
to to more better suited modules.
Unify client, EndpointConnPool and DbUserConnPool for remote and local
conn.
- Use new ClientDataEnum for additional client data.
- Add ClientInnerCommon client structure.
- Remove Client and EndpointConnPool code from local_conn_pool.rs
* Also rename `AuthFailed` variant to `PasswordFailed`.
* Before this all JWT errors end up in `AuthError::AuthFailed()`,
expects a username and also causes cache invalidation.
Follow up on #9344. We want to install the extension automatically. We
didn't want to couple the extension into compute_ctl so instead
local_proxy is the one to issue requests specific to the extension.
depends on #9344 and #9395
First PR for #9284
Start unification of the client and connection pool interfaces:
- Exclude the 'global_connections_count' out from the get_conn_entry()
- Move remote connection pools to the conn_pool_lib as a reference
- Unify clients among all the conn pools
removes the ConsoleRedirect backend from the main auth::Backends enum,
copy-paste the existing crate::proxy::task_main structure to use the
ConsoleRedirectBackend exclusively.
This makes the logic a bit simpler at the cost of some fairly trivial
code duplication.
preliminary for #9270
The auth::Backend didn't need to be in the mega ProxyConfig object, so I
split it off and passed it manually in the few places it was necessary.
I've also refined some of the uses of config I saw while doing this
small refactor.
I've also followed the trend and make the console redirect backend it's
own struct, same as LocalBackend and ControlPlaneBackend.
misc changes split out from #8855
- **allow cloning the request context in a read-only fashion for
background tasks**
- **propagate endpoint and request context through the jwk cache**
- **only allow password based auth for md5 during testing**
- **remove auth info from conn info**
This reverts #8076 - which was already reverted from the release branch
since forever (it would have been a breaking change to release for all
users who currently set TimeZone options). It's causing conflicts now so
we should revert it here as well.
## Problem
1. Hard to correlate startup parameters with the endpoint that provided
them.
2. Some configurations are not needed in the `ProxyConfig` struct.
## Summary of changes
Because of some borrow checker fun, I needed to switch to an
interior-mutability implementation of our `RequestMonitoring` context
system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock
for such a use-case (needed to be thread safe).
Removed the lock of each startup message, instead just logging only the
startup params in a successful handshake.
Also removed from values from `ProxyConfig` and kept as arguments.
(needed for local-proxy config)
## Problem
new clippy warnings on nightly.
## Summary of changes
broken up each commit by warning type.
1. Remove some unnecessary refs.
2. In edition 2024, inference will default to `!` and not `()`.
3. Clippy complains about doc comment indentation
4. Fix `Trait + ?Sized` where `Trait: Sized`.
5. diesel_derives triggering `non_local_defintions`
## Problem
1. Proxy is retrying errors from cplane that shouldn't be retried
2. ~~Proxy is not using the retry_after_ms value~~
## Summary of changes
1. Correct the could_retry impl for ConsoleError.
2. ~~Update could_retry interface to support returning a fixed wait
duration.~~
## Problem
Fixes https://github.com/neondatabase/neon/issues/1287
## Summary of changes
tokio-postgres now supports arbitrary server params through the
`param(key, value)` method. Some keys are special so we explicitly
filter them out.
## Problem
Some tasks are using around upwards of 10KB of memory at all times,
sometimes having buffers that swing them up to 30MB.
## Summary of changes
Split some of the async tasks in selective places and box them as
appropriate to try and reduce the constant memory usage. Especially in
the locations where the large future is only a small part of the total
runtime of the task.
Also, reduces the size of the CopyBuffer buffer size from 8KB to 1KB.
In my local testing and in staging this had a minor improvement. sadly
not the improvement I was hoping for :/ Might have more impact in
production
## Problem
Computes that are healthy can manage many connection attempts at a time.
Unhealthy computes cannot. We initially handled this with a fixed
concurrency limit, but it seems this inhibits pgbench.
## Summary of changes
Support AIMD for connect_to_compute lock to allow varying the
concurrency limit based on compute health
## Problem
Despite making password hashing async, it can still take time away from
the network code.
## Summary of changes
Introduce a custom threadpool, inspired by rayon. Features:
### Fairness
Each task is tagged with it's endpoint ID. The more times we have seen
the endpoint, the more likely we are to skip the task if it comes up in
the queue. This is using a min-count-sketch estimator for the number of
times we have seen the endpoint, resetting it every 1000+ steps.
Since tasks are immediately rescheduled if they do not complete, the
worker could get stuck in a "always work available loop". To combat
this, we check the global queue every 61 steps to ensure all tasks
quickly get a worker assigned to them.
### Balanced
Using crossbeam_deque, like rayon does, we have workstealing out of the
box. I've tested it a fair amount and it seems to balance the workload
accordingly
## Problem
There is no global per-ep rate limiter in proxy.
## Summary of changes
* Return global per-ep rate limiter back.
* Rename weak compute rate limiter (the cli flags were not used
anywhere, so it's safe to rename).
## Problem
If a permit cannot be acquired to connect to compute, the cache is
invalidated. This had the observed affect of sending more traffic to
ProxyWakeCompute on cplane.
## Summary of changes
Make sure that permit acquire failures are marked as "should not
invalidate cache".
## Problem
Too many connect_compute attempts can overwhelm postgres, getting the
connections stuck.
## Summary of changes
Limit number of connection attempts that can happen at a given time.