## Problem
Some PrivateLink customers are unable to use Private DNS. As such they
use an invalid domain name to address Neon. We currently are rejecting
those connections because we cannot resolve the correct certificate.
## Summary of changes
1. Ensure a certificate is always returned.
2. If there is an SNI field, use endpoint fallback if it doesn't match.
I suggest reviewing each commit separately.
## Problem
The proxy denies using `unwrap()`s in regular code, but we want to use
it in test code
and so have to allow it for each test block.
## Summary of changes
Set `allow-unwrap-in-tests = true` in clippy.toml and remove all
exceptions.
Testodrome measures uptime based on the failed requests and errors. In
case of testodrome request we send back error based on the service. This
will help us distinguish error types in testodrome and rely on the
uptime SLI.
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:
* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.
* Prefer `io::Error::other()`.
* Use implicit returns
One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
## Problem
We exposed the direction tag in #10925 but didn't actually include the
ingress tag in the export to allow for an adaption period.
## Summary of changes
We now export the ingress direction
Closes: https://github.com/neondatabase/cloud/issues/22998
If control-plane reports that TLS should be used, load the certificates
(and watch for updates), make sure postgres use them, and detects
updates.
Procedure:
1. Load certificates
2. Reconfigure postgres/pgbouncer
3. Loop on a timer until certificates have loaded
4. Go to 1
Notes:
1. We only run this procedure if requested on startup by control plane.
2. We needed to compile pgbouncer with openssl enabled
3. Postgres doesn't allow tls keys to be globally accessible - must be
read only to the postgres user. I couldn't convince the autoscaling team
to let me put this logic into the VM settings, so instead compute_ctl
will copy the keys to be read-only by postgres.
4. To mitigate a race condition, we also verify that the key matches the
cert.
In our json encoding, we only need to know about array types.
Information about composites or enums are not actually used.
Enums are quite popular, needing to type query them when not needed can
add some latency cost for no gain.
Fixes https://github.com/neondatabase/serverless/issues/144
When tables have enums, we need to perform type queries for that data.
We cache these query statements for performance reasons. In Neon RLS, we
run "discard all" for security reasons, which discards all the
statements. When we need to type check again, the statements are no
longer valid.
This fixes it to discard the statements as well.
I've also added some new logs and error types to monitor this. Currently
we don't see the prepared statement errors in our logs.
https://github.com/neondatabase/cloud/issues/23008
For TLS between proxy and compute, we are using an internally
provisioned CA to sign the compute certificates. This change ensures
that proxy will load them from a supplied env var pointing to the
correct file - this file and env var will be configured later, using a
kubernetes secret.
Control plane responds with a `server_name` field if and only if the
compute uses TLS. This server name is the name we use to validate the
certificate. Control plane still sends us the IP to connect to as well
(to support overlay IP).
To support this change, I'd had to split `host` and `host_addr` into
separate fields. Using `host_addr` and bypassing `lookup_addr` if
possible (which is what happens in production). `host` then is only used
for the TLS connection.
There's no blocker to merging this. The code paths will not be triggered
until the new control plane is deployed and the `enableTLS` compute flag
is enabled on a project.
This upgrades the `proxy/` crate as well as the forked libraries in
`libs/proxy/` to edition 2024.
Also reformats the imports of those forked libraries via:
```
cargo +nightly fmt -p proxy -p postgres-protocol2 -p postgres-types2 -p tokio-postgres2 -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true
```
It can be read commit-by-commit: the first commit has no formatting
changes, only changes to accomodate the new edition.
Part of #10918
ref: https://github.com/neondatabase/cloud/issues/23385
Adds a direction flag as well as private-link ID to the traffic
reporting pipeline. We do not yet actually count ingress, but we include
the flag anyway.
I have additionally moved vpce_id string parsing earlier, since we
expect it to be utf8 (ascii).
## Problem
`discard all` cannot run in a transaction (even if implicit)
## Summary of changes
Split up the query into two, we don't need transaction support.
I was looking into
https://github.com/neondatabase/serverless/issues/144, I recall previous
cases where proxy would trigger these prepared statements which would
conflict with other statements prepared by our client downstream.
Because of that, and also to aid in debugging, I've made sure all
prepared statements that proxy needs to make have specific names that
likely won't conflict and makes it clear in a error log if it's our
statements that are causing issues
Avoids compiling the crate and its dependencies into binaries that don't
need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the
release-line-debug-size-lto profile.
- Wired up filtering on VPC endpoints
- Wired up block access from public internet / VPC depending on per
project flag
- Added cache invalidation for VPC endpoints (partially based on PR from
Raphael)
- Removed BackendIpAllowlist trait
---------
Co-authored-by: Ivan Efremov <ivan@neon.tech>
## Problem
Because dashmap 6 switched to hashbrown RawTable API, it required us to
use unsafe code in the upgrade:
https://github.com/neondatabase/neon/pull/8107
## Summary of changes
Switch to clashmap, a fork maintained by me which removes much of the
unsafe and ultimately switches to HashTable instead of RawTable to
remove much of the unsafe requirement on us.
## Problem
The approach of having CancelMap as an in-memory structure increases
code complexity,
as well as putting additional load for Redis streams.
## Summary of changes
- Implement a set of KV ops for Redis client;
- Remove cancel notifications code;
- Send KV ops over the bounded channel to the handling background task
for removing and adding the cancel keys.
Closes#9660
Generally ed25519 seems to be much preferred for cryptographic strength
to P256 nowadays, and it is NIST approved finally. We should use it
where we can as it's also faster than p256.
This PR makes the re-signed JWTs between local_proxy and pg_session_jwt
use ed25519.
This does introduce a new dependency on ed25519, but I do recall some
Neon Authorise customers asking for support for ed25519, so I am
justifying this dependency addition in the context that we can then
introduce support for customer ed25519 keys
sources:
* https://csrc.nist.gov/pubs/fips/186-5/final subsection 7 (EdDSA)
* https://datatracker.ietf.org/doc/html/rfc8037#section-3.1
Now that we construct the TLS client config for cancellation as well as
connect, it feels appropriate to construct the same config once and
re-use it elsewhere. It might also help should #7500 require any extra
setup, so we can easily add it to all the appropriate call sites.
As the title says, I updated the lint rules to no longer allow unwrap or
unimplemented.
Three special cases:
* Tests are allowed to use them
* std::sync::Mutex lock().unwrap() is common because it's usually
correct to continue panicking on poison
* `tokio::spawn_blocking(...).await.unwrap()` is common because it will
only error if the blocking fn panics, so continuing the panic is also
correct
I've introduced two extension traits to help with these last two, that
are a bit more explicit so they don't need an expect message every time.
## Problem
While reviewing #10152 I found it tricky to actually determine whether
the connection used `allow_self_signed_compute` or not.
I've tried to remove this setting in the past:
* https://github.com/neondatabase/neon/pull/7884
* https://github.com/neondatabase/neon/pull/7437
* https://github.com/neondatabase/cloud/pull/13702
But each time it seems it is used by e2e tests
## Summary of changes
The `node_info.allow_self_signed_computes` is always initialised to
false, and then sometimes inherits the proxy config value. There's no
need this needs to be in the node_info, so removing it and propagating
it via `TcpMechansim` is simpler.
(stacked on #9990 and #9995)
Partially fixes#1287 with a custom option field to enable the fixed
behaviour. This allows us to gradually roll out the fix without silently
changing the observed behaviour for our customers.
related to https://github.com/neondatabase/cloud/issues/15284
Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres
allows us to remove a lot of dead weight from our actual postgres
connection logic.
Our rust-postgres fork is getting messy. Mostly because proxy wants more
control over the raw protocol than tokio-postgres provides. As such,
it's diverging more and more. Storage and compute also make use of
rust-postgres, but in more normal usage, thus they don't need our crazy
changes.
Idea:
* proxy maintains their subset
* other teams use a minimal patch set against upstream rust-postgres
Reviewing this code will be difficult. To implement it, I
1. Copied tokio-postgres, postgres-protocol and postgres-types from
00940fcdb5
2. Updated their package names with the `2` suffix to make them compile
in the workspace.
3. Updated proxy to use those packages
4. Copied in the code from tokio-postgres-rustls 0.13 (with some patches
applied https://github.com/jbg/tokio-postgres-rustls/pull/32https://github.com/jbg/tokio-postgres-rustls/pull/33)
5. Removed as much dead code as I could find in the vendored libraries
6. Updated the tokio-postgres-rustls code to use our existing channel
binding implementation
## Problem
For cancellation, a connection is open during all the cancel checks.
## Summary of changes
Spawn cancellation checks in the background, and close connection
immediately.
Use task_tracker for cancellation checks.
Fixes the masking for the CancelKeyData display format. Due to negative
i32 cast to u64, the top-bits all had `0xffffffff` prefix. On the
bitwise-or that followed, these took priority.
This PR also compresses 3 logs during sql-over-http into 1 log with
durations as label fields, as prior discussed.
## Problem
```
curl -H "Neon-Connection-String: postgresql://neondb_owner:PASSWORD@ep-autumn-rain-a58lubg0.us-east-2.aws.neon.tech/neondb?sslmode=require" https://ep-autumn-rain-a58lubg0.us-east-2.aws.neon.tech/sql -d '{"query":"SELECT 1","params":[]}'
```
For such a query, I also need to send `params`. Do I really need it?
## Summary of changes
I've marked `params` as optional
Follow up to #9803
See https://github.com/neondatabase/cloud/issues/14378
In collaboration with @cloneable and @awarus, we sifted through logs and
simply demoted some logs to debug. This is not at all finished and there
are more logs to review, but we ran out of time in the session we
organised. In any slightly more nuanced cases, we didn't touch the log,
instead leaving a TODO comment.
I've also slightly refactored the sql-over-http body read/length reject
code. I can split that into a separate PR. It just felt natural after I
switched to `read_body_with_limit` as we discussed during the meet.
## Problem
It is called context/ctx everywhere and the Monitoring suffix needlessly
confuses with proper monitoring code.
## Summary of changes
* Rename RequestMonitoring to RequestContext
* Rename RequestMonitoringInner to RequestContextInner
While setting up some tests, I noticed that we didn't support keycloak.
They make use of encryption JWKs as well as signature ones. Our current
jwks crate does not support parsing encryption keys which caused the
entire jwk set to fail to parse. Switching to lazy parsing fixes this.
Also while setting up tests, I couldn't use localhost jwks server as we
require HTTPS and we were using webpki so it was impossible to add a
custom CA. Enabling native roots addresses this possibility.
I saw some of our current e2e tests against our custom JWKS in s3 were
taking a while to fetch. I've added a timeout + retries to address this.
The overall idea of the PR is to rename a few types to make their
purpose more clear, reduce abstraction where not needed, and move types
to to more better suited modules.
The PROXY Protocol V2 offers a "command" concept. It can be of two
different values. "Local" and "Proxy". The spec suggests that "Local" be
used for health-checks. We can thus use this to silence logging for such
health checks such as those from NLB.
This additionally refactors the flow to be a bit more type-safe, self
documenting and using zerocopy deser.
Unify client, EndpointConnPool and DbUserConnPool for remote and local
conn.
- Use new ClientDataEnum for additional client data.
- Add ClientInnerCommon client structure.
- Remove Client and EndpointConnPool code from local_conn_pool.rs
* Also rename `AuthFailed` variant to `PasswordFailed`.
* Before this all JWT errors end up in `AuthError::AuthFailed()`,
expects a username and also causes cache invalidation.
AWS/azure private link shares extra information in the "TLV" values of
the proxy protocol v2 header. This code doesn't action on it, but it
parses it as appropriate.