Commit Graph

675 Commits

Author SHA1 Message Date
Conrad Ludgate
8ab96cc71f chore(proxy/jwks): reduce the rightward drift of jwks renewal (#9853)
I found the rightward drift of the `renew_jwks` function hard to review.

This PR splits out some major logic and uses early returns to make the
happy path more linear.
2024-11-22 14:51:32 +00:00
Conrad Ludgate
725a5ff003 fix(proxy): CancelKeyData display log masking (#9838)
Fixes the masking for the CancelKeyData display format. Due to negative
i32 cast to u64, the top-bits all had `0xffffffff` prefix. On the
bitwise-or that followed, these took priority.

This PR also compresses 3 logs during sql-over-http into 1 log with
durations as label fields, as prior discussed.
2024-11-21 16:46:30 +00:00
Ivan Efremov
2d6bf176a0 proxy: Refactor http conn pool (#9785)
- Use the same ConnPoolEntry for http connection pool.
- Rename EndpointConnPool to the HttpConnPool.
- Narrow clone bound for client

Fixes #9284
2024-11-20 19:36:29 +00:00
Vadim Kharitonov
313ebfdb88 [proxy] chore: allow bypassing empty params to /sql endpoint (#9827)
## Problem

```
curl -H "Neon-Connection-String: postgresql://neondb_owner:PASSWORD@ep-autumn-rain-a58lubg0.us-east-2.aws.neon.tech/neondb?sslmode=require" https://ep-autumn-rain-a58lubg0.us-east-2.aws.neon.tech/sql -d '{"query":"SELECT 1","params":[]}'
```

For such a query, I also need to send `params`. Do I really need it?

## Summary of changes
I've marked `params` as optional
2024-11-20 19:36:23 +00:00
Conrad Ludgate
f36f0068b8 chore(proxy): demote more logs during successful connection attempts (#9828)
Follow up to #9803 

See https://github.com/neondatabase/cloud/issues/14378

In collaboration with @cloneable and @awarus, we sifted through logs and
simply demoted some logs to debug. This is not at all finished and there
are more logs to review, but we ran out of time in the session we
organised. In any slightly more nuanced cases, we didn't touch the log,
instead leaving a TODO comment.

I've also slightly refactored the sql-over-http body read/length reject
code. I can split that into a separate PR. It just felt natural after I
switched to `read_body_with_limit` as we discussed during the meet.
2024-11-20 17:50:39 +00:00
Folke Behrens
bf7d859a8b proxy: Rename RequestMonitoring to RequestContext (#9805)
## Problem

It is called context/ctx everywhere and the Monitoring suffix needlessly
confuses with proper monitoring code.

## Summary of changes

* Rename RequestMonitoring to RequestContext
* Rename RequestMonitoringInner to RequestContextInner
2024-11-20 12:50:36 +00:00
Conrad Ludgate
3ae0b2149e chore(proxy): demote a ton of logs for successful connection attempts (#9803)
See https://github.com/neondatabase/cloud/issues/14378

In collaboration with @cloneable and @awarus, we sifted through logs and
simply demoted some logs to debug. This is not at all finished and there
are more logs to review, but we ran out of time in the session we
organised. In any slightly more nuanced cases, we didn't touch the log,
instead leaving a TODO comment.
2024-11-20 10:14:28 +00:00
Conrad Ludgate
191f745c81 fix(proxy/auth_broker): ignore -pooler suffix (#9800)
Fixes https://github.com/neondatabase/cloud/issues/20400

We cannot mix local_proxy and pgbouncer, so we are filtering out the
`-pooler` suffix prior to calling wake_compute.
2024-11-19 13:58:26 +00:00
Conrad Ludgate
37b97b3a68 chore(local_proxy): reduce some startup logging (#9798)
Currently, local_proxy will write an error log if it doesn't find the
config file. This is expected for startup, so it's just noise. It is an
error if we do receive an explicit SIGHUP though.

I've also demoted the build info logs to be debug level. We don't need
them in the compute image since we have other ways to determine what
code is running.

Lastly, I've demoted SIGHUP signal handling from warn to info, since
it's not really a warning event.

See https://github.com/neondatabase/cloud/issues/10880 for more details
2024-11-19 13:58:11 +00:00
Heikki Linnakangas
79929bb1b6 Disable rust_2024_compatibility lint option (#9615)
Compiling with nightly rust compiler, I'm getting a lot of errors like
this:

    error: `if let` assigns a shorter lifetime since Edition 2024
       --> proxy/src/auth/backend/jwt.rs:226:16
        |
    226 |             if let Some(permit) = self.try_acquire_permit() {
        |                ^^^^^^^^^^^^^^^^^^^-------------------------
        |                                   |
| this value has a significant drop implementation which may observe a
major change in drop order and requires your discretion
        |
        = warning: this changes meaning in Rust 2024
= note: for more information, see issue #124085
<https://github.com/rust-lang/rust/issues/124085>
    help: the value is now dropped here in Edition 2024
       --> proxy/src/auth/backend/jwt.rs:241:13
        |
    241 |             } else {
        |             ^
    note: the lint level is defined here
       --> proxy/src/lib.rs:8:5
        |
    8   |     rust_2024_compatibility
        |     ^^^^^^^^^^^^^^^^^^^^^^^
= note: `#[deny(if_let_rescope)]` implied by
`#[deny(rust_2024_compatibility)]`

and this:

error: these values and local bindings have significant drop
implementation that will have a different drop order from that of
Edition 2021
       --> proxy/src/auth/backend/jwt.rs:376:18
        |
    369 |         let client = Client::builder()
| ------ these values have significant drop implementation and will
observe changes in drop order under Edition 2024
    ...
    376 |             map: DashMap::default(),
        |                  ^^^^^^^^^^^^^^^^^^
        |
        = warning: this changes meaning in Rust 2024
= note: for more information, see issue #123739
<https://github.com/rust-lang/rust/issues/123739>
= note: `#[deny(tail_expr_drop_order)]` implied by
`#[deny(rust_2024_compatibility)]`

They are caused by the `rust_2024_compatibility` lint option.

When we actually switch to the 2024 edition, it makes sense to go
through all these and check that the drop order changes don't break
anything, but in the meanwhile, there's no easy way to avoid these
errors. Disable it, to allow compiling with nightly again.

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-11-08 08:35:03 +00:00
Conrad Ludgate
82e3f0ecba [proxy/authorize]: improve JWKS reliability (#9676)
While setting up some tests, I noticed that we didn't support keycloak.
They make use of encryption JWKs as well as signature ones. Our current
jwks crate does not support parsing encryption keys which caused the
entire jwk set to fail to parse. Switching to lazy parsing fixes this.

Also while setting up tests, I couldn't use localhost jwks server as we
require HTTPS and we were using webpki so it was impossible to add a
custom CA. Enabling native roots addresses this possibility.

I saw some of our current e2e tests against our custom JWKS in s3 were
taking a while to fetch. I've added a timeout + retries to address this.
2024-11-07 16:24:38 +00:00
Conrad Ludgate
73bdc9a2d0 [proxy]: minor changes to endpoint-cache handling (#9666)
I think I meant to make these changes over 6 months ago. alas, better
late than never.

1. should_reject doesn't eagerly intern the endpoint string
2. Rate limiter uses a std Mutex instead of a tokio Mutex.
3. Recently I introduced a `-local-proxy` endpoint suffix. I forgot to
add this to normalize.
4. Random but a small cleanup making the ControlPlaneEvent deser
directly to the interned strings.
2024-11-06 17:40:40 +00:00
Folke Behrens
bdd492b1d8 proxy: Replace "web(auth)" with "console redirect" everywhere (#9655) 2024-11-06 11:03:38 +00:00
Folke Behrens
5d8284c7fe proxy: Read cplane JWT with clap arg (#9654) 2024-11-06 10:27:55 +00:00
Folke Behrens
ebc43efebc proxy: Refactor cplane types (#9643)
The overall idea of the PR is to rename a few types to make their
purpose more clear, reduce abstraction where not needed, and move types
to to more better suited modules.
2024-11-05 23:03:53 +01:00
Folke Behrens
754d2950a3 proxy: Revert ControlPlaneEvent back to struct (#9649)
Due to neondatabase/cloud#19815 we need to be more tolerant when reading
events.
2024-11-05 21:32:33 +00:00
Conrad Ludgate
fcde40d600 [proxy] use the proxy protocol v2 command to silence some logs (#9620)
The PROXY Protocol V2 offers a "command" concept. It can be of two
different values. "Local" and "Proxy". The spec suggests that "Local" be
used for health-checks. We can thus use this to silence logging for such
health checks such as those from NLB.

This additionally refactors the flow to be a bit more type-safe, self
documenting and using zerocopy deser.
2024-11-05 17:23:00 +00:00
Ivan Efremov
2f1a56c8f9 proxy: Unify local and remote conn pool client structures (#9604)
Unify client, EndpointConnPool and DbUserConnPool for remote and local
conn.
- Use new ClientDataEnum for additional client data.
- Add ClientInnerCommon client structure.
- Remove Client and EndpointConnPool code from local_conn_pool.rs
2024-11-05 17:33:41 +02:00
Folke Behrens
1085fe57d3 proxy: Rewrite ControlPlaneEvent as enum (#9627) 2024-11-04 20:19:26 +01:00
Folke Behrens
59879985b4 proxy: Wrap JWT errors in separate AuthError variant (#9625)
* Also rename `AuthFailed` variant to `PasswordFailed`.
* Before this all JWT errors end up in `AuthError::AuthFailed()`,
  expects a username and also causes cache invalidation.
2024-11-04 19:56:40 +01:00
Conrad Ludgate
81d1bb1941 quieten aws_config logs (#9626)
logs during aws authentication are soooo noisy in staging 🙃
2024-11-04 17:28:10 +00:00
Conrad Ludgate
8ad1dbce72 [proxy]: parse proxy protocol TLVs with aws/azure support (#9610)
AWS/azure private link shares extra information in the "TLV" values of
the proxy protocol v2 header. This code doesn't action on it, but it
parses it as appropriate.
2024-11-04 14:04:56 +00:00
Conrad Ludgate
3dcdbcc34d remove aws-lc-rs dep and fix storage_broker tls (#9613)
It seems the ecosystem is not so keen on moving to aws-lc-rs as it's
build setup is more complicated than ring (requiring cmake).

Eventually I expect the ecosystem should pivot to
https://github.com/ctz/graviola/tree/main/rustls-graviola as it
stabilises (it has a very simply build step and license), but for now
let's try not have a headache of juggling two crypto libs.

I also noticed that tonic will just fail with tls without a default
provider, so I added some defensive code for that.
2024-11-04 13:29:13 +00:00
Conrad Ludgate
897cffb9d8 auth_broker: fix local_proxy conn count (#9593)
our current metrics for http pool opened connections is always negative
:D oops
2024-10-31 14:57:55 +00:00
Jakub Kołodziejczak
57499640c5 proxy: more granular http status codes for sql-over-http errors (#9549)
closes #9532
2024-10-29 15:44:45 +00:00
Conrad Ludgate
47c35f67c3 [proxy]: fix JWT handling for AWS cognito. (#9536)
In the base64 payload of an aws cognito jwt, I saw the following:

```
"iss":"https:\/\/cognito-idp.us-west-2.amazonaws.com\/us-west-2_redacted"
```

issuers are supposed to be URLs, and URLs are always valid un-escaped
JSON. However, `\/` is a valid escape character so what AWS is doing is
technically correct... sigh...

This PR refactors the test suite and adds a new regression test for
cognito.
2024-10-29 11:01:09 +00:00
Conrad Ludgate
25f1e5cfeb [proxy] demote warnings and remove dead-argument (#9512)
fixes https://github.com/neondatabase/cloud/issues/19000
2024-10-28 15:02:20 +00:00
Conrad Ludgate
dbadb0f9bb proxy: propagate session IDs (#9509)
fixes #9367 by sending session IDs to local_proxy, and also returns
session IDs to the client for easier debugging.
2024-10-25 14:34:19 +00:00
Jakub Kołodziejczak
9768f09f6b proxy: don't follow redirects for user provided JWKS urls + set custom user agent (#9514)
partially fixes https://github.com/neondatabase/cloud/issues/19249

ref https://docs.rs/reqwest/latest/reqwest/redirect/index.html
> By default, a Client will automatically handle HTTP redirects, having
a maximum redirect chain of 10 hops. To customize this behavior, a
redirect::Policy can be used with a ClientBuilder.
2024-10-25 14:04:41 +02:00
Folke Behrens
92d5e0e87a proxy: clear lib.rs of code items (#9479)
We keep lib.rs for crate configs, lint configs and re-exports for the binaries.
2024-10-23 08:21:28 +02:00
Ivan Efremov
2dcac94194 proxy: Use common error interface for error handling with cplane (#9454)
- Remove obsolete error handles.
- Use one source of truth for cplane errors.
#18468
2024-10-21 17:20:09 +03:00
Conrad Ludgate
cc25ef7342 bump pg-session-jwt version (#9455)
forgot to bump this before
2024-10-20 14:42:50 +02:00
Conrad Ludgate
5cbdec9c79 [local_proxy]: install pg_session_jwt extension on demand (#9370)
Follow up on #9344. We want to install the extension automatically. We
didn't want to couple the extension into compute_ctl so instead
local_proxy is the one to issue requests specific to the extension.

depends on #9344 and #9395
2024-10-18 14:41:21 +01:00
Conrad Ludgate
b8304f90d6 2024 oct new clippy lints (#9448)
Fixes new lints from `cargo +nightly clippy` (`clippy 0.1.83 (798fb83f
2024-10-16)`)
2024-10-18 10:27:50 +01:00
Conrad Ludgate
d762ad0883 update rustls (#9396)
The forever ongoing effort of juggling multiple versions of rustls :3

now with new crypto library aws-lc.

Because of dependencies, it is currently impossible to not have both
ring and aws-lc in the dep tree, therefore our only options are not
updating rustls or having both crypto backends enabled...

According to benchmarks run by the rustls maintainer, aws-lc is faster
than ring in some cases too <https://jbp.io/graviola/>, so it's not
without its upsides,
2024-10-17 20:45:37 +01:00
Ivan Efremov
22d8834474 proxy: move the connection pools to separate file (#9398)
First PR for #9284
Start unification of the client and connection pool interfaces:
- Exclude the 'global_connections_count' out from the get_conn_entry()
- Move remote connection pools to the conn_pool_lib as a reference
- Unify clients among all the conn pools
2024-10-17 13:38:24 +03:00
Folke Behrens
ed694732e7 proxy: merge AuthError and AuthErrorImpl (#9418)
Since GetAuthInfoError now boxes the ControlPlaneError message the
variant is not big anymore and AuthError is 32 bytes.
2024-10-16 19:10:49 +02:00
Folke Behrens
f14e45f0ce proxy: format imports with nightly rustfmt (#9414)
```shell
cargo +nightly fmt -p proxy -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true
```

These rust-analyzer settings for VSCode should help retain this style:
```json
  "rust-analyzer.imports.group.enable": true,
  "rust-analyzer.imports.prefix": "crate",
  "rust-analyzer.imports.merge.glob": false,
  "rust-analyzer.imports.granularity.group": "module",
  "rust-analyzer.imports.granularity.enforce": true,
```
2024-10-16 15:01:56 +02:00
Folke Behrens
fb74c21e8c proxy: Migrate jwt module away from anyhow (#9361) 2024-10-15 15:24:56 +02:00
Conrad Ludgate
d92d36a315 [local_proxy] update api for pg_session_jwt (#9359)
pg_session_jwt now:
1. Sets the JWK in a PGU_BACKEND session guc, no longer in the init()
function.
2. JWK no longer needs the kid.
2024-10-15 12:13:57 +00:00
John Spray
73c6626b38 pageserver: stabilize & refine controller scale test (#8971)
## Problem

We were seeing timeouts on migrations in this test.

The test unfortunately tends to saturate local storage, which is shared
between the pageservers and the control plane database, which makes the
test kind of unrealistic. We will also want to increase the scale of
this test, so it's worth fixing that.

## Summary of changes

- Instead of randomly creating timelines at the same time as the other
background operations, explicitly identify a subset of tenant which will
have timelines, and create them at the start. This avoids pageservers
putting a lot of load on the test node during the main body of the test.
- Adjust the tenants created to create some number of 8 shard tenants
and the rest 1 shard tenants, instead of just creating a lot of 2 shard
tenants.
- Use archival_config to exercise tenant-mutating operations, instead of
using timeline creation for this.
- Adjust reconcile_until_idle calls to avoid waiting 5 seconds between
calls, which causes timelines with large shard count tenants.
- Fix a pageserver bug where calls to archival_config during activation
get 404
2024-10-15 09:31:18 +01:00
Conrad Ludgate
cb9ab7463c proxy: split out the console-redirect backend flow (#9270)
removes the ConsoleRedirect backend from the main auth::Backends enum,
copy-paste the existing crate::proxy::task_main structure to use the
ConsoleRedirectBackend exclusively.

This makes the logic a bit simpler at the cost of some fairly trivial
code duplication.
2024-10-14 12:25:55 +02:00
Conrad Ludgate
ab5bbb445b proxy: refactor auth backends (#9271)
preliminary for #9270 

The auth::Backend didn't need to be in the mega ProxyConfig object, so I
split it off and passed it manually in the few places it was necessary.

I've also refined some of the uses of config I saw while doing this
small refactor.

I've also followed the trend and make the console redirect backend it's
own struct, same as LocalBackend and ControlPlaneBackend.
2024-10-11 20:14:52 +01:00
Folke Behrens
6baf1aae33 proxy: Demote some errors to warnings in logs (#9354) 2024-10-11 11:29:08 +02:00
Ivan Efremov
b2ecbf3e80 Introduce "quota" ErrorKind (#9300)
## Problem
Fixes #8340
## Summary of changes
Introduced ErrorKind::quota to handle quota-related errors
## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-10-11 10:45:55 +03:00
Conrad Ludgate
306094a87d add local-proxy suffix to wake-compute requests, respect the returned port (#9298)
https://github.com/neondatabase/cloud/issues/18349

Use the `-local-proxy` suffix to make sure we get the 10432 local_proxy
port back from cplane.
2024-10-09 22:43:35 +01:00
Conrad Ludgate
75434060a5 local_proxy: integrate with pg_session_jwt extension (#9086) 2024-10-09 18:24:10 +01:00
Folke Behrens
54d1185789 proxy: Unalias hyper1 and replace one use of hyper0 in test (#9324)
Leaves one final use of hyper0 in proxy for the health service,
which requires some coordinated effort with other services.
2024-10-09 12:44:17 +02:00
Folke Behrens
ad267d849f proxy: Move module base files into module directory (#9297) 2024-10-07 16:25:34 +02:00
Conrad Ludgate
8cd7b5bf54 proxy: rename console -> control_plane, rename web -> console_redirect (#9266)
rename console -> control_plane
rename web -> console_redirect

I think these names are a little more representative.
2024-10-07 14:09:54 +01:00