Commit Graph

86 Commits

Author SHA1 Message Date
Conrad Ludgate
49be446d95 async password validation (#7171)
## Problem

password hashing can block main thread

## Summary of changes

spawn_blocking the password hash call
2024-03-18 23:57:32 +01:00
Anna Khanova
b0aff04157 proxy: add new dimension to exclude cplane latency (#7011)
## Problem

Currently cplane communication is a part of the latency monitoring. It
doesn't allow to setup the proper alerting based on proxy latency.

## Summary of changes

Added dimension to exclude cplane latency.
2024-03-13 13:50:05 +01:00
Anna Khanova
15b3665dc4 proxy: fix bug with populating the data (#7023)
## Problem

Branch/project and coldStart were not populated to data events.

## Summary of changes

Populate it. Also added logging for the coldstart info.
2024-03-05 15:32:58 +00:00
Conrad Ludgate
48957e23b7 proxy: refactor span usage (#6946)
## Problem

Hard to find error reasons by endpoint for HTTP flow.

## Summary of changes

I want all root spans to have session id and endpoint id. I want all
root spans to be consistent.
2024-02-28 17:10:07 +04:00
Conrad Ludgate
60e5a56a5a proxy: include client IP in ip deny message (#6854)
## Problem

Debugging IP deny errors is difficult for our users

## Summary of changes

Include the client IP in the deny message
2024-02-21 18:24:59 +01:00
Anna Khanova
fac50a6264 Proxy refactor auth+connect (#6708)
## Problem

Not really a problem, just refactoring.

## Summary of changes

Separate authenticate from wake compute.

Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
2024-02-12 18:41:02 +00:00
Conrad Ludgate
98ec5c5c46 proxy: some more parquet data (#6711)
## Summary of changes

add auth_method and database to the parquet logs
2024-02-12 13:14:06 +00:00
Conrad Ludgate
96d89cde51 Proxy error reworking (#6453)
## Problem

Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.

We currently don't report error classifications in proxy as the current
error handling made it hard to do so.

## Summary of changes

1. Add a `ReportableError` trait that all errors will implement. This
provides the error classification functionality.
2. Handle Client requests a strongly typed error
    * this error is a `ReportableError` and is logged appropriately
3. The handle client error only has a few possible error types, to
account for the fact that at this point errors should be returned to the
user.
2024-02-09 15:50:51 +00:00
Anna Khanova
c63e3e7e84 Proxy: improve http-pool (#6577)
## Problem

The password check logic for the sql-over-http is a bit non-intuitive. 

## Summary of changes

1. Perform scram auth using the same logic as for websocket cleartext
password.
2. Split establish connection logic and connection pool.
3. Parallelize param parsing logic with authentication + wake compute.
4. Limit the total number of clients
2024-02-08 12:57:05 +01:00
Conrad Ludgate
6506fd14c4 proxy: more refactors (#6526)
## Problem

not really any problem, just some drive-by changes

## Summary of changes

1. move wake compute
2. move json processing
3. move handle_try_wake
4. move test backend to api provider
5. reduce wake-compute concerns
6. remove duplicate wake-compute loop
2024-02-02 16:07:35 +00:00
Anna Khanova
271133d960 Proxy: reduce number of get role secret calls (#6557)
## Problem

Right now if get_role_secret response wasn't cached (e.g. cache already
reached max size) it will send the second (exactly the same request).

## Summary of changes

Avoid needless request.
2024-01-31 22:16:56 +00:00
Conrad Ludgate
ec8dcc2231 flatten proxy flow (#6447)
## Problem

Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.

Proxy flow was quite deeply nested, which makes adding more interesting
error handling quite tricky.

## Summary of changes

I recommend reviewing commit by commit.

1. move handshake logic into a separate file
2. move passthrough logic into a separate file
3. no longer accept a closure in CancelMap session logic
4. Remove connect_to_db, copy logic into handle_client
5. flatten auth_and_wake_compute in authenticate
6. record info for link auth
2024-01-29 17:38:03 +00:00
Anna Khanova
8253cf1931 proxy: Relax endpoint check (#6503)
## Problem

http-over-sql allowes host to be in format api.aws.... however it's not
the case for the websocket flow.

## Summary of changes

Relax endpoint check for the ws serverless connections.
2024-01-28 21:27:14 +00:00
Conrad Ludgate
210700d0d9 proxy: add newtype wrappers for string based IDs (#6445)
## Problem

too many string based IDs. easy to mix up ID types.

## Summary of changes

Add a bunch of `SmolStr` wrappers that provide convenience methods but
are type safe
2024-01-24 16:38:10 +00:00
Conrad Ludgate
e03f8abba9 eager parsing of ip addr (#6446)
## Problem

Parsing the IP address at check time is a little wasteful. 

## Summary of changes

Parse the IP when we get it from cplane. Adding a `None` variant to
still allow malformed patterns
2024-01-23 13:25:01 +00:00
Anna Khanova
1905f0bced proxy: store role not found in cache (#6439)
## Problem

There are a lot of responses with 404 role not found error, which are
not getting cached in proxy.

## Summary of changes

If there was returned an empty secret but with the project_id, store it
in cache.
2024-01-23 13:15:05 +01:00
Conrad Ludgate
34ddec67d9 proxy small tweaks (#6398)
## Problem

In https://github.com/neondatabase/neon/pull/6283 I did a couple changes
that weren't directly related to the goal of extracting the state
machine, so I'm putting them here

## Summary of changes

- move postgres vs console provider into another enum
- reduce error cases for link auth
- slightly refactor link flow
2024-01-21 09:58:42 +01:00
Conrad Ludgate
551f0cc097 proxy: refactor how neon-options are handled (#6306)
## Problem

HTTP connection pool was not respecting the PitR options.

## Summary of changes

1. refactor neon_options a bit to allow easier access to cache_key
2. make HTTP not go through `StartupMessageParams`
3. expose SNI processing to replace what was removed in step 2.
2024-01-11 14:58:31 +00:00
Anna Khanova
76372ce002 Added auth info cache with notifiations to redis. (#6208)
## Problem

Current cache doesn't support any updates from the cplane.

## Summary of changes

* Added redis notifier listner.
* Added cache which can be invalidated with the notifier. If the
notifier is not available, it's just a normal ttl cache.
* Updated cplane api.

The motivation behind this organization of the data is the following:
* In the Neon data model there are projects. Projects could have
multiple branches and each branch could have more than one endpoint.
* Also there is one special `main` branch.
* Password reset works per branch.
* Allowed IPs are the same for every branch in the project (except,
maybe, the main one).
* The main branch can be changed to the other branch.
* The endpoint can be moved between branches.

Every event described above requires some special processing on the
porxy (or cplane) side.

The idea of invalidating for the project is that whenever one of the
events above is happening with the project, proxy can invalidate all
entries for the entire project.

This approach also requires some additional API change (returning
project_id inside the auth info).
2024-01-10 11:51:05 +00:00
Conrad Ludgate
8a646cb750 proxy: add request context for observability and blocking (#6160)
## Summary of changes

### RequestMonitoring

We want to add an event stream with information on each request for
easier analysis than what we can do with diagnostic logs alone
(https://github.com/neondatabase/cloud/issues/8807). This
RequestMonitoring will keep a record of the final state of a request. On
drop it will be pushed into a queue to be uploaded.

Because this context is a bag of data, I don't want this information to
impact logic of request handling. I personally think that weakly typed
data (such as all these options) makes for spaghetti code. I will
however allow for this data to impact rate-limiting and blocking of
requests, as this does not _really_ change how a request is handled.

### Parquet

Each `RequestMonitoring` is flushed into a channel where it is converted
into `RequestData`, which is accumulated into parquet files. Each file
will have a certain number of rows per row group, and several row groups
will eventually fill up the file, which we then upload to S3.

We will also upload smaller files if they take too long to construct.
2024-01-08 11:42:43 +00:00
Anna Khanova
6e6e40dd7f Invalidate credentials on auth failure (#6171)
## Problem

If the user reset password, cache could receive this information only
after `ttl` minutes.

## Summary of changes

Invalidate password on auth failure.
2023-12-18 23:24:22 +01:00
Anna Khanova
00d90ce76a Added cache for get role secret (#6165)
## Problem

Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.

## Summary of changes

Cache with ttl=4min proxy_get_role_secret response.

Note: this is the temporary hack, notifier listener is WIP.
2023-12-18 16:04:47 +01:00
Conrad Ludgate
17bde7eda5 proxy refactor large files (#6153)
## Problem

The `src/proxy.rs` file is far too large

## Summary of changes

Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
2023-12-18 10:59:49 +00:00
Anna Khanova
9e071e4458 Propagate information about the protocol to console (#6102)
## Problem

In snowflake logs currently there is no information about the protocol,
that the client uses.

## Summary of changes

Propagate the information about the protocol together with the app_name.
In format: `{app_name}/{sql_over_http/tcp/ws}`.

This will give to @stepashka more observability on what our clients are
using.
2023-12-12 11:42:51 +00:00
Andrew Rudenko
df1f8e13c4 proxy: pass neon options in deep object format (#6068)
---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2023-12-08 19:58:36 +01:00
Conrad Ludgate
699049b8f3 proxy: make auth more type safe (#5689)
## Problem

a5292f7e67/proxy/src/auth/backend.rs (L146-L148)

a5292f7e67/proxy/src/console/provider/neon.rs (L90)

a5292f7e67/proxy/src/console/provider/neon.rs (L154)

## Summary of changes

1. Test backend is only enabled on `cfg(test)`.
2. Postgres mock backend + MD5 auth keys are only enabled on
`cfg(feature = testing)`
3. Password hack and cleartext flow will have their passwords validated
before proceeding.
4. Distinguish between ClientCredentials with endpoint and without,
removing many panics in the process
2023-12-08 11:48:37 +00:00
Conrad Ludgate
f39fca0049 proxy: chore: replace strings with SmolStr (#5786)
## Problem

no problem

## Summary of changes

replaces boxstr with arcstr as it's cheaper to clone. mild perf
improvement.

probably should look into other smallstring optimsations tbh, they will
likely be even better. The longest endpoint name I was able to construct
is something like `ep-weathered-wildflower-12345678` which is 32 bytes.
Most string optimisations top out at 23 bytes
2023-11-30 20:52:30 +00:00
Anna Khanova
e12e2681e9 IP allowlist on the proxy side (#5906)
## Problem

Per-project IP allowlist:
https://github.com/neondatabase/cloud/issues/8116

## Summary of changes

Implemented IP filtering on the proxy side. 

To retrieve ip allowlist for all scenarios, added `get_auth_info` call
to the control plane for:
* sql-over-http
* password_hack
* cleartext_hack

Added cache with ttl for sql-over-http path

This might slow down a bit, consider using redis in the future.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-11-30 13:14:33 +00:00
Conrad Ludgate
316309c85b channel binding (#5683)
## Problem

channel binding protects scram from sophisticated MITM attacks where the
attacker is able to produce 'valid' TLS certificates.

## Summary of changes

get the tls-server-end-point channel binding, and verify it is correct
for the SCRAM-SHA-256-PLUS authentication flow
2023-11-27 21:45:15 +00:00
khanova
6b82f22ada Collect number of connections by sni type (#5867)
## Problem

We don't know the number of users with the different kind of
authentication: ["sni", "endpoint in options" (A and B from
[here](https://neon.tech/docs/connect/connection-errors)),
"password_hack"]

## Summary of changes

Collect metrics by sni kind.
2023-11-16 12:19:13 +00:00
Andrew Rudenko
fc47af156f Passing neon options to the console (#5781)
The idea is to pass neon_* prefixed options to control plane. It can be
used by cplane to dynamically create timelines and computes. Such
options also should be excluded from passing to compute. Another issue
is how connection caching is working now, because compute's instance now
depends not only on hostname but probably on such options too I included
them to cache key.
2023-11-07 16:49:26 +01:00
Conrad Ludgate
493b47e1da proxy: exclude client latencies in metrics (#5688)
## Problem

In #5539, I moved the connect_to_compute latency to start counting
before authentication - this is because authentication will perform some
calls to the control plane in order to get credentials and to eagerly
wake a compute server. It felt important to include these times in the
latency metric as these are times we should definitely care about
reducing.

What is not interesting to record in this metric is the roundtrip time
during authentication when we wait for the client to respond.

## Summary of changes

Implement a mechanism to pause the latency timer, resuming on drop of
the pause struct. We pause the timer right before we send the
authentication message to the client, and we resume the timer right
after we complete the authentication flow.
2023-10-27 17:17:39 +00:00
Conrad Ludgate
b2c96047d0 move wake compute after the auth quirks logic (#5642)
## Problem

https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606

## Summary of changes

Make the auth_quirks_creds return the authentication information, and
push the wake_compute loop to after, inside `auth_quirks`
2023-10-25 08:30:47 +01:00
khanova
b514da90cb Set up timeout for scram protocol execution (#5551)
## Problem
Context:
https://github.com/neondatabase/neon/issues/5511#issuecomment-1759649679

Some of out scram protocol execution timed out only after 17 minutes. 
## Summary of changes
Make timeout for scram execution meaningful and configurable.
2023-10-23 15:11:05 +01:00
Conrad Ludgate
fd20bbc6cb proxy: log params when no endpoint (#5418)
## Problem

Our SNI error dashboard features IP addresses but it's not immediately
clear who that is still (#5369)

## Summary of changes

Log some startup params with this error
2023-09-29 09:40:27 +01:00
Conrad Ludgate
93dcdb293a proxy: password hack hack (#5126)
## Problem

fixes #4881 

## Summary of changes
2023-08-30 16:20:27 +01:00
Conrad Ludgate
ec10838aa4 proxy: pool connection logs (#5020)
## Problem

Errors and notices that happen during a pooled connection lifecycle have
no session identifiers

## Summary of changes

Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session

Also, adding a connection id to be able to search the entire connection
lifecycle
2023-08-18 11:44:08 +01:00
Conrad Ludgate
3e4710c59e proxy: add more sasl logs (#5012)
## Problem

A customer is having trouble connecting to neon from their production
environment. The logs show a mix of "Internal error" and "authentication
protocol violation" but not the full error

## Summary of changes

Make sure we don't miss any logs during SASL/SCRAM
2023-08-17 12:05:54 +01:00
Conrad Ludgate
0fa85aa08e proxy: delay auth on retry (#4929)
## Problem

When an endpoint is shutting down, it can take a few seconds. Currently
when starting a new compute, this causes an "endpoint is in transition"
error. We need to add delays before retrying to ensure that we allow
time for the endpoint to shutdown properly.

## Summary of changes

Adds a delay before retrying in auth. connect_to_compute already has
this delay
2023-08-08 17:19:24 +03:00
Conrad Ludgate
606caa0c5d proxy: update logs and span data to be consistent and have more info (#4878)
## Problem

Pre-requisites for #4852 and #4853

## Summary of changes

1. Includes the client's IP address (which we already log) with the span
info so we can have it on all associated logs. This makes making
dashboards based on IP addresses easier.
2. Switch to a consistent error/warning log for errors during
connection. This includes error, num_retries, retriable=true/false and a
consistent log message that we can grep for.
2023-08-04 12:37:18 +03:00
Conrad Ludgate
eb78603121 proxy: div by zero (#4845)
## Problem

1. In the CacheInvalid state loop, we weren't checking the
`num_retries`. If this managed to get up to `32`, the retry_after
procedure would compute 2^32 which would overflow to 0 and trigger a div
by zero
2. When fixing the above, I started working on a flow diagram for the
state machine logic and realised it was more complex than it had to be:
  a. We start in a `Cached` state
b. `Cached`: call `connect_once`. After the first connect_once error, we
always move to the `CacheInvalid` state, otherwise, we return the
connection.
c. `CacheInvalid`: we attempt to `wake_compute` and we either switch to
Cached or we retry this step (or we error).
d. `Cached`: call `connect_once`. We either retry this step or we have a
connection (or we error) - After num_retries > 1 we never switch back to
`CacheInvalid`.

## Summary of changes

1. Insert a `num_retries` check in the `handle_try_wake` procedure. Also
using floats in the retry_after procedure to prevent the overflow
entirely
2. Refactor connect_to_compute to be more linear in design.
2023-07-31 09:30:24 -04:00
Alex Chi Z
a8f3540f3d proxy: add unit test for wake_compute (#4819)
## Problem

ref https://github.com/neondatabase/neon/pull/4721, ref
https://github.com/neondatabase/neon/issues/4709

## Summary of changes

This PR adds unit tests for wake_compute.

The patch adds a new variant `Test` to auth backends. When
`wake_compute` is called, we will verify if it is the exact operation
sequence we are expecting. The operation sequence now contains 3 more
operations: `Wake`, `WakeRetry`, and `WakeFail`.

The unit tests for proxy connects are now complete and I'll continue
work on WebSocket e2e test in future PRs.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-28 19:10:55 -04:00
Conrad Ludgate
231d7a7616 proxy: retry compute wake in auth (#4817)
## Problem

wake_compute can fail sometimes but is eligible for retries. We retry
during the main connect, but not during auth.

## Summary of changes

retry wake_compute during auth flow if there was an error talking to
control plane, or if there was a temporary error in waking the compute
node
2023-07-26 16:34:46 +01:00
Alex Chi Z
407a20ceae add proxy unit tests for retry connections (#4721)
Given now we've refactored `connect_to_compute` as a generic, we can
test it with mock backends. In this PR, we mock the error API and
connect_once API to test the retry behavior of `connect_to_compute`. In
the next PR, I'll add mock for credentials so that we can also test
behavior with `wake_compute`.

ref https://github.com/neondatabase/neon/issues/4709

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-24 20:41:42 +03:00
Arseny Sher
c200ebc096 proxy: log endpoint name everywhere.
Checking out proxy logs for the endpoint is a frequent (often first) operation
during user issues investigation; let's remove endpoint id -> session id mapping
annoying extra step here.
2023-05-24 09:11:23 +04:00
Vadim Kharitonov
4f64be4a98 Add endpoint to connection string 2023-05-15 23:45:04 +02:00
Stas Kelvich
4ac6a9f089 add backward compatibility to proxy 2023-04-28 17:15:43 +03:00
Stas Kelvich
9486d76b2a Add tests for link auth to compute connection 2023-04-28 17:15:43 +03:00
Stas Kelvich
645e4f6ab9 use TLS in link proxy 2023-04-28 17:15:43 +03:00
Stas Kelvich
c3ca48c62b Support extra domain names for proxy.
Make it possible to specify directory where proxy will look up for
extra certificates. Proxy will iterate through subdirs of that directory
and load `key.pem` and `cert.pem` files from each subdir. Certs directory
structure may look like that:

  certs
  |--example.com
  |  |--key.pem
  |  |--cert.pem
  |--foo.bar
     |--key.pem
     |--cert.pem

Actual domain names are taken from certs and key, subdir names are
ignored.
2023-04-05 20:06:48 +03:00