rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 12:40:37 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	2f0e3428bb	no error string	2024-01-22 15:00:45 +00:00
Conrad Ludgate	3928d3bc8a	eager parsing of ip addr	2024-01-22 14:53:50 +00:00
Conrad Ludgate	34ddec67d9	proxy small tweaks (#6398 ) ## Problem In https://github.com/neondatabase/neon/pull/6283 I did a couple changes that weren't directly related to the goal of extracting the state machine, so I'm putting them here ## Summary of changes - move postgres vs console provider into another enum - reduce error cases for link auth - slightly refactor link flow	2024-01-21 09:58:42 +01:00
Conrad Ludgate	551f0cc097	proxy: refactor how neon-options are handled (#6306 ) ## Problem HTTP connection pool was not respecting the PitR options. ## Summary of changes 1. refactor neon_options a bit to allow easier access to cache_key 2. make HTTP not go through `StartupMessageParams` 3. expose SNI processing to replace what was removed in step 2.	2024-01-11 14:58:31 +00:00
Anna Khanova	76372ce002	Added auth info cache with notifiations to redis. (#6208 ) ## Problem Current cache doesn't support any updates from the cplane. ## Summary of changes * Added redis notifier listner. * Added cache which can be invalidated with the notifier. If the notifier is not available, it's just a normal ttl cache. * Updated cplane api. The motivation behind this organization of the data is the following: * In the Neon data model there are projects. Projects could have multiple branches and each branch could have more than one endpoint. * Also there is one special `main` branch. * Password reset works per branch. * Allowed IPs are the same for every branch in the project (except, maybe, the main one). * The main branch can be changed to the other branch. * The endpoint can be moved between branches. Every event described above requires some special processing on the porxy (or cplane) side. The idea of invalidating for the project is that whenever one of the events above is happening with the project, proxy can invalidate all entries for the entire project. This approach also requires some additional API change (returning project_id inside the auth info).	2024-01-10 11:51:05 +00:00
Conrad Ludgate	8a646cb750	proxy: add request context for observability and blocking (#6160 ) ## Summary of changes ### RequestMonitoring We want to add an event stream with information on each request for easier analysis than what we can do with diagnostic logs alone (https://github.com/neondatabase/cloud/issues/8807). This RequestMonitoring will keep a record of the final state of a request. On drop it will be pushed into a queue to be uploaded. Because this context is a bag of data, I don't want this information to impact logic of request handling. I personally think that weakly typed data (such as all these options) makes for spaghetti code. I will however allow for this data to impact rate-limiting and blocking of requests, as this does not _really_ change how a request is handled. ### Parquet Each `RequestMonitoring` is flushed into a channel where it is converted into `RequestData`, which is accumulated into parquet files. Each file will have a certain number of rows per row group, and several row groups will eventually fill up the file, which we then upload to S3. We will also upload smaller files if they take too long to construct.	2024-01-08 11:42:43 +00:00
Anna Khanova	6e6e40dd7f	Invalidate credentials on auth failure (#6171 ) ## Problem If the user reset password, cache could receive this information only after `ttl` minutes. ## Summary of changes Invalidate password on auth failure.	2023-12-18 23:24:22 +01:00
Anna Khanova	00d90ce76a	Added cache for get role secret (#6165 ) ## Problem Currently if we are getting many consecutive connections to the same user/ep we will send a lot of traffic to the console. ## Summary of changes Cache with ttl=4min proxy_get_role_secret response. Note: this is the temporary hack, notifier listener is WIP.	2023-12-18 16:04:47 +01:00
Conrad Ludgate	17bde7eda5	proxy refactor large files (#6153 ) ## Problem The `src/proxy.rs` file is far too large ## Summary of changes Creates 3 new files: ``` src/metrics.rs src/proxy/retry.rs src/proxy/connect_compute.rs ```	2023-12-18 10:59:49 +00:00
Anna Khanova	9e071e4458	Propagate information about the protocol to console (#6102 ) ## Problem In snowflake logs currently there is no information about the protocol, that the client uses. ## Summary of changes Propagate the information about the protocol together with the app_name. In format: `{app_name}/{sql_over_http/tcp/ws}`. This will give to @stepashka more observability on what our clients are using.	2023-12-12 11:42:51 +00:00
Andrew Rudenko	df1f8e13c4	proxy: pass neon options in deep object format (#6068 ) --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2023-12-08 19:58:36 +01:00
Conrad Ludgate	699049b8f3	proxy: make auth more type safe (#5689 ) ## Problem `a5292f7e67/proxy/src/auth/backend.rs (L146-L148)` `a5292f7e67/proxy/src/console/provider/neon.rs (L90)` `a5292f7e67/proxy/src/console/provider/neon.rs (L154)` ## Summary of changes 1. Test backend is only enabled on `cfg(test)`. 2. Postgres mock backend + MD5 auth keys are only enabled on `cfg(feature = testing)` 3. Password hack and cleartext flow will have their passwords validated before proceeding. 4. Distinguish between ClientCredentials with endpoint and without, removing many panics in the process	2023-12-08 11:48:37 +00:00
Conrad Ludgate	f39fca0049	proxy: chore: replace strings with SmolStr (#5786 ) ## Problem no problem ## Summary of changes replaces boxstr with arcstr as it's cheaper to clone. mild perf improvement. probably should look into other smallstring optimsations tbh, they will likely be even better. The longest endpoint name I was able to construct is something like `ep-weathered-wildflower-12345678` which is 32 bytes. Most string optimisations top out at 23 bytes	2023-11-30 20:52:30 +00:00
Anna Khanova	e12e2681e9	IP allowlist on the proxy side (#5906 ) ## Problem Per-project IP allowlist: https://github.com/neondatabase/cloud/issues/8116 ## Summary of changes Implemented IP filtering on the proxy side. To retrieve ip allowlist for all scenarios, added `get_auth_info` call to the control plane for: * sql-over-http * password_hack * cleartext_hack Added cache with ttl for sql-over-http path This might slow down a bit, consider using redis in the future. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-11-30 13:14:33 +00:00
Conrad Ludgate	316309c85b	channel binding (#5683 ) ## Problem channel binding protects scram from sophisticated MITM attacks where the attacker is able to produce 'valid' TLS certificates. ## Summary of changes get the tls-server-end-point channel binding, and verify it is correct for the SCRAM-SHA-256-PLUS authentication flow	2023-11-27 21:45:15 +00:00
khanova	6b82f22ada	Collect number of connections by sni type (#5867 ) ## Problem We don't know the number of users with the different kind of authentication: ["sni", "endpoint in options" (A and B from [here](https://neon.tech/docs/connect/connection-errors)), "password_hack"] ## Summary of changes Collect metrics by sni kind.	2023-11-16 12:19:13 +00:00
Andrew Rudenko	fc47af156f	Passing neon options to the console (#5781 ) The idea is to pass neon_* prefixed options to control plane. It can be used by cplane to dynamically create timelines and computes. Such options also should be excluded from passing to compute. Another issue is how connection caching is working now, because compute's instance now depends not only on hostname but probably on such options too I included them to cache key.	2023-11-07 16:49:26 +01:00
Conrad Ludgate	493b47e1da	proxy: exclude client latencies in metrics (#5688 ) ## Problem In #5539, I moved the connect_to_compute latency to start counting before authentication - this is because authentication will perform some calls to the control plane in order to get credentials and to eagerly wake a compute server. It felt important to include these times in the latency metric as these are times we should definitely care about reducing. What is not interesting to record in this metric is the roundtrip time during authentication when we wait for the client to respond. ## Summary of changes Implement a mechanism to pause the latency timer, resuming on drop of the pause struct. We pause the timer right before we send the authentication message to the client, and we resume the timer right after we complete the authentication flow.	2023-10-27 17:17:39 +00:00
Conrad Ludgate	b2c96047d0	move wake compute after the auth quirks logic (#5642 ) ## Problem https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606 ## Summary of changes Make the auth_quirks_creds return the authentication information, and push the wake_compute loop to after, inside `auth_quirks`	2023-10-25 08:30:47 +01:00
khanova	b514da90cb	Set up timeout for scram protocol execution (#5551 ) ## Problem Context: https://github.com/neondatabase/neon/issues/5511#issuecomment-1759649679 Some of out scram protocol execution timed out only after 17 minutes. ## Summary of changes Make timeout for scram execution meaningful and configurable.	2023-10-23 15:11:05 +01:00
Conrad Ludgate	fd20bbc6cb	proxy: log params when no endpoint (#5418 ) ## Problem Our SNI error dashboard features IP addresses but it's not immediately clear who that is still (#5369) ## Summary of changes Log some startup params with this error	2023-09-29 09:40:27 +01:00
Conrad Ludgate	93dcdb293a	proxy: password hack hack (#5126 ) ## Problem fixes #4881 ## Summary of changes	2023-08-30 16:20:27 +01:00
Conrad Ludgate	ec10838aa4	proxy: pool connection logs (#5020 ) ## Problem Errors and notices that happen during a pooled connection lifecycle have no session identifiers ## Summary of changes Using a watch channel, we set the session ID whenever it changes. This way we can see the status of a connection for that session Also, adding a connection id to be able to search the entire connection lifecycle	2023-08-18 11:44:08 +01:00
Conrad Ludgate	3e4710c59e	proxy: add more sasl logs (#5012 ) ## Problem A customer is having trouble connecting to neon from their production environment. The logs show a mix of "Internal error" and "authentication protocol violation" but not the full error ## Summary of changes Make sure we don't miss any logs during SASL/SCRAM	2023-08-17 12:05:54 +01:00
Conrad Ludgate	0fa85aa08e	proxy: delay auth on retry (#4929 ) ## Problem When an endpoint is shutting down, it can take a few seconds. Currently when starting a new compute, this causes an "endpoint is in transition" error. We need to add delays before retrying to ensure that we allow time for the endpoint to shutdown properly. ## Summary of changes Adds a delay before retrying in auth. connect_to_compute already has this delay	2023-08-08 17:19:24 +03:00
Conrad Ludgate	606caa0c5d	proxy: update logs and span data to be consistent and have more info (#4878 ) ## Problem Pre-requisites for #4852 and #4853 ## Summary of changes 1. Includes the client's IP address (which we already log) with the span info so we can have it on all associated logs. This makes making dashboards based on IP addresses easier. 2. Switch to a consistent error/warning log for errors during connection. This includes error, num_retries, retriable=true/false and a consistent log message that we can grep for.	2023-08-04 12:37:18 +03:00
Conrad Ludgate	eb78603121	proxy: div by zero (#4845 ) ## Problem 1. In the CacheInvalid state loop, we weren't checking the `num_retries`. If this managed to get up to `32`, the retry_after procedure would compute 2^32 which would overflow to 0 and trigger a div by zero 2. When fixing the above, I started working on a flow diagram for the state machine logic and realised it was more complex than it had to be: a. We start in a `Cached` state b. `Cached`: call `connect_once`. After the first connect_once error, we always move to the `CacheInvalid` state, otherwise, we return the connection. c. `CacheInvalid`: we attempt to `wake_compute` and we either switch to Cached or we retry this step (or we error). d. `Cached`: call `connect_once`. We either retry this step or we have a connection (or we error) - After num_retries > 1 we never switch back to `CacheInvalid`. ## Summary of changes 1. Insert a `num_retries` check in the `handle_try_wake` procedure. Also using floats in the retry_after procedure to prevent the overflow entirely 2. Refactor connect_to_compute to be more linear in design.	2023-07-31 09:30:24 -04:00
Alex Chi Z	a8f3540f3d	proxy: add unit test for wake_compute (#4819 ) ## Problem ref https://github.com/neondatabase/neon/pull/4721, ref https://github.com/neondatabase/neon/issues/4709 ## Summary of changes This PR adds unit tests for wake_compute. The patch adds a new variant `Test` to auth backends. When `wake_compute` is called, we will verify if it is the exact operation sequence we are expecting. The operation sequence now contains 3 more operations: `Wake`, `WakeRetry`, and `WakeFail`. The unit tests for proxy connects are now complete and I'll continue work on WebSocket e2e test in future PRs. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-28 19:10:55 -04:00
Conrad Ludgate	231d7a7616	proxy: retry compute wake in auth (#4817 ) ## Problem wake_compute can fail sometimes but is eligible for retries. We retry during the main connect, but not during auth. ## Summary of changes retry wake_compute during auth flow if there was an error talking to control plane, or if there was a temporary error in waking the compute node	2023-07-26 16:34:46 +01:00
Alex Chi Z	407a20ceae	add proxy unit tests for retry connections (#4721 ) Given now we've refactored `connect_to_compute` as a generic, we can test it with mock backends. In this PR, we mock the error API and connect_once API to test the retry behavior of `connect_to_compute`. In the next PR, I'll add mock for credentials so that we can also test behavior with `wake_compute`. ref https://github.com/neondatabase/neon/issues/4709 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-24 20:41:42 +03:00
Arseny Sher	c200ebc096	proxy: log endpoint name everywhere. Checking out proxy logs for the endpoint is a frequent (often first) operation during user issues investigation; let's remove endpoint id -> session id mapping annoying extra step here.	2023-05-24 09:11:23 +04:00
Vadim Kharitonov	4f64be4a98	Add endpoint to connection string	2023-05-15 23:45:04 +02:00
Stas Kelvich	4ac6a9f089	add backward compatibility to proxy	2023-04-28 17:15:43 +03:00
Stas Kelvich	9486d76b2a	Add tests for link auth to compute connection	2023-04-28 17:15:43 +03:00
Stas Kelvich	645e4f6ab9	use TLS in link proxy	2023-04-28 17:15:43 +03:00
Stas Kelvich	c3ca48c62b	Support extra domain names for proxy. Make it possible to specify directory where proxy will look up for extra certificates. Proxy will iterate through subdirs of that directory and load `key.pem` and `cert.pem` files from each subdir. Certs directory structure may look like that: certs \|--example.com \| \|--key.pem \| \|--cert.pem \|--foo.bar \|--key.pem \|--cert.pem Actual domain names are taken from certs and key, subdir names are ignored.	2023-04-05 20:06:48 +03:00
Dmitry Ivanov	f85a61ceac	[proxy] Fix regression in logging For some reason, `tracing::instrument` proc_macro doesn't always print elements specified via `fields()` or even show that it's impossible (e.g. there's no Display impl). Work around this using the `?foo` notation. Before: 2023-04-03T14:48:06.017504Z INFO handle_client🤝 received SslRequest After: 2023-04-03T14:51:24.424176Z INFO handle_client{session_id=7bd07be8-3462-404e-8ccc-0a5332bf3ace}🤝 received SslRequest	2023-04-03 18:49:30 +03:00
Dmitry Ivanov	d90cd36bcc	[proxy] Improve tracing spans here and there.	2023-02-17 15:32:14 +03:00
Dmitry Ivanov	a4d5c8085b	Move hacks to a dedicated module.	2023-02-16 22:10:56 +03:00
Dmitry Ivanov	edffe0dd9d	Extract password hack & cleartext hack	2023-02-16 22:10:56 +03:00
Heikki Linnakangas	d9c518b2cc	Refactor use_cleartext_password_flow. It's not a property of the credentials that we receive from the client, so remove it from ClientCredentials. Instead, pass it as an argument directly to 'authenticate' function, where it's actually used. All the rest of the changes is just plumbing to pass it through the call stack to 'authenticate'	2023-02-16 22:10:56 +03:00
Dmitry Ivanov	3569c1bacd	[proxy] Fix: don't cache user & dbname in node info cache Upstream proxy erroneously stores user & dbname in compute node info cache entries, thus causing "funny" connection problems if such an entry is reused while connecting to e.g. a different DB on the same compute node. This PR fixes the problem but doesn't eliminate the root cause just yet. I'll revisit this code and make it more type-safe in the upcoming PR.	2023-02-14 17:54:01 +03:00
Dmitry Ivanov	96e78394f5	[proxy] Fix project (aka endpoint) init in the password hack handler (#3529 ) The project/endpoint should be set in the original (non-as_ref'd) creds, because we call `wake_compute` not only in `try_password_hack` but also later in the connection retry logic. This PR also removes the obsolete `as_ref` method and makes the code simpler because we no longer need this complication after a recent refactoring. Further action points: finally introduce typestate in creds (planned).	2023-02-02 22:56:15 +02:00
Dmitry Ivanov	ea0278cf27	[proxy] Implement compute node info cache (#3331 ) This patch adds a timed LRU cache implementation and a compute node info cache on top of that. Cache entries might expire on their own (default ttl=5mins) or become invalid due to real-world events, e.g. compute node scale-to-zero event, so we add a connection retry loop with a wake-up call. Solved problems: - [x] Find a decent LRU implementation. - [x] Implement timed LRU on top of that. - [x] Cache results of `proxy_wake_compute` API call. - [x] Don't invalidate newer cache entries for the same key. - [x] Add cmdline configuration knobs (requires some refactoring). - [x] Add failed connection estab metric. - [x] Refactor auth backends to make things simpler (retries, cache placement, etc). - [x] Address review comments (add code comments + cleanup). - [x] Retry `/proxy_wake_compute` if we couldn't connect to a compute (e.g. stalled cache entry). - [x] Add high-level description for `TimedLru`. TODOs (will be addressed later): - [ ] Add cache metrics (hit, spurious hit, miss). - [ ] Synchronize http requests across concurrent per-client tasks (https://github.com/neondatabase/neon/pull/3331#issuecomment-1399216069). - [ ] Cache results of `proxy_get_role_secret` API call.	2023-02-01 17:11:41 +03:00
Arthur Petukhovsky	debd134b15	Implement wss support in proxy (#3247 ) This is a hacky implementation of WebSocket server, embedded into our postgres proxy. The server is used to allow https://github.com/neondatabase/serverless to connect to our postgres from browser and serverless javascript functions. How it will work (general schema): - browser opens a websocket connection to `wss://ep-abc-xyz-123.xx-central-1.aws.neon.tech/` - proxy accepts this connection and terminates TLS (https) - inside encrypted tunnel (HTTPS), browser initiates plain (non-encrypted) postgres connection - proxy performs auth as in usual plain pg connection and forwards connection to the compute Related issue: #3225	2023-01-06 18:34:18 +03:00
Dmitry Ivanov	c700c7db2e	[proxy] Add more labels to the pricing metrics	2022-12-29 22:25:52 +03:00
Dmitry Ivanov	83baf49487	[proxy] Forward compute connection params to client This fixes all kinds of problems related to missing params, like broken timestamps (due to `integer_datetimes`). This solution is not ideal, but it will help. Meanwhile, I'm going to dedicate some time to improving connection machinery. Note that this does not fix problems with passing certain parameters in a reverse direction, i.e. from client to compute. This is a separate matter and will be dealt with in an upcoming PR.	2022-12-16 21:37:50 +03:00
Dmitry Ivanov	607c0facfc	[proxy] Propagate more console API errors to the user This patch aims to fix some of the inconsistencies in error reporting, for example "Internal error" or "Console request failed" instead of "password authentication failed for user '<NAME>'".	2022-12-13 16:16:31 +03:00
Dmitry Ivanov	05db6458df	[proxy] Fix project (endpoint) -related error messages	2022-11-23 23:03:29 +03:00
Dmitry Ivanov	9470bc9fe0	[proxy] Implement per-tenant traffic metrics	2022-11-22 18:50:57 +03:00

1 2

72 Commits